Fastest way to search through text records

Mr_Tom · March 2020

Hi all,

Hoping to pick your collective minds.

I've got a MySQL table (pretty much just a couple of fields - an ID, and a string) which I need to search the string for and get the ID(s) of any matching rows.

Using LIKE "%search%" works perfect - but it requires a full table scan and there's 61 million rows.

I've tried using a MATCH ... AGAINST() search but this seems to work against the FULL word, but I need to include results where they match against part of the word too, if that makes sense.

I've considered putting the dataset into something like redis, but does anyone have any suggestions on the fastest way to do this. It doesn't have to use a database, I can put the data into any format I want - the only "requirement" is I've got to be able to search it using PHP.

Thanks in advance!

InceptionHosting · March 2020

I don't think there will be any magic bullet for 61 million rows short of a complete database redesign to separate them out, I would say your best bet is better caching so the table runs in ram.

Perhaps someone like @joepie91 or @Francisco can give a better answer.

isunbejo · March 2020

How about Full-text search, (FTS) indexing with ADD FULLTEXT

example :

ALTER TABLE news ADD FULLTEXT (title, content, author)

Mr_Tom · March 2020

@AnthonySmith said: I don't think there will be any magic bullet for 61 million rows short of a complete database redesign to separate them out, I would say your best bet is better caching so the table runs in ram.

Yeah I think that might be the way to go.

@isunbejo said: How about Full-text search, (FTS) indexing with ADD FULLTEXT

I've added a FULLTEXT index to the table already, hasn't really sped up the LIKE searches. Using MATCH/AGAINST is quicker but doesn't return the correct set of results.

FlamingSpaceJunk · March 2020

Basically, what you're asking for is a search engine, and a Solr or Elasticsearch cluster might be better suited for this case. They're designed for text search.

If you want to keep using MySQL, see if there is a distributed DB module which will automatically shard the DB and parallelize the query. PostgreSQL has the Citus extension and Postgres-XL, but I'm not as familiar with the MySQL ecosystem as I am the postgres ecosystem.

Of course, you'll need to throw lots of medium boxes at the problem to get the best throughput.

Mr_Tom · March 2020

@FlamingSpaceJunk said: Basically, what you're asking for is a search engine, and a Solr or Elasticsearch cluster might be better suited for this case. They're designed for text search.

Thanks! That's probably the best way of looking at it.

I might setup a solr or elasticsearch install and import the data to see how it fairs speed wise.

There's no reason for it to be kept in a database, if grepping a csv file is quickest I'll write a solution that does that lol.

FlamingSpaceJunk · March 2020

@Mr_Tom said:
I might setup a solr or elasticsearch install and import the data to see how it fairs speed wise.

You'll probably need a 3 node cluster to start seeing performance gains.

There's no reason for it to be kept in a database, if grepping a csv file is quickest I'll write a solution that does that lol.

You can use a CSV as a backend of some database systems.

PHP_Backend · March 2020

if grepping a csv file is quickest I'll write a solution that does that lol

No matter how silly this sounds like, these type of solutions most of the times get the job done.
Fuzzy search is already PITA. You never gonna like what you get with it (at least for me).
text search oriented solutions like elasticsearch/solr will perform better but comes with own set of extra work.

Using LIKE "%search%" works perfect - but it requires a full table scan and there's 61 million rows.

have you tried to load whole dataset to buffer (RAM). which DB engine did you use. If your data is mostly read-only, may I suggest Aria engine (or Memory)? Off-course I dont know much GBs actually your 61 mil data is, but If all of your data is in Buffer, I don't see any problem with FULL TABLE SCAN.

seriesn · March 2020

@poisson

joepie91 · March 2020

I haven't played with it much yet, but AFAIK, PostgreSQL's full-text search capabilities are a lot better (and probably more performant) than those of MySQL.

Given that PostgreSQL is generally nicer to work with (and faster) than MySQL, but its SQL dialect is similar, it may be worth considering a switch - after trying out your usecase in PostgreSQL and verifying that it does work well there, of course. You can also use EXPLAIN ANALYZE (along with this tool) to understand exactly how a query was scheduled and executed, so that you can track down what exactly made it slow.

ElasticSearch is a possibility, but note that it is open-core (not open-source) and certainly not low-maintenance. I wouldn't bother with it unless you've verified that PostgreSQL really can't do what you need. As already mentioned elsewhere, it's probably not going to be faster in and of itself either, you'd need to actually add more servers for that.

But honestly, 61 million records should be peanuts for an RDBMS on a single server. It's not the sort of dataset size where you should be running into performance issues, so long as you get your indexes right and don't run especially unperformant queries. This should just work.

willie · March 2020

@Mr_Tom said: I've got a MySQL table (pretty much just a couple of fields - an ID, and a string) which I need to search the string for and get the ID(s) of any matching rows.

That is a crazy db setup. You shouldn't have to scan strings at query time. Process each string when you insert it into the database, extracting any id's from the string. Make a new table that combines the extracted id's with the id of the row you just added. Add suitable indexes and do appropriate JOIN operations to find the rows you want. Full text searching doesn't seem like the right thing at all.

What is the application? If it's not an online op taking a high query load, maybe you can use a plain text file. 61M records is probably less than a GB of text and you can grep that much text in a second or so. Maybe that is ok for what you are doing.

havoc · March 2020

Pretty sure the standard unix search stuff is optimised to death so I'd probably try to find a way to turn this into text files

FlamingSpaceJunk · March 2020

@joepie91 said:
ElasticSearch is a possibility, but note that it is open-core (not open-source) and certainly not low-maintenance. I wouldn't bother with it unless you've verified that PostgreSQL really can't do what you need. As already mentioned elsewhere, it's probably not going to be faster in and of itself either, you'd need to actually add more servers for that.

The base functionality is still in Elasticsearch; the proprietary addons don't add much when it just needs to be a search engine, provided this isn't filled with sensitive information and on the Internet.

For everything, Amazon's Open Distro for Elasticsearch should be looked at. (https://opendistro.github.io/for-elasticsearch/) It's what Amazon uses for their hosted Elasticsearch instances.

Yes, it's like any other Java application.

@Mr_Tom said:
There's no reason for it to be kept in a database, if grepping a csv file is quickest I'll write a solution that does that lol.

I found Whoosh, which is a search engine library for Python (https://whoosh.readthedocs.io/en/latest/intro.html), and Xapian, search engine lib for lots of languages (https://xapian.org/). [I haven't used either of these. I just found them after getting curious about search engine libraries.)

Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.

WSS · March 2020

@FlamingSpaceJunk said:
Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.

The answer is always awk based somehow.

FlamingSpaceJunk · March 2020

@WSS said:

show previous quotes

@FlamingSpaceJunk said:
Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.

The answer is always awk based somehow.

I thought of awk. I haven't used it enough to get a feel for it's speed, or how well text searching works.

WSS · March 2020

@FlamingSpaceJunk said:

show previous quotes

@WSS said:

@FlamingSpaceJunk said:
Now that I think about it, there are also tools like the silver searcher, ack, and parallel grep implementations which might be helpful.

The answer is always awk based somehow.

I thought of awk. I haven't used it enough to get a feel for it's speed, or how well text searching works.

awk '/term/ {print $1}' file_with_penis_ascii.txt

FlamingSpaceJunk · March 2020

@WSS said:
awk '/term/ {print $1}' file_with_penis_ascii.txt

61 millions lines of ASCII peens.

Would that be a linear search? I'm thinking part of the solution is getting as much parallelism as possible. Something, something O(log(n)).

WSS · March 2020

@FlamingSpaceJunk said:
Would that be a linear search? I'm thinking part of the solution is getting as much parallelism as possible. Something, something O(log(n)).

All I know is that my log comes first.

WSS · March 2020

..oh god did I just make a sexual math joke?

Mr_Tom · March 2020

Thanks for all the info/tips guys.

I've tweaked some of the indexes and it's helped - doing a WHERE text="xx" search is now almsot instant, but still a little slow for LIKE queries. I've started having a look at the "search engine" type options too, as these might be a better bet for searching/returning "close" matches as well as exact matches, which I'm currently handling in an acceptable way in PHP by generating "close" terms based on the search term and then querying the database for these depending on the number of results.

nfn · March 2020

I would offload search to sphixsearch using sphinxse with mariadb.
https://mariadb.com/kb/en/about-sphinxse/

willie · March 2020

LIKE 'XX%' should also be fast, because of the index, and there is sometimes a way to make an index on the reverse spelling of the string so that LIKE '%XX' is also fast. The slow one is LIKE '%X%'. But really if the stuff you are parsing out of those strings is id numbers or something else you can identify when you import the strings in the first place, then as mentioned above it's best to extract that at import time and put it in its own indexed column.

The way you described the problem initially, it sounds a bit weird if you really have to search free text. If that is the case though, Postgres's text search stuff is supposed to be quite good (almost as good as Elastic/Solr) while MySQL's is terrible.

exception0x876 · March 2020

You can still achieve it using MySQL FULLTEXT (not really, see update below)

It supports wildcard search for words suffixes so you could search like this MATCH(column) AGAINST('term*' IN BOOLEAN MODE)

If you want to search for wildcard prefix as well, you can create another column that will contain reversed words of the original column. The final query would look like this

MATCH(original_column) AGAINST('term*' IN BOOLEAN MODE) OR MATCH(reversed_column) AGAINST('mret*' IN BOOLEAN MODE)

You might need to update MySQL stopwords list with their reversed counterparts in this case.

UPDATE: however, that is still not gonna find the partial match inside the word.

bikegremlin · March 2020

Not 100% sure if this is a good option for your particular database, but to me it came highly recommended for editing WordPress ones:

https://interconnectit.com/products/search-and-replace-for-wordpress-databases/

DISCLAIMERS:
Make backups before trying anything, don't blindly follow advice of idiots like me from the Internet, eat healthy food and drink enough water...

Mr_Tom · March 2020

@willie said: The way you described the problem initially, it sounds a bit weird if you really have to search free text. If that is the case though, Postgres's text search stuff is supposed to be quite good (almost as good as Elastic/Solr) while MySQL's is terrible.

Yeah, although the ID is the final required but if information, it's user inputted text which is being searched in a way that needs to find "Mr_Tom" from the term "To". The text is only 9 characters max. If postgresql is faster than mysql that will probably solve it - I've replicated it on a server of similar spec to the live one and with the index tweaks it's a lot faster - but not quite there for a the expected times for a website.

@exception0x876 said: UPDATE: however, that is still not gonna find the partial match inside the word.

That's what I need ideally. LIKE %TERM% works perfect, just not the fastest.

@bikegremlin said: Not 100% sure if this is a good option for your particular database

That looks a good option for what it does, but doesn't quite fit here.

Cheers

tgl · March 2020

mysql is not that good for that, I guess elasticsearch would make more sense

there is however a tool called sphinx that can index your tables and then the search will be fast, but i dont know if it will fit your needs

willie · March 2020

What is the actual application if you don't mind my asking, and what is the total amount of text to be searched? What size vm are you running in? I'm still perplexed that you have to find "Mr_Tom" from "to".

Maybe you can index trigraphs or substrings from the tokens in this text? If column length is 9 chars max, this is around 0.5GB of text, and if you index the 9-char strings, and their substrings starting from the 2nd char, 3rd char etc., you get around 500M total strings, around 2.5GB, doesn't seem too bad. It will chew up some disk space but you can index them all. Doesn't seem necessary to have them in ram since the index lookups on ssd will be fast enough

vimalware · March 2020

Postgresl full text native capabilities are probably enough.

comi · March 2020

$ head -5 records
B7C9607F0
BEA9DE6AF
EC1D58C44
576099FE5
88607C803

$ wc -l records
69905067 records

$ time grep -F AAAAAA records
9AAAAAAAB
AAAAAA9F8
1AAAAAA1F
AAAAAADA7
AAAAAAD4A
AAAAAA519
AAAAAA5DD
DAAAAAA58
4AAAAAA92
6AAAAAA4C
F81AAAAAA
AAAAAA34D
746AAAAAA
BAAAAAAEF

real    0m0.545s
user    0m0.465s
sys     0m0.079s

$ cat generate.sh
#!/bin/sh

dd if=/dev/urandom bs=1M count=300 | basenc --base16 -w 9 > records

$ ls -lh
total 667M
-rwxr--r-- 1 1000 1000   79 Mar 13 15:15 generate.sh
-rw-r--r-- 1 1000 1000 667M Mar 13 15:16 records
-rw-r--r-- 1 1000 1000   42 Mar 13 15:21 rep.txt
drwxr-xr-x 2 1000 1000 4.0K Mar 13 15:05 v1

$

databases are lame

willie · March 2020

If this is for a web service, 0.5 sec per query is really too slow. We still haven't heard anything about the application though. Fulltext engines will be of no help with this since they search for tokens given prefixes. They won't look for substrings in the middle of tokens. All you can really do is index all the "tails" of each token, or (depending on the input distribution) do some hack like indexing trigraphs. Mr_Tom if you're still around and want serious advice about this, we need to know the query load, the type of server, the software stack, the origin of the data (so we can make reasonable guesses about how the substrings are distributed), what kinds of lengths you're willing to go to in the software to make it fast, etc. A full-out in-process data structure with everything in ram can probably handle millions of queries per second, but the implementation hassle would be considerable.

comi · March 2020

@willie said: 0.5 sec per query is really too slow

you are not wrong, but:

it will satisfy a lot of usecases
it is 0.5s without database and all involved hassle
it is 0.5s without optimisations like query caching

As you said yourself it is not possible to index by substrings in the middle of tokens, so not sure what kind of algo in a dynamic language is going to be faster.
Also, if it's a web service, it is going to serve queries in parallel, so 0.5s does not mean two clients per second.

Fastest way to search through text records

Comments