N-gram string matching
You're reading Mikko Saari's blog Life and opinions. This entry was written 11/27/2004, at 12:30.
If you want read more of my entries in the same topic, this entry belongs to the category of Geek stuff.
Previous entry: Qveere Eye For Thye Mediaeval Man
Next entry: Interesting blog
As my game review collection has grown quite large (92 reviews now), I thought a search for game titles would a nice addition to the different means I have for visitors to find the reviews they want. As I'm a student of information studies, which includes a healthy dose of computer information retrieval, I knew I would be able to tackle the task.
I didn't want to do full-text indexing of the reviews, but make the game titles searchable. Most searches of this kind use simple keyword searching with truncation. That kind of search produces a list of titles which have the keyword in them and nothing else. The list can't be ordered by hit quality and even a slightest mistake makes the search fail. That, I believe, is not satisfactory. A course I took in cross-language information retrieval had equipped me with just the tools I needed to create a better system.
Cross-language information retrieval uses so called N-grams. N can be whatever, but digrams and trigrams are most used. For example, digrams of word "foobar" are *f, fo, oo, ob, ba, ar, r* where * denotes a padding space. Both the indexed words and the search key are broken into N-grams and then compared using a simple formula of intersection of N-grams in the phrases divided by sum of unique N-grams in the phrases. Let's have an example.
How well does foobar match with fubar? Digrams of the words are *f, fo, oo, ob, ba, ar, r* and *f, fu, ub, ba, ar, r*. The union (sum) of unique digrams in both phrases is *f, fo, oo, ob, fu, ub, ba, ar, r*. The intersection of the phrases' digrams is *f, ba, ar, r*. There are thus four digrams that appear in both phrases and nine different digrams that appear in at least one the phrases.
A simple division gives a score of 4/9 or 0.444... for the match between the phrases. That's pretty good, but still far from complete match score of 1.
With this method, it's easy to compare the search key to different indexed keys and find out which are the closest ones. It allows for small mistakes and variations in spelling (which is one reason why it's used in CLIR: N-gram matching can overcome different spellings of the same words in different languages). It also eliminates one problem with straight pattern matching: small keys are easy to search. Try searching BoardGameGeek for Go or Ra and you'll know what I mean. In this system, there's no problem.
Of course, working with the N-grams takes more computing power, which can be a problem. Also, two words can have the same N-grams but in completely different order. Still, I think N-gram matching is a good tool for simple title or name searches such as the one I'm doing.
I'm sharing the PHP code that does the tricks of forming the N-grams and comparing two sets of N-grams. The compare function is very elegant, thanks to good tools already implemented in PHP. You can find it here: ngram.php. If you're planning to do something with it, I'd appreciate if you'd drop me a note.
In addition to that, I have indexing code that goes through all the files in the directory that has the reviews, searching for HTML meta field that contains the title of the game - or titles, as each game can have several titles in different languages. File names, game titles and the N-grams are then stored in a MySQL database to be accessed by the search script.
Creating the search engine was very straighforward. I was surprised how quickly I got it all together. Having thought a lot about N-grams before helped a bit, obviously.
Trackback Pings
TrackBack URL for this entry:
http://www.melankolia.net/mt/mt-tb.cgi/2777
Comments
Hi Mikko
I want to implement something like this for a search i'm doing with city names. What i'd like to know is how you implemented it, how big your database / table your using this on and how much slower this makes the search.
I want to use this solution for the same reasons, to correct spelling mistakes or errors and still come up with the right city.
I am currently using a fulltext search on city name alone which isn't producing the correct results.
I was also wondering if this solution is better then a soundex, levenshtein or metaphone solution.
Thanks for the idea and i hope it can work for me
Adam
Posted by: Adam at February 9, 2005 7:29 PM