Pages

Thursday, October 2, 2014

Evaluating similarity between strings. Jaro-Winkler implementation in C#

We recently had a requirement where our application had to let the user know about existing trading security names from the database that were "similar" to what the user was trying to import into the system.


The key of the problem was what "similar" meant in this context.

Something like "McDonalds" could be considered similar to the existing official name in the database "McDonalds Corporation". In this case the SQL "like" operation and the use of the "%"  symbol would solve our problem.

However, if a user (a broker) calls this company "McDNLDS", we wanted our system to know that he probably meant  "McDonalds Corporation" and in this case the like operator in SQL is of absolutely no value. 

Other text comparison algorithms from SQL include:
FreeText which returns texts that are not necessarily identical but that have similar meanings. For example Cat and Kitten. Dog and puppy. Or Synonyms as well. Or inflection of verbs. For example Drive and Drove would be a match. FreeText certainly gives you flexibility but still, the words need to have real meaning for SQL to figure out their alternatives. However, the data provided by Brokers didn't need to necessarily be a semantically correct or meaningful word.

Soundex is another SQL function to rate similarities between two texts but in this case it is based on their sound. Obviously, because the user may be entering strings that were not real words that exist in an English dictionary, this type of comparison would also leave a lot to be desired.


JARO-WINKLER algorithm

The Jaro-Winkler algorithm (Wikipedia link) allows you to compare two strings and obtain a numerical number that evaluates how similar they are.

The idea is to simply calculate the "distance" between two words. And the distance would be the number of steps required to apply on word number one so that it becomes identical to word number 2. Steps like, remove a character, transpose two characters, delete a character etc.

Finally, based on the "distance" and length of the words, a formula is applied that returns a coefficient between 0 and 1. One means the words are identical. Zero means they are totally different.

By the way this algorithm compares two strings that don't necessarily have to be single words. It can be a comparison between two entire sentences as well.

Attached to this article is a JaroWinkler class in C# with a static method called RateSimilarity that expects two strings as input parameters and returns the aforementioned coefficient as follows:

The above comparison for example returns 0.78 as the coefficient of similarity. Which means that this words are somewhat similar.

With this class, you can apply the function to an entire collection of strings and filter by the coefficient, so that you only show the user those elements that exist in the system that are fairly similar to what he is typing. That could be coefficients above 0.70 for example. Ultimately it is up to you, the developer, to decide how relaxed or strict you want those results to be.

Download JaroWinkler.cs




No comments:

Post a Comment