Fuzzy Matching with SQL

Exact duplicates are easy to find with SQL. And most database servers also implement their own SQL commands to search for fuzzy duplicates, dupes with words that sound similar.

Most database servers implement the SOUNDEX command as an algorithm for phonetic searches. SOUNDEX represents the letters of a word as a four-character string according to their sound in the English language. This simple algorithm sometimes delivers reasonably good results. Thus, for example, 'Smith' and 'Smythe' are recognised to be identical. The algorithm also delivers some good results in languages other than English. Thus, 'Maier', 'Mayer', 'Mayr' and 'Mair' are recognised to be identical.

However, the algorithm is configured to compare individual words, so that already the comparison of 'Ken Smith' and 'Smith Ken' will not appear in the results. In addition, this algorithm is language dependent. The representation of the examined words as a 4-character string is also rather rough, so that sometimes strange results are obtained. For example, 'Hilbert' and 'Heilbronn' or 'Knut' and 'Kant' are recognised to be duplicates.

In addition, the difference between the two words to be compared is not always the same. Therefore, it would be nice if the comparison would return a confidence score, for example 80% or 90%. Phonetic algorithms, however, return as a result of the comparison only the information whether the two words are similar or not.

An SQL query to dedupe a database using the SOUNDEX command could look like this:

SELECT tab1.id, tab1.name, tab2.id, tab2.name
FROM tablename tab1, tablename tab2
WHERE SOUNDEX(tab1.name)= SOUNDEX(tab2.name)
AND tab1.id<>tab2.id
AND tab1.id=(SELECT MAX(id) FROM tablename tab
WHERE SOUNDEX(tab.name)=SOUNDEX(tab1.name))

Seeing as the results of this query probably contain hits that are not actually hits, and you do not want to randomly delete or keep records from a duplicates group, the results cannot be processed any further without editing. There is also the fact that such a query always matches all of the records with each other, which of course lengthens the time required for the query.

Therefore, in most cases, the SOUNDEX command in SQL is not a feasible method to deduplicate a database. Only specialised tools that include an error-tolerant (fuzzy) matching algorithm can provide a satisfactory solution to this problem, such as DataQualityTools: