Duplicates with Small Differences

Finding exact duplicates is, as rule, not a problem. However, when it comes to finding duplicates with small differences as can occur, for example, from typos, transposed words, omitted letters and added letters, it certainly becomes harder.

Data and especially addresses are recorded, as a rule, by people. And in doing so, every person has his own style. The one is used to recording the first name in front of the last name as, for example, in "Albert Einstein". For someone else, it is obvious that the last name should be recorded ahead of the first name as in "Einstein Albert". Someone else will abbreviate the first name because the information can be recorded faster that way: "A. Einstein". Yet another is guilty of a typographical error and turns "Albert Einstein" into "Albert Einsein".

Similar things happen with corporate names: place names like the "Berlin" in "BioEnergy Berlin GmbH" or the mention of the business purpose like the "Computer" in "Apple Computer Inc" are often omitted or recorded in modified form. Even the legal form of a company is not always recorded consistently. Possible variations here would be, for example, "G.m.b.H." and "GmbH". It’s the same problem with street names: whether we are talking about "First Avenue", "First Av" or "1st Av", we probably still mean the same street. Despite all these differences in recording the data, we are very likely talking about duplicates, but duplicates with small differences. And these will still cost money.

So what to do? If the addresses are recorded in text files, you can’t get by without a specially designed solution. For database servers like MySQL, MariaDB or SQL Server, you can try to solve this problem with the SOUNDEX command. But its possibilities are limited. You can read up on this subject in the article, "Fuzzy Matching with SQL". Even in Excel, things don’t look any better: Even there, we have only a function to find exact duplicates, i.e. duplicates that match, character for character. Duplicates with deviations remain unrecognized by this function. In the final analysis, only specialised tools that include an error-tolerant (fuzzy) matching algorithm can provide a satisfactory solution to this problem, such as DedupeWizard or DataQualityTools: