ORACLE: Deduping Data (duplicate records)

Considering that duplicates can hardly be kept in check by hand in small databases, finding duplicate records among large amounts of data, such as those found in databases managed with ORACLE, can only be accomplished using appropriate tools.

As a matter of principle, one must differentiate between exact and fuzzy duplicates. You will find more on the subject of fuzzy duplicates in the article "Duplicates with Small Differences". Exact duplicates, i.e. duplicate records that are completely identical except for capitalisation, are relatively easy to find using SQL queries. For the following query, for example, ORACLE delivers all records that match the contents of the 'name' data field:

SELECT tab1.id, tab1.name, tab2.id, tab2.name
FROM tablename tab1, tablename tab2
WHERE tab1.name=tab2.name
AND tab1.id<>tab2.id
AND tab1.id=(SELECT MAX(id) FROM tablename tab
WHERE tab.name=tab1.name)

As one can see, this SQL query requires a column with an ID clearly identifying each record to ensure that a record is not compared with itself. Furthermore, this ID is required to ensure that the record with the biggest ID only appears in the column ‘tab1.id’, but not in the column ‘tab2.id’. This ensures that the record with the biggest ID from a duplicates group is not deleted. The IDs of the records that are to be deleted are written in column 'tab2.id'. This is how it looks when the results are integrated into a DELETE command for ORACLE:

DELETE FROM tablename 
WHERE id IN
(SELECT tab2.id
FROM tablename tab1, tablename tab2
WHERE tab1.name=tab2.name
AND tab1.id<>tab2.id
AND tab1.id=(SELECT MAX(id) FROM tablename tab
WHERE tab.name=tab1.name))

Of course, this SQL command can be easily extended to include other data fields in addition to the contents of the 'name' data field, for example, the data fields that contain the postal address.

You can read more about the options SQL provides to search for fuzzy duplicates in the article ‘Fuzzy matching with SQL'. But only specialised tools that include an error-tolerant (fuzzy) matching algorithm can provide a satisfactory solution to this problem, such as DataQualityTools: