ACCESS: Deduping Data (duplicate records)

Considering that duplicates can hardly be kept in check by hand in small databases, finding duplicate records among large amounts of data, such as those found in databases managed with ACCESS, can only be accomplished using appropriate tools.

As a matter of principle, one must differentiate between exact and fuzzy duplicates. You will find more on the subject of fuzzy duplicates in the article "Duplicates with Small Differences". Exact duplicates, i.e. duplicate records that are completely identical except for capitalisation, are relatively easy to find using SQL queries. Although the query assistant in ACCESS includes an option to create a query for a duplicates search, the results only list the duplicate terms and the number of times they occur in the table. Unfortunately, the duplicates that were found cannot be deleted this way. For this reason, there is no way to avoid making an individual SQL query for this purpose in ACCESS.

The easiest way is to start by creating a new query scheme and then switch over to the direct entry of SQL commands using the menu item ‘SQL view’ from the context menu or using the small 'SQL‘ symbol at the very bottom right edge of the window. Here, just like with other database servers, you can enter SQL commands. The entered SQL command can be executed using the ‘Execute’ button. The small 'SQL‘ symbol at the very bottom right edge of the window takes you back to the corresponding SQL query.

For the following query, for example, ACCESS delivers all records that match the contents of the 'name' data field:

SELECT tab1.id, tab1.name, tab2.id, tab2.name
FROM tablename tab1, tablename tab2
WHERE tab1.name=tab2.name
AND tab1.id<>tab2.id
AND tab1.id=(SELECT MAX(id) FROM tablename tab
WHERE tab.name=tab1.name)

As one can see, this SQL query requires a column with an ID clearly identifying each record to ensure that a record is not compared with itself. Furthermore, this ID is required to ensure that the record with the biggest ID only appears in the column ‘tab1.id’, but not in the column ‘tab2.id’. This ensures that the record with the biggest ID from a duplicates group is not deleted. The IDs of the records that are to be deleted are written in column 'tab2.id'. This is how it looks when the results are integrated into a DELETE command for ACCESS:

DELETE FROM tablename 
WHERE id IN
(SELECT tab2.id
FROM tablename tab1, tablename tab2
WHERE tab1.name=tab2.name
AND tab1.id<>tab2.id
AND tab1.id=(SELECT MAX(id) FROM tablename tab
WHERE tab.name=tab1.name))

Of course, this SQL command can be easily extended to include other data fields in addition to the contents of the 'name' data field, for example, the data fields that contain the postal address.

You can read more about the options SQL provides to search for fuzzy duplicates in the article ‘Fuzzy matching with SQL'. But only specialised tools that include an error-tolerant (fuzzy) matching algorithm can provide a satisfactory solution to this problem, such as DataQualityTools: