Most developers have a basic knowledge of the SQL commands “Select,” “Insert,” “Update” and “Delete.” Sometimes you want the database to return something different than was initially planned and that is when you need the SQL statements to do more. Being asked to find duplicates is a request that comes up repeatedly.
Early in my career, I was fortunate enough to receive a five-day SQL PL/SQL course at Oracle. It was a great experience and really helped me better understand how the language should be used. I remember the sprit of the lessons and the logic behind the explanations. That is what I want to share with you. I will try to always use SQL-style language to reinforce the concepts.
Now to find duplicate records you first need to identify exactly each field that you consider important in the duplication. There are often columns like the record ID or a date/time fields that are unique between two otherwise identical rows. So, analyze some records to determine what fields you will actually select to determine duplication. Fields that are always unique like a record ID are not needed.
Note: In a procedural language, you would normally identify a duplicate by sorting the information and then examining each record with the preceding record to match if it is the same. The relational database software will actually do that in the background in a variety of ways. But in SQL you do not work with this sort of step-by-step concept. You work with sets of data.
In SQL language you create a set of data by selecting the fields and applying filter criteria. The “where” clause in a select statement is probably know by every developer. The lesser know is the “having” clause. The “where” clause is applied to the set of data before any grouping and the “having” clause is applied after the grouping.
Let’s look at an example. I want to run a query that would find the first two records in the example below but not the third record.
I can see that I will need to select the Street Number, Street Name, City, Province, Postal Code. Those are the fields that are duplicated.
So, the query you write needs to look like below. The fields are selected. The same fields are grouped. The “count” function counts the number of records that are included in the grouping just made. Don’t let the 1 fool you. It is just a literal. You can specify a single field like “Street Number” or a “*” the result would be the same. The one is traditionally used for a legacy performance reason in Oracle. The “having” clause now evaluates the records that have a property in this grouping of having more than 1 record. I could have just as easily put “= 2” to get the records that only appear twice and not three times. It is really up to you.
SELECT Street Number, Street Name, City, Province, Postal Code FROM ADDRESS GROUP BY Street Number, Street Name, City, Province, Postal Code HAVING COUNT(1) > 1
Voila! That is all that there is to this. One record will be returned in this example.
Have a nice day.