I have been working with the SQL language for very many years. I take for granted that most developers are not very experienced in using commands outside of simple select, insert, update and delete statements. Certainly, this can be a disadvantage when working. I would like to explain a few tricks when using the UNION clause.
Union basically joins the result of two separate results sets with the condition that the result sets have the same number of columns and the same or very similar data type. Naturally, the columns need to be in the same order. This seems very intuitive if you can imagine that you have two tables that are similar, and you want to produce a result set that has records from both. But you can also use records from the same table or two more complex queries to produce the result you need. Let me walk you through the details and some examples.
First the details…
There is UNION and UNION ALL
UNION joins the results of two queries UNION ALL also joins the result of two queries. UNION does not include duplicates; however, UNION ALL does include duplicates. Let us walk through the logic quickly. In the example below imagine A,B,C… as records.
Select 1 returns A,B,C
Select 2 returns C, D, E
UNION of the two -> A,B,C,D,E
UNION ALL of the two -> A,B,C,C,D,E
Higher cost of UNION vs. UNION ALL
Since the UNION clause does not include duplicates the database engine needs to sort the two result sets and eliminate the duplicates. If you are selecting records from two different tables and you know it is not logically possible to have a duplicate, you should use the UNION ALL. For example, you are querying a SALES_2020 table and SALES_2021 table and one of the columns is the date of the sale. Logically there should not be any duplicate records between the 2020 and 2021 tables. UNION also has the undesirable surprise that a GROUP BY is applied without you knowing. UNION will eliminate any duplicate not only between Select 1 and Select 2 but also UNION will eliminate any duplicates within Select 1 or Select 2.
Now some examples…
We can see in table SalesQ1 that John appears three times. Twice John purchase $100 on the same day. A normal select of the of three columns SalesDate, Customer and SalesAmount would return 5 records.
It would be intuitive to simply union the same query on the second table and have a result that combines the two sets of data together. Surprise the result will not be what you want.
SELECT SalesDate, Customer, SalesAmount FROM SalesQ1 UNION SELECT SalesDate, Customer, SalesAmount FROM SalesQ2;
John does not appear 3 times anymore in Q1’s dataset. The two records that had the same SalesDate, Customer and SalesAmount were grouped into one. Like I explained to make the query do less you need to use the UNION ALL.
SELECT SalesDate, Customer, SalesAmount FROM SalesQ1 UNION ALL SELECT SalesDate, Customer, SalesAmount FROM SalesQ2;
To illustrate how wrong things could look sometimes we can do it again without the date.
SELECT Customer, SalesAmount FROM SalesQ1 UNION SELECT Customer, SalesAmount FROM SalesQ2;
We see that the UNION clause essentially grouped the records from the first table but also the records that have the same values in either table are grouped together. If we use the UNION ALL the result looks like all the records from both tables.
SELECT Customer, SalesAmount FROM SalesQ1 UNION ALL SELECT Customer, SalesAmount FROM SalesQ2;
If the selection in each table was unique, the UNION and UNION ALL clauses would return the same results. Often things look that way in the beginning but if you are the person working with reports and ETL you often have to work with set of data that you are not an expert with. I hope this example helps with your understanding.
Have a nice day.