Union queries are resource hogs

Union queries are one of the most used and abused way of hogging down the SQL Server. Unions can be used either as a DISTINCT, ALL, or an OR unions.

Data used

  1. USE sql911;
  2. go
  3.  
  4. SELECT EMPNO,PROJNO, EMPTIME AS 'Time Spent',
  5.        EMSTDATE AS 'Start Date', EMENDATE AS 'End Date'
  6. FROM emp_work
  7. WHERE projno in ('ad3111','ad3112');
  8. go
EMPNO  PROJNO Time Spent                              Start Date              End Date
------ ------ --------------------------------------- ----------------------- -----------------------
000230 AD3111 1.00                                    2002-01-01 00:00:00.000 2002-03-15 00:00:00.000
000230 AD3111 0.50                                    2002-03-15 00:00:00.000 2002-04-15 00:00:00.000
000230 AD3111 0.50                                    2002-03-15 00:00:00.000 2002-10-15 00:00:00.000
000230 AD3111 0.50                                    2002-04-15 00:00:00.000 2002-10-15 00:00:00.000
000230 AD3111 1.00                                    2002-10-15 00:00:00.000 2003-01-01 00:00:00.000
000240 AD3111 1.00                                    2002-02-15 00:00:00.000 2002-09-15 00:00:00.000
000240 AD3111 1.00                                    2002-09-15 00:00:00.000 2003-01-01 00:00:00.000
000250 AD3112 1.00                                    2002-01-01 00:00:00.000 2002-02-01 00:00:00.000
000250 AD3112 0.50                                    2002-02-01 00:00:00.000 2002-03-15 00:00:00.000
000250 AD3112 0.50                                    2004-12-01 00:00:00.000 2005-01-01 00:00:00.000
000250 AD3112 1.00                                    2003-01-01 00:00:00.000 2003-02-01 00:00:00.000
000250 AD3112 0.50                                    2006-02-01 00:00:00.000 2006-03-15 00:00:00.000
000250 AD3112 1.00                                    2002-03-15 00:00:00.000 2002-08-15 00:00:00.000
000250 AD3112 0.25                                    2005-08-15 00:00:00.000 2005-10-15 00:00:00.000
000250 AD3112 0.25                                    2002-08-15 00:00:00.000 2002-10-15 00:00:00.000
000250 AD3112 0.50                                    2002-10-15 00:00:00.000 2002-12-01 00:00:00.000
000250 AD3112 0.50                                    2002-08-15 00:00:00.000 2003-01-01 00:00:00.000

(17 row(s) affected)

Here is the execution plan for the plain listing:

Execution Plan
Execution Plan for the plain query

Union

Here we will get the data for project: AD3111 and AD3112

  1. SELECT EMPNO,PROJNO, EMPTIME AS 'Time Spent',
  2.        EMSTDATE AS 'Start Date', EMENDATE AS 'End Date'
  3. FROM emp_work
  4. WHERE projno = 'ad3111'
  5. UNION all
  6. SELECT EMPNO,PROJNO, EMPTIME AS 'Time Spent',
  7.        EMSTDATE AS 'Start Date', EMENDATE AS 'End Date'
  8. FROM emp_work
  9. WHERE projno = 'ad3112';
  10. go
EMPNO  PROJNO Time Spent                              Start Date              End Date
------ ------ --------------------------------------- ----------------------- -----------------------
000230 AD3111 0.50                                    2002-03-15 00:00:00.000 2002-04-15 00:00:00.000
000230 AD3111 0.50                                    2002-03-15 00:00:00.000 2002-10-15 00:00:00.000
000230 AD3111 0.50                                    2002-04-15 00:00:00.000 2002-10-15 00:00:00.000
000230 AD3111 1.00                                    2002-01-01 00:00:00.000 2002-03-15 00:00:00.000
000230 AD3111 1.00                                    2002-10-15 00:00:00.000 2003-01-01 00:00:00.000
000240 AD3111 1.00                                    2002-02-15 00:00:00.000 2002-09-15 00:00:00.000
000240 AD3111 1.00                                    2002-09-15 00:00:00.000 2003-01-01 00:00:00.000
000250 AD3112 0.25                                    2002-08-15 00:00:00.000 2002-10-15 00:00:00.000
000250 AD3112 0.25                                    2005-08-15 00:00:00.000 2005-10-15 00:00:00.000
000250 AD3112 0.50                                    2002-02-01 00:00:00.000 2002-03-15 00:00:00.000
000250 AD3112 0.50                                    2002-08-15 00:00:00.000 2003-01-01 00:00:00.000
000250 AD3112 0.50                                    2002-10-15 00:00:00.000 2002-12-01 00:00:00.000
000250 AD3112 0.50                                    2004-12-01 00:00:00.000 2005-01-01 00:00:00.000
000250 AD3112 0.50                                    2006-02-01 00:00:00.000 2006-03-15 00:00:00.000
000250 AD3112 1.00                                    2002-01-01 00:00:00.000 2002-02-01 00:00:00.000
000250 AD3112 1.00                                    2002-03-15 00:00:00.000 2002-08-15 00:00:00.000
000250 AD3112 1.00                                    2003-01-01 00:00:00.000 2003-02-01 00:00:00.000

(17 row(s) affected)

Which is the correct result

Here is the execution plan for the plain union and the details of the sort on the execution plan.

Execution Plan
Execution Plan for query with the union

Execution Plan
Execution Plan for query with the union with cost for the sort
  1. The sort is a DISTINCT SORT.
  2. The sort uses: 63% of the cost.
  3. Note the estmated I/O

The definition of UNION in SQL is to:

  1. sort the result.
  2. execute a SELECT DISTINCT on the result set.
  3. display the results from the SELECT DISTINCT.

Union All

Here is the same query result but this time with a UNION ALL.

  1. SELECT EMPNO,PROJNO, EMPTIME AS 'Time Spent',
  2.        EMSTDATE AS 'Start Date', EMENDATE AS 'End Date'
  3. FROM emp_work
  4. WHERE projno = 'ad3111'
  5. UNION all
  6. SELECT EMPNO,PROJNO, EMPTIME AS 'Time Spent',
  7.        EMSTDATE AS 'Start Date', EMENDATE AS 'End Date'
  8. FROM emp_work
  9. WHERE projno = 'ad3112';
  10. go
EMPNO  PROJNO Time Spent                              Start Date              End Date
------ ------ --------------------------------------- ----------------------- -----------------------
000230 AD3111 1.00                                    2002-01-01 00:00:00.000 2002-03-15 00:00:00.000
000230 AD3111 0.50                                    2002-03-15 00:00:00.000 2002-04-15 00:00:00.000
000230 AD3111 0.50                                    2002-03-15 00:00:00.000 2002-10-15 00:00:00.000
000230 AD3111 0.50                                    2002-04-15 00:00:00.000 2002-10-15 00:00:00.000
000230 AD3111 1.00                                    2002-10-15 00:00:00.000 2003-01-01 00:00:00.000
000240 AD3111 1.00                                    2002-02-15 00:00:00.000 2002-09-15 00:00:00.000
000240 AD3111 1.00                                    2002-09-15 00:00:00.000 2003-01-01 00:00:00.000
000250 AD3112 1.00                                    2002-01-01 00:00:00.000 2002-02-01 00:00:00.000
000250 AD3112 0.50                                    2002-02-01 00:00:00.000 2002-03-15 00:00:00.000
000250 AD3112 0.50                                    2004-12-01 00:00:00.000 2005-01-01 00:00:00.000
000250 AD3112 1.00                                    2003-01-01 00:00:00.000 2003-02-01 00:00:00.000
000250 AD3112 0.50                                    2006-02-01 00:00:00.000 2006-03-15 00:00:00.000
000250 AD3112 1.00                                    2002-03-15 00:00:00.000 2002-08-15 00:00:00.000
000250 AD3112 0.25                                    2005-08-15 00:00:00.000 2005-10-15 00:00:00.000
000250 AD3112 0.25                                    2002-08-15 00:00:00.000 2002-10-15 00:00:00.000
000250 AD3112 0.50                                    2002-10-15 00:00:00.000 2002-12-01 00:00:00.000
000250 AD3112 0.50                                    2002-08-15 00:00:00.000 2003-01-01 00:00:00.000

(17 row(s) affected)

And here is the execution with the I/O Costs.

Execution Plan
Execution Plan for query with the union all

Execution Plan
Execution Plan for query with the union all with cost
  • The UNION does imply a select distinct.
  • The UNION ALL does NOT use a select distinct.
  • When you are sure that there will be no duplicate, do NOT use the union, but a union all.
  • The UNION is often used when older section of data have been archived into separate tables (in the hope of improving speed), therefore you are sure that the data is not a duplicate one.

Please note that there is some cheating in the costs:

  1. I am using small tables.
  2. I query the same table twice.
  3. SQL Server does optimize by realizing that it's the same table, therefore reread the data from RAM.