How to delete duplicate rows in batch with MS SQL Server
- Duplicates usually occurs because of wrong logic or from a bug in the implementation.
- Often when you have duplicate, you will have hundreds, and thousands.
- It's not always possible to delete manually the duplicate rows.
To delete manually duplicate rows see: How to manually delete duplicate rows
Data used
use sql911;
go
select emp_short,
sales_date,
sales_qty as 'Sold'
from sales
order by emp_short, sales_date;
go
emp_short sales_date Sold ---------- ----------------------- ---------- 000010 2005-04-06 15:01:01.000 4.00 000010 2005-11-06 10:01:01.000 9.00 000010 2005-12-06 11:01:01.000 8.00 000010 2005-12-06 11:01:01.000 8.00 000010 2005-12-06 11:01:01.000 8.00 000010 2006-04-06 15:01:01.000 4.00 000010 2006-05-06 16:01:01.000 3.00 000010 2006-05-06 16:01:01.000 3.00 000010 2006-07-06 19:01:01.000 1.00 000010 2007-04-06 00:00:00.000 4.00 000010 2007-04-06 00:00:00.000 4.00 000010 2007-04-06 00:00:00.000 4.00 000010 2007-06-06 17:01:01.000 2.00 000030 2004-02-06 13:01:01.000 6.00 000030 2005-10-06 10:01:01.000 10.00 000030 2006-01-06 12:01:01.000 7.00 000030 2006-04-06 15:01:01.000 4.00 000030 2006-04-06 15:01:01.000 4.00 000030 2006-06-06 17:01:01.000 2.00 000030 2007-05-01 00:00:00.000 5.00 000030 2007-05-01 00:00:00.000 5.00 000030 2007-05-01 00:00:00.000 5.00 000030 2007-05-01 00:00:00.000 5.00 000030 2007-05-01 00:00:00.000 5.00 000030 2007-05-01 00:00:00.000 5.00 000030 2007-05-01 00:00:00.000 5.00 000030 2007-11-06 10:01:01.000 9.00 000110 2004-03-06 14:01:01.000 5.00 000110 2005-11-06 10:01:01.000 9.00 000110 2006-02-06 13:01:01.000 6.00 000110 2006-03-06 14:01:01.000 5.00 000110 2006-05-06 16:01:01.000 3.00 000110 2006-06-06 17:01:01.000 2.00 000110 2006-07-06 19:01:01.000 1.00 000110 2006-07-06 19:01:01.000 1.00 000190 2005-03-06 14:01:01.000 5.00 000190 2005-05-06 16:01:01.000 3.00 000190 2005-12-06 11:01:01.000 8.00 000190 2006-01-06 12:01:01.000 7.00 000190 2006-01-06 12:01:01.000 7.00 000190 2006-02-06 13:01:01.000 6.00 000190 2006-05-06 16:01:01.000 3.00 000190 2006-06-06 17:01:01.000 2.00 000190 2006-06-06 17:01:01.000 2.00 000190 2006-07-06 00:00:00.000 1.00 000190 2006-07-06 19:01:01.000 1.00 000190 2007-07-06 00:00:00.000 1.00 000190 2007-07-06 00:00:00.000 1.00 000190 2007-07-06 00:00:00.000 1.00 000310 2004-10-06 10:01:01.000 10.00 000310 2005-10-06 10:01:01.000 10.00 000310 2005-10-06 10:01:01.000 10.00 000310 2005-10-06 10:01:01.000 10.00 000310 2005-11-06 10:01:01.000 9.00 000310 2005-11-06 10:01:01.000 9.00 000310 2006-01-06 12:01:01.000 7.00 000310 2006-03-06 14:01:01.000 5.00 000310 2006-07-06 19:01:01.000 1.00 000310 2006-10-06 10:01:01.000 10.00 (59 row(s) affected)
Finding the duplicate rows
- The duplicate rows occur more than once.
- It's kind of self-evident, but that's how we find the duplicates.
We want to find the duplicate sales. What makes a duplicate sale?
- The same rep
- The same date
- The same quantity
use sql911;
go
select emp_short,
sales_date,
sum(sales_qty) as 'Totals',
count(emp_short) as 'Duplicates'
from sales
group by emp_short, sales_date, sales_qty
having count(emp_short) > 1
order by emp_short;
go
emp_short sales_date Totals Duplicates ---------- ----------------------- ----------- ----------- 000010 2005-12-06 11:01:01.000 24.00 3 000010 2006-05-06 16:01:01.000 6.00 2 000010 2007-04-06 00:00:00.000 12.00 3 000030 2006-04-06 15:01:01.000 8.00 2 000030 2007-05-01 00:00:00.000 35.00 7 000110 2006-07-06 19:01:01.000 2.00 2 000190 2006-01-06 12:01:01.000 14.00 2 000190 2006-06-06 17:01:01.000 4.00 2 000190 2007-07-06 00:00:00.000 3.00 3 000310 2005-10-06 10:01:01.000 30.00 3 000310 2005-11-06 10:01:01.000 18.00 2 (11 row(s) affected)
- You must group by all the columns that make the duplicate.
Deleting the duplicate rows in batch
Copy the duplicate key values into a temporary table
select emp_short, sales_date, sales_qty, count(*) into #tempsaleskey from sales group by emp_short, sales_date, sales_qty having count(*) > 1; go
(11 row(s) affected)
Copy the duplicate rows into a temporary table
This is the step that will eliminated all the duplicate rows.
select distinct s1.*
into #tempsalesdata
from sales s1, #tempsaleskey t1
where s1.emp_short = t1.emp_short and
s1.sales_date = t1.sales_date and
s1.sales_qty = t1.sales_qty;
go
(11 row(s) affected)
Verify that you have correct data
- Optional, but as Ronald Reagan said: Trust but verify.
select emp_short, sales_date, sales_qty, count(*) as 'Counter' from #tempsalesdata group by emp_short, sales_date, sales_qty; go
emp_short sales_date sales_qty Counter ---------- ----------------------- ------------ ----------- 000010 2005-12-06 11:01:01.000 8.00 1 000010 2006-05-06 16:01:01.000 3.00 1 000010 2007-04-06 00:00:00.000 4.00 1 000030 2006-04-06 15:01:01.000 4.00 1 000030 2007-05-01 00:00:00.000 5.00 1 000110 2006-07-06 19:01:01.000 1.00 1 000190 2006-01-06 12:01:01.000 7.00 1 000190 2006-06-06 17:01:01.000 2.00 1 000190 2007-07-06 00:00:00.000 1.00 1 000310 2005-10-06 10:01:01.000 10.00 1 000310 2005-11-06 10:01:01.000 9.00 1 (11 row(s) affected)
Delete the duplicates from the original table
DELETE sales
FROM sales s1, #tempsaleskey t1
WHERE s1.emp_short = t1.emp_short and
s1.sales_date = t1.sales_date and
s1.sales_qty = t1.sales_qty;
go
(31 row(s) affected)
Put the unique rows back in the original table
INSERT sales SELECT * FROM #tempsalesdata; go
(11 row(s) affected)
Verification
SELECT emp_short,
sales_date,
sales_qty AS 'Sold'
FROM sales
ORDER BY emp_short, sales_date;
go
emp_short sales_date Sold ---------- ----------------------- --------------------------------------- 000010 2005-04-06 15:01:01.000 4.00 000010 2005-11-06 10:01:01.000 9.00 000010 2005-12-06 11:01:01.000 8.00 000010 2006-04-06 15:01:01.000 4.00 000010 2006-05-06 16:01:01.000 3.00 000010 2006-07-06 19:01:01.000 1.00 000010 2007-04-06 00:00:00.000 4.00 000010 2007-06-06 17:01:01.000 2.00 000030 2004-02-06 13:01:01.000 6.00 000030 2005-10-06 10:01:01.000 10.00 000030 2006-01-06 12:01:01.000 7.00 000030 2006-04-06 15:01:01.000 4.00 000030 2006-06-06 17:01:01.000 2.00 000030 2007-05-01 00:00:00.000 5.00 000030 2007-11-06 10:01:01.000 9.00 000110 2004-03-06 14:01:01.000 5.00 000110 2005-11-06 10:01:01.000 9.00 000110 2006-02-06 13:01:01.000 6.00 000110 2006-03-06 14:01:01.000 5.00 000110 2006-05-06 16:01:01.000 3.00 000110 2006-06-06 17:01:01.000 2.00 000110 2006-07-06 19:01:01.000 1.00 000190 2005-03-06 14:01:01.000 5.00 000190 2005-05-06 16:01:01.000 3.00 000190 2005-12-06 11:01:01.000 8.00 000190 2006-01-06 12:01:01.000 7.00 000190 2006-02-06 13:01:01.000 6.00 000190 2006-05-06 16:01:01.000 3.00 000190 2006-06-06 17:01:01.000 2.00 000190 2006-07-06 00:00:00.000 1.00 000190 2006-07-06 19:01:01.000 1.00 000190 2007-07-06 00:00:00.000 1.00 000310 2004-10-06 10:01:01.000 10.00 000310 2005-10-06 10:01:01.000 10.00 000310 2005-11-06 10:01:01.000 9.00 000310 2006-01-06 12:01:01.000 7.00 000310 2006-03-06 14:01:01.000 5.00 000310 2006-07-06 19:01:01.000 1.00 000310 2006-10-06 10:01:01.000 10.00 (39 row(s) affected)

