SQL to delete the duplicates in a table - 【StackMirror】|sql|oracle|oracle10g

I have a table transaction which has duplicates. i want to keep the record that had minimum id and delete all the duplicates based on four fields DATE, AMOUNT, REFNUMBER, PARENTFOLDERID. I wrote this query but i am not sure if this can be written in an efficient way. Do you think there is a better way? I am asking because i am worried about the run time.

DELETE FROM TRANSACTION
WHERE ID IN 
(SELECT FIT2.ID
FROM
(SELECT MIN(ID) AS ID, FIT.DATE, FIT.AMOUNT, FIT.REFNUMBER, FIT.PARENTFOLDERID
FROM EWORK.TRANSACTION FIT
GROUP BY FIT.DATE, FIT.AMOUNT , FIT.REFNUMBER, FIT.PARENTFOLDERID
HAVING COUNT(1)>1 and FIT.AMOUNT >0) FIT1,
EWORK.TRANSACTION FIT2

WHERE FIT1.DATE=FIT2.DATE AND
FIT1.AMOUNT=FIT2.AMOUNT AND
FIT1.REFNUMBER=FIT2.REFNUMBER AND 
FIT1.PARENTFOLDERID=FIT2.PARENTFOLDERID AND 
FIT1.ID<>FIT2.ID)

2012-04-04 18:19
by spider8

It would probably be more efficient to do something like

DELETE FROM transaction t1
 WHERE EXISTS( SELECT 1
                 FROM transaction t2
                WHERE t1.date = t2.date
                  AND t1.refnumber = t2.refnumber
                  AND t1.parentFolderId = t2.parentFolderId
                  AND t2.id > t1.id )

2012-04-04 18:34
by Justin Cave

Yes that works. I am not sure about the performance but it looks more legible - spider8 2012-04-04 19:24

@justin:I think using analytic function is much more optimized .what you say - Gaurav Soni 2012-04-04 20:25

@GauravSoni - I wouldn't expect it to be more efficient in this case. With either my approach or yours, Oracle is going to have to hit the transaction table twice. I would tend to expect that an anti-join would be a bit more efficient than the analytic function approach. But it will depend on the indexes available, the data, how many duplicate rows there are, etc. I wouldn't be shocked if in some cases the analytic function approach was a bit more efficient but I'd expect it to be reasonably close - Justin Cave 2012-04-04 20:30

@JustinCave:you're correct,but in this case you're considering id as primary key .what if id is not primary .On what basis you'll do the join then - Gaurav Soni 2012-04-04 20:37

@GauravSoni - The original poster requested that the row with the minimum id value be retained. If you didn't care which row was retained, you could simply use the predicate t2.rowid > t1.rowid - Justin Cave 2012-04-04 20:39

@JustinCave:Got it now,thanks alo - Gaurav Soni 2012-04-04 20:42

DELETE FROM transaction
      WHERE ID IN (
               SELECT ID
                 FROM (SELECT ID,
                          ROW_NUMBER () OVER (PARTITION BY  date
                                                          ,amount
                                                          ,refnumber
                                                          ,parentfolderid
                                                ORDER BY ID) rn
                                              FROM transaction)
                WHERE rn <> 1);

I will try like this

2012-04-04 19:39
by Gaurav Soni

@mahen:If the table is really really large,using the invaluable power of Analytics you can optimiz - Gaurav Soni 2012-04-04 19:44

I believe you want to ORDER BY id in your analytic function - Justin Cave 2012-04-04 20:30

@JustinCave:On you're suggestion ,i have edited my initial query .Thanks for you comments justi - Gaurav Soni 2012-04-04 20:44

I would try something like this:

DELETE transaction 
FROM transaction
LEFT OUTER JOIN 
   (
       SELECT MIN(id) as id, date, amount, refnumber, parentfolderid 
       FROM transaction
      GROUP BY date, amount, refnumber, parentfolderid
   ) as validRows 
ON transaction.id = validRows.id
WHERE validRows.id IS NULL

2012-04-04 18:57
by jordeu