Re: Need efficient way to remove (almost) duplicate rows from a table
От | Paul Jungwirth |
---|---|
Тема | Re: Need efficient way to remove (almost) duplicate rows from a table |
Дата | |
Msg-id | c6e4f48d-3af4-4d65-4031-b1e447e09214@illuminatedcomputing.com обсуждение исходный текст |
Ответ на | Need efficient way to remove (almost) duplicate rows from a table (David Gauthier <dfgpostgres@gmail.com>) |
Список | pgsql-general |
On 10/17/23 15:48, David Gauthier wrote: > I have a table with millions of records and there are a lot of "almost" > duplicates that I want to get rid of in an efficient way. Best to > illustrate with a simple example... This sounds like a classic use-case for a "correlated subquery". Assuming you want to keep the lowest twig_id, you could do this: ``` paul=# select * from twigs; limb_id | branch_id | twig_id | surfarea ---------+-----------+---------+---------- 1 | 1 | 1 | 100 1 | 1 | 2 | 103.7 1 | 1 | 3 | 103.7 1 | 1 | 4 | 110.4 1 | 2 | 1 | 120 1 | 2 | 2 | 123.6 1 | 2 | 3 | 128.1 1 | 2 | 4 | 128.1 1 | 2 | 5 | 128.1 2 | 1 | 1 | 100 2 | 1 | 3 | 104.4 2 | 1 | 4 | 131.9 (12 rows) Time: 0.218 ms paul=# delete from twigs as t1 where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id, t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id > t2.twig_id); DELETE 3 Time: 8.555 ms paul=# select * from twigs; limb_id | branch_id | twig_id | surfarea ---------+-----------+---------+---------- 1 | 1 | 1 | 100 1 | 1 | 2 | 103.7 1 | 1 | 4 | 110.4 1 | 2 | 1 | 120 1 | 2 | 2 | 123.6 1 | 2 | 3 | 128.1 2 | 1 | 1 | 100 2 | 1 | 3 | 104.4 2 | 1 | 4 | 131.9 (9 rows) Time: 0.231 ms ``` That says "delete from t1 where there is a matching t2 with a lower twig_id." If you want to keep the highest-numbered twig_id, the sql is easy to adjust. Regards, -- Paul ~{:-) pj@illuminatedcomputing.com
В списке pgsql-general по дате отправления: