Re: Need efficient way to remove (almost) duplicate rows from a table

Поиск

Список

Период

Сортировка

От	Paul Jungwirth
Тема	Re: Need efficient way to remove (almost) duplicate rows from a table
Дата	17 октября 2023 г. 23:09:17
Msg-id	c6e4f48d-3af4-4d65-4031-b1e447e09214@illuminatedcomputing.com обсуждение исходный текст
Ответ на	Need efficient way to remove (almost) duplicate rows from a table (David Gauthier <dfgpostgres@gmail.com>)
Список	pgsql-general

Дерево обсуждения

On 10/17/23 15:48, David Gauthier wrote:
> I have a table with millions of records and there are a lot of "almost" 
> duplicates that I want to get rid of in an efficient way.  Best to 
> illustrate with a simple example...

This sounds like a classic use-case for a "correlated subquery". 
Assuming you want to keep the lowest twig_id, you could do this:

```
paul=# select * from twigs;
  limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
        1 |         1 |       1 |      100
        1 |         1 |       2 |    103.7
        1 |         1 |       3 |    103.7
        1 |         1 |       4 |    110.4
        1 |         2 |       1 |      120
        1 |         2 |       2 |    123.6
        1 |         2 |       3 |    128.1
        1 |         2 |       4 |    128.1
        1 |         2 |       5 |    128.1
        2 |         1 |       1 |      100
        2 |         1 |       3 |    104.4
        2 |         1 |       4 |    131.9
(12 rows)

Time: 0.218 ms
paul=# delete from twigs as t1
where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id, 
t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id > 
t2.twig_id);
DELETE 3
Time: 8.555 ms
paul=# select * from twigs;
  limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
        1 |         1 |       1 |      100
        1 |         1 |       2 |    103.7
        1 |         1 |       4 |    110.4
        1 |         2 |       1 |      120
        1 |         2 |       2 |    123.6
        1 |         2 |       3 |    128.1
        2 |         1 |       1 |      100
        2 |         1 |       3 |    104.4
        2 |         1 |       4 |    131.9
(9 rows)

Time: 0.231 ms
```

That says "delete from t1 where there is a matching t2 with a lower 
twig_id."

If you want to keep the highest-numbered twig_id, the sql is easy to adjust.

Regards,

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Need efficient way to remove (almost) duplicate rows from a table