Re: Removing duplicates

Поиск

Список

Период

Сортировка

От	Giuseppe Broccolo
Тема	Re: Removing duplicates
Дата	10 июля 2013 г. 15:50:18
Msg-id	51DD82AF.4060006@2ndquadrant.it обсуждение исходный текст
Ответ на	Removing duplicates (Johann Spies <johann.spies@gmail.com>)
Список	pgsql-general

Дерево обсуждения

Dear Johann,

I tried (with PostgreSQL 9.2) to run the two DELETE statements you describe in your mail (the first based on the "id" field, the second on the ctid) and they work! I have to point out that if you use the DELETE based on the "id" field YOU'LL DELETE ALL RECORDS having at least one duplicate.
The "q1.id != q.id" doesn't work because query doesn't know yet its result so every row with a duplicate will match without exception.
If you use the DELETE based on ctid order you delete only records with the same "rart_id", keeping the record with the lowest "id".

Remember that if you run the first DELETE query you remove more data than you expect.

Regards,

Giuseppe.

Il 10/07/2013 09:11, Johann Spies ha scritto:

I have used this method successfully on another table but this one is not
working.

I have a table of nearly 800 million records with some duplicates in.

Here is an example:

select rart_id, r9, ra, ry, rw, rv, rp, rs, ri
from isi.rcited_ref
where rart_id = 'A1986D733500019';

renders a result of 72 records. When I do

select count(id), rart_id, r9, ra, ry, rw, rv, rp, rs, ri
from isi.rcited_ref
where rart_id = 'A1986D733500019'
group by rart_id, r9, ra, ry, rw, rv, rp, rs, ri;

It shows that there are 36 duplicates with this rart_id.

So as a test I did the following (the id-field is the primary key):

DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM   isi.rcited_ref q1  WHERE  q1.id != q.id  AND    q.rart_id = q1.rart_id  AND    q.r9 = q1.r9  AND    q.ra = q1.ra  AND    q.ry = q1.ry  AND    q.rw = q1.rw  AND    q.rv = q1.rv  AND    q.rp = q1.rp  AND    q.rs = q1.rs  AND    q.rart_id = 'A1986D733500019'   );

But that deletes none.  And I cannot see what went wrong.

I have also tried the same query with ctid without success:

DELETE FROM isi.rcited_ref q
WHERE EXISTS(SELECT 1 FROM   isi.rcited_ref q1  WHERE  q1.ctid < q.ctid  AND    q.rart_id = q1.rart_id  AND    q.r9 = q1.r9  AND    q.ra = q1.ra  AND    q.ry = q1.ry  AND    q.rw = q1.rw  AND    q.rv = q1.rv  AND    q.rp = q1.rp  AND    q.rs = q1.rs  AND    q.rart_id = 'A1986D733500019'   );


The size of the table makes it difficult to use a 'group by'  method to
delete all duplcates.

What am I doing wrong?

Regards
Johann
-- 
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Removing duplicates