Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
От | John McKown |
---|---|
Тема | Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) |
Дата | |
Msg-id | CAAJSdjiR9sUOQNqWQKZRr9BCyqVAtA_2dYVfhWKcHUGYhgFSxw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) (Daniel Begin <jfd553@hotmail.com>) |
Ответы |
Re: Removing duplicate records from a bulk upload (rationale behind
selecting a method)
(David G Johnston <david.g.johnston@gmail.com>)
Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) (Marc Mamin <M.Mamin@intershop.de>) |
Список | pgsql-general |
Seems promising but could you provide me a reference to PostgreSQL
documentation regarding this "a%8=*" feature?
Best
% is the modulus operator.
Assuming "a" is an integer (I don't remember), then doing 8 selects of "a modulus 8" = for each of the possible results (0..7) will each select about 1/8 of the entire table (I would guess) and the end result put together, they will end up selecting all of the original table. I don't know, myself, why this would be faster. But I'm not any kind of a PostgreSQL expert either.
Daniel
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Marc Mamin
Sent: December-12-14 06:41
To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
Cc: 'Andy Colson'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)
>Thank Tom,
>I understand that the rationale behind choosing to create a new table
>from distinct records is that, since both approaches need full table
>scans, selecting distinct records is faster (and seems more straight
>forward) than finding/deleting duplicates;
Hi,
on a large table you may get it faster while using more than one thread.
e.g.:
select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select
a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;
This will/should use a shared full table scan on oldtable.
HTH
Marc Mamin
>
>Best regards,
>Daniel
>
>-----Original Message-----
>From: pgsql-general-owner@postgresql.org
>[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tom Lane
>Sent: December-08-14 21:52
>To: Scott Marlowe
>Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
>Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
>(rationale behind selecting a method)
>
>Scott Marlowe <scott.marlowe@gmail.com> writes:
>> If you're de-duping a whole table, no need to create indexes, as it's
>> gonna have to hit every row anyway. Fastest way I've found has been:
>
>> select a,b,c into newtable from oldtable group by a,b,c;
>
>> On pass, done.
>
>> If you want to use less than the whole row, you can use select
>> distinct on (col1, col2) * into newtable from oldtable;
>
>Also, the DISTINCT ON method can be refined to control which of a set
>of duplicate keys is retained, if you can identify additional columns
>that constitute a preference order for retaining/discarding dupes. See
>the "latest weather reports" example in the SELECT reference page.
>
>In any case, it's advisable to crank up work_mem while performing this
>operation.
>
> regards, tom lane
>
>
>--
>Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
>make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>--
>Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
>make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
While a transcendent vocabulary is laudable, one must be eternally careful so that the calculated objective of communication does not become ensconced in obscurity. In other words, eschew obfuscation.John McKown
В списке pgsql-general по дате отправления:
Предыдущее
От: Daniel BeginДата:
Сообщение: Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)