Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)
От | Daniel Begin |
---|---|
Тема | Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) |
Дата | |
Msg-id | COL129-DS838ED023AE6FCC4446A2494640@phx.gbl обсуждение исходный текст |
Ответы |
Re: Removing duplicate records from a bulk upload (rationale
behind selecting a method)
(John R Pierce <pierce@hogranch.com>)
Re: Removing duplicate records from a bulk upload (rationale behind selecting a method) (Andy Colson <andy@squeakycode.net>) |
Список | pgsql-general |
Thanks for your answers Andy; I will keep in mind the procedure you proposed. About the fields required to find duplicate records, all of them are required (5-9) depending on the table. Considering that the tables are not indexed yet, am I right to think that both approaches will need a full table scan? - Deleting duplicate records would need a full table scan to create temporary indexes to select/remove duplicate records; - Inserting distinct records into an empty table will also need a full table scan to select distinct (*) from big_table; Once said, is the indexing and selection/deletion of duplicate records faster than rewriting a whole table from distinctrecords? I am trying to find a rationale behind the choice - execution time, simplicity? I just don't know what thecriteria should be and how/why it would affect the selection of an approach:-| Daniel -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andy Colson Sent: December-08-14 11:39 To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Removing duplicate records from a bulk upload On 12/8/2014 10:30 AM, Andy Colson wrote: > On 12/7/2014 9:31 PM, Daniel Begin wrote: >> I have just completed the bulk upload of a large database. Some >> tables have billions of records and no constraints or indexes have >> been applied yet. About 0.1% of these records may have been >> duplicated during the upload and I need to remove them before applying constraints. >> >> I understand there are (at least) two approaches to get a table >> without duplicate records… >> >> - Delete duplicate records from the table based on an >> appropriate select clause; >> >> - Create a new table with the results from a select distinct >> clause, and then drop the original table. >> >> What would be the most efficient procedure in PostgreSQL to do the >> job considering … >> >> - I do not know which records were duplicated; >> >> - There are no indexes applied on tables yet; >> >> - There is no OIDS on tables yet; >> >> - The database is currently 1TB but I have plenty of disk >> space. >> >> Daniel >> > > How would you detect duplicate? Is there a single field that would be > duplicated? Or do you have to test a bunch of different fields? > > If its a single field, you could find dups in a single pass of the > table > with: > > create index bigtable_key on bigtable(key); select key, count(*) from > bigtable group by key having count(*) > 1; > > Save that list, and decide on some way of deleting the dups. > > The index might help the initial select, but will really help re-query > and delete statements. > > -Andy > > I just thought of a more generic way. 1) make a non-unique index on bigtable 2) make a temp table 3) -- copy only dups insert into temp table select * from big table where (its a duplicate); 4) delete from bigtable where keys in (select key from temp); 5) insert into bigtable select distinct from temp; This would minimize the amount of data you have to move around. Depends on how hard step 3 is to write. Index not requiredbut would help both step 3 and 4 be faster. -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
В списке pgsql-general по дате отправления:
Следующее
От: John R PierceДата:
Сообщение: Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)