Обсуждение: De-duplicating rows
The Subject: is somewhat imprecise, but here's what I'm trying to do. For some reason, my brain is locking up over it. I'm moving a 7.2 (yes) database to 8.4. In the table in question, the structure is along the lines of: serial_number SERIAL, PRIMARY KEYemail TEXTcreate_date TIMESTAMPattr1 typeattr2 typeattr3 type... (The point of the "attr" fields is that there are many more columns for each row.) The new structure removes the "serial_number" field, and uses "email" as the primary key, but is otherwise unchanged: email TEXT, PRIMARY KEYcreate_date TIMESTAMPattr1 typeattr2 typeattr3 type... Now, since this database has been production since 7.2 days, cruft has crept in: in particular, there are duplicate email addresses, some with mismatched attributes. The policy decision by the client is that the correct row is the one with the earliest timestamp. (The timestamps are widely distributed; it's not the case that there is a single timestamp above which all the duplicates live.) Thus, ideally, I want to select exactly one row per "email", picking the row with the earliest timestamp in the case that there is more than one row with that email. Any suggestions on how to write such a SELECT? Of course, I could do this with an application against the db, but a single SELECT would be great if possible. TIA!
Christophe wrote: > The Subject: is somewhat imprecise, but here's what I'm trying to do. > For some reason, my brain is locking up over it. > > I'm moving a 7.2 (yes) database to 8.4. > Big leap. Allow some time for testing your application. I'm not sure if 7.2 even supported schemas, and there's been loads of tightening up the rules for automatic type casting, unicode etc. > Now, since this database has been production since 7.2 days, cruft has > crept in: in particular, there are duplicate email addresses, some with > mismatched attributes. The policy decision by the client is that the > correct row is the one with the earliest timestamp. Something like (untested): CREATE TEMPORARY TABLE earliest_duplicates AS SELECT email AS tgt_email, min(create_date) AS tgt_date FROM mytable GROUP BY email HAVING count(*) > 1; DELETE FROM mytable USING earliest duplicates WHERE email=tgt_email AND create_date > tgt_date; -- Richard Huxton Archonet Ltd
Richard Huxton wrote: > Christophe wrote: >> Now, since this database has been production since 7.2 days, cruft has >> crept in: in particular, there are duplicate email addresses, some >> with mismatched attributes. The policy decision by the client is that >> the correct row is the one with the earliest timestamp. > > Something like (untested): > > CREATE TEMPORARY TABLE earliest_duplicates AS > SELECT > email AS tgt_email, > min(create_date) AS tgt_date > FROM mytable > GROUP BY email > HAVING count(*) > 1; > > DELETE FROM mytable USING earliest duplicates > WHERE email=tgt_email AND create_date > tgt_date; If it is possible that two rows exist for the same email/date; then you will likely need to deal with these manually. If you rerun the above SELECT after running the delete you should identify these rows.
On Thu, Jul 16, 2009 at 9:07 PM, Christophe<xof@thebuild.com> wrote: > The Subject: is somewhat imprecise, but here's what I'm trying to do. For > some reason, my brain is locking up over it. > > I'm moving a 7.2 (yes) database to 8.4. In the table in question, the > structure is along the lines of: > > serial_number SERIAL, PRIMARY KEY > email TEXT > create_date TIMESTAMP > attr1 type > attr2 type > attr3 type > ... > > (The point of the "attr" fields is that there are many more columns for each > row.) > > The new structure removes the "serial_number" field, and uses "email" as the > primary key, but is otherwise unchanged: > > email TEXT, PRIMARY KEY > create_date TIMESTAMP > attr1 type > attr2 type > attr3 type > ... > > Now, since this database has been production since 7.2 days, cruft has crept > in: in particular, there are duplicate email addresses, some with mismatched > attributes. The policy decision by the client is that the correct row is > the one with the earliest timestamp. (The timestamps are widely > distributed; it's not the case that there is a single timestamp above which > all the duplicates live.) Thus, ideally, I want to select exactly one row > per "email", picking the row with the earliest timestamp in the case that > there is more than one row with that email. > > Any suggestions on how to write such a SELECT? Of course, I could do this > with an application against the db, but a single SELECT would be great if > possible. OK, assuming we can keep the serial number during the conversion, we could use something like this: select distinct a.serial_number from table a join table b on (a.email=b.email and a.serial_number>b.serial_number) Now assuming that the serial numbers and the timestamps are in order together, that'll give us all the serial numbers for all the matching email addresses EXCEPT the first one. If the serial numbers are not in order with the timestamps, then create a sequence, and update them in order, then the query will work. Once you've confirmed by hand that the first hundred or so serial_numbers you're getting back ARE in fact all n+1 for the same email address, use the select in a subselect to delete: delete from table x where serial_number in (select distinct....)