Обсуждение: De-duplicating rows

Поиск
Список
Период
Сортировка

De-duplicating rows

От
Christophe
Дата:
The Subject: is somewhat imprecise, but here's what I'm trying to do.   
For some reason, my brain is locking up over it.

I'm moving a 7.2 (yes) database to 8.4.  In the table in question, the  
structure is along the lines of:
serial_number    SERIAL, PRIMARY KEYemail        TEXTcreate_date    TIMESTAMPattr1        typeattr2        typeattr3
   type...
 

(The point of the "attr" fields is that there are many more columns  
for each row.)

The new structure removes the "serial_number" field, and uses "email"  
as the primary key, but is otherwise unchanged:
email        TEXT, PRIMARY KEYcreate_date    TIMESTAMPattr1        typeattr2        typeattr3        type...

Now, since this database has been production since 7.2 days, cruft has  
crept in: in particular, there are duplicate email addresses, some  
with mismatched attributes.  The policy decision by the client is that  
the correct row is the one with the earliest timestamp.  (The  
timestamps are widely distributed; it's not the case that there is a  
single timestamp above which all the duplicates live.)  Thus, ideally,  
I want to select exactly one row per "email", picking the row with the  
earliest timestamp in the case that there is more than one row with  
that email.

Any suggestions on how to write such a SELECT?  Of course, I could do  
this with an application against the db, but a single SELECT would be  
great if possible.

TIA!


Re: De-duplicating rows

От
Richard Huxton
Дата:
Christophe wrote:
> The Subject: is somewhat imprecise, but here's what I'm trying to do.  
> For some reason, my brain is locking up over it.
> 
> I'm moving a 7.2 (yes) database to 8.4. >

Big leap. Allow some time for testing your application. I'm not sure if 
7.2 even supported schemas, and there's been loads of tightening up the 
rules for automatic type casting, unicode etc.

> Now, since this database has been production since 7.2 days, cruft has 
> crept in: in particular, there are duplicate email addresses, some with 
> mismatched attributes.  The policy decision by the client is that the 
> correct row is the one with the earliest timestamp.

Something like (untested):

CREATE TEMPORARY TABLE earliest_duplicates AS
SELECT  email AS tgt_email,  min(create_date) AS tgt_date
FROM mytable
GROUP BY email
HAVING count(*) > 1;

DELETE FROM mytable USING earliest duplicates
WHERE email=tgt_email AND create_date > tgt_date;

--   Richard Huxton  Archonet Ltd


Re: De-duplicating rows

От
Frank Bax
Дата:
Richard Huxton wrote:
> Christophe wrote:
>> Now, since this database has been production since 7.2 days, cruft has 
>> crept in: in particular, there are duplicate email addresses, some 
>> with mismatched attributes.  The policy decision by the client is that 
>> the correct row is the one with the earliest timestamp.
> 
> Something like (untested):
> 
> CREATE TEMPORARY TABLE earliest_duplicates AS
> SELECT
>   email AS tgt_email,
>   min(create_date) AS tgt_date
> FROM mytable
> GROUP BY email
> HAVING count(*) > 1;
> 
> DELETE FROM mytable USING earliest duplicates
> WHERE email=tgt_email AND create_date > tgt_date;



If it is possible that two rows exist for the same email/date; then you 
will likely need to deal with these manually.

If you rerun the above SELECT after running the delete you should 
identify these rows.


Re: De-duplicating rows

От
Scott Marlowe
Дата:
On Thu, Jul 16, 2009 at 9:07 PM, Christophe<xof@thebuild.com> wrote:
> The Subject: is somewhat imprecise, but here's what I'm trying to do.  For
> some reason, my brain is locking up over it.
>
> I'm moving a 7.2 (yes) database to 8.4.  In the table in question, the
> structure is along the lines of:
>
>        serial_number   SERIAL, PRIMARY KEY
>        email           TEXT
>        create_date     TIMESTAMP
>        attr1           type
>        attr2           type
>        attr3           type
>        ...
>
> (The point of the "attr" fields is that there are many more columns for each
> row.)
>
> The new structure removes the "serial_number" field, and uses "email" as the
> primary key, but is otherwise unchanged:
>
>        email           TEXT, PRIMARY KEY
>        create_date     TIMESTAMP
>        attr1           type
>        attr2           type
>        attr3           type
>        ...
>
> Now, since this database has been production since 7.2 days, cruft has crept
> in: in particular, there are duplicate email addresses, some with mismatched
> attributes.  The policy decision by the client is that the correct row is
> the one with the earliest timestamp.  (The timestamps are widely
> distributed; it's not the case that there is a single timestamp above which
> all the duplicates live.)  Thus, ideally, I want to select exactly one row
> per "email", picking the row with the earliest timestamp in the case that
> there is more than one row with that email.
>
> Any suggestions on how to write such a SELECT?  Of course, I could do this
> with an application against the db, but a single SELECT would be great if
> possible.

OK, assuming we can keep the serial number during the conversion, we
could use something like this:

select distinct a.serial_number from table a join table b on
(a.email=b.email and a.serial_number>b.serial_number)

Now assuming that the serial numbers and the timestamps are in order
together, that'll give us all the serial numbers for all the matching
email addresses EXCEPT the first one.  If the serial numbers are not
in order with the timestamps, then create a sequence, and update them
in order, then the query will work.  Once you've confirmed by hand
that the first hundred or so serial_numbers you're getting back ARE in
fact all n+1 for the same email address, use the select in a subselect
to delete:

delete from table x where serial_number in (select distinct....)