Scrub one large table against another
От | Brendan Curran |
---|---|
Тема | Scrub one large table against another |
Дата | |
Msg-id | 452C1437.9040709@gmail.com обсуждение исходный текст |
Ответы |
Re: Scrub one large table against another
|
Список | pgsql-performance |
I currently have a db supporting what is for the most part an OLAP data warehousing application. One table (good data) has roughly 120 million rows, divided into roughly 40 different relational groups (logically by foreign key). Every time I add data to this table, I need to afterwards scrub that group against known "bad data" which is housed in a second table that has roughly 21 million rows. The 120 million row good data table is called "email_record" The 21 million row bad data table is called "suppress" There are separate btree indexes on 'email_record_id', 'email_list_id' and 'email' on both tables. Each time I scrub data I pull out anywhere from 1 to 5 million rows from the good table (depending on the size of the group i'm scrubbing) and compare them against the 21 million rows in the 'suppress' table. So far I've done this using a temporary staging table that stores only the email_record_id for each row from the relevant group of the good table. I use a plsql function that does roughly the following (i've included only sql syntax and inserted the constant '9' where i would normally use a variable): The characters: email_record_id int8, email varchar(255), email_list_id int8 ------------------------------------------------------------- CREATE TEMP TABLE temp_list_suppress(email_record_id int8); INSERT INTO temp_list_suppress SELECT email_record_id from ONLY email_record er WHERE email_list_id = 9 AND email IN (select email from suppress); CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id ); INSERT INTO er_banned SELECT * from ONLY email_record er WHERE EXISTS (SELECT 1 from temp_list_suppress ts where er.email_record_id = ts.email_record_id)'; DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN (SELECT email_record_id from temp_list_suppress); TRUNCATE TABLE temp_list_suppress; DROP TABLE temp_list_suppress; -------------------------------------------------------------- The performance is dreadful, is there a more efficient way to do this? Would I be better off just grabbing * initially from the good table instead of just the id to avoid more sequential searches later? Here are my configs: Debian Postgres 8.1.4 dual zeon ram: 4 gigs raid 5 # - Memory - shared_buffers = 3000 work_mem = 92768 maintenance_work_mem = 128384 autovacuum is turned off, and the db is annalyzed and vacuumed regularly. Regards, Brendan
В списке pgsql-performance по дате отправления: