Most efficient way to insert without duplicates
От | François Beausoleil |
---|---|
Тема | Most efficient way to insert without duplicates |
Дата | |
Msg-id | 897C3D25-1352-49C7-A9E6-2E9270029FAC@teksol.info обсуждение исходный текст |
Ответы |
Re: Most efficient way to insert without duplicates
Re: Most efficient way to insert without duplicates Re: Most efficient way to insert without duplicates |
Список | pgsql-general |
Hi all! I track Twitter followers in my database. I have the following table: # \d persona_followers Table "public.persona_followers" Column | Type | Modifiers -------------+-----------------------------+----------- service_id | bigint | not null follower_id | bigint | not null valid_at | timestamp without time zone | Indexes: "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id) The table IS NOT partitioned. I have a list of Twitter people I follow more - brands, actors, those kinds of Twitter accounts. They often have thousands,if not hundreds of thousands, of followers. I fetch the followers of these accounts about once a day. When it'stime to insert into the database, I use the following algorithm: CREATE TEMP TABLE import( service_id bigint, follower_id bigint ); COPY INTO import FROM STDIN; ... \N INSERT INTO persona_followers(service_id, follower_id, valid_at) SELECT service_id, follower_id, NOW() FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND import.follower_id= persona_followers.follower_id); I currently have 660 million rows in persona_followers (47 GB). A test import is 13.5 million rows (571 MB). The real dailyimport will be at least 10x more. In a 24 hour period, I will have at most a few thousand *new* rows - the rest willalready exist in persona_followers. How do I most efficiently eliminate the duplicates? Should I delete the duplicatesin import? Or should I bite the bullet and EXCEPT the final table? Should I insert much smaller batches? Or isthe above already the most efficient way? What other completely different data structure could I use to achieve my goal?I truly need the exhaustive list of followers because we do reach calculations (number of unique accounts which receiveda particular tweet). The true answer is probably "benchmark on your own servers", but I'm looking for guidelines, people with the same kind ofexperience. Thanks! François
Вложения
В списке pgsql-general по дате отправления: