Most efficient way to insert without duplicates

Поиск

Список

Период

Сортировка

От	François Beausoleil
Тема	Most efficient way to insert without duplicates
Дата	17 апреля 2013 г. 02:51:31
Msg-id	897C3D25-1352-49C7-A9E6-2E9270029FAC@teksol.info обсуждение исходный текст
Ответы	Re: Most efficient way to insert without duplicates Re: Most efficient way to insert without duplicates Re: Most efficient way to insert without duplicates
Список	pgsql-general

Дерево обсуждения

Hi all!

I track Twitter followers in my database. I have the following table:

# \d persona_followers
           Table "public.persona_followers"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 service_id  | bigint                      | not null
 follower_id | bigint                      | not null
 valid_at    | timestamp without time zone |
Indexes:
    "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)

The table IS NOT partitioned.

I have a list of Twitter people I follow more - brands, actors, those kinds of Twitter accounts. They often have
thousands,if not hundreds of thousands, of followers. I fetch the followers of these accounts about once a day. When
it'stime to insert into the database, I use the following algorithm: 

CREATE TEMP TABLE import( service_id bigint, follower_id bigint );
COPY INTO import FROM STDIN;
...
\N

INSERT INTO persona_followers(service_id, follower_id, valid_at)
  SELECT service_id, follower_id, NOW()
  FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import
  WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND
import.follower_id= persona_followers.follower_id); 

I currently have 660 million rows in persona_followers (47 GB). A test import is 13.5 million rows (571 MB). The real
dailyimport will be at least 10x more. In a 24 hour period, I will have at most a few thousand *new* rows - the rest
willalready exist in persona_followers. How do I most efficiently eliminate the duplicates? Should I delete the
duplicatesin import? Or should I bite the bullet and EXCEPT the final table? Should I insert much smaller batches? Or
isthe above already the most efficient way? What other completely different data structure could I use to achieve my
goal?I truly need the exhaustive list of followers because we do reach calculations (number of unique accounts which
receiveda particular tweet). 

The true answer is probably "benchmark on your own servers", but I'm looking for guidelines, people with the same kind
ofexperience. 

Thanks!
François

Вложения

smime.p7s

В списке pgsql-general по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Most efficient way to insert without duplicates

Вложения