Re: Merge David and Goliath tables efficiently

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: Merge David and Goliath tables efficiently
Дата	17 июня 2023 г. 19:52:07
Msg-id	c2bb0978-10ea-1644-6b72-240298fd88d7@enterprisedb.com обсуждение исходный текст
Ответ на	Merge David and Goliath tables efficiently (Nicolas Paris <nicolas.paris@riseup.net>)
Ответы	Re: Merge David and Goliath tables efficiently
Список	pgsql-performance

Дерево обсуждения

On 6/17/23 15:48, Nicolas Paris wrote:
> In my use case I have a 2billion / 1To table. I have daily data to upsert around 2milion, with say 50% inserts, based
onthe primary key in a fresh analyzed table.
 
> 
> I have tested multiple strategies to merge the data, all based on first stage to copy the 2m dataset in an staging
unlogged/ indexed table:
 
> 
> 1. Join insert then join update 
> 2.1. Usage of the new merge statement
> 2.2 Usage of merge on two hash partitioned tables wit partition wide join enabled
> 3. Usage of merge by batch of 1000 rows
> 
> First remark is the merge statement is almost 30% faster than two statements in my benchmarks. Thanks to the pg
communityfor this.
 
> 
> While the strategies 1 and 2.x are incredibly slow (canceled after 10 hours), the third one finishes within 30
minutes.
> 

Seems pretty terrible, provided the data is on reasonable storage (with
acceptable random I/O behavior).

> My interpretation reading the query plan is: well sized small batches of upserts leverage the indexes while the
regularjoin choose the sequential scan, including sorting and hashing which takes forever time and resources including
disk.

You may be right, but it's hard to tell without seeing the query plan.

> 
> Sadly my incoming dataset is too small to benefit from a seq scan and too large to benefit from an index scan join.
Howeverwhen splited manuallyin N portions, the problem can be tackled with N * small cost, which is cheap anyway.
 
> 

Sounds very much like you'd benefit from tuning some cost parameters to
make the index scan look cheaper.

> Questions:
> 1. Is there another strategy  ?
> 2. Could postgres support a "batched indexed join itself", leveraging indexes itself by dynamic sized batches ?
> 

Not sure what 'batched indexed join' would be, but it very much sounds
like a nested loop with an index scan.

> 
> It is error prone write code to split and iterate  I suspect postgres has everything internally (indexes catalog,
planner)to split itself the job, making David vs Goliath something trivial.
 
> 

What PostgreSQL version are you using, what hardware? Did you tune it in
any way, or is everything just default?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-performance по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Merge David and Goliath tables efficiently