Обсуждение: parallel data loading for pgbench -i

Поиск

Список

Период

Сортировка

parallel data loading for pgbench -i

От

Mircea Cadariu

Дата:

17 ноября 2025 г., 15:46:12

Hi,

I propose a patch for speeding up pgbench -i through multithreading.

To enable this, pass -j and then the number of workers you want to use.

Here are some results I got on my laptop:


master

---

-i -s 100
done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side 
generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).

-i -s 100 --partitions=10
done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side 
generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).


patch (-j 10)

---

-i -s 100 -j 10
done in 18.64 s (drop tables 0.00 s, create tables 0.01 s, client-side 
generate 5.82 s, vacuum 6.89 s, primary keys 5.93 s).

-i -s 100 -j 10 --partitions=10
done in 14.66 s (drop tables 0.00 s, create tables 0.01 s, client-side 
generate 8.42 s, vacuum 1.55 s, primary keys 4.68 s).

The speedup is more significant for the partitioned use-case. This is 
because all workers can use COPY FREEZE (thus incurring a lower vacuum 
penalty) because they create their separate partitions.

For the non-partitioned case the speedup is lower, but I observe it 
improves somewhat with larger scale factors. When parallel vacuum 
support is merged, this should further reduce the time.

I'd still need to update docs, tests, better integrate the code with its 
surroundings, and other aspects. Would appreciate any feedback on what I 
have so far though. Thanks!

Kind regards,

Mircea Cadariu

Вложения

v1-0001-parallel-pgbench-wip.patch

Re: parallel data loading for pgbench -i

От

lakshmi

Дата:

19 января, 12:25:43

Hi Mircea,

I tested the patch on 19devel and it worked well for me.
Before applying it, -j is rejected in pgbench initialization mode as expected. After applying the patch, pgbench -i -s 100 -j 10 runs successfully and shows a clear speedup.
On my system the total runtime dropped to about 9.6s, with client-side data generation around 3.3s.
I also checked correctness after the run — row counts for pgbench_accounts, pgbench_branches, and pgbench_tellers all match the expected values.

Thanks for working on this, the improvement is very noticeable.

Best regards,
lakshmi

On Mon, Jan 19, 2026 at 2:30 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

Hi,

I propose a patch for speeding up pgbench -i through multithreading.

To enable this, pass -j and then the number of workers you want to use.

Here are some results I got on my laptop:

master

---

-i -s 100
done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).

-i -s 100 --partitions=10
done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).

patch (-j 10)

---

-i -s 100 -j 10
done in 18.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 5.82 s, vacuum 6.89 s, primary keys 5.93 s).

-i -s 100 -j 10 --partitions=10
done in 14.66 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 8.42 s, vacuum 1.55 s, primary keys 4.68 s).

The speedup is more significant for the partitioned use-case. This is
because all workers can use COPY FREEZE (thus incurring a lower vacuum
penalty) because they create their separate partitions.

For the non-partitioned case the speedup is lower, but I observe it
improves somewhat with larger scale factors. When parallel vacuum
support is merged, this should further reduce the time.

I'd still need to update docs, tests, better integrate the code with its
surroundings, and other aspects. Would appreciate any feedback on what I
have so far though. Thanks!

Kind regards,

Mircea Cadariu

Re: parallel data loading for pgbench -i

От

Mircea Cadariu

Дата:

29 января, 14:19:05

Hi Lakshmi,

On 19/01/2026 09:25, lakshmi wrote:

Hi Mircea,
I tested the patch on 19devel and it worked well for me.
Before applying it, -j is rejected in pgbench initialization mode as expected. After applying the patch, pgbench -i -s 100 -j 10 runs successfully and shows a clear speedup.
On my system the total runtime dropped to about 9.6s, with client-side data generation around 3.3s.
I also checked correctness after the run — row counts for pgbench_accounts, pgbench_branches, and pgbench_tellers all match the expected values.
Thanks for working on this, the improvement is very noticeable.
Best regards,
lakshmi

Thanks for having a look and trying it out!

FYI this is one of Tomas Vondra's patch ideas from his blog [1].

I have attached a new version which now includes docs, tests, a proposed commit message, and an attempt to fix the current CI failures (Windows).

[1] - https://vondra.me/posts/patch-idea-parallel-pgbench-i

-- 
Thanks,
Mircea Cadariu

Вложения

v1-0001-Add-parallel-data-loading-support-to-pgbench.patch

Re: parallel data loading for pgbench -i

От

lakshmi

Дата:

05 февраля, 10:17:02

Hi Mircea,

Thanks again for the updated patch.
I did some additional testing on 19devel with a larger scale factor.
For scale 100,parallel initialization with -j 10 shows a clear overall speedup and correct results ,as mentioned earlier.
For scale 500,i observed that client-side data generation becomes significantly faster with parallel loading,but the total run time was slightly higher than the serial case on my system.This appears to be mainly due to much longer vacuum phase after the parallel load.
so the parallel approach clearly improves data generation time,but the overall benefit may depend on scale and workload characteristics.
Regression tests still pass locally,and correctness checks look good.

just sharing these observations in case they are useful for further evaluation.

Best regards,
lakshmi

On Thu, Jan 29, 2026 at 4:49 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

Hi Lakshmi,
On 19/01/2026 09:25, lakshmi wrote:
Hi Mircea,
I tested the patch on 19devel and it worked well for me.
Before applying it, -j is rejected in pgbench initialization mode as expected. After applying the patch, pgbench -i -s 100 -j 10 runs successfully and shows a clear speedup.
On my system the total runtime dropped to about 9.6s, with client-side data generation around 3.3s.
I also checked correctness after the run — row counts for pgbench_accounts, pgbench_branches, and pgbench_tellers all match the expected values.
Thanks for working on this, the improvement is very noticeable.
Best regards,
lakshmi
Thanks for having a look and trying it out!
FYI this is one of Tomas Vondra's patch ideas from his blog [1].
I have attached a new version which now includes docs, tests, a proposed commit message, and an attempt to fix the current CI failures (Windows).
[1] - https://vondra.me/posts/patch-idea-parallel-pgbench-i
-- 
Thanks,
Mircea Cadariu

RE: parallel data loading for pgbench -i

От

"Hayato Kuroda (Fujitsu)"

Дата:

11 февраля, 15:23:42

Dear Mircea,

Thanks for the proposal. I also feel the initalization wastes time.
Here are my initial comments.

01.
I found that pgbench raises a FATAL in case of -j > --partitions, is there a
specific reason?
If needed, we may choose the softer way, which adjust nthreads up to the number
of partitions. -c and -j do the similar one:

```
if (nthreads > nclients && !is_init_mode)
nthreads = nclients;
```

02.
Also, why is -j accepted in case of non-partitions?

03.
Can we port all validation to main()? I found initPopulateTableParallel() has
such a part.

04.
Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it really
essential to parallelize the initialization? I feel it may optimize even
serialized case thus can be discussed independently.

05.
Per my understanding, each thread creates its tables, and all of them are
attached to the parent table. Is it right? I think it needs more code
changes, and I am not sure it is critical to make initialization faster.

So I suggest using the incremental approach. The first patch only parallelizes
the data load, and the second patch implements the CREATE TABLE and ALTER TABLE
ATTACH PARTITION. You can benchmark three patterns, master, 0001, and
0001 + 0002, then compare the results. IIUC, this is the common approach to
reduce the patch size and make them more reviewable.

06.
Missing update for typedefs.list. WorkerTask and CopyTarget can be added there.

07.
Since there is a report like [1], you can benchmark more cases.

[1]: https://www.postgresql.org/message-id/CAEvyyTht69zjnosPjziW6dqNLqs-n6eKia2vof108zQp1QFX%3DQ%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: parallel data loading for pgbench -i

От

lakshmi

Дата:

17 февраля, 09:11:13

Hi Mircea, Hayato,
I ran a few more tests on 19devel ,focusing on the partitioned case to better understand the performance behavior.

For scale 500, the serial initialization on my system takes around 34.3 seconds. Using parallel initialization without partitions (-j 10) makes the client-side data generation noticeably faster,But the overall runtime ends up slightly higher because the vacuum phase becomes much longer.

However,when running with partitions(pgbench -i -s 500 --partitions=10 -j 10),the total runtime drops to about 21.9 seconds, and the vacuum cost is much smaller.I also verified that the row counts are correct in all cases ,and regression tests still pass locally.

So it looks like the main benefit of parallel initialization shows up clearly in the partitioned setup,which matches the expectations discussed earlier.Just sharing these observations in case they are useful for the ongoing review.
Thanks again for the work on this patch.

Best regards,
Lakshmi

On Wed, Feb 11, 2026 at 5:53 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:

Dear Mircea,

Thanks for the proposal. I also feel the initalization wastes time.
Here are my initial comments.

01.
I found that pgbench raises a FATAL in case of -j > --partitions, is there a
specific reason?
If needed, we may choose the softer way, which adjust nthreads up to the number
of partitions. -c and -j do the similar one:

```
if (nthreads > nclients && !is_init_mode)
nthreads = nclients;
```

02.
Also, why is -j accepted in case of non-partitions?

03.
Can we port all validation to main()? I found initPopulateTableParallel() has
such a part.

04.
Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it really
essential to parallelize the initialization? I feel it may optimize even
serialized case thus can be discussed independently.

05.
Per my understanding, each thread creates its tables, and all of them are
attached to the parent table. Is it right? I think it needs more code
changes, and I am not sure it is critical to make initialization faster.

So I suggest using the incremental approach. The first patch only parallelizes
the data load, and the second patch implements the CREATE TABLE and ALTER TABLE
ATTACH PARTITION. You can benchmark three patterns, master, 0001, and
0001 + 0002, then compare the results. IIUC, this is the common approach to
reduce the patch size and make them more reviewable.

06.
Missing update for typedefs.list. WorkerTask and CopyTarget can be added there.

07.
Since there is a report like [1], you can benchmark more cases.

[1]: https://www.postgresql.org/message-id/CAEvyyTht69zjnosPjziW6dqNLqs-n6eKia2vof108zQp1QFX%3DQ%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: parallel data loading for pgbench -i

От

"Hayato Kuroda (Fujitsu)"

Дата:

20 февраля, 12:59:15

Dear Iakshmi,

Thanks for the measurement!

> For scale 500, the serial initialization on my system takes around 34.3 seconds.
> Using parallel initialization without partitions (-j 10) makes the client-side
> data generation noticeably faster,But the overall runtime ends up slightly
> higher because the vacuum phase becomes much longer.

To confirm, do you know the reason why the VACUUMing needs more time than serial case?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: parallel data loading for pgbench -i

От

lakshmi

Дата:

23 февраля, 15:12:49

On Fri, Feb 20, 2026 at 3:29 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:

Dear Iakshmi,

Thanks for the measurement!

> For scale 500, the serial initialization on my system takes around 34.3 seconds.
> Using parallel initialization without partitions (-j 10) makes the client-side
> data generation noticeably faster,But the overall runtime ends up slightly
> higher because the vacuum phase becomes much longer.

To confirm, do you know the reason why the VACUUMing needs more time than serial case?

Dear Hayato,

Thank you for the question.

From what I observed,in the non-partitioned parallel case the data generation phase becomes much faster,but the VACUUM phase takes longer compared to the serial run.

My current understanding is that this may be related to multiple workers inserting into the same heap relation.That could potentially affect page locality or increases the amount of freezing work required afterward.In contrast,the partitioned case seems to benefit more clearly,likely because each worker operates on a separate partition and COPY FREEZE reduces the vacuum effort.

I have not yet done deeper internal analysis,so this is based on the behavior I measured rather than detailed inspection.If needed,I can try to collect additional statistics to better understand and difference.

please let me know if this reasoning aligns with your understanding.

Best regards
Lakshmi

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: parallel data loading for pgbench -i

Вложения

Вложения