Обсуждение: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

Поиск

Список

Период

Сортировка

Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

11 ноября 2025 г., 16:33:16

Hello hackers,

For some of my specific hardware tests I needed to generate big databases well beyond RAM size. Hence I turned to pgbench tool and its default 2 modes for client- and server-side generation for TPC-B tests. When I use "scale" factor in range of few thousands (eg, 3000 - 5000) data generation phase takes quite some time. I looked at it as opportunity to prove/disprove 2 hypothesises:

will INSERT mode work faster if we commit once every "scale" and turn single INSERT into "for" loop with commits for 3 tables in the end of each loop
will "INSERT .. SELECT FROM unnest" be faster than "INSERT .. SELECT FROM generate_series"
will BINARY mode work faster than TEXT even though we send much more data
and so on

As a result of my experiments I produced significant patch for pgbench utility and though that it might be of interest not just for me. Therefore I'm sending draft version of it in diff format for current development tree on GitHub. As of November 11, 2025 I can merge with main branch of the project on GitHub.

Spoiler alert: "COPY FROM BINARY" is significantly faster than current "COPY FROM TEXT"

Would be happy to polish it if there is interest to such change.

Cheers

Вложения

pgbench.c.diff

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Ashutosh Bapat

Дата:

13 ноября 2025 г., 13:17:15

Hi Boris,

On Wed, Nov 12, 2025 at 3:25 AM Boris Mironov <boris_mironov@outlook.com> wrote:
>
> Hello hackers,
>
> For some of my specific hardware tests I needed to generate big databases well beyond RAM size. Hence I turned to
pgbenchtool and its default 2 modes for client- and server-side generation for TPC-B tests. When I use "scale" factor
inrange of few thousands (eg, 3000 - 5000) data generation phase takes quite some time. I looked at it as opportunity
toprove/disprove 2 hypothesises: 
>
> will INSERT mode work faster if we commit once every "scale" and turn single INSERT into "for" loop with commits for
3tables in the end of each loop 
> will "INSERT .. SELECT FROM unnest" be faster than "INSERT .. SELECT FROM generate_series"
> will BINARY mode work faster than TEXT even though we send much more data
> and so on
>
> As a result of my experiments I produced significant patch for pgbench utility and though that it might be of
interestnot just for me. Therefore I'm sending draft version of it in diff format for current development tree on
GitHub.As of November 11, 2025 I can merge with main branch of the project on GitHub. 
>
> Spoiler alert: "COPY FROM BINARY" is significantly faster than current "COPY FROM TEXT"
>
> Would be happy to polish it if there is interest to such change.

Making pgbench data initialization faster at a higher scale is
desirable and the community might be willing to accept such a change.
Running very large benchmarks is becoming common these days. However,
it's not clear what you are proposing and what's the performance
improvement. Answering following question may help: Your patch
implements all the above methods? Do all of them provide performance
improvement? If each of them performs better under certain conditions,
what are those conditions? Is there one method which performs better
than all others, which is it and why not implement just that method?
What performance numbers are we looking at? Can the methods which use
batch commits, also run those batches in parallel?

--
Best Wishes,
Ashutosh Bapat

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

14 ноября 2025 г., 18:21:12

Hi Ashutosh,

> If there is one method that is better than all others, community will

> be more willing to accept implementation of that one method than
> multiple implementations so as to reduce maintenance burden.

Ok then. I'll leave "COPY FROM STDIN BINARY" implementation out of 3 only.

Would you prefer to replace original COPY FROM STDIN TEXT by this

code or add it as new "init-step" (e.g., with code "c")?

I also have noted that current code doesn't prevent pgbench parameter

like "--init-steps=dtgG". It allows to run data generation step twice.

Each of these "g" and "G" will present own timing in status line. Is this

an oversight or intentional?

> The code in the patch does not have enough comments. It's hard to
> understand the methods just from the code. Each of the generateData*
> functions could use a prologue explaining the data generation method
> it uses.

To add comments is not a problem at all. So far, it was just "code for myself"

and I was checking if there is any interest in community to include it.

>> I'm sure that much more testing is required to run this code under different

>> conditions and hardware to get a better picture. So far it looks very promising.

> Sure.

Cheers,

Boris

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Ashutosh Bapat

Дата:

17 ноября 2025 г., 07:58:08

On Fri, Nov 14, 2025 at 8:51 PM Boris Mironov <boris_mironov@outlook.com> wrote:
>
> Hi Ashutosh,
>
> > If there is one method that is better than all others, community will
> > be more willing to accept implementation of that one method than
> > multiple implementations so as to reduce maintenance burden.
>
> Ok then. I'll leave "COPY FROM STDIN BINARY" implementation out of 3 only.
> Would you prefer to replace original COPY FROM STDIN TEXT by this
> code or add it as new "init-step" (e.g., with code "c")?
>

TEXT copy may be useful for cross platform client side data
generation. BINARY might be useful for same platform client side
generation or server side generation. Just a thought, use TEXT or
BINARY automatically based on where it's cross-platform or same
platform setup.

> I also have noted that current code doesn't prevent pgbench parameter
> like "--init-steps=dtgG". It allows to run data generation step twice.
> Each of these "g" and "G" will present own timing in status line. Is this
> an oversight or intentional?
>

I would review the commit a386942bd29b0ef0c9df061392659880d22cdf43 and
the discussion thread
https://postgr.es/m/alpine.DEB.2.21.1904061826420.3678@lancre
mentioned in the commit message to find that out. At first glance it
looks like an oversight, but I haven't reviewed the commit and thread
myself. That thread might reveal why generate_series() was used
instead of BINARY COPY for server side data generation. If it needs to
change it's better to start a separate thread and separate patch for
that discussion.

--
Best Wishes,
Ashutosh Bapat

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

17 ноября 2025 г., 15:43:52

Hi Ashutosh,

> TEXT copy may be useful for cross platform client side data
> generation. BINARY might be useful for same platform client side
> generation or server side generation. Just a thought, use TEXT or
> BINARY automatically based on where it's cross-platform or same
> platform setup.

It is true that BINARY format is not as flexible as TEXT. Postgres expects

data in wire to arrive in "network byte order". AFAIK only Solaris can

send its data without byte reordering. I support such exception via

#ifdef __sparc__

I don't see an easy way to make decision within pgbench on which

COPY mode to use TEXT or BINARY except specifically asking for one

via command line parameter. This is why I left flag "g" for TEXT mode

and added BINARY as "C" (for "Copy" and upper case as faster). I guess,

we can add alias "c" for old client-side generation as "copy", but slower

version of it.

While we are on topic of client- vs server-side generation. IMHO these are

quite misleading terms. Both of them are executed by RDBMS on server side,

but "server" one gets quite short query (and quite slow in execution) and

"client" one gets quite big network transfer (and quite fast in execution). The

reason is difference in data path that needs to reflected in documentation.

On top of it server-side thrashes DB cache, while client-side works via ring buffer

that doesn't allocate more than 1/8 of shared_buffers,

> I would review the commit a386942bd29b0ef0c9df061392659880d22cdf43 and
> the discussion thread
> https://postgr.es/m/alpine.DEB.2.21.1904061826420.3678@lancre
> mentioned in the commit message to find that out. At first glance it
> looks like an oversight, but I haven't reviewed the commit and thread
> myself. That thread might reveal why generate_series() was used
> instead of BINARY COPY for server side data generation. If it needs to
> change it's better to start a separate thread and separate patch for
> that discussion.

Thank you for this hint. I went through whole thread and there they discuss

how to reflect certain behavior of init-steps and nothing about COPY BINARY.

Major point of generate_series() introduction is to send short query to DB

and not to worry about network performance. It is quite true that COPY

sends tons of data over network and it might be an issue for slow network.

They also touched on topic of "one huge transaction" for whole generated

dataset or few smaller transaction.

Allow me to repost my benchmarks here (as it was lost for pgsql-hasckers

because I just used Reply instead of Reply-To-All)

Tests:

Test | Binary | Init mode | Query and details

-----|----------|-----------|-------------------------------------------------------

1 | original | G | INSERT FROM generate_series in single huge transaction

2 | enhanced | G | INSERT FROM generate_series in single huge transaction

3 | enhanced | i | INSERT FROM generate_series in one transaction per scale

4 | enhanced | I | INSERT FROM unnest in one transaction per scale

5 | original | g | COPY FROM STDIN TEXT in single transaction

6 | enhanced | g | COPY FROM STDIN TEXT in single transaction

7 | enhanced | C | COPY FROM STDIN BINARY in single transaction

Test | Scale and seconds to complete data generation

| 1 | 2 | 10 | 20 | 100 | 200 | 1000 | 2000

-----|------|------|------|------|-------|-------|--------|--------

1 | 0.19 | 0.37 | 2.01 | 4.34 | 22.58 | 46.64 | 245.98 | 525.99

2 | 0.30 | 0.47 | 2.18 | 4.37 | 25.38 | 56.66 | 240.89 | 482.63

3 | 0.18 | 0.39 | 2.14 | 4.19 | 23.78 | 47.63 | 240.91 | 483.19

4 | 0.18 | 0.38 | 2.17 | 4.39 | 23.68 | 47.93 | 242.63 | 487.33

5 | 0.11 | 0.22 | 1.46 | 2.95 | 15.69 | 32.86 | 154.16 | 311.00

6 | 0.11 | 0.22 | 1.43 | 2.89 | 16.01 | 29.41 | 158.10 | 307.54

7 | 0.14 | 0.12 | 0.56 | 1.16 | 6.22 | 12.70 | 64.70 | 135.58

"Original" binary is pgbench v17.6.

"Enhanced" binary is pgbench 19-devel with proposed patch.

As we can see another point of discussion in mentioned earlier

thread on pgsql-hackers said that multi transactions for init-step

do NOT bring any benefit. My numbers show some increase in

performance by simply INSERT-ing data in loop with one COMMIT

per "scale" on lower scales. On higher scales benefit dissapears. My

guess here is quite active process WAL archiver.

COPY TEXT is 36% faster than INSERT with multiple transactions.

COPY BINARY is ~72% faster than INSERT with multiple transactions.

At this point I'm torn between keeping old modes and logic for

backward compatibility and introduction of new modes for INSERT & COPY

versus simply replacing old less efficient logic with new one.

Sorry for quite long response.

Best regards,

Boris

From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Sent: November 16, 2025 11:58 PM
To: Boris Mironov <boris_mironov@outlook.com>
Cc: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Subject: Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

On Fri, Nov 14, 2025 at 8:51 PM Boris Mironov <boris_mironov@outlook.com> wrote:
>
> Hi Ashutosh,
>
> > If there is one method that is better than all others, community will
> > be more willing to accept implementation of that one method than
> > multiple implementations so as to reduce maintenance burden.
>
> Ok then. I'll leave "COPY FROM STDIN BINARY" implementation out of 3 only.
> Would you prefer to replace original COPY FROM STDIN TEXT by this
> code or add it as new "init-step" (e.g., with code "c")?
>

TEXT copy may be useful for cross platform client side data
generation. BINARY might be useful for same platform client side
generation or server side generation. Just a thought, use TEXT or
BINARY automatically based on where it's cross-platform or same
platform setup.

> I also have noted that current code doesn't prevent pgbench parameter
> like "--init-steps=dtgG". It allows to run data generation step twice.
> Each of these "g" and "G" will present own timing in status line. Is this
> an oversight or intentional?
>

I would review the commit a386942bd29b0ef0c9df061392659880d22cdf43 and
the discussion thread
https://postgr.es/m/alpine.DEB.2.21.1904061826420.3678@lancre
mentioned in the commit message to find that out. At first glance it
looks like an oversight, but I haven't reviewed the commit and thread
myself. That thread might reveal why generate_series() was used
instead of BINARY COPY for server side data generation. If it needs to
change it's better to start a separate thread and separate patch for
that discussion.

--
Best Wishes,
Ashutosh Bapat

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

21 ноября 2025 г., 16:26:05

Hi Ashutosh,

Just wanted to let you know that I've submitted this patch

to CommitFest (see https://commitfest.postgresql.org/patch/6245/)

Interestingly enough there is one more patch from Mircea Cadariu in the same

CommitFest about pgbench (https://commitfest.postgresql.org/patch/6242/)

That patch has been submitted few days ago and is proposing to run

data generation phase in parallel threads. It shows significant

improvements over performance of original single-thread code.

Hopefully sooner or later pgbench will get significant performance

gains in data generation from these two patches.

Original version of my patch failed in GitHub tests. Therefore I have

to start posting updated versions here.

Attached is updated version that sets default value for filler columns.

This trick allows significantly shrink network traffic for COPY FROM BINARY.

Absence of filler column in dataflow has failed my original patch in GitHub

pipeline.

I also switched from one huge transaction for COPY FROM BINARY to

"one per scale". This will simplify merge with multi-threaded data load

proposed by Mircea. Unfortunately, it killed possibility to freeze data right

away, which was possible when table truncation and data load was done

in the same transaction.

I think it would be fair to leave all original modes of data generation

until official review in CommitFest. Hence "INSERT SELECT FROM UNNEST"

is staying so far as there might be interest in community for benchmarking

of columnar tables (eg, for OLAP loads or Timescale DB).

Best regards,

Boris

Вложения

pgbench.c.diff

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

22 ноября 2025 г., 14:02:12

Hello,

Adding tests for new modes into Perl testing framework for pgbench.

The goal is to pass GitHub checks for the patch in green to simplify reviewer's life.

Cheers,

Boris

Вложения

v3-pgbench-faster-modes.patch

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

23 ноября 2025 г., 10:51:30

Hello,

Updating code to satisfy compiler warnings about unused code.

Cheers,

Boris

Вложения

v4-pgbench-faster-modes.patch

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

29 января, 16:55:36

Hello,

Next attempt to fix autotests...

Cheers,

Boris

Вложения

v5-pgbench-faster-modes.patch

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

30 января, 10:06:30

Hello,

Found memory leaks via valgrind. Hence new version of the patch...

Best regards,

Boris

Вложения

v6-pgbench-faster-modes.patch

RE: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Madyshev Egor

Дата:

30 января, 16:22:30

Hello Boris,

I've reviewed and tested your patch. In some modes, I did observe a
performance improvement. However, in my opinion, the current set of
modes is not transparent enough.

I propose, by analogy with the existing 'g'/'G' modes, to use lowercase
letters for client-side data generation and uppercase letters for
server-side generation. Furthermore, I propose considering making the
"one transaction per scale" mode a separate setting. This would result
in the following modes:
1.  g: COPY .. FROM STDIN .. TEXT, single transaction (orig. mode)
2.  c: COPY .. FROM STDIN .. BINARY, single transaction (added mode)
3.  G: INSERT .. SELECT generate_series, single transaction (orig. mode)
4.  I: INSERT .. SELECT unnest, single transaction (added mode)
And: M: multiple transactions. A setting that, when used, makes a mode
run with a transaction for each scale instead of a single transaction.
This would yield 8 possible combinations.

It would be reasonable to first collect performance measurements for
these modes and then decide whether to keep them, before proceeding with
a full implementation including their selection.
The provided measurements do not seem conclusive. For example, in mode
G(1) at scale 200, the result was 46.64, while the next measurement for
G(2) was 56.6, while mode i(3) falls between them at 47.63. Could you
please describe how you collected the performance measurements? Is it
possible that the measurement deviations significantly affected the
results shown in the table? It would be correct to take measurements
several times and then present a table with averages.

> I also switched from one huge transaction for COPY FROM BINARY to
> 'one per scale'. This will simplify merge with multi-threaded data load
> proposed by Mircea. Unfortunately, it killed possibility to freeze
> data right away, which was possible when table truncation and data load
> was done in the same transaction.

It seems incorrect to me to make a decision to abandon the freeze
optimization solely because of another patch, especially one that is
not yet in master. Please provide more motivation on why using one
transaction per scale is more beneficial than a single transaction
combined with the freeze optimization. Having a setting to switch
the transaction mode would allow avoiding this trade-off.

Thus, I propose reconsidering the approach to data generation modes
and adding a setting to control the number of transactions.
I also suggest conducting new, more accurate performance measurements to
inform the decision on the necessity of the additional generation modes.

Best regards,
Egor

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

30 января, 17:47:54

Hello Egor,

Thank you very much for taking this patch under your wing!

> I propose, by analogy with the existing 'g'/'G' modes, to use lowercase
> letters for client-side data generation and uppercase letters for
> server-side generation. Furthermore, I propose considering making the
> "one transaction per scale" mode a separate setting. This would result
> in the following modes:
> 1. g: COPY .. FROM STDIN .. TEXT, single transaction (orig. mode)
> 2. c: COPY .. FROM STDIN .. BINARY, single transaction (added mode)
> 3. G: INSERT .. SELECT generate_series, single transaction (orig. mode)
> 4. I: INSERT .. SELECT unnest, single transaction (added mode)
> And: M: multiple transactions. A setting that, when used, makes a mode
> run with a transaction for each scale instead of a single transaction.
> This would yield 8 possible combinations.

Sure thing. I agree with your proposal to add more flexibility to

parameters with single exception. For UNNEST test I would suggest

to use "U" instead of "I" as it might be confusing later in case of

another patch from current CommitFest will make it into the master.

I'm referring to:

https://commitfest.postgresql.org/patch/6242/

It uses parameter "-i" to start use multiple threads to populate tables.

> It would be reasonable to first collect performance measurements for
> these modes and then decide whether to keep them, before proceeding with
> a full implementation including their selection.

Since logic will be slightly different by following your proposal new set

of metrics will be required. That's for sure.

My main motivation in splitting one huge transaction to fill tables

into smaller ones comes from another idea that was put on

a backburner - running data population via multiple threads. This

idea is implemented in above mentioned patch by Mircea Cadariu.

By amount of changes in that patch it is clear that we're quite equal

by number of lines. Hence putting the change into my patch would

be overwhelming for any reviewer.

Another reason for smaller in size transactions ("one per scale")

is my experience during generation of test databases that are much

bigger than host RAM (e.g., scale=5000). Data population phase is

not just slow, but more than often has to use multiple checkpoints

for such single transaction because even my max_wal_size was

smaller than size of such "change". One might argue that my DB

is not tuned properly, but it's a topic for another day. As a side

effect of decision to use multiple transactions raises another

issue - inability to use FREEZE optimisation for COPY commands

which leads to Autovacuum storm in turn even during very process

of data population.

> Thus, I propose reconsidering the approach to data generation modes
> and adding a setting to control the number of transactions.
> I also suggest conducting new, more accurate performance measurements to
> inform the decision on the necessity of the additional generation modes.

Agree and Agree. Both make perfect sense.

Best regards,

Boris

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

06 февраля, 13:48:03

Hi Egor,

As per your recommendations I've updated code and attached it below.

I've also prepared Excel spreadsheet with benchmarks in raw format as well as processed for better presentation.

Just for reference I provide my test configuration and benchmarks.

Setup

VM: 4 CPU 2.2 GHz Intel Xeon Silver 4114, 16 GB RAM

shared_buffers=2GB

wal_keep_size=300MB

Same DB for all tests. No replicas, just primary instance

Scale | 1 | 2 | 10 | 20 | 100 | 200 | 1000 | 2000

DB size | 1761 MB | 1774 MB | 1878 MB | 2009 MB | 3050 MB | 4351 MB | 14 GB | 27 GB

Data generation modes

Prefixes S Single transaction for whole data set

M Multiple transactions (as 1 transaction per "scale")

Modes G INSERT .. FROM generate_series

U INSERT .. FROM unnest

g COPY .. FROM STDIN TEXT

c COPY .. FROM STDIN BINARY

Average time of 5 runs to complete data initialization

pgbench (PostgreSQL) 17.7 (Debian 17.7-3.pgdg12+1)

Mode

Scale | 1 | 2 | 10 | 20 | 100 | 200 | 1000 | 2000

SG | 0.22 | 0.45 | 2.18 | 4.39 | 23.94 | 47.89 | 241.99 | 550.91

Sg | 0.16 | 0.27 | 1.53 | 3.19 | 16.08 | 32.98 | 161.31 | 327.56

pgbench (PostgreSQL) 19devel / client-side generation

Mode

Scale | 1 | 2 | 10 | 20 | 100 | 200 | 1000 | 2000

Mc | 0.15 | 0.29 | 1.39 | 2.93 | 14.79 | 30.78 | 161.52 | 330.63

Sc | 0.13 | 0.27 | 1.37 | 2.69 | 14.71 | 29.99 | 152.83 | 298.88

Mg | 0.17 | 0.30 | 1.58 | 3.26 | 15.91 | 31.95 | 160.31 | 326.46

Sg | 0.20 | 0.38 | 1.66 | 3.39 | 18.72 | 36.26 | 176.54 | 351.66

pgbench (PostgreSQL) 19devel / server-side generation

Mode

Scale | 1 | 2 | 10 | 20 | 100 | 200 | 1000 | 2000

MU | 0.22 | 0.47 | 2.35 | 4.71 | 24.80 | 53.19 | 261.44 | 536.05

SU | 0.22 | 0.44 | 2.35 | 4.78 | 24.42 | 49.16 | 246.41 | 495.51

MG | 0.22 | 0.45 | 2.29 | 4.44 | 24.03 | 50.21 | 256.54 | 544.17

SG | 0.23 | 0.43 | 2.35 | 5.04 | 27.74 | 52.72 | 250.27 | 492.77

Best regards,

Boris

Вложения

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

06 февраля, 14:08:23

Resending patch as Excel spreadsheet took over second attachment in last email and testing bot got confused

Вложения

v7-pgbench-faster-modes.patch

RE: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Madyshev Egor

Дата:

09 февраля, 16:19:45

Hello, Boris

Thank you for making new productivity measurements. I`ve reviewed the v7
version and have some feedback below.

1. Perhaps we should rename functions initRowMethodBin to
initRowMethodBinary, initPopulateTableText to initPopulateTableTextCopy,
initPopulateTableBinary to initPopulateTableBinaryCopy,
initGenerateDataClientSideText to initGenerateDataClientSideTextFrmt,
initGenerateDataClientSideBinary to initGenerateDataClientSideBinaryFrmt
for better clarity.

2. The --help output currently describes only the 'M' mode. Should we
also add a description for the 'S' mode for completeness?

3. I`m wondering if the default value for the 'filler' column in
initCreateTables is necessary? The current functionality seems
unaffected, so perhaps we could avoid changing this function to keep the
diff minimal.

4. I noticed that vanilla client functionality in M mode is not
implemented. Is there a specific reason for this? It seems feasible to
implement by passing a counter, similar to how it`s done in
initPopulateTableBinary. If there are reasons not to implement it, in
mode Mg pgbench should not run mode Sg, it just pg_fatal.

5. In mode client binary format generation c, It would be the right
thing to do implement write progress of generating data and 'quiet' mode,
as it already implement in client text format generation g.

6. In bufferData, when len == 1, we call bufferCharData, which already
increments bin_copy_buffer_length. However, at the end of bufferCharData
we increment it again, leading to a double increment.

7. I suggest adding column count in function initPopulateTableBinary,
and initBranchBinary, initTellerBinary, initAccountBinary, to pass it
from initAccountBinary to init_row, and use it in functions
initBranchBinary, initTellerBinary, initAccountBinary. This will
increase the readability of the code, and remove the magic numbers in
the addColumnCounter call.

8. I think check and install data_generation_type in function
checkInitSteps is not quite right. In the current realization, pgbench
allows run data generation many times (dtCdtC...), so i suggest do not
touching this functionality. My suggestion would be to revert all the
changes from function checkInitSteps, set the data_generation_type in
switch in runInitSteps and remove call function checkInitSteps
from main.


Best regards,
Egor

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

23 февраля, 13:41:19

Hi Egor,

Here comes v8 of the patch with all requested changes. Sorry it took a little while to implement...

> 1. Perhaps we should rename functions initRowMethodBin to
> initRowMethodBinary, initPopulateTableText to initPopulateTableTextCopy,
> initPopulateTableBinary to initPopulateTableBinaryCopy,
> initGenerateDataClientSideText to initGenerateDataClientSideTextFrmt,
> initGenerateDataClientSideBinary to initGenerateDataClientSideBinaryFrmt
> for better clarity.

Done

> 2. The --help output currently describes only the 'M' mode. Should we
>also add a description for the 'S' mode for completeness?

Fixed. Thanks for noticing

> 3. I`m wondering if the default value for the 'filler' column in
> initCreateTables is necessary? The current functionality seems
> unaffected, so perhaps we could avoid changing this function to keep the
> diff minimal.

That was done purely for debugging. Fixed

> 4. I noticed that vanilla client functionality in M mode is not
> implemented. Is there a specific reason for this? It seems feasible to
> implement by passing a counter, similar to how it`s done in
> initPopulateTableBinary. If there are reasons not to implement it, in
> mode Mg pgbench should not run mode Sg, it just pg_fatal.

I didn't do it initially to keep original logic intact. To unify new patch

with existing codebase I've added multi-transactional functionality

to client-side data generator and extracted "show progress" code

into own procedure. Now both client-side generators use it.

> 5. In mode client binary format generation c, It would be the right
> thing to do implement write progress of generating data and 'quiet' mode,
> as it already implement in client text format generation g.

Everything is unified now via new procedure showPopulateTableCopyProgress

> 6. In bufferData, when len == 1, we call bufferCharData, which already
> increments bin_copy_buffer_length. However, at the end of bufferCharData
> we increment it again, leading to a double increment.

Fixed. Thank you

> 7. I suggest adding column count in function initPopulateTableBinary,
> and initBranchBinary, initTellerBinary, initAccountBinary, to pass it
> from initAccountBinary to init_row, and use it in functions
> initBranchBinary, initTellerBinary, initAccountBinary. This will
> increase the readability of the code, and remove the magic numbers in
> the addColumnCounter call.

Done

> 8. I think check and install data_generation_type in function
> checkInitSteps is not quite right. In the current realization, pgbench
> allows run data generation many times (dtCdtC...), so i suggest do not
> touching this functionality. My suggestion would be to revert all the
> changes from function checkInitSteps, set the data_generation_type in
> switch in runInitSteps and remove call function checkInitSteps
> from main.

I thought about such solution initially, but then decided to implement extra

procedure where I can count how many times we asked to initialize data.

Since this is legacy code and I can see need for ability to run init multiple

times (eg, get exact timing of different methods for comparison or run

same code multiple times to get average timing), I just show warning.

All in all, thanks a lot for your deep analysis of the code and great

suggestions on how to make better.

Best regards,
Boris

Вложения

v8-pgbench-faster-modes.patch

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

01 марта, 15:18:03

Hi Egor,

Quick update.

Sending slightly updated version of the patch as couple

of small bugs have been fixed. Also more tests have been

added to cover specifically all possible init modes when

they called with "multi-" of "single-" transaction switches

as well as combination of multiple init modes in one call

to pgbench (eg, -IdtMccGc).

Cheers,

Boris

Вложения

v9-pgbench-faster-modes.patch

RE: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Madyshev Egor

Дата:

02 марта, 17:12:09

Hi Boris,

I have reviewed the new patch. Overall, it looks correct,
but I have a few minor questions.

1. What do you think about moving characters in 'detail: Allowed step
characters are: "dtgMScGUvpf"' so that generation modes and
transactions count modes are not mixed? For example "dtMSgcGUvpf".
2. In the initCreateTables function, default values are set as empty
strings '' in the pgbench_history and pgbench_accounts tables. Was
this done intentionally, and if so, what is the reason? In the
pgbench_tellers and pgbench_branches tables, the implicit default
would be NULL - why was this logic changed?
3. In showPopulateTableCopyProgress, I think it would be better to
calculate elapsed_sec and remaining_sec inside the condition blocks,
as is done in the original code.
4. Do the changes and bug fixes in the patch affect performance? Are
the existing performance measurements still valid?

Best regards,
Egor

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

03 марта, 07:06:30

Hi Egor,

Thank you for your time reviewing this patch and all of its versions!

> 1. What do you think about moving characters in 'detail: Allowed step
> characters are: "dtgMScGUvpf"' so that generation modes and
> transactions count modes are not mixed? For example "dtMSgcGUvpf".

Done

> 2. In the initCreateTables function, default values are set as empty
> strings '' in the pgbench_history and pgbench_accounts tables. Was
> this done intentionally, and if so, what is the reason? In the
> pgbench_tellers and pgbench_branches tables, the implicit default
> would be NULL - why was this logic changed?

I personally would set default filler for every table since "TPC-B-like"

means that filler is not used as expected by actual requirement of

TPC-B for row length. There are 2 specific tests for not-NULL values

in those 2 tables that do not have "default values". I would prefer

to use similar approach everywhere and bring on requirement of TPC-B

for row length. Since TPC-B is kind of archaic now this isn't best

tactic to fight another battle against "well established test set".

> 3. In showPopulateTableCopyProgress, I think it would be better to
> calculate elapsed_sec and remaining_sec inside the condition blocks,
> as is done in the original code.

I believe original code with complete copy of both variables initialization

twice in the same proc doesn't bring any benefits. Hence shortened it.

> 4. Do the changes and bug fixes in the patch affect performance? Are
> the existing performance measurements still valid?

We didn't significantly alter logic since last benchmark. I'll try to find some

time to rerun all tests again and publish updated results.

Best regards,
Egor

Вложения

v10-pgbench-faster-modes.patch

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

03 марта, 14:13:01

Hi Egor,

Sorry for so many extra changes. I promise, this one is good.

I believe it will beneficial to explain how buffer size for each row in COPY BINARY is actually calculated. Instead of hard coded values like 35, 40, 250, I put in comments actual sizes of each field as well as easy to read formulas. IMHO it adds extra value and resolves possible future questions right away.

I greatly appreciate your time spent on this patch.

Best regards,

Boris

Вложения

v11-pgbench-faster-modes.patch

RE: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Madyshev Egor

Дата:

06 марта, 10:26:44

Hi Boris,

I'm in the final stage of reviewing your patch. Before completing,
please fix the patch file and rebase it to the current state of
the master branch. The current version contains two commits
that are already in the master.

Because of this, the patch cannot be cleanly applied to the repository.

Best regards,
Egor

RE: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Madyshev Egor

Дата:

06 марта, 11:02:52

Hello, Boris,

One more thing I missed in my previous post:
there is no documentation for the new modes (-c, -U, -M, -S).
Please add descriptions for them in doc/src/sgml/pgbench.sgml
(similar to the existing options).

As soon as the rebase and documentation update are completed,
the patch will be ready for final review.

Thank you,
Egor

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

06 марта, 17:21:36

Hi Egor,

> One more thing I missed in my previous post:
> there is no documentation for the new modes (-c, -U, -M, -S).
> Please add descriptions for them in doc/src/sgml/pgbench.sgml
> (similar to the existing options).

Documentation update has been added

> As soon as the rebase and documentation update are completed,
> the patch will be ready for final review.

Patch has been rebased to current state of the main branch.

Squashing 27 separate commits shrunk patch size almost 4-fold.

Thank you very much. Greatly appreciate your help in this endeavor.

Best regards,
Boris

Вложения

v12-pgbench-faster-modes.patch

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

09 марта, 08:18:58

Hi Egor,

Sorry, but I had to issue new version of the patch to include changes to documentation,

which was missed in previous version.

Now patch consists of 3 files:

doc/src/sgml/ref/pgbench.sgml | 90 ++-

src/bin/pgbench/pgbench.c | 802 ++++++++++++++++---

src/bin/pgbench/t/001_pgbench_with_server.pl | 236 +++++-

3 files changed, 978 insertions(+), 150 deletions(-)

Best regards,
Boris

Вложения

v14-pgbench-faster-modes.patch

RE: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

"Hayato Kuroda (Fujitsu)"

Дата:

09 марта, 10:55:12

Dear Boris,

There is a similar idea which parallelize the initialization. Not sure, is there
an interactions or comparison between them? Your patch seems to divide a
transaction into several parts but everything is done by the single thread,
whereas proposed in [1] is to dispatch to other threads.

[1]: https://commitfest.postgresql.org/patch/6242/

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

От

Boris Mironov

Дата:

09 марта, 12:57:57

Dear Hayato,

> There is a similar idea which parallelize the initialization. Not sure, is there
> an interactions or comparison between them? Your patch seems to divide a
> transaction into several parts but everything is done by the single thread,
> whereas proposed in [1] is to dispatch to other threads.

You're absolutely right. Mircea (author of multithread patch #6242 [1]) contacted

me with a question if our patches could be merged. At that point in time

none of our patches had reviewers and sheer size of each one of them was quite

big (~500 lines of new code). Therefore we decided to leave our patches

as a separate commits and see what community will say.

It's quite clear that there is certain synergy between them and perhaps it would

make perfect sense to introduce them together in same release of PostgreSQL.

I'm not familiar with intricate details of process that takes place after CommitFest,

especially if several reviewed commits in "Ready to go" status touch same

area of the product.

I believe that my patch shouldn't require too many changes in order to be merged

with Mircea's one since core logic to initialize set of rows in three main tables

for one particular "scale" is already in place.

Thank you for your interest in my commit and

Best regards,

Boris

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Idea to enhance pgbench by more modes to generate data (multi-TXNs, UNNEST, COPY BINARY)

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения