Обсуждение: Re: Suggestion to add --continue-client-on-abort option to pgbench

Поиск

Список

Период

Сортировка

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Stepan Neretin

Дата:

10 мая, 18:02:06

On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com> wrote:

Hi hackers,

I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.

For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.

The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.

```
% bin/pgbench -d postgres -f ../insert_to_unique_column.sql -T 10
--failures-detailed --continue-client-on-error
transaction type: ../custom_script_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 33552
number of failed transactions: 21901 (39.495%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 21901 (39.495%)
latency average = 0.180 ms (including failures)
initial connection time = 2.857 ms
tps = 3356.092385 (without initial connection time)
```

I have attached the patch. I would appreciate your feedback.

Best regards,

Rintaro Ikeda
NTT DATA Corporation Japan

Hi Rintaro,

Thanks for the patch and explanation. I understand your goal is to ensure that pgbench clients continue running even when some transactions fail due to application-level errors (e.g., constraint violations), especially when running custom scripts.

However, I wonder if the intended behavior can't already be achieved using standard SQL constructs — specifically ON CONFLICT or careful transaction structure. For example, your sample script:

CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));

can be rewritten as:

\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;

This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issue load without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach.

So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are there error types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?

Best regards,

Stepan Neretin

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Matthias van de Meent

Дата:

11 мая, 15:06:49

> On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com> wrote:
>>
>> Hi hackers,
>>
>> I would like to suggest adding a new option to pgbench, which enables
>> the client to continue processing transactions even if some errors occur
>> during a transaction.
>> Currently, a client stops sending requests when its transaction is
>> aborted due to reasons other than serialization failures or deadlocks. I
>> think in some cases, especially when using custom scripts, the client
>> should be able to rollback the failed transaction and start a new one.
>>
>> For example, my custom script (insert_to_unique_column.sql) follows:
>> ```
>> CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
>> INSERT INTO test (col2) VALUES (random(0, 50000));
>> ```
>> Assume we need to continuously apply load to the server using 5 clients
>> for a certain period of time. However, a client sometimes stops when its
>> transaction in my custom script is aborted due to a check constraint
>> violation. As a result, the load on the server is lower than expected,
>> which is the problem I want to address.
>>
>> The proposed new option solves this problem. When
>> --continue-client-on-abort is set to true, the client rolls back the
>> failed transaction and starts a new one. This allows all 5 clients to
>> continuously apply load to the server, even if some transactions fail.

+1. I've had similar cases before too, where I'd wanted pgbench to
continue creating load on the server even if a transaction failed
server-side for any reason. Sometimes, I'd even want that type of
load.

On Sat, 10 May 2025 at 17:02, Stepan Neretin <slpmcf@gmail.com> wrote:
> INSERT INTO test (col2) VALUES (random(0, 50000));
>
> can be rewritten as:
>
> \setrandom val 0 50000
> INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;

That won't test the same execution paths, so an option to explicitly
rollback or ignore failed transactions (rather than stopping the
benchmark) would be a nice feature.
With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if
there are many conflicting entries, as that triggers and catches
per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT
...multiple rows could conflict on multiple rows, but will fail on the
first conflict. DO NOTHING would cause full execution of the SELECT
statement, which has an inherently different performance profile.

> This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to
issueload without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach. 
>
> So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are
thereerror types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON
CONFLICT,or SAVEPOINT/ROLLBACK TO? 

The issue isn't necessarily whether you can construct SQL scripts that
don't raise such errors (indeed, it's possible to do so for nearly any
command; you can run pl/pgsql procedures or DO blocks which catch and
ignore errors), but rather whether we can make pgbench function in a
way that can keep load on the server even when it notices an error.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Stepan Neretin

Дата:

11 мая, 20:51:13

On Sun, May 11, 2025 at 7:07 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

> On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com> wrote:
>>
>> Hi hackers,
>>
>> I would like to suggest adding a new option to pgbench, which enables
>> the client to continue processing transactions even if some errors occur
>> during a transaction.
>> Currently, a client stops sending requests when its transaction is
>> aborted due to reasons other than serialization failures or deadlocks. I
>> think in some cases, especially when using custom scripts, the client
>> should be able to rollback the failed transaction and start a new one.
>>
>> For example, my custom script (insert_to_unique_column.sql) follows:
>> ```
>> CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
>> INSERT INTO test (col2) VALUES (random(0, 50000));
>> ```
>> Assume we need to continuously apply load to the server using 5 clients
>> for a certain period of time. However, a client sometimes stops when its
>> transaction in my custom script is aborted due to a check constraint
>> violation. As a result, the load on the server is lower than expected,
>> which is the problem I want to address.
>>
>> The proposed new option solves this problem. When
>> --continue-client-on-abort is set to true, the client rolls back the
>> failed transaction and starts a new one. This allows all 5 clients to
>> continuously apply load to the server, even if some transactions fail.

+1. I've had similar cases before too, where I'd wanted pgbench to
continue creating load on the server even if a transaction failed
server-side for any reason. Sometimes, I'd even want that type of
load.

On Sat, 10 May 2025 at 17:02, Stepan Neretin <slpmcf@gmail.com> wrote:
> INSERT INTO test (col2) VALUES (random(0, 50000));
>
> can be rewritten as:
>
> \setrandom val 0 50000
> INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;

That won't test the same execution paths, so an option to explicitly
rollback or ignore failed transactions (rather than stopping the
benchmark) would be a nice feature.
With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if
there are many conflicting entries, as that triggers and catches
per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT
...multiple rows could conflict on multiple rows, but will fail on the
first conflict. DO NOTHING would cause full execution of the SELECT
statement, which has an inherently different performance profile.

> This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issue load without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach.
>
> So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are there error types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?

The issue isn't necessarily whether you can construct SQL scripts that
don't raise such errors (indeed, it's possible to do so for nearly any
command; you can run pl/pgsql procedures or DO blocks which catch and
ignore errors), but rather whether we can make pgbench function in a
way that can keep load on the server even when it notices an error.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Hi Matthias,

Thanks for your detailed explanation — it really helped clarify the usefulness of the patch. I agree that the feature is indeed valuable, and it's great to see it being pushed forward.

Regarding the patch code, I noticed that there are duplicate case entries in the command-line option handling (specifically for case 18 or case ESTATUS_OTHER_SQL_ERROR, the continue-client-on-error option). These duplicated cases can be merged to simplify the logic and reduce redundancy.

Best regards,
Stepan Neretin

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Dilip Kumar

Дата:

13 мая, 08:56:52

On Tue, May 13, 2025 at 9:20 AM <Rintaro.Ikeda@nttdata.com> wrote:
> I also appreciate you for pointing out my mistakes in the previous version of the patch. I fixed the duplicated
lines.I’ve attached the updated patch. 
>
This is a useful feature, so +1 from my side.  Here are some initial
comments on the patch while having a quick look.

1. You need to update the stats for this new counter in the
"accumStats()" function.

2. IMHO, " continue-on-error " is more user-friendly than
"continue-client-on-error".

3. There are a lot of whitespace errors, so those can be fixed.  You
can just try to apply using git am, and it will report those
whitespace warnings.  And for fixing, you can just use
"--whitespace=fix" along with git am.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Srinath Reddy Sadipiralla

Дата:

14 мая, 12:08:53

Hi,

On Tue, May 13, 2025 at 11:27 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, May 13, 2025 at 9:20 AM <Rintaro.Ikeda@nttdata.com> wrote:
> I also appreciate you for pointing out my mistakes in the previous version of the patch. I fixed the duplicated lines. I’ve attached the updated patch.
>
This is a useful feature, so +1 from my side. Here are some initial
comments on the patch while having a quick look.

1. You need to update the stats for this new counter in the
"accumStats()" function.

2. IMHO, " continue-on-error " is more user-friendly than
"continue-client-on-error".

3. There are a lot of whitespace errors, so those can be fixed. You
can just try to apply using git am, and it will report those
whitespace warnings. And for fixing, you can just use
"--whitespace=fix" along with git am.

Hi, +1 for the idea. I’ve reviewed and tested the patch. Aside from Dilip’s feedback and the missing usage information for this option, the patch LGTM.

Here's the diff for the missing usage information for this option and as Dilip mentioned updating the new counter in the "accumStats()" function.

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index baaf1379be2..20d456bc4b9 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -959,6 +959,8 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-client-on-error\n"
+ " Continue and retry transactions that failed due to errors other than serialization or deadlocks.\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1522,6 +1524,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);

Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

RE: Suggestion to add --continue-client-on-abort option to pgbench

От

"Hayato Kuroda (Fujitsu)"

Дата:

04 июня, 15:57:14

Dear Ikeda-san,

Thanks for starting the new thread! I have never known the issue before I heard at
PGConf.dev.

Few comments:

1.
This parameter seems a type of benchmark option. So should we set
benchmarking_option_set as well?

2.
Not sure, but exit-on-abort seems a similar option. What if both are specified?
Is it allowed?

3.
Can you consider a test case for the new parameter?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Rintaro Ikeda

Дата:

08 июня, 17:59:49

Dear Kuroda-san, hackers,

On 2025/06/04 21:57, Hayato Kuroda (Fujitsu) wrote:
> Dear Ikeda-san,
> 
> Thanks for starting the new thread! I have never known the issue before I heard at
> PGConf.dev.
> 
> Few comments:
> 
> 1.
> This parameter seems a type of benchmark option. So should we set
> benchmarking_option_set as well?
> 
> 2.
> Not sure, but exit-on-abort seems a similar option. What if both are specified?
> Is it allowed?
> 
> 3.
> Can you consider a test case for the new parameter?
> 
> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED
> 
> 

Thank you for your valuable comment!

1. I should've also set benchmarking_option_set. I've modified it accordingly.

2. The exit-on-abort option and continue-on-error option are mutually exclusive. 
Therefore, I've updated the patch to throw a FATAL error when two options are 
set simultaneously. Corresponding explanation was also added.
(I'm wondering the name of parameter should be continue-on-abort so that users 
understand the two option are mutually exclusive.)

3. I've added the test.

Additionally, I modified the patch so that st->state does not transition to 
CSTATE_RETRY when a transaction fails and continue-on-error option is enabled. 
In the previous patch, we retry the failed transaction up to max-try times, 
which is unnecessary for our purpose: clients does not exit when its 
transactions fail.

I've attached the updated patch. 
v3-0001-Add-continue-on-error-option-to-pgbench.patch is identical to 
v4-0001-Add-continue-on-error-option-to-pgbench.patch. The v4-0002 patch is the 
diff from the previous patch.

Best regards,
Rintaro Ikeda

Hi,

Thank you for the kind comments.

I've updated the previous patch.

Below is a summary of the changes:
1. The code path and documentation have been corrected based on your feedback.
2. The following message is now suppressed by default. Instead, an error message
is added when a client aborts during SQL execution. (v6-0003-Suppress-xxx.patch)

```
                if (verbose_errors)
                    pg_log_error("client %d script %d aborted in command %d query %d: %s",
                                 st->id, st->use_file, st->command, qrynum,
                                 PQerrorMessage(st->con));
```


On 2025/07/04 22:01, Hayato Kuroda (Fujitsu) wrote:

>>> Could I confirm what you mean by "start new one"?
>>>
>>> In the current pgbench, when a query raises an error (a deadlock or
>>> serialization failure), it can be retried using the same random state.
>>> This typically means the query will be retried with the same parameter values.
>>>
>>> On the other hand, when the query ultimately fails (possibly after some retries),
>>> the transaction is marked as a "failure", and the next transaction starts with a
>>> new random state (i.e., with new parameter values).
>>>
>>> Therefore, if a query fails due to a unique constraint violation and is retried
>>> with the same parameters, it will keep failing on each retry.
>>
>> Thank you for your explanation. I understand it as you described. I've also
>> attached a schematic diagram of the state machine. I hope it will help clarify
>> the behavior of pgbench. Red arrows represent the transition of state when SQL
>> command fails and --continue-on-error option is specified.
>
> Thanks for the diagram, it's quite helpful. Let me share my understanding and opinion.
>
> The terminology "retry" is being used for the transition CSTATE_ERROR->CSTATE_RETRY,
> and here the random state would be restored to be the begining:
>
> ```
>                 /*
>                  * Reset the random state as they were at the beginning of the
>                  * transaction.
>                  */
>                 st->cs_func_rs = st->random_state;
> ```
>
> In --continue-on-error case, the transaction CSTATE_WAIT_RESULT->CSTATE_ERROR
> can happen even the reason of failure is not serialization and deadlock.
> Ultimately the pass will reach ...->CSTATE_END_TX->CSTATE_CHOOSE_SCRIPT, the
> beginning of the state machine. cs_func_rs is not overwritten in the route so
> that different random value would be generated, or even another script may be
> chosen. Is it correct?

Yes, I believe that’s correct.

>
> 01.
> ```
> $ git am ../patches/pgbench/v5-0001-Add-continue-on-error-opt
> ion-to-pgbench.patch
> Applying: When the option is set, client rolls back the failed transaction and...
> .git/rebase-apply/patch:65: trailing whitespace.
>    <literal>serialization</literal>, <literal>deadlock</literal>, or
> .git/rebase-apply/patch:139: trailing whitespace.
>    <option>--max-tries</option> option is not equal to 1 and
> warning: 2 lines add whitespace errors.
> ```
>
> I got warnings when I applied the patch. Please fix it.

It's been fixed.

>
> 02.
> ```
> +        *       'serialization_failures' + 'deadlock_failures' +
> +        *   'other_sql_failures' (they got a error when continue-on-error option
> ```
> The first line has the tab, but it should be normal blank.

I hadn't noticed it. It's fixed.


> 03.
> ```
> +                               else if (continue_on_error | canRetryError(st->estatus))
> ```
>
> I feel "|" should be "||".

Thank you for pointing out. Fixed it.

> 04.
> ```
>      <term><replaceable>retries</replaceable></term>
>      <listitem>
>       <para>
>        number of retries after serialization or deadlock errors
>        (zero unless <option>--max-tries</option> is not equal to one)
>       </para>
>      </listitem>
> ```
>
> To confirm; --continue-on-error won't be counted here because it is not "retry",
> in other words, it does not reach CSTATE_RETRY, right?

Yes. I agree with Nagata-san [1] — --continue-on-error is not considered a
"retry" because it doesn't reach CSTATE_RETRY.


On 2025/07/05 0:03, Yugo Nagata wrote:
>>>              case PGRES_NONFATAL_ERROR:
>>>              case PGRES_FATAL_ERROR:
>>>                  st->estatus = getSQLErrorStatus(PQresultErrorField(res,
>>>                                                                     PG_DIAG_SQLSTATE));
>>>                  if (canRetryError(st->estatus))
>>>                  {
>>>                      if (verbose_errors)
>>>                          commandError(st, PQerrorMessage(st->con));
>>>                      goto error;
>>>                  }
>>>                  /* fall through */
>>>
>>>              default:
>>>                  /* anything else is unexpected */
>>>                  pg_log_error("client %d script %d aborted in command %d query %d: %s",
>>>                               st->id, st->use_file, st->command, qrynum,
>>>                               PQerrorMessage(st->con));
>>>                  goto error;
>>>          }
>>>
>>> When an SQL error other than a serialization or deadlock error occurs, an error message is
>>> output via pg_log_error in this code path. However, I think this should be reported only
>>> when verbose_errors is set, similar to how serialization and deadlock errors are handled when
>>> --continue-on-error is enabled
>>
>> I think the error message logged via pg_log_error is useful when verbose_errors
>> is not specified, because it informs users that the client has exited. Without
>> it, users may not notice that something went wrong.
>
> However, if a large number of errors occur, this could result in a significant increase
> in stderr output during the benchmark.
>
> Users can still notice that something went wrong by checking the “number of other failures”
> reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
> users aren’t particularly interested in seeing individual error messages as they happen.
>
> It’s true that seeing error messages during the benchmark might be useful in some cases, but
> the same could be said for serialization or deadlock errors, and that’s exactly what the
> --verbose-errors option is for.


I understand your concern. The condition for calling pg_log_error() was modified
to reduce stderr output.
Additionally, an error message was added for cases where some clients aborted
while executing SQL commands, similar to other code paths that transition to
st->state = CSTATE_ABORTED, as shown in the example below:

```
                        pg_log_error("client %d aborted while establishing connection", st->id);
                        st->state = CSTATE_ABORTED;
```


> Here are some comments on the patch.
>
> (1)
>
>                 }
> -               else if (canRetryError(st->estatus))
> +               else if (continue_on_error | canRetryError(st->estatus))
>                     st->state = CSTATE_ERROR;
>                 else
>                     st->state = CSTATE_ABORTED;
>
> Due to this change, when --continue-on-error is enabled, st->state is set to
> CSTATE_ERROR regardless of the type of error returned by readCommandResponse.
> When the error is not ESTATUS_OTHER_SQL_ERROR, e.g. ESTATUS_META_COMMAND_ERROR
> due to a failure of \gset with query returning more the one row.
>
> Therefore, this should be like:
>
>                else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR  && continue_on_error) ||
>                          canRetryError(st->estatus))
>

Thanks for pointing that out — I’ve corrected it.


> (2)
>
> +          "  --continue-on-error      continue processing transations after a trasaction fails\n"
>
> "trasaction" is a typo　and including "transaction" twice looks a bit redundant.
> Instead using the word "transaction", how about:
>
>  "--continue-on-error continue running after an SQL error" ?
>
> This version is shorter, avoids repetition, and describes well the actual behavior when
> SQL statements fail.

Fixed it.

> (3)
>
> -    * A failed transaction is defined as unsuccessfully retried transactions.
> +    * A failed transaction is defined as unsuccessfully retried transactions
> +    * unless continue-on-error option is specified.
>      * It can be one of two types:
>      *
>      * failed (the number of failed transactions) =
> @@ -411,6 +412,12 @@ typedef struct StatsData
>      *   'deadlock_failures' (they got a deadlock error and were not
>      *                        successfully retried).
>      *
> +    * When continue-on-error option is specified,
> +    * failed (the number of failed transactions) =
> +    *   'serialization_failures' + 'deadlock_failures' +
> +    *   'other_sql_failures' (they got a error when continue-on-error option
> +    *                         was specified).
> +    *
>
> To explain explicitly that there are two definitions of failed transactions
> depending on the situation, how about:
>
> """
>  A failed transaction is counted differently depending on whether
>  the --continue-on-error option is specified.
>
>  Without --continue-on-error:
>
>  failed (the number of failed transactions) =
>   'serialization_failures' (they got a serialization error and were not
>                             successfully retried) +
>   'deadlock_failures' (they got a deadlock error and were not
>                        successfully retried).
>
>  When --continue-on-error is specified:
>
>  failed (number of failed transactions) =
>    'serialization_failures' + 'deadlock_failures' +
>    'other_sql_failures'  (they got some other SQL error; the transaction was
>                           not retried and counted as failed due to
>                           --continue-on-error).
> """

Thank you for your suggestion. I modified it accordingly.


> (4)
> +   int64       other_sql_failures; /* number of failed transactions for
> +                                    * reasons other than
> +                                    * serialization/deadlock failure , which
> +                                    * is enabled if --continue-on-error is
> +                                    * used */
>
> Is "counted" is more proper than "enabled" here?

Fixed.


>
> Af for the documentations:
> (5)
>    The next line reports the number of failed transactions due to
> -  serialization or deadlock errors (see <xref linkend="failures-and-retries"/>
> -  for more information).
> +  serialization or deadlock errors by default (see
> +  <xref linkend="failures-and-retries"/> for more information).
>
> Would it be more readable to simply say:
> "The next line reports the number of failed transactions (see ... for more information),
> since definition of "failed transaction" has become a bit messy?
>

I fixed it to the simple explanation.

> (6)
>     connection with the database server was lost or the end of script was reached
>     without completing the last transaction. In addition, if execution of an SQL
>     or meta command fails for reasons other than serialization or deadlock errors,
> -   the client is aborted. Otherwise, if an SQL command fails with serialization or
> -   deadlock errors, the client is not aborted. In such cases, the current
> -   transaction is rolled back, which also includes setting the client variables
> -   as they were before the run of this transaction (it is assumed that one
> -   transaction script contains only one transaction; see
> -   <xref linkend="transactions-and-scripts"/> for more information).
> +   the client is aborted by default. However, if the --continue-on-error option
> +   is specified, the client does not abort and proceeds to the next transaction
> +   regardless of the error. This case is reported as other failures in the output.
> +   Otherwise, if an SQL command fails with serialization or deadlock errors, the
> +   client is not aborted. In such cases, the current transaction is rolled back,
> +   which also includes setting the client variables as they were before the run
> +   of this transaction (it is assumed that one transaction script contains only
> +   one transaction; see <xref linkend="transactions-and-scripts"/> for more information).
>
> To emphasize the default behavior, I wonder it would be better to move "by default"
> to the beginning of the statements; like
>
>  "By default, if execution of an SQL or meta command fails for reasons other than
>  serialization or deadlock errors, the client is aborted."
>
> How about quoting "other failures"? like:
>
>  "These cases are reported as "other failures" in the output."
>
> Also, I feel the meaning of "Otherwise" has becomes somewhat unclear since the
> explanation of --continue-on-error was added between the sentences So, how about
> clarifying that "the clients are not aborted due to serializable/deadlock even without
> --continue-on-error".  For example;
>
>  "On contrast, if an SQL command fails with serialization or deadlock errors, the
>   client is not aborted even without  <option>--continue-on-error</option>.
>   Instead, the current transaction is rolled back, which also includes setting
>   the client variables as they were before the run of this transaction
>   (it is assumed that one transaction script contains only
>    one transaction; see <xref linkend="transactions-and-scripts"/> for more information)."
>

I've modified according to your suggestion.

> (7)
>     The main report contains the number of failed transactions. If the
> -   <option>--max-tries</option> option is not equal to 1, the main report also
> +   <option>--max-tries</option> option is not equal to 1 and
> +   <option>--continue-on-error</option> is not specified, the main report also
>     contains statistics related to retries: the total number of retried
>
> Is that true?
> The retreis statitics would be included even without --continue-on-error.

That was wrong. I corrected it.


[1]
https://www.postgresql.org/message-id/20250705002239.27e6e5a4ba22c047ac2fa16a%40sraoss.co.jp

Regards,
Rintaro Ikeda

Вложения

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

10 июля, 12:17:01

On Wed, 9 Jul 2025 23:58:32 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:

> Hi,
> 
> Thank you for the kind comments.
> 
> I've updated the previous patch.

Thank you for updating the patch!

> > However, if a large number of errors occur, this could result in a significant increase
> > in stderr output during the benchmark.
> > 
> > Users can still notice that something went wrong by checking the “number of other failures”
> > reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
> > users aren’t particularly interested in seeing individual error messages as they happen.
> > 
> > It’s true that seeing error messages during the benchmark might be useful in some cases, but
> > the same could be said for serialization or deadlock errors, and that’s exactly what the
> > --verbose-errors option is for.
> 
> 
> I understand your concern. The condition for calling pg_log_error() was modified
> to reduce stderr output.
> Additionally, an error message was added for cases where some clients aborted
> while executing SQL commands, similar to other code paths that transition to
> st->state = CSTATE_ABORTED, as shown in the example below:
> 
> ```
>                         pg_log_error("client %d aborted while establishing connection", st->id);
>                         st->state = CSTATE_ABORTED;
> ```

            default:
                /* anything else is unexpected */
-               pg_log_error("client %d script %d aborted in command %d query %d: %s",
-                            st->id, st->use_file, st->command, qrynum,
-                            PQerrorMessage(st->con));
+               if (verbose_errors)
+                   pg_log_error("client %d script %d aborted in command %d query %d: %s",
+                                st->id, st->use_file, st->command, qrynum,
+                                PQerrorMessage(st->con));
                goto error;
        }

Thanks to this fix, error messages caused by SQL errors are now output only when
--verbose-errors is enable. However, the comment describes the condition as "unexpected",
and the message states that the client was "aborted". This does not seems accurate, since
clients are not aborted due to SQL errors when --continue-on-errors is enabled. 

I think the error message should be emitted using commandError() when both
--coneinut-on-errors and --verbose-errors are specified, like this;

            case PGRES_NONFATAL_ERROR:
            case PGRES_FATAL_ERROR:
                st->estatus = getSQLErrorStatus(PQresultErrorField(res,
                                                                   PG_DIAG_SQLSTATE));
                if (continue_on_error || canRetryError(st->estatus))
                {
                    if (verbose_errors)
                        commandError(st, PQerrorMessage(st->con));
                    goto error;
                }
                /* fall through */

In addition, the error message in the "default" case should be shown regardless
of the --verbose-errors since it represents an unexpected situation and should
always reported.

Finally, I believe this fix should be included in patch 0001 rather than 0003,
as it would be a part of the implementation of --continiue-on-error.


As of 0003:

+               {
+                   pg_log_error("client %d aborted while executing SQL commands", st->id);
                    st->state = CSTATE_ABORTED;
+               }
                break;

I understand that the patch is not directly related to --continue-on-error, similar to 0002,
and that it aims to improve the error message to indicate that the client was aborted due to
some error during readCommandResponse().

However, this message doesn't seem entirely accurate, since the error is not always caused
by an SQL command failure itself. For example, it could also be due to a failure of the \gset
meta-command.

In addition, this fix causes error messages to be emitted twice. For example, if \gset fails,
the following similar messages are printed:

 pgbench: error: client 0 script 0 command 0 query 0: expected one row, got 0
 pgbench: error: client 0 aborted while executing SQL commands

Even worse, if an unexpected error occurs in readCommandResponse() (i.e., the default case),
the following messages are emitted, both indicating that the client was aborted;

 pgbench: error: client 0 script 0 aborted in command ... query ...
 pgbench: error: client 0 aborted while executing SQL commands

I feel this is a bit redundant.

Therefore, if we are to improve these messages to indicate explicitly that the client
was aborted, I would suggest modifying the error messages in readCommandResponse() rather
than adding a new one in advanceConnectionState().

I've attached patch 0003 incorporating my suggestion. What do you think?

Additionally, the patch 0001 includes the fix that was originally part of
your proposed 0003, as previously discussed.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Rintaro Ikeda

Дата:

13 июля, 17:15:24

Hi,

On 2025/07/10 18:17, Yugo Nagata wrote:
> On Wed, 9 Jul 2025 23:58:32 +0900
> Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:
>
>> Hi,
>>
>> Thank you for the kind comments.
>>
>> I've updated the previous patch.
>
> Thank you for updating the patch!
>
>>> However, if a large number of errors occur, this could result in a significant increase
>>> in stderr output during the benchmark.
>>>
>>> Users can still notice that something went wrong by checking the “number of other failures”
>>> reported after the run, and I assume that in most cases, when --continue-on-error is enabled,
>>> users aren’t particularly interested in seeing individual error messages as they happen.
>>>
>>> It’s true that seeing error messages during the benchmark might be useful in some cases, but
>>> the same could be said for serialization or deadlock errors, and that’s exactly what the
>>> --verbose-errors option is for.
>>
>>
>> I understand your concern. The condition for calling pg_log_error() was modified
>> to reduce stderr output.
>> Additionally, an error message was added for cases where some clients aborted
>> while executing SQL commands, similar to other code paths that transition to
>> st->state = CSTATE_ABORTED, as shown in the example below:
>>
>> ```
>>                         pg_log_error("client %d aborted while establishing connection", st->id);
>>                         st->state = CSTATE_ABORTED;
>> ```
>
>             default:
>                 /* anything else is unexpected */
> -               pg_log_error("client %d script %d aborted in command %d query %d: %s",
> -                            st->id, st->use_file, st->command, qrynum,
> -                            PQerrorMessage(st->con));
> +               if (verbose_errors)
> +                   pg_log_error("client %d script %d aborted in command %d query %d: %s",
> +                                st->id, st->use_file, st->command, qrynum,
> +                                PQerrorMessage(st->con));
>                 goto error;
>         }
>
> Thanks to this fix, error messages caused by SQL errors are now output only when
> --verbose-errors is enable. However, the comment describes the condition as "unexpected",
> and the message states that the client was "aborted". This does not seems accurate, since
> clients are not aborted due to SQL errors when --continue-on-errors is enabled.
>
> I think the error message should be emitted using commandError() when both
> --coneinut-on-errors and --verbose-errors are specified, like this;
>
>             case PGRES_NONFATAL_ERROR:
>             case PGRES_FATAL_ERROR:
>                 st->estatus = getSQLErrorStatus(PQresultErrorField(res,
>                                                                    PG_DIAG_SQLSTATE));
>                 if (continue_on_error || canRetryError(st->estatus))
>                 {
>                     if (verbose_errors)
>                         commandError(st, PQerrorMessage(st->con));
>                     goto error;
>                 }
>                 /* fall through */
>
> In addition, the error message in the "default" case should be shown regardless
> of the --verbose-errors since it represents an unexpected situation and should
> always reported.
>
> Finally, I believe this fix should be included in patch 0001 rather than 0003,
> as it would be a part of the implementation of --continiue-on-error.
>
>
> As of 0003:
>
> +               {
> +                   pg_log_error("client %d aborted while executing SQL commands", st->id);
>                     st->state = CSTATE_ABORTED;
> +               }
>                 break;
>
> I understand that the patch is not directly related to --continue-on-error, similar to 0002,
> and that it aims to improve the error message to indicate that the client was aborted due to
> some error during readCommandResponse().
>
> However, this message doesn't seem entirely accurate, since the error is not always caused
> by an SQL command failure itself. For example, it could also be due to a failure of the \gset
> meta-command.
>
> In addition, this fix causes error messages to be emitted twice. For example, if \gset fails,
> the following similar messages are printed:
>
>  pgbench: error: client 0 script 0 command 0 query 0: expected one row, got 0
>  pgbench: error: client 0 aborted while executing SQL commands
>
> Even worse, if an unexpected error occurs in readCommandResponse() (i.e., the default case),
> the following messages are emitted, both indicating that the client was aborted;
>
>  pgbench: error: client 0 script 0 aborted in command ... query ...
>  pgbench: error: client 0 aborted while executing SQL commands
>
> I feel this is a bit redundant.
>
> Therefore, if we are to improve these messages to indicate explicitly that the client
> was aborted, I would suggest modifying the error messages in readCommandResponse() rather
> than adding a new one in advanceConnectionState().
>
> I've attached patch 0003 incorporating my suggestion. What do you think?

Thank you very much for the updated patch!

I reviewed 0003 and it looks great - the error message become easier to understand.

I noticed one small thing I’d like to discuss. I'm not sure that users clearly
understand which aborted in the following error message, the client or the script.
> pgbench: error: client 0 script 0 aborted in command ... query ...

Since the code path always results in a client abort, I wonder if the following
message might be clearer:
> pgbench: error: client 0 aborted in script 0 command ... query ...


Regards,
Rintaro Ikeda

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

15 июля, 05:16:40

Hi,

On Sun, 13 Jul 2025 23:15:24 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:

> I noticed one small thing I’d like to discuss. I'm not sure that users clearly
> understand which aborted in the following error message, the client or the script.
> > pgbench: error: client 0 script 0 aborted in command ... query ...
> 
> Since the code path always results in a client abort, I wonder if the following
> message might be clearer:
> > pgbench: error: client 0 aborted in script 0 command ... query ...

Indeed, it seems clearer to explicitly state that it is the client that
was aborted.

I've attached an updated patch that replaces the remaining message mentioned
above with a call to commandFailed(). With this change, the output in such
situations will now be:

 "client 0 aborted in command 0 (SQL) of script 0; ...."

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Rintaro Ikeda

Дата:

16 июля, 15:35:01

Hi,

On 2025/07/15 11:16, Yugo Nagata wrote:
>> I noticed one small thing I’d like to discuss. I'm not sure that users clearly
>> understand which aborted in the following error message, the client or the script.
>>> pgbench: error: client 0 script 0 aborted in command ... query ...
>>
>> Since the code path always results in a client abort, I wonder if the following
>> message might be clearer:
>>> pgbench: error: client 0 aborted in script 0 command ... query ...
> 
> Indeed, it seems clearer to explicitly state that it is the client that
> was aborted.
> 
> I've attached an updated patch that replaces the remaining message mentioned
> above with a call to commandFailed(). With this change, the output in such
> situations will now be:
> 
>  "client 0 aborted in command 0 (SQL) of script 0; ...."

Thank you for updating the patch!

When I executed a custom script that may raise a unique constraint violation, I
got the following output:
> pgbench: error: client 0 script 0 aborted in command 1 query 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"

I think we should also change the error message in pg_log_error. I modified the
patch v8-0003 as follows:
@@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char
*varprefix)

                        default:
                                /* anything else is unexpected */
-                               pg_log_error("client %d script %d aborted in
command %d query %d: %s",
-                                                        st->id, st->use_file,
st->command, qrynum,
+                               pg_log_error("client %d aborted in command %d
query %d of script %d: %s",
+                                                        st->id, st->command,
qrynum, st->use_file,
                                                         PQerrorMessage(st->con));
                                goto error;
                }

With this change, the output now is like this:
> pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR:
duplicate key value violates unique constraint "test_col2_key"

I want hear your thoughts.


Also, let me ask one question. In this case, I directly modified your commit in
the v8-0003 patch. Is that the right way to update the patch?

Regards,
Rintaro Ikeda

Вложения

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

16 июля, 16:49:57

On Wed, 16 Jul 2025 21:35:01 +0900
Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote:

> Hi,
> 
> On 2025/07/15 11:16, Yugo Nagata wrote:
> >> I noticed one small thing I’d like to discuss. I'm not sure that users clearly
> >> understand which aborted in the following error message, the client or the script.
> >>> pgbench: error: client 0 script 0 aborted in command ... query ...
> >>
> >> Since the code path always results in a client abort, I wonder if the following
> >> message might be clearer:
> >>> pgbench: error: client 0 aborted in script 0 command ... query ...
> > 
> > Indeed, it seems clearer to explicitly state that it is the client that
> > was aborted.
> > 
> > I've attached an updated patch that replaces the remaining message mentioned
> > above with a call to commandFailed(). With this change, the output in such
> > situations will now be:
> > 
> >  "client 0 aborted in command 0 (SQL) of script 0; ...."
> 
> Thank you for updating the patch!
> 
> When I executed a custom script that may raise a unique constraint violation, I
> got the following output:
> > pgbench: error: client 0 script 0 aborted in command 1 query 0: ERROR:
> duplicate key value violates unique constraint "test_col2_key"

I'm sorry. I must have failed to attach the correct patch in my previous post.
As a result, patch v8 was actually the same as v7, and the message in question
was not modified as intended.

> 
> I think we should also change the error message in pg_log_error. I modified the
> patch v8-0003 as follows:
> @@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char
> *varprefix)
> 
>                         default:
>                                 /* anything else is unexpected */
> -                               pg_log_error("client %d script %d aborted in
> command %d query %d: %s",
> -                                                        st->id, st->use_file,
> st->command, qrynum,
> +                               pg_log_error("client %d aborted in command %d
> query %d of script %d: %s",
> +                                                        st->id, st->command,
> qrynum, st->use_file,
>                                                          PQerrorMessage(st->con));
>                                 goto error;
>                 }
> 
> With this change, the output now is like this:
> > pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR:
> duplicate key value violates unique constraint "test_col2_key"
> 
> I want hear your thoughts.

My idea is to modify this as follows;

                        default:
                                /* anything else is unexpected */
-                               pg_log_error("client %d script %d aborted in command %d query %d: %s",
-                                                        st->id, st->use_file, st->command, qrynum,
-                                                        PQerrorMessage(st->con));
+                               commandFailed(st, "SQL", PQerrorMessage(st->con));
                                goto error;
                }

This fix is originally planned to be included in patch v8, but was missed.
It is now included in the attached patch, v10.

With this change, the output becomes:

 pgbench: error: client 0 aborted in command 0 (SQL) of script 0;
  ERROR:  duplicate key value violates unique constraint "t2_pkey"

Although there is a slight difference, the message is essentially the same as
your proposal. Also, I believe the use of commandFailed() makes the code simpler
and more consistent.

What do you think?


> Also, let me ask one question. In this case, I directly modified your commit in
> the v8-0003 patch. Is that the right way to update the patch?

I’m not sure if that’s the best way, but I think modifying the patch directly is a
valid way to propose an alternative approach during discussion, as long as the original
patch is respected. It can often help clarify suggestions.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

On Fri, Sep 19, 2025 at 11:43 AM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> >
> > On Thu, 18 Sep 2025 14:37:29 +0900
> > Fujii Masao <masao.fujii@gmail.com> wrote:
> >
> > > On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > > > That makes sense. How about rewriting this like:
> > > >
> > > >  However, if the --continue-on-error option is specified and the error occurs in
> > > >  an SQL command, the client does not abort and proceeds to the next
> > > >  transaction regardless of the error. These cases are reported as "other failures"
> > > >  in the output. Note that if the error occurs in a meta-command, the client will
> > > >  still abort even when this option is specified.
> > >
> > > How about phrasing it like this, based on your version?
> > >
> > > ----------------------------
> > > A client's run is aborted in case of a serious error; for example, the
> > > connection with the database server was lost or the end of script was reached
> > > without completing the last transaction.  The client also aborts
> > > if a meta-command fails, or if an SQL command fails for reasons other than
> > > serialization or deadlock errors when --continue-on-error is not specified.
> > > With --continue-on-error, the client does not abort on such SQL errors
> > > and instead proceeds to the next transaction.  These cases are reported
> > > as "other failures" in the output.  If the error occurs in a meta-command,
> > > however, the client still aborts even when this option is specified.
> > > ----------------------------
> >
> > I'm fine with that. This version is clearer.
>
> Thanks for checking!

I've updated the 0001 patch based on the comments.
The revised version is attached.

While testing, I found that running pgbench with --continue-on-error and
pipeline mode triggers the following assertion failure. Could this be
a bug in the patch?

---------------------------------------------------
$ cat pipeline.pgbench
\startpipeline
DO $$
  BEGIN
    PERFORM pg_sleep(3);
    PERFORM pg_terminate_backend(pg_backend_pid());
  END $$;
\endpipeline

$ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M
extended --continue-on-error
...
Assertion failed:
(sql_script[st->use_file].commands[st->command]->type == 1), function
commandError, file pgbench.c, line 3081.
Abort trap: 6
---------------------------------------------------

When I ran the same command without --continue-on-error,
the assertion failure did not occur.

Regards,

--
Fujii Masao

Вложения

v11-0001-Add-continue-on-error-option.patch

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

19 сентября, 14:56:44

On Fri, 19 Sep 2025 11:43:28 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

> On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> >
> > On Thu, 18 Sep 2025 14:37:29 +0900
> > Fujii Masao <masao.fujii@gmail.com> wrote:
> >
> > > On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > > > That makes sense. How about rewriting this like:
> > > >
> > > >  However, if the --continue-on-error option is specified and the error occurs in
> > > >  an SQL command, the client does not abort and proceeds to the next
> > > >  transaction regardless of the error. These cases are reported as "other failures"
> > > >  in the output. Note that if the error occurs in a meta-command, the client will
> > > >  still abort even when this option is specified.
> > >
> > > How about phrasing it like this, based on your version?
> > >
> > > ----------------------------
> > > A client's run is aborted in case of a serious error; for example, the
> > > connection with the database server was lost or the end of script was reached
> > > without completing the last transaction.  The client also aborts
> > > if a meta-command fails, or if an SQL command fails for reasons other than
> > > serialization or deadlock errors when --continue-on-error is not specified.
> > > With --continue-on-error, the client does not abort on such SQL errors
> > > and instead proceeds to the next transaction.  These cases are reported
> > > as "other failures" in the output.  If the error occurs in a meta-command,
> > > however, the client still aborts even when this option is specified.
> > > ----------------------------
> >
> > I'm fine with that. This version is clearer.
> 
> Thanks for checking!
> 
> 
> Also I'd like to share the review comments for 0002 and 0003.
> 
> Regarding 0002:
> 
> - TSTATUS_OTHER_ERROR,
> + TSTATUS_UNKNOWN_ERROR,
> 
> You did this rename to avoid confusion with other_sql_errors.
> I see the intention, but I'm not sure if this concern is really valid
> and if the rename adds much value. Also, TSTATUS_UNKNOWN_ERROR
> might be mistakenly assumed to be related to PQTRANS_UNKNOWN,
> even though they aren't related...

I don’t have a strong opinion on this, but I think TSTATUS_* is slightly
related to PQTRANS_*, since getTransactionStatus() determines the TSTATUS
value based on PQTRANS. There is no one-to-one relationship, of course,
but it is more related than ESTATUS_OTHER_SQL_ERROR, which is entirely
separate.

> But if we agree with this change, I think it should be folded into 0001,
> since there's no strong reason to keep it separate.

+1

I personally don't care if ommiting this change, but I would like to wait 
for Ikeda-san's response because he is the author of these two patches.

> Regarding 0003:
> 
> - pg_log_error("client %d script %d command %d query %d: expected one
> row, got %d",
> - st->id, st->use_file, st->command, qrynum, 0);
> + commandFailed(st, "gset", psprintf("expected one row, got %d", 0));
> 
> The change to use commandFailed() seems to remove
> the "query %d" detail that the current pg_log_error() call reports.
> Is it OK to lose that information?

"qrynum" is the index of SQL queries combined by "\;", but reporting it
in \gset errors is almost useless, since \gset can only be applied to the
last query of a compound query. So I think it’s fine to commit this.

That said, it might still be useful for debugging when an internal error like 
the following occurs (mainly for developers rather than users):

      /* internal error */
      commandFailed(st, cmd, psprintf("error storing into variable %s", varname));

For that case, I’d be fine with adding information like this:

      /* internal error */
      commandFailed(st, cmd, psprintf("error storing into variable %s, at query %d", varname, qrynum));

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

19 сентября, 18:21:19

On Fri, 19 Sep 2025 19:21:29 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

> On Fri, Sep 19, 2025 at 11:43 AM Fujii Masao <masao.fujii@gmail.com> wrote:
> >
> > On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > >
> > > On Thu, 18 Sep 2025 14:37:29 +0900
> > > Fujii Masao <masao.fujii@gmail.com> wrote:
> > >
> > > > On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > > > > That makes sense. How about rewriting this like:
> > > > >
> > > > >  However, if the --continue-on-error option is specified and the error occurs in
> > > > >  an SQL command, the client does not abort and proceeds to the next
> > > > >  transaction regardless of the error. These cases are reported as "other failures"
> > > > >  in the output. Note that if the error occurs in a meta-command, the client will
> > > > >  still abort even when this option is specified.
> > > >
> > > > How about phrasing it like this, based on your version?
> > > >
> > > > ----------------------------
> > > > A client's run is aborted in case of a serious error; for example, the
> > > > connection with the database server was lost or the end of script was reached
> > > > without completing the last transaction.  The client also aborts
> > > > if a meta-command fails, or if an SQL command fails for reasons other than
> > > > serialization or deadlock errors when --continue-on-error is not specified.
> > > > With --continue-on-error, the client does not abort on such SQL errors
> > > > and instead proceeds to the next transaction.  These cases are reported
> > > > as "other failures" in the output.  If the error occurs in a meta-command,
> > > > however, the client still aborts even when this option is specified.
> > > > ----------------------------
> > >
> > > I'm fine with that. This version is clearer.
> >
> > Thanks for checking!
> 
> I've updated the 0001 patch based on the comments.
> The revised version is attached.

Thank you for updating the patch.

> 
> While testing, I found that running pgbench with --continue-on-error and
> pipeline mode triggers the following assertion failure. Could this be
> a bug in the patch?
> 
> ---------------------------------------------------
> $ cat pipeline.pgbench
> \startpipeline
> DO $$
>   BEGIN
>     PERFORM pg_sleep(3);
>     PERFORM pg_terminate_backend(pg_backend_pid());
>   END $$;
> \endpipeline
> 
> $ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M
> extended --continue-on-error
> ...
> Assertion failed:
> (sql_script[st->use_file].commands[st->command]->type == 1), function
> commandError, file pgbench.c, line 3081.
> Abort trap: 6
> ---------------------------------------------------
> 
> When I ran the same command without --continue-on-error,
> the assertion failure did not occur.

I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench
to retry and added the --verbose-errors option, rather than by this patch itself.

The assertion failure occurs in commandError(), which is called to report an error when
it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error
is used after this patch.

 Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);

This assumes the error is always detected during SQL command execution, but
that’s not correct, since in pipeline mode, the error can be detected when
a \endpipeline meta-command is executed.

 $ cat deadlock.sql 
 \startpipeline
 begin;
 lock b;
 lock a;
 end;
 \endpipeline

 $ cat deadlock2.sql 
 \startpipeline
 begin;
 lock a;
 lock b;
 end;
 \endpipeline

 $ pgbench --verbose-errors -f deadlock.sql  -f deadlock2.sql -c 2 -T 3 -M extended 
 pgbench (19devel)
 starting vacuum...end.
 pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1' failed.

Although one option would be to remove this assertion, if we prefer to keep it,
the attached patch fixes the issue.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

fix_pgbench_assertion_failure_in_pipeline.patch.txt

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Rintaro Ikeda

Дата:

20 сентября, 15:58:04

Thank you for reviewing the patches.

On 2025/09/19 20:56, Yugo Nagata wrote:
>>>> A client's run is aborted in case of a serious error; for example, the
>>>> connection with the database server was lost or the end of script was reached
>>>> without completing the last transaction.  The client also aborts
>>>> if a meta-command fails, or if an SQL command fails for reasons other than
>>>> serialization or deadlock errors when --continue-on-error is not specified.
>>>> With --continue-on-error, the client does not abort on such SQL errors
>>>> and instead proceeds to the next transaction.  These cases are reported
>>>> as "other failures" in the output.  If the error occurs in a meta-command,
>>>> however, the client still aborts even when this option is specified.
>>>> ----------------------------
>>>
>>> I'm fine with that. This version is clearer.

I also agree with this.

>>
>> Also I'd like to share the review comments for 0002 and 0003.
>>
>> Regarding 0002:
>>
>> - TSTATUS_OTHER_ERROR,
>> + TSTATUS_UNKNOWN_ERROR,
>>
>> You did this rename to avoid confusion with other_sql_errors.
>> I see the intention, but I'm not sure if this concern is really valid
>> and if the rename adds much value. Also, TSTATUS_UNKNOWN_ERROR
>> might be mistakenly assumed to be related to PQTRANS_UNKNOWN,
>> even though they aren't related...
> 
> I don’t have a strong opinion on this, but I think TSTATUS_* is slightly
> related to PQTRANS_*, since getTransactionStatus() determines the TSTATUS
> value based on PQTRANS. There is no one-to-one relationship, of course,
> but it is more related than ESTATUS_OTHER_SQL_ERROR, which is entirely
> separate.
> 
>> But if we agree with this change, I think it should be folded into 0001,
>> since there's no strong reason to keep it separate.
> 
> +1
> 
> I personally don't care if ommiting this change, but I would like to wait 
> for Ikeda-san's response because he is the author of these two patches.
> 

The points you both raise make sense to me.
Changing the macro name is not important for the purpose of the patch, so I now
feel it would be reasonable to drop patch 0002.


Regards,
Rintaro Ikeda

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Fujii Masao

Дата:

22 сентября, 05:56:31

On Sat, Sep 20, 2025 at 12:21 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > While testing, I found that running pgbench with --continue-on-error and
> > pipeline mode triggers the following assertion failure. Could this be
> > a bug in the patch?
> >
> > ---------------------------------------------------
> > $ cat pipeline.pgbench
> > \startpipeline
> > DO $$
> >   BEGIN
> >     PERFORM pg_sleep(3);
> >     PERFORM pg_terminate_backend(pg_backend_pid());
> >   END $$;
> > \endpipeline
> >
> > $ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M
> > extended --continue-on-error
> > ...
> > Assertion failed:
> > (sql_script[st->use_file].commands[st->command]->type == 1), function
> > commandError, file pgbench.c, line 3081.
> > Abort trap: 6
> > ---------------------------------------------------
> >
> > When I ran the same command without --continue-on-error,
> > the assertion failure did not occur.
>
> I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench
> to retry and added the --verbose-errors option, rather than by this patch itself.
>
> The assertion failure occurs in commandError(), which is called to report an error when
> it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error
> is used after this patch.
>
>  Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
>
> This assumes the error is always detected during SQL command execution, but
> that’s not correct, since in pipeline mode, the error can be detected when
> a \endpipeline meta-command is executed.
>
>  $ cat deadlock.sql
>  \startpipeline
>  begin;
>  lock b;
>  lock a;
>  end;
>  \endpipeline
>
>  $ cat deadlock2.sql
>  \startpipeline
>  begin;
>  lock a;
>  lock b;
>  end;
>  \endpipeline
>
>  $ pgbench --verbose-errors -f deadlock.sql  -f deadlock2.sql -c 2 -T 3 -M extended
>  pgbench (19devel)
>  starting vacuum...end.
>  pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1' failed.
>
> Although one option would be to remove this assertion, if we prefer to keep it,
> the attached patch fixes the issue.

Thanks for the analysis and the patch!

I think we should fix the issue rather than just removing the assertion.
I'd like to apply your patch with the following source comment:

---------------------------
Errors should only be detected during an SQL command or the \endpipeline
meta command. Any other case triggers an assertion failure.
--------------------------


With your patch and the continue-on-error patches, running the same pgbench
command I used to reproduce the assertion failure upthread causes pgbench
to hang. From my analysis, it enters an infinite loop in discardUntilSync().
That loop waits for PGRES_PIPELINE_SYNC, but since the connection has already
been closed, it never arrives, leaving pgbench stuck.

Could this also happen without the continue-on-error patch, or is it a new bug
introduced by it? Either way, it seems pgbench needs to exit the loop when
the result status is PGRES_FATAL_ERROR.

Regards,

--
Fujii Masao

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Fujii Masao

Дата:

22 сентября, 05:59:23

On Sat, Sep 20, 2025 at 9:58 PM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:
> The points you both raise make sense to me.
> Changing the macro name is not important for the purpose of the patch, so I now
> feel it would be reasonable to drop patch 0002.

Thanks for your thoughts! So let's focus on the 0001 patch for now.

Regards,

--
Fujii Masao

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Rintaro Ikeda

Дата:

23 сентября, 05:58:29

Hi,

On 2025/09/22 11:56, Fujii Masao wrote:
> On Sat, Sep 20, 2025 at 12:21 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
>>> While testing, I found that running pgbench with --continue-on-error and
>>> pipeline mode triggers the following assertion failure. Could this be
>>> a bug in the patch?
>>>
>>> ---------------------------------------------------
>>> $ cat pipeline.pgbench
>>> \startpipeline
>>> DO $$
>>>   BEGIN
>>>     PERFORM pg_sleep(3);
>>>     PERFORM pg_terminate_backend(pg_backend_pid());
>>>   END $$;
>>> \endpipeline
>>>
>>> $ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M
>>> extended --continue-on-error
>>> ...
>>> Assertion failed:
>>> (sql_script[st->use_file].commands[st->command]->type == 1), function
>>> commandError, file pgbench.c, line 3081.
>>> Abort trap: 6
>>> ---------------------------------------------------
>>>
>>> When I ran the same command without --continue-on-error,
>>> the assertion failure did not occur.
>>
>> I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench
>> to retry and added the --verbose-errors option, rather than by this patch itself.
>>
>> The assertion failure occurs in commandError(), which is called to report an error when
>> it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error
>> is used after this patch.
>>
>>  Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND);
>>
>> This assumes the error is always detected during SQL command execution, but
>> that’s not correct, since in pipeline mode, the error can be detected when
>> a \endpipeline meta-command is executed.
>>
>>  $ cat deadlock.sql
>>  \startpipeline
>>  begin;
>>  lock b;
>>  lock a;
>>  end;
>>  \endpipeline
>>
>>  $ cat deadlock2.sql
>>  \startpipeline
>>  begin;
>>  lock a;
>>  lock b;
>>  end;
>>  \endpipeline
>>
>>  $ pgbench --verbose-errors -f deadlock.sql  -f deadlock2.sql -c 2 -T 3 -M extended
>>  pgbench (19devel)
>>  starting vacuum...end.
>>  pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1'
failed.
>>
>> Although one option would be to remove this assertion, if we prefer to keep it,
>> the attached patch fixes the issue.
>
> Thanks for the analysis and the patch!
>
> I think we should fix the issue rather than just removing the assertion.
> I'd like to apply your patch with the following source comment:
>
> ---------------------------
> Errors should only be detected during an SQL command or the \endpipeline
> meta command. Any other case triggers an assertion failure.
> --------------------------
>
>
> With your patch and the continue-on-error patches, running the same pgbench
> command I used to reproduce the assertion failure upthread causes pgbench
> to hang. From my analysis, it enters an infinite loop in discardUntilSync().
> That loop waits for PGRES_PIPELINE_SYNC, but since the connection has already
> been closed, it never arrives, leaving pgbench stuck.
>
> Could this also happen without the continue-on-error patch, or is it a new bug
> introduced by it? Either way, it seems pgbench needs to exit the loop when
> the result status is PGRES_FATAL_ERROR.
>


Thank you for the analysis and the patches.

I think the issue is a new bug because we have transitioned to CSTATE_ABORT
immediately after queries failed, without executing discardUntilSync().

I've attached a patch that fixes the assertion error. The content of v1 patch by
Mr. Nagata is also included. I would appreciate it if you review my patch.

Regards,
Rintaro Ikeda

Вложения

v2_fix_pgbench_fix_assertion_in_pipeline.patch.txt

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Fujii Masao

Дата:

24 сентября, 20:19:27

On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
<ikedarintarof@oss.nttdata.com> wrote:
> I think the issue is a new bug because we have transitioned to CSTATE_ABORT
> immediately after queries failed, without executing discardUntilSync().

If so, the fix in discardUntilSync() should be part of the continue-on-error
patch. The assertion failure fix should be a separate patch, since only
that needs to be backpatched (the failure can also occur in back branches).


> I've attached a patch that fixes the assertion error. The content of v1 patch by
> Mr. Nagata is also included. I would appreciate it if you review my patch.

+ if (received_sync == true)

For boolean flags, we usually just use the variable itself instead of
"== true/false".

+ switch (PQresultStatus(res))
+ {
+ case PGRES_PIPELINE_SYNC:
+ received_sync = true;

In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.

+ case PGRES_FATAL_ERROR:
+ PQclear(res);
+ goto done;

I don't think we need goto here. How about this instead?

-----------------------
@@ -3525,11 +3525,18 @@ discardUntilSync(CState *st)
                         * results have been discarded.
                         */
                        st->num_syncs = 0;
-                       PQclear(res);
                        break;
                }
+               /*
+                * Stop receiving further results if PGRES_FATAL_ERROR
is returned
+                * (e.g., due to a connection failure) before
PGRES_PIPELINE_SYNC,
+                * since PGRES_PIPELINE_SYNC will never arrive.
+                */
+               else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
+                       break;
                PQclear(res);
        }
+       PQclear(res);

        /* exit pipeline */
        if (PQexitPipelineMode(st->con) != 1)
-----------------------

Regards,

--
Fujii Masao

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

25 сентября, 05:09:40

On Thu, 25 Sep 2025 02:19:27 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

> On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
> <ikedarintarof@oss.nttdata.com> wrote:
> > I think the issue is a new bug because we have transitioned to CSTATE_ABORT
> > immediately after queries failed, without executing discardUntilSync().

Agreed. 
 
> If so, the fix in discardUntilSync() should be part of the continue-on-error
> patch. The assertion failure fix should be a separate patch, since only
> that needs to be backpatched (the failure can also occur in back branches).

+1

> 
> > I've attached a patch that fixes the assertion error. The content of v1 patch by
> > Mr. Nagata is also included. I would appreciate it if you review my patch.

> + switch (PQresultStatus(res))
> + {
> + case PGRES_PIPELINE_SYNC:
> + received_sync = true;
> 
> In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.
> 
> + case PGRES_FATAL_ERROR:
> + PQclear(res);
> + goto done;
> 
> I don't think we need goto here. How about this instead?
> 
> -----------------------
> @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st)
>                          * results have been discarded.
>                          */
>                         st->num_syncs = 0;
> -                       PQclear(res);
>                         break;
>                 }
> +               /*
> +                * Stop receiving further results if PGRES_FATAL_ERROR
> is returned
> +                * (e.g., due to a connection failure) before
> PGRES_PIPELINE_SYNC,
> +                * since PGRES_PIPELINE_SYNC will never arrive.
> +                */
> +               else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
> +                       break;
>                 PQclear(res);
>         }
> +       PQclear(res);
> 
>         /* exit pipeline */
>         if (PQexitPipelineMode(st->con) != 1)
> -----------------------

I think Fujii-san's version is better because Ikeda-san's version doesn't
consider the case where PGRES_PIPELINE_SYNC is followed by another one.
In that situation, the loop would terminate without getting NULL, which would
causes an error in PQexitPipelineMode().

However, I would like to suggest an alternative solution: checking the connection
status when readCommandResponse() returns false. This seems more straightforwad,
since the cause of the error can be investigated immediately after it is detected.

@@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
                                        if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
                                                st->state = CSTATE_END_COMMAND;
                                }
-                               else if (canRetryError(st->estatus))
+                               else if (PQstatus(st->con) == CONNECTION_BAD)
+                                       st->state = CSTATE_ABORTED;
+                               else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
+                                                canRetryError(st->estatus))
                                        st->state = CSTATE_ERROR;
                                else
                                        st->state = CSTATE_ABORTED;

What do you think?


Additionally, I noticed that in pipeline mode, the error message reported in
readCommandResponse() is lost, because it is reset when PQgetResult() returned
NULL to indicate the end of query processing. For example:

 pgbench: client 0 got an error in command 3 (SQL) of script 0; 
 pgbench: client 1 got an error in command 3 (SQL) of script 0; 

This can be fixed this by saving the previous error message and using it
for the report. After the fix:

 pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL:  terminating connection due to administrator
command

I've attached update patches.

0001 fixes the assersion failure in commandError() and error message lost
in readCommandResponse().

0002 was the previous 0001 that adds --continiue-on-error, including the
fix to handle connection termination errors.

0003 is for improving error messages for errors that cause client abortion.

Regareds,
Yugo Nagata



-- 
Yugo Nagata <nagata@sraoss.co.jp>

On Thu, 25 Sep 2025 13:49:05 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

> On Thu, Sep 25, 2025 at 11:17 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> >
> > On Thu, 25 Sep 2025 11:09:40 +0900
> > Yugo Nagata <nagata@sraoss.co.jp> wrote:
> >
> > > On Thu, 25 Sep 2025 02:19:27 +0900
> > > Fujii Masao <masao.fujii@gmail.com> wrote:
> > >
> > > > On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda
> > > > <ikedarintarof@oss.nttdata.com> wrote:
> > > > > I think the issue is a new bug because we have transitioned to CSTATE_ABORT
> > > > > immediately after queries failed, without executing discardUntilSync().
> > >
> > > Agreed.
> > >
> > > > If so, the fix in discardUntilSync() should be part of the continue-on-error
> > > > patch. The assertion failure fix should be a separate patch, since only
> > > > that needs to be backpatched (the failure can also occur in back branches).
> > >
> > > +1
> > >
> > > >
> > > > > I've attached a patch that fixes the assertion error. The content of v1 patch by
> > > > > Mr. Nagata is also included. I would appreciate it if you review my patch.
> > >
> > > > + switch (PQresultStatus(res))
> > > > + {
> > > > + case PGRES_PIPELINE_SYNC:
> > > > + received_sync = true;
> > > >
> > > > In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be.
> > > >
> > > > + case PGRES_FATAL_ERROR:
> > > > + PQclear(res);
> > > > + goto done;
> > > >
> > > > I don't think we need goto here. How about this instead?
> > > >
> > > > -----------------------
> > > > @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st)
> > > >                          * results have been discarded.
> > > >                          */
> > > >                         st->num_syncs = 0;
> > > > -                       PQclear(res);
> > > >                         break;
> > > >                 }
> > > > +               /*
> > > > +                * Stop receiving further results if PGRES_FATAL_ERROR
> > > > is returned
> > > > +                * (e.g., due to a connection failure) before
> > > > PGRES_PIPELINE_SYNC,
> > > > +                * since PGRES_PIPELINE_SYNC will never arrive.
> > > > +                */
> > > > +               else if (PQresultStatus(res) == PGRES_FATAL_ERROR)
> > > > +                       break;
> > > >                 PQclear(res);
> > > >         }
> > > > +       PQclear(res);
> > > >
> > > >         /* exit pipeline */
> > > >         if (PQexitPipelineMode(st->con) != 1)
> > > > -----------------------
> > >
> > > I think Fujii-san's version is better because Ikeda-san's version doesn't
> > > consider the case where PGRES_PIPELINE_SYNC is followed by another one.
> > > In that situation, the loop would terminate without getting NULL, which would
> > > causes an error in PQexitPipelineMode().
> > >
> > > However, I would like to suggest an alternative solution: checking the connection
> > > status when readCommandResponse() returns false. This seems more straightforwad,
> > > since the cause of the error can be investigated immediately after it is detected.
> > >
> > > @@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
> > >                                         if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON)
> > >                                                 st->state = CSTATE_END_COMMAND;
> > >                                 }
> > > -                               else if (canRetryError(st->estatus))
> > > +                               else if (PQstatus(st->con) == CONNECTION_BAD)
> > > +                                       st->state = CSTATE_ABORTED;
> > > +                               else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) ||
> > > +                                                canRetryError(st->estatus))
> > >                                         st->state = CSTATE_ERROR;
> > >                                 else
> > >                                         st->state = CSTATE_ABORTED;
> > >
> > > What do you think?
> > >
> > >
> > > Additionally, I noticed that in pipeline mode, the error message reported in
> > > readCommandResponse() is lost, because it is reset when PQgetResult() returned
> > > NULL to indicate the end of query processing. For example:
> > >
> > >  pgbench: client 0 got an error in command 3 (SQL) of script 0;
> > >  pgbench: client 1 got an error in command 3 (SQL) of script 0;
> > >
> > > This can be fixed this by saving the previous error message and using it
> > > for the report. After the fix:
> > >
> > >  pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL:  terminating connection due to
administratorcommand
 
> > >
> > > I've attached update patches.
> > >
> > > 0001 fixes the assersion failure in commandError() and error message lost
> > > in readCommandResponse().
> > >
> > > 0002 was the previous 0001 that adds --continiue-on-error, including the
> > > fix to handle connection termination errors.
> > >
> > > 0003 is for improving error messages for errors that cause client abortion.
> >
> > I think the patch 0001 should be back patched since the issues can occurs
> > even for retries of serialization failure or deadlock detection in pipeline mode.
> 
> Agreed.
> 
> Regarding 0001:
> 
> + /*
> + Errors should only be detected during an SQL command or the \endpipeline
> + meta command. Any other case triggers an assertion failure.
> + */
> 
> * should be added before "Errors" and "meta".

Oops. Fixed.
 
> + errmsg = pg_strdup(PQerrorMessage(st->con));
> 
> It would be good to add a comment explaining why we do this.
> 
> + pg_free(errmsg);
> 
> Shouldn't pg_free() be called also in the error case, i.e., after
> jumping to the error label?

Yes, it should be.
Fixed.

I've attached updated patches.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Anthonin Bonnefoy

Дата:

25 сентября, 11:27:44

Hi,

The patch looks good, I've spotted some typos in the doc.

+        Allows clients to continue their run even if an SQL statement
fails due to
+        errors other than serialization or deadlock. Unlike
serialization and deadlock
+        failures, clients do not retry the same transactions but
start new transaction.

Should be "but start a new transaction.", although "proceed to the
next transaction." may be clearer here that ?

+       number of transactions that got a SQL error
+       (zero unless <option>--failures-detailed</option> is specified)

It seems like both "a SQL" and "an SQL" are used in the codebase and
doc, but this page only uses "an SQL", so using "an SQL" may be better
for consistency.

+   If an SQL command fails due to serialization or deadlock errors, the
+   client does not aborted, regardless of whether

Should be "the client does not abort."

Regards,
Anthonin Bonnefoy

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Chao Li

Дата:

25 сентября, 12:17:36

Hi Yugo,

Thanks for the patch. After reviewing it, I got a few small comments:

On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote:

--
Yugo Nagata <nagata@sraoss.co.jp>
<v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch>

1 - 0001

```

@@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)

PGresult *res;

PGresult *next_res;

int qrynum = 0;

+ char *errmsg;

```

I think we should initialize errmsg to NULL. Compiler won’t auto initialize a local variable. If it happens to not enter the while loop, then errmsg will hold a random value, then pg_free(errmsg) will have trouble.

2 - 0002

```

+ <para>

+ Allows clients to continue their run even if an SQL statement fails due to

+ errors other than serialization or deadlock. Unlike serialization and deadlock

+ failures, clients do not retry the same transactions but start new transaction.

+ This option is useful when your custom script may raise errors due to some

+ reason like unique constraints violation. Without this option, the client is

+ aborted after such errors.

+ </para>

```

A few nit suggestions:

* “continue their run” => “continue running”

* “clients to not retry the same transactions but start new transaction” => “clients do not retry the same transaction but start a new transaction instead"

* “due to some reason like” => “for reasons such as"

3 - 0002

```

+ * Without --continue-on-error:

* failed (the number of failed transactions) =

```

Maybe add an empty line after “without” line.

4 - 0002

```

+ * When --continue-on-error is specified:

+ *

+ * failed (number of failed transactions) =

```

Maybe change to “With ---continue-on-error”, which sounds consistent with the previous “without”.

5 - 0002

```

+ int64 other_sql_failures; /* number of failed transactions for

+ * reasons other than

+ * serialization/deadlock failure, which

+ * is counted if --continue-on-error is

+ * specified */

```

How about rename this variable to “sql_errors”, which reflects to the new option name.

6 - 0002

```

@@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus)

return "serialization";

case ESTATUS_DEADLOCK_ERROR:

return "deadlock";

+ case ESTATUS_OTHER_SQL_ERROR:

+ return "other”;

```

I think this can just return “error”. I checked where this function is called, there will not be other words such as “error” appended.

7 - 0002

```

/* it can be non-zero only if max_tries is not equal to one */

@@ -6569,6 +6602,10 @@ printResults(StatsData *total,

sstats->deadlock_failures,

(100.0 * sstats->deadlock_failures /

script_total_cnt));

+ printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",

+ sstats->other_sql_failures,

+ (100.0 * sstats->other_sql_failures /

+ script_total_cnt));

```

Do we only want to print this number when “—continue-on-error” is given?

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Fujii Masao

Дата:

25 сентября, 18:03:06

On Thu, Sep 25, 2025 at 4:22 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> I've attached updated patches.

Thanks for updating the patches!

About 0001: you mentioned that the lost error message issue occurs in
pipeline mode.
Just to confirm, are you sure it never happens in non-pipeline mode?
From a quick look,
readCommandResponse() seems to have this problem regardless of whether pipeline
mode is used.

If it can also happen outside pipeline mode, maybe we should split this from
the assertion failure fix, since they'd need to be backpatched to
different branches.
What do you think?

Regards,

--
Fujii Masao

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

26 сентября, 04:29:32

On Fri, 26 Sep 2025 00:03:06 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

> On Thu, Sep 25, 2025 at 4:22 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > I've attached updated patches.
> 
> Thanks for updating the patches!
> 
> About 0001: you mentioned that the lost error message issue occurs in
> pipeline mode.
> Just to confirm, are you sure it never happens in non-pipeline mode?
> From a quick look,
> readCommandResponse() seems to have this problem regardless of whether pipeline
> mode is used.
> 
> If it can also happen outside pipeline mode, maybe we should split this from
> the assertion failure fix, since they'd need to be backpatched to
> different branches.

I could not find a code path that resets the error state before reporting in
non-pipeline mode, since it is typically reset when starting to send a query.
However, referencing an error message after another PQgetResult() does not seem
like a good idea in general, so I agree with splitting the patch.

I'll submit updated patches soon.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

26 сентября, 05:44:42

On Thu, 25 Sep 2025 17:17:36 +0800
Chao Li <li.evan.chao@gmail.com> wrote:

> Hi Yugo,
> 
> Thanks for the patch. After reviewing it, I got a few small comments:

Thank you for your reviewing and comments.

> > On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > 
> > -- 
> > Yugo Nagata <nagata@sraoss.co.jp <mailto:nagata@sraoss.co.jp>>
> >
<v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch>
> 
> 
> 1 - 0001
> ```
> @@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
>      PGresult   *res;
>      PGresult   *next_res;
>      int            qrynum = 0;
> +    char       *errmsg;
> ```
> 
> I think we should initialize errmsg to NULL. Compiler won’t auto initialize a local variable. If it happens to not
enterthe while loop, then errmsg will hold a random value, then pg_free(errmsg) will have trouble.
 

I think this initialization is unnecessary, just like for res and next_res.
If the code happens not to enter the while loop, pg_free(errmsg) will not be
called anyway, since the error: label is only reachable from inside the loop.

> 2 - 0002
> ```
> +       <para>
> +        Allows clients to continue their run even if an SQL statement fails due to
> +        errors other than serialization or deadlock. Unlike serialization and deadlock
> +        failures, clients do not retry the same transactions but start new transaction.
> +        This option is useful when your custom script may raise errors due to some
> +        reason like unique constraints violation. Without this option, the client is
> +        aborted after such errors.
> +       </para>
> ```
> 
> A few nit suggestions:
> 
> * “continue their run” => “continue running”

Fixed.

> * “clients to not retry the same transactions but start new transaction” => “clients do not retry the same
transactionbut start a new transaction instead"
 

I see your point. Maybe we could follow Anthonin Bonnefoy's suggestion
to use "proceed to the next transaction", as it may sound a bit more natural.

> * “due to some reason like” => “for reasons such as"

Fixed.

> 3 - 0002
> ```
> +     * Without --continue-on-error:
>       * failed (the number of failed transactions) =
> ```
> 
> Maybe add an empty line after “without” line.

Makes sense. Fixed.
 
> 4 - 0002
> ```
> +     * When --continue-on-error is specified:
> +     *
> +     * failed (number of failed transactions) =
> ```
> 
> Maybe change to “With ---continue-on-error”, which sounds consistent with the previous “without”.

Fixed.

> 5 - 0002
> ```
> +    int64        other_sql_failures; /* number of failed transactions for
> +                                     * reasons other than
> +                                     * serialization/deadlock failure, which
> +                                     * is counted if --continue-on-error is
> +                                     * specified */
> ```
> 
> How about rename this variable to “sql_errors”, which reflects to the new option name.

I think it’s better to keep the current name, since the variable counts failed transactions,
even though that happens to be equivalent to the number of SQL errors. It’s also consistent
with the other variables, serialization_failures and deadlock_failures.

> 6 - 0002
> ```
> @@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus)
>                  return "serialization";
>              case ESTATUS_DEADLOCK_ERROR:
>                  return "deadlock";
> +            case ESTATUS_OTHER_SQL_ERROR:
> +                return "other”;
> ```
> 
> I think this can just return “error”. I checked where this function is called, there will not be other words such as
“error”appended.
 

getResultString() is called to get a string that represents the type of error
causing the transaction failure, so simply returning "error" doesn’t seem very
useful.
 
> 7 - 0002
> ```
>      /* it can be non-zero only if max_tries is not equal to one */
> @@ -6569,6 +6602,10 @@ printResults(StatsData *total,
>                                 sstats->deadlock_failures,
>                                 (100.0 * sstats->deadlock_failures /
>                                  script_total_cnt));
> +                        printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
> +                               sstats->other_sql_failures,
> +                               (100.0 * sstats->other_sql_failures /
> +                                script_total_cnt));
> ```
> 
> Do we only want to print this number when “―continue-on-error” is given?

We could do that, but this message is printed only when
--failures-detailed is specified. So I think users would not mind
if it shows that the number of other failures is zero, even when
--continue-on-error is not specified.

I would appreciate hearing other people's opinions on this.


I've attached updated patches that include fixes for some of your
suggestions and for Anthonin Bonnefoy's suggestion on the documentation.

I also split the patch according to Fujii-san's suggestion.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

26 сентября, 05:52:34

On Thu, 25 Sep 2025 10:27:44 +0200
Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:

> Hi,
> 
> The patch looks good, I've spotted some typos in the doc.
> 
> +        Allows clients to continue their run even if an SQL statement
> fails due to
> +        errors other than serialization or deadlock. Unlike
> serialization and deadlock
> +        failures, clients do not retry the same transactions but
> start new transaction.
> 
> Should be "but start a new transaction.", although "proceed to the
> next transaction." may be clearer here that ?
> 
> +       number of transactions that got a SQL error
> +       (zero unless <option>--failures-detailed</option> is specified)
> 
> It seems like both "a SQL" and "an SQL" are used in the codebase and
> doc, but this page only uses "an SQL", so using "an SQL" may be better
> for consistency.
> 
> +   If an SQL command fails due to serialization or deadlock errors, the
> +   client does not aborted, regardless of whether
> 
> Should be "the client does not abort."

Thank you for your review.
I've attached the updated patch in my previous post in this thread.

By the way, on the pgsql-hackers list, top-posting is generally discouraged [1],
so replying below the quoted messages is usually preferred.

[1] https://wiki.postgresql.org/wiki/Mailing_Lists

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

30 сентября, 04:23:53

On Fri, 26 Sep 2025 11:44:42 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:

> On Thu, 25 Sep 2025 17:17:36 +0800
> Chao Li <li.evan.chao@gmail.com> wrote:
> 
> > Hi Yugo,
> > 
> > Thanks for the patch. After reviewing it, I got a few small comments:
> 
> Thank you for your reviewing and comments.
> 
> > > On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > > 
> > > -- 
> > > Yugo Nagata <nagata@sraoss.co.jp <mailto:nagata@sraoss.co.jp>>
> > >
<v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch>
> > 
> > 
> > 1 - 0001
> > ```
> > @@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
> >      PGresult   *res;
> >      PGresult   *next_res;
> >      int            qrynum = 0;
> > +    char       *errmsg;
> > ```
> > 
> > I think we should initialize errmsg to NULL. Compiler won’t auto initialize a local variable. If it happens to not
enterthe while loop, then errmsg will hold a random value, then pg_free(errmsg) will have trouble.
 
> 
> I think this initialization is unnecessary, just like for res and next_res.
> If the code happens not to enter the while loop, pg_free(errmsg) will not be
> called anyway, since the error: label is only reachable from inside the loop.
> 
> > 2 - 0002
> > ```
> > +       <para>
> > +        Allows clients to continue their run even if an SQL statement fails due to
> > +        errors other than serialization or deadlock. Unlike serialization and deadlock
> > +        failures, clients do not retry the same transactions but start new transaction.
> > +        This option is useful when your custom script may raise errors due to some
> > +        reason like unique constraints violation. Without this option, the client is
> > +        aborted after such errors.
> > +       </para>
> > ```
> > 
> > A few nit suggestions:
> > 
> > * “continue their run” => “continue running”
> 
> Fixed.
> 
> > * “clients to not retry the same transactions but start new transaction” => “clients do not retry the same
transactionbut start a new transaction instead"
 
> 
> I see your point. Maybe we could follow Anthonin Bonnefoy's suggestion
> to use "proceed to the next transaction", as it may sound a bit more natural.
> 
> > * “due to some reason like” => “for reasons such as"
> 
> Fixed.
> 
> > 3 - 0002
> > ```
> > +     * Without --continue-on-error:
> >       * failed (the number of failed transactions) =
> > ```
> > 
> > Maybe add an empty line after “without” line.
> 
> Makes sense. Fixed.
>  
> > 4 - 0002
> > ```
> > +     * When --continue-on-error is specified:
> > +     *
> > +     * failed (number of failed transactions) =
> > ```
> > 
> > Maybe change to “With ---continue-on-error”, which sounds consistent with the previous “without”.
> 
> Fixed.
> 
> > 5 - 0002
> > ```
> > +    int64        other_sql_failures; /* number of failed transactions for
> > +                                     * reasons other than
> > +                                     * serialization/deadlock failure, which
> > +                                     * is counted if --continue-on-error is
> > +                                     * specified */
> > ```
> > 
> > How about rename this variable to “sql_errors”, which reflects to the new option name.
> 
> I think it’s better to keep the current name, since the variable counts failed transactions,
> even though that happens to be equivalent to the number of SQL errors. It’s also consistent
> with the other variables, serialization_failures and deadlock_failures.
> 
> > 6 - 0002
> > ```
> > @@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus)
> >                  return "serialization";
> >              case ESTATUS_DEADLOCK_ERROR:
> >                  return "deadlock";
> > +            case ESTATUS_OTHER_SQL_ERROR:
> > +                return "other”;
> > ```
> > 
> > I think this can just return “error”. I checked where this function is called, there will not be other words such
as“error” appended.
 
> 
> getResultString() is called to get a string that represents the type of error
> causing the transaction failure, so simply returning "error" doesn’t seem very
> useful.
>  
> > 7 - 0002
> > ```
> >      /* it can be non-zero only if max_tries is not equal to one */
> > @@ -6569,6 +6602,10 @@ printResults(StatsData *total,
> >                                 sstats->deadlock_failures,
> >                                 (100.0 * sstats->deadlock_failures /
> >                                  script_total_cnt));
> > +                        printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n",
> > +                               sstats->other_sql_failures,
> > +                               (100.0 * sstats->other_sql_failures /
> > +                                script_total_cnt));
> > ```
> > 
> > Do we only want to print this number when “―continue-on-error” is given?
> 
> We could do that, but this message is printed only when
> --failures-detailed is specified. So I think users would not mind
> if it shows that the number of other failures is zero, even when
> --continue-on-error is not specified.
> 
> I would appreciate hearing other people's opinions on this.
> 
> 
> I've attached updated patches that include fixes for some of your
> suggestions and for Anthonin Bonnefoy's suggestion on the documentation.
> 
> I also split the patch according to Fujii-san's suggestion.

Fujii-san, thank you for committing the patch that fixes the assertion failure.
I've attached the remaining patches so that cfbot stays green.

Regards,
Yugo Nagata



-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Fujii Masao

Дата:

30 сентября, 07:46:11

On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> Fujii-san, thank you for committing the patch that fixes the assertion failure.
> I've attached the remaining patches so that cfbot stays green.

Thanks for reattaching the patches!

For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would
be better to just use that to get the error message. Thought?

Regards,

--
Fujii Masao

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Yugo Nagata

Дата:

30 сентября, 09:16:53

On Tue, 30 Sep 2025 13:46:11 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

> On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > Fujii-san, thank you for committing the patch that fixes the assertion failure.
> > I've attached the remaining patches so that cfbot stays green.
> 
> Thanks for reattaching the patches!
> 
> For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would
> be better to just use that to get the error message. Thought?

Thank you for your suggestion.

I agree that it is better to use PQresultErrorMessage().
I had overlooked the existence of this interface.

I've attached the updated patches.

Regards,
Yugo Nagata


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Suggestion to add --continue-client-on-abort option to pgbench

От

Fujii Masao

Дата:

01 октября, 19:22:19

On Tue, Sep 30, 2025 at 3:17 PM Yugo Nagata <nagata@sraoss.co.jp> wrote:
>
> On Tue, 30 Sep 2025 13:46:11 +0900
> Fujii Masao <masao.fujii@gmail.com> wrote:
>
> > On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > > Fujii-san, thank you for committing the patch that fixes the assertion failure.
> > > I've attached the remaining patches so that cfbot stays green.
> >
> > Thanks for reattaching the patches!
> >
> > For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would
> > be better to just use that to get the error message. Thought?
>
> Thank you for your suggestion.
>
> I agree that it is better to use PQresultErrorMessage().
> I had overlooked the existence of this interface.
>
> I've attached the updated patches.

Thanks for updating the patches! I've pushed 0001.

Regarding 0002:

- if (canRetryError(st->estatus))
+ if (continue_on_error || canRetryError(st->estatus))
  {
  if (verbose_errors)
  commandError(st, PQresultErrorMessage(res));
  goto error;

With this change, even non-SQL errors (e.g., connection failures) would
satisfy the condition when --continue-on-error is set. Isn't that a problem?
Shouldn't we also check that the error status is one that
--continue-on-error is meant to handle?


+ * Without --continue-on-error:
  *
  * failed (the number of failed transactions) =
  *   'serialization_failures' (they got a serialization error and were not
  *                             successfully retried) +
  *   'deadlock_failures' (they got a deadlock error and were not
  *                        successfully retried).
  *
+ * With --continue-on-error:
+ *
+ * failed (number of failed transactions) =
+ *   'serialization_failures' + 'deadlock_failures' +
+ *   'other_sql_failures'  (they got some other SQL error; the transaction was
+ * not retried and counted as failed due to --continue-on-error).

About the comments on failed transactions: I don't think we need
to split them into separate "with/without --continue-on-error" sections.
How about simplifying them like this?


------------------------
* failed (the number of failed transactions) =
*   'serialization_failures' (they got a serialization error and were not
*                        successfully retried) +
*   'deadlock_failures' (they got a deadlock error and were not
*                        successfully retried) +
*   'other_sql_failures'  (they failed on the first try or after retries
*                        due to a SQL error other than serialization or
*                        deadlock; they are counted as a failed transaction
*                        only when --continue-on-error is specified).
------------------------


* 'retried' (number of all retried transactions) =
*   successfully retried transactions +
*   failed transactions.

Since transactions that failed on the first try (i.e., no retries) due to
an SQL error are not counted as 'retried', shouldn't this source comment
be updated?

Regards,

--
Fujii Masao

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: Suggestion to add --continue-client-on-abort option to pgbench

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения