Обсуждение: Re: Suggestion to add --continue-client-on-abort option to pgbench
Hi hackers,
I would like to suggest adding a new option to pgbench, which enables
the client to continue processing transactions even if some errors occur
during a transaction.
Currently, a client stops sending requests when its transaction is
aborted due to reasons other than serialization failures or deadlocks. I
think in some cases, especially when using custom scripts, the client
should be able to rollback the failed transaction and start a new one.
For example, my custom script (insert_to_unique_column.sql) follows:
```
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
```
Assume we need to continuously apply load to the server using 5 clients
for a certain period of time. However, a client sometimes stops when its
transaction in my custom script is aborted due to a check constraint
violation. As a result, the load on the server is lower than expected,
which is the problem I want to address.
The proposed new option solves this problem. When
--continue-client-on-abort is set to true, the client rolls back the
failed transaction and starts a new one. This allows all 5 clients to
continuously apply load to the server, even if some transactions fail.
```
% bin/pgbench -d postgres -f ../insert_to_unique_column.sql -T 10
--failures-detailed --continue-client-on-error
transaction type: ../custom_script_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 10 s
number of transactions actually processed: 33552
number of failed transactions: 21901 (39.495%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other failures: 21901 (39.495%)
latency average = 0.180 ms (including failures)
initial connection time = 2.857 ms
tps = 3356.092385 (without initial connection time)
```
I have attached the patch. I would appreciate your feedback.
Best regards,
Rintaro Ikeda
NTT DATA Corporation Japan
Hi Rintaro,
Thanks for the patch and explanation. I understand your goal is to ensure that pgbench
clients continue running even when some transactions fail due to application-level errors (e.g., constraint violations), especially when running custom scripts.
However, I wonder if the intended behavior can't already be achieved using standard SQL constructs — specifically ON CONFLICT
or careful transaction structure. For example, your sample script:
CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
INSERT INTO test (col2) VALUES (random(0, 50000));
can be rewritten as:
\setrandom val 0 50000
INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;
This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issue load without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach.
So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are there error types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT
, or SAVEPOINT
/ROLLBACK TO
?
Best regards,
Stepan Neretin
> On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com> wrote: >> >> Hi hackers, >> >> I would like to suggest adding a new option to pgbench, which enables >> the client to continue processing transactions even if some errors occur >> during a transaction. >> Currently, a client stops sending requests when its transaction is >> aborted due to reasons other than serialization failures or deadlocks. I >> think in some cases, especially when using custom scripts, the client >> should be able to rollback the failed transaction and start a new one. >> >> For example, my custom script (insert_to_unique_column.sql) follows: >> ``` >> CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique); >> INSERT INTO test (col2) VALUES (random(0, 50000)); >> ``` >> Assume we need to continuously apply load to the server using 5 clients >> for a certain period of time. However, a client sometimes stops when its >> transaction in my custom script is aborted due to a check constraint >> violation. As a result, the load on the server is lower than expected, >> which is the problem I want to address. >> >> The proposed new option solves this problem. When >> --continue-client-on-abort is set to true, the client rolls back the >> failed transaction and starts a new one. This allows all 5 clients to >> continuously apply load to the server, even if some transactions fail. +1. I've had similar cases before too, where I'd wanted pgbench to continue creating load on the server even if a transaction failed server-side for any reason. Sometimes, I'd even want that type of load. On Sat, 10 May 2025 at 17:02, Stepan Neretin <slpmcf@gmail.com> wrote: > INSERT INTO test (col2) VALUES (random(0, 50000)); > > can be rewritten as: > > \setrandom val 0 50000 > INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING; That won't test the same execution paths, so an option to explicitly rollback or ignore failed transactions (rather than stopping the benchmark) would be a nice feature. With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if there are many conflicting entries, as that triggers and catches per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT ...multiple rows could conflict on multiple rows, but will fail on the first conflict. DO NOTHING would cause full execution of the SELECT statement, which has an inherently different performance profile. > This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issueload without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach. > > So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are thereerror types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT,or SAVEPOINT/ROLLBACK TO? The issue isn't necessarily whether you can construct SQL scripts that don't raise such errors (indeed, it's possible to do so for nearly any command; you can run pl/pgsql procedures or DO blocks which catch and ignore errors), but rather whether we can make pgbench function in a way that can keep load on the server even when it notices an error. Kind regards, Matthias van de Meent Neon (https://neon.tech)
> On Sat, May 10, 2025 at 8:45 PM ikedarintarof <ikedarintarof@oss.nttdata.com> wrote:
>>
>> Hi hackers,
>>
>> I would like to suggest adding a new option to pgbench, which enables
>> the client to continue processing transactions even if some errors occur
>> during a transaction.
>> Currently, a client stops sending requests when its transaction is
>> aborted due to reasons other than serialization failures or deadlocks. I
>> think in some cases, especially when using custom scripts, the client
>> should be able to rollback the failed transaction and start a new one.
>>
>> For example, my custom script (insert_to_unique_column.sql) follows:
>> ```
>> CREATE TABLE IF NOT EXISTS test (col1 serial, col2 int unique);
>> INSERT INTO test (col2) VALUES (random(0, 50000));
>> ```
>> Assume we need to continuously apply load to the server using 5 clients
>> for a certain period of time. However, a client sometimes stops when its
>> transaction in my custom script is aborted due to a check constraint
>> violation. As a result, the load on the server is lower than expected,
>> which is the problem I want to address.
>>
>> The proposed new option solves this problem. When
>> --continue-client-on-abort is set to true, the client rolls back the
>> failed transaction and starts a new one. This allows all 5 clients to
>> continuously apply load to the server, even if some transactions fail.
+1. I've had similar cases before too, where I'd wanted pgbench to
continue creating load on the server even if a transaction failed
server-side for any reason. Sometimes, I'd even want that type of
load.
On Sat, 10 May 2025 at 17:02, Stepan Neretin <slpmcf@gmail.com> wrote:
> INSERT INTO test (col2) VALUES (random(0, 50000));
>
> can be rewritten as:
>
> \setrandom val 0 50000
> INSERT INTO test (col2) VALUES (:val) ON CONFLICT DO NOTHING;
That won't test the same execution paths, so an option to explicitly
rollback or ignore failed transactions (rather than stopping the
benchmark) would be a nice feature.
With e.g. ON CONFLICT DO NOTHING you'll have much higher workload if
there are many conflicting entries, as that triggers and catches
per-row errors, rather than per-statement. E.g. INSERT INTO ... SELECT
...multiple rows could conflict on multiple rows, but will fail on the
first conflict. DO NOTHING would cause full execution of the SELECT
statement, which has an inherently different performance profile.
> This avoids transaction aborts entirely in the presence of uniqueness violations and ensures the client continues to issue load without interruption. In many real-world benchmarking scenarios, this is the preferred and simplest approach.
>
> So from that angle, could you elaborate on specific cases where this SQL-level workaround wouldn't be sufficient? Are there error types you intend to handle that cannot be gracefully avoided or recovered from using SQL constructs like ON CONFLICT, or SAVEPOINT/ROLLBACK TO?
The issue isn't necessarily whether you can construct SQL scripts that
don't raise such errors (indeed, it's possible to do so for nearly any
command; you can run pl/pgsql procedures or DO blocks which catch and
ignore errors), but rather whether we can make pgbench function in a
way that can keep load on the server even when it notices an error.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Hi Matthias,
Thanks for your detailed explanation — it really helped clarify the usefulness of the patch. I agree that the feature is indeed valuable, and it's great to see it being pushed forward.
Regarding the patch code, I noticed that there are duplicate case entries in the command-line option handling (specifically for case 18 or case ESTATUS_OTHER_SQL_ERROR, the continue-client-on-error option). These duplicated cases can be merged to simplify the logic and reduce redundancy.
Best regards,
Stepan Neretin
On Tue, May 13, 2025 at 9:20 AM <Rintaro.Ikeda@nttdata.com> wrote: > I also appreciate you for pointing out my mistakes in the previous version of the patch. I fixed the duplicated lines.I’ve attached the updated patch. > This is a useful feature, so +1 from my side. Here are some initial comments on the patch while having a quick look. 1. You need to update the stats for this new counter in the "accumStats()" function. 2. IMHO, " continue-on-error " is more user-friendly than "continue-client-on-error". 3. There are a lot of whitespace errors, so those can be fixed. You can just try to apply using git am, and it will report those whitespace warnings. And for fixing, you can just use "--whitespace=fix" along with git am. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Suggestion to add --continue-client-on-abort option to pgbench
On Tue, May 13, 2025 at 9:20 AM <Rintaro.Ikeda@nttdata.com> wrote:
> I also appreciate you for pointing out my mistakes in the previous version of the patch. I fixed the duplicated lines. I’ve attached the updated patch.
>
This is a useful feature, so +1 from my side. Here are some initial
comments on the patch while having a quick look.
1. You need to update the stats for this new counter in the
"accumStats()" function.
2. IMHO, " continue-on-error " is more user-friendly than
"continue-client-on-error".
3. There are a lot of whitespace errors, so those can be fixed. You
can just try to apply using git am, and it will report those
whitespace warnings. And for fixing, you can just use
"--whitespace=fix" along with git am.
Here's the diff for the missing usage information for this option and as Dilip mentioned updating the new counter in the "accumStats()" function.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index baaf1379be2..20d456bc4b9 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -959,6 +959,8 @@ usage(void)
" --log-prefix=PREFIX prefix for transaction time log file\n"
" (default: \"pgbench_log\")\n"
" --max-tries=NUM max number of tries to run transaction (default: 1)\n"
+ " --continue-client-on-error\n"
+ " Continue and retry transactions that failed due to errors other than serialization or deadlocks.\n"
" --progress-timestamp use Unix epoch timestamps for progress\n"
" --random-seed=SEED set random seed (\"time\", \"rand\", integer)\n"
" --sampling-rate=NUM fraction of transactions to log (e.g., 0.01 for 1%%)\n"
@@ -1522,6 +1524,9 @@ accumStats(StatsData *stats, bool skipped, double lat, double lag,
case ESTATUS_DEADLOCK_ERROR:
stats->deadlock_failures++;
break;
+ case ESTATUS_OTHER_SQL_ERROR:
+ stats->other_sql_failures++;
+ break;
default:
/* internal error which should never occur */
pg_fatal("unexpected error status: %d", estatus);
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Ikeda-san, Thanks for starting the new thread! I have never known the issue before I heard at PGConf.dev. Few comments: 1. This parameter seems a type of benchmark option. So should we set benchmarking_option_set as well? 2. Not sure, but exit-on-abort seems a similar option. What if both are specified? Is it allowed? 3. Can you consider a test case for the new parameter? Best regards, Hayato Kuroda FUJITSU LIMITED
Dear Kuroda-san, hackers, On 2025/06/04 21:57, Hayato Kuroda (Fujitsu) wrote: > Dear Ikeda-san, > > Thanks for starting the new thread! I have never known the issue before I heard at > PGConf.dev. > > Few comments: > > 1. > This parameter seems a type of benchmark option. So should we set > benchmarking_option_set as well? > > 2. > Not sure, but exit-on-abort seems a similar option. What if both are specified? > Is it allowed? > > 3. > Can you consider a test case for the new parameter? > > Best regards, > Hayato Kuroda > FUJITSU LIMITED > > Thank you for your valuable comment! 1. I should've also set benchmarking_option_set. I've modified it accordingly. 2. The exit-on-abort option and continue-on-error option are mutually exclusive. Therefore, I've updated the patch to throw a FATAL error when two options are set simultaneously. Corresponding explanation was also added. (I'm wondering the name of parameter should be continue-on-abort so that users understand the two option are mutually exclusive.) 3. I've added the test. Additionally, I modified the patch so that st->state does not transition to CSTATE_RETRY when a transaction fails and continue-on-error option is enabled. In the previous patch, we retry the failed transaction up to max-try times, which is unnecessary for our purpose: clients does not exit when its transactions fail. I've attached the updated patch. v3-0001-Add-continue-on-error-option-to-pgbench.patch is identical to v4-0001-Add-continue-on-error-option-to-pgbench.patch. The v4-0002 patch is the diff from the previous patch. Best regards, Rintaro Ikeda
Вложения
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Ikeda-san, Thanks for updating the patch! > 1. I should've also set benchmarking_option_set. I've modified it accordingly. Confirmed it has been fixed. Thanks. > 2. The exit-on-abort option and continue-on-error option are mutually exclusive. > Therefore, I've updated the patch to throw a FATAL error when two options are > set simultaneously. Corresponding explanation was also added. > (I'm wondering the name of parameter should be continue-on-abort so that users > understand the two option are mutually exclusive.) Make sense, +1. Here are new comments. 01. build failure According to the cfbot [1], the documentation cannot be built. IIUC </para> seemed to be missed here: ``` + <para> + Note that this option can not be used together with + <option>--exit-on-abort</option>. + </listitem> + </varlistentry> ``` 02. patch separation How about separating the patch series like: 0001 - contains option handling and retry part, and documentation 0002 - contains accumulation/reporting part 0003 - contains tests. I hope above style is more helpful for reviewers. 03. documentation ``` + Note that this option can not be used together with + <option>--exit-on-abort</option>. ``` I feel we should add the similar description in `exit-on-abort` part. 04. documentation ``` + Client rolls back the failed transaction and starts a new one when its + transaction fails due to the reason other than the deadlock and + serialization failure. This allows all clients specified with -c option + to continuously apply load to the server, even if some transactions fail. ``` I feel the description contains bit redundant part and misses the default behavior. How about: ``` <para> Clients survive when their transactions are aborted, and they continue their run. Without the option, clients exit when transactions they run are aborted. </para> <para> Note that serialization failures or deadlock failures do not abort the client, so they are not affected by this option. See <xref linkend="failures-and-retries"/> for more information. </para> ``` 05. StatsData ``` + * When continue-on-error option is specified, + * failed (the number of failed transactions) = + * 'other_sql_failures' (they got a error when continue-on-error option + * was specified). ``` Let me confirm one point; can serialization_failures and deadlock_failures be counted when continue-on-error is true? If so, the comment seems not correct for me. The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures' in the case. 06. StatsData Another point; can other_sql_failures be counted when the continue-on-error is NOT specified? I feel it should be... 06. usage() Added line is too long. According to program_help_ok(), the output by help should be less than 80. 07. Please run pgindent/pgperltidy, I got some diffs. [1]: https://cirrus-ci.com/task/5210061275922432 Best regards, Hayato Kuroda FUJITSU LIMITED
On Mon, 9 Jun 2025 09:34:03 +0000 "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote: > > 2. The exit-on-abort option and continue-on-error option are mutually exclusive. > > Therefore, I've updated the patch to throw a FATAL error when two options are > > set simultaneously. Corresponding explanation was also added. I don't think that's right since "abort" and "error" are different concept in pgbench. (Here, "abort" refers to the termination of a client, not a transaction abort.) The --exit-on-abort option forces to exit pgbench immediately when any client is aborted due to some error. When the --continue-on-error option is not set, SQL errors other than deadlock or serialization error cause a client to be aborted. On the other hand, when the option is set, clients are not aborted due to any SQL errors; instead they continue to run after them. However, clients can still be aborted for other reasons, such as connection failures or meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option remains useful even when --continue-on-error is enabled. > > (I'm wondering the name of parameter should be continue-on-abort so that users > > understand the two option are mutually exclusive.) For the same reason as above, I believe --continue-on-error is a more accurate description of the option's behavior. > 02. patch separation > How about separating the patch series like: > > 0001 - contains option handling and retry part, and documentation > 0002 - contains accumulation/reporting part > 0003 - contains tests. > > I hope above style is more helpful for reviewers. I'm not sure whether it's necessary to split the patch, as the change doesn't seem very complex. However, the current separation appears inconsistent. For example, patch 0001 modifies canRetryError(), but patch 0002 reverts that change, and so on. > > 04. documentation > ``` > + Client rolls back the failed transaction and starts a new one when its > + transaction fails due to the reason other than the deadlock and > + serialization failure. This allows all clients specified with -c option > + to continuously apply load to the server, even if some transactions fail. > ``` > > I feel the description contains bit redundant part and misses the default behavior. > How about: > ``` > <para> > Clients survive when their transactions are aborted, and they continue > their run. Without the option, clients exit when transactions they run > are aborted. > </para> > <para> > Note that serialization failures or deadlock failures do not abort the > client, so they are not affected by this option. > See <xref linkend="failures-and-retries"/> for more information. > </para> > ``` I think we can make it clearer as follows: Allows clients to continue their run even if an SQL statement fails due to errors other than serialization or deadlock. Without this option, the client is aborted after such errors. Note that serialization and deadlock failures never cause the client to be aborted, so they are not affected by this option. See <xref linkend="failures-and-retries"/> for more information. That said, a review by a native English speaker would still be appreciated. Also, we would need to update several parts of the documentation. For example, the "Failures and Serialization/Deadlock Retries" section should be revised to describe the behavior change. In addition, we should update the explanations of output result examples and logging, the description of the --failures-detailed option, and so on. If transactions are not retried after SQL errors other than serialization or deadlock, this should also be explicitly documented. > 05. StatsData > ``` > + * When continue-on-error option is specified, > + * failed (the number of failed transactions) = > + * 'other_sql_failures' (they got a error when continue-on-error option > + * was specified). > ``` > > Let me confirm one point; can serialization_failures and deadlock_failures be > counted when continue-on-error is true? If so, the comment seems not correct for me. > The formula can be 'serialization_failures' + 'deadlock_failures' + 'deadlock_failures' > in the case. +1 > 06. StatsData > Another point; can other_sql_failures be counted when the continue-on-error is NOT > specified? I feel it should be... We could do that. However, if an SQL error other than a serialization or deadlock error occurs when --continue-on-error is not set, pgbench will be aborted midway and the printed results will be incomplete. Therefore, this might not make much sense. > 06. usage() > Added line is too long. According to program_help_ok(), the output by help should > be less than 80. +1 Here are additional comments from me. @@ -4548,6 +4570,8 @@ getResultString(bool skipped, EStatus estatus) return "serialization"; case ESTATUS_DEADLOCK_ERROR: return "deadlock"; + case ESTATUS_OTHER_SQL_ERROR: + return "error (except serialization/deadlock)"; Strings returned by getResultString() are printed in the "time" field of the log when both the -l and --failures-detailed options are set. Therefore, they should be single words that do not contain any space characters. I wonder if something like "other" or "other_sql_error" would be appropriate. @@ -4099,6 +4119,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) * can retry the error. */ st->state = timer_exceeded ? CSTATE_FINISHED : + continue_on_error ? CSTATE_FAILURE : doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE; } else This fix is not necessary because doRetry() (and canRetryError(), which is called within it) will return false when continue_on_error is set (after applying patch 0002). case PGRES_NONFATAL_ERROR: case PGRES_FATAL_ERROR: st->estatus = getSQLErrorStatus(PQresultErrorField(res, PG_DIAG_SQLSTATE)); if (canRetryError(st->estatus)) { if (verbose_errors) commandError(st, PQerrorMessage(st->con)); goto error; } /* fall through */ default: /* anything else is unexpected */ pg_log_error("client %d script %d aborted in command %d query %d: %s", st->id, st->use_file, st->command, qrynum, PQerrorMessage(st->con)); goto error; } When an SQL error other than a serialization or deadlock error occurs, an error message is output via pg_log_error in this code path. However, I think this should be reported only when verbose_errors is set, similar to how serialization and deadlock errors are handled when --continue-on-error is enabled Best regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Nagata-san, > > > 2. The exit-on-abort option and continue-on-error option are mutually > exclusive. > > > Therefore, I've updated the patch to throw a FATAL error when two options > are > > > set simultaneously. Corresponding explanation was also added. > > I don't think that's right since "abort" and "error" are different concept in pgbench. > (Here, "abort" refers to the termination of a client, not a transaction abort.) > > The --exit-on-abort option forces to exit pgbench immediately when any client is > aborted > due to some error. When the --continue-on-error option is not set, SQL errors > other than > deadlock or serialization error cause a client to be aborted. On the other hand, > when the option > is set, clients are not aborted due to any SQL errors; instead they continue to run > after them. > However, clients can still be aborted for other reasons, such as connection > failures or > meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option > remains > useful even when --continue-on-error is enabled. To clarify: another approach is that allow --continue-on-error option to continue running even when clients meet such errors. Which one is better? > > 02. patch separation > > How about separating the patch series like: > > > > 0001 - contains option handling and retry part, and documentation > > 0002 - contains accumulation/reporting part > > 0003 - contains tests. > > > > I hope above style is more helpful for reviewers. > > I'm not sure whether it's necessary to split the patch, as the change doesn't seem > very > complex. However, the current separation appears inconsistent. For example, > patch 0001 > modifies canRetryError(), but patch 0002 reverts that change, and so on. Either way is fine for me if they are changed from the current method. > > > > > 04. documentation > > ``` > > + Client rolls back the failed transaction and starts a new one when its > > + transaction fails due to the reason other than the deadlock and > > + serialization failure. This allows all clients specified with -c option > > + to continuously apply load to the server, even if some transactions > fail. > > ``` > > > > I feel the description contains bit redundant part and misses the default > behavior. > > How about: > > ``` > > <para> > > Clients survive when their transactions are aborted, and they continue > > their run. Without the option, clients exit when transactions they run > > are aborted. > > </para> > > <para> > > Note that serialization failures or deadlock failures do not abort the > > client, so they are not affected by this option. > > See <xref linkend="failures-and-retries"/> for more information. > > </para> > > ``` > > I think we can make it clearer as follows: I do not have confident for English, native speaker is needed.... > > 06. usage() > > Added line is too long. According to program_help_ok(), the output by help > should > > be less than 80. > > +1 FYI - I posted a patch which adds the test. You can apply and confirm how the function says. [1]: https://www.postgresql.org/message-id/OSCPR01MB1496610451F5896375B2562E6F56BA%40OSCPR01MB14966.jpnprd01.prod.outlook.com Best regards, Hayato Kuroda FUJITSU LIMITED
On Tue, 17 Jun 2025 03:47:00 +0000 "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote: > Dear Nagata-san, > > > > > 2. The exit-on-abort option and continue-on-error option are mutually > > exclusive. > > > > Therefore, I've updated the patch to throw a FATAL error when two options > > are > > > > set simultaneously. Corresponding explanation was also added. > > > > I don't think that's right since "abort" and "error" are different concept in pgbench. > > (Here, "abort" refers to the termination of a client, not a transaction abort.) > > > > The --exit-on-abort option forces to exit pgbench immediately when any client is > > aborted > > due to some error. When the --continue-on-error option is not set, SQL errors > > other than > > deadlock or serialization error cause a client to be aborted. On the other hand, > > when the option > > is set, clients are not aborted due to any SQL errors; instead they continue to run > > after them. > > However, clients can still be aborted for other reasons, such as connection > > failures or > > meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option > > remains > > useful even when --continue-on-error is enabled. > > To clarify: another approach is that allow --continue-on-error option to continue > running even when clients meet such errors. Which one is better? It might be worth discussing which types of errors this option should allow pgbench to continue after. On my understand the current patch's goal is to allow only SQL level errors like comstraint violations. It seems good because this could simulate behaviour of applications that ignore or retry such errors (although they are not retried in the current patch). Perhaps, it makes sense to allow to continue after some network errors because it would enable benchmarks usign a cluster system as a cloud service that could report a temporary error during a failover. It might be worth discussing which types of errors this option should allow pgbench to continue after. As I understand it, the current patch aims to allow continuation only after SQL-level errors, such as constraint violations. That seems reasonable, as it can simulate the behavior of applications that ignore or retry such errors (even though retries are not implemented in the current patch). Perhaps it also makes sense to allow continuation after certain network errors, as this would enable benchmarking with cluster systems or cloud services, which might report temporary errors during a failover. We would need additional work to properly detect and handle network errors, though. However, I'm not sure it's reasonable to allow continuation after other types of errors, such as misuse of meta-commands or unexpected errors during their execution, since these wouldn't simulate any real application behavior and would more likely indicate a failure in the benchmarking process itself. Best regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Tue, 17 Jun 2025 16:28:28 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > On Tue, 17 Jun 2025 03:47:00 +0000 > "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote: > > > Dear Nagata-san, > > > > > > > 2. The exit-on-abort option and continue-on-error option are mutually > > > exclusive. > > > > > Therefore, I've updated the patch to throw a FATAL error when two options > > > are > > > > > set simultaneously. Corresponding explanation was also added. > > > > > > I don't think that's right since "abort" and "error" are different concept in pgbench. > > > (Here, "abort" refers to the termination of a client, not a transaction abort.) > > > > > > The --exit-on-abort option forces to exit pgbench immediately when any client is > > > aborted > > > due to some error. When the --continue-on-error option is not set, SQL errors > > > other than > > > deadlock or serialization error cause a client to be aborted. On the other hand, > > > when the option > > > is set, clients are not aborted due to any SQL errors; instead they continue to run > > > after them. > > > However, clients can still be aborted for other reasons, such as connection > > > failures or > > > meta-command errors (e.g., \set x 1/0). In these cases, the --exit-on-abort option > > > remains > > > useful even when --continue-on-error is enabled. > > > > To clarify: another approach is that allow --continue-on-error option to continue > > running even when clients meet such errors. Which one is better? > > It might be worth discussing which types of errors this option should allow pgbench > to continue after. On my understand the current patch's goal is to allow only SQL > level errors like comstraint violations. It seems good because this could simulate > behaviour of applications that ignore or retry such errors (although they are not > retried in the current patch). Perhaps, it makes sense to allow to continue after > some network errors because it would enable benchmarks usign a cluster system as a > cloud service that could report a temporary error during a failover. I apologize for accidentally leaving the draft paragraph just above in my previous post. Please ignore it. > It might be worth discussing which types of errors this option should allow pgbench to > continue after. > > As I understand it, the current patch aims to allow continuation only after SQL-level > errors, such as constraint violations. That seems reasonable, as it can simulate the > behavior of applications that ignore or retry such errors (even though retries are not > implemented in the current patch). > > Perhaps it also makes sense to allow continuation after certain network errors, as this > would enable benchmarking with cluster systems or cloud services, which might report > temporary errors during a failover. We would need additional work to properly detect > and handle network errors, though. > > However, I'm not sure it's reasonable to allow continuation after other types of errors, > such as misuse of meta-commands or unexpected errors during their execution, since these > wouldn't simulate any real application behavior and would more likely indicate a failure > in the benchmarking process itself. > > Best regards, > Yugo Nagata > > -- > Yugo Nagata <nagata@sraoss.co.jp> > > -- Yugo Nagata <nagata@sraoss.co.jp>
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Nagata-san, > As I understand it, the current patch aims to allow continuation only after > SQL-level > errors, such as constraint violations. That seems reasonable, as it can simulate > the > behavior of applications that ignore or retry such errors (even though retries are > not > implemented in the current patch). Yes, no one has objections to retry in this case. This is a main part of the proposal. > However, I'm not sure it's reasonable to allow continuation after other types of > errors, > such as misuse of meta-commands or unexpected errors during their execution, > since these > wouldn't simulate any real application behavior and would more likely indicate a > failure > in the benchmarking process itself. I have a concern for \gset metacommand. According to the doc and source code, \gset assumed that executed command surely returns a tuple: ``` if (meta == META_GSET && ntuples != 1) { /* under \gset, report the error */ pg_log_error("client %d script %d command %d query %d: expected one row, got %d", st->id, st->use_file, st->command, qrynum, PQntuples(res)); st->estatus = ESTATUS_META_COMMAND_ERROR; goto error; } ``` But sometimes the SQL may not be able to return tuples or return multiple ones due to the concurrent transactions. I feel retrying the transaction is very useful in this case. Anyway, we must confirm the opinion from the proposer. [1]: https://github.com/ryogrid/tpcc_like_with_pgbench Best regards, Hayato Kuroda FUJITSU LIMITED
On Thu, 26 Jun 2025 05:45:12 +0000 "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote: > Dear Nagata-san, > > > As I understand it, the current patch aims to allow continuation only after > > SQL-level > > errors, such as constraint violations. That seems reasonable, as it can simulate > > the > > behavior of applications that ignore or retry such errors (even though retries are > > not > > implemented in the current patch). > > Yes, no one has objections to retry in this case. This is a main part of the proposal., As I understand it, the proposed --continue-on-error option does not retry the transaction in any case; it simply gives up on the transaction. That is, when an SQL-level error occurs, the transaction is reported as "failed" rather than "retried", and the random state is discarded. > > > However, I'm not sure it's reasonable to allow continuation after other types of > > errors, > > such as misuse of meta-commands or unexpected errors during their execution, > > since these > > wouldn't simulate any real application behavior and would more likely indicate a > > failure > > in the benchmarking process itself. > > I have a concern for \gset metacommand. > According to the doc and source code, \gset assumed that executed command surely > returns a tuple: > > ``` > if (meta == META_GSET && ntuples != 1) > { > /* under \gset, report the error */ > pg_log_error("client %d script %d command %d query %d: expected one row, got %d", > st->id, st->use_file, st->command, qrynum, PQntuples(res)); > st->estatus = ESTATUS_META_COMMAND_ERROR; > goto error; > } > ``` > > But sometimes the SQL may not be able to return tuples or return multiple ones due > to the concurrent transactions. I feel retrying the transaction is very useful > in this case. You can use \aset command instead to avoid the error of pgbench. If the query doesn't return any row, a subsecuent SQL command trying to use the varialbe will fail, but this would be ignored without terminating the benchmark when the --coneinue-on-error option enabled. > Anyway, we must confirm the opinion from the proposer. +1 Best regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Hi, Thank you very much for your valuable comments and kind advice. I'm currently working on revising the previous patch based on the feedback received. I would like to share my thoughts regarding the conditions under which the --continue-on-error option should initiate a new transaction or a new connection. In my opinion, when the --continue-on-error option is enabled, pgbench clients does not need to start new transactions after network errors and other errors except for SQL-level errors. Network errors are relatively rare, except in failover scenarios. Outside of failover, any network issues should be resolved rather than worked around. In the context of failover, the key metric is not TPS, but system downtime. While one might infer the timing of a failover by observing by using --progress option, you can easily determine the downtime by executing simple SQL query such as `psql -c 'SELECT 1` every second. On 2025/06/26 18:47, Yugo Nagata wrote: > On Thu, 26 Jun 2025 05:45:12 +0000 > "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote: > >> Dear Nagata-san, >> >>> As I understand it, the current patch aims to allow continuation only >>> after >>> SQL-level >>> errors, such as constraint violations. That seems reasonable, as it >>> can simulate >>> the >>> behavior of applications that ignore or retry such errors (even >>> though retries are >>> not >>> implemented in the current patch). >> >> Yes, no one has objections to retry in this case. This is a main part >> of the proposal., > > As I understand it, the proposed --continue-on-error option does not > retry the transaction > in any case; it simply gives up on the transaction. That is, when an > SQL-level error occurs, > the transaction is reported as "failed" rather than "retried", and the > random state is discarded. Retrying the failed transaction is not necessary when the transaction failed due to SQL-level errors. Unlike real-world applications, pgbench does not need to complete specific transaction successfully. In the case of unique constraint violations, retrying the same transaction will likely to result in the same error again. I want to hear your thoughts on this. Best regards, Rintaro Ikeda
On Fri, 27 Jun 2025 14:06:24 +0900 ikedarintarof <ikedarintarof@oss.nttdata.com> wrote: > Hi, > > Thank you very much for your valuable comments and kind advice. I'm > currently working on revising the previous patch based on the feedback > received. I would like to share my thoughts regarding the conditions > under which the --continue-on-error option should initiate a new > transaction or a new connection. > > In my opinion, when the --continue-on-error option is enabled, pgbench > clients does not need to start new transactions after network errors and > other errors except for SQL-level errors. +1 I agree that --continue-on-error prevents pgbench from terminating only when SQL-level errors occur, and does not change the behavior in the case of other types of errors, including network errors. > > As I understand it, the proposed --continue-on-error option does not > > retry the transaction > > in any case; it simply gives up on the transaction. That is, when an > > SQL-level error occurs, > > the transaction is reported as "failed" rather than "retried", and the > > random state is discarded. > > Retrying the failed transaction is not necessary when the transaction > failed due to SQL-level errors. Unlike real-world applications, pgbench > does not need to complete specific transaction successfully. In the case > of unique constraint violations, retrying the same transaction will > likely to result in the same error again. Agreed. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Nagata-san, Ikeda-san, > > In my opinion, when the --continue-on-error option is enabled, pgbench > > clients does not need to start new transactions after network errors and > > other errors except for SQL-level errors. > > +1 > > I agree that --continue-on-error prevents pgbench from terminating only when > SQL-level errors occur, and does not change the behavior in the case of other > types of errors, including network errors. OK, so let's do like that. BTW, initially we were discussing the combination of --continue-on-error and --exit-on-abort. What it the conclusion? I feel the Nagata-san's point [1] is valid in this approach. > > > As I understand it, the proposed --continue-on-error option does not > > > retry the transaction > > > in any case; it simply gives up on the transaction. That is, when an > > > SQL-level error occurs, > > > the transaction is reported as "failed" rather than "retried", and the > > > random state is discarded. > > > > Retrying the failed transaction is not necessary when the transaction > > failed due to SQL-level errors. Unlike real-world applications, pgbench > > does not need to complete specific transaction successfully. In the case > > of unique constraint violations, retrying the same transaction will > > likely to result in the same error again. I intended here that clients could throw away the failed transaction and start new one again in the case. I hope we are on the same page... [1]: https://www.postgresql.org/message-id/20250614002453.5c72f2ec80864d840150a642%40sraoss.co.jp Best regards, Hayato Kuroda FUJITSU LIMITED
RE: Suggestion to add --continue-client-on-abort option to pgbench
Dear Ikeda-san, Nagata-san, Thanks for updating the patch! > > Could I confirm what you mean by "start new one"? > > > > In the current pgbench, when a query raises an error (a deadlock or > > serialization failure), it can be retried using the same random state. > > This typically means the query will be retried with the same parameter values. > > > > On the other hand, when the query ultimately fails (possibly after some retries), > > the transaction is marked as a "failure", and the next transaction starts with a > > new random state (i.e., with new parameter values). > > > > Therefore, if a query fails due to a unique constraint violation and is retried > > with the same parameters, it will keep failing on each retry. > > Thank you for your explanation. I understand it as you described. I've also > attached a schematic diagram of the state machine. I hope it will help clarify > the behavior of pgbench. Red arrows represent the transition of state when SQL > command fails and --continue-on-error option is specified. Thanks for the diagram, it's quite helpful. Let me share my understanding and opinion. The terminology "retry" is being used for the transition CSTATE_ERROR->CSTATE_RETRY, and here the random state would be restored to be the begining: ``` /* * Reset the random state as they were at the beginning of the * transaction. */ st->cs_func_rs = st->random_state; ``` In --continue-on-error case, the transaction CSTATE_WAIT_RESULT->CSTATE_ERROR can happen even the reason of failure is not serialization and deadlock. Ultimately the pass will reach ...->CSTATE_END_TX->CSTATE_CHOOSE_SCRIPT, the beginning of the state machine. cs_func_rs is not overwritten in the route so that different random value would be generated, or even another script may be chosen. Is it correct? And I feel this behavior is OK. Most likely failure here is the unique constraint violation. Clients should roll the dice again otherwise it would face the same error again. Below are my comments for the latest patch. 01. ``` $ git am ../patches/pgbench/v5-0001-Add-continue-on-error-opt ion-to-pgbench.patch Applying: When the option is set, client rolls back the failed transaction and... .git/rebase-apply/patch:65: trailing whitespace. <literal>serialization</literal>, <literal>deadlock</literal>, or .git/rebase-apply/patch:139: trailing whitespace. <option>--max-tries</option> option is not equal to 1 and warning: 2 lines add whitespace errors. ``` I got warnings when I applied the patch. Please fix it. 02. ``` + * 'serialization_failures' + 'deadlock_failures' + + * 'other_sql_failures' (they got a error when continue-on-error option ``` The first line has the tab, but it should be normal blank. 03. ``` + else if (continue_on_error | canRetryError(st->estatus)) ``` I feel "|" should be "||". 04. ``` <term><replaceable>retries</replaceable></term> <listitem> <para> number of retries after serialization or deadlock errors (zero unless <option>--max-tries</option> is not equal to one) </para> </listitem> ``` To confirm; --continue-on-error won't be counted here because it is not "retry", in other words, it does not reach CSTATE_RETRY, right? Best regards, Hayato Kuroda FUJITSU LIMITED
Hi, Thank you for the kind comments. I've updated the previous patch. Below is a summary of the changes: 1. The code path and documentation have been corrected based on your feedback. 2. The following message is now suppressed by default. Instead, an error message is added when a client aborts during SQL execution. (v6-0003-Suppress-xxx.patch) ``` if (verbose_errors) pg_log_error("client %d script %d aborted in command %d query %d: %s", st->id, st->use_file, st->command, qrynum, PQerrorMessage(st->con)); ``` On 2025/07/04 22:01, Hayato Kuroda (Fujitsu) wrote: >>> Could I confirm what you mean by "start new one"? >>> >>> In the current pgbench, when a query raises an error (a deadlock or >>> serialization failure), it can be retried using the same random state. >>> This typically means the query will be retried with the same parameter values. >>> >>> On the other hand, when the query ultimately fails (possibly after some retries), >>> the transaction is marked as a "failure", and the next transaction starts with a >>> new random state (i.e., with new parameter values). >>> >>> Therefore, if a query fails due to a unique constraint violation and is retried >>> with the same parameters, it will keep failing on each retry. >> >> Thank you for your explanation. I understand it as you described. I've also >> attached a schematic diagram of the state machine. I hope it will help clarify >> the behavior of pgbench. Red arrows represent the transition of state when SQL >> command fails and --continue-on-error option is specified. > > Thanks for the diagram, it's quite helpful. Let me share my understanding and opinion. > > The terminology "retry" is being used for the transition CSTATE_ERROR->CSTATE_RETRY, > and here the random state would be restored to be the begining: > > ``` > /* > * Reset the random state as they were at the beginning of the > * transaction. > */ > st->cs_func_rs = st->random_state; > ``` > > In --continue-on-error case, the transaction CSTATE_WAIT_RESULT->CSTATE_ERROR > can happen even the reason of failure is not serialization and deadlock. > Ultimately the pass will reach ...->CSTATE_END_TX->CSTATE_CHOOSE_SCRIPT, the > beginning of the state machine. cs_func_rs is not overwritten in the route so > that different random value would be generated, or even another script may be > chosen. Is it correct? Yes, I believe that’s correct. > > 01. > ``` > $ git am ../patches/pgbench/v5-0001-Add-continue-on-error-opt > ion-to-pgbench.patch > Applying: When the option is set, client rolls back the failed transaction and... > .git/rebase-apply/patch:65: trailing whitespace. > <literal>serialization</literal>, <literal>deadlock</literal>, or > .git/rebase-apply/patch:139: trailing whitespace. > <option>--max-tries</option> option is not equal to 1 and > warning: 2 lines add whitespace errors. > ``` > > I got warnings when I applied the patch. Please fix it. It's been fixed. > > 02. > ``` > + * 'serialization_failures' + 'deadlock_failures' + > + * 'other_sql_failures' (they got a error when continue-on-error option > ``` > The first line has the tab, but it should be normal blank. I hadn't noticed it. It's fixed. > 03. > ``` > + else if (continue_on_error | canRetryError(st->estatus)) > ``` > > I feel "|" should be "||". Thank you for pointing out. Fixed it. > 04. > ``` > <term><replaceable>retries</replaceable></term> > <listitem> > <para> > number of retries after serialization or deadlock errors > (zero unless <option>--max-tries</option> is not equal to one) > </para> > </listitem> > ``` > > To confirm; --continue-on-error won't be counted here because it is not "retry", > in other words, it does not reach CSTATE_RETRY, right? Yes. I agree with Nagata-san [1] — --continue-on-error is not considered a "retry" because it doesn't reach CSTATE_RETRY. On 2025/07/05 0:03, Yugo Nagata wrote: >>> case PGRES_NONFATAL_ERROR: >>> case PGRES_FATAL_ERROR: >>> st->estatus = getSQLErrorStatus(PQresultErrorField(res, >>> PG_DIAG_SQLSTATE)); >>> if (canRetryError(st->estatus)) >>> { >>> if (verbose_errors) >>> commandError(st, PQerrorMessage(st->con)); >>> goto error; >>> } >>> /* fall through */ >>> >>> default: >>> /* anything else is unexpected */ >>> pg_log_error("client %d script %d aborted in command %d query %d: %s", >>> st->id, st->use_file, st->command, qrynum, >>> PQerrorMessage(st->con)); >>> goto error; >>> } >>> >>> When an SQL error other than a serialization or deadlock error occurs, an error message is >>> output via pg_log_error in this code path. However, I think this should be reported only >>> when verbose_errors is set, similar to how serialization and deadlock errors are handled when >>> --continue-on-error is enabled >> >> I think the error message logged via pg_log_error is useful when verbose_errors >> is not specified, because it informs users that the client has exited. Without >> it, users may not notice that something went wrong. > > However, if a large number of errors occur, this could result in a significant increase > in stderr output during the benchmark. > > Users can still notice that something went wrong by checking the “number of other failures” > reported after the run, and I assume that in most cases, when --continue-on-error is enabled, > users aren’t particularly interested in seeing individual error messages as they happen. > > It’s true that seeing error messages during the benchmark might be useful in some cases, but > the same could be said for serialization or deadlock errors, and that’s exactly what the > --verbose-errors option is for. I understand your concern. The condition for calling pg_log_error() was modified to reduce stderr output. Additionally, an error message was added for cases where some clients aborted while executing SQL commands, similar to other code paths that transition to st->state = CSTATE_ABORTED, as shown in the example below: ``` pg_log_error("client %d aborted while establishing connection", st->id); st->state = CSTATE_ABORTED; ``` > Here are some comments on the patch. > > (1) > > } > - else if (canRetryError(st->estatus)) > + else if (continue_on_error | canRetryError(st->estatus)) > st->state = CSTATE_ERROR; > else > st->state = CSTATE_ABORTED; > > Due to this change, when --continue-on-error is enabled, st->state is set to > CSTATE_ERROR regardless of the type of error returned by readCommandResponse. > When the error is not ESTATUS_OTHER_SQL_ERROR, e.g. ESTATUS_META_COMMAND_ERROR > due to a failure of \gset with query returning more the one row. > > Therefore, this should be like: > > else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || > canRetryError(st->estatus)) > Thanks for pointing that out — I’ve corrected it. > (2) > > + " --continue-on-error continue processing transations after a trasaction fails\n" > > "trasaction" is a typo and including "transaction" twice looks a bit redundant. > Instead using the word "transaction", how about: > > "--continue-on-error continue running after an SQL error" ? > > This version is shorter, avoids repetition, and describes well the actual behavior when > SQL statements fail. Fixed it. > (3) > > - * A failed transaction is defined as unsuccessfully retried transactions. > + * A failed transaction is defined as unsuccessfully retried transactions > + * unless continue-on-error option is specified. > * It can be one of two types: > * > * failed (the number of failed transactions) = > @@ -411,6 +412,12 @@ typedef struct StatsData > * 'deadlock_failures' (they got a deadlock error and were not > * successfully retried). > * > + * When continue-on-error option is specified, > + * failed (the number of failed transactions) = > + * 'serialization_failures' + 'deadlock_failures' + > + * 'other_sql_failures' (they got a error when continue-on-error option > + * was specified). > + * > > To explain explicitly that there are two definitions of failed transactions > depending on the situation, how about: > > """ > A failed transaction is counted differently depending on whether > the --continue-on-error option is specified. > > Without --continue-on-error: > > failed (the number of failed transactions) = > 'serialization_failures' (they got a serialization error and were not > successfully retried) + > 'deadlock_failures' (they got a deadlock error and were not > successfully retried). > > When --continue-on-error is specified: > > failed (number of failed transactions) = > 'serialization_failures' + 'deadlock_failures' + > 'other_sql_failures' (they got some other SQL error; the transaction was > not retried and counted as failed due to > --continue-on-error). > """ Thank you for your suggestion. I modified it accordingly. > (4) > + int64 other_sql_failures; /* number of failed transactions for > + * reasons other than > + * serialization/deadlock failure , which > + * is enabled if --continue-on-error is > + * used */ > > Is "counted" is more proper than "enabled" here? Fixed. > > Af for the documentations: > (5) > The next line reports the number of failed transactions due to > - serialization or deadlock errors (see <xref linkend="failures-and-retries"/> > - for more information). > + serialization or deadlock errors by default (see > + <xref linkend="failures-and-retries"/> for more information). > > Would it be more readable to simply say: > "The next line reports the number of failed transactions (see ... for more information), > since definition of "failed transaction" has become a bit messy? > I fixed it to the simple explanation. > (6) > connection with the database server was lost or the end of script was reached > without completing the last transaction. In addition, if execution of an SQL > or meta command fails for reasons other than serialization or deadlock errors, > - the client is aborted. Otherwise, if an SQL command fails with serialization or > - deadlock errors, the client is not aborted. In such cases, the current > - transaction is rolled back, which also includes setting the client variables > - as they were before the run of this transaction (it is assumed that one > - transaction script contains only one transaction; see > - <xref linkend="transactions-and-scripts"/> for more information). > + the client is aborted by default. However, if the --continue-on-error option > + is specified, the client does not abort and proceeds to the next transaction > + regardless of the error. This case is reported as other failures in the output. > + Otherwise, if an SQL command fails with serialization or deadlock errors, the > + client is not aborted. In such cases, the current transaction is rolled back, > + which also includes setting the client variables as they were before the run > + of this transaction (it is assumed that one transaction script contains only > + one transaction; see <xref linkend="transactions-and-scripts"/> for more information). > > To emphasize the default behavior, I wonder it would be better to move "by default" > to the beginning of the statements; like > > "By default, if execution of an SQL or meta command fails for reasons other than > serialization or deadlock errors, the client is aborted." > > How about quoting "other failures"? like: > > "These cases are reported as "other failures" in the output." > > Also, I feel the meaning of "Otherwise" has becomes somewhat unclear since the > explanation of --continue-on-error was added between the sentences So, how about > clarifying that "the clients are not aborted due to serializable/deadlock even without > --continue-on-error". For example; > > "On contrast, if an SQL command fails with serialization or deadlock errors, the > client is not aborted even without <option>--continue-on-error</option>. > Instead, the current transaction is rolled back, which also includes setting > the client variables as they were before the run of this transaction > (it is assumed that one transaction script contains only > one transaction; see <xref linkend="transactions-and-scripts"/> for more information)." > I've modified according to your suggestion. > (7) > The main report contains the number of failed transactions. If the > - <option>--max-tries</option> option is not equal to 1, the main report also > + <option>--max-tries</option> option is not equal to 1 and > + <option>--continue-on-error</option> is not specified, the main report also > contains statistics related to retries: the total number of retried > > Is that true? > The retreis statitics would be included even without --continue-on-error. That was wrong. I corrected it. [1] https://www.postgresql.org/message-id/20250705002239.27e6e5a4ba22c047ac2fa16a%40sraoss.co.jp Regards, Rintaro Ikeda
Вложения
On Wed, 9 Jul 2025 23:58:32 +0900 Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote: > Hi, > > Thank you for the kind comments. > > I've updated the previous patch. Thank you for updating the patch! > > However, if a large number of errors occur, this could result in a significant increase > > in stderr output during the benchmark. > > > > Users can still notice that something went wrong by checking the “number of other failures” > > reported after the run, and I assume that in most cases, when --continue-on-error is enabled, > > users aren’t particularly interested in seeing individual error messages as they happen. > > > > It’s true that seeing error messages during the benchmark might be useful in some cases, but > > the same could be said for serialization or deadlock errors, and that’s exactly what the > > --verbose-errors option is for. > > > I understand your concern. The condition for calling pg_log_error() was modified > to reduce stderr output. > Additionally, an error message was added for cases where some clients aborted > while executing SQL commands, similar to other code paths that transition to > st->state = CSTATE_ABORTED, as shown in the example below: > > ``` > pg_log_error("client %d aborted while establishing connection", st->id); > st->state = CSTATE_ABORTED; > ``` default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, - PQerrorMessage(st->con)); + if (verbose_errors) + pg_log_error("client %d script %d aborted in command %d query %d: %s", + st->id, st->use_file, st->command, qrynum, + PQerrorMessage(st->con)); goto error; } Thanks to this fix, error messages caused by SQL errors are now output only when --verbose-errors is enable. However, the comment describes the condition as "unexpected", and the message states that the client was "aborted". This does not seems accurate, since clients are not aborted due to SQL errors when --continue-on-errors is enabled. I think the error message should be emitted using commandError() when both --coneinut-on-errors and --verbose-errors are specified, like this; case PGRES_NONFATAL_ERROR: case PGRES_FATAL_ERROR: st->estatus = getSQLErrorStatus(PQresultErrorField(res, PG_DIAG_SQLSTATE)); if (continue_on_error || canRetryError(st->estatus)) { if (verbose_errors) commandError(st, PQerrorMessage(st->con)); goto error; } /* fall through */ In addition, the error message in the "default" case should be shown regardless of the --verbose-errors since it represents an unexpected situation and should always reported. Finally, I believe this fix should be included in patch 0001 rather than 0003, as it would be a part of the implementation of --continiue-on-error. As of 0003: + { + pg_log_error("client %d aborted while executing SQL commands", st->id); st->state = CSTATE_ABORTED; + } break; I understand that the patch is not directly related to --continue-on-error, similar to 0002, and that it aims to improve the error message to indicate that the client was aborted due to some error during readCommandResponse(). However, this message doesn't seem entirely accurate, since the error is not always caused by an SQL command failure itself. For example, it could also be due to a failure of the \gset meta-command. In addition, this fix causes error messages to be emitted twice. For example, if \gset fails, the following similar messages are printed: pgbench: error: client 0 script 0 command 0 query 0: expected one row, got 0 pgbench: error: client 0 aborted while executing SQL commands Even worse, if an unexpected error occurs in readCommandResponse() (i.e., the default case), the following messages are emitted, both indicating that the client was aborted; pgbench: error: client 0 script 0 aborted in command ... query ... pgbench: error: client 0 aborted while executing SQL commands I feel this is a bit redundant. Therefore, if we are to improve these messages to indicate explicitly that the client was aborted, I would suggest modifying the error messages in readCommandResponse() rather than adding a new one in advanceConnectionState(). I've attached patch 0003 incorporating my suggestion. What do you think? Additionally, the patch 0001 includes the fix that was originally part of your proposed 0003, as previously discussed. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
Hi, On 2025/07/10 18:17, Yugo Nagata wrote: > On Wed, 9 Jul 2025 23:58:32 +0900 > Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote: > >> Hi, >> >> Thank you for the kind comments. >> >> I've updated the previous patch. > > Thank you for updating the patch! > >>> However, if a large number of errors occur, this could result in a significant increase >>> in stderr output during the benchmark. >>> >>> Users can still notice that something went wrong by checking the “number of other failures” >>> reported after the run, and I assume that in most cases, when --continue-on-error is enabled, >>> users aren’t particularly interested in seeing individual error messages as they happen. >>> >>> It’s true that seeing error messages during the benchmark might be useful in some cases, but >>> the same could be said for serialization or deadlock errors, and that’s exactly what the >>> --verbose-errors option is for. >> >> >> I understand your concern. The condition for calling pg_log_error() was modified >> to reduce stderr output. >> Additionally, an error message was added for cases where some clients aborted >> while executing SQL commands, similar to other code paths that transition to >> st->state = CSTATE_ABORTED, as shown in the example below: >> >> ``` >> pg_log_error("client %d aborted while establishing connection", st->id); >> st->state = CSTATE_ABORTED; >> ``` > > default: > /* anything else is unexpected */ > - pg_log_error("client %d script %d aborted in command %d query %d: %s", > - st->id, st->use_file, st->command, qrynum, > - PQerrorMessage(st->con)); > + if (verbose_errors) > + pg_log_error("client %d script %d aborted in command %d query %d: %s", > + st->id, st->use_file, st->command, qrynum, > + PQerrorMessage(st->con)); > goto error; > } > > Thanks to this fix, error messages caused by SQL errors are now output only when > --verbose-errors is enable. However, the comment describes the condition as "unexpected", > and the message states that the client was "aborted". This does not seems accurate, since > clients are not aborted due to SQL errors when --continue-on-errors is enabled. > > I think the error message should be emitted using commandError() when both > --coneinut-on-errors and --verbose-errors are specified, like this; > > case PGRES_NONFATAL_ERROR: > case PGRES_FATAL_ERROR: > st->estatus = getSQLErrorStatus(PQresultErrorField(res, > PG_DIAG_SQLSTATE)); > if (continue_on_error || canRetryError(st->estatus)) > { > if (verbose_errors) > commandError(st, PQerrorMessage(st->con)); > goto error; > } > /* fall through */ > > In addition, the error message in the "default" case should be shown regardless > of the --verbose-errors since it represents an unexpected situation and should > always reported. > > Finally, I believe this fix should be included in patch 0001 rather than 0003, > as it would be a part of the implementation of --continiue-on-error. > > > As of 0003: > > + { > + pg_log_error("client %d aborted while executing SQL commands", st->id); > st->state = CSTATE_ABORTED; > + } > break; > > I understand that the patch is not directly related to --continue-on-error, similar to 0002, > and that it aims to improve the error message to indicate that the client was aborted due to > some error during readCommandResponse(). > > However, this message doesn't seem entirely accurate, since the error is not always caused > by an SQL command failure itself. For example, it could also be due to a failure of the \gset > meta-command. > > In addition, this fix causes error messages to be emitted twice. For example, if \gset fails, > the following similar messages are printed: > > pgbench: error: client 0 script 0 command 0 query 0: expected one row, got 0 > pgbench: error: client 0 aborted while executing SQL commands > > Even worse, if an unexpected error occurs in readCommandResponse() (i.e., the default case), > the following messages are emitted, both indicating that the client was aborted; > > pgbench: error: client 0 script 0 aborted in command ... query ... > pgbench: error: client 0 aborted while executing SQL commands > > I feel this is a bit redundant. > > Therefore, if we are to improve these messages to indicate explicitly that the client > was aborted, I would suggest modifying the error messages in readCommandResponse() rather > than adding a new one in advanceConnectionState(). > > I've attached patch 0003 incorporating my suggestion. What do you think? Thank you very much for the updated patch! I reviewed 0003 and it looks great - the error message become easier to understand. I noticed one small thing I’d like to discuss. I'm not sure that users clearly understand which aborted in the following error message, the client or the script. > pgbench: error: client 0 script 0 aborted in command ... query ... Since the code path always results in a client abort, I wonder if the following message might be clearer: > pgbench: error: client 0 aborted in script 0 command ... query ... Regards, Rintaro Ikeda
Hi, On Sun, 13 Jul 2025 23:15:24 +0900 Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote: > I noticed one small thing I’d like to discuss. I'm not sure that users clearly > understand which aborted in the following error message, the client or the script. > > pgbench: error: client 0 script 0 aborted in command ... query ... > > Since the code path always results in a client abort, I wonder if the following > message might be clearer: > > pgbench: error: client 0 aborted in script 0 command ... query ... Indeed, it seems clearer to explicitly state that it is the client that was aborted. I've attached an updated patch that replaces the remaining message mentioned above with a call to commandFailed(). With this change, the output in such situations will now be: "client 0 aborted in command 0 (SQL) of script 0; ...." Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
Hi, On 2025/07/15 11:16, Yugo Nagata wrote: >> I noticed one small thing I’d like to discuss. I'm not sure that users clearly >> understand which aborted in the following error message, the client or the script. >>> pgbench: error: client 0 script 0 aborted in command ... query ... >> >> Since the code path always results in a client abort, I wonder if the following >> message might be clearer: >>> pgbench: error: client 0 aborted in script 0 command ... query ... > > Indeed, it seems clearer to explicitly state that it is the client that > was aborted. > > I've attached an updated patch that replaces the remaining message mentioned > above with a call to commandFailed(). With this change, the output in such > situations will now be: > > "client 0 aborted in command 0 (SQL) of script 0; ...." Thank you for updating the patch! When I executed a custom script that may raise a unique constraint violation, I got the following output: > pgbench: error: client 0 script 0 aborted in command 1 query 0: ERROR: duplicate key value violates unique constraint "test_col2_key" I think we should also change the error message in pg_log_error. I modified the patch v8-0003 as follows: @@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, + pg_log_error("client %d aborted in command %d query %d of script %d: %s", + st->id, st->command, qrynum, st->use_file, PQerrorMessage(st->con)); goto error; } With this change, the output now is like this: > pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR: duplicate key value violates unique constraint "test_col2_key" I want hear your thoughts. Also, let me ask one question. In this case, I directly modified your commit in the v8-0003 patch. Is that the right way to update the patch? Regards, Rintaro Ikeda
Вложения
On Wed, 16 Jul 2025 21:35:01 +0900 Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote: > Hi, > > On 2025/07/15 11:16, Yugo Nagata wrote: > >> I noticed one small thing I’d like to discuss. I'm not sure that users clearly > >> understand which aborted in the following error message, the client or the script. > >>> pgbench: error: client 0 script 0 aborted in command ... query ... > >> > >> Since the code path always results in a client abort, I wonder if the following > >> message might be clearer: > >>> pgbench: error: client 0 aborted in script 0 command ... query ... > > > > Indeed, it seems clearer to explicitly state that it is the client that > > was aborted. > > > > I've attached an updated patch that replaces the remaining message mentioned > > above with a call to commandFailed(). With this change, the output in such > > situations will now be: > > > > "client 0 aborted in command 0 (SQL) of script 0; ...." > > Thank you for updating the patch! > > When I executed a custom script that may raise a unique constraint violation, I > got the following output: > > pgbench: error: client 0 script 0 aborted in command 1 query 0: ERROR: > duplicate key value violates unique constraint "test_col2_key" I'm sorry. I must have failed to attach the correct patch in my previous post. As a result, patch v8 was actually the same as v7, and the message in question was not modified as intended. > > I think we should also change the error message in pg_log_error. I modified the > patch v8-0003 as follows: > @@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char > *varprefix) > > default: > /* anything else is unexpected */ > - pg_log_error("client %d script %d aborted in > command %d query %d: %s", > - st->id, st->use_file, > st->command, qrynum, > + pg_log_error("client %d aborted in command %d > query %d of script %d: %s", > + st->id, st->command, > qrynum, st->use_file, > PQerrorMessage(st->con)); > goto error; > } > > With this change, the output now is like this: > > pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR: > duplicate key value violates unique constraint "test_col2_key" > > I want hear your thoughts. My idea is to modify this as follows; default: /* anything else is unexpected */ - pg_log_error("client %d script %d aborted in command %d query %d: %s", - st->id, st->use_file, st->command, qrynum, - PQerrorMessage(st->con)); + commandFailed(st, "SQL", PQerrorMessage(st->con)); goto error; } This fix is originally planned to be included in patch v8, but was missed. It is now included in the attached patch, v10. With this change, the output becomes: pgbench: error: client 0 aborted in command 0 (SQL) of script 0; ERROR: duplicate key value violates unique constraint "t2_pkey" Although there is a slight difference, the message is essentially the same as your proposal. Also, I believe the use of commandFailed() makes the code simpler and more consistent. What do you think? > Also, let me ask one question. In this case, I directly modified your commit in > the v8-0003 patch. Is that the right way to update the patch? I’m not sure if that’s the best way, but I think modifying the patch directly is a valid way to propose an alternative approach during discussion, as long as the original patch is respected. It can often help clarify suggestions. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
Hi, On 2025/07/16 22:49, Yugo Nagata wrote: >> I think we should also change the error message in pg_log_error. I modified the >> patch v8-0003 as follows: >> @@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char >> *varprefix) >> >> default: >> /* anything else is unexpected */ >> - pg_log_error("client %d script %d aborted in >> command %d query %d: %s", >> - st->id, st->use_file, >> st->command, qrynum, >> + pg_log_error("client %d aborted in command %d >> query %d of script %d: %s", >> + st->id, st->command, >> qrynum, st->use_file, >> PQerrorMessage(st->con)); >> goto error; >> } >> >> With this change, the output now is like this: >>> pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR: >> duplicate key value violates unique constraint "test_col2_key" >> >> I want hear your thoughts. > > My idea is to modify this as follows; > > default: > /* anything else is unexpected */ > - pg_log_error("client %d script %d aborted in command %d query %d: %s", > - st->id, st->use_file, st->command, qrynum, > - PQerrorMessage(st->con)); > + commandFailed(st, "SQL", PQerrorMessage(st->con)); > goto error; > } > > This fix is originally planned to be included in patch v8, but was missed. > It is now included in the attached patch, v10. > > With this change, the output becomes: > > pgbench: error: client 0 aborted in command 0 (SQL) of script 0; > ERROR: duplicate key value violates unique constraint "t2_pkey" > > Although there is a slight difference, the message is essentially the same as > your proposal. Also, I believe the use of commandFailed() makes the code simpler > and more consistent. > > What do you think? > Thank you for the new patch! I think Nagata-san's v10 patch is a clear improvement over my v9 patch. I'm happy with the changes. >> Also, let me ask one question. In this case, I directly modified your commit in >> the v8-0003 patch. Is that the right way to update the patch? > > I’m not sure if that’s the best way, but I think modifying the patch directly is a > valid way to propose an alternative approach during discussion, as long as the original > patch is respected. It can often help clarify suggestions. I understand that. Thank you. Regards, Rintaro Ikeda
On Fri, 18 Jul 2025 17:07:53 +0900 Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote: > Hi, > > On 2025/07/16 22:49, Yugo Nagata wrote: > >> I think we should also change the error message in pg_log_error. I modified the > >> patch v8-0003 as follows: > >> @@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char > >> *varprefix) > >> > >> default: > >> /* anything else is unexpected */ > >> - pg_log_error("client %d script %d aborted in > >> command %d query %d: %s", > >> - st->id, st->use_file, > >> st->command, qrynum, > >> + pg_log_error("client %d aborted in command %d > >> query %d of script %d: %s", > >> + st->id, st->command, > >> qrynum, st->use_file, > >> PQerrorMessage(st->con)); > >> goto error; > >> } > >> > >> With this change, the output now is like this: > >>> pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR: > >> duplicate key value violates unique constraint "test_col2_key" > >> > >> I want hear your thoughts. > > > > My idea is to modify this as follows; > > > > default: > > /* anything else is unexpected */ > > - pg_log_error("client %d script %d aborted in command %d query %d: %s", > > - st->id, st->use_file, st->command, qrynum, > > - PQerrorMessage(st->con)); > > + commandFailed(st, "SQL", PQerrorMessage(st->con)); > > goto error; > > } > > > > This fix is originally planned to be included in patch v8, but was missed. > > It is now included in the attached patch, v10. > > > > With this change, the output becomes: > > > > pgbench: error: client 0 aborted in command 0 (SQL) of script 0; > > ERROR: duplicate key value violates unique constraint "t2_pkey" > > > > Although there is a slight difference, the message is essentially the same as > > your proposal. Also, I believe the use of commandFailed() makes the code simpler > > and more consistent. > > > > What do you think? > > > > Thank you for the new patch! I think Nagata-san's v10 patch is a clear > improvement over my v9 patch. I'm happy with the changes. Thank you. I believe the patches implement the expected behavior, include appropriste doc and test modification, are in good shape overall, so if there are no objections, I'll mark this as Read-for-Committer. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Tue, 22 Jul 2025 17:49:49 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > On Fri, 18 Jul 2025 17:07:53 +0900 > Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote: > > > Hi, > > > > On 2025/07/16 22:49, Yugo Nagata wrote: > > >> I think we should also change the error message in pg_log_error. I modified the > > >> patch v8-0003 as follows: > > >> @@ -3383,8 +3383,8 @@ readCommandResponse(CState *st, MetaCommand meta, char > > >> *varprefix) > > >> > > >> default: > > >> /* anything else is unexpected */ > > >> - pg_log_error("client %d script %d aborted in > > >> command %d query %d: %s", > > >> - st->id, st->use_file, > > >> st->command, qrynum, > > >> + pg_log_error("client %d aborted in command %d > > >> query %d of script %d: %s", > > >> + st->id, st->command, > > >> qrynum, st->use_file, > > >> PQerrorMessage(st->con)); > > >> goto error; > > >> } > > >> > > >> With this change, the output now is like this: > > >>> pgbench: error: client 0 aborted in command 1 query 0 of script 0: ERROR: > > >> duplicate key value violates unique constraint "test_col2_key" > > >> > > >> I want hear your thoughts. > > > > > > My idea is to modify this as follows; > > > > > > default: > > > /* anything else is unexpected */ > > > - pg_log_error("client %d script %d aborted in command %d query %d: %s", > > > - st->id, st->use_file, st->command, qrynum, > > > - PQerrorMessage(st->con)); > > > + commandFailed(st, "SQL", PQerrorMessage(st->con)); > > > goto error; > > > } > > > > > > This fix is originally planned to be included in patch v8, but was missed. > > > It is now included in the attached patch, v10. > > > > > > With this change, the output becomes: > > > > > > pgbench: error: client 0 aborted in command 0 (SQL) of script 0; > > > ERROR: duplicate key value violates unique constraint "t2_pkey" > > > > > > Although there is a slight difference, the message is essentially the same as > > > your proposal. Also, I believe the use of commandFailed() makes the code simpler > > > and more consistent. > > > > > > What do you think? > > > > > > > Thank you for the new patch! I think Nagata-san's v10 patch is a clear > > improvement over my v9 patch. I'm happy with the changes. > > Thank you. > > I believe the patches implement the expected behavior, include appropriste doc and test > modification, are in good shape overall, so if there are no objections, > I'll mark this as Read-for-Committer. I've updated the CF status to Ready for Committer. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Thu, Jul 24, 2025 at 5:44 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > I believe the patches implement the expected behavior, include appropriste doc and test > > modification, are in good shape overall, so if there are no objections, > > I'll mark this as Read-for-Committer. > > I've updated the CF status to Ready for Committer. Thanks for working on it! As Matthias, Dilip, Srinath and many others pointed out it would be a very nice and helpful addition to pgbench. I've just used it out of necessity and it worked as advertised for me and it even adds a cool-looking "XXX failed" when used with -P progress meter: progress: 1.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 3854 failed progress: 2.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 3796 failed -J.
On Tue, Sep 16, 2025 at 5:34 PM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > > On Thu, Jul 24, 2025 at 5:44 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > > > I believe the patches implement the expected behavior, include appropriste doc and test > > > modification, are in good shape overall, so if there are no objections, > > > I'll mark this as Read-for-Committer. > > > > I've updated the CF status to Ready for Committer. Since this patch is marked as ready for committer, I've started reviewing it. The patch basically looks good to me. + the client is aborted. However, if the --continue-on-error option is specified, "--continue-on-error" should be enclosed in <option> tags. + without completing the last transaction. By default, if execution of an SQL or meta command fails for reasons other than serialization or deadlock errors, <snip> + the client is aborted. However, if the --continue-on-error option is specified, + the client does not abort and proceeds to the next transaction regardless of + the error. These cases are reported as "other failures" in the output. This explanation can be read as if --continue-on-error allows the client to proceed to the next transaction even when mata command (not SQL) fails, but that is not correct, right? If so, the description should be updated to make it clear that only SQL errors are affected, while meta command failures are not. +$node->pgbench( + '-t 10 --continue-on-error --failures-detailed', Isn't it better to specify also -n option to skip unnecessary VACUUM and speed the test up? + 'test --continue-on-error', + { + '002_continue_on_error' => q{ Regarding the test file name, perhaps 001 would be a better prefix than 002, since other tests in 001_pgbench_with_server.pl use 001 as the prefix. + insert into unique_table values 0; This INSERT causes a syntax error. Was this intentional? If the intention was to test unique constraint violations, it should instead be INSERT INTO unique_table VALUES (0);. To further improve the test, it might also be useful to mix successful and failed transactions in the --continue-on-error case. For example, the following change would result in one successful transaction and nine failures: ----------------------------- $node->safe_psql('postgres', - 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO unique_table VALUES (0);'); + 'CREATE TABLE unique_table(i int unique);'); $node->pgbench( '-t 10 --continue-on-error --failures-detailed', 0, [ - qr{processed: 0/10\b}, - qr{other failures: 10\b} + qr{processed: 1/10\b}, + qr{other failures: 9\b} ----------------------------- Regards, -- Fujii Masao
On Thu, 18 Sep 2025 01:52:46 +0900 Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Sep 16, 2025 at 5:34 PM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > > > On Thu, Jul 24, 2025 at 5:44 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > > > > > > I believe the patches implement the expected behavior, include appropriste doc and test > > > > modification, are in good shape overall, so if there are no objections, > > > > I'll mark this as Read-for-Committer. > > > > > > I've updated the CF status to Ready for Committer. > > Since this patch is marked as ready for committer, I've started reviewing it. > The patch basically looks good to me. > > > + the client is aborted. However, if the --continue-on-error option > is specified, > > "--continue-on-error" should be enclosed in <option> tags. +1 > + without completing the last transaction. By default, if execution of an SQL > or meta command fails for reasons other than serialization or > deadlock errors, > <snip> > + the client is aborted. However, if the --continue-on-error option > is specified, > + the client does not abort and proceeds to the next transaction regardless of > + the error. These cases are reported as "other failures" in the output. > > This explanation can be read as if --continue-on-error allows the client to > proceed to the next transaction even when mata command (not SQL) fails, > but that is not correct, right? If so, the description should be updated to > make it clear that only SQL errors are affected, while meta command failures > are not. That makes sense. How about rewriting this like: However, if the --continue-on-error option is specified and the error occurs in an SQL command, the client does not abort and proceeds to the next transaction regardless of the error. These cases are reported as "other failures" in the output. Note that if the error occurs in a meta-command, the client will still abort even when this option is specified. > +$node->pgbench( > + '-t 10 --continue-on-error --failures-detailed', > > Isn't it better to specify also -n option to skip unnecessary VACUUM and > speed the test up? +1 > + 'test --continue-on-error', > + { > + '002_continue_on_error' => q{ > > Regarding the test file name, perhaps 001 would be a better prefix than 002, > since other tests in 001_pgbench_with_server.pl use 001 as the prefix. Right. This filename is shown in the “transaction type:” field of the results when the test fails, so it should be aligned with the test file name. > + insert into unique_table values 0; > > This INSERT causes a syntax error. Was this intentional? If the intention was > to test unique constraint violations, it should instead be > INSERT INTO unique_table VALUES (0);. This was clearly unintentional. I happened to overlook it during my review. > To further improve the test, it might also be useful to mix successful and > failed transactions in the --continue-on-error case. For example, > the following change would result in one successful transaction and > nine failures: > > ----------------------------- > $node->safe_psql('postgres', > - 'CREATE TABLE unique_table(i int unique);' . 'INSERT INTO > unique_table VALUES (0);'); > + 'CREATE TABLE unique_table(i int unique);'); > > $node->pgbench( > '-t 10 --continue-on-error --failures-detailed', > 0, > [ > - qr{processed: 0/10\b}, > - qr{other failures: 10\b} > + qr{processed: 1/10\b}, > + qr{other failures: 9\b} > ----------------------------- +1 This makes the purpose of the test clearer. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > That makes sense. How about rewriting this like: > > However, if the --continue-on-error option is specified and the error occurs in > an SQL command, the client does not abort and proceeds to the next > transaction regardless of the error. These cases are reported as "other failures" > in the output. Note that if the error occurs in a meta-command, the client will > still abort even when this option is specified. How about phrasing it like this, based on your version? ---------------------------- A client's run is aborted in case of a serious error; for example, the connection with the database server was lost or the end of script was reached without completing the last transaction. The client also aborts if a meta-command fails, or if an SQL command fails for reasons other than serialization or deadlock errors when --continue-on-error is not specified. With --continue-on-error, the client does not abort on such SQL errors and instead proceeds to the next transaction. These cases are reported as "other failures" in the output. If the error occurs in a meta-command, however, the client still aborts even when this option is specified. ---------------------------- Regards, -- Fujii Masao
On Thu, 18 Sep 2025 14:37:29 +0900 Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > That makes sense. How about rewriting this like: > > > > However, if the --continue-on-error option is specified and the error occurs in > > an SQL command, the client does not abort and proceeds to the next > > transaction regardless of the error. These cases are reported as "other failures" > > in the output. Note that if the error occurs in a meta-command, the client will > > still abort even when this option is specified. > > How about phrasing it like this, based on your version? > > ---------------------------- > A client's run is aborted in case of a serious error; for example, the > connection with the database server was lost or the end of script was reached > without completing the last transaction. The client also aborts > if a meta-command fails, or if an SQL command fails for reasons other than > serialization or deadlock errors when --continue-on-error is not specified. > With --continue-on-error, the client does not abort on such SQL errors > and instead proceeds to the next transaction. These cases are reported > as "other failures" in the output. If the error occurs in a meta-command, > however, the client still aborts even when this option is specified. > ---------------------------- I'm fine with that. This version is clearer. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > On Thu, 18 Sep 2025 14:37:29 +0900 > Fujii Masao <masao.fujii@gmail.com> wrote: > > > On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > That makes sense. How about rewriting this like: > > > > > > However, if the --continue-on-error option is specified and the error occurs in > > > an SQL command, the client does not abort and proceeds to the next > > > transaction regardless of the error. These cases are reported as "other failures" > > > in the output. Note that if the error occurs in a meta-command, the client will > > > still abort even when this option is specified. > > > > How about phrasing it like this, based on your version? > > > > ---------------------------- > > A client's run is aborted in case of a serious error; for example, the > > connection with the database server was lost or the end of script was reached > > without completing the last transaction. The client also aborts > > if a meta-command fails, or if an SQL command fails for reasons other than > > serialization or deadlock errors when --continue-on-error is not specified. > > With --continue-on-error, the client does not abort on such SQL errors > > and instead proceeds to the next transaction. These cases are reported > > as "other failures" in the output. If the error occurs in a meta-command, > > however, the client still aborts even when this option is specified. > > ---------------------------- > > I'm fine with that. This version is clearer. Thanks for checking! Also I'd like to share the review comments for 0002 and 0003. Regarding 0002: - TSTATUS_OTHER_ERROR, + TSTATUS_UNKNOWN_ERROR, You did this rename to avoid confusion with other_sql_errors. I see the intention, but I'm not sure if this concern is really valid and if the rename adds much value. Also, TSTATUS_UNKNOWN_ERROR might be mistakenly assumed to be related to PQTRANS_UNKNOWN, even though they aren't related... But if we agree with this change, I think it should be folded into 0001, since there's no strong reason to keep it separate. Regarding 0003: - pg_log_error("client %d script %d command %d query %d: expected one row, got %d", - st->id, st->use_file, st->command, qrynum, 0); + commandFailed(st, "gset", psprintf("expected one row, got %d", 0)); The change to use commandFailed() seems to remove the "query %d" detail that the current pg_log_error() call reports. Is it OK to lose that information? Regards, -- Fujii Masao
On Fri, Sep 19, 2025 at 11:43 AM Fujii Masao <masao.fujii@gmail.com> wrote: > > On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > On Thu, 18 Sep 2025 14:37:29 +0900 > > Fujii Masao <masao.fujii@gmail.com> wrote: > > > > > On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > That makes sense. How about rewriting this like: > > > > > > > > However, if the --continue-on-error option is specified and the error occurs in > > > > an SQL command, the client does not abort and proceeds to the next > > > > transaction regardless of the error. These cases are reported as "other failures" > > > > in the output. Note that if the error occurs in a meta-command, the client will > > > > still abort even when this option is specified. > > > > > > How about phrasing it like this, based on your version? > > > > > > ---------------------------- > > > A client's run is aborted in case of a serious error; for example, the > > > connection with the database server was lost or the end of script was reached > > > without completing the last transaction. The client also aborts > > > if a meta-command fails, or if an SQL command fails for reasons other than > > > serialization or deadlock errors when --continue-on-error is not specified. > > > With --continue-on-error, the client does not abort on such SQL errors > > > and instead proceeds to the next transaction. These cases are reported > > > as "other failures" in the output. If the error occurs in a meta-command, > > > however, the client still aborts even when this option is specified. > > > ---------------------------- > > > > I'm fine with that. This version is clearer. > > Thanks for checking! I've updated the 0001 patch based on the comments. The revised version is attached. While testing, I found that running pgbench with --continue-on-error and pipeline mode triggers the following assertion failure. Could this be a bug in the patch? --------------------------------------------------- $ cat pipeline.pgbench \startpipeline DO $$ BEGIN PERFORM pg_sleep(3); PERFORM pg_terminate_backend(pg_backend_pid()); END $$; \endpipeline $ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M extended --continue-on-error ... Assertion failed: (sql_script[st->use_file].commands[st->command]->type == 1), function commandError, file pgbench.c, line 3081. Abort trap: 6 --------------------------------------------------- When I ran the same command without --continue-on-error, the assertion failure did not occur. Regards, -- Fujii Masao
Вложения
On Fri, 19 Sep 2025 11:43:28 +0900 Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > On Thu, 18 Sep 2025 14:37:29 +0900 > > Fujii Masao <masao.fujii@gmail.com> wrote: > > > > > On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > That makes sense. How about rewriting this like: > > > > > > > > However, if the --continue-on-error option is specified and the error occurs in > > > > an SQL command, the client does not abort and proceeds to the next > > > > transaction regardless of the error. These cases are reported as "other failures" > > > > in the output. Note that if the error occurs in a meta-command, the client will > > > > still abort even when this option is specified. > > > > > > How about phrasing it like this, based on your version? > > > > > > ---------------------------- > > > A client's run is aborted in case of a serious error; for example, the > > > connection with the database server was lost or the end of script was reached > > > without completing the last transaction. The client also aborts > > > if a meta-command fails, or if an SQL command fails for reasons other than > > > serialization or deadlock errors when --continue-on-error is not specified. > > > With --continue-on-error, the client does not abort on such SQL errors > > > and instead proceeds to the next transaction. These cases are reported > > > as "other failures" in the output. If the error occurs in a meta-command, > > > however, the client still aborts even when this option is specified. > > > ---------------------------- > > > > I'm fine with that. This version is clearer. > > Thanks for checking! > > > Also I'd like to share the review comments for 0002 and 0003. > > Regarding 0002: > > - TSTATUS_OTHER_ERROR, > + TSTATUS_UNKNOWN_ERROR, > > You did this rename to avoid confusion with other_sql_errors. > I see the intention, but I'm not sure if this concern is really valid > and if the rename adds much value. Also, TSTATUS_UNKNOWN_ERROR > might be mistakenly assumed to be related to PQTRANS_UNKNOWN, > even though they aren't related... I don’t have a strong opinion on this, but I think TSTATUS_* is slightly related to PQTRANS_*, since getTransactionStatus() determines the TSTATUS value based on PQTRANS. There is no one-to-one relationship, of course, but it is more related than ESTATUS_OTHER_SQL_ERROR, which is entirely separate. > But if we agree with this change, I think it should be folded into 0001, > since there's no strong reason to keep it separate. +1 I personally don't care if ommiting this change, but I would like to wait for Ikeda-san's response because he is the author of these two patches. > Regarding 0003: > > - pg_log_error("client %d script %d command %d query %d: expected one > row, got %d", > - st->id, st->use_file, st->command, qrynum, 0); > + commandFailed(st, "gset", psprintf("expected one row, got %d", 0)); > > The change to use commandFailed() seems to remove > the "query %d" detail that the current pg_log_error() call reports. > Is it OK to lose that information? "qrynum" is the index of SQL queries combined by "\;", but reporting it in \gset errors is almost useless, since \gset can only be applied to the last query of a compound query. So I think it’s fine to commit this. That said, it might still be useful for debugging when an internal error like the following occurs (mainly for developers rather than users): /* internal error */ commandFailed(st, cmd, psprintf("error storing into variable %s", varname)); For that case, I’d be fine with adding information like this: /* internal error */ commandFailed(st, cmd, psprintf("error storing into variable %s, at query %d", varname, qrynum)); Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Fri, 19 Sep 2025 19:21:29 +0900 Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Sep 19, 2025 at 11:43 AM Fujii Masao <masao.fujii@gmail.com> wrote: > > > > On Thu, Sep 18, 2025 at 4:20 PM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > > On Thu, 18 Sep 2025 14:37:29 +0900 > > > Fujii Masao <masao.fujii@gmail.com> wrote: > > > > > > > On Thu, Sep 18, 2025 at 10:22 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > That makes sense. How about rewriting this like: > > > > > > > > > > However, if the --continue-on-error option is specified and the error occurs in > > > > > an SQL command, the client does not abort and proceeds to the next > > > > > transaction regardless of the error. These cases are reported as "other failures" > > > > > in the output. Note that if the error occurs in a meta-command, the client will > > > > > still abort even when this option is specified. > > > > > > > > How about phrasing it like this, based on your version? > > > > > > > > ---------------------------- > > > > A client's run is aborted in case of a serious error; for example, the > > > > connection with the database server was lost or the end of script was reached > > > > without completing the last transaction. The client also aborts > > > > if a meta-command fails, or if an SQL command fails for reasons other than > > > > serialization or deadlock errors when --continue-on-error is not specified. > > > > With --continue-on-error, the client does not abort on such SQL errors > > > > and instead proceeds to the next transaction. These cases are reported > > > > as "other failures" in the output. If the error occurs in a meta-command, > > > > however, the client still aborts even when this option is specified. > > > > ---------------------------- > > > > > > I'm fine with that. This version is clearer. > > > > Thanks for checking! > > I've updated the 0001 patch based on the comments. > The revised version is attached. Thank you for updating the patch. > > While testing, I found that running pgbench with --continue-on-error and > pipeline mode triggers the following assertion failure. Could this be > a bug in the patch? > > --------------------------------------------------- > $ cat pipeline.pgbench > \startpipeline > DO $$ > BEGIN > PERFORM pg_sleep(3); > PERFORM pg_terminate_backend(pg_backend_pid()); > END $$; > \endpipeline > > $ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M > extended --continue-on-error > ... > Assertion failed: > (sql_script[st->use_file].commands[st->command]->type == 1), function > commandError, file pgbench.c, line 3081. > Abort trap: 6 > --------------------------------------------------- > > When I ran the same command without --continue-on-error, > the assertion failure did not occur. I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench to retry and added the --verbose-errors option, rather than by this patch itself. The assertion failure occurs in commandError(), which is called to report an error when it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error is used after this patch. Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND); This assumes the error is always detected during SQL command execution, but that’s not correct, since in pipeline mode, the error can be detected when a \endpipeline meta-command is executed. $ cat deadlock.sql \startpipeline begin; lock b; lock a; end; \endpipeline $ cat deadlock2.sql \startpipeline begin; lock a; lock b; end; \endpipeline $ pgbench --verbose-errors -f deadlock.sql -f deadlock2.sql -c 2 -T 3 -M extended pgbench (19devel) starting vacuum...end. pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1' failed. Although one option would be to remove this assertion, if we prefer to keep it, the attached patch fixes the issue. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
Thank you for reviewing the patches. On 2025/09/19 20:56, Yugo Nagata wrote: >>>> A client's run is aborted in case of a serious error; for example, the >>>> connection with the database server was lost or the end of script was reached >>>> without completing the last transaction. The client also aborts >>>> if a meta-command fails, or if an SQL command fails for reasons other than >>>> serialization or deadlock errors when --continue-on-error is not specified. >>>> With --continue-on-error, the client does not abort on such SQL errors >>>> and instead proceeds to the next transaction. These cases are reported >>>> as "other failures" in the output. If the error occurs in a meta-command, >>>> however, the client still aborts even when this option is specified. >>>> ---------------------------- >>> >>> I'm fine with that. This version is clearer. I also agree with this. >> >> Also I'd like to share the review comments for 0002 and 0003. >> >> Regarding 0002: >> >> - TSTATUS_OTHER_ERROR, >> + TSTATUS_UNKNOWN_ERROR, >> >> You did this rename to avoid confusion with other_sql_errors. >> I see the intention, but I'm not sure if this concern is really valid >> and if the rename adds much value. Also, TSTATUS_UNKNOWN_ERROR >> might be mistakenly assumed to be related to PQTRANS_UNKNOWN, >> even though they aren't related... > > I don’t have a strong opinion on this, but I think TSTATUS_* is slightly > related to PQTRANS_*, since getTransactionStatus() determines the TSTATUS > value based on PQTRANS. There is no one-to-one relationship, of course, > but it is more related than ESTATUS_OTHER_SQL_ERROR, which is entirely > separate. > >> But if we agree with this change, I think it should be folded into 0001, >> since there's no strong reason to keep it separate. > > +1 > > I personally don't care if ommiting this change, but I would like to wait > for Ikeda-san's response because he is the author of these two patches. > The points you both raise make sense to me. Changing the macro name is not important for the purpose of the patch, so I now feel it would be reasonable to drop patch 0002. Regards, Rintaro Ikeda
On Sat, Sep 20, 2025 at 12:21 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > While testing, I found that running pgbench with --continue-on-error and > > pipeline mode triggers the following assertion failure. Could this be > > a bug in the patch? > > > > --------------------------------------------------- > > $ cat pipeline.pgbench > > \startpipeline > > DO $$ > > BEGIN > > PERFORM pg_sleep(3); > > PERFORM pg_terminate_backend(pg_backend_pid()); > > END $$; > > \endpipeline > > > > $ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M > > extended --continue-on-error > > ... > > Assertion failed: > > (sql_script[st->use_file].commands[st->command]->type == 1), function > > commandError, file pgbench.c, line 3081. > > Abort trap: 6 > > --------------------------------------------------- > > > > When I ran the same command without --continue-on-error, > > the assertion failure did not occur. > > I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench > to retry and added the --verbose-errors option, rather than by this patch itself. > > The assertion failure occurs in commandError(), which is called to report an error when > it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error > is used after this patch. > > Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND); > > This assumes the error is always detected during SQL command execution, but > that’s not correct, since in pipeline mode, the error can be detected when > a \endpipeline meta-command is executed. > > $ cat deadlock.sql > \startpipeline > begin; > lock b; > lock a; > end; > \endpipeline > > $ cat deadlock2.sql > \startpipeline > begin; > lock a; > lock b; > end; > \endpipeline > > $ pgbench --verbose-errors -f deadlock.sql -f deadlock2.sql -c 2 -T 3 -M extended > pgbench (19devel) > starting vacuum...end. > pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1' failed. > > Although one option would be to remove this assertion, if we prefer to keep it, > the attached patch fixes the issue. Thanks for the analysis and the patch! I think we should fix the issue rather than just removing the assertion. I'd like to apply your patch with the following source comment: --------------------------- Errors should only be detected during an SQL command or the \endpipeline meta command. Any other case triggers an assertion failure. -------------------------- With your patch and the continue-on-error patches, running the same pgbench command I used to reproduce the assertion failure upthread causes pgbench to hang. From my analysis, it enters an infinite loop in discardUntilSync(). That loop waits for PGRES_PIPELINE_SYNC, but since the connection has already been closed, it never arrives, leaving pgbench stuck. Could this also happen without the continue-on-error patch, or is it a new bug introduced by it? Either way, it seems pgbench needs to exit the loop when the result status is PGRES_FATAL_ERROR. Regards, -- Fujii Masao
On Sat, Sep 20, 2025 at 9:58 PM Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote: > The points you both raise make sense to me. > Changing the macro name is not important for the purpose of the patch, so I now > feel it would be reasonable to drop patch 0002. Thanks for your thoughts! So let's focus on the 0001 patch for now. Regards, -- Fujii Masao
Hi, On 2025/09/22 11:56, Fujii Masao wrote: > On Sat, Sep 20, 2025 at 12:21 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: >>> While testing, I found that running pgbench with --continue-on-error and >>> pipeline mode triggers the following assertion failure. Could this be >>> a bug in the patch? >>> >>> --------------------------------------------------- >>> $ cat pipeline.pgbench >>> \startpipeline >>> DO $$ >>> BEGIN >>> PERFORM pg_sleep(3); >>> PERFORM pg_terminate_backend(pg_backend_pid()); >>> END $$; >>> \endpipeline >>> >>> $ pgbench -n --debug --verbose-errors -f pipeline.pgbench -c 2 -t 4 -M >>> extended --continue-on-error >>> ... >>> Assertion failed: >>> (sql_script[st->use_file].commands[st->command]->type == 1), function >>> commandError, file pgbench.c, line 3081. >>> Abort trap: 6 >>> --------------------------------------------------- >>> >>> When I ran the same command without --continue-on-error, >>> the assertion failure did not occur. >> >> I think this bug was introduced by commit 4a39f87acd6e, which enabled pgbench >> to retry and added the --verbose-errors option, rather than by this patch itself. >> >> The assertion failure occurs in commandError(), which is called to report an error when >> it can be retried (i.e., serializable failure or deadlock), or when --continue-on-error >> is used after this patch. >> >> Assert(sql_script[st->use_file].commands[st->command]->type == SQL_COMMAND); >> >> This assumes the error is always detected during SQL command execution, but >> that’s not correct, since in pipeline mode, the error can be detected when >> a \endpipeline meta-command is executed. >> >> $ cat deadlock.sql >> \startpipeline >> begin; >> lock b; >> lock a; >> end; >> \endpipeline >> >> $ cat deadlock2.sql >> \startpipeline >> begin; >> lock a; >> lock b; >> end; >> \endpipeline >> >> $ pgbench --verbose-errors -f deadlock.sql -f deadlock2.sql -c 2 -T 3 -M extended >> pgbench (19devel) >> starting vacuum...end. >> pgbench: pgbench.c:3062: commandError: Assertion `sql_script[st->use_file].commands[st->command]->type == 1' failed. >> >> Although one option would be to remove this assertion, if we prefer to keep it, >> the attached patch fixes the issue. > > Thanks for the analysis and the patch! > > I think we should fix the issue rather than just removing the assertion. > I'd like to apply your patch with the following source comment: > > --------------------------- > Errors should only be detected during an SQL command or the \endpipeline > meta command. Any other case triggers an assertion failure. > -------------------------- > > > With your patch and the continue-on-error patches, running the same pgbench > command I used to reproduce the assertion failure upthread causes pgbench > to hang. From my analysis, it enters an infinite loop in discardUntilSync(). > That loop waits for PGRES_PIPELINE_SYNC, but since the connection has already > been closed, it never arrives, leaving pgbench stuck. > > Could this also happen without the continue-on-error patch, or is it a new bug > introduced by it? Either way, it seems pgbench needs to exit the loop when > the result status is PGRES_FATAL_ERROR. > Thank you for the analysis and the patches. I think the issue is a new bug because we have transitioned to CSTATE_ABORT immediately after queries failed, without executing discardUntilSync(). I've attached a patch that fixes the assertion error. The content of v1 patch by Mr. Nagata is also included. I would appreciate it if you review my patch. Regards, Rintaro Ikeda
Вложения
On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda <ikedarintarof@oss.nttdata.com> wrote: > I think the issue is a new bug because we have transitioned to CSTATE_ABORT > immediately after queries failed, without executing discardUntilSync(). If so, the fix in discardUntilSync() should be part of the continue-on-error patch. The assertion failure fix should be a separate patch, since only that needs to be backpatched (the failure can also occur in back branches). > I've attached a patch that fixes the assertion error. The content of v1 patch by > Mr. Nagata is also included. I would appreciate it if you review my patch. + if (received_sync == true) For boolean flags, we usually just use the variable itself instead of "== true/false". + switch (PQresultStatus(res)) + { + case PGRES_PIPELINE_SYNC: + received_sync = true; In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be. + case PGRES_FATAL_ERROR: + PQclear(res); + goto done; I don't think we need goto here. How about this instead? ----------------------- @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) * results have been discarded. */ st->num_syncs = 0; - PQclear(res); break; } + /* + * Stop receiving further results if PGRES_FATAL_ERROR is returned + * (e.g., due to a connection failure) before PGRES_PIPELINE_SYNC, + * since PGRES_PIPELINE_SYNC will never arrive. + */ + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) + break; PQclear(res); } + PQclear(res); /* exit pipeline */ if (PQexitPipelineMode(st->con) != 1) ----------------------- Regards, -- Fujii Masao
On Thu, 25 Sep 2025 02:19:27 +0900 Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda > <ikedarintarof@oss.nttdata.com> wrote: > > I think the issue is a new bug because we have transitioned to CSTATE_ABORT > > immediately after queries failed, without executing discardUntilSync(). Agreed. > If so, the fix in discardUntilSync() should be part of the continue-on-error > patch. The assertion failure fix should be a separate patch, since only > that needs to be backpatched (the failure can also occur in back branches). +1 > > > I've attached a patch that fixes the assertion error. The content of v1 patch by > > Mr. Nagata is also included. I would appreciate it if you review my patch. > + switch (PQresultStatus(res)) > + { > + case PGRES_PIPELINE_SYNC: > + received_sync = true; > > In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be. > > + case PGRES_FATAL_ERROR: > + PQclear(res); > + goto done; > > I don't think we need goto here. How about this instead? > > ----------------------- > @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) > * results have been discarded. > */ > st->num_syncs = 0; > - PQclear(res); > break; > } > + /* > + * Stop receiving further results if PGRES_FATAL_ERROR > is returned > + * (e.g., due to a connection failure) before > PGRES_PIPELINE_SYNC, > + * since PGRES_PIPELINE_SYNC will never arrive. > + */ > + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) > + break; > PQclear(res); > } > + PQclear(res); > > /* exit pipeline */ > if (PQexitPipelineMode(st->con) != 1) > ----------------------- I think Fujii-san's version is better because Ikeda-san's version doesn't consider the case where PGRES_PIPELINE_SYNC is followed by another one. In that situation, the loop would terminate without getting NULL, which would causes an error in PQexitPipelineMode(). However, I would like to suggest an alternative solution: checking the connection status when readCommandResponse() returns false. This seems more straightforwad, since the cause of the error can be investigated immediately after it is detected. @@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) st->state = CSTATE_END_COMMAND; } - else if (canRetryError(st->estatus)) + else if (PQstatus(st->con) == CONNECTION_BAD) + st->state = CSTATE_ABORTED; + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || + canRetryError(st->estatus)) st->state = CSTATE_ERROR; else st->state = CSTATE_ABORTED; What do you think? Additionally, I noticed that in pipeline mode, the error message reported in readCommandResponse() is lost, because it is reset when PQgetResult() returned NULL to indicate the end of query processing. For example: pgbench: client 0 got an error in command 3 (SQL) of script 0; pgbench: client 1 got an error in command 3 (SQL) of script 0; This can be fixed this by saving the previous error message and using it for the report. After the fix: pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administrator command I've attached update patches. 0001 fixes the assersion failure in commandError() and error message lost in readCommandResponse(). 0002 was the previous 0001 that adds --continiue-on-error, including the fix to handle connection termination errors. 0003 is for improving error messages for errors that cause client abortion. Regareds, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On Thu, 25 Sep 2025 11:09:40 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > On Thu, 25 Sep 2025 02:19:27 +0900 > Fujii Masao <masao.fujii@gmail.com> wrote: > > > On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda > > <ikedarintarof@oss.nttdata.com> wrote: > > > I think the issue is a new bug because we have transitioned to CSTATE_ABORT > > > immediately after queries failed, without executing discardUntilSync(). > > Agreed. > > > If so, the fix in discardUntilSync() should be part of the continue-on-error > > patch. The assertion failure fix should be a separate patch, since only > > that needs to be backpatched (the failure can also occur in back branches). > > +1 > > > > > > I've attached a patch that fixes the assertion error. The content of v1 patch by > > > Mr. Nagata is also included. I would appreciate it if you review my patch. > > > + switch (PQresultStatus(res)) > > + { > > + case PGRES_PIPELINE_SYNC: > > + received_sync = true; > > > > In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be. > > > > + case PGRES_FATAL_ERROR: > > + PQclear(res); > > + goto done; > > > > I don't think we need goto here. How about this instead? > > > > ----------------------- > > @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) > > * results have been discarded. > > */ > > st->num_syncs = 0; > > - PQclear(res); > > break; > > } > > + /* > > + * Stop receiving further results if PGRES_FATAL_ERROR > > is returned > > + * (e.g., due to a connection failure) before > > PGRES_PIPELINE_SYNC, > > + * since PGRES_PIPELINE_SYNC will never arrive. > > + */ > > + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) > > + break; > > PQclear(res); > > } > > + PQclear(res); > > > > /* exit pipeline */ > > if (PQexitPipelineMode(st->con) != 1) > > ----------------------- > > I think Fujii-san's version is better because Ikeda-san's version doesn't > consider the case where PGRES_PIPELINE_SYNC is followed by another one. > In that situation, the loop would terminate without getting NULL, which would > causes an error in PQexitPipelineMode(). > > However, I would like to suggest an alternative solution: checking the connection > status when readCommandResponse() returns false. This seems more straightforwad, > since the cause of the error can be investigated immediately after it is detected. > > @@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) > if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) > st->state = CSTATE_END_COMMAND; > } > - else if (canRetryError(st->estatus)) > + else if (PQstatus(st->con) == CONNECTION_BAD) > + st->state = CSTATE_ABORTED; > + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || > + canRetryError(st->estatus)) > st->state = CSTATE_ERROR; > else > st->state = CSTATE_ABORTED; > > What do you think? > > > Additionally, I noticed that in pipeline mode, the error message reported in > readCommandResponse() is lost, because it is reset when PQgetResult() returned > NULL to indicate the end of query processing. For example: > > pgbench: client 0 got an error in command 3 (SQL) of script 0; > pgbench: client 1 got an error in command 3 (SQL) of script 0; > > This can be fixed this by saving the previous error message and using it > for the report. After the fix: > > pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administrator command > > I've attached update patches. > > 0001 fixes the assersion failure in commandError() and error message lost > in readCommandResponse(). > > 0002 was the previous 0001 that adds --continiue-on-error, including the > fix to handle connection termination errors. > > 0003 is for improving error messages for errors that cause client abortion. I think the patch 0001 should be back patched since the issues can occurs even for retries of serialization failure or deadlock detection in pipeline mode. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Thu, Sep 25, 2025 at 11:17 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > On Thu, 25 Sep 2025 11:09:40 +0900 > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > On Thu, 25 Sep 2025 02:19:27 +0900 > > Fujii Masao <masao.fujii@gmail.com> wrote: > > > > > On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda > > > <ikedarintarof@oss.nttdata.com> wrote: > > > > I think the issue is a new bug because we have transitioned to CSTATE_ABORT > > > > immediately after queries failed, without executing discardUntilSync(). > > > > Agreed. > > > > > If so, the fix in discardUntilSync() should be part of the continue-on-error > > > patch. The assertion failure fix should be a separate patch, since only > > > that needs to be backpatched (the failure can also occur in back branches). > > > > +1 > > > > > > > > > I've attached a patch that fixes the assertion error. The content of v1 patch by > > > > Mr. Nagata is also included. I would appreciate it if you review my patch. > > > > > + switch (PQresultStatus(res)) > > > + { > > > + case PGRES_PIPELINE_SYNC: > > > + received_sync = true; > > > > > > In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be. > > > > > > + case PGRES_FATAL_ERROR: > > > + PQclear(res); > > > + goto done; > > > > > > I don't think we need goto here. How about this instead? > > > > > > ----------------------- > > > @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) > > > * results have been discarded. > > > */ > > > st->num_syncs = 0; > > > - PQclear(res); > > > break; > > > } > > > + /* > > > + * Stop receiving further results if PGRES_FATAL_ERROR > > > is returned > > > + * (e.g., due to a connection failure) before > > > PGRES_PIPELINE_SYNC, > > > + * since PGRES_PIPELINE_SYNC will never arrive. > > > + */ > > > + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) > > > + break; > > > PQclear(res); > > > } > > > + PQclear(res); > > > > > > /* exit pipeline */ > > > if (PQexitPipelineMode(st->con) != 1) > > > ----------------------- > > > > I think Fujii-san's version is better because Ikeda-san's version doesn't > > consider the case where PGRES_PIPELINE_SYNC is followed by another one. > > In that situation, the loop would terminate without getting NULL, which would > > causes an error in PQexitPipelineMode(). > > > > However, I would like to suggest an alternative solution: checking the connection > > status when readCommandResponse() returns false. This seems more straightforwad, > > since the cause of the error can be investigated immediately after it is detected. > > > > @@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) > > if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) > > st->state = CSTATE_END_COMMAND; > > } > > - else if (canRetryError(st->estatus)) > > + else if (PQstatus(st->con) == CONNECTION_BAD) > > + st->state = CSTATE_ABORTED; > > + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || > > + canRetryError(st->estatus)) > > st->state = CSTATE_ERROR; > > else > > st->state = CSTATE_ABORTED; > > > > What do you think? > > > > > > Additionally, I noticed that in pipeline mode, the error message reported in > > readCommandResponse() is lost, because it is reset when PQgetResult() returned > > NULL to indicate the end of query processing. For example: > > > > pgbench: client 0 got an error in command 3 (SQL) of script 0; > > pgbench: client 1 got an error in command 3 (SQL) of script 0; > > > > This can be fixed this by saving the previous error message and using it > > for the report. After the fix: > > > > pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administrator command > > > > I've attached update patches. > > > > 0001 fixes the assersion failure in commandError() and error message lost > > in readCommandResponse(). > > > > 0002 was the previous 0001 that adds --continiue-on-error, including the > > fix to handle connection termination errors. > > > > 0003 is for improving error messages for errors that cause client abortion. > > I think the patch 0001 should be back patched since the issues can occurs > even for retries of serialization failure or deadlock detection in pipeline mode. Agreed. Regarding 0001: + /* + Errors should only be detected during an SQL command or the \endpipeline + meta command. Any other case triggers an assertion failure. + */ * should be added before "Errors" and "meta". + errmsg = pg_strdup(PQerrorMessage(st->con)); It would be good to add a comment explaining why we do this. + pg_free(errmsg); Shouldn't pg_free() be called also in the error case, i.e., after jumping to the error label? Regards, -- Fujii Masao
On Thu, 25 Sep 2025 13:49:05 +0900 Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Sep 25, 2025 at 11:17 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > On Thu, 25 Sep 2025 11:09:40 +0900 > > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > On Thu, 25 Sep 2025 02:19:27 +0900 > > > Fujii Masao <masao.fujii@gmail.com> wrote: > > > > > > > On Tue, Sep 23, 2025 at 11:58 AM Rintaro Ikeda > > > > <ikedarintarof@oss.nttdata.com> wrote: > > > > > I think the issue is a new bug because we have transitioned to CSTATE_ABORT > > > > > immediately after queries failed, without executing discardUntilSync(). > > > > > > Agreed. > > > > > > > If so, the fix in discardUntilSync() should be part of the continue-on-error > > > > patch. The assertion failure fix should be a separate patch, since only > > > > that needs to be backpatched (the failure can also occur in back branches). > > > > > > +1 > > > > > > > > > > > > I've attached a patch that fixes the assertion error. The content of v1 patch by > > > > > Mr. Nagata is also included. I would appreciate it if you review my patch. > > > > > > > + switch (PQresultStatus(res)) > > > > + { > > > > + case PGRES_PIPELINE_SYNC: > > > > + received_sync = true; > > > > > > > > In the PGRES_PIPELINE_SYNC case, PQclear(res) isn't called but should be. > > > > > > > > + case PGRES_FATAL_ERROR: > > > > + PQclear(res); > > > > + goto done; > > > > > > > > I don't think we need goto here. How about this instead? > > > > > > > > ----------------------- > > > > @@ -3525,11 +3525,18 @@ discardUntilSync(CState *st) > > > > * results have been discarded. > > > > */ > > > > st->num_syncs = 0; > > > > - PQclear(res); > > > > break; > > > > } > > > > + /* > > > > + * Stop receiving further results if PGRES_FATAL_ERROR > > > > is returned > > > > + * (e.g., due to a connection failure) before > > > > PGRES_PIPELINE_SYNC, > > > > + * since PGRES_PIPELINE_SYNC will never arrive. > > > > + */ > > > > + else if (PQresultStatus(res) == PGRES_FATAL_ERROR) > > > > + break; > > > > PQclear(res); > > > > } > > > > + PQclear(res); > > > > > > > > /* exit pipeline */ > > > > if (PQexitPipelineMode(st->con) != 1) > > > > ----------------------- > > > > > > I think Fujii-san's version is better because Ikeda-san's version doesn't > > > consider the case where PGRES_PIPELINE_SYNC is followed by another one. > > > In that situation, the loop would terminate without getting NULL, which would > > > causes an error in PQexitPipelineMode(). > > > > > > However, I would like to suggest an alternative solution: checking the connection > > > status when readCommandResponse() returns false. This seems more straightforwad, > > > since the cause of the error can be investigated immediately after it is detected. > > > > > > @@ -4024,7 +4043,10 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) > > > if (PQpipelineStatus(st->con) != PQ_PIPELINE_ON) > > > st->state = CSTATE_END_COMMAND; > > > } > > > - else if (canRetryError(st->estatus)) > > > + else if (PQstatus(st->con) == CONNECTION_BAD) > > > + st->state = CSTATE_ABORTED; > > > + else if ((st->estatus == ESTATUS_OTHER_SQL_ERROR && continue_on_error) || > > > + canRetryError(st->estatus)) > > > st->state = CSTATE_ERROR; > > > else > > > st->state = CSTATE_ABORTED; > > > > > > What do you think? > > > > > > > > > Additionally, I noticed that in pipeline mode, the error message reported in > > > readCommandResponse() is lost, because it is reset when PQgetResult() returned > > > NULL to indicate the end of query processing. For example: > > > > > > pgbench: client 0 got an error in command 3 (SQL) of script 0; > > > pgbench: client 1 got an error in command 3 (SQL) of script 0; > > > > > > This can be fixed this by saving the previous error message and using it > > > for the report. After the fix: > > > > > > pgbench: client 0 got an error in command 3 (SQL) of script 0; FATAL: terminating connection due to administratorcommand > > > > > > I've attached update patches. > > > > > > 0001 fixes the assersion failure in commandError() and error message lost > > > in readCommandResponse(). > > > > > > 0002 was the previous 0001 that adds --continiue-on-error, including the > > > fix to handle connection termination errors. > > > > > > 0003 is for improving error messages for errors that cause client abortion. > > > > I think the patch 0001 should be back patched since the issues can occurs > > even for retries of serialization failure or deadlock detection in pipeline mode. > > Agreed. > > Regarding 0001: > > + /* > + Errors should only be detected during an SQL command or the \endpipeline > + meta command. Any other case triggers an assertion failure. > + */ > > * should be added before "Errors" and "meta". Oops. Fixed. > + errmsg = pg_strdup(PQerrorMessage(st->con)); > > It would be good to add a comment explaining why we do this. > > + pg_free(errmsg); > > Shouldn't pg_free() be called also in the error case, i.e., after > jumping to the error label? Yes, it should be. Fixed. I've attached updated patches. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
Hi, The patch looks good, I've spotted some typos in the doc. + Allows clients to continue their run even if an SQL statement fails due to + errors other than serialization or deadlock. Unlike serialization and deadlock + failures, clients do not retry the same transactions but start new transaction. Should be "but start a new transaction.", although "proceed to the next transaction." may be clearer here that ? + number of transactions that got a SQL error + (zero unless <option>--failures-detailed</option> is specified) It seems like both "a SQL" and "an SQL" are used in the codebase and doc, but this page only uses "an SQL", so using "an SQL" may be better for consistency. + If an SQL command fails due to serialization or deadlock errors, the + client does not aborted, regardless of whether Should be "the client does not abort." Regards, Anthonin Bonnefoy
On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote:
--
Yugo Nagata <nagata@sraoss.co.jp>
<v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch>
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, Sep 25, 2025 at 4:22 PM Yugo Nagata <nagata@sraoss.co.jp> wrote: > I've attached updated patches. Thanks for updating the patches! About 0001: you mentioned that the lost error message issue occurs in pipeline mode. Just to confirm, are you sure it never happens in non-pipeline mode? From a quick look, readCommandResponse() seems to have this problem regardless of whether pipeline mode is used. If it can also happen outside pipeline mode, maybe we should split this from the assertion failure fix, since they'd need to be backpatched to different branches. What do you think? Regards, -- Fujii Masao
On Fri, 26 Sep 2025 00:03:06 +0900 Fujii Masao <masao.fujii@gmail.com> wrote: > On Thu, Sep 25, 2025 at 4:22 PM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > I've attached updated patches. > > Thanks for updating the patches! > > About 0001: you mentioned that the lost error message issue occurs in > pipeline mode. > Just to confirm, are you sure it never happens in non-pipeline mode? > From a quick look, > readCommandResponse() seems to have this problem regardless of whether pipeline > mode is used. > > If it can also happen outside pipeline mode, maybe we should split this from > the assertion failure fix, since they'd need to be backpatched to > different branches. I could not find a code path that resets the error state before reporting in non-pipeline mode, since it is typically reset when starting to send a query. However, referencing an error message after another PQgetResult() does not seem like a good idea in general, so I agree with splitting the patch. I'll submit updated patches soon. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Thu, 25 Sep 2025 17:17:36 +0800 Chao Li <li.evan.chao@gmail.com> wrote: > Hi Yugo, > > Thanks for the patch. After reviewing it, I got a few small comments: Thank you for your reviewing and comments. > > On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > -- > > Yugo Nagata <nagata@sraoss.co.jp <mailto:nagata@sraoss.co.jp>> > > <v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch> > > > 1 - 0001 > ``` > @@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) > PGresult *res; > PGresult *next_res; > int qrynum = 0; > + char *errmsg; > ``` > > I think we should initialize errmsg to NULL. Compiler won’t auto initialize a local variable. If it happens to not enterthe while loop, then errmsg will hold a random value, then pg_free(errmsg) will have trouble. I think this initialization is unnecessary, just like for res and next_res. If the code happens not to enter the while loop, pg_free(errmsg) will not be called anyway, since the error: label is only reachable from inside the loop. > 2 - 0002 > ``` > + <para> > + Allows clients to continue their run even if an SQL statement fails due to > + errors other than serialization or deadlock. Unlike serialization and deadlock > + failures, clients do not retry the same transactions but start new transaction. > + This option is useful when your custom script may raise errors due to some > + reason like unique constraints violation. Without this option, the client is > + aborted after such errors. > + </para> > ``` > > A few nit suggestions: > > * “continue their run” => “continue running” Fixed. > * “clients to not retry the same transactions but start new transaction” => “clients do not retry the same transactionbut start a new transaction instead" I see your point. Maybe we could follow Anthonin Bonnefoy's suggestion to use "proceed to the next transaction", as it may sound a bit more natural. > * “due to some reason like” => “for reasons such as" Fixed. > 3 - 0002 > ``` > + * Without --continue-on-error: > * failed (the number of failed transactions) = > ``` > > Maybe add an empty line after “without” line. Makes sense. Fixed. > 4 - 0002 > ``` > + * When --continue-on-error is specified: > + * > + * failed (number of failed transactions) = > ``` > > Maybe change to “With ---continue-on-error”, which sounds consistent with the previous “without”. Fixed. > 5 - 0002 > ``` > + int64 other_sql_failures; /* number of failed transactions for > + * reasons other than > + * serialization/deadlock failure, which > + * is counted if --continue-on-error is > + * specified */ > ``` > > How about rename this variable to “sql_errors”, which reflects to the new option name. I think it’s better to keep the current name, since the variable counts failed transactions, even though that happens to be equivalent to the number of SQL errors. It’s also consistent with the other variables, serialization_failures and deadlock_failures. > 6 - 0002 > ``` > @@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus) > return "serialization"; > case ESTATUS_DEADLOCK_ERROR: > return "deadlock"; > + case ESTATUS_OTHER_SQL_ERROR: > + return "other”; > ``` > > I think this can just return “error”. I checked where this function is called, there will not be other words such as “error”appended. getResultString() is called to get a string that represents the type of error causing the transaction failure, so simply returning "error" doesn’t seem very useful. > 7 - 0002 > ``` > /* it can be non-zero only if max_tries is not equal to one */ > @@ -6569,6 +6602,10 @@ printResults(StatsData *total, > sstats->deadlock_failures, > (100.0 * sstats->deadlock_failures / > script_total_cnt)); > + printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n", > + sstats->other_sql_failures, > + (100.0 * sstats->other_sql_failures / > + script_total_cnt)); > ``` > > Do we only want to print this number when “―continue-on-error” is given? We could do that, but this message is printed only when --failures-detailed is specified. So I think users would not mind if it shows that the number of other failures is zero, even when --continue-on-error is not specified. I would appreciate hearing other people's opinions on this. I've attached updated patches that include fixes for some of your suggestions and for Anthonin Bonnefoy's suggestion on the documentation. I also split the patch according to Fujii-san's suggestion. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On Thu, 25 Sep 2025 10:27:44 +0200 Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote: > Hi, > > The patch looks good, I've spotted some typos in the doc. > > + Allows clients to continue their run even if an SQL statement > fails due to > + errors other than serialization or deadlock. Unlike > serialization and deadlock > + failures, clients do not retry the same transactions but > start new transaction. > > Should be "but start a new transaction.", although "proceed to the > next transaction." may be clearer here that ? > > + number of transactions that got a SQL error > + (zero unless <option>--failures-detailed</option> is specified) > > It seems like both "a SQL" and "an SQL" are used in the codebase and > doc, but this page only uses "an SQL", so using "an SQL" may be better > for consistency. > > + If an SQL command fails due to serialization or deadlock errors, the > + client does not aborted, regardless of whether > > Should be "the client does not abort." Thank you for your review. I've attached the updated patch in my previous post in this thread. By the way, on the pgsql-hackers list, top-posting is generally discouraged [1], so replying below the quoted messages is usually preferred. [1] https://wiki.postgresql.org/wiki/Mailing_Lists Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Fri, 26 Sep 2025 11:44:42 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > On Thu, 25 Sep 2025 17:17:36 +0800 > Chao Li <li.evan.chao@gmail.com> wrote: > > > Hi Yugo, > > > > Thanks for the patch. After reviewing it, I got a few small comments: > > Thank you for your reviewing and comments. > > > > On Sep 25, 2025, at 15:22, Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > > -- > > > Yugo Nagata <nagata@sraoss.co.jp <mailto:nagata@sraoss.co.jp>> > > > <v13-0003-Improve-error-messages-for-errors-that-cause-cli.patch><v13-0002-Add-continue-on-error-option.patch><v13-0001-Fix-assertion-failure-and-verbose-messages-in-pi.patch> > > > > > > 1 - 0001 > > ``` > > @@ -3265,6 +3271,7 @@ readCommandResponse(CState *st, MetaCommand meta, char *varprefix) > > PGresult *res; > > PGresult *next_res; > > int qrynum = 0; > > + char *errmsg; > > ``` > > > > I think we should initialize errmsg to NULL. Compiler won’t auto initialize a local variable. If it happens to not enterthe while loop, then errmsg will hold a random value, then pg_free(errmsg) will have trouble. > > I think this initialization is unnecessary, just like for res and next_res. > If the code happens not to enter the while loop, pg_free(errmsg) will not be > called anyway, since the error: label is only reachable from inside the loop. > > > 2 - 0002 > > ``` > > + <para> > > + Allows clients to continue their run even if an SQL statement fails due to > > + errors other than serialization or deadlock. Unlike serialization and deadlock > > + failures, clients do not retry the same transactions but start new transaction. > > + This option is useful when your custom script may raise errors due to some > > + reason like unique constraints violation. Without this option, the client is > > + aborted after such errors. > > + </para> > > ``` > > > > A few nit suggestions: > > > > * “continue their run” => “continue running” > > Fixed. > > > * “clients to not retry the same transactions but start new transaction” => “clients do not retry the same transactionbut start a new transaction instead" > > I see your point. Maybe we could follow Anthonin Bonnefoy's suggestion > to use "proceed to the next transaction", as it may sound a bit more natural. > > > * “due to some reason like” => “for reasons such as" > > Fixed. > > > 3 - 0002 > > ``` > > + * Without --continue-on-error: > > * failed (the number of failed transactions) = > > ``` > > > > Maybe add an empty line after “without” line. > > Makes sense. Fixed. > > > 4 - 0002 > > ``` > > + * When --continue-on-error is specified: > > + * > > + * failed (number of failed transactions) = > > ``` > > > > Maybe change to “With ---continue-on-error”, which sounds consistent with the previous “without”. > > Fixed. > > > 5 - 0002 > > ``` > > + int64 other_sql_failures; /* number of failed transactions for > > + * reasons other than > > + * serialization/deadlock failure, which > > + * is counted if --continue-on-error is > > + * specified */ > > ``` > > > > How about rename this variable to “sql_errors”, which reflects to the new option name. > > I think it’s better to keep the current name, since the variable counts failed transactions, > even though that happens to be equivalent to the number of SQL errors. It’s also consistent > with the other variables, serialization_failures and deadlock_failures. > > > 6 - 0002 > > ``` > > @@ -4571,6 +4594,8 @@ getResultString(bool skipped, EStatus estatus) > > return "serialization"; > > case ESTATUS_DEADLOCK_ERROR: > > return "deadlock"; > > + case ESTATUS_OTHER_SQL_ERROR: > > + return "other”; > > ``` > > > > I think this can just return “error”. I checked where this function is called, there will not be other words such as“error” appended. > > getResultString() is called to get a string that represents the type of error > causing the transaction failure, so simply returning "error" doesn’t seem very > useful. > > > 7 - 0002 > > ``` > > /* it can be non-zero only if max_tries is not equal to one */ > > @@ -6569,6 +6602,10 @@ printResults(StatsData *total, > > sstats->deadlock_failures, > > (100.0 * sstats->deadlock_failures / > > script_total_cnt)); > > + printf(" - number of other failures: " INT64_FORMAT " (%.3f%%)\n", > > + sstats->other_sql_failures, > > + (100.0 * sstats->other_sql_failures / > > + script_total_cnt)); > > ``` > > > > Do we only want to print this number when “―continue-on-error” is given? > > We could do that, but this message is printed only when > --failures-detailed is specified. So I think users would not mind > if it shows that the number of other failures is zero, even when > --continue-on-error is not specified. > > I would appreciate hearing other people's opinions on this. > > > I've attached updated patches that include fixes for some of your > suggestions and for Anthonin Bonnefoy's suggestion on the documentation. > > I also split the patch according to Fujii-san's suggestion. Fujii-san, thank you for committing the patch that fixes the assertion failure. I've attached the remaining patches so that cfbot stays green. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > Fujii-san, thank you for committing the patch that fixes the assertion failure. > I've attached the remaining patches so that cfbot stays green. Thanks for reattaching the patches! For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would be better to just use that to get the error message. Thought? Regards, -- Fujii Masao
On Tue, 30 Sep 2025 13:46:11 +0900 Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > Fujii-san, thank you for committing the patch that fixes the assertion failure. > > I've attached the remaining patches so that cfbot stays green. > > Thanks for reattaching the patches! > > For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would > be better to just use that to get the error message. Thought? Thank you for your suggestion. I agree that it is better to use PQresultErrorMessage(). I had overlooked the existence of this interface. I've attached the updated patches. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On Tue, Sep 30, 2025 at 3:17 PM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > On Tue, 30 Sep 2025 13:46:11 +0900 > Fujii Masao <masao.fujii@gmail.com> wrote: > > > On Tue, Sep 30, 2025 at 10:24 AM Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > Fujii-san, thank you for committing the patch that fixes the assertion failure. > > > I've attached the remaining patches so that cfbot stays green. > > > > Thanks for reattaching the patches! > > > > For 0001, after reading the docs on PQresultErrorMessage(), I wonder if it would > > be better to just use that to get the error message. Thought? > > Thank you for your suggestion. > > I agree that it is better to use PQresultErrorMessage(). > I had overlooked the existence of this interface. > > I've attached the updated patches. Thanks for updating the patches! I've pushed 0001. Regarding 0002: - if (canRetryError(st->estatus)) + if (continue_on_error || canRetryError(st->estatus)) { if (verbose_errors) commandError(st, PQresultErrorMessage(res)); goto error; With this change, even non-SQL errors (e.g., connection failures) would satisfy the condition when --continue-on-error is set. Isn't that a problem? Shouldn't we also check that the error status is one that --continue-on-error is meant to handle? + * Without --continue-on-error: * * failed (the number of failed transactions) = * 'serialization_failures' (they got a serialization error and were not * successfully retried) + * 'deadlock_failures' (they got a deadlock error and were not * successfully retried). * + * With --continue-on-error: + * + * failed (number of failed transactions) = + * 'serialization_failures' + 'deadlock_failures' + + * 'other_sql_failures' (they got some other SQL error; the transaction was + * not retried and counted as failed due to --continue-on-error). About the comments on failed transactions: I don't think we need to split them into separate "with/without --continue-on-error" sections. How about simplifying them like this? ------------------------ * failed (the number of failed transactions) = * 'serialization_failures' (they got a serialization error and were not * successfully retried) + * 'deadlock_failures' (they got a deadlock error and were not * successfully retried) + * 'other_sql_failures' (they failed on the first try or after retries * due to a SQL error other than serialization or * deadlock; they are counted as a failed transaction * only when --continue-on-error is specified). ------------------------ * 'retried' (number of all retried transactions) = * successfully retried transactions + * failed transactions. Since transactions that failed on the first try (i.e., no retries) due to an SQL error are not counted as 'retried', shouldn't this source comment be updated? Regards, -- Fujii Masao