Обсуждение: A failure in prepared_xacts test

Поиск

Список

Период

Сортировка

A failure in prepared_xacts test

От

Richard Guo

Дата:

29 апреля, 04:12:48

Yesterday I noticed a failure on cirrus-ci for the 'Right Semi Join'
patch. The failure can be found at [1], and it looks like:

--- /tmp/cirrus-ci-build/src/test/regress/expected/prepared_xacts.out 2024-04-27 00:41:25.831297000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/prepared_xacts.out 2024-04-27 00:45:50.261369000 +0000
@@ -83,8 +83,9 @@
SELECT gid FROM pg_prepared_xacts;
gid
------
+ gxid
foo3
-(1 row)
+(2 rows)

Upon closer look, it seems that this issue is not caused by the patch
about 'Right Semi Join', because this query, which initially included
two left joins, can actually be reduced to a function scan after
removing these two useless left joins. It seems that no semi-joins
would be involved.

EXPLAIN SELECT gid FROM pg_prepared_xacts;
QUERY PLAN
----------------------------------------------------------------------------
Function Scan on pg_prepared_xact p (cost=0.00..10.00 rows=1000 width=32)
(1 row)

Does anyone have any clue to this failure?

FWIW, after another run of this test, the failure just disappears. Does
it suggest that the test case is flaky?

[1] https://api.cirrus-ci.com/v1/artifact/task/6220592364388352/testrun/build/testrun/regress-running/regress/regression.diffs

Thanks
Richard

Re: A failure in prepared_xacts test

От

Michael Paquier

Дата:

29 апреля, 07:58:51

On Mon, Apr 29, 2024 at 09:12:48AM +0800, Richard Guo wrote:
> Does anyone have any clue to this failure?
>
> FWIW, after another run of this test, the failure just disappears.  Does
> it suggest that the test case is flaky?

If you grep the source tree, you'd notice that a prepared transaction
named gxid only exists in the 2PC tests of ECPG, in
src/interfaces/ecpg/test/sql/twophase.pgc.  So the origin of the
failure comes from a race condition due to test parallelization,
because the scan of pg_prepared_xacts affects all databases with
installcheck, and in your case it means that the scan of
pg_prepared_xacts was running in parallel of the ECPG tests with an
installcheck.

The only location in the whole tree where we want to do predictible
scans of pg_prepared_xacts is prepared_xacts.sql, so rather than
playing with 2PC transactions across a bunch of tests, I think that we
should do two things, both touching prepared_xacts.sql:
- The 2PC transactions run in the main regression test suite should
use names that would be unlikely used elsewhere.
- Limit the scans of pg_prepared_xacts on these name patterns to avoid
interferences.

See for example the attached with both expected outputs updated
depending on the value set for max_prepared_transactions in the
backend.  There may be an argument in back-patching that, but I don't
recall seeing this failure in the CI, so perhaps that's not worth
bothering with.  What do you think?
--
Michael

Вложения

Re: A failure in prepared_xacts test

От

Alexander Lakhin

Дата:

29 апреля, 08:00:00

Hello Richard,

29.04.2024 04:12, Richard Guo wrote:
> Does anyone have any clue to this failure?
>
> FWIW, after another run of this test, the failure just disappears.  Does
> it suggest that the test case is flaky?
>

I think this could be caused by the ecpg test twophase executed
simultaneously with the test prepared_xacts thanks to meson's jobs
parallelization.

Best regards,
Alexander

Re: A failure in prepared_xacts test

От

Tom Lane

Дата:

29 апреля, 08:11:00

Michael Paquier <michael@paquier.xyz> writes:
> If you grep the source tree, you'd notice that a prepared transaction
> named gxid only exists in the 2PC tests of ECPG, in
> src/interfaces/ecpg/test/sql/twophase.pgc.  So the origin of the
> failure comes from a race condition due to test parallelization,
> because the scan of pg_prepared_xacts affects all databases with
> installcheck, and in your case it means that the scan of
> pg_prepared_xacts was running in parallel of the ECPG tests with an
> installcheck.

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact.  It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree.  Maybe that was a bad idea and we should
fix the meson infrastructure to not do that.  I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

            regards, tom lane

Re: A failure in prepared_xacts test

От

Michael Paquier

Дата:

29 апреля, 08:25:10

On Mon, Apr 29, 2024 at 01:11:00AM -0400, Tom Lane wrote:
> Up to now, we've only worried about whether tests running in parallel
> within a single test suite can interact.  It's quite scary to think
> that the meson setup has expanded the possibility of interactions
> to our entire source tree.  Maybe that was a bad idea and we should
> fix the meson infrastructure to not do that.  I fear that otherwise,
> we'll get bit regularly by very-low-probability bugs of this kind.

I don't disagree with your point, still I'm not sure that this can be
made entirely bullet-proof.  Anyway, I think that we should still
improve this test and make it more robust for parallel operations:
installcheck fails equally on HEAD if there is a prepared transaction
on the backend where the tests run, and that seems like a bad idea to
me to rely on cluster-wide scans for what should be a "local" test.
--
Michael

Вложения

signature.asc

Re: A failure in prepared_xacts test

От

Alexander Lakhin

Дата:

29 апреля, 08:30:00

Hello Tom and Michael,

29.04.2024 08:11, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> If you grep the source tree, you'd notice that a prepared transaction
>> named gxid only exists in the 2PC tests of ECPG, in
>> src/interfaces/ecpg/test/sql/twophase.pgc.  So the origin of the
>> failure comes from a race condition due to test parallelization,
>> because the scan of pg_prepared_xacts affects all databases with
>> installcheck, and in your case it means that the scan of
>> pg_prepared_xacts was running in parallel of the ECPG tests with an
>> installcheck.
> Up to now, we've only worried about whether tests running in parallel
> within a single test suite can interact.  It's quite scary to think
> that the meson setup has expanded the possibility of interactions
> to our entire source tree.  Maybe that was a bad idea and we should
> fix the meson infrastructure to not do that.  I fear that otherwise,
> we'll get bit regularly by very-low-probability bugs of this kind.

Yes, I'm afraid of the same. For example, the test failure [1] is of that
ilk, I guess.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-04-17%2016%3A33%3A23

Best regards,
Alexander

Re: A failure in prepared_xacts test

От

Tom Lane

Дата:

29 апреля, 08:32:40

Michael Paquier <michael@paquier.xyz> writes:
> I don't disagree with your point, still I'm not sure that this can be
> made entirely bullet-proof.  Anyway, I think that we should still
> improve this test and make it more robust for parallel operations:
> installcheck fails equally on HEAD if there is a prepared transaction
> on the backend where the tests run, and that seems like a bad idea to
> me to rely on cluster-wide scans for what should be a "local" test.

True, it's antithetical to the point of an "installcheck" test if
unrelated actions in another database can break it.  So I'm fine
with tightening up prepared_xacts's query.  I just wonder how far
we want to try to carry this.

(BTW, on the same logic, should ecpg's twophase.pgc be using a
prepared-transaction name that's less generic than "gxid"?)

            regards, tom lane

Re: A failure in prepared_xacts test

От

Michael Paquier

Дата:

29 апреля, 09:57:58

On Mon, Apr 29, 2024 at 01:32:40AM -0400, Tom Lane wrote:
> (BTW, on the same logic, should ecpg's twophase.pgc be using a
> prepared-transaction name that's less generic than "gxid"?)

I've hesitated a few seconds about that before sending my patch, but
refrained because this stuff does not care about the contents of
pg_prepared_xacts.  I'd be OK to use something like an "ecpg_regress"
or something similar there.
--
Michael

Вложения

signature.asc

Re: A failure in prepared_xacts test

От

Richard Guo

Дата:

29 апреля, 11:49:28

On Mon, Apr 29, 2024 at 1:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.

I have the same concern. I suspect that the scan of pg_prepared_xacts
is not the only test that could cause problems when running in parallel
to other tests from the entire source tree.

Thanks
Richard

Re: A failure in prepared_xacts test

От

Richard Guo

Дата:

29 апреля, 12:11:19

On Mon, Apr 29, 2024 at 2:58 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 29, 2024 at 01:32:40AM -0400, Tom Lane wrote:
> (BTW, on the same logic, should ecpg's twophase.pgc be using a
> prepared-transaction name that's less generic than "gxid"?)

I've hesitated a few seconds about that before sending my patch, but
refrained because this stuff does not care about the contents of
pg_prepared_xacts. I'd be OK to use something like an "ecpg_regress"
or something similar there.

I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.

Thanks
Richard

Re: A failure in prepared_xacts test

От

Michael Paquier

Дата:

29 апреля, 12:19:48

On Mon, Apr 29, 2024 at 05:11:19PM +0800, Richard Guo wrote:
> I noticed that some TAP tests from recovery and subscription would
> select the count from pg_prepared_xacts.  I wonder if these tests would
> be affected if there are any prepared transactions on the backend.

TAP tests run in isolation of the rest with their own clusters
initialized from a copy initdb'd (rather than initdb because that's
much cheaper), so these scans are OK left alone.
--
Michael

Вложения

signature.asc

Re: A failure in prepared_xacts test

От

Tom Lane

Дата:

29 апреля, 16:45:16

Richard Guo <guofenglinux@gmail.com> writes:
> I noticed that some TAP tests from recovery and subscription would
> select the count from pg_prepared_xacts.  I wonder if these tests would
> be affected if there are any prepared transactions on the backend.

TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them.  Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.

            regards, tom lane

Re: A failure in prepared_xacts test

От

Michael Paquier

Дата:

30 апреля, 01:42:52

On Mon, Apr 29, 2024 at 09:45:16AM -0400, Tom Lane wrote:
> TAP tests shouldn't be at risk, because there is no "make
> installcheck" equivalent for them.  Each TAP test creates its own
> database instance (or maybe several), so that instance won't have
> anything else going on.

There are a few more 2PC transactions in test_decoding (no
installcheck), temp.sql, test_extensions.sql and pg_stat_statements's
utility.sql (no installcheck) but their GIDs are not that bad.
twophase_stream.sql has a GID "test1", which is kind of generic, but
it won't run in parallel.  At the end, only addressing the
prepared_xacts.sql and the ECPG bits looked enough to me, so I've
tweaked these with 7e61e4cc7cfc and called it a day.

I'd be curious about any discussion involving the structure of the
meson tests.
--
Michael

Вложения

signature.asc

Re: A failure in prepared_xacts test

От

Richard Guo

Дата:

30 апреля, 03:43:32

On Mon, Apr 29, 2024 at 9:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:
> I noticed that some TAP tests from recovery and subscription would
> select the count from pg_prepared_xacts. I wonder if these tests would
> be affected if there are any prepared transactions on the backend.

TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them. Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.

Thank you for the explanation. I wasn't aware of this before.

Thanks
Richard

Re: A failure in prepared_xacts test

От

Richard Guo

Дата:

30 апреля, 03:54:47

On Tue, Apr 30, 2024 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote:

I'd be curious about any discussion involving the structure of the
meson tests.

+1. I'm kind of worried that the expansion of parallelization could
lead to more instances of instability. Alexander mentioned one such
case at [1]. I haven't looked into it though.

[1] https://www.postgresql.org/message-id/cbf0156f-5aa1-91db-5802-82435dda03e6%40gmail.com

Thanks
Richard

Re: A failure in prepared_xacts test

От

Tom Lane

Дата:

30 апреля, 04:48:13

Richard Guo <guofenglinux@gmail.com> writes:
> +1.  I'm kind of worried that the expansion of parallelization could
> lead to more instances of instability.  Alexander mentioned one such
> case at [1].  I haven't looked into it though.
> [1] https://www.postgresql.org/message-id/cbf0156f-5aa1-91db-5802-82435dda03e6%40gmail.com

The mechanism there is pretty obvious: a plancache flush happened
at just the wrong (right?) time and caused the output to change,
as indeed the comment acknowledges:

 -- currently, this fails due to cached plan for "r.f1 + 1" expression
 -- (but if debug_discard_caches is on, it will succeed)

I wonder if we shouldn't just remove that test case as being
too unstable -- especially since it's not proving much anyway.

            regards, tom lane

Re: A failure in prepared_xacts test

От

Peter Eisentraut

Дата:

03 мая, 15:27:40

On 29.04.24 07:11, Tom Lane wrote:
> Up to now, we've only worried about whether tests running in parallel
> within a single test suite can interact.  It's quite scary to think
> that the meson setup has expanded the possibility of interactions
> to our entire source tree.  Maybe that was a bad idea and we should
> fix the meson infrastructure to not do that.  I fear that otherwise,
> we'll get bit regularly by very-low-probability bugs of this kind.

I don't think there is anything fundamentally different in the 
parallelism setups of the make-based and the meson-based tests.  There 
are just different implementation details that might affect the likely 
orderings and groupings.