Обсуждение: A failure in prepared_xacts test
Yesterday I noticed a failure on cirrus-ci for the 'Right Semi Join'
patch. The failure can be found at [1], and it looks like:
--- /tmp/cirrus-ci-build/src/test/regress/expected/prepared_xacts.out 2024-04-27 00:41:25.831297000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/prepared_xacts.out 2024-04-27 00:45:50.261369000 +0000
@@ -83,8 +83,9 @@
SELECT gid FROM pg_prepared_xacts;
gid
------
+ gxid
foo3
-(1 row)
+(2 rows)
Upon closer look, it seems that this issue is not caused by the patch
about 'Right Semi Join', because this query, which initially included
two left joins, can actually be reduced to a function scan after
removing these two useless left joins. It seems that no semi-joins
would be involved.
EXPLAIN SELECT gid FROM pg_prepared_xacts;
QUERY PLAN
----------------------------------------------------------------------------
Function Scan on pg_prepared_xact p (cost=0.00..10.00 rows=1000 width=32)
(1 row)
Does anyone have any clue to this failure?
FWIW, after another run of this test, the failure just disappears. Does
it suggest that the test case is flaky?
[1] https://api.cirrus-ci.com/v1/artifact/task/6220592364388352/testrun/build/testrun/regress-running/regress/regression.diffs
Thanks
Richard
patch. The failure can be found at [1], and it looks like:
--- /tmp/cirrus-ci-build/src/test/regress/expected/prepared_xacts.out 2024-04-27 00:41:25.831297000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/prepared_xacts.out 2024-04-27 00:45:50.261369000 +0000
@@ -83,8 +83,9 @@
SELECT gid FROM pg_prepared_xacts;
gid
------
+ gxid
foo3
-(1 row)
+(2 rows)
Upon closer look, it seems that this issue is not caused by the patch
about 'Right Semi Join', because this query, which initially included
two left joins, can actually be reduced to a function scan after
removing these two useless left joins. It seems that no semi-joins
would be involved.
EXPLAIN SELECT gid FROM pg_prepared_xacts;
QUERY PLAN
----------------------------------------------------------------------------
Function Scan on pg_prepared_xact p (cost=0.00..10.00 rows=1000 width=32)
(1 row)
Does anyone have any clue to this failure?
FWIW, after another run of this test, the failure just disappears. Does
it suggest that the test case is flaky?
[1] https://api.cirrus-ci.com/v1/artifact/task/6220592364388352/testrun/build/testrun/regress-running/regress/regression.diffs
Thanks
Richard
On Mon, Apr 29, 2024 at 09:12:48AM +0800, Richard Guo wrote: > Does anyone have any clue to this failure? > > FWIW, after another run of this test, the failure just disappears. Does > it suggest that the test case is flaky? If you grep the source tree, you'd notice that a prepared transaction named gxid only exists in the 2PC tests of ECPG, in src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the failure comes from a race condition due to test parallelization, because the scan of pg_prepared_xacts affects all databases with installcheck, and in your case it means that the scan of pg_prepared_xacts was running in parallel of the ECPG tests with an installcheck. The only location in the whole tree where we want to do predictible scans of pg_prepared_xacts is prepared_xacts.sql, so rather than playing with 2PC transactions across a bunch of tests, I think that we should do two things, both touching prepared_xacts.sql: - The 2PC transactions run in the main regression test suite should use names that would be unlikely used elsewhere. - Limit the scans of pg_prepared_xacts on these name patterns to avoid interferences. See for example the attached with both expected outputs updated depending on the value set for max_prepared_transactions in the backend. There may be an argument in back-patching that, but I don't recall seeing this failure in the CI, so perhaps that's not worth bothering with. What do you think? -- Michael
Вложения
Hello Richard, 29.04.2024 04:12, Richard Guo wrote: > Does anyone have any clue to this failure? > > FWIW, after another run of this test, the failure just disappears. Does > it suggest that the test case is flaky? > I think this could be caused by the ecpg test twophase executed simultaneously with the test prepared_xacts thanks to meson's jobs parallelization. Best regards, Alexander
Michael Paquier <michael@paquier.xyz> writes: > If you grep the source tree, you'd notice that a prepared transaction > named gxid only exists in the 2PC tests of ECPG, in > src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the > failure comes from a race condition due to test parallelization, > because the scan of pg_prepared_xacts affects all databases with > installcheck, and in your case it means that the scan of > pg_prepared_xacts was running in parallel of the ECPG tests with an > installcheck. Up to now, we've only worried about whether tests running in parallel within a single test suite can interact. It's quite scary to think that the meson setup has expanded the possibility of interactions to our entire source tree. Maybe that was a bad idea and we should fix the meson infrastructure to not do that. I fear that otherwise, we'll get bit regularly by very-low-probability bugs of this kind. regards, tom lane
On Mon, Apr 29, 2024 at 01:11:00AM -0400, Tom Lane wrote: > Up to now, we've only worried about whether tests running in parallel > within a single test suite can interact. It's quite scary to think > that the meson setup has expanded the possibility of interactions > to our entire source tree. Maybe that was a bad idea and we should > fix the meson infrastructure to not do that. I fear that otherwise, > we'll get bit regularly by very-low-probability bugs of this kind. I don't disagree with your point, still I'm not sure that this can be made entirely bullet-proof. Anyway, I think that we should still improve this test and make it more robust for parallel operations: installcheck fails equally on HEAD if there is a prepared transaction on the backend where the tests run, and that seems like a bad idea to me to rely on cluster-wide scans for what should be a "local" test. -- Michael
Вложения
Hello Tom and Michael, 29.04.2024 08:11, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: >> If you grep the source tree, you'd notice that a prepared transaction >> named gxid only exists in the 2PC tests of ECPG, in >> src/interfaces/ecpg/test/sql/twophase.pgc. So the origin of the >> failure comes from a race condition due to test parallelization, >> because the scan of pg_prepared_xacts affects all databases with >> installcheck, and in your case it means that the scan of >> pg_prepared_xacts was running in parallel of the ECPG tests with an >> installcheck. > Up to now, we've only worried about whether tests running in parallel > within a single test suite can interact. It's quite scary to think > that the meson setup has expanded the possibility of interactions > to our entire source tree. Maybe that was a bad idea and we should > fix the meson infrastructure to not do that. I fear that otherwise, > we'll get bit regularly by very-low-probability bugs of this kind. Yes, I'm afraid of the same. For example, the test failure [1] is of that ilk, I guess. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-04-17%2016%3A33%3A23 Best regards, Alexander
Michael Paquier <michael@paquier.xyz> writes: > I don't disagree with your point, still I'm not sure that this can be > made entirely bullet-proof. Anyway, I think that we should still > improve this test and make it more robust for parallel operations: > installcheck fails equally on HEAD if there is a prepared transaction > on the backend where the tests run, and that seems like a bad idea to > me to rely on cluster-wide scans for what should be a "local" test. True, it's antithetical to the point of an "installcheck" test if unrelated actions in another database can break it. So I'm fine with tightening up prepared_xacts's query. I just wonder how far we want to try to carry this. (BTW, on the same logic, should ecpg's twophase.pgc be using a prepared-transaction name that's less generic than "gxid"?) regards, tom lane
On Mon, Apr 29, 2024 at 01:32:40AM -0400, Tom Lane wrote: > (BTW, on the same logic, should ecpg's twophase.pgc be using a > prepared-transaction name that's less generic than "gxid"?) I've hesitated a few seconds about that before sending my patch, but refrained because this stuff does not care about the contents of pg_prepared_xacts. I'd be OK to use something like an "ecpg_regress" or something similar there. -- Michael
Вложения
On Mon, Apr 29, 2024 at 1:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Up to now, we've only worried about whether tests running in parallel
within a single test suite can interact. It's quite scary to think
that the meson setup has expanded the possibility of interactions
to our entire source tree. Maybe that was a bad idea and we should
fix the meson infrastructure to not do that. I fear that otherwise,
we'll get bit regularly by very-low-probability bugs of this kind.
I have the same concern. I suspect that the scan of pg_prepared_xacts
is not the only test that could cause problems when running in parallel
to other tests from the entire source tree.
Thanks
Richard
is not the only test that could cause problems when running in parallel
to other tests from the entire source tree.
Thanks
Richard
On Mon, Apr 29, 2024 at 2:58 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Apr 29, 2024 at 01:32:40AM -0400, Tom Lane wrote:
> (BTW, on the same logic, should ecpg's twophase.pgc be using a
> prepared-transaction name that's less generic than "gxid"?)
I've hesitated a few seconds about that before sending my patch, but
refrained because this stuff does not care about the contents of
pg_prepared_xacts. I'd be OK to use something like an "ecpg_regress"
or something similar there.
I noticed that some TAP tests from recovery and subscription would
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.
Thanks
Richard
select the count from pg_prepared_xacts. I wonder if these tests would
be affected if there are any prepared transactions on the backend.
Thanks
Richard
On Mon, Apr 29, 2024 at 05:11:19PM +0800, Richard Guo wrote: > I noticed that some TAP tests from recovery and subscription would > select the count from pg_prepared_xacts. I wonder if these tests would > be affected if there are any prepared transactions on the backend. TAP tests run in isolation of the rest with their own clusters initialized from a copy initdb'd (rather than initdb because that's much cheaper), so these scans are OK left alone. -- Michael
Вложения
Richard Guo <guofenglinux@gmail.com> writes: > I noticed that some TAP tests from recovery and subscription would > select the count from pg_prepared_xacts. I wonder if these tests would > be affected if there are any prepared transactions on the backend. TAP tests shouldn't be at risk, because there is no "make installcheck" equivalent for them. Each TAP test creates its own database instance (or maybe several), so that instance won't have anything else going on. regards, tom lane
On Mon, Apr 29, 2024 at 09:45:16AM -0400, Tom Lane wrote: > TAP tests shouldn't be at risk, because there is no "make > installcheck" equivalent for them. Each TAP test creates its own > database instance (or maybe several), so that instance won't have > anything else going on. There are a few more 2PC transactions in test_decoding (no installcheck), temp.sql, test_extensions.sql and pg_stat_statements's utility.sql (no installcheck) but their GIDs are not that bad. twophase_stream.sql has a GID "test1", which is kind of generic, but it won't run in parallel. At the end, only addressing the prepared_xacts.sql and the ECPG bits looked enough to me, so I've tweaked these with 7e61e4cc7cfc and called it a day. I'd be curious about any discussion involving the structure of the meson tests. -- Michael
Вложения
On Mon, Apr 29, 2024 at 9:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> I noticed that some TAP tests from recovery and subscription would
> select the count from pg_prepared_xacts. I wonder if these tests would
> be affected if there are any prepared transactions on the backend.
TAP tests shouldn't be at risk, because there is no "make
installcheck" equivalent for them. Each TAP test creates its own
database instance (or maybe several), so that instance won't have
anything else going on.
Thank you for the explanation. I wasn't aware of this before.
Thanks
Richard
Thanks
Richard
On Tue, Apr 30, 2024 at 6:43 AM Michael Paquier <michael@paquier.xyz> wrote:
I'd be curious about any discussion involving the structure of the
meson tests.
+1. I'm kind of worried that the expansion of parallelization could
lead to more instances of instability. Alexander mentioned one such
case at [1]. I haven't looked into it though.
[1] https://www.postgresql.org/message-id/cbf0156f-5aa1-91db-5802-82435dda03e6%40gmail.com
Thanks
Richard
lead to more instances of instability. Alexander mentioned one such
case at [1]. I haven't looked into it though.
[1] https://www.postgresql.org/message-id/cbf0156f-5aa1-91db-5802-82435dda03e6%40gmail.com
Thanks
Richard
Richard Guo <guofenglinux@gmail.com> writes: > +1. I'm kind of worried that the expansion of parallelization could > lead to more instances of instability. Alexander mentioned one such > case at [1]. I haven't looked into it though. > [1] https://www.postgresql.org/message-id/cbf0156f-5aa1-91db-5802-82435dda03e6%40gmail.com The mechanism there is pretty obvious: a plancache flush happened at just the wrong (right?) time and caused the output to change, as indeed the comment acknowledges: -- currently, this fails due to cached plan for "r.f1 + 1" expression -- (but if debug_discard_caches is on, it will succeed) I wonder if we shouldn't just remove that test case as being too unstable -- especially since it's not proving much anyway. regards, tom lane
On 29.04.24 07:11, Tom Lane wrote: > Up to now, we've only worried about whether tests running in parallel > within a single test suite can interact. It's quite scary to think > that the meson setup has expanded the possibility of interactions > to our entire source tree. Maybe that was a bad idea and we should > fix the meson infrastructure to not do that. I fear that otherwise, > we'll get bit regularly by very-low-probability bugs of this kind. I don't think there is anything fundamentally different in the parallelism setups of the make-based and the meson-based tests. There are just different implementation details that might affect the likely orderings and groupings.