Обсуждение: BF member drongo doesn't like 035_standby_logical_decoding.pl
I realized just now that drongo has been intermittently failing like this: 147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status255 or signal 127 SIGinvalid) ------------------------------------- 8< ------------------------------------- stderr: # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229. # poll_query_until timed out executing this query: # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' # expecting this output: # t # last actual query output: # f # with stderr: # Failed test 'confl_active_logicalslot updated' # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235. # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 255 just after 24. This has been happening for some time, in all three branches where that test script exists. The oldest failure that looks like that in the v16 branch is https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35 However, there are older failures showing a timeout of 035_standby_logical_decoding.pl that don't provide any detail, but might well be the same thing. The oldest one of those is from 2024-05-01, which is still considerably later than the test script itself (added on 2023-04-08). So it would seem that this is something we broke during 2024, rather than an aboriginal problem in this test. A search of the buildfarm logs did not turn up similar failures on any other animals. I have no idea how to proceed on narrowing down the cause... regards, tom lane PS: FTR, the hits I got on this in the past 90 days were sysname | branch | snapshot | stage | l ---------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------ drongo | HEAD | 2024-10-29 10:52:07 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_class'\r\r drongo | REL_16_STABLE | 2024-10-31 08:07:11 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | REL_17_STABLE | 2024-11-05 11:11:28 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | REL_16_STABLE | 2024-11-22 08:58:52 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_class'\r\r drongo | HEAD | 2024-11-29 22:23:47 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | HEAD | 2024-12-04 20:33:42 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | REL_17_STABLE | 2024-12-20 17:00:28 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_class'\r\r drongo | HEAD | 2024-12-24 21:07:27 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_class'\r\r drongo | REL_16_STABLE | 2025-01-10 15:48:16 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | REL_17_STABLE | 2025-01-10 18:05:18 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | HEAD | 2025-01-11 20:40:09 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | REL_16_STABLE | 2025-01-12 14:06:11 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_class'\r\r drongo | REL_16_STABLE | 2025-01-13 07:19:22 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | REL_16_STABLE | 2025-01-14 10:48:16 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | HEAD | 2025-01-20 20:11:00 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | REL_16_STABLE | 2025-01-24 01:54:55 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r drongo | REL_17_STABLE | 2025-01-24 04:21:44 | recoveryCheck | # Failed test 'activeslot slot invalidation is loggedwith vacuum on pg_authid'\r\r
Hello Tom,
24.01.2025 18:42, Tom Lane wrote:
24.01.2025 18:42, Tom Lane wrote:
I realized just now that drongo has been intermittently failing like this: 147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status 255 or signal 127 SIGinvalid) ------------------------------------- 8< ------------------------------------- stderr: # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229. # poll_query_until timed out executing this query: # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' # expecting this output: # t # last actual query output: # f # with stderr: # Failed test 'confl_active_logicalslot updated' # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235. # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 255 just after 24. This has been happening for some time, in all three branches where that test script exists. The oldest failure that looks like that in the v16 branch is https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35 However, there are older failures showing a timeout of 035_standby_logical_decoding.pl that don't provide any detail, but might well be the same thing. The oldest one of those is from 2024-05-01, which is still considerably later than the test script itself (added on 2023-04-08). So it would seem that this is something we broke during 2024, rather than an aboriginal problem in this test. A search of the buildfarm logs did not turn up similar failures on any other animals. I have no idea how to proceed on narrowing down the cause...
Please take a look at the list of such failures since 2024-06-01 I
collected here:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#035_standby_logical_decoding_standby.pl_fails_due_to_missing_activeslot_invalidation
There is also a reference to a discussion of the failure there:
https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
(In short, I observed that that test suffers from bgwriter's activity.)
Best regards,
Alexander Lakhin
Neon (https://neon.tech)
Hi, Thanks for noticing the pattern. On 2025-01-24 11:42:15 -0500, Tom Lane wrote: > I realized just now that drongo has been intermittently failing like this: > > 147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status255 or signal 127 SIGinvalid) > ------------------------------------- 8< ------------------------------------- > stderr: > # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' > # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229. > # poll_query_until timed out executing this query: > # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' > # expecting this output: > # t > # last actual query output: > # f > # with stderr: > # Failed test 'confl_active_logicalslot updated' > # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235. > # Tests were run but no plan was declared and done_testing() was not seen. > # Looks like your test exited with 255 just after 24. > There's a failure before this one: [05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class > This has been happening for some time, in all three branches where > that test script exists. The oldest failure that looks like that in > the v16 branch is > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35 Hm. 2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG: logical decoding found consistent point at 0/4000180 2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL: There are no running transactions. 2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT "row_removal_activeslot"LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1') ... 2024-09-06 05:23:49.601 UTC [3984:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot" 2024-09-06 05:23:49.601 UTC [3984:8] DETAIL: The slot conflicted with xid horizon 747. 2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT: WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747, nredirected:0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0 ... 2024-09-06 05:28:21.043 UTC [5648:6] LOG: restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed,0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s; distance=249kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680 2024-09-06 05:28:21.043 UTC [5648:7] LOG: recovery restart point at 0/403E680 ... 2024-09-06 05:54:38.414 UTC [3984:10] LOG: invalid record length at 0/403E768: expected at least 24, got 0 There is no further message from 1612 and no intervening message by 3984 (startup) between invalidating row_removal_inactiveslot and reaching the end of the WAL. The regress_log does have: Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary done I.e. this looks to not be a failure of the test to find the right message or the stats about the terminatio nnot being flushed, but that row_removal_activeslot isn't getting terminated. And due to the replay_lsn check on the primary, we know that replication must have progressed past the point of the conflict. So the problem doesn't appear to be that we're waiting for termination, but that somehow we determined that termination is not necessary. Unfortunately with the logs available I can't really pinpoint why. I think we need to a) increase the postgres log level for the standby b) add some explicit debug logging to report the contents of pg_replication_slots I hope the issue is that somehow we didn't need to terminate, rather than missing the need to terminate and just continuing on with replay. But I can't think of a scenario in which there wouldn't be a need to terminate... > However, there are older failures showing a timeout of > 035_standby_logical_decoding.pl that don't provide any detail, but > might well be the same thing. The oldest one of those is from > 2024-05-01, which is still considerably later than the test script > itself (added on 2023-04-08). So it would seem that this is something > we broke during 2024, rather than an aboriginal problem in this test. Looking at the failures I see on drongo, I also started to wonder if there's something broken around recovery conflict handling. Perhaps windows specific? There e.g. are the following: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-26%2005%3A49%3A14 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-10-14%2016%3A39%3A49 Which fail like this: ------------------------------------- 8< ------------------------------------- stderr: # poll_query_until timed out executing this query: # # SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; # # expecting this output: # waiting # last actual query output: # # with stderr: # Failed test 'startup deadlock: lock acquisition is waiting' # at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 261. Greetings, Andres Freund
Hi, On 2025-01-24 14:23:26 -0500, Andres Freund wrote: > On 2025-01-24 11:42:15 -0500, Tom Lane wrote: > There's a failure before this one: > [05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class > > > > This has been happening for some time, in all three branches where > > that test script exists. The oldest failure that looks like that in > > the v16 branch is > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35 > > Hm. > > 2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG: logical decoding found consistent point at0/4000180 > 2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL: There are no running transactions. > 2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT: START_REPLICATION SLOT "row_removal_activeslot"LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1') > ... > 2024-09-06 05:23:49.601 UTC [3984:7] LOG: invalidating obsolete replication slot "row_removal_inactiveslot" > 2024-09-06 05:23:49.601 UTC [3984:8] DETAIL: The slot conflicted with xid horizon 747. > 2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT: WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747, nredirected:0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0 > ... > 2024-09-06 05:28:21.043 UTC [5648:6] LOG: restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed,0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s; distance=249kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680 > 2024-09-06 05:28:21.043 UTC [5648:7] LOG: recovery restart point at 0/403E680 > ... > 2024-09-06 05:54:38.414 UTC [3984:10] LOG: invalid record length at 0/403E768: expected at least 24, got 0 > > There is no further message from 1612 and no intervening message by 3984 > (startup) between invalidating row_removal_inactiveslot and reaching the end > of the WAL. > > The regress_log does have: > > Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary > done > > > I.e. this looks to not be a failure of the test to find the right message or > the stats about the terminatio nnot being flushed, but that > row_removal_activeslot isn't getting terminated. And due to the replay_lsn > check on the primary, we know that replication must have progressed past the > point of the conflict. So the problem doesn't appear to be that we're waiting > for termination, but that somehow we determined that termination is not > necessary. > > Unfortunately with the logs available I can't really pinpoint why. I think we > need to > a) increase the postgres log level for the standby > b) add some explicit debug logging to report the contents of > pg_replication_slots > > I hope the issue is that somehow we didn't need to terminate, rather than > missing the need to terminate and just continuing on with replay. But I can't > think of a scenario in which there wouldn't be a need to terminate... <thinks> Hm, maybe I'm missing something, but isn't it possible for the active slot to actually progress decoding past the conflict point? It's an active slot, with the consumer running in the background, so all that needs to happen for that is that logical decoding progresses past the conflict point. That requires there be some reference to a newer xid to be in the WAL, but there's nothing preventing that afaict? In fact, I now saw this comment: # Note that pg_current_snapshot() is used to get the horizon. It does # not generate a Transaction/COMMIT WAL record, decreasing the risk of # seeing a xl_running_xacts that would advance an active replication slot's # catalog_xmin. Advancing the active replication slot's catalog_xmin # would break some tests that expect the active slot to conflict with # the catalog xmin horizon. Which seems precisely what's happening here? If that's the issue, I think we need to find a way to block logical decoding from making forward progress during the test. The easiest way would be to stop pg_recvlogical and emit a bunch of changes, so that the backend is stalled sending out data. But that'd require a hard to predict amount of data to be emitted, which isn't great. But perhaps we could do something smarter, by starting a session on the primary that acquires an access exclusive lock on a relation that logical decoding will need to access? The tricky bit likely would be that it'd somehow need to *not* prevent VACUUM on the primary. If we could trigger VACUUM in a transaction on the primary this would be easy, but we can't. Greetings, Andres
Hi, On 2025-01-24 21:00:00 +0200, Alexander Lakhin wrote: > 24.01.2025 18:42, Tom Lane wrote: > > I realized just now that drongo has been intermittently failing like this: > > > > 147/256 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 2116.16s (exit status255 or signal 127 SIGinvalid) > > ------------------------------------- 8< ------------------------------------- > > stderr: > > # Failed test 'activeslot slot invalidation is logged with vacuum on pg_class' > > # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229. > > # poll_query_until timed out executing this query: > > # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb' > > # expecting this output: > > # t > > # last actual query output: > > # f > > # with stderr: > > # Failed test 'confl_active_logicalslot updated' > > # at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235. > > # Tests were run but no plan was declared and done_testing() was not seen. > > # Looks like your test exited with 255 just after 24. > > > > This has been happening for some time, in all three branches where > > that test script exists. The oldest failure that looks like that in > > the v16 branch is > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35 > > > > However, there are older failures showing a timeout of > > 035_standby_logical_decoding.pl that don't provide any detail, but > > might well be the same thing. The oldest one of those is from > > 2024-05-01, which is still considerably later than the test script > > itself (added on 2023-04-08). So it would seem that this is something > > we broke during 2024, rather than an aboriginal problem in this test. > > > > A search of the buildfarm logs did not turn up similar failures > > on any other animals. > > > > I have no idea how to proceed on narrowing down the cause... > > > > Please take a look at the list of such failures since 2024-06-01 I > collected here: > https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#035_standby_logical_decoding_standby.pl_fails_due_to_missing_activeslot_invalidation > > There is also a reference to a discussion of the failure there: > https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com > (In short, I observed that that test suffers from bgwriter's activity.) I don't think it's quite right to blame this on bgwriter. E.g. a checkpoint will also emit XLOG_RUNNING_XACTS. The problem is that the test just is racy and that needs to be fixed. Greetings, Andres Freund
Alexander Lakhin <exclusion@gmail.com> writes: > There is also a reference to a discussion of the failure there: > https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com > (In short, I observed that that test suffers from bgwriter's activity.) Hmm, that's an interesting theory, but it fails to explain why only drongo is showing this failure. It's not particularly slow compared to some other BF animals. I suspect the correct explanation involves "it's Windows" somehow, but no idea exactly how. BTW, looking more closely at the failures, I notice that some of them are 'with vacuum on pg_authid' but others are 'with vacuum on pg_class'. So those are two different steps in the test script, and there are other steps that are very similar but have not failed. Curiouser and curiouser. regards, tom lane
Hi, On 2025-01-24 14:56:45 -0500, Tom Lane wrote: > Alexander Lakhin <exclusion@gmail.com> writes: > > There is also a reference to a discussion of the failure there: > > https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com > > (In short, I observed that that test suffers from bgwriter's activity.) > > Hmm, that's an interesting theory, but it fails to explain why > only drongo is showing this failure. It's not particularly slow > compared to some other BF animals. One interesting bit is that both on sucessful and failing runs the preceding test is quite slow: [14:18:13.386](3.192s) ok 25 - row_removal_activeslot on standby created ### Reloading node "standby" # Running: pg_ctl -D C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata reload server signaled #### Begin standard error psql:<stdin>:1: INFO: vacuuming "testdb.pg_catalog.pg_class" psql:<stdin>:1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1 pages: 0 removed, 11 remain, 11 scanned (100.00% of total) tuples: 4 removed, 419 remain, 0 are dead but not yet removable removable cutoff: 760, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen visibility map: 11 pages set all-visible, 10 pages set all-frozen (0 were all-visible) index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable avg read rate: 11.684 MB/s, avg write rate: 23.369 MB/s buffer usage: 90 hits, 8 reads, 16 dirtied WAL usage: 28 records, 15 full page images, 123107 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s #### End standard error Waiting for replication conn standby's replay_lsn to pass 0/403FA80 on primary done [14:18:48.597](35.210s) ok 26 - inactiveslot slot invalidation is logged with vacuum on pg_class That's really quite slow. Looking at a few other instances it always just seems to take a few seconds. Obviously running for longer does make it more likely for checkpoint/bgwriter to log running xacts. Unfortunately there's no timing visible between the vacuum and waiting for replication... And the passing runs don't seem to have the server logs. Because of that I'm looking at a failure: [05:23:26.630](3.831s) ok 21 - row_removal_activeslot on standby created ### Reloading node "standby" # Running: pg_ctl -D C:\\prog\\bf\\root\\REL_16_STABLE\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata reload server signaled #### Begin standard error psql:<stdin>:1: INFO: vacuuming "testdb.pg_catalog.pg_class" psql:<stdin>:1: INFO: finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1 pages: 0 removed, 11 remain, 11 scanned (100.00% of total) tuples: 4 removed, 417 remain, 0 are dead but not yet removable removable cutoff: 748, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable avg read rate: 0.264 MB/s, avg write rate: 0.495 MB/s buffer usage: 77 hits, 8 misses, 15 dirtied WAL usage: 28 records, 15 full page images, 122841 bytes system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.23 s #### End standard error Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary done [05:23:54.630](28.000s) ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class [05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class The vacuum only starts at: 05:23:49.294, i.e. after ~23s. 2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG: connection received: host=127.0.0.1 port=51838 2024-09-06 05:23:49.301 UTC [552:2] [unknown] LOG: connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" method=sspi (C:/prog/bf/root/REL_16_STABLE/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_primary_data/pgdata/pg_hba.conf:2) 2024-09-06 05:23:49.301 UTC [552:3] [unknown] LOG: connection authorized: user=pgrunner database=testdb application_name=035_standby_logical_decoding.pl 2024-09-06 05:23:49.355 UTC [552:4] 035_standby_logical_decoding.pl LOG: statement: VACUUM verbose pg_class; 2024-09-06 05:23:49.607 UTC [552:5] 035_standby_logical_decoding.pl LOG: statement: INSERT INTO flush_wal DEFAULT VALUES; 2024-09-06 05:23:49.609 UTC [552:6] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.322 user=pgrunnerdatabase=testdb host=127.0.0.1 port=51838 Before that we see: 2024-09-06 05:23:27.683 UTC [168:4] 035_standby_logical_decoding.pl LOG: statement: SELECT xmin IS NULL AND catalog_xminIS NULL FROM pg_catalog.pg_replication_slots WHERE slot_name = 'primary_physical'; 2024-09-06 05:23:31.447 UTC [5836:4] 035_standby_logical_decoding.pl LOG: statement: select pg_snapshot_xmin(pg_current_snapshot()); 2024-09-06 05:23:36.963 UTC [4908:4] 035_standby_logical_decoding.pl LOG: statement: CREATE TABLE conflict_test(x integer,y text); 2024-09-06 05:23:36.972 UTC [4908:5] 035_standby_logical_decoding.pl LOG: statement: DROP TABLE conflict_test; 2024-09-06 05:23:45.192 UTC [4240:4] 035_standby_logical_decoding.pl LOG: statement: SELECT (select pg_snapshot_xmin(pg_current_snapshot())::text::int- 746) > 0 Between 05:23:30.797 and 05:23:49.601 nothing is logged on the standby. There's a lot of time between disconnects and connects in the test around that time: 2024-09-06 05:23:31.591 UTC [5836:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.240 user=pgrunnerdatabase=testdb host=127.0.0.1 port=51817 2024-09-06 05:23:36.897 UTC [4908:1] [unknown] LOG: connection received: host=127.0.0.1 port=51829 2024-09-06 05:23:37.032 UTC [4908:6] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.142 user=pgrunnerdatabase=testdb host=127.0.0.1 port=51829 2024-09-06 05:23:45.135 UTC [4240:1] [unknown] LOG: connection received: host=127.0.0.1 port=51836 2024-09-06 05:23:45.272 UTC [4240:5] 035_standby_logical_decoding.pl LOG: disconnection: session time: 0:00:00.152 user=pgrunnerdatabase=testdb host=127.0.0.1 port=51836 2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG: connection received: host=127.0.0.1 port=51838 Even though afaict the perl script doesn't do anything else between those connections. > BTW, looking more closely at the failures, I notice that some of them > are 'with vacuum on pg_authid' but others are 'with vacuum on pg_class'. > So those are two different steps in the test script, and there are > other steps that are very similar but have not failed. Curiouser > and curiouser. I think that makes some sense: "Scenario 1": Doesn't have the race to the same degree, because a VACUUM FULL on pg_class is used, which prevents logical decoding from progressing after it started (due to the logged AEL at the start of VACFULL). "Scenario 2" and "Scenario 3": Nearly the same test, except once with a database-local relation and once with a shared rel. "Scenario 4": Doesn't expect to trigger a conflict, so the race doesn't matter. "Scenario 5": Theoretically racy, but the window seems much smaller, it looks like the conflict has multiple chances to be triggered due to each of the updates happening in a separate session (and thus transaction). Greetings, Andres Freund