Обсуждение: BF member drongo doesn't like 035_standby_logical_decoding.pl

Поиск
Список
Период
Сортировка

BF member drongo doesn't like 035_standby_logical_decoding.pl

От
Tom Lane
Дата:
I realized just now that drongo has been intermittently failing like this:

147/256 postgresql:recovery / recovery/035_standby_logical_decoding               ERROR          2116.16s   (exit
status255 or signal 127 SIGinvalid) 
------------------------------------- 8< -------------------------------------
stderr:
#   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
#   at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
# poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
#   Failed test 'confl_active_logicalslot updated'
#   at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 24.

This has been happening for some time, in all three branches where
that test script exists.  The oldest failure that looks like that in
the v16 branch is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

However, there are older failures showing a timeout of
035_standby_logical_decoding.pl that don't provide any detail, but
might well be the same thing.  The oldest one of those is from
2024-05-01, which is still considerably later than the test script
itself (added on 2023-04-08).  So it would seem that this is something
we broke during 2024, rather than an aboriginal problem in this test.

A search of the buildfarm logs did not turn up similar failures
on any other animals.

I have no idea how to proceed on narrowing down the cause...

            regards, tom lane

PS: FTR, the hits I got on this in the past 90 days were

 sysname |    branch     |      snapshot       |     stage     |                                                     l
                                                    

---------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
 drongo  | HEAD          | 2024-10-29 10:52:07 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_class'\r\r 
 drongo  | REL_16_STABLE | 2024-10-31 08:07:11 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | REL_17_STABLE | 2024-11-05 11:11:28 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | REL_16_STABLE | 2024-11-22 08:58:52 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_class'\r\r 
 drongo  | HEAD          | 2024-11-29 22:23:47 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | HEAD          | 2024-12-04 20:33:42 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | REL_17_STABLE | 2024-12-20 17:00:28 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_class'\r\r 
 drongo  | HEAD          | 2024-12-24 21:07:27 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_class'\r\r 
 drongo  | REL_16_STABLE | 2025-01-10 15:48:16 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | REL_17_STABLE | 2025-01-10 18:05:18 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | HEAD          | 2025-01-11 20:40:09 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | REL_16_STABLE | 2025-01-12 14:06:11 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_class'\r\r 
 drongo  | REL_16_STABLE | 2025-01-13 07:19:22 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | REL_16_STABLE | 2025-01-14 10:48:16 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | HEAD          | 2025-01-20 20:11:00 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | REL_16_STABLE | 2025-01-24 01:54:55 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 
 drongo  | REL_17_STABLE | 2025-01-24 04:21:44 | recoveryCheck | #   Failed test 'activeslot slot invalidation is
loggedwith vacuum on pg_authid'\r\r 



Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

От
Alexander Lakhin
Дата:
Hello Tom,

24.01.2025 18:42, Tom Lane wrote:
I realized just now that drongo has been intermittently failing like this:

147/256 postgresql:recovery / recovery/035_standby_logical_decoding               ERROR          2116.16s   (exit status 255 or signal 127 SIGinvalid)
------------------------------------- 8< -------------------------------------
stderr:
#   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
#   at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
# poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
#   Failed test 'confl_active_logicalslot updated'
#   at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 24.

This has been happening for some time, in all three branches where
that test script exists.  The oldest failure that looks like that in
the v16 branch is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

However, there are older failures showing a timeout of
035_standby_logical_decoding.pl that don't provide any detail, but
might well be the same thing.  The oldest one of those is from
2024-05-01, which is still considerably later than the test script
itself (added on 2023-04-08).  So it would seem that this is something
we broke during 2024, rather than an aboriginal problem in this test.

A search of the buildfarm logs did not turn up similar failures
on any other animals.

I have no idea how to proceed on narrowing down the cause...


Please take a look at the list of such failures since 2024-06-01 I
collected here:
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#035_standby_logical_decoding_standby.pl_fails_due_to_missing_activeslot_invalidation

There is also a reference to a discussion of the failure there:
https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
(In short, I observed that that test suffers from bgwriter's activity.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

От
Andres Freund
Дата:
Hi,

Thanks for noticing the pattern.

On 2025-01-24 11:42:15 -0500, Tom Lane wrote:
> I realized just now that drongo has been intermittently failing like this:
>
> 147/256 postgresql:recovery / recovery/035_standby_logical_decoding               ERROR          2116.16s   (exit
status255 or signal 127 SIGinvalid)
 
> ------------------------------------- 8< -------------------------------------
> stderr:
> #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> #   at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
> # poll_query_until timed out executing this query:
> # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> #   Failed test 'confl_active_logicalslot updated'
> #   at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
> # Tests were run but no plan was declared and done_testing() was not seen.
> # Looks like your test exited with 255 just after 24.
>

There's a failure before this one:
[05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class


> This has been happening for some time, in all three branches where
> that test script exists.  The oldest failure that looks like that in
> the v16 branch is
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35

Hm.

2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG:  logical decoding found consistent point at
0/4000180
2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL:  There are no running transactions.
2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT
"row_removal_activeslot"LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
 
...
2024-09-06 05:23:49.601 UTC [3984:7] LOG:  invalidating obsolete replication slot "row_removal_inactiveslot"
2024-09-06 05:23:49.601 UTC [3984:8] DETAIL:  The slot conflicted with xid horizon 747.
2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT:  WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747,
nredirected:0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0
 
...
2024-09-06 05:28:21.043 UTC [5648:6] LOG:  restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0
removed,0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=249kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680
 
2024-09-06 05:28:21.043 UTC [5648:7] LOG:  recovery restart point at 0/403E680
...
2024-09-06 05:54:38.414 UTC [3984:10] LOG:  invalid record length at 0/403E768: expected at least 24, got 0

There is no further message from 1612 and no intervening message by 3984
(startup) between invalidating row_removal_inactiveslot and reaching the end
of the WAL.

The regress_log does have:

Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
done


I.e. this looks to not be a failure of the test to find the right message or
the stats about the terminatio nnot being flushed, but that
row_removal_activeslot isn't getting terminated. And due to the replay_lsn
check on the primary, we know that replication must have progressed past the
point of the conflict. So the problem doesn't appear to be that we're waiting
for termination, but that somehow we determined that termination is not
necessary.

Unfortunately with the logs available I can't really pinpoint why. I think we
need to
a) increase the postgres log level for the standby
b) add some explicit debug logging to report the contents of
   pg_replication_slots

I hope the issue is that somehow we didn't need to terminate, rather than
missing the need to terminate and just continuing on with replay. But I can't
think of a scenario in which there wouldn't be a need to terminate...



> However, there are older failures showing a timeout of
> 035_standby_logical_decoding.pl that don't provide any detail, but
> might well be the same thing.  The oldest one of those is from
> 2024-05-01, which is still considerably later than the test script
> itself (added on 2023-04-08).  So it would seem that this is something
> we broke during 2024, rather than an aboriginal problem in this test.

Looking at the failures I see on drongo, I also started to wonder if there's
something broken around recovery conflict handling. Perhaps windows specific?

There e.g. are the following:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-12-26%2005%3A49%3A14
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-10-14%2016%3A39%3A49

Which fail like this:
------------------------------------- 8< -------------------------------------
stderr:
# poll_query_until timed out executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
#   Failed test 'startup deadlock: lock acquisition is waiting'
#   at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 261.


Greetings,

Andres Freund



Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

От
Andres Freund
Дата:
Hi,

On 2025-01-24 14:23:26 -0500, Andres Freund wrote:
> On 2025-01-24 11:42:15 -0500, Tom Lane wrote:
> There's a failure before this one:
> [05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class
>
>
> > This has been happening for some time, in all three branches where
> > that test script exists.  The oldest failure that looks like that in
> > the v16 branch is
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35
>
> Hm.
>
> 2024-09-06 05:23:29.031 UTC [1612:14] 035_standby_logical_decoding.pl LOG:  logical decoding found consistent point
at0/4000180
 
> 2024-09-06 05:23:29.031 UTC [1612:15] 035_standby_logical_decoding.pl DETAIL:  There are no running transactions.
> 2024-09-06 05:23:29.031 UTC [1612:16] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT
"row_removal_activeslot"LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
 
> ...
> 2024-09-06 05:23:49.601 UTC [3984:7] LOG:  invalidating obsolete replication slot "row_removal_inactiveslot"
> 2024-09-06 05:23:49.601 UTC [3984:8] DETAIL:  The slot conflicted with xid horizon 747.
> 2024-09-06 05:23:49.601 UTC [3984:9] CONTEXT:  WAL redo at 0/4020210 for Heap2/PRUNE: snapshotConflictHorizon: 747,
nredirected:0, ndead: 1, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 1663/16384/2610, blk 0
 
> ...
> 2024-09-06 05:28:21.043 UTC [5648:6] LOG:  restartpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0
removed,0 recycled; write=3.191 s, sync=0.001 s, total=3.192 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=249kB, estimate=29516 kB; lsn=0/403E6B8, redo lsn=0/403E680
 
> 2024-09-06 05:28:21.043 UTC [5648:7] LOG:  recovery restart point at 0/403E680
> ...
> 2024-09-06 05:54:38.414 UTC [3984:10] LOG:  invalid record length at 0/403E768: expected at least 24, got 0
>
> There is no further message from 1612 and no intervening message by 3984
> (startup) between invalidating row_removal_inactiveslot and reaching the end
> of the WAL.
>
> The regress_log does have:
>
> Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
> done
>
>
> I.e. this looks to not be a failure of the test to find the right message or
> the stats about the terminatio nnot being flushed, but that
> row_removal_activeslot isn't getting terminated. And due to the replay_lsn
> check on the primary, we know that replication must have progressed past the
> point of the conflict. So the problem doesn't appear to be that we're waiting
> for termination, but that somehow we determined that termination is not
> necessary.
>
> Unfortunately with the logs available I can't really pinpoint why. I think we
> need to
> a) increase the postgres log level for the standby
> b) add some explicit debug logging to report the contents of
>    pg_replication_slots
>
> I hope the issue is that somehow we didn't need to terminate, rather than
> missing the need to terminate and just continuing on with replay. But I can't
> think of a scenario in which there wouldn't be a need to terminate...

<thinks>

Hm, maybe I'm missing something, but isn't it possible for the active slot to
actually progress decoding past the conflict point? It's an active slot, with
the consumer running in the background, so all that needs to happen for that
is that logical decoding progresses past the conflict point. That requires
there be some reference to a newer xid to be in the WAL, but there's nothing
preventing that afaict?


In fact, I now saw this comment:

# Note that pg_current_snapshot() is used to get the horizon.  It does
# not generate a Transaction/COMMIT WAL record, decreasing the risk of
# seeing a xl_running_xacts that would advance an active replication slot's
# catalog_xmin.  Advancing the active replication slot's catalog_xmin
# would break some tests that expect the active slot to conflict with
# the catalog xmin horizon.

Which seems precisely what's happening here?



If that's the issue, I think we need to find a way to block logical decoding
from making forward progress during the test.

The easiest way would be to stop pg_recvlogical and emit a bunch of changes,
so that the backend is stalled sending out data. But that'd require a hard to
predict amount of data to be emitted, which isn't great.


But perhaps we could do something smarter, by starting a session on the
primary that acquires an access exclusive lock on a relation that logical
decoding will need to access?  The tricky bit likely would be that it'd
somehow need to *not* prevent VACUUM on the primary.

If we could trigger VACUUM in a transaction on the primary this would be
easy, but we can't.


Greetings,

Andres



Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

От
Andres Freund
Дата:
Hi,

On 2025-01-24 21:00:00 +0200, Alexander Lakhin wrote:
> 24.01.2025 18:42, Tom Lane wrote:
> > I realized just now that drongo has been intermittently failing like this:
> > 
> > 147/256 postgresql:recovery / recovery/035_standby_logical_decoding               ERROR          2116.16s   (exit
status255 or signal 127 SIGinvalid)
 
> > ------------------------------------- 8< -------------------------------------
> > stderr:
> > #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> > #   at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
> > # poll_query_until timed out executing this query:
> > # select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
> > # expecting this output:
> > # t
> > # last actual query output:
> > # f
> > # with stderr:
> > #   Failed test 'confl_active_logicalslot updated'
> > #   at C:/prog/bf/root/REL_16_STABLE/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
> > # Tests were run but no plan was declared and done_testing() was not seen.
> > # Looks like your test exited with 255 just after 24.
> > 
> > This has been happening for some time, in all three branches where
> > that test script exists.  The oldest failure that looks like that in
> > the v16 branch is
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-06%2004%3A19%3A35
> > 
> > However, there are older failures showing a timeout of
> > 035_standby_logical_decoding.pl that don't provide any detail, but
> > might well be the same thing.  The oldest one of those is from
> > 2024-05-01, which is still considerably later than the test script
> > itself (added on 2023-04-08).  So it would seem that this is something
> > we broke during 2024, rather than an aboriginal problem in this test.
> > 
> > A search of the buildfarm logs did not turn up similar failures
> > on any other animals.
> > 
> > I have no idea how to proceed on narrowing down the cause...
> > 
> 
> Please take a look at the list of such failures since 2024-06-01 I
> collected here:
>
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#035_standby_logical_decoding_standby.pl_fails_due_to_missing_activeslot_invalidation
> 
> There is also a reference to a discussion of the failure there:
> https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
> (In short, I observed that that test suffers from bgwriter's activity.)

I don't think it's quite right to blame this on bgwriter. E.g. a checkpoint
will also emit XLOG_RUNNING_XACTS. The problem is that the test just is racy
and that needs to be fixed.

Greetings,

Andres Freund



Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

От
Tom Lane
Дата:
Alexander Lakhin <exclusion@gmail.com> writes:
> There is also a reference to a discussion of the failure there:
> https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
> (In short, I observed that that test suffers from bgwriter's activity.)

Hmm, that's an interesting theory, but it fails to explain why
only drongo is showing this failure.  It's not particularly slow
compared to some other BF animals.

I suspect the correct explanation involves "it's Windows" somehow,
but no idea exactly how.

BTW, looking more closely at the failures, I notice that some of them
are 'with vacuum on pg_authid' but others are 'with vacuum on pg_class'.
So those are two different steps in the test script, and there are
other steps that are very similar but have not failed.  Curiouser
and curiouser.

            regards, tom lane



Re: BF member drongo doesn't like 035_standby_logical_decoding.pl

От
Andres Freund
Дата:
Hi,

On 2025-01-24 14:56:45 -0500, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
> > There is also a reference to a discussion of the failure there:
> > https://www.postgresql.org/message-id/657815a2-5a89-fcc1-1c9d-d77a6986bc26@gmail.com
> > (In short, I observed that that test suffers from bgwriter's activity.)
>
> Hmm, that's an interesting theory, but it fails to explain why
> only drongo is showing this failure.  It's not particularly slow
> compared to some other BF animals.

One interesting bit is that both on sucessful and failing runs the preceding
test is quite slow:

[14:18:13.386](3.192s) ok 25 - row_removal_activeslot on standby created
### Reloading node "standby"
# Running: pg_ctl -D
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata
reload
server signaled

#### Begin standard error
psql:<stdin>:1: INFO:  vacuuming "testdb.pg_catalog.pg_class"
psql:<stdin>:1: INFO:  finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 419 remain, 0 are dead but not yet removable
removable cutoff: 760, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 11 pages set all-visible, 10 pages set all-frozen (0 were all-visible)
index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed
index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 11.684 MB/s, avg write rate: 23.369 MB/s
buffer usage: 90 hits, 8 reads, 16 dirtied
WAL usage: 28 records, 15 full page images, 123107 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
#### End standard error
Waiting for replication conn standby's replay_lsn to pass 0/403FA80 on primary
done
[14:18:48.597](35.210s) ok 26 - inactiveslot slot invalidation is logged with vacuum on pg_class

That's really quite slow. Looking at a few other instances it always just
seems to take a few seconds. Obviously running for longer does make it more
likely for checkpoint/bgwriter to log running xacts.

Unfortunately there's no timing visible between the vacuum and waiting for
replication... And the passing runs don't seem to have the server logs.


Because of that I'm looking at a failure:

[05:23:26.630](3.831s) ok 21 - row_removal_activeslot on standby created
### Reloading node "standby"
# Running: pg_ctl -D
C:\\prog\\bf\\root\\REL_16_STABLE\\pgsql.build/testrun/recovery/035_standby_logical_decoding\\data/t_035_standby_logical_decoding_standby_data/pgdata
reload
server signaled

#### Begin standard error
psql:<stdin>:1: INFO:  vacuuming "testdb.pg_catalog.pg_class"
psql:<stdin>:1: INFO:  finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 417 remain, 0 are dead but not yet removable
removable cutoff: 748, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 3 pages from table (27.27% of total) had 5 dead item identifiers removed
index "pg_class_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_relname_nsp_index": pages: 5 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_class_tblspc_relfilenode_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 0.264 MB/s, avg write rate: 0.495 MB/s
buffer usage: 77 hits, 8 misses, 15 dirtied
WAL usage: 28 records, 15 full page images, 122841 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.23 s
#### End standard error
Waiting for replication conn standby's replay_lsn to pass 0/403E648 on primary
done
[05:23:54.630](28.000s) ok 22 - inactiveslot slot invalidation is logged with vacuum on pg_class
[05:23:54.643](0.013s) not ok 23 - activeslot slot invalidation is logged with vacuum on pg_class


The vacuum only starts at: 05:23:49.294, i.e. after ~23s.

2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG:  connection received: host=127.0.0.1 port=51838
2024-09-06 05:23:49.301 UTC [552:2] [unknown] LOG:  connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner"
method=sspi
(C:/prog/bf/root/REL_16_STABLE/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_primary_data/pgdata/pg_hba.conf:2)
2024-09-06 05:23:49.301 UTC [552:3] [unknown] LOG:  connection authorized: user=pgrunner database=testdb
application_name=035_standby_logical_decoding.pl
2024-09-06 05:23:49.355 UTC [552:4] 035_standby_logical_decoding.pl LOG:  statement: VACUUM  verbose pg_class;
2024-09-06 05:23:49.607 UTC [552:5] 035_standby_logical_decoding.pl LOG:  statement: INSERT INTO flush_wal DEFAULT
VALUES;
2024-09-06 05:23:49.609 UTC [552:6] 035_standby_logical_decoding.pl LOG:  disconnection: session time: 0:00:00.322
user=pgrunnerdatabase=testdb host=127.0.0.1 port=51838
 


Before that we see:

2024-09-06 05:23:27.683 UTC [168:4] 035_standby_logical_decoding.pl LOG:  statement: SELECT xmin IS NULL AND
catalog_xminIS NULL
 
            FROM pg_catalog.pg_replication_slots
            WHERE slot_name = 'primary_physical';

2024-09-06 05:23:31.447 UTC [5836:4] 035_standby_logical_decoding.pl LOG:  statement: select
pg_snapshot_xmin(pg_current_snapshot());
2024-09-06 05:23:36.963 UTC [4908:4] 035_standby_logical_decoding.pl LOG:  statement: CREATE TABLE conflict_test(x
integer,y text);
 
2024-09-06 05:23:36.972 UTC [4908:5] 035_standby_logical_decoding.pl LOG:  statement: DROP TABLE conflict_test;
2024-09-06 05:23:45.192 UTC [4240:4] 035_standby_logical_decoding.pl LOG:  statement: SELECT (select
pg_snapshot_xmin(pg_current_snapshot())::text::int- 746) > 0
 

Between 05:23:30.797 and 05:23:49.601 nothing is logged on the standby.


There's a lot of time between disconnects and connects in the test around that
time:

2024-09-06 05:23:31.591 UTC [5836:5] 035_standby_logical_decoding.pl LOG:  disconnection: session time: 0:00:00.240
user=pgrunnerdatabase=testdb host=127.0.0.1 port=51817
 
2024-09-06 05:23:36.897 UTC [4908:1] [unknown] LOG:  connection received: host=127.0.0.1 port=51829

2024-09-06 05:23:37.032 UTC [4908:6] 035_standby_logical_decoding.pl LOG:  disconnection: session time: 0:00:00.142
user=pgrunnerdatabase=testdb host=127.0.0.1 port=51829
 
2024-09-06 05:23:45.135 UTC [4240:1] [unknown] LOG:  connection received: host=127.0.0.1 port=51836

2024-09-06 05:23:45.272 UTC [4240:5] 035_standby_logical_decoding.pl LOG:  disconnection: session time: 0:00:00.152
user=pgrunnerdatabase=testdb host=127.0.0.1 port=51836
 
2024-09-06 05:23:49.294 UTC [552:1] [unknown] LOG:  connection received: host=127.0.0.1 port=51838


Even though afaict the perl script doesn't do anything else between those
connections.


> BTW, looking more closely at the failures, I notice that some of them
> are 'with vacuum on pg_authid' but others are 'with vacuum on pg_class'.
> So those are two different steps in the test script, and there are
> other steps that are very similar but have not failed.  Curiouser
> and curiouser.

I think that makes some sense:

"Scenario 1":

Doesn't have the race to the same degree, because a VACUUM FULL on pg_class is
used, which prevents logical decoding from progressing after it started (due
to the logged AEL at the start of VACFULL).


"Scenario 2" and "Scenario 3":

Nearly the same test, except once with a database-local relation and once with
a shared rel.


"Scenario 4":

Doesn't expect to trigger a conflict, so the race doesn't matter.


"Scenario 5":

Theoretically racy, but the window seems much smaller, it looks like the
conflict has multiple chances to be triggered due to each of the updates
happening in a separate session (and thus transaction).

Greetings,

Andres Freund