Обсуждение: A failure in 031_recovery_conflict.pl on Cirrus CI Windows Server

Поиск
Список
Период
Сортировка

A failure in 031_recovery_conflict.pl on Cirrus CI Windows Server

От
Bharath Rupireddy
Дата:
Hi,

I've observed the following failure once in one of my Cirrus CI runs
on Windows Server on HEAD:

timed out waiting for match: (?^:User was holding shared buffer pin
for too long) at
C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318.
# Postmaster PID for node "primary" is 696


https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict

https://github.com/BRupireddy/postgres/runs/15296698158

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: A failure in 031_recovery_conflict.pl on Cirrus CI Windows Server

От
Noah Misch
Дата:
On Tue, Jul 25, 2023 at 01:56:41PM +0530, Bharath Rupireddy wrote:
> I've observed the following failure once in one of my Cirrus CI runs
> on Windows Server on HEAD:
> 
> timed out waiting for match: (?^:User was holding shared buffer pin
> for too long) at
> C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318.
> # Postmaster PID for node "primary" is 696
> 
>
https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict
> 
> https://github.com/BRupireddy/postgres/runs/15296698158

Known:
https://postgr.es/m/flat/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de
https://postgr.es/m/flat/CA+hUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO=nEYKSg@mail.gmail.com

A good next step would be patch "Fix recovery conflict SIGUSR1 handling" from
https://postgr.es/m/CA+hUKG+Hi5P1j_8cVeqLLwNSVyJh4RntF04fYWkeNXfTrH2MYA@mail.gmail.com
(near the end of that second thread).



A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Noah Misch
> On Tue, Jul 25, 2023 at 01:56:41PM +0530, Bharath Rupireddy wrote:
> > I've observed the following failure once in one of my Cirrus CI runs
> > on Windows Server on HEAD:
> >
> > timed out waiting for match: (?^:User was holding shared buffer pin
> > for too long) at
> > C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318.
> > # Postmaster PID for node "primary" is 696
> >
> >
https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict
> >
> > https://github.com/BRupireddy/postgres/runs/15296698158
>
> Known:
> https://postgr.es/m/flat/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de
> https://postgr.es/m/flat/CA+hUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO=nEYKSg@mail.gmail.com
>
> A good next step would be patch "Fix recovery conflict SIGUSR1 handling" from
> https://postgr.es/m/CA+hUKG+Hi5P1j_8cVeqLLwNSVyJh4RntF04fYWkeNXfTrH2MYA@mail.gmail.com
> (near the end of that second thread).

I am also seeing problems with t/031_recovery_conflict.pl, on the
newly added s390x architecture on apt.postgresql.org. The failure is
flaky, but has been popping up in build logs often enough.

I managed to reproduce it on the shell by running the test in a loop a
few times. The failure looks like this:

echo "# +++ tap check in src/test/recovery +++" && rm -rf
'/home/myon/postgresql/pg/postgresql/build/src/test/recovery'/tmp_check&& /bin/mkdir -p
'/home/myon/postgresql/pg/postgresql/build/src/test/recovery'/tmp_check&& cd
/home/myon/postgresql/pg/postgresql/build/../src/test/recovery&&
TESTLOGDIR='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/tmp_check/log'
TESTDATADIR='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/tmp_check'
PATH="/home/myon/postgresql/pg/postgresql/build/tmp_install/usr/lib/postgresql/17/bin:/home/myon/postgresql/pg/postgresql/build/src/test/recovery:$PATH"
LD_LIBRARY_PATH="/home/myon/postgresql/pg/postgresql/build/tmp_install/usr/lib/s390x-linux-gnu" PGPORT='65432'
top_builddir='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/../../..'
PG_REGRESS='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/../../../src/test/regress/pg_regress'
/usr/bin/prove-I /home/myon/postgresql/pg/postgresql/build/../src/test/perl/ -I
/home/myon/postgresql/pg/postgresql/build/../src/test/recovery--verbose t/031_recovery_conflict.pl 
# +++ tap check in src/test/recovery +++
t/031_recovery_conflict.pl ..
# issuing query via background psql:
#     BEGIN;
#     DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1;
#     FETCH FORWARD FROM test_recovery_conflict_cursor;
ok 1 - buffer pin conflict: cursor with conflicting pin established
ok 2 - buffer pin conflict: logfile contains terminated connection due to recovery conflict
ok 3 - buffer pin conflict: stats show conflict on standby
# issuing query via background psql:
#         BEGIN;
#         DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1;
#         FETCH FORWARD FROM test_recovery_conflict_cursor;
#
ok 4 - snapshot conflict: cursor with conflicting snapshot established
ok 5 - snapshot conflict: logfile contains terminated connection due to recovery conflict
ok 6 - snapshot conflict: stats show conflict on standby
# issuing query via background psql:
#         BEGIN;
#         LOCK TABLE test_recovery_conflict_table1 IN ACCESS SHARE MODE;
#         SELECT 1;
#
ok 7 - lock conflict: conflicting lock acquired
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 7.
Dubious, test returned 255 (wstat 65280, 0xff00)
All 7 subtests passed

Test Summary Report
-------------------
t/031_recovery_conflict.pl (Wstat: 65280 Tests: 7 Failed: 0)
  Non-zero exit status: 255
  Parse errors: No plan found in TAP output
Files=1, Tests=7, 186 wallclock secs ( 0.01 usr  0.00 sys +  0.74 cusr  0.71 csys =  1.46 CPU)
Result: FAIL
make: *** [Makefile:23: check] Error 1

I.e. the test file just exits after test 7, without running the rest.

Is it dying because it's running into a deadlock instead of some other
expected error message?

timed out waiting for match: (?^:User was holding a relation lock for too long) at t/031_recovery_conflict.pl line 318.
2023-08-04 12:20:38.061 UTC [1920073] 031_recovery_conflict.pl FATAL:  terminating connection due to conflict with
recovery

Christoph

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
On Sat, Aug 5, 2023 at 12:43 AM Christoph Berg <myon@debian.org> wrote:
> I managed to reproduce it on the shell by running the test in a loop a
> few times. The failure looks like this:

It's great that you can reproduce this semi-reliably!  I've rebased
the patch, hoping you can try it out.

https://www.postgresql.org/message-id/CA%2BhUKGJDcyW8Hbq3UyG-5-5Y7WqqOTjrXbFTMxxmhiofFraE-Q%40mail.gmail.com



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Thomas Munro
> It's great that you can reproduce this semi-reliably!  I've rebased
> the patch, hoping you can try it out.

Unfortunately very semi, today I didn't get to the same point where it
exited after test 7, but got some other timeouts. Not even sure they
are related to this (?) problem. Anyway:

> https://www.postgresql.org/message-id/CA%2BhUKGJDcyW8Hbq3UyG-5-5Y7WqqOTjrXbFTMxxmhiofFraE-Q%40mail.gmail.com

This patch makes the testsuite hang (and later exit) after this:

ok 17 - 5 recovery conflicts shown in pg_stat_database
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 17.

I haven't seen any other problems with the patch attached, but since
that test was hanging and hence very slow, I couldn't do many runs.

Perhaps that's progress, I don't know. :) Logs attached.

Christoph

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
On Mon, Aug 7, 2023 at 8:40 AM Christoph Berg <myon@debian.org> wrote:
> 2023-08-06 17:21:24.078 UTC [1275555] 031_recovery_conflict.pl FATAL:  unrecognized conflict mode: 7

Thanks for testing!  Would you mind trying v8 from that thread?  V7
had a silly bug (I accidentally deleted a 'case' label while cleaning
some stuff up, resulting in the above error...)



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Thomas Munro
> Thanks for testing!  Would you mind trying v8 from that thread?  V7
> had a silly bug (I accidentally deleted a 'case' label while cleaning
> some stuff up, resulting in the above error...)

v8 worked better. It succeeded a few times (at least 12, my screen
scrollback didn't catch more) before erroring like this:

[10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated connection due to recovery conflict
[10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on standby
[10:21:58.463](0.000s)
[10:21:58.463](0.000s) #   Failed test 'startup deadlock: stats show conflict on standby'
#   at t/031_recovery_conflict.pl line 332.
[10:21:58.463](0.000s) #          got: '0'
#     expected: '1'

Christoph

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Andres Freund
Дата:
Hi,

On 2023-08-07 12:57:40 +0200, Christoph Berg wrote:
> Re: Thomas Munro
> > Thanks for testing!  Would you mind trying v8 from that thread?  V7
> > had a silly bug (I accidentally deleted a 'case' label while cleaning
> > some stuff up, resulting in the above error...)
> 
> v8 worked better. It succeeded a few times (at least 12, my screen
> scrollback didn't catch more) before erroring like this:

> [10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated connection due to recovery conflict
> [10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on standby
> [10:21:58.463](0.000s)
> [10:21:58.463](0.000s) #   Failed test 'startup deadlock: stats show conflict on standby'
> #   at t/031_recovery_conflict.pl line 332.
> [10:21:58.463](0.000s) #          got: '0'
> #     expected: '1'

Hm, that could just be a "harmless" race. Does it still happen if you apply
the attached patch in addition?

Greetings,

Andres Freund

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Andres Freund
> Hm, that could just be a "harmless" race. Does it still happen if you apply
> the attached patch in addition?

Putting that patch on top of v8 made it pass 294 times before exiting
like this:

[08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting pin established
Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary
done
timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line
318.

But admittedly, this build machine is quite sluggish at times, likely
due to neighboring VMs on the same host. (Perhaps that even explains
the behavior from before this patch.) I'll still attach the logs since
I frankly can't judge what errors are acceptable here and which
aren't.

Christoph

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
On Wed, Aug 9, 2023 at 2:01 AM Christoph Berg <myon@debian.org> wrote:
> Putting that patch on top of v8 made it pass 294 times before exiting
> like this:
>
> [08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting pin established
> Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary
> done
> timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line
318.

Can you reproduce that with logging like this added on top?

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5b663a2997..72f5274c95 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -629,6 +629,7 @@ SetStartupBufferPinWaitBufId(int bufid)
        /* use volatile pointer to prevent code rearrangement */
        volatile PROC_HDR *procglobal = ProcGlobal;

+       elog(LOG, "XXX SetStartupBufferPinWaitBufId(%d)", bufid);
        procglobal->startupBufferPinWaitBufId = bufid;
 }

@@ -640,8 +641,11 @@ GetStartupBufferPinWaitBufId(void)
 {
        /* use volatile pointer to prevent code rearrangement */
        volatile PROC_HDR *procglobal = ProcGlobal;
+       int bufid;

-       return procglobal->startupBufferPinWaitBufId;
+       bufid = procglobal->startupBufferPinWaitBufId;
+       elog(LOG, "XXX GetStartupBufferPinWaitBufId() -> %d", bufid);
+       return bufid;
 }



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Thomas Munro
> On Wed, Aug 9, 2023 at 2:01 AM Christoph Berg <myon@debian.org> wrote:
> > Putting that patch on top of v8 made it pass 294 times before exiting
> > like this:
> >
> > [08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting pin established
> > Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary
> > done
> > timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl
line318.
 
> 
> Can you reproduce that with logging like this added on top?

603 iterations later it hit again, but didn't log anything. (I believe
I did run "make" in the right directory.)

[22:20:24.714](3.145s) # issuing query via background psql:
#     BEGIN;
#     DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1;
#     FETCH FORWARD FROM test_recovery_conflict_cursor;
[22:20:24.745](0.031s) ok 1 - buffer pin conflict: cursor with conflicting pin established
Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary
done
timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line
318.

Perhaps this can simply be attributed to the machine being too busy.

Christoph

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: To Thomas Munro
> 603 iterations later it hit again, but didn't log anything. (I believe
> I did run "make" in the right directory.)

Since that didn't seem right I'm running the tests again. There are
XXX lines in the output, but it hasn't hit yet.

Christoph



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: To Thomas Munro
> 603 iterations later it hit again, but didn't log anything. (I believe
> I did run "make" in the right directory.)

This time it took 3086 iterations to hit the problem.

Running c27f8621eedf7 + Debian patches + v8 +
pgstat-report-conflicts-immediately.patch + the XXX logging.

> [22:20:24.714](3.145s) # issuing query via background psql:
> #     BEGIN;
> #     DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1;
> #     FETCH FORWARD FROM test_recovery_conflict_cursor;
> [22:20:24.745](0.031s) ok 1 - buffer pin conflict: cursor with conflicting pin established
> Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary
> done
> timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line
318.

No XXX lines this time either, but I've seen then im logfiles that
went through successfully.

> Perhaps this can simply be attributed to the machine being too busy.

With the patches, the problem of dying so often that builds targeting
several distributions in parallel will usually fail is gone.

Christoph

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
On Thu, Aug 10, 2023 at 9:15 PM Christoph Berg <myon@debian.org> wrote:
> No XXX lines this time either, but I've seen then im logfiles that
> went through successfully.

Hmm.  Well, I think this looks like a different kind of bug then.
That patch of mine is about fixing some unsafe coding on the receiving
side of a signal.  In this case it's apparently not being sent.  So
either the Heap2/PRUNE record was able to proceed (indicating that
that CURSOR was not holding a pin as expected), or VACUUM decided not
to actually do anything to that block (conditional cleanup lock vs
transient pin changing behaviour?), or there's a bug somewhere in/near
LockBufferForCleanup(), which should have emitted that XXX message
before even calling ResolveRecoveryConflictWithBufferPin().

Do you still have the data directories around from that run, so we can
see if the expected Heap2/PRUNE was actually logged?  For example
(using meson layout here, in the build directory) that'd be something
like:

$ ./tmp_install/home/tmunro/install/bin/pg_waldump
testrun/recovery/031_recovery_conflict/data/t_031_recovery_conflict_standby_data/pgdata/pg_wal/000000010000000000000003

In there I see this:

rmgr: Heap2       len (rec/tot):     57/    57, tx:          0, lsn:
0/0344BB90, prev 0/0344BB68, desc: PRUNE snapshotConflictHorizon: 0,
nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [21],
unused: [], blkref #0: rel 1663/16385/16386 blk 0

That's the WAL record that's supposed to be causing
031_recovery_conflict_standby.log to talk about a conflict, starting
with this:

2023-08-10 22:47:04.564 NZST [57145] LOG:  recovery still waiting
after 10.035 ms: recovery conflict on buffer pin
2023-08-10 22:47:04.564 NZST [57145] CONTEXT:  WAL redo at 0/344BB90
for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 1,
 nunused: 0, redirected: [], dead: [21], unused: []; blkref #0: rel
1663/16385/16386, blk 0



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Thomas Munro
> On Thu, Aug 10, 2023 at 9:15 PM Christoph Berg <myon@debian.org> wrote:
> > No XXX lines this time either, but I've seen then im logfiles that
> > went through successfully.
> 
> Do you still have the data directories around from that run, so we can
> see if the expected Heap2/PRUNE was actually logged?  For example
> (using meson layout here, in the build directory) that'd be something
> like:

Sorry for the late reply, getting the PG release out of the door was
more important.

> $ ./tmp_install/home/tmunro/install/bin/pg_waldump
>
testrun/recovery/031_recovery_conflict/data/t_031_recovery_conflict_standby_data/pgdata/pg_wal/000000010000000000000003
> 
> In there I see this:

It's been a long day and I can't wrap my mind around understanding
this now, so I'll just dump the output here.

[0] 16:03 myon@sid-s390x.pgs390x:~/.../build/src/test/recovery $ LC_ALL=C
../../../tmp_install/usr/lib/postgresql/17/bin/pg_waldump
tmp_check/t_031_recovery_conflict_standby_data/pgdata/pg_wal/000000010000000000000003| grep -3 PRUNE > PRUNE.log
 
pg_waldump: error: error in WAL record at 0/347E6A8: invalid record length at 0/347E6E0: expected at least 24, got 0

Christoph

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
Thanks.  I realised that it's easy enough to test that theory about
cleanup locks by hacking ConditionalLockBufferForCleanup() to return
false randomly.  Then the test occasionally fails as described.  Seems
like we'll need to fix that test, but it's not evidence of a server
bug, and my signal handler refactoring patch is in the clear.  Thanks
for testing it!



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Andres Freund
Дата:
Hi,

On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
> Thanks.  I realised that it's easy enough to test that theory about
> cleanup locks by hacking ConditionalLockBufferForCleanup() to return
> false randomly.  Then the test occasionally fails as described.  Seems
> like we'll need to fix that test, but it's not evidence of a server
> bug, and my signal handler refactoring patch is in the clear.  Thanks
> for testing it!

WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
passes even after I make ConditionalLockBufferForCleanup() fail 100%.

Greetings,

Andres Freund



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Andres Freund
> > Thanks.  I realised that it's easy enough to test that theory about
> > cleanup locks by hacking ConditionalLockBufferForCleanup() to return
> > false randomly.  Then the test occasionally fails as described.  Seems
> > like we'll need to fix that test, but it's not evidence of a server
> > bug, and my signal handler refactoring patch is in the clear.  Thanks
> > for testing it!
> 
> WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
> After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
> passes even after I make ConditionalLockBufferForCleanup() fail 100%.

I have now applied the last two patches to postgresql-17 so see if the
build is more stable. (So far I had only tried in manual tests.)

Fwiw this is also causing pain on PostgreSQL 16:


https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/postgresql-16-binaries-snapshot/1011/architecture=s390x,distribution=sid/consoleText

Most of the failing builds in
https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/postgresql-16-binaries-snapshot/
are on s390x and likely due to this problem.

This should be fixed before the 16 release.

Christoph



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
On Tue, Aug 29, 2023 at 1:58 AM Christoph Berg <myon@debian.org> wrote:
> This should be fixed before the 16 release.

Here's what I was thinking of doing for this, given where we are in
the release schedule:

* commit the signal-refactoring patch in master only
* plan to back-patch it into 16 in a later point release (it's much
harder to back-patch further than that)
* look into what we can do to suppress the offending test for 16 in the meantime

But looking into that just now, I am curious about something.  The
whole area of recovery conflicts has been buggy forever, and the tests
were only developed fairly recently by Melanie and Andres (April
2022), and then backpatched to all releases.  They were disabled again
in release branches 10-14 (discussion at
https://postgr.es/m/3447060.1652032749@sss.pgh.pa.us):

+plan skip_all => "disabled until after minor releases, due to instability";

Now your mainframe build bot is regularly failing for whatever timing
reason, in 16 and master.  That's quite useful because your tests have
made us or at least me a lot more confident that the fix is good (one
of the reasons progress has been slow is that failures in CI and BF
were pretty rare and hard to analyse).  But... I wonder why it isn't
failing for you in 15?  Are you able to check if it ever has?  I
suppose we could go and do the "disabled due to instability" thing in
15 and 16, and then later we'll un-disable it in 16 when we back-patch
the fix into it.



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Thomas Munro
> 2022), and then backpatched to all releases.  They were disabled again
> in release branches 10-14 (discussion at
> https://postgr.es/m/3447060.1652032749@sss.pgh.pa.us):
> 
> +plan skip_all => "disabled until after minor releases, due to instability";

Right:

https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/postgresql-14-binaries-snapshot/366/architecture=s390x,distribution=sid/consoleText
-> t/031_recovery_conflict.pl ........... skipped: disabled until after minor releases, due to instability

> Now your mainframe build bot is regularly failing for whatever timing
> reason, in 16 and master.  That's quite useful because your tests have
> made us or at least me a lot more confident that the fix is good (one
> of the reasons progress has been slow is that failures in CI and BF
> were pretty rare and hard to analyse).  But... I wonder why it isn't
> failing for you in 15?  Are you able to check if it ever has?  I
> suppose we could go and do the "disabled due to instability" thing in
> 15 and 16, and then later we'll un-disable it in 16 when we back-patch
> the fix into it.

The current explanation is that builds are only running frequently for
16 and devel, and the periodic extra test jobs only run "make check"
and the postgresql-common testsuite, not the full server testsuite.

I had "snapshot" jobs configured for basically everything (server and
extensions) for some time, but ultimately realized that I don't have
the slightest time to look at everything, so I disabled them again
about 3 weeks ago. This removed enough noise so I could actually look
at the failing 16 and devel jobs again.

I don't see any failures of that kind for the 15 snapshot build while
it was still running, the failures in there are all something else
(mostly problems outside PG).

https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/postgresql-15-binaries-snapshot/

I'll enable the server snapshot builds again, those shouldn't be too
noisy.

Christoph



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
I have now disabled the test in 15 and 16 (like the older branches).
I'll see about getting the fixes into master today, and we can
contemplate back-patching later, after we've collected a convincing
volume of test results from the build farm, CI and hopefully your
s390x master snapshot builds (if that is a thing).



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
On Sun, Aug 13, 2023 at 9:00 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
> > Thanks.  I realised that it's easy enough to test that theory about
> > cleanup locks by hacking ConditionalLockBufferForCleanup() to return
> > false randomly.  Then the test occasionally fails as described.  Seems
> > like we'll need to fix that test, but it's not evidence of a server
> > bug, and my signal handler refactoring patch is in the clear.  Thanks
> > for testing it!
>
> WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
> After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
> passes even after I make ConditionalLockBufferForCleanup() fail 100%.

I pushed that change (master only for now, like the other change).



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Thomas Munro
Дата:
On Tue, Aug 8, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-08-07 12:57:40 +0200, Christoph Berg wrote:
> > v8 worked better. It succeeded a few times (at least 12, my screen
> > scrollback didn't catch more) before erroring like this:
>
> > [10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated connection due to recovery conflict
> > [10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on standby
> > [10:21:58.463](0.000s)
> > [10:21:58.463](0.000s) #   Failed test 'startup deadlock: stats show conflict on standby'
> > #   at t/031_recovery_conflict.pl line 332.
> > [10:21:58.463](0.000s) #          got: '0'
> > #     expected: '1'
>
> Hm, that could just be a "harmless" race. Does it still happen if you apply
> the attached patch in addition?

Do you intend that as the fix, or just proof of what's wrong?  It
looks pretty reasonable to me.  I will go ahead and commit this unless
you want to change something/do it yourself.  As far as I know so far,
this is the last thing preventing the mighty Debian/s390x build box
from passing consistently, which would be nice.

Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Christoph Berg
Дата:
Re: Thomas Munro
> I have now disabled the test in 15 and 16 (like the older branches).
> I'll see about getting the fixes into master today, and we can
> contemplate back-patching later, after we've collected a convincing
> volume of test results from the build farm, CI and hopefully your
> s390x master snapshot builds (if that is a thing).

I have now seen the problem on 15 as well, if I had not reported that
yet.

I think the 16/17 builds look good now with the patches applied, but
there's still a lot of noise from the build box having random other
problems (network disconnect and other silly things), so it still
needs a lot of hand-holding :-/.

Christoph



Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Alexander Lakhin
Дата:
Hi,

13.08.2023 00:00, Andres Freund wrote:
> Hi,
>
> On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
>> Thanks.  I realised that it's easy enough to test that theory about
>> cleanup locks by hacking ConditionalLockBufferForCleanup() to return
>> false randomly.  Then the test occasionally fails as described.  Seems
>> like we'll need to fix that test, but it's not evidence of a server
>> bug, and my signal handler refactoring patch is in the clear.  Thanks
>> for testing it!
> WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
> After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
> passes even after I make ConditionalLockBufferForCleanup() fail 100%.
>

I happened to encounter another failure of 031_recovery_conflict, on
rather slow ARMv7:
t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out executing this query:
#
# SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
#
# expecting this output:
# waiting
# last actual query output:
#
# with stderr:
t/031_recovery_conflict.pl ............ 14/?
#   Failed test 'startup deadlock: lock acquisition is waiting'
#   at t/031_recovery_conflict.pl line 261.
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 15.
t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/15 subtests

I've managed to reproduce it with the attached test patch, on a regular
x86_64 VM with CPU slowed down to 2%, so that the modified test runs for
5+ minutes.
...
t/031_recovery_conflict.pl .. 102/? # iteration 31
t/031_recovery_conflict.pl .. 104/? # iteration 32
t/031_recovery_conflict.pl .. 106/?
<... waiting for something ...>

031_recovery_conflict_standby.log contains:
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: BEGIN;
2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursor 
CURSOR FOR SELECT a FROM test_recovery_conflict_table1;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROM 
test_recovery_conflict_cursor;
2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: SELECT * FROM
test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery still waiting after 10.217 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT:  WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, 
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel

1663/16385/16489, blk 0
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR:  canceling statement due to conflict with recovery

at character 15
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL:  User transaction caused buffer deadlock with 
recovery.
2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT:  SELECT * FROM test_recovery_conflict_table2;
2023-10-09 12:43:15.039 UTC [145513] LOG:  recovery finished waiting after 10.382 ms: recovery conflict on buffer pin
2023-10-09 12:43:15.039 UTC [145513] CONTEXT:  WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, 
nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel

1663/16385/16489, blk 0
2023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE 
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE 
locktype = 'relation' AND NOT granted;
2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE 
locktype = 'relation' AND NOT granted;
...
2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE 
locktype = 'relation' AND NOT granted;
2023-10-09 12:51:25.969 UTC [145514] FATAL:  could not receive data from WAL stream: server closed the connection 
unexpectedly
                 This probably means the server terminated abnormally
                 before or while processing the request.

Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/
000000010000000000000003 contains:
...
rmgr: Heap        len (rec/tot):     59/    59, tx:        972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off:
100,
 
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap        len (rec/tot):     59/    59, tx:        972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off:
101,
 
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:        972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT
2023-10-09
 
12:43:13.797513 UTC
rmgr: Standby     len (rec/tot):     42/    42, tx:        973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db

16385 rel 16389
rmgr: Transaction len (rec/tot):    178/   178, tx:        973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gid 
lock: 2023-10-09 12:43:13.798048 UTC
rmgr: Heap        len (rec/tot):     59/    59, tx:        974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off:
102,
 
flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0
rmgr: Transaction len (rec/tot):     34/    34, tx:        974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT 
2023-10-09 12:43:13.798434 UTC
rmgr: Transaction len (rec/tot):     34/    34, tx:        975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT 
2023-10-09 12:43:13.900046 UTC
rmgr: Heap2       len (rec/tot):    255/   255, tx:          0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNE 
snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100,
101],
 
unused: [], blkref #0: rel 1663/16385/16489 blk 0
rmgr: Heap2       len (rec/tot):    248/   248, tx:          0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused:

100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0

So it looks like "SELECT * FROM test_recovery_conflict_table2" in this
case was interrupted due to conflict with replaying PRUNE, not VACUUM,
and as a result, this query hadn't entered "waiting for lock" state.
(I saw also other failures of the test, but probably they have the same
root cause).

Best regards,
Alexander
Вложения

Re: A failure in 031_recovery_conflict.pl on Debian/s390x

От
Alexander Lakhin
Дата:
13.08.2023 00:00, Andres Freund wrote:
> On 2023-08-12 15:50:24 +1200, Thomas Munro wrote:
>> Thanks.  I realised that it's easy enough to test that theory about
>> cleanup locks by hacking ConditionalLockBufferForCleanup() to return
>> false randomly.  Then the test occasionally fails as described.  Seems
>> like we'll need to fix that test, but it's not evidence of a server
>> bug, and my signal handler refactoring patch is in the clear.  Thanks
>> for testing it!
> WRT fixing the test: I think just using VACUUM FREEZE ought to do the job?
> After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl
> passes even after I make ConditionalLockBufferForCleanup() fail 100%.

Yesterday I wondered what's is the reason of other failure on skink [1],
[2], [3]:
#   Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_authid'
#   at t/035_standby_logical_decoding.pl line 224.

I've reproduced the failure by running several test instances in parallel
(with 15 jobs) under Valgrind and saw a difference in the primary node WAL:
The failed run:
rmgr: Heap        len (rec/tot):     54/    54, tx:        752, lsn: 0/0403FB88, prev 0/0403FB48, desc: DELETE xmax: 
752, off: 16, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1664/0/1260 blk 0
rmgr: Transaction len (rec/tot):     62/    62, tx:        752, lsn: 0/0403FBC0, prev 0/0403FB88, desc: INVALIDATION ;

inval msgs: catcache 11 catcache 10
rmgr: Transaction len (rec/tot):     82/    82, tx:        752, lsn: 0/0403FC00, prev 0/0403FBC0, desc: COMMIT 
2023-10-14 13:35:21.955043 MSK; inval msgs: catcache 11 catcache 10
rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/0403FC58, prev 0/0403FC00, desc: FPI_FOR_HINT ,

blkref #0: rel 1664/0/1260 fork fsm blk 2 FPW
rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/04041CA8, prev 0/0403FC58, desc: FPI_FOR_HINT ,

blkref #0: rel 1664/0/1260 fork fsm blk 1 FPW
rmgr: XLOG        len (rec/tot):     49/  8241, tx:          0, lsn: 0/04043CF8, prev 0/04041CA8, desc: FPI_FOR_HINT ,

blkref #0: rel 1664/0/1260 fork fsm blk 0 FPW
rmgr: Heap        len (rec/tot):    225/   225, tx:          0, lsn: 0/04045D48, prev 0/04043CF8, desc: INPLACE off:
26,
 
blkref #0: rel 1663/16384/16406 blk 4
rmgr: Transaction len (rec/tot):     78/    78, tx:          0, lsn: 0/04045E30, prev 0/04045D48, desc: INVALIDATION ;

inval msgs: catcache 55 catcache 54 relcache 1260
rmgr: Standby     len (rec/tot):     90/    90, tx:          0, lsn: 0/04045E80, prev 0/04045E30, desc: INVALIDATIONS ;

relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 55 catcache 54 relcache 1260

A successful run:
rmgr: Heap        len (rec/tot):     54/    54, tx:        752, lsn: 0/0403FB88, prev 0/0403FB48, desc: DELETE xmax: 
752, off: 16, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1664/0/1260 blk 0
rmgr: Transaction len (rec/tot):     62/    62, tx:        752, lsn: 0/0403FBC0, prev 0/0403FB88, desc: INVALIDATION ;

inval msgs: catcache 11 catcache 10
rmgr: Transaction len (rec/tot):     82/    82, tx:        752, lsn: 0/0403FC00, prev 0/0403FBC0, desc: COMMIT 
2023-10-14 14:21:47.254556 MSK; inval msgs: catcache 11 catcache 10
rmgr: Heap2       len (rec/tot):     57/    57, tx:          0, lsn: 0/0403FC58, prev 0/0403FC00, desc: PRUNE 
snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [16], unused: [], blkref #0:

rel 1664/0/1260 blk 0
rmgr: Btree       len (rec/tot):     52/    52, tx:          0, lsn: 0/0403FC98, prev 0/0403FC58, desc: VACUUM
ndeleted:
 
1, nupdated: 0, deleted: [1], updated: [], blkref #0: rel 1664/0/2676 blk 1
rmgr: Btree       len (rec/tot):     52/    52, tx:          0, lsn: 0/0403FCD0, prev 0/0403FC98, desc: VACUUM
ndeleted:
 
1, nupdated: 0, deleted: [16], updated: [], blkref #0: rel 1664/0/2677 blk 1
rmgr: Heap2       len (rec/tot):     50/    50, tx:          0, lsn: 0/0403FD08, prev 0/0403FCD0, desc: VACUUM nunused:

1, unused: [16], blkref #0: rel 1664/0/1260 blk 0

I repeated the hack Thomas did before (added
"if (rand() % 10 == 0) return false;" in ConditionalLockBufferForCleanup())
and reproduced that failure without Valgrind and parallel.
The following changes:
@@ -562 +562 @@
-  VACUUM pg_class;
+  VACUUM FREEZE pg_class;
@@ -600 +600 @@
-  VACUUM pg_authid;
+  VACUUM FREEZE pg_authid;
@@ -638 +638 @@
-  VACUUM conflict_test;
+  VACUUM FREEZE conflict_test;
fixed the issue for me

Second change should protect from a similar failure [4]:
#   Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class'
#   at t/035_standby_logical_decoding.pl line 224.

(I decided to write about failures of 035_standby_logical_decoding in this
thread because of the same VACUUM issue.)

I also confirmed that all the other tests in src/test/recovery pass reliably
with the modified ConditionalLockBufferForCleanup (except for
027_stream_regress, where plans changed).

When I had overcome the failures with pg_authid/pg_class, I stumbled upon
other even more rare one (which occurs without modifying
ConditionalLockBufferForCleanup):
t/035_standby_logical_decoding.pl .. 42/?
#   Failed test 'inactiveslot slot invalidation is logged with on-access pruning'
#   at t/035_standby_logical_decoding.pl line 224.
But I think that's another case, which I'm going to investigate separately.

BTW, it looks like the comment:
# One way to produce recovery conflict is to create/drop a relation and
# launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
in scenario 3 is a copy-paste error.
Also, there are two "Scenario 4" in this test.

Long story short, VACUUM FREEZE should make skink greener, but still there
is another possibility for the 035 test to fail.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-09-05%2009%3A47%3A44
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-13%2012%3A16%3A52
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-09-15%2017%3A10%3A55
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-08-27%2008%3A30%3A28