Обсуждение: A failure in 031_recovery_conflict.pl on Cirrus CI Windows Server
Hi, I've observed the following failure once in one of my Cirrus CI runs on Windows Server on HEAD: timed out waiting for match: (?^:User was holding shared buffer pin for too long) at C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318. # Postmaster PID for node "primary" is 696 https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict https://github.com/BRupireddy/postgres/runs/15296698158 -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Jul 25, 2023 at 01:56:41PM +0530, Bharath Rupireddy wrote: > I've observed the following failure once in one of my Cirrus CI runs > on Windows Server on HEAD: > > timed out waiting for match: (?^:User was holding shared buffer pin > for too long) at > C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318. > # Postmaster PID for node "primary" is 696 > > https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict > > https://github.com/BRupireddy/postgres/runs/15296698158 Known: https://postgr.es/m/flat/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de https://postgr.es/m/flat/CA+hUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO=nEYKSg@mail.gmail.com A good next step would be patch "Fix recovery conflict SIGUSR1 handling" from https://postgr.es/m/CA+hUKG+Hi5P1j_8cVeqLLwNSVyJh4RntF04fYWkeNXfTrH2MYA@mail.gmail.com (near the end of that second thread).
Re: Noah Misch > On Tue, Jul 25, 2023 at 01:56:41PM +0530, Bharath Rupireddy wrote: > > I've observed the following failure once in one of my Cirrus CI runs > > on Windows Server on HEAD: > > > > timed out waiting for match: (?^:User was holding shared buffer pin > > for too long) at > > C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318. > > # Postmaster PID for node "primary" is 696 > > > > https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict > > > > https://github.com/BRupireddy/postgres/runs/15296698158 > > Known: > https://postgr.es/m/flat/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de > https://postgr.es/m/flat/CA+hUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO=nEYKSg@mail.gmail.com > > A good next step would be patch "Fix recovery conflict SIGUSR1 handling" from > https://postgr.es/m/CA+hUKG+Hi5P1j_8cVeqLLwNSVyJh4RntF04fYWkeNXfTrH2MYA@mail.gmail.com > (near the end of that second thread). I am also seeing problems with t/031_recovery_conflict.pl, on the newly added s390x architecture on apt.postgresql.org. The failure is flaky, but has been popping up in build logs often enough. I managed to reproduce it on the shell by running the test in a loop a few times. The failure looks like this: echo "# +++ tap check in src/test/recovery +++" && rm -rf '/home/myon/postgresql/pg/postgresql/build/src/test/recovery'/tmp_check&& /bin/mkdir -p '/home/myon/postgresql/pg/postgresql/build/src/test/recovery'/tmp_check&& cd /home/myon/postgresql/pg/postgresql/build/../src/test/recovery&& TESTLOGDIR='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/tmp_check/log' TESTDATADIR='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/tmp_check' PATH="/home/myon/postgresql/pg/postgresql/build/tmp_install/usr/lib/postgresql/17/bin:/home/myon/postgresql/pg/postgresql/build/src/test/recovery:$PATH" LD_LIBRARY_PATH="/home/myon/postgresql/pg/postgresql/build/tmp_install/usr/lib/s390x-linux-gnu" PGPORT='65432' top_builddir='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/../../..' PG_REGRESS='/home/myon/postgresql/pg/postgresql/build/src/test/recovery/../../../src/test/regress/pg_regress' /usr/bin/prove-I /home/myon/postgresql/pg/postgresql/build/../src/test/perl/ -I /home/myon/postgresql/pg/postgresql/build/../src/test/recovery--verbose t/031_recovery_conflict.pl # +++ tap check in src/test/recovery +++ t/031_recovery_conflict.pl .. # issuing query via background psql: # BEGIN; # DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; # FETCH FORWARD FROM test_recovery_conflict_cursor; ok 1 - buffer pin conflict: cursor with conflicting pin established ok 2 - buffer pin conflict: logfile contains terminated connection due to recovery conflict ok 3 - buffer pin conflict: stats show conflict on standby # issuing query via background psql: # BEGIN; # DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; # FETCH FORWARD FROM test_recovery_conflict_cursor; # ok 4 - snapshot conflict: cursor with conflicting snapshot established ok 5 - snapshot conflict: logfile contains terminated connection due to recovery conflict ok 6 - snapshot conflict: stats show conflict on standby # issuing query via background psql: # BEGIN; # LOCK TABLE test_recovery_conflict_table1 IN ACCESS SHARE MODE; # SELECT 1; # ok 7 - lock conflict: conflicting lock acquired # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 255 just after 7. Dubious, test returned 255 (wstat 65280, 0xff00) All 7 subtests passed Test Summary Report ------------------- t/031_recovery_conflict.pl (Wstat: 65280 Tests: 7 Failed: 0) Non-zero exit status: 255 Parse errors: No plan found in TAP output Files=1, Tests=7, 186 wallclock secs ( 0.01 usr 0.00 sys + 0.74 cusr 0.71 csys = 1.46 CPU) Result: FAIL make: *** [Makefile:23: check] Error 1 I.e. the test file just exits after test 7, without running the rest. Is it dying because it's running into a deadlock instead of some other expected error message? timed out waiting for match: (?^:User was holding a relation lock for too long) at t/031_recovery_conflict.pl line 318. 2023-08-04 12:20:38.061 UTC [1920073] 031_recovery_conflict.pl FATAL: terminating connection due to conflict with recovery Christoph
Вложения
On Sat, Aug 5, 2023 at 12:43 AM Christoph Berg <myon@debian.org> wrote: > I managed to reproduce it on the shell by running the test in a loop a > few times. The failure looks like this: It's great that you can reproduce this semi-reliably! I've rebased the patch, hoping you can try it out. https://www.postgresql.org/message-id/CA%2BhUKGJDcyW8Hbq3UyG-5-5Y7WqqOTjrXbFTMxxmhiofFraE-Q%40mail.gmail.com
Re: Thomas Munro > It's great that you can reproduce this semi-reliably! I've rebased > the patch, hoping you can try it out. Unfortunately very semi, today I didn't get to the same point where it exited after test 7, but got some other timeouts. Not even sure they are related to this (?) problem. Anyway: > https://www.postgresql.org/message-id/CA%2BhUKGJDcyW8Hbq3UyG-5-5Y7WqqOTjrXbFTMxxmhiofFraE-Q%40mail.gmail.com This patch makes the testsuite hang (and later exit) after this: ok 17 - 5 recovery conflicts shown in pg_stat_database # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 255 just after 17. I haven't seen any other problems with the patch attached, but since that test was hanging and hence very slow, I couldn't do many runs. Perhaps that's progress, I don't know. :) Logs attached. Christoph
Вложения
On Mon, Aug 7, 2023 at 8:40 AM Christoph Berg <myon@debian.org> wrote: > 2023-08-06 17:21:24.078 UTC [1275555] 031_recovery_conflict.pl FATAL: unrecognized conflict mode: 7 Thanks for testing! Would you mind trying v8 from that thread? V7 had a silly bug (I accidentally deleted a 'case' label while cleaning some stuff up, resulting in the above error...)
Re: Thomas Munro > Thanks for testing! Would you mind trying v8 from that thread? V7 > had a silly bug (I accidentally deleted a 'case' label while cleaning > some stuff up, resulting in the above error...) v8 worked better. It succeeded a few times (at least 12, my screen scrollback didn't catch more) before erroring like this: [10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated connection due to recovery conflict [10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on standby [10:21:58.463](0.000s) [10:21:58.463](0.000s) # Failed test 'startup deadlock: stats show conflict on standby' # at t/031_recovery_conflict.pl line 332. [10:21:58.463](0.000s) # got: '0' # expected: '1' Christoph
Вложения
Hi, On 2023-08-07 12:57:40 +0200, Christoph Berg wrote: > Re: Thomas Munro > > Thanks for testing! Would you mind trying v8 from that thread? V7 > > had a silly bug (I accidentally deleted a 'case' label while cleaning > > some stuff up, resulting in the above error...) > > v8 worked better. It succeeded a few times (at least 12, my screen > scrollback didn't catch more) before erroring like this: > [10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated connection due to recovery conflict > [10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on standby > [10:21:58.463](0.000s) > [10:21:58.463](0.000s) # Failed test 'startup deadlock: stats show conflict on standby' > # at t/031_recovery_conflict.pl line 332. > [10:21:58.463](0.000s) # got: '0' > # expected: '1' Hm, that could just be a "harmless" race. Does it still happen if you apply the attached patch in addition? Greetings, Andres Freund
Вложения
Re: Andres Freund > Hm, that could just be a "harmless" race. Does it still happen if you apply > the attached patch in addition? Putting that patch on top of v8 made it pass 294 times before exiting like this: [08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting pin established Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary done timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line 318. But admittedly, this build machine is quite sluggish at times, likely due to neighboring VMs on the same host. (Perhaps that even explains the behavior from before this patch.) I'll still attach the logs since I frankly can't judge what errors are acceptable here and which aren't. Christoph
Вложения
On Wed, Aug 9, 2023 at 2:01 AM Christoph Berg <myon@debian.org> wrote: > Putting that patch on top of v8 made it pass 294 times before exiting > like this: > > [08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting pin established > Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary > done > timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line 318. Can you reproduce that with logging like this added on top? diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 5b663a2997..72f5274c95 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -629,6 +629,7 @@ SetStartupBufferPinWaitBufId(int bufid) /* use volatile pointer to prevent code rearrangement */ volatile PROC_HDR *procglobal = ProcGlobal; + elog(LOG, "XXX SetStartupBufferPinWaitBufId(%d)", bufid); procglobal->startupBufferPinWaitBufId = bufid; } @@ -640,8 +641,11 @@ GetStartupBufferPinWaitBufId(void) { /* use volatile pointer to prevent code rearrangement */ volatile PROC_HDR *procglobal = ProcGlobal; + int bufid; - return procglobal->startupBufferPinWaitBufId; + bufid = procglobal->startupBufferPinWaitBufId; + elog(LOG, "XXX GetStartupBufferPinWaitBufId() -> %d", bufid); + return bufid; }
Re: Thomas Munro > On Wed, Aug 9, 2023 at 2:01 AM Christoph Berg <myon@debian.org> wrote: > > Putting that patch on top of v8 made it pass 294 times before exiting > > like this: > > > > [08:52:34.134](0.032s) ok 1 - buffer pin conflict: cursor with conflicting pin established > > Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary > > done > > timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line318. > > Can you reproduce that with logging like this added on top? 603 iterations later it hit again, but didn't log anything. (I believe I did run "make" in the right directory.) [22:20:24.714](3.145s) # issuing query via background psql: # BEGIN; # DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; # FETCH FORWARD FROM test_recovery_conflict_cursor; [22:20:24.745](0.031s) ok 1 - buffer pin conflict: cursor with conflicting pin established Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary done timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line 318. Perhaps this can simply be attributed to the machine being too busy. Christoph
Вложения
Re: To Thomas Munro > 603 iterations later it hit again, but didn't log anything. (I believe > I did run "make" in the right directory.) Since that didn't seem right I'm running the tests again. There are XXX lines in the output, but it hasn't hit yet. Christoph
Re: To Thomas Munro > 603 iterations later it hit again, but didn't log anything. (I believe > I did run "make" in the right directory.) This time it took 3086 iterations to hit the problem. Running c27f8621eedf7 + Debian patches + v8 + pgstat-report-conflicts-immediately.patch + the XXX logging. > [22:20:24.714](3.145s) # issuing query via background psql: > # BEGIN; > # DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; > # FETCH FORWARD FROM test_recovery_conflict_cursor; > [22:20:24.745](0.031s) ok 1 - buffer pin conflict: cursor with conflicting pin established > Waiting for replication conn standby's replay_lsn to pass 0/3430000 on primary > done > timed out waiting for match: (?^:User was holding shared buffer pin for too long) at t/031_recovery_conflict.pl line 318. No XXX lines this time either, but I've seen then im logfiles that went through successfully. > Perhaps this can simply be attributed to the machine being too busy. With the patches, the problem of dying so often that builds targeting several distributions in parallel will usually fail is gone. Christoph
Вложения
On Thu, Aug 10, 2023 at 9:15 PM Christoph Berg <myon@debian.org> wrote: > No XXX lines this time either, but I've seen then im logfiles that > went through successfully. Hmm. Well, I think this looks like a different kind of bug then. That patch of mine is about fixing some unsafe coding on the receiving side of a signal. In this case it's apparently not being sent. So either the Heap2/PRUNE record was able to proceed (indicating that that CURSOR was not holding a pin as expected), or VACUUM decided not to actually do anything to that block (conditional cleanup lock vs transient pin changing behaviour?), or there's a bug somewhere in/near LockBufferForCleanup(), which should have emitted that XXX message before even calling ResolveRecoveryConflictWithBufferPin(). Do you still have the data directories around from that run, so we can see if the expected Heap2/PRUNE was actually logged? For example (using meson layout here, in the build directory) that'd be something like: $ ./tmp_install/home/tmunro/install/bin/pg_waldump testrun/recovery/031_recovery_conflict/data/t_031_recovery_conflict_standby_data/pgdata/pg_wal/000000010000000000000003 In there I see this: rmgr: Heap2 len (rec/tot): 57/ 57, tx: 0, lsn: 0/0344BB90, prev 0/0344BB68, desc: PRUNE snapshotConflictHorizon: 0, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [21], unused: [], blkref #0: rel 1663/16385/16386 blk 0 That's the WAL record that's supposed to be causing 031_recovery_conflict_standby.log to talk about a conflict, starting with this: 2023-08-10 22:47:04.564 NZST [57145] LOG: recovery still waiting after 10.035 ms: recovery conflict on buffer pin 2023-08-10 22:47:04.564 NZST [57145] CONTEXT: WAL redo at 0/344BB90 for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [21], unused: []; blkref #0: rel 1663/16385/16386, blk 0
Re: Thomas Munro > On Thu, Aug 10, 2023 at 9:15 PM Christoph Berg <myon@debian.org> wrote: > > No XXX lines this time either, but I've seen then im logfiles that > > went through successfully. > > Do you still have the data directories around from that run, so we can > see if the expected Heap2/PRUNE was actually logged? For example > (using meson layout here, in the build directory) that'd be something > like: Sorry for the late reply, getting the PG release out of the door was more important. > $ ./tmp_install/home/tmunro/install/bin/pg_waldump > testrun/recovery/031_recovery_conflict/data/t_031_recovery_conflict_standby_data/pgdata/pg_wal/000000010000000000000003 > > In there I see this: It's been a long day and I can't wrap my mind around understanding this now, so I'll just dump the output here. [0] 16:03 myon@sid-s390x.pgs390x:~/.../build/src/test/recovery $ LC_ALL=C ../../../tmp_install/usr/lib/postgresql/17/bin/pg_waldump tmp_check/t_031_recovery_conflict_standby_data/pgdata/pg_wal/000000010000000000000003| grep -3 PRUNE > PRUNE.log pg_waldump: error: error in WAL record at 0/347E6A8: invalid record length at 0/347E6E0: expected at least 24, got 0 Christoph
Вложения
Thanks. I realised that it's easy enough to test that theory about cleanup locks by hacking ConditionalLockBufferForCleanup() to return false randomly. Then the test occasionally fails as described. Seems like we'll need to fix that test, but it's not evidence of a server bug, and my signal handler refactoring patch is in the clear. Thanks for testing it!
Hi, On 2023-08-12 15:50:24 +1200, Thomas Munro wrote: > Thanks. I realised that it's easy enough to test that theory about > cleanup locks by hacking ConditionalLockBufferForCleanup() to return > false randomly. Then the test occasionally fails as described. Seems > like we'll need to fix that test, but it's not evidence of a server > bug, and my signal handler refactoring patch is in the clear. Thanks > for testing it! WRT fixing the test: I think just using VACUUM FREEZE ought to do the job? After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl passes even after I make ConditionalLockBufferForCleanup() fail 100%. Greetings, Andres Freund
Re: Andres Freund > > Thanks. I realised that it's easy enough to test that theory about > > cleanup locks by hacking ConditionalLockBufferForCleanup() to return > > false randomly. Then the test occasionally fails as described. Seems > > like we'll need to fix that test, but it's not evidence of a server > > bug, and my signal handler refactoring patch is in the clear. Thanks > > for testing it! > > WRT fixing the test: I think just using VACUUM FREEZE ought to do the job? > After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl > passes even after I make ConditionalLockBufferForCleanup() fail 100%. I have now applied the last two patches to postgresql-17 so see if the build is more stable. (So far I had only tried in manual tests.) Fwiw this is also causing pain on PostgreSQL 16: https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/postgresql-16-binaries-snapshot/1011/architecture=s390x,distribution=sid/consoleText Most of the failing builds in https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/postgresql-16-binaries-snapshot/ are on s390x and likely due to this problem. This should be fixed before the 16 release. Christoph
On Tue, Aug 29, 2023 at 1:58 AM Christoph Berg <myon@debian.org> wrote: > This should be fixed before the 16 release. Here's what I was thinking of doing for this, given where we are in the release schedule: * commit the signal-refactoring patch in master only * plan to back-patch it into 16 in a later point release (it's much harder to back-patch further than that) * look into what we can do to suppress the offending test for 16 in the meantime But looking into that just now, I am curious about something. The whole area of recovery conflicts has been buggy forever, and the tests were only developed fairly recently by Melanie and Andres (April 2022), and then backpatched to all releases. They were disabled again in release branches 10-14 (discussion at https://postgr.es/m/3447060.1652032749@sss.pgh.pa.us): +plan skip_all => "disabled until after minor releases, due to instability"; Now your mainframe build bot is regularly failing for whatever timing reason, in 16 and master. That's quite useful because your tests have made us or at least me a lot more confident that the fix is good (one of the reasons progress has been slow is that failures in CI and BF were pretty rare and hard to analyse). But... I wonder why it isn't failing for you in 15? Are you able to check if it ever has? I suppose we could go and do the "disabled due to instability" thing in 15 and 16, and then later we'll un-disable it in 16 when we back-patch the fix into it.
Re: Thomas Munro > 2022), and then backpatched to all releases. They were disabled again > in release branches 10-14 (discussion at > https://postgr.es/m/3447060.1652032749@sss.pgh.pa.us): > > +plan skip_all => "disabled until after minor releases, due to instability"; Right: https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/postgresql-14-binaries-snapshot/366/architecture=s390x,distribution=sid/consoleText -> t/031_recovery_conflict.pl ........... skipped: disabled until after minor releases, due to instability > Now your mainframe build bot is regularly failing for whatever timing > reason, in 16 and master. That's quite useful because your tests have > made us or at least me a lot more confident that the fix is good (one > of the reasons progress has been slow is that failures in CI and BF > were pretty rare and hard to analyse). But... I wonder why it isn't > failing for you in 15? Are you able to check if it ever has? I > suppose we could go and do the "disabled due to instability" thing in > 15 and 16, and then later we'll un-disable it in 16 when we back-patch > the fix into it. The current explanation is that builds are only running frequently for 16 and devel, and the periodic extra test jobs only run "make check" and the postgresql-common testsuite, not the full server testsuite. I had "snapshot" jobs configured for basically everything (server and extensions) for some time, but ultimately realized that I don't have the slightest time to look at everything, so I disabled them again about 3 weeks ago. This removed enough noise so I could actually look at the failing 16 and devel jobs again. I don't see any failures of that kind for the 15 snapshot build while it was still running, the failures in there are all something else (mostly problems outside PG). https://pgdgbuild.dus.dg-i.net/view/Snapshot/job/postgresql-15-binaries-snapshot/ I'll enable the server snapshot builds again, those shouldn't be too noisy. Christoph
I have now disabled the test in 15 and 16 (like the older branches). I'll see about getting the fixes into master today, and we can contemplate back-patching later, after we've collected a convincing volume of test results from the build farm, CI and hopefully your s390x master snapshot builds (if that is a thing).
On Sun, Aug 13, 2023 at 9:00 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-08-12 15:50:24 +1200, Thomas Munro wrote: > > Thanks. I realised that it's easy enough to test that theory about > > cleanup locks by hacking ConditionalLockBufferForCleanup() to return > > false randomly. Then the test occasionally fails as described. Seems > > like we'll need to fix that test, but it's not evidence of a server > > bug, and my signal handler refactoring patch is in the clear. Thanks > > for testing it! > > WRT fixing the test: I think just using VACUUM FREEZE ought to do the job? > After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl > passes even after I make ConditionalLockBufferForCleanup() fail 100%. I pushed that change (master only for now, like the other change).
On Tue, Aug 8, 2023 at 11:08 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-08-07 12:57:40 +0200, Christoph Berg wrote: > > v8 worked better. It succeeded a few times (at least 12, my screen > > scrollback didn't catch more) before erroring like this: > > > [10:21:58.410](0.151s) ok 15 - startup deadlock: logfile contains terminated connection due to recovery conflict > > [10:21:58.463](0.053s) not ok 16 - startup deadlock: stats show conflict on standby > > [10:21:58.463](0.000s) > > [10:21:58.463](0.000s) # Failed test 'startup deadlock: stats show conflict on standby' > > # at t/031_recovery_conflict.pl line 332. > > [10:21:58.463](0.000s) # got: '0' > > # expected: '1' > > Hm, that could just be a "harmless" race. Does it still happen if you apply > the attached patch in addition? Do you intend that as the fix, or just proof of what's wrong? It looks pretty reasonable to me. I will go ahead and commit this unless you want to change something/do it yourself. As far as I know so far, this is the last thing preventing the mighty Debian/s390x build box from passing consistently, which would be nice.
Вложения
Re: Thomas Munro > I have now disabled the test in 15 and 16 (like the older branches). > I'll see about getting the fixes into master today, and we can > contemplate back-patching later, after we've collected a convincing > volume of test results from the build farm, CI and hopefully your > s390x master snapshot builds (if that is a thing). I have now seen the problem on 15 as well, if I had not reported that yet. I think the 16/17 builds look good now with the patches applied, but there's still a lot of noise from the build box having random other problems (network disconnect and other silly things), so it still needs a lot of hand-holding :-/. Christoph
Hi, 13.08.2023 00:00, Andres Freund wrote: > Hi, > > On 2023-08-12 15:50:24 +1200, Thomas Munro wrote: >> Thanks. I realised that it's easy enough to test that theory about >> cleanup locks by hacking ConditionalLockBufferForCleanup() to return >> false randomly. Then the test occasionally fails as described. Seems >> like we'll need to fix that test, but it's not evidence of a server >> bug, and my signal handler refactoring patch is in the clear. Thanks >> for testing it! > WRT fixing the test: I think just using VACUUM FREEZE ought to do the job? > After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl > passes even after I make ConditionalLockBufferForCleanup() fail 100%. > I happened to encounter another failure of 031_recovery_conflict, on rather slow ARMv7: t/031_recovery_conflict.pl ............ 13/? # poll_query_until timed out executing this query: # # SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; # # expecting this output: # waiting # last actual query output: # # with stderr: t/031_recovery_conflict.pl ............ 14/? # Failed test 'startup deadlock: lock acquisition is waiting' # at t/031_recovery_conflict.pl line 261. # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 255 just after 15. t/031_recovery_conflict.pl ............ Dubious, test returned 255 (wstat 65280, 0xff00) Failed 1/15 subtests I've managed to reproduce it with the attached test patch, on a regular x86_64 VM with CPU slowed down to 2%, so that the modified test runs for 5+ minutes. ... t/031_recovery_conflict.pl .. 102/? # iteration 31 t/031_recovery_conflict.pl .. 104/? # iteration 32 t/031_recovery_conflict.pl .. 106/? <... waiting for something ...> 031_recovery_conflict_standby.log contains: 2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: BEGIN; 2023-10-09 12:43:14.821 UTC [145519] 031_recovery_conflict.pl LOG: statement: DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT a FROM test_recovery_conflict_table1; 2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor; 2023-10-09 12:43:14.822 UTC [145519] 031_recovery_conflict.pl LOG: statement: SELECT * FROM test_recovery_conflict_table2; 2023-10-09 12:43:15.039 UTC [145513] LOG: recovery still waiting after 10.217 ms: recovery conflict on buffer pin 2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel 1663/16385/16489, blk 0 2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl ERROR: canceling statement due to conflict with recovery at character 15 2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl DETAIL: User transaction caused buffer deadlock with recovery. 2023-10-09 12:43:15.039 UTC [145519] 031_recovery_conflict.pl STATEMENT: SELECT * FROM test_recovery_conflict_table2; 2023-10-09 12:43:15.039 UTC [145513] LOG: recovery finished waiting after 10.382 ms: recovery conflict on buffer pin 2023-10-09 12:43:15.039 UTC [145513] CONTEXT: WAL redo at 0/35ADD38 for Heap2/PRUNE: snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: []; blkref #0: rel 1663/16385/16489, blk 0 2023-10-09 12:43:15.134 UTC [145527] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:43:15.647 UTC [145530] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:43:15.981 UTC [145532] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; ... 2023-10-09 12:51:24.729 UTC [149175] 031_recovery_conflict.pl LOG: statement: SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted; 2023-10-09 12:51:25.969 UTC [145514] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Indeed, t_031_recovery_conflict_standby_data/pgdata/pg_wal/ 000000010000000000000003 contains: ... rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB18, prev 0/035ADAD8, desc: INSERT off: 100, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Heap len (rec/tot): 59/ 59, tx: 972, lsn: 0/035ADB58, prev 0/035ADB18, desc: INSERT off: 101, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Transaction len (rec/tot): 34/ 34, tx: 972, lsn: 0/035ADB98, prev 0/035ADB58, desc: ABORT 2023-10-09 12:43:13.797513 UTC rmgr: Standby len (rec/tot): 42/ 42, tx: 973, lsn: 0/035ADBC0, prev 0/035ADB98, desc: LOCK xid 973 db 16385 rel 16389 rmgr: Transaction len (rec/tot): 178/ 178, tx: 973, lsn: 0/035ADBF0, prev 0/035ADBC0, desc: PREPARE gid lock: 2023-10-09 12:43:13.798048 UTC rmgr: Heap len (rec/tot): 59/ 59, tx: 974, lsn: 0/035ADCA8, prev 0/035ADBF0, desc: INSERT off: 102, flags: 0x00, blkref #0: rel 1663/16385/16489 blk 0 rmgr: Transaction len (rec/tot): 34/ 34, tx: 974, lsn: 0/035ADCE8, prev 0/035ADCA8, desc: COMMIT 2023-10-09 12:43:13.798434 UTC rmgr: Transaction len (rec/tot): 34/ 34, tx: 975, lsn: 0/035ADD10, prev 0/035ADCE8, desc: COMMIT 2023-10-09 12:43:13.900046 UTC rmgr: Heap2 len (rec/tot): 255/ 255, tx: 0, lsn: 0/035ADD38, prev 0/035ADD10, desc: PRUNE snapshotConflictHorizon: 0, nredirected: 0, ndead: 100, nunused: 0, redirected: [], dead: [2, 3, 4, ... , 99, 100, 101], unused: [], blkref #0: rel 1663/16385/16489 blk 0 rmgr: Heap2 len (rec/tot): 248/ 248, tx: 0, lsn: 0/035ADE38, prev 0/035ADD38, desc: VACUUM nunused: 100, unused: [2, 3, 4, ... , 99, 100, 101], blkref #0: rel 1663/16385/16489 blk 0 So it looks like "SELECT * FROM test_recovery_conflict_table2" in this case was interrupted due to conflict with replaying PRUNE, not VACUUM, and as a result, this query hadn't entered "waiting for lock" state. (I saw also other failures of the test, but probably they have the same root cause). Best regards, Alexander
Вложения
13.08.2023 00:00, Andres Freund wrote: > On 2023-08-12 15:50:24 +1200, Thomas Munro wrote: >> Thanks. I realised that it's easy enough to test that theory about >> cleanup locks by hacking ConditionalLockBufferForCleanup() to return >> false randomly. Then the test occasionally fails as described. Seems >> like we'll need to fix that test, but it's not evidence of a server >> bug, and my signal handler refactoring patch is in the clear. Thanks >> for testing it! > WRT fixing the test: I think just using VACUUM FREEZE ought to do the job? > After changing all the VACUUMs to VACUUM FREEZEs, 031_recovery_conflict.pl > passes even after I make ConditionalLockBufferForCleanup() fail 100%. Yesterday I wondered what's is the reason of other failure on skink [1], [2], [3]: # Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_authid' # at t/035_standby_logical_decoding.pl line 224. I've reproduced the failure by running several test instances in parallel (with 15 jobs) under Valgrind and saw a difference in the primary node WAL: The failed run: rmgr: Heap len (rec/tot): 54/ 54, tx: 752, lsn: 0/0403FB88, prev 0/0403FB48, desc: DELETE xmax: 752, off: 16, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1664/0/1260 blk 0 rmgr: Transaction len (rec/tot): 62/ 62, tx: 752, lsn: 0/0403FBC0, prev 0/0403FB88, desc: INVALIDATION ; inval msgs: catcache 11 catcache 10 rmgr: Transaction len (rec/tot): 82/ 82, tx: 752, lsn: 0/0403FC00, prev 0/0403FBC0, desc: COMMIT 2023-10-14 13:35:21.955043 MSK; inval msgs: catcache 11 catcache 10 rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/0403FC58, prev 0/0403FC00, desc: FPI_FOR_HINT , blkref #0: rel 1664/0/1260 fork fsm blk 2 FPW rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/04041CA8, prev 0/0403FC58, desc: FPI_FOR_HINT , blkref #0: rel 1664/0/1260 fork fsm blk 1 FPW rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn: 0/04043CF8, prev 0/04041CA8, desc: FPI_FOR_HINT , blkref #0: rel 1664/0/1260 fork fsm blk 0 FPW rmgr: Heap len (rec/tot): 225/ 225, tx: 0, lsn: 0/04045D48, prev 0/04043CF8, desc: INPLACE off: 26, blkref #0: rel 1663/16384/16406 blk 4 rmgr: Transaction len (rec/tot): 78/ 78, tx: 0, lsn: 0/04045E30, prev 0/04045D48, desc: INVALIDATION ; inval msgs: catcache 55 catcache 54 relcache 1260 rmgr: Standby len (rec/tot): 90/ 90, tx: 0, lsn: 0/04045E80, prev 0/04045E30, desc: INVALIDATIONS ; relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 55 catcache 54 relcache 1260 A successful run: rmgr: Heap len (rec/tot): 54/ 54, tx: 752, lsn: 0/0403FB88, prev 0/0403FB48, desc: DELETE xmax: 752, off: 16, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1664/0/1260 blk 0 rmgr: Transaction len (rec/tot): 62/ 62, tx: 752, lsn: 0/0403FBC0, prev 0/0403FB88, desc: INVALIDATION ; inval msgs: catcache 11 catcache 10 rmgr: Transaction len (rec/tot): 82/ 82, tx: 752, lsn: 0/0403FC00, prev 0/0403FBC0, desc: COMMIT 2023-10-14 14:21:47.254556 MSK; inval msgs: catcache 11 catcache 10 rmgr: Heap2 len (rec/tot): 57/ 57, tx: 0, lsn: 0/0403FC58, prev 0/0403FC00, desc: PRUNE snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, nunused: 0, redirected: [], dead: [16], unused: [], blkref #0: rel 1664/0/1260 blk 0 rmgr: Btree len (rec/tot): 52/ 52, tx: 0, lsn: 0/0403FC98, prev 0/0403FC58, desc: VACUUM ndeleted: 1, nupdated: 0, deleted: [1], updated: [], blkref #0: rel 1664/0/2676 blk 1 rmgr: Btree len (rec/tot): 52/ 52, tx: 0, lsn: 0/0403FCD0, prev 0/0403FC98, desc: VACUUM ndeleted: 1, nupdated: 0, deleted: [16], updated: [], blkref #0: rel 1664/0/2677 blk 1 rmgr: Heap2 len (rec/tot): 50/ 50, tx: 0, lsn: 0/0403FD08, prev 0/0403FCD0, desc: VACUUM nunused: 1, unused: [16], blkref #0: rel 1664/0/1260 blk 0 I repeated the hack Thomas did before (added "if (rand() % 10 == 0) return false;" in ConditionalLockBufferForCleanup()) and reproduced that failure without Valgrind and parallel. The following changes: @@ -562 +562 @@ - VACUUM pg_class; + VACUUM FREEZE pg_class; @@ -600 +600 @@ - VACUUM pg_authid; + VACUUM FREEZE pg_authid; @@ -638 +638 @@ - VACUUM conflict_test; + VACUUM FREEZE conflict_test; fixed the issue for me Second change should protect from a similar failure [4]: # Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class' # at t/035_standby_logical_decoding.pl line 224. (I decided to write about failures of 035_standby_logical_decoding in this thread because of the same VACUUM issue.) I also confirmed that all the other tests in src/test/recovery pass reliably with the modified ConditionalLockBufferForCleanup (except for 027_stream_regress, where plans changed). When I had overcome the failures with pg_authid/pg_class, I stumbled upon other even more rare one (which occurs without modifying ConditionalLockBufferForCleanup): t/035_standby_logical_decoding.pl .. 42/? # Failed test 'inactiveslot slot invalidation is logged with on-access pruning' # at t/035_standby_logical_decoding.pl line 224. But I think that's another case, which I'm going to investigate separately. BTW, it looks like the comment: # One way to produce recovery conflict is to create/drop a relation and # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby. in scenario 3 is a copy-paste error. Also, there are two "Scenario 4" in this test. Long story short, VACUUM FREEZE should make skink greener, but still there is another possibility for the 035 test to fail. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-09-05%2009%3A47%3A44 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-13%2012%3A16%3A52 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-09-15%2017%3A10%3A55 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-08-27%2008%3A30%3A28