Re: Race conditions in 019_replslot_limit.pl
От | Andres Freund |
---|---|
Тема | Re: Race conditions in 019_replslot_limit.pl |
Дата | |
Msg-id | 20220327213219.smdvfkq2fl74flow@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Re: Race conditions in 019_replslot_limit.pl (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
Hi, On 2022-03-27 15:28:05 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > How about the attached variation, which retries (for 15s, with 100ms sleeps) > > if there are multiple walsenders, printing the debugging info each time? It'll > > still fail the test if later iterations find just one walsender, which seems > > the right behaviour for now. > > We have now four instances of failures with this version of the test, > and in every case the second iteration succeeded. Is that enough > evidence yet? I still feel like there's something off here. But that's probably not enough to keep causing failures. I'm inclined to leave the debugging in for a bit longer, but not fail the test anymore? The way the temporary slot removal hangs for a while seems just off: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-03-27%2017%3A04%3A18 2022-03-27 13:30:56.993 EDT [2022-03-27 13:30:56 EDT 750695:20] 019_replslot_limit.pl DEBUG: replication slot drop: pg_basebackup_750695:removed on-disk ... 2022-03-27 13:30:57.456 EDT [2022-03-27 13:30:57 EDT 750759:3] [unknown] LOG: connection authorized: user=andrew database=postgresapplication_name=019_replslot_limit.pl 2022-03-27 13:30:57.466 EDT [2022-03-27 13:30:57 EDT 750759:4] 019_replslot_limit.pl LOG: statement: SELECT * FROM pg_stat_activity . 2022-03-27 13:30:57.474 EDT [2022-03-27 13:30:56 EDT 750679:87] DEBUG: server process (PID 750759) exited with exit code0 2022-03-27 13:30:57.507 EDT [2022-03-27 13:30:56 EDT 750695:21] 019_replslot_limit.pl DEBUG: replication slot drop: pg_basebackup_750695:done https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2009%3A00%3A09 2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819:begin 2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819:removed on-disk ... 2022-03-25 10:13:31.038 CET [4022841][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgresapplication_name=019_replslot_limit.pl 2022-03-25 10:13:31.039 CET [4022841][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity ... 2022-03-25 10:13:31.045 CET [4022807][postmaster][:0][] DEBUG: server process (PID 4022841) exited with exit code 0 2022-03-25 10:13:31.056 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_4022819:done https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-03-25%2008%3A02%3A05 2022-03-25 09:15:20.558 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_3730425:removed on-disk ... 2022-03-25 09:15:20.803 CET [3730461][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgresapplication_name=019_replslot_limit.pl 2022-03-25 09:15:20.804 CET [3730461][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity ... 2022-03-25 09:15:20.834 CET [3730381][postmaster][:0][] DEBUG: server process (PID 3730461) exited with exit code 0 2022-03-25 09:15:20.861 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_3730425:done https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2003%3A00%3A18 2022-03-25 04:14:03.025 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_2674398:removed on-disk ... 2022-03-25 04:14:03.264 CET [2674463][client backend][5/7:0][[unknown]] LOG: connection authorized: user=bf database=postgresapplication_name=019_replslot_limit.pl 2022-03-25 04:14:03.265 CET [2674463][client backend][5/8:0][019_replslot_limit.pl] LOG: statement: SELECT * FROM pg_stat_activity ... 2022-03-25 04:14:03.270 CET [2674384][postmaster][:0][] DEBUG: server process (PID 2674463) exited with exit code 0 ... 2022-03-25 04:14:03.324 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG: replication slot drop: pg_basebackup_2674398:done We are able to start / finish several new connections between the two debug elog()sin ReplicationSlotDropPtr()? I wonder if there's something odd going on with ConditionVariableBroadcast(). Might be useful to add another debug message before/after ConditionVariableBroadcast() and rmtree(). If the delay is due to rmtree() being slow under concurrent tests I'd feel less concerned (although that machine has dual NVMe drives...). I really wish I could reproduce the failure. I went through a few hundred cycles of that test in a separate checkout on that machine. > I'd like to silence this noise so that we can start tracking > lower-probability failure modes, like say these: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2022-03-26%2002%3A59%3A03 That one was a missing compiler flag. I see that failure reproducibly locally when using asan with clang, unless I use -fsanitize-address-use-after-return=never. gcc has a different default for the option, which is why I hadn't configured it. If I understand correctly, the problem is that -fsanitize-address-use-after-return uses an alternative stack. Sometimes our stack depths functions get called with the "proper" stack, and sometimes with a "shadow" stack. Which breaks our stack depth checking. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-03-26%2015%3A53%3A51 This one I have no idea about yet. I assume it's just a race in a new test... Greetings, Andres Freund
В списке pgsql-hackers по дате отправления: