Re: Race conditions in 019_replslot_limit.pl
От | Andres Freund |
---|---|
Тема | Re: Race conditions in 019_replslot_limit.pl |
Дата | |
Msg-id | 20220325000234.uzsbjzk4evshdizl@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Re: Race conditions in 019_replslot_limit.pl (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
Hi, On 2022-03-24 15:05:40 -0400, Tom Lane wrote: > I just noticed something very interesting: in a couple of recent > buildfarm runs with this failure, the pg_stat_activity printout > no longer shows the extra walsender: Oh. That is interesting. Thanks for catching that. > What this suggests to me is that maybe the extra walsender is > indeed not blocked on anything, but is just taking its time > about exiting. In these two runs, as well as in all the > non-failing runs, it had enough time to do so. Still odd that it started with the bugfix I committed. And the locking pattern in [1] still seems suspicious. But the obove does seem to suggest those might just have been red herrings. I had previously tried to make process exit slow by inserting delays in various points and only succeeded in tests failing due too many connections. For some reason I didn't increase max_connections... I now increased the test's max_connections and managed to get a kind of similar failure by sticking elog(LOG, "received disconnect"); pg_usleep(220000); into the 'X' case in PostgresMain(). However, I always see three pids or the expected one pid, not two like the BF cases. But it's not too surprising that such an artificial, uniform, slowdown would have different symptoms than reality. > I suggest that we add a couple-of-seconds sleep in front of > the query that collects walsender PIDs, and maybe a couple more > seconds before the pg_stat_activity probe in the failure path, > and see if the behavior changes at all. Makes sense. I was previously thinking that it'd make sense to sleep for a bit before the ->stop('fast'); calls, so that we can see whether it's just the shutdown unblocking things (which [1] seems to suggest). How about the attached variation, which retries (for 15s, with 100ms sleeps) if there are multiple walsenders, printing the debugging info each time? It'll still fail the test if later iterations find just one walsender, which seems the right behaviour for now. > That should be enough to confirm or disprove this idea pretty quickly. If > it is right, a permanent fix could be to wait for the basebackup's walsender > to disappear from node_primary3's pg_stat_activity before we start the one > for node_standby3. For some tests a "confimed disconnect" mode would be useful. Basically the client waiting until it receives EOF after sending 'X'. Greetings, Andres Freund [1] https://postgr.es/m/20220226072325.wtvkwvvga2wc3nkn%40alap3.anarazel.de
Вложения
В списке pgsql-hackers по дате отправления: