Re: Race conditions with checkpointer and shutdown
От | Michael Paquier |
---|---|
Тема | Re: Race conditions with checkpointer and shutdown |
Дата | |
Msg-id | 20190613060123.GC1643@paquier.xyz обсуждение исходный текст |
Ответ на | Re: Race conditions with checkpointer and shutdown (Alvaro Herrera <alvherre@2ndquadrant.com>) |
Список | pgsql-hackers |
On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote: > I looked at the buildfarm failures for the recoveryCheck stage. It > looks like there is only one failure for branch master after this > commit, which was chipmunk saying: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11 > AFAICS this is wholly unrelated to the problem at hand. Yes, that's unrelated. This failure is interesting, still it has happened only once per what I can see. I think that this points out a rare race condition in test 007_sync_rep.pl because of this sequence which reorders the standbys: # Stop and start standbys to rearrange the order of standbys # in WalSnd array. Now, if standbys have the same priority, # standby2 is selected preferentially and standby3 is next. $node_standby_1->stop; $node_standby_2->stop; $node_standby_3->stop; $node_standby_2->start; $node_standby_3->start; The failure actually indicates that even if standby3 has started after standby2, standby3 has registered back into the WAL sender array of the primary before standby2 had the occasion to do it, but we have no guarantee that the ordering is actually right. So this has messed up with the expected set of sync standbys when these are not directly listed. I think that this can happen as well when initializing the standbys at the top of the test, but the window is much, much narrower and basically impossible to reach. Using a combo of safe_psql('checkpoint') + wait_for_catchup() makes the test faster, but that's much more costly than using just poll_query_until() on pg_stat_replication to make sure that each standby is registered on the primary's WAL sender array. In short, something like the attached should improve the stability of the test. Thoughts? -- Michael
Вложения
В списке pgsql-hackers по дате отправления: