Re: [HACKERS] More race conditions in logical replication
От | Petr Jelinek |
---|---|
Тема | Re: [HACKERS] More race conditions in logical replication |
Дата | |
Msg-id | 59c6012e-1d78-dca3-339c-be67fd166d6d@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] More race conditions in logical replication (Petr Jelinek <petr.jelinek@2ndquadrant.com>) |
Ответы |
Re: [HACKERS] More race conditions in logical replication
|
Список | pgsql-hackers |
On 06/07/17 17:33, Petr Jelinek wrote: > On 03/07/17 01:54, Tom Lane wrote: >> I noticed a recent failure that looked suspiciously like a race condition: >> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07 >> >> The critical bit in the log file is >> >> error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher >> DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790' >> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROPSUBSCRIPTION tap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pmline 1198. >> >> After poking at it a bit, I found that I can cause several different >> failures of this ilk in the subscription tests by injecting delays at >> the points where a slot's active_pid is about to be cleared, as in the >> attached patch (which also adds some extra printouts for debugging >> purposes; none of that is meant for commit). It seems clear that there >> is inadequate interlocking going on when we kill and restart a logical >> rep worker: we're trying to start a new one before the old one has >> gotten out of the slot. >> > > Thanks for the test case. > > It's not actually that we start new worker fast. It's that we try to > drop the slot right after worker process was killed but if the code that > clears slot's active_pid takes too long, it still looks like it's being > used. I am quite sure it's possible to make this happen for physical > replication as well when using slots. > > This is not something that can be solved by locking on subscriber. ISTM > we need to make pg_drop_replication_slot behave more nicely, like making > it wait for the slot to become available (either by default or as an > option). > > I'll have to think about how to do it without rewriting half of > replication slots or reimplementing lock queue though because the > replication slots don't use normal catalog access so there is no object > locking with wait queue. We could use some latch wait with small timeout > but that seems ugly as that function can be called by user without > having dropped the slot before so the wait can be quite long (as in > "forever"). > Naive fix would be something like attached. But as I said, it's not exactly pretty. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
В списке pgsql-hackers по дате отправления: