Обсуждение: [HACKERS] src/test/subscription/t/005_encoding.pl is broken
To reproduce the subscription-startup hang that Thomas Munro observed, I changed src/backend/replication/logical/launcher.c like this: @@ -427,7 +427,8 @@ retry: bgw.bgw_notify_pid = MyProcPid; bgw.bgw_main_arg = Int32GetDatum(slot); - if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle)) + if (random() < 1000000000 || + !RegisterDynamicBackgroundWorker(&bgw, &bgw_handle)) { /* Failed to start worker, so clean up the workerslot. */ LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE); This causes about 50% of worker launch requests to fail. With the fix I just committed, 002_types.pl gets through fine, but 005_encoding.pl does not; it sometimes fails like this: t/005_encoding.pl ..... 1/1 # Failed test 'data replicated to subscriber' # at t/005_encoding.pl line 49. # got: '' # expected: '1' # Looks like you failed 1 test of 1. t/005_encoding.pl ..... Dubious, test returned 1 (wstat 256, 0x100) Failed 1/1 subtests The reason seems to be that its method of waiting for replication to happen is completely inapropos. It's watching for the master to say that the slave has received all the WAL, but that does not ensure that the logicalrep apply workers have caught up, does it? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi, On 2017-09-18 11:50:06 -0400, Tom Lane wrote: > The reason seems to be that its method of waiting for replication > to happen is completely inapropos. It's watching for the master > to say that the slave has received all the WAL, but that does not > ensure that the logicalrep apply workers have caught up, does it? To my knowledge here's not really any difference between the two in logical replication. Received changes are immediately applied, there's no equivalent to a walreceiver queing up "logical wal" onto disk. So I'm not sure that theory holds. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes: > On 2017-09-18 11:50:06 -0400, Tom Lane wrote: >> The reason seems to be that its method of waiting for replication >> to happen is completely inapropos. It's watching for the master >> to say that the slave has received all the WAL, but that does not >> ensure that the logicalrep apply workers have caught up, does it? > To my knowledge here's not really any difference between the two in > logical replication. Received changes are immediately applied, there's > no equivalent to a walreceiver queing up "logical wal" onto disk. > So I'm not sure that theory holds. Well, there's *something* wrong with this test's wait method. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 18, 2017 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote: > To my knowledge here's not really any difference between the two in > logical replication. Received changes are immediately applied, there's > no equivalent to a walreceiver queing up "logical wal" onto disk. Huh? Decoding and applying the changes has to take some finite time greater than 0. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 8:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 18, 2017 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote: >> To my knowledge here's not really any difference between the two in >> logical replication. Received changes are immediately applied, there's >> no equivalent to a walreceiver queing up "logical wal" onto disk. > > Huh? Decoding and applying the changes has to take some finite time > greater than 0. FWIW, the wait method looks fine to me. Now, I just had a look at the logs for a failure and a success, and one difference can be seen in the subscriber's logs as follows: -LOG: logical replication table synchronization worker for subscription "mysub", table "test1" has started -LOG: logical replication table synchronization worker for subscription "mysub", table "test1" has finished +WARNING: out of background worker slots +HINT: You might need to increase max_worker_processes. The "+" portion is for a failure, and I think that this causes the subscription to not consume the changes from the publisher which explains the failure in the test as the logical worker applying the changes on the subscriber-side is not here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes: > Now, I just had a look at the logs for a failure and a success, and > one difference can be seen in the subscriber's logs as follows: > -LOG: logical replication table synchronization worker for > subscription "mysub", table "test1" has started > -LOG: logical replication table synchronization worker for > subscription "mysub", table "test1" has finished > +WARNING: out of background worker slots > +HINT: You might need to increase max_worker_processes. > The "+" portion is for a failure, and I think that this causes the > subscription to not consume the changes from the publisher which > explains the failure in the test as the logical worker applying the > changes on the subscriber-side is not here. That would indicate that something isn't ever retrying the worker start; but if that's the case, how is it that we get through the other subscription tests with my random-failure patch in place? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That would indicate that something isn't ever retrying the worker > start; but if that's the case, how is it that we get through the > other subscription tests with my random-failure patch in place? I have been able to dig into this issue further, and the problem is indeed in the wait logic of 005_encoding.pl. It is important to wait for the initial sync of the subscriber to happen. There is no need to incorporate the additional wait query in wait_for_caught_up() as well. Please see the attached which fixes the stability problems for me even after forcing failures in launcher.c. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Michael Paquier <michael.paquier@gmail.com> writes: > On Wed, Sep 20, 2017 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That would indicate that something isn't ever retrying the worker >> start; but if that's the case, how is it that we get through the >> other subscription tests with my random-failure patch in place? > I have been able to dig into this issue further, and the problem is > indeed in the wait logic of 005_encoding.pl. It is important to wait > for the initial sync of the subscriber to happen. There is no need to > incorporate the additional wait query in wait_for_caught_up() as well. > Please see the attached which fixes the stability problems for me even > after forcing failures in launcher.c. Ah, that makes sense. Pushed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers