Обсуждение: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

Поиск
Список
Период
Сортировка

[HACKERS] src/test/subscription/t/005_encoding.pl is broken

От
Tom Lane
Дата:
To reproduce the subscription-startup hang that Thomas Munro observed,
I changed src/backend/replication/logical/launcher.c like this:

@@ -427,7 +427,8 @@ retry:    bgw.bgw_notify_pid = MyProcPid;    bgw.bgw_main_arg = Int32GetDatum(slot);
-    if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+    if (random() < 1000000000 ||
+        !RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))    {        /* Failed to start worker, so clean up the
workerslot. */        LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE);
 

This causes about 50% of worker launch requests to fail.
With the fix I just committed, 002_types.pl gets through fine,
but 005_encoding.pl does not; it sometimes fails like this:

t/005_encoding.pl ..... 1/1 
#   Failed test 'data replicated to subscriber'
#   at t/005_encoding.pl line 49.
#          got: ''
#     expected: '1'
# Looks like you failed 1 test of 1.
t/005_encoding.pl ..... Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 

The reason seems to be that its method of waiting for replication
to happen is completely inapropos.  It's watching for the master
to say that the slave has received all the WAL, but that does not
ensure that the logicalrep apply workers have caught up, does it?
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

От
Andres Freund
Дата:
Hi,

On 2017-09-18 11:50:06 -0400, Tom Lane wrote:
> The reason seems to be that its method of waiting for replication
> to happen is completely inapropos.  It's watching for the master
> to say that the slave has received all the WAL, but that does not
> ensure that the logicalrep apply workers have caught up, does it?

To my knowledge here's not really any difference between the two in
logical replication. Received changes are immediately applied, there's
no equivalent to a walreceiver queing up "logical wal" onto disk.

So I'm not sure that theory holds.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2017-09-18 11:50:06 -0400, Tom Lane wrote:
>> The reason seems to be that its method of waiting for replication
>> to happen is completely inapropos.  It's watching for the master
>> to say that the slave has received all the WAL, but that does not
>> ensure that the logicalrep apply workers have caught up, does it?

> To my knowledge here's not really any difference between the two in
> logical replication. Received changes are immediately applied, there's
> no equivalent to a walreceiver queing up "logical wal" onto disk.

> So I'm not sure that theory holds.

Well, there's *something* wrong with this test's wait method.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

От
Robert Haas
Дата:
On Mon, Sep 18, 2017 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
> To my knowledge here's not really any difference between the two in
> logical replication. Received changes are immediately applied, there's
> no equivalent to a walreceiver queing up "logical wal" onto disk.

Huh?  Decoding and applying the changes has to take some finite time
greater than 0.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

От
Michael Paquier
Дата:
On Tue, Sep 19, 2017 at 8:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Sep 18, 2017 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
>> To my knowledge here's not really any difference between the two in
>> logical replication. Received changes are immediately applied, there's
>> no equivalent to a walreceiver queing up "logical wal" onto disk.
>
> Huh?  Decoding and applying the changes has to take some finite time
> greater than 0.

FWIW, the wait method looks fine to me.

Now, I just had a look at the logs for a failure and a success, and
one difference can be seen in the subscriber's logs as follows:
-LOG:  logical replication table synchronization worker for
subscription "mysub", table "test1" has started
-LOG:  logical replication table synchronization worker for
subscription "mysub", table "test1" has finished
+WARNING:  out of background worker slots
+HINT:  You might need to increase max_worker_processes.
The "+" portion is for a failure, and I think that this causes the
subscription to not consume the changes from the publisher which
explains the failure in the test as the logical worker applying the
changes on the subscriber-side is not here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> Now, I just had a look at the logs for a failure and a success, and
> one difference can be seen in the subscriber's logs as follows:
> -LOG:  logical replication table synchronization worker for
> subscription "mysub", table "test1" has started
> -LOG:  logical replication table synchronization worker for
> subscription "mysub", table "test1" has finished
> +WARNING:  out of background worker slots
> +HINT:  You might need to increase max_worker_processes.
> The "+" portion is for a failure, and I think that this causes the
> subscription to not consume the changes from the publisher which
> explains the failure in the test as the logical worker applying the
> changes on the subscriber-side is not here.

That would indicate that something isn't ever retrying the worker
start; but if that's the case, how is it that we get through the
other subscription tests with my random-failure patch in place?
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

От
Michael Paquier
Дата:
On Wed, Sep 20, 2017 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That would indicate that something isn't ever retrying the worker
> start; but if that's the case, how is it that we get through the
> other subscription tests with my random-failure patch in place?

I have been able to dig into this issue further, and the problem is
indeed in the wait logic of 005_encoding.pl. It is important to wait
for the initial sync of the subscriber to happen. There is no need to
incorporate the additional wait query in wait_for_caught_up() as well.
Please see the attached which fixes the stability problems for me even
after forcing failures in launcher.c.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] src/test/subscription/t/005_encoding.pl is broken

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Wed, Sep 20, 2017 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That would indicate that something isn't ever retrying the worker
>> start; but if that's the case, how is it that we get through the
>> other subscription tests with my random-failure patch in place?

> I have been able to dig into this issue further, and the problem is
> indeed in the wait logic of 005_encoding.pl. It is important to wait
> for the initial sync of the subscriber to happen. There is no need to
> incorporate the additional wait query in wait_for_caught_up() as well.
> Please see the attached which fixes the stability problems for me even
> after forcing failures in launcher.c.

Ah, that makes sense.  Pushed.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers