Re: [HACKERS] logical replication - still unstable after all thesemonths

Поиск

Список

Период

Сортировка

От	Mark Kirkwood
Тема	Re: [HACKERS] logical replication - still unstable after all thesemonths
Дата	2 июня 2017 г. 01:46:22
Msg-id	389d0619-b35d-b349-5303-c82723dfdf84@catalyst.net.nz обсуждение исходный текст
Ответ на	Re: [HACKERS] logical replication - still unstable after all thesemonths (Petr Jelinek <petr.jelinek@2ndquadrant.com>)
Ответы	Re: [HACKERS] logical replication - still unstable after all thesemonths
Список	pgsql-hackers

Дерево обсуждения

On 31/05/17 21:16, Petr Jelinek wrote:

> On 29/05/17 23:06, Mark Kirkwood wrote:
>> On 29/05/17 23:14, Petr Jelinek wrote:
>>
>>> On 29/05/17 03:33, Jeff Janes wrote:
>>>
>>>> What would you want to look at?  Would saving the WAL from the master be
>>>> helpful?
>>>>
>>> Useful info is, logs from provider (mainly the logical decoding logs
>>> that mention LSNs), logs from subscriber (the lines about when sync
>>> workers finished), contents of the pg_subscription_rel (with srrelid
>>> casted to regclass so we know which table is which), and pg_waldump
>>> output around the LSNs found in the logs and in the pg_subscription_rel
>>> (+ few lines before and some after to get context). It's enough to only
>>> care about LSNs for the table(s) that are out of sync.
>>>
>> I have a run that aborted with failure (accounts table md5 mismatch).
>> Petr - would you like to have access to the machine ? If so send me you
>> public key and I'll set it up.
> Thanks to Mark's offer I was able to study the issue as it happened and
> found the cause of this.
>
> The busy loop in apply stops at the point when worker shmem state
> indicates that table synchronization was finished, but that might not be
> visible in the next transaction if it takes long to flush the final
> commit to disk so we might ignore couple of transactions for given table
> in the main apply because we think it's still being synchronized. This
> also explains why I could not reproduce it on my testing machine (fast
> ssd disk array where flushes were always fast) and why it happens
> relatively rarely because it's one specific commit during the whole
> synchronization process that needs to be slow.
>
> So as solution I changed the busy loop in the apply to wait for
> in-catalog status rather than in-memory status to make sure things are
> really there and flushed.
>
> While working on this I realized that the handover itself is bit more
> complex than necessary (especially for debugging and for other people
> understanding it) so I did some small changes as part of this patch to
> make the sequences of states table goes during synchronization process
> to always be the same. This might cause unnecessary update per one table
> synchronization process in some cases but that seems like small enough
> price to pay for clearer logic. And it also fixes another potential bug
> that I identified where we might write wrong state to catalog if main
> apply crashed while sync worker was waiting for status update.
>
> I've been running tests on this overnight on another machine where I was
> able to reproduce  the original issue within few runs (once I found what
> causes it) and so far looks good.
>
>
>

I'm seeing a new failure with the patch applied - this time the history 
table has missing rows. Petr, I'll put back your access :-)

regards

Mark

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] logical replication - still unstable after all thesemonths