Re: [HACKERS] logical replication - still unstable after all thesemonths
От | Mark Kirkwood |
---|---|
Тема | Re: [HACKERS] logical replication - still unstable after all thesemonths |
Дата | |
Msg-id | 389d0619-b35d-b349-5303-c82723dfdf84@catalyst.net.nz обсуждение исходный текст |
Ответ на | Re: [HACKERS] logical replication - still unstable after all thesemonths (Petr Jelinek <petr.jelinek@2ndquadrant.com>) |
Ответы |
Re: [HACKERS] logical replication - still unstable after all thesemonths
|
Список | pgsql-hackers |
On 31/05/17 21:16, Petr Jelinek wrote: > On 29/05/17 23:06, Mark Kirkwood wrote: >> On 29/05/17 23:14, Petr Jelinek wrote: >> >>> On 29/05/17 03:33, Jeff Janes wrote: >>> >>>> What would you want to look at? Would saving the WAL from the master be >>>> helpful? >>>> >>> Useful info is, logs from provider (mainly the logical decoding logs >>> that mention LSNs), logs from subscriber (the lines about when sync >>> workers finished), contents of the pg_subscription_rel (with srrelid >>> casted to regclass so we know which table is which), and pg_waldump >>> output around the LSNs found in the logs and in the pg_subscription_rel >>> (+ few lines before and some after to get context). It's enough to only >>> care about LSNs for the table(s) that are out of sync. >>> >> I have a run that aborted with failure (accounts table md5 mismatch). >> Petr - would you like to have access to the machine ? If so send me you >> public key and I'll set it up. > Thanks to Mark's offer I was able to study the issue as it happened and > found the cause of this. > > The busy loop in apply stops at the point when worker shmem state > indicates that table synchronization was finished, but that might not be > visible in the next transaction if it takes long to flush the final > commit to disk so we might ignore couple of transactions for given table > in the main apply because we think it's still being synchronized. This > also explains why I could not reproduce it on my testing machine (fast > ssd disk array where flushes were always fast) and why it happens > relatively rarely because it's one specific commit during the whole > synchronization process that needs to be slow. > > So as solution I changed the busy loop in the apply to wait for > in-catalog status rather than in-memory status to make sure things are > really there and flushed. > > While working on this I realized that the handover itself is bit more > complex than necessary (especially for debugging and for other people > understanding it) so I did some small changes as part of this patch to > make the sequences of states table goes during synchronization process > to always be the same. This might cause unnecessary update per one table > synchronization process in some cases but that seems like small enough > price to pay for clearer logic. And it also fixes another potential bug > that I identified where we might write wrong state to catalog if main > apply crashed while sync worker was waiting for status update. > > I've been running tests on this overnight on another machine where I was > able to reproduce the original issue within few runs (once I found what > causes it) and so far looks good. > > > I'm seeing a new failure with the patch applied - this time the history table has missing rows. Petr, I'll put back your access :-) regards Mark
В списке pgsql-hackers по дате отправления: