Re: Skipping logical replication transactions on subscriber side

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: Skipping logical replication transactions on subscriber side
Дата
Msg-id CAD21AoCTv4OuZTMi+DYNxhNSRe=M6Ap9YENLQwLuKkWe8rWHMw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Skipping logical replication transactions on subscriber side  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Ответы Re: Skipping logical replication transactions on subscriber side  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Tue, May 25, 2021 at 7:21 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Tue, May 25, 2021 at 1:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, May 25, 2021 at 2:49 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > On Mon, May 24, 2021 at 1:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > If a logical replication worker cannot apply the change on the
> > > > subscriber for some reason (e.g., missing table or violating a
> > > > constraint, etc.), logical replication stops until the problem is
> > > > resolved. Ideally, we resolve the problem on the subscriber (e.g., by
> > > > creating the missing table or removing the conflicting data, etc.) but
> > > > occasionally a problem cannot be fixed and it may be necessary to skip
> > > > the entire transaction in question. Currently, we have two ways to
> > > > skip transactions: advancing the LSN of the replication origin on the
> > > > subscriber and advancing the LSN of the replication slot on the
> > > > publisher. But both ways might not be able to skip exactly one
> > > > transaction in question and end up skipping other transactions too.
> > >
> > > Does it mean pg_replication_origin_advance() can't skip exactly one
> > > txn? I'm not familiar with the function or never used it though, I was
> > > just searching for "how to skip a single txn in postgres" and ended up
> > > in [1]. Could you please give some more details on scenarios when we
> > > can't skip exactly one txn? Is there any other way to advance the LSN,
> > > something like directly updating the pg_replication_slots catalog?
> >
> > Sorry, it's not impossible. Although the user mistakenly skips more
> > than one transaction by specifying a wrong LSN it's always possible to
> > skip an exact one transaction.
>
> IIUC, if the user specifies the "correct LSN", then it's possible to
> skip exact txn for which the sync workers are unable to apply changes,
> right?
>
> How can the user get the LSN (which we call "correct LSN")? Is it from
> pg_replication_slots? Or some other way?
>
> If the user somehow can get the "correct LSN", can't the exact txn be
> skipped using it with any of the existing ways, either using
> pg_replication_origin_advance or any other ways?

One possible way I know is to copy the logical replication slot used
by the subscriber and peek at the changes to identify the correct LSN
(maybe there is another handy way though) . For example, suppose that
two transactions insert tuples as follows on the publisher:

TX-A: BEGIN;
TX-A: INSERT INTO test VALUES (1);
TX-B: BEGIN;
TX-B: INSERT INTO test VALUES (10);
TX-B: COMMIT;
TX-A: INSERT INTO test VALUES (2);
TX-A: COMMIT;

And suppose further that the insertion with value = 10 (by TX-A)
cannot be applied only on the subscriber due to unique constraint
violation. If we copy the slot by
pg_copy_logical_replication_slot('test_sub', 'copy_slot', true,
'test_decoding') , we can peek at those changes with LSN as follows:

=# select * from pg_logical_slot_peek_changes('copy', null, null) order by lsn;
    lsn    | xid |                   data
-----------+-----+------------------------------------------
 0/1911548 | 736 | BEGIN 736
 0/1911548 | 736 | table public.hoge: INSERT: c[integer]:1
 0/1911588 | 737 | BEGIN 737
 0/1911588 | 737 | table public.hoge: INSERT: c[integer]:10
 0/19115F8 | 737 | COMMIT 737
 0/1911630 | 736 | table public.hoge: INSERT: c[integer]:2
 0/19116A0 | 736 | COMMIT 736
(7 rows)

In this case, '0/19115F8' is the correct LSN to specify. We can
advance the replication origin to ' 0/19115F8' by
pg_replication_origin_advance() so that logical replication streams
transactions committed after ' 0/19115F8'. After the logical
replication restarting, it skips the transaction with xid = 737 but
replicates the transaction with xid = 736.

> If there's no way to get the "correct LSN", then why can't we just
> print that LSN in the error context and/or in the new statistics view
> for logical replication workers, so that any of the existing ways can
> be used to skip exactly one txn?

I think specifying XID to the subscription is more understandable for users.

>
> IIUC, the feature proposed here guards against the users specifying
> wrong LSN. If I'm right, what is the guarantee that users don't
> specify the wrong txn id? Why can't we tell the users when a wrong LSN
> is specified that "currently, an apply worker is failing to apply the
> LSN XXXX, and you specified LSN YYYY, are you sure this is
> intentional?"

With the initial idea, specifying the correct XID is the user's
responsibility. If they specify an old XID, the worker invalids it and
raises a warning to tell "the worker invalidated the specified XID as
it's too old". As the second idea, if we store the last failed XID
somewhere (e.g., a system catalog), the user can just specify to skip
that transaction. That is, instead of specifying the XID they could do
something like "ALTER SUBSCRIPTION test_sub RESOLVE CONFLICT BY SKIP".

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dilip Kumar
Дата:
Сообщение: Re: Assertion failure while streaming toasted data
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: Transactions involving multiple postgres foreign servers, take 2