Re: persist logical slots to disk during shutdown checkpoint

Поиск
Список
Период
Сортировка
От Ashutosh Bapat
Тема Re: persist logical slots to disk during shutdown checkpoint
Дата
Msg-id CAExHW5snwhNn7nn4gwAZKjbTTLwa3UbPiGP40MCGzZtaB8EsPw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: persist logical slots to disk during shutdown checkpoint  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: persist logical slots to disk during shutdown checkpoint  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Thu, Aug 31, 2023 at 7:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 31, 2023 at 6:12 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Thu, Aug 31, 2023 at 2:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > > I
> > > > > > think we should shut down subscriber, restart publisher and then make this
> > > > > > check based on the contents of the replication slot instead of server log.
> > > > > > Shutting down subscriber will ensure that the subscriber won't send any new
> > > > > > confirmed flush location to the publisher after restart.
> > > > > >
> > > > >
> > > > > But if we shutdown the subscriber before the publisher there is no
> > > > > guarantee that the publisher has sent all outstanding logs up to the
> > > > > shutdown checkpoint record (i.e., the latest record). Such a guarantee
> > > > > can only be there if we do a clean shutdown of the publisher before
> > > > > the subscriber.
> > > >
> > > > So the sequence is shutdown publisher node, shutdown subscriber node,
> > > > start publisher node and carry out the checks.
> > > >
> > >
> > > This can probably work but I still prefer the current approach as that
> > > will be closer to the ideal values on the disk instead of comparison
> > > with a later in-memory value of confirmed_flush LSN. Ideally, if we
> > > would have a tool like pg_replslotdata which can read the on-disk
> > > state of slots that would be better but missing that, the current one
> > > sounds like the next best possibility. Do you see any problem with the
> > > current approach of test?
> >
> > > +  qr/Streaming transactions committing after ([A-F0-9]+\/[A-F0-9]+),
> > > reading WAL from ([A-F0-9]+\/[A-F0-9]+)./
> >
> > I don't think the LSN reported in this message is guaranteed to be the
> > confirmed_flush LSN of the slot. It's usually confirmed_flush but not
> > always. It's the LSN that snapshot builder computes based on factors
> > including confirmed_flush. There's a chance that this test will fail
> > sometimes because  of this behaviour.
> >
>
> I think I am missing something here because as per my understanding,
> the LOG referred by the test is generated in CreateDecodingContext()
> before which we shouldn't be changing the slot's confirmed_flush LSN.
> The LOG [1] refers to the slot's persistent value for confirmed_flush,
> so how it could be different from what the test is expecting.
>
> [1]
> errdetail("Streaming transactions committing after %X/%X, reading WAL
> from %X/%X.",
>    LSN_FORMAT_ARGS(slot->data.confirmed_flush),

I was afraid that we may move confirmed_flush while creating the
snapshot builder when creating the decoding context. But I don't see
any code doing that. So may be we are safe. But if the log message
changes, this test would fail - depending upon the log message looks a
bit fragile, esp. when we have a way to access the data directly
reliably.

--
Best Wishes,
Ashutosh Bapat



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Следующее
От: jian he
Дата:
Сообщение: Re: Incremental View Maintenance, take 2