Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
От | Amit Kapila |
---|---|
Тема | Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns |
Дата | |
Msg-id | CAA4eK1LBtv6ayE+TvCcPmC-xse=DVg=SmbyQD1nv_AaqcpUJEg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
|
Список | pgsql-hackers |
On Fri, Jul 29, 2022 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Yeah, your description makes sense to me. I've also considered how to > > hit this path but I guess it is never hit. Thinking of it in another > > way, first of all, at least 2 catalog modifying transactions have to > > be running while writing a xl_running_xacts. The serialized snapshot > > that is written when we decode the first xl_running_xact has two > > transactions. Then, one of them is committed before the second > > xl_running_xacts. The second serialized snapshot has only one > > transaction. Then, the transaction is also committed after that. Now, > > in order to execute the path, we need to start decoding from the first > > serialized snapshot. However, if we start from there, we cannot decode > > the full contents of the transaction that was committed later. > > > > I think then we should change this code in the master branch patch > with an additional comment on the lines of: "Either all the xacts got > purged or none. It is only possible to partially remove the xids from > this array if one or more of the xids are still running but not all. > That can happen if we start decoding from a point (LSN where the > snapshot state became consistent) where all the xacts in this were > running and then at least one of those got committed and a few are > still running. We will never start from such a point because we won't > move the slot's restart_lsn past the point where the oldest running > transaction's restart_decoding_lsn is." > Unfortunately, this theory doesn't turn out to be true. While investigating the latest buildfarm failure [1], I see that it is possible that only part of the xacts in the restored catalog modifying xacts list needs to be purged. See the attached where I have demonstrated it via a reproducible test. It seems the point we were missing was that to start from a point where two or more catalog modifying were serialized, it requires another open transaction before both get committed, and then we need the checkpoint or other way to force running_xacts record in-between the commit of initial two catalog modifying xacts. There could possibly be other ways as well but the theory above wasn't correct. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-08-25%2004%3A15%3A34 -- With Regards, Amit Kapila.
Вложения
В списке pgsql-hackers по дате отправления: