Re: Infinite loop in XLogPageRead() on standby
От | Alexander Kukushkin |
---|---|
Тема | Re: Infinite loop in XLogPageRead() on standby |
Дата | |
Msg-id | CAFh8B=nPSERv7NyYHmjVXK4xK3va1XzU3-rhOswjgEZMWkV=RQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Infinite loop in XLogPageRead() on standby (Michael Paquier <michael@paquier.xyz>) |
Список | pgsql-hackers |
Hi Michael,
On Thu, 29 Feb 2024 at 06:05, Michael Paquier <michael@paquier.xyz> wrote:
Wow. Have you seen that in an actual production environment?
Yes, we see it regularly, and it is reproducible in test environments as well.
my $start_page = start_of_page($end_lsn);
my $wal_file = write_wal($primary, $TLI, $start_page,
"\x00" x $WAL_BLOCK_SIZE);
# copy the file we just "hacked" to the archive
copy($wal_file, $primary->archive_dir);
So you are emulating a failure by filling with zeros the second page
where the last emit_message() generated a record, and the page before
that includes the continuation record. Then abuse of WAL archiving to
force the replay of the last record. That's kind of cool.
Right, at this point it is easier than to cause an artificial crash on the primary after it finished writing just one page.
> To be honest, I don't know yet how to fix it nicely. I am thinking about
> returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a new
> timeline while trying to read a page and if this page is invalid.
Hmm. I suspect that you may be right on a TLI change when reading a
page. There are a bunch of side cases with continuation records and
header validation around XLogReaderValidatePageHeader(). Perhaps you
have an idea of patch to show your point?
Not yet, but hopefully I will get something done next week.
Nit. In your test, it seems to me that you should not call directly
set_standby_mode and enable_restoring, just rely on has_restoring with
the standby option included.
Thanks, I'll look into it.
Regards,
--
Alexander Kukushkin
В списке pgsql-hackers по дате отправления: