Re: pg_waldump error message fix

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: pg_waldump error message fix
Дата
Msg-id 20201214.113451.2231140800354158395.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: pg_waldump error message fix  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: pg_waldump error message fix  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
> At Fri, 11 Dec 2020 19:27:31 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> > I looked through all the calls to report_invalid_record() in
> > xlogreader.c and noticed that all but a few in
> > XLogReaderValidatePageHeader() already report an LSN.  Of the calls in
> > XLogReaderValidatePageHeader() that don't report an LSN, it looks like
> > most still report a position, and the remaining ones are for "WAL file
> > is from different database system...," which IIUC generally happens on
> > the first page of the segment.

Apart from this issue, while checking that, I noticed that if server
starts having WALs from a server of a different systemid, the server
stops with obscure messages.

> LOG:  database system was shut down at 2020-12-14 10:36:02 JST
> LOG:  invalid primary checkpoint record
> PANIC:  could not locate a valid checkpoint record

The cause is XLogPageRead erases the error message set by
XLogReaderValidatePageHeader(). As the comment just above says, this
is required to continue replication under a certain situation. The
code is aiming to allow continue replication when the first half of a
continued record has been removed on the primary so we don't need to
do the amendment unless we're in standby mode. If we let the savior
code only while StandbyMode, we would have the correct error message.

> JST LOG:  database system was shut down at 2020-12-14 10:36:02 JST
> LOG:  WAL file is from different database system: WAL file database system identifier is 6905923817995618754,
pg_controldatabase system identifier is 6905924227171453468
 
> JST LOG:  invalid primary checkpoint record
> JST PANIC:  could not locate a valid checkpoint record

I confirmed 0668719801 still works under the intended context using
the steps shown in [1].


[1]:
https://www.postgresql.org/message-id/flat/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From d54531aa2774bad7e426cc16691553fbc8f0b3b3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 14 Dec 2020 11:18:08 +0900
Subject: [PATCH] Don't cancel invalid-page-header error in unwanted situation

The commit 0668719801 is intending to work while streaming replication
but it cancels the error message regardless of the context. As the
result ReadRecord fails to show the correct error messages even when
it is required, that is, not while replication.  Allowing the
cancellation happen only on non-standby fixes that.
---
 src/backend/access/transam/xlog.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e81ce4f17..770902518d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12055,7 +12055,8 @@ retry:
      * Validating the page header is cheap enough that doing it twice
      * shouldn't be a big deal from a performance point of view.
      */
-    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+    if (StandbyMode &&
+        !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
     {
         /* reset any error XLogReaderValidatePageHeader() might have set */
         xlogreader->errormsg_buf[0] = '\0';
-- 
2.27.0


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: pg_waldump error message fix
Следующее
От: "k.jamison@fujitsu.com"
Дата:
Сообщение: RE: [Patch] Optimize dropping of relation buffers using dlist