Re: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted
От | Kyotaro Horiguchi |
---|---|
Тема | Re: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted |
Дата | |
Msg-id | 20210329.113457.933340007488906032.horikyota.ntt@gmail.com обсуждение исходный текст |
Ответ на | RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted ("egashira.yusuke@fujitsu.com" <egashira.yusuke@fujitsu.com>) |
Ответы |
RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted
|
Список | pgsql-bugs |
Hello. (Mmm. Sorry for annoying, but added some persons in Cc:) This is the same issue with what is discussed in [1] and recently reported by [2]. [1] https://www.postgresql.org/message-id/E63E5670-6CC3-4B09-9686-A77CF94FE4A8%40amazon.com [2] https://www.postgresql.org/message-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com At Thu, 25 Mar 2021 00:23:52 +0000, "egashira.yusuke@fujitsu.com" <egashira.yusuke@fujitsu.com> wrote in > > The replication between "NODE-A" and "NODE-B" is synchronous replication, > > and between "NODE-B" and "NODE-C" is asynchronous. > > > > "NODE-A" <-[synchronous]-> "NODE-B" <-[non-synchronous]-> "NODE-C" > > > > When the primary server "NODE-A" crashed due to full WAL storage and > > "NODE-B" promoted, the downstream standby server "NODE-C" aborted with > > following messages. > > > > 2021-03-11 11:26:28.470 JST [85228] LOG: invalid contrecord length 26 at > > 0/5FFFFF0 > > 2021-03-11 11:26:28.470 JST [85232] FATAL: terminating walreceiver process > > due to administrator command > > 2021-03-11 11:26:28.470 JST [85228] PANIC: could not open file > > "pg_wal/000000020000000000000005": No such file or directory > > 2021-03-11 11:26:28.492 JST [85260] LOG: started streaming WAL from primary > > at 0/5000000 on timeline 2 > > 2021-03-11 11:26:29.260 JST [85227] LOG: startup process (PID 85228) was > > terminated by signal 6: Aborted > > I would like to clarify the conditions under which this "abort" occurred to explain to the customer. > > By the result of pg_waldump, I think that the conditions are followings. > > 1) A partially written (across the following segment files) WAL record is recorded at the end of the WAL segment file.and > 2) The WAL segment file of 1) is the last WAL segment file that standby server received, and > 3) The standby server promoted. > > I think that the above conditions will be met only when the primary server crashed due to full WAL storage. > > Is my idea correct? The diagnosis looks correct to me, but the cause of the crash is irrelevant. A disk full just makes the crash hit accurately on the vital point. -- Kyotaro Horiguchi NTT Open Source Software Center
В списке pgsql-bugs по дате отправления: