Re: BUG #17928: Standby fails to decode WAL on termination of primary
От | Thomas Munro |
---|---|
Тема | Re: BUG #17928: Standby fails to decode WAL on termination of primary |
Дата | |
Msg-id | CA+hUKGJf3Hhb2MB88-rW2di2H9XT0xr6-hd6ZjGEwdJs3A=b+Q@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #17928: Standby fails to decode WAL on termination of primary (Michael Paquier <michael@paquier.xyz>) |
Ответы |
Re: BUG #17928: Standby fails to decode WAL on termination of primary
|
Список | pgsql-bugs |
On Sat, Sep 23, 2023 at 4:44 PM Michael Paquier <michael@paquier.xyz> wrote: > The stack may point out at a different issue, but perhaps this is a > matter where we're returning now XLREAD_SUCCESS where previously we > had XLREAD_FAIL, causing this code to fail thinking that the block was > valid while it's not? "grison" has a little more detail -- we see pg_comp_crc32c_sb8(len=4294636456). I'm wondering how to reproduce this, but among the questions that jump out I have: why was it ever OK that we load record->xl_tot_len into total_len, perform header validation, determine that total_len < len (= this record is all on one page, no reassembly loop needed, so now we're in the single-page branch), then call ReadPageInternal() again, then call ValidXLogRecord() which internally loads record->xl_tot_len *again*? ReadPageInternal() might have changed xl_tot_len, no? That seems to be a possible pathway to reading past the end of the buffer in the CRC check, no? If that value didn't change underneath us, I think we'd need an explanation for how we finished up in the single-page branch at xlogreader.c:842 with a large xl_tot_len, which I'm not seeing yet, though it might take more coffee. (Possibly supporting the re-read theory is the fact that it's only happening on a few very slow computers, though I have no idea why it would only happen on master [so far at least].)
В списке pgsql-bugs по дате отправления: