Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
От | Heikki Linnakangas |
---|---|
Тема | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? |
Дата | |
Msg-id | 89e33d4f-5c75-0738-3dcb-44c4df59e920@iki.fi обсуждение исходный текст |
Ответ на | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Ответы |
Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
|
Список | pgsql-hackers |
On 18/01/18 20:54, Kyotaro HORIGUCHI wrote: > At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <andres@anarazel.de> wrote in <20180118195252.hyxgkb3krcqjzfjm@alap3.anarazel.de> >> On 2018-01-18 20:58:27 +0900, Michael Paquier wrote: >>> b) The second patch that I would like to mention is doing things on the >>> standby side by being able to change a WAL source when fetching a single >>> record so as it is possible to fetch the beginning of a cross-segment >>> record from say an archive or the local xlog, and then request the rest >>> on the primary. This patch can be found in >>> https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp >>> and the diff in WaitForWALToBecomeAvailable() scares me a lot because, >>> it seems to me that this actually breaks timeline switch >>> consistency. The concept of switching a WAL source in the middle of a >>> WAL segment is risky anyway, because we perform sanity checks while >>> switching sources. The bug we are dealing about here is very rare, and >>> seeing a problem like that for a timeline switch is even more rare, but >>> the risk is not zero and we may finish with a corrupted instance, so we >>> should not in my opinion take this approach. Also this actually achieves >>> point 2) above, not completely though, but not 1). >> >> I don't agree with the sentiment about the approach, but I agree there >> might be weaknesses in the implementation (which I have not reviewed). I >> think it's perfectly sensible to require fetching one segment from >> pg_xlog and the next one via another method. Currently the inability to >> do so often leads to us constantly refetching the same segment. > > Though I'm still not fully confident, if reading a set of > continuation records from two (or more) sources can lead to > inconsistency, two successve (complete) records are facing the > same risk. This seems like the best approach to me as well. Looking at the patch linked above (https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp): > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c > @@ -11693,6 +11693,10 @@ retry: > Assert(reqLen <= readLen); > > *readTLI = curFileTLI; > + > + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf)) > + goto next_record_is_invalid; > + > return readLen; > > next_record_is_invalid: What is the point of adding this XLogReaderValidatePageHeader() call? The caller of this callback function, ReadPageInternal(), checks the page header anyway. Earlier in this thread, you said: > The rest to do is let XLogPageRead retry other sources > immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c > public (and renamed to XLogReaderValidatePageHeader). but I don't understand that. - Heikki
В списке pgsql-hackers по дате отправления: