Обсуждение: Outdated replication protocol error?
Commit 5ee2197767 (about 4 years old) introduced the error: "IDENTIFY_SYSTEM has not been run before START_REPLICATION" But it seems like running START_REPLICATION first works (at least in the simple case). We should either: 1. Document that IDENTIFY_SYSTEM must always be run before START_REPLICATION, and always issue a WARNING if that's not done (an ERROR might break existing applications); or 2. If the commit is out of date and no longer needed, or if it's easy enough to fix, just remove the error (and Assert a valid ThisTimeLineID). Regards, Jeff Davis
On 2021/01/12 9:06, Jeff Davis wrote: > Commit 5ee2197767 (about 4 years old) introduced the error: > > "IDENTIFY_SYSTEM has not been run before START_REPLICATION" > > But it seems like running START_REPLICATION first works (at least in > the simple case). > > We should either: > > 1. Document that IDENTIFY_SYSTEM must always be run before > START_REPLICATION, and always issue a WARNING if that's not done (an > ERROR might break existing applications); or > > 2. If the commit is out of date and no longer needed, or if it's easy > enough to fix, just remove the error (and Assert a valid > ThisTimeLineID). +1 to remove the error if START_REPLICATION can always work fine without IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure if IDENTIFY_SYSTEM is actually necessary even in that case. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Hi, On 2021-01-14 16:40:26 +0900, Fujii Masao wrote: > On 2021/01/12 9:06, Jeff Davis wrote: > > Commit 5ee2197767 (about 4 years old) introduced the error: > > > > "IDENTIFY_SYSTEM has not been run before START_REPLICATION" > > > > But it seems like running START_REPLICATION first works (at least in > > the simple case). > > > > We should either: > > > > 1. Document that IDENTIFY_SYSTEM must always be run before > > START_REPLICATION, and always issue a WARNING if that's not done (an > > ERROR might break existing applications); or > > > > 2. If the commit is out of date and no longer needed, or if it's easy > > enough to fix, just remove the error (and Assert a valid > > ThisTimeLineID). > > +1 to remove the error if START_REPLICATION can always work fine without > IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby > and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure > if IDENTIFY_SYSTEM is actually necessary even in that case. The current approach seems quite bad to me too. By that point the value could be just about arbitrarily out of date - but that doesn't really matter because GetStandbyFlushRecPtr() updates it. And for the !am_cascading_walsender it's trivial to compute. Has anybody dug out the thread leading to the commit? Greetings, Andres Freund
On Fri, 2021-01-15 at 13:55 -0800, Andres Freund wrote: > Has anybody dug out the thread leading to the commit? https://www.postgresql.org/message-id/CAMsr%2BYEN04ztb%2BSDb_UfD72Kg5M3F%2BCpad31QBKT2rRjysmxRg%40mail.gmail.com Regards, Jeff Davis
On Fri, 2021-01-15 at 13:55 -0800, Andres Freund wrote: > > > We should either: > > > > > > 1. Document that IDENTIFY_SYSTEM must always be run before > > > START_REPLICATION, and always issue a WARNING if that's not done > > > (an > > > ERROR might break existing applications); or > > > > > > 2. If the commit is out of date and no longer needed, or if it's > > > easy > > > enough to fix, just remove the error (and Assert a valid > > > ThisTimeLineID). > > > > +1 to remove the error if START_REPLICATION can always work fine > > without > > IDENTIFY_SYSTEM. I found that the error happens when we connect to > > the standby > > and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not > > sure > > if IDENTIFY_SYSTEM is actually necessary even in that case. > > The current approach seems quite bad to me too. By that point the > value > could be just about arbitrarily out of date - but that doesn't really > matter because GetStandbyFlushRecPtr() updates it. And for the > !am_cascading_walsender it's trivial to compute. [ digging up old thread ] It seems everyone agrees that the current behavior is strange. Any ideas on a solution here? > Has anybody dug out the thread leading to the commit? https://www.postgresql.org/message-id/CAMsr%2BYEN04ztb%2BSDb_UfD72Kg5M3F%2BCpad31QBKT2rRjysmxRg%40mail.gmail.com Regards, Jeff Davis
Hi, On 2021-06-16 15:59:11 -0700, Jeff Davis wrote: > [ digging up old thread ] > > It seems everyone agrees that the current behavior is strange. Yea. I don't remember the details, but I've also hit this problem since in some odd circumstance while reviewing the "logical decoding on standbys" patchset. > Any ideas on a solution here? I think we should explicitly compute the current timeline before using ThisTimelineID. E.g. in StartReplication() call a new version of GetFlushRecPtr() that also returns the current timeline id. Greetings, Andres Freund
On Wed, 2021-06-16 at 16:17 -0700, Andres Freund wrote: > I think we should explicitly compute the current timeline before > using > ThisTimelineID. E.g. in StartReplication() call a new version of > GetFlushRecPtr() that also returns the current timeline id. I think all we need to do is follow the pattern in IdentifySystem() by calling: am_cascading_walsender = RecoveryInProgress(); first. There are three cases: 1. If the server was a primary the last time RecoveryInProgress() was called, and it's still a primary, then it returns false immediately. ThisTimeLineID should be set properly already. 2. If the server was a secondary the last time RecoveryInProgress() was called, and now it's a primary, then it updates ThisTimeLineID in InitXLOGAccess() and returns false. 3. If the server was a secondary the last time, and is still a secondary, then it returns true. Then, StartReplication() will call GetStandbyFlushRecPtr(), which will update ThisTimeLineID. It was confusing to me for a while because I was trying to sort out whether am_cascading_walsender and/or ThisTimeLineID could be out of date, and how that would result in not updating ThisTimeLineID; and also why there was a difference between IdentifySystem() and StartReplication(). Simple patch attached. I didn't test it yet, but wanted to post my analysis. Regards, Jeff Davis
Вложения
Hi, On 2021-06-17 18:13:57 -0700, Jeff Davis wrote: > On Wed, 2021-06-16 at 16:17 -0700, Andres Freund wrote: > > I think we should explicitly compute the current timeline before > > using > > ThisTimelineID. E.g. in StartReplication() call a new version of > > GetFlushRecPtr() that also returns the current timeline id. > > I think all we need to do is follow the pattern in IdentifySystem() by > calling: > > am_cascading_walsender = RecoveryInProgress(); > > first. Yea, that sounds reasonable. I'm not a fan of hiding the timeline determination inside RecoveryInProgress(), particularly not when communicated via global variable. But that's not the fault of this patch. Greetings, Andres Freund