Обсуждение: Cascade replication
On Wed, May 25, 2011 at 2:01 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > I'd like to propose cascade replication feature (i.e., allow the > standby to accept > replication connection from another standby) for 9.2. This feature is useful to > reduce the overhead of the master since by using that we can decrease the > number of standbys directly connecting to the master. > > I attached the WIP patch, which changes walsender so that it starts replication > even during recovery. Then, the walsender attempts to send all WAL that's > already been fsync'd to the standby's disk (i.e., send WAL up to the bigger > location between the receive location and the replay one). When the standby is > promoted, all walsenders in that standby end because they cannot continue > replication any more in that case because of the timeline mismatch. > > The standby must not accept replication connection from that standby itself. > Otherwise, since any new WAL data would not appear in that standby, > replication cannot advance any more. As a safeguard against this, I introduced > new ID to identify each instance. The walsender sends that ID as the fourth > field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether > the IDs are the same between two servers. If they are the same, which means > that the standby is just connecting to that standby itself, so walreceiver > emits ERROR. > > One remaining problem which I'll have to tackle is that: Even while walreceiver > is not in progress (i.e., the startup process is retrieving WAL file from the > archive), the cascading walsender should continuously send new WAL data. > This means that the walsender should send the WAL file restored from the > archive. The problem is that the name of such a restored WAL file is always > "RECOVERYXLOG". For now, walsender cannot handle the WAL file with such > a name. > > To address the above problem, I'm thinking to make the startup process restore > the WAL file with its real name instead of "RECOVERYXLOG". Then, like in the > master, the walsender can read and send the restored WAL file. The required > WAL file can be recycled before being sent. So we might need to enable > wal_keep_segments setting even in the standby. Done. Updated patch attached. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> The standby must not accept replication connection from that standby itself. >> Otherwise, since any new WAL data would not appear in that standby, >> replication cannot advance any more. As a safeguard against this, I introduced >> new ID to identify each instance. The walsender sends that ID as the fourth >> field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether >> the IDs are the same between two servers. If they are the same, which means >> that the standby is just connecting to that standby itself, so walreceiver >> emits ERROR. Thanks for waiting for review. This part of the patch is troubling me. I think you have identified an important problem, but this solution doesn't work fully. If we allow standbys to connect to other standbys then we have problems with standbys not being connected to master. This can occur with a 1-step connection, as you point out, but it could also occur with a 2-step, 3-step or more connection, where a circle of standbys are all depending upon each other. Your solution only works for 1-step connections. "Solving" that problem in a general sense might be more dangerous than leaving it alone. I think we should think some more about the issues there and approach them as a separate problem. I think we should remove that and just focus on the main problem, for now. That will make it a simpler patch and easier to commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jul 4, 2011 at 6:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > >>> The standby must not accept replication connection from that standby itself. >>> Otherwise, since any new WAL data would not appear in that standby, >>> replication cannot advance any more. As a safeguard against this, I introduced >>> new ID to identify each instance. The walsender sends that ID as the fourth >>> field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether >>> the IDs are the same between two servers. If they are the same, which means >>> that the standby is just connecting to that standby itself, so walreceiver >>> emits ERROR. > > Thanks for waiting for review. Thanks for the review! > This part of the patch is troubling me. I think you have identified an > important problem, but this solution doesn't work fully. > > If we allow standbys to connect to other standbys then we have > problems with standbys not being connected to master. This can occur > with a 1-step connection, as you point out, but it could also occur > with a 2-step, 3-step or more connection, where a circle of standbys > are all depending upon each other. Your solution only works for 1-step > connections. "Solving" that problem in a general sense might be more > dangerous than leaving it alone. I think we should think some more > about the issues there and approach them as a separate problem. > > I think we should remove that and just focus on the main problem, for > now. That will make it a simpler patch and easier to commit. I agree to focus on the main problem first. I removed that. Attached is the updated version. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
On Tue, Jul 5, 2011 at 4:34 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Jul 4, 2011 at 6:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Tue, Jun 14, 2011 at 6:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> >>>> The standby must not accept replication connection from that standby itself. >>>> Otherwise, since any new WAL data would not appear in that standby, >>>> replication cannot advance any more. As a safeguard against this, I introduced >>>> new ID to identify each instance. The walsender sends that ID as the fourth >>>> field of the reply of IDENTIFY_SYSTEM, and then walreceiver checks whether >>>> the IDs are the same between two servers. If they are the same, which means >>>> that the standby is just connecting to that standby itself, so walreceiver >>>> emits ERROR. >> >> Thanks for waiting for review. > > Thanks for the review! > I agree to focus on the main problem first. I removed that. Attached > is the updated version. Now for the rest of the review... I'd rather not include another chunk of code related to wal_keep_segments. The existing code in CreateCheckPoint() should be refactored so that we call the same code from both CreateCheckPoint() and CreateRestartPoint(). IMHO it's time to get rid of RECOVERYXLOG as an initial target for de-archived files. That made sense once, but now we have streaming it makes more sense for us to de-archive straight onto the correct file name and let the file be cleaned up later. So de-archiving it and then copying to the new location doesn't seem the right thing to do (especially not to copy rather than rename). RECOVERYXLOG allowed us to de-archive the file without removing a pre-existing file, so we must handle that still - the current patch would fail if a pre-existing WAL file were there. Those changes will make this code cleaner for the long term. I don't think we should simply shutdown a WALSender when we startup. That is indistinguishable from a failure, which is going to be very worrying if we do a switchover. Is there another way to do this? Or if not, at least a log message to explain it was normal that we requested this. It would be possible to have synchronous cascaded replication but that is probably another patch :-) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 5, 2011 at 8:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Now for the rest of the review... Thanks! > I'd rather not include another chunk of code related to > wal_keep_segments. The existing code in CreateCheckPoint() should be > refactored so that we call the same code from both CreateCheckPoint() > and CreateRestartPoint(). This makes sense. Will do. > IMHO it's time to get rid of RECOVERYXLOG as an initial target for > de-archived files. That made sense once, but now we have streaming it > makes more sense for us to de-archive straight onto the correct file > name and let the file be cleaned up later. So de-archiving it and then > copying to the new location doesn't seem the right thing to do > (especially not to copy rather than rename). RECOVERYXLOG allowed us > to de-archive the file without removing a pre-existing file, so we > must handle that still - the current patch would fail if a > pre-existing WAL file were there. You mean that we must keep a pre-existing file? If so, why do we need to do that? Since it's older than an archived file, it seems to be OK to overwrite it with an archived file. Or is there corner case that an archived file is older than a pre-existing one? If we don't need to keep a pre-existing file, I'll change the way to de-archive according to your suggestion, as follows; 1. Rename a pre-existing file to EXISTINGXLOG 2. De-archive the file onto the correct name 3. If the de-archived file is invalid (i.e., its size is not 16MB), remove it and rename EXISTINGXLOG to the correct name 4. If the de-archived file is valid, remove EXISTINGXLOG 5. Replay the file with the correct name Or 1. De-archive the file to RECOVERYXLOG 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename RECOVERYXLOG to the correct name 3. Replay the file with the correct name > Those changes will make this code cleaner for the long term. > > I don't think we should simply shutdown a WALSender when we startup. > That is indistinguishable from a failure, which is going to be very > worrying if we do a switchover. Is there another way to do this? Or if > not, at least a log message to explain it was normal that we requested > this. What about outputing something like the following message in that case? if ("walsender receives SIGUSR2") ereport(LOG, "terminating walsender process due to administrator command"); > It would be possible to have synchronous cascaded replication but that > is probably another patch :-) Yeah, right. You'll try? ;) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jul 6, 2011 at 2:13 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> IMHO it's time to get rid of RECOVERYXLOG as an initial target for >> de-archived files. That made sense once, but now we have streaming it >> makes more sense for us to de-archive straight onto the correct file >> name and let the file be cleaned up later. So de-archiving it and then >> copying to the new location doesn't seem the right thing to do >> (especially not to copy rather than rename). RECOVERYXLOG allowed us >> to de-archive the file without removing a pre-existing file, so we >> must handle that still - the current patch would fail if a >> pre-existing WAL file were there. > <snip> > If we don't need to keep a pre-existing file, I'll change the way to de-archive > according to your suggestion, as follows; > > 1. Rename a pre-existing file to EXISTINGXLOG > 2. De-archive the file onto the correct name > 3. If the de-archived file is invalid (i.e., its size is not 16MB), > remove it and > rename EXISTINGXLOG to the correct name > 4. If the de-archived file is valid, remove EXISTINGXLOG > 5. Replay the file with the correct name I'm laughing quite hard here... :-) > Or > > 1. De-archive the file to RECOVERYXLOG > 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename > RECOVERYXLOG to the correct name > 3. Replay the file with the correct name Yes please, that makes sense. >> Those changes will make this code cleaner for the long term. >> >> I don't think we should simply shutdown a WALSender when we startup. >> That is indistinguishable from a failure, which is going to be very >> worrying if we do a switchover. Is there another way to do this? Or if >> not, at least a log message to explain it was normal that we requested >> this. > > What about outputing something like the following message in that case? > > if ("walsender receives SIGUSR2") > ereport(LOG, "terminating walsender process due to > administrator command"); ...which doesn't explain the situation because we don't know why SIGUSR2 was sent. I was thinking of something like this LOG: requesting walsenders for cascading replication reconnect to update timeline but then I ask: Why not simply send a new message type saying "new timeline id is X" and that way we don't need to restart the connection at all? >> It would be possible to have synchronous cascaded replication but that >> is probably another patch :-) > > Yeah, right. You'll try? ;) I'll wait for a request... -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> 1. De-archive the file to RECOVERYXLOG >> 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename >> RECOVERYXLOG to the correct name >> 3. Replay the file with the correct name > > Yes please, that makes sense. Will do. >>> Those changes will make this code cleaner for the long term. >>> >>> I don't think we should simply shutdown a WALSender when we startup. >>> That is indistinguishable from a failure, which is going to be very >>> worrying if we do a switchover. Is there another way to do this? Or if >>> not, at least a log message to explain it was normal that we requested >>> this. >> >> What about outputing something like the following message in that case? >> >> if ("walsender receives SIGUSR2") >> ereport(LOG, "terminating walsender process due to >> administrator command"); > > ...which doesn't explain the situation because we don't know why > SIGUSR2 was sent. > > I was thinking of something like this > > LOG: requesting walsenders for cascading replication reconnect to > update timeline Looks better than my proposal. > but then I ask: Why not simply send a new message type saying "new > timeline id is X" and that way we don't need to restart the connection > at all? Yeah, that's very useful. But I'd like to implement that as a separate patch. I'm thinking that in that case walsender should send the timeline history file and walreceiver should write it down to the disk, instead of just sending timeline ID. Otherwise, when the standby in receive side restarts, it cannot calculate the latest timeline ID correctly. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jul 6, 2011 at 8:53 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >>> What about outputing something like the following message in that case? >>> >>> if ("walsender receives SIGUSR2") >>> ereport(LOG, "terminating walsender process due to >>> administrator command"); >> >> ...which doesn't explain the situation because we don't know why >> SIGUSR2 was sent. >> >> I was thinking of something like this >> >> LOG: requesting walsenders for cascading replication reconnect to >> update timeline > > Looks better than my proposal. > >> but then I ask: Why not simply send a new message type saying "new >> timeline id is X" and that way we don't need to restart the connection >> at all? > > Yeah, that's very useful. But I'd like to implement that as a separate patch. > > I'm thinking that in that case walsender should send the timeline history file > and walreceiver should write it down to the disk, instead of just sending > timeline ID. Otherwise, when the standby in receive side restarts, it cannot > calculate the latest timeline ID correctly. OK, happy to do that part as an additional patch. That way we can get this committed soon - in this CF. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jul 6, 2011 at 4:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> 1. De-archive the file to RECOVERYXLOG >>> 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename >>> RECOVERYXLOG to the correct name >>> 3. Replay the file with the correct name >> >> Yes please, that makes sense. In #2, if the server is killed with SIGKILL just after removing a pre-existing file and before renaming RECOVERYXLOG, we lose the file with correct name. Even in this case, we would be able to restore it from the archive, but what if unfortunately the archive is unavailable? We would lose the file infinitely. So we should introduce the following safeguard? 2'. If RECOVERYXLOG is valid, move a pre-existing file to pg_xlog/backup, rename RECOVERYXLOG to the correct name,and remove the pre-existing file from pg_xlog/backup Currently we give up a recovery if there is the target file in neither the archive nor pg_xlog. But, if we adopt the above safeguard, in that case, we should try to read thefile from also pg_xlog/backup. In #2, there is another problem; walsender might have the pre-existing file open, so the startup process would need to request walsenders to close the file before removing (or renaming) it, wait for new file to appear and open it again. This might make the code complicated. Does anyone have better approach? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jul 6, 2011 at 12:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Wed, Jul 6, 2011 at 4:53 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Wed, Jul 6, 2011 at 2:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> 1. De-archive the file to RECOVERYXLOG >>>> 2. If RECOVERYXLOG is valid, remove a pre-existing one and rename >>>> RECOVERYXLOG to the correct name >>>> 3. Replay the file with the correct name >>> >>> Yes please, that makes sense. > > In #2, if the server is killed with SIGKILL just after removing a pre-existing > file and before renaming RECOVERYXLOG, we lose the file with correct name. > Even in this case, we would be able to restore it from the archive, but what if > unfortunately the archive is unavailable? We would lose the file infinitely. So > we should introduce the following safeguard? > > 2'. If RECOVERYXLOG is valid, move a pre-existing file to pg_xlog/backup, > rename RECOVERYXLOG to the correct name, and remove the pre-existing > file from pg_xlog/backup > > Currently we give up a recovery if there is the target file in > neither the > archive nor pg_xlog. But, if we adopt the above safeguard, in that case, > we should try to read the file from also pg_xlog/backup. > > In #2, there is another problem; walsender might have the pre-existing file > open, so the startup process would need to request walsenders to close the > file before removing (or renaming) it, wait for new file to appear and open it > again. This might make the code complicated. Does anyone have better > approach? The risk you describe already exists in current code. I regard it as a non-risk. The unlink() and the rename() are executed consecutively, so the gap between them is small, so the chance of a SIGKILL in that gap at the same time as losing the archive seems low, and we can always get that file from the master again if we are streaming. Any code you add to "fix" this will get executed so rarely it probably won't work when we need it to. In the current scheme we restart archiving from the last restartpoint, which exists only on the archive. This new patch improves upon this by keeping the most recent files locally, so we are less expose in the case of archive unavailability. So this patch already improves things and we don't need any more than that. No extra code please, IMHO. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Fujii, > In the current scheme we restart archiving from the last restartpoint, > which exists only on the archive. This new patch improves upon this by > keeping the most recent files locally, so we are less expose in the > case of archive unavailability. So this patch already improves things > and we don't need any more than that. No extra code please, IMHO. Do you think you'll submit a new version of the patch this commitfest? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jul 11, 2011 at 3:30 AM, Josh Berkus <josh@agliodbs.com> wrote: > Fujii, > >> In the current scheme we restart archiving from the last restartpoint, >> which exists only on the archive. This new patch improves upon this by >> keeping the most recent files locally, so we are less expose in the >> case of archive unavailability. So this patch already improves things >> and we don't need any more than that. No extra code please, IMHO. > > Do you think you'll submit a new version of the patch this commitfest? Yes. I'm now updating the patch according to Simon's comments. I will submit it today. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jul 11, 2011 at 10:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Mon, Jul 11, 2011 at 3:30 AM, Josh Berkus <josh@agliodbs.com> wrote: >> Do you think you'll submit a new version of the patch this commitfest? > > Yes. I'm now updating the patch according to Simon's comments. > I will submit it today. Attached is the updated version which addresses all the issues raised by Simon. > The risk you describe already exists in current code. > > I regard it as a non-risk. The unlink() and the rename() are executed > consecutively, so the gap between them is small, so the chance of a > SIGKILL in that gap at the same time as losing the archive seems low, > and we can always get that file from the master again if we are > streaming. Any code you add to "fix" this will get executed so rarely > it probably won't work when we need it to. > > In the current scheme we restart archiving from the last restartpoint, > which exists only on the archive. This new patch improves upon this by > keeping the most recent files locally, so we are less expose in the > case of archive unavailability. So this patch already improves things > and we don't need any more than that. No extra code please, IMHO. Yes, I added no extra code for the risk I raised upthread. > In #2, there is another problem; walsender might have the pre-existing file > open, so the startup process would need to request walsenders to close the > file before removing (or renaming) it, wait for new file to appear and open it > again. I implemented this. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
On Mon, Jul 11, 2011 at 7:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > Attached is the updated version which addresses all the issues raised by > Simon. Is there any reason why we disallow cascading unless hot standby is enabled? ISTM we can just alter the postmaster path for walsenders, patch attached. Some people might be happier if a sync standby were not HS enabled, yet able to cascade to other standbys for reading. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Tue, Jul 19, 2011 at 5:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, Jul 11, 2011 at 7:28 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > >> Attached is the updated version which addresses all the issues raised by >> Simon. > > Is there any reason why we disallow cascading unless hot standby is enabled? > > ISTM we can just alter the postmaster path for walsenders, patch attached. > > Some people might be happier if a sync standby were not HS enabled, > yet able to cascade to other standbys for reading. - return CAC_STARTUP; /* normal startup */ + { + if (am_walsender) + return CAC_OK; + else + return CAC_STARTUP; /* normal startup */ + } In canAcceptConnections(), am_walsender is always false, so the above CAC_OK is never returned. You should change ProcessStartupPacket() as follows, instead. switch (port->canAcceptConnections) { case CAC_STARTUP: + if (am_walsender) + { + port->canAcceptConnections = CAC_OK; + break; + } ereport(FATAL, When I fixed the above, compile the code and set up the cascading replication environment (disable hot_standby), I got the following assertion error: TRAP: FailedAssertion("!(slot > 0 && slot <= PMSignalState->num_child_flags)", File: "pmsignal.c", Line: 227) So we would still have some code to change. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > So we would still have some code to change. Sigh, yes, of course. The question was whether there is any reason we need to disallow cascading? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 19, 2011 at 9:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > >> So we would still have some code to change. > > Sigh, yes, of course. > > The question was whether there is any reason we need to disallow cascading? No, at least I have no clear reason for now. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jul 19, 2011 at 1:38 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Jul 19, 2011 at 9:09 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Tue, Jul 19, 2011 at 12:19 PM, Fujii Masao <masao.fujii@gmail.com> >> wrote: >> >>> So we would still have some code to change. >> >> Sigh, yes, of course. >> >> The question was whether there is any reason we need to disallow >> cascading? > > No, at least I have no clear reason for now. I'll work up a proper patch. Thanks for your earlier review, -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services