RE: Synchronizing slots from primary to standby
От | Zhijie Hou (Fujitsu) |
---|---|
Тема | RE: Synchronizing slots from primary to standby |
Дата | |
Msg-id | OS0PR01MB571641EF50C4B99A5DB3C601946F2@OS0PR01MB5716.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | Re: Synchronizing slots from primary to standby (Masahiko Sawada <sawada.mshk@gmail.com>) |
Ответы |
Re: Synchronizing slots from primary to standby
|
Список | pgsql-hackers |
On Friday, January 12, 2024 8:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: Hi, > > On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > +static bool > > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot > > > +*remote_slot) > > > { > > > ... > > > + /* Slot ready for sync, so sync it. */ else { > > > + /* > > > + * Sanity check: With hot_standby_feedback enabled and > > > + * invalidations handled appropriately as above, this should never > > > + * happen. > > > + */ > > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) elog(ERROR, > > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > + " to remote slot's LSN(%X/%X) as synchronization" > > > + " would move it backwards", remote_slot->name, > > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > > ... > > > } > > > > > > I was thinking about the above code in the patch and as far as I can > > > think this can only occur if the same name slot is re-created with > > > prior restart_lsn after the existing slot is dropped. Normally, the > > > newly created slot (with the same name) will have higher restart_lsn > > > but one can mimic it by copying some older slot by using > > > pg_copy_logical_replication_slot(). > > > > > > I don't think as mentioned in comments even if hot_standby_feedback > > > is temporarily set to off, the above shouldn't happen. It can only > > > lead to invalidated slots on standby. > > > > > > To close the above race, I could think of the following ways: > > > 1. Drop and re-create the slot. > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > your previous comment, we need to update all other fields as well. > > > If we follow this then we probably need to have a check for > > > catalog_xmin as well. > > > > > > > The second point as mentioned is slightly misleading, so let me try to > > rephrase it once again: Emit LOG/WARNING in this case and once > > remote_slot's LSN moves ahead of local_slot's LSN then we can update > > it; additionally, we need to update all other fields like two_phase as > > well. If we follow this then we probably need to have a check for > > catalog_xmin as well along remote_slot's restart_lsn. > > > > > Now, related to this the other case which needs some handling is > > > what if the remote_slot's restart_lsn is greater than local_slot's > > > restart_lsn but it is a re-created slot with the same name. In that > > > case, I think the other properties like 'two_phase', 'plugin' could > > > be different. So, is simply copying those sufficient or do we need > > > to do something else as well? > > > > > > > Bertrand, Dilip, Sawada-San, and others, please share your opinion on > > this problem as I think it is important to handle this race condition. > > Is there any good use case of copying a failover slot in the first place? If it's not > a normal use case and we can probably live without it, why not always disable > failover during the copy? FYI we always disable two_phase on copied slots. It > seems to me that copying a failover slot could lead to problems, as long as we > synchronize slots based on their names. IIUC without the copy, this pass should > never happen. Thanks for the suggestion. I also don't have a use case for this. Attach the V61 patch set that addresses this suggestion. And here is the summary of the changes made in each patch. V61-0001 1. Reverts the changes in copy_replication_slot. V61-0002 1. Adds the documents for the steps that user needs to follow to ensure the standby is ready for failover 2. Directly update the fields restart_lsn/confirmed_flush/catalog_xmin instead of using APIs like LogicalConfirmReceivedLocation 3. Updates all the fields(two_phase, failover, plugin) when syncing the slots 4. fixes CFbot failures. 5. Some code style adjustment. (pending comments in last version) 6. Remove some unnecessary Assert and variable assignment (off-list comments from Peter) Thanks Shveta for working on 4 and 5. V61-0003 1. Some documents update related to standby_slot_names and the steps for failover. V61-0004 - No change. Best Regards, Hou zj
Вложения
В списке pgsql-hackers по дате отправления: