Re: Race condition in InvalidateObsoleteReplicationSlots()
От | Andres Freund |
---|---|
Тема | Re: Race condition in InvalidateObsoleteReplicationSlots() |
Дата | |
Msg-id | 20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Race condition in InvalidateObsoleteReplicationSlots() (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: Race condition in InvalidateObsoleteReplicationSlots()
Re: Race condition in InvalidateObsoleteReplicationSlots() Re: Race condition in InvalidateObsoleteReplicationSlots() |
Список | pgsql-hackers |
Hi, On 2021-04-07 17:10:37 -0700, Andres Freund wrote: > I think this can be solved in two different ways: > > 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of > InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new > slot in the to-be-obsoleted-slot's place. > > 2) Atomically check whether the slot needs to be invalidated and try to > acquire if needed. Don't release ReplicationSlotControlLock between those > two steps. Signal the owner to release the slot iff we couldn't acquire the > slot. In the latter case wait and then recheck if the slot still needs to > be dropped. > > To me 2) seems better, because we then can also be sure that the slot still > needs to be obsoleted, rather than potentially doing so unnecessarily. > > > It looks to me like several of the problems here stem from trying to reuse > code from ReplicationSlotAcquireInternal() (which before this was just named > ReplicationSlotAcquire()). I don't think that makes sense, because cases like > this want to check if a condition is true, and acquire it only if so. > > IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(), > except that a different condition is checked, and the if (active_pid) case > needs to prepare a condition variable, signal the owner and then wait on the > condition variable, to restart after. I'm also confused by the use of ConditionVariableTimedSleep(timeout = 10). Why do we need a timed sleep here in the first place? And why with such a short sleep? I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); - but is aware it's running in checkpointer. I don't think CFI does much there? If we are worried about needing to check for interrupts, more work is needed. Sketch for a fix attached. I did leave the odd ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's there... After this I don't see a reason to have SAB_Inquire - as far as I can tell it's practically impossible to use without race conditions? Except for raising an error - which is "builtin"... Greetings, Andres Freund
Вложения
В списке pgsql-hackers по дате отправления: