Re: Introduce XID age and inactive timeout based replication slot invalidation
От | Bharath Rupireddy |
---|---|
Тема | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Дата | |
Msg-id | CALj2ACVVb1Uds+ucw_nq03uoGB5NToVi5hSiwW+S_goju_sncg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Introduce XID age and inactive timeout based replication slot invalidation (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
Ответы |
Re: Introduce XID age and inactive timeout based replication slot invalidation
(Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
|
Список | pgsql-hackers |
On Thu, Mar 28, 2024 at 3:13 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Regarding 0002: Thanks for reviewing it. > Some testing: > > T1 === > > When the slot is invalidated on the primary, then the reason is propagated to > the sync slot (if any). That's fine but we are loosing the inactive_since on the > standby: > > Primary: > > postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot'; > slot_name | inactive_since | conflicting | invalidation_reason > -------------+-------------------------------+-------------+--------------------- > lsub29_slot | 2024-03-28 08:24:51.672528+00 | f | inactive_timeout > (1 row) > > Standby: > > postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot'; > slot_name | inactive_since | conflicting | invalidation_reason > -------------+----------------+-------------+--------------------- > lsub29_slot | | f | inactive_timeout > (1 row) > > I think in this case it should always reflect the value from the primary (so > that one can understand why it is invalidated). I'll come back to this as soon as we all agree on inactive_since behavior for synced slots. > T2 === > > And it is set to a value during promotion: > > postgres=# select pg_promote(); > pg_promote > ------------ > t > (1 row) > > postgres=# select slot_name,inactive_since,conflicting,invalidation_reason from pg_replication_slots where slot_name='lsub29_slot'; > slot_name | inactive_since | conflicting | invalidation_reason > -------------+------------------------------+-------------+--------------------- > lsub29_slot | 2024-03-28 08:30:11.74505+00 | f | inactive_timeout > (1 row) > > I think when it is invalidated it should always reflect the value from the > primary (so that one can understand why it is invalidated). I'll come back to this as soon as we all agree on inactive_since behavior for synced slots. > T3 === > > As far the slot invalidation on the primary: > > postgres=# SELECT * FROM pg_logical_slot_get_changes('lsub29_slot', NULL, NULL, 'include-xids', '0'); > ERROR: cannot acquire invalidated replication slot "lsub29_slot" > > Can we make the message more consistent with what can be found in CreateDecodingContext() > for example? Hm, that makes sense because slot acquisition and release is something internal to the server. > T4 === > > Also, it looks like querying pg_replication_slots() does not trigger an > invalidation: I think it should if the slot is not invalidated yet (and matches > the invalidation criteria). There's a different opinion on this, check comment #3 from https://www.postgresql.org/message-id/CAA4eK1LLj%2BeaMN-K8oeOjfG%2BUuzTY%3DL5PXbcMJURZbFm%2B_aJSA%40mail.gmail.com. > Code review: > > CR1 === > > + Invalidate replication slots that are inactive for longer than this > + amount of time. If this value is specified without units, it is taken > > s/Invalidate/Invalidates/? Done. > Should we mention the relationship with inactive_since? Done. > CR2 === > > + * > + * If check_for_invalidation is true, the slot is checked for invalidation > + * based on replication_slot_inactive_timeout GUC and an error is raised after making the slot ours. > */ > void > -ReplicationSlotAcquire(const char *name, bool nowait) > +ReplicationSlotAcquire(const char *name, bool nowait, > + bool check_for_invalidation) > > > s/check_for_invalidation/check_for_timeout_invalidation/? Done. > CR3 === > > + if (slot->inactive_since == 0 || > + replication_slot_inactive_timeout == 0) > + return false; > > Better to test replication_slot_inactive_timeout first? (I mean there is no > point of testing inactive_since if replication_slot_inactive_timeout == 0) > > CR4 === > > + if (slot->inactive_since > 0 && > + replication_slot_inactive_timeout > 0) > + { > > Same. > > So, instead of CR3 === and CR4 ===, I wonder if it wouldn't be better to do > something like: > > if (replication_slot_inactive_timeout == 0) > return false; > else if (slot->inactive_since > 0) > . > else > return false; > > That would avoid checking replication_slot_inactive_timeout and inactive_since > multiple times. Done. > CR5 === > > + * held to avoid race conditions -- for example the restart_lsn could move > + * forward, or the slot could be dropped. > > Does the restart_lsn example makes sense here? No, it doesn't. Modified that. > CR6 === > > +static bool > +InvalidateSlotForInactiveTimeout(ReplicationSlot *slot, bool need_locks) > +{ > > InvalidatePossiblyInactiveSlot() maybe? I think we will lose the essence i.e. timeout from the suggested function name, otherwise just the inactive doesn't give a clearer meaning. I kept it that way unless anyone suggests otherwise. > CR7 === > > + /* Make sure the invalidated state persists across server restart */ > + slot->just_dirtied = true; > + slot->dirty = true; > + SpinLockRelease(&slot->mutex); > > Maybe we could create a new function say MarkGivenReplicationSlotDirty() > with a slot as parameter, that ReplicationSlotMarkDirty could call too? Done that. > Then maybe we could set slot->data.invalidated = RS_INVAL_INACTIVE_TIMEOUT in > InvalidateSlotForInactiveTimeout()? (to avoid multiple SpinLockAcquire/SpinLockRelease). Done that. > CR8 === > > + if (persist_state) > + { > + char path[MAXPGPATH]; > + > + sprintf(path, "pg_replslot/%s", NameStr(slot->data.name)); > + SaveSlotToPath(slot, path, ERROR); > + } > > Maybe we could create a new function say GivenReplicationSlotSave() > with a slot as parameter, that ReplicationSlotSave() could call too? Done that. > CR9 === > > + if (check_for_invalidation) > + { > + /* The slot is ours by now */ > + Assert(s->active_pid == MyProcPid); > + > + /* > + * Well, the slot is not yet ours really unless we check for the > + * invalidation below. > + */ > + s->active_pid = 0; > + if (InvalidateReplicationSlotForInactiveTimeout(s, true, true)) > + { > + /* > + * If the slot has been invalidated, recalculate the resource > + * limits. > + */ > + ReplicationSlotsComputeRequiredXmin(false); > + ReplicationSlotsComputeRequiredLSN(); > + > + /* Might need it for slot clean up on error, so restore it */ > + s->active_pid = MyProcPid; > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot acquire invalidated replication slot \"%s\"", > + NameStr(MyReplicationSlot->data.name)))); > + } > + s->active_pid = MyProcPid; > > Are we not missing some SpinLockAcquire/Release on the slot's mutex here? (the > places where we set the active_pid). Hm, yes. But, shall I acquire the mutex, set active_pid to 0 for a moment just to satisfy Assert(slot->active_pid == 0); in InvalidateReplicationSlotForInactiveTimeout and InvalidateSlotForInactiveTimeout? I just removed the assertions because being replication_slot_inactive_timeout > 0 and inactive_since > 0 is enough for these functions to think and decide on inactive timeout invalidation. > CR10 === > > @@ -1628,6 +1674,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, > if (SlotIsLogical(s)) > invalidation_cause = cause; > break; > + case RS_INVAL_INACTIVE_TIMEOUT: > + if (InvalidateReplicationSlotForInactiveTimeout(s, false, false)) > + invalidation_cause = cause; > + break; > > InvalidatePossiblyObsoleteSlot() is not called with such a reason, better to use > an Assert here and in the caller too? Done. > CR11 === > > +++ b/src/test/recovery/t/050_invalidate_slots.pl > > why not using 019_replslot_limit.pl? I understand that 019_replslot_limit covers wal_removed related invalidations. But, I don't want to kludge it with a bunch of other tests. The new tests anyway need a bunch of new nodes and a couple of helper functions. Any future invalidation mechanisms can be added here in this new file. Also, having a separate file quickly helps isolate any test failures that BF animals might report in future. I don't think a separate test file here hurts anyone unless there's a strong reason against it. Please see the attached v30 patch. 0002 is where all of the above review comments have been addressed. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: