Обсуждение: Re: Mention idle_replication_slot_timeout in pg_replication_slots docs
On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > Hi, > > The pg_replication_slots documentation mentions only max_slot_wal_keep_size > as a condition under which the wal_status column can show unreserved or lost. > However, since commit ac0e33136ab, idle_replication_slot_timeout can also > cause this behavior when it is set. This has not been documented yet. > https://www.postgresql.org/docs/devel/view-pg-replication-slots.html > +1 to the doc update. > So, how about updating the documentation to also mention > idle_replication_slot_timeout as a factor that can cause wal_status to > become unreserved or lost? Patch attached. > Since idle_replication_slot_timeout can only cause wal_status to become 'lost' and not 'unreserved', perhaps we can reword the sentence slightly for clarity, suggestion - "The last two states are seen when max_slot_wal_keep_size is non-negative and, the 'lost' state may also appear when idle_replication_slot_timeout is greater than zero." Please feel free to rephrase if needed. -- Thanks, Nisha
On 2025/06/26 15:46, Nisha Moond wrote: > On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> >> Hi, >> >> The pg_replication_slots documentation mentions only max_slot_wal_keep_size >> as a condition under which the wal_status column can show unreserved or lost. >> However, since commit ac0e33136ab, idle_replication_slot_timeout can also >> cause this behavior when it is set. This has not been documented yet. >> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html >> > > +1 to the doc update. Thanks for the review! >> So, how about updating the documentation to also mention >> idle_replication_slot_timeout as a factor that can cause wal_status to >> become unreserved or lost? Patch attached. >> > > Since idle_replication_slot_timeout can only cause wal_status to > become 'lost' and not 'unreserved', perhaps we can reword the sentence > slightly for clarity, suggestion - > "The last two states are seen when max_slot_wal_keep_size is > non-negative and, the 'lost' state may also appear when > idle_replication_slot_timeout is greater than zero." I was thinking that when idle_replication_slot_timeout triggers, the following functions are called, and that wal_status can become "unreserved" before ReplicationSlotRelease() runs. It's very short period, though. Am I wrong? ReplicationSlotMarkDirty(); ReplicationSlotSave(); ReplicationSlotRelease(); Regards, -- Fujii Masao NTT DATA Japan Corporation
On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2025/06/26 15:46, Nisha Moond wrote: > > On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> > >> Hi, > >> > >> The pg_replication_slots documentation mentions only max_slot_wal_keep_size > >> as a condition under which the wal_status column can show unreserved or lost. > >> However, since commit ac0e33136ab, idle_replication_slot_timeout can also > >> cause this behavior when it is set. This has not been documented yet. > >> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html > >> > > > > +1 to the doc update. > > Thanks for the review! > > > >> So, how about updating the documentation to also mention > >> idle_replication_slot_timeout as a factor that can cause wal_status to > >> become unreserved or lost? Patch attached. > >> > > > > Since idle_replication_slot_timeout can only cause wal_status to > > become 'lost' and not 'unreserved', perhaps we can reword the sentence > > slightly for clarity, suggestion - > > "The last two states are seen when max_slot_wal_keep_size is > > non-negative and, the 'lost' state may also appear when > > idle_replication_slot_timeout is greater than zero." > > I was thinking that when idle_replication_slot_timeout triggers, > the following functions are called, and that wal_status can become > "unreserved" before ReplicationSlotRelease() runs. It's very short > period, though. Am I wrong? > > ReplicationSlotMarkDirty(); > ReplicationSlotSave(); > ReplicationSlotRelease(); > Thank you for pointing it out. You are correct that while the checkpointer is in the process of invalidating a slot, it sets its PID as the slot’s active_pid. During this short window, if a user queries pg_replication_slot, the underlying function pg_get_replication_slots will compute the wal_status as 'unreserved' for the invalidated slot because the slot has a valid active_pid. That said, it's reasonable to mention in the doc that 'unreserved' may appear when idle_replication_slot_timeout is greater than zero, as this can indeed happen. So, let's retain the current description. However, this behavior isn’t specific to idle_replication_slot_timeout. For example, when a slot is being invalidated due to a different cause "wal_level_insufficient", 'unreserved' may also briefly appear in wal_status. The current patch LGTM. -- Thanks, Nisha
On 2025/06/27 15:32, Nisha Moond wrote: > On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> >> >> >> On 2025/06/26 15:46, Nisha Moond wrote: >>> On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>>> >>>> Hi, >>>> >>>> The pg_replication_slots documentation mentions only max_slot_wal_keep_size >>>> as a condition under which the wal_status column can show unreserved or lost. >>>> However, since commit ac0e33136ab, idle_replication_slot_timeout can also >>>> cause this behavior when it is set. This has not been documented yet. >>>> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html >>>> >>> >>> +1 to the doc update. >> >> Thanks for the review! >> >> >>>> So, how about updating the documentation to also mention >>>> idle_replication_slot_timeout as a factor that can cause wal_status to >>>> become unreserved or lost? Patch attached. >>>> >>> >>> Since idle_replication_slot_timeout can only cause wal_status to >>> become 'lost' and not 'unreserved', perhaps we can reword the sentence >>> slightly for clarity, suggestion - >>> "The last two states are seen when max_slot_wal_keep_size is >>> non-negative and, the 'lost' state may also appear when >>> idle_replication_slot_timeout is greater than zero." >> >> I was thinking that when idle_replication_slot_timeout triggers, >> the following functions are called, and that wal_status can become >> "unreserved" before ReplicationSlotRelease() runs. It's very short >> period, though. Am I wrong? >> >> ReplicationSlotMarkDirty(); >> ReplicationSlotSave(); >> ReplicationSlotRelease(); >> > > Thank you for pointing it out. > You are correct that while the checkpointer is in the process of > invalidating a slot, it sets its PID as the slot’s active_pid. During > this short window, if a user queries pg_replication_slot, the > underlying function pg_get_replication_slots will compute the > wal_status as 'unreserved' for the invalidated slot because the slot > has a valid active_pid. > > That said, it's reasonable to mention in the doc that 'unreserved' may > appear when idle_replication_slot_timeout is greater than zero, as > this can indeed happen. So, let's retain the current description. > > However, this behavior isn’t specific to > idle_replication_slot_timeout. For example, when a slot is being > invalidated due to a different cause "wal_level_insufficient", > 'unreserved' may also briefly appear in wal_status. Yes, and "lost" can appear for various reasons, including wal_level_insufficient, so it seems odd to highlight max_slot_wal_keep_size as the cause of the "lost" status in the note. It would probably be better to remove the mention of "lost" from that note. As for "unreserved", it can also occur for different reasons, but typically, it happens when max_slot_wal_keep_size is set to a non-negative value. So it might make sense to keep the explanation focused just on "unreserved" and max_slot_wal_keep_size. For example: ---------------------- <listitem> <para> <literal>unreserved</literal> means that the slot no longer retains the required WAL files and some of them are to be removed at - the next checkpoint. This state can return + the next checkpoint. This can occur when + <xref linkend="guc-max-slot-wal-keep-size"/> is set to + a non-negative value. This state can return to <literal>reserved</literal> or <literal>extended</literal>. </para> </listitem> <listitem> ---------------------- What do you think? Also, I noticed the note that says “If <structfield>restart_lsn</structfield> is NULL, this field is null” seems inaccurate. For example, when "wal_removed" happens, restart_lsn is NULL but wal_status is "lost". So maybe we should remove that note as well? Regards, -- Fujii Masao NTT DATA Japan Corporation
On Fri, Jun 27, 2025 at 5:40 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2025/06/27 15:32, Nisha Moond wrote: > > On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> > >> > >> > >> On 2025/06/26 15:46, Nisha Moond wrote: > >>> On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> The pg_replication_slots documentation mentions only max_slot_wal_keep_size > >>>> as a condition under which the wal_status column can show unreserved or lost. > >>>> However, since commit ac0e33136ab, idle_replication_slot_timeout can also > >>>> cause this behavior when it is set. This has not been documented yet. > >>>> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html > >>>> > >>> > >>> +1 to the doc update. > >> > >> Thanks for the review! > >> > >> > >>>> So, how about updating the documentation to also mention > >>>> idle_replication_slot_timeout as a factor that can cause wal_status to > >>>> become unreserved or lost? Patch attached. > >>>> > >>> > >>> Since idle_replication_slot_timeout can only cause wal_status to > >>> become 'lost' and not 'unreserved', perhaps we can reword the sentence > >>> slightly for clarity, suggestion - > >>> "The last two states are seen when max_slot_wal_keep_size is > >>> non-negative and, the 'lost' state may also appear when > >>> idle_replication_slot_timeout is greater than zero." > >> > >> I was thinking that when idle_replication_slot_timeout triggers, > >> the following functions are called, and that wal_status can become > >> "unreserved" before ReplicationSlotRelease() runs. It's very short > >> period, though. Am I wrong? > >> > >> ReplicationSlotMarkDirty(); > >> ReplicationSlotSave(); > >> ReplicationSlotRelease(); > >> > > > > Thank you for pointing it out. > > You are correct that while the checkpointer is in the process of > > invalidating a slot, it sets its PID as the slot’s active_pid. During > > this short window, if a user queries pg_replication_slot, the > > underlying function pg_get_replication_slots will compute the > > wal_status as 'unreserved' for the invalidated slot because the slot > > has a valid active_pid. > > > > That said, it's reasonable to mention in the doc that 'unreserved' may > > appear when idle_replication_slot_timeout is greater than zero, as > > this can indeed happen. So, let's retain the current description. > > > > However, this behavior isn’t specific to > > idle_replication_slot_timeout. For example, when a slot is being > > invalidated due to a different cause "wal_level_insufficient", > > 'unreserved' may also briefly appear in wal_status. > > Yes, and "lost" can appear for various reasons, including wal_level_insufficient, > so it seems odd to highlight max_slot_wal_keep_size as the cause of the "lost" > status in the note. It would probably be better to remove the mention of "lost" > from that note. > +1 > As for "unreserved", it can also occur for different reasons, but typically, > it happens when max_slot_wal_keep_size is set to a non-negative value. > So it might make sense to keep the explanation focused just on "unreserved" > and max_slot_wal_keep_size. For example: > > ---------------------- > <listitem> > <para> > <literal>unreserved</literal> means that the slot no longer > retains the required WAL files and some of them are to be removed at > - the next checkpoint. This state can return > + the next checkpoint. This can occur when > + <xref linkend="guc-max-slot-wal-keep-size"/> is set to > + a non-negative value. This state can return > to <literal>reserved</literal> or <literal>extended</literal>. > </para> > </listitem> > <listitem> > ---------------------- > > What do you think? > The change LGTM, only a minor suggestion to add "typically", as “This can typically occur when…” to indicate that max_slot_wal_keep_size is one possible reason, not the only one. > > Also, I noticed the note that says “If <structfield>restart_lsn</structfield> > is NULL, this field is null” seems inaccurate. For example, when "wal_removed" > happens, restart_lsn is NULL but wal_status is "lost". So maybe we should remove > that note as well? You're right, the statement is not accurate. We could rephrase it as: "If <structfield>restart_lsn</structfield> is NULL, this field is either null or lost." But since 'unreserved' can also appear briefly during invalidation, it might be better to remove it altogether. -- Thanks, Nisha
On 2025/06/30 20:32, Nisha Moond wrote: > On Fri, Jun 27, 2025 at 5:40 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >> >> >> >> On 2025/06/27 15:32, Nisha Moond wrote: >>> On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>>> >>>> >>>> >>>> On 2025/06/26 15:46, Nisha Moond wrote: >>>>> On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> The pg_replication_slots documentation mentions only max_slot_wal_keep_size >>>>>> as a condition under which the wal_status column can show unreserved or lost. >>>>>> However, since commit ac0e33136ab, idle_replication_slot_timeout can also >>>>>> cause this behavior when it is set. This has not been documented yet. >>>>>> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html >>>>>> >>>>> >>>>> +1 to the doc update. >>>> >>>> Thanks for the review! >>>> >>>> >>>>>> So, how about updating the documentation to also mention >>>>>> idle_replication_slot_timeout as a factor that can cause wal_status to >>>>>> become unreserved or lost? Patch attached. >>>>>> >>>>> >>>>> Since idle_replication_slot_timeout can only cause wal_status to >>>>> become 'lost' and not 'unreserved', perhaps we can reword the sentence >>>>> slightly for clarity, suggestion - >>>>> "The last two states are seen when max_slot_wal_keep_size is >>>>> non-negative and, the 'lost' state may also appear when >>>>> idle_replication_slot_timeout is greater than zero." >>>> >>>> I was thinking that when idle_replication_slot_timeout triggers, >>>> the following functions are called, and that wal_status can become >>>> "unreserved" before ReplicationSlotRelease() runs. It's very short >>>> period, though. Am I wrong? >>>> >>>> ReplicationSlotMarkDirty(); >>>> ReplicationSlotSave(); >>>> ReplicationSlotRelease(); >>>> >>> >>> Thank you for pointing it out. >>> You are correct that while the checkpointer is in the process of >>> invalidating a slot, it sets its PID as the slot’s active_pid. During >>> this short window, if a user queries pg_replication_slot, the >>> underlying function pg_get_replication_slots will compute the >>> wal_status as 'unreserved' for the invalidated slot because the slot >>> has a valid active_pid. >>> >>> That said, it's reasonable to mention in the doc that 'unreserved' may >>> appear when idle_replication_slot_timeout is greater than zero, as >>> this can indeed happen. So, let's retain the current description. >>> >>> However, this behavior isn’t specific to >>> idle_replication_slot_timeout. For example, when a slot is being >>> invalidated due to a different cause "wal_level_insufficient", >>> 'unreserved' may also briefly appear in wal_status. >> >> Yes, and "lost" can appear for various reasons, including wal_level_insufficient, >> so it seems odd to highlight max_slot_wal_keep_size as the cause of the "lost" >> status in the note. It would probably be better to remove the mention of "lost" >> from that note. >> > > +1 Is this true starting from v16, when logical replication from standby was introduced? In other words, in v15 and earlier, only max_slot_wal_keep_size could cause the wal_status to become "unreserved" or "lost"? I'm wondering where to back-patch this fix to. >> As for "unreserved", it can also occur for different reasons, but typically, >> it happens when max_slot_wal_keep_size is set to a non-negative value. >> So it might make sense to keep the explanation focused just on "unreserved" >> and max_slot_wal_keep_size. For example: >> >> ---------------------- >> <listitem> >> <para> >> <literal>unreserved</literal> means that the slot no longer >> retains the required WAL files and some of them are to be removed at >> - the next checkpoint. This state can return >> + the next checkpoint. This can occur when >> + <xref linkend="guc-max-slot-wal-keep-size"/> is set to >> + a non-negative value. This state can return >> to <literal>reserved</literal> or <literal>extended</literal>. >> </para> >> </listitem> >> <listitem> >> ---------------------- >> >> What do you think? >> > > The change LGTM, only a minor suggestion to add "typically", as “This > can typically occur when…” to indicate that max_slot_wal_keep_size is > one possible reason, not the only one. OK. >> Also, I noticed the note that says “If <structfield>restart_lsn</structfield> >> is NULL, this field is null” seems inaccurate. For example, when "wal_removed" >> happens, restart_lsn is NULL but wal_status is "lost". So maybe we should remove >> that note as well? > > You're right, the statement is not accurate. > We could rephrase it as: "If <structfield>restart_lsn</structfield> is > NULL, this field is either null or lost." But since 'unreserved' can > also appear briefly during invalidation, it might be better to remove > it altogether. I agree with removing the description. Unless I'm missing something, it has been incorrect since at least v13, so we should back-patch this fix to all supported versions. Regards, -- Fujii Masao NTT DATA Japan Corporation
On Mon, Jun 30, 2025 at 6:12 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > > > > On 2025/06/30 20:32, Nisha Moond wrote: > > On Fri, Jun 27, 2025 at 5:40 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> > >> > >> > >> On 2025/06/27 15:32, Nisha Moond wrote: > >>> On Thu, Jun 26, 2025 at 1:33 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>>> > >>>> > >>>> > >>>> On 2025/06/26 15:46, Nisha Moond wrote: > >>>>> On Wed, Jun 25, 2025 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> The pg_replication_slots documentation mentions only max_slot_wal_keep_size > >>>>>> as a condition under which the wal_status column can show unreserved or lost. > >>>>>> However, since commit ac0e33136ab, idle_replication_slot_timeout can also > >>>>>> cause this behavior when it is set. This has not been documented yet. > >>>>>> https://www.postgresql.org/docs/devel/view-pg-replication-slots.html > >>>>>> > >>>>> > >>>>> +1 to the doc update. > >>>> > >>>> Thanks for the review! > >>>> > >>>> > >>>>>> So, how about updating the documentation to also mention > >>>>>> idle_replication_slot_timeout as a factor that can cause wal_status to > >>>>>> become unreserved or lost? Patch attached. > >>>>>> > >>>>> > >>>>> Since idle_replication_slot_timeout can only cause wal_status to > >>>>> become 'lost' and not 'unreserved', perhaps we can reword the sentence > >>>>> slightly for clarity, suggestion - > >>>>> "The last two states are seen when max_slot_wal_keep_size is > >>>>> non-negative and, the 'lost' state may also appear when > >>>>> idle_replication_slot_timeout is greater than zero." > >>>> > >>>> I was thinking that when idle_replication_slot_timeout triggers, > >>>> the following functions are called, and that wal_status can become > >>>> "unreserved" before ReplicationSlotRelease() runs. It's very short > >>>> period, though. Am I wrong? > >>>> > >>>> ReplicationSlotMarkDirty(); > >>>> ReplicationSlotSave(); > >>>> ReplicationSlotRelease(); > >>>> > >>> > >>> Thank you for pointing it out. > >>> You are correct that while the checkpointer is in the process of > >>> invalidating a slot, it sets its PID as the slot’s active_pid. During > >>> this short window, if a user queries pg_replication_slot, the > >>> underlying function pg_get_replication_slots will compute the > >>> wal_status as 'unreserved' for the invalidated slot because the slot > >>> has a valid active_pid. > >>> > >>> That said, it's reasonable to mention in the doc that 'unreserved' may > >>> appear when idle_replication_slot_timeout is greater than zero, as > >>> this can indeed happen. So, let's retain the current description. > >>> > >>> However, this behavior isn’t specific to > >>> idle_replication_slot_timeout. For example, when a slot is being > >>> invalidated due to a different cause "wal_level_insufficient", > >>> 'unreserved' may also briefly appear in wal_status. > >> > >> Yes, and "lost" can appear for various reasons, including wal_level_insufficient, > >> so it seems odd to highlight max_slot_wal_keep_size as the cause of the "lost" > >> status in the note. It would probably be better to remove the mention of "lost" > >> from that note. > >> > > > > +1 > > Is this true starting from v16, when logical replication from standby was introduced? > In other words, in v15 and earlier, only max_slot_wal_keep_size could cause > the wal_status to become "unreserved" or "lost"? I'm wondering where to back-patch > this fix to. > I also think we should back-patch this till v16, since that’s when additional slot invalidation causes were also introduced(commit be87200). And since then “max_slot_wal_keep_size” is no longer the sole reason for “unreserved” or “lost” status. -- Thanks, Nisha
On 2025/07/02 16:12, Fujii Masao wrote: > > > On 2025/07/01 13:52, Nisha Moond wrote: >> On Mon, Jun 30, 2025 at 6:12 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: >>> Is this true starting from v16, when logical replication from standby was introduced? >>> In other words, in v15 and earlier, only max_slot_wal_keep_size could cause >>> the wal_status to become "unreserved" or "lost"? I'm wondering where to back-patch >>> this fix to. >>> >> >> I also think we should back-patch this till v16, since that’s when >> additional slot invalidation causes were also introduced(commit >> be87200). And since then “max_slot_wal_keep_size” is no longer the >> sole reason for “unreserved” or “lost” status. > > Okay, I've prepared two patches: > > - 0001 removes the incorrect line: "If restart_lsn is NULL, this field is null." > This should be back-patched to v13. > - 0002 updates the description of the wal_status to reflect that max_slot_wal_keep_size > is not the only cause of the lost state. This should be back-patched to v16. > > Barrng objections, I will commit these patches. I've pushed the patches. Thanks! Regards, -- Fujii Masao NTT DATA Japan Corporation