Обсуждение: Unexpected behavior when setting "idle_replication_slot_timeout"

Поиск
Список
Период
Сортировка

Unexpected behavior when setting "idle_replication_slot_timeout"

От
Gunnar Morling
Дата:
Hi all,

I am exploring the new setting "idle_replication_slot_timeout" in Postgres 18; for testing purposes, I set the value to "30s", which, unexpectedly to me, didn't cause an idle slot to be invalidated when I triggered a checkpoint after the timeout had been reached.

The docs of the option state that the value is rounded up or down to the nearest full minute, so I reckon "30s" gets rounded down to 0, thus effectively disabling the feature. It might be less surprising to users if values between "1s" and "59s" get actually always rounded up to one minute? Arguably, that'd seem the more intuitive behavior to me. Alternatively, logging a warning might be considered for values between "1s" and "30s"? Curious what folks here think.

Thanks and all the best,

--Gunnar

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Nisha Moond
Дата:
On Fri, Jul 4, 2025 at 4:35 PM Gunnar Morling
<gunnar.morling@googlemail.com> wrote:
>
> Hi all,
>
> I am exploring the new setting "idle_replication_slot_timeout" in Postgres 18; for testing purposes, I set the value
to"30s", which, unexpectedly to me, didn't cause an idle slot to be invalidated when I triggered a checkpoint after the
timeouthad been reached. 
>
> The docs of the option state that the value is rounded up or down to the nearest full minute, so I reckon "30s" gets
roundeddown to 0, thus effectively disabling the feature. It might be less surprising to users if values between "1s"
and"59s" get actually always rounded up to one minute? Arguably, that'd seem the more intuitive behavior to me.
Alternatively,logging a warning might be considered for values between "1s" and "30s"? Curious what folks here think. 
>

Thanks for bringing this up!

Yes, this is expected behavior, idle_replication_slot_timeout accepts
values in minutes, so a setting like "30s" is rounded down to 0,
effectively disabling the timeout, while values >= "31s" are rounded
up to 1.

This behavior isn’t specific to this GUC as Postgres generally rounds
values below a parameter’s minimum unit without a warning. For
example, wal_summary_keep_time and log_rotation_age behave the same
way.

--
Thanks,
Nisha



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Fujii Masao
Дата:

On 2025/07/04 21:14, Nisha Moond wrote:
> On Fri, Jul 4, 2025 at 4:35 PM Gunnar Morling
> <gunnar.morling@googlemail.com> wrote:
>>
>> Hi all,
>>
>> I am exploring the new setting "idle_replication_slot_timeout" in Postgres 18; for testing purposes, I set the value
to"30s", which, unexpectedly to me, didn't cause an idle slot to be invalidated when I triggered a checkpoint after the
timeouthad been reached.
 
>>
>> The docs of the option state that the value is rounded up or down to the nearest full minute, so I reckon "30s" gets
roundeddown to 0, thus effectively disabling the feature.
 

When I first tried using idle_replication_slot_timeout, I also encountered this issue.


> It might be less surprising to users if values between "1s" and "59s" get actually always rounded up to one minute?
Arguably,that'd seem the more intuitive behavior to me. Alternatively, logging a warning might be considered for values
between"1s" and "30s"? Curious what folks here think.
 
>>
> 
> Thanks for bringing this up!
> 
> Yes, this is expected behavior, idle_replication_slot_timeout accepts
> values in minutes, so a setting like "30s" is rounded down to 0,
> effectively disabling the timeout, while values >= "31s" are rounded
> up to 1.
> 
> This behavior isn’t specific to this GUC as Postgres generally rounds
> values below a parameter’s minimum unit without a warning. For
> example, wal_summary_keep_time and log_rotation_age behave the same
> way.

Right.

But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
Since at least two users (including myself) tried to set it to a value
less than 1 minute, it might worth considering changing the unit to seconds
(GUC_UNIT_S). Also which would reduces the chance of the reported trouble.

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
> On 2025/07/04 23:12, Fujii Masao wrote:
> > But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
> > Since at least two users (including myself) tried to set it to a value
> > less than 1 minute, it might worth considering changing the unit to seconds
> > (GUC_UNIT_S). Also which would reduces the chance of the reported trouble.
>
> Attached patch changes unit of idle_replication_slot_timeout to seconds.

-1

I think that the reason that several users tried to set it it less than a minute
is that they were trying to test the feature and didn't want to wait long.
I cannot imagine that anybody will want to abandon a standby server just
because it is idle for more than 30 seconds.

For me, there would be two appealing alternatives:

1. Always round up to the next minute.

2. Use the value -1 to deactivate the feature.
   Optionally, we could forbid the value 0 in this case.

Stepping back a bit, I am not happy with the documentation.  I looked at it,
trying to figure out what the parameter does, and I am none the wiser.
What exactly does it mean for a replication slot to idle?

- Does it mean that the standby is not connected?
- Does it mean that the standby is connected, but gives no feedback?
  Probably not, but I only guess that because I know that there is a different
  parameter for that.
- Does it mean that the standby is giving feedback, but that feedback doesn't
  indicate progress?

I think we could do better here.

Yours,
Laurenz Albe



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Fujii Masao
Дата:

On 2025/07/05 1:07, Laurenz Albe wrote:
> On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
>> On 2025/07/04 23:12, Fujii Masao wrote:
>>> But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
>>> Since at least two users (including myself) tried to set it to a value
>>> less than 1 minute, it might worth considering changing the unit to seconds
>>> (GUC_UNIT_S). Also which would reduces the chance of the reported trouble.
>>
>> Attached patch changes unit of idle_replication_slot_timeout to seconds.
> 
> -1
> 
> I think that the reason that several users tried to set it it less than a minute
> is that they were trying to test the feature and didn't want to wait long.
> I cannot imagine that anybody will want to abandon a standby server just
> because it is idle for more than 30 seconds.

Maybe. But changing the unit to seconds doesn't make things worse, does it?
It still allows users to set values greater than 1 minute, and also less than
1 minute for debugging or testing purposes, if needed.

Or are you suggesting we should disallow values below 1 minute?

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Gunnar Morling
Дата:
Hey all,

Thanks for all the replies!

Indeed I also wouldn't expect anyone to set this timeout to 30s for purposes other than testing or exploration. But then again, if the option can be set with second precision, is there any reason to not honor this? To me, it boils down to the principle of least surprise for users, in particular when it comes to a setting like "30s" being interpreted as disabling the feature.

> What exactly does it mean for a replication slot to idle?

I'd like to echo this; I also was/am confused by the term "idle" here; it isn't fully clear to me what it means for a slot to be idle, and in particular whether it is different from a slot being inactive.  pg_replication_slots uses the "active"/"inactive" terminology, and it seems that this is what idleness is about, as per the docs of the setting [1]:

> [Users] can force a checkpoint to promptly invalidate inactive slots. The duration of slot inactivity is calculated using the slot's pg_replication_slots.inactive_since value.

If so, "inactive_replication_slot_timeout" might be a more consistent name for that option?

Best,

--Gunnar



On Fri, 4 Jul 2025 at 18:24, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:


On 2025/07/05 1:07, Laurenz Albe wrote:
> On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
>> On 2025/07/04 23:12, Fujii Masao wrote:
>>> But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
>>> Since at least two users (including myself) tried to set it to a value
>>> less than 1 minute, it might worth considering changing the unit to seconds
>>> (GUC_UNIT_S). Also which would reduces the chance of the reported trouble.
>>
>> Attached patch changes unit of idle_replication_slot_timeout to seconds.
>
> -1
>
> I think that the reason that several users tried to set it it less than a minute
> is that they were trying to test the feature and didn't want to wait long.
> I cannot imagine that anybody will want to abandon a standby server just
> because it is idle for more than 30 seconds.

Maybe. But changing the unit to seconds doesn't make things worse, does it?
It still allows users to set values greater than 1 minute, and also less than
1 minute for debugging or testing purposes, if needed.

Or are you suggesting we should disallow values below 1 minute?

Regards,

--
Fujii Masao
NTT DATA Japan Corporation

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Sat, 2025-07-05 at 01:24 +0900, Fujii Masao wrote:
> On 2025/07/05 1:07, Laurenz Albe wrote:
> > On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
> > > On 2025/07/04 23:12, Fujii Masao wrote:
> > > > But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
> > > > Since at least two users (including myself) tried to set it to a value
> > > > less than 1 minute, it might worth considering changing the unit to seconds
> > > > (GUC_UNIT_S). Also which would reduces the chance of the reported trouble.
> > >
> > > Attached patch changes unit of idle_replication_slot_timeout to seconds.
> >
> > -1
> >
> > I think that the reason that several users tried to set it it less than a minute
> > is that they were trying to test the feature and didn't want to wait long.
> > I cannot imagine that anybody will want to abandon a standby server just
> > because it is idle for more than 30 seconds.
>
> Maybe. But changing the unit to seconds doesn't make things worse, does it?
> It still allows users to set values greater than 1 minute, and also less than
> 1 minute for debugging or testing purposes, if needed.
>
> Or are you suggesting we should disallow values below 1 minute?

I guess you are right.  There is no problem with second precision, even if
the use case in this case was artificial.
I withdraw my objection.

Gunnar Morlin wrote:
> I also was/am confused by the term "idle" here; it isn't fully clear to me
> what it means for a slot to be idle, and in particular whether it is different
> from a slot being inactive.  [...]
>
> If so, "inactive_replication_slot_timeout" might be a more consistent name
> for that option?

Perhaps.  I must say that I don't care so much about the name, as long as the
documentation doesn't leave any doubts.

Yours,
Laurenz Albe



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Amit Kapila
Дата:
On Fri, Jul 4, 2025 at 9:54 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> On 2025/07/05 1:07, Laurenz Albe wrote:
> > On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
> >> On 2025/07/04 23:12, Fujii Masao wrote:
> >>> But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
> >>> Since at least two users (including myself) tried to set it to a value
> >>> less than 1 minute, it might worth considering changing the unit to seconds
> >>> (GUC_UNIT_S). Also which would reduces the chance of the reported trouble.
> >>
> >> Attached patch changes unit of idle_replication_slot_timeout to seconds.
> >
> > -1
> >
> > I think that the reason that several users tried to set it it less than a minute
> > is that they were trying to test the feature and didn't want to wait long.
> > I cannot imagine that anybody will want to abandon a standby server just
> > because it is idle for more than 30 seconds.
>
> Maybe. But changing the unit to seconds doesn't make things worse, does it?
> It still allows users to set values greater than 1 minute, and also less than
> 1 minute for debugging or testing purposes, if needed.
>

We expect the value of this variable to be in hours or, in some cases,
days. Specifying in seconds would be inconvenient for users. Now, I
agree there is a value in testing/debugging to allow it to be seconds,
but the same could be said for other variables whose units are in
minutes, like log_rotation_age and wal_summary_keep_time.

> Or are you suggesting we should disallow values below 1 minute?
>

We should be consistent with other similar GUC variables whose unit is
in GUC_UNIT_MIN.

--
With Regards,
Amit Kapila.



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Amit Kapila
Дата:
On Fri, Jul 4, 2025 at 9:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> Stepping back a bit, I am not happy with the documentation.  I looked at it,
> trying to figure out what the parameter does, and I am none the wiser.
> What exactly does it mean for a replication slot to idle?
>
> - Does it mean that the standby is not connected?
>

It means the above. The slot is used for purposes other than the
standby as well, so we can't mention something only specific to the
standby. This parameter mainly cares when the slot becomes inactive.
You can get more information by looking at
pg_replication_slots.inactive_since (as indicated by sentence: "The
duration of slot inactivity is calculated using the slot's
pg_replication_slots.inactive_since value.").

>
> I think we could do better here.
>

Sure, feel free to propose what you think makes it better.

--
With Regards,
Amit Kapila.



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Sat, 2025-07-05 at 09:52 +0530, Amit Kapila wrote:
> On Fri, Jul 4, 2025 at 9:37 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > What exactly does it mean for a replication slot to idle?
> >
> > - Does it mean that the standby is not connected?
>
> It means the above. The slot is used for purposes other than the
> standby as well, so we can't mention something only specific to the
> standby.
>
> > I think we could do better here.
>
> Sure, feel free to propose what you think makes it better.

Done in the attached patch.

> We expect the value of this variable to be in hours or, in some cases,
> days. Specifying in seconds would be inconvenient for users.

I don't buy that argument.  Specifying shared_buffers in 8kB blocks
would be quite inconvenient to most users, but I don't remember any
complaints about it.  One of the nice things about the Grand Unified
Configuration is that you can use units different from the default.

On the other hand, if the behavior is clearly documented, as I have
tried to do with my patch, it should be fine.  So I'll rest my case if
you apply my patch.

Yours,
Laurenz Albe

Вложения

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
"David G. Johnston"
Дата:
On Friday, July 4, 2025, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
One of the nice things about the Grand Unified
Configuration is that you can use units different from the default.

Supplying the value as '2 h', SHOW will display 2h while pg_settings.setting shows 120 (min).

David J.

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
"David G. Johnston"
Дата:
On Fri, Jul 4, 2025 at 10:35 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On the other hand, if the behavior is clearly documented, as I have
tried to do with my patch, it should be fine.  So I'll rest my case if
you apply my patch.


We should clearly document how rounding works in section 19.1.1 (which we mostly do; "If the parameter is of integer type, a final rounding to integer occurs after any unit conversion.") and not in every time-related setting that chooses to use something larger than microseconds.  So, 30s is 'unit converted' up to 0.5 minutes (not explicitly explained...) then rounded to zero (which is odd, half normally rounds up...).  I'm against cluttering up the individual settings docs with this detail.

If the change from idle to inactive is needed in the description we should just admit we named it wrong in the first place.  As-is, the description matches the name and the callout to the field in the second paragraph precisely clears up what this setting at least cares about.  The reader should be directed to how that field is computed should they need clarification.

Thus, I'd accept but not find required the idle/inactive wording change to any of various degrees; and would ask that any clarification regarding generic setting value interpretation be relegated to 19.1.1 where all such settings can benefit.

David J.






Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Fri, 2025-07-04 at 23:16 -0700, David G. Johnston wrote:
> We should clearly document how rounding works in section 19.1.1
> (which we mostly do; "If the parameter is of integer type, a final rounding
> to integer occurs after any unit conversion.") and not in every
> time-related setting that chooses to use something larger than microseconds.
> So, 30s is 'unit converted' up to 0.5 minutes (not explicitly explained...)
> then rounded to zero (which is odd, half normally rounds up...).
> I'm against cluttering up the individual settings docs with this detail.

That's fine with me; do you have a patch?

> If the change from idle to inactive is needed in the description we should
> just admit we named it wrong in the first place.

I had half a mind to propose renaming the parameter, but I shied from
a lengthy bikeshedding discussion.  Reading up on the archives, I see
that Peter Smith proposed the term "idle" in [1], and nobody had any
problem with it.

For the record: I would be much more happy if the parameter were called
"inactive_replication_slot_timeout", since we use the term "active" in
"pg_replication_slots".  Also, we call connections "idle" when they are
established, but doing nothing, and this parameter is about disconnected
replication connections.

>                                                   As-is, the description
> matches the name and the callout to the field in the second paragraph
> precisely clears up what this setting at least cares about.  The reader
> should be directed to how that field is computed should they need clarification.
>
> Thus, I'd accept but not find required the idle/inactive wording change to
> any of various degrees; and would ask that any clarification regarding
> generic setting value interpretation be relegated to 19.1.1 where all
> such settings can benefit.

I am sure that there is some information in these sentences, but I cannot
extract it, even after reading them twice.

Yours,
Laurenz Albe

 [1]: https://postgr.es/m/CAHut%2BPtHbYNxPvtMfs7jARbsVcFXL1%3DC9SO3Q93NgVDgbKN7LQ%40mail.gmail.com



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
"David G. Johnston"
Дата:
On Saturday, July 5, 2025, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

I am sure that there is some information in these sentences, but I cannot
extract it, even after reading them twice.


Maybe: “During checkpoint if the interval since pg_replication_slots.inactive_since and now is larger than this value pg_replication_slots.conflicting is set to true and pg_replication_slots.inactive_reason is set to ‘timeout’.  See section wherever for more information on handling conflicted slots.”

Heck, writing this, “idle” is probably better, a slot can recover from being idle on its own but usually inactive would imply having to do something to make it active again.

IMO our documentation for replication has serious flaws but this particular area is clear enough.  Like any good timeout the slot is killed if it goes unused “idle” for some length of time.  We can describe that in many ways but the name, to me, is fully descriptive and consistent with other timeouts like “idle_in_transaction_timeout”.

David J.

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Fujii Masao
Дата:

On 2025/07/05 13:06, Amit Kapila wrote:
> On Fri, Jul 4, 2025 at 9:54 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> On 2025/07/05 1:07, Laurenz Albe wrote:
>>> On Sat, 2025-07-05 at 00:22 +0900, Fujii Masao wrote:
>>>> On 2025/07/04 23:12, Fujii Masao wrote:
>>>>> But I wonder why the current unit of this GUC is minutes (GUC_UNIT_MIN).
>>>>> Since at least two users (including myself) tried to set it to a value
>>>>> less than 1 minute, it might worth considering changing the unit to seconds
>>>>> (GUC_UNIT_S). Also which would reduces the chance of the reported trouble.
>>>>
>>>> Attached patch changes unit of idle_replication_slot_timeout to seconds.
>>>
>>> -1
>>>
>>> I think that the reason that several users tried to set it it less than a minute
>>> is that they were trying to test the feature and didn't want to wait long.
>>> I cannot imagine that anybody will want to abandon a standby server just
>>> because it is idle for more than 30 seconds.
>>
>> Maybe. But changing the unit to seconds doesn't make things worse, does it?
>> It still allows users to set values greater than 1 minute, and also less than
>> 1 minute for debugging or testing purposes, if needed.
>>
> 
> We expect the value of this variable to be in hours or, in some cases,
> days. Specifying in seconds would be inconvenient for users. Now, I
> agree there is a value in testing/debugging to allow it to be seconds,
> but the same could be said for other variables whose units are in
> minutes, like log_rotation_age and wal_summary_keep_time.

Even if we change the unit to seconds (GUC_UNIT_S), it's still possible
to set the timeout values such as hours or days. For example, even with
the patch, we can set the timeout to 365 days:

   =# ALTER SYSTEM SET idle_replication_slot_timeout TO '365d';
   =# SELECT pg_reload_conf();
   =# SHOW idle_replication_slot_timeout ;
    idle_replication_slot_timeout
   -------------------------------
    365d
   (1 row)

Do you see any serious downside to switching the unit to seconds? I don't
think it introduces any serious issues. On the contrary, it gives users finer
control over the timeout, and additionally works around the issue
that we're discussing here.

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Fujii Masao
Дата:

On 2025/07/05 15:16, David G. Johnston wrote:
> On Fri, Jul 4, 2025 at 10:35 PM Laurenz Albe <laurenz.albe@cybertec.at <mailto:laurenz.albe@cybertec.at>> wrote:
> 
>     On the other hand, if the behavior is clearly documented, as I have
>     tried to do with my patch, it should be fine.  So I'll rest my case if
>     you apply my patch.
> 
> 
> We should clearly document how rounding works in section 19.1.1 (which we mostly do; "If the parameter is of integer
type,a final rounding to integer occurs after any unit conversion.") and not in every time-related setting that chooses
touse something larger than microseconds.  So, 30s is 'unit converted' up to 0.5 minutes (not explicitly explained...)
thenrounded to zero (which is odd, half normally rounds up...).
 

This happens because in this case rounding is done using rint(3),
which uses banker's rounding by default. While the rint(3)'s rounding method
can be changed with fesetround(), PostgreSQL doesn't seem to change it.
So 0.5 rounds to 0, 1.5 and 2.5 round to 2, 3.5 and 4.5 round to 4, and so on.

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
"David G. Johnston"
Дата:
On Saturday, July 5, 2025, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

Do you see any serious downside to switching the unit to seconds? I don't
think it introduces any serious issues. On the contrary, it gives users finer
control over the timeout, and additionally works around the issue
that we're discussing here.

I do not, and would rather we make the change.  Minutes are an unconventional base unit for time in our world and should be avoided.

David J.

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
"David G. Johnston"
Дата:
On Saturday, July 5, 2025, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:


On 2025/07/05 15:16, David G. Johnston wrote:
On Fri, Jul 4, 2025 at 10:35 PM Laurenz Albe <laurenz.albe@cybertec.at <mailto:laurenz.albe@cybertec.at>> wrote:

    On the other hand, if the behavior is clearly documented, as I have
    tried to do with my patch, it should be fine.  So I'll rest my case if
    you apply my patch.


We should clearly document how rounding works in section 19.1.1 (which we mostly do; "If the parameter is of integer type, a final rounding to integer occurs after any unit conversion.") and not in every time-related setting that chooses to use something larger than microseconds.  So, 30s is 'unit converted' up to 0.5 minutes (not explicitly explained...) then rounded to zero (which is odd, half normally rounds up...).

This happens because in this case rounding is done using rint(3),
which uses banker's rounding by default. While the rint(3)'s rounding method
can be changed with fesetround(), PostgreSQL doesn't seem to change it.
So 0.5 rounds to 0, 1.5 and 2.5 round to 2, 3.5 and 4.5 round to 4, and so on.

Right, a.k.a., half-even rounding.  So, not odd.

David J.

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Tom Lane
Дата:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Saturday, July 5, 2025, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> Do you see any serious downside to switching the unit to seconds? I don't
>> think it introduces any serious issues. On the contrary, it gives users
>> finer
>> control over the timeout, and additionally works around the issue
>> that we're discussing here.

> I do not, and would rather we make the change.  Minutes are an
> unconventional base unit for time in our world and should be avoided.

By my count, there are ten GUCs declared with GUC_UNIT_S and three
with GUC_UNIT_MIN.  I'd say that there may be some lean towards
seconds but David's argument seems like pure hyperbole.

I'm kind of down on changing the unit, because it will *silently*
break configuration files where the value was set without a unit.

May I suggest an alternative?  We could change the variable from int
to float type and continue to specify it in minutes.  That will have
exactly zero compatibility impact, it allows sub-minute values to
be selected at need, and it removes the need for hair-splitting
documentation about what the rounding rules are.

We previously did the same with vacuum_cost_delay to avoid worries
about how to specify sub-millisecond precision for that.  So the
infrastructure is already in place, I think.  The patch will be
different from what is proposed but should need to touch pretty
much the same places.

            regards, tom lane



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Tom Lane
Дата:
I wrote:
> I'm kind of down on changing the unit, because it will *silently*
> break configuration files where the value was set without a unit.

Oh, wait a second.  I just noticed that this variable is new in v18.
So there won't be a compatibility issue as long as we change it in
v18 too.  So changing it to a base of seconds seems sufficient,
and more in line with existing practice:

regression=# select name,unit from pg_settings where name like '%timeout';
                name                 | unit 
-------------------------------------+------
 archive_timeout                     | s
 authentication_timeout              | s
 checkpoint_timeout                  | s
 deadlock_timeout                    | ms
 idle_in_transaction_session_timeout | ms
 idle_replication_slot_timeout       | min
 idle_session_timeout                | ms
 lock_timeout                        | ms
 statement_timeout                   | ms
 tcp_user_timeout                    | ms
 transaction_timeout                 | ms
 wal_receiver_timeout                | ms
 wal_sender_timeout                  | ms
(13 rows)

Using 'ms' seems clearly overkill, but there's precedent for
timeouts measured in seconds.

            regards, tom lane



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Fujii Masao
Дата:

On 2025/07/06 1:13, Tom Lane wrote:
> I wrote:
>> I'm kind of down on changing the unit, because it will *silently*
>> break configuration files where the value was set without a unit.
> 
> Oh, wait a second.  I just noticed that this variable is new in v18.
> So there won't be a compatibility issue as long as we change it in
> v18 too.  So changing it to a base of seconds seems sufficient,
> and more in line with existing practice:

Yes!

It seems there are currently no clear objections to the patch that
changes the unit of idle_replication_slot_timeout to seconds.
If that's the case, I'm planning to commit it to both master and v18.

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Sat, 2025-07-05 at 12:13 -0400, Tom Lane wrote:
> Oh, wait a second.  I just noticed that this variable is new in v18.
> So there won't be a compatibility issue as long as we change it in
> v18 too.  So changing it to a base of seconds seems sufficient,
> and more in line with existing practice:
>
> Using 'ms' seems clearly overkill, but there's precedent for
> timeouts measured in seconds.

Here is a patch that changes the unit to seconds, like Fujii Masao's
patch upthread.  In addition, it removes the now unnecessary injection
point for the TAP test and tries to improve the documentation to
address my complaints.

Yours,
Laurenz Albe

Вложения

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Amit Kapila
Дата:
On Sat, Jul 5, 2025 at 9:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > I'm kind of down on changing the unit, because it will *silently*
> > break configuration files where the value was set without a unit.
>
> Oh, wait a second.  I just noticed that this variable is new in v18.
> So there won't be a compatibility issue as long as we change it in
> v18 too.  So changing it to a base of seconds seems sufficient,
> and more in line with existing practice:
>
> regression=# select name,unit from pg_settings where name like '%timeout';
>                 name                 | unit
> -------------------------------------+------
>  archive_timeout                     | s
>  authentication_timeout              | s
>  checkpoint_timeout                  | s
>  deadlock_timeout                    | ms
>  idle_in_transaction_session_timeout | ms
>  idle_replication_slot_timeout       | min
>  idle_session_timeout                | ms
>  lock_timeout                        | ms
>  statement_timeout                   | ms
>  tcp_user_timeout                    | ms
>  transaction_timeout                 | ms
>  wal_receiver_timeout                | ms
>  wal_sender_timeout                  | ms
> (13 rows)
>
> Using 'ms' seems clearly overkill, but there's precedent for
> timeouts measured in seconds.
>

Agreed, but it is unclear to me in how many of these cases users would
always need to give values in hours or more. In case of
idle_replication_slot_timeout, there is less chance that many people
would like to use the value in seconds in production. Say, tomorrow,
we want to assign some default value to this GUC, it is likely that we
would choose some value in minutes. Based on this thinking, we decided
to keep the unit of this GUC as minutes. But I see the theory of
consistency with a larger number of similar-sounding GUCs and the
developer testing theory to keep this value in seconds. So, if more
people prefer to use the unit as seconds and are not convinced by the
points I made, we should go ahead and change it.

--
With Regards,
Amit Kapila.



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Amit Kapila
Дата:
On Tue, Jul 8, 2025 at 10:23 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Sat, 2025-07-05 at 12:13 -0400, Tom Lane wrote:
> > Oh, wait a second.  I just noticed that this variable is new in v18.
> > So there won't be a compatibility issue as long as we change it in
> > v18 too.  So changing it to a base of seconds seems sufficient,
> > and more in line with existing practice:
> >
> > Using 'ms' seems clearly overkill, but there's precedent for
> > timeouts measured in seconds.
>
> Here is a patch that changes the unit to seconds, like Fujii Masao's
> patch upthread.  In addition, it removes the now unnecessary injection
> point for the TAP test and tries to improve the documentation to
> address my complaints.
>

+# Wait a bit for the replication slots to become invalid
+$node->safe_psql('postgres', "SELECT pg_sleep(2)");

Is it a good idea to add 2s to 'make check-world' for one particular test?

--
With Regards,
Amit Kapila.



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Wed, 2025-07-09 at 12:05 +0530, Amit Kapila wrote:
> On Tue, Jul 8, 2025 at 10:23 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> >
> > On Sat, 2025-07-05 at 12:13 -0400, Tom Lane wrote:
> > > Oh, wait a second.  I just noticed that this variable is new in v18.
> > > So there won't be a compatibility issue as long as we change it in
> > > v18 too.  So changing it to a base of seconds seems sufficient,
> > > and more in line with existing practice:
> > >
> > > Using 'ms' seems clearly overkill, but there's precedent for
> > > timeouts measured in seconds.
> >
> > Here is a patch that changes the unit to seconds, like Fujii Masao's
> > patch upthread.  In addition, it removes the now unnecessary injection
> > point for the TAP test and tries to improve the documentation to
> > address my complaints.
> >
>
> +# Wait a bit for the replication slots to become invalid
> +$node->safe_psql('postgres', "SELECT pg_sleep(2)");
>
> Is it a good idea to add 2s to 'make check-world' for one particular test?

That is debatable.  I thought that adding two seconds would be worth the
advantage of removing the injection point that only fakes the timeout.
We'd get better test coverage.

But if the people who run the regression tests dozens of times every day
object to the two seconds, I won't insist on this part of the patch.

Yours,
Laurenz Albe



RE: Unexpected behavior when setting "idle_replication_slot_timeout"

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Laurenz, Amit,

> > +# Wait a bit for the replication slots to become invalid
> > +$node->safe_psql('postgres', "SELECT pg_sleep(2)");
> >
> > Is it a good idea to add 2s to 'make check-world' for one particular test?
> 
> That is debatable.  I thought that adding two seconds would be worth the
> advantage of removing the injection point that only fakes the timeout.
> We'd get better test coverage.
> 
> But if the people who run the regression tests dozens of times every day
> object to the two seconds, I won't insist on this part of the patch.

I can understand both concerns. One benefit not to use the injection point is
that any machines can run the test. Currently some BF machines do not enable
the injection point, thus the feature has not been verified on these env.
However, unconditional sleep makes the test longer.

Based on that, I want to propose the hybrid approach; use injection_point_attach()
if possible, otherwise wait few seconds. One concern is that the test code may be
bit complex, but below codes was enough on my environment.

```
# Check if the 'injection_points' extension is available, as it may be
# possible that this script is run with installcheck, where the module
# would not be installed by default.
if ($node->check_extension('injection_points'))
{
    # Register an injection point on the node to forcibly cause a slot
    # invalidation due to idle_timeout
    $node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');

    $node->safe_psql('postgres',
        "SELECT injection_points_attach('slot-timeout-inval', 'error');");
}
else
{
    # Otherwise wait a bit for the replication slots to become invalid
    $node->safe_psql('postgres', "SELECT pg_sleep(2)");
}
```

Attached patch implements the idea, which can be applied atop v2. How do you think?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Wed, 2025-07-09 at 07:24 +0000, Hayato Kuroda (Fujitsu) wrote:
> > > +# Wait a bit for the replication slots to become invalid
> > > +$node->safe_psql('postgres', "SELECT pg_sleep(2)");
> > >
> > > Is it a good idea to add 2s to 'make check-world' for one particular test?
> >
> > That is debatable.  I thought that adding two seconds would be worth the
> > advantage of removing the injection point that only fakes the timeout.
> > We'd get better test coverage.
> >
> > But if the people who run the regression tests dozens of times every day
> > object to the two seconds, I won't insist on this part of the patch.
>
> I can understand both concerns. One benefit not to use the injection point is
> that any machines can run the test. Currently some BF machines do not enable
> the injection point, thus the feature has not been verified on these env.
> However, unconditional sleep makes the test longer.
>
> Based on that, I want to propose the hybrid approach; use injection_point_attach()
> if possible, otherwise wait few seconds. One concern is that the test code may be
> bit complex, but below codes was enough on my environment.
>
> ```
> # Check if the 'injection_points' extension is available, as it may be
> # possible that this script is run with installcheck, where the module
> # would not be installed by default.
> if ($node->check_extension('injection_points'))
> {
>     # Register an injection point on the node to forcibly cause a slot
>     # invalidation due to idle_timeout
>     $node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
>
>     $node->safe_psql('postgres',
>         "SELECT injection_points_attach('slot-timeout-inval', 'error');");
> }
> else
> {
>     # Otherwise wait a bit for the replication slots to become invalid
>     $node->safe_psql('postgres', "SELECT pg_sleep(2)");
> }
> ```
>
> Attached patch implements the idea, which can be applied atop v2. How do you think?

I have no objections.

I'll leave the decision to Fujii Masao or whoever wants to commit something
in this area.

Yours,
Laurenz Albe



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Fujii Masao
Дата:

On 2025/07/09 18:21, Laurenz Albe wrote:
> On Wed, 2025-07-09 at 07:24 +0000, Hayato Kuroda (Fujitsu) wrote:
>>>> +# Wait a bit for the replication slots to become invalid
>>>> +$node->safe_psql('postgres', "SELECT pg_sleep(2)");
>>>>
>>>> Is it a good idea to add 2s to 'make check-world' for one particular test?
>>>
>>> That is debatable.  I thought that adding two seconds would be worth the
>>> advantage of removing the injection point that only fakes the timeout.
>>> We'd get better test coverage.
>>>
>>> But if the people who run the regression tests dozens of times every day
>>> object to the two seconds, I won't insist on this part of the patch.
>>
>> I can understand both concerns. One benefit not to use the injection point is
>> that any machines can run the test. Currently some BF machines do not enable
>> the injection point, thus the feature has not been verified on these env.
>> However, unconditional sleep makes the test longer.
>>
>> Based on that, I want to propose the hybrid approach; use injection_point_attach()
>> if possible, otherwise wait few seconds. One concern is that the test code may be
>> bit complex, but below codes was enough on my environment.
>>
>> ```
>> # Check if the 'injection_points' extension is available, as it may be
>> # possible that this script is run with installcheck, where the module
>> # would not be installed by default.
>> if ($node->check_extension('injection_points'))
>> {
>>     # Register an injection point on the node to forcibly cause a slot
>>     # invalidation due to idle_timeout
>>     $node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
>>
>>     $node->safe_psql('postgres',
>>         "SELECT injection_points_attach('slot-timeout-inval', 'error');");
>> }
>> else
>> {
>>     # Otherwise wait a bit for the replication slots to become invalid
>>     $node->safe_psql('postgres', "SELECT pg_sleep(2)");
>> }
>> ```
>>
>> Attached patch implements the idea, which can be applied atop v2. How do you think?
> 
> I have no objections.

So the purpose of this "hybrid approach" is to test idle_replication_slot_timeout
even in environments where the injection point isn't available? Do we really need
this additional test? It seems to me that the test using the injection point is
sufficient.

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Wed, 2025-07-09 at 19:19 +0900, Fujii Masao wrote:
> On 2025/07/09 18:21, Laurenz Albe wrote:
> > On Wed, 2025-07-09 at 07:24 +0000, Hayato Kuroda (Fujitsu) wrote:
> > > >
> > > Based on that, I want to propose the hybrid approach; use injection_point_attach()
> > > if possible, otherwise wait few seconds. One concern is that the test code may be
> > > bit complex, but below codes was enough on my environment.
> > >
> > > Attached patch implements the idea, which can be applied atop v2. How do you think?
> >
> > I have no objections.
>
> So the purpose of this "hybrid approach" is to test idle_replication_slot_timeout
> even in environments where the injection point isn't available? Do we really need
> this additional test? It seems to me that the test using the injection point is
> sufficient.

If you think that using the injection point is sufficient, go for it.

That means slightly less test coverage; the injection point skips the call to
TimestampDifferenceExceedsSeconds() to determine if the slot's "inactive_since"
is old enough and just unconditionally returns RS_INVAL_IDLE_TIMEOUT.
But yes, that is not a big loss.

My patch has the advantage of
a) simplifying the code,
b) running the test even if injection points are not available and
c) testing more of the code
at the price of two extra seconds for the TAP tests.

The "hybrid approach" saves the two seconds on system with injection points,
but cannot simplify the code.

I personally can live with all three; my personal preference is my patch.

Yours,
Laurenz Albe



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Amit Kapila
Дата:
On Wed, Jul 9, 2025 at 8:47 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Wed, 2025-07-09 at 19:19 +0900, Fujii Masao wrote:
> > On 2025/07/09 18:21, Laurenz Albe wrote:
> > > On Wed, 2025-07-09 at 07:24 +0000, Hayato Kuroda (Fujitsu) wrote:
> > > > >
> > > > Based on that, I want to propose the hybrid approach; use injection_point_attach()
> > > > if possible, otherwise wait few seconds. One concern is that the test code may be
> > > > bit complex, but below codes was enough on my environment.
> > > >
> > > > Attached patch implements the idea, which can be applied atop v2. How do you think?
> > >
> > > I have no objections.
> >
> > So the purpose of this "hybrid approach" is to test idle_replication_slot_timeout
> > even in environments where the injection point isn't available? Do we really need
> > this additional test? It seems to me that the test using the injection point is
> > sufficient.
>
> If you think that using the injection point is sufficient, go for it.
>
> That means slightly less test coverage; the injection point skips the call to
> TimestampDifferenceExceedsSeconds() to determine if the slot's "inactive_since"
> is old enough and just unconditionally returns RS_INVAL_IDLE_TIMEOUT.
> But yes, that is not a big loss.
>
> My patch has the advantage of
> a) simplifying the code,
> b) running the test even if injection points are not available and
> c) testing more of the code
> at the price of two extra seconds for the TAP tests.
>

I think an additional two seconds is not worth this test, especially
when we have another way to test most of it, and it could be
noticeable overhead for developers who need to run check-world
multiple times in a day. If we want to proceed with changing the
units, we can leave the existing test as it is. BTW, as mentioned
yesterday, I still feel keeping minutes as a unit for
idle_replication_slot_timeout has merits, but I surrender my case
based on what most other people feel is the right thing to do.

> The "hybrid approach" saves the two seconds on system with injection points,
> but cannot simplify the code.
>

I feel that would be a needless complexity in the test.

--
With Regards,
Amit Kapila.



RE: Unexpected behavior when setting "idle_replication_slot_timeout"

От
"Hayato Kuroda (Fujitsu)"
Дата:
> > The "hybrid approach" saves the two seconds on system with injection points,
> > but cannot simplify the code.
> >
> 
> I feel that would be a needless complexity in the test.

Yeah, it is the down side of the approach. If it do not have enough advantage, let's drop the idea.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Thu, 2025-07-10 at 05:27 +0000, Hayato Kuroda (Fujitsu) wrote:
> > > The "hybrid approach" saves the two seconds on system with injection points,
> > > but cannot simplify the code.
> >
> > I feel that would be a needless complexity in the test.
>
> Yeah, it is the down side of the approach. If it do not have enough advantage, let's drop the idea.

Great, then I think that we have consensus.
Let's change to seconds and leave the injectsion points and tests as they are.

I'd still like to see the documentation improved, like my patch tried to do.

Yours,
Laurenz Albe



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Fujii Masao
Дата:

On 2025/07/10 14:51, Laurenz Albe wrote:
> On Thu, 2025-07-10 at 05:27 +0000, Hayato Kuroda (Fujitsu) wrote:
>>>> The "hybrid approach" saves the two seconds on system with injection points,
>>>> but cannot simplify the code.
>>>
>>> I feel that would be a needless complexity in the test.
>>
>> Yeah, it is the down side of the approach. If it do not have enough advantage, let's drop the idea.
> 
> Great, then I think that we have consensus.
> Let's change to seconds and leave the injectsion points and tests as they are.

+1


> I'd still like to see the documentation improved, like my patch tried to do.

Agreed.

I created this as a separate patch from the one that changes the unit,
since they serve different purposes. Both patches are attached.
I'm planning to commit them.

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation

Вложения

Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Thu, 2025-07-10 at 19:27 +0900, Fujii Masao wrote:
> I created this as a separate patch from the one that changes the unit,
> since they serve different purposes. Both patches are attached.
> I'm planning to commit them.

Your Patches look good for me.

Yours,
Laurenz Albe



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Fujii Masao
Дата:

On 2025/07/10 20:58, Laurenz Albe wrote:
> On Thu, 2025-07-10 at 19:27 +0900, Fujii Masao wrote:
>> I created this as a separate patch from the one that changes the unit,
>> since they serve different purposes. Both patches are attached.
>> I'm planning to commit them.
> 
> Your Patches look good for me.

Thanks for the review! I've pushed both patches.

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation




Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Laurenz Albe
Дата:
On Fri, 2025-07-11 at 08:48 +0900, Fujii Masao wrote:
> Thanks for the review! I've pushed both patches.

Thank you for taking care of this!

Yours,
Laurenz Albe



Re: Unexpected behavior when setting "idle_replication_slot_timeout"

От
Gunnar Morling
Дата:
Thanks a lot to you Fujii, and everyone else, for addressing this!

--Gunnar


On Fri, 11 Jul 2025 at 04:32, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2025-07-11 at 08:48 +0900, Fujii Masao wrote:
> Thanks for the review! I've pushed both patches.

Thank you for taking care of this!

Yours,
Laurenz Albe