Обсуждение: Re: Add CHECK_FOR_INTERRUPTS in polling loop code path in XactLockTableWait
On 2025/05/24 5:41, Kevin K Biju wrote: > Hi, > > While creating a logical replication slot, we wait for older transactions to complete to reach a "consistent point", whichcan take a while on busy databases. If we're creating a slot on a primary instance, it's pretty clear that we're waitingon a transaction: > > > postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=804; > pid | wait_event_type | wait_event | state | query > -----+-----------------+---------------+--------+---------------------------------------------------------------- > 804 | Lock | transactionid | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput'); > (1 row) > postgres=# SELECT locktype,transactionid,pid,mode FROM pg_locks WHERE pid=804 AND granted='f'; > locktype | transactionid | pid | mode > ---------------+---------------+-----+----------- > transactionid | 6077 | 804 | ShareLock > > However, creating a slot on a hot standby behaves very differently when blocked on a transaction. Querying pg_stat_activitydoesn't give us any indication on what the issue is: > > > postgres=# SELECT pg_is_in_recovery(); > pg_is_in_recovery > ------------------- > t > (1 row) > postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=5074; > pid | wait_event_type | wait_event | state | query > ------+-----------------+------------+--------+---------------------------------------------------------------- > 5074 | | | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput'); > > And more importantly, a backend in this state cannot be terminated via the usual methods and sometimes requires a serverrestart. > > > postgres=# SELECT pg_cancel_backend(5074), pg_terminate_backend(5074); > pg_cancel_backend | pg_terminate_backend > -------------------+---------------------- > t | t > (1 row) > postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=5074; > pid | wait_event_type | wait_event | state | query > ------+-----------------+------------+--------+---------------------------------------------------------------- > 5074 | | | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput'); > > I've taken a look at the code that "awaits" on transactions and the function of interest is lmgr.c/XactLockTableWait. Ona primary, it is able to acquire a ShareLock on the xid and the lock manager does a good job on making this wait efficientas well as visible externally. However, on a standby, it cannot wait on the xid since it is not running the transaction.However, it knows the transaction is still running from KnownAssignedXids, and then ends up on this codepath: > > * > * Some uses of this function don't involve tuple visibility -- such > * as when building snapshots for logical decoding. It is possible to > * see a transaction in ProcArray before it registers itself in the > * locktable. The topmost transaction in that case is the same xid, > * so we try again after a short sleep. (Don't sleep the first time > * through, to avoid slowing down the normal case.) > */ > if (!first) > pg_usleep(1000L); > > The attached comment suggests that this sleep was only meant to be hit a few times while we wait for the lock to appearso we can wait on it. However, in the hot standby case, this ends up becoming a polling loop since the lock will neverappear. There is no interrupt processing in this loop either and so the only way out is to wait for the transactionon the primary to complete. I agree with your analysis. It makes sense to me. > Attached is a patch that adds CHECK_FOR_INTERRUPTS before the sleep in XactLockTableWait and ConditionalXactLockTableWait.This allows backends waiting on a transaction during slot creation on a standby to be interrupted. +1 > It would also be nice if there was a way for users to tell what we're waiting on (maybe a different wait event?) but I'dlike input on that. Just an idea: how about calling pgstat_report_wait_start() and _end() around pg_usleep(), similar to what WaitExceedsMaxStandbyDelay() does? Since this is more of an improvement than a bug fix, I think it would be better to make it as a separate patch from the CFI addition. Also, would waiting only 1ms per loop cycle be too aggressive? If so, maybe we could increase the sleep time gradually during recovery (but not beyond 1 second), again similar to WaitExceedsMaxStandbyDelay(). Regards, -- Fujii Masao NTT DATA Japan Corporation
Hi Fujii,
Thanks for the review.> Just an idea: how about calling pgstat_report_wait_start() and _end() around pg_usleep(), similar to what WaitExceedsMaxStandbyDelay() does?
I was thinking about this as well, but the question is what do we report the wait event here as? The one that makes sense to me is PG_WAIT_LOCK, but that wouldn't align with what pg_locks would display since there is no actual lock grabbed on the standby.
> Also, would waiting only 1ms per loop cycle be too aggressive?
I agree that it's aggressive for this case. But XactLockTableWait/ConditionalXactLockTableWait seem to be used in other places including in heapam so I'm hesitant on changing this behaviour for all of them. Should we have a different "wait logic" for this case?
Kevin
On Mon, May 26, 2025 at 9:02 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2025/05/24 5:41, Kevin K Biju wrote:
> Hi,
>
> While creating a logical replication slot, we wait for older transactions to complete to reach a "consistent point", which can take a while on busy databases. If we're creating a slot on a primary instance, it's pretty clear that we're waiting on a transaction:
>
>
> postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=804;
> pid | wait_event_type | wait_event | state | query
> -----+-----------------+---------------+--------+----------------------------------------------------------------
> 804 | Lock | transactionid | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput');
> (1 row)
> postgres=# SELECT locktype,transactionid,pid,mode FROM pg_locks WHERE pid=804 AND granted='f';
> locktype | transactionid | pid | mode
> ---------------+---------------+-----+-----------
> transactionid | 6077 | 804 | ShareLock
>
> However, creating a slot on a hot standby behaves very differently when blocked on a transaction. Querying pg_stat_activity doesn't give us any indication on what the issue is:
>
>
> postgres=# SELECT pg_is_in_recovery();
> pg_is_in_recovery
> -------------------
> t
> (1 row)
> postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=5074;
> pid | wait_event_type | wait_event | state | query
> ------+-----------------+------------+--------+----------------------------------------------------------------
> 5074 | | | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput');
>
> And more importantly, a backend in this state cannot be terminated via the usual methods and sometimes requires a server restart.
>
>
> postgres=# SELECT pg_cancel_backend(5074), pg_terminate_backend(5074);
> pg_cancel_backend | pg_terminate_backend
> -------------------+----------------------
> t | t
> (1 row)
> postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM pg_stat_activity WHERE pid=5074;
> pid | wait_event_type | wait_event | state | query
> ------+-----------------+------------+--------+----------------------------------------------------------------
> 5074 | | | active | SELECT pg_create_logical_replication_slot('wow1', 'pgoutput');
>
> I've taken a look at the code that "awaits" on transactions and the function of interest is lmgr.c/XactLockTableWait. On a primary, it is able to acquire a ShareLock on the xid and the lock manager does a good job on making this wait efficient as well as visible externally. However, on a standby, it cannot wait on the xid since it is not running the transaction. However, it knows the transaction is still running from KnownAssignedXids, and then ends up on this codepath:
>
> *
> * Some uses of this function don't involve tuple visibility -- such
> * as when building snapshots for logical decoding. It is possible to
> * see a transaction in ProcArray before it registers itself in the
> * locktable. The topmost transaction in that case is the same xid,
> * so we try again after a short sleep. (Don't sleep the first time
> * through, to avoid slowing down the normal case.)
> */
> if (!first)
> pg_usleep(1000L);
>
> The attached comment suggests that this sleep was only meant to be hit a few times while we wait for the lock to appear so we can wait on it. However, in the hot standby case, this ends up becoming a polling loop since the lock will never appear. There is no interrupt processing in this loop either and so the only way out is to wait for the transaction on the primary to complete.
I agree with your analysis. It makes sense to me.
> Attached is a patch that adds CHECK_FOR_INTERRUPTS before the sleep in XactLockTableWait and ConditionalXactLockTableWait. This allows backends waiting on a transaction during slot creation on a standby to be interrupted.
+1
> It would also be nice if there was a way for users to tell what we're waiting on (maybe a different wait event?) but I'd like input on that.
Just an idea: how about calling pgstat_report_wait_start() and _end() around
pg_usleep(), similar to what WaitExceedsMaxStandbyDelay() does?
Since this is more of an improvement than a bug fix, I think
it would be better to make it as a separate patch from the CFI addition.
Also, would waiting only 1ms per loop cycle be too aggressive? If so, maybe
we could increase the sleep time gradually during recovery (but not beyond
1 second), again similar to WaitExceedsMaxStandbyDelay().
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
On 2025/05/27 4:43, Kevin K Biju wrote: > Hi Fujii, > > Thanks for the review. So unless there are any objections, I'm planning to commit the patch with the following commit message. --------------- Make XactLockTableWait() and ConditionalXactLockTableWait() interruptable more. Previously, XactLockTableWait() and ConditionalXactLockTableWait() could enter a non-interruptible loop when they successfully acquired a lock on a transaction but the transaction still appeared to be running. Since this loop continued until the transaction completed, it could result in long, uninterruptible waits. Although this scenario is generally unlikely since a transaction lock is basically acquired only when the transaction is not running, it can occur in a hot standby. In such cases, the transaction may still appear active due to the KnownAssignedXids list, even while no lock on the transaction exists. For example, this situation can happen when creating a logical replication slot on a standby. The cause of the non-interruptible loop was the absence of CHECK_FOR_INTERRUPTS() within it. This commit adds CHECK_FOR_INTERRUPTS() to the loop in both functions, ensuring they can be interrupted safely. Back-patch to all supported branches. Author: Kevin K Biju <kevinkbiju@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAM45KeELdjhS-rGuvN=ZLJ_asvZACucZ9LZWVzH7bGcD12DDwg@mail.gmail.com Backpatch-through: 13 --------------- > > Just an idea: how about calling pgstat_report_wait_start() and _end() around pg_usleep(), similar to what WaitExceedsMaxStandbyDelay()does? > > I was thinking about this as well, but the question is what do we report the wait event here as? The one that makes senseto me is PG_WAIT_LOCK, but that wouldn't align with what pg_locks would display since there is no actual lock grabbedon the standby. I couldn't find an existing wait event that fits this case, so how about adding a new one, like IPC/XactDone, to indicate "Waiting for the transaction to commit or abort"? > > Also, would waiting only 1ms per loop cycle be too aggressive? > > I agree that it's aggressive for this case. But XactLockTableWait/ConditionalXactLockTableWait seem to be used in otherplaces including in heapam so I'm hesitant on changing this behaviour for all of them. I agree that we need more time to consider whether this change is safe. Regards, -- Fujii Masao NTT DATA Japan Corporation
Hi Fujii,
The WaitEvent sounds good to me, I will submit a separate patch for that when I find time.
Kevin
On Wed, May 28, 2025 at 5:06 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
On 2025/05/27 4:43, Kevin K Biju wrote:
> Hi Fujii,
>
> Thanks for the review.
So unless there are any objections, I'm planning to commit
the patch with the following commit message.
---------------
Make XactLockTableWait() and ConditionalXactLockTableWait() interruptable more.
Previously, XactLockTableWait() and ConditionalXactLockTableWait() could enter
a non-interruptible loop when they successfully acquired a lock on a transaction
but the transaction still appeared to be running. Since this loop continued
until the transaction completed, it could result in long, uninterruptible waits.
Although this scenario is generally unlikely since a transaction lock is
basically acquired only when the transaction is not running, it can occur
in a hot standby. In such cases, the transaction may still appear active due to
the KnownAssignedXids list, even while no lock on the transaction exists.
For example, this situation can happen when creating a logical replication slot
on a standby.
The cause of the non-interruptible loop was the absence of CHECK_FOR_INTERRUPTS()
within it. This commit adds CHECK_FOR_INTERRUPTS() to the loop in both functions,
ensuring they can be interrupted safely.
Back-patch to all supported branches.
Author: Kevin K Biju <kevinkbiju@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAM45KeELdjhS-rGuvN=ZLJ_asvZACucZ9LZWVzH7bGcD12DDwg@mail.gmail.com
Backpatch-through: 13
---------------
> > Just an idea: how about calling pgstat_report_wait_start() and _end() around pg_usleep(), similar to what WaitExceedsMaxStandbyDelay() does?
>
> I was thinking about this as well, but the question is what do we report the wait event here as? The one that makes sense to me is PG_WAIT_LOCK, but that wouldn't align with what pg_locks would display since there is no actual lock grabbed on the standby.
I couldn't find an existing wait event that fits this case, so how about
adding a new one, like IPC/XactDone, to indicate "Waiting for the transaction
to commit or abort"?
> > Also, would waiting only 1ms per loop cycle be too aggressive?
>
> I agree that it's aggressive for this case. But XactLockTableWait/ConditionalXactLockTableWait seem to be used in other places including in heapam so I'm hesitant on changing this behaviour for all of them.
I agree that we need more time to consider whether this change is safe.
Regards,
--
Fujii Masao
NTT DATA Japan Corporation
On 2025/05/29 5:40, Kevin K Biju wrote: > Hi Fujii, > > The WaitEvent sounds good to me, I will submit a separate patch for that when I find time. > > Kevin > > On Wed, May 28, 2025 at 5:06 PM Fujii Masao <masao.fujii@oss.nttdata.com <mailto:masao.fujii@oss.nttdata.com>> wrote: > > > > On 2025/05/27 4:43, Kevin K Biju wrote: > > Hi Fujii, > > > > Thanks for the review. > > So unless there are any objections, I'm planning to commit > the patch with the following commit message. I've pushed the patch. Thanks! Regards, -- Fujii Masao NTT DATA Japan Corporation
On 2025/06/03 13:35, Xuneng Zhou wrote: > Hi all, > > This update amends the previous patch by replacing the global > XactLockTableWait_us with a local wait_us variable inside > XactLockTableWait. Now each call resets its backoff to 1 ms, doubling > up to 1 s when oper == XLTW_None. This eliminates shared‐state bugs > (where multiple backends or successive calls would otherwise inherit a > stale, high sleep value). > > I remain concerned about relying on oper == XLTW_None to signal > “logical‐replication” use cases, since: > It couples backoff behavior to the absence of an error‐callback > context. Future callers passing oper == None for other reasons would > inadvertently get progressive backoff. > The intent (“I need a long‐wait backoff”) isn’t self‐evident from oper. I think that this concern is right and we should avoid this approach. > An alternative is to add a separate boolean or enum argument (e.g. > bool isLogicalReplication) so that callers explicitly opt in to > backoff. That approach is more verbose but avoids hidden coupling and > is easier to maintain if additional contexts require different wait > policies. If ConditionalXactLockTableWait ever needs backoff, it would > benefit from the same explicit flag. I prefer this approach over "oper == XLTW_None" one. Just idea, if XactLockTableWait() is expected to finish within a few seconds after acquiring the lock, how about this approach: keep sleeping for 1ms until the total sleep time reaches 10s (10s is just an example), and after that, start doubling the sleep duration each cycle, up to a maximum of 10s. That is, even in non-"create replication slot" case, if the total sleep time exceeds 10s, it seems safe to double the sleep time up to 10s. This way, we stay responsive early on but can back off more aggressively if needed. Thought? Anyway we probably need to study XactLockTableWait() behavior more closely. > On Tue, Jun 3, 2025 at 11:32 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: >> >> Hi all, >> >> Thank you for the earlier pushed patch—after testing it, things work >> well. Per Fujii’s and Kevin’s ideas and suggestions[1], [2], I’ve >> tried to extend it to add a new wait event and a progressive backoff >> mechanism for XactLockTableWait in logical replication scenarios. Thanks for the patch! I haven't reviewed it yet, but since this is a v19 item, please add it to the next CommitFest so we don't lose track of it. Also, I think it would be better to split the addition of the wait event and the introduction of exponential backoff in XactLockTableWait() into separate patches. They serve different purposes and can be committed independently. BTW, regarding the discussion on the list, please avoid top-posting; bottom-posting is the preferred style on this mailing list. Regards, -- Fujii Masao NTT DATA Japan Corporation
Hi Fujii, Thanks a lot for reviewing! > > This update amends the previous patch by replacing the global > > XactLockTableWait_us with a local wait_us variable inside > > XactLockTableWait. Now each call resets its backoff to 1 ms, doubling > > up to 1 s when oper == XLTW_None. This eliminates shared‐state bugs > > (where multiple backends or successive calls would otherwise inherit a > > stale, high sleep value). > > > > I remain concerned about relying on oper == XLTW_None to signal > > “logical‐replication” use cases, since: > > It couples backoff behavior to the absence of an error‐callback > > context. Future callers passing oper == None for other reasons would > > inadvertently get progressive backoff. > > The intent (“I need a long‐wait backoff”) isn’t self‐evident from oper. > > I think that this concern is right and we should avoid this approach. +1 > > > An alternative is to add a separate boolean or enum argument (e.g. > > bool isLogicalReplication) so that callers explicitly opt in to > > backoff. That approach is more verbose but avoids hidden coupling and > > is easier to maintain if additional contexts require different wait > > policies. If ConditionalXactLockTableWait ever needs backoff, it would > > benefit from the same explicit flag. > > I prefer this approach over "oper == XLTW_None" one. Agreed. My worry was that introducing a new argument adds many call-site changes, and XLTW_Oper is primarily for error-context callbacks. > > Just idea, if XactLockTableWait() is expected to finish within a few seconds > after acquiring the lock, how about this approach: keep sleeping for 1ms > until the total sleep time reaches 10s (10s is just an example), > and after that, start doubling the sleep duration each cycle, up to > a maximum of 10s. That is, even in non-"create replication slot" case, > if the total sleep time exceeds 10s, it seems safe to double the sleep time > up to 10s. This way, we stay responsive early on but can back off more > aggressively if needed. Thought? Anyway we probably need to study > XactLockTableWait() behavior more closely. > I agree—we need to study XactLockTableWait() usage more closely. Picking a fixed threshold could be tricky across different workloads. Exposing it as a GUC for this one also feels heavy. > Thanks for the patch! I haven't reviewed it yet, but since this is > a v19 item, please add it to the next CommitFest so we don't lose > track of it. > > Also, I think it would be better to split the addition of the wait event > and the introduction of exponential backoff in XactLockTableWait() into > separate patches. They serve different purposes and can be committed > independently. Understood—I’ll split them and move both to July’s CommitFest. > > BTW, regarding the discussion on the list, please avoid top-posting; > bottom-posting is the preferred style on this mailing list. > Good point—I’ll use bottom-posting going forward.
Hi, > Just idea, if XactLockTableWait() is expected to finish within a few seconds > after acquiring the lock, how about this approach: keep sleeping for 1ms > until the total sleep time reaches 10s (10s is just an example), > and after that, start doubling the sleep duration each cycle, up to > a maximum of 10s. That is, even in non-"create replication slot" case, > if the total sleep time exceeds 10s, it seems safe to double the sleep time > up to 10s. This way, we stay responsive early on but can back off more > aggressively if needed. Thought? Anyway we probably need to study > XactLockTableWait() behavior more closely. After some thoughts, I think that using an additional parameter to distinguish replication cases from heap/index cases may not be the optimal approach. Other scenarios, such as batch operations or analytical workloads, could potentially cause long waits as well. An alternative approach would be to adopt a hybrid strategy: - For logical replication use cases: Apply exponential backoff immediately - For other cases: Apply exponential backoff only after a certain threshold is reached However, this is more complicated. I don't see clear benefits for now. > Thanks for the patch! I haven't reviewed it yet, but since this is > a v19 item, please add it to the next CommitFest so we don't lose > track of it. I've added it to July's CommitFest. https://commitfest.postgresql.org/patch/5804/ > Also, I think it would be better to split the addition of the wait event > and the introduction of exponential backoff in XactLockTableWait() into > separate patches. They serve different purposes and can be committed > independently. The following is the split patch for adding the wait event.
Вложения
Hi, > Thanks for the patch! I haven't reviewed it yet, but since this is > a v19 item, please add it to the next CommitFest so we don't lose > track of it. > > Also, I think it would be better to split the addition of the wait event > and the introduction of exponential backoff in XactLockTableWait() into > separate patches. They serve different purposes and can be committed > independently. > I've renamed and created two discussion threads and commitfest entries for these patches to allow independent evaluation. 1) Add new wait event to XactLockTableWait https://commitfest.postgresql.org/patch/5804/ https://www.postgresql.org/message-id/CABPTF7WZODAVPFxtn9ygA9d6zckkJbFG%3DSUtHdvk7ca%3DUTzSFg%40mail.gmail.com 2) Add progressive backoff to XactLockTableWait https://commitfest.postgresql.org/patch/5806/ https://www.postgresql.org/message-id/CABPTF7XmTrBp8S93a%2BzQ5M3FhLB6o8kWn9yQ1YnHJqTPT9dRYA%40mail.gmail.com There's some code overlap between XactLockTableWait() and ConditionalXactLockTableWait() in 2) that would warrant a shared helper function. However, introducing such a helper would create dependencies between the patches, defeating the purpose of keeping them separate. I’m very familiar with splitting patches and submitting them separately, so please let me know if anything here needs improvement. Best regards, Xuneng