Обсуждение: Missing calls to UnlockBuffers() - unify error handling?
Hi,
While hacking on a patch to "inline" the lock implementation of content locks
into storage/buffer/, I looked at the callers of UnlockBuffers() and compared
them to the callers of LWLockReleaseAll().
Somewhat disconcertingly, there are more callers of LWLockReleaseAll(),
without - afaict - analysis whether that's right.
E.g., why is it correct that ShutdownAuxiliaryProcess() does not call
UnlockBuffers() anywhere? It's *possible* that that turns out to be safe,
because there's no aux process acquiring cleanup locks, but even if that were
true today, if that ever changed we'd very likely miss that
ShutdownAuxiliaryProcess() would need to be updated.
I guess it's ok that pgarch.c doesn't call UnlockBuffers(), it probably is ok
that walsummarizer.c and method_worker.c don't. For walsender.c I'm a lot less
clear.
Widening the perspective a bit, our current approach for such error-handling /
shutdown functions seems ... not maintainable. In particular we have a
substantial number of top-level sigsetjmp() handlers that have slightly
different recovery codepaths. When adding a new process type, how is one
supposed to even know what function one needs to call?
PostgresMain() has this comment:
/*
* NOTE: if you are tempted to add more code in this if-block,
* consider the high probability that it should be in
* AbortTransaction() instead. The only stuff done directly here
* should be stuff that is guaranteed to apply *only* for outer-level
* error recovery, such as adjusting the FE/BE protocol status.
*/
But a) that clearly hasn't worked, given how long the following block is, and
b) can't really work that well, because we have plenty processes that don't
use transaction, therefore putting cleanup in AbortTransaction() doesn't go
that far.
I don't quite know how it should look like, but it seems pretty obvious that
it should be more unified than it is. My gut feeling is that we should have a
single error recovery function that has a flags argument guiding which
subsystems should be reset and which shouldn't.
Greetings,
Andres Freund
Hi,
On 2025-11-06 10:34:58 -0500, Andres Freund wrote:
> While hacking on a patch to "inline" the lock implementation of content locks
> into storage/buffer/, I looked at the callers of UnlockBuffers() and compared
> them to the callers of LWLockReleaseAll().
>
> Somewhat disconcertingly, there are more callers of LWLockReleaseAll(),
> without - afaict - analysis whether that's right.
A related issue I just found is that ProcKill() calls
SyncRepCleanupAtProcExit() before LWLockReleaseAll(). Which, given that
LWLockReleaseAll() uses lwlocks, doesn't seem great.
This may have kinda been ok in the past, as we'd normally have gone through
AbortTransaction() by that point. However, that's e.g. not true in background
workers, which just rely on ProcKill():
/*
* Do we need more cleanup here? For shmem-connected bgworkers, we
* will call InitProcess below, which will install ProcKill as exit
* callback. That will take care of releasing locks, etc.
*/
Seems we should just reorder the sequence in ProcKill() to first call
LWLockReleaseAll().
Greetings,
Andres Freund
Hi Andres,
Widening the perspective a bit, our current approach for such error-handling /
shutdown functions seems ... not maintainable. In particular we have a
substantial number of top-level sigsetjmp() handlers that have slightly
different recovery codepaths. When adding a new process type, how is one
supposed to even know what function one needs to call?
PostgresMain() has this comment:
/*
* NOTE: if you are tempted to add more code in this if-block,
* consider the high probability that it should be in
* AbortTransaction() instead. The only stuff done directly here
* should be stuff that is guaranteed to apply *only* for outer-level
* error recovery, such as adjusting the FE/BE protocol status.
*/
But a) that clearly hasn't worked, given how long the following block is, and
b) can't really work that well, because we have plenty processes that don't
use transaction, therefore putting cleanup in AbortTransaction() doesn't go
that far.
I don't quite know how it should look like, but it seems pretty obvious that
it should be more unified than it is. My gut feeling is that we should have a
single error recovery function that has a flags argument guiding which
subsystems should be reset and which shouldn't.
+1 for the idea. I am interested in writing a patch for the same if you would like.
As you mentioned, these code blocks under the if (sigsetjmp(local_sigjmp_buf, 1) != 0)
statement have a very similar set of function calls, depending on the process type—whether
it's an auxiliary process, background process, or client backend.
There are also similarities across these types; for instance, all of them register a ProcKill
callback as an on_shmem_exit callback..
Currently, we perform LWLockReleaseAll() at the before_shmem_exit stage,
such as in the ShutdownAuxiliaryProcess() callback for auxiliary processes and
ShutdownPostgres() for client backends.
statement have a very similar set of function calls, depending on the process type—whether
it's an auxiliary process, background process, or client backend.
There are also similarities across these types; for instance, all of them register a ProcKill
callback as an on_shmem_exit callback..
Currently, we perform LWLockReleaseAll() at the before_shmem_exit stage,
such as in the ShutdownAuxiliaryProcess() callback for auxiliary processes and
ShutdownPostgres() for client backends.
However, there are some inconsistencies:
1. The client backend does not call LWLockReleaseAll() until ProcKill(), if it is not
in a transaction.
2. The background processes and AutovacuumLauncher do not register a before_shmem_callback
for calling LWLockReleaseAll() but may invoke it directly within the sigsetjmp() blocks.
1. The client backend does not call LWLockReleaseAll() until ProcKill(), if it is not
in a transaction.
2. The background processes and AutovacuumLauncher do not register a before_shmem_callback
for calling LWLockReleaseAll() but may invoke it directly within the sigsetjmp() blocks.
3. Some auxiliary processes also call LWLockReleaseAll() early in the sigsetjmp() block.
Seems we should just reorder the sequence in ProcKill() to first call
LWLockReleaseAll()
It would be too late to call it in ProcKill, since dsm_backend_shutdown() would
detach segments containing the LWLocks before this. Using a before_shmem_exit callback or
handling it in the initial steps of sigsetjmp might be more suitable.
handling it in the initial steps of sigsetjmp might be more suitable.
Thank you,
Rahila Syed