Обсуждение: Segmentation fault on proc exit after dshash_find_or_insert
Hi,
If a process encounters a FATAL error after acquiring a dshash lock but before releasing it,
and it is not within a transaction, it can lead to a segmentation fault.
The FATAL error causes the backend to exit, triggering proc_exit() and similar functions.
In the absence of a transaction, LWLockReleaseAll() is delayed until ProcKill. ProcKill is
an on_shmem_exit callback, and dsm_backend_shutdown() is called before any
on_shmem_exit callbacks are invoked.
Consequently, if a dshash lock was acquired before the FATAL error occurred, the lock
will only be released after dsm_backend_shutdown() detaches the DSM segment containing
the lock, resulting in a segmentation fault.
Please find a reproducer attached. I have modified the test_dsm_registry module to create
a background worker that does nothing but throws a FATAL error after acquiring the dshash lock.
The reason this must be executed in the background worker is to ensure it runs without a transaction.
To trigger the segmentation fault, apply the 0001-Reproducer* patch, run make install in the
test_dsm_registry module, specify test_dsm_registry as shared_preload_libraries in postgresql.conf,
and start the server.
Please find attached a fix to call LWLockReleaseAll() early in the shmem_exit() routine. This ensures
that the dshash lock is released before dsm_backend_shutdown() is called. This will also ensure that
any subsequent callbacks invoked in shmem_exit() will not fail to acquire any lock.
Thank you,
If a process encounters a FATAL error after acquiring a dshash lock but before releasing it,
and it is not within a transaction, it can lead to a segmentation fault.
The FATAL error causes the backend to exit, triggering proc_exit() and similar functions.
In the absence of a transaction, LWLockReleaseAll() is delayed until ProcKill. ProcKill is
an on_shmem_exit callback, and dsm_backend_shutdown() is called before any
on_shmem_exit callbacks are invoked.
Consequently, if a dshash lock was acquired before the FATAL error occurred, the lock
will only be released after dsm_backend_shutdown() detaches the DSM segment containing
the lock, resulting in a segmentation fault.
Please find a reproducer attached. I have modified the test_dsm_registry module to create
a background worker that does nothing but throws a FATAL error after acquiring the dshash lock.
The reason this must be executed in the background worker is to ensure it runs without a transaction.
To trigger the segmentation fault, apply the 0001-Reproducer* patch, run make install in the
test_dsm_registry module, specify test_dsm_registry as shared_preload_libraries in postgresql.conf,
and start the server.
Please find attached a fix to call LWLockReleaseAll() early in the shmem_exit() routine. This ensures
that the dshash lock is released before dsm_backend_shutdown() is called. This will also ensure that
any subsequent callbacks invoked in shmem_exit() will not fail to acquire any lock.
Please see the backtrace below.
```
```
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055a7515af56c in pg_atomic_fetch_sub_u32_impl (ptr=0x7f92c4b334f4, sub_=262144)
at ../../../../src/include/port/atomics/generic-gcc.h:218
218 return __sync_fetch_and_sub(&ptr->value, sub_);
(gdb) bt
#0 0x000055a7515af56c in pg_atomic_fetch_sub_u32_impl (ptr=0x7f92c4b334f4, sub_=262144)
at ../../../../src/include/port/atomics/generic-gcc.h:218
#1 0x000055a7515af625 in pg_atomic_sub_fetch_u32_impl (ptr=0x7f92c4b334f4, sub_=262144)
at ../../../../src/include/port/atomics/generic.h:232
#2 0x000055a7515af709 in pg_atomic_sub_fetch_u32 (ptr=0x7f92c4b334f4, sub_=262144)
at ../../../../src/include/port/atomics.h:441
#3 0x000055a7515b1583 in LWLockReleaseInternal (lock=0x7f92c4b334f0, mode=LW_EXCLUSIVE) at lwlock.c:1840
#4 0x000055a7515b1638 in LWLockRelease (lock=0x7f92c4b334f0) at lwlock.c:1902
#5 0x000055a7515b16e9 in LWLockReleaseAll () at lwlock.c:1951
#6 0x000055a7515ba63d in ProcKill (code=1, arg=0) at proc.c:953
#7 0x000055a7515913af in shmem_exit (code=1) at ipc.c:276
#8 0x000055a75159119b in proc_exit_prepare (code=1) at ipc.c:198
#9 0x000055a7515910df in proc_exit (code=1) at ipc.c:111
#10 0x000055a7517be71d in errfinish (filename=0x7f92ce41d062 "test_dsm_registry.c", lineno=187,
funcname=0x7f92ce41d160 <__func__.0> "TestDSMRegistryMain") at elog.c:596
#11 0x00007f92ce41ca62 in TestDSMRegistryMain (main_arg=0) at test_dsm_registry.c:187
#12 0x000055a7514db00c in BackgroundWorkerMain (startup_data=0x55a752dd8028, startup_data_len=1472)
at bgworker.c:846
#13 0x000055a7514de1e8 in postmaster_child_launch (child_type=B_BG_WORKER, child_slot=239,
startup_data=0x55a752dd8028, startup_data_len=1472, client_sock=0x0) at launch_backend.c:268
#14 0x000055a7514e530d in StartBackgroundWorker (rw=0x55a752dd8028) at postmaster.c:4168
#15 0x000055a7514e55a4 in maybe_start_bgworkers () at postmaster.c:4334
#16 0x000055a7514e4200 in LaunchMissingBackgroundProcesses () at postmaster.c:3408
#17 0x000055a7514e205b in ServerLoop () at postmaster.c:1728
#18 0x000055a7514e18b0 in PostmasterMain (argc=3, argv=0x55a752dd0e70) at postmaster.c:1403
#19 0x000055a75138eead in main (argc=3, argv=0x55a752dd0e70) at main.c:231
```
#0 0x000055a7515af56c in pg_atomic_fetch_sub_u32_impl (ptr=0x7f92c4b334f4, sub_=262144)
at ../../../../src/include/port/atomics/generic-gcc.h:218
218 return __sync_fetch_and_sub(&ptr->value, sub_);
(gdb) bt
#0 0x000055a7515af56c in pg_atomic_fetch_sub_u32_impl (ptr=0x7f92c4b334f4, sub_=262144)
at ../../../../src/include/port/atomics/generic-gcc.h:218
#1 0x000055a7515af625 in pg_atomic_sub_fetch_u32_impl (ptr=0x7f92c4b334f4, sub_=262144)
at ../../../../src/include/port/atomics/generic.h:232
#2 0x000055a7515af709 in pg_atomic_sub_fetch_u32 (ptr=0x7f92c4b334f4, sub_=262144)
at ../../../../src/include/port/atomics.h:441
#3 0x000055a7515b1583 in LWLockReleaseInternal (lock=0x7f92c4b334f0, mode=LW_EXCLUSIVE) at lwlock.c:1840
#4 0x000055a7515b1638 in LWLockRelease (lock=0x7f92c4b334f0) at lwlock.c:1902
#5 0x000055a7515b16e9 in LWLockReleaseAll () at lwlock.c:1951
#6 0x000055a7515ba63d in ProcKill (code=1, arg=0) at proc.c:953
#7 0x000055a7515913af in shmem_exit (code=1) at ipc.c:276
#8 0x000055a75159119b in proc_exit_prepare (code=1) at ipc.c:198
#9 0x000055a7515910df in proc_exit (code=1) at ipc.c:111
#10 0x000055a7517be71d in errfinish (filename=0x7f92ce41d062 "test_dsm_registry.c", lineno=187,
funcname=0x7f92ce41d160 <__func__.0> "TestDSMRegistryMain") at elog.c:596
#11 0x00007f92ce41ca62 in TestDSMRegistryMain (main_arg=0) at test_dsm_registry.c:187
#12 0x000055a7514db00c in BackgroundWorkerMain (startup_data=0x55a752dd8028, startup_data_len=1472)
at bgworker.c:846
#13 0x000055a7514de1e8 in postmaster_child_launch (child_type=B_BG_WORKER, child_slot=239,
startup_data=0x55a752dd8028, startup_data_len=1472, client_sock=0x0) at launch_backend.c:268
#14 0x000055a7514e530d in StartBackgroundWorker (rw=0x55a752dd8028) at postmaster.c:4168
#15 0x000055a7514e55a4 in maybe_start_bgworkers () at postmaster.c:4334
#16 0x000055a7514e4200 in LaunchMissingBackgroundProcesses () at postmaster.c:3408
#17 0x000055a7514e205b in ServerLoop () at postmaster.c:1728
#18 0x000055a7514e18b0 in PostmasterMain (argc=3, argv=0x55a752dd0e70) at postmaster.c:1403
#19 0x000055a75138eead in main (argc=3, argv=0x55a752dd0e70) at main.c:231
```
Thank you,
Rahila Syed
Вложения
Hi Rahila,
On Fri, Nov 21, 2025 at 8:45 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
>
> Hi,
>
> If a process encounters a FATAL error after acquiring a dshash lock but before releasing it,
> and it is not within a transaction, it can lead to a segmentation fault.
>
> The FATAL error causes the backend to exit, triggering proc_exit() and similar functions.
> In the absence of a transaction, LWLockReleaseAll() is delayed until ProcKill. ProcKill is
> an on_shmem_exit callback, and dsm_backend_shutdown() is called before any
> on_shmem_exit callbacks are invoked.
> Consequently, if a dshash lock was acquired before the FATAL error occurred, the lock
> will only be released after dsm_backend_shutdown() detaches the DSM segment containing
> the lock, resulting in a segmentation fault.
Thanks for the report.
I am not super familiar with this code path, but this raises a broader
question for me: are there other resources residing in DSM (besides
LWLocks) that might be accessed during the shutdown sequence?
We know dshash and dsa locks (LWLocks) face this risk because ProcKill
runs as an on_shmem_exit callback, which happens after
dsm_backend_shutdown() has already detached the memory.
This patch fixes the specific case for LWLocks, but if there are any
other on_shmem_exit callbacks that attempt to touch DSM memory, they
would still trigger a similar segfault. Do we need to worry about
other cleanup routines, or is ProcKill the only consumer of DSM data
at this stage?
> Please find a reproducer attached. I have modified the test_dsm_registry module to create
> a background worker that does nothing but throws a FATAL error after acquiring the dshash lock.
> The reason this must be executed in the background worker is to ensure it runs without a transaction.
>
> To trigger the segmentation fault, apply the 0001-Reproducer* patch, run make install in the
> test_dsm_registry module, specify test_dsm_registry as shared_preload_libraries in postgresql.conf,
> and start the server.
>
> Please find attached a fix to call LWLockReleaseAll() early in the shmem_exit() routine. This ensures
> that the dshash lock is released before dsm_backend_shutdown() is called. This will also ensure that
> any subsequent callbacks invoked in shmem_exit() will not fail to acquire any lock.
@@ -229,6 +230,14 @@ shmem_exit(int code)
{
shmem_exit_inprogress = true;
+ /*
+ * Make sure we release any pending locks so that any callbacks called
+ * subsequently do not fail to acquire any locks. This also fixes a seg
+ * fault due to releasing a dshash lock after the dsm segment containing
+ * the lock has been detached by dsm_backend_shutdown().
+ */
+ LWLockReleaseAll();
+
/*
* Call before_shmem_exit callbacks.
*
Again, not an expert, but I am concerned about placing
LWLockReleaseAll() at the very top, before before_shmem_exit()
callbacks run.
One of those callbacks might rely on locks being held or assume the
consistency of shared memory structures protected by those locks. It
seems safer to sandwich the release between the two callback lists:
after before_shmem_exit is done, but before dsm_backend_shutdown()
runs.
If we move it there, we should also reword the comment to focus on the
resource lifespan rather than just the symptom ("seg fault"):
/*
* Release all LWLocks before we tear down DSM segments.
*
* LWLocks that reside in dynamic shared memory (e.g., dshash partition
* locks) must be released before the backing segment is detached by
* dsm_backend_shutdown(). If we wait for ProcKill (via on_shmem_exit),
* the memory will already be unmapped, leading to access violations.
*/
--
Thanks, Amit Langote
On Tue, Dec 2, 2025 at 9:40 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Rahila,
>
> On Fri, Nov 21, 2025 at 8:45 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
> >
> > Hi,
> >
> > If a process encounters a FATAL error after acquiring a dshash lock but before releasing it,
> > and it is not within a transaction, it can lead to a segmentation fault.
> >
> > The FATAL error causes the backend to exit, triggering proc_exit() and similar functions.
> > In the absence of a transaction, LWLockReleaseAll() is delayed until ProcKill. ProcKill is
> > an on_shmem_exit callback, and dsm_backend_shutdown() is called before any
> > on_shmem_exit callbacks are invoked.
> > Consequently, if a dshash lock was acquired before the FATAL error occurred, the lock
> > will only be released after dsm_backend_shutdown() detaches the DSM segment containing
> > the lock, resulting in a segmentation fault.
>
> Thanks for the report.
>
> I am not super familiar with this code path, but this raises a broader
> question for me: are there other resources residing in DSM (besides
> LWLocks) that might be accessed during the shutdown sequence?
>
> We know dshash and dsa locks (LWLocks) face this risk because ProcKill
> runs as an on_shmem_exit callback, which happens after
> dsm_backend_shutdown() has already detached the memory.
>
> This patch fixes the specific case for LWLocks, but if there are any
> other on_shmem_exit callbacks that attempt to touch DSM memory, they
> would still trigger a similar segfault. Do we need to worry about
> other cleanup routines, or is ProcKill the only consumer of DSM data
> at this stage?
>
> > Please find a reproducer attached. I have modified the test_dsm_registry module to create
> > a background worker that does nothing but throws a FATAL error after acquiring the dshash lock.
> > The reason this must be executed in the background worker is to ensure it runs without a transaction.
> >
> > To trigger the segmentation fault, apply the 0001-Reproducer* patch, run make install in the
> > test_dsm_registry module, specify test_dsm_registry as shared_preload_libraries in postgresql.conf,
> > and start the server.
> >
> > Please find attached a fix to call LWLockReleaseAll() early in the shmem_exit() routine. This ensures
> > that the dshash lock is released before dsm_backend_shutdown() is called. This will also ensure that
> > any subsequent callbacks invoked in shmem_exit() will not fail to acquire any lock.
>
> @@ -229,6 +230,14 @@ shmem_exit(int code)
> {
> shmem_exit_inprogress = true;
>
> + /*
> + * Make sure we release any pending locks so that any callbacks called
> + * subsequently do not fail to acquire any locks. This also fixes a seg
> + * fault due to releasing a dshash lock after the dsm segment containing
> + * the lock has been detached by dsm_backend_shutdown().
> + */
> + LWLockReleaseAll();
> +
> /*
> * Call before_shmem_exit callbacks.
> *
>
> Again, not an expert, but I am concerned about placing
> LWLockReleaseAll() at the very top, before before_shmem_exit()
> callbacks run.
>
> One of those callbacks might rely on locks being held or assume the
> consistency of shared memory structures protected by those locks. It
> seems safer to sandwich the release between the two callback lists:
> after before_shmem_exit is done, but before dsm_backend_shutdown()
> runs.
I think it makes sense to place LWLockReleaseAll() right before the
dsm_backend_shutdown() which is detaching the process from the dsm
desgments.
--
Regards,
Dilip Kumar
Google
Hi,
On 2025-12-02 13:10:29 +0900, Amit Langote wrote:
> On Fri, Nov 21, 2025 at 8:45 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
> >
> > Hi,
> >
> > If a process encounters a FATAL error after acquiring a dshash lock but before releasing it,
> > and it is not within a transaction, it can lead to a segmentation fault.
> >
> > The FATAL error causes the backend to exit, triggering proc_exit() and similar functions.
> > In the absence of a transaction, LWLockReleaseAll() is delayed until ProcKill. ProcKill is
> > an on_shmem_exit callback, and dsm_backend_shutdown() is called before any
> > on_shmem_exit callbacks are invoked.
> > Consequently, if a dshash lock was acquired before the FATAL error occurred, the lock
> > will only be released after dsm_backend_shutdown() detaches the DSM segment containing
> > the lock, resulting in a segmentation fault.
>
> Thanks for the report.
>
> I am not super familiar with this code path, but this raises a broader
> question for me: are there other resources residing in DSM (besides
> LWLocks) that might be accessed during the shutdown sequence?
>
> We know dshash and dsa locks (LWLocks) face this risk because ProcKill
> runs as an on_shmem_exit callback, which happens after
> dsm_backend_shutdown() has already detached the memory.
>
> This patch fixes the specific case for LWLocks, but if there are any
> other on_shmem_exit callbacks that attempt to touch DSM memory, they
> would still trigger a similar segfault. Do we need to worry about
> other cleanup routines, or is ProcKill the only consumer of DSM data
> at this stage?
I don't think it's really right to frame it as ProcKill() being a consumer of
DSM data - it's just releasing all held lwlocks, and we happen to hold an
lwlock inside a DSM in the problematic case...
There are many other places that do LWLockReleaseAll()...
> > Please find a reproducer attached. I have modified the test_dsm_registry module to create
> > a background worker that does nothing but throws a FATAL error after acquiring the dshash lock.
> > The reason this must be executed in the background worker is to ensure it runs without a transaction.
> >
> > To trigger the segmentation fault, apply the 0001-Reproducer* patch, run make install in the
> > test_dsm_registry module, specify test_dsm_registry as shared_preload_libraries in postgresql.conf,
> > and start the server.
> >
> > Please find attached a fix to call LWLockReleaseAll() early in the shmem_exit() routine. This ensures
> > that the dshash lock is released before dsm_backend_shutdown() is called. This will also ensure that
> > any subsequent callbacks invoked in shmem_exit() will not fail to acquire any lock.
>
> @@ -229,6 +230,14 @@ shmem_exit(int code)
> {
> shmem_exit_inprogress = true;
>
> + /*
> + * Make sure we release any pending locks so that any callbacks called
> + * subsequently do not fail to acquire any locks. This also fixes a seg
> + * fault due to releasing a dshash lock after the dsm segment containing
> + * the lock has been detached by dsm_backend_shutdown().
> + */
> + LWLockReleaseAll();
> +
> /*
> * Call before_shmem_exit callbacks.
> *
>
> Again, not an expert, but I am concerned about placing
> LWLockReleaseAll() at the very top, before before_shmem_exit()
> callbacks run.
I think it's actually kind of required for correctness, independent of this
crash. If we errored out while holding an lwlock, we cannot reliably run
*any* code acquiring an lwlock, because lwlocks are not reentrant.
> One of those callbacks might rely on locks being held or assume the
> consistency of shared memory structures protected by those locks. It
> seems safer to sandwich the release between the two callback lists:
> after before_shmem_exit is done, but before dsm_backend_shutdown()
> runs.
I don't agree. You *cannot* rely on lwlocks working after an error before you
have called LWLockReleaseAll(). I.e. currently *any* use of lwlocks in
before_shmem_exit is unsafe. The only reason we haven't really noticed that is
that most of the top-level error handlers (i.e. sigsetjmp()s) do an
AbortCurrentTransaction(), which does an LWLockReleaseAll() if in a tx, and
most lwlock acquisitions happen within a transaction. But if you ever do stuff
outside of a transaction, the AbortCurrentTransaction() won't do
LWLockReleaseAll(), and you're in trouble, as the case here.
IOW, I think we need to do LWLockReleaseAll() at the top of proc_exit(), at
least in case of FATAL errors.
We probably should add a note to LWLockReleaseAll() indicating that we rely on
LWLockReleaseAll() working even if CreateLWLocks()/InitializeLWLocks() etc
hasn't yet been called...
Greetings,
Andres Freund
On Thu, Dec 4, 2025 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-12-02 13:10:29 +0900, Amit Langote wrote:
> > On Fri, Nov 21, 2025 at 8:45 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
> > > If a process encounters a FATAL error after acquiring a dshash lock but before releasing it,
> > > and it is not within a transaction, it can lead to a segmentation fault.
> > >
> > > The FATAL error causes the backend to exit, triggering proc_exit() and similar functions.
> > > In the absence of a transaction, LWLockReleaseAll() is delayed until ProcKill. ProcKill is
> > > an on_shmem_exit callback, and dsm_backend_shutdown() is called before any
> > > on_shmem_exit callbacks are invoked.
> > > Consequently, if a dshash lock was acquired before the FATAL error occurred, the lock
> > > will only be released after dsm_backend_shutdown() detaches the DSM segment containing
> > > the lock, resulting in a segmentation fault.
> >
> > Thanks for the report.
> >
> > I am not super familiar with this code path, but this raises a broader
> > question for me: are there other resources residing in DSM (besides
> > LWLocks) that might be accessed during the shutdown sequence?
> >
> > We know dshash and dsa locks (LWLocks) face this risk because ProcKill
> > runs as an on_shmem_exit callback, which happens after
> > dsm_backend_shutdown() has already detached the memory.
> >
> > This patch fixes the specific case for LWLocks, but if there are any
> > other on_shmem_exit callbacks that attempt to touch DSM memory, they
> > would still trigger a similar segfault. Do we need to worry about
> > other cleanup routines, or is ProcKill the only consumer of DSM data
> > at this stage?
>
> I don't think it's really right to frame it as ProcKill() being a consumer of
> DSM data - it's just releasing all held lwlocks, and we happen to hold an
> lwlock inside a DSM in the problematic case...
>
> There are many other places that do LWLockReleaseAll()...
Sure, I was just wondering if there might be other stuff in these DSM
segment possibly being accessible from on_shmem_exit callbacks. But,
maybe we don't have to address all such risks in this patch.
> > > Please find a reproducer attached. I have modified the test_dsm_registry module to create
> > > a background worker that does nothing but throws a FATAL error after acquiring the dshash lock.
> > > The reason this must be executed in the background worker is to ensure it runs without a transaction.
> > >
> > > To trigger the segmentation fault, apply the 0001-Reproducer* patch, run make install in the
> > > test_dsm_registry module, specify test_dsm_registry as shared_preload_libraries in postgresql.conf,
> > > and start the server.
> > >
> > > Please find attached a fix to call LWLockReleaseAll() early in the shmem_exit() routine. This ensures
> > > that the dshash lock is released before dsm_backend_shutdown() is called. This will also ensure that
> > > any subsequent callbacks invoked in shmem_exit() will not fail to acquire any lock.
> >
> > @@ -229,6 +230,14 @@ shmem_exit(int code)
> > {
> > shmem_exit_inprogress = true;
> >
> > + /*
> > + * Make sure we release any pending locks so that any callbacks called
> > + * subsequently do not fail to acquire any locks. This also fixes a seg
> > + * fault due to releasing a dshash lock after the dsm segment containing
> > + * the lock has been detached by dsm_backend_shutdown().
> > + */
> > + LWLockReleaseAll();
> > +
> > /*
> > * Call before_shmem_exit callbacks.
> > *
> >
> > Again, not an expert, but I am concerned about placing
> > LWLockReleaseAll() at the very top, before before_shmem_exit()
> > callbacks run.
>
> I think it's actually kind of required for correctness, independent of this
> crash. If we errored out while holding an lwlock, we cannot reliably run
> *any* code acquiring an lwlock, because lwlocks are not reentrant.
>
> > One of those callbacks might rely on locks being held or assume the
> > consistency of shared memory structures protected by those locks. It
> > seems safer to sandwich the release between the two callback lists:
> > after before_shmem_exit is done, but before dsm_backend_shutdown()
> > runs.
>
> I don't agree. You *cannot* rely on lwlocks working after an error before you
> have called LWLockReleaseAll(). I.e. currently *any* use of lwlocks in
> before_shmem_exit is unsafe. The only reason we haven't really noticed that is
> that most of the top-level error handlers (i.e. sigsetjmp()s) do an
> AbortCurrentTransaction(), which does an LWLockReleaseAll() if in a tx, and
> most lwlock acquisitions happen within a transaction. But if you ever do stuff
> outside of a transaction, the AbortCurrentTransaction() won't do
> LWLockReleaseAll(), and you're in trouble, as the case here.
>
> IOW, I think we need to do LWLockReleaseAll() at the top of proc_exit(), at
> least in case of FATAL errors.
Oh, so not at the top of not shmem_exit() as Rahila has proposed?
> We probably should add a note to LWLockReleaseAll() indicating that we rely on
> LWLockReleaseAll() working even if CreateLWLocks()/InitializeLWLocks() etc
> hasn't yet been called...
Makes sense. AFAICS, num_held_lwlocks would be 0 in that case, so
LWLockReleaseAll() would be a no-op.
--
Thanks, Amit Langote
Hi, On 2025-12-04 11:06:20 +0900, Amit Langote wrote: > On Thu, Dec 4, 2025 at 12:33 AM Andres Freund <andres@anarazel.de> wrote: > > I don't agree. You *cannot* rely on lwlocks working after an error before you > > have called LWLockReleaseAll(). I.e. currently *any* use of lwlocks in > > before_shmem_exit is unsafe. The only reason we haven't really noticed that is > > that most of the top-level error handlers (i.e. sigsetjmp()s) do an > > AbortCurrentTransaction(), which does an LWLockReleaseAll() if in a tx, and > > most lwlock acquisitions happen within a transaction. But if you ever do stuff > > outside of a transaction, the AbortCurrentTransaction() won't do > > LWLockReleaseAll(), and you're in trouble, as the case here. > > > > IOW, I think we need to do LWLockReleaseAll() at the top of proc_exit(), at > > least in case of FATAL errors. > > Oh, so not at the top of not shmem_exit() as Rahila has proposed? Oh, ontop of shmem_exit() seems fine, what I was trying to express was that it should happen unconditionally at the start of exit processing, not just at the tail. > > We probably should add a note to LWLockReleaseAll() indicating that we rely on > > LWLockReleaseAll() working even if CreateLWLocks()/InitializeLWLocks() etc > > hasn't yet been called... > > Makes sense. AFAICS, num_held_lwlocks would be 0 in that case, so > LWLockReleaseAll() would be a no-op. Right. I just meant we should add a comment noting that we rely on that fact... Greetings, Andres Freund
Hi, On Fri, Dec 5, 2025 at 12:33 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-12-04 11:06:20 +0900, Amit Langote wrote: > > On Thu, Dec 4, 2025 at 12:33 AM Andres Freund <andres@anarazel.de> wrote: > > > I don't agree. You *cannot* rely on lwlocks working after an error before you > > > have called LWLockReleaseAll(). I.e. currently *any* use of lwlocks in > > > before_shmem_exit is unsafe. The only reason we haven't really noticed that is > > > that most of the top-level error handlers (i.e. sigsetjmp()s) do an > > > AbortCurrentTransaction(), which does an LWLockReleaseAll() if in a tx, and > > > most lwlock acquisitions happen within a transaction. But if you ever do stuff > > > outside of a transaction, the AbortCurrentTransaction() won't do > > > LWLockReleaseAll(), and you're in trouble, as the case here. > > > > > > IOW, I think we need to do LWLockReleaseAll() at the top of proc_exit(), at > > > least in case of FATAL errors. > > > > Oh, so not at the top of not shmem_exit() as Rahila has proposed? > > Oh, ontop of shmem_exit() seems fine, what I was trying to express was that it > should happen unconditionally at the start of exit processing, not just at the > tail. Ah, ok. I was talking about this with Rahila today and she pointed out to me that whether we add it to the top of proc_exit() or shmem_exit() doesn't really make any material difference to the fact that it will get done at the start of exit processing as you say, at least today. So I think we can keep it like Rahila originally proposed. > > > We probably should add a note to LWLockReleaseAll() indicating that we rely on > > > LWLockReleaseAll() working even if CreateLWLocks()/InitializeLWLocks() etc > > > hasn't yet been called... > > > > Makes sense. AFAICS, num_held_lwlocks would be 0 in that case, so > > LWLockReleaseAll() would be a no-op. > > Right. I just meant we should add a comment noting that we rely on that > fact... Ok, got it. Maybe like this: @@ -1940,6 +1940,10 @@ LWLockReleaseClearVar(LWLock *lock, pg_atomic_uint64 *valptr, uint64 val) * unchanged by this operation. This is necessary since InterruptHoldoffCount * has been set to an appropriate level earlier in error recovery. We could * decrement it below zero if we allow it to drop for each released lock! + * + * Note that this function must be safe to call even if the LWLock subsystem + * hasn't been initialized (e.g., during early startup error recovery). + * In that case, num_held_lwlocks will be 0, and we'll do nothing. */ void LWLockReleaseAll(void) -- Thanks, Amit Langote