Обсуждение: `pg_ctl init` crashes when run concurrently; semget(2) suspected

Поиск
Список
Период
Сортировка

`pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Gavin Panella
Дата:
Hi,

Summary: semget(2) behaves differently on macOS and requires extra care.

I have many tests which spin up clusters using `pg_ctl init`, each in its own single-use temporary directory. Each test is run for every PostgreSQL installation found on the host machine. These tests are often run concurrently. Since adding PostgreSQL 17 to the mix, I've been getting sporadic failures on macOS:

  FATAL:  could not create semaphores: Invalid argument
  DETAIL:  Failed system call was semget(176163502, 20, 03600).
  child process exited with exit code 1

I think it's related to the increase of SEMAS_PER_SET in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of semget(2) on macOS.

I think the bug manifests because:
  • I create two clusters concurrently using `pg_ctl init`. One cluster is PostgreSQL 17; the other is PostgreSQL 16 or earlier.
  • Their data directories are separate but created close enough in time to have sequential inodes. This is relevant because the inode is used to seed the semaphore key.
  • Somehow (waves hands) semget(2) in PostgreSQL 17 is called with a key that points at a preexisting semaphore set. On Linux, due to the IPC_CREAT | IPC_EXCL flags, this returns <0 and sets errno to EEXIST. On macOS, it sets it instead to EINVAL, likely because the requested number of semaphores is greater than those in the existing set. This is in the InternalIpcSemaphoreCreate function, which then aborts the process.
The attached patch fixes the issue, I think, and has another description of this mechanism. On EINVAL it adds an additional call to semget(2) but for zero semaphores.

The patch is relative to master, but I developed it against REL_17_5; it should apply cleanly to both. I think it would be good to backport a fix to 17 too.

If anyone is feeling nerd-sniped, some better proof of the "waves hands" bit would be useful, because that was a working hypothesis that led to a working fix, and I have not yet had time to investigate further.

Please consider the patch for review. Thanks!

Gavin.
Вложения

Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Tom Lane
Дата:
Gavin Panella <gavinpanella@gmail.com> writes:
> I have many tests which spin up clusters using `pg_ctl init`, each in its
> own single-use temporary directory. Each test is run for every PostgreSQL
> installation found on the host machine. These tests are often run
> concurrently. Since adding PostgreSQL 17 to the mix, I've been getting
> sporadic failures on macOS:
>   FATAL:  could not create semaphores: Invalid argument
>   DETAIL:  Failed system call was semget(176163502, 20, 03600).
>   child process exited with exit code 1

Ugh.

> I think it's related to the increase of SEMAS_PER_SET
> in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted
> in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of
> semget(2) on macOS.

Yeah, I guess you're right; but it's not macOS-specific AFAICS.
"man semget" quoth

     [EINVAL]           The number of semaphores requested is either less than
                        1 or greater than the system imposed maximum per set.

     [EINVAL]           A semaphore identifier exists for the argument key,
                        but the number of semaphores in the set associated
                        with it is less than nsems, and nsems is non-zero.

This is from current macOS, but equivalent text appears on Linux and
in the POSIX spec.  So it's just luck that nobody has reported the
same problem elsewhere --- unless maybe there is some macOS-specific
behavior making it more likely that different installs would try the
same key.  Nonetheless, key collisions must be expected, and we have
little right to assume that a pre-existing semaphore set must have
enough semaphores.

Using the same errno for these cases was really quite poorly chosen,
because in the first case we should not retry with another key, but
in the second case we should.  If the problem is that SEMMSL is too
small, and we retry, it'll be an infinite loop.

I don't care for your patch because (a) I'm not sure that the result
of asking for zero semaphores is portable (the Linux man page suggests
that it might be allowed there); and (b) if the problem is SEMMSL,
your code will fall into the infinite loop.

My first thought about fixing it is to go ahead and retry after
EINVAL, but keep a count of failures, and give up after 100 or 1000
or so.  This'd be safer in any case --- it's not really great that
IpcSemaphoreCreate is willing to retry an infinite number of times.

If we do retry, the semget() in IpcSemaphoreCreate will fail again,
so we'll just skip over that semaphore key and try another.  It
would be marginally nicer to release the semaphore set if it's
unused, but I'm not sure it'd be safe to assume that a set with
too few semaphores is one of ours.  It certainly didn't come from
a prior cycle of life of the same postmaster.

            regards, tom lane



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Tom Lane
Дата:
I wrote:
> This is from current macOS, but equivalent text appears on Linux and
> in the POSIX spec.  So it's just luck that nobody has reported the
> same problem elsewhere --- unless maybe there is some macOS-specific
> behavior making it more likely that different installs would try the
> same key.

Hmm, no, there is a platform dependency here.  I made the attached
test program to see what happens when there's a key collision,
and on Linux I get

semget(SEMAS_PER_SET) failed: File exists
semget(SEMAS_PER_SET + 1) failed: File exists

but macOS and NetBSD give

semget(SEMAS_PER_SET) failed: File exists
semget(SEMAS_PER_SET + 1) failed: Invalid argument

I didn't try other BSDen; this might be a NetBSD-ism that Apple
inherited, or maybe it's common among the BSDen.  I don't see any
text in POSIX specifying which errno is to be returned in this case,
so we can't really argue that the behavior is wrong.

            regards, tom lane

#include <stdio.h>
#include <errno.h>
#include <signal.h>
#include <string.h>
#include <unistd.h>
#include <sys/file.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/stat.h>

typedef key_t IpcSemaphoreKey;    /* semaphore key passed to semget(2) */

#ifndef HAVE_UNION_SEMUN
union semun
{
    int            val;
    struct semid_ds *buf;
    unsigned short *array;
};
#endif

#define SEMAS_PER_SET    16

#define IPCProtection    (0600)    /* access/modify by user only */


int main(int argc, char **argv)
{
    IpcSemaphoreKey semKey = 42;
    int            semId;
    int            semId2;
    union semun semun;

    /*
     * Find a free semaphore key, and create a set with SEMAS_PER_SET semas.
     */
    for(;;)
    {
        semId = semget(semKey, SEMAS_PER_SET,
                       IPC_CREAT | IPC_EXCL | IPCProtection);
        if (semId >= 0)
            break;
        semKey++;
    }

    /*
     * Check behavior if we try to make another set with same # of semas.
     */
    semId2 = semget(semKey, SEMAS_PER_SET,
                    IPC_CREAT | IPC_EXCL | IPCProtection);
    if (semId2 < 0)
        fprintf(stderr, "semget(SEMAS_PER_SET) failed: %s\n",
                strerror(errno));
    else
        fprintf(stderr, "semget(SEMAS_PER_SET) unexpectedly succeeded\n");

    /*
     * Check behavior if we try to make another set with more semas.
     */
    semId2 = semget(semKey, SEMAS_PER_SET + 1,
                    IPC_CREAT | IPC_EXCL | IPCProtection);
    if (semId2 < 0)
        fprintf(stderr, "semget(SEMAS_PER_SET + 1) failed: %s\n",
                strerror(errno));
    else
        fprintf(stderr, "semget(SEMAS_PER_SET + 1) unexpectedly succeeded\n");

    /*
     * Clean up
     */
    semun.val = 0;                /* unused, but keep compiler quiet */

    if (semctl(semId, 0, IPC_RMID, semun) < 0)
        fprintf(stderr, "semctl(%d, 0, IPC_RMID, ...) failed: %s\n",
                semId, strerror(errno));

    return 0;
}

Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Tom Lane
Дата:
I wrote:
> I didn't try other BSDen; this might be a NetBSD-ism that Apple
> inherited, or maybe it's common among the BSDen.

A bit more research later: OpenBSD behaves like NetBSD, while
FreeBSD behaves like Linux.  So that's pretty inconclusive about
what the aboriginal behavior was.  I also found that OpenIndiana
behaves like Linux.

For Linux and FreeBSD, it doesn't actually matter to us because
we no longer use SysV semaphores on those platforms.  But this
is a live problem on macOS, NetBSD, OpenBSD.

            regards, tom lane



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Tom Lane
Дата:
I wrote:
> My first thought about fixing it is to go ahead and retry after
> EINVAL, but keep a count of failures, and give up after 100 or 1000
> or so.  This'd be safer in any case --- it's not really great that
> IpcSemaphoreCreate is willing to retry an infinite number of times.

Concretely, about like the attached.  I don't have any clear idea
about what loop limit to use, but I think we do need one.

            regards, tom lane

diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index 423b2b4f9d6..8984a0aa2d1 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -69,7 +69,7 @@ static int    nextSemaNumber;        /* next free sem num in last sema set */


 static IpcSemaphoreId InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey,
-                                                 int numSems);
+                                                 int numSems, bool retry_ok);
 static void IpcSemaphoreInitialize(IpcSemaphoreId semId, int semNum,
                                    int value);
 static void IpcSemaphoreKill(IpcSemaphoreId semId);
@@ -88,9 +88,13 @@ static void ReleaseSemaphores(int status, Datum arg);
  * If we fail with a failure code other than collision-with-existing-set,
  * print out an error and abort.  Other types of errors suggest nonrecoverable
  * problems.
+ *
+ * Unfortunately, it's sometimes hard to tell whether errors are
+ * nonrecoverable.  Our caller keeps track of whether continuing to retry
+ * is sane or not; if not, we abort on failure regardless of the errno.
  */
 static IpcSemaphoreId
-InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems)
+InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems, bool retry_ok)
 {
     int            semId;

@@ -101,16 +105,27 @@ InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems)
         int            saved_errno = errno;

         /*
-         * Fail quietly if error indicates a collision with existing set. One
-         * would expect EEXIST, given that we said IPC_EXCL, but perhaps we
-         * could get a permission violation instead?  Also, EIDRM might occur
-         * if an old set is slated for destruction but not gone yet.
+         * Fail quietly if error suggests a collision with an existing set and
+         * our caller has not lost patience.
+         *
+         * One would expect EEXIST, given that we said IPC_EXCL, but perhaps
+         * we could get a permission violation instead.  On some platforms
+         * EINVAL will be reported if the existing set has too few semaphores.
+         * Also, EIDRM might occur if an old set is slated for destruction but
+         * not gone yet.
+         *
+         * EINVAL is the key reason why we need the caller-level loop limit,
+         * as it can also mean that the platform's SEMMSL is less than
+         * numSems, and that condition will not go away.
          */
-        if (saved_errno == EEXIST || saved_errno == EACCES
+        if (retry_ok &&
+            (saved_errno == EEXIST
+             || saved_errno == EACCES
+             || saved_errno == EINVAL
 #ifdef EIDRM
-            || saved_errno == EIDRM
+             || saved_errno == EIDRM
 #endif
-            )
+             ))
             return -1;

         /*
@@ -207,17 +222,22 @@ IpcSemaphoreGetLastPID(IpcSemaphoreId semId, int semNum)
 static IpcSemaphoreId
 IpcSemaphoreCreate(int numSems)
 {
+    int            num_tries = 0;
     IpcSemaphoreId semId;
     union semun semun;
     PGSemaphoreData mysema;

     /* Loop till we find a free IPC key */
-    for (nextSemaKey++;; nextSemaKey++)
+    for (nextSemaKey++;; nextSemaKey++, num_tries++)
     {
         pid_t        creatorPID;

-        /* Try to create new semaphore set */
-        semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1);
+        /*
+         * Try to create new semaphore set.  Give up after trying 1000
+         * distinct IPC keys.
+         */
+        semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1,
+                                           num_tries < 1000);
         if (semId >= 0)
             break;                /* successful create */

@@ -254,7 +274,7 @@ IpcSemaphoreCreate(int numSems)
         /*
          * Now try again to create the sema set.
          */
-        semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1);
+        semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1, true);
         if (semId >= 0)
             break;                /* successful create */


Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Gavin Panella
Дата:
With that fix applied to REL_17_5 things are working well. Limiting the search sounds like an improvement too.

As an experiment I added a log for when semget in InternalIpcSemaphoreCreate returns -1. When I'm running `pg_ctl init` for this local build concurrently with `pg_ctl init` from PostgreSQL 15 (or another version prior to 17), I saw ~8 logged failures when there was contention. As I increased the concurrency, the maximum number of logged failures looked to be ~8 times concurrency, roughly. For me, then, running `pg_ctl init` with a concurrency of 125 would be needed to even begin exceeding the max retries of 1000 – in the worst case. That sounds high enough.

Then I thought: I'm only seeing the log from one of those instances, yet they're all going through the same search for free semaphore sets. That's a few system calls going to waste. Maybe not important in the big picture, but it gave me an idea to left shift nextSemaKey in PGReserveSemaphores, i.e. `nextSemaKey = statbuf.st_ino << 4`, to give each pg_ctl process a few guaranteed uncontested keys (at least, uncontested between themselves). In a small test this eliminated contention for semaphore sets due to concurrency. It is more of an optimisation though, rather than a bug fix.

Gavin

Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Tom Lane
Дата:
Gavin Panella <gavinpanella@gmail.com> writes:
> With that fix applied to REL_17_5 things are working well. Limiting the
> search sounds like an improvement too.

Cool.  I'll push the fix after our release freeze lifts.

> Then I thought: I'm only seeing the log from one of those instances, yet
> they're all going through the same search for free semaphore sets. That's a
> few system calls going to waste. Maybe not important in the big picture,
> but it gave me an idea to left shift nextSemaKey in PGReserveSemaphores,
> i.e. `nextSemaKey = statbuf.st_ino << 4`, to give each pg_ctl process a few
> guaranteed uncontested keys (at least, uncontested between themselves). In
> a small test this eliminated contention for semaphore sets due to
> concurrency. It is more of an optimisation though, rather than a bug fix.

Meh ... if we thought this was worth worrying about, I'd be more
inclined to run the st_ino through a hash function to spread out the
possible values a little more.  But I think in nearly all scenarios
it'd be a waste of effort.  If collisions were probable we'd have
found this issue years ago.

            regards, tom lane



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Gavin Panella
Дата:
On Tue, 12 Aug 2025 at 01:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Cool.  I'll push the fix after our release freeze lifts.

Splendid, thank you!

Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Thomas Munro
Дата:
On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> A bit more research later: OpenBSD behaves like NetBSD, while
> FreeBSD behaves like Linux.  So that's pretty inconclusive about
> what the aboriginal behavior was.  I also found that OpenIndiana
> behaves like Linux.

Out of curiosity, FreeBSD changed that in 2016:

https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A bit more research later: OpenBSD behaves like NetBSD, while
>> FreeBSD behaves like Linux.  So that's pretty inconclusive about
>> what the aboriginal behavior was.  I also found that OpenIndiana
>> behaves like Linux.

> Out of curiosity, FreeBSD changed that in 2016:
> https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e

Hah, so that lends weight to the theory that this is an ancient
Berkeley mistake.  I wonder how hard it would be to convince
POSIX that they should be more specific about which errno is
expected here.

In the meantime, though, we'll have to deal with the existing
behavior for years to come.  So I'll go ahead with that patch.
I like having a loop limit there anyway --- I was never exactly
convinced that retrying indefinitely was a good idea.

            regards, tom lane



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Thomas Munro
Дата:
On Wed, Aug 13, 2025 at 8:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In the meantime, though, we'll have to deal with the existing
> behavior for years to come.  So I'll go ahead with that patch.
> I like having a loop limit there anyway --- I was never exactly
> convinced that retrying indefinitely was a good idea.

LGTM.  1000 does seem like "enough" following the arguments made
earlier.  Gavin does also raise a good point that it might be nice to
randomise the search a bit but that sounds like another patch for
master.

It's not a very nice interface and I hope to get rid of it one day.
When I'm working on a Mac I frequently finish up with leaked SysV IPC
entries of both kinds and have to clean them up manually, though
admittedly while hacking and debugging PostgreSQL you exit in weird
and wonderful ways.

FWIW in early prototype multithreading patches you can just use
sem_init() on all these systems since you don't need pshared=1.  It
works, but macOS spews "deprecated" warnings when you compile.  Kinda
weird for the best semaphore API IMHO.  But you can also just
implement LWLock on top of pthread_rwlock_t or futexes with some more
work, so I'm not bothered about that and there are more important
problems to solve in that project for now...

Here's a story from an Apple engineer about why they don't have pshared=1:

https://www.quora.com/Why-does-OS-X-not-support-unnamed-semaphores

I don't actually see any reason they couldn't adopt the same trick as
other modern systems, without an ABI break: use the *address* to
identify semaphores, converted to {VM object, offset} to form the key
in a hash table of waitlists that only has an entry while someone is
actually waiting.  No expensive kernel resources while not being
waited on, no leaky SysV namespace management, ideal access control by
memory map.  But you probably have to move to California if you want
to fix that.

They also have the CMU Mach semaphore API, semaphore_create(), which
some other projects seem to be using.  I think it could be made to
work, but Mach ports are not inherited by child processes so you'd
have to send them explicitly after fork() which sounds painful and
slow, not to mention that the documentation for all that is thin on
the ground...



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Thomas Munro
Дата:
On Wed, Aug 13, 2025 at 11:29 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> FWIW in early prototype multithreading patches you can just use
> sem_init() on all these systems since you don't need pshared=1.

Oops, I misremembered that.  It works on NetBSD and OpenBSD, but not
macOS :-(.

Hmm, but it looks like there has been a new development in macOS 15.
They have finally made the futex API public:

https://developer.apple.com/documentation/os/os_sync_wait_on_address

That was one of the sticking points for my patch that added our own
replacement sem_init() implementation.  At the time the only "proper"
way to reach the undocumented and undeclared
__ulock_wait()/__ulock_wake() functions was to write a module in C++
using std::atomic<int>::wait()/notify(), which I had started to wonder
about, but here it is with a shiny new C interface and documentation.

Maybe I should have another go at that!



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Thomas Munro
Дата:
On Wed, Aug 13, 2025 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Maybe I should have another go at that!

Here's a new attempt at that.  It picks futexes automatically, unless
you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
sysv as before.

Вложения

Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Thomas Munro
Дата:
On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a new attempt at that.  It picks futexes automatically, unless
> you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
> sysv as before.

I compared my hand-rolled atomics logic against FreeBSD's user space
semaphores[1].  sem_post() and sem_trywait() are about the same, and I
guess there just aren't many ways to write those.  Minor differences:

1. They use CAS in sem_post() because they want to report EOVERFLOW if
you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
just used fetch-and-add.  Is that bad?  I noticed their alternative
older version (remove "_new" from [1]) also uses a whole 32 bit
counter like me, so perhaps isn't so worried about exceeding a narrow
field of bits, and does it my way.
2.  In sem_trywait() they reload the value while our
pg_atomic_compare_exchange_u32() does that for you, and I also prefer
to avoid assignment expression syntax if I can.

On the other hand my v1 sem_wait() seemed a little janky and weird in
comparison to theirs, and I was a bit worried about bugs.  I thought
about rewriting it to look more like that, but then I realised that I
could use sem_trywait() to implement sem_wait() for the same net
effect.  It's much shorter and sweeter and more intuitively
understandable, I think?

I also didn't bother to supply a couple of functions that posix_sema.c
doesn't use.

There's no immediately obvious performance difference, but I wasn't
expecting there to be: it's used in the slow path of our LWLocks and
they already have similar user space atomics in front, so you don't
really avoid entering the kernel with this technique except perhaps
just occasionally if you wait just as someone posts.  This is all just
about the convenience of bannishing System V cruft.  It doesn't seem
to be any slower though, in simple testing, and CI might even be a bit
faster based on spot checks of a few runs... but proper statistical
tools might be needed to see if that is a real phenomenon.

Anyone got any better ideas for how to organise the build scripts,
control the feature, naming etc?  I came up with is this, which
automatically falls back to sysv if it can't find what it needs
(respecting Apple's deployment target thing following what we did for
preadv):

      meson.build:        sema_type = "unnamed_posix+emulation"
      configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"

The pg_futex.h abstraction layer doesn't have any support for other
OSes yet because I tried to pare this patch down to the absolute
minimum to solve this problem, but I did try at least a little bit to
anticipate that (having removed it from earlier versions from a few
years ago) when designing the futex contracts.  We could trivially add
at least OpenBSD support to use emulated semaphores there.  I never
quite figured out whether the NetBSD futex API is really available
outside its Linux syscall emulator, but that might be doable too.  And
futexes might of course turn out to have applications in other
synchronisation primitives.

Any thoughts on whether this is worth pursuing, and what kinds of
validation might be useful?

[1] https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c

Вложения

Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
"Burd, Greg"
Дата:
> On Aug 12, 2025, at 4:07 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A bit more research later: OpenBSD behaves like NetBSD, while
>> FreeBSD behaves like Linux.  So that's pretty inconclusive about
>> what the aboriginal behavior was.  I also found that OpenIndiana
>> behaves like Linux.
>
> Out of curiosity, FreeBSD changed that in 2016:
>
> https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e

I believe that this is the code in macOS's "XNU" [1][2] which has similar
logic (unsurprisingly) to the FreeBSD code before the identified commit [3].

best.

-greg

[1]
https://github.com/apple-oss-distributions/xnu/blob/e3723e1f17661b24996789d8afc084c0c3303b26/bsd/kern/sysv_sem.c#L860-L889
[2] "XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from
FreeBSDand a C++ API for writing drivers called IOKit." 
[3] https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e




Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
"Burd, Greg"
Дата:
> On Aug 13, 2025, at 8:01 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Here's a new attempt at that.  It picks futexes automatically, unless
>> you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks
>> sysv as before.

Thomas, thanks for digging into this platform specific issue.  I've read
your patch and done some research.  Thoughts below...

> I compared my hand-rolled atomics logic against FreeBSD's user space
> semaphores[1].  sem_post() and sem_trywait() are about the same, and I
> guess there just aren't many ways to write those.  Minor differences:
>
> 1. They use CAS in sem_post() because they want to report EOVERFLOW if
> you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
> just used fetch-and-add.  Is that bad?  I noticed their alternative
> older version (remove "_new" from [1]) also uses a whole 32 bit
> counter like me, so perhaps isn't so worried about exceeding a narrow
> field of bits, and does it my way.

The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
the sem_post() function shall fail if the maximum allowable value of the
semaphore would be exceeded [EOVERFLOW].  So, maybe a CAS is the way to
go after all?

> 2.  In sem_trywait() they reload the value while our
> pg_atomic_compare_exchange_u32() does that for you, and I also prefer
> to avoid assignment expression syntax if I can.
>
> On the other hand my v1 sem_wait() seemed a little janky and weird in
> comparison to theirs, and I was a bit worried about bugs.  I thought
> about rewriting it to look more like that, but then I realised that I
> could use sem_trywait() to implement sem_wait() for the same net
> effect.  It's much shorter and sweeter and more intuitively
> understandable, I think?

So if I understand it the way your sem_trywait() works is:
 * Try non-blocking acquire with pg_sem_trywait()
 * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
 * When woken up: Loop back to try again

This is essentially how sem_trywait() implementations work in Linux IIRC,
they use futexes or similar kernel primitives for efficient blocking.

> I also didn't bother to supply a couple of functions that posix_sema.c
> doesn't use.

I think that's fine for now.

> There's no immediately obvious performance difference, but I wasn't
> expecting there to be: it's used in the slow path of our LWLocks and
> they already have similar user space atomics in front, so you don't
> really avoid entering the kernel with this technique except perhaps
> just occasionally if you wait just as someone posts.  This is all just
> about the convenience of bannishing System V cruft.  It doesn't seem
> to be any slower though, in simple testing, and CI might even be a bit
> faster based on spot checks of a few runs... but proper statistical
> tools might be needed to see if that is a real phenomenon.

That lines up with my intuition after reading the patch.  I don't know
that it is even worth diving too deep in performance testing this.

> Anyone got any better ideas for how to organise the build scripts,
> control the feature, naming etc?  I came up with is this, which
> automatically falls back to sysv if it can't find what it needs
> (respecting Apple's deployment target thing following what we did for
> preadv):
>
>      meson.build:        sema_type = "unnamed_posix+emulation"
>      configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION"

I'm not an expert on Meson, but I do have years of Autoconf work under
my belt.  Offhand this seems reasonable, but I'm new to this project.

> The pg_futex.h abstraction layer doesn't have any support for other
> OSes yet because I tried to pare this patch down to the absolute
> minimum to solve this problem, but I did try at least a little bit to
> anticipate that (having removed it from earlier versions from a few
> years ago) when designing the futex contracts.  We could trivially add
> at least OpenBSD support to use emulated semaphores there.  I never
> quite figured out whether the NetBSD futex API is really available
> outside its Linux syscall emulator, but that might be doable too.  And
> futexes might of course turn out to have applications in other
> synchronisation primitives.
>
> Any thoughts on whether this is worth pursuing, and what kinds of
> validation might be useful?
>
> [1] https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c
> <v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patch>

Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX

Overall, I like the approach, it addresses a real limitation on macOS and
could be expanded in the future to cover other similar issues on other
platforms.

best.

-greg

[1] https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/sem_post.html




Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Thomas Munro
Дата:
On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote:
> > 1. They use CAS in sem_post() because they want to report EOVERFLOW if
> > you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
> > just used fetch-and-add.  Is that bad?  I noticed their alternative
> > older version (remove "_new" from [1]) also uses a whole 32 bit
> > counter like me, so perhaps isn't so worried about exceeding a narrow
> > field of bits, and does it my way.
>
> The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
> the sem_post() function shall fail if the maximum allowable value of the
> semaphore would be exceeded [EOVERFLOW].  So, maybe a CAS is the way to
> go after all?

Hmm.  Before last year, POSIX *didn't* specify EOVERFLOW.  I was
looking at one of the earlier versions.  But I see now that POSIX 2024
has just added it!  So the question is: do we care?

Supposing posix_sema.c checked that the maximum number of backends
didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
today if you later exceed it in sem_post() you'll get either FATAL:
EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
including a hang due to a corrupted counter, I guess).  I don't know
if systems with a low enough value exist in the wild, but if this
patch were to define its own PG_SEM_VALUE_MAX and then redefine
SEM_VALUE_MAX to that, it certainly wouldn't be one of them.  It would
naturally use UINT32_MAX, and then that check would be tautological
given the types involved, not to mention our own lower limits.  So
perhaps it's not worth more than a comment?

I've found only very brief discussions in the standards community so far:

https://austingroupbugs.net/view.php?id=315
https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow

> So if I understand it the way your sem_trywait() works is:
>  * Try non-blocking acquire with pg_sem_trywait()
>  * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
>  * When woken up: Loop back to try again
>
> This is essentially how sem_trywait() implementations work in Linux IIRC,
> they use futexes or similar kernel primitives for efficient blocking.

Yeah.  Linux gave us the name futex and seems to have established a
sort of de facto standard semantics for it, as seen here in macOS.

It's not the only way to skin that cat... IIUC there were various
older systems with the double-check to avoid the race against user
space, and "address as wait channel" wasn't new either, but I think at
least some of the other things like that might have required a
waitlist in user space; I'm not too sure about that history though, a
lot of it was probably closed source.  It's not even too clear to me
how that waitlist tradeoff works out, but in any case, that is exactly
how a fallback implementation would probably work if we ever decided
we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
sizeof(uint32) and maybe even do away with the struct and just use
pg_atomic_uint32 directly then you could hash the address to select a
cv from a table of (say) 256 of them, with the annoying detail that
you'd probably need a separate wake/wait function pair for {DSM
segment handle, offset} for memory mapped at different addresses.  But
we don't need to make decisions about any of that vapourware to solve
this immediate single-platform problem.

> Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX

We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
unrelated) system SEM_VALUE_MAX.  Might even get a compiler warning if
you tried it.

> Overall, I like the approach, it addresses a real limitation on macOS and
> could be expanded in the future to cover other similar issues on other
> platforms.

Thanks for looking!  And especially for the sanity check on the
definitions of sem_wait()/sem_trywait().



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Thomas Munro
Дата:
On Sat, Aug 16, 2025 at 2:25 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Supposing posix_sema.c checked that the maximum number of backends
> didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
> today if you later exceed it in sem_post() you'll get either FATAL:
> EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
> including a hang due to a corrupted counter, I guess).

And just by the way, each backend has its own semaphore, so in actual
usage we're probably only talking about the "superfluous wakeups"
mentioned in lwlock.c, clog.c and procarray.c.  I suppose it's not
expected to go very high at all?  I was just trying to think about
where the bounds on that come from in theory, while working through
the case for ignoring EOVERFLOW...



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> And just by the way, each backend has its own semaphore, so in actual
> usage we're probably only talking about the "superfluous wakeups"
> mentioned in lwlock.c, clog.c and procarray.c.  I suppose it's not
> expected to go very high at all?

I wouldn't expect so.

            regards, tom lane



Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected

От
Greg Burd
Дата:
> On Aug 15, 2025, at 10:25 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote:
>>> 1. They use CAS in sem_post() because they want to report EOVERFLOW if
>>> you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I
>>> just used fetch-and-add.  Is that bad?  I noticed their alternative
>>> older version (remove "_new" from [1]) also uses a whole 32 bit
>>> counter like me, so perhaps isn't so worried about exceeding a narrow
>>> field of bits, and does it my way.
>>
>> The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that
>> the sem_post() function shall fail if the maximum allowable value of the
>> semaphore would be exceeded [EOVERFLOW].  So, maybe a CAS is the way to
>> go after all?
>
> Hmm.  Before last year, POSIX *didn't* specify EOVERFLOW.  I was
> looking at one of the earlier versions.  But I see now that POSIX 2024
> has just added it!  So the question is: do we care?

I’m on the fence about "caring", if your code fixes the issue we could
simply add a comment about eventually needing to support EOVERFLOW.  I
don't think what you have in the patch is wrong and we're solving our
issue, not the worlds inconsistencies.

> Supposing posix_sema.c checked that the maximum number of backends
> didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose
> today if you later exceed it in sem_post() you'll get either FATAL:
> EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely
> including a hang due to a corrupted counter, I guess).  I don't know
> if systems with a low enough value exist in the wild, but if this
> patch were to define its own PG_SEM_VALUE_MAX and then redefine
> SEM_VALUE_MAX to that, it certainly wouldn't be one of them.  It would
> naturally use UINT32_MAX, and then that check would be tautological
> given the types involved, not to mention our own lower limits.  So
> perhaps it's not worth more than a comment?
>
> I've found only very brief discussions in the standards community so far:
>
> https://austingroupbugs.net/view.php?id=315
> https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow

I see what you mean, a comment that captures this decision and we're
done IMO.

>
>> So if I understand it the way your sem_trywait() works is:
>> * Try non-blocking acquire with pg_sem_trywait()
>> * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently
>> * When woken up: Loop back to try again
>>
>> This is essentially how sem_trywait() implementations work in Linux IIRC,
>> they use futexes or similar kernel primitives for efficient blocking.
>
> Yeah.  Linux gave us the name futex and seems to have established a
> sort of de facto standard semantics for it, as seen here in macOS.

Indeed, the "fast userspace mutex".  *sigh* ;-P

I was calling this out because an implementation of sem_trywait() that
used sem_wait() that wasn't based on futex would be less correct AFAICT.

> It's not the only way to skin that cat... IIUC there were various
> older systems with the double-check to avoid the race against user
> space, and "address as wait channel" wasn't new either, but I think at
> least some of the other things like that might have required a
> waitlist in user space; I'm not too sure about that history though, a
> lot of it was probably closed source.  It's not even too clear to me
> how that waitlist tradeoff works out, but in any case, that is exactly
> how a fallback implementation would probably work if we ever decided
> we wanted pg_futex everywhere: stick a ConditionVariable (or similar)
> into the struct, or if you want to preserve sizeof(pg_futex_u32) ==
> sizeof(uint32) and maybe even do away with the struct and just use
> pg_atomic_uint32 directly then you could hash the address to select a
> cv from a table of (say) 256 of them, with the annoying detail that
> you'd probably need a separate wake/wait function pair for {DSM
> segment handle, offset} for memory mapped at different addresses.  But
> we don't need to make decisions about any of that vapourware to solve
> this immediate single-platform problem.


>> Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX
>
> We'd be checking if a uint32 is > UINT32_MAX, assuming again we that
> we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely
> unrelated) system SEM_VALUE_MAX.  Might even get a compiler warning if
> you tried it.

Oops, yep.  You are right. :)

>> Overall, I like the approach, it addresses a real limitation on macOS and
>> could be expanded in the future to cover other similar issues on other
>> platforms.
>
> Thanks for looking!  And especially for the sanity check on the
> definitions of sem_wait()/sem_trywait().

Sure thing, anytime.  Not sure I helped much...

-greg