Обсуждение: `pg_ctl init` crashes when run concurrently; semget(2) suspected
Hi,
Summary: semget(2) behaves differently on macOS and requires extra care.
I have many tests which spin up clusters using `pg_ctl init`, each in its own single-use temporary directory. Each test is run for every PostgreSQL installation found on the host machine. These tests are often run concurrently. Since adding PostgreSQL 17 to the mix, I've been getting sporadic failures on macOS:
FATAL: could not create semaphores: Invalid argument
DETAIL: Failed system call was semget(176163502, 20, 03600).
child process exited with exit code 1
DETAIL: Failed system call was semget(176163502, 20, 03600).
child process exited with exit code 1
I think it's related to the increase of SEMAS_PER_SET in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of semget(2) on macOS.
I think the bug manifests because:
- I create two clusters concurrently using `pg_ctl init`. One cluster is PostgreSQL 17; the other is PostgreSQL 16 or earlier.
- Their data directories are separate but created close enough in time to have sequential inodes. This is relevant because the inode is used to seed the semaphore key.
- Somehow (waves hands) semget(2) in PostgreSQL 17 is called with a key that points at a preexisting semaphore set. On Linux, due to the IPC_CREAT | IPC_EXCL flags, this returns <0 and sets errno to EEXIST. On macOS, it sets it instead to EINVAL, likely because the requested number of semaphores is greater than those in the existing set. This is in the InternalIpcSemaphoreCreate function, which then aborts the process.
The attached patch fixes the issue, I think, and has another description of this mechanism. On EINVAL it adds an additional call to semget(2) but for zero semaphores.
The patch is relative to master, but I developed it against REL_17_5; it should apply cleanly to both. I think it would be good to backport a fix to 17 too.
If anyone is feeling nerd-sniped, some better proof of the "waves hands" bit would be useful, because that was a working hypothesis that led to a working fix, and I have not yet had time to investigate further.
Please consider the patch for review. Thanks!
Gavin.
Please consider the patch for review. Thanks!
Gavin.
Вложения
Gavin Panella <gavinpanella@gmail.com> writes: > I have many tests which spin up clusters using `pg_ctl init`, each in its > own single-use temporary directory. Each test is run for every PostgreSQL > installation found on the host machine. These tests are often run > concurrently. Since adding PostgreSQL 17 to the mix, I've been getting > sporadic failures on macOS: > FATAL: could not create semaphores: Invalid argument > DETAIL: Failed system call was semget(176163502, 20, 03600). > child process exited with exit code 1 Ugh. > I think it's related to the increase of SEMAS_PER_SET > in 38da053463bef32adf563ddee5277d16d2b6c5a (later reverted > in 810a8b1c8051d4e8822967a96f133692698386de) combined with the behaviour of > semget(2) on macOS. Yeah, I guess you're right; but it's not macOS-specific AFAICS. "man semget" quoth [EINVAL] The number of semaphores requested is either less than 1 or greater than the system imposed maximum per set. [EINVAL] A semaphore identifier exists for the argument key, but the number of semaphores in the set associated with it is less than nsems, and nsems is non-zero. This is from current macOS, but equivalent text appears on Linux and in the POSIX spec. So it's just luck that nobody has reported the same problem elsewhere --- unless maybe there is some macOS-specific behavior making it more likely that different installs would try the same key. Nonetheless, key collisions must be expected, and we have little right to assume that a pre-existing semaphore set must have enough semaphores. Using the same errno for these cases was really quite poorly chosen, because in the first case we should not retry with another key, but in the second case we should. If the problem is that SEMMSL is too small, and we retry, it'll be an infinite loop. I don't care for your patch because (a) I'm not sure that the result of asking for zero semaphores is portable (the Linux man page suggests that it might be allowed there); and (b) if the problem is SEMMSL, your code will fall into the infinite loop. My first thought about fixing it is to go ahead and retry after EINVAL, but keep a count of failures, and give up after 100 or 1000 or so. This'd be safer in any case --- it's not really great that IpcSemaphoreCreate is willing to retry an infinite number of times. If we do retry, the semget() in IpcSemaphoreCreate will fail again, so we'll just skip over that semaphore key and try another. It would be marginally nicer to release the semaphore set if it's unused, but I'm not sure it'd be safe to assume that a set with too few semaphores is one of ours. It certainly didn't come from a prior cycle of life of the same postmaster. regards, tom lane
I wrote: > This is from current macOS, but equivalent text appears on Linux and > in the POSIX spec. So it's just luck that nobody has reported the > same problem elsewhere --- unless maybe there is some macOS-specific > behavior making it more likely that different installs would try the > same key. Hmm, no, there is a platform dependency here. I made the attached test program to see what happens when there's a key collision, and on Linux I get semget(SEMAS_PER_SET) failed: File exists semget(SEMAS_PER_SET + 1) failed: File exists but macOS and NetBSD give semget(SEMAS_PER_SET) failed: File exists semget(SEMAS_PER_SET + 1) failed: Invalid argument I didn't try other BSDen; this might be a NetBSD-ism that Apple inherited, or maybe it's common among the BSDen. I don't see any text in POSIX specifying which errno is to be returned in this case, so we can't really argue that the behavior is wrong. regards, tom lane #include <stdio.h> #include <errno.h> #include <signal.h> #include <string.h> #include <unistd.h> #include <sys/file.h> #include <sys/types.h> #include <sys/ipc.h> #include <sys/sem.h> #include <sys/stat.h> typedef key_t IpcSemaphoreKey; /* semaphore key passed to semget(2) */ #ifndef HAVE_UNION_SEMUN union semun { int val; struct semid_ds *buf; unsigned short *array; }; #endif #define SEMAS_PER_SET 16 #define IPCProtection (0600) /* access/modify by user only */ int main(int argc, char **argv) { IpcSemaphoreKey semKey = 42; int semId; int semId2; union semun semun; /* * Find a free semaphore key, and create a set with SEMAS_PER_SET semas. */ for(;;) { semId = semget(semKey, SEMAS_PER_SET, IPC_CREAT | IPC_EXCL | IPCProtection); if (semId >= 0) break; semKey++; } /* * Check behavior if we try to make another set with same # of semas. */ semId2 = semget(semKey, SEMAS_PER_SET, IPC_CREAT | IPC_EXCL | IPCProtection); if (semId2 < 0) fprintf(stderr, "semget(SEMAS_PER_SET) failed: %s\n", strerror(errno)); else fprintf(stderr, "semget(SEMAS_PER_SET) unexpectedly succeeded\n"); /* * Check behavior if we try to make another set with more semas. */ semId2 = semget(semKey, SEMAS_PER_SET + 1, IPC_CREAT | IPC_EXCL | IPCProtection); if (semId2 < 0) fprintf(stderr, "semget(SEMAS_PER_SET + 1) failed: %s\n", strerror(errno)); else fprintf(stderr, "semget(SEMAS_PER_SET + 1) unexpectedly succeeded\n"); /* * Clean up */ semun.val = 0; /* unused, but keep compiler quiet */ if (semctl(semId, 0, IPC_RMID, semun) < 0) fprintf(stderr, "semctl(%d, 0, IPC_RMID, ...) failed: %s\n", semId, strerror(errno)); return 0; }
I wrote: > I didn't try other BSDen; this might be a NetBSD-ism that Apple > inherited, or maybe it's common among the BSDen. A bit more research later: OpenBSD behaves like NetBSD, while FreeBSD behaves like Linux. So that's pretty inconclusive about what the aboriginal behavior was. I also found that OpenIndiana behaves like Linux. For Linux and FreeBSD, it doesn't actually matter to us because we no longer use SysV semaphores on those platforms. But this is a live problem on macOS, NetBSD, OpenBSD. regards, tom lane
I wrote: > My first thought about fixing it is to go ahead and retry after > EINVAL, but keep a count of failures, and give up after 100 or 1000 > or so. This'd be safer in any case --- it's not really great that > IpcSemaphoreCreate is willing to retry an infinite number of times. Concretely, about like the attached. I don't have any clear idea about what loop limit to use, but I think we do need one. regards, tom lane diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c index 423b2b4f9d6..8984a0aa2d1 100644 --- a/src/backend/port/sysv_sema.c +++ b/src/backend/port/sysv_sema.c @@ -69,7 +69,7 @@ static int nextSemaNumber; /* next free sem num in last sema set */ static IpcSemaphoreId InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, - int numSems); + int numSems, bool retry_ok); static void IpcSemaphoreInitialize(IpcSemaphoreId semId, int semNum, int value); static void IpcSemaphoreKill(IpcSemaphoreId semId); @@ -88,9 +88,13 @@ static void ReleaseSemaphores(int status, Datum arg); * If we fail with a failure code other than collision-with-existing-set, * print out an error and abort. Other types of errors suggest nonrecoverable * problems. + * + * Unfortunately, it's sometimes hard to tell whether errors are + * nonrecoverable. Our caller keeps track of whether continuing to retry + * is sane or not; if not, we abort on failure regardless of the errno. */ static IpcSemaphoreId -InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems) +InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems, bool retry_ok) { int semId; @@ -101,16 +105,27 @@ InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems) int saved_errno = errno; /* - * Fail quietly if error indicates a collision with existing set. One - * would expect EEXIST, given that we said IPC_EXCL, but perhaps we - * could get a permission violation instead? Also, EIDRM might occur - * if an old set is slated for destruction but not gone yet. + * Fail quietly if error suggests a collision with an existing set and + * our caller has not lost patience. + * + * One would expect EEXIST, given that we said IPC_EXCL, but perhaps + * we could get a permission violation instead. On some platforms + * EINVAL will be reported if the existing set has too few semaphores. + * Also, EIDRM might occur if an old set is slated for destruction but + * not gone yet. + * + * EINVAL is the key reason why we need the caller-level loop limit, + * as it can also mean that the platform's SEMMSL is less than + * numSems, and that condition will not go away. */ - if (saved_errno == EEXIST || saved_errno == EACCES + if (retry_ok && + (saved_errno == EEXIST + || saved_errno == EACCES + || saved_errno == EINVAL #ifdef EIDRM - || saved_errno == EIDRM + || saved_errno == EIDRM #endif - ) + )) return -1; /* @@ -207,17 +222,22 @@ IpcSemaphoreGetLastPID(IpcSemaphoreId semId, int semNum) static IpcSemaphoreId IpcSemaphoreCreate(int numSems) { + int num_tries = 0; IpcSemaphoreId semId; union semun semun; PGSemaphoreData mysema; /* Loop till we find a free IPC key */ - for (nextSemaKey++;; nextSemaKey++) + for (nextSemaKey++;; nextSemaKey++, num_tries++) { pid_t creatorPID; - /* Try to create new semaphore set */ - semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1); + /* + * Try to create new semaphore set. Give up after trying 1000 + * distinct IPC keys. + */ + semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1, + num_tries < 1000); if (semId >= 0) break; /* successful create */ @@ -254,7 +274,7 @@ IpcSemaphoreCreate(int numSems) /* * Now try again to create the sema set. */ - semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1); + semId = InternalIpcSemaphoreCreate(nextSemaKey, numSems + 1, true); if (semId >= 0) break; /* successful create */
With that fix applied to REL_17_5 things are working well. Limiting the search sounds like an improvement too.
As an experiment I added a log for when semget in InternalIpcSemaphoreCreate returns -1. When I'm running `pg_ctl init` for this local build concurrently with `pg_ctl init` from PostgreSQL 15 (or another version prior to 17), I saw ~8 logged failures when there was contention. As I increased the concurrency, the maximum number of logged failures looked to be ~8 times concurrency, roughly. For me, then, running `pg_ctl init` with a concurrency of 125 would be needed to even begin exceeding the max retries of 1000 – in the worst case. That sounds high enough.
Then I thought: I'm only seeing the log from one of those instances, yet they're all going through the same search for free semaphore sets. That's a few system calls going to waste. Maybe not important in the big picture, but it gave me an idea to left shift nextSemaKey in PGReserveSemaphores, i.e. `nextSemaKey = statbuf.st_ino << 4`, to give each pg_ctl process a few guaranteed uncontested keys (at least, uncontested between themselves). In a small test this eliminated contention for semaphore sets due to concurrency. It is more of an optimisation though, rather than a bug fix.
Gavin
Gavin Panella <gavinpanella@gmail.com> writes: > With that fix applied to REL_17_5 things are working well. Limiting the > search sounds like an improvement too. Cool. I'll push the fix after our release freeze lifts. > Then I thought: I'm only seeing the log from one of those instances, yet > they're all going through the same search for free semaphore sets. That's a > few system calls going to waste. Maybe not important in the big picture, > but it gave me an idea to left shift nextSemaKey in PGReserveSemaphores, > i.e. `nextSemaKey = statbuf.st_ino << 4`, to give each pg_ctl process a few > guaranteed uncontested keys (at least, uncontested between themselves). In > a small test this eliminated contention for semaphore sets due to > concurrency. It is more of an optimisation though, rather than a bug fix. Meh ... if we thought this was worth worrying about, I'd be more inclined to run the st_ino through a hash function to spread out the possible values a little more. But I think in nearly all scenarios it'd be a waste of effort. If collisions were probable we'd have found this issue years ago. regards, tom lane
On Tue, 12 Aug 2025 at 01:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Cool. I'll push the fix after our release freeze lifts.
Splendid, thank you!
On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > A bit more research later: OpenBSD behaves like NetBSD, while > FreeBSD behaves like Linux. So that's pretty inconclusive about > what the aboriginal behavior was. I also found that OpenIndiana > behaves like Linux. Out of curiosity, FreeBSD changed that in 2016: https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
Thomas Munro <thomas.munro@gmail.com> writes: > On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A bit more research later: OpenBSD behaves like NetBSD, while >> FreeBSD behaves like Linux. So that's pretty inconclusive about >> what the aboriginal behavior was. I also found that OpenIndiana >> behaves like Linux. > Out of curiosity, FreeBSD changed that in 2016: > https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e Hah, so that lends weight to the theory that this is an ancient Berkeley mistake. I wonder how hard it would be to convince POSIX that they should be more specific about which errno is expected here. In the meantime, though, we'll have to deal with the existing behavior for years to come. So I'll go ahead with that patch. I like having a loop limit there anyway --- I was never exactly convinced that retrying indefinitely was a good idea. regards, tom lane
On Wed, Aug 13, 2025 at 8:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > In the meantime, though, we'll have to deal with the existing > behavior for years to come. So I'll go ahead with that patch. > I like having a loop limit there anyway --- I was never exactly > convinced that retrying indefinitely was a good idea. LGTM. 1000 does seem like "enough" following the arguments made earlier. Gavin does also raise a good point that it might be nice to randomise the search a bit but that sounds like another patch for master. It's not a very nice interface and I hope to get rid of it one day. When I'm working on a Mac I frequently finish up with leaked SysV IPC entries of both kinds and have to clean them up manually, though admittedly while hacking and debugging PostgreSQL you exit in weird and wonderful ways. FWIW in early prototype multithreading patches you can just use sem_init() on all these systems since you don't need pshared=1. It works, but macOS spews "deprecated" warnings when you compile. Kinda weird for the best semaphore API IMHO. But you can also just implement LWLock on top of pthread_rwlock_t or futexes with some more work, so I'm not bothered about that and there are more important problems to solve in that project for now... Here's a story from an Apple engineer about why they don't have pshared=1: https://www.quora.com/Why-does-OS-X-not-support-unnamed-semaphores I don't actually see any reason they couldn't adopt the same trick as other modern systems, without an ABI break: use the *address* to identify semaphores, converted to {VM object, offset} to form the key in a hash table of waitlists that only has an entry while someone is actually waiting. No expensive kernel resources while not being waited on, no leaky SysV namespace management, ideal access control by memory map. But you probably have to move to California if you want to fix that. They also have the CMU Mach semaphore API, semaphore_create(), which some other projects seem to be using. I think it could be made to work, but Mach ports are not inherited by child processes so you'd have to send them explicitly after fork() which sounds painful and slow, not to mention that the documentation for all that is thin on the ground...
On Wed, Aug 13, 2025 at 11:29 AM Thomas Munro <thomas.munro@gmail.com> wrote: > FWIW in early prototype multithreading patches you can just use > sem_init() on all these systems since you don't need pshared=1. Oops, I misremembered that. It works on NetBSD and OpenBSD, but not macOS :-(. Hmm, but it looks like there has been a new development in macOS 15. They have finally made the futex API public: https://developer.apple.com/documentation/os/os_sync_wait_on_address That was one of the sticking points for my patch that added our own replacement sem_init() implementation. At the time the only "proper" way to reach the undocumented and undeclared __ulock_wait()/__ulock_wake() functions was to write a module in C++ using std::atomic<int>::wait()/notify(), which I had started to wonder about, but here it is with a shiny new C interface and documentation. Maybe I should have another go at that!
On Wed, Aug 13, 2025 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Maybe I should have another go at that! Here's a new attempt at that. It picks futexes automatically, unless you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks sysv as before.
Вложения
On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a new attempt at that. It picks futexes automatically, unless > you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks > sysv as before. I compared my hand-rolled atomics logic against FreeBSD's user space semaphores[1]. sem_post() and sem_trywait() are about the same, and I guess there just aren't many ways to write those. Minor differences: 1. They use CAS in sem_post() because they want to report EOVERFLOW if you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I just used fetch-and-add. Is that bad? I noticed their alternative older version (remove "_new" from [1]) also uses a whole 32 bit counter like me, so perhaps isn't so worried about exceeding a narrow field of bits, and does it my way. 2. In sem_trywait() they reload the value while our pg_atomic_compare_exchange_u32() does that for you, and I also prefer to avoid assignment expression syntax if I can. On the other hand my v1 sem_wait() seemed a little janky and weird in comparison to theirs, and I was a bit worried about bugs. I thought about rewriting it to look more like that, but then I realised that I could use sem_trywait() to implement sem_wait() for the same net effect. It's much shorter and sweeter and more intuitively understandable, I think? I also didn't bother to supply a couple of functions that posix_sema.c doesn't use. There's no immediately obvious performance difference, but I wasn't expecting there to be: it's used in the slow path of our LWLocks and they already have similar user space atomics in front, so you don't really avoid entering the kernel with this technique except perhaps just occasionally if you wait just as someone posts. This is all just about the convenience of bannishing System V cruft. It doesn't seem to be any slower though, in simple testing, and CI might even be a bit faster based on spot checks of a few runs... but proper statistical tools might be needed to see if that is a real phenomenon. Anyone got any better ideas for how to organise the build scripts, control the feature, naming etc? I came up with is this, which automatically falls back to sysv if it can't find what it needs (respecting Apple's deployment target thing following what we did for preadv): meson.build: sema_type = "unnamed_posix+emulation" configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION" The pg_futex.h abstraction layer doesn't have any support for other OSes yet because I tried to pare this patch down to the absolute minimum to solve this problem, but I did try at least a little bit to anticipate that (having removed it from earlier versions from a few years ago) when designing the futex contracts. We could trivially add at least OpenBSD support to use emulated semaphores there. I never quite figured out whether the NetBSD futex API is really available outside its Linux syscall emulator, but that might be doable too. And futexes might of course turn out to have applications in other synchronisation primitives. Any thoughts on whether this is worth pursuing, and what kinds of validation might be useful? [1] https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c
Вложения
> On Aug 12, 2025, at 4:07 PM, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, Aug 11, 2025 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A bit more research later: OpenBSD behaves like NetBSD, while >> FreeBSD behaves like Linux. So that's pretty inconclusive about >> what the aboriginal behavior was. I also found that OpenIndiana >> behaves like Linux. > > Out of curiosity, FreeBSD changed that in 2016: > > https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e I believe that this is the code in macOS's "XNU" [1][2] which has similar logic (unsurprisingly) to the FreeBSD code before the identified commit [3]. best. -greg [1] https://github.com/apple-oss-distributions/xnu/blob/e3723e1f17661b24996789d8afc084c0c3303b26/bsd/kern/sysv_sem.c#L860-L889 [2] "XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSDand a C++ API for writing drivers called IOKit." [3] https://github.com/freebsd/freebsd-src/commit/f00fb5457ebf5907055e420d803ac67fb098109e
> On Aug 13, 2025, at 8:01 PM, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Aug 13, 2025 at 7:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Here's a new attempt at that. It picks futexes automatically, unless >> you export MACOSX_DEPLOYMENT_TARGET=14.3 or lower, and then it picks >> sysv as before. Thomas, thanks for digging into this platform specific issue. I've read your patch and done some research. Thoughts below... > I compared my hand-rolled atomics logic against FreeBSD's user space > semaphores[1]. sem_post() and sem_trywait() are about the same, and I > guess there just aren't many ways to write those. Minor differences: > > 1. They use CAS in sem_post() because they want to report EOVERFLOW if > you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I > just used fetch-and-add. Is that bad? I noticed their alternative > older version (remove "_new" from [1]) also uses a whole 32 bit > counter like me, so perhaps isn't so worried about exceeding a narrow > field of bits, and does it my way. The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that the sem_post() function shall fail if the maximum allowable value of the semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to go after all? > 2. In sem_trywait() they reload the value while our > pg_atomic_compare_exchange_u32() does that for you, and I also prefer > to avoid assignment expression syntax if I can. > > On the other hand my v1 sem_wait() seemed a little janky and weird in > comparison to theirs, and I was a bit worried about bugs. I thought > about rewriting it to look more like that, but then I realised that I > could use sem_trywait() to implement sem_wait() for the same net > effect. It's much shorter and sweeter and more intuitively > understandable, I think? So if I understand it the way your sem_trywait() works is: * Try non-blocking acquire with pg_sem_trywait() * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently * When woken up: Loop back to try again This is essentially how sem_trywait() implementations work in Linux IIRC, they use futexes or similar kernel primitives for efficient blocking. > I also didn't bother to supply a couple of functions that posix_sema.c > doesn't use. I think that's fine for now. > There's no immediately obvious performance difference, but I wasn't > expecting there to be: it's used in the slow path of our LWLocks and > they already have similar user space atomics in front, so you don't > really avoid entering the kernel with this technique except perhaps > just occasionally if you wait just as someone posts. This is all just > about the convenience of bannishing System V cruft. It doesn't seem > to be any slower though, in simple testing, and CI might even be a bit > faster based on spot checks of a few runs... but proper statistical > tools might be needed to see if that is a real phenomenon. That lines up with my intuition after reading the patch. I don't know that it is even worth diving too deep in performance testing this. > Anyone got any better ideas for how to organise the build scripts, > control the feature, naming etc? I came up with is this, which > automatically falls back to sysv if it can't find what it needs > (respecting Apple's deployment target thing following what we did for > preadv): > > meson.build: sema_type = "unnamed_posix+emulation" > configure template: PREFERRED_SEMAPHORES="UNNAMED_POSIX+EMULATION" I'm not an expert on Meson, but I do have years of Autoconf work under my belt. Offhand this seems reasonable, but I'm new to this project. > The pg_futex.h abstraction layer doesn't have any support for other > OSes yet because I tried to pare this patch down to the absolute > minimum to solve this problem, but I did try at least a little bit to > anticipate that (having removed it from earlier versions from a few > years ago) when designing the futex contracts. We could trivially add > at least OpenBSD support to use emulated semaphores there. I never > quite figured out whether the NetBSD futex API is really available > outside its Linux syscall emulator, but that might be doable too. And > futexes might of course turn out to have applications in other > synchronisation primitives. > > Any thoughts on whether this is worth pursuing, and what kinds of > validation might be useful? > > [1] https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/sem_new.c > <v2-0001-Use-futex-based-semaphore-emulation-on-macOS.patch> Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX Overall, I like the approach, it addresses a real limitation on macOS and could be expanded in the future to cover other similar issues on other platforms. best. -greg [1] https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/sem_post.html
On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote: > > 1. They use CAS in sem_post() because they want to report EOVERFLOW if > > you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I > > just used fetch-and-add. Is that bad? I noticed their alternative > > older version (remove "_new" from [1]) also uses a whole 32 bit > > counter like me, so perhaps isn't so worried about exceeding a narrow > > field of bits, and does it my way. > > The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that > the sem_post() function shall fail if the maximum allowable value of the > semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to > go after all? Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was looking at one of the earlier versions. But I see now that POSIX 2024 has just added it! So the question is: do we care? Supposing posix_sema.c checked that the maximum number of backends didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose today if you later exceed it in sem_post() you'll get either FATAL: EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely including a hang due to a corrupted counter, I guess). I don't know if systems with a low enough value exist in the wild, but if this patch were to define its own PG_SEM_VALUE_MAX and then redefine SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would naturally use UINT32_MAX, and then that check would be tautological given the types involved, not to mention our own lower limits. So perhaps it's not worth more than a comment? I've found only very brief discussions in the standards community so far: https://austingroupbugs.net/view.php?id=315 https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow > So if I understand it the way your sem_trywait() works is: > * Try non-blocking acquire with pg_sem_trywait() > * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently > * When woken up: Loop back to try again > > This is essentially how sem_trywait() implementations work in Linux IIRC, > they use futexes or similar kernel primitives for efficient blocking. Yeah. Linux gave us the name futex and seems to have established a sort of de facto standard semantics for it, as seen here in macOS. It's not the only way to skin that cat... IIUC there were various older systems with the double-check to avoid the race against user space, and "address as wait channel" wasn't new either, but I think at least some of the other things like that might have required a waitlist in user space; I'm not too sure about that history though, a lot of it was probably closed source. It's not even too clear to me how that waitlist tradeoff works out, but in any case, that is exactly how a fallback implementation would probably work if we ever decided we wanted pg_futex everywhere: stick a ConditionVariable (or similar) into the struct, or if you want to preserve sizeof(pg_futex_u32) == sizeof(uint32) and maybe even do away with the struct and just use pg_atomic_uint32 directly then you could hash the address to select a cv from a table of (say) 256 of them, with the annoying detail that you'd probably need a separate wake/wait function pair for {DSM segment handle, offset} for memory mapped at different addresses. But we don't need to make decisions about any of that vapourware to solve this immediate single-platform problem. > Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX We'd be checking if a uint32 is > UINT32_MAX, assuming again we that we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if you tried it. > Overall, I like the approach, it addresses a real limitation on macOS and > could be expanded in the future to cover other similar issues on other > platforms. Thanks for looking! And especially for the sanity check on the definitions of sem_wait()/sem_trywait().
On Sat, Aug 16, 2025 at 2:25 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Supposing posix_sema.c checked that the maximum number of backends > didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose > today if you later exceed it in sem_post() you'll get either FATAL: > EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely > including a hang due to a corrupted counter, I guess). And just by the way, each backend has its own semaphore, so in actual usage we're probably only talking about the "superfluous wakeups" mentioned in lwlock.c, clog.c and procarray.c. I suppose it's not expected to go very high at all? I was just trying to think about where the bounds on that come from in theory, while working through the case for ignoring EOVERFLOW...
Thomas Munro <thomas.munro@gmail.com> writes: > And just by the way, each backend has its own semaphore, so in actual > usage we're probably only talking about the "superfluous wakeups" > mentioned in lwlock.c, clog.c and procarray.c. I suppose it's not > expected to go very high at all? I wouldn't expect so. regards, tom lane
> On Aug 15, 2025, at 10:25 PM, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Aug 16, 2025 at 5:58 AM Burd, Greg <greg@burd.me> wrote: >>> 1. They use CAS in sem_post() because they want to report EOVERFLOW if >>> you exceed SEM_VALUE_MAX, but POSIX doesn't seem to require that, so I >>> just used fetch-and-add. Is that bad? I noticed their alternative >>> older version (remove "_new" from [1]) also uses a whole 32 bit >>> counter like me, so perhaps isn't so worried about exceeding a narrow >>> field of bits, and does it my way. >> >> The Open Group Base Specifications Issue 7 (POSIX.1-2008) [1] says that >> the sem_post() function shall fail if the maximum allowable value of the >> semaphore would be exceeded [EOVERFLOW]. So, maybe a CAS is the way to >> go after all? > > Hmm. Before last year, POSIX *didn't* specify EOVERFLOW. I was > looking at one of the earlier versions. But I see now that POSIX 2024 > has just added it! So the question is: do we care? I’m on the fence about "caring", if your code fixes the issue we could simply add a comment about eventually needing to support EOVERFLOW. I don't think what you have in the patch is wrong and we're solving our issue, not the worlds inconsistencies. > Supposing posix_sema.c checked that the maximum number of backends > didn't exceed SEM_VALUE_MAX and refused to start up if so (I suppose > today if you later exceed it in sem_post() you'll get either FATAL: > EOVERFLOW on POSIX 2024 systems or unspecified behaviour, likely > including a hang due to a corrupted counter, I guess). I don't know > if systems with a low enough value exist in the wild, but if this > patch were to define its own PG_SEM_VALUE_MAX and then redefine > SEM_VALUE_MAX to that, it certainly wouldn't be one of them. It would > naturally use UINT32_MAX, and then that check would be tautological > given the types involved, not to mention our own lower limits. So > perhaps it's not worth more than a comment? > > I've found only very brief discussions in the standards community so far: > > https://austingroupbugs.net/view.php?id=315 > https://austin-group-l.opengroup.narkive.com/8CNaq5S0/sem-post-overflow I see what you mean, a comment that captures this decision and we're done IMO. > >> So if I understand it the way your sem_trywait() works is: >> * Try non-blocking acquire with pg_sem_trywait() >> * If fails (semaphore = 0): Call pg_futex_wait_u32() to block efficiently >> * When woken up: Loop back to try again >> >> This is essentially how sem_trywait() implementations work in Linux IIRC, >> they use futexes or similar kernel primitives for efficient blocking. > > Yeah. Linux gave us the name futex and seems to have established a > sort of de facto standard semantics for it, as seen here in macOS. Indeed, the "fast userspace mutex". *sigh* ;-P I was calling this out because an implementation of sem_trywait() that used sem_wait() that wasn't based on futex would be less correct AFAICT. > It's not the only way to skin that cat... IIUC there were various > older systems with the double-check to avoid the race against user > space, and "address as wait channel" wasn't new either, but I think at > least some of the other things like that might have required a > waitlist in user space; I'm not too sure about that history though, a > lot of it was probably closed source. It's not even too clear to me > how that waitlist tradeoff works out, but in any case, that is exactly > how a fallback implementation would probably work if we ever decided > we wanted pg_futex everywhere: stick a ConditionVariable (or similar) > into the struct, or if you want to preserve sizeof(pg_futex_u32) == > sizeof(uint32) and maybe even do away with the struct and just use > pg_atomic_uint32 directly then you could hash the address to select a > cv from a table of (say) 256 of them, with the annoying detail that > you'd probably need a separate wake/wait function pair for {DSM > segment handle, offset} for memory mapped at different addresses. But > we don't need to make decisions about any of that vapourware to solve > this immediate single-platform problem. >> Nit: pg_sem_init() doesn't return EINVAL if value > SEM_VALUE_MAX > > We'd be checking if a uint32 is > UINT32_MAX, assuming again we that > we'd decided to define our own PG_SEM_VALUE_MAX and not the (entirely > unrelated) system SEM_VALUE_MAX. Might even get a compiler warning if > you tried it. Oops, yep. You are right. :) >> Overall, I like the approach, it addresses a real limitation on macOS and >> could be expanded in the future to cover other similar issues on other >> platforms. > > Thanks for looking! And especially for the sanity check on the > definitions of sem_wait()/sem_trywait(). Sure thing, anytime. Not sure I helped much... -greg