Обсуждение: Re: connection establishment versus parallel workers
On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > My team recently received a report about connection establishment times > increasing substantially from v16 onwards. Upon further investigation, > this seems to have something to do with commit 7389aad (which moved a lot > of postmaster code out of signal handlers) in conjunction with workloads > that generate many parallel workers. I've attached a set of reproduction > steps. The issue seems to be worst on larger machines (e.g., r8g.48xlarge, > r5.24xlarge) when max_parallel_workers/max_worker_process is set very high > (>= 48). Interesting. > Our theory is that commit 7389aad (and follow-ups like commit 239b175) made > parallel worker processing much more responsive to the point of contending > with incoming connections, and that before this change, the kernel balanced > the execution of the signal handlers and ServerLoop() to prevent this. I > don't have a concrete proposal yet, but I thought it was still worth > starting a discussion. TBH I'm not sure we really need to do anything > since this arguably comes down to a trade-off between connection and worker > responsiveness. One factor is: * Check if the latch is set already. If so, leave the loop * immediately, avoid blocking again. We don't attempt to report any * other events that might also be satisfied. If we had a way to say "no really, gimme everything you have", I guess that'd help. Which reminds me a bit of commit 04a09ee9 (Windows-only problem, making sure that we handle multiple sockets fairly instead of reporting only the lowest priority one); I think it'd work the same way: if you already saw a latch, you'd use a zero timeout for the system call.
On Thu, Dec 12, 2024 at 11:36 AM Thomas Munro <thomas.munro@gmail.com> wrote: > ... instead of > reporting only the lowest priority one) s/priority/position/
On Fri, Dec 13, 2024 at 02:29:53AM +1300, Thomas Munro wrote: > Here's an experimental patch to try changing that policy. It improves > the connection times on my small computer with your test, but I doubt > I'm seeing the real issue. But in theory, assuming a backlog of > connections and workers to start, I think each server loop should be > able to accept and fork one client backend, and fork up to 100 > (MAX_BGWORKERS_TO_LAUNCH) background workers. Thanks for the quick response! I'm taking a look at the patch... -- nathan
Sorry for the delay, and thanks again for digging into this. On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote: > 0001 patch is unchanged, 0002 patch sketches out a response to the > observation a couple of paragraphs above. Both of these patches seem to improve matters quite a bit. I haven't yet thought too deeply about it all, but upon a skim, your patches seem entirely reasonable to me. However, while this makes the test numbers for >= v16 look more like those for v15, we're also seeing a big jump from v13 to v14. This bisects pretty cleanly to commit d872510. I haven't figured out _why_ this commit is impacting this particular test, but I figured I'd at least update the thread with what we know so far. -- nathan
On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote: > On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote: >> 0001 patch is unchanged, 0002 patch sketches out a response to the >> observation a couple of paragraphs above. > > Both of these patches seem to improve matters quite a bit. I haven't yet > thought too deeply about it all, but upon a skim, your patches seem > entirely reasonable to me. I gave these a closer look, and I still feel that they are both straightforward and reasonable. IIUC the main open question is whether this might cause problems for other PM signal kinds. Like you, I don't see anything immediately obvious there, but I'll admit I'm not terribly familiar with the precise characteristics of postmaster signals. In any case, 0001 feels pretty safe to me. > However, while this makes the test numbers for >= v16 look more like those > for v15, we're also seeing a big jump from v13 to v14. This bisects pretty > cleanly to commit d872510. I haven't figured out _why_ this commit is > impacting this particular test, but I figured I'd at least update the > thread with what we know so far. I regrettably have no updates on this one, yet. -- nathan
On Tue, Jan 14, 2025 at 8:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote: > > On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote: > >> 0001 patch is unchanged, 0002 patch sketches out a response to the > >> observation a couple of paragraphs above. > > > > Both of these patches seem to improve matters quite a bit. I haven't yet > > thought too deeply about it all, but upon a skim, your patches seem > > entirely reasonable to me. > > I gave these a closer look, and I still feel that they are both > straightforward and reasonable. IIUC the main open question is whether > this might cause problems for other PM signal kinds. Like you, I don't see > anything immediately obvious there, but I'll admit I'm not terribly > familiar with the precise characteristics of postmaster signals. In any > case, 0001 feels pretty safe to me. Cool. Thanks. I'll think about what else could be affected by that change as you say, and if nothing jumps out I'll go ahead and commit them, back to 16. I have done a lot more study of this problem and was about to write in with some more patches to propose for master only. Basically that "100" is destroying performance in this workload, which at least on my machine hardly gets any parallelism at all, and only in sporadic bursts. You can argue that we aren't designed for high frequency short-lived workers (we'll have to reuse workers in some way to be good at that), but I don't think it has to fail as badly as it does today. It falls off a cliff instead of plateauing: we are so busy forking that we don't get around to reaping children, so all our slots are (artificially) used up most of the time, and the queries that do manage to nab one then sit on their hands for a long time at query end. "1" gets much smoother results, but as prophesied in aa1351f1, the complexity is terrible, possibly even O(n^3) in places depending on how you count: there are many places that scan the whole worker list, and one that even scans it again for each item, and that is for each thing that starts. IOW we have to fix the complexity fundamentally. I have a WIP patch that adds a couple of work queues, so that the postmaster never has to consider anything more than the head of a queue in various places. More soon... > > However, while this makes the test numbers for >= v16 look more like those > > for v15, we're also seeing a big jump from v13 to v14. This bisects pretty > > cleanly to commit d872510. I haven't figured out _why_ this commit is > > impacting this particular test, but I figured I'd at least update the > > thread with what we know so far. > > I regrettably have no updates on this one, yet. My first thought was that the catalogues needed for connection might be getting evicted, but the data size seems too small for that surely and you'd probably have picked it up immediately from wait events. Weird.
On Mon, Jan 20, 2025 at 6:33 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's the WIP code I have up with for that so far. > > Remaining opportunities not attempted: > 1. When a child exits, we could use a hash table to find it by pid. > 2. When looking for a bgworker slot that is not in use, we could do > something better than linear search. I haven't had time to work on this again due to other projects, but I wanted to write down the ideas I thought about for the record. Obviously we'd want some kind of free list, but the postmaster would need to push free slots into it when workers exit, and it can't use locks or any data structures that can cause it to get stuck just because a backend has corrupted shared memory. It must always be able to process shutdown commands and coordinate crash restarts, no matter how bananas the backends go. With that in mind: Idea #1: A CAS-based linked list, relying on CAS never being emulated with locks, and probably requiring 16 bit indexes since we can only expect 32 bit atomic hardware and you might need to change both the head and tail of a hypothetical list head when going from empty to one element. But you have to convince yourself that it's OK to run a CAS-loop in the postmaster, which might in theory might be prevented from completing... Idea #2: The free list could be a simple circular buffer of slot index numbers. That would be symmetrical with 0001-Remove-BackgroundWorkerStateChange-s-outer-loop.patch's shared memory "start" queue, and could be coded essentially the same way. The "start" queue and the "free" queue would then both be simple arrays with a head and a tail, and in both cases only the consumers (regular backends) need an lwlock to serialise against each other, while the producer (the postmaster) can get away with careful memory barriers and just has to range-check the head/tail indexes to deal with untrusted shared memory contents. A rogue backend can jam up the bgworker subsystem, but that's already true. It still can't prevent shutdown or crash restart. Idea #3: Suggested by Robert in an off-list chat about all this: backends could maintain a shared memory free-list using existing dlist technology protected by an lwlock. When it's empty *they* (not the postmaster) would fill it up again using a linear slot search. It's simple but not quite as satisfying, because it is still possible to degrade to high frequency linear searches that only find a small number of free slots each time if you're unlucky, ie you can entirely fail to amortise. Idea #4: Maintain a bitmap of free slots indexes, which the postmaster sets with atomic_fetch_or(). I lean towards idea #2, but haven't actually tried it. A similar situation exists for DSM slot management, though that has different complications: no postmaster interaction, but funky handle requirements due to portability concerns. I think the handles could probably be changed to encode the slot index + generation for O(1) lookup at "attach" time, and free slots could be stored in a circular queue for O(1) slot allocation. I think the use of random numbers stemmed from SysV shared memory's need to find a free key in a 32 bit OS-wide namespace (yuck). I haven't looked at that code in a while but I don't recall any reason why even those couldn't be hidden inside the slot itself, instead of being exposed in the handle, forcing linear searches. A generation scheme would also be more robust against weird random number collisions, and could detect handles that were valid but now are no longer in a more obvious way, instead of "I looked everywhere and I couldn't find it".