Обсуждение: Re: connection establishment versus parallel workers

Поиск
Список
Период
Сортировка

Re: connection establishment versus parallel workers

От
Thomas Munro
Дата:
On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> My team recently received a report about connection establishment times
> increasing substantially from v16 onwards.  Upon further investigation,
> this seems to have something to do with commit 7389aad (which moved a lot
> of postmaster code out of signal handlers) in conjunction with workloads
> that generate many parallel workers.  I've attached a set of reproduction
> steps.  The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
> r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
> (>= 48).

Interesting.

> Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
> parallel worker processing much more responsive to the point of contending
> with incoming connections, and that before this change, the kernel balanced
> the execution of the signal handlers and ServerLoop() to prevent this.  I
> don't have a concrete proposal yet, but I thought it was still worth
> starting a discussion.  TBH I'm not sure we really need to do anything
> since this arguably comes down to a trade-off between connection and worker
> responsiveness.

One factor is:

         * Check if the latch is set already. If so, leave the loop
         * immediately, avoid blocking again. We don't attempt to report any
         * other events that might also be satisfied.

If we had a way to say "no really, gimme everything you have", I guess
that'd help.  Which reminds me a bit of commit 04a09ee9 (Windows-only
problem, making sure that we handle multiple sockets fairly instead of
reporting only the lowest priority one); I think it'd work the same
way: if you already saw a latch, you'd use a zero timeout for the
system call.



Re: connection establishment versus parallel workers

От
Thomas Munro
Дата:
On Thu, Dec 12, 2024 at 11:36 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... instead of
> reporting only the lowest priority one)

s/priority/position/



Re: connection establishment versus parallel workers

От
Nathan Bossart
Дата:
On Fri, Dec 13, 2024 at 02:29:53AM +1300, Thomas Munro wrote:
> Here's an experimental patch to try changing that policy.  It improves
> the connection times on my small computer with your test, but I doubt
> I'm seeing the real issue.  But in theory, assuming a backlog of
> connections and workers to start, I think each server loop should be
> able to accept and fork one client backend, and fork up to 100
> (MAX_BGWORKERS_TO_LAUNCH) background workers.

Thanks for the quick response!  I'm taking a look at the patch...

-- 
nathan



Re: connection establishment versus parallel workers

От
Nathan Bossart
Дата:
Sorry for the delay, and thanks again for digging into this.

On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
> 0001 patch is unchanged, 0002 patch sketches out a response to the
> observation a couple of paragraphs above.

Both of these patches seem to improve matters quite a bit.  I haven't yet
thought too deeply about it all, but upon a skim, your patches seem
entirely reasonable to me.

However, while this makes the test numbers for >= v16 look more like those
for v15, we're also seeing a big jump from v13 to v14.  This bisects pretty
cleanly to commit d872510.  I haven't figured out _why_ this commit is
impacting this particular test, but I figured I'd at least update the
thread with what we know so far.

-- 
nathan



Re: connection establishment versus parallel workers

От
Nathan Bossart
Дата:
On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote:
> On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
>> 0001 patch is unchanged, 0002 patch sketches out a response to the
>> observation a couple of paragraphs above.
> 
> Both of these patches seem to improve matters quite a bit.  I haven't yet
> thought too deeply about it all, but upon a skim, your patches seem
> entirely reasonable to me.

I gave these a closer look, and I still feel that they are both
straightforward and reasonable.  IIUC the main open question is whether
this might cause problems for other PM signal kinds.  Like you, I don't see
anything immediately obvious there, but I'll admit I'm not terribly
familiar with the precise characteristics of postmaster signals.  In any
case, 0001 feels pretty safe to me.

> However, while this makes the test numbers for >= v16 look more like those
> for v15, we're also seeing a big jump from v13 to v14.  This bisects pretty
> cleanly to commit d872510.  I haven't figured out _why_ this commit is
> impacting this particular test, but I figured I'd at least update the
> thread with what we know so far.

I regrettably have no updates on this one, yet.

-- 
nathan



Re: connection establishment versus parallel workers

От
Thomas Munro
Дата:
On Tue, Jan 14, 2025 at 8:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Thu, Dec 19, 2024 at 10:09:35AM -0600, Nathan Bossart wrote:
> > On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
> >> 0001 patch is unchanged, 0002 patch sketches out a response to the
> >> observation a couple of paragraphs above.
> >
> > Both of these patches seem to improve matters quite a bit.  I haven't yet
> > thought too deeply about it all, but upon a skim, your patches seem
> > entirely reasonable to me.
>
> I gave these a closer look, and I still feel that they are both
> straightforward and reasonable.  IIUC the main open question is whether
> this might cause problems for other PM signal kinds.  Like you, I don't see
> anything immediately obvious there, but I'll admit I'm not terribly
> familiar with the precise characteristics of postmaster signals.  In any
> case, 0001 feels pretty safe to me.

Cool.  Thanks.  I'll think about what else could be affected by that
change as you say, and if nothing jumps out I'll go ahead and commit
them, back to 16.

I have done a lot more study of this problem and was about to write in
with some more patches to propose for master only.  Basically that
"100" is destroying performance in this workload, which at least on my
machine hardly gets any parallelism at all, and only in sporadic
bursts.  You can argue that we aren't designed for high frequency
short-lived workers (we'll have to reuse workers in some way to be
good at that), but I don't think it has to fail as badly as it does
today.  It falls off a cliff instead of plateauing: we are so busy
forking that we don't get around to reaping children, so all our slots
are (artificially) used up most of the time, and the queries that do
manage to nab one then sit on their hands for a long time at query
end.  "1" gets much smoother results, but as prophesied in aa1351f1,
the complexity is terrible, possibly even O(n^3) in places depending
on how you count: there are many places that scan the whole worker
list, and one that even scans it again for each item, and that is for
each thing that starts.  IOW we have to fix the complexity
fundamentally.  I have a WIP patch that adds a couple of work queues,
so that the postmaster never has to consider anything more than the
head of a queue in various places.  More soon...

> > However, while this makes the test numbers for >= v16 look more like those
> > for v15, we're also seeing a big jump from v13 to v14.  This bisects pretty
> > cleanly to commit d872510.  I haven't figured out _why_ this commit is
> > impacting this particular test, but I figured I'd at least update the
> > thread with what we know so far.
>
> I regrettably have no updates on this one, yet.

My first thought was that the catalogues needed for connection might
be getting evicted, but the data size seems too small for that surely
and you'd probably have picked it up immediately from wait events.
Weird.



Re: connection establishment versus parallel workers

От
Thomas Munro
Дата:
On Mon, Jan 20, 2025 at 6:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's the WIP code I have up with for that so far.
>
> Remaining opportunities not attempted:
> 1.  When a child exits, we could use a hash table to find it by pid.
> 2.  When looking for a bgworker slot that is not in use, we could do
> something better than linear search.

I haven't had time to work on this again due to other projects, but I
wanted to write down the ideas I thought about for the record.
Obviously we'd want some kind of free list, but the postmaster would
need to push free slots into it when workers exit, and it can't use
locks or any data structures that can cause it to get stuck just
because a backend has corrupted shared memory.  It must always be able
to process shutdown commands and coordinate crash restarts, no matter
how bananas the backends go.  With that in mind:

Idea #1:  A CAS-based linked list, relying on CAS never being emulated
with locks, and probably requiring 16 bit indexes since we can only
expect 32 bit atomic hardware and you might need to change both the
head and tail of a hypothetical list head when going from empty to one
element.  But you have to convince yourself that it's OK to run a
CAS-loop in the postmaster, which might in theory might be prevented
from completing...

Idea #2:  The free list could be a simple circular buffer of slot
index numbers.  That would be symmetrical with
0001-Remove-BackgroundWorkerStateChange-s-outer-loop.patch's shared
memory "start" queue, and could be coded essentially the same way.
The "start" queue and the "free" queue would then both be simple
arrays with a head and a tail, and in both cases only the consumers
(regular backends) need an lwlock to serialise against each other,
while the producer (the postmaster) can get away with careful memory
barriers and just has to range-check the head/tail indexes to deal
with untrusted shared memory contents.  A rogue backend can jam up the
bgworker subsystem, but that's already true.  It still can't prevent
shutdown or crash restart.

Idea #3:  Suggested by Robert in an off-list chat about all this:
backends could maintain a shared memory free-list using existing dlist
technology protected by an lwlock.  When it's empty *they* (not the
postmaster) would fill it up again using a linear slot search.  It's
simple but not quite as satisfying, because it is still possible to
degrade to high frequency linear searches that only find a small
number of free slots each time if you're unlucky, ie you can entirely
fail to amortise.

Idea #4:  Maintain a bitmap of free slots indexes, which the
postmaster sets with atomic_fetch_or().

I lean towards idea #2, but haven't actually tried it.

A similar situation exists for DSM slot management, though that has
different complications: no postmaster interaction, but funky handle
requirements due to portability concerns.  I think the handles could
probably be changed to encode the slot index + generation for O(1)
lookup at "attach" time, and free slots could be stored in a circular
queue for O(1) slot allocation.  I think the use of random numbers
stemmed from SysV shared memory's need to find a free key in a 32 bit
OS-wide namespace (yuck).  I haven't looked at that code in a while
but I don't recall any reason why even those couldn't be hidden inside
the slot itself, instead of being exposed in the handle, forcing
linear searches.  A generation scheme would also be more robust
against weird random number collisions, and could detect handles that
were valid but now are no longer in a more obvious way, instead of "I
looked everywhere and I couldn't find it".