Re: PGC_SIGHUP shared_buffers?

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: PGC_SIGHUP shared_buffers?
Дата
Msg-id 20240218203516.h6n2wgpwfenbswxq@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: PGC_SIGHUP shared_buffers?  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: PGC_SIGHUP shared_buffers?  (Robert Haas <robertmhaas@gmail.com>)
Re: PGC_SIGHUP shared_buffers?  (Joe Conway <mail@joeconway.com>)
Список pgsql-hackers
Hi,

On 2024-02-18 17:06:09 +0530, Robert Haas wrote:
> On Sat, Feb 17, 2024 at 12:38 AM Andres Freund <andres@anarazel.de> wrote:
> > IMO the ability to *shrink* shared_buffers dynamically and cheaply is more
> > important than growing it in a way, except that they are related of
> > course. Idling hardware is expensive, thus overcommitting hardware is very
> > attractive (I count "serverless" as part of that). To be able to overcommit
> > effectively, unused long-lived memory has to be released. I.e. shared buffers
> > needs to be shrinkable.
> 
> I see your point, but people want to scale up, too. Of course, those
> people will have to live with what we can practically implement.

Sure, I didn't intend to say that scaling up isn't useful.


> > Perhaps worth noting that there are two things limiting the size of shared
> > buffers: 1) the available buffer space 2) the available buffer *mapping*
> > space. I think making the buffer mapping resizable is considerably harder than
> > the buffers themselves. Of course pre-reserving memory for a buffer mapping
> > suitable for a huge shared_buffers is more feasible than pre-allocating all
> > that memory for the buffers themselves. But it' still mean youd have a maximum
> > set at server start.
> 
> We size the fsync queue based on shared_buffers too. That's a lot less
> important, though, and could be worked around in other ways.

We probably should address that independently of making shared_buffers
PGC_SIGHUP. The queue gets absurdly large once s_b hits a few GB. It's not
that much memory compared to the buffer blocks themselves, but a sync queue of
many millions of entries just doesn't make sense. And a few hundred MB for
that isn't nothing either, even if it's just a fraction of the space for the
buffers. It makes checkpointer more susceptible to OOM as well, because
AbsorbSyncRequests() allocates an array to copy all requests into local
memory.


> > Such a scheme still leaves you with a dependend memory read for a quite
> > frequent operation. It could turn out to nto matter hugely if the mapping
> > array is cache resident, but I don't know if we can realistically bank on
> > that.
> 
> I don't know, either. I was hoping you did. :-)
> 
> But we can rig up a test pretty easily, I think. We can just create a
> fake mapping that gives the same answers as the current calculation
> and then beat on it. Of course, if testing shows no difference, there
> is the small problem of knowing whether the test scenario was right;
> and it's also possible that an initial impact could be mitigated by
> removing some gratuitously repeated buffer # -> buffer address
> mappings. Still, I think it could provide us with a useful baseline.
> I'll throw something together when I have time, unless someone beats
> me to it.

I think such a test would be useful, although I also don't know how confident
we would be if we saw positive results. Probably depends a bit on the
generated code and how plausible it is to not see regressions.


> > I'm also somewhat concerned about the coarse granularity being problematic. It
> > seems like it'd lead to a desire to make the granule small, causing slowness.
> 
> How many people set shared_buffers to something that's not a whole
> number of GB these days?

I'd say the vast majority of postgres instances in production run with less
than 1GB of s_b. Just because numbers wise the majority of instances are
running on small VMs and/or many PG instances are running on one larger
machine.  There are a lot of instances where the total available memory is
less than 2GB.


> I mean I bet it happens, but in practice if you rounded to the nearest GB,
> or even the nearest 2GB, I bet almost nobody would really care. I think it's
> fine to be opinionated here and hold the line at a relatively large granule,
> even though in theory people could want something else.

I don't believe that at all unfortunately.


> > One big advantage of a scheme like this is that it'd be a step towards a NUMA
> > aware buffer mapping and replacement. Practically everything beyond the size
> > of a small consumer device these days has NUMA characteristics, even if not
> > "officially visible". We could make clock sweeps (or a better victim buffer
> > selection algorithm) happen within each "chunk", with some additional
> > infrastructure to choose which of the chunks to search a buffer in. Using a
> > chunk on the current numa node, except when there is a lot of imbalance
> > between buffer usage or replacement rate between chunks.
> 
> I also wondered whether this might be a useful step toward allowing
> different-sized buffers in the same buffer pool (ducks, runs away
> quickly). I don't have any particular use for that myself, but it's a
> thing some people probably want for some reason or other.

I still think that that's something that will just cause a significant cost in
complexity, and secondarily also runtime overhead, at a comparatively marginal
gain.


> > > 2. Make a Buffer just a disguised pointer. Imagine something like
> > > typedef struct { Page bp; } *buffer. WIth this approach,
> > > BufferGetBlock() becomes trivial.
> >
> > You also additionally need something that allows for efficient iteration over
> > all shared buffers. Making buffer replacement and checkpointing more expensive
> > isn't great.
> 
> True, but I don't really see what the problem with this would be in
> this approach.

It's a bit hard to tell at this level of detail :). At the extreme end, if you
end up with a large number of separate allocations for s_b, it surely would.


> > > 3. Reserve lots of address space and then only use some of it. I hear
> > > rumors that some forks of PG have implemented something like this. The
> > > idea is that you convince the OS to give you a whole bunch of address
> > > space, but you try to avoid having all of it be backed by physical
> > > memory. If you later want to increase shared_buffers, you then get the
> > > OS to back more of it by physical memory, and if you later want to
> > > decrease shared_buffers, you hopefully have some way of giving the OS
> > > the memory back. As compared with the previous two approaches, this
> > > seems less likely to be noticeable to most PG code.
> >
> > Another advantage is that you can shrink shared buffers fairly granularly and
> > cheaply with that approach, compared to having to move buffes entirely out of
> > a larger mapping to be able to unmap it.
> 
> Don't you have to still move buffers entirely out of the region you
> want to unmap?

Sure. But you can unmap at the granularity of a hardware page (there is some
fragmentation cost on the OS / hardware page table level
though). Theoretically you could unmap individual 8kB pages.


> > > Problems include (1) you have to somehow figure out how much address space
> > > to reserve, and that forms an upper bound on how big shared_buffers can grow
> > > at runtime and
> >
> > Presumably you'd normally not want to reserve more than the physical amount of
> > memory on the system. Sure, memory can be hot added, but IME that's quite
> > rare.
> 
> I would think that might not be so rare in a virtualized environment,
> which would seem to be one of the most important use cases for this
> kind of thing.

I've not seen it in production in a long time - but that might be because I've
been out of the consulting game for too long. To my knowledge none of the
common cloud providers support it, which of course restricts where it could be
used significantly.  I have far more commonly seen use of "balooning" to
remove unused/rarely used memory from running instances though.


> Plus, this would mean we'd need to auto-detect system RAM. I'd rather
> not go there, and just fix the upper limit via a GUC.

I'd have assumed we'd want a GUC that auto-determines the amount of RAM if set
to -1. I don't think it's that hard to detect the available memory.


> > A third issue is that it can confuse administrators inspecting the system with
> > OS tools. "Postgres uses many terabytes of memory on my system!" due to VIRT
> > being huge etc.
> 
> Mmph. That's disagreeable but probably not a reason to entirely
> abandon any particular approach.

Agreed.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Why is pq_begintypsend so slow?
Следующее
От: Erik Wienhold
Дата:
Сообщение: Re: Patch: Add parse_type Function