Обсуждение: Re: Changing shared_buffers without restart
> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > changing shared memory mapping layout. Any feedback is appreciated. Hi, Here is a new version of the patch, which contains a proposal about how to coordinate shared memory resizing between backends. The rest is more or less the same, a feedback about coordination is appreciated. It's a lot to read, but the main difference is about: 1. Allowing to decouple a GUC value change from actually applying it, sort of a "pending" change. The idea is to let a custom logic be triggered on an assign hook, and then take responsibility for what happens later and how it's going to be applied. This allows to use regular GUC infrastructure in cases where value change requires some complicated processing. I was trying to make the change not so invasive, plus it's missing GUC reporting yet. 2. Shared memory resizing patch became more complicated thanks to some coordination between backends. The current implementation was chosen from few more or less equal alternatives, which are evolving along following lines: * There should be one "coordinator" process overseeing the change. Having postmaster to fulfill this role like in this patch seems like a natural idea, but it poses certain challenges since it doesn't have locking infrastructure. Another option would be to elect a single backend to be a coordinator, which will handle the postmaster as a special case. If there will ever be a "coordinator" worker in Postgres, that would be useful here. * The coordinator uses EmitProcSignalBarrier to reach out to all other backends and trigger the resize process. Backends join a Barrier to synchronize and wait untill everyone is finished. * There is some resizing state stored in shared memory, which is there to handle backends that were for some reason late or didn't receive the signal. What to store there is open for discussion. * Since we want to make sure all processes share the same understanding of what NBuffers value is, any failure is mostly a hard stop, since to rollback the change coordination is needed as well and sounds a bit too complicated for now. We've tested this change manually for now, although it might be useful to try out injection points. The testing strategy, which has caught plenty of bugs, was simply to run pgbench workload against a running instance and change shared_buffers on the fly. Some more subtle cases were verified by manually injecting delays to trigger expected scenarios. To reiterate, here is patches breakdown: Patches 1-3 prepare the infrastructure and shared memory layout. They could be useful even with multithreaded PostgreSQL, when there will be no need for shared memory. I assume, in the multithreaded world there still will be need for a contiguous chunk of memory to share between threads, and its layout would be similar to the one with shared memory mappings. Note that patch nr 2 is going away as soon as I'll get to implement shared memory address reservation, but for now it's needed. Patch 4 is a new addition to handle "pending" GUC changes. Patch 5 actually does resizing. It's shared memory specific of course, and utilized Linux specific mremap, meaning open portability questions. Patch 6 is somewhat independent, but quite convenient to have. It also utilizes Linux specific call memfd_create. I would like to get some feedback on the synchronization part. While waiting I'll proceed implementing shared memory address space reservation and Ashutosh will continue with buffer eviction to support shared memory reduction.
Вложения
- v2-0001-Allow-to-use-multiple-shared-memory-mappings.patch
- v2-0002-Allow-placing-shared-memory-mapping-with-an-offse.patch
- v2-0003-Introduce-multiple-shmem-segments-for-shared-buff.patch
- v2-0004-Introduce-pending-flag-for-GUC-assign-hooks.patch
- v2-0005-Allow-to-resize-shared-memory-without-restart.patch
- v2-0006-Use-anonymous-files-to-back-shared-memory-segment.patch
> On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote: > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > changing shared memory mapping layout. Any feedback is appreciated. > > Hi, > > Here is a new version of the patch, which contains a proposal about how to > coordinate shared memory resizing between backends. The rest is more or less > the same, a feedback about coordination is appreciated. It's a lot to read, but > the main difference is about: Just one note, there are still couple of compilation warnings in the code, which I haven't addressed yet. Those will go away in the next version.
On Thu, Feb 27, 2025 at 1:58 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Tue, Feb 25, 2025 at 10:52:05AM GMT, Dmitry Dolgov wrote: > > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > > changing shared memory mapping layout. Any feedback is appreciated. > > > > Hi, > > > > Here is a new version of the patch, which contains a proposal about how to > > coordinate shared memory resizing between backends. The rest is more or less > > the same, a feedback about coordination is appreciated. It's a lot to read, but > > the main difference is about: > > Just one note, there are still couple of compilation warnings in the > code, which I haven't addressed yet. Those will go away in the next > version. PFA the patchset which implements shrinking shared buffers. 0001-0006 are same as the previous patchset 0007 fixes compilation warnings from previous patches - I think those should be absorbed into their respective patches 0008 adds TODOs that need some code changes or at least need some consideration. Some of them might point to the causes of Assertion failures seen with this patch set. 0009 adds WIP support for shrinking shared buffers - I think this should be absorbed into 0005 0010 WIP fix for Assertion failures seen from BgBufferSync() - I am still investigating those. I am using the attached script to shake the patch well. It runs pgbench and concurrently resizes the shared_buffers. I am seeing Assertion failures when running the script in both cases, expanding and shrinking the buffers. I am investigating "failed Assert("strategy_delta >= 0")," next. -- Best Wishes, Ashutosh Bapat
Вложения
- 0004-Introduce-pending-flag-for-GUC-assign-hooks-20250228.patch
- 0003-Introduce-multiple-shmem-segments-for-share-20250228.patch
- 0005-Allow-to-resize-shared-memory-without-resta-20250228.patch
- 0002-Allow-placing-shared-memory-mapping-with-an-20250228.patch
- 0001-Allow-to-use-multiple-shared-memory-mapping-20250228.patch
- 0006-Use-anonymous-files-to-back-shared-memory-s-20250228.patch
- 0010-WIP-Reinitialize-buffer-sync-strategy-20250228.patch
- 0007-Fix-compilation-failures-in-previous-patche-20250228.patch
- 0009-WIP-Support-shrinking-shared-buffers-20250228.patch
- 0008-Add-TODOs-and-questions-about-previous-comm-20250228.patch
- pgbench-concurrent-resize-buffers.sh
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: > > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via > > changing shared memory mapping layout. Any feedback is appreciated. > > Hi, > > Here is a new version of the patch, which contains a proposal about how to > coordinate shared memory resizing between backends. The rest is more or less > the same, a feedback about coordination is appreciated. It's a lot to read, but > the main difference is about: Thanks Dmitry for the summary. > > 1. Allowing to decouple a GUC value change from actually applying it, sort of a > "pending" change. The idea is to let a custom logic be triggered on an assign > hook, and then take responsibility for what happens later and how it's going to > be applied. This allows to use regular GUC infrastructure in cases where value > change requires some complicated processing. I was trying to make the change > not so invasive, plus it's missing GUC reporting yet. > > 2. Shared memory resizing patch became more complicated thanks to some > coordination between backends. The current implementation was chosen from few > more or less equal alternatives, which are evolving along following lines: > > * There should be one "coordinator" process overseeing the change. Having > postmaster to fulfill this role like in this patch seems like a natural idea, > but it poses certain challenges since it doesn't have locking infrastructure. > Another option would be to elect a single backend to be a coordinator, which > will handle the postmaster as a special case. If there will ever be a > "coordinator" worker in Postgres, that would be useful here. > > * The coordinator uses EmitProcSignalBarrier to reach out to all other backends > and trigger the resize process. Backends join a Barrier to synchronize and wait > untill everyone is finished. > > * There is some resizing state stored in shared memory, which is there to > handle backends that were for some reason late or didn't receive the signal. > What to store there is open for discussion. > > * Since we want to make sure all processes share the same understanding of what > NBuffers value is, any failure is mostly a hard stop, since to rollback the > change coordination is needed as well and sounds a bit too complicated for now. > I think we should add a way to monitor the progress of resizing; at least whether resizing is complete and whether the new GUC value is in effect. > We've tested this change manually for now, although it might be useful to try > out injection points. The testing strategy, which has caught plenty of bugs, > was simply to run pgbench workload against a running instance and change > shared_buffers on the fly. Some more subtle cases were verified by manually > injecting delays to trigger expected scenarios. I have shared a script with my changes but it's far from being full testing. We will need to use injection points to test specific scenarios. -- Best Wishes, Ashutosh Bapat
I ran some simple tests (outside of PG) on linux kernel v6.1, which has this commit that added some hugepage support to mremap (
https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/).
From reading the kernel code and testing, for a hugepage-backed mapping it seems mremap supports only shrinking but not growing. Further, for shrinking, what I observed is that after mremap is called the hugepage memory
is not released back to the OS, rather it's released when the fd is closed (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
I'm no expert in the linux kernel so I could be missing something. It'd be great if you or somebody can comment on these observations and whether this mremap-based solution would work with hugepage bufferpool.
I also attached the test program in case someone can spot I did something wrong.
Regards,
Jack Ng
On Tue, Feb 25, 2025 at 3:22 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> > TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> > changing shared memory mapping layout. Any feedback is appreciated.
>
> Hi,
>
> Here is a new version of the patch, which contains a proposal about how to
> coordinate shared memory resizing between backends. The rest is more or less
> the same, a feedback about coordination is appreciated. It's a lot to read, but
> the main difference is about:
Thanks Dmitry for the summary.
>
> 1. Allowing to decouple a GUC value change from actually applying it, sort of a
> "pending" change. The idea is to let a custom logic be triggered on an assign
> hook, and then take responsibility for what happens later and how it's going to
> be applied. This allows to use regular GUC infrastructure in cases where value
> change requires some complicated processing. I was trying to make the change
> not so invasive, plus it's missing GUC reporting yet.
>
> 2. Shared memory resizing patch became more complicated thanks to some
> coordination between backends. The current implementation was chosen from few
> more or less equal alternatives, which are evolving along following lines:
>
> * There should be one "coordinator" process overseeing the change. Having
> postmaster to fulfill this role like in this patch seems like a natural idea,
> but it poses certain challenges since it doesn't have locking infrastructure.
> Another option would be to elect a single backend to be a coordinator, which
> will handle the postmaster as a special case. If there will ever be a
> "coordinator" worker in Postgres, that would be useful here.
>
> * The coordinator uses EmitProcSignalBarrier to reach out to all other backends
> and trigger the resize process. Backends join a Barrier to synchronize and wait
> untill everyone is finished.
>
> * There is some resizing state stored in shared memory, which is there to
> handle backends that were for some reason late or didn't receive the signal.
> What to store there is open for discussion.
>
> * Since we want to make sure all processes share the same understanding of what
> NBuffers value is, any failure is mostly a hard stop, since to rollback the
> change coordination is needed as well and sounds a bit too complicated for now.
>
I think we should add a way to monitor the progress of resizing; at
least whether resizing is complete and whether the new GUC value is in
effect.
> We've tested this change manually for now, although it might be useful to try
> out injection points. The testing strategy, which has caught plenty of bugs,
> was simply to run pgbench workload against a running instance and change
> shared_buffers on the fly. Some more subtle cases were verified by manually
> injecting delays to trigger expected scenarios.
I have shared a script with my changes but it's far from being full
testing. We will need to use injection points to test specific
scenarios.
--
Best Wishes,
Ashutosh Bapat
Вложения
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote: > > I ran some simple tests (outside of PG) on linux kernel v6.1, which has > this commit that added some hugepage support to mremap ( > https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/ > ). > > From reading the kernel code and testing, for a hugepage-backed mapping it > seems mremap supports only shrinking but not growing. Further, for > shrinking, what I observed is that after mremap is called the hugepage > memory > is not released back to the OS, rather it's released when the fd is closed > (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS). > I'm not sure if this behavior is expected, but being able to release memory > back to the OS immediately after mremap would be important for use cases > such as supporting "serverless" PG instances on the cloud. > > I'm no expert in the linux kernel so I could be missing something. It'd be > great if you or somebody can comment on these observations and whether this > mremap-based solution would work with hugepage bufferpool. Hm, I think you're right. I didn't realize there is such limitation, but just verified on the latest kernel build and hit the same condition on increasing hugetlb mapping you've mentioned above. That's annoying of course, but I've got another approach I was originally experimenting with -- instead of mremap do munmap and mmap with the new size and rely on the anonymous fd to keep the memory content in between. I'm currently reworking mmap'ing part of the patch, let me check if this new approach is something we could universally rely on.
Right, I think the anonymous fd approach would work to keep the memory contents intact in between munmap and mmap with the new size, so bufferpool expansion would work.
But it seems shrinking would still be problematic, since that approach requires the anonymous fd to remain open (for memory content protection), and so munmap would not release the memory back to the OS right away (gets released when the fd is closed). From testing this is true for hugepage memory at least.
Regards,
Jack Ng
> On Thu, Mar 20, 2025 at 04:55:47PM GMT, Ni Ku wrote:
>
> I ran some simple tests (outside of PG) on linux kernel v6.1, which has
> this commit that added some hugepage support to mremap (
> https://patchwork.kernel.org/project/linux-mm/patch/20211013195825.3058275-1-almasrymina@google.com/
> ).
>
> From reading the kernel code and testing, for a hugepage-backed mapping it
> seems mremap supports only shrinking but not growing. Further, for
> shrinking, what I observed is that after mremap is called the hugepage
> memory
> is not released back to the OS, rather it's released when the fd is closed
> (or when the memory is unmapped for a mapping created with MAP_ANONYMOUS).
> I'm not sure if this behavior is expected, but being able to release memory
> back to the OS immediately after mremap would be important for use cases
> such as supporting "serverless" PG instances on the cloud.
>
> I'm no expert in the linux kernel so I could be missing something. It'd be
> great if you or somebody can comment on these observations and whether this
> mremap-based solution would work with hugepage bufferpool.
Hm, I think you're right. I didn't realize there is such limitation, but
just verified on the latest kernel build and hit the same condition on
increasing hugetlb mapping you've mentioned above. That's annoying of
course, but I've got another approach I was originally experimenting
with -- instead of mremap do munmap and mmap with the new size and rely
on the anonymous fd to keep the memory content in between. I'm currently
reworking mmap'ing part of the patch, let me check if this new approach
is something we could universally rely on.
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote: > Thanks for your insights and confirmation, Dmitry. > Right, I think the anonymous fd approach would work to keep the memory > contents intact in between munmap and mmap with the new size, so bufferpool > expansion would work. > But it seems shrinking would still be problematic, since that approach > requires the anonymous fd to remain open (for memory content protection), > and so munmap would not release the memory back to the OS right away (gets > released when the fd is closed). From testing this is true for hugepage > memory at least. > Is there a way around this? Or maybe I misunderstood what you have in mind > ;) The anonymous file will be truncated to it's new shrinked size before mapping it second time (I think this part is missing in your test example), to my understanding after a quick look at do_vmi_align_munmap, this should be enough to make the memory reclaimable.
> On Fri, Mar 21, 2025 at 04:48:30PM GMT, Ni Ku wrote:
> Thanks for your insights and confirmation, Dmitry.
> Right, I think the anonymous fd approach would work to keep the memory
> contents intact in between munmap and mmap with the new size, so bufferpool
> expansion would work.
> But it seems shrinking would still be problematic, since that approach
> requires the anonymous fd to remain open (for memory content protection),
> and so munmap would not release the memory back to the OS right away (gets
> released when the fd is closed). From testing this is true for hugepage
> memory at least.
> Is there a way around this? Or maybe I misunderstood what you have in mind
> ;)
The anonymous file will be truncated to it's new shrinked size before
mapping it second time (I think this part is missing in your test
example), to my understanding after a quick look at do_vmi_align_munmap,
this should be enough to make the memory reclaimable.
On Fri, Feb 28, 2025 at 5:31 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > I think we should add a way to monitor the progress of resizing; at > least whether resizing is complete and whether the new GUC value is in > effect. > I further tested this approach by tracing the barrier synchronization using the attached patch with adds a bunch of elogs(). I ran pgbench load and simultaneously executed following commands on a psql connection #alter system set shared_buffers to '200MB'; ALTER SYSTEM #select pg_reload_conf(); pg_reload_conf ---------------- t (1 row) #show shared_buffers; shared_buffers ---------------- 200MB (1 row) #select count(*) from pg_stat_activity; count ------- 6 (1 row) #select pg_backend_pid(); - the backend where all these commands were executed pg_backend_pid ---------------- 878405 (1 row) I see the following in the postgresql error logs. 2025-03-12 11:04:53.812 IST [878167] LOG: received SIGHUP, reloading configuration files 2025-03-12 11:04:53.813 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 -- not all backends have reloaded configuration. 2025-03-12 11:04:53.813 IST [878173] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878173] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.813 IST [878173] LOG: reached barrier phase 1 2025-03-12 11:04:53.813 IST [878171] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878172] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878171] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878172] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878340] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.813 IST [878340] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.813 IST [878338] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.813 IST [878339] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.813 IST [878338] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.813 IST [878339] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.813 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.813 IST [878341] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.813 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878341] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878337] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878337] LOG: attached when barrier was at phase 1 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878173] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.814 IST [878172] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878340] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878340] STATEMENT: UPDATE pgbench_branches SET bbalance = bbalance + 1367 WHERE bid = 8; 2025-03-12 11:04:53.814 IST [878341] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878341] STATEMENT: BEGIN; 2025-03-12 11:04:53.814 IST [878339] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878339] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -3449 WHERE aid = 159726; 2025-03-12 11:04:53.814 IST [878171] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878338] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878338] STATEMENT: UPDATE pgbench_accounts SET abalance = abalance + -209 WHERE aid = 453662; 2025-03-12 11:04:53.814 IST [878337] LOG: reached barrier phase 3 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878337] LOG: buffer resizing operation finished at phase 4 2025-03-12 11:04:53.814 IST [878337] STATEMENT: UPDATE pgbench_tellers SET tbalance = tbalance + -1996 WHERE tid = 392; 2025-03-12 11:04:53.814 IST [878168] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.814 IST [878168] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 1 2025-03-12 11:04:53.814 IST [878168] LOG: reached barrier phase 2 2025-03-12 11:04:53.814 IST [878168] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:53.815 IST [878169] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:53.815 IST [878169] LOG: attached when barrier was at phase 0 2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 1 2025-03-12 11:04:53.815 IST [878169] LOG: reached barrier phase 2 2025-03-12 11:04:53.815 IST [878169] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to -1, 0 2025-03-12 11:04:55.965 IST [878405] LOG: Handle a barrier for shmem resizing from 16384 to 25600, 1 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: attached when barrier was at phase 0 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 1 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: reached barrier phase 2 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; 2025-03-12 11:04:55.965 IST [878405] LOG: buffer resizing operation finished at phase 3 2025-03-12 11:04:55.965 IST [878405] STATEMENT: show shared_buffers; To tell the story in short. pid 173 (for the sake of brevity I am just mentioning the last three digits of PID) attached to the barrier first and immediately reached phase 1. 171, 172, 340, 338, 339, 341, 337 - all attached barrier in phase 1. All of these backends completed the phases in synchronous fashion. But 168, 169 and 405 were yet to attach to the barrier since they hadn't loaded their configurations yet. Each of these backends then finished all phases independent of others. For your reference #select pid, application_name, backend_type from pg_stat_activity where pid in (878169, 878168); pid | application_name | backend_type --------+------------------+------------------- 878168 | | checkpointer 878169 | | background writer (2 rows) This is because the BarrierArriveAndWait() only waits for all the attached backends. It doesn't wait for backends which are yet to attach. I think what we want is *all* the backends should execute all the phases synchronously and wait for others to finish. If we don't do that, there's a possibility that some of them would see inconsistent buffer states or even worse may not have necessary memory mapped and resized - thus causing segfaults. Am I correct? I think what needs to be done is that every backend should wait for other backends to attach themselves to the barrier before moving to the first phase. One way I can think of is we use two signal barriers - one to ensure that all the backends have attached themselves and second for the actual resizing. But then the postmaster needs to wait for all the processes to process the first signal barrier. A postmaster can not wait on anything. Maybe there's a way to poll, but I didn't find it. Does that mean that we have to make some other backend a coordinator? -- Best Wishes, Ashutosh Bapat
> On Mon, Apr 07, 2025 at 11:50:46AM GMT, Ashutosh Bapat wrote: > This is because the BarrierArriveAndWait() only waits for all the > attached backends. It doesn't wait for backends which are yet to > attach. I think what we want is *all* the backends should execute all > the phases synchronously and wait for others to finish. If we don't do > that, there's a possibility that some of them would see inconsistent > buffer states or even worse may not have necessary memory mapped and > resized - thus causing segfaults. Am I correct? > > I think what needs to be done is that every backend should wait for other > backends to attach themselves to the barrier before moving to the > first phase. One way I can think of is we use two signal barriers - > one to ensure that all the backends have attached themselves and > second for the actual resizing. But then the postmaster needs to wait for > all the processes to process the first signal barrier. A postmaster can > not wait on anything. Maybe there's a way to poll, but I didn't find > it. Does that mean that we have to make some other backend a coordinator? Yes, you're right, plain dynamic Barrier does not ensure all available processes will be synchronized. I was aware about the scenario you describe, it's mentioned in commentaries for the resize function. I was under the impression this should be enough, but after some more thinking I'm not so sure anymore. Let me try to structure it as a list of possible corner cases that we need to worry about: * New backend spawned while we're busy resizing shared memory. Those should wait until the resizing is complete and get the new size as well. * Old backend receives a resize message, but exits before attempting to resize. Those should be excluded from coordination. * A backend is blocked and not responding before or after the ProcSignalBarrier message was sent. I'm thinking about a failure situation, when one rogue backend is doing something without checking for interrupts. We need to wait for those to become responsive, and potentially abort shared memory resize after some timeout. * Backends join the barrier in disjoint groups with some time in between, which is longer than what it takes to resize shared memory. That means that relying only on the shared dynamic barrier is not enough -- it will only synchronize resize procedure withing those groups. Out of those I think the third poses some problems, e.g. if we shrinking the shared memory, but one backend is accessing buffer pool without checking for interrupts. In the v3 implementation this won't be handled correctly, other backends will ignore such rogue process. Independently from that we could reason about the logic much easier if it's guaranteed that all the process to resize shared memory will wait for each other to start simultaneously. Looks like to achieve that we need a slightly different combination of a global Barrier and ProcSignalBarrier mechanism. We can't use ProcSignalBarrier as it is, because processes need to wait for each other, and at the same time finish processing to bump the generation. We also can't use a simple dynamic Barrier due to possibility of disjoint groups of processes. A static Barrier is also not easier, because we would need somehow to know exact number of processes, which might change over time. I think a relatively elegant solution is to extend ProcSignalBarrier mechanism to track not only pss_barrierGeneration, as a sign that everything was processed, but also something like pss_barrierReceivedGeneration, indicating that the message was received everywhere but not processed yet. That would be enough to allow processes to wait until the resize message was received everywhere, then use a global Barrier to wait until all processes are finished. It's somehow similar to your proposal to use two signals, but has less implementation overhead. This would also allow different solutions regarding error handling. E.g. we could do an unbounded waiting for all processes we expect to resize, assuming that the user will be able to intervene and fix an issue if there is any. Or we can do a timed waiting, and abort the resize after some timeout of not all processes are ready yet. In the new v4 version of the patch the first option is implemented. On top of that there are following changes: * Shared memory address space is now reserved for future usage, making shared memory segments clash (e.g. due to memory allocation) impossible. There is a new GUC to control how much space to reserve, which is called max_available_memory -- on the assumption that most of the time it would make sense to set its value to the total amount of memory on the machine. I'm open for suggestions regarding the name. * There is one more patch to address hugepages remap. As mentioned in this thread above, Linux kernel has certain limitations when it comes to mremap for segments allocated with huge pages. To work around it's possible to replace mremap with a sequence of unmap and map again, relying on the anon file behind the segment to keep the memory content. I haven't found any downsides of this approach so far, but it makes the anonymous file patch 0007 mandatory.
Вложения
- v4-0001-Allow-to-use-multiple-shared-memory-mappings.patch
- v4-0002-Address-space-reservation-for-shared-memory.patch
- v4-0003-Introduce-multiple-shmem-segments-for-shared-buff.patch
- v4-0004-Introduce-pending-flag-for-GUC-assign-hooks.patch
- v4-0005-Introduce-pss_barrierReceivedGeneration.patch
- v4-0006-Allow-to-resize-shared-memory-without-restart.patch
- v4-0007-Use-anonymous-files-to-back-shared-memory-segment.patch
- v4-0008-Support-resize-for-hugetlb.patch
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > In the new v4 version > of the patch the first option is implemented. > The patches don't apply cleanly using git am but patch -p1 applies them cleanly. However I see following compilation errors RuntimeError: command "ninja" failed with error [1/1954] Generating src/include/utils/errcodes with a custom command [2/1954] Generating src/include/storage/lwlocknames_h with a custom command [3/1954] Generating src/include/utils/wait_event_names with a custom command [4/1954] Compiling C object src/port/libpgport.a.p/pg_popcount_aarch64.c.o [5/1954] Compiling C object src/port/libpgport.a.p/pg_numa.c.o FAILED: src/port/libpgport.a.p/pg_numa.c.o cc -Isrc/port/libpgport.a.p -Isrc/include -I../../coderoot/pg/src/include -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Werror -g -fno-strict-aliasing -fwrapv -fexcess-precision=standard -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -Wdeclaration-after-statement -Wno-format-truncation -Wno-stringop-truncation -fPIC -DFRONTEND -MD -MQ src/port/libpgport.a.p/pg_numa.c.o -MF src/port/libpgport.a.p/pg_numa.c.o.d -o src/port/libpgport.a.p/pg_numa.c.o -c ../../coderoot/pg/src/port/pg_numa.c In file included from ../../coderoot/pg/src/include/storage/spin.h:54, from ../../coderoot/pg/src/include/storage/condition_variable.h:26, from ../../coderoot/pg/src/include/storage/barrier.h:22, from ../../coderoot/pg/src/include/storage/pg_shmem.h:27, from ../../coderoot/pg/src/port/pg_numa.c:26: ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error "s_lock.h may not be included from frontend code" 93 | #error "s_lock.h may not be included from frontend code" | ^~~~~ In file included from ../../coderoot/pg/src/port/pg_numa.c:26: ../../coderoot/pg/src/include/storage/pg_shmem.h:66:9: error: unknown type name ‘pg_atomic_uint32’ 66 | pg_atomic_uint32 NSharedBuffers; | ^~~~~~~~~~~~~~~~ ../../coderoot/pg/src/include/storage/pg_shmem.h:68:9: error: unknown type name ‘pg_atomic_uint64’ 68 | pg_atomic_uint64 Generation; | ^~~~~~~~~~~~~~~~ ../../coderoot/pg/src/port/pg_numa.c: In function ‘pg_numa_get_pagesize’: ../../coderoot/pg/src/port/pg_numa.c:117:17: error: too few arguments to function ‘GetHugePageSize’ 117 | GetHugePageSize(&os_page_size, NULL); | ^~~~~~~~~~~~~~~ In file included from ../../coderoot/pg/src/port/pg_numa.c:26: ../../coderoot/pg/src/include/storage/pg_shmem.h:127:13: note: declared here 127 | extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags, | ^~~~~~~~~~~~~~~ -- Best Wishes, Ashutosh Bapat
> On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote: > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > In the new v4 version > > of the patch the first option is implemented. > > > > The patches don't apply cleanly using git am but patch -p1 applies > them cleanly. However I see following compilation errors > RuntimeError: command "ninja" failed with error Becase it's relatively meaningless to apply a patch to the tip of the master around the release freeze time :) Commit 65c298f61fc has introduced new usage of GetHugePageSize, which was modified in my patch. I'm going to address it with the next rebased version, in the meantime you can always use the specified base commit to apply the changeset: base-commit: 5e1915439085014140314979c4dd5e23bd677cac
On Wed, Apr 9, 2025 at 1:15 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Wed, Apr 09, 2025 at 11:12:18AM GMT, Ashutosh Bapat wrote: > > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > In the new v4 version > > > of the patch the first option is implemented. > > > > > > > The patches don't apply cleanly using git am but patch -p1 applies > > them cleanly. However I see following compilation errors > > RuntimeError: command "ninja" failed with error > > Becase it's relatively meaningless to apply a patch to the tip of the > master around the release freeze time :) Commit 65c298f61fc has > introduced new usage of GetHugePageSize, which was modified in my patch. > I'm going to address it with the next rebased version, in the meantime > you can always use the specified base commit to apply the changeset: > > base-commit: 5e1915439085014140314979c4dd5e23bd677cac There is a higher chance that people will try these patches now than it was two days before and more chance if they find the patches applicable easily. ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error "s_lock.h may not be included from frontend code" How about this? Why is that happening? -- Best Wishes, Ashutosh Bapat
> On Wed, Apr 09, 2025 at 01:20:16PM GMT, Ashutosh Bapat wrote: > ../../coderoot/pg/src/include/storage/s_lock.h:93:2: error: #error > "s_lock.h may not be included from frontend code" > > How about this? Why is that happening? The same -- as you can see it comes from compiling pg_numa.c, which as it seems used in frontend and imports pg_shmem.h . I wanted to reshuffle includes in the patch anyway, that would be a good excuse to finally do this.
On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Yes, you're right, plain dynamic Barrier does not ensure all available > processes will be synchronized. I was aware about the scenario you > describe, it's mentioned in commentaries for the resize function. I was > under the impression this should be enough, but after some more thinking > I'm not so sure anymore. Let me try to structure it as a list of > possible corner cases that we need to worry about: > > * New backend spawned while we're busy resizing shared memory. Those > should wait until the resizing is complete and get the new size as well. > > * Old backend receives a resize message, but exits before attempting to > resize. Those should be excluded from coordination. Should we detach barrier in on_exit()? > > * A backend is blocked and not responding before or after the > ProcSignalBarrier message was sent. I'm thinking about a failure > situation, when one rogue backend is doing something without checking > for interrupts. We need to wait for those to become responsive, and > potentially abort shared memory resize after some timeout. Right. > > I think a relatively elegant solution is to extend ProcSignalBarrier > mechanism to track not only pss_barrierGeneration, as a sign that > everything was processed, but also something like > pss_barrierReceivedGeneration, indicating that the message was received > everywhere but not processed yet. That would be enough to allow > processes to wait until the resize message was received everywhere, then > use a global Barrier to wait until all processes are finished. It's > somehow similar to your proposal to use two signals, but has less > implementation overhead. The way it's implemented in v4 still has the disjoint group problem. Assume backends p1, p2, p3. All three of them are executing ProcessProcSignalBarrier(). All three of them updated pss_barrierReceivedGeneration /* The message is observed, record that */ pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, shared_gen); p1, p2 moved faster and reached following code from ProcessBarrierShmemResize() if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED) WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation)); Since all the processes have received the barrier message, p1, p2 move ahead and go through all the next phases and finish resizing even before p3 gets a chance to call ProcessBarrierShmemResize() and attach itself to Barrier. This could happen because it processed some other ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has not attached itself to the barrier. Once p1, p2 finish, p3 will attach itself to the barrier and resize buffers again - reinitializing the shared memory, which might has been already modified by p1 or p2. Boom - there's memory corruption. Either every process has to make sure that all the other extant backends have attached themselves to the barrier OR somebody has to ensure that and signal all the backends to proceed. The implementation doesn't do either. > > * Shared memory address space is now reserved for future usage, making > shared memory segments clash (e.g. due to memory allocation) > impossible. There is a new GUC to control how much space to reserve, > which is called max_available_memory -- on the assumption that most of > the time it would make sense to set its value to the total amount of > memory on the machine. I'm open for suggestions regarding the name. With 0006 applied + /* Clean up some reserved space to resize into */ + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1) ze, m->shmem))); ... snip ... + ptr = mremap(m->shmem, m->shmem_size, new_size, 0); We unmap the portion of reserved address space where the existing segment would expand into. As long as we are just expanding this will work. I am wondering how would this work for shrinking buffers? What scheme do you have in mind? > > * There is one more patch to address hugepages remap. As mentioned in > this thread above, Linux kernel has certain limitations when it comes > to mremap for segments allocated with huge pages. To work around it's > possible to replace mremap with a sequence of unmap and map again, > relying on the anon file behind the segment to keep the memory > content. I haven't found any downsides of this approach so far, but it > makes the anonymous file patch 0007 mandatory. In 0008 if (munmap(m->shmem, m->shmem_size) < 0) ... snip ... /* Resize the backing anon file. */ if(ftruncate(m->segment_fd, new_size) == -1) ... /* Reclaim the space */ ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE, mmap_flags | MAP_FIXED, m->segment_fd, 0); How are we preventing something get mapped into the space after m->shmem + newsize? We will need to add an unallocated but reserved addressed space map after m->shmem+newsize right? -- Best Wishes, Ashutosh Bapat
> On Fri, Apr 11, 2025 at 08:04:39PM GMT, Ashutosh Bapat wrote: > On Mon, Apr 7, 2025 at 2:13 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > Yes, you're right, plain dynamic Barrier does not ensure all available > > processes will be synchronized. I was aware about the scenario you > > describe, it's mentioned in commentaries for the resize function. I was > > under the impression this should be enough, but after some more thinking > > I'm not so sure anymore. Let me try to structure it as a list of > > possible corner cases that we need to worry about: > > > > * New backend spawned while we're busy resizing shared memory. Those > > should wait until the resizing is complete and get the new size as well. > > > > * Old backend receives a resize message, but exits before attempting to > > resize. Those should be excluded from coordination. > > Should we detach barrier in on_exit()? Yeah, good point. > > I think a relatively elegant solution is to extend ProcSignalBarrier > > mechanism to track not only pss_barrierGeneration, as a sign that > > everything was processed, but also something like > > pss_barrierReceivedGeneration, indicating that the message was received > > everywhere but not processed yet. That would be enough to allow > > processes to wait until the resize message was received everywhere, then > > use a global Barrier to wait until all processes are finished. It's > > somehow similar to your proposal to use two signals, but has less > > implementation overhead. > > The way it's implemented in v4 still has the disjoint group problem. > Assume backends p1, p2, p3. All three of them are executing > ProcessProcSignalBarrier(). All three of them updated > pss_barrierReceivedGeneration > > /* The message is observed, record that */ > pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, > shared_gen); > > p1, p2 moved faster and reached following code from ProcessBarrierShmemResize() > if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED) > WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation)); > > Since all the processes have received the barrier message, p1, p2 move > ahead and go through all the next phases and finish resizing even > before p3 gets a chance to call ProcessBarrierShmemResize() and attach > itself to Barrier. This could happen because it processed some other > ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has > not attached itself to the barrier. Once p1, p2 finish, p3 will attach > itself to the barrier and resize buffers again - reinitializing the > shared memory, which might has been already modified by p1 or p2. Boom > - there's memory corruption. It won't reinitialize anything, since this logic is controlled by the ShmemCtrl->NSharedBuffers, if it's already updated nothing will be changed. About the race condition you mention, there is indeed a window between receiving the ProcSignalBarrier and attaching to the global Barrier in resize, but I don't think any process will be able to touch buffer pool while inside this window. Even if it happens that the remapping itself was blazing fast that this window was enough to make one process late (e.g. if it was busy handling some other signal as you mention), as I've showed above it shouldn't be a problem. I can experiment with this case though, maybe there is a way to completely close this window to not thing about even potential scenarios. > > * Shared memory address space is now reserved for future usage, making > > shared memory segments clash (e.g. due to memory allocation) > > impossible. There is a new GUC to control how much space to reserve, > > which is called max_available_memory -- on the assumption that most of > > the time it would make sense to set its value to the total amount of > > memory on the machine. I'm open for suggestions regarding the name. > > With 0006 applied > + /* Clean up some reserved space to resize into */ > + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1) > ze, m->shmem))); > ... snip ... > + ptr = mremap(m->shmem, m->shmem_size, new_size, 0); > > We unmap the portion of reserved address space where the existing > segment would expand into. As long as we are just expanding this will > work. I am wondering how would this work for shrinking buffers? What > scheme do you have in mind? I didn't like this part originally, and after changes to support hugetlb I think it's worth it just to replace mremap with munmap/mmap. That way there will be no such question, e.g. if a segment is getting shrinked the unmapped area will again become a part of the reserved space. > > * There is one more patch to address hugepages remap. As mentioned in > > this thread above, Linux kernel has certain limitations when it comes > > to mremap for segments allocated with huge pages. To work around it's > > possible to replace mremap with a sequence of unmap and map again, > > relying on the anon file behind the segment to keep the memory > > content. I haven't found any downsides of this approach so far, but it > > makes the anonymous file patch 0007 mandatory. > > In 0008 > if (munmap(m->shmem, m->shmem_size) < 0) > ... snip ... > /* Resize the backing anon file. */ > if(ftruncate(m->segment_fd, new_size) == -1) > ... > /* Reclaim the space */ > ptr = mmap(m->shmem, new_size, PROT_READ | PROT_WRITE, > mmap_flags | MAP_FIXED, m->segment_fd, 0); > > How are we preventing something get mapped into the space after > m->shmem + newsize? We will need to add an unallocated but reserved > addressed space map after m->shmem+newsize right? Nope, the segment is allocated from the reserved space already, with some chunk of it left after the segment's end for resizing purposes. We only take some part of the designated space, the rest is still reserved.
On Fri, Apr 11, 2025 at 8:31 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > I think a relatively elegant solution is to extend ProcSignalBarrier > > > mechanism to track not only pss_barrierGeneration, as a sign that > > > everything was processed, but also something like > > > pss_barrierReceivedGeneration, indicating that the message was received > > > everywhere but not processed yet. That would be enough to allow > > > processes to wait until the resize message was received everywhere, then > > > use a global Barrier to wait until all processes are finished. It's > > > somehow similar to your proposal to use two signals, but has less > > > implementation overhead. > > > > The way it's implemented in v4 still has the disjoint group problem. > > Assume backends p1, p2, p3. All three of them are executing > > ProcessProcSignalBarrier(). All three of them updated > > pss_barrierReceivedGeneration > > > > /* The message is observed, record that */ > > pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, > > shared_gen); > > > > p1, p2 moved faster and reached following code from ProcessBarrierShmemResize() > > if (BarrierAttach(barrier) == SHMEM_RESIZE_REQUESTED) > > WaitForProcSignalBarrierReceived(pg_atomic_read_u64(&ShmemCtrl->Generation)); > > > > Since all the processes have received the barrier message, p1, p2 move > > ahead and go through all the next phases and finish resizing even > > before p3 gets a chance to call ProcessBarrierShmemResize() and attach > > itself to Barrier. This could happen because it processed some other > > ProcSignalBarrier message. p1 and p2 won't wait for p3 since it has > > not attached itself to the barrier. Once p1, p2 finish, p3 will attach > > itself to the barrier and resize buffers again - reinitializing the > > shared memory, which might has been already modified by p1 or p2. Boom > > - there's memory corruption. > > It won't reinitialize anything, since this logic is controlled by the > ShmemCtrl->NSharedBuffers, if it's already updated nothing will be > changed. Ah, I see it now if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers) { Thanks for the clarification. However, when we put back the patches to shrink buffers, we will evict the extra buffers, and shrink - if all the processes haven't participated in the barrier by then, some of them may try to access those buffers - re-installing them and then bad things can happen. > > About the race condition you mention, there is indeed a window between > receiving the ProcSignalBarrier and attaching to the global Barrier in > resize, but I don't think any process will be able to touch buffer pool > while inside this window. Even if it happens that the remapping itself > was blazing fast that this window was enough to make one process late > (e.g. if it was busy handling some other signal as you mention), as I've > showed above it shouldn't be a problem. > > I can experiment with this case though, maybe there is a way to > completely close this window to not thing about even potential > scenarios. The window may be small today but we have to make this future proof. Multiple ProcSignalBarrier messages may be processed in a single call to ProcessProcSignalBarrier() and if each of those takes as long as buffer resizing, the window will get bigger and bigger. So we have to close this window. > > > > * Shared memory address space is now reserved for future usage, making > > > shared memory segments clash (e.g. due to memory allocation) > > > impossible. There is a new GUC to control how much space to reserve, > > > which is called max_available_memory -- on the assumption that most of > > > the time it would make sense to set its value to the total amount of > > > memory on the machine. I'm open for suggestions regarding the name. > > > > With 0006 applied > > + /* Clean up some reserved space to resize into */ > > + if (munmap(m->shmem + m->shmem_size, new_size - m->shmem_size) == -1) > > ze, m->shmem))); > > ... snip ... > > + ptr = mremap(m->shmem, m->shmem_size, new_size, 0); > > > > We unmap the portion of reserved address space where the existing > > segment would expand into. As long as we are just expanding this will > > work. I am wondering how would this work for shrinking buffers? What > > scheme do you have in mind? > > I didn't like this part originally, and after changes to support hugetlb > I think it's worth it just to replace mremap with munmap/mmap. That way > there will be no such question, e.g. if a segment is getting shrinked > the unmapped area will again become a part of the reserved space. > I might have not noticed it, but are we putting two mappings one reserved and one allocated in the same address space, so that when the allocated mapping shrinks or expands, the reserved mapping continues to prohibit any other mapping from appearing there? I looked at some of the previous emails, but didn't find anything that describes how the reserved mapped space is managed. -- Best Wishes, Ashutosh Bapat
> On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote: > > However, when we put back the patches to shrink buffers, we will evict > the extra buffers, and shrink - if all the processes haven't > participated in the barrier by then, some of them may try to access > those buffers - re-installing them and then bad things can happen. As I've mentioned above, I don't see how a process could try to access a buffer, if it's on the path between receiving the ProcSignalBarrier and attaching to the global shmem Barrier, even if we shrink buffers. AFAICT interrupt handles should not touch buffers, and otherwise the process doesn't have any point withing this window where it might do this. Do you have some particular scenario in mind? > I might have not noticed it, but are we putting two mappings one > reserved and one allocated in the same address space, so that when the > allocated mapping shrinks or expands, the reserved mapping continues > to prohibit any other mapping from appearing there? I looked at some > of the previous emails, but didn't find anything that describes how > the reserved mapped space is managed. I though so, but this turns out to be incorrect. Just have done a small experiment -- looks like when reserving some space, mapping and unmapping a small segment from it leaves a non-mapped gap. That would mean for shrinking the new available space has to be reserved again.
On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote: > > > > However, when we put back the patches to shrink buffers, we will evict > > the extra buffers, and shrink - if all the processes haven't > > participated in the barrier by then, some of them may try to access > > those buffers - re-installing them and then bad things can happen. > > As I've mentioned above, I don't see how a process could try to access a > buffer, if it's on the path between receiving the ProcSignalBarrier and > attaching to the global shmem Barrier, even if we shrink buffers. > AFAICT interrupt handles should not touch buffers, and otherwise the > process doesn't have any point withing this window where it might do > this. Do you have some particular scenario in mind? ProcessProcSignalBarrier() is not within an interrupt handler but it responds to a flag set by an interrupt handler. After calling pg_atomic_write_u64(&MyProcSignalSlot->pss_barrierReceivedGeneration, shared_gen); it will enter the loop while (flags != 0) where it may process many barriers before processing PROCSIGNAL_BARRIER_SHMEM_RESIZE. Nothing stops the other barrier processing code from touching buffers. Right now it's just smgrrelease that gets called in the other barrier. But that's not guaranteed in future. > > > I might have not noticed it, but are we putting two mappings one > > reserved and one allocated in the same address space, so that when the > > allocated mapping shrinks or expands, the reserved mapping continues > > to prohibit any other mapping from appearing there? I looked at some > > of the previous emails, but didn't find anything that describes how > > the reserved mapped space is managed. > > I though so, but this turns out to be incorrect. Just have done a small > experiment -- looks like when reserving some space, mapping and > unmapping a small segment from it leaves a non-mapped gap. That would > mean for shrinking the new available space has to be reserved again. Right. That's what I thought. But I didn't see the corresponding code. So we have to keep track of two mappings for every segment - 1 for allocation and one for reserving space and resize those two while shrinking and expanding buffers. Am I correct? -- Best Wishes, Ashutosh Bapat
Hi Dmitry, On Mon, Apr 14, 2025 at 12:50 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Apr 14, 2025 at 10:40:28AM GMT, Ashutosh Bapat wrote: > > > > However, when we put back the patches to shrink buffers, we will evict > > the extra buffers, and shrink - if all the processes haven't > > participated in the barrier by then, some of them may try to access > > those buffers - re-installing them and then bad things can happen. > > As I've mentioned above, I don't see how a process could try to access a > buffer, if it's on the path between receiving the ProcSignalBarrier and > attaching to the global shmem Barrier, even if we shrink buffers. > AFAICT interrupt handles should not touch buffers, and otherwise the > process doesn't have any point withing this window where it might do > this. Do you have some particular scenario in mind? > > > I might have not noticed it, but are we putting two mappings one > > reserved and one allocated in the same address space, so that when the > > allocated mapping shrinks or expands, the reserved mapping continues > > to prohibit any other mapping from appearing there? I looked at some > > of the previous emails, but didn't find anything that describes how > > the reserved mapped space is managed. > > I though so, but this turns out to be incorrect. Just have done a small > experiment -- looks like when reserving some space, mapping and > unmapping a small segment from it leaves a non-mapped gap. That would > mean for shrinking the new available space has to be reserved again. In an offlist chat Thomas Munro mentioned that just ftruncate() would be enough to resize the shared memory without touching address maps using mmap and munmap(). ftruncate man page seems to concur with him If the effect of ftruncate() is to decrease the size of a memory mapped file or a shared memory object and whole pages beyond the new end were previously mapped, then the whole pages beyond the new end shall be discarded. References to discarded pages shall result in the generation of a SIGBUS signal. If the effect of ftruncate() is to increase the size of a memory object, it is unspecified whether the contents of any mapped pages between the old end-of-file and the new are flushed to the underlying object. ftruncate() when shrinking memory will release the extra pages and also would cause segmentation fault when memory outside the size of file is accessed even if the actual address map is larger than the mapped file. The expanded memory is allocated as it is written to, and those pages also become visible in the underlying object. I played with the attached small program under debugger observing pmap and /proc/<pid>/status after every memory operation. The address map always shows that it's as long as 300K memory. 00007fffd2200000 307200K rw-s- memfd:mmap_fd_exp (deleted) Immediately after mmap() RssShmem: 0 kB after first memset RssShmem: 307200 kB after ftruncate to 100MB (we don't need to wait for memset() to see the effect on RssShmem) RssShmem: 102400 kB after ftruncate to 200MB (requires memset to see effect on RssShmem) RssShmem: 102400 kB after memsetting upto 200MB RssShmem: 204800 kB All the observations concur with the man page. [1] https://man7.org/linux/man-pages/man3/ftruncate.3p.html#:~:text=If%20the%20effect%20of%20ftruncate,generation%20of%20a%20SIGBUS%20signal. -- Best Wishes, Ashutosh Bapat
On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote: TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via changing shared memory mapping layout. Any feedback is appreciated.
Hi Dmitry,
I am sorry that I have not participated in the discussion in this thread from the very beginning, although I am also very interested in dynamic shared buffer resizing and evn proposed my own implementation of it: https://github.com/knizhnik/postgres/pull/2 based on memory ballooning and using `madvise`. And it really works (returns unused memory to the system).
This PoC allows me to understand the main drawbacks of this approach:
1. Performance of Postgres CLOCK page eviction algorithm depends on number of shared buffers. My first native attempt just to mark unused buffers as invalid cause significant degrade of performance
pgbench -c 32 -j 4 -T 100 -P1 -M prepared -S
(here shared_buffers - is maximal shared buffers size and `available_buffers` - is used part:
| shared_buffers | available_buffers | TPS | | ------------------| ---------------------------- | ---- | | 128MB | -1 | 280k | | 1GB | -1 | 324k | | 2GB | -1 | 358k | | 32GB | -1 | 350k | | 2GB | 128Mb | 130k | | 2GB | 1Gb | 311k | | 32GB | 128Mb | 13k | | 32GB | 1Gb | 140k | | 32GB | 2Gb | 348k |
My first thought is to replace clock with LRU based in double-linked list. As far as there is no lockless double-list implementation,
it need some global lock. This lock can become bottleneck. The standard solution is partitioning: use N LRU lists instead of 1.
Just as partitioned has table used by buffer manager to lockup buffers. Actually we can use the same partitions locks to protect LRU list.
But it not clear what to do with ring buffers (strategies).So I decided not to perform such revolution in bufmgr, but optimize clock to more efficiently split reserved buffers.
Just add skip_count
field to buffer descriptor. And it helps! Now the worst case shared_buffer/available_buffers = 32Gb/128Mb
shows the same performance 280k as shared_buffers=128Mb without ballooning.
2. There are several data structures i Postgres which size depends on number of buffers.
In my patch I used in some cases dynamic shared buffer size, but if this structure has to be allocated in shared memory then still maximal size has to be used. We have the buffers themselves (8 kB per buffer), then the main BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry).
128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be quote noticeable with size differences larger than 2 orders of magnitude:
E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd have ~2GiB of static overhead on only 0.5GiB of actual buffers.
3. `madvise` is not portable.
Certainly you have moved much further in your proposal comparing with my PoC (including huge pages support).
But it is still not quite clear to me how you are going to solve the problems with large memory overhead in case of ~100x times variation of shared buffers size.
I
> On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote: > > In an offlist chat Thomas Munro mentioned that just ftruncate() would > be enough to resize the shared memory without touching address maps > using mmap and munmap(). > > ftruncate man page seems to concur with him > > If the effect of ftruncate() is to decrease the size of a memory > mapped file or a shared memory object and whole pages beyond the > new end were previously mapped, then the whole pages beyond the > new end shall be discarded. > > References to discarded pages shall result in the generation of a > SIGBUS signal. > > If the effect of ftruncate() is to increase the size of a memory > object, it is unspecified whether the contents of any mapped pages > between the old end-of-file and the new are flushed to the > underlying object. > > ftruncate() when shrinking memory will release the extra pages and > also would cause segmentation fault when memory outside the size of > file is accessed even if the actual address map is larger than the > mapped file. The expanded memory is allocated as it is written to, and > those pages also become visible in the underlying object. Thanks for sharing. I need to do more thorough tests, but after a quick look I'm not sure about that. ftruncate will take care about the memory, but AFAICT the memory mapping will stay the same, is that what you mean? In that case if the segment got increased, the memory still can't be used because it's beyond the mapping end (at least in my test that's what happened). If the segment got shrinked, the memory couldn't be reclaimed, because, well, there is already a mapping. Or do I miss something? > > > I might have not noticed it, but are we putting two mappings one > > > reserved and one allocated in the same address space, so that when the > > > allocated mapping shrinks or expands, the reserved mapping continues > > > to prohibit any other mapping from appearing there? I looked at some > > > of the previous emails, but didn't find anything that describes how > > > the reserved mapped space is managed. > > > > I though so, but this turns out to be incorrect. Just have done a small > > experiment -- looks like when reserving some space, mapping and > > unmapping a small segment from it leaves a non-mapped gap. That would > > mean for shrinking the new available space has to be reserved again. > > Right. That's what I thought. But I didn't see the corresponding code. > So we have to keep track of two mappings for every segment - 1 for > allocation and one for reserving space and resize those two while > shrinking and expanding buffers. Am I correct? Not necessarily, depending on what we want. Again, I'll do a bit more testing, but after a quick check it seems that it's possible to "plug" the gap with a new reservation mapping, then reallocate it to another mapping or unmap both reservations (main and the "gap" one) at once. That would mean that for the current functionality we don't need to track reservation in any way more than just start and the end of the "main" reserved space. The only consequence I can imagine is possible fragmentation of the reserved space in case of frequent increase/decrease of a segment with even decreasing size. But since it's only reserved space, which will not really be used, it's probably not going to be a problem.
> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote: > > 1. Performance of Postgres CLOCK page eviction algorithm depends on number > of shared buffers. My first native attempt just to mark unused buffers as > invalid cause significant degrade of performance Thanks for sharing! Right, but it concerns the case when the number of shared buffers is high, independently from whether it was changed online or with a restart, correct? In that case it's out of scope for this patch. > 2. There are several data structures i Postgres which size depends on number > of buffers. > In my patch I used in some cases dynamic shared buffer size, but if this > structure has to be allocated in shared memory then still maximal size has > to be used. We have the buffers themselves (8 kB per buffer), then the main > BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's > CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry). > 128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be > quote noticeable with size differences larger than 2 orders of magnitude: > E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd > have ~2GiB of static overhead on only 0.5GiB of actual buffers. Not sure what do you mean by using a maximal size, can you elaborate. In the current patch those structures are allocated as before, except each goes into a separate segment -- without any extra memory overhead as far as I see. > 3. `madvise` is not portable. The current implementation doesn't rely on madvise so far (it might for shared memory shrinking), but yeah there are plenty of other not very portable things (MAP_FIXED, memfd_create). All of that is mentioned in the corresponding patches as a limitation.
I also have a related question about how ftruncate() is used in the patch.
In my testing I also see that when using ftruncate to shrink a shared segment, the memory is freed immediately after the call, even if other processes still have that memory mapped, and they will hit SIGBUS if they try to access that memory again as the manpage says.
So am I correct to think that, to support the bufferpool shrinking case, it would not be safe to call ftruncate in AnonymousShmemResize as-is, since at that point other processes may still be using pages that belong to the truncated memory?
It appears that for shrinking we should only call ftruncate when we're sure no process will access those pages again (eg, all processes have handled the resize interrupt signal barrier). I suppose this can be done by the resize coordinator after synchronizing with all the other processes.
But in that case it seems we cannot use the postmaster as the coordinator then? b/c I see some code comments saying the postmaster does not have waiting infrastructure... (maybe even if the postmaster has waiting infra we don't want to use it anyway since it can be blocked for a long time and won't be able to serve other requests).
Regards,
Jack Ng
On 18/04/2025 12:26 am, Dmitry Dolgov wrote: >> On Thu, Apr 17, 2025 at 02:21:07PM GMT, Konstantin Knizhnik wrote: >> >> 1. Performance of Postgres CLOCK page eviction algorithm depends on number >> of shared buffers. My first native attempt just to mark unused buffers as >> invalid cause significant degrade of performance > Thanks for sharing! > > Right, but it concerns the case when the number of shared buffers is > high, independently from whether it was changed online or with a > restart, correct? In that case it's out of scope for this patch. > >> 2. There are several data structures i Postgres which size depends on number >> of buffers. >> In my patch I used in some cases dynamic shared buffer size, but if this >> structure has to be allocated in shared memory then still maximal size has >> to be used. We have the buffers themselves (8 kB per buffer), then the main >> BufferDescriptors array (64 B), the BufferIOCVArray (16 B), checkpoint's >> CkptBufferIds (20 B), and the hashmap on the buffer cache (24B+8B/entry). >> 128 bytes per 8kb bytes seems to large overhead (~1%) but but it may be >> quote noticeable with size differences larger than 2 orders of magnitude: >> E.g. to support scaling to from 0.5Gb to 128GB , with 128 bytes/buffer we'd >> have ~2GiB of static overhead on only 0.5GiB of actual buffers. > Not sure what do you mean by using a maximal size, can you elaborate. > > In the current patch those structures are allocated as before, except > each goes into a separate segment -- without any extra memory overhead > as far as I see. Thank you for explanation. I am sorry that I have not precisely investigated your patch before writing: it seems to be that you are are placing in separate segment only content of shared buffers. Now I see that I was wrong and it is actually the main difference with memory ballooning approach I have used. As far as you are are allocating buffers descriptors and hash table in the same segment, there is no extra memory overhead. The only drawback is that we are loosing content of shared buffers in case of resize. It may be sadly, but not looks like there is no better alternative. But there are still some dependencies on shared buffers size which are not addressed in this PR. I am not sure how critical they are and is it possible to do something here, but at least I want to enumerate them: 1. Checkpointer: maximal number of checkpointer requests depends on NBuffers. So if we start with small shared buffers and then upscale, it may cause the too frequent checkpoints: Size CheckpointerShmemSize(void) ... size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest))); CheckpointerShmemInit(void) CheckpointerShmem->max_requests = NBuffers; 2. XLOG: number of xlog buffers is calculated depending on number of shared buffers: XLOGChooseNumBuffers(void) { ... xbuffers = NBuffers / 32; Should not cause some errors, but may be not so efficient if once again we start we tiny shared buffers. 3. AIO: AIO max concurrency is also calculated based on number of shared buffers: AioChooseMaxConcurrency(void) { ... max_proportional_pins = NBuffers / max_backends; For small shared buffers (i.e. 1Mb, there will be no concurrency at all). So none of this issues can cause some error, just some inefficient behavior. But if we want to start with very small shared buffers and then increase them on demand, then it can be a problem. In all this three cases NBuffers is used not just to calculate some threshold value, but also determine size of the structure in shared memory. The straightforward solution is to place them in the same segment as shared buffers. But I am not sure how difficult it will be to implement.
On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > On Thu, Apr 17, 2025 at 03:22:28PM GMT, Ashutosh Bapat wrote: > > > > In an offlist chat Thomas Munro mentioned that just ftruncate() would > > be enough to resize the shared memory without touching address maps > > using mmap and munmap(). > > > > ftruncate man page seems to concur with him > > > > If the effect of ftruncate() is to decrease the size of a memory > > mapped file or a shared memory object and whole pages beyond the > > new end were previously mapped, then the whole pages beyond the > > new end shall be discarded. > > > > References to discarded pages shall result in the generation of a > > SIGBUS signal. > > > > If the effect of ftruncate() is to increase the size of a memory > > object, it is unspecified whether the contents of any mapped pages > > between the old end-of-file and the new are flushed to the > > underlying object. > > > > ftruncate() when shrinking memory will release the extra pages and > > also would cause segmentation fault when memory outside the size of > > file is accessed even if the actual address map is larger than the > > mapped file. The expanded memory is allocated as it is written to, and > > those pages also become visible in the underlying object. > > Thanks for sharing. I need to do more thorough tests, but after a quick > look I'm not sure about that. ftruncate will take care about the memory, > but AFAICT the memory mapping will stay the same, is that what you mean? > In that case if the segment got increased, the memory still can't be > used because it's beyond the mapping end (at least in my test that's > what happened). If the segment got shrinked, the memory couldn't be > reclaimed, because, well, there is already a mapping. Or do I miss > something? I was imagining that you might map some maximum possible size at the beginning to reserve the address space permanently, and then adjust the virtual memory object's size with ftruncate as required to provide backing. Doesn't that achieve the goal with fewer steps, using only portable* POSIX stuff, and keeping all pointers stable? I understand that pointer stability may not be required (I can see roughly how that argument is constructed), but isn't it still better to avoid having to prove that and deal with various other problems completely? Is there a downside/cost to having a large mapping that is only partially backed? I suppose choosing that number might offend you but at least there is an obvious upper bound: physical memory size. *You might also want to use fallocate after ftruncate on Linux to avoid SIGBUS on allocation failure on first touch page fault, which raises portability questions since it's unspecified whether you can do that with shm fds and fails on some systems, but it let's call that an independent topic as it's not affected by this choice.
Hi, On April 18, 2025 11:17:21 AM GMT+02:00, Thomas Munro <thomas.munro@gmail.com> wrote: > Doesn't that achieve the goal with fewer steps, using only >portable* POSIX stuff, and keeping all pointers stable? I understand >that pointer stability may not be required (I can see roughly how that >argument is constructed), but isn't it still better to avoid having to >prove that and deal with various other problems completely? I think we should flat out reject any approach that does not maintain pointer stability. It would restrict future optimizationsa lot if we can't rely on that (e.g. not materializing tuples when transporting them from worker to leader;pointering datastructures in shared buffers). Greetings, Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Apr 18, 2025 at 9:17 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Apr 18, 2025 at 7:25 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Thanks for sharing. I need to do more thorough tests, but after a quick > > look I'm not sure about that. ftruncate will take care about the memory, > > but AFAICT the memory mapping will stay the same, is that what you mean? > > In that case if the segment got increased, the memory still can't be > > used because it's beyond the mapping end (at least in my test that's > > what happened). If the segment got shrinked, the memory couldn't be > > reclaimed, because, well, there is already a mapping. Or do I miss > > something? > > I was imagining that you might map some maximum possible size at the > beginning to reserve the address space permanently, and then adjust > the virtual memory object's size with ftruncate as required to provide > backing. Doesn't that achieve the goal with fewer steps, using only > portable* POSIX stuff, and keeping all pointers stable? I understand > that pointer stability may not be required (I can see roughly how that > argument is constructed), but isn't it still better to avoid having to > prove that and deal with various other problems completely? Is there > a downside/cost to having a large mapping that is only partially > backed? I suppose choosing that number might offend you but at least > there is an obvious upper bound: physical memory size. TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side effect on Linux, as if you had called ftruncate on it (fully allocated huge pages I expected up to the object's size, just not magical size changes beyond that when I merely asked to map it). That doesn't happen for regular page size, or for any page size on my local OS's shm objects and doesn't seem to fit mmap's job description given an fd*, but maybe I'm just confused. Anyway, a workaround seems to be to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ | PROT_WRITE) new regions after extending with ftruncate(), at least in simple tests... (*Hmm, wiild uninformed speculation: perhap the size-setting behaviour needed when hugetlbfs is used secretly to implement MAP_ANONYMOUS is being exposed also when a hugetlbfs fd is given explicitly to mmap, generating this bizarro side effect?)
> On Fri, Apr 18, 2025 at 09:17:21PM GMT, Thomas Munro wrote: > I was imagining that you might map some maximum possible size at the > beginning to reserve the address space permanently, and then adjust > the virtual memory object's size with ftruncate as required to provide > backing. Doesn't that achieve the goal with fewer steps, using only > portable* POSIX stuff, and keeping all pointers stable? Ah, I see what you folks mean. So in the latest patch there is a single large shared memory area reserved with PROT_NONE + MAP_NORESERVE. This area is logically divided between shmem segments, and each segment is mmap'd out of it and could be resized withing these logical boundaries. Now the suggestion is to have one reserved area for each segment, and instead of really mmap'ing something out of it, manage memory via ftruncate. Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are questionable from portability point of view. This leaves memfd_create, and I'm still not completely clear on it's portability -- it seems to be specific to Linux, but others provide compatible implementation as well. Let me experiment with this idea a bit, I would like to make sure there are no other limitations we might face. > I understand that pointer stability may not be required Just to clarify, the current patch maintains this property (stable pointers), which I also see as mandatory for any possible implementation. > *You might also want to use fallocate after ftruncate on Linux to > avoid SIGBUS on allocation failure on first touch page fault, which > raises portability questions since it's unspecified whether you can do > that with shm fds and fails on some systems, but it let's call that an > independent topic as it's not affected by this choice. I'm afraid it would be strictly neccessary to do fallocate, otherwise we're back where we were before reservation accounting for huge pages in Linux (lot's of people were facing unexpected SIGBUS when dealing with cgroups). > TIL that mmap(size, fd) will actually extend a hugetlb memfd as a side > effect on Linux, as if you had called ftruncate on it (fully allocated > huge pages I expected up to the object's size, just not magical size > changes beyond that when I merely asked to map it). That doesn't > happen for regular page size, or for any page size on my local OS's > shm objects and doesn't seem to fit mmap's job description given an > fd*, but maybe I'm just confused. Anyway, a workaround seems to be > to start out with PROT_NONE and MAP_NORESERVE, then mprotect(PROT_READ > | PROT_WRITE) new regions after extending with ftruncate(), at least > in simple tests... Right, it's similar to the currently implemented space reservation, which also goes with PROT_NONE and MAP_NORESERVE. I assume it boils down to the way how memory reservation accounting in Linux works.
> On Thu, Apr 17, 2025 at 07:05:36PM GMT, Ni Ku wrote: > I also have a related question about how ftruncate() is used in the patch. > In my testing I also see that when using ftruncate to shrink a shared > segment, the memory is freed immediately after the call, even if other > processes still have that memory mapped, and they will hit SIGBUS if they > try to access that memory again as the manpage says. > > So am I correct to think that, to support the bufferpool shrinking case, it > would not be safe to call ftruncate in AnonymousShmemResize as-is, since at > that point other processes may still be using pages that belong to the > truncated memory? > It appears that for shrinking we should only call ftruncate when we're sure > no process will access those pages again (eg, all processes have handled > the resize interrupt signal barrier). I suppose this can be done by the > resize coordinator after synchronizing with all the other processes. > But in that case it seems we cannot use the postmaster as the coordinator > then? b/c I see some code comments saying the postmaster does not have > waiting infrastructure... (maybe even if the postmaster has waiting infra > we don't want to use it anyway since it can be blocked for a long time and > won't be able to serve other requests). There is already a coordination infrastructure, implemented in the patch 0006, which will take care of this and prevent access to the shared memory until everything is resized.
> On Fri, Apr 18, 2025 at 10:06:23AM GMT, Konstantin Knizhnik wrote: > The only drawback is that we are loosing content of shared buffers in case > of resize. It may be sadly, but not looks like there is no better > alternative. No, why would we loose the content? If we do mremap, it will leave the content as it is. If we do munmap/mmap with an anonymous backing file, it will also keep the content in memory. The same with another proposal about using ftruncate/fallocate only, both will leave the content untouch unless told to do otherwise. > But there are still some dependencies on shared buffers size which are not > addressed in this PR. > I am not sure how critical they are and is it possible to do something here, > but at least I want to enumerate them: Righ, I'm aware about those (except the AIO one, which was added after the first version of the patch), and didn't address them yet due to the same reason you've mentioned -- they're not hard errors, rather inefficiencies. But thanks for the reminder, I keep those in the back of my mind, and when the rest of the design will be settled down, I'll try to address them as well.
On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are > questionable from portability point of view. This leaves memfd_create, and I'm > still not completely clear on it's portability -- it seems to be specific to > Linux, but others provide compatible implementation as well. Something like this should work, roughly based on DSM code except here we don't really need the name so we unlink it immediately, at the slight risk of leaking it if the postmaster is killed between those lines (maybe someone should go and tell POSIX to support the special name SHM_ANON or some other way to avoid that; I can't see any portable workaround). Not tested/compiled, just a sketch: #ifdef HAVE_MEMFD_CREATE /* Anonymous shared memory region. */ fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags); #else /* Standard POSIX insists on a name, which we unlink immediately. */ do { char tmp[80]; snprintf(tmp, sizeof(tmp), "PostgreSQL.%u", pg_prng_uint32(&pg_global_prng_state)); fd.= shm_open(tmp, O_CREAT | O_EXCL); if (fd >= 0) shm_unlink(tmp); } while (fd < 0 && errno == EXIST); #endif > Let me experiment with this idea a bit, I would like to make sure there are no > other limitations we might face. One thing I'm still wondering about is whether you really need all this multi-phase barrier stuff, or even need to stop other backends from running at all while doing the resize. I guess that's related to your remapping scheme, but supposing you find the simple ftruncate()-only approach to be good, my next question is: why isn't it enough to wait for all backends to agree to stop allocating new buffers in the range to be truncated, and then left them continue to run as normal? As far as they would be concerned, the in-progress downsize has already happened, though it could be reverted later if the eviction phase fails. Then the coordinator could start evicting buffers and truncating the shared memory object, which are phases/steps, sure, but it's not clear to me why they need other backends' help. It sounds like Windows might need a second ProcSignalBarrier poke in order to call VirtualUnlock() in every backend. That's based on that Usenet discussion I lobbed in here the other day; I haven't tried it myself or fully grokked why it works, and there could well be other ways, IDK. Assuming it's the right approach, between the first poke to make all backends accept the new lower size and the second poke to unlock the memory, I don't see why they need to wait. I suppose it would be the same ProcSignalBarrier, but behave differently based on a control variables. I suppose there could also be a third poke, if you want to consider the operation to be fully complete only once they have all actually done that unlock step, but it may also be OK not to worry about that, IDK. On the other hand, maybe it just feels less risky if you stop the whole world, or maybe you envisage parallelising the eviction work, or there is some correctness concern I haven't grokked yet, but what? > > *You might also want to use fallocate after ftruncate on Linux to > > avoid SIGBUS on allocation failure on first touch page fault, which > > raises portability questions since it's unspecified whether you can do > > that with shm fds and fails on some systems, but it let's call that an > > independent topic as it's not affected by this choice. > > I'm afraid it would be strictly neccessary to do fallocate, otherwise we're > back where we were before reservation accounting for huge pages in Linux (lot's > of people were facing unexpected SIGBUS when dealing with cgroups). Yeah. FWIW here is where we decided to gate that on __linux__ while fixing that for DSM: https://www.postgresql.org/message-id/flat/CAEepm%3D0euOKPaYWz0-gFv9xfG%2B8ptAjhFjiQEX0CCJaYN--sDQ%40mail.gmail.com#c81b941d300f04d382472e6414cec1f4
Thanks Dmitry. Right, the coordination mechanism in v4-0006 works as expected in various tests (sorry, I misunderstood somedetails initially). I also want to report a couple of minor issues found during testing (which you may be aware of already): 1. For memory segments other the first one ('main'), the start address passed to mmap may not be aligned to 4KB or huge pagesize (since reserved_offset may not be aligned) and cause mmap to fail. 2. Since the ratio for main/desc/iocv/checkpt/strategy in SHMEM_RESIZE_RATIO are relatively small, I think we need to guardagainst the case where 'max_available_memory' is too small for the required sizes of these segments (from CalculateShmemSize). Like when max_available_memory=default and shared_numbers=128kB, 'main' still needs ~109MB, but since only 10% of max_available_memoryis reserved for it (~102MB) and start address of the next segment is calculated based on reserved_offset,this would cause the mappings to overlap and memory problems later (I hit this after fixing 1.) I suppose we can change the minimum value of max_available_memory to be large enough, and may also adjust the ratios in SHMEM_RESIZE_RATIOto ensure the reserved space of those segments are sufficient. Regards, Jack Ng -----Original Message----- From: Dmitry Dolgov <9erthalion6@gmail.com> Sent: Monday, April 21, 2025 5:33 AM To: Ni Ku <jakkuniku@gmail.com> Cc: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>; pgsql-hackers@postgresql.org; Robert Haas <robertmhaas@gmail.com> Subject: Re: Changing shared_buffers without restart > On Thu, Apr 17, 2025 at 07:05:36PM GMT, Ni Ku wrote: > I also have a related question about how ftruncate() is used in the patch. > In my testing I also see that when using ftruncate to shrink a shared > segment, the memory is freed immediately after the call, even if other > processes still have that memory mapped, and they will hit SIGBUS if > they try to access that memory again as the manpage says. > > So am I correct to think that, to support the bufferpool shrinking > case, it would not be safe to call ftruncate in AnonymousShmemResize > as-is, since at that point other processes may still be using pages > that belong to the truncated memory? > It appears that for shrinking we should only call ftruncate when we're > sure no process will access those pages again (eg, all processes have > handled the resize interrupt signal barrier). I suppose this can be > done by the resize coordinator after synchronizing with all the other processes. > But in that case it seems we cannot use the postmaster as the > coordinator then? b/c I see some code comments saying the postmaster > does not have waiting infrastructure... (maybe even if the postmaster > has waiting infra we don't want to use it anyway since it can be > blocked for a long time and won't be able to serve other requests). There is already a coordination infrastructure, implemented in the patch 0006, which will take care of this and prevent accessto the shared memory until everything is resized.
> On Tue, May 06, 2025 at 04:23:07AM GMT, Jack Ng wrote: > Thanks Dmitry. Right, the coordination mechanism in v4-0006 works as expected in various tests (sorry, I misunderstoodsome details initially). Great, thanks for checking. > I also want to report a couple of minor issues found during testing (which you may be aware of already): > > 1. For memory segments other the first one ('main'), the start address passed to mmap may not be aligned to 4KB or hugepage size (since reserved_offset may not be aligned) and cause mmap to fail. > > 2. Since the ratio for main/desc/iocv/checkpt/strategy in SHMEM_RESIZE_RATIO are relatively small, I think we need toguard against the case where 'max_available_memory' is too small for the required sizes of these segments (from CalculateShmemSize). > Like when max_available_memory=default and shared_numbers=128kB, 'main' still needs ~109MB, but since only 10% of max_available_memoryis reserved for it (~102MB) and start address of the next segment is calculated based on reserved_offset,this would cause the mappings to overlap and memory problems later (I hit this after fixing 1.) > I suppose we can change the minimum value of max_available_memory to be large enough, and may also adjust the ratios inSHMEM_RESIZE_RATIO to ensure the reserved space of those segments are sufficient. Yeah, good points. I've introduced max_available_memory expecting some heated discussions about it, and thus didn't put lots of efforts into covering all the possible scenarios. But now I'm reworking it along the lines suggested by Thomas, and will address those as well. Thanks!
> all the possible scenarios. But now I'm reworking it along the lines suggested > by Thomas, and will address those as well. Thanks! Thanks for the info, Dmitry. Just want to confirm my understanding of Thomas' suggestion and your discussions... I think the simpler and more portablesolution goes something like the following? * For each BP resource segment (main, desc, buffers, etc): 1. create an anonymous file as backing 2. mmap a large reserved shared memory area with PROTO_READ/WRITE + MAP_NORESERVE using the anon fd 3. use ftruncate to back the in-use region (and maybe posix_fallocate too to avoid SIGBUS on alloc failure during first-touch),but no need to create a memory mapping for it 4. also no need to create a separate mapping for the reserved region (already covered by the mapping created in 2.) |-- Memory mapping (MAP_NORESERVE) for BUFFER --| |-- In-use region --|----- Reserved region -----| * During resize, simply calculate the new size and call ftruncate on each segment to adjust memory accordingly, no need tommap/munmap or modify any memory mapping. I tried this approach with a test program (with huge pages), and both expand and shrink seem to work as expected --for shrink,the memory is freed right after the resize ftruncate. Regards, Jack Ng
On Mon, Apr 21, 2025 at 9:30 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Yeah, that would work and will allow to avoid MAP_FIXED and mremap, which are
> questionable from portability point of view. This leaves memfd_create, and I'm
> still not completely clear on it's portability -- it seems to be specific to
> Linux, but others provide compatible implementation as well.
Something like this should work, roughly based on DSM code except here
we don't really need the name so we unlink it immediately, at the
slight risk of leaking it if the postmaster is killed between those
lines (maybe someone should go and tell POSIX to support the special
name SHM_ANON or some other way to avoid that; I can't see any
portable workaround). Not tested/compiled, just a sketch:
#ifdef HAVE_MEMFD_CREATE
/* Anonymous shared memory region. */
fd = memfd_create("foo", MFD_CLOEXEC | huge_pages_flags);
#else
/* Standard POSIX insists on a name, which we unlink immediately. */
do
{
char tmp[80];
snprintf(tmp, sizeof(tmp), "PostgreSQL.%u",
pg_prng_uint32(&pg_global_prng_state));
fd.= shm_open(tmp, O_CREAT | O_EXCL);
if (fd >= 0)
shm_unlink(tmp);
} while (fd < 0 && errno == EXIST);
#endif
> Let me experiment with this idea a bit, I would like to make sure there are no
> other limitations we might face.
One thing I'm still wondering about is whether you really need all
this multi-phase barrier stuff, or even need to stop other backends
from running at all while doing the resize. I guess that's related to
your remapping scheme, but supposing you find the simple
ftruncate()-only approach to be good, my next question is: why isn't
it enough to wait for all backends to agree to stop allocating new
buffers in the range to be truncated, and then left them continue to
run as normal? As far as they would be concerned, the in-progress
downsize has already happened, though it could be reverted later if
the eviction phase fails. Then the coordinator could start evicting
buffers and truncating the shared memory object, which are
phases/steps, sure, but it's not clear to me why they need other
backends' help.
Вложения
- 0005-Introduce-pss_barrierReceivedGeneration-20250610.patch
- 0004-Introduce-pending-flag-for-GUC-assign-hooks-20250610.patch
- 0001-Allow-to-use-multiple-shared-memory-mapping-20250610.patch
- 0002-Address-space-reservation-for-shared-memory-20250610.patch
- 0003-Introduce-multiple-shmem-segments-for-share-20250610.patch
- 0007-Use-anonymous-files-to-back-shared-memory-s-20250610.patch
- 0008-Support-resize-for-hugetlb-20250610.patch
- 0006-Allow-to-resize-shared-memory-without-resta-20250610.patch
- 0009-Support-shrinking-shared-buffers-20250610.patch
- 0010-Reinitialize-StrategyControl-after-resizing-20250610.patch
- 0012-Fix-compilation-failure-in-pg_get_shmem_pag-20250610.patch
- 0011-Fix-compilation-failure-in-pg_get_shmem_all-20250610.patch
> On Fri, Jun 20, 2025 at 12:19:31PM +0200, Dmitry Dolgov wrote: > Thanks! I've reworked the series to implement approach suggested by > Thomas, and applied your patches to support buffers shrinking on top. I > had to restructure the patch set, here is how it looks like right now: The base-commit was left in the cover letter which I didn't post, so for posterity: base-commit: 4464fddf7b50abe3dbb462f76fd925e10eedad1c
Hi Dmitry, Thanks for sharing the patches. On Fri, Jun 20, 2025 at 3:49 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > 3. Shared memory shrinking > > So far only shared memory increase was implemented. These patches from Ashutosh > support shrinking as well, which is tricky due to the need for buffer eviction. > > * Support shrinking shared buffers > * Reinitialize StrategyControl after resizing buffers This applies to both shrinking and expansion of shared buffers. When expanding we need to add the new buffers to the freelist by changing next pointer of last buffer in the free list to point to the first new buffer. > * Additional validation for buffer in the ring > > > 0009 adds support to shrink shared buffers. It has two changes: a. > > evict the buffers outside the new buffer size b. remove buffers with > > buffer id outside the new buffer size from the free list. If a buffer > > being evicted is pinned, the operation is aborted and a FATAL error is > > raised. I think we need to change this behaviour to be less severe > > like rolling back the operation or waiting for the pinned buffer to be > > unpinned etc. Better even if we could let users control the behaviour. > > But we need better infrastructure to do such things. That's one TODO > > left in the patch. > > I haven't reviewed those, just tested a bit to finally include into the series. > Note that I had to tweak two things: > > * The way it was originally implemented was sending resize signal to postmaster > before doing eviction, which could result in sigbus when accessing LSN of a > dirty buffer to be evicted. I've reshaped it a bit to make sure eviction always > happens first. Will take a look at this. > > * It seems the CurrentResource owner could be missing sometimes, so I've added > a band-aid checking its presence. > > One side note, during my testing I've noticed assert failures on > pgstat_tracks_io_op inside a wal writer a few times. I couldn't reproduce it > after the fixes above, but still it may indicate that something is off. E.g. > it's somehow not expected that the wal writer will do buffer eviction IO (from > what I understand, the current shrinking implementation allows that). Yes. I think, we have to find a better way to choose a backend which does the actual work. Eviction can be done in that backend itself. Compiler gives warning about an uninitialized variable, which seems to be a real bug. Fix attached. -- Best Wishes, Ashutosh Bapat
Вложения
> On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote: > I took a look at this patch, because it's somewhat related to the NUMA > patch series I posted a couple days ago, and I've been wondering if > it makes some of the NUMA stuff harder or simpler. Thanks a lot for the review! It's a plenty of feedback, and I'll probably take time to answer all of it, but I still want to address couple of most important topics quickly. > But I'm getting a bit lost in how exactly this interacts with things > like overcommit, system memory accounting / OOM killer and this sort of > stuff. I went through the thread and it seems to me the reserve+map > approach works OK in this regard (and the messages on linux-mm seem to > confirm this). But this information is scattered over many messages and > it's hard to say for sure, because some of this might be relevant for > an earlier approach, or a subtly different variant of it. > > A similar question is portability. The comments and commit messages > seem to suggest most of this is linux-specific, and other platforms just > don't have these capabilities. But there's a bunch of messages (mostly > by Thomas Munro) that hint FreeBSD might be capable of this too, even if > to some limited extent. And possibly even Windows/EXEC_BACKEND, although > that seems much trickier. > > [...] > > So I think it'd be very helpful to write a README, explaining the > currnent design/approach, and summarizing all these aspects in a single > place. Including things like portability, interaction with the OS > accounting, OOM killer, this kind of stuff. Some of this stuff may be > already mentioned in code comments, but you it's hard to find those. > > Especially worth documenting are the states the processes need to go > through (using the barriers), and the transacitons between them (i.e. > what is allowed in each phase, what blocks can be visible, etc.). Agree, I'll add some comprehensive readme in the next version. Note, that on the topic of portability the latest version implements a new approach suggested by Thomas Munro, which reduces problematic parts to memfd_create only, which is mentioned as Linux specific in the documentation, but AFAICT has FreeBSD counterparts. > 1) no user docs > > There are no user .sgml docs, and maybe it's time to write some, > explaining how to use this thing - how to configure it, how to trigger > the resizing, etc. It took me a while to realize I need to do ALTER > SYSTEM + pg_reload_conf() to kick this off. > > It should also document the user-visible limitations, e.g. what activity > is blocked during the resizing, etc. While the user interface is still under discussion, I agree, it makes sense to capture this information in sgml docs. > 2) pending GUC changes > > [...] > > It also seems a bit strange that the "switch" gets to be be driven by a > randomly selected backend (unless I'm misunderstanding this bit). It > seems to be true for the buffer eviction during shrinking, at least. The resize itself is coordinated by the postmaster alone, not by a randomly selected backend. But looks like buffer eviction indeed can happen anywhere, which is what we were discussing in the previous messages. > Perhaps this should be a separate utility command, or maybe even just > a new ALTER SYSTEM variant? Or even just a function, similar to what > the "online checksums" patch did, possibly combined with a bgworder > (but probably not needed, there are no db-specific tasks to do). This is one topic we still actively discuss, but haven't had much feedback otherwise. The pros and cons seem to be clear: * Utilizing the existing GUC mechanism would allow treating shared_buffers as any other configuration, meaning that potential users of this feature don't have to do anything new to use it -- they still can use whatever method they prefer to apply new configuration (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly). I'm also wondering if it's only shared_buffers, or some other options could use similar approach. * Having a separate utility command is a mighty simplification, which helps avoiding problems you've described above. So far we've got two against one in favour of simple utility command, so we can as well go with that. > 3) max_available_memory > > Speaking of GUCs, I dislike how max_available_memory works. It seems a > bit backwards to me. I mean, we're specifying shared_buffers (and some > other parameters), and the system calculates the amount of shared memory > needed. But the limit determines the total limit? The reason it's so backwards is that it's coming from the need to specify how much memory we would like to reserve, and what would be the upper boundary for increasing shared_buffers. My intention is eventually to get rid of this GUC and figure its value at runtime as a function of the total available memory. > I think the GUC should specify the maximum shared_buffers we want to > allow, and then we'd work out the total to pre-allocate? Considering > we're only allowing to resize shared_buffers, that should be pretty > trivial. Yes, it might happen that the "total limit" happens to exceed > the available memory or something, but we already have the problem > with shared_buffers. Seems fine if we explain this in the docs, and > perhaps print the calculated memory limit on start. Somehow I'm not following what you suggest here. You mean having the maximum shared_buffers specified, but not as a separate GUC? > 4) SHMEM_RESIZE_RATIO > > The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way > these ratios can make sense. For example, BLCKSZ is 8192 but the buffer > descriptor is 64B. That's 128x difference, but the ratios says 0.6 and > 0.1, so 6x. Sure, we'll actually allocate only the memory we need, and > the rest is only "reserved". SHMEM_RESIZE_RATIO is a temporary hack, waiting for more decent solution, nothing more. I probably have to mention that in the commentaries. > Moreover, all of the above is for mappings sized based on NBuffers. But > if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the > moment someone increases of max_connection, max_locks_per_transaction > and possibly some other stuff? Can you elaborate, what do you mean by that? Increasing max_connection, etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT, but the ratio is for memory reservation only. > 5) no tests > > I mentioned no "user docs", but the patch has 0 tests too. Which seems > a bit strange for a patch of this age. > > A really serious part of the patch series seems to be the coordination > of processes when going through the phases, enforced by the barriers. > This seems like a perfect match for testing using injection points, and > I know we did something like this in the online checksums patch, which > needs to coordinate processes in a similar way. Exactly what we're talking about recently, figuring out how to use injections points for testing. Keep in mind, that the scope of this work turned out to be huge, and with just two people on board we're addressing one thing at the time. > But even just a simple TAP test that does a bunch of (random?) resizes > while running a pgbench seem better than no tests. (That's what I did > manually, and it crashed right away.) This is the type of testing I was doing before posting the series. I assume you've crashed it on buffers shrinking, singe you've got SIGBUS which would indicate that the memory is not available anymore. Before we go into debugging, just to be on the safe side I would like to make sure you were testing the latest patch version (there are some signs that it's not the case, about that later)? > 10) what to do about stuck resize? > > AFAICS the resize can get stuck for various reasons, e.g. because it > can't evict pinned buffers, possibly indefinitely. Not great, it's not > clear to me if there's a way out (canceling the resize) after a timeout, > or something like that? Not great to start an "online resize" only to > get stuck with all activity blocked for indefinite amount of time, and > get to restart anyway. > > Seems related to Thomas' message [2], but AFAICS the patch does not do > anything about this yet, right? What's the plan here? It's another open discussion right now, with an idea to eventually allow canceling after a timeout. I think canceling when stuck on buffer eviction should be pretty straightforward (the evition must take place before actual shared memory resize, so we know nothing has changed yet), but in some other failure scenarios it would be harder (e.g. if one backend is stuck resizing, while other have succeeded -- this would require another round of synchronization and some way to figure out what is the current status). > 11) preparatory actions? > > Even if it doesn't get stuck, some of the actions can take a while, like > evicting dirty buffers before shrinking, etc. This is similar to what > happens on restart, when the shutdown checkpoint can take a while, while > the system is (partly) unavailable. > > The common mitigation is to do an explicit checkpoint right before the > restart, to make the shutdown checkpoint cheap. Could we do something > similar for the shrinking, e.g. flush buffers from the part to be > removed before actually starting the resize? Yeah, that's a good idea, we will try to explore it. > 12) does this affect e.g. fork() costs? > > I wonder if this affects the cost of fork() in some undesirable way? > Could it make fork() more measurably more expensive? The number of new mappings is quite limited, so I would not expect that. But I can measure the impact. > 14) interesting messages from the thread > > While reading through the thread, I noticed a couple messages that I > think are still relevant: Right, I'm aware there is a lot of not yet addressed feedback, even more than you've mentioned below. None of this feedback was ignored, we're just solving large problems step by step. So far the focus was on how to do memory reservation and to coordinate resize, and everybody is more than welcome to join. But thanks for collecting the list, I probably need to start tracking what was addressed and what was not. > - Robert asked [5] if Linux might abruptly break this, but I find that > unlikely. We'd point out we rely on this, and they'd likely rethink. > This would be made safer if this was specified by POSIX - taking that > away once implemented seems way harder than for custom extensions. > It's likely they'd not take away the feature without an alternative > way to achieve the same effect, I think (yes, harder to maintain). > Tom suggests [7] this is not in POSIX. This conversation was related to the original implementation, which was based on mremap and slicing of mappings. As I've mentioned, the new approach doesn't have most of those controversial points, it uses memfd_create and regular compatible mmap -- I don't see any of those changing their behavior any time soon. > - Andres had an interesting comment about how overcommit interacts with > MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit > accounting. There's also some comments about from linux-mm people [9]. The new implementation uses MAP_NORESERVE for the mapping. > - There seem to be some issues with releasing memory backing a mapping > with hugetlb [10]. With the fd (and truncating the file), this seems > to release the memory, but it's linux-specific? But most of this stuff > is specific to linux, it seems. So is this a problem? With this it > should be working even for hugetlb ... Again, the new implementation got rid of problematic bits here, and I haven't found any weak points related to hugetlb in testing so far. > - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and > make the hugetlb stuff work just like on Linux? Unclear. Also, I > thought the mfd stuff is linux-specific ... or am I confused? Yep, probably. > - Thomas asked [13] why we need to stop all the backends, instead of > just waiting for them to acknowledge the new (smaller) NBuffers value > and then let them continue. I also don't quite see why this should > not work, and it'd limit the disruption when we have to wait for > eviction of buffers pinned by paused cursors, etc. I think I've replied to that one, the idea so far was to eliminate any chance of accessing to-be-truncated buffers and make it easier to reason about correctness of the implementation this way. I don't see any other way how to prevent backends from accessing buffers that may disappear without adding overhead on the read path, but if you folks have some ideas -- please share! > v5-0001-Process-config-reload-in-AIO-workers.patch > > 1) Hmmm, so which other workers may need such explicit handling? Do all > other processes participate in procsignal stuff, or does anything > need an explicit handling? So far I've noticed the issue only with io_workers and the checkpointer. > v5-0003-Introduce-pss_barrierReceivedGeneration.patch > > 1) Do we actually need this? Isn't it enough to just have two barriers? > Or a barrier + condition variable, or something like that. The issue with two barriers is that they do not prevent disjoint groups, i.e. one backend joins the barrier, finishes the work and detaches from the barrier, then another backends joins. I'm not familiar with how this was solved for online checkums patch though, will take a look. Having a barrier and a condition variable would be possible, but it's hard to figure out for how many backends to wait. All in all, a small extention to the ProcSignalBarrier feels to me much more elegant. > 2) The comment talks about "coordinated way" when processing messages, > but it's not very clear to me. It should explain what is needed and > not possible with the current barrier code. Yeah, I need to work on the commentaries across the patch. Here in particular it means any coordinated way, whatever that could be. I can add an example to clarify that part. > v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch Most of the commentaries here and in the following patches are obviously reasonable and I'll incorporate them into the next version. > 5) I'm a bit confused about the segment/mapping difference. The patch > seems to randomly mix those, or maybe I'm just confused. I mean, > we are creating just shmem segment, and the pieces are mappings, > right? So why do we index them by "shmem_segment"? Indeed, the patch uses "segment" and "mapping" interchangeably, I need to tighten it up. The relation is still one to one, thus are multiple segments as well as mappings. > 7) We should remember which segments got to use huge pages and which > did not. And we should make it optional for each segment. Although, > maybe I'm just confused about the "segment" definition - if we only > have one, that's where huge pages are applied. > > If we could have multiple segments for different segments (whatever > that means), not sure what we'll report for cases when some segments > get to use huge pages and others don't. Exactly to avoid solving this, I've consciously decided to postpone implementing possibility to mix huge and regular pages so far. Any opinions, should a single reported value be removed and this information is instead represented as part of an informational view about shared memory (the one you were suggesting in this thread)? > 11) Actually, what's the difference between the contents of Mappings > and Segments? Isn't that the same thing, indexed in the same way? > Or could it be unified? Or are they conceptually different thing? Unless I'm mixing something badly, the content is the same. The relation is a segment as a structure "contains" a mapping. > v5-0005-Address-space-reservation-for-shared-memory.patch > > 1) Shouldn't reserved_offset and huge_pages_on really be in the segment > info? Or maybe even in mapping info? (again, maybe I'm confused > about what these structs store) I don't think there is reserved_offset variable in the latest version anymore, can you please confirm you use it instead of ther one I've posted in April? > 3) So ReserveAnonymousMemory is what makes decisions about huge pages, > for the whole reserved space / all segments in it. That's a bit > unfortunate with respect to the desirability of some segments > benefiting from huge pages and others not. Maybe we should have two > "reserved" areas, one with huge pages, one without? Again, there is no ReserveAnonymousMemory anymore, the new approach is to reserve the memory via separate mappings. > I guess we don't want too many segments, because that might make > fork() more expensive, etc. Just guessing, though. Also, how would > this work with threading? I assume multithreading will render it unnecessary to use shared memory favoring some other types of memory usage, but the mechanism around it could still be the same. > 5) The general approach seems sound to me, but I'm not expert on this. > I wonder how portable this behavior is. I mean, will it work on other > Unix systems / Windows? Is it POSIX or Linux extension? Don't know yet, it's a topic for investigation. > v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch > > 2) In fact, what happens if the user tries to resize to a value that is > too large for one of the segments? How would the system know before > starting the resize (and failing)? This type of situation is handled (doing hard stop) in the latest version, because all the necessary information is present in the mapping structure. > v5-0007-Allow-to-resize-shared-memory-without-restart.patch > > 1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug > somewhere in the resizing? When coordination with barriers kicks in, there is a cut off line after which any newly spawned backend will not be able to take part in it (e.g. it was too slow to init ProcSignal infrastructure). AdjustShmemSize is used to handle this cases. > 2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird? > Why is it needed, exactly? If it's to flush stuff for processes > consuming EmitProcSignalBarrier, it's that too late? What if a > process consumes the barrier between the emit and memory barrier? I think it's not needed, a leftover after code modifications. > v5-0008-Support-shrinking-shared-buffers.patch > v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch > v5-0010-Additional-validation-for-buffer-in-the-ring.patch This reminds me I still need to review those, so Ashutosh probably can answer those questions better than I.
On 7/4/25 16:41, Dmitry Dolgov wrote: >> On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote: >> I took a look at this patch, because it's somewhat related to the NUMA >> patch series I posted a couple days ago, and I've been wondering if >> it makes some of the NUMA stuff harder or simpler. > > Thanks a lot for the review! It's a plenty of feedback, and I'll > probably take time to answer all of it, but I still want to address > couple of most important topics quickly. > >> But I'm getting a bit lost in how exactly this interacts with things >> like overcommit, system memory accounting / OOM killer and this sort of >> stuff. I went through the thread and it seems to me the reserve+map >> approach works OK in this regard (and the messages on linux-mm seem to >> confirm this). But this information is scattered over many messages and >> it's hard to say for sure, because some of this might be relevant for >> an earlier approach, or a subtly different variant of it. >> >> A similar question is portability. The comments and commit messages >> seem to suggest most of this is linux-specific, and other platforms just >> don't have these capabilities. But there's a bunch of messages (mostly >> by Thomas Munro) that hint FreeBSD might be capable of this too, even if >> to some limited extent. And possibly even Windows/EXEC_BACKEND, although >> that seems much trickier. >> >> [...] >> >> So I think it'd be very helpful to write a README, explaining the >> currnent design/approach, and summarizing all these aspects in a single >> place. Including things like portability, interaction with the OS >> accounting, OOM killer, this kind of stuff. Some of this stuff may be >> already mentioned in code comments, but you it's hard to find those. >> >> Especially worth documenting are the states the processes need to go >> through (using the barriers), and the transacitons between them (i.e. >> what is allowed in each phase, what blocks can be visible, etc.). > > Agree, I'll add some comprehensive readme in the next version. Note, > that on the topic of portability the latest version implements a new > approach suggested by Thomas Munro, which reduces problematic parts to > memfd_create only, which is mentioned as Linux specific in the > documentation, but AFAICT has FreeBSD counterparts. > OK. It's not entirely clear to me if this README should be temporary, or if it should eventually get committed. I'd probably vote to have a proper README explaining the basic design / resizing processes etc. It probably should not discuss portability in too much detail, that can get stale pretty quick. >> 1) no user docs >> >> There are no user .sgml docs, and maybe it's time to write some, >> explaining how to use this thing - how to configure it, how to trigger >> the resizing, etc. It took me a while to realize I need to do ALTER >> SYSTEM + pg_reload_conf() to kick this off. >> >> It should also document the user-visible limitations, e.g. what activity >> is blocked during the resizing, etc. > > While the user interface is still under discussion, I agree, it makes > sense to capture this information in sgml docs. > Yeah. Spelling out the "official" way to use something is helpful. >> 2) pending GUC changes >> >> [...] >> >> It also seems a bit strange that the "switch" gets to be be driven by a >> randomly selected backend (unless I'm misunderstanding this bit). It >> seems to be true for the buffer eviction during shrinking, at least. > > The resize itself is coordinated by the postmaster alone, not by a > randomly selected backend. But looks like buffer eviction indeed can > happen anywhere, which is what we were discussing in the previous > messages. > >> Perhaps this should be a separate utility command, or maybe even just >> a new ALTER SYSTEM variant? Or even just a function, similar to what >> the "online checksums" patch did, possibly combined with a bgworder >> (but probably not needed, there are no db-specific tasks to do). > > This is one topic we still actively discuss, but haven't had much > feedback otherwise. The pros and cons seem to be clear: > > * Utilizing the existing GUC mechanism would allow treating > shared_buffers as any other configuration, meaning that potential > users of this feature don't have to do anything new to use it -- they > still can use whatever method they prefer to apply new configuration > (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly). > > I'm also wondering if it's only shared_buffers, or some other options > could use similar approach. > I don't know. What are the "potential users" of this feature? I don't recall any, but there may be some. How do we know the new pending flag will work for them too? > * Having a separate utility command is a mighty simplification, which > helps avoiding problems you've described above. > > So far we've got two against one in favour of simple utility command, so > we can as well go with that. > Not sure voting is a good way to make design decisions ... >> 3) max_available_memory >> >> Speaking of GUCs, I dislike how max_available_memory works. It seems a >> bit backwards to me. I mean, we're specifying shared_buffers (and some >> other parameters), and the system calculates the amount of shared memory >> needed. But the limit determines the total limit? > > The reason it's so backwards is that it's coming from the need to > specify how much memory we would like to reserve, and what would be the > upper boundary for increasing shared_buffers. My intention is eventually > to get rid of this GUC and figure its value at runtime as a function of > the total available memory. > I understand why it's like this. It's simple, and people do want to limit the memory the instance will allocate. That's understandable. The trouble is it makes it very unclear what's the implied limit on shared buffers size. Maybe if there was a sensible way to expose that, we could keep the max_available_memory. But I don't think you can get rid of the GUC, at least not entirely. You need to leave some memory aside for queries, people may start multiple instances at once, ... >> I think the GUC should specify the maximum shared_buffers we want to >> allow, and then we'd work out the total to pre-allocate? Considering >> we're only allowing to resize shared_buffers, that should be pretty >> trivial. Yes, it might happen that the "total limit" happens to exceed >> the available memory or something, but we already have the problem >> with shared_buffers. Seems fine if we explain this in the docs, and >> perhaps print the calculated memory limit on start. > > Somehow I'm not following what you suggest here. You mean having the > maximum shared_buffers specified, but not as a separate GUC? > My suggestion was to have a guc max_shared_buffers. Based on that you can easily calculate the size of all other segments dependent on NBuffers, and reserve memory for that. >> 4) SHMEM_RESIZE_RATIO >> >> The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way >> these ratios can make sense. For example, BLCKSZ is 8192 but the buffer >> descriptor is 64B. That's 128x difference, but the ratios says 0.6 and >> 0.1, so 6x. Sure, we'll actually allocate only the memory we need, and >> the rest is only "reserved". > > SHMEM_RESIZE_RATIO is a temporary hack, waiting for more decent > solution, nothing more. I probably have to mention that in the > commentaries. > OK >> Moreover, all of the above is for mappings sized based on NBuffers. But >> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the >> moment someone increases of max_connection, max_locks_per_transaction >> and possibly some other stuff? > > Can you elaborate, what do you mean by that? Increasing max_connection, > etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT, > but the ratio is for memory reservation only. > Stuff like PGPROC, fast-path locks etc. are allocated as part of MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum space for that. If I significantly increase GUCs like max_connections or max_locks_per_transaction, how do you know it didn't exceed the 10%? >> 5) no tests >> >> I mentioned no "user docs", but the patch has 0 tests too. Which seems >> a bit strange for a patch of this age. >> >> A really serious part of the patch series seems to be the coordination >> of processes when going through the phases, enforced by the barriers. >> This seems like a perfect match for testing using injection points, and >> I know we did something like this in the online checksums patch, which >> needs to coordinate processes in a similar way. > > Exactly what we're talking about recently, figuring out how to use > injections points for testing. Keep in mind, that the scope of this work > turned out to be huge, and with just two people on board we're > addressing one thing at the time. > Sure. >> But even just a simple TAP test that does a bunch of (random?) resizes >> while running a pgbench seem better than no tests. (That's what I did >> manually, and it crashed right away.) > > This is the type of testing I was doing before posting the series. I > assume you've crashed it on buffers shrinking, singe you've got SIGBUS > which would indicate that the memory is not available anymore. Before we > go into debugging, just to be on the safe side I would like to make sure > you were testing the latest patch version (there are some signs that > it's not the case, about that later)? > Maybe, I don't remember. But I also see crashes while expanding the buffers, with assert failure here: #4 0x0000556f159c43d1 in ExceptionalCondition (conditionName=0x556f15c00e00 "node->prev != INVALID_PROC_NUMBER || list->head == procno", fileName=0x556f15c00ce0 "../../../../src/include/storage/proclist.h", lineNumber=163) at assert.c:66 #5 0x0000556f157a9831 in proclist_contains_offset (list=0x7f296333ce24, procno=140, node_offset=100) at ../../../../src/include/storage/proclist.h:163 #6 0x0000556f157a9add in ConditionVariableTimedSleep (cv=0x7f296333ce20, timeout=-1, wait_event_info=134217782) at condition_variable.c:184 #7 0x0000556f157a99c9 in ConditionVariableSleep (cv=0x7f296333ce20, wait_event_info=134217782) at condition_variable.c:98 #8 0x0000556f157902df in BarrierArriveAndWait (barrier=0x7f296333ce08, wait_event_info=134217782) at barrier.c:191 #9 0x0000556f156d1226 in ProcessBarrierShmemResize (barrier=0x7f296333ce08) at pg_shmem.c:1201 >> 10) what to do about stuck resize? >> >> AFAICS the resize can get stuck for various reasons, e.g. because it >> can't evict pinned buffers, possibly indefinitely. Not great, it's not >> clear to me if there's a way out (canceling the resize) after a timeout, >> or something like that? Not great to start an "online resize" only to >> get stuck with all activity blocked for indefinite amount of time, and >> get to restart anyway. >> >> Seems related to Thomas' message [2], but AFAICS the patch does not do >> anything about this yet, right? What's the plan here? > > It's another open discussion right now, with an idea to eventually allow > canceling after a timeout. I think canceling when stuck on buffer > eviction should be pretty straightforward (the evition must take place > before actual shared memory resize, so we know nothing has changed yet), > but in some other failure scenarios it would be harder (e.g. if one > backend is stuck resizing, while other have succeeded -- this would > require another round of synchronization and some way to figure out what > is the current status). > I think it'll be crucial to structure it so that it can't get stuck while resizing. >> 11) preparatory actions? >> >> Even if it doesn't get stuck, some of the actions can take a while, like >> evicting dirty buffers before shrinking, etc. This is similar to what >> happens on restart, when the shutdown checkpoint can take a while, while >> the system is (partly) unavailable. >> >> The common mitigation is to do an explicit checkpoint right before the >> restart, to make the shutdown checkpoint cheap. Could we do something >> similar for the shrinking, e.g. flush buffers from the part to be >> removed before actually starting the resize? > > Yeah, that's a good idea, we will try to explore it. > >> 12) does this affect e.g. fork() costs? >> >> I wonder if this affects the cost of fork() in some undesirable way? >> Could it make fork() more measurably more expensive? > > The number of new mappings is quite limited, so I would not expect that. > But I can measure the impact. > >> 14) interesting messages from the thread >> >> While reading through the thread, I noticed a couple messages that I >> think are still relevant: > > Right, I'm aware there is a lot of not yet addressed feedback, even more > than you've mentioned below. None of this feedback was ignored, we're > just solving large problems step by step. So far the focus was on how to > do memory reservation and to coordinate resize, and everybody is more > than welcome to join. But thanks for collecting the list, I probably > need to start tracking what was addressed and what was not. > >> - Robert asked [5] if Linux might abruptly break this, but I find that >> unlikely. We'd point out we rely on this, and they'd likely rethink. >> This would be made safer if this was specified by POSIX - taking that >> away once implemented seems way harder than for custom extensions. >> It's likely they'd not take away the feature without an alternative >> way to achieve the same effect, I think (yes, harder to maintain). >> Tom suggests [7] this is not in POSIX. > > This conversation was related to the original implementation, which was > based on mremap and slicing of mappings. As I've mentioned, the new > approach doesn't have most of those controversial points, it uses > memfd_create and regular compatible mmap -- I don't see any of those > changing their behavior any time soon. > >> - Andres had an interesting comment about how overcommit interacts with >> MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit >> accounting. There's also some comments about from linux-mm people [9]. > > The new implementation uses MAP_NORESERVE for the mapping. > >> - There seem to be some issues with releasing memory backing a mapping >> with hugetlb [10]. With the fd (and truncating the file), this seems >> to release the memory, but it's linux-specific? But most of this stuff >> is specific to linux, it seems. So is this a problem? With this it >> should be working even for hugetlb ... > > Again, the new implementation got rid of problematic bits here, and I > haven't found any weak points related to hugetlb in testing so far. > >> - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and >> make the hugetlb stuff work just like on Linux? Unclear. Also, I >> thought the mfd stuff is linux-specific ... or am I confused? > > Yep, probably. > >> - Thomas asked [13] why we need to stop all the backends, instead of >> just waiting for them to acknowledge the new (smaller) NBuffers value >> and then let them continue. I also don't quite see why this should >> not work, and it'd limit the disruption when we have to wait for >> eviction of buffers pinned by paused cursors, etc. > > I think I've replied to that one, the idea so far was to eliminate any > chance of accessing to-be-truncated buffers and make it easier to reason > about correctness of the implementation this way. I don't see any other > way how to prevent backends from accessing buffers that may disappear > without adding overhead on the read path, but if you folks have some > ideas -- please share! > >> v5-0001-Process-config-reload-in-AIO-workers.patch >> >> 1) Hmmm, so which other workers may need such explicit handling? Do all >> other processes participate in procsignal stuff, or does anything >> need an explicit handling? > > So far I've noticed the issue only with io_workers and the checkpointer. > >> v5-0003-Introduce-pss_barrierReceivedGeneration.patch >> >> 1) Do we actually need this? Isn't it enough to just have two barriers? >> Or a barrier + condition variable, or something like that. > > The issue with two barriers is that they do not prevent disjoint groups, > i.e. one backend joins the barrier, finishes the work and detaches from > the barrier, then another backends joins. I'm not familiar with how this > was solved for online checkums patch though, will take a look. Having a > barrier and a condition variable would be possible, but it's hard to > figure out for how many backends to wait. All in all, a small extention > to the ProcSignalBarrier feels to me much more elegant. > >> 2) The comment talks about "coordinated way" when processing messages, >> but it's not very clear to me. It should explain what is needed and >> not possible with the current barrier code. > > Yeah, I need to work on the commentaries across the patch. Here in > particular it means any coordinated way, whatever that could be. I can > add an example to clarify that part. > >> v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch > > Most of the commentaries here and in the following patches are obviously > reasonable and I'll incorporate them into the next version. > >> 5) I'm a bit confused about the segment/mapping difference. The patch >> seems to randomly mix those, or maybe I'm just confused. I mean, >> we are creating just shmem segment, and the pieces are mappings, >> right? So why do we index them by "shmem_segment"? > > Indeed, the patch uses "segment" and "mapping" interchangeably, I need > to tighten it up. The relation is still one to one, thus are multiple > segments as well as mappings. > >> 7) We should remember which segments got to use huge pages and which >> did not. And we should make it optional for each segment. Although, >> maybe I'm just confused about the "segment" definition - if we only >> have one, that's where huge pages are applied. >> >> If we could have multiple segments for different segments (whatever >> that means), not sure what we'll report for cases when some segments >> get to use huge pages and others don't. > > Exactly to avoid solving this, I've consciously decided to postpone > implementing possibility to mix huge and regular pages so far. Any > opinions, should a single reported value be removed and this information > is instead represented as part of an informational view about shared > memory (the one you were suggesting in this thread)? > >> 11) Actually, what's the difference between the contents of Mappings >> and Segments? Isn't that the same thing, indexed in the same way? >> Or could it be unified? Or are they conceptually different thing? > > Unless I'm mixing something badly, the content is the same. The relation > is a segment as a structure "contains" a mapping. > Then, why do we need to track it in two places? Doesn't it just increase the likelihood that someone misses updating one of them? >> v5-0005-Address-space-reservation-for-shared-memory.patch >> >> 1) Shouldn't reserved_offset and huge_pages_on really be in the segment >> info? Or maybe even in mapping info? (again, maybe I'm confused >> about what these structs store) > > I don't think there is reserved_offset variable in the latest version > anymore, can you please confirm you use it instead of ther one I've > posted in April? > >> 3) So ReserveAnonymousMemory is what makes decisions about huge pages, >> for the whole reserved space / all segments in it. That's a bit >> unfortunate with respect to the desirability of some segments >> benefiting from huge pages and others not. Maybe we should have two >> "reserved" areas, one with huge pages, one without? > > Again, there is no ReserveAnonymousMemory anymore, the new approach is > to reserve the memory via separate mappings. > Will check. These may indeed be stale comments, from looking at the earlier version of the patch (the last one from Ashutosh). >> I guess we don't want too many segments, because that might make >> fork() more expensive, etc. Just guessing, though. Also, how would >> this work with threading? > > I assume multithreading will render it unnecessary to use shared memory > favoring some other types of memory usage, but the mechanism around it > could still be the same. > >> 5) The general approach seems sound to me, but I'm not expert on this. >> I wonder how portable this behavior is. I mean, will it work on other >> Unix systems / Windows? Is it POSIX or Linux extension? > > Don't know yet, it's a topic for investigation. > >> v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch >> >> 2) In fact, what happens if the user tries to resize to a value that is >> too large for one of the segments? How would the system know before >> starting the resize (and failing)? > > This type of situation is handled (doing hard stop) in the latest > version, because all the necessary information is present in the mapping > structure. > I don't know, but crashing the instance (I assume that's what you mean by hard stop) does not seem like something we want to do. AFAIK the GUC hook should be able to determine if the value is too large, and reject it at that point. Not proceed and crash everything. >> v5-0007-Allow-to-resize-shared-memory-without-restart.patch >> >> 1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug >> somewhere in the resizing? > > When coordination with barriers kicks in, there is a cut off line after > which any newly spawned backend will not be able to take part in it > (e.g. it was too slow to init ProcSignal infrastructure). > AdjustShmemSize is used to handle this cases. > >> 2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird? >> Why is it needed, exactly? If it's to flush stuff for processes >> consuming EmitProcSignalBarrier, it's that too late? What if a >> process consumes the barrier between the emit and memory barrier? > > I think it's not needed, a leftover after code modifications. > >> v5-0008-Support-shrinking-shared-buffers.patch >> v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch >> v5-0010-Additional-validation-for-buffer-in-the-ring.patch > > This reminds me I still need to review those, so Ashutosh probably can > answer those questions better than I. -- Tomas Vondra
> On Fri, Jul 04, 2025 at 05:23:29PM +0200, Tomas Vondra wrote: > >> 2) pending GUC changes > >> > >> Perhaps this should be a separate utility command, or maybe even just > >> a new ALTER SYSTEM variant? Or even just a function, similar to what > >> the "online checksums" patch did, possibly combined with a bgworder > >> (but probably not needed, there are no db-specific tasks to do). > > > > This is one topic we still actively discuss, but haven't had much > > feedback otherwise. The pros and cons seem to be clear: > > > > * Utilizing the existing GUC mechanism would allow treating > > shared_buffers as any other configuration, meaning that potential > > users of this feature don't have to do anything new to use it -- they > > still can use whatever method they prefer to apply new configuration > > (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly). > > > > I'm also wondering if it's only shared_buffers, or some other options > > could use similar approach. > > > > I don't know. What are the "potential users" of this feature? I don't > recall any, but there may be some. How do we know the new pending flag > will work for them too? It could be potentialy useful for any GUC that controls a resource shared between backend, and requires restart. To make this GUC changeable online, every backend has to perform some action, and they have to coordinate to make sure things are consistent -- exactly the use case we're trying to address, shared_buffers is just happened to be one of such resources. While I agree that the currently implemented interface is wrong (e.g. it doesn't prevent pending GUCs from being stored in PG_AUTOCONF_FILENAME, this has to happen only when the new value is actually applied), it still makes sense to me to allow more flexible lifecycle for certain GUC. An example I could think of is shared_preload_libraries. If we ever want to do a hot reload of libraries, this will follow the procedure above: every backend has to do something like dlclose / dlopen and make sure that other backends have the same version of the library. Another maybe less far fetched example is max_worker_processes, which AFAICT is mostly used to control number of slots in shared memory (altough it's also stored in the control file, which makes things more complicated). > > * Having a separate utility command is a mighty simplification, which > > helps avoiding problems you've described above. > > > > So far we've got two against one in favour of simple utility command, so > > we can as well go with that. > > > > Not sure voting is a good way to make design decisions ... I'm somewhat torn between those two options myself. The more I think about this topic, the more I convinced that pending GUC makes sense, but the more work I see needed to implement that. Maybe a good middle ground is to go with a simple utility command, as Ashutosh was suggesting, and keep pending GUC infrastructure on top of that as an optional patch. > >> 3) max_available_memory > >> > >> I think the GUC should specify the maximum shared_buffers we want to > >> allow, and then we'd work out the total to pre-allocate? Considering > >> we're only allowing to resize shared_buffers, that should be pretty > >> trivial. Yes, it might happen that the "total limit" happens to exceed > >> the available memory or something, but we already have the problem > >> with shared_buffers. Seems fine if we explain this in the docs, and > >> perhaps print the calculated memory limit on start. > > > > Somehow I'm not following what you suggest here. You mean having the > > maximum shared_buffers specified, but not as a separate GUC? > > My suggestion was to have a guc max_shared_buffers. Based on that you > can easily calculate the size of all other segments dependent on > NBuffers, and reserve memory for that. Got it, ok. > >> Moreover, all of the above is for mappings sized based on NBuffers. But > >> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the > >> moment someone increases of max_connection, max_locks_per_transaction > >> and possibly some other stuff? > > > > Can you elaborate, what do you mean by that? Increasing max_connection, > > etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT, > > but the ratio is for memory reservation only. > > > > Stuff like PGPROC, fast-path locks etc. are allocated as part of > MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum > space for that. If I significantly increase GUCs like max_connections or > max_locks_per_transaction, how do you know it didn't exceed the 10%? Still don't see the problem. The 10% we're talking about is the reserved space, thus it affects only shared memory resizing operation and nothing else. The real memory allocated is less than or equal to the reserved size, but is allocated and managed completely in the same way as without the patch, including size calculations. If some GUCs are increased and drive real memory usage high, it will be handled as before. Are we on the same page about this? > >> 11) Actually, what's the difference between the contents of Mappings > >> and Segments? Isn't that the same thing, indexed in the same way? > >> Or could it be unified? Or are they conceptually different thing? > > > > Unless I'm mixing something badly, the content is the same. The relation > > is a segment as a structure "contains" a mapping. > > > Then, why do we need to track it in two places? Doesn't it just increase > the likelihood that someone misses updating one of them? To clarify, under "contents" I mean the shared memory content (the actual data) behind both "segment" and the "mapping", maybe you had something else in mind. On the surface of it those are two different data structures that have mostly different, but related, fields: a shared memory segment contains stuff needed for working with memory (header, base, end, lock), mapping has more lower level details, e.g. reserved space, fd, IPC key. The only common fields are size and address, maybe I can factor them out to not repeat. > >> 2) In fact, what happens if the user tries to resize to a value that is > >> too large for one of the segments? How would the system know before > >> starting the resize (and failing)? > > > > This type of situation is handled (doing hard stop) in the latest > > version, because all the necessary information is present in the mapping > > structure. > > > I don't know, but crashing the instance (I assume that's what you mean > by hard stop) does not seem like something we want to do. AFAIK the GUC > hook should be able to determine if the value is too large, and reject > it at that point. Not proceed and crash everything. I see, you're pointing out that it would be good to have more validation at the GUC level, right?
> On Fri, Jul 04, 2025 at 04:41:51PM +0200, Dmitry Dolgov wrote: > > v5-0003-Introduce-pss_barrierReceivedGeneration.patch > > > > 1) Do we actually need this? Isn't it enough to just have two barriers? > > Or a barrier + condition variable, or something like that. > > The issue with two barriers is that they do not prevent disjoint groups, > i.e. one backend joins the barrier, finishes the work and detaches from > the barrier, then another backends joins. I'm not familiar with how this > was solved for online checkums patch though, will take a look. Having a > barrier and a condition variable would be possible, but it's hard to > figure out for how many backends to wait. All in all, a small extention > to the ProcSignalBarrier feels to me much more elegant. After quickly checking how online checksums patch is dealing with the coordination, I've realized my answer here about the disjoint groups is not quite correct. You were asking about ProcSignalBarrier, I was answering about the barrier within the resizing logic. Here is how it looks like to me: * We could follow the same way as the online checksums, launch a coordinator worker (Ashutosh was suggesting that, but no implementation has materialized yet) and fire two ProcSignalBarriers, one to kick off resizing and another one to finish it. Maybe it could even be three phases, one extra to tell backends to not pull in new buffers into the pool to help buffer eviction process. * This way any backend between the ProcSignalBarriers will be able proceed with whatever it's doing, and there is need to make sure it will not access buffers that will soon disappear. A suggestion so far was to get all backends agree to not allocate any new buffers in the to-be-truncated range, but accessing already existing buffers that will soon go away is a problem as well. As far as I can tell there is no rock solid method to make sure a backend doesn't have a reference to such a buffer somewhere (this was discussed earlier in thre thread), meaning that either a backend has to wait or buffers have to be checked every time on access. * Since the latter adds a performance overhead, we went with the former (making backends wait). And here is where all the complexity comes from, because waiting backends cannot reply on a ProcSignalBarrier and thus require some other approach. If I've overlooked any other alternative to backends waiting, let me know. > It also seems a bit strange that the "switch" gets to be be driven by > a randomly selected backend (unless I'm misunderstanding this bit). It > seems to be true for the buffer eviction during shrinking, at least. But looks like the eviction could be indeed improved via a new coordinator worker. Before resizing shared memory such a worker will first tell all the backends to not allocate new buffers via ProcSignalBarrier, then will do buffer eviction. Since backends don't need to be waiting after this type of ProcSignalBarrier, it should work and establish only one worker to do the eviction. But the second ProcSignalBarrier for resizing would still follow the current procedure with everybody waiting. Does it make sense to you folks?
> On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote: > * This way any backend between the ProcSignalBarriers will be able > proceed with whatever it's doing, and there is need to make sure it > will not access buffers that will soon disappear. A suggestion so far > was to get all backends agree to not allocate any new buffers in the > to-be-truncated range, but accessing already existing buffers that > will soon go away is a problem as well. As far as I can tell there is > no rock solid method to make sure a backend doesn't have a reference > to such a buffer somewhere (this was discussed earlier in thre > thread), meaning that either a backend has to wait or buffers have to > be checked every time on access. And sure enough, after I wrote this I've realized there should be no such references after the buffer eviction and prohibiting new buffer allocation. I still need to check it though, because not only buffers, but other shared memory structures (which number depends on NBuffers) will be truncated. But if they will also be handled by the eviction, then maybe everything is just fine.
On 7/5/25 12:35, Dmitry Dolgov wrote: >> On Fri, Jul 04, 2025 at 05:23:29PM +0200, Tomas Vondra wrote: >>>> 2) pending GUC changes >>>> >>>> Perhaps this should be a separate utility command, or maybe even just >>>> a new ALTER SYSTEM variant? Or even just a function, similar to what >>>> the "online checksums" patch did, possibly combined with a bgworder >>>> (but probably not needed, there are no db-specific tasks to do). >>> >>> This is one topic we still actively discuss, but haven't had much >>> feedback otherwise. The pros and cons seem to be clear: >>> >>> * Utilizing the existing GUC mechanism would allow treating >>> shared_buffers as any other configuration, meaning that potential >>> users of this feature don't have to do anything new to use it -- they >>> still can use whatever method they prefer to apply new configuration >>> (pg_reload_conf, pg_ctr reload, maybe even sending SIGHUP directly). >>> >>> I'm also wondering if it's only shared_buffers, or some other options >>> could use similar approach. >>> >> >> I don't know. What are the "potential users" of this feature? I don't >> recall any, but there may be some. How do we know the new pending flag >> will work for them too? > > It could be potentialy useful for any GUC that controls a resource > shared between backend, and requires restart. To make this GUC > changeable online, every backend has to perform some action, and they > have to coordinate to make sure things are consistent -- exactly the use > case we're trying to address, shared_buffers is just happened to be one > of such resources. While I agree that the currently implemented > interface is wrong (e.g. it doesn't prevent pending GUCs from being > stored in PG_AUTOCONF_FILENAME, this has to happen only when the new > value is actually applied), it still makes sense to me to allow more > flexible lifecycle for certain GUC. > > An example I could think of is shared_preload_libraries. If we ever want > to do a hot reload of libraries, this will follow the procedure above: > every backend has to do something like dlclose / dlopen and make sure > that other backends have the same version of the library. Another maybe > less far fetched example is max_worker_processes, which AFAICT is mostly > used to control number of slots in shared memory (altough it's also > stored in the control file, which makes things more complicated). > Not sure. My concern is the config reload / GUC assign hook was not designed with this use case in mind, and we'll run into issues. I also dislike the "async" nature of this, which makes it harder to e.g. abort the change, etc. >>> * Having a separate utility command is a mighty simplification, which >>> helps avoiding problems you've described above. >>> >>> So far we've got two against one in favour of simple utility command, so >>> we can as well go with that. >>> >> >> Not sure voting is a good way to make design decisions ... > > I'm somewhat torn between those two options myself. The more I think > about this topic, the more I convinced that pending GUC makes sense, but > the more work I see needed to implement that. Maybe a good middle ground > is to go with a simple utility command, as Ashutosh was suggesting, and > keep pending GUC infrastructure on top of that as an optional patch. > What about a simple function? Probably not as clean as a proper utility command, and it implies a transaction - not sure if that could be a problem for some part of this. >>>> 3) max_available_memory >>>> >>>> I think the GUC should specify the maximum shared_buffers we want to >>>> allow, and then we'd work out the total to pre-allocate? Considering >>>> we're only allowing to resize shared_buffers, that should be pretty >>>> trivial. Yes, it might happen that the "total limit" happens to exceed >>>> the available memory or something, but we already have the problem >>>> with shared_buffers. Seems fine if we explain this in the docs, and >>>> perhaps print the calculated memory limit on start. >>> >>> Somehow I'm not following what you suggest here. You mean having the >>> maximum shared_buffers specified, but not as a separate GUC? >> >> My suggestion was to have a guc max_shared_buffers. Based on that you >> can easily calculate the size of all other segments dependent on >> NBuffers, and reserve memory for that. > > Got it, ok. > >>>> Moreover, all of the above is for mappings sized based on NBuffers. But >>>> if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the >>>> moment someone increases of max_connection, max_locks_per_transaction >>>> and possibly some other stuff? >>> >>> Can you elaborate, what do you mean by that? Increasing max_connection, >>> etc. leading to increased memory consumption in the MAIN_SHMEM_SEGMENT, >>> but the ratio is for memory reservation only. >>> >> >> Stuff like PGPROC, fast-path locks etc. are allocated as part of >> MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum >> space for that. If I significantly increase GUCs like max_connections or >> max_locks_per_transaction, how do you know it didn't exceed the 10%? > > Still don't see the problem. The 10% we're talking about is the reserved > space, thus it affects only shared memory resizing operation and nothing > else. The real memory allocated is less than or equal to the reserved > size, but is allocated and managed completely in the same way as without > the patch, including size calculations. If some GUCs are increased and > drive real memory usage high, it will be handled as before. Are we on > the same page about this? > How do you know reserving 10% is sufficient? Imagine I set max_available_memory = '256MB' max_connections = 1000000 max_locks_per_transaction = 10000 How do you know it's not more than 10% of the available memory? FWIW if I add a simple assert to CreateAnonymousSegment Assert(mapping->shmem_reserved >= allocsize); it crashes even with just the max_available_memory=256MB #4 0x0000000000b74fbd in ExceptionalCondition (conditionName=0xe25920 "mapping->shmem_reserved >= allocsize", fileName=0xe251e7 "pg_shmem.c", lineNumber=878) at assert.c:66 because we happen to execute it with this: mapping->shmem_reserved 26845184 allocsize 125042688 I think I mentioned a similar crash earlier, not sure if that's the same issue or a different one. >>>> 11) Actually, what's the difference between the contents of Mappings >>>> and Segments? Isn't that the same thing, indexed in the same way? >>>> Or could it be unified? Or are they conceptually different thing? >>> >>> Unless I'm mixing something badly, the content is the same. The relation >>> is a segment as a structure "contains" a mapping. >>> >> Then, why do we need to track it in two places? Doesn't it just increase >> the likelihood that someone misses updating one of them? > > To clarify, under "contents" I mean the shared memory content (the > actual data) behind both "segment" and the "mapping", maybe you had > something else in mind. > > On the surface of it those are two different data structures that have > mostly different, but related, fields: a shared memory segment contains > stuff needed for working with memory (header, base, end, lock), mapping > has more lower level details, e.g. reserved space, fd, IPC key. The > only common fields are size and address, maybe I can factor them out to > not repeat. > OK, I think I'm just confused by the ambiguous definitions of segment/mapping. It'd be good to document/explain this in a comment somewhere. >>>> 2) In fact, what happens if the user tries to resize to a value that is >>>> too large for one of the segments? How would the system know before >>>> starting the resize (and failing)? >>> >>> This type of situation is handled (doing hard stop) in the latest >>> version, because all the necessary information is present in the mapping >>> structure. >>> >> I don't know, but crashing the instance (I assume that's what you mean >> by hard stop) does not seem like something we want to do. AFAIK the GUC >> hook should be able to determine if the value is too large, and reject >> it at that point. Not proceed and crash everything. > > I see, you're pointing out that it would be good to have more validation > at the GUC level, right? Well, that'd be a starting point. We definitely should not allow setting a value that end up crashing an instance (it does not matter if it's because of FATAL or hitting a segfault/sigbut somewhere). cheers -- Tomas Vondra
> On Mon, Jul 07, 2025 at 01:57:42PM +0200, Tomas Vondra wrote: > > It could be potentialy useful for any GUC that controls a resource > > shared between backend, and requires restart. To make this GUC > > changeable online, every backend has to perform some action, and they > > have to coordinate to make sure things are consistent -- exactly the use > > case we're trying to address, shared_buffers is just happened to be one > > of such resources. While I agree that the currently implemented > > interface is wrong (e.g. it doesn't prevent pending GUCs from being > > stored in PG_AUTOCONF_FILENAME, this has to happen only when the new > > value is actually applied), it still makes sense to me to allow more > > flexible lifecycle for certain GUC. > > > > An example I could think of is shared_preload_libraries. If we ever want > > to do a hot reload of libraries, this will follow the procedure above: > > every backend has to do something like dlclose / dlopen and make sure > > that other backends have the same version of the library. Another maybe > > less far fetched example is max_worker_processes, which AFAICT is mostly > > used to control number of slots in shared memory (altough it's also > > stored in the control file, which makes things more complicated). > > > > Not sure. My concern is the config reload / GUC assign hook was not > designed with this use case in mind, and we'll run into issues. I also > dislike the "async" nature of this, which makes it harder to e.g. abort > the change, etc. Yes, GUC assing hook was not designed for that. That's why the idea is to extend the design and see if it will be good enough. > > I'm somewhat torn between those two options myself. The more I think > > about this topic, the more I convinced that pending GUC makes sense, but > > the more work I see needed to implement that. Maybe a good middle ground > > is to go with a simple utility command, as Ashutosh was suggesting, and > > keep pending GUC infrastructure on top of that as an optional patch. > > > > What about a simple function? Probably not as clean as a proper utility > command, and it implies a transaction - not sure if that could be a > problem for some part of this. I'm currently inclined towards this and a new one worker to coordinate the process, with everything else provided as an optional follow-up step. Will try this out unless there are any objections. > >> Stuff like PGPROC, fast-path locks etc. are allocated as part of > >> MAIN_SHMEM_SEGMENT, right? Yet the ratio assigns 10% of the maximum > >> space for that. If I significantly increase GUCs like max_connections or > >> max_locks_per_transaction, how do you know it didn't exceed the 10%? > > > > Still don't see the problem. The 10% we're talking about is the reserved > > space, thus it affects only shared memory resizing operation and nothing > > else. The real memory allocated is less than or equal to the reserved > > size, but is allocated and managed completely in the same way as without > > the patch, including size calculations. If some GUCs are increased and > > drive real memory usage high, it will be handled as before. Are we on > > the same page about this? > > > > How do you know reserving 10% is sufficient? Imagine I set I see, I was convinced you're talking about changing something at runtime, which will hit the reservation boundary. But you mean all of that at simply the start, and yes, of course it will fail -- see the point about SHMEM_RATIO being just a temporary hack.
On Mon, Jul 7, 2025 at 6:36 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Jul 07, 2025 at 01:57:42PM +0200, Tomas Vondra wrote: > > > It could be potentialy useful for any GUC that controls a resource > > > shared between backend, and requires restart. To make this GUC > > > changeable online, every backend has to perform some action, and they > > > have to coordinate to make sure things are consistent -- exactly the use > > > case we're trying to address, shared_buffers is just happened to be one > > > of such resources. While I agree that the currently implemented > > > interface is wrong (e.g. it doesn't prevent pending GUCs from being > > > stored in PG_AUTOCONF_FILENAME, this has to happen only when the new > > > value is actually applied), it still makes sense to me to allow more > > > flexible lifecycle for certain GUC. > > > > > > An example I could think of is shared_preload_libraries. If we ever want > > > to do a hot reload of libraries, this will follow the procedure above: > > > every backend has to do something like dlclose / dlopen and make sure > > > that other backends have the same version of the library. Another maybe > > > less far fetched example is max_worker_processes, which AFAICT is mostly > > > used to control number of slots in shared memory (altough it's also > > > stored in the control file, which makes things more complicated). > > > > > > > Not sure. My concern is the config reload / GUC assign hook was not > > designed with this use case in mind, and we'll run into issues. I also > > dislike the "async" nature of this, which makes it harder to e.g. abort > > the change, etc. > > Yes, GUC assing hook was not designed for that. That's why the idea is > to extend the design and see if it will be good enough. > > > > I'm somewhat torn between those two options myself. The more I think > > > about this topic, the more I convinced that pending GUC makes sense, but > > > the more work I see needed to implement that. Maybe a good middle ground > > > is to go with a simple utility command, as Ashutosh was suggesting, and > > > keep pending GUC infrastructure on top of that as an optional patch. > > > > > > > What about a simple function? Probably not as clean as a proper utility > > command, and it implies a transaction - not sure if that could be a > > problem for some part of this. > > I'm currently inclined towards this and a new one worker to coordinate > the process, with everything else provided as an optional follow-up > step. Will try this out unless there are any objections. I will reply to the questions but let me summarise my offlist discussion with Andres. I had proposed ALTER SYSTEM ... UPDATE ... approach in pgconf.dev for any system wide GUC change such as this. However, Andres pointed out that any UI proposal has to honour the current ability to edit postgresql.conf and trigger the change in a running server. ALTER SYSTEM ... UDPATE ... does not allow that. So, I think we have to build something similar or on top of the current ALTER SYSTEM ... SET + pg_reload_conf(). My current proposal is ALTER SYSTEM ... SET + pg_reload_conf() with pending mark + pg_apply_pending_conf(<name of GUC>, <more parameters>). The third function would take a GUC name as parameter and complete the pending application change. If the proposed change is not valid, it will throw an error. If there are problems completing the change it will throw an error and keep the pending mark intact. Further the function can take GUC specific parameters which control the application process. E.g. for example it could tell whether to wait for a backend to unpin a buffer or cancel that query or kill the backend or abort the application itself. If the operation takes too long, a user may want to cancel the function execution just like cancelling a query. Running two concurrent instances of the function, both applying the same GUC won't be allowed. Does that look good? -- Best Wishes, Ashutosh Bapat
> On Mon, Jul 07, 2025 at 07:12:50PM +0530, Ashutosh Bapat wrote: > > My current proposal is ALTER SYSTEM ... SET + pg_reload_conf() with > pending mark + pg_apply_pending_conf(<name of GUC>, <more > parameters>). The third function would take a GUC name as parameter > and complete the pending application change. If the proposed change is > not valid, it will throw an error. If there are problems completing > the change it will throw an error and keep the pending mark intact. > Further the function can take GUC specific parameters which control > the application process. E.g. for example it could tell whether to > wait for a backend to unpin a buffer or cancel that query or kill the > backend or abort the application itself. If the operation takes too > long, a user may want to cancel the function execution just like > cancelling a query. Running two concurrent instances of the function, > both applying the same GUC won't be allowed. Yeah, it can look like this, but it's a large chunk of work as well as improving the current implementation. I'm still convinced that using GUC mechanism one or another way is the right choice here, but maybe better as a follow-up step I was mentioning above -- simply to limit the scope and move step by step. How does it sound? Regarding the proposal, I'm somehow uncomfortable with the fact that between those two function call the system will be in an awkward state for some time, and how long would it take will not be controlled by the resizing logic anymore. But otherwise it seems to be equivalent of what we want to achieve in many other apspects.
> On Sun, Jul 06, 2025 at 03:21:08PM +0200, Dmitry Dolgov wrote: > > On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote: > > * This way any backend between the ProcSignalBarriers will be able > > proceed with whatever it's doing, and there is need to make sure it > > will not access buffers that will soon disappear. A suggestion so far > > was to get all backends agree to not allocate any new buffers in the > > to-be-truncated range, but accessing already existing buffers that > > will soon go away is a problem as well. As far as I can tell there is > > no rock solid method to make sure a backend doesn't have a reference > > to such a buffer somewhere (this was discussed earlier in thre > > thread), meaning that either a backend has to wait or buffers have to > > be checked every time on access. > > And sure enough, after I wrote this I've realized there should be no > such references after the buffer eviction and prohibiting new buffer > allocation. I still need to check it though, because not only buffers, > but other shared memory structures (which number depends on NBuffers) > will be truncated. But if they will also be handled by the eviction, > then maybe everything is just fine. Pondering more about this topic, I've realized there was one more problematic case mentioned by Robert early in the thread, which is relatively easy to construct: * When increasing shared buffers from NBuffers_small to NBuffers_large it's possible that one backend already has applied NBuffers_large, then allocated a buffer B from (NBuffer_small, NBuffers_large] and put it into the buffer lookup table. * In the meantime another backend still has NBuffers_small, but got buffer B from the lookup table. Currently it's being addressed via every backend waiting for each other, but I guess it could be as well managed via handling the freelist, so that only "available" buffers will be inserted into the lookup table. It's probably the only such case, but I can't tell that for sure (hard to say, maybe there are more tricky cases with the latest async io). If you folks have some other examples that may break, let me know. The idea behind making everyone wait was to be rock solid that no similar but unknown scenarios could damage the resize procedure. As for other structures, BufferBlocks, BufferDescriptors and BufferIOCVArray are all buffer indexed, so making sure shared memory resizing works for buffers should automatically mean the same for the rest. But CkptBufferIds is a different case, as it collects buffers to sync and process them at later point in time -- it has to be explicitely handled when shrinking shared memory I guess. Long story short, in the next version of the patch I'll try to experiment with a simplified design: a simple function to trigger resizing, launching a coordinator worker, with backends not waiting for each other and buffers first allocated and then marked as "available to use".
On Mon, Jul 14, 2025 at 12:07 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Sun, Jul 06, 2025 at 03:21:08PM +0200, Dmitry Dolgov wrote: > > > On Sun, Jul 06, 2025 at 03:01:34PM +0200, Dmitry Dolgov wrote: > > > * This way any backend between the ProcSignalBarriers will be able > > > proceed with whatever it's doing, and there is need to make sure it > > > will not access buffers that will soon disappear. A suggestion so far > > > was to get all backends agree to not allocate any new buffers in the > > > to-be-truncated range, but accessing already existing buffers that > > > will soon go away is a problem as well. As far as I can tell there is > > > no rock solid method to make sure a backend doesn't have a reference > > > to such a buffer somewhere (this was discussed earlier in thre > > > thread), meaning that either a backend has to wait or buffers have to > > > be checked every time on access. > > > > And sure enough, after I wrote this I've realized there should be no > > such references after the buffer eviction and prohibiting new buffer > > allocation. I still need to check it though, because not only buffers, > > but other shared memory structures (which number depends on NBuffers) > > will be truncated. But if they will also be handled by the eviction, > > then maybe everything is just fine. > > Pondering more about this topic, I've realized there was one more > problematic case mentioned by Robert early in the thread, which is > relatively easy to construct: > > * When increasing shared buffers from NBuffers_small to NBuffers_large > it's possible that one backend already has applied NBuffers_large, > then allocated a buffer B from (NBuffer_small, NBuffers_large] and put > it into the buffer lookup table. > > * In the meantime another backend still has NBuffers_small, but got > buffer B from the lookup table. > > Currently it's being addressed via every backend waiting for each other, > but I guess it could be as well managed via handling the freelist, so > that only "available" buffers will be inserted into the lookup table. I didn't get how can that be managed by freelist? Buffers are also allocated through clocksweep, which needs to be managed as well. > Long story short, in the next version of the patch I'll try to > experiment with a simplified design: a simple function to trigger > resizing, launching a coordinator worker, with backends not waiting for > each other and buffers first allocated and then marked as "available to > use". Should all the backends wait between buffer allocation and them being marked as "available"? I assume that marking them as available means "declaring the new NBuffers". What about when shrinking the buffers? Do you plan to make all the backends wait while the coordinator is evicting buffers? -- Best Wishes, Ashutosh Bapat
> On Mon, Jul 14, 2025 at 10:25:51AM +0530, Ashutosh Bapat wrote: > > Currently it's being addressed via every backend waiting for each other, > > but I guess it could be as well managed via handling the freelist, so > > that only "available" buffers will be inserted into the lookup table. > > I didn't get how can that be managed by freelist? Buffers are also > allocated through clocksweep, which needs to be managed as well. The way it is implemented in the patch right now is new buffers are added into the freelist right away, when they're initialized by the virtue of nextFree. What I have in mind is to do this as the last step, when all backends have confirmed shared memory signal was absorbed. This would mean that StrategyControll will not return a buffer id from the freshly allocated range until everything is done, and no such buffer will be inserted into the buffer lookup table. You're right of course, a buffer id could be returned from the ClockSweep and from the custom strategy buffer ring. Buf from what I see those are picking a buffer from the set of already utilized buffers, meaning that for a buffer to land there it first has to go through StrategyControl->firstFreeBuffer, and hence the idea above will be a requirement for those as well. > > Long story short, in the next version of the patch I'll try to > > experiment with a simplified design: a simple function to trigger > > resizing, launching a coordinator worker, with backends not waiting for > > each other and buffers first allocated and then marked as "available to > > use". > > Should all the backends wait between buffer allocation and them being > marked as "available"? I assume that marking them as available means > "declaring the new NBuffers". Yep, making buffers available would be equivalent to declaring the new NBuffers. What I think is important here is to note, that we use two mechanisms for coordination: the shared structure ShmemControl that shares the state of operation, and ProcSignal that tells backends to do something (change the memory mapping). Declaring the new NBuffers could be done via ShmemControl, atomically applying the new value, instead of sending a ProcSignal -- this way there is no need for backends to wait, but StrategyControl would need to use the ShmemControl instead of local copy of NBuffers. Does it make sense to you? > What about when shrinking the buffers? Do you plan to make all the > backends wait while the coordinator is evicting buffers? No, it was never planned like that, since it could easily end up with coordinator waiting for the backend to unpin a buffer, and the backend to wait for a signal from the coordinator.
On Mon, Jul 14, 2025 at 1:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Jul 14, 2025 at 10:25:51AM +0530, Ashutosh Bapat wrote: > > > Currently it's being addressed via every backend waiting for each other, > > > but I guess it could be as well managed via handling the freelist, so > > > that only "available" buffers will be inserted into the lookup table. > > > > I didn't get how can that be managed by freelist? Buffers are also > > allocated through clocksweep, which needs to be managed as well. > > The way it is implemented in the patch right now is new buffers are > added into the freelist right away, when they're initialized by the > virtue of nextFree. What I have in mind is to do this as the last step, > when all backends have confirmed shared memory signal was absorbed. This > would mean that StrategyControll will not return a buffer id from the > freshly allocated range until everything is done, and no such buffer > will be inserted into the buffer lookup table. > > You're right of course, a buffer id could be returned from the > ClockSweep and from the custom strategy buffer ring. Buf from what I see > those are picking a buffer from the set of already utilized buffers, > meaning that for a buffer to land there it first has to go through > StrategyControl->firstFreeBuffer, and hence the idea above will be a > requirement for those as well. That isn't true. A buffer which was never in the free list can still be picked up by clock sweep. But you are raising a relevant point about StrategyControl below > > > > Long story short, in the next version of the patch I'll try to > > > experiment with a simplified design: a simple function to trigger > > > resizing, launching a coordinator worker, with backends not waiting for > > > each other and buffers first allocated and then marked as "available to > > > use". > > > > Should all the backends wait between buffer allocation and them being > > marked as "available"? I assume that marking them as available means > > "declaring the new NBuffers". > > Yep, making buffers available would be equivalent to declaring the new > NBuffers. What I think is important here is to note, that we use two > mechanisms for coordination: the shared structure ShmemControl that > shares the state of operation, and ProcSignal that tells backends to do > something (change the memory mapping). Declaring the new NBuffers could > be done via ShmemControl, atomically applying the new value, instead of > sending a ProcSignal -- this way there is no need for backends to wait, > but StrategyControl would need to use the ShmemControl instead of local > copy of NBuffers. Does it make sense to you? When expanding buffers, letting StrategyControl continue with the old NBuffers may work. When propagating the new buffer value we have to reinitialize StrategyControl to use new NBuffers. But when shrinking, the StrategyControl needs to be initialized with the new NBuffers, lest it picks a victim from buffers being shrunk. And then if the operation fails, we have to reinitialize the StrategyControl again with the old NBuffers. > > > What about when shrinking the buffers? Do you plan to make all the > > backends wait while the coordinator is evicting buffers? > > No, it was never planned like that, since it could easily end up with > coordinator waiting for the backend to unpin a buffer, and the backend > to wait for a signal from the coordinator. I agree with the deadlock situation. How do we prevent the backends from picking or continuing to work with a buffer from buffers being shrunk then? Each backend then has to do something about their respective pinned buffers. -- Best Wishes, Ashutosh Bapat
> On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote: > > You're right of course, a buffer id could be returned from the > > ClockSweep and from the custom strategy buffer ring. Buf from what I see > > those are picking a buffer from the set of already utilized buffers, > > meaning that for a buffer to land there it first has to go through > > StrategyControl->firstFreeBuffer, and hence the idea above will be a > > requirement for those as well. > > That isn't true. A buffer which was never in the free list can still > be picked up by clock sweep. How's that? > > Yep, making buffers available would be equivalent to declaring the new > > NBuffers. What I think is important here is to note, that we use two > > mechanisms for coordination: the shared structure ShmemControl that > > shares the state of operation, and ProcSignal that tells backends to do > > something (change the memory mapping). Declaring the new NBuffers could > > be done via ShmemControl, atomically applying the new value, instead of > > sending a ProcSignal -- this way there is no need for backends to wait, > > but StrategyControl would need to use the ShmemControl instead of local > > copy of NBuffers. Does it make sense to you? > > When expanding buffers, letting StrategyControl continue with the old > NBuffers may work. When propagating the new buffer value we have to > reinitialize StrategyControl to use new NBuffers. But when shrinking, > the StrategyControl needs to be initialized with the new NBuffers, > lest it picks a victim from buffers being shrunk. And then if the > operation fails, we have to reinitialize the StrategyControl again > with the old NBuffers. Right, those two cases will become more asymmetrical: for expanding number of available buffers would have to be propagated to the backends at the end, when they're ready; for shrinking number of available buffers would have to be propagated at the start, so that backends will stop allocating unavailable buffers. > > > What about when shrinking the buffers? Do you plan to make all the > > > backends wait while the coordinator is evicting buffers? > > > > No, it was never planned like that, since it could easily end up with > > coordinator waiting for the backend to unpin a buffer, and the backend > > to wait for a signal from the coordinator. > > I agree with the deadlock situation. How do we prevent the backends > from picking or continuing to work with a buffer from buffers being > shrunk then? Each backend then has to do something about their > respective pinned buffers. The idea I've got so far is stop allocating buffers from the unavailable range and wait until backends will unpin all unavailable buffers. We either wait unconditionally until it happens, or bail out after certain timeout. It's probably possible to force backends to unpin buffers they work with, but it sounds much more problematic to me. What do you think?
> On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote:
> > You're right of course, a buffer id could be returned from the
> > ClockSweep and from the custom strategy buffer ring. Buf from what I see
> > those are picking a buffer from the set of already utilized buffers,
> > meaning that for a buffer to land there it first has to go through
> > StrategyControl->firstFreeBuffer, and hence the idea above will be a
> > requirement for those as well.
>
> That isn't true. A buffer which was never in the free list can still
> be picked up by clock sweep.
How's that?
> On Mon, Jul 14, 2025 at 10:24:50AM +0100, Thom Brown wrote: > On Mon, 14 Jul 2025, 09:54 Dmitry Dolgov, <9erthalion6@gmail.com> wrote: > > > > On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote: > > > > You're right of course, a buffer id could be returned from the > > > > ClockSweep and from the custom strategy buffer ring. Buf from what I > > see > > > > those are picking a buffer from the set of already utilized buffers, > > > > meaning that for a buffer to land there it first has to go through > > > > StrategyControl->firstFreeBuffer, and hence the idea above will be a > > > > requirement for those as well. > > > > > > That isn't true. A buffer which was never in the free list can still > > > be picked up by clock sweep. > > > > How's that? > > > > Isn't it its job to find usable buffers from the used buffer list when no > free ones are available? The next victim buffer can be selected (and > cleaned if dirty) and then immediately used without touching the free list. Ah, I see what you mean folks. But I'm talking here only about buffers which will be allocated after extending shared memory -- they must go through the freelist first (I don't see why not, any other options?), and clock sweep will have a chance to pick them up only afterwards. That makes the freelist sort of an entry point for those buffers.
Hi, On 2025-07-14 11:32:25 +0200, Dmitry Dolgov wrote: > > On Mon, Jul 14, 2025 at 10:24:50AM +0100, Thom Brown wrote: > > On Mon, 14 Jul 2025, 09:54 Dmitry Dolgov, <9erthalion6@gmail.com> wrote: > > > > > > On Mon, Jul 14, 2025 at 01:55:39PM +0530, Ashutosh Bapat wrote: > > > > > You're right of course, a buffer id could be returned from the > > > > > ClockSweep and from the custom strategy buffer ring. Buf from what I > > > see > > > > > those are picking a buffer from the set of already utilized buffers, > > > > > meaning that for a buffer to land there it first has to go through > > > > > StrategyControl->firstFreeBuffer, and hence the idea above will be a > > > > > requirement for those as well. > > > > > > > > That isn't true. A buffer which was never in the free list can still > > > > be picked up by clock sweep. > > > > > > How's that? > > > > > > > Isn't it its job to find usable buffers from the used buffer list when no > > free ones are available? The next victim buffer can be selected (and > > cleaned if dirty) and then immediately used without touching the free list. > > Ah, I see what you mean folks. But I'm talking here only about buffers > which will be allocated after extending shared memory -- they must go > through the freelist first (I don't see why not, any other options?), > and clock sweep will have a chance to pick them up only afterwards. That > makes the freelist sort of an entry point for those buffers. Clock sweep can find any buffer, independent of whether it's on the freelist.
> On Mon, Jul 14, 2025 at 08:56:56AM -0400, Andres Freund wrote: > > Ah, I see what you mean folks. But I'm talking here only about buffers > > which will be allocated after extending shared memory -- they must go > > through the freelist first (I don't see why not, any other options?), > > and clock sweep will have a chance to pick them up only afterwards. That > > makes the freelist sort of an entry point for those buffers. > > Clock sweep can find any buffer, independent of whether it's on the freelist. It does the search based on nextVictimBuffer, where the actual buffer will be a modulo of NBuffers, right? If that's correct and I get everything else right, that would mean as long as NBuffers stays the same (which is the case for the purposes of the current discussion) new buffers, allocated on top of NBuffers after shared memory increase, will not be picked by the clock sweep.
Hi, On 2025-07-14 15:08:28 +0200, Dmitry Dolgov wrote: > > On Mon, Jul 14, 2025 at 08:56:56AM -0400, Andres Freund wrote: > > > Ah, I see what you mean folks. But I'm talking here only about buffers > > > which will be allocated after extending shared memory -- they must go > > > through the freelist first (I don't see why not, any other options?), > > > and clock sweep will have a chance to pick them up only afterwards. That > > > makes the freelist sort of an entry point for those buffers. > > > > Clock sweep can find any buffer, independent of whether it's on the freelist. > > It does the search based on nextVictimBuffer, where the actual buffer > will be a modulo of NBuffers, right? If that's correct and I get > everything else right, that would mean as long as NBuffers stays the > same (which is the case for the purposes of the current discussion) new > buffers, allocated on top of NBuffers after shared memory increase, will > not be picked by the clock sweep. Are you tell me that you'd put "new" buffers onto the freelist, before you increase NBuffers? That doesn't make sense. Orthogonaly - there's discussion about simply removing the freelist. Greetings, Andres Freund
> On Mon, Jul 14, 2025 at 09:14:26AM -0400, Andres Freund wrote: > > > Clock sweep can find any buffer, independent of whether it's on the freelist. > > > > It does the search based on nextVictimBuffer, where the actual buffer > > will be a modulo of NBuffers, right? If that's correct and I get > > everything else right, that would mean as long as NBuffers stays the > > same (which is the case for the purposes of the current discussion) new > > buffers, allocated on top of NBuffers after shared memory increase, will > > not be picked by the clock sweep. > > Are you tell me that you'd put "new" buffers onto the freelist, before you > increase NBuffers? That doesn't make sense. Why? > Orthogonaly - there's discussion about simply removing the freelist. Good to know, will take a look at that thread, thanks.
Hi, On 2025-07-14 15:20:03 +0200, Dmitry Dolgov wrote: > > On Mon, Jul 14, 2025 at 09:14:26AM -0400, Andres Freund wrote: > > > > Clock sweep can find any buffer, independent of whether it's on the freelist. > > > > > > It does the search based on nextVictimBuffer, where the actual buffer > > > will be a modulo of NBuffers, right? If that's correct and I get > > > everything else right, that would mean as long as NBuffers stays the > > > same (which is the case for the purposes of the current discussion) new > > > buffers, allocated on top of NBuffers after shared memory increase, will > > > not be picked by the clock sweep. > > > > Are you tell me that you'd put "new" buffers onto the freelist, before you > > increase NBuffers? That doesn't make sense. > > Why? I think it basically boils down to "That's not how it supposed to work". If you have buffers that are not in the clock sweep they'll get unfairly high usage counts, as their usecount won't be decremented by the clock sweep. Resulting in those buffers potentially being overly sticky after the s_b resize completed. It breaks the entirely reasonable check to verify that a buffer returned by StrategyGetBuffer() is within the buffer pool. Obviously, if we remove the freelist, not having the clock sweep find the buffer would mean it's unreachable. What on earth would be the point of putting a buffer on the freelist but not make it reachable by the clock sweep? To me that's just nonsensical. Greetings, Andres Freund
> On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote: > What on earth would be the point of putting a buffer on the freelist but not > make it reachable by the clock sweep? To me that's just nonsensical. To clarify, we're not talking about this scenario as "that's how it would work after the resize". The point is that to expand shared buffers they need to be initialized, included into the whole buffer machinery (freelist, clock sweep, etc.) and NBuffers has to be updated. Those steps are separated in time, and I'm currently trying to understand what are the consequences of performing them in different order and whether there are possible concurrency issues under various scenarios. Does this make more sense, or still not?
> On Jul 14, 2025, at 10:01 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > >> On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote: >> What on earth would be the point of putting a buffer on the freelist but not >> make it reachable by the clock sweep? To me that's just nonsensical. > > To clarify, we're not talking about this scenario as "that's how it > would work after the resize". The point is that to expand shared buffers > they need to be initialized, included into the whole buffer machinery > (freelist, clock sweep, etc.) and NBuffers has to be updated. Those > steps are separated in time, and I'm currently trying to understand what > are the consequences of performing them in different order and whether > there are possible concurrency issues under various scenarios. Does this > make more sense, or still not? Hello, first off thanks for working on the intricate issues related to resizing shared_buffers. Second, I'm new in this code so take that in account but I'm the person trying to remove the freelist entirely [1] so I have reviewed this code recently. I'd initialize them, expand BufferDescriptors, and adjust NBuffers. The clock-sweep algorithm will eventually find them and make use of them. The buf->freeNext should be FREENEXT_NOT_IN_LIST so that StrategyFreeBuffer() will do the work required to append it the freelist after use. AFAICT there is no need to add to the freelist up front. best. -greg [1] https://postgr.es/m/flat/E2D6FCDC-BE98-4F95-B45E-699C3E17BA10%40burd.me
Hi, On 2025-07-14 16:01:50 +0200, Dmitry Dolgov wrote: > > On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote: > > What on earth would be the point of putting a buffer on the freelist but not > > make it reachable by the clock sweep? To me that's just nonsensical. > > To clarify, we're not talking about this scenario as "that's how it > would work after the resize". The point is that to expand shared buffers > they need to be initialized, included into the whole buffer machinery > (freelist, clock sweep, etc.) and NBuffers has to be updated. It seems pretty obvious to that the order has to be 1) initialize buffer headers 2) update NBuffers 3) put them onto the freelist (with 3) hopefully becoming obsolete) > Those steps are separated in time, and I'm currently trying to understand > what are the consequences of performing them in different order and whether > there are possible concurrency issues under various scenarios. Does this > make more sense, or still not? I still don't understand why it'd ever make sense to put a buffer onto the freelist before updating NBuffers first. Greetings, Andres Freund
> On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote: > > Those steps are separated in time, and I'm currently trying to understand > > what are the consequences of performing them in different order and whether > > there are possible concurrency issues under various scenarios. Does this > > make more sense, or still not? > > I still don't understand why it'd ever make sense to put a buffer onto the > freelist before updating NBuffers first. Depending on how NBuffers is updated, different backends may have different value of NBuffers for a short time frame. In that case a scenario I'm trying to address is when one backend with the new NBuffers value allocates a new buffer and puts it into the buffer lookup table, where it could become reachable by another backend, which still has the old NBuffer value. Correct me if I'm wrong, but initializing buffer headers + updating NBuffers means clock sweep can now return one of those new buffers, opening the scenario above, right?
> On Mon, Jul 14, 2025 at 10:22:17AM -0400, Burd, Greg wrote: > I'd initialize them, expand BufferDescriptors, and adjust NBuffers. The > clock-sweep algorithm will eventually find them and make use of them. The > buf->freeNext should be FREENEXT_NOT_IN_LIST so that StrategyFreeBuffer() will > do the work required to append it the freelist after use. AFAICT there is no > need to add to the freelist up front. Yep, thanks. I think this approach may lead to a problem I'm trying to address with the buffer lookup table (just have described it in the message above). But if I'm wrong, that of course would be the way to go.
If I understanding correctly, putting a new buffer in the freelist before updating NBuffers could break existing logic thatcalls BufferIsValid(bufnum) and asserts bufnum <= NBuffers? (since a backend can grab the new buffer and checks its validitybefore the coordinator can add it to the freelist.) But it seems updating NBuffers before adding new elements to the freelist could be problematic too? Like if a new bufferis already chosen as a victim and then the coordinator adds it to the freelist, would that lead to "double-use"? (seemspossible at least with current logic and serialization in StrategyGetBuffer). If that's a valid concern, would somethinglike this work? 1) initialize buffer headers, with a new state/flag to indicate "add-pending" 2) update NBuffers -- add a check in clock-sweep logic for "add-pending" and skip them 3) put them onto the freelist 4) when a new element is grabbed from freelist, check for and reset add-pending flag. This ensure the new element is always obtained from the freelist first I think. Jack >-----Original Message----- >From: Andres Freund <andres@anarazel.de> >Sent: Monday, July 14, 2025 10:23 AM >To: Dmitry Dolgov <9erthalion6@gmail.com> >Cc: Thom Brown <thom@linux.com>; Ashutosh Bapat ><ashutosh.bapat.oss@gmail.com>; Tomas Vondra <tomas@vondra.me>; >Thomas Munro <thomas.munro@gmail.com>; PostgreSQL-development <pgsql- >hackers@postgresql.org>; Jack Ng <Jack.Ng@huawei.com>; Ni Ku ><jakkuniku@gmail.com> >Subject: Re: Changing shared_buffers without restart > >Hi, > >On 2025-07-14 16:01:50 +0200, Dmitry Dolgov wrote: >> > On Mon, Jul 14, 2025 at 09:42:46AM -0400, Andres Freund wrote: >> > What on earth would be the point of putting a buffer on the freelist >> > but not make it reachable by the clock sweep? To me that's just nonsensical. >> >> To clarify, we're not talking about this scenario as "that's how it >> would work after the resize". The point is that to expand shared >> buffers they need to be initialized, included into the whole buffer >> machinery (freelist, clock sweep, etc.) and NBuffers has to be updated. > >It seems pretty obvious to that the order has to be > >1) initialize buffer headers >2) update NBuffers >3) put them onto the freelist > >(with 3) hopefully becoming obsolete) > > >> Those steps are separated in time, and I'm currently trying to >> understand what are the consequences of performing them in different >> order and whether there are possible concurrency issues under various >> scenarios. Does this make more sense, or still not? > >I still don't understand why it'd ever make sense to put a buffer onto the freelist >before updating NBuffers first. > >Greetings, > >Andres Freund
Hi, On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote: >> > Those steps are separated in time, and I'm currently trying to understand >> > what are the consequences of performing them in different order and whether >> > there are possible concurrency issues under various scenarios. Does this >> > make more sense, or still not? >> >> I still don't understand why it'd ever make sense to put a buffer onto the >> freelist before updating NBuffers first. > >Depending on how NBuffers is updated, different backends may have >different value of NBuffers for a short time frame. In that case a >scenario I'm trying to address is when one backend with the new NBuffers >value allocates a new buffer and puts it into the buffer lookup table, >where it could become reachable by another backend, which still has the >old NBuffer value. Correct me if I'm wrong, but initializing buffer >headers + updating NBuffers means clock sweep can now return one of >those new buffers, opening the scenario above, right? The same is true if you put buffers into the freelist. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Just brain-storming here... would moving NBuffers to shared memory solve this specific issue? Though I'm pretty sure thatwould open up a new set of synchronization issues elsewhere, so I'm not sure if there's a net gain. Jack >-----Original Message----- >From: Andres Freund <andres@anarazel.de> >Sent: Monday, July 14, 2025 11:12 AM >To: Dmitry Dolgov <9erthalion6@gmail.com> >Cc: Thom Brown <thom@linux.com>; Ashutosh Bapat ><ashutosh.bapat.oss@gmail.com>; Tomas Vondra <tomas@vondra.me>; >Thomas Munro <thomas.munro@gmail.com>; PostgreSQL-development <pgsql- >hackers@postgresql.org>; Jack Ng <Jack.Ng@huawei.com>; Ni Ku ><jakkuniku@gmail.com> >Subject: Re: Changing shared_buffers without restart > >Hi, > >On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com> >wrote: >>> On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote: >>> > Those steps are separated in time, and I'm currently trying to >>> > understand what are the consequences of performing them in >>> > different order and whether there are possible concurrency issues >>> > under various scenarios. Does this make more sense, or still not? >>> >>> I still don't understand why it'd ever make sense to put a buffer >>> onto the freelist before updating NBuffers first. >> >>Depending on how NBuffers is updated, different backends may have >>different value of NBuffers for a short time frame. In that case a >>scenario I'm trying to address is when one backend with the new >>NBuffers value allocates a new buffer and puts it into the buffer >>lookup table, where it could become reachable by another backend, which >>still has the old NBuffer value. Correct me if I'm wrong, but >>initializing buffer headers + updating NBuffers means clock sweep can >>now return one of those new buffers, opening the scenario above, right? > >The same is true if you put buffers into the freelist. > >Andres >-- >Sent from my Android device with K-9 Mail. Please excuse my brevity.
> On Mon, Jul 14, 2025 at 11:11:36AM -0400, Andres Freund wrote: > Hi, > > On July 14, 2025 10:39:33 AM EDT, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > >> On Mon, Jul 14, 2025 at 10:23:23AM -0400, Andres Freund wrote: > >> > Those steps are separated in time, and I'm currently trying to understand > >> > what are the consequences of performing them in different order and whether > >> > there are possible concurrency issues under various scenarios. Does this > >> > make more sense, or still not? > >> > >> I still don't understand why it'd ever make sense to put a buffer onto the > >> freelist before updating NBuffers first. > > > >Depending on how NBuffers is updated, different backends may have > >different value of NBuffers for a short time frame. In that case a > >scenario I'm trying to address is when one backend with the new NBuffers > >value allocates a new buffer and puts it into the buffer lookup table, > >where it could become reachable by another backend, which still has the > >old NBuffer value. Correct me if I'm wrong, but initializing buffer > >headers + updating NBuffers means clock sweep can now return one of > >those new buffers, opening the scenario above, right? > > The same is true if you put buffers into the freelist. Yep, but the question about clock sweep still stays. Anyway, thanks for the input, let me digest it and come up with more questions & patch series.
> On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote: > Just brain-storming here... would moving NBuffers to shared memory solve this specific issue? Though I'm pretty sure thatwould open up a new set of synchronization issues elsewhere, so I'm not sure if there's a net gain. It's in fact already happening, there is a shared structure that described the resize status. But if I get everything right, it doesn't solve all the problems.
> On Fri, Jul 04, 2025 at 02:06:16AM +0200, Tomas Vondra wrote:
> 10) what to do about stuck resize?
>
> AFAICS the resize can get stuck for various reasons, e.g. because it
> can't evict pinned buffers, possibly indefinitely. Not great, it's not
> clear to me if there's a way out (canceling the resize) after a timeout,
> or something like that? Not great to start an "online resize" only to
> get stuck with all activity blocked for indefinite amount of time, and
> get to restart anyway.
>
> Seems related to Thomas' message [2], but AFAICS the patch does not do
> anything about this yet, right? What's the plan here?
It's another open discussion right now, with an idea to eventually allow
canceling after a timeout. I think canceling when stuck on buffer
eviction should be pretty straightforward (the evition must take place
before actual shared memory resize, so we know nothing has changed yet),
but in some other failure scenarios it would be harder (e.g. if one
backend is stuck resizing, while other have succeeded -- this would
require another round of synchronization and some way to figure out what
is the current status).
>> On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote: >> Just brain-storming here... would moving NBuffers to shared memory solve >this specific issue? Though I'm pretty sure that would open up a new set of >synchronization issues elsewhere, so I'm not sure if there's a net gain. > >It's in fact already happening, there is a shared structure that described the >resize status. But if I get everything right, it doesn't solve all the problems. Hi Dmitry, Just to clarify, you're not only referring to the ShmemControl::NSharedBuffers and related logic in the current patches, but actually getting rid of per-process NBuffers completely and use ShmemControl::NSharedBuffers everywhere instead (or something along those lines)? So that when the coordinator updates ShmemControl::NSharedBuffers, everyone sees the new value right away. I guess this is part of the "simplified design" you mentioned several posts earlier? I also thought about that approach more, and there seems to be new synchronization issues we would need to deal with, like: 1. Mid-execution change of NBuffers in functions like BufferSync and BgBufferSync, which could cause correctness and performance issues. I suppose most of them are solvable with atomics and shared r/w locks etc, but at the cost of higher performance overheads. 2. NBuffers becomes inconsistent with the underlying shared memory mappings for a period of time for each process. Currently both are updated in AnonymousShmemResize and AdjustShmemSize "atomically" for a process, so I wonder if letting them get out-of-sync (even for a brief period) could be problematic. I agree it doesn't seem to solve all the problems. It can simplify certain aspects of the design, but may also introduce new issues. Overall not a "silver bullet" :) Jack
> On Tue, Jul 15, 2025 at 10:52:01PM +0000, Jack Ng wrote: > >> On Mon, Jul 14, 2025 at 03:18:10PM +0000, Jack Ng wrote: > >> Just brain-storming here... would moving NBuffers to shared memory solve > >this specific issue? Though I'm pretty sure that would open up a new set of > >synchronization issues elsewhere, so I'm not sure if there's a net gain. > > > >It's in fact already happening, there is a shared structure that described the > >resize status. But if I get everything right, it doesn't solve all the problems. > > Just to clarify, you're not only referring to the ShmemControl::NSharedBuffers > and related logic in the current patches, but actually getting rid of per-process > NBuffers completely and use ShmemControl::NSharedBuffers everywhere instead (or > something along those lines)? So that when the coordinator updates > ShmemControl::NSharedBuffers, everyone sees the new value right away. > I guess this is part of the "simplified design" you mentioned several posts earlier? I was thinking more about something like NBuffersAvailable, which would control how victim buffers are getting picked, but there is a spectrum of different options to experiment with. > I also thought about that approach more, and there seems to be new synchronization > issues we would need to deal with, like: Potentially tricky change of NBuffers already happens in the current patch set, e.g. NBuffers is getting updated in ProcessProcSignalBarrier, which is called at the end of BufferSync loop iteration. By itself I don't see any obvious problems here except remembering buffer id in CkptBufferIds (I've mentioned this few messages above).
> On Mon, Jul 14, 2025 at 05:55:13PM -0500, Jim Nasby wrote: > > Finally, while shared buffers is the most visible target here, there are > other shared memory settings that have a *much* smaller surface area, and > in my experience are going to be much more valuable from a tuning > perspective; notably wal_buffers and the MXID SLRUs (and possibly CLOG and > subtrans). I say that because unless you're running a workload that > entirely fits in shared buffers, or a *really* small shared buffers > compared to system memory, increasing shared buffers quickly gets into > diminishing returns. But since the default size for the other fixed sized > areas is so much smaller than normal values for shared_buffers, increasing > those areas can have a much, much larger impact on performance. (Especially > for something like the MXID SLRUs.) I would certainly consider focusing on > one of those areas before trying to tackle shared buffers. That's an interesting idea, thanks for sharing. The reason I'm concentrating on shared buffers is that it was frequently called out as a problem when trying to tune PostgreSQL automatically. In this context shared buffers is usually one of the most impactful knobs, yet one of the most painful to manage as well. But if the amount of complexity around resizable shared buffers will be proved unsurmountable, yeah, it would make sense to consider simpler targets using the same mechanism.
Hi, On 2025-07-14 17:55:13 -0500, Jim Nasby wrote: > I say that because unless you're running a workload that entirely fits in > shared buffers, or a *really* small shared buffers compared to system > memory, increasing shared buffers quickly gets into diminishing returns. I don't think that's true, at all, today. And it certainly won't be true in a world where we will be able to use direct_io for real workloads. Particularly for write heavy workloads, the difference between a small buffer pool and a large one can be *dramatic*, because the large buffer pool allows most writes to be done by checkpointer (and thus largely sequentially) or by backends and bgwriter (and thus largely randomly). Doing more writes sequentially helps with short-term performance, but *particularly* helps with sustained performance on SSDs. A larger buffer pool also reduces the *total* number of writes dramatically, because the same buffer will often be dirtied repeatedly within one checkpoint window. r/w/ pgbench is a workload that *undersells* the benefit of a larger shared_buffers, as each transaction is uncommonly small, making WAL flushes much more of a bottleneck (the access pattern is too uniform, too). But even for that the difference can be massive: A scale 500 pgbench with 48 clients: s_b= 512MB: averages 390MB/s of writes in steady state average TPS: 25072 s_b=8192MB: averages 48MB/s of writes in steady state average TPS: 47901 Nearly an order of magnitude difference in writes and nearly a 2x difference in TPS. 25%, the advice we give for shared_buffers, is literally close to the worst possible value. The only thing it maximizes is double buffering. While removing information useful about what to cache for how long from both postgres and the OS, leading to reduced cache hit rates. > But since the default size for the other fixed sized areas is so much > smaller than normal values for shared_buffers, increasing those areas can > have a much, much larger impact on performance. (Especially for something > like the MXID SLRUs.) I would certainly consider focusing on one of those > areas before trying to tackle shared buffers. I think that'd be a bad idea. There's simply no point in having the complexity in place to allow for dynamically resizing a few megabytes of buffers. You just configure them large enough (including probalby increasing some of the defaults one of these years). Whereas you can't just do that for shared_buffers, as we're talking really memory. Ahead of time you do not know how much memory backends themselves need and the amount of memory in the system may change. Resizing shared_buffers is particularly important because it's becoming more important to be able to dynamically increase/decrease the resources of a running postgres instance to adjust for system load. Memory and CPUs can be hot added/removed from VMs, but we need to utilize them... Greetings, Andres Freund
Hi Tomas, Thanks for your detailed feedback. Sorry for replying late. On Fri, Jul 4, 2025 at 5:36 AM Tomas Vondra <tomas@vondra.me> wrote: > > v5-0008-Support-shrinking-shared-buffers.patch > > 1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't > there a place starting it and waiting for it to complete? Why > shouldn't it do EvictExtraBuffers itself? > > 3) Seems a bit strange to do it from a random backend. Shouldn't it > be the responsibility of a process like checkpointer/bgwriter, or > maybe a dedicated dynamic bgworker? Can we even rely on a backend > to be available? I will answer these two together. I don't think we should rely on a random backend. But that's what the rest of the patches did and patches to support shrinking followed them. But AFAIK, Dmitry is working on a set of changes which will make a non-postmaster backend to be a coordinator for buffer pool resizing process. When that happens the same backend which initializes the expanded memory when expanding the buffer pool should also be responsible for evicting the buffers when shrinking the buffer pool. Will wait for Dmitry's next set of patches before making this change. > > 4) Unsolved issues with buffers pinned for a long time. Could be an > issue if the buffer is pinned indefinitely (e.g. cursor in idle > connection), and the resizing blocks some activity (new connections > or stuff like that). In such cases we should cancel the operation or kill that backend (per user preference) after a timeout with (user specified) timeout >= 0. We haven't yet figured out the details. I think the first version of the feature would just cancel the operation, if it encounters a pinned buffer. > 2) Isn't the change to BufferManagerShmemInit wrong? How do we know the > last buffer is still at the end of the freelist? Seems unlikely. > 6) It's not clear to me in what situations this triggers (in the call > to BufferManagerShmemInit) > > if (FirstBufferToInit < NBuffers) ... > Will answer these two together. As the comment says FirstBufferToInit < NBuffers indicates two situations: When FirstBufferToInit = 0, it's the first time the buffer pool is being initialized. Otherwise it indicates expanding the buffer pool, in which case the last buffer will be a newly initialized buffer. All newly initialized buffers are linked into the freelist one after the other in the increasing order of their buffer ids by code a few lines above. Now that the free buffer list has been removed, we don't need to worry about it. In the next set of patches, I have removed this code. > > v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch > > 1) IMHO this should be included in the earlier resize/shrink patches, > I don't see a reason to keep it separate (assuming this is the > correct way, and the "init" is not). These patches are separate just because me and Dmitry developed them respectively. Once they are reviewed by Dmitry, we will squash them into a single patch. I am expecting that Dmitry's next patchset which will do significant changes to the synchronization will have a single patch for all code related to and consequential to resizing. > > 5) Funny that "AI suggests" something, but doesn't the block fail to > reset nextVictimBuffer of the clocksweep? It may point to a buffer > we're removing, and it'll be invalid, no? > The TODO no more applies. There's code to reset the clocksweep in a separate patch. Sorry for not removing it earlier. It will be removed in the next set of patches. > > 2) Doesn't StrategyPurgeFreeList already do some of this for the case > of shrinking memory? > > 3) Not great adding a bunch of static variables to bufmgr.c. Why do we > need to make "everything" static global? Isn't it enough to make > only the "valid" flag global? The rest can stay local, no? > > If everything needs to be global for some reason, could we at least > make it a struct, to group the fields, not just separate random > variables? And maybe at the top, not half-way throught the file? > > 4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting > anything, it's just invalidating the info about past runs. I think there's a bit of refactoring possible here. Setting up the BgBufferSync state, resetting it when bgwriter_lru_maxpages <= 0 and then re initializing it when bgwriter_lru_maxpages > 0, and actually performing the buffer sync is all packed into the same function BgBufferSync() right now. It makes this function harder to read. I think these functionalities should be separated into their own functions and use the appropriate one instead of BgBufferSyncAdjust(), whose name is misleading. The static global variables should all be packed into a structure which is passed as an argument to these functions. I need more time to study the code and refactor it that way. For now I have added a note to the commit message of this patch so that I will revisit it. I have renamed BgBufferSyncAdjust() to BgBufferSyncReset(). > > 5) I don't quite understand why BufferSync needs to do the dance with > delay_shmem_resize. I mean, we certainly should not run BufferSync > from the code that resizes buffers, right? Certainly not after the > eviction, from the part that actually rebuilds shmem structs etc. Right. But let me answer all three questions together. > So perhaps something could trigger resize while we're running the > BufferSync()? Isn't that a bit strange? If this flag is needed, it > seems more like a band-aid for some issue in the architecture. > > 6) Also, why should it be fine to get into situation that some of the > buffers might not be valid, during shrinking? I mean, why should > this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != NBuffers). > It seems better to ensure we never get into "sync" in a way that > might lead some of the buffers invalid. Seems way too lowlevel to > care about whether resize is happening. > > 7) I don't understand the new condition for "Execute the LRU scan". > Won't this stop LRU scan even in cases when we want it to happen? > Don't we want to scan the buffers in the remaining part (after > shrinking), for example? Also, we already checked this shmem flag at > the beginning of the function - sure, it could change (if some other > process modifies it), but does that make sense? Wouldn't it cause > problems if it can change at an arbitrary point while running the > BufferSync? IMHO just another sign it may not make sense to allow > this, i.e. buffer sync should not run during the "actual" resize. > ProcessBarrierShmemResize() which does the resizing is part of ProcessProcSignalBarrier() which in turn gets called from CHECK_FOR_INTERRUPTS(), which is called from multiple places, even from elog(). I am not able to find a call stack linking BgBufferSync() and ProcessProcSignalBarrier(). But I couldn't convince myself that it is true and will remain true in the future. I mean, the function loops through a large number of buffers and performs IO, both avenues to call CHECK_FOR_INTERRUPTS(). Hence that flag. Do you know what (part of code) guarantees that ProcessProcSignalBarrier() will never be called from BgBufferSync()? Note, resizing can not begin till delay_shmem_resize is cleared, so while BgBufferSync is executing, no buffer can be invalidated or no new buffers could be added. But at the cost of all other backends to wait till BgBufferSync finishes. We want to avoid that. The idea here is to make BgBufferSync stop as soon as it realises that the buffer resizing is "about to begin". But I think the condition looks wrong. I think the right condition would be NBufferPending != NBuffers or NBuffersOld. AFAIK, Dmitry is working on consolidating NBuffers* variables as you have requested elsewhere. Better even if we could somehow set a flag in shared memory indicating that the buffer resizing is "about to begin" and BgBufferSync() checks that flag. So I will wait for him to make that change and then change this condition. > > v5-0010-Additional-validation-for-buffer-in-the-ring.patch > > 1) So the problem is we might create a ring before shrinking shared > buffers, and then GetBufferFromRing will see bogus buffers? OK, but > we should be more careful with these checks, otherwise we'll miss > real issues when we incorrectly get an invalid buffer. Can't the > backends do this only when they for sure know we did shrink the > shared buffers? Or maybe even handle that during the barrier? > > 2) IMHO a sign there's the "transitions" between different NBuffers > values may not be clear enough, and we're allowing stuff to happen > in the "blurry" area. I think that's likely to cause bugs (it did > cause issues for the online checksums patch, I think). > I think you are right, that this might hide some bugs. Just like we remove buffers to be shrunk from freelist only once, I wanted each backend to remove them buffer rings only once. But I couldn't find a way to make all the buffer rings for a given backend available to the barrier handling code. The rings are stored in Scan objects, which seem local to the executor nodes. Is there a way to make them available to barrier handling code (even if it has to walk an execution tree, let's say)? If there would have been only one scan, we could have set a flag after shrinking, let GetBufferFromRing() purge all invalid buffers once when flag is true and reset the flag. But there can be more than one scan happening and we don't know how many there are and when all of them have finished calling GetBufferFromRing() after shrinking. Suggestions to do this only once are welcome. I will send the next set of patches with my next email. -- Best Wishes, Ashutosh Bapat
On Mon, Jun 16, 2025 at 6:09 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > > > > Buffer lookup table resizing > > ------------------------------------ I looked at the interaction of shared buffer lookup table with buffer resizing as per the patches in [0]. Here's list of my findings, issues and fixes. 1. The basic structure of buffer lookup table (directory and control area etc.) is allocated in a shared memory segment dedicated to the buffer lookup table. However, the entries are allocated in the shared memory using ShmemAllocNoError() which allocates the entries in the main memory segment. In order for ShmemAllocNoError() to allocate entries in the dedicated shared memory segment, it should know the shared memory segment. We could do that by setting the segment number in element_alloc() before calling hashp->alloc(). This is similar to how ShmemAllocNoError() knows the memory context in which to allocate the entries on heap. But read on ... 2. When the buffer pool is expanded, an "out of shared memory" error is thrown when more entries are added to the buffer look up table. We could temporarily adjust that flag and allocate more entries. But the directory also needs to be expanded proportionately otherwise it may lead to more contention. Expanding directory is non-trivial since it's a contiguous chunk of memory, followed by other data structures. Further, expanding directory would require rehashing all the existing entries, which may impact the time taken by the resizing operation and how long other backends remain blocked. 3. When the buffer pool is shrunk, there is no way to free the extra entries in such a way that a contiguous chunk of shared memory can be given back to the OS. In case we implement it, we will need some way to compact the shrunk entries in contiguous chunk of memory and unmap remaining chunk. That's some significant code. Given these things, I think we should set up the buffer lookup table to hold maximum entries required to expand the buffer pool to its maximum, right at the beginning. The maximum size to which buffer pool can grow is given by GUC max_available_memory (which is a misnomer and should be renamed to max_shared_buffers or something), introduced by previous set of patches [0]. We don't shrink or expand the buffer lookup table as we shrink and expand the buffer pool. With that the buffer lookup table can be located in the main memory segment itself and we don't have to fix ShmemAllocNoError(). This has two side effects: 1. larger hash table makes hash table operations slower [2]. Its impact on actual queries needs to be studied. 2. There's increase in the total shared memory allocated upfront. Currently we allocate 150MB memory with all default GUC values. With this change we will allocate 250MB memory since max_available_memory (or rather max_shared_buffers) defaults to allow 524288 shared buffers. If we make max_shared_buffers to default to shared_buffers, it won't be a problem. However, when a user sets max_shared_buffers themselves, they have to be conscious of the fact that it will allocate more memory than necessary with given shared_buffers value. This fix is part of patch 0015. The patchset contains more fixes and improvements as described below. Per TODO in the prologue of CalculateShmemSize(), more than necessary shared memory was mapped and allocated in the buffer manager related memory segments because of an error in that function; the amount of memory to be allocated in the main shared memory segment was added to every other shared memory segment. Thus shrinking those memory segments didn't actually affect the objects allocated in those. Because of that, we were not seeing SIGBUS even when the objects supposedly shrunk were accessed, masking bugs in the patches. In this patchset I have a working fix for CalculateShmemSize(). With that fix in place we see server crashing with SIGBUS in some resizing operations. Those cases need to be investigated. The fix changes its minions to a. return size of shared memory objects to be allocated in the main memory segment and b. add sizes of the shared memory objects to be allocated in other memory segments in the respective AnonymousMapping structures. This assymetry between main segment and other segment exists so as not to change a lot the minions of CalculateShmemSize(). But I think we should eliminate the assymetry and change every minion to add sizes in the respective segment's AnonymousMapping structure. The patch proposed at [3] would simplify CalculateShmemSize() which should help eliminating the assymetry. Along with refactoring CalculateShmemSize() I have added small fixes to update the total size and end address of shared memory mapping after resizing them and also to update the new allocated_sizes of resized structures in ShmemIndex entry. Patch 0009 includes these changes. I found that the shared memory resizing synchronization is triggered even before setting up the shared buffers the first time after starting the server. That's not required and also can lead to issues because of trying to resize shared buffers which do not exist. A WIP fix is included as patch 0012. A TODO in the patch needs to be addressed. It should be squashed into an earlier patch 0011 when appropriate. While debugging the above mentioned issues, I found it useful to have an insight into the contents of buffer lookup table. Hence I added a system view exposing the contents of the buffer lookup table. This is added as patch 0001 in the attached patchset. I think it's useful to have this independent of this patchset to investigate inconsistencies between the contents of shared buffer pool and buffer lookup table. Again for debugging purposes, I have added a new column "segment" in pg_shmem_allocations reporting the shared memory segment in which the given allocation has happened. I have also added another view pg_shmem_segments to provide information about the shared memory segments. This view definition will change as we design shared memory mappings and shared memory segments better. So it's WIP and needs doc changes as well. I have included it in the patchset as patch 0011 since it will be helpful to debug issues found in the patch when testing. The patch should be merged into patch 0007. Last but not the least, patch 0016 contains two tests a. stress test to run buffer resizing while pgbench is running, b. a SQL test to test the sizes of segments and shared memory allocations after resizing. The stress test polls "show shared_buffers" output to know when the resizing is finished. I think we need a better interface to know when resizing has finished. Thanks a lot my colleague Palak Chaturvedi for providing initial draft of the test case. The patches are rebased on top of the latest master, which includes changes to remove free buffer list. That led to removing all the code in these patches dealing with free buffer list. I am intentionally keeping my changes (patches 0001, 0008 to 0012, 0012 to 0016) separate from Dmitry's changes so that Dmitry can review them easily. The patches are arranged so that my patches are nearer to Dmitry's patches, into which, they should be squashed. Dmitry, I found that max_available_memory is PGC_SIGHUP. Is that intentional? I thought it's PGC_POSTMASTER since we can not reserve more address space without restarting postmaster. Left a TODO for this. I think we also need to change the name and description to better reflect its actual functionality. [0] https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2 [1] https://www.postgresql.org/message-id/CAExHW5v0jh3F_wj86yC%3DqBfWk0uiT94qy%3DZ41uzAHLHh0SerRA%40mail.gmail.com [2] https://ashutoshpg.blogspot.com/2025/07/efficiency-of-sparse-hash-table.html [3] https://commitfest.postgresql.org/patch/5997/ -- Best Wishes, Ashutosh Bapat
Вложения
- 0004-Introduce-pss_barrierReceivedGeneration-20250918.patch
- 0002-Process-config-reload-in-AIO-workers-20250918.patch
- 0003-Introduce-pending-flag-for-GUC-assign-hooks-20250918.patch
- 0005-Allow-to-use-multiple-shared-memory-mapping-20250918.patch
- 0001-Add-system-view-for-shared-buffer-lookup-ta-20250918.patch
- 0006-Address-space-reservation-for-shared-memory-20250918.patch
- 0008-Fix-compilation-failures-from-previous-comm-20250918.patch
- 0007-Introduce-multiple-shmem-segments-for-share-20250918.patch
- 0009-Refactor-CalculateShmemSize-20250918.patch
- 0010-WIP-Monitoring-views-20250918.patch
- 0013-Update-sizes-and-addresses-of-shared-memory-20250918.patch
- 0011-Allow-to-resize-shared-memory-without-resta-20250918.patch
- 0012-Initial-value-of-shared_buffers-or-NBuffers-20250918.patch
- 0014-Support-shrinking-shared-buffers-20250918.patch
- 0015-Reinitialize-StrategyControl-after-resizing-20250918.patch
- 0016-Tests-for-dynamic-shared_buffers-resizing-20250918.patch
Hi, On 2025-09-18 10:25:29 +0530, Ashutosh Bapat wrote: > From d1ed934ccd02fca2c831e582b07a169e17d19f59 Mon Sep 17 00:00:00 2001 > From: Dmitrii Dolgov <9erthalion6@gmail.com> > Date: Tue, 17 Jun 2025 15:14:33 +0200 > Subject: [PATCH 02/16] Process config reload in AIO workers I think this is superfluous due to b8e1f2d96bb9 > Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS, > which does not include ConfigReloadPending. Thus we need to check for it > explicitly. > +/* > + * Process any new interrupts. > + */ > +static void > +pgaio_worker_process_interrupts(void) > +{ > + /* > + * Reloading config can trigger further signals, complicating interrupts > + * processing -- so let it run first. > + * > + * XXX: Is there any need in memory barrier after ProcessConfigFile? > + */ > + if (ConfigReloadPending) > + { > + ConfigReloadPending = false; > + ProcessConfigFile(PGC_SIGHUP); > + } > + > + if (ProcSignalBarrierPending) > + ProcessProcSignalBarrier(); > +} Given that even before b8e1f2d96bb9 method_worker.c used CHECK_FOR_INTERRUPTS(), which contains a ProcessProcSignalBarrier(), I don't know why that second check was added here? > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001 > From: Dmitrii Dolgov <9erthalion6@gmail.com> > Date: Sun, 6 Apr 2025 16:40:32 +0200 > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks > > Currently an assing hook can perform some preprocessing of a new value, > but it cannot change the behavior, which dictates that the new value > will be applied immediately after the hook. Certain GUC options (like > shared_buffers, coming in subsequent patches) may need coordinating work > between backends to change, meaning we cannot apply it right away. > > Add a new flag "pending" for an assign hook to allow the hook indicate > exactly that. If the pending flag is set after the hook, the new value > will not be applied and it's handling becomes the hook's implementation > responsibility. I doubt it makes sense to add this to the GUC system. I think it'd be better to just use the GUC value as the desired "target" configuration and have a function or a show-only GUC for reporting the current size. I don't think you can't just block application of the GUC until the resize is complete. E.g. what if the value was too big and the new configuration needs to fixed to be lower? > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001 > From: Dmitrii Dolgov <9erthalion6@gmail.com> > Date: Fri, 4 Apr 2025 21:46:14 +0200 > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration > > Currently WaitForProcSignalBarrier allows to make sure the message sent > via EmitProcSignalBarrier was processed by all ProcSignal mechanism > participants. > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration, > which will be updated when a process has received the message, but not > processed it yet. This makes it possible to support a new mode of > waiting, when ProcSignal participants want to synchronize message > processing. To do that, a participant can wait via > WaitForProcSignalBarrierReceived when processing a message, effectively > making sure that all processes are going to start processing > ProcSignalBarrier simultaneously. I doubt "online resizing" that requires synchronously processing the same event, can really be called "online". There can be significant delays in processing a barrier, stalling the entire server until that is reached seems like a complete no-go for production systems? > From 63fe27340656c52b13f4eecebd9e73d24efe5e33 Mon Sep 17 00:00:00 2001 > From: Dmitrii Dolgov <9erthalion6@gmail.com> > Date: Fri, 28 Feb 2025 19:54:47 +0100 > Subject: [PATCH 05/16] Allow to use multiple shared memory mappings > > Currently all the work with shared memory is done via a single anonymous > memory mapping, which limits ways how the shared memory could be organized. > > Introduce possibility to allocate multiple shared memory mappings, where > a single mapping is associated with a specified shared memory segment. > There is only fixed amount of available segments, currently only one > main shared memory segment is allocated. A new shared memory API is > introduces, extended with a segment as a new parameter. As a path of > least resistance, the original API is kept in place, utilizing the main > shared memory segment. > -#define MAX_ON_EXITS 20 > +#define MAX_ON_EXITS 40 Why does a patch like this contain changes like this mixed in with the rest? That's clearly not directly related to $subject. > /* shared memory global variables */ > > -static PGShmemHeader *ShmemSegHdr; /* shared mem segment header */ > +ShmemSegment Segments[ANON_MAPPINGS]; > > -static void *ShmemBase; /* start address of shared memory */ > - > -static void *ShmemEnd; /* end+1 address of shared memory */ > - > -slock_t *ShmemLock; /* spinlock for shared memory and LWLock > - * allocation */ > - > -static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */ Why do we need a separate ShmemLock for each segment? Besides being unnecessary, it seems like that prevents locking in a way that provides consistency across all segments. > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001 > From: Dmitrii Dolgov <9erthalion6@gmail.com> > Date: Tue, 17 Jun 2025 11:47:04 +0200 > Subject: [PATCH 06/16] Address space reservation for shared memory > > Currently the shared memory layout is designed to pack everything tight > together, leaving no space between mappings for resizing. Here is how it > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the > anonymous shared memory we talk about: > > 00400000-00490000 /path/bin/postgres > ... > 012d9000-0133e000 [heap] > 7f443a800000-7f470a800000 /dev/zero (deleted) > 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive > 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34 > ... > > Make the layout more dynamic via splitting every shared memory segment > into two parts: > > * An anonymous file, which actually contains shared memory content. Such > an anonymous file is created via memfd_create, it lives in memory, > behaves like a regular file and semantically equivalent to an > anonymous memory allocated via mmap with MAP_ANONYMOUS. > > * A reservation mapping, which size is much larger than required shared > segment size. This mapping is created with flags PROT_NONE (which > makes sure the reserved space is not used), and MAP_NORESERVE (to not > count the reserved space against memory limits). The anonymous file is > mapped into this reservation mapping. The commit message fails to explain why, if we're already relying on MAP_NORESERVE, we need to anything else? Why can't we just have one maximally sized allocation that's marked MAP_NORESERVE for all the parts that we don't yet need? > There are also few unrelated advantages of using anon files: > > * We've got a file descriptor, which could be used for regular file > operations (modification, truncation, you name it). What is this an advantage for? > * The file could be given a name, which improves readability when it > comes to process maps. > * By default, Linux will not add file-backed shared mappings into a core dump, > making it more convenient to work with them in PostgreSQL: no more huge dumps > to process. That's just as well a downside, because you now can't investigate some issues. This was already configurable via coredump_filter. > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001 > From: Dmitrii Dolgov <9erthalion6@gmail.com> > Date: Tue, 17 Jun 2025 11:22:02 +0200 > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers > > Add more shmem segments to split shared buffers into following chunks: > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status Why do all these need to be separate segments? Afaict we'll have to maximally size everything other than BUFFERS_SHMEM_SEGMENT at start? > From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001 > From: Dmitrii Dolgov <9erthalion6@gmail.com> > Date: Tue, 17 Jun 2025 14:16:55 +0200 > Subject: [PATCH 11/16] Allow to resize shared memory without restart > > Add assing hook for shared_buffers to resize shared memory using space, > introduced in the previous commits without requiring PostgreSQL restart. > Essentially the implementation is based on two mechanisms: a > ProcSignalBarrier is used to make sure all processes are starting the > resize procedure simultaneously, and a global Barrier is used to > coordinate after that and make sure all finished processes are waiting > for others that are in progress. > > The resize process looks like this: > > * The GUC assign hook sets a flag to let the Postmaster know that resize > was requested. > > * Postmaster verifies the flag in the event loop, and starts the resize > by emitting a ProcSignal barrier. > > * All processes, that participate in ProcSignal mechanism, begin to > process ProcSignal barrier. First a process waits until all processes > have confirmed they received the message and can start simultaneously. As mentioned above, this basically makes the entire feature not really online. Besides the latency of some processes not getting to the barrier immediately, there's also the issue that actually reserving large amounts of memory can take a long time - during which all processes would be unavailable. I really don't see that being viable. It'd be one thing if that were a "temporary" restriction, but the whole design seems to be fairly centered around that. > * Every process recalculates shared memory size based on the new > NBuffers, adjusts its size using ftruncate and adjust reservation > permissions with mprotect. One elected process signals the postmaster > to do the same. If we just used a single memory mapping with all unused parts marked MAP_NORESERVE, we wouldn't need this (and wouldn't need a fair bit of other work in this patchset).. > From experiment it turns out that shared mappings have to be extended > separately for each process that uses them. Another rough edge is that a > backend blocked on ReadCommand will not apply shared_buffers change > until it receives something. That's not a rough edge, that basically makes the feature unusable, no? > +-- Test 2: Set to 64MB > +ALTER SYSTEM SET shared_buffers = '64MB'; > +SELECT pg_reload_conf(); > +SELECT pg_sleep(1); > +SHOW shared_buffers; Tests containing sleeps are a significant warning flag imo. Greetings, Andres Freund
Hi, On 2025-09-18 09:52:03 -0400, Andres Freund wrote: > On 2025-09-18 10:25:29 +0530, Ashutosh Bapat wrote: > > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Fri, 4 Apr 2025 21:46:14 +0200 > > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration > > > > Currently WaitForProcSignalBarrier allows to make sure the message sent > > via EmitProcSignalBarrier was processed by all ProcSignal mechanism > > participants. > > > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration, > > which will be updated when a process has received the message, but not > > processed it yet. This makes it possible to support a new mode of > > waiting, when ProcSignal participants want to synchronize message > > processing. To do that, a participant can wait via > > WaitForProcSignalBarrierReceived when processing a message, effectively > > making sure that all processes are going to start processing > > ProcSignalBarrier simultaneously. > > I doubt "online resizing" that requires synchronously processing the same > event, can really be called "online". There can be significant delays in > processing a barrier, stalling the entire server until that is reached seems > like a complete no-go for production systems? > [...] > > From 78bc0a49f8ebe17927abd66164764745ecc6d563 Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Tue, 17 Jun 2025 14:16:55 +0200 > > Subject: [PATCH 11/16] Allow to resize shared memory without restart > > > > Add assing hook for shared_buffers to resize shared memory using space, > > introduced in the previous commits without requiring PostgreSQL restart. > > Essentially the implementation is based on two mechanisms: a > > ProcSignalBarrier is used to make sure all processes are starting the > > resize procedure simultaneously, and a global Barrier is used to > > coordinate after that and make sure all finished processes are waiting > > for others that are in progress. > > > > The resize process looks like this: > > > > * The GUC assign hook sets a flag to let the Postmaster know that resize > > was requested. > > > > * Postmaster verifies the flag in the event loop, and starts the resize > > by emitting a ProcSignal barrier. > > > > * All processes, that participate in ProcSignal mechanism, begin to > > process ProcSignal barrier. First a process waits until all processes > > have confirmed they received the message and can start simultaneously. > > As mentioned above, this basically makes the entire feature not really > online. Besides the latency of some processes not getting to the barrier > immediately, there's also the issue that actually reserving large amounts of > memory can take a long time - during which all processes would be unavailable. > > I really don't see that being viable. It'd be one thing if that were a > "temporary" restriction, but the whole design seems to be fairly centered > around that. Besides not really being online, isn't this a recipe for endless undetected deadlocks? What if process A waits for a lock held by process B and process B arrives at the barrier? Process A won't ever get there, because process B can't make progress, because A is not making progress. Greetings, Andres Freund
Sorry for late reply folks. > On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote: > > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Sun, 6 Apr 2025 16:40:32 +0200 > > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks > > > > Currently an assing hook can perform some preprocessing of a new value, > > but it cannot change the behavior, which dictates that the new value > > will be applied immediately after the hook. Certain GUC options (like > > shared_buffers, coming in subsequent patches) may need coordinating work > > between backends to change, meaning we cannot apply it right away. > > > > Add a new flag "pending" for an assign hook to allow the hook indicate > > exactly that. If the pending flag is set after the hook, the new value > > will not be applied and it's handling becomes the hook's implementation > > responsibility. > > I doubt it makes sense to add this to the GUC system. I think it'd be better > to just use the GUC value as the desired "target" configuration and have a > function or a show-only GUC for reporting the current size. > > I don't think you can't just block application of the GUC until the resize is > complete. E.g. what if the value was too big and the new configuration needs > to fixed to be lower? I think it was a bit hasty to post another version of the patch without the design changes we've agreed upon last time. I'm still working on that (sorry, it takes time, I haven't wrote so much Perl for testing since forever), the current implementation doesn't include anything with GUC to simplify the discussion. I'm still convinced that multi-step GUC changing makes sense, but it has proven to be more complicated than I anticipated, so I'll spin up another thread to discuss when I come to it. > > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Fri, 4 Apr 2025 21:46:14 +0200 > > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration > > > > Currently WaitForProcSignalBarrier allows to make sure the message sent > > via EmitProcSignalBarrier was processed by all ProcSignal mechanism > > participants. > > > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration, > > which will be updated when a process has received the message, but not > > processed it yet. This makes it possible to support a new mode of > > waiting, when ProcSignal participants want to synchronize message > > processing. To do that, a participant can wait via > > WaitForProcSignalBarrierReceived when processing a message, effectively > > making sure that all processes are going to start processing > > ProcSignalBarrier simultaneously. > > I doubt "online resizing" that requires synchronously processing the same > event, can really be called "online". There can be significant delays in > processing a barrier, stalling the entire server until that is reached seems > like a complete no-go for production systems? > > [...] > As mentioned above, this basically makes the entire feature not really > online. Besides the latency of some processes not getting to the barrier > immediately, there's also the issue that actually reserving large amounts of > memory can take a long time - during which all processes would be unavailable. > > I really don't see that being viable. It'd be one thing if that were a > "temporary" restriction, but the whole design seems to be fairly centered > around that. > > [...] > > Besides not really being online, isn't this a recipe for endless undetected > deadlocks? What if process A waits for a lock held by process B and process B > arrives at the barrier? Process A won't ever get there, because process B > can't make progress, because A is not making progress. Same as above, in the version I'm working right now it's changed in favor of an approach that looks more like the one from "online checksum change" patch. I've even stumbled upon a cases when a process was just killed and never arrive at the barrier, so that was it. The new approach makes certain parts simpler, but requires managing backends with different understanding of how large shared memory segments are for some time interval. Introducing a new parameter "number of available buffers" seems to be helpful to address all cases I've found so far. Btw, under "online" resizing I mostly understood "without restart", the goal was not to make it really "online". > > -#define MAX_ON_EXITS 20 > > +#define MAX_ON_EXITS 40 > > Why does a patch like this contain changes like this mixed in with the rest? > That's clearly not directly related to $subject. An artifact of rebasing, it belonged to 0007. > > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Tue, 17 Jun 2025 11:47:04 +0200 > > Subject: [PATCH 06/16] Address space reservation for shared memory > > > > Currently the shared memory layout is designed to pack everything tight > > together, leaving no space between mappings for resizing. Here is how it > > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the > > anonymous shared memory we talk about: > > > > 00400000-00490000 /path/bin/postgres > > ... > > 012d9000-0133e000 [heap] > > 7f443a800000-7f470a800000 /dev/zero (deleted) > > 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive > > 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34 > > ... > > > > Make the layout more dynamic via splitting every shared memory segment > > into two parts: > > > > * An anonymous file, which actually contains shared memory content. Such > > an anonymous file is created via memfd_create, it lives in memory, > > behaves like a regular file and semantically equivalent to an > > anonymous memory allocated via mmap with MAP_ANONYMOUS. > > > > * A reservation mapping, which size is much larger than required shared > > segment size. This mapping is created with flags PROT_NONE (which > > makes sure the reserved space is not used), and MAP_NORESERVE (to not > > count the reserved space against memory limits). The anonymous file is > > mapped into this reservation mapping. > > The commit message fails to explain why, if we're already relying on > MAP_NORESERVE, we need to anything else? Why can't we just have one maximally > sized allocation that's marked MAP_NORESERVE for all the parts that we don't > yet need? How do we return memory to the OS in that case? Currently it's done explicitly via truncating the anonymous file. > > * The file could be given a name, which improves readability when it > > comes to process maps. > > > * By default, Linux will not add file-backed shared mappings into a core dump, > > making it more convenient to work with them in PostgreSQL: no more huge dumps > > to process. > > That's just as well a downside, because you now can't investigate some > issues. This was already configurable via coredump_filter. This behaviour is configured via coredump_filter as well, so just the default value has been changed. > > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001 > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > Date: Tue, 17 Jun 2025 11:22:02 +0200 > > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers > > > > Add more shmem segments to split shared buffers into following chunks: > > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks > > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors > > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers > > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids > > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status > > Why do all these need to be separate segments? Afaict we'll have to maximally > size everything other than BUFFERS_SHMEM_SEGMENT at start? Why would they need to me maxed out at the start? So far my rule of thumb was one segment for one structure which size depends on NBuffers, so that when changing NBuffers each segment could be adjusted independently. > > +-- Test 2: Set to 64MB > > +ALTER SYSTEM SET shared_buffers = '64MB'; > > +SELECT pg_reload_conf(); > > +SELECT pg_sleep(1); > > +SHOW shared_buffers; > > Tests containing sleeps are a significant warning flag imo. Tests I'm preparing so far avoiding this by waiting in injection points. I haven't found anything similar in existing tests, but I assume such approach is fine.
Hi, On 2025-09-26 20:04:21 +0200, Dmitry Dolgov wrote: > > On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote: > > > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001 > > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > > Date: Sun, 6 Apr 2025 16:40:32 +0200 > > > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks > > > > > > Currently an assing hook can perform some preprocessing of a new value, > > > but it cannot change the behavior, which dictates that the new value > > > will be applied immediately after the hook. Certain GUC options (like > > > shared_buffers, coming in subsequent patches) may need coordinating work > > > between backends to change, meaning we cannot apply it right away. > > > > > > Add a new flag "pending" for an assign hook to allow the hook indicate > > > exactly that. If the pending flag is set after the hook, the new value > > > will not be applied and it's handling becomes the hook's implementation > > > responsibility. > > > > I doubt it makes sense to add this to the GUC system. I think it'd be better > > to just use the GUC value as the desired "target" configuration and have a > > function or a show-only GUC for reporting the current size. > > > > I don't think you can't just block application of the GUC until the resize is > > complete. E.g. what if the value was too big and the new configuration needs > > to fixed to be lower? > > I think it was a bit hasty to post another version of the patch without > the design changes we've agreed upon last time. I'm still working on > that (sorry, it takes time, I haven't wrote so much Perl for testing > since forever), the current implementation doesn't include anything with > GUC to simplify the discussion. I'm still convinced that multi-step GUC > changing makes sense, but it has proven to be more complicated than I > anticipated, so I'll spin up another thread to discuss when I come to > it. FWIW, I'm fairly convinced it's a completely dead end. > > > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001 > > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > > Date: Tue, 17 Jun 2025 11:47:04 +0200 > > > Subject: [PATCH 06/16] Address space reservation for shared memory > > > > > > Currently the shared memory layout is designed to pack everything tight > > > together, leaving no space between mappings for resizing. Here is how it > > > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the > > > anonymous shared memory we talk about: > > > > > > 00400000-00490000 /path/bin/postgres > > > ... > > > 012d9000-0133e000 [heap] > > > 7f443a800000-7f470a800000 /dev/zero (deleted) > > > 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive > > > 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34 > > > ... > > > > > > Make the layout more dynamic via splitting every shared memory segment > > > into two parts: > > > > > > * An anonymous file, which actually contains shared memory content. Such > > > an anonymous file is created via memfd_create, it lives in memory, > > > behaves like a regular file and semantically equivalent to an > > > anonymous memory allocated via mmap with MAP_ANONYMOUS. > > > > > > * A reservation mapping, which size is much larger than required shared > > > segment size. This mapping is created with flags PROT_NONE (which > > > makes sure the reserved space is not used), and MAP_NORESERVE (to not > > > count the reserved space against memory limits). The anonymous file is > > > mapped into this reservation mapping. > > > > The commit message fails to explain why, if we're already relying on > > MAP_NORESERVE, we need to anything else? Why can't we just have one maximally > > sized allocation that's marked MAP_NORESERVE for all the parts that we don't > > yet need? > > How do we return memory to the OS in that case? Currently it's done > explicitly via truncating the anonymous file. madvise with MADV_DONTNEED or MADV_REMOVE. Greetings, Andres Freund
> On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote: > Given these things, I think we should set up the buffer lookup table > to hold maximum entries required to expand the buffer pool to its > maximum, right at the beginning. Thanks for investigating. I think another option would be to rebuild the buffer lookup table (create a new table based on the new size and copy the data over from the original one) as part of the resize procedure, alongsize with buffers eviction and initialization. From what I recall the size of buffer lookup table is about two orders of magnitude lower than shared buffers, so the overhead should not be that large even for significant amount of buffers.
On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote: > > Given these things, I think we should set up the buffer lookup table > > to hold maximum entries required to expand the buffer pool to its > > maximum, right at the beginning. > > Thanks for investigating. I think another option would be to rebuild the > buffer lookup table (create a new table based on the new size and copy > the data over from the original one) as part of the resize procedure, > alongsize with buffers eviction and initialization. From what I recall > the size of buffer lookup table is about two orders of magnitude lower > than shared buffers, so the overhead should not be that large even for > significant amount of buffers. The proposal will work but will require significant work: 1. The pointer to the shared buffer lookup table will change. The change needs to be absorbed by all the processes at the same time; we can not have few processes accessing old lookup table and few processes new one. That has potential to make many processes wait for a very long time. That can be fixed by accessing a new pointer when the next buffer lookup access happens by modifying BufTable* functions. But that means an extra condition checks and some extra code in those hot paths. Not sure whether that's acceptable. 2. The memory consumed by the old buffer lookup table will need to be "freed" to the OS. The only way to do so is by having a new memory segment (which can be unmapped) or unmapping portions of segment dedicated to the buffer lookup table. That's some more synchronization and additional wait times for backends. 3. When the new shared buffer lookup table will be built, processes may be able to access it in shared mode but they may not be able to make changes to it (or else we need to make corresponding changes to new table as well). That means more restrictions on the running backends. I am not saying that we can not implement your idea, but maybe we could do that incrementally after basic resizing is in place. -- Best Wishes, Ashutosh Bapat
On Fri, Sep 26, 2025 at 11:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > Sorry for late reply folks. > > > On Thu, Sep 18, 2025 at 09:52:03AM -0400, Andres Freund wrote: > > > From 0a13e56dceea8cc7a2685df7ee8cea434588681b Mon Sep 17 00:00:00 2001 > > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > > Date: Sun, 6 Apr 2025 16:40:32 +0200 > > > Subject: [PATCH 03/16] Introduce pending flag for GUC assign hooks > > > > > > Currently an assing hook can perform some preprocessing of a new value, > > > but it cannot change the behavior, which dictates that the new value > > > will be applied immediately after the hook. Certain GUC options (like > > > shared_buffers, coming in subsequent patches) may need coordinating work > > > between backends to change, meaning we cannot apply it right away. > > > > > > Add a new flag "pending" for an assign hook to allow the hook indicate > > > exactly that. If the pending flag is set after the hook, the new value > > > will not be applied and it's handling becomes the hook's implementation > > > responsibility. > > > > I doubt it makes sense to add this to the GUC system. I think it'd be better > > to just use the GUC value as the desired "target" configuration and have a > > function or a show-only GUC for reporting the current size. > > > > I don't think you can't just block application of the GUC until the resize is > > complete. E.g. what if the value was too big and the new configuration needs > > to fixed to be lower? > > I think it was a bit hasty to post another version of the patch without > the design changes we've agreed upon last time. I'm still working on > that (sorry, it takes time, I haven't wrote so much Perl for testing > since forever), the current implementation doesn't include anything with > GUC to simplify the discussion. I'm still convinced that multi-step GUC > changing makes sense, but it has proven to be more complicated than I > anticipated, so I'll spin up another thread to discuss when I come to > it. > > > > From 0a55bc15dc3a724f03e674048109dac1f248c406 Mon Sep 17 00:00:00 2001 > > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > > Date: Fri, 4 Apr 2025 21:46:14 +0200 > > > Subject: [PATCH 04/16] Introduce pss_barrierReceivedGeneration > > > > > > Currently WaitForProcSignalBarrier allows to make sure the message sent > > > via EmitProcSignalBarrier was processed by all ProcSignal mechanism > > > participants. > > > > > > Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration, > > > which will be updated when a process has received the message, but not > > > processed it yet. This makes it possible to support a new mode of > > > waiting, when ProcSignal participants want to synchronize message > > > processing. To do that, a participant can wait via > > > WaitForProcSignalBarrierReceived when processing a message, effectively > > > making sure that all processes are going to start processing > > > ProcSignalBarrier simultaneously. > > > > I doubt "online resizing" that requires synchronously processing the same > > event, can really be called "online". There can be significant delays in > > processing a barrier, stalling the entire server until that is reached seems > > like a complete no-go for production systems? > > > > [...] > > > As mentioned above, this basically makes the entire feature not really > > online. Besides the latency of some processes not getting to the barrier > > immediately, there's also the issue that actually reserving large amounts of > > memory can take a long time - during which all processes would be unavailable. > > > > I really don't see that being viable. It'd be one thing if that were a > > "temporary" restriction, but the whole design seems to be fairly centered > > around that. > > > > [...] > > > > Besides not really being online, isn't this a recipe for endless undetected > > deadlocks? What if process A waits for a lock held by process B and process B > > arrives at the barrier? Process A won't ever get there, because process B > > can't make progress, because A is not making progress. > > Same as above, in the version I'm working right now it's changed in > favor of an approach that looks more like the one from "online checksum > change" patch. I've even stumbled upon a cases when a process was just > killed and never arrive at the barrier, so that was it. The new approach > makes certain parts simpler, but requires managing backends with > different understanding of how large shared memory segments are for some > time interval. Introducing a new parameter "number of available buffers" > seems to be helpful to address all cases I've found so far. > > Btw, under "online" resizing I mostly understood "without restart", the > goal was not to make it really "online". > > > > -#define MAX_ON_EXITS 20 > > > +#define MAX_ON_EXITS 40 > > > > Why does a patch like this contain changes like this mixed in with the rest? > > That's clearly not directly related to $subject. > > An artifact of rebasing, it belonged to 0007. > > > > From e2f48da8a8206711b24e34040d699431910fbf9c Mon Sep 17 00:00:00 2001 > > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > > Date: Tue, 17 Jun 2025 11:47:04 +0200 > > > Subject: [PATCH 06/16] Address space reservation for shared memory > > > > > > Currently the shared memory layout is designed to pack everything tight > > > together, leaving no space between mappings for resizing. Here is how it > > > looks like for one mapping in /proc/$PID/maps, /dev/zero represents the > > > anonymous shared memory we talk about: > > > > > > 00400000-00490000 /path/bin/postgres > > > ... > > > 012d9000-0133e000 [heap] > > > 7f443a800000-7f470a800000 /dev/zero (deleted) > > > 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive > > > 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34 > > > ... > > > > > > Make the layout more dynamic via splitting every shared memory segment > > > into two parts: > > > > > > * An anonymous file, which actually contains shared memory content. Such > > > an anonymous file is created via memfd_create, it lives in memory, > > > behaves like a regular file and semantically equivalent to an > > > anonymous memory allocated via mmap with MAP_ANONYMOUS. > > > > > > * A reservation mapping, which size is much larger than required shared > > > segment size. This mapping is created with flags PROT_NONE (which > > > makes sure the reserved space is not used), and MAP_NORESERVE (to not > > > count the reserved space against memory limits). The anonymous file is > > > mapped into this reservation mapping. > > > > The commit message fails to explain why, if we're already relying on > > MAP_NORESERVE, we need to anything else? Why can't we just have one maximally > > sized allocation that's marked MAP_NORESERVE for all the parts that we don't > > yet need? > > How do we return memory to the OS in that case? Currently it's done > explicitly via truncating the anonymous file. > > > > * The file could be given a name, which improves readability when it > > > comes to process maps. > > > > > * By default, Linux will not add file-backed shared mappings into a core dump, > > > making it more convenient to work with them in PostgreSQL: no more huge dumps > > > to process. > > > > That's just as well a downside, because you now can't investigate some > > issues. This was already configurable via coredump_filter. > > This behaviour is configured via coredump_filter as well, so just the > default value has been changed. > > > > From 942b69a0876b0e83303e6704da54c4c002a5a2d8 Mon Sep 17 00:00:00 2001 > > > From: Dmitrii Dolgov <9erthalion6@gmail.com> > > > Date: Tue, 17 Jun 2025 11:22:02 +0200 > > > Subject: [PATCH 07/16] Introduce multiple shmem segments for shared buffers > > > > > > Add more shmem segments to split shared buffers into following chunks: > > > * BUFFERS_SHMEM_SEGMENT: contains buffer blocks > > > * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors > > > * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers > > > * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids > > > * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status > > > > Why do all these need to be separate segments? Afaict we'll have to maximally > > size everything other than BUFFERS_SHMEM_SEGMENT at start? > > Why would they need to me maxed out at the start? So far my rule of > thumb was one segment for one structure which size depends on NBuffers, > so that when changing NBuffers each segment could be adjusted > independently. > Offlist Andres expressed that having multiple shared memory segments may impact the time it takes to disconnect a backend. If the application is using all the configured number of backends, a slow disconnection will lead to a slow connection. If we want to go the route of multple segments (as many as 5) it would make sense to measure that impact first. Maxing out at start avoids using multiple segments. Those segments have much much lower memory compared to the buffer blocks even when maxed out with a reasonable max_shared_buffers setting. We avoid complicating code for a small increase in shared memory. -- Best Wishes, Ashutosh Bapat
> On Mon, Sep 29, 2025 at 12:21:08PM +0530, Ashutosh Bapat wrote: > On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote: > > > Given these things, I think we should set up the buffer lookup table > > > to hold maximum entries required to expand the buffer pool to its > > > maximum, right at the beginning. > > > > Thanks for investigating. I think another option would be to rebuild the > > buffer lookup table (create a new table based on the new size and copy > > the data over from the original one) as part of the resize procedure, > > alongsize with buffers eviction and initialization. From what I recall > > the size of buffer lookup table is about two orders of magnitude lower > > than shared buffers, so the overhead should not be that large even for > > significant amount of buffers. > > The proposal will work but will require significant work: > > 1. The pointer to the shared buffer lookup table will change. Which pointers you mean? AFAICT no operation on the buffer lookup table returns a pointer (they work with buffer id or a hash) and keys are compared by value as well. > we can not have few processes accessing old lookup table and few > processes new one. That has potential to make many processes wait for > a very long time. As I've mentioned above, size of the buffer lookup table is few magnitudes lower than shared buffers, so I doubt about "a very long time". But it can be measured. > 2. The memory consumed by the old buffer lookup table will need to be > "freed" to the OS. The only way to do so is by having a new memory > segment Shared buffer lookup table already lives in it's own segment as implemented in the current patch, so I don't see any problem here. I see you folks are inclined to keep some small segments static and allocate maximum allowed memory for it. It's an option, at the end of the day we need to experiment and measure both approaches.
On Wed, Oct 1, 2025 at 2:40 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Mon, Sep 29, 2025 at 12:21:08PM +0530, Ashutosh Bapat wrote: > > On Sun, Sep 28, 2025 at 2:54 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > > > > > On Thu, Sep 18, 2025 at 10:25:29AM +0530, Ashutosh Bapat wrote: > > > > Given these things, I think we should set up the buffer lookup table > > > > to hold maximum entries required to expand the buffer pool to its > > > > maximum, right at the beginning. > > > > > > Thanks for investigating. I think another option would be to rebuild the > > > buffer lookup table (create a new table based on the new size and copy > > > the data over from the original one) as part of the resize procedure, > > > alongsize with buffers eviction and initialization. From what I recall > > > the size of buffer lookup table is about two orders of magnitude lower > > > than shared buffers, so the overhead should not be that large even for > > > significant amount of buffers. > > > > The proposal will work but will require significant work: > > > > 1. The pointer to the shared buffer lookup table will change. > > Which pointers you mean? AFAICT no operation on the buffer lookup table > returns a pointer (they work with buffer id or a hash) and keys are > compared by value as well. The buffer lookup table itself. /* Pass location of hashtable header to hash_create */ infoP->hctl = (HASHHDR *) location; > > > we can not have few processes accessing old lookup table and few > > processes new one. That has potential to make many processes wait for > > a very long time. > > As I've mentioned above, size of the buffer lookup table is few > magnitudes lower than shared buffers, so I doubt about "a very long > time". But it can be measured. > > > 2. The memory consumed by the old buffer lookup table will need to be > > "freed" to the OS. The only way to do so is by having a new memory > > segment > > Shared buffer lookup table already lives in it's own segment as > implemented in the current patch, so I don't see any problem here. The table is not a single chunk of memory. It's a few chunks spread across the shared memory segment. Freeing a lookup table is like freeing those chunks. We have ways to free tail parts of shared memory segments, but not chunks in-between. -- Best Wishes, Ashutosh Bapat
> On Wed, Oct 01, 2025 at 03:50:17PM +0530, Ashutosh Bapat wrote: > The buffer lookup table itself. > /* Pass location of hashtable header to hash_create */ > infoP->hctl = (HASHHDR *) location; How does this affect any users of the lookup table, if they do not even get to see those? > > Shared buffer lookup table already lives in it's own segment as > > implemented in the current patch, so I don't see any problem here. > > The table is not a single chunk of memory. It's a few chunks spread > across the shared memory segment. Freeing a lookup table is like > freeing those chunks. We have ways to free tail parts of shared memory > segments, but not chunks in-between. Right, and the idea was to rebuild it completely to fit into the new size, not just chunk-by-chunk.