Обсуждение: Add os_page_num to pg_buffercache
Hi hackers, I was doing some more tests around ba2a3c2302f (pg_buffercache_numa) and thought that seeing how buffers are spread across multiple OS pages (if that's the case) thanks to the os_page_num field is good information to have. The thing that I think is annoying is that to get this information (os_page_num): - One needs to use pg_buffercache_numa (which is more costly/slower) than pg_buffercache - One needs a system with NUMA support enabled So why not also add this information (os_page_num) in pg_buffercache? - It would make this information available on all systems, not just NUMA-enabled ones - It would help understand the memory layout implications of configuration changes such as database block size, OS page size (huge pages for example) and see how the buffers are spread across OS pages (if that's the case). So, please find attached a patch to $SUBJECT then. Remarks: - Maybe we could create a helper function to reduce the code duplication between pg_buffercache_pages() and pg_buffercache_numa_pages() - I think it would have made sense to also add this information while working on ba2a3c2302f but (unfortunately) I doubt that this patch is candidate for v18 post freeze (it looks more a feature enhancement than anything else) - It's currently doing the changes in pg_buffercache v1.6 but will need to create v1.7 for 19 (if the above stands true) Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On 4/10/25 15:17, Bertrand Drouvot wrote: > Hi hackers, > > I was doing some more tests around ba2a3c2302f (pg_buffercache_numa) and > thought that seeing how buffers are spread across multiple OS pages (if that's > the case) thanks to the os_page_num field is good information to have. > > The thing that I think is annoying is that to get this information (os_page_num): > > - One needs to use pg_buffercache_numa (which is more costly/slower) than pg_buffercache > - One needs a system with NUMA support enabled > > So why not also add this information (os_page_num) in pg_buffercache? > > - It would make this information available on all systems, not just NUMA-enabled ones > - It would help understand the memory layout implications of configuration changes > such as database block size, OS page size (huge pages for example) and see how the > buffers are spread across OS pages (if that's the case). > > So, please find attached a patch to $SUBJECT then. > > Remarks: > > - Maybe we could create a helper function to reduce the code duplication between > pg_buffercache_pages() and pg_buffercache_numa_pages() > - I think it would have made sense to also add this information while working > on ba2a3c2302f but (unfortunately) I doubt that this patch is candidate for v18 > post freeze (it looks more a feature enhancement than anything else) > - It's currently doing the changes in pg_buffercache v1.6 but will need to > create v1.7 for 19 (if the above stands true) > This seems like a good idea in principle, but at this point it has to wait for PG19. Please add it to the July commitfest. regards -- Tomas Vondra
On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote: > This seems like a good idea in principle, but at this point it has to > wait for PG19. Please add it to the July commitfest. +1. From a glance, this seems to fall in the "new feature" bucket and should likely wait for v19. -- nathan
Hi, On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote: > On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote: > > This seems like a good idea in principle, but at this point it has to > > wait for PG19. Please add it to the July commitfest. > > +1. From a glance, this seems to fall in the "new feature" bucket and > should likely wait for v19. Thank you both for providing your thoughts that confirm my initial doubt. Let's come back to that one later then. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Apr 10, 2025 at 03:05:29PM +0000, Bertrand Drouvot wrote: > Hi, > > On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote: > > On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote: > > > This seems like a good idea in principle, but at this point it has to > > > wait for PG19. Please add it to the July commitfest. > > > > +1. From a glance, this seems to fall in the "new feature" bucket and > > should likely wait for v19. > > Thank you both for providing your thoughts that confirm my initial doubt. Let's > come back to that one later then. > Here we are. Please find attached a rebased version and while at it, v2 adds a new macro and a function to avoid some code duplication between pg_buffercache_pages() and pg_buffercache_numa_pages(). So, PFA: 0001 - Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries Those new macro and function are extracted from pg_buffercache_numa_pages() and pg_buffercache_numa_pages() makes use of them. 0002 - Add os_page_num to pg_buffercache Making use of the new macro and function from 0001. As it's for v19, also bumping pg_buffercache's version to 1.7. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On 7/1/25 15:45, Bertrand Drouvot wrote: > Hi, > > On Thu, Apr 10, 2025 at 03:05:29PM +0000, Bertrand Drouvot wrote: >> Hi, >> >> On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote: >>> On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote: >>>> This seems like a good idea in principle, but at this point it has to >>>> wait for PG19. Please add it to the July commitfest. >>> >>> +1. From a glance, this seems to fall in the "new feature" bucket and >>> should likely wait for v19. >> >> Thank you both for providing your thoughts that confirm my initial doubt. Let's >> come back to that one later then. >> > > Here we are. > > Please find attached a rebased version and while at it, v2 adds a new macro and > a function to avoid some code duplication between pg_buffercache_pages() and > pg_buffercache_numa_pages(). > > So, PFA: > > 0001 - Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries > > Those new macro and function are extracted from pg_buffercache_numa_pages() and > pg_buffercache_numa_pages() makes use of them. > > 0002 - Add os_page_num to pg_buffercache > > Making use of the new macro and function from 0001. > > As it's for v19, also bumping pg_buffercache's version to 1.7. > Thanks for the updated patch! I took a quick look on this, and I doubt we want to change the schema of pg_buffercache like this. Adding columns is fine, but it seems rather wrong to change the cardinality. The view is meant to be 1:1 mapping for buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer, page), to be precise. I think this will break a lot of monitoring queries, and possibly in a very subtle way - especially on systems with huge pages, where most buffers will have one row, but then a buffer that happens to be split on two pages will have two rows. That seems not great. Just look at the changes needed in regression tests, where the first test now needs to be -select count(*) = (select setting::bigint +select count(*) >= (select setting::bigint from pg_settings where name = 'shared_buffers') which seems like a much weaker check. IMHO it'd be better to have a new view for this info, something like pg_buffercache_pages, or something like that. But I'm also starting to question if the patch really is that useful. Sure, people may not have NUMA support enabled (e.g. on non-linux platforms), and even if they do the _numa view is quite expensive. But I don't recall ever asking how the buffers map to memory pages. At least not before/without the NUMA stuff. regards -- Tomas Vondra
Hi, On Tue, Jul 01, 2025 at 04:31:01PM +0200, Tomas Vondra wrote: > On 7/1/25 15:45, Bertrand Drouvot wrote: > > I took a quick look on this, Thanks for looking at it! > and I doubt we want to change the schema of > pg_buffercache like this. Adding columns is fine, but it seems rather > wrong to change the cardinality. The view is meant to be 1:1 mapping for > buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer, > page), to be precise. > > I think this will break a lot of monitoring queries, and possibly in a > very subtle way - especially on systems with huge pages, where most > buffers will have one row, but then a buffer that happens to be split on > two pages will have two rows. That seems not great. > > IMHO it'd be better to have a new view for this info, something like > pg_buffercache_pages, or something like that. That's a good point, fully agree! > But I'm also starting to question if the patch really is that useful. > Sure, people may not have NUMA support enabled (e.g. on non-linux > platforms), and even if they do the _numa view is quite expensive. > Yeah, it's not for day to day activities, more for configuration testing and also for development activity/testing. For example, If I set BLCKSZ to 8KB and enable huge pages (2MB), then I may expect to see buffers not spread across pages. But what I can see is: SELECT pages_per_buffer, COUNT(*) as buffer_count FROM ( SELECT bufferid, COUNT(*) as pages_per_buffer FROM pg_buffercache GROUP BY bufferid ) subq GROUP BY pages_per_buffer ORDER BY pages_per_buffer; pages_per_buffer | buffer_count ------------------+-------------- 1 | 261120 2 | 1024 This is due to the shared buffers being aligned to PG_IO_ALIGN_SIZE. If I change it to: BufferManagerShmemInit(void) /* Align buffer pool on IO page size boundary. */ BufferBlocks = (char *) - TYPEALIGN(PG_IO_ALIGN_SIZE, + TYPEALIGN(2 * 1024 * 1024, ShmemInitStruct("Buffer Blocks", - NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, + NBuffers * (Size) BLCKSZ + (2 * 1024 * 1024), &foundBufs)); Then I get: pages_per_buffer | buffer_count ------------------+-------------- 1 | 262144 (1 row) So we've been able to see that some buffers were spread across pages due to shared buffer alignment on PG_IO_ALIGN_SIZE. And that if we change the alignment to be set to 2MB then I don't see any buffers spread across pages anymore. I think that it helps "visualize" some configuration or code changes. What are your thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 7/1/25 18:34, Bertrand Drouvot wrote: > Hi, > > On Tue, Jul 01, 2025 at 04:31:01PM +0200, Tomas Vondra wrote: >> On 7/1/25 15:45, Bertrand Drouvot wrote: >> >> I took a quick look on this, > > Thanks for looking at it! > >> and I doubt we want to change the schema of >> pg_buffercache like this. Adding columns is fine, but it seems rather >> wrong to change the cardinality. The view is meant to be 1:1 mapping for >> buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer, >> page), to be precise. >> >> I think this will break a lot of monitoring queries, and possibly in a >> very subtle way - especially on systems with huge pages, where most >> buffers will have one row, but then a buffer that happens to be split on >> two pages will have two rows. That seems not great. >> >> IMHO it'd be better to have a new view for this info, something like >> pg_buffercache_pages, or something like that. > > That's a good point, fully agree! > >> But I'm also starting to question if the patch really is that useful. >> Sure, people may not have NUMA support enabled (e.g. on non-linux >> platforms), and even if they do the _numa view is quite expensive. >> > > Yeah, it's not for day to day activities, more for configuration testing and > also for development activity/testing. > > For example, If I set BLCKSZ to 8KB and enable huge pages (2MB), then I may > expect to see buffers not spread across pages. > > But what I can see is: > > SELECT > pages_per_buffer, > COUNT(*) as buffer_count > FROM ( > SELECT bufferid, COUNT(*) as pages_per_buffer > FROM pg_buffercache > GROUP BY bufferid > ) subq > GROUP BY pages_per_buffer > ORDER BY pages_per_buffer; > > pages_per_buffer | buffer_count > ------------------+-------------- > 1 | 261120 > 2 | 1024 > > This is due to the shared buffers being aligned to PG_IO_ALIGN_SIZE. > > If I change it to: > > BufferManagerShmemInit(void) > > /* Align buffer pool on IO page size boundary. */ > BufferBlocks = (char *) > - TYPEALIGN(PG_IO_ALIGN_SIZE, > + TYPEALIGN(2 * 1024 * 1024, > ShmemInitStruct("Buffer Blocks", > - NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, > + NBuffers * (Size) BLCKSZ + (2 * 1024 * 1024), > &foundBufs)); > > Then I get: > > pages_per_buffer | buffer_count > ------------------+-------------- > 1 | 262144 > (1 row) > > > So we've been able to see that some buffers were spread across pages due to > shared buffer alignment on PG_IO_ALIGN_SIZE. And that if we change the alignment > to be set to 2MB then I don't see any buffers spread across pages anymore. > > I think that it helps "visualize" some configuration or code changes. > > What are your thoughts? > But isn't the _numa view good enough for this? Sure, you need NUMA support for it, and it may take a fair amount of time, but how often you need to do such queries? I don't plan to block improving this use case, but I'm not sure it's worth the effort. cheers -- Tomas Vondra
Hi, On Tue, Jul 01, 2025 at 06:45:37PM +0200, Tomas Vondra wrote: > On 7/1/25 18:34, Bertrand Drouvot wrote: > > But isn't the _numa view good enough for this? Sure, you need NUMA > support for it, and it may take a fair amount of time, but how often you > need to do such queries? Not that often, but my reasoning was more like: why people managing engines and/or developing on platform that does not support libnuma would not deserve access to this information? > I don't plan to block improving this use case, No worries at all, I do appreciate that you're looking at it and provide feedback whatever the outcome would be. > but I'm not sure it's worth the effort. I think that the hard work has already been done while creating pg_buffercache_numa_pages(). Now it's just a matter of extracting the necessary pieces from pg_buffercache_numa_pages() so that: * the new view could make use of it * the maintenance burden should be low (thanks to code dedeuplication) * people that don't have access to a platform that supports libnuma can have access to this information Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 7/1/25 19:20, Bertrand Drouvot wrote: > Hi, > > On Tue, Jul 01, 2025 at 06:45:37PM +0200, Tomas Vondra wrote: >> On 7/1/25 18:34, Bertrand Drouvot wrote: >> >> But isn't the _numa view good enough for this? Sure, you need NUMA >> support for it, and it may take a fair amount of time, but how often you >> need to do such queries? > > Not that often, but my reasoning was more like: > > why people managing engines and/or developing on platform that does not support > libnuma would not deserve access to this information? > True. I always forget we only support libnuma on linux for now. >> I don't plan to block improving this use case, > > No worries at all, I do appreciate that you're looking at it and provide feedback > whatever the outcome would be. > >> but I'm not sure it's worth the effort. > > I think that the hard work has already been done while creating > pg_buffercache_numa_pages(). > > Now it's just a matter of extracting the necessary pieces from pg_buffercache_numa_pages() > so that: > > * the new view could make use of it > * the maintenance burden should be low (thanks to code dedeuplication) > * people that don't have access to a platform that supports libnuma can have > access to this information > +1 regards -- Tomas Vondra
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed Hi Bertrand, Just tried out your patch, nice work, thought to leave a review as well. Patch applied successfully on top of commit a27893df4 in master. Ran the tests in pg_buffercache and they pass including the new ones. Running "pagesize" on my laptop returns 16384. test=# SELECT current_setting('block_size'); current_setting ----------------- 8192 (1 row) Given the above, the results are as expected: test=# select * from pg_buffercache_os_pages; bufferid | os_page_num ----------+------------- 1 | 0 2 | 0 3 | 1 4 | 1 5 | 2 6 | 2 I have noticed that pg_buffercache_os_pages would be the 3rd function which follows the same high-level structure (others being pg_buffercache_pages and pg_buffercache_numa_pages). I am wondering if this would be let's say "strike three" - time to consider extracting out a high-level "skeleton" function, with a couple of slots which would then be provided by the 3 variants. Kind regards, Mircea The new status of this patch is: Waiting on Author
Hi, On Tue, Jul 08, 2025 at 02:47:34PM +0000, Mircea Cadariu wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation: tested, passed > > Hi Bertrand, > > Just tried out your patch, nice work, thought to leave a review as well. Thanks for looking at it! > Patch applied successfully on top of commit a27893df4 in master. > Ran the tests in pg_buffercache and they pass including the new ones. > > Running "pagesize" on my laptop returns 16384. > > test=# SELECT current_setting('block_size'); > current_setting > ----------------- > 8192 > (1 row) > > Given the above, the results are as expected: > > test=# select * from pg_buffercache_os_pages; > bufferid | os_page_num > ----------+------------- > 1 | 0 > 2 | 0 > 3 | 1 > 4 | 1 > 5 | 2 > 6 | 2 Cool. > I have noticed that pg_buffercache_os_pages would be the 3rd function > which follows the same high-level structure (others being pg_buffercache_pages > and pg_buffercache_numa_pages). I am wondering if this would be let's say > "strike three" - time to consider extracting out a high-level "skeleton" function, > with a couple of slots which would then be provided by the 3 variants. Yeah, I tried to avoid code duplication for the "os pages" related stuff in v1. I can check if more can be done (outside of the "os pages" related stuff). Might be done in a dedicated patch though, I mean I don't think that should be a blocker for this one. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi,
Thanks for the prompt reply!
Yeah, I tried to avoid code duplication for the "os pages" related stuff in v1. I can check if more can be done (outside of the "os pages" related stuff). Might be done in a dedicated patch though, I mean I don't think that should be a blocker for this one.
Agreed, if there's any low-hanging fruit to address now that this file is cracked open, then great. Otherwise, makes sense to leave it for a separate dedicated patch.
If you don't mind I have some further questions on the patch, see below.
+ if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type");
Is this needed in the new pg_buffercache_os_pages function? I noticed this check also in the "original" pg_buffercache_pages. There's a comment there indicating that (if I understand correctly) its purpose is to handle upgrades from version 1.0, mentioning a field unrelated to this patch.
If it's needed, shall we consider adding a similar comment as in pg_buffercache_pages?
For step number 3 - should it be the other way around: database blocks are contained within OS pages?+ /* + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, + * while the OS may have different memory page sizes. + * + * To correctly map between them, we need to: 1. Determine the OS + * memory page size 2. Calculate how many OS pages are used by all + * buffer blocks 3. Calculate how many OS pages are contained within + * each database block. + */
Kind regards,
Mircea Cadariu
Hi, On Wed, Jul 09, 2025 at 10:51:16AM +0100, Mircea Cadariu wrote: > If you don't mind I have some further questions on the patch, see below. Thanks for the feedback/questions! > > + if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE) > > + elog(ERROR, "return type must be a row type"); > > Is this needed in the new pg_buffercache_os_pages function? Strictly speaking it is not, we could CreateTemplateTupleDesc(NUM_BUFFERCACHE_OS_PAGES_ELEM) instead of CreateTemplateTupleDesc(expected_tupledesc->natts). OTOH, it's used in multiple places in this extension so I think it's ok to keep it that way for consistency. > I noticed this > check also in the "original" pg_buffercache_pages. There's a comment there > indicating that (if I understand correctly) its purpose is to handle > upgrades from version 1.0, mentioning a field unrelated to this patch. > > If it's needed, shall we consider adding a similar comment as > in pg_buffercache_pages? We don't need the same kind of comment in pg_buffercache_os_pages() because it's new in 1.7 (so the patch can not "break" a pre-1.7 version of this function /view). In the pg_buffercache_pages case, the story is different, it's used to deal with the pinning_backends fields that has been introduced in 1.1 (see pg_buffercache--1.0--1.1.sql). > > + /* > > + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, > > + * while the OS may have different memory page sizes. > > + * > > + * To correctly map between them, we need to: 1. Determine the OS > > + * memory page size 2. Calculate how many OS pages are used by all > > + * buffer blocks 3. Calculate how many OS pages are contained within > > + * each database block. > > + */ > For step number 3 - should it be the other way around: database blocks are > contained within OS pages? This comment comes from pg_get_shmem_allocations_numa() and I agree that it could be misleading: it all depends what the OS and block sizes actually are: fixed in v5 attached where the wording is almost the same as in pg_buffercache_numa_pages(). Also I think that it is not correct in pg_get_shmem_allocations_numa(), I think that it should be something like proposed in [1]. [1]: https://www.postgresql.org/message-id/aH4DDhdiG9Gi0rG7%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi,
Thanks for the update! I tried v5 and it returns the expected results on my laptop, same as before.
Just two further remarks for your consideration.
Let's capitalize the first letter here.+ <para> + number of OS memory page for this buffer + </para></entry>
In the existing pg_buffercache.sql there are sections similar to the above (SET ROLE pg_database_owner/pg_monitor ... RESET role), with a couple of different SELECT statements within. Should we rather add the above new SELECTs there, instead of in the new pg_buffercache_os_pages.sql?+-- Check that the functions / views can't be accessed by default. To avoid +-- having to create a dedicated user, use the pg_database_owner pseudo-role. +SET ROLE pg_database_owner; +SELECT count(*) > 0 FROM pg_buffercache_os_pages; +RESET role; + +-- Check that pg_monitor is allowed to query view / function +SET ROLE pg_monitor; +SELECT count(*) > 0 FROM pg_buffercache_os_pages; +RESET role;
Kind regards,
Mircea Cadariu
Hi, On Thu, Jul 24, 2025 at 10:30:06PM +0800, Mircea Cadariu wrote: > I tried v5 and it returns the expected results on my > laptop, same as before. Thanks for the review and testing. > > Just two further remarks for your consideration. > > > + <para> > > + number of OS memory page for this buffer > > + </para></entry> > Let's capitalize the first letter here. It's copy/pasted from pg_buffercache_numa, but I agree that both (the one in pg_buffercache_numa and the new one) should be capitalized (for consistency with the other views). Done in the attached. > > +-- Check that the functions / views can't be accessed by default. To avoid > > +-- having to create a dedicated user, use the pg_database_owner pseudo-role. > > +SET ROLE pg_database_owner; > > +SELECT count(*) > 0 FROM pg_buffercache_os_pages; > > +RESET role; > > + > > +-- Check that pg_monitor is allowed to query view / function > > +SET ROLE pg_monitor; > > +SELECT count(*) > 0 FROM pg_buffercache_os_pages; > > +RESET role; > In the existing pg_buffercache.sql there are sections similar to the above > (SET ROLE pg_database_owner/pg_monitor ... RESET role), with a couple of > different SELECT statements within. Should we rather add the above new > SELECTs there, instead of in the new pg_buffercache_os_pages.sql? Yeah, that probably makes more sense, done in the attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi,
Thanks! As the most recent changes are only docs and tests, I did not try out v6 anymore, but just checked the CI result; all green.
I've set the status to Ready for Committer.
Kind regards,
Mircea Cadariu
Hi,
A small addendum might make sense for this patch, given a recent change to master.
A CHECK_FOR_INTERRUPTS() call was added in several pg_buffercache functions in commit eab9e4e.
See also the corresponding discussion [1].
Shall we add it to the function introduced in this patch as well?
Kind regards,
Mircea Cadariu
Hi, On Thu, Aug 21, 2025 at 12:08:37PM +0100, Mircea Cadariu wrote: > Hi, > > A small addendum might make sense for this patch, given a recent change to > master. > > A CHECK_FOR_INTERRUPTS() call was added in several pg_buffercache functions > in commit eab9e4e. > > See also the corresponding discussion [1]. > > Shall we add it to the function introduced in this patch as well? Yeah, I think so. Thanks for the ping, done in attached. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com