Обсуждение: Add os_page_num to pg_buffercache

Поиск

Список

Период

Сортировка

Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

10 апреля, 16:17:55

Hi hackers,

I was doing some more tests around ba2a3c2302f (pg_buffercache_numa) and
thought that seeing how buffers are spread across multiple OS pages (if that's
the case) thanks to the os_page_num field is good information to have.

The thing that I think is annoying is that to get this information (os_page_num):

- One needs to use pg_buffercache_numa (which is more costly/slower) than pg_buffercache
- One needs a system with NUMA support enabled

So why not also add this information (os_page_num) in pg_buffercache?

- It would make this information available on all systems, not just NUMA-enabled ones
- It would help understand the memory layout implications of configuration changes
such as database block size, OS page size (huge pages for example) and see how the
buffers are spread across OS pages (if that's the case).

So, please find attached a patch to $SUBJECT then.

Remarks:

- Maybe we could create a helper function to reduce the code duplication between
pg_buffercache_pages() and pg_buffercache_numa_pages()
- I think it would have made sense to also add this information while working
on ba2a3c2302f but (unfortunately) I doubt that this patch is candidate for v18
post freeze (it looks more a feature enhancement than anything else)
- It's currently doing the changes in pg_buffercache v1.6 but will need to
create v1.7 for 19 (if the above stands true)

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

v1-0001-Add-os_page_num-to-pg_buffercache.patch

Re: Add os_page_num to pg_buffercache

От

Tomas Vondra

Дата:

10 апреля, 16:35:24


On 4/10/25 15:17, Bertrand Drouvot wrote:
> Hi hackers,
> 
> I was doing some more tests around ba2a3c2302f (pg_buffercache_numa) and
> thought that seeing how buffers are spread across multiple OS pages (if that's
> the case) thanks to the os_page_num field is good information to have.
> 
> The thing that I think is annoying is that to get this information (os_page_num):
> 
> - One needs to use pg_buffercache_numa (which is more costly/slower) than pg_buffercache
> - One needs a system with NUMA support enabled
> 
> So why not also add this information (os_page_num) in pg_buffercache?
> 
> - It would make this information available on all systems, not just NUMA-enabled ones
> - It would help understand the memory layout implications of configuration changes
> such as database block size, OS page size (huge pages for example) and see how the
> buffers are spread across OS pages (if that's the case).
> 
> So, please find attached a patch to $SUBJECT then.
> 
> Remarks:
> 
> - Maybe we could create a helper function to reduce the code duplication between
> pg_buffercache_pages() and pg_buffercache_numa_pages()
> - I think it would have made sense to also add this information while working
> on ba2a3c2302f but (unfortunately) I doubt that this patch is candidate for v18
> post freeze (it looks more a feature enhancement than anything else)
> - It's currently doing the changes in pg_buffercache v1.6 but will need to
> create v1.7 for 19 (if the above stands true)
> 

This seems like a good idea in principle, but at this point it has to
wait for PG19. Please add it to the July commitfest.


regards

-- 
Tomas Vondra

Re: Add os_page_num to pg_buffercache

От

Nathan Bossart

Дата:

10 апреля, 17:58:18

On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote:
> This seems like a good idea in principle, but at this point it has to
> wait for PG19. Please add it to the July commitfest.

+1.  From a glance, this seems to fall in the "new feature" bucket and
should likely wait for v19.

-- 
nathan

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

10 апреля, 18:05:29

Hi,

On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote:
> On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote:
> > This seems like a good idea in principle, but at this point it has to
> > wait for PG19. Please add it to the July commitfest.
> 
> +1.  From a glance, this seems to fall in the "new feature" bucket and
> should likely wait for v19.

Thank you both for providing your thoughts that confirm my initial doubt. Let's
come back to that one later then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

01 июля, 16:45:55

Hi,

On Thu, Apr 10, 2025 at 03:05:29PM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote:
> > On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote:
> > > This seems like a good idea in principle, but at this point it has to
> > > wait for PG19. Please add it to the July commitfest.
> > 
> > +1.  From a glance, this seems to fall in the "new feature" bucket and
> > should likely wait for v19.
> 
> Thank you both for providing your thoughts that confirm my initial doubt. Let's
> come back to that one later then.
> 

Here we are.

Please find attached a rebased version and while at it, v2 adds a new macro and
a function to avoid some code duplication between pg_buffercache_pages() and
pg_buffercache_numa_pages().

So, PFA:

0001 - Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries

Those new macro and function are extracted from pg_buffercache_numa_pages() and
pg_buffercache_numa_pages() makes use of them.

0002 - Add os_page_num to pg_buffercache

Making use of the new macro and function from 0001.

As it's for v19, also bumping pg_buffercache's version to 1.7.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Add os_page_num to pg_buffercache

От

Tomas Vondra

Дата:

01 июля, 17:31:01

On 7/1/25 15:45, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Apr 10, 2025 at 03:05:29PM +0000, Bertrand Drouvot wrote:
>> Hi,
>>
>> On Thu, Apr 10, 2025 at 09:58:18AM -0500, Nathan Bossart wrote:
>>> On Thu, Apr 10, 2025 at 03:35:24PM +0200, Tomas Vondra wrote:
>>>> This seems like a good idea in principle, but at this point it has to
>>>> wait for PG19. Please add it to the July commitfest.
>>>
>>> +1.  From a glance, this seems to fall in the "new feature" bucket and
>>> should likely wait for v19.
>>
>> Thank you both for providing your thoughts that confirm my initial doubt. Let's
>> come back to that one later then.
>>
> 
> Here we are.
> 
> Please find attached a rebased version and while at it, v2 adds a new macro and
> a function to avoid some code duplication between pg_buffercache_pages() and
> pg_buffercache_numa_pages().
> 
> So, PFA:
> 
> 0001 - Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries
> 
> Those new macro and function are extracted from pg_buffercache_numa_pages() and
> pg_buffercache_numa_pages() makes use of them.
> 
> 0002 - Add os_page_num to pg_buffercache
> 
> Making use of the new macro and function from 0001.
> 
> As it's for v19, also bumping pg_buffercache's version to 1.7.
> 

Thanks for the updated patch!

I took a quick look on this, and I doubt we want to change the schema of
pg_buffercache like this. Adding columns is fine, but it seems rather
wrong to change the cardinality. The view is meant to be 1:1 mapping for
buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer,
page), to be precise.

I think this will break a lot of monitoring queries, and possibly in a
very subtle way - especially on systems with huge pages, where most
buffers will have one row, but then a buffer that happens to be split on
two pages will have two rows. That seems not great.

Just look at the changes needed in regression tests, where the first
test now needs to be

  -select count(*) = (select setting::bigint
  +select count(*) >= (select setting::bigint
                    from pg_settings
                    where name = 'shared_buffers')

which seems like a much weaker check.

IMHO it'd be better to have a new view for this info, something like
pg_buffercache_pages, or something like that.

But I'm also starting to question if the patch really is that useful.
Sure, people may not have NUMA support enabled (e.g. on non-linux
platforms), and even if they do the _numa view is quite expensive.

But I don't recall ever asking how the buffers map to memory pages. At
least not before/without the NUMA stuff.

regards

-- 
Tomas Vondra

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

01 июля, 19:34:56

Hi,

On Tue, Jul 01, 2025 at 04:31:01PM +0200, Tomas Vondra wrote:
> On 7/1/25 15:45, Bertrand Drouvot wrote:
> 
> I took a quick look on this,

Thanks for looking at it!

> and I doubt we want to change the schema of
> pg_buffercache like this. Adding columns is fine, but it seems rather
> wrong to change the cardinality. The view is meant to be 1:1 mapping for
> buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer,
> page), to be precise.
> 
> I think this will break a lot of monitoring queries, and possibly in a
> very subtle way - especially on systems with huge pages, where most
> buffers will have one row, but then a buffer that happens to be split on
> two pages will have two rows. That seems not great.
> 
> IMHO it'd be better to have a new view for this info, something like
> pg_buffercache_pages, or something like that.

That's a good point, fully agree!

> But I'm also starting to question if the patch really is that useful.
> Sure, people may not have NUMA support enabled (e.g. on non-linux
> platforms), and even if they do the _numa view is quite expensive.
> 

Yeah, it's not for day to day activities, more for configuration testing and
also for development activity/testing.

For example, If I set BLCKSZ to 8KB and enable huge pages (2MB), then I may
expect to see buffers not spread across pages.

But what I can see is:

SELECT
    pages_per_buffer,
    COUNT(*) as buffer_count
FROM (
    SELECT bufferid, COUNT(*) as pages_per_buffer
    FROM pg_buffercache
    GROUP BY bufferid
) subq
GROUP BY pages_per_buffer
ORDER BY pages_per_buffer;

 pages_per_buffer | buffer_count
------------------+--------------
                1 |       261120
                2 |         1024

This is due to the shared buffers being aligned to PG_IO_ALIGN_SIZE.

If I change it to:

BufferManagerShmemInit(void)

        /* Align buffer pool on IO page size boundary. */
        BufferBlocks = (char *)
-               TYPEALIGN(PG_IO_ALIGN_SIZE,
+               TYPEALIGN(2 * 1024 * 1024,
                                  ShmemInitStruct("Buffer Blocks",
-                                                                 NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+                                                                 NBuffers * (Size) BLCKSZ + (2 * 1024 * 1024),
                                                                  &foundBufs));

Then I get:

 pages_per_buffer | buffer_count
------------------+--------------
                1 |       262144
(1 row)


So we've been able to see that some buffers were spread across pages due to 
shared buffer alignment on PG_IO_ALIGN_SIZE. And that if we change the alignment
to be set to 2MB then I don't see any buffers spread across pages anymore.

I think that it helps "visualize" some configuration or code changes.

What are your thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

От

Tomas Vondra

Дата:

01 июля, 19:45:37

On 7/1/25 18:34, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jul 01, 2025 at 04:31:01PM +0200, Tomas Vondra wrote:
>> On 7/1/25 15:45, Bertrand Drouvot wrote:
>>
>> I took a quick look on this,
> 
> Thanks for looking at it!
> 
>> and I doubt we want to change the schema of
>> pg_buffercache like this. Adding columns is fine, but it seems rather
>> wrong to change the cardinality. The view is meant to be 1:1 mapping for
>> buffers, but now suddenly it's 1:1 with memory pages. Or rather (buffer,
>> page), to be precise.
>>
>> I think this will break a lot of monitoring queries, and possibly in a
>> very subtle way - especially on systems with huge pages, where most
>> buffers will have one row, but then a buffer that happens to be split on
>> two pages will have two rows. That seems not great.
>>
>> IMHO it'd be better to have a new view for this info, something like
>> pg_buffercache_pages, or something like that.
> 
> That's a good point, fully agree!
> 
>> But I'm also starting to question if the patch really is that useful.
>> Sure, people may not have NUMA support enabled (e.g. on non-linux
>> platforms), and even if they do the _numa view is quite expensive.
>>
> 
> Yeah, it's not for day to day activities, more for configuration testing and
> also for development activity/testing.
> 
> For example, If I set BLCKSZ to 8KB and enable huge pages (2MB), then I may
> expect to see buffers not spread across pages.
> 
> But what I can see is:
> 
> SELECT
>     pages_per_buffer,
>     COUNT(*) as buffer_count
> FROM (
>     SELECT bufferid, COUNT(*) as pages_per_buffer
>     FROM pg_buffercache
>     GROUP BY bufferid
> ) subq
> GROUP BY pages_per_buffer
> ORDER BY pages_per_buffer;
> 
>  pages_per_buffer | buffer_count
> ------------------+--------------
>                 1 |       261120
>                 2 |         1024
> 
> This is due to the shared buffers being aligned to PG_IO_ALIGN_SIZE.
> 
> If I change it to:
> 
> BufferManagerShmemInit(void)
> 
>         /* Align buffer pool on IO page size boundary. */
>         BufferBlocks = (char *)
> -               TYPEALIGN(PG_IO_ALIGN_SIZE,
> +               TYPEALIGN(2 * 1024 * 1024,
>                                   ShmemInitStruct("Buffer Blocks",
> -                                                                 NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE,
> +                                                                 NBuffers * (Size) BLCKSZ + (2 * 1024 * 1024),
>                                                                   &foundBufs));
> 
> Then I get:
> 
>  pages_per_buffer | buffer_count
> ------------------+--------------
>                 1 |       262144
> (1 row)
> 
> 
> So we've been able to see that some buffers were spread across pages due to 
> shared buffer alignment on PG_IO_ALIGN_SIZE. And that if we change the alignment
> to be set to 2MB then I don't see any buffers spread across pages anymore.
> 
> I think that it helps "visualize" some configuration or code changes.
> 
> What are your thoughts?
> 

But isn't the _numa view good enough for this? Sure, you need NUMA
support for it, and it may take a fair amount of time, but how often you
need to do such queries? I don't plan to block improving this use case,
but I'm not sure it's worth the effort.


cheers

-- 
Tomas Vondra

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

01 июля, 20:20:06

Hi,

On Tue, Jul 01, 2025 at 06:45:37PM +0200, Tomas Vondra wrote:
> On 7/1/25 18:34, Bertrand Drouvot wrote:
> 
> But isn't the _numa view good enough for this? Sure, you need NUMA
> support for it, and it may take a fair amount of time, but how often you
> need to do such queries?

Not that often, but my reasoning was more like:

why people managing engines and/or developing on platform that does not support
libnuma would not deserve access to this information?

> I don't plan to block improving this use case,

No worries at all, I do appreciate that you're looking at it and provide feedback
whatever the outcome would be.

> but I'm not sure it's worth the effort.

I think that the hard work has already been done while creating
pg_buffercache_numa_pages().

Now it's just a matter of extracting the necessary pieces from pg_buffercache_numa_pages()
so that:

* the new view could make use of it
* the maintenance burden should be low (thanks to code dedeuplication)
* people that don't have access to a platform that supports libnuma can have
access to this information

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

От

Tomas Vondra

Дата:

01 июля, 20:46:30

On 7/1/25 19:20, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jul 01, 2025 at 06:45:37PM +0200, Tomas Vondra wrote:
>> On 7/1/25 18:34, Bertrand Drouvot wrote:
>>
>> But isn't the _numa view good enough for this? Sure, you need NUMA
>> support for it, and it may take a fair amount of time, but how often you
>> need to do such queries?
> 
> Not that often, but my reasoning was more like:
> 
> why people managing engines and/or developing on platform that does not support
> libnuma would not deserve access to this information?
> 

True. I always forget we only support libnuma on linux for now.

>> I don't plan to block improving this use case,
> 
> No worries at all, I do appreciate that you're looking at it and provide feedback
> whatever the outcome would be.
> 
>> but I'm not sure it's worth the effort.
> 
> I think that the hard work has already been done while creating
> pg_buffercache_numa_pages().
> 
> Now it's just a matter of extracting the necessary pieces from pg_buffercache_numa_pages()
> so that:
> 
> * the new view could make use of it
> * the maintenance burden should be low (thanks to code dedeuplication)
> * people that don't have access to a platform that supports libnuma can have
> access to this information
> 

+1


regards

-- 
Tomas Vondra

Re: Add os_page_num to pg_buffercache

От

Mircea Cadariu

Дата:

08 июля, 17:47:34

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

Hi Bertrand, 

Just tried out your patch, nice work, thought to leave a review as well.

Patch applied successfully on top of commit a27893df4 in master. 
Ran the tests in pg_buffercache and they pass including the new ones. 

Running "pagesize" on my laptop returns 16384. 

test=# SELECT current_setting('block_size');
 current_setting 
-----------------
 8192
(1 row)

Given the above, the results are as expected: 

test=# select * from pg_buffercache_os_pages;
 bufferid | os_page_num 
----------+-------------
        1 |           0
        2 |           0
        3 |           1
        4 |           1
        5 |           2
        6 |           2

I have noticed that pg_buffercache_os_pages would be the 3rd function 
which follows the same high-level structure (others being pg_buffercache_pages 
and pg_buffercache_numa_pages). I am wondering if this would be let's say 
"strike three" - time to consider extracting out a high-level "skeleton" function, 
with a couple of slots which would then be provided by the 3 variants. 

Kind regards,
Mircea

The new status of this patch is: Waiting on Author

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

09 июля, 10:37:11

Hi,

On Tue, Jul 08, 2025 at 02:47:34PM +0000, Mircea Cadariu wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           tested, passed
> Documentation:            tested, passed
> 
> Hi Bertrand, 
> 
> Just tried out your patch, nice work, thought to leave a review as well.

Thanks for looking at it!

> Patch applied successfully on top of commit a27893df4 in master. 
> Ran the tests in pg_buffercache and they pass including the new ones. 
> 
> Running "pagesize" on my laptop returns 16384. 
> 
> test=# SELECT current_setting('block_size');
>  current_setting 
> -----------------
>  8192
> (1 row)
> 
> Given the above, the results are as expected: 
> 
> test=# select * from pg_buffercache_os_pages;
>  bufferid | os_page_num 
> ----------+-------------
>         1 |           0
>         2 |           0
>         3 |           1
>         4 |           1
>         5 |           2
>         6 |           2

Cool.

> I have noticed that pg_buffercache_os_pages would be the 3rd function 
> which follows the same high-level structure (others being pg_buffercache_pages 
> and pg_buffercache_numa_pages). I am wondering if this would be let's say 
> "strike three" - time to consider extracting out a high-level "skeleton" function, 
> with a couple of slots which would then be provided by the 3 variants. 

Yeah, I tried to avoid code duplication for the "os pages" related stuff in
v1. I can check if more can be done (outside of the "os pages" related stuff).

Might be done in a dedicated patch though, I mean I don't think that should be
a blocker for this one.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

От

Mircea Cadariu

Дата:

09 июля, 12:51:16

Hi,

Thanks for the prompt reply!

On 09/07/2025 08:37, Bertrand Drouvot wrote:

Yeah, I tried to avoid code duplication for the "os pages" related stuff in
v1. I can check if more can be done (outside of the "os pages" related stuff).

Might be done in a dedicated patch though, I mean I don't think that should be
a blocker for this one.

Agreed, if there's any low-hanging fruit to address now that this file is cracked open, then great. Otherwise, makes sense to leave it for a separate dedicated patch.

If you don't mind I have some further questions on the patch, see below.

+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");

Is this needed in the new pg_buffercache_os_pages function? I noticed this check also in the "original" pg_buffercache_pages. There's a comment there indicating that (if I understand correctly) its purpose is to handle upgrades from version 1.0, mentioning a field unrelated to this patch.

If it's needed, shall we consider adding a similar comment as in pg_buffercache_pages?

+		/*
+		 * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+		 * while the OS may have different memory page sizes.
+		 *
+		 * To correctly map between them, we need to: 1. Determine the OS
+		 * memory page size 2. Calculate how many OS pages are used by all
+		 * buffer blocks 3. Calculate how many OS pages are contained within
+		 * each database block.
+		 */

For step number 3 - should it be the other way around: database blocks are contained within OS pages?

Kind regards,

Mircea Cadariu

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

21 июля, 12:12:58

Hi,

On Wed, Jul 09, 2025 at 10:51:16AM +0100, Mircea Cadariu wrote:
> If you don't mind I have some further questions on the patch, see below.

Thanks for the feedback/questions!

> > +        if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
> > +            elog(ERROR, "return type must be a row type");
> 
> Is this needed in the new pg_buffercache_os_pages function?

Strictly speaking it is not, we could CreateTemplateTupleDesc(NUM_BUFFERCACHE_OS_PAGES_ELEM)
instead of CreateTemplateTupleDesc(expected_tupledesc->natts). OTOH, it's used
in multiple places in this extension so I think it's ok to keep it that way
for consistency.

> I noticed this
> check also in the "original" pg_buffercache_pages. There's a comment there
> indicating that (if I understand correctly) its purpose is to handle
> upgrades from version 1.0, mentioning a field unrelated to this patch.
> 
> If it's needed, shall we consider adding a similar comment as
> in pg_buffercache_pages?

We don't need the same kind of comment in pg_buffercache_os_pages() because
it's new in 1.7 (so the patch can not "break" a pre-1.7 version of this function
/view).

In the pg_buffercache_pages case, the story is different, it's used to deal
with the pinning_backends fields that has been introduced in 1.1 (see
pg_buffercache--1.0--1.1.sql).

> > +        /*
> > +         * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
> > +         * while the OS may have different memory page sizes.
> > +         *
> > +         * To correctly map between them, we need to: 1. Determine the OS
> > +         * memory page size 2. Calculate how many OS pages are used by all
> > +         * buffer blocks 3. Calculate how many OS pages are contained within
> > +         * each database block.
> > +         */
> For step number 3 - should it be the other way around: database blocks are
> contained within OS pages?

This comment comes from pg_get_shmem_allocations_numa() and I agree that it
could be misleading: it all depends what the OS and block sizes actually are:
fixed in v5 attached where the wording is almost the same as in 
pg_buffercache_numa_pages().

Also I think that it is not correct in pg_get_shmem_allocations_numa(), I think
that it should be something like proposed in [1].

[1]: https://www.postgresql.org/message-id/aH4DDhdiG9Gi0rG7%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Add os_page_num to pg_buffercache

От

Mircea Cadariu

Дата:

24 июля, 17:30:06

Hi,

Thanks for the update! I tried v5 and it returns the expected results on my laptop, same as before.

Just two further remarks for your consideration.

+      <para>
+       number of OS memory page for this buffer
+      </para></entry>

Let's capitalize the first letter here.

+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_os_pages;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_os_pages;
+RESET role;

In the existing pg_buffercache.sql there are sections similar to the above (SET ROLE pg_database_owner/pg_monitor ... RESET role), with a couple of different SELECT statements within. Should we rather add the above new SELECTs there, instead of in the new pg_buffercache_os_pages.sql?

Kind regards,

Mircea Cadariu

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

25 июля, 15:58:34

Hi,

On Thu, Jul 24, 2025 at 10:30:06PM +0800, Mircea Cadariu wrote:
> I tried v5 and it returns the expected results on my
> laptop, same as before.

Thanks for the review and testing.

> 
> Just two further remarks for your consideration.
> 
> > +      <para>
> > +       number of OS memory page for this buffer
> > +      </para></entry>
> Let's capitalize the first letter here.

It's copy/pasted from pg_buffercache_numa, but I agree that both (the one
in pg_buffercache_numa and the new one) should be capitalized (for consistency
with the other views).

Done in the attached.

> > +-- Check that the functions / views can't be accessed by default. To avoid
> > +-- having to create a dedicated user, use the pg_database_owner pseudo-role.
> > +SET ROLE pg_database_owner;
> > +SELECT count(*) > 0 FROM pg_buffercache_os_pages;
> > +RESET role;
> > +
> > +-- Check that pg_monitor is allowed to query view / function
> > +SET ROLE pg_monitor;
> > +SELECT count(*) > 0 FROM pg_buffercache_os_pages;
> > +RESET role;
> In the existing pg_buffercache.sql there are sections similar to the above
> (SET ROLE pg_database_owner/pg_monitor ... RESET role), with a couple of
> different SELECT statements within. Should we rather add the above new
> SELECTs there, instead of in the new pg_buffercache_os_pages.sql?

Yeah, that probably makes more sense, done in the attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Add os_page_num to pg_buffercache

От

Mircea Cadariu

Дата:

28 июля, 10:52:16

Hi,

Thanks! As the most recent changes are only docs and tests, I did not try out v6 anymore, but just checked the CI result; all green.

I've set the status to Ready for Committer.

Kind regards,

Mircea Cadariu

Re: Add os_page_num to pg_buffercache

От

Mircea Cadariu

Дата:

21 августа, 14:08:37

Hi,

A small addendum might make sense for this patch, given a recent change to master.

A CHECK_FOR_INTERRUPTS() call was added in several pg_buffercache functions in commit eab9e4e.

See also the corresponding discussion [1].

Shall we add it to the function introduced in this patch as well?

Kind regards,

Mircea Cadariu

[1] https://www.postgresql.org/message-id/flat/CAHg%2BQDcejeLx7WunFT3DX6XKh1KshvGKa8F5au8xVhqVvvQPRw%40mail.gmail.com

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

22 августа, 11:48:57

Hi,

On Thu, Aug 21, 2025 at 12:08:37PM +0100, Mircea Cadariu wrote:
> Hi,
> 
> A small addendum might make sense for this patch, given a recent change to
> master.
> 
> A CHECK_FOR_INTERRUPTS() call was added in several pg_buffercache functions
> in commit eab9e4e.
> 
> See also the corresponding discussion [1].
> 
> Shall we add it to the function introduced in this patch as well?

Yeah, I think so. Thanks for the ping, done in attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Add os_page_num to pg_buffercache

От

Michael Paquier

Дата:

18 ноября, 08:39:36

On Fri, Aug 22, 2025 at 08:48:57AM +0000, Bertrand Drouvot wrote:
> Yeah, I think so. Thanks for the ping, done in attached.

The patch has been marked as ready for committer, and I see the value
in providing a view where it is possible to know to which OS page a
given shared buffer is linked to, based on the OS page size and our
shared buffer size.  No problem with that.

Now, I am not really a fan of the duplication created here, where most
of the code pg_buffercache_os_pages() is a plain copy-paste of
pg_buffercache_numa_pages(), with the difference being that we want
only os_page_status via a call to pg_numa_query_pages(), something
that we can rely on when pg_numa_init() is able to work.  The
copy-paste is not complete, actually, we surely care about the
Assert() done after calling pg_get_shmem_pagesize(), that acts as a
sanity check to make sure that the buffer size and the OS page size
are divisible pieces (one being divisible by the other), and there is
more that would be nice to not duplicate, like the start pointer
location, the comment on top of pg_get_shmem_pagesize(), etc.

It seems to me that it would be simpler to make the allocations and
information of os_page_ptrs and os_page_status conditional depending
on the result of pg_numa_init(), and that we could just fill numa_node
with NULLs if numa is not available in the environment where the query
is run.  This includes making pg_numa_touch_mem_if_required()
conditional, of course, not called when pg_numa_init() fails.  Or is
there a strong reason where it would be better to rely on an
elog(ERROR) if numa(3) fails, based on numa_available()?  The purpose
of these views being monitoring, it is usually easier in my experience
to rely on NULLs rather than facing periodic errors when we don't know
something.  That makes JOINs more predictible, for one.

Please note that I don't mind the extra view pg_buffercache_os_pages
that can provide some information that's transparent cross-platform,
but let's make it something that calls pg_buffercache_numa_pages()
instead.

Note 1: the patch failed to compile as we don't need a buffer state
anymore when unlocking a buffer.

Note 2: Nice catch about the description of os_page_num, applied this
one separately.
--
Michael

Вложения

signature.asc

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

19 ноября, 12:28:09

Hi,

On Tue, Nov 18, 2025 at 02:39:36PM +0900, Michael Paquier wrote:
> On Fri, Aug 22, 2025 at 08:48:57AM +0000, Bertrand Drouvot wrote:
> > Yeah, I think so. Thanks for the ping, done in attached.
> 
> The patch has been marked as ready for committer, and I see the value
> in providing a view where it is possible to know to which OS page a
> given shared buffer is linked to, based on the OS page size and our
> shared buffer size.  No problem with that.

Thanks for looking at it!

> Now, I am not really a fan of the duplication created here, where most
> of the code pg_buffercache_os_pages() is a plain copy-paste of
> pg_buffercache_numa_pages(), with the difference being that we want
> only os_page_status via a call to pg_numa_query_pages(),

Agree. 0001 helps to avoid code duplication but I agree that we could do more.

> It seems to me that it would be simpler to make the allocations and
> information of os_page_ptrs and os_page_status conditional depending
> on the result of pg_numa_init(), and that we could just fill numa_node
> with NULLs if numa is not available in the environment where the query
> is run.  This includes making pg_numa_touch_mem_if_required()
> conditional, of course, not called when pg_numa_init() fails.

That's a good idea and I think we have 2 options here.

Option 1:

We could simply create the pg_buffercache_os_pages view on top of the
pg_buffercache_numa one. The cons I can think of is that, when numa is available,
then pg_buffercache_os_pages would pay the extra cost that also make
pg_buffercache_numa slow.

Then there is no real benefits for adding a new view, we could just keep
pg_buffercache_numa and fill numa_node with NULLs if numa is not available and
document also the use case (with an example) when numa is not available.

That would achieve the main goal.

Option 2:

Still make changes in pg_buffercache_numa_pages() and fill with NULL when 
numa is not available. Then create an helper to do the mapping buffers to OS
pages without any NUMA specific operations.

That way we could create a dedicated view pg_buffercache_os_pages on top of
a new function. No code duplication and the new view would not get the extra
cost if numa is available.

> Or is
> there a strong reason where it would be better to rely on an
> elog(ERROR) if numa(3) fails, based on numa_available()?  The purpose
> of these views being monitoring, it is usually easier in my experience
> to rely on NULLs rather than facing periodic errors when we don't know
> something.  That makes JOINs more predictible, for one.

Yeah I don't think that's an issue to fill with NULL instead of generating errors.
Specially as it will give added value. 

> Please note that I don't mind the extra view pg_buffercache_os_pages
> that can provide some information that's transparent cross-platform,
> but let's make it something that calls pg_buffercache_numa_pages()
> instead.

So, I think we have Option 1 and Option 2. Option 1 is very simple and Option
2 would allow to get the desired information with no performance penalties when
numa is availble.

I'm tempted to vote for 1 as I'm not sure the larger code refactoring of option
2 is worth the benefits. Thoughts?

> Note 2: Nice catch about the description of os_page_num, applied this
> one separately.

Thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

От

Michael Paquier

Дата:

19 ноября, 16:49:49

On Wed, Nov 19, 2025 at 09:28:09AM +0000, Bertrand Drouvot wrote:
> Option 1:
>
> We could simply create the pg_buffercache_os_pages view on top of the
> pg_buffercache_numa one. The cons I can think of is that, when numa is available,
> then pg_buffercache_os_pages would pay the extra cost that also make
> pg_buffercache_numa slow.
>
> Then there is no real benefits for adding a new view, we could just keep
> pg_buffercache_numa and fill numa_node with NULLs if numa is not available and
> document also the use case (with an example) when numa is not available.
>
> That would achieve the main goal.
>
> Option 2:
>
> Still make changes in pg_buffercache_numa_pages() and fill with NULL when
> numa is not available. Then create an helper to do the mapping buffers to OS
> pages without any NUMA specific operations.
>
> That way we could create a dedicated view pg_buffercache_os_pages on top of
> a new function. No code duplication and the new view would not get the extra
> cost if numa is available.

Hmm.  I can think about an option 3 here: pg_buffercache outlines the
view pg_buffercache_numa as the primary choice over
pg_buffercache_numa_pages().  So I would suggest a more drastic
strategy, that should not break monitoring queries with the views
being the primary source for the results:
- Rename of pg_buffercache_numa_pages() to pg_buffercache_os_pages(),
that takes in input a boolean argument to decide if numa should be
executed or not.
- Creation of a second view for the OS pages that calls
pg_buffercache_os_pages() without the numa code activated, for the two
attributes that matter.
- Switch the existing view pg_buffercache_numa to call
pg_buffercache_os_pages() with the numa code activated.  If NUMA
cannot be set up, elog(ERROR).
--
Michael

Вложения

signature.asc

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

20 ноября, 19:59:07

Hi,

On Wed, Nov 19, 2025 at 10:49:49PM +0900, Michael Paquier wrote:
> On Wed, Nov 19, 2025 at 09:28:09AM +0000, Bertrand Drouvot wrote:
> > Option 1:
> > 
> > We could simply create the pg_buffercache_os_pages view on top of the
> > pg_buffercache_numa one. The cons I can think of is that, when numa is available,
> > then pg_buffercache_os_pages would pay the extra cost that also make
> > pg_buffercache_numa slow.
> > 
> > Then there is no real benefits for adding a new view, we could just keep
> > pg_buffercache_numa and fill numa_node with NULLs if numa is not available and
> > document also the use case (with an example) when numa is not available.
> > 
> > That would achieve the main goal.
> > 
> > Option 2:
> > 
> > Still make changes in pg_buffercache_numa_pages() and fill with NULL when 
> > numa is not available. Then create an helper to do the mapping buffers to OS
> > pages without any NUMA specific operations.
> > 
> > That way we could create a dedicated view pg_buffercache_os_pages on top of
> > a new function. No code duplication and the new view would not get the extra
> > cost if numa is available.
> 
> Hmm.  I can think about an option 3 here: pg_buffercache outlines the
> view pg_buffercache_numa as the primary choice over
> pg_buffercache_numa_pages().  So I would suggest a more drastic
> strategy, that should not break monitoring queries with the views
> being the primary source for the results:
> - Rename of pg_buffercache_numa_pages() to pg_buffercache_os_pages(),
> that takes in input a boolean argument to decide if numa should be
> executed or not.
> - Creation of a second view for the OS pages that calls
> pg_buffercache_os_pages() without the numa code activated, for the two
> attributes that matter.
> - Switch the existing view pg_buffercache_numa to call
> pg_buffercache_os_pages() with the numa code activated.  If NUMA
> cannot be set up, elog(ERROR).

Love the idea: the new view would not suffer from the numa availability overhead
and the current behavior is kept. Will look at it, thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

21 ноября, 14:53:52

Hi,

On Thu, Nov 20, 2025 at 04:59:07PM +0000, Bertrand Drouvot wrote:
> On Wed, Nov 19, 2025 at 10:49:49PM +0900, Michael Paquier wrote:
> > 
> > Hmm.  I can think about an option 3 here: pg_buffercache outlines the
> > view pg_buffercache_numa as the primary choice over
> > pg_buffercache_numa_pages().  So I would suggest a more drastic
> > strategy, that should not break monitoring queries with the views
> > being the primary source for the results:
> > - Rename of pg_buffercache_numa_pages() to pg_buffercache_os_pages(),
> > that takes in input a boolean argument to decide if numa should be
> > executed or not.
> > - Creation of a second view for the OS pages that calls
> > pg_buffercache_os_pages() without the numa code activated, for the two
> > attributes that matter.
> > - Switch the existing view pg_buffercache_numa to call
> > pg_buffercache_os_pages() with the numa code activated.  If NUMA
> > cannot be set up, elog(ERROR).
> 
> Love the idea: the new view would not suffer from the numa availability overhead
> and the current behavior is kept. Will look at it, thanks!

Here they are:

0001:

Is nothing but the same as the one shared in [1].

0002:

Introduce GET_MAX_BUFFER_ENTRIES and get_buffer_page_boundaries

It's not really needed anymore since we'll avoid code duplication with the
new approach. That said I think they help for code readability so keeping them
(I don't have a strong opinion about it if other prefer not to add them).

0003: 

Adding pg_buffercache_numa_pages_internal()

This new function makes NUMA data collection conditional.

It extracts the core current pg_buffercache_numa_pages() logic into an
internal function that accepts a boolean parameter. It's currently only called
with the boolean set to true to serve the pg_buffercache_numa view needs.

It's done that way to ease to review but could be pushed as is.

0004:

Add pg_buffercache_os_pages function and view

The patch:

- renames pg_buffercache_numa_pages_internal() to pg_buffercache_os_pages()
- keep pg_buffercache_numa_pages() as a backward compatibility wrapper
- re-create the pg_buffercache_numa view on top of pg_buffercache_os_pages using
 true as argument
- adds doc
- adds test

Remark for the doc: the patch does not show the pg_buffercache_os_pages() parameter.
It just mentions that it exists. I think that's fine given that a) the same is
true for pg_buffercache_evict() and pg_buffercache_evict_relation() (maybe that
should be changed though), b) the only purpose of this function is to be linked
to the pg_buffercache_os_pages and pg_buffercache_numa views.

[1]: https://www.postgresql.org/message-id/aSBOKX6pLJzumbmF%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Add os_page_num to pg_buffercache

От

Bertrand Drouvot

Дата:

23 ноября, 12:15:31

Hi,

On Fri, Nov 21, 2025 at 11:53:52AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Nov 20, 2025 at 04:59:07PM +0000, Bertrand Drouvot wrote:
> > On Wed, Nov 19, 2025 at 10:49:49PM +0900, Michael Paquier wrote:
> > > 
> > > Hmm.  I can think about an option 3 here: pg_buffercache outlines the
> > > view pg_buffercache_numa as the primary choice over
> > > pg_buffercache_numa_pages().  So I would suggest a more drastic
> > > strategy, that should not break monitoring queries with the views
> > > being the primary source for the results:
> > > - Rename of pg_buffercache_numa_pages() to pg_buffercache_os_pages(),
> > > that takes in input a boolean argument to decide if numa should be
> > > executed or not.
> > > - Creation of a second view for the OS pages that calls
> > > pg_buffercache_os_pages() without the numa code activated, for the two
> > > attributes that matter.
> > > - Switch the existing view pg_buffercache_numa to call
> > > pg_buffercache_os_pages() with the numa code activated.  If NUMA
> > > cannot be set up, elog(ERROR).
> > 
> > Love the idea: the new view would not suffer from the numa availability overhead
> > and the current behavior is kept. Will look at it, thanks!
> 
> Here they are:

Attached a rebase due to 7d9043aee80. Also 0003 has a minor change (as compared
to v8-0004) to avoid this error when creating the 1.6 version with the new code:

postgres=# create extension pg_buffercache version '1.6';
CREATE EXTENSION
postgres=# select count(*) from pg_buffercache_numa;
ERROR:  set-valued function called in context that cannot accept a set

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Add os_page_num to pg_buffercache

От

Michael Paquier

Дата:

24 ноября, 08:35:32

On Sun, Nov 23, 2025 at 09:15:31AM +0000, Bertrand Drouvot wrote:
> Attached a rebase due to 7d9043aee80. Also 0003 has a minor change (as compared
> to v8-0004) to avoid this error when creating the 1.6 version with the new code:

Yes, sorry, I forgot to mention that part.  I have played with the
patch for a couple of hours, fixed a couple of issues, rewording and
tweaking things while browsing the whole (typedefs.list was incorrect,
docs were partially incorrect), and applied the result.  I did not see
a point in 0001, as well, because the refactored "internal" function
we'd have just one caller for the proposed macro and function.

The original function pg_buffercache_numa_pages could be dropped when
upgrading to v1.7 now that the view pg_buffercache_numa relies on the
new SQL function pg_buffercache_os_pages(boolean), but I could not be
really excited about that..  We could add a DROP FUNCTION, of course.

By the way, thanks for the effort of splitting up things.  This was
super useful for the review and when dealing with each part of the
proposed changes.
--
Michael

Вложения

signature.asc

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Add os_page_num to pg_buffercache

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения