Обсуждение: Re: pgsql: Introduce pg_shmem_allocations_numa view

Поиск

Список

Период

Сортировка

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

13 июня 2025 г., 00:16:19

Re: Tomas Vondra
> Introduce pg_shmem_allocations_numa view

This is acting up on Debian's 32-bit architectures, namely i386, armel
and armhf:

--- /build/reproducible-path/postgresql-18-18~beta1+20250612/src/test/regress/expected/numa.out    2025-06-12
12:21:21.000000000+0000
 
+++ /build/reproducible-path/postgresql-18-18~beta1+20250612/build/src/test/regress/results/numa.out    2025-06-12
20:20:33.124292694+0000
 
@@ -6,8 +6,4 @@
 -- switch to superuser
 \c -
 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
- ok
-----
- t
-(1 row)
-
+ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -14

The diff is the same on all architectures.

-14 seems to be -EFAULT, and move_pages(2) says:

   Page states in the status array
       The following values can be returned in each element of the status array.

       -EFAULT
              This is a zero page or the memory area is not mapped by the process.

https://buildd.debian.org/status/logs.php?pkg=postgresql-18&ver=18%7Ebeta1%2B20250612-1

https://buildd.debian.org/status/fetch.php?pkg=postgresql-18&arch=armel&ver=18%7Ebeta1%2B20250612-1&stamp=1749759646&raw=0

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

23 июня 2025 г., 17:42:26

Re: To Tomas Vondra
> This is acting up on Debian's 32-bit architectures, namely i386, armel
> and armhf:

... and x32 (x86_64 instruction set with 32-bit pointers).

>  SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
> +ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -14
> 
> -14 seems to be -EFAULT, and move_pages(2) says:
>        -EFAULT
>               This is a zero page or the memory area is not mapped by the process.

I did some debugging on i386 and made it print the page numbers:

 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 35
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 36
...
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 32768
+WARNING:  invalid NUMA node id outside of allowed range [0, 0]: -14 for page 32769

So it works for the first few pages and then the rest is EFAULT.

I think the pg_numa_touch_mem_if_required() hack might not be enough
to force the pages to be allocated. Changing that to a memcpy() didn't
help. Is there some optimization that zero pages aren't allocated
until being written to?

Why do we try to force the pages to be allocated at all? This is just
a monitoring function, it should not change the actual system state.
Why not just skip any page where the status is <0 ?

The attached patch removes that logic. Regression tests pass, but we
probably have to think about whether to report these negative numbers
as-is or perhaps convert them to NULL.

Christoph

Вложения

0001-Don-t-force-allocate-pages-for-pg_get_shmem_allocati.patch

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

23 июня 2025 г., 17:48:27

Re: To Tomas Vondra
> Why do we try to force the pages to be allocated at all? This is just
> a monitoring function, it should not change the actual system state.

One-time touching might also not be enough, what if the pages later
get swapped out and the monitoring functions are called again? They
will have to deal with these "not in memory" error conditions anyway.

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Andres Freund

Дата:

23 июня 2025 г., 18:14:54

Hi,

On 2025-06-23 16:48:27 +0200, Christoph Berg wrote:
> Re: To Tomas Vondra
> > Why do we try to force the pages to be allocated at all? This is just
> > a monitoring function, it should not change the actual system state.

The problem is that the kernel function just gives bogus results for pages
that *are* present in memory but that have only touched in another process
that has mapped the same range of memory.

> One-time touching might also not be enough, what if the pages later
> get swapped out and the monitoring functions are called again?

I don't think that's a problem, the process still has a relevant page table
entry in that case.

Greetings,

Andres Freund

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

23 июня 2025 г., 18:20:12

Re: Andres Freund
> > > Why do we try to force the pages to be allocated at all? This is just
> > > a monitoring function, it should not change the actual system state.
> 
> The problem is that the kernel function just gives bogus results for pages
> that *are* present in memory but that have only touched in another process
> that has mapped the same range of memory.

Ok, so we leave the touching in, but still defend against negative
status values?

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Andres Freund

Дата:

23 июня 2025 г., 19:26:22

Hi,

On 2025-06-23 17:59:24 +0200, Christoph Berg wrote:
> Re: To Andres Freund
> > Ok, so we leave the touching in, but still defend against negative
> > status values?
> 
> v2 attached.

How confident are we that this isn't actually because we passed a bogus
address to the kernel or such? With this patch, are *any* pages recognized as
valid on the machines that triggered the error?

I wonder if we ought to report the failures as a separate "numa node"
(e.g. NULL as node id) instead ...

Greetings,

Andres Freund

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

23 июня 2025 г., 22:57:56

Re: Andres Freund
> How confident are we that this isn't actually because we passed a bogus
> address to the kernel or such? With this patch, are *any* pages recognized as
> valid on the machines that triggered the error?

See upthread - the first 35 pages were ok, then a lot of -14.

> I wonder if we ought to report the failures as a separate "numa node"
> (e.g. NULL as node id) instead ...

Did that now, using N+1 (== 1 here) for errors in this Debian i386
environment (chroot on an amd64 host):

select * from pg_shmem_allocations_numa \crosstabview 

                      name                      │    0     │    1
────────────────────────────────────────────────┼──────────┼──────────
 multixact_offset                               │    69632 │    65536
 subtransaction                                 │   139264 │   131072
 notify                                         │   139264 │        0
 Shared Memory Stats                            │   188416 │   131072
 serializable                                   │   188416 │    86016
 PROCLOCK hash                                  │     4096 │        0
 FinishedSerializableTransactions               │     4096 │        0
 XLOG Ctl                                       │  2117632 │  2097152
 Shared MultiXact State                         │     4096 │        0
 Proc Header                                    │     4096 │        0
 Archiver Data                                  │     4096 │        0
.... more 0s in the last column ...
 AioHandleData                                  │  1429504 │        0
 Buffer Blocks                                  │ 67117056 │ 67108864
 Buffer IO Condition Variables                  │   266240 │        0
 Proc Array                                     │     4096 │        0
.... more 0s
(73 rows)


There is something fishy with pg_buffercache. If I restart PG, I'm
getting "Bad address" (errno 14), this time as return value of
move_pages().

postgres =# select * from pg_buffercache_numa;
DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:383
2025-06-23 19:41:41.315 UTC [1331894] ERROR:  failed NUMA pages inquiry: Bad address
2025-06-23 19:41:41.315 UTC [1331894] STATEMENT:  select * from pg_buffercache_numa;
ERROR:  XX000: failed NUMA pages inquiry: Bad address
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:394

Repeated calls are fine.

Maybe NUMA is just not supported on 32-bit archs, but I'd rather be
sure about that before play that card.

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

23 июня 2025 г., 23:10:46


On 6/23/25 21:57, Christoph Berg wrote:
> Re: Andres Freund
>> How confident are we that this isn't actually because we passed a bogus
>> address to the kernel or such? With this patch, are *any* pages recognized as
>> valid on the machines that triggered the error?
> 
> See upthread - the first 35 pages were ok, then a lot of -14.
> 
>> I wonder if we ought to report the failures as a separate "numa node"
>> (e.g. NULL as node id) instead ...
> 
> Did that now, using N+1 (== 1 here) for errors in this Debian i386
> environment (chroot on an amd64 host):
> 
> select * from pg_shmem_allocations_numa \crosstabview 
> 
>                       name                      │    0     │    1
> ────────────────────────────────────────────────┼──────────┼──────────
>  multixact_offset                               │    69632 │    65536
>  subtransaction                                 │   139264 │   131072
>  notify                                         │   139264 │        0
>  Shared Memory Stats                            │   188416 │   131072
>  serializable                                   │   188416 │    86016
>  PROCLOCK hash                                  │     4096 │        0
>  FinishedSerializableTransactions               │     4096 │        0
>  XLOG Ctl                                       │  2117632 │  2097152
>  Shared MultiXact State                         │     4096 │        0
>  Proc Header                                    │     4096 │        0
>  Archiver Data                                  │     4096 │        0
> .... more 0s in the last column ...
>  AioHandleData                                  │  1429504 │        0
>  Buffer Blocks                                  │ 67117056 │ 67108864
>  Buffer IO Condition Variables                  │   266240 │        0
>  Proc Array                                     │     4096 │        0
> .... more 0s
> (73 rows)
> 
> 
> There is something fishy with pg_buffercache. If I restart PG, I'm
> getting "Bad address" (errno 14), this time as return value of
> move_pages().
> 
> postgres =# select * from pg_buffercache_numa;
> DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:383
> 2025-06-23 19:41:41.315 UTC [1331894] ERROR:  failed NUMA pages inquiry: Bad address
> 2025-06-23 19:41:41.315 UTC [1331894] STATEMENT:  select * from pg_buffercache_numa;
> ERROR:  XX000: failed NUMA pages inquiry: Bad address
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:394
> 
> Repeated calls are fine.
> 

Huh. So it's only the first call that does this?

Can you maybe print the addresses passed to pg_numa_query_pages? I
wonder if there's some bug in how we fill that array. Not sure why would
it happen only on 32-bit systems, though.

I'll create a 32-bit VM so that I can try reproducing this.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

23 июня 2025 г., 23:31:50

Re: Tomas Vondra
> Huh. So it's only the first call that does this?

The first call after a restart. Reconnecting is not enough.

> Can you maybe print the addresses passed to pg_numa_query_pages? I

The addresses look good:

Breakpoint 1, pg_numa_query_pages (pid=0, count=32768, pages=0xeb44d02c, status=0xeb42c02c) at
../src/port/pg_numa.c:49
49        return numa_move_pages(pid, count, pages, NULL, status, 0);
(gdb) p *pages
$1 = (void *) 0xebc33000
(gdb) p pages[1]
$2 = (void *) 0xebc34000
(gdb) p pages[2]
$3 = (void *) 0xebc35000


> wonder if there's some bug in how we fill that array. Not sure why would
> it happen only on 32-bit systems, though.

I found something, but that should be harmless:

--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -365,7 +365,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)

         /* Used to determine the NUMA node for all OS pages at once */
         os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
-        os_page_status = palloc(sizeof(uint64) * os_page_count);
+        os_page_status = palloc(sizeof(int) * os_page_count);

         /* Fill pointers for all the memory pages. */
         idx = 0;

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

23 июня 2025 г., 23:37:12


On 6/23/25 22:31, Christoph Berg wrote:
> Re: Tomas Vondra
>> Huh. So it's only the first call that does this?
> 
> The first call after a restart. Reconnecting is not enough.
> 
>> Can you maybe print the addresses passed to pg_numa_query_pages? I
> 
> The addresses look good:
> 
> Breakpoint 1, pg_numa_query_pages (pid=0, count=32768, pages=0xeb44d02c, status=0xeb42c02c) at
../src/port/pg_numa.c:49
> 49        return numa_move_pages(pid, count, pages, NULL, status, 0);
> (gdb) p *pages
> $1 = (void *) 0xebc33000
> (gdb) p pages[1]
> $2 = (void *) 0xebc34000
> (gdb) p pages[2]
> $3 = (void *) 0xebc35000
> 

Didn't you say the first ~35 addresses succeed, right? What about the
addresses after that?

> 
>> wonder if there's some bug in how we fill that array. Not sure why would
>> it happen only on 32-bit systems, though.
> 
> I found something, but that should be harmless:
> 
> --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -365,7 +365,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> 
>          /* Used to determine the NUMA node for all OS pages at once */
>          os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
> -        os_page_status = palloc(sizeof(uint64) * os_page_count);
> +        os_page_status = palloc(sizeof(int) * os_page_count);
> 

Yes, good catch. But as you say, that should be benign - we allocate
more memory than needed, I don't think it should break anything.


-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

23 июня 2025 г., 23:51:49

Re: Tomas Vondra
> Didn't you say the first ~35 addresses succeed, right? What about the
> addresses after that?

That was pg_shmem_allocations_numa. The pg_numa_query_pages() in there
works (does not return -1), but then some of the status[] values are
-14.

When pg_buffercache_numa fails, pg_numa_query_pages() itself
returns -14.

The printed os_page_ptrs[] contents are the same for the failing and
non-failing calls, so the problem is probably elsewhere.

        /* Fill pointers for all the memory pages. */
        idx = 0;
        for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
        {
+           if (idx < 50)
+               elog(DEBUG1, "os_page_ptrs idx %d = %p", idx, ptr);
            os_page_ptrs[idx++] = ptr;


20:47 myon@postgres =# select * from pg_buffercache_numa;
DEBUG:  00000: os_page_ptrs idx 0 = 0xebc44000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 1 = 0xebc45000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 2 = 0xebc46000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 3 = 0xebc47000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 4 = 0xebc48000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 5 = 0xebc49000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 6 = 0xebc4a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 7 = 0xebc4b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 8 = 0xebc4c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 9 = 0xebc4d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 10 = 0xebc4e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 11 = 0xebc4f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 12 = 0xebc50000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 13 = 0xebc51000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 14 = 0xebc52000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 15 = 0xebc53000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 16 = 0xebc54000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 17 = 0xebc55000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 18 = 0xebc56000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 19 = 0xebc57000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 20 = 0xebc58000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 21 = 0xebc59000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 22 = 0xebc5a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 23 = 0xebc5b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 24 = 0xebc5c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 25 = 0xebc5d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 26 = 0xebc5e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 27 = 0xebc5f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 28 = 0xebc60000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 29 = 0xebc61000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 30 = 0xebc62000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 31 = 0xebc63000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 32 = 0xebc64000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 33 = 0xebc65000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 34 = 0xebc66000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 35 = 0xebc67000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 36 = 0xebc68000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 37 = 0xebc69000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 38 = 0xebc6a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 39 = 0xebc6b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 40 = 0xebc6c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 41 = 0xebc6d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 42 = 0xebc6e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 43 = 0xebc6f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 44 = 0xebc70000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 45 = 0xebc71000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 46 = 0xebc72000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 47 = 0xebc73000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 48 = 0xebc74000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 49 = 0xebc75000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:385
2025-06-23 20:47:41.827 UTC [1368080] ERROR:  failed NUMA pages inquiry: Bad address
2025-06-23 20:47:41.827 UTC [1368080] STATEMENT:  select * from pg_buffercache_numa;
ERROR:  XX000: failed NUMA pages inquiry: Bad address
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:396
Time: 92.757 ms

20:47 myon@postgres =# select * from pg_buffercache_numa;
DEBUG:  00000: os_page_ptrs idx 0 = 0xebc44000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 1 = 0xebc45000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 2 = 0xebc46000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 3 = 0xebc47000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 4 = 0xebc48000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 5 = 0xebc49000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 6 = 0xebc4a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 7 = 0xebc4b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 8 = 0xebc4c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 9 = 0xebc4d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 10 = 0xebc4e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 11 = 0xebc4f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 12 = 0xebc50000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 13 = 0xebc51000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 14 = 0xebc52000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 15 = 0xebc53000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 16 = 0xebc54000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 17 = 0xebc55000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 18 = 0xebc56000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 19 = 0xebc57000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 20 = 0xebc58000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 21 = 0xebc59000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 22 = 0xebc5a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 23 = 0xebc5b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 24 = 0xebc5c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 25 = 0xebc5d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 26 = 0xebc5e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 27 = 0xebc5f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 28 = 0xebc60000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 29 = 0xebc61000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 30 = 0xebc62000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 31 = 0xebc63000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 32 = 0xebc64000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 33 = 0xebc65000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 34 = 0xebc66000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 35 = 0xebc67000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 36 = 0xebc68000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 37 = 0xebc69000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 38 = 0xebc6a000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 39 = 0xebc6b000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 40 = 0xebc6c000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 41 = 0xebc6d000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 42 = 0xebc6e000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 43 = 0xebc6f000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 44 = 0xebc70000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 45 = 0xebc71000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 46 = 0xebc72000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 47 = 0xebc73000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 48 = 0xebc74000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: os_page_ptrs idx 49 = 0xebc75000
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:385
DEBUG:  00000: NUMA: page-faulting the buffercache for proper NUMA readouts
LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:444
Time: 24.547 ms
20:47 myon@postgres =#


Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

24 июня 2025 г., 00:14:28


On 6/23/25 22:51, Christoph Berg wrote:
> Re: Tomas Vondra
>> Didn't you say the first ~35 addresses succeed, right? What about the
>> addresses after that?
> 
> That was pg_shmem_allocations_numa. The pg_numa_query_pages() in there
> works (does not return -1), but then some of the status[] values are
> -14.
> 
> When pg_buffercache_numa fails, pg_numa_query_pages() itself
> returns -14.
> 
> The printed os_page_ptrs[] contents are the same for the failing and
> non-failing calls, so the problem is probably elsewhere.
> 
>         /* Fill pointers for all the memory pages. */
>         idx = 0;
>         for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
>         {
> +           if (idx < 50)
> +               elog(DEBUG1, "os_page_ptrs idx %d = %p", idx, ptr);
>             os_page_ptrs[idx++] = ptr;
> 
> 
> 20:47 myon@postgres =# select * from pg_buffercache_numa;
> DEBUG:  00000: os_page_ptrs idx 0 = 0xebc44000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 1 = 0xebc45000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 2 = 0xebc46000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 3 = 0xebc47000
...
> DEBUG:  00000: os_page_ptrs idx 47 = 0xebc73000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 48 = 0xebc74000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 49 = 0xebc75000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:385
> 2025-06-23 20:47:41.827 UTC [1368080] ERROR:  failed NUMA pages inquiry: Bad address
> 2025-06-23 20:47:41.827 UTC [1368080] STATEMENT:  select * from pg_buffercache_numa;
> ERROR:  XX000: failed NUMA pages inquiry: Bad address
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:396
> Time: 92.757 ms
> 
> 20:47 myon@postgres =# select * from pg_buffercache_numa;
> DEBUG:  00000: os_page_ptrs idx 0 = 0xebc44000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 1 = 0xebc45000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 2 = 0xebc46000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 3 = 0xebc47000
...> DEBUG:  00000: os_page_ptrs idx 46 = 0xebc72000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 47 = 0xebc73000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 48 = 0xebc74000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: os_page_ptrs idx 49 = 0xebc75000
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:375
> DEBUG:  00000: NUMA: NBuffers=16384 os_page_count=32768 os_page_size=4096
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:385
> DEBUG:  00000: NUMA: page-faulting the buffercache for proper NUMA readouts
> LOCATION:  pg_buffercache_numa_pages, pg_buffercache_pages.c:444
> Time: 24.547 ms
> 20:47 myon@postgres =#
> 

True. If it fails on first call, but succeeds on the other, then the
problem is likely somewhere else. But also on the second call we won't
do the memory touching. Can you try setting firstNumaTouch=false, so
that we do this on every call?


At the beginning you mentioned this is happening on i386, armel and
armhf - are all those in qemu? I've tried on my rpi5 (with 32-bit user
space), and there everything seems to work fine. But that's aarch64
kernel, just the user space if 32-bit.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

24 июня 2025 г., 00:25:23

Re: Tomas Vondra
> True. If it fails on first call, but succeeds on the other, then the
> problem is likely somewhere else. But also on the second call we won't
> do the memory touching. Can you try setting firstNumaTouch=false, so
> that we do this on every call?

firstNumaTouch=false, it still fails on the first call.

I assume you meant actually keeping firstNumaTouch=true - but it still
fails on the first call.

The memory touching is done for the first call in each backend, but
reconnecting doesn't reset it, I have to restart PG.

> At the beginning you mentioned this is happening on i386, armel and
> armhf - are all those in qemu? I've tried on my rpi5 (with 32-bit user
> space), and there everything seems to work fine. But that's aarch64
> kernel, just the user space if 32-bit.

I'm testing on i386 in a chroot on a amd64 kernel. (same for x32)
armel and armhf are also 32-bit chroots on a arm64 host.

https://buildd.debian.org/status/package.php?p=postgresql-18&suite=experimental

Maybe this is a kernel bug.

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

24 июня 2025 г., 00:47:00


On 6/23/25 23:25, Christoph Berg wrote:
> Re: Tomas Vondra
>> True. If it fails on first call, but succeeds on the other, then the
>> problem is likely somewhere else. But also on the second call we won't
>> do the memory touching. Can you try setting firstNumaTouch=false, so
>> that we do this on every call?
> 
> firstNumaTouch=false, it still fails on the first call.
> 
> I assume you meant actually keeping firstNumaTouch=true - but it still
> fails on the first call.
> 

No, I meant firstNumaTouch=false, so that the touching happens on every
call. I was wondering if that makes all calls fail.

> The memory touching is done for the first call in each backend, but
> reconnecting doesn't reset it, I have to restart PG.
> 

I don't follow. Why wouldn't reconnecting reset it?

>> At the beginning you mentioned this is happening on i386, armel and
>> armhf - are all those in qemu? I've tried on my rpi5 (with 32-bit user
>> space), and there everything seems to work fine. But that's aarch64
>> kernel, just the user space if 32-bit.
> 
> I'm testing on i386 in a chroot on a amd64 kernel. (same for x32)
> armel and armhf are also 32-bit chroots on a arm64 host.
> 
> https://buildd.debian.org/status/package.php?p=postgresql-18&suite=experimental
> 
> Maybe this is a kernel bug.
> 

Or maybe the 32-bit chroot on 64-bit host matters and confuses some
calculation.


-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

24 июня 2025 г., 04:43:19

On 6/23/25 23:47, Tomas Vondra wrote:
> ...
> 
> Or maybe the 32-bit chroot on 64-bit host matters and confuses some
> calculation.
>

I think it's likely something like this. I noticed that if I modify
pg_buffercache_numa_pages() to query the addresses one by one, it works.
And when I increase the number, it stops working somewhere between 16k
and 17k items.

It may be a coincidence, but I suspect it's related to the sizeof(void
*) being 8 in the kernel, but only 4 in the chroot. So the userspace
passes an array of 4-byte items, but kernel interprets that as 8-byte
items. That is, we call

long move_pages(int pid, unsigned long count, void *pages[.count], const
int nodes[.count], int status[.count], int flags);

Which (I assume) just passes the parameters to kernel. And it'll
interpret them per kernel pointer size.

If this is what's happening, I'm not sure what to do about it ...

FWIW while looking into this, I tried running this under valgrind (on a
regular 64-bit system, not in the chroot), and I get this report:

==65065== Invalid read of size 8
==65065==    at 0x113B0EBE: pg_buffercache_numa_pages
(pg_buffercache_pages.c:380)
==65065==    by 0x6B539D: ExecMakeTableFunctionResult (execSRF.c:234)
==65065==    by 0x6CEB7E: FunctionNext (nodeFunctionscan.c:94)
==65065==    by 0x6B6ACA: ExecScanFetch (execScan.h:126)
==65065==    by 0x6B6B31: ExecScanExtended (execScan.h:170)
==65065==    by 0x6B6C9D: ExecScan (execScan.c:59)
==65065==    by 0x6CEF0F: ExecFunctionScan (nodeFunctionscan.c:269)
==65065==    by 0x6B29FA: ExecProcNodeFirst (execProcnode.c:469)
==65065==    by 0x6A6F56: ExecProcNode (executor.h:313)
==65065==    by 0x6A9533: ExecutePlan (execMain.c:1679)
==65065==    by 0x6A7422: standard_ExecutorRun (execMain.c:367)
==65065==    by 0x6A7330: ExecutorRun (execMain.c:304)
==65065==    by 0x934EF0: PortalRunSelect (pquery.c:921)
==65065==    by 0x934BD8: PortalRun (pquery.c:765)
==65065==    by 0x92E4CD: exec_simple_query (postgres.c:1273)
==65065==    by 0x93301E: PostgresMain (postgres.c:4766)
==65065==    by 0x92A88B: BackendMain (backend_startup.c:124)
==65065==    by 0x85A7C7: postmaster_child_launch (launch_backend.c:290)
==65065==    by 0x860111: BackendStartup (postmaster.c:3580)
==65065==    by 0x85DE6F: ServerLoop (postmaster.c:1702)
==65065==  Address 0x7b6c000 is in a rw- anonymous segment

This fails here (on the pg_numa_touch_mem_if_required call):

    for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
    {
        os_page_ptrs[idx++] = ptr;

        /* Only need to touch memory once per backend process */
        if (firstNumaTouch)
            pg_numa_touch_mem_if_required(touch, ptr);
    }

The 0x7b6c000 is the very first pointer, and it's the only pointer that
triggers this warning. At first I thought there's something wrong with
how we align the pointer using TYPEALIGN_DOWN(), but then I noticed it's
actually the pointer of BufferGetBlock(1).

So I'm a bit puzzled by this, and I'm not sure it's related to the other
issue at all (it probably is not).

It's a bit too late here, I'll continue investigating this tomorrow.

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

24 июня 2025 г., 12:20:15

On 6/24/25 10:24, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 24, 2025 at 03:43:19AM +0200, Tomas Vondra wrote:
>> On 6/23/25 23:47, Tomas Vondra wrote:
>>> ...
>>>
>>> Or maybe the 32-bit chroot on 64-bit host matters and confuses some
>>> calculation.
>>>
>>
>> I think it's likely something like this.
> 
> I think the same.
> 
>> I noticed that if I modify
>> pg_buffercache_numa_pages() to query the addresses one by one, it works.
>> And when I increase the number, it stops working somewhere between 16k
>> and 17k items.
> 
> Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
> pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
> than 16 pages.
> 
> It's also confirmed by test_chunk_size.c attached:
> 
> $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
> $ ./test_chunk_size
>  1 pages: SUCCESS (0 errors)
>  2 pages: SUCCESS (0 errors)
>  3 pages: SUCCESS (0 errors)
>  4 pages: SUCCESS (0 errors)
>  5 pages: SUCCESS (0 errors)
>  6 pages: SUCCESS (0 errors)
>  7 pages: SUCCESS (0 errors)
>  8 pages: SUCCESS (0 errors)
>  9 pages: SUCCESS (0 errors)
> 10 pages: SUCCESS (0 errors)
> 11 pages: SUCCESS (0 errors)
> 12 pages: SUCCESS (0 errors)
> 13 pages: SUCCESS (0 errors)
> 14 pages: SUCCESS (0 errors)
> 15 pages: SUCCESS (0 errors)
> 16 pages: SUCCESS (0 errors)
> 17 pages: 1 errors
> Threshold: 17 pages
> 
> No error if -m32 is not used.
> 
>> It may be a coincidence, but I suspect it's related to the sizeof(void
>> *) being 8 in the kernel, but only 4 in the chroot. So the userspace
>> passes an array of 4-byte items, but kernel interprets that as 8-byte
>> items. That is, we call
>>
>> long move_pages(int pid, unsigned long count, void *pages[.count], const
>> int nodes[.count], int status[.count], int flags);
>>
>> Which (I assume) just passes the parameters to kernel. And it'll
>> interpret them per kernel pointer size.
>>
> 
> I also suspect something in this area...
> 
>> If this is what's happening, I'm not sure what to do about it ...
> 
> We could work by chunks (16?) on 32 bits but would probably produce performance
> degradation (we mention it in the doc though). Also would always 16 be a correct
> chunk size? 

I don't see how this would solve anything?

AFAICS the problem is the two places are confused about how large the
array elements are, and get to interpret that differently. Using a
smaller array won't solve that. The pg function would still allocate
array of 16 x 32-bit pointers, and the kernel would interpret this as 16
x 64-bit pointers. And that means the kernel will (a) write into memory
beyond the allocated buffer - a clear buffer overflow, and (b) see bogus
pointers, because it'll concatenate two 32-bit pointers.

I don't see how using smaller array makes this correct. That it works is
more a matter of luck, and also a consequence of still allocating the
whole array, so there's no overflow (at least I kept that, not sure how
you did the chunks).

If I fix the code to make the entries 64-bit (by treating the pointers
as int64), it suddenly starts working - no bad addresses, etc. Well,
almost, because I get this

 bufferid | os_page_num | numa_node
----------+-------------+-----------
        1 |           0 |         0
        1 |           1 |       -14
        2 |           2 |         0
        2 |           3 |       -14
        3 |           4 |         0
        3 |           5 |       -14
        4 |           6 |         0
        4 |           7 |       -14
        ...

The -14 status is interesting, because that's the same value Christoph
reported as the other issue (in pg_shmem_allocations_numa).

I did an experiment and changed os_page_status to be declared as int64,
not just int. And interestingly, that produced this:

 bufferid | os_page_num | numa_node
----------+-------------+-----------
        1 |           0 |         0
        1 |           1 |         0
        2 |           2 |         0
        2 |           3 |         0
        3 |           4 |         0
        3 |           5 |         0
        4 |           6 |         0
        4 |           7 |         0
        ...

But I don't see how this makes any sense, because "int" should be 4B in
both cases (in 64-bit kernel and 32-bit chroot).

FWIW I realized this applies to "official" systems with 32-bit user
space on 64-bit kernels, like e.g. rpi5 with RPi OS 32-bit. (Fun fact,
rpi5 has 8 NUMA nodes, with all CPUs attached to all NUMA nodes.)

I'm starting to think we need to disable NUMA for setups like this,
mixing 64-bit kernels with 32-bit chroot. Is there a good way to detect
those, so that we can error-out?

FWIW this doesn't explain the strange valgrind issue, though.

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

24 июня 2025 г., 14:10:24

Hi,

On Tue, Jun 24, 2025 at 11:20:15AM +0200, Tomas Vondra wrote:
> On 6/24/25 10:24, Bertrand Drouvot wrote:
> > Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
> > pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
> > than 16 pages.
> > 
> > It's also confirmed by test_chunk_size.c attached:
> > 
> > $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
> > $ ./test_chunk_size
> >  1 pages: SUCCESS (0 errors)
> >  2 pages: SUCCESS (0 errors)
> >  3 pages: SUCCESS (0 errors)
> >  4 pages: SUCCESS (0 errors)
> >  5 pages: SUCCESS (0 errors)
> >  6 pages: SUCCESS (0 errors)
> >  7 pages: SUCCESS (0 errors)
> >  8 pages: SUCCESS (0 errors)
> >  9 pages: SUCCESS (0 errors)
> > 10 pages: SUCCESS (0 errors)
> > 11 pages: SUCCESS (0 errors)
> > 12 pages: SUCCESS (0 errors)
> > 13 pages: SUCCESS (0 errors)
> > 14 pages: SUCCESS (0 errors)
> > 15 pages: SUCCESS (0 errors)
> > 16 pages: SUCCESS (0 errors)
> > 17 pages: 1 errors
> > Threshold: 17 pages
> > 
> > No error if -m32 is not used.
> > 
> > We could work by chunks (16?) on 32 bits but would probably produce performance
> > degradation (we mention it in the doc though). Also would always 16 be a correct
> > chunk size? 
> 
> I don't see how this would solve anything?
> 
> AFAICS the problem is the two places are confused about how large the
> array elements are, and get to interpret that differently.

> I don't see how using smaller array makes this correct. That it works is
> more a matter of luck,

Not sure it's luck, maybe the wrong pointers arithmetic has no effect if batch
size is <= 16.

So we have kernel_move_pages() -> kernel_move_pages() (because nodes is NULL here
for us as we call "numa_move_pages(pid, count, pages, NULL, status, 0);").

So, if we look at do_pages_stat() ([1]), we can see that it uses an hardcoded
"#define DO_PAGES_STAT_CHUNK_NR 16UL" and that this pointers arithmetic:

"
        pages += chunk_nr;
        status += chunk_nr;
"

is done but has no effect since nr_pages will exit the loop if we use a batch
size <= 16.

So if this pointer arithmetic is not correct, (it seems that it should advance
by 16 * sizeof(compat_uptr_t) instead) then it has no effect as long as the batch
size is <= 16.

Does test_chunk_size also fails at 17 for you?

[1]: https://github.com/torvalds/linux/blob/master/mm/migrate.c

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Andres Freund

Дата:

24 июня 2025 г., 14:10:50

Hi,

On 2025-06-24 03:43:19 +0200, Tomas Vondra wrote:
> FWIW while looking into this, I tried running this under valgrind (on a
> regular 64-bit system, not in the chroot), and I get this report:
> 
> ==65065== Invalid read of size 8
> ==65065==    at 0x113B0EBE: pg_buffercache_numa_pages
> (pg_buffercache_pages.c:380)
> ==65065==    by 0x6B539D: ExecMakeTableFunctionResult (execSRF.c:234)
> ==65065==    by 0x6CEB7E: FunctionNext (nodeFunctionscan.c:94)
> ==65065==    by 0x6B6ACA: ExecScanFetch (execScan.h:126)
> ==65065==    by 0x6B6B31: ExecScanExtended (execScan.h:170)
> ==65065==    by 0x6B6C9D: ExecScan (execScan.c:59)
> ==65065==    by 0x6CEF0F: ExecFunctionScan (nodeFunctionscan.c:269)
> ==65065==    by 0x6B29FA: ExecProcNodeFirst (execProcnode.c:469)
> ==65065==    by 0x6A6F56: ExecProcNode (executor.h:313)
> ==65065==    by 0x6A9533: ExecutePlan (execMain.c:1679)
> ==65065==    by 0x6A7422: standard_ExecutorRun (execMain.c:367)
> ==65065==    by 0x6A7330: ExecutorRun (execMain.c:304)
> ==65065==    by 0x934EF0: PortalRunSelect (pquery.c:921)
> ==65065==    by 0x934BD8: PortalRun (pquery.c:765)
> ==65065==    by 0x92E4CD: exec_simple_query (postgres.c:1273)
> ==65065==    by 0x93301E: PostgresMain (postgres.c:4766)
> ==65065==    by 0x92A88B: BackendMain (backend_startup.c:124)
> ==65065==    by 0x85A7C7: postmaster_child_launch (launch_backend.c:290)
> ==65065==    by 0x860111: BackendStartup (postmaster.c:3580)
> ==65065==    by 0x85DE6F: ServerLoop (postmaster.c:1702)
> ==65065==  Address 0x7b6c000 is in a rw- anonymous segment
> 
> 
> This fails here (on the pg_numa_touch_mem_if_required call):
> 
>     for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
>     {
>         os_page_ptrs[idx++] = ptr;
> 
>         /* Only need to touch memory once per backend process */
>         if (firstNumaTouch)
>             pg_numa_touch_mem_if_required(touch, ptr);
>     }

That's because we mark unpinned pages as inaccessible / mark them as
accessible when pinning. See logic related to that in PinBuffer():

                /*
                 * Assume that we acquired a buffer pin for the purposes of
                 * Valgrind buffer client checks (even in !result case) to
                 * keep things simple.  Buffers that are unsafe to access are
                 * not generally guaranteed to be marked undefined or
                 * non-accessible in any case.
                 */


> The 0x7b6c000 is the very first pointer, and it's the only pointer that
> triggers this warning.

I suspect that that's because valgrind combines different reports or such.

Greetings,

Andres Freund

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

24 июня 2025 г., 15:33:59


On 6/24/25 13:10, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 24, 2025 at 11:20:15AM +0200, Tomas Vondra wrote:
>> On 6/24/25 10:24, Bertrand Drouvot wrote:
>>> Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
>>> pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
>>> than 16 pages.
>>>
>>> It's also confirmed by test_chunk_size.c attached:
>>>
>>> $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
>>> $ ./test_chunk_size
>>>  1 pages: SUCCESS (0 errors)
>>>  2 pages: SUCCESS (0 errors)
>>>  3 pages: SUCCESS (0 errors)
>>>  4 pages: SUCCESS (0 errors)
>>>  5 pages: SUCCESS (0 errors)
>>>  6 pages: SUCCESS (0 errors)
>>>  7 pages: SUCCESS (0 errors)
>>>  8 pages: SUCCESS (0 errors)
>>>  9 pages: SUCCESS (0 errors)
>>> 10 pages: SUCCESS (0 errors)
>>> 11 pages: SUCCESS (0 errors)
>>> 12 pages: SUCCESS (0 errors)
>>> 13 pages: SUCCESS (0 errors)
>>> 14 pages: SUCCESS (0 errors)
>>> 15 pages: SUCCESS (0 errors)
>>> 16 pages: SUCCESS (0 errors)
>>> 17 pages: 1 errors
>>> Threshold: 17 pages
>>>
>>> No error if -m32 is not used.
>>>
>>> We could work by chunks (16?) on 32 bits but would probably produce performance
>>> degradation (we mention it in the doc though). Also would always 16 be a correct
>>> chunk size? 
>>
>> I don't see how this would solve anything?
>>
>> AFAICS the problem is the two places are confused about how large the
>> array elements are, and get to interpret that differently.
> 
>> I don't see how using smaller array makes this correct. That it works is
>> more a matter of luck,
> 
> Not sure it's luck, maybe the wrong pointers arithmetic has no effect if batch
> size is <= 16.
> 
> So we have kernel_move_pages() -> kernel_move_pages() (because nodes is NULL here
> for us as we call "numa_move_pages(pid, count, pages, NULL, status, 0);").
> 
> So, if we look at do_pages_stat() ([1]), we can see that it uses an hardcoded
> "#define DO_PAGES_STAT_CHUNK_NR 16UL" and that this pointers arithmetic:
> 
> "
>         pages += chunk_nr;
>         status += chunk_nr;
> "
> 
> is done but has no effect since nr_pages will exit the loop if we use a batch
> size <= 16.
> 
> So if this pointer arithmetic is not correct, (it seems that it should advance
> by 16 * sizeof(compat_uptr_t) instead) then it has no effect as long as the batch
> size is <= 16.
> 
> Does test_chunk_size also fails at 17 for you?

Yes, it fails for me at 17 too. So you're saying the access within each
chunk of 16 elements is OK, but that maybe advancing to the next chunk
is not quite right? In which case limiting the access to 16 entries
might be a workaround.

In any case, this sounds like a kernel bug, right? I don't have much
experience with the kernel code, so don't want to rely too much on my
interpretation of it.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

24 июня 2025 г., 15:42:43


On 6/24/25 13:10, Andres Freund wrote:
> Hi,
> 
> On 2025-06-24 03:43:19 +0200, Tomas Vondra wrote:
>> FWIW while looking into this, I tried running this under valgrind (on a
>> regular 64-bit system, not in the chroot), and I get this report:
>>
>> ==65065== Invalid read of size 8
>> ==65065==    at 0x113B0EBE: pg_buffercache_numa_pages
>> (pg_buffercache_pages.c:380)
>> ==65065==    by 0x6B539D: ExecMakeTableFunctionResult (execSRF.c:234)
>> ==65065==    by 0x6CEB7E: FunctionNext (nodeFunctionscan.c:94)
>> ==65065==    by 0x6B6ACA: ExecScanFetch (execScan.h:126)
>> ==65065==    by 0x6B6B31: ExecScanExtended (execScan.h:170)
>> ==65065==    by 0x6B6C9D: ExecScan (execScan.c:59)
>> ==65065==    by 0x6CEF0F: ExecFunctionScan (nodeFunctionscan.c:269)
>> ==65065==    by 0x6B29FA: ExecProcNodeFirst (execProcnode.c:469)
>> ==65065==    by 0x6A6F56: ExecProcNode (executor.h:313)
>> ==65065==    by 0x6A9533: ExecutePlan (execMain.c:1679)
>> ==65065==    by 0x6A7422: standard_ExecutorRun (execMain.c:367)
>> ==65065==    by 0x6A7330: ExecutorRun (execMain.c:304)
>> ==65065==    by 0x934EF0: PortalRunSelect (pquery.c:921)
>> ==65065==    by 0x934BD8: PortalRun (pquery.c:765)
>> ==65065==    by 0x92E4CD: exec_simple_query (postgres.c:1273)
>> ==65065==    by 0x93301E: PostgresMain (postgres.c:4766)
>> ==65065==    by 0x92A88B: BackendMain (backend_startup.c:124)
>> ==65065==    by 0x85A7C7: postmaster_child_launch (launch_backend.c:290)
>> ==65065==    by 0x860111: BackendStartup (postmaster.c:3580)
>> ==65065==    by 0x85DE6F: ServerLoop (postmaster.c:1702)
>> ==65065==  Address 0x7b6c000 is in a rw- anonymous segment
>>
>>
>> This fails here (on the pg_numa_touch_mem_if_required call):
>>
>>     for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
>>     {
>>         os_page_ptrs[idx++] = ptr;
>>
>>         /* Only need to touch memory once per backend process */
>>         if (firstNumaTouch)
>>             pg_numa_touch_mem_if_required(touch, ptr);
>>     }
> 
> That's because we mark unpinned pages as inaccessible / mark them as
> accessible when pinning. See logic related to that in PinBuffer():
> 
>                 /*
>                  * Assume that we acquired a buffer pin for the purposes of
>                  * Valgrind buffer client checks (even in !result case) to
>                  * keep things simple.  Buffers that are unsafe to access are
>                  * not generally guaranteed to be marked undefined or
>                  * non-accessible in any case.
>                  */
> 
> 
>> The 0x7b6c000 is the very first pointer, and it's the only pointer that
>> triggers this warning.
> 
> I suspect that that's because valgrind combines different reports or such.
> 

Thanks. It probably is something like that, although I made sure to not
use any such options when running valgrind (so --error-limit=no). But
maybe there's something else, hiding the reports.

I guess there are two ways to address this - make sure the buffers are
marked as accessible/defined, or add a valgrind suppression. I think the
suppression is the right approach here, otherwise we'd need to worry
about already pinned buffers etc. Which seems not great, the functions
don't even care about buffers right now, they mostly work with memory
pages (especially pg_shmem_allocations_numa).

Barring objections, I'll fix it this way.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

24 июня 2025 г., 16:25:07

Hi,

On Tue, Jun 24, 2025 at 02:33:59PM +0200, Tomas Vondra wrote:
> 
> 
> On 6/24/25 13:10, Bertrand Drouvot wrote:
> > So, if we look at do_pages_stat() ([1]), we can see that it uses an hardcoded
> > "#define DO_PAGES_STAT_CHUNK_NR 16UL" and that this pointers arithmetic:
> > 
> > "
> >         pages += chunk_nr;
> >         status += chunk_nr;
> > "
> > 
> > is done but has no effect since nr_pages will exit the loop if we use a batch
> > size <= 16.
> > 
> > So if this pointer arithmetic is not correct, (it seems that it should advance
> > by 16 * sizeof(compat_uptr_t) instead) then it has no effect as long as the batch
> > size is <= 16.
> > 
> > Does test_chunk_size also fails at 17 for you?
> 
> Yes, it fails for me at 17 too. So you're saying the access within each
> chunk of 16 elements is OK, but that maybe advancing to the next chunk
> is not quite right?

Yes, I think compat_uptr_t usage is missing in do_pages_stat() (while it's used
in do_pages_move()).

Having a chunk size <= DO_PAGES_STAT_CHUNK_NR ensures we are not affected
by the wrong pointer arithmetic.

> In which case limiting the access to 16 entries
> might be a workaround.

Yes, something like:

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index c9ae3b45b76..070ad2f13e7 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -689,8 +689,17 @@ pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
                        CHECK_FOR_INTERRUPTS();
                }

-               if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
-                       elog(ERROR, "failed NUMA pages inquiry status: %m");
+               #define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
+
+               for (uint64 chunk_start = 0; chunk_start < shm_ent_page_count; chunk_start += NUMA_QUERY_CHUNK_SIZE) {
+                        uint64 chunk_size = Min(NUMA_QUERY_CHUNK_SIZE, shm_ent_page_count - chunk_start);
+
+                       if (pg_numa_query_pages(0, chunk_size, &page_ptrs[chunk_start],
+                                                                       &pages_status[chunk_start]) == -1)
+                               elog(ERROR, "failed NUMA pages inquiry status: %m");
+               }
+
+               #undef NUMA_QUERY_CHUNK_SIZE

> In any case, this sounds like a kernel bug, right?

yes it sounds like a kernel bug.

> I don't have much
> experience with the kernel code, so don't want to rely too much on my
> interpretation of it.

I don't have that much experience too but I think the issue is in do_pages_stat()
and that "pages += chunk_nr" should be advanced by sizeof(compat_uptr_t) instead.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

24 июня 2025 г., 17:41:33

Re: Bertrand Drouvot
> Yes, I think compat_uptr_t usage is missing in do_pages_stat() (while it's used
> in do_pages_move()).

I was also reading the kernel source around that place but you spotted
the problem before me. This patch resolves the issue here:

diff --git a/mm/migrate.c b/mm/migrate.c
index 8cf0f9c9599..542c81ec3ed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2444,7 +2444,13 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
         if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status)))
             break;

-        pages += chunk_nr;
+        if (in_compat_syscall()) {
+            compat_uptr_t __user *pages32 = (compat_uptr_t __user *)pages;
+
+            pages32 += chunk_nr;
+            pages = (const void __user * __user *) pages32;
+        } else
+            pages += chunk_nr;
         status += chunk_nr;
         nr_pages -= chunk_nr;
     }


> Having a chunk size <= DO_PAGES_STAT_CHUNK_NR ensures we are not affected
> by the wrong pointer arithmetic.

Good idea. Buggy kernels will be around for some time.

> +               #define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
> +
> +               for (uint64 chunk_start = 0; chunk_start < shm_ent_page_count; chunk_start += NUMA_QUERY_CHUNK_SIZE)
{

Perhaps optimize it to this:

#if sizeof(void *) == 4
#define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
#else
#define NUMA_QUERY_CHUNK_SIZE 1024
#endif

... or some other bigger number.

The loop could also include CHECK_FOR_INTERRUPTS();

> > I don't have much
> > experience with the kernel code, so don't want to rely too much on my
> > interpretation of it.
> 
> I don't have that much experience too but I think the issue is in do_pages_stat()
> and that "pages += chunk_nr" should be advanced by sizeof(compat_uptr_t) instead.

Me neither, but I'll try submit this fix.

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

24 июня 2025 г., 18:04:00

On 6/24/25 16:41, Christoph Berg wrote:
> Re: Bertrand Drouvot
>> Yes, I think compat_uptr_t usage is missing in do_pages_stat() (while it's used
>> in do_pages_move()).
> 
> I was also reading the kernel source around that place but you spotted
> the problem before me. This patch resolves the issue here:
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8cf0f9c9599..542c81ec3ed 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2444,7 +2444,13 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
>          if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status)))
>              break;
> 
> -        pages += chunk_nr;
> +        if (in_compat_syscall()) {
> +            compat_uptr_t __user *pages32 = (compat_uptr_t __user *)pages;
> +
> +            pages32 += chunk_nr;
> +            pages = (const void __user * __user *) pages32;
> +        } else
> +            pages += chunk_nr;
>          status += chunk_nr;
>          nr_pages -= chunk_nr;
>      }
> 
> 
>> Having a chunk size <= DO_PAGES_STAT_CHUNK_NR ensures we are not affected
>> by the wrong pointer arithmetic.
> 
> Good idea. Buggy kernels will be around for some time.
> 

If it's a reliable fix, then I guess we can do it like this. But won't
that be a performance penalty on everyone? Or does the system split the
array into 16-element chunks anyway, so this makes no difference?

Anyway, maybe we should start by reporting this to the kernel people. Do
you want me to do that, or shall one of you take care of that? I suppose
that'd be better, as you already wrote a fix / know the code better.

>> +               #define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
>> +
>> +               for (uint64 chunk_start = 0; chunk_start < shm_ent_page_count; chunk_start += NUMA_QUERY_CHUNK_SIZE)
{
> 
> Perhaps optimize it to this:
> 
> #if sizeof(void *) == 4
> #define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
> #else
> #define NUMA_QUERY_CHUNK_SIZE 1024
> #endif
> 
> ... or some other bigger number.
> 

Hmm, maybe. I guess that'd hurt only fully 32-bit systems, but that also
seems like a non-issue.

> The loop could also include CHECK_FOR_INTERRUPTS();
> 
>>> I don't have much
>>> experience with the kernel code, so don't want to rely too much on my
>>> interpretation of it.
>>
>> I don't have that much experience too but I think the issue is in do_pages_stat()
>> and that "pages += chunk_nr" should be advanced by sizeof(compat_uptr_t) instead.
> 
> Me neither, but I'll try submit this fix.
> 

+1

Thanks to both of you for the report and the investigation.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

24 июня 2025 г., 18:30:02

Re: Tomas Vondra
> If it's a reliable fix, then I guess we can do it like this. But won't
> that be a performance penalty on everyone? Or does the system split the
> array into 16-element chunks anyway, so this makes no difference?

There's still the overhead of the syscall itself. But no idea how
costly it is to have this 16-step loop in user or kernel space.

We could claim that on 32-bit systems, shared_buffers would be smaller
anyway, so there the overhead isn't that big. And the step size should
be larger (if at all) on 64-bit.

> Anyway, maybe we should start by reporting this to the kernel people. Do
> you want me to do that, or shall one of you take care of that? I suppose
> that'd be better, as you already wrote a fix / know the code better.

Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

24 июня 2025 г., 23:32:25

On 6/24/25 17:30, Christoph Berg wrote:
> Re: Tomas Vondra
>> If it's a reliable fix, then I guess we can do it like this. But won't
>> that be a performance penalty on everyone? Or does the system split the
>> array into 16-element chunks anyway, so this makes no difference?
> 
> There's still the overhead of the syscall itself. But no idea how
> costly it is to have this 16-step loop in user or kernel space.
> 
> We could claim that on 32-bit systems, shared_buffers would be smaller
> anyway, so there the overhead isn't that big. And the step size should
> be larger (if at all) on 64-bit.
> 
>> Anyway, maybe we should start by reporting this to the kernel people. Do
>> you want me to do that, or shall one of you take care of that? I suppose
>> that'd be better, as you already wrote a fix / know the code better.
> 
> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
> 

Thanks! Now we wait ...

Attached is a minor tweak of the valgrind suppresion rules, to add the
two places touching the memory. I was hoping I could add a single rule
for pg_numa_touch_mem_if_required, but that does not work - it's a
macro, not a function. So I had to add one rule for both functions,
querying the NUMA. That's a bit disappointing, because it means it'll
hide all other failues (of Memcheck:Addr8 type) in those functions.

Perhaps it'd be be better to turn pg_numa_touch_mem_if_required into a
proper (inlined) function, at least with USE_VALGRIND defined. Something
like the v2 patch - needs more testing to ensure the inlined function
doesn't break the touching or something silly like that.

regards

-- 
Tomas Vondra

Вложения

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

25 июня 2025 г., 09:05:21

Hi,

On Tue, Jun 24, 2025 at 04:41:33PM +0200, Christoph Berg wrote:
> Re: Bertrand Drouvot
> > Yes, I think compat_uptr_t usage is missing in do_pages_stat() (while it's used
> > in do_pages_move()).
> 
> I was also reading the kernel source around that place but you spotted
> the problem before me. This patch resolves the issue here:
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8cf0f9c9599..542c81ec3ed 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2444,7 +2444,13 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
>          if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status)))
>              break;
> 
> -        pages += chunk_nr;
> +        if (in_compat_syscall()) {
> +            compat_uptr_t __user *pages32 = (compat_uptr_t __user *)pages;
> +
> +            pages32 += chunk_nr;
> +            pages = (const void __user * __user *) pages32;
> +        } else
> +            pages += chunk_nr;
>          status += chunk_nr;
>          nr_pages -= chunk_nr;
>      }
> 

Thanks! Yeah, I had the same kind of patch idea in mind.

> > +               #define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
> > +
> > +               for (uint64 chunk_start = 0; chunk_start < shm_ent_page_count; chunk_start +=
NUMA_QUERY_CHUNK_SIZE){
 
> 
> Perhaps optimize it to this:
> 
> #if sizeof(void *) == 4
> #define NUMA_QUERY_CHUNK_SIZE 16  /* has to be <= DO_PAGES_STAT_CHUNK_NR (do_pages_stat())*/
> #else
> #define NUMA_QUERY_CHUNK_SIZE 1024
> #endif
> 
> ... or some other bigger number.

I had in mind to split the batch size on the PG side only for 32-bits, what about
the attached?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

v1-0001-Work-around-Linux-kernel-bug-in-do_pages_stat.patch

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

25 июня 2025 г., 09:11:21

Hi,

On Tue, Jun 24, 2025 at 05:30:02PM +0200, Christoph Berg wrote:
> Re: Tomas Vondra
> > If it's a reliable fix, then I guess we can do it like this. But won't
> > that be a performance penalty on everyone? Or does the system split the
> > array into 16-element chunks anyway, so this makes no difference?
> 
> There's still the overhead of the syscall itself. But no idea how
> costly it is to have this 16-step loop in user or kernel space.
> 
> We could claim that on 32-bit systems, shared_buffers would be smaller
> anyway, so there the overhead isn't that big. And the step size should
> be larger (if at all) on 64-bit.

Right, and we already mention in the doc that using those views is "very slow"
or "can take a noticeable amount of time".

> > Anyway, maybe we should start by reporting this to the kernel people. Do
> > you want me to do that, or shall one of you take care of that? I suppose
> > that'd be better, as you already wrote a fix / know the code better.
> 
> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2

Thanks! I had in mind to look at how to report such a bug and provide a patch
but you beat me to it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

25 июня 2025 г., 09:45:55

Hi,

On Tue, Jun 24, 2025 at 10:32:25PM +0200, Tomas Vondra wrote:
> 
> Attached is a minor tweak of the valgrind suppresion rules,

Thanks!

> to add the
> two places touching the memory. I was hoping I could add a single rule
> for pg_numa_touch_mem_if_required, but that does not work - it's a
> macro, not a function. So I had to add one rule for both functions,
> querying the NUMA. That's a bit disappointing, because it means it'll
> hide all other failues (of Memcheck:Addr8 type) in those functions.
>

Shouldn't we add 2 rules for Memcheck:Addr4 too?

> Perhaps it'd be be better to turn pg_numa_touch_mem_if_required into a
> proper (inlined) function, at least with USE_VALGRIND defined.

Yeah I think that's probably better to reduce the scope to what we really want to.

> Something
> like the v2 patch -

yeah, maybe:

- add a rule for Memcheck:Addr4?
- have the same parameters name for the macro and the function?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Jakub Wartak

Дата:

25 июня 2025 г., 10:15:02

On Tue, Jun 24, 2025 at 5:30 PM Christoph Berg <myon@debian.org> wrote:
>
> Re: Tomas Vondra
> > If it's a reliable fix, then I guess we can do it like this. But won't
> > that be a performance penalty on everyone? Or does the system split the
> > array into 16-element chunks anyway, so this makes no difference?
>
> There's still the overhead of the syscall itself. But no idea how
> costly it is to have this 16-step loop in user or kernel space.
>
> We could claim that on 32-bit systems, shared_buffers would be smaller
> anyway, so there the overhead isn't that big. And the step size should
> be larger (if at all) on 64-bit.
>
> > Anyway, maybe we should start by reporting this to the kernel people. Do
> > you want me to do that, or shall one of you take care of that? I suppose
> > that'd be better, as you already wrote a fix / know the code better.
>
> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
>

Hi all, I'm quite late to the party (just noticed the thread), but
here's some addition context: it technically didn't make any sense to
me to have NUMA on 32-bit due too small amount of addressable memory
(after all, NUMA is about big iron, probably not even VMs), so in the
first versions of the patchset I've excluded 32-bit (and back then for
some reason I couldn't even find libnuma i386, but Andres pointed to
me that it exists, so we re-added it probably just to stay
consistent). The thread has kind of snowballed since then, but I still
believe that NUMA on 32-bit does not make a lot of sense.

Even assuming future shm interleaving one day in future version,
allocation of small s_b sizes will usually fit a single NUMA node.

-J.

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

25 июня 2025 г., 12:00:38

Re: Bertrand Drouvot
> +/*
> + * Work around Linux kernel bug in 32-bit compat mode: do_pages_stat() has
> + * incorrect pointer arithmetic for more than DO_PAGES_STAT_CHUNK_NR pages.
> + */
> +#if SIZEOF_SIZE_T == 4

I was also missing it in my suggested patch draft, but this should
probably include #ifdef __linux__.


Re: Tomas Vondra
> +#ifdef USE_VALGRIND
> +
> +static inline void
> +pg_numa_touch_mem_if_required(uint64 tmp, char *ptr)

Stupid question, if this function gets properly inlined, why not
always use it as there should be no performance difference vs using a
macro?

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

25 июня 2025 г., 12:22:41

On 6/25/25 11:00, Christoph Berg wrote:
> Re: Bertrand Drouvot
>> +/*
>> + * Work around Linux kernel bug in 32-bit compat mode: do_pages_stat() has
>> + * incorrect pointer arithmetic for more than DO_PAGES_STAT_CHUNK_NR pages.
>> + */
>> +#if SIZEOF_SIZE_T == 4
> 
> I was also missing it in my suggested patch draft, but this should
> probably include #ifdef __linux__.
> 

> 
> Re: Tomas Vondra
>> +#ifdef USE_VALGRIND
>> +
>> +static inline void
>> +pg_numa_touch_mem_if_required(uint64 tmp, char *ptr)
> 
> Stupid question, if this function gets properly inlined, why not
> always use it as there should be no performance difference vs using a
> macro?
> 

TBH I'm not 100% sure it works correctly, I need to check it actually
touches the memory etc. It's possible it was discussed in one of the
earlier NUMA threads, and there are reasons to do a macro.

I also dislike the ifdefs because it adds subtle differences between the
"normal" code and the code tested with valgrind. So just having the
inlined function would be "nicer".

regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

25 июня 2025 г., 12:31:36

On 6/25/25 09:15, Jakub Wartak wrote:
> On Tue, Jun 24, 2025 at 5:30 PM Christoph Berg <myon@debian.org> wrote:
>>
>> Re: Tomas Vondra
>>> If it's a reliable fix, then I guess we can do it like this. But won't
>>> that be a performance penalty on everyone? Or does the system split the
>>> array into 16-element chunks anyway, so this makes no difference?
>>
>> There's still the overhead of the syscall itself. But no idea how
>> costly it is to have this 16-step loop in user or kernel space.
>>
>> We could claim that on 32-bit systems, shared_buffers would be smaller
>> anyway, so there the overhead isn't that big. And the step size should
>> be larger (if at all) on 64-bit.
>>
>>> Anyway, maybe we should start by reporting this to the kernel people. Do
>>> you want me to do that, or shall one of you take care of that? I suppose
>>> that'd be better, as you already wrote a fix / know the code better.
>>
>> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
>>
> 
> Hi all, I'm quite late to the party (just noticed the thread), but
> here's some addition context: it technically didn't make any sense to
> me to have NUMA on 32-bit due too small amount of addressable memory
> (after all, NUMA is about big iron, probably not even VMs), so in the
> first versions of the patchset I've excluded 32-bit (and back then for
> some reason I couldn't even find libnuma i386, but Andres pointed to
> me that it exists, so we re-added it probably just to stay
> consistent). The thread has kind of snowballed since then, but I still
> believe that NUMA on 32-bit does not make a lot of sense.
> 
> Even assuming future shm interleaving one day in future version,
> allocation of small s_b sizes will usually fit a single NUMA node.
> 

Not sure. I thought NUMA doesn't matter very much on 32-bit systems too,
exactly because those systems tend to use small amounts of memory. But
then while investigating this issue I realized even rpi5 has NUMA, in
fact it has a whopping 8 nodes:

debian@raspberry-32:~ $ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 981 MB
node 0 free: 882 MB
node 1 cpus: 0 1 2 3
node 1 size: 1007 MB
node 1 free: 936 MB
node 2 cpus: 0 1 2 3
node 2 size: 1007 MB
node 2 free: 936 MB
node 3 cpus: 0 1 2 3
node 3 size: 943 MB
node 3 free: 873 MB
node 4 cpus: 0 1 2 3
node 4 size: 1007 MB
node 4 free: 936 MB
node 5 cpus: 0 1 2 3
node 5 size: 1007 MB
node 5 free: 935 MB
node 6 cpus: 0 1 2 3
node 6 size: 1007 MB
node 6 free: 936 MB
node 7 cpus: 0 1 2 3
node 7 size: 990 MB
node 7 free: 918 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  10  10  10  10  10  10  10
  1:  10  10  10  10  10  10  10  10
  2:  10  10  10  10  10  10  10  10
  3:  10  10  10  10  10  10  10  10
  4:  10  10  10  10  10  10  10  10
  5:  10  10  10  10  10  10  10  10
  6:  10  10  10  10  10  10  10  10
  7:  10  10  10  10  10  10  10  10


This is with the 32-bit system (which AFAICS means 64-bit kernel and
32-bit user space). I'm not saying it's a particularly interesting NUMA
system, considering all the costs are 10, and it's not like it's
critical to get the best performance on rpi5. But it's NUMA, and maybe
there are some other (more practical) systems. I find it interesting
mostly for testing purposes.


regards

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Álvaro Herrera

Дата:

25 июня 2025 г., 15:42:49

On 2025-Jun-25, Tomas Vondra wrote:

> Not sure. I thought NUMA doesn't matter very much on 32-bit systems too,
> exactly because those systems tend to use small amounts of memory. But
> then while investigating this issue I realized even rpi5 has NUMA, in
> fact it has a whopping 8 nodes:
> 
> debian@raspberry-32:~ $ numactl --hardware
> available: 8 nodes (0-7)

Interesting.  Mine only shows a single node.

alvherre@amras:~ $ uname -a
Linux amras 6.12.25+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.12.25-1+rpt1 (2025-04-30) aarch64 GNU/Linux
alvherre@amras:~ $ sudo numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 8051 MB
node 0 free: 202 MB
node distances:
node   0 
  0:  10 
alvherre@amras:~ $ sudo lscpu 
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                ARM
  Model name:             Cortex-A76
    Model:                1
    Thread(s) per core:   1
    Core(s) per cluster:  4
    Socket(s):            -
    Cluster(s):           1
    Stepping:             r4p1
    CPU(s) scaling MHz:   62%
    CPU max MHz:          2400.0000
    CPU min MHz:          1500.0000
    BogoMIPS:             108.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                           asimdhp cpuid asimdrdm lrcpc dcpop asimddp
[...]
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3


Did you enable something special on it maybe?

... Oh, I found this:
https://www.jeffgeerling.com/blog/2024/numa-emulation-speeds-pi-5-and-other-improvements
Sounds like you have this in your system and I don't in mine.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

26 июня 2025 г., 08:28:22

Hi,

On Wed, Jun 25, 2025 at 11:00:38AM +0200, Christoph Berg wrote:
> Re: Bertrand Drouvot
> > +/*
> > + * Work around Linux kernel bug in 32-bit compat mode: do_pages_stat() has
> > + * incorrect pointer arithmetic for more than DO_PAGES_STAT_CHUNK_NR pages.
> > + */
> > +#if SIZEOF_SIZE_T == 4
> 
> I was also missing it in my suggested patch draft, but this should
> probably include #ifdef __linux__.

I'm not sure because the workaround is after this part of the code in pg_numa.c:

"
/*
 * At this point we provide support only for Linux thanks to libnuma, but in
 * future support for other platforms e.g. Win32 or FreeBSD might be possible
 * too. For Win32 NUMA APIs see
 * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
 */
#ifdef USE_LIBNUMA
"

So I guess that the "#ifdef __linux__" would have to be at a higher level anyway
(should we support NUMA on more than Linux in the future).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

26 июня 2025 г., 09:00:59

Hi,

On Tue, Jun 24, 2025 at 10:32:25PM +0200, Tomas Vondra wrote:
> On 6/24/25 17:30, Christoph Berg wrote:
> > Re: Tomas Vondra
> >> If it's a reliable fix, then I guess we can do it like this. But won't
> >> that be a performance penalty on everyone? Or does the system split the
> >> array into 16-element chunks anyway, so this makes no difference?
> > 
> > There's still the overhead of the syscall itself. But no idea how
> > costly it is to have this 16-step loop in user or kernel space.
> > 
> > We could claim that on 32-bit systems, shared_buffers would be smaller
> > anyway, so there the overhead isn't that big. And the step size should
> > be larger (if at all) on 64-bit.
> > 
> >> Anyway, maybe we should start by reporting this to the kernel people. Do
> >> you want me to do that, or shall one of you take care of that? I suppose
> >> that'd be better, as you already wrote a fix / know the code better.
> > 
> > Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
> > 
> 
> Thanks! Now we wait ...

It looks like that the bug is "confirmed" and that it will be fixed:
https://marc.info/?l=linux-kernel&m=175088392116841&w=2

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

26 июня 2025 г., 11:53:26

On 6/26/25 08:00, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 24, 2025 at 10:32:25PM +0200, Tomas Vondra wrote:
>> On 6/24/25 17:30, Christoph Berg wrote:
>>> Re: Tomas Vondra
>>>> If it's a reliable fix, then I guess we can do it like this. But won't
>>>> that be a performance penalty on everyone? Or does the system split the
>>>> array into 16-element chunks anyway, so this makes no difference?
>>>
>>> There's still the overhead of the syscall itself. But no idea how
>>> costly it is to have this 16-step loop in user or kernel space.
>>>
>>> We could claim that on 32-bit systems, shared_buffers would be smaller
>>> anyway, so there the overhead isn't that big. And the step size should
>>> be larger (if at all) on 64-bit.
>>>
>>>> Anyway, maybe we should start by reporting this to the kernel people. Do
>>>> you want me to do that, or shall one of you take care of that? I suppose
>>>> that'd be better, as you already wrote a fix / know the code better.
>>>
>>> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
>>>
>>
>> Thanks! Now we wait ...
> 
> It looks like that the bug is "confirmed" and that it will be fixed:
> https://marc.info/?l=linux-kernel&m=175088392116841&w=2
> 

Yay! I like how the first response is "you sent the patch wrong" ;-)


cheers

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

27 июня 2025 г., 17:52:08

Here's three small patches, that should handle the issue


0001 - Adds the batching into pg_numa_query_pages, so that the callers
don't need to do anything.

The batching doesn't seem to cause any performance regression. 32-bit
systems can't use that much memory anyway, and on 64-bit systems the
batch is sufficiently large (1024).


0002 - Silences the valgrind about the memory touching. It replaces the
macro with a static inline function, and adds suppressions for both
32-bit and 64-bits. The 32-bit may be a bit pointless, because on my
rpi5 valgrind produces about a bunch of other stuff anyway. But doesn't
hurt.

The function now looks like this:

  static inline void
  pg_numa_touch_mem_if_required(void *ptr)
  {
      volatile uint64 touch pg_attribute_unused();
      touch = *(volatile uint64 *) ptr;
  }

I did a lot of testing on multiple systems to check replacing the macro
with a static inline function still works - and it seems it does. But if
someone thinks the function won't work, I'd like to know.


0003 - While working on these patches, it occurred to me we could/should
add CHECK_FOR_INTERRUPTS() into the batch loop. This querying can take
quite a bit of time, so letting people to interrupt it seems reasonable.
It wasn't possible with just one call into the kernel, but with the
batching we can add a CFI.


Please, take a look.

regards

-- 
Tomas Vondra

Вложения

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

27 июня 2025 г., 20:33:26

Hi,

On Fri, Jun 27, 2025 at 04:52:08PM +0200, Tomas Vondra wrote:
> Here's three small patches, that should handle the issue

Thanks for the patches!

> 0001 - Adds the batching into pg_numa_query_pages, so that the callers
> don't need to do anything.
> 
> The batching doesn't seem to cause any performance regression. 32-bit
> systems can't use that much memory anyway, and on 64-bit systems the
> batch is sufficiently large (1024).

=== 1

-#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+#define pg_numa_touch_mem_if_required(ptr) \

Looks unrelated, should be in 0002?

=== 2

I thought that it would be better to provide a batch size only in the 32-bit
case (see [1]), but I now think it makes sense to also provide (a larger) one
for non 32-bit (as you did) due to the CFI added in 0003 (as it's also good to
have it for non 32-bit).

> 0002 - Silences the valgrind about the memory touching. It replaces the
> macro with a static inline function, and adds suppressions for both
> 32-bit and 64-bits. The 32-bit may be a bit pointless, because on my
> rpi5 valgrind produces about a bunch of other stuff anyway. But doesn't
> hurt.
> 
> The function now looks like this:
> 
>   static inline void
>   pg_numa_touch_mem_if_required(void *ptr)
>   {
>       volatile uint64 touch pg_attribute_unused();
>       touch = *(volatile uint64 *) ptr;
>   }
> 
> I did a lot of testing on multiple systems to check replacing the macro
> with a static inline function still works - and it seems it does. But if
> someone thinks the function won't work, I'd like to know.

LGTM.

> 0003 - While working on these patches, it occurred to me we could/should
> add CHECK_FOR_INTERRUPTS() into the batch loop. This querying can take
> quite a bit of time, so letting people to interrupt it seems reasonable.
> It wasn't possible with just one call into the kernel, but with the
> batching we can add a CFI.

Yeah, LGTM.

[1]: https://www.postgresql.org/message-id/aFuRoUieUVh%2BpMfZ%40ip-10-97-1-34.eu-west-3.compute.internal 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

01 июля 2025 г., 07:06:31

Hi,

On Mon, Jun 30, 2025 at 08:56:43PM +0200, Tomas Vondra wrote:
> In particular it now uses "chunking" instead of "batching". I believe
> bathing is "combining multiple requests into a single one", but we're
> doing exactly the opposite - splitting a large request into smaller
> ones. Which is what "chunking" does.

I do agree that "chuncking" is more appropriate here.

> I plan to push this tomorrow morning.

Thanks!

LGTM, just 2 nit about the commit messages:

For 0001:

Is it worth to add a link to the Kernel Bug report or mentioned it can be
found in the discussion?

For 0003:

"
But with the chunking, introduced to work around the do_pages_stat()
bug"

Do you have in mind to quote the hex commit object name that will be generated
by 0001?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Tomas Vondra

Дата:

01 июля 2025 г., 14:03:28

On 7/1/25 06:06, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Jun 30, 2025 at 08:56:43PM +0200, Tomas Vondra wrote:
>> In particular it now uses "chunking" instead of "batching". I believe
>> bathing is "combining multiple requests into a single one", but we're
>> doing exactly the opposite - splitting a large request into smaller
>> ones. Which is what "chunking" does.
> 
> I do agree that "chuncking" is more appropriate here.
> 
>> I plan to push this tomorrow morning.
> 
> Thanks!
> 
> LGTM, just 2 nit about the commit messages:
> 
> For 0001:
> 
> Is it worth to add a link to the Kernel Bug report or mentioned it can be
> found in the discussion?
> 
> For 0003:
> 
> "
> But with the chunking, introduced to work around the do_pages_stat()
> bug"
> 
> Do you have in mind to quote the hex commit object name that will be generated
> by 0001?
> 

Thanks! Pushed, with both adjustments (link to kernel thread, adding the
commit hash).

But damn it, right after pushing I noticed two typos in the second
commit message :-/

-- 
Tomas Vondra

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

21 июля 2025 г., 23:52:12

Re: Tomas Vondra
> >>> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
> >>>
> >>
> >> Thanks! Now we wait ...
> > 
> > It looks like that the bug is "confirmed" and that it will be fixed:
> > https://marc.info/?l=linux-kernel&m=175088392116841&w=2

If I'm reading the Linux git log correctly, the fix was merged into
Linux 6.16-rc7. Yay :)

> Yay! I like how the first response is "you sent the patch wrong" ;-)

I would have thought that coming from two major projects that use
email extensively (Debian, PostgreSQL), I would navigate that process
better. But it worked in the end...

Christoph

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Bertrand Drouvot

Дата:

22 июля 2025 г., 10:01:09

Hi,

On Mon, Jul 21, 2025 at 10:52:12PM +0200, Christoph Berg wrote:
> Re: Tomas Vondra
> > >>> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
> > >>>
> > >>
> > >> Thanks! Now we wait ...
> > > 
> > > It looks like that the bug is "confirmed" and that it will be fixed:
> > > https://marc.info/?l=linux-kernel&m=175088392116841&w=2
> 
> If I'm reading the Linux git log correctly, the fix was merged into
> Linux 6.16-rc7. Yay :)

Yeah! ;-)  https://github.com/torvalds/linux/commit/10d04c26ab2b7

> > Yay! I like how the first response is "you sent the patch wrong" ;-)
> 
> I would have thought that coming from two major projects that use
> email extensively (Debian, PostgreSQL), I would navigate that process
> better. But it worked in the end...

Indeed, thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

11 сентября 2025 г., 14:36:14

Re: Tomas Vondra
> Thanks! Pushed, with both adjustments (link to kernel thread, adding the
> commit hash).

The PG18 Debian package is still carrying the contrib complement of
this patch (see attachment).

Should that be addressed before 18.0?

Christoph

Вложения

move-pages32

Re: pgsql: Introduce pg_shmem_allocations_numa view

От

Christoph Berg

Дата:

11 сентября 2025 г., 14:39:21

Re: To Tomas Vondra
> The PG18 Debian package is still carrying the contrib complement of
> this patch (see attachment).

Ah sorry, I was confused here. I had assumed that the patch is
required as long as it doesn't conflict, but it doesn't conflict since
the problem was fixed inside pg_numa_query_pages() in git, while the
workaround was outside.

Christoph

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: pgsql: Introduce pg_shmem_allocations_numa view

Вложения

Вложения

Вложения

Вложения

Вложения