Обсуждение: InitControlFile misbehaving on graviton

Поиск
Список
Период
Сортировка

InitControlFile misbehaving on graviton

От
Christoph Berg
Дата:
Bernd and I have been chasing a bug that happens when all of the
following conditions are fulfilled:

* PG 15..18 (older PGs are ok)
* gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
* arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host)
* -O2 (ok with -O0)
* --with-openssl (ok without openssl)
* using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)

The problem happens early during initdb:

$ ./configure --with-openssl --enable-debug
...
$ /usr/local/pgsql/bin/initdb -D broken --no-clean
...
running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL:  control file contains invalid database
clusterstate
 
child process exited with exit code 1
initdb: data directory "broken" not removed at user's request

Looking at the control file, we can see that the cluster state is
"starting up":

$ /usr/local/pgsql/bin/pg_controldata broken/
pg_control version number:            1700
Catalog version number:               202501101
Database system identifier:           7459462110308027428
Database cluster state:               starting up
pg_control last modified:             Mon 13 Jan 2025 06:02:44 PM UTC
Latest checkpoint location:           0/1000028
Latest checkpoint's REDO location:    0/1000028
Latest checkpoint's REDO WAL file:    000000010000000000000001
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:3

The relevant code is in BootStrapXLOG():

    /* Now create pg_control */
    InitControlFile(sysidentifier, data_checksum_version);
    ControlFile->time = checkPoint.time;
    ControlFile->checkPoint = checkPoint.redo;
    ControlFile->checkPointCopy = checkPoint;

    /* some additional ControlFile fields are set in WriteControlFile() */
    WriteControlFile();

and InitControlFile():

    if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
        ereport(PANIC,
                (errcode(ERRCODE_INTERNAL_ERROR),
                 errmsg("could not generate secret authorization token")));

    memset(ControlFile, 0, sizeof(ControlFileData));
    /* Initialize pg_control status fields */
    ControlFile->system_identifier = sysidentifier;
    memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
    ControlFile->state = DB_SHUTDOWNED;

So the state should actually be DB_SHUTDOWNED (1), but on this system,
the value is DB_STARTUP (0).

Stepping through InitControlFile we can see that ControlFile->state is
never written to. (The trace jumps back to InitControlFile a lot
because it seems inlined into BootStrapXLOG):

$ cd broken
$ rm -f global/pg_control; PGDATA=$PWD gdb /usr/lib/postgresql/17/bin/postgres
Reading symbols from /usr/local/pgsql/bin/postgres...
(gdb) b InitControlFile
Breakpoint 1 at 0x1819bc: file xlog.c, line 4214.
(gdb) r --boot -F -c log_checkpoints=false -X 16777216 -k
Starting program: /usr/local/pgsql/bin/postgres --boot -F -c log_checkpoints=false -X 16777216 -k

Breakpoint 1, 0x0000aaaaaac219bc in InitControlFile (sysidentifier=<optimized out>, data_checksum_version=<optimized
out>)at xlog.c:4214
 
4214        if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
(gdb) s
5175        InitControlFile(sysidentifier, data_checksum_version);
(gdb)
InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at xlog.c:4214
4214        if (!pg_strong_random(mock_auth_nonce, MOCK_AUTH_NONCE_LEN))
(gdb)
pg_strong_random (buf=buf@entry=0xfffffffff670, len=len@entry=32) at pg_strong_random.c:79
79        for (i = 0; i < NUM_RAND_POLL_RETRIES; i++)
(gdb)
81            if (RAND_status() == 1)
(gdb)
87            RAND_poll();
(gdb)
90        if (RAND_bytes(buf, len) == 1)
(gdb)
InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at xlog.c:4219
4219        memset(ControlFile, 0, sizeof(ControlFileData));
(gdb)
4221        ControlFile->system_identifier = sysidentifier;
(gdb)
5175        InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a08 in InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at xlog.c:4221
4221        ControlFile->system_identifier = sysidentifier;
(gdb)
4222        memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
(gdb)
5175        InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a3c in InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at xlog.c:4234
4234        ControlFile->track_commit_timestamp = track_commit_timestamp;
(gdb)
5175        InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a4c in InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at xlog.c:4232
4232        ControlFile->wal_level = wal_level;
(gdb)
5175        InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a6c in InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=1) at xlog.c:4233
4233        ControlFile->wal_log_hints = wal_log_hints;
(gdb)
5175        InitControlFile(sysidentifier, data_checksum_version);
(gdb)
0x0000aaaaaac21a8c in InitControlFile (sysidentifier=7459466832287685723, data_checksum_version=<optimized out>) at
xlog.c:4233
4233        ControlFile->wal_log_hints = wal_log_hints;
(gdb)
BootStrapXLOG (data_checksum_version=data_checksum_version@entry=1) at xlog.c:5181
5181        WriteControlFile();

(gdb) p *ControlFile
$1 = {system_identifier = 7459466832287685723, pg_control_version = 1700, catalog_version_no = 202501101, state =
DB_STARTUP,time = 1736792463,
 
  checkPoint = 16777256, checkPointCopy = {redo = 16777256, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites =
true,wal_level = 1, nextXid = {
 
      value = 3}, nextOid = 10000, nextMulti = 1, nextMultiOffset = 0, oldestXid = 3, oldestXidDB = 1, oldestMulti = 1,
oldestMultiDB= 1,
 
    time = 1736792463, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}, unloggedLSN = 1000,
minRecoveryPoint= 0,
 
  minRecoveryPointTLI = 0, backupStartPoint = 0, backupEndPoint = 0, backupEndRequired = false, wal_level = 1,
wal_log_hints= false, MaxConnections = 100,
 
  max_worker_processes = 8, max_wal_senders = 10, max_prepared_xacts = 0, max_locks_per_xact = 64,
track_commit_timestamp= false, maxAlign = 8,
 
  floatFormat = 1234567, blcksz = 8192, relseg_size = 131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen
=64, indexMaxKeys = 32,
 
  toast_max_chunk_size = 1996, loblksize = 2048, float8ByVal = true, data_checksum_version = 1,
  mock_authentication_nonce = "*\307\177t\215\362\344 \326\307I\374\005f7v@\242ə\265\230\273#+\301\t\212\204\377\004A",
crc= 4294967295}
 

Disassembling at the breakpoint:

(gdb) disassemble
Dump of assembler code for function BootStrapXLOG:
   0x0000aaaaaac21708 <+0>:    stp    x29, x30, [sp, #-272]!
   0x0000aaaaaac2170c <+4>:    mov    w1, #0x0                       // #0
   0x0000aaaaaac21710 <+8>:    mov    x29, sp
...
=> 0x0000aaaaaac219bc <+692>:    add    x19, sp, #0x90
   0x0000aaaaaac219c0 <+696>:    mov    x0, x19
   0x0000aaaaaac219c4 <+700>:    mov    x1, #0x20                      // #32
   0x0000aaaaaac219c8 <+704>:    str    w2, [x21, #28]
   0x0000aaaaaac219cc <+708>:    bl    0xaaaaab0ac824 <pg_strong_random>
   0x0000aaaaaac219d0 <+712>:    tbz    w0, #0, 0xaaaaaac21b28 <BootStrapXLOG+1056>
   0x0000aaaaaac219d4 <+716>:    ldr    x3, [x22, #32]
   0x0000aaaaaac219d8 <+720>:    mov    x2, #0x128                     // #296
   0x0000aaaaaac219dc <+724>:    mov    w1, #0x0                       // #0
   0x0000aaaaaac219e0 <+728>:    mov    x0, x3
   0x0000aaaaaac219e4 <+732>:    bl    0xaaaaaab7f3b0 <memset@plt>
   0x0000aaaaaac219e8 <+736>:    mov    x3, x0
   0x0000aaaaaac219ec <+740>:    mov    x1, #0x3e8                     // #1000
   0x0000aaaaaac219f0 <+744>:    ldr    w9, [x21, #32]
   0x0000aaaaaac219f4 <+748>:    adrp    x7, 0xaaaaab3ce000 <fmgr_builtins+72112>
   0x0000aaaaaac219f8 <+752>:    ldr    x7, [x7, #3720]
   0x0000aaaaaac219fc <+756>:    str    x1, [x3, #128]
   0x0000aaaaaac21a00 <+760>:    ldr    w1, [sp, #120]
   0x0000aaaaaac21a04 <+764>:    add    x0, x0, #0x28
   0x0000aaaaaac21a08 <+768>:    str    x23, [x3]
   0x0000aaaaaac21a0c <+772>:    str    w1, [x3, #252]
   0x0000aaaaaac21a10 <+776>:    adrp    x6, 0xaaaaab3cf000
   0x0000aaaaaac21a14 <+780>:    ldr    x6, [x6, #2392]
   0x0000aaaaaac21a18 <+784>:    adrp    x5, 0xaaaaab3cf000
   0x0000aaaaaac21a1c <+788>:    ldr    x5, [x5, #2960]
   0x0000aaaaaac21a20 <+792>:    adrp    x4, 0xaaaaab3cf000
   0x0000aaaaaac21a24 <+796>:    ldr    x4, [x4, #3352]
   0x0000aaaaaac21a28 <+800>:    ldp    q26, q25, [x19]
   0x0000aaaaaac21a2c <+804>:    str    s15, [x3, #16]
   0x0000aaaaaac21a30 <+808>:    adrp    x2, 0xaaaaab3cf000
   0x0000aaaaaac21a34 <+812>:    ldr    x2, [x2, #2552]
   0x0000aaaaaac21a38 <+816>:    str    x26, [x3, #24]
   0x0000aaaaaac21a3c <+820>:    adrp    x1, 0xaaaaab3ce000 <fmgr_builtins+72112>
   0x0000aaaaaac21a40 <+824>:    ldr    x1, [x1, #3320]
   0x0000aaaaaac21a44 <+828>:    ldp    q27, q29, [x20]
   0x0000aaaaaac21a48 <+832>:    str    x25, [x3, #32]
   0x0000aaaaaac21a4c <+836>:    str    w9, [x3, #172]
   0x0000aaaaaac21a50 <+840>:    ldr    w4, [x4]
   0x0000aaaaaac21a54 <+844>:    ldr    w5, [x5]
   0x0000aaaaaac21a58 <+848>:    ldr    w6, [x6]
   0x0000aaaaaac21a5c <+852>:    ldr    w7, [x7]
   0x0000aaaaaac21a60 <+856>:    ldr    q30, [x20, #64]
   0x0000aaaaaac21a64 <+860>:    ldp    q28, q31, [x20, #32]
   0x0000aaaaaac21a68 <+864>:    stur    q27, [x3, #40]
   0x0000aaaaaac21a6c <+868>:    ldrb    w8, [x22, #240]
   0x0000aaaaaac21a70 <+872>:    ldr    w2, [x2]
   0x0000aaaaaac21a74 <+876>:    ldrb    w1, [x1]
   0x0000aaaaaac21a78 <+880>:    stp    w7, w6, [x3, #180]
   0x0000aaaaaac21a7c <+884>:    stp    w5, w4, [x3, #188]
   0x0000aaaaaac21a80 <+888>:    strb    w1, [x3, #200]
   0x0000aaaaaac21a84 <+892>:    ldr    x1, [x20, #80]
   0x0000aaaaaac21a88 <+896>:    str    x1, [x3, #120]
   0x0000aaaaaac21a8c <+900>:    strb    w8, [x3, #176]
   0x0000aaaaaac21a90 <+904>:    str    w2, [x3, #196]
   0x0000aaaaaac21a94 <+908>:    stp    q26, q25, [x3, #256]
   0x0000aaaaaac21a98 <+912>:    stp    q29, q28, [x0, #16]
   0x0000aaaaaac21a9c <+916>:    stp    q31, q30, [x0, #48]
   0x0000aaaaaac21aa0 <+920>:    bl    0xaaaaaac1dcf0 <WriteControlFile>
...


The really weird thing is that the very same binaries work on a
different host (arm64 VM provided by Huawei) - the
postgresql_arm64.deb files compiled there and present on
apt.postgresql.org are fine, but when installed on that graviton VM,
they throw the above error.

It smells like graviton's arm9 isn't as backwards compatible to arm8
as it should be. (But then I don't understand why disabling openssl
fixes it.)

Christoph



Re: InitControlFile misbehaving on graviton

От
Alexander Lakhin
Дата:
Hello Christoph,

13.01.2025 21:04, Christoph Berg wrote:
Bernd and I have been chasing a bug that happens when all of the
following conditions are fulfilled:

...

It smells like graviton's arm9 isn't as backwards compatible to arm8
as it should be. (But then I don't understand why disabling openssl
fixes it.)

Very interesting! Maybe this can also explain weird leafhopper (which
uses graviton4, as Robins said) failures in the buildfarm. (I've
extracted diffs from the logs at [1] and brought Robins's attention to it
at [2].)

[1] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures
[2] https://www.postgresql.org/message-id/35d87371-f3ab-42c8-9aac-bb39ab5bd987%40gmail.com

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: InitControlFile misbehaving on graviton

От
Matthias van de Meent
Дата:
On Mon, 13 Jan 2025 at 20:04, Christoph Berg <cb@df7cb.de> wrote:
>
> Bernd and I have been chasing a bug that happens when all of the
> following conditions are fulfilled:
>
> * PG 15..18 (older PGs are ok)
> * gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
> * arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host)
> * -O2 (ok with -O0)
> * --with-openssl (ok without openssl)
> * using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)
>
> The problem happens early during initdb:
>
> $ ./configure --with-openssl --enable-debug
> ...
> $ /usr/local/pgsql/bin/initdb -D broken --no-clean
> ...
> running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL:  control file contains invalid database
clusterstate
 
> child process exited with exit code 1
> initdb: data directory "broken" not removed at user's request

Yes, weird.

> (gdb) disassemble
> Dump of assembler code for function BootStrapXLOG:
>    0x0000aaaaaac21708 <+0>:     stp     x29, x30, [sp, #-272]!
>    0x0000aaaaaac2170c <+4>:     mov     w1, #0x0                        // #0
>    0x0000aaaaaac21710 <+8>:     mov     x29, sp
> ...
> => 0x0000aaaaaac219bc <+692>:   add     x19, sp, #0x90
>    0x0000aaaaaac219c0 <+696>:   mov     x0, x19
>    0x0000aaaaaac219c4 <+700>:   mov     x1, #0x20                       // #32
>    0x0000aaaaaac219c8 <+704>:   str     w2, [x21, #28]
>    0x0000aaaaaac219cc <+708>:   bl      0xaaaaab0ac824 <pg_strong_random>

pg_strong_random pulls random values from openssl's RAND_bytes
(defined in openssl/rand.h) when PostgreSQL is compiled with openSSL
support. If openSSL isn't enabled we instead use /dev/urandom (on
unix-y systems), which means different code will be generated for
pg_strong_random.

>    0x0000aaaaaac219d0 <+712>:   tbz     w0, #0, 0xaaaaaac21b28 <BootStrapXLOG+1056>
>    0x0000aaaaaac219d4 <+716>:   ldr     x3, [x22, #32]
>    0x0000aaaaaac219d8 <+720>:   mov     x2, #0x128                      // #296
>    0x0000aaaaaac219dc <+724>:   mov     w1, #0x0                        // #0
>    0x0000aaaaaac219e0 <+728>:   mov     x0, x3
>    0x0000aaaaaac219e4 <+732>:   bl      0xaaaaaab7f3b0 <memset@plt>

Given this code, it looks like register x3 contains ControlFile - it's
being memset(..., 0, sizeof(ControlFileData));

>    0x0000aaaaaac219e8 <+736>:   mov     x3, x0
>    0x0000aaaaaac219ec <+740>:   mov     x1, #0x3e8                      // #1000
>    0x0000aaaaaac219f0 <+744>:   ldr     w9, [x21, #32]
>    0x0000aaaaaac219f4 <+748>:   adrp    x7, 0xaaaaab3ce000 <fmgr_builtins+72112>
>    0x0000aaaaaac219f8 <+752>:   ldr     x7, [x7, #3720]
>    0x0000aaaaaac219fc <+756>:   str     x1, [x3, #128]

... Which would make this the assignment to unloggedLSN (which matches
the FirstNormalUnloggedLSN=1000 stored just above)

>    0x0000aaaaaac21a00 <+760>:   ldr     w1, [sp, #120]
>    0x0000aaaaaac21a04 <+764>:   add     x0, x0, #0x28
>    0x0000aaaaaac21a08 <+768>:   str     x23, [x3]

And this would be the assignment of systemidentifier,

>    0x0000aaaaaac21a0c <+772>:   str     w1, [x3, #252]

... data_checksum_version,

>    0x0000aaaaaac21a10 <+776>:   adrp    x6, 0xaaaaab3cf000
>    0x0000aaaaaac21a14 <+780>:   ldr     x6, [x6, #2392]
>    0x0000aaaaaac21a18 <+784>:   adrp    x5, 0xaaaaab3cf000
>    0x0000aaaaaac21a1c <+788>:   ldr     x5, [x5, #2960]
>    0x0000aaaaaac21a20 <+792>:   adrp    x4, 0xaaaaab3cf000
>    0x0000aaaaaac21a24 <+796>:   ldr     x4, [x4, #3352]
>    0x0000aaaaaac21a28 <+800>:   ldp     q26, q25, [x19]
>    0x0000aaaaaac21a2c <+804>:   str     s15, [x3, #16]

... and finally ControlFile->state.

I don't see where s15 is initialized and/or written to first, but this
is the only reference in this section of ASM. As such, I think the
initialization (presumably, "mov s15, #1" or such) must have happened
before the call to pg_secure_rand/RAND_bytes.

Looking around on the internet, it seems that in the ARM Procedure
Call Standard register s15 does not need to be preserved, and thus
could be clobbered when we're going into pg_secure_rand and co. If the
register is was indeed clobbered by OpenSSL, that would be a good
explanation for these issues. Can you check this?

> The really weird thing is that the very same binaries work on a
> different host (arm64 VM provided by Huawei) - the
> postgresql_arm64.deb files compiled there and present on
> apt.postgresql.org are fine, but when installed on that graviton VM,
> they throw the above error.

If I were you, I'd start looking into the differences in behaviour of
OpenSSL between the two ARM-based systems you mention; particularly
with a focus on register contents. It looks like gdb's `i r ...`
command could help out with that - or so StackOverflow tells me.


Kind regards,

Matthias van de Meent



Re: InitControlFile misbehaving on graviton

От
Julian Andres Klode
Дата:
On Mon, Jan 13, 2025 at 09:39:40PM +0100, Matthias van de Meent wrote:
> On Mon, 13 Jan 2025 at 20:04, Christoph Berg <cb@df7cb.de> wrote:
> >
> > Bernd and I have been chasing a bug that happens when all of the
> > following conditions are fulfilled:
> >
> > * PG 15..18 (older PGs are ok)
> > * gcc 14.2 on Debian unstable/testing (older Debians and Ubuntus are ok)
> > * arm64 running on graviton (AWS EC2 c8g.2xlarge, ok on different arm64 host)
> > * -O2 (ok with -O0)
> > * --with-openssl (ok without openssl)
> > * using no -m flag, or using -marm8.4-a (using `-march=armv9-a` fixes it)
> >
> > The problem happens early during initdb:
> >
> > $ ./configure --with-openssl --enable-debug
> > ...
> > $ /usr/local/pgsql/bin/initdb -D broken --no-clean
> > ...
> > running bootstrap script ... 2025-01-13 18:02:44.484 UTC [523300] FATAL:  control file contains invalid database
clusterstate
 
> > child process exited with exit code 1
> > initdb: data directory "broken" not removed at user's request
> 
> Yes, weird.
> 
> > (gdb) disassemble
> > Dump of assembler code for function BootStrapXLOG:
> >    0x0000aaaaaac21708 <+0>:     stp     x29, x30, [sp, #-272]!
> >    0x0000aaaaaac2170c <+4>:     mov     w1, #0x0                        // #0
> >    0x0000aaaaaac21710 <+8>:     mov     x29, sp
> > ...
> > => 0x0000aaaaaac219bc <+692>:   add     x19, sp, #0x90
> >    0x0000aaaaaac219c0 <+696>:   mov     x0, x19
> >    0x0000aaaaaac219c4 <+700>:   mov     x1, #0x20                       // #32
> >    0x0000aaaaaac219c8 <+704>:   str     w2, [x21, #28]
> >    0x0000aaaaaac219cc <+708>:   bl      0xaaaaab0ac824 <pg_strong_random>
> 
> pg_strong_random pulls random values from openssl's RAND_bytes
> (defined in openssl/rand.h) when PostgreSQL is compiled with openSSL
> support. If openSSL isn't enabled we instead use /dev/urandom (on
> unix-y systems), which means different code will be generated for
> pg_strong_random.
> 
> >    0x0000aaaaaac219d0 <+712>:   tbz     w0, #0, 0xaaaaaac21b28 <BootStrapXLOG+1056>
> >    0x0000aaaaaac219d4 <+716>:   ldr     x3, [x22, #32]
> >    0x0000aaaaaac219d8 <+720>:   mov     x2, #0x128                      // #296
> >    0x0000aaaaaac219dc <+724>:   mov     w1, #0x0                        // #0
> >    0x0000aaaaaac219e0 <+728>:   mov     x0, x3
> >    0x0000aaaaaac219e4 <+732>:   bl      0xaaaaaab7f3b0 <memset@plt>
> 
> Given this code, it looks like register x3 contains ControlFile - it's
> being memset(..., 0, sizeof(ControlFileData));
> 
> >    0x0000aaaaaac219e8 <+736>:   mov     x3, x0
> >    0x0000aaaaaac219ec <+740>:   mov     x1, #0x3e8                      // #1000
> >    0x0000aaaaaac219f0 <+744>:   ldr     w9, [x21, #32]
> >    0x0000aaaaaac219f4 <+748>:   adrp    x7, 0xaaaaab3ce000 <fmgr_builtins+72112>
> >    0x0000aaaaaac219f8 <+752>:   ldr     x7, [x7, #3720]
> >    0x0000aaaaaac219fc <+756>:   str     x1, [x3, #128]
> 
> ... Which would make this the assignment to unloggedLSN (which matches
> the FirstNormalUnloggedLSN=1000 stored just above)
> 
> >    0x0000aaaaaac21a00 <+760>:   ldr     w1, [sp, #120]
> >    0x0000aaaaaac21a04 <+764>:   add     x0, x0, #0x28
> >    0x0000aaaaaac21a08 <+768>:   str     x23, [x3]
> 
> And this would be the assignment of systemidentifier,
> 
> >    0x0000aaaaaac21a0c <+772>:   str     w1, [x3, #252]
> 
> ... data_checksum_version,
> 
> >    0x0000aaaaaac21a10 <+776>:   adrp    x6, 0xaaaaab3cf000
> >    0x0000aaaaaac21a14 <+780>:   ldr     x6, [x6, #2392]
> >    0x0000aaaaaac21a18 <+784>:   adrp    x5, 0xaaaaab3cf000
> >    0x0000aaaaaac21a1c <+788>:   ldr     x5, [x5, #2960]
> >    0x0000aaaaaac21a20 <+792>:   adrp    x4, 0xaaaaab3cf000
> >    0x0000aaaaaac21a24 <+796>:   ldr     x4, [x4, #3352]
> >    0x0000aaaaaac21a28 <+800>:   ldp     q26, q25, [x19]
> >    0x0000aaaaaac21a2c <+804>:   str     s15, [x3, #16]
> 
> ... and finally ControlFile->state.
> 
> I don't see where s15 is initialized and/or written to first, but this
> is the only reference in this section of ASM. As such, I think the
> initialization (presumably, "mov s15, #1" or such) must have happened
> before the call to pg_secure_rand/RAND_bytes.
> 
> Looking around on the internet, it seems that in the ARM Procedure
> Call Standard register s15 does not need to be preserved, and thus
> could be clobbered when we're going into pg_secure_rand and co. If the
> register is was indeed clobbered by OpenSSL, that would be a good
> explanation for these issues. Can you check this?
> 
> > The really weird thing is that the very same binaries work on a
> > different host (arm64 VM provided by Huawei) - the
> > postgresql_arm64.deb files compiled there and present on
> > apt.postgresql.org are fine, but when installed on that graviton VM,
> > they throw the above error.
> 
> If I were you, I'd start looking into the differences in behaviour of
> OpenSSL between the two ARM-based systems you mention; particularly
> with a focus on register contents. It looks like gdb's `i r ...`
> command could help out with that - or so StackOverflow tells me.

This was all very helpful and if I paid more attention I'd have seen
it sooner but here we go:

https://github.com/openssl/openssl/pull/26469

I believe this should fix your issue as well, I was debugging it
from the APT side for the past 14 hours or so.

The AES-CTR code is used by the default random number generator
to derive random numbers from the initial seed.
-- 
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer                              i speak de, en