Обсуждение: Non-reproducible AIO failure

Поиск

Список

Период

Сортировка

Non-reproducible AIO failure

От

Tom Lane

Дата:

24 апреля, 00:17:01

My buildfarm animal sifaka just failed like this [1]:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322
0   postgres                            0x0000000100e3df2c ExceptionalCondition + 108
1   postgres                            0x0000000100cbb154 WaitReadBuffers + 616
2   postgres                            0x0000000100f072bc read_stream_next_buffer.cold.1 + 184
3   postgres                            0x0000000100cb6dd8 read_stream_next_buffer + 300
4   postgres                            0x00000001009b75b4 heap_fetch_next_buffer + 136
5   postgres                            0x00000001009ad8c4 heapgettup + 368
6   postgres                            0x00000001009ad364 heap_getnext + 104
7   postgres                            0x00000001009b9770 heapam_index_build_range_scan + 700
8   postgres                            0x00000001009ef4b8 spgbuild + 480
9   postgres                            0x0000000100a41acc index_build + 428
10  postgres                            0x0000000100a43d14 reindex_index + 752
11  postgres                            0x0000000100ad09e0 ExecReindex + 1908
12  postgres                            0x0000000100d04628 ProcessUtilitySlow + 1300
13  postgres                            0x0000000100d03754 standard_ProcessUtility + 1528
14  pg_stat_statements.dylib            0x000000010158d7a0 pgss_ProcessUtility + 632
15  postgres                            0x0000000100d02cec PortalRunUtility + 184
16  postgres                            0x0000000100d02354 PortalRunMulti + 236
17  postgres                            0x0000000100d01d94 PortalRun + 456
18  postgres                            0x0000000100d00e10 exec_simple_query + 1308
...
2025-04-23 16:08:28.834 EDT [78788:4] LOG:  client backend (PID 79322) was terminated by signal 6: Abort trap: 6
2025-04-23 16:08:28.834 EDT [78788:5] DETAIL:  Failed process was running: reindex index spgist_point_idx;


Scraping the buildfarm database confirmed that there have been no
other similar failures since AIO went in, and unsurprisingly some
attempts to reproduce it manually didn't bear fruit.  So I'm
suspecting a race condition with a very narrow window, but I don't
have an idea where to look.  Any thoughts on how to gather more
data?

            regards, tom lane

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sifaka&dt=2025-04-23%2020%3A03%3A24&stg=recovery-check

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

24 апреля, 01:04:56

Hi,

On 2025-04-23 17:17:01 -0400, Tom Lane wrote:
> My buildfarm animal sifaka just failed like this [1]:

There's nothing really special about sifaka, is there? I see -std=gnu99 and a
few debug -D cppflags, but they don't look they could really be relevant here.


> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

Huh.

I assume there's no core file left over? Would be helpful to query some more
state (e.g. whether operation->io_wref is valid (think it has to be) and
whether this was a partial read or such).


> 0   postgres                            0x0000000100e3df2c ExceptionalCondition + 108
> 1   postgres                            0x0000000100cbb154 WaitReadBuffers + 616
> 2   postgres                            0x0000000100f072bc read_stream_next_buffer.cold.1 + 184
> 3   postgres                            0x0000000100cb6dd8 read_stream_next_buffer + 300
> 4   postgres                            0x00000001009b75b4 heap_fetch_next_buffer + 136
> 5   postgres                            0x00000001009ad8c4 heapgettup + 368
> 6   postgres                            0x00000001009ad364 heap_getnext + 104
> 7   postgres                            0x00000001009b9770 heapam_index_build_range_scan + 700
> 8   postgres                            0x00000001009ef4b8 spgbuild + 480
> 9   postgres                            0x0000000100a41acc index_build + 428
> 10  postgres                            0x0000000100a43d14 reindex_index + 752
> 11  postgres                            0x0000000100ad09e0 ExecReindex + 1908
> 12  postgres                            0x0000000100d04628 ProcessUtilitySlow + 1300
> 13  postgres                            0x0000000100d03754 standard_ProcessUtility + 1528
> 14  pg_stat_statements.dylib            0x000000010158d7a0 pgss_ProcessUtility + 632
> 15  postgres                            0x0000000100d02cec PortalRunUtility + 184
> 16  postgres                            0x0000000100d02354 PortalRunMulti + 236
> 17  postgres                            0x0000000100d01d94 PortalRun + 456
> 18  postgres                            0x0000000100d00e10 exec_simple_query + 1308
> ...
> 2025-04-23 16:08:28.834 EDT [78788:4] LOG:  client backend (PID 79322) was terminated by signal 6: Abort trap: 6
> 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL:  Failed process was running: reindex index spgist_point_idx;


> Scraping the buildfarm database confirmed that there have been no
> other similar failures since AIO went in, and unsurprisingly some
> attempts to reproduce it manually didn't bear fruit.  So I'm
> suspecting a race condition with a very narrow window, but I don't
> have an idea where to look.  Any thoughts on how to gather more
> data?

The weird thing is that the concrete symptom here doesn't actually involve
anything that could really race - the relevant state is only updated by the
process itself.  But of course there has to be something narrow here,
otherwise we'd have seen this before :(.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

24 апреля, 01:58:46

Andres Freund <andres@anarazel.de> writes:
> On 2025-04-23 17:17:01 -0400, Tom Lane wrote:
>> My buildfarm animal sifaka just failed like this [1]:

> There's nothing really special about sifaka, is there? I see -std=gnu99 and a
> few debug -D cppflags, but they don't look they could really be relevant here.

No, it's intended to be a bog-standard macOS environment.

>> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

> Huh.

> I assume there's no core file left over?

No such luck, I did not have that enabled.  I will turn it on though,
and also on its sibling indri.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

24 мая, 06:00:01

Hello Tom and Andres,

24.04.2025 01:58, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2025-04-23 17:17:01 -0400, Tom Lane wrote:
>>> My buildfarm animal sifaka just failed like this [1]:
>> There's nothing really special about sifaka, is there? I see -std=gnu99 and a
>> few debug -D cppflags, but they don't look they could really be relevant here.
>>

FWIW, that Assert have just triggered on another mac:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-05-23%2020%3A30%3A07

I have been trying to reproduce it locally (on Linux, FreeBSD VMs) for days,
but failed. So maybe it's a MacOS-specific issue. I will try to get an access
to a similar machine to reproduce it there.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

24 мая, 06:17:12

Alexander Lakhin <exclusion@gmail.com> writes:
> FWIW, that Assert have just triggered on another mac:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-05-23%2020%3A30%3A07

Yeah, I was just looking at that too.  There is a corefile
from that crash, but lldb seems unable to extract anything
from it :-(.  There is a low-resolution stack trace in the
postmaster log though:

0   postgres                            0x0000000105299c84 ExceptionalCondition + 108
1   postgres                            0x00000001051159ac WaitReadBuffers + 616
2   postgres                            0x00000001053611ec read_stream_next_buffer.cold.1 + 184
3   postgres                            0x0000000105111630 read_stream_next_buffer + 300
4   postgres                            0x0000000104e0b994 heap_fetch_next_buffer + 136
5   postgres                            0x0000000104e018f4 heapgettup_pagemode + 204
6   postgres                            0x0000000104e02010 heap_getnextslot + 84
7   postgres                            0x0000000104faebb4 SeqNext + 160


Of note here is that indri and sifaka run on the same host ---
indri uses some MacPorts packages while sifaka doesn't, but
that shouldn't have anything to do with our AIO code.

I trawled the buildfarm database and confirmed that these two crashes
are our only similar reports (grepping for "PGAIO_RS_UNKNOWN"):

 sysname | branch |      snapshot       |     stage     |                                                      l
                                              

---------+--------+---------------------+---------------+-------------------------------------------------------------------------------------------------------------
 indri   | HEAD   | 2025-05-23 20:30:07 | recoveryCheck | TRAP: failed Assert("aio_ret->result.status !=
PGAIO_RS_UNKNOWN"),File: "bufmgr.c", Line: 1605, PID: 20931 
 sifaka  | HEAD   | 2025-04-23 20:03:24 | recoveryCheck | TRAP: failed Assert("aio_ret->result.status !=
PGAIO_RS_UNKNOWN"),File: "bufmgr.c", Line: 1605, PID: 79322 
(2 rows)

So it seems that "very low-probability issue in our Mac AIO code" is
the most probable description.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

24 мая, 14:42:35

On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So it seems that "very low-probability issue in our Mac AIO code" is
> the most probable description.

There isn't any macOS-specific AIO code so my first guess would be
that it might be due to aarch64 weak memory reordering (though Andres
speculated that itt should all be one backend, huh), if it's not just
a timing luck thing.  Alexander, were the other OSes you tried all on
x86?

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

25 мая, 00:00:00

Hello Thomas,

24.05.2025 14:42, Thomas Munro wrote:
> On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So it seems that "very low-probability issue in our Mac AIO code" is
>> the most probable description.
> There isn't any macOS-specific AIO code so my first guess would be
> that it might be due to aarch64 weak memory reordering (though Andres
> speculated that itt should all be one backend, huh), if it's not just
> a timing luck thing.  Alexander, were the other OSes you tried all on
> x86?

As I wrote off-list before, I had tried x86_64 only, but since then I
tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
also multiplied "test: brin ..." line x10. But the issue is still not
reproduced (in 8+ hours).

However, I've managed to get an AIO-related assertion failure on macOS 14.5
(Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl
the same way:
2       # +++ tap check in src/test/recovery_2 +++
3       # +++ tap check in src/test/recovery_3 +++
4       # +++ tap check in src/test/recovery_4 +++
5       # +++ tap check in src/test/recovery_5 +++
1       # +++ tap check in src/test/recovery_1 +++
7       # +++ tap check in src/test/recovery_7 +++
6       # +++ tap check in src/test/recovery_6 +++
8       # +++ tap check in src/test/recovery_8 +++
1       [16:07:31] t/027_stream_regress.pl ..
1       Dubious, test returned 29 (wstat 7424, 0x1d00)
1       All 1 subtests passed
1       [16:07:31]
1
1       Test Summary Report
1       -------------------
1       t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0)
1         Non-zero exit status: 29
1         Parse errors: No plan found in TAP output
1       Files=1, Tests=1, 197 wallclock secs ( 0.01 usr  0.01 sys + 4.47 cusr  5.79 csys = 10.28 CPU)
1       Result: FAIL

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355
0   postgres                            0x0000000104f078f4 ExceptionalCondition + 236
1   postgres                            0x0000000104c0ebd4 pgaio_io_before_start + 260
2   postgres                            0x0000000104c0ea94 pgaio_io_start_readv + 36
3   postgres                            0x0000000104c2d4e8 FileStartReadV + 252
4   postgres                            0x0000000104c807c8 mdstartreadv + 668
5   postgres                            0x0000000104c83db0 smgrstartreadv + 116
6   postgres                            0x0000000104c1675c AsyncReadBuffers + 1444
7   postgres                            0x0000000104c15868 StartReadBuffersImpl + 1196
8   postgres                            0x0000000104c153ac StartReadBuffers + 64
9   postgres                            0x0000000104c129b4 read_stream_start_pending_read + 1204
10  postgres                            0x0000000104c12018 read_stream_look_ahead + 812
11  postgres                            0x0000000104c11cd0 read_stream_next_buffer + 1396
12  postgres                            0x0000000104606c38 heap_fetch_next_buffer + 284
13  postgres                            0x00000001045f79d4 heapgettup_pagemode + 192
14  postgres                            0x00000001045f7fa4 heap_getnextslot + 84
15  postgres                            0x000000010497a404 table_scan_getnextslot + 340
16  postgres                            0x000000010497a210 SeqNext + 148
17  postgres                            0x000000010497a8e4 ExecScanFetch + 772
18  postgres                            0x000000010497a4b8 ExecScanExtended + 164
19  postgres                            0x0000000104979bf0 ExecSeqScanWithProject + 244
20  postgres                            0x0000000104953ce0 ExecProcNode + 68
21  postgres                            0x000000010494f148 MultiExecPrivateHash + 64
22  postgres                            0x000000010494ed48 MultiExecHash + 100
23  postgres                            0x0000000104925a18 MultiExecProcNode + 180
24  postgres                            0x0000000104957220 ExecHashJoinImpl + 628
25  postgres                            0x00000001049566dc ExecHashJoin + 32
26  postgres                            0x0000000104925958 ExecProcNodeFirst + 100
27  postgres                            0x000000010491c08c ExecProcNode + 68
28  postgres                            0x000000010491692c ExecutePlan + 268
29  postgres                            0x000000010491679c standard_ExecutorRun + 504
30  pg_stat_statements.dylib            0x00000001055e132c pgss_ExecutorRun + 212
31  postgres                            0x0000000104916580 ExecutorRun + 72
32  postgres                            0x000000010491f59c ParallelQueryMain + 508
33  postgres                            0x00000001046968d4 ParallelWorkerMain + 1712
34  postgres                            0x0000000104b476f4 BackgroundWorkerMain + 824
35  postgres                            0x0000000104b4b87c postmaster_child_launch + 492
36  postgres                            0x0000000104b56610 StartBackgroundWorker + 416
37  postgres                            0x0000000104b51378 maybe_start_bgworkers + 552
38  postgres                            0x0000000104b540cc LaunchMissingBackgroundProcesses + 1316
39  postgres                            0x0000000104b51650 ServerLoop + 616
40  postgres                            0x0000000104b4fef0 PostmasterMain + 6632
41  postgres                            0x00000001049bfe20 main + 952
42  dyld                                0x00000001849d60e0 start + 2360
2025-05-24 16:07:23.427 EDT [22120:4] LOG:  background worker "parallel worker" (PID 32355) was terminated by signal 6:

Abort trap: 6
2025-05-24 16:07:23.427 EDT [22120:5] DETAIL:  Failed process was running: select count(*) from join_foo
           left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss

I'm yet to see the Assert triggered on the buildfarm, but this one looks
interesting too.

(I can share the complete patch + script for such testing, if it can be helpful.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

25 мая, 05:45:52

On Sun, May 25, 2025 at 9:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Hello Thomas,
> 24.05.2025 14:42, Thomas Munro wrote:
> > On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> So it seems that "very low-probability issue in our Mac AIO code" is
> >> the most probable description.
> > There isn't any macOS-specific AIO code so my first guess would be
> > that it might be due to aarch64 weak memory reordering (though Andres
> > speculated that itt should all be one backend, huh), if it's not just
> > a timing luck thing.  Alexander, were the other OSes you tried all on
> > x86?
>
> As I wrote off-list before, I had tried x86_64 only, but since then I
> tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
> running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
> also multiplied "test: brin ..." line x10. But the issue is still not
> reproduced (in 8+ hours).

Hmm.  And I see now that this really is all in one backend.  Could it
be some variation of the interrupt processing stuff from acad9093?

> However, I've managed to get an AIO-related assertion failure on macOS 14.5
...
> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355

Can you get a core and print *ioh in the debugger?

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

25 мая, 06:22:22

Thomas Munro <thomas.munro@gmail.com> writes:
> Can you get a core and print *ioh in the debugger?

So far, I've failed to get anything useful out of core files
from this failure.  The trace goes back no further than

(lldb) bt
* thread #1
  * frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8

That's quite odd in itself: while I don't find the debugging
environment on macOS to be the greatest, it's not normally
this unhelpful.

I have managed to reproduce the "aio_ret->result.status !=
PGAIO_RS_UNKNOWN" crash about five times in total on three different
Apple Silicon machines (an M1 mini, an M4 Pro mini, and an M4 Pro
laptop).  So there is definitely something wrong.  The failure rate
is pretty tiny for me, though I've not tried running several test
instances concurrently.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

26 мая, 02:44:48

On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Can you get a core and print *ioh in the debugger?
>
> So far, I've failed to get anything useful out of core files
> from this failure.  The trace goes back no further than
>
> (lldb) bt
> * thread #1
>   * frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8
>
> That's quite odd in itself: while I don't find the debugging
> environment on macOS to be the greatest, it's not normally
> this unhelpful.

(And Alexander reported the same off-list.). It's interesting that the
elog.c backtrace stuff is able to analyse the stack and it looks
normal AFAICS.  Could that be interfering with the stack in the core?!
 I doubt it ... I kinda wonder if the debugger might be confused about
libsystem sometimes since it has ceased to be a regular Mach-O file on
disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue?

So far we have:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"),
File: "bufmgr.c", Line: 1605, PID: 20931
0   postgres                            0x0000000105299c84
ExceptionalCondition + 108
1   postgres                            0x00000001051159ac WaitReadBuffers + 616
2   postgres                            0x00000001053611ec
read_stream_next_buffer.cold.1 + 184
3   postgres                            0x0000000105111630
read_stream_next_buffer + 300
4   postgres                            0x0000000104e0b994
heap_fetch_next_buffer + 136
5   postgres                            0x0000000104e018f4
heapgettup_pagemode + 204

Hmm, looking around that code and wracking my brain for things that
might happen on one OS but not others, I wonder about partial I/Os.
Perhaps combined with some overlapping requests.

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
Line: 161, PID: 32355
0   postgres                            0x0000000104f078f4
ExceptionalCondition + 236
1   postgres                            0x0000000104c0ebd4
pgaio_io_before_start + 260
2   postgres                            0x0000000104c0ea94
pgaio_io_start_readv + 36
3   postgres                            0x0000000104c2d4e8 FileStartReadV + 252
4   postgres                            0x0000000104c807c8 mdstartreadv + 668
5   postgres                            0x0000000104c83db0 smgrstartreadv + 116

But this one seems like a more basic confusion...  wild writes
somewhere?  Hmm, we need to see what's in that struct.

If we can't get a debugger to break there or a core file to be
analysable, maybe we should try logging as much info as possible at
those points to learn a bit more?  I would be digging like that myself
but I haven't seen this failure on my little M4 MacBook Air yet
(Sequoia 15.5, Apple clang-1700.0.13.3).  It is infected with
corporate security-ware that intercepts at least file system stuff and
slows it down and I can't even convince it to dump core files right
now.  Could you guys please share your exact repro steps?

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

26 мая, 03:05:49

Thomas Munro <thomas.munro@gmail.com> writes:
> Could you guys please share your exact repro steps?

I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%.  It sounded like Alexander
had a better way.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

26 мая, 03:25:46

Thomas Munro <thomas.munro@gmail.com> writes:
> On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So far, I've failed to get anything useful out of core files
>> from this failure.  The trace goes back no further than
>> (lldb) bt
>> * thread #1
>> * frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8

> (And Alexander reported the same off-list.). It's interesting that the
> elog.c backtrace stuff is able to analyse the stack and it looks
> normal AFAICS.  Could that be interfering with the stack in the core?!

No, but something is.  Just to make sure it wasn't totally broken,
I added a sure-to-fail Assert in a random place (I chose
pg_backend_pid), and I get both a trace in the postmaster log and a
perfectly usable core file:

TRAP: failed Assert("MyProcPid == 0"), File: "pgstatfuncs.c", Line: 692, PID: 59063
0   postgres                            0x00000001031f1fa4 ExceptionalCondition + 108
1   postgres                            0x00000001031672b4 pg_stat_get_backend_pid + 0
2   postgres                            0x0000000102e9e598 ExecInterpExpr + 5524
3   postgres                            0x0000000102edb100 ExecResult + 368
4   postgres                            0x0000000102ea6418 standard_ExecutorRun + 316

(lldb) bt
* thread #1
  * frame #0: 0x00000001836b5388 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001836ee88c libsystem_pthread.dylib`pthread_kill + 296
    frame #2: 0x00000001835f7c60 libsystem_c.dylib`abort + 124
    frame #3: 0x000000010491dfac postgres`ExceptionalCondition(conditionName=<unavailable>, fileName=<unavailable>,
lineNumber=692)at assert.c:66:2 [opt] 
    frame #4: 0x000000010489329c postgres`pg_backend_pid(fcinfo=<unavailable>) at pgstatfuncs.c:692:2 [opt]
    frame #5: 0x00000001045ca598 postgres`ExecInterpExpr(state=0x000000013780d190, econtext=0x000000013780ce38,
isnull=<unavailable>)at execExprInterp.c:0 [opt] 
    frame #6: 0x0000000104607100 postgres`ExecResult [inlined] ExecEvalExprNoReturn(state=<unavailable>,
econtext=0x000000013780ce38)at executor.h:417:13 [opt] 
    frame #7: 0x00000001046070f4 postgres`ExecResult [inlined] ExecEvalExprNoReturnSwitchContext(state=<unavailable>,
econtext=0x000000013780ce38)at executor.h:458:2 [opt] 

The fact that I can trace through this Assert failure but not the
AIO one strongly suggests some system-level problem in the latter.
There is something rotten in the state of Denmark.

For completeness, this is with Sequoia 15.5 (latest macOS) on
an M4 Pro MacBook.

> but I haven't seen this failure on my little M4 MacBook Air yet
> (Sequoia 15.5, Apple clang-1700.0.13.3).  It is infected with
> corporate security-ware that intercepts at least file system stuff and
> slows it down and I can't even convince it to dump core files right
> now.

As far as that goes, if you have SIP turned on (which I'm sure a
corporate laptop would), there are extra steps needed to get a
core to happen.  See [1]; that page is old, but the recipe still
works for me.

            regards, tom lane

[1] https://nasa.github.io/trick/howto_guides/How-to-dump-core-file-on-MacOS.html

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

27 мая, 16:06:51

On Mon, May 26, 2025 at 12:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Could you guys please share your exact repro steps?
>
> I've just been running 027_stream_regress.pl over and over.
> It's not a recommendable answer though because the failure
> probability is tiny, under 1%.  It sounded like Alexander
> had a better way.

Could you please share your configure options?

While flailing around in the dark and contemplating sources of
nondeterminism that might come from a small system under a lot of load
(as hinted at by Alexander's mention of running the test in parallel)
with a 1MB buffer pool (as used by 027_stream_read.pl via Cluster.pm's
settings for replication tests), I thought about partial reads:

--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -128,6 +128,8 @@ pgaio_io_perform_synchronously(PgAioHandle *ioh)
                        result = pg_preadv(ioh->op_data.read.fd, iov,

ioh->op_data.read.iov_length,

ioh->op_data.read.offset);
+                       if (result > BLCKSZ && rand() < RAND_MAX / 2)
+                               result = BLCKSZ;

... and the fallback path for io_method=worker that runs IOs
synchronous when the submission queue overflows because the I/O
workers aren't keeping up:

--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -253,7 +253,7 @@ pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
        for (int i = 0; i < nios; ++i)
        {
                Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
-               if (!pgaio_worker_submission_queue_insert(ios[i]))
+               if (rand() < RAND_MAX / 2 ||
!pgaio_worker_submission_queue_insert(ios[i]))
                {
                        /*
                         * We'll do it synchronously, but only after
we've sent as many as

... but still no dice here...

Re: Non-reproducible AIO failure

От

Tomas Vondra

Дата:

27 мая, 16:26:03


On 5/24/25 23:00, Alexander Lakhin wrote:
> ...
>
> I'm yet to see the Assert triggered on the buildfarm, but this one looks
> interesting too.
> 
> (I can share the complete patch + script for such testing, if it can be
> helpful.)
> 

I'm interested in how you run these tests in parallel. Can you share the
patch/script?

thank

-- 
Tomas Vondra

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

27 мая, 16:35:09

Hi,

On 2025-05-25 20:05:49 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Could you guys please share your exact repro steps?
> 
> I've just been running 027_stream_regress.pl over and over.
> It's not a recommendable answer though because the failure
> probability is tiny, under 1%.  It sounded like Alexander
> had a better way.

Just FYI, I've been trying to reproduce this as well, without a single failure
so far. Despite running all tests for a few hundred times (~2 days) and
027_stream_regress.pl many hundreds of times (~1 day).

This is on a m4 mac mini.  I'm wondering if there's some hardware specific
memory ordering issue or disk speed based timing issue that I'm just not
hitting.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

27 мая, 17:00:01

Hello Tomas,

27.05.2025 16:26, Tomas Vondra wrote:

I'm interested in how you run these tests in parallel. Can you share the
patch/script?

Yeah, sure. I'm running the test as follows:
rm -rf src/test/recovery_*; for i in `seq 40`; do cp -r src/test/recovery/ src/test/recovery_$i/; sed -i .bak "s|src/test/recovery|src/test/recovery_$i|" src/test/recovery_$i/Makefile; done
PROVE_TESTS="t/027*" make -s check -s -C src/test/recovery

for i in `seq 100`; do echo "iteration $i"; parallel -j40 --linebuffer --tag PROVE_TESTS="t/027*" NO_TEMP_INSTALL=1 make -s check -s -C src/test/recovery_{} ::: `seq 8` || break; done; echo -e "\007"

Alexander Lakhin
Neon (https://neon.tech)

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

27 мая, 17:10:15

Thomas Munro <thomas.munro@gmail.com> writes:
> Could you please share your configure options?

The failures on indri and sifaka were during ordinary buildfarm
runs, you can check the animals' details on the website.
(Note those are same host machine, the difference is that
indri uses some MacPorts packages while sifaka is meant to be
a bare-bones-macOS build.)

On the other two, I believe I just used my normal Mac devel
configuration.  On the M4 laptop that's

$ ./configure CC="ccache gcc" CFLAGS="-O2 -fno-common" --enable-debug --enable-cassert --with-bonjour --without-icu
--enable-tap-tests--with-system-tzdata=/usr/share/zoneinfo 

but I see I haven't quite got the same thing on the M1 mini:

$ ./configure CC="ccache clang" --enable-debug --enable-cassert --enable-tap-tests
--with-system-tzdata=/usr/share/zoneinfo

In both cases the compiler is really

$ gcc -v
Apple clang version 17.0.0 (clang-1700.0.13.3)
Target: arm64-apple-darwin24.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

I did some experimentation with different -O levels to see if that
would make the failure any more or less common.  But I've yet to
reproduce it at anything except -O2.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

27 мая, 17:12:28

Andres Freund <andres@anarazel.de> writes:
> This is on a m4 mac mini.  I'm wondering if there's some hardware specific
> memory ordering issue or disk speed based timing issue that I'm just not
> hitting.

I dunno, I've seen it on three different physical machines now
(one M1, two M4 Pros).  But it is darn hard to repro, for sure.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

27 мая, 18:00:01

Hello hackers,

27.05.2025 16:35, Andres Freund пишет:

On 2025-05-25 20:05:49 -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Could you guys please share your exact repro steps?

I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%.  It sounded like Alexander
had a better way.

Just FYI, I've been trying to reproduce this as well, without a single failure
so far. Despite running all tests for a few hundred times (~2 days) and
027_stream_regress.pl many hundreds of times (~1 day).

This is on a m4 mac mini.  I'm wondering if there's some hardware specific
memory ordering issue or disk speed based timing issue that I'm just not
hitting.

I'm sorry, but I need several days more to present a working reproducer.

I was lucky enough to catch the assert on my first attempt, without much
effort, but then something changed on that MacBook (it's not mine, I
connect to it remotely when it's available) and I can not reproduce it
anymore.
Just today, I discovered that 027_stream_regress is running very slow
there just because of shared_preload_libraries:
# after the 027_stream_regress test run
echo "shared_preload_libraries = 'pg_stat_statements'" >/tmp/extra.config
TEMP_CONFIG=/tmp/extra.config NO_TEMP_INSTALL=1 /usr/bin/time make -s check
1061,29 real 56,09 user 27,69 sys
vs
NO_TEMP_INSTALL=1 /usr/bin/time make -s check
36,42 real 27,11 user 13,98 sys

Probably it's an effect of antivirus (I see wdavdaemon_unprivileged eating
CPU time), and I uninstalled it before, but now it's installed again
(maybe by some policy). So I definitely need more time to figure out the
exact recipe for triggering the assert.

As to the configure options, when I tried to reproduce the issue on other
(non-macOS) machines, I used options from sifaka:
-DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS,
but then I added -DREAD_STREAM_DISABLE_FAST_PATH to stress read_stream,
and then I just copied that command and ran it on MacBook...
So I think the complete compilation command was (and I'm seeing it in
the history):
CFLAGS="-DREAD_STREAM_DISABLE_FAST_PATH -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS" ./configure --enable-injection-points --enable-cassert --enable-debug --enable-tap-tests --prefix=/tmp/pg -q && make -s -j8 && make -s install && make -s check
... then running 5 027_stream_regress tests in parallel ...
I had also applied a patch to repeat "test: brin" line, but I'm not sure
it does matter.

Sorry for the lack of useful information again.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

27 мая, 21:21:30

Hi,

On 2025-05-27 10:12:28 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > This is on a m4 mac mini.  I'm wondering if there's some hardware specific
> > memory ordering issue or disk speed based timing issue that I'm just not
> > hitting.
> 
> I dunno, I've seen it on three different physical machines now
> (one M1, two M4 Pros).  But it is darn hard to repro, for sure.

I just meant that it seems that I can't reproduce it for some as of yet
unknown reason. I've now been through 3k+ runs of 027_stream_regress, without
a single failure, so there has to be *something* different about my
environment than yours.

Darwin m4-dev 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024;
root:xnu-11215.41.3~3/RELEASE_ARM64_T8132arm64

cc -v
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix

I guess I'll try to update to a later version and see if it repros there?

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

27 мая, 21:43:14

Andres Freund <andres@anarazel.de> writes:
> I just meant that it seems that I can't reproduce it for some as of yet
> unknown reason. I've now been through 3k+ runs of 027_stream_regress, without
> a single failure, so there has to be *something* different about my
> environment than yours.

> Darwin m4-dev 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024;
root:xnu-11215.41.3~3/RELEASE_ARM64_T8132arm64 

> cc -v
> Apple clang version 16.0.0 (clang-1600.0.26.4)
> Target: arm64-apple-darwin24.1.0
> Thread model: posix

> I guess I'll try to update to a later version and see if it repros there?

Maybe.  All the machines I've seen it on are current-software:

$ uname -a
Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025;
root:xnu-11417.121.6~2/RELEASE_ARM64_T6041arm64 
$ cc -v
Apple clang version 17.0.0 (clang-1700.0.13.3)
Target: arm64-apple-darwin24.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

If it's OS-version-specific, that raises the odds IMO that this
is Apple's fault more than ours.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

27 мая, 22:16:15

Hi,

On 2025-05-27 14:43:14 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I just meant that it seems that I can't reproduce it for some as of yet
> > unknown reason. I've now been through 3k+ runs of 027_stream_regress, without
> > a single failure, so there has to be *something* different about my
> > environment than yours.
> 
> > Darwin m4-dev 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024;
root:xnu-11215.41.3~3/RELEASE_ARM64_T8132arm64
 
> 
> > cc -v
> > Apple clang version 16.0.0 (clang-1600.0.26.4)
> > Target: arm64-apple-darwin24.1.0
> > Thread model: posix
> 
> > I guess I'll try to update to a later version and see if it repros there?
> 
> Maybe.  All the machines I've seen it on are current-software:
> 
> $ uname -a
> Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025;
root:xnu-11417.121.6~2/RELEASE_ARM64_T6041arm64
 
> $ cc -v
> Apple clang version 17.0.0 (clang-1700.0.13.3)
> Target: arm64-apple-darwin24.5.0
> Thread model: posix
> InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
> 
> If it's OS-version-specific, that raises the odds IMO that this
> is Apple's fault more than ours.

Uh, huh.  After more than 3k successful runs, ~1 minute after I started to log
in graphically (to update the OS), I got my first reproduction.

2025-05-27 14:51:34.703 EDT [88755] pg_regress/sanity_check LOG:  statement: VACUUM;
TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File:
"../../src/postgres/src/backend/storage/buffer/bufmgr.c",Line: 1605, PID: 88755
 
0   postgres                            0x0000000102747514 ExceptionalCondition + 108
1   postgres                            0x00000001025cd618 WaitReadBuffers + 596
2   postgres                            0x00000001025c9314 read_stream_next_buffer + 428
3   postgres                            0x0000000102345a24 heap_vacuum_rel + 1884
4   postgres                            0x0000000102452fec vacuum_rel + 724
5   postgres                            0x0000000102452b54 vacuum + 1656
6   postgres                            0x0000000102452400 ExecVacuum + 1504
7   postgres                            0x0000000102615990 standard_ProcessUtility + 444
8   pg_stat_statements.dylib            0x0000000102f2c39c pgss_ProcessUtility + 668
9   postgres                            0x00000001026153c4 PortalRunUtility + 136
10  postgres                            0x0000000102614af4 PortalRunMulti + 232
11  postgres                            0x0000000102614530 PortalRun + 456
12  postgres                            0x00000001026135ac exec_simple_query + 1240
13  postgres                            0x000000010261084c PostgresMain + 1400
14  postgres                            0x000000010260c5d4 BackendInitialize + 0
15  postgres                            0x0000000102568f44 postmaster_child_launch + 372
16  postgres                            0x000000010256d218 ServerLoop + 4960
17  postgres                            0x000000010256b55c InitProcessGlobals + 0
18  postgres                            0x00000001024beabc help + 0
19  dyld                                0x0000000192b80274 start + 2840

I'll see if being graphically logged in somehow indeed increased the repro
rate, and if so I'll expand the debugging somewhat, or if this was just an
absurd coincidence.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Robert Haas

Дата:

27 мая, 22:18:50

On Sun, May 25, 2025 at 8:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The fact that I can trace through this Assert failure but not the
> AIO one strongly suggests some system-level problem in the latter.
> There is something rotten in the state of Denmark.

I have been quite frustrated with lldb on macOS for a while now -- I
tend to find that when I get a can get stack trace from a
still-running process it works fine, but trying to get a stack trace
from a core dump often fails to produce anything useful (but sometimes
it does produce something useful). I haven't been able to find any
information on the Internet to explain why this sometimes happens and
sometimes doesn't, and various things I attempted as fixes didn't work
out. There could be something wrong specifically with this machine,
but I also wouldn't be too shocked if this is just randomly broken on
macOS.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

28 мая, 00:49:08

Andres Freund <andres@anarazel.de> writes:
> I'll see if being graphically logged in somehow indeed increased the repro
> rate, and if so I'll expand the debugging somewhat, or if this was just an
> absurd coincidence.

Hmm.  Now that you mention it, the one repro on the M1 came just as
I was about to give up and manually cancel the script that was running
the 027 test in a loop.  I wondered for a moment if I'd somehow
affected the state of the machine, but since I was logged in over ssh
I didn't quite see how that would be possible.  But your tale makes me
suspect that outside activity helps --- which would square with
Alexander's results that suggest concurrent runs help.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

31 мая, 06:00:00

Hello Thomas,

25.05.2025 05:45, Thomas Munro wrote:
>> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355
> Can you get a core and print *ioh in the debugger?
It looks like I managed to restore all the conditions needed to reproduce
that Assert more or less reliably (within a couple of hours), so I can
continue experiments.

I've added the following debugging:
+++ b/src/backend/storage/aio/aio_io.c
@@ -158,6 +158,7 @@ pgaio_io_before_start(PgAioHandle *ioh)
         Assert(ioh->state == PGAIO_HS_HANDED_OUT);
         Assert(pgaio_my_backend->handed_out_io == ioh);
         Assert(pgaio_io_has_target(ioh));
+if (ioh->op != PGAIO_OP_INVALID) elog(LOG, "!!!pgaio_io_before_start| ioh: %p, ioh->op: %d", ioh, ioh->op);
         Assert(ioh->op == PGAIO_OP_INVALID);

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -67,6 +67,7 @@
  #include "utils/resowner.h"
  #include "utils/timestamp.h"

+#include "storage/aio_internal.h"
...
@@ -1855,7 +1860,11 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
                 pgaio_submit_staged();

                 ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+fprintf(stderr, "!!!AsyncReadBuffers [%d]| ioh(2): %p, op: %d, blocknum: %d\n", getpid(), ioh, ((PgAioHandle 
*)ioh)->op, blocknum);
         }
+else
+fprintf(stderr, "!!!AsyncReadBuffers [%d]| ioh(1): %p, op: %d, blocknum: %d\n", getpid(), ioh, ((PgAioHandle 
*)ioh)->op, blocknum);
+

         /*
          * Check if we can start IO on the first to-be-read buffer.
...

(using fprintf here to rule out side effects of interrupt processing
inside elog(), and it looks like fprintf there eases the reproducing)

--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -712,7 +712,7 @@ sub init
                 print $conf "wal_log_hints = on\n";
                 print $conf "hot_standby = on\n";
                 # conservative settings to ensure we can run multiple postmasters:
-               print $conf "shared_buffers = 1MB\n";
+               print $conf "shared_buffers = 512kB\n";
                 print $conf "max_connections = 10\n";

(it looks like it increases the probability of failure too, will gather
more statistics...)

And got:
...
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 121
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 122
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 123
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 124
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 125
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 126
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 1, blocknum: 127
2025-05-30 14:38:40.812 EDT [97859:121] pg_regress/replica_identity LOG:  !!!pgaio_io_before_start| ioh: 0x106efd540, 
ioh->op: 1
2025-05-30 14:38:40.812 EDT [97859:122] pg_regress/replica_identity STATEMENT:  SELECT c.oid,
           n.nspname,
           c.relname
         FROM pg_catalog.pg_class c
              LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
         WHERE c.relname OPERATOR(pg_catalog.~) '^(test_replica_identity3)$' COLLATE pg_catalog.default
           AND pg_catalog.pg_table_is_visible(c.oid)
         ORDER BY 2, 3;
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 162, PID: 97859

0   postgres                            0x0000000105823878 ExceptionalCondition + 236
1   postgres                            0x000000010552ab54 pgaio_io_before_start + 452
2   postgres                            0x000000010552a954 pgaio_io_start_readv + 36
3   postgres                            0x000000010554946c FileStartReadV + 252
4   postgres                            0x000000010559c74c mdstartreadv + 668
5   postgres                            0x000000010559fd34 smgrstartreadv + 116
6   postgres                            0x00000001055326e0 AsyncReadBuffers + 1664
7   postgres                            0x00000001055317e8 StartReadBuffersImpl + 1196
8   postgres                            0x000000010553132c StartReadBuffers + 64
9   postgres                            0x000000010552e934 read_stream_start_pending_read + 1204
10  postgres                            0x000000010552df98 read_stream_look_ahead + 812
11  postgres                            0x000000010552dc50 read_stream_next_buffer + 1396
12  postgres                            0x0000000104f22af8 heap_fetch_next_buffer + 284
13  postgres                            0x0000000104f13894 heapgettup_pagemode + 192
14  postgres                            0x0000000104f13e64 heap_getnextslot + 84
15  postgres                            0x00000001052962c4 table_scan_getnextslot + 340
16  postgres                            0x00000001052960d0 SeqNext + 148
17  postgres                            0x00000001052967a4 ExecScanFetch + 772
18  postgres                            0x0000000105296378 ExecScanExtended + 164
...

And another failed run:
...
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 30
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 31
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 34
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 35
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 36
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 37
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 38
!!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 253
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 39
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 40
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 41
!!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 254
!!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 255
2025-05-30 22:48:00.265 EDT [5043:16] pg_regress/memoize LOG: !!!pgaio_io_before_start| ioh: 0x102d14f10, ioh->op: 1
2025-05-30 22:48:00.265 EDT [5043:17] pg_regress/memoize CONTEXT: PL/pgSQL function explain_memoize(text,boolean) line
5
 
at FOR over EXECUTE statement
2025-05-30 22:48:00.265 EDT [5043:18] pg_regress/memoize STATEMENT: SELECT explain_memoize('
         SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN
         LATERAL (SELECT t1.twenty AS c1, t2.unique1 AS c2, t2.two FROM tenk1 t2) s
         ON t1.two = s.two
         WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false);
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 162, PID: 5043
...
0   postgres                            0x000000010163b878 ExceptionalCondition + 236
1   postgres                            0x0000000101342b54 pgaio_io_before_start + 452
2   postgres                            0x0000000101342954 pgaio_io_start_readv + 36
3   postgres                            0x000000010136146c FileStartReadV + 252
4   postgres                            0x00000001013b474c mdstartreadv + 668
5   postgres                            0x00000001013b7d34 smgrstartreadv + 116
6   postgres                            0x000000010134a6e0 AsyncReadBuffers + 1664
7   postgres                            0x00000001013497e8 StartReadBuffersImpl + 1196
8   postgres                            0x000000010134932c StartReadBuffers + 64
9   postgres                            0x0000000101346934 read_stream_start_pending_read + 1204
10  postgres                            0x0000000101345f98 read_stream_look_ahead + 812
11  postgres                            0x0000000101345c50 read_stream_next_buffer + 1396
12  postgres                            0x0000000100d3aaf8 heap_fetch_next_buffer + 284
13  postgres                            0x0000000100d2b894 heapgettup_pagemode + 192
14  postgres                            0x0000000100d2be64 heap_getnextslot + 84
15  postgres                            0x00000001010ae2c4 table_scan_getnextslot + 340
16  postgres                            0x00000001010ae0d0 SeqNext + 148
...

I run the same loop as shown at [1], but with a random number of parallel
processes: np=$((1 + $RANDOM % 5)), and as I can see the failure is more
likely with np == 5. I also tried running stress-ng --hdd 1 --timeout 60m
concurrently, but I'm still not sure if it affects the reproducing
directly.
I use MacBook M1 and don't know how to disable E/P-cores, moreover, from
my observation, the test performs slower (up to 10%) when it runs for
hours in a loop, probably because of the CPU heating up and lowering
it's frequency, so if the issue is really timing-dependent, it's not that
easy to get more or less the same timings once more (I tried to add sleep
into the loop, to let CPU cool down...).

[1] https://www.postgresql.org/message-id/87673f76-1ed7-44dc-8e8b-6aaf78e2817a%40gmail.com

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

02 июня, 09:00:01

31.05.2025 06:00, Alexander Lakhin wrote:
> Hello Thomas,
>
> It looks like I managed to restore all the conditions needed to reproduce
> that Assert more or less reliably (within a couple of hours), so I can
> continue experiments.
>
> I've added the following debugging:
> ...

With additional logging (the patch is attached), I can see the following:
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 15, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0
!!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 16, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0
!!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 17, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0
!!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0
2025-06-02 00:45:11.391 EDT [63817:14] pg_regress/brin LOG: !!!pgaio_io_before_start| ioh: 0x1046b5660, ioh->op: 1, 
ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 63817
0   postgres                            0x0000000102fdb60c ExceptionalCondition + 236
1   postgres                            0x0000000102ce233c pgaio_io_before_start + 504
2   postgres                            0x0000000102ce20a8 pgaio_io_start_readv + 36
3   postgres                            0x0000000102d01200 FileStartReadV + 252
4   postgres                            0x0000000102d544e0 mdstartreadv + 668
5   postgres                            0x0000000102d57ac8 smgrstartreadv + 116
6   postgres                            0x0000000102cea474 AsyncReadBuffers + 1996
7   postgres                            0x0000000102ce9430 StartReadBuffersImpl + 1196
8   postgres                            0x0000000102ce8f74 StartReadBuffers + 64
9   postgres                            0x0000000102ce657c read_stream_start_pending_read + 1204
10  postgres                            0x0000000102ce5c60 read_stream_look_ahead + 812
11  postgres                            0x0000000102ce57f8 read_stream_next_buffer + 2356
12  postgres                            0x00000001026da1b8 heap_fetch_next_buffer + 284
13  postgres                            0x00000001026caf54 heapgettup_pagemode + 192
14  postgres                            0x00000001026cb524 heap_getnextslot + 84
15  postgres                            0x0000000102a4d984 table_scan_getnextslot + 340
16  postgres                            0x0000000102a4d790 SeqNext + 148
17  postgres                            0x0000000102a4de64 ExecScanFetch + 772
18  postgres                            0x0000000102a4da38 ExecScanExtended + 164
19  postgres                            0x0000000102a4d374 ExecSeqScanWithQualProject + 248
20  postgres                            0x0000000102a13248 ExecProcNode + 68
21  postgres                            0x0000000102a12e70 fetch_input_tuple + 180
22  postgres                            0x0000000102a12bb4 agg_retrieve_direct + 1664
23  postgres                            0x0000000102a0f914 ExecAgg + 240
...
2025-06-02 00:45:11.409 EDT [57192:4] LOG:  client backend (PID 63817) was terminated by signal 6: Abort trap: 6
2025-06-02 00:45:11.409 EDT [57192:5] DETAIL:  Failed process was running: DO $x$
         DECLARE
                 r record;
                 r2 record;
                 cond text;
                 idx_ctids tid[];
                 ss_ctids tid[];
                 count int;
                 plan_ok bool;
                 plan_line text;
         BEGIN
...

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

Вложения

aio-debugging-0602.patch

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

04 июня, 23:02:32

Hi,

Thanks for working on investigating this.


On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> 02.06.2025 09:00, Alexander Lakhin wrote:
> > With additional logging (the patch is attached), I can see the following:
> > ...
> > !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
> > !!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0x1046b5660,
> > ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0
> > 2025-06-02 00:45:11.391 EDT [63817:14] pg_regress/brin LOG:
> > !!!pgaio_io_before_start| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 1,
> > ioh->result: 0, ioh->num_callbacks: 2
> > 
> > TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 63817
> 
> I've added logging of the ioh->generation field at the beginning and at
> the end of pgaio_io_reclaim() (the complete patch is attached) and got
> even more interesting output:
> ...
> !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->state: 6,
> ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21692
> !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 0, ioh->generation: 21693
> !!!AsyncReadBuffers [25175] (1)| blocknum: 40, ioh: 0x104c3e1a0, ioh->op: 0,
> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21693
> !!!pgaio_io_start_readv [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->generation: 21693

IO starts.


> !!!pgaio_io_reclaim [25176]| ioh: 0x104c3e080, ioh->op: 1, ioh->state: 6,
> ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21408
> !!!pgaio_io_reclaim [25176]| ioh: 0x104c3e080, ioh->op: 0, ioh->generation: 21409
> !!!AsyncReadBuffers [25176] (1)| blocknum: 49, ioh: 0x104c3e080, ioh->op: 0,
> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21409
> !!!pgaio_io_start_readv [25176]| ioh: 0x104c3e080, ioh->op: 1, ioh->generation: 21409

(other backend / other IO)


> !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->state: 6,
> ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21693
> !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 0, ioh->generation: 21694

I completes and is marked as idle, as we'd expect.


> !!!AsyncReadBuffers [25175] (1)| blocknum: 41, ioh: 0x104c3e1a0, ioh->op: 0,
> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21694

And here, before starting a new IO, the op is 0 (i.e. PGAIO_OP_INVALID), as we
would expect.


> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> 2, ioh->generation: 21694

But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
"normal" reason for that. We know that the IO wasn't actually started as
otherwise pgaio_io_start_readv() would have logged that fact.

Which then immediately leads to an assert failure:

> 2025-06-03 00:19:09.282 EDT [25175:2] STATEMENT:  explain (analyze, format 'json')
>           select count(*) from join_foo
>             left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
>             on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
> 
> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 25175
> 
> 0   postgres                            0x0000000103563550 ExceptionalCondition + 236
> 1   postgres                            0x000000010326a260 pgaio_io_before_start + 516
> 2   postgres                            0x0000000103269fb4 pgaio_io_start_readv + 36
> 3   postgres                            0x0000000103289144 FileStartReadV + 252
> 4   postgres                            0x00000001032dc424 mdstartreadv + 668
> 5   postgres                            0x00000001032dfa0c smgrstartreadv + 116
> 6   postgres                            0x00000001032723b8 AsyncReadBuffers + 2028
> 7   postgres                            0x0000000103271354 StartReadBuffersImpl + 1196
> 8   postgres                            0x0000000103270e98 StartReadBuffers + 64
> 9   postgres                            0x000000010326e4a0 read_stream_start_pending_read + 1204
> 10  postgres                            0x000000010326db84 read_stream_look_ahead + 812
> 11  postgres                            0x000000010326d71c read_stream_next_buffer + 2356

I don't see how this can happen.

There is no mac specific code in AIO as of now, why aren't we seeing this on
any other platform?

IME the theory that there is something outside of a subsystem randomly
overwriting memory is almost always wrong, but I'm kinda starting to suspect
it here nonetheless.


I wonder if it's worth trying to put a fair bit of "padding" memory
before/after AIO's shared memory region(s). If that were to stop the crashes,
it'd likely be something random writing there. Unfortunately continuing to
crash wouldn't necessarily guarantee it's aio specific.

Hm. We had issues with wrong size calculations of shmem regions before, what
about writing a patch that puts mprotect(PROT_NONE) pages before/after each
ShmemAlloc'd region?

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

04 июня, 23:32:53

On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> > 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> > 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> > 2, ioh->generation: 21694
>
> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
> "normal" reason for that. We know that the IO wasn't actually started as
> otherwise pgaio_io_start_readv() would have logged that fact.

A cheap/easy thing to try, maybe: give that enumeration more
distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
a wild 1 here?

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

05 июня, 22:00:00

Hello Thomas and Andres,

04.06.2025 23:32, Thomas Munro wrote:
> On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
>>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
>>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
>>> 2, ioh->generation: 21694
>> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
>> "normal" reason for that. We know that the IO wasn't actually started as
>> otherwise pgaio_io_start_readv() would have logged that fact.
> A cheap/easy thing to try, maybe: give that enumeration more
> distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
> a wild 1 here?

Thank you for your attention to this and for the tip! Today I tried the
following:
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -89,8 +89,8 @@ typedef enum PgAioOp
         /* intentionally the zero value, to help catch zeroed memory etc */
         PGAIO_OP_INVALID = 0,

-       PGAIO_OP_READV,
-       PGAIO_OP_WRITEV,
+       PGAIO_OP_READV = 0xaa,
+       PGAIO_OP_WRITEV = 0xbb,

And got::
!!!AsyncReadBuffers [68292] (1)| blocknum: 9, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 13038
!!!pgaio_io_start_readv [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->generation: 13038
!!!pgaio_io_reclaim [67729]| ioh: 0x102b25540, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 47468
!!!pgaio_io_reclaim [67729]| ioh: 0x102b25540, ioh->op: 0, ioh->generation: 47469
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 13038
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 0, ioh->generation: 13039
!!!AsyncReadBuffers [68292] (1)| blocknum: 10, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 13039
!!!pgaio_io_start_readv [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->generation: 13039
!!!pgaio_io_reclaim [67718]| ioh: 0x102b25390, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 35821
!!!pgaio_io_reclaim [67718]| ioh: 0x102b25390, ioh->op: 0, ioh->generation: 35822
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 13039
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 0, ioh->generation: 13040
!!!AsyncReadBuffers [68292] (1)| blocknum: 11, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 13040
2025-06-05 04:15:38.847 EDT [68292:1] LOG: !!!pgaio_io_before_start| ioh: 0x102b26230, ioh->op: 170, ioh->state: 1, 
ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 13040
### I reproduced the failure a few times and observed "ioh->op: 170" each time. ###
2025-06-05 04:15:38.847 EDT [68292:2] STATEMENT:  select count(*) from simple r join extremely_skewed s using (id);
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 68292

0   postgres                            0x000000010144b550 ExceptionalCondition + 236
1   postgres                            0x0000000101152260 pgaio_io_before_start + 516
2   postgres                            0x0000000101151fb4 pgaio_io_start_readv + 36
3   postgres                            0x0000000101171144 FileStartReadV + 252
4   postgres                            0x00000001011c4424 mdstartreadv + 668
5   postgres                            0x00000001011c7a0c smgrstartreadv + 116
6   postgres                            0x000000010115a3b8 AsyncReadBuffers + 2028
7   postgres                            0x0000000101159354 StartReadBuffersImpl + 1196
8   postgres                            0x0000000101158e98 StartReadBuffers + 64
9   postgres                            0x00000001011564a0 read_stream_start_pending_read + 1204
10  postgres                            0x0000000101155b84 read_stream_look_ahead + 812
11  postgres                            0x000000010115571c read_stream_next_buffer + 2356
12  postgres                            0x0000000100b4a04c heap_fetch_next_buffer + 284
13  postgres                            0x0000000100b3ade8 heapgettup_pagemode + 192
14  postgres                            0x0000000100b3b3b8 heap_getnextslot + 84
15  postgres                            0x0000000100ebd818 table_scan_getnextslot + 340
16  postgres                            0x0000000100ebd624 SeqNext + 148
17  postgres                            0x0000000100ebdcf8 ExecScanFetch + 772
18  postgres                            0x0000000100ebd8cc ExecScanExtended + 164
19  postgres                            0x0000000100ebd004 ExecSeqScanWithProject + 244
20  postgres                            0x0000000100e970f4 ExecProcNode + 68

I also tried reproducing this with -DFDDEBUG, no luck so far, probably due
to significant change of the test duration.

Best regards,
Alexander Lakhin

Re: Non-reproducible AIO failure

От

Matthias van de Meent

Дата:

06 июня, 01:13:52

On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Thomas and Andres,
>
> 04.06.2025 23:32, Thomas Munro wrote:
> > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote:
> >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> >>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> >>> 2, ioh->generation: 21694
> >> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
> >> "normal" reason for that. We know that the IO wasn't actually started as
> >> otherwise pgaio_io_start_readv() would have logged that fact.
> > A cheap/easy thing to try, maybe: give that enumeration more
> > distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
> > a wild 1 here?

I have a very wild guess that's probably wrong in a weird way, but
here goes anyway:

Did anyone test if interleaving the enum-typed bitfield fields of
PgAioHandle with the uint8 fields might solve the issue? I.e. {state,
flags, target, num_callbacks, op} instead of {state, target, op,
flags, num_callbacks}?

I've looked at some decompiled code generated for
pgaio_io_before_start and pgaio_io_start_readv. pgaio_io_start_readv
looks rather normal, but in pgaio_io_before_start it's clear that
pgaio_io_before_start does some weird things as a response to the
bitfields of PgAioHandle:

(manually annotated)

_pgaio_io_before_start:
00000001002eaba8    stp    x20, x19, [sp, #-0x20]!     /* save registers */
00000001002eabac    stp    x29, x30, [sp, #0x10]       /* save registers */
00000001002eabb0    add    x29, sp, #0x10              /* x29 := stack+10 */
00000001002eabb4    ldrb    w8, [x0]                /* w8 := ioh->state */
00000001002eabb8    cmp    w8, #0x1                    /* w8 ??
PGAIO_HS_HANDED_OUT */
00000001002eabbc    b.ne    0x1002eac38

w8 here will contain the value 0x1.

00000001002eabc0    mov    x19, x0                     /* x19 := ioh */
00000001002eabc4    adrp    x8, 1461 ; 0x10089f000  /* x8 := 0x10089f000 */
00000001002eabc8    add    x8, x8, #0xd40              /* x8 += 0xd40
(0x10089fd40) */
00000001002eabcc    ldr    x8, [x8]                    /* x8 :=
pgaio_my_backend */
00000001002eabd0    ldr    x8, [x8, #0x20]             /* x8 :=
pgaio_my_backend->handed_out_io */
00000001002eabd4    cmp    x8, x0                      /* x8
(pgaio_my_backend->handed_out_io) ?? x19 (ioh) */
00000001002eabd8    b.ne    0x1002eac50
00000001002eabdc    mov    x0, x19                     /* x0 := ioh */
00000001002eabe0    bl    _pgaio_io_has_target
00000001002eabe4    tbz    w0, #0x0, 0x1002eac68       /*
_pgaio_io_has_target(ioh)? */
00000001002eabe8    ldrb    w8, [x19, #0x2]         /* w8 := ioh->op */

w8 now contains ioh->op, which is presumed to be 0x0.

00000001002eabec    ldrh    w9, [x19]               /* w9 :=
ioh->[target | state] */
00000001002eabf0    orr    w8, w9, w8, lsl #16         /* w8 := w9 |
(w8 << 16) // [ op | [target | state] ]*/

But here more operations are applied, pushing in the preceding
bitfields of PgAioHandle, mixing in various values into w8.

00000001002eabf4    cmp    w8, #0x10, lsl #12          /* w8 ??
0x1'00'00 // op'state'target ?? 0x1'00'00 */

And finally the full set of bitfields is compared against a target
value, rather than just ioh->op. This means we do 2 memory accesses
for this field.

00000001002eabf8    b.hs    0x1002eac80             /* >= ?? ERROR : continue */

And finally we branch into the setup for _ExceptionalCondition.

This shows that b->op is not exactly a stand-alone field, and might be
influenced by the other sibling bitfield fields. If we changed the
order of the fields to not have continuous bitfields, that might be
able to convince the compiler that there are no data dependencies
between state/target/op, and remove the two-phase load operation.

(checks more decompiled code)

Oh, the bitfields are not just read with multiple accesses, but also
*written* using multiple accesses. See pgaio_io_set_target:

_pgaio_io_set_target:
00000001002eafdc    stp    x29, x30, [sp, #-0x10]!
00000001002eafe0    mov    x29, sp
00000001002eafe4    ldrb    w8, [x0, #0x2]
00000001002eafe8    ldrh    w9, [x0]
00000001002eafec    orr    w8, w9, w8, lsl #16
00000001002eaff0    and    w9, w8, #0xff
00000001002eaff4    cmp    w9, #0x1
00000001002eaff8    b.ne    0x1002eb01c
00000001002eaffc    tst    w8, #0xff00
00000001002eb000    b.ne    0x1002eb034
00000001002eb004    lsr    w9, w8, #16
00000001002eb008    bfi    w8, w1, #8, #24
00000001002eb00c    strb    w9, [x0, #0x2] /* ioh->op = w9 */
00000001002eb010    strh    w8, [x0]           /* ioh->[state, target] = w8 */

... and pgaio_io_update_state:

_pgaio_io_update_state:
00000001002e7f08    sub    sp, sp, #0x70
00000001002e7f0c    stp    x24, x23, [sp, #0x30]
00000001002e7f10    stp    x22, x21, [sp, #0x40]
00000001002e7f14    stp    x20, x19, [sp, #0x50]
00000001002e7f18    stp    x29, x30, [sp, #0x60]
00000001002e7f1c    add    x29, sp, #0x60
00000001002e7f20    mov    x19, x1
00000001002e7f24    mov    x20, x0
00000001002e7f28    mov    w0, #0xa
00000001002e7f2c    mov    x1, #0x0
00000001002e7f30    bl    _errstart
00000001002e7f34    cbz    w0, 0x1002e8010
[...] /* skip pgaio_debug_io log handling */
00000001002e8010    dmb    ish
00000001002e8014    ldurh    w8, [x20, #0x1]
00000001002e8018    bfi    w19, w8, #8, #16
00000001002e801c    lsr    w8, w8, #8
00000001002e8020    strb    w8, [x20, #0x2] /* store ioh->op */
00000001002e8024    strh    w19, [x20]        /* store ioh->{state, target} */

I'm sending this now without looking much deeper into this due to
$err_date_changed and other time-related constraints, but might this
be a candidate for further debugging research?


Kind regards,

Matthias van de Meent
Databricks Inc.

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

06 июня, 02:31:51

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> I have a very wild guess that's probably wrong in a weird way, but
> here goes anyway:
> Did anyone test if interleaving the enum-typed bitfield fields of
> PgAioHandle with the uint8 fields might solve the issue?

Ugh.  I think you probably nailed it.

IMO all those struct fields better be declared uint8.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

06 июня, 06:00:00

Hello,

05.06.2025 22:00, Alexander Lakhin wrote:
> Thank you for your attention to this and for the tip! Today I tried the
> following:
> --- a/src/include/storage/aio.h
> +++ b/src/include/storage/aio.h
> @@ -89,8 +89,8 @@ typedef enum PgAioOp
>         /* intentionally the zero value, to help catch zeroed memory etc */
>         PGAIO_OP_INVALID = 0,
>
> -       PGAIO_OP_READV,
> -       PGAIO_OP_WRITEV,
> +       PGAIO_OP_READV = 0xaa,
> +       PGAIO_OP_WRITEV = 0xbb,

And one more bit of information from me, if you're not tired of it yet:
I've added:
@@ -2238,6 +2240,7 @@ FileStartReadV(PgAioHandle *ioh, File file,

         vfdP = &VfdCache[file];

+if (ioh->op != PGAIO_OP_INVALID) elog(LOG, "!!!FileStartReadV| ioh: %p, ioh->op: %d, ioh->state: %d, ioh->result: %d,

ioh->num_callbacks: %d, ioh->generation: %llu", ioh, ((PgAioHandle *)ioh)->op, ((PgAioHandle *)ioh)->state, 
((PgAioHandle *)ioh)->result, ((PgAioHandle *)ioh)->num_callbacks, ((PgAioHandle *)ioh)->generation);
         pgaio_io_start_readv(ioh, vfdP->fd, iovcnt, offset);

and got:
...
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29200
!!!AsyncReadBuffers [91056] (1)| blocknum: 70, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 29200
!!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29200
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 29200
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29201
!!!AsyncReadBuffers [91056] (1)| blocknum: 71, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 29201
!!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29201
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 29201
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29202
!!!AsyncReadBuffers [91056] (1)| blocknum: 72, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 29202
!!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29202
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 29202
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29203
!!!AsyncReadBuffers [91056] (1)| blocknum: 73, ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 29203
2025-06-05 19:52:44.174 EDT [91056:14] pg_regress/brin LOG: !!!FileStartReadV| ioh: 0x1066d15d0, ioh->op: 170, 
ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 29203
2025-06-05 19:52:44.174 EDT [91056:15] pg_regress/brin CONTEXT:  SQL statement "SELECT array_agg(ctid) FROM brintest 
WHERE cidrcol = '10.2.14/24'::inet "
         PL/pgSQL function inline_code_block line 52 at EXECUTE
2025-06-05 19:52:44.174 EDT [91056:16] pg_regress/brin STATEMENT: DO $x$
...
2025-06-05 19:52:44.174 EDT [91056:17] pg_regress/brin LOG: !!!pgaio_io_before_start| ioh: 0x1066d15d0, ioh->op: 170, 
ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 29203
2025-06-05 19:52:44.174 EDT [91056:18] pg_regress/brin CONTEXT:  SQL statement "SELECT array_agg(ctid) FROM brintest 
WHERE cidrcol = '10.2.14/24'::inet "
         PL/pgSQL function inline_code_block line 52 at EXECUTE
2025-06-05 19:52:44.174 EDT [91056:19] pg_regress/brin STATEMENT: DO $x$
         DECLARE
                 r record;
                 r2 record;
                 cond text;
                 idx_ctids tid[];
                 ss_ctids tid[];
                 count int;
                 plan_ok bool;
                 plan_line text;
         BEGIN
                 FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers, unnest(op)

WITH ORDINALITY AS oper LOOP
...
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 91056
0   postgres                            0x0000000104ff771c ExceptionalCondition + 236
1   postgres                            0x0000000104cfe32c pgaio_io_before_start + 516
2   postgres                            0x0000000104cfe080 pgaio_io_start_readv + 36
3   postgres                            0x0000000104d1d310 FileStartReadV + 508
4   postgres                            0x0000000104d705f0 mdstartreadv + 668
5   postgres                            0x0000000104d73bd8 smgrstartreadv + 116
6   postgres                            0x0000000104d06484 AsyncReadBuffers + 2028
7   postgres                            0x0000000104d05420 StartReadBuffersImpl + 1196
8   postgres                            0x0000000104d04f64 StartReadBuffers + 64
9   postgres                            0x0000000104d0256c read_stream_start_pending_read + 1204
10  postgres                            0x0000000104d01c50 read_stream_look_ahead + 812
11  postgres                            0x0000000104d017e8 read_stream_next_buffer + 2356
12  postgres                            0x00000001046f661c heap_fetch_next_buffer + 284
13  postgres                            0x00000001046e73b8 heapgettup_pagemode + 192
14  postgres                            0x00000001046e7988 heap_getnextslot + 84

Thus, this time ioh->op switched to PGAIO_OP_READV earlier, even before
the logging in AsyncReadBuffers(). By the way, this
"DO .. FOR ... FROM brinopers, unnest(op) ... " triggers the assertion not
for the first time for me (I've counted 3 such failures in my notes).

Best regards,
Alexander Lakhin

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

06 июня, 09:13:57

On 06/06/2025 2:31 am, Tom Lane wrote:
> Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
>> I have a very wild guess that's probably wrong in a weird way, but
>> here goes anyway:
>> Did anyone test if interleaving the enum-typed bitfield fields of
>> PgAioHandle with the uint8 fields might solve the issue?
> Ugh.  I think you probably nailed it.
>
> IMO all those struct fields better be declared uint8.
>
>             regards, tom lane

I also think that it can be in compiler. Bitfields with different enum 
type looks really exotic, so no wonder that optimizer can do something 
strange here.
I failed to reproduce the problem with old version of clang (15.0). Also 
as far as I understand nobody was able to reproduce the problem with 
disabled optimizations (-O0).
It definitely doesn't mean that there is bug in optimizer - just timing 
can be changed.

Still it is not quite clear to me how `PGAIO_OP_READV` is managed to be 
written.
There is just one place in the code when it is assigned:

```
pgaio_io_start_readv(PgAioHandle *ioh,
                      int fd, int iovcnt, uint64 offset)
{
     ...

     pgaio_io_stage(ioh, PGAIO_OP_READV);
}

```

and `pgaio_io_stage` should update both `state` and `op`:

```
     ioh->op = op;
     ioh->result = 0;

     pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
```

But as we see from the trace state is still PGAIO_HS_HANDED_OUT, so it 
was not updated.

If there is some bug in optimizer which incorrectly construct mask for 
bitfield assignment, it is still not clean where it managed to get this 
PGAIO_OP_READV.
And we can be sure that it is really PGAIO_OP_READV and just arbitrary 
garbage, because Alexander has replaced its value with 0xaa and we see 
in logs that it is rally stored.

If there is race condition in `pgaio_io_update_state` (which enforces 
memory barrier before updating state) then for example inserting some 
sleep between assignment operation and status should increase 
probability of error. But it doesn't happen. Also as far as I 
understand, op is updated and read by the same backend. So it should not 
be some synchronization issue.

So most likely it is bug in optimizer which generates incorrect code. 
Can Alexander or somebody else who was able to reproduce the problem 
share assembler code of `pgaio_io_reclaim`  function?
I am not sure that the bug is in this function - but it is prime 
suspect. Only `pgaio_io_start_readv` can set PGAIO_OP_READV, but we are 
almost sure that it was no called.
So looks like that `op` was not cleared despite to what we see in logs. 
But if there was incorrect code in `pgaio_io_reclaim`, then it should 
always work incorrectly - doesn't clear "op" but in most cases it works...

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

06 июня, 14:03:12

There is really essential difference in code generated by clang 15 (working) and 16 (not working).

```

pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)

{

...

     HOLD_INTERRUPTS();

    ioh->op = op;
    ioh->result = 0;

    pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
    ...

}

```

clang15:

```

postgres[0x10035d4ac] <+84>: add    x22, x22, #0x5d4          ; InterruptHoldoffCount
postgres[0x10035d4b0] <+88>: ldr    w8, [x22]
postgres[0x10035d4b4] <+92>: add    w8, w8, #0x1
postgres[0x10035d4b8] <+96>: str    w8, [x22]
postgres[0x10035d4bc] <+100>: strb   w20, [x19, #0x2]
postgres[0x10035d4c0] <+104>: str    wzr, [x19, #0x14]
postgres[0x10035d4c4] <+108>: mov    x0, x19
postgres[0x10035d4c8] <+112>: mov    w1, #0x2
postgres[0x10035d4cc] <+116>: bl     0x10035cbdc               ; pgaio_io_update_state at aio.c:384
```

clang16:

```

postgres[0x100699e8c] <+200>: add    x9, x9, #0xd74 ; InterruptHoldoffCount
postgres[0x100699e90] <+204>: ldr    w8, [x9]
postgres[0x100699e94] <+208>: mov    w10, #0x1 ; =1
postgres[0x100699e98] <+212>: stur   w10, [x29, #-0x18]
postgres[0x100699e9c] <+216>: add    w8, w8, #0x1
postgres[0x100699ea0] <+220>: str    w8, [x9]
postgres[0x100699ea4] <+224>: ldur   w10, [x29, #-0xc]
postgres[0x100699ea8] <+228>: ldur   x9, [x29, #-0x8]
postgres[0x100699eac] <+232>: ldrh   w8, [x9]
postgres[0x100699eb0] <+236>: and    w10, w10, #0xff
postgres[0x100699eb4] <+240>: orr    w10, w8, w10, lsl #16
postgres[0x100699eb8] <+244>: lsr    w8, w10, #16
postgres[0x100699ebc] <+248>: strh   w10, [x9]
postgres[0x100699ec0] <+252>: strb   w8, [x9, #0x2]
postgres[0x100699ec4] <+256>: ldur   x8, [x29, #-0x8]
postgres[0x100699ec8] <+260>: str    wzr, [x8, #0x14]
postgres[0x100699ecc] <+264>: ldur   x0, [x29, #-0x8]
postgres[0x100699ed0] <+268>: mov    w1, #0x2 ; =2
postgres[0x100699ed4] <+272>: bl     0x100698efc    ; pgaio_io_update_state at aio.c:384
```

As you can see pg16 just updates third byte of the structure (`op` field). It is not affecting 'state' field at all.

While clang16 is also updating `state` field! And looks like it is moved before memory barrier (in `pgaio_io_update_state` function),

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

06 июня, 21:47:32

Hi,

On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
> There is really essential difference in code generated by clang 15 (working)
> and 16 (not working).

There also are code gen differences between upstream clang 17 and apple's
clang, which is based on llvm 17 as well (I've updated the toolchain, it
repros with that as well).

The code generated for the bitfield access is so attrocious that it's worth
changing to plain uint8 for that alone. It's annoying to have to add casts to
all the switches, but clearly there isn't a better way for now.

What specific compilers / versions were you comparing? I don't quite see
similar assembly...

I'm annotating the code below mostly for my understanding, I'm not fluid in armv8.

> ```
>
> pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
>
> {
>
>      ...
>
>      HOLD_INTERRUPTS();
>
>     ioh->op = op;
>     ioh->result = 0;
>
>     pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
>     ...
>
> }
>
> ```
>
> clang15:
>
> ```
>
> postgres[0x10035d4ac] <+84>:  add    x22, x22, #0x5d4          ;
> InterruptHoldoffCount
> postgres[0x10035d4b0] <+88>:  ldr    w8, [x22]
> postgres[0x10035d4b4] <+92>:  add    w8, w8, #0x1
> postgres[0x10035d4b8] <+96>:  str    w8, [x22]

That's InterruptHoldoffCount++

> *postgres[0x10035d4bc] <+100>: strb   w20, [x19, #0x2]

That's ioh->op = op

> *postgres[0x10035d4c0] <+104>: str    wzr, [x19, #0x14]

ioh->result = 0;

> postgres[0x10035d4c4] <+108>: mov    x0, x19

Saving the caller-saved x0 register.

> postgres[0x10035d4c8] <+112>: mov    w1, #0x2
> postgres[0x10035d4cc] <+116>: bl 0x10035cbdc               ;
> pgaio_io_update_state at aio.c:384

Calling pgaio_io_update_state() with new_state = PGAIO_HS_DEFINED

> ```
>
> clang16:
>
> ```
>
> postgres[0x100699e8c] <+200>: add    x9, x9, #0xd74 ; InterruptHoldoffCount
> postgres[0x100699e90] <+204>: ldr    w8, [x9]

That's the start of InterruptHoldoffCount++

> postgres[0x100699e94] <+208>: mov    w10, #0x1 ; =1
> postgres[0x100699e98] <+212>: stur   w10, [x29, #-0x18]

Isn't this setting w10 to 1 just to immediately overwrite it with another
value?

I assume this is restoring things from the stack?

> postgres[0x100699e9c] <+216>: add    w8, w8, #0x1
> postgres[0x100699ea0] <+220>: str    w8, [x9]

That's the end of InterruptHoldoffCount++

> postgres[0x100699ea4] <+224>: ldur   w10, [x29, #-0xc]

One more overwriting of w10, for good measure?

> postgres[0x100699ea8] <+228>: ldur   x9, [x29, #-0x8]

I assume this is restoring the pointer to ioh from the stack?

> postgres[0x100699eac] <+232>: ldrh   w8, [x9]

Loading state + target as one 16bit value.

> postgres[0x100699eb0] <+236>: and    w10, w10, #0xff

Clearing the upper bits of w10 - not sure what w10 actually is, probably the
op argument?

> postgres[0x100699eb4] <+240>: orr    w10, w8, w10, lsl #16

Or-ing {ioh->state, ioh->target} and (I think) the 'op' argument shifted left
by 16 bit, presumably leaving us with {ioh->state, ioh->target, op}.

> postgres[0x100699eb8] <+244>: lsr    w8, w10, #16

So this is right shifting what I think is {ioh->state, ioh->target, op} by 16,
which presumably leaves you just with op. WTF.

> *postgres[0x100699ebc] <+248>: strh   w10, [x9]

Now we're writing 16 bits of {ioh->state, ioh->target, ioh->op} to ioh,
thereby setting ioh->state and ioh->target, but not ioh->op.

> postgres[0x100699ec0] <+252>: strb   w8, [x9, #0x2]

Setting ioh->op.

> *postgres[0x100699ec4] <+256>: ldur   x8, [x29, #-0x8]

This was loaded into x9 above, why is this redone, x9 is not overwritten? Oh,
I guess you removed the Assert related instructions, and that's why it's
reloading it...

> postgres[0x100699ec8] <+260>: str    wzr, [x8, #0x14]

ioh->return = 0.

> postgres[0x100699ed0] <+268>: mov    w1, #0x2 ; =2

This puts 2, presumably PGAIO_HS_DEFINED into w1 - presumably as an argument
for the call to pgaio_io_update_state().

> postgres[0x100699ed4] <+272>: bl     0x100698efc    ; pgaio_io_update_state
> at aio.c:384
> ```

> As you can see pg16 just updates third byte of the structure (`op` field).
> It is not affecting 'state' field at all.

It's really weird code. I think the state store would be in the call to
pgaio_io_update_state().

> While clang16 is also updating `state` field!

I assume this was intending to refer to clang15?  FWIW, I don't think it does
update the state field in the assembly you shared.

This is very odd code, but I don't immediately see a bug in it :(

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

06 июня, 22:21:13

Konstantin Knizhnik <knizhnik@garret.ru> writes:
> There is really essential difference in code generated by clang 15
> (working) and 16 (not working).

It's a mistake to think that this is a compiler bug. The C standard
explicitly allows compilers to use word-wide operations to access
bit-field struct members. Such accesses may fetch or overwrite other
bit-fields in the same word, using a non-atomic read/modify/write
sequence. I don't have a recent C standard at hand, but I found this
quote on the web [1]:

The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states

Note 2 to entry: A bit-field and an adjacent non-bit-field member
are in separate memory locations. The same applies to two
bit-fields, if one is declared inside a nested structure
declaration and the other is not, or if the two are separated by a
zero-length bit-field declaration, or if they are separated by a
non-bit-field member declaration. It is not safe to concurrently
update two non-atomic bit-fields in the same structure if all
members declared between them are also (nonzero-length)
bit-fields, no matter what the sizes of those intervening
bit-fields happen to be.

So it's our code that is busted. No doubt, what is happening is
that process A is fetching two fields, modifying one of them,
and storing the word back (with the observed value of the other
field) concurrently with some other process trying to update
the other field. So the other process's update is lost.

If we want to preserve the compactness of this struct, we have
to use primitive int types not bit-fields.

regards, tom lane

[1]
https://wiki.sei.cmu.edu/confluence/display/c/CON32-C.+Prevent+data+races+when+accessing+bit-fields+from+multiple+threads

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

06 июня, 22:37:45

Hi,

On 2025-06-06 15:21:13 -0400, Tom Lane wrote:
> Konstantin Knizhnik <knizhnik@garret.ru> writes:
> > There is really essential difference in code generated by clang 15
> > (working) and 16 (not working).
>
> It's a mistake to think that this is a compiler bug.  The C standard
> explicitly allows compilers to use word-wide operations to access
> bit-field struct members.  Such accesses may fetch or overwrite other
> bit-fields in the same word, using a non-atomic read/modify/write
> sequence.  I don't have a recent C standard at hand, but I found this
> quote on the web [1]:
>
> The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states
>
>     Note 2 to entry: A bit-field and an adjacent non-bit-field member
>     are in separate memory locations. The same applies to two
>     bit-fields, if one is declared inside a nested structure
>     declaration and the other is not, or if the two are separated by a
>     zero-length bit-field declaration, or if they are separated by a
>     non-bit-field member declaration. It is not safe to concurrently
>     update two non-atomic bit-fields in the same structure if all
>     members declared between them are also (nonzero-length)
>     bit-fields, no matter what the sizes of those intervening
>     bit-fields happen to be.
>
> So it's our code that is busted.  No doubt, what is happening is
> that process A is fetching two fields, modifying one of them,
> and storing the word back (with the observed value of the other
> field) concurrently with some other process trying to update
> the other field.  So the other process's update is lost.

There shouldn't be any concurrent accesses here, so I don't really see how the
above would explain the problem (the IO can only ever be modified by one
backend, initially the "owning backend", then, when submitted, by the IO
worker, and then again by the backend).


Alexander's latest post has a good example (due to the added logging)

At the start of reclaim (looks sane):
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2,
ioh->generation:29202
 

At the end of reclaim (looks sane), note ioh->op = 0
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29203

Just a bit later (broken):
!!!AsyncReadBuffers [91056] (1)| blocknum: 73, ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 1, ioh->result: 0,
ioh->num_callbacks:0, ioh->generation: 29203
 

Note that ioh->op went back to 170, despite the code that sets ioh->op not yet
having been called. However the increment of the increment of ioh->generation
survived.  Between the second "!!!pgaio_io_reclaim" and "!!!AsyncReadBuffers"
there definitely aren't any other processes accessing things.



> If we want to preserve the compactness of this struct, we have
> to use primitive int types not bit-fields.

I think we have to do that, if for no other reason than how bad the generated
code is. But I would still like to understand what actually causes the problem
here - I'm far from sure it's actually the bitfield.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

07 июня, 07:00:00

Hello Andres and Tom,

06.06.2025 22:37, Andres Freund wrote:

On 2025-06-06 15:21:13 -0400, Tom Lane wrote:

It's a mistake to think that this is a compiler bug.  The C standard
explicitly allows compilers to use word-wide operations to access
bit-field struct members.  Such accesses may fetch or overwrite other
bit-fields in the same word, using a non-atomic read/modify/write
sequence.  I don't have a recent C standard at hand, but I found this
quote on the web [1]:

The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states
    Note 2 to entry: A bit-field and an adjacent non-bit-field member    are in separate memory locations. The same applies to two    bit-fields, if one is declared inside a nested structure    declaration and the other is not, or if the two are separated by a    zero-length bit-field declaration, or if they are separated by a    non-bit-field member declaration ...

So it's our code that is busted.  No doubt, what is happening is
that process A is fetching two fields, modifying one of them,
and storing the word back (with the observed value of the other
field) concurrently with some other process trying to update
the other field.  So the other process's update is lost.

There shouldn't be any concurrent accesses here, so I don't really see how the
above would explain the problem (the IO can only ever be modified by one
backend, initially the "owning backend", then, when submitted, by the IO
worker, and then again by the backend).

FWIW, I tried the idea Matthias proposed upthread:
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -98,16 +98,17 @@ struct PgAioHandle
        /* all state updates should go through pgaio_io_update_state() */
        PgAioHandleState state:8;

+       /* bitfield of PgAioHandleFlags */
+       uint8           flags;
+
        /* what are we operating on */
        PgAioTargetID target:8;

+       uint8           num_callbacks;
+
        /* which IO operation */
        PgAioOp         op:8;

-       /* bitfield of PgAioHandleFlags */
-       uint8           flags;
-
-       uint8           num_callbacks;

        /* using the proper type here would use more space */
        uint8           callbacks[PGAIO_HANDLE_MAX_CALLBACKS];

and got 120 iterations passed cleanly. Without this change, I got failures
on iterations 41, 5, 20, Not a proof, just an observation...

May be the op field gets broken because of a (concurrent?) write to
not that field, by to state or target in some other place?

I also tried to reproduce the same on an ARM server again with clang 18,
using -O2 and -O3, and we with Konstantin didn't see such a sophisticated
assembly for pgaio_io_stage() — the code is much simpler and the failure
is not reproduced. The triplet there is:
PostgreSQL 18beta1 on aarch64-unknown-linux-gnu, compiled by Ubuntu clang version 18.1.3 (1ubuntu1), 64-bit

On that my MacBook, the triplet is:
PostgreSQL 18beta1 on aarch64-apple-darwin23.5.0, compiled by Apple clang version 16.0.0 (clang-1600.0.26.6), 64-bit
indri:
PostgreSQL 18beta1 on aarch64-apple-darwin24.5.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit
sifaka:
PostgreSQL 18devel on aarch64-apple-darwin24.4.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit

So it looks like "aarch64-apple" enables some extra optimizations with
bitfields...

By the way, there are also some other places with adjacent bitfields, e, g.:
typedef union PgAioTargetData
{
    struct
    {
        RelFileLocator rlocator;    /* physical relation identifier */
        BlockNumber blockNum;   /* blknum relative to begin of reln */
        BlockNumber nblocks;
        ForkNumber forkNum:8; /* don't waste 4 byte for four values */
        bool        is_temp:1; /* proc can be inferred by owning AIO */
        bool        skip_fsync:1;
    }           smgr;

and
typedef struct PgStat_KindInfo
{
    /*
     * Do a fixed number of stats objects exist for this kind of stats (e.g.
     * bgwriter stats) or not (e.g. tables).
     */
    bool        fixed_amount:1;

    /*
     * Can stats of this kind be accessed from another database? Determines
     * whether a stats object gets included in stats snapshots.
     */
    bool        accessed_across_databases:1;

    /* Should stats be written to the on-disk stats file? */
    bool        write_to_file:1;
...

Best regards,
Alexander Lakhin

Re: Non-reproducible AIO failure

От

Nico Williams

Дата:

07 июня, 07:28:12

On Fri, Jun 06, 2025 at 03:37:45PM -0400, Andres Freund wrote:
> On 2025-06-06 15:21:13 -0400, Tom Lane wrote:
> > So it's our code that is busted.  No doubt, what is happening is
> > that process A is fetching two fields, modifying one of them,
> > and storing the word back (with the observed value of the other
> > field) concurrently with some other process trying to update
> > the other field.  So the other process's update is lost.
> 
> There shouldn't be any concurrent accesses here, so I don't really see
> how the above would explain the problem (the IO can only ever be
> modified by one backend, initially the "owning backend", then, when
> submitted, by the IO worker, and then again by the backend).

No concurrent accesses, not from threads anyways, but not even from
kernel subsystems nor from signal handlers?

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

07 июня, 07:32:43

On 06/06/2025 9:47 pm, Andres Freund wrote:

Hi,

On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:

There is really essential difference in code generated by clang 15 (working)
and 16 (not working).

There also are code gen differences between upstream clang 17 and apple's
clang, which is based on llvm 17 as well (I've updated the toolchain, it
repros with that as well).

The code generated for the bitfield access is so attrocious that it's worth
changing to plain uint8 for that alone. It's annoying to have to add casts to
all the switches, but clearly there isn't a better way for now.

What specific compilers / versions were you comparing? I don't quite see
similar assembly...

My MacBook (bug is not reproduced):

Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin22.6.0

Alexander's MacBook (bug is reproduced)

Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin23.5.0

As you can see pg16 just updates third byte of the structure (`op` field).
It is not affecting 'state' field at all.

It's really weird code. I think the state store would be in the call to
pgaio_io_update_state().

Yes, and that seems to be the problem,

While clang16 is also updating `state` field!

I assume this was intending to refer to clang15?  FWIW, I don't think it does
update the state field in the assembly you shared.

clang 15 (old) doesn't update `state` field. And this code works normally.
clang 16 - does.

This is very odd code, but I don't immediately see a bug in it :(

I also not an expert in ARM assembler. My analyze of the code is mostly the same as your.
Looks like the code generated by clang 16 does update all three bitfields in `pgaio_io_stage`.
As far as I understand assignment is correct: so it uses old (extracted) value of state and new (passed to the function as parameter) value of op).
But I am not 100% sure here.

It is not quite clear to me why it happen, because `pgaio_io_update_state` is not actually inlined.
And there is at least one problem that assignment of state moved through memory barrier (in `pgaio_io_update_state`).

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

07 июня, 16:27:45

On 06/06/2025 10:21 pm, Tom Lane wrote:
> Konstantin Knizhnik <knizhnik@garret.ru> writes:
>> There is really essential difference in code generated by clang 15
>> (working) and 16 (not working).
> It's a mistake to think that this is a compiler bug.  The C standard
> explicitly allows compilers to use word-wide operations to access
> bit-field struct members.

Sorry, I have not said that I have found compiler bug.
The generated code (clang 16) overwrites all three bitfields but looks 
like is doing it in correct way (old values of state and target are 
preserved).
My first thought was that compiler moves forward assignment of state 
from `pgaio_io_update_state`. But it is not true: it just reads and 
writes old value of `state` field.

It is also possible to expect that `pgaio_io_update_state` somehow 
"resurrects" old value of `op` field.
But it is also not true:


```

postgres`pgaio_io_update_state:
...

postgres[0x100699070] <+372>: dmb    ish ; memory write barrier
postgres[0x100699074] <+376>: ldur   w10, [x29, #-0xc] ; w10 = state
postgres[0x100699078] <+380>: ldur   x9, [x29, #-0x8]  ; x9 = ioh
postgres[0x10069907c] <+384>: ldrh   w8, [x9]    ; w8 = ioh->state, 
ioh->target
postgres[0x100699080] <+388>: ldrb   w11, [x9, #0x2] ; w11 = ioh->op
postgres[0x100699084] <+392>: orr    w8, w8, w11, lsl #16 ; w8 = 
ioh->state, ioh->target, ioh->op
postgres[0x100699088] <+396>: and    w10, w10, #0xff      ; w10 &= 0xff
postgres[0x10069908c] <+400>: and    w8, w8, #0xffffff00  ; w8 &= ~0xff
postgres[0x100699090] <+404>: orr    w10, w8, w10          ; w10 = 
state, ioh->target, ioh->op
postgres[0x100699094] <+408>: lsr    w8, w10, #16         ; w8 = ioh->op
postgres[0x100699098] <+412>: strh   w10, [x9]
postgres[0x10069909c] <+416>: strb   w8, [x9, #0x2]
postgres[0x1006990a0] <+420>: ldp    x29, x30, [sp, #0x60]
postgres[0x1006990a4] <+424>: add    sp, sp, #0x70
postgres[0x1006990a8] <+428>: ret
```

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

08 июня, 21:51:11

Hi,

On 2025-06-06 15:37:45 -0400, Andres Freund wrote:
> There shouldn't be any concurrent accesses here, so I don't really see how the
> above would explain the problem (the IO can only ever be modified by one
> backend, initially the "owning backend", then, when submitted, by the IO
> worker, and then again by the backend).

The symptoms I can reproduce are slightly different than Alexander's - it's
the assertion failure reported upthread by Tom.

FWIW, I can continue to repro the assertion after removing the use of the
bitfield in PgAioHandle. So the problem indeed seems to be be independent of
the bitfields.

I'm continuing to investigate.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

08 июня, 22:15:43

Andres Freund <andres@anarazel.de> writes:
> The symptoms I can reproduce are slightly different than Alexander's - it's
> the assertion failure reported upthread by Tom.
>
> FWIW, I can continue to repro the assertion after removing the use of the
> bitfield in PgAioHandle. So the problem indeed seems to be be independent of
> the bitfields.

I noticed that the assertion you and I are seeing is also about
a bitfield:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

typedef struct PgAioResult
{
...
    /* of type PgAioResultStatus, see above */
    uint32        status:PGAIO_RESULT_STATUS_BITS;
...
} PgAioResult;

That doesn't make a lot of sense to me, because as far as I've
seen the bitfields in PgAioResult are not concurrently accessed.
But it feels like a relevant observation.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

09 июня, 02:05:29

On Sat, Jun 7, 2025 at 6:47 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
> > There is really essential difference in code generated by clang 15 (working)
> > and 16 (not working).
>
> There also are code gen differences between upstream clang 17 and apple's
> clang, which is based on llvm 17 as well (I've updated the toolchain, it
> repros with that as well).

Just for the record, Apple clang 17 (self-reported clobbered version)
is said to be based on LLVM 19[1].  For a long time it was off by one
but tada now it's apparently two.  Might be relevant if people are
comparing generated code up that close....

. o O (I wonder if one could corroborate that by running "strings" on
upstream clang binaries (as compiled by MacPorts/whatever) for each
major version and finding new strings, ie strings that don't appear in
earlier major versions, and then seeing which ones are present in
Apple's clang binaries...  What a silly problem.)

[1] https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_16.x_(since_visionOS_support)

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

10 июня, 17:28:11

On 09/06/2025 2:05 am, Thomas Munro wrote:
> On Sat, Jun 7, 2025 at 6:47 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
>>> There is really essential difference in code generated by clang 15 (working)
>>> and 16 (not working).
>> There also are code gen differences between upstream clang 17 and apple's
>> clang, which is based on llvm 17 as well (I've updated the toolchain, it
>> repros with that as well).
> Just for the record, Apple clang 17 (self-reported clobbered version)
> is said to be based on LLVM 19[1].  For a long time it was off by one
> but tada now it's apparently two.  Might be relevant if people are
> comparing generated code up that close....
>
> . o O (I wonder if one could corroborate that by running "strings" on
> upstream clang binaries (as compiled by MacPorts/whatever) for each
> major version and finding new strings, ie strings that don't appear in
> earlier major versions, and then seeing which ones are present in
> Apple's clang binaries...  What a silly problem.)
>
> [1] https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_16.x_(since_visionOS_support)

Some updates: I was able to reproduce the problem at my Mac with old 
clang (15.0) but only with disabled optimization (CFLAGS=-O0).
So very unlikely it is bug in compiler.

Why it is better reproduced in debug build? May be because of timing.
Or may be because without optimization compiler is doing stupid things: 
loads all three bitfields from memory to register (one half word+one 
byte), then does some manipulations with this register and writes it 
back to memory. Can register somehow be clobbered between read and write 
(for example by signal handler)? Very unlikely...
So still do not have any good hypothesis.

But with bitfields replaced with uint8 the bug is not reproduced any more.
May be just do this change (which seems to be good thing in any case)?

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

10 июня, 20:41:07

Hi,

On 2025-06-10 17:28:11 +0300, Konstantin Knizhnik wrote:
> On 09/06/2025 2:05 am, Thomas Munro wrote:
> > On Sat, Jun 7, 2025 at 6:47 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:
> > > > There is really essential difference in code generated by clang 15 (working)
> > > > and 16 (not working).
> > > There also are code gen differences between upstream clang 17 and apple's
> > > clang, which is based on llvm 17 as well (I've updated the toolchain, it
> > > repros with that as well).
> > Just for the record, Apple clang 17 (self-reported clobbered version)
> > is said to be based on LLVM 19[1].  For a long time it was off by one
> > but tada now it's apparently two.  Might be relevant if people are
> > comparing generated code up that close....

You've got to be kidding me. Because the world otherwise would be too easy, I
guess.

> > . o O (I wonder if one could corroborate that by running "strings" on
> > upstream clang binaries (as compiled by MacPorts/whatever) for each
> > major version and finding new strings, ie strings that don't appear in
> > earlier major versions, and then seeing which ones are present in
> > Apple's clang binaries...  What a silly problem.)
> > 
> > [1] https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_16.x_(since_visionOS_support)
> 
> 
> Some updates: I was able to reproduce the problem at my Mac with old clang
> (15.0) but only with disabled optimization (CFLAGS=-O0).
> So very unlikely it is bug in compiler.

I was able to reproduce it with gcc, too.


> Why it is better reproduced in debug build? May be because of timing.

Code-gen wise the biggest change I see is that there is more stack spilling
due to assertion related code...


> Or may be because without optimization compiler is doing stupid things:
> loads all three bitfields from memory to register (one half word+one byte),
> then does some manipulations with this register and writes it back to
> memory. Can register somehow be clobbered between read and write (for
> example by signal handler)? Very unlikely...
> So still do not have any good hypothesis.
> 
> But with bitfields replaced with uint8 the bug is not reproduced any more.
> May be just do this change (which seems to be good thing in any case)?

I've reproduced it without that bitfield, unfortunately :(.


Unfortunately my current set of debugging output seems to have prevented the
issue from re-occurring. Need to pare it down to re-trigger. But for me it
only reproduces relatively rarely, so paring down the debug output is a rather
slow process :(


This is really a peculiar issue. I've now ran 10s of thousands of non-macos
iterations, without triggering this or a related issue even once. The one good
news is that currently the regression tests are remarkably stable, I think in
the past I hardly could have run that many iterations without (independent)
failures.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

10 июня, 21:09:18

On 10/06/2025 8:41 pm, Andres Freund wrote:
> I was able to reproduce it with gcc, too.
> I've reproduced it without that bitfield, unfortunately :(.

But also only at MacOS?

I wonder if it is possible to set hardware watchpoint fro program itself 
(not using gdb)? I.e. using ptrace?
Looks like it is not possible to debug yourself:
https://stackoverflow.com/questions/64321402/adding-a-watchpoint-in-the-current-process-not-in-gdb-not-for-debugging

but it is possible to fork process.
In theory it certainly should be possible - gdb is normal process, so at 
least we can implement our mini-gdb.
But not sure how complex it will be.

It will be nice to set watchpoint on for example "op" in 
`pgaio_io_reclaim` and disable it in `pgaio_io_before_start`.
It may have less overhead (both memory and CPU) than using `mprotect` to 
detect illegal access.

But I afraid that if mempory corruption happens in kernal mode then 
memory protection or watchpoints will not help.

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

10 июня, 23:09:53

Hi,

On 2025-06-10 21:09:18 +0300, Konstantin Knizhnik wrote:
> 
> On 10/06/2025 8:41 pm, Andres Freund wrote:
> > I was able to reproduce it with gcc, too.
> > I've reproduced it without that bitfield, unfortunately :(.

> But also only at MacOS?

Correct.


> I wonder if it is possible to set hardware watchpoint fro program itself
> (not using gdb)? I.e. using ptrace?
> Looks like it is not possible to debug yourself:
> https://stackoverflow.com/questions/64321402/adding-a-watchpoint-in-the-current-process-not-in-gdb-not-for-debugging
> 
> but it is possible to fork process.
> In theory it certainly should be possible - gdb is normal process, so at
> least we can implement our mini-gdb.
> But not sure how complex it will be.

If this were on intel, I would try to use intel-pt to get the execution trace
for the recent past at the time of the assertion failure, but it's not
intel...


FWIW, in all the incidents I looked at the actual read-in buffers are valid,
so it's not like the IO did not occur or such.


Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

12 июня, 08:03:22

I tried to catch moment when memory is changed using mprotect.
I have aligned PgAioHandle on page boundary (16kb at MacOS), and disable 
writes in `pgaio_io_reclaim`:
```
static void
pgaio_io_reclaim(PgAioHandle *ioh)
{
    RESUME_INTERRUPTS();
     rc = mprotect(ioh, sizeof(*ioh), PROT_READ);
     Assert(rc == 0);
fprintf(stderr, "!!!pgaio_io_reclaim [%d]| ioh: %p, ioh->op: %d, 
ioh->generation: %llu\n", getpid(), ioh, ioh->op, ioh->generation);
}

```

and reenable writes in `pgaio_io_before_start` and `pgaio_io_acquire_nb`:

```

static void
pgaio_io_before_start(PgAioHandle *ioh)
{
     int rc = mprotect(ioh, sizeof(*ioh), PROT_READ|PROT_WRITE);
     Assert(rc == 0);

```

and

```
PgAioHandle *
pgaio_io_acquire_nb(struct ResourceOwnerData *resowner, PgAioReturn *ret)
{
      ...

         ioh = dclist_container(PgAioHandle, node, ion);

         Assert(ioh->state == PGAIO_HS_IDLE);
         Assert(ioh->owner_procno == MyProcNumber);

         rc = mprotect(ioh, sizeof(*ioh), PROT_READ|PROT_WRITE);
         Assert(rc == 0);
}

```


The error is reproduced after 133 iterations:
```
!!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, 
ioh->generation: 19346
!!!AsyncReadBuffers [20376] (1)| blocknum: 21, ioh: 0x1019bc000, 
ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, 
ioh->generation: 19346
2025-06-12 01:05:31.865 EEST [20376:918] pg_regress/psql LOG: 
!!!pgaio_io_before_start| ioh: 0x1019bc000, ioh->op: 1, ioh->state: 1, 
ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 19346
```

But no write protection violation happen.
Do not know how to interpret this fact. Changes are made by kernel? 
`pgaio_io_acquire_nb` was called between `pgaio_io_reclaim` and 
`pgaio_io_before_start`?

I am now going add trace to `pgaio_io_acquire_nb`.

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

12 июня, 15:12:00

Reproduced it once again with with write-protected io handle.
But once again - no access violation, just assert failure.

Previously "op" field was overwritten somewhere between `pgaio_io_reclaim` and `AsyncReadBuffers`:

!!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, ioh->generation: 19346
!!!AsyncReadBuffers [20376] (1)| blocknum: 21, ioh: 0x1019bc000, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 19346

Now it is overwritten after print in AsyncReadBuffers:

!!!pgaio_io_reclaim [88932]| ioh: 0x105a5c000, ioh->op: 0, ioh->generation: 42848
!!!pgaio_io_acquire_nb[88932]| ioh: 0x105a5c000, ioh->op: 0, ioh->generation: 42848
!!!AsyncReadBuffers [88932] (1)| blocknum: 10, ioh: 0x105a5c000, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 42848
!!!pgaio_io_before_start| ioh: 0x105a5c000, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 42848

In this run I prohibit writes to io handle in `pgaio_io_acquire_nb` and reenable them in `AsyncReadBuffer`.

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

12 июня, 16:13:04

Hi,

On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote:
> Reproduced it once again with with write-protected io handle.
> But once again - no access violation, just assert failure.
> 
> Previously "op" field was overwritten somewhere between `pgaio_io_reclaim`
> and `AsyncReadBuffers`:
> 
> !!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, ioh->generation:
> 19346
> !!!AsyncReadBuffers [20376] (1)| blocknum: 21, ioh: 0x1019bc000, ioh->op: 1,
> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 19346
> 
> Now it is overwritten after print in AsyncReadBuffers:
> 
> !!!pgaio_io_reclaim [88932]| ioh: 0x105a5c000, ioh->op: 0, ioh->generation:
> 42848
> !!!pgaio_io_acquire_nb[88932]| ioh: 0x105a5c000, ioh->op: 0,
> ioh->generation: 42848
> !!!AsyncReadBuffers [88932] (1)| blocknum: 10, ioh: 0x105a5c000, ioh->op: 0,
> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 42848
> !!!pgaio_io_before_start| ioh: 0x105a5c000, ioh->op: 1, ioh->state: 1,
> ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 42848
> 
> In this run I prohibit writes to io handle in `pgaio_io_acquire_nb` and
> reenable them in `AsyncReadBuffer`.

I'm reasonably certain I found the issue, I think it's a missing memory
barrier on the read side. The CPU is reordering the read (or just using a
cached value) of ->distilled_result to be before the load of ->state.

But it'll take a bit to verify that that's the issue...

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

12 июня, 16:30:54

On 12/06/2025 4:13 pm, Andres Freund wrote:
> Hi,
>
> On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote:
>> Reproduced it once again with with write-protected io handle.
>> But once again - no access violation, just assert failure.
>>
>> Previously "op" field was overwritten somewhere between `pgaio_io_reclaim`
>> and `AsyncReadBuffers`:
>>
>> !!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, ioh->generation:
>> 19346
>> !!!AsyncReadBuffers [20376] (1)| blocknum: 21, ioh: 0x1019bc000, ioh->op: 1,
>> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 19346
>>
>> Now it is overwritten after print in AsyncReadBuffers:
>>
>> !!!pgaio_io_reclaim [88932]| ioh: 0x105a5c000, ioh->op: 0, ioh->generation:
>> 42848
>> !!!pgaio_io_acquire_nb[88932]| ioh: 0x105a5c000, ioh->op: 0,
>> ioh->generation: 42848
>> !!!AsyncReadBuffers [88932] (1)| blocknum: 10, ioh: 0x105a5c000, ioh->op: 0,
>> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 42848
>> !!!pgaio_io_before_start| ioh: 0x105a5c000, ioh->op: 1, ioh->state: 1,
>> ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 42848
>>
>> In this run I prohibit writes to io handle in `pgaio_io_acquire_nb` and
>> reenable them in `AsyncReadBuffer`.
> I'm reasonably certain I found the issue, I think it's a missing memory
> barrier on the read side. The CPU is reordering the read (or just using a
> cached value) of ->distilled_result to be before the load of ->state.
>
> But it'll take a bit to verify that that's the issue...

It is great!
But I wonder how it correlates with your previous statement:

> There shouldn't be any concurrent accesses here, so I don't really see how the
> above would explain the problem (the IO can only ever be modified by one
> backend, initially the "owning backend", then, when submitted, by the IO
> worker, and then again by the backend).


This is what I am observing myself: "op"  field is modified and fetched 
by the same process.
Certainly process can be rescheduled to some other CPU. But if such 
reschedule can cause loose of stored value, then nothing will work, will it?
So assume that there is some variable "x" which is updated by process 
"x=1" executed at CPU1, then process is rescheduled to CPU2 which does 
"x=2", then process is once again rescheduled to CPU1 and we found out 
that  "x==1". And to prevent it we need to explicitly enforce some 
memory barrier. Unless I missing something , nothing can work with such 
memory  model.

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

12 июня, 16:57:38

Hi,

On 2025-06-12 16:30:54 +0300, Konstantin Knizhnik wrote:
> On 12/06/2025 4:13 pm, Andres Freund wrote:
> > On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote:
> > I'm reasonably certain I found the issue, I think it's a missing memory
> > barrier on the read side. The CPU is reordering the read (or just using a
> > cached value) of ->distilled_result to be before the load of ->state.
> > 
> > But it'll take a bit to verify that that's the issue...
> 
> It is great!
> But I wonder how it correlates with your previous statement:
> 
> > There shouldn't be any concurrent accesses here, so I don't really see how the
> > above would explain the problem (the IO can only ever be modified by one
> > backend, initially the "owning backend", then, when submitted, by the IO
> > worker, and then again by the backend).

The problem appears to be in that switch between "when submitted, by the IO
worker" and "then again by the backend".  It's not concurrent access in the
sense of two processes writing to the same value, it's that when switching
from the worker updating ->distilled_result to the issuer looking at that, the
issuer didn't ensure that no outdated version of ->distilled_result could be
used.

Basically, the problem is that the worker would

1) set ->distilled_result
2) perform a write memory barrier
3) set ->state to COMPLETED_SHARED

and then the issuer of the IO would:

4) check ->state is COMPLETED_SHARED
5) use ->distilled_result

The problem is that there currently is no barrier between 4 & 5, which means
an outdated ->distilled_result could be used.

This also explains why the issue looked so weird - eventually, after fprintfs,
after a core dump, etc, the updated ->distilled_result result would "arrive"
in the issuing process, and suddenly look correct.

> This is what I am observing myself: "op"  field is modified and fetched by
> the same process.

Right - but I don't think the ->op field being wrong was actually part of the
issue though.

> Certainly process can be rescheduled to some other CPU. But if such
> reschedule can cause loose of stored value, then nothing will work, will it?

Yes, that'd be completely broken - but isn't the issue here.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

12 июня, 17:22:22

On 12/06/2025 4:57 pm, Andres Freund wrote:
> The problem appears to be in that switch between "when submitted, by the IO
> worker" and "then again by the backend".  It's not concurrent access in the
> sense of two processes writing to the same value, it's that when switching
> from the worker updating ->distilled_result to the issuer looking at that, the
> issuer didn't ensure that no outdated version of ->distilled_result could be
> used.
>
> Basically, the problem is that the worker would
>
> 1) set ->distilled_result
> 2) perform a write memory barrier
> 3) set ->state to COMPLETED_SHARED
>
> and then the issuer of the IO would:
>
> 4) check ->state is COMPLETED_SHARED
> 5) use ->distilled_result
>
> The problem is that there currently is no barrier between 4 & 5, which means
> an outdated ->distilled_result could be used.
>
>
> This also explains why the issue looked so weird - eventually, after fprintfs,
> after a core dump, etc, the updated ->distilled_result result would "arrive"
> in the issuing process, and suddenly look correct.

Thank you very much for explanation.
Everything seems to be so simple after explanations, that you can not 
even believe that before you think that such behavior can be only caused 
by "black magic" or "OS bug":)

Certainly using outdated result can explain such behavior.
But in which particular place we loose read barrier between 4 and 5?
I see `pgaio_io_wait` which as I expect should be called by backend to 
wait completion of IO.
And it calls `pgaio_io_was_recycled` to get state and it in turn enforce 
read barrier:
```

bool
pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, 
PgAioHandleState *state)
{
     *state = ioh->state;
     pg_read_barrier();

     return ioh->generation != ref_generation;
}
```

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

12 июня, 18:52:31

Hi,

On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> On 12/06/2025 4:57 pm, Andres Freund wrote:
> > The problem appears to be in that switch between "when submitted, by the IO
> > worker" and "then again by the backend".  It's not concurrent access in the
> > sense of two processes writing to the same value, it's that when switching
> > from the worker updating ->distilled_result to the issuer looking at that, the
> > issuer didn't ensure that no outdated version of ->distilled_result could be
> > used.
> > 
> > Basically, the problem is that the worker would
> > 
> > 1) set ->distilled_result
> > 2) perform a write memory barrier
> > 3) set ->state to COMPLETED_SHARED
> > 
> > and then the issuer of the IO would:
> > 
> > 4) check ->state is COMPLETED_SHARED
> > 5) use ->distilled_result
> > 
> > The problem is that there currently is no barrier between 4 & 5, which means
> > an outdated ->distilled_result could be used.
> > 
> > 
> > This also explains why the issue looked so weird - eventually, after fprintfs,
> > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > in the issuing process, and suddenly look correct.
> 
> Thank you very much for explanation.
> Everything seems to be so simple after explanations, that you can not even
> believe that before you think that such behavior can be only caused by
> "black magic" or "OS bug":)

Indeed. For a while I was really starting to doubt my sanity - not helped by
the fact that I'd loose most of the mental context between reproductions (the
problem did not re-occur from Saturday to Wednesday morning...).  What finally
made it clearer was moving the failing assertion to earlier
(pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
that took out a lot of the time in which the problem could occur.

> Certainly using outdated result can explain such behavior.
> But in which particular place we loose read barrier between 4 and 5?
> I see `pgaio_io_wait` which as I expect should be called by backend to wait
> completion of IO.
> And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> read barrier:
> ```
> 
> bool
> pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> PgAioHandleState *state)
> {
>     *state = ioh->state;
>     pg_read_barrier();
> 
>     return ioh->generation != ref_generation;
> }
> ```

Yes, I don't think that path has the issue.

As far as I can tell the problem only ever happens if we end up reclaiming an
IO while waiting for a free IO handle.

The known problematic path is
pgaio_io_acquire()
-> pgaio_io_wait_for_free()
-> pgaio_io_reclaim()

In that path we don't use pgaio_io_was_recycled().  I couldn't find any other
path with the same issue [1].

The issue only happening while waiting for free IO handles is presumably the
reason why it happend in 027_stream_regress.pl, due to the small
shared_buffers used for the test io_max_concurrency is 1, which means we
constantly need to wait for a free IO handles.

Greetings,

Andres Freund

[1] I think theoretically we are missing a memory barrier in the worker, when
starting to process an IO, but the lwlock for getting the IO to process
provides that. But we should probably add a comment about that.

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

12 июня, 19:23:13

Hi,

On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
> On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
> > On 12/06/2025 4:57 pm, Andres Freund wrote:
> > > The problem appears to be in that switch between "when submitted, by the IO
> > > worker" and "then again by the backend".  It's not concurrent access in the
> > > sense of two processes writing to the same value, it's that when switching
> > > from the worker updating ->distilled_result to the issuer looking at that, the
> > > issuer didn't ensure that no outdated version of ->distilled_result could be
> > > used.
> > > 
> > > Basically, the problem is that the worker would
> > > 
> > > 1) set ->distilled_result
> > > 2) perform a write memory barrier
> > > 3) set ->state to COMPLETED_SHARED
> > > 
> > > and then the issuer of the IO would:
> > > 
> > > 4) check ->state is COMPLETED_SHARED
> > > 5) use ->distilled_result
> > > 
> > > The problem is that there currently is no barrier between 4 & 5, which means
> > > an outdated ->distilled_result could be used.
> > > 
> > > 
> > > This also explains why the issue looked so weird - eventually, after fprintfs,
> > > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > > in the issuing process, and suddenly look correct.
> > 
> > Thank you very much for explanation.
> > Everything seems to be so simple after explanations, that you can not even
> > believe that before you think that such behavior can be only caused by
> > "black magic" or "OS bug":)
> 
> Indeed. For a while I was really starting to doubt my sanity - not helped by
> the fact that I'd loose most of the mental context between reproductions (the
> problem did not re-occur from Saturday to Wednesday morning...).  What finally
> made it clearer was moving the failing assertion to earlier
> (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
> that took out a lot of the time in which the problem could occur.

> 
> > Certainly using outdated result can explain such behavior.
> > But in which particular place we loose read barrier between 4 and 5?
> > I see `pgaio_io_wait` which as I expect should be called by backend to wait
> > completion of IO.
> > And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
> > read barrier:
> > ```
> > 
> > bool
> > pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
> > PgAioHandleState *state)
> > {
> >     *state = ioh->state;
> >     pg_read_barrier();
> > 
> >     return ioh->generation != ref_generation;
> > }
> > ```
> 
> Yes, I don't think that path has the issue.
> 
> As far as I can tell the problem only ever happens if we end up reclaiming an
> IO while waiting for a free IO handle.
> 
> The known problematic path is
> pgaio_io_acquire()
> -> pgaio_io_wait_for_free()
> -> pgaio_io_reclaim()
> 
> In that path we don't use pgaio_io_was_recycled().  I couldn't find any other
> path with the same issue [1].

> The issue only happening while waiting for free IO handles is presumably the
> reason why it happend in 027_stream_regress.pl, due to the small
> shared_buffers used for the test io_max_concurrency is 1, which means we
> constantly need to wait for a free IO handles.

Armed with that knowledge it turns out to not be too hard to create a repro
that fails a lot quicker than multiple days.

Starting postgres with

  -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c
max_parallel_workers_per_gather=0

and creating a test table with:
  create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b;

makes the following pgbench fail within seconds for me:
  c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10

(interestingly at c=8 it takes closer to a minute on average)

Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in
pgaio_io_wait_for_free() has so far prevented a crash.  I'll leave it running
for a while longer, to make sure.

We really lack any meaningful stress testing of IO paths, or I suspect we
would have found this much earlier.  I wonder if we could add a smaller
version of the above, with a small fixed runtime (maybe 3-5s), as a tap test.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

14 июня, 21:44:59

On 13/06/2025 11:20 pm, Andres Freund wrote:
> Hi,
>
> On 2025-06-12 12:23:13 -0400, Andres Freund wrote:
>> On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
>>> On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
>>>> On 12/06/2025 4:57 pm, Andres Freund wrote:
>>>>> The problem appears to be in that switch between "when submitted, by the IO
>>>>> worker" and "then again by the backend".  It's not concurrent access in the
>>>>> sense of two processes writing to the same value, it's that when switching
>>>>> from the worker updating ->distilled_result to the issuer looking at that, the
>>>>> issuer didn't ensure that no outdated version of ->distilled_result could be
>>>>> used.
>>>>>
>>>>> Basically, the problem is that the worker would
>>>>>
>>>>> 1) set ->distilled_result
>>>>> 2) perform a write memory barrier
>>>>> 3) set ->state to COMPLETED_SHARED
>>>>>
>>>>> and then the issuer of the IO would:
>>>>>
>>>>> 4) check ->state is COMPLETED_SHARED
>>>>> 5) use ->distilled_result
>>>>>
>>>>> The problem is that there currently is no barrier between 4 & 5, which means
>>>>> an outdated ->distilled_result could be used.
>>>>>
>>>>>
>>>>> This also explains why the issue looked so weird - eventually, after fprintfs,
>>>>> after a core dump, etc, the updated ->distilled_result result would "arrive"
>>>>> in the issuing process, and suddenly look correct.
>>>> Thank you very much for explanation.
>>>> Everything seems to be so simple after explanations, that you can not even
>>>> believe that before you think that such behavior can be only caused by
>>>> "black magic" or "OS bug":)
>>> Indeed. For a while I was really starting to doubt my sanity - not helped by
>>> the fact that I'd loose most of the mental context between reproductions (the
>>> problem did not re-occur from Saturday to Wednesday morning...).  What finally
>>> made it clearer was moving the failing assertion to earlier
>>> (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
>>> that took out a lot of the time in which the problem could occur.
>>>> Certainly using outdated result can explain such behavior.
>>>> But in which particular place we loose read barrier between 4 and 5?
>>>> I see `pgaio_io_wait` which as I expect should be called by backend to wait
>>>> completion of IO.
>>>> And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
>>>> read barrier:
>>>> ```
>>>>
>>>> bool
>>>> pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
>>>> PgAioHandleState *state)
>>>> {
>>>>      *state = ioh->state;
>>>>      pg_read_barrier();
>>>>
>>>>      return ioh->generation != ref_generation;
>>>> }
>>>> ```
>>> Yes, I don't think that path has the issue.
>>>
>>> As far as I can tell the problem only ever happens if we end up reclaiming an
>>> IO while waiting for a free IO handle.
>>>
>>> The known problematic path is
>>> pgaio_io_acquire()
>>> -> pgaio_io_wait_for_free()
>>> -> pgaio_io_reclaim()
>>>
>>> In that path we don't use pgaio_io_was_recycled().  I couldn't find any other
>>> path with the same issue [1].
>>> The issue only happening while waiting for free IO handles is presumably the
>>> reason why it happend in 027_stream_regress.pl, due to the small
>>> shared_buffers used for the test io_max_concurrency is 1, which means we
>>> constantly need to wait for a free IO handles.
>> Armed with that knowledge it turns out to not be too hard to create a repro
>> that fails a lot quicker than multiple days.
>>
>> Starting postgres with
>>
>>    -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c
max_parallel_workers_per_gather=0
>>
>> and creating a test table with:
>>    create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b;
>>
>> makes the following pgbench fail within seconds for me:
>>    c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10
>>
>> (interestingly at c=8 it takes closer to a minute on average)
>>
>> Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in
>> pgaio_io_wait_for_free() has so far prevented a crash.  I'll leave it running
>> for a while longer, to make sure.
>>
>> We really lack any meaningful stress testing of IO paths, or I suspect we
>> would have found this much earlier.  I wonder if we could add a smaller
>> version of the above, with a small fixed runtime (maybe 3-5s), as a
>> tap test.
> Attached is a patch that fixes the problem for me. Alexander, Konstantin,
> could you verify that it also fixes the problem for you?


I have inspected your patch and from my point of view it adds all 
necessary read barriers and so eliminates the race.
I also ran proposed test during more than 4 hours and the problem is not 
reproduced at my system (without patch it is reproduced after 100 seconds).

It's a little bit strange that pgbench is "stuck" - I have specified 
timeout 10000 seconds (-T 10000) but it didn't stop and continue working 
until 15000 seconds when I have terminated it.
There were no stuck queries: all 16 backends are running queries which 
execution takes about 2 seconds:

progress: 14960.0 s, 7.4 tps, lat 2189.006 ms stddev 133.277, 0 failed
progress: 14970.0 s, 7.3 tps, lat 2113.871 ms stddev 164.008, 0 failed
progress: 14980.0 s, 7.3 tps, lat 2038.736 ms stddev 152.869, 0 failed

Looks like there is yet another race condition now in pgbench:)




>
> Given that it does address the problem for me, I'm inclined to push this
> fairly soon, the barrier is pretty obviously required.
>
> Greetings,
>
> Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

15 июня, 14:48:43

On 13/06/2025 11:20 pm, Andres Freund wrote:

Attached is a patch that fixes the problem for me. Alexander, Konstantin,
could you verify that it also fixes the problem for you?

Given that it does address the problem for me, I'm inclined to push this
fairly soon, the barrier is pretty obviously required.

Unfortunately I still able to reproduce assertion failure after 17450 seconds (`--enable-debug --enable-cassert CFLAGS=-O0`):

* frame #0: 0x0000000187248704 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x000000018727fc28 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x000000018718dae8 libsystem_c.dylib`abort + 180
    frame #3: 0x00000001011a0fbc postgres`ExceptionalCondition(conditionName="ioh->op == PGAIO_OP_INVALID", fileName="aio_io.c", lineNumber=161) at assert.c:66:2
    frame #4: 0x0000000100ea9b40 postgres`pgaio_io_before_start(ioh=0x000000010a6e22f0) at aio_io.c:161:2
    frame #5: 0x0000000100ea9a00 postgres`pgaio_io_start_readv(ioh=0x000000010a6e22f0, fd=12, iovcnt=1, offset=85565440) at aio_io.c:81:2
    frame #6: 0x0000000100ec8798 postgres`FileStartReadV(ioh=0x000000010a6e22f0, file=4, iovcnt=1, offset=85565440, wait_event_info=167772181) at fd.c:2241:2
    frame #7: 0x0000000100f1b7d0 postgres`mdstartreadv(ioh=0x000000010a6e22f0, reln=0x000000010b0289c8, forknum=MAIN_FORKNUM, blocknum=10445, buffers=0x000000016f5ed998, nblocks=1) at md.c:1019:8
    frame #8: 0x0000000100f1ed88 postgres`smgrstartreadv(ioh=0x000000010a6e22f0, reln=0x000000010b0289c8, forknum=MAIN_FORKNUM, blocknum=10445, buffers=0x000000016f5ed998, nblocks=1) at smgr.c:758:2
    frame #9: 0x0000000100eb1a48 postgres`AsyncReadBuffers(operation=0x000000011d80d108, nblocks_progress=0x000000016f5edeb4) at bufmgr.c:1959:3
    frame #10: 0x0000000100eb0c24 postgres`StartReadBuffersImpl(operation=0x000000011d80d108, buffers=0x000000011d80cdf4, blockNum=10445, nblocks=0x000000016f5edeb4, flags=0, allow_forwarding=true) at bufmgr.c:1428:18

I wonder if we also need second write barrier in pgaio_io_update_state:

    /*
     * Ensure the changes signified by the new state are visible before the
     * new state becomes visible.
     */
    pg_write_barrier();

    ioh->state = new_state;

    /*
     * Make sure that status is written
     */
    pg_write_barrier();

Otherwise is is not clear when status is written.
Also I think that replacing bitfields with `uint8` and may be even with `int`, is good idea at least to avoids false sharing.

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

15 июня, 20:54:54

With this two additional changes:

diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index 6c6c0a908e2..6dd2816bea9 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -538,6 +538,9 @@ pgaio_io_process_completion(PgAioHandle *ioh, int 
result)

         pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);

+       /* ensure the state update is visible before we broadcast 
condition variable */
+       pg_write_barrier();
+
         /* condition variable broadcast ensures state is visible before 
wakeup */
         ConditionVariableBroadcast(&ioh->cv);

bool only_running);
diff --git a/src/include/storage/aio_internal.h 
b/src/include/storage/aio_internal.h
index 2d37a243abe..0a2bb109696 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -96,13 +96,13 @@ struct ResourceOwnerData;
  struct PgAioHandle
  {
         /* all state updates should go through pgaio_io_update_state() */
-       PgAioHandleState state:8;
+       uint8           state;

         /* what are we operating on */
-       PgAioTargetID target:8;
+       uint8           target;

         /* which IO operation */
-       PgAioOp         op:8;
+       uint8           op;

         /* bitfield of PgAioHandleFlags */
         uint8           flags;

the problem is not reproduced at my system within 20000 seconds. I will 
leave it to run during the night.

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

16 июня, 14:11:39

One more update: with the proposed patch (memory barrier before 
`ConditionVariableBroadcast` in `pgaio_io_process_completion` and 
replacing bit fields with `uint8`) the problem is not reproduced at my 
system during 50000 seconds.

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

16 июня, 18:11:29

Hi,

On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote:
> One more update: with the proposed patch (memory barrier before
> `ConditionVariableBroadcast` in `pgaio_io_process_completion`

I don't see how that barrier could be required for correctness -
ConditionVariableBroadcast() is a barrier itself (as the comment above the
call notes, too).

On 2025-06-15 14:48:43 +0300, Konstantin Knizhnik wrote:
> Also I think that replacing bitfields with `uint8` and may be even with
> `int`, is good idea at least to avoids false sharing.

I don't think there's false sharing here. And even if there were, the
granularity at which false sharing occurs is a cache line size, so either 64
or 128 byte.

I unfortunately can't repro this issue so far.  I don't think it's the same
problem as my patch fixes, so I'll push my patch.  How exactly did you
reproduce the probelm?

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

16 июня, 18:48:27

On 16/06/2025 6:11 pm, Andres Freund wrote:

Hi,

On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote:

One more update: with the proposed patch (memory barrier before
`ConditionVariableBroadcast` in `pgaio_io_process_completion`

I don't see how that barrier could be required for correctness -
ConditionVariableBroadcast() is a barrier itself (as the comment above the
call notes, too).


On 2025-06-15 14:48:43 +0300, Konstantin Knizhnik wrote:

Also I think that replacing bitfields with `uint8` and may be even with
`int`, is good idea at least to avoids false sharing.

I don't think there's false sharing here. And even if there were, the
granularity at which false sharing occurs is a cache line size, so either 64
or 128 byte.


I unfortunately can't repro this issue so far.  I don't think it's the same
problem as my patch fixes, so I'll push my patch.  How exactly did you
reproduce the probelm?

Greetings,

Andres Freund

Sorry, I was not sure that spinlock (used in `ConditionVariableBroadcast`) enforces memory barrier. Certainly in this case adding memory barrir here is not needed.

Concerning false sharing - I suspected that compiler can extract bitfield from the word loaded before read barrier. But it seems to be not possible (if barrier is correctl recognized by compiler) and more over - I have reproduced it with didsabled optimization, so unlikely compiler tries to eliminate some reads here. And I agree with your argument about cache line: even replacing uint8 with int will not prevent it.

But unfortunately it means that the problem is not fixed. I just did the same test as you proposed:

c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 10000 -P 10


with the following config changes:
io_max_concurrency=1
io_combine_limit=1
synchronize_seqscans=false
restart_after_crash=false
max_parallel_workers_per_gather=0
fsync=off

Postgres was build with the following options: --without-icu --enable-debug --enable-cassert CFLAGS=-O0

As I wrote - it takes about 10000 seconds to get this assertion failure.
I can  try to do it once again.
Looks like the problem is better reproduced with disabled optimizations.

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

17 июня, 03:22:00

Konstantin Knizhnik <knizhnik@garret.ru> writes:
> On 16/06/2025 6:11 pm, Andres Freund wrote:
>> I unfortunately can't repro this issue so far.

> But unfortunately it means that the problem is not fixed.

FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
using MacPorts' current compiler release (clang version 19.1.7).
The currently-proposed test case fails within a few minutes on
e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5.

However, I cannot repro that on a slightly older Mini M1 using Apple's
current release (clang-1700.0.13.5, which per wikipedia is really LLVM
19.1.4).  It seems to work fine even without e9a3615a5.  So the whole
thing is still depressingly phase-of-the-moon-dependent.

I don't doubt that Konstantin has found a different issue, but
it's hard to be sure about the fix unless we can get it to be
more reproducible.  Neither of my machines has ever shown the
symptom he's getting.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

17 июня, 13:30:22

On 17/06/2025 3:22 am, Tom Lane wrote:

Konstantin Knizhnik <knizhnik@garret.ru> writes:

On 16/06/2025 6:11 pm, Andres Freund wrote:

I unfortunately can't repro this issue so far.

But unfortunately it means that the problem is not fixed.

FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
using MacPorts' current compiler release (clang version 19.1.7).
The currently-proposed test case fails within a few minutes on
e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5.

However, I cannot repro that on a slightly older Mini M1 using Apple's
current release (clang-1700.0.13.5, which per wikipedia is really LLVM
19.1.4).  It seems to work fine even without e9a3615a5.  So the whole
thing is still depressingly phase-of-the-moon-dependent.

I don't doubt that Konstantin has found a different issue, but
it's hard to be sure about the fix unless we can get it to be
more reproducible.  Neither of my machines has ever shown the
symptom he's getting.
			regards, tom lane

Unfortunately I still able to reproduce assertion failure (with Andreas patch but without my "fixes" - uint8 instead of bitfields).

Apple M2 Pro, Ventura 13.7.6, clang 15.0

Postgres build option: --without-icu --enable-debug --enable-cassert CFLAGS=-O0 Postgres config:

io_max_concurrency=1
io_combine_limit=1
synchronize_seqscans=false
restart_after_crash=false
max_parallel_workers_per_gather=0
fsync=off

Scenario proposed by Andreas:
c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 30000 -P 10


Now it takes me much more time to reproduce the error. Night 30000 seconds iterations passed normally.
I thought that it can be somehow caused or enforces by hibernation.
When I have reproduced it last time, I do not change default setting for display "Prevent automatic sleeping on power adapter when the display is off" (disabled).
It was cause strange timing effect I have reported: when system is hibernated, alarm is also "hibernated" and so total test time is much more than specified with -T option.
Now I tried explicitly force hibernation by periodically closing laptop cover - it doesn't help. But I disable one again this option and let it run. Several times it normally continue execution after wakeup. But finally I got assertion failure:

progress: 18290.0 s, 7.3 tps, lat 2229.243 ms stddev 147.211, 0 failed
progress: 18300.0 s, 7.3 tps, lat 2164.216 ms stddev 176.251, 0 failed
progress: 18310.0 s, 7.5 tps, lat 2105.803 ms stddev 202.003, 0 failed
progress: 18320.0 s, 7.7 tps, lat 2162.209 ms stddev 209.344, 0 failed
progress: 18330.0 s, 7.0 tps, lat 2157.891 ms stddev 181.369, 0 failed
progress: 18340.0 s, 7.6 tps, lat 2120.269 ms stddev 169.287, 0 failed
progress: 18350.0 s, 7.3 tps, lat 2178.657 ms stddev 159.984, 0 failed
pgbench: error: client 4 aborted in command 0 (SQL) of script 0; perhaps the backend died while processing
WARNING:  terminating connection because of crash of another server process

Not sure if it is related with hibernation or not (looks like it happens at the moment when computer is going to hibernate - not sure how to check it).
Backtrace is usual:
  * frame #0: 0x0000000187248704 libsystem_kernel.dylib`__pthread_kill + 8    frame #1: 0x000000018727fc28 libsystem_pthread.dylib`pthread_kill + 288    frame #2: 0x000000018718dae8 libsystem_c.dylib`abort + 180    frame #3: 0x00000001034b2080 postgres`ExceptionalCondition(conditionName="ioh->op == PGAIO_OP_INVALID", fileName="aio_io.c", lineNumber=161) at assert.c:66:2    frame #4: 0x00000001031bac04 postgres`pgaio_io_before_start(ioh=0x000000010c9f6e40) at aio_io.c:161:2    frame #5: 0x00000001031baac4 postgres`pgaio_io_start_readv(ioh=0x000000010c9f6e40, fd=10, iovcnt=1, offset=180944896) at aio_io.c:81:2    frame #6: 0x00000001031d985c postgres`FileStartReadV(ioh=0x000000010c9f6e40, file=3, iovcnt=1, offset=180944896, wait_event_info=167772181) at fd.c:2241:2    frame #7: 0x000000010322c894 postgres`mdstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at md.c:1019:8    frame #8: 0x000000010322fe4c postgres`smgrstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at smgr.c:758:2    frame #9: 0x00000001031c2b0c postgres`AsyncReadBuffers(operation=0x000000010f81b910, nblocks_progress=0x000000016d2ddeb4) at bufmgr.c:1959:3    frame #10: 0x00000001031c1ce8 postgres`StartReadBuffersImpl(operation=0x000000010f81b910, buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4, flags=0, allow_forwarding=true) at bufmgr.c:1428:18    frame #11: 0x00000001031c182c postgres`StartReadBuffers(operation=0x000000010f81b910, buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4, flags=0) at bufmgr.c:1500:9    frame #12: 0x00000001031bee44 postgres`read_stream_start_pending_read(stream=0x000000010f81b378) at read_stream.c:335:14    frame #13: 0x00000001031be528 postgres`read_stream_look_ahead(stream=0x000000010f81b378) at read_stream.c:493:3    frame #14: 0x00000001031be0c0 postgres`read_stream_next_buffer(stream=0x000000010f81b378, per_buffer_data=0x0000000000000000) at read_stream.c:971:2    frame #15: 0x0000000102bb3d34 postgres`heap_fetch_next_buffer(scan=0x000000010f81ae58, dir=ForwardScanDirection) at heapam.c:675:18    frame #16: 0x0000000102ba4a88 postgres`heapgettup_pagemode(scan=0x000000010f81ae58, dir=ForwardScanDirection, nkeys=0, key=0x0000000000000000) at heapam.c:1037:3    frame #17: 0x0000000102ba5058 postgres`heap_getnextslot(sscan=0x000000010f81ae58, direction=ForwardScanDirection, slot=0x000000010f811330) at heapam.c:1391:3    frame #18: 0x0000000102f26448 postgres`table_scan_getnextslot(sscan=0x000000010f81ae58, direction=ForwardScanDirection, slot=0x000000010f811330) at tableam.h:1031:9    frame #19: 0x0000000102f26254 postgres`SeqNext(node=0x000000010f811110) at nodeSeqscan.c:81:6    frame #20: 0x0000000102f2691c postgres`ExecScanFetch(node=0x000000010f811110, epqstate=0x0000000000000000, accessMtd=(postgres`SeqNext at nodeSeqscan.c:52), recheckMtd=(postgres`SeqRecheck at nodeSeqscan.c:91)) at execScan.h:126:9

If you are interested - I have core file.
May be it is separate problem - not related with race condition caused by lack of memory barrier. But frankly speaking - unlikely.
I think hibernation just affects processes schedule which cause this effect.

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

17 июня, 16:35:55

Hi,

On 2025-06-16 20:22:00 -0400, Tom Lane wrote:
> Konstantin Knizhnik <knizhnik@garret.ru> writes:
> > On 16/06/2025 6:11 pm, Andres Freund wrote:
> >> I unfortunately can't repro this issue so far.
>
> > But unfortunately it means that the problem is not fixed.
>
> FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
> using MacPorts' current compiler release (clang version 19.1.7).
> The currently-proposed test case fails within a few minutes on
> e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5.

I'm surprised it takes that long, given it takes seconds to reproduce here
with the config parameters I outlined. Did you try crank up the concurrency a
bit? Yours has more cores than mine, and I found that that makes a huge
difference.

> However, I cannot repro that on a slightly older Mini M1 using Apple's
> current release (clang-1700.0.13.5, which per wikipedia is really LLVM
> 19.1.4).  It seems to work fine even without e9a3615a5.  So the whole
> thing is still depressingly phase-of-the-moon-dependent.

It's not entirely surprising that an M1 would have a harder time reproducing
the issue, more cores, larger caches and a larger out-of-order execution
window will make it more likely that the missing memory barriers have a
visible effect.

I'm reasonably sure that e9a3615a5 quashed that specific issue - I could repro
it within seconds with e9a3615a5^ and with e9a3615a5 I ran it for several days
without a single failure...

> I don't doubt that Konstantin has found a different issue, but
> it's hard to be sure about the fix unless we can get it to be
> more reproducible.  Neither of my machines has ever shown the
> symptom he's getting.

I've not been able to reproduce that symptom a single time either so far.

The assertion continues to be inexplicable to me. It shows, within a single
process, memory in shared memory going "backwards". But not always, just very
occasionally. Because this is before the IO is defined, there's no concurrent
access whatsoever.

I stole^Wgot my partner's m1 macbook for a bit, trying to reproduce the issue
there. It has
"Apple clang version 16.0.0 (clang-1600.0.26.6)"
on
"Darwin Kernel Version 24.3.0"

That's the same Apple-clang version that Alexander reported being able to
reproduce the issue on [1], but unfortunately it's a newer kernel version. No
dice in the first 55 test iterations.

Konstantin, Alexander - are you using the same device to reproduce this or
different ones? I wonder if this somehow depends on some MDM / corporate
enforcement tooling running or such.

What does:
- profiles status -type enrollment
- kextstat -l
show?

Greetings,

Andres Freund

[1] https://postgr.es/m/92b33ab2-0596-40fe-9db6-a6d821d08e8a%40gmail.com

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

17 июня, 16:43:05

On 17/06/2025 4:35 pm, Andres Freund wrote:
> Konstantin, Alexander - are you using the same device to reproduce this or
> different ones? I wonder if this somehow depends on some MDM / corporate
> enforcement tooling running or such.
>
> What does:
> - profiles status -type enrollment
> - kextstat -l
> show?

I and Alexandr are using completely different devices with different 
hardware, OS and clang version.

profiles status -type enrollment
Enrolled via DEP: No
MDM enrollment: Yes (User Approved)
MDM server: https://apple.mdm.jumpcloud.com/command

`kextstat -l` output very long list, should I post it all or search for 
some particular pattern?

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

17 июня, 16:47:37

Hi,

On 2025-06-17 16:43:05 +0300, Konstantin Knizhnik wrote:
> On 17/06/2025 4:35 pm, Andres Freund wrote:
> > Konstantin, Alexander - are you using the same device to reproduce this or
> > different ones? I wonder if this somehow depends on some MDM / corporate
> > enforcement tooling running or such.
> > 
> > What does:
> > - profiles status -type enrollment
> > - kextstat -l
> > show?
> 
> 
> I and Alexandr are using completely different devices with different
> hardware, OS and clang version.

Both of you are running Ventura, right?


> profiles status -type enrollment
> Enrolled via DEP: No
> MDM enrollment: Yes (User Approved)
> MDM server: https://apple.mdm.jumpcloud.com/command

So it is enrolled in MDM, which, I guess, means there's some kernel component
running...


> `kextstat -l` output very long list, should I post it all or search for some
> particular pattern?

I guess
  kextstat -l|grep -v com.apple
might suffice.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Tom Lane

Дата:

17 июня, 17:15:37

Andres Freund <andres@anarazel.de> writes:
> Both of you are running Ventura, right?

FTR, the machines I'm trying this on are all running current Sequoia:

[tgl@minim4 ~]$ uname -a
Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025;
root:xnu-11417.121.6~2/RELEASE_ARM64_T6041arm64 

Interestingly, the M1 seems to be using a different kernel build than
my two M4 machines:

[tgl@minim1 ~]$ uname -a
Darwin minim1.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:48:46 PDT 2025;
root:xnu-11417.121.6~2/RELEASE_ARM64_T8103arm64 

although Apple has a history of using kernels customized to specific
core counts, so maybe that's all there is there.

These are all personally-owned bare metal:

[tgl@minim4 ~]$ profiles status -type enrollment
Enrolled via DEP: No
MDM enrollment: No
[tgl@minim4 ~]$ kextstat -l | grep -v com.apple
Executing: /usr/bin/kmutil showloaded --list-only
No variant specified, falling back to release

As for your previous question, I tried both c=16 and c=32 on
the M4.

            regards, tom lane

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

17 июня, 17:54:12

On 12/06/2025 4:57 pm, Andres Freund wrote:
> The problem appears to be in that switch between "when submitted, by the IO
> worker" and "then again by the backend".  It's not concurrent access in the
> sense of two processes writing to the same value, it's that when switching
> from the worker updating ->distilled_result to the issuer looking at that, the
> issuer didn't ensure that no outdated version of ->distilled_result could be
> used.
>
> Basically, the problem is that the worker would
>
> 1) set ->distilled_result
> 2) perform a write memory barrier
> 3) set ->state to COMPLETED_SHARED
>
> and then the issuer of the IO would:
>
> 4) check ->state is COMPLETED_SHARED
> 5) use ->distilled_result
>
> The problem is that there currently is no barrier between 4 & 5, which means
> an outdated ->distilled_result could be used.
>
>
> This also explains why the issue looked so weird - eventually, after fprintfs,
> after a core dump, etc, the updated ->distilled_result result would "arrive"
> in the issuing process, and suddenly look correct.
>
Sorry, I realized that O do not completely understand how it can 
explained assertion failure in `pgaio_io_before_start`:

     Assert(ioh->op == PGAIO_OP_INVALID);

I am not so familiar with AIO implementation in Postgress, so please 
excuse me if I missing something...

Ok, `distilled_result` contains some deteriorated data.
As far as I understand it is used for calling callbacks.

Right now in debugger (loaded core file), I see thsat two callbacks are 
registered: `aio_md_readv_cb` and `aio_shared_buffer_readv_cb`.
None of them is changing "op" field.

This is full content of the handle:

(PgAioHandle) {
   state = PGAIO_HS_HANDED_OUT
   target = PGAIO_TID_SMGR
   op = PGAIO_OP_READV
   flags = '\x04'
   num_callbacks = '\x02'
   callbacks = {
     [0] = '\x02'
     [1] = '\x01'
     [2] = '\0'
     [3] = '\0'
   }
   callbacks_data = {
     [0] = '\0'
     [1] = '\0'
     [2] = '\0'
     [3] = '\0'
   }
   handle_data_len = '\x01'
   owner_procno = 28
   result = 0
   node = {
     prev = 0x000000010c9ea368
     next = 0x000000010c9ea368
   }
   resowner = 0x000000014880ef28
   resowner_node = {
     prev = 0x000000014880f1d8
     next = 0x000000014880f1d8
   }
   generation = 133159566
   cv = {
     mutex = 0
     wakeup = (head = -1, tail = -1)
   }
   distilled_result = (id = 0, status = 0, error_data = 0, result = 1)
   iovec_off = 448
   report_return = 0x000000010f81b950
   op_data = {
     read = (fd = 7, iov_length = 1, offset = 180936704)
     write = (fd = 7, iov_length = 1, offset = 180936704)
   }
   target_data = {
     smgr = {
       rlocator = (spcOid = 1663, dbOid = 5, relNumber = 16384)
       blockNum = 22088
       nblocks = 1
       forkNum = MAIN_FORKNUM
       is_temp = false
       skip_fsync = false
     }
   }
}

Actually `PGAIO_OP_READV` is set in the only place: `pgaio_io_start_readv`.
And this function is obviously called in user's backed with procno=28 
which letter gets this assert failure.

So I try to understand dataflow.

1. Backend calls `pgaio_io_start_readv` to initiate IO. It assigns 
`op=PGAIO_OP_READV`
2. It is broadcasted to all CPUs because of write barrier in 
`pgaio_io_update_state(ioh, PGAIO_HS_DEFINED)` called fro `pgaio_io_stage`.
3. AIO thread gets notification about this request and somehow handle it 
changing status to `COMPLETED_SHARED` in `pgaio_io_process_completion` 
which notifies backend using condvar (which is expected to enforce write 
barrier).
4. Backend see that status is `COMPLETED_SHARED`. Backend can do it in 
`pgaio_io_wait` or in `pgaio_io_wait_for_free`. In first case there 
seems to be no problem, because `pgaio_io_was_recycled` is used which 
does read barrier. So you suspect `pgaio_io_wait_for_free` where read 
barrier was missed (before your patch). But it is not quite clear to me 
what can happen with "op" field without read barrier? In theory backend 
can see old value. In practice it seems to be impossible because as far 
as I know the unit of syncing data between CPU is cache line.

So my hypothesis was the following: backend observed old value of "op" 
(PGAIO_OP_READV, because this is what it has written itself), then 
overwrites it with PGAIO_OP_INVALID in `pgaio_io_reclaim` and then 
delayed update from AIO worker is received which restores old value - 
PGAIO_OP_READV.
But it seems to be not so realistic scenario.

Can you explain please how according to you theory not up-to-date 
version of distilled_result can explain such behavior?
It will be possible if result contains index of the handle and stale 
version of the index can cause addressing from IO handle.
But it is not the case - as we can see from the trace, address of `ioh` 
in the backens is always the same.

>
>> This is what I am observing myself: "op"  field is modified and fetched by
>> the same process.
> Right - but I don't think the ->op field being wrong was actually part of the
> issue though.
>
>
>> Certainly process can be rescheduled to some other CPU. But if such
>> reschedule can cause loose of stored value, then nothing will work, will it?
> Yes, that'd be completely broken - but isn't the issue here.
>
> Greetings,
>
> Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

17 июня, 18:08:30

On 17/06/2025 4:47 pm, Andres Freund wrote:

I and Alexandr are using completely different devices with different
hardware, OS and clang version.

Both of you are running Ventura, right?

No, Alexandr is using darwin23.5
Alexandr also noticed that he can reproduce the problem only with --with-liburing.
And I am not using it.

I guess  kextstat -l|grep -v com.apple
might suffice.

kextstat -l| grep -v com.apple
Executing: /usr/bin/kmutil showloaded --list-only
No variant specified, falling back to release

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

17 июня, 18:08:48

On 2025-06-17 17:54:12 +0300, Konstantin Knizhnik wrote:
> 
> On 12/06/2025 4:57 pm, Andres Freund wrote:
> > The problem appears to be in that switch between "when submitted, by the IO
> > worker" and "then again by the backend".  It's not concurrent access in the
> > sense of two processes writing to the same value, it's that when switching
> > from the worker updating ->distilled_result to the issuer looking at that, the
> > issuer didn't ensure that no outdated version of ->distilled_result could be
> > used.
> > 
> > Basically, the problem is that the worker would
> > 
> > 1) set ->distilled_result
> > 2) perform a write memory barrier
> > 3) set ->state to COMPLETED_SHARED
> > 
> > and then the issuer of the IO would:
> > 
> > 4) check ->state is COMPLETED_SHARED
> > 5) use ->distilled_result
> > 
> > The problem is that there currently is no barrier between 4 & 5, which means
> > an outdated ->distilled_result could be used.
> > 
> > 
> > This also explains why the issue looked so weird - eventually, after fprintfs,
> > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > in the issuing process, and suddenly look correct.
> > 
> Sorry, I realized that O do not completely understand how it can explained
> assertion failure in `pgaio_io_before_start`:
> 
>     Assert(ioh->op == PGAIO_OP_INVALID);

I don't think it can - this must be an independent bug from the one that Tom
and I were encountering.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

17 июня, 18:11:13

On 2025-06-17 18:08:30 +0300, Konstantin Knizhnik wrote:
> 
> On 17/06/2025 4:47 pm, Andres Freund wrote:
> > > I and Alexandr are using completely different devices with different
> > > hardware, OS and clang version.
> > Both of you are running Ventura, right?
> > 
> No, Alexandr is using darwin23.5

I.e. Sonoma.


> Alexandr also noticed that he can reproduce the problem only with
> --with-liburing.
> And  I am not using it.

That must be a misunderstanding - liburing is linux only.



> > I guess
> >    kextstat -l|grep -v com.apple
> > might suffice.
> 
> kextstat -l| grep -v com.apple
> Executing: /usr/bin/kmutil showloaded --list-only
> No variant specified, falling back to release

Thanks.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

18 июня, 10:32:08

On 17/06/2025 6:08 pm, Andres Freund wrote:
>
> I don't think it can - this must be an independent bug from the one that Tom
> and I were encountering.

I see... It's a pity.

By the way, I have a questions concerning using interrupts in AIO.
The comments say:

pgaio_io_release(PgAioHandle *ioh)
                 /*
                  * Note that no interrupts are processed between
                  * pgaio_io_was_recycled() and this check - that's 
important
                  * as otherwise an interrupt could have already 
reclaimed the
                  * handle.
                  */

pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
     /*
      * All callers need to have held interrupts in some form, otherwise
      * interrupt processing could wait for the IO to complete, while in an
      * intermediary state.
      */
...

But I failed to understand how handle can be reclaimed by interrupt or 
how any other AIO processing activity can be done  in interrupt handlers,
`IoWorkerMain` is not registering some IO specific interrupts. Can you 
explain please how interrupts can affect AIO, because I suspect that 
interrupts may be the only possible explanation of such behavior?

Also I tried to write small test reproducing AIO data flow:

#include <assert.h>
#include <pthread.h>

#define read_barrier() __atomic_thread_fence(__ATOMIC_ACQUIRE)
#define write_barrier() __atomic_thread_fence(__ATOMIC_RELEASE)

typedef struct {
     int state:8;
     int target:8;
     int op:8;
     int result;
} Handle;

enum State { IDLE, GO, DONE };
enum Operation { NOP, READ };

void* io_thread_proc(void* arg)
{
     Handle* h = (Handle*)arg;
     while (1)
     {
         if (h->state == GO)
         {
             assert(h->op == READ);
             h->result += 1;
             write_barrier();
             h->state = DONE;
         }
     }
     return  0;
}

void* client_thread_proc(void* arg)
{
     Handle* h = (Handle*)arg;
     int expected_result = 0;
     while (1)
     {
         assert(h->op == NOP);
         assert(h->state == IDLE);
         h->op = READ;
         write_barrier();
         h->state = GO;
         while (h->state != DONE);
         read_barrier();
         h->op = NOP;
         expected_result += 1;
         assert(h->result == expected_result);
         write_barrier();
         h->state = IDLE;
     }
     return  0;
}

int main() {
     void* res;
     pthread_t client_thread, io_thread;
     Handle h = {IDLE, 0, NOP, 0};
     pthread_create(&client_thread, NULL, client_thread_proc, &h);
     pthread_create(&io_thread, NULL, io_thread_proc, &h);
     pthread_join(client_thread, &res);
     pthread_join(io_thread, res);
     return 0;
}

It certainly works without any problems (well, I have not run it for 
hours, but I do not think that it is needed).
Do you think that this test is doing something similar as Postgres AIO 
or something should be changed (certainly AIO is not doing busy loop 
like this test, but unlikely it is important for reproducing the problem).

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

18 июня, 19:08:17

Hi,

On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote:
> On 17/06/2025 6:08 pm, Andres Freund wrote:
> >
> > I don't think it can - this must be an independent bug from the one that Tom
> > and I were encountering.

> I see... It's a pity.

Indeed.

Konstantin, Alexander, can you share what commit you're testing and what
precise changes have been applied to the source?  I've now tested this on a
significant number of apple machines for many many days without being able to
reproduce this a single time, despite using various compiler
[versions]. Something has to be different on the two systems you're testing
on.


> By the way, I have a questions concerning using interrupts in AIO.
> The comments say:
>
> pgaio_io_release(PgAioHandle *ioh)
>                 /*
>                  * Note that no interrupts are processed between
>                  * pgaio_io_was_recycled() and this check - that's important
>                  * as otherwise an interrupt could have already reclaimed
> the
>                  * handle.
>                  */
>
> pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
>     /*
>      * All callers need to have held interrupts in some form, otherwise
>      * interrupt processing could wait for the IO to complete, while in an
>      * intermediary state.
>      */
> ...
>
> But I failed to understand how handle can be reclaimed by interrupt or how
> any other AIO processing activity can be done  in interrupt handlers,
> `IoWorkerMain` is not registering some IO specific interrupts. Can you
> explain please how interrupts can affect AIO, because I suspect that
> interrupts may be the only possible explanation of such behavior?

The most problematic interrupt is ProcessBarrierSmgrRelease(). To prevent
problems with closing file-descriptors of in-progress IOs (with io_uring), we
may need to wait for IO for file descriptors to complete before closing the
file descriptor.  If we started to wait for IO while in the middle of updating
the state of an IO we'd re-enter the AIO code from the interrupt processing,
which would lead to confusion.

Note that this isn't specific to IO workers, it applies to nearly all process
types (I guess not to the logger process, but ...).


> Also I tried to write small test reproducing AIO data flow:
>
> #include <assert.h>
> #include <pthread.h>
>
> #define read_barrier() __atomic_thread_fence(__ATOMIC_ACQUIRE)
> #define write_barrier() __atomic_thread_fence(__ATOMIC_RELEASE)
>
> typedef struct {
>     int state:8;
>     int target:8;
>     int op:8;
>     int result;
> } Handle;
>
> enum State { IDLE, GO, DONE };
> enum Operation { NOP, READ };
>
> void* io_thread_proc(void* arg)
> {
>     Handle* h = (Handle*)arg;
>     while (1)
>     {
>         if (h->state == GO)

Strictly speaking I don't think the compiler is actually forced to reload
h->state from memory here...

>         {
>             assert(h->op == READ);
>             h->result += 1;
>             write_barrier();
>             h->state = DONE;
>         }
>     }
>     return  0;
> }
>
> void* client_thread_proc(void* arg)
> {
>     Handle* h = (Handle*)arg;
>     int expected_result = 0;
>     while (1)
>     {
>         assert(h->op == NOP);
>         assert(h->state == IDLE);
>         h->op = READ;
>         write_barrier();
>         h->state = GO;
>         while (h->state != DONE);

Same here.


> Do you think that this test is doing something similar as Postgres AIO or
> something should be changed (certainly AIO is not doing busy loop like this
> test, but unlikely it is important for reproducing the problem).

It's reasonably similar. As you say, with pg's aio there's no such busy
looping - I guess it's possible that this only happens if there are scheduler
interactions in the loops.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

19 июня, 01:18:21

On Thu, Jun 19, 2025 at 4:08 AM Andres Freund <andres@anarazel.de> wrote:
> Konstantin, Alexander, can you share what commit you're testing and what
> precise changes have been applied to the source?  I've now tested this on a
> significant number of apple machines for many many days without being able to
> reproduce this a single time, despite using various compiler
> [versions]. Something has to be different on the two systems you're testing
> on.

What do you see in "System Firmware Version" from "About This Mac" ->
"More Info" -> "General" -> "System Report"?  My M4 Air running
Sequoia 15.5 shows "11881.121.1".  My M1 Air is still running Sequoia
15.4.1 and shows "11881.101.1" (will upgrade it shortly).  The version
before, Sequoia 15.4 (no ".1") actually broke my M1 Air for a couple
of weeks, there was something funny going on with the firmware
involving a kernel panic that mentioned t6000dart (I believe that is
the IOMMU, ie virtual memory hardware for IO), fixed by 15.4.1.  It's
just a paranoid thought, but I wonder if all that stuff rolls out in
sync across all territories.

I have also not managed to reproduce this issue despite running for
many hours yesterday...

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

19 июня, 17:02:18

On 18/06/2025 7:08 pm, Andres Freund wrote:
> Hi,
>
> On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote:
>> On 17/06/2025 6:08 pm, Andres Freund wrote:
>>> I don't think it can - this must be an independent bug from the one that Tom
>>> and I were encountering.
>> I see... It's a pity.
> Indeed.
>
> Konstantin, Alexander, can you share what commit you're testing and what
> precise changes have been applied to the source?

I am using most recent master: 0cf205e122ae0fe9333ccf843c2269f13ddc32fc

By the way - still not been able to reproduce assertion failure at most 
recent MacPro (Apple M4 Pro) with Sequoia 15.5.

Re: Non-reproducible AIO failure

От

Daniel Gustafsson

Дата:

19 июня, 17:11:19

> On 19 Jun 2025, at 16:02, Konstantin Knizhnik <knizhnik@garret.ru> wrote:

> By the way - still not been able to reproduce assertion failure at most recent MacPro (Apple M4 Pro) with Sequoia
15.5.

I tried to reproduce this on an older quad core i7 MacBook Pro running Sonoma
14.7.5 using Apple clang version 15.0.0 (clang-1500.3.9.4) which corresponds to
LLVM 16.  Nothing so far after running a testloop suggested by Andres off-list
for two hours.

--
Daniel Gustafsson

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

19 июня, 17:36:20

Hi,

On 2025-06-19 17:02:18 +0300, Konstantin Knizhnik wrote:
> On 18/06/2025 7:08 pm, Andres Freund wrote:
> > Hi,
> >
> > On 2025-06-18 10:32:08 +0300, Konstantin Knizhnik wrote:
> > > On 17/06/2025 6:08 pm, Andres Freund wrote:
> > > > I don't think it can - this must be an independent bug from the one that Tom
> > > > and I were encountering.
> > > I see... It's a pity.
> > Indeed.
> >
> > Konstantin, Alexander, can you share what commit you're testing and what
> > precise changes have been applied to the source?
>
> I am using most recent master: 0cf205e122ae0fe9333ccf843c2269f13ddc32fc
>
> By the way - still not been able to reproduce assertion failure at most
> recent MacPro (Apple M4 Pro) with Sequoia 15.5.

So for some reason this apparently can only be reproduced on older macos - we
know it's not the older compiler, because I couldn't reproduce it on the same
compile version as alexander, on an m1 that was running sequoia. That's really
really weird. I guess I'll try to see if i can reproduce it on an older macos
in a VM, running on an m4.

Greetings,

Andres

Re: Non-reproducible AIO failure

От

Daniel Gustafsson

Дата:

19 июня, 18:05:25

> On 19 Jun 2025, at 16:36, Andres Freund <andres@anarazel.de> wrote:

> So for some reason this apparently can only be reproduced on older macos - we
> know it's not the older compiler, because I couldn't reproduce it on the same
> compile version as alexander, on an m1 that was running sequoia. That's really
> really weird. I guess I'll try to see if i can reproduce it on an older macos
> in a VM, running on an m4.

I also dug out an archeologically old MacBook Pro running macOS High Sierra
10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), and it
too fails to reproduce any issue.

--
Daniel Gustafsson

Re: Non-reproducible AIO failure

От

Nico Williams

Дата:

19 июня, 18:16:12

On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote:
> I also dug out an archeologically old MacBook Pro running macOS High Sierra
> 10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), and it
> too fails to reproduce any issue.

It's not going to be reproducible on x86_64 macs.  It has to be ARM
macs.

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

22 июля, 02:19:54

Hi,

On 2025-06-19 10:16:12 -0500, Nico Williams wrote:
> On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote:
> > I also dug out an archeologically old MacBook Pro running macOS High Sierra
> > 10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), and it
> > too fails to reproduce any issue.
>
> It's not going to be reproducible on x86_64 macs.  It has to be ARM
> macs.

In addition to various other folks that had reported in, I had various macs,
both real and virtualized, run this test in a loop. For weeks - without
reproducing the issue once [1].  Since this can only be reproduced on
Alexander and Konstantin's laptops, running an older macos [2] and running
MDM, I'm for now considering this issue to not be reproducible outside of that
environment. That's deeply unsatisfying, but I just don't see what else I
could do.

RMT, note that there were two issues in this thread, the original report by
Tom has been addressed (in e9a3615a522).  I guess the best thing would be to
split the open items entry into two?

Greetings,

Andres Freund

[1] Rather impressed at how stable our tests are right now, I've barely hit
any independent issues. That definitely wasn't always the case.

[2] I tried various versions of macos, without success

Re: Non-reproducible AIO failure

От

Nathan Bossart

Дата:

12 августа, 19:29:06

On Mon, Jul 21, 2025 at 07:19:54PM -0400, Andres Freund wrote:
> RMT, note that there were two issues in this thread, the original report by
> Tom has been addressed (in e9a3615a522).  I guess the best thing would be to
> split the open items entry into two?

I went ahead and marked the open item as fixed for 18beta2 with a note
about the unreproducibility of the remaining issue.

-- 
nathan

Re: Non-reproducible AIO failure

От

Alexander Lakhin

Дата:

20 августа, 21:00:00

Hello Andres,

22.07.2025 02:19, Andres Freund wrote:

Hi,

On 2025-06-19 10:16:12 -0500, Nico Williams wrote:

On Thu, Jun 19, 2025 at 05:05:25PM +0200, Daniel Gustafsson wrote:

I also dug out an archeologically old MacBook Pro running macOS High Sierra
10.13.6 with an i5 using Apple LLVM version 10.0.0 (clang-1000.10.44.4), and it
too fails to reproduce any issue.

It's not going to be reproducible on x86_64 macs.  It has to be ARM
macs.

In addition to various other folks that had reported in, I had various macs,
both real and virtualized, run this test in a loop. For weeks - without
reproducing the issue once [1].  Since this can only be reproduced on
Alexander and Konstantin's laptops, running an older macos [2] and running
MDM, I'm for now considering this issue to not be reproducible outside of that
environment. That's deeply unsatisfying, but I just don't see what else I
could do.

I got access to that M1 laptop the past week and reproduced the failure
again. Then I just copied all the postgres directory to my MacBook M4 Pro
and ran the same tests there. And the issue reproduced! This fact made me
believe that that's not the old CPU-specific issue. Armed with this belief,
I managed to reproduce this with a build made on M4 too.
So for now I have pretty stable reproducer which I'd like to share (please
forgive me some rough edges):
git apply AIO-debugging.patch
CC=/opt/homebrew/Cellar/llvm@17/17.0.6/bin/clang-17 CFLAGS="" ./configure --enable-cassert --enable-debug --enable-tap-tests --without-icu --without-zlib --prefix=/tmp/postgres -q && make -s -j8 && make -s install && make -s check

for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i: $np"; time parallel -j40 --linebuffer --tag /tmp/repro-AIO-Assert.sh {} ::: `seq $np` || break; sleep $(($RANDOM % 20)); done; echo -e "\007"
(For M1 it works with ~6 servers.)

The script executes opr_sanity.sql from the regress suite, but it can be
reduced, of course. Though reducing the number of queries seemingly leads
to decrease of the failure's probability, so I decided to leave it as-is
for now. Anyway, I can work on simplifying the repro, if needed.

This fails for me as below:
Run 1:
iteration 3: 29
...
24      TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 56420
...
real    12m3.018s

server24 log contains:
!!!AsyncReadBuffers [56420] (1)| blocknum: 73, ioh: 0x102588640, ioh->op: 0, ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 16293
!!!AsyncReadBuffers [56420] before smgrstartreadv| blocknum: 73, ioh: 0x102588640, ioh->op: 170, pre_op: 170, ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 1, ioh->generation: 16293
2025-08-20 19:20:23.121 EEST [56420] LOG: !!!FileStartReadV| ioh: 0x102588640, ioh->op: 170, ioh->state: 1, ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 16293
2025-08-20 19:20:23.121 EEST [56420] STATEMENT: SELECT DISTINCT p1.proargtypes[7]::regtype, p2.proargtypes[7]::regtype
    FROM pg_proc AS p1, pg_proc AS p2
    WHERE p1.oid != p2.oid AND
        p1.prosrc = p2.prosrc AND
        p1.prolang = 12 AND p2.prolang = 12 AND
        p1.prokind != 'a' AND p2.prokind != 'a' AND
        (p1.proargtypes[7] < p2.proargtypes[7])
    ORDER BY 1, 2;
2025-08-20 19:20:23.121 EEST [56420] LOG: !!!pgaio_io_before_start| ioh: 0x102588640, ioh->op: 170, ioh->state: 1, ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 16293
2025-08-20 19:20:23.121 EEST [56420] STATEMENT: SELECT DISTINCT p1.proargtypes[7]::regtype, p2.proargtypes[7]::regtype
    FROM pg_proc AS p1, pg_proc AS p2
    WHERE p1.oid != p2.oid AND
        p1.prosrc = p2.prosrc AND
        p1.prolang = 12 AND p2.prolang = 12 AND
        p1.prokind != 'a' AND p2.prokind != 'a' AND
        (p1.proargtypes[7] < p2.proargtypes[7])
    ORDER BY 1, 2;
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 56420

Run 2:
iteration 3: 21
...
21      TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 378
...
real    8m32.370s

server21.log contains:
!!!pgaio_io_reclaim [378]| ioh: 0x103f533d0, ioh->op: 0, ioh->state: 0, ioh->target: 0, ioh->generation: 13609
!!!pgaio_io_acquire_nb [378]| ioh: 0x103f533d0, ioh->op: 170, pre_op: 170
!!!AsyncReadBuffers [378] (1)| blocknum: 55, ioh: 0x103f533d0, ioh->op: 170, ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 13609
!!!AsyncReadBuffers [378] before smgrstartreadv| blocknum: 55, ioh: 0x103f533d0, ioh->op: 170, pre_op: 170, ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 1, ioh->generation: 13609
!!!pgaio_io_reclaim [379]| ioh: 0x103f53340, ioh->op: 0, ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 13457
!!!pgaio_io_reclaim [379]| ioh: 0x103f53340, ioh->op: 0, ioh->state: 0, ioh->target: 0, ioh->generation: 13458
2025-08-20 19:53:56.726 EEST [378] LOG: !!!FileStartReadV| ioh: 0x103f533d0, ioh->op: 170, ioh->state: 1, ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 13609
2025-08-20 19:53:56.726 EEST [378] STATEMENT: SELECT p1.oid, p1.proname
    FROM pg_proc as p1
    WHERE p1.prolang = 0 OR p1.prorettype = 0 OR
           p1.pronargs < 0 OR
           p1.pronargdefaults < 0 OR
           p1.pronargdefaults > p1.pronargs OR
           array_lower(p1.proargtypes, 1) != 0 OR
           array_upper(p1.proargtypes, 1) != p1.pronargs-1 OR
           0::oid = ANY (p1.proargtypes) OR
           procost <= 0 OR
           CASE WHEN proretset THEN prorows <= 0 ELSE prorows != 0 END OR
           prokind NOT IN ('f', 'a', 'w', 'p') OR
           provolatile NOT IN ('i', 's', 'v') OR
           proparallel NOT IN ('s', 'r', 'u');
2025-08-20 19:53:56.726 EEST [378] LOG: !!!pgaio_io_before_start| ioh: 0x103f533d0, ioh->op: 170, ioh->state: 1, ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 13609
2025-08-20 19:53:56.726 EEST [378] STATEMENT: SELECT p1.oid, p1.proname
    FROM pg_proc as p1
    WHERE p1.prolang = 0 OR p1.prorettype = 0 OR
           p1.pronargs < 0 OR
           p1.pronargdefaults < 0 OR
           p1.pronargdefaults > p1.pronargs OR
           array_lower(p1.proargtypes, 1) != 0 OR
           array_upper(p1.proargtypes, 1) != p1.pronargs-1 OR
           0::oid = ANY (p1.proargtypes) OR
           procost <= 0 OR
           CASE WHEN proretset THEN prorows <= 0 ELSE prorows != 0 END OR
           prokind NOT IN ('f', 'a', 'w', 'p') OR
           provolatile NOT IN ('i', 's', 'v') OR
           proparallel NOT IN ('s', 'r', 'u');
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 378

Run 3:
iteration 3: 29
...
2       TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 39023
...
real    11m51.809s

server2.log contains:
!!!pgaio_io_reclaim [39023]| ioh: 0x1045ff220, ioh->op: 0, ioh->state: 0, ioh->target: 0, ioh->generation: 5268
!!!pgaio_io_acquire_nb [39023]| ioh: 0x1045ff220, ioh->op: 170, pre_op: 170
!!!AsyncReadBuffers [39023] (1)| blocknum: 67, ioh: 0x1045ff220, ioh->op: 170, ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 5268
!!!AsyncReadBuffers [39023] before smgrstartreadv| blocknum: 67, ioh: 0x1045ff220, ioh->op: 170, pre_op: 170, ioh->state: 1, ioh->target: 0, ioh->result: 0, ioh->num_callbacks: 1, ioh->generation: 5268
2025-08-20 20:24:41.767 EEST [39023] LOG: !!!FileStartReadV| ioh: 0x1045ff220, ioh->op: 170, ioh->state: 1, ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 5268
2025-08-20 20:24:41.767 EEST [39023] STATEMENT: SELECT p1.oid, p1.proname
    FROM pg_proc as p1 LEFT JOIN pg_description as d
         ON p1.tableoid = d.classoid and p1.oid = d.objoid and d.objsubid = 0
    WHERE d.classoid IS NULL AND p1.oid <= 9999;
2025-08-20 20:24:41.767 EEST [39023] LOG: !!!pgaio_io_before_start| ioh: 0x1045ff220, ioh->op: 170, ioh->state: 1, ioh->target: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 5268
2025-08-20 20:24:41.767 EEST [39023] STATEMENT: SELECT p1.oid, p1.proname
    FROM pg_proc as p1 LEFT JOIN pg_description as d
         ON p1.tableoid = d.classoid and p1.oid = d.objoid and d.objsubid = 0
    WHERE d.classoid IS NULL AND p1.oid <= 9999;
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 167, PID: 39023

Best regards,
Alexander

Вложения

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

23 августа, 20:32:48

On 20/08/2025 9:00 PM, Alexander Lakhin wrote:
> for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i: 
> $np"; time parallel -j40 --linebuffer --tag  /tmp/repro-AIO-Assert.sh 
> {} ::: `seq $np` || break; sleep $(($RANDOM % 20)); done; echo -e "\007"

Unfortunately I was not able to reproduce the problem on my MacBook M4 
Pro (which should be identical with yours) using your instructions: I 
tried five times but in all cases 10 iterations are completed normally.

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

24 августа, 15:38:16

On Sun, Aug 24, 2025 at 5:32 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> On 20/08/2025 9:00 PM, Alexander Lakhin wrote:
> > for i in {1..10}; do np=$((20 + $RANDOM % 10)); echo "iteration $i:
> > $np"; time parallel -j40 --linebuffer --tag  /tmp/repro-AIO-Assert.sh
> > {} ::: `seq $np` || break; sleep $(($RANDOM % 20)); done; echo -e "\007"
>
>
> Unfortunately I was not able to reproduce the problem on my MacBook M4
> Pro (which should be identical with yours) using your instructions: I
> tried five times but in all cases 10 iterations are completed normally.

I can't reproduce it either.  I think the sequence is:

IO worker: pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED)
Backend:   pgaio_io_wait(ioh, generation)
             pgaio_io_was_recycled(ioh, generation, &state)
               case for state == PGAIO_HS_COMPLETED_SHARED
                 pgaio_io_reclaim(boh)
                   pgaio_io_update_state(ioh, PGAIO_HS_IDLE)
                   ioh->op = PGAIO_OP_INVALID;
Backend:   pgaio_io_acquire_nb()
             Assert(ioh->state == PGAIO_HS_IDLE) *fails*

pgaio_io_update_state() is essentially:

    pg_write_barrier();
    ioh->state = new_state;

I looked at the code generated by a non-optimised build:

(lldb) dis --name pgaio_io_update_state
...
postgres[0x100687fc4] <+372>: dmb    ish                    XXX memory barrier
postgres[0x100687fc8] <+376>: ldur   w10, [x29, #-0xc]
postgres[0x100687fcc] <+380>: ldur   x9, [x29, #-0x8]
postgres[0x100687fd0] <+384>: ldrh   w8, [x9]               XXX load
state + target
postgres[0x100687fd4] <+388>: ldrb   w11, [x9, #0x2]        XXX load op
postgres[0x100687fd8] <+392>: orr    w8, w8, w11, lsl #16
postgres[0x100687fdc] <+396>: and    w10, w10, #0xff
postgres[0x100687fe0] <+400>: and    w8, w8, #0xffffff00
postgres[0x100687fe4] <+404>: orr    w10, w8, w10
postgres[0x100687fe8] <+408>: lsr    w8, w10, #16
postgres[0x100687fec] <+412>: strh   w10, [x9]              XXX store
state + target
postgres[0x100687ff0] <+416>: strb   w8, [x9, #0x2]         XXX store op
postgres[0x100687ff4] <+420>: ldp    x29, x30, [sp, #0x60]
postgres[0x100687ff8] <+424>: add    sp, sp, #0x70
postgres[0x100687ffc] <+428>: ret

Even the -O3 version does that, it just uses fancier instructions:

postgres[0x100687c08] <+316>: dmb    ish
postgres[0x100687c0c] <+320>: ldurh  w8, [x20, #0x1]
postgres[0x100687c10] <+324>: bfi    w19, w8, #8, #16
postgres[0x100687c14] <+328>: lsr    w8, w8, #8
postgres[0x100687c18] <+332>: strb   w8, [x20, #0x2]       XXX clobber op
postgres[0x100687c1c] <+336>: strh   w19, [x20]

Could it be that the store buffer was flushed between the two stores,
pgaio_io_was_recycled() saw the new state, pgaio_io_reclaim() assigned
ioh->op = PGAIO_OP_INVALID and it was flushed to L1, and then finally
the time travelling op value was flushed and clobbered it?

Apple clang 17 output (= LLVM 19) with optimisation is smart enough to
use a single store:

postgres[0x1002f0ae8] <+308>: dmb    ish
postgres[0x1002f0aec] <+312>: strb   w19, [x20]

That's also how open source clang 17 compiles it if you rip out the
bitfield.  I guess if you do that, you won't be able to reproduce
this, Alexander?  Something like:

 struct PgAioHandle
 {
        /* all state updates should go through pgaio_io_update_state() */
-       PgAioHandleState state:8;
+       uint8           state;

        /* what are we operating on */
-       PgAioTargetID target:8;
+       uint8           target;

        /* which IO operation */
-       PgAioOp         op:8;
+       uint8           op;

Curiously, Apple clang 17 without optimisation does some amount of
unnecessary clobbering, but not the whole bitset and it's only a
single 16 bit store.

postgres[0x10057bfb0] <+348>: dmb    ish
postgres[0x10057bfb4] <+352>: ldur   w10, [x29, #-0xc]
postgres[0x10057bfb8] <+356>: ldur   x9, [x29, #-0x8]
postgres[0x10057bfbc] <+360>: ldrh   w8, [x9]
postgres[0x10057bfc0] <+364>: and    w10, w10, #0xff
postgres[0x10057bfc4] <+368>: and    w8, w8, #0xffffff00
postgres[0x10057bfc8] <+372>: orr    w8, w8, w10
postgres[0x10057bfcc] <+376>: strh   w8, [x9]

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

24 августа, 21:11:36

On 24/08/2025 3:38 PM, Thomas Munro wrote:
> That's also how open source clang 17 compiles it if you rip out the
> bitfield.  I guess if you do that, you won't be able to reproduce
> this, Alexander?  Something like:

I think that we have made this experiment at the very beginning and as 
far as I remembered we were not able to reproduce the problem with 
replaced bitfields. In theory even replacing bitfield with in should not 
avoid race condition, because they are still shared the same cache line.
But practice doesn't confirm it:)

I tried to create small test reproducing this data flow.
But still failed to find and problem with this test.
May be somebody can change this test to be more similar with real AIO 
dataflows?

Вложения

race.c

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

25 августа, 01:43:21

On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> In theory even replacing bitfield with in should not
> avoid race condition, because they are still shared the same cache line.

I'm no expert in this stuff, but that's not my understanding of how it
works.  Plain stores to normal memory go into the store buffer and are
eventually flushed to the memory hierarchy, but all modifications that
reach the cache hierarchy have a consistent view of memory created by
the cache coherency protocol (in ARM's case MOESI[1]): only one core
can change a cache line at a time while it has exclusive access (with
some optimisations, owner mode, snooping, etc but AFAIK that doesn't
change the basic consistency).  While modifying memory contents, the
core has the most up to date contents of the cache line and it doesn't
touch any other bytes.  What I mean is that cores don't make changes
to bytes that you didn't explicitly change with a store instruction,
it's just the order and timing that the contents of the store buffer
hits the memory system that is in question.  So I guess the key thing
here is that there is more than one store.

If the Apple's fork of LLVM 17 generated the same code as the MacPorts
LLVM 17 output I disassembled above, then I guess this double store
with the Apple toolchain until recently, but if you upgraded Xcode in
the past few months[2] then it would have stopped happening, which is
probably why Alexander had to use Homebrew's clang 17 to show the
problem.  (For the record, what I disassembled earlier was from
MacPorts's package for clang 17.0.6, so I guess it's remotely possible
that some difference between Homebrew's package and MacPorts' package
is causing my attempts to reproduce it to fail; it claims to be the
same version though and I don't expect them to patch anything.).

I don't understand these things, but maybe this commit or something
near it caused the change:

https://github.com/llvm/llvm-project/blob/b64499c6ef73881378b7e767b31c0fa06e3fe7b4/llvm/lib/Transforms/Scalar/EarlyCSE.cpp

> I tried to create small test reproducing this data flow.
> But still failed to find and problem with this test.
> May be somebody can change this test to be more similar with real AIO
> dataflows?

(lldb) dis --name update_state
a.out`update_state:
a.out[0x100000548] <+0>:  sub    sp, sp, #0x10
a.out[0x10000054c] <+4>:  str    x0, [sp, #0x8]
a.out[0x100000550] <+8>:  str    w1, [sp, #0x4]
a.out[0x100000554] <+12>: dmb    ish
a.out[0x100000558] <+16>: ldrb   w10, [sp, #0x4]
a.out[0x10000055c] <+20>: ldr    x9, [sp, #0x8]
a.out[0x100000560] <+24>: ldr    w8, [x9]
a.out[0x100000564] <+28>: and    w8, w8, #0xffffff00
a.out[0x100000568] <+32>: orr    w8, w8, w10
a.out[0x10000056c] <+36>: str    w8, [x9]           <--- single store
a.out[0x100000570] <+40>: add    sp, sp, #0x10
a.out[0x100000574] <+44>: ret

I think you need to add the following uint8 member to the struct, so
that it has to store 24 bits with two stores (assuming you are testing
with LLVM 17 and not LLVM 19 which chooses to clobber only the first
16 bits anyway for some reason).  In that code it's storing 32 bits at
once so although it's still clobbering op, it's doing it atomically so
nothing goes wrong.

If that doesn't work, maybe try adding a random number of other
stores, to try to find the timing that overflows the store queue
between the two stores, or maybe it has to do with delivering signals
or context switches that cause memory barriers, or who knows...  Hmm,
yeah maybe it's actually context switches that are doing this, as
Alexander is running a huge number of processes.

[1] https://en.wikipedia.org/wiki/MOESI_protocol
[2] https://bfbot.cputube.org/project/apple_xcode/

Re: Non-reproducible AIO failure

От

Nico Williams

Дата:

25 августа, 02:42:29

On Mon, Aug 25, 2025 at 10:43:21AM +1200, Thomas Munro wrote:
> On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> > In theory even replacing bitfield with in should not
> > avoid race condition, because they are still shared the same cache line.
>
> I'm no expert in this stuff, but that's not my understanding of how it
> works.  Plain stores to normal memory go into the store buffer and are
> eventually flushed to the memory hierarchy, but all modifications that
> reach the cache hierarchy have a consistent view of memory created by
> the cache coherency protocol (in ARM's case MOESI[1]): only one core
> can change a cache line at a time while it has exclusive access (with
> some optimisations, owner mode, snooping, etc but AFAIK that doesn't
> change the basic consistency).  While modifying memory contents, the
> core has the most up to date contents of the cache line and it doesn't
> touch any other bytes.  What I mean is that cores don't make changes
> to bytes that you didn't explicitly change with a store instruction,
> it's just the order and timing that the contents of the store buffer
> hits the memory system that is in question.  So I guess the key thing
> here is that there is more than one store.

I think the issue is that if the compiler decides to coalesce what we
think of as distinct (but neighboring) bitfields, then when you update
one of the bitfields you could be updating the other with stale data
from an earlier read where the cached stale data is cached in a
_register_.  Thus the fact that the cache line should have the most up
to date data for that other field is irrelevant because the stale data
is in a _register_.

The very fact that this can happen, that the C specs allow it, argues
that one must never have adjacent distinct (for some value of
"distinct") bitfields for anything that requires atomics.

IIRC we're talking about this:

```
struct PgAioHandle
{
        /* all state updates should go through pgaio_io_update_state() */
        PgAioHandleState state:8;

        /* what are we operating on */
        PgAioTargetID target:8;

        /* which IO operation */
        PgAioOp         op:8;

        /* bitfield of PgAioHandleFlags */
        uint8           flags;
```

To us humans each of the first three fields are distinct because sure
they are bitfields, and bitfields can be coalesced, but their _types_
are different.  But from the compiler's perspective the types are all
enums with small positive values, therefore they might as well all be
`int`, like this:

```
struct PgAioHandle
{
        /* all state updates should go through pgaio_io_update_state() */
        int              state:8;

        /* what are we operating on */
        int           target:8;

        /* which IO operation */
        int             op:8;

        /* bitfield of PgAioHandleFlags */
        uint8           flags;
```

and now you can see that a write to any one of those three bytes can
also be a write to either or both of the other two.  Although... it
seems unlikely that a write to `state` could write to `op` given that a
32-bit write would also write `flags`, and `flags` is not a bitfields
(though its contents should be thought of as a bitfield).

It's also hard to see how `op` gets written to when writing to `target`
either because that crosses a 16-bit boundary, and again a 32-bit write
would be wrong.

So even if the compiler could merge these three bitfields it should not
be able to ever write to `op` when writing to the other two unless there
were a compiler bug that also caused the compiler to coalesce `flags`
into these three bitfields.

The fact that a compiler can do this is... well, I think to some degree
desirable, but in this case horrifying.

BTW, I _think_ you can keep the adjacent enum bitfields and prevent this
with this [slightly ugly?] hack:

```
struct PgAioHandle
{
        /* all state updates should go through pgaio_io_update_state() */
        struct { PgAioHandleState v:8; } state;

        /* what are we operating on */
        struct { PgAioTargetID v:8; } target;

        /* which IO operation */
        struct { PgAioOp v:8; } op;

        /* bitfield of PgAioHandleFlags */
        uint8           flags;
```

because wrapping each bitfield in a struct makes it so the compiler
doesn't see each of these adjacent fields of PgAioHandle as being
bitfields when coding accesses to them.

Nico
--

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

25 августа, 04:52:57

On Mon, Aug 25, 2025 at 11:42 AM Nico Williams <nico@cryptonector.com> wrote:
> I think the issue is that if the compiler decides to coalesce what we
> think of as distinct (but neighboring) bitfields, then when you update
> one of the bitfields you could be updating the other with stale data
> from an earlier read where the cached stale data is cached in a
> _register_.  Thus the fact that the cache line should have the most up
> to date data for that other field is irrelevant because the stale data
> is in a _register_.

Yeah.  It so happens that the other bitfield members are never updated
by other backends in our particular usage, or another way to put, in
states after STAGED which are entirely managed by the life-long owner
of the PgAioHandle, and we have paired memory barriers to make sure we
always see changes to state after all other data changes, so in our
usage we'd always reload that, both at compiler barrier and memory
barrier level (the latter being the thing that is being defeated by
this multiple store re-writing neighbouring bitfields thing, it's not
a stale-value-in-register problem).  But I'm sure you're right in
general.

> The very fact that this can happen, that the C specs allow it, argues
> that one must never have adjacent distinct (for some value of
> "distinct") bitfields for anything that requires atomics.

C doesn't even say this should be safe for plain integers, only
sig_atomic_t and the newer _Atomic types.  (I guess word addressable
architectures like Alpha and Cray that needed bitfield-esque
load-modify-store for small types at one end, and microcontrollers
might have the opposite problem.  C11 lets you query that but you have
to go through the appropriate types.)

>         struct { PgAioHandleState v:8; } state;

This preserves type safety and compiles to strb two properties we
want, but it seems to waste space (look at the offsets for the
stores):

a.out[0x1000005f8] <+140>: ldr    x8, [sp, #0x8]
a.out[0x1000005fc] <+144>: strb   wzr, [x8, #0x8]
a.out[0x100000600] <+148>: ldr    x8, [sp, #0x8]
a.out[0x100000604] <+152>: strb   wzr, [x8, #0x4]

I was wracking my brain for ways to keep the desired type safety,
atomicity, AND storage size.  The best I could come up with was GCC
pack attributes, but I couldn't figure out how to do that in MSVC
without using C++, so I gave up.

The only other thing I could think of is that if you put  padding
after the bitfields, it should be able to generate a single 32 bit
store (as Konstantin's race.c shows).  But that'd be baking in shaky
knowledge the possible dirtiness other bitfields that I mentioned in
parentheses above a little too hard, on top of a new shaky assumption
about how compilers implement bitfield merging and assignment, so I
abandoned that thought quickly too.

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

25 августа, 05:41:18

On Mon, Aug 25, 2025 at 1:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >         struct { PgAioHandleState v:8; } state;
>
> This preserves type safety and compiles to strb two properties we
> want, but it seems to waste space (look at the offsets for the
> stores):
>
> a.out[0x1000005f8] <+140>: ldr    x8, [sp, #0x8]
> a.out[0x1000005fc] <+144>: strb   wzr, [x8, #0x8]
> a.out[0x100000600] <+148>: ldr    x8, [sp, #0x8]
> a.out[0x100000604] <+152>: strb   wzr, [x8, #0x4]

Sorry, I didn't make that very clear: that's open source clang 17
compiling assignment of zero to two neighbouring wrapped bitfields
with your trick.  Probably easier to look at the struct layout with
pahole or printf offsetof(...) or sizeof() to see that PgAioHandle
grows by 9 bytes of padding, something Andres obviously felt pretty
strongly about if he felt the need to summon bitfields...

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

25 августа, 05:42:35

On Mon, Aug 25, 2025 at 2:41 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Aug 25, 2025 at 1:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > >         struct { PgAioHandleState v:8; } state;
> >
> > This preserves type safety and compiles to strb two properties we
> > want, but it seems to waste space (look at the offsets for the
> > stores):
> >
> > a.out[0x1000005f8] <+140>: ldr    x8, [sp, #0x8]
> > a.out[0x1000005fc] <+144>: strb   wzr, [x8, #0x8]
> > a.out[0x100000600] <+148>: ldr    x8, [sp, #0x8]
> > a.out[0x100000604] <+152>: strb   wzr, [x8, #0x4]
>
> Sorry, I didn't make that very clear: that's open source clang 17
> compiling assignment of zero to two neighbouring wrapped bitfields
> with your trick.  Probably easier to look at the struct layout with
> pahole or printf offsetof(...) or sizeof() to see that PgAioHandle
> grows by 9 bytes of padding, something Andres obviously felt pretty
> strongly about if he felt the need to summon bitfields...

(Errm, surely more than 9 due to alignment of the following stuff, I
was just looking at a truncated test struct with only those bits...)

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

26 августа, 03:37:17

Hi,

I'm a bit confused by this focus on bitfields - both Alexander and Konstantin
stated they could reproduce the issue without the bitfields.

But we have observed the generated code being pretty grotty and it's caused
more than enough confusion - so let's just replace them with plain uint8's and
cast in switches.

> I think the issue is that if the compiler decides to coalesce what we
> think of as distinct (but neighboring) bitfields, then when you update
> one of the bitfields you could be updating the other with stale data
> from an earlier read where the cached stale data is cached in a
> _register_.  Thus the fact that the cache line should have the most up
> to date data for that other field is irrelevant because the stale data
> is in a _register_.
> 
> The very fact that this can happen, that the C specs allow it, argues
> that one must never have adjacent distinct (for some value of
> "distinct") bitfields for anything that requires atomics.

I think the barriers in place should prevent that.

Greetings,

Andres Freund

Вложения

v1-0001-aio-Stop-using-enum-bitfields-due-to-bad-code-gen.patch

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

26 августа, 03:44:58

Hi,

On 2025-08-25 10:43:21 +1200, Thomas Munro wrote:
> On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> > In theory even replacing bitfield with in should not
> > avoid race condition, because they are still shared the same cache line.
> 
> I'm no expert in this stuff, but that's not my understanding of how it
> works.  Plain stores to normal memory go into the store buffer and are
> eventually flushed to the memory hierarchy, but all modifications that reach
> the cache hierarchy have a consistent view of memory created by the cache
> coherency protocol (in ARM's case MOESI[1]): only one core can change a
> cache line at a time while it has exclusive access (with some optimisations,
> owner mode, snooping, etc but AFAIK that doesn't change the basic
> consistency).

From what I understand that's not quite right - the whole point of the store
buffer is to avoid the latency hit of having to wait for cacheline
ownership. Instead the write is done into the store buffer, notably on a
granularity *smaller* than the cacheline (it has to be smaller, because we
don't have the contents of the cacheline).  The reason that that is somewhat
OK from a coherency perspective is that this is done only for pure writes, not
read-modify-write operations. As the write overwrites the prior contents of
the memory, it is "ok" to do the write without waiting for cacheline ownership
ahead of time.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

26 августа, 06:21:34

On Tue, Aug 26, 2025 at 12:45 PM Andres Freund <andres@anarazel.de> wrote:
> On 2025-08-25 10:43:21 +1200, Thomas Munro wrote:
> > On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> > > In theory even replacing bitfield with in should not
> > > avoid race condition, because they are still shared the same cache line.
> >
> > I'm no expert in this stuff, but that's not my understanding of how it
> > works.  Plain stores to normal memory go into the store buffer and are
> > eventually flushed to the memory hierarchy, but all modifications that reach
> > the cache hierarchy have a consistent view of memory created by the cache
> > coherency protocol (in ARM's case MOESI[1]): only one core can change a
> > cache line at a time while it has exclusive access (with some optimisations,
> > owner mode, snooping, etc but AFAIK that doesn't change the basic
> > consistency).
>
> From what I understand that's not quite right - the whole point of the store
> buffer is to avoid the latency hit of having to wait for cacheline
> ownership. Instead the write is done into the store buffer, notably on a
> granularity *smaller* than the cacheline (it has to be smaller, because we
> don't have the contents of the cacheline).  The reason that that is somewhat
> OK from a coherency perspective is that this is done only for pure writes, not
> read-modify-write operations. As the write overwrites the prior contents of
> the memory, it is "ok" to do the write without waiting for cacheline ownership
> ahead of time.

*confused*  Where's the contradiction?

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

26 августа, 06:45:25

On Tue, Aug 26, 2025 at 12:37 PM Andres Freund <andres@anarazel.de> wrote:
> I'm a bit confused by this focus on bitfields - both Alexander and Konstantin
> stated they could reproduce the issue without the bitfields.

Konstantin's message all seem to say it *did* fix it?

But I do apologise for working through the same theory and analysis as
Konstantin had already done earlier.  I somehow had the impression
this topic was considered closed, and was reacting to Alexander's
latest email as if it was a new variant of the problem when I wrote
that.  ETHREADTOOLONG.

> But we have observed the generated code being pretty grotty and it's caused
> more than enough confusion - so let's just replace them with plain uint8's and
> cast in switches.

+1

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

26 августа, 15:01:03

Hi,

On 2025-08-26 15:21:34 +1200, Thomas Munro wrote:
> On Tue, Aug 26, 2025 at 12:45 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-08-25 10:43:21 +1200, Thomas Munro wrote:
> > > On Mon, Aug 25, 2025 at 6:11 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> > > > In theory even replacing bitfield with in should not
> > > > avoid race condition, because they are still shared the same cache line.
> > >
> > > I'm no expert in this stuff, but that's not my understanding of how it
> > > works.  Plain stores to normal memory go into the store buffer and are
> > > eventually flushed to the memory hierarchy, but all modifications that reach
> > > the cache hierarchy have a consistent view of memory created by the cache
> > > coherency protocol (in ARM's case MOESI[1]): only one core can change a
> > > cache line at a time while it has exclusive access (with some optimisations,
> > > owner mode, snooping, etc but AFAIK that doesn't change the basic
> > > consistency).
> >
> > From what I understand that's not quite right - the whole point of the store
> > buffer is to avoid the latency hit of having to wait for cacheline
> > ownership. Instead the write is done into the store buffer, notably on a
> > granularity *smaller* than the cacheline (it has to be smaller, because we
> > don't have the contents of the cacheline).  The reason that that is somewhat
> > OK from a coherency perspective is that this is done only for pure writes, not
> > read-modify-write operations. As the write overwrites the prior contents of
> > the memory, it is "ok" to do the write without waiting for cacheline ownership
> > ahead of time.
> 
> *confused*  Where's the contradiction?

Maybe I just misunderstood you. I was reading your message as saying that MESI
governs all writes to cachelines, but that's not really true for writes going
through the store buffer... You can write to a cacheline that currently is
exclusively owned by another core.

Greetings,

Andres Freund

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

26 августа, 16:59:54

On 26/08/2025 3:37 AM, Andres Freund wrote:

Hi,

I'm a bit confused by this focus on bitfields - both Alexander and Konstantin
stated they could reproduce the issue without the bitfields.

Sorry if I am not correct, but it seems that the problem was never reproduced with replaced bitfields.
I have once again looked through all this thread (it is really long;) and find just one place when you wrote, that you was able to reproduce the problem without bitfields. But as far as understand it was the different problem which was fixed by your patch v1-0001-aio-Add-missing-memory-barrier-when-waiting-for-I.patch

Still it is not quite clear to me how bitfields can cause this issue.
Yes, fitfields are updated all together - when you update one field, all other bitfields are also rewritten.
But actually compiler (at least some versions) does quite straightforward thing (especially without optimization):
1. Loads all bitfields.
2. Update required bitfield.
3. Stores all bitfields.

It is very unlikely that some stale version is stored in registers - without -O optimizer is using registers only locally and doesn't try to keep some variables in registers. Also not clear how we can read some stale value from memory if we perform write barrier before updating status (in pgaio_io_update_state).
It it impossible that `op` and `state` fields belongs to different cache lines because PgAioHandle is aligned on 8. So doesn;t matter whether there are bitfields or bytes, this fields should be propagated between cores all together.
Certainly, extraction of bitfields is not an atomic operations (as we see in assembler, it is done using two loads).
So we can observe inconsistent state of this two fields.

Let's consider fields `op` and `state`. Let's their initial state is `op=1`m `state=1`.
The we do update `op=2` and call pgaio_io_update_state(2). Some other task can see (op=1, state=1) or (op=2, state=1) or (op=2, state=2) but not (op=1, state=2).

The problem can happen only if some other task tries to update `op` without prior checking for `state`.
In this case we can get (op=3,state=1) as well as (op=2,state=2).
But `op` field is updated in just two places: pgaio_io_reclaim and pgaio_io_stage. And in both cases we first check 'state' value.

What makes me concern is what happen in case of branch misprediction.
Assume that we have code:

if (state == 1) {
op = 2;
} else {
op = 3;
}

It seems to be absolutely legal, but what will happen in case if with speculative execution processor start executing first branch and does assignment op=2, but then state is loaded and used to be not equal to 1, so processor has to undone this execution. But for some moment we we can have state (state=0 and op=2) which can be observed by other task and cause assertion failure.
Is it really possible?

But we have observed the generated code being pretty grotty and it's caused
more than enough confusion - so let's just replace them with plain uint8's and
cast in switches.

May be I am wrong, but it seems to me that after add-moissing-memory-barrier patch was applied nobody reproduced assertion failure with replaced bitfields.

Re: Non-reproducible AIO failure

От

Ken Marshall

Дата:

26 августа, 17:15:13

On Tue, Aug 26, 2025 at 04:59:54PM +0300, Konstantin Knizhnik wrote:
> 
> > But we have observed the generated code being pretty grotty and it's caused
> > more than enough confusion - so let's just replace them with plain uint8's and
> > cast in switches.
> 
> +1
> 
> May be I am wrong, but it seems to me that after add-moissing-memory-barrier
> patch was applied nobody reproduced assertion failure with replaced
> bitfields.
> 

Hi,

I am just a lurker but that was what I saw happen too. No problem with
the replaced bitfields.

Regards,
Ken

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

28 августа, 02:08:20

Hi,

On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote:
> On 26/08/2025 3:37 AM, Andres Freund wrote:
> > Hi,
> > 
> > I'm a bit confused by this focus on bitfields - both Alexander and Konstantin
> > stated they could reproduce the issue without the bitfields.
> 
> 
> Sorry if I am not correct, but it seems that the problem was never
> reproduced with replaced bitfields.

I think unfortunately you and Thomas that are correct, and me that somehow got
confused. I went through the entire thread and I'm not sure how I got the
impression that the problem could be reproduced without use of the
bitfields. Sorry for that!

I'll push the patch to remove the bitfields after adjusting the commit message
somewhat.


> Still it is not quite clear to me how bitfields can cause this issue.

Same.


> Yes, fitfields are updated all together - when you update one field, all
> other bitfields are also rewritten.
> But actually compiler (at least some versions) does quite straightforward
> thing (especially without optimization):
> 1. Loads all bitfields.
> 2. Update required bitfield.
> 3. Stores all bitfields.
> 
> It is very unlikely that some stale version is stored in registers - without
> -O optimizer is using registers only locally and doesn't try to keep some
> variables in registers. Also not clear how we can read some stale value from
> memory if we perform write barrier before updating status (in
> pgaio_io_update_state).
> It it impossible that `op` and `state` fields belongs to different cache
> lines because PgAioHandle is aligned on 8. So doesn;t matter whether there
> are bitfields or bytes, this fields should be propagated between cores all
> together.
> Certainly, extraction of bitfields is not an atomic operations (as we see in
> assembler, it is done using two loads).
> So we can observe inconsistent state of this two fields.
> 
> Let's consider fields `op` and `state`. Let's their initial state is `op=1`m
> `state=1`.
> The we do update `op=2` and call pgaio_io_update_state(2). Some other task
> can see (op=1, state=1)  or (op=2, state=1) or (op=2, state=2) but not
> (op=1, state=2).
> 
> The problem can happen only if some other task tries to update `op` without
> prior checking for `state`.

Which isn't possible here - the handle was idle (i.e. not accessed by another
backend) just before the assertion failure and is being initialized at that
time. Which all happens within one process, without any concurrent accesses.


> In this case we can get (op=3,state=1) as well as (op=2,state=2).
> But `op` field is updated in just two places: pgaio_io_reclaim and
> pgaio_io_stage. And in both cases we first check 'state' value.
> 
> What makes me concern is what happen in case of branch misprediction.
> Assume that we have code:
> 
> if (state == 1) {
>     op = 2;
> } else {
>     op = 3;
> }
> 
> It seems to be absolutely legal, but what will happen in case if with
> speculative execution processor start executing first branch and does
> assignment op=2, but then state is loaded and used to be not equal to 1, so
> processor has to undone this execution. But for some moment we we can
> have state (state=0 and op=2) which can be observed by other task and cause
> assertion failure.
> Is it really possible?

No, the CPU will not retire ("commit") instructions that are still dependent
on speculatively executed instructions. And it won't do work that can't be
undone (like actually writing memory or doing system calls) as part of
speculative execution. Therefore the result of speculatively executed
instructions aren't visible to other cores [1].

Greetings,

Andres Freund


[1] Leaving aside side-channel attacks like spectre / meltdown and a lot of
    subsequent issues, but those are observing side channels like presence of
    a cacheline in the cpu cache, not actually speculatively written memory

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

28 августа, 02:16:29

On 2025-08-27 19:08:20 -0400, Andres Freund wrote:
> I'll push the patch to remove the bitfields after adjusting the commit message
> somewhat.

And done.

Re: Non-reproducible AIO failure

От

Thomas Munro

Дата:

28 августа, 03:08:17

On Thu, Aug 28, 2025 at 11:08 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote:
> > Still it is not quite clear to me how bitfields can cause this issue.
>
> Same.

Here's what I speculated after reading the generated asm[1]:

"Could it be that the store buffer was flushed between the two stores,
pgaio_io_was_recycled() saw the new state, pgaio_io_reclaim() assigned
ioh->op = PGAIO_OP_INVALID and it was flushed to L1, and then finally
the time travelling op value was flushed and clobbered it?"

To put it in terms of cache line modes and store buffer operations,
the IO worker does this, omitting uninteresting instructions:

postgres[0x100687fd0] <+384>: ldrh w8, [x9]          ; load state, target
postgres[0x100687fd4] <+388>: ldrb   w11, [x9, #0x2] ; load op
... build new state + target in w10 ....
postgres[0x100687fec] <+412>: strh   w10, [x9]       ; state <- COMPLETED_SHARED
postgres[0x100687ff0] <+416>: strb   w8, [x9, #0x2]  ; op <- PGAIO_OP_READV

My speculation was that the two stores hit memory in separate store
buffer flushes for whatever reason (something to do with a context
switch, or maybe the store buffer is just full, or... who knows,
doesn't matter).  Cache line exclusive mode is released and
re-acquired in between the two flushes.  I know that merely executing
a store instruction doesn't acquire cache line exclusive mode, I'm
talking specifically about store buffer flush operations here, which
must.

In that window, pgaio_io_wait() in the owner backend sees state ==
COMPLETED_SHARED and runs pgaio_io_reclaim() which then stores and
flushes op <- PGAIO_OP_IDLE.  The IO worker core's second flush
operation is delayed because it has to wait to re-acquire the cache
line in exclusive mode, ie wait for the owner backend to release it,
and then  it clobbers op with the old value, later producing this
assertion failure in the owner backend:

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
Line: 167, PID: 56420

To put it another way, this could only work correctly if the IO worker
did: store op (unnecessarily because bitfield blah blah), dmb ish,
store state, pairing with the owner's dmb ishld, load state.  But we
have no control over that, the fact that the compiler generated two
stores for our byte-sized assignment is invisible from C.

Or if this sequence is not possible, what exactly prevents it?

[1] https://www.postgresql.org/message-id/CA%2BhUKGK6ujMT5myrEkgQ%2Bn-N3rquZA4haHfJszQVe4ofHd6z6A%40mail.gmail.com

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

28 августа, 15:57:27

On 28/08/2025 3:08 AM, Thomas Munro wrote:
> On Thu, Aug 28, 2025 at 11:08 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote:
>>> Still it is not quite clear to me how bitfields can cause this issue.
>> Same.
> Here's what I speculated after reading the generated asm[1]:
>
> "Could it be that the store buffer was flushed between the two stores,
> pgaio_io_was_recycled() saw the new state, pgaio_io_reclaim() assigned
> ioh->op = PGAIO_OP_INVALID and it was flushed to L1, and then finally
> the time travelling op value was flushed and clobbered it?"
>
> To put it in terms of cache line modes and store buffer operations,
> the IO worker does this, omitting uninteresting instructions:
>
> postgres[0x100687fd0] <+384>: ldrh w8, [x9]          ; load state, target
> postgres[0x100687fd4] <+388>: ldrb   w11, [x9, #0x2] ; load op
> ... build new state + target in w10 ....
> postgres[0x100687fec] <+412>: strh   w10, [x9]       ; state <- COMPLETED_SHARED
> postgres[0x100687ff0] <+416>: strb   w8, [x9, #0x2]  ; op <- PGAIO_OP_READV
>
> My speculation was that the two stores hit memory in separate store
> buffer flushes for whatever reason (something to do with a context
> switch, or maybe the store buffer is just full, or... who knows,
> doesn't matter).  Cache line exclusive mode is released and
> re-acquired in between the two flushes.  I know that merely executing
> a store instruction doesn't acquire cache line exclusive mode, I'm
> talking specifically about store buffer flush operations here, which
> must.
>
> In that window, pgaio_io_wait() in the owner backend sees state ==
> COMPLETED_SHARED and runs pgaio_io_reclaim() which then stores and
> flushes op <- PGAIO_OP_IDLE.  The IO worker core's second flush
> operation is delayed because it has to wait to re-acquire the cache
> line in exclusive mode, ie wait for the owner backend to release it,
> and then  it clobbers op with the old value, later producing this
> assertion failure in the owner backend:
>
> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
> Line: 167, PID: 56420
>
> To put it another way, this could only work correctly if the IO worker
> did: store op (unnecessarily because bitfield blah blah), dmb ish,
> store state, pairing with the owner's dmb ishld, load state.  But we
> have no control over that, the fact that the compiler generated two
> stores for our byte-sized assignment is invisible from C.

I am convinced by your speculations:)
And it explains why replacing of bitfields with uint8 eliminates the 
problem despite to the fact that it is not possible to update individual 
byte at RISC and cache line.  The critical thing here is that op and 
state are updated independently.
If state field is byte or "good" version of clang (or enabling 
optimization) eliminates extracting and storing redundant bitfields,
then the problem describe above is not possible. Even through `state` 
and `op` can be updated together but it is atomic update!

No other process can see new value of `state`, do assignment to `op` and 
some time later receive delayed update for `op`, which overwrites new value.

If it is true, then patch proposed by Andres should solve the issue (and 
simplify generated code).

Re: Non-reproducible AIO failure

От

Dmitry Mityugov

Дата:

03 сентября, 20:37:40

Quite inspiring discussion. The patch is brilliantly good but it adds a bunch of explicit type casts, and it's not always easy to remember what cast to use in a particular case, and that may eventually lead to errors in the future. Just wanted to add that when 64-bit code is generated, uint8s are probably aligned to 64-bit boundaries for structures that contain other 64-bit members like pointers, and bit fields are not (and they may not have an address at all). To align bit fields properly, anonymous bit fields probably can be used, or unions containing the bitfields together with uint8s, to avoid the explicit casts, like in this example:

struct Foo {
union {
char a:8;
char aa;
};
char b:8;
int* c;
};

Regards,

On Thu, Aug 28, 2025 at 3:57 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:

On 28/08/2025 3:08 AM, Thomas Munro wrote:
> On Thu, Aug 28, 2025 at 11:08 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2025-08-26 16:59:54 +0300, Konstantin Knizhnik wrote:
>>> Still it is not quite clear to me how bitfields can cause this issue.
>> Same.
> Here's what I speculated after reading the generated asm[1]:
>
> "Could it be that the store buffer was flushed between the two stores,
> pgaio_io_was_recycled() saw the new state, pgaio_io_reclaim() assigned
> ioh->op = PGAIO_OP_INVALID and it was flushed to L1, and then finally
> the time travelling op value was flushed and clobbered it?"
>
> To put it in terms of cache line modes and store buffer operations,
> the IO worker does this, omitting uninteresting instructions:
>
> postgres[0x100687fd0] <+384>: ldrh w8, [x9] ; load state, target
> postgres[0x100687fd4] <+388>: ldrb w11, [x9, #0x2] ; load op
> ... build new state + target in w10 ....
> postgres[0x100687fec] <+412>: strh w10, [x9] ; state <- COMPLETED_SHARED
> postgres[0x100687ff0] <+416>: strb w8, [x9, #0x2] ; op <- PGAIO_OP_READV
>
> My speculation was that the two stores hit memory in separate store
> buffer flushes for whatever reason (something to do with a context
> switch, or maybe the store buffer is just full, or... who knows,
> doesn't matter). Cache line exclusive mode is released and
> re-acquired in between the two flushes. I know that merely executing
> a store instruction doesn't acquire cache line exclusive mode, I'm
> talking specifically about store buffer flush operations here, which
> must.
>
> In that window, pgaio_io_wait() in the owner backend sees state ==
> COMPLETED_SHARED and runs pgaio_io_reclaim() which then stores and
> flushes op <- PGAIO_OP_IDLE. The IO worker core's second flush
> operation is delayed because it has to wait to re-acquire the cache
> line in exclusive mode, ie wait for the owner backend to release it,
> and then it clobbers op with the old value, later producing this
> assertion failure in the owner backend:
>
> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
> Line: 167, PID: 56420
>
> To put it another way, this could only work correctly if the IO worker
> did: store op (unnecessarily because bitfield blah blah), dmb ish,
> store state, pairing with the owner's dmb ishld, load state. But we
> have no control over that, the fact that the compiler generated two
> stores for our byte-sized assignment is invisible from C.

I am convinced by your speculations:)
And it explains why replacing of bitfields with uint8 eliminates the
problem despite to the fact that it is not possible to update individual
byte at RISC and cache line. The critical thing here is that op and
state are updated independently.
If state field is byte or "good" version of clang (or enabling
optimization) eliminates extracting and storing redundant bitfields,
then the problem describe above is not possible. Even through `state`
and `op` can be updated together but it is atomic update!

No other process can see new value of `state`, do assignment to `op` and
some time later receive delayed update for `op`, which overwrites new value.

If it is true, then patch proposed by Andres should solve the issue (and
simplify generated code).

Dmitry

Re: Non-reproducible AIO failure

От

Konstantin Knizhnik

Дата:

03 сентября, 21:50:42

On 03/09/2025 8:37 PM, Dmitry Mityugov wrote:
> Quite inspiring discussion. The patch is brilliantly good but it adds 
> a bunch of explicit type casts, and it's not always easy to remember 
> what cast to use in a particular case, and that may eventually lead to 
> errors in the future. Just wanted to add that when 64-bit code is 
> generated, uint8s are probably aligned to 64-bit boundaries for 
> structures that contain other 64-bit members like pointers, and bit 
> fields are not (and they may not have an address at all). To align bit 
> fields properly, anonymous bit fields probably can be used, or unions 
> containing the bitfields together with uint8s, to avoid the explicit 
> casts, like in this example:
>
> struct Foo {
>     union {
>         char a:8;
>         char aa;
>     };
>     char b:8;
>     int* c;
> };
>
> Regards,
>

Size of PgAioHandle is144 bytes. I wonder how critical for us is to save 
9 bytes for it (3 bytes vs 3 integers)?
Why not to use normal enums instead of bitfields and uint8 with type casts?

Re: Non-reproducible AIO failure

От

Andres Freund

Дата:

03 сентября, 22:42:06

Hi,

On 2025-09-03 21:50:42 +0300, Konstantin Knizhnik wrote:
> On 03/09/2025 8:37 PM, Dmitry Mityugov wrote:
> Size of PgAioHandle is144 bytes. I wonder how critical for us is to save 9
> bytes for it (3 bytes vs 3 integers)?

Not that it makes that huge a difference, but due to alignment considerations
the size increase would be 12 bytes, not 9. Maybe that could be addressed by
more reordering though.

> Why not to use normal enums instead of bitfields and uint8 with type casts?

There are a lot of PgAioHandles when max_connections is large. In addition,
some workloads (e.g. network attached storage), one might need to increase
io_max_concurrency to actually be able to saturate storage.  An ~8% increase
in size isn't nothing when the baseline isn't small.

Greetings,

Andres Freund

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Non-reproducible AIO failure

Вложения

Вложения

Вложения

Вложения