Обсуждение: BUG #18009: Postgres Recovery not happening

Поиск
Список
Период
Сортировка

BUG #18009: Postgres Recovery not happening

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18009
Logged by:          Vamshi krishna
Email address:      tvk1271@gmail.com
PostgreSQL version: 15.2
Operating system:   AIX
Description:

Basically what i observe is ..Abrupt down of the system, Postgres database
is unable to find its checkpoint record.

2023-06-28 03:17:33.051 CDT|649bec9d.cc01b6|LOG:  database system was shut
down at 2023-06-28 01:20:17 CDT
2023-06-28 03:17:33.054 CDT|649bec9d.cc01b6|LOG:  unexpected pageaddr
0/1000000 in log segment 000000010000000000000003, offset 0 <==
2023-06-28 03:17:33.054 CDT|649bec9d.cc01b6|LOG:  invalid primary checkpoint
record
2023-06-28 03:17:33.054 CDT|649bec9d.cc01b6|PANIC:  could not locate a valid
checkpoint record
2023-06-28 03:17:33.058 CDT|649bec9c.c6015c|LOG:  startup process (PID
13369782) was terminated by signal 6: IOT/Abort trap

I verified in the OS side, we are not observing explicit fsync() call post
writing to this file "000000010000000000000003". I suspect this because 
the writes are present in the VMM page cache and not getting synced up to
the disk. Post restart of my node, DB is not coming up.
Can anyone help me figure out the underlying issue.  reset-wal option is
there, but i am afraid it will lead to data loss of the database. This seems
easily recreating on my node.

Thanks
 Vamshi.


Re: BUG #18009: Postgres Recovery not happening

От
Thomas Munro
Дата:
On Sat, Jul 1, 2023 at 2:29 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> Operating system:   AIX

> I verified in the OS side, we are not observing explicit fsync() call post
> writing to this file "000000010000000000000003". I suspect this because
> the writes are present in the VMM page cache and not getting synced up to
> the disk. Post restart of my node, DB is not coming up.

We don't usually call fsync() for WAL files (except when initially
creating them), we use various methods controlled by the setting
wal_sync_method[1] and on AIX we default to open_datasync (that means
we open the WAL with O_DSYNC and then we expect pwrite() to return
only after the data is durably on disk).  Have you changed that
setting?  When you say "abrupt shutdown", do you mean power loss?
Perhaps you could investigate what O_DSYNC does with respect to write
caches on your system and what your disk controllers etc promise about
power loss.  Can you reproduce this problem with a fresh cluster, and
does it go away if you use wal_sync_method=fdatasync?

It doesn't seem that likely to me that expensive AIX systems would
fail at sensible volatile cache management, so that's a long shot, but
we know that some other systems can fail in that way (eg Windows on
consumer storage), and I'm pretty sure they can fail exactly as you
described because the control file is fsync'd while the WAL is only
written to volatile drive caches.

[1] https://www.postgresql.org/docs/15/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS



Re: BUG #18009: Postgres Recovery not happening

От
Vamshikrishna T
Дата:
Hi Thomas,

Thank you for your immediate response, This is not on the default file system of AIX ( JFS2 ), but on a specific special purpose file system. looks like (  open_datasync ) O_DSYNC is causing the issue which seems to be not honoured on this file system. Yeah abrupt shutdown can be treated as power loss.

I used wal_sync_method=fdatasync, ( although i am not sure the problem vanished or not, because it is not getting reproduced )  but what i observe
is there is an immediate explicit sync calls to the files present in ../pg_wal/ directory post write call completions. 

2023-07-04 03:36:04.259 CDT|64a3d9b4.8701bc|LOG:  checkpoint starting: time
2023-07-04 03:36:06.263 CDT|64a3d9b4.8701bc|LOG:  checkpoint complete: wrote 21 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=1.925 s, sync=0.053 s, total=2.005 s; sync files=20, longest=0.019 s, average=0.003 s; distance=32 kB, estimate=32 kB

I can see between these two time interval, write caches are cleared. With  wal_sync_method=fdatasync tunable, Is it safe to assume all the Postgres DB writes during checkpointing are called via explicit call to  OS level sync() or fsync(), irrespective of O_DSYNC during open?.

Thanks
 Vamshi.

On Mon, 3 Jul 2023 at 03:57, Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, Jul 1, 2023 at 2:29 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> Operating system:   AIX

> I verified in the OS side, we are not observing explicit fsync() call post
> writing to this file "000000010000000000000003". I suspect this because
> the writes are present in the VMM page cache and not getting synced up to
> the disk. Post restart of my node, DB is not coming up.

We don't usually call fsync() for WAL files (except when initially
creating them), we use various methods controlled by the setting
wal_sync_method[1] and on AIX we default to open_datasync (that means
we open the WAL with O_DSYNC and then we expect pwrite() to return
only after the data is durably on disk).  Have you changed that
setting?  When you say "abrupt shutdown", do you mean power loss?
Perhaps you could investigate what O_DSYNC does with respect to write
caches on your system and what your disk controllers etc promise about
power loss.  Can you reproduce this problem with a fresh cluster, and
does it go away if you use wal_sync_method=fdatasync?

It doesn't seem that likely to me that expensive AIX systems would
fail at sensible volatile cache management, so that's a long shot, but
we know that some other systems can fail in that way (eg Windows on
consumer storage), and I'm pretty sure they can fail exactly as you
described because the control file is fsync'd while the WAL is only
written to volatile drive caches.

[1] https://www.postgresql.org/docs/15/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS


--
Thanks
 Vamshi.

Re: BUG #18009: Postgres Recovery not happening

От
Thomas Munro
Дата:
On Wed, Jul 5, 2023 at 1:48 AM Vamshikrishna T <tvk1271@gmail.com> wrote:
> Thank you for your immediate response, This is not on the default file system of AIX ( JFS2 ), but on a specific
specialpurpose file system. looks like (  open_datasync ) O_DSYNC is causing the issue which seems to be not honoured
onthis file system. Yeah abrupt shutdown can be treated as power loss. 

Sounds like a fun project.  Is this alien technology that regular AIX
users won't run into and I should forget this conversation ever
happened, or is it a clue we should use wal_sync_method=fdatasync by
default on AIX?

> I used wal_sync_method=fdatasync, ( although i am not sure the problem vanished or not, because it is not getting
reproduced)  but what i observe 
> is there is an immediate explicit sync calls to the files present in ../pg_wal/ directory post write call
completions.
>
> 2023-07-04 03:36:04.259 CDT|64a3d9b4.8701bc|LOG:  checkpoint strting: time
> 2023-07-04 03:36:06.263 CDT|64a3d9b4.8701bc|LOG:  checkpoint complete: wrote 21 buffers (0.1%); 0 WAL file(s) added,
0removed, 0 recycled; write=1.925 s, sync=0.053 s, total=2.005 s; sync files=20, longest=0.019 s, average=0.003 s;
distance=32kB, estimate=32 kB 
>
> I can see between these two time interval, write caches are cleared. With  wal_sync_method=fdatasync tunable, Is it
safeto assume all the Postgres DB writes during checkpointing are called via explicit call to  OS level sync() or
fsync(),irrespective of O_DSYNC during open?. 

Yes.  Our periodic checkpoints write out all the relation data (files
like base/1234/2345 that hold tables and indexes), and then always
call fsync() (sometimes the pwrite() calls and the fsync() happen in
different processes*).  But WAL data (files like
pg_wal/000000010000000000000001) get the various wal_sync_method
behaviours, with (IMHO unfortunately) different defaults based on a
series of inconsistent platform-by-platform historical decisions...

Just BTW, if you're interested in PostgreSQL on AIX, see
https://wiki.postgresql.org/wiki/AIX .

*With interesting cross-platform consequences.  If you're a
kernel/VMM/storage person you might find
https://wiki.postgresql.org/wiki/Fsync_Errors interesting.  Of course
we have no idea what any closed source kernel does.



Re: BUG #18009: Postgres Recovery not happening

От
Vamshikrishna T
Дата:
Hi Thomas,

Thank you for the confirmation on fdatasync, Definitely this is not related to regular AIX users, so we may not need any change.
Other thing i am interested in is, Looks like Postgres 15.2 doesn't support Direct or Concurrent I/O on AIX ?. I don't see any
setting where i can tune that ?.  I feel DIO or CIO would be more safer than cached I/O. I see fsync error behaviors are
pretty interesting across different OS. Thanks for the info.

Thanks
 Vamshi.

On Wed, 5 Jul 2023 at 09:06, Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jul 5, 2023 at 1:48 AM Vamshikrishna T <tvk1271@gmail.com> wrote:
> Thank you for your immediate response, This is not on the default file system of AIX ( JFS2 ), but on a specific special purpose file system. looks like (  open_datasync ) O_DSYNC is causing the issue which seems to be not honoured on this file system. Yeah abrupt shutdown can be treated as power loss.

Sounds like a fun project.  Is this alien technology that regular AIX
users won't run into and I should forget this conversation ever
happened, or is it a clue we should use wal_sync_method=fdatasync by
default on AIX?

> I used wal_sync_method=fdatasync, ( although i am not sure the problem vanished or not, because it is not getting reproduced )  but what i observe
> is there is an immediate explicit sync calls to the files present in ../pg_wal/ directory post write call completions.
>
> 2023-07-04 03:36:04.259 CDT|64a3d9b4.8701bc|LOG:  checkpoint strting: time
> 2023-07-04 03:36:06.263 CDT|64a3d9b4.8701bc|LOG:  checkpoint complete: wrote 21 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=1.925 s, sync=0.053 s, total=2.005 s; sync files=20, longest=0.019 s, average=0.003 s; distance=32 kB, estimate=32 kB
>
> I can see between these two time interval, write caches are cleared. With  wal_sync_method=fdatasync tunable, Is it safe to assume all the Postgres DB writes during checkpointing are called via explicit call to  OS level sync() or fsync(), irrespective of O_DSYNC during open?.

Yes.  Our periodic checkpoints write out all the relation data (files
like base/1234/2345 that hold tables and indexes), and then always
call fsync() (sometimes the pwrite() calls and the fsync() happen in
different processes*).  But WAL data (files like
pg_wal/000000010000000000000001) get the various wal_sync_method
behaviours, with (IMHO unfortunately) different defaults based on a
series of inconsistent platform-by-platform historical decisions...

Just BTW, if you're interested in PostgreSQL on AIX, see
https://wiki.postgresql.org/wiki/AIX .

*With interesting cross-platform consequences.  If you're a
kernel/VMM/storage person you might find
https://wiki.postgresql.org/wiki/Fsync_Errors interesting.  Of course
we have no idea what any closed source kernel does.


--
Thanks
 Vamshi.

Re: BUG #18009: Postgres Recovery not happening

От
Thomas Munro
Дата:
On Mon, Jul 10, 2023 at 11:27 PM Vamshikrishna T <tvk1271@gmail.com> wrote:
> Other thing i am interested in is, Looks like Postgres 15.2 doesn't support Direct or Concurrent I/O on AIX ?. I
don'tsee any 
> setting where i can tune that ?.  I feel DIO or CIO would be more safer than cached I/O. I see fsync error behaviors
are
> pretty interesting across different OS. Thanks for the info.

There is nothing in v15.  In v16 (in beta now), we have just the
tentative beginnings of direct I/O support.  So far we just provided
the switch to turn it on for WAL and/or data files, which involved
getting various memory buffers to be correctly aligned and figuring
out various system programming details for ~10 OSes.  For now the
setting is called "debug_io_direct", with a temporary "debug_" prefix
because we don't want people turning it on for real production
databases yet.  But that's the easy bit.  You can't use it naively
without taking a massive performance hit.  We also need to replace the
readahead/writeback and automatic clustering that PostgreSQL relies on
today, which is probably why previous proposals over the past decades
to just add a simple direct I/O switch went nowhere -- that's not even
the real problem!  That just gives you a lot of painfully slow
unclustered 8KB reads and write calls.  We have a prototype and
forthcoming proposal in the pipeline to address the real problem, but
we wanted the debug_io_direct to be released sooner to give us more
time to learn about weird problems on various OSes and filesystems.
We've already learned a few surprising things in the thread that added
the setting[1] (like how to make btrfs corrupt its own checksums).

I remember from a previous life that different proprietary Unixes
seemed to take different views on inode-level serialisation with
O_DIRECT.  (Maybe something to do with interpreting POSIX's
requirements differently?)  AIX and HP-UX seemed to take the view that
you should have to opt in to concurrency separately, unlike IRIX/XFS.
AIX seems to be the last OS standing that might want us to do that.
Hence my question on https://wiki.postgresql.org/wiki/AIX which I left
there for someone who cares about PostgreSQL-on-AIX to research, along
with the other patches there (patches that will *definitely* improve
performance, I should add, I just can't test them/take them forward
myself).  Maybe later though, once we have more async machinery.
(People joke about these old dinosaur systems but they were doing
direct concurrent writes in the 90s or even late 80s while ext4 has
literally just learned how to do that in 2023 and can't even make a
file bigger than 16TB.  But I digress.).

As for what comes next, here's what we have prototyped:  My colleague
Andres Freund designed an architecture for driving asynchronous I/O
ahead of time, initially with Linux's io_uring.  I worked on extending
it to use pluggable backends to support the obvious options available
on the ~10 OSes we target, including a fully portable fallback.  For
all OSes, we have the option to use a pool of IO worker processes
doing plain old synchronous preadv()/pwritev()/fsync()/... calls in
the background, but since we're talking about AIX, I can report that
AIX turned out to be one of the few OSes where the cursed POSIX AIO
API actually worked as expected and well enough for our purposes[2],
and from light reading, it seems like it might be truly asynchronous
down to the drivers in some cases.  I wrote a little talk about that
portability work at an OS conference, primarily for therapeutic value,
because asynchronous I/O truly is a portability nightmare.

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
[2] https://speakerdeck.com/macdice/aio-and-dio-for-postgresql-on-freebsd?slide=20