Обсуждение: fdatasync(2) on macOS

Поиск
Список
Период
Сортировка

fdatasync(2) on macOS

От
Thomas Munro
Дата:
Hello hackers,

While following along with the nearby investigation into weird
cross-version Apple toolchain issues that confuse configure, I noticed
that the newer buildfarm Macs say:

checking for fdatasync... (cached) yes

That's a bit strange because there's no man page and no declaration:

checking whether fdatasync is declared... (cached) no

That's no obstacle for us, because c.h does:

#if defined(HAVE_FDATASYNC) && !HAVE_DECL_FDATASYNC
extern int  fdatasync(int fildes);
#endif

So... does this unreleased function flush drive caches?  We know that
fsync(2) doesn't, based on Apple's advice[1] for databases hackers to
call fcntl(fd, F_FULLSYNC, 0) instead.  We do that.

Speaking as an armchair Internet Unix detective, my guess is: no.  In
the source[2] we can see that there is a real system call table entry
and VFS support, so there is *something* wired up to this lever.  On
the other hand, it shares code with fsync(2), and I suppose that
fdatasync(2) isn't going to do *more* than fsync(2).  But who knows?
Not only is it unreleased, but below VNOP_FSYNC() you reach closed
source file system code.

That was fun, but now I'm asking myself: do we really want to use an
IO synchronisation facility that's not declared by the vendor?  I see
that our declaration goes back 20 years to 33cc5d8a, which introduced
fdatasync(2).  The discussion from the time[3] makes it clear that the
OS support was very patchy and thin back then.

Just by the way, another fun thing I learned about libSystem while
reading up on Big Sur changes is that the system libraries are no
longer on the file system.  dlopen() is magical.

[1] https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fsync.2.html
[2] https://github.com/apple/darwin-xnu/blob/d4061fb0260b3ed486147341b72468f836ed6c8f/bsd/vfs/vfs_syscalls.c#L7708
[3] https://www.postgresql.org/message-id/flat/200102171805.NAA24180%40candle.pha.pa.us



Re: fdatasync(2) on macOS

От
Thomas Munro
Дата:
On Fri, Jan 15, 2021 at 7:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> That was fun, but now I'm asking myself: do we really want to use an
> IO synchronisation facility that's not declared by the vendor?

I should add, the default wal_sync_method is open_datasync, not
fdatasync.  I'm pretty suspicious of that too: neither O_SYNC nor
O_DSYNC appears as a documented flag for open(2) and the numbers look
suspicious.  Perhaps they only define them to support aio_fsync(2).



Re: fdatasync(2) on macOS

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> While following along with the nearby investigation into weird
> cross-version Apple toolchain issues that confuse configure, I noticed
> that the newer buildfarm Macs say:
> checking for fdatasync... (cached) yes
> That's a bit strange because there's no man page and no declaration:

Yeah, it's been there but undeclared for a long time.  Who knows why.

> So... does this unreleased function flush drive caches?  We know that
> fsync(2) doesn't, based on Apple's advice[1] for databases hackers to
> call fcntl(fd, F_FULLSYNC, 0) instead.  We do that.

pg_test_fsync results make it clear that fdatasync is the same or a shade
faster than fsync, which is pretty much what you'd expect.  On my
late-model Macbook Pro:

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     14251.416 ops/sec      70 usecs/op
        fdatasync                         25345.103 ops/sec      39 usecs/op
        fsync                             24677.445 ops/sec      41 usecs/op
        fsync_writethrough                   41.519 ops/sec   24085 usecs/op
        open_sync                         14188.903 ops/sec      70 usecs/op

and on an old Mac mini with spinning rust:

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                      2536.535 ops/sec     394 usecs/op
        fdatasync                          4602.192 ops/sec     217 usecs/op
        fsync                              4600.365 ops/sec     217 usecs/op
        fsync_writethrough                   12.135 ops/sec   82404 usecs/op
        open_sync                          2506.674 ops/sec     399 usecs/op

So it's not a no-op, but on the other hand it's not succeeding in getting
bits down to the platter.  I'm not inclined to dike it out, but it does
seem problematic that we're defaulting to open_datasync, which is also
not getting bits down to the platter.

I have a vague recollection that we discussed changing the default
wal_sync_method for Darwin years ago, but I don't recall why we
didn't pull the trigger.  These results certainly suggest that
we oughta.

            regards, tom lane



Re: fdatasync(2) on macOS

От
Bruce Momjian
Дата:
On Fri, Jan 15, 2021 at 12:55:52PM -0500, Tom Lane wrote:
> > So... does this unreleased function flush drive caches?  We know that
> > fsync(2) doesn't, based on Apple's advice[1] for databases hackers to
> > call fcntl(fd, F_FULLSYNC, 0) instead.  We do that.
> 
> pg_test_fsync results make it clear that fdatasync is the same or a shade
> faster than fsync, which is pretty much what you'd expect.  On my
> late-model Macbook Pro:
> 
> Compare file sync methods using two 8kB writes:
> (in wal_sync_method preference order, except fdatasync is Linux's default)
>         open_datasync                     14251.416 ops/sec      70 usecs/op
>         fdatasync                         25345.103 ops/sec      39 usecs/op
>         fsync                             24677.445 ops/sec      41 usecs/op
>         fsync_writethrough                   41.519 ops/sec   24085 usecs/op
>         open_sync                         14188.903 ops/sec      70 usecs/op
> 
> and on an old Mac mini with spinning rust:
> 
> Compare file sync methods using two 8kB writes:
> (in wal_sync_method preference order, except fdatasync is Linux's default)
>         open_datasync                      2536.535 ops/sec     394 usecs/op
>         fdatasync                          4602.192 ops/sec     217 usecs/op
>         fsync                              4600.365 ops/sec     217 usecs/op
>         fsync_writethrough                   12.135 ops/sec   82404 usecs/op
>         open_sync                          2506.674 ops/sec     399 usecs/op
> 
> So it's not a no-op, but on the other hand it's not succeeding in getting
> bits down to the platter.  I'm not inclined to dike it out, but it does
> seem problematic that we're defaulting to open_datasync, which is also
> not getting bits down to the platter.
> 
> I have a vague recollection that we discussed changing the default
> wal_sync_method for Darwin years ago, but I don't recall why we
> didn't pull the trigger.  These results certainly suggest that
> we oughta.

Is this with an SSD?  We used to be able to know something wasn't
flushing to durable storage because magnetic disk was so slow you could
tell from the numbers, but with SSDs, it might be harder to guess. 
Maybe time to use:

    https://brad.livejournal.com/2116715.html
    diskchecker.pl

or find a way to automate that test.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: fdatasync(2) on macOS

От
Thomas Munro
Дата:
On Sat, Jan 16, 2021 at 6:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So it's not a no-op, but on the other hand it's not succeeding in getting
> bits down to the platter.  I'm not inclined to dike it out, but it does
> seem problematic that we're defaulting to open_datasync, which is also
> not getting bits down to the platter.

Hmm, OK, from these times it does appear that O_SYNC and O_DSYNC
actually do something then.  It's baffling that they are undocumented.
It might be possible to use dtrace on a SIP-disabled Mac to trace the
IOs with this script, to see if the B_FUA flag is being set, which
might make open_datasync better than fdatasync (if it's being sent and
not ignored), but again, who knows?!:

https://github.com/apple/darwin-xnu/blob/master/bsd/dev/dtrace/scripts/io.d

> I have a vague recollection that we discussed changing the default
> wal_sync_method for Darwin years ago, but I don't recall why we
> didn't pull the trigger.  These results certainly suggest that
> we oughta.

No strong preference here, at least without more information. It's
unsettling that two of our wal_sync_methods are based on half-released
phantom operating system features, but there doesn't seem to be much
we can do about that other than try to understand what they do.  I see
that the idea of defaulting to fsync_writethrough was discussed a
decade ago and rejected[1].  I'm not entirely sure how it manages to
be so slow.

It looks like the reliability section of our manual could use a spring
clean[2].  It's still talking about IDE and platters, instead of
modern stuff like NVMe, cloud/network storage and FUA flags.

[1] https://www.postgresql.org/message-id/flat/AANLkTik261QWc9kGv6acZz2h9ZrQy9rKQC8ow5U1tAaM%40mail.gmail.com
[2] https://www.postgresql.org/docs/13/wal-reliability.html



Re: fdatasync(2) on macOS

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sat, Jan 16, 2021 at 6:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I have a vague recollection that we discussed changing the default
>> wal_sync_method for Darwin years ago, but I don't recall why we
>> didn't pull the trigger.  These results certainly suggest that
>> we oughta.

> No strong preference here, at least without more information. It's
> unsettling that two of our wal_sync_methods are based on half-released
> phantom operating system features, but there doesn't seem to be much
> we can do about that other than try to understand what they do.  I see
> that the idea of defaulting to fsync_writethrough was discussed a
> decade ago and rejected[1].
> [1] https://www.postgresql.org/message-id/flat/AANLkTik261QWc9kGv6acZz2h9ZrQy9rKQC8ow5U1tAaM%40mail.gmail.com

Ah, thanks for doing the archaeology on that.  Re-reading that old
thread, it seems like the two big arguments against making it
safe-by-default were

(1) other platforms weren't safe-by-default either.  Perhaps the
state of the art is better now, though?

(2) we don't want to force exceedingly-expensive defaults on people
who may be uninterested in reliable storage.  That seemed like a
shaky argument then and it still does now.  Still, I see the point
that suddenly degrading performance by orders of magnitude would
be a PR disaster.

            regards, tom lane



Re: fdatasync(2) on macOS

От
Thomas Munro
Дата:
On Mon, Jan 18, 2021 at 5:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (1) other platforms weren't safe-by-default either.  Perhaps the
> state of the art is better now, though?

Generally the answer seems to be yes, but there are still some systems
out there that don't send flushes when volatile write cache is
enabled.  Probably still including Macs, by the admission of their man
page.  The numbers I saw would put a little M1 Air at the upper range
of super expensive server storage if they included or didn't need a
flush to survive power loss, but then that's a consumer device with a
battery so it doesn't really fit into the usual way we think about
database server storage and power loss...

> (2) we don't want to force exceedingly-expensive defaults on people
> who may be uninterested in reliable storage.  That seemed like a
> shaky argument then and it still does now.  Still, I see the point
> that suddenly degrading performance by orders of magnitude would
> be a PR disaster.

(Purely as a matter of curiosity, I wonder why the latency is so high
for F_FULLFSYNC.  Wild speculation: APFS is said to be a bit like ZFS,
but it's also said to avoid the data journaling of HFS+... so perhaps
it lacks an equivalent of ZFS's ZIL (a thing like WAL) that allows
synchronous writes to avoid having to flush out a new tree and uber
block (in ZFS lingo "spa_sync()").  It might be possible to see this
with tools like iosnoop (or the underlying io:::start dtrace probe),
if you overwrite a single block and then fcntl(F_FULLFSYNC).  Your 12
ops/sec on spinning rust would have to be explained by something like
that, and is significantly slower than the speeds I see on my spinning
rust ZFS system that manages something like disk rotation speed.)

Anyway, my purpose in this thread was to flag our usage of the
undocumented system call and open flags; that is, "how we talk to the
OS", not "how the OS talks to the disk".  That turned out to be
already well known and not as new as I first thought, so I'm not
planning to pursue this Mac stuff any further, despite my curiosity...



Re: fdatasync(2) on macOS

От
Thomas Munro
Дата:
On Mon, Jan 18, 2021 at 4:39 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Jan 16, 2021 at 6:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > So it's not a no-op, but on the other hand it's not succeeding in getting
> > bits down to the platter.  I'm not inclined to dike it out, but it does
> > seem problematic that we're defaulting to open_datasync, which is also
> > not getting bits down to the platter.
>
> Hmm, OK, from these times it does appear that O_SYNC and O_DSYNC
> actually do something then.  It's baffling that they are undocumented.

I was digging through Apple sources again trying to learn something
about bug report #16827, and I spotted one extra detail that I wanted
to share in this thread about these undocumented system interfaces,
just for the record.  It appears that as of macOS 11.2/XNU 7195.83.3,
their vn_write() doesn't treat O_DSYNC any differently than O_SYNC:

    /*
     * Treat synchronous mounts and O_FSYNC on the fd as equivalent.
     *
     * XXX We treat O_DSYNC as O_FSYNC for now, since we can not delay
     * XXX the non-essential metadata without some additional VFS work;
     * XXX the intent at this point is to plumb the interface for it.
     */
    if ((fp->fp_glob->fg_flag & (O_FSYNC | O_DSYNC)) ||
        (vp->v_mount && (vp->v_mount->mnt_flag & MNT_SYNCHRONOUS))) {
        ioflag |= IO_SYNC;
    }