Volatile write caches on macOS and Windows, redux
От | Thomas Munro |
---|---|
Тема | Volatile write caches on macOS and Windows, redux |
Дата | |
Msg-id | CA+hUKG+F0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: Volatile write caches on macOS and Windows, redux
(Peter Smith <smithpb2250@gmail.com>)
Re: Volatile write caches on macOS and Windows, redux (Jelte Fennema-Nio <postgres@jeltef.nl>) |
Список | pgsql-hackers |
Hi, Continuing a topic from earlier threads[1][2], I've been wondering about how to de-klugify wal_sync_method=fsync_writethrough (a setting that actually affects much more than just WAL), and how to do the right thing for our users on macOS and Windows by default. Commit d0c28601 was a very small cleanup in this area. Here are some bigger ideas I'd like to try out. Short version: * Make wal_sync_method=fdatasync the default everywhere * Drop wal_sync_method=fsync_writethrough * Add new macOS-only level for the fsync GUC: fsync=full * Make fsync=full redirect both pg_fsync() and pg_fdatasync() * Make fsync=full the default on macOS Motivation: I think expectations might have changed quite a bit since ~2000. Back then, fsync() didn't flush write caches on any OS (you were supposed to use battery-backed controllers and SCSI as found on expensive proprietary Unix systems if you were serious, IDE/ATA protocols didn't originally have flush commands, and some consumer drives famously ignored them or lied, so users of cheap drives were advised to turn write caches off). Around 2005, Linux decided to start sending the flush command in fsync(). Windows' FlushFileBuffers() does the same, and I gathered from Raymond Chen's blog that by the Windows 8 timeframe all consumer drive vendors supported and respected the flush command. macOS *still* doesn't send it for fsync(), but has had fcntl(F_FULLFSYNC) since 2003. In Apple's defence, they seem to have been ahead of the curve on this problem[3]... I suppose they didn't anticipate that everyone else was going to do it in their main fsync()/fdatasync() call, they blazed their own trail, and now it all looks a bit weird. In other words, back then all systems running PostgreSQL risked data loss unless you had fancy hardware or turned off unsafe caching. But now, due to the changing landscape and our policy choices, that is true only for rarer systems by default while most in our community are on Linux where this is all just a historical footnote. People's baseline expectations have moved, and although we try to document the situation, they are occasionally very surprised: "Loaded footgun open_datasync on Windows" was Laurenz Albe's reaction[4] to those paragraphs. Surely we should be able to recover after power loss by default even on a lowly desktop PC or basic server loaded with SATA drives, out of the box? Proposal for Windows: The existing default use of FILE_FLAG_WRITE_THROUGH is probably a better choice on hardware where it works reliably (cache disabled, non-volatile cache, or working FUA support), since it skips a system call and doesn't wait for incidental other stuff in the cache to flush, but it's well documented that Windows' SATA drivers neither pass the "FUA" flag down to the device nor fall back to sending a full cache flush command. It's also easy to see in the pg_test_fsync numbers, which are too good to be true on consumer gear. Therefore wal_sync_method=fdatasync is a better default level. We map that to NtFlushBuffersFileEx(FLUSH_FLAGS_FILE_DATA_SYNC_ONLY). (The "SYNC" in that flag name means flush the drive cache; the "DATA...ONLY" in that flag name means skip non-essential stuff like file modification time etc just like fdatasync() in POSIX, and goes visibly faster thanks to not journaling metadata.) Proposal for macOS: Our current default isn't nice to users who run a database on mains-powered Macs. I don't have one myself to try it, but "man fsync" clearly states that you can lose data and it is easily demonstrated with a traditional cord-yanking test[5]. You could certainly lose some recent commits; you could probably also get more subtle corruption or a total recovery failure like [6] too, if for example the control file can make it to durable storage and while pointing to a checkpoint that did not (maybe a ZFS-like atomic root-switch prevents that sort of disorder in APFS, I dunno, but I read some semi-informed speculation that it doesn't work that way *shrug*). We do currently offer a non-default setting wal_sync_method=fsync_writethough to address all this already. Despite its name, it affects every caller of pg_fsync() (control file, data files, etc). It's certainly essential to flush all those files fully too as part of our recovery protocol, but they're not "WAL". The new idea here is to provide a separate way of controlling that global behaviour, and I propose fsync=full. Furthermore, I think that setting should also affect pg_fdatasync(), given that Apple doesn't even really have fdatasync() (perhaps if they carry out their threat to implement it, they'll also invent F_FULLFDATASYNC; for now it *seems* to be basically just another name for fsync() albeit undeclared by <unistd.h>). It's possible that fcntl(F_FULLFSYNC) might fail with ENOSUPP or other errors in obscure cases (eg unusual file systems). In that case, you could manually lower fsync to just "on" and do your own research on whether power loss can toast your database, but that doesn't seem like a reason for us not to ship good solid defaults for typical users. Rationale for changing wal_sync_method globally (for now): With wal_sync_method=fdatasync as default for Linux, FreeBSD, OpenBSD, DragonflyBSD already, if we added macOS and Windows, that'd leave only NetBSD, AIX, Solaris/illumos. I don't like having different and more magical defaults on rare target OSes with no expert users left in our community (as [6] reminded me), so I figure we'd be better off with the same less magical setting everywhere, as a baseline. Later we might want a per-platform default again. For example, Linux (like Windows) has policies on whether to believe FUA works reliably for the purposes of O_DSYNC, but (unlike Windows) falls back to sending cache flushes instead of doing nothing, so in theory open_datasync might be a safe and sometimes better performing default there. If we decided to do that, we'd just restore the PLATFORM_DEFAULT_SYNC_METHOD mechanism. The only other OS where I have detailed enough knowledge to comment is FreeBSD. Its ZFS flushes caches for all levels just fine, so it doesn't much matter, while its UFS never got that memo (so it's like a Mac and probably other old Unixes; maybe I'll get that fixed, see FreeBSD proposal D36371 if interested). The reasons for using fdatasync on both FreeBSD and Linux wasn't cache control policies, but rather some obscure logic of ours that would turn on O_DIRECT in some cases (and I think in the past when wal_level was lower by default, it would have been common), which might have complications or fail. The last trace of that is gone since d4e71df6, so if we were to put Linux on a 'known-good-for-open_datasync' list I'd probably also consider putting FreeBSD on the list too. Note that while this'll slow down some real world databases by being more careful, 'meson test' time shouldn't be affected on any OS due to use of fsync=off in tests. Draft patches attached. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de [3] https://lists.apple.com/archives/darwin-dev/2005/Feb/msg00087.html [4] https://www.postgresql.org/message-id/flat/1527846213.2475.31.camel%40cybertec.at [5] https://news.ycombinator.com/item?id=30372194 [6] https://www.postgresql.org/message-id/flat/18009-40a42f84af3fbda1%40postgresql.org
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: Nathan BossartДата:
Сообщение: Re: Inefficiency in parallel pg_restore with many tables