Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
От | Thomas Munro |
---|---|
Тема | Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" |
Дата | |
Msg-id | CA+hUKGLkOYgHLmErZ1jAgR3WKzJRxCjJumFM4mkkba7+hv-NGA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" ("Anton A. Melnikov" <aamelnikov@inbox.ru>) |
Ответы |
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
(Thomas Munro <thomas.munro@gmail.com>)
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" ("Anton A. Melnikov" <aamelnikov@inbox.ru>) |
Список | pgsql-hackers |
On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote: > First of all it seemed to me that is not a problem at all since msdn > guarantees sector-by-sector atomicity. > "Physical Sector: The unit for which read and write operations to the device > are completed in a single operation. This is the unit of atomic write..." > https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering > https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile > (Of course, only if the 512 bytes lays from the beginning of the file with a zero > offset, but this is our case. The current size of ControlFileData is > 296 bytes at offset = 0.) There are two kinds of atomicity that we rely on for the control file today: * atomicity on power loss (= device property, in case of overwrite filesystems) * atomicity of concurrent reads and writes (= VFS or kernel buffer pool interlocking policy) I assume that documentation is talking about the first thing (BTW, I suspect that documentation is also now wrong in one special case: NTFS filesystems mounted on DAX devices don't have sectors or sector-based atomicity unless you turn on BTT and slow them down[1]; that might eventually be something for PostgreSQL to think about, and it applies to other OSes too). With this patch we would stop relying on the second thing. Supposedly POSIX requires read/write atomicity, and many file systems offer it in a strong form (internal range locking) or maybe a weak accidental form (page-level locking). Since some extremely popular implementations just don't care, and Windows isn't even a POSIX, we just have to do it ourselves. BTW there are at least two other places where PostgreSQL already knows that concurrent reads and writes are possibly non-atomic (and we also don't even try to get the alignment right, making the question moot): pg_basebackup, which enables full_page_writes implicitly if you didn't have the GUC on already, and pg_rewind, which refuses to run if you haven't enabled full_page_writes explicitly (as recently discussed on another thread recently; that's an annoying difference, and also an annoying behaviour if you know your storage doesn't really need it!) > I tried to verify this fact experimentally and was very surprised. > Unfortunately it crashed in an hour during torture test: > 2023-02-13 15:10:52.675 MSK [5704] LOG: starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit > 2023-02-13 15:10:52.768 MSK [5704] LOG: database system is ready to accept connections > @@@@@@ sizeof(ControlFileData) = 296 > ......... > 2023-02-13 16:10:41.997 MSK [9380] ERROR: calculated CRC checksum does not match value stored in file Thanks. I'd seen reports of this in discussions on the 'net, but those had no authoritative references or supporting test results. The fact that fsync made it take longer (in your following email) makes sense and matches Linux. It inserts a massive pause in the middle of the experiment loop, affecting the probabilities. Therefore, we need a solution for Windows too. I tried to write the equivalent code, in the attached. I learned a few things: Windows locks are mandatory (that is, if you lock a file, reads/writes can fail, unlike Unix). Combined with async release, that means we probably also need to lock the file in xlog.c, when reading it in xlog.c:ReadControlFile() (see comments). Before I added that, on a CI run, I found that the read in there would sometimes fail, and adding the locking fixed that. I am a bit confused, though, because I expected that to be necessary only if someone managed to crash while holding the write lock, which the CI tests shouldn't do. Do you have any ideas? While contemplating what else a mandatory file lock might break, I remembered that basebackup.c also reads the control file. Hrmph. Not addressed yet; I guess it might need to acquire/release around sendFile(sink, XLOG_CONTROL_FILE, ...)? > > I would > > only want to consider this if we also stop choosing "open_datasync" as > > a default on at least Windows. > > I didn't quite understand this point. Could you clarify it for me, please? If the performance > of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all platforms. The level of durability would be weakened on Windows. On all systems except Linux and FreeBSD, we default to wal_sync_method=open_datasync, so then we would start using FILE_FLAG_WRITE_THROUGH for the control file too. We know from reading and experimentation that FILE_FLAG_WRITE_THROUGH doesn't flush the drive cache on consumer Windows hardware, but its fdatasync-like thing does[2]. I have not thought too hard about the consequences of using different durability levels for different categories of file, but messing with write ordering can't be good for crash recovery, so I'd rather increase WAL durability than decrease control file durability. If a Windows user complains that it makes their fancy non-volatile cache slow down, they can always adjust the settings in PostgreSQL, their OS, or their drivers etc. I think we should just make fdatasync the default on all systems. Here's a patch like that. [1] https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/persistent-memory-direct-access [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Ba-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw%40mail.gmail.com#460bfa5a6b571cc98c575d23322e0b25
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Amit KapilaДата:
Сообщение: Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16
Следующее
От: Amit KapilaДата:
Сообщение: Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy