Re: new option to allow pg_rewind to run without full_page_writes
От | Jérémie Grauer |
---|---|
Тема | Re: new option to allow pg_rewind to run without full_page_writes |
Дата | |
Msg-id | e3184432-54b8-5420-f2a0-b26e7a4652e0@cosium.com обсуждение исходный текст |
Ответ на | Re: new option to allow pg_rewind to run without full_page_writes (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: new option to allow pg_rewind to run without full_page_writes
Re: new option to allow pg_rewind to run without full_page_writes |
Список | pgsql-hackers |
Hello, First, thank you for reviewing. ZFS writes files in increment of its configured recordsize for the current filesystem dataset. So with a recordsize configured to be a multiple of 8K, you can't get torn pages on writes, that's why full_page_writes can be safely deactivated on ZFS (the usual advice is to configure ZFS with a recordsize of 8K for postgres, but on some workloads, it can actually be beneficial to go to a higher multiple of 8K). On 06/11/2022 03:38, Andres Freund wrote: > Hi, > > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote: >> Currently pg_rewind refuses to run if full_page_writes is off. This is to >> prevent it to run into a torn page during operation. >> >> This is usually a good call, but some file systems like ZFS are naturally >> immune to torn page (maybe btrfs too, but I don't know for sure for this >> one). > > Note that this isn't about torn pages in case of crashes, but about reading > pages while they're being written to. Like I wrote above, ZFS will prevent torn pages on writes, like full_page_writes does. > > Right now, that definitely allows for torn reads, because of the way > pg_read_binary_file() is implemented. We only ensure a 4k read size from the > view of our code, which obviously can lead to torn 8k page reads, no matter > what the filesystem guarantees. > > Also, for reasons I don't understand we use C streaming IO or > pg_read_binary_file(), so you'd also need to ensure that the buffer size used > by the stream implementation can't cause the reads to happen in smaller > chunks. Afaict we really shouldn't use file streams here, then we'd at least > have control over that aspect. > > > Does ZFS actually guarantee that there never can be short reads? As soon as > they are possible, full page writes are neededI may be missing something here: how does full_page_writes prevents short _reads_ ? Presumably, if we do something like read the first 4K of a file, then change the file, then read the next 4K, the second 4K may be a torn read. But I fail to see how full_page_writes prevents this since it only act on writes> > This isn't an fundamental issue - we could have a version of > pg_read_binary_file() for relation data that prevents the page being written > out concurrently by locking the buffer page. In addition it could often avoid > needing to read the page from the OS / disk, if present in shared buffers > (perhaps minus cases where we haven't flushed the WAL yet, but we could also > flush the WAL in those). >I agree, but this would need a differen patch, which may be beyond my skills. > Greetings, > > Andres Freund Anyway, ZFS will act like full_page_writes is always active, so isn't the proposed modification to pg_rewind valid? You'll find attached a second version of the patch, which is cleaner (removed double negation). Regards, Jérémie Grauer
Вложения
В списке pgsql-hackers по дате отправления: