Re: should crash recovery ignore checkpoint_flush_after ?
От | Andres Freund |
---|---|
Тема | Re: should crash recovery ignore checkpoint_flush_after ? |
Дата | |
Msg-id | 20200118233202.ax27prmsvvxqaytx@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Re: should crash recovery ignore checkpoint_flush_after ? (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: should crash recovery ignore checkpoint_flush_after ?
|
Список | pgsql-hackers |
Hi, On 2020-01-19 09:52:21 +1300, Thomas Munro wrote: > On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > As I understand, the first thing that happens syncing every file in the data > > dir, like in initdb --sync. These instances were both 5+TB on zfs, with > > compression, so that's slow, but tolerable, and at least understandable, and > > with visible progress in ps. > > > > The 2nd stage replays WAL. strace show's it's occasionally running > > sync_file_range, and I think recovery might've been several times faster if > > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've > > just kill -9 the recovery process and edited the config to disable this lest it > > spend all night in recovery. > > Does sync_file_range() even do anything for non-mmap'd files on ZFS? Good point. Next time it might be worthwhile to use strace -T to see whether the sync_file_range calls actually take meaningful time. > Non-mmap'd ZFS data is not in the Linux page cache, and I think > sync_file_range() works at that level. At a guess, there'd need to be > a new VFS file_operation so that ZFS could get a callback to handle > data in its ARC. Yea, it requires the pages to be in the pagecache to do anything: int sync_file_range(struct file *file, loff_t offset, loff_t nbytes, unsigned int flags) { ... if (flags & SYNC_FILE_RANGE_WRITE) { int sync_mode = WB_SYNC_NONE; if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) == SYNC_FILE_RANGE_WRITE_AND_WAIT) sync_mode = WB_SYNC_ALL; ret = __filemap_fdatawrite_range(mapping, offset, endbyte, sync_mode); if (ret < 0) goto out; } and then int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, loff_t end, int sync_mode) { int ret; struct writeback_control wbc = { .sync_mode = sync_mode, .nr_to_write = LONG_MAX, .range_start = start, .range_end = end, }; if (!mapping_cap_writeback_dirty(mapping) || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; which means that if there's no pages in the pagecache for the relevant range, it'll just finish here. *Iff* there are some, say because something else mmap()ed a section, it'd potentially call into address_space->writepages() callback. So it's possible to emulate enough state for ZFS or such to still get sync_file_range() call into it (by setting up a pseudo map tagged as dirty), but it's not really the normal path. Greetings, Andres Freund
В списке pgsql-hackers по дате отправления: