remove more archiving overhead
От | Nathan Bossart |
---|---|
Тема | remove more archiving overhead |
Дата | |
Msg-id | 20220222011948.GA3850532@nathanxps13 обсуждение исходный текст |
Ответы |
Re: remove more archiving overhead
|
Список | pgsql-hackers |
Hi hackers, With the addition of archive modules (5ef1eef) and other archiving improvements (e.g., beb4e9b), the amount of archiving overhead has been reduced. I spent some time measuring the remaining overhead, and the following function stood out: /* * pgarch_archiveDone * * Emit notification that an xlog file has been successfully archived. * We do this by renaming the status file from NNN.ready to NNN.done. * Eventually, a checkpoint process will notice this and delete both the * NNN.done file and the xlog file itself. */ static void pgarch_archiveDone(char *xlog) { char rlogready[MAXPGPATH]; char rlogdone[MAXPGPATH]; StatusFilePath(rlogready, xlog, ".ready"); StatusFilePath(rlogdone, xlog, ".done"); (void) durable_rename(rlogready, rlogdone, WARNING); } Specifically, the call to durable_rename(), which calls fsync() a few times, can cause a decent delay. On my machine, archiving ~12k WAL files using a bare-bones archive module that did nothing but return true took over a minute. When I changed this to rename() (i.e., reverting the change to pgarch.c in 1d4a0ab), the same test took a second or less. I also spent some time investigating whether durably renaming the archive status files was even necessary. In theory, it shouldn't be. If a crash happens before the rename is persisted, we might try to re-archive files, but your archive_command/archive_library should handle that. If the file was already recycled or removed, the archiver will skip it (thanks to 6d8727f). But when digging further, I found that WAL file recyling uses durable_rename_excl(), which has the following note: * Note that a crash in an unfortunate moment can leave you with two links to * the target file. IIUC this means that in theory, a crash at an unfortunate moment could leave you with a .ready file, the file to archive, and another link to that file with a "future" WAL filename. If you re-archive the file after it has been reused, you could end up with corrupt WAL archives. I think this might already be possible today. Syncing the directory after every rename probably reduces the likelihood quite substantially, but if RemoveOldXlogFiles() quickly picks up the .done file and attempts recycling before durable_rename() calls fsync() on the directory, presumably the same problem could occur. I believe the current recommendation is to fail if the archive file already exists. The basic_archive contrib module fails if the archive already exists and has different contents, and it succeeds without re-archiving if it already exists with identical contents. Since something along these lines is recommended as a best practice, perhaps this is okay. But I think we might be able to do better. If we ensure that the archive_status directory is sync'd prior to recycling/removing a WAL file, I believe we are safe from the aforementioned problem. The drawback of this is that we are essentially offloading more work to the checkpointer process. In addition to everything else it must do, it will also need to fsync() the archive_status directory many times. I'm not sure how big of a problem this is. It should be good to ensure that archiving keeps up, and there are far more expensive tasks that the checkpointer must perform that could make the added fsync() calls relatively insignificant. I've attached a patch that makes the changes discussed above. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: