Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
От | Thomas Munro |
---|---|
Тема | Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" |
Дата | |
Msg-id | CA+hUKGKiZJcfZSA5G5Rm8oC78SNOQ4c8az5Ku=4wMTjw1FZ40g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" (Daniel Gustafsson <daniel@yesql.se>) |
Ответы |
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" |
Список | pgsql-hackers |
This is a frustrating thread, because despite the last patch solving most of the problems we discussed, it doesn't address the low-level-backup procedure in a nice way. We'd have to tell users they have to flock that file, or add a new step "pg_controldata --raw > pg_control", which seems weird when they already have a connection to the server. Maybe it just doesn't matter if eg the pg_controldata program can spuriously fail if pointed at a running system, and I was being too dogmatic trying to fix even that. Maybe we should just focus on fixing backups. Even there, I am beginning to suspect we are solving this problem in the wrong place when a higher level change could simplify the problem away. Idea for future research: Perhaps pg_backup_stop()'s label-file output should include the control file image (suitably encoded)? Then the recovery-from-label code could completely ignore the existing control file, and overwrite it using that copy. It's already partially ignoring it, by using the label file's checkpoint LSN instead of the control file's. Perhaps the captured copy could include the correct LSN already, simplifying that code, and the low level backup procedure would not need any additional steps or caveats. No more atomicity problem for low-level-backups... but probably not something we would back-patch, for such a rare failure mode. Here's a new minimal patch that solves only the bugs in basebackup + the simple SQL-facing functions that read the control file, by simply acquiring ControlFileLock in the obvious places. This should be simple enough for back-patching? Perhaps we could use the control file image from server memory, but that requires us to be certain that its CRC is always up to date. That seems to be true, but I didn't want to require it for this, and it doesn't seem important for non-performance-critical code. Thoughts? As for the other topics that came up in this thread, I kicked the wal_sync_method thing out to its own thread[1]. (There was a logical chain connecting these topics: "can I add file lock system calls here?" -> "well if someone is going to complain that it's performance critical then why are we using unnecessarily slow pg_fsync()?" -> "well if we change that to pg_fdatasync() we have to address known weakness/kludge on macOS first". I don't like the flock stuff anymore, but I do want to fix the known macOS problem independently. Hereby disentangled.) [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BF0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg%40mail.gmail.com
Вложения
В списке pgsql-hackers по дате отправления: