Re: FSM Corruption (was: Could not read block at end of the relation)
От | Noah Misch |
---|---|
Тема | Re: FSM Corruption (was: Could not read block at end of the relation) |
Дата | |
Msg-id | 20240304190312.b6.nmisch@google.com обсуждение исходный текст |
Ответ на | Re: FSM Corruption (was: Could not read block at end of the relation) (Ronan Dunklau <ronan.dunklau@aiven.io>) |
Ответы |
Re: FSM Corruption (was: Could not read block at end of the relation)
|
Список | pgsql-bugs |
On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote: > Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit : > > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote: > > > - happens during heavy system load > > > - lots of concurrent writes happening on a table > > > - often (but haven't been able to confirm it is necessary), a vacuum is > > > running on the table at the same time the error is triggered > Looking at when the corruption was WAL-logged, this particular case is quite > easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table > (probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM > running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog > records for each of the relation main fork, followed by a lonely FPI for the > leaf page of it's FSM: You're using data_checksums, right? Thanks for the wal dump excerpts; I agree with this summary thereof. > There are no traces of relation truncation happening in the WAL. That is notable. > This case only shows a single invalid entry in the FSM, but I've noticed as > much as 62 blocks present in the FSM while they do not exist on disk, all > tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk > extension mechanism. Is this happening after an OS crash, a replica promote, or a PITR restore? If so, I think I see the problem. We have an undocumented rule that FSM shall not contain references to pages past the end of the relation. To facilitate that, relation truncation WAL-logs FSM truncate. However, there's no similar protection for relation extension, which is not WAL-logged. We break the rule whenever we write FSM for block X before some WAL record initializes block X. data_checksums makes the trouble easier to hit, since it creates FPI_FOR_HINT records for FSM changes. A replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield this broken state. While v16 RelationAddBlocks() made this easier to hit, I suspect it's reproducible in all supported branches. For example, lazy_scan_new_or_empty() and multiple index AMs break the rule via RecordPageWithFreeSpace() on a PageIsNew() page. I think the fix is one of: - Revoke the undocumented rule. Make FSM consumers resilient to the FSM returning a now-too-large block number. - Enforce a new "main-fork WAL before FSM" rule for logged rels. For example, in each PageIsNew() case, either don't update FSM or WAL-log an init (like lazy_scan_new_or_empty() does when PageIsEmpty()).
В списке pgsql-bugs по дате отправления: