Re: BUG #10432: failed to re-find parent key in index
От | Greg Stark |
---|---|
Тема | Re: BUG #10432: failed to re-find parent key in index |
Дата | |
Msg-id | CAM-w4HM3wLZU_qd6fU7XhEiPX9DO-R_FBszfQqE9GhrNy-kfYw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #10432: failed to re-find parent key in index (Andres Freund <andres@2ndquadrant.com>) |
Ответы |
Re: BUG #10432: failed to re-find parent key in index
Re: BUG #10432: failed to re-find parent key in index Re: BUG #10432: failed to re-find parent key in index |
Список | pgsql-bugs |
Ok, I made some progress. It turns out this was a pre-existing problem in the master. They've been getting "failed to re-find parent" errors for weeks. Far longer than I have any WAL or backups for. What I did find that was interesting is that this error basically made the backups worthless. I could build a hot standby and connect and query it. But as soon as recovery finished it would try to clean up the incomplete split and fail. Because it had noticed the incomplete split it had skipped every restartpoint and the next time I tried to start it it insisted on restarting recovery from the beginning. If we had been lucky enough not to do any page splits in the broken index while the backup was being taken all would have been fine. But that doesn't seem to have happened so all the backups were unrecoverable. So a few thoughts on how to improve things: 1) Failed to re-find parent should perhaps not be FATAL to recovery. In fact any index replay error would really be nice not to have to crash on. All crashing does is prevent the user from being able to bring up their database and REINDEX the btree. This may be another use case for the machinery that would protect against corrupt hash indexes or user-defined indexes -- if we could mark the index invalid and proceed (perhaps ignoring subsequent records for it) that would be great. 2) When we see an abort record we could check for any cleanup actions triggered by that transaction and run them right away. I think the checkpoints (and maybe hot standby snapshots or vacuum cleanup records?) also include information about the oldest xid running, they would also let us prune the cleanup actions sooner. That would at least find the error sooner. In conjunction with (1) it would also mean subsequent restartpoints would be effective instead of suppressing restartpoints right to the end of recovery. 3) The lack of logs around an error during recovery makes it hard to decipher what's going on. It would be nice to see "Beginning Xlog cleanup (1 incomplete splits to replay)" and when it crashed "Last safe point to restart recovery is 324/ABCDEF". As it was it was a pretty big mystery why the database crashed, the logs made it appear as if it had started up fine. And it was unclear why restarting it caused it to replay from the beginning, I thought maybe something was wrong with our scripts.
В списке pgsql-bugs по дате отправления: