Re: BUG #10432: failed to re-find parent key in index
От | Andres Freund |
---|---|
Тема | Re: BUG #10432: failed to re-find parent key in index |
Дата | |
Msg-id | 20140604113519.GG1220@awork2.anarazel.de обсуждение исходный текст |
Ответ на | Re: BUG #10432: failed to re-find parent key in index (Greg Stark <stark@mit.edu>) |
Список | pgsql-bugs |
Hi, On 2014-06-04 12:14:27 +0100, Greg Stark wrote: > Ok, I made some progress. It turns out this was a pre-existing problem > in the master. They've been getting "failed to re-find parent" errors > for weeks. Far longer than I have any WAL or backups for. Ok. > 1) Failed to re-find parent should perhaps not be FATAL to recovery. > In fact any index replay error would really be nice not to have to > crash on. I think that's not really realistic. We'd need to put a significant amount of machinery for this in to be workable. Suddenly a crash restart doesn't guarantee that you're indexes are there anymore? Not nice. > All crashing does is prevent the user from being able to > bring up their database and REINDEX the btree. This may be another use > case for the machinery that would protect against corrupt hash indexes > or user-defined indexes -- if we could mark the index invalid and > proceed (perhaps ignoring subsequent records for it) that would be > great. > > 2) When we see an abort record we could check for any cleanup actions > triggered by that transaction and run them right away. I think the > checkpoints (and maybe hot standby snapshots or vacuum cleanup > records?) also include information about the oldest xid running, they > would also let us prune the cleanup actions sooner. That would at > least find the error sooner. In conjunction with (1) it would also > mean subsequent restartpoints would be effective instead of > suppressing restartpoints right to the end of recovery. Heikki removed restartpoints from 9.4 alltogether so most of these are gone. As all these -even if they were doable - sound far too large for backpatching I think it's luckily mostly done. > 3) The lack of logs around an error during recovery makes it hard to > decipher what's going on. It would be nice to see "Beginning Xlog > cleanup (1 incomplete splits to replay)" and when it crashed "Last > safe point to restart recovery is 324/ABCDEF". As it was it was a > pretty big mystery why the database crashed, the logs made it appear > as if it had started up fine. And it was unclear why restarting it > caused it to replay from the beginning, I thought maybe something was > wrong with our scripts. I think this should be fixed by setting up error context stack support in two places. a) in StartupXLOG() before the rm_cleanup() calls b) in < 9.4 inside the individual cleanup routines. We do all that around redo routines, but, as evidenced here, that's not always enough. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
В списке pgsql-bugs по дате отправления: