Re: Losing records when server hang
От | lec |
---|---|
Тема | Re: Losing records when server hang |
Дата | |
Msg-id | 41182683.6090409@streamyx.com обсуждение исходный текст |
Ответ на | Losing records when server hang (lec <limec@streamyx.com>) |
Список | pgsql-general |
Tom Lane wrote:
Thanks for all your feedbacks and reasoning.
--lec
Same here. I don't even want to have to prove anything if the hardware isn't reliable but the "management" queries about the lost transactions, blaming on system/software/database. I could prove to them that the lost transactions were due to the system hang, but transaction #10 being there makes my reasoning doubtful.Marco Colombo <marco@esi.it> writes:Tom Lane wrote:However this would seem to imply disk drive misfeasance above and beyond your motherboard problem.Well, no. How about this theory:1) everything is ok: the backend executes write()/fsync() for transactions 1-52) hardware fails some how at MB level (imagine CPU/RAM overheating): RAM gets corrupted - kernel starts oopsing (but goes on) meanwhile, the backend executes write()/fsync() for transactions 6-10, but randomly corrupted data gets written to disk.3) unrecoverable kernel error occurs, the show stops.On recover, transactions 6-9 don't even look like valid log entries, while 10, for some reason, does (maybe only data is corrupted).I'm not familiar with the details of WAL files and post-crash recovery, but is that possible? Or does the process stop at the first failure?Recovery will stop at the first corrupted record, so it would not happen like that. But you are right, the MB failure alone might have been enough to corrupt the outgoing WAL log data and thus produce the scenario I described. Once Postgres *thinks* transactions 1-10 are safely down to disk in the WAL log, it will feel free to update the data files in any random order that seems convenient. So the write of record 10 could have occurred before the rest, and if that happened not to get corrupted by the MB problem, we could see the result lec describes. Of course this is all guesswork since we have no direct evidence to look at, but it seems fairly plausible.Anyway, if your CPU/RAM is failing, no DB technology can save you.Agreed. Software certainly cannot make any guarantees if it can't even execute correctly ...
Thanks for all your feedbacks and reasoning.
--lec
В списке pgsql-general по дате отправления: