Re: Theory about XLogFlush startup failures
От | Tom Lane |
---|---|
Тема | Re: Theory about XLogFlush startup failures |
Дата | |
Msg-id | 25375.1011113260@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: Theory about XLogFlush startup failures (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Theory about XLogFlush startup failures
|
Список | pgsql-hackers |
I said: > Hiroshi Inoue <Inoue@tpf.co.jp> writes: >> What about PageGetItem ? It seems to be able to touch the item >> via HeapTupleSatisfies etc. > Hmm. Strictly speaking I think you are right, but I'm hesitant to add a > bunch of new tests to PageGetItem --- that is much more of a hot spot > than PageAddItem, and it'll cost us something in speed I fear. I wasn't totally comfortable with this (and I'm sure you weren't either), but after more thought I still feel it's the right tradeoff. Here are a couple of heuristic arguments why we don't need more error checking in PageGetItem: 1. tqual.c won't ever try to set tuple status bits until it's checked t_xmin or t_xmax against TransactionIdDidCommit/DidAbort. If messed-up page headers have caused us to compute a bogus item pointer, one would expect that a more-or-less-random transaction ID will be passed to TransactionIdDidCommit/DidAbort. Now in 7.2, it's unlikely that more than about 2 segments (2 million transactions' worth) of CLOG will exist at any instant, so the odds that asking about a random XID will produce an answer and not elog(STOP) are less than 1 in 1000. 2. If this happened in the field, the signature would be one or two bits set apparently at random in an otherwise-okay page. In the data corruption cases I've been able to examine personally, I can't recall ever having seen such a case. The usual form of corruption is dozens of consecutive bytes worth of garbage overlaying part of an otherwise-valid page. While I tend to blame such stuff on hardware glitches (especially when the damage is aligned on power-of-2 byte boundaries), it's certainly possible that it comes from a misdirected memcpy, which is why I think it's a good idea to introduce more bounds checking in PageAddItem and so forth. If we start to see failure reports that look like they might have been caused by tqual.c let loose on the wrong bits, we can certainly revisit this decision. But right now I think that adding more checks in PageGetItem would waste a lot of cycles to little purpose. BTW, to close the loop back to the original topic: I think it's quite likely that some of the elog(STOP)s in clog.c will need to be reduced to lesser error levels once we see what sorts of problems arise in the field, just as we found that this particular elog(STOP) in xlog.c was overkill. But I want to wait and see which ones cause problems before backing off the error severity. I will go and add a few more sanity checks to bufpage.c. regards, tom lane
В списке pgsql-hackers по дате отправления: