Re: Logging corruption error codes
От | Andrey Borodin |
---|---|
Тема | Re: Logging corruption error codes |
Дата | |
Msg-id | E14AD04A-A800-4D39-BA1F-64B5619F76D2@yandex-team.ru обсуждение исходный текст |
Ответ на | Re: Logging corruption error codes (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: Logging corruption error codes
|
Список | pgsql-bugs |
Please find attached patch v2: added a little more cases with corruption. > 25 июля 2019 г., в 23:27, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Thu, Jul 25, 2019 at 3:45 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> From my POV these messages provide meaningful information to cope with corruption. But they are definitely internal. >> Translations already provide some information on toast chunks, mentions btree many times times and many other internalthings. >> So, I'm confused about status of these messages. >> Such messages should be rare enough and those to whom they are addressed should be familiar with English. > > I agree that these don't need to be translated, which means you must > use errmsg_internal() with ereport(). A message like "failed to > re-find parent key in index..." doesn't mean anything to more than a > tiny number of experts. It is useful only because you can paste in > into a search engine. Users will want to search for the English string > anyway. We already have translations for messages like "index \"%s\" is not a btree" and "version mismatch in index \"%s\": fileversion %d, ". Personally, I agree that we should try to make these messages googlable in mailing lists. Marking them errmsg_internal willdiscard some work of translators. So I haven't marked them internal in this version. >> This causes various data corruptions, always undetected by data checksums (do we want Merkle tree?). > > I don't think that it's possible to verify the integrity of multiple > page images without amcheck support for the access method. It might be > possible to do slightly more in a generic way, but I doubt it. Well, if you have a fork with LSNs of each page - you can guarantee that that you do not have stale version of single page.And you can have cheap block-level incremental backups, fast catchup of standbys etc. But this comes at a cost. Anyway,it's a discussion for another thread. >> Besides messages in this patch we also had: >> could not read block 1751 in file "base/16452/358336": Bad address // Probably mostly not only data corruption, but hardwarefault >> t_xmin is uncommitted in tuple to be updated // Probably on-disk corruption >> failed to re-find parent key in index // Probably index corruption >> left link changed unexpectedly in block // Probably on-disk data corruption >> right sibling 45056 of block * is not next child * of block * in index // Definitely index corruption >> >> Should I add corruption codes for these messages in the patch? Or make a separate discussion about these? > > I don't think that we need to worry too much about the difference > between data corruption and a hardware fault that could theoretically > self-correct. There is a cost to making fine distinctions like this in > the errcodes we use. Currently, that case with "could not read block" is marked by errcode_for_file_access(). I think that this code is betterthan corruption error code.. Thanks! Best regards, Andrey Borodin.
Вложения
В списке pgsql-bugs по дате отправления: