Re: [HACKERS] emergency outage requiring database restart
От | Merlin Moncure |
---|---|
Тема | Re: [HACKERS] emergency outage requiring database restart |
Дата | |
Msg-id | CAHyXU0ypCaDJMJ78H6EdKztZeh5oEkGu+j5HpwmfzOpWB4q1zg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] emergency outage requiring database restart (Ants Aasma <ants.aasma@eesti.ee>) |
Список | pgsql-hackers |
On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> Still getting checksum failures. Over the last 30 days, I see the >> following. Since enabling checksums FWICT none of the damage is >> permanent and rolls back with the transaction. So creepy! > > The checksums still only differ in least significant digits which > pretty much means that there is a block number mismatch. So if you > rule out filesystem not doing its job correctly and transposing > blocks, it could be something else that is resulting in blocks getting > read from a location that happens to differ by a small multiple of > page size. Maybe somebody is racily mucking with table fd's between > seeking and reading. That would explain the issue disappearing after a > retry. > > Maybe you can arrange for the RelFileNode and block number to be > logged for the checksum failures and check what the actual checksums > are in data files surrounding the failed page. If the requested block > number contains something completely else, but the page that follows > contains the expected checksum value, then it would support this > theory. will do. Main challenge is getting hand compiled server to swap in so that libdir continues to work. Getting access to the server is difficult as is getting a maintenance window. I'll post back ASAP. merlin
В списке pgsql-hackers по дате отправления: