Re: possible bug in handling of contrecords in dd38ff28ad (Fix recovery_prefetch with low maintenance_io_concurrency)
От | Tomas Vondra |
---|---|
Тема | Re: possible bug in handling of contrecords in dd38ff28ad (Fix recovery_prefetch with low maintenance_io_concurrency) |
Дата | |
Msg-id | 10616819-5333-e35e-32c8-c0d138542167@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: possible bug in handling of contrecords in dd38ff28ad (Fix recovery_prefetch with low maintenance_io_concurrency) (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: possible bug in handling of contrecords in dd38ff28ad (Fix recovery_prefetch with low maintenance_io_concurrency)
(Thomas Munro <thomas.munro@gmail.com>)
|
Список | pgsql-hackers |
On 7/2/23 04:09, Thomas Munro wrote: > On Sun, Jul 2, 2023 at 1:40 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> I think there's some sort of bug in how dd38ff28ad deals with >> contrecords. Consider something as simple as >> >> pgbench -i -s 100 >> >> and then doing pg_waldump on the WAL segments, I get this for every >> single one: >> >> pg_waldump: error: error in WAL record at 0/1FFFF98: missing >> contrecord at 0/1FFFFE0 >> >> This only happens since dd38ff28ad, and revert makes it disappear. >> >> It's possible we still have some issue with contrecords, but IIUC we >> fixed those. So unless there's some unknown one (and considering this >> seems to happen for *every* WAL segment that's hard to believe), this >> seems more like an issue in the error detection. > > Yeah. That message is due to this part of dd38ff28ad's change: > > Also add an explicit error message for missing contrecords. It was a > bit strange that we didn't report an error already, and became a latent > bug with prefetching, since the internal state that tracks aborted > contrecords would not survive retrying, as revealed by > 026_overwrite_contrecord.pl with this adjustment. Reporting an error > prevents that. > > We can change 'missing contrecord' back to silent end-of-decoding (as > it was in 14) with the attached. Here [1] is some analysis of this > error that I wrote last year. The reason for my hesitation in pushing > a fix was something like this: > > 1. For pg_waldump, it's "you told me to decode only this WAL segment, > so decoding failed here", which is useless noise > 2. For walsender, it's "you told me to shut down, so decoding failed > here", which is useless noise > 3. For crash recovery, "I ran out of data, so decoding failed here", > which seems like a report-worthy condition, I think? > > When I added that new error I was thinking about that third case. We > generally expect to detect the end of WAL replay after a crash with an > error ("invalid record length ...: wanted 24, got 0" + several other > possibilities), but in this rare case it would be silent. The effect > on the first two cases is cosmetic, but certainly annoying. Perhaps I > should go ahead and back-patch the attached change, and then we can > discuss whether/how we should do a better job of distinguishing "user > requested artificial end of decoding" from "unexpectedly ran out of > data" for v17? > Yeah, I think that'd be sensible. IMHO we have a habit of scaring users with stuff that might be dangerous/bad, but 99% of the time it's actually fine and perhaps even expected. It's almost as if we're conditioning people to ignore errors. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: