Re: problems with making relfilenodes 56-bits

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: problems with making relfilenodes 56-bits
Дата
Msg-id CA+TgmoYP=w5j3TCdmsZt=HtQMBe+E2X58YkZGvk+vvCdGStmWw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: problems with making relfilenodes 56-bits  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Tue, Oct 4, 2022 at 2:30 PM Andres Freund <andres@anarazel.de> wrote:
> Consider the following sequence:
>
> 1) we write WAL like this:
>
> [record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]
>
> 2) crash, the sectors with A and C made it to disk, the one with B didn't
>
> 3) We replay A, discover B is invalid (possibly zeroes or old contents),
>    insert a new record B' with the same length. Now it looks like this:
>
> [record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]
>
> 4) crash, the sector with B' makes it to disk
>
> 5) we replay A, B', C, because C has an xl_prev that's compatible with B'
>    location and a valid CRC.
>
> Oops.
>
> I think this can happen both within a single page and across page boundaries.
>
> I hope I am missing something here?

If you are, I don't know what it is off-hand. That seems like a
plausible scenario to me. It does require the OS to write things out
of order, and I don't know how likely that is in practice, but the
answer probably isn't zero.

> I think there might be reasonable ways to increase the guarantees based on the
> 2 byte xl_prev approach "alone". We don't have to store the offset from the
> page header as a plain offset. What about storing something like:
>   page_offset ^ (page_lsn >> wal_segsz_shift)
>
> I think something like that'd result in prev_not_really_lsn typically not
> simply matching after recycling. Of course it only provides so much
> protection, given 16bits...

Maybe. That does seem somewhat better, but I feel like it's hard to
reason about whether it's safe in absolute terms or just resistant to
the precise scenario Matthias postulated while remaining vulnerable to
slightly modified versions.

How about this: remove xl_prev. widen xl_crc to 64 bits. include the
CRC of the previous WAL record in the xl_crc calculation. That doesn't
cut quite as many bytes out of the record size as your proposal, but
it feels like it should strongly resist pretty much every attack of
this general type, with only the minor disadvantage that the more
expensive CRC calculation will destroy all hope of getting anything
committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: installcheck-world concurrency issues
Следующее
От: Nikita Malakhov
Дата:
Сообщение: Re: Pluggable toaster