Обсуждение: Preventing indirection for IndexPageGetOpaque for known-size page special areas

Поиск
Список
Период
Сортировка

Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Matthias van de Meent
Дата:
Hi,

I noticed that effectively all indexes use the special region of a
page to store some index-specific data on the page. In all cases I've
noticed, this is a constant-sized struct, located at what is
effectively a fixed offset from the end of the page; indicated on the
page by pd_special; and accessed through PageGetSpecialPointer() (that
is about equal to `page + page->pd_special` modulo typing and
assertions).

Seeing that these indexes effectively always use a constant-sized
struct as the only data in the special region; why does this code
depend on the pd_special pointer to retrieve their PageOpaque pointer?
Wouldn't a constant offset off of (page), based on the size of the
opaque structure and BLCKSZ, be enough?

Sure, in assertion builds this also validates that pd_special indeed
points in the bounds of the page, but PageGetSpecialPointer() does not
validate that the struct itself is in the bounds of the page, so
there's little guarantee that this is actually safe.

Additionally, this introduces a layer of indirection that is not
necessarily useful: for example the_bt_step_left code has no use for
the page's header information other than that it contains the
pd_special pointer (which is effectively always BLCKSZ -
MAXALIGN(sizeof(opaque))). This introduces more data and ordering
dependencies in the CPU, where this should be as simple as a constant
offset over the base page pointer.


Assuming there's no significant reason to _not_ to change this code to
a constant offset off the page pointer and only relying on pd_special
in asserts when retrieving the IndexPageOpaques, I propose the
attached patches:

0001 adds a new macro PageGetSpecialOpaque(page, opaquedatatyp); which
replaces PageGetSpecialPointer for constant sized special area
structures, and sets up the SPGist, GIST and Gin index methods to use
that instead of PageGetSpecialPointer.
0002 replaces manual PageGetSpecialPointer calls & casts in btree code
with a new macro BTPageGetOpaque, which utilizes PageGetSpecialOpaque.
0003 does the same as 0002, but for hash indexes.

A first good reason to do this is preventing further damage when a
page is corrupted: if I can somehow overwrite pd_special,
non-assert-enabled builds would start reading and writing at arbitrary
offsets from the page pointer, quite possibly in subsequent buffers
(or worse, on the stack, in case of stack-allocated blocks).
A second reason would be less indirection to get to the opaque
pointer. This should improve performance a bit in those cases where we
(initially) only want to access the [Index]PageOpaque struct.
Lastly, 0002 and 0003 remove some repetitive tasks of dealing with the
[nbtree, hash] opaques by providing a typed accessor macro similar to
what is used in the GIN and (SP-)GIST index methods; improving the
legibility of the code and decreasing the churn.

Kind regards,

Matthias van de Meent.

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Michael Paquier
Дата:
On Wed, Feb 16, 2022 at 10:24:58PM +0100, Matthias van de Meent wrote:
> A first good reason to do this is preventing further damage when a
> page is corrupted: if I can somehow overwrite pd_special,
> non-assert-enabled builds would start reading and writing at arbitrary
> offsets from the page pointer, quite possibly in subsequent buffers
> (or worse, on the stack, in case of stack-allocated blocks).

Well, pageinspect has proved to be rather careless in this area for a
couple of years, but this just came from the fact that we called
PageGetSpecialPointer() before checking that PageGetSpecialSize() maps
with the MAXALIGN'd size of the opaque area, if the related AM has
one.

I am not sure that we need to worry about corruptions, as data
checksums would offer protection for most cases users usually need to
worry about, at the exception of page corruptions because of memory
for pages already read in the PG shared buffers from disk or even the
OS cache.

> A second reason would be less indirection to get to the opaque
> pointer. This should improve performance a bit in those cases where we
> (initially) only want to access the [Index]PageOpaque struct.

We don't have many cases that do that, do we?

> Lastly, 0002 and 0003 remove some repetitive tasks of dealing with the
> [nbtree, hash] opaques by providing a typed accessor macro similar to
> what is used in the GIN and (SP-)GIST index methods; improving the
> legibility of the code and decreasing the churn.

+#define PageGetSpecialOpaque(page, OpaqueDataTyp) \
+( \
+   AssertMacro(PageGetPageSize(page) == BLCKSZ && \
+               PageGetSpecialSize(page) == MAXALIGN(sizeof(OpaqueDataTyp))), \
+   (OpaqueDataTyp *) ((char *) (page) + (BLCKSZ - MAXALIGN(sizeof(OpaqueDataTyp)))) \
+)

Did you notice PageValidateSpecialPointer() that mentions MSVC?  Could
this stuff a problem if not an inline function.

Perhaps you should do a s/OpaqueDataTyp/OpaqueDataType/.

As a whole, this patch set looks like an improvement in terms of
checks and consistency regarding the special area handling across the
different AMs for the backend code, while reducing the MAXALIGN-ism on
all those sizes, and this tends to be missed.  Any other opinions?
--
Michael

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Matthias van de Meent
Дата:
On Mon, 28 Mar 2022 at 06:33, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Feb 16, 2022 at 10:24:58PM +0100, Matthias van de Meent wrote:
> > A first good reason to do this is preventing further damage when a
> > page is corrupted: if I can somehow overwrite pd_special,
> > non-assert-enabled builds would start reading and writing at arbitrary
> > offsets from the page pointer, quite possibly in subsequent buffers
> > (or worse, on the stack, in case of stack-allocated blocks).
>
> Well, pageinspect has proved to be rather careless in this area for a
> couple of years, but this just came from the fact that we called
> PageGetSpecialPointer() before checking that PageGetSpecialSize() maps
> with the MAXALIGN'd size of the opaque area, if the related AM has
> one.
>
> I am not sure that we need to worry about corruptions, as data
> checksums would offer protection for most cases users usually need to
> worry about, at the exception of page corruptions because of memory
> for pages already read in the PG shared buffers from disk or even the
> OS cache.

Not all clusters have checksums enabled (boo!, but we can't
realistically fix that), so on-disk corruption could reasonably
propagate to the rest of such system. Additionally, checksums are only
checked on read, and updated when the page is written to disk (in
PageSetChecksumCopy called from FlushBuffer), but this does not check
for signs of page corruption. As such, a memory bug resulting in
in-memory corruption of pd_special would persist and would currently
have the potential to further corrupt neighbouring buffers.

> > A second reason would be less indirection to get to the opaque
> > pointer. This should improve performance a bit in those cases where we
> > (initially) only want to access the [Index]PageOpaque struct.
>
> We don't have many cases that do that, do we?

Indeed not many, but I know that at least the nbtree-code has some
cases where it uses the opaque before other fields in the header are
accessed (sometimes even without accessing the header for other
reasons): in _bt_moveright and _bt_walk_left. There might be more; but
these are cases I know of.

> > Lastly, 0002 and 0003 remove some repetitive tasks of dealing with the
> > [nbtree, hash] opaques by providing a typed accessor macro similar to
> > what is used in the GIN and (SP-)GIST index methods; improving the
> > legibility of the code and decreasing the churn.
>
> +#define PageGetSpecialOpaque(page, OpaqueDataTyp) \
> +( \
> +   AssertMacro(PageGetPageSize(page) == BLCKSZ && \
> +               PageGetSpecialSize(page) == MAXALIGN(sizeof(OpaqueDataTyp))), \
> +   (OpaqueDataTyp *) ((char *) (page) + (BLCKSZ - MAXALIGN(sizeof(OpaqueDataTyp)))) \
> +)
>
> Did you notice PageValidateSpecialPointer() that mentions MSVC?  Could
> this stuff a problem if not an inline function.

I'm not sure; but the CFbot build (using whatever MSVC is used with
latest Window Server 2019, VS 19) failed to complain in this case.
Furthermore, the case mentioned in the comment of
PageValidateSpecialPointer() refers to problems that only happened
after the macro was updated from one MacroAssert to three seperate
MacroAssert()s; this new code currently only has one, so we should be
fine for now.

> Perhaps you should do a s/OpaqueDataTyp/OpaqueDataType/.

This has been fixed in attached v2; and apart from tweaks in the
commit messages there have been no other changes.

> As a whole, this patch set looks like an improvement in terms of
> checks and consistency regarding the special area handling across the
> different AMs for the backend code, while reducing the MAXALIGN-ism on
> all those sizes, and this tends to be missed.

Thanks!

-Matthias

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Michael Paquier
Дата:
On Mon, Mar 28, 2022 at 05:09:10PM +0200, Matthias van de Meent wrote:
> Not all clusters have checksums enabled (boo!, but we can't
> realistically fix that), so on-disk corruption could reasonably
> propagate to the rest of such system. Additionally, checksums are only
> checked on read, and updated when the page is written to disk (in
> PageSetChecksumCopy called from FlushBuffer), but this does not check
> for signs of page corruption. As such, a memory bug resulting in
> in-memory corruption of pd_special would persist and would currently
> have the potential to further corrupt neighbouring buffers.

Well, if it comes to corruption then pd_special is not the only
problem, just one part of it.  A broken pd_lower or pd_lower or even
pd_lsn could also cause AMs to point at areas they should not.  There
are many reasons that could make things go wrong depending on the
compiled page size.

>>> A second reason would be less indirection to get to the opaque
>>> pointer. This should improve performance a bit in those cases where we
>>> (initially) only want to access the [Index]PageOpaque struct.
>>
>> We don't have many cases that do that, do we?
>
> Indeed not many, but I know that at least the nbtree-code has some
> cases where it uses the opaque before other fields in the header are
> accessed (sometimes even without accessing the header for other
> reasons): in _bt_moveright and _bt_walk_left. There might be more; but
> these are cases I know of.

Even these are marginal as the page is already loaded in the shared
buffers..

While looking at the page, the pieces in 0002 and 0003 that I really
liked are the introduction of the macros to grab the special area for
btree and hash.  This brings the code of those APIs closer to GiST,
SpGiST and GIN.  And it is possible to move to a possible change in
the checks and/or the shape of all the *GetOpaque macros at once after
the switch to this part is done.  So I don't really mind introducing
this part.

PageGetSpecialOpaque() would not have saved the day with the recent
pageinspect changes, and that's just moving the responsability of the
page header to something else.  I am not sure if it is a good idea to
have a mention of the opaque page type in bufpage.h actually, as this
is optional.  Having an AssertMacro() checking that pd_special is in
line with MAXALIGN(sizeof(OpaqueData)) is attractive, but I'd rather
keep that in each AM's headers per its optional nature, and an index
AM may handle things differently depending on a page type, as well.
--
Michael

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Matthias van de Meent
Дата:
On Thu, 31 Mar 2022 at 09:32, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Mar 28, 2022 at 05:09:10PM +0200, Matthias van de Meent wrote:
> > Not all clusters have checksums enabled (boo!, but we can't
> > realistically fix that), so on-disk corruption could reasonably
> > propagate to the rest of such system. Additionally, checksums are only
> > checked on read, and updated when the page is written to disk (in
> > PageSetChecksumCopy called from FlushBuffer), but this does not check
> > for signs of page corruption. As such, a memory bug resulting in
> > in-memory corruption of pd_special would persist and would currently
> > have the potential to further corrupt neighbouring buffers.
>
> Well, if it comes to corruption then pd_special is not the only
> problem, just one part of it.  A broken pd_lower or pd_lower or even
> pd_lsn could also cause AMs to point at areas they should not.  There
> are many reasons that could make things go wrong depending on the
> compiled page size.
>
> >>> A second reason would be less indirection to get to the opaque
> >>> pointer. This should improve performance a bit in those cases where we
> >>> (initially) only want to access the [Index]PageOpaque struct.
> >>
> >> We don't have many cases that do that, do we?
> >
> > Indeed not many, but I know that at least the nbtree-code has some
> > cases where it uses the opaque before other fields in the header are
> > accessed (sometimes even without accessing the header for other
> > reasons): in _bt_moveright and _bt_walk_left. There might be more; but
> > these are cases I know of.
>
> Even these are marginal as the page is already loaded in the shared
> buffers..

I can't really disagree there.

> While looking at the page, the pieces in 0002 and 0003 that I really
> liked are the introduction of the macros to grab the special area for
> btree and hash.  This brings the code of those APIs closer to GiST,
> SpGiST and GIN.  And it is possible to move to a possible change in
> the checks and/or the shape of all the *GetOpaque macros at once after
> the switch to this part is done.  So I don't really mind introducing
> this part.
>
> PageGetSpecialOpaque() would not have saved the day with the recent
> pageinspect changes, and that's just moving the responsability of the
> page header to something else.  I am not sure if it is a good idea to
> have a mention of the opaque page type in bufpage.h actually, as this
> is optional.  Having an AssertMacro() checking that pd_special is in
> line with MAXALIGN(sizeof(OpaqueData)) is attractive, but I'd rather
> keep that in each AM's headers per its optional nature, and an index
> AM may handle things differently depending on a page type, as well.

PageInit MAXALIGNs the size of the special area that it receives as an
argument; so any changes to the page header that would misalign the
value would be AM-specific; in which case it is quite unlikely that
this is the right accessor for your page's special area.

- Matthias



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Michael Paquier
Дата:
On Thu, Mar 31, 2022 at 12:09:35PM +0200, Matthias van de Meent wrote:
> PageInit MAXALIGNs the size of the special area that it receives as an
> argument; so any changes to the page header that would misalign the
> value would be AM-specific; in which case it is quite unlikely that
> this is the right accessor for your page's special area.

Right.  I'd still be tempted to keep that per-AM rather than making
the checks deeper with one extra macro layer in the page header or
with a different macro that would depend on the opaque type, though.
Like in the attached, for example.
--
Michael

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Matthias van de Meent
Дата:
On Fri, 1 Apr 2022 at 07:38, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Mar 31, 2022 at 12:09:35PM +0200, Matthias van de Meent wrote:
> > PageInit MAXALIGNs the size of the special area that it receives as an
> > argument; so any changes to the page header that would misalign the
> > value would be AM-specific; in which case it is quite unlikely that
> > this is the right accessor for your page's special area.
>
> Right.  I'd still be tempted to keep that per-AM rather than making
> the checks deeper with one extra macro layer in the page header or
> with a different macro that would depend on the opaque type, though.
> Like in the attached, for example.

I see. I still would like it better if the access could use this
statically determined offset: your opaque-macros.patch doesn't fix the
out-of-bound read/write scenariofor non-assert builds, nor does it
remove the unneeded indirection through the page header that I was
trying to remove.

Even in assert-enabled builds; with my proposed changes the code paths
for checking the header value and the use of the special area can be
executed independently, which allows for parallel (pre-)fetching of
the page header and special area, as opposed to the current sequential
load order due to the required use of pd_special.

-Matthias



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Matthias van de Meent
Дата:
On Fri, 1 Apr 2022 at 10:50, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Fri, 1 Apr 2022 at 07:38, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Mar 31, 2022 at 12:09:35PM +0200, Matthias van de Meent wrote:
> > > PageInit MAXALIGNs the size of the special area that it receives as an
> > > argument; so any changes to the page header that would misalign the
> > > value would be AM-specific; in which case it is quite unlikely that
> > > this is the right accessor for your page's special area.
> >
> > Right.  I'd still be tempted to keep that per-AM rather than making
> > the checks deeper with one extra macro layer in the page header or
> > with a different macro that would depend on the opaque type, though.
> > Like in the attached, for example.

I noticed that you committed the equivalent of 0002 and 0003, thank you.

Here's a new 0001 to keep CFBot happy.

> I see. I still would like it better if the access could use this
> statically determined offset: your opaque-macros.patch doesn't fix the
> out-of-bound read/write scenariofor non-assert builds, nor does it
> remove the unneeded indirection through the page header that I was
> trying to remove.
>
> Even in assert-enabled builds; with my proposed changes the code paths
> for checking the header value and the use of the special area can be
> executed independently, which allows for parallel (pre-)fetching of
> the page header and special area, as opposed to the current sequential
> load order due to the required use of pd_special.

As an extra argument for this; it removes the need for a temporary
variable on the stack; as now all accesses into the special area can
be based on offsets off the base page pointer (which is also used
often). This would allow the compiler to utilize the available
registers more effectively.

-Matthias

PS. I noticed that between PageGetSpecialPointer and
PageGetSpecialOpaque the binary size was bigger by ~1kb for the
*Opaque variant when compiled with `CFLAGS="-o2" ./configure
--enable-debug`; but that varied a lot between builds and likely
related to binary layouts. For similar builds with `--enable-cassert`,
the *Opaque build was slimmer by ~ 50kB, which is a suprisingly large
amount.

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Robert Haas
Дата:
On Fri, Apr 1, 2022 at 11:12 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Here's a new 0001 to keep CFBot happy.

This seems like it would conflict with the proposal from
http://postgr.es/m/CA+TgmoaD8wMN6i1mmuo+4ZNeGE3Hd57ys8uV8UZm7cneqy3W2g@mail.gmail.com
which I still hope to advance in some form at an appropriate time.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Matthias van de Meent
Дата:
On Tue, 5 Apr 2022 at 21:45, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Apr 1, 2022 at 11:12 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Here's a new 0001 to keep CFBot happy.
>
> This seems like it would conflict with the proposal from
> http://postgr.es/m/CA+TgmoaD8wMN6i1mmuo+4ZNeGE3Hd57ys8uV8UZm7cneqy3W2g@mail.gmail.com
> which I still hope to advance in some form at an appropriate time.

I can see why what would be the case. However, I don't think that is
of much importance here, as reverting the changes that are yet to be
committed would be trivial.

Also, I can't see why we would allow page-level layout changes in
initdb; that seems like the wrong place to do that. All page layout
currently is at compile-time; even checksums (which can be
enabled/disabled in initdb) have reserved space available in the page
header. Why would it be different for nonces?

I don't think that the 'storing an explicit nonce' patch would
conflict in any meaningful form, other than maybe needing to roll back
from PageGetSpecialOpaque to PageGetSpecialPointer when (if) the nonce
is indeed going to be stored in the special area of each page with a
size defined during initdb. Until then, we could have a nice (but
small) performance boost.


-Matthias



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Pavel Borisov
Дата:
On Fri, Apr 1, 2022 at 11:12 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Here's a new 0001 to keep CFBot happy.

This seems like it would conflict with the proposal from
http://postgr.es/m/CA+TgmoaD8wMN6i1mmuo+4ZNeGE3Hd57ys8uV8UZm7cneqy3W2g@mail.gmail.com
which I still hope to advance in some form at an appropriate time.
It may be weakly related to the current discussion, but I'd like to note that the other proposed patch for making XID's 64-bit [1] will store per-page-common xid data in the heap page special area. It doesn't contradict storing a nonce in this area also, but I'd appreciate it if we could avoid changing heap page special layout twice.


Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Robert Haas
Дата:
On Thu, Apr 7, 2022 at 4:34 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Also, I can't see why we would allow page-level layout changes in
> initdb; that seems like the wrong place to do that. All page layout
> currently is at compile-time; even checksums (which can be
> enabled/disabled in initdb) have reserved space available in the page
> header. Why would it be different for nonces?

Because there's no place to put them in the existing page format. We
jammed checksums into the 2-byte field that had previously been set
aside for the TLI, but that wasn't really an ideal solution because it
meant we ended up with a checksum that is only 16 bits wide. However,
the 2 bytes set aside for the TLI weren't really being used
effectively anyway, so repurposing them was relatively easy, and a
16-bit checksum is better than nothing.

For encryption or integrity verification, you need a lot more space,
like 16-32 bytes. I can't see us adding that unilaterally to every
cluster, because (1) it would break pg_upgrade compatibility, (2) it's
a pretty significant amount of space to set aside in every single
cluster whether it is using those features or not, and (3) the
technology is constantly evolving and we can't predict how many bytes
will be needed in the future. AES in whatever mode we use it requires
whatever number of bytes it does, but AES replaced DES and will in
turn be replaced by something else which can have totally different
requirements.

I do understand that there are significant challenges and performance
concerns around having these kinds of initdb-controlled page layout
changes, so the future of that patch is unclear. However, I don't
think that it's practical to say that a 16-bit checksum should be good
enough for everyone, and I also don't think it's practical to say that
if you want a wider checksum or to use encryption or whatever, you
have to recompile from source. Most users are going to get precompiled
binaries from a vendor, and those binaries need to be able to work in
all the configurations that users want to use. It's OK for
experimental things that we only expect developers to fool with to
require recompiling, but things that users have a legitimate need to
reconfigure need to be initdb-time controllable at worst. Hence
fc49e24fa69a15efacd5b8958115ed9c43c48f9a, for example.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Peter Geoghegan
Дата:
On Thu, Apr 7, 2022 at 7:01 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Because there's no place to put them in the existing page format. We
> jammed checksums into the 2-byte field that had previously been set
> aside for the TLI, but that wasn't really an ideal solution because it
> meant we ended up with a checksum that is only 16 bits wide. However,
> the 2 bytes set aside for the TLI weren't really being used
> effectively anyway, so repurposing them was relatively easy, and a
> 16-bit checksum is better than nothing.

But if we were in a green-field situation we'd probably not want to
use up several bytes for a nonse anyway. You said so yourself.

> I do understand that there are significant challenges and performance
> concerns around having these kinds of initdb-controlled page layout
> changes, so the future of that patch is unclear.

Why does it need to be at initdb time?

Though I cannot prove it, I suspect that the original intent of the
special area was to support an additional (though typically small)
variable length array, that works a little like the current line
pointer array. This array would have to grow backwards (newer items
get appended at earlier physical offsets), unlike our line pointer
array (which gets appended to at the end, in the simple and obvious
way). Growing backwards like this happens with DB systems, that store
their line pointer array at the end of the page(the traditional
approach from the System R days, I believe).

Supporting a variable-length special area array like this would mean
that any time you add a new item to the variable-sized array in the
special area, the page's entire tuple space has to be memmove()'d
backwards by a couple of bytes to create the required space. And so
the relevant bufpage.c routine would have to adjust the whole line
pointer array such that each lp_off received a compensating
adjustment. The array might only be for some kind of page-level
transaction metadata, something like that -- shifting it around is
pretty expensive (reusing existing slots isn't too expensive, though).

Why can't it work like that? You don't really need to build the full
set of bufpage.c facilities (though it might not be a bad idea to
fully support these variable-length arrays, which seem like they might
come in handy). That seems perfectly compatible with what Matthias
wants to do, provided we're willing to deem the special area struct
(e.g. BTOpaque) as always coming "first" (which is essentially the
same as his current proposal anyway). You can even do the same thing
yourself for the nonse (use a fixed, known offset), with relatively
modest effort. You'd need to have AM-specific knowledge (it would
stack right on top of Matthias's technique), but that doesn't seem all
that hard. There are plenty of remaining status bits in BTOpaque, and
probably all other index AM special areas.

-- 
Peter Geoghegan



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Robert Haas
Дата:
On Thu, Apr 7, 2022 at 2:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> But if we were in a green-field situation we'd probably not want to
> use up several bytes for a nonse anyway. You said so yourself.

I don't know what statement of mine you're talking about here, and
while I don't love using up space for a nonce, it seems to be the way
this encryption stuff works. I don't see that there's a reasonable
alternative, green field or no.

> > I do understand that there are significant challenges and performance
> > concerns around having these kinds of initdb-controlled page layout
> > changes, so the future of that patch is unclear.
>
> Why does it need to be at initdb time?
>
> Though I cannot prove it, I suspect that the original intent of the
> special area was to support an additional (though typically small)
> variable length array, that works a little like the current line
> pointer array. This array would have to grow backwards (newer items
> get appended at earlier physical offsets), unlike our line pointer
> array (which gets appended to at the end, in the simple and obvious
> way). Growing backwards like this happens with DB systems, that store
> their line pointer array at the end of the page(the traditional
> approach from the System R days, I believe).
>
> Supporting a variable-length special area array like this would mean
> that any time you add a new item to the variable-sized array in the
> special area, the page's entire tuple space has to be memmove()'d
> backwards by a couple of bytes to create the required space. And so
> the relevant bufpage.c routine would have to adjust the whole line
> pointer array such that each lp_off received a compensating
> adjustment. The array might only be for some kind of page-level
> transaction metadata, something like that -- shifting it around is
> pretty expensive (reusing existing slots isn't too expensive, though).
>
> Why can't it work like that? You don't really need to build the full
> set of bufpage.c facilities (though it might not be a bad idea to
> fully support these variable-length arrays, which seem like they might
> come in handy). That seems perfectly compatible with what Matthias
> wants to do, provided we're willing to deem the special area struct
> (e.g. BTOpaque) as always coming "first" (which is essentially the
> same as his current proposal anyway). You can even do the same thing
> yourself for the nonse (use a fixed, known offset), with relatively
> modest effort. You'd need to have AM-specific knowledge (it would
> stack right on top of Matthias's technique), but that doesn't seem all
> that hard. There are plenty of remaining status bits in BTOpaque, and
> probably all other index AM special areas.

I'm not really following any of this. You seem to be arguing about
whether it's possible to change the length of the special space
*later* than initdb time. I agree that might have some use for some
purpose, but for encryption it's not necessarily all that helpful
because you have to be able to find the nonce on the page before
you've decrypted it. If you don't know whether there's a nonce or
where it's located, you can't do that. What Matthias and I were
discussing is whether you have to make a decision about appending
stuff to the special space *earlier* than initdb-time i.e. at compile
time.

My position is that if we need some space in every page to put a
nonce, the best place to put it is at the very end of the page, within
the special space and after anything else that is stored in the
special space. Code that only manipulates the line pointer array and
tuple data won't care, because pd_special will just be a bit smaller
than it would otherwise have been, and none of that code looks at any
byte offset >= pd_special. Code that looks at the special space won't
care either, as long as it uses PageGetSpecialPointer to find the
data, and doesn't examine how large the special space actually is.
That corresponds pretty well to how existing users of the special
space work, so it seems pretty good.

If we *didn't* put the nonce at the end of the page, where else would
we put it? It has to be at a fixed offset, because otherwise you can't
find it without decrypting the page first, which would be circular.
You could put it at the beginning of the page, or after the page
header and before the line pointer array, but either of those things
seem likely to affect a lot more code, because there's a lot more
stuff that accesses the line pointer array than the special space.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Peter Geoghegan
Дата:
On Thu, Apr 7, 2022 at 12:11 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't know what statement of mine you're talking about here, and
> while I don't love using up space for a nonce, it seems to be the way
> this encryption stuff works. I don't see that there's a reasonable
> alternative, green field or no.

I just meant that it wouldn't be reasonable to impose a fixed cost on
every user, even those not using the feature. Which you said yourself.

> I'm not really following any of this. You seem to be arguing about
> whether it's possible to change the length of the special space
> *later* than initdb time. I agree that might have some use for some
> purpose, but for encryption it's not necessarily all that helpful
> because you have to be able to find the nonce on the page before
> you've decrypted it. If you don't know whether there's a nonce or
> where it's located, you can't do that.

What if you just encrypt a significant subset of the page, and leave a
small amount of metadata that can be read without decryption? Is that
approach feasible?

I think that you could even encrypt the line pointer array (not just
the tuple space), without breaking anything.

> What Matthias and I were
> discussing is whether you have to make a decision about appending
> stuff to the special space *earlier* than initdb-time i.e. at compile
> time.

I got that much, of course. That will work, I suppose, but it'll be
the first and last time that anybody gets to do that (unless we accept
it being incompatible with encryption).

> That corresponds pretty well to how existing users of the special
> space work, so it seems pretty good.

I think that it still needs to be accounted for in a bunch of places.
In particular anywhere concerned with page-level space management. For
example the definition of "1/3 of a page" used to enforce the limit on
the maximum size of an nbtree index tuple will need to account for the
presence of a nonse, either way.

> If we *didn't* put the nonce at the end of the page, where else would
> we put it? It has to be at a fixed offset, because otherwise you can't
> find it without decrypting the page first, which would be circular.

Immediately before the special area proper (say BTOpaque), which would
"logically come after" the special space under this scheme. You
wouldn't have a simple constant offset into the page, but you'd have
something not too far removed from such a constant. It could work as a
constant with minimal context (just the AM type). Just like with
Matthias' patch.

-- 
Peter Geoghegan



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Robert Haas
Дата:
On Thu, Apr 7, 2022 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I just meant that it wouldn't be reasonable to impose a fixed cost on
> every user, even those not using the feature. Which you said yourself.

Unfortunately, I think there's bound to be some cost. We can avoid
using the space in the page for every user, but anything that makes
the page layout variable is going to cost some number of CPU cycles
someplace. We have to measure that overhead and see if it's small
enough that we're OK with it.

> I got that much, of course. That will work, I suppose, but it'll be
> the first and last time that anybody gets to do that (unless we accept
> it being incompatible with encryption).

Yeah.

> > If we *didn't* put the nonce at the end of the page, where else would
> > we put it? It has to be at a fixed offset, because otherwise you can't
> > find it without decrypting the page first, which would be circular.
>
> Immediately before the special area proper (say BTOpaque), which would
> "logically come after" the special space under this scheme. You
> wouldn't have a simple constant offset into the page, but you'd have
> something not too far removed from such a constant. It could work as a
> constant with minimal context (just the AM type). Just like with
> Matthias' patch.

I don't think this would work, because I don't think it would be
practical to always know the AM type. Think about applying an XLOG_FPI
record, for example.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Peter Geoghegan
Дата:
On Thu, Apr 7, 2022 at 12:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 7, 2022 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I just meant that it wouldn't be reasonable to impose a fixed cost on
> > every user, even those not using the feature. Which you said yourself.
>
> Unfortunately, I think there's bound to be some cost. We can avoid
> using the space in the page for every user, but anything that makes
> the page layout variable is going to cost some number of CPU cycles
> someplace. We have to measure that overhead and see if it's small
> enough that we're OK with it.

I wouldn't go so far as to say that any non-zero overhead is not okay
(that sounds really extreme). I would only say this much: wasting
non-trivial amounts of space on every page must not happen. If the
user opts-in to the feature then it's not just waste, so that's
perfectly okay. (I imagine that we agree on this much already.)

> > Immediately before the special area proper (say BTOpaque), which would
> > "logically come after" the special space under this scheme. You
> > wouldn't have a simple constant offset into the page, but you'd have
> > something not too far removed from such a constant. It could work as a
> > constant with minimal context (just the AM type). Just like with
> > Matthias' patch.
>
> I don't think this would work, because I don't think it would be
> practical to always know the AM type. Think about applying an XLOG_FPI
> record, for example.

There are already some pretty shaky heuristics that are used by tools
like pg_filedump for this exact purpose. But they work! And they're
even supported by Postgres, quasi-officially -- grep for "pg_filedump"
to see what I mean.

There are numerous reasons why we might want to put that on a formal
footing, so that we can reliably detect the AM type starting from only
a page image. I suspect that you're going to end up needing to account
for this index AMs anyway, so going this way isn't necessarily making
your life all that much harder. (I am not really sure about that,
though, since I don't have enough background information.)


--
Peter Geoghegan



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Stephen Frost
Дата:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Apr 7, 2022 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I just meant that it wouldn't be reasonable to impose a fixed cost on
> > every user, even those not using the feature. Which you said yourself.
>
> Unfortunately, I think there's bound to be some cost. We can avoid
> using the space in the page for every user, but anything that makes
> the page layout variable is going to cost some number of CPU cycles
> someplace. We have to measure that overhead and see if it's small
> enough that we're OK with it.

Agreed that there may be some cost in CPU cycles for folks who don't
initialize the database with an option that needs this, but hopefully
that'll be quite small.

> > I got that much, of course. That will work, I suppose, but it'll be
> > the first and last time that anybody gets to do that (unless we accept
> > it being incompatible with encryption).
>
> Yeah.

I don't know that I agree with this being the 'first and last time'..?
If we have two options that could work together and each need some
amount of per-page space, such as a nonce or authentication tag, we'd
just need to be able to track which of those are enabled (eg: through
pg_control) and then know which gets what space.  I don't see why we
couldn't add something today and then add something else later on.  I do
think there'll likely be cases where we have two options that don't make
sense together and we wouldn't wish to allow that (such as page-level
strong checksums and authenticated encryption, as the latter provides
the former inherently) but there could certainly be things that do work
together too.

> > > If we *didn't* put the nonce at the end of the page, where else would
> > > we put it? It has to be at a fixed offset, because otherwise you can't
> > > find it without decrypting the page first, which would be circular.
> >
> > Immediately before the special area proper (say BTOpaque), which would
> > "logically come after" the special space under this scheme. You
> > wouldn't have a simple constant offset into the page, but you'd have
> > something not too far removed from such a constant. It could work as a
> > constant with minimal context (just the AM type). Just like with
> > Matthias' patch.
>
> I don't think this would work, because I don't think it would be
> practical to always know the AM type. Think about applying an XLOG_FPI
> record, for example.

I'm also doubtful about how well this would work, but the other question
is- what would be the advantage to doing it this way?  If we could have
it be run-time instead of initdb-time, that'd be great (imagine a
database that's encrypted while another isn't in the same cluster, or
even individual tables, which would all be very cool), but I don't think
this approach would make that possible..?

(sorry for only just seeing this thread getting traction, have been a
bit busy with other things ...)

Thanks,

Stephen

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Stephen Frost
Дата:
Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:
> On Thu, Apr 7, 2022 at 12:37 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Apr 7, 2022 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > I just meant that it wouldn't be reasonable to impose a fixed cost on
> > > every user, even those not using the feature. Which you said yourself.
> >
> > Unfortunately, I think there's bound to be some cost. We can avoid
> > using the space in the page for every user, but anything that makes
> > the page layout variable is going to cost some number of CPU cycles
> > someplace. We have to measure that overhead and see if it's small
> > enough that we're OK with it.
>
> I wouldn't go so far as to say that any non-zero overhead is not okay
> (that sounds really extreme). I would only say this much: wasting
> non-trivial amounts of space on every page must not happen. If the
> user opts-in to the feature then it's not just waste, so that's
> perfectly okay. (I imagine that we agree on this much already.)

Right, the *space* wouldn't ever be wasted- either the user opts-in and
the space is used for what the user asked it to be used for, or ther
user doesn't select that option and we don't reserve that space.
Robert's point was that by allowing this to happen at initdb time, there
may be some CPU cycles that end up getting spent, even if the user
didn't opt-in, that would be saved if this decision was made at compile
time.  For my 2c, I'd think that would be our main concern with this,
but I also am hopeful and strongly suspect that the extra CPU cycles
aren't likely to be much and therefore would be acceptable.

A thought that's further down the line would be the question of if we
could get those CPU cycles back (and perhaps more..) by using JIT.

> > > Immediately before the special area proper (say BTOpaque), which would
> > > "logically come after" the special space under this scheme. You
> > > wouldn't have a simple constant offset into the page, but you'd have
> > > something not too far removed from such a constant. It could work as a
> > > constant with minimal context (just the AM type). Just like with
> > > Matthias' patch.
> >
> > I don't think this would work, because I don't think it would be
> > practical to always know the AM type. Think about applying an XLOG_FPI
> > record, for example.
>
> There are already some pretty shaky heuristics that are used by tools
> like pg_filedump for this exact purpose. But they work! And they're
> even supported by Postgres, quasi-officially -- grep for "pg_filedump"
> to see what I mean.

Shaky heuristics feels like something appropriate for pg_filedump.. less
so for things like TDE.

> There are numerous reasons why we might want to put that on a formal
> footing, so that we can reliably detect the AM type starting from only
> a page image. I suspect that you're going to end up needing to account
> for this index AMs anyway, so going this way isn't necessarily making
> your life all that much harder. (I am not really sure about that,
> though, since I don't have enough background information.)

This seems like it'd be pretty difficult given that AMs can be loaded as
extensions..?  Also seems like a much bigger change and I'm still not
sure what the advantage is, at least in terms of what this particular
patch is doing.

Thanks,

Stephen

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Peter Geoghegan
Дата:
On Thu, Apr 7, 2022 at 1:09 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > I got that much, of course. That will work, I suppose, but it'll be
> > > the first and last time that anybody gets to do that (unless we accept
> > > it being incompatible with encryption).
> >
> > Yeah.
>
> I don't know that I agree with this being the 'first and last time'..?
> If we have two options that could work together and each need some
> amount of per-page space, such as a nonce or authentication tag, we'd
> just need to be able to track which of those are enabled (eg: through
> pg_control) and then know which gets what space.

Sounds very messy.

> I don't see why we
> couldn't add something today and then add something else later on.

That's what I'm arguing in favor of, in part.

> I'm also doubtful about how well this would work, but the other question
> is- what would be the advantage to doing it this way?  If we could have
> it be run-time instead of initdb-time, that'd be great (imagine a
> database that's encrypted while another isn't in the same cluster, or
> even individual tables, which would all be very cool), but I don't think
> this approach would make that possible..?

That would be the main advantage, yes. But I also tend to doubt that
we should make it completely impossible to know anything at all about
the page without fully decrypting it.

It was just a suggestion. I will leave it at that.

-- 
Peter Geoghegan



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Matthias van de Meent
Дата:
On Thu, 7 Apr 2022 at 21:11, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Apr 7, 2022 at 2:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > But if we were in a green-field situation we'd probably not want to
> > use up several bytes for a nonse anyway. You said so yourself.
>
> I don't know what statement of mine you're talking about here, and
> while I don't love using up space for a nonce, it seems to be the way
> this encryption stuff works. I don't see that there's a reasonable
> alternative, green field or no.
>
> > > I do understand that there are significant challenges and performance
> > > concerns around having these kinds of initdb-controlled page layout
> > > changes, so the future of that patch is unclear.
> >
> > Why does it need to be at initdb time?
> >
> > Though I cannot prove it, I suspect that the original intent of the
> > special area was to support an additional (though typically small)
> > variable length array, that works a little like the current line
> > pointer array. This array would have to grow backwards (newer items
> > get appended at earlier physical offsets), unlike our line pointer
> > array (which gets appended to at the end, in the simple and obvious
> > way). Growing backwards like this happens with DB systems, that store
> > their line pointer array at the end of the page(the traditional
> > approach from the System R days, I believe).
> >
> > Supporting a variable-length special area array like this would mean
> > that any time you add a new item to the variable-sized array in the
> > special area, the page's entire tuple space has to be memmove()'d
> > backwards by a couple of bytes to create the required space. And so
> > the relevant bufpage.c routine would have to adjust the whole line
> > pointer array such that each lp_off received a compensating
> > adjustment. The array might only be for some kind of page-level
> > transaction metadata, something like that -- shifting it around is
> > pretty expensive (reusing existing slots isn't too expensive, though).
> >
> > Why can't it work like that? You don't really need to build the full
> > set of bufpage.c facilities (though it might not be a bad idea to
> > fully support these variable-length arrays, which seem like they might
> > come in handy). That seems perfectly compatible with what Matthias
> > wants to do, provided we're willing to deem the special area struct
> > (e.g. BTOpaque) as always coming "first" (which is essentially the
> > same as his current proposal anyway). You can even do the same thing
> > yourself for the nonse (use a fixed, known offset), with relatively
> > modest effort. You'd need to have AM-specific knowledge (it would
> > stack right on top of Matthias's technique), but that doesn't seem all
> > that hard. There are plenty of remaining status bits in BTOpaque, and
> > probably all other index AM special areas.
>
> I'm not really following any of this. You seem to be arguing about
> whether it's possible to change the length of the special space
> *later* than initdb time. I agree that might have some use for some
> purpose, but for encryption it's not necessarily all that helpful
> because you have to be able to find the nonce on the page before
> you've decrypted it. If you don't know whether there's a nonce or
> where it's located, you can't do that. What Matthias and I were
> discussing is whether you have to make a decision about appending
> stuff to the special space *earlier* than initdb-time i.e. at compile
> time.
>
> My position is that if we need some space in every page to put a
> nonce, the best place to put it is at the very end of the page, within
> the special space and after anything else that is stored in the
> special space. Code that only manipulates the line pointer array and
> tuple data won't care, because pd_special will just be a bit smaller
> than it would otherwise have been, and none of that code looks at any
> byte offset >= pd_special. Code that looks at the special space won't
> care either, as long as it uses PageGetSpecialPointer to find the
> data, and doesn't examine how large the special space actually is.
> That corresponds pretty well to how existing users of the special
> space work, so it seems pretty good.

Except that reserving space on each page requires recalculation of all
variables that depend on the amount of potential free space available
on a page (for some cases this is less important, for some it is
critical that the value is not wrong). If this is always done at
runtime then that can cause significant overhead.

> If we *didn't* put the nonce at the end of the page, where else would
> we put it? It has to be at a fixed offset, because otherwise you can't
> find it without decrypting the page first, which would be circular.

I think there's no specifically good reason why we'd need to put the
nonce in storage at the same place as where we reserve the space for
the nonce in the unencrypted in-memory format.

> You could put it at the beginning of the page, or after the page
> header and before the line pointer array, but either of those things
> seem likely to affect a lot more code, because there's a lot more
> stuff that accesses the line pointer array than the special space.

I'm not too keen on anything related to having no page layout
guarantees. I can understand needing a nonce; but couldn't that be put
somewhere different than smack-dab in the middle of what are
considered AM-controlled areas?

I'm not certain why we need lsn on a page after we've checked that we
flushed all WAL up to that LSN. That is, right now we store the LSN in
the in-memory representation of the page because we need it to check
that the WAL is flushed up to that point when we write out the page,
so that we can recover the data in case of disk write issues.
But after flushing the WAL to disk, this LSN on the page is not needed
anymore, and could thus be replaced with a nonce. When reading such
page, the LSN-now-nonce can be replaced with the latest flushed LSN to
prevent unwanted xlog flushes. Sure, this limits nonce to 8 bytes, but
if you really need more than that IMO you can recompile from scratch
with a bigger PageHeader.

Benefit of using existing pageheader structs is that we could enable
TDE on a relation level and on existing clusters - there's no extra
space needed right now as there's already some space available.
Yes, we'd lose the ability to skip redo on pages, but I think that's a
small price to pay when you enable TDE.

-Matthias



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Bruce Momjian
Дата:
On Thu, Apr  7, 2022 at 03:11:24PM -0400, Robert Haas wrote:
> On Thu, Apr 7, 2022 at 2:43 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > But if we were in a green-field situation we'd probably not want to
> > use up several bytes for a nonse anyway. You said so yourself.
> 
> I don't know what statement of mine you're talking about here, and
> while I don't love using up space for a nonce, it seems to be the way
> this encryption stuff works. I don't see that there's a reasonable
> alternative, green field or no.

Uh, XTS doesn't require a nonce, so why are talking about nonces in this
thread?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Uh, XTS doesn't require a nonce, so why are talking about nonces in this
> thread?

Because some other proposals do, or could, require a per-page nonce.

After looking through this thread, I side with Robert: we should reject
the remainder of this patch.  It gives up page layout flexibility that
we might want back someday.  Moreover, I didn't see any hard evidence
offered that there's any actual performance gain, let alone such a
compelling gain that we should give up that flexibility for it.

            regards, tom lane



Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Michael Paquier
Дата:
On Mon, Nov 28, 2022 at 05:31:50PM -0500, Tom Lane wrote:
> After looking through this thread, I side with Robert: we should reject
> the remainder of this patch.  It gives up page layout flexibility that
> we might want back someday.  Moreover, I didn't see any hard evidence
> offered that there's any actual performance gain, let alone such a
> compelling gain that we should give up that flexibility for it.

As far as I understand, we are talking about this one:
https://www.postgresql.org/message-id/CAEze2Wj9c0abW2aRbC8JzOuUdGurO5av6SJ2H83du6tM+Q1rHQ@mail.gmail.com
After a few months looking at it again, I cannot get much excited
about switching these routines from a logic where we look at the page
header to something where we'd rely on the opaque structure size.

I am wondering if it would be worth adding an AssertMacro() like in
this one, though:
https://www.postgresql.org/message-id/YkaP64JvZTMgcHtq@paquier.xyz

This still uses PageGetSpecialPointer() to retrieve the pointer area
from the page header, but it also checks that we have a match with the
structure expected by the index AM.
--
Michael

Вложения

Re: Preventing indirection for IndexPageGetOpaque for known-size page special areas

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> I am wondering if it would be worth adding an AssertMacro() like in
> this one, though:
> https://www.postgresql.org/message-id/YkaP64JvZTMgcHtq@paquier.xyz

Kind of doubt it.  It'd bloat debug builds with a lot of redundant
checks, and probably never catch anything.  For catching problems
in production, the right place to do this (and where we already do
do it) is in _bt_checkpage.

If any of the other AMs lack page-read-time sanity checks like
_bt_checkpage, I'd be in favor of adding that.  But I don't
think a whole bunch of additional checks afterwards will buy much.

            regards, tom lane