Обсуждение: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

Поиск
Список
Период
Сортировка

Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
Hi all,

I have been looking at $subject and the many past reviews recently,
also related to some of the work related to the potential support for
zstandard compression in TOAST values, and found myself pondering
about the following message from Tom, to be reminded that nothing has
been done regarding the fact that the backend may finish in an
infinite loop once a TOAST table reaches 4 billion values:
https://www.postgresql.org/message-id/764273.1669674269%40sss.pgh.pa.us

Spoiler: I have heard of users that are in this case, and the best
thing we can do currently except raising shoulders is to use
workarounds with data externalization AFAIK, which is not nice,
usually, and users notice the problem once they see some backends
stuck in the infinite loop.  I have spent some time looking at the
problem, and looked at all the proposals in this area like these ones
(I hope so at least):
https://commitfest.postgresql.org/patch/4296/
https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru

Anyway, it seems like nothing goes in a direction that I think would
be suited to fix the two following problems (some of the proposed
patches broke backward-compatibility, as well, and that's critical):
- The limit of TOAST values to 4 billions, because external TOAST
pointers want OIDs.
- New compression methods, see the recent proposal about zstandard at
[1].  ToastCompressionId is currently limited to 4 values because the
extinfo field of varatt_external has only this much data remaining.
Spoiler: I want to propose a new varatt_external dedicated to
zstandard-compressed external pointers, but that's not for this
thread.

Please find attached a patch set I have finished with while poking at
the problem, to address points 1) and 2) in the first email mentioned
at the top of this message.  It is not yet ready for prime day yet
(there are a couple of things that would need adjustments), but I have
reached the point where I am going to need a consensus about what
people would be OK to have in terms of design to be able to support
multiple types of varatt_external to address these issues.  And I'm OK
to consume time on that for the v19 cycle.

While hacking at (playing with?) the whole toasting and detoasting
code to understand the blast radius that this would involve, I have
quickly found that it is very annoying to have to touch at many places
of varatt.h to make variants of the existing varatt_external structure
(what we store on-disk as varlena Datum for external TOAST pointers).
Spoiler: it's my first time touching the internals of this code area
so deeply.  Most of the code churns happen because we need to make the
[de]toast code aware of what to do depending on the vartags of the
external varlenas.  It would be simple to hardcode a bunch of new
VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures.
While it is efficient, this has a cost for out-of-core code and in
core because all the places that touch external TOAST pointers need to
be adjusted.  Point is: it can be done.  But if we introduce more
types of external TOAST pointers we need to always patch all these
areas, and there's a cost in that each time one or more new vartags
are added.

So, I have invented a new interface aimed at manipulating on-disk
external TOAST pointers, called toast_external, that is an on-memory
structure that services as an interface between on-disk external TOAST
pointers and what the backend wants to look at when retrieving chunks
of data from the TOAST relations.  That's the main proposal of this
patch set, with a structure looking like that:
typedef struct toast_external_data
{
    /* Original data size (includes header) */
    int32       rawsize;
    /* External saved size (without header) */
    uint32      extsize;
    /* compression method */
    ToastCompressionId compression_method;
    /* Relation OID of TOAST table containing the value */
    Oid            toastrelid;
    /*
     * Unique ID of value within TOAST table.  This could be an OID or an
     * int8 value.  This field is large enough to be able to store any of
     * them.
     */
    uint64        value;
} toast_external_data;

This is a bit similar to what the code does for R/W and R/O vartags,
only applying to the on-disk external pointers.  Then, the [de]toast
code and extension code is updated so as varlenas are changed into
this structure if we need to retrieve some of its data, and these
areas of the code do not need to know about the details of the
external TOAST pointers.  When saving an external set of chunks, this
structure is filled with information depending on what
toast_save_datum() deals with, be it a short varlena, a non-compressed
external value, or a compressed external value, then builds a varlena
with the vartag we want.

External TOAST pointers have three properties that are hardcoded in
the tree, bringing some challenges of their own:
- The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close
to 2k to make 4 chunks fit on a page.  This depends on the size of the
external pointer.  This one was actually easy to refactor.
- The varlena header size, based on VARTAG_SIZE(), which is kind of
tricky to refactor out in the new toast_external.c, but that seems OK
even if this knowledge stays in varatt.h.
- The toast pointer size, aka TOAST_POINTER_SIZE.  This one is
actually very interesting (tricky): we use it in one place,
toast_tuple_find_biggest_attribute(), as a lower bound to decide if an
attribute should be toastable or not.  I've refactored the code to use
a "best" guess depending on the value type in the TOAST relation, but
that's not 100% waterproof.  That needs more thoughts.

Anyway, the patch set is able to demonstrate how much needs to be done
in the tree to support multiple chunk_id types, and the CI is happy
with the attached.  Some of the things done:
- Introduction of a user-settable GUC called default_toast_type, that
can be switched between two modes "oid" and "int8", to force the
creation of a TOAST relation using one type or the other.
- Dump, restore and upgrade support are integrated, relying on a GUC
makes the logic a breeze.
- 64b values are retrieved from a single counter in the control file,
named a "TOAST counter", which has the same reliability and properties
as an OID, with checkpoint support, WAL records, etc.
- Rewrites are soft, so I have kicked the can down the toast on this
point to not make the proposal more complicated than it should be: a
VACUUM FULL retains the same TOAST value type as the original.  We
could extend rewrites so as the type of TOAST value is changed.  It is
possible to setup a new cluster with default_toast_type = int8 set
after an upgrade, with the existing tables still using the OID mode.
This relates to the recent proposal with a concurrent VACUUM FULL
(REPACK discussion).

The patch set keeps the existing vartag_external with OID values for
backward-compatibility, and adds a second vartag_external that can
store 8-byte values.  This model is the simplest one, and
optimizations are possible, where the Datum TOAST pointer could be
shorter depending on the ID type (OID or int8), the compression method
and the actual value to divide in chunks.  For example, if you know
that a chunk of data to save has a value less than UINT32_MAX, we
could store 4 bytes worth of data instead of 8.  This design has the
advantage to allow plugging in new TOAST external structures easily.
Now I've not spent extra time in this tuning, because there's no point
in spending more time without an actual agreement about three things,
and *that's what I'm looking for* as feedback for this upcoming CF:
- The structures of the external TOAST pointers.  Variable-sized
pointers could be one possibility, across multiple vartags.  Ideas are
welcome.
- How many vartag_external types we want.
- If people are actually OK with this translation layer or not, and I
don't disagree that there may be some paths hot enough where the
translation between the on-disk varlenas and this on-memory
toast_external_data hurts.  Again, it is possible to hardcode more
types of vartags in the tree, or just bypass the translation in the
paths that are too hot.  That's doable still brutal, but if that's the
final consensus reached I'm OK with that as well.  (See for example
the changes in amcheck to see how simpler things get.)

The patch set has been divided into multiple pieces to ease its
review.  Again, I'm not completely happy with everything in it, but
it's a start.  Each patch has its own commit message, so feel free to
refer to them for more details:
- 0001 introduces the GUC default_toast_type.  It is just defined, not
used in the tree at this stage.
- 0002 adds support for catcache lookups for int8 values, required to
allow TOAST values with int8 and its indexes.  Potentially useful for
extensions.
- 0003 introduces the "TOAST counter", 8 bytes in the control file to
allocate values for the int8 chunk_id.  That's cheap, reliable.
- 0004 is a mechanical change, that enlarges a couple of TOAST
interfaces to use values of uint64 instead of OID.
- 0005, again a mechanical change, reducing a bit the footprint of
TOAST_MAX_CHUNK_SIZE because OID and int8 values need different
values.
- 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type.

Then comes the "main" patches:
- 0007 adds support for int8 chunk_id in TOAST tables.  This is mostly
a mechanical change.  If applying the patches up to this point,
external Datums are applied to both OID and int8 values.  Note that
there is one tweak I'm unhappy with: the toast counter generation
would need to be smarter to avoid concurrent values because we don't
cross-check the TOAST index for existing values.  (Sorry, got slightly
lazy here).
- 0008 adds tests for external compressed and uncompressed TOAST
values for int8 TOAST types.
- 0009 adds support for dump, restore, upgrades of the TOAST table
types.
- 0010 is the main dish: refactoring of the TOAST code to use
toast_external_data, with OID vartags as the only type defined.
- 0011 adds a second vartag_external: the one with int8 values stored
in the external TOAST pointer.
- 0012 is a bonus for amcheck: what needs to be done in its TAP tests
to allow the corruption cases to work when supporting a new vartag.

That was a long message.  Thank you for reading if you have reached
this point.

Regards,

[1]: https://www.postgresql.org/message-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Hannu Krosing
Дата:
Hi Michael

I'll take a look at the patch set.

While digging around in the TOAST code did you have any ideas on how
one could extract the TOAST APIs in a way that they can be added in
Table Access Method definition ?

Not all TAMs need TOAST, but the ones that do could also be the ones
that still like to do something different when materializing toasted
values.

And TOAST is actually a nice abstraction which could be used as basis
for both offloading more columns into separate forks and files as well
as implementing some kinds of vectored, columnar and compressed
storages.

----
Hannu

On Thu, Jun 19, 2025 at 7:59 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> Hi all,
>
> I have been looking at $subject and the many past reviews recently,
> also related to some of the work related to the potential support for
> zstandard compression in TOAST values, and found myself pondering
> about the following message from Tom, to be reminded that nothing has
> been done regarding the fact that the backend may finish in an
> infinite loop once a TOAST table reaches 4 billion values:
> https://www.postgresql.org/message-id/764273.1669674269%40sss.pgh.pa.us
>
> Spoiler: I have heard of users that are in this case, and the best
> thing we can do currently except raising shoulders is to use
> workarounds with data externalization AFAIK, which is not nice,
> usually, and users notice the problem once they see some backends
> stuck in the infinite loop.  I have spent some time looking at the
> problem, and looked at all the proposals in this area like these ones
> (I hope so at least):
> https://commitfest.postgresql.org/patch/4296/
> https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru
>
> Anyway, it seems like nothing goes in a direction that I think would
> be suited to fix the two following problems (some of the proposed
> patches broke backward-compatibility, as well, and that's critical):
> - The limit of TOAST values to 4 billions, because external TOAST
> pointers want OIDs.
> - New compression methods, see the recent proposal about zstandard at
> [1].  ToastCompressionId is currently limited to 4 values because the
> extinfo field of varatt_external has only this much data remaining.
> Spoiler: I want to propose a new varatt_external dedicated to
> zstandard-compressed external pointers, but that's not for this
> thread.
>
> Please find attached a patch set I have finished with while poking at
> the problem, to address points 1) and 2) in the first email mentioned
> at the top of this message.  It is not yet ready for prime day yet
> (there are a couple of things that would need adjustments), but I have
> reached the point where I am going to need a consensus about what
> people would be OK to have in terms of design to be able to support
> multiple types of varatt_external to address these issues.  And I'm OK
> to consume time on that for the v19 cycle.
>
> While hacking at (playing with?) the whole toasting and detoasting
> code to understand the blast radius that this would involve, I have
> quickly found that it is very annoying to have to touch at many places
> of varatt.h to make variants of the existing varatt_external structure
> (what we store on-disk as varlena Datum for external TOAST pointers).
> Spoiler: it's my first time touching the internals of this code area
> so deeply.  Most of the code churns happen because we need to make the
> [de]toast code aware of what to do depending on the vartags of the
> external varlenas.  It would be simple to hardcode a bunch of new
> VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures.
> While it is efficient, this has a cost for out-of-core code and in
> core because all the places that touch external TOAST pointers need to
> be adjusted.  Point is: it can be done.  But if we introduce more
> types of external TOAST pointers we need to always patch all these
> areas, and there's a cost in that each time one or more new vartags
> are added.
>
> So, I have invented a new interface aimed at manipulating on-disk
> external TOAST pointers, called toast_external, that is an on-memory
> structure that services as an interface between on-disk external TOAST
> pointers and what the backend wants to look at when retrieving chunks
> of data from the TOAST relations.  That's the main proposal of this
> patch set, with a structure looking like that:
> typedef struct toast_external_data
> {
>         /* Original data size (includes header) */
>         int32       rawsize;
>         /* External saved size (without header) */
>         uint32      extsize;
>         /* compression method */
>         ToastCompressionId compression_method;
>         /* Relation OID of TOAST table containing the value */
>         Oid                     toastrelid;
>         /*
>          * Unique ID of value within TOAST table.  This could be an OID or an
>          * int8 value.  This field is large enough to be able to store any of
>          * them.
>          */
>         uint64          value;
> } toast_external_data;
>
> This is a bit similar to what the code does for R/W and R/O vartags,
> only applying to the on-disk external pointers.  Then, the [de]toast
> code and extension code is updated so as varlenas are changed into
> this structure if we need to retrieve some of its data, and these
> areas of the code do not need to know about the details of the
> external TOAST pointers.  When saving an external set of chunks, this
> structure is filled with information depending on what
> toast_save_datum() deals with, be it a short varlena, a non-compressed
> external value, or a compressed external value, then builds a varlena
> with the vartag we want.
>
> External TOAST pointers have three properties that are hardcoded in
> the tree, bringing some challenges of their own:
> - The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close
> to 2k to make 4 chunks fit on a page.  This depends on the size of the
> external pointer.  This one was actually easy to refactor.
> - The varlena header size, based on VARTAG_SIZE(), which is kind of
> tricky to refactor out in the new toast_external.c, but that seems OK
> even if this knowledge stays in varatt.h.
> - The toast pointer size, aka TOAST_POINTER_SIZE.  This one is
> actually very interesting (tricky): we use it in one place,
> toast_tuple_find_biggest_attribute(), as a lower bound to decide if an
> attribute should be toastable or not.  I've refactored the code to use
> a "best" guess depending on the value type in the TOAST relation, but
> that's not 100% waterproof.  That needs more thoughts.
>
> Anyway, the patch set is able to demonstrate how much needs to be done
> in the tree to support multiple chunk_id types, and the CI is happy
> with the attached.  Some of the things done:
> - Introduction of a user-settable GUC called default_toast_type, that
> can be switched between two modes "oid" and "int8", to force the
> creation of a TOAST relation using one type or the other.
> - Dump, restore and upgrade support are integrated, relying on a GUC
> makes the logic a breeze.
> - 64b values are retrieved from a single counter in the control file,
> named a "TOAST counter", which has the same reliability and properties
> as an OID, with checkpoint support, WAL records, etc.
> - Rewrites are soft, so I have kicked the can down the toast on this
> point to not make the proposal more complicated than it should be: a
> VACUUM FULL retains the same TOAST value type as the original.  We
> could extend rewrites so as the type of TOAST value is changed.  It is
> possible to setup a new cluster with default_toast_type = int8 set
> after an upgrade, with the existing tables still using the OID mode.
> This relates to the recent proposal with a concurrent VACUUM FULL
> (REPACK discussion).
>
> The patch set keeps the existing vartag_external with OID values for
> backward-compatibility, and adds a second vartag_external that can
> store 8-byte values.  This model is the simplest one, and
> optimizations are possible, where the Datum TOAST pointer could be
> shorter depending on the ID type (OID or int8), the compression method
> and the actual value to divide in chunks.  For example, if you know
> that a chunk of data to save has a value less than UINT32_MAX, we
> could store 4 bytes worth of data instead of 8.  This design has the
> advantage to allow plugging in new TOAST external structures easily.
> Now I've not spent extra time in this tuning, because there's no point
> in spending more time without an actual agreement about three things,
> and *that's what I'm looking for* as feedback for this upcoming CF:
> - The structures of the external TOAST pointers.  Variable-sized
> pointers could be one possibility, across multiple vartags.  Ideas are
> welcome.
> - How many vartag_external types we want.
> - If people are actually OK with this translation layer or not, and I
> don't disagree that there may be some paths hot enough where the
> translation between the on-disk varlenas and this on-memory
> toast_external_data hurts.  Again, it is possible to hardcode more
> types of vartags in the tree, or just bypass the translation in the
> paths that are too hot.  That's doable still brutal, but if that's the
> final consensus reached I'm OK with that as well.  (See for example
> the changes in amcheck to see how simpler things get.)
>
> The patch set has been divided into multiple pieces to ease its
> review.  Again, I'm not completely happy with everything in it, but
> it's a start.  Each patch has its own commit message, so feel free to
> refer to them for more details:
> - 0001 introduces the GUC default_toast_type.  It is just defined, not
> used in the tree at this stage.
> - 0002 adds support for catcache lookups for int8 values, required to
> allow TOAST values with int8 and its indexes.  Potentially useful for
> extensions.
> - 0003 introduces the "TOAST counter", 8 bytes in the control file to
> allocate values for the int8 chunk_id.  That's cheap, reliable.
> - 0004 is a mechanical change, that enlarges a couple of TOAST
> interfaces to use values of uint64 instead of OID.
> - 0005, again a mechanical change, reducing a bit the footprint of
> TOAST_MAX_CHUNK_SIZE because OID and int8 values need different
> values.
> - 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type.
>
> Then comes the "main" patches:
> - 0007 adds support for int8 chunk_id in TOAST tables.  This is mostly
> a mechanical change.  If applying the patches up to this point,
> external Datums are applied to both OID and int8 values.  Note that
> there is one tweak I'm unhappy with: the toast counter generation
> would need to be smarter to avoid concurrent values because we don't
> cross-check the TOAST index for existing values.  (Sorry, got slightly
> lazy here).
> - 0008 adds tests for external compressed and uncompressed TOAST
> values for int8 TOAST types.
> - 0009 adds support for dump, restore, upgrades of the TOAST table
> types.
> - 0010 is the main dish: refactoring of the TOAST code to use
> toast_external_data, with OID vartags as the only type defined.
> - 0011 adds a second vartag_external: the one with int8 values stored
> in the external TOAST pointer.
> - 0012 is a bonus for amcheck: what needs to be done in its TAP tests
> to allow the corruption cases to work when supporting a new vartag.
>
> That was a long message.  Thank you for reading if you have reached
> this point.
>
> Regards,
>
> [1]: https://www.postgresql.org/message-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com
> --
> Michael



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi!

Hannu, we'd already made an attempt to extract the TOAST functionality as API
and make it extensible and usable by other AMs in [1], the patch set was met calmly
but we still have some hopes on it.

Michael, glad you continue this work! Took patch set for review.


--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company


On Fri, Jul 4, 2025 at 2:03 PM Hannu Krosing <hannuk@google.com> wrote:
Hi Michael

I'll take a look at the patch set.

While digging around in the TOAST code did you have any ideas on how
one could extract the TOAST APIs in a way that they can be added in
Table Access Method definition ?

Not all TAMs need TOAST, but the ones that do could also be the ones
that still like to do something different when materializing toasted
values.

And TOAST is actually a nice abstraction which could be used as basis
for both offloading more columns into separate forks and files as well
as implementing some kinds of vectored, columnar and compressed
storages.

----
Hannu

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi!

I'm reviewing code at toast_64bit_v2.
Michael, isn't there a typo?
static const toast_external_info toast_external_infos[TOAST_EXTERNAL_INFO_SIZE] = {
[VARTAG_ONDISK_INT8] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE,
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_INT8,
.to_external_data = ondisk_int8_to_external_data,
.create_external_data = ondisk_int8_create_external_data,
},
[VARTAG_ONDISK_OID] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE, <--- here
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_OID,
.to_external_data = ondisk_oid_to_external_data,
.create_external_data = ondisk_oid_create_external_data,
},
};

Shouldn't TOAST_POINTER_INT8_SIZE be replaced with TOAST_POINTER_OID_SIZE?

Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Tue, Jul 08, 2025 at 08:38:41AM +0900, Michael Paquier wrote:
> Please note that I still need to look at perf profiles and some flame
> graphs with the refactoring done in 0003 with the worst case I've
> mentioned upthread with detoasting and values stored uncompressed in
> the TOAST relation.

So, the worst case I could think of for the slice detoast path is
something like that:
create table toasttest_bytea (f1 bytea);
alter table toasttest_bytea alter column f1 set storage external;
insert into toasttest_bytea values(decode(repeat('1234567890',10000),'escape'));

And then use something like the following query that retrieves a small
substring many times, to force a maximum of detoast_attr_slice() to
happen, checking the effect of toast_external_info_get_data():
select length(string_agg(substr(f1, 2, 3), '')) from
   toasttest_bytea, lateral generate_series(1,1000000) as a (id);

I have taken this query, kept running that with a \watch, and took
samples of 10s perf records, finishing with the attached graphs
(runtime does not show any difference):
- detoast_master.svg, for the graph on HEAD.
- detoast_patch.svg with the patch set up to 0003 and the external
TOAST pointer refactoring, where detoast_attr_slice() shows up.
- master_patch_diff.svg as the difference between both, with
difffolded.pl from [1].

I don't see a difference in the times spent in these stacks, as we are
spending most of the run retrieving the slices from the TOAST relation
in fetch_datum_slice().  Opinions and/or comments are welcome.

[1]: https://github.com/brendangregg/FlameGraph
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
"Burd, Greg"
Дата:
> On Jul 7, 2025, at 7:38 PM, Michael Paquier <michael@paquier.xyz> wrote:
>
> I have also pushed this v2 on this branch, so feel free to grab it if
> that makes your life easier:
> https://github.com/michaelpq/postgres/tree/toast_64bit_v2
> --
> Michael


Thank you for spending time digging into this and for the well structured patch set (and GitHub branch which I
personallyfind helpful).  This $subject is important on its own, but even more so in the broader context of the
zstd/dictwork [1] and also allowing for innovation when it comes to how externalized Datum are stored.   The current
modelfor toasting has served the community well for years, but I think that Hannu [2] and Nikita and others have
promisingideas that should be allowable without forcing core changes.  I've worked a bit in this area too, I re-based
thePluggble TOAST work by Nikita [3] onto 18devel earlier this year as I was looking for a way to implement a toaster
fora custom type. 

All that aside, I think you're right to tackle this one step at a time and try not to boil too much of the ocean at
once(the patch set is already large enough).  With that in mind I've read once or twice over your changes and have a
fewbasic comments/questions. 


v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid

This set of changes make sense and as you say are mechanical in nature, no real comments other than I think that using
uint64rather than Oid is the right call and addresses #2 on Tom's list. 


v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code

I like this as well, clarifies the code and reduces repetition.


v2-0003 Refactor external TOAST pointer code for better pluggability

+ * For now there are only two types, all defined in this file. For now this
+ * is the maximum value of vartag_external, which is a historical choice.

This provides a bridge for compatibility, but doesn't open the door to a truly pluggable API.  I'm guessing the goal is
incrementalchange rather than wholesale rewrite. 

+ * The different kinds of on-disk external TOAST pointers. divided by
+ * vartag_external.

Extra '.' in "TOAST pointers. divided" I'm guessing.


v2-0004 Introduce new callback to get fresh TOAST values
v2-0005 Add catcache support for INT8OID
v2-0006 Add GUC default_toast_type

Easily understood, good progression of supporting changes.


v2-0007 Introduce global 64-bit TOAST ID counter in control file

Do you have any concern that this might become a bottleneck when there are many relations and many backends all
contendingfor a new id?  I'd imagine that this would show up in a flame graph, but I think your test focused on the
readside detoast_attr_slice() rather than insert/update and contention on the shared counter.  Would this be even worse
onNUMA systems? 


v2-0008 Switch pg_column_toast_chunk_id() return value from oid to bigint
v2-0009 Add support for bigint TOAST values
v2-0010 Add tests for TOAST relations with bigint as value type
v2-0011 Add support for TOAST table types in pg_dump and pg_restore
v2-0012 Add new vartag_external for 8-byte TOAST values
V2-0013 amcheck: Add test cases for 8-byte TOAST values

I read through each of these patches, I like the break down and the attention to detail.  The inclusion of good
documentationat each step is helpful.  Thank you. 


Thanks for the flame graphs examining a heavy detoast_attr_slice() workload.  I agree that there is little or no
differencebetween them which is nice. 

I think the only call out Tom made [4] that isn't addressed was the ask for localized ID selection.  That may make
senseat some point, especially if there is contention on GetNewToastId().  I think that case is worth a separate
performancetest, something with a large number of relations and backends all performing a lot of updates generating a
lotof new IDs.  What do you think? 

As for adding even more flexibility, I see the potential to move in that direction over time with this as a good
focusedincremental set of changes that address a few important issues now. 


Really excited by this work, thank you.

-greg


[1] https://commitfest.postgresql.org/patch/5702/
[2] "Yes, the idea is to put the tid pointer directly in the varlena
external header and have a tid array in the toast table as an extra
column. If all of the TOAST fits in the single record, this will be
empty, else it will have an array of tids for all the pages for this
toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025
[3] https://postgr.es/m/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru
[4] https://postgr.es/m/764273.1669674269%40sss.pgh.pa.us




Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi!

Greg, thanks for the interest in our work!

Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.

What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Hannu Krosing
Дата:
I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.

It will remove the need for any oid management and also will be
many-many orders of magnitude faster for large tables (just 2x faster
for in-memory small tables)

I plan to go over Michael's patch set here and see how much change is
needed to add the "direct toast"

My goals are:
1. fast lookup from skipping index lookup
2. making the toast pointer in main heap as small as possible -
hopefully just the 6 bytes of tid pointer - so that scans that do not
need toasted values get more tuples from each page
3. adding all (optional) the extra data into toast chunk record as
there we are free to add whatever is needed
Currently I plan to introduces something like this for toast chunk record

Column | Type | Storage
-------------+---------+----------
chunk_id | oid | plain | 0 when not using toast index, 0xfffe -
non-deletable, for example when used as dictionary for multiple
toasted values.
chunk_seq | integer | plain | if not 0 when referenced from toast
pointer then the toasted data starts at toast_pages[0] (or below it in
that tree), which *must* have chunk_id = 0
chunk_data | bytea | plain

-- added fields

toast_pages | tid[] | plain | can be chained or make up a tree
offsets | int[] | plain | -- starting offsets of the toast_pages
(octets or type-specific units), upper bit is used to indicate that a
new compressed span starts at that offset, 2nd highest bit indicates
that the page is another tree page
comp_method | int | plain | -- compression methos used maybe should be enum ?
dict_pages | tid[] | plain | -- pages to use as compression
dictionary, up to N pages, one level

This seems to be flexible enough to allow for both compressin and
efficient partial updates

---
Hannu


On Tue, Jul 8, 2025 at 8:31 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> Hi!
>
> Greg, thanks for the interest in our work!
>
> Michael, one more thing forgot to mention yesterday -
> #define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
> static const toast_external_info toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
> VARTAG_ONDISK_OID historically has a value of 18
> and here we got an array of 19 members with only 2 valid ones.
>
> What do you think about having an individual
> TOAST value id counter per relation instead of using
> a common one? I think this is a very promising approach,
> but a decision must be made where it should be stored.
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> The Russian Postgres Company
> https://postgrespro.ru/



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Álvaro Herrera
Дата:
On 2025-Jul-08, Hannu Krosing wrote:

> I still think we should go with direct toast tid pointers in varlena
> and not some kind of oid.

I think this can be made to work, as long as we stop seeing the toast
table just like a normal heap table containing normal tuples.  A lot to
reimplement though -- vacuum in particular.  Maybe it can be thought of
as a new table AM.  Not an easy project, I reckon.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi,

Hannu, we have some thoughts on direct tids storage,
it was some time ago and done by another developer,
so I have to look. I'll share it as soon as I find it, if you
are interested.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Hannu Krosing
Дата:
chunk_

On Tue, Jul 8, 2025 at 9:37 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
>
> On 2025-Jul-08, Hannu Krosing wrote:
>
> > I still think we should go with direct toast tid pointers in varlena
> > and not some kind of oid.
>
> I think this can be made to work, as long as we stop seeing the toast
> table just like a normal heap table containing normal tuples.  A lot to
> reimplement though -- vacuum in particular.

Non-FULL vacuum should already work. Only commands like VACUUM FULL
and CLUSTER which move tuples around should be disabled on TOAST
tables.

What other parts do you think need re-implementing in addition to
skipping the index lookup part and using the tid directly ?


The fact that per-page chunk_tid arrays allow also tree structures
should allow us much more flexibility in implementing
in-place-updatable structured storage in something otherways very
similar to toast, but this is not required for just moving from oid +
index ==> tid to using the tid directly.

I think that having a toast table as a normal table with full MVCC is
actually a good thing, as it can implement the "array element update"
as a real partial update of only the affected parts and not the
current 'copy everything' way of doing this. We already do collect the
array element update in the parse tree in a special way, now we just
need to have types that can do the partial update by changing a tid or
two in the chunk_tids array (and adjust the offsets array if needed)

This should make both
UPDATE t SET theintarray[3] = 5, theintarray[4] = 7 WHERE id  = 1;

and even do partial up[dates for something like this

hannuk=# select * from jtab;
 id |             j
----+----------------------------
  1 | {"a": 3, "b": 2}
  2 | {"c": 1, "d": [10, 20, 3]}
(2 rows)
hannuk=# update jtab SET j['d'][3] = '7' WHERE id = 2;
UPDATE 1
hannuk=# select * from jtab;
 id |               j
----+-------------------------------
  1 | {"a": 3, "b": 2}
  2 | {"c": 1, "d": [10, 20, 3, 7]}
(2 rows)

when the JSON data is so large that changed part is in it's own chunk.

> Maybe it can be thought of
> as a new table AM.  Not an easy project, I reckon.

I would prefer it to be an extension of current toast - just another
varatt_* type - as then you can upgrade to new storage CONCURRENTLY,
same way as you can currently switch compression methods.

> --
> Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Tue, Jul 08, 2025 at 08:54:33PM +0200, Hannu Krosing wrote:
> I still think we should go with direct toast tid pointers in varlena
> and not some kind of oid.
>
> It will remove the need for any oid management and also will be
> many-many orders of magnitude faster for large tables (just 2x faster
> for in-memory small tables)

There is also the point of backward-compatibility.  We cannot just
replace things, and we have to provide a way for users to be able to
rely on the system so as upgrades are painless.  So we need to think
about the correct application layer to use to maintain the legacy code
behavior while considering improvements.

> I plan to go over Michael's patch set here and see how much change is
> needed to add the "direct toast"

If you do not have a lot of time looking at the full patch set, I'd
recommend looking at 0003, files toast_external.h and
toast_external.c which include the key idea.  Adding a new external
TOAST pointer is then a two-step process:
- Add a new vartag_external.
- Add some callbacks to let the backend understand what it should do
with this new vartag_external.

> My goals are:
> 1. fast lookup from skipping index lookup
> 2. making the toast pointer in main heap as small as possible -
> hopefully just the 6 bytes of tid pointer - so that scans that do not
> need toasted values get more tuples from each page
> 3. adding all (optional) the extra data into toast chunk record as
> there we are free to add whatever is needed
> Currently I plan to introduces something like this for toast chunk record

Points 2. and 3. are things that the refactoring should allow.  About
1., I have no idea how much you want to store in the TOAST external
points and how it affects the backend, but you could surely implement
an option that lets the backend know that it should still index
lookups based on what the external TOAST pointer says, if this stuff
has benefits.

> This seems to be flexible enough to allow for both compressin and
> efficient partial updates

I don't really disagree with all that.  Now the history of the TOAST
threads point out that we've good at proposing complex things, but
these had a high footprint.  What I'm proposing is lighter than that,
I think, tackling my core issue with the infra supporting backward
compatibility and the addition of more modes on top of it.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Tue, Jul 08, 2025 at 09:31:29PM +0300, Nikita Malakhov wrote:
> Michael, one more thing forgot to mention yesterday -
> #define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
> static const toast_external_info
> toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
> VARTAG_ONDISK_OID historically has a value of 18
> and here we got an array of 19 members with only 2 valid ones.

Yeah, I'm aware of that.  The code is mostly to make it easier to read
while dealing with this historical behavior, even if it costs a bit
more in memory.  I don't think that it's a big deal, and we could
always have one more level of redirection to reduce its size.  Now
there's the extra complexity..

> What do you think about having an individual
> TOAST value id counter per relation instead of using
> a common one? I think this is a very promising approach,
> but a decision must be made where it should be stored.

I've thought about that, and decided to discard this idea for now to
keep the whole proposal simpler.  This has benefits if you have many
relations with few OIDs consumed, but this has a cost in itself as you
need to maintain the data for each TOAST relation.  When I looked at
the problems a couple of weeks ago, I came to the conclusion that all
the checkbox properties of "local" TOAST values are filled with a
sequence: WAL logging to ensure uniqueness, etc.  So I was even
considering the addition of some code to create sequences on-the-fly,
but at the end that was just more complexity with how we define
sequences currently compared to a unique 8-byte counter in the
control file that's good enough for a veeery long time.

I've also noticed that this sort of links to a piece I've implemented
last year and is still sitting in the CF app without much interest
from others: sequence AMs.  You could implement a "TOAST" sequence
method, for example, optimized for this purpose.  As a whole, I
propose to limit the scope of the proposal to the pluggability of the
external TOAST pointers.  The approach I've taken should allow such
improvements, these can be added later if really needed.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Tue, Jul 08, 2025 at 12:58:26PM -0400, Burd, Greg wrote:
> All that aside, I think you're right to tackle this one step at a
> time and try not to boil too much of the ocean at once (the patch
> set is already large enough).  With that in mind I've read once or
> twice over your changes and have a few basic comments/questions.
>
>
> v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid
>
> This set of changes make sense and as you say are mechanical in
> nature, no real comments other than I think that using uint64 rather
> than Oid is the right call and addresses #2 on Tom's list.
>
>
> v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code
>
> I like this as well, clarifies the code and reduces repetition.

Thanks.  These are independent pieces if you want to link the code
less to TOAST, assuming that an area of 8 bytes would be good enough
for any TOAST "value" concept.  TIDs were mentioned as well on a
different part of the thread, ItemPointerData is 6 bytes.

> v2-0003 Refactor external TOAST pointer code for better pluggability
>
> + * For now there are only two types, all defined in this file. For now this
> + * is the maximum value of vartag_external, which is a historical choice.
>
> This provides a bridge for compatibility, but doesn't open the door
> to a truly pluggable API.  I'm guessing the goal is incremental
> change rather than wholesale rewrite.

Nope, it does not introduce a pluggable thing, but it does untangle
the fact that one needs to change 15-ish code paths when they want to
add a new type of external TOAST pointer, without showing an actual
impact AFAIK when we insert a TOAST value or fetch it, as long as we
know that we're dealing with an on-disk thing that requires an
external lookup.

> + * The different kinds of on-disk external TOAST pointers. divided by
> + * vartag_external.
>
> Extra '.' in "TOAST pointers. divided" I'm guessing.

Indeed, thanks.

> v2-0007 Introduce global 64-bit TOAST ID counter in control file
>
> Do you have any concern that this might become a bottleneck when
> there are many relations and many backends all contending for a new
> id?  I'd imagine that this would show up in a flame graph, but I
> think your test focused on the read side detoast_attr_slice() rather
> than insert/update and contention on the shared counter.  Would this
> be even worse on NUMA systems?

That may be possible, see below.

> Thanks for the flame graphs examining a heavy detoast_attr_slice()
> workload.  I agree that there is little or no difference between
> them which is nice.

Cool.  Yes.  I was wondering why detoast_attr_slice() does not show up
in the profile on HEAD, perhaps it just got optimized away (I was
under -O2 for these profiles).

> I think the only call out Tom made [4] that isn't addressed was the
> ask for localized ID selection.  That may make sense at some point,
> especially if there is contention on GetNewToastId().  I think that
> case is worth a separate performance test, something with a large
> number of relations and backends all performing a lot of updates
> generating a lot of new IDs.  What do you think?

Yeah, I need to do more benchmark for the int8 part, I was holding on
such evaluations because this part of the patch does not fly if we
don't do the refactoring pieces first.  Anyway, I cannot get excited
about the extra workload that this would require in the catalogs,
because we would need one TOAST sequence tracked in there, linked to
the TOAST relation so it would not be free.  Or we invent a new
facility just for this purpose, meaning that we get far away even more
from being able to resolve the original problem with the values and
compression IDs.  We're talking about two instructions.  Well, I guess
that we could optimize it more some atomics or even cache a range of
values to save in ToastIdGenLock acquisitions in a single backend.  I
suspect that the bottleneck is going to be the insertion of the TOAST
entries in toast_save_datum() anyway with the check for conflicting
values, even if your relation is unlogged or running-on-scissors in
memory.

> [2] "Yes, the idea is to put the tid pointer directly in the varlena
> external header and have a tid array in the toast table as an extra
> column. If all of the TOAST fits in the single record, this will be
> empty, else it will have an array of tids for all the pages for this
> toasted field." - Hannu Krosing in an email to me after
> PGConf.dev/2025

Sure, you could do that as well, but I suspect that we'll need the
steps of at least up to 0003 to be able to handle more easily multiple
external TOAST pointer types, or the code will be messier than it
currently is.  :D
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi!

On this - 
> Non-FULL vacuum should already work. Only commands like VACUUM FULL
> and CLUSTER which move tuples around should be disabled on TOAST
> tables.

Cool, toast tables are subject to bloating in update-heavy scenarios
and it's a big problem in production systems, it seems there is a promising
way to solve it once and for all!

Have to mention though that we encountered issues in logical replication
when we made toast values updatable.

Also researching direct tids implementation.

Cheers!

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Mon, Jul 14, 2025 at 09:01:28AM +0300, Nikita Malakhov wrote:
> Cool, toast tables are subject to bloating in update-heavy scenarios
> and it's a big problem in production systems, it seems there is a promising
> way to solve it once and for all!
>
> Have to mention though that we encountered issues in logical replication
> when we made toast values updatable.
>
> Also researching direct tids implementation.

I would be curious to see if the refactoring done on this thread would
be useful in the scope of what you are trying to do.  I'd suggest
dropping that on a different thread, though, if you finish with a
patch or something worth looking at for others.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi Michael,

I'm currently debugging POC direct tids TOAST patch (on top of your branch),
will mail it in a day or two.

On Tue, Jul 15, 2025 at 3:56 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jul 14, 2025 at 09:01:28AM +0300, Nikita Malakhov wrote:
> Cool, toast tables are subject to bloating in update-heavy scenarios
> and it's a big problem in production systems, it seems there is a promising
> way to solve it once and for all!
>
> Have to mention though that we encountered issues in logical replication
> when we made toast values updatable.
>
> Also researching direct tids implementation.

I would be curious to see if the refactoring done on this thread would
be useful in the scope of what you are trying to do.  I'd suggest
dropping that on a different thread, though, if you finish with a
patch or something worth looking at for others.
--
Michael


--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Hannu Krosing
Дата:
On Fri, Jul 18, 2025 at 9:24 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> Hi Michael,
>
> I'm currently debugging POC direct tids TOAST patch (on top of your branch),
> will mail it in a day or two.

Great!

I also just started looking at it, starting from 0003 as recommended by Michael.

Will be interesting to see how similar / different our approaches will be :)



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Hannu Krosing
Дата:
On Mon, Jul 14, 2025 at 8:15 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
...
> Have to mention though that we encountered issues in logical replication
> when we made toast values updatable.

This seems to indicate that Logical Decoding does not honour
visibility checks in TOAST.
This works fine if the TOAST visibility never changes but will break
if it can change independently of heap-side tuple

> Also researching direct tids implementation.

Cool!



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Fri, Jul 18, 2025 at 10:24:12PM +0300, Nikita Malakhov wrote:
> I'm currently debugging POC direct tids TOAST patch (on top of your branch),
> will mail it in a day or two.

Interesting.  Of course I may be wrong because I have no idea of how
you have shaped things, still I suspect that for the basics you should
just need 0003, 0004, the parts with the GUC to switch the TOAST table
type and the dump/restore/upgrade bits.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikhil Kumar Veldanda
Дата:
Hi Michael,

I'm reviewing your patches(toast_64bit_v2 branch) and have prepared
two incremental patches that add ZSTD compression support. While doing
this I made a small refactoring in heaptoast.c (around
ToastTupleContext) to make adding additional TOAST pointer formats
easier.

Added two new on-disk external vartags to support new compression
methods: one using an Oid value identifier and one using a 64‑bit
(int8) identifier for toast table types. This lets us support extended
compression methods for both existing Oid‑based TOAST tables and a
variant that needs a wider ID space.

typedef enum vartag_external
{
    VARTAG_INDIRECT = 1,
    VARTAG_EXPANDED_RO = 2,
    VARTAG_EXPANDED_RW = 3,
    VARTAG_ONDISK_INT8 = 4,
    VARTAG_ONDISK_CE_OID = 5,
    VARTAG_ONDISK_CE_INT8 = 6,
    VARTAG_ONDISK_OID = 18
} vartag_external;

Two new ondisk TOAST pointer structs carrying an va_ecinfo field for
extended compression methods:

typedef struct varatt_external_ce_oid
{
    int32       va_rawsize;     /* Original data size (includes header) */
    uint32      va_extinfo;     /* External saved size (without header) and
                                 * VARATT_CE_FLAG in top 2 bits */
    uint32      va_ecinfo;      /* Extended compression info */
    Oid         va_valueid;     /* Unique ID of value within TOAST table */
    Oid         va_toastrelid;  /* RelID of TOAST table containing it */
}           varatt_external_ce_oid;

typedef struct varatt_external_ce_int8
{
    int32       va_rawsize;     /* Original data size (includes header) */
    uint32      va_extinfo;     /* External saved size (without header) and
                                 * VARATT_CE_FLAG in top 2 bits */
    uint32      va_ecinfo;      /* Extended compression info */

    /*
     * Unique ID of value within TOAST table, as two uint32 for alignment and
     * padding.
     */
    uint32      va_valueid_lo;
    uint32      va_valueid_hi;
    Oid         va_toastrelid;  /* RelID of TOAST table containing it */
}           varatt_external_ce_int8;

The inline (varattrib_4b) format extension (discussed in [1]) is
included; I made one adjustment: the compression id field is now a
4‑byte va_ecinfo field (instead of 1 byte) for structural consistency
with the extended TOAST pointer formats.

typedef union
{
    struct                      /* Normal varlena (4-byte length) */
    {
        uint32      va_header;
        char        va_data[FLEXIBLE_ARRAY_MEMBER];
    }           va_4byte;
    struct                      /* Compressed-in-line format */
    {
        uint32      va_header;
        uint32      va_tcinfo;  /* Original data size (excludes header) and
                                 * compression method; see va_extinfo */
        char        va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */
    }           va_compressed;
    struct
    {
        uint32      va_header;
        uint32      va_tcinfo;  /* Original data size (excludes header) and
                                 * compression method or VARATT_CE_FLAG flag;
                                 * see va_extinfo */
        uint32      va_ecinfo;  /* Extended compression info: 32-bit field
                                 * where only the lower 8 bits are used for
                                 * compression method. Upper 24 bits are
                                 * reserved/unused. Lower 8 bits layout: Bits
                                 * 7–1: encode (cmid − 2), so cmid is
                                 * [2…129] Bit 0: flag for extra metadata
                                 */
        char        va_data[FLEXIBLE_ARRAY_MEMBER];
    }           va_compressed_ext;
} varattrib_4b;

[1] https://www.postgresql.org/message-id/flat/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com

Please review it and let me know if you have any questions or
feedback? Thank you!

v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
Design proposal covering varattrib_4b, TOAST pointer layouts, and
related macro updates.
v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
ZSTD (non dict) support and few basic tests.

--
Nikhil Veldanda

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikhil Kumar Veldanda
Дата:
Hi,

> v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
> Design proposal covering varattrib_4b, TOAST pointer layouts, and
> related macro updates.
> v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
> ZSTD (non dict) support and few basic tests.

Sending v27 patch with a small update over v26 patch.

v27-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
Design proposal covering varattrib_4b, TOAST pointer layouts, and
related macro updates.
v27-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
ZSTD (non dict) support and few basic tests.

> --
> Nikhil Veldanda

-- 
Nikhil Veldanda

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Hannu Krosing
Дата:
I have been evolving details for Direct TOAST design in
https://wiki.postgresql.org/wiki/DirectTOAST

The top level goals are

* 8-byte TOAST pointer - just (header:1, tag:1 and TID:6)
* all other info moved from toast pointer to actual toast record(s),
so heap rows are smaller and faster.
* all extra fields are bytea with internal encoding (maybe will create
full new types for these, or maybe just introspection functions are
enough)
  the reasons for this are
  - PostgresSQL arrays add 20 byte overhead
  - bytea gives other freedoms in encoding for minimal space usage

No solution yet for va_toastrelid , but hope is
- to use some kind of mapping and find one or two free bits somewhere
(tid has one free),
- or add a 12-byte toast pointer just for this.
- or to make sure that CLUSTER and VACUUM FULL can be done without
needing va_toastrelid. I assume it is there for clustering the TOAST
which will be not possible separately from the main heap with direct
toast tid pointers anyway.

Please take a look and poke holes in it !

On Sun, Jul 20, 2025 at 10:28 AM Nikhil Kumar Veldanda
<veldanda.nikhilkumar17@gmail.com> wrote:
>
> Hi,
>
> > v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
> > Design proposal covering varattrib_4b, TOAST pointer layouts, and
> > related macro updates.
> > v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
> > ZSTD (non dict) support and few basic tests.
>
> Sending v27 patch with a small update over v26 patch.
>
> v27-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
> Design proposal covering varattrib_4b, TOAST pointer layouts, and
> related macro updates.
> v27-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
> ZSTD (non dict) support and few basic tests.
>
> > --
> > Nikhil Veldanda
>
> --
> Nikhil Veldanda



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi!

Michael and Hannu, here's a POC patch with direct TIDs TOAST.
The simplest implementation where we store a chain of TIDs, each
chunk stores the next TID to be fetched. Patch applies on top of
commit 998b0b51d5ea763be081804434f177082ba6772b (origin/toast_64bit_v2)
Author: Michael Paquier <michael@paquier.xyz>
Date:   Thu Jun 19 13:09:11 2025 +0900

While it is very fast on small data - I see several disadvantages:
- first of all, VACUUM should be revised to work with such tables;
- problematic batch insertion due to necessity to store TID chain.

It is just a POC implementation, so please don't blame me for
questionable decisions.

Any opinions and feedback welcome!

PS: Hannu, just seen your latest message, will check it out now.

On Mon, Jul 21, 2025 at 3:15 AM Hannu Krosing <hannuk@google.com> wrote:
I have been evolving details for Direct TOAST design in
https://wiki.postgresql.org/wiki/DirectTOAST

The top level goals are

* 8-byte TOAST pointer - just (header:1, tag:1 and TID:6)
* all other info moved from toast pointer to actual toast record(s),
so heap rows are smaller and faster.
* all extra fields are bytea with internal encoding (maybe will create
full new types for these, or maybe just introspection functions are
enough)
  the reasons for this are
  - PostgresSQL arrays add 20 byte overhead
  - bytea gives other freedoms in encoding for minimal space usage

No solution yet for va_toastrelid , but hope is
- to use some kind of mapping and find one or two free bits somewhere
(tid has one free),
- or add a 12-byte toast pointer just for this.
- or to make sure that CLUSTER and VACUUM FULL can be done without
needing va_toastrelid. I assume it is there for clustering the TOAST
which will be not possible separately from the main heap with direct
toast tid pointers anyway.

Please take a look and poke holes in it !




--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi!

I agree that storing reltoastrelid in each Toast pointer seems to be
a waste of disk space since the current Postgres state does not
allow multiple TOAST tables per relation.
But if we consider this as a viable option it could bring additional
advantages. I've successfully tried to use multiple TOAST tables,
with different variations - by type, by column and as-is just as
an extensible storage.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Mon, Jul 21, 2025 at 02:20:31PM +0300, Nikita Malakhov wrote:
> I agree that storing reltoastrelid in each Toast pointer seems to be
> a waste of disk space since the current Postgres state does not
> allow multiple TOAST tables per relation.

va_toastrelid is a central part of the current system when dealing
with a TOAST relation rewrite, because we need to know to which
relation an on-disk TOAST pointer is part of.  Or do you mean that we
don't need that with what you are proposing with TIDs?  Perhaps yes,
sure, I've not studied this question when associated with your patch
(which has a bunch of duplicated code that could be avoided AFAIK).

> But if we consider this as a viable option it could bring additional
> advantages. I've successfully tried to use multiple TOAST tables,
> with different variations - by type, by column and as-is just as
> an extensible storage.

I don't think we need to be that ambitious.  There is no way to say
such things could have any benefits in the long-term and there's the
catalog representation part.  Even if we do that, I suspect that users
would never really know which setup makes sense because we would want
to control how things happen at a relation level and not purely
automate a behavior.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Mon, Jul 21, 2025 at 02:06:03PM +0300, Nikita Malakhov wrote:
> While it is very fast on small data - I see several disadvantages:
> - first of all, VACUUM should be revised to work with such tables;
> - problematic batch insertion due to necessity to store TID chain.
>
> It is just a POC implementation, so please don't blame me for
> questionable decisions.
>
> Any opinions and feedback welcome!

I think that this is going to be helpful in taking some decisions with
the refactoring pieces I am proposing for the external TOAST pointer
layer.  You have some interesting points around
detoast_external_attr() and detoast_attr_slice(), as far as I can see.
One point of the on-disk TOAST refactoring is that we should be able
to entirely avoid this level of redirection.  I get that this is a
POC, of course, but it provides pointers that what I've done may not
be sufficient in terms of extensibility so that seems worth digging
into.

The patch posted by Nikhil is also something that touches this area,
that I have on my tablets:
https://www.postgresql.org/message-id/CAFAfj_E-QLiUq--%2BKdyvb%2B-Gg79LLayZRcH8%2BmFPzVuDQOVaAw%40mail.gmail.com

It touches a different point: we need to be smarter with
CompressionToastId and not use that as the value for what we store on
disk.  Each vartag_external should be able to use the compression
methods values it wishes, with the toast_external layer being in
charge of translating that back-and-forth with the GUC value.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi Michael!

Yes, I know about relation rewrite and have already thought about how
we can avoid excessive storage of toastrelid and do not spoil rewrite,
still do not have a good enough solution.

> You have some interesting points around
> detoast_external_attr() and detoast_attr_slice(), as far as I can see.
> One point of the on-disk TOAST refactoring is that we should be able
> to entirely avoid this level of redirection.  I get that this is a
> POC, of course, but it provides pointers that what I've done may not
> be sufficient in terms of extensibility so that seems worth digging
> into.

I'm currently re-visiting our TOAST API patch set, there are some
good (in terms of simplicity and lightweightness) proposals, will mail
later.

Some more thoughts on TIDs:
TIDs could be stored as a list instead of a chain (as Hannu proposes
in his design). This allows batch operations and storage optimization
'cause TID lists are highly compressible, but significantly complicates
the code responsible for chunk processing.
Also, Toast pointer in current state must store raw size and external
size - these two are used by the executor, and we cannot get rid
of them so lightly.

Vacuuming such a table would be a pain in the ass, we have to
somehow prevent bloating tables with a high update rate.

Also, current toast mechanics is insert-only, it does not support
updates (just to remind - the whole toasted value is marked dead
and new one is inserted during update), this is a subject to change.
And logical replication, as I mentioned earlier, does not have any
means for replicating toast diffs.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Hannu Krosing
Дата:
On Tue, Jul 22, 2025 at 1:24 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> Hi Michael!
>
> Yes, I know about relation rewrite and have already thought about how
> we can avoid excessive storage of toastrelid and do not spoil rewrite,
> still do not have a good enough solution.

The high-level idea would be to any actual rewrite -- as opposed to
plain vacuum which frees empty space within the TOAST relation -- as
part of the vacuum of the main relation.
Another option would be to store a back-pointer to the heap tuple
inside the toast tuple and use that when rewriting, though it has its
own set of complexities.

> > You have some interesting points around
> > detoast_external_attr() and detoast_attr_slice(), as far as I can see.
> > One point of the on-disk TOAST refactoring is that we should be able
> > to entirely avoid this level of redirection.  I get that this is a
> > POC, of course, but it provides pointers that what I've done may not
> > be sufficient in terms of extensibility so that seems worth digging
> > into.
>
> I'm currently re-visiting our TOAST API patch set, there are some
> good (in terms of simplicity and lightweightness) proposals, will mail
> later.

Sounds interesting.

> Some more thoughts on TIDs:
> TIDs could be stored as a list instead of a chain (as Hannu proposes
> in his design). This allows batch operations and storage optimization
> 'cause TID lists are highly compressible, but significantly complicates
> the code responsible for chunk processing.

I would not say it complicates the *code* very much, especially when
you keep offsets in the toast tuples so that you can copy them into
the final materialized datum in any order.
And it does allow many optimisations in terms of batching,
pre-fetching and even parallelism in case of huge toasted values.

> Also, Toast pointer in current state must store raw size and external
> size - these two are used by the executor, and we cannot get rid
> of them so lightly.

Are these ever used without actually using the data ?
When the data _is_ also used then the cost of getting the length from
the toast record with direct toast should mostly amortize over the
full query.

Can you point to where in the code this is done ?

In long run we may want to store also the actual size in the toast
record (not toast pointer) as well for types where length() !=
octetsize bacuse currently a simple call like length(text) has to
materialize the whole thing before getting the length, whereas
pg_colum_size() and octertsize() are instantaneous.

> Vacuuming such a table would be a pain in the ass, we have to
> somehow prevent bloating tables with a high update rate.

Normal Vacuum should work fine. It is the rewrite that cis tricky.

> Also, current toast mechanics is insert-only, it does not support
> updates (just to remind - the whole toasted value is marked dead
> and new one is inserted during update), this is a subject to change.
> And logical replication, as I mentioned earlier, does not have any
> means for replicating toast diffs.

Which points to the need to (optionally) store the diff in the toast
as well when there are defined replication slots.
Once we have a way to actually do JSON(B) updates at SQL or function level.

We may even want to store the JSON in some base JSON + JSON_PATCH
format where we materialize at retrieval.

But this goes way beyond the current patch's scope. Though my design
should accommodate it nicely.

---
Hannu



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Tue, Jul 22, 2025 at 02:56:23PM +0200, Hannu Krosing wrote:
> The high-level idea would be to any actual rewrite -- as opposed to
> plain vacuum which frees empty space within the TOAST relation -- as
> part of the vacuum of the main relation.
> Another option would be to store a back-pointer to the heap tuple
> inside the toast tuple and use that when rewriting, though it has its
> own set of complexities.

Well.  I would suggest to begin with simpler to begin with, finding
building pieces that can be relied on, and I tend to think that I've
sorted out the first basic one of these because we want to be able to
give the backend the possibility to understand more formats of
external on-dist TOAST pointers.  A simple whole rewrite of the TOAST
engine is not something that can just happen in one day: that's a very
risky move, and we need to worry about backward-compatibility while
maintaining the legacy code.  FWIW, I think that we should still move
on with the 8-byte implementation anyway with specific vartag_external
that can lead to variable-sized pointers (shorter if value is less
than UINT32_MAX, still stored in a int8 TOAST table), extending the
code so as it's possible to think about more on top of that.  That's
basically risk-free if done right while taking the problem at its
root.  That's also what Tom and Andres meant back in 2022 before the
problem drifted away to different issues, and that should allow the
addition of more compression methods if done correctly (quoted at the
top of this thread).

>>> You have some interesting points around
>>> detoast_external_attr() and detoast_attr_slice(), as far as I can see.
>>> One point of the on-disk TOAST refactoring is that we should be able
>>> to entirely avoid this level of redirection.  I get that this is a
>>> POC, of course, but it provides pointers that what I've done may not
>>> be sufficient in terms of extensibility so that seems worth digging
>>> into.
>>
>> I'm currently re-visiting our TOAST API patch set, there are some
>> good (in terms of simplicity and lightweightness) proposals, will mail
>> later.
>
> Sounds interesting.

If you have refactoring proposals, that could be considered
separately, yes.

Now, the reason why nothing has been done about the original problem
is the same reason as what's happening now on this thread: I am seeing
a lot of designs and assumptions for new solutions, without knowing
all the problems we are trying to solve or even if they actually make
sense at this level of the baclend engine.  And there is little to no
argument made about backward-compatibility, which is also something
I've tried to design (single GUC for dump/restore/upgrade with TOAST
type based on attribute data type of the TOAST table, which could be
also a tid later on).

> But this goes way beyond the current patch's scope. Though my design
> should accommodate it nicely.

If you see independent useful pieces that are worth their own, nobody
is going to object to that.  The trick is to find incremental pieces
good enough to be able to build upon with individual evaluations and
reviews, at least that's how I want to tackle the problem because I
would be the one who would be responsible for its maintenance by
default.  Finding out these "simple" independent relevant pieces is
the hard part.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Wed, Jul 09, 2025 at 08:20:31AM +0900, Michael Paquier wrote:
> Sure, you could do that as well, but I suspect that we'll need the
> steps of at least up to 0003 to be able to handle more easily multiple
> external TOAST pointer types, or the code will be messier than it
> currently is.  :D

Please find attached a v3, that I have spent some time polishing to
fix the value ID problem of this thread.  v2 had some conflicts, and
the CI previously failed with warning job (CI is green here now).

There are two things that have changed in this patch series:
- Addition of separate patch to rename varatt_external to
varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003.
- In the refactoring to introduce the translation layer for external
ondisk TOAST pointers, aka 0004 in this set labelled v3, I was not
happy about the way we grab the vartag_external that gets assigned to
the TOAST varlena, and it depends on two factors in this patch set:
-- The type of TOAST table.
-- The value assigned.  If, for example, we have a value less than 4
billion, we can make the TOAST pointer shorter.  The interface of 0004
has been rethought to take that into consideration.  That's the
function called toast_external_assign_vartag(), reused on heaptoast.c
to get TOAST_POINTER_SIZE as the compressibility threshold when
inserting a tuple and if it should be compressed.

As things stand, I am getting pretty happy with the patch set up to
0005 and how things are getting in shape for the interface, and I am
planning to begin applying this stuff up to 0005 in the next couple of
weeks.

This stuff introduces all the callbacks and the concept of
toast_external to get the refactoring into the tree first, which is
something that other patches that want to extend the compression
methods or the TID layer are going to need anyway.  The rest of the
patch set is more a group shot, where they actually matter only when
the new vartag_external is added for the TOAST tables able to hold
8-byte values.  As of now, we would need to add more one more
vartag_external:
- For values lower than 4 billion, perhaps we could just reuse the OID
vartag_external here?  The code is modular now with
toast_external_assign_vartag(), and it is only a matter of using
VARTAG_ONDISK_OID if we have a value less than 4 billion.  So there is
only one place in the code to change, and that's a one-line change
with the v3 patch set attached.  And that's less duplication.
- The second one when we have more than 4 billion values.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Jim Nasby
Дата:
On Fri, Aug 1, 2025 at 4:03 AM Michael Paquier <michael@paquier.xyz> wrote:
- Addition of separate patch to rename varatt_external to
varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003.

Since you're already renaming things... ISTM "ondisk" has the potential for confusion, assuming that at some point we'll have the ability to store large datums directly in the filesystem (instead of breaking into chunks to live in a relation). VARTAG_DURABLE might be a better option. 

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Mon, Aug 04, 2025 at 02:37:19PM -0500, Jim Nasby wrote:
> On Fri, Aug 1, 2025 at 4:03 AM Michael Paquier <michael@paquier.xyz> wrote:
>> - Addition of separate patch to rename varatt_external to
>> varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003.
>
> Since you're already renaming things... ISTM "ondisk" has the potential for
> confusion, assuming that at some point we'll have the ability to store
> large datums directly in the filesystem (instead of breaking into chunks to
> live in a relation). VARTAG_DURABLE might be a better option.

Hmm.  I don't know about this one.  Durable is an ACID property that
does not apply to all relation kinds.  For example, take an unlogged
table: its data is not durable but its TOAST pointers would be marked
with a VARTAG_DURABLE.  With that in mind ONDISK still sounds kind of
OK for me to use as a name.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Fri, Aug 01, 2025 at 06:03:11PM +0900, Michael Paquier wrote:
> Please find attached a v3, that I have spent some time polishing to
> fix the value ID problem of this thread.  v2 had some conflicts, and
> the CI previously failed with warning job (CI is green here now).

Attached is a v4, due to conflicts mainly caused by the recent changes
in varatt.h done by e035863c9a04.  This had an interesting side
benefit when rebasing, where I have been able to isolate most of the
knowledge related to the struct varatt_external (well
varatt_external_oid in the patch set) into toast_external.c, at the
exception of VARTAG_SIZE.  That's done in a separate patch, numbered
0006.

The rest of the patch set has a couple of adjustements to document
better the new API expectations for toast_external.{c,h}, comment
adjustments, some more beautification changes, some indentation
applied, etc.

> As things stand, I am getting pretty happy with the patch set up to
> 0005 and how things are getting in shape for the interface, and I am
> planning to begin applying this stuff up to 0005 in the next couple of
> weeks.

As of this patch set, this means a new target of 0006, to get the
TOAST code refactored so as it is able to support more than 1 type of
external on-disk pointer with the 8-byte value problem in scope.  Any
comments?
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikhil Kumar Veldanda
Дата:
Hi michael,

I have a question regarding TOAST pointer handling.

As I understand, in the current design, each attribute in a HeapTuple
can have its own TOAST pointer, and TOAST pointers are possible only
for top-level attributes.

Would it make sense to maintain an array for ttc_toast_pointer_size in
ToastTupleContext, allowing us to estimate the size per attribute
based on compression or other criteria?

This approach could make the logic more generic in my opinion, but it
would require changes in toast_tuple_find_biggest_attribute and other
places.

I’d like to hear your thoughts on this.

On Fri, Aug 8, 2025 at 12:52 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Aug 01, 2025 at 06:03:11PM +0900, Michael Paquier wrote:
> > Please find attached a v3, that I have spent some time polishing to
> > fix the value ID problem of this thread.  v2 had some conflicts, and
> > the CI previously failed with warning job (CI is green here now).
>
> Attached is a v4, due to conflicts mainly caused by the recent changes
> in varatt.h done by e035863c9a04.  This had an interesting side
> benefit when rebasing, where I have been able to isolate most of the
> knowledge related to the struct varatt_external (well
> varatt_external_oid in the patch set) into toast_external.c, at the
> exception of VARTAG_SIZE.  That's done in a separate patch, numbered
> 0006.
>
> The rest of the patch set has a couple of adjustements to document
> better the new API expectations for toast_external.{c,h}, comment
> adjustments, some more beautification changes, some indentation
> applied, etc.
>
> > As things stand, I am getting pretty happy with the patch set up to
> > 0005 and how things are getting in shape for the interface, and I am
> > planning to begin applying this stuff up to 0005 in the next couple of
> > weeks.
>
> As of this patch set, this means a new target of 0006, to get the
> TOAST code refactored so as it is able to support more than 1 type of
> external on-disk pointer with the 8-byte value problem in scope.  Any
> comments?
> --
> Michael



--
Nikhil Veldanda



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> Attached is a v4, due to conflicts mainly caused by the recent changes
> in varatt.h done by e035863c9a04.

I found some time to look at the v4 patchset, and have a bunch of
comments of different sizes.

0001:

I'm good with widening all these values to 64 bits, but I wonder
if it's a great idea to use unadorned "uint64" as the data type.
That's impossible to grep for, doesn't convey anything much about
what the variables are, etc.  I'm tempted to propose instead
inventing a typedef "BigOid" or some such name (bikeshedding
welcome).  The elog's could be handled with, say,
    #define PRIBO PRIu64
This suggestion isn't made with the idea that we'd someday switch
to an even wider type, but just with the idea of making it clearer
what these values are being used for.  When you see "Oid" you
know it's some sort of object identifier, and I'm sad to give
that up here.

This hunk is flat out buggy:

@@ -1766,6 +1774,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
         return true;
 
     /* It is external, and we're looking at a page on disk */
+    toast_pointer_valueid = toast_pointer.va_valueid;
 
     /*
      * Must copy attr into toast_pointer for alignment considerations

toast_pointer isn't initialized at this point.  I see you fixed that
in 0004, but it doesn't help to split the patch series if the
intermediate steps are broken.

0002:

OK, although some of the hunks in 0001 seem to belong here.

0003:

No objection to the struct renaming, but does this go far
enough?  Aren't we going to need to rename TOAST_POINTER_SIZE
to TOAST_OID_POINTER_SIZE, etc, so that we can have similar
symbols for the wider version?  I'm suspicious of not renaming
the functions that work on these, too.  (Oh, it looks like you
did some of that in later parts.)

BTW, given that varatt.h has "typedef struct varatt_external {...}
varatt_external", I'm certain that the vast majority of uses of
"struct varatt_external" could be shortened to "varatt_external".
And I think we should do that, because using "struct foo" not "foo"
is not project style.  This patch would be a fine time to do that.

0004:

Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely?
It looks to me like every use of that should be replaced by
toast_external_info_get_data().

I wonder if we shouldn't try to get rid of the phraseology "standard
TOAST pointer", and instead write something like "short TOAST pointer"
or "small TOAST pointer".  These aren't really going to be more
"standard" than the wider ones, IMO.

I don't like replacing "va_valueid" with just "value".  Dropping
the "id" is not an improvement, because now a reader might be
confused about whether this is somehow the actual value of the
toasted datum.

Functions in toast_external.c are under-documented.  Some lack
any header comment whatever, and others have one that doesn't
actually say what they do.

I kind of wonder whether the run-time indirection this design causes
is going to be a performance problem.  Perhaps not, given the expenses
involved in accessing a toasted value, but it has a faint feeling
of overdesign.

It looks like a lot of this code would just flat out dump core if
passed an invalid vartag value.  Probably not good enough, given
that we might look at corrupted data.

0005:

Nice!

0006:

get_new_value is very confusingly documented.  Is the indexid that of
the toastrel's index?  What attribute is the attnum for?  Why should
the caller need to provide either, rather than get_new_value knowing
that internally?  Also, again you are using "value" for something
that is not the value of the to-be-toasted datum.  I'd suggest
something more like "get_new_toast_identifier".

I kind of feel that treating this operation as a toast_external_info
method is the wrong thing, since where it looks like you are going
is to fetch an identifier and only then decide which vartag you
need to use.  That ugliness comes out here:

+    /*
+     * Retrieve the external TOAST information, with the value still unknown.
+     * We need to do this again once we know the actual value assigned, to
+     * define the correct vartag_external for the new TOAST tuple.
+     */

0007:

Surely we do not need the cast in this:

+    return murmurhash64((int64) DatumGetInt64(datum));

I see you copied that from int4hashfast, but that's not good
style either.

More generally, though, why do we need catcache support for int8?
There aren't any caches on toast chunks, and I doubt we are going to
introduce any.  Sure there might be a need for this down the road,
but I don't see that this patch series is the time to add it.

0008:

I totally hate the idea of introducing a GUC for this.  This should be
user-transparent.  Or if it isn't user-transparent, a GUC is still
the wrong thing; some kind of relation option would be more sensible.

0009:

I do not love this either.  I think the right thing is just to widen
the existing nextOid counter to 64 bits, and increment it in-memory as
64 bits, and then return either all 64 bits when a 64-bit Oid is asked
for, or just the low-order 32 bits when a 32-bit OID is asked for
(with appropriate hacking for 32-bit wraparound).  Having a separate
counter will make it too easy for the same value to be produced by
nearby calls to the 32-bit and 64-bit Oid generators, which is likely
a bad thing, if only because of potential confusion.

0010:

OK

0011:

Not reviewing this yet, because I disagree with the basic design.
I didn't look at the later patches either.

            regards, tom lane



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Andres Freund
Дата:
Hi,

On 2025-08-08 15:32:16 -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > Attached is a v4, due to conflicts mainly caused by the recent changes
> > in varatt.h done by e035863c9a04.
> 
> I found some time to look at the v4 patchset, and have a bunch of
> comments of different sizes.
> 
> 0001:
> 
> I'm good with widening all these values to 64 bits, but I wonder
> if it's a great idea to use unadorned "uint64" as the data type.
> That's impossible to grep for, doesn't convey anything much about
> what the variables are, etc.  I'm tempted to propose instead
> inventing a typedef "BigOid" or some such name (bikeshedding
> welcome).  The elog's could be handled with, say,
>     #define PRIBO PRIu64
> This suggestion isn't made with the idea that we'd someday switch
> to an even wider type, but just with the idea of making it clearer
> what these values are being used for.  When you see "Oid" you
> know it's some sort of object identifier, and I'm sad to give
> that up here.

I think we should consider introducing Oid64, instead of having a toast
specific type.  I doubt this is the last place that we'll want to use 64 bit
wide types for cataloged entities.


> 0004:
> 
> Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely?
> It looks to me like every use of that should be replaced by
> toast_external_info_get_data().
> 
> I wonder if we shouldn't try to get rid of the phraseology "standard
> TOAST pointer", and instead write something like "short TOAST pointer"
> or "small TOAST pointer".  These aren't really going to be more
> "standard" than the wider ones, IMO.
> 
> I don't like replacing "va_valueid" with just "value".  Dropping
> the "id" is not an improvement, because now a reader might be
> confused about whether this is somehow the actual value of the
> toasted datum.
> 
> Functions in toast_external.c are under-documented.  Some lack
> any header comment whatever, and others have one that doesn't
> actually say what they do.
> 
> I kind of wonder whether the run-time indirection this design causes
> is going to be a performance problem.  Perhaps not, given the expenses
> involved in accessing a toasted value, but it has a faint feeling
> of overdesign.

+1


> 0008:
> 
> I totally hate the idea of introducing a GUC for this.  This should be
> user-transparent.  Or if it isn't user-transparent, a GUC is still
> the wrong thing; some kind of relation option would be more sensible.

Agreed. I think we need backward compatibility for pg_upgrade purposes, but
that doesn't require a GUC, it just requires an option when creating tables in
binary upgrade mode.


> 0009:
> 
> I do not love this either.  I think the right thing is just to widen
> the existing nextOid counter to 64 bits, and increment it in-memory as
> 64 bits, and then return either all 64 bits when a 64-bit Oid is asked
> for, or just the low-order 32 bits when a 32-bit OID is asked for
> (with appropriate hacking for 32-bit wraparound).  Having a separate
> counter will make it too easy for the same value to be produced by
> nearby calls to the 32-bit and 64-bit Oid generators, which is likely
> a bad thing, if only because of potential confusion.

I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
has merit in general, and I'm rather sure it's the wrong thing for toast
values. There's no straightforward path to move away from the global counter
for plain oids, but I would suggest simply not using the global oid counter
for toast IDs.

A large portion of the cases I've seen where toast ID assignments were a
problem were when the global OID counter wrapped around due to activity on
*other* tables (and/or temporary table creation). If you instead had a
per-toast-table sequence for assigning chunk IDs, that problem would largely
vanish.

With 64bit toast IDs we shouldn't need to search the index for a
non-conflicting toast IDs, there can't be wraparounds (we'd hit wraparound of
LSNs well before that and that's not practically reachable).

Greetings,

Andres Freund



Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Fri, Aug 08, 2025 at 05:02:46PM -0400, Andres Freund wrote:
> On 2025-08-08 15:32:16 -0400, Tom Lane wrote:
>> I found some time to look at the v4 patchset, and have a bunch of
>> comments of different sizes.

Thanks for the input.  This patch set exists as a result of a
discussion between you and Andres back in 2022.

(Replying here regarding the points that Andres is quoting, I'll add a
second mail for some of the other things later.)

>> I'm good with widening all these values to 64 bits, but I wonder
>> if it's a great idea to use unadorned "uint64" as the data type.
>> That's impossible to grep for, doesn't convey anything much about
>> what the variables are, etc.  I'm tempted to propose instead
>> inventing a typedef "BigOid" or some such name (bikeshedding
>> welcome).  The elog's could be handled with, say,
>>     #define PRIBO PRIu64
>> This suggestion isn't made with the idea that we'd someday switch
>> to an even wider type, but just with the idea of making it clearer
>> what these values are being used for.  When you see "Oid" you
>> know it's some sort of object identifier, and I'm sad to give
>> that up here.
>
> I think we should consider introducing Oid64, instead of having a toast
> specific type.  I doubt this is the last place that we'll want to use 64 bit
> wide types for cataloged entities.

Works for me for the grepping argument.  Using "64" as a number of
bits in the type name sounds a bit strange to me, not in line with
what's done with bigint/int8 or float, where we use bytes.  Naming a
dedicated type "bigoid" with a structure named BigOid behind that's
used for TOAST or in the future for other code paths is OK by me.
AFAIK, the itchy point with unsigned 64b would be the casting, but my
take would be to introduce no casts in the first implementation,
particularly for the bigint -> bigoid case.

>> 0004:
>>
>> Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely?
>> It looks to me like every use of that should be replaced by
>> toast_external_info_get_data().

Indirect toast pointers still wanted it.  But perhaps this could just
be renamed VARATT_EXTERNAL_INDIRECT_GET_POINTER() or something like
that if jt's used only for TOAST pointers?

>> I wonder if we shouldn't try to get rid of the phraseology "standard
>> TOAST pointer", and instead write something like "short TOAST pointer"
>> or "small TOAST pointer".  These aren't really going to be more
>> "standard" than the wider ones, IMO.
>>
>> I don't like replacing "va_valueid" with just "value".  Dropping
>> the "id" is not an improvement, because now a reader might be
>> confused about whether this is somehow the actual value of the
>> toasted datum.

Okay, sure.

>> Functions in toast_external.c are under-documented.  Some lack
>> any header comment whatever, and others have one that doesn't
>> actually say what they do.

Okay.  Will work on that.  I am not sure if it's worth doing yet, it
does not seem like there's a clear agreement about patch 0004 and the
toast_external business I am proposing.

>> I kind of wonder whether the run-time indirection this design causes
>> is going to be a performance problem.  Perhaps not, given the expenses
>> involved in accessing a toasted value, but it has a faint feeling
>> of overdesign.

toast_external_info_get_data() is called in 8 places as of the patch
set:
1) Twice for amcheck.
2) Twice for reorderbuffer.
3) Three times in detoast.c
3-1) When fetching a value in full, toast_fetch_datum(), which goes
through the slice API.
3-2) toast_fetch_datum_slice(), to retrieve a slice of the toasted
data.
3-3) detoast_attr_slice(), falling back to the slice fetch.
4) Twice in toast_internals.c:
4-1) Saving a datum.
4-2) Deleting a datum.

And if you see here, upthread, I've defined the worst case as
retrieving a minimal toasted slice stored uncompressed, to measure the
cost of the conversion to this toast_external, without seeing any
effects, the btree conflict lookups doing most of the work:
https://www.postgresql.org/message-id/aGzLiDUB_18-8aVQ%40paquier.xyz

Thoughts and counter-arguments are welcome, of course.

> +1

Well, one reason why this design exists is a point from 2023, made by
Robert H., around here:
https://www.postgresql.org/message-id/CA%2BTgmoaVcjUkmtWdc_9QjBzvSShjDBYk-5XFNaOvYLgGROjJMA%40mail.gmail.com

The argument is that it's hard to miss for the existing code and
extensions how a new vartag should be handled.  Inventing a new layer
means that extensions and existing code need to do a switch once, then
they cannot really miss be missed when we add a new vartag because the
from/to indirection between the toast_external layer and the varlenas
is handled in its own place.

A second reason why I've used this design is the problem related to
compression IDs, where the idea would be to add a new vartag for
zstandard for example.  This still needs more work due to the existing
limitations that we currently have with the CompressionToastId and its
limitation to four values.  The idea is that the deserialization of
this data into this toast_external proposal makes the future additions
easier.  It would be of course more brutal and more efficient to just
extend the code paths where the toast_external layer (I mean where
VARATT_IS_EXTERNAL_ONDISK() is used currently) with an extra branch
made of a VARATT_IS_EXTERNAL_ONDISK_BIGOID(), but I've decided to
digest the argument from Robert instead, where I'm aiming at isolating
most of the code related to on-disk external TOAST pointers into a
single file, named toast_external.c here.

I'm OK with the "brutal" method if the patch presented here is
thought as overengineered and overdesigned, just wanted to say that
there are a couple of reasons behind these choices.  It's great that
both of you are able to take some time to comment on these choices.
If you disagree with this method and just say that we should go ahead
with a more direct VARATT-like method, that's also fine by me as it
would still solve the value limitation problem.  And that's primarily
what I want to solve here as it's a pain for some in the field.

>> 0008:
>>
>> I totally hate the idea of introducing a GUC for this.  This should be
>> user-transparent.  Or if it isn't user-transparent, a GUC is still
>> the wrong thing; some kind of relation option would be more sensible.
>
> Agreed. I think we need backward compatibility for pg_upgrade purposes, but
> that doesn't require a GUC, it just requires an option when creating tables in
> binary upgrade mode.

So, you basically mean that we should just make the 8-byte case the
default for all the new tables created on a new cluster at upgrade,
and just carry across upgrades the TOAST tables with OID values?  If
this stuff uses a reloption or an option at DDL level, that's going to
need some dump/restore parts.  Not sure that I'm betting the full
picture of what the default behavior should be.  I have read on the
2022 thread where both of you have discussed this issue a point about
a "legacy" mode, to give users a way to get the old behavior if they
wanted.  A GUC fit nicely around that idea, because it is them
possible to choose how all tables should behave, or perhaps I just did
not design correctly what you meant through that.

>> 0009:
>>
>> I do not love this either.  I think the right thing is just to widen
>> the existing nextOid counter to 64 bits, and increment it in-memory as
>> 64 bits, and then return either all 64 bits when a 64-bit Oid is asked
>> for, or just the low-order 32 bits when a 32-bit OID is asked for
>> (with appropriate hacking for 32-bit wraparound).  Having a separate
>> counter will make it too easy for the same value to be produced by
>> nearby calls to the 32-bit and 64-bit Oid generators, which is likely
>> a bad thing, if only because of potential confusion.

I was wondering about doing that as well.  FWIW, this approach works
for me, eating only 4 more bytes in the control file.

> I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
> has merit in general, and I'm rather sure it's the wrong thing for toast
> values. There's no straightforward path to move away from the global counter
> for plain oids, but I would suggest simply not using the global oid counter
> for toast IDs.
>
> A large portion of the cases I've seen where toast ID assignments were a
> problem were when the global OID counter wrapped around due to activity on
> *other* tables (and/or temporary table creation). If you instead had a
> per-toast-table sequence for assigning chunk IDs, that problem would largely
> vanish.

I've thought about this piece already and the reason why I have not
taken this approach is mostly simplicity.  It's mentioned upthread,
with potentially pieces referring to sequence AMs and "toast"
sequences that could be optimized for the purpose of TOAST relations:
https://www.postgresql.org/message-id/aG2iY26tXj1_MHfH%40paquier.xyz

Yep.  All the requirement boxes are filled with a sequence assigned to
the TOAST relation.  Using a single 8-byte counter is actually
cheaper, even if the overhead when generating the TOAST tuples are in
the btree lookups and the insertions themselves.  It is also possible
to do that as a two-step process:
- First have TOAST relations with 8-byte values.
- Add local sequences later on.
What matters is having the code in place to check for index lookups
upon values conflicting when a new sequence is attached to an existing
TOAST relation.  If we assign sequences from the start, that would not
matter, of course.  Another reason on top of the simplicity piece
behind the control file is that it's dead cheap to acquire a new
value when saving a chunk externally.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Fri, Aug 08, 2025 at 03:32:16PM -0400, Tom Lane wrote:
> I found some time to look at the v4 patchset, and have a bunch of
> comments of different sizes.

Thanks for the comments.  I have replied to some of the items here:
https://www.postgresql.org/message-id/aJbygEBqJgmLS0wF%40paquier.xyz

Will try to answer the rest here.

> 0001:
>
> I'm good with widening all these values to 64 bits, but I wonder
> if it's a great idea to use unadorned "uint64" as the data type.
> That's impossible to grep for, doesn't convey anything much about
> what the variables are, etc.  I'm tempted to propose instead
> inventing a typedef "BigOid" or some such name (bikeshedding
> welcome).  The elog's could be handled with, say,
>     #define PRIBO PRIu64
> This suggestion isn't made with the idea that we'd someday switch
> to an even wider type, but just with the idea of making it clearer
> what these values are being used for.  When you see "Oid" you
> know it's some sort of object identifier, and I'm sad to give
> that up here.

Already mentioned on the other message, but I'm OK with a "bigoid"
type and a BigOid type.  Honestly, I'm not sure about a replacement
for PRIu64, as we don't really do that for Oids with %u.

> toast_pointer isn't initialized at this point.  I see you fixed that
> in 0004, but it doesn't help to split the patch series if the
> intermediate steps are broken.

Oops.  FWIW, I've rebased this patch set much more than 4 times.  It
looks like I've messed up some of the diffs.  Sorry about that.

> 0003:
>
> No objection to the struct renaming, but does this go far
> enough?  Aren't we going to need to rename TOAST_POINTER_SIZE
> to TOAST_OID_POINTER_SIZE, etc, so that we can have similar
> symbols for the wider version?  I'm suspicious of not renaming
> the functions that work on these, too.  (Oh, it looks like you
> did some of that in later parts.)

Yeah.  I've stuck that into the later parts where the int8 bits have
been added.  Perhaps I've not been ambitious enough.

> BTW, given that varatt.h has "typedef struct varatt_external {...}
> varatt_external", I'm certain that the vast majority of uses of
> "struct varatt_external" could be shortened to "varatt_external".
> And I think we should do that, because using "struct foo" not "foo"
> is not project style.  This patch would be a fine time to do that.

Good point.

> 0004:
>
> Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely?
> It looks to me like every use of that should be replaced by
> toast_external_info_get_data().

Point about indirect pointers mentioned on the other message, where we
could rename VARATT_EXTERNAL_GET_POINTER to
VARATT_EXTERNAL_INDIRECT_GET_POINTER or equivalent to limit the
confusion?

> I wonder if we shouldn't try to get rid of the phraseology "standard
> TOAST pointer", and instead write something like "short TOAST pointer"
> or "small TOAST pointer".  These aren't really going to be more
> "standard" than the wider ones, IMO.

I'm seeing one place in arrayfuncs.c and one in detoast.h using this
term on HEAD.  I would do simpler: no standard and no short, just with
a removal of the "standard" part if I were to change something.

> I don't like replacing "va_valueid" with just "value".  Dropping
> the "id" is not an improvement, because now a reader might be
> confused about whether this is somehow the actual value of the
> toasted datum.

Okay, sure.  One reason behind the field renaming was also to track
down all the areas in need to be updated.

> Functions in toast_external.c are under-documented.  Some lack
> any header comment whatever, and others have one that doesn't
> actually say what they do.

Okay.

> I kind of wonder whether the run-time indirection this design causes
> is going to be a performance problem.  Perhaps not, given the expenses
> involved in accessing a toasted value, but it has a faint feeling
> of overdesign.

Mentioned on the other message, linked to this message:
https://www.postgresql.org/message-id/aGzLiDUB_18-8aVQ%40paquier.xyz

> It looks like a lot of this code would just flat out dump core if
> passed an invalid vartag value.  Probably not good enough, given
> that we might look at corrupted data.

Right.  That's where we would need an error path in
toast_external_get_info() if nothing is found.  That's a cheap
defense, I guess.  Good thing is that the lookups are centralized in a
single code path, at least with this design.

> I kind of feel that treating this operation as a toast_external_info
> method is the wrong thing, since where it looks like you are going
> is to fetch an identifier and only then decide which vartag you
> need to use.  That ugliness comes out here:
>
> +    /*
> +     * Retrieve the external TOAST information, with the value still unknown.
> +     * We need to do this again once we know the actual value assigned, to
> +     * define the correct vartag_external for the new TOAST tuple.
> +     */

Yeah, I was feeling a bit depressed when writing the concept of what a
"default" method should be before assigning the value, but that was
still feeling right.

> 0007:
>
> Surely we do not need the cast in this:
>
> +    return murmurhash64((int64) DatumGetInt64(datum));
>
> I see you copied that from int4hashfast, but that's not good
> style either.

Okay.

>
> More generally, though, why do we need catcache support for int8?
> There aren't any caches on toast chunks, and I doubt we are going to
> introduce any.  Sure there might be a need for this down the road,
> but I don't see that this patch series is the time to add it.

I recall that this part was required for the value conflict lookups
and also the TOAST value retrievals, so we are going to need it.

> 0008:
>
> I totally hate the idea of introducing a GUC for this.  This should be
> user-transparent.  Or if it isn't user-transparent, a GUC is still
> the wrong thing; some kind of relation option would be more sensible.

Per other email, I'm not sure what you entirely mean here: should 8B
values be the default with existing TOAST OID values kept as they are
across upgrades?  Or something else?

> 0009:
>
> I do not love this either.  I think the right thing is just to widen
> the existing nextOid counter to 64 bits, and increment it in-memory as
> 64 bits, and then return either all 64 bits when a 64-bit Oid is asked
> for, or just the low-order 32 bits when a 32-bit OID is asked for
> (with appropriate hacking for 32-bit wraparound).  Having a separate
> counter will make it too easy for the same value to be produced by
> nearby calls to the 32-bit and 64-bit Oid generators, which is likely
> a bad thing, if only because of potential confusion.

Acknowledged on other message, fine my be.

> 0011:
>
> Not reviewing this yet, because I disagree with the basic design.
> I didn't look at the later patches either.

Without an agreement about the design choices related to the first
patches up to 0006, I doubt there this is any need to review any of
the follow-up patches yet because the choices of the first patches
influence the next patches in the series.  Thanks for the feedback!
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Fri, Aug 08, 2025 at 09:09:20AM -0700, Nikhil Kumar Veldanda wrote:
> I have a question regarding TOAST pointer handling.
>
> As I understand, in the current design, each attribute in a HeapTuple
> can have its own TOAST pointer, and TOAST pointers are possible only
> for top-level attributes.
>
> Would it make sense to maintain an array for ttc_toast_pointer_size in
> ToastTupleContext, allowing us to estimate the size per attribute
> based on compression or other criteria?
>
> This approach could make the logic more generic in my opinion, but it
> would require changes in toast_tuple_find_biggest_attribute and other
> places.
>
> I’d like to hear your thoughts on this.

Yes, that's some complexity that you would need if plugging in more
vartags if these are related to compression methods, and also
something that we may need if we use multiple vartags depending on a
value ID assigned (aka for the TOAST table with 8-byte values, short
Datum if we have a value less than 4 billion with one vartag, longer
Datum if value more than 4 billion with a second vartag).

For the 4-byte vs 8-byte value case, I was wondering if we should be
simpler and less optimistic and assume that we are only going to use
the wider one depending on the type of chunk_id in the TOAST table, as
a minimum threshold when checking if a tuple should be toasted or not.
Perhaps my vision of things is too simple, but I cannot think about a
good reason that would justify making this code more complicated than
it already is.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Nikita Malakhov
Дата:
Hi!

Michael, I'm looking into your patch set for Sequence AMs.
I'm not very happy with single global OID counter for TOAST
tables, so I've decided to investigate your work on sequences
and try to adapt it to TOAST tables. Would report my progress.

Some time ago I've proposed individual TOAST counter
to get rid of table lookups for unused ID, but later decided
that neither reloptions nor pg_class are not a good place
to store it due to lots of related catalog updates including
locks, so sequences look much more useful for this,
as you wrote above.

Personally I'm not too happy with
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
array and for me it seems that lookups using VARTAG
should be straightened out with more error-proof, currently
using wrong vartags would result in core dumps.

On Sat, Aug 9, 2025 at 10:40 AM Michael Paquier <michael@paquier.xyz> wrote:

For the 4-byte vs 8-byte value case, I was wondering if we should be
simpler and less optimistic and assume that we are only going to use
the wider one depending on the type of chunk_id in the TOAST table, as
a minimum threshold when checking if a tuple should be toasted or not.
Perhaps my vision of things is too simple, but I cannot think about a
good reason that would justify making this code more complicated than
it already is.
--
Michael


--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Jim Nasby
Дата:
On Fri, Aug 8, 2025 at 4:03 PM Andres Freund <andres@anarazel.de> wrote:
I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
has merit in general, and I'm rather sure it's the wrong thing for toast
values. There's no straightforward path to move away from the global counter
for plain oids, but I would suggest simply not using the global oid counter
for toast IDs.

A large portion of the cases I've seen where toast ID assignments were a
problem were when the global OID counter wrapped around due to activity on
*other* tables (and/or temporary table creation). If you instead had a
per-toast-table sequence for assigning chunk IDs, that problem would largely
vanish.

That's been my experience as well. I was actually toying with the idea of simply switching from OIDs to per-table counters when I came across this, specifically to address the problem of OID wraparound induced performance problems. 

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Wed, Aug 13, 2025 at 03:06:16PM -0500, Jim Nasby wrote:
> On Fri, Aug 8, 2025 at 4:03 PM Andres Freund <andres@anarazel.de> wrote:
>> I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
>> has merit in general, and I'm rather sure it's the wrong thing for toast
>> values. There's no straightforward path to move away from the global
>> counter
>> for plain oids, but I would suggest simply not using the global oid counter
>> for toast IDs.
>>
>> A large portion of the cases I've seen where toast ID assignments were a
>> problem were when the global OID counter wrapped around due to activity on
>> *other* tables (and/or temporary table creation). If you instead had a
>> per-toast-table sequence for assigning chunk IDs, that problem would
>> largely
>> vanish.
>
> That's been my experience as well. I was actually toying with the idea of
> simply switching from OIDs to per-table counters when I came across this,
> specifically to address the problem of OID wraparound induced performance
> problems.

Yep, that exists.  My original use case is not this one, where I have
a class of users able to reach the OID limit on a single table even
within the 32TB limit, meaning that TOAST blobs are large enough to
reach the external threshold, still small enough to have 4 billions of
them within the single-table limit.

Implementation-wise, switching to 8B would solve both things, and
it's so much cheaper to grab a value from a single source.  I don't
really object to using 4B local values, but that does not really solve
the original issue that's causing the efforts I've initiated on this
thread.  But yes, I'm also trying to implement things so as such an
addition like this one would be slightly easier.  It just does not
have to be me doing it ;)
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Sat, Aug 09, 2025 at 04:31:05PM +0900, Michael Paquier wrote:
> Without an agreement about the design choices related to the first
> patches up to 0006, I doubt there this is any need to review any of
> the follow-up patches yet because the choices of the first patches
> influence the next patches in the series.  Thanks for the feedback!

So, please find attached a rebased version set that addresses all the
feedback that has been provided, hopefully:
- 0001 is the requested data type for 64b OIDs, which is called here
oid8 for the data type and Oid8 for the internal name.  I'm not
completely wedded to these names, but anyway.  This comes with a
package that make this data type useful for other purposes than this
patch set, basically everything where I've wanted toys for TOAST:
hash/btree operators, min/max, a couple of casts (oid -> oid8,
integers, etc).  The rest of the patch set builds upon that.
- 0002 changes a couple of APIs related to TOAST to use Oid8 (in
previous iterations this was uint64).  In the last version, the patch
was mixing parts related to max_chunk_size, that should be split
cleanly now.
- 0003 is a preparatory change, reducing the footprint of
TOAST_MAX_CHUNK_SIZE, because we'd want to have something that changes
depending on the vartag we're dealing with.
- 0004 renames things around vartag_external to vartag_external_oid.
Based on the previous feedback, this is more aggressive now with the
renames, renaming a few more things like MAX_CHUNK_SIZE,
TOAST_POINTER_SIZE, etc.
- 0005 is the refactoring piece, that introduces toast_external.c.  I
have included the previous feedback, cleaning up the code, adding more
documentation, adding a failure fallback if the vartag given in input
is incorrect, to serve as a corruption defense.  And a few more
things.
- 0006 is a follow-up cleanup of varatt.h.  It would make more sense
to merge that with 0005, but I've decided to keep that separate as it
makes the review slightly cleaner.
- 0007 is related to the previous feedback, and a follow-up cleanup
that could be merged with one of the previous steps.  It changes the
remaining pieces of VARATT_EXTERNAL_GET_POINTER to be related to
indirect TOAST pointers, as it's only used there.
- 0008 is a previous piece, switching pg_column_toast_chunk_id() to
use oid8 as result instead.
- 0009 adds the catcache bits for OID8OID, required for toast values
lookups and deletions, in the upcoming patches.  Same as previous.
- 0010 is a new piece, based on the previous feedback.  This is an
independent piece that adds an extra step in binary upgrades to be
able to pass down the attribute type of chunk_id.  Without oid8
support for the values, this does not bring much of course, but it's
less code churn in the follow-up patches.
- 0011 adds a new piece, based on the previous feedback, where
the existing nextOid is enlarged to 8 bytes, keeping compatibility for
the existing 4-byte OIDs where we don't want values within the [0,
FirstNormalObjectId] range.  The next patches rely on the new API to
get 8-byte values.
- 0012 is a new piece requested: reloption able to define the
attribute type of chunk_id when a TOAST relation is initially created
for a table, replacing the previous GUC approach which is now gone.
- 0013 adds support for oid8 in TOAST relations, extending the new
reloption and the binary upgrade paths previously introduced.
- 0014 adds some tests with toast OID8 case, leaving some relations
around for pg_upgrade, covering the binary upgrade path for oid8.
- 0015 is the last patch, adding a new vartag for OID8.  It would make
most sense to merge that with 0013, perhaps, the split is here to ease
reviews.

I have dropped the amcheck test patch for now, which was fun but it's
not really necessary for the "basics".  I have done also more tests,
playing for example with pg_resetwal, installcheck and pg_upgrade
scenarios.  I am wondering if it would be worth doing a pg_resetwal in
the node doing an installcheck on the instance to be upgraded, bumping
its next OID to be much larger than 4 billion, actually..
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Thu, Aug 14, 2025 at 02:49:06PM +0900, Michael Paquier wrote:
> I have dropped the amcheck test patch for now, which was fun but it's
> not really necessary for the "basics".  I have done also more tests,
> playing for example with pg_resetwal, installcheck and pg_upgrade
> scenarios.  I am wondering if it would be worth doing a pg_resetwal in
> the node doing an installcheck on the instance to be upgraded, bumping
> its next OID to be much larger than 4 billion, actually..

Four patches had conflicts with 748caa9dcb68, so rebased as v6.
--
Michael

Вложения

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

От
Michael Paquier
Дата:
On Tue, Sep 16, 2025 at 02:14:43PM +0900, Michael Paquier wrote:
> On Thu, Aug 14, 2025 at 02:49:06PM +0900, Michael Paquier wrote:
>> I have dropped the amcheck test patch for now, which was fun but it's
>> not really necessary for the "basics".  I have done also more tests,
>> playing for example with pg_resetwal, installcheck and pg_upgrade
>> scenarios.  I am wondering if it would be worth doing a pg_resetwal in
>> the node doing an installcheck on the instance to be upgraded, bumping
>> its next OID to be much larger than 4 billion, actually..
>
> Four patches had conflicts with 748caa9dcb68, so rebased as v6.

There were a few conflicts, so here is a rebased v7, moving the patch
to the next CF.  I have been sitting on this patch for six weeks for
the moment.

Tom, you are registered as a reviewer of the patch.  The point of
contention of the patch, where I see there is no consensus yet, is if
my approach of using a redirection for the external TOAST pointers
with a new layer to facilitate the addition of more vartags (aka the
64b value vartag proposed here, concept that could also apply to
compression methods later on) is acceptable.  Moving to a different
approach, like the "brutal" one I am naming upthread where the
redirection layer is replaced by changes in all the code paths that
need to be touched, would be of course cheaper at runtime as there
would be no more redirection, but the maintenance would be a nightmare
the more vartags we add, and I have some plans for more of these.
Doing the switch would be a few hours work, so that would not be a big
deal, I guess.  The important part is an agreement about the approach,
IMO.

Please note that the latest patch set also uses a new reloption to
control if 8 byte TOAST values are set (not a GUC), binary upgrades
are handled with binary_upgrade.h functions, and the 8-byte OID value
uses a dedicated data type, as reviewed upthread.

One thing that shows up in the last patch of the set is mentioned in
an XXX comment in toast_external_assign_vartag(), if it would be
better to use a 4-byte vartag even if we have a 8-byte value in the
TOAST table to make the TOAST pointers shorter when a value is less
than 4 billions.  Not sure how much to do about this one, and there's
little point in doing this change without the earlier infrastructure
patches if the approach taken is thought as OK, as well.
--
Michael

Вложения