Обсуждение: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
Hi all, I have been looking at $subject and the many past reviews recently, also related to some of the work related to the potential support for zstandard compression in TOAST values, and found myself pondering about the following message from Tom, to be reminded that nothing has been done regarding the fact that the backend may finish in an infinite loop once a TOAST table reaches 4 billion values: https://www.postgresql.org/message-id/764273.1669674269%40sss.pgh.pa.us Spoiler: I have heard of users that are in this case, and the best thing we can do currently except raising shoulders is to use workarounds with data externalization AFAIK, which is not nice, usually, and users notice the problem once they see some backends stuck in the infinite loop. I have spent some time looking at the problem, and looked at all the proposals in this area like these ones (I hope so at least): https://commitfest.postgresql.org/patch/4296/ https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru Anyway, it seems like nothing goes in a direction that I think would be suited to fix the two following problems (some of the proposed patches broke backward-compatibility, as well, and that's critical): - The limit of TOAST values to 4 billions, because external TOAST pointers want OIDs. - New compression methods, see the recent proposal about zstandard at [1]. ToastCompressionId is currently limited to 4 values because the extinfo field of varatt_external has only this much data remaining. Spoiler: I want to propose a new varatt_external dedicated to zstandard-compressed external pointers, but that's not for this thread. Please find attached a patch set I have finished with while poking at the problem, to address points 1) and 2) in the first email mentioned at the top of this message. It is not yet ready for prime day yet (there are a couple of things that would need adjustments), but I have reached the point where I am going to need a consensus about what people would be OK to have in terms of design to be able to support multiple types of varatt_external to address these issues. And I'm OK to consume time on that for the v19 cycle. While hacking at (playing with?) the whole toasting and detoasting code to understand the blast radius that this would involve, I have quickly found that it is very annoying to have to touch at many places of varatt.h to make variants of the existing varatt_external structure (what we store on-disk as varlena Datum for external TOAST pointers). Spoiler: it's my first time touching the internals of this code area so deeply. Most of the code churns happen because we need to make the [de]toast code aware of what to do depending on the vartags of the external varlenas. It would be simple to hardcode a bunch of new VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures. While it is efficient, this has a cost for out-of-core code and in core because all the places that touch external TOAST pointers need to be adjusted. Point is: it can be done. But if we introduce more types of external TOAST pointers we need to always patch all these areas, and there's a cost in that each time one or more new vartags are added. So, I have invented a new interface aimed at manipulating on-disk external TOAST pointers, called toast_external, that is an on-memory structure that services as an interface between on-disk external TOAST pointers and what the backend wants to look at when retrieving chunks of data from the TOAST relations. That's the main proposal of this patch set, with a structure looking like that: typedef struct toast_external_data { /* Original data size (includes header) */ int32 rawsize; /* External saved size (without header) */ uint32 extsize; /* compression method */ ToastCompressionId compression_method; /* Relation OID of TOAST table containing the value */ Oid toastrelid; /* * Unique ID of value within TOAST table. This could be an OID or an * int8 value. This field is large enough to be able to store any of * them. */ uint64 value; } toast_external_data; This is a bit similar to what the code does for R/W and R/O vartags, only applying to the on-disk external pointers. Then, the [de]toast code and extension code is updated so as varlenas are changed into this structure if we need to retrieve some of its data, and these areas of the code do not need to know about the details of the external TOAST pointers. When saving an external set of chunks, this structure is filled with information depending on what toast_save_datum() deals with, be it a short varlena, a non-compressed external value, or a compressed external value, then builds a varlena with the vartag we want. External TOAST pointers have three properties that are hardcoded in the tree, bringing some challenges of their own: - The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close to 2k to make 4 chunks fit on a page. This depends on the size of the external pointer. This one was actually easy to refactor. - The varlena header size, based on VARTAG_SIZE(), which is kind of tricky to refactor out in the new toast_external.c, but that seems OK even if this knowledge stays in varatt.h. - The toast pointer size, aka TOAST_POINTER_SIZE. This one is actually very interesting (tricky): we use it in one place, toast_tuple_find_biggest_attribute(), as a lower bound to decide if an attribute should be toastable or not. I've refactored the code to use a "best" guess depending on the value type in the TOAST relation, but that's not 100% waterproof. That needs more thoughts. Anyway, the patch set is able to demonstrate how much needs to be done in the tree to support multiple chunk_id types, and the CI is happy with the attached. Some of the things done: - Introduction of a user-settable GUC called default_toast_type, that can be switched between two modes "oid" and "int8", to force the creation of a TOAST relation using one type or the other. - Dump, restore and upgrade support are integrated, relying on a GUC makes the logic a breeze. - 64b values are retrieved from a single counter in the control file, named a "TOAST counter", which has the same reliability and properties as an OID, with checkpoint support, WAL records, etc. - Rewrites are soft, so I have kicked the can down the toast on this point to not make the proposal more complicated than it should be: a VACUUM FULL retains the same TOAST value type as the original. We could extend rewrites so as the type of TOAST value is changed. It is possible to setup a new cluster with default_toast_type = int8 set after an upgrade, with the existing tables still using the OID mode. This relates to the recent proposal with a concurrent VACUUM FULL (REPACK discussion). The patch set keeps the existing vartag_external with OID values for backward-compatibility, and adds a second vartag_external that can store 8-byte values. This model is the simplest one, and optimizations are possible, where the Datum TOAST pointer could be shorter depending on the ID type (OID or int8), the compression method and the actual value to divide in chunks. For example, if you know that a chunk of data to save has a value less than UINT32_MAX, we could store 4 bytes worth of data instead of 8. This design has the advantage to allow plugging in new TOAST external structures easily. Now I've not spent extra time in this tuning, because there's no point in spending more time without an actual agreement about three things, and *that's what I'm looking for* as feedback for this upcoming CF: - The structures of the external TOAST pointers. Variable-sized pointers could be one possibility, across multiple vartags. Ideas are welcome. - How many vartag_external types we want. - If people are actually OK with this translation layer or not, and I don't disagree that there may be some paths hot enough where the translation between the on-disk varlenas and this on-memory toast_external_data hurts. Again, it is possible to hardcode more types of vartags in the tree, or just bypass the translation in the paths that are too hot. That's doable still brutal, but if that's the final consensus reached I'm OK with that as well. (See for example the changes in amcheck to see how simpler things get.) The patch set has been divided into multiple pieces to ease its review. Again, I'm not completely happy with everything in it, but it's a start. Each patch has its own commit message, so feel free to refer to them for more details: - 0001 introduces the GUC default_toast_type. It is just defined, not used in the tree at this stage. - 0002 adds support for catcache lookups for int8 values, required to allow TOAST values with int8 and its indexes. Potentially useful for extensions. - 0003 introduces the "TOAST counter", 8 bytes in the control file to allocate values for the int8 chunk_id. That's cheap, reliable. - 0004 is a mechanical change, that enlarges a couple of TOAST interfaces to use values of uint64 instead of OID. - 0005, again a mechanical change, reducing a bit the footprint of TOAST_MAX_CHUNK_SIZE because OID and int8 values need different values. - 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type. Then comes the "main" patches: - 0007 adds support for int8 chunk_id in TOAST tables. This is mostly a mechanical change. If applying the patches up to this point, external Datums are applied to both OID and int8 values. Note that there is one tweak I'm unhappy with: the toast counter generation would need to be smarter to avoid concurrent values because we don't cross-check the TOAST index for existing values. (Sorry, got slightly lazy here). - 0008 adds tests for external compressed and uncompressed TOAST values for int8 TOAST types. - 0009 adds support for dump, restore, upgrades of the TOAST table types. - 0010 is the main dish: refactoring of the TOAST code to use toast_external_data, with OID vartags as the only type defined. - 0011 adds a second vartag_external: the one with int8 values stored in the external TOAST pointer. - 0012 is a bonus for amcheck: what needs to be done in its TAP tests to allow the corruption cases to work when supporting a new vartag. That was a long message. Thank you for reading if you have reached this point. Regards, [1]: https://www.postgresql.org/message-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com -- Michael
Вложения
- v1-0001-Add-GUC-default_toast_type.patch
- v1-0002-Add-catcache-support-for-INT8OID.patch
- v1-0003-Introduce-global-64-bit-TOAST-ID-counter-in-contr.patch
- v1-0004-Refactor-some-TOAST-value-ID-code-to-use-uint64-i.patch
- v1-0005-Minimize-footprint-of-TOAST_MAX_CHUNK_SIZE-in-hea.patch
- v1-0006-Switch-pg_column_toast_chunk_id-return-value-from.patch
- v1-0007-Add-support-for-bigint-TOAST-values.patch
- v1-0008-Add-tests-for-TOAST-relations-with-bigint-as-valu.patch
- v1-0009-Add-support-for-TOAST-table-types-in-pg_dump-and-.patch
- v1-0010-Refactor-external-TOAST-pointer-code-for-better-p.patch
- v1-0011-Add-new-vartag_external-for-8-byte-TOAST-values.patch
- v1-0012-amcheck-Add-test-cases-for-8-byte-TOAST-values.patch
- signature.asc
Hi Michael I'll take a look at the patch set. While digging around in the TOAST code did you have any ideas on how one could extract the TOAST APIs in a way that they can be added in Table Access Method definition ? Not all TAMs need TOAST, but the ones that do could also be the ones that still like to do something different when materializing toasted values. And TOAST is actually a nice abstraction which could be used as basis for both offloading more columns into separate forks and files as well as implementing some kinds of vectored, columnar and compressed storages. ---- Hannu On Thu, Jun 19, 2025 at 7:59 AM Michael Paquier <michael@paquier.xyz> wrote: > > Hi all, > > I have been looking at $subject and the many past reviews recently, > also related to some of the work related to the potential support for > zstandard compression in TOAST values, and found myself pondering > about the following message from Tom, to be reminded that nothing has > been done regarding the fact that the backend may finish in an > infinite loop once a TOAST table reaches 4 billion values: > https://www.postgresql.org/message-id/764273.1669674269%40sss.pgh.pa.us > > Spoiler: I have heard of users that are in this case, and the best > thing we can do currently except raising shoulders is to use > workarounds with data externalization AFAIK, which is not nice, > usually, and users notice the problem once they see some backends > stuck in the infinite loop. I have spent some time looking at the > problem, and looked at all the proposals in this area like these ones > (I hope so at least): > https://commitfest.postgresql.org/patch/4296/ > https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru > > Anyway, it seems like nothing goes in a direction that I think would > be suited to fix the two following problems (some of the proposed > patches broke backward-compatibility, as well, and that's critical): > - The limit of TOAST values to 4 billions, because external TOAST > pointers want OIDs. > - New compression methods, see the recent proposal about zstandard at > [1]. ToastCompressionId is currently limited to 4 values because the > extinfo field of varatt_external has only this much data remaining. > Spoiler: I want to propose a new varatt_external dedicated to > zstandard-compressed external pointers, but that's not for this > thread. > > Please find attached a patch set I have finished with while poking at > the problem, to address points 1) and 2) in the first email mentioned > at the top of this message. It is not yet ready for prime day yet > (there are a couple of things that would need adjustments), but I have > reached the point where I am going to need a consensus about what > people would be OK to have in terms of design to be able to support > multiple types of varatt_external to address these issues. And I'm OK > to consume time on that for the v19 cycle. > > While hacking at (playing with?) the whole toasting and detoasting > code to understand the blast radius that this would involve, I have > quickly found that it is very annoying to have to touch at many places > of varatt.h to make variants of the existing varatt_external structure > (what we store on-disk as varlena Datum for external TOAST pointers). > Spoiler: it's my first time touching the internals of this code area > so deeply. Most of the code churns happen because we need to make the > [de]toast code aware of what to do depending on the vartags of the > external varlenas. It would be simple to hardcode a bunch of new > VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures. > While it is efficient, this has a cost for out-of-core code and in > core because all the places that touch external TOAST pointers need to > be adjusted. Point is: it can be done. But if we introduce more > types of external TOAST pointers we need to always patch all these > areas, and there's a cost in that each time one or more new vartags > are added. > > So, I have invented a new interface aimed at manipulating on-disk > external TOAST pointers, called toast_external, that is an on-memory > structure that services as an interface between on-disk external TOAST > pointers and what the backend wants to look at when retrieving chunks > of data from the TOAST relations. That's the main proposal of this > patch set, with a structure looking like that: > typedef struct toast_external_data > { > /* Original data size (includes header) */ > int32 rawsize; > /* External saved size (without header) */ > uint32 extsize; > /* compression method */ > ToastCompressionId compression_method; > /* Relation OID of TOAST table containing the value */ > Oid toastrelid; > /* > * Unique ID of value within TOAST table. This could be an OID or an > * int8 value. This field is large enough to be able to store any of > * them. > */ > uint64 value; > } toast_external_data; > > This is a bit similar to what the code does for R/W and R/O vartags, > only applying to the on-disk external pointers. Then, the [de]toast > code and extension code is updated so as varlenas are changed into > this structure if we need to retrieve some of its data, and these > areas of the code do not need to know about the details of the > external TOAST pointers. When saving an external set of chunks, this > structure is filled with information depending on what > toast_save_datum() deals with, be it a short varlena, a non-compressed > external value, or a compressed external value, then builds a varlena > with the vartag we want. > > External TOAST pointers have three properties that are hardcoded in > the tree, bringing some challenges of their own: > - The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close > to 2k to make 4 chunks fit on a page. This depends on the size of the > external pointer. This one was actually easy to refactor. > - The varlena header size, based on VARTAG_SIZE(), which is kind of > tricky to refactor out in the new toast_external.c, but that seems OK > even if this knowledge stays in varatt.h. > - The toast pointer size, aka TOAST_POINTER_SIZE. This one is > actually very interesting (tricky): we use it in one place, > toast_tuple_find_biggest_attribute(), as a lower bound to decide if an > attribute should be toastable or not. I've refactored the code to use > a "best" guess depending on the value type in the TOAST relation, but > that's not 100% waterproof. That needs more thoughts. > > Anyway, the patch set is able to demonstrate how much needs to be done > in the tree to support multiple chunk_id types, and the CI is happy > with the attached. Some of the things done: > - Introduction of a user-settable GUC called default_toast_type, that > can be switched between two modes "oid" and "int8", to force the > creation of a TOAST relation using one type or the other. > - Dump, restore and upgrade support are integrated, relying on a GUC > makes the logic a breeze. > - 64b values are retrieved from a single counter in the control file, > named a "TOAST counter", which has the same reliability and properties > as an OID, with checkpoint support, WAL records, etc. > - Rewrites are soft, so I have kicked the can down the toast on this > point to not make the proposal more complicated than it should be: a > VACUUM FULL retains the same TOAST value type as the original. We > could extend rewrites so as the type of TOAST value is changed. It is > possible to setup a new cluster with default_toast_type = int8 set > after an upgrade, with the existing tables still using the OID mode. > This relates to the recent proposal with a concurrent VACUUM FULL > (REPACK discussion). > > The patch set keeps the existing vartag_external with OID values for > backward-compatibility, and adds a second vartag_external that can > store 8-byte values. This model is the simplest one, and > optimizations are possible, where the Datum TOAST pointer could be > shorter depending on the ID type (OID or int8), the compression method > and the actual value to divide in chunks. For example, if you know > that a chunk of data to save has a value less than UINT32_MAX, we > could store 4 bytes worth of data instead of 8. This design has the > advantage to allow plugging in new TOAST external structures easily. > Now I've not spent extra time in this tuning, because there's no point > in spending more time without an actual agreement about three things, > and *that's what I'm looking for* as feedback for this upcoming CF: > - The structures of the external TOAST pointers. Variable-sized > pointers could be one possibility, across multiple vartags. Ideas are > welcome. > - How many vartag_external types we want. > - If people are actually OK with this translation layer or not, and I > don't disagree that there may be some paths hot enough where the > translation between the on-disk varlenas and this on-memory > toast_external_data hurts. Again, it is possible to hardcode more > types of vartags in the tree, or just bypass the translation in the > paths that are too hot. That's doable still brutal, but if that's the > final consensus reached I'm OK with that as well. (See for example > the changes in amcheck to see how simpler things get.) > > The patch set has been divided into multiple pieces to ease its > review. Again, I'm not completely happy with everything in it, but > it's a start. Each patch has its own commit message, so feel free to > refer to them for more details: > - 0001 introduces the GUC default_toast_type. It is just defined, not > used in the tree at this stage. > - 0002 adds support for catcache lookups for int8 values, required to > allow TOAST values with int8 and its indexes. Potentially useful for > extensions. > - 0003 introduces the "TOAST counter", 8 bytes in the control file to > allocate values for the int8 chunk_id. That's cheap, reliable. > - 0004 is a mechanical change, that enlarges a couple of TOAST > interfaces to use values of uint64 instead of OID. > - 0005, again a mechanical change, reducing a bit the footprint of > TOAST_MAX_CHUNK_SIZE because OID and int8 values need different > values. > - 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type. > > Then comes the "main" patches: > - 0007 adds support for int8 chunk_id in TOAST tables. This is mostly > a mechanical change. If applying the patches up to this point, > external Datums are applied to both OID and int8 values. Note that > there is one tweak I'm unhappy with: the toast counter generation > would need to be smarter to avoid concurrent values because we don't > cross-check the TOAST index for existing values. (Sorry, got slightly > lazy here). > - 0008 adds tests for external compressed and uncompressed TOAST > values for int8 TOAST types. > - 0009 adds support for dump, restore, upgrades of the TOAST table > types. > - 0010 is the main dish: refactoring of the TOAST code to use > toast_external_data, with OID vartags as the only type defined. > - 0011 adds a second vartag_external: the one with int8 values stored > in the external TOAST pointer. > - 0012 is a bonus for amcheck: what needs to be done in its TAP tests > to allow the corruption cases to work when supporting a new vartag. > > That was a long message. Thank you for reading if you have reached > this point. > > Regards, > > [1]: https://www.postgresql.org/message-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com > -- > Michael
Hi!
Hannu, we'd already made an attempt to extract the TOAST functionality as API
and make it extensible and usable by other AMs in [1], the patch set was met calmly
but we still have some hopes on it.
Michael, glad you continue this work! Took patch set for review.
[1] Pluggable TOAST
On Fri, Jul 4, 2025 at 2:03 PM Hannu Krosing <hannuk@google.com> wrote:
Hi Michael
I'll take a look at the patch set.
While digging around in the TOAST code did you have any ideas on how
one could extract the TOAST APIs in a way that they can be added in
Table Access Method definition ?
Not all TAMs need TOAST, but the ones that do could also be the ones
that still like to do something different when materializing toasted
values.
And TOAST is actually a nice abstraction which could be used as basis
for both offloading more columns into separate forks and files as well
as implementing some kinds of vectored, columnar and compressed
storages.
----
Hannu
Hi!
I'm reviewing code at toast_64bit_v2.
Michael, isn't there a typo?
static const toast_external_info toast_external_infos[TOAST_EXTERNAL_INFO_SIZE] = {
[VARTAG_ONDISK_INT8] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE,
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_INT8,
.to_external_data = ondisk_int8_to_external_data,
.create_external_data = ondisk_int8_create_external_data,
},
[VARTAG_ONDISK_OID] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE, <--- here
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_OID,
.to_external_data = ondisk_oid_to_external_data,
.create_external_data = ondisk_oid_create_external_data,
},
};
[VARTAG_ONDISK_INT8] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE,
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_INT8,
.to_external_data = ondisk_int8_to_external_data,
.create_external_data = ondisk_int8_create_external_data,
},
[VARTAG_ONDISK_OID] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE, <--- here
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_OID,
.to_external_data = ondisk_oid_to_external_data,
.create_external_data = ondisk_oid_create_external_data,
},
};
Shouldn't TOAST_POINTER_INT8_SIZE be replaced with TOAST_POINTER_OID_SIZE?
Regards,
--
On Tue, Jul 08, 2025 at 08:38:41AM +0900, Michael Paquier wrote: > Please note that I still need to look at perf profiles and some flame > graphs with the refactoring done in 0003 with the worst case I've > mentioned upthread with detoasting and values stored uncompressed in > the TOAST relation. So, the worst case I could think of for the slice detoast path is something like that: create table toasttest_bytea (f1 bytea); alter table toasttest_bytea alter column f1 set storage external; insert into toasttest_bytea values(decode(repeat('1234567890',10000),'escape')); And then use something like the following query that retrieves a small substring many times, to force a maximum of detoast_attr_slice() to happen, checking the effect of toast_external_info_get_data(): select length(string_agg(substr(f1, 2, 3), '')) from toasttest_bytea, lateral generate_series(1,1000000) as a (id); I have taken this query, kept running that with a \watch, and took samples of 10s perf records, finishing with the attached graphs (runtime does not show any difference): - detoast_master.svg, for the graph on HEAD. - detoast_patch.svg with the patch set up to 0003 and the external TOAST pointer refactoring, where detoast_attr_slice() shows up. - master_patch_diff.svg as the difference between both, with difffolded.pl from [1]. I don't see a difference in the times spent in these stacks, as we are spending most of the run retrieving the slices from the TOAST relation in fetch_datum_slice(). Opinions and/or comments are welcome. [1]: https://github.com/brendangregg/FlameGraph -- Michael
Вложения
> On Jul 7, 2025, at 7:38 PM, Michael Paquier <michael@paquier.xyz> wrote: > > I have also pushed this v2 on this branch, so feel free to grab it if > that makes your life easier: > https://github.com/michaelpq/postgres/tree/toast_64bit_v2 > -- > Michael Thank you for spending time digging into this and for the well structured patch set (and GitHub branch which I personallyfind helpful). This $subject is important on its own, but even more so in the broader context of the zstd/dictwork [1] and also allowing for innovation when it comes to how externalized Datum are stored. The current modelfor toasting has served the community well for years, but I think that Hannu [2] and Nikita and others have promisingideas that should be allowable without forcing core changes. I've worked a bit in this area too, I re-based thePluggble TOAST work by Nikita [3] onto 18devel earlier this year as I was looking for a way to implement a toaster fora custom type. All that aside, I think you're right to tackle this one step at a time and try not to boil too much of the ocean at once(the patch set is already large enough). With that in mind I've read once or twice over your changes and have a fewbasic comments/questions. v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid This set of changes make sense and as you say are mechanical in nature, no real comments other than I think that using uint64rather than Oid is the right call and addresses #2 on Tom's list. v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code I like this as well, clarifies the code and reduces repetition. v2-0003 Refactor external TOAST pointer code for better pluggability + * For now there are only two types, all defined in this file. For now this + * is the maximum value of vartag_external, which is a historical choice. This provides a bridge for compatibility, but doesn't open the door to a truly pluggable API. I'm guessing the goal is incrementalchange rather than wholesale rewrite. + * The different kinds of on-disk external TOAST pointers. divided by + * vartag_external. Extra '.' in "TOAST pointers. divided" I'm guessing. v2-0004 Introduce new callback to get fresh TOAST values v2-0005 Add catcache support for INT8OID v2-0006 Add GUC default_toast_type Easily understood, good progression of supporting changes. v2-0007 Introduce global 64-bit TOAST ID counter in control file Do you have any concern that this might become a bottleneck when there are many relations and many backends all contendingfor a new id? I'd imagine that this would show up in a flame graph, but I think your test focused on the readside detoast_attr_slice() rather than insert/update and contention on the shared counter. Would this be even worse onNUMA systems? v2-0008 Switch pg_column_toast_chunk_id() return value from oid to bigint v2-0009 Add support for bigint TOAST values v2-0010 Add tests for TOAST relations with bigint as value type v2-0011 Add support for TOAST table types in pg_dump and pg_restore v2-0012 Add new vartag_external for 8-byte TOAST values V2-0013 amcheck: Add test cases for 8-byte TOAST values I read through each of these patches, I like the break down and the attention to detail. The inclusion of good documentationat each step is helpful. Thank you. Thanks for the flame graphs examining a heavy detoast_attr_slice() workload. I agree that there is little or no differencebetween them which is nice. I think the only call out Tom made [4] that isn't addressed was the ask for localized ID selection. That may make senseat some point, especially if there is contention on GetNewToastId(). I think that case is worth a separate performancetest, something with a large number of relations and backends all performing a lot of updates generating a lotof new IDs. What do you think? As for adding even more flexibility, I see the potential to move in that direction over time with this as a good focusedincremental set of changes that address a few important issues now. Really excited by this work, thank you. -greg [1] https://commitfest.postgresql.org/patch/5702/ [2] "Yes, the idea is to put the tid pointer directly in the varlena external header and have a tid array in the toast table as an extra column. If all of the TOAST fits in the single record, this will be empty, else it will have an array of tids for all the pages for this toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025 [3] https://postgr.es/m/224711f9-83b7-a307-b17f-4457ab73aa0a%40sigaev.ru [4] https://postgr.es/m/764273.1669674269%40sss.pgh.pa.us
Hi!
-- Greg, thanks for the interest in our work!
Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.
What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.
I still think we should go with direct toast tid pointers in varlena and not some kind of oid. It will remove the need for any oid management and also will be many-many orders of magnitude faster for large tables (just 2x faster for in-memory small tables) I plan to go over Michael's patch set here and see how much change is needed to add the "direct toast" My goals are: 1. fast lookup from skipping index lookup 2. making the toast pointer in main heap as small as possible - hopefully just the 6 bytes of tid pointer - so that scans that do not need toasted values get more tuples from each page 3. adding all (optional) the extra data into toast chunk record as there we are free to add whatever is needed Currently I plan to introduces something like this for toast chunk record Column | Type | Storage -------------+---------+---------- chunk_id | oid | plain | 0 when not using toast index, 0xfffe - non-deletable, for example when used as dictionary for multiple toasted values. chunk_seq | integer | plain | if not 0 when referenced from toast pointer then the toasted data starts at toast_pages[0] (or below it in that tree), which *must* have chunk_id = 0 chunk_data | bytea | plain -- added fields toast_pages | tid[] | plain | can be chained or make up a tree offsets | int[] | plain | -- starting offsets of the toast_pages (octets or type-specific units), upper bit is used to indicate that a new compressed span starts at that offset, 2nd highest bit indicates that the page is another tree page comp_method | int | plain | -- compression methos used maybe should be enum ? dict_pages | tid[] | plain | -- pages to use as compression dictionary, up to N pages, one level This seems to be flexible enough to allow for both compressin and efficient partial updates --- Hannu On Tue, Jul 8, 2025 at 8:31 PM Nikita Malakhov <hukutoc@gmail.com> wrote: > > Hi! > > Greg, thanks for the interest in our work! > > Michael, one more thing forgot to mention yesterday - > #define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1) > static const toast_external_info toast_external_infos[TOAST_EXTERNAL_INFO_SIZE] > VARTAG_ONDISK_OID historically has a value of 18 > and here we got an array of 19 members with only 2 valid ones. > > What do you think about having an individual > TOAST value id counter per relation instead of using > a common one? I think this is a very promising approach, > but a decision must be made where it should be stored. > > -- > Regards, > Nikita Malakhov > Postgres Professional > The Russian Postgres Company > https://postgrespro.ru/
On 2025-Jul-08, Hannu Krosing wrote: > I still think we should go with direct toast tid pointers in varlena > and not some kind of oid. I think this can be made to work, as long as we stop seeing the toast table just like a normal heap table containing normal tuples. A lot to reimplement though -- vacuum in particular. Maybe it can be thought of as a new table AM. Not an easy project, I reckon. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Hi,
Hannu, we have some thoughts on direct tids storage,
it was some time ago and done by another developer,
so I have to look. I'll share it as soon as I find it, if you
are interested.
--
chunk_ On Tue, Jul 8, 2025 at 9:37 PM Álvaro Herrera <alvherre@kurilemu.de> wrote: > > On 2025-Jul-08, Hannu Krosing wrote: > > > I still think we should go with direct toast tid pointers in varlena > > and not some kind of oid. > > I think this can be made to work, as long as we stop seeing the toast > table just like a normal heap table containing normal tuples. A lot to > reimplement though -- vacuum in particular. Non-FULL vacuum should already work. Only commands like VACUUM FULL and CLUSTER which move tuples around should be disabled on TOAST tables. What other parts do you think need re-implementing in addition to skipping the index lookup part and using the tid directly ? The fact that per-page chunk_tid arrays allow also tree structures should allow us much more flexibility in implementing in-place-updatable structured storage in something otherways very similar to toast, but this is not required for just moving from oid + index ==> tid to using the tid directly. I think that having a toast table as a normal table with full MVCC is actually a good thing, as it can implement the "array element update" as a real partial update of only the affected parts and not the current 'copy everything' way of doing this. We already do collect the array element update in the parse tree in a special way, now we just need to have types that can do the partial update by changing a tid or two in the chunk_tids array (and adjust the offsets array if needed) This should make both UPDATE t SET theintarray[3] = 5, theintarray[4] = 7 WHERE id = 1; and even do partial up[dates for something like this hannuk=# select * from jtab; id | j ----+---------------------------- 1 | {"a": 3, "b": 2} 2 | {"c": 1, "d": [10, 20, 3]} (2 rows) hannuk=# update jtab SET j['d'][3] = '7' WHERE id = 2; UPDATE 1 hannuk=# select * from jtab; id | j ----+------------------------------- 1 | {"a": 3, "b": 2} 2 | {"c": 1, "d": [10, 20, 3, 7]} (2 rows) when the JSON data is so large that changed part is in it's own chunk. > Maybe it can be thought of > as a new table AM. Not an easy project, I reckon. I would prefer it to be an extension of current toast - just another varatt_* type - as then you can upgrade to new storage CONCURRENTLY, same way as you can currently switch compression methods. > -- > Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On Tue, Jul 08, 2025 at 08:54:33PM +0200, Hannu Krosing wrote: > I still think we should go with direct toast tid pointers in varlena > and not some kind of oid. > > It will remove the need for any oid management and also will be > many-many orders of magnitude faster for large tables (just 2x faster > for in-memory small tables) There is also the point of backward-compatibility. We cannot just replace things, and we have to provide a way for users to be able to rely on the system so as upgrades are painless. So we need to think about the correct application layer to use to maintain the legacy code behavior while considering improvements. > I plan to go over Michael's patch set here and see how much change is > needed to add the "direct toast" If you do not have a lot of time looking at the full patch set, I'd recommend looking at 0003, files toast_external.h and toast_external.c which include the key idea. Adding a new external TOAST pointer is then a two-step process: - Add a new vartag_external. - Add some callbacks to let the backend understand what it should do with this new vartag_external. > My goals are: > 1. fast lookup from skipping index lookup > 2. making the toast pointer in main heap as small as possible - > hopefully just the 6 bytes of tid pointer - so that scans that do not > need toasted values get more tuples from each page > 3. adding all (optional) the extra data into toast chunk record as > there we are free to add whatever is needed > Currently I plan to introduces something like this for toast chunk record Points 2. and 3. are things that the refactoring should allow. About 1., I have no idea how much you want to store in the TOAST external points and how it affects the backend, but you could surely implement an option that lets the backend know that it should still index lookups based on what the external TOAST pointer says, if this stuff has benefits. > This seems to be flexible enough to allow for both compressin and > efficient partial updates I don't really disagree with all that. Now the history of the TOAST threads point out that we've good at proposing complex things, but these had a high footprint. What I'm proposing is lighter than that, I think, tackling my core issue with the infra supporting backward compatibility and the addition of more modes on top of it. -- Michael
Вложения
On Tue, Jul 08, 2025 at 09:31:29PM +0300, Nikita Malakhov wrote: > Michael, one more thing forgot to mention yesterday - > #define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1) > static const toast_external_info > toast_external_infos[TOAST_EXTERNAL_INFO_SIZE] > VARTAG_ONDISK_OID historically has a value of 18 > and here we got an array of 19 members with only 2 valid ones. Yeah, I'm aware of that. The code is mostly to make it easier to read while dealing with this historical behavior, even if it costs a bit more in memory. I don't think that it's a big deal, and we could always have one more level of redirection to reduce its size. Now there's the extra complexity.. > What do you think about having an individual > TOAST value id counter per relation instead of using > a common one? I think this is a very promising approach, > but a decision must be made where it should be stored. I've thought about that, and decided to discard this idea for now to keep the whole proposal simpler. This has benefits if you have many relations with few OIDs consumed, but this has a cost in itself as you need to maintain the data for each TOAST relation. When I looked at the problems a couple of weeks ago, I came to the conclusion that all the checkbox properties of "local" TOAST values are filled with a sequence: WAL logging to ensure uniqueness, etc. So I was even considering the addition of some code to create sequences on-the-fly, but at the end that was just more complexity with how we define sequences currently compared to a unique 8-byte counter in the control file that's good enough for a veeery long time. I've also noticed that this sort of links to a piece I've implemented last year and is still sitting in the CF app without much interest from others: sequence AMs. You could implement a "TOAST" sequence method, for example, optimized for this purpose. As a whole, I propose to limit the scope of the proposal to the pluggability of the external TOAST pointers. The approach I've taken should allow such improvements, these can be added later if really needed. -- Michael
Вложения
On Tue, Jul 08, 2025 at 12:58:26PM -0400, Burd, Greg wrote: > All that aside, I think you're right to tackle this one step at a > time and try not to boil too much of the ocean at once (the patch > set is already large enough). With that in mind I've read once or > twice over your changes and have a few basic comments/questions. > > > v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid > > This set of changes make sense and as you say are mechanical in > nature, no real comments other than I think that using uint64 rather > than Oid is the right call and addresses #2 on Tom's list. > > > v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code > > I like this as well, clarifies the code and reduces repetition. Thanks. These are independent pieces if you want to link the code less to TOAST, assuming that an area of 8 bytes would be good enough for any TOAST "value" concept. TIDs were mentioned as well on a different part of the thread, ItemPointerData is 6 bytes. > v2-0003 Refactor external TOAST pointer code for better pluggability > > + * For now there are only two types, all defined in this file. For now this > + * is the maximum value of vartag_external, which is a historical choice. > > This provides a bridge for compatibility, but doesn't open the door > to a truly pluggable API. I'm guessing the goal is incremental > change rather than wholesale rewrite. Nope, it does not introduce a pluggable thing, but it does untangle the fact that one needs to change 15-ish code paths when they want to add a new type of external TOAST pointer, without showing an actual impact AFAIK when we insert a TOAST value or fetch it, as long as we know that we're dealing with an on-disk thing that requires an external lookup. > + * The different kinds of on-disk external TOAST pointers. divided by > + * vartag_external. > > Extra '.' in "TOAST pointers. divided" I'm guessing. Indeed, thanks. > v2-0007 Introduce global 64-bit TOAST ID counter in control file > > Do you have any concern that this might become a bottleneck when > there are many relations and many backends all contending for a new > id? I'd imagine that this would show up in a flame graph, but I > think your test focused on the read side detoast_attr_slice() rather > than insert/update and contention on the shared counter. Would this > be even worse on NUMA systems? That may be possible, see below. > Thanks for the flame graphs examining a heavy detoast_attr_slice() > workload. I agree that there is little or no difference between > them which is nice. Cool. Yes. I was wondering why detoast_attr_slice() does not show up in the profile on HEAD, perhaps it just got optimized away (I was under -O2 for these profiles). > I think the only call out Tom made [4] that isn't addressed was the > ask for localized ID selection. That may make sense at some point, > especially if there is contention on GetNewToastId(). I think that > case is worth a separate performance test, something with a large > number of relations and backends all performing a lot of updates > generating a lot of new IDs. What do you think? Yeah, I need to do more benchmark for the int8 part, I was holding on such evaluations because this part of the patch does not fly if we don't do the refactoring pieces first. Anyway, I cannot get excited about the extra workload that this would require in the catalogs, because we would need one TOAST sequence tracked in there, linked to the TOAST relation so it would not be free. Or we invent a new facility just for this purpose, meaning that we get far away even more from being able to resolve the original problem with the values and compression IDs. We're talking about two instructions. Well, I guess that we could optimize it more some atomics or even cache a range of values to save in ToastIdGenLock acquisitions in a single backend. I suspect that the bottleneck is going to be the insertion of the TOAST entries in toast_save_datum() anyway with the check for conflicting values, even if your relation is unlogged or running-on-scissors in memory. > [2] "Yes, the idea is to put the tid pointer directly in the varlena > external header and have a tid array in the toast table as an extra > column. If all of the TOAST fits in the single record, this will be > empty, else it will have an array of tids for all the pages for this > toasted field." - Hannu Krosing in an email to me after > PGConf.dev/2025 Sure, you could do that as well, but I suspect that we'll need the steps of at least up to 0003 to be able to handle more easily multiple external TOAST pointer types, or the code will be messier than it currently is. :D -- Michael
Вложения
Hi!
On this -
> Non-FULL vacuum should already work. Only commands like VACUUM FULL
> and CLUSTER which move tuples around should be disabled on TOAST
> tables.
> and CLUSTER which move tuples around should be disabled on TOAST
> tables.
Cool, toast tables are subject to bloating in update-heavy scenarios
and it's a big problem in production systems, it seems there is a promising
way to solve it once and for all!
Have to mention though that we encountered issues in logical replication
when we made toast values updatable.
Also researching direct tids implementation.
Cheers!
On Mon, Jul 14, 2025 at 09:01:28AM +0300, Nikita Malakhov wrote: > Cool, toast tables are subject to bloating in update-heavy scenarios > and it's a big problem in production systems, it seems there is a promising > way to solve it once and for all! > > Have to mention though that we encountered issues in logical replication > when we made toast values updatable. > > Also researching direct tids implementation. I would be curious to see if the refactoring done on this thread would be useful in the scope of what you are trying to do. I'd suggest dropping that on a different thread, though, if you finish with a patch or something worth looking at for others. -- Michael
Вложения
Hi Michael,
I'm currently debugging POC direct tids TOAST patch (on top of your branch),
will mail it in a day or two.
On Tue, Jul 15, 2025 at 3:56 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jul 14, 2025 at 09:01:28AM +0300, Nikita Malakhov wrote:
> Cool, toast tables are subject to bloating in update-heavy scenarios
> and it's a big problem in production systems, it seems there is a promising
> way to solve it once and for all!
>
> Have to mention though that we encountered issues in logical replication
> when we made toast values updatable.
>
> Also researching direct tids implementation.
I would be curious to see if the refactoring done on this thread would
be useful in the scope of what you are trying to do. I'd suggest
dropping that on a different thread, though, if you finish with a
patch or something worth looking at for others.
--
Michael
On Fri, Jul 18, 2025 at 9:24 PM Nikita Malakhov <hukutoc@gmail.com> wrote: > > Hi Michael, > > I'm currently debugging POC direct tids TOAST patch (on top of your branch), > will mail it in a day or two. Great! I also just started looking at it, starting from 0003 as recommended by Michael. Will be interesting to see how similar / different our approaches will be :)
On Mon, Jul 14, 2025 at 8:15 AM Nikita Malakhov <hukutoc@gmail.com> wrote: > ... > Have to mention though that we encountered issues in logical replication > when we made toast values updatable. This seems to indicate that Logical Decoding does not honour visibility checks in TOAST. This works fine if the TOAST visibility never changes but will break if it can change independently of heap-side tuple > Also researching direct tids implementation. Cool!
On Fri, Jul 18, 2025 at 10:24:12PM +0300, Nikita Malakhov wrote: > I'm currently debugging POC direct tids TOAST patch (on top of your branch), > will mail it in a day or two. Interesting. Of course I may be wrong because I have no idea of how you have shaped things, still I suspect that for the basics you should just need 0003, 0004, the parts with the GUC to switch the TOAST table type and the dump/restore/upgrade bits. -- Michael
Вложения
Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
От
Nikhil Kumar Veldanda
Дата:
Hi Michael, I'm reviewing your patches(toast_64bit_v2 branch) and have prepared two incremental patches that add ZSTD compression support. While doing this I made a small refactoring in heaptoast.c (around ToastTupleContext) to make adding additional TOAST pointer formats easier. Added two new on-disk external vartags to support new compression methods: one using an Oid value identifier and one using a 64‑bit (int8) identifier for toast table types. This lets us support extended compression methods for both existing Oid‑based TOAST tables and a variant that needs a wider ID space. typedef enum vartag_external { VARTAG_INDIRECT = 1, VARTAG_EXPANDED_RO = 2, VARTAG_EXPANDED_RW = 3, VARTAG_ONDISK_INT8 = 4, VARTAG_ONDISK_CE_OID = 5, VARTAG_ONDISK_CE_INT8 = 6, VARTAG_ONDISK_OID = 18 } vartag_external; Two new ondisk TOAST pointer structs carrying an va_ecinfo field for extended compression methods: typedef struct varatt_external_ce_oid { int32 va_rawsize; /* Original data size (includes header) */ uint32 va_extinfo; /* External saved size (without header) and * VARATT_CE_FLAG in top 2 bits */ uint32 va_ecinfo; /* Extended compression info */ Oid va_valueid; /* Unique ID of value within TOAST table */ Oid va_toastrelid; /* RelID of TOAST table containing it */ } varatt_external_ce_oid; typedef struct varatt_external_ce_int8 { int32 va_rawsize; /* Original data size (includes header) */ uint32 va_extinfo; /* External saved size (without header) and * VARATT_CE_FLAG in top 2 bits */ uint32 va_ecinfo; /* Extended compression info */ /* * Unique ID of value within TOAST table, as two uint32 for alignment and * padding. */ uint32 va_valueid_lo; uint32 va_valueid_hi; Oid va_toastrelid; /* RelID of TOAST table containing it */ } varatt_external_ce_int8; The inline (varattrib_4b) format extension (discussed in [1]) is included; I made one adjustment: the compression id field is now a 4‑byte va_ecinfo field (instead of 1 byte) for structural consistency with the extended TOAST pointer formats. typedef union { struct /* Normal varlena (4-byte length) */ { uint32 va_header; char va_data[FLEXIBLE_ARRAY_MEMBER]; } va_4byte; struct /* Compressed-in-line format */ { uint32 va_header; uint32 va_tcinfo; /* Original data size (excludes header) and * compression method; see va_extinfo */ char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */ } va_compressed; struct { uint32 va_header; uint32 va_tcinfo; /* Original data size (excludes header) and * compression method or VARATT_CE_FLAG flag; * see va_extinfo */ uint32 va_ecinfo; /* Extended compression info: 32-bit field * where only the lower 8 bits are used for * compression method. Upper 24 bits are * reserved/unused. Lower 8 bits layout: Bits * 7–1: encode (cmid − 2), so cmid is * [2…129] Bit 0: flag for extra metadata */ char va_data[FLEXIBLE_ARRAY_MEMBER]; } va_compressed_ext; } varattrib_4b; [1] https://www.postgresql.org/message-id/flat/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com Please review it and let me know if you have any questions or feedback? Thank you! v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch: Design proposal covering varattrib_4b, TOAST pointer layouts, and related macro updates. v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain ZSTD (non dict) support and few basic tests. -- Nikhil Veldanda
Вложения
Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
От
Nikhil Kumar Veldanda
Дата:
Hi, > v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch: > Design proposal covering varattrib_4b, TOAST pointer layouts, and > related macro updates. > v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain > ZSTD (non dict) support and few basic tests. Sending v27 patch with a small update over v26 patch. v27-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch: Design proposal covering varattrib_4b, TOAST pointer layouts, and related macro updates. v27-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain ZSTD (non dict) support and few basic tests. > -- > Nikhil Veldanda -- Nikhil Veldanda
Вложения
I have been evolving details for Direct TOAST design in https://wiki.postgresql.org/wiki/DirectTOAST The top level goals are * 8-byte TOAST pointer - just (header:1, tag:1 and TID:6) * all other info moved from toast pointer to actual toast record(s), so heap rows are smaller and faster. * all extra fields are bytea with internal encoding (maybe will create full new types for these, or maybe just introspection functions are enough) the reasons for this are - PostgresSQL arrays add 20 byte overhead - bytea gives other freedoms in encoding for minimal space usage No solution yet for va_toastrelid , but hope is - to use some kind of mapping and find one or two free bits somewhere (tid has one free), - or add a 12-byte toast pointer just for this. - or to make sure that CLUSTER and VACUUM FULL can be done without needing va_toastrelid. I assume it is there for clustering the TOAST which will be not possible separately from the main heap with direct toast tid pointers anyway. Please take a look and poke holes in it ! On Sun, Jul 20, 2025 at 10:28 AM Nikhil Kumar Veldanda <veldanda.nikhilkumar17@gmail.com> wrote: > > Hi, > > > v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch: > > Design proposal covering varattrib_4b, TOAST pointer layouts, and > > related macro updates. > > v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain > > ZSTD (non dict) support and few basic tests. > > Sending v27 patch with a small update over v26 patch. > > v27-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch: > Design proposal covering varattrib_4b, TOAST pointer layouts, and > related macro updates. > v27-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain > ZSTD (non dict) support and few basic tests. > > > -- > > Nikhil Veldanda > > -- > Nikhil Veldanda
Hi!
Michael and Hannu, here's a POC patch with direct TIDs TOAST.
The simplest implementation where we store a chain of TIDs, each
chunk stores the next TID to be fetched. Patch applies on top of
commit 998b0b51d5ea763be081804434f177082ba6772b (origin/toast_64bit_v2)
Author: Michael Paquier <michael@paquier.xyz>
Date: Thu Jun 19 13:09:11 2025 +0900
Date: Thu Jun 19 13:09:11 2025 +0900
While it is very fast on small data - I see several disadvantages:
- first of all, VACUUM should be revised to work with such tables;
- problematic batch insertion due to necessity to store TID chain.
It is just a POC implementation, so please don't blame me for
questionable decisions.
Any opinions and feedback welcome!
PS: Hannu, just seen your latest message, will check it out now.
On Mon, Jul 21, 2025 at 3:15 AM Hannu Krosing <hannuk@google.com> wrote:
I have been evolving details for Direct TOAST design in
https://wiki.postgresql.org/wiki/DirectTOAST
The top level goals are
* 8-byte TOAST pointer - just (header:1, tag:1 and TID:6)
* all other info moved from toast pointer to actual toast record(s),
so heap rows are smaller and faster.
* all extra fields are bytea with internal encoding (maybe will create
full new types for these, or maybe just introspection functions are
enough)
the reasons for this are
- PostgresSQL arrays add 20 byte overhead
- bytea gives other freedoms in encoding for minimal space usage
No solution yet for va_toastrelid , but hope is
- to use some kind of mapping and find one or two free bits somewhere
(tid has one free),
- or add a 12-byte toast pointer just for this.
- or to make sure that CLUSTER and VACUUM FULL can be done without
needing va_toastrelid. I assume it is there for clustering the TOAST
which will be not possible separately from the main heap with direct
toast tid pointers anyway.
Please take a look and poke holes in it !
Вложения
Hi!
a waste of disk space since the current Postgres state does not
allow multiple TOAST tables per relation.
But if we consider this as a viable option it could bring additional
advantages. I've successfully tried to use multiple TOAST tables,
with different variations - by type, by column and as-is just as
an extensible storage.
--
On Mon, Jul 21, 2025 at 02:20:31PM +0300, Nikita Malakhov wrote: > I agree that storing reltoastrelid in each Toast pointer seems to be > a waste of disk space since the current Postgres state does not > allow multiple TOAST tables per relation. va_toastrelid is a central part of the current system when dealing with a TOAST relation rewrite, because we need to know to which relation an on-disk TOAST pointer is part of. Or do you mean that we don't need that with what you are proposing with TIDs? Perhaps yes, sure, I've not studied this question when associated with your patch (which has a bunch of duplicated code that could be avoided AFAIK). > But if we consider this as a viable option it could bring additional > advantages. I've successfully tried to use multiple TOAST tables, > with different variations - by type, by column and as-is just as > an extensible storage. I don't think we need to be that ambitious. There is no way to say such things could have any benefits in the long-term and there's the catalog representation part. Even if we do that, I suspect that users would never really know which setup makes sense because we would want to control how things happen at a relation level and not purely automate a behavior. -- Michael
Вложения
On Mon, Jul 21, 2025 at 02:06:03PM +0300, Nikita Malakhov wrote: > While it is very fast on small data - I see several disadvantages: > - first of all, VACUUM should be revised to work with such tables; > - problematic batch insertion due to necessity to store TID chain. > > It is just a POC implementation, so please don't blame me for > questionable decisions. > > Any opinions and feedback welcome! I think that this is going to be helpful in taking some decisions with the refactoring pieces I am proposing for the external TOAST pointer layer. You have some interesting points around detoast_external_attr() and detoast_attr_slice(), as far as I can see. One point of the on-disk TOAST refactoring is that we should be able to entirely avoid this level of redirection. I get that this is a POC, of course, but it provides pointers that what I've done may not be sufficient in terms of extensibility so that seems worth digging into. The patch posted by Nikhil is also something that touches this area, that I have on my tablets: https://www.postgresql.org/message-id/CAFAfj_E-QLiUq--%2BKdyvb%2B-Gg79LLayZRcH8%2BmFPzVuDQOVaAw%40mail.gmail.com It touches a different point: we need to be smarter with CompressionToastId and not use that as the value for what we store on disk. Each vartag_external should be able to use the compression methods values it wishes, with the toast_external layer being in charge of translating that back-and-forth with the GUC value. -- Michael
Вложения
Hi Michael!
Yes, I know about relation rewrite and have already thought about how
we can avoid excessive storage of toastrelid and do not spoil rewrite,
still do not have a good enough solution.
> You have some interesting points around
> detoast_external_attr() and detoast_attr_slice(), as far as I can see.
> One point of the on-disk TOAST refactoring is that we should be able
> to entirely avoid this level of redirection. I get that this is a
> POC, of course, but it provides pointers that what I've done may not
> be sufficient in terms of extensibility so that seems worth digging
> into.
> detoast_external_attr() and detoast_attr_slice(), as far as I can see.
> One point of the on-disk TOAST refactoring is that we should be able
> to entirely avoid this level of redirection. I get that this is a
> POC, of course, but it provides pointers that what I've done may not
> be sufficient in terms of extensibility so that seems worth digging
> into.
I'm currently re-visiting our TOAST API patch set, there are some
good (in terms of simplicity and lightweightness) proposals, will mail
later.
Some more thoughts on TIDs:
TIDs could be stored as a list instead of a chain (as Hannu proposes
in his design). This allows batch operations and storage optimization
'cause TID lists are highly compressible, but significantly complicates
the code responsible for chunk processing.
Also, Toast pointer in current state must store raw size and external
size - these two are used by the executor, and we cannot get rid
of them so lightly.
Vacuuming such a table would be a pain in the ass, we have to
somehow prevent bloating tables with a high update rate.
Also, current toast mechanics is insert-only, it does not support
updates (just to remind - the whole toasted value is marked dead
and new one is inserted during update), this is a subject to change.
And logical replication, as I mentioned earlier, does not have any
means for replicating toast diffs.
On Tue, Jul 22, 2025 at 1:24 PM Nikita Malakhov <hukutoc@gmail.com> wrote: > > Hi Michael! > > Yes, I know about relation rewrite and have already thought about how > we can avoid excessive storage of toastrelid and do not spoil rewrite, > still do not have a good enough solution. The high-level idea would be to any actual rewrite -- as opposed to plain vacuum which frees empty space within the TOAST relation -- as part of the vacuum of the main relation. Another option would be to store a back-pointer to the heap tuple inside the toast tuple and use that when rewriting, though it has its own set of complexities. > > You have some interesting points around > > detoast_external_attr() and detoast_attr_slice(), as far as I can see. > > One point of the on-disk TOAST refactoring is that we should be able > > to entirely avoid this level of redirection. I get that this is a > > POC, of course, but it provides pointers that what I've done may not > > be sufficient in terms of extensibility so that seems worth digging > > into. > > I'm currently re-visiting our TOAST API patch set, there are some > good (in terms of simplicity and lightweightness) proposals, will mail > later. Sounds interesting. > Some more thoughts on TIDs: > TIDs could be stored as a list instead of a chain (as Hannu proposes > in his design). This allows batch operations and storage optimization > 'cause TID lists are highly compressible, but significantly complicates > the code responsible for chunk processing. I would not say it complicates the *code* very much, especially when you keep offsets in the toast tuples so that you can copy them into the final materialized datum in any order. And it does allow many optimisations in terms of batching, pre-fetching and even parallelism in case of huge toasted values. > Also, Toast pointer in current state must store raw size and external > size - these two are used by the executor, and we cannot get rid > of them so lightly. Are these ever used without actually using the data ? When the data _is_ also used then the cost of getting the length from the toast record with direct toast should mostly amortize over the full query. Can you point to where in the code this is done ? In long run we may want to store also the actual size in the toast record (not toast pointer) as well for types where length() != octetsize bacuse currently a simple call like length(text) has to materialize the whole thing before getting the length, whereas pg_colum_size() and octertsize() are instantaneous. > Vacuuming such a table would be a pain in the ass, we have to > somehow prevent bloating tables with a high update rate. Normal Vacuum should work fine. It is the rewrite that cis tricky. > Also, current toast mechanics is insert-only, it does not support > updates (just to remind - the whole toasted value is marked dead > and new one is inserted during update), this is a subject to change. > And logical replication, as I mentioned earlier, does not have any > means for replicating toast diffs. Which points to the need to (optionally) store the diff in the toast as well when there are defined replication slots. Once we have a way to actually do JSON(B) updates at SQL or function level. We may even want to store the JSON in some base JSON + JSON_PATCH format where we materialize at retrieval. But this goes way beyond the current patch's scope. Though my design should accommodate it nicely. --- Hannu
On Tue, Jul 22, 2025 at 02:56:23PM +0200, Hannu Krosing wrote: > The high-level idea would be to any actual rewrite -- as opposed to > plain vacuum which frees empty space within the TOAST relation -- as > part of the vacuum of the main relation. > Another option would be to store a back-pointer to the heap tuple > inside the toast tuple and use that when rewriting, though it has its > own set of complexities. Well. I would suggest to begin with simpler to begin with, finding building pieces that can be relied on, and I tend to think that I've sorted out the first basic one of these because we want to be able to give the backend the possibility to understand more formats of external on-dist TOAST pointers. A simple whole rewrite of the TOAST engine is not something that can just happen in one day: that's a very risky move, and we need to worry about backward-compatibility while maintaining the legacy code. FWIW, I think that we should still move on with the 8-byte implementation anyway with specific vartag_external that can lead to variable-sized pointers (shorter if value is less than UINT32_MAX, still stored in a int8 TOAST table), extending the code so as it's possible to think about more on top of that. That's basically risk-free if done right while taking the problem at its root. That's also what Tom and Andres meant back in 2022 before the problem drifted away to different issues, and that should allow the addition of more compression methods if done correctly (quoted at the top of this thread). >>> You have some interesting points around >>> detoast_external_attr() and detoast_attr_slice(), as far as I can see. >>> One point of the on-disk TOAST refactoring is that we should be able >>> to entirely avoid this level of redirection. I get that this is a >>> POC, of course, but it provides pointers that what I've done may not >>> be sufficient in terms of extensibility so that seems worth digging >>> into. >> >> I'm currently re-visiting our TOAST API patch set, there are some >> good (in terms of simplicity and lightweightness) proposals, will mail >> later. > > Sounds interesting. If you have refactoring proposals, that could be considered separately, yes. Now, the reason why nothing has been done about the original problem is the same reason as what's happening now on this thread: I am seeing a lot of designs and assumptions for new solutions, without knowing all the problems we are trying to solve or even if they actually make sense at this level of the baclend engine. And there is little to no argument made about backward-compatibility, which is also something I've tried to design (single GUC for dump/restore/upgrade with TOAST type based on attribute data type of the TOAST table, which could be also a tid later on). > But this goes way beyond the current patch's scope. Though my design > should accommodate it nicely. If you see independent useful pieces that are worth their own, nobody is going to object to that. The trick is to find incremental pieces good enough to be able to build upon with individual evaluations and reviews, at least that's how I want to tackle the problem because I would be the one who would be responsible for its maintenance by default. Finding out these "simple" independent relevant pieces is the hard part. -- Michael
Вложения
On Wed, Jul 09, 2025 at 08:20:31AM +0900, Michael Paquier wrote: > Sure, you could do that as well, but I suspect that we'll need the > steps of at least up to 0003 to be able to handle more easily multiple > external TOAST pointer types, or the code will be messier than it > currently is. :D Please find attached a v3, that I have spent some time polishing to fix the value ID problem of this thread. v2 had some conflicts, and the CI previously failed with warning job (CI is green here now). There are two things that have changed in this patch series: - Addition of separate patch to rename varatt_external to varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003. - In the refactoring to introduce the translation layer for external ondisk TOAST pointers, aka 0004 in this set labelled v3, I was not happy about the way we grab the vartag_external that gets assigned to the TOAST varlena, and it depends on two factors in this patch set: -- The type of TOAST table. -- The value assigned. If, for example, we have a value less than 4 billion, we can make the TOAST pointer shorter. The interface of 0004 has been rethought to take that into consideration. That's the function called toast_external_assign_vartag(), reused on heaptoast.c to get TOAST_POINTER_SIZE as the compressibility threshold when inserting a tuple and if it should be compressed. As things stand, I am getting pretty happy with the patch set up to 0005 and how things are getting in shape for the interface, and I am planning to begin applying this stuff up to 0005 in the next couple of weeks. This stuff introduces all the callbacks and the concept of toast_external to get the refactoring into the tree first, which is something that other patches that want to extend the compression methods or the TID layer are going to need anyway. The rest of the patch set is more a group shot, where they actually matter only when the new vartag_external is added for the TOAST tables able to hold 8-byte values. As of now, we would need to add more one more vartag_external: - For values lower than 4 billion, perhaps we could just reuse the OID vartag_external here? The code is modular now with toast_external_assign_vartag(), and it is only a matter of using VARTAG_ONDISK_OID if we have a value less than 4 billion. So there is only one place in the code to change, and that's a one-line change with the v3 patch set attached. And that's less duplication. - The second one when we have more than 4 billion values. -- Michael
Вложения
- v3-0001-Refactor-some-TOAST-value-ID-code-to-use-uint64-i.patch
- v3-0002-Minimize-footprint-of-TOAST_MAX_CHUNK_SIZE-in-hea.patch
- v3-0003-varatt_external-varatt_external_oid-and-VARTAG_ON.patch
- v3-0004-Refactor-external-TOAST-pointer-code-for-better-p.patch
- v3-0005-Introduce-new-callback-to-get-fresh-TOAST-values.patch
- v3-0006-Add-catcache-support-for-INT8OID.patch
- v3-0007-Add-GUC-default_toast_type.patch
- v3-0008-Introduce-global-64-bit-TOAST-ID-counter-in-contr.patch
- v3-0009-Switch-pg_column_toast_chunk_id-return-value-from.patch
- v3-0010-Add-support-for-bigint-TOAST-values.patch
- v3-0011-Add-tests-for-TOAST-relations-with-bigint-as-valu.patch
- v3-0012-Add-support-for-TOAST-table-types-in-pg_dump-and-.patch
- v3-0013-Add-new-vartag_external-for-8-byte-TOAST-values.patch
- v3-0014-amcheck-Add-test-cases-for-8-byte-TOAST-values.patch
- signature.asc
On Fri, Aug 1, 2025 at 4:03 AM Michael Paquier <michael@paquier.xyz> wrote:
- Addition of separate patch to rename varatt_external to
varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003.
Since you're already renaming things... ISTM "ondisk" has the potential for confusion, assuming that at some point we'll have the ability to store large datums directly in the filesystem (instead of breaking into chunks to live in a relation). VARTAG_DURABLE might be a better option.
On Mon, Aug 04, 2025 at 02:37:19PM -0500, Jim Nasby wrote: > On Fri, Aug 1, 2025 at 4:03 AM Michael Paquier <michael@paquier.xyz> wrote: >> - Addition of separate patch to rename varatt_external to >> varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003. > > Since you're already renaming things... ISTM "ondisk" has the potential for > confusion, assuming that at some point we'll have the ability to store > large datums directly in the filesystem (instead of breaking into chunks to > live in a relation). VARTAG_DURABLE might be a better option. Hmm. I don't know about this one. Durable is an ACID property that does not apply to all relation kinds. For example, take an unlogged table: its data is not durable but its TOAST pointers would be marked with a VARTAG_DURABLE. With that in mind ONDISK still sounds kind of OK for me to use as a name. -- Michael
Вложения
On Fri, Aug 01, 2025 at 06:03:11PM +0900, Michael Paquier wrote: > Please find attached a v3, that I have spent some time polishing to > fix the value ID problem of this thread. v2 had some conflicts, and > the CI previously failed with warning job (CI is green here now). Attached is a v4, due to conflicts mainly caused by the recent changes in varatt.h done by e035863c9a04. This had an interesting side benefit when rebasing, where I have been able to isolate most of the knowledge related to the struct varatt_external (well varatt_external_oid in the patch set) into toast_external.c, at the exception of VARTAG_SIZE. That's done in a separate patch, numbered 0006. The rest of the patch set has a couple of adjustements to document better the new API expectations for toast_external.{c,h}, comment adjustments, some more beautification changes, some indentation applied, etc. > As things stand, I am getting pretty happy with the patch set up to > 0005 and how things are getting in shape for the interface, and I am > planning to begin applying this stuff up to 0005 in the next couple of > weeks. As of this patch set, this means a new target of 0006, to get the TOAST code refactored so as it is able to support more than 1 type of external on-disk pointer with the 8-byte value problem in scope. Any comments? -- Michael
Вложения
- v4-0001-Refactor-some-TOAST-value-ID-code-to-use-uint64-i.patch
- v4-0002-Minimize-footprint-of-TOAST_MAX_CHUNK_SIZE-in-hea.patch
- v4-0003-varatt_external-varatt_external_oid-and-VARTAG_ON.patch
- v4-0004-Refactor-external-TOAST-pointer-code-for-better-p.patch
- v4-0005-Move-static-inline-routines-of-varatt_external_oi.patch
- v4-0006-Introduce-new-callback-to-get-fresh-TOAST-values.patch
- v4-0007-Add-catcache-support-for-INT8OID.patch
- v4-0008-Add-GUC-default_toast_type.patch
- v4-0009-Introduce-global-64-bit-TOAST-ID-counter-in-contr.patch
- v4-0010-Switch-pg_column_toast_chunk_id-return-value-from.patch
- v4-0011-Add-support-for-bigint-TOAST-values.patch
- v4-0012-Add-tests-for-TOAST-relations-with-bigint-as-valu.patch
- v4-0013-Add-support-for-TOAST-table-types-in-pg_dump-and-.patch
- v4-0014-Add-new-vartag_external-for-8-byte-TOAST-values.patch
- v4-0015-amcheck-Add-test-cases-for-8-byte-TOAST-values.patch
- signature.asc
Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
От
Nikhil Kumar Veldanda
Дата:
Hi michael, I have a question regarding TOAST pointer handling. As I understand, in the current design, each attribute in a HeapTuple can have its own TOAST pointer, and TOAST pointers are possible only for top-level attributes. Would it make sense to maintain an array for ttc_toast_pointer_size in ToastTupleContext, allowing us to estimate the size per attribute based on compression or other criteria? This approach could make the logic more generic in my opinion, but it would require changes in toast_tuple_find_biggest_attribute and other places. I’d like to hear your thoughts on this. On Fri, Aug 8, 2025 at 12:52 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Aug 01, 2025 at 06:03:11PM +0900, Michael Paquier wrote: > > Please find attached a v3, that I have spent some time polishing to > > fix the value ID problem of this thread. v2 had some conflicts, and > > the CI previously failed with warning job (CI is green here now). > > Attached is a v4, due to conflicts mainly caused by the recent changes > in varatt.h done by e035863c9a04. This had an interesting side > benefit when rebasing, where I have been able to isolate most of the > knowledge related to the struct varatt_external (well > varatt_external_oid in the patch set) into toast_external.c, at the > exception of VARTAG_SIZE. That's done in a separate patch, numbered > 0006. > > The rest of the patch set has a couple of adjustements to document > better the new API expectations for toast_external.{c,h}, comment > adjustments, some more beautification changes, some indentation > applied, etc. > > > As things stand, I am getting pretty happy with the patch set up to > > 0005 and how things are getting in shape for the interface, and I am > > planning to begin applying this stuff up to 0005 in the next couple of > > weeks. > > As of this patch set, this means a new target of 0006, to get the > TOAST code refactored so as it is able to support more than 1 type of > external on-disk pointer with the 8-byte value problem in scope. Any > comments? > -- > Michael -- Nikhil Veldanda
Michael Paquier <michael@paquier.xyz> writes: > Attached is a v4, due to conflicts mainly caused by the recent changes > in varatt.h done by e035863c9a04. I found some time to look at the v4 patchset, and have a bunch of comments of different sizes. 0001: I'm good with widening all these values to 64 bits, but I wonder if it's a great idea to use unadorned "uint64" as the data type. That's impossible to grep for, doesn't convey anything much about what the variables are, etc. I'm tempted to propose instead inventing a typedef "BigOid" or some such name (bikeshedding welcome). The elog's could be handled with, say, #define PRIBO PRIu64 This suggestion isn't made with the idea that we'd someday switch to an even wider type, but just with the idea of making it clearer what these values are being used for. When you see "Oid" you know it's some sort of object identifier, and I'm sad to give that up here. This hunk is flat out buggy: @@ -1766,6 +1774,7 @@ check_tuple_attribute(HeapCheckContext *ctx) return true; /* It is external, and we're looking at a page on disk */ + toast_pointer_valueid = toast_pointer.va_valueid; /* * Must copy attr into toast_pointer for alignment considerations toast_pointer isn't initialized at this point. I see you fixed that in 0004, but it doesn't help to split the patch series if the intermediate steps are broken. 0002: OK, although some of the hunks in 0001 seem to belong here. 0003: No objection to the struct renaming, but does this go far enough? Aren't we going to need to rename TOAST_POINTER_SIZE to TOAST_OID_POINTER_SIZE, etc, so that we can have similar symbols for the wider version? I'm suspicious of not renaming the functions that work on these, too. (Oh, it looks like you did some of that in later parts.) BTW, given that varatt.h has "typedef struct varatt_external {...} varatt_external", I'm certain that the vast majority of uses of "struct varatt_external" could be shortened to "varatt_external". And I think we should do that, because using "struct foo" not "foo" is not project style. This patch would be a fine time to do that. 0004: Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely? It looks to me like every use of that should be replaced by toast_external_info_get_data(). I wonder if we shouldn't try to get rid of the phraseology "standard TOAST pointer", and instead write something like "short TOAST pointer" or "small TOAST pointer". These aren't really going to be more "standard" than the wider ones, IMO. I don't like replacing "va_valueid" with just "value". Dropping the "id" is not an improvement, because now a reader might be confused about whether this is somehow the actual value of the toasted datum. Functions in toast_external.c are under-documented. Some lack any header comment whatever, and others have one that doesn't actually say what they do. I kind of wonder whether the run-time indirection this design causes is going to be a performance problem. Perhaps not, given the expenses involved in accessing a toasted value, but it has a faint feeling of overdesign. It looks like a lot of this code would just flat out dump core if passed an invalid vartag value. Probably not good enough, given that we might look at corrupted data. 0005: Nice! 0006: get_new_value is very confusingly documented. Is the indexid that of the toastrel's index? What attribute is the attnum for? Why should the caller need to provide either, rather than get_new_value knowing that internally? Also, again you are using "value" for something that is not the value of the to-be-toasted datum. I'd suggest something more like "get_new_toast_identifier". I kind of feel that treating this operation as a toast_external_info method is the wrong thing, since where it looks like you are going is to fetch an identifier and only then decide which vartag you need to use. That ugliness comes out here: + /* + * Retrieve the external TOAST information, with the value still unknown. + * We need to do this again once we know the actual value assigned, to + * define the correct vartag_external for the new TOAST tuple. + */ 0007: Surely we do not need the cast in this: + return murmurhash64((int64) DatumGetInt64(datum)); I see you copied that from int4hashfast, but that's not good style either. More generally, though, why do we need catcache support for int8? There aren't any caches on toast chunks, and I doubt we are going to introduce any. Sure there might be a need for this down the road, but I don't see that this patch series is the time to add it. 0008: I totally hate the idea of introducing a GUC for this. This should be user-transparent. Or if it isn't user-transparent, a GUC is still the wrong thing; some kind of relation option would be more sensible. 0009: I do not love this either. I think the right thing is just to widen the existing nextOid counter to 64 bits, and increment it in-memory as 64 bits, and then return either all 64 bits when a 64-bit Oid is asked for, or just the low-order 32 bits when a 32-bit OID is asked for (with appropriate hacking for 32-bit wraparound). Having a separate counter will make it too easy for the same value to be produced by nearby calls to the 32-bit and 64-bit Oid generators, which is likely a bad thing, if only because of potential confusion. 0010: OK 0011: Not reviewing this yet, because I disagree with the basic design. I didn't look at the later patches either. regards, tom lane
Hi, On 2025-08-08 15:32:16 -0400, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > > Attached is a v4, due to conflicts mainly caused by the recent changes > > in varatt.h done by e035863c9a04. > > I found some time to look at the v4 patchset, and have a bunch of > comments of different sizes. > > 0001: > > I'm good with widening all these values to 64 bits, but I wonder > if it's a great idea to use unadorned "uint64" as the data type. > That's impossible to grep for, doesn't convey anything much about > what the variables are, etc. I'm tempted to propose instead > inventing a typedef "BigOid" or some such name (bikeshedding > welcome). The elog's could be handled with, say, > #define PRIBO PRIu64 > This suggestion isn't made with the idea that we'd someday switch > to an even wider type, but just with the idea of making it clearer > what these values are being used for. When you see "Oid" you > know it's some sort of object identifier, and I'm sad to give > that up here. I think we should consider introducing Oid64, instead of having a toast specific type. I doubt this is the last place that we'll want to use 64 bit wide types for cataloged entities. > 0004: > > Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely? > It looks to me like every use of that should be replaced by > toast_external_info_get_data(). > > I wonder if we shouldn't try to get rid of the phraseology "standard > TOAST pointer", and instead write something like "short TOAST pointer" > or "small TOAST pointer". These aren't really going to be more > "standard" than the wider ones, IMO. > > I don't like replacing "va_valueid" with just "value". Dropping > the "id" is not an improvement, because now a reader might be > confused about whether this is somehow the actual value of the > toasted datum. > > Functions in toast_external.c are under-documented. Some lack > any header comment whatever, and others have one that doesn't > actually say what they do. > > I kind of wonder whether the run-time indirection this design causes > is going to be a performance problem. Perhaps not, given the expenses > involved in accessing a toasted value, but it has a faint feeling > of overdesign. +1 > 0008: > > I totally hate the idea of introducing a GUC for this. This should be > user-transparent. Or if it isn't user-transparent, a GUC is still > the wrong thing; some kind of relation option would be more sensible. Agreed. I think we need backward compatibility for pg_upgrade purposes, but that doesn't require a GUC, it just requires an option when creating tables in binary upgrade mode. > 0009: > > I do not love this either. I think the right thing is just to widen > the existing nextOid counter to 64 bits, and increment it in-memory as > 64 bits, and then return either all 64 bits when a 64-bit Oid is asked > for, or just the low-order 32 bits when a 32-bit OID is asked for > (with appropriate hacking for 32-bit wraparound). Having a separate > counter will make it too easy for the same value to be produced by > nearby calls to the 32-bit and 64-bit Oid generators, which is likely > a bad thing, if only because of potential confusion. I'm not convinced that the global counter, be it a 32 or a 64 bit, approach has merit in general, and I'm rather sure it's the wrong thing for toast values. There's no straightforward path to move away from the global counter for plain oids, but I would suggest simply not using the global oid counter for toast IDs. A large portion of the cases I've seen where toast ID assignments were a problem were when the global OID counter wrapped around due to activity on *other* tables (and/or temporary table creation). If you instead had a per-toast-table sequence for assigning chunk IDs, that problem would largely vanish. With 64bit toast IDs we shouldn't need to search the index for a non-conflicting toast IDs, there can't be wraparounds (we'd hit wraparound of LSNs well before that and that's not practically reachable). Greetings, Andres Freund
On Fri, Aug 08, 2025 at 05:02:46PM -0400, Andres Freund wrote: > On 2025-08-08 15:32:16 -0400, Tom Lane wrote: >> I found some time to look at the v4 patchset, and have a bunch of >> comments of different sizes. Thanks for the input. This patch set exists as a result of a discussion between you and Andres back in 2022. (Replying here regarding the points that Andres is quoting, I'll add a second mail for some of the other things later.) >> I'm good with widening all these values to 64 bits, but I wonder >> if it's a great idea to use unadorned "uint64" as the data type. >> That's impossible to grep for, doesn't convey anything much about >> what the variables are, etc. I'm tempted to propose instead >> inventing a typedef "BigOid" or some such name (bikeshedding >> welcome). The elog's could be handled with, say, >> #define PRIBO PRIu64 >> This suggestion isn't made with the idea that we'd someday switch >> to an even wider type, but just with the idea of making it clearer >> what these values are being used for. When you see "Oid" you >> know it's some sort of object identifier, and I'm sad to give >> that up here. > > I think we should consider introducing Oid64, instead of having a toast > specific type. I doubt this is the last place that we'll want to use 64 bit > wide types for cataloged entities. Works for me for the grepping argument. Using "64" as a number of bits in the type name sounds a bit strange to me, not in line with what's done with bigint/int8 or float, where we use bytes. Naming a dedicated type "bigoid" with a structure named BigOid behind that's used for TOAST or in the future for other code paths is OK by me. AFAIK, the itchy point with unsigned 64b would be the casting, but my take would be to introduce no casts in the first implementation, particularly for the bigint -> bigoid case. >> 0004: >> >> Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely? >> It looks to me like every use of that should be replaced by >> toast_external_info_get_data(). Indirect toast pointers still wanted it. But perhaps this could just be renamed VARATT_EXTERNAL_INDIRECT_GET_POINTER() or something like that if jt's used only for TOAST pointers? >> I wonder if we shouldn't try to get rid of the phraseology "standard >> TOAST pointer", and instead write something like "short TOAST pointer" >> or "small TOAST pointer". These aren't really going to be more >> "standard" than the wider ones, IMO. >> >> I don't like replacing "va_valueid" with just "value". Dropping >> the "id" is not an improvement, because now a reader might be >> confused about whether this is somehow the actual value of the >> toasted datum. Okay, sure. >> Functions in toast_external.c are under-documented. Some lack >> any header comment whatever, and others have one that doesn't >> actually say what they do. Okay. Will work on that. I am not sure if it's worth doing yet, it does not seem like there's a clear agreement about patch 0004 and the toast_external business I am proposing. >> I kind of wonder whether the run-time indirection this design causes >> is going to be a performance problem. Perhaps not, given the expenses >> involved in accessing a toasted value, but it has a faint feeling >> of overdesign. toast_external_info_get_data() is called in 8 places as of the patch set: 1) Twice for amcheck. 2) Twice for reorderbuffer. 3) Three times in detoast.c 3-1) When fetching a value in full, toast_fetch_datum(), which goes through the slice API. 3-2) toast_fetch_datum_slice(), to retrieve a slice of the toasted data. 3-3) detoast_attr_slice(), falling back to the slice fetch. 4) Twice in toast_internals.c: 4-1) Saving a datum. 4-2) Deleting a datum. And if you see here, upthread, I've defined the worst case as retrieving a minimal toasted slice stored uncompressed, to measure the cost of the conversion to this toast_external, without seeing any effects, the btree conflict lookups doing most of the work: https://www.postgresql.org/message-id/aGzLiDUB_18-8aVQ%40paquier.xyz Thoughts and counter-arguments are welcome, of course. > +1 Well, one reason why this design exists is a point from 2023, made by Robert H., around here: https://www.postgresql.org/message-id/CA%2BTgmoaVcjUkmtWdc_9QjBzvSShjDBYk-5XFNaOvYLgGROjJMA%40mail.gmail.com The argument is that it's hard to miss for the existing code and extensions how a new vartag should be handled. Inventing a new layer means that extensions and existing code need to do a switch once, then they cannot really miss be missed when we add a new vartag because the from/to indirection between the toast_external layer and the varlenas is handled in its own place. A second reason why I've used this design is the problem related to compression IDs, where the idea would be to add a new vartag for zstandard for example. This still needs more work due to the existing limitations that we currently have with the CompressionToastId and its limitation to four values. The idea is that the deserialization of this data into this toast_external proposal makes the future additions easier. It would be of course more brutal and more efficient to just extend the code paths where the toast_external layer (I mean where VARATT_IS_EXTERNAL_ONDISK() is used currently) with an extra branch made of a VARATT_IS_EXTERNAL_ONDISK_BIGOID(), but I've decided to digest the argument from Robert instead, where I'm aiming at isolating most of the code related to on-disk external TOAST pointers into a single file, named toast_external.c here. I'm OK with the "brutal" method if the patch presented here is thought as overengineered and overdesigned, just wanted to say that there are a couple of reasons behind these choices. It's great that both of you are able to take some time to comment on these choices. If you disagree with this method and just say that we should go ahead with a more direct VARATT-like method, that's also fine by me as it would still solve the value limitation problem. And that's primarily what I want to solve here as it's a pain for some in the field. >> 0008: >> >> I totally hate the idea of introducing a GUC for this. This should be >> user-transparent. Or if it isn't user-transparent, a GUC is still >> the wrong thing; some kind of relation option would be more sensible. > > Agreed. I think we need backward compatibility for pg_upgrade purposes, but > that doesn't require a GUC, it just requires an option when creating tables in > binary upgrade mode. So, you basically mean that we should just make the 8-byte case the default for all the new tables created on a new cluster at upgrade, and just carry across upgrades the TOAST tables with OID values? If this stuff uses a reloption or an option at DDL level, that's going to need some dump/restore parts. Not sure that I'm betting the full picture of what the default behavior should be. I have read on the 2022 thread where both of you have discussed this issue a point about a "legacy" mode, to give users a way to get the old behavior if they wanted. A GUC fit nicely around that idea, because it is them possible to choose how all tables should behave, or perhaps I just did not design correctly what you meant through that. >> 0009: >> >> I do not love this either. I think the right thing is just to widen >> the existing nextOid counter to 64 bits, and increment it in-memory as >> 64 bits, and then return either all 64 bits when a 64-bit Oid is asked >> for, or just the low-order 32 bits when a 32-bit OID is asked for >> (with appropriate hacking for 32-bit wraparound). Having a separate >> counter will make it too easy for the same value to be produced by >> nearby calls to the 32-bit and 64-bit Oid generators, which is likely >> a bad thing, if only because of potential confusion. I was wondering about doing that as well. FWIW, this approach works for me, eating only 4 more bytes in the control file. > I'm not convinced that the global counter, be it a 32 or a 64 bit, approach > has merit in general, and I'm rather sure it's the wrong thing for toast > values. There's no straightforward path to move away from the global counter > for plain oids, but I would suggest simply not using the global oid counter > for toast IDs. > > A large portion of the cases I've seen where toast ID assignments were a > problem were when the global OID counter wrapped around due to activity on > *other* tables (and/or temporary table creation). If you instead had a > per-toast-table sequence for assigning chunk IDs, that problem would largely > vanish. I've thought about this piece already and the reason why I have not taken this approach is mostly simplicity. It's mentioned upthread, with potentially pieces referring to sequence AMs and "toast" sequences that could be optimized for the purpose of TOAST relations: https://www.postgresql.org/message-id/aG2iY26tXj1_MHfH%40paquier.xyz Yep. All the requirement boxes are filled with a sequence assigned to the TOAST relation. Using a single 8-byte counter is actually cheaper, even if the overhead when generating the TOAST tuples are in the btree lookups and the insertions themselves. It is also possible to do that as a two-step process: - First have TOAST relations with 8-byte values. - Add local sequences later on. What matters is having the code in place to check for index lookups upon values conflicting when a new sequence is attached to an existing TOAST relation. If we assign sequences from the start, that would not matter, of course. Another reason on top of the simplicity piece behind the control file is that it's dead cheap to acquire a new value when saving a chunk externally. -- Michael
Вложения
On Fri, Aug 08, 2025 at 03:32:16PM -0400, Tom Lane wrote: > I found some time to look at the v4 patchset, and have a bunch of > comments of different sizes. Thanks for the comments. I have replied to some of the items here: https://www.postgresql.org/message-id/aJbygEBqJgmLS0wF%40paquier.xyz Will try to answer the rest here. > 0001: > > I'm good with widening all these values to 64 bits, but I wonder > if it's a great idea to use unadorned "uint64" as the data type. > That's impossible to grep for, doesn't convey anything much about > what the variables are, etc. I'm tempted to propose instead > inventing a typedef "BigOid" or some such name (bikeshedding > welcome). The elog's could be handled with, say, > #define PRIBO PRIu64 > This suggestion isn't made with the idea that we'd someday switch > to an even wider type, but just with the idea of making it clearer > what these values are being used for. When you see "Oid" you > know it's some sort of object identifier, and I'm sad to give > that up here. Already mentioned on the other message, but I'm OK with a "bigoid" type and a BigOid type. Honestly, I'm not sure about a replacement for PRIu64, as we don't really do that for Oids with %u. > toast_pointer isn't initialized at this point. I see you fixed that > in 0004, but it doesn't help to split the patch series if the > intermediate steps are broken. Oops. FWIW, I've rebased this patch set much more than 4 times. It looks like I've messed up some of the diffs. Sorry about that. > 0003: > > No objection to the struct renaming, but does this go far > enough? Aren't we going to need to rename TOAST_POINTER_SIZE > to TOAST_OID_POINTER_SIZE, etc, so that we can have similar > symbols for the wider version? I'm suspicious of not renaming > the functions that work on these, too. (Oh, it looks like you > did some of that in later parts.) Yeah. I've stuck that into the later parts where the int8 bits have been added. Perhaps I've not been ambitious enough. > BTW, given that varatt.h has "typedef struct varatt_external {...} > varatt_external", I'm certain that the vast majority of uses of > "struct varatt_external" could be shortened to "varatt_external". > And I think we should do that, because using "struct foo" not "foo" > is not project style. This patch would be a fine time to do that. Good point. > 0004: > > Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely? > It looks to me like every use of that should be replaced by > toast_external_info_get_data(). Point about indirect pointers mentioned on the other message, where we could rename VARATT_EXTERNAL_GET_POINTER to VARATT_EXTERNAL_INDIRECT_GET_POINTER or equivalent to limit the confusion? > I wonder if we shouldn't try to get rid of the phraseology "standard > TOAST pointer", and instead write something like "short TOAST pointer" > or "small TOAST pointer". These aren't really going to be more > "standard" than the wider ones, IMO. I'm seeing one place in arrayfuncs.c and one in detoast.h using this term on HEAD. I would do simpler: no standard and no short, just with a removal of the "standard" part if I were to change something. > I don't like replacing "va_valueid" with just "value". Dropping > the "id" is not an improvement, because now a reader might be > confused about whether this is somehow the actual value of the > toasted datum. Okay, sure. One reason behind the field renaming was also to track down all the areas in need to be updated. > Functions in toast_external.c are under-documented. Some lack > any header comment whatever, and others have one that doesn't > actually say what they do. Okay. > I kind of wonder whether the run-time indirection this design causes > is going to be a performance problem. Perhaps not, given the expenses > involved in accessing a toasted value, but it has a faint feeling > of overdesign. Mentioned on the other message, linked to this message: https://www.postgresql.org/message-id/aGzLiDUB_18-8aVQ%40paquier.xyz > It looks like a lot of this code would just flat out dump core if > passed an invalid vartag value. Probably not good enough, given > that we might look at corrupted data. Right. That's where we would need an error path in toast_external_get_info() if nothing is found. That's a cheap defense, I guess. Good thing is that the lookups are centralized in a single code path, at least with this design. > I kind of feel that treating this operation as a toast_external_info > method is the wrong thing, since where it looks like you are going > is to fetch an identifier and only then decide which vartag you > need to use. That ugliness comes out here: > > + /* > + * Retrieve the external TOAST information, with the value still unknown. > + * We need to do this again once we know the actual value assigned, to > + * define the correct vartag_external for the new TOAST tuple. > + */ Yeah, I was feeling a bit depressed when writing the concept of what a "default" method should be before assigning the value, but that was still feeling right. > 0007: > > Surely we do not need the cast in this: > > + return murmurhash64((int64) DatumGetInt64(datum)); > > I see you copied that from int4hashfast, but that's not good > style either. Okay. > > More generally, though, why do we need catcache support for int8? > There aren't any caches on toast chunks, and I doubt we are going to > introduce any. Sure there might be a need for this down the road, > but I don't see that this patch series is the time to add it. I recall that this part was required for the value conflict lookups and also the TOAST value retrievals, so we are going to need it. > 0008: > > I totally hate the idea of introducing a GUC for this. This should be > user-transparent. Or if it isn't user-transparent, a GUC is still > the wrong thing; some kind of relation option would be more sensible. Per other email, I'm not sure what you entirely mean here: should 8B values be the default with existing TOAST OID values kept as they are across upgrades? Or something else? > 0009: > > I do not love this either. I think the right thing is just to widen > the existing nextOid counter to 64 bits, and increment it in-memory as > 64 bits, and then return either all 64 bits when a 64-bit Oid is asked > for, or just the low-order 32 bits when a 32-bit OID is asked for > (with appropriate hacking for 32-bit wraparound). Having a separate > counter will make it too easy for the same value to be produced by > nearby calls to the 32-bit and 64-bit Oid generators, which is likely > a bad thing, if only because of potential confusion. Acknowledged on other message, fine my be. > 0011: > > Not reviewing this yet, because I disagree with the basic design. > I didn't look at the later patches either. Without an agreement about the design choices related to the first patches up to 0006, I doubt there this is any need to review any of the follow-up patches yet because the choices of the first patches influence the next patches in the series. Thanks for the feedback! -- Michael
Вложения
On Fri, Aug 08, 2025 at 09:09:20AM -0700, Nikhil Kumar Veldanda wrote: > I have a question regarding TOAST pointer handling. > > As I understand, in the current design, each attribute in a HeapTuple > can have its own TOAST pointer, and TOAST pointers are possible only > for top-level attributes. > > Would it make sense to maintain an array for ttc_toast_pointer_size in > ToastTupleContext, allowing us to estimate the size per attribute > based on compression or other criteria? > > This approach could make the logic more generic in my opinion, but it > would require changes in toast_tuple_find_biggest_attribute and other > places. > > I’d like to hear your thoughts on this. Yes, that's some complexity that you would need if plugging in more vartags if these are related to compression methods, and also something that we may need if we use multiple vartags depending on a value ID assigned (aka for the TOAST table with 8-byte values, short Datum if we have a value less than 4 billion with one vartag, longer Datum if value more than 4 billion with a second vartag). For the 4-byte vs 8-byte value case, I was wondering if we should be simpler and less optimistic and assume that we are only going to use the wider one depending on the type of chunk_id in the TOAST table, as a minimum threshold when checking if a tuple should be toasted or not. Perhaps my vision of things is too simple, but I cannot think about a good reason that would justify making this code more complicated than it already is. -- Michael
Вложения
Hi!
Michael, I'm looking into your patch set for Sequence AMs.
I'm not very happy with single global OID counter for TOAST
tables, so I've decided to investigate your work on sequences
and try to adapt it to TOAST tables. Would report my progress.
Some time ago I've proposed individual TOAST counter
to get rid of table lookups for unused ID, but later decided
that neither reloptions nor pg_class are not a good place
to store it due to lots of related catalog updates including
locks, so sequences look much more useful for this,
as you wrote above.
Personally I'm not too happy with
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
array and for me it seems that lookups using VARTAG
should be straightened out with more error-proof, currently
using wrong vartags would result in core dumps.
On Sat, Aug 9, 2025 at 10:40 AM Michael Paquier <michael@paquier.xyz> wrote:
For the 4-byte vs 8-byte value case, I was wondering if we should be
simpler and less optimistic and assume that we are only going to use
the wider one depending on the type of chunk_id in the TOAST table, as
a minimum threshold when checking if a tuple should be toasted or not.
Perhaps my vision of things is too simple, but I cannot think about a
good reason that would justify making this code more complicated than
it already is.
--
Michael
On Fri, Aug 8, 2025 at 4:03 PM Andres Freund <andres@anarazel.de> wrote:
I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
has merit in general, and I'm rather sure it's the wrong thing for toast
values. There's no straightforward path to move away from the global counter
for plain oids, but I would suggest simply not using the global oid counter
for toast IDs.
A large portion of the cases I've seen where toast ID assignments were a
problem were when the global OID counter wrapped around due to activity on
*other* tables (and/or temporary table creation). If you instead had a
per-toast-table sequence for assigning chunk IDs, that problem would largely
vanish.
That's been my experience as well. I was actually toying with the idea of simply switching from OIDs to per-table counters when I came across this, specifically to address the problem of OID wraparound induced performance problems.
On Wed, Aug 13, 2025 at 03:06:16PM -0500, Jim Nasby wrote: > On Fri, Aug 8, 2025 at 4:03 PM Andres Freund <andres@anarazel.de> wrote: >> I'm not convinced that the global counter, be it a 32 or a 64 bit, approach >> has merit in general, and I'm rather sure it's the wrong thing for toast >> values. There's no straightforward path to move away from the global >> counter >> for plain oids, but I would suggest simply not using the global oid counter >> for toast IDs. >> >> A large portion of the cases I've seen where toast ID assignments were a >> problem were when the global OID counter wrapped around due to activity on >> *other* tables (and/or temporary table creation). If you instead had a >> per-toast-table sequence for assigning chunk IDs, that problem would >> largely >> vanish. > > That's been my experience as well. I was actually toying with the idea of > simply switching from OIDs to per-table counters when I came across this, > specifically to address the problem of OID wraparound induced performance > problems. Yep, that exists. My original use case is not this one, where I have a class of users able to reach the OID limit on a single table even within the 32TB limit, meaning that TOAST blobs are large enough to reach the external threshold, still small enough to have 4 billions of them within the single-table limit. Implementation-wise, switching to 8B would solve both things, and it's so much cheaper to grab a value from a single source. I don't really object to using 4B local values, but that does not really solve the original issue that's causing the efforts I've initiated on this thread. But yes, I'm also trying to implement things so as such an addition like this one would be slightly easier. It just does not have to be me doing it ;) -- Michael
Вложения
On Sat, Aug 09, 2025 at 04:31:05PM +0900, Michael Paquier wrote: > Without an agreement about the design choices related to the first > patches up to 0006, I doubt there this is any need to review any of > the follow-up patches yet because the choices of the first patches > influence the next patches in the series. Thanks for the feedback! So, please find attached a rebased version set that addresses all the feedback that has been provided, hopefully: - 0001 is the requested data type for 64b OIDs, which is called here oid8 for the data type and Oid8 for the internal name. I'm not completely wedded to these names, but anyway. This comes with a package that make this data type useful for other purposes than this patch set, basically everything where I've wanted toys for TOAST: hash/btree operators, min/max, a couple of casts (oid -> oid8, integers, etc). The rest of the patch set builds upon that. - 0002 changes a couple of APIs related to TOAST to use Oid8 (in previous iterations this was uint64). In the last version, the patch was mixing parts related to max_chunk_size, that should be split cleanly now. - 0003 is a preparatory change, reducing the footprint of TOAST_MAX_CHUNK_SIZE, because we'd want to have something that changes depending on the vartag we're dealing with. - 0004 renames things around vartag_external to vartag_external_oid. Based on the previous feedback, this is more aggressive now with the renames, renaming a few more things like MAX_CHUNK_SIZE, TOAST_POINTER_SIZE, etc. - 0005 is the refactoring piece, that introduces toast_external.c. I have included the previous feedback, cleaning up the code, adding more documentation, adding a failure fallback if the vartag given in input is incorrect, to serve as a corruption defense. And a few more things. - 0006 is a follow-up cleanup of varatt.h. It would make more sense to merge that with 0005, but I've decided to keep that separate as it makes the review slightly cleaner. - 0007 is related to the previous feedback, and a follow-up cleanup that could be merged with one of the previous steps. It changes the remaining pieces of VARATT_EXTERNAL_GET_POINTER to be related to indirect TOAST pointers, as it's only used there. - 0008 is a previous piece, switching pg_column_toast_chunk_id() to use oid8 as result instead. - 0009 adds the catcache bits for OID8OID, required for toast values lookups and deletions, in the upcoming patches. Same as previous. - 0010 is a new piece, based on the previous feedback. This is an independent piece that adds an extra step in binary upgrades to be able to pass down the attribute type of chunk_id. Without oid8 support for the values, this does not bring much of course, but it's less code churn in the follow-up patches. - 0011 adds a new piece, based on the previous feedback, where the existing nextOid is enlarged to 8 bytes, keeping compatibility for the existing 4-byte OIDs where we don't want values within the [0, FirstNormalObjectId] range. The next patches rely on the new API to get 8-byte values. - 0012 is a new piece requested: reloption able to define the attribute type of chunk_id when a TOAST relation is initially created for a table, replacing the previous GUC approach which is now gone. - 0013 adds support for oid8 in TOAST relations, extending the new reloption and the binary upgrade paths previously introduced. - 0014 adds some tests with toast OID8 case, leaving some relations around for pg_upgrade, covering the binary upgrade path for oid8. - 0015 is the last patch, adding a new vartag for OID8. It would make most sense to merge that with 0013, perhaps, the split is here to ease reviews. I have dropped the amcheck test patch for now, which was fun but it's not really necessary for the "basics". I have done also more tests, playing for example with pg_resetwal, installcheck and pg_upgrade scenarios. I am wondering if it would be worth doing a pg_resetwal in the node doing an installcheck on the instance to be upgraded, bumping its next OID to be much larger than 4 billion, actually.. -- Michael
Вложения
- v5-0001-Implement-oid8-data-type.patch
- v5-0002-Refactor-some-TOAST-value-ID-code-to-use-Oid8-ins.patch
- v5-0003-Minimize-footprint-of-TOAST_MAX_CHUNK_SIZE-in-hea.patch
- v5-0004-Renames-around-varatt_external-varatt_external_oi.patch
- v5-0005-Refactor-external-TOAST-pointer-code-for-better-p.patch
- v5-0006-Move-static-inline-routines-of-varatt_external_oi.patch
- v5-0007-Split-VARATT_EXTERNAL_GET_POINTER-for-indirect-an.patch
- v5-0008-Switch-pg_column_toast_chunk_id-return-value-from.patch
- v5-0009-Add-catcache-support-for-OID8OID.patch
- v5-0010-Add-support-for-TOAST-chunk_id-type-in-binary-upg.patch
- v5-0011-Enlarge-OID-generation-to-8-bytes.patch
- v5-0012-Add-relation-option-toast_value_type.patch
- v5-0013-Add-support-for-oid8-TOAST-values.patch
- v5-0014-Add-tests-for-TOAST-relations-with-bigint-as-valu.patch
- v5-0015-Add-new-vartag_external-for-8-byte-TOAST-values.patch
- signature.asc
On Thu, Aug 14, 2025 at 02:49:06PM +0900, Michael Paquier wrote: > I have dropped the amcheck test patch for now, which was fun but it's > not really necessary for the "basics". I have done also more tests, > playing for example with pg_resetwal, installcheck and pg_upgrade > scenarios. I am wondering if it would be worth doing a pg_resetwal in > the node doing an installcheck on the instance to be upgraded, bumping > its next OID to be much larger than 4 billion, actually.. Four patches had conflicts with 748caa9dcb68, so rebased as v6. -- Michael
Вложения
- v6-0001-Implement-oid8-data-type.patch
- v6-0002-Refactor-some-TOAST-value-ID-code-to-use-Oid8-ins.patch
- v6-0003-Minimize-footprint-of-TOAST_MAX_CHUNK_SIZE-in-hea.patch
- v6-0004-Renames-around-varatt_external-varatt_external_oi.patch
- v6-0005-Refactor-external-TOAST-pointer-code-for-better-p.patch
- v6-0006-Move-static-inline-routines-of-varatt_external_oi.patch
- v6-0007-Split-VARATT_EXTERNAL_GET_POINTER-for-indirect-an.patch
- v6-0008-Switch-pg_column_toast_chunk_id-return-value-from.patch
- v6-0009-Add-catcache-support-for-OID8OID.patch
- v6-0010-Add-support-for-TOAST-chunk_id-type-in-binary-upg.patch
- v6-0011-Enlarge-OID-generation-to-8-bytes.patch
- v6-0012-Add-relation-option-toast_value_type.patch
- v6-0013-Add-support-for-oid8-TOAST-values.patch
- v6-0014-Add-tests-for-TOAST-relations-with-bigint-as-valu.patch
- v6-0015-Add-new-vartag_external-for-8-byte-TOAST-values.patch
- signature.asc
On Tue, Sep 16, 2025 at 02:14:43PM +0900, Michael Paquier wrote: > On Thu, Aug 14, 2025 at 02:49:06PM +0900, Michael Paquier wrote: >> I have dropped the amcheck test patch for now, which was fun but it's >> not really necessary for the "basics". I have done also more tests, >> playing for example with pg_resetwal, installcheck and pg_upgrade >> scenarios. I am wondering if it would be worth doing a pg_resetwal in >> the node doing an installcheck on the instance to be upgraded, bumping >> its next OID to be much larger than 4 billion, actually.. > > Four patches had conflicts with 748caa9dcb68, so rebased as v6. There were a few conflicts, so here is a rebased v7, moving the patch to the next CF. I have been sitting on this patch for six weeks for the moment. Tom, you are registered as a reviewer of the patch. The point of contention of the patch, where I see there is no consensus yet, is if my approach of using a redirection for the external TOAST pointers with a new layer to facilitate the addition of more vartags (aka the 64b value vartag proposed here, concept that could also apply to compression methods later on) is acceptable. Moving to a different approach, like the "brutal" one I am naming upthread where the redirection layer is replaced by changes in all the code paths that need to be touched, would be of course cheaper at runtime as there would be no more redirection, but the maintenance would be a nightmare the more vartags we add, and I have some plans for more of these. Doing the switch would be a few hours work, so that would not be a big deal, I guess. The important part is an agreement about the approach, IMO. Please note that the latest patch set also uses a new reloption to control if 8 byte TOAST values are set (not a GUC), binary upgrades are handled with binary_upgrade.h functions, and the 8-byte OID value uses a dedicated data type, as reviewed upthread. One thing that shows up in the last patch of the set is mentioned in an XXX comment in toast_external_assign_vartag(), if it would be better to use a 4-byte vartag even if we have a 8-byte value in the TOAST table to make the TOAST pointers shorter when a value is less than 4 billions. Not sure how much to do about this one, and there's little point in doing this change without the earlier infrastructure patches if the approach taken is thought as OK, as well. -- Michael
Вложения
- v7-0001-Implement-oid8-data-type.patch
- v7-0002-Refactor-some-TOAST-value-ID-code-to-use-Oid8-ins.patch
- v7-0003-Minimize-footprint-of-TOAST_MAX_CHUNK_SIZE-in-hea.patch
- v7-0004-Renames-around-varatt_external-varatt_external_oi.patch
- v7-0005-Refactor-external-TOAST-pointer-code-for-better-p.patch
- v7-0006-Move-static-inline-routines-of-varatt_external_oi.patch
- v7-0007-Split-VARATT_EXTERNAL_GET_POINTER-for-indirect-an.patch
- v7-0008-Switch-pg_column_toast_chunk_id-return-value-from.patch
- v7-0009-Add-catcache-support-for-OID8OID.patch
- v7-0010-Add-support-for-TOAST-chunk_id-type-in-binary-upg.patch
- v7-0011-Enlarge-OID-generation-to-8-bytes.patch
- v7-0012-Add-relation-option-toast_value_type.patch
- v7-0013-Add-support-for-oid8-TOAST-values.patch
- v7-0014-Add-tests-for-TOAST-relations-with-bigint-as-valu.patch
- v7-0015-Add-new-vartag_external-for-8-byte-TOAST-values.patch
- signature.asc