Обсуждение: Adding CommandID to heap xlog records
Hi, The current WAL records generated by the Heap tableAM do not contain the command ID of the query that inserted/updated/deleted the records. The CID is not included in XLog because it is useful only to visibility checks in an active read/write transaction, which currently only appear in a primary node. In Neon [0], we're using XLog to reconstruct page state, as opposed to writing out dirty pages. This has the benefit of saving write IO for dirty page writes, but this does mean that we need the CID in heap insert/update/delete records to correctly mark the tuples, such that modified pages that are flushed from the buffer pool get reconstructed correctly. A more detailed write-up why we do this is here [1]. Neon does not need to be the only user of this API, as adding CID to xlog records also allows the primary to offload (partial) queries to a remote physical replica that would utilise the same transaction and snapshot of the primary. Right now, it's not possible to offload the read-component of RW queries to a secondary [2]. The attached patch would make multi-node transactions possible on systems with a single primary node and multiple read replicas, without the need for prepared commits and special extra code to achieve snapshot consistency, as a consistent snapshot could be copied and used by physical replicas (think parallel workers, but on a different server). Please find attached a patch that adds the CommandId of the inserting transaction to heap (batch)insert, update and delete records. It is based on the changes we made in the fork we maintain for Neon. Kind regards, Matthias van de Meent [0] https://github.com/neondatabase/neon/#neon [1] https://github.com/neondatabase/neon/blob/main/docs/core_changes.md#add-t_cid-to-heap-wal-records [2] At least not without blocking XLog replay of the primary transaction on the secondary, due to the same issues that Neon encountered: you need the CommandID to distinguish between this transactions' updates in the current command and previous commands.
Вложения
Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > Please find attached a patch that adds the CommandId of the inserting > transaction to heap (batch)insert, update and delete records. It is > based on the changes we made in the fork we maintain for Neon. This seems like a very significant cost increment with returns to only a minuscule number of users. We certainly cannot consider it unless you provide some evidence that that impression is wrong. regards, tom lane
On Thu, 8 Sept 2022 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > Please find attached a patch that adds the CommandId of the inserting > > transaction to heap (batch)insert, update and delete records. It is > > based on the changes we made in the fork we maintain for Neon. > > This seems like a very significant cost increment with returns > to only a minuscule number of users. We certainly cannot consider > it unless you provide some evidence that that impression is wrong. Attached a proposed set of patches to reduce overhead of the inital patch. 0001: moves the RMGR-specific xl_info flag bits into their own field in the xlog header This new field is allocated in the 2-byte alignment hole in the xlog record header (of which now 1 byte is left). This change was discussed by Andres (cc-ed) earlier in [0], and this is a partial implementation of the suggestion. With the patch we could merge the xl_heap and xl_heap2 redo managers, but that is not implemented and not a goal of this patchset - we're only enabling the change, not providing it. The main difference between this patch and the proposed change of [0] is that this patch only provides a single 8-bit field for rmgr use, for both flag bits and record types, as opposed to separate fields for record type and flag bits. The reason for including this patch is to get free bits to use in the xlog header - all other addressable bits in the xlhdr are in use already; and it is much more difficult to find usable bits in the heap xlog structs. They exist, but processing would be much more of a pain than what it is now. 0002: add new wal_level = remote This implements the same concept as v1; but now makes CommandId presence optional. This presence is now indicated by the all-new XLOG_HEAP_WITH_CID bit. CommandId is included when wal_level is at least set to the new 'remote' value. wal_level=logical by extension also includes this commandId. Performance numbers for this patch seem to indicate no significant regression: Runs of pgbench (options: -s 50 -c 4 -j 4 -T 900 -v) have shown no immediately significant regression with wal_level = replica when compared to master @ cbe6dd17 (master: 978tps, patched: 985tps). Results for wal_level = remote are slightly worse than with wal_level = replica, but acceptable nonetheless (wal_level=remote: 964tps, =replica: 985tps). Apart from wal_level being changed between runs, it was an otherwise default postgres configuration with shared_buffers set to 10GB. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/20220715173731.6t3km5cww3f5ztfq%40awork3.anarazel.de
Вложения
On Thu, Sep 22, 2022 at 11:12:32PM +0200, Matthias van de Meent wrote: > On Thu, 8 Sept 2022 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > > Please find attached a patch that adds the CommandId of the inserting > > > transaction to heap (batch)insert, update and delete records. It is > > > based on the changes we made in the fork we maintain for Neon. > > > > This seems like a very significant cost increment with returns > > to only a minuscule number of users. We certainly cannot consider > > it unless you provide some evidence that that impression is wrong. > > Attached a proposed set of patches to reduce overhead of the inital patch. This might be obvious to some, but the patch got a lot larger. :-( --------------------------------------------------------------------------- > contrib/pg_walinspect/pg_walinspect.c | 4 +- > src/backend/access/brin/brin_pageops.c | 16 +++--- > src/backend/access/brin/brin_xlog.c | 8 +-- > src/backend/access/gin/ginxlog.c | 6 +-- > src/backend/access/gist/gistxlog.c | 6 +-- > src/backend/access/hash/hash_xlog.c | 6 +-- > src/backend/access/heap/heapam.c | 40 +++++++-------- > src/backend/access/nbtree/nbtinsert.c | 18 +++---- > src/backend/access/nbtree/nbtpage.c | 8 +-- > src/backend/access/nbtree/nbtxlog.c | 10 ++-- > src/backend/access/rmgrdesc/brindesc.c | 20 ++++---- > src/backend/access/rmgrdesc/clogdesc.c | 10 ++-- > src/backend/access/rmgrdesc/committsdesc.c | 10 ++-- > src/backend/access/rmgrdesc/dbasedesc.c | 12 ++--- > src/backend/access/rmgrdesc/genericdesc.c | 2 +- > src/backend/access/rmgrdesc/gindesc.c | 8 +-- > src/backend/access/rmgrdesc/gistdesc.c | 8 +-- > src/backend/access/rmgrdesc/hashdesc.c | 8 +-- > src/backend/access/rmgrdesc/heapdesc.c | 46 ++++++++--------- > src/backend/access/rmgrdesc/logicalmsgdesc.c | 8 +-- > src/backend/access/rmgrdesc/mxactdesc.c | 14 ++--- > src/backend/access/rmgrdesc/nbtdesc.c | 8 +-- > src/backend/access/rmgrdesc/relmapdesc.c | 8 +-- > src/backend/access/rmgrdesc/replorigindesc.c | 8 +-- > src/backend/access/rmgrdesc/seqdesc.c | 8 +-- > src/backend/access/rmgrdesc/smgrdesc.c | 10 ++-- > src/backend/access/rmgrdesc/spgdesc.c | 8 +-- > src/backend/access/rmgrdesc/standbydesc.c | 12 ++--- > src/backend/access/rmgrdesc/tblspcdesc.c | 10 ++-- > src/backend/access/rmgrdesc/xactdesc.c | 34 ++++++------ > src/backend/access/rmgrdesc/xlogdesc.c | 28 +++++----- > src/backend/access/spgist/spgxlog.c | 6 +-- > src/backend/access/transam/clog.c | 8 +-- > src/backend/access/transam/commit_ts.c | 8 +-- > src/backend/access/transam/multixact.c | 48 ++++++++--------- > src/backend/access/transam/twophase.c | 2 +- > src/backend/access/transam/xact.c | 36 +++++++------ > src/backend/access/transam/xlog.c | 34 ++++++------ > src/backend/access/transam/xloginsert.c | 31 ++++++++--- > src/backend/access/transam/xlogprefetcher.c | 2 +- > src/backend/access/transam/xlogreader.c | 2 +- > src/backend/access/transam/xlogrecovery.c | 54 ++++++++++---------- > src/backend/access/transam/xlogstats.c | 2 +- > src/backend/catalog/storage.c | 15 +++--- > src/backend/commands/dbcommands.c | 30 ++++++----- > src/backend/commands/sequence.c | 6 +-- > src/backend/commands/tablespace.c | 8 +-- > src/backend/postmaster/autovacuum.c | 4 +- > src/backend/replication/logical/decode.c | 38 +++++++------- > src/backend/replication/logical/message.c | 6 +-- > src/backend/replication/logical/origin.c | 6 +-- > src/backend/storage/ipc/standby.c | 10 ++-- > src/backend/utils/cache/relmapper.c | 6 +-- > src/bin/pg_resetwal/pg_resetwal.c | 2 +- > src/bin/pg_rewind/parsexlog.c | 10 ++-- > src/bin/pg_waldump/pg_waldump.c | 6 +-- > src/include/access/brin_xlog.h | 2 +- > src/include/access/clog.h | 2 +- > src/include/access/ginxlog.h | 2 +- > src/include/access/gistxlog.h | 2 +- > src/include/access/hash_xlog.h | 2 +- > src/include/access/heapam_xlog.h | 4 +- > src/include/access/multixact.h | 3 +- > src/include/access/nbtxlog.h | 2 +- > src/include/access/spgxlog.h | 2 +- > src/include/access/xact.h | 6 +-- > src/include/access/xlog.h | 2 +- > src/include/access/xloginsert.h | 3 +- > src/include/access/xlogreader.h | 1 + > src/include/access/xlogrecord.h | 11 +--- > src/include/access/xlogstats.h | 2 +- > src/include/catalog/storage_xlog.h | 2 +- > src/include/commands/dbcommands_xlog.h | 2 +- > src/include/commands/sequence.h | 2 +- > src/include/commands/tablespace.h | 2 +- > src/include/replication/message.h | 2 +- > src/include/storage/standbydefs.h | 2 +- > src/include/utils/relmapper.h | 2 +- > 78 files changed, 430 insertions(+), 412 deletions(-) -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
On Wed, 28 Sept 2022 at 19:40, Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Sep 22, 2022 at 11:12:32PM +0200, Matthias van de Meent wrote: > > On Thu, 8 Sept 2022 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > > > Please find attached a patch that adds the CommandId of the inserting > > > > transaction to heap (batch)insert, update and delete records. It is > > > > based on the changes we made in the fork we maintain for Neon. > > > > > > This seems like a very significant cost increment with returns > > > to only a minuscule number of users. We certainly cannot consider > > > it unless you provide some evidence that that impression is wrong. > > > > Attached a proposed set of patches to reduce overhead of the inital patch. > > This might be obvious to some, but the patch got a lot larger. :-( Sorry for that, but updating the field from which the redo manager should pull its information does indeed touch a lot of files because most users of xl_info are only interested in the 4 bits reserved for the redo-manager. Most of 0001 is therefore updates to point code to the new field in XLogRecord, and renaming the variables and arguments from info to rminfo. [tangent] With that refactoring, I also clean up a lot of code that was using a wrong macro/constant for rmgr flags; `info & ~XLR_INFO_MASK` may have the same value as `info & XLR_RMGR_INFO_MASK`, but that's only guaranteed by the documentation; and would require the same significant rework if new bits were assigned to non-XLR_INFO_MASK and non-XLR_RMGR_INFO_MASK. [/tangent] 0002 grew a bit as well, but not to a degree that I think is worrying or otherwise impossible to review. Kind regards, Matthias van de Meent.
2022年9月30日(金) 1:04 Matthias van de Meent <boekewurm+postgres@gmail.com>: > > On Wed, 28 Sept 2022 at 19:40, Bruce Momjian <bruce@momjian.us> wrote: > > > > On Thu, Sep 22, 2022 at 11:12:32PM +0200, Matthias van de Meent wrote: > > > On Thu, 8 Sept 2022 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > > > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > > > > Please find attached a patch that adds the CommandId of the inserting > > > > > transaction to heap (batch)insert, update and delete records. It is > > > > > based on the changes we made in the fork we maintain for Neon. > > > > > > > > This seems like a very significant cost increment with returns > > > > to only a minuscule number of users. We certainly cannot consider > > > > it unless you provide some evidence that that impression is wrong. > > > > > > Attached a proposed set of patches to reduce overhead of the inital patch. > > > > This might be obvious to some, but the patch got a lot larger. :-( > > Sorry for that, but updating the field from which the redo manager > should pull its information does indeed touch a lot of files because > most users of xl_info are only interested in the 4 bits reserved for > the redo-manager. Most of 0001 is therefore updates to point code to > the new field in XLogRecord, and renaming the variables and arguments > from info to rminfo. > > [tangent] With that refactoring, I also clean up a lot of code that > was using a wrong macro/constant for rmgr flags; `info & > ~XLR_INFO_MASK` may have the same value as `info & > XLR_RMGR_INFO_MASK`, but that's only guaranteed by the documentation; > and would require the same significant rework if new bits were > assigned to non-XLR_INFO_MASK and non-XLR_RMGR_INFO_MASK. [/tangent] > > 0002 grew a bit as well, but not to a degree that I think is worrying > or otherwise impossible to review. Hi This entry was marked as "Needs review" in the CommitFest app but cfbot reports the patch no longer applies. We've marked it as "Waiting on Author". As CommitFest 2022-11 is currently underway, this would be an excellent time update the patch. Once you think the patchset is ready for review again, you (or any interested party) can move the patch entry forward by visiting https://commitfest.postgresql.org/40/3882/ and changing the status to "Needs review". Thanks Ian Barwick
On Thu, 3 Nov 2022 at 15:06, Ian Lawrence Barwick <barwick@gmail.com> wrote: > > 2022年9月30日(金) 1:04 Matthias van de Meent <boekewurm+postgres@gmail.com>: > > > > On Wed, 28 Sept 2022 at 19:40, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > On Thu, Sep 22, 2022 at 11:12:32PM +0200, Matthias van de Meent wrote: > > > > On Thu, 8 Sept 2022 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > > > > > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > > > > > Please find attached a patch that adds the CommandId of the inserting > > > > > > transaction to heap (batch)insert, update and delete records. It is > > > > > > based on the changes we made in the fork we maintain for Neon. > > > > > > > > > > This seems like a very significant cost increment with returns > > > > > to only a minuscule number of users. We certainly cannot consider > > > > > it unless you provide some evidence that that impression is wrong. > > > > > > > > Attached a proposed set of patches to reduce overhead of the inital patch. > > > > > > This might be obvious to some, but the patch got a lot larger. :-( > > > > Sorry for that, but updating the field from which the redo manager > > should pull its information does indeed touch a lot of files because > > most users of xl_info are only interested in the 4 bits reserved for > > the redo-manager. Most of 0001 is therefore updates to point code to > > the new field in XLogRecord, and renaming the variables and arguments > > from info to rminfo. > > > > [tangent] With that refactoring, I also clean up a lot of code that > > was using a wrong macro/constant for rmgr flags; `info & > > ~XLR_INFO_MASK` may have the same value as `info & > > XLR_RMGR_INFO_MASK`, but that's only guaranteed by the documentation; > > and would require the same significant rework if new bits were > > assigned to non-XLR_INFO_MASK and non-XLR_RMGR_INFO_MASK. [/tangent] > > > > 0002 grew a bit as well, but not to a degree that I think is worrying > > or otherwise impossible to review. > > Hi > > This entry was marked as "Needs review" in the CommitFest app but cfbot > reports the patch no longer applies. > > We've marked it as "Waiting on Author". As CommitFest 2022-11 is > currently underway, this would be an excellent time update the patch. > > Once you think the patchset is ready for review again, you (or any > interested party) can move the patch entry forward by visiting > > https://commitfest.postgresql.org/40/3882/ > > and changing the status to "Needs review". I was not sure if you will be planning to post an updated version of patch as the patch has been awaiting your attention from last commitfest, please post an updated version for it soon or update the commitfest entry accordingly. Regards, Vignesh
On Mon, 16 Jan 2023 at 19:56, vignesh C <vignesh21@gmail.com> wrote: > > On Thu, 3 Nov 2022 at 15:06, Ian Lawrence Barwick <barwick@gmail.com> wrote: > > > > 2022年9月30日(金) 1:04 Matthias van de Meent <boekewurm+postgres@gmail.com>: > > > > > > On Wed, 28 Sept 2022 at 19:40, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > > On Thu, Sep 22, 2022 at 11:12:32PM +0200, Matthias van de Meent wrote: > > > > > On Thu, 8 Sept 2022 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > > > > > > > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > > > > > > Please find attached a patch that adds the CommandId of the inserting > > > > > > > transaction to heap (batch)insert, update and delete records. It is > > > > > > > based on the changes we made in the fork we maintain for Neon. > > > > > > > > > > > > This seems like a very significant cost increment with returns > > > > > > to only a minuscule number of users. We certainly cannot consider > > > > > > it unless you provide some evidence that that impression is wrong. > > > > > > > > > > Attached a proposed set of patches to reduce overhead of the inital patch. > > > > > > > > This might be obvious to some, but the patch got a lot larger. :-( > > > > > > Sorry for that, but updating the field from which the redo manager > > > should pull its information does indeed touch a lot of files because > > > most users of xl_info are only interested in the 4 bits reserved for > > > the redo-manager. Most of 0001 is therefore updates to point code to > > > the new field in XLogRecord, and renaming the variables and arguments > > > from info to rminfo. > > > > > > [tangent] With that refactoring, I also clean up a lot of code that > > > was using a wrong macro/constant for rmgr flags; `info & > > > ~XLR_INFO_MASK` may have the same value as `info & > > > XLR_RMGR_INFO_MASK`, but that's only guaranteed by the documentation; > > > and would require the same significant rework if new bits were > > > assigned to non-XLR_INFO_MASK and non-XLR_RMGR_INFO_MASK. [/tangent] > > > > > > 0002 grew a bit as well, but not to a degree that I think is worrying > > > or otherwise impossible to review. > > > > Hi > > > > This entry was marked as "Needs review" in the CommitFest app but cfbot > > reports the patch no longer applies. > > > > We've marked it as "Waiting on Author". As CommitFest 2022-11 is > > currently underway, this would be an excellent time update the patch. > > > > Once you think the patchset is ready for review again, you (or any > > interested party) can move the patch entry forward by visiting > > > > https://commitfest.postgresql.org/40/3882/ > > > > and changing the status to "Needs review". > > I was not sure if you will be planning to post an updated version of > patch as the patch has been awaiting your attention from last > commitfest, please post an updated version for it soon or update the > commitfest entry accordingly. There has been no updates on this thread for some time, so this has been switched as Returned with Feedback. Feel free to open it in the next commitfest if you plan to continue on this. Regards, Vignesh
I took another stab at this from a different angle, and tried to use this to simplify logical decoding. The theory was that if we included the command ID in the WAL records, we wouldn't need the separate HEAP2_NEW_CID record anymore, and could remove much of the code in reorderbuffer.c that's concerned with tracking ctid->(cmin,cmax) mapping. Unfortunately, it didn't work out. Here's one problem: Insert with cmin 1 Commit Delete the same tuple with cmax 2. Abort Even if we store the cmin in the INSERT record, and set it on the tuple on replay, the DELETE overwrites it. That's OK for the original transactions, because they only look at the cmin/cmax of their own transaction, but it's a problem for logical decoding. If we see the inserted tuple during logical decoding, we need the cmin of the tuple. We could still just replace the HEAP2_NEW_CID records with the CIDs in the heap INSERT/UPDATE/DELETE records, and use that information to maintain the ctid->(cmin,cmax) mapping in reorderbuffer.c like we do today. But that doesn't really simplify reorderbuffer.c much. Attached is a patch for that, for the archives sake. Another problem with that is that logical decoding needs slightly different information than what we store on the tuples on disk. My original motivation for this was for Neon, which needs the WAL replay to restore the same CID as what's stored on disk, whether it's cmin, cmax or combocid. But for logical decoding, we need the cmin or cmax, *not* the combocid. To cater for both uses, we'd need to include both the original cmin/cmax and the possible combocid, which again makes it more complicated. So unfortunately I don't see much opportunity to simplify logical decoding with this. However, please take a look at the first two patches attached. They're tiny cleanups that make sense on their own. - Heikki
Вложения
Improve comment on cid mapping (was Re: Adding CommandID to heap xlog records)
От
Heikki Linnakangas
Дата:
On 28/02/2023 15:52, Heikki Linnakangas wrote: > So unfortunately I don't see much opportunity to simplify logical > decoding with this. However, please take a look at the first two patches > attached. They're tiny cleanups that make sense on their own. Rebased these small patches. I'll add this to the commitfest. -- Heikki Linnakangas Neon (https://neon.tech)
Вложения
Re: Improve comment on cid mapping (was Re: Adding CommandID to heap xlog records)
От
Andres Freund
Дата:
Hi, On 2023-06-26 09:57:56 +0300, Heikki Linnakangas wrote: > diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c > index 0786bb0ab7..e403feeccd 100644 > --- a/src/backend/replication/logical/snapbuild.c > +++ b/src/backend/replication/logical/snapbuild.c > @@ -41,10 +41,15 @@ > * transactions we need Snapshots that see intermediate versions of the > * catalog in a transaction. During normal operation this is achieved by using > * CommandIds/cmin/cmax. The problem with that however is that for space > - * efficiency reasons only one value of that is stored > - * (cf. combocid.c). Since combo CIDs are only available in memory we log > - * additional information which allows us to get the original (cmin, cmax) > - * pair during visibility checks. Check the reorderbuffer.c's comment above > + * efficiency reasons, the cmin and cmax are not included in WAL records. We > + * cannot read the cmin/cmax from the tuple itself, either, because it is > + * reset on crash recovery. Even if we could, we could not decode combocids > + * which are only tracked in the original backend's memory. To work around > + * that, heapam writes an extra WAL record (XLOG_HEAP2_NEW_CID) every time a > + * catalog row is modified, which includes the cmin and cmax of the > + * tuple. During decoding, we insert the ctid->(cmin,cmax) mappings into the > + * reorder buffer, and use them at visibility checks instead of the cmin/cmax > + * on the tuple itself. Check the reorderbuffer.c's comment above > * ResolveCminCmaxDuringDecoding() for details. > * > * To facilitate all this we need our own visibility routine, as the normal > -- > 2.30.2 LGTM > From 9140a0d98fd21b595eac6d111175521a6b1a9f1b Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Mon, 26 Jun 2023 09:56:02 +0300 > Subject: [PATCH v2 2/2] Remove redundant check for fast_forward. > > We already checked for it earlier in the function. > > Discussion: https://www.postgresql.org/message-id/1ba2899e-77f8-7866-79e5-f3b7d1251a3e@iki.fi > --- > src/backend/replication/logical/decode.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c > index d91055a440..7039d425e2 100644 > --- a/src/backend/replication/logical/decode.c > +++ b/src/backend/replication/logical/decode.c > @@ -422,8 +422,7 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) > switch (info) > { > case XLOG_HEAP2_MULTI_INSERT: > - if (!ctx->fast_forward && > - SnapBuildProcessChange(builder, xid, buf->origptr)) > + if (SnapBuildProcessChange(builder, xid, buf->origptr)) > DecodeMultiInsert(ctx, buf); > break; > case XLOG_HEAP2_NEW_CID: > -- > 2.30.2 LGTM^2 Greetings, Andres Freund