Обсуждение: XLog size reductions: smaller XLRec block header for PG17
Hi all, As was previously discussed at the thread surrounding [1]: Currently any block registration in WAL takes up at least 8 bytes in xlog overhead, regardless of how much data is included in that block: 1 byte for block ID, 1 byte for fork and flags, 2 bytes for the block data length, and 4 bytes for the blockNo. (Usually, another 12 bytes are used for a RelFileLocator; but that may not be included for some blocks in the record when conditions apply) Attached is a patch that reduces this overhead by up to 2 bytes by encoding how large the block data length field is into the block ID, and thus optionally reducing the block data's length field to 0 bytes. Examples: cross-page update records will now be 2 bytes shorter, because the record never registers any data for the new block of the update; pgbench transactions are now either 6 or 8 bytes smaller depending on whether the update crosses a page boundary (in xlog record size; after alignment it is 0 or 4/8 bytes, depending on MAXALIGN and whether the updates are cross-page updates). It changes the block IDs used to fit in 6 bits, using the upper 2 bits of the block_id field to store how much data is contained in the record (0, <=UINT8_MAX, or <=UINT16_MAX bytes). This is part 1 of a series of patches trying to decrease the size of WAL; see also [0], [1] and [2] for more info on what's still to go. I'm working on a separate, much more involved patch for the XLogRecord header itself, similar to the patch in [1], which I expect to send sometime soon as well. Unless someone thinks the patches should be discussed as one series, I'm planning on posting that in another thread, as I don't see any meaningful dependencies between the patches, and the XLR header patch will be quite a bit larger than this one. Kind regards, Matthias van de Meent Neon, Inc. [0] https://wiki.postgresql.org/wiki/Updating_the_WAL_infrastructure [1] https://www.postgresql.org/message-id/flat/CAEze2Wjd3jY_UhhOGdGGnC6NO%3D%2BNmtNOmd%3DJaYv-v-nwBAiXXA%40mail.gmail.com#17a51d83923f4390d8f407d0d6c5da07 [2] https://www.postgresql.org/message-id/flat/CAEze2Whf%3DfwAj7rosf6aDM9t%2B7MU1w-bJn28HFWYGkz%2Bics-hg%40mail.gmail.com PS. Benchmark results on my system (5950x with other light tasks running) don't show an obviously negative effect in a 10-minute run with these arbitrary pgbench settings on a fresh cluster with default configuration: ./pg_install/bin/pgbench postgres -j 2 -c 6 -T 600 -M prepared [...] master: tps = 375 patched: tps = 381
Вложения
On 18/05/2023 17:59, Matthias van de Meent wrote: > Attached is a patch that reduces this overhead by up to 2 bytes by > encoding how large the block data length field is into the block ID, > and thus optionally reducing the block data's length field to 0 bytes. > Examples: cross-page update records will now be 2 bytes shorter, > because the record never registers any data for the new block of the > update; pgbench transactions are now either 6 or 8 bytes smaller > depending on whether the update crosses a page boundary (in xlog > record size; after alignment it is 0 or 4/8 bytes, depending on > MAXALIGN and whether the updates are cross-page updates). > > It changes the block IDs used to fit in 6 bits, using the upper 2 bits > of the block_id field to store how much data is contained in the > record (0, <=UINT8_MAX, or <=UINT16_MAX bytes). Perhaps we should introduce a few generic inline functions to do varint encoding. That could be useful in many places, while this scheme is very tailored for XLogRecordBlockHeader. We could replace XLogRecordDataHeaderShort and XLogRecordDataHeaderLong with this too. With just one XLogRecordDataHeader, with a variable-length length field. > This is part 1 of a series of patches trying to decrease the size of > WAL; see also [0], [1] and [2] for more info on what's still to go. > I'm working on a separate, much more involved patch for the XLogRecord > header itself, similar to the patch in [1], which I expect to send > sometime soon as well. > Unless someone thinks the patches should be discussed as one series, > I'm planning on posting that in another thread, as I don't see any > meaningful dependencies between the patches, and the XLR header patch > will be quite a bit larger than this one. > > Kind regards, > > Matthias van de Meent > Neon, Inc. > > [0] https://wiki.postgresql.org/wiki/Updating_the_WAL_infrastructure Good ideas here. Eliminating the two padding bytes from XLogRecord in particular seems like a pure win. > PS. Benchmark results on my system (5950x with other light tasks > running) don't show an obviously negative effect in a 10-minute run > with these arbitrary pgbench settings on a fresh cluster with default > configuration: > > ./pg_install/bin/pgbench postgres -j 2 -c 6 -T 600 -M prepared > [...] > master: tps = 375 > patched: tps = 381 That was probably not CPU limited, so that any overhead in generating the WAL would not show up. Try PGOPTIONS="-csynchronous_commit=off" and pgbench -N option. And make sure the scale is large enough that there is no lock contention. Also would be good to measure the overhead in replaying the WAL. How much space saving does this yield? - Heikki
On Thu, 18 May 2023 at 18:22, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 18/05/2023 17:59, Matthias van de Meent wrote: > Perhaps we should introduce a few generic inline functions to do varint > encoding. That could be useful in many places, while this scheme is very > tailored for XLogRecordBlockHeader. I'm not sure about the reusability of such code, as not all varint encodings are made equal: Here, I chose to determine the size of the field with some bits stored in leftover bits of another field, so storing the field and size separately. But in other cases, such as UTF8's code point encoding, each byte has a carry bit indicating whether the value has more bytes to go. In even more other cases, such as sqlite's integer encoding, the value is stored in a single byte, unless that byte contains a sentinel value that indicates the number of bytes that the value continues into. What I'm trying to say is that there is no perfect encoding that is better than all others, and I picked what I thought worked best in this specific case. I think it is reasonable to expect that varint-encoding of e.g. blockNo or RelFileNode into the WAL record could want to choose a different method than the method I've chosen for the block data length. > We could replace XLogRecordDataHeaderShort and XLogRecordDataHeaderLong > with this too. With just one XLogRecordDataHeader, with a > variable-length length field. Yes, that could be used too. But that's not part of the patch right now, and I've not yet planned on implementing that for this patch. > > [0] https://wiki.postgresql.org/wiki/Updating_the_WAL_infrastructure > > Good ideas here. Eliminating the two padding bytes from XLogRecord in > particular seems like a pure win. It requires code churn and probably increases complexity, but apart from that I think 'pure win' is accurate, yes. > > PS. Benchmark results on my system (5950x with other light tasks > > running) don't show an obviously negative effect in a 10-minute run > > with these arbitrary pgbench settings on a fresh cluster with default > > configuration: > > > > ./pg_install/bin/pgbench postgres -j 2 -c 6 -T 600 -M prepared > > [...] > > master: tps = 375 > > patched: tps = 381 > > That was probably not CPU limited, so that any overhead in generating > the WAL would not show up. Try PGOPTIONS="-csynchronous_commit=off" and > pgbench -N option. And make sure the scale is large enough that there is > no lock contention. Also would be good to measure the overhead in > replaying the WAL. with assertions now disabled, and the following configuration: synchronous_commit = off fsync = off full_page_writes = off checkpoint_timeout = 1d autovacuum = off and now without assertions, I get master: tps = 3500.815859 patched: tps = 3535.188054 With autovacuum enabled it's worked similarly well, within 1% of these results. > How much space saving does this yield? No meaningful savings in the pgbench workload, mostly due to xlog record length MAXALIGNs currently not being favorable in the pgbench workload. But, record sizes have dropped by 1 or 2 bytes in several cases, as can be seen at the bottom of this mail. Kind regards, Matthias van de Meent Neon, Inc. The data: Record type, then record length averages (average aligned length between parens) for both master and patched, and the average per-record savings with this patch. | record type | master avg | patched avg | delta | delta | | | (aligned avg) | (aligned avg) | | aligned | |---------------|-----------------|-----------------|-------|---------| | BT/DEDUP | 64.00 (64.00) | 63.00 (64.00) | -1 | 0 | | BT/INS_LEAF | 81.41 (81.41) | 80.41 (81.41) | -1 | 0 | | CLOG/0PG | 30.00 (32.00) | 30.00 (32.00) | 0 | 0 | | HEAP/DEL | 54.00 (56.00) | 52.00 (56.00) | -2 | 0 | | HEAP/HOT_UPD | 72.02 (72.19) | 71.02 (72.19) | 0 | 0 | | HEAP/INS | 79.00 (80.00) | 78.00 (80.00) | -1 | 0 | | HEAP/INS+INIT | 79.00 (80.00) | 78.00 (80.00) | -1 | 0 | | HEAP/LOCK | 54.00 (56.00) | 52.00 (56.00) * | -2 | 0 | | HEAP2/MUL_INS | 85.00 (88.00) | 84.00 (88.00) * | -1 | 0 | | HEAP2/PRUNE | 65.17 (68.19) | 64.17 (68.19) | -1 | 0 | | STDBY/R_XACTS | 52.76 (56.00) | 52.21 (56.00) | -0.5 | 0 | | TX/COMMIT | 34.00 (40.00) | 34.00 (40.00) | 0 | 0 | | XLOG/CHCKPT_O | 114.00 (120.00) | 114.00 (120.00) | 0 | 0 |
Hi, I noticed that the patch needs review and decided to take a look. > No meaningful savings in the pgbench workload, mostly due to xlog > record length MAXALIGNs currently not being favorable in the pgbench > workload. But, record sizes have dropped by 1 or 2 bytes in several > cases, as can be seen at the bottom of this mail. This may not sound a lot but still is valuable IMO if we consider the reduction in terms of percentages of overall saved disk throughput, network traffic, etc, not in absolute values per one record. Even if 1-2 bytes are not a bottleneck that can be seen on benchmarks (or the performance improvement is not that impressive), it's some amount of money paid on cloud. Considering the fact that the patch is not that complicated I see no reason not to apply the optimization as long as it doesn't cause degradations. I also agree with Matthias' arguments above regarding the lack of one-size-fits-all variable encoding and the overall desire to keep the focus. E.g. the code can be refactored if and when we discover that different subsystems ended up using the same encoding. All in all the patch looks good to me, but I have a couple of nitpicks: * The comment for XLogSizeClass seems to be somewhat truncated as if Ctr+S was not pressed before creating the patch. I also suggest double-checking the grammar. * `Size written = -1;` in XLogWriteLength() can lead to compiler warnings some day considering the fact that Size / size_t are unsigned. Also this assignment doesn't seem to serve any particular purpose. So I suggest removing it. * I don't see much value in using the WRITE_OP macro in XLogWriteLength(). The code is read more often than it's written and I wouldn't call this code particularly readable (although it's shorter). * XLogReadLength() - ditto * `if (read < 0)` in DecodeXLogRecord() is noop since `read` is unsigned -- Best regards, Aleksander Alekseev
On Tue, 5 Sept 2023 at 15:04, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi, > > I noticed that the patch needs review and decided to take a look. Thanks for reviewing! > All in all the patch looks good to me, but I have a couple of nitpicks: > > * The comment for XLogSizeClass seems to be somewhat truncated as if > Ctr+S was not pressed before creating the patch. I also suggest > double-checking the grammar. I've updated the various comments with improved wording. > * `Size written = -1;` in XLogWriteLength() can lead to compiler > warnings some day considering the fact that Size / size_t are > unsigned. Also this assignment doesn't seem to serve any particular > purpose. So I suggest removing it. Fixed, it now uses `int` instead, as does XLogReadLength(). > * I don't see much value in using the WRITE_OP macro in > XLogWriteLength(). The code is read more often than it's written and I > wouldn't call this code particularly readable (although it's shorter). > * XLogReadLength() - ditto I use READ_OP and WRITE_OP mostly to make sure that each operation's code is clear. Manually expanding the macro would allow the handling of each variant to have different structure code, and that would allow for more coding errors. I think it's extra important to make sure the code isn't wrong because this concerns WAL (de)serialization, and one copy is (in my opinion) easier to check for errors than 3 copies. I've had my share of issues in copy-edited code, so I rather like keep the template around as long as I don't need to modify the underlying code. > * `if (read < 0)` in DecodeXLogRecord() is noop since `read` is unsigned Yes, thanks for noticing. I've been working with Rust recently, where unsigned size is `usize` and `size` is signed. The issue has been fixed in the attached patch with 'int' types instead. Kind regards, Matthias van de Meent
Вложения
Hi, On 2023-05-18 19:22:26 +0300, Heikki Linnakangas wrote: > On 18/05/2023 17:59, Matthias van de Meent wrote: > > Attached is a patch that reduces this overhead by up to 2 bytes by > > encoding how large the block data length field is into the block ID, > > and thus optionally reducing the block data's length field to 0 bytes. > > Examples: cross-page update records will now be 2 bytes shorter, > > because the record never registers any data for the new block of the > > update; pgbench transactions are now either 6 or 8 bytes smaller > > depending on whether the update crosses a page boundary (in xlog > > record size; after alignment it is 0 or 4/8 bytes, depending on > > MAXALIGN and whether the updates are cross-page updates). > > > > It changes the block IDs used to fit in 6 bits, using the upper 2 bits > > of the block_id field to store how much data is contained in the > > record (0, <=UINT8_MAX, or <=UINT16_MAX bytes). > > Perhaps we should introduce a few generic inline functions to do varint > encoding. That could be useful in many places, while this scheme is very > tailored for XLogRecordBlockHeader. Yes - I proposed that and wrote an implementation of reasonably efficient varint encoding. Here's my prototype: https://postgr.es/m/20221004234952.anrguppx5owewb6n%40awork3.anarazel.de I think it's a bad tradeoff to write lots of custom varint encodings, just to eek out a bit more space savings. The increase in code complexity IMO makes it a bad tradeoff. Greetings, Andres Freund
On Mon, Sep 18, 2023 at 04:03:38PM -0700, Andres Freund wrote: > I think it's a bad tradeoff to write lots of custom varint encodings, just to > eek out a bit more space savings. The increase in code complexity IMO makes it > a bad tradeoff. +1. -- Michael
Вложения
On Tue, 19 Sept 2023 at 01:03, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2023-05-18 19:22:26 +0300, Heikki Linnakangas wrote: > > On 18/05/2023 17:59, Matthias van de Meent wrote: > > > It changes the block IDs used to fit in 6 bits, using the upper 2 bits > > > of the block_id field to store how much data is contained in the > > > record (0, <=UINT8_MAX, or <=UINT16_MAX bytes). > > > > Perhaps we should introduce a few generic inline functions to do varint > > encoding. That could be useful in many places, while this scheme is very > > tailored for XLogRecordBlockHeader. This scheme is reused later for the XLogRecord xl_tot_len field over at [0], and FWIW is thus being reused. Sure, it's tailored to this WAL use case, but IMO we're getting good value from it. We don't use protobuf or JSON for WAL, we use our own serialization format. Having some specialized encoding/decoding in that format for certain fields is IMO quite acceptable. > Yes - I proposed that and wrote an implementation of reasonably efficient > varint encoding. Here's my prototype: > https://postgr.es/m/20221004234952.anrguppx5owewb6n%40awork3.anarazel.de As I mentioned on that thread, that prototype has a significant probability of doing nothing to improve WAL size, or even increasing the WAL size for installations which consume a lot of OIDs. > I think it's a bad tradeoff to write lots of custom varint encodings, just to > eek out a bit more space savings. This is only a single "custom" varint encoding though, if you can even call it that. It makes a field's size depend on flags set in another byte, which is not that much different from the existing use of XLR_BLOCK_ID_DATA_[LONG, SHORT]. > The increase in code complexity IMO makes it a bad tradeoff. Pardon me for asking, but what would you consider to be a good tradeoff then? I think the code relating to the WAL storage format is about as simple as you can get it within the feature set it provides and the size of the resulting records. While I think there is still much to gain w.r.t. WAL record size, I don't think we can get much of those improvements without adding at least some amount of complexity, something I think to be true for most components in PostgreSQL. So, except for redesigning significant parts of the public WAL APIs, are we just going to ignore any potential improvements because they "increase code complexity"? Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://commitfest.postgresql.org/43/4386/
Hi, > This scheme is reused later for the XLogRecord xl_tot_len field over > at [0], and FWIW is thus being reused. Sure, it's tailored to this WAL > use case, but IMO we're getting good value from it. We don't use > protobuf or JSON for WAL, we use our own serialization format. Having > some specialized encoding/decoding in that format for certain fields > is IMO quite acceptable. > > > Yes - I proposed that and wrote an implementation of reasonably efficient > > varint encoding. Here's my prototype: > > https://postgr.es/m/20221004234952.anrguppx5owewb6n%40awork3.anarazel.de > > As I mentioned on that thread, that prototype has a significant > probability of doing nothing to improve WAL size, or even increasing > the WAL size for installations which consume a lot of OIDs. > > > I think it's a bad tradeoff to write lots of custom varint encodings, just to > > eek out a bit more space savings. > > This is only a single "custom" varint encoding though, if you can even > call it that. It makes a field's size depend on flags set in another > byte, which is not that much different from the existing use of > XLR_BLOCK_ID_DATA_[LONG, SHORT]. > > > The increase in code complexity IMO makes it a bad tradeoff. > > Pardon me for asking, but what would you consider to be a good > tradeoff then? I think the code relating to the WAL storage format is > about as simple as you can get it within the feature set it provides > and the size of the resulting records. While I think there is still > much to gain w.r.t. WAL record size, I don't think we can get much of > those improvements without adding at least some amount of complexity, > something I think to be true for most components in PostgreSQL. > > So, except for redesigning significant parts of the public WAL APIs, > are we just going to ignore any potential improvements because they > "increase code complexity"? Here are my two cents. I definitely see the value in having reusable varint encoding. This would probably be a good option for extendable TOAST pointers, compression dictionaries, and perhaps other patches. In this particular case however Matthias has good arguments that this is not the right tool for this particular task, IMO. We don't use rbtrees for everything that needs a map from x to y. Hash tables have other compromises. Sometimes one container is a better fit, sometimes the other, sometimes none and we implement a fine tuned container for the given case. Same here. -- Best regards, Aleksander Alekseev
On Tue, 26 Sept 2023 at 02:09, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Tue, 19 Sept 2023 at 01:03, Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2023-05-18 19:22:26 +0300, Heikki Linnakangas wrote: > > > On 18/05/2023 17:59, Matthias van de Meent wrote: > > > > It changes the block IDs used to fit in 6 bits, using the upper 2 bits > > > > of the block_id field to store how much data is contained in the > > > > record (0, <=UINT8_MAX, or <=UINT16_MAX bytes). > > > > > > Perhaps we should introduce a few generic inline functions to do varint > > > encoding. That could be useful in many places, while this scheme is very > > > tailored for XLogRecordBlockHeader. > > This scheme is reused later for the XLogRecord xl_tot_len field over > at [0], and FWIW is thus being reused. Sure, it's tailored to this WAL > use case, but IMO we're getting good value from it. We don't use > protobuf or JSON for WAL, we use our own serialization format. Having > some specialized encoding/decoding in that format for certain fields > is IMO quite acceptable. > > > Yes - I proposed that and wrote an implementation of reasonably efficient > > varint encoding. Here's my prototype: > > https://postgr.es/m/20221004234952.anrguppx5owewb6n%40awork3.anarazel.de > > As I mentioned on that thread, that prototype has a significant > probability of doing nothing to improve WAL size, or even increasing > the WAL size for installations which consume a lot of OIDs. > > > I think it's a bad tradeoff to write lots of custom varint encodings, just to > > eek out a bit more space savings. > > This is only a single "custom" varint encoding though, if you can even > call it that. It makes a field's size depend on flags set in another > byte, which is not that much different from the existing use of > XLR_BLOCK_ID_DATA_[LONG, SHORT]. > > > The increase in code complexity IMO makes it a bad tradeoff. > > Pardon me for asking, but what would you consider to be a good > tradeoff then? I think the code relating to the WAL storage format is > about as simple as you can get it within the feature set it provides > and the size of the resulting records. While I think there is still > much to gain w.r.t. WAL record size, I don't think we can get much of > those improvements without adding at least some amount of complexity, > something I think to be true for most components in PostgreSQL. > > So, except for redesigning significant parts of the public WAL APIs, > are we just going to ignore any potential improvements because they > "increase code complexity"? I'm seeing that there has been no activity in this thread for nearly 4 months, I'm planning to close this in the current commitfest unless someone is planning to take it forward. Regards, Vignesh
Hi, > I'm seeing that there has been no activity in this thread for nearly 4 > months, I'm planning to close this in the current commitfest unless > someone is planning to take it forward. I don't think that closing CF entries purely due to inactivity is a good practice (neither something we did before) as long as there is code, it applies, etc. There are a lot of patches and few people working on them. Inactivity in a given thread doesn't necessarily indicate lack of interest, more likely lack of resources. -- Best regards, Aleksander Alekseev
On Mon, 22 Jan 2024 at 16:08, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Hi, > > > I'm seeing that there has been no activity in this thread for nearly 4 > > months, I'm planning to close this in the current commitfest unless > > someone is planning to take it forward. > > I don't think that closing CF entries purely due to inactivity is a > good practice (neither something we did before) as long as there is > code, it applies, etc. There are a lot of patches and few people > working on them. Inactivity in a given thread doesn't necessarily > indicate lack of interest, more likely lack of resources. There are a lot of patches like this and there is no clear way to find out if someone wants to work on it or if they have lost interest in it. That is the reason, I thought to send out a mail so that the author/reviewer can reply and take it to the next state like ready for committer state. If the author/reviewer is not planning to work in this commitfest, but has plans to work in the next commitfest we can move this to the next commitfest. I don't see a better way to identify if the patch has interest or not. Regards, Vignesh
On Mon, Jan 22, 2024 at 5:38 AM Aleksander Alekseev <aleksander@timescale.com> wrote: > I don't think that closing CF entries purely due to inactivity is a > good practice (neither something we did before) as long as there is > code, it applies, etc. There are a lot of patches and few people > working on them. Inactivity in a given thread doesn't necessarily > indicate lack of interest, more likely lack of resources. If the CF entry doesn't serve the purpose of allowing someone to find patches in need of review, it's worse than useless. Patches that aren't getting reviewed should stay in the CF until they do, or until we make a collective decision that we don't care about or want that particular patch enough to bother. But patches that are not ready for review need to get out until such time as they are. Otherwise they're just non-actionable clutter. And unfortunately we have so much of that in the CF app now that finding things that really need review is time consuming and difficult. That REALLY needs to be cleaned up. In the case of this particular patch, I think the problem is that there's no consensus on the design. There's not a ton of debate on this thread, but thread [1] linked in the original post contains a lot of vigorous debate about what the right thing to do is here and I don't believe we reached any meeting of the minds. In light of that lack of agreement, I'm honestly a bit confused why Matthias even found it worthwhile to submit this on a new thread. I think we all agree with him that there's room for improvement here, but we don't agree on what particular form that improvement should take, and as long as that agreement is lacking, I find it hard to imagine anything getting committed. The task right now is to get agreement on something, and leaving the CF entry open longer isn't going make the people who have already expressed opinions start agreeing with each other more than they do already. It looks like I never replied to https://www.postgresql.org/message-id/20221019192130.ebjbycpw6bzjry4v%40awork3.anarazel.de but, FWIW, I agree with Andres that applying the same technique to multiple fields that are stored together (DB OID, TS OID, rel #, block #) is unlikely in practice to produce many cases that regress. But the question for this thread is really more about whether we're OK with using ad-hoc bit swizzling to reduce the size of xlog records or whether we want to insist on the use of a uniform varint encoding. Heikki and Andres both seem to favor the latter. IIRC, I was initially more optimistic about ad-hoc bit swizzling being a potentially acceptable technique, but I'm not convinced enough about it to argue against two very smart committers both of whom know more about micro-optimizing performance than I do, and nobody else seems to making this argument on this thread either, so I just don't really see how this patch is ever going to go anywhere in its current form. -- Robert Haas EDB: http://www.enterprisedb.com
On 22/01/2024 19:23, Robert Haas wrote: > In the case of this particular patch, I think the problem is that > there's no consensus on the design. There's not a ton of debate on > this thread, but thread [1] linked in the original post contains a lot > of vigorous debate about what the right thing to do is here and I > don't believe we reached any meeting of the minds. Yeah, so it seems. > It looks like I never replied to > https://www.postgresql.org/message-id/20221019192130.ebjbycpw6bzjry4v%40awork3.anarazel.de > but, FWIW, I agree with Andres that applying the same technique to > multiple fields that are stored together (DB OID, TS OID, rel #, block > #) is unlikely in practice to produce many cases that regress. But the > question for this thread is really more about whether we're OK with > using ad-hoc bit swizzling to reduce the size of xlog records or > whether we want to insist on the use of a uniform varint encoding. > Heikki and Andres both seem to favor the latter. IIRC, I was initially > more optimistic about ad-hoc bit swizzling being a potentially > acceptable technique, but I'm not convinced enough about it to argue > against two very smart committers both of whom know more about > micro-optimizing performance than I do, and nobody else seems to > making this argument on this thread either, so I just don't really see > how this patch is ever going to go anywhere in its current form. I don't have a clear idea of how to proceed with this either. Some thoughts I have: Using varint encoding makes sense for length fields. The common values are small, and if a length of anything is large, then the size of the length field itself is insignificant compared to the actual data. I don't like using varint encoding for OID. They might be small in common cases, but it feels wrong to rely on that. They're just arbitrary numbers. We could pick them randomly, it's just an implementation detail that we use a counter to choose the next one. I really dislike the idea that someone would do a pg_dump + restore, just to get smaller OIDs and smaller WAL as a result. It does make sense to have a fast-path (small-path?) for 0 OIDs though. To shrink OIDs fields, you could refer to earlier WAL records. A special value for "same relation as in previous record", or something like that. Now we're just re-inventing LZ-style compression though. Might as well use LZ4 or Snappy or something to compress the whole WAL stream. It's a bit tricky to get the crash-safety right, but shouldn't be impossible. Has anyone seriously considered implementing wholesale compression of WAL? -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, Feb 2, 2024 at 8:52 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > To shrink OIDs fields, you could refer to earlier WAL records. A special > value for "same relation as in previous record", or something like that. > Now we're just re-inventing LZ-style compression though. Might as well > use LZ4 or Snappy or something to compress the whole WAL stream. It's a > bit tricky to get the crash-safety right, but shouldn't be impossible. > > Has anyone seriously considered implementing wholesale compression of WAL? I thought about the idea of referring to earlier WAL and/or undo records when working on zheap. It seems tricky, because what if replay starts after those WAL records and you can't refer back to them? It's OK if you can make sure that you never depend on anything prior to the latest checkpoint, but the natural way to make that work is to add more looping like what we already do for FPIs, and then you have to worry about whether that extra logic is going to be more expensive than what you save. FPIs are so expensive that we can afford to go to a lot of trouble to avoid them and still come out ahead, but storing the full OID instead of an OID reference isn't nearly in the same category. I also thought about trying to refer to earlier items on the same page, thinking that the locality would make things easier. But it doesn't, because we don't know which page will ultimately contain the WAL record until quite late, so we can't reason about what's on the same page when constructing it. Wholesale compression of WAL might run into some of the same issues, e.g. if you don't want to compress each record individually, that means you can't compress until you know the insert position. And even then, if you want the previous data to be present in the compressor as context, you almost need all the WAL compression to be done by a single process. But if the LSNs are relative to the compressed stream, you have to wait for that compression to finish before you can determine the LSN of the next record, which seems super-painful, and if they're relative to the uncompressed stream, then mapping them onto fixed-size files gets tricky. My hunch is that we can squeeze more out of the existing architecture with a lot less work than it would take to do major rearchitecture like compressing everything. I don't know how we can agree on a way of doing that because everybody's got slightly different ideas about the right way to do this. But if agreeing on how to evolve the system we've got seems harder then rewriting it, we need to stop worrying about WAL overhead and learn how to work together better. -- Robert Haas EDB: http://www.enterprisedb.com