Обсуждение: Something is wrong with wal_compression
The symptom being exhibited by Michael's new BF animal tanager is perfectly reproducible elsewhere. $ cat /home/postgres/tmp/temp_config #default_toast_compression = lz4 wal_compression = lz4 $ export TEMP_CONFIG=/home/postgres/tmp/temp_config $ cd ~/pgsql/src/test/recovery $ make check PROVE_TESTS=t/011_crash_recovery.pl ... +++ tap check in src/test/recovery +++ t/011_crash_recovery.pl .. 1/? # Failed test 'new xid after restart is greater' # at t/011_crash_recovery.pl line 53. # '729' # > # '729' # Failed test 'xid is aborted after crash' # at t/011_crash_recovery.pl line 57. # got: 'committed' # expected: 'aborted' # Looks like you failed 2 tests of 3. Maybe this is somehow the test script's fault, but I don't see how. It fails the same way with 'wal_compression = pglz', so I think it's generic to that whole feature rather than specific to LZ4. regards, tom lane
On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote: > The symptom being exhibited by Michael's new BF animal tanager > is perfectly reproducible elsewhere. I think these tests have always failed with wal_compression ? https://www.postgresql.org/message-id/20210308.173242.463790587797836129.horikyota.ntt%40gmail.com https://www.postgresql.org/message-id/20210313012820.GJ29463@telsasoft.com https://www.postgresql.org/message-id/20220222231948.GJ9008@telsasoft.com https://www.postgresql.org/message-id/YNqWd2GSMrnqWIfx@paquier.xyz |My buildfarm machine has been changed to use wal_compression = lz4, |while on it for HEAD runs.
Justin Pryzby <pryzby@telsasoft.com> writes: > On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote: >> The symptom being exhibited by Michael's new BF animal tanager >> is perfectly reproducible elsewhere. > I think these tests have always failed with wal_compression ? If that's a known problem, and we've done nothing about it, that is pretty horrid. That test case is demonstrating fundamental database corruption after a crash. regards, tom lane
On Thu, Jan 26, 2023 at 02:08:27PM -0600, Justin Pryzby wrote: > On Thu, Jan 26, 2023 at 02:43:29PM -0500, Tom Lane wrote: > > The symptom being exhibited by Michael's new BF animal tanager > > is perfectly reproducible elsewhere. > > I think these tests have always failed with wal_compression ? > > https://www.postgresql.org/message-id/20210308.173242.463790587797836129.horikyota.ntt%40gmail.com > https://www.postgresql.org/message-id/20210313012820.GJ29463@telsasoft.com > https://www.postgresql.org/message-id/20220222231948.GJ9008@telsasoft.com + https://www.postgresql.org/message-id/c86ce84f-dd38-9951-102f-13a931210f52%40dunslane.net
On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > That test case is demonstrating fundamental > database corruption after a crash. > Not exactly corruption. XID was not persisted and buffer data did not hit a disk. Database is in the correct state. It was discussed long before WAL compression here [0]. The thing is it is easier to reproduce with compression, but compression has nothing to do with it, as far as I understand. Proposed fix is here[1], but I think it's better to fix the test. It should not veryfi Xid, but rather side effects of "CREATE TABLE mine(x integer);". Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/565FB155-C6B0-41E2-8C44-7B514DC25132%2540yandex-team.ru [1] https://www.postgresql.org/message-id/flat/20210313012820.GJ29463%40telsasoft.com#0f18d3a4d593ea656fdc761e026fee81
Andrey Borodin <amborodin86@gmail.com> writes: > On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That test case is demonstrating fundamental >> database corruption after a crash. > Not exactly corruption. XID was not persisted and buffer data did not > hit a disk. Database is in the correct state. Really? I don't see how this part is even a little bit okay: [00:40:50.744](0.046s) not ok 3 - xid is aborted after crash [00:40:50.745](0.001s) [00:40:50.745](0.000s) # Failed test 'xid is aborted after crash' # at t/011_crash_recovery.pl line 57. [00:40:50.746](0.001s) # got: 'committed' # expected: 'aborted' If any tuples made by that transaction had reached disk, we'd have a problem. regards, tom lane
On Fri, Jan 27, 2023 at 11:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrey Borodin <amborodin86@gmail.com> writes: > > On Thu, Jan 26, 2023 at 12:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> That test case is demonstrating fundamental > >> database corruption after a crash. > > > Not exactly corruption. XID was not persisted and buffer data did not > > hit a disk. Database is in the correct state. > > Really? I don't see how this part is even a little bit okay: > > [00:40:50.744](0.046s) not ok 3 - xid is aborted after crash > [00:40:50.745](0.001s) > [00:40:50.745](0.000s) # Failed test 'xid is aborted after crash' > # at t/011_crash_recovery.pl line 57. > [00:40:50.746](0.001s) # got: 'committed' > # expected: 'aborted' > > If any tuples made by that transaction had reached disk, > we'd have a problem. The problem is that the WAL wasn't flushed, allowing the same xid to be allocated again after crash recovery. But for any data pages to hit the disk, we'd have to flush WAL first, so then it couldn't happen, no? FWIW I also re-complained about the dangers of anyone relying on pg_xact_status() for its stated purpose after seeing tanager's failure[1]. [1] https://www.postgresql.org/message-id/CA%2BhUKGJ9p2JPPMA4eYAKq%3Dr9d_4_8vziet_tS1LEBbiny5-ypA%40mail.gmail.com
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, Jan 27, 2023 at 11:14 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If any tuples made by that transaction had reached disk, >> we'd have a problem. > The problem is that the WAL wasn't flushed, allowing the same xid to > be allocated again after crash recovery. But for any data pages to > hit the disk, we'd have to flush WAL first, so then it couldn't > happen, no? Ah, now I get the point: the "committed xact" seen after restart isn't the same one as we saw before the crash, but a new one that was given the same XID because nothing about the old one had made it to disk yet. > FWIW I also re-complained about the dangers of anyone > relying on pg_xact_status() for its stated purpose after seeing > tanager's failure[1]. Indeed, it seems like this behavior makes pg_xact_status() basically useless as things stand. regards, tom lane
On Thu, Jan 26, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Indeed, it seems like this behavior makes pg_xact_status() basically > useless as things stand. > If we agree that xid allocation is not something persistent, let's fix the test? We can replace a check with select * from pg_class or, maybe, add an amcheck run. As far as I recollect, this test was introduced to test this new function in 857ee8e391f. Best regards, Andrey Borodin.
Andrey Borodin <amborodin86@gmail.com> writes: > On Thu, Jan 26, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Indeed, it seems like this behavior makes pg_xact_status() basically >> useless as things stand. > If we agree that xid allocation is not something persistent, let's fix > the test? If we're not going to fix this behavior, we need to fix the docs to disclaim that pg_xact_status() is of use for what it's said to be good for. regards, tom lane
On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote: > If we agree that xid allocation is not something persistent, let's fix > the test? We can replace a check with select * from pg_class or, > maybe, add an amcheck run. > As far as I recollect, this test was introduced to test this new > function in 857ee8e391f. My opinion would be to make this function more reliable, FWIW, even if that involves a performance impact when called in a close loop by forcing more WAL flushes to ensure its report durability and consistency. As things stand, this is basically unreliable, and we document it as something applications can *use*. Adding a note in the docs to say that this function can be unstable for some edge cases does not make much sense to me, either. Commit 857ee8e itself says that we can use it if a database connection is lost, which could happen on a crash.. -- Michael
Вложения
On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Jan 26, 2023 at 04:14:57PM -0800, Andrey Borodin wrote: > > If we agree that xid allocation is not something persistent, let's fix > > the test? We can replace a check with select * from pg_class or, > > maybe, add an amcheck run. > > As far as I recollect, this test was introduced to test this new > > function in 857ee8e391f. > > My opinion would be to make this function more reliable, FWIW, even if > that involves a performance impact when called in a close loop by > forcing more WAL flushes to ensure its report durability and > consistency. As things stand, this is basically unreliable, and we > document it as something applications can *use*. Adding a note in the > docs to say that this function can be unstable for some edge cases > does not make much sense to me, either. Commit 857ee8e itself says > that we can use it if a database connection is lost, which could > happen on a crash.. Yeah, the other thread has a patch for that. But it would hurt some workloads. A better patch would do some kind of amortisation (reserving N xids at a time or some such scheme, while being careful to make sure the right CLOG pages etc exist if you crash and skip a bunch of xids on recovery) but be more complicated. For the record, back before release 13 added the 64 bit xid allocator, these functions (or rather their txid_XXX ancestors) were broken in a different way: they didn't track epochs reliably, the discovery of which led to the new xid8-based functions, so that might provide a natural back-patching range, if a back-patchable solution can be agreed on.
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote: >> My opinion would be to make this function more reliable, FWIW, even if >> that involves a performance impact when called in a close loop by >> forcing more WAL flushes to ensure its report durability and >> consistency. > Yeah, the other thread has a patch for that. But it would hurt some > workloads. I think we need to get the thing correct first and worry about performance later. What's wrong with simply making pg_xact_status write and flush a record of the XID's existence before returning it? Yeah, it will cost you if you use that function, but not if you don't. > A better patch would do some kind of amortisation > (reserving N xids at a time or some such scheme, while being careful > to make sure the right CLOG pages etc exist if you crash and skip a > bunch of xids on recovery) but be more complicated. Maybe that would be appropriate for HEAD, but I'd be wary of adding anything complicated to the back branches. This is clearly a very under-tested area. regards, tom lane
On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote: > >> My opinion would be to make this function more reliable, FWIW, even if > >> that involves a performance impact when called in a close loop by > >> forcing more WAL flushes to ensure its report durability and > >> consistency. > > > Yeah, the other thread has a patch for that. But it would hurt some > > workloads. > > I think we need to get the thing correct first and worry about > performance later. What's wrong with simply making pg_xact_status > write and flush a record of the XID's existence before returning it? > Yeah, it will cost you if you use that function, but not if you don't. It would be pg_current_xact_id() that would have to pay the cost of the WAL flush, not pg_xact_status() itself, but yeah that's what the patch does (with some optimisations). I guess one question is whether there are any other reasonable real world uses of pg_current_xact_id(), other than the original goal[1]. If not, then at least you are penalising the right users, even though they probably only actually call pg_current_status() in extremely rare circumstances (if COMMIT hangs up). But I wouldn't be surprised if people have found other reasons to be interested in xid observability, related to distributed transactions and snapshots and suchlike. There is no doubt that the current situation is unacceptable, though, so maybe we really should just do it and make a faster one later. Anyone else want to vote on this? [1] https://www.postgresql.org/message-id/flat/CAMsr%2BYHQiWNEi0daCTboS40T%2BV5s_%2Bdst3PYv_8v2wNVH%2BXx4g%40mail.gmail.com
Thomas Munro <thomas.munro@gmail.com> writes: > On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think we need to get the thing correct first and worry about >> performance later. What's wrong with simply making pg_xact_status >> write and flush a record of the XID's existence before returning it? >> Yeah, it will cost you if you use that function, but not if you don't. > It would be pg_current_xact_id() that would have to pay the cost of > the WAL flush, not pg_xact_status() itself, Right, typo on my part. regards, tom lane
On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote: > On Fri, Jan 27, 2023 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > > On Fri, Jan 27, 2023 at 1:30 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > My opinion would be to make this function more reliable, FWIW, even if > > > > that involves a performance impact when called in a close loop by > > > > forcing more WAL flushes to ensure its report durability and > > > > consistency. > > > > > Yeah, the other thread has a patch for that. But it would hurt some > > > workloads. > > > > I think we need to get the thing correct first and worry about > > performance later. What's wrong with simply making pg_xact_status > > write and flush a record of the XID's existence before returning it? > > Yeah, it will cost you if you use that function, but not if you don't. > > There is no > doubt that the current situation is unacceptable, though, so maybe we > really should just do it and make a faster one later. Anyone else > want to vote on this? I wasn't aware of the existence of pg_xact_status, so I suspect that it is not a widely known and used feature. After reading the documentation, I'd say that anybody who uses it will want it to give a reliable answer. So I'd agree that it is better to make it more expensive, but live up to its promise. Yours, Laurenz Albe
On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote: > On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote: >> There is no >> doubt that the current situation is unacceptable, though, so maybe we >> really should just do it and make a faster one later. Anyone else >> want to vote on this? > > I wasn't aware of the existence of pg_xact_status, so I suspect that it > is not a widely known and used feature. After reading the documentation, > I'd say that anybody who uses it will want it to give a reliable answer. > So I'd agree that it is better to make it more expensive, but live up to > its promise. A code search within the Debian packages (codesearch.debian.net) and github does not show that it is not actually used, pg_xact_status() is reported as parts of copies of the Postgres code in the regression tests. FWIW, my vote goes for a more expensive but reliable function even in stable branches. Even 857ee8e mentions that this could be used on a lost connection, so we don't even satisfy the use case of the original commit as things stand (right?), because lost connection could just be a result of a crash, and if crash recovery reassigns the XID, then the client gets it wrong. -- Michael
Вложения
Hi, On 2023-01-27 16:15:08 +1300, Thomas Munro wrote: > It would be pg_current_xact_id() that would have to pay the cost of > the WAL flush, not pg_xact_status() itself, but yeah that's what the > patch does (with some optimisations). I guess one question is whether > there are any other reasonable real world uses of > pg_current_xact_id(), other than the original goal[1]. txid_current() is a lot older than pg_current_xact_id(), and they're backed by the same code afaict. 8.4 I think. Unfortunately txid_current() is used in plenty montiring setups IME. I don't think it's a good idea to make a function that was quite cheap for 15 years, suddenly be several orders of magnitude more expensive... Greetings, Andres Freund
On Fri, Jan 27, 2023, 18:58 Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-01-27 16:15:08 +1300, Thomas Munro wrote:
> It would be pg_current_xact_id() that would have to pay the cost of
> the WAL flush, not pg_xact_status() itself, but yeah that's what the
> patch does (with some optimisations). I guess one question is whether
> there are any other reasonable real world uses of
> pg_current_xact_id(), other than the original goal[1].
txid_current() is a lot older than pg_current_xact_id(), and they're backed by
the same code afaict. 8.4 I think.
Unfortunately txid_current() is used in plenty montiring setups IME.
I don't think it's a good idea to make a function that was quite cheap for 15
years, suddenly be several orders of magnitude more expensive...
As someone working on a monitoring tool that uses it (well, both), +1. We'd have to rethink a few things if this becomes a performance concern.
Thanks,
Maciek
Hi, On 2023-01-28 11:38:50 +0900, Michael Paquier wrote: > On Fri, Jan 27, 2023 at 06:06:05AM +0100, Laurenz Albe wrote: > > On Fri, 2023-01-27 at 16:15 +1300, Thomas Munro wrote: > >> There is no > >> doubt that the current situation is unacceptable, though, so maybe we > >> really should just do it and make a faster one later. Anyone else > >> want to vote on this? > > > > I wasn't aware of the existence of pg_xact_status, so I suspect that it > > is not a widely known and used feature. After reading the documentation, > > I'd say that anybody who uses it will want it to give a reliable answer. > > So I'd agree that it is better to make it more expensive, but live up to > > its promise. > A code search within the Debian packages (codesearch.debian.net) and > github does not show that it is not actually used, pg_xact_status() is > reported as parts of copies of the Postgres code in the regression > tests. Not finding a user at codesearch.debian.net provides useful information for C APIs, but a negative result for an SQL exposed function doesn't provide any information. Those callers will largely be in application code, which largely won't be in debian. And as noted two messages up, we wouldn't need to flush in pg_xact_status(), we'd need to flush in pg_current_xact_id()/txid_current(). > FWIW, my vote goes for a more expensive but reliable function even in > stable branches. I very strenuously object. If we make txid_current() (by way of pg_current_xact_id()) flush WAL, we'll cause outages. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-01-28 11:38:50 +0900, Michael Paquier wrote: >> FWIW, my vote goes for a more expensive but reliable function even in >> stable branches. > I very strenuously object. If we make txid_current() (by way of > pg_current_xact_id()) flush WAL, we'll cause outages. What are you using it for, that you don't care whether the answer is trustworthy? regards, tom lane
Hi, On 2023-01-27 22:39:56 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-01-28 11:38:50 +0900, Michael Paquier wrote: > >> FWIW, my vote goes for a more expensive but reliable function even in > >> stable branches. > > > I very strenuously object. If we make txid_current() (by way of > > pg_current_xact_id()) flush WAL, we'll cause outages. > > What are you using it for, that you don't care whether the answer > is trustworthy? It's quite commonly used as part of trigger based replication tools (IIRC that's its origin), monitoring, as part of client side logging, as part of snapshot management. txid_current() predates pg_xact_status() by well over 10 years. Clearly we had lots of uses for it before pg_xact_status() was around. Greetings, Andres Freund
On Fri, Jan 27, 2023 at 7:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > What are you using it for, that you don't care whether the answer > is trustworthy? > It's not trustworthy anyway. Xid wraparound might happen during reconnect. I suspect we can design a test that will show that it does not always show correct results during xid->2pc conversion (there is a point in time when xid is not in regular and not in 2pc, and I'm not sure ProcArrayLock is held). Maybe there are other edge cases. Anyway, if a user wants to know the status of xid in case of disconnection they have prepared xacts. Best regards, Andrey Borodin.
Hi, On 2023-01-27 19:49:17 -0800, Andres Freund wrote: > It's quite commonly used as part of trigger based replication tools (IIRC > that's its origin), monitoring, as part of client side logging, as part of > snapshot management. Forgot one: Queues. The way it's used for trigger based replication, queues and also some materialized aggregation tooling, is that there's a trigger that inserts into a "log" table. And that log table has a column into which txid_current() will be inserted. Together with txid_current_snapshot() etc that's used to get a (at least semi) "transactional" order out of such log tables. I believe that's originally been invented by londiste / skytool, later slony migrated to it. The necessary C code was added as contrib/txid in 1f92630fc4e 2007-10-07 and then moved into core a few days later in 18e3fcc31e7. For those cases making txid_current() flush would approximately double the WAL flush rate. Greetings, Andres Freund
Hi, On 2023-01-27 19:57:35 -0800, Andrey Borodin wrote: > On Fri, Jan 27, 2023 at 7:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > What are you using it for, that you don't care whether the answer > > is trustworthy? > > > > It's not trustworthy anyway. Xid wraparound might happen during > reconnect. I think that part would be approximately fine, as long as you can live with an answer of "too old". The xid returned by txid_status/pg_current_xact_id() is 64bit, and there is code to verify that the relevant range is covered by the clog. However - there's nothing preventing the xid to become too old in case of a crash. If you have an open connection, you can prevent the clog from being truncated by having an open snapshot. But you can't really without using e.g. 2PC if you want to handle crashes - obviously snapshots don't survive them. I really don't think txid_status() can be used for anything but informational probing of the clog / procarray. > I suspect we can design a test that will show that it does not always show > correct results during xid->2pc conversion (there is a point in time when > xid is not in regular and not in 2pc, and I'm not sure ProcArrayLock is > held). Maybe there are other edge cases. Unless I am missing something, that would be very bad [TM], completely independent of txid_status(). The underlying functions like TransactionIdIsInProgress() are used for MVCC. Greetings, Andres Freund
On Sat, Jan 28, 2023 at 4:57 PM Andrey Borodin <amborodin86@gmail.com> wrote: > It's not trustworthy anyway. Xid wraparound might happen during > reconnect. I suspect we can design a test that will show that it does > not always show correct results during xid->2pc conversion (there is a > point in time when xid is not in regular and not in 2pc, and I'm not > sure ProcArrayLock is held). Maybe there are other edge cases. I'm not sure I understand the edge cases, but it is true that this can only give you the answer until the CLOG is truncated, which is pretty arbitrary and you could be unlucky. I guess a reliable version of this would have new policies about CLOG retention, and CLOG segment filenames derived from 64 bit xids so they don't wrap around. > Anyway, if a user wants to know the status of xid in case of > disconnection they have prepared xacts. Yeah. The original proposal mentioned that, but that this was a "lighter" alternative. Reading Andres's comments and realising how relatively young txid_status() is compared to txid_current(), I'm now wondering if we shouldn't just disclaim the whole thing in back branches. Maybe if we want to rescue it in master, there could be a "reliable" argument, defaulting to false, or whatever, and we could eventually make the amortisation improvement.
Thomas Munro <thomas.munro@gmail.com> writes: > Reading Andres's comments and realising how relatively young > txid_status() is compared to txid_current(), I'm now wondering if we > shouldn't just disclaim the whole thing in back branches. My thoughts were trending in that direction too. It's starting to sound like we aren't going to be able to make a fix that we'd be willing to risk back-patching, even if it were completely compatible at the user level. Still, the idea that txid_status() isn't trustworthy is rather scary. I wonder whether there is a failure mode here that's exhibitable without using that. regards, tom lane
On Sat, Jan 28, 2023 at 12:02:23AM -0500, Tom Lane wrote: > My thoughts were trending in that direction too. It's starting > to sound like we aren't going to be able to make a fix that > we'd be willing to risk back-patching, even if it were completely > compatible at the user level. > > Still, the idea that txid_status() isn't trustworthy is rather > scary. I wonder whether there is a failure mode here that's > exhibitable without using that. Okay, as far as I can see, the consensus would be to not do anything about the performance impact of these functions: 20210305.115011.558061052471425531.horikyota.ntt@gmail.com Three of my buildfarm machines are unstable because of that, they need something for stable branches as well, and I'd like them to stress their options. Based on what's been mentioned, we can: 1) tweak the test with an extra checkpoint to make sure that the XIDs are flushed, like in the patch posted on [1]. 2) tweak the test to rely on a state of the table, as mentioned by Andrey. 3) remove entirely the test, because as introduced it does not actually test what it should. 2) is not really interesting, IMO, because the test checks for two things: - an in-progress XID, which we already do in the main regression test suite. - a post-crash state, and switching to an approach where some data is for example scanned is no different than a lot of the other recovery tests. 1) means more test cycles, and perhaps we could enforce compression of WAL while on it? At the end, my vote would just go for 3) and drop the whole scenario, though there may be an argument in 1). [1]: https://www.postgresql.org/message-id/20210305.115011.558061052471425531.horikyota.ntt@gmail.com -- Michael
Вложения
On Mon, Jan 30, 2023 at 02:57:13PM +0900, Michael Paquier wrote: > 1) means more test cycles, and perhaps we could enforce compression of > WAL while on it? At the end, my vote would just go for 3) and drop > the whole scenario, though there may be an argument in 1). And actually I was under the impression that 1) is not completely stable either in the test because we rely on the return result of txid_current() with IPC::Run::start, so a checkpoint forcing a flush may not be able to do its work. In order to bring all my animals back to green, I have removed the test. -- Michael