Обсуждение: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
Hi, Buildfarm identified one issue at [1] where it could not identify a partial WAL record spanning across 2 pages was written due to immediate shutdown. Consider a scenario where a WAL record is split across multiple WAL pages. If the server crashes before the entire WAL record is written, the recovery process detects the incomplete (broken) record. In such cases, a special flag, XLP_FIRST_IS_OVERWRITE_CONTRECORD, is set in the page header to indicate this condition. When we attempt to retrieve changes using a logical replication slot, the system reads WAL data based on the record header. In one specific instance, a total of 8133 bytes were required to reconstruct the WAL record. Of this, 277 bytes were expected to be present in the subsequent WAL page. However, only 248 bytes were available, because no new WAL records had been generated as there were no operations performed on the system after the crash. As a result, the slot_get_changes function continuously checks whether the current WAL flush position has reached 0x29004115 (the offset corresponding to the required 277 bytes in the next WAL page). Since no new WAL is being generated, the flush position always returns 0x290040F8 (the 248-byte offset), causing the function to wait indefinitely at read_local_xlog_page_guts function. Currently, the logic attempts to read the complete WAL record based on the size obtained before the crash—even though only a partial record was written. It then checks the page header to determine whether the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the complete WAL record at XLogDecodeNextRecord function, but since that much WAL data was not available in the system we never get a chance to check the header after this.. To address this issue, a more robust approach would be to first read the page header, check if the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed to read the WAL record size if the record is not marked as a partial overwrite. This would prevent the system from waiting for WAL data that will never arrive. Attached partial_wal_record_fix.patch patch for this. I don't have a consistent test to reproduce this issue, currently this issue can be reproduced by running the 046_checkpoint_logical_slot test about 50 times. This test 046_checkpoint_logical_slot was reverted recently after it caused a few buildfarm failures discussed at [2]. The same test is also attached as reverted_test_046_checkpoint_logical_slot.patch. Added members who were involved in the discussion of this issue. Thoughts? [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2025-06-19+23%3A47%3A08 [2] - https://www.postgresql.org/message-id/CAPpHfdsO9s5he3xHWNFtwvXtvsftu3nNz%3DPV9fdN32UOh-Z3tA%40mail.gmail.com Regards, Vignesh
Вложения
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Michael Paquier
Дата:
On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote: > Currently, the logic attempts to read the complete WAL record based on > the size obtained before the crash—even though only a partial record > was written. It then checks the page header to determine whether the > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the > complete WAL record at XLogDecodeNextRecord function, but since that > much WAL data was not available in the system we never get a chance to > check the header after this.. To address this issue, a more robust > approach would be to first read the page header, check if the > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed > to read the WAL record size if the record is not marked as a partial > overwrite. This would prevent the system from waiting for WAL data > that will never arrive. Attached partial_wal_record_fix.patch patch > for this. So you are suggesting the addition of an extra ReadPageInternal() that forces a read of only the read, perform the checks on the header, then read the rest. After reading SizeOfXLogShortPHD worth of data, shouldn't the checks on xlp_rem_len be done a bit earlier than what you are proposing in this patch? > I don't have a consistent test to reproduce this issue, currently this > issue can be reproduced by running the 046_checkpoint_logical_slot > test about 50 times. This test 046_checkpoint_logical_slot was > reverted recently after it caused a few buildfarm failures discussed > at [2]. The same test is also attached as > reverted_test_046_checkpoint_logical_slot.patch. Seeing the noise that this originally created in the CFbot and the buildfarm, even if the issue would be only triggered after a timeout, I'd vote for relying on this test being good enough for the purpose of this race condition. Another reliable approach would be to make the code wait before reading the record in the internal loop of ReadPageInternal() with an injection point when we know that we have a contrecord, but I'm not really excited about this prospect in xlogreader.c which can be also used in the frontend. -- Michael
Вложения
RE: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael, Vignesh, > On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote: > > Currently, the logic attempts to read the complete WAL record based on > > the size obtained before the crash—even though only a partial record > > was written. It then checks the page header to determine whether the > > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the > > complete WAL record at XLogDecodeNextRecord function, but since that > > much WAL data was not available in the system we never get a chance to > > check the header after this.. To address this issue, a more robust > > approach would be to first read the page header, check if the > > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed > > to read the WAL record size if the record is not marked as a partial > > overwrite. This would prevent the system from waiting for WAL data > > that will never arrive. Attached partial_wal_record_fix.patch patch > > for this. > > So you are suggesting the addition of an extra ReadPageInternal() that > forces a read of only the read, perform the checks on the header, then > read the rest. After reading SizeOfXLogShortPHD worth of data, > shouldn't the checks on xlp_rem_len be done a bit earlier than what > you are proposing in this patch? I have a concern for the performance perspective. This approach must read the page twice in any cases, right? The workaround is needed only for the corner case but affects for all the passes. Or, is it actually negligible? > Another reliable approach would be to make the > code wait before reading the record in the internal loop of > ReadPageInternal() with an injection point when we know that we have a > contrecord, but I'm not really excited about this prospect in > xlogreader.c which can be also used in the frontend. Per my understanding an injection point must be added while flushing a WAL record, to emulate the incomplete WAL record issue. To confirm, how can it be used in ReadPageInternal()? Best regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Jun 26, 2025 at 6:22 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote: > > Currently, the logic attempts to read the complete WAL record based on > > the size obtained before the crash—even though only a partial record > > was written. It then checks the page header to determine whether the > > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the > > complete WAL record at XLogDecodeNextRecord function, but since that > > much WAL data was not available in the system we never get a chance to > > check the header after this.. To address this issue, a more robust > > approach would be to first read the page header, check if the > > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed > > to read the WAL record size if the record is not marked as a partial > > overwrite. This would prevent the system from waiting for WAL data > > that will never arrive. Attached partial_wal_record_fix.patch patch > > for this. Yeah this is a problem, I am not sure at the moment I can think of anything better than just reading the header first and checking the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag. > > So you are suggesting the addition of an extra ReadPageInternal() that > forces a read of only the read, perform the checks on the header, then > read the rest. After reading SizeOfXLogShortPHD worth of data, > shouldn't the checks on xlp_rem_len be done a bit earlier than what > you are proposing in this patch? I did not get the point, IMHO it has to be validated after the record on the next page has been read. -- Regards, Dilip Kumar Google
On Thu, Jun 26, 2025 at 11:47 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Jun 26, 2025 at 08:48:32AM +0530, Dilip Kumar wrote: > > On Thu, Jun 26, 2025 at 6:22 AM Michael Paquier <michael@paquier.xyz> wrote: > >> So you are suggesting the addition of an extra ReadPageInternal() that > >> forces a read of only the read, perform the checks on the header, then > >> read the rest. After reading SizeOfXLogShortPHD worth of data, > >> shouldn't the checks on xlp_rem_len be done a bit earlier than what > >> you are proposing in this patch? > > > > I did not get the point, IMHO it has to be validated after the record > > on the next page has been read. > > As proposed in the patch, the first ReadPageInternal() reads > SizeOfXLogShortPHD worth of data (right?), so we should have access to > XLogPageHeaderData.xlp_rem_len before attempting the second > ReadPageInternal() worth Min(total_len - gotlen + SizeOfXLogShortPHD, > XLOG_BLCKSZ) of data. The check that fails with a "invalid contrecord > length %u" relies on xlp_rem_len, total_len, RecPtr and gotlen. We > know all this information before the second ReadPageInternal(). Yeah you are right, this can be moved before the second ReadPageInternal() as we have already read XLogPageHeader. -- Regards, Dilip Kumar Google
RE: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
"Hayato Kuroda (Fujitsu)"
Дата:
> > Another reliable approach would be to make the > > code wait before reading the record in the internal loop of > > ReadPageInternal() with an injection point when we know that we have a > > contrecord, but I'm not really excited about this prospect in > > xlogreader.c which can be also used in the frontend. > > Per my understanding an injection point must be added while flushing a WAL > record, > to emulate the incomplete WAL record issue. To confirm, how can it be used in > ReadPageInternal()? I've spent time how we use the injection point to cause the same situation, which generates the OVERWRITE_CONTRECORD in-between the page, but it seems difficult. XLogFlush()->XLogWrite() has a responsibility to flush WAL records, but it does not write/flush per pages. It tries to write to pages as much as possible and flushes the result at once. A corner case is when the segment is changed, but this is not the same situation we observed the failure. So... I have no idea to create deterministic reproducers, it is OK for me to use 046 test for the purpose. Best regards, Hayato Kuroda FUJITSU LIMITED
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Michael Paquier
Дата:
On Thu, Jun 26, 2025 at 09:40:05AM +0000, Hayato Kuroda (Fujitsu) wrote: > I've spent time how we use the injection point to cause the same situation, which > generates the OVERWRITE_CONTRECORD in-between the page, but it seems difficult. Hmm. I was wondering first about detecting a OVERWRITE_CONTRECORD while looping in the loop of XLogDecodeNextRecord() with a IS_INJECTION_POINT_ATTACHED(), then do a wait. But this method means that the injection point is linked to the fix we are trying to bundle here because it would need to be placed between the two ReadPageInternal() as we need to read the header first. Usually I try to place points so as they are generic enough, but that's a vague philosophy. > So... I have no idea to create deterministic reproducers, it is OK for me to use > 046 test for the purpose. Let's reuse 046 and call it a day. It would not be the first reproducer that fails on timeout if a problem arises. The important fact is to be informed about the failure, while keeping tests fast in the "normal" cases. We have a few tests like that already in the tree, where a timeout is the best thing we can rely on when the code fails its test. -- Michael
Вложения
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Michael Paquier
Дата:
On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote: > On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote: >> So you are suggesting the addition of an extra ReadPageInternal() that >> forces a read of only the read, perform the checks on the header, then >> read the rest. After reading SizeOfXLogShortPHD worth of data, >> shouldn't the checks on xlp_rem_len be done a bit earlier than what >> you are proposing in this patch? > > Modified It seems to me that this assert can be moved after the second page read: Assert(SizeOfXLogShortPHD <= readOff); Coming back to the point of Kuroda-san about performance, did you do some checks related to that and did you measure any difference? I suspect none of that because in most cases we are just going to fetch the next page and we would trigger the fast-exit path of ReadPageInternal() on the second call when fetching the rest. I still need to get an idea of all that by myself, probably with various lengths of logical message records. Perhaps this code could be improved in the future with less page reads. Anyway, what you are doing here is simple enough that it is a no-brainer for the back-branches because we are just forcing our way through with a new short header validation, so logically that's sound as far as I can see. -- Michael
Вложения
On Fri, 27 Jun 2025 at 07:05, Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote: > > On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote: > >> So you are suggesting the addition of an extra ReadPageInternal() that > >> forces a read of only the read, perform the checks on the header, then > >> read the rest. After reading SizeOfXLogShortPHD worth of data, > >> shouldn't the checks on xlp_rem_len be done a bit earlier than what > >> you are proposing in this patch? > > > > Modified > > It seems to me that this assert can be moved after the second page > read: > Assert(SizeOfXLogShortPHD <= readOff); > > Coming back to the point of Kuroda-san about performance, did you do > some checks related to that and did you measure any difference? I > suspect none of that because in most cases we are just going to fetch > the next page and we would trigger the fast-exit path of > ReadPageInternal() on the second call when fetching the rest. I still > need to get an idea of all that by myself, probably with various > lengths of logical message records. > > Perhaps this code could be improved in the future with less page > reads. Anyway, what you are doing here is simple enough that it is a > no-brainer for the back-branches because we are just forcing our way > through with a new short header validation, so logically that's sound > as far as I can see. I was unable to reproduce the same test failure on the PG17 branch, even after running the test around 500 times. However, on the master branch, the failure consistently reproduces approximately once in every 50 runs. I also noticed that while the buildfarm has reported multiple failures for this test for the master branch, none of them appear to be on the PG17 branch. I'm not yet sure why this discrepancy exists. Regards, Vignesh
On Fri, Jun 27, 2025 at 9:29 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 27 Jun 2025 at 07:05, Michael Paquier <michael@paquier.xyz> wrote: > > > > On Thu, Jun 26, 2025 at 05:25:42PM +0530, vignesh C wrote: > > > On Thu, 26 Jun 2025 at 06:22, Michael Paquier <michael@paquier.xyz> wrote: > > >> So you are suggesting the addition of an extra ReadPageInternal() that > > >> forces a read of only the read, perform the checks on the header, then > > >> read the rest. After reading SizeOfXLogShortPHD worth of data, > > >> shouldn't the checks on xlp_rem_len be done a bit earlier than what > > >> you are proposing in this patch? > > > > > > Modified > > > > It seems to me that this assert can be moved after the second page > > read: > > Assert(SizeOfXLogShortPHD <= readOff); > > I felt this Assert can be here to ensure we’ve read SizeOfXLogShortPHD > before checking the page header contents. But for the second page > read, the following existing Assert should be enough: > Assert(pageHeaderSize <= readOff); Thanks make sense. > > > > Coming back to the point of Kuroda-san about performance, did you do > > some checks related to that and did you measure any difference? I > > suspect none of that because in most cases we are just going to fetch > > the next page and we would trigger the fast-exit path of > > ReadPageInternal() on the second call when fetching the rest. I still > > need to get an idea of all that by myself, probably with various > > lengths of logical message records. > > The test execution times are measured in microseconds. In the results > table below, the first row indicates the message size, and each value > represents the median of 5 test runs. > The attached script was used to run these tests. In the attached > script the MSG_SIZE variable in the script should be changed to 1000(1 > page), 10000 (2 pages approx), 25000 (3 pages approx) , 50000 (6 pages > approx), 100000 (12 page approx), 1000000 (122 pages approx), 10000000 > (1220 pages approx) and 100000000 (12207 pages approx) and be run. > Test execution time can be taken from run_*.dat files that will be > generated. > > Size | 1000 | 10000 | 25000 | 50000 | 100000 | 1000000 > --------|-----------|-----------|------------|-------------|------------|-------------- > Head | 9297.1 | 9895.4 | 10844.2 | 12946.5 | 16945.1 | 86187.1 > Patch | 9222.7 | 9889 | 10897.1 | 12904.2 | 16858.4 | 87115.5 > > Size | 10000000 | 100000000 > ---------|----------------|----------------- > HEAD | 804965.6 | 331639.7 > Patch | 804942.6 | 321198.6 > > The performance results show that the patch does not introduce any > noticeable overhead across varying message sizes, I felt there was no > impact because of the additional page header read. Yeah, that doesn't seem like a regression. -- Regards, Dilip Kumar Google
RE: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Vignesh, > I was unable to reproduce the same test failure on the PG17 branch, > even after running the test around 500 times. However, on the master > branch, the failure consistently reproduces approximately once in > every 50 runs. I also noticed that while the buildfarm has reported > multiple failures for this test for the master branch, none of them > appear to be on the PG17 branch. I'm not yet sure why this discrepancy > exists. I was also not able to reproduce as-is. After analyzing bit more, I found on PG17, the workload cannot generate an FPI_FOR_HINT. The type of WAL record has longer length than the page there was a possibility that the WAL record could be flushed partially in HEAD. But in PG17 it could not happen so that OVERWRITE_CONTRECORD won't be appeared. I modified the test code like [1] and confirmed that the same stuck could happen on PG17. It generates a long record which can go across the page and can be flushed partially. [1]: ``` --- a/src/test/recovery/t/046_checkpoint_logical_slot.pl +++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl @@ -123,6 +123,10 @@ $node->safe_psql('postgres', $node->safe_psql('postgres', q{select injection_points_wakeup('checkpoint-before-old-wal-removal')}); +# Generate a long WAL record +$node->safe_psql('postgres', + q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))}); ``` Best regards, Hayato Kuroda FUJITSU LIMITED
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Michael Paquier
Дата:
On Mon, Jun 30, 2025 at 01:21:24PM +0530, Dilip Kumar wrote: > On Fri, Jun 27, 2025 at 9:29 PM vignesh C <vignesh21@gmail.com> wrote: >> The test execution times are measured in microseconds. In the results >> table below, the first row indicates the message size, and each value >> represents the median of 5 test runs. >> The attached script was used to run these tests. In the attached >> script the MSG_SIZE variable in the script should be changed to 1000(1 >> page), 10000 (2 pages approx), 25000 (3 pages approx) , 50000 (6 pages >> approx), 100000 (12 page approx), 1000000 (122 pages approx), 10000000 >> (1220 pages approx) and 100000000 (12207 pages approx) and be run. >> Test execution time can be taken from run_*.dat files that will be >> generated. >> >> Size | 1000 | 10000 | 25000 | 50000 | 100000 | 1000000 >> --------|-----------|-----------|------------|-------------|------------|-------------- >> Head | 9297.1 | 9895.4 | 10844.2 | 12946.5 | 16945.1 | 86187.1 >> Patch | 9222.7 | 9889 | 10897.1 | 12904.2 | 16858.4 | 87115.5 >> >> Size | 10000000 | 100000000 >> ---------|----------------|----------------- >> HEAD | 804965.6 | 331639.7 >> Patch | 804942.6 | 321198.6 >> >> The performance results show that the patch does not introduce any >> noticeable overhead across varying message sizes, I felt there was no >> impact because of the additional page header read. That's encouraging, yes, thanks for these numbers. I'll try to get some numbers myself next week. One thing that really stresses me with this patch is that we have to do this additional header check even in the physical replication path, but as we're just adding that for the case where a record needs to reconstructed across two pages, perhaps I'm just worrying too much. > Yeah, that doesn't seem like a regression. Yep. -- Michael
Вложения
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Michael Paquier
Дата:
On Mon, Jun 30, 2025 at 05:51:51PM +0530, vignesh C wrote: > Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on > injection points. The issue reproduced consistently across all > branches up to PG13. You can use the attached > 049_slot_get_changes_wait_continously_pg17.pl script (found in the > 049_slot_get_changes_wait_continously_pg17.zip file) to verify this. > Just copy the script to src/test/recovery and run the test to observe > the problem. +# Abruptly stop the server (1 second should be enough for the checkpoint +# to finish; it would be better). +$node->stop('immediate'); While v3-0001 is good enough to reproduce the original issue after a few dozen runs here, we cannot reintroduce the test without the proper wait_for_log() logic making sure that the checkpoint is completed before we stop the server. I like the addition of an extra pg_logical_emit_message() in test 046 anyway, down to v17, in the test 046 for all the branches. Even if the reproduction is sporadic, we have seen it pretty quickly in the CI and in the buildfarm so it would not go unnoticed for a long time if we mess up with this stuff again. We are lucky enough that the existing test does not fail in v17, TBH, so let's tackle all the issues on this thread step by step: 1) Fix let's add the header check. 2) Let's reintroduce a fixed version of 046 on HEAD and on v18, with an extra pg_logical_emit_message() that travels across WAL pages forcing a reconstruction of the record and the extra header check. 3) Let's fix recovery test 046 currently in v17, for the checkpoint wait logic and the extra pg_logical_emit_message(). 4) Let's improve the stability of 047 everywhere for the checkpoint wait logic, as already done by Alexander. I have been doing some tests with the patch vs HEAD (thanks for the script, it has saved some time), and I can summarize my results by using a message size of 819200000, hence worth 100k pages of WAL or so. Then, taking 20 runs, and eliminating the three highest and lowest numbers to eliminate some of the variance, I am getting an average of runtime when re-assembling the record of: - 695473.09us for HEAD - 695679.18us for the patch. So I am not going to stress more on this point. Attached is the result of all that for HEAD. There is also one extra patch dedicated to v17, where I have checked that the extra pg_logical_emit_message() is enough to reproduce the problem without the patch, and that the problem is fixed with the patch. The patch set for v17 is simpler of course, as test 046 still exists on REL_17_STABLE. Note that I have moved this extra contrecord to be generated before the checkpoint is completed, leading to the same result. With all that said, I'll move on with this stuff once the embargo for v18 beta2 is lifted and the tag is pushed. That should happen in 24h or so, I guess. -- Michael
Вложения
RE: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael, > I like the addition of an extra pg_logical_emit_message() in test 046 > anyway, down to v17, in the test 046 for all the branches. Even if > the reproduction is sporadic, we have seen it pretty quickly in the CI > and in the buildfarm so it would not go unnoticed for a long time if > we mess up with this stuff again. I put pg_logical_emit_message() after injection_points_wakeup(), but your patch puts it before. I verified even your patch can reproduce the issue, but is there a reason? (This is not a comment which must be fixed, I just want to know the reason to study) > With all that said, I'll move on with this stuff once the embargo for > v18 beta2 is lifted and the tag is pushed. That should happen in 24h > or so, I guess. The provided patches looks good to me. Best regards, Hayato Kuroda FUJITSU LIMITED
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Alexander Korotkov
Дата:
Hi, Michael! On Tue, Jul 15, 2025 at 6:31 AM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Jun 30, 2025 at 05:51:51PM +0530, vignesh C wrote: > > Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on > > injection points. The issue reproduced consistently across all > > branches up to PG13. You can use the attached > > 049_slot_get_changes_wait_continously_pg17.pl script (found in the > > 049_slot_get_changes_wait_continously_pg17.zip file) to verify this. > > Just copy the script to src/test/recovery and run the test to observe > > the problem. > > +# Abruptly stop the server (1 second should be enough for the checkpoint > +# to finish; it would be better). > +$node->stop('immediate'); > > While v3-0001 is good enough to reproduce the original issue after a > few dozen runs here, we cannot reintroduce the test without the proper > wait_for_log() logic making sure that the checkpoint is completed > before we stop the server. > > I like the addition of an extra pg_logical_emit_message() in test 046 > anyway, down to v17, in the test 046 for all the branches. Even if > the reproduction is sporadic, we have seen it pretty quickly in the CI > and in the buildfarm so it would not go unnoticed for a long time if > we mess up with this stuff again. > > We are lucky enough that the existing test does not fail in v17, TBH, > so let's tackle all the issues on this thread step by step: > 1) Fix let's add the header check. > 2) Let's reintroduce a fixed version of 046 on HEAD and on v18, with > an extra pg_logical_emit_message() that travels across WAL pages > forcing a reconstruction of the record and the extra header check. > 3) Let's fix recovery test 046 currently in v17, for the checkpoint > wait logic and the extra pg_logical_emit_message(). > 4) Let's improve the stability of 047 everywhere for the checkpoint > wait logic, as already done by Alexander. > > I have been doing some tests with the patch vs HEAD (thanks for the > script, it has saved some time), and I can summarize my results by > using a message size of 819200000, hence worth 100k pages of WAL or > so. Then, taking 20 runs, and eliminating the three highest and > lowest numbers to eliminate some of the variance, I am getting an > average of runtime when re-assembling the record of: > - 695473.09us for HEAD > - 695679.18us for the patch. > So I am not going to stress more on this point. > > Attached is the result of all that for HEAD. There is also one extra > patch dedicated to v17, where I have checked that the extra > pg_logical_emit_message() is enough to reproduce the problem without > the patch, and that the problem is fixed with the patch. The patch > set for v17 is simpler of course, as test 046 still exists on > REL_17_STABLE. Note that I have moved this extra contrecord to be > generated before the checkpoint is completed, leading to the same > result. > > With all that said, I'll move on with this stuff once the embargo for > v18 beta2 is lifted and the tag is pushed. That should happen in 24h > or so, I guess. Thank you for your efforts on this subject. Actually, I was planning to work on pushing this after the release freeze ends for v18. I'd like to do this at least for tests as they were initially committed by me. Anyway, please, hold on pushing this for ~1 day to let me do final review of this. ------ Regards, Alexander Korotkov Supabase
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Michael Paquier
Дата:
On Tue, Jul 15, 2025 at 09:19:50PM +0300, Alexander Korotkov wrote: > Thank you for your efforts on this subject. Actually, I was planning > to work on pushing this after the release freeze ends for v18. I'd > like to do this at least for tests as they were initially committed by > me. > > Anyway, please, hold on pushing this for ~1 day to let me do final > review of this. If you want to apply all of this yourself, please feel free, of course. -- Michael
Вложения
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Michael Paquier
Дата:
On Tue, Jul 15, 2025 at 09:14:27AM +0000, Hayato Kuroda (Fujitsu) wrote: > I put pg_logical_emit_message() after injection_points_wakeup(), but your patch > puts it before. I verified even your patch can reproduce the issue, but is there > a reason? > (This is not a comment which must be fixed, I just want to know the reason to study) It's for increasing the odds, so as the cross-page record finds its way to disk when doing the recovery phase. I was wondering about forcing a flush of pg_logical_emit_message(), as well, but the reproduction rate was still good enough without it. >> With all that said, I'll move on with this stuff once the embargo for >> v18 beta2 is lifted and the tag is pushed. That should happen in 24h >> or so, I guess. > > The provided patches looks good to me. Thanks for the reviews! -- Michael
Вложения
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Alexander Korotkov
Дата:
On Wed, Jul 16, 2025 at 1:12 AM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Jul 15, 2025 at 09:19:50PM +0300, Alexander Korotkov wrote: > > Thank you for your efforts on this subject. Actually, I was planning > > to work on pushing this after the release freeze ends for v18. I'd > > like to do this at least for tests as they were initially committed by > > me. > > > > Anyway, please, hold on pushing this for ~1 day to let me do final > > review of this. > > If you want to apply all of this yourself, please feel free, of > course. Thank you! I went trough the patchset. Everything looks good to me. I only did some improvements to comments and commit messages. I'm going to push this if no objections. ------ Regards, Alexander Korotkov Supabase
Вложения
Alexander Korotkov <aekorotkov@gmail.com> writes: > I went trough the patchset. Everything looks good to me. I only did > some improvements to comments and commit messages. I'm going to push > this if no objections. There's apparently something wrong in the v17 branch, as three separate buildfarm members have now hit timeout failures in 046_checkpoint_logical_slot.pl [1][2][3]. I tried to reproduce this locally, and didn't have much luck initially. However, if I build with a configuration similar to grassquit's, it will hang up maybe one time in ten: export ASAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0' export UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1' ./configure ... usual flags plus ... CFLAGS='-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare-Wno-missing-field-initializers -fsanitize=address -fno-sanitize-recover=all' --enable-injection-points The fact that 046_checkpoint_logical_slot.pl is skipped in non-injection-point builds is probably reducing the number of buildfarm failures, since only a minority of animals have that turned on yet. I don't see anything obviously wrong in the test changes, and the postmaster log from the failures looks pretty clearly like what is hanging up is the pg_logical_slot_get_changes call: 2025-07-19 16:10:07.276 CEST [3458309][client backend][0/2:0] LOG: statement: select count(*) from pg_logical_slot_get_changes('slot_logical',null, null); 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: starting logical decoding for slot "slot_logical" 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: Streaming transactions committing after 0/290000F8,reading WAL from 0/1540F40. 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical',null, null); 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: logical decoding found consistent point at 0/1540F40 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: There are no running transactions. 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical',null, null); 2025-07-19 16:59:56.828 CEST [3458140][postmaster][:0] LOG: received immediate shutdown request 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] LOG: could not send data to client: Broken pipe 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical',null, null); 2025-07-19 16:59:56.851 CEST [3458140][postmaster][:0] LOG: database system is shut down So my impression is that the bug is not reliably fixed in 17. One other interesting thing is that once it's hung, the test does not stop after PG_TEST_TIMEOUT_DEFAULT elapses. You can see above that olingo took nearly 50 minutes to give up, and in manual testing it doesn't seem to stop either (though I've not got the patience to wait 50 minutes...) regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2025-07-19%2014%3A07%3A23 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2025-07-19%2014%3A07%3A56 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2025-07-19%2016%3A29%3A32
Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages
От
Alexander Korotkov
Дата:
On Sat, Jul 19, 2025 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: > > I went trough the patchset. Everything looks good to me. I only did > > some improvements to comments and commit messages. I'm going to push > > this if no objections. > > There's apparently something wrong in the v17 branch, as three > separate buildfarm members have now hit timeout failures in > 046_checkpoint_logical_slot.pl [1][2][3]. I tried to reproduce > this locally, and didn't have much luck initially. However, > if I build with a configuration similar to grassquit's, it > will hang up maybe one time in ten: > > export ASAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0' > > export UBSAN_OPTIONS='print_stacktrace=1:disable_coredump=0:abort_on_error=1' > > ./configure ... usual flags plus ... CFLAGS='-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter-Wno-sign-compare -Wno-missing-field-initializers -fsanitize=address -fno-sanitize-recover=all' --enable-injection-points > > The fact that 046_checkpoint_logical_slot.pl is skipped in > non-injection-point builds is probably reducing the number > of buildfarm failures, since only a minority of animals > have that turned on yet. > > I don't see anything obviously wrong in the test changes, and the > postmaster log from the failures looks pretty clearly like what is > hanging up is the pg_logical_slot_get_changes call: > > 2025-07-19 16:10:07.276 CEST [3458309][client backend][0/2:0] LOG: statement: select count(*) from pg_logical_slot_get_changes('slot_logical',null, null); > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: starting logical decoding for slot "slot_logical" > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: Streaming transactions committing after 0/290000F8,reading WAL from 0/1540F40. > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical',null, null); > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] LOG: logical decoding found consistent point at 0/1540F40 > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] DETAIL: There are no running transactions. > 2025-07-19 16:10:07.278 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical',null, null); > 2025-07-19 16:59:56.828 CEST [3458140][postmaster][:0] LOG: received immediate shutdown request > 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] LOG: could not send data to client: Broken pipe > 2025-07-19 16:59:56.841 CEST [3458309][client backend][0/2:0] STATEMENT: select count(*) from pg_logical_slot_get_changes('slot_logical',null, null); > 2025-07-19 16:59:56.851 CEST [3458140][postmaster][:0] LOG: database system is shut down > > So my impression is that the bug is not reliably fixed in 17. > > One other interesting thing is that once it's hung, the test does > not stop after PG_TEST_TIMEOUT_DEFAULT elapses. You can see > above that olingo took nearly 50 minutes to give up, and in > manual testing it doesn't seem to stop either (though I've not > got the patience to wait 50 minutes...) Thank you for pointing! Apparently I've backpatched d3917d8f13e7 everywhere but not in REL_17_STABLE. Will be fixed now. ------ Regards, Alexander Korotkov Supabase
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Sat, Jul 19, 2025 at 10:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So my impression is that the bug is not reliably fixed in 17. > Thank you for pointing! > Apparently I've backpatched d3917d8f13e7 everywhere but not in > REL_17_STABLE. Will be fixed now. Hah ... well, at least we know the test works! regards, tom lane