Обсуждение: more descriptive message for process termination due to max_slot_wal_keep_size
more descriptive message for process termination due to max_slot_wal_keep_size
Hello. As complained in pgsql-bugs [1], when a process is terminated due to max_slot_wal_keep_size, the related messages don't mention the root cause for *the termination*. Note that the third message does not show for temporary replication slots. [pid=a] LOG: terminating process x to release replication slot "s" [pid=x] LOG: FATAL: terminating connection due to administrator command [pid=a] LOG: invalidting slot "s" because its restart_lsn X/X exceeds max_slot_wal_keep_size The attached patch attaches a DETAIL line to the first message. > [17605] LOG: terminating process 17614 to release replication slot "s1" + [17605] DETAIL: The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > [17614] FATAL: terminating connection due to administrator command > [17605] LOG: invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size Somewhat the second and fourth lines look inconsistent each other but that wouldn't be such a problem. I don't think we want to concatenate the two lines together as the result is a bit too long. > LOG: terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. What do you think about this? [1] https://www.postgresql.org/message-id/20211214.101137.379073733372253470.horikyota.ntt%40gmail.com -- Kyotaro Horiguchi NTT Open Source Software Center From b0c27dc80aff37ef984592b79f1dd20d052299fa Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Tue, 14 Dec 2021 10:50:00 +0900 Subject: [PATCH] Make an error message about process termination more descriptive If checkpointer kills a process due to a temporary replication slot exceeding max_slot_wal_keep_size, the messages fails to describe the cause of the termination. It is because the message that describes the reason that is emitted for persistent slots does not show for temporary slots. Add a DETAIL line to the message common to all types of slot to describe the cause. Reported-by: Alex Enachioaie <alex@altmetric.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org --- src/backend/replication/slot.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 90ba9b417d..cba9a29113 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1254,7 +1254,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, { ereport(LOG, (errmsg("terminating process %d to release replication slot \"%s\"", - active_pid, NameStr(slotname)))); + active_pid, NameStr(slotname)), + errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.", LSN_FORMAT_ARGS(restart_lsn)))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hello. > > As complained in pgsql-bugs [1], when a process is terminated due to > max_slot_wal_keep_size, the related messages don't mention the root > cause for *the termination*. Note that the third message does not > show for temporary replication slots. > > [pid=a] LOG: terminating process x to release replication slot "s" > [pid=x] LOG: FATAL: terminating connection due to administrator command > [pid=a] LOG: invalidting slot "s" because its restart_lsn X/X exceeds max_slot_wal_keep_size > > The attached patch attaches a DETAIL line to the first message. > > > [17605] LOG: terminating process 17614 to release replication slot "s1" > + [17605] DETAIL: The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > > [17614] FATAL: terminating connection due to administrator command > > [17605] LOG: invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size > > Somewhat the second and fourth lines look inconsistent each other but > that wouldn't be such a problem. I don't think we want to concatenate > the two lines together as the result is a bit too long. > > > LOG: terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > > What do you think about this? Agree. I think we should also specify the restart_lsn value which would be within max_slot_wal_keep_size for better understanding. -- Best Wishes, Ashutosh Bapat
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hello. > > As complained in pgsql-bugs [1], when a process is terminated due to > max_slot_wal_keep_size, the related messages don't mention the root > cause for *the termination*. Note that the third message does not > show for temporary replication slots. > > [pid=a] LOG: "terminating process %d to release replication slot \"%s\"" > [pid=x] LOG: FATAL: terminating connection due to administrator command > [pid=a] LOG: invalidting slot "s" because its restart_lsn X/X exceeds max_slot_wal_keep_size > > The attached patch attaches a DETAIL line to the first message. > > > [17605] LOG: terminating process 17614 to release replication slot "s1" > + [17605] DETAIL: The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > > [17614] FATAL: terminating connection due to administrator command > > [17605] LOG: invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size > > Somewhat the second and fourth lines look inconsistent each other but > that wouldn't be such a problem. I don't think we want to concatenate > the two lines together as the result is a bit too long. > > > LOG: terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > > What do you think about this? > > [1] https://www.postgresql.org/message-id/20211214.101137.379073733372253470.horikyota.ntt%40gmail.com +1 to give more context to the "terminating process %d to release replication slot \"%s\"" message. How about having below, instead of adding errdetail: "terminating process %d to release replication slot \"%s\" whose restart_lsn %X/%X exceeds max_slot_wal_keep_size"? I think we can keep the "invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size" message as-is. We may not see "terminating process ..." and "invalidation slot ..." messages together for the same slot, so having slightly different wording is fine IMO. Regards, Bharath Rupireddy.
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Tue, 14 Dec 2021 19:31:21 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in > On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > [17605] LOG: terminating process 17614 to release replication slot "s1" > > + [17605] DETAIL: The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > > > [17614] FATAL: terminating connection due to administrator command > > > [17605] LOG: invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size > > > > Somewhat the second and fourth lines look inconsistent each other but > > that wouldn't be such a problem. I don't think we want to concatenate > > the two lines together as the result is a bit too long. > > > > > LOG: terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > > > > What do you think about this? > > Agree. I think we should also specify the restart_lsn value which > would be within max_slot_wal_keep_size for better understanding. Thanks! It seems to me the main message of the "invalidating" log has no room for further detail. So I split the reason out to DETAILS line the same way with the "terminating" message in the attached second patch. (It is separated from the first patch just for review) I believe someone can make the DETAIL message simpler or more natural. The attached patch set emits the following message. > LOG: invalidating slot "s1" > DETAIL: The slot's restart_lsn 0/10000D68 is behind the limit 0/11000000 defined by max_slot_wal_keep_size. The second line could be changed like the following or anything other. > DETAIL: The slot's restart_lsn 0/10000D68 got behind the limit 0/11000000 determined by max_slot_wal_keep_size. ..... regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 6909d854a0d48f2504cbb6dfbc7eb571fd32bcd8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Tue, 14 Dec 2021 10:50:00 +0900 Subject: [PATCH v2 1/2] Make an error message about process termination more descriptive If checkpointer kills a process due to a temporary replication slot exceeding max_slot_wal_keep_size, the messages fails to describe the cause of the termination. It is finally shown for persistent slots in a subsequent message, but that message doesn't show for temporary slots. It's better to attach an explainaion to the termination message directly. Reported-by: Alex Enachioaie <alex@altmetric.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org --- src/backend/replication/slot.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 90ba9b417d..cba9a29113 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1254,7 +1254,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, { ereport(LOG, (errmsg("terminating process %d to release replication slot \"%s\"", - active_pid, NameStr(slotname)))); + active_pid, NameStr(slotname)), + errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.", LSN_FORMAT_ARGS(restart_lsn)))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0 From 265a84fa6a20ee7c5178ace41096df1a4b00465c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Wed, 15 Dec 2021 13:08:20 +0900 Subject: [PATCH v2 2/2] Make the message further detailed --- src/backend/replication/slot.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index cba9a29113..695151e484 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1255,7 +1255,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ereport(LOG, (errmsg("terminating process %d to release replication slot \"%s\"", active_pid, NameStr(slotname)), - errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.", LSN_FORMAT_ARGS(restart_lsn)))); + errdetail("The slot's restart_lsn %X/%X is behind the limit %X/%X defined by max_slot_wal_keep_size.", + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN)))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; @@ -1292,9 +1294,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ReplicationSlotRelease(); ereport(LOG, - (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", - NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn)))); + (errmsg("invalidating slot \"%s\"", + NameStr(slotname)), + errdetail("The slot's restart_lsn %X/%X is behind the limit %X/%X defined by max_slot_wal_keep_size.", + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN)))); /* done with this slot for now */ break; -- 2.27.0
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Wed, Dec 15, 2021 at 9:42 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 14 Dec 2021 19:31:21 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in > > On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > [17605] LOG: terminating process 17614 to release replication slot "s1" > > > + [17605] DETAIL: The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > > > > [17614] FATAL: terminating connection due to administrator command > > > > [17605] LOG: invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size > > > > > > Somewhat the second and fourth lines look inconsistent each other but > > > that wouldn't be such a problem. I don't think we want to concatenate > > > the two lines together as the result is a bit too long. > > > > > > > LOG: terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size. > > > > > > What do you think about this? > > > > Agree. I think we should also specify the restart_lsn value which > > would be within max_slot_wal_keep_size for better understanding. > > Thanks! It seems to me the main message of the "invalidating" log has > no room for further detail. So I split the reason out to DETAILS line > the same way with the "terminating" message in the attached second > patch. (It is separated from the first patch just for review) I > believe someone can make the DETAIL message simpler or more natural. > > The attached patch set emits the following message. > > > LOG: invalidating slot "s1" > > DETAIL: The slot's restart_lsn 0/10000D68 is behind the limit 0/11000000 defined by max_slot_wal_keep_size. > > The second line could be changed like the following or anything other. > > > DETAIL: The slot's restart_lsn 0/10000D68 got behind the limit 0/11000000 determined by max_slot_wal_keep_size. > ..... > The second version looks better as it gives more details. I am fine with either of the above wordings. I would prefer everything in the same message though since "invalidating slot ..." is too short a LOG message. Not everybody enabled details always. -- Best Wishes, Ashutosh Bapat
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Thu, 23 Dec 2021 18:08:08 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in > On Wed, Dec 15, 2021 at 9:42 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > LOG: invalidating slot "s1" > > > DETAIL: The slot's restart_lsn 0/10000D68 is behind the limit 0/11000000 defined by max_slot_wal_keep_size. > > > > The second line could be changed like the following or anything other. > > > > > DETAIL: The slot's restart_lsn 0/10000D68 got behind the limit 0/11000000 determined by max_slot_wal_keep_size. > > ..... > > > > The second version looks better as it gives more details. I am fine > with either of the above wordings. > > I would prefer everything in the same message though since > "invalidating slot ..." is too short a LOG message. Not everybody > enabled details always. Mmm. Right. I have gone too much to the same way with the process-termination message. I rearranged the meesages as follows in the attached version. (at master) > LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size > DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size. > LOG: invalidating slot \"%s\" because its restart_LSN %X/%X exceeds max_slot_wal_keep_size c> DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size. The messages is actually incomplete even in 13 so I think the change to the errmsg() message of the first message is worth back-patching. - v3-0001-Make-a-message-on-process-termination-more-dscrip.patch Changes only the first main message and it can be back-patched to 14. - v3-0001-Make-a-message-on-process-termination-more-dscrip_13.patch The same to the above but for 13, which doesn't have LSN_FORMAT_ARGS. - v3-0002-Add-detailed-information-to-slot-invalidation-mes.patch Attaches the DETAIL line shown above to both messages, only for the master. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 4bfbb58801e9d2a0e0ab8320dd95bc850a0a397b Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 24 Dec 2021 12:52:07 +0900 Subject: [PATCH v3 1/2] Make a message on process termination more dscriptive The message at process termination due to slot limit doesn't provide the reason. In the major scenario the message is followed by another message about slot invalidatation, which shows the detail for the termination. However the second message is missing if the slot is temporary one. Augment the first message with the reason same as the second message. Backpatch through to 13 where the message was introduced. Reported-by: Alex Enachioaie <alex@altmetric.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org Backpatch-through: 13 --- src/backend/replication/slot.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 90ba9b417d..0ceac3a54d 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1253,8 +1253,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\"", - active_pid, NameStr(slotname)))); + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds max_slot_wal_keep_size", + active_pid, NameStr(slotname), + LSN_FORMAT_ARGS(restart_lsn)))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0 From 444b3bb64ee907280606696e0cb90f6b022c9cd6 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 24 Dec 2021 13:23:54 +0900 Subject: [PATCH v3] Make a message on process termination more dscriptive The message at process termination due to slot limit doesn't provide the reason. In the major scenario the message is followed by another message about slot invalidatation, which shows the detail for the termination. However the second message is missing if the slot is temporary one. Augment the first message with the reason same as the second message. Backpatch through to 13 where the message was introduced. Reported-by: Alex Enachioaie <alex@altmetric.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org Backpatch-through: 13 --- src/backend/replication/slot.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 02047ea920..15b8934ae2 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1228,8 +1228,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\"", - active_pid, NameStr(slotname)))); + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds max_slot_wal_keep_size", + active_pid, NameStr(slotname), + (uint32) (restart_lsn >> 32), + (uint32) restart_lsn))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0 From 4d37e4e6f2af557779d970f5fe4dba98c937ed49 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 24 Dec 2021 12:58:25 +0900 Subject: [PATCH v3 2/2] Add detailed information to slot-invalidation messages The messages says just "exceeds max_slot_wal_keep_size" but doesn't tell the concrete LSN of the limit. That information helps operators' understanding on the issue. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org --- src/backend/replication/slot.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 0ceac3a54d..2b394fe0d7 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1255,7 +1255,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ereport(LOG, (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds max_slot_wal_keep_size", active_pid, NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn)))); + LSN_FORMAT_ARGS(restart_lsn)), + errdetail("The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.", + LSN_FORMAT_ARGS(oldestLSN)))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; @@ -1294,7 +1296,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ereport(LOG, (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn)))); + LSN_FORMAT_ARGS(restart_lsn)), + errdetail("The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.", + LSN_FORMAT_ARGS(oldestLSN)))); /* done with this slot for now */ break; -- 2.27.0
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Fri, Dec 24, 2021 at 1:42 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 23 Dec 2021 18:08:08 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in > > On Wed, Dec 15, 2021 at 9:42 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > > LOG: invalidating slot "s1" > > > > DETAIL: The slot's restart_lsn 0/10000D68 is behind the limit 0/11000000 defined by max_slot_wal_keep_size. > > > > > > The second line could be changed like the following or anything other. > > > > > > > DETAIL: The slot's restart_lsn 0/10000D68 got behind the limit 0/11000000 determined by max_slot_wal_keep_size. > > > ..... > > > > > > > The second version looks better as it gives more details. I am fine > > with either of the above wordings. > > > > I would prefer everything in the same message though since > > "invalidating slot ..." is too short a LOG message. Not everybody > > enabled details always. > > Mmm. Right. I have gone too much to the same way with the > process-termination message. > > I rearranged the meesages as follows in the attached version. (at master) Thank you for the patch! +1 for improving the messages. > > > LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size > > DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size. > > > LOG: invalidating slot \"%s\" because its restart_LSN %X/%X exceeds max_slot_wal_keep_size > c> DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size. - LSN_FORMAT_ARGS(restart_lsn)))); + LSN_FORMAT_ARGS(restart_lsn)), + errdetail("The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.", + LSN_FORMAT_ARGS(oldestLSN)))); Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also by wal_keep_size? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: more descriptive message for process termination due to max_slot_wal_keep_size
Thank you for the comment. At Fri, 24 Dec 2021 17:06:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > Thank you for the patch! +1 for improving the messages. > > > > > > LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size > > > DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size. > > > > > LOG: invalidating slot \"%s\" because its restart_LSN %X/%X exceeds max_slot_wal_keep_size > > c> DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size. > > - > LSN_FORMAT_ARGS(restart_lsn)))); > + > LSN_FORMAT_ARGS(restart_lsn)), > + errdetail("The slot > got behind the limit %X/%X determined by max_slot_wal_keep_size.", > + > LSN_FORMAT_ARGS(oldestLSN)))); > > Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also > by wal_keep_size? Right. But I believe the two are not assumed to be used at once. One can set wal_keep_size larger than max_slot_wal_keep_size but it is actually a kind of ill setting. LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size and wal_keep_size. Mmm. I don't like this. I feel we don't need such detail in the message.. I'd like to hear opinions from others, please. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Fri, Dec 24, 2021 at 5:30 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Thank you for the comment. > > At Fri, 24 Dec 2021 17:06:57 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > > Thank you for the patch! +1 for improving the messages. > > > > > > > > > LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size > > > > DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size. > > > > > > > LOG: invalidating slot \"%s\" because its restart_LSN %X/%X exceeds max_slot_wal_keep_size > > > c> DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size. > > > > - > > LSN_FORMAT_ARGS(restart_lsn)))); > > + > > LSN_FORMAT_ARGS(restart_lsn)), > > + errdetail("The slot > > got behind the limit %X/%X determined by max_slot_wal_keep_size.", > > + > > LSN_FORMAT_ARGS(oldestLSN)))); > > > > Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also > > by wal_keep_size? > > Right. But I believe the two are not assumed to be used at once. One > can set wal_keep_size larger than max_slot_wal_keep_size but it is > actually a kind of ill setting. > > LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size > DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size and wal_keep_size. > > Mmm. I don't like this. I feel we don't need such detail in the > message. How about something like: LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit DETAIL: The slot got behind the limit %X/%X HINT: You might need to increase max_slot_wal_keep_size or wal_keep_size. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Fri, 24 Dec 2021 20:23:29 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in > On Fri, Dec 24, 2021 at 5:30 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > Right. But I believe the two are not assumed to be used at once. One > > can set wal_keep_size larger than max_slot_wal_keep_size but it is > > actually a kind of ill setting. > > > > LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size > > DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size and wal_keep_size. > > > > Mmm. I don't like this. I feel we don't need such detail in the > > message. > > How about something like: > > LOG: terminating process %d to release replication slot \"%s\" > because its restart_lsn %X/%X exceeds the limit > DETAIL: The slot got behind the limit %X/%X > HINT: You might need to increase max_slot_wal_keep_size or wal_keep_size. The message won't be seen when max_slot_wal_keep_size is not set. So we don't recommend to increase wal_keep_size in that case. We might need inhibit (or warn)the two parameters from being activated at once, but it would be another issue. Another point is how people determine the value for the parameter. I suppose (or believe) max_slot_wal_keep_size is not a kind to set to minimal first then increase later but a kind to set to maximum allowable first. On the other hand we suggest as the follows for too-small max_wal_size so we could do the same for this parameter. > HINT: Consider increasing the configuration parameter \"max_wal_size\". Also, I don't like we have three lines for this message. If the DETAIL adds only the specific value of the limit, I think it'd better append it to the main message. So what do you say if I propose the following? LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X HINT: You might need to increase max_slot_wal_keep_size. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > So what do you say if I propose the following? > > LOG: terminating process %d to release replication slot \"%s\" > because its restart_lsn %X/%X exceeds the limit %X/%X > HINT: You might need to increase max_slot_wal_keep_size. This version emits the following message. [35785:checkpointer] LOG: terminating process 36368 to release replication slot "s1" because its restart_lsn 0/1F000148exceeds the limit 0/21000000 [35785:checkpointer] HINT: You might need to increase max_slot_wal_keep_size. [36368:walsender] FATAL: terminating connection due to administrator command [36368:walsender] STATEMENT: START_REPLICATION SLOT "s1" 0/1F000000 TIMELINE 1 [35785:checkpointer] LOG: invalidating slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/21000000 [35785:checkpointer] HINT: You might need to increase max_slot_wal_keep_size. We can omit the HINT line from the termination log for non-persistent slots but I think we don't want to bother that considering its low frequency. The CI was confused by the mixed patches for multiple PG versions. In this version the patchset for master are attached as .patch and that for PG13 as .txt. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 24 Dec 2021 12:52:07 +0900 Subject: [PATCH v4 1/2] Make a message on process termination more dscriptive The message at process termination due to slot limit doesn't provide the reason. In the major scenario the message is followed by another message about slot invalidatation, which shows the detail for the termination. However the second message is missing if the slot is temporary one. Augment the first message with the reason same as the second message. Backpatch through to 13 where the message was introduced. Reported-by: Alex Enachioaie <alex@altmetric.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org Backpatch-through: 13 --- src/backend/replication/slot.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index caa6b29756..92f19bcb35 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\"", - active_pid, NameStr(slotname)))); + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds max_slot_wal_keep_size", + active_pid, NameStr(slotname), + LSN_FORMAT_ARGS(restart_lsn)))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0 From f31014f85c7dec160bd42e1c48f1c28a7dc22af2 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 24 Dec 2021 12:58:25 +0900 Subject: [PATCH v4 2/2] Add detailed information to slot-invalidation messages The messages says just "exceeds max_slot_wal_keep_size" but doesn't tell the concrete LSN of the limit. That information helps operators' understanding on the issue. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org --- src/backend/replication/slot.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 92f19bcb35..be21b62add 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds max_slot_wal_keep_size", + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds the limit %X/%X", active_pid, NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn)))); + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN)), + errhint("You might need to increase max_slot_wal_keep_size."))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; @@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ReplicationSlotRelease(); ereport(LOG, - (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", + (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X", NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn)))); + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN)), + errhint("You might need to increase max_slot_wal_keep_size."))); /* done with this slot for now */ break; -- 2.27.0 From 6c5a680842521b26c0c899c3d0675bd53e58ac11 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 24 Dec 2021 13:23:54 +0900 Subject: [PATCH v4] Make a message on process termination more dscriptive The message at process termination due to slot limit doesn't provide the reason. In the major scenario the message is followed by another message about slot invalidatation, which shows the detail for the termination. However the second message is missing if the slot is temporary one. Augment the first message with the reason same as the second message. Backpatch through to 13 where the message was introduced. Reported-by: Alex Enachioaie <alex@altmetric.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org Backpatch-through: 13 --- src/backend/replication/slot.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 02047ea920..15b8934ae2 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1228,8 +1228,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\"", - active_pid, NameStr(slotname)))); + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds max_slot_wal_keep_size", + active_pid, NameStr(slotname), + (uint32) (restart_lsn >> 32), + (uint32) restart_lsn))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Wed, 02 Mar 2022 15:37:19 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > The CI was confused by the mixed patches for multiple PG versions. In > this version the patchset for master are attached as .patch and that > for PG13 as .txt. Yeah.... It is of course the relevant check should be fixed. The attached v5 adjusts 019_replslot_limit.pl. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 24 Dec 2021 12:52:07 +0900 Subject: [PATCH v5 1/2] Make a message on process termination more dscriptive The message at process termination due to slot limit doesn't provide the reason. In the major scenario the message is followed by another message about slot invalidatation, which shows the detail for the termination. However the second message is missing if the slot is temporary one. Augment the first message with the reason same as the second message. Backpatch through to 13 where the message was introduced. Reported-by: Alex Enachioaie <alex@altmetric.com> Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org Backpatch-through: 13 --- src/backend/replication/slot.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index caa6b29756..92f19bcb35 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\"", - active_pid, NameStr(slotname)))); + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds max_slot_wal_keep_size", + active_pid, NameStr(slotname), + LSN_FORMAT_ARGS(restart_lsn)))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0 From 7a9e571e8cf4e5ffc13d101733c1b6fc455b1aec Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 24 Dec 2021 12:58:25 +0900 Subject: [PATCH v5 2/2] Add detailed information to slot-invalidation messages The messages says just "exceeds max_slot_wal_keep_size" but doesn't tell the concrete LSN of the limit. That information helps operators' understanding on the issue. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org --- src/backend/replication/slot.c | 12 ++++++++---- src/test/recovery/t/019_replslot_limit.pl | 2 +- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 92f19bcb35..be21b62add 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, if (last_signaled_pid != active_pid) { ereport(LOG, - (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds max_slot_wal_keep_size", + (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%Xexceeds the limit %X/%X", active_pid, NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn)))); + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN)), + errhint("You might need to increase max_slot_wal_keep_size."))); (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; @@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ReplicationSlotRelease(); ereport(LOG, - (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", + (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X", NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn)))); + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN)), + errhint("You might need to increase max_slot_wal_keep_size."))); /* done with this slot for now */ break; diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl index 9bb71b62c0..bac496a9cf 100644 --- a/src/test/recovery/t/019_replslot_limit.pl +++ b/src/test/recovery/t/019_replslot_limit.pl @@ -188,7 +188,7 @@ for (my $i = 0; $i < 10000; $i++) { if (find_in_log( $node_primary, - "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds max_slot_wal_keep_size", + "invalidating slot \"rep1\" because its restart_lsn [0-9A-F/]+ exceeds the limit [0-9A-F/]+", $logstart)) { $invalidated = 1; -- 2.27.0
Re: more descriptive message for process termination due to max_slot_wal_keep_size
Hi,
At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote inSo what do you say if I propose the following? LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X HINT: You might need to increase max_slot_wal_keep_size.This version emits the following message. [35785:checkpointer] LOG: terminating process 36368 to release replication slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/21000000 [35785:checkpointer] HINT: You might need to increase max_slot_wal_keep_size.
As the hint is to increase max_slot_wal_keep_size, what about reporting the difference in size (rather than the limit lsn)? Something along those lines?
[35785:checkpointer] LOG: terminating process 36368 to release replication slot "s1" because its restart_lsn 0/1F000148 exceeds the limit by <NNN MB>.
Regards,
--
Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Mon, 5 Sep 2022 11:56:33 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in > Hi, > > On 3/2/22 7:37 AM, Kyotaro Horiguchi wrote: > > At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro > > Horiguchi<horikyota.ntt@gmail.com> wrote in > >> So what do you say if I propose the following? > >> > >> LOG: terminating process %d to release replication slot \"%s\" > >> because its restart_lsn %X/%X exceeds the limit %X/%X > >> HINT: You might need to increase max_slot_wal_keep_size. > > This version emits the following message. > > > > [35785:checkpointer] LOG: terminating process 36368 to release > > replication slot "s1" because its restart_lsn 0/1F000148 exceeds the > > limit 0/21000000 > > [35785:checkpointer] HINT: You might need to increase > > max_slot_wal_keep_size. > > As the hint is to increase max_slot_wal_keep_size, what about > reporting the difference in size (rather than the limit lsn)? > Something along those lines? > > [35785:checkpointer] LOG: terminating process 36368 to release > replication slot "s1" because its restart_lsn 0/1F000148 exceeds the > limit by <NNN MB>. Thanks! That might be more sensible exactly for the reason you mentioned. One issue doing that is size_pretty is dbsize.c local function. Since the size is less than kB in many cases, we cannot use fixed unit for that. 0001 and 0002 are the same with v5. 0003 exposes byte_size_pretty() to other modules. 0004 does the change by using byte_size_pretty() After 0004 applied, they look like this. > LOG: terminating process 108413 to release replication slot "rep3" because its restart_lsn 0/7000D8 exceeds the limitby 1024 kB > HINT: You might need to increase max_slot_wal_keep_size. The reason for "1024 kB" instead of "1 MB" is the precise value is a bit less than 1024 * 1024. regards. - Kyotaro Horiguchi NTT Open Source Software Center
Вложения
Re: more descriptive message for process termination due to max_slot_wal_keep_size
Hi,
At Mon, 5 Sep 2022 11:56:33 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote inHi, On 3/2/22 7:37 AM, Kyotaro Horiguchi wrote:At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro Horiguchi<horikyota.ntt@gmail.com> wrote inSo what do you say if I propose the following? LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X HINT: You might need to increase max_slot_wal_keep_size.This version emits the following message. [35785:checkpointer] LOG: terminating process 36368 to release replication slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/21000000 [35785:checkpointer] HINT: You might need to increase max_slot_wal_keep_size.As the hint is to increase max_slot_wal_keep_size, what about reporting the difference in size (rather than the limit lsn)? Something along those lines? [35785:checkpointer] LOG: terminating process 36368 to release replication slot "s1" because its restart_lsn 0/1F000148 exceeds the limit by <NNN MB>.Thanks! That might be more sensible exactly for the reason you mentioned. One issue doing that is size_pretty is dbsize.c local function. Since the size is less than kB in many cases, we cannot use fixed unit for that.
Thanks for the new patch version!. I did not realized (sorry about that) that we'd need to expose byte_size_pretty(). Now I wonder if we should not simply report the number of bytes (like I can see it is done in many places). So something like:
@@ -1298,9 +1298,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
byte_size_pretty(buf, sizeof(buf),
oldestLSN - restart_lsn);
ereport(LOG,
- (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit by %s",
+ (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit by %lu bytes",
active_pid, NameStr(slotname),
- LSN_FORMAT_ARGS(restart_lsn), buf),
+ LSN_FORMAT_ARGS(restart_lsn), oldestLSN - restart_lsn),
errhint("You might need to increase max_slot_wal_keep_size.")));
and then forget about exposing/using byte_size_pretty() (that would be more consistent with the same kind of reporting in the existing code).
What do you think?
Regards,
-- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: more descriptive message for process termination due to max_slot_wal_keep_size
(I noticed I sent a wrong version..) At Tue, 6 Sep 2022 10:54:35 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in > Thanks for the new patch version!. I did not realized (sorry about > that) that we'd need to expose byte_size_pretty(). Now I wonder if we I didn't think we need the units larger than MB, but I used pretty_print to prevent small number from rounding to exactly zero. On the other hand, in typical cases it is longer than 6 digits in bytes, which is a bit hard to read a glance. > LOG: terminating process 16034 to release replication slot "rep1" because its restart_lsn 0/3158000 exceeds the limitby 15368192 bytes > should not simply report the number of bytes (like I can see it is > done in many places). So something like: .. > + (errmsg("terminating process %d to release replication slot \"%s\" > because its restart_lsn %X/%X exceeds the limit by %lu bytes", .. > and then forget about exposing/using byte_size_pretty() (that would be > more consistent with the same kind of reporting in the existing code). > > What do you think? An alterntive would be rounding up to the whole MB, or a sub-MB. > ereport(LOG, > (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds thelimit by %.1lf MB", > active_pid, NameStr(slotname), > LSN_FORMAT_ARGS(restart_lsn), > /* round-up at sub-MB */ > ceil((double) (oldestLSN - restart_lsn) / 1024 / 102.4) / 10), > LOG: terminating process 49539 to release replication slot "rep1" because its restart_lsn 0/3038000 exceeds the limitby 15.8 MB If the distance were 1 byte, it is shown as "0.1 MB". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: more descriptive message for process termination due to max_slot_wal_keep_size
Hi,
(I noticed I sent a wrong version..) At Tue, 6 Sep 2022 10:54:35 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote inThanks for the new patch version!. I did not realized (sorry about that) that we'd need to expose byte_size_pretty(). Now I wonder if weI didn't think we need the units larger than MB, but I used pretty_print to prevent small number from rounding to exactly zero.
Yeah makes sense.
Also, rounding to zero wouldn't occur with "just" displaying "oldestLSN - restart_lsn" (as proposed upthread).
On the other hand, in typical cases it is longer than 6 digits in bytes, which is a bit hard to read a glance.
Yeah right, but that's already the case in some part of the code, like for example in arrayfuncs.c:
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("array size exceeds the maximum allowed (%d)",
(int) MaxAllocSize)));
LOG: terminating process 16034 to release replication slot "rep1" because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytesshould not simply report the number of bytes (like I can see it is done in many places). So something like:..+ (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit by %lu bytes",..and then forget about exposing/using byte_size_pretty() (that would be more consistent with the same kind of reporting in the existing code). What do you think?An alterntive would be rounding up to the whole MB, or a sub-MB.ereport(LOG, (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit by %.1lf MB", active_pid, NameStr(slotname), LSN_FORMAT_ARGS(restart_lsn), /* round-up at sub-MB */ ceil((double) (oldestLSN - restart_lsn) / 1024 / 102.4) / 10),
typo "/ 102.4" ?
LOG: terminating process 49539 to release replication slot "rep1" because its restart_lsn 0/3038000 exceeds the limit by 15.8 MBIf the distance were 1 byte, it is shown as "0.1 MB".
Right and I'm -1 on it, I think we should stick to the "pretty" or the "bytes only" approach (my preference being the bytes only one).
Regards,
-- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Wed, 7 Sep 2022 12:16:29 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote in > Also, rounding to zero wouldn't occur with "just" displaying > "oldestLSN - restart_lsn" (as proposed upthread). .. > Yeah right, but that's already the case in some part of the code, like > for example in arrayfuncs.c: Fair points. > >> ereport(LOG, > >> (errmsg("terminating process %d to release replication slot > >> \"%s\" because its restart_lsn %X/%X exceeds the limit by %.1lf > >> MB", > >> active_pid, NameStr(slotname), > >> LSN_FORMAT_ARGS(restart_lsn), > >> /* round-up at sub-MB */ > >> ceil((double) (oldestLSN - restart_lsn) / 1024 / 102.4) / > >> 10), > > typo "/ 102.4" ? No, it rounds the difference up to one decimal place. So it is devided by 10 after ceil():p > >> LOG: terminating process 49539 to release replication slot "rep1" > >> because its restart_lsn 0/3038000 exceeds the limit by 15.8 MB > > If the distance were 1 byte, it is shown as "0.1 MB". > > Right and I'm -1 on it, I think we should stick to the "pretty" or the > "bytes only" approach (my preference being the bytes only one). Okay. the points you brought up above are sufficient grounds for not doing so. Now they are in the following format. >> LOG: terminating process 16034 to release replication slot "rep1" >> because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes Thank you for the discussion, Bertrand! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
Re: more descriptive message for process termination due to max_slot_wal_keep_size
Hi,
At Wed, 7 Sep 2022 12:16:29 +0200, "Drouvot, Bertrand" <bdrouvot@amazon.com> wrote inLOG: terminating process 49539 to release replication slot "rep1" because its restart_lsn 0/3038000 exceeds the limit by 15.8 MBIf the distance were 1 byte, it is shown as "0.1 MB".Right and I'm -1 on it, I think we should stick to the "pretty" or the "bytes only" approach (my preference being the bytes only one).Okay. the points you brought up above are sufficient grounds for not doing so. Now they are in the following format.LOG: terminating process 16034 to release replication slot "rep1" because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytesThank you for the discussion, Bertrand!
You are welcome, thanks for the patch!
It looks good to me, barring any objections i think we can mark the CF entry as Ready for Committer.
Regards,
-- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > Okay. the points you brought up above are sufficient grounds for not > doing so. Now they are in the following format. > LOG: terminating process 16034 to release replication slot "rep1" > because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes This seems to me to be a pretty blatant violation of our first message style guideline [1]: The primary message should be short, factual, and avoid reference to implementation details such as specific function names. “Short” means “should fit on one line under normal conditions”. Use a detail message if needed to keep the primary message short ... I think you should leave the primary message alone and add a DETAIL, as the first version of the patch did. The existing "invalidating slot" message is already in violation of this guideline, so splitting off a DETAIL from that seems indicated as well. regards, tom lane [1] https://www.postgresql.org/docs/devel/error-style-guide.html
... oh, one other point is that using %ld to print an int64 is entirely not portable, as indeed the cfbot is complaining about. I think our best practice on that is to put %lld in the format string and explicitly cast the corresponding argument to "long long". regards, tom lane
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Wed, 28 Sep 2022 16:30:37 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > > Okay. the points you brought up above are sufficient grounds for not > > doing so. Now they are in the following format. > > > LOG: terminating process 16034 to release replication slot "rep1" > > because its restart_lsn 0/3158000 exceeds the limit by 15368192 bytes > > This seems to me to be a pretty blatant violation of our first message > style guideline [1]: Thanks! It seems that I was waiting for a comment on that line. I thought that way at first but finally returned to the current message as the result of discussion (in my memory). I will happily make the main message shorter. > I think you should leave the primary message alone and add a DETAIL, > as the first version of the patch did. > > The existing "invalidating slot" message is already in violation > of this guideline, so splitting off a DETAIL from that seems > indicated as well. So I'm going to change the mssage as: LOG: terminating process %d to release replication slot \"%s\" DETAIL: The slot's restart_lsn %X/%X exceeds the limit by %lld bytes. HINT: You might need to increase max_slot_wal_keep_size. LOG: invalidating *replication* slot \"%s\" DETAILS: (ditto) HINTS: (ditto) It seems that it's no longer useful to split out the first patch so I merged them into one. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > At Wed, 28 Sep 2022 16:30:37 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in >> I think you should leave the primary message alone and add a DETAIL, >> as the first version of the patch did. > So I'm going to change the mssage as: > LOG: terminating process %d to release replication slot \"%s\" > DETAIL: The slot's restart_lsn %X/%X exceeds the limit by %lld bytes. > HINT: You might need to increase max_slot_wal_keep_size. > LOG: invalidating *replication* slot \"%s\" > DETAILS: (ditto) > HINTS: (ditto) I thought the latter was a little *too* short; the primary message should at least give you some clue why that happened, even if it doesn't offer all the detail. After some thought I changed it to LOG: invalidating obsolete replication slot \"%s\" and pushed it that way. regards, tom lane
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Thu, 29 Sep 2022 13:31:00 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > > At Wed, 28 Sep 2022 16:30:37 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > > LOG: invalidating *replication* slot \"%s\" > > DETAILS: (ditto) > > HINTS: (ditto) > > I thought the latter was a little *too* short; the primary message > should at least give you some clue why that happened, even if it > doesn't offer all the detail. After some thought I changed it to Yeah, agreed. It looks better. (I was about to spell it as "invalidating slot "%s"" then changed my mind to add "replication". I felt that it is a bit too short but didn't think about further streaching that by adding "obsolete"..). > LOG: invalidating obsolete replication slot \"%s\" > > and pushed it that way. Thanks. And thanks for fixing the test script, too. By the way, I didn't notice at that time (and forgot about the policy), but the HINT message has variations differing only by the variable name. What do you think about the attached? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > By the way, I didn't notice at that time (and forgot about the > policy), but the HINT message has variations differing only by the > variable name. > What do you think about the attached? Hmm, maybe, but a quick grep for 'You might need to increase' finds about a dozen other cases, and none of them are using %s. If we do this we should change all of them, and they probably need "translator:" hints. I'm not sure whether abstracting away the variable names will make translation harder. regards, tom lane
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Thu, 29 Sep 2022 22:49:00 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > > By the way, I didn't notice at that time (and forgot about the > > policy), but the HINT message has variations differing only by the > > variable name. > > > What do you think about the attached? > > Hmm, maybe, but a quick grep for 'You might need to increase' > finds about a dozen other cases, and none of them are using %s. (Mmm. I didn't find others only in po files..) > If we do this we should change all of them, and they probably > need "translator:" hints. I'm not sure whether abstracting > away the variable names will make translation harder. I expect that dedicated po-editing tools can lookup corresponding code lines, which gives the answer if no hint is attached at least in this specific case. Anyway, thinking calmly, since we are not about to edit these messages, it's unlikely to happen on purpose, I think. So I don't mean to push this so hard. Thanks for the comment! regards. -- Kyotaro Horiguchi NTT Open Source Software Center