Обсуждение: Report reorder buffer size

Поиск
Список
Период
Сортировка

Report reorder buffer size

От
Ashutosh Bapat
Дата:
Hi All,
The question of how to tune logical_decoding_work_mem has come up a
few times in the context of customer cases. If it is too low then the
changes are spilled to disk, slowing down replication and consuming
disk space. In case of streamed transactions, it won't consume space
on primary but will consume resources (transactional resources/disk
space) on replica. If decoding memory is too high, it may cause memory
pressure especially when there are many wal senders and large amounts
of changes queued to reorder buffers frequently.

In order to tune logical_decoding_work_mem optimally we need a measure
of the size of changes belonging to the transactions queued in the
reorder buffer, whether spilled to the disk or in memory or sent
downstream as part of an unfinished streamed transaction. Currently
there is no way to determine that. pg_stat_replication_slots reports
cumulative statistics about spilled or streamed transactions. It helps
to know whether the logical decoding work memory is sufficient or not
to hold all the changes. But it does not provide a way to estimate an
optimal value for it since it doesn't capture the state of the reorder
buffer at a given point in time.

With the attached WIP patch, we can report, at a given point in time,
total size of reorder buffer i.e. sum of the size of all the changes
belonging to the transactions active in the reorder buffer whether
those changes are in memory, on disk or sent downstream as part of an
unfinished streamed transaction. By sampling this value at regular
intervals, one can observe the reorder buffer trendline as shown in
the plots attached. In the plots, the Y-axis is the total size of
reorder buffer and the X axis is time of day.
rbsize_trendlines_pgbench.png shows trendline when just regular
pgbench is run with 3K TPS and
rbsize_trendlines_large_transaction_spikes.png shows the trendline
when there are long and large transactions in addition to pgbench.
Using these trendlines, a user may decide to keep
logical_decoding_work_mem to minimum and thus let all the large
transactions spill to disk or be streamed. Or they may decide to set
it as 100MB, 200MB or 500MB reducing or eliminating the disk spill or
streaming. It makes it easy to quantify the trade-offs in tuning
logical_decoding_work_mem.

Tracking total size of reorder buffer also helps to quantify load on a
wal sender. It is an additional metric that can be useful to debug WAL
buildup, slow replication etc. For example, the plot clearly and
directly shows when the wal sender started processing the large
transaction, when it finished etc.

The patch is not complete. There are many things to be worked on
1. Whether the metrics should be reported in pg_stat_replication or
pg_stat_replication_slots. Given that it's a point-in-time metric, it
fits pg_stat_replication better. But that view does not report logical
decoding activity happening in backends other than the wal senders.
But that's true with the other metrics in that view as well.
pg_stat_replication_slots, however, covers all users of logical
decoding. For now I have used pg_stat_replication, but it can be
changed based on the discussion here.

2. Is the size of the reorder buffer enough or we want to also track
the size of changes on disk and the size of changes sent downstream as
part of unfinished streamed transactions separately? Also the number
of transactions being tracked by the reorder buffer? If we are going
to report so much statistics about the contents of the reorder buffer,
is it better to have a separate view pg_stat_reorder_buffer for the
same?

3. The patch doesn't cover all the places where the reorder buffer's
total size should be maintained. I need to find all such places and
add required maths.

But before I work on those, I would like to know others' opinions,
usefulness and acceptance. FWIW, attached is a tap test to test the
changes

-- 
Best Wishes,
Ashutosh Bapat

Вложения

Re: Report reorder buffer size

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Aug 13, 2025 at 06:09:47PM +0530, Ashutosh Bapat wrote:
> Hi All,
> The question of how to tune logical_decoding_work_mem has come up a
> few times in the context of customer cases.

Same here.

> If it is too low then the
> changes are spilled to disk, slowing down replication and consuming
> disk space. In case of streamed transactions, it won't consume space
> on primary but will consume resources (transactional resources/disk
> space) on replica. If decoding memory is too high, it may cause memory
> pressure especially when there are many wal senders and large amounts
> of changes queued to reorder buffers frequently.
> 
> In order to tune logical_decoding_work_mem optimally we need a measure
> of the size of changes belonging to the transactions queued in the
> reorder buffer, whether spilled to the disk or in memory or sent
> downstream as part of an unfinished streamed transaction.

Fully agree, that's something I started to work on some time ago. Not only
to tune the logical_decoding_work_mem but also to help diagnose issues with high
memory consumption and/or OOM due to logical decoding.

> Currently
> there is no way to determine that. pg_stat_replication_slots reports
> cumulative statistics about spilled or streamed transactions. It helps
> to know whether the logical decoding work memory is sufficient or not
> to hold all the changes. But it does not provide a way to estimate an
> optimal value for it since it doesn't capture the state of the reorder
> buffer at a given point in time.

Agree.

> With the attached WIP patch, we can report, at a given point in time,
> total size of reorder buffer i.e. sum of the size of all the changes
> belonging to the transactions active in the reorder buffer whether
> those changes are in memory, on disk or sent downstream as part of an
> unfinished streamed transaction. By sampling this value at regular
> intervals, one can observe the reorder buffer trendline as shown in
> the plots attached. In the plots, the Y-axis is the total size of
> reorder buffer and the X axis is time of day.
> rbsize_trendlines_pgbench.png shows trendline when just regular
> pgbench is run with 3K TPS and
> rbsize_trendlines_large_transaction_spikes.png shows the trendline
> when there are long and large transactions in addition to pgbench.

> Using these trendlines, a user may decide to keep
> logical_decoding_work_mem to minimum and thus let all the large
> transactions spill to disk or be streamed. 

Right. Also, the issue with spill files is that when they are read back from disk
then the reorder buffer can consume a lot of memory. That's why I think tracking
the xid, sub xid and LSNs could be useful too. That could help diagnose why
it was using that memory.

Attached is what I did prepare a couple of years ago. I did not share it until
now because it "only" logs information in the log file and I wanted to do more.

It's far from being polished and I'm not sure it's 100% correct. I share
it just as food for thought about what information I thought could be useful
to log. It applies on top of yours.

> Tracking total size of reorder buffer also helps to quantify load on a
> wal sender.

Yeah, and memory being used.

> It is an additional metric that can be useful to debug WAL
> buildup, slow replication etc. For example, the plot clearly and
> directly shows when the wal sender started processing the large
> transaction, when it finished etc.

Agree that's useful info to have.

> The patch is not complete. There are many things to be worked on

> 1. Whether the metrics should be reported in pg_stat_replication or
> pg_stat_replication_slots. Given that it's a point-in-time metric, it
> fits pg_stat_replication better.

I do think the same.

> But that view does not report logical
> decoding activity happening in backends other than the wal senders.
> But that's true with the other metrics in that view as well.
> pg_stat_replication_slots, however, covers all users of logical
> decoding. For now I have used pg_stat_replication, but it can be
> changed based on the discussion here.

I think pg_stat_replication is a good place to record "real time" activities.

Maybe we could log 2 metrics: the reorder buffer size before creating the spill
files and the reorder buffer size while reading back the spill files? That would
help make the distinction when logical_decoding_work_mem is involved (i.e before
creating the spill files) and when it is not (reading them back).

pg_stat_replication_slots could be used to add some counters too: like the
number of times logical_decoding_work_mem has been reached while decoding from
that slot. Also maybe record the max value the reorder buffer reached and 
the associated xid, number of sub xids, and LSN? That could be useful to diagnose
high memory consumption and/or OOM issues due to logical decoding.

Also, should we log some of information to the server log?

> But before I work on those, I would like to know others' opinions,
> usefulness and acceptance.

Thanks for working on it. I like the idea and also think that is an area where
more details/metrics should be provided.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Report reorder buffer size

От
Ashutosh Bapat
Дата:
Hi Bertrand,
Thanks for your response. I am glad that you have found the proposal
to be generally useful.

On Thu, Aug 14, 2025 at 3:50 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> > Using these trendlines, a user may decide to keep
> > logical_decoding_work_mem to minimum and thus let all the large
> > transactions spill to disk or be streamed.
>
> Right. Also, the issue with spill files is that when they are read back from disk
> then the reorder buffer can consume a lot of memory. That's why I think tracking
> the xid, sub xid and LSNs could be useful too. That could help diagnose why
> it was using that memory.

When the replication is stuck or is slow or WAL builds up a detailed
look at the reorder buffer's contents would be useful. It will need to
be done outside of pg_stat_replication or pg_stat_replication_slots. I
think it will be useful to provide a function which will trigger a
given WAL sender to dump the information about the contents of the
reorder buffer to the server error log or send them to the requesting
client.


>
> I think pg_stat_replication is a good place to record "real time" activities.
>

Ok. Thanks for confirming.

> Maybe we could log 2 metrics: the reorder buffer size before creating the spill
> files and the reorder buffer size while reading back the spill files? That would
> help make the distinction when logical_decoding_work_mem is involved (i.e before
> creating the spill files) and when it is not (reading them back).
>
> pg_stat_replication_slots could be used to add some counters too: like the
> number of times logical_decoding_work_mem has been reached while decoding from
> that slot.

I think pg_stat_replication_slots::spill_count and
pg_stat_replication_slots::stream_count give that value.

> Also maybe record the max value the reorder buffer reached and
> the associated xid, number of sub xids, and LSN? That could be useful to diagnose
> high memory consumption and/or OOM issues due to logical decoding.

A function which dumps the information about reorder buffers can be
used along with the trendline for this purpose. For example, if a user
notices WAL accumulation and an upward trend in the metric exposed by
my patch, they can trigger a dump of reorder buffer contents at that
time and diagnose the problem and even fix it proactively.

>
> Also, should we log some of information to the server log?
>

Information in server error logs is hard to access and analyse,
especially in cloud environment. Having a SQL construct is better.

> > But before I work on those, I would like to know others' opinions,
> > usefulness and acceptance.
>
> Thanks for working on it. I like the idea and also think that is an area where
> more details/metrics should be provided.

Thanks.

--
Best Wishes,
Ashutosh Bapat



Re: Report reorder buffer size

От
Bertrand Drouvot
Дата:
Hi Ashutosh,

On Thu, Aug 21, 2025 at 07:26:41PM +0530, Ashutosh Bapat wrote:
> Hi Bertrand,
> Thanks for your response. I am glad that you have found the proposal
> to be generally useful.
> 
> On Thu, Aug 14, 2025 at 3:50 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > > Using these trendlines, a user may decide to keep
> > > logical_decoding_work_mem to minimum and thus let all the large
> > > transactions spill to disk or be streamed.
> >
> > Right. Also, the issue with spill files is that when they are read back from disk
> > then the reorder buffer can consume a lot of memory. That's why I think tracking
> > the xid, sub xid and LSNs could be useful too. That could help diagnose why
> > it was using that memory.
> 
> When the replication is stuck or is slow or WAL builds up a detailed
> look at the reorder buffer's contents would be useful. It will need to
> be done outside of pg_stat_replication or pg_stat_replication_slots. I
> think it will be useful to provide a function which will trigger a
> given WAL sender to dump the information about the contents of the
> reorder buffer to the server error log or send them to the requesting
> client.

Maybe we could add such functionalities in contrib/pg_logicalinspect?

> > Maybe we could log 2 metrics: the reorder buffer size before creating the spill
> > files and the reorder buffer size while reading back the spill files? That would
> > help make the distinction when logical_decoding_work_mem is involved (i.e before
> > creating the spill files) and when it is not (reading them back).
> >
> > pg_stat_replication_slots could be used to add some counters too: like the
> > number of times logical_decoding_work_mem has been reached while decoding from
> > that slot.
> 
> I think pg_stat_replication_slots::spill_count and
> pg_stat_replication_slots::stream_count give that value.

Yes as long as subtrans are not involved. But if say I've one transaction made of
1500 subtrans, I'd get something like:

  slot_name   | spill_count | stream_count | total_txns
--------------+-------------+--------------+------------
 logical_slot |        1501 |            0 |          1

So we don't know how many times logical_decoding_work_mem has been reached (
except by looking at total_txns).

But as soon as another transaction (that does not involve spill) is decoded:

  slot_name   | spill_count | stream_count | total_txns
--------------+-------------+--------------+------------
 logical_slot |        1501 |            0 |          2

Then we don't know if logical_decoding_work_mem has been reached one or two 
times.

> > Also maybe record the max value the reorder buffer reached and
> > the associated xid, number of sub xids, and LSN? That could be useful to diagnose
> > high memory consumption and/or OOM issues due to logical decoding.
> 
> A function which dumps the information about reorder buffers can be
> used along with the trendline for this purpose. For example, if a user
> notices WAL accumulation and an upward trend in the metric exposed by
> my patch, they can trigger a dump of reorder buffer contents at that
> time and diagnose the problem and even fix it proactively.

Right. That would work for cases where the issue is occuring right now and
the system is still available (OOM not triggered for example).

> > Also, should we log some of information to the server log?
> >
> 
> Information in server error logs is hard to access and analyse,
> especially in cloud environment. Having a SQL construct is better.

That's right as long as the system is still available. I think a mix of log
and SQL API help more use cases: real time, historical and when the system is
back to a usable state (if it was not anymore).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Report reorder buffer size

От
Ashutosh Bapat
Дата:
On Fri, Aug 22, 2025 at 11:43 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi Ashutosh,
>
> On Thu, Aug 21, 2025 at 07:26:41PM +0530, Ashutosh Bapat wrote:
> > Hi Bertrand,
> > Thanks for your response. I am glad that you have found the proposal
> > to be generally useful.
> >
> > On Thu, Aug 14, 2025 at 3:50 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> > >
> > > > Using these trendlines, a user may decide to keep
> > > > logical_decoding_work_mem to minimum and thus let all the large
> > > > transactions spill to disk or be streamed.
> > >
> > > Right. Also, the issue with spill files is that when they are read back from disk
> > > then the reorder buffer can consume a lot of memory. That's why I think tracking
> > > the xid, sub xid and LSNs could be useful too. That could help diagnose why
> > > it was using that memory.
> >
> > When the replication is stuck or is slow or WAL builds up a detailed
> > look at the reorder buffer's contents would be useful. It will need to
> > be done outside of pg_stat_replication or pg_stat_replication_slots. I
> > think it will be useful to provide a function which will trigger a
> > given WAL sender to dump the information about the contents of the
> > reorder buffer to the server error log or send them to the requesting
> > client.
>
> Maybe we could add such functionalities in contrib/pg_logicalinspect?

I don't have an opinion about where to place this. But if place it in
the core, there are higher chances that it will be trusted and used.

>
> > > Maybe we could log 2 metrics: the reorder buffer size before creating the spill
> > > files and the reorder buffer size while reading back the spill files? That would
> > > help make the distinction when logical_decoding_work_mem is involved (i.e before
> > > creating the spill files) and when it is not (reading them back).
> > >
> > > pg_stat_replication_slots could be used to add some counters too: like the
> > > number of times logical_decoding_work_mem has been reached while decoding from
> > > that slot.
> >
> > I think pg_stat_replication_slots::spill_count and
> > pg_stat_replication_slots::stream_count give that value.
>
> Yes as long as subtrans are not involved. But if say I've one transaction made of
> 1500 subtrans, I'd get something like:
>
>   slot_name   | spill_count | stream_count | total_txns
> --------------+-------------+--------------+------------
>  logical_slot |        1501 |            0 |          1
>
> So we don't know how many times logical_decoding_work_mem has been reached (
> except by looking at total_txns).
>
> But as soon as another transaction (that does not involve spill) is decoded:
>
>   slot_name   | spill_count | stream_count | total_txns
> --------------+-------------+--------------+------------
>  logical_slot |        1501 |            0 |          2
>
> Then we don't know if logical_decoding_work_mem has been reached one or two
> times.

I didn't know this is how it works. Thanks for correcting me. In that
case, I think we should
add a column in pg_stat_replication_slots reporting the number of
times the memory limit is reached since the last reset. I am +0.5 on
it being useful.

>
> > > Also maybe record the max value the reorder buffer reached and
> > > the associated xid, number of sub xids, and LSN? That could be useful to diagnose
> > > high memory consumption and/or OOM issues due to logical decoding.
> >
> > A function which dumps the information about reorder buffers can be
> > used along with the trendline for this purpose. For example, if a user
> > notices WAL accumulation and an upward trend in the metric exposed by
> > my patch, they can trigger a dump of reorder buffer contents at that
> > time and diagnose the problem and even fix it proactively.
>
> Right. That would work for cases where the issue is occuring right now and
> the system is still available (OOM not triggered for example).
>
> > > Also, should we log some of information to the server log?
> > >
> >
> > Information in server error logs is hard to access and analyse,
> > especially in cloud environment. Having a SQL construct is better.
>
> That's right as long as the system is still available. I think a mix of log
> and SQL API help more use cases: real time, historical and when the system is
> back to a usable state (if it was not anymore).

The way I envision this is that users will sample the views
periodically or when certain thresholds (amount of WAL, size of
reorder buffers) are crossed. That way they can have historical data
in an easy-to-query manner when the actual incident occurs.

--
Best Wishes,
Ashutosh Bapat



Re: Report reorder buffer size

От
Bertrand Drouvot
Дата:
Hi,

On Mon, Aug 25, 2025 at 01:48:42PM +0530, Ashutosh Bapat wrote:
> On Fri, Aug 22, 2025 at 11:43 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > Yes as long as subtrans are not involved. But if say I've one transaction made of
> > 1500 subtrans, I'd get something like:
> >
> >   slot_name   | spill_count | stream_count | total_txns
> > --------------+-------------+--------------+------------
> >  logical_slot |        1501 |            0 |          1
> >
> > So we don't know how many times logical_decoding_work_mem has been reached (
> > except by looking at total_txns).
> >
> > But as soon as another transaction (that does not involve spill) is decoded:
> >
> >   slot_name   | spill_count | stream_count | total_txns
> > --------------+-------------+--------------+------------
> >  logical_slot |        1501 |            0 |          2
> >
> > Then we don't know if logical_decoding_work_mem has been reached one or two
> > times.
> 
> I didn't know this is how it works. Thanks for correcting me. In that
> case, I think we should
> add a column in pg_stat_replication_slots reporting the number of
> times the memory limit is reached since the last reset. I am +0.5 on
> it being useful.

Thanks. With this in place one could get a ratio like total_txns/total_txns_that_exceeds.
That could help to see if reaching logical_decoding_work_mem is rare or 
frequent enough. If frequent, then maybe there is a need to adjust logical_decoding_work_mem.

> > That's right as long as the system is still available. I think a mix of log
> > and SQL API help more use cases: real time, historical and when the system is
> > back to a usable state (if it was not anymore).
> 
> The way I envision this is that users will sample the views
> periodically or when certain thresholds (amount of WAL, size of
> reorder buffers) are crossed. That way they can have historical data
> in an easy-to-query manner when the actual incident occurs.

Yeah. OTOH, having this information in the log could also help users that
did not think about sampling the views and hit an incident.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Report reorder buffer size

От
Masahiko Sawada
Дата:
Hi,

On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Hi All,
> The question of how to tune logical_decoding_work_mem has come up a
> few times in the context of customer cases. If it is too low then the
> changes are spilled to disk, slowing down replication and consuming
> disk space. In case of streamed transactions, it won't consume space
> on primary but will consume resources (transactional resources/disk
> space) on replica. If decoding memory is too high, it may cause memory
> pressure especially when there are many wal senders and large amounts
> of changes queued to reorder buffers frequently.

Thank you for starting the thread about this topic. This is one of the
topics that I've been eager to address. Here are some random comments:

>
> In order to tune logical_decoding_work_mem optimally we need a measure
> of the size of changes belonging to the transactions queued in the
> reorder buffer, whether spilled to the disk or in memory or sent
> downstream as part of an unfinished streamed transaction. Currently
> there is no way to determine that. pg_stat_replication_slots reports
> cumulative statistics about spilled or streamed transactions. It helps
> to know whether the logical decoding work memory is sufficient or not
> to hold all the changes. But it does not provide a way to estimate an
> optimal value for it since it doesn't capture the state of the reorder
> buffer at a given point in time.
>
> With the attached WIP patch, we can report, at a given point in time,
> total size of reorder buffer i.e. sum of the size of all the changes
> belonging to the transactions active in the reorder buffer whether
> those changes are in memory, on disk or sent downstream as part of an
> unfinished streamed transaction. By sampling this value at regular
> intervals, one can observe the reorder buffer trendline as shown in
> the plots attached.

Alternative idea (or an additional metric) would be the high-watermark
of memory usage. That way, users won't miss memory usage spikes that
might be missed by sampling the total memory usage.

> Tracking total size of reorder buffer also helps to quantify load on a
> wal sender. It is an additional metric that can be useful to debug WAL
> buildup, slow replication etc. For example, the plot clearly and
> directly shows when the wal sender started processing the large
> transaction, when it finished etc.

Agreed.

> The patch is not complete. There are many things to be worked on
> 1. Whether the metrics should be reported in pg_stat_replication or
> pg_stat_replication_slots. Given that it's a point-in-time metric, it
> fits pg_stat_replication better. But that view does not report logical
> decoding activity happening in backends other than the wal senders.
> But that's true with the other metrics in that view as well.
> pg_stat_replication_slots, however, covers all users of logical
> decoding. For now I have used pg_stat_replication, but it can be
> changed based on the discussion here.
>
> 2. Is the size of the reorder buffer enough or we want to also track
> the size of changes on disk and the size of changes sent downstream as
> part of unfinished streamed transactions separately? Also the number
> of transactions being tracked by the reorder buffer?

For the purpose of tuning logical_decoding_work_mem, the additional
metrics and statistics we need might not be many. But in order to make
logical decoding more visible for users for debugging or diagnosing
purposes, more statistics like the number of transactions being
tracked by the reorder buffer might be required.

We need to note that the actual size of changes sent downstream
actually depends on logical decoding plugins. For instance, we have
table, row, and column filters in logical replication cases. It might
be worth considering providing such statistics too. The statistics
like the number of changes filtered out by the table filters or the
change-kind filter (e.g., only publishing INSERTs etc) might be
helpful for users to confirm the effectiveness of the filters they
set.

> If we are going
> to report so much statistics about the contents of the reorder buffer,
> is it better to have a separate view pg_stat_reorder_buffer for the
> same?

Given logical decoding can be used also by regular backend processes,
I guess that such dynamic metrics would fit a system view dedicated to
logical decoding, say pg_stat_reorder_buffer or
pg_stat_logical_decoding.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Report reorder buffer size

От
Ashutosh Bapat
Дата:
On Mon, Aug 25, 2025 at 6:28 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> > I didn't know this is how it works. Thanks for correcting me. In that
> > case, I think we should
> > add a column in pg_stat_replication_slots reporting the number of
> > times the memory limit is reached since the last reset. I am +0.5 on
> > it being useful.
>
> Thanks. With this in place one could get a ratio like total_txns/total_txns_that_exceeds.
> That could help to see if reaching logical_decoding_work_mem is rare or
> frequent enough. If frequent, then maybe there is a need to adjust logical_decoding_work_mem.
>

We are discussing some improvements to pg_stat_replication_slots over
at [1]. Would you be able to create a patch reporting the number of
times logical_decoding_work_mem is reached, in the patchset being
discussed there?

[1] https://www.postgresql.org/message-id/CAExHW5s6KntzUyUoMbKR5dgwRmdV2Ay_2+AnTgYGAzo=Qv61wA@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat



Re: Report reorder buffer size

От
Ashutosh Bapat
Дата:
Hi Masahiko,
Thanks for your inputs.

On Tue, Aug 26, 2025 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
>
> Thank you for starting the thread about this topic. This is one of the
> topics that I've been eager to address. Here are some random comments:
>

Glad that more people think there is a need for improving the reorder
buffer statistics.

> >
> > With the attached WIP patch, we can report, at a given point in time,
> > total size of reorder buffer i.e. sum of the size of all the changes
> > belonging to the transactions active in the reorder buffer whether
> > those changes are in memory, on disk or sent downstream as part of an
> > unfinished streamed transaction. By sampling this value at regular
> > intervals, one can observe the reorder buffer trendline as shown in
> > the plots attached.
>
> Alternative idea (or an additional metric) would be the high-watermark
> of memory usage. That way, users won't miss memory usage spikes that
> might be missed by sampling the total memory usage.

I think the trendline is important to make right trade-offs in case
setting logical_decoding_work_mem to maximum is not possible. With a
high watermark the question is how long does a high remain high? As
the load changes, a high that was once may become lower or higher than
the new high and soon irrelevant. Probably we could reset the
high-watermark every time the view is sampled, so that we provide a
trendline for the high-water mark as well. But I want to be cautious
about adding that complexity of tracking maxima accurately and then
maintain it forever. If sampling is frequent enough usually it will
capture a maxima and minima good enough for practical purposes. The
users will need to consider the trendline as approximate anyway since
the load will show slight variations over the time.

Please share your idea of reporting high-watermark.

> >
> > 2. Is the size of the reorder buffer enough or we want to also track
> > the size of changes on disk and the size of changes sent downstream as
> > part of unfinished streamed transactions separately? Also the number
> > of transactions being tracked by the reorder buffer?
>
> For the purpose of tuning logical_decoding_work_mem, the additional
> metrics and statistics we need might not be many. But in order to make
> logical decoding more visible for users for debugging or diagnosing
> purposes, more statistics like the number of transactions being
> tracked by the reorder buffer might be required.
>

Ok. We could add the total number of transactions in the reorder
buffer at a given point in time to the report easily. How about
subtransactions? How about prepared but not committed/aborted
transactions?

> We need to note that the actual size of changes sent downstream
> actually depends on logical decoding plugins. For instance, we have
> table, row, and column filters in logical replication cases. It might
> be worth considering providing such statistics too. The statistics
> like the number of changes filtered out by the table filters or the
> change-kind filter (e.g., only publishing INSERTs etc) might be
> helpful for users to confirm the effectiveness of the filters they
> set.

I have proposed this in [1]. Please check.

>
> > If we are going
> > to report so much statistics about the contents of the reorder buffer,
> > is it better to have a separate view pg_stat_reorder_buffer for the
> > same?
>
> Given logical decoding can be used also by regular backend processes,
> I guess that such dynamic metrics would fit a system view dedicated to
> logical decoding, say pg_stat_reorder_buffer or
> pg_stat_logical_decoding.

Hmm. That means we will have to reserve some area in the shared memory
with as many slots as the number of replication slots (since that is
the maximum number of reorder buffers that can be in the system) and
use each one to maintain the stats. I haven't yet checked whether we
could use the stats maintaining system for it, but it should be
doable.

--
Best Wishes,
Ashutosh Bapat



Re: Report reorder buffer size

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Aug 26, 2025 at 02:56:14PM +0530, Ashutosh Bapat wrote:
> On Mon, Aug 25, 2025 at 6:28 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > > I didn't know this is how it works. Thanks for correcting me. In that
> > > case, I think we should
> > > add a column in pg_stat_replication_slots reporting the number of
> > > times the memory limit is reached since the last reset. I am +0.5 on
> > > it being useful.
> >
> > Thanks. With this in place one could get a ratio like total_txns/total_txns_that_exceeds.
> > That could help to see if reaching logical_decoding_work_mem is rare or
> > frequent enough. If frequent, then maybe there is a need to adjust logical_decoding_work_mem.
> >
> 
> We are discussing some improvements to pg_stat_replication_slots over
> at [1]. Would you be able to create a patch reporting the number of
> times logical_decoding_work_mem is reached, in the patchset being
> discussed there?
> 
> [1] https://www.postgresql.org/message-id/CAExHW5s6KntzUyUoMbKR5dgwRmdV2Ay_2+AnTgYGAzo=Qv61wA@mail.gmail.com

Thanks for pointing me to this thread, I'll look at it and propose a patch
to report the number of times logical_decoding_work_mem is reached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Report reorder buffer size

От
Ashutosh Bapat
Дата:
>
> >
> > > If we are going
> > > to report so much statistics about the contents of the reorder buffer,
> > > is it better to have a separate view pg_stat_reorder_buffer for the
> > > same?
> >
> > Given logical decoding can be used also by regular backend processes,
> > I guess that such dynamic metrics would fit a system view dedicated to
> > logical decoding, say pg_stat_reorder_buffer or
> > pg_stat_logical_decoding.
>
> Hmm. That means we will have to reserve some area in the shared memory
> with as many slots as the number of replication slots (since that is
> the maximum number of reorder buffers that can be in the system) and
> use each one to maintain the stats. I haven't yet checked whether we
> could use the stats maintaining system for it, but it should be
> doable.

I think the beauty of reporting the statistics only for WAL senders,
as is done with the patch, was that we could piggyback stats update on
current logic to update WalSnd. If we want to do it for other
backends, it would be an extra overhead to lock and update the stats
outside WalSnd. And as I said in my email, WAL senders would be
arguably only serious users of logical decoding whose usage will be
worth monitoring - that's why pg_stat_replication reports only stats
about WAL senders and not other backends. Even if we report stats
about only WAL senders, it should be possible to report them through a
view other than pg_stat_replication. What's your opinion about this?

-- 
Best Wishes,
Ashutosh Bapat



Re: Report reorder buffer size

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Aug 26, 2025 at 09:48:04AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Aug 26, 2025 at 02:56:14PM +0530, Ashutosh Bapat wrote:
> > On Mon, Aug 25, 2025 at 6:28 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> > > > I didn't know this is how it works. Thanks for correcting me. In that
> > > > case, I think we should
> > > > add a column in pg_stat_replication_slots reporting the number of
> > > > times the memory limit is reached since the last reset. I am +0.5 on
> > > > it being useful.
> > >
> > > Thanks. With this in place one could get a ratio like total_txns/total_txns_that_exceeds.
> > > That could help to see if reaching logical_decoding_work_mem is rare or
> > > frequent enough. If frequent, then maybe there is a need to adjust logical_decoding_work_mem.
> > >
> > 
> > We are discussing some improvements to pg_stat_replication_slots over
> > at [1]. Would you be able to create a patch reporting the number of
> > times logical_decoding_work_mem is reached, in the patchset being
> > discussed there?
> > 
> > [1] https://www.postgresql.org/message-id/CAExHW5s6KntzUyUoMbKR5dgwRmdV2Ay_2+AnTgYGAzo=Qv61wA@mail.gmail.com
> 
> Thanks for pointing me to this thread, I'll look at it and propose a patch
> to report the number of times logical_decoding_work_mem is reached.

I preferred to create a dedicated thread for it [1] (I think both may need dedicated
discussions).

[1]: https://www.postgresql.org/message-id/flat/aK6zNcmb%2BPpySFvG%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Report reorder buffer size

От
Masahiko Sawada
Дата:
On Tue, Aug 26, 2025 at 2:45 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> Hi Masahiko,
> Thanks for your inputs.
>
> On Tue, Aug 26, 2025 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > Thank you for starting the thread about this topic. This is one of the
> > topics that I've been eager to address. Here are some random comments:
> >
>
> Glad that more people think there is a need for improving the reorder
> buffer statistics.
>
> > >
> > > With the attached WIP patch, we can report, at a given point in time,
> > > total size of reorder buffer i.e. sum of the size of all the changes
> > > belonging to the transactions active in the reorder buffer whether
> > > those changes are in memory, on disk or sent downstream as part of an
> > > unfinished streamed transaction. By sampling this value at regular
> > > intervals, one can observe the reorder buffer trendline as shown in
> > > the plots attached.
> >
> > Alternative idea (or an additional metric) would be the high-watermark
> > of memory usage. That way, users won't miss memory usage spikes that
> > might be missed by sampling the total memory usage.
>
> I think the trendline is important to make right trade-offs in case
> setting logical_decoding_work_mem to maximum is not possible. With a
> high watermark the question is how long does a high remain high? As
> the load changes, a high that was once may become lower or higher than
> the new high and soon irrelevant. Probably we could reset the
> high-watermark every time the view is sampled, so that we provide a
> trendline for the high-water mark as well. But I want to be cautious
> about adding that complexity of tracking maxima accurately and then
> maintain it forever. If sampling is frequent enough usually it will
> capture a maxima and minima good enough for practical purposes. The
> users will need to consider the trendline as approximate anyway since
> the load will show slight variations over the time.

Agreed with the importance of trendline.

> Please share your idea of reporting high-watermark.

I was thinking of this high-watermark idea since it doesn't require
any external system/tool to get the information for tuning
logical_decoding_work_mem. It would be easy to use when users want to
figure out maximum data usage for logical decoding of the particular
workload.

>
> > >
> > > 2. Is the size of the reorder buffer enough or we want to also track
> > > the size of changes on disk and the size of changes sent downstream as
> > > part of unfinished streamed transactions separately? Also the number
> > > of transactions being tracked by the reorder buffer?
> >
> > For the purpose of tuning logical_decoding_work_mem, the additional
> > metrics and statistics we need might not be many. But in order to make
> > logical decoding more visible for users for debugging or diagnosing
> > purposes, more statistics like the number of transactions being
> > tracked by the reorder buffer might be required.
> >
>
> Ok. We could add the total number of transactions in the reorder
> buffer at a given point in time to the report easily. How about
> subtransactions? How about prepared but not committed/aborted
> transactions?

For debugging or diagnosing purposes, this information might be
useful, but I'm not sure we need this information of logical decodings
that are running.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Report reorder buffer size

От
Ashutosh Bapat
Дата:
On Thu, Aug 28, 2025 at 5:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Aug 26, 2025 at 2:45 AM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > Hi Masahiko,
> > Thanks for your inputs.
> >
> > On Tue, Aug 26, 2025 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > > On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
> > > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > Thank you for starting the thread about this topic. This is one of the
> > > topics that I've been eager to address. Here are some random comments:
> > >
> >
> > Glad that more people think there is a need for improving the reorder
> > buffer statistics.
> >
> > > >
> > > > With the attached WIP patch, we can report, at a given point in time,
> > > > total size of reorder buffer i.e. sum of the size of all the changes
> > > > belonging to the transactions active in the reorder buffer whether
> > > > those changes are in memory, on disk or sent downstream as part of an
> > > > unfinished streamed transaction. By sampling this value at regular
> > > > intervals, one can observe the reorder buffer trendline as shown in
> > > > the plots attached.
> > >
> > > Alternative idea (or an additional metric) would be the high-watermark
> > > of memory usage. That way, users won't miss memory usage spikes that
> > > might be missed by sampling the total memory usage.
> >
> > I think the trendline is important to make right trade-offs in case
> > setting logical_decoding_work_mem to maximum is not possible. With a
> > high watermark the question is how long does a high remain high? As
> > the load changes, a high that was once may become lower or higher than
> > the new high and soon irrelevant. Probably we could reset the
> > high-watermark every time the view is sampled, so that we provide a
> > trendline for the high-water mark as well. But I want to be cautious
> > about adding that complexity of tracking maxima accurately and then
> > maintain it forever. If sampling is frequent enough usually it will
> > capture a maxima and minima good enough for practical purposes. The
> > users will need to consider the trendline as approximate anyway since
> > the load will show slight variations over the time.
>
> Agreed with the importance of trendline.
>
> > Please share your idea of reporting high-watermark.
>
> I was thinking of this high-watermark idea since it doesn't require
> any external system/tool to get the information for tuning
> logical_decoding_work_mem. It would be easy to use when users want to
> figure out maximum data usage for logical decoding of the particular
> workload.

Ok. I am interested in knowing how you think we should keep track of
the high watermark.

Do we keep it updated as new maxima are found? And keep reporting the
last high watermark seen till a new maxima is reached? With that,
users would end up setting logical_decoding_work_mem (and hence
provisioning for it) at a higher value, even if the workload changes
so that the reorder buffer never reaches that high watermark. So, I
think we should reset the high watermark. Do you think so? If yes,
When do we do that? How often do we reset it?

Tracking maxima and minima of other parameters in pg_stat_replication
like replication lag might also help users e.g. to tune their
networks. But we don't track their watermarks. External tools are used
for that. I was envisioning something like that for reorder buffer
size as well.

>
> >
> > > >
> > > > 2. Is the size of the reorder buffer enough or we want to also track
> > > > the size of changes on disk and the size of changes sent downstream as
> > > > part of unfinished streamed transactions separately? Also the number
> > > > of transactions being tracked by the reorder buffer?
> > >
> > > For the purpose of tuning logical_decoding_work_mem, the additional
> > > metrics and statistics we need might not be many. But in order to make
> > > logical decoding more visible for users for debugging or diagnosing
> > > purposes, more statistics like the number of transactions being
> > > tracked by the reorder buffer might be required.
> > >
> >
> > Ok. We could add the total number of transactions in the reorder
> > buffer at a given point in time to the report easily. How about
> > subtransactions? How about prepared but not committed/aborted
> > transactions?
>
> For debugging or diagnosing purposes, this information might be
> useful, but I'm not sure we need this information of logical decodings
> that are running.

I was not clear. Sorry. I meant reporting the number of
subtransactions and prepared transactions in the reorder buffer (not
the actual subtransactions or prepared transactions themselves). Do
you think that reporting just the number of transactions in the
reorder buffer is enough? Do you think that reporting the number of
subtransactions and the number of prepared transactions is not
required?

--
Best Wishes,
Ashutosh Bapat



Re: Report reorder buffer size

От
Masahiko Sawada
Дата:
On Wed, Aug 27, 2025 at 9:36 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Aug 28, 2025 at 5:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Aug 26, 2025 at 2:45 AM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > Hi Masahiko,
> > > Thanks for your inputs.
> > >
> > > On Tue, Aug 26, 2025 at 2:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > > On Wed, Aug 13, 2025 at 5:40 AM Ashutosh Bapat
> > > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > >
> > > > Thank you for starting the thread about this topic. This is one of the
> > > > topics that I've been eager to address. Here are some random comments:
> > > >
> > >
> > > Glad that more people think there is a need for improving the reorder
> > > buffer statistics.
> > >
> > > > >
> > > > > With the attached WIP patch, we can report, at a given point in time,
> > > > > total size of reorder buffer i.e. sum of the size of all the changes
> > > > > belonging to the transactions active in the reorder buffer whether
> > > > > those changes are in memory, on disk or sent downstream as part of an
> > > > > unfinished streamed transaction. By sampling this value at regular
> > > > > intervals, one can observe the reorder buffer trendline as shown in
> > > > > the plots attached.
> > > >
> > > > Alternative idea (or an additional metric) would be the high-watermark
> > > > of memory usage. That way, users won't miss memory usage spikes that
> > > > might be missed by sampling the total memory usage.
> > >
> > > I think the trendline is important to make right trade-offs in case
> > > setting logical_decoding_work_mem to maximum is not possible. With a
> > > high watermark the question is how long does a high remain high? As
> > > the load changes, a high that was once may become lower or higher than
> > > the new high and soon irrelevant. Probably we could reset the
> > > high-watermark every time the view is sampled, so that we provide a
> > > trendline for the high-water mark as well. But I want to be cautious
> > > about adding that complexity of tracking maxima accurately and then
> > > maintain it forever. If sampling is frequent enough usually it will
> > > capture a maxima and minima good enough for practical purposes. The
> > > users will need to consider the trendline as approximate anyway since
> > > the load will show slight variations over the time.
> >
> > Agreed with the importance of trendline.
> >
> > > Please share your idea of reporting high-watermark.
> >
> > I was thinking of this high-watermark idea since it doesn't require
> > any external system/tool to get the information for tuning
> > logical_decoding_work_mem. It would be easy to use when users want to
> > figure out maximum data usage for logical decoding of the particular
> > workload.
>
> Ok. I am interested in knowing how you think we should keep track of
> the high watermark.
>
> Do we keep it updated as new maxima are found? And keep reporting the
> last high watermark seen till a new maxima is reached? With that,
> users would end up setting logical_decoding_work_mem (and hence
> provisioning for it) at a higher value, even if the workload changes
> so that the reorder buffer never reaches that high watermark. So, I
> think we should reset the high watermark. Do you think so? If yes,
> When do we do that? How often do we reset it?

I think it ultimately depends on use cases but I imagine that users
can reset the high watermark value after adjusting
logical_decoding_work_mem, and see how the new value works, then
adjust the parameter and reset the value again, ... repeating.

> Tracking maxima and minima of other parameters in pg_stat_replication
> like replication lag might also help users e.g. to tune their
> networks. But we don't track their watermarks. External tools are used
> for that. I was envisioning something like that for reorder buffer
> size as well.

I think that high watermark value would be a new type of statistics we
collect. As far as I know pg_stat_statement collects similar values
but there is no precedent in the core.

>
> >
> > >
> > > > >
> > > > > 2. Is the size of the reorder buffer enough or we want to also track
> > > > > the size of changes on disk and the size of changes sent downstream as
> > > > > part of unfinished streamed transactions separately? Also the number
> > > > > of transactions being tracked by the reorder buffer?
> > > >
> > > > For the purpose of tuning logical_decoding_work_mem, the additional
> > > > metrics and statistics we need might not be many. But in order to make
> > > > logical decoding more visible for users for debugging or diagnosing
> > > > purposes, more statistics like the number of transactions being
> > > > tracked by the reorder buffer might be required.
> > > >
> > >
> > > Ok. We could add the total number of transactions in the reorder
> > > buffer at a given point in time to the report easily. How about
> > > subtransactions? How about prepared but not committed/aborted
> > > transactions?
> >
> > For debugging or diagnosing purposes, this information might be
> > useful, but I'm not sure we need this information of logical decodings
> > that are running.
>
> I was not clear. Sorry. I meant reporting the number of
> subtransactions and prepared transactions in the reorder buffer (not
> the actual subtransactions or prepared transactions themselves). Do
> you think that reporting just the number of transactions in the
> reorder buffer is enough? Do you think that reporting the number of
> subtransactions and the number of prepared transactions is not
> required?

Given that collecting new statistics and updating the statistics
real-time require costs, it's ultimately rewards vs. costs. In terms
of additional costs, I guess tracking the number of prepared
transactions in the reorderbuffer would not be costly, but I'm not
sure about the number of subtransactions. I personally would avoid
collecting these statistics unless there are explicit use cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Report reorder buffer size

От
Ashutosh Bapat
Дата:
On Tue, Sep 23, 2025 at 11:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Aug 27, 2025 at 9:36 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > Ok. I am interested in knowing how you think we should keep track of
> > the high watermark.
> >
> > Do we keep it updated as new maxima are found? And keep reporting the
> > last high watermark seen till a new maxima is reached? With that,
> > users would end up setting logical_decoding_work_mem (and hence
> > provisioning for it) at a higher value, even if the workload changes
> > so that the reorder buffer never reaches that high watermark. So, I
> > think we should reset the high watermark. Do you think so? If yes,
> > When do we do that? How often do we reset it?
>
> I think it ultimately depends on use cases but I imagine that users
> can reset the high watermark value after adjusting
> logical_decoding_work_mem, and see how the new value works, then
> adjust the parameter and reset the value again, ... repeating.
>
> > Tracking maxima and minima of other parameters in pg_stat_replication
> > like replication lag might also help users e.g. to tune their
> > networks. But we don't track their watermarks. External tools are used
> > for that. I was envisioning something like that for reorder buffer
> > size as well.
>
> I think that high watermark value would be a new type of statistics we
> collect. As far as I know pg_stat_statement collects similar values
> but there is no precedent in the core.
>

Thanks for pointing me to pg_stat_statements. I see we store
cumulative statistics like min and max plan and execution times for
queries there. We maintain cumulative statistics about logical
decoding/replication in pg_stat_replication_slot. I think the high
watermark fits there. It also has functionality to reset the
statistics which can be used to reset the high watermark along with
other statistics.

> >
> > >
> > > >
> > > > > >
> > > > > > 2. Is the size of the reorder buffer enough or we want to also track
> > > > > > the size of changes on disk and the size of changes sent downstream as
> > > > > > part of unfinished streamed transactions separately? Also the number
> > > > > > of transactions being tracked by the reorder buffer?
> > > > >
> > > > > For the purpose of tuning logical_decoding_work_mem, the additional
> > > > > metrics and statistics we need might not be many. But in order to make
> > > > > logical decoding more visible for users for debugging or diagnosing
> > > > > purposes, more statistics like the number of transactions being
> > > > > tracked by the reorder buffer might be required.
> > > > >
> > > >
> > > > Ok. We could add the total number of transactions in the reorder
> > > > buffer at a given point in time to the report easily. How about
> > > > subtransactions? How about prepared but not committed/aborted
> > > > transactions?
> > >
> > > For debugging or diagnosing purposes, this information might be
> > > useful, but I'm not sure we need this information of logical decodings
> > > that are running.
> >
> > I was not clear. Sorry. I meant reporting the number of
> > subtransactions and prepared transactions in the reorder buffer (not
> > the actual subtransactions or prepared transactions themselves). Do
> > you think that reporting just the number of transactions in the
> > reorder buffer is enough? Do you think that reporting the number of
> > subtransactions and the number of prepared transactions is not
> > required?
>
> Given that collecting new statistics and updating the statistics
> real-time require costs, it's ultimately rewards vs. costs. In terms
> of additional costs, I guess tracking the number of prepared
> transactions in the reorderbuffer would not be costly, but I'm not
> sure about the number of subtransactions. I personally would avoid
> collecting these statistics unless there are explicit use cases.
>

I agree. I think it's not clear whether statistics about number of
transactions currently active in the reorder buffer is useful. So
let's leave it. If needed we will add it at a later point in time.

I will submit a complete patch soon.

--
Best Wishes,
Ashutosh Bapat



Re: Report reorder buffer size

От
Andres Freund
Дата:
Hi,

On 2025-09-23 17:45:19 +0530, Ashutosh Bapat wrote:
> On Tue, Sep 23, 2025 at 11:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Aug 27, 2025 at 9:36 PM Ashutosh Bapat
> > <ashutosh.bapat.oss@gmail.com> wrote:
> > >
> > > Ok. I am interested in knowing how you think we should keep track of
> > > the high watermark.
> > >
> > > Do we keep it updated as new maxima are found? And keep reporting the
> > > last high watermark seen till a new maxima is reached? With that,
> > > users would end up setting logical_decoding_work_mem (and hence
> > > provisioning for it) at a higher value, even if the workload changes
> > > so that the reorder buffer never reaches that high watermark. So, I
> > > think we should reset the high watermark. Do you think so? If yes,
> > > When do we do that? How often do we reset it?
> >
> > I think it ultimately depends on use cases but I imagine that users
> > can reset the high watermark value after adjusting
> > logical_decoding_work_mem, and see how the new value works, then
> > adjust the parameter and reset the value again, ... repeating.
> >
> > > Tracking maxima and minima of other parameters in pg_stat_replication
> > > like replication lag might also help users e.g. to tune their
> > > networks. But we don't track their watermarks. External tools are used
> > > for that. I was envisioning something like that for reorder buffer
> > > size as well.
> >
> > I think that high watermark value would be a new type of statistics we
> > collect. As far as I know pg_stat_statement collects similar values
> > but there is no precedent in the core.
> >
> 
> Thanks for pointing me to pg_stat_statements. I see we store
> cumulative statistics like min and max plan and execution times for
> queries there. We maintain cumulative statistics about logical
> decoding/replication in pg_stat_replication_slot. I think the high
> watermark fits there. It also has functionality to reset the
> statistics which can be used to reset the high watermark along with
> other statistics.

I think pg_stat_statement is an anti-example, if anything. It's been so
overloaded with barely useful fields that it got so so slow that it starts to
cause contention even in just moderately busy workloads.

Let's make this a minimal change, instead of over-complicating this beyond
usefulness.

Greetings,

Andres Freund



Re: Report reorder buffer size

От
Ashutosh Bapat
Дата:
On Tue, Sep 23, 2025 at 6:04 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-09-23 17:45:19 +0530, Ashutosh Bapat wrote:
> > On Tue, Sep 23, 2025 at 11:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Aug 27, 2025 at 9:36 PM Ashutosh Bapat
> > > <ashutosh.bapat.oss@gmail.com> wrote:
> > > >
> > > > Ok. I am interested in knowing how you think we should keep track of
> > > > the high watermark.
> > > >
> > > > Do we keep it updated as new maxima are found? And keep reporting the
> > > > last high watermark seen till a new maxima is reached? With that,
> > > > users would end up setting logical_decoding_work_mem (and hence
> > > > provisioning for it) at a higher value, even if the workload changes
> > > > so that the reorder buffer never reaches that high watermark. So, I
> > > > think we should reset the high watermark. Do you think so? If yes,
> > > > When do we do that? How often do we reset it?
> > >
> > > I think it ultimately depends on use cases but I imagine that users
> > > can reset the high watermark value after adjusting
> > > logical_decoding_work_mem, and see how the new value works, then
> > > adjust the parameter and reset the value again, ... repeating.
> > >
> > > > Tracking maxima and minima of other parameters in pg_stat_replication
> > > > like replication lag might also help users e.g. to tune their
> > > > networks. But we don't track their watermarks. External tools are used
> > > > for that. I was envisioning something like that for reorder buffer
> > > > size as well.
> > >
> > > I think that high watermark value would be a new type of statistics we
> > > collect. As far as I know pg_stat_statement collects similar values
> > > but there is no precedent in the core.
> > >
> >
> > Thanks for pointing me to pg_stat_statements. I see we store
> > cumulative statistics like min and max plan and execution times for
> > queries there. We maintain cumulative statistics about logical
> > decoding/replication in pg_stat_replication_slot. I think the high
> > watermark fits there. It also has functionality to reset the
> > statistics which can be used to reset the high watermark along with
> > other statistics.
>
> I think pg_stat_statement is an anti-example, if anything. It's been so
> overloaded with barely useful fields that it got so so slow that it starts to
> cause contention even in just moderately busy workloads.
>
> Let's make this a minimal change, instead of over-complicating this beyond
> usefulness.

Thanks Bertrand, Masahiko and Andres for your thoughts.

Summarising the discussion so far, it seems we agree that reporting
the current size of reorder buffer (including the spilled and streamed
transactions) in pg_stat_replication is useful.Including the sizes of
changes from spilled as well as streamed transactions allows users to
see how reorder buffer size would trend if logical_decoding_work_mem
were to be infinite.  If it's sampled frequently enough, the trendline
can be used to find an optimal value of logical_decoding_work_mem
considering various trade-offs.

We did consider other metrics like number of transactions in the
reorder buffer and highest size of reorder buffer. But the opinions
about their usefulness differ. Hence defering them for now.

I will work on producing a complete patch next.

--
Best Wishes,
Ashutosh Bapat