Обсуждение: Get rid of pgstat_count_backend_io_op*() functions

Поиск
Список
Период
Сортировка

Get rid of pgstat_count_backend_io_op*() functions

От
Bertrand Drouvot
Дата:
Hi hackers,

This patch removes the functions that are incrementing the backend IO stats.

Instead, it now copies the IO pending stats to the backend IO pending stats when
the pending IO stats are flushed. This behaves the same way as for some relation
and database stats (see pgstat_relation_flush_cb()).

It's done that way to avoid incrementing the "same" counters twice as it produces
increased overhead in profiles (reported by Andres in [1]).

Please note that per-backend statistics have to be last when we define the
PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are
populated while other stats kinds are flushing.

[1]: https://www.postgresql.org/message-id/7fhpds4xqk6bnudzmzkqi33pinsxammpljwde5gfkjdygvejrj%40ojkzfr7dxkmm

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Get rid of pgstat_count_backend_io_op*() functions

От
Michael Paquier
Дата:
On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote:
> Instead, it now copies the IO pending stats to the backend IO pending stats when
> the pending IO stats are flushed. This behaves the same way as for some relation
> and database stats (see pgstat_relation_flush_cb()).
>
> It's done that way to avoid incrementing the "same" counters twice as it produces
> increased overhead in profiles (reported by Andres in [1]).

So, is the complaint about the addition of the extra incrementations
for backend counters on top of the existing IO counters in v17,
leading to more instructions generated, the addition of the functions,
or both things at the same time?  Do you have an example of profile
and/or workload where this actually matters?  I am aware that you are
doing crazy things in this area, so..

Also, how much does it matter for v18?  Removing both function calls
and counter incrementations implies refactoring that touches both the
I/O stats and the backend counterparts, which may not be a a good idea
during RC.  I'd be OK in beta 1~2 for this amount of work, the timing
of the discussion is what it is.  So, depending on the arguments
raised this may warrant a revert for the IO part of the backend stats.

> Please note that per-backend statistics have to be last when we define the
> PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are
> populated while other stats kinds are flushing.

+/*
+ * per-backend statistics has to be last, as some of their pending stats are
+ * populated while other stats kinds are flushing.
+ */
+#define PGSTAT_KIND_BACKEND    12    /* per-backend statistics */

I don't think that adding ordering dependencies for the stats kinds is
a good concept in general.  What makes you think that we could not
have a custom pgstats kind willing to increment backend counters as
well, say if we have a table AM with its own stats, for example?

Removing both the function calls and the extra counter incrementations
means to do the same thing as the WAL stats, where we have one
structure in charge of incrementing the counters, then the data diffs
are fed when flushing the entries in all the stats kinds.  So this
would imply the introduction of an equivalent to WalUsage, but applied
to the IO stats, like a IOUsage, or something like that.
--
Michael

Вложения

Re: Get rid of pgstat_count_backend_io_op*() functions

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Sep 02, 2025 at 08:19:22AM +0900, Michael Paquier wrote:
> On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote:
> > Instead, it now copies the IO pending stats to the backend IO pending stats when
> > the pending IO stats are flushed. This behaves the same way as for some relation
> > and database stats (see pgstat_relation_flush_cb()).
> > 
> > It's done that way to avoid incrementing the "same" counters twice as it produces
> > increased overhead in profiles (reported by Andres in [1]).
> 
> So, is the complaint about the addition of the extra incrementations
> for backend counters on top of the existing IO counters in v17,
> leading to more instructions generated, the addition of the functions,
> or both things at the same time?  Do you have an example of profile
> and/or workload where this actually matters?  I am aware that you are
> doing crazy things in this area, so..

I did not observe such noticeable overhead myself, so I guess that those questions
are for Andres. That said, I think it makes fully sense to avoid incrementing
the "same" counters twice (and specially in hot path).

> > Please note that per-backend statistics have to be last when we define the
> > PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are
> > populated while other stats kinds are flushing.
> 
> +/*
> + * per-backend statistics has to be last, as some of their pending stats are
> + * populated while other stats kinds are flushing.
> + */
> +#define PGSTAT_KIND_BACKEND    12    /* per-backend statistics */
> 
> I don't think that adding ordering dependencies for the stats kinds is
> a good concept in general.

I agree that it has to be simple. I don't think that just ensuring that the
backends are last introduces much complexity.

> What makes you think that we could not
> have a custom pgstats kind willing to increment backend counters as
> well, say if we have a table AM with its own stats, for example?

Nothing, but the issue would be the same if we keep the backends
at their current position. That's right that it's "only" solving the in-core
ordering.

> Removing both the function calls and the extra counter incrementations
> means to do the same thing as the WAL stats, where we have one
> structure in charge of incrementing the counters, then the data diffs
> are fed when flushing the entries in all the stats kinds.  So this
> would imply the introduction of an equivalent to WalUsage, but applied
> to the IO stats, like a IOUsage, or something like that.

Yeah, that sounds more complicated at first but I think that would solve
the "ordering" issue mentioned above.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Get rid of pgstat_count_backend_io_op*() functions

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Sep 02, 2025 at 07:41:59AM +0000, Bertrand Drouvot wrote:
> On Tue, Sep 02, 2025 at 08:19:22AM +0900, Michael Paquier wrote:
> > Removing both the function calls and the extra counter incrementations
> > means to do the same thing as the WAL stats, where we have one
> > structure in charge of incrementing the counters, then the data diffs
> > are fed when flushing the entries in all the stats kinds.  So this
> > would imply the introduction of an equivalent to WalUsage, but applied
> > to the IO stats, like a IOUsage, or something like that.
> 
> Yeah, that sounds more complicated at first but I think that would solve
> the "ordering" issue mentioned above.

Attached is a draft version that implements an equivalent of WalUsage, that way:

- no ordering concerns anymore
- we can also get rid of PgStat_PendingIO ("replaced" by IOUsage)
- we can also get rid of PgStat_BackendPending (since both IO and WAL stats now
use the WAL's approach)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Get rid of pgstat_count_backend_io_op*() functions

От
Andres Freund
Дата:
Hi,

On 2025-09-02 08:19:22 +0900, Michael Paquier wrote:
> On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote:
> > Instead, it now copies the IO pending stats to the backend IO pending stats when
> > the pending IO stats are flushed. This behaves the same way as for some relation
> > and database stats (see pgstat_relation_flush_cb()).
> > 
> > It's done that way to avoid incrementing the "same" counters twice as it produces
> > increased overhead in profiles (reported by Andres in [1]).
> 
> So, is the complaint about the addition of the extra incrementations
> for backend counters on top of the existing IO counters in v17,
> leading to more instructions generated, the addition of the functions,
> or both things at the same time?

The latter, I think.


> Do you have an example of profile and/or workload where this actually
> matters?

It's not a large regression by any means - it shows up when micro-benchmarking
seqscans that are consumed where quickly (e.g. OFFSET large; or count(*)).


> Also, how much does it matter for v18?

I think it's ok for 18. We just don't want to go further down the wrong way. I
objected to this approach in the context of the tuple level counters, where it
matters more, because obviously those are incremented a lot more often.


> Removing both the function calls and the extra counter incrementations
> means to do the same thing as the WAL stats, where we have one
> structure in charge of incrementing the counters, then the data diffs
> are fed when flushing the entries in all the stats kinds.

I think that's the wrong direction to go. Diffing stats is far from cheap and
gets more expensive the more detail you add to stats.

EXPLAIN ANALYZE spends a large chunk of the time doing diffing of buffer
access stats, for example. We need to work towards doing less of that stuff,
not more.

Greetings,

Andres Freund



Re: Get rid of pgstat_count_backend_io_op*() functions

От
Michael Paquier
Дата:
On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote:
> On 2025-09-02 08:19:22 +0900, Michael Paquier wrote:
>> On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote:
>>> Instead, it now copies the IO pending stats to the backend IO pending stats when
>>> the pending IO stats are flushed. This behaves the same way as for some relation
>>> and database stats (see pgstat_relation_flush_cb()).
>>>
>>> It's done that way to avoid incrementing the "same" counters twice as it produces
>>> increased overhead in profiles (reported by Andres in [1]).
>>
>> So, is the complaint about the addition of the extra incrementations
>> for backend counters on top of the existing IO counters in v17,
>> leading to more instructions generated, the addition of the functions,
>> or both things at the same time?
>
> The latter, I think.

Another option would be more inlining, applying the same to the
pgstat_io.c parts.  It sounds to me like that limiting the number of
incrementations to happen only once when shared across multiple stats
kinds would be beneficial anyway, so we could just do that.  My whole
picture would then come to:
- Split the IO counters (PendingIOStats) into a separate file, with
perhaps inline functions to have a more elegant layer, somewhat
relevant for pgstat_report_fixed as well.
- Update the stats kinds to apply the diffs in the counters at flush
time.

>> Removing both the function calls and the extra counter incrementations
>> means to do the same thing as the WAL stats, where we have one
>> structure in charge of incrementing the counters, then the data diffs
>> are fed when flushing the entries in all the stats kinds.
>
> I think that's the wrong direction to go. Diffing stats is far from cheap and
> gets more expensive the more detail you add to stats.

Even if we only do the diffs calculations when flushing the pending
stats in the flush callbacks?  That would mean to calculate the diffs
when stats reports are forced at transaction commit or when the delay
threshold is reached in non-force mode for pgstat_report_stat().

> EXPLAIN ANALYZE spends a large chunk of the time doing diffing of buffer
> access stats, for example. We need to work towards doing less of that stuff,
> not more.

Yep, agreed.  I've seen people complaining about the overhead that can
happen in the EXPLAIN path, even for stuff aimed to be maintained
outside of core, that uses the backend hooks.
--
Michael

Вложения

Re: Get rid of pgstat_count_backend_io_op*() functions

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Sep 03, 2025 at 02:47:51PM +0900, Michael Paquier wrote:
> On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote:
> > I think that's the wrong direction to go. Diffing stats is far from cheap and
> > gets more expensive the more detail you add to stats.
> 
> Even if we only do the diffs calculations when flushing the pending
> stats in the flush callbacks?

If my math are correct we have 3 × 5 × 8 = 120 array positions and for each
we'd do the diff on counts, bytes and times. That means 360 diff operations per
flush. That means 720 diff operations to flush the global and per backend IO
stats. That's certainly more expensive than "just" copying the pending stats
as proposed in v1.

As far the ordering concern for v1, what about:

- let backend kind enum defined as 6
- move the backend flush outside of the loop in pgstat_report_stat()

That way the backend are flushed last and that solves your concern about custom
stats kind.

The backend would not be the "only" exception in pgstat_report_stat(), the db
stats are already handled as exception with pgstat_update_dbstats().

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Get rid of pgstat_count_backend_io_op*() functions

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote:
> Hi,
> 
> On Wed, Sep 03, 2025 at 02:47:51PM +0900, Michael Paquier wrote:
> > On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote:
> > > I think that's the wrong direction to go. Diffing stats is far from cheap and
> > > gets more expensive the more detail you add to stats.
> > 
> > Even if we only do the diffs calculations when flushing the pending
> > stats in the flush callbacks?
> 
> If my math are correct we have 3 × 5 × 8 = 120 array positions and for each
> we'd do the diff on counts, bytes and times. That means 360 diff operations per
> flush. That means 720 diff operations to flush the global and per backend IO
> stats. That's certainly more expensive than "just" copying the pending stats
> as proposed in v1.
> 
> As far the ordering concern for v1, what about:
> 
> - let backend kind enum defined as 6
> - move the backend flush outside of the loop in pgstat_report_stat()
> 
> That way the backend are flushed last and that solves your concern about custom
> stats kind.
> 
> The backend would not be the "only" exception in pgstat_report_stat(), the db
> stats are already handled as exception with pgstat_update_dbstats().

That would give something like v3 attached, thoughts?

Once we agree on the approach to deal with per backend pending stats, I'll make
use of the same in [1] and [2].

[1]: https://www.postgresql.org/message-id/flat/aJDBwNlyiFuJOoPx@ip-10-97-1-34.eu-west-3.compute.internal
[2]: https://www.postgresql.org/message-id/flat/aJrxug4LCg4Hm5Mm@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Get rid of pgstat_count_backend_io_op*() functions

От
Michael Paquier
Дата:
On Wed, Sep 24, 2025 at 07:48:32AM +0000, Bertrand Drouvot wrote:
> On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote:
>> As far the ordering concern for v1, what about:
>>
>> - let backend kind enum defined as 6
>> - move the backend flush outside of the loop in pgstat_report_stat()
>>
>> That way the backend are flushed last and that solves your concern about custom
>> stats kind.
>>
>> The backend would not be the "only" exception in pgstat_report_stat(), the db
>> stats are already handled as exception with pgstat_update_dbstats().
>
> That would give something like v3 attached, thoughts?
>
> Once we agree on the approach to deal with per backend pending stats, I'll make
> use of the same in [1] and [2].

         for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
         {
[...]
+            if (kind == PGSTAT_KIND_BACKEND)
+                continue;
             if (!kind_info)
                 continue;
             if (!kind_info->flush_static_cb)

             partial_flush |= kind_info->flush_static_cb(nowait);
         }
+
+        /*
+         * Do per-backend last as some of their pending stats are populated
+         * during the flush of other stats kinds.
+         */
+        kind_info = pgstat_get_kind_info(PGSTAT_KIND_BACKEND);
+        partial_flush |= kind_info->flush_static_cb(nowait);
     }

Hmm.  I really have an issue with the ordering dependency this
introduces between stats kinds, assumed to happen inside the code.  Up
to now, we have always considered stats kinds as rather independent
pieces at flush time, simplifying its code (I am not saying counter
increment time).  I get that you are doing what you do here because
the IO pending statistics (PendingIOStats) are reset each time the
stat IO flush callback finishes its job, and we'd lose the amount of
stats saved between two reports.

Putting aside the style issue (aka I like the pgstat_count_backend_*
interface, rather than copying the pending data), I think that we lack
data to be able to clearly identify how and actually how much we
should try to optimize and change this code:
- Should we try to use more inlining?
- Should we try to use more static structures, without any functions
at all.
- Should we try to prefer copies?  In which case, could it make sense
to introduce a concept of dependency between stats kinds?  This way,
we could order how the flush actions are taken rather than having a
for loop based on the sole stats kind IDs.
- Should we try to reduce diff operations at flush time?  This is
something done by the WAL stats and worry about their costs, with
diffs calculated each time between the current and previous states?
With a non-forced report delay of 10s, this does not worry me, short
transactions would, as we force reports each time a transaction
finishes.

For the first and second point, more optimization could be done for
more stats kinds.  For the last point, we'd want to reconsider how
pgstat_wal.c does its job.  As the proposal stands, at least it seems
to me, this sacrifices code readability, though I get that the opinion
of each may differ on the matter.  And perhaps there are practices
that we may be able to use in core for all the stats kinds, as well.
That would mean doing benchmarks with specific workloads on big
hardware, likely.  So, what I am seeing currently here is an
incomplete picture.
--
Michael

Вложения

Re: Get rid of pgstat_count_backend_io_op*() functions

От
Bertrand Drouvot
Дата:
Hi,

On Thu, Sep 25, 2025 at 03:42:33PM +0900, Michael Paquier wrote:
> On Wed, Sep 24, 2025 at 07:48:32AM +0000, Bertrand Drouvot wrote:
> > On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote:
> >> As far the ordering concern for v1, what about:
> >> 
> >> - let backend kind enum defined as 6
> >> - move the backend flush outside of the loop in pgstat_report_stat()
> >> 
> >> That way the backend are flushed last and that solves your concern about custom
> >> stats kind.
> >> 
> >> The backend would not be the "only" exception in pgstat_report_stat(), the db
> >> stats are already handled as exception with pgstat_update_dbstats().
> > 
> > That would give something like v3 attached, thoughts?
> > 
> > Once we agree on the approach to deal with per backend pending stats, I'll make
> > use of the same in [1] and [2].
> 
>          for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
>          {
> [...]
> +            if (kind == PGSTAT_KIND_BACKEND)
> +                continue;
>              if (!kind_info)
>                  continue;
>              if (!kind_info->flush_static_cb)
>  
>              partial_flush |= kind_info->flush_static_cb(nowait);
>          }
> +
> +        /*
> +         * Do per-backend last as some of their pending stats are populated
> +         * during the flush of other stats kinds.
> +         */
> +        kind_info = pgstat_get_kind_info(PGSTAT_KIND_BACKEND);
> +        partial_flush |= kind_info->flush_static_cb(nowait);
>      }
> 
> Hmm.  I really have an issue with the ordering dependency this
> introduces between stats kinds, assumed to happen inside the code.  Up
> to now, we have always considered stats kinds as rather independent
> pieces at flush time, simplifying its code (I am not saying counter
> increment time).

We already have such dependencies for some databases/relations stats but that
works well because all of them are "variable" stats and their flush is based
on the pgStatPending list.

> I get that you are doing what you do here because
> the IO pending statistics (PendingIOStats) are reset each time the
> stat IO flush callback finishes its job, and we'd lose the amount of
> stats saved between two reports.

We don't lose the stats, but it takes one more reporting cyle to
get them flushed. The sequence (without the ordering being forced) is:

1. static flush of IO backend (happens before IO because backend's enum < IO's enum)
2. static flush of IO -> pending IOs are copied to backend's pending IOs

later on (i.e next report cycle):

3. static flush of IO backend
4. static flush of IO -> pending IOs are copied to backend's pending IO

During 3. we now see the backend IO stats coming from 2.

So the stats would still be reported correctly, just with a one-cycle backend
IO reporting stats delay. 

> Putting aside the style issue (aka I like the pgstat_count_backend_*
> interface, rather than copying the pending data), I think that we lack
> data to be able to clearly identify how and actually how much we
> should try to optimize and change this code:
> - Should we try to use more inlining?
> - Should we try to use more static structures, without any functions
> at all.
> - Should we try to prefer copies? 

The fundamental issue is that as we add more stats to PGSTAT_KIND_BACKEND,
we increasingly end up doing duplicate work: the same operation gets counted in
both the specialized stat kind (IO, relation, etc.) and the backend stats.

I think that backend stats should serve as a consolidated view of what a particular
backend is doing, not as an independent counting mechanism.

It makes more sense for backend stats to copy from the specialized pending stats
rather than having every operation increment multiple independent counters.

The specific optimization techniques you mention (inlining, static structures, etc.)
don't address the fundamental architectural issue of duplicate counting.

> In which case, could it make sense
> to introduce a concept of dependency between stats kinds?  This way,
> we could order how the flush actions are taken rather than having a
> for loop based on the sole stats kind IDs.

We could, but I don't think we need that complexity yet.
Backend stats are inherently special they're a "consolidated" view of all activity
for a specific backend, which naturally makes them dependent on other stat kinds.

This makes backend stats naturally suited to be "last" in the flush order.

I'm not sure we'll need such ordering dependencies for other kinds. And if we need
more later on, we could always work on a more sophisticated approach.

The v3 implementation is explicit about this ordering (with comments and clear
code structure).

> - Should we try to reduce diff operations at flush time?  This is
> something done by the WAL stats and worry about their costs, with
> diffs calculated each time between the current and previous states?
> With a non-forced report delay of 10s, this does not worry me, short
> transactions would, as we force reports each time a transaction
> finishes.

Right, that might be a concern for short transactions.

The copy approach makes more sense to me: it is just a simple memory copy operation.
That produces less operations than iterating through all the counter arrays doing
arithmetic. I think it's hard to beat (even if we "could" reduce diff operations
at flush time).

> For the first and second point, more optimization could be done for
> more stats kinds.  For the last point, we'd want to reconsider how
> pgstat_wal.c does its job.  As the proposal stands, at least it seems
> to me, this sacrifices code readability, though I get that the opinion
> of each may differ on the matter.  And perhaps there are practices
> that we may be able to use in core for all the stats kinds, as well.

I disagree on the readability point. The copy approach is actually quite readable
and is very clear about what's happening.

The diff approach requires understanding:

- Why we maintain separate previous state variables
- How the diff calculations work across multi-dimensional arrays
- The complex time arithmetic with INSTR_TIME_ACCUM_DIFF

Regarding performance, I think it's hard to beat a simple memory copy.
Even with potential optimizations to other approaches, you're still doing more
work (arithmetic operations across arrays) versus a straightforward copy operation.

As for incrementing counters twice "just" for code readability, that seems backwards
to me.

> That would mean doing benchmarks with specific workloads on big
> hardware, likely.

I don't think we need to benchmark each approach.

I suspect benchmarks would show that avoiding duplicate work in the hot path (I/O operations)
provides benefits.

> So, what I am seeing currently here is an incomplete picture.

I think the picture is actually quite complete for the specific problem we're solving.

Backend stats are unique in that they're designed to "aggregate" activity from
other stat kinds and that's what creates the duplicate counting issue.

The "incomplete picture" would be if we were trying to optimize all stat kinds
uniformly, but that's not the problem here.

The problem is specifically that backend stats create duplicate work by counting
the same operations that are already being counted in their respective dedicated
stats kind.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com