Обсуждение: Get rid of pgstat_count_backend_io_op*() functions
Hi hackers, This patch removes the functions that are incrementing the backend IO stats. Instead, it now copies the IO pending stats to the backend IO pending stats when the pending IO stats are flushed. This behaves the same way as for some relation and database stats (see pgstat_relation_flush_cb()). It's done that way to avoid incrementing the "same" counters twice as it produces increased overhead in profiles (reported by Andres in [1]). Please note that per-backend statistics have to be last when we define the PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are populated while other stats kinds are flushing. [1]: https://www.postgresql.org/message-id/7fhpds4xqk6bnudzmzkqi33pinsxammpljwde5gfkjdygvejrj%40ojkzfr7dxkmm Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote: > Instead, it now copies the IO pending stats to the backend IO pending stats when > the pending IO stats are flushed. This behaves the same way as for some relation > and database stats (see pgstat_relation_flush_cb()). > > It's done that way to avoid incrementing the "same" counters twice as it produces > increased overhead in profiles (reported by Andres in [1]). So, is the complaint about the addition of the extra incrementations for backend counters on top of the existing IO counters in v17, leading to more instructions generated, the addition of the functions, or both things at the same time? Do you have an example of profile and/or workload where this actually matters? I am aware that you are doing crazy things in this area, so.. Also, how much does it matter for v18? Removing both function calls and counter incrementations implies refactoring that touches both the I/O stats and the backend counterparts, which may not be a a good idea during RC. I'd be OK in beta 1~2 for this amount of work, the timing of the discussion is what it is. So, depending on the arguments raised this may warrant a revert for the IO part of the backend stats. > Please note that per-backend statistics have to be last when we define the > PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are > populated while other stats kinds are flushing. +/* + * per-backend statistics has to be last, as some of their pending stats are + * populated while other stats kinds are flushing. + */ +#define PGSTAT_KIND_BACKEND 12 /* per-backend statistics */ I don't think that adding ordering dependencies for the stats kinds is a good concept in general. What makes you think that we could not have a custom pgstats kind willing to increment backend counters as well, say if we have a table AM with its own stats, for example? Removing both the function calls and the extra counter incrementations means to do the same thing as the WAL stats, where we have one structure in charge of incrementing the counters, then the data diffs are fed when flushing the entries in all the stats kinds. So this would imply the introduction of an equivalent to WalUsage, but applied to the IO stats, like a IOUsage, or something like that. -- Michael
Вложения
Hi, On Tue, Sep 02, 2025 at 08:19:22AM +0900, Michael Paquier wrote: > On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote: > > Instead, it now copies the IO pending stats to the backend IO pending stats when > > the pending IO stats are flushed. This behaves the same way as for some relation > > and database stats (see pgstat_relation_flush_cb()). > > > > It's done that way to avoid incrementing the "same" counters twice as it produces > > increased overhead in profiles (reported by Andres in [1]). > > So, is the complaint about the addition of the extra incrementations > for backend counters on top of the existing IO counters in v17, > leading to more instructions generated, the addition of the functions, > or both things at the same time? Do you have an example of profile > and/or workload where this actually matters? I am aware that you are > doing crazy things in this area, so.. I did not observe such noticeable overhead myself, so I guess that those questions are for Andres. That said, I think it makes fully sense to avoid incrementing the "same" counters twice (and specially in hot path). > > Please note that per-backend statistics have to be last when we define the > > PGSTAT_KIND_% values in pgstat_kind.h, as some of their pending stats are > > populated while other stats kinds are flushing. > > +/* > + * per-backend statistics has to be last, as some of their pending stats are > + * populated while other stats kinds are flushing. > + */ > +#define PGSTAT_KIND_BACKEND 12 /* per-backend statistics */ > > I don't think that adding ordering dependencies for the stats kinds is > a good concept in general. I agree that it has to be simple. I don't think that just ensuring that the backends are last introduces much complexity. > What makes you think that we could not > have a custom pgstats kind willing to increment backend counters as > well, say if we have a table AM with its own stats, for example? Nothing, but the issue would be the same if we keep the backends at their current position. That's right that it's "only" solving the in-core ordering. > Removing both the function calls and the extra counter incrementations > means to do the same thing as the WAL stats, where we have one > structure in charge of incrementing the counters, then the data diffs > are fed when flushing the entries in all the stats kinds. So this > would imply the introduction of an equivalent to WalUsage, but applied > to the IO stats, like a IOUsage, or something like that. Yeah, that sounds more complicated at first but I think that would solve the "ordering" issue mentioned above. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Sep 02, 2025 at 07:41:59AM +0000, Bertrand Drouvot wrote: > On Tue, Sep 02, 2025 at 08:19:22AM +0900, Michael Paquier wrote: > > Removing both the function calls and the extra counter incrementations > > means to do the same thing as the WAL stats, where we have one > > structure in charge of incrementing the counters, then the data diffs > > are fed when flushing the entries in all the stats kinds. So this > > would imply the introduction of an equivalent to WalUsage, but applied > > to the IO stats, like a IOUsage, or something like that. > > Yeah, that sounds more complicated at first but I think that would solve > the "ordering" issue mentioned above. Attached is a draft version that implements an equivalent of WalUsage, that way: - no ordering concerns anymore - we can also get rid of PgStat_PendingIO ("replaced" by IOUsage) - we can also get rid of PgStat_BackendPending (since both IO and WAL stats now use the WAL's approach) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi, On 2025-09-02 08:19:22 +0900, Michael Paquier wrote: > On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote: > > Instead, it now copies the IO pending stats to the backend IO pending stats when > > the pending IO stats are flushed. This behaves the same way as for some relation > > and database stats (see pgstat_relation_flush_cb()). > > > > It's done that way to avoid incrementing the "same" counters twice as it produces > > increased overhead in profiles (reported by Andres in [1]). > > So, is the complaint about the addition of the extra incrementations > for backend counters on top of the existing IO counters in v17, > leading to more instructions generated, the addition of the functions, > or both things at the same time? The latter, I think. > Do you have an example of profile and/or workload where this actually > matters? It's not a large regression by any means - it shows up when micro-benchmarking seqscans that are consumed where quickly (e.g. OFFSET large; or count(*)). > Also, how much does it matter for v18? I think it's ok for 18. We just don't want to go further down the wrong way. I objected to this approach in the context of the tuple level counters, where it matters more, because obviously those are incremented a lot more often. > Removing both the function calls and the extra counter incrementations > means to do the same thing as the WAL stats, where we have one > structure in charge of incrementing the counters, then the data diffs > are fed when flushing the entries in all the stats kinds. I think that's the wrong direction to go. Diffing stats is far from cheap and gets more expensive the more detail you add to stats. EXPLAIN ANALYZE spends a large chunk of the time doing diffing of buffer access stats, for example. We need to work towards doing less of that stuff, not more. Greetings, Andres Freund
On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote: > On 2025-09-02 08:19:22 +0900, Michael Paquier wrote: >> On Mon, Sep 01, 2025 at 02:07:27PM +0000, Bertrand Drouvot wrote: >>> Instead, it now copies the IO pending stats to the backend IO pending stats when >>> the pending IO stats are flushed. This behaves the same way as for some relation >>> and database stats (see pgstat_relation_flush_cb()). >>> >>> It's done that way to avoid incrementing the "same" counters twice as it produces >>> increased overhead in profiles (reported by Andres in [1]). >> >> So, is the complaint about the addition of the extra incrementations >> for backend counters on top of the existing IO counters in v17, >> leading to more instructions generated, the addition of the functions, >> or both things at the same time? > > The latter, I think. Another option would be more inlining, applying the same to the pgstat_io.c parts. It sounds to me like that limiting the number of incrementations to happen only once when shared across multiple stats kinds would be beneficial anyway, so we could just do that. My whole picture would then come to: - Split the IO counters (PendingIOStats) into a separate file, with perhaps inline functions to have a more elegant layer, somewhat relevant for pgstat_report_fixed as well. - Update the stats kinds to apply the diffs in the counters at flush time. >> Removing both the function calls and the extra counter incrementations >> means to do the same thing as the WAL stats, where we have one >> structure in charge of incrementing the counters, then the data diffs >> are fed when flushing the entries in all the stats kinds. > > I think that's the wrong direction to go. Diffing stats is far from cheap and > gets more expensive the more detail you add to stats. Even if we only do the diffs calculations when flushing the pending stats in the flush callbacks? That would mean to calculate the diffs when stats reports are forced at transaction commit or when the delay threshold is reached in non-force mode for pgstat_report_stat(). > EXPLAIN ANALYZE spends a large chunk of the time doing diffing of buffer > access stats, for example. We need to work towards doing less of that stuff, > not more. Yep, agreed. I've seen people complaining about the overhead that can happen in the EXPLAIN path, even for stuff aimed to be maintained outside of core, that uses the backend hooks. -- Michael
Вложения
Hi, On Wed, Sep 03, 2025 at 02:47:51PM +0900, Michael Paquier wrote: > On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote: > > I think that's the wrong direction to go. Diffing stats is far from cheap and > > gets more expensive the more detail you add to stats. > > Even if we only do the diffs calculations when flushing the pending > stats in the flush callbacks? If my math are correct we have 3 × 5 × 8 = 120 array positions and for each we'd do the diff on counts, bytes and times. That means 360 diff operations per flush. That means 720 diff operations to flush the global and per backend IO stats. That's certainly more expensive than "just" copying the pending stats as proposed in v1. As far the ordering concern for v1, what about: - let backend kind enum defined as 6 - move the backend flush outside of the loop in pgstat_report_stat() That way the backend are flushed last and that solves your concern about custom stats kind. The backend would not be the "only" exception in pgstat_report_stat(), the db stats are already handled as exception with pgstat_update_dbstats(). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote: > Hi, > > On Wed, Sep 03, 2025 at 02:47:51PM +0900, Michael Paquier wrote: > > On Tue, Sep 02, 2025 at 12:42:54PM -0400, Andres Freund wrote: > > > I think that's the wrong direction to go. Diffing stats is far from cheap and > > > gets more expensive the more detail you add to stats. > > > > Even if we only do the diffs calculations when flushing the pending > > stats in the flush callbacks? > > If my math are correct we have 3 × 5 × 8 = 120 array positions and for each > we'd do the diff on counts, bytes and times. That means 360 diff operations per > flush. That means 720 diff operations to flush the global and per backend IO > stats. That's certainly more expensive than "just" copying the pending stats > as proposed in v1. > > As far the ordering concern for v1, what about: > > - let backend kind enum defined as 6 > - move the backend flush outside of the loop in pgstat_report_stat() > > That way the backend are flushed last and that solves your concern about custom > stats kind. > > The backend would not be the "only" exception in pgstat_report_stat(), the db > stats are already handled as exception with pgstat_update_dbstats(). That would give something like v3 attached, thoughts? Once we agree on the approach to deal with per backend pending stats, I'll make use of the same in [1] and [2]. [1]: https://www.postgresql.org/message-id/flat/aJDBwNlyiFuJOoPx@ip-10-97-1-34.eu-west-3.compute.internal [2]: https://www.postgresql.org/message-id/flat/aJrxug4LCg4Hm5Mm@ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On Wed, Sep 24, 2025 at 07:48:32AM +0000, Bertrand Drouvot wrote: > On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote: >> As far the ordering concern for v1, what about: >> >> - let backend kind enum defined as 6 >> - move the backend flush outside of the loop in pgstat_report_stat() >> >> That way the backend are flushed last and that solves your concern about custom >> stats kind. >> >> The backend would not be the "only" exception in pgstat_report_stat(), the db >> stats are already handled as exception with pgstat_update_dbstats(). > > That would give something like v3 attached, thoughts? > > Once we agree on the approach to deal with per backend pending stats, I'll make > use of the same in [1] and [2]. for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++) { [...] + if (kind == PGSTAT_KIND_BACKEND) + continue; if (!kind_info) continue; if (!kind_info->flush_static_cb) partial_flush |= kind_info->flush_static_cb(nowait); } + + /* + * Do per-backend last as some of their pending stats are populated + * during the flush of other stats kinds. + */ + kind_info = pgstat_get_kind_info(PGSTAT_KIND_BACKEND); + partial_flush |= kind_info->flush_static_cb(nowait); } Hmm. I really have an issue with the ordering dependency this introduces between stats kinds, assumed to happen inside the code. Up to now, we have always considered stats kinds as rather independent pieces at flush time, simplifying its code (I am not saying counter increment time). I get that you are doing what you do here because the IO pending statistics (PendingIOStats) are reset each time the stat IO flush callback finishes its job, and we'd lose the amount of stats saved between two reports. Putting aside the style issue (aka I like the pgstat_count_backend_* interface, rather than copying the pending data), I think that we lack data to be able to clearly identify how and actually how much we should try to optimize and change this code: - Should we try to use more inlining? - Should we try to use more static structures, without any functions at all. - Should we try to prefer copies? In which case, could it make sense to introduce a concept of dependency between stats kinds? This way, we could order how the flush actions are taken rather than having a for loop based on the sole stats kind IDs. - Should we try to reduce diff operations at flush time? This is something done by the WAL stats and worry about their costs, with diffs calculated each time between the current and previous states? With a non-forced report delay of 10s, this does not worry me, short transactions would, as we force reports each time a transaction finishes. For the first and second point, more optimization could be done for more stats kinds. For the last point, we'd want to reconsider how pgstat_wal.c does its job. As the proposal stands, at least it seems to me, this sacrifices code readability, though I get that the opinion of each may differ on the matter. And perhaps there are practices that we may be able to use in core for all the stats kinds, as well. That would mean doing benchmarks with specific workloads on big hardware, likely. So, what I am seeing currently here is an incomplete picture. -- Michael
Вложения
Hi, On Thu, Sep 25, 2025 at 03:42:33PM +0900, Michael Paquier wrote: > On Wed, Sep 24, 2025 at 07:48:32AM +0000, Bertrand Drouvot wrote: > > On Wed, Sep 03, 2025 at 07:33:37AM +0000, Bertrand Drouvot wrote: > >> As far the ordering concern for v1, what about: > >> > >> - let backend kind enum defined as 6 > >> - move the backend flush outside of the loop in pgstat_report_stat() > >> > >> That way the backend are flushed last and that solves your concern about custom > >> stats kind. > >> > >> The backend would not be the "only" exception in pgstat_report_stat(), the db > >> stats are already handled as exception with pgstat_update_dbstats(). > > > > That would give something like v3 attached, thoughts? > > > > Once we agree on the approach to deal with per backend pending stats, I'll make > > use of the same in [1] and [2]. > > for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++) > { > [...] > + if (kind == PGSTAT_KIND_BACKEND) > + continue; > if (!kind_info) > continue; > if (!kind_info->flush_static_cb) > > partial_flush |= kind_info->flush_static_cb(nowait); > } > + > + /* > + * Do per-backend last as some of their pending stats are populated > + * during the flush of other stats kinds. > + */ > + kind_info = pgstat_get_kind_info(PGSTAT_KIND_BACKEND); > + partial_flush |= kind_info->flush_static_cb(nowait); > } > > Hmm. I really have an issue with the ordering dependency this > introduces between stats kinds, assumed to happen inside the code. Up > to now, we have always considered stats kinds as rather independent > pieces at flush time, simplifying its code (I am not saying counter > increment time). We already have such dependencies for some databases/relations stats but that works well because all of them are "variable" stats and their flush is based on the pgStatPending list. > I get that you are doing what you do here because > the IO pending statistics (PendingIOStats) are reset each time the > stat IO flush callback finishes its job, and we'd lose the amount of > stats saved between two reports. We don't lose the stats, but it takes one more reporting cyle to get them flushed. The sequence (without the ordering being forced) is: 1. static flush of IO backend (happens before IO because backend's enum < IO's enum) 2. static flush of IO -> pending IOs are copied to backend's pending IOs later on (i.e next report cycle): 3. static flush of IO backend 4. static flush of IO -> pending IOs are copied to backend's pending IO During 3. we now see the backend IO stats coming from 2. So the stats would still be reported correctly, just with a one-cycle backend IO reporting stats delay. > Putting aside the style issue (aka I like the pgstat_count_backend_* > interface, rather than copying the pending data), I think that we lack > data to be able to clearly identify how and actually how much we > should try to optimize and change this code: > - Should we try to use more inlining? > - Should we try to use more static structures, without any functions > at all. > - Should we try to prefer copies? The fundamental issue is that as we add more stats to PGSTAT_KIND_BACKEND, we increasingly end up doing duplicate work: the same operation gets counted in both the specialized stat kind (IO, relation, etc.) and the backend stats. I think that backend stats should serve as a consolidated view of what a particular backend is doing, not as an independent counting mechanism. It makes more sense for backend stats to copy from the specialized pending stats rather than having every operation increment multiple independent counters. The specific optimization techniques you mention (inlining, static structures, etc.) don't address the fundamental architectural issue of duplicate counting. > In which case, could it make sense > to introduce a concept of dependency between stats kinds? This way, > we could order how the flush actions are taken rather than having a > for loop based on the sole stats kind IDs. We could, but I don't think we need that complexity yet. Backend stats are inherently special they're a "consolidated" view of all activity for a specific backend, which naturally makes them dependent on other stat kinds. This makes backend stats naturally suited to be "last" in the flush order. I'm not sure we'll need such ordering dependencies for other kinds. And if we need more later on, we could always work on a more sophisticated approach. The v3 implementation is explicit about this ordering (with comments and clear code structure). > - Should we try to reduce diff operations at flush time? This is > something done by the WAL stats and worry about their costs, with > diffs calculated each time between the current and previous states? > With a non-forced report delay of 10s, this does not worry me, short > transactions would, as we force reports each time a transaction > finishes. Right, that might be a concern for short transactions. The copy approach makes more sense to me: it is just a simple memory copy operation. That produces less operations than iterating through all the counter arrays doing arithmetic. I think it's hard to beat (even if we "could" reduce diff operations at flush time). > For the first and second point, more optimization could be done for > more stats kinds. For the last point, we'd want to reconsider how > pgstat_wal.c does its job. As the proposal stands, at least it seems > to me, this sacrifices code readability, though I get that the opinion > of each may differ on the matter. And perhaps there are practices > that we may be able to use in core for all the stats kinds, as well. I disagree on the readability point. The copy approach is actually quite readable and is very clear about what's happening. The diff approach requires understanding: - Why we maintain separate previous state variables - How the diff calculations work across multi-dimensional arrays - The complex time arithmetic with INSTR_TIME_ACCUM_DIFF Regarding performance, I think it's hard to beat a simple memory copy. Even with potential optimizations to other approaches, you're still doing more work (arithmetic operations across arrays) versus a straightforward copy operation. As for incrementing counters twice "just" for code readability, that seems backwards to me. > That would mean doing benchmarks with specific workloads on big > hardware, likely. I don't think we need to benchmark each approach. I suspect benchmarks would show that avoiding duplicate work in the hot path (I/O operations) provides benefits. > So, what I am seeing currently here is an incomplete picture. I think the picture is actually quite complete for the specific problem we're solving. Backend stats are unique in that they're designed to "aggregate" activity from other stat kinds and that's what creates the duplicate counting issue. The "incomplete picture" would be if we were trying to optimize all stat kinds uniformly, but that's not the problem here. The problem is specifically that backend stats create duplicate work by counting the same operations that are already being counted in their respective dedicated stats kind. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com