Обсуждение: Add support for entry counting in pgstats
Hi all, One obstacle to the move of pg_stat_statements to the shmem pgstats that we have in core is that there is no easy way to track the total number of entries that are stored in the shared hash table of pgstats for variable-sized stats kinds. PGSS currently hash_get_num_entries() as a cheap way to get the number of entries stored in its hash table, but moving PGSS to the in-core stats facility means that we need a different approach to count these entries. I have looked at what can be done, and finished with a rather simple approach, as we have only one code path when a new entry is inserted, and one code path when an entry is removed when the refcount of an entry reaches 0 (includes resets, drops for kind matches, etc.). The counters are stored in a uint64 atomic, one for each stats kind in PgStat_ShmemControl, set only if the option is enabled. I am wondering how much protection we want for the reads, and chose the lightest "lossy" approach, leaving to stats kind what they want to do when deallocating entries. Hence, a stats kind may want to protect the entry deallocations with an extra lock, but that's not required either depending on how aggressive removals should happen. Please find attached that adds an option for variable-sized stats kinds given these the possibility to track the number of entries in the shared hash table. The option is named track_counts. The patch has some coverage added in the test module injection_points, in its TAP test. Thoughts are welcome. -- Michael
Вложения
On Sep 12, 2025, at 15:23, Michael Paquier <michael@paquier.xyz> wrote:
--
Michael
<0001-Add-support-for-entry-counting-in-pgstats.patch><0002-injection_points-Add-entry-counting.patch>
The code overall looks good to me, very clear and neat. Just a few small comments:
1 - 0001
```
+ * set. This counter is incremented each time a new entry is created (not
+ * when reused), and each time an entry is dropped.
+ */
+ pg_atomic_uint64 entry_counts[PGSTAT_KIND_MAX];
```
“and each time an entry is dropped” is a little misleading, it should be “and decremented when an entry is removed”.
2 - 0001
```
+ /* Increment entry count, if required. */
+ if (kind_info->track_counts)
+ pg_atomic_fetch_add_u64(&pgStatLocal.shmem->entry_counts[kind - 1], 1);
```
The code is quite straightforward, so the comment seems unnecessary.
3 - 0001
```
+ if (kind_info->track_counts)
+ {
+ ereport(ERROR,
+ (errmsg("custom cumulative statistics property is invalid"),
+ errhint("Custom cumulative statistics cannot use counter tracking for fixed-numbered objects.")));
+ }
```
{} are not needed. Looks like in the existing code, when “if” has a single statement, {} are not used. There is a similar “if” right above this change.
4 - 0004
```
diff --git a/src/test/modules/injection_points/injection_stats.c b/src/test/modules/injection_points/injection_stats.c
index e3947b23ba57..15ad9883dedb 100644
--- a/src/test/modules/injection_points/injection_stats.c
+++ b/src/test/modules/injection_points/injection_stats.c
@@ -49,6 +49,7 @@ static const PgStat_KindInfo injection_stats = {
.shared_data_len = sizeof(((PgStatShared_InjectionPoint *) 0)->stats),
.pending_size = sizeof(PgStat_StatInjEntry),
.flush_pending_cb = injection_stats_flush_cb,
+ .track_counts = true,
};
```
Would it be better to move “.track_counts” up to be with the other boolean flags? Whey define together to share the same byte, feels better to initialize them together as well.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
Thanks for this patch! > I have looked at what can be done, and finished with a rather simple > approach, as we have only one code path when a new entry is inserted, > and one code path when an entry is removed when the refcount of an > entry reaches 0 (includes resets, drops for kind matches, etc.). The > counters are stored in a uint64 atomic, one for each stats kind in > PgStat_ShmemControl, set only if the option is enabled. The refcount reaches 0 when all backends release their references to the stat, so if something like pg_stat_statements relies on a count for deallocation purposes (to stay within the max), and that value only decrements when all references are released, then it could end up constantly triggering deallocation attempts. So, I am wondering if we actually need another value that tracks "live" entries, or those that have not yet been marked for drop. This means the live entries count is decremented as soon as the entry is set to ->dropped. > Hence, a stats kind may want to protect > the entry deallocations with an extra lock, but that's not required > either depending on how aggressive removals should happen. +1. As a side note, maybe I am missing something here: In [0], should this not say ".... the entry should not be freed" instead of ".... the entry should not be dropped". The refcount deals with the freeing of the entry after all refs are released. [0] https://github.com/postgres/postgres/blob/f2bae51dfd5b2edc460c86071c577a45a1acbfd7/src/include/utils/pgstat_internal.h#L98-L99 -- Sami
On Tue, Sep 23, 2025 at 01:39:00PM -0500, Sami Imseih wrote: > The refcount reaches 0 when all backends release their references to the > stat, so if something like pg_stat_statements relies on a count for > deallocation purposes (to stay within the max), and that value only > decrements when all references are released, then it could end up > constantly triggering deallocation attempts. So, I am wondering if we > actually need another value that tracks "live" entries, or those that > have not yet been marked for drop. This means the live entries count > is decremented as soon as the entry is set to ->dropped. Couldn't that mean a potential bloat in terms of memory in the dshash if we have a large cycle of objects still held around but marked to be gone? That sounds risky to me as it could go out of control. The counter tracking exactly the number of dshash entries would provide a tighter control, and perhaps this would be OK if the actual deallocations are grouped. Your suggestion would not be difficult to implement, as well, requiring an extra tweak when entries are reused, and that's handled in a unique path. > As a side note, maybe I am missing something here: > In [0], should this not say ".... the entry should not be freed" instead of > ".... the entry should not be dropped". The refcount deals with the freeing > of the entry after all refs are released. > > [0] https://github.com/postgres/postgres/blob/f2bae51dfd5b2edc460c86071c577a45a1acbfd7/src/include/utils/pgstat_internal.h#L98-L99 The sentence looks correct to me? The refcount cannot be incremented if an entry is marked for drop, as far as I recall. -- Michael
Вложения
Testing via the Injection Point has been successfully completed. > The option is named track_counts. Regarding the option name track_counts in PgStat_KindInfo. In my personal opinion, I was just wondering that it shares the same name as the GUC track_counts(pgstat_track_counts in the source code). If we want to make it clearer, renaming it to track_entry_counts could be one option. (That said, I feel the GUC track_counts option has a mismatch between its role and its name...) Best Regards, -- Keisuke Kuroda NTT DOCOMO SOLUTIONS, Inc.
On Wed, Sep 24, 2025 at 11:37:25AM +0900, Keisuke Kuroda wrote: > Regarding the option name track_counts in PgStat_KindInfo. > In my personal opinion, I was just wondering that it shares > the same name as the GUC track_counts(pgstat_track_counts in the source code). > If we want to make it clearer, renaming it to track_entry_counts > could be one option. Yes, that's a good point and I have missed the GUC part. What you are suggesting is cleaner overall with the flag added to the pgstats kind info. > (That said, I feel the GUC track_counts option has a mismatch > between its role and its name...) I doubt that we will rename this GUC at any point in the future. Updated patch attached with the new name, as you are suggesting. -- Michael
Вложения
> On Tue, Sep 23, 2025 at 01:39:00PM -0500, Sami Imseih wrote: > > The refcount reaches 0 when all backends release their references to the > > stat, so if something like pg_stat_statements relies on a count for > > deallocation purposes (to stay within the max), and that value only > > decrements when all references are released, then it could end up > > constantly triggering deallocation attempts. So, I am wondering if we > > actually need another value that tracks "live" entries, or those that > > have not yet been marked for drop. This means the live entries count > > is decremented as soon as the entry is set to ->dropped. > > Couldn't that mean a potential bloat in terms of memory in the dshash > if we have a large cycle of objects still held around but marked to be > gone? That sounds risky to me as it could go out of control. I spent a bit of time testing this with a pg_stat_statements like extension using a custom stats kind, and while I think there is value for both "live" ( ! ->dropped) counter and an exact dshash counter ( current proposal), I rather go with the latter at least initially, for the sake of not having 2 atomic counters. Both will allow an extension to trigger some type of a cleanup strategy, and in general, both should be very close in value. That's at least my observation. I do think however that the placement of the decrement is wrong, and that it should go inside pgstat_free_entry, since pgstat_free_entry can be called in multiple code paths. In my high churn workload, v2 ended up increasing way beyond the actual size of the dshash. See the attached for what I did to fix. >> Regarding the option name track_counts in PgStat_KindInfo. >> In my personal opinion, I was just wondering that it shares >> the same name as the GUC track_counts(pgstat_track_counts in the source code). >> If we want to make it clearer, renaming it to track_entry_counts >> could be one option. > Yes, that's a good point and I have missed the GUC part. What you are > suggesting is cleaner overall with the flag added to the pgstats kind > info. IMO, "entry_counts" does not sound correct. We are not tracking more than one count. What about track_num_entries ? -- Sami
Вложения
On Thu, Sep 25, 2025 at 07:47:48PM -0500, Sami Imseih wrote: > I spent a bit of time testing this with a pg_stat_statements like extension > using a custom stats kind, and while I think there is value for both "live" > ( ! ->dropped) counter and an exact dshash counter ( current proposal), > I rather go with the latter at least initially, for the sake of not having 2 > atomic counters. Both will allow an extension to trigger some type of a > cleanup strategy, and in general, both should be very close in value. > That's at least my observation. Thanks for the feedback. I'm feeling encouraged by this opinion about the "hard" counter, so I'd like to move on with it. > I do think however that the placement of the decrement is wrong, and that > it should go inside pgstat_free_entry, since pgstat_free_entry can be called > in multiple code paths. In my high churn workload, v2 ended up increasing way > beyond the actual size of the dshash. See the attached for what > I did to fix. Yes, you are right and the patch is wrong. I've done the following with my previous patch with injection_points loaded and its stats activated, confirming your argument: 1) session with psql running in a tight loop and these queries: select injection_points_attach('popo', 'notice'); select injection_points_run('popo'); select injection_points_detach('popo'); 2) session that fetches the stats entry select injection_points_stats_numcalls('popo'); \watch 0 3) session that checks the number of entries, should be 0 or 1: select injection_points_stats_count(); \watch 1 And the numbers reported by the third session increase over time, which is no good, once my patch is used. Once we move the decrementation to pgstat_free_entry() as you propose, the counter is under control and capped. > IMO, "entry_counts" does not sound correct. We are not tracking more > than one count. What about track_num_entries ? Hmm. track_entry_counts resonates with the field name entry_counts, and it's true that using the plural form is confusing. How about track_entry_count in PgStat_KindInfo instead? -- Michael
Вложения
Thanks for v3. The only remaining comment I have is: This comment seems unnecessary, since refcount is not checked inside pgstat_free_entry, but earlier. + /* + * Entry has been dropped with refcount at 0, hence decrement the + * entry counter. + */ I would just say this, /* Decrement entry count, if required. */ -- Sami
On Fri, Sep 26, 2025 at 12:09:45PM -0500, Sami Imseih wrote: > Thanks for v3. The only remaining comment I have is: Thanks for the extra lookup. I have fixed this one, incorporated the feedback from Chao, and applied the result after more tests with pgbench to check the state of the counter. With more concurrency and an instance set up like that in postgresql.conf: shared_preload_libraries = 'injection_points' injection_points.stats = on And of course that: psql -c 'create extension injection_points' Then with the following flow, checking that the count never gets higher than the number of clients: 1) pgbench -n -T 300 -f create_inj.sql -c 10 $ cat create_inj.sql \set id random(1,100000) select injection_points_attach('popo:id', 'notice'); select injection_points_run('popo:id'); select injection_points_detach('popo:id'); 2) pgbench -n -T 300 -f select_inj.sql -c 10 $ cat select_inj.sql \set id random(1,100000) select injection_points_stats_numcalls('popo:id'); 3) psql session: select injection_points_stats_count(); \watch 1 And I have found, cough, a rather embarrassing bug on the way, unrelated to the proposal of this thread. It's in the module injection_points for its fixed stats, so nothing ground-breaking, still embarrassing. Will post a fix on a new thread shortly.. -- Michael