Обсуждение: Adding wait events statistics

Поиск
Список
Период
Сортировка

Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi hackers,

Wait events are useful to know what backends are waiting for when there is/was
a performance issue: for this we can sample pg_stat_activity at regular intervals
and record historical data. That’s how it is commonly used.

It could also be useful to observe the engine/backends behavior over time and
help answer questions like:

* Is the engine’s wait pattern the same over time?
* Is application’s "A" wait pattern the same over time?
* I observe a peak in wait event "W": is it because "W" is now waiting longer or
is it because it is hit more frequently?
* I observe a peak in some of them (say for example MultiXact%), is it due to a
workload change?

For the above use cases, we need a way to track the wait events that occur between
samples: please find attached a proof of concept patch series doing so.

The patch series is made of:

0001 - It generates the WaitClassTable[], a lookup table that will be used by
the wait events statistics. 

The array is indexed by classId (derived from the PG_WAIT_* constants), handles
gaps in the class ID numbering and provides metadata for wait events.

This new array is generated in generate-wait_event_types.pl, so that:

* it now needs lwlocklist.h and wait_classes.h as input parameters
* WAIT_EVENT_CLASS_MASK and WAIT_EVENT_ID_MASK have been moved away from wait_event.c

In passing it adds several new macros that will be used by 0002.

0002 -  It adds wait events statistics

It adds a new stat kind PGSTAT_KIND_WAIT_EVENT for the wait event statistics.

This new statistic kind is a fixed one because we know the maximum number of wait
events. Indeed:

 * it does not take into account custom wait events as extensions have all they need
 at their disposal to create custom stats on their own wait events should they
 want to (limited by WAIT_EVENT_CUSTOM_HASH_MAX_SIZE though).

 * it does not take into account LWLock > LWTRANCHE_FIRST_USER_DEFINED for the same
 reasons as above. That said, there is no maximum limitation in LWLockNewTrancheId().

 * we don't want to allocate memory in some places where the counters are
 incremented (see 4feba03d8b9). We could still use the same implementation as for
 backend statistics (i.e, make use of flush_static_cb) if we really want/need to
 switch to variable stats.

For the moment only the counters are added (an array of currently 285 counters),
we’ll study/discuss about adding the timings once the counters are fully done.

I think we’d have more discussion/debate around the timings (should we add them
by default, add a new GUC, enable them at compilation time?…), that’s why only
the counters are in this patch series.

I think it makes sense as the counters have merit on their own. We currently have
273 wait events but the array is 285 long: the reason is that some wait events
classes have "holes". 

A few questions:

 * Do we need to serialize the stats based on their names (as for
 PGSTAT_KIND_REPLSLOT)? This question is the same as "is the ordering preserved
 if file stats format is not changed": I think the answer is yes (see f98dbdeb51d)
 , which means there is no need for to_serialized_name/from_serialized_name.

 * What if a new wait event is added? We'd need to change the stats file format,
 unless: the wait event stats kind becomes a variable one or we change a bit the
 way fixed stats are written/read to/from the stat file (we could add a new field
 in the PgStat_KindInfo for example).

Note: for some backends the wait events stats are not flushed (walwriter for
example), so we need to find additional places to flush the wait events stats.

0003 - It adds the pg_stat_wait_event view

It also adds documentation and regression tests.

0004 - switching PGSTAT_KIND_WAIT_EVENT to variable sized

It might be better for PGSTAT_KIND_WAIT_EVENT to be a variable sized stats kind.
That way:

* It would be easier to add custom wait events if we want to
* It would be possible to add new wait events without having to change the stats
file format

So adding 0004 to see what it would look like to have a variable sized stats kind
instead and decide how we want to move forward. 

It uses the uint32 wait_event_info as the hash key while the hash key is defined
as uint64: that should not be an issue but this patch does explicit casting though.

That said it might be better to use all the 64 bits (means not have the half full
of zeroes) for the hash key (better hashing distribution?): we could imagine
using something like:

((uint64) wait_event_info) | (((uint64) wait_event_info) << 32)

for the hash key.

If we decide to go that way (means with variable sized kind) then a new patch
series will be provided and will not implement the fixed one but will start
directly with the variable one.

The more I think about it, the more I think we should go for the variable sized
proposal: that's more flexible.

Remarks:

* If we want to add some "ereport” in waitEventIncrementCounter() then we’d need
to take care of the race condition in ConditionVariableTimedSleep() that that would
produce, see [1].

* Once we agree on fixed vs variable sized stats kind, I'll start measuring if
there is any performance regression and check if there is a need for optimization
(partition the counters array?…).

* The pgstat_report_wait_end() is "inlined", but with the additional code added
here, compilers may ignore the inline keyword. Need to check if that's the case
and the impact (see above).

* After it goes in: add timings, add into per-backend stats too

[1]: https://www.postgresql.org/message-id/aBhuTqNhMN3prcqe%40ip-10-97-1-34.eu-west-3.compute.internal 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Mon, Jul 07, 2025 at 02:56:29PM +0900, Michael Paquier wrote:
> On Mon, Jun 30, 2025 at 01:36:12PM +0000, Bertrand Drouvot wrote:

> >  * Do we need to serialize the stats based on their names (as for
> >  PGSTAT_KIND_REPLSLOT)? This question is the same as "is the ordering preserved
> >  if file stats format is not changed": I think the answer is yes (see f98dbdeb51d)
> >  , which means there is no need for to_serialized_name/from_serialized_name.
> 
> The serialization is important for the custom wait events, much less
> for wait events related to injection points (where we don't really
> care) which could be filtered out when reporting the stats.  It seems
> to me that associating the names (perhaps as $CLASS/$NAME on disk as a
> single string, two strings with two lengths are perhaps better) is
> going to be needed.  When the stats are read, the wait event ID could
> be allocated from the startup process in the from_serialized_name
> callback associated to the new stats kind, before extensions would try
> to allocate it.

Thanks for looking at it!

Yeah, I think that this is needed for the custom wait events. But do we need
to handle the custom wait events case? As, I think, the extensions have all they
need at their disposal to count/increment their own counters when their custom
wait events is triggered.

I think the question is: if the extension owner does not increment it, do we want
to do it in core on their behalf? I'm not sure as it's still up to them to make
use of custom wait events: so some of them will, some not, making the "in core"
counters not consistent depending of the extension. That could result in end
users asking why they see counters for some and not for some.

I also have the feeling that adding this extra complexity for (I think) a really
small number of cases may impact the "in core" wait events stats negatively (from
a complexity and maybe performance points of view).

I was thinking to just start with "in core" wait events and see if there is a
need/ask to expand it to custom wait events. Thoughts?

> >  * What if a new wait event is added? We'd need to change the stats file format,
> >  unless: the wait event stats kind becomes a variable one or we change a bit the
> >  way fixed stats are written/read to/from the stat file (we could add a new field
> >  in the PgStat_KindInfo for example).
> 
> I am pretty sure that everybody is going to forget about this
> dependency, myself included.  Please note that we add wait events in
> stable branches, sometimes as a portion of a bug fix, and we cannot
> bump the version of the stats file or stats would be lost on restart.
> So these stats should be variable,

Agree, that it's better to go the "variable" stats kind way.

> For the stats waiting to be flushed, you could
> allocate a single array of WAIT_EVENT_CUSTOM_HASH_MAX_SIZE elements in
> the TopMemoryContext of each backend, as we know that the number of
> custom wait events is never going to be higher than that, indexed by
> ID assigned.

That's an idea, but I'm not 100% sure we need to care about custom wait events
for now (see above).

> It may be simpler to allocate a series of arrays, one for each class,
> as well, sized depending on the wait event enum sizes.  These could be
> indexed based on the enums generated in wait_event_types.h, minus
> their initial value (for the array of events in WaitEventTimeout,
> minus PG_WAIT_TIMEOUT needs to be applied).  Deriving the contents
> that you need for the sizing and indexing of the pending data based on
> WaitClassTableEntry seems indeed necessary at some degree.

I'm not sure that would be simpler (or maybe that would simplify the perl
script a bit) but I think that might complicate the C code a bit. I mean, given
a wait_event_info we'd need to know in which array to increment the related wait
event counter. So, (at least) some switch on the wait class would be needed (and
that would probably need to be generated by the perl script, so I'm not sure
that would simplify the perl script after all).

Also in pg_stat_get_wait_event(), we'd need to iterate on all the arrays so
we'd need to know and identify them.

> your proposal feels overcomplicated, and that the names may not be
> necessary as we already have calls able to do class name and event
> name lookups based on the uint32 class/name values.

I started without the names in the array (making use of pgstat_get_wait_event())
but that did not work well for LWLOCK. Indeed, for LWLOCK we have "holes" in the
eventID numbering (see lwlocklist.h). That's why there is "NULL" in some
names for PG_WAIT_LWLOCK in the WaitClassTable[]. So, instead of putting random
stuff when it's not NULL, I thought that would be ok to put the names instead
(and avoid calling pgstat_get_wait_event()).

> This is not new.  This limitation exists for some auxiliary processes
> or processes within their main loops.  As long as we have APIs that
> can be called to do the flushes, IMO we are fine, like we do for WAL,
> IO, etc.

I do agree. My point was that I did not add those in the patch series yet.

> > It might be better for PGSTAT_KIND_WAIT_EVENT to be a variable sized stats kind.
> > That way:
> > 
> > * It would be easier to add custom wait events if we want to
> > * It would be possible to add new wait events without having to change the stats
> > file format
> > 
> > So adding 0004 to see what it would look like to have a variable sized stats kind
> > instead and decide how we want to move forward. 
> 
> IMO, I think that we should just do that.  ABI compatibility and
> custom wait events make fixed-sized stats unfit with what you are
> trying to achieve here.

Yeah, next version will do that. Just waiting on a agreement about:

- custom wait events
- split the array in multiple arrays

before sharing v2.

> > That said it might be better to use all the 64 bits (means not have the half full
> > of zeroes) for the hash key (better hashing distribution?): we could imagine
> > using something like:
> > 
> > ((uint64) wait_event_info) | (((uint64) wait_event_info) << 32)
> > 
> > for the hash key.
> 
> Using uint32 for the hash key is fine.  At least it's not an issue for
> the pgstats machinery.

Yeah, my doubt was more about the hashing distribution if all the keys have
the exact same bits (the upper 32 bits) sets to zero (which is what using a
uint32 key would produce).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Jul 08, 2025 at 09:36:53AM +0900, Michael Paquier wrote:
> On Mon, Jul 07, 2025 at 03:07:12PM +0000, Bertrand Drouvot wrote:
> 
> > I think the question is: if the extension owner does not increment it, do we want
> > to do it in core on their behalf? I'm not sure as it's still up to them to make
> > use of custom wait events: so some of them will, some not, making the "in core"
> > counters not consistent depending of the extension. That could result in end
> > users asking why they see counters for some and not for some.
> 
> My take would be yes.  That feels more natural than ignoring things.

Yeah ignoring the custom wait events would not make sense, but "waiting" to see
if there is a need/ask could have made sense.

> > I was thinking to just start with "in core" wait events and see if there is a
> > need/ask to expand it to custom wait events. Thoughts?
> 
> This statement is perhaps true.  Now we do use custom wait events even
> in contrib/ modules, so the move feels natural to me.

Okay, I had in mind to start with something simpler that would have covered the
waste majority (I think) of end users's needs but will look at it for v2 then.

> > I'm not sure that would be simpler (or maybe that would simplify the perl
> > script a bit) but I think that might complicate the C code a bit. I mean, given
> > a wait_event_info we'd need to know in which array to increment the related wait
> > event counter. So, (at least) some switch on the wait class would be needed (and
> > that would probably need to be generated by the perl script, so I'm not sure
> > that would simplify the perl script after all).
> 
> Noted.
> 
> > Also in pg_stat_get_wait_event(), we'd need to iterate on all the arrays so
> > we'd need to know and identify them.
> 
> Yes, you may need some level of meta-data generated for the wait
> classes generated when the perl script generating this data is run.
> It would be nice to the footprint of the code generated minimal if we
> can.  It's one of these things where I would try both approaches,

Let's see what a multi arrays approach would look like and decide.

> > Yeah, my doubt was more about the hashing distribution if all the keys have
> > the exact same bits (the upper 32 bits) sets to zero (which is what using a
> > uint32 key would produce).
> 
> Hmm.  Not sure about that, but that would not be difficult to check.
> It is true that it would not be a good thing if all the stats get
> pushed through the same partition in the pgstat dshash table, but I'm
> pretty sure that we don't need to worry much about that, like for
> neighboring OIDs.

Probably yes, will try to figure out though, out of curiosity (I think that would
be good to know generally speaking).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Jul 08, 2025 at 04:46:28AM +0000, Bertrand Drouvot wrote:
> On Tue, Jul 08, 2025 at 09:36:53AM +0900, Michael Paquier wrote:
> > Yes, you may need some level of meta-data generated for the wait
> > classes generated when the perl script generating this data is run.
> > It would be nice to the footprint of the code generated minimal if we
> > can.  It's one of these things where I would try both approaches,

I did some testing and experiments and I think an "hybrid" approach mixing multiple
arrays and one "global" one is the best approach.

So, something like:

"
ClassInfo LWLockClassInfo = {
    .numberOfEvents = NB_LWLOCK_WAIT_EVENTS,
    .pendingCounts = PendingWaitEventStats.lwlock_counts
};

WaitClassInfo LockClassInfo = {
    .numberOfEvents = NB_LOCK_WAIT_EVENTS,
    .pendingCounts = PendingWaitEventStats.lock_counts
};
.
.
.
.
"

and then the "global" one:

"
WaitClassInfo *AllWaitClasses[] = {
    NULL,  /* gap - no class with ID 0 */
    &LWLockClassInfo,
    NULL,  /* gap - no class with ID 2 */
    &LockClassInfo,
    &BufferPinClassInfo,
    &ActivityClassInfo,
    &ClientClassInfo,
    &ExtensionClassInfo,
    &IPCClassInfo,
    &TimeoutClassInfo,
    &IOClassInfo,
};
"

That way:

- we do not need a mapping between ClassID and "Arrays" and the gaps
in the ClassID are handled. So we can acces directly the class of interest
knowing its classID.
- we can easily iterate (again no mapping needed) over all the pending stats
thanks to AllWaitClasses[]

What do you think? It looks like the most elegant approach to me.

Bonus point, as you can see above, while working on this approach I've been able
to remove storing the className and the array of eventName per wait class (would
have worked in v1 too).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Michael Paquier
Дата:
On Tue, Jul 08, 2025 at 03:25:26PM +0000, Bertrand Drouvot wrote:
> So, something like:
>
> ClassInfo LWLockClassInfo = {
>     .numberOfEvents = NB_LWLOCK_WAIT_EVENTS,
>     .pendingCounts = PendingWaitEventStats.lwlock_counts
> };
>
> and then the "global" one:
>
> WaitClassInfo *AllWaitClasses[] = {
>     NULL,  /* gap - no class with ID 0 */
>     &LWLockClassInfo,
>     NULL,  /* gap - no class with ID 2 */
>     &LockClassInfo,
>     &BufferPinClassInfo,
>     &ActivityClassInfo,
>     &ClientClassInfo,
>     &ExtensionClassInfo,
>     &IPCClassInfo,
>     &TimeoutClassInfo,
>     &IOClassInfo,
> };

Okay.  I'm guessing that all this information is automatically built
based on the contents of wait_event_names.txt (with some tweaks for
the LWLock part?), so why not even if there are gaps as long as we
know the length of AllWaitClasses.  The stats code would be simple by
doing an O(n^2) across the array when flushing the individual contents
of each wait event.

> That way:
>
> - we do not need a mapping between ClassID and "Arrays" and the gaps
> in the ClassID are handled. So we can acces directly the class of interest
> knowing its classID.
> - we can easily iterate (again no mapping needed) over all the pending stats
> thanks to AllWaitClasses[]
>
> What do you think? It looks like the most elegant approach to me.

I think that's pretty cool.

> Bonus point, as you can see above, while working on this approach I've been able
> to remove storing the className and the array of eventName per wait class (would
> have worked in v1 too).

You mean the extra set of names generated?  Yes, I was getting the
impression while reading your patch that this did not require this
complexity when generating the code for the pending data.  Getting rid
of this part would be a nice outcome.
--
Michael

Вложения

Re: Adding wait events statistics

От
Andres Freund
Дата:
Hi,

On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> Wait events are useful to know what backends are waiting for when there is/was
> a performance issue: for this we can sample pg_stat_activity at regular intervals
> and record historical data. That’s how it is commonly used.
>
> It could also be useful to observe the engine/backends behavior over time and
> help answer questions like:
>
> * Is the engine’s wait pattern the same over time?
> * Is application’s "A" wait pattern the same over time?
> * I observe a peak in wait event "W": is it because "W" is now waiting longer or
> is it because it is hit more frequently?
> * I observe a peak in some of them (say for example MultiXact%), is it due to a
> workload change?
>
> For the above use cases, we need a way to track the wait events that occur between
> samples: please find attached a proof of concept patch series doing so.
> >
> 0002 -  It adds wait events statistics
>
> It adds a new stat kind PGSTAT_KIND_WAIT_EVENT for the wait event statistics.
>
> This new statistic kind is a fixed one because we know the maximum number of wait
> events. Indeed:
>
>  * it does not take into account custom wait events as extensions have all they need
>  at their disposal to create custom stats on their own wait events should they
>  want to (limited by WAIT_EVENT_CUSTOM_HASH_MAX_SIZE though).
>
>  * it does not take into account LWLock > LWTRANCHE_FIRST_USER_DEFINED for the same
>  reasons as above. That said, there is no maximum limitation in LWLockNewTrancheId().
>
>  * we don't want to allocate memory in some places where the counters are
>  incremented (see 4feba03d8b9). We could still use the same implementation as for
>  backend statistics (i.e, make use of flush_static_cb) if we really want/need to
>  switch to variable stats.

Even without allocating memory, this makes wait events considerably more
expensive. Going from ~4 instructions to something more like ~40.  Including
external function calls.

Today:

start wait:
   0x0000000000a08e8f <+63>:    lea    0x735662(%rip),%r12        # 0x113e4f8 <my_wait_event_info>
   0x0000000000a08e96 <+70>:    mov    (%r12),%rax
   0x0000000000a08e9a <+74>:    movl   $0x9000006,(%rax)

   < do wait operation >


end wait:
   0x0000000000a08eaa <+90>:    mov    (%r12),%rax


After:

start wait:
   0x0000000000a0b39f <+63>:    lea    0x73673a(%rip),%r12        # 0x1141ae0 <my_wait_event_info>
   0x0000000000a0b3a6 <+70>:    mov    (%r12),%rax
   0x0000000000a0b3aa <+74>:    movl   $0x9000006,(%rax)


   < do wait operation >

end wait:

   0x0000000000a0b3ba <+90>:    mov    (%r12),%rax
   0x0000000000a0b3be <+94>:    mov    (%rax),%edi
   0x0000000000a0b3c0 <+96>:    rex call 0xa452e0 <waitEventIncrementCounter>
   0x0000000000a0b3c6 <+102>:    mov    (%r12),%rax

Where waitEventIncrementCounter ends up as:

   0x0000000000a452e0 <+0>:    lea    0x6fc7f9(%rip),%rax        # 0x1141ae0 <my_wait_event_info>
   0x0000000000a452e7 <+7>:    mov    (%rax),%rax
   0x0000000000a452ea <+10>:    mov    (%rax),%eax
   0x0000000000a452ec <+12>:    mov    %eax,%edx
   0x0000000000a452ee <+14>:    and    $0xff000000,%edx
   0x0000000000a452f4 <+20>:    jne    0xa45300 <waitEventIncrementCounter+32>
   0x0000000000a452f6 <+22>:    test   %ax,%ax
   0x0000000000a452f9 <+25>:    jne    0xa45300 <waitEventIncrementCounter+32>
   0x0000000000a452fb <+27>:    ret
   0x0000000000a452fc <+28>:    nopl   0x0(%rax)
   0x0000000000a45300 <+32>:    cmp    $0x1000000,%edx
   0x0000000000a45306 <+38>:    jne    0xa4530e <waitEventIncrementCounter+46>
   0x0000000000a45308 <+40>:    cmp    $0x5e,%ax
   0x0000000000a4530c <+44>:    ja     0xa452fb <waitEventIncrementCounter+27>
   0x0000000000a4530e <+46>:    cmp    $0x7000000,%edx
   0x0000000000a45314 <+52>:    sete   %cl
   0x0000000000a45317 <+55>:    test   %ax,%ax
   0x0000000000a4531a <+58>:    setne  %al
   0x0000000000a4531d <+61>:    test   %al,%cl
   0x0000000000a4531f <+63>:    jne    0xa452fb <waitEventIncrementCounter+27>
   0x0000000000a45321 <+65>:    cmp    $0xb000000,%edx
   0x0000000000a45327 <+71>:    je     0xa452fb <waitEventIncrementCounter+27>
   0x0000000000a45329 <+73>:    mov    %edi,%eax
   0x0000000000a4532b <+75>:    movzwl %di,%edi
   0x0000000000a4532e <+78>:    lea    0x714e4b(%rip),%rdx        # 0x115a180 <PendingWaitEventStats>
   0x0000000000a45335 <+85>:    shr    $0x18,%eax
   0x0000000000a45338 <+88>:    shl    $0x5,%rax
   0x0000000000a4533c <+92>:    add    0x4e1735(%rip),%rax        # 0xf26a78
   0x0000000000a45343 <+99>:    movslq 0x8(%rax),%rax
   0x0000000000a45347 <+103>:    add    %rdi,%rax
   0x0000000000a4534a <+106>:    incq   (%rdx,%rax,8)
   0x0000000000a4534e <+110>:    lea    0x714e0b(%rip),%rax        # 0x115a160 <have_wait_event_stats>
   0x0000000000a45355 <+117>:    movb   $0x1,(%rax)
   0x0000000000a45358 <+120>:    ret


That's about an order of magnitude increase in overhead, just based on the
number of instructions, and probably considerably xmore in runtime (you're
adding external function calls etc).


I think to make this viable, we would have to remove some wait events and be
more careful about adding them.  And optimize this code a good bit. This
really shouldn't end up with this mouch code.


Greetings,

Andres Freund



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote:
> On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> >  * we don't want to allocate memory in some places where the counters are
> >  incremented (see 4feba03d8b9). We could still use the same implementation as for
> >  backend statistics (i.e, make use of flush_static_cb) if we really want/need to
> >  switch to variable stats.
> 
> Even without allocating memory, this makes wait events considerably more
> expensive. Going from ~4 instructions to something more like ~40.  Including
> external function calls.

Yeah looking at the performance impacts and discuss those was in my todo (as
mentioned above), thanks for looking at it! 

> end wait:
>    0x0000000000a08eaa <+90>:    mov    (%r12),%rax

Yeah this one is unbeatable, it will have to be more expensive...

On my side, I'm getting:

=> 0x000000000091a320 <+0>:     mov    0x5337f9(%rip),%rax        # 0xe4db20 <my_wait_event_info>
   0x000000000091a327 <+7>:     mov    (%rax),%eax
   0x000000000091a329 <+9>:     mov    %eax,%edx
   0x000000000091a32b <+11>:    and    $0xff000000,%edx
   0x000000000091a331 <+17>:    jne    0x91a340 <waitEventIncrementCounter+32>
   0x000000000091a333 <+19>:    test   %ax,%ax
   0x000000000091a336 <+22>:    jne    0x91a340 <waitEventIncrementCounter+32>
   0x000000000091a338 <+24>:    ret
   0x000000000091a339 <+25>:    nopl   0x0(%rax)
   0x000000000091a340 <+32>:    cmp    $0x1000000,%edx
   0x000000000091a346 <+38>:    jne    0x91a34e <waitEventIncrementCounter+46>
   0x000000000091a348 <+40>:    cmp    $0x5e,%ax
   0x000000000091a34c <+44>:    ja     0x91a338 <waitEventIncrementCounter+24>
   0x000000000091a34e <+46>:    cmp    $0x7000000,%edx
   0x000000000091a354 <+52>:    sete   %cl
   0x000000000091a357 <+55>:    test   %ax,%ax
   0x000000000091a35a <+58>:    setne  %al
   0x000000000091a35d <+61>:    test   %al,%cl
   0x000000000091a35f <+63>:    jne    0x91a338 <waitEventIncrementCounter+24>
   0x000000000091a361 <+65>:    cmp    $0xb000000,%edx
   0x000000000091a367 <+71>:    je     0x91a338 <waitEventIncrementCounter+24>
   0x000000000091a369 <+73>:    mov    %edi,%eax
   0x000000000091a36b <+75>:    movzwl %di,%edi
   0x000000000091a36e <+78>:    shr    $0x18,%eax
   0x000000000091a371 <+81>:    shl    $0x5,%rax
   0x000000000091a375 <+85>:    movslq 0xe4d9c8(%rax),%rax
   0x000000000091a37c <+92>:    movb   $0x1,0x56045d(%rip)        # 0xe7a7e0 <have_wait_event_stats>
   0x000000000091a383 <+99>:    add    %rdi,%rax
   0x000000000091a386 <+102>:   addq   $0x1,0xe7a800(,%rax,8)
   0x000000000091a38f <+111>:   ret

> That's about an order of magnitude increase in overhead, just based on the
> number of instructions, and probably considerably xmore in runtime (you're
> adding external function calls etc).

Right.

> I think to make this viable, we would have to remove some wait events and be
> more careful about adding them.

You mean incrementing the counters for wait events for which we expect that
the increment is "negligible" as compare to the wait event itself? (Like DATAFILE
IO and such)?

If so I agree and that's something I also have discussed with a few (also if we
add the timings later), that we could/should be more granular. If we go that way
we could also imagine a GUC to set the stats for all of them? (default being off).

> And optimize this code a good bit. This
> really shouldn't end up with this mouch code.

Yes, with the idea mentioned in [1] (code not shared yet), I end up with:

Dump of assembler code for function waitEventIncrementCounter:
=> 0x000000000091b320 <+0>:     mov    %edi,%eax
   0x000000000091b322 <+2>:     and    $0xff000000,%eax
   0x000000000091b327 <+7>:     jne    0x91b330 <waitEventIncrementCounter+16>
   0x000000000091b329 <+9>:     test   %di,%di
   0x000000000091b32c <+12>:    jne    0x91b330 <waitEventIncrementCounter+16>
   0x000000000091b32e <+14>:    ret
   0x000000000091b32f <+15>:    nop
   0x000000000091b330 <+16>:    cmp    $0x1000000,%eax
   0x000000000091b335 <+21>:    jne    0x91b33d <waitEventIncrementCounter+29>
   0x000000000091b337 <+23>:    cmp    $0x5e,%di
   0x000000000091b33b <+27>:    ja     0x91b32e <waitEventIncrementCounter+14>
   0x000000000091b33d <+29>:    cmp    $0x7000000,%eax
   0x000000000091b342 <+34>:    sete   %cl
   0x000000000091b345 <+37>:    test   %di,%di
   0x000000000091b348 <+40>:    setne  %dl
   0x000000000091b34b <+43>:    test   %dl,%cl
   0x000000000091b34d <+45>:    jne    0x91b32e <waitEventIncrementCounter+14>
   0x000000000091b34f <+47>:    cmp    $0xb000000,%eax
   0x000000000091b354 <+52>:    je     0x91b32e <waitEventIncrementCounter+14>
   0x000000000091b356 <+54>:    mov    %edi,%eax
   0x000000000091b358 <+56>:    movzwl %di,%edi
   0x000000000091b35b <+59>:    shr    $0x18,%eax
   0x000000000091b35e <+62>:    mov    0xe4e9c0(,%rax,8),%rax
   0x000000000091b366 <+70>:    movb   $0x1,0x5603f3(%rip)        # 0xe7b760 <have_wait_event_stats>
   0x000000000091b36d <+77>:    mov    0x8(%rax),%rax
   0x000000000091b371 <+81>:    addq   $0x1,(%rax,%rdi,8)
   0x000000000091b376 <+86>:    ret

That looks better before the first branch (could be done in v1 too) as it uses
registers only but looks worse when it's time to increment the counter while
the code is much simpler:

"
    WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info);

    AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++;
"

instead of

"
    counter = waitEventGetCounter(PendingWaitEventStats.counts, waitInfo.classId,
                                  waitInfo.eventId);

    (*counter)++;

"

with the associated machinery in waitEventGetCounter().

Indeed, the waitEventGetCounter() produced only one memory read (and one write):

movslq 0xe4d9c8(%rax),%rax
addq   $0x1,0xe7a800(,%rax,8)

while the AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++ produces 2
pointer dereferences and 1 write:

mov    0xe4e9c0(,%rax,8),%rax
mov    0x8(%rax),%rax
addq   $0x1,(%rax,%rdi,8)

I still need to work on it and try to make the code simpler and effective.

[1]: https://www.postgresql.org/message-id/aG04Zk%2Bb3SRN0pBe%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Jul 09, 2025 at 09:56:10AM +0900, Michael Paquier wrote:
> On Tue, Jul 08, 2025 at 03:25:26PM +0000, Bertrand Drouvot wrote:
> > So, something like:
> > 
> > ClassInfo LWLockClassInfo = {
> >     .numberOfEvents = NB_LWLOCK_WAIT_EVENTS,
> >     .pendingCounts = PendingWaitEventStats.lwlock_counts
> > };
> > 
> > and then the "global" one:
> > 
> > WaitClassInfo *AllWaitClasses[] = {
> >     NULL,  /* gap - no class with ID 0 */
> >     &LWLockClassInfo,
> >     NULL,  /* gap - no class with ID 2 */
> >     &LockClassInfo,
> >     &BufferPinClassInfo,
> >     &ActivityClassInfo,
> >     &ClientClassInfo,
> >     &ExtensionClassInfo,
> >     &IPCClassInfo,
> >     &TimeoutClassInfo,
> >     &IOClassInfo,
> > };
> 
> Okay.  I'm guessing that all this information is automatically built
> based on the contents of wait_event_names.txt (with some tweaks for
> the LWLock part?),

Yeah, fully generated by generate-wait_event_types.pl.

> > That way:
> > 
> > - we do not need a mapping between ClassID and "Arrays" and the gaps
> > in the ClassID are handled. So we can acces directly the class of interest
> > knowing its classID.
> > - we can easily iterate (again no mapping needed) over all the pending stats
> > thanks to AllWaitClasses[]
> > 
> > What do you think? It looks like the most elegant approach to me.
> 
> I think that's pretty cool.

Yeah, but from what I can see the counter increment results in one more memory
read (as compared to v1), see [1]. So the code looks more simple but is more
expensive... I'll try to see what can be done to improve it.

> > Bonus point, as you can see above, while working on this approach I've been able
> > to remove storing the className and the array of eventName per wait class (would
> > have worked in v1 too).
> 
> You mean the extra set of names generated?

Yeah those are not needed, the only thing that's important is that we ensure
that the counters and pending ones do contain the same "gap" (the LWLock ones),
so the the classID mapping is correct.

[1]: https://www.postgresql.org/message-id/aG4ZjQ1qKensHBOY%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Andres Freund
Дата:
Hi,

On 2025-07-09 07:26:05 +0000, Bertrand Drouvot wrote:
> On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote:
> > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> > >  * we don't want to allocate memory in some places where the counters are
> > >  incremented (see 4feba03d8b9). We could still use the same implementation as for
> > >  backend statistics (i.e, make use of flush_static_cb) if we really want/need to
> > >  switch to variable stats.
> >
> > Even without allocating memory, this makes wait events considerably more
> > expensive. Going from ~4 instructions to something more like ~40.  Including
> > external function calls.
>
> Yeah looking at the performance impacts and discuss those was in my todo (as
> mentioned above), thanks for looking at it!

TBH, the more I think about this, the less convinced a pure counter provides
*any* additional value over stats we already have, in which case the overhead
is moot.

We use commonly use wait events unconditionally around system calls, e.g.:

    pgstat_report_wait_start(wait_event_info);
    returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
    pgstat_report_wait_end();

A counter for the number of wait events is just less informative version of
what we already have in pg_stat_io. There's literally no value provided by
tracking the number of times that wait event is hit.

A pretty large number of wait events operate that way. For those the
interesting information would be to measure *how long* you are in the wait
event, not the number of times the wait event was entered for a very short
amount of time.

The obvious counter to that is that you likely are planning to add timing
information at some point down the line - leaving overhead aside, that would
*also* just duplicate information for the above case, namely track_io_timing.


The wait events we have as-is are useful because they tell you what's
currently happening, and you can correlate that with other things that are
going on. But a pure counter of wait events encountered doesn't allow for
that.


> > end wait:
> >    0x0000000000a08eaa <+90>:    mov    (%r12),%rax
>
> Yeah this one is unbeatable, it will have to be more expensive...
>
> On my side, I'm getting:
>
> => 0x000000000091a320 <+0>:     mov    0x5337f9(%rip),%rax        # 0xe4db20 <my_wait_event_info>
>    0x000000000091a327 <+7>:     mov    (%rax),%eax
>    0x000000000091a329 <+9>:     mov    %eax,%edx
>    0x000000000091a32b <+11>:    and    $0xff000000,%edx
>    0x000000000091a331 <+17>:    jne    0x91a340 <waitEventIncrementCounter+32>
>    0x000000000091a333 <+19>:    test   %ax,%ax
>    0x000000000091a336 <+22>:    jne    0x91a340 <waitEventIncrementCounter+32>
>    0x000000000091a338 <+24>:    ret
>    0x000000000091a339 <+25>:    nopl   0x0(%rax)
>    0x000000000091a340 <+32>:    cmp    $0x1000000,%edx
>    0x000000000091a346 <+38>:    jne    0x91a34e <waitEventIncrementCounter+46>
>    0x000000000091a348 <+40>:    cmp    $0x5e,%ax
>    0x000000000091a34c <+44>:    ja     0x91a338 <waitEventIncrementCounter+24>
>    0x000000000091a34e <+46>:    cmp    $0x7000000,%edx
>    0x000000000091a354 <+52>:    sete   %cl
>    0x000000000091a357 <+55>:    test   %ax,%ax
>    0x000000000091a35a <+58>:    setne  %al
>    0x000000000091a35d <+61>:    test   %al,%cl
>    0x000000000091a35f <+63>:    jne    0x91a338 <waitEventIncrementCounter+24>
>    0x000000000091a361 <+65>:    cmp    $0xb000000,%edx
>    0x000000000091a367 <+71>:    je     0x91a338 <waitEventIncrementCounter+24>
>    0x000000000091a369 <+73>:    mov    %edi,%eax
>    0x000000000091a36b <+75>:    movzwl %di,%edi
>    0x000000000091a36e <+78>:    shr    $0x18,%eax
>    0x000000000091a371 <+81>:    shl    $0x5,%rax
>    0x000000000091a375 <+85>:    movslq 0xe4d9c8(%rax),%rax
>    0x000000000091a37c <+92>:    movb   $0x1,0x56045d(%rip)        # 0xe7a7e0 <have_wait_event_stats>
>    0x000000000091a383 <+99>:    add    %rdi,%rax
>    0x000000000091a386 <+102>:   addq   $0x1,0xe7a800(,%rax,8)
>    0x000000000091a38f <+111>:   ret

I think adding 6 conditional branches to this is just out of question...


> > I think to make this viable, we would have to remove some wait events and be
> > more careful about adding them.
>
> You mean incrementing the counters for wait events for which we expect that
> the increment is "negligible" as compare to the wait event itself? (Like DATAFILE
> IO and such)?

I mean that we would need to either *remove* a bunch of wait events, or change
the API so that there are "cheap" wait events that just have the current
overhead, and expensive wait events that do this tracking.


> If so I agree and that's something I also have discussed with a few (also if we
> add the timings later), that we could/should be more granular. If we go that way
> we could also imagine a GUC to set the stats for all of them? (default being
> off).

Any conditional code that you add makes this stuff more expensive...


> > And optimize this code a good bit. This
> > really shouldn't end up with this mouch code.
>
> Yes, with the idea mentioned in [1] (code not shared yet), I end up with:
>
> Dump of assembler code for function waitEventIncrementCounter:
> => 0x000000000091b320 <+0>:     mov    %edi,%eax
>    0x000000000091b322 <+2>:     and    $0xff000000,%eax
>    0x000000000091b327 <+7>:     jne    0x91b330 <waitEventIncrementCounter+16>
>    0x000000000091b329 <+9>:     test   %di,%di
>    0x000000000091b32c <+12>:    jne    0x91b330 <waitEventIncrementCounter+16>
>    0x000000000091b32e <+14>:    ret
>    0x000000000091b32f <+15>:    nop
>    0x000000000091b330 <+16>:    cmp    $0x1000000,%eax
>    0x000000000091b335 <+21>:    jne    0x91b33d <waitEventIncrementCounter+29>
>    0x000000000091b337 <+23>:    cmp    $0x5e,%di
>    0x000000000091b33b <+27>:    ja     0x91b32e <waitEventIncrementCounter+14>
>    0x000000000091b33d <+29>:    cmp    $0x7000000,%eax
>    0x000000000091b342 <+34>:    sete   %cl
>    0x000000000091b345 <+37>:    test   %di,%di
>    0x000000000091b348 <+40>:    setne  %dl
>    0x000000000091b34b <+43>:    test   %dl,%cl
>    0x000000000091b34d <+45>:    jne    0x91b32e <waitEventIncrementCounter+14>
>    0x000000000091b34f <+47>:    cmp    $0xb000000,%eax
>    0x000000000091b354 <+52>:    je     0x91b32e <waitEventIncrementCounter+14>
>    0x000000000091b356 <+54>:    mov    %edi,%eax
>    0x000000000091b358 <+56>:    movzwl %di,%edi
>    0x000000000091b35b <+59>:    shr    $0x18,%eax
>    0x000000000091b35e <+62>:    mov    0xe4e9c0(,%rax,8),%rax
>    0x000000000091b366 <+70>:    movb   $0x1,0x5603f3(%rip)        # 0xe7b760 <have_wait_event_stats>
>    0x000000000091b36d <+77>:    mov    0x8(%rax),%rax
>    0x000000000091b371 <+81>:    addq   $0x1,(%rax,%rdi,8)
>    0x000000000091b376 <+86>:    ret

That's still way too many jumps.


> That looks better before the first branch (could be done in v1 too) as it uses
> registers only but looks worse when it's time to increment the counter while
> the code is much simpler:
>
> "
>     WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info);
>
>     AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++;
> "

Too many indirections. The increment should be doable without needing to
dereference intermediary memory locations.

Greetings,

Andres Freund



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Jul 09, 2025 at 11:19:00AM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-07-09 07:26:05 +0000, Bertrand Drouvot wrote:
> > On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote:
> > > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote:
> 
> TBH, the more I think about this, the less convinced a pure counter provides
> *any* additional value over stats we already have, in which case the overhead
> is moot.

I think there is a lot of wait events that don't have a stat counterpart:
Controlfile ones, some of the DataFile ones (prefetch..), Dsm ones to name a few.

> We use commonly use wait events unconditionally around system calls, e.g.:
> 
>     pgstat_report_wait_start(wait_event_info);
>     returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
>     pgstat_report_wait_end();
> 
> A counter for the number of wait events is just less informative version of
> what we already have in pg_stat_io. There's literally no value provided by
> tracking the number of times that wait event is hit.

For this particular one I do agree, IO is kind of exception here.

> A pretty large number of wait events operate that way. For those the
> interesting information would be to measure *how long* you are in the wait
> event, not the number of times the wait event was entered for a very short
> amount of time.

The counters on the wait events could be seen as "system/engine stats" not as
wait events. That could help describe the engine behavior over time.

> The obvious counter to that is that you likely are planning to add timing
> information at some point down the line

Yes, and then the counters also become even more valuable. Also, timings without
counters would not make that much sense, so counters are kind of a mandatory
first step.

> - leaving overhead aside, that would
> *also* just duplicate information for the above case, namely track_io_timing.

Agree for this particular case but we currently have 273 in core wait events.

> The wait events we have as-is are useful because they tell you what's
> currently happening, and you can correlate that with other things that are
> going on. But a pure counter of wait events encountered doesn't allow for
> that.

Yes, I think the pure counters could/should be seen as engine statistics instead.
And then become even more valuable once we add the timings.

> I think adding 6 conditional branches to this is just out of question...

This has to be optimized, I do agree.

> > > I think to make this viable, we would have to remove some wait events and be
> > > more careful about adding them.
> >
> > You mean incrementing the counters for wait events for which we expect that
> > the increment is "negligible" as compare to the wait event itself? (Like DATAFILE
> > IO and such)?
> 
> I mean that we would need to either *remove* a bunch of wait events, or change
> the API so that there are "cheap" wait events that just have the current
> overhead, and expensive wait events that do this tracking.

I had the later in mind as it's still good to know what the engine is doing/waiting
for even if that's "only" at regular sampling interval (we would lost that information
if we were to *remove* a bunch of wait events).

> > If so I agree and that's something I also have discussed with a few (also if we
> > add the timings later), that we could/should be more granular. If we go that way
> > we could also imagine a GUC to set the stats for all of them? (default being
> > off).
> 
> Any conditional code that you add makes this stuff more expensive...

Yes, or could be a compilation time flag to avoid the condition to be hit
at runtime. That said I like your API change idea, we could decide which code
path use the tracking and which does not.

> That's still way too many jumps.

Yes, did not try to optimize that part yet. The above idea would certainly help
though, I mean probably we could remove most of the jumps.

> > That looks better before the first branch (could be done in v1 too) as it uses
> > registers only but looks worse when it's time to increment the counter while
> > the code is much simpler:
> >
> > "
> >     WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info);
> >
> >     AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++;
> > "
> 
> Too many indirections. The increment should be doable without needing to
> dereference intermediary memory locations.

Right. I can think of 2 approaches:

1. Automatically build a function (thanks to generate-wait_event_types.pl) that
returns the offset of the wait event, something like:

static inline int
get_wait_event_counter_index(uint32 wait_event_info)
{
    uint32 classId = wait_event_info & WAIT_EVENT_CLASS_MASK;
    uint32 eventId = wait_event_info & WAIT_EVENT_ID_MASK;

    switch (classId) {
        case PG_WAIT_LWLOCK:
            if (eventId >= 95) return -1;
            return 0 + eventId;
        case PG_WAIT_LOCK:
            if (eventId >= 12) return -1;
            return 95 + eventId;
        case PG_WAIT_BUFFERPIN:
.
.
.

Then the offset are "hardcoded" and there is no memory reads to get the offset (
confirmed by the assembly code):

"
cmp    $0x8,%edi
ja     0x91b32e
add    $0x7e,%edi
"

But the switch is not that good because it also produces much more cmp, so 
the last case will "suffer" more than the "first" one for example.

2. Make use of a lookup table, like:

static const int WaitClassOffsets[11] = {
    -1, 
    0,
    -1,
    95,
    107,
    108,
    126,
    135,
    136,
    194,
    204,
};

but then one indirection would still be needed, so a memory read to get
WaitClassOffsets[classId].

I think that the read can need more cpu cycles (than the worst switch case)
depending where it is read from (L1, L2, L3).

That said, given its size and how frequently it would be accessed then it's
likely to be in L1 and if that's the case then that's probably better than the
switch idea, thoughts?

Let me try to produce an action plan:

- Let's define a list of "cheap" and a list of "expensive" wait events
- Change the code path to increment counters only for the "expensive" ones
- Optimize the counter incrementation as much as possible

Does that sound reasonable to you?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Andres Freund
Дата:
Hi,

On 2025-07-10 08:21:56 +0000, Bertrand Drouvot wrote:
> - Let's define a list of "cheap" and a list of "expensive" wait events
> - Change the code path to increment counters only for the "expensive" ones
> - Optimize the counter incrementation as much as possible
> 
> Does that sound reasonable to you?

That does seem like the minimum.

Unfortunately I'm rather doubtful this provides enough value to be worth the
cost. But I'm rather willing to be proven wrong.

Greetings,

Andres Freund



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Jul 15, 2025 at 04:13:49PM +0200, Bertrand Drouvot wrote:
> Hi,
> 
> On Fri, Jul 11, 2025 at 11:59:39AM -0400, Andres Freund wrote:
> > Hi,
> > 
> > On 2025-07-10 08:21:56 +0000, Bertrand Drouvot wrote:
> > > - Let's define a list of "cheap" and a list of "expensive" wait events
> > > - Change the code path to increment counters only for the "expensive" ones
> > > - Optimize the counter incrementation as much as possible
> > > 
> > > Does that sound reasonable to you?
> > 
> > That does seem like the minimum.
> > 
> > Unfortunately I'm rather doubtful this provides enough value to be worth the
> > cost. But I'm rather willing to be proven wrong.
> > 
> 
> I guess that would (among other things) depend of how many wait events would be 
> tracked. I'll do some study to measure the counters's impact on the wait events. 
>                                                                                 

Here what I’ve done:

1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end()
to measure the wait events duration. Then, I ran make check-world and computed 
the duration percentiles (in microseconds) per wait event (see attached).

Note that this is without the counters patch, just to get an idea of wait events
duration on my lab.

Due to the way I ran the eBPF script I’ve been able to trap 159 wait events
(out of 273) during "make check-world" (I think that’s enough for a first
analysis, see below). 

2. Applied the patch with the counters (one that I did not share yet), removed
the probes in pgstat_report_wait_start() and pgstat_report_wait_end() and added
one probe before incrementing the counters (i.e before calling waitEventIncrementCounter())
in pgstat_report_wait_end() and one probe after incrementing the counters to measure
the counters increment duration.

Then I ran make check-world and computed counter increment duration percentiles
(in nanoseconds) and got:

P50.00: 1138.000000
P90.00: 2674.000000
P95.00: 3741.000000
P99.00: 7657.000000
P99.90: 16973.881000
P99.99: 55614.830000

Then I compared the P50.00 for the counter and for the wait events and rejected
the wait events where incrementing the counter produces more than 5% overhead.

-> 59 wait events (out of the 159 observed during make check-world) got less
than 5% overhead, they are mixed across classes but among those 59 we have:

All the trapped ones in the Lock class (8 events) (0.7% avg impact):

spectoken
advisory
tuple
relation
transactionid
virtualxid
extend
applytransaction
 
All the trapped ones in the Timeout class (7 events) (0.03% avg impact):

RECOVERY_APPLY_DELAY
RECOVERY_RETRIEVE_RETRY_INTERVAL
CHECKPOINT_WRITE_DELAY
VACUUM_TRUNCATE
BASE_BACKUP_THROTTLE
PG_SLEEP
SPIN_DELAY

And the BufferPin one (about 4%).

Note that I would get the same behavior (just a little bit more overhead) if I
add the wait events duration (INSTR_*) to the game.
 
With the above it looks like that it would be possible to add the counters and
their timings on the Lock and Timeout classes without producing a noticeable
performance impact.

I think that:

- That’s easy enough to reason about and kind of expected for those 2 classes
- We would miss other wait events but I think the trade-off is worth the simplicity
- Having the ability to get the counters and the timings for those 2 wait classes
is better than nothing
- Having those metrics for the Lock class would be a nice addition to pg_locks

Thoughts?

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Adding wait events statistics

От
Andres Freund
Дата:
Hi,

On 2025-07-18 06:04:38 +0000, Bertrand Drouvot wrote:
> Here what I’ve done:
>
> 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end()
> to measure the wait events duration. Then, I ran make check-world and computed
> the duration percentiles (in microseconds) per wait event (see attached).

> Note that this is without the counters patch, just to get an idea of wait events
> duration on my lab.
>
> Due to the way I ran the eBPF script I’ve been able to trap 159 wait events
> (out of 273) during "make check-world" (I think that’s enough for a first
> analysis, see below).

It's important to note that eBPF tracepoints can add significant overhead,
i.e. your baseline measurement is influenced by that overhead.  To showcase
that, in a ridiculously exaggerated way, I added probe with bpftrace, on start
and exit of heap_getnextslot(), measuring the time in the function, and
compared the performance of that to not having the probes.

Without probes my seqscan takes 1651.221ms, with probes 15417.930ms. I.e. it
slowed down by about an order of magnitude.

I don't think you can use numbers gathered from this as a baseline measurement
to evaluate the overhead added by adding explicit counters. The overhead of
the explicit counter is going to be *WAY* lower than the eBPF based probes.


You can't usefully use eBPF probes to measure costs that are this small, each
time you hit a probe it has to trap into kernel space, run a bit of code,
return to userspace. That'll be a few hundred cycles at the very least, if you
actually want to compute a semi-accurate duration with eBPF, the duration of
what you measure needs to be considerably longer.


> 2. Applied the patch with the counters (one that I did not share yet), removed
> the probes in pgstat_report_wait_start() and pgstat_report_wait_end() and added
> one probe before incrementing the counters (i.e before calling waitEventIncrementCounter())
> in pgstat_report_wait_end() and one probe after incrementing the counters to measure
> the counters increment duration.
>
> Then I ran make check-world and computed counter increment duration percentiles
> (in nanoseconds) and got:
>
> P50.00: 1138.000000
> P90.00: 2674.000000
> P95.00: 3741.000000
> P99.00: 7657.000000
> P99.90: 16973.881000
> P99.99: 55614.830000

The above confirms my suspicion. It's inconceivable that the explicit counting
you added costs ~1k nanosecond on average. Multiply that by the frequency of
the CPU and that CPUs are superscalar (can execute multiple instructions at
once), and the overhead you compute here would be roughly 10k
instructions. Obviously that's not true. If it were, we'd have to just mark
this patch as rejected and never talk about it again.


Even if your baseline measurement did provide an accurate baseline number,
it'd still not be particularly helpful, I suspect. Whether adding additional
instructions actually causes slowdowns isn't just determined by the raw number
of instructions, it's also depends on what work is being done in the
CPU. E.g. if you add an extra 50 cheap instructions into a codepath that's
completely bottlenecked by memory latency, you'll often not increase the total
time taken meaningfully, as all those instructions are executed while waiting
for memory (which has a latency longer than those 50 cheap instructions added
together). But if you add 50 cheap instructions to a workload that's purely
CPU bound, you're much more likely to slow that down measurably.


> I think that:
>
> - That’s easy enough to reason about and kind of expected for those 2 classes
> - We would miss other wait events but I think the trade-off is worth the simplicity
> - Having the ability to get the counters and the timings for those 2 wait classes
> is better than nothing
> - Having those metrics for the Lock class would be a nice addition to pg_locks
>
> Thoughts?

Unfortunately, due to the issues explained above, I don't think we can learn a
lot from this experiment :(

I also continue to not believe that pure event counters are going to be useful
for the majority of wait events. I'm not sure it is really interesting for
*any* wait event that we don't already have independent stats for.

I think if we do want to have wait events that have more details, we need to:

a) make them explicitly opt-in, i.e. code has to be changed over to use the
   extended wait events
b) the extended wait events need to count both the number of encounters as
   well as the duration, the number of encounters is not useful on its own
c) for each callsite that is converted to the extended wait event, you either
   need to reason why the added overhead is ok, or do a careful experiment


Personally I'd rather have an in-core sampling collector, counting how often
it sees certain wait events when sampling. It then also can correlate those
samples with other things on the system, e.g. by counting the number of active
backends together with each sample. And eventually also memory usage etc.

Greetings,

Andres Freund



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Fri, Jul 18, 2025 at 11:43:47AM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-07-18 06:04:38 +0000, Bertrand Drouvot wrote:
> > Here what I’ve done:
> >
> > 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end()
> > to measure the wait events duration. Then, I ran make check-world and computed
> > the duration percentiles (in microseconds) per wait event (see attached).
> 
> > Note that this is without the counters patch, just to get an idea of wait events
> > duration on my lab.
> >
> > Due to the way I ran the eBPF script I’ve been able to trap 159 wait events
> > (out of 273) during "make check-world" (I think that’s enough for a first
> > analysis, see below).
> 
> It's important to note that eBPF tracepoints can add significant overhead,
> i.e. your baseline measurement is influenced by that overhead.

Thanks for looking at it! I do agree that eBPF tracepoints can add significant
overhead.

> The overhead of
> the explicit counter is going to be *WAY* lower than the eBPF based probes.

Agree.

> Unfortunately, due to the issues explained above, I don't think we can learn a
> lot from this experiment :(

I think that statement is debatable. While it's true that eBPF tracepoints
add overhead, I used them to get the wait events percentile timings and also
the counter increments timings. And then I used those percentiles (
both include the overhead) to extract the wait events/classes for which the
p50 counters increments would represent less than 5%, means:

100 * p50_counter_timing / p50_wait_time < 5

And we have seen that for the Lock class we got 0.7% and 0.03% for the Timeout
class. With those numbers and even if the eBPF tracepoints adds overhead I
think that we can learn that it would be safe to add the counters for those
two wait classes.

Anyway, let's forget about eBPF, I ran another experiment by counting the cycles
with:

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

and then calling this function before and after waitEventIncrementCounter()
and also at wait_start() and wait_end() (without the increment counters patchs).

This gives for counter increments (also added INSTR* timings to get the full
picture as we want them anyway) the following number of cycles:

p50: 146.000
p90: 722.000
p95: 1148.000
p99: 3474.000
p99.9: 13600.132

So that we can compare with the percentile cycles per wait events (see attached).

We can see that, for those wait classes, all their wait events overhead would be
< 5% and more precisely:

Overhead on the lock class is about 0.03%
Overhead on the timeout class is less than 0.01%

and now we can also see that:

Overhead on the lwlock class is about 1%
Overhead on the client class is about 0.5%
Overhead on the bufferpin class is about 0.2%

while the io and ipc classes have mixed results.

So based on the cycles metric I think it looks pretty safe to implement for the
whole majority of classes.

> I also continue to not believe that pure event counters are going to be useful
> for the majority of wait events. I'm not sure it is really interesting for
> *any* wait event that we don't already have independent stats for.

For pure counters only I can see your point, but for counters + timings are you
also not convinced?

> I think if we do want to have wait events that have more details, we need to:
> 
> a) make them explicitly opt-in, i.e. code has to be changed over to use the
>    extended wait events
> b) the extended wait events need to count both the number of encounters as
>    well as the duration, the number of encounters is not useful on its own
> c) for each callsite that is converted to the extended wait event, you either
>    need to reason why the added overhead is ok, or do a careful experiment
>

I do agree with the above, what do you think about this lastest experiment counting
the cycles?

> 
> Personally I'd rather have an in-core sampling collector, counting how often
> it sees certain wait events when sampling.

Yeah but even if we are okay with losing "counters" by sampling, we'd still not get
the duration. For the duration to be meaningful we also need the exact number
of counters.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Adding wait events statistics

От
Andres Freund
Дата:
Hi,

On 2025-07-22 12:24:46 +0000, Bertrand Drouvot wrote:
> Anyway, let's forget about eBPF, I ran another experiment by counting the cycles
> with:
>
> static inline uint64_t rdtsc(void) {
>     uint32_t lo, hi;
>     __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
>     return ((uint64_t)hi << 32) | lo;
> }
>
> and then calling this function before and after waitEventIncrementCounter()
> and also at wait_start() and wait_end() (without the increment counters patchs).

I think you're still going to get massively increased baseline numbers that
way - the normal cost of a wait event is under 10 cycles. Doing two rdtscs
costs somewhere between 60-90 cycles. Which means that any increase due to
counters & timers will look a lot cheaper if compared to that increased
baseline, than if you compared to the actual current cost of a wait event.


> So that we can compare with the percentile cycles per wait events (see attached).
>
> We can see that, for those wait classes, all their wait events overhead would be
> < 5% and more precisely:
>
> Overhead on the lock class is about 0.03%
> Overhead on the timeout class is less than 0.01%
>
> and now we can also see that:
>
> Overhead on the lwlock class is about 1%
> Overhead on the client class is about 0.5%
> Overhead on the bufferpin class is about 0.2%

I think that's largely because there is relatively few such wait events,
because there is very very little contention in the regression tests and we
just don't do a whole lot intensive things in the tests. I suspect that at
least some of the high events here will actually be due to tests that
explicitly test the contention behaviour, and thus will have very high wait
times.

E.g. if you measure client timings, the overhead here will be fairly low,
because we're not going to be CPU bound by the back/forth between client and
server, and thus many of the waits will be longer. If you instead measure a
single client readonly pgbench, it'll look different.  Similar, if you have
lwlock contention in a real world workload, most of the waits will be
incredibly short, but in our tests that will not necessarily be the case.


> while the io and ipc classes have mixed results.
>
> So based on the cycles metric I think it looks pretty safe to implement for the
> whole majority of classes.

This precisely is why I am scared of this effort. If you only look at it in
the right light, it'll look cheap, but in other cases it'll cause measureable
slowdowns.


> > I also continue to not believe that pure event counters are going to be useful
> > for the majority of wait events. I'm not sure it is really interesting for
> > *any* wait event that we don't already have independent stats for.
>
> For pure counters only I can see your point, but for counters + timings are you
> also not convinced?

For counters + timings I can see that it'd be useful. But i don't believe it's
close to as cheap as you say it is.


> > I think if we do want to have wait events that have more details, we need to:
> >
> > a) make them explicitly opt-in, i.e. code has to be changed over to use the
> >    extended wait events
> > b) the extended wait events need to count both the number of encounters as
> >    well as the duration, the number of encounters is not useful on its own
> > c) for each callsite that is converted to the extended wait event, you either
> >    need to reason why the added overhead is ok, or do a careful experiment
> >
>
> I do agree with the above, what do you think about this lastest experiment counting
> the cycles?

I continue to not believe it at all, sorry.  Even if the counting method were
accurate, you can't use our tests to measure the relative overhead, as they
aren't actually exercising the paths leading to waits


> > Personally I'd rather have an in-core sampling collector, counting how often
> > it sees certain wait events when sampling.
>
> Yeah but even if we are okay with losing "counters" by sampling, we'd still not get
> the duration. For the duration to be meaningful we also need the exact number
> of counters.

You don't need precise duration to see what wait events are a problem. If you
see that some event is samples a lot you know it's because there either are a
*lot* of those wait events or the wait events are entered into for a long
time.

Greetings,

Andres Freund



Re: Adding wait events statistics

От
Robert Haas
Дата:
On Tue, Jul 22, 2025 at 8:24 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> So based on the cycles metric I think it looks pretty safe to implement for the
> whole majority of classes.

I'm not convinced that this is either cheap enough to implement, and I
don't understand the value proposition, either. I see the first couple
of messages in the thread say that this is important for
troubleshooting problems in "our user base" (not sure about the
antecedent of "our") but the description of what those problems are
seems pretty vague to me. I've tried counting wait events of a
particular type and the resulting data was extremely misleading and
therefore useless.

I think it's probably a mistake to even be thinking about this in
terms of wait events. It seems very reasonable to want more data about
what PostgreSQL backends are doing, but I don't really see any reason
why that should happen to line up with whatever wait events do. For
instance, if you asked "what information is useful to gather about
heavyweight locks?" you might say "well, we'd like to know how many
times we tried to acquire one, and how many of those times we had to
wait, and how many of those times we waited for longer than
deadlock_timeout". And then you could design a facility to answer
those questions. Or you might say "we'd like a history of all the
different heavyweight locks that a certain backend has tried to
acquire," and then you could design a tracing facility to give you
that. Either of those hypothetical facilities involve providing more
information than you would get from just counting wait events, or
counting+timing wait events, or recording a complete history of all
wait events.

And, I would say that, for more or less that exact reason, there's a
real use case for either of them. Maybe either or both are valuable
enough to add and maybe they are not, and maybe the overhead is
acceptable or maybe it isn't, but I think the arguments are much
better than for a facility that just counts and times all the wait
events. For instance, the former facility would let you answer the
question "what percentage of lock acquisitions are contended?" whereas
a pure wait event count just lets you answer the question "how many
contended lock acquisitions were there?". The latter isn't useless,
but the former is a lot better, particularly since as I proposed it,
you can also judge how many of those contention events involved a
brief wait vs. a long one.

But if, on the other hand, you look at LWLocks, it's a totally
different situation, IMHO. I'm almost sure that measuring LWLock wait
times is going to be too expensive to be practical, and I don't really
see why you'd want to: the right approach is sampling, which is cheap
and in my experience highly effective. Measuring counts doesn't seem
very useful either: knowing the number of times that somebody tried to
acquire a relation lock or a tuple lock arguably tells you something
about your workload that you might want to know, whereas I would argue
that knowing the number of times that somebody tried to acquire a
buffer lock doesn't really tell you anything at all. What you might
want to know is how many buffers you accessed, which is why we've
already got a system for tracking that. That's not to say that there's
nothing at all that somebody could want to know about LWLocks that you
can't already find out today: for example, a system that identifies
which buffers are experiencing significant buffer lock contention, by
relation OID and block number, sounds quite handy. But just counting
wait events, or counting and timing them, will not give you that.
Knowing which SLRUs are being heavily used could also be useful, but I
think there's a good argument that having some instrumentation that
cuts across SLRU types and exposes a bunch of useful numbers for each
could be more useful than just hoping you can figure it out from
LWLock wait event counts.

In short, I feel like just counting, or counting+timing, all the wait
events is painting with too broad a brush. Wait events get inserted
for a specific purpose: so you can know why a certain backend is
off-CPU without having to use debugging tools. They serve that purpose
pretty well, but that doesn't mean that they serve other purposes
well, and I find it kind of hard to see the argument that just
sticking a bunch of counters or timers in the same places where we put
the wait event calls would be the right thing in general.
Introspection is an important topic and, IMHO, deserves much more
specific and nuanced thought about what we're trying to accomplish and
how we're going about it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Jul 22, 2025 at 10:07:30AM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-07-22 12:24:46 +0000, Bertrand Drouvot wrote:
> > Anyway, let's forget about eBPF, I ran another experiment by counting the cycles
> > with:
> >
> > static inline uint64_t rdtsc(void) {
> >     uint32_t lo, hi;
> >     __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
> >     return ((uint64_t)hi << 32) | lo;
> > }
> >
> > and then calling this function before and after waitEventIncrementCounter()
> > and also at wait_start() and wait_end() (without the increment counters patchs).
> 
> I think you're still going to get massively increased baseline numbers that
> way - the normal cost of a wait event is under 10 cycles. Doing two rdtscs
> costs somewhere between 60-90 cycles. Which means that any increase due to
> counters & timers will look a lot cheaper if compared to that increased
> baseline, than if you compared to the actual current cost of a wait event.

Let's name the overhead O, what is done is computing:

100 * (actual p50_counter_timing + O) / (actual p50_wait_time + O)

Assuming the rdtscs cost is say 80 cycles, that gives:

100 * (actual p50_counter_timing + 80) / (actual p50_wait_time + 80)

which means (as p50_counter_timing is 146):

100 * (66 + 80) / (actual p50_wait_time + 80)

So, if we look at say the lock class result: 0.03%, it means:

actual p50_wait_time + 80 = 100 * 146 / 0.03 = 486667 cycles

Then it means that the 66-cycles counter increment overhead becomes negligible.

So, when:

100 * (actual p50_counter_timing + O) / (actual p50_wait_time + O)

results in very small numbers, I think that we can still rely on the ratio to
make conclusions. I think the current computation could provide false negatives
but not false positives: when the ratio is very small, we can reliably conclude
that the counters overhead is negligible.

> I think that's largely because there is relatively few such wait events,

The exact numbers are:

# cat make_check-world_cycles_percentiles.csv | grep ",lock" | awk -F, '{print $3,$4}'
advisory 332
applytransaction 44
extend 156
frozenid 2
object 2
relation 409
spectoken 1
transactionid 693
tuple 17
virtualxid 257

# cat make_check-world_cycles_percentiles.csv | grep ",timeout" | awk -F, '{print $3,$4}'
BASE_BACKUP_THROTTLE 5
CHECKPOINT_WRITE_DELAY 1262
PG_SLEEP 55
RECOVERY_APPLY_DELAY 4
RECOVERY_RETRIEVE_RETRY_INTERVAL 62
SPIN_DELAY 46
VACUUM_TRUNCATE 200

# cat make_check-world_cycles_percentiles.csv | grep ",lwlock" | awk -F, '{print $3,$4}'
AioWorkerSubmissionQueue 1498
BufferContent 44825
BufferMapping 808
CheckpointerComm 431
ControlFile 6
DynamicSharedMemoryControl 67
LockFastPath 1096
LockManager 1565
OidGen 1264
ParallelHashJoin 33
ParallelQueryDSA 17
PerSessionDSA 2
PgStatsDSA 136
PgStatsData 41
PgStatsHash 28
ProcArray 263
RelCacheInit 156
RelationMapping 3
ReplicationSlotAllocation 5
ReplicationSlotControl 1
SInvalRead 577
SInvalWrite 166
SerializableXactHash 2
SharedTidBitmap 12
TwoPhaseState 8
WALInsert 4935
WALWrite 5875
XactBuffer 1
XactSLRU 98
XidGen 56
unknown_event_95 1662

# cat make_check-world_cycles_percentiles.csv | grep ",client" | awk -F, '{print $3,$4}'
CLIENT_READ 330753
CLIENT_WRITE 6382
LIBPQWALRECEIVER_CONNECT 2137
LIBPQWALRECEIVER_RECEIVE 3875
WAL_SENDER_WAIT_FOR_WAL 7
WAL_SENDER_WRITE_DATA 2740
unknown_event_8 48

# cat make_check-world_cycles_percentiles.csv | grep ",bufferpin" | awk -F, '{print $3,$4}'
BUFFER_PIN 89

There are a few for some of them and some have more decent numbers.

> because there is very very little contention in the regression tests and we
> just don't do a whole lot intensive things in the tests. I suspect that at
> least some of the high events here will actually be due to tests that
> explicitly test the contention behaviour, and thus will have very high wait
> times.

Yeah, probably, I would need to check more in details.

> E.g. if you measure client timings, the overhead here will be fairly low,
> because we're not going to be CPU bound by the back/forth between client and
> server, and thus many of the waits will be longer. If you instead measure a
> single client readonly pgbench, it'll look different.  Similar, if you have
> lwlock contention in a real world workload, most of the waits will be
> incredibly short, but in our tests that will not necessarily be the case.

Agreed.

> For counters + timings I can see that it'd be useful. But i don't believe it's
> close to as cheap as you say it is.

Thanks for the confirmation.

> You don't need precise duration to see what wait events are a problem. If you
> see that some event is samples a lot you know it's because there either are a
> *lot* of those wait events or the wait events are entered into for a long
> time.

That's right, but tracking the duration would help to determine which case it is:

a *lot* of those wait events or the wait events are entered into for a long time.

That said, I also understand your concerns (and the ones shared by Robert in
this thread [1]), even if I do not necessarily agree with all of them, and I also
understand that you both think that there is a room for instrumentation improvements.

I'll try to think of other approaches (like the ones mentioned by Robert), as
I think that fully makes sense and they are probably even better (means they
provide more context).

Doing so, I'll hold on the wait events stats approach (as both of you are
skeptical about it).

Regards,

[1]:
https://www.postgresql.org/message-id/CA%2BTgmobptuUWo7X5zcQrWKh22qeAn4eL%2B%3Dwtb8_ajCOR%2B7_tcw%40mail.gmail.com

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Jul 23, 2025 at 11:38:07AM -0400, Robert Haas wrote:
> On Tue, Jul 22, 2025 at 8:24 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > So based on the cycles metric I think it looks pretty safe to implement for the
> > whole majority of classes.
> 
> I'm not convinced that this is either cheap enough to implement,

Thanks for looking at it and providing your thoughts!

> and I
> don't understand the value proposition, either. I see the first couple
> of messages in the thread say that this is important for
> troubleshooting problems in "our user base" (not sure about the
> antecedent of "our") but the description of what those problems are
> seems pretty vague to me.

Well, the idea was more: as we speak about "wait" events then it would make
sense to add their duration. And then, to have meaningful data to interpret the
durations then it would make sense to add the counters. So that one could answer
questions like:

* Is the engine’s wait pattern the same over time?
* Is application’s "A" wait pattern the same over time?
* I observe a peak in wait event "W": is it because "W" is now waiting longer or
is it because it is hit more frequently?

> I think it's probably a mistake to even be thinking about this in
> terms of wait events.

I think it feels natural to add "wait time" to "wait events", it started from
there. 

> For
> instance, if you asked "what information is useful to gather about
> heavyweight locks?" you might say "well, we'd like to know how many
> times we tried to acquire one, and how many of those times we had to
> wait, and how many of those times we waited for longer than
> deadlock_timeout".

Yes.

> And then you could design a facility to answer
> those questions.

I think the current proposal would provide a way to get some of those answers
and a way to "estimate" some others. But designing a facility to answer all of
those would be better.

> Or you might say "we'd like a history of all the
> different heavyweight locks that a certain backend has tried to
> acquire," and then you could design a tracing facility to give you
> that. Either of those hypothetical facilities involve providing more
> information than you would get from just counting wait events, or
> counting+timing wait events, or recording a complete history of all
> wait events.

Agreed.

> And, I would say that, for more or less that exact reason, there's a
> real use case for either of them. Maybe either or both are valuable
> enough to add and maybe they are not, and maybe the overhead is
> acceptable or maybe it isn't, but I think the arguments are much
> better than for a facility that just counts and times all the wait
> events.

If we add more knowledge per wait class/wait events by providing "dedicated"
facilities, I agree that would provide more values than just "counts and durations".

> For instance, the former facility would let you answer the
> question "what percentage of lock acquisitions are contended?" whereas
> a pure wait event count just lets you answer the question "how many
> contended lock acquisitions were there?".

Right.

> The latter isn't useless,
> but the former is a lot better,

Agreed.

> I'm almost sure that measuring LWLock wait
> times is going to be too expensive to be practical,

On my lab it added 60 cycles, I'm not sure that is too expensive. But even
if we think this is, maybe we could provide an option to turn this "overhead" off/on
with a GUC or compilation flag.

> and I don't really
> see why you'd want to: the right approach is sampling, which is cheap
> and in my experience highly effective.

I think the sampling is very useful too (that's why I created the pgsentinel
extension [1]). But I think the sampling and the current proposal do not
answer the same questions. Most (if not all) of the questions that I mentioned
above could not be answered accurately with the sampling approach.

> Measuring counts doesn't seem
> very useful either: knowing the number of times that somebody tried to
> acquire a relation lock or a tuple lock arguably tells you something
> about your workload that you might want to know, whereas I would argue
> that knowing the number of times that somebody tried to acquire a
> buffer lock doesn't really tell you anything at all.

If you add the duration to the mix, that's more useful. And if you add the buffer
relfile information to the mix, that's even more insightful.

One could spot hot buffers with this data in hand.

> What you might
> want to know is how many buffers you accessed, which is why we've
> already got a system for tracking that. That's not to say that there's
> nothing at all that somebody could want to know about LWLocks that you
> can't already find out today: for example, a system that identifies
> which buffers are experiencing significant buffer lock contention, by
> relation OID and block number, sounds quite handy.

Could not agree more.

> But just counting
> wait events, or counting and timing them, will not give you that.

Right, I had in mind a step by step approach. Counters, timings and then providing
more details to wait events, something like:

   pid   | wait_event_type |  wait_event  |                            infos
---------+-----------------+--------------+-------------------------------------------------------------
 2560105 | IO              | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"}
 2560135 | IO              | WalSync      | {"segno" : "1", "tli" : "1"}
 2560138 | IO              | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"}

But it also has its own challenges like you and Andres pointed out on the
Discord server (in the PG #core-hacking channel).

> Knowing which SLRUs are being heavily used could also be useful, but I
> think there's a good argument that having some instrumentation that
> cuts across SLRU types and exposes a bunch of useful numbers for each
> could be more useful than just hoping you can figure it out from
> LWLock wait event counts.

Right, I'm all in for even more details.

> In short, I feel like just counting, or counting+timing, all the wait
> events is painting with too broad a brush. Wait events get inserted
> for a specific purpose: so you can know why a certain backend is
> off-CPU without having to use debugging tools. They serve that purpose
> pretty well, but that doesn't mean that they serve other purposes
> well, and I find it kind of hard to see the argument that just
> sticking a bunch of counters or timers in the same places where we put
> the wait event calls would be the right thing in general.
> Introspection is an important topic and, IMHO, deserves much more
> specific and nuanced thought about what we're trying to accomplish and
> how we're going about it.

I see what you mean and also think that providing more instrumentation is an
important topic and that it deserves improvement.

The current proposal was maybe too "generic" as you pointed out and maybe the wait
events is not the best place to add this instrumentation/details.

If we could agree on extra details/instrumentation per area (like the few examples
you provided) then I would be happy to work on it: as said, I think the instrumentation
is an area that deserves improvement.

Your approach would probably take longer but OTOH would surely be even more meaningful.

[1]: https://github.com/pgsentinel/pgsentinel

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Robert Haas
Дата:
On Thu, Jul 24, 2025 at 7:52 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Well, the idea was more: as we speak about "wait" events then it would make
> sense to add their duration. And then, to have meaningful data to interpret the
> durations then it would make sense to add the counters. So that one could answer
> questions like:
>
> * Is the engine’s wait pattern the same over time?
> * Is application’s "A" wait pattern the same over time?
> * I observe a peak in wait event "W": is it because "W" is now waiting longer or
> is it because it is hit more frequently?

Forgive me if I'm beating a dead horse here, but I just don't believe
the first two of these at all. If you want to know if the wait pattern
is the same over time, take samples over various intervals and compare
the samples. It's true that you might notice some differences by
comparing counts that you wouldn't notice by comparing sample
profiles, but that's true of ANY additional data that you choose to
collect. It's not a good argument for collecting
number-of-times-we-waited-on-each-wait-event specifically. Because (in
my experience) counts of the number of times we hit each wait event
point are nearly meaningless, if you detect a change by comparing
those across time periods, all you will really know is that something
changed. You won't really get any understanding of what it was. I
think with a more careful choice of what data to gather, we can do
better.

The third one -- are the waits longer or are we waiting more often --
is a much better argument. I acknowledge that's something that a user
might want to know. But I think it's worth drilling down into that a
little bit more -- when and why would the user want to know that? I
might be wrong here, but I feel like the most likely case where you
would care is something having to do with I/O, which gets back to
Andres's point about the AIO instrumentation. If it's strictly
processes fighting over in-memory LWLocks, I don't expect the lock
hold times to vary widely, and I'm not sure what the user is supposed
to think about it or do about it if they do. I actually wrote some
instrumentation kind of like this many years ago and it was useful for
understanding some of the internal locking mechanics of PostgreSQL, so
that I could think about whether things could be improved in the code,
but I never had the sense that what I wrote on that occasion would
have been useful to end-users. I don't know if you have any more
detailed thoughts about this to share.

> > I'm almost sure that measuring LWLock wait
> > times is going to be too expensive to be practical,
>
> On my lab it added 60 cycles, I'm not sure that is too expensive. But even
> if we think this is, maybe we could provide an option to turn this "overhead" off/on
> with a GUC or compilation flag.

I don't know exactly what you tested, but there's a huge difference
between running a query normally and running it under EXPLAIN ANALYZE,
and a lot of that is timing overhead. There are probably cases where
queries never actually have to wait very much - e.g. all data in
shared_buffers, no contention, maybe a bunch of CPU cycles going into
computation rather than data access - but I think there will also be
cases where there are a LOT of wait events. For instance, if data fits
well into the OS cache but poorly into shared_buffers and you
construct a workload that has to access a lot of buffers in quick
succession, like say a nested loop over a parameterized inner index
scan, then I would think we would just be doing a ton of extra timing
operations compared to right now, and since the waits would be short
(since the data is coming from the OS rather than the disk) I would
expect it to hurt quite a bit. If not, then why does EXPLAIN need a
TIMING OFF option?

> > Measuring counts doesn't seem
> > very useful either: knowing the number of times that somebody tried to
> > acquire a relation lock or a tuple lock arguably tells you something
> > about your workload that you might want to know, whereas I would argue
> > that knowing the number of times that somebody tried to acquire a
> > buffer lock doesn't really tell you anything at all.
>
> If you add the duration to the mix, that's more useful. And if you add the buffer
> relfile information to the mix, that's even more insightful.

But also costly. If we create a system that has significantly more
overhead than wait events, we probably won't be able to have it
enabled all the time, and busy systems may not even be able to afford
turning it on temporarily. The cost of a system of this time has a
huge impact on how usable it actually is.

> One could spot hot buffers with this data in hand.

I actually don't believe we need duration to spot hot buffers. Some
kind of sampling approach seems like it should work fine. If you
expose some kind of information about buffer access in shared memory
such that it can be sampled (deliberate hand-waving here) and somebody
takes 100 samples, with each sample covering all backends in the
system, over a period of 50 or 100 or 200 or 500 seconds, the hot
buffers are going to pop right up to the top of the list. You won't
necessarily be able to see buffers that are just a tiny bit hot, but
with a decent number of samples you should be able to clearly see the
stuff that's really hot. People often say things to me like "well what
if I just get really unlucky and always miss what is truly the hottest
buffer," but the law of large numbers says you don't really need to
worry about that as long as you collect a sufficiently large number of
samples, and that's not really difficult to do.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Thu, Jul 24, 2025 at 10:38:49AM -0400, Robert Haas wrote:
> On Thu, Jul 24, 2025 at 7:52 AM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > Well, the idea was more: as we speak about "wait" events then it would make
> > sense to add their duration. And then, to have meaningful data to interpret the
> > durations then it would make sense to add the counters. So that one could answer
> > questions like:
> >
> > * Is the engine’s wait pattern the same over time?
> > * Is application’s "A" wait pattern the same over time?
> > * I observe a peak in wait event "W": is it because "W" is now waiting longer or
> > is it because it is hit more frequently?
> 
> The third one -- are the waits longer or are we waiting more often --
> is a much better argument. I acknowledge that's something that a user
> might want to know. But I think it's worth drilling down into that a
> little bit more -- when and why would the user want to know that? I
> might be wrong here, but I feel like the most likely case where you
> would care is something having to do with I/O, which gets back to
> Andres's point about the AIO instrumentation.

Yeah, I/O is definitively an area where exposing the duration makes sense.

> If it's strictly
> processes fighting over in-memory LWLocks, I don't expect the lock
> hold times to vary widely, and I'm not sure what the user is supposed
> to think about it or do about it if they do.

Yeah, I also think not all of them are worth to get the duration for.

> I actually wrote some
> instrumentation kind of like this many years ago and it was useful for
> understanding some of the internal locking mechanics of PostgreSQL, so
> that I could think about whether things could be improved in the code,
> but I never had the sense that what I wrote on that occasion would
> have been useful to end-users. I don't know if you have any more
> detailed thoughts about this to share.

Focusing on LWLocks, I think that for example the OidGenLock one is one
it would be interesting to have the duration for. That way one could start to
investigate if there is a very long run of used OID values with no gaps in
TOAST tables or if there is higher concurrency. I do agree that might be not be
useful for all the LWLocks (those for which the users have no actionable 
workaround/solutions) but for some I think it could be useful (like the one
mentioned above).

Outside of LWLocks, I think it makes sense for heavyweight locks to answer
questions like: Is the locker holding a lock for longer? 

> > On my lab it added 60 cycles, I'm not sure that is too expensive. But even
> > if we think this is, maybe we could provide an option to turn this "overhead" off/on
> > with a GUC or compilation flag.
> 
> I don't know exactly what you tested,

I counted the cycles with:

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}

calling this function in 2 ways:

W1: before and after incrementing the counters and collecting their durations
W2: in wait_start() and in wait_end()

Each of the aboves have been done without the other one in place and the delta
have been computed while running check-world.

That way I generated percentiles for the values measured in W1 (shared in [1])
and also per wait events (also shared on [1]).

P50 for the counters increments and durations is 146 cycles, if we remove 
the estimated rdtsc() calls overhead (say 80 cycles) that gives 66 cycles
overhead.


[1]: https://www.postgresql.org/message-id/aH%2BDDoiLujt%2Bb9cV%40ip-10-97-1-34.eu-west-3.compute.internal

> but there's a huge difference
> between running a query normally and running it under EXPLAIN ANALYZE,
> and a lot of that is timing overhead. There are probably cases where
> queries never actually have to wait very much - e.g. all data in
> shared_buffers, no contention, maybe a bunch of CPU cycles going into
> computation rather than data access - but I think there will also be
> cases where there are a LOT of wait events.

I see, you are saying that a LOT of wait events could add a LOT of little
cycles overhead which could lead to much more overhead than those 66 cycles.

That's right, that's why the idea was to add the counters and timings only on
wait classes for which the counters and durations overhead was relatively small
in regard to the waits themself.

From [1]:

"
Overhead on the lock class is about 0.03%
Overhead on the timeout class is less than 0.01%

and now we can also see that:

Overhead on the lwlock class is about 1%
Overhead on the client class is about 0.5%
Overhead on the bufferpin class is about 0.2%
"

> For instance, if data fits
> well into the OS cache but poorly into shared_buffers and you
> construct a workload that has to access a lot of buffers in quick
> succession, like say a nested loop over a parameterized inner index
> scan, then I would think we would just be doing a ton of extra timing
> operations compared to right now, and since the waits would be short
> (since the data is coming from the OS rather than the disk) I would
> expect it to hurt quite a bit. If not, then why does EXPLAIN need a
> TIMING OFF option?

That's right, that's why the reasoning shared in [1] was based on filtering on 
the wait classes by focusing on very small overhead (relative overhead and not
absolute).

> > If you add the duration to the mix, that's more useful. And if you add the buffer
> > relfile information to the mix, that's even more insightful.
> 
> But also costly.

Right, everything that adds cycles adds costs. I think that comparing the
relative costs like done in [1] makes sense though. That helps to categorize
the costs.

> > One could spot hot buffers with this data in hand.
> 
> I actually don't believe we need duration to spot hot buffers. Some
> kind of sampling approach seems like it should work fine. If you
> expose some kind of information about buffer access in shared memory
> such that it can be sampled (deliberate hand-waving here) and somebody
> takes 100 samples, with each sample covering all backends in the
> system, over a period of 50 or 100 or 200 or 500 seconds, the hot
> buffers are going to pop right up to the top of the list.

That's right, the ones you would spot would be hot buffers but I think that's
also possible that you missed some (the smaller the sampling interval is, the
better though).

But still without the durations you can't say if the ones holding the buffers are
holding longer than before or if there is an increase of concurrency.

Between 2 samples that produced the same wait on a buffer: has the wait event
changed or was it the same? 

If that's the later we could increment the duration by the sampling interval,
but we just don't know.

> People often say things to me like "well what
> if I just get really unlucky and always miss what is truly the hottest
> buffer," but the law of large numbers says you don't really need to
> worry about that as long as you collect a sufficiently large number of
> samples, and that's not really difficult to do.

Yeah probably, but still without the durations, it's almost impossible to know
if there is an increase of concurrency or if the lock is held for longer. Indeed,
if the lock is held for longer one might see an increase of number of waiters
and "wrongly" interpret that as an increase of concurrency.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Adding wait events statistics

От
Robert Haas
Дата:
On Mon, Jul 28, 2025 at 8:35 AM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
> Focusing on LWLocks, I think that for example the OidGenLock one is one
> it would be interesting to have the duration for. That way one could start to
> investigate if there is a very long run of used OID values with no gaps in
> TOAST tables or if there is higher concurrency. I do agree that might be not be
> useful for all the LWLocks (those for which the users have no actionable
> workaround/solutions) but for some I think it could be useful (like the one
> mentioned above).

Yeah, maybe. That's an interesting example. One could make the
argument that we ought to have a system that specifically tests for
OID oversaturation, but admittedly that feels pretty special-purpose.
But on the other hand, I still don't think timing every LWLock wait is
going to work out well from a performance standpoint.

> Outside of LWLocks, I think it makes sense for heavyweight locks to answer
> questions like: Is the locker holding a lock for longer?

I think the existing log_lock_waits does a pretty good job with this case.

> I see, you are saying that a LOT of wait events could add a LOT of little
> cycles overhead which could lead to much more overhead than those 66 cycles.

Yes.

> That's right, that's why the idea was to add the counters and timings only on
> wait classes for which the counters and durations overhead was relatively small
> in regard to the waits themself.
>
> From [1]:
>
> "
> Overhead on the lock class is about 0.03%
> Overhead on the timeout class is less than 0.01%
>
> and now we can also see that:
>
> Overhead on the lwlock class is about 1%
> Overhead on the client class is about 0.5%
> Overhead on the bufferpin class is about 0.2%
> "

I have a hard time reconciling this with the overhead of EXPLAIN
ANALYZE, which we know to be much larger than this. To be fair, it's
quite possible that we switch between executor nodes a lot more often
than we go off-CPU, so maybe the wait event instrumentation is just a
lot cheaper for that reason. But I'm still suspicious that something
is wrong with these measurements: that you're testing on a platform
where timer events are particularly cheap, or a workload where they
are not as common as some workloads, or, I don't really know exactly.
Even if the overhead for measuring LWLock wait events is truly only
1%, I'm skeptical about whether that's worth it for the average user.
At that level, it might be OK as an optional feature that people can
enable if they need it.

But as I've said before, it's super-important that we don't get
ourselves into a situation where hackers don't wait add wait events to
places that should have them for fear of the overhead being too large.
We have mostly avoided that problem up until now, as far as I am
aware.

> That's right, the ones you would spot would be hot buffers but I think that's
> also possible that you missed some (the smaller the sampling interval is, the
> better though).

But like ... who cares? Nobody needs an absolutely exhaustive list of
buffers above a certain level of hotness.

> But still without the durations you can't say if the ones holding the buffers are
> holding longer than before or if there is an increase of concurrency.

I bet you can make a pretty good guess by looking at how many active
backends have any sort of heavyweight lock on the relation in
question. I have a lot of difficulty imagining this as a real point of
confusion in a customer troubleshooting situation.

I mean, theoretically, you could imagine a situation where the
concurrency on a relation is high but you can't tell whether they're
all focusing on the same buffers for shorter times or different
buffers for longer times. But I don't really see how this could occur
in practice. If it's an index, the contention almost has to be on the
root/upper-level pages; and if it's a heap you're either doing
sequential scans and the access pattern is uniform or you're doing
some kind of index scan and the contention is probably focused on the
upper-level index pages rather than the heap pages. You could maybe
imagine some crazy corner case where there's heap contention created
by TID scans, but that's too obscure a situation to justify adding
machinery of this kind.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Adding wait events statistics

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Jul 23, 2025 at 11:38:07AM -0400, Robert Haas wrote:
> I think it's probably a mistake to even be thinking about this in
> terms of wait events. It seems very reasonable to want more data about
> what PostgreSQL backends are doing, but I don't really see any reason
> why that should happen to line up with whatever wait events do. For
> instance, if you asked "what information is useful to gather about
> heavyweight locks?" you might say "well, we'd like to know how many
> times we tried to acquire one, and how many of those times we had to
> wait, and how many of those times we waited for longer than
> deadlock_timeout". 

I did find this idea interesting, and gave it a stab in [1].

[1]: https://www.postgresql.org/message-id/aIyNxBWFCybgBZBS%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com