Обсуждение: Adding wait events statistics
Hi hackers, Wait events are useful to know what backends are waiting for when there is/was a performance issue: for this we can sample pg_stat_activity at regular intervals and record historical data. That’s how it is commonly used. It could also be useful to observe the engine/backends behavior over time and help answer questions like: * Is the engine’s wait pattern the same over time? * Is application’s "A" wait pattern the same over time? * I observe a peak in wait event "W": is it because "W" is now waiting longer or is it because it is hit more frequently? * I observe a peak in some of them (say for example MultiXact%), is it due to a workload change? For the above use cases, we need a way to track the wait events that occur between samples: please find attached a proof of concept patch series doing so. The patch series is made of: 0001 - It generates the WaitClassTable[], a lookup table that will be used by the wait events statistics. The array is indexed by classId (derived from the PG_WAIT_* constants), handles gaps in the class ID numbering and provides metadata for wait events. This new array is generated in generate-wait_event_types.pl, so that: * it now needs lwlocklist.h and wait_classes.h as input parameters * WAIT_EVENT_CLASS_MASK and WAIT_EVENT_ID_MASK have been moved away from wait_event.c In passing it adds several new macros that will be used by 0002. 0002 - It adds wait events statistics It adds a new stat kind PGSTAT_KIND_WAIT_EVENT for the wait event statistics. This new statistic kind is a fixed one because we know the maximum number of wait events. Indeed: * it does not take into account custom wait events as extensions have all they need at their disposal to create custom stats on their own wait events should they want to (limited by WAIT_EVENT_CUSTOM_HASH_MAX_SIZE though). * it does not take into account LWLock > LWTRANCHE_FIRST_USER_DEFINED for the same reasons as above. That said, there is no maximum limitation in LWLockNewTrancheId(). * we don't want to allocate memory in some places where the counters are incremented (see 4feba03d8b9). We could still use the same implementation as for backend statistics (i.e, make use of flush_static_cb) if we really want/need to switch to variable stats. For the moment only the counters are added (an array of currently 285 counters), we’ll study/discuss about adding the timings once the counters are fully done. I think we’d have more discussion/debate around the timings (should we add them by default, add a new GUC, enable them at compilation time?…), that’s why only the counters are in this patch series. I think it makes sense as the counters have merit on their own. We currently have 273 wait events but the array is 285 long: the reason is that some wait events classes have "holes". A few questions: * Do we need to serialize the stats based on their names (as for PGSTAT_KIND_REPLSLOT)? This question is the same as "is the ordering preserved if file stats format is not changed": I think the answer is yes (see f98dbdeb51d) , which means there is no need for to_serialized_name/from_serialized_name. * What if a new wait event is added? We'd need to change the stats file format, unless: the wait event stats kind becomes a variable one or we change a bit the way fixed stats are written/read to/from the stat file (we could add a new field in the PgStat_KindInfo for example). Note: for some backends the wait events stats are not flushed (walwriter for example), so we need to find additional places to flush the wait events stats. 0003 - It adds the pg_stat_wait_event view It also adds documentation and regression tests. 0004 - switching PGSTAT_KIND_WAIT_EVENT to variable sized It might be better for PGSTAT_KIND_WAIT_EVENT to be a variable sized stats kind. That way: * It would be easier to add custom wait events if we want to * It would be possible to add new wait events without having to change the stats file format So adding 0004 to see what it would look like to have a variable sized stats kind instead and decide how we want to move forward. It uses the uint32 wait_event_info as the hash key while the hash key is defined as uint64: that should not be an issue but this patch does explicit casting though. That said it might be better to use all the 64 bits (means not have the half full of zeroes) for the hash key (better hashing distribution?): we could imagine using something like: ((uint64) wait_event_info) | (((uint64) wait_event_info) << 32) for the hash key. If we decide to go that way (means with variable sized kind) then a new patch series will be provided and will not implement the fixed one but will start directly with the variable one. The more I think about it, the more I think we should go for the variable sized proposal: that's more flexible. Remarks: * If we want to add some "ereport” in waitEventIncrementCounter() then we’d need to take care of the race condition in ConditionVariableTimedSleep() that that would produce, see [1]. * Once we agree on fixed vs variable sized stats kind, I'll start measuring if there is any performance regression and check if there is a need for optimization (partition the counters array?…). * The pgstat_report_wait_end() is "inlined", but with the additional code added here, compilers may ignore the inline keyword. Need to check if that's the case and the impact (see above). * After it goes in: add timings, add into per-backend stats too [1]: https://www.postgresql.org/message-id/aBhuTqNhMN3prcqe%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi, On Mon, Jul 07, 2025 at 02:56:29PM +0900, Michael Paquier wrote: > On Mon, Jun 30, 2025 at 01:36:12PM +0000, Bertrand Drouvot wrote: > > * Do we need to serialize the stats based on their names (as for > > PGSTAT_KIND_REPLSLOT)? This question is the same as "is the ordering preserved > > if file stats format is not changed": I think the answer is yes (see f98dbdeb51d) > > , which means there is no need for to_serialized_name/from_serialized_name. > > The serialization is important for the custom wait events, much less > for wait events related to injection points (where we don't really > care) which could be filtered out when reporting the stats. It seems > to me that associating the names (perhaps as $CLASS/$NAME on disk as a > single string, two strings with two lengths are perhaps better) is > going to be needed. When the stats are read, the wait event ID could > be allocated from the startup process in the from_serialized_name > callback associated to the new stats kind, before extensions would try > to allocate it. Thanks for looking at it! Yeah, I think that this is needed for the custom wait events. But do we need to handle the custom wait events case? As, I think, the extensions have all they need at their disposal to count/increment their own counters when their custom wait events is triggered. I think the question is: if the extension owner does not increment it, do we want to do it in core on their behalf? I'm not sure as it's still up to them to make use of custom wait events: so some of them will, some not, making the "in core" counters not consistent depending of the extension. That could result in end users asking why they see counters for some and not for some. I also have the feeling that adding this extra complexity for (I think) a really small number of cases may impact the "in core" wait events stats negatively (from a complexity and maybe performance points of view). I was thinking to just start with "in core" wait events and see if there is a need/ask to expand it to custom wait events. Thoughts? > > * What if a new wait event is added? We'd need to change the stats file format, > > unless: the wait event stats kind becomes a variable one or we change a bit the > > way fixed stats are written/read to/from the stat file (we could add a new field > > in the PgStat_KindInfo for example). > > I am pretty sure that everybody is going to forget about this > dependency, myself included. Please note that we add wait events in > stable branches, sometimes as a portion of a bug fix, and we cannot > bump the version of the stats file or stats would be lost on restart. > So these stats should be variable, Agree, that it's better to go the "variable" stats kind way. > For the stats waiting to be flushed, you could > allocate a single array of WAIT_EVENT_CUSTOM_HASH_MAX_SIZE elements in > the TopMemoryContext of each backend, as we know that the number of > custom wait events is never going to be higher than that, indexed by > ID assigned. That's an idea, but I'm not 100% sure we need to care about custom wait events for now (see above). > It may be simpler to allocate a series of arrays, one for each class, > as well, sized depending on the wait event enum sizes. These could be > indexed based on the enums generated in wait_event_types.h, minus > their initial value (for the array of events in WaitEventTimeout, > minus PG_WAIT_TIMEOUT needs to be applied). Deriving the contents > that you need for the sizing and indexing of the pending data based on > WaitClassTableEntry seems indeed necessary at some degree. I'm not sure that would be simpler (or maybe that would simplify the perl script a bit) but I think that might complicate the C code a bit. I mean, given a wait_event_info we'd need to know in which array to increment the related wait event counter. So, (at least) some switch on the wait class would be needed (and that would probably need to be generated by the perl script, so I'm not sure that would simplify the perl script after all). Also in pg_stat_get_wait_event(), we'd need to iterate on all the arrays so we'd need to know and identify them. > your proposal feels overcomplicated, and that the names may not be > necessary as we already have calls able to do class name and event > name lookups based on the uint32 class/name values. I started without the names in the array (making use of pgstat_get_wait_event()) but that did not work well for LWLOCK. Indeed, for LWLOCK we have "holes" in the eventID numbering (see lwlocklist.h). That's why there is "NULL" in some names for PG_WAIT_LWLOCK in the WaitClassTable[]. So, instead of putting random stuff when it's not NULL, I thought that would be ok to put the names instead (and avoid calling pgstat_get_wait_event()). > This is not new. This limitation exists for some auxiliary processes > or processes within their main loops. As long as we have APIs that > can be called to do the flushes, IMO we are fine, like we do for WAL, > IO, etc. I do agree. My point was that I did not add those in the patch series yet. > > It might be better for PGSTAT_KIND_WAIT_EVENT to be a variable sized stats kind. > > That way: > > > > * It would be easier to add custom wait events if we want to > > * It would be possible to add new wait events without having to change the stats > > file format > > > > So adding 0004 to see what it would look like to have a variable sized stats kind > > instead and decide how we want to move forward. > > IMO, I think that we should just do that. ABI compatibility and > custom wait events make fixed-sized stats unfit with what you are > trying to achieve here. Yeah, next version will do that. Just waiting on a agreement about: - custom wait events - split the array in multiple arrays before sharing v2. > > That said it might be better to use all the 64 bits (means not have the half full > > of zeroes) for the hash key (better hashing distribution?): we could imagine > > using something like: > > > > ((uint64) wait_event_info) | (((uint64) wait_event_info) << 32) > > > > for the hash key. > > Using uint32 for the hash key is fine. At least it's not an issue for > the pgstats machinery. Yeah, my doubt was more about the hashing distribution if all the keys have the exact same bits (the upper 32 bits) sets to zero (which is what using a uint32 key would produce). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Jul 08, 2025 at 09:36:53AM +0900, Michael Paquier wrote: > On Mon, Jul 07, 2025 at 03:07:12PM +0000, Bertrand Drouvot wrote: > > > I think the question is: if the extension owner does not increment it, do we want > > to do it in core on their behalf? I'm not sure as it's still up to them to make > > use of custom wait events: so some of them will, some not, making the "in core" > > counters not consistent depending of the extension. That could result in end > > users asking why they see counters for some and not for some. > > My take would be yes. That feels more natural than ignoring things. Yeah ignoring the custom wait events would not make sense, but "waiting" to see if there is a need/ask could have made sense. > > I was thinking to just start with "in core" wait events and see if there is a > > need/ask to expand it to custom wait events. Thoughts? > > This statement is perhaps true. Now we do use custom wait events even > in contrib/ modules, so the move feels natural to me. Okay, I had in mind to start with something simpler that would have covered the waste majority (I think) of end users's needs but will look at it for v2 then. > > I'm not sure that would be simpler (or maybe that would simplify the perl > > script a bit) but I think that might complicate the C code a bit. I mean, given > > a wait_event_info we'd need to know in which array to increment the related wait > > event counter. So, (at least) some switch on the wait class would be needed (and > > that would probably need to be generated by the perl script, so I'm not sure > > that would simplify the perl script after all). > > Noted. > > > Also in pg_stat_get_wait_event(), we'd need to iterate on all the arrays so > > we'd need to know and identify them. > > Yes, you may need some level of meta-data generated for the wait > classes generated when the perl script generating this data is run. > It would be nice to the footprint of the code generated minimal if we > can. It's one of these things where I would try both approaches, Let's see what a multi arrays approach would look like and decide. > > Yeah, my doubt was more about the hashing distribution if all the keys have > > the exact same bits (the upper 32 bits) sets to zero (which is what using a > > uint32 key would produce). > > Hmm. Not sure about that, but that would not be difficult to check. > It is true that it would not be a good thing if all the stats get > pushed through the same partition in the pgstat dshash table, but I'm > pretty sure that we don't need to worry much about that, like for > neighboring OIDs. Probably yes, will try to figure out though, out of curiosity (I think that would be good to know generally speaking). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Jul 08, 2025 at 04:46:28AM +0000, Bertrand Drouvot wrote: > On Tue, Jul 08, 2025 at 09:36:53AM +0900, Michael Paquier wrote: > > Yes, you may need some level of meta-data generated for the wait > > classes generated when the perl script generating this data is run. > > It would be nice to the footprint of the code generated minimal if we > > can. It's one of these things where I would try both approaches, I did some testing and experiments and I think an "hybrid" approach mixing multiple arrays and one "global" one is the best approach. So, something like: " ClassInfo LWLockClassInfo = { .numberOfEvents = NB_LWLOCK_WAIT_EVENTS, .pendingCounts = PendingWaitEventStats.lwlock_counts }; WaitClassInfo LockClassInfo = { .numberOfEvents = NB_LOCK_WAIT_EVENTS, .pendingCounts = PendingWaitEventStats.lock_counts }; . . . . " and then the "global" one: " WaitClassInfo *AllWaitClasses[] = { NULL, /* gap - no class with ID 0 */ &LWLockClassInfo, NULL, /* gap - no class with ID 2 */ &LockClassInfo, &BufferPinClassInfo, &ActivityClassInfo, &ClientClassInfo, &ExtensionClassInfo, &IPCClassInfo, &TimeoutClassInfo, &IOClassInfo, }; " That way: - we do not need a mapping between ClassID and "Arrays" and the gaps in the ClassID are handled. So we can acces directly the class of interest knowing its classID. - we can easily iterate (again no mapping needed) over all the pending stats thanks to AllWaitClasses[] What do you think? It looks like the most elegant approach to me. Bonus point, as you can see above, while working on this approach I've been able to remove storing the className and the array of eventName per wait class (would have worked in v1 too). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Jul 08, 2025 at 03:25:26PM +0000, Bertrand Drouvot wrote: > So, something like: > > ClassInfo LWLockClassInfo = { > .numberOfEvents = NB_LWLOCK_WAIT_EVENTS, > .pendingCounts = PendingWaitEventStats.lwlock_counts > }; > > and then the "global" one: > > WaitClassInfo *AllWaitClasses[] = { > NULL, /* gap - no class with ID 0 */ > &LWLockClassInfo, > NULL, /* gap - no class with ID 2 */ > &LockClassInfo, > &BufferPinClassInfo, > &ActivityClassInfo, > &ClientClassInfo, > &ExtensionClassInfo, > &IPCClassInfo, > &TimeoutClassInfo, > &IOClassInfo, > }; Okay. I'm guessing that all this information is automatically built based on the contents of wait_event_names.txt (with some tweaks for the LWLock part?), so why not even if there are gaps as long as we know the length of AllWaitClasses. The stats code would be simple by doing an O(n^2) across the array when flushing the individual contents of each wait event. > That way: > > - we do not need a mapping between ClassID and "Arrays" and the gaps > in the ClassID are handled. So we can acces directly the class of interest > knowing its classID. > - we can easily iterate (again no mapping needed) over all the pending stats > thanks to AllWaitClasses[] > > What do you think? It looks like the most elegant approach to me. I think that's pretty cool. > Bonus point, as you can see above, while working on this approach I've been able > to remove storing the className and the array of eventName per wait class (would > have worked in v1 too). You mean the extra set of names generated? Yes, I was getting the impression while reading your patch that this did not require this complexity when generating the code for the pending data. Getting rid of this part would be a nice outcome. -- Michael
Вложения
Hi, On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote: > Wait events are useful to know what backends are waiting for when there is/was > a performance issue: for this we can sample pg_stat_activity at regular intervals > and record historical data. That’s how it is commonly used. > > It could also be useful to observe the engine/backends behavior over time and > help answer questions like: > > * Is the engine’s wait pattern the same over time? > * Is application’s "A" wait pattern the same over time? > * I observe a peak in wait event "W": is it because "W" is now waiting longer or > is it because it is hit more frequently? > * I observe a peak in some of them (say for example MultiXact%), is it due to a > workload change? > > For the above use cases, we need a way to track the wait events that occur between > samples: please find attached a proof of concept patch series doing so. > > > 0002 - It adds wait events statistics > > It adds a new stat kind PGSTAT_KIND_WAIT_EVENT for the wait event statistics. > > This new statistic kind is a fixed one because we know the maximum number of wait > events. Indeed: > > * it does not take into account custom wait events as extensions have all they need > at their disposal to create custom stats on their own wait events should they > want to (limited by WAIT_EVENT_CUSTOM_HASH_MAX_SIZE though). > > * it does not take into account LWLock > LWTRANCHE_FIRST_USER_DEFINED for the same > reasons as above. That said, there is no maximum limitation in LWLockNewTrancheId(). > > * we don't want to allocate memory in some places where the counters are > incremented (see 4feba03d8b9). We could still use the same implementation as for > backend statistics (i.e, make use of flush_static_cb) if we really want/need to > switch to variable stats. Even without allocating memory, this makes wait events considerably more expensive. Going from ~4 instructions to something more like ~40. Including external function calls. Today: start wait: 0x0000000000a08e8f <+63>: lea 0x735662(%rip),%r12 # 0x113e4f8 <my_wait_event_info> 0x0000000000a08e96 <+70>: mov (%r12),%rax 0x0000000000a08e9a <+74>: movl $0x9000006,(%rax) < do wait operation > end wait: 0x0000000000a08eaa <+90>: mov (%r12),%rax After: start wait: 0x0000000000a0b39f <+63>: lea 0x73673a(%rip),%r12 # 0x1141ae0 <my_wait_event_info> 0x0000000000a0b3a6 <+70>: mov (%r12),%rax 0x0000000000a0b3aa <+74>: movl $0x9000006,(%rax) < do wait operation > end wait: 0x0000000000a0b3ba <+90>: mov (%r12),%rax 0x0000000000a0b3be <+94>: mov (%rax),%edi 0x0000000000a0b3c0 <+96>: rex call 0xa452e0 <waitEventIncrementCounter> 0x0000000000a0b3c6 <+102>: mov (%r12),%rax Where waitEventIncrementCounter ends up as: 0x0000000000a452e0 <+0>: lea 0x6fc7f9(%rip),%rax # 0x1141ae0 <my_wait_event_info> 0x0000000000a452e7 <+7>: mov (%rax),%rax 0x0000000000a452ea <+10>: mov (%rax),%eax 0x0000000000a452ec <+12>: mov %eax,%edx 0x0000000000a452ee <+14>: and $0xff000000,%edx 0x0000000000a452f4 <+20>: jne 0xa45300 <waitEventIncrementCounter+32> 0x0000000000a452f6 <+22>: test %ax,%ax 0x0000000000a452f9 <+25>: jne 0xa45300 <waitEventIncrementCounter+32> 0x0000000000a452fb <+27>: ret 0x0000000000a452fc <+28>: nopl 0x0(%rax) 0x0000000000a45300 <+32>: cmp $0x1000000,%edx 0x0000000000a45306 <+38>: jne 0xa4530e <waitEventIncrementCounter+46> 0x0000000000a45308 <+40>: cmp $0x5e,%ax 0x0000000000a4530c <+44>: ja 0xa452fb <waitEventIncrementCounter+27> 0x0000000000a4530e <+46>: cmp $0x7000000,%edx 0x0000000000a45314 <+52>: sete %cl 0x0000000000a45317 <+55>: test %ax,%ax 0x0000000000a4531a <+58>: setne %al 0x0000000000a4531d <+61>: test %al,%cl 0x0000000000a4531f <+63>: jne 0xa452fb <waitEventIncrementCounter+27> 0x0000000000a45321 <+65>: cmp $0xb000000,%edx 0x0000000000a45327 <+71>: je 0xa452fb <waitEventIncrementCounter+27> 0x0000000000a45329 <+73>: mov %edi,%eax 0x0000000000a4532b <+75>: movzwl %di,%edi 0x0000000000a4532e <+78>: lea 0x714e4b(%rip),%rdx # 0x115a180 <PendingWaitEventStats> 0x0000000000a45335 <+85>: shr $0x18,%eax 0x0000000000a45338 <+88>: shl $0x5,%rax 0x0000000000a4533c <+92>: add 0x4e1735(%rip),%rax # 0xf26a78 0x0000000000a45343 <+99>: movslq 0x8(%rax),%rax 0x0000000000a45347 <+103>: add %rdi,%rax 0x0000000000a4534a <+106>: incq (%rdx,%rax,8) 0x0000000000a4534e <+110>: lea 0x714e0b(%rip),%rax # 0x115a160 <have_wait_event_stats> 0x0000000000a45355 <+117>: movb $0x1,(%rax) 0x0000000000a45358 <+120>: ret That's about an order of magnitude increase in overhead, just based on the number of instructions, and probably considerably xmore in runtime (you're adding external function calls etc). I think to make this viable, we would have to remove some wait events and be more careful about adding them. And optimize this code a good bit. This really shouldn't end up with this mouch code. Greetings, Andres Freund
Hi, On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote: > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote: > > * we don't want to allocate memory in some places where the counters are > > incremented (see 4feba03d8b9). We could still use the same implementation as for > > backend statistics (i.e, make use of flush_static_cb) if we really want/need to > > switch to variable stats. > > Even without allocating memory, this makes wait events considerably more > expensive. Going from ~4 instructions to something more like ~40. Including > external function calls. Yeah looking at the performance impacts and discuss those was in my todo (as mentioned above), thanks for looking at it! > end wait: > 0x0000000000a08eaa <+90>: mov (%r12),%rax Yeah this one is unbeatable, it will have to be more expensive... On my side, I'm getting: => 0x000000000091a320 <+0>: mov 0x5337f9(%rip),%rax # 0xe4db20 <my_wait_event_info> 0x000000000091a327 <+7>: mov (%rax),%eax 0x000000000091a329 <+9>: mov %eax,%edx 0x000000000091a32b <+11>: and $0xff000000,%edx 0x000000000091a331 <+17>: jne 0x91a340 <waitEventIncrementCounter+32> 0x000000000091a333 <+19>: test %ax,%ax 0x000000000091a336 <+22>: jne 0x91a340 <waitEventIncrementCounter+32> 0x000000000091a338 <+24>: ret 0x000000000091a339 <+25>: nopl 0x0(%rax) 0x000000000091a340 <+32>: cmp $0x1000000,%edx 0x000000000091a346 <+38>: jne 0x91a34e <waitEventIncrementCounter+46> 0x000000000091a348 <+40>: cmp $0x5e,%ax 0x000000000091a34c <+44>: ja 0x91a338 <waitEventIncrementCounter+24> 0x000000000091a34e <+46>: cmp $0x7000000,%edx 0x000000000091a354 <+52>: sete %cl 0x000000000091a357 <+55>: test %ax,%ax 0x000000000091a35a <+58>: setne %al 0x000000000091a35d <+61>: test %al,%cl 0x000000000091a35f <+63>: jne 0x91a338 <waitEventIncrementCounter+24> 0x000000000091a361 <+65>: cmp $0xb000000,%edx 0x000000000091a367 <+71>: je 0x91a338 <waitEventIncrementCounter+24> 0x000000000091a369 <+73>: mov %edi,%eax 0x000000000091a36b <+75>: movzwl %di,%edi 0x000000000091a36e <+78>: shr $0x18,%eax 0x000000000091a371 <+81>: shl $0x5,%rax 0x000000000091a375 <+85>: movslq 0xe4d9c8(%rax),%rax 0x000000000091a37c <+92>: movb $0x1,0x56045d(%rip) # 0xe7a7e0 <have_wait_event_stats> 0x000000000091a383 <+99>: add %rdi,%rax 0x000000000091a386 <+102>: addq $0x1,0xe7a800(,%rax,8) 0x000000000091a38f <+111>: ret > That's about an order of magnitude increase in overhead, just based on the > number of instructions, and probably considerably xmore in runtime (you're > adding external function calls etc). Right. > I think to make this viable, we would have to remove some wait events and be > more careful about adding them. You mean incrementing the counters for wait events for which we expect that the increment is "negligible" as compare to the wait event itself? (Like DATAFILE IO and such)? If so I agree and that's something I also have discussed with a few (also if we add the timings later), that we could/should be more granular. If we go that way we could also imagine a GUC to set the stats for all of them? (default being off). > And optimize this code a good bit. This > really shouldn't end up with this mouch code. Yes, with the idea mentioned in [1] (code not shared yet), I end up with: Dump of assembler code for function waitEventIncrementCounter: => 0x000000000091b320 <+0>: mov %edi,%eax 0x000000000091b322 <+2>: and $0xff000000,%eax 0x000000000091b327 <+7>: jne 0x91b330 <waitEventIncrementCounter+16> 0x000000000091b329 <+9>: test %di,%di 0x000000000091b32c <+12>: jne 0x91b330 <waitEventIncrementCounter+16> 0x000000000091b32e <+14>: ret 0x000000000091b32f <+15>: nop 0x000000000091b330 <+16>: cmp $0x1000000,%eax 0x000000000091b335 <+21>: jne 0x91b33d <waitEventIncrementCounter+29> 0x000000000091b337 <+23>: cmp $0x5e,%di 0x000000000091b33b <+27>: ja 0x91b32e <waitEventIncrementCounter+14> 0x000000000091b33d <+29>: cmp $0x7000000,%eax 0x000000000091b342 <+34>: sete %cl 0x000000000091b345 <+37>: test %di,%di 0x000000000091b348 <+40>: setne %dl 0x000000000091b34b <+43>: test %dl,%cl 0x000000000091b34d <+45>: jne 0x91b32e <waitEventIncrementCounter+14> 0x000000000091b34f <+47>: cmp $0xb000000,%eax 0x000000000091b354 <+52>: je 0x91b32e <waitEventIncrementCounter+14> 0x000000000091b356 <+54>: mov %edi,%eax 0x000000000091b358 <+56>: movzwl %di,%edi 0x000000000091b35b <+59>: shr $0x18,%eax 0x000000000091b35e <+62>: mov 0xe4e9c0(,%rax,8),%rax 0x000000000091b366 <+70>: movb $0x1,0x5603f3(%rip) # 0xe7b760 <have_wait_event_stats> 0x000000000091b36d <+77>: mov 0x8(%rax),%rax 0x000000000091b371 <+81>: addq $0x1,(%rax,%rdi,8) 0x000000000091b376 <+86>: ret That looks better before the first branch (could be done in v1 too) as it uses registers only but looks worse when it's time to increment the counter while the code is much simpler: " WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info); AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++; " instead of " counter = waitEventGetCounter(PendingWaitEventStats.counts, waitInfo.classId, waitInfo.eventId); (*counter)++; " with the associated machinery in waitEventGetCounter(). Indeed, the waitEventGetCounter() produced only one memory read (and one write): movslq 0xe4d9c8(%rax),%rax addq $0x1,0xe7a800(,%rax,8) while the AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++ produces 2 pointer dereferences and 1 write: mov 0xe4e9c0(,%rax,8),%rax mov 0x8(%rax),%rax addq $0x1,(%rax,%rdi,8) I still need to work on it and try to make the code simpler and effective. [1]: https://www.postgresql.org/message-id/aG04Zk%2Bb3SRN0pBe%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Jul 09, 2025 at 09:56:10AM +0900, Michael Paquier wrote: > On Tue, Jul 08, 2025 at 03:25:26PM +0000, Bertrand Drouvot wrote: > > So, something like: > > > > ClassInfo LWLockClassInfo = { > > .numberOfEvents = NB_LWLOCK_WAIT_EVENTS, > > .pendingCounts = PendingWaitEventStats.lwlock_counts > > }; > > > > and then the "global" one: > > > > WaitClassInfo *AllWaitClasses[] = { > > NULL, /* gap - no class with ID 0 */ > > &LWLockClassInfo, > > NULL, /* gap - no class with ID 2 */ > > &LockClassInfo, > > &BufferPinClassInfo, > > &ActivityClassInfo, > > &ClientClassInfo, > > &ExtensionClassInfo, > > &IPCClassInfo, > > &TimeoutClassInfo, > > &IOClassInfo, > > }; > > Okay. I'm guessing that all this information is automatically built > based on the contents of wait_event_names.txt (with some tweaks for > the LWLock part?), Yeah, fully generated by generate-wait_event_types.pl. > > That way: > > > > - we do not need a mapping between ClassID and "Arrays" and the gaps > > in the ClassID are handled. So we can acces directly the class of interest > > knowing its classID. > > - we can easily iterate (again no mapping needed) over all the pending stats > > thanks to AllWaitClasses[] > > > > What do you think? It looks like the most elegant approach to me. > > I think that's pretty cool. Yeah, but from what I can see the counter increment results in one more memory read (as compared to v1), see [1]. So the code looks more simple but is more expensive... I'll try to see what can be done to improve it. > > Bonus point, as you can see above, while working on this approach I've been able > > to remove storing the className and the array of eventName per wait class (would > > have worked in v1 too). > > You mean the extra set of names generated? Yeah those are not needed, the only thing that's important is that we ensure that the counters and pending ones do contain the same "gap" (the LWLock ones), so the the classID mapping is correct. [1]: https://www.postgresql.org/message-id/aG4ZjQ1qKensHBOY%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2025-07-09 07:26:05 +0000, Bertrand Drouvot wrote: > On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote: > > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote: > > > * we don't want to allocate memory in some places where the counters are > > > incremented (see 4feba03d8b9). We could still use the same implementation as for > > > backend statistics (i.e, make use of flush_static_cb) if we really want/need to > > > switch to variable stats. > > > > Even without allocating memory, this makes wait events considerably more > > expensive. Going from ~4 instructions to something more like ~40. Including > > external function calls. > > Yeah looking at the performance impacts and discuss those was in my todo (as > mentioned above), thanks for looking at it! TBH, the more I think about this, the less convinced a pure counter provides *any* additional value over stats we already have, in which case the overhead is moot. We use commonly use wait events unconditionally around system calls, e.g.: pgstat_report_wait_start(wait_event_info); returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); pgstat_report_wait_end(); A counter for the number of wait events is just less informative version of what we already have in pg_stat_io. There's literally no value provided by tracking the number of times that wait event is hit. A pretty large number of wait events operate that way. For those the interesting information would be to measure *how long* you are in the wait event, not the number of times the wait event was entered for a very short amount of time. The obvious counter to that is that you likely are planning to add timing information at some point down the line - leaving overhead aside, that would *also* just duplicate information for the above case, namely track_io_timing. The wait events we have as-is are useful because they tell you what's currently happening, and you can correlate that with other things that are going on. But a pure counter of wait events encountered doesn't allow for that. > > end wait: > > 0x0000000000a08eaa <+90>: mov (%r12),%rax > > Yeah this one is unbeatable, it will have to be more expensive... > > On my side, I'm getting: > > => 0x000000000091a320 <+0>: mov 0x5337f9(%rip),%rax # 0xe4db20 <my_wait_event_info> > 0x000000000091a327 <+7>: mov (%rax),%eax > 0x000000000091a329 <+9>: mov %eax,%edx > 0x000000000091a32b <+11>: and $0xff000000,%edx > 0x000000000091a331 <+17>: jne 0x91a340 <waitEventIncrementCounter+32> > 0x000000000091a333 <+19>: test %ax,%ax > 0x000000000091a336 <+22>: jne 0x91a340 <waitEventIncrementCounter+32> > 0x000000000091a338 <+24>: ret > 0x000000000091a339 <+25>: nopl 0x0(%rax) > 0x000000000091a340 <+32>: cmp $0x1000000,%edx > 0x000000000091a346 <+38>: jne 0x91a34e <waitEventIncrementCounter+46> > 0x000000000091a348 <+40>: cmp $0x5e,%ax > 0x000000000091a34c <+44>: ja 0x91a338 <waitEventIncrementCounter+24> > 0x000000000091a34e <+46>: cmp $0x7000000,%edx > 0x000000000091a354 <+52>: sete %cl > 0x000000000091a357 <+55>: test %ax,%ax > 0x000000000091a35a <+58>: setne %al > 0x000000000091a35d <+61>: test %al,%cl > 0x000000000091a35f <+63>: jne 0x91a338 <waitEventIncrementCounter+24> > 0x000000000091a361 <+65>: cmp $0xb000000,%edx > 0x000000000091a367 <+71>: je 0x91a338 <waitEventIncrementCounter+24> > 0x000000000091a369 <+73>: mov %edi,%eax > 0x000000000091a36b <+75>: movzwl %di,%edi > 0x000000000091a36e <+78>: shr $0x18,%eax > 0x000000000091a371 <+81>: shl $0x5,%rax > 0x000000000091a375 <+85>: movslq 0xe4d9c8(%rax),%rax > 0x000000000091a37c <+92>: movb $0x1,0x56045d(%rip) # 0xe7a7e0 <have_wait_event_stats> > 0x000000000091a383 <+99>: add %rdi,%rax > 0x000000000091a386 <+102>: addq $0x1,0xe7a800(,%rax,8) > 0x000000000091a38f <+111>: ret I think adding 6 conditional branches to this is just out of question... > > I think to make this viable, we would have to remove some wait events and be > > more careful about adding them. > > You mean incrementing the counters for wait events for which we expect that > the increment is "negligible" as compare to the wait event itself? (Like DATAFILE > IO and such)? I mean that we would need to either *remove* a bunch of wait events, or change the API so that there are "cheap" wait events that just have the current overhead, and expensive wait events that do this tracking. > If so I agree and that's something I also have discussed with a few (also if we > add the timings later), that we could/should be more granular. If we go that way > we could also imagine a GUC to set the stats for all of them? (default being > off). Any conditional code that you add makes this stuff more expensive... > > And optimize this code a good bit. This > > really shouldn't end up with this mouch code. > > Yes, with the idea mentioned in [1] (code not shared yet), I end up with: > > Dump of assembler code for function waitEventIncrementCounter: > => 0x000000000091b320 <+0>: mov %edi,%eax > 0x000000000091b322 <+2>: and $0xff000000,%eax > 0x000000000091b327 <+7>: jne 0x91b330 <waitEventIncrementCounter+16> > 0x000000000091b329 <+9>: test %di,%di > 0x000000000091b32c <+12>: jne 0x91b330 <waitEventIncrementCounter+16> > 0x000000000091b32e <+14>: ret > 0x000000000091b32f <+15>: nop > 0x000000000091b330 <+16>: cmp $0x1000000,%eax > 0x000000000091b335 <+21>: jne 0x91b33d <waitEventIncrementCounter+29> > 0x000000000091b337 <+23>: cmp $0x5e,%di > 0x000000000091b33b <+27>: ja 0x91b32e <waitEventIncrementCounter+14> > 0x000000000091b33d <+29>: cmp $0x7000000,%eax > 0x000000000091b342 <+34>: sete %cl > 0x000000000091b345 <+37>: test %di,%di > 0x000000000091b348 <+40>: setne %dl > 0x000000000091b34b <+43>: test %dl,%cl > 0x000000000091b34d <+45>: jne 0x91b32e <waitEventIncrementCounter+14> > 0x000000000091b34f <+47>: cmp $0xb000000,%eax > 0x000000000091b354 <+52>: je 0x91b32e <waitEventIncrementCounter+14> > 0x000000000091b356 <+54>: mov %edi,%eax > 0x000000000091b358 <+56>: movzwl %di,%edi > 0x000000000091b35b <+59>: shr $0x18,%eax > 0x000000000091b35e <+62>: mov 0xe4e9c0(,%rax,8),%rax > 0x000000000091b366 <+70>: movb $0x1,0x5603f3(%rip) # 0xe7b760 <have_wait_event_stats> > 0x000000000091b36d <+77>: mov 0x8(%rax),%rax > 0x000000000091b371 <+81>: addq $0x1,(%rax,%rdi,8) > 0x000000000091b376 <+86>: ret That's still way too many jumps. > That looks better before the first branch (could be done in v1 too) as it uses > registers only but looks worse when it's time to increment the counter while > the code is much simpler: > > " > WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info); > > AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++; > " Too many indirections. The increment should be doable without needing to dereference intermediary memory locations. Greetings, Andres Freund
Hi, On Wed, Jul 09, 2025 at 11:19:00AM -0400, Andres Freund wrote: > Hi, > > On 2025-07-09 07:26:05 +0000, Bertrand Drouvot wrote: > > On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote: > > > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote: > > TBH, the more I think about this, the less convinced a pure counter provides > *any* additional value over stats we already have, in which case the overhead > is moot. I think there is a lot of wait events that don't have a stat counterpart: Controlfile ones, some of the DataFile ones (prefetch..), Dsm ones to name a few. > We use commonly use wait events unconditionally around system calls, e.g.: > > pgstat_report_wait_start(wait_event_info); > returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); > pgstat_report_wait_end(); > > A counter for the number of wait events is just less informative version of > what we already have in pg_stat_io. There's literally no value provided by > tracking the number of times that wait event is hit. For this particular one I do agree, IO is kind of exception here. > A pretty large number of wait events operate that way. For those the > interesting information would be to measure *how long* you are in the wait > event, not the number of times the wait event was entered for a very short > amount of time. The counters on the wait events could be seen as "system/engine stats" not as wait events. That could help describe the engine behavior over time. > The obvious counter to that is that you likely are planning to add timing > information at some point down the line Yes, and then the counters also become even more valuable. Also, timings without counters would not make that much sense, so counters are kind of a mandatory first step. > - leaving overhead aside, that would > *also* just duplicate information for the above case, namely track_io_timing. Agree for this particular case but we currently have 273 in core wait events. > The wait events we have as-is are useful because they tell you what's > currently happening, and you can correlate that with other things that are > going on. But a pure counter of wait events encountered doesn't allow for > that. Yes, I think the pure counters could/should be seen as engine statistics instead. And then become even more valuable once we add the timings. > I think adding 6 conditional branches to this is just out of question... This has to be optimized, I do agree. > > > I think to make this viable, we would have to remove some wait events and be > > > more careful about adding them. > > > > You mean incrementing the counters for wait events for which we expect that > > the increment is "negligible" as compare to the wait event itself? (Like DATAFILE > > IO and such)? > > I mean that we would need to either *remove* a bunch of wait events, or change > the API so that there are "cheap" wait events that just have the current > overhead, and expensive wait events that do this tracking. I had the later in mind as it's still good to know what the engine is doing/waiting for even if that's "only" at regular sampling interval (we would lost that information if we were to *remove* a bunch of wait events). > > If so I agree and that's something I also have discussed with a few (also if we > > add the timings later), that we could/should be more granular. If we go that way > > we could also imagine a GUC to set the stats for all of them? (default being > > off). > > Any conditional code that you add makes this stuff more expensive... Yes, or could be a compilation time flag to avoid the condition to be hit at runtime. That said I like your API change idea, we could decide which code path use the tracking and which does not. > That's still way too many jumps. Yes, did not try to optimize that part yet. The above idea would certainly help though, I mean probably we could remove most of the jumps. > > That looks better before the first branch (could be done in v1 too) as it uses > > registers only but looks worse when it's time to increment the counter while > > the code is much simpler: > > > > " > > WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info); > > > > AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++; > > " > > Too many indirections. The increment should be doable without needing to > dereference intermediary memory locations. Right. I can think of 2 approaches: 1. Automatically build a function (thanks to generate-wait_event_types.pl) that returns the offset of the wait event, something like: static inline int get_wait_event_counter_index(uint32 wait_event_info) { uint32 classId = wait_event_info & WAIT_EVENT_CLASS_MASK; uint32 eventId = wait_event_info & WAIT_EVENT_ID_MASK; switch (classId) { case PG_WAIT_LWLOCK: if (eventId >= 95) return -1; return 0 + eventId; case PG_WAIT_LOCK: if (eventId >= 12) return -1; return 95 + eventId; case PG_WAIT_BUFFERPIN: . . . Then the offset are "hardcoded" and there is no memory reads to get the offset ( confirmed by the assembly code): " cmp $0x8,%edi ja 0x91b32e add $0x7e,%edi " But the switch is not that good because it also produces much more cmp, so the last case will "suffer" more than the "first" one for example. 2. Make use of a lookup table, like: static const int WaitClassOffsets[11] = { -1, 0, -1, 95, 107, 108, 126, 135, 136, 194, 204, }; but then one indirection would still be needed, so a memory read to get WaitClassOffsets[classId]. I think that the read can need more cpu cycles (than the worst switch case) depending where it is read from (L1, L2, L3). That said, given its size and how frequently it would be accessed then it's likely to be in L1 and if that's the case then that's probably better than the switch idea, thoughts? Let me try to produce an action plan: - Let's define a list of "cheap" and a list of "expensive" wait events - Change the code path to increment counters only for the "expensive" ones - Optimize the counter incrementation as much as possible Does that sound reasonable to you? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2025-07-10 08:21:56 +0000, Bertrand Drouvot wrote: > - Let's define a list of "cheap" and a list of "expensive" wait events > - Change the code path to increment counters only for the "expensive" ones > - Optimize the counter incrementation as much as possible > > Does that sound reasonable to you? That does seem like the minimum. Unfortunately I'm rather doubtful this provides enough value to be worth the cost. But I'm rather willing to be proven wrong. Greetings, Andres Freund
Hi, On Tue, Jul 15, 2025 at 04:13:49PM +0200, Bertrand Drouvot wrote: > Hi, > > On Fri, Jul 11, 2025 at 11:59:39AM -0400, Andres Freund wrote: > > Hi, > > > > On 2025-07-10 08:21:56 +0000, Bertrand Drouvot wrote: > > > - Let's define a list of "cheap" and a list of "expensive" wait events > > > - Change the code path to increment counters only for the "expensive" ones > > > - Optimize the counter incrementation as much as possible > > > > > > Does that sound reasonable to you? > > > > That does seem like the minimum. > > > > Unfortunately I'm rather doubtful this provides enough value to be worth the > > cost. But I'm rather willing to be proven wrong. > > > > I guess that would (among other things) depend of how many wait events would be > tracked. I'll do some study to measure the counters's impact on the wait events. > Here what I’ve done: 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end() to measure the wait events duration. Then, I ran make check-world and computed the duration percentiles (in microseconds) per wait event (see attached). Note that this is without the counters patch, just to get an idea of wait events duration on my lab. Due to the way I ran the eBPF script I’ve been able to trap 159 wait events (out of 273) during "make check-world" (I think that’s enough for a first analysis, see below). 2. Applied the patch with the counters (one that I did not share yet), removed the probes in pgstat_report_wait_start() and pgstat_report_wait_end() and added one probe before incrementing the counters (i.e before calling waitEventIncrementCounter()) in pgstat_report_wait_end() and one probe after incrementing the counters to measure the counters increment duration. Then I ran make check-world and computed counter increment duration percentiles (in nanoseconds) and got: P50.00: 1138.000000 P90.00: 2674.000000 P95.00: 3741.000000 P99.00: 7657.000000 P99.90: 16973.881000 P99.99: 55614.830000 Then I compared the P50.00 for the counter and for the wait events and rejected the wait events where incrementing the counter produces more than 5% overhead. -> 59 wait events (out of the 159 observed during make check-world) got less than 5% overhead, they are mixed across classes but among those 59 we have: All the trapped ones in the Lock class (8 events) (0.7% avg impact): spectoken advisory tuple relation transactionid virtualxid extend applytransaction All the trapped ones in the Timeout class (7 events) (0.03% avg impact): RECOVERY_APPLY_DELAY RECOVERY_RETRIEVE_RETRY_INTERVAL CHECKPOINT_WRITE_DELAY VACUUM_TRUNCATE BASE_BACKUP_THROTTLE PG_SLEEP SPIN_DELAY And the BufferPin one (about 4%). Note that I would get the same behavior (just a little bit more overhead) if I add the wait events duration (INSTR_*) to the game. With the above it looks like that it would be possible to add the counters and their timings on the Lock and Timeout classes without producing a noticeable performance impact. I think that: - That’s easy enough to reason about and kind of expected for those 2 classes - We would miss other wait events but I think the trade-off is worth the simplicity - Having the ability to get the counters and the timings for those 2 wait classes is better than nothing - Having those metrics for the Lock class would be a nice addition to pg_locks Thoughts? -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi, On 2025-07-18 06:04:38 +0000, Bertrand Drouvot wrote: > Here what I’ve done: > > 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end() > to measure the wait events duration. Then, I ran make check-world and computed > the duration percentiles (in microseconds) per wait event (see attached). > Note that this is without the counters patch, just to get an idea of wait events > duration on my lab. > > Due to the way I ran the eBPF script I’ve been able to trap 159 wait events > (out of 273) during "make check-world" (I think that’s enough for a first > analysis, see below). It's important to note that eBPF tracepoints can add significant overhead, i.e. your baseline measurement is influenced by that overhead. To showcase that, in a ridiculously exaggerated way, I added probe with bpftrace, on start and exit of heap_getnextslot(), measuring the time in the function, and compared the performance of that to not having the probes. Without probes my seqscan takes 1651.221ms, with probes 15417.930ms. I.e. it slowed down by about an order of magnitude. I don't think you can use numbers gathered from this as a baseline measurement to evaluate the overhead added by adding explicit counters. The overhead of the explicit counter is going to be *WAY* lower than the eBPF based probes. You can't usefully use eBPF probes to measure costs that are this small, each time you hit a probe it has to trap into kernel space, run a bit of code, return to userspace. That'll be a few hundred cycles at the very least, if you actually want to compute a semi-accurate duration with eBPF, the duration of what you measure needs to be considerably longer. > 2. Applied the patch with the counters (one that I did not share yet), removed > the probes in pgstat_report_wait_start() and pgstat_report_wait_end() and added > one probe before incrementing the counters (i.e before calling waitEventIncrementCounter()) > in pgstat_report_wait_end() and one probe after incrementing the counters to measure > the counters increment duration. > > Then I ran make check-world and computed counter increment duration percentiles > (in nanoseconds) and got: > > P50.00: 1138.000000 > P90.00: 2674.000000 > P95.00: 3741.000000 > P99.00: 7657.000000 > P99.90: 16973.881000 > P99.99: 55614.830000 The above confirms my suspicion. It's inconceivable that the explicit counting you added costs ~1k nanosecond on average. Multiply that by the frequency of the CPU and that CPUs are superscalar (can execute multiple instructions at once), and the overhead you compute here would be roughly 10k instructions. Obviously that's not true. If it were, we'd have to just mark this patch as rejected and never talk about it again. Even if your baseline measurement did provide an accurate baseline number, it'd still not be particularly helpful, I suspect. Whether adding additional instructions actually causes slowdowns isn't just determined by the raw number of instructions, it's also depends on what work is being done in the CPU. E.g. if you add an extra 50 cheap instructions into a codepath that's completely bottlenecked by memory latency, you'll often not increase the total time taken meaningfully, as all those instructions are executed while waiting for memory (which has a latency longer than those 50 cheap instructions added together). But if you add 50 cheap instructions to a workload that's purely CPU bound, you're much more likely to slow that down measurably. > I think that: > > - That’s easy enough to reason about and kind of expected for those 2 classes > - We would miss other wait events but I think the trade-off is worth the simplicity > - Having the ability to get the counters and the timings for those 2 wait classes > is better than nothing > - Having those metrics for the Lock class would be a nice addition to pg_locks > > Thoughts? Unfortunately, due to the issues explained above, I don't think we can learn a lot from this experiment :( I also continue to not believe that pure event counters are going to be useful for the majority of wait events. I'm not sure it is really interesting for *any* wait event that we don't already have independent stats for. I think if we do want to have wait events that have more details, we need to: a) make them explicitly opt-in, i.e. code has to be changed over to use the extended wait events b) the extended wait events need to count both the number of encounters as well as the duration, the number of encounters is not useful on its own c) for each callsite that is converted to the extended wait event, you either need to reason why the added overhead is ok, or do a careful experiment Personally I'd rather have an in-core sampling collector, counting how often it sees certain wait events when sampling. It then also can correlate those samples with other things on the system, e.g. by counting the number of active backends together with each sample. And eventually also memory usage etc. Greetings, Andres Freund
Hi, On Fri, Jul 18, 2025 at 11:43:47AM -0400, Andres Freund wrote: > Hi, > > On 2025-07-18 06:04:38 +0000, Bertrand Drouvot wrote: > > Here what I’ve done: > > > > 1. Added 2 probes: one in pgstat_report_wait_start() and one in pgstat_report_wait_end() > > to measure the wait events duration. Then, I ran make check-world and computed > > the duration percentiles (in microseconds) per wait event (see attached). > > > Note that this is without the counters patch, just to get an idea of wait events > > duration on my lab. > > > > Due to the way I ran the eBPF script I’ve been able to trap 159 wait events > > (out of 273) during "make check-world" (I think that’s enough for a first > > analysis, see below). > > It's important to note that eBPF tracepoints can add significant overhead, > i.e. your baseline measurement is influenced by that overhead. Thanks for looking at it! I do agree that eBPF tracepoints can add significant overhead. > The overhead of > the explicit counter is going to be *WAY* lower than the eBPF based probes. Agree. > Unfortunately, due to the issues explained above, I don't think we can learn a > lot from this experiment :( I think that statement is debatable. While it's true that eBPF tracepoints add overhead, I used them to get the wait events percentile timings and also the counter increments timings. And then I used those percentiles ( both include the overhead) to extract the wait events/classes for which the p50 counters increments would represent less than 5%, means: 100 * p50_counter_timing / p50_wait_time < 5 And we have seen that for the Lock class we got 0.7% and 0.03% for the Timeout class. With those numbers and even if the eBPF tracepoints adds overhead I think that we can learn that it would be safe to add the counters for those two wait classes. Anyway, let's forget about eBPF, I ran another experiment by counting the cycles with: static inline uint64_t rdtsc(void) { uint32_t lo, hi; __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo; } and then calling this function before and after waitEventIncrementCounter() and also at wait_start() and wait_end() (without the increment counters patchs). This gives for counter increments (also added INSTR* timings to get the full picture as we want them anyway) the following number of cycles: p50: 146.000 p90: 722.000 p95: 1148.000 p99: 3474.000 p99.9: 13600.132 So that we can compare with the percentile cycles per wait events (see attached). We can see that, for those wait classes, all their wait events overhead would be < 5% and more precisely: Overhead on the lock class is about 0.03% Overhead on the timeout class is less than 0.01% and now we can also see that: Overhead on the lwlock class is about 1% Overhead on the client class is about 0.5% Overhead on the bufferpin class is about 0.2% while the io and ipc classes have mixed results. So based on the cycles metric I think it looks pretty safe to implement for the whole majority of classes. > I also continue to not believe that pure event counters are going to be useful > for the majority of wait events. I'm not sure it is really interesting for > *any* wait event that we don't already have independent stats for. For pure counters only I can see your point, but for counters + timings are you also not convinced? > I think if we do want to have wait events that have more details, we need to: > > a) make them explicitly opt-in, i.e. code has to be changed over to use the > extended wait events > b) the extended wait events need to count both the number of encounters as > well as the duration, the number of encounters is not useful on its own > c) for each callsite that is converted to the extended wait event, you either > need to reason why the added overhead is ok, or do a careful experiment > I do agree with the above, what do you think about this lastest experiment counting the cycles? > > Personally I'd rather have an in-core sampling collector, counting how often > it sees certain wait events when sampling. Yeah but even if we are okay with losing "counters" by sampling, we'd still not get the duration. For the duration to be meaningful we also need the exact number of counters. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi, On 2025-07-22 12:24:46 +0000, Bertrand Drouvot wrote: > Anyway, let's forget about eBPF, I ran another experiment by counting the cycles > with: > > static inline uint64_t rdtsc(void) { > uint32_t lo, hi; > __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi)); > return ((uint64_t)hi << 32) | lo; > } > > and then calling this function before and after waitEventIncrementCounter() > and also at wait_start() and wait_end() (without the increment counters patchs). I think you're still going to get massively increased baseline numbers that way - the normal cost of a wait event is under 10 cycles. Doing two rdtscs costs somewhere between 60-90 cycles. Which means that any increase due to counters & timers will look a lot cheaper if compared to that increased baseline, than if you compared to the actual current cost of a wait event. > So that we can compare with the percentile cycles per wait events (see attached). > > We can see that, for those wait classes, all their wait events overhead would be > < 5% and more precisely: > > Overhead on the lock class is about 0.03% > Overhead on the timeout class is less than 0.01% > > and now we can also see that: > > Overhead on the lwlock class is about 1% > Overhead on the client class is about 0.5% > Overhead on the bufferpin class is about 0.2% I think that's largely because there is relatively few such wait events, because there is very very little contention in the regression tests and we just don't do a whole lot intensive things in the tests. I suspect that at least some of the high events here will actually be due to tests that explicitly test the contention behaviour, and thus will have very high wait times. E.g. if you measure client timings, the overhead here will be fairly low, because we're not going to be CPU bound by the back/forth between client and server, and thus many of the waits will be longer. If you instead measure a single client readonly pgbench, it'll look different. Similar, if you have lwlock contention in a real world workload, most of the waits will be incredibly short, but in our tests that will not necessarily be the case. > while the io and ipc classes have mixed results. > > So based on the cycles metric I think it looks pretty safe to implement for the > whole majority of classes. This precisely is why I am scared of this effort. If you only look at it in the right light, it'll look cheap, but in other cases it'll cause measureable slowdowns. > > I also continue to not believe that pure event counters are going to be useful > > for the majority of wait events. I'm not sure it is really interesting for > > *any* wait event that we don't already have independent stats for. > > For pure counters only I can see your point, but for counters + timings are you > also not convinced? For counters + timings I can see that it'd be useful. But i don't believe it's close to as cheap as you say it is. > > I think if we do want to have wait events that have more details, we need to: > > > > a) make them explicitly opt-in, i.e. code has to be changed over to use the > > extended wait events > > b) the extended wait events need to count both the number of encounters as > > well as the duration, the number of encounters is not useful on its own > > c) for each callsite that is converted to the extended wait event, you either > > need to reason why the added overhead is ok, or do a careful experiment > > > > I do agree with the above, what do you think about this lastest experiment counting > the cycles? I continue to not believe it at all, sorry. Even if the counting method were accurate, you can't use our tests to measure the relative overhead, as they aren't actually exercising the paths leading to waits > > Personally I'd rather have an in-core sampling collector, counting how often > > it sees certain wait events when sampling. > > Yeah but even if we are okay with losing "counters" by sampling, we'd still not get > the duration. For the duration to be meaningful we also need the exact number > of counters. You don't need precise duration to see what wait events are a problem. If you see that some event is samples a lot you know it's because there either are a *lot* of those wait events or the wait events are entered into for a long time. Greetings, Andres Freund
On Tue, Jul 22, 2025 at 8:24 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > So based on the cycles metric I think it looks pretty safe to implement for the > whole majority of classes. I'm not convinced that this is either cheap enough to implement, and I don't understand the value proposition, either. I see the first couple of messages in the thread say that this is important for troubleshooting problems in "our user base" (not sure about the antecedent of "our") but the description of what those problems are seems pretty vague to me. I've tried counting wait events of a particular type and the resulting data was extremely misleading and therefore useless. I think it's probably a mistake to even be thinking about this in terms of wait events. It seems very reasonable to want more data about what PostgreSQL backends are doing, but I don't really see any reason why that should happen to line up with whatever wait events do. For instance, if you asked "what information is useful to gather about heavyweight locks?" you might say "well, we'd like to know how many times we tried to acquire one, and how many of those times we had to wait, and how many of those times we waited for longer than deadlock_timeout". And then you could design a facility to answer those questions. Or you might say "we'd like a history of all the different heavyweight locks that a certain backend has tried to acquire," and then you could design a tracing facility to give you that. Either of those hypothetical facilities involve providing more information than you would get from just counting wait events, or counting+timing wait events, or recording a complete history of all wait events. And, I would say that, for more or less that exact reason, there's a real use case for either of them. Maybe either or both are valuable enough to add and maybe they are not, and maybe the overhead is acceptable or maybe it isn't, but I think the arguments are much better than for a facility that just counts and times all the wait events. For instance, the former facility would let you answer the question "what percentage of lock acquisitions are contended?" whereas a pure wait event count just lets you answer the question "how many contended lock acquisitions were there?". The latter isn't useless, but the former is a lot better, particularly since as I proposed it, you can also judge how many of those contention events involved a brief wait vs. a long one. But if, on the other hand, you look at LWLocks, it's a totally different situation, IMHO. I'm almost sure that measuring LWLock wait times is going to be too expensive to be practical, and I don't really see why you'd want to: the right approach is sampling, which is cheap and in my experience highly effective. Measuring counts doesn't seem very useful either: knowing the number of times that somebody tried to acquire a relation lock or a tuple lock arguably tells you something about your workload that you might want to know, whereas I would argue that knowing the number of times that somebody tried to acquire a buffer lock doesn't really tell you anything at all. What you might want to know is how many buffers you accessed, which is why we've already got a system for tracking that. That's not to say that there's nothing at all that somebody could want to know about LWLocks that you can't already find out today: for example, a system that identifies which buffers are experiencing significant buffer lock contention, by relation OID and block number, sounds quite handy. But just counting wait events, or counting and timing them, will not give you that. Knowing which SLRUs are being heavily used could also be useful, but I think there's a good argument that having some instrumentation that cuts across SLRU types and exposes a bunch of useful numbers for each could be more useful than just hoping you can figure it out from LWLock wait event counts. In short, I feel like just counting, or counting+timing, all the wait events is painting with too broad a brush. Wait events get inserted for a specific purpose: so you can know why a certain backend is off-CPU without having to use debugging tools. They serve that purpose pretty well, but that doesn't mean that they serve other purposes well, and I find it kind of hard to see the argument that just sticking a bunch of counters or timers in the same places where we put the wait event calls would be the right thing in general. Introspection is an important topic and, IMHO, deserves much more specific and nuanced thought about what we're trying to accomplish and how we're going about it. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Tue, Jul 22, 2025 at 10:07:30AM -0400, Andres Freund wrote: > Hi, > > On 2025-07-22 12:24:46 +0000, Bertrand Drouvot wrote: > > Anyway, let's forget about eBPF, I ran another experiment by counting the cycles > > with: > > > > static inline uint64_t rdtsc(void) { > > uint32_t lo, hi; > > __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi)); > > return ((uint64_t)hi << 32) | lo; > > } > > > > and then calling this function before and after waitEventIncrementCounter() > > and also at wait_start() and wait_end() (without the increment counters patchs). > > I think you're still going to get massively increased baseline numbers that > way - the normal cost of a wait event is under 10 cycles. Doing two rdtscs > costs somewhere between 60-90 cycles. Which means that any increase due to > counters & timers will look a lot cheaper if compared to that increased > baseline, than if you compared to the actual current cost of a wait event. Let's name the overhead O, what is done is computing: 100 * (actual p50_counter_timing + O) / (actual p50_wait_time + O) Assuming the rdtscs cost is say 80 cycles, that gives: 100 * (actual p50_counter_timing + 80) / (actual p50_wait_time + 80) which means (as p50_counter_timing is 146): 100 * (66 + 80) / (actual p50_wait_time + 80) So, if we look at say the lock class result: 0.03%, it means: actual p50_wait_time + 80 = 100 * 146 / 0.03 = 486667 cycles Then it means that the 66-cycles counter increment overhead becomes negligible. So, when: 100 * (actual p50_counter_timing + O) / (actual p50_wait_time + O) results in very small numbers, I think that we can still rely on the ratio to make conclusions. I think the current computation could provide false negatives but not false positives: when the ratio is very small, we can reliably conclude that the counters overhead is negligible. > I think that's largely because there is relatively few such wait events, The exact numbers are: # cat make_check-world_cycles_percentiles.csv | grep ",lock" | awk -F, '{print $3,$4}' advisory 332 applytransaction 44 extend 156 frozenid 2 object 2 relation 409 spectoken 1 transactionid 693 tuple 17 virtualxid 257 # cat make_check-world_cycles_percentiles.csv | grep ",timeout" | awk -F, '{print $3,$4}' BASE_BACKUP_THROTTLE 5 CHECKPOINT_WRITE_DELAY 1262 PG_SLEEP 55 RECOVERY_APPLY_DELAY 4 RECOVERY_RETRIEVE_RETRY_INTERVAL 62 SPIN_DELAY 46 VACUUM_TRUNCATE 200 # cat make_check-world_cycles_percentiles.csv | grep ",lwlock" | awk -F, '{print $3,$4}' AioWorkerSubmissionQueue 1498 BufferContent 44825 BufferMapping 808 CheckpointerComm 431 ControlFile 6 DynamicSharedMemoryControl 67 LockFastPath 1096 LockManager 1565 OidGen 1264 ParallelHashJoin 33 ParallelQueryDSA 17 PerSessionDSA 2 PgStatsDSA 136 PgStatsData 41 PgStatsHash 28 ProcArray 263 RelCacheInit 156 RelationMapping 3 ReplicationSlotAllocation 5 ReplicationSlotControl 1 SInvalRead 577 SInvalWrite 166 SerializableXactHash 2 SharedTidBitmap 12 TwoPhaseState 8 WALInsert 4935 WALWrite 5875 XactBuffer 1 XactSLRU 98 XidGen 56 unknown_event_95 1662 # cat make_check-world_cycles_percentiles.csv | grep ",client" | awk -F, '{print $3,$4}' CLIENT_READ 330753 CLIENT_WRITE 6382 LIBPQWALRECEIVER_CONNECT 2137 LIBPQWALRECEIVER_RECEIVE 3875 WAL_SENDER_WAIT_FOR_WAL 7 WAL_SENDER_WRITE_DATA 2740 unknown_event_8 48 # cat make_check-world_cycles_percentiles.csv | grep ",bufferpin" | awk -F, '{print $3,$4}' BUFFER_PIN 89 There are a few for some of them and some have more decent numbers. > because there is very very little contention in the regression tests and we > just don't do a whole lot intensive things in the tests. I suspect that at > least some of the high events here will actually be due to tests that > explicitly test the contention behaviour, and thus will have very high wait > times. Yeah, probably, I would need to check more in details. > E.g. if you measure client timings, the overhead here will be fairly low, > because we're not going to be CPU bound by the back/forth between client and > server, and thus many of the waits will be longer. If you instead measure a > single client readonly pgbench, it'll look different. Similar, if you have > lwlock contention in a real world workload, most of the waits will be > incredibly short, but in our tests that will not necessarily be the case. Agreed. > For counters + timings I can see that it'd be useful. But i don't believe it's > close to as cheap as you say it is. Thanks for the confirmation. > You don't need precise duration to see what wait events are a problem. If you > see that some event is samples a lot you know it's because there either are a > *lot* of those wait events or the wait events are entered into for a long > time. That's right, but tracking the duration would help to determine which case it is: a *lot* of those wait events or the wait events are entered into for a long time. That said, I also understand your concerns (and the ones shared by Robert in this thread [1]), even if I do not necessarily agree with all of them, and I also understand that you both think that there is a room for instrumentation improvements. I'll try to think of other approaches (like the ones mentioned by Robert), as I think that fully makes sense and they are probably even better (means they provide more context). Doing so, I'll hold on the wait events stats approach (as both of you are skeptical about it). Regards, [1]: https://www.postgresql.org/message-id/CA%2BTgmobptuUWo7X5zcQrWKh22qeAn4eL%2B%3Dwtb8_ajCOR%2B7_tcw%40mail.gmail.com -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Jul 23, 2025 at 11:38:07AM -0400, Robert Haas wrote: > On Tue, Jul 22, 2025 at 8:24 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > So based on the cycles metric I think it looks pretty safe to implement for the > > whole majority of classes. > > I'm not convinced that this is either cheap enough to implement, Thanks for looking at it and providing your thoughts! > and I > don't understand the value proposition, either. I see the first couple > of messages in the thread say that this is important for > troubleshooting problems in "our user base" (not sure about the > antecedent of "our") but the description of what those problems are > seems pretty vague to me. Well, the idea was more: as we speak about "wait" events then it would make sense to add their duration. And then, to have meaningful data to interpret the durations then it would make sense to add the counters. So that one could answer questions like: * Is the engine’s wait pattern the same over time? * Is application’s "A" wait pattern the same over time? * I observe a peak in wait event "W": is it because "W" is now waiting longer or is it because it is hit more frequently? > I think it's probably a mistake to even be thinking about this in > terms of wait events. I think it feels natural to add "wait time" to "wait events", it started from there. > For > instance, if you asked "what information is useful to gather about > heavyweight locks?" you might say "well, we'd like to know how many > times we tried to acquire one, and how many of those times we had to > wait, and how many of those times we waited for longer than > deadlock_timeout". Yes. > And then you could design a facility to answer > those questions. I think the current proposal would provide a way to get some of those answers and a way to "estimate" some others. But designing a facility to answer all of those would be better. > Or you might say "we'd like a history of all the > different heavyweight locks that a certain backend has tried to > acquire," and then you could design a tracing facility to give you > that. Either of those hypothetical facilities involve providing more > information than you would get from just counting wait events, or > counting+timing wait events, or recording a complete history of all > wait events. Agreed. > And, I would say that, for more or less that exact reason, there's a > real use case for either of them. Maybe either or both are valuable > enough to add and maybe they are not, and maybe the overhead is > acceptable or maybe it isn't, but I think the arguments are much > better than for a facility that just counts and times all the wait > events. If we add more knowledge per wait class/wait events by providing "dedicated" facilities, I agree that would provide more values than just "counts and durations". > For instance, the former facility would let you answer the > question "what percentage of lock acquisitions are contended?" whereas > a pure wait event count just lets you answer the question "how many > contended lock acquisitions were there?". Right. > The latter isn't useless, > but the former is a lot better, Agreed. > I'm almost sure that measuring LWLock wait > times is going to be too expensive to be practical, On my lab it added 60 cycles, I'm not sure that is too expensive. But even if we think this is, maybe we could provide an option to turn this "overhead" off/on with a GUC or compilation flag. > and I don't really > see why you'd want to: the right approach is sampling, which is cheap > and in my experience highly effective. I think the sampling is very useful too (that's why I created the pgsentinel extension [1]). But I think the sampling and the current proposal do not answer the same questions. Most (if not all) of the questions that I mentioned above could not be answered accurately with the sampling approach. > Measuring counts doesn't seem > very useful either: knowing the number of times that somebody tried to > acquire a relation lock or a tuple lock arguably tells you something > about your workload that you might want to know, whereas I would argue > that knowing the number of times that somebody tried to acquire a > buffer lock doesn't really tell you anything at all. If you add the duration to the mix, that's more useful. And if you add the buffer relfile information to the mix, that's even more insightful. One could spot hot buffers with this data in hand. > What you might > want to know is how many buffers you accessed, which is why we've > already got a system for tracking that. That's not to say that there's > nothing at all that somebody could want to know about LWLocks that you > can't already find out today: for example, a system that identifies > which buffers are experiencing significant buffer lock contention, by > relation OID and block number, sounds quite handy. Could not agree more. > But just counting > wait events, or counting and timing them, will not give you that. Right, I had in mind a step by step approach. Counters, timings and then providing more details to wait events, something like: pid | wait_event_type | wait_event | infos ---------+-----------------+--------------+------------------------------------------------------------- 2560105 | IO | DataFileRead | {"blocknum" : "9272", "dbnode" : "5", "relnode" : "16407"} 2560135 | IO | WalSync | {"segno" : "1", "tli" : "1"} 2560138 | IO | DataFileRead | {"blocknum" : "78408", "dbnode" : "5", "relnode" : "16399"} But it also has its own challenges like you and Andres pointed out on the Discord server (in the PG #core-hacking channel). > Knowing which SLRUs are being heavily used could also be useful, but I > think there's a good argument that having some instrumentation that > cuts across SLRU types and exposes a bunch of useful numbers for each > could be more useful than just hoping you can figure it out from > LWLock wait event counts. Right, I'm all in for even more details. > In short, I feel like just counting, or counting+timing, all the wait > events is painting with too broad a brush. Wait events get inserted > for a specific purpose: so you can know why a certain backend is > off-CPU without having to use debugging tools. They serve that purpose > pretty well, but that doesn't mean that they serve other purposes > well, and I find it kind of hard to see the argument that just > sticking a bunch of counters or timers in the same places where we put > the wait event calls would be the right thing in general. > Introspection is an important topic and, IMHO, deserves much more > specific and nuanced thought about what we're trying to accomplish and > how we're going about it. I see what you mean and also think that providing more instrumentation is an important topic and that it deserves improvement. The current proposal was maybe too "generic" as you pointed out and maybe the wait events is not the best place to add this instrumentation/details. If we could agree on extra details/instrumentation per area (like the few examples you provided) then I would be happy to work on it: as said, I think the instrumentation is an area that deserves improvement. Your approach would probably take longer but OTOH would surely be even more meaningful. [1]: https://github.com/pgsentinel/pgsentinel Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jul 24, 2025 at 7:52 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Well, the idea was more: as we speak about "wait" events then it would make > sense to add their duration. And then, to have meaningful data to interpret the > durations then it would make sense to add the counters. So that one could answer > questions like: > > * Is the engine’s wait pattern the same over time? > * Is application’s "A" wait pattern the same over time? > * I observe a peak in wait event "W": is it because "W" is now waiting longer or > is it because it is hit more frequently? Forgive me if I'm beating a dead horse here, but I just don't believe the first two of these at all. If you want to know if the wait pattern is the same over time, take samples over various intervals and compare the samples. It's true that you might notice some differences by comparing counts that you wouldn't notice by comparing sample profiles, but that's true of ANY additional data that you choose to collect. It's not a good argument for collecting number-of-times-we-waited-on-each-wait-event specifically. Because (in my experience) counts of the number of times we hit each wait event point are nearly meaningless, if you detect a change by comparing those across time periods, all you will really know is that something changed. You won't really get any understanding of what it was. I think with a more careful choice of what data to gather, we can do better. The third one -- are the waits longer or are we waiting more often -- is a much better argument. I acknowledge that's something that a user might want to know. But I think it's worth drilling down into that a little bit more -- when and why would the user want to know that? I might be wrong here, but I feel like the most likely case where you would care is something having to do with I/O, which gets back to Andres's point about the AIO instrumentation. If it's strictly processes fighting over in-memory LWLocks, I don't expect the lock hold times to vary widely, and I'm not sure what the user is supposed to think about it or do about it if they do. I actually wrote some instrumentation kind of like this many years ago and it was useful for understanding some of the internal locking mechanics of PostgreSQL, so that I could think about whether things could be improved in the code, but I never had the sense that what I wrote on that occasion would have been useful to end-users. I don't know if you have any more detailed thoughts about this to share. > > I'm almost sure that measuring LWLock wait > > times is going to be too expensive to be practical, > > On my lab it added 60 cycles, I'm not sure that is too expensive. But even > if we think this is, maybe we could provide an option to turn this "overhead" off/on > with a GUC or compilation flag. I don't know exactly what you tested, but there's a huge difference between running a query normally and running it under EXPLAIN ANALYZE, and a lot of that is timing overhead. There are probably cases where queries never actually have to wait very much - e.g. all data in shared_buffers, no contention, maybe a bunch of CPU cycles going into computation rather than data access - but I think there will also be cases where there are a LOT of wait events. For instance, if data fits well into the OS cache but poorly into shared_buffers and you construct a workload that has to access a lot of buffers in quick succession, like say a nested loop over a parameterized inner index scan, then I would think we would just be doing a ton of extra timing operations compared to right now, and since the waits would be short (since the data is coming from the OS rather than the disk) I would expect it to hurt quite a bit. If not, then why does EXPLAIN need a TIMING OFF option? > > Measuring counts doesn't seem > > very useful either: knowing the number of times that somebody tried to > > acquire a relation lock or a tuple lock arguably tells you something > > about your workload that you might want to know, whereas I would argue > > that knowing the number of times that somebody tried to acquire a > > buffer lock doesn't really tell you anything at all. > > If you add the duration to the mix, that's more useful. And if you add the buffer > relfile information to the mix, that's even more insightful. But also costly. If we create a system that has significantly more overhead than wait events, we probably won't be able to have it enabled all the time, and busy systems may not even be able to afford turning it on temporarily. The cost of a system of this time has a huge impact on how usable it actually is. > One could spot hot buffers with this data in hand. I actually don't believe we need duration to spot hot buffers. Some kind of sampling approach seems like it should work fine. If you expose some kind of information about buffer access in shared memory such that it can be sampled (deliberate hand-waving here) and somebody takes 100 samples, with each sample covering all backends in the system, over a period of 50 or 100 or 200 or 500 seconds, the hot buffers are going to pop right up to the top of the list. You won't necessarily be able to see buffers that are just a tiny bit hot, but with a decent number of samples you should be able to clearly see the stuff that's really hot. People often say things to me like "well what if I just get really unlucky and always miss what is truly the hottest buffer," but the law of large numbers says you don't really need to worry about that as long as you collect a sufficiently large number of samples, and that's not really difficult to do. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Thu, Jul 24, 2025 at 10:38:49AM -0400, Robert Haas wrote: > On Thu, Jul 24, 2025 at 7:52 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Well, the idea was more: as we speak about "wait" events then it would make > > sense to add their duration. And then, to have meaningful data to interpret the > > durations then it would make sense to add the counters. So that one could answer > > questions like: > > > > * Is the engine’s wait pattern the same over time? > > * Is application’s "A" wait pattern the same over time? > > * I observe a peak in wait event "W": is it because "W" is now waiting longer or > > is it because it is hit more frequently? > > The third one -- are the waits longer or are we waiting more often -- > is a much better argument. I acknowledge that's something that a user > might want to know. But I think it's worth drilling down into that a > little bit more -- when and why would the user want to know that? I > might be wrong here, but I feel like the most likely case where you > would care is something having to do with I/O, which gets back to > Andres's point about the AIO instrumentation. Yeah, I/O is definitively an area where exposing the duration makes sense. > If it's strictly > processes fighting over in-memory LWLocks, I don't expect the lock > hold times to vary widely, and I'm not sure what the user is supposed > to think about it or do about it if they do. Yeah, I also think not all of them are worth to get the duration for. > I actually wrote some > instrumentation kind of like this many years ago and it was useful for > understanding some of the internal locking mechanics of PostgreSQL, so > that I could think about whether things could be improved in the code, > but I never had the sense that what I wrote on that occasion would > have been useful to end-users. I don't know if you have any more > detailed thoughts about this to share. Focusing on LWLocks, I think that for example the OidGenLock one is one it would be interesting to have the duration for. That way one could start to investigate if there is a very long run of used OID values with no gaps in TOAST tables or if there is higher concurrency. I do agree that might be not be useful for all the LWLocks (those for which the users have no actionable workaround/solutions) but for some I think it could be useful (like the one mentioned above). Outside of LWLocks, I think it makes sense for heavyweight locks to answer questions like: Is the locker holding a lock for longer? > > On my lab it added 60 cycles, I'm not sure that is too expensive. But even > > if we think this is, maybe we could provide an option to turn this "overhead" off/on > > with a GUC or compilation flag. > > I don't know exactly what you tested, I counted the cycles with: static inline uint64_t rdtsc(void) { uint32_t lo, hi; __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo; } calling this function in 2 ways: W1: before and after incrementing the counters and collecting their durations W2: in wait_start() and in wait_end() Each of the aboves have been done without the other one in place and the delta have been computed while running check-world. That way I generated percentiles for the values measured in W1 (shared in [1]) and also per wait events (also shared on [1]). P50 for the counters increments and durations is 146 cycles, if we remove the estimated rdtsc() calls overhead (say 80 cycles) that gives 66 cycles overhead. [1]: https://www.postgresql.org/message-id/aH%2BDDoiLujt%2Bb9cV%40ip-10-97-1-34.eu-west-3.compute.internal > but there's a huge difference > between running a query normally and running it under EXPLAIN ANALYZE, > and a lot of that is timing overhead. There are probably cases where > queries never actually have to wait very much - e.g. all data in > shared_buffers, no contention, maybe a bunch of CPU cycles going into > computation rather than data access - but I think there will also be > cases where there are a LOT of wait events. I see, you are saying that a LOT of wait events could add a LOT of little cycles overhead which could lead to much more overhead than those 66 cycles. That's right, that's why the idea was to add the counters and timings only on wait classes for which the counters and durations overhead was relatively small in regard to the waits themself. From [1]: " Overhead on the lock class is about 0.03% Overhead on the timeout class is less than 0.01% and now we can also see that: Overhead on the lwlock class is about 1% Overhead on the client class is about 0.5% Overhead on the bufferpin class is about 0.2% " > For instance, if data fits > well into the OS cache but poorly into shared_buffers and you > construct a workload that has to access a lot of buffers in quick > succession, like say a nested loop over a parameterized inner index > scan, then I would think we would just be doing a ton of extra timing > operations compared to right now, and since the waits would be short > (since the data is coming from the OS rather than the disk) I would > expect it to hurt quite a bit. If not, then why does EXPLAIN need a > TIMING OFF option? That's right, that's why the reasoning shared in [1] was based on filtering on the wait classes by focusing on very small overhead (relative overhead and not absolute). > > If you add the duration to the mix, that's more useful. And if you add the buffer > > relfile information to the mix, that's even more insightful. > > But also costly. Right, everything that adds cycles adds costs. I think that comparing the relative costs like done in [1] makes sense though. That helps to categorize the costs. > > One could spot hot buffers with this data in hand. > > I actually don't believe we need duration to spot hot buffers. Some > kind of sampling approach seems like it should work fine. If you > expose some kind of information about buffer access in shared memory > such that it can be sampled (deliberate hand-waving here) and somebody > takes 100 samples, with each sample covering all backends in the > system, over a period of 50 or 100 or 200 or 500 seconds, the hot > buffers are going to pop right up to the top of the list. That's right, the ones you would spot would be hot buffers but I think that's also possible that you missed some (the smaller the sampling interval is, the better though). But still without the durations you can't say if the ones holding the buffers are holding longer than before or if there is an increase of concurrency. Between 2 samples that produced the same wait on a buffer: has the wait event changed or was it the same? If that's the later we could increment the duration by the sampling interval, but we just don't know. > People often say things to me like "well what > if I just get really unlucky and always miss what is truly the hottest > buffer," but the law of large numbers says you don't really need to > worry about that as long as you collect a sufficiently large number of > samples, and that's not really difficult to do. Yeah probably, but still without the durations, it's almost impossible to know if there is an increase of concurrency or if the lock is held for longer. Indeed, if the lock is held for longer one might see an increase of number of waiters and "wrongly" interpret that as an increase of concurrency. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jul 28, 2025 at 8:35 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > Focusing on LWLocks, I think that for example the OidGenLock one is one > it would be interesting to have the duration for. That way one could start to > investigate if there is a very long run of used OID values with no gaps in > TOAST tables or if there is higher concurrency. I do agree that might be not be > useful for all the LWLocks (those for which the users have no actionable > workaround/solutions) but for some I think it could be useful (like the one > mentioned above). Yeah, maybe. That's an interesting example. One could make the argument that we ought to have a system that specifically tests for OID oversaturation, but admittedly that feels pretty special-purpose. But on the other hand, I still don't think timing every LWLock wait is going to work out well from a performance standpoint. > Outside of LWLocks, I think it makes sense for heavyweight locks to answer > questions like: Is the locker holding a lock for longer? I think the existing log_lock_waits does a pretty good job with this case. > I see, you are saying that a LOT of wait events could add a LOT of little > cycles overhead which could lead to much more overhead than those 66 cycles. Yes. > That's right, that's why the idea was to add the counters and timings only on > wait classes for which the counters and durations overhead was relatively small > in regard to the waits themself. > > From [1]: > > " > Overhead on the lock class is about 0.03% > Overhead on the timeout class is less than 0.01% > > and now we can also see that: > > Overhead on the lwlock class is about 1% > Overhead on the client class is about 0.5% > Overhead on the bufferpin class is about 0.2% > " I have a hard time reconciling this with the overhead of EXPLAIN ANALYZE, which we know to be much larger than this. To be fair, it's quite possible that we switch between executor nodes a lot more often than we go off-CPU, so maybe the wait event instrumentation is just a lot cheaper for that reason. But I'm still suspicious that something is wrong with these measurements: that you're testing on a platform where timer events are particularly cheap, or a workload where they are not as common as some workloads, or, I don't really know exactly. Even if the overhead for measuring LWLock wait events is truly only 1%, I'm skeptical about whether that's worth it for the average user. At that level, it might be OK as an optional feature that people can enable if they need it. But as I've said before, it's super-important that we don't get ourselves into a situation where hackers don't wait add wait events to places that should have them for fear of the overhead being too large. We have mostly avoided that problem up until now, as far as I am aware. > That's right, the ones you would spot would be hot buffers but I think that's > also possible that you missed some (the smaller the sampling interval is, the > better though). But like ... who cares? Nobody needs an absolutely exhaustive list of buffers above a certain level of hotness. > But still without the durations you can't say if the ones holding the buffers are > holding longer than before or if there is an increase of concurrency. I bet you can make a pretty good guess by looking at how many active backends have any sort of heavyweight lock on the relation in question. I have a lot of difficulty imagining this as a real point of confusion in a customer troubleshooting situation. I mean, theoretically, you could imagine a situation where the concurrency on a relation is high but you can't tell whether they're all focusing on the same buffers for shorter times or different buffers for longer times. But I don't really see how this could occur in practice. If it's an index, the contention almost has to be on the root/upper-level pages; and if it's a heap you're either doing sequential scans and the access pattern is uniform or you're doing some kind of index scan and the contention is probably focused on the upper-level index pages rather than the heap pages. You could maybe imagine some crazy corner case where there's heap contention created by TID scans, but that's too obscure a situation to justify adding machinery of this kind. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On Wed, Jul 23, 2025 at 11:38:07AM -0400, Robert Haas wrote: > I think it's probably a mistake to even be thinking about this in > terms of wait events. It seems very reasonable to want more data about > what PostgreSQL backends are doing, but I don't really see any reason > why that should happen to line up with whatever wait events do. For > instance, if you asked "what information is useful to gather about > heavyweight locks?" you might say "well, we'd like to know how many > times we tried to acquire one, and how many of those times we had to > wait, and how many of those times we waited for longer than > deadlock_timeout". I did find this idea interesting, and gave it a stab in [1]. [1]: https://www.postgresql.org/message-id/aIyNxBWFCybgBZBS%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com