Обсуждение: Optimize LISTEN/NOTIFY
Hi hackers, The current LISTEN/NOTIFY implementation is well-suited for use-cases like cache invalidation where many backends listen on the same channel. However, its scalability is limited when many backends listen on distinct channels. The root of the problem is that Async_Notify must signal every listening backend in the database, as it lacks central knowledge of which backend is interested in which channel. This results in an O(N) number of kill(pid, SIGUSR1) syscalls as the listener count grows. The attached proof-of-concept patch proposes a straightforward optimization for the single-listener case. It introduces a shared-memory hash table mapping (dboid, channelname) to the ProcNumber of a single listener. When NOTIFY is issued, we first check this table. If a single listener is found, we signal only that backend. Otherwise, we fall back to the existing broadcast behavior. The performance impact for this pattern is significant. A benchmark [1] measuring a NOTIFY "ping-pong" between two connections, while adding a variable number of idle listeners, shows the following: master (8893c3a): 0 extra listeners: 9126 TPS 10 extra listeners: 6233 TPS 100 extra listeners: 2020 TPS 1000 extra listeners: 238 TPS 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patch: 0 extra listeners: 9152 TPS 10 extra listeners: 9352 TPS 100 extra listeners: 9320 TPS 1000 extra listeners: 8937 TPS As you can see, the patched version's performance is near O(1) with respect to the number of idle listeners, while the current implementation shows the expected O(N) degradation. This patch is a first-step. It uses a simple boolean has_multiple_listeners flag in the hash entry. Once a channel gets a second listener, this flag is set and, crucially, never cleared. The entry will then permanently indicate "multiple listeners", even after all backends on that channel disconnect. A more complete solution would likely use reference counting for each channel's listeners. This would solve the "stuck entry" problem and could also enable a further optimization: targeted signaling to all listeners of a multi-user channel, avoiding the database-wide broadcast entirely. The patch also includes a "wake only tail" optimization (contributed by Marko Tikkaja) to help prevent backends from falling too far behind. Instead of waking all lagging backends at once and creating a "thundering herd", this logic signals only the single backend that is currently at the queue tail. This ensures the global queue tail can always advance, relying on a chain reaction to get backends caught up efficiently. This seems like a sensible improvement in its own right. Thoughts? /Joel [1] Benchmark tool and full results: https://github.com/joelonsql/pg-bench-listen-notify
Вложения
"Joel Jacobson" <joel@compiler.org> writes: > The attached proof-of-concept patch proposes a straightforward > optimization for the single-listener case. It introduces a shared-memory > hash table mapping (dboid, channelname) to the ProcNumber of a single > listener. What does that do to the cost and parallelizability of LISTEN/UNLISTEN? > The patch also includes a "wake only tail" optimization (contributed by > Marko Tikkaja) to help prevent backends from falling too far behind. Coulda sworn we dealt with that case some years ago. In any case, if it's independent of the other idea it should probably get its own thread. regards, tom lane
On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> The attached proof-of-concept patch proposes a straightforward >> optimization for the single-listener case. It introduces a shared-memory >> hash table mapping (dboid, channelname) to the ProcNumber of a single >> listener. > > What does that do to the cost and parallelizability of LISTEN/UNLISTEN? Good point. The previous patch would effectively force all LISTEN/UNLISTEN to be serialized, which would at least hurt parallelizability. New benchmark confirm this hypothesis. New patch attached that combines two complementary approaches, that together seems to scale well for both common-channel and unique-channel scenarios: 1. Partitioned Hash Locking The Channel Hash now uses HASH_PARTITION, with an array of NUM_NOTIFY_PARTITIONS lightweight locks. A given channel is mapped to a partition lock using a custom hash function on (dboid, channelname). This allows LISTEN/UNLISTEN operations on different channels to proceed concurrently without fighting over a single global lock, addressing the "many distinct channels" use-case. 2. Optimistic Read-Locking For the "many backends on one channel" use-case, lock acquisition now follows a read-then-upgrade pattern. We first acquire a LW_SHARED lock, to check the channel's state. If the channel is already marked as has_multiple_listeners, we can return immediately without any need for a write. Only if we are the first or second listener on a channel do we release the shared lock and acquire an LW_EXCLUSIVE lock to modify the hash entry. After getting the exclusive lock, we re-verify the state to guard against race conditions. This avoids serializing the third and all subsequent listeners for a popular channel. BENCHMARK https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_connections_equal_jobs.png https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_fixed_connections.png I didn't want to attached the images to this email because they are quite large, due to all the details in the images. However, since it's important this mailing list contains all relevant data discussed, I've also included all data in the graphs formatted in ASCII/Markdown: performance_overview.md I've also included the raw parsed data from the pgbench output, which has been used as input to create performance_overview.md as well as the images: pgbench_results_combined.csv I've benchmarked five times per measurement, in random order. All raw measurements have been included in the Markdown document within { curly braces } sorted, next to the average values, to get an idea of the variance. Stddev felt possibly misleading since I'm not sure the data points are normally distributed, since it's benchmarking data. I've run the benchmarks on my MacBook Pro Apple M3 Max, using `caffeinate -dims pgbench ...`. >> The patch also includes a "wake only tail" optimization (contributed by >> Marko Tikkaja) to help prevent backends from falling too far behind. > > Coulda sworn we dealt with that case some years ago. In any case, > if it's independent of the other idea it should probably get its > own thread. Maybe it's been dealt with by some other part of the system, but I can't find any such code anywhere, it's only async.c that currently sends PROCSIG_NOTIFY_INTERRUPT. The wake only tail mechanism seems almost perfect, but I can think of at least one edge-case where we could still get a problem situation: With lots of idle backends, the rate of this one-by-one catch-up may not be fast enough to outpace the queue's advancement, causing other idle backends to eventually lag by more than the QUEUE_CLEANUP_DELAY threshold. To ensure all backends are eventually processed without re-introducing the thundering herd problem, an additional mechanism seems neessary: I see two main options: 1. Extend the chain reaction Once woken, a backend could signal the next backend at the queue tail, propagating the catch-up process. This would need to be managed carefully, perhaps with some kind of global advisory lock, to prevent multiple cascades from running at once. 2. Centralize the work We already have the autovacuum daemon, maybe it could also be made responsible for kicking lagging backends? Other ideas? /Joel Attached: * pgbench-scripts.tar.gz pgbench scripts to reproduce the results, report and images. * performance_overview.md Same results as in the images, but in ASCII/Markdown format. * pgbench_results_combined.csv Parsed output from pgbench runs, used to create performance_overview.md as well as the linked images. * 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch Old patch just renamed to -v2 * 0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch New patch with the approach explained above.
Вложения
On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote: > On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote: >> "Joel Jacobson" <joel@compiler.org> writes: >>> The attached proof-of-concept patch proposes a straightforward >>> optimization for the single-listener case. It introduces a shared-memory >>> hash table mapping (dboid, channelname) to the ProcNumber of a single >>> listener. >> >> What does that do to the cost and parallelizability of LISTEN/UNLISTEN? > > Good point. The previous patch would effectively force all LISTEN/UNLISTEN > to be serialized, which would at least hurt parallelizability. > > New benchmark confirm this hypothesis. > > New patch attached that combines two complementary approaches, that together > seems to scale well for both common-channel and unique-channel scenarios: Thanks to the FreeBSD animal failing, I see I made a shared memory blunder. New squashed patch attached. /Joel
Вложения
On Tue, Jul 15, 2025, at 22:56, Joel Jacobson wrote: > On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote: >> On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote: >>> "Joel Jacobson" <joel@compiler.org> writes: >>>> The attached proof-of-concept patch proposes a straightforward >>>> optimization for the single-listener case. It introduces a shared-memory >>>> hash table mapping (dboid, channelname) to the ProcNumber of a single >>>> listener. >>> >>> What does that do to the cost and parallelizability of LISTEN/UNLISTEN? >> >> Good point. The previous patch would effectively force all LISTEN/UNLISTEN >> to be serialized, which would at least hurt parallelizability. >> >> New benchmark confirm this hypothesis. >> >> New patch attached that combines two complementary approaches, that together >> seems to scale well for both common-channel and unique-channel scenarios: > > Thanks to the FreeBSD animal failing, I see I made a shared memory blunder. > New squashed patch attached. > > /Joel > Attachments: > * 0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.patch (cfbot is not picking up my patch; I wonder if some filename length is exceeded, trying a shorter filename, apologies for spamming) /Joel
Вложения
Hi Joel, Thanks for sharing the patch. I have a few questions based on a cursory first look. > If a single listener is found, we signal only that backend. > Otherwise, we fall back to the existing broadcast behavior. The idea of not wanting to wake up all backends makes sense to me, but I don’t understand why we want this optimization only for the case where there is a single backend listening on a channel. Is there a pattern of usage in LISTEN/NOTIFY where users typically have either just one or several backends listening on a channel? If we are doing this optimization, why not maintain a list of backends for each channel, and only wake up those channels? Thanks, Rishu
On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote: > Hi Joel, > > Thanks for sharing the patch. > I have a few questions based on a cursory first look. > >> If a single listener is found, we signal only that backend. >> Otherwise, we fall back to the existing broadcast behavior. > > The idea of not wanting to wake up all backends makes sense to me, > but I don’t understand why we want this optimization only for the case > where there is a single backend listening on a channel. > > Is there a pattern of usage in LISTEN/NOTIFY where users typically > have either just one or several backends listening on a channel? > > If we are doing this optimization, why not maintain a list of backends > for each channel, and only wake up those channels? Thanks for the thoughtful question. You've hit on the central design trade-off in this optimization: how to provide targeted signaling for some workloads without degrading performance for others. While we don't have telemetry on real-world usage patterns of LISTEN/NOTIFY, it seems likely that most applications fall into one of three categories, which I've been thinking of in networking terms: 1. Broadcast-style ("hub mode") Many backends listening on the *same* channel (e.g., for cache invalidation). The current implementation is already well-optimized for this, behaving like an Ethernet hub that broadcasts to all ports. Waking all listeners is efficient because they all need the message. 2. Targeted notifications ("switch mode") Each backend listens on its own private channel (e.g., for session events or worker queues). This is where the current implementation scales poorly, as every NOTIFY wakes up all listeners regardless of relevance. My patch is designed to make this behave like an efficient Ethernet switch. 3. Selective multicast-style ("group mode") A subset of backends shares a channel, but not all. This is the tricky middle ground. Your question, "why not maintain a list of backends for each channel, and only wake up those channels?" is exactly the right one to ask. A full listener list seems like the obvious path to optimizing for *all* cases. However, the devil is in the details of concurrency and performance. Managing such a list would require heavier locking, which would create a new bottleneck and degrade the scalability of LISTEN/UNLISTEN operations—especially for the "hub mode" case where many backends rapidly subscribe to the same popular channel. This patch makes a deliberate architectural choice: Prioritize a massive, low-risk win for "switch mode" while rigorously protecting the performance of "hub mode". It introduces a targeted fast path for single-listener channels and cleanly falls back to the existing, well-performing broadcast model for everything else. This brings us back to "group mode", which remains an open optimization problem. A possible approach could be to track listeners up to a small threshold *K* (e.g., store up to 4 ProcNumber's in the hash entry). If the count exceeds *K*, we would flip a "broadcast" flag and revert to hub-mode behavior. However, this path has a critical drawback: 1. Performance Penalty for Hub Mode With the current patch, after the second listener joins a channel, the has_multiple_listeners flag is set. Every subsequent listener can acquire a shared lock, see the flag is true, and immediately continue. This is a highly concurrent, read-only operation that does not require mutating shared state. In contrast, the K-listener approach would force every new listener (from the third up to the K-th) to acquire an exclusive lock to mutate the shared listener array**. This would serialize LISTEN operations on popular channels, creating the very contention point this patch successfully avoids and directly harming the hub-mode use case that currently works well. 2. Uncertainty Compounding this, without clear data on typical "group" sizes, choosing a value for *K* is a shot in the dark. A small *K* might not help much, while a large *K* would increase the shared memory footprint and worsen the serialization penalty. For these reasons, attempting to build a switch that also optimizes for multicast risks undermining the architectural clarity and performance of both the switch and hub models. This patch, therefore, draws a clean line. It provides a precise, low-cost path for switch-mode workloads and preserves the existing, well-performing path for hub-mode workloads. While this leaves "group mode" unoptimized for now, it ensures we make two common use cases better without making any use case worse. The new infrastructure is flexible, leaving the door open should a better approach for "group mode" emerge in the future—one that doesn't compromise the other two. Benchmarks updated showing master vs 0001-optimize_listen_notify-v3.patch: https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot.png https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equal_jobs.png https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connections.png I've not included the benchmark CSV data in this mail, since it's quite heavy, 160kB, and I couldn't see any significant performance changes since v2. /Joel
On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote: > If we are doing this optimization, why not maintain a list of backends > for each channel, and only wake up those channels? Thanks for a contributing a great idea, it actually turned out to work really well in practice! The attached new v4 of the patch implements your multicast idea: --- Improve NOTIFY scalability with multicast signaling Previously, NOTIFY would signal all listening backends in a database for any channel with more than one listener. This broadcast approach scales poorly for workloads that rely on targeted notifications to small groups of backends, as every NOTIFY could wake up many unrelated processes. This commit introduces a multicast signaling optimization to improve scalability for such use-cases. A new GUC, `notify_multicast_threshold`, is added to control the maximum number of listeners to track per channel. When a NOTIFY is issued, if the number of listeners is at or below this threshold, only those specific backends are signaled. If the limit is exceeded, the system falls back to the original broadcast behavior. The default for this threshold is set to 16. Benchmarks show this provides a good balance, with significant performance gains for small to medium-sized listener groups and diminishing returns for higher values. Setting the threshold to 0 disables multicast signaling, forcing a fallback to the broadcast path for all notifications. To implement this, a new partitioned hash table is introduced in shared memory to track listeners. Locking is managed with an optimistic read-then-upgrade pattern. This allows concurrent LISTEN/UNLISTEN operations on *different* channels to proceed in parallel, as they will only acquire locks on their respective partitions. For correctness and to prevent deadlocks, a strict lock ordering hierarchy (NotifyQueueLock before any partition lock) is observed. The signaling path in NOTIFY must acquire the global NotifyQueueLock first before consulting the partitioned hash table, which serializes concurrent NOTIFYs. The primary concurrency win is for LISTEN/UNLISTEN operations, which are now much more scalable. The "wake only tail" optimization, which signals backends that are far behind in the queue, is also included to ensure the global queue tail can always advance. Thanks to Rishu Bagga for the multicast idea. --- BENCHMARK To find the optimal default notify_multicast_threshold value, I created a new benchmark tool that spawns one "ping" worker that sends notifications to a channel, and multiple "pong" workers that listen on channels and all immediately reply back to the "ping" worker, and when all replies have been received, the cycle repeats. By measuring how many complete round-trips can be performed per second, it evaluates the impact of different multicast threshold settings. The results below show the effect of setting the notify_multicast_threshold just below, or exactly at the N backends per channel, to compare broadcast vs multicast, for different sizes of multicast groups (where 1 would be the old targeted mode, optimized for specifically earlier). K = notify_multicast_threshold With 2 backends per channel (32 channels total): patch-v4 (K=1): 8,477 TPS patch-v4 (K=2): 27,748 TPS (3.3x improvement) With 4 backends per channel (16 channels total): patch-v4 (K=1): 7,367 TPS patch-v4 (K=4): 18,777 TPS (2.6x improvement) With 8 backends per channel (8 channels total): patch-v4 (K=1): 5,892 TPS patch-v4 (K=8): 8,620 TPS (1.5x improvement) With 16 backends per channel (4 channels total): patch-v4 (K=1): 4,202 TPS patch-v4 (K=16): 4,750 TPS (1.1x improvement) I also reran the old ping-pong as well as the pgbench benchmarks, and I couldn't detect any negative impact, testing with notify_multicast_threshold {1, 8, 16}. Ping-pong benchmark: Extra Connections: 0 -------------------------------------------------------------------------------- Version Max TPS vs Master All Values (sorted) ------------------------------------------------------------------------------------- master 9119 baseline {9088, 9095, 9119} patch-v4 (t=1) 9116 -0.0% {9082, 9090, 9116} patch-v4 (t=8) 9106 -0.2% {9086, 9102, 9106} patch-v4 (t=16) 9134 +0.2% {9082, 9116, 9134} Extra Connections: 10 -------------------------------------------------------------------------------- Version Max TPS vs Master All Values (sorted) ------------------------------------------------------------------------------------- master 6237 baseline {6224, 6227, 6237} patch-v4 (t=1) 9358 +50.0% {9302, 9345, 9358} patch-v4 (t=8) 9348 +49.9% {9266, 9312, 9348} patch-v4 (t=16) 9408 +50.8% {9339, 9407, 9408} Extra Connections: 100 -------------------------------------------------------------------------------- Version Max TPS vs Master All Values (sorted) ------------------------------------------------------------------------------------- master 2028 baseline {2026, 2027, 2028} patch-v4 (t=1) 9278 +357.3% {9222, 9235, 9278} patch-v4 (t=8) 9227 +354.8% {9184, 9207, 9227} patch-v4 (t=16) 9250 +355.9% {9180, 9243, 9250} Extra Connections: 1000 -------------------------------------------------------------------------------- Version Max TPS vs Master All Values (sorted) ------------------------------------------------------------------------------------- master 239 baseline {239, 239, 239} patch-v4 (t=1) 8841 +3594.1% {8819, 8840, 8841} patch-v4 (t=8) 8835 +3591.7% {8802, 8826, 8835} patch-v4 (t=16) 8855 +3599.8% {8787, 8843, 8855} Among my pgbench benchmarks, results seems unaffected in these benchmarks: listen_unique.sql listen_common.sql listen_unlisten_unique.sql listen_unlisten_common.sql The listen_notify_unique.sql benchmark shows similar improvements for all notify_multicast_threshold values tested, which is expected, since this benchmark uses unique channels, so a higher notify_multicast_threshold shouldn't affect the results, which it didn't: # TEST `listen_notify_unique.sql` ```sql LISTEN channel_:client_id; NOTIFY channel_:client_id; ``` ## 1 Connection, 1 Job - **master**: 63696 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 63377 TPS (-0.5%) - **optimize_listen_notify_v4 (t=8.0)**: 62890 TPS (-1.3%) - **optimize_listen_notify_v4 (t=16.0)**: 63114 TPS (-0.9%) ## 2 Connections, 2 Jobs - **master**: 90967 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 109423 TPS (+20.3%) - **optimize_listen_notify_v4 (t=8.0)**: 109107 TPS (+19.9%) - **optimize_listen_notify_v4 (t=16.0)**: 109608 TPS (+20.5%) ## 4 Connections, 4 Jobs - **master**: 114333 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 140986 TPS (+23.3%) - **optimize_listen_notify_v4 (t=8.0)**: 141263 TPS (+23.6%) - **optimize_listen_notify_v4 (t=16.0)**: 141327 TPS (+23.6%) ## 8 Connections, 8 Jobs - **master**: 64429 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 93787 TPS (+45.6%) - **optimize_listen_notify_v4 (t=8.0)**: 93828 TPS (+45.6%) - **optimize_listen_notify_v4 (t=16.0)**: 93875 TPS (+45.7%) ## 16 Connections, 16 Jobs - **master**: 41704 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 84791 TPS (+103.3%) - **optimize_listen_notify_v4 (t=8.0)**: 88330 TPS (+111.8%) - **optimize_listen_notify_v4 (t=16.0)**: 84827 TPS (+103.4%) ## 32 Connections, 32 Jobs - **master**: 25988 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 83197 TPS (+220.1%) - **optimize_listen_notify_v4 (t=8.0)**: 83453 TPS (+221.1%) - **optimize_listen_notify_v4 (t=16.0)**: 83576 TPS (+221.6%) ## 1000 Connections, 1 Job - **master**: 105 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 3097 TPS (+2852.1%) - **optimize_listen_notify_v4 (t=8.0)**: 3079 TPS (+2835.1%) - **optimize_listen_notify_v4 (t=16.0)**: 3080 TPS (+2835.9%) ## 1000 Connections, 2 Jobs - **master**: 108 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 2981 TPS (+2671.7%) - **optimize_listen_notify_v4 (t=8.0)**: 3091 TPS (+2774.4%) - **optimize_listen_notify_v4 (t=16.0)**: 3097 TPS (+2779.6%) ## 1000 Connections, 4 Jobs - **master**: 105 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 2947 TPS (+2705.5%) - **optimize_listen_notify_v4 (t=8.0)**: 2994 TPS (+2751.0%) - **optimize_listen_notify_v4 (t=16.0)**: 2992 TPS (+2748.7%) ## 1000 Connections, 8 Jobs - **master**: 107 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 3064 TPS (+2777.0%) - **optimize_listen_notify_v4 (t=8.0)**: 2981 TPS (+2698.5%) - **optimize_listen_notify_v4 (t=16.0)**: 2979 TPS (+2696.8%) ## 1000 Connections, 16 Jobs - **master**: 101 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 3068 TPS (+2923.2%) - **optimize_listen_notify_v4 (t=8.0)**: 2950 TPS (+2806.4%) - **optimize_listen_notify_v4 (t=16.0)**: 2940 TPS (+2796.8%) ## 1000 Connections, 32 Jobs - **master**: 102 TPS (baseline) - **optimize_listen_notify_v4 (t=1.0)**: 2980 TPS (+2815.0%) - **optimize_listen_notify_v4 (t=8.0)**: 3034 TPS (+2867.9%) - **optimize_listen_notify_v4 (t=16.0)**: 2962 TPS (+2798.0%) Here are some plots that includes the above results: https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot-v4.png https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equal_jobs-v4.png https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connections-v4.png /Joel
Вложения
On Thu, Jul 17, 2025, at 09:43, Joel Jacobson wrote: > On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote: >> If we are doing this optimization, why not maintain a list of backends >> for each channel, and only wake up those channels? > > Thanks for a contributing a great idea, it actually turned out to work > really well in practice! > > The attached new v4 of the patch implements your multicast idea: Hi hackers, While my previous attempts of $subject has only focused on optimizing the multi-channel scenario, I thought it would be really nice if LISTEN/NOTIFY could be optimize in the general case, benefiting all users, including those who just listen on a single channel. To my surprise, this was not only possible, but actually quite simple. The main idea in this patch, is to introduce an atomic state machine, with three states, IDLE, SIGNALLED, and PROCESSED, so that we don't interrupt backends that are already in the process of catching up. Thanks to Thomas Munro for making me aware of his, Heikki Linnakanga's and others work in the "Interrupts vs signals" [1] thread. Maybe my patch is redundant due to their patch set, I'm not really sure? Their patch seems to refactors the underlying wakeup mechanism. It replaces the old, complex chain of events (SIGUSR1 signal -> handler -> flag -> latch) with a single, direct function call: SendInterrupt(). For async.c, this seems to be a low-level plumbing change that simplifies how a notification wakeup is delivered. My patch optimizes the high-level notification protocol. It introduces a state machine (IDLE, SIGNALLED, PROCESSING) to only signal backends when needed. In their patch, in asyn.c's SignalBackends(), they do SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't seem to check if the backend is already signalled or not, but maybe SendInterrupt() has signal coalescing built-in so it would be a noop with almost no cost? I'm happy to rebase my LISTEN/NOTIFY work on top of [1], but I could also see benefits of doing the opposite. I'm also happy to help with benchmarking of your work in [1]. Note that this patch doesn't contain the hash table to keep track of listeners per backend, as proposed in earlier patches. I will propose such a patch again later, but first we need to figure out if I should rebase onto [1] or master (HEAD). --- PATCH --- Optimize NOTIFY signaling to avoid redundant backend signals Previously, a NOTIFY would send SIGUSR1 to all listening backends, which could lead to a "thundering herd" of redundant signals under high traffic. To address this inefficiency, this patch replaces the simple volatile notifyInterruptPending flag with a per-backend atomic state machine, stored in asyncQueueControl->backend[i].state. This state variable can be in one of three states: IDLE (awaiting signal), SIGNALLED (signal received, work pending), or PROCESSING (actively reading the queue). From the notifier's perspective, SignalBackends now uses an atomic compare-and-swap (CAS) to transition a listener from IDLE to SIGNALLED. Only on a successful transition is a signal sent. If the listener is already SIGNALLED or another notifier wins the race, no redundant signal is sent. If the listener is in the PROCESSING state, the notifier will also transition it to SIGNALLED to ensure the listener re-scans the queue after its current work is done. On the listener side, ProcessIncomingNotify first transitions its state from SIGNALLED to PROCESSING. After reading notifications, it attempts to transition from PROCESSING back to IDLE. If this CAS fails, it means a new notification arrived during processing and a notifier has already set the state back to SIGNALLED. The listener then simply re-latches itself to process the new notifications, avoiding a tight loop. The primary benefit is a significant reduction in syscall overhead and unnecessary kernel wakeups in high-traffic scenarios. This dramatically improves performance for workloads with many concurrent notifiers. Benchmarks show a substantial increase in NOTIFY-only transaction throughput, with gains exceeding 200% at higher concurrency levels. src/backend/commands/async.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------- src/backend/tcop/postgres.c | 4 ++-- src/include/commands/async.h | 4 +++- 3 files changed, 185 insertions(+), 32 deletions(-) --- BENCHMARK --- The attached benchmark script does LISTEN on one connection, and then uses pgbench to send NOTIFY on a varying number of connections and jobs, to cause a high procsignal load. I've run the benchmark on my MacBook Pro M3 Max, 10 seconds per run, 3 runs. (I reused the same benchmark script as in the other thread, "Optimize ProcSignal to avoid redundant SIGUSR1 signals") Connections=Jobs | TPS (master) | TPS (patch) | Relative Diff (%) | StdDev (master) | StdDev (patch) ------------------+--------------+-------------+-------------------+-----------------+---------------- 1 | 118833 | 151510 | 27.50% | 484 | 923 2 | 156005 | 239051 | 53.23% | 3145 | 1596 4 | 177351 | 250910 | 41.48% | 4305 | 4891 8 | 116597 | 171944 | 47.47% | 1549 | 2752 16 | 40835 | 165482 | 305.25% | 2695 | 2825 32 | 37940 | 145150 | 282.58% | 2533 | 1566 64 | 35495 | 131836 | 271.42% | 1837 | 573 128 | 40193 | 121333 | 201.88% | 2254 | 874 (8 rows) /Joel https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ%40mail.gmail.com
Вложения
On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <joel@compiler.org> wrote: > In their patch, in asyn.c's SignalBackends(), they do > SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of > SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't > seem to check if the backend is already signalled or not, but maybe > SendInterrupt() has signal coalescing built-in so it would be a noop > with almost no cost? Yeah: + old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask); + + /* + * If the process is currently blocked waiting for an interrupt to arrive, + * and the interrupt wasn't already pending, wake it up. + */ + if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) == SLEEPING_ON_INTERRUPTS) + WakeupOtherProc(proc);
On Wed, Jul 23, 2025, at 04:44, Thomas Munro wrote: > On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <joel@compiler.org> wrote: >> In their patch, in asyn.c's SignalBackends(), they do >> SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of >> SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't >> seem to check if the backend is already signalled or not, but maybe >> SendInterrupt() has signal coalescing built-in so it would be a noop >> with almost no cost? > > Yeah: > > + old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask); > + > + /* > + * If the process is currently blocked waiting for an interrupt to arrive, > + * and the interrupt wasn't already pending, wake it up. > + */ > + if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) == > SLEEPING_ON_INTERRUPTS) > + WakeupOtherProc(proc); Thanks for confirming the coalescing logic in SendInterrupt. That's a great low-level optimization. It's clear we're both targeting the same problem of redundant wake-ups under contention, but approaching it from different architectural levels. The core difference, as I see it, is *where* the state management resides. The "Interrupts vs signals" patch set creates a unified machinery where the 'pending' state for all subsystems is combined into a single atomic bitmask. This is a valid approach. However, I've been exploring an alternative pattern that decouples the state management from the signaling machinery, allowing each subsystem to manage its own state independently. I believe this leads to a simpler, more modular migration path. I've developed a two-patch series for `async.c` to demonstrate this concept. 1. The first patch introduces a lock-free, atomic finite state machine (FSM) entirely within async.c. By using a subsystem-specific atomic integer and CAS operations, async.c can now robustly manage its own listener states (IDLE, SIGNALLED, PROCESSING). This solves the redundant signal problem at the source, as notifiers can now observe a listener's state and refrain from sending a wakeup if one is already pending. 2. The second patch demonstrates that once state is managed locally, the wakeup mechanism becomes trivial.** The expensive `SendProcSignal` call is replaced with a direct `SetLatch`. This leverages the existing, highly-optimized `WaitEventSet` infrastructure as a simple, efficient "poke." This suggests a powerful, incremental migration pattern: first, fix a subsystem's state management internally; second, replace its wakeup mechanism. This vertical, module-by-module approach seems complementary to the horizontal, layer-by-layer refactoring in the "Interrupts vs signals" thread. I'll post a more detailed follow-up in that thread to discuss the broader architectural implications. Attached are the two patches, reframed to better illustrate this two-step pattern. /Joel
Вложения
On Thu, Jul 24, 2025, at 23:03, Joel Jacobson wrote: > * 0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patch > * 0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patch I'm withdrawing the latest patches, since they won't fix the scalability problems, but only provide some performance improvements by eliminating redundant IPC signalling. This could also be improved outside of async.c, by optimizing ProcSignal [1] or removing ProcSignal as "Interrupts vs Signals" [2] is working on. There seems to be two different scalability problems, that appears to be orthogonal: First, it's the thundering herd problems that I tried to solve initially in this thread, by introducing a hash table in shared memory, to keep track of what backends listen to what channels, to avoid immediate wakeup of all listening backends for every notification. Second, it's the heavyweight lock in PreCommit_Notify(), that prevents parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this. My perf+pgbench experiments indicate that out of these two different scalability problems, if one or the other is the bottleneck depends on the workload. I think the idea of keeping track of channels per backends has merit, but I want to take a step back and see what others think about the idea first. I guess my main question is if we think we should fix one problem first, then the other, both at the same time, or only one or the other? I've attached some benchmarks using pgbench and running postgres under perf, which I hope can provide some insights. /Joel [1] https://www.postgresql.org/message-id/flat/a0b12a70-8200-4bd4-9e24-56796314bdce%40app.fastmail.com [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/1878165.1752858390%40sss.pgh.pa.us
Вложения
[ getting back to this... ] "Joel Jacobson" <joel@compiler.org> writes: > I'm withdrawing the latest patches, since they won't fix the scalability > problems, but only provide some performance improvements by eliminating > redundant IPC signalling. This could also be improved outside of > async.c, by optimizing ProcSignal [1] or removing ProcSignal as > "Interrupts vs Signals" [2] is working on. > There seems to be two different scalability problems, that appears to be > orthogonal: > First, it's the thundering herd problems that I tried to solve initially > in this thread, by introducing a hash table in shared memory, to keep > track of what backends listen to what channels, to avoid immediate > wakeup of all listening backends for every notification. > Second, it's the heavyweight lock in PreCommit_Notify(), that prevents > parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this. I concur that these are orthogonal issues, but I don't understand why you withdrew your patches --- don't they constitute a solution to the first scalability bottleneck? > I guess my main question is if we think we should fix one problem first, > then the other, both at the same time, or only one or the other? I imagine we'd eventually want to fix both, but it doesn't have to be done in the same patch. regards, tom lane
On Tue, Sep 23, 2025, at 18:27, Tom Lane wrote: > I concur that these are orthogonal issues, but I don't understand > why you withdrew your patches --- don't they constitute a solution > to the first scalability bottleneck? Thanks for getting back to this thread. I was unhappy with not finding a solution that would improve all use-cases, I had a feeling it would be possible to find one, and I think I've done so now. >> I guess my main question is if we think we should fix one problem first, >> then the other, both at the same time, or only one or the other? > > I imagine we'd eventually want to fix both, but it doesn't have to > be done in the same patch. I've attached a new patch with a new pragmatic approach, that specifically addresses the context switching cost. The patch is based upon the assumption that some extra LISTEN/NOTIFY latency would be acceptable by most users, as a trade-off, in order to improve throughput. One nice thing with this approach is that it has the potential to improve throughput both for users with just a single listening backend, and also for users with lots of listening backends. More details in the commit message of the patch. Curious to hear thoughts on this approach. /Joel
Вложения
Hi Joel,
Thanks for the patch. After reviewing it, I got a few comments.
2.
On Sep 25, 2025, at 04:34, Joel Jacobson <joel@compiler.org> wrote:
Curious to hear thoughts on this approach.
/Joel
<0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patch>
1.
```
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
IDLE_SESSION_TIMEOUT,
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
+ NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
```
Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to preserve the existing enum value?
```
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -766,6 +766,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#notify_latency_target = 0 # in milliseconds, 0 is disabled
#bytea_output = 'hex' # hex, escape
```
I think we should add one more table to make the comment to align with last line’s comment.
3.
```
/* GUC parameters */
bool Trace_notify = false;
+int notify_latency_target;
```
I know compiler will auto initiate notify_latency_target to 0. But all other global and static variables around are explicitly initiated, so it would look better to assign 0 to it, which just keeps coding style consistent.
4.
```
+ /*
+ * Throttling check: if we were last active too recently, defer. This
+ * check is safe without a lock because it's based on a backend-local
+ * timestamp.
+ */
+ if (notify_latency_target > 0 &&
+ !TimestampDifferenceExceeds(last_wakeup_start_time,
+ GetCurrentTimestamp(),
+ notify_latency_target))
+ {
+ /*
+ * Too soon. We leave wakeup_pending_flag untouched (it must be true,
+ * or we wouldn't have been signaled) to tell senders we are
+ * intentionally delaying. Arm a timer to re-awaken and process the
+ * backlog later.
+ */
+ enable_timeout_after(NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
+ notify_latency_target);
+ return;
+ }
+
```
Should we avid duplicate timeout to be enabled? Now, whenever a duplicate notification is avoid, a new timeout is enabled. I think we can add another variable to remember if a timeout has been enabled.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, Sep 25, 2025, at 10:25, Chao Li wrote: > Hi Joel, > > Thanks for the patch. After reviewing it, I got a few comments. Thanks for reviewing! >> On Sep 25, 2025, at 04:34, Joel Jacobson <joel@compiler.org> wrote: > 1. ... > Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to > preserve the existing enum value? Fixed. > 2. ... > I think we should add one more table to make the comment to align with > last line’s comment. Fixed. > 3. ... > I know compiler will auto initiate notify_latency_target to 0. But all > other global and static variables around are explicitly initiated, so > it would look better to assign 0 to it, which just keeps coding style > consistent. Fixed. > 4. ... > Should we avid duplicate timeout to be enabled? Now, whenever a > duplicate notification is avoid, a new timeout is enabled. I think we > can add another variable to remember if a timeout has been enabled. Hmm, I don't see how duplicate timeout could happen? Once we decide to defer the wakeup, wakeup_pending_flag remains set, which avoids further signals from notifiers, so I don't see how we could re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset when ProcessIncomingNotify() is called, and notifyInterruptPending is only set when a signal is received (or set directly when in same process). New patch attached with 1-3 fixed. /Joel
Вложения
On Sep 26, 2025, at 05:13, Joel Jacobson <joel@compiler.org> wrote:
Hmm, I don't see how duplicate timeout could happen?
Once we decide to defer the wakeup, wakeup_pending_flag remains set,
which avoids further signals from notifiers, so I don't see how we could
re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset
when ProcessIncomingNotify() is called, and notifyInterruptPending is
only set when a signal is received (or set directly when in same
process).
I think what you explained is partially correct.
Based on my understanding, any backend process may call SignalBackends(), which means that it’s possible that multiple backend processes may call SignalBackends() concurrently.
Looking at your code, between checking QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is a block of code (the “if-else”) to run, so that it’s possible that multiple backend processes have passed the QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will be sent to a process, which will lead to duplicate timeout enabled in the receiver process.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Fri, Sep 26, 2025, at 04:26, Chao Li wrote: > I think what you explained is partially correct. > > Based on my understanding, any backend process may call > SignalBackends(), which means that it’s possible that multiple backend > processes may call SignalBackends() concurrently. > > Looking at your code, between checking > QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is > a block of code (the “if-else”) to run, so that it’s possible that > multiple backend processes have passed the > QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will > be sent to a process, which will lead to duplicate timeout enabled in > the receiver process. I don't see how that can happen; we're checking wakeup_pending_flag while holding an exclusive lock, so I don't see how multiple backend processes could be within the region where we check/set wakeup_pending_flag, at the same time? /Joel
On Sep 26, 2025, at 17:32, Joel Jacobson <joel@compiler.org> wrote:On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:I think what you explained is partially correct.
Based on my understanding, any backend process may call
SignalBackends(), which means that it’s possible that multiple backend
processes may call SignalBackends() concurrently.
Looking at your code, between checking
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
a block of code (the “if-else”) to run, so that it’s possible that
multiple backend processes have passed the
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
be sent to a process, which will lead to duplicate timeout enabled in
the receiver process.
I don't see how that can happen; we're checking wakeup_pending_flag
while holding an exclusive lock, so I don't see how multiple backend
processes could be within the region where we check/set
wakeup_pending_flag, at the same time?
/Joel
I might miss the factor of holding an exclusive lock. I will revisit that part again.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Fri, Sep 26, 2025, at 11:44, Chao Li wrote: >> On Sep 26, 2025, at 17:32, Joel Jacobson <joel@compiler.org> wrote: >> >> On Fri, Sep 26, 2025, at 04:26, Chao Li wrote: >> >>> I think what you explained is partially correct. >>> >>> Based on my understanding, any backend process may call >>> SignalBackends(), which means that it’s possible that multiple backend >>> processes may call SignalBackends() concurrently. >>> >>> Looking at your code, between checking >>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is >>> a block of code (the “if-else”) to run, so that it’s possible that >>> multiple backend processes have passed the >>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will >>> be sent to a process, which will lead to duplicate timeout enabled in >>> the receiver process. >> >> I don't see how that can happen; we're checking wakeup_pending_flag >> while holding an exclusive lock, so I don't see how multiple backend >> processes could be within the region where we check/set >> wakeup_pending_flag, at the same time? >> >> /Joel > > I might miss the factor of holding an exclusive lock. I will revisit > that part again. I've re-read this entire thread, and I actually think my original approaches are more promising, that is, the 0001-optimize_listen_notify-v4.patch patch, doing multicast targeted signaling. Therefore, merely consider the latest patch as PoC with some possible interesting ideas. Before this patch, I had never used PostgreSQL's timeout mechanism before, so I didn't consider it when thinking about how to solve the remaining problems with 0001-optimize_listen_notify-v4.patch, which currently can't guarantee that all listening backends will eventually catch up, since it just kicks one of the most lagging ones, for each notification. This could be a problem in practise if there is a long period of time with no notifications coming in. Then some listening backends could end up not being signaled and would stay behind, preventing the queue tail from advancing. I'm thinking maybe somehow we can use the timeout mechanism here, but I'm not sure how yet. Any ideas? /Joel
On Sep 28, 2025, at 18:24, Joel Jacobson <joel@compiler.org> wrote:
I might miss the factor of holding an exclusive lock. I will revisit
that part again.
I've re-read this entire thread, and I actually think my original
approaches are more promising, that is, the
0001-optimize_listen_notify-v4.patch patch, doing multicast targeted
signaling.
Therefore, merely consider the latest patch as PoC with some possible
interesting ideas.
Before this patch, I had never used PostgreSQL's timeout mechanism
before, so I didn't consider it when thinking about how to solve the
remaining problems with 0001-optimize_listen_notify-v4.patch, which
currently can't guarantee that all listening backends will eventually
catch up, since it just kicks one of the most lagging ones, for each
notification. This could be a problem in practise if there is a long
period of time with no notifications coming in. Then some listening
backends could end up not being signaled and would stay behind,
preventing the queue tail from advancing.
I'm thinking maybe somehow we can use the timeout mechanism here, but
I'm not sure how yet. Any ideas?
/Joel
Hi Joel,
I never had a concern about using the timeout mechanism. My comment was about enabling timeout duplicately.
I just revisited the code, now I agree that I was over-worried because I missed considering NotifyQueueLock. With the lock protection, a backend process’ QUEUE_BACKEND_WAKEUP_PENDING_FLAG won’t have race condition, then it should have no duplicate signals sending to the same backend process. Then in the backend process, you have “last_wakeup_start_time” that avoids duplicate timeout within a configured period, and you reset last_wakeup_start_time in asyncQueueReadAllNotifications() together with cleaning the QUEUE_BACKEND_WAKEUP_PENDING_FLAG.
So, overall v2 looks good to me.
One last tiny comment is about naming of last_wakeup_start_time. I think it can be renamed to “last_wakeup_time”. Because the variable just records when asyncQueueReadAllNotifications() last time called, there seems not a meaning of “start” involved.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Mon, Sep 29, 2025, at 04:33, Chao Li wrote: > I never had a concern about using the timeout mechanism. My comment was > about enabling timeout duplicately. Thanks for reviewing. However, like said in my previous email, I'm sorry, but don't believe in my suggested throughput/latency approach. I unfortunately managed to derail from the IMO more promising approaches I worked on initially. What I couldn't find a solution to then, was the problem of possibly ending up in a situation where some lagging backends would never catch up. In this new patch, I've simply introduced a new bgworker, given the specific task of kicking lagging backends. I wish of course we could do without the bgworker, but I don't see how that would be possible. --- optimize_listen_notify-v5.patch: Fix LISTEN/NOTIFY so it scales with idle listening backends Currently, idle listening backends cause a dramatic slowdown due to context switching when they are signaled and wake up. This is wasteful when they are not listening to the channel being notified. Just 10 extra idle listening connections cause a slowdown from 8700 TPS to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra it falls to 250 TPS. To improve scalability with the number of idle listening backends, this patch introduces a shared hash table to keep track of channels per listening backend. This hash table is partitioned to reduce contention on concurrent LISTEN/UNLISTEN operations. We keep track of up to NOTIFY_MULTICAST_THRESHOLD (16) listeners per channel. Benchmarks indicated diminishing gains above this level. Setting it lower seems unnecessary, so a constant seemed fine; a GUC did not seem motivated. This patch also adds a wakeup_pending flag to each backend's queue status to avoid redundant signaling when a wakeup is already pending as the backend is signaled again. The flag is set when a backend is signaled and cleared before processing the queue. This order is important to ensure correctness. It was also necessary to add a new bgworker, notify_bgworker, whose sole responsibility is to wake up lagging listening backends, ensuring they are kicked when they are about to fall too far behind. This bgworker is always started at postmaster startup, but is only activated upon NOTIFY by signaling it, unless it is already active. The notify_bgworker staggers the signaling of lagging listening backends by sleeping 100 ms between each signal, to prevent the thundering herd problem we would otherwise get if all listening backends woke up at the same time. It loops until there are no more lagging listening backends, and then becomes inactive. /Joel
Вложения
On Tue, Sep 30, 2025, at 20:56, Joel Jacobson wrote: > Attachments: > * optimize_listen_notify-v5.patch Changes since v5: *) Added missing #include "nodes/pg_list.h" to fix List type error in headerscheck *) Add NOTIFY_DEFERRED_WAKEUP_MAIN to wait_event_names.txt and rename WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP to WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP_MAIN /Joel
Вложения
"Joel Jacobson" <joel@compiler.org> writes: > Thanks for reviewing. However, like said in my previous email, I'm > sorry, but don't believe in my suggested throughput/latency approach. I > unfortunately managed to derail from the IMO more promising approaches I > worked on initially. > What I couldn't find a solution to then, was the problem of possibly > ending up in a situation where some lagging backends would never catch > up. > In this new patch, I've simply introduced a new bgworker, given the > specific task of kicking lagging backends. I wish of course we could do > without the bgworker, but I don't see how that would be possible. I don't understand why you feel you need a bgworker. The existing code does not have any provision that guarantees a lost signal will eventually be re-sent --- it will be if there is continuing NOTIFY traffic, but not if all the senders suddenly go quiet. AFAIR we've had zero complaints about that in 25+ years. So I'm perfectly content to continue the approach of "check for laggards during NOTIFY". (This could be gated behind an overall check on how long the notify queue is, so that we don't expend the cycles when things are performing as-expected.) If you feel that that's not robust enough, you should split it out as a separate patch that's advertised as a robustness improvement not a performance improvement, and see if you can get anyone to bite. The other thing I'm concerned about with this patch is the new shared hash table. I don't think we have anywhere near a good enough fix on how big it needs to be, and that is problematic because of the frozen-at-startup size of main shared memory. We could imagine inventing YA GUC to let the user tell us how big to make it, but I think there is now a better way: use a dshash table (src/backend/lib/dshash.c). That offers the additional win that we don't have to create it at all in an installation that never uses LISTEN/NOTIFY. We could also rethink whether we really need the NOTIFY_MULTICAST_THRESHOLD limit: rather than having two code paths, we could just say that all listeners are registered for every channel. regards, tom lane