Обсуждение: Optimize LISTEN/NOTIFY

Поиск

Список

Период

Сортировка

Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

13 июля, 01:35:21

Hi hackers,

The current LISTEN/NOTIFY implementation is well-suited for use-cases like
cache invalidation where many backends listen on the same channel. However,
its scalability is limited when many backends listen on distinct
channels. The root of the problem is that Async_Notify must signal every
listening backend in the database, as it lacks central knowledge of which
backend is interested in which channel. This results in an O(N) number of
kill(pid, SIGUSR1) syscalls as the listener count grows.

The attached proof-of-concept patch proposes a straightforward
optimization for the single-listener case. It introduces a shared-memory
hash table mapping (dboid, channelname) to the ProcNumber of a single
listener. When NOTIFY is issued, we first check this table. If a single
listener is found, we signal only that backend. Otherwise, we fall back to
the existing broadcast behavior.

The performance impact for this pattern is significant. A benchmark [1]
measuring a NOTIFY "ping-pong" between two connections, while adding a
variable number of idle listeners, shows the following:

master (8893c3a):
0 extra listeners: 9126 TPS
10 extra listeners: 6233 TPS
100 extra listeners: 2020 TPS
1000 extra listeners: 238 TPS

0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patch:
0 extra listeners: 9152 TPS
10 extra listeners: 9352 TPS
100 extra listeners: 9320 TPS
1000 extra listeners: 8937 TPS

As you can see, the patched version's performance is near O(1) with respect
to the number of idle listeners, while the current implementation shows the
expected O(N) degradation.

This patch is a first-step. It uses a simple boolean has_multiple_listeners
flag in the hash entry. Once a channel gets a second listener, this flag is
set and, crucially, never cleared. The entry will then permanently indicate
"multiple listeners", even after all backends on that channel disconnect.

A more complete solution would likely use reference counting for each
channel's listeners. This would solve the "stuck entry" problem and could
also enable a further optimization: targeted signaling to all listeners of a
multi-user channel, avoiding the database-wide broadcast entirely.

The patch also includes a "wake only tail" optimization (contributed by
Marko Tikkaja) to help prevent backends from falling too far behind.
Instead of waking all lagging backends at once and creating a "thundering
herd", this logic signals only the single backend that is currently at the
queue tail. This ensures the global queue tail can always advance, relying
on a chain reaction to get backends caught up efficiently. This seems like
a sensible improvement in its own right.

Thoughts?

/Joel

[1] Benchmark tool and full results: https://github.com/joelonsql/pg-bench-listen-notify

Вложения

0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patch

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

13 июля, 02:18:40

"Joel Jacobson" <joel@compiler.org> writes:
> The attached proof-of-concept patch proposes a straightforward
> optimization for the single-listener case. It introduces a shared-memory
> hash table mapping (dboid, channelname) to the ProcNumber of a single
> listener.

What does that do to the cost and parallelizability of LISTEN/UNLISTEN?

> The patch also includes a "wake only tail" optimization (contributed by
> Marko Tikkaja) to help prevent backends from falling too far behind.

Coulda sworn we dealt with that case some years ago.  In any case,
if it's independent of the other idea it should probably get its
own thread.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

15 июля, 10:20:58

On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> The attached proof-of-concept patch proposes a straightforward
>> optimization for the single-listener case. It introduces a shared-memory
>> hash table mapping (dboid, channelname) to the ProcNumber of a single
>> listener.
>
> What does that do to the cost and parallelizability of LISTEN/UNLISTEN?

Good point. The previous patch would effectively force all LISTEN/UNLISTEN
to be serialized, which would at least hurt parallelizability.

New benchmark confirm this hypothesis.

New patch attached that combines two complementary approaches, that together
seems to scale well for both common-channel and unique-channel scenarios:

1. Partitioned Hash Locking

The Channel Hash now uses HASH_PARTITION, with an array of NUM_NOTIFY_PARTITIONS
lightweight locks. A given channel is mapped to a partition lock using
a custom hash function on (dboid, channelname).

This allows LISTEN/UNLISTEN operations on different channels to proceed
concurrently without fighting over a single global lock, addressing the
"many distinct channels" use-case.

2. Optimistic Read-Locking

For the "many backends on one channel" use-case, lock acquisition now follows
a read-then-upgrade pattern. We first acquire a LW_SHARED lock, to check the
channel's state. If the channel is already marked as has_multiple_listeners,
we can return immediately without any need for a write.

Only if we are the first or second listener on a channel do we release
the shared lock and acquire an LW_EXCLUSIVE lock to modify the hash entry.
After getting the exclusive lock, we re-verify the state to guard against
race conditions. This avoids serializing the third and all subsequent
listeners for a popular channel.

BENCHMARK

https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_connections_equal_jobs.png

https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_fixed_connections.png

I didn't want to attached the images to this email because they are quite large,
due to all the details in the images.

However, since it's important this mailing list contains all relevant data discussed,
I've also included all data in the graphs formatted in ASCII/Markdown:

performance_overview.md

I've also included the raw parsed data from the pgbench output,
which has been used as input to create performance_overview.md
as well as the images:

pgbench_results_combined.csv

I've benchmarked five times per measurement, in random order.
All raw measurements have been included in the Markdown document
within { curly braces } sorted, next to the average values, to get an idea
of the variance. Stddev felt possibly misleading since I'm not sure the
data points are normally distributed, since it's benchmarking data.

I've run the benchmarks on my MacBook Pro Apple M3 Max,
using `caffeinate -dims pgbench ...`.

>> The patch also includes a "wake only tail" optimization (contributed by
>> Marko Tikkaja) to help prevent backends from falling too far behind.
>
> Coulda sworn we dealt with that case some years ago.  In any case,
> if it's independent of the other idea it should probably get its
> own thread.

Maybe it's been dealt with by some other part of the system, but I can't
find any such code anywhere, it's only async.c that currently sends
PROCSIG_NOTIFY_INTERRUPT.

The wake only tail mechanism seems almost perfect, but I can think of at least
one edge-case where we could still get a problem situation:

With lots of idle backends, the rate of this one-by-one catch-up may not be fast
enough to outpace the queue's advancement, causing other idle backends
to eventually lag by more than the QUEUE_CLEANUP_DELAY threshold.

To ensure all backends are eventually processed without re-introducing
the thundering herd problem, an additional mechanism seems neessary:

I see two main options:

1. Extend the chain reaction
Once woken, a backend could signal the next backend at the queue tail,
propagating the catch-up process. This would need to be managed carefully,
perhaps with some kind of global advisory lock, to prevent multiple
cascades from running at once.

2. Centralize the work
We already have the autovacuum daemon, maybe it could also be made responsible
for kicking lagging backends?

Other ideas?

/Joel

Attached:

* pgbench-scripts.tar.gz
pgbench scripts to reproduce the results, report and images.

*  performance_overview.md
Same results as in the images, but in ASCII/Markdown format.

* pgbench_results_combined.csv
Parsed output from pgbench runs, used to create performance_overview.md as well as the linked images.

* 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch
Old patch just renamed to -v2

* 0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch
New patch with the approach explained above.

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

15 июля, 23:56:57

On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote:
> On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote:
>> "Joel Jacobson" <joel@compiler.org> writes:
>>> The attached proof-of-concept patch proposes a straightforward
>>> optimization for the single-listener case. It introduces a shared-memory
>>> hash table mapping (dboid, channelname) to the ProcNumber of a single
>>> listener.
>>
>> What does that do to the cost and parallelizability of LISTEN/UNLISTEN?
>
> Good point. The previous patch would effectively force all LISTEN/UNLISTEN
> to be serialized, which would at least hurt parallelizability.
>
> New benchmark confirm this hypothesis.
>
> New patch attached that combines two complementary approaches, that together
> seems to scale well for both common-channel and unique-channel scenarios:

Thanks to the FreeBSD animal failing, I see I made a shared memory blunder.
New squashed patch attached.

/Joel

Вложения

0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.patch

Re: Optimize LISTEN/NOTIFY

От

Rishu Bagga

Дата:

16 июля, 03:20:42

Hi Joel,

Thanks for sharing the patch.
I have a few questions based on a cursory first look.

> If a single listener is found, we signal only that backend.
> Otherwise, we fall back to the existing broadcast behavior.

The idea of not wanting to wake up all backends makes sense to me,
but I don’t understand why we want this optimization only for the case
where there is a single backend listening on a channel.

Is there a pattern of usage in LISTEN/NOTIFY where users typically
have either just one or several backends listening on a channel?

If we are doing this optimization, why not maintain a list of backends
for each channel, and only wake up those channels?

Thanks,
Rishu

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

16 июля, 10:00:05

On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
> Hi Joel,
>
> Thanks for sharing the patch.
> I have a few questions based on a cursory first look.
>
>> If a single listener is found, we signal only that backend.
>> Otherwise, we fall back to the existing broadcast behavior.
>
> The idea of not wanting to wake up all backends makes sense to me,
> but I don’t understand why we want this optimization only for the case
> where there is a single backend listening on a channel.
>
> Is there a pattern of usage in LISTEN/NOTIFY where users typically
> have either just one or several backends listening on a channel?
>
> If we are doing this optimization, why not maintain a list of backends
> for each channel, and only wake up those channels?

Thanks for the thoughtful question. You've hit on the central design trade-off
in this optimization: how to provide targeted signaling for some workloads
without degrading performance for others.

While we don't have telemetry on real-world usage patterns of LISTEN/NOTIFY,
it seems likely that most applications fall into one of three categories,
which I've been thinking of in networking terms:

1. Broadcast-style ("hub mode")

Many backends listening on the *same* channel (e.g., for cache invalidation).
The current implementation is already well-optimized for this, behaving like
an Ethernet hub that broadcasts to all ports. Waking all listeners is efficient
because they all need the message.

2. Targeted notifications ("switch mode")

Each backend listens on its own private channel (e.g., for session events or
worker queues). This is where the current implementation scales poorly, as every
NOTIFY wakes up all listeners regardless of relevance. My patch is designed
to make this behave like an efficient Ethernet switch.

3. Selective multicast-style ("group mode")

A subset of backends shares a channel, but not all. This is the tricky middle
ground. Your question, "why not maintain a list of backends for each channel,
and only wake up those channels?" is exactly the right one to ask.
A full listener list seems like the obvious path to optimizing for *all* cases.
However, the devil is in the details of concurrency and performance. Managing
such a list would require heavier locking, which would create a new bottleneck
and degrade the scalability of LISTEN/UNLISTEN operations—especially for
the "hub mode" case where many backends rapidly subscribe to the same popular
channel.

This patch makes a deliberate architectural choice:
Prioritize a massive, low-risk win for "switch mode" while rigorously protecting
the performance of "hub mode".

It introduces a targeted fast path for single-listener channels and cleanly
falls back to the existing, well-performing broadcast model for everything else.

This brings us back to "group mode", which remains an open optimization problem.
A possible approach could be to track listeners up to a small threshold *K*
(e.g., store up to 4 ProcNumber's in the hash entry). If the count exceeds *K*,
we would flip a "broadcast" flag and revert to hub-mode behavior.

However, this path has a critical drawback:

1. Performance Penalty for Hub Mode

With the current patch, after the second listener joins a channel,
the has_multiple_listeners flag is set. Every subsequent listener can acquire
a shared lock, see the flag is true, and immediately continue. This is
a highly concurrent, read-only operation that does not require mutating shared
state.

In contrast, the K-listener approach would force every new listener (from the
third up to the K-th) to acquire an exclusive lock to mutate the shared
listener array**. This would serialize LISTEN operations on popular channels,
creating the very contention point this patch successfully avoids and directly
harming the hub-mode use case that currently works well.

2. Uncertainty

Compounding this, without clear data on typical "group" sizes, choosing a value
for *K* is a shot in the dark. A small *K* might not help much, while
a large *K* would increase the shared memory footprint and worsen the
serialization penalty.

For these reasons, attempting to build a switch that also optimizes for
multicast risks undermining the architectural clarity and performance of
both the switch and hub models.

This patch, therefore, draws a clean line. It provides a precise,
low-cost path for switch-mode workloads and preserves the existing,
well-performing path for hub-mode workloads. While this leaves "group mode"
unoptimized for now, it ensures we make two common use cases better without
making any use case worse. The new infrastructure is flexible, leaving
the door open should a better approach for "group mode" emerge in
the future—one that doesn't compromise the other two.

Benchmarks updated showing master vs 0001-optimize_listen_notify-v3.patch:
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equal_jobs.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connections.png

I've not included the benchmark CSV data in this mail, since it's quite heavy,
160kB, and I couldn't see any significant performance changes since v2.

/Joel

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

23 июля, 04:39:30

On Thu, Jul 17, 2025, at 09:43, Joel Jacobson wrote:
> On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
>> If we are doing this optimization, why not maintain a list of backends
>> for each channel, and only wake up those channels?
>
> Thanks for a contributing a great idea, it actually turned out to work
> really well in practice!
>
> The attached new v4 of the patch implements your multicast idea:

Hi hackers,

While my previous attempts of $subject has only focused on optimizing
the multi-channel scenario, I thought it would be really nice if
LISTEN/NOTIFY could be optimize in the general case, benefiting all
users, including those who just listen on a single channel.

To my surprise, this was not only possible, but actually quite simple.

The main idea in this patch, is to introduce an atomic state machine,
with three states, IDLE, SIGNALLED, and PROCESSED, so that we don't
interrupt backends that are already in the process of catching up.

Thanks to Thomas Munro for making me aware of his, Heikki Linnakanga's
and others work in the  "Interrupts vs signals" [1] thread.

Maybe my patch is redundant due to their patch set, I'm not really sure?

Their patch seems to refactors the underlying wakeup mechanism. It
replaces the old, complex chain of events (SIGUSR1 signal -> handler ->
flag -> latch) with a single, direct function call: SendInterrupt(). For
async.c, this seems to be a low-level plumbing change that simplifies
how a notification wakeup is delivered.

My patch optimizes the high-level notification protocol. It introduces a
state machine (IDLE, SIGNALLED, PROCESSING) to only signal backends when
needed.

In their patch, in asyn.c's SignalBackends(), they do
SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
seem to check if the backend is already signalled or not, but maybe
SendInterrupt() has signal coalescing built-in so it would be a noop
with almost no cost?

I'm happy to rebase my LISTEN/NOTIFY work on top of [1], but I could
also see benefits of doing the opposite.

I'm also happy to help with benchmarking of your work in [1].

Note that this patch doesn't contain the hash table to keep track of
listeners per backend, as proposed in earlier patches. I will propose
such a patch again later, but first we need to figure out if I should
rebase onto [1] or master (HEAD).

--- PATCH ---

    Optimize NOTIFY signaling to avoid redundant backend signals

    Previously, a NOTIFY would send SIGUSR1 to all listening backends, which
    could lead to a "thundering herd" of redundant signals under high
    traffic. To address this inefficiency, this patch replaces the simple
    volatile notifyInterruptPending flag with a per-backend atomic state
    machine, stored in asyncQueueControl->backend[i].state. This state
    variable can be in one of three states: IDLE (awaiting signal),
    SIGNALLED (signal received, work pending), or PROCESSING (actively
    reading the queue).

    From the notifier's perspective, SignalBackends now uses an atomic
    compare-and-swap (CAS) to transition a listener from IDLE to SIGNALLED.
    Only on a successful transition is a signal sent. If the listener is
    already SIGNALLED or another notifier wins the race, no redundant signal
    is sent. If the listener is in the PROCESSING state, the notifier will
    also transition it to SIGNALLED to ensure the listener re-scans the
    queue after its current work is done.

    On the listener side, ProcessIncomingNotify first transitions its state
    from SIGNALLED to PROCESSING. After reading notifications, it attempts
    to transition from PROCESSING back to IDLE. If this CAS fails, it means
    a new notification arrived during processing and a notifier has already
    set the state back to SIGNALLED. The listener then simply re-latches
    itself to process the new notifications, avoiding a tight loop.

    The primary benefit is a significant reduction in syscall overhead and
    unnecessary kernel wakeups in high-traffic scenarios. This dramatically
    improves performance for workloads with many concurrent notifiers.
    Benchmarks show a substantial increase in NOTIFY-only transaction
    throughput, with gains exceeding 200% at higher
    concurrency levels.

 src/backend/commands/async.c | 209
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
 src/backend/tcop/postgres.c  |   4 ++--
 src/include/commands/async.h |   4 +++-
 3 files changed, 185 insertions(+), 32 deletions(-)

--- BENCHMARK ---

The attached benchmark script does LISTEN on one connection,
and then uses pgbench to send NOTIFY on a varying number of
connections and jobs, to cause a high procsignal load.

I've run the benchmark on my MacBook Pro M3 Max,
10 seconds per run, 3 runs.

(I reused the same benchmark script as in the other thread, "Optimize ProcSignal to avoid redundant SIGUSR1 signals")

 Connections=Jobs | TPS (master) | TPS (patch) | Relative Diff (%) | StdDev (master) | StdDev (patch)
------------------+--------------+-------------+-------------------+-----------------+----------------
                1 |       118833 |      151510 | 27.50%            |             484 |            923
                2 |       156005 |      239051 | 53.23%            |            3145 |           1596
                4 |       177351 |      250910 | 41.48%            |            4305 |           4891
                8 |       116597 |      171944 | 47.47%            |            1549 |           2752
               16 |        40835 |      165482 | 305.25%           |            2695 |           2825
               32 |        37940 |      145150 | 282.58%           |            2533 |           1566
               64 |        35495 |      131836 | 271.42%           |            1837 |            573
              128 |        40193 |      121333 | 201.88%           |            2254 |            874
(8 rows)

/Joel

https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ%40mail.gmail.com

Вложения

0001-Optimize-NOTIFY-signaling-to-avoid-redundant-backend.patch

Re: Optimize LISTEN/NOTIFY

От

Thomas Munro

Дата:

23 июля, 05:44:29

On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <joel@compiler.org> wrote:
> In their patch, in asyn.c's SignalBackends(), they do
> SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
> SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
> seem to check if the backend is already signalled or not, but maybe
> SendInterrupt() has signal coalescing built-in so it would be a noop
> with almost no cost?

Yeah:

+ old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask);
+
+ /*
+ * If the process is currently blocked waiting for an interrupt to arrive,
+ * and the interrupt wasn't already pending, wake it up.
+ */
+ if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) ==
SLEEPING_ON_INTERRUPTS)
+     WakeupOtherProc(proc);

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

25 июля, 00:03:27

On Wed, Jul 23, 2025, at 04:44, Thomas Munro wrote:
> On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <joel@compiler.org> wrote:
>> In their patch, in asyn.c's SignalBackends(), they do
>> SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
>> SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
>> seem to check if the backend is already signalled or not, but maybe
>> SendInterrupt() has signal coalescing built-in so it would be a noop
>> with almost no cost?
>
> Yeah:
>
> + old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask);
> +
> + /*
> + * If the process is currently blocked waiting for an interrupt to arrive,
> + * and the interrupt wasn't already pending, wake it up.
> + */
> + if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) ==
> SLEEPING_ON_INTERRUPTS)
> +     WakeupOtherProc(proc);

Thanks for confirming the coalescing logic in SendInterrupt. That's a
great low-level optimization. It's clear we're both targeting the same
problem of redundant wake-ups under contention, but approaching it from
different architectural levels.

The core difference, as I see it, is *where* the state management
resides. The "Interrupts vs signals" patch set creates a unified
machinery where the 'pending' state for all subsystems is combined into
a single atomic bitmask. This is a valid approach.

However, I've been exploring an alternative pattern that decouples the
state management from the signaling machinery, allowing each subsystem
to manage its own state independently. I believe this leads to a
simpler, more modular migration path. I've developed a two-patch series
for `async.c` to demonstrate this concept.

1. The first patch introduces a lock-free, atomic finite state machine
   (FSM) entirely within async.c. By using a subsystem-specific atomic
   integer and CAS operations, async.c can now robustly manage its own
   listener states (IDLE, SIGNALLED, PROCESSING). This solves the
   redundant signal problem at the source, as notifiers can now observe
   a listener's state and refrain from sending a wakeup if one is
   already pending.

2. The second patch demonstrates that once state is managed locally, the
   wakeup mechanism becomes trivial.** The expensive `SendProcSignal`
   call is replaced with a direct `SetLatch`. This leverages the
   existing, highly-optimized `WaitEventSet` infrastructure as a simple,
   efficient "poke."

This suggests a powerful, incremental migration pattern: first, fix a
subsystem's state management internally; second, replace its wakeup
mechanism. This vertical, module-by-module approach seems complementary
to the horizontal, layer-by-layer refactoring in the "Interrupts vs
signals" thread.

I'll post a more detailed follow-up in that thread to discuss the
broader architectural implications. Attached are the two patches,
reframed to better illustrate this two-step pattern.

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

07 августа, 03:16:05

On Thu, Jul 24, 2025, at 23:03, Joel Jacobson wrote:
> * 0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patch
> * 0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patch

I'm withdrawing the latest patches, since they won't fix the scalability
problems, but only provide some performance improvements by eliminating
redundant IPC signalling. This could also be improved outside of
async.c, by optimizing ProcSignal [1] or removing ProcSignal as
"Interrupts vs Signals" [2] is working on.

There seems to be two different scalability problems, that appears to be
orthogonal:

First, it's the thundering herd problems that I tried to solve initially
in this thread, by introducing a hash table in shared memory, to keep
track of what backends listen to what channels, to avoid immediate
wakeup of all listening backends for every notification.

Second, it's the heavyweight lock in PreCommit_Notify(), that prevents
parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this.

My perf+pgbench experiments indicate that out of these two different
scalability problems, if one or the other is the bottleneck depends on
the workload.

I think the idea of keeping track of channels per backends has merit,
but I want to take a step back and see what others think about the idea first.

I guess my main question is if we think we should fix one problem first,
then the other, both at the same time, or only one or the other?

I've attached some benchmarks using pgbench and running postgres under
perf, which I hope can provide some insights.

/Joel

[1] https://www.postgresql.org/message-id/flat/a0b12a70-8200-4bd4-9e24-56796314bdce%40app.fastmail.com
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ%40mail.gmail.com
[3] https://www.postgresql.org/message-id/1878165.1752858390%40sss.pgh.pa.us

Вложения

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

23 сентября, 19:27:59

[ getting back to this... ]

"Joel Jacobson" <joel@compiler.org> writes:
> I'm withdrawing the latest patches, since they won't fix the scalability
> problems, but only provide some performance improvements by eliminating
> redundant IPC signalling. This could also be improved outside of
> async.c, by optimizing ProcSignal [1] or removing ProcSignal as
> "Interrupts vs Signals" [2] is working on.

> There seems to be two different scalability problems, that appears to be
> orthogonal:

> First, it's the thundering herd problems that I tried to solve initially
> in this thread, by introducing a hash table in shared memory, to keep
> track of what backends listen to what channels, to avoid immediate
> wakeup of all listening backends for every notification.

> Second, it's the heavyweight lock in PreCommit_Notify(), that prevents
> parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this.

I concur that these are orthogonal issues, but I don't understand
why you withdrew your patches --- don't they constitute a solution
to the first scalability bottleneck?

> I guess my main question is if we think we should fix one problem first,
> then the other, both at the same time, or only one or the other?

I imagine we'd eventually want to fix both, but it doesn't have to
be done in the same patch.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

24 сентября, 23:34:49

On Tue, Sep 23, 2025, at 18:27, Tom Lane wrote:
> I concur that these are orthogonal issues, but I don't understand
> why you withdrew your patches --- don't they constitute a solution
> to the first scalability bottleneck?

Thanks for getting back to this thread. I was unhappy with not finding a
solution that would improve all use-cases, I had a feeling it would be
possible to find one, and I think I've done so now.

>> I guess my main question is if we think we should fix one problem first,
>> then the other, both at the same time, or only one or the other?
>
> I imagine we'd eventually want to fix both, but it doesn't have to
> be done in the same patch.

I've attached a new patch with a new pragmatic approach, that
specifically addresses the context switching cost.

The patch is based upon the assumption that some extra LISTEN/NOTIFY
latency would be acceptable by most users, as a trade-off, in order to
improve throughput.

One nice thing with this approach is that it has the potential to
improve throughput both for users with just a single listening backend,
and also for users with lots of listening backends.

More details in the commit message of the patch.

Curious to hear thoughts on this approach.

/Joel

Вложения

0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patch

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

25 сентября, 11:25:23

Hi Joel,

Thanks for the patch. After reviewing it, I got a few comments.

On Sep 25, 2025, at 04:34, Joel Jacobson <joel@compiler.org> wrote:

Curious to hear thoughts on this approach.

/Joel
<0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patch>

```

--- a/src/include/utils/timeout.h

+++ b/src/include/utils/timeout.h

@@ -35,6 +35,7 @@ typedef enum TimeoutId

IDLE_SESSION_TIMEOUT,

IDLE_STATS_UPDATE_TIMEOUT,

CLIENT_CONNECTION_CHECK_TIMEOUT,

+ NOTIFY_DEFERRED_WAKEUP_TIMEOUT,

STARTUP_PROGRESS_TIMEOUT,

```

Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to preserve the existing enum value?

```

--- a/src/backend/utils/misc/postgresql.conf.sample

+++ b/src/backend/utils/misc/postgresql.conf.sample

@@ -766,6 +766,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate

#lock_timeout = 0 # in milliseconds, 0 is disabled

#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled

#idle_session_timeout = 0 # in milliseconds, 0 is disabled

+#notify_latency_target = 0 # in milliseconds, 0 is disabled

#bytea_output = 'hex' # hex, escape

```

I think we should add one more table to make the comment to align with last line’s comment.

```

/* GUC parameters */

bool Trace_notify = false;

+int notify_latency_target;

```

I know compiler will auto initiate notify_latency_target to 0. But all other global and static variables around are explicitly initiated, so it would look better to assign 0 to it, which just keeps coding style consistent.

```

+ /*

+ * Throttling check: if we were last active too recently, defer. This

+ * check is safe without a lock because it's based on a backend-local

+ * timestamp.

+ */

+ if (notify_latency_target > 0 &&

+ !TimestampDifferenceExceeds(last_wakeup_start_time,

+ GetCurrentTimestamp(),

+ notify_latency_target))

+ {

+ /*

+ * Too soon. We leave wakeup_pending_flag untouched (it must be true,

+ * or we wouldn't have been signaled) to tell senders we are

+ * intentionally delaying. Arm a timer to re-awaken and process the

+ * backlog later.

+ */

+ enable_timeout_after(NOTIFY_DEFERRED_WAKEUP_TIMEOUT,

+ notify_latency_target);

+ return;

+ }

```

Should we avid duplicate timeout to be enabled? Now, whenever a duplicate notification is avoid, a new timeout is enabled. I think we can add another variable to remember if a timeout has been enabled.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

26 сентября, 00:13:31

On Thu, Sep 25, 2025, at 10:25, Chao Li wrote:
> Hi Joel,
>
> Thanks for the patch. After reviewing it, I got a few comments.

Thanks for reviewing!

>> On Sep 25, 2025, at 04:34, Joel Jacobson <joel@compiler.org> wrote:
> 1.
...
> Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to
> preserve the existing enum value?

Fixed.

> 2.
...
> I think we should add one more table to make the comment to align with
> last line’s comment.

Fixed.

> 3.
...
> I know compiler will auto initiate notify_latency_target to 0. But all
> other global and static variables around are explicitly initiated, so
> it would look better to assign 0 to it, which just keeps coding style
> consistent.

Fixed.

> 4.
...
> Should we avid duplicate timeout to be enabled? Now, whenever a
> duplicate notification is avoid, a new timeout is enabled. I think we
> can add another variable to remember if a timeout has been enabled.

Hmm, I don't see how duplicate timeout could happen?

Once we decide to defer the wakeup, wakeup_pending_flag remains set,
which avoids further signals from notifiers, so I don't see how we could
re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset
when ProcessIncomingNotify() is called, and notifyInterruptPending is
only set when a signal is received (or set directly when in same
process).

New patch attached with 1-3 fixed.

/Joel

Вложения

0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-v2.patch

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

26 сентября, 05:26:37

On Sep 26, 2025, at 05:13, Joel Jacobson <joel@compiler.org> wrote:

Hmm, I don't see how duplicate timeout could happen?

Once we decide to defer the wakeup, wakeup_pending_flag remains set,
which avoids further signals from notifiers, so I don't see how we could
re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset
when ProcessIncomingNotify() is called, and notifyInterruptPending is
only set when a signal is received (or set directly when in same
process).

I think what you explained is partially correct.

Based on my understanding, any backend process may call SignalBackends(), which means that it’s possible that multiple backend processes may call SignalBackends() concurrently.

Looking at your code, between checking QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is a block of code (the “if-else”) to run, so that it’s possible that multiple backend processes have passed the QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will be sent to a process, which will lead to duplicate timeout enabled in the receiver process.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

26 сентября, 12:32:29

On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:

> I think what you explained is partially correct.
>
> Based on my understanding, any backend process may call
> SignalBackends(), which means that it’s possible that multiple backend
> processes may call SignalBackends() concurrently.
>
> Looking at your code, between checking
> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
> a block of code (the “if-else”) to run, so that it’s possible that
> multiple backend processes have passed the
> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
> be sent to a process, which will lead to duplicate timeout enabled in
> the receiver process.

I don't see how that can happen; we're checking wakeup_pending_flag
while holding an exclusive lock, so I don't see how multiple backend
processes could be within the region where we check/set
wakeup_pending_flag, at the same time?

/Joel

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

26 сентября, 12:44:44

On Sep 26, 2025, at 17:32, Joel Jacobson <joel@compiler.org> wrote:

On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:

I think what you explained is partially correct.

Based on my understanding, any backend process may call
SignalBackends(), which means that it’s possible that multiple backend
processes may call SignalBackends() concurrently.

Looking at your code, between checking
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
a block of code (the “if-else”) to run, so that it’s possible that
multiple backend processes have passed the
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
be sent to a process, which will lead to duplicate timeout enabled in
the receiver process.

I don't see how that can happen; we're checking wakeup_pending_flag
while holding an exclusive lock, so I don't see how multiple backend
processes could be within the region where we check/set
wakeup_pending_flag, at the same time?

/Joel

I might miss the factor of holding an exclusive lock. I will revisit that part again.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

28 сентября, 13:24:06

On Fri, Sep 26, 2025, at 11:44, Chao Li wrote:
>> On Sep 26, 2025, at 17:32, Joel Jacobson <joel@compiler.org> wrote:
>>
>> On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:
>>
>>> I think what you explained is partially correct.
>>>
>>> Based on my understanding, any backend process may call
>>> SignalBackends(), which means that it’s possible that multiple backend
>>> processes may call SignalBackends() concurrently.
>>>
>>> Looking at your code, between checking
>>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
>>> a block of code (the “if-else”) to run, so that it’s possible that
>>> multiple backend processes have passed the
>>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
>>> be sent to a process, which will lead to duplicate timeout enabled in
>>> the receiver process.
>>
>> I don't see how that can happen; we're checking wakeup_pending_flag
>> while holding an exclusive lock, so I don't see how multiple backend
>> processes could be within the region where we check/set
>> wakeup_pending_flag, at the same time?
>>
>> /Joel
>
> I might miss the factor of holding an exclusive lock. I will revisit
> that part again.

I've re-read this entire thread, and I actually think my original
approaches are more promising, that is, the
0001-optimize_listen_notify-v4.patch patch, doing multicast targeted
signaling.

Therefore, merely consider the latest patch as PoC with some possible
interesting ideas.

Before this patch, I had never used PostgreSQL's timeout mechanism
before, so I didn't consider it when thinking about how to solve the
remaining problems with 0001-optimize_listen_notify-v4.patch, which
currently can't guarantee that all listening backends will eventually
catch up, since it just kicks one of the most lagging ones, for each
notification. This could be a problem in practise if there is a long
period of time with no notifications coming in. Then some listening
backends could end up not being signaled and would stay behind,
preventing the queue tail from advancing.

I'm thinking maybe somehow we can use the timeout mechanism here, but
I'm not sure how yet. Any ideas?

/Joel

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

29 сентября, 05:33:54

On Sep 28, 2025, at 18:24, Joel Jacobson <joel@compiler.org> wrote:

I might miss the factor of holding an exclusive lock. I will revisit
that part again.

I've re-read this entire thread, and I actually think my original
approaches are more promising, that is, the
0001-optimize_listen_notify-v4.patch patch, doing multicast targeted
signaling.

Therefore, merely consider the latest patch as PoC with some possible
interesting ideas.

Before this patch, I had never used PostgreSQL's timeout mechanism
before, so I didn't consider it when thinking about how to solve the
remaining problems with 0001-optimize_listen_notify-v4.patch, which
currently can't guarantee that all listening backends will eventually
catch up, since it just kicks one of the most lagging ones, for each
notification. This could be a problem in practise if there is a long
period of time with no notifications coming in. Then some listening
backends could end up not being signaled and would stay behind,
preventing the queue tail from advancing.

I'm thinking maybe somehow we can use the timeout mechanism here, but
I'm not sure how yet. Any ideas?

/Joel

Hi Joel,

I never had a concern about using the timeout mechanism. My comment was about enabling timeout duplicately.

I just revisited the code, now I agree that I was over-worried because I missed considering NotifyQueueLock. With the lock protection, a backend process’ QUEUE_BACKEND_WAKEUP_PENDING_FLAG won’t have race condition, then it should have no duplicate signals sending to the same backend process. Then in the backend process, you have “last_wakeup_start_time” that avoids duplicate timeout within a configured period, and you reset last_wakeup_start_time in asyncQueueReadAllNotifications() together with cleaning the QUEUE_BACKEND_WAKEUP_PENDING_FLAG.

So, overall v2 looks good to me.

One last tiny comment is about naming of last_wakeup_start_time. I think it can be renamed to “last_wakeup_time”. Because the variable just records when asyncQueueReadAllNotifications() last time called, there seems not a meaning of “start” involved.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

30 сентября, 21:56:11

On Mon, Sep 29, 2025, at 04:33, Chao Li wrote:
> I never had a concern about using the timeout mechanism. My comment was 
> about enabling timeout duplicately.

Thanks for reviewing. However, like said in my previous email, I'm
sorry, but don't believe in my suggested throughput/latency approach. I
unfortunately managed to derail from the IMO more promising approaches I
worked on initially.

What I couldn't find a solution to then, was the problem of possibly
ending up in a situation where some lagging backends would never catch
up.

In this new patch, I've simply introduced a new bgworker, given the
specific task of kicking lagging backends. I wish of course we could do
without the bgworker, but I don't see how that would be possible.

---

optimize_listen_notify-v5.patch:

Fix LISTEN/NOTIFY so it scales with idle listening backends

Currently, idle listening backends cause a dramatic slowdown due to
context switching when they are signaled and wake up. This is wasteful
when they are not listening to the channel being notified.

Just 10 extra idle listening connections cause a slowdown from 8700 TPS
to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra
it falls to 250 TPS.

To improve scalability with the number of idle listening backends, this
patch introduces a shared hash table to keep track of channels per
listening backend. This hash table is partitioned to reduce contention
on concurrent LISTEN/UNLISTEN operations.

We keep track of up to NOTIFY_MULTICAST_THRESHOLD (16) listeners per
channel. Benchmarks indicated diminishing gains above this level.
Setting it lower seems unnecessary, so a constant seemed fine; a GUC did
not seem motivated.

This patch also adds a wakeup_pending flag to each backend's queue
status to avoid redundant signaling when a wakeup is already pending as
the backend is signaled again. The flag is set when a backend is
signaled and cleared before processing the queue. This order is
important to ensure correctness.

It was also necessary to add a new bgworker, notify_bgworker, whose sole
responsibility is to wake up lagging listening backends, ensuring they
are kicked when they are about to fall too far behind. This bgworker is
always started at postmaster startup, but is only activated upon NOTIFY
by signaling it, unless it is already active. The notify_bgworker
staggers the signaling of lagging listening backends by sleeping 100 ms
between each signal, to prevent the thundering herd problem we would
otherwise get if all listening backends woke up at the same time. It
loops until there are no more lagging listening backends, and then
becomes inactive.

/Joel

Вложения

optimize_listen_notify-v5.patch

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

01 октября, 08:47:27

On Tue, Sep 30, 2025, at 20:56, Joel Jacobson wrote:
> Attachments:
> * optimize_listen_notify-v5.patch

Changes since v5:
*) Added missing #include "nodes/pg_list.h" to fix List type error in headerscheck
*) Add NOTIFY_DEFERRED_WAKEUP_MAIN to wait_event_names.txt and rename WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP to
WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP_MAIN

/Joel

Вложения

optimize_listen_notify-v6.patch

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

02 октября, 19:39:39

"Joel Jacobson" <joel@compiler.org> writes:
> Thanks for reviewing. However, like said in my previous email, I'm
> sorry, but don't believe in my suggested throughput/latency approach. I
> unfortunately managed to derail from the IMO more promising approaches I
> worked on initially.
> What I couldn't find a solution to then, was the problem of possibly
> ending up in a situation where some lagging backends would never catch
> up.
> In this new patch, I've simply introduced a new bgworker, given the
> specific task of kicking lagging backends. I wish of course we could do
> without the bgworker, but I don't see how that would be possible.

I don't understand why you feel you need a bgworker.  The existing
code does not have any provision that guarantees a lost signal will
eventually be re-sent --- it will be if there is continuing NOTIFY
traffic, but not if all the senders suddenly go quiet.  AFAIR
we've had zero complaints about that in 25+ years.  So I'm perfectly
content to continue the approach of "check for laggards during
NOTIFY".  (This could be gated behind an overall check on how long the
notify queue is, so that we don't expend the cycles when things are
performing as-expected.)  If you feel that that's not robust enough,
you should split it out as a separate patch that's advertised as a
robustness improvement not a performance improvement, and see if you
can get anyone to bite.

The other thing I'm concerned about with this patch is the new shared
hash table.  I don't think we have anywhere near a good enough fix on
how big it needs to be, and that is problematic because of the
frozen-at-startup size of main shared memory.  We could imagine
inventing YA GUC to let the user tell us how big to make it,
but I think there is now a better way: use a dshash table
(src/backend/lib/dshash.c).  That offers the additional win that we
don't have to create it at all in an installation that never uses
LISTEN/NOTIFY.  We could also rethink whether we really need the
NOTIFY_MULTICAST_THRESHOLD limit: rather than having two code paths,
we could just say that all listeners are registered for every channel.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

06 октября, 23:11:21

On Thu, Oct 2, 2025, at 18:39, Tom Lane wrote:
> I don't understand why you feel you need a bgworker.  The existing
> code does not have any provision that guarantees a lost signal will
> eventually be re-sent --- it will be if there is continuing NOTIFY
> traffic, but not if all the senders suddenly go quiet.  AFAIR
> we've had zero complaints about that in 25+ years.  So I'm perfectly
> content to continue the approach of "check for laggards during
> NOTIFY".  (This could be gated behind an overall check on how long the
> notify queue is, so that we don't expend the cycles when things are
> performing as-expected.)  If you feel that that's not robust enough,
> you should split it out as a separate patch that's advertised as a
> robustness improvement not a performance improvement, and see if you
> can get anyone to bite.

Good point. I agree it's better to check for laggards during NOTIFY.

> The other thing I'm concerned about with this patch is the new shared
> hash table.  I don't think we have anywhere near a good enough fix on
> how big it needs to be, and that is problematic because of the
> frozen-at-startup size of main shared memory.  We could imagine
> inventing YA GUC to let the user tell us how big to make it,
> but I think there is now a better way: use a dshash table
> (src/backend/lib/dshash.c).  That offers the additional win that we
> don't have to create it at all in an installation that never uses
> LISTEN/NOTIFY.  We could also rethink whether we really need the
> NOTIFY_MULTICAST_THRESHOLD limit: rather than having two code paths,
> we could just say that all listeners are registered for every channel.

Thanks for guidance, I didn't know about dshash.

The patch is now using dshash. I've been looking at code in launcher.c
when implementing it. The function init_channel_hash() ended up being
very similar to launcher.c's logicalrep_launcher_attach_dshmem().

/Joel

Вложения

optimize_listen_notify-v7.patch

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

06 октября, 23:22:05

On Mon, Oct 6, 2025, at 22:11, Joel Jacobson wrote:
> The patch is now using dshash. I've been looking at code in launcher.c
> when implementing it. The function init_channel_hash() ended up being
> very similar to launcher.c's logicalrep_launcher_attach_dshmem().

Noticed a mistake on one line just after pressing send.
Sorry about that, new version attached.

/Joel

Вложения

optimize_listen_notify-v8.patch

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

07 октября, 08:39:08

On Mon, Oct 6, 2025, at 22:22, Joel Jacobson wrote:
> On Mon, Oct 6, 2025, at 22:11, Joel Jacobson wrote:
>> The patch is now using dshash. I've been looking at code in launcher.c
>> when implementing it. The function init_channel_hash() ended up being
>> very similar to launcher.c's logicalrep_launcher_attach_dshmem().
>
> Noticed a mistake on one line just after pressing send.
> Sorry about that, new version attached.

Trying to fix the NetBSD failure.

I don't understand why 001_constraint_validation, test 'list_parted2_def
scanned' and test 'part_5 verified by existing constraints' should be
affected by this patch. I guess I could have gotten something wrong with
the locking with dshash, that might somehow affect other tests?

I've changed the dshash_find() in SignalBackends from dshash_find(...,
false) to dshash_find(..., true), that is, to take an exclusive lock
instead. Not sure if this is necessary, since we're not modifying the
entry, but we're already holding an exclusive lock on NotifyQueueLock
here, so I don't think it should affect concurrency.

Any help on looking specifically at the dshash code would be much
appreciated, since I'm new to this interface.

/Joel

Вложения

optimize_listen_notify-v9.patch

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

07 октября, 08:43:16

"Joel Jacobson" <joel@compiler.org> writes:
> Trying to fix the NetBSD failure.
> I don't understand why 001_constraint_validation, test 'list_parted2_def
> scanned' and test 'part_5 verified by existing constraints' should be
> affected by this patch. I guess I could have gotten something wrong with
> the locking with dshash, that might somehow affect other tests?

Our CI infrastructure is not as stable as one could wish.  You
sure this is related at all?

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

07 октября, 09:16:24

On Tue, Oct 7, 2025, at 07:43, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> Trying to fix the NetBSD failure.
>> I don't understand why 001_constraint_validation, test 'list_parted2_def
>> scanned' and test 'part_5 verified by existing constraints' should be
>> affected by this patch. I guess I could have gotten something wrong with
>> the locking with dshash, that might somehow affect other tests?
>
> Our CI infrastructure is not as stable as one could wish.  You
> sure this is related at all?

No, not sure at all. OK, then going forward, I guess I should ignore
errors coming from just a single farm animal if the error seems
unrelated to my patch.

/Joel

Re: Optimize LISTEN/NOTIFY

От

Matheus Alcantara

Дата:

07 октября, 15:40:51

On Mon Oct 6, 2025 at 5:11 PM -03, Joel Jacobson wrote:
> The patch is now using dshash. I've been looking at code in launcher.c
> when implementing it. The function init_channel_hash() ended up being
> very similar to launcher.c's logicalrep_launcher_attach_dshmem().
>
Hi,

This is not a complete review, I just read the v9 patch and summarized
some points.

1. You may want to add ChannelEntry and ChannelHashKey to typedefs.list
to get pgindent do the right job on indentation.

2. The ListCell* variables are normally named as lc
+       ListCell   *p;

3. This block on ChannelHashRemoveListener() seems contradictory. You
early return if channel_hash == NULL and then call init_channel_hash
that it will early return if channel_hash != NULL. So if channel_hash !=
NULL I don't think that we need to call init_channel_hash()?
+       if (channel_hash == NULL)
+               return;
+
+       init_channel_hash();

A similar check also exists on SignalBackends()
    if (channel_hash == NULL)
        ...
    else
    {
        // channel_hash is != NULL, so init_channel_hash will early
        // return.
        init_channel_hash();
        ...
    }

4. The ChannelHashRemoveListener() release lock logic could be
simplified to something like the following, what do you think?
+                       if (entry->num_listeners == 0)
+                       {
+                               dsa_free(channel_dsa, entry->listeners_array);
+                               dshash_delete_entry(channel_hash, entry);
+                       }
+                       break;
+               }
+       }
+
+       /* Not found in list */
+       dshash_release_lock(channel_hash, entry);

5. You may want to use list_member() on GetPendingNotifyChannels() to
avoid the inner loop to check for duplicate channel names.

6. s/beind/behind
+                       /* Need to signal if a backend has fallen too
far beind */

7. I'm wondering if we could add some TAP tests for this? I think that
adding a case to ensure that we can grown the dshash correctly and also
we manage multiple backends to the same channel properly. This CF [1]
has some examples of how TAP tests can be created to test LISTEN/NOTIFY

[1] https://commitfest.postgresql.org/patch/6095/

--
Matheus Alcantara

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

07 октября, 19:51:20

Matheus Alcantara <matheusssilv97@gmail.com> writes:
> 7. I'm wondering if we could add some TAP tests for this?

async.c seems already moderately well covered by existing tests
src/test/regress/sql/async.sql
src/test/isolation/specs/async-notify.spec

Do we need more?  If there's something not covered, can we extend
those test cases instead of spinning up a whole new installation
for a TAP test?

Also, I don't think it's the job of this patch to provide test
coverage for dshash.  That should be quite well covered already.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

07 октября, 20:28:22

On Tue, Oct 7, 2025, at 14:40, Matheus Alcantara wrote:
> This is not a complete review, I just read the v9 patch and summarized
> some points.

Many thanks for the review!

> 1. You may want to add ChannelEntry and ChannelHashKey to typedefs.list
> to get pgindent do the right job on indentation.

Fixed.

> 2. The ListCell* variables are normally named as lc
> +       ListCell   *p;

I agree, better to be consistent. I renamed the variables this patch
adds, but I didn't change the existing ListCell *p variables in async.c.
Would we want to harmonize it to just *lc everywhere in async.c?
I notice we also use ListCell *l in async.c at some places.

> 3. This block on ChannelHashRemoveListener() seems contradictory. You
> early return if channel_hash == NULL and then call init_channel_hash
> that it will early return if channel_hash != NULL. So if channel_hash !=
> NULL I don't think that we need to call init_channel_hash()?
> +       if (channel_hash == NULL)
> +               return;
> +
> +       init_channel_hash();
>
> A similar check also exists on SignalBackends()
>     if (channel_hash == NULL)
>         ...
>     else
>     {
>         // channel_hash is != NULL, so init_channel_hash will early
>         // return.
>         init_channel_hash();
>         ...
>     }

Ahh, right, I agree. I've removed the unnecessary init_channel_hash()
calls.

> 4. The ChannelHashRemoveListener() release lock logic could be
> simplified to something like the following, what do you think?
> +                       if (entry->num_listeners == 0)
> +                       {
> +                               dsa_free(channel_dsa, entry->listeners_array);
> +                               dshash_delete_entry(channel_hash, entry);
> +                       }
> +                       break;
> +               }
> +       }
> +
> +       /* Not found in list */
> +       dshash_release_lock(channel_hash, entry);

That would be nicer, but I noted that dshash_delete_entry() releases the
lock just like dshash_release_lock(), so then I think we would need to
return; after dshash_delete_entry(), to prevent attempting to release
the lock twice?

> 5. You may want to use list_member() on GetPendingNotifyChannels() to
> avoid the inner loop to check for duplicate channel names.

Ahh, much nicer! Fixed.

> 6. s/beind/behind
> +                       /* Need to signal if a backend has fallen too
> far beind */

Fixed.

> 7. I'm wondering if we could add some TAP tests for this? I think that
> adding a case to ensure that we can grown the dshash correctly and also
> we manage multiple backends to the same channel properly. This CF [1]
> has some examples of how TAP tests can be created to test LISTEN/NOTIFY

I will look over the tests. Maybe we should add some elog DEBUG at the
new code paths, and ensure the tests at least cover all of them?

/Joel

Вложения

optimize_listen_notify-v10.patch

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

07 октября, 21:14:50

"Joel Jacobson" <joel@compiler.org> writes:
>> 7. I'm wondering if we could add some TAP tests for this? I think that
>> adding a case to ensure that we can grown the dshash correctly and also
>> we manage multiple backends to the same channel properly. This CF [1]
>> has some examples of how TAP tests can be created to test LISTEN/NOTIFY

> I will look over the tests. Maybe we should add some elog DEBUG at the
> new code paths, and ensure the tests at least cover all of them?

I went to do a coverage test on v10, and found that it does not get
through the existing async-notify isolation test: it panics with
"cannot abort transaction %u, it was already committed".  It's a bit
premature to worry about adding new tests if you're not passing the
ones that are there.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

07 октября, 22:26:31

On Tue, Oct 7, 2025, at 20:14, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>>> 7. I'm wondering if we could add some TAP tests for this? I think that
>>> adding a case to ensure that we can grown the dshash correctly and also
>>> we manage multiple backends to the same channel properly. This CF [1]
>>> has some examples of how TAP tests can be created to test LISTEN/NOTIFY
>
>> I will look over the tests. Maybe we should add some elog DEBUG at the
>> new code paths, and ensure the tests at least cover all of them?
>
> I went to do a coverage test on v10, and found that it does not get
> through the existing async-notify isolation test: it panics with
> "cannot abort transaction %u, it was already committed".  It's a bit
> premature to worry about adding new tests if you're not passing the
> ones that are there.

Ops, I see I got the list_member() code wrong. I've changed it to now
create String nodes, and then use strVal().

I also changed back to dshash_find(..., false) in SignalBackends(),
since that makes more sense to me, since we're not modifying entry.
(This was the code change due to me being fooled by the false alarm from
the NetBSD animal.)

/Joel

Вложения

optimize_listen_notify-v11.patch

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

07 октября, 23:15:24

"Joel Jacobson" <joel@compiler.org> writes:
> Ops, I see I got the list_member() code wrong. I've changed it to now
> create String nodes, and then use strVal().

Might be better to revert to the previous coding.  Using String
nodes is going to roughly double the space eaten for the list,
and it seems like it's not buying you a lot.

> I also changed back to dshash_find(..., false) in SignalBackends(),
> since that makes more sense to me, since we're not modifying entry.

Agreed.

I did a code coverage run and it seems like things are in pretty
good shape already.  async.c is about 88% covered and a lot of the
omissions are either Trace_notify or unreached error reports, which
I'm not especially concerned about.  The visible coverage gaps are:

1. asyncQueueFillWarning.  This wasn't covered before either, because
it doesn't seem very practical to exercise it in an everyday
regression test.  Since your patch doesn't touch that code nor the
queue contents, I'm not concerned here.

2. AtSubCommit_Notify's reparenting stanza.  This is a pre-existing
omission too, but maybe worth doing something about?

3. AtSubAbort_Notify's pendingActions cleanup loop; same comments.

4. notification_match is not called at all.  Again, pre-existing
coverage gap.

5. ChannelHashAddListener: "already registered" case is not reached,
which surprises me a bit, and neither is the "grow the array" stanza.
Since this is new code it might be worth adding coverage.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Matheus Alcantara"

Дата:

08 октября, 00:14:14

On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote:
> Matheus Alcantara <matheusssilv97@gmail.com> writes:
>> 7. I'm wondering if we could add some TAP tests for this?
>
> async.c seems already moderately well covered by existing tests
> src/test/regress/sql/async.sql
> src/test/isolation/specs/async-notify.spec
>
> Do we need more?  If there's something not covered, can we extend
> those test cases instead of spinning up a whole new installation
> for a TAP test?
>
I've executed the test coverage on v9 and it seems that we still have a
good code coverage. I would imagine with the new branches that the code
coverage could be effected but it's not true. There is just some small
piece of new code added that is not being coveraged.

> Also, I don't think it's the job of this patch to provide test
> coverage for dshash.  That should be quite well covered already.
>
When I was mentioning to test that we can grow the dshash correctly it's
because the v9 patch has a logic to grow the array stored on dshash
entry value that currently is not being cover by the tests. I'm not
saying to test the dshash internal logic which I agree that it's not the
job of this patch. Sorry for being confusing.

+    /* Need to add this listener */
+    if (entry->num_listeners >= entry->allocated_listeners)
+    {
+        /* Grow the array (double the size) */
+        int            new_size = entry->allocated_listeners * 2;
+        dsa_pointer new_array = dsa_allocate(channel_dsa,
+                                             sizeof(ProcNumber) * new_size);
+        ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+                                                                   new_array);
+
+        /* Copy existing listeners */
+        memcpy(new_listeners, listeners,
+               sizeof(ProcNumber) * entry->num_listeners);
+
+        /* Free old array and update entry */
+        dsa_free(channel_dsa, entry->listeners_array);
+        entry->listeners_array = new_array;
+        entry->allocated_listeners = new_size;
+        listeners = new_listeners;
+    }

--
Matheus Alcantara

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

08 октября, 00:17:36

"Matheus Alcantara" <matheusssilv97@gmail.com> writes:
> On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote:
>> Also, I don't think it's the job of this patch to provide test
>> coverage for dshash.  That should be quite well covered already.

> When I was mentioning to test that we can grow the dshash correctly it's
> because the v9 patch has a logic to grow the array stored on dshash
> entry value that currently is not being cover by the tests. I'm not
> saying to test the dshash internal logic which I agree that it's not the
> job of this patch. Sorry for being confusing.

Ah, yeah, I misunderstood what you meant.  I agree that covering that
"Grow the array" stanza is a good idea, in fact I said the same thing
a little bit ago.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Matheus Alcantara"

Дата:

08 октября, 00:22:26

On Tue Oct 7, 2025 at 6:17 PM -03, Tom Lane wrote:
> "Matheus Alcantara" <matheusssilv97@gmail.com> writes:
>> On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote:
>>> Also, I don't think it's the job of this patch to provide test
>>> coverage for dshash.  That should be quite well covered already.
>
>> When I was mentioning to test that we can grow the dshash correctly it's
>> because the v9 patch has a logic to grow the array stored on dshash
>> entry value that currently is not being cover by the tests. I'm not
>> saying to test the dshash internal logic which I agree that it's not the
>> job of this patch. Sorry for being confusing.
>
> Ah, yeah, I misunderstood what you meant.  I agree that covering that
> "Grow the array" stanza is a good idea, in fact I said the same thing
> a little bit ago.
>
Yeah, I just saw your response after I sent the email, which I agree
with all the points. So I think that we are on the same "page" now.

--
Matheus Alcantara

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

08 октября, 06:43:09

After several rounds of reviewing, the code is already very good. I just got a few small comments:

On Oct 8, 2025, at 03:26, Joel Jacobson <joel@compiler.org> wrote:

/Joel<optimize_listen_notify-v11.patch>

```

+ channels = GetPendingNotifyChannels();

LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);

- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))

+ foreach(lc, channels)

```

I don’t see where “channels” is freed. GetPendingNotifyChannels() creates a list of Nodes, both the list and Nodes the list points to should be freed.

```

+ foreach(lc, channels)

{

- int32 pid = QUEUE_BACKEND_PID(i);

- QueuePosition pos;

+ char *channel = strVal(lfirst(lc));

+ ChannelEntry *entry;

+ ProcNumber *listeners;

+ ChannelHashKey key;

- Assert(pid != InvalidPid);

- pos = QUEUE_BACKEND_POS(i);

- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)

+ if (channel_hash == NULL)

+ entry = NULL;

+ else

```

I wonder whether or not “channel_hash” can be NULL here? Maybe possible if a channel is un-listened while the event is pending?

So, maybe add a comment here to explain the logic.

The same piece of code as 2.

I think the code can be optimized a little bit. First, we can initialize entry to NULL, then we don’t the if-else. Second, “key” is only used for dshash_find(), so it can defined where it is used.

foreach(lc, channels)
{
char *channel = strVal(lfirst(lc));
ChannelEntry *entry = NULL;
ProcNumber *listeners;
 //ChannelHashKey key;

if (channel_hash != NULL)
{
ChannelHashKey key;
ChannelHashPrepareKey(&key, MyDatabaseId, channel);
entry = dshash_find(channel_hash, &key, false);
}

if (entry == NULL)
continue; /* No listeners registered for this channel */

```

+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))

+ continue;

```

I wonder if “signaled[i]” is a duplicate flag of "QUEUE_BACKEND_WAKEUP_PENDING(i)”?

I understand signaled is local, and QUEUE_BACKEND_WAKEUP_PENDING is in shared memory and may be set by other processes, but in local, when signaled[I] is set, QUEUE_BACKEND_WAKEUP_PENDING(i) is also set. And because of NotifyQueueLock, other process should not be able to cleanup the flag.

But if “signals” is really needed, maybe we can use Bitmapset (src/backend/nodes/bitmapset.c), that would use 1/8 of memories comparing to the bool array.

```

@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)

LWLockAcquire(NotifyQueueLock, LW_SHARED);

/* Assert checks that we have a valid state entry */

Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));

+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;

```

This piece of code originally only read the shared memory, so it can use LW_SHARED lock mode, but now it writes to the shared memory, do we need to change the lock mode to “exclusive”?

```

+static inline void

+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)

+ memset(key, 0, sizeof(ChannelHashKey));

+ key->dboid = dboid;

+ strlcpy(key->channel, channel, NAMEDATALEN);

```

Do we really need the memset()? If “channel” is of length NAMEDATALEN, then it still results in a non-0 terminated key->channel; if channel is shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think previous code should have ensured length of channel should be less than NAMEDATALEN.

```

* Resist the temptation to make this really large. While that would save

* work in some places, it would add cost in others. In particular, this

@@ -246,6 +280,7 @@ typedef struct QueueBackendStatus

Oid dboid; /* backend's database OID, or InvalidOid */

ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */

QueuePosition pos; /* backend has read queue up to here */

+ bool wakeup_pending; /* signal sent but not yet processed */

} QueueBackendStatus;

```

In the same structure, rest of fields are all in camel case, I think it’s better to rename the new field to “wakeupPending”.

```

@@ -288,11 +323,91 @@ typedef struct AsyncQueueControl

ProcNumber firstListener; /* id of first listener, or

* INVALID_PROC_NUMBER */

TimestampTz lastQueueFillWarn; /* time of last queue-full msg */

+ dsa_handle channel_hash_dsa;

+ dshash_table_handle channel_hash_dsh;

QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];

```

Same as 7, but in this case, type names are not camel case, maybe okay for field names. I don’t have a strong opinion here.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

08 октября, 07:36:35

On Oct 8, 2025, at 11:43, Chao Li <li.evan.chao@gmail.com> wrote:

6
```
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
```

Do we really need the memset()? If “channel” is of length NAMEDATALEN, then it still results in a non-0 terminated key->channel; if channel is shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think previous code should have ensured length of channel should be less than NAMEDATALEN.

For comment 6, the result is the same that I don’t think memset() is needed. However, my previous explanation of strlcpy() was wrong, which should apply to strncpy(). For strlcpy(), it already makes a termination ‘\0’.

And one more nit comment:

```

+ int allocated_listeners; /* Allocated size of array */

```

For “size” here, I guess you meant “length”, though “size” also works, but usually “size” means bytes occupied by an array and “length” means number of elements of an array. So, “length” would be clearer here.

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

08 октября, 17:31:24

On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> Ops, I see I got the list_member() code wrong. I've changed it to now
>> create String nodes, and then use strVal().
>
> Might be better to revert to the previous coding.  Using String
> nodes is going to roughly double the space eaten for the list,
> and it seems like it's not buying you a lot.
>
>> I also changed back to dshash_find(..., false) in SignalBackends(),
>> since that makes more sense to me, since we're not modifying entry.
>
> Agreed.
>
> I did a code coverage run and it seems like things are in pretty
> good shape already.  async.c is about 88% covered and a lot of the
> omissions are either Trace_notify or unreached error reports, which
> I'm not especially concerned about.  The visible coverage gaps are:
>
> 1. asyncQueueFillWarning.  This wasn't covered before either, because
> it doesn't seem very practical to exercise it in an everyday
> regression test.  Since your patch doesn't touch that code nor the
> queue contents, I'm not concerned here.

I agree.

> 2. AtSubCommit_Notify's reparenting stanza.  This is a pre-existing
> omission too, but maybe worth doing something about?
>
> 3. AtSubAbort_Notify's pendingActions cleanup loop; same comments.
>
> 4. notification_match is not called at all.  Again, pre-existing
> coverage gap.

I've added test coverage for all three items above.

> 5. ChannelHashAddListener: "already registered" case is not reached,
> which surprises me a bit, and neither is the "grow the array" stanza.
> Since this is new code it might be worth adding coverage.

I've added a test for the "grow the array" stanza.

The "already registered" case seems impossible to reach, since the
caller, Exec_ListenCommit, returns early if IsListeningOn.

Patches:

0001-optimize_listen_notify-v12.patch:
Improve LISTEN/NOTIFY test coverage

0002-optimize_listen_notify-v12.patch:
Optimize LISTEN/NOTIFY with channel-specific listener tracking

I split this into two patches, to make it easier to verify that the
second patch doesn't affect the tests added by the first patch. The 0001
patch also includes the "grow the array" test, which is pointless
without the 0002 patch, but felt better to add it first anyway.

I've also made changes in v12 based on feedback from Chao Li, to which I
will reply to shortly.

/Joel

On Oct 8, 2025, at 22:53, Joel Jacobson <joel@compiler.org> wrote:

1
```
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i
= QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
```

I don’t see where “channels” is freed. GetPendingNotifyChannels()
creates a list of Nodes, both the list and Nodes the list points to
should be freed.

Per suggestion from Tom Lane I reverted back GetPendingNotifyChannels(),
so this comment is not applicable any longer.

I think you just reverted the usage of list_member() and makeNode(), but returned “channels” is still built by “lappend()” that allocates memory for the List structure. So you need to use “list_free(channels)” to free the memory.

5
```
/*
@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
```

This piece of code originally only read the shared memory, so it can
use LW_SHARED lock mode, but now it writes to the shared memory, do we
need to change the lock mode to “exclusive”?

No, LW_SHARED is sufficient here, since the backend only modifies its own state,
and no other backend could do that, without holding an exclusive lock.

Yes, the backend only modifies its own state to “false”, but other backends may set its state to “true”, that is a race condition. So I still think an exclusive lock is needed.

6
```
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
```

Do we really need the memset()? If “channel” is of length NAMEDATALEN,
then it still results in a non-0 terminated key->channel; if channel is
shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think
previous code should have ensured length of channel should be less than
NAMEDATALEN.

Yes, I think we need memset, since I fear that when the hash table keys
are compared, every byte of the struct might be inspected, so without
zero-initializing it, there could be unused bytes after the null
terminator, that could then cause logically identical keys to be wrongly
considered different.

I haven't checked the implementation though, but my gut feeling says
it's better to be a bit paranoid here.

The hash function channel_hash_func() is defined by your own code, it use strnlen() to get length of channel name, so that bytes after ‘\0’ won’t be used.

And I guess you missed comment 9:

```

+ int allocated_listeners; /* Allocated size of array */

```

And I got a new comment for v12:

```

+ found = false;

+ foreach(q, channels)

+ {

+ char *existing = (char *) lfirst(q);

+ if (strcmp(existing, channel) == 0)

+ {

```

Might be safer to do “strncmp(existing, channel, NAMEDATALEN)”.

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

09 октября, 11:07:21

On Thu, Oct 9, 2025, at 03:11, Chao Li wrote:
> I think you just reverted the usage of list_member() and makeNode(),
> but returned “channels” is still built by “lappend()” that allocates
> memory for the List structure. So you need to use “list_free(channels)”
> to free the memory.

Right. However, I'll see if I can make Tom's idea work of possibly removing the list form, instead.

>>> ```
>>> /*
>>> @@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
>>> LWLockAcquire(NotifyQueueLock, LW_SHARED);
>>> /* Assert checks that we have a valid state entry */
>>> Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
>>> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
>>> ```
>>>
>>> This piece of code originally only read the shared memory, so it can
>>> use LW_SHARED lock mode, but now it writes to the shared memory, do we
>>> need to change the lock mode to “exclusive”?
>>
>> No, LW_SHARED is sufficient here, since the backend only modifies its own state,
>> and no other backend could do that, without holding an exclusive lock.
>
> Yes, the backend only modifies its own state to “false”, but other
> backends may set its state to “true”, that is a race condition. So I
> still think an exclusive lock is needed.

No, other backends cannot alter our state without holding an exclusive lock,
and they cannot obtain an exclusive lock on our backend until we've released
the shared lock we're holding.

>>> 6
>>> ```
>>> +static inline void
>>> +ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
>>> +{
>>> + memset(key, 0, sizeof(ChannelHashKey));
>>> + key->dboid = dboid;
>>> + strlcpy(key->channel, channel, NAMEDATALEN);
>>> +}
>>> ```
>>>
>>> Do we really need the memset()? If “channel” is of length NAMEDATALEN,
>>> then it still results in a non-0 terminated key->channel; if channel is
>>> shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think
>>> previous code should have ensured length of channel should be less than
>>> NAMEDATALEN.
>>
>> Yes, I think we need memset, since I fear that when the hash table keys
>> are compared, every byte of the struct might be inspected, so without
>> zero-initializing it, there could be unused bytes after the null
>> terminator, that could then cause logically identical keys to be wrongly
>> considered different.
>>
>> I haven't checked the implementation though, but my gut feeling says
>> it's better to be a bit paranoid here.
>
> The hash function channel_hash_func() is defined by your own code, it
> use strnlen() to get length of channel name, so that bytes after ‘\0’
> won’t be used.

No, the hash function is not used for comparison.
We're using the default dshash_memcmp for comparison:

```
/* parameters for the channel hash table */
static const dshash_parameters channelDSHParams = {
    sizeof(ChannelHashKey),
    sizeof(ChannelEntry),
    dshash_memcmp,
    channelHashFunc,
    dshash_memcpy,
    LWTRANCHE_NOTIFY_CHANNEL_HASH
};
```

Looking at its implementation, we can see it's using memcmp under the hood:

```
/*
 * A compare function that forwards to memcmp.
 */
int
dshash_memcmp(const void *a, const void *b, size_t size, void *arg)
{
    return memcmp(a, b, size);
}
```

Here, the input parameter `size` comes from `sizeof(ChannelHashKey)`,
so it will include all bytes in the comparison.


> And I guess you missed comment 9:
>
> 9
> ```
> + int allocated_listeners; /* Allocated size of array */
> ```
>
> For “size” here, I guess you meant “length”, though “size” also works,
> but usually “size” means bytes occupied by an array and “length” means
> number of elements of an array. So, “length” would be clearer here.

Agreed, will change.

> And I got a new comment for v12:
>
> 10
> ```
> + found = false;
> + foreach(q, channels)
> + {
> + char   *existing = (char *) lfirst(q);
> +
> + if (strcmp(existing, channel) == 0)
> + {
> ```
>
> Might be safer to do “strncmp(existing, channel, NAMEDATALEN)”.

Good idea, will change.

/Joel

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

09 октября, 11:39:43

On Oct 9, 2025, at 16:07, Joel Jacobson <joel@compiler.org> wrote:

```
/*
@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
```

This piece of code originally only read the shared memory, so it can
use LW_SHARED lock mode, but now it writes to the shared memory, do we
need to change the lock mode to “exclusive”?

No, LW_SHARED is sufficient here, since the backend only modifies its own state,
and no other backend could do that, without holding an exclusive lock.

Yes, the backend only modifies its own state to “false”, but other
backends may set its state to “true”, that is a race condition. So I
still think an exclusive lock is needed.

No, other backends cannot alter our state without holding an exclusive lock,
and they cannot obtain an exclusive lock on our backend until we've released
the shared lock we're holding.

Ah… That’s true. This comment is resolved.

The hash function channel_hash_func() is defined by your own code, it
use strnlen() to get length of channel name, so that bytes after ‘\0’
won’t be used.

No, the hash function is not used for comparison.
We're using the default dshash_memcmp for comparison:

```
/* parameters for the channel hash table */
static const dshash_parameters channelDSHParams = {
sizeof(ChannelHashKey),
sizeof(ChannelEntry),
dshash_memcmp,
channelHashFunc,
dshash_memcpy,
LWTRANCHE_NOTIFY_CHANNEL_HASH
};
```

Looking at its implementation, we can see it's using memcmp under the hood:

```
/*
* A compare function that forwards to memcmp.
*/
int
dshash_memcmp(const void *a, const void *b, size_t size, void *arg)
{
return memcmp(a, b, size);
}
```

Here, the input parameter `size` comes from `sizeof(ChannelHashKey)`,
so it will include all bytes in the comparison.

Okay, I think I misunderstood hash_function. So, this comment is also resolved.

I am thinking loudly. When a hash key is created, it has been memset to 0, meaning that in key->channel, all bytes after ‘\0’ are also 0, there should not be any random bytes in hash key, so that in channelHashFunc(), we don’t need to to use strnlen() anymore, which improves performance a little bit. Like this:

h = DatumGetUInt32(hash_uint32(k->dboid));
h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
sizeof(k->channel)));

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

10 октября, 21:46:04

On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>>> 5. ChannelHashAddListener: "already registered" case is not reached,
>>> which surprises me a bit, and neither is the "grow the array" stanza.
>
>> I've added a test for the "grow the array" stanza.
>
>> The "already registered" case seems impossible to reach, since the
>> caller, Exec_ListenCommit, returns early if IsListeningOn.
>
> Maybe we should remove the check for "already registered" then,
> or reduce it to an Assert?  Seems pointless to check twice.
>
> Or thinking a little bigger: why are we maintaining the set of
> channels-listened-to both as a list and a hash?  Could we remove
> the list form?

Yes, it was indeed possible to remove the list form.

Some functions got a bit more complex, but I think it's worth it since a
single source of truth seems like an important design goal.

This also made LISTEN faster when a backend is listening on plenty of
channels, since we can now lookup the channel in the hash, instead of
having to go through the list as before. The additional linear scan of
the listenersArray didn't add any noticeable extra cost even with
thousands of listening backends for the channel.

I also tried to keep listenersArray sorted and binary-search it, but
even with thousands of listening backends, I couldn't measure any
overall latency difference of LISTEN, so I kept the linear scan to keep
it simple.

In Exec_ListenCommit, I've now inlined code that is similar to
IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful
having to do dshash_find, when we instead can just use
dshash_find_or_insert, to handle both cases.

I also added a static int numChannelsListeningOn variable, to avoid the
possibly expensive operation of going through the entire hash, to be
able to check `numChannelsListeningOn == 0` instead of the now removed
`listenChannels == NIL`. It's of course critical to keep
numChannelsListeningOn in sync, but I think it should be safe? Would of
course be better to avoid this variable. Maybe the extra cycles that
would cost would be worth it?

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

11 октября, 09:43:51

On Fri, Oct 10, 2025, at 20:46, Joel Jacobson wrote:
> On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote:
>> "Joel Jacobson" <joel@compiler.org> writes:
>>> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>>>> 5. ChannelHashAddListener: "already registered" case is not reached,
>>>> which surprises me a bit, and neither is the "grow the array" stanza.
>>
>>> I've added a test for the "grow the array" stanza.
>>
>>> The "already registered" case seems impossible to reach, since the
>>> caller, Exec_ListenCommit, returns early if IsListeningOn.
>>
>> Maybe we should remove the check for "already registered" then,
>> or reduce it to an Assert?  Seems pointless to check twice.
>>
>> Or thinking a little bigger: why are we maintaining the set of
>> channels-listened-to both as a list and a hash?  Could we remove
>> the list form?
>
> Yes, it was indeed possible to remove the list form.
>
> Some functions got a bit more complex, but I think it's worth it since a
> single source of truth seems like an important design goal.
>
> This also made LISTEN faster when a backend is listening on plenty of
> channels, since we can now lookup the channel in the hash, instead of
> having to go through the list as before. The additional linear scan of
> the listenersArray didn't add any noticeable extra cost even with
> thousands of listening backends for the channel.
>
> I also tried to keep listenersArray sorted and binary-search it, but
> even with thousands of listening backends, I couldn't measure any
> overall latency difference of LISTEN, so I kept the linear scan to keep
> it simple.
>
> In Exec_ListenCommit, I've now inlined code that is similar to
> IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful
> having to do dshash_find, when we instead can just use
> dshash_find_or_insert, to handle both cases.
>
> I also added a static int numChannelsListeningOn variable, to avoid the
> possibly expensive operation of going through the entire hash, to be
> able to check `numChannelsListeningOn == 0` instead of the now removed
> `listenChannels == NIL`. It's of course critical to keep
> numChannelsListeningOn in sync, but I think it should be safe? Would of
> course be better to avoid this variable. Maybe the extra cycles that
> would cost would be worth it?

In addition to previously suggested optimization, there is another major
one that seems doable, that would mean a great improvement for workload
having large traffic differences between channels, i.e. some low traffic
and some high traffic.

I'm not entirely sure this approach is correct though, I've might
misunderstood the guarantees of the heavyweight lock. My assumption is
based on that there can only be one backend that is currently running
the code in PreCommit_Notify after having aquired the heavyweight lock.
If this is not true, then it doesn't work. What made me worried is the
exclusive lock we also take inside the same function, I don't see the
point of it since we're already holding the heavyweight lock, but maybe
this is just to "allows deadlocks to be detected" like the comment says?

---

Patches:

* 0001-optimize_listen_notify-v14.patch:
Just adds additional test coverage of async.c

* 0002-optimize_listen_notify-v14.patch:
Adds the shared channel hash.
Unchanged since 0002-optimize_listen_notify-v13.patch.

* 0003-optimize_listen_notify-v14.patch:

Optimize LISTEN/NOTIFY by advancing idle backends directly

Building on the previous channel-specific listener tracking
optimization, this patch further reduces context switching by detecting
idle listening backends that don't listen to any of the channels being
notified and advancing their queue positions directly without waking
them up.

When a backend commits notifications, it now saves both the queue head
position before and after writing. In SignalBackends(), backends that
are at the old queue head and weren't marked for wakeup (meaning they
don't listen to any of the notified channels) are advanced directly to
the new queue head. This eliminates unnecessary wakeups for these
backends, which would otherwise wake up, scan through all the
notifications, skip each one, and advance to the same position anyway.

The implementation carefully handles the race condition where other
backends may write notifications after the heavyweight lock is released
but before SignalBackends() is called. By saving queueHeadAfterWrite
immediately after writing (before releasing the lock), we ensure
backends are only advanced over the exact notifications we wrote, not
notifications from other concurrent backends.

---

Benchmark:

% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=10
pgbench_patched: starting LISTEN/NOTIFY round-trip benchmark
pgbench_patched: round-trips per iteration: 10000
pgbench_patched: idle listeners added per iteration: 10

master:

idle_listeners  round_trips_per_sec     max_latency_usec
             0              33592.9                 2278
            10              14251.1                 1041
            20               9258.7                 1367
            30               6144.2                 2277
            40               4653.1                 1690
            50               3780.7                 2869
            60               3234.9                 3215
            70               2818.9                 3652
            80               2458.7                 3219
            90               2203.1                 3505
           100               1951.9                 1739

0002-optimize_listen_notify-v14.patch:

idle_listeners  round_trips_per_sec     max_latency_usec
             0              33936.2                  889
            10              30631.9                 1233
            20              22404.7                 7862
            30              19446.2                 9539
            40              16013.3                13963
            50              14310.1                16983
            60              12827.0                21363
            70              11271.9                24775
            80              10764.4                28703
            90               9568.1                31693
           100               9241.3                32724

0003-optimize_listen_notify-v14.patch:

idle_listeners  round_trips_per_sec     max_latency_usec
             0              33236.8                 1090
            10              34681.0                 1338
            20              34530.4                 1372
            30              34061.6                 1339
            40              33084.5                  913
            50              33847.5                  955
            60              33675.8                 1239
            70              28857.4                20443
            80              33324.9                  786
            90              33612.3                  758
           100              31259.2                 7706

As we can see, with 0002, the ping-pong round-trips per second degrades
much slower than master, but the wakeup of idle listening backends still
needs to happen at some point, much fewer wakeups, and staggered over
time, but still makes it go down from 33k to 9k due to 100 idle
listening backends. With 0003, the round-trips per second is sustained,
unaffected by additional idle listening backends.

I've also attached the pgbench patch as a .txt in
pgbench-listen-notify-benchmark-patch.txt, since it's not part of this
patch, it's just provided to help others verify the results.

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

11 октября, 10:43:22

On Sat, Oct 11, 2025, at 08:43, Joel Jacobson wrote:
> In addition to previously suggested optimization, there is another major
> one that seems doable, that would mean a great improvement for workload
> having large traffic differences between channels, i.e. some low traffic
> and some high traffic.
>
> I'm not entirely sure this approach is correct though, I've might
> misunderstood the guarantees of the heavyweight lock. My assumption is
> based on that there can only be one backend that is currently running
> the code in PreCommit_Notify after having aquired the heavyweight lock.
> If this is not true, then it doesn't work. What made me worried is the
> exclusive lock we also take inside the same function, I don't see the
> point of it since we're already holding the heavyweight lock, but maybe
> this is just to "allows deadlocks to be detected" like the comment says?
..,
> * 0003-optimize_listen_notify-v14.patch:
>
> Optimize LISTEN/NOTIFY by advancing idle backends directly
>
> Building on the previous channel-specific listener tracking
> optimization, this patch further reduces context switching by detecting
> idle listening backends that don't listen to any of the channels being
> notified and advancing their queue positions directly without waking
> them up.
...
> 0003-optimize_listen_notify-v14.patch:
>
> idle_listeners  round_trips_per_sec     max_latency_usec
>              0              33236.8                 1090
>             10              34681.0                 1338
>             20              34530.4                 1372
>             30              34061.6                 1339
>             40              33084.5                  913
>             50              33847.5                  955
>             60              33675.8                 1239
>             70              28857.4                20443
>             80              33324.9                  786
>             90              33612.3                  758
>            100              31259.2                 7706

I noticed the strange data point at idle_listeners=70.
This made me think about the "wake tail only" trick,
and realized this is now unnecessary with the new 0003 idea.

New version attached that removes that part from the 0003 patch.

This also of course improved the stability of max_latency_usec,
since in this specific benchmark all other listeners are always idle,
so they don't need to be woken up ever:

idle_listeners  round_trips_per_sec     max_latency_usec
             0              33631.8                  546
            10              34318.0                  586
            20              34813.0                  596
            30              35073.4                  574
            40              34646.1                  569
            50              33755.5                  634
            60              33991.6                  561
            70              34049.0                  550
            80              33886.0                  541
            90              33545.0                  540
           100              33163.1                  660

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

14 октября, 19:40:11

On Sat, Oct 11, 2025, at 09:43, Joel Jacobson wrote:
> On Sat, Oct 11, 2025, at 08:43, Joel Jacobson wrote:
>> In addition to previously suggested optimization, there is another major
...
>> I'm not entirely sure this approach is correct though

Having investigated this, the "direct advancement" approach seems
correct to me.

(I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
is of course needed because there are other operations that don't
acquire the heavyweight-lock, that take shared/exclusive lock on
NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
QUEUE_HEAD.)

Given all the experiments since my earlier message, here is a fresh,
self-contained write-up:

This series has two patches:

* 0001-optimize_listen_notify-v16.patch:
Improve test coverage of async.c. Adds isolation specs covering
previously untested paths (subxact LISTEN reparenting/merge/abort,
simple NOTIFY reparenting, notification_match dedup, and an array-growth
case used by the follow-on patch.

* 0002-optimize_listen_notify-v16.patch:
Optimize LISTEN/NOTIFY by maintaining a shared channel map and using
direct advancement to avoid useless wakeups.

Problem
-------

Today SignalBackends wakes all listeners in the same database, with no
knowledge of which backends listen on which channels. When some backends
are listening on different channels, each NOTIFY causes unnecessary
wakeups and context switches, which can become the bottleneck in
workloads.

Overview of the solution (patch 0002)
-------------------------------------

* Introduce a lazily-created DSA+dshash map (dboid, channel) ->
  [ProcNumber] (channelHash). AtCommit_Notify maintains it for
  LISTEN/UNLISTEN, and SignalBackends consults it to signal only
  listeners on the channels notified within the transaction.
* Add a per-backend wakeupPending flag to suppress duplicate signals.
* Direct advancement: while queuing, PreCommit_Notify records the queue
  head before and after our writes. Writers are globally serialized, so
  the interval [oldHead, newHead) contains only our entries.
  SignalBackends advances any backend still at oldHead directly to
  newHead, avoiding a pointless wakeup.
* Keep the queue healthy by signaling backends that have fallen too far
  behind (lag >= QUEUE_CLEANUP_DELAY) so the global tail can advance.
* pg_listening_channels and IsListeningOn now read from channelHash.
* Add LWLock tranche NOTIFY_CHANNEL_HASH and wait event
  NotifyChannelHash.

No user-visible semantic changes are intended; this is an internal
performance improvement.

Benchmark
---------

Using a patched pgbench (adds --listen-notify-benchmark; attached as
.txt to avoid confusing cfbot). Each run performs 10 000 round trips and
adds 100 idle listeners per iteration.

master (HEAD):

% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=100

idle_listeners  round_trips_per_sec     max_latency_usec
             0              32123.7                  893
           100               1952.5                 1465
           200                991.4                 3438
           300                663.5                 2454
           400                494.6                 2950
           500                398.6                 3394
           600                332.8                 4272
           700                287.1                 4692
           800                252.6                 5208
           900                225.4                 5614
          1000                202.5                 6212

0002-optimize_listen_notify-v16.patch:

% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=100

idle_listeners  round_trips_per_sec     max_latency_usec
             0              31832.6                 1067
           100              32341.0                 1035
           200              31562.5                 1054
           300              30040.1                 1057
           400              29287.1                 1023
           500              28191.9                 1201
           600              28166.5                 1019
           700              26994.3                 1094
           800              26501.0                 1043
           900              25974.2                 1005
          1000              25720.6                 1008

Benchmarked on MacBook Pro Apple M3 Max.

Files
-----

* 0001-optimize_listen_notify-v16.patch - tests only.
* 0002-optimize_listen_notify-v16.patch - implementation.
* pgbench-listen-notify-benchmark-patch.txt - adds --listen-notify-benchmark.

Feedback and review much welcomed.

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

15 октября, 00:19:07

"Joel Jacobson" <joel@compiler.org> writes:
> Having investigated this, the "direct advancement" approach seems
> correct to me.

> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
> is of course needed because there are other operations that don't
> acquire the heavyweight-lock, that take shared/exclusive lock on
> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
> QUEUE_HEAD.)

Right.  What the heavyweight lock buys for us in this context is that
we can be sure no other would-be notifier can insert any messages
in between ours, even though we may take and release NotifyQueueLock
several times to allow readers to sneak in.  That in turn means that
it's safe to advance readers over that whole set of messages if we
know we didn't wake them up for any of those messages.

There is a false-positive possibility if a reader was previously
signaled but hasn't yet awoken: we will think that maybe we signaled
it and hence not advance its pointer.  This is an error in the safe
direction however, and it will advance its pointer when it does
wake up.

A potential complaint is that we are doubling down on the need for
that heavyweight lock, despite the upthread discussion about maybe
getting rid of it for better scalability.  However, this patch
only requires holding a lock across all the insertions, not holding
it through commit which I think is the true scalability blockage.
If we did want to get rid of that lock, we'd only need to stop
releasing NotifyQueueLock at insertion page boundary crossings,
which I suspect isn't really that useful anyway.  (In connection
with that though, I think you ought to capture both the "before" and
"after" pointers within that lock interval, not expend another lock
acquisition later.)

It would be good if the patch's comments made these points ...
also, the comments above struct AsyncQueueControl need to be
updated, because changing some other backend's queue pos is
not legal under any of the stated rules.

> Given all the experiments since my earlier message, here is a fresh,
> self-contained write-up:

I'm getting itchy about removing the local listenChannels list,
because what you've done is to replace it with a shared data
structure that can't be accessed without a good deal of locking
overhead.  That seems like it could easily be a net loss.

Also, I really do not like this implementation of
GetPendingNotifyChannels, as it looks like O(N^2) effort.
The potentially large length of the list it builds is scary too,
considering the comments that SignalBackends had better not fail.
If we have to do it that way it'd be better to collect the list
during PreCommit_Notify.

The "Avoid needing to wake listening backends" loop should probably
be combined with the loop after it; I don't quite see the point of
iterating over all the listening backends twice.  Also, why is the
second loop only paying attention to backends in the same DB?

I don't love adding queueHeadBeforeWrite and queueHeadAfterWrite into
the pendingNotifies data structure, as they have no visible connection
to that.  In particular, we will have multiple NotificationList
structs when there's nested transactions, and it's certainly
meaningless to have such fields in more than one place.  Probably
just making them independent static variables is the best way.

The overall layout of what the patch injects where needs another
look.  I don't like inserting code before typedefs and static
variables within a module: that's not our normal layout style.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

15 октября, 06:19:47

On Oct 15, 2025, at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Joel Jacobson" <joel@compiler.org> writes:
Having investigated this, the "direct advancement" approach seems
correct to me.

(I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
is of course needed because there are other operations that don't
acquire the heavyweight-lock, that take shared/exclusive lock on
NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
QUEUE_HEAD.)

Right. What the heavyweight lock buys for us in this context is that
we can be sure no other would-be notifier can insert any messages
in between ours, even though we may take and release NotifyQueueLock
several times to allow readers to sneak in. That in turn means that
it's safe to advance readers over that whole set of messages if we
know we didn't wake them up for any of those messages.

There is a false-positive possibility if a reader was previously
signaled but hasn't yet awoken: we will think that maybe we signaled
it and hence not advance its pointer. This is an error in the safe
direction however, and it will advance its pointer when it does
wake up.

A potential complaint is that we are doubling down on the need for
that heavyweight lock, despite the upthread discussion about maybe
getting rid of it for better scalability. However, this patch
only requires holding a lock across all the insertions, not holding
it through commit which I think is the true scalability blockage.
If we did want to get rid of that lock, we'd only need to stop
releasing NotifyQueueLock at insertion page boundary crossings,
which I suspect isn't really that useful anyway. (In connection
with that though, I think you ought to capture both the "before" and
"after" pointers within that lock interval, not expend another lock
acquisition later.)

It would be good if the patch's comments made these points ...
also, the comments above struct AsyncQueueControl need to be
updated, because changing some other backend's queue pos is
not legal under any of the stated rules.

I used to think “direct advancement” was a good idea. After reading Tom’s explanation, and reading v16 again carefully, now I also consider it’s adding complexity and could be fragile.

I just composed an example of race condition, please see if it is valid.

Because recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in PreCommit_Notify() and checking them happens in AtCommit_Notify(), there is an interval in between, something may happen.

Say a listener A, it’s head pointing to 1.

And current QueueHead is 1.

Now two notifiers B and C are committing:

* B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it records headBeforeWrite = 1 and writes to 3, and records headAfterWrite = 3.

* Now QueueHead is 3.

* C enters PreCommit_Notify(), it records headBeforeWrite = 3 and writes to 5, and records headAfterWrite = 5.

* Now QueueHead is 5

* C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal to C’s headBeforeWrite, C won’t advance A’s head.

* A starts to run AtCommit_Notify(), A’s head equals to B’s beforeHeadWrite, B will advance A’s head to 3.

* At this time, QueueHead is 5, and A’s head is 3, so “direct advancement” will never work for A until A wakes up next time.

I am brainstorming. Maybe we can use a simpler strategy. If a backend’s queue lag exceeds a threshold, then wake it up. This solution is simpler and reliable, also reducing the total wake-up count.

Given all the experiments since my earlier message, here is a fresh,
self-contained write-up:

I'm getting itchy about removing the local listenChannels list,
because what you've done is to replace it with a shared data
structure that can't be accessed without a good deal of locking
overhead. That seems like it could easily be a net loss.

Also, I really do not like this implementation of
GetPendingNotifyChannels, as it looks like O(N^2) effort.
The potentially large length of the list it builds is scary too,
considering the comments that SignalBackends had better not fail.
If we have to do it that way it'd be better to collect the list
during PreCommit_Notify.

I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.

In PreCommit_Notify(), we can maintain a local hash table to record pending nofications’ channel names. dahash also supports hash table in local memory.

Then in SignalBackends(), we no longer need GetPendingNotifyChannels(), we can just iterate all keys of the local channel name hash.

And the local static numChannelsListeningOn is also not needed. We can get the count from the local hash.

WRT to v6, I got a few new comments:

1 - 0002

```

* After commit we are called another time (AtCommit_Notify()). Here we

- * make any actual updates to the effective listen state (listenChannels).

+ * make any actual updates to the effective listen state (channelHash).

* Then we signal any backends that may be interested in our messages

* (including our own backend, if listening). This is done by

- * SignalBackends(), which scans the list of listening backends and sends a

- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't

- * know which backend is listening on which channel so we must signal them

- * all). We can exclude backends that are already up to date, though, and

- * we can also exclude backends that are in other databases (unless they

- * are way behind and should be kicked to make them advance their

- * pointers).

+ * SignalBackends(), which consults the shared channel hash table to

+ * identify listeners for the channels that have pending notifications

+ * in the current database. Each selected backend is marked as having a

+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT

+ * signal is sent to it.

```

In this comment, you refer to “channelHash” and “the shared channel hash table”, they are the same thing, but easy to make readers to misunderstand.

2 - 0002

```

pg_listening_channels(PG_FUNCTION_ARGS)

{

FuncCallContext *funcctx;

+ List *listenChannels;

/* stuff done only on the first call of the function */

if (SRF_IS_FIRSTCALL())

{

+ MemoryContext oldcontext;

+ dshash_seq_status status;

+ ChannelEntry *entry;

/* create a function context for cross-call persistence */

funcctx = SRF_FIRSTCALL_INIT();

```

listenChannels is only used within the “if”, so it’s definition can be moved into the “if”.

3 - 0002

```

+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),

+ QUEUE_POS_PAGE(QUEUE_TAIL));

+ /* Check for lagging backends when the queue spans multiple pages */

+ if (queue_length > 0)

+ {

```

I wonder why this check is needed. If queue_length is 0, can we return immediately from SignalBackends()?

Best regards,

Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

15 октября, 06:22:59

On Tue, Oct 14, 2025, at 23:19, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> Having investigated this, the "direct advancement" approach seems
>> correct to me.
>
>> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
>> is of course needed because there are other operations that don't
>> acquire the heavyweight-lock, that take shared/exclusive lock on
>> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
>> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
>> QUEUE_HEAD.)
>
> Right.  What the heavyweight lock buys for us in this context is that
> we can be sure no other would-be notifier can insert any messages
> in between ours, even though we may take and release NotifyQueueLock
> several times to allow readers to sneak in.  That in turn means that
> it's safe to advance readers over that whole set of messages if we
> know we didn't wake them up for any of those messages.

Right.

> There is a false-positive possibility if a reader was previously
> signaled but hasn't yet awoken: we will think that maybe we signaled
> it and hence not advance its pointer.  This is an error in the safe
> direction however, and it will advance its pointer when it does
> wake up.

I've added a comment on this in SignalBackends.

> A potential complaint is that we are doubling down on the need for
> that heavyweight lock, despite the upthread discussion about maybe
> getting rid of it for better scalability.  However, this patch
> only requires holding a lock across all the insertions, not holding
> it through commit which I think is the true scalability blockage.
>
> If we did want to get rid of that lock, we'd only need to stop
> releasing NotifyQueueLock at insertion page boundary crossings,
> which I suspect isn't really that useful anyway.

Right. So if the upthread discussion would get rid of the heavyweight
lock we would just need to hold the exclusive lock across all
insertions. Good to know the two efforts are not conflicting.

> (In connection
> with that though, I think you ought to capture both the "before" and
> "after" pointers within that lock interval, not expend another lock
> acquisition later.)

Fixed.

> It would be good if the patch's comments made these points ...

I've added a comment inside PreCommit_Notify on how it would suffice to
hold the exclusive lock across all insertions, for the purpose of
setting the "before" and "after" pointers, if the heavyweight lock would
be removed.

> also, the comments above struct AsyncQueueControl need to be
> updated, because changing some other backend's queue pos is
> not legal under any of the stated rules.

Fixed.

>> Given all the experiments since my earlier message, here is a fresh,
>> self-contained write-up:
>
> I'm getting itchy about removing the local listenChannels list,
> because what you've done is to replace it with a shared data
> structure that can't be accessed without a good deal of locking
> overhead.  That seems like it could easily be a net loss.

I agree, I also prefer the local listenChannels list.
I've changed it back.

> Also, I really do not like this implementation of
> GetPendingNotifyChannels, as it looks like O(N^2) effort.
> The potentially large length of the list it builds is scary too,
> considering the comments that SignalBackends had better not fail.
> If we have to do it that way it'd be better to collect the list
> during PreCommit_Notify.

I agree. I've removed GetPendingNotifyChannels and added a local list,
named pendingNotifyChannels instead, collected during PreCommit_Notify.

> The "Avoid needing to wake listening backends" loop should probably
> be combined with the loop after it; I don't quite see the point of
> iterating over all the listening backends twice.

I agree. Fixed.

> Also, why is the
> second loop only paying attention to backends in the same DB?

Fixed. (We're already sure it's the same DB, since that's part of the
hash key. I've removed the redundant check.)

> I don't love adding queueHeadBeforeWrite and queueHeadAfterWrite into
> the pendingNotifies data structure, as they have no visible connection
> to that.  In particular, we will have multiple NotificationList
> structs when there's nested transactions, and it's certainly
> meaningless to have such fields in more than one place.  Probably
> just making them independent static variables is the best way.

Fixed.

> The overall layout of what the patch injects where needs another
> look.  I don't like inserting code before typedefs and static
> variables within a module: that's not our normal layout style.

Fixed.

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

Arseniy Mukhin

Дата:

15 октября, 14:19:40

Hi,

Thank you for working on it! Benchmarking looks great. There are several points:

I tried the patch and it seems listeners sometimes don't receive
notifications. To reproduce it you can try to listen to the channel in
one psql session and send notifications from another psql session. But
all tests are fine, so I tried to write a TAP test to reproduce it. It
passes on master and fails with the patch, so looks like it's real.
Please find the repro in attachments. I added the TAP test to amcheck
module just for simplicity.

I think "Direct advancement" is a good idea. But the way it's
implemented now has a concurrency bug. Listeners store its current
position in the local variable 'pos' during the reading in
asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
means that some notifier can directly advance the listener's position
while the listener has an old value in the local variable. The same
time we use listener positions to find out the limit we can truncate
the queue in asyncQueueAdvanceTail(). asyncQueueAdvanceTail() doesn't
know that listeners have a local copy of their positions and can
truncate the queue beyond that which means listeners can try to read
notifications from the truncated segment. I managed to reproduce it
locally. Please let me know if more details are needed.

BTW error message a bit confusing:

2025-10-15 13:32:15.570 MSK [261845] ERROR:  could not access status
of transaction 0
2025-10-15 13:32:15.570 MSK [261845] DETAIL:  Could not open file
"pg_notify/000000000000001": No such file or directory.

Looks like all slru IO errors have an error message about transaction
status. It's not a problem really as we have a directory path in the
log.


Best regards,
Arseniy Mukhin

Вложения

listen-notify-test.patch.nocfbot

Re: Optimize LISTEN/NOTIFY

От

Tom Lane

Дата:

15 октября, 17:16:01

Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> writes:
> I think "Direct advancement" is a good idea. But the way it's
> implemented now has a concurrency bug. Listeners store its current
> position in the local variable 'pos' during the reading in
> asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
> means that some notifier can directly advance the listener's position
> while the listener has an old value in the local variable. The same
> time we use listener positions to find out the limit we can truncate
> the queue in asyncQueueAdvanceTail().

Good catch!

I think we can perhaps salvage the idea if we invent a separate
"advisory" queue position field, which tells its backend "hey,
you could skip as far as here if you want", but is not used for
purposes of SLRU truncation.  Alternatively, split the queue pos
into "this is where to read next" and "this is as much as I'm
definitively done with", where the second field gets advanced at
the end of asyncQueueReadAllNotifications.  Not sure which
view would be less confusing (in the end I guess they're nearly
the same thing, differently explained).

A different line of thought could be to get rid of
asyncQueueReadAllNotifications's optimization of moving the
queue pos only once, per

     * (We could alternatively retake NotifyQueueLock and move the position
     * before handling each individual message, but that seems like too much
     * lock traffic.)

Since we only need shared lock to advance our own queue pos,
maybe that wouldn't be too awful.  Not sure.

            regards, tom lane

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

15 октября, 18:36:11

On Wed, Oct 15, 2025, at 05:19, Chao Li wrote:
>  * B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it
> records headBeforeWrite = 1 and writes to 3, and records headAfterWrite
> = 3.
>  * Now QueueHead is 3.
>  * C enters PreCommit_Notify(),  it records headBeforeWrite = 3 and
> writes to 5, and records headAfterWrite = 5.

No, when C enters PreCommit_Notify, it will be waiting on the
heavyweight lock, currently held by B, which B will hold
until it commits. It will then see headBeforeWrite = 3.

>  * Now QueueHead is 5
>  * C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal
> to C’s headBeforeWrite, C won’t advance A’s head.
>  * A starts to run AtCommit_Notify(), A’s head equals to B’s
> beforeHeadWrite, B will advance A’s head to 3.

No, like explained above, B cannot be running here,
it must have committed already (or aborted) since C
was waiting on the heavyweight lock held by B.

The example therefore seems invalid to me.

> I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.
>
> In PreCommit_Notify(), we can maintain a local hash table to record
> pending nofications’ channel names. dahash also supports hash table in
> local memory.

I'm confused, I assume you mean "dynahash" since there is no "dahash"
in the sources? I see dynahash has local-to-a-backend support,
but I don't see why we would need a hash table for this,
we just iterate over it once in SignalBackends,
I think the local list is fine.

The latest version gets rid of GetPendingNotifyChannels()
and replaces it with the local list pendingNotifyChannels.

> And the local static numChannelsListeningOn is also not needed. We can
> get the count from the local hash.

No, you're mixing up the data structures.
The local hash you suggested was for pending notify channels,
but numChannelsListeningOn was needed when we didn't have
listenChannels. Now that I've reverted back to listenChannels,
I also replaced `(numChannelsListeningOn == 0)`
with `(listenChannels == NIL)`.

> WRT to v6, I got a few new comments:
...
> In this comment, you refer to “channelHash” and “the shared channel
> hash table”, they are the same thing, but easy to make readers to
> misunderstand.

Right, will try to improve this in the next version.

>  pg_listening_channels(PG_FUNCTION_ARGS)
>  {
>  FuncCallContext *funcctx;
> + List   *listenChannels;
...
> listenChannels is only used within the “if”, so it’s definition can be
> moved into the “if”.

Comment not applicable since local variable listenChannels has now been
removed from pg_listening_channels, now using the original static
listenChannels instead.

> + /* Check for lagging backends when the queue spans multiple pages */
> + if (queue_length > 0)
...
> I wonder why this check is needed. If queue_length is 0, can we return
> immediately from SignalBackends()?

This check has been removed in the latest version.

/Joel

Re: Optimize LISTEN/NOTIFY

От

Arseniy Mukhin

Дата:

15 октября, 22:53:38

On Wed, Oct 15, 2025 at 5:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> writes:
> > I think "Direct advancement" is a good idea. But the way it's
> > implemented now has a concurrency bug. Listeners store its current
> > position in the local variable 'pos' during the reading in
> > asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
> > means that some notifier can directly advance the listener's position
> > while the listener has an old value in the local variable. The same
> > time we use listener positions to find out the limit we can truncate
> > the queue in asyncQueueAdvanceTail().
>
> Good catch!
>
> I think we can perhaps salvage the idea if we invent a separate
> "advisory" queue position field, which tells its backend "hey,
> you could skip as far as here if you want", but is not used for
> purposes of SLRU truncation.  Alternatively, split the queue pos
> into "this is where to read next" and "this is as much as I'm
> definitively done with", where the second field gets advanced at
> the end of asyncQueueReadAllNotifications.  Not sure which
> view would be less confusing (in the end I guess they're nearly
> the same thing, differently explained).
>
> A different line of thought could be to get rid of
> asyncQueueReadAllNotifications's optimization of moving the
> queue pos only once, per
>
>          * (We could alternatively retake NotifyQueueLock and move the position
>          * before handling each individual message, but that seems like too much
>          * lock traffic.)
>
> Since we only need shared lock to advance our own queue pos,
> maybe that wouldn't be too awful.  Not sure.
>
>                         regards, tom lane

Advisory queue position field sounds good IMHO. Listeners are still
solely responsible for advancing their positions so they still need to
wake up to do it, but they will only do so if there are relevant
notifications, or if they are too far behind. In any case they will be
able to jump over all irrelevant stuff.


Best regards,
Arseniy Mukhin

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

15 октября, 23:39:14

On Wed, Oct 15, 2025, at 13:19, Arseniy Mukhin wrote:
> I tried the patch and it seems listeners sometimes don't receive
> notifications. To reproduce it you can try to listen to the channel in
> one psql session and send notifications from another psql session. But
> all tests are fine, so I tried to write a TAP test to reproduce it. It
> passes on master and fails with the patch, so looks like it's real.
> Please find the repro in attachments. I added the TAP test to amcheck
> module just for simplicity.

Indeed a good catch! Thanks for the TAP test. I've migrated it to
async-notify.spec, included in 0001-optimize_listen_notify-v18.patch:

* Check that notifications sent from a backend that has not done LISTEN
  are properly delivered to a listener in another backend

To fix this, we now do an initChannelHash call at the beginning of
SignalBackends, since the problem was that if no LISTEN had been done in
the session which did a NOTIFY, the channel would not have been
initiated. Added a point about this in the
0002-optimize_listen_notify-v18.patch header:

* SignalBackends attaches to the channel hash at the start, ensuring
  that backends performing NOTIFY without having done LISTEN can still
  find listeners in the shared hash table.

On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote:
> I think we can perhaps salvage the idea if we invent a separate
> "advisory" queue position field, which tells its backend "hey,
> you could skip as far as here if you want", but is not used for
> purposes of SLRU truncation.  Alternatively, split the queue pos
> into "this is where to read next" and "this is as much as I'm
> definitively done with", where the second field gets advanced at
> the end of asyncQueueReadAllNotifications.  Not sure which
> view would be less confusing (in the end I guess they're nearly
> the same thing, differently explained).
>
> A different line of thought could be to get rid of
> asyncQueueReadAllNotifications's optimization of moving the
> queue pos only once, per
>
>      * (We could alternatively retake NotifyQueueLock and move the position
>      * before handling each individual message, but that seems like too much
>      * lock traffic.)
>
> Since we only need shared lock to advance our own queue pos,
> maybe that wouldn't be too awful.  Not sure.

These all sounds like promising ideas.

I went ahead and tried the "split the queue pos" idea, implemented
in 0002-optimize_listen_notify-v18.patch:

Position tracking for truncation safety
----------------------------------------

To prevent race conditions during queue truncation when using direct
advancement, backend positions are now tracked using two fields:

* pos: The next position to read from. This can be advanced by other
  backends via direct advancement to skip over uninteresting
  notifications.
* donePos: What the backend has definitively processed and no longer
  needs. This is used for determining safe truncation points.

Without this separation, a backend could be advanced by another backend
while it's reading notifications, then write back its stale local
position that points to an already-truncated page. By using donePos for
truncation decisions and taking the maximum of local and shared pos when
updating, we ensure that truncation waits for backends to finish
reading, while still allowing position advancement for optimization.

On Wed, Oct 15, 2025, at 21:53, Arseniy Mukhin wrote:
> Advisory queue position field sounds good IMHO. Listeners are still
> solely responsible for advancing their positions so they still need to
> wake up to do it, but they will only do so if there are relevant
> notifications, or if they are too far behind. In any case they will be
> able to jump over all irrelevant stuff.

I read your message too late, otherwise I would have tried that
approach first. I will try to implement that one too, and perhaps
also the third one, and then we can evaluate them to see which
one we prefer.

/Joel

On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote:
> Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> writes:
>> I think "Direct advancement" is a good idea. But the way it's
>> implemented now has a concurrency bug. Listeners store its current
>> position in the local variable 'pos' during the reading in
>> asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
>> means that some notifier can directly advance the listener's position
>> while the listener has an old value in the local variable. The same
>> time we use listener positions to find out the limit we can truncate
>> the queue in asyncQueueAdvanceTail().
>
> Good catch!

I've implemented the three ideas presented below, attached as .txt files
that are diffs on top of v19, which has these changes since v17:

0002-optimize_listen_notify-v19.patch:
* Improve wording of top comment per request from Chao Li.
* Add initChannelHash call to top of SignalBackends,
  to fix bug reported by Arseniy Mukhin.

> I think we can perhaps salvage the idea if we invent a separate
> "advisory" queue position field, which tells its backend "hey,
> you could skip as far as here if you want", but is not used for
> purposes of SLRU truncation.

Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt

> Alternatively, split the queue pos
> into "this is where to read next" and "this is as much as I'm
> definitively done with", where the second field gets advanced at
> the end of asyncQueueReadAllNotifications.  Not sure which
> view would be less confusing (in the end I guess they're nearly
> the same thing, differently explained).

Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt

> A different line of thought could be to get rid of
> asyncQueueReadAllNotifications's optimization of moving the
> queue pos only once, per
>
>      * (We could alternatively retake NotifyQueueLock and move the position
>      * before handling each individual message, but that seems like too much
>      * lock traffic.)
>
> Since we only need shared lock to advance our own queue pos,
> maybe that wouldn't be too awful.  Not sure.

Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt

/Joel

On Thu, Oct 16, 2025, at 22:16, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote:
>>> Building pendingNotifyChannels is O(N^2) yes, but how large N is
>>> realistic here?
>
>> I agree this looks like a real problem, since I guess it's not
>> completely unthinkable someone might have
>> some kind of trigger on a table, that could fire off NOTIFY
>> for each row, possibly causing hundreds of thousands of
>> notifies in the same db txn.
>
> We already de-duplicate identical NOTIFY operations for exactly that
> reason (cf. AsyncExistsPendingNotify).  However, non-identical NOTIFYs
> obviously can't be merged.
>
> I wonder whether we could adapt that de-duplication logic so that
> it produces a list of unique channel names in addition to a list
> of unique NOTIFY events.  One way could be a list/hashtable of
> channels used, and for each one a list/hashtable of unique payloads,
> rather than the existing single-level list/hashtable.

Thanks for the great idea! Yes, this was indeed possible.

0002-optimize_listen_notify-v20.patch:
* Added channelHashtab field, created and updated together with hashtab.
  If we have channelHashtab, it's used within PreCommit_Notify to
  quickly build pendingNotifyChannelsl.

In this email, I'm also answering to the feedback from Arseniy Mukhin,
and I've based the alt1, alt2, alt3 .txt patches on top of v20.

On Sat, Oct 18, 2025, at 18:41, Arseniy Mukhin wrote:
> Thank you for the new version and all implementations!

Thanks for review and great ideas!

>> > I think we can perhaps salvage the idea if we invent a separate
>> > "advisory" queue position field, which tells its backend "hey,
>> > you could skip as far as here if you want", but is not used for
>> > purposes of SLRU truncation.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt
>
>        pos = QUEUE_BACKEND_POS(i);
>
>        /* Direct advancement for idle backends at the old head */
>        if (pendingNotifies != NULL &&
>           QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
>        {
>           QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
>
> If we have several notifying backends, it looks like only the first
> one will be able to do direct advancement here. Next notifying backend
> will fail on QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite) as we don't
> wake up the listener and pos will be the same as it was for the first
> notifying backend.

Right.

> It seems that to accumulate direct advancement from
> several notifying backends we need to compare queueHeadBeforeWrite
> with advisoryPos here.

*** 0002-optimize_listen_notify-v20-alt1.txt:

* Fixed; compare advisoryPos with queueHeadBeforeWrite instead of pos.

> And we also need to advance advisoryPos to the
> listener's position after reading if advisoryPos falls behind.

* Fixed; set advisoryPos to max(max,advisoryPos) in PG_FINALLY block.

* Also noted Exec_ListenPreCommit didn't set advisoryPos to max
  for the first LISTEN, now fixed.

> Minute of brainstorming
>
> I also thought about a workload that probably frequently can be met.
> Let's say we have sequence of notifications:
>
> F F F T F F F T F F F T
>
> Here F - notification from the channel we don't care about and T - the opposite.
> It seems that after the first 'T' notification it will be more
> difficult for notifying backends to do 'direct advancement' as there
> will be some lag before the listener reads the notification and
> advances its position. Not sure if it's a problem, probably it depends
> on the intensity of notifications.

Hmm, I realize both the advisoryPos and donePos ideas share a problem;
they both require listening backends to wakeup eventually anyway,
just to advance the 'pos'.

The holy grail would be to avoid this context switching cost entirely,
and only need to wakeup listening backends when they are actually
interested in the queued notifications. I think the third idea,
alt3, is most promising in achieving this goal.

> But maybe we can use a bit more
> sophisticated data structure here? Something like a list of skip
> ranges. Every entry in the list is the range (pos1, pos2) that the
> listener can skip during the reading. So instead of advancing
> advisoryPos every notifying backend should add skip range to the list.
> Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
> pos3) -> (pos1, pos3). We also can limit the number of entries to 5
> for example. Listeners on their side should clear the list before
> reading and skip all ranges from it. What do you think? Is it
> overkill?

Hmm, maybe, but I'm a bit wary about too much complication.
Hopefully there is a simpler solution that avoids the need for this,
but sure, if we can't find one, then I'm positive to try this skip ranges idea.

>> > Alternatively, split the queue pos
>> > into "this is where to read next" and "this is as much as I'm
>> > definitively done with", where the second field gets advanced at
>> > the end of asyncQueueReadAllNotifications.  Not sure which
>> > view would be less confusing (in the end I guess they're nearly
>> > the same thing, differently explained).
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt
>>
>
> IMHO it's a little bit more confusing than the first option. Two
> points I noticed:
>
> 1) We have a fast path in asyncQueueReadAllNotifications()
>
>     if (QUEUE_POS_EQUAL(pos, head))
>     {
>        /* Nothing to do, we have read all notifications already. */
>        return;
>     }
>
> Should we update donePos here? It looks like donePos may never be
> updated without it.

*** 0002-optimize_listen_notify-v20-alt2.txt:

* Fixed; update donePos here 

> 2) In SignalBackends()
>
>         /* Signal backends that have fallen too far behind */
>         lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
>                             QUEUE_POS_PAGE(pos));
>
>         if (lag >= QUEUE_CLEANUP_DELAY)
>         {
>             pid = QUEUE_BACKEND_PID(i);
>             Assert(pid != InvalidPid);
>
>             QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
>             pids[count] = pid;
>             procnos[count] = i;
>             count++;
>         }
>
> Should we use donePos here as it is responsible for queue truncation now?

* Fixed; use donePos here 

>> > A different line of thought could be to get rid of
>> > asyncQueueReadAllNotifications's optimization of moving the
>> > queue pos only once, per
>> >
>> >        * (We could alternatively retake NotifyQueueLock and move the position
>> >        * before handling each individual message, but that seems like too much
>> >        * lock traffic.)
>> >
>> > Since we only need shared lock to advance our own queue pos,
>> > maybe that wouldn't be too awful.  Not sure.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
>>
>
> Hmm, it seems we still have the race when in the beginning of
> asyncQueueReadAllNotifications we read pos into the local variable and
> release the lock. IIUC to avoid the race without introducing another
> field here, the listener needs to hold the lock until it updates its
> position so that the notifying backend cannot change it concurrently.

*** 0002-optimize_listen_notify-v20-alt3.txt:

* Fixed; the shared 'pos' is now only updated if the new position is ahead.


To me, it looks like alt3 is the winner in terms of simplicity, and is
also the winner in my ping-pong benchmark, due to avoiding context
switches more effectively than alt1 and alt2.

Eager to hear your thoughts!

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

20 октября, 01:10:34

On Mon, Oct 20, 2025, at 00:06, Joel Jacobson wrote:
> Attachments:
> * 0001-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20-alt1.txt
> * 0002-optimize_listen_notify-v20-alt3.txt
> * 0002-optimize_listen_notify-v20-alt2.txt

My apologies, I forgot to attach 0002-optimize_listen_notify-v20.patch.

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

20 октября, 08:12:18

On Mon, Oct 20, 2025, at 00:10, Joel Jacobson wrote:
> Attachments:
> * 0001-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20.patch
> * 0002-optimize_listen_notify-v20-alt1.txt
> * 0002-optimize_listen_notify-v20-alt3.txt
> * 0002-optimize_listen_notify-v20-alt2.txt

Attaching a new alt1 version, that fixes the mistake of using max(pos,
advisoryPos) for lag calculation, which is wrong, since in alt1 it's the
backend itself that updates its 'pos' when it wakes up, and it's 'pos'
that asyncQueueAdvanceTail looks at, in alt1.

/Joel

Вложения

0002-optimize_listen_notify-v20-alt1-v2.txt

Re: Optimize LISTEN/NOTIFY

От

Arseniy Mukhin

Дата:

20 октября, 19:43:27

On Mon, Oct 20, 2025 at 1:07 AM Joel Jacobson <joel@compiler.org> wrote:
>
> > Minute of brainstorming
> >
> > I also thought about a workload that probably frequently can be met.
> > Let's say we have sequence of notifications:
> >
> > F F F T F F F T F F F T
> >
> > Here F - notification from the channel we don't care about and T - the opposite.
> > It seems that after the first 'T' notification it will be more
> > difficult for notifying backends to do 'direct advancement' as there
> > will be some lag before the listener reads the notification and
> > advances its position. Not sure if it's a problem, probably it depends
> > on the intensity of notifications.
>
> Hmm, I realize both the advisoryPos and donePos ideas share a problem;
> they both require listening backends to wakeup eventually anyway,
> just to advance the 'pos'.
>
> The holy grail would be to avoid this context switching cost entirely,
> and only need to wakeup listening backends when they are actually
> interested in the queued notifications. I think the third idea,
> alt3, is most promising in achieving this goal.
>

Yeah, it would be great.

> > But maybe we can use a bit more
> > sophisticated data structure here? Something like a list of skip
> > ranges. Every entry in the list is the range (pos1, pos2) that the
> > listener can skip during the reading. So instead of advancing
> > advisoryPos every notifying backend should add skip range to the list.
> > Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
> > pos3) -> (pos1, pos3). We also can limit the number of entries to 5
> > for example. Listeners on their side should clear the list before
> > reading and skip all ranges from it. What do you think? Is it
> > overkill?
>
> Hmm, maybe, but I'm a bit wary about too much complication.
> Hopefully there is a simpler solution that avoids the need for this,
> but sure, if we can't find one, then I'm positive to try this skip ranges idea.
>

Yes, and it's probably worth doing a benchmarking to see if it's a
problem at all before implementing anything.

> >> > A different line of thought could be to get rid of
> >> > asyncQueueReadAllNotifications's optimization of moving the
> >> > queue pos only once, per
> >> >
> >> >        * (We could alternatively retake NotifyQueueLock and move the position
> >> >        * before handling each individual message, but that seems like too much
> >> >        * lock traffic.)
> >> >
> >> > Since we only need shared lock to advance our own queue pos,
> >> > maybe that wouldn't be too awful.  Not sure.
> >>
> >> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
> >>
> >
> > Hmm, it seems we still have the race when in the beginning of
> > asyncQueueReadAllNotifications we read pos into the local variable and
> > release the lock. IIUC to avoid the race without introducing another
> > field here, the listener needs to hold the lock until it updates its
> > position so that the notifying backend cannot change it concurrently.
>
> *** 0002-optimize_listen_notify-v20-alt3.txt:
>
> * Fixed; the shared 'pos' is now only updated if the new position is ahead.
>

I managed to reproduce the race with v20-alt3. I tried to write a TAP
test reproducing the issue, so it was easier to validate changes.
Please find the attached TAP test. I added it to some random package
for simplicity.


Best regards,
Arseniy Mukhin

Вложения

0001-TAP-test-with-listener-pos-race.patch.nocfbot

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

23 октября, 11:16:27


> On Oct 21, 2025, at 00:43, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote:
>
>
> I managed to reproduce the race with v20-alt3. I tried to write a TAP
> test reproducing the issue, so it was easier to validate changes.
> Please find the attached TAP test. I added it to some random package
> for simplicity.
>

With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can do a
littlebit more optimization: 

The notifier: in SignalBackend()
    * Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement”
    * We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the advancement.

The listener: in asyncQueueReadAllNotifications():
    * With alt3, we only lock and update pos
    * We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an
advancement,so it can stop reading. 

I tried to run your TAP test on my MacBook, but failed:

```
t/008_listen-pos-race.pl .. Dubious, test returned 32 (wstat 8192, 0x2000)
No subtests run

Test Summary Report
-------------------
t/008_listen-pos-race.pl (Wstat: 8192 (exited 32) Tests: 0 Failed: 0)
  Non-zero exit status: 32
  Parse errors: No plan found in TAP output
Files=1, Tests=0,  3 wallclock secs ( 0.01 usr  0.01 sys +  0.10 cusr  0.29 csys =  0.41 CPU)
Result: FAIL
```

I didn’t spend time debugging the problem. If you can figure the problem, maybe I can run the test from my side.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Optimize LISTEN/NOTIFY

От

Arseniy Mukhin

Дата:

23 октября, 13:02:49

Hi,

On Thu, Oct 23, 2025 at 11:17 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
>
>
> > On Oct 21, 2025, at 00:43, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote:
> >
> >
> > I managed to reproduce the race with v20-alt3. I tried to write a TAP
> > test reproducing the issue, so it was easier to validate changes.
> > Please find the attached TAP test. I added it to some random package
> > for simplicity.
> >
>
> With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can do
alittle bit more optimization: 
>
> The notifier: in SignalBackend()
>     * Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement”
>     * We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the
advancement.
>
> The listener: in asyncQueueReadAllNotifications():
>     * With alt3, we only lock and update pos
>     * We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an
advancement,so it can stop reading. 
>

I think this would be a reasonable optimization if it weren't for the
race condition mentioned above. The problem is that if the local pos
lags behind the shared memory pos, it could point to a truncated queue
segment, so we shouldn't allow that.

> I tried to run your TAP test on my MacBook, but failed:
>
> ```
> t/008_listen-pos-race.pl .. Dubious, test returned 32 (wstat 8192, 0x2000)
> No subtests run
>
> Test Summary Report
> -------------------
> t/008_listen-pos-race.pl (Wstat: 8192 (exited 32) Tests: 0 Failed: 0)
>   Non-zero exit status: 32
>   Parse errors: No plan found in TAP output
> Files=1, Tests=0,  3 wallclock secs ( 0.01 usr  0.01 sys +  0.10 cusr  0.29 csys =  0.41 CPU)
> Result: FAIL
> ```
>
> I didn’t spend time debugging the problem. If you can figure the problem, maybe I can run the test from my side.
>

Thank you for trying the test. I think the test works for you as
expected, it should fail with error and I have the same error status.
Sorry, I failed to realize it could be confusing, probably it was
better to fail on some assert instead, but I thought error is enough
for temp reproducer. Please see 008_listen-pos-race_test.log for
details.


Best regards,
Arseniy Mukhin

Re: Optimize LISTEN/NOTIFY

От

Chao Li

Дата:

26 октября, 07:11:25

> On Oct 23, 2025, at 18:02, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote:
>
> Hi,
>
> On Thu, Oct 23, 2025 at 11:17 AM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>>
>>
>>> On Oct 21, 2025, at 00:43, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote:
>>>
>>>
>>> I managed to reproduce the race with v20-alt3. I tried to write a TAP
>>> test reproducing the issue, so it was easier to validate changes.
>>> Please find the attached TAP test. I added it to some random package
>>> for simplicity.
>>>
>>
>> With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can
doa little bit more optimization: 
>>
>> The notifier: in SignalBackend()
>>    * Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement”
>>    * We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the
advancement.
>>
>> The listener: in asyncQueueReadAllNotifications():
>>    * With alt3, we only lock and update pos
>>    * We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an
advancement,so it can stop reading. 
>>
>
> I think this would be a reasonable optimization if it weren't for the
> race condition mentioned above. The problem is that if the local pos
> lags behind the shared memory pos, it could point to a truncated queue
> segment, so we shouldn't allow that.
>

I figured out a way to resolve the race condition for alt3:

* add an awakening flag for every listener, this flag is only set by listeners
* add an advisory pos for every listener, similar to alt1
* if a listener is awaken, notify only updates the listener’s advisory pos; otherwise directly advance its position.
* If a running listener see current pos is behind advisory pos, then stop reading

See more details in attach patch file, I added code comments for my changes. Now the TAP test won’t hit the race
condition. 
```
# +++ tap check in src/test/authentication +++
t/008_listen-pos-race.pl .. skipped: Injection points not supported by this build
Files=1, Tests=0,  0 wallclock secs ( 0.00 usr  0.00 sys +  0.03 cusr  0.01 csys =  0.04 CPU)
Result: NOTESTS
```

And with my solution, listeners longer will still use local pos, so that no longer need to acquire notification lock in
everyloop. 

The patch stack is: v20 patch -> alt3 patch -> tap patch -> my patch. Please see if my solution works.

I also made a tiny change in the TAP script to allow it to terminate gracefully.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Вложения

fix-race.patch

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

26 октября, 09:33:32

On Sun, Oct 26, 2025, at 05:11, Chao Li wrote:
> I figured out a way to resolve the race condition for alt3:
>
> * add an awakening flag for every listener, this flag is only set by
> listeners
> * add an advisory pos for every listener, similar to alt1
> * if a listener is awaken, notify only updates the listener’s advisory
> pos; otherwise directly advance its position.
> * If a running listener see current pos is behind advisory pos, then
> stop reading
>
> See more details in attach patch file, I added code comments for my
> changes. Now the TAP test won’t hit the race condition.
> ```
> # +++ tap check in src/test/authentication +++
> t/008_listen-pos-race.pl .. skipped: Injection points not supported by
> this build
> Files=1, Tests=0,  0 wallclock secs ( 0.00 usr  0.00 sys +  0.03 cusr
> 0.01 csys =  0.04 CPU)
> Result: NOTESTS
> ```
>
> And with my solution, listeners longer will still use local pos, so
> that no longer need to acquire notification lock in every loop.

This sounds promising, similar to what I had in mind. I was thinking
about the idea of using the advisoryPos only when the listening backend
is known to be running (which felt like it would need another shared
boolean field), and to move its pos field directly only when it's not
running, since if it's running we don't need to optimize for context
switching, since it's by definition already running.

What I wanted to investigate what all the concurrency situations
that we can imagine, i.e. to permutate all possible differences
we can think of into a truth table, and reason about each case.

The ones I can think of are, from the perspective of SignalBackends,
reasoning about a specific listening backend:

{is interested in the notifications, is not interested in the notifications} x
{wakeupPending=false, wakeupPending=true} x
{pos < queueHeadBeforeWrite, pos == queueHeadBeforeWrite, pos > queueHeadBeforeWrite, pos == queueHeadAfterWrite, pos >
queueHeadAfterWrite}x 
{is running, is not running}

This gives 2x2x5x2=40 states to reason about. Some of these combinations
are probably impossible, I still think it would be good to include them
and explain why they are impossible.

> The patch stack is: v20 patch -> alt3 patch -> tap patch -> my patch.
> Please see if my solution works.
>
> I also made a tiny change in the TAP script to allow it to terminate gracefully.

I haven't looked at the code yet, tried to apply the patch but it fails:

shasum of files:
```
ca54ffa02ac54efd65acce0d09b18e630b5d7982  0001-optimize_listen_notify-v20.patch
5755701bb0e7ac7a0cea3abab9d74a0001b7b63a  0002-optimize_listen_notify-v20.patch
5819e23b5760023be70d2582207b72164904e952  0002-optimize_listen_notify-v20-alt3.txt
33d700dc0b3288d46705e85d381cb564d99079d1  0001-TAP-test-with-listener-pos-race.patch.nocfbot
8ee716451bd5f85761b666712bdfd8b5d936f92d  fix-race.patch
```

Trying to apply them on top of current master (39dcfda2d23ac39f14ecf4b83e01eae85d07d9e5):

```
% git apply 0001-optimize_listen_notify-v20.patch
% git apply 0002-optimize_listen_notify-v20.patch
% git apply 0002-optimize_listen_notify-v20-alt3.txt
% git apply 0001-TAP-test-with-listener-pos-race.patch.nocfbot
% git apply fix-race.patch
fix-race.patch:100: indent with spaces.
                         (QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos) && QUEUE_POS_PRECEDES(pos,
queueHeadAfterWrite)))&& 
error: patch failed: src/backend/commands/async.c:250
error: src/backend/commands/async.c: patch does not apply
error: patch failed: src/test/authentication/t/008_listen-pos-race.pl:8
error: src/test/authentication/t/008_listen-pos-race.pl: patch does not apply
```

I'll try to resolve it manually, but in case you're quicker to reply, I'm sending this now.

/Joel

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

26 октября, 10:08:42

On Sun, Oct 26, 2025, at 07:33, Joel Jacobson wrote:
> Trying to apply them on top of current master 
> (39dcfda2d23ac39f14ecf4b83e01eae85d07d9e5):
>
> ```
> % git apply 0001-optimize_listen_notify-v20.patch
> % git apply 0002-optimize_listen_notify-v20.patch
> % git apply 0002-optimize_listen_notify-v20-alt3.txt
> % git apply 0001-TAP-test-with-listener-pos-race.patch.nocfbot
> % git apply fix-race.patch
> fix-race.patch:100: indent with spaces.
>                          (QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos) && 
> QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))) &&
> error: patch failed: src/backend/commands/async.c:250
> error: src/backend/commands/async.c: patch does not apply
> error: patch failed: src/test/authentication/t/008_listen-pos-race.pl:8
> error: src/test/authentication/t/008_listen-pos-race.pl: patch does not 
> apply
> ```
>
> I'll try to resolve it manually, but in case you're quicker to reply, 
> I'm sending this now.

I see the problem; seems like you based fix-race.patch on top of
0002-optimize_listen_notify-v19-alt3.txt because fix-race.patch contains
this diff block which is only present in that version:

```
@@ -2441,21 +2485,29 @@ asyncQueueReadAllNotifications(void)
                                                    page_buffer.buf,
                                                    snapshot);
 
-        /*
-         * Update our position in shared memory.  The 'pos' variable now
-         * holds our new position (advanced past all messages we just
-         * processed).  This ensures that if we fail while processing
```

I've compared 0002-optimize_listen_notify-v19-alt3.txt with
0002-optimize_listen_notify-v20-alt3.txt and it's only the addition of
QUEUE_POS_PRECEDES which fix-race.patch also adds, and some locking and
pos handling differences.

/Joel

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

27 октября, 02:24:54

On Sun, Oct 26, 2025, at 07:33, Joel Jacobson wrote:
> On Sun, Oct 26, 2025, at 05:11, Chao Li wrote:
>> I figured out a way to resolve the race condition for alt3:
>>
>> * add an awakening flag for every listener, this flag is only set by
>> listeners
>> * add an advisory pos for every listener, similar to alt1
>> * if a listener is awaken, notify only updates the listener’s advisory
>> pos; otherwise directly advance its position.
>> * If a running listener see current pos is behind advisory pos, then
>> stop reading
...
> This sounds promising, similar to what I had in mind. I was thinking
> about the idea of using the advisoryPos only when the listening backend
> is known to be running (which felt like it would need another shared
> boolean field), and to move its pos field directly only when it's not
> running, since if it's running we don't need to optimize for context
> switching, since it's by definition already running.

Write-up of changes since v20:

Two new fields have been added to QueueBackendStatus:
+ QueuePosition advisoryPos;    /* safe skip-ahead position */
+ bool        advancingPos;    /* backend is reading the queue */

These are used SignalBackends and asyncQueueReadAllNotifications to
handle the empheral state of the shared queue position, since we don't
take a lock while advancing it in asyncQueueReadAllNotifications.

In SignalBackends, we now don't signal laggers in other databases,
instead we will signal any listening backend that could possibly be
behind the old queue head, since we can't know if such backend is
interested in the notifications before the old queue head.  Realistic
benchmarks will be needed to determine if this happens often enough to
warrant a more complex optimization, such as the ranges idea suggested
by Arseniy Mukhin.

In SignalBackends, if a backend that is uninterested in our
notifications, has a shared pos that is at the old queue head, then we
will check if it's not currently advancing its pos, in which case we can
set its shared pos to the new queue head, i.e. "direct advance" it,
otherwise, if it's currently advancing its pos, and if its advisory pos
is behind our new queue head, we will update its advisory pos to our new
queue head.

In asyncQueueReadAllNotifications, we start by setting wakupPending to
false and advisoryPos to true, to indicate that we've woken up, and that
we will now start advancing the pos.  We also check if the pos is behind
the advisory pos, and if so use the advisory pos to update the pos.

In asyncQueueReadAllNotifications's PG_FINALLY block, we reset
advancingPos to false, and detect if the advisoryPos was set by
SignalBackends while we were processing messages on the queue, and if
so, and if the advisoryPos is ahead of our pos, we update our shared pos
with the advisoryPos, and otherwise update the shared pos with the new
pos.

/Joel

On Tue, Oct 28, 2025, at 07:46, Chao Li wrote:
>>> But anyway, we should run some load tests to verify every solution to 
>>> see how much they really improve. Do you already have or plan to work 
>>> on a load test script?
>> 
>> Yes, I'm currently working on a combined benchmark / correctness test suite.
>> 
>
> Cool. Then we can run the benchmark and decide.

I found a concurrency bug in v21 that could cause missed wakeup when a
backend would UNLISTEN on the last channel, which called
asyncQueueUnregister, and if wakeupPending was at that time already set,
then it wouldn't get reset, since in ProcessIncomingNotify we return
early if (listenChannels == NIL), so we would never clear wakeupPending
which happens in asyncQueueReadAllNotifications.

Fixed by clearing wakeupPending in asyncQueueUnregister:

@@ -1597,6 +1597,7 @@ asyncQueueUnregister(void)
    /* Mark our entry as invalid */
    QUEUE_BACKEND_PID(MyProcNumber) = InvalidPid;
    QUEUE_BACKEND_DBOID(MyProcNumber) = InvalidOid;
+   QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
    /* and remove it from the list */
    if (QUEUE_FIRST_LISTENER == MyProcNumber)
        QUEUE_FIRST_LISTENER = QUEUE_NEXT_LISTENER(MyProcNumber);

/Joel

On Sat, Nov 1, 2025, at 21:41, Arseniy Mukhin wrote:
> Thank you for working on this! There are few points about 'direct
> advancement' part:

Thanks for reviewing!

> Looks like the bug with truncating of the queue is gone, advancingPos
> does the trick, great.
>
> Maybe I missed something, but I failed to find an example where we can
> take advantage of advisoryPos:
>
>     SignalBackends(void)
>     ...
>
>         if (QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
>         {
>            ...
>            if (!QUEUE_BACKEND_ADVANCING_POS(i))
>               QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
>            else if (QUEUE_POS_PRECEDES(advisoryPos, queueHeadAfterWrite))
>               QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
>         }
>
> We update advisoryPos if:
> 1) listener's advancingPos is true
> 2) listener's pos equals queueHeadBeforeWrite
>
> (1) means the listener is currently reading. (2) means notifications
> that the listener is currently reading belong to us (or it's even
> possible that the listener is reading notifications that were added in
> the queue after ours). And since the listener is reading, it will only
> see updated advancingPos in the PG_FINALLY, where listener's pos will
> already be >= queueHeadAfterWrite (as result of reading).
>
>
> This condition seems to be redundant. I would say it should always be
> true, otherwise it would mean that somebody allowed the listener to
> skip our notification.
>
>     else if (QUEUE_POS_PRECEDES(advisoryPos, queueHeadAfterWrite))

Ohhh, right! I agree with your reasoning; it's dead code.
This means we can remove the advisoryPos altogether, with
the benefit of making the code even simpler. That's what I've
done in v22, among some other changes.

Changes since v22:

* Optimize listening on thousands of channels per backend by replacing
  the listenChannels List with a local hash table, renamed to
  listenChannelsHash to avoid confusion.

* Removed advisoryPos, since it was not actually used. We only needed
  advancingPos to fix the bug with truncation of the queue. It's possible
  that the bottleneck in some workloads is no longer the wakeups, but I'm
  not sure yet; I'll do some more benchmarking to get a better
  understanding of whether it would be worthwhile to pursue further
  optimization.

* Removed asyncQueuePageDiff, since it's no longer used.

Benchmark to demonstrate the effect of the listenChannelsHash:

% gcc -Wall -Wextra -O2 -pthread -I/Users/joel/pg19/include/postgresql/server -I/Users/joel/pg19/include -o
async-notify-test-4async-notify-test-4.c -L/Users/joel/pg19/lib -lpq -pthread -lm
 

v21:
% ./async-notify-test-4 --listeners 1 --notifiers 1 --channels 1 --extra-channels=10000
10 s: 1286593 sent (130036/s), 437822 received (44121/s)
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms     #          1 (0.0%) avg: 0.099ms
 0.10-1.00ms     #          39 (0.0%) avg: 0.576ms
 1.00-10.00ms    #          395 (0.1%) avg: 6.005ms
 10.00-100.00ms  #          6186 (1.4%) avg: 52.880mss
>100.00ms       #########  431214 (98.5%) avg: 3379.928ms

v22:
% ./async-notify-test-4 --listeners 1 --notifiers 1 --channels 1 --extra-channels=10000
10 s: 879208 sent (87704/s), 879207 received (87703/s)
 0.00-0.01ms     #          31 (0.0%) avg: 0.009ms
 0.01-0.10ms     #########  879012 (100.0%) avg: 0.016ms
 0.10-1.00ms     #          157 (0.0%) avg: 0.155ms
 1.00-10.00ms    #          7 (0.0%) avg: 2.913ms
 10.00-100.00ms  #          1 (0.0%) avg: 11.457ms
>100.00ms                  0 (0.0%) avg: 0.000ms

/Joel

On Thu, Nov 6, 2025, at 00:21, Chao Li wrote:
> That’s what we don’t know. We now lack a performance test for
> evaluating how “direct advancement” efficiently helps if it only
> handles sleeping listeners. So what I was suggesting is that we should
> first create some tests, maybe also add a few more statistics, so that
> we can evaluate different solutions. If a simple implementation that
> only handles sleeping listeners would have performed good enough, of
> course we can take it; otherwise we may need to either pursue a better
> solution.

Just for the sake of evaluating this patch, I've added instrumentation
of async.c that increments counters for the different branches in
asyncQueueReadAllNotifications and SignalBackends. (I'm just using
atomics without any locking, but should be fine since this is just
statistics.)

pg_get_async_wakeup_stats-patch.txt adds the SQL-callable
catalog functions pg_reset_async_wakeup_stats() and
pg_get_async_wakeup_stats(), which should not be included in the patch,
they are just for evaluating. It can be applied on top of the v23 patch.

Below is just an example of how to compile and an arbitrary mix of
command line options. I've tired a lot of combinations, and we seem to
be holding up fine in all cases I've tried.

async-notify-test-5.c will detect if the pg_*_async_wakeup_stats() functions
exists, and only show the extra histograms if so.

% gcc -Wall -Wextra -O2 -pthread -I/Users/joel/pg19/include/postgresql/server -I/Users/joel/pg19/include -o
async-notify-test-5async-notify-test-5.c -L/Users/joel/pg19/lib -lpq -pthread -lm 

% ./async-notify-test-5 --listeners 10 --notifiers 10 --channels 10 --sleep 0.1 --sleep-exp 2.0 --batch 10
10 s: 38100 sent (3690/s), 381000 received (36900/s)
Notification Latency Distribution:
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms     #          23 (0.0%) avg: 0.092ms
 0.10-1.00ms     #######    298002 (78.2%) avg: 0.563ms
 1.00-10.00ms    ##         82975 (21.8%) avg: 1.506ms
 10.00-100.00ms             0 (0.0%) avg: 0.000ms
>100.00ms                  0 (0.0%) avg: 0.000ms

asyncQueueReadAllNotifications Statistics:
necessary_wakeups    ########   35469 (88.2%)
unnecessary_wakeups  #          4762 (11.8%)

SignalBackends Statistics:
signaled_needed     #          34983 (9.5%)
avoided_wakeups     ########   325874 (88.9%)
already_advancing   #          3 (0.0%)
signaled_uncertain  #          5347 (1.5%)
already_ahead       #          375 (0.1%)

Thoughts on how to interpret results:

- Is the notification latency distribution good enough, for the given
  workload? Naturally, if the workload is too high, we cannot expect to
  ever achieve sub millisecond latency anyway, so it's a judgement.

- Even if the "unnecessary_wakeups" is high relative to
  "necessary_wakeups", it's not necessarily a problem, if the latency
  distribution still is good enough. We should also think about the
  ratio between "unnecessary_wakeups" and "avoided_wakeups", since even
  if "unnecessary_wakeups" is high in absolute numbers, if the
  "avoided_wakeups" is magnitudes larger, that means the cost of the
  context switching has been dramatically reduced already. I think there
  is always a risk when optimizing to forget what problem one was trying
  to solve initially, usually a bottleneck. When the bottleneck is gone
  and is somewhere else instead, then the efforts should IMO usually be
  spent elsewhere, especially if more optimizations would need a
  insignificant increase of code complexity.

- It's the "signaled_uncertain" that primarily contribute to
  "unnecessary_wakeups".

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

07 ноября, 21:59:41

On Thu, Nov 6, 2025, at 09:33, Joel Jacobson wrote:
> On Thu, Nov 6, 2025, at 00:21, Chao Li wrote:
>> That’s what we don’t know. We now lack a performance test for
>> evaluating how “direct advancement” efficiently helps if it only
>> handles sleeping listeners. So what I was suggesting is that we should
>> first create some tests, maybe also add a few more statistics, so that
>> we can evaluate different solutions. If a simple implementation that
>> only handles sleeping listeners would have performed good enough, of
>> course we can take it; otherwise we may need to either pursue a better
>> solution.

Changes since v23:

* The advancingPos flag has been split into two fields:
    bool isAdvancing: indicates if a backend is currently advancing
    QueuePosition advancingPos: the target position the backend will advance to

* The logic in SignalBackends has been reworked and simplified,
  thanks to the new isAdvancing and advancingPos fields.
  I now think it's finally easy to reason about why each branch
  in SignalBackends must be correct.

I've also attached 0003-optimize_listen_notify-v24.txt that adds
instrumentation that can then be used together with the
benchmark/correctness tool pg_async_notify_test-v24.c.
This has been very helpful to me, to develop an intuition for its
concurrency behavior. I hope it can help others as well.
The 0003 patch is only for testing and not part of the patchset,
hence the .txt.

% gcc -Wall -Wextra -O2 -pthread \
  -I$(pg_config --includedir-server) \
  -I$(pg_config --includedir) \
  -L$(pg_config --libdir) \
  -o pg_async_notify_test-v24 pg_async_notify_test-v24.c \
  -lpq -pthread -lm

% ./pg_async_notify_test-v24 --listeners 10 --notifiers 10 --channels 10 --duration 10
10 s: 301622 sent (29409/s), 3015940 received (294185/s)
Notification Latency Distribution:
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms     #          25 (0.0%) avg: 0.074ms
 0.10-1.00ms     #          289 (0.0%) avg: 0.461ms
 1.00-10.00ms    #########  2824257 (93.6%) avg: 4.193ms
 10.00-100.00ms  #          189923 (6.3%) avg: 24.893ms
>100.00ms       #          1453 (0.0%) avg: 109.662ms

SignalBackends Statistics:
signaled_targeted  #          1251569 (9.4%)
advancing_behind   #          108505 (0.8%)
advancing_ahead    #          207494 (1.6%)
idle_behind        #          408641 (3.1%)
avoided_wakeups    #######    10589740 (79.6%)
already_ahead      #          744087 (5.6%)

asyncQueueReadAllNotifications Statistics:
necessary_wakeups    ########   1525695 (86.3%)
unnecessary_wakeups  #          242106 (13.7%)

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

08 ноября, 15:59:21

On Fri, Nov 7, 2025, at 19:59, Joel Jacobson wrote:
> * The logic in SignalBackends has been reworked and simplified,
>   thanks to the new isAdvancing and advancingPos fields.
>   I now think it's finally easy to reason about why each branch
>   in SignalBackends must be correct.

I was wrong. I wrongly assumed asyncQueueReadAllNotifications would read
up until head, which it might not actually do:

    * Process messages up to the stop position, end of page, or an
    * uncommitted message.

This in turn could cause a listening backend to remain behind, if there
would be no more notifies, so it unfortunately seems like we will always
need to signal when a backend isAdvancing, and therefore have no use of
the advancingPos field.

I will do more correctness and benchmark testing before posting a new
version. Just wanted to give you a heads up on the bug, so you don't
waste time reviewing.

/Joel

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

08 ноября, 18:04:42

On Sat, Nov 8, 2025, at 13:59, Joel Jacobson wrote:
> On Fri, Nov 7, 2025, at 19:59, Joel Jacobson wrote:
>> * The logic in SignalBackends has been reworked and simplified,
>>   thanks to the new isAdvancing and advancingPos fields.
>>   I now think it's finally easy to reason about why each branch
>>   in SignalBackends must be correct.
>
> I was wrong. I wrongly assumed asyncQueueReadAllNotifications would read
> up until head, which it might not actually do:
>
>     * Process messages up to the stop position, end of page, or an
>     * uncommitted message.
>
> This in turn could cause a listening backend to remain behind, if there
> would be no more notifies, so it unfortunately seems like we will always
> need to signal when a backend isAdvancing, and therefore have no use of
> the advancingPos field.
>
> I will do more correctness and benchmark testing before posting a new
> version. Just wanted to give you a heads up on the bug, so you don't
> waste time reviewing.

Changes since v24:

* Removed the QueuePosition advancingPos QueueBackendStatus field.
* Always signal when isAdvancing is true.

Some benchmarking:

master:

% ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.1 --sleep-exp 1.01
13 s: 8668 sent (658/s), 8676 received (653/s)
Notification Latency Distribution:
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms                0 (0.0%) avg: 0.000ms
 0.10-1.00ms     #          3 (0.0%) avg: 0.783ms
 1.00-10.00ms    #          49 (0.6%) avg: 5.559ms
 10.00-100.00ms  #          168 (1.9%) avg: 56.360mss
>100.00ms       #########  8456 (97.5%) avg: 256.086ms

v25:

% ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.1 --sleep-exp 1.01
14 s: 27097 sent (1959/s), 27097 received (1960/s)
Notification Latency Distribution:
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms     #          962 (3.6%) avg: 0.081ms
 0.10-1.00ms     #########  25066 (92.5%) avg: 0.321ms
 1.00-10.00ms    #          1069 (3.9%) avg: 2.104ms
 10.00-100.00ms             0 (0.0%) avg: 0.000ms
>100.00ms                  0 (0.0%) avg: 0.000ms

On master, I see lots of "waiting for AccessExclusiveLock on object 0"
in the logs for this benchmark setup, but none at all for v25.

I wonder if there would be some way to solve the problem with v24,
without having to always signal when isAdvancing is true with an
advancingPos at or ahead of the notifier's queueHeadAfterWrite. One idea
I had, that I think is a bad idea, but just mentioning it in case you
think it could work, is to check if the new pos is at head, in
asyncQueueReadAllNotifications's PG_FINALLY block, and if not, to let
the listening backend signal itself. I guess this is a bad idea, since
then it could signal itself over and over again, and consume a lot of
resources, if some uncommitted message takes a long time to commit. The
current mechanism limit the rate of wake-ups to the same rate as the
notifies, which seems sensible. Thoughts on this?

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

11 ноября, 19:34:39

On Sat, Nov 8, 2025, at 16:04, Joel Jacobson wrote:
> On Sat, Nov 8, 2025, at 13:59, Joel Jacobson wrote:
>> On Fri, Nov 7, 2025, at 19:59, Joel Jacobson wrote:
>> This in turn could cause a listening backend to remain behind, if there
>> would be no more notifies, so it unfortunately seems like we will always
>> need to signal when a backend isAdvancing, and therefore have no use of
>> the advancingPos field.

Having thought about this, I don't think this is actually a problem,
since this isn't any worse than what we currently have in master.
Listening backends can currently end up stationary behind QUEUE_HEAD, in
exactly this situation, when they don't read up until QUEUE_HEAD in
asyncQueueReadAllNotifications. In this case, we currently rely on
another NOTIFY to wake them up, so v24 wouldn't be any worse.

My apologies for again making the mistake of mixing in robustness
improvements into this patch. I must keep in mind this is solely an
optimization patch.

I'm therefore attaching v24 again, but renamed to v26.

/Joel

On Thu, Nov 13, 2025, at 07:36, Arseniy Mukhin wrote:
> On Thu, Nov 13, 2025 at 9:28 AM Joel Jacobson <joel@compiler.org> wrote:
>> I think this confirms that listeners can actually stop somewhere in between
>> queueHeadBeforeWrite and queueHeadAfterWrite.
>
>
> Ahh, yes, I think you are right. I missed that notifiers update the
> head when they move to the next page. Thank you for the detailed
> example and sorry for taking your time with it. I agree that
> QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite) is correct and covers
> more cases where we can do direct advancement.

Thanks for investigating this; we now both have an even stronger mental
model of the code.

Attached, please find a new version rebased on top of the bug fix
patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
1b46990.

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

14 ноября, 19:01:59

On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote:
> Attached, please find a new version rebased on top of the bug fix
> patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
> 1b46990.

To help reviewers, here is a new write-up of the patch:

PROBLEM
=======

The current implementation has no central knowledge of which backend
listens on which channel. When a backend commits a transaction that
issued NOTIFY, SignalBackends() iterates over all registered listeners
in the same database and sends each one a PROCSIG_NOTIFY_INTERRUPT
signal, regardless of whether they are listening on the notified
channel.

This behavior is fine when all listeners are on the same channel, but
when many backends are listening on different channels, each NOTIFY
triggers unnecessary wakeups and context switches. As the number of idle
listeners grows, this often becomes the bottleneck and throughput drops
sharply.

Performance degrades dramatically: benchmarks show throughput dropping
from ~9,000 TPS with few listeners to ~200 TPS with 1,000 idle listeners
on unrelated channels - a 45x slowdown purely from waking backends that
have no notifications to process.

SOLUTION OVERVIEW
=================

This patch introduces two optimizations:

1. Targeted Signaling
   A lazily-created dynamic shared hash table (dshash) backed by dynamic
   shared memory (DSA) maps (database OID, channel name) to arrays of
   listening backends (ProcNumbers). This allows the notifier to signal
   only those backends actually listening on the channels being
   notified.

2. Direct Advancement
   Even with targeted signaling, idle backends might still need to be
   woken to advance their queue read positions past notifications they
   don't care about. This patch avoids those unnecessary wakeups by
   directly advancing the queue positions of idle backends that are not
   listening on the channels being notified.

   This is possible because all NOTIFY writers are serialized by a
   heavyweight lock, allowing the notifier to identify precisely which
   queue entries belong to the current transaction. The notifier can
   then determine which idle backends are positioned within that range
   and safely advance their positions without waking them, since we know
   from the shared channel hash that they are not listening on any of
   the notified channels.

IMPLEMENTATION DETAILS
=======================

Shared Channel Hash
-------------------

The patch adds a dshash table that maps (dboid, channel) keys to
ChannelEntry structures.

The listenersArray starts with capacity for 4 listeners and doubles when
full. Memory is allocated from a DSA area and freed when a channel has
zero listeners.

The table is created lazily on the first LISTEN command. The DSA handle
and dshash handle are stored in AsyncQueueControl for other backends to
attach.

Dual Data Structures
--------------------

The implementation maintains two complementary data structures:

1. Shared channelHash: Used during commit to determine which backends
   need to be signaled. Updated during Exec_ListenCommit/UnlistenCommit/
   UnlistenAllCommit.

2. Local listenChannelsHash: Changed from a List to an HTAB for fast
   lookups, used by IsListeningOn().

This separation avoids contention on the shared hash during the frequent
IsListeningOn() checks that occur for every notification read from the
queue.

Direct Advancement Algorithm
-----------------------------

In PreCommit_Notify(), while holding the heavyweight lock on "database
0" that serializes all NOTIFY writers:

1. Before writing the first notification, capture queueHeadBeforeWrite
2. Write all notifications for the transaction to the queue 3. After
writing the last notification, capture queueHeadAfterWrite

The heavyweight lock guarantee means the range [queueHeadBeforeWrite,
queueHeadAfterWrite) contains only notifications written by this commit,
and no other backend could have inserted entries in this range.

SignalBackends() then processes each backend:

  - If the backend has wakeupPending: skip (already signaled)

  - If the backend is advancing (reading the queue):
      If advancingPos < queueHeadAfterWrite: signal it
      (it will get stuck before our new entries without a signal)

  - If the backend is idle:
      If pos < queueHeadBeforeWrite: signal it
      (it might be interested in older messages)

      If pos >= queueHeadBeforeWrite AND pos < queueHeadAfterWrite:
      Direct advance pos to queueHeadAfterWrite
      (skip our messages entirely, no signal needed)

New QueueBackendStatus Fields
-----------------------------

Each backend's entry in AsyncQueueControl now includes:

  wakeupPending:  signal sent but not yet processed

  isAdvancing:    backend is advancing its position

  advancingPos:   target position backend is advancing to

These flags ensure correct interaction between direct advancement and
backends that are concurrently processing their queue.

Transaction-Local State
------------------------

PreCommit_Notify() builds a list of unique channels
(pendingNotifyChannels) from the transaction's notifications. This list
is used by SignalBackends() to look up listeners in the shared hash
efficiently, avoiding duplicate lookups when multiple notifications are
sent to the same channel.

Functions Modified
------------------

AsyncShmemInit
  Initialize channelHashDSA/DSH handles (InvalidHandle) and new
  per-backend fields: wakeupPending, isAdvancing, advancingPos.

Async_Notify
   Initialize channelHashtab.

pg_listening_channels
  Rewritten to iterate over listenChannelsHash using HASH_SEQ_STATUS
  instead of traversing the old listenChannels list.

PreCommit_Notify
  Build pendingNotifyChannels list of unique channels from transaction's
  notifications. Capture queueHeadBeforeWrite before writing first
  notification and queueHeadAfterWrite after each write to enable direct
  advancement optimization.

AtCommit_Notify
  Check hash table entry count instead of list emptiness when deciding
  whether to unregister from listener array.

Exec_ListenCommit
  Complete rewrite to maintain both local listenChannelsHash and shared
  channelHash. Insert backend's ProcNumber into DSA-allocated listeners
  array, growing array (doubling strategy) when full.

Exec_UnlistenCommit
  Remove from both local and shared hashes. Compact listeners array with
  memmove, free DSA memory and delete hash entry when last listener
  removed.

Exec_UnlistenAllCommit
  Iterate shared channelHash with dshash_seq_*, remove this backend from
  all channel entries in current database, clean up DSA memory and
  delete entries when empty.

IsListeningOn
  Simplified to single hash_search() call on listenChannelsHash.

asyncQueueUnregister
  Clear QUEUE_BACKEND_WAKEUP_PENDING flag and update assertion to check
  hash table instead of list.

SignalBackends
  Rewrite to use targeted signaling instead of broadcast. Iterate
  pendingNotifyChannels, look up listeners per channel in shared
  channelHash. Implement direct advancement: advance idle backends
  positioned in [queueHeadBeforeWrite, queueHeadAfterWrite) without
  signaling. Use wakeupPending flag to prevent duplicate signals and
  respect isAdvancing flag to avoid interfering with concurrent position
  updates.

AtAbort_Notify
  Use listenChannelsHash instead of listenChannels.

asyncQueueReadAllNotifications
  Set isAdvancing flag and advancingPos before reading, clear
  isAdvancing after advancing position.

asyncQueueProcessPageEntries
  Use listenChannelsHash instead of listenChannels.

ProcessIncomingNotify
  Use listenChannelsHash instead of listenChannels.

AddEventToPendingNotifies
  Build channelHashtab when notification count exceeds
  MIN_HASHABLE_NOTIFIES, enabling efficient extraction of unique channel
  names in PreCommit_Notify.

ClearPendingActionsAndNotifies
  Also free pendingNotifyChannels.

Functions Added
---------------

asyncQueuePagePrecedes
  Inline function returning true if page p precedes page q (p < q).

channelHashFunc
  Hash function for ChannelHashKey, combining hash of dboid and channel
  name using XOR. Required callback for dshash operations.

initChannelHash
  Lazy initialization of shared dshash table mapping (dboid, channel) to
  listener arrays. First caller creates DSA area and dshash, stores
  handles in asyncQueueControl; subsequent callers attach using stored
  handles.

initListenChannelsHash
  Lazy initialization of backend-local hash table (listenChannelsHash)
  for faster IsListeningOn() checks.

ChannelHashPrepareKey
  Inline helper to construct ChannelHashKey.

TESTING
=======

The patch adds comprehensive isolation tests covering:

1. Subtransaction handling:
   - LISTEN in subtransaction with SAVEPOINT/RELEASE - LISTEN merge path
   (both outer and inner transactions) - ROLLBACK TO SAVEPOINT
   discarding pending actions

2. Notification deduplication:
   - Hash table duplicate detection with 17 notifications + 1 duplicate

3. Listener array growth:
   - Multiple listeners triggering ChannelEntry array expansion

4. Cross-session delivery:
   - Notifications from non-listening backend to listener in another
   session

Total test sessions expanded from 3 to 7 to cover these scenarios.

/Joel

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

16 ноября, 00:53:52

On Fri, Nov 14, 2025, at 17:01, Joel Jacobson wrote:
> On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote:
>> Attached, please find a new version rebased on top of the bug fix
>> patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
>> 1b46990.
>
> To help reviewers, here is a new write-up of the patch:
> [...write-up...]

While reviewing all the comments in async.c to make sure they match the
patch's code, I noticed a few discrepancies. One of them was the comment
above QUEUE_CLEANUP_DELAY, which again made me think about how master
currently uses that value as the threshold for when to "wake laggers".

In this patch, QUEUE_CLEANUP_DELAY is no longer used for that
purpose; it now only determines how often we try to advance the tail
pointer.

I realize someone who is familiar with the current code in master,
might ask the following question:

    Why not do direct advancement, but just use the old "wake laggers"
    logic for listeners that lag behind more than QUEUE_CLEANUP_DELAY?

On the surface it might look like a plausible alternative,
since that's what master currently does (but for other databases).

I was curious to see how such alternative approach would affect
performance, so I changed SignalBackends and ran some benchmarks.

Below is the v27 logic:

```
if (QUEUE_BACKEND_IS_ADVANCING(i) ?
    QUEUE_POS_PRECEDES(QUEUE_BACKEND_ADVANCING_POS(i), queueHeadAfterWrite) :
    QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite))
{
    Assert(pid != InvalidPid);
    QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
    pids[count] = pid;
    procnos[count] = i;
    count++;
}
else if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
          QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
{
    Assert(!QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite));
    QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
}
```

And here is the modified "wake laggers" version I tested:

```
if (!QUEUE_BACKEND_IS_ADVANCING(i) &&
    !QUEUE_POS_PRECEDES(pos, queueHeadBeforeWrite) &&
     QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))
{
    QUEUE_BACKEND_POS(i) = queueHeadAfterWrite;
}
else if (asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
                            QUEUE_POS_PAGE(pos)) >= QUEUE_CLEANUP_DELAY)
{
    QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
    pids[count] = pid;
    procnos[count] = i;
    count++;
}
```

This preserves direct advancement under the same conditions as the patch,
but only sends signals to backends that are "laggers" by
the QUEUE_CLEANUP_DELAY threshold, similar to master's behavior.

It turns out the "wake laggers" approach is significantly slower:

"wake laggers":

./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01
20 s: 80871 sent (4650/s), 80871 received (4650/s)
Notification Latency Distribution:
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms     #          380 (0.5%) avg: 0.085ms
 0.10-1.00ms     #          2765 (3.4%) avg: 0.289ms
 1.00-10.00ms    #          4947 (6.1%) avg: 5.870ms
 10.00-100.00ms  #######    57467 (71.1%) avg: 52.101ms
>100.00ms       #          15312 (18.9%) avg: 119.847ms

v27:

% ./pg_async_notify_test --listeners 1 --notifiers 1 --channels 1000 --sleep 0.01 --sleep-exp 1.01
20 s: 229866 sent (13985/s), 229865 received (13984/s)
Notification Latency Distribution:
 0.00-0.01ms                0 (0.0%) avg: 0.000ms
 0.01-0.10ms     #          18351 (8.0%) avg: 0.072ms
 0.10-1.00ms     #          32899 (14.3%) avg: 0.315ms
 1.00-10.00ms    ####       103273 (44.9%) avg: 5.055ms
 10.00-100.00ms  ###        75342 (32.8%) avg: 18.197ms
>100.00ms                  0 (0.0%) avg: 0.000ms

I reran the experiments a three times with similar results.
Also tested other test permutaions, that also showed "wake laggers" was worse.

At first glance, both approaches ultimately signal all backends
interested in our notifies, so it may seem surprising that latency
differs this much. The key point is what happens to non-interested
backends:

A backend that is *not* listening to our channels may have stopped at a
position before the old queue head, because it woke up and read the head
before all notifies for a previous commit were written. Such backend
might actually be interested in the notifies that lie in between its
current position and the old queue head, and it could therefore be
urgent to wake it up to make it process the queue and delivery the
notifies.

In v27, such a backend gets signaled on the next NOTIFY, when it notice
it is stationary at a pos behind the queueHeadBeforeWrite.

With "wake laggers", however, it receives no signal until it becomes a
"lagger" by QUEUE_CLEANUP_DELAY pages, which can be far later. This
risks delaying its processing and the delivery of notifications it is
interested in.

In master this is fine because "wake laggers" is only applied to
backends in other databases that we know are not interested in our
notifications.

In conclusion, the "wake laggers" mechanism seems inherently
incompatible with the direct advancement mechanism, and I therefore
think the current approach in v27 is sound.

I just wanted to share this reasoning, not because anyone has raised any
concerns about this, but because I hadn't fully internalized this myself,
and thought the reasoning could possibly be helpful to others.

The attached v28 is the same as v27, except some comments have been
fixed to accurately reflect the code.

/Joel

Вложения

Re: Optimize LISTEN/NOTIFY

От

"Joel Jacobson"

Дата:

17 ноября, 10:04:57

On Sat, Nov 15, 2025, at 22:53, Joel Jacobson wrote:
> On Fri, Nov 14, 2025, at 17:01, Joel Jacobson wrote:
>> On Thu, Nov 13, 2025, at 08:13, Joel Jacobson wrote:
>>> Attached, please find a new version rebased on top of the bug fix
>>> patches that just got committed in 0bdc777, 797e9ea, 8eeb4a0, and
>>> 1b46990.
>>
>> To help reviewers, here is a new write-up of the patch:
>> [...write-up...]
>
...
> The attached v28 is the same as v27, except some comments have been
> fixed to accurately reflect the code.

I note LISTEN/NOTIFY yet again made it to the front-page of Hacker News
due to complaints of being a bottleneck:

https://peterullrich.com/listen-to-database-changes-through-the-postgres-wal
https://news.ycombinator.com/item?id=45885768

Unfortunately, the article doesn't say if the workload in the example
is made up, or if it's based on actual numbers, and it doesn't say
if they listened to a single channel or multiple channels.

/Joel

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Optimize LISTEN/NOTIFY

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения