Обсуждение: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
Hi all, There’s been some renewed attention on improving the performance of the LISTEN/NOTIFY system, which historically hasn’t scaled well under high notify frequency. Joel Jacobson recently shared some work on optimizing the LISTEN path [1], and I wanted to follow up with a proposal focused on the NOTIFY side. One of the main bottlenecks in the current implementation is the global lock taken in `PreCommit_Notify`, which serializes all notifications. In many use cases (especially where NOTIFY is used for non–mission-critical caching or coordination), users may not care about strict notification ordering or delivery semantics in the event of a transaction rollback. To explore this further, I’ve drafted a patch that introduces a new GUC: `publish_out_of_order_notifications`. When enabled, this skips the global lock in `PreCommit_Notify`, allowing notifications to be queued in parallel. This comes at the cost of possible out-of-order delivery and the potential for notifications to be delivered from rolled-back transactions. For benchmarking, I used pgbench with a custom SQL script that sends a single NOTIFY message per transaction. The test was run using 8 connections and 2000 transactions per client. Here are the results on a MacBook Air (Apple M2 chip, 8 cores, 16 GB memory): publish_out_of_order_notifications = off: • Run 1: 158,190 TPS (latency: 0.051 ms) • Run 2: 189,771 TPS (latency: 0.042 ms) • Run 3: 189,401 TPS (latency: 0.042 ms) • Run 4: 190,288 TPS (latency: 0.042 ms) • Run 5: 185,001 TPS (latency: 0.043 ms) publish_out_of_order_notifications = on: • Run 1: 298,982 TPS (latency: 0.027 ms) • Run 2: 345,162 TPS (latency: 0.023 ms) • Run 3: 351,309 TPS (latency: 0.023 ms) • Run 4: 333,035 TPS (latency: 0.024 ms) • Run 5: 353,834 TPS (latency: 0.023 ms) This shows roughly a 2x improvement in TPS in this basic benchmark. I believe this could serve as a useful knob for users who want performance over guarantees, and it may help guide future efforts to reduce contention in NOTIFY more generally. I also have some ideas for stricter-but-faster implementations that preserve ordering, but I wanted to start with a minimal and illustrative patch. I'd appreciate thoughts on the direction and whether this seems worth pursuing further. Relevant prior discussions: [1] https://www.postgresql.org/message-id/flat/6899c044-4a82-49be-8117-e6f669765f7e%40app.fastmail.com [2] https://www.postgresql.org/message-id/flat/CAM527d_s8coiXDA4xbJRyVOcNnnjnf%2BezPYpn214y3-5ixn75w%40mail.gmail.com Thanks, Rishu
Вложения
Rishu Bagga <rishu.postgres@gmail.com> writes: > To explore this further, I’ve drafted a patch that introduces a new GUC: > `publish_out_of_order_notifications`. We have generally found that GUCs that change query semantics turn out to be very regrettable choices. They break applications that aren't expecting it, and the granularity of the effect is frequently not what you want either. In the case at hand, I fear that a USERSET GUC is particularly inappropriate, because what this proposal does is to break the notification order guarantees system-wide, even if only one issuer of notifications has it set. > When enabled, this skips the global > lock in `PreCommit_Notify`, allowing notifications to be queued in parallel. How much does that really help, considering that we'll still serialize on the NotifyQueueLock? I think that you'd need some rather fundamental redesign to allow truly parallel queueing. Stepping back a bit, my recollection is that "queue entries appear in commit order" is a rather fundamental assumption in async.c, which we rely on while dequeuing notifications. If that stops being true, I think you'd have cases where listening backends fail to collect available (committed) notifications because they appear in the queue beyond not-yet-committed notifications. Maybe the window in which a notification would remain uncommitted is short enough that we could avert our eyes from that problem, but I'm not convinced. So I sympathize with concerns about how well the notification code scales, but I think you're going to have to do a ton more work than this to get to anything that would pass muster to get committed. In particular, I'd really want to see something that involves explicitly opting-into out-of-order delivery on a per-NOTIFY basis, because anything else will break too many applications. The underlying queue mechanism is going to need a serious rethink, too. My guess is that we'd need to move to something involving multiple queues rather than just one, and I'm not very sure how that ought to work. (But perhaps queuing in-order notifications separately from not-guaranteed-in-order notifications would help? Or maybe the latter should be transmitted in some other way entirely.) regards, tom lane
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Fri, Jul 18, 2025, at 03:28, Tom Lane wrote: > So I sympathize with concerns about how well the notification code > scales, but I think you're going to have to do a ton more work than > this to get to anything that would pass muster to get committed. > In particular, I'd really want to see something that involves > explicitly opting-into out-of-order delivery on a per-NOTIFY basis, > because anything else will break too many applications. The > underlying queue mechanism is going to need a serious rethink, too. > My guess is that we'd need to move to something involving multiple > queues rather than just one, and I'm not very sure how that ought > to work. (But perhaps queuing in-order notifications separately > from not-guaranteed-in-order notifications would help? Or maybe > the latter should be transmitted in some other way entirely.) I agree opting-into out-of-order delivery on a per-NOTIFY basis sounds like a great idea. For all the exiting users that rely on in-order delivery, and those who are not really sure they would dare to make the switch, do we want to try to do something to improve their use-cases? It doesn't seem possible to do better than what we already do when all backends listen on the same channel. However, for cases when up to K backends listen on the same channel (multicast), my patch and benchmark in the other thread suggests we can achieve massive improvements of parallelization of both LISTEN/UNLISTEN as well as NOTIFY without introducing that much extra complexity. Do we find this goal worth pursuing on its own? Or should we only focus on exposing a new third parameter to pg_notify() to indicate out-of-order delivery? /Joel
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Fri, Jul 18, 2025, at 02:49, Rishu Bagga wrote: > Hi all, > > There’s been some renewed attention on improving the performance of the > LISTEN/NOTIFY system, which historically hasn’t scaled well under high > notify frequency. Joel Jacobson recently shared some work on optimizing > the LISTEN path [1], and I wanted to follow up with a proposal focused on > the NOTIFY side. To clarify, my patch optimizes parallelizability of NOTIFY, without degrading parallelizability of LISTEN/UNLISTEN. /Joel
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Fri, Jul 18, 2025, at 08:46, Joel Jacobson wrote: > However, for cases when up to K backends listen on the same channel (multicast), > my patch and benchmark in the other thread suggests we can achieve massive > improvements of parallelization of both LISTEN/UNLISTEN as well as NOTIFY > without introducing that much extra complexity. > > Do we find this goal worth pursuing on its own? > > Or should we only focus on exposing a new third parameter to pg_notify() > to indicate out-of-order delivery? I guess the best would be if we could do both, i.e. improve existing use-cases as well as out-of-order delivery per notification,within acceptable levels of increased complexity? One thing I wonder though, that I haven't yet benchmarked, is if even more parallelization than what my in-order optimizationsof NOTIFY already achives, would actually significantly improve parallelization of real workloads, where youdo some actual DML in the same txn that you send a NOTIFY. My in-order optimizations now scale to 3000 tps at 1000 connections.I wonder if PostgreSQL really can push that much DML tps, or if the serialization effect of LISTEN/NOTIFY wouldbe marginal to other serialization caused by the DML. /Joel
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Fri, Jul 18, 2025, at 10:15, Joel Jacobson wrote: > On Fri, Jul 18, 2025, at 02:49, Rishu Bagga wrote: >> Hi all, >> >> There’s been some renewed attention on improving the performance of the >> LISTEN/NOTIFY system, which historically hasn’t scaled well under high >> notify frequency. Joel Jacobson recently shared some work on optimizing >> the LISTEN path [1], and I wanted to follow up with a proposal focused on >> the NOTIFY side. > > To clarify, my patch optimizes parallelizability of NOTIFY, without > degrading parallelizability of LISTEN/UNLISTEN. I realize my above clarification is not technically true, what I meant to say is: My patch improves NOTIFY TPS when many backends are listening on multiple channels by eliminating unnecessary syscall wake‑ups, but it doesn't increase the internal parallelism of the NOTIFY queue itself. /Joel
"Joel Jacobson" <joel@compiler.org> writes: > My patch improves NOTIFY TPS when many backends are listening on multiple > channels by eliminating unnecessary syscall wake‑ups, but it doesn't increase > the internal parallelism of the NOTIFY queue itself. After thinking about this for awhile, I have a rough idea of something we could do to improve parallelism of NOTIFY. As a bonus, this'd allow processes on hot standby servers to receive NOTIFYs from processes on the primary, which is a feature many have asked for. The core thought here was to steal some implementation ideas from two-phase commit. I initially thought maybe we could remove the SLRU queue entirely, and maybe we can still find a way to do that, but in this sketch it's still there with substantially reduced traffic. The idea basically is to use the WAL log rather than SLRU as transport for notify messages. 1. In PreCommit_Notify(), gather up all the notifications this transaction has emitted, and write them into a WAL log message. Remember the LSN of this message. (I think this part should be parallelizable, because of work that's previously been done to allow parallel writes to WAL.) 2. When writing the transaction's commit WAL log entry, include the LSN of the previous notify-data entry. 3. Concurrently with writing the commit entry, send a message to the notify SLRU queue. This would be a small fixed-size message with the transaction's XID, database ID, and the LSN of the notify-data WAL entry. (The DBID is there to let listeners quickly ignore traffic from senders in other DBs.) 4. Signal listening backends to check the queue, as we do now. 5. Listeners read the SLRU queue and then, if in same database, pull the notify data out of the WAL log. (I believe we already have enough infrastructure to make that cheap, because 2-phase commit does it too.) In the simplest implementation of this idea, step 3 would still require a global lock, to ensure that SLRU entries are made in commit order. However, that lock only needs to be held for the duration of step 3, which is much shorter than what happens now. A more advanced idea could be to send the SLRU message in step 1, as soon as we've pushed out the notify-data message. In this approach, listening backends would become responsible for figuring out whether senders have committed yet and processing the messages in correct commit order. This is quite handwavy yet because I don't have a clear idea of how they'd do that reliably, but maybe it's possible. In a hot standby server, the WAL replay process would simply have to send the proper SLRU message and issue signals when it sees a commit message containing a notify-data LSN. (One small detail to be worked out is who's responsible for truncating the notify SLRU in a hot standby server. In current usage the sending backends do it, but there won't be any in hot standby, and there aren't necessarily any listeners either.) An area that needs a bit of thought is how to ensure that we don't truncate away WAL that contains still-unread notify messages. We have mechanisms already to prevent too-soon truncation of WAL, so I doubt there's anything too difficult here. (Also note that we have an existing unsolved problem of preventing CLOG truncation while the notify SLRU still contains references to some old XID. Perhaps that could be dealt with at the same time.) This isn't something I'm likely to work on anytime soon, but perhaps someone else would like to push on these ideas. regards, tom lane
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Fri, Jul 18, 2025, at 19:06, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> My patch improves NOTIFY TPS when many backends are listening on multiple >> channels by eliminating unnecessary syscall wake‑ups, but it doesn't increase >> the internal parallelism of the NOTIFY queue itself. > > After thinking about this for awhile, I have a rough idea of > something we could do to improve parallelism of NOTIFY. > As a bonus, this'd allow processes on hot standby servers to > receive NOTIFYs from processes on the primary, which is a > feature many have asked for. LISTEN/NOTIFY on standbys sounds interesting on its own. However, I don't think reducing the time we hold the exclusive lock, would do anything at all, to help the users who have been reporting problems with LISTEN/NOTIFY. I'll explain why I think so. I assume Rishu in his original post, with "renewed attention" was referring to the post "Postgres LISTEN/NOTIFY does not scale" [1] that was on the front page of Hacker News with 319 comments [2]. I think the reported "still waiting for AccessExclusiveLock" they saw in the logs, is probably just a *symptom* but not the *cause* of their problems. Unfortunately, the author of [1] jumped to conclusion and assumed the global lock was the problem. I'm quite sure it is probably not, because: We know for sure, that current users do LISTEN and NOTIFY in the same database. And there is no point in doing NOTIFY unless you also do LISTEN. Their plots show an y-axis with a few hundred "active sessions". If we assume at least ~100 of them would be listening backends, that would explain their problems, due to the syscall thundering herd wake-up bomb, that each NOTIFY currently causes. So instead of saying "Postgres LISTEN/NOTIFY does not scale", like in the article [1], I think it would be much more fair and meaningful to say "Postgres LISTEN/NOTIFY does not scale, with the number of listening backends". All my benchmarks support this hypothesis. I've already posted a lot of them, but can of course provide more specific additional benchmarks if desired. /Joel [1] https://www.recall.ai/blog/postgres-listen-notify-does-not-scale [2] https://news.ycombinator.com/item?id=44490510
> "Joel Jacobson" <joel@compiler.org> writes: > > Unfortunately, the author of [1] jumped to conclusion and assumed > the global lock was the problem. I'm quite sure it is probably not, > because: > > We know for sure, that current users do LISTEN and NOTIFY > in the same database. And there is no point in doing NOTIFY > unless you also do LISTEN. I agree that, in practice, there is likely no use of NOTIFY without LISTEN. That being said, I disagree that the global lock is not at least one of the bottlenecks from a performance perspective. The global lock is not released until after the call to RecordTransactionCommit and the reset of CommitTransaction. So if there are multiple transactions sending notifications, each must become durable before the next can proceed, which introduces a significant bottleneck. This becomes especially expensive in environments where compute and storage are separated, such as certain cloud-based variants of Postgres. > Tom Lane <tgl@sss.pgh.pa.us> writes: > > 1. In PreCommit_Notify(), gather up all the notifications this > transaction has emitted, and write them into a WAL log message. > Remember the LSN of this message. (I think this part should be > parallelizable, because of work that's previously been done to > allow parallel writes to WAL.) > > 2. When writing the transaction's commit WAL log entry, include > the LSN of the previous notify-data entry. Thanks for the input — this is a compelling idea. I'll work on implementing a proof of concept. Thanks, Rishu
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Mon, Jul 21, 2025, at 00:06, Rishu Bagga wrote: >> "Joel Jacobson" <joel@compiler.org> writes: >> >> Unfortunately, the author of [1] jumped to conclusion and assumed >> the global lock was the problem. I'm quite sure it is probably not, >> because: >> >> We know for sure, that current users do LISTEN and NOTIFY >> in the same database. And there is no point in doing NOTIFY >> unless you also do LISTEN. > > I agree that, in practice, there is likely no use of NOTIFY without > LISTEN. That being said, I disagree that the global lock is not at > least one of the bottlenecks from a performance perspective. > > The global lock is not released until after the call to > RecordTransactionCommit and the reset of CommitTransaction. So if > there are multiple transactions sending notifications, each must > become durable before the next can proceed, which introduces a > significant bottleneck. > > This becomes especially expensive in environments where compute and > storage are separated, such as certain cloud-based variants of Postgres. Can you please share the benchmark script and pgbench commands from your initial post? When you skipped the global lock, TPS went from max 190k to max 350k. Based how high these TPS are, I would guess you're not doing LISTEN in your benchmark, but only NOTIFY? Since there is no point of just doing NOTIFY if nobody is LISTENing, a realistic benchmark would also need to do LISTEN. What you will then see is that TPS will be severely impacted, and the gains from removing the global exclusive lock will drown in the huge cost for all the syscalls. On my end, I will work on reducing the expensive syscalls, which I believe is the current bottleneck. Once that's fixed though, the next bottleneck might be the global exclusive lock, so I think it's great your working on improving that as well. /Joel
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Mon, Jul 21, 2025, at 08:16, Joel Jacobson wrote: > Since there is no point of just doing NOTIFY if nobody is LISTENing, > a realistic benchmark would also need to do LISTEN. > What you will then see is that TPS will be severely impacted, > and the gains from removing the global exclusive lock will > drown in the huge cost for all the syscalls. I thought I better put my money where my mouth is, and decided to try to replicate Rishu's benchmark results, and in addition, benchmark my own hypothesis above, that if not just doing NOTIFY but both LISTEN+NOTIFY, the 2x improvement would be heavily reduced. The benchmarks I've run below confirm this hypothesis. The observed 2x improvement is reduced to a 0.14x improvement when doing both LISTEN+NOTIFY. Benchmark from original post: > Here are the results on a MacBook Air (Apple M2 chip, 8 cores, 16 GB memory): > > publish_out_of_order_notifications = off: > > • Run 1: 158,190 TPS (latency: 0.051 ms) > • Run 2: 189,771 TPS (latency: 0.042 ms) > • Run 3: 189,401 TPS (latency: 0.042 ms) > • Run 4: 190,288 TPS (latency: 0.042 ms) > • Run 5: 185,001 TPS (latency: 0.043 ms) > > publish_out_of_order_notifications = on: > > • Run 1: 298,982 TPS (latency: 0.027 ms) > • Run 2: 345,162 TPS (latency: 0.023 ms) > • Run 3: 351,309 TPS (latency: 0.023 ms) > • Run 4: 333,035 TPS (latency: 0.024 ms) > • Run 5: 353,834 TPS (latency: 0.023 ms) > > This shows roughly a 2x improvement in TPS in this basic benchmark. # Benchmarks on my MacBook Pro (Apple M3 Max, 16 cores, 128 GB memory) ## NOTIFY only ~160k TPS master (HEAD) ~340k TPS (0001-allow-out-of-order-notifications.patch) => ~125% improvement ### master (HEAD) % cat notify_common.sql NOTIFY channel_common; % for n in `seq 1 5` ; do pgbench -f notify_common.sql -c 8 -t 2000 -n | grep -E '^(latency|tps)' ; done latency average = 0.052 ms tps = 154326.941626 (without initial connection time) latency average = 0.049 ms tps = 162334.368215 (without initial connection time) latency average = 0.050 ms tps = 160703.883008 (without initial connection time) latency average = 0.048 ms tps = 165296.086615 (without initial connection time) latency average = 0.049 ms tps = 163706.310878 (without initial connection time) ### 0001-allow-out-of-order-notifications.patch % cat notify_common.sql NOTIFY channel_common; % for n in `seq 1 5` ; do PGOPTIONS='-c publish_notifications_out_of_order=true' pgbench -f notify_common.sql -c 8 -t 2000-n | grep -E '^(latency|tps)' ; done latency average = 0.026 ms tps = 310149.647205 (without initial connection time) latency average = 0.021 ms tps = 380427.029340 (without initial connection time) latency average = 0.025 ms tps = 320108.837005 (without initial connection time) latency average = 0.024 ms tps = 333500.083375 (without initial connection time) latency average = 0.022 ms tps = 357965.859006 (without initial connection time) ## LISTEN+NOTIFY #### ~73k TPS master (HEAD) ~83k TPS (0001-allow-out-of-order-notifications.patch) => ~14% improvement ### master (HEAD) % cat listen_notify_common.sql LISTEN channel_common; NOTIFY channel_common; % for n in `seq 1 5` ; do pgbench -f listen_notify_common.sql -c 8 -t 2000 -n | grep -E '^(latency|tps)' ; done latency average = 0.112 ms tps = 71677.201722 (without initial connection time) latency average = 0.109 ms tps = 73228.220325 (without initial connection time) latency average = 0.109 ms tps = 73310.423826 (without initial connection time) latency average = 0.108 ms tps = 73995.625009 (without initial connection time) latency average = 0.113 ms tps = 70970.431944 (without initial connection time) ### 0001-allow-out-of-order-notifications.patch % for n in `seq 1 5` ; do PGOPTIONS='-c publish_notifications_out_of_order=true' pgbench -f listen_notify_common.sql -c 8-t 2000 -n | grep -E '^(latency|tps)' ; done latency average = 0.098 ms tps = 81620.992919 (without initial connection time) latency average = 0.095 ms tps = 84173.755675 (without initial connection time) latency average = 0.096 ms tps = 83634.329802 (without initial connection time) latency average = 0.095 ms tps = 84311.700356 (without initial connection time) latency average = 0.096 ms tps = 83340.278357 (without initial connection time) For a normal PostgreSQL with the CPU and storage on the same physical machine, I think the results above clearly demonstrate that the global exclusive lock is at least not the bottleneck, which I strongly believe instead is the flood of unnecessary kill(pid, SIGUSR1) syscalls. If anyone with access to a cloud environment, with compute and storage separated, like suggested by Rishu, it would be interesting to see what benchmark results you get. /Joel
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Tue, Jul 22, 2025, at 14:48, Joel Jacobson wrote: > Benchmark from original post: ... > For a normal PostgreSQL with the CPU and storage on the same physical machine, > I think the results above clearly demonstrate that the global exclusive lock > is at least not the bottleneck, which I strongly believe instead is the flood of > unnecessary kill(pid, SIGUSR1) syscalls. I was wrong here. This is much more complex than I initially thought. After some additional benchmarking and analyzing perf results, I realize the bottleneck depends on the workload, which is either the kill() syscalls *or* the heavyweight lock. Here is one scenario where the heavyweight lock actually *is* the bottleneck: 1 session does LISTEN pgbench -f notify.sql -c 1000 -j 8 -T 60 -n Simply commenting out the heavyweight lock gives a dramatic difference: tps = 7679 (with heavyweight lock; in commit order) tps = 95430 (without heavyweight lock; not in commit order) My conclusion so far is that we would greatly benefit both from reducing/eliminating kill() syscalls, as well as finding ways to avoid the heavyweight lock while preserving commit order. /Joel
On Fri, Jul 18, 2025 at 10:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > After thinking about this for awhile, I have a rough idea of > something we could do to improve parallelism of NOTIFY. > As a bonus, this'd allow processes on hot standby servers to > receive NOTIFYs from processes on the primary, which is a > feature many have asked for. > > The core thought here was to steal some implementation ideas > from two-phase commit. I initially thought maybe we could > remove the SLRU queue entirely, and maybe we can still find > a way to do that, but in this sketch it's still there with > substantially reduced traffic. > > The idea basically is to use the WAL log rather than SLRU > as transport for notify messages. > > 1. In PreCommit_Notify(), gather up all the notifications this > transaction has emitted, and write them into a WAL log message. > Remember the LSN of this message. (I think this part should be > parallelizable, because of work that's previously been done to > allow parallel writes to WAL.) > > 2. When writing the transaction's commit WAL log entry, include > the LSN of the previous notify-data entry. > > 3. Concurrently with writing the commit entry, send a message > to the notify SLRU queue. This would be a small fixed-size > message with the transaction's XID, database ID, and the LSN > of the notify-data WAL entry. (The DBID is there to let > listeners quickly ignore traffic from senders in other DBs.) > > 4. Signal listening backends to check the queue, as we do now. > > 5. Listeners read the SLRU queue and then, if in same database, > pull the notify data out of the WAL log. (I believe we already > have enough infrastructure to make that cheap, because 2-phase > commit does it too.) > > In the simplest implementation of this idea, step 3 would still > require a global lock, to ensure that SLRU entries are made in > commit order. However, that lock only needs to be held for the > duration of step 3, which is much shorter than what happens now. Attached is an initial patch that implements this idea. There is still some work to be done around how to handle truncation / vacuum for the new approach, and testing replication of notifications onto a reader instance. That being said, I ran some basic benchmarking to stress concurrent notifications. With the following sql script, I ran pgbench -T 100 -c 100 -j 8 -f pgbench_transaction_notify.sql -d postgres BEGIN; INSERT INTO test VALUES(1); NOTIFY benchmark_channel, 'transaction_completed'; COMMIT; With the patch 3 runs showed the following TPS: tps = 66372.705917 tps = 63445.909465 tps = 64412.544339 Without the patch, we got the following TPS: tps = 30212.390982 tps = 30908.865812 tps = 29191.388601 So, there is about a 2x increase in TPS at 100 connections, which establishes some promise in the approach. Additionally, this would help solve the issue being discussed in a separate thread [1], where listeners currently rely on the transaction log to verify if a transaction that it reads has indeed committed, but it is possible that the portion of the transaction log has been truncated by vacuum. Would appreciate any thoughts on the direction of this patch. Thanks, Rishu [1] https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y% 2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw%40mail.gmail.com
Вложения
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Joel Jacobson"
Дата:
On Fri, Sep 5, 2025, at 00:53, Rishu Bagga wrote: > With the following sql script, I ran > pgbench -T 100 -c 100 -j 8 -f pgbench_transaction_notify.sql -d postgres > > BEGIN; > INSERT INTO test VALUES(1); > NOTIFY benchmark_channel, 'transaction_completed'; > COMMIT; Thanks for working on this. I haven't looked at the code yet, but have some questions on the benchmark. What's the definition of the test table? > With the patch 3 runs showed the following TPS: > > tps = 66372.705917 > tps = 63445.909465 > tps = 64412.544339 > > Without the patch, we got the following TPS: > > tps = 30212.390982 > tps = 30908.865812 > tps = 29191.388601 > > So, there is about a 2x increase in TPS at 100 connections, which establishes > some promise in the approach. In your benchmark, how many backends were doing `LISTEN benchmark_channel;`? It would be interesting if you could run the benchmark and vary the number of listeners, e.g. with 1, 10, 100 listeners, to understand the effect of this patch for different type of workloads. > Additionally, this would help solve the issue being discussed in a > separate thread [1], > where listeners currently rely on the transaction log to verify if a > transaction that it reads > has indeed committed, but it is possible that the portion of the > transaction log has > been truncated by vacuum. Nice! /Joel
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
Arseniy Mukhin
Дата:
Hi, On Fri, Sep 5, 2025 at 1:53 AM Rishu Bagga <rishu.postgres@gmail.com> wrote: > > Attached is an initial patch that implements this idea. Thank you for sharing the patch and for working on this! I briefly read the patch and there are some points: 1) There are some compilation warnings about unused functions. 2) I thought the idea was to add notification_data_entry_lsn to the transaction's commit wal entry, but the patch introduced the new XLOG_ASYNC_NOTIFY_COMMIT wal entry and put lsn in it. Also XLOG_ASYNC_NOTIFY_COMMIT is written in AtCommit_Notify, when a transaction is already committed, so we write it after the commit entry. I don't think we can write anything about the transaction to the wal after we wrote the transaction commit record. In my understanding all transaction stuff in wal should be before commit / abort entries. At least because during committing we sync wal with disk up to the commit entry, and we can't guarantee that anything after a commit entry will survive a crash. Anyway I think we just don't need this new wal entry. 3) I see you removed global lock, but I don't see where you use lock for step 3 [0] to provide the commit order of notifications. I thought step 3 should look something like that: if (have_notifications) acquire_lock(); ... XactLogCommitRecord(); .... if (have_notifications) { asyncQueueAddCompactEntry(); release_lock(); } So this way we will have the same order for commit records in wal and entries in the listen/notify queue. I'm not sure what level we should add this stuff, but it seems that to minimise the time we hold the lock we need to release it before calling XLogFlush(). I'm not sure about it. Best regards, Arseniy Mukhin
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Matheus Alcantara"
Дата:
On Thu Sep 4, 2025 at 7:53 PM -03, Rishu Bagga wrote: > Attached is an initial patch that implements this idea. > Thanks for working on this! I haven't looked at the code yet, but have some questions related with the issue discussed at [1]. > Additionally, this would help solve the issue being discussed in a > separate thread [1], > where listeners currently rely on the transaction log to verify if a > transaction that it reads > has indeed committed, but it is possible that the portion of the > transaction log has > been truncated by vacuum. > Your patch already aims to fix the issue? On [2] I implemented a TAP test that reproduce the issue and I tried to execute using your patch and I still see the error. I'm attaching the TAP test isolated and maybe we could incorporate into your patch series to ensure that the issue is fixed? What do you think? [1] https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFY6G8cJm73_MM9SuynZUqtqcaTuepUDgDuvS661oLW7U0dgsg%40mail.gmail.com -- Matheus Alcantara
Вложения
Hi Joel, Arseniy, Matheus, thanks for taking a look. I’ve attached an updated patch and rebased on the latest commits that fixes the correctness issues. On Fri, Sep 5, 2025 at 2:38 AM Joel Jacobson <joel@compiler.org> wrote: > What's the definition of the test table? It’s just a one column integer table defined as such: CREATE TABLE test(x int); > In your benchmark, how many backends were doing > `LISTEN benchmark_channel;`? I just tested the notify codepath throughput, with no listeners. > It would be interesting if you could run the benchmark and vary the > number of listeners, e.g. with 1, 10, 100 listeners, to understand the > effect of this patch for different type of workloads. IIUC, the current benchmark wouldn’t be affected by listeners - the commit of the notifying transaction is independent from the listener reading the transaction. I agree that it would be useful to test listener performance as well. I will work on this shortly. On Fri, Sep 5, 2025 at 10:31 AM Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote: > 1) There are some compilation warnings about unused functions. Taken care of in the attached patch. > 2) I thought the idea was to add notification_data_entry_lsn to the > transaction's commit wal entry, but the patch introduced the new > XLOG_ASYNC_NOTIFY_COMMIT wal entry and put lsn in it. Also > XLOG_ASYNC_NOTIFY_COMMIT is written in AtCommit_Notify, when a > transaction is already committed, so we write it after the commit > entry. I don't think we can write anything about the transaction to > the wal after we wrote the transaction commit record. In my > understanding all transaction stuff in wal should be before commit / > abort entries. At least because during committing we sync wal with > disk up to the commit entry, and we can't guarantee that anything > after a commit entry will survive a crash. Anyway I think we just > don't need this new wal entry. I removed the XLOG_ASYNC_NOTIFY_COMMIT record in the new patch, in favor of including the LSN of the notification as a new field in the commit record itself. The reasoning here is that Notify data is WAL’d before transaction commit, so during WAL replay, we need some way to know whether the notification should be added to the notify queue. In the previous patch we were using XLOG_ASYNC_NOTIFY_COMMIT record, but actually this doesn’t work, because we could write this record before commit, or we could commit and crash before writing this record. So the best solution is to have a field within the commit record, that tells us which log record contains the notifications for that transaction, if there are any. > I see you removed global lock, but I don't see where you use lock > for step 3 [0] to provide the commit order of notifications. Fixed this in the updated patch. On Sat, Sep 6, 2025 at 7:52 AM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > Your patch already aims to fix the issue? On [2] I implemented a TAP > test that reproduce the issue and I tried to execute using your patch > and I still see the error. I'm attaching the TAP test isolated and > maybe we could incorporate into your patch series to ensure that the > issue is fixed? What do you think? I wasn’t able to run the TAP tests; however, in the updated patch, we can be confident that entries in the queue are from committed transactions. If there is a crash before committing and after writing to the queue, this would be within the critical section, so a notification from an uncommitted transaction would never be read in the queue. That allows us to remove the XidInMVCCSnapshot and TransactionIdDidCommit check. Thanks, Rishu
Вложения
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Matheus Alcantara"
Дата:
On Mon Sep 8, 2025 at 9:08 PM -03, Rishu Bagga wrote: > Hi Joel, Arseniy, Matheus, thanks for taking a look. I’ve attached an > updated patch and rebased on the latest commits that fixes the > correctness issues. > I think that your latest patch miss to include of asyncdesc.c file. I'm getting a compile error: src/backend/access/rmgrdesc/meson.build:4:20: ERROR: File asyncdesc.c does not exist. -- Matheus Alcantara
On Tue, Sep 9, 2025 at 3:34 PM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > > On Mon Sep 8, 2025 at 9:08 PM -03, Rishu Bagga wrote: > > Hi Joel, Arseniy, Matheus, thanks for taking a look. I’ve attached an > > updated patch and rebased on the latest commits that fixes the > > correctness issues. > > > I think that your latest patch miss to include of asyncdesc.c file. I'm > getting a compile error: > > src/backend/access/rmgrdesc/meson.build:4:20: ERROR: File asyncdesc.c does not exist. > > -- > Matheus Alcantara Sorry about that, added the asyncdesc.c file in the attached patch.
Вложения
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Matheus Alcantara"
Дата:
On Tue Sep 9, 2025 at 7:49 PM -03, Rishu Bagga wrote: > On Tue, Sep 9, 2025 at 3:34 PM Matheus Alcantara > <matheusssilv97@gmail.com> wrote: >> >> On Mon Sep 8, 2025 at 9:08 PM -03, Rishu Bagga wrote: >> > Hi Joel, Arseniy, Matheus, thanks for taking a look. I’ve attached an >> > updated patch and rebased on the latest commits that fixes the >> > correctness issues. >> > >> I think that your latest patch miss to include of asyncdesc.c file. I'm >> getting a compile error: >> >> src/backend/access/rmgrdesc/meson.build:4:20: ERROR: File asyncdesc.c does not exist. >> >> -- >> Matheus Alcantara > > Sorry about that, added the asyncdesc.c file in the attached patch. > I'm still seeing some compiler errors. FAILED: src/backend/postgres_lib.a.p/access_rmgrdesc_asyncdesc.c.o ccache cc -Isrc/backend/postgres_lib.a.p -Isrc/include -I../src/include -I/opt/homebrew/Cellar/icu4c@77/77.1/include -I/opt/homebrew/opt/lz4/include-I/opt/homebrew/Cellar/openssl@3/3.5.2/include -I/opt/homebrew/opt/zstd/include -I/opt/homebrew/opt/openssl@3/include-fdiagnostics-color=always -Wall -Winvalid-pch -O0 -g -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX15.2.sdk-fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wmissing-prototypes-Wpointer-arith -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute-Wcast-function-type -Wformat-security -Wdeclaration-after-statement -Wmissing-variable-declarations-Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-cast-function-type-strict-DBUILDING_DLL -MD -MQ src/backend/postgres_lib.a.p/access_rmgrdesc_asyncdesc.c.o -MF src/backend/postgres_lib.a.p/access_rmgrdesc_asyncdesc.c.o.d-o src/backend/postgres_lib.a.p/access_rmgrdesc_asyncdesc.c.o-c ../src/backend/access/rmgrdesc/asyncdesc.c ../src/backend/access/rmgrdesc/asyncdesc.c:17:10: fatal error: 'access/async_xlog.h' file not found 17 | #include "access/async_xlog.h" | ^~~~~~~~~~~~~~~~~~~~~ 1 error generated. I'm missing something here? I'm using meson+ninja and I remove the build directory, executed meson setup and then ninja -C build install and I got this error. -- Matheus Alcantara
On Tue, Sep 9, 2025 at 4:02 PM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > > FAILED: src/backend/postgres_lib.a.p/access_rmgrdesc_asyncdesc.c.o > ccache cc -Isrc/backend/postgres_lib.a.p -Isrc/include -I../src/include -I/opt/homebrew/Cellar/icu4c@77/77.1/include -I/opt/homebrew/opt/lz4/include-I/opt/homebrew/Cellar/openssl@3/3.5.2/include -I/opt/homebrew/opt/zstd/include -I/opt/homebrew/opt/openssl@3/include-fdiagnostics-color=always -Wall -Winvalid-pch -O0 -g -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX15.2.sdk-fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wmissing-prototypes-Wpointer-arith -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute-Wcast-function-type -Wformat-security -Wdeclaration-after-statement -Wmissing-variable-declarations-Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -Wno-cast-function-type-strict-DBUILDING_DLL -MD -MQ src/backend/postgres_lib.a.p/access_rmgrdesc_asyncdesc.c.o -MF src/backend/postgres_lib.a.p/access_rmgrdesc_asyncdesc.c.o.d-o src/backend/postgres_lib.a.p/access_rmgrdesc_asyncdesc.c.o-c ../src/backend/access/rmgrdesc/asyncdesc.c > ../src/backend/access/rmgrdesc/asyncdesc.c:17:10: fatal error: 'access/async_xlog.h' file not found > 17 | #include "access/async_xlog.h" > | ^~~~~~~~~~~~~~~~~~~~~ > 1 error generated. > > > I'm missing something here? I'm using meson+ninja and I remove the build > directory, executed meson setup and then ninja -C build install and I > got this error. Oops again - I didn't "git add" the new files, so they weren't showing up in the patch. I added async_xlog.h as well now, and tested to make sure the patch applies and compiles. Sorry about that, it should work now.
Вложения
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
"Matheus Alcantara"
Дата:
On Tue Sep 9, 2025 at 8:14 PM -03, Rishu Bagga wrote: > Oops again - I didn't "git add" the new files, so they weren't showing > up in the patch. I added async_xlog.h as well now, > and tested to make sure the patch applies and compiles. Sorry about > that, it should work now. > Thanks, it's compiling now. You wrote: > I wasn’t able to run the TAP tests; however, in the updated patch, we > can be confident that entries in the queue are from committed > transactions. If there is a crash before committing and after writing to > the queue, this would be within the critical section, so a notification > from an uncommitted transaction would never be read in the queue. > That allows us to remove the XidInMVCCSnapshot and > TransactionIdDidCommit check. > I've executed the TAP test that I've attached on [1] and the test is failing due to an assert failure: TRAP: failed Assert("CritSectionCount == 0 || (context)->allowInCritSection"), File: "../src/backend/utils/mmgr/mcxt.c",Line: 1372, PID: 79968 0 postgres 0x0000000105661860 ExceptionalCondition + 236 1 postgres 0x00000001056b7a0c palloc + 248 2 postgres 0x0000000104f96850 SignalBackends + 36 3 postgres 0x0000000104edd50c XactLogCommitRecord + 1632 4 postgres 0x0000000104ee0188 RecordTransactionCommit + 860 5 postgres 0x0000000104edcb8c CommitTransaction + 896 6 postgres 0x0000000104ed7b38 CommitTransactionCommandInternal + 256 7 postgres 0x0000000104ed7a24 CommitTransactionCommand + 16 8 postgres 0x0000000105402220 finish_xact_command + 32 9 postgres 0x00000001053ffb60 exec_simple_query + 1556 10 postgres 0x00000001053feb78 PostgresMain + 3424 11 postgres 0x00000001053f5504 BackendInitialize + 0 12 postgres 0x00000001052c11f4 postmaster_child_launch + 492 13 postgres 0x00000001052c9660 BackendStartup + 336 14 postgres 0x00000001052c70e0 ServerLoop + 432 15 postgres 0x00000001052c5a38 PostmasterMain + 7096 16 postgres 0x0000000105166488 main + 952 17 dyld 0x000000019145ab98 start + 6076 There is also some other tests failing, like isolation, regress and others. [1] https://www.postgresql.org/message-id/DCLSWKOKDAX4.3HS2NBE53P0M2%40gmail.com -- Matheus Alcantara
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
Arseniy Mukhin
Дата:
Hi, Thank you for the new version! There are several points I noticed: 1) On Tue, Sep 9, 2025 at 3:08 AM Rishu Bagga <rishu.postgres@gmail.com> wrote: > > ... > On Sat, Sep 6, 2025 at 7:52 AM Matheus Alcantara > <matheusssilv97@gmail.com> wrote: > > > Your patch already aims to fix the issue? On [2] I implemented a TAP > > test that reproduce the issue and I tried to execute using your patch > > and I still see the error. I'm attaching the TAP test isolated and > > maybe we could incorporate into your patch series to ensure that the > > issue is fixed? What do you think? > > I wasn’t able to run the TAP tests; however, in the updated patch, we > can be confident that entries in the queue are from committed > transactions. If there is a crash before committing and after writing to > the queue, this would be within the critical section, so a notification > from an uncommitted transaction would never be read in the queue. > That allows us to remove the XidInMVCCSnapshot and > TransactionIdDidCommit check. > I agree that listeners don't need to check if the notify transaction was committed or not anymore, but it seems that listeners still need to wait until the notify transaction is completed before sending its notifications. I believe we should continue using XidInMVCCSnapshot as the current version does. 2) Now we have two calls to SignalBackends (is it intentional?). The first in AtCommit_Notify() which seems correct to me. The second in XactLogCommitRecord() seems too early, because other backends can't process new notifications at this point (if the point about XidInMVCCSnapshot is correct). And there is Assert failure (Matheus' report is about it). 3) I think we still need to check if the queue is full or not while adding new entries and ereport if it is (entries are smaller now, but we still can run out of space). And it seems to me that we should do this check before entering the critical section, because in the critical section it will result in PANIC. Master results in ERROR in case of the full queue, so should we do the same here? 4) I don't know, but It seems to me that XactLogCommitRecord is not the right place for adding listen/notify logic, it seems to be only about wal stuff. 5) Do we want to use NotifyQueueLock as a lock that provides the commit order? Maybe we can introduce another lock to avoid unnecessary contantion? For example, if we use NotifyQueueLock, we block listeners that want to read new notifications while we are inserting a commit record, which seems unnecessary. 6) We have fixed size queue entries, so I think we don't need this "padding" logic at the end of the page anymore, because we know how many entries we can have on each page. BTW, what do you think about creating a separate thread for the patch? The current thread's subject seems a bit irrelevant. Best regards, Arseniy Mukhin
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
Masahiko Sawada
Дата:
On Thu, Sep 4, 2025 at 3:53 PM Rishu Bagga <rishu.postgres@gmail.com> wrote: > > On Fri, Jul 18, 2025 at 10:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > After thinking about this for awhile, I have a rough idea of > > something we could do to improve parallelism of NOTIFY. > > As a bonus, this'd allow processes on hot standby servers to > > receive NOTIFYs from processes on the primary, which is a > > feature many have asked for. > > > > The core thought here was to steal some implementation ideas > > from two-phase commit. I initially thought maybe we could > > remove the SLRU queue entirely, and maybe we can still find > > a way to do that, but in this sketch it's still there with > > substantially reduced traffic. > > > > The idea basically is to use the WAL log rather than SLRU > > as transport for notify messages. > > > > 1. In PreCommit_Notify(), gather up all the notifications this > > transaction has emitted, and write them into a WAL log message. > > Remember the LSN of this message. (I think this part should be > > parallelizable, because of work that's previously been done to > > allow parallel writes to WAL.) > > > > 2. When writing the transaction's commit WAL log entry, include > > the LSN of the previous notify-data entry. > > > > 3. Concurrently with writing the commit entry, send a message > > to the notify SLRU queue. This would be a small fixed-size > > message with the transaction's XID, database ID, and the LSN > > of the notify-data WAL entry. (The DBID is there to let > > listeners quickly ignore traffic from senders in other DBs.) > > > > 4. Signal listening backends to check the queue, as we do now. > > > > 5. Listeners read the SLRU queue and then, if in same database, > > pull the notify data out of the WAL log. (I believe we already > > have enough infrastructure to make that cheap, because 2-phase > > commit does it too.) > > > > In the simplest implementation of this idea, step 3 would still > > require a global lock, to ensure that SLRU entries are made in > > commit order. However, that lock only needs to be held for the > > duration of step 3, which is much shorter than what happens now. > > Attached is an initial patch that implements this idea. > > There is still some > work to be done around how to handle truncation / vacuum for the new > approach, and testing replication of notifications onto a reader instance. > > That being said, I ran some basic benchmarking to stress concurrent > notifications. > > With the following sql script, I ran > pgbench -T 100 -c 100 -j 8 -f pgbench_transaction_notify.sql -d postgres > > BEGIN; > INSERT INTO test VALUES(1); > NOTIFY benchmark_channel, 'transaction_completed'; > COMMIT; > > With the patch 3 runs showed the following TPS: > > tps = 66372.705917 > tps = 63445.909465 > tps = 64412.544339 > > Without the patch, we got the following TPS: > > tps = 30212.390982 > tps = 30908.865812 > tps = 29191.388601 > > So, there is about a 2x increase in TPS at 100 connections, which establishes > some promise in the approach. Looks promising improvement. > > Additionally, this would help solve the issue being discussed in a > separate thread [1], > where listeners currently rely on the transaction log to verify if a > transaction that it reads > has indeed committed, but it is possible that the portion of the > transaction log has > been truncated by vacuum. With your patch, since the backends get the notification by reading WAL records do we need to prevent WAL records that potentially have unconsumed notification from being removed by the checkpointer? Or we can save unconsumed notifications in WAL records to somewhere during the checkpoint as we do for 2PC transactions. Also, could you add this patch to the next commit fest[1] if not yet? Regards, [1] https://commitfest.postgresql.org/56/ -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Proposal: Out-of-Order NOTIFY via GUC to Improve LISTEN/NOTIFY Throughput
От
Masahiko Sawada
Дата:
On Wed, Sep 10, 2025 at 12:00 PM Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote: > > > 6) We have fixed size queue entries, so I think we don't need this > "padding" logic at the end of the page anymore, because we know how > many entries we can have on each page. +1 Probably we also no longer need to have a pair of page number and offset number as an entry queue's position since AsyncQueueEntry is now a fixed size. > BTW, what do you think about creating a separate thread for the patch? > The current thread's subject seems a bit irrelevant. +1 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com