Обсуждение: Logical replication prefetch
There is well known Postgres problem that logical replication subscriber can not caught-up with publisher just because LR changes are applied by single worker and at publisher changes are made by multiple concurrent backends. The problem is not logical replication specific: physical replication stream is also handled by single walreceiver. But for physical replication Postgres now implements prefetch: looking at WAL record blocks it is quite easy to predict which pages will be required for redo and prefetch them. With logical replication situation is much more complicated. My first idea was to implement parallel apply of transactions. But to do it we need to track dependencies between transactions. Right now Postgres can apply transactions in parallel, but only if they are streamed (which is done only for large transactions) and serialize them by commits. It is possible to enforce parallel apply of short transactions using `debug_logical_replication_streaming` but then performance is ~2x times slower than in case of sequential apply by single worker. By removing serialization by commits, it is possible to speedup apply 3x times and make subscriber apply changes faster then producer can produce them even with multiple clients. But it is possible only if transactions are independent and it can be enforced only by tracking dependencies which seems to be very non-trivial and invasive. I still do not completely give up with tracking dependencies approach, but decided first to try more simple solution - prefetching. It is already used for physical replication. Certainly in case of physical replication it is much simpler, because each WAL record contains list of accessed blocks. In case of logical replication prefetching can be done either by prefetching access to replica identity index (usually primary key), either by executing replication command by some background worker Certainly first case is much more easy. We just perform index lookup in prefetch worker and it loads accessed index and heap pages in shared buffer, so main apply worker does not need to read something from disk. But it works well only for DELETE and HOT UPDATE operations. In the second case we normally execute the LR command in background worker and then abort transaction. Certainly in this case we are doing the same work twice. But assumption is the same: parallel prefetch workers should load affected pages, speeding up work of the main apply worker. I have implemented some PoC (see attached patch). And get first results of efficiency of such prefetching. *** First scenario (update-only). Publisher: ``` create table t(pk integer primary key, counter integer, filler text default repeat('x', 1000)) with (fillfactor=10); insert into t values (generate_series(1,100000), 0); create publication pub1 for table t; ``` Subscriber: ``` create table t(pk integer primary key, counter integer, filler text default repeat('x', 1000)) with (fillfactor=10); create subscription sub1 connection 'port=54321 dbname=postgres' publication pub1; ``` Then I wait until replication is synced, stop subscriber and do random dot updates in 10 sessions at publisher: ``` pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres ``` where update.sql is: ``` \set pk random(1, 100000) update t set counter=counter+1 where pk=:pk; ``` Then I start subscriber and measure how much time is needed for it to caught up. Results: no prefetch: 2:00 min prefetch (replica identity only): 0:55 min prefetch (all): 1:10 min This is definitely the best case for replica-identity index only prefetch (update-only and no other indexes). How to interpret this results? Without prefetch applying updates takes about two times more at subscriber than performing this updates at publisher. It means that under huge workload subscriber has no chances to caught up. With prefetching replica identity index, apply time is even smaller than time needed to perform updates at publisher. Performing the whole operation and transaction abort certainly adds more overhead. But still improvement is quite significant. Please also notice that this results were obtains at the system with larger amount of RAM (64Gb) and fast SSD. With data set not fitting in RAM and much slower disks, the difference is expected to be more significant. I have tried to simulate it be adding 0.1msec delay to pg_preadv. When I add artificial 0.1msec `preadv` delay, I got the following results: no prefetch: 7:40 prefetch (replica identity only): 3:10 min prefetch (all): 3:09 In this case apply takes much more time than 100 seconds during which updates are performed at publisher. Prefetch can improve speed about two times, but it doesn't allow subcriber to caught-up. *** Second scenario: inserts with secondary random key. Publisher: ``` create table t(pk serial primary key, sk integer, counter integer default 0) insert into t (sk) select random()*10000000 from generate_series(1,10000000) create index on t(sk) create publication pub1 for table t ``` Subscriber: ``` create table t(pk integer primary key, sk integer, counter integer) create index on t(sk) create subscription sub1 connection 'port=54321 dbname=postgres' publication pub1 ``` workload: ``` pgbench -T 100 -c 10 -M prepared -n -f insert.sql -p 54321 -d postgres ``` where insert.sql: ``` INSERT INTO t (sk) VALUES (random()*10000000); ``` Results (with 0.1msec delay) are the followingL no prefetch: 10:10 min prefetch (identity): 8:25 min prefetch (full): 5:50min Here as expected prefetching only primary key doesn't provide some big improvement. But replaying insert command in prefetch worker allows to speedup apply almost twice. Please notice that this approach requires minimal changes in Postgres, because all infrastructure of parallel apply workers is already present and we can reuse the same apply code (with minimal changes) for performing prefetch. I only have to introduce extra tuple lock types (no-lock and try-lock) to minimize overhead and lock conflicts between prefetch and main apply workers. Still it can not completely prevent locks conflicts and deadlocks in prefetch workers. Looks like more work is needed here. Also I set `wal_level=minimal` in prefetch workers to avoid WAL-logging overhead. Number of prefetch workers is specified by `max_parallel_prefetch_workers_per_subscription` GUC. If it is zero (default) then no prefetching is performed. Prefetch mode is controlled by `prefetch_replica_identity_only` GUC . By default it is true which makes prefetch efficient for hot updates, deletes or inserts in table with just one index (primary key). Attached please find patch and two shell scripts used to produce this test results. Also it may be more convenient to inspect this patch as PR: https://github.com/knizhnik/postgres/pull/3 I wonder if such LR prefetching approach is considered to be useful? Or it is better to investigate other ways to improve LR apply speed (parallel apply)?
Вложения
On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > There is well known Postgres problem that logical replication subscriber > can not caught-up with publisher just because LR changes are applied by > single worker and at publisher changes are made by > multiple concurrent backends. The problem is not logical replication > specific: physical replication stream is also handled by single > walreceiver. But for physical replication Postgres now implements > prefetch: looking at WAL record blocks it is quite easy to predict which > pages will be required for redo and prefetch them. With logical > replication situation is much more complicated. > > My first idea was to implement parallel apply of transactions. But to do > it we need to track dependencies between transactions. Right now > Postgres can apply transactions in parallel, but only if they are > streamed (which is done only for large transactions) and serialize them > by commits. It is possible to enforce parallel apply of short > transactions using `debug_logical_replication_streaming` but then > performance is ~2x times slower than in case of sequential apply by > single worker. > What is the reason of such a large slow down? Is it because the amount of network transfer has increased without giving any significant advantage because of the serialization of commits? > By removing serialization by commits, it is possible to > speedup apply 3x times and make subscriber apply changes faster then > producer can produce them even with multiple clients. But it is possible > only if transactions are independent and it can be enforced only by > tracking dependencies which seems to be very non-trivial and invasive. > > I still do not completely give up with tracking dependencies approach, > but decided first to try more simple solution - prefetching. > Sounds reasonable, but in the long run, we should track transaction dependencies and allow parallel apply of all the transactions. It is > already used for physical replication. Certainly in case of physical > replication it is much simpler, because each WAL record contains list of > accessed blocks. > > In case of logical replication prefetching can be done either by > prefetching access to replica identity index (usually primary key), > either by executing replication command by some background worker > Certainly first case is much more easy. > It seems there is only one case described, so what exactly are you referring to first and second? We just perform index lookup in > prefetch worker and it loads accessed index and heap pages in shared > buffer, so main apply worker does not need to read something from disk. > But it works well only for DELETE and HOT UPDATE operations. > > In the second case we normally execute the LR command in background > worker and then abort transaction. Certainly in this case we are doing > the same work twice. But assumption is the same: parallel prefetch > workers should load affected pages, speeding up work of the main apply > worker. > > I have implemented some PoC (see attached patch). And get first results > of efficiency of such prefetching. > > *** First scenario (update-only). > > Publisher: > ``` > create table t(pk integer primary key, counter integer, filler text > default repeat('x', 1000)) with (fillfactor=10); > insert into t values (generate_series(1,100000), 0); > create publication pub1 for table t; > ``` > > Subscriber: > ``` > create table t(pk integer primary key, counter integer, filler text > default repeat('x', 1000)) with (fillfactor=10); > create subscription sub1 connection 'port=54321 dbname=postgres' > publication pub1; > ``` > > Then I wait until replication is synced, stop subscriber and do random > dot updates in 10 sessions at publisher: > > ``` > pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres > ``` > > where update.sql is: > > ``` > \set pk random(1, 100000) > update t set counter=counter+1 where pk=:pk; > ``` > > Then I start subscriber and measure how much time is needed for it to > caught up. > Results: > > no prefetch: 2:00 min > prefetch (replica identity only): 0:55 min > prefetch (all): 1:10 min > > This is definitely the best case for replica-identity index only > prefetch (update-only and no other indexes). > How to interpret this results? > > Without prefetch applying updates takes about two times more at > subscriber than performing this updates at publisher. > It means that under huge workload subscriber has no chances to caught up. > > With prefetching replica identity index, apply time is even smaller than > time needed to perform updates at publisher. > Performing the whole operation and transaction abort certainly adds more > overhead. But still improvement is quite significant. > > Please also notice that this results were obtains at the system with > larger amount of RAM (64Gb) and fast SSD. > With data set not fitting in RAM and much slower disks, the difference > is expected to be more significant. > But what about worst cases where these additional pre-fetches could lead to removing some pages from shared_buffers, which are required by the workload on the subscriber? I think you try such workloads as well. > I have tried to simulate it be adding 0.1msec delay to pg_preadv. > When I add artificial 0.1msec `preadv` delay, I got the following results: > > no prefetch: 7:40 > prefetch (replica identity only): 3:10 min > prefetch (all): 3:09 > > > In this case apply takes much more time than 100 seconds during which > updates are performed at publisher. Prefetch can improve speed about two > times, > but it doesn't allow subcriber to caught-up. > ... > > Please notice that this approach requires minimal changes in Postgres, > because all infrastructure of parallel apply workers is already present > and we can reuse the same apply code (with minimal changes) for > performing prefetch. I only have to introduce extra tuple lock types > (no-lock and try-lock) to minimize overhead and lock conflicts between > prefetch and main apply workers. Still it can not completely prevent > locks conflicts and deadlocks in prefetch workers. Looks like more work > is needed here. > I understand that it is just a POC, so you haven't figured out all the details, but it would be good to know the reason of these deadlocks. > > I wonder if such LR prefetching approach is considered to be useful? > Or it is better to investigate other ways to improve LR apply speed > (parallel apply)? > I think it could be a good intermediate step till we are able to find a solution for tracking the dependencies. Do you think this work will be useful once we have parallel apply, and if so how? -- With Regards, Amit Kapila.
On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:It is possible to enforce parallel apply of short transactions using `debug_logical_replication_streaming` but then performance is ~2x times slower than in case of sequential apply by single worker.What is the reason of such a large slow down? Is it because the amount of network transfer has increased without giving any significant advantage because of the serialization of commits?
No, I do not think that network traffic is somehow increased.
If I removed locks (just by commenting body of `pa_lock_stream` and `pa_unlock_stream` functions and callof `pa_wait_for_xact_finish`), I get 3x speed improvement (with 4 parallel apply workers) comparing with normal mode
(when transactions are applied by main logical replication worker). So the main reason is lock overhead/contention and de-facto serialization of transactions (in `top` I see that only one worker is active most the time.
Even with simulated 0.1msec read delay, results of update tests are the following:
normal mode: 7:40
forced parallel mode: 8:30
forced parallel mode (no locks): 1:45
By removing serialization by commits, it is possible to speedup apply 3x times and make subscriber apply changes faster then producer can produce them even with multiple clients. But it is possible only if transactions are independent and it can be enforced only by tracking dependencies which seems to be very non-trivial and invasive. I still do not completely give up with tracking dependencies approach, but decided first to try more simple solution - prefetching.Sounds reasonable, but in the long run, we should track transaction dependencies and allow parallel apply of all the transactions.
I agree.
I see two different approaches:
1. Build dependency graph: track dependency between xids when transaction is executed at publisher and then include this graph in commit record.
2. Calculate hash of replica identity key and check that data sets of transactions do no intersect (certainly will notwork if there are some triggers).
It isalready used for physical replication. Certainly in case of physical replication it is much simpler, because each WAL record contains list of accessed blocks. In case of logical replication prefetching can be done either by prefetching access to replica identity index (usually primary key), either by executing replication command by some background worker Certainly first case is much more easy.It seems there is only one case described, so what exactly are you referring to first and second?
First: perform lookup in replica identity index (primary key). It will prefetch index pages and referenced heap page.
Seconds: execute LR operation (insert/update) in prefetch worker and then rollback transaction.
But what about worst cases where these additional pre-fetches could lead to removing some pages from shared_buffers, which are required by the workload on the subscriber? I think you try such workloads as well.
It is the common problem of all prefetch algorithms: if size of cache where prefetch results are stored (shared buffers, OS file cache,...) is not larger enough to keep prefetch result until it will be used,
then prefetch will not provide any improvement of performance and may be even cause some degradation.
So it is really challenged task to choose optimal time for prefetch operation: too early - and its results will be thrown away before requested, too late - executor has to wait prefetch completion or load page itself. Certainly there is some kind of autotuning: worker performing prefetch has to wait for IO completion and executor whichm pickup page from cache process requests faster and so should catch up prefetch workers. Then it has to perform IO itself and start fall behind prefetch workers.
I understand that it is just a POC, so you haven't figured out all the details, but it would be good to know the reason of these deadlocks.
Will investigate it.
I wonder if such LR prefetching approach is considered to be useful? Or it is better to investigate other ways to improve LR apply speed (parallel apply)?I think it could be a good intermediate step till we are able to find a solution for tracking the dependencies. Do you think this work will be useful once we have parallel apply, and if so how?
I think that if we will have parallel apply, prefetch is not needed.
At least that is what I see now in Neon (open source serverless Postgres which separates compute and storage).
We have implemented prefetch for seqscan and indexscan because of relatively large acess latency with page server. And it can really significantly improve performance - 4 or even more times.
But the same effect cna be achieved by forcing parallel plan with larger number of parallel workers. Unfortunately effect of this two optimizations is not multiplied, so parallel plan + prefetch shows almost the same speed as any of this optimizations.
On Wed, Jul 9, 2025 at 12:08 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > On 08/07/2025 2:51 pm, Amit Kapila wrote: > > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > It is possible to enforce parallel apply of short > transactions using `debug_logical_replication_streaming` but then > performance is ~2x times slower than in case of sequential apply by > single worker. > > What is the reason of such a large slow down? Is it because the amount > of network transfer has increased without giving any significant > advantage because of the serialization of commits? > > No, I do not think that network traffic is somehow increased. > If I removed locks (just by commenting body of `pa_lock_stream` and `pa_unlock_stream` functions and callof `pa_wait_for_xact_finish`),I get 3x speed improvement (with 4 parallel apply workers) comparing with normal mode > (when transactions are applied by main logical replication worker). So the main reason is lock overhead/contention andde-facto serialization of transactions (in `top` I see that only one worker is active most the time. > > Even with simulated 0.1msec read delay, results of update tests are the following: > > normal mode: 7:40 > forced parallel mode: 8:30 > forced parallel mode (no locks): 1:45 > > By removing serialization by commits, it is possible to > speedup apply 3x times and make subscriber apply changes faster then > producer can produce them even with multiple clients. But it is possible > only if transactions are independent and it can be enforced only by > tracking dependencies which seems to be very non-trivial and invasive. > > I still do not completely give up with tracking dependencies approach, > but decided first to try more simple solution - prefetching. > > Sounds reasonable, but in the long run, we should track transaction > dependencies and allow parallel apply of all the transactions. > > I agree. > I see two different approaches: > > 1. Build dependency graph: track dependency between xids when transaction is executed at publisher and then include thisgraph in commit record. > 2. Calculate hash of replica identity key and check that data sets of transactions do no intersect (certainly will notworkif there are some triggers). > I think it is better to compute transaction dependencies on the subscriber side because there could be many transactions that could be filtered because the containing tables or rows are not published. > > > But what about worst cases where these additional pre-fetches could > lead to removing some pages from shared_buffers, which are required by > the workload on the subscriber? I think you try such workloads as > well. > > It is the common problem of all prefetch algorithms: if size of cache where prefetch results are stored (shared buffers,OS file cache,...) is not larger enough to keep prefetch result until it will be used, > then prefetch will not provide any improvement of performance and may be even cause some degradation. > So it is really challenged task to choose optimal time for prefetch operation: too early - and its results will be thrownaway before requested, too late - executor has to wait prefetch completion or load page itself. Certainly there issome kind of autotuning: worker performing prefetch has to wait for IO completion and executor whichm pickup page fromcache process requests faster and so should catch up prefetch workers. Then it has to perform IO itself and start fallbehind prefetch workers. > > > > I understand that it is just a POC, so you haven't figured out all the > details, but it would be good to know the reason of these deadlocks. > > Will investigate it. > > > I wonder if such LR prefetching approach is considered to be useful? > Or it is better to investigate other ways to improve LR apply speed > (parallel apply)? > > I think it could be a good intermediate step till we are able to find > a solution for tracking the dependencies. Do you think this work will > be useful once we have parallel apply, and if so how? > > > I think that if we will have parallel apply, prefetch is not needed. > At least that is what I see now in Neon (open source serverless Postgres which separates compute and storage). > We have implemented prefetch for seqscan and indexscan because of relatively large acess latency with page server. Andit can really significantly improve performance - 4 or even more times. > But the same effect cna be achieved by forcing parallel plan with larger number of parallel workers. Unfortunately effectof this two optimizations is not multiplied, so parallel plan + prefetch shows almost the same speed as any of thisoptimizations. > > I think our case is a bit different, and prefetch could even be used when we are able to track dependencies and achieve true parallelism. We can consider using prefetch to speed up dependent transactions that can't be parallelized. -- With Regards, Amit Kapila.
On 11/07/2025 11:52 am, Amit Kapila wrote: > On Wed, Jul 9, 2025 at 12:08 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote: >> On 08/07/2025 2:51 pm, Amit Kapila wrote: >> >> On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: >> >> It is possible to enforce parallel apply of short >> transactions using `debug_logical_replication_streaming` but then >> performance is ~2x times slower than in case of sequential apply by >> single worker. >> >> What is the reason of such a large slow down? Is it because the amount >> of network transfer has increased without giving any significant >> advantage because of the serialization of commits? >> >> No, I do not think that network traffic is somehow increased. >> If I removed locks (just by commenting body of `pa_lock_stream` and `pa_unlock_stream` functions and callof `pa_wait_for_xact_finish`),I get 3x speed improvement (with 4 parallel apply workers) comparing with normal mode >> (when transactions are applied by main logical replication worker). So the main reason is lock overhead/contention andde-facto serialization of transactions (in `top` I see that only one worker is active most the time. >> >> Even with simulated 0.1msec read delay, results of update tests are the following: >> >> normal mode: 7:40 >> forced parallel mode: 8:30 >> forced parallel mode (no locks): 1:45 >> >> By removing serialization by commits, it is possible to >> speedup apply 3x times and make subscriber apply changes faster then >> producer can produce them even with multiple clients. But it is possible >> only if transactions are independent and it can be enforced only by >> tracking dependencies which seems to be very non-trivial and invasive. >> >> I still do not completely give up with tracking dependencies approach, >> but decided first to try more simple solution - prefetching. >> >> Sounds reasonable, but in the long run, we should track transaction >> dependencies and allow parallel apply of all the transactions. >> >> I agree. >> I see two different approaches: >> >> 1. Build dependency graph: track dependency between xids when transaction is executed at publisher and then include thisgraph in commit record. >> 2. Calculate hash of replica identity key and check that data sets of transactions do no intersect (certainly will notworkif there are some triggers). >> > I think it is better to compute transaction dependencies on the > subscriber side because there could be many transactions that could be > filtered because the containing tables or rows are not published. It is certainly true. Also tracking dependencies at subscriber doesn't require to change protocol and makes it possible to do parallel apply for old publishers. But from the other hand it seems to be too late. It will be nice that just after receiving first transaction statement, apply process can make a decision to which parallel apply worker it should be sent. At subscriber side it seems to be easier to calculate hash of replica identity key and based on this hash (or more precisely the set of hashes representing transaction working set) make a decision whther transaction interleave with some prior transaction or not and schedule it accordingly. >> >> But what about worst cases where these additional pre-fetches could >> lead to removing some pages from shared_buffers, which are required by >> the workload on the subscriber? I think you try such workloads as >> well. >> >> It is the common problem of all prefetch algorithms: if size of cache where prefetch results are stored (shared buffers,OS file cache,...) is not larger enough to keep prefetch result until it will be used, >> then prefetch will not provide any improvement of performance and may be even cause some degradation. >> So it is really challenged task to choose optimal time for prefetch operation: too early - and its results will be thrownaway before requested, too late - executor has to wait prefetch completion or load page itself. Certainly there issome kind of autotuning: worker performing prefetch has to wait for IO completion and executor whichm pickup page fromcache process requests faster and so should catch up prefetch workers. Then it has to perform IO itself and start fallbehind prefetch workers. >> >> >> >> I understand that it is just a POC, so you haven't figured out all the >> details, but it would be good to know the reason of these deadlocks. >> >> Will investigate it. I found the reason: conflict happen between main apply worker and prefetch worker which was able to catch-up main worker and so they both are trying to apply the same statement. I fixed the problem by adding extra parameter to ExecSimpleRelationUpdate/Insert and handle prefetch as some kind of speculative operation. With this change results for insert test are the following: no prefetch: 10 min prefetch (identity): 8 min prefetch (full): 3 min > I think our case is a bit different, and prefetch could even be used > when we are able to track dependencies and achieve true parallelism. > We can consider using prefetch to speed up dependent transactions that > can't be parallelized. Make sense.
On 08/07/2025 2:51 pm, Amit Kapila wrote: > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: >> There is well known Postgres problem that logical replication subscriber >> can not caught-up with publisher just because LR changes are applied by >> single worker and at publisher changes are made by >> multiple concurrent backends. The problem is not logical replication >> specific: physical replication stream is also handled by single >> walreceiver. But for physical replication Postgres now implements >> prefetch: looking at WAL record blocks it is quite easy to predict which >> pages will be required for redo and prefetch them. With logical >> replication situation is much more complicated. >> >> My first idea was to implement parallel apply of transactions. But to do >> it we need to track dependencies between transactions. Right now >> Postgres can apply transactions in parallel, but only if they are >> streamed (which is done only for large transactions) and serialize them >> by commits. It is possible to enforce parallel apply of short >> transactions using `debug_logical_replication_streaming` but then >> performance is ~2x times slower than in case of sequential apply by >> single worker. >> > What is the reason of such a large slow down? Is it because the amount > of network transfer has increased without giving any significant > advantage because of the serialization of commits? It is not directly related with subj, but I do not understand this code: ``` /* * Stop the worker if there are enough workers in the pool. * * XXX Additionally, we also stop the worker if the leader apply worker * serialize part of the transaction data due to a send timeout. This is * because the message could be partially written to the queue and there * is no way to clean the queue other than resending the message until it * succeeds. Instead of trying to send the data which anyway would have * been serialized and then letting the parallel apply worker deal with * the spurious message, we stop the worker. */ if (winfo->serialize_changes || list_length(ParallelApplyWorkerPool) > (max_parallel_apply_workers_per_subscription / 2)) { logicalrep_pa_worker_stop(winfo); pa_free_worker_info(winfo); return; } ``` It stops worker if number fo workers in pool is more than half of `max_parallel_apply_workers_per_subscription`. What I see is that `pa_launch_parallel_worker` spawns new workers and after completion of transaction it is immediately terminated. Actually this leads to awful slowdown of apply process. If I just disable and all `max_parallel_apply_workers_per_subscription`are actually used for applying transactions, then time of parallel apply with 4 workers is 6 minutes comparing with 10 minutes fr applying all transactions by main workers. It is still not so larger improvement, but at least it is improvement and not degradation.
On Fri, Jul 11, 2025 at 7:49 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > On 08/07/2025 2:51 pm, Amit Kapila wrote: > > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > >> There is well known Postgres problem that logical replication subscriber > >> can not caught-up with publisher just because LR changes are applied by > >> single worker and at publisher changes are made by > >> multiple concurrent backends. The problem is not logical replication > >> specific: physical replication stream is also handled by single > >> walreceiver. But for physical replication Postgres now implements > >> prefetch: looking at WAL record blocks it is quite easy to predict which > >> pages will be required for redo and prefetch them. With logical > >> replication situation is much more complicated. > >> > >> My first idea was to implement parallel apply of transactions. But to do > >> it we need to track dependencies between transactions. Right now > >> Postgres can apply transactions in parallel, but only if they are > >> streamed (which is done only for large transactions) and serialize them > >> by commits. It is possible to enforce parallel apply of short > >> transactions using `debug_logical_replication_streaming` but then > >> performance is ~2x times slower than in case of sequential apply by > >> single worker. > >> > > What is the reason of such a large slow down? Is it because the amount > > of network transfer has increased without giving any significant > > advantage because of the serialization of commits? > > > It is not directly related with subj, but I do not understand this code: > > ``` > /* > * Stop the worker if there are enough workers in the pool. > * > * XXX Additionally, we also stop the worker if the leader apply worker > * serialize part of the transaction data due to a send timeout. > This is > * because the message could be partially written to the queue and > there > * is no way to clean the queue other than resending the message > until it > * succeeds. Instead of trying to send the data which anyway would have > * been serialized and then letting the parallel apply worker deal with > * the spurious message, we stop the worker. > */ > if (winfo->serialize_changes || > list_length(ParallelApplyWorkerPool) > > (max_parallel_apply_workers_per_subscription / 2)) > { > logicalrep_pa_worker_stop(winfo); > pa_free_worker_info(winfo); > > return; > } > ``` > > It stops worker if number fo workers in pool is more than half of > `max_parallel_apply_workers_per_subscription`. > What I see is that `pa_launch_parallel_worker` spawns new workers and > after completion of transaction it is immediately terminated. > Actually this leads to awful slowdown of apply process. > I didn't understand your scenario. pa_launch_parallel_worker() should spawn a new worker only if all the workers in the pool are busy, and then it will free the worker if the pool already has enough workers. So, do you mean to say that the workers in the pool are always busy in your workload which lead spawn/exit of new workers? Can you please explain your scenario in some more detail? -- With Regards, Amit Kapila.
On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > There is well known Postgres problem that logical replication subscriber > can not caught-up with publisher just because LR changes are applied by > single worker and at publisher changes are made by > multiple concurrent backends. > BTW, do you know how users deal with this lag? For example, one can imagine creating multiple pub-sub pairs for different sets of tables so that the workload on the subscriber could also be shared by multiple apply workers. I can also think of splitting the workload among multiple pub-sub pairs by using row filters. -- With Regards, Amit Kapila.
On 13/07/2025 9:28 am, Amit Kapila wrote: > I didn't understand your scenario. pa_launch_parallel_worker() should > spawn a new worker only if all the workers in the pool are busy, and > then it will free the worker if the pool already has enough workers. > So, do you mean to say that the workers in the pool are always busy in > your workload which lead spawn/exit of new workers? Can you please > explain your scenario in some more detail? > Current LR apply logic is not working well for applying small OLTP transactions. First of all by default reorder buffer at publisher will buffer them and so prevent parallel apply at subscriber. Publisher switches to streaming mode only if transaction is too large or `debug_logical_replication_streaming=immediate`. But even if we force publisher to stream short transactions, subscriber will try to launch new parallel apply worker for each transactions (if all existed workers are busy). If there are 100 active backends at publisher, then subscriber will try to launch 100 parallel apply workers. Most likely it fails because of limit for maximal number of workers. In this case leader will serialize such transactions. So if there are 100 streamed transactions and 10 parallel apply workers, then 10 transactions are started in parallel and 90 will be serialized to disk. It seems to be not so efficient for short transaction. It is better to wait for some time until some of workers become vacant. But the worst thing happen when parallel apply worker completes its transactions. If number of parallel apply workers in pool exceeds `max_parallel_apply_workers_per_subscription / 2`, then this parallel apply worker is terminated. So instead of having `max_parallel_apply_workers_per_subscription` workers applying transactions at maximal possible speed and leader which distributes transaction between them and stops receiving new data from publisher if there is no vacant worker, we will have leader serializing and writing transactions to the disk (and then definitely reading them from the disk) and permanently starting and terminating parallel apply worker processes. It leads to awful performance. Certainly originally intended use case was different: parallel apply is performed only for large transactions. Number of of such transactions is not so big and so there should be enough parallel apply workers in pool to proceed them. And if there are not enough workers, it is not a problem to spawn new one and terminate it after completion of transaction (because transaction is long, overhead of spawning process is not so larger comparing with redo of large transaction). But if we want to efficiently replicate OLTP workload, then we definitely need some other approach. Prefetch is actually more compatible with current implementation because prefetch operations don't need to be grouped by transaction and can be executed by any prefetch worker.
On 13/07/2025 1:28 pm, Amit Kapila wrote: > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: >> There is well known Postgres problem that logical replication subscriber >> can not caught-up with publisher just because LR changes are applied by >> single worker and at publisher changes are made by >> multiple concurrent backends. >> > BTW, do you know how users deal with this lag? For example, one can > imagine creating multiple pub-sub pairs for different sets of tables > so that the workload on the subscriber could also be shared by > multiple apply workers. I can also think of splitting the workload > among multiple pub-sub pairs by using row filters Yes, I saw that users starts several subscriptions/publications to receive and apply changes in parallel. But it can not be considered as universal solution: 1. Not always there are multiple tables (or partitions of one one table) so that it it possible to split them between multiple publications. 2. It violates transactional behavior (consistency): if transactions update several tables included in different publications then applying this changes independently, we can observe at replica behaviour when one table is update - and another - not. The same is true for row filters. 3. Each walsender will have to scan WAL, so having N subscriptions we have to read and decode WAL N times.
On Tuesday, July 8, 2025 2:36 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > There is well known Postgres problem that logical replication subscriber can > not caught-up with publisher just because LR changes are applied by single > worker and at publisher changes are made by multiple concurrent backends. > The problem is not logical replication > specific: physical replication stream is also handled by single walreceiver. But > for physical replication Postgres now implements > prefetch: looking at WAL record blocks it is quite easy to predict which pages > will be required for redo and prefetch them. With logical replication situation is > much more complicated. > > My first idea was to implement parallel apply of transactions. But to do it we > need to track dependencies between transactions. Right now Postgres can > apply transactions in parallel, but only if they are streamed (which is done > only for large transactions) and serialize them by commits. It is possible to > enforce parallel apply of short transactions using > `debug_logical_replication_streaming` but then performance is ~2x times > slower than in case of sequential apply by single worker. By removing > serialization by commits, it is possible to speedup apply 3x times and make > subscriber apply changes faster then producer can produce them even with > multiple clients. But it is possible only if transactions are independent and it > can be enforced only by tracking dependencies which seems to be very > non-trivial and invasive. > > I still do not completely give up with tracking dependencies approach, but > decided first to try more simple solution - prefetching. It is already used for > physical replication. Certainly in case of physical replication it is much simpler, > because each WAL record contains list of accessed blocks. > > In case of logical replication prefetching can be done either by prefetching > access to replica identity index (usually primary key), either by executing > replication command by some background worker Certainly first case is much > more easy. We just perform index lookup in prefetch worker and it loads > accessed index and heap pages in shared buffer, so main apply worker does > not need to read something from disk. > But it works well only for DELETE and HOT UPDATE operations. > > In the second case we normally execute the LR command in background > worker and then abort transaction. Certainly in this case we are doing the same > work twice. But assumption is the same: parallel prefetch workers should load > affected pages, speeding up work of the main apply worker. > > I have implemented some PoC (see attached patch). And get first results of > efficiency of such prefetching. > > *** First scenario (update-only). > > Publisher: > ``` > create table t(pk integer primary key, counter integer, filler text default repeat('x', > 1000)) with (fillfactor=10); insert into t values (generate_series(1,100000), 0); > create publication pub1 for table t; ``` > > Subscriber: > ``` > create table t(pk integer primary key, counter integer, filler text default repeat('x', > 1000)) with (fillfactor=10); create subscription sub1 connection 'port=54321 > dbname=postgres' > publication pub1; > ``` > > Then I wait until replication is synced, stop subscriber and do random dot > updates in 10 sessions at publisher: > > ``` > pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres ``` > > where update.sql is: > > ``` > \set pk random(1, 100000) > update t set counter=counter+1 where pk=:pk; ``` > > Then I start subscriber and measure how much time is needed for it to caught > up. > Results: > > no prefetch: 2:00 min > prefetch (replica identity only): 0:55 min prefetch (all): 1:10 min > > This is definitely the best case for replica-identity index only prefetch > (update-only and no other indexes). > How to interpret this results? > > Without prefetch applying updates takes about two times more at subscriber > than performing this updates at publisher. > It means that under huge workload subscriber has no chances to caught up. > > With prefetching replica identity index, apply time is even smaller than time > needed to perform updates at publisher. > Performing the whole operation and transaction abort certainly adds more > overhead. But still improvement is quite significant. > > Please also notice that this results were obtains at the system with larger > amount of RAM (64Gb) and fast SSD. > With data set not fitting in RAM and much slower disks, the difference is > expected to be more significant. > I have tried to simulate it be adding 0.1msec delay to pg_preadv. > When I add artificial 0.1msec `preadv` delay, I got the following results: > > no prefetch: 7:40 > prefetch (replica identity only): 3:10 min prefetch (all): 3:09 > > > In this case apply takes much more time than 100 seconds during which > updates are performed at publisher. Prefetch can improve speed about two > times, but it doesn't allow subcriber to caught-up. > > > > *** Second scenario: inserts with secondary random key. > > > Publisher: > > ``` > create table t(pk serial primary key, sk integer, counter integer default 0) insert > into t (sk) select random()*10000000 from generate_series(1,10000000) create > index on t(sk) create publication pub1 for table t ``` > > Subscriber: > ``` > > create table t(pk integer primary key, sk integer, counter integer) create index on > t(sk) create subscription sub1 connection 'port=54321 dbname=postgres' > publication pub1 > ``` > > workload: > > ``` > pgbench -T 100 -c 10 -M prepared -n -f insert.sql -p 54321 -d postgres > > ``` > > where insert.sql: > > ``` > INSERT INTO t (sk) VALUES (random()*10000000); ``` > > Results (with 0.1msec delay) are the followingL > > no prefetch: 10:10 min > prefetch (identity): 8:25 min > prefetch (full): 5:50min > > Here as expected prefetching only primary key doesn't provide some big > improvement. But replaying insert command in prefetch worker allows to > speedup apply almost twice. > > Please notice that this approach requires minimal changes in Postgres, > because all infrastructure of parallel apply workers is already present and we > can reuse the same apply code (with minimal changes) for performing prefetch. > I only have to introduce extra tuple lock types (no-lock and try-lock) to minimize > overhead and lock conflicts between prefetch and main apply workers. Still it > can not completely prevent locks conflicts and deadlocks in prefetch workers. > Looks like more work is needed here. Also I set `wal_level=minimal` in > prefetch workers to avoid WAL-logging overhead. > > Number of prefetch workers is specified by > `max_parallel_prefetch_workers_per_subscription` GUC. If it is zero > (default) then no prefetching is performed. > Prefetch mode is controlled by `prefetch_replica_identity_only` GUC . By > default it is true which makes prefetch efficient for hot updates, deletes or > inserts in table with just one index (primary key). > > > Attached please find patch and two shell scripts used to produce this test > results. > Also it may be more convenient to inspect this patch as PR: > https://github.com/knizhnik/postgres/pull/3 > > I wonder if such LR prefetching approach is considered to be useful? > Or it is better to investigate other ways to improve LR apply speed (parallel > apply)? Thank you for the proposal ! I find it to be a very interesting feature。 I tested the patch you shared in your original email and encountered potential deadlocks when testing pgbench TPC-B like workload. Could you please provide an updated patch version so that I can conduct further performance experiments ? Additionally, I was also exploring ways to improve performance and have tried an alternative version of prefetch for experimentation. The alternative design is that we assigns each non-streaming transaction to a parallel apply worker, while strictly maintaining the order of commits. During parallel apply, if the transactions that need to be committed before the current transaction are not yet finished, the worker performs pre-fetch operations. Specifically, for updates and deletes, the worker finds and caches the target local tuple to be updated/deleted. Once all preceding transactions are committed, the parallel apply worker uses these cached tuples to execute the actual updates or deletes. What do you think about this alternative ? I think the alternative might offer more stability in scenarios where shared buffer elimination occurs frequently and avoids leaving dead tuples in the buffer. However, it also presents some drawbacks, such as the need to add wait events to maintain commit order, compared to the approach discussed in this thread. (Note that, due to time constraints, I have not implemented the pre-fetch for Inserts and the code is not in reviewable shape and lacks comments and documentation, but just share the POC patch for reference). Best Regards, Hou zj
Вложения
Thank you for the proposal ! I find it to be a very interesting feature。 I tested the patch you shared in your original email and encountered potential deadlocks when testing pgbench TPC-B like workload. Could you please provide an updated patch version so that I can conduct further performance experiments ?
Sorry, it was fixed in my repo: https://github.com/knizhnik/postgres/pull/3
Updated patch is attached.
Additionally, I was also exploring ways to improve performance and have tried an alternative version of prefetch for experimentation. The alternative design is that we assigns each non-streaming transaction to a parallel apply worker, while strictly maintaining the order of commits. During parallel apply, if the transactions that need to be committed before the current transaction are not yet finished, the worker performs pre-fetch operations. Specifically, for updates and deletes, the worker finds and caches the target local tuple to be updated/deleted. Once all preceding transactions are committed, the parallel apply worker uses these cached tuples to execute the actual updates or deletes. What do you think about this alternative ? I think the alternative might offer more stability in scenarios where shared buffer elimination occurs frequently and avoids leaving dead tuples in the buffer. However, it also presents some drawbacks, such as the need to add wait events to maintain commit order, compared to the approach discussed in this thread.
So as far as I understand your PoC is doing the same as approach 1 in my proposal - prefetch of replica identity, but it is done not by parallel prefetch workers, but normal parallel apply workers when they have to wait until previous transaction is committed. I consider it to be more complex but may be more efficient than my approach.
The obvious drawback of both your's and my approaches is that it prefetch only pages of primary index (replica identity). If there are some other indexes which keys are changed by update, then pages of such indexes will be read from the disk when you apply update. The same is also true for insert (in this case you always has to include new tuple in all indexes) - this is why I have also implemented another approach: apply operation in prefetch worker and then rollback transaction.
Also I do not quite understand how you handle invalidations? Assume that we have two transactions - T1 and T2:
T1: ... W1 Commit
T2: ... W1
So T1 writes tuple 1 and then commits transaction. Then T2 updates tuple 1.
If I correctly understand your approach, parallel apply worker for T2 will try to prefetch tuple 1 before T1 is committed.
But in this case it will get old version of the tuple. It is not a problem if parallel apply worker will repeat lookup and not just use cached tuple.
One more moment. As far as you assigns each non-streaming transaction to a parallel apply worker, number of such transactions is limited by assigns each non-streaming transaction to a parallel apply worker,umber of background workers. Usually it is not so large (~10). So if there were 100 parallel transactions and publishers, then at subscriber you still be able to executed concurrently not more than few of them. In this sense my approach with separate prefetch workers is more flexible: each prefetch worker can prefetch as many operations as it can.
Вложения
On Sun, Jul 13, 2025 at 6:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > On 13/07/2025 1:28 pm, Amit Kapila wrote: > > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > >> There is well known Postgres problem that logical replication subscriber > >> can not caught-up with publisher just because LR changes are applied by > >> single worker and at publisher changes are made by > >> multiple concurrent backends. > >> > > BTW, do you know how users deal with this lag? For example, one can > > imagine creating multiple pub-sub pairs for different sets of tables > > so that the workload on the subscriber could also be shared by > > multiple apply workers. I can also think of splitting the workload > > among multiple pub-sub pairs by using row filters > > > Yes, I saw that users starts several subscriptions/publications to > receive and apply changes in parallel. > But it can not be considered as universal solution: > 1. Not always there are multiple tables (or partitions of one one table) > so that it it possible to split them between multiple publications. > 2. It violates transactional behavior (consistency): if transactions > update several tables included in different publications then applying > this changes independently, we can observe at replica behaviour when one > table is update - and another - not. The same is true for row filters. > 3. Each walsender will have to scan WAL, so having N subscriptions we > have to read and decode WAL N times. > I agree that it is not a solution which can be applied in all cases and neither I want to say that we shouldn't pursue the idea of prefetch or parallel apply to improve the speed of apply. It was just to know/discuss how users try to workaround lag for cases where the lag is large. -- With Regards, Amit Kapila.
On Sun, Jul 13, 2025 at 5:59 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > > Certainly originally intended use case was different: parallel apply is > performed only for large transactions. Number of of such transactions is > not so big and > so there should be enough parallel apply workers in pool to proceed > them. And if there are not enough workers, it is not a problem to spawn > new one and terminate > it after completion of transaction (because transaction is long, > overhead of spawning process is not so larger comparing with redo of > large transaction). > Right. > But if we want to efficiently replicate OLTP workload, then we > definitely need some other approach. > Agreed, for simplicity, for now we can have a GUC to decide the size of the pool. There is a note in the code for this as well, see: " XXX This worker pool threshold is arbitrary and we can provide a GUC variable for this in the future if required. I think we can think of some dynamic strategy where we remove from the pool if the workers are not in use for some threshold period of time or something on those lines. But at this stage it is better to use something simple and try to come up with a good way to perform pre-fetch or parallelization of short transactions. -- With Regards, Amit Kapila.
On Monday, July 14, 2025 2:36 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > On 14/07/2025 4:20 am, Zhijie Hou (Fujitsu) wrote: > > Additionally, I was also exploring ways to improve performance and have tried an > > alternative version of prefetch for experimentation. The alternative design is > > that we assigns each non-streaming transaction to a parallel apply worker, while > > strictly maintaining the order of commits. During parallel apply, if the > > transactions that need to be committed before the current transaction are not > > yet finished, the worker performs pre-fetch operations. Specifically, for > > updates and deletes, the worker finds and caches the target local tuple to be > > updated/deleted. Once all preceding transactions are committed, the parallel > > apply worker uses these cached tuples to execute the actual updates or deletes. > > What do you think about this alternative ? I think the alternative might offer > > more stability in scenarios where shared buffer elimination occurs frequently > > and avoids leaving dead tuples in the buffer. However, it also presents some > > drawbacks, such as the need to add wait events to maintain commit order, > > compared to the approach discussed in this thread. > > So as far as I understand your PoC is doing the same as approach 1 in my > proposal - prefetch of replica identity, but it is done not by parallel prefetch > workers, but normal parallel apply workers when they have to wait until previous > transaction is committed. I consider it to be more complex but may be more > efficient than my approach. > > The obvious drawback of both your's and my approaches is that it prefetch only > pages of primary index (replica identity). If there are some other indexes > which keys are changed by update, then pages of such indexes will be read from > the disk when you apply update. The same is also true for insert (in this case > you always has to include new tuple in all indexes) - this is why I have also > implemented another approach: apply operation in prefetch worker and then > rollback transaction. Thank you for your reply! I agree that indexes other than RI do not benefit from the pre-fetch. Regarding the apply operation and rollback approach, I have some concerns about the possible side effects, particularly the accumulation of dead tuples in the shared buffer. This is because all changes are performed by the pre-fetch worker at once before being aborted. I haven't delved deeply into this yet, but do you think this could potentially introduce additional overhead ? > > > Also I do not quite understand how you handle invalidations? During the pre-fetch phase of my patch, the execution of table_tuple_lock() is postponed until all preceding transactions have been finalized. If the cached tuple was modified by other transactions, table_tuple_lock() will return TM_Updated, signifying that the cached tuple is no longer valid. In these cases, the parallel apply worker will re-fetch the tuple. > Assume that we have two transactions - T1 and T2: > > T1: ... W1 Commit > T2: ... W1 > > So T1 writes tuple 1 and then commits transaction. Then T2 updates tuple 1. > If I correctly understand your approach, parallel apply worker for T2 will try > to prefetch tuple 1 before T1 is committed. > > But in this case it will get old version of the tuple. It is not a problem if > parallel apply worker will repeat lookup and not just use cached tuple. Yes, it is done like that. > > One more moment. As far as you assigns each non-streaming transaction to a > parallel apply worker, number of such transactions is limited by assigns each > non-streaming transaction to a parallel apply worker,umber of background > workers. Usually it is not so large (~10). So if there were 100 parallel > transactions and publishers, then at subscriber you still be able to executed > concurrently not more than few of them. In this sense my approach with > separate prefetch workers is more flexible: each prefetch worker can prefetch > as many operations as it can. Yes, that's true. I have been analyzing some performance issues in logical replication, specifically under scenarios where both the publisher and subscriber are subjected to high workloads. In these situations, the shared buffer is frequent updated, prompting me to consider the alternative approach I mentioned. I plan to perform additional tests and analysis on these approaches, thanks ! Best Regards, Hou zj
On Mon, Jul 14, 2025 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Jul 13, 2025 at 6:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > > > On 13/07/2025 1:28 pm, Amit Kapila wrote: > > > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > >> There is well known Postgres problem that logical replication subscriber > > >> can not caught-up with publisher just because LR changes are applied by > > >> single worker and at publisher changes are made by > > >> multiple concurrent backends. > > >> > > > BTW, do you know how users deal with this lag? For example, one can > > > imagine creating multiple pub-sub pairs for different sets of tables > > > so that the workload on the subscriber could also be shared by > > > multiple apply workers. I can also think of splitting the workload > > > among multiple pub-sub pairs by using row filters > > > > > > Yes, I saw that users starts several subscriptions/publications to > > receive and apply changes in parallel. > > But it can not be considered as universal solution: > > 1. Not always there are multiple tables (or partitions of one one table) > > so that it it possible to split them between multiple publications. > > 2. It violates transactional behavior (consistency): if transactions > > update several tables included in different publications then applying > > this changes independently, we can observe at replica behaviour when one > > table is update - and another - not. The same is true for row filters. > > 3. Each walsender will have to scan WAL, so having N subscriptions we > > have to read and decode WAL N times. > > > > I agree that it is not a solution which can be applied in all cases > and neither I want to say that we shouldn't pursue the idea of > prefetch or parallel apply to improve the speed of apply. It was just > to know/discuss how users try to workaround lag for cases where the > lag is large. > If you are interested, I would like to know your opinion on a somewhat related topic, which has triggered my interest in your patch. We are working on an update_delete conflict detection patch. The exact problem was explained in the initial email [1]. The basic idea to resolve the problem is that on the subscriber, we maintain a slot that will help in retaining dead tuples for a certain period of time till the concurrent transactions have been applied to the subscriber. You can read the commit message of the first patch in email [2]. Now, the problem we are facing is that because of replication LAG in a scenario similar to what we are discussing here, such that when there are many clients on the publisher and a single apply worker on the subscriber, the slot takes more time to get advanced. This will lead to retention of dead tuples, which further slows down apply worker especially for update workloads. Apart from apply, the other transactions running on the system (say pgbench kind of workload on the subscriber) also became slower because of the retention of dead tuples. Now, for the workloads where the LAG is not there, like when one splits the workload with options mentioned above (split workload among pub-sub in some way) or the workload doesn't consist of a large number of clients operating on the publisher and subscriber at the same time, etc. we don't observe any major slowdown on the subscriber. We would like to solicit your opinion as you seem to have some experience with LR users, whether one can use this feature in cases where required by enabling it at the subscription level. They will have the facility to disable it if they face any performance regression or additional bloat. Now, after having that feature, we can work on additional features such as prefetch or parallel apply that will reduce the chances of LAG, making the feature more broadly used. Does that sound reasonable to you? Feel free to ignore giving your opinion if you are not interested in that work. [1] - https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com [2] - https://www.postgresql.org/message-id/OS0PR01MB5716ECC539008C85E7AB65C5944FA%40OS0PR01MB5716.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On 15/07/2025 2:31 PM, Amit Kapila wrote: > If you are interested, I would like to know your opinion on a somewhat > related topic, which has triggered my interest in your patch. We are > working on an update_delete conflict detection patch. The exact > problem was explained in the initial email [1]. The basic idea to > resolve the problem is that on the subscriber, we maintain a slot that > will help in retaining dead tuples for a certain period of time till > the concurrent transactions have been applied to the subscriber. You > can read the commit message of the first patch in email [2]. Now, the > problem we are facing is that because of replication LAG in a scenario > similar to what we are discussing here, such that when there are many > clients on the publisher and a single apply worker on the subscriber, > the slot takes more time to get advanced. This will lead to retention > of dead tuples, which further slows down apply worker especially for > update workloads. Apart from apply, the other transactions running on > the system (say pgbench kind of workload on the subscriber) also > became slower because of the retention of dead tuples. > > Now, for the workloads where the LAG is not there, like when one > splits the workload with options mentioned above (split workload among > pub-sub in some way) or the workload doesn't consist of a large number > of clients operating on the publisher and subscriber at the same time, > etc. we don't observe any major slowdown on the subscriber. > > We would like to solicit your opinion as you seem to have some > experience with LR users, whether one can use this feature in cases > where required by enabling it at the subscription level. They will > have the facility to disable it if they face any performance > regression or additional bloat. Now, after having that feature, we can > work on additional features such as prefetch or parallel apply that > will reduce the chances of LAG, making the feature more broadly used. > Does that sound reasonable to you? Feel free to ignore giving your > opinion if you are not interested in that work. > > [1] - https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com > [2] - https://www.postgresql.org/message-id/OS0PR01MB5716ECC539008C85E7AB65C5944FA%40OS0PR01MB5716.jpnprd01.prod.outlook.com > I am very sorry for delay with answer - it was very busy week. I hope that I understand the problem and proposed approach to solve it (it actually seems to be quite straightforward and similar with `hot_standby_feedback`). And definitely suffering from the same problem: blown database because of lagged slots. But it really hard to propose some other solution (rather than backward scan of WAL, but it seems to be completely unacceptable). Concerning user's experience... First of all disclaimer: I am first of all programmer and not DBA. Yes, I have investigated many support cases, but still it is hard to expect that I have the full picture. There is even no consensus concerning `hot_standby_feedback`! It is still disabled by default in Postgres and in Neon. It makes sense for vanilla Postgres, where replicas are first of all used for HA and only secondary - for load balancing of read only queries. But in Neon HA is provided in different way and the only sense of creating RO replicas is load balancing (mostly for OLAP queries). Execution of heavy OLAP queries without `hot_stanbdy_feedback` is some kind of "russian roulette", because probability of conflict with recovery is very high. But still we are using Postgres default. But situation with feedback slots may be different: as far as I understand it is mostly needed for bidirectional replication and automatic conflict resolution. So it is assumed as part of some distributed system (like BDR), rather than feature used directly by Postgres users. I still a little bit depressed by complexity of LR and all related aspects. But unlikely it is possible to invent something more elegant and simpler:)
Completely rewritten version of prefetch patch. Now prefetch workers do not try to apply LR application and then rollback transaction. They just perform indexes lookup and so prefetch index and referenced heap pages. So no any hacks are needed to prevent lock conflicts and WAL logging. Performance results are the following (test scenario was explained in previous message as well as used schell scripts): update: prefetch (2): 5:20 prefetch (4): 3:20 prefetch (8): 2:05 no prefetch: 8:30 insert: pk (4) prefetch: 9:55 pk+sk(4) prefetch: 5:20 pk+sk(8) prefetch: 3:08 no prefetch: 9:20 The number in parentheses specifies number of prefetch workers. For example to spawn 8 prefetch workers I used the following settings in postgresql.conf.replica: prefetch_replica_identity_only=off max_worker_processes=16 max_logical_replication_workers=16 max_parallel_apply_workers_per_subscription=8 max_parallel_prefetch_workers_per_subscription=8 port=54322 Also I run continuous test with long (3 hours) updates workload on publisher with logical replication to subscriber. And with 8 prefetch workers replica is able to caught up with primary where 10 backends are performing update! After the end of this updates replica was at the same state as primary while without prefetch it proceed only 1/2 of generated WAL and it takes another 5:30 hours to catch up.