Обсуждение: Logical replication prefetch

Поиск

Список

Период

Сортировка

Logical replication prefetch

От

Konstantin Knizhnik

Дата:

08 июля, 09:36:22

There is well known Postgres problem that logical replication subscriber 
can not caught-up with publisher just because LR changes are applied by 
single worker and at publisher changes are made by
multiple concurrent backends. The problem is not logical replication 
specific: physical replication stream is also handled by single 
walreceiver. But for physical replication Postgres now implements 
prefetch: looking at WAL record blocks it is quite easy to predict which 
pages will be required for redo and prefetch them. With logical 
replication situation is much more complicated.

My first idea was to implement parallel apply of transactions. But to do 
it we need to track dependencies between transactions. Right now 
Postgres can apply transactions in parallel, but only if they are 
streamed  (which is done only for large transactions) and serialize them 
by commits. It is possible to enforce parallel apply of short 
transactions using `debug_logical_replication_streaming` but then 
performance is ~2x times slower than in case of sequential apply by 
single worker. By removing serialization by commits, it is possible to 
speedup apply 3x times and make subscriber apply changes faster then 
producer can produce them even with multiple clients. But it is possible 
only if transactions are independent and it can be enforced only by 
tracking dependencies which seems to be very non-trivial and invasive.

I still do not completely give up with tracking dependencies approach, 
but decided first to try more simple solution - prefetching. It is 
already used for physical replication. Certainly in case of physical 
replication it is much simpler, because each WAL record contains list of 
accessed blocks.

In case of logical replication prefetching can be done either by 
prefetching access to replica identity index (usually primary key), 
either by executing replication command by some background worker
Certainly first case is much more easy. We just perform index lookup in 
prefetch worker and it loads accessed index and heap pages in shared 
buffer, so main apply worker does not need to read something from disk.
But it works well only for DELETE and HOT UPDATE operations.

In the second case we normally execute the LR command in background 
worker and then abort transaction. Certainly in this case we are doing 
the same work twice. But assumption is the same: parallel prefetch 
workers should load affected pages, speeding up work of the main apply 
worker.

I have implemented some PoC (see attached patch). And get first results 
of efficiency of such prefetching.

*** First scenario (update-only).

Publisher:
```
create table t(pk integer primary key, counter integer, filler text 
default repeat('x', 1000)) with (fillfactor=10);
insert into t values (generate_series(1,100000), 0);
create publication pub1 for table t;
```

Subscriber:
```
create table t(pk integer primary key, counter integer, filler text 
default repeat('x', 1000)) with (fillfactor=10);
create subscription sub1 connection 'port=54321 dbname=postgres' 
publication pub1;
```

Then I wait until replication is synced, stop subscriber and do random 
dot updates in 10 sessions at publisher:

```
pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres
```

where update.sql is:

```
\set pk random(1, 100000)
update t set counter=counter+1 where pk=:pk;
```

Then I start subscriber and measure how much time is needed for it to 
caught up.
Results:

no prefetch: 2:00 min
prefetch (replica identity only): 0:55 min
prefetch (all): 1:10 min

This is definitely the best case for replica-identity index only 
prefetch (update-only and no other indexes).
How to interpret this results?

Without prefetch applying updates takes about two times  more at 
subscriber than performing this updates at publisher.
It means that under huge workload subscriber has no chances to caught up.

With prefetching replica identity index, apply time is even smaller than 
time needed to perform updates at publisher.
Performing the whole operation and transaction abort certainly adds more 
overhead. But still improvement is quite significant.

Please also notice that this results were obtains at the system with 
larger amount of RAM (64Gb) and fast SSD.
With data set not fitting in RAM and much slower disks, the difference 
is expected to be more significant.
I have tried to simulate it be adding 0.1msec delay to pg_preadv.
When I add artificial 0.1msec `preadv` delay, I got the following results:

no prefetch: 7:40
prefetch (replica identity only): 3:10 min
prefetch (all): 3:09


In this case apply takes much more time than 100 seconds during which 
updates are performed at publisher. Prefetch can improve speed about two 
times,
but it doesn't allow subcriber to caught-up.



*** Second scenario: inserts with secondary random key.


Publisher:

```
create table t(pk serial primary key, sk integer, counter integer default 0)
insert into t (sk) select random()*10000000 from generate_series(1,10000000)
create index on t(sk)
create publication pub1 for table t
```

Subscriber:
```

create table t(pk integer primary key, sk integer, counter integer)
create index on t(sk)
create subscription sub1 connection 'port=54321 dbname=postgres' 
publication pub1
```

workload:

```
pgbench -T 100 -c 10 -M prepared -n -f insert.sql -p 54321 -d postgres

```

where insert.sql:

```
INSERT INTO t (sk) VALUES (random()*10000000);
```

Results (with 0.1msec delay)  are the followingL

no prefetch: 10:10 min
prefetch (identity): 8:25 min
prefetch (full): 5:50min

Here as expected prefetching only primary key doesn't provide some big 
improvement. But replaying insert command in prefetch worker allows to 
speedup apply almost twice.

Please notice that this approach requires minimal changes in Postgres, 
because all infrastructure of parallel apply workers is already present 
and we can reuse the same apply code (with minimal changes) for 
performing prefetch. I only have to introduce extra tuple lock types 
(no-lock and try-lock) to minimize overhead and lock conflicts between 
prefetch and main apply workers. Still it can not completely prevent 
locks conflicts and deadlocks in prefetch workers. Looks like more work 
is needed here. Also I set `wal_level=minimal` in prefetch workers to 
avoid  WAL-logging overhead.

Number of prefetch workers is specified by 
`max_parallel_prefetch_workers_per_subscription` GUC. If it is zero 
(default) then no prefetching is performed.
Prefetch mode is controlled by `prefetch_replica_identity_only` GUC . By 
default it is true which makes prefetch efficient for hot updates, 
deletes or inserts in table with just one index (primary key).


Attached please find patch and two shell scripts used to produce this 
test results.
Also it may be more convenient to inspect this patch as PR: 
https://github.com/knizhnik/postgres/pull/3

I wonder if such LR prefetching approach is considered to be useful?
Or it is better to investigate other ways to improve LR apply speed 
(parallel apply)?

Вложения

Re: Logical replication prefetch

От

Amit Kapila

Дата:

08 июля, 14:51:50

On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
> There is well known Postgres problem that logical replication subscriber
> can not caught-up with publisher just because LR changes are applied by
> single worker and at publisher changes are made by
> multiple concurrent backends. The problem is not logical replication
> specific: physical replication stream is also handled by single
> walreceiver. But for physical replication Postgres now implements
> prefetch: looking at WAL record blocks it is quite easy to predict which
> pages will be required for redo and prefetch them. With logical
> replication situation is much more complicated.
>
> My first idea was to implement parallel apply of transactions. But to do
> it we need to track dependencies between transactions. Right now
> Postgres can apply transactions in parallel, but only if they are
> streamed  (which is done only for large transactions) and serialize them
> by commits. It is possible to enforce parallel apply of short
> transactions using `debug_logical_replication_streaming` but then
> performance is ~2x times slower than in case of sequential apply by
> single worker.
>

What is the reason of such a large slow down? Is it because the amount
of network transfer has increased without giving any significant
advantage because of the serialization of commits?

> By removing serialization by commits, it is possible to
> speedup apply 3x times and make subscriber apply changes faster then
> producer can produce them even with multiple clients. But it is possible
> only if transactions are independent and it can be enforced only by
> tracking dependencies which seems to be very non-trivial and invasive.
>
> I still do not completely give up with tracking dependencies approach,
> but decided first to try more simple solution - prefetching.
>

Sounds reasonable, but in the long run, we should track transaction
dependencies and allow parallel apply of all the transactions.

 It is
> already used for physical replication. Certainly in case of physical
> replication it is much simpler, because each WAL record contains list of
> accessed blocks.
>
> In case of logical replication prefetching can be done either by
> prefetching access to replica identity index (usually primary key),
> either by executing replication command by some background worker
> Certainly first case is much more easy.
>

It seems there is only one case described, so what exactly are you
referring to first and second?

 We just perform index lookup in
> prefetch worker and it loads accessed index and heap pages in shared
> buffer, so main apply worker does not need to read something from disk.
> But it works well only for DELETE and HOT UPDATE operations.
>
> In the second case we normally execute the LR command in background
> worker and then abort transaction. Certainly in this case we are doing
> the same work twice. But assumption is the same: parallel prefetch
> workers should load affected pages, speeding up work of the main apply
> worker.
>
> I have implemented some PoC (see attached patch). And get first results
> of efficiency of such prefetching.
>
> *** First scenario (update-only).
>
> Publisher:
> ```
> create table t(pk integer primary key, counter integer, filler text
> default repeat('x', 1000)) with (fillfactor=10);
> insert into t values (generate_series(1,100000), 0);
> create publication pub1 for table t;
> ```
>
> Subscriber:
> ```
> create table t(pk integer primary key, counter integer, filler text
> default repeat('x', 1000)) with (fillfactor=10);
> create subscription sub1 connection 'port=54321 dbname=postgres'
> publication pub1;
> ```
>
> Then I wait until replication is synced, stop subscriber and do random
> dot updates in 10 sessions at publisher:
>
> ```
> pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres
> ```
>
> where update.sql is:
>
> ```
> \set pk random(1, 100000)
> update t set counter=counter+1 where pk=:pk;
> ```
>
> Then I start subscriber and measure how much time is needed for it to
> caught up.
> Results:
>
> no prefetch: 2:00 min
> prefetch (replica identity only): 0:55 min
> prefetch (all): 1:10 min
>
> This is definitely the best case for replica-identity index only
> prefetch (update-only and no other indexes).
> How to interpret this results?
>
> Without prefetch applying updates takes about two times  more at
> subscriber than performing this updates at publisher.
> It means that under huge workload subscriber has no chances to caught up.
>
> With prefetching replica identity index, apply time is even smaller than
> time needed to perform updates at publisher.
> Performing the whole operation and transaction abort certainly adds more
> overhead. But still improvement is quite significant.
>
> Please also notice that this results were obtains at the system with
> larger amount of RAM (64Gb) and fast SSD.
> With data set not fitting in RAM and much slower disks, the difference
> is expected to be more significant.
>

But what about worst cases where these additional pre-fetches could
lead to removing some pages from shared_buffers, which are required by
the workload on the subscriber? I think you try such workloads as
well.

> I have tried to simulate it be adding 0.1msec delay to pg_preadv.
> When I add artificial 0.1msec `preadv` delay, I got the following results:
>
> no prefetch: 7:40
> prefetch (replica identity only): 3:10 min
> prefetch (all): 3:09
>
>
> In this case apply takes much more time than 100 seconds during which
> updates are performed at publisher. Prefetch can improve speed about two
> times,
> but it doesn't allow subcriber to caught-up.
>
...
>
> Please notice that this approach requires minimal changes in Postgres,
> because all infrastructure of parallel apply workers is already present
> and we can reuse the same apply code (with minimal changes) for
> performing prefetch. I only have to introduce extra tuple lock types
> (no-lock and try-lock) to minimize overhead and lock conflicts between
> prefetch and main apply workers. Still it can not completely prevent
> locks conflicts and deadlocks in prefetch workers. Looks like more work
> is needed here.
>

I understand that it is just a POC, so you haven't figured out all the
details, but it would be good to know the reason of these deadlocks.

>
> I wonder if such LR prefetching approach is considered to be useful?
> Or it is better to investigate other ways to improve LR apply speed
> (parallel apply)?
>

I think it could be a good intermediate step till we are able to find
a solution for tracking the dependencies. Do you think this work will
be useful once we have parallel apply, and if so how?

--
With Regards,
Amit Kapila.

Re: Logical replication prefetch

От

Konstantin Knizhnik

Дата:

08 июля, 21:38:01

On 08/07/2025 2:51 pm, Amit Kapila wrote:

On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:

It is possible to enforce parallel apply of short
transactions using `debug_logical_replication_streaming` but then
performance is ~2x times slower than in case of sequential apply by
single worker.

What is the reason of such a large slow down? Is it because the amount
of network transfer has increased without giving any significant
advantage because of the serialization of commits?

No, I do not think that network traffic is somehow increased.
If I removed locks (just by commenting body of `pa_lock_stream` and `pa_unlock_stream` functions and callof `pa_wait_for_xact_finish`), I get 3x speed improvement (with 4 parallel apply workers) comparing with normal mode
(when transactions are applied by main logical replication worker). So the main reason is lock overhead/contention and de-facto serialization of transactions (in `top` I see that only one worker is active most the time.

Even with simulated 0.1msec read delay, results of update tests are the following:

normal mode: 7:40
forced parallel mode: 8:30
forced parallel mode (no locks): 1:45

By removing serialization by commits, it is possible to
speedup apply 3x times and make subscriber apply changes faster then
producer can produce them even with multiple clients. But it is possible
only if transactions are independent and it can be enforced only by
tracking dependencies which seems to be very non-trivial and invasive.

I still do not completely give up with tracking dependencies approach,
but decided first to try more simple solution - prefetching.

Sounds reasonable, but in the long run, we should track transaction
dependencies and allow parallel apply of all the transactions.

I agree.
I see two different approaches:

1. Build dependency graph: track dependency between xids when transaction is executed at publisher and then include this graph in commit record.
2. Calculate hash of replica identity key and check that data sets of transactions do no intersect (certainly will notwork if there are some triggers).

 It is

already used for physical replication. Certainly in case of physical
replication it is much simpler, because each WAL record contains list of
accessed blocks.

In case of logical replication prefetching can be done either by
prefetching access to replica identity index (usually primary key),
either by executing replication command by some background worker
Certainly first case is much more easy.

It seems there is only one case described, so what exactly are you
referring to first and second?

First: perform lookup in replica identity index (primary key). It will prefetch index pages and referenced heap page.
Seconds: execute LR operation (insert/update) in prefetch worker and then rollback transaction.

But what about worst cases where these additional pre-fetches could
lead to removing some pages from shared_buffers, which are required by
the workload on the subscriber? I think you try such workloads as
well.

It is the common problem of all prefetch algorithms: if size of cache where prefetch results are stored (shared buffers, OS file cache,...) is not larger enough to keep prefetch result until it will be used,
then prefetch will not provide any improvement of performance and may be even cause some degradation.
So it is really challenged task to choose optimal time for prefetch operation: too early - and its results will be thrown away before requested, too late - executor has to wait prefetch completion or load page itself. Certainly there is some kind of autotuning: worker performing prefetch has to wait for IO completion and executor whichm pickup page from cache process requests faster and so should catch up prefetch workers. Then it has to perform IO itself and start fall behind prefetch workers.

I understand that it is just a POC, so you haven't figured out all the
details, but it would be good to know the reason of these deadlocks.

Will investigate it.

I wonder if such LR prefetching approach is considered to be useful?
Or it is better to investigate other ways to improve LR apply speed
(parallel apply)?

I think it could be a good intermediate step till we are able to find
a solution for tracking the dependencies. Do you think this work will
be useful once we have parallel apply, and if so how?

I think that if we will have parallel apply, prefetch is not needed.
At least that is what I see now in Neon (open source serverless Postgres which separates compute and storage).
We have implemented prefetch for seqscan and indexscan because of relatively large acess latency with page server. And it can really significantly improve performance - 4 or even more times.
But the same effect cna be achieved by forcing parallel plan with larger number of parallel workers. Unfortunately effect of this two optimizations is not multiplied, so parallel plan + prefetch shows almost the same speed as any of this optimizations.

Re: Logical replication prefetch

От

Amit Kapila

Дата:

11 июля, 11:52:39

On Wed, Jul 9, 2025 at 12:08 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
> On 08/07/2025 2:51 pm, Amit Kapila wrote:
>
> On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
> It is possible to enforce parallel apply of short
> transactions using `debug_logical_replication_streaming` but then
> performance is ~2x times slower than in case of sequential apply by
> single worker.
>
> What is the reason of such a large slow down? Is it because the amount
> of network transfer has increased without giving any significant
> advantage because of the serialization of commits?
>
> No, I do not think that network traffic is somehow increased.
> If I removed locks (just by commenting body of `pa_lock_stream` and `pa_unlock_stream` functions and callof
`pa_wait_for_xact_finish`),I get 3x speed improvement (with 4 parallel apply workers) comparing with normal mode 
> (when transactions are applied by main logical replication worker). So the main reason is lock overhead/contention
andde-facto serialization of transactions (in `top` I see that only one worker is active most the time. 
>
> Even with simulated 0.1msec read delay, results of update tests are the following:
>
> normal mode: 7:40
> forced parallel mode: 8:30
> forced parallel mode (no locks): 1:45
>
> By removing serialization by commits, it is possible to
> speedup apply 3x times and make subscriber apply changes faster then
> producer can produce them even with multiple clients. But it is possible
> only if transactions are independent and it can be enforced only by
> tracking dependencies which seems to be very non-trivial and invasive.
>
> I still do not completely give up with tracking dependencies approach,
> but decided first to try more simple solution - prefetching.
>
> Sounds reasonable, but in the long run, we should track transaction
> dependencies and allow parallel apply of all the transactions.
>
> I agree.
> I see two different approaches:
>
> 1. Build dependency graph: track dependency between xids when transaction is executed at publisher and then include
thisgraph in commit record. 
> 2. Calculate hash of replica identity key  and check that data sets of transactions do no intersect (certainly will
notworkif there are some triggers). 
>

I think it is better to compute transaction dependencies on the
subscriber side because there could be many transactions that could be
filtered because the containing tables or rows are not published.

>
>
> But what about worst cases where these additional pre-fetches could
> lead to removing some pages from shared_buffers, which are required by
> the workload on the subscriber? I think you try such workloads as
> well.
>
> It is the common problem of all prefetch algorithms: if size of cache where prefetch results are stored (shared
buffers,OS file cache,...) is not larger enough to keep prefetch result until it will be used, 
> then prefetch will not provide any improvement of performance and may be even cause some degradation.
> So it is really challenged task to choose optimal time for prefetch operation: too early - and its results will be
thrownaway before requested, too late - executor has to wait prefetch completion or load page itself. Certainly there
issome kind of autotuning: worker performing prefetch has to wait for IO completion and executor whichm pickup page
fromcache process requests faster and so should catch up prefetch workers. Then it has to perform IO itself and start
fallbehind prefetch workers. 
>
>
>
> I understand that it is just a POC, so you haven't figured out all the
> details, but it would be good to know the reason of these deadlocks.
>
> Will investigate it.
>
>
> I wonder if such LR prefetching approach is considered to be useful?
> Or it is better to investigate other ways to improve LR apply speed
> (parallel apply)?
>
> I think it could be a good intermediate step till we are able to find
> a solution for tracking the dependencies. Do you think this work will
> be useful once we have parallel apply, and if so how?
>
>
> I think that if we will have parallel apply, prefetch is not needed.
> At least that is what I see now in Neon (open source serverless Postgres which separates compute and storage).
> We have implemented prefetch for seqscan and indexscan because of relatively large acess latency with page server.
Andit can really significantly improve performance - 4 or even more times. 
> But the same effect cna be achieved by forcing parallel plan with larger number of parallel workers. Unfortunately
effectof this two optimizations is not multiplied, so parallel plan + prefetch shows almost the same speed as any of
thisoptimizations. 
>
>

I think our case is a bit different, and prefetch could even be used
when we are able to track dependencies and achieve true parallelism.
We can consider using prefetch to speed up dependent transactions that
can't be parallelized.

--
With Regards,
Amit Kapila.

Re: Logical replication prefetch

От

Konstantin Knizhnik

Дата:

11 июля, 13:27:26

On 11/07/2025 11:52 am, Amit Kapila wrote:
> On Wed, Jul 9, 2025 at 12:08 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>> On 08/07/2025 2:51 pm, Amit Kapila wrote:
>>
>> On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>>
>> It is possible to enforce parallel apply of short
>> transactions using `debug_logical_replication_streaming` but then
>> performance is ~2x times slower than in case of sequential apply by
>> single worker.
>>
>> What is the reason of such a large slow down? Is it because the amount
>> of network transfer has increased without giving any significant
>> advantage because of the serialization of commits?
>>
>> No, I do not think that network traffic is somehow increased.
>> If I removed locks (just by commenting body of `pa_lock_stream` and `pa_unlock_stream` functions and callof
`pa_wait_for_xact_finish`),I get 3x speed improvement (with 4 parallel apply workers) comparing with normal mode
 
>> (when transactions are applied by main logical replication worker). So the main reason is lock overhead/contention
andde-facto serialization of transactions (in `top` I see that only one worker is active most the time.
 
>>
>> Even with simulated 0.1msec read delay, results of update tests are the following:
>>
>> normal mode: 7:40
>> forced parallel mode: 8:30
>> forced parallel mode (no locks): 1:45
>>
>> By removing serialization by commits, it is possible to
>> speedup apply 3x times and make subscriber apply changes faster then
>> producer can produce them even with multiple clients. But it is possible
>> only if transactions are independent and it can be enforced only by
>> tracking dependencies which seems to be very non-trivial and invasive.
>>
>> I still do not completely give up with tracking dependencies approach,
>> but decided first to try more simple solution - prefetching.
>>
>> Sounds reasonable, but in the long run, we should track transaction
>> dependencies and allow parallel apply of all the transactions.
>>
>> I agree.
>> I see two different approaches:
>>
>> 1. Build dependency graph: track dependency between xids when transaction is executed at publisher and then include
thisgraph in commit record.
 
>> 2. Calculate hash of replica identity key  and check that data sets of transactions do no intersect (certainly will
notworkif there are some triggers).
 
>>
> I think it is better to compute transaction dependencies on the
> subscriber side because there could be many transactions that could be
> filtered because the containing tables or rows are not published.

It is certainly true. Also tracking dependencies at subscriber doesn't 
require to change protocol and makes it possible to do parallel apply 
for old publishers.
But from the other hand it seems to be too late. It will be nice that 
just after receiving first transaction statement, apply process can make 
a decision to which parallel apply worker it should be sent.

At subscriber side it seems to be easier to calculate hash of replica 
identity key and based on this hash (or more precisely the set of hashes 
representing transaction working set) make a decision whther transaction 
interleave with some prior transaction or not and schedule it accordingly.


>>
>> But what about worst cases where these additional pre-fetches could
>> lead to removing some pages from shared_buffers, which are required by
>> the workload on the subscriber? I think you try such workloads as
>> well.
>>
>> It is the common problem of all prefetch algorithms: if size of cache where prefetch results are stored (shared
buffers,OS file cache,...) is not larger enough to keep prefetch result until it will be used,
 
>> then prefetch will not provide any improvement of performance and may be even cause some degradation.
>> So it is really challenged task to choose optimal time for prefetch operation: too early - and its results will be
thrownaway before requested, too late - executor has to wait prefetch completion or load page itself. Certainly there
issome kind of autotuning: worker performing prefetch has to wait for IO completion and executor whichm pickup page
fromcache process requests faster and so should catch up prefetch workers. Then it has to perform IO itself and start
fallbehind prefetch workers.
 
>>
>>
>>
>> I understand that it is just a POC, so you haven't figured out all the
>> details, but it would be good to know the reason of these deadlocks.
>>
>> Will investigate it.

I found the reason: conflict happen between main apply worker and 
prefetch worker which was able to catch-up main worker and so they both 
are trying to apply the same statement.
I fixed the problem by adding extra parameter to 
ExecSimpleRelationUpdate/Insert and handle prefetch as some kind of 
speculative operation.
With this change results for insert test are the following:

no prefetch: 10 min
prefetch (identity): 8 min
prefetch (full): 3 min


> I think our case is a bit different, and prefetch could even be used
> when we are able to track dependencies and achieve true parallelism.
> We can consider using prefetch to speed up dependent transactions that
> can't be parallelized.
Make sense.

Re: Logical replication prefetch

От

Konstantin Knizhnik

Дата:

11 июля, 17:19:03

On 08/07/2025 2:51 pm, Amit Kapila wrote:
> On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>> There is well known Postgres problem that logical replication subscriber
>> can not caught-up with publisher just because LR changes are applied by
>> single worker and at publisher changes are made by
>> multiple concurrent backends. The problem is not logical replication
>> specific: physical replication stream is also handled by single
>> walreceiver. But for physical replication Postgres now implements
>> prefetch: looking at WAL record blocks it is quite easy to predict which
>> pages will be required for redo and prefetch them. With logical
>> replication situation is much more complicated.
>>
>> My first idea was to implement parallel apply of transactions. But to do
>> it we need to track dependencies between transactions. Right now
>> Postgres can apply transactions in parallel, but only if they are
>> streamed  (which is done only for large transactions) and serialize them
>> by commits. It is possible to enforce parallel apply of short
>> transactions using `debug_logical_replication_streaming` but then
>> performance is ~2x times slower than in case of sequential apply by
>> single worker.
>>
> What is the reason of such a large slow down? Is it because the amount
> of network transfer has increased without giving any significant
> advantage because of the serialization of commits?

It is not directly related with subj, but I do not understand this code:

```
     /*
      * Stop the worker if there are enough workers in the pool.
      *
      * XXX Additionally, we also stop the worker if the leader apply worker
      * serialize part of the transaction data due to a send timeout. 
This is
      * because the message could be partially written to the queue and 
there
      * is no way to clean the queue other than resending the message 
until it
      * succeeds. Instead of trying to send the data which anyway would have
      * been serialized and then letting the parallel apply worker deal with
      * the spurious message, we stop the worker.
      */
     if (winfo->serialize_changes ||
         list_length(ParallelApplyWorkerPool) >
         (max_parallel_apply_workers_per_subscription / 2))
     {
         logicalrep_pa_worker_stop(winfo);
         pa_free_worker_info(winfo);

         return;
     }
```

It stops worker if number fo workers in pool is more than half of 
`max_parallel_apply_workers_per_subscription`.
What I see is that `pa_launch_parallel_worker` spawns new workers and 
after completion of transaction it is immediately terminated.
Actually this leads to awful slowdown of apply process.
If I just disable and all 
`max_parallel_apply_workers_per_subscription`are actually used for 
applying transactions, then time of parallel apply with 4 workers is 6 
minutes comparing with 10 minutes fr applying all transactions by main 
workers. It is still not so larger improvement, but at least it is 
improvement and not degradation.

Re: Logical replication prefetch

От

Amit Kapila

Дата:

13 июля, 09:28:15

On Fri, Jul 11, 2025 at 7:49 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
> On 08/07/2025 2:51 pm, Amit Kapila wrote:
> > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> >> There is well known Postgres problem that logical replication subscriber
> >> can not caught-up with publisher just because LR changes are applied by
> >> single worker and at publisher changes are made by
> >> multiple concurrent backends. The problem is not logical replication
> >> specific: physical replication stream is also handled by single
> >> walreceiver. But for physical replication Postgres now implements
> >> prefetch: looking at WAL record blocks it is quite easy to predict which
> >> pages will be required for redo and prefetch them. With logical
> >> replication situation is much more complicated.
> >>
> >> My first idea was to implement parallel apply of transactions. But to do
> >> it we need to track dependencies between transactions. Right now
> >> Postgres can apply transactions in parallel, but only if they are
> >> streamed  (which is done only for large transactions) and serialize them
> >> by commits. It is possible to enforce parallel apply of short
> >> transactions using `debug_logical_replication_streaming` but then
> >> performance is ~2x times slower than in case of sequential apply by
> >> single worker.
> >>
> > What is the reason of such a large slow down? Is it because the amount
> > of network transfer has increased without giving any significant
> > advantage because of the serialization of commits?
>
>
> It is not directly related with subj, but I do not understand this code:
>
> ```
>      /*
>       * Stop the worker if there are enough workers in the pool.
>       *
>       * XXX Additionally, we also stop the worker if the leader apply worker
>       * serialize part of the transaction data due to a send timeout.
> This is
>       * because the message could be partially written to the queue and
> there
>       * is no way to clean the queue other than resending the message
> until it
>       * succeeds. Instead of trying to send the data which anyway would have
>       * been serialized and then letting the parallel apply worker deal with
>       * the spurious message, we stop the worker.
>       */
>      if (winfo->serialize_changes ||
>          list_length(ParallelApplyWorkerPool) >
>          (max_parallel_apply_workers_per_subscription / 2))
>      {
>          logicalrep_pa_worker_stop(winfo);
>          pa_free_worker_info(winfo);
>
>          return;
>      }
> ```
>
> It stops worker if number fo workers in pool is more than half of
> `max_parallel_apply_workers_per_subscription`.
> What I see is that `pa_launch_parallel_worker` spawns new workers and
> after completion of transaction it is immediately terminated.
> Actually this leads to awful slowdown of apply process.
>

I didn't understand your scenario. pa_launch_parallel_worker() should
spawn a new worker only if all the workers in the pool are busy, and
then it will free the worker if the pool already has enough workers.
So, do you mean to say that the workers in the pool are always busy in
your workload which lead spawn/exit of new workers? Can you please
explain your scenario in some more detail?

--
With Regards,
Amit Kapila.

Re: Logical replication prefetch

От

Amit Kapila

Дата:

13 июля, 13:28:01

On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
> There is well known Postgres problem that logical replication subscriber
> can not caught-up with publisher just because LR changes are applied by
> single worker and at publisher changes are made by
> multiple concurrent backends.
>

BTW, do you know how users deal with this lag? For example, one can
imagine creating multiple pub-sub pairs for different sets of tables
so that the workload on the subscriber could also be shared by
multiple apply workers. I can also think of splitting the workload
among multiple pub-sub pairs by using row filters.

--
With Regards,
Amit Kapila.

Re: Logical replication prefetch

От

Konstantin Knizhnik

Дата:

13 июля, 15:29:35

On 13/07/2025 9:28 am, Amit Kapila wrote:
> I didn't understand your scenario. pa_launch_parallel_worker() should
> spawn a new worker only if all the workers in the pool are busy, and
> then it will free the worker if the pool already has enough workers.
> So, do you mean to say that the workers in the pool are always busy in
> your workload which lead spawn/exit of new workers? Can you please
> explain your scenario in some more detail?
>
Current LR apply logic is not working well for applying small OLTP 
transactions.
First of all by default reorder buffer at publisher will buffer them and 
so prevent parallel apply at subscriber.
Publisher switches to streaming mode only if transaction is too large or 
`debug_logical_replication_streaming=immediate`.
But even if we force publisher to stream short transactions, subscriber 
will try to launch new parallel apply worker for each transactions (if 
all existed workers are busy).
If there are 100 active backends at publisher, then subscriber will try 
to launch 100 parallel apply workers.
Most likely it fails because of limit for maximal number of workers. In 
this case leader will serialize such transactions.
So if there are 100 streamed transactions and 10 parallel apply workers, 
then 10 transactions are started in parallel and 90 will be serialized 
to disk.
It seems to be not so efficient for short transaction. It is better to 
wait for some time until some of workers become vacant.

But the worst thing happen when parallel apply worker completes its 
transactions. If number of parallel apply workers in pool exceeds 
`max_parallel_apply_workers_per_subscription / 2`,
then this parallel apply worker is terminated. So instead of having 
`max_parallel_apply_workers_per_subscription` workers applying 
transactions at maximal possible speed and leader
which distributes transaction between them and stops receiving new data 
from publisher if there is no vacant worker, we will have leader 
serializing and writing transactions to the disk
(and then definitely reading them from the disk) and permanently 
starting and terminating parallel apply worker processes. It leads to 
awful performance.

Certainly originally intended use case was different: parallel apply is 
performed only for large transactions. Number of of such transactions is 
not so big and
so there should be enough parallel apply workers in pool to proceed 
them. And if there are not enough workers, it is not a problem to spawn 
new one and terminate
it after completion of transaction (because transaction is long, 
overhead of spawning process is not so larger comparing with redo of 
large transaction).
But if we want to efficiently replicate OLTP workload, then we 
definitely need some other approach.

Prefetch is actually more compatible with current implementation because 
prefetch operations don't need to be grouped by transaction and can be 
executed by any prefetch worker.

Re: Logical replication prefetch

От

Konstantin Knizhnik

Дата:

13 июля, 15:36:30

On 13/07/2025 1:28 pm, Amit Kapila wrote:
> On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>> There is well known Postgres problem that logical replication subscriber
>> can not caught-up with publisher just because LR changes are applied by
>> single worker and at publisher changes are made by
>> multiple concurrent backends.
>>
> BTW, do you know how users deal with this lag? For example, one can
> imagine creating multiple pub-sub pairs for different sets of tables
> so that the workload on the subscriber could also be shared by
> multiple apply workers. I can also think of splitting the workload
> among multiple pub-sub pairs by using row filters

Yes, I saw that users starts several subscriptions/publications to 
receive and apply changes in parallel.
But it can not be considered as universal solution:
1. Not always there are multiple tables (or partitions of one one table) 
so that it it possible to split them between multiple publications.
2. It violates transactional behavior (consistency): if transactions 
update several tables included in different publications then applying 
this changes independently, we can observe at replica behaviour when one 
table is update - and another - not. The same is true for row filters.
3. Each walsender will have to scan WAL, so having N subscriptions we 
have to read and decode WAL N times.

RE: Logical replication prefetch

От

"Zhijie Hou (Fujitsu)"

Дата:

14 июля, 04:20:42

On Tuesday, July 8, 2025 2:36 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> 
> There is well known Postgres problem that logical replication subscriber can
> not caught-up with publisher just because LR changes are applied by single
> worker and at publisher changes are made by multiple concurrent backends.
> The problem is not logical replication
> specific: physical replication stream is also handled by single walreceiver. But
> for physical replication Postgres now implements
> prefetch: looking at WAL record blocks it is quite easy to predict which pages
> will be required for redo and prefetch them. With logical replication situation is
> much more complicated.
> 
> My first idea was to implement parallel apply of transactions. But to do it we
> need to track dependencies between transactions. Right now Postgres can
> apply transactions in parallel, but only if they are streamed  (which is done
> only for large transactions) and serialize them by commits. It is possible to
> enforce parallel apply of short transactions using
> `debug_logical_replication_streaming` but then performance is ~2x times
> slower than in case of sequential apply by single worker. By removing
> serialization by commits, it is possible to speedup apply 3x times and make
> subscriber apply changes faster then producer can produce them even with
> multiple clients. But it is possible only if transactions are independent and it
> can be enforced only by tracking dependencies which seems to be very
> non-trivial and invasive.
> 
> I still do not completely give up with tracking dependencies approach, but
> decided first to try more simple solution - prefetching. It is already used for
> physical replication. Certainly in case of physical replication it is much simpler,
> because each WAL record contains list of accessed blocks.
> 
> In case of logical replication prefetching can be done either by prefetching
> access to replica identity index (usually primary key), either by executing
> replication command by some background worker Certainly first case is much
> more easy. We just perform index lookup in prefetch worker and it loads
> accessed index and heap pages in shared buffer, so main apply worker does
> not need to read something from disk.
> But it works well only for DELETE and HOT UPDATE operations.
> 
> In the second case we normally execute the LR command in background
> worker and then abort transaction. Certainly in this case we are doing the same
> work twice. But assumption is the same: parallel prefetch workers should load
> affected pages, speeding up work of the main apply worker.
> 
> I have implemented some PoC (see attached patch). And get first results of
> efficiency of such prefetching.
> 
> *** First scenario (update-only).
> 
> Publisher:
> ```
> create table t(pk integer primary key, counter integer, filler text default repeat('x',
> 1000)) with (fillfactor=10); insert into t values (generate_series(1,100000), 0);
> create publication pub1 for table t; ```
> 
> Subscriber:
> ```
> create table t(pk integer primary key, counter integer, filler text default repeat('x',
> 1000)) with (fillfactor=10); create subscription sub1 connection 'port=54321
> dbname=postgres'
> publication pub1;
> ```
> 
> Then I wait until replication is synced, stop subscriber and do random dot
> updates in 10 sessions at publisher:
> 
> ```
> pgbench -T 100 -c 10 -M prepared -n -f update.sql -p 54321 -d postgres ```
> 
> where update.sql is:
> 
> ```
> \set pk random(1, 100000)
> update t set counter=counter+1 where pk=:pk; ```
> 
> Then I start subscriber and measure how much time is needed for it to caught
> up.
> Results:
> 
> no prefetch: 2:00 min
> prefetch (replica identity only): 0:55 min prefetch (all): 1:10 min
> 
> This is definitely the best case for replica-identity index only prefetch
> (update-only and no other indexes).
> How to interpret this results?
> 
> Without prefetch applying updates takes about two times  more at subscriber
> than performing this updates at publisher.
> It means that under huge workload subscriber has no chances to caught up.
> 
> With prefetching replica identity index, apply time is even smaller than time
> needed to perform updates at publisher.
> Performing the whole operation and transaction abort certainly adds more
> overhead. But still improvement is quite significant.
> 
> Please also notice that this results were obtains at the system with larger
> amount of RAM (64Gb) and fast SSD.
> With data set not fitting in RAM and much slower disks, the difference is
> expected to be more significant.
> I have tried to simulate it be adding 0.1msec delay to pg_preadv.
> When I add artificial 0.1msec `preadv` delay, I got the following results:
> 
> no prefetch: 7:40
> prefetch (replica identity only): 3:10 min prefetch (all): 3:09
> 
> 
> In this case apply takes much more time than 100 seconds during which
> updates are performed at publisher. Prefetch can improve speed about two
> times, but it doesn't allow subcriber to caught-up.
> 
> 
> 
> *** Second scenario: inserts with secondary random key.
> 
> 
> Publisher:
> 
> ```
> create table t(pk serial primary key, sk integer, counter integer default 0) insert
> into t (sk) select random()*10000000 from generate_series(1,10000000) create
> index on t(sk) create publication pub1 for table t ```
> 
> Subscriber:
> ```
> 
> create table t(pk integer primary key, sk integer, counter integer) create index on
> t(sk) create subscription sub1 connection 'port=54321 dbname=postgres'
> publication pub1
> ```
> 
> workload:
> 
> ```
> pgbench -T 100 -c 10 -M prepared -n -f insert.sql -p 54321 -d postgres
> 
> ```
> 
> where insert.sql:
> 
> ```
> INSERT INTO t (sk) VALUES (random()*10000000); ```
> 
> Results (with 0.1msec delay)  are the followingL
> 
> no prefetch: 10:10 min
> prefetch (identity): 8:25 min
> prefetch (full): 5:50min
> 
> Here as expected prefetching only primary key doesn't provide some big
> improvement. But replaying insert command in prefetch worker allows to
> speedup apply almost twice.
> 
> Please notice that this approach requires minimal changes in Postgres,
> because all infrastructure of parallel apply workers is already present and we
> can reuse the same apply code (with minimal changes) for performing prefetch.
> I only have to introduce extra tuple lock types (no-lock and try-lock) to minimize
> overhead and lock conflicts between prefetch and main apply workers. Still it
> can not completely prevent locks conflicts and deadlocks in prefetch workers.
> Looks like more work is needed here. Also I set `wal_level=minimal` in
> prefetch workers to avoid  WAL-logging overhead.
> 
> Number of prefetch workers is specified by
> `max_parallel_prefetch_workers_per_subscription` GUC. If it is zero
> (default) then no prefetching is performed.
> Prefetch mode is controlled by `prefetch_replica_identity_only` GUC . By
> default it is true which makes prefetch efficient for hot updates, deletes or
> inserts in table with just one index (primary key).
> 
> 
> Attached please find patch and two shell scripts used to produce this test
> results.
> Also it may be more convenient to inspect this patch as PR:
> https://github.com/knizhnik/postgres/pull/3
> 
> I wonder if such LR prefetching approach is considered to be useful?
> Or it is better to investigate other ways to improve LR apply speed (parallel
> apply)?

Thank you for the proposal ! I find it to be a very interesting feature。

I tested the patch you shared in your original email and encountered potential
deadlocks when testing pgbench TPC-B like workload. Could you please provide an
updated patch version so that I can conduct further performance experiments ?

Additionally, I was also exploring ways to improve performance and have tried an
alternative version of prefetch for experimentation. The alternative design is
that we assigns each non-streaming transaction to a parallel apply worker, while
strictly maintaining the order of commits. During parallel apply, if the
transactions that need to be committed before the current transaction are not
yet finished, the worker performs pre-fetch operations. Specifically, for
updates and deletes, the worker finds and caches the target local tuple to be
updated/deleted. Once all preceding transactions are committed, the parallel
apply worker uses these cached tuples to execute the actual updates or deletes.
What do you think about this alternative ? I think the alternative might offer
more stability in scenarios where shared buffer elimination occurs frequently
and avoids leaving dead tuples in the buffer. However, it also presents some
drawbacks, such as the need to add wait events to maintain commit order,
compared to the approach discussed in this thread.

(Note that, due to time constraints, I have not implemented the pre-fetch for
Inserts and the code is not in reviewable shape and lacks comments and
documentation, but just share the POC patch for reference).

Best Regards,
Hou zj

Вложения

vPOC-0001-Apply-non-streaming-transactions-in-parallel-ap.patch

Re: Logical replication prefetch

От

Konstantin Knizhnik

Дата:

14 июля, 09:35:53

On 14/07/2025 4:20 am, Zhijie Hou (Fujitsu) wrote:

Thank you for the proposal ! I find it to be a very interesting feature。

I tested the patch you shared in your original email and encountered potential
deadlocks when testing pgbench TPC-B like workload. Could you please provide an
updated patch version so that I can conduct further performance experiments ?

Sorry, it was fixed in my repo: https://github.com/knizhnik/postgres/pull/3
Updated patch is attached.

Additionally, I was also exploring ways to improve performance and have tried an
alternative version of prefetch for experimentation. The alternative design is
that we assigns each non-streaming transaction to a parallel apply worker, while
strictly maintaining the order of commits. During parallel apply, if the
transactions that need to be committed before the current transaction are not
yet finished, the worker performs pre-fetch operations. Specifically, for
updates and deletes, the worker finds and caches the target local tuple to be
updated/deleted. Once all preceding transactions are committed, the parallel
apply worker uses these cached tuples to execute the actual updates or deletes.
What do you think about this alternative ? I think the alternative might offer
more stability in scenarios where shared buffer elimination occurs frequently
and avoids leaving dead tuples in the buffer. However, it also presents some
drawbacks, such as the need to add wait events to maintain commit order,
compared to the approach discussed in this thread.

So as far as I understand your PoC is doing the same as approach 1 in my proposal - prefetch of replica identity, but it is done not by parallel prefetch workers, but normal parallel apply workers when they have to wait until previous transaction is committed. I consider it to be more complex but may be more efficient than my approach.

The obvious drawback of both your's and my approaches is that it prefetch only pages of primary index (replica identity). If there are some other indexes which keys are changed by update, then pages of such indexes will be read from the disk when you apply update. The same is also true for insert (in this case you always has to include new tuple in all indexes) - this is why I have also implemented another approach: apply operation in prefetch worker and then rollback transaction.

Also I do not quite understand how you handle invalidations? Assume that we have two transactions - T1 and T2:

T1: ... W1 Commit

T2: ... W1

So T1 writes tuple 1 and then commits transaction. Then T2 updates tuple 1.

If I correctly understand your approach, parallel apply worker for T2 will try to prefetch tuple 1 before T1 is committed.
But in this case it will get old version of the tuple. It is not a problem if parallel apply worker will repeat lookup and not just use cached tuple.

One more moment. As far as you assigns each non-streaming transaction to a parallel apply worker, number of such transactions is limited by assigns each non-streaming transaction to a parallel apply worker,umber of background workers. Usually it is not so large (~10). So if there were 100 parallel transactions and publishers, then at subscriber you still be able to executed concurrently not more than few of them. In this sense my approach with separate prefetch workers is more flexible: each prefetch worker can prefetch as many operations as it can.

Вложения

v2-0001-logical-replication-prefetch.patch

Re: Logical replication prefetch

От

Amit Kapila

Дата:

14 июля, 12:43:18

On Sun, Jul 13, 2025 at 6:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
> On 13/07/2025 1:28 pm, Amit Kapila wrote:
> > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> >> There is well known Postgres problem that logical replication subscriber
> >> can not caught-up with publisher just because LR changes are applied by
> >> single worker and at publisher changes are made by
> >> multiple concurrent backends.
> >>
> > BTW, do you know how users deal with this lag? For example, one can
> > imagine creating multiple pub-sub pairs for different sets of tables
> > so that the workload on the subscriber could also be shared by
> > multiple apply workers. I can also think of splitting the workload
> > among multiple pub-sub pairs by using row filters
>
>
> Yes, I saw that users starts several subscriptions/publications to
> receive and apply changes in parallel.
> But it can not be considered as universal solution:
> 1. Not always there are multiple tables (or partitions of one one table)
> so that it it possible to split them between multiple publications.
> 2. It violates transactional behavior (consistency): if transactions
> update several tables included in different publications then applying
> this changes independently, we can observe at replica behaviour when one
> table is update - and another - not. The same is true for row filters.
> 3. Each walsender will have to scan WAL, so having N subscriptions we
> have to read and decode WAL N times.
>

I agree that it is not a solution which can be applied in all cases
and neither I want to say that we shouldn't pursue the idea of
prefetch or parallel apply to improve the speed of apply. It was just
to know/discuss how users try to workaround lag for cases where the
lag is large.

--
With Regards,
Amit Kapila.

Re: Logical replication prefetch

От

Amit Kapila

Дата:

14 июля, 12:53:47

On Sun, Jul 13, 2025 at 5:59 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
>
>
> Certainly originally intended use case was different: parallel apply is
> performed only for large transactions. Number of of such transactions is
> not so big and
> so there should be enough parallel apply workers in pool to proceed
> them. And if there are not enough workers, it is not a problem to spawn
> new one and terminate
> it after completion of transaction (because transaction is long,
> overhead of spawning process is not so larger comparing with redo of
> large transaction).
>

Right.

> But if we want to efficiently replicate OLTP workload, then we
> definitely need some other approach.
>

Agreed, for simplicity, for now we can have a GUC to decide the size
of the pool. There is a note in the code for this as well, see: " XXX
This worker pool threshold is arbitrary and we can provide a GUC
variable for this in the future if required. I think we can think of
some dynamic strategy where we remove from the pool if the workers are
not in use for some threshold period of time or something on those
lines. But at this stage it is better to use something simple and try
to come up with a good way to perform pre-fetch or parallelization of
short transactions.

--
With Regards,
Amit Kapila.

RE: Logical replication prefetch

От

"Zhijie Hou (Fujitsu)"

Дата:

14 июля, 13:04:21

On Monday, July 14, 2025 2:36 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> On 14/07/2025 4:20 am, Zhijie Hou (Fujitsu) wrote:
> > Additionally, I was also exploring ways to improve performance and have tried an
> > alternative version of prefetch for experimentation. The alternative design is
> > that we assigns each non-streaming transaction to a parallel apply worker, while
> > strictly maintaining the order of commits. During parallel apply, if the
> > transactions that need to be committed before the current transaction are not
> > yet finished, the worker performs pre-fetch operations. Specifically, for
> > updates and deletes, the worker finds and caches the target local tuple to be
> > updated/deleted. Once all preceding transactions are committed, the parallel
> > apply worker uses these cached tuples to execute the actual updates or deletes.
> > What do you think about this alternative ? I think the alternative might offer
> > more stability in scenarios where shared buffer elimination occurs frequently
> > and avoids leaving dead tuples in the buffer. However, it also presents some
> > drawbacks, such as the need to add wait events to maintain commit order,
> > compared to the approach discussed in this thread.
> 
> So as far as I understand your PoC is doing the same as approach 1 in my
> proposal - prefetch of replica identity, but it is done not by parallel prefetch
> workers, but normal parallel apply workers when they have to wait until previous
> transaction is committed.  I consider it to be more complex but may be more
> efficient than my approach.
> 
> The obvious drawback of both your's and my approaches is that it prefetch only
> pages of primary index (replica identity). If there are some other indexes
> which keys are changed by update, then pages of such indexes will be read from
> the disk when you apply update. The same is also true for insert (in this case
> you always has to include new tuple in all indexes) - this is why I have also
> implemented another approach: apply operation in prefetch worker and then
> rollback transaction.

Thank you for your reply! I agree that indexes other than RI do not benefit from
the pre-fetch. Regarding the apply operation and rollback approach, I have some
concerns about the possible side effects, particularly the accumulation of dead
tuples in the shared buffer. This is because all changes are performed by the
pre-fetch worker at once before being aborted. I haven't delved deeply into this
yet, but do you think this could potentially introduce additional overhead ?

>
>
> Also I do not quite understand how you handle invalidations?


During the pre-fetch phase of my patch, the execution of table_tuple_lock() is
postponed until all preceding transactions have been finalized. If the cached
tuple was modified by other transactions, table_tuple_lock() will return
TM_Updated, signifying that the cached tuple is no longer valid. In these cases,
the parallel apply worker will re-fetch the tuple.

> Assume that we have two transactions - T1 and T2:
>
> T1: ... W1 Commit
> T2: ...                     W1
>
> So T1 writes tuple 1 and then commits transaction. Then T2 updates tuple 1.
> If I correctly understand your approach, parallel apply worker for T2 will try
> to prefetch tuple 1 before T1 is committed.
> 
> But in this case it will get old version of the tuple. It is not a problem if
> parallel apply worker will repeat lookup and not just use cached tuple.

Yes, it is done like that.

> 
> One more moment. As far as you assigns each non-streaming transaction to a
> parallel apply worker, number of such transactions is limited by assigns each
> non-streaming transaction to a parallel apply worker,umber of background
> workers. Usually it is not so large (~10). So if there were 100 parallel
> transactions and publishers, then at subscriber you still be able to executed
> concurrently not more than few of them. In this sense my approach with
> separate prefetch workers is more flexible: each prefetch worker can prefetch
> as many operations as it can.

Yes, that's true. I have been analyzing some performance issues in logical
replication, specifically under scenarios where both the publisher and
subscriber are subjected to high workloads. In these situations, the shared
buffer is frequent updated, prompting me to consider the alternative approach I
mentioned. I plan to perform additional tests and analysis on these approaches,
thanks !

Best Regards,
Hou zj

Re: Logical replication prefetch

От

Amit Kapila

Дата:

15 июля, 14:31:17

On Mon, Jul 14, 2025 at 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jul 13, 2025 at 6:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> >
> > On 13/07/2025 1:28 pm, Amit Kapila wrote:
> > > On Tue, Jul 8, 2025 at 12:06 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
> > >> There is well known Postgres problem that logical replication subscriber
> > >> can not caught-up with publisher just because LR changes are applied by
> > >> single worker and at publisher changes are made by
> > >> multiple concurrent backends.
> > >>
> > > BTW, do you know how users deal with this lag? For example, one can
> > > imagine creating multiple pub-sub pairs for different sets of tables
> > > so that the workload on the subscriber could also be shared by
> > > multiple apply workers. I can also think of splitting the workload
> > > among multiple pub-sub pairs by using row filters
> >
> >
> > Yes, I saw that users starts several subscriptions/publications to
> > receive and apply changes in parallel.
> > But it can not be considered as universal solution:
> > 1. Not always there are multiple tables (or partitions of one one table)
> > so that it it possible to split them between multiple publications.
> > 2. It violates transactional behavior (consistency): if transactions
> > update several tables included in different publications then applying
> > this changes independently, we can observe at replica behaviour when one
> > table is update - and another - not. The same is true for row filters.
> > 3. Each walsender will have to scan WAL, so having N subscriptions we
> > have to read and decode WAL N times.
> >
>
> I agree that it is not a solution which can be applied in all cases
> and neither I want to say that we shouldn't pursue the idea of
> prefetch or parallel apply to improve the speed of apply. It was just
> to know/discuss how users try to workaround lag for cases where the
> lag is large.
>

If you are interested, I would like to know your opinion on a somewhat
related topic, which has triggered my interest in your patch. We are
working on an update_delete conflict detection patch. The exact
problem was explained in the initial email [1]. The basic idea to
resolve the problem is that on the subscriber, we maintain a slot that
will help in retaining dead tuples for a certain period of time till
the concurrent transactions have been applied to the subscriber. You
can read the commit message of the first patch in email [2]. Now, the
problem we are facing is that because of replication LAG in a scenario
similar to what we are discussing here, such that when there are many
clients on the publisher and a single apply worker on the subscriber,
the slot takes more time to get advanced. This will lead to retention
of dead tuples, which further slows down apply worker especially for
update workloads. Apart from apply, the other transactions running on
the system (say pgbench kind of workload on the subscriber) also
became slower because of the retention of dead tuples.

Now, for the workloads where the LAG is not there, like when one
splits the workload with options mentioned above (split workload among
pub-sub in some way) or the workload doesn't consist of a large number
of clients operating on the publisher and subscriber at the same time,
etc. we don't observe any major slowdown on the subscriber.

We would like to solicit your opinion as you seem to have some
experience with LR users, whether one can use this feature in cases
where required by enabling it at the subscription level. They will
have the facility to disable it if they face any performance
regression or additional bloat. Now, after having that feature, we can
work on additional features such as prefetch or parallel apply that
will reduce the chances of LAG, making the feature more broadly used.
Does that sound reasonable to you? Feel free to ignore giving your
opinion if you are not interested in that work.

[1] -
https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] -
https://www.postgresql.org/message-id/OS0PR01MB5716ECC539008C85E7AB65C5944FA%40OS0PR01MB5716.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.

Re: Logical replication prefetch

От

Konstantin Knizhnik

Дата:

19 июля, 21:36:07

On 15/07/2025 2:31 PM, Amit Kapila wrote:
> If you are interested, I would like to know your opinion on a somewhat
> related topic, which has triggered my interest in your patch. We are
> working on an update_delete conflict detection patch. The exact
> problem was explained in the initial email [1]. The basic idea to
> resolve the problem is that on the subscriber, we maintain a slot that
> will help in retaining dead tuples for a certain period of time till
> the concurrent transactions have been applied to the subscriber. You
> can read the commit message of the first patch in email [2]. Now, the
> problem we are facing is that because of replication LAG in a scenario
> similar to what we are discussing here, such that when there are many
> clients on the publisher and a single apply worker on the subscriber,
> the slot takes more time to get advanced. This will lead to retention
> of dead tuples, which further slows down apply worker especially for
> update workloads. Apart from apply, the other transactions running on
> the system (say pgbench kind of workload on the subscriber) also
> became slower because of the retention of dead tuples.
>
> Now, for the workloads where the LAG is not there, like when one
> splits the workload with options mentioned above (split workload among
> pub-sub in some way) or the workload doesn't consist of a large number
> of clients operating on the publisher and subscriber at the same time,
> etc. we don't observe any major slowdown on the subscriber.
>
> We would like to solicit your opinion as you seem to have some
> experience with LR users, whether one can use this feature in cases
> where required by enabling it at the subscription level. They will
> have the facility to disable it if they face any performance
> regression or additional bloat. Now, after having that feature, we can
> work on additional features such as prefetch or parallel apply that
> will reduce the chances of LAG, making the feature more broadly used.
> Does that sound reasonable to you? Feel free to ignore giving your
> opinion if you are not interested in that work.
>
> [1] -
https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> [2] -
https://www.postgresql.org/message-id/OS0PR01MB5716ECC539008C85E7AB65C5944FA%40OS0PR01MB5716.jpnprd01.prod.outlook.com
>

I am very sorry for delay with answer - it was very busy week.
I hope that I understand the problem and proposed approach to solve it 
(it actually seems to be quite straightforward and similar with 
`hot_standby_feedback`). And definitely suffering from the same problem: 
blown database because of lagged slots.
But it really hard to propose some other solution (rather than backward 
scan of WAL, but it seems to be completely unacceptable).

Concerning user's experience... First of all disclaimer: I am first of 
all programmer and not DBA. Yes, I have investigated many support cases, 
but still it is hard to expect that I have the full picture. There is 
even no consensus concerning `hot_standby_feedback`! It is still 
disabled by default in Postgres and in Neon. It makes sense for vanilla 
Postgres, where replicas are first of all used for HA and only secondary 
- for load balancing of read only queries.
But in Neon HA is provided in different way and the only sense of 
creating RO replicas is load balancing (mostly for OLAP queries).
Execution of heavy OLAP queries without `hot_stanbdy_feedback` is some 
kind of "russian roulette", because probability of conflict with 
recovery is very high. But still we are using Postgres default.

But situation with feedback slots may be different: as far as I 
understand it is mostly needed for bidirectional replication and 
automatic conflict resolution.
So it is assumed as part of some distributed system (like BDR), rather 
than feature used directly by Postgres users.

I still a little bit depressed by complexity of LR and all related 
aspects. But unlikely it is possible to invent something more elegant 
and simpler:)

Re: Logical replication prefetch

От

Konstantin Knizhnik

Дата:

26 июля, 16:27:23

Completely rewritten version of prefetch patch.
Now prefetch workers do not try to apply LR application and then 
rollback transaction.
They just perform indexes lookup and so prefetch index and referenced 
heap pages.
So no any hacks are needed to prevent lock conflicts and WAL logging.

Performance results are the following (test scenario was explained in 
previous message as well as used schell scripts):

update:
    prefetch (2):           5:20
    prefetch (4):           3:20
    prefetch (8):           2:05
    no prefetch:           8:30

insert:
    pk (4) prefetch:      9:55
    pk+sk(4) prefetch: 5:20
    pk+sk(8) prefetch: 3:08
    no prefetch:           9:20

The number in parentheses specifies number of prefetch workers.
For example to spawn 8 prefetch workers I used the following settings in 
postgresql.conf.replica:

prefetch_replica_identity_only=off
max_worker_processes=16
max_logical_replication_workers=16
max_parallel_apply_workers_per_subscription=8
max_parallel_prefetch_workers_per_subscription=8
port=54322


Also I run continuous test with long (3 hours) updates workload on 
publisher with logical replication to subscriber.
And with 8 prefetch workers replica is able to caught up with primary 
where 10 backends are performing update!
After the end of this updates replica was at the same state as primary 
while without prefetch it proceed only 1/2 of
generated WAL and it takes another 5:30  hours to catch up.

Вложения

v3-0001-logical-replication-prefetch.patch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Logical replication prefetch

Вложения

Вложения

Вложения

Вложения