Обсуждение: Sending unflushed WAL in physical replication
Hi,
Please find attached a POC patch that introduces changes to the WAL sender and
receiver, allowing WAL records to be sent to standbys before they are flushed
to disk on the primary during physical replication. This is intended to improve
replication latency by reducing the amount of WAL read from disk.
For large transactions, this approach ensures that the bulk of the transaction’s
WAL records are already sent to the standby before the flush occurs on the primary.
As a result, the flush on the primary and standby happen closer together,
reducing replication lag.
Observations from the benchmark:
1. The patch improves TPS by ~13% in the sync replication setup. In repeated runs,
I see that the TPS increase is anywhere between 5% to 13% .
2. WAL sender reads significantly less WAL from disk, indicating more efficient use
of WAL buffers and reduced disk I/O
Following are some of the details of the implementation:
1. Primary does not wait for flush before starting to send data, so it is likely to
send smaller chunks of data. To prevent network overload, changes are made to
avoid sending excessively small packets.
Please find attached a POC patch that introduces changes to the WAL sender and
receiver, allowing WAL records to be sent to standbys before they are flushed
to disk on the primary during physical replication. This is intended to improve
replication latency by reducing the amount of WAL read from disk.
For large transactions, this approach ensures that the bulk of the transaction’s
WAL records are already sent to the standby before the flush occurs on the primary.
As a result, the flush on the primary and standby happen closer together,
reducing replication lag.
Observations from the benchmark:
1. The patch improves TPS by ~13% in the sync replication setup. In repeated runs,
I see that the TPS increase is anywhere between 5% to 13% .
2. WAL sender reads significantly less WAL from disk, indicating more efficient use
of WAL buffers and reduced disk I/O
Following are some of the details of the implementation:
1. Primary does not wait for flush before starting to send data, so it is likely to
send smaller chunks of data. To prevent network overload, changes are made to
avoid sending excessively small packets.
2. The sender includes the current flush pointer in the replication protocol
messages, so the standby knows up to which point WAL has been safely flushed
on the primary.
3. The logic ensures that standbys do not apply transactions that have not
been flushed on the primary, by updating the flushedUpto position on the standby
only up to the flushPtr received from the primary.
4. WAL records received from the primary are written and can be flushed to disk on the
standby, but are only marked as flushed up to the flushPtr reported by the primary.
Benchmark details are as follows:
Synchronous replication with remote write enabled.
Two Azure VMs: Central India (primary), Central US (standby).
OS: Ubuntu 24.04, VM size D4s (4 vCPUs, 16 GiB RAM).
With patch
TPS : 115
WAL read from disk by wal sender : ~40MB (read bytes from pg_stat_io)
WAL generated during the test: 772705760 bytes.
Without the patch
TPS: 102
WAL read from disk by wal sender : ~79MB (read bytes from pg_stat_io)
WAL generated during the test : 760060792 bytes
Commit hash: b1187266e0
pgbench -c 32 -j 4 postgres -T 300 -f wal_test.sql
wal_test.sql (each transaction generates ~36KB of WAL):
\set delta random(1, 500)
BEGIN;
INSERT INTO wal_bloat_:delta (data)
SELECT repeat('x', 8000)
FROM generate_series(1, 80);
TODO:
1. Ensure there is a robust mechanism on the receiver to prevent WAL records
that are not flushed on primary from being applied on standby, under any
circumstances.
2. When smaller chunks of WAL are received on the standby, it can lead to more
frequent disk write operations. To mitigate this issue, employing WAL buffers
on the standby could be a more effective approach. Evaluate the performance
impact of using WAL buffers on the standby.
Similar idea was proposed here:
Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas
This idea is also discussed here recently :
https://www.postgresql.org/message-id/fa2e932eeff472250e2dbacb49d8c43ad282fea9.camel%40j-davis.com
Kindly let me know your thoughts.
Thank you,
Rahila Syed
messages, so the standby knows up to which point WAL has been safely flushed
on the primary.
3. The logic ensures that standbys do not apply transactions that have not
been flushed on the primary, by updating the flushedUpto position on the standby
only up to the flushPtr received from the primary.
4. WAL records received from the primary are written and can be flushed to disk on the
standby, but are only marked as flushed up to the flushPtr reported by the primary.
Benchmark details are as follows:
Synchronous replication with remote write enabled.
Two Azure VMs: Central India (primary), Central US (standby).
OS: Ubuntu 24.04, VM size D4s (4 vCPUs, 16 GiB RAM).
With patch
TPS : 115
WAL read from disk by wal sender : ~40MB (read bytes from pg_stat_io)
WAL generated during the test: 772705760 bytes.
Without the patch
TPS: 102
WAL read from disk by wal sender : ~79MB (read bytes from pg_stat_io)
WAL generated during the test : 760060792 bytes
Commit hash: b1187266e0
pgbench -c 32 -j 4 postgres -T 300 -f wal_test.sql
wal_test.sql (each transaction generates ~36KB of WAL):
\set delta random(1, 500)
BEGIN;
INSERT INTO wal_bloat_:delta (data)
SELECT repeat('x', 8000)
FROM generate_series(1, 80);
TODO:
1. Ensure there is a robust mechanism on the receiver to prevent WAL records
that are not flushed on primary from being applied on standby, under any
circumstances.
2. When smaller chunks of WAL are received on the standby, it can lead to more
frequent disk write operations. To mitigate this issue, employing WAL buffers
on the standby could be a more effective approach. Evaluate the performance
impact of using WAL buffers on the standby.
Similar idea was proposed here:
Proposal: Allow walsenders to send WAL directly from wal_buffers to replicas
This idea is also discussed here recently :
https://www.postgresql.org/message-id/fa2e932eeff472250e2dbacb49d8c43ad282fea9.camel%40j-davis.com
Kindly let me know your thoughts.
Thank you,
Rahila Syed
Вложения
Hi Rahila,
On Thu, Sep 25, 2025 at 12:02 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
Hi,
Please find attached a POC patch that introduces changes to the WAL sender and
receiver, allowing WAL records to be sent to standbys before they are flushed
to disk on the primary during physical replication. This is intended to improve
replication latency by reducing the amount of WAL read from disk.
For large transactions, this approach ensures that the bulk of the transaction’s
WAL records are already sent to the standby before the flush occurs on the primary.
As a result, the flush on the primary and standby happen closer together,
reducing replication lag.
At the high level idea LGTM.
Observations from the benchmark:
1. The patch improves TPS by ~13% in the sync replication setup. In repeated runs,
I see that the TPS increase is anywhere between 5% to 13% .
2. WAL sender reads significantly less WAL from disk, indicating more efficient use
of WAL buffers and reduced disk I/O
Can you please measure the transaction commit latency improvement as well.
Commit latency = Primary_Disk_Flush_time + Standby_disk_fluish_time + network_roundtrip_time
Following are some of the details of the implementation:
1. Primary does not wait for flush before starting to send data, so it is likely to
send smaller chunks of data. To prevent network overload, changes are made to
avoid sending excessively small packets.2. The sender includes the current flush pointer in the replication protocol
messages, so the standby knows up to which point WAL has been safely flushed
on the primary.
3. The logic ensures that standbys do not apply transactions that have not
been flushed on the primary, by updating the flushedUpto position on the standby
only up to the flushPtr received from the primary.
4. WAL records received from the primary are written and can be flushed to disk on the
standby, but are only marked as flushed up to the flushPtr reported by the primary.
What happens in crash recovery scenarios? For example, when a standby crash restart,
it replays until the end of WAL. In this case, it may end up replaying WAL that was
never flushed on the primary (if primary does a crash recovery).
Shouldn't archive on standby not upload WAL before WAL gets flushed on the primary?
Same applicable for pg_receivewal.
Thanks,
Satya
> On 26 Sep 2025, at 00:02, Rahila Syed <rahilasyed90@gmail.com> wrote: > > Kindly let me know your thoughts. What about crash recovery? Unflushed WAL might get overwritten after crash recovery. Primary must switch to new timelineto prevent problems, related to this situation. Best regards, Andrey Borodin.
Hi Rahila, > Please find attached a POC patch that introduces changes to the WAL sender and > receiver, allowing WAL records to be sent to standbys before they are flushed > to disk on the primary during physical replication. [..] I didn't look at the code but your description of the design sounds OK. I wanted to clarify: what happens if master doesn't increase flushPtr and replica runs out of memory for WAL records? > Benchmark details are as follows: > Synchronous replication with remote write enabled. > Two Azure VMs: Central India (primary), Central US (standby). > [...] I'm curious what happens: 1. When master and replica are located in the same datacenter. 2. What happens for small transactions? -- Best regards, Aleksander Alekseev
Hi,
At the high level idea LGTM.
Thank you for looking into it.
Observations from the benchmark:
1. The patch improves TPS by ~13% in the sync replication setup. In repeated runs,
I see that the TPS increase is anywhere between 5% to 13% .
2. WAL sender reads significantly less WAL from disk, indicating more efficient use
of WAL buffers and reduced disk I/OCan you please measure the transaction commit latency improvement as well.Commit latency = Primary_Disk_Flush_time + Standby_disk_fluish_time + network_roundtrip_time
The pgbench average latency should capture this, since it measures the time from
the start to the end of a transaction. In synchronous replication, each transaction waits
for write confirmation from the standby before commiting, and that additional wait time is
included in the latency measurement. I will post that with the next benchmark results.
What happens in crash recovery scenarios? For example, when a standby crash restart,it replays until the end of WAL. In this case, it may end up replaying WAL that wasnever flushed on the primary (if primary does a crash recovery).Shouldn't archive on standby not upload WAL before WAL gets flushed on the primary?Same applicable for pg_receivewal.
The current solution isn’t sufficient for situations where we rely solely on the WAL files to identify
what needs to be replayed. In these cases, we need to either write the unflushed WAL data to a buffer and
then to temporary files until the primary flush occurs or store the flush pointer so that the recovery process
knows up to which point it should replay the WAL.
As mentioned in the TODO section of my previous email, I am currently working on a more robust method to
manage unflushed WAL on the receiver. The goal is to ensure this does not disrupt recovery or affect tools that
expect the WAL files on standby to only contain WAL records that have already been flushed on the primary.
Thank you,
Rahila Syed
Hi,
> Please find attached a POC patch that introduces changes to the WAL sender and
> receiver, allowing WAL records to be sent to standbys before they are flushed
> to disk on the primary during physical replication. [..]
I didn't look at the code but your description of the design sounds OK.
Thanks for looking into it.
I wanted to clarify: what happens if master doesn't increase flushPtr
and replica runs out of memory for WAL records?
This is a great question. I'm currently working on implementing a solution for this.
One possible solution is to write the records to a spill file when the flush pointer
indicates that none have been flushed on the primary. Once they have been flushed
on the primary, the records can then be copied from the spill file to the WAL segments.
While this method may lead to increased I/O, if such spills are infrequent, the overall
performance impact should be minimal.
Another option would be to notify the sender that there is no more space available
and to pause sending additional data until records are flushed on the sender side.
However, this approach could reintroduce some of the replication lag or network
latency that we are aiming to minimize.
Kindly let me know your views.
Thank you,
and to pause sending additional data until records are flushed on the sender side.
However, this approach could reintroduce some of the replication lag or network
latency that we are aiming to minimize.
Kindly let me know your views.
Thank you,
Rahila Syed
Hi, > This is a great question. I'm currently working on implementing a solution for this. > One possible solution is to write the records to a spill file when the flush pointer > indicates that none have been flushed on the primary. Once they have been flushed > on the primary, the records can then be copied from the spill file to the WAL segments. > While this method may lead to increased I/O, if such spills are infrequent, the overall > performance impact should be minimal. > > Another option would be to notify the sender that there is no more space available > and to pause sending additional data until records are flushed on the sender side. > However, this approach could reintroduce some of the replication lag or network > latency that we are aiming to minimize. > > Kindly let me know your views. Both options don't strike me as great design choices. A proper solution IMO would be to send and record flushPtr as a usual WAL record (a new resource manager, perhaps). When a replica is promoted to master or restarted it should truncate WAL according to the latest recorded flushPtr. The only thing we will have to worry about is to make sure the latest recorded flushPtr is never truncated, including regular recycling of WAL segments. Everything else will work as it is now, including cascaded replication for instance. -- Best regards, Aleksander Alekseev