Обсуждение: Re: Conflict detection for update_deleted in logical replication

Поиск

Список

Период

Сортировка

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

10 сентября 2024 г., 09:44:55

On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Hi hackers,
>
> I am starting a new thread to discuss and propose the conflict detection for
> update_deleted scenarios during logical replication. This conflict occurs when
> the apply worker cannot find the target tuple to be updated, as the tuple might
> have been removed by another origin.
>
> ---
> BACKGROUND
> ---
>
> Currently, when the apply worker cannot find the target tuple during an update,
> an update_missing conflict is logged. However, to facilitate future automatic
> conflict resolution, it has been agreed[1][2] that we need to detect both
> update_missing and update_deleted conflicts. Specifically, we will detect an
> update_deleted conflict if any dead tuple matching the old key value of the
> update operation is found; otherwise, it will be classified as update_missing.
>
> Detecting both update_deleted and update_missing conflicts is important for
> achieving eventual consistency in a bidirectional cluster, because the
> resolution for each conflict type can differs. For example, for an
> update_missing conflict, a feasible solution might be converting the update to
> an insert and applying it. While for an update_deleted conflict, the preferred
> approach could be to skip the update or compare the timestamps of the delete
> transactions with the remote update transaction's and choose the most recent
> one. For additional context, please refer to [3], which gives examples about
> how these differences could lead to data divergence.
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL. If the
> slots in feedback_slots are removed, a final message containing
> InvalidXLogRecPtr will be sent to inform the walsender to forget about the
> slot.xmin.
>
> To detect update_deleted conflicts during update operations, if the target row
> cannot be found, we perform an additional scan of the table using snapshotAny.
> This scan aims to locate the most recently deleted row that matches the old
> column values from the remote update operation and has not yet been removed by
> VACUUM. If any such tuples are found, we report the update_deleted conflict
> along with the origin and transaction information that deleted the tuple.
>
> Please refer to the attached POC patch set which implements above design. The
> patch set is split into some parts to make it easier for the initial review.
> Please note that each patch is interdependent and cannot work independently.
>
> Thanks a lot to Kuroda-San and Amit for the off-list discussion.
>
> Suggestions and comments are highly appreciated !
>

Thank You Hou-San for explaining the design. But to make it easier to
understand, would you be able to explain the sequence/timeline of the
*new* actions performed by the walsender and the apply processes for
the given example along with new feedback_slot config needed

Node A: (Procs: walsenderA, applyA)
  T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM

Node B: (Procs: walsenderB, applyB)
  T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

10 сентября 2024 г., 11:10:33

On Tuesday, September 10, 2024 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > ---
> > THE DESIGN
> > ---
> >
> > To achieve the above, we plan to allow the logical walsender to
> > maintain and advance the slot.xmin to protect the data in the user
> > table and introduce a new logical standby feedback message. This
> > message reports the WAL position that has been replayed on the logical
> > standby *AND* the changes occurring on the logical standby before the
> > WAL position are also replayed to the walsender's node (where the
> > walsender is running). After receiving the new feedback message, the
> > walsender will advance the slot.xmin based on the flush info, similar
> > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
> >
> > We have introduced a new subscription option
> > (feedback_slots='slot1,...'), where these slots will be used to check
> > condition (b): the transactions on logical standbys occurring before
> > the replay of Node A's DELETE are replayed on Node A as well.
> > Therefore, on Node B, users should specify the slots corresponding to
> > Node A in this option. The apply worker will get the oldest confirmed
> > flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> > let apply worker select the slots that acquired by the walsenders
> > which connect to the same remote server(e.g. if apply worker's
> > connection info or some other flags is same as the walsender's
> > connection info). But it seems tricky because if some slots are
> > inactive which means the walsenders are not there, the apply worker
> > could not find the correct slots to check unless we save the host along with
> the slot's persistence data.
> >
> > The new feedback message is sent only if feedback_slots is not NULL.
> > If the slots in feedback_slots are removed, a final message containing
> > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > the slot.xmin.
> >
> > To detect update_deleted conflicts during update operations, if the
> > target row cannot be found, we perform an additional scan of the table using
> snapshotAny.
> > This scan aims to locate the most recently deleted row that matches
> > the old column values from the remote update operation and has not yet
> > been removed by VACUUM. If any such tuples are found, we report the
> > update_deleted conflict along with the origin and transaction information
> that deleted the tuple.
> >
> > Please refer to the attached POC patch set which implements above
> > design. The patch set is split into some parts to make it easier for the initial
> review.
> > Please note that each patch is interdependent and cannot work
> independently.
> >
> > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> >
> > Suggestions and comments are highly appreciated !
> >
> 
> Thank You Hou-San for explaining the design. But to make it easier to
> understand, would you be able to explain the sequence/timeline of the
> *new* actions performed by the walsender and the apply processes for the
> given example along with new feedback_slot config needed
> 
> Node A: (Procs: walsenderA, applyA)
>   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> 
> Node B: (Procs: walsenderB, applyB)
>   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM

Thanks for reviewing! Let me elaborate further on the example:

On node A, feedback_slots should include the logical slot that used to replicate changes
from Node A to Node B. On node B, feedback_slots should include the logical
slot that replicate changes from Node B to Node A.

Assume the slot.xmin on Node A has been initialized to a valid number(740) before the
following flow:

Node A executed T1                                    - 10.00 AM
T1 replicated and applied on Node B                            - 10.0001 AM
Node B executed T3                                    - 10.01 AM
Node A executed T2 (741)                                - 10.02 AM
T2 replicated and applied on Node B    (delete_missing)                - 10.03 AM
T3 replicated and applied on Node A    (new action, detect update_deleted)        - 10.04 AM

(new action) Apply worker on Node B has confirmed that T2 has been applied
locally and the transactions before T2 (e.g., T3) has been replicated and
applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
replayed T2), thus send the new feedback message to Node A.                - 10.05 AM
                    
 

(new action) Walsender on Node A received the message and would advance the slot.xmin.- 10.06 AM

Then, after the slot.xmin is advanced to a number greater than 741, the VACUUM would be able to
remove the dead tuple on Node A.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

10 сентября 2024 г., 12:55:58

On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 10, 2024 2:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> > > ---
> > > THE DESIGN
> > > ---
> > >
> > > To achieve the above, we plan to allow the logical walsender to
> > > maintain and advance the slot.xmin to protect the data in the user
> > > table and introduce a new logical standby feedback message. This
> > > message reports the WAL position that has been replayed on the logical
> > > standby *AND* the changes occurring on the logical standby before the
> > > WAL position are also replayed to the walsender's node (where the
> > > walsender is running). After receiving the new feedback message, the
> > > walsender will advance the slot.xmin based on the flush info, similar
> > > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > > of logical slot are unused during logical replication, so I think it's safe and
> > won't cause side-effect to reuse the xmin for this feature.
> > >
> > > We have introduced a new subscription option
> > > (feedback_slots='slot1,...'), where these slots will be used to check
> > > condition (b): the transactions on logical standbys occurring before
> > > the replay of Node A's DELETE are replayed on Node A as well.
> > > Therefore, on Node B, users should specify the slots corresponding to
> > > Node A in this option. The apply worker will get the oldest confirmed
> > > flush LSN among the specified slots and send the LSN as a feedback
> > message to the walsender. -- I also thought of making it an automaic way, e.g.
> > > let apply worker select the slots that acquired by the walsenders
> > > which connect to the same remote server(e.g. if apply worker's
> > > connection info or some other flags is same as the walsender's
> > > connection info). But it seems tricky because if some slots are
> > > inactive which means the walsenders are not there, the apply worker
> > > could not find the correct slots to check unless we save the host along with
> > the slot's persistence data.
> > >
> > > The new feedback message is sent only if feedback_slots is not NULL.
> > > If the slots in feedback_slots are removed, a final message containing
> > > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > > the slot.xmin.
> > >
> > > To detect update_deleted conflicts during update operations, if the
> > > target row cannot be found, we perform an additional scan of the table using
> > snapshotAny.
> > > This scan aims to locate the most recently deleted row that matches
> > > the old column values from the remote update operation and has not yet
> > > been removed by VACUUM. If any such tuples are found, we report the
> > > update_deleted conflict along with the origin and transaction information
> > that deleted the tuple.
> > >
> > > Please refer to the attached POC patch set which implements above
> > > design. The patch set is split into some parts to make it easier for the initial
> > review.
> > > Please note that each patch is interdependent and cannot work
> > independently.
> > >
> > > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> > >
> > > Suggestions and comments are highly appreciated !
> > >
> >
> > Thank You Hou-San for explaining the design. But to make it easier to
> > understand, would you be able to explain the sequence/timeline of the
> > *new* actions performed by the walsender and the apply processes for the
> > given example along with new feedback_slot config needed
> >
> > Node A: (Procs: walsenderA, applyA)
> >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> >
> > Node B: (Procs: walsenderB, applyB)
> >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
>
> Thanks for reviewing! Let me elaborate further on the example:
>
> On node A, feedback_slots should include the logical slot that used to replicate changes
> from Node A to Node B. On node B, feedback_slots should include the logical
> slot that replicate changes from Node B to Node A.
>
> Assume the slot.xmin on Node A has been initialized to a valid number(740) before the
> following flow:
>
> Node A executed T1                                                                      - 10.00 AM
> T1 replicated and applied on Node B                                                     - 10.0001 AM
> Node B executed T3                                                                      - 10.01 AM
> Node A executed T2 (741)                                                                - 10.02 AM
> T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM

Not related to this feature, but do you mean delete_origin_differ here?

> T3 replicated and applied on Node A     (new action, detect update_deleted)             - 10.04 AM
>
> (new action) Apply worker on Node B has confirmed that T2 has been applied
> locally and the transactions before T2 (e.g., T3) has been replicated and
> applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
> replayed T2), thus send the new feedback message to Node A.                             - 10.05 AM
>
> (new action) Walsender on Node A received the message and would advance the slot.xmin.- 10.06 AM
>
> Then, after the slot.xmin is advanced to a number greater than 741, the VACUUM would be able to
> remove the dead tuple on Node A.
>

Thanks for the example. Can you please review below and let me know if
my understanding is correct.

1)
In a bidirectional replication setup, the user has to create slots in
a way that NodeA's sub's slot is Node B's feedback_slot and Node B's
sub's slot is Node A's feedback slot. And then only this feature will
work well, is it correct to say?

2)
Now coming back to multiple feedback_slots in a subscription, is the
below correct:

Say Node A has publications and subscriptions as follow:
------------------
A_pub1

A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)


Say Node B has publications and subscriptions as follow:
------------------
B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)

B_pub1
B_pub2
B_pub3

Then what will be the feedback_slot configuration for all
subscriptions of A and B? Is below correct:
------------------
A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

3)
If the above is true, then do we have a way to make sure that the user
 has given this configuration exactly the above way? If users end up
giving feedback_slots as some random slot  (say A_slot4 or incomplete
list), do we validate that? (I have not looked at code yet, just
trying to understand design first).

4)
Now coming to this:

> The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender.

 There will be one apply worker on B which will be due to B_sub1, so
will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't
it be sufficient to check confimed_lsn of say slot A_sub1 alone which
has subscribed to table 't' on which delete has been performed? Rest
of the  lots (A_sub2, A_sub3) might have subscribed to different
tables?

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

10 сентября 2024 г., 14:00:01

On Tuesday, September 10, 2024 5:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Tuesday, September 10, 2024 2:45 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > Thank You Hou-San for explaining the design. But to make it easier
> > > to understand, would you be able to explain the sequence/timeline of
> > > the
> > > *new* actions performed by the walsender and the apply processes for
> > > the given example along with new feedback_slot config needed
> > >
> > > Node A: (Procs: walsenderA, applyA)
> > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> > >
> > > Node B: (Procs: walsenderB, applyB)
> > >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
> >
> > Thanks for reviewing! Let me elaborate further on the example:
> >
> > On node A, feedback_slots should include the logical slot that used to
> > replicate changes from Node A to Node B. On node B, feedback_slots
> > should include the logical slot that replicate changes from Node B to Node A.
> >
> > Assume the slot.xmin on Node A has been initialized to a valid
> > number(740) before the following flow:
> >
> > Node A executed T1                                                                      - 10.00 AM
> > T1 replicated and applied on Node B                                                     - 10.0001 AM
> > Node B executed T3                                                                      - 10.01 AM
> > Node A executed T2 (741)                                                                - 10.02 AM
> > T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM
> 
> Not related to this feature, but do you mean delete_origin_differ here?

Oh sorry, It's a miss. I meant delete_origin_differ.

> 
> > T3 replicated and applied on Node A     (new action, detect
> update_deleted)             - 10.04 AM
> >
> > (new action) Apply worker on Node B has confirmed that T2 has been
> > applied locally and the transactions before T2 (e.g., T3) has been
> > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> >= lsn of the local
> > replayed T2), thus send the new feedback message to Node A.
> - 10.05 AM
> >
> > (new action) Walsender on Node A received the message and would
> > advance the slot.xmin.- 10.06 AM
> >
> > Then, after the slot.xmin is advanced to a number greater than 741,
> > the VACUUM would be able to remove the dead tuple on Node A.
> >
> 
> Thanks for the example. Can you please review below and let me know if my
> understanding is correct.
> 
> 1)
> In a bidirectional replication setup, the user has to create slots in a way that
> NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> A's feedback slot. And then only this feature will work well, is it correct to say?

Yes, your understanding is correct.

> 
> 2)
> Now coming back to multiple feedback_slots in a subscription, is the below
> correct:
> 
> Say Node A has publications and subscriptions as follow:
> ------------------
> A_pub1
> 
> A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> 
> 
> Say Node B has publications and subscriptions as follow:
> ------------------
> B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> 
> B_pub1
> B_pub2
> B_pub3
> 
> Then what will be the feedback_slot configuration for all subscriptions of A and
> B? Is below correct:
> ------------------
> A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

Right. The above configurations are correct.

> 
> 3)
> If the above is true, then do we have a way to make sure that the user  has
> given this configuration exactly the above way? If users end up giving
> feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> validate that? (I have not looked at code yet, just trying to understand design
> first).

The patch doesn't validate if the feedback slots belong to the correct
subscriptions on remote server. It only validates if the slot is an existing,
valid, logical slot. I think there are few challenges to validate it further.
E.g. We need a way to identify the which server the slot is replicating
changes to, which could be tricky as the slot currently doesn't have any info
to identify the remote server. Besides, the slot could be inactive temporarily
due to some subscriber side error, in which case we cannot verify the
subscription that used it.

> 
> 4)
> Now coming to this:
> 
> > The apply worker will get the oldest
> > confirmed flush LSN among the specified slots and send the LSN as a
> > feedback message to the walsender.
> 
>  There will be one apply worker on B which will be due to B_sub1, so will it
> check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't it be
> sufficient to check confimed_lsn of say slot A_sub1 alone which has
> subscribed to table 't' on which delete has been performed? Rest of the  lots
> (A_sub2, A_sub3) might have subscribed to different tables?

I think it's theoretically correct to only check the A_sub1. We could document
that user can do this by identifying the tables that each subscription
replicates, but it may not be user friendly.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

11 сентября 2024 г., 07:18:25

On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 10, 2024 5:56 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 2:45 PM shveta malik
> > <shveta.malik@gmail.com> wrote:
> > > >
> > > > Thank You Hou-San for explaining the design. But to make it easier
> > > > to understand, would you be able to explain the sequence/timeline of
> > > > the
> > > > *new* actions performed by the walsender and the apply processes for
> > > > the given example along with new feedback_slot config needed
> > > >
> > > > Node A: (Procs: walsenderA, applyA)
> > > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > > >   T2: DELETE FROM t WHERE id = 1;               ts=10.02 AM
> > > >
> > > > Node B: (Procs: walsenderB, applyB)
> > > >   T3: UPDATE t SET value = 2 WHERE id = 1;     ts=10.01 AM
> > >
> > > Thanks for reviewing! Let me elaborate further on the example:
> > >
> > > On node A, feedback_slots should include the logical slot that used to
> > > replicate changes from Node A to Node B. On node B, feedback_slots
> > > should include the logical slot that replicate changes from Node B to Node A.
> > >
> > > Assume the slot.xmin on Node A has been initialized to a valid
> > > number(740) before the following flow:
> > >
> > > Node A executed T1                                                                      - 10.00 AM
> > > T1 replicated and applied on Node B                                                     - 10.0001 AM
> > > Node B executed T3                                                                      - 10.01 AM
> > > Node A executed T2 (741)                                                                - 10.02 AM
> > > T2 replicated and applied on Node B     (delete_missing)                                - 10.03 AM
> >
> > Not related to this feature, but do you mean delete_origin_differ here?
>
> Oh sorry, It's a miss. I meant delete_origin_differ.
>
> >
> > > T3 replicated and applied on Node A     (new action, detect
> > update_deleted)             - 10.04 AM
> > >
> > > (new action) Apply worker on Node B has confirmed that T2 has been
> > > applied locally and the transactions before T2 (e.g., T3) has been
> > > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> > >= lsn of the local
> > > replayed T2), thus send the new feedback message to Node A.
> > - 10.05 AM
> > >
> > > (new action) Walsender on Node A received the message and would
> > > advance the slot.xmin.- 10.06 AM
> > >
> > > Then, after the slot.xmin is advanced to a number greater than 741,
> > > the VACUUM would be able to remove the dead tuple on Node A.
> > >
> >
> > Thanks for the example. Can you please review below and let me know if my
> > understanding is correct.
> >
> > 1)
> > In a bidirectional replication setup, the user has to create slots in a way that
> > NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> > A's feedback slot. And then only this feature will work well, is it correct to say?
>
> Yes, your understanding is correct.
>
> >
> > 2)
> > Now coming back to multiple feedback_slots in a subscription, is the below
> > correct:
> >
> > Say Node A has publications and subscriptions as follow:
> > ------------------
> > A_pub1
> >
> > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> >
> >
> > Say Node B has publications and subscriptions as follow:
> > ------------------
> > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> >
> > B_pub1
> > B_pub2
> > B_pub3
> >
> > Then what will be the feedback_slot configuration for all subscriptions of A and
> > B? Is below correct:
> > ------------------
> > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
>
> Right. The above configurations are correct.

Okay. It seems difficult to understand configuration from user's perspective.

> >
> > 3)
> > If the above is true, then do we have a way to make sure that the user  has
> > given this configuration exactly the above way? If users end up giving
> > feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> > validate that? (I have not looked at code yet, just trying to understand design
> > first).
>
> The patch doesn't validate if the feedback slots belong to the correct
> subscriptions on remote server. It only validates if the slot is an existing,
> valid, logical slot. I think there are few challenges to validate it further.
> E.g. We need a way to identify the which server the slot is replicating
> changes to, which could be tricky as the slot currently doesn't have any info
> to identify the remote server. Besides, the slot could be inactive temporarily
> due to some subscriber side error, in which case we cannot verify the
> subscription that used it.

Okay, I understand the challenges here.

> >
> > 4)
> > Now coming to this:
> >
> > > The apply worker will get the oldest
> > > confirmed flush LSN among the specified slots and send the LSN as a
> > > feedback message to the walsender.
> >
> >  There will be one apply worker on B which will be due to B_sub1, so will it
> > check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3? Won't it be
> > sufficient to check confimed_lsn of say slot A_sub1 alone which has
> > subscribed to table 't' on which delete has been performed? Rest of the  lots
> > (A_sub2, A_sub3) might have subscribed to different tables?
>
> I think it's theoretically correct to only check the A_sub1. We could document
> that user can do this by identifying the tables that each subscription
> replicates, but it may not be user friendly.
>

Sorry, I fail to understand how user can identify the tables and give
feedback_slots accordingly? I thought feedback_slots is a one time
configuration when replication is setup (or say setup changes in
future); it can not keep on changing with each query. Or am I missing
something?

IMO, it is something which should be identified internally. Since the
query is on table 't1', feedback-slot which is for 't1' shall be used
to check lsn. But on rethinking,this optimization may not be worth the
effort, the identification part could be tricky, so it might be okay
to check all the slots.

~~

Another query is about 3 node setup. I couldn't figure out what would
be feedback_slots setting when it is not bidirectional, as in consider
the case where there are three nodes A,B,C. Node C is subscribing to
both Node A and Node B. Node A and Node B are the ones doing
concurrent "update" and "delete" which will both be replicated to Node
C. In this case what will be the feedback_slots setting on Node C? We
don't have any slots here which will be replicating changes from Node
C to Node A and Node C to Node B. This is given in [3] in your first
email ([1])

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

11 сентября 2024 г., 07:45:08

On Wednesday, September 11, 2024 12:18 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Tuesday, September 10, 2024 5:56 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > Thanks for the example. Can you please review below and let me know
> > > if my understanding is correct.
> > >
> > > 1)
> > > In a bidirectional replication setup, the user has to create slots
> > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > B's sub's slot is Node A's feedback slot. And then only this feature will
> work well, is it correct to say?
> >
> > Yes, your understanding is correct.
> >
> > >
> > > 2)
> > > Now coming back to multiple feedback_slots in a subscription, is the
> > > below
> > > correct:
> > >
> > > Say Node A has publications and subscriptions as follow:
> > > ------------------
> > > A_pub1
> > >
> > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > >
> > >
> > > Say Node B has publications and subscriptions as follow:
> > > ------------------
> > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > >
> > > B_pub1
> > > B_pub2
> > > B_pub3
> > >
> > > Then what will be the feedback_slot configuration for all
> > > subscriptions of A and B? Is below correct:
> > > ------------------
> > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> >
> > Right. The above configurations are correct.
> 
> Okay. It seems difficult to understand configuration from user's perspective.

Right. I think we could give an example in the document to make it clear.

> 
> > >
> > > 3)
> > > If the above is true, then do we have a way to make sure that the
> > > user  has given this configuration exactly the above way? If users
> > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > incomplete list), do we validate that? (I have not looked at code
> > > yet, just trying to understand design first).
> >
> > The patch doesn't validate if the feedback slots belong to the correct
> > subscriptions on remote server. It only validates if the slot is an
> > existing, valid, logical slot. I think there are few challenges to validate it
> further.
> > E.g. We need a way to identify the which server the slot is
> > replicating changes to, which could be tricky as the slot currently
> > doesn't have any info to identify the remote server. Besides, the slot
> > could be inactive temporarily due to some subscriber side error, in
> > which case we cannot verify the subscription that used it.
> 
> Okay, I understand the challenges here.
> 
> > >
> > > 4)
> > > Now coming to this:
> > >
> > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > specified slots and send the LSN as a feedback message to the
> > > > walsender.
> > >
> > >  There will be one apply worker on B which will be due to B_sub1, so
> > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > alone which has subscribed to table 't' on which delete has been
> > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> different tables?
> >
> > I think it's theoretically correct to only check the A_sub1. We could
> > document that user can do this by identifying the tables that each
> > subscription replicates, but it may not be user friendly.
> >
> 
> Sorry, I fail to understand how user can identify the tables and give
> feedback_slots accordingly? I thought feedback_slots is a one time
> configuration when replication is setup (or say setup changes in future); it can
> not keep on changing with each query. Or am I missing something?

I meant that user have all the publication information(including the tables
added in a publication) that the subscription subscribes to, and could also
have the slot_name, so I think it's possible to identify the tables that each
subscription includes and add the feedback_slots correspondingly before
starting the replication. It would be pretty complicate although possible, so I
prefer to not mention it in the first place if it could not bring much
benefits.

> 
> IMO, it is something which should be identified internally. Since the query is on
> table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> rethinking,this optimization may not be worth the effort, the identification part
> could be tricky, so it might be okay to check all the slots.

I agree that identifying these internally would add complexity.

> 
> ~~
> 
> Another query is about 3 node setup. I couldn't figure out what would be
> feedback_slots setting when it is not bidirectional, as in consider the case
> where there are three nodes A,B,C. Node C is subscribing to both Node A and
> Node B. Node A and Node B are the ones doing concurrent "update" and
> "delete" which will both be replicated to Node C. In this case what will be the
> feedback_slots setting on Node C? We don't have any slots here which will be
> replicating changes from Node C to Node A and Node C to Node B. This is given
> in [3] in your first email ([1])

Thanks for pointing this, the link was a bit misleading. I think the solution
proposed in this thread is only used to allow detecting update_deleted reliably
in a bidirectional cluster.  For non- bidirectional cases, it would be more
tricky to predict the timing till when should we retain the dead tuples.


> 
> [1]:
> https://www.postgresql.org/message-id/OS0PR01MB5716BE80DAEB0EE2A
> 6A5D1F5949D2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

11 сентября 2024 г., 08:02:48

On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 11, 2024 12:18 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 5:56 PM shveta malik
> > <shveta.malik@gmail.com> wrote:
> > > >
> > > > Thanks for the example. Can you please review below and let me know
> > > > if my understanding is correct.
> > > >
> > > > 1)
> > > > In a bidirectional replication setup, the user has to create slots
> > > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > > B's sub's slot is Node A's feedback slot. And then only this feature will
> > work well, is it correct to say?
> > >
> > > Yes, your understanding is correct.
> > >
> > > >
> > > > 2)
> > > > Now coming back to multiple feedback_slots in a subscription, is the
> > > > below
> > > > correct:
> > > >
> > > > Say Node A has publications and subscriptions as follow:
> > > > ------------------
> > > > A_pub1
> > > >
> > > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > > >
> > > >
> > > > Say Node B has publications and subscriptions as follow:
> > > > ------------------
> > > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > > >
> > > > B_pub1
> > > > B_pub2
> > > > B_pub3
> > > >
> > > > Then what will be the feedback_slot configuration for all
> > > > subscriptions of A and B? Is below correct:
> > > > ------------------
> > > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> > >
> > > Right. The above configurations are correct.
> >
> > Okay. It seems difficult to understand configuration from user's perspective.
>
> Right. I think we could give an example in the document to make it clear.
>
> >
> > > >
> > > > 3)
> > > > If the above is true, then do we have a way to make sure that the
> > > > user  has given this configuration exactly the above way? If users
> > > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > > incomplete list), do we validate that? (I have not looked at code
> > > > yet, just trying to understand design first).
> > >
> > > The patch doesn't validate if the feedback slots belong to the correct
> > > subscriptions on remote server. It only validates if the slot is an
> > > existing, valid, logical slot. I think there are few challenges to validate it
> > further.
> > > E.g. We need a way to identify the which server the slot is
> > > replicating changes to, which could be tricky as the slot currently
> > > doesn't have any info to identify the remote server. Besides, the slot
> > > could be inactive temporarily due to some subscriber side error, in
> > > which case we cannot verify the subscription that used it.
> >
> > Okay, I understand the challenges here.
> >
> > > >
> > > > 4)
> > > > Now coming to this:
> > > >
> > > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > > specified slots and send the LSN as a feedback message to the
> > > > > walsender.
> > > >
> > > >  There will be one apply worker on B which will be due to B_sub1, so
> > > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > > alone which has subscribed to table 't' on which delete has been
> > > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> > different tables?
> > >
> > > I think it's theoretically correct to only check the A_sub1. We could
> > > document that user can do this by identifying the tables that each
> > > subscription replicates, but it may not be user friendly.
> > >
> >
> > Sorry, I fail to understand how user can identify the tables and give
> > feedback_slots accordingly? I thought feedback_slots is a one time
> > configuration when replication is setup (or say setup changes in future); it can
> > not keep on changing with each query. Or am I missing something?
>
> I meant that user have all the publication information(including the tables
> added in a publication) that the subscription subscribes to, and could also
> have the slot_name, so I think it's possible to identify the tables that each
> subscription includes and add the feedback_slots correspondingly before
> starting the replication. It would be pretty complicate although possible, so I
> prefer to not mention it in the first place if it could not bring much
> benefits.
>
> >
> > IMO, it is something which should be identified internally. Since the query is on
> > table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> > rethinking,this optimization may not be worth the effort, the identification part
> > could be tricky, so it might be okay to check all the slots.
>
> I agree that identifying these internally would add complexity.
>
> >
> > ~~
> >
> > Another query is about 3 node setup. I couldn't figure out what would be
> > feedback_slots setting when it is not bidirectional, as in consider the case
> > where there are three nodes A,B,C. Node C is subscribing to both Node A and
> > Node B. Node A and Node B are the ones doing concurrent "update" and
> > "delete" which will both be replicated to Node C. In this case what will be the
> > feedback_slots setting on Node C? We don't have any slots here which will be
> > replicating changes from Node C to Node A and Node C to Node B. This is given
> > in [3] in your first email ([1])
>
> Thanks for pointing this, the link was a bit misleading. I think the solution
> proposed in this thread is only used to allow detecting update_deleted reliably
> in a bidirectional cluster.  For non- bidirectional cases, it would be more
> tricky to predict the timing till when should we retain the dead tuples.
>

So in brief, this solution is only for bidrectional setup? For
non-bidirectional, feedback_slots is non-configurable and thus
irrelevant.

Irrespective of above, if user ends up setting feedback_slot to some
random but existing slot which is not at all consuming changes, then
it may so happen that the node will never send feedback msg to another
node resulting in accumulation of dead tuples on another node. Is that
a possibility?

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

11 сентября 2024 г., 08:36:55

On Wednesday, September 11, 2024 1:03 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, September 11, 2024 12:18 PM shveta malik
> <shveta.malik@gmail.com> wrote:
> > >
> > > ~~
> > >
> > > Another query is about 3 node setup. I couldn't figure out what
> > > would be feedback_slots setting when it is not bidirectional, as in
> > > consider the case where there are three nodes A,B,C. Node C is
> > > subscribing to both Node A and Node B. Node A and Node B are the
> > > ones doing concurrent "update" and "delete" which will both be
> > > replicated to Node C. In this case what will be the feedback_slots
> > > setting on Node C? We don't have any slots here which will be
> > > replicating changes from Node C to Node A and Node C to Node B. This
> > > is given in [3] in your first email ([1])
> >
> > Thanks for pointing this, the link was a bit misleading. I think the
> > solution proposed in this thread is only used to allow detecting
> > update_deleted reliably in a bidirectional cluster.  For non-
> > bidirectional cases, it would be more tricky to predict the timing till when
> should we retain the dead tuples.
> >
> 
> So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> feedback_slots is non-configurable and thus irrelevant.

Right.

> 
> Irrespective of above, if user ends up setting feedback_slot to some random but
> existing slot which is not at all consuming changes, then it may so happen that
> the node will never send feedback msg to another node resulting in
> accumulation of dead tuples on another node. Is that a possibility?

Yes, It's possible. I think this is a common situation for this kind of user
specified options. Like the user DML will be blocked, if any inactive standby
names are added synchronous_standby_names.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

13 сентября 2024 г., 09:07:56

On Wed, Sep 11, 2024 at 11:07 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 11, 2024 1:03 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > > >
> > > > Another query is about 3 node setup. I couldn't figure out what
> > > > would be feedback_slots setting when it is not bidirectional, as in
> > > > consider the case where there are three nodes A,B,C. Node C is
> > > > subscribing to both Node A and Node B. Node A and Node B are the
> > > > ones doing concurrent "update" and "delete" which will both be
> > > > replicated to Node C. In this case what will be the feedback_slots
> > > > setting on Node C? We don't have any slots here which will be
> > > > replicating changes from Node C to Node A and Node C to Node B. This
> > > > is given in [3] in your first email ([1])
> > >
> > > Thanks for pointing this, the link was a bit misleading. I think the
> > > solution proposed in this thread is only used to allow detecting
> > > update_deleted reliably in a bidirectional cluster.  For non-
> > > bidirectional cases, it would be more tricky to predict the timing till when
> > should we retain the dead tuples.
> > >
> >
> > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > feedback_slots is non-configurable and thus irrelevant.
>
> Right.
>

One possible idea to address the non-bidirectional case raised by
Shveta is to use a time-based cut-off to remove dead tuples. As
mentioned earlier in my email [1], we can define a new GUC parameter
say vacuum_committs_age which would indicate that we will allow rows
to be removed only if the modified time of the tuple as indicated by
committs module is greater than the vacuum_committs_age. We could keep
this parameter a table-level option without introducing a GUC as this
may not apply to all tables. I checked and found that some other
replication solutions like GoldenGate also allowed similar parameters
(tombstone_deletes) to be specified at table level [2]. The other
advantage of allowing it at table level is that it won't hamper the
performance of hot-pruning or vacuum in general. Note, I am careful
here because to decide whether to remove a dead tuple or not we need
to compare its committs_time both during hot-pruning and vacuum.

Note that tombstones_deletes is a general concept used by replication
solutions to detect updated_deleted conflict and time-based purging is
recommended. See [3][4]. We previously discussed having tombstone
tables to keep the deleted records information but it was suggested to
prevent the vacuum from removing the required dead tuples as that
would be simpler than inventing a new kind of tables/store for
tombstone_deletes [5]. So, we came up with the idea of feedback slots
discussed in this email but that didn't work out in all cases and
appears difficult to configure as pointed out by Shveta. So, now, we
are back to one of the other ideas [1] discussed previously to solve
this problem.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1Lj-PWrP789KnKxZydisHajd38rSihWXO8MVBLDwxG1Kg%40mail.gmail.com
[2] -
BEGIN
  DBMS_GOLDENGATE_ADM.ALTER_AUTO_CDR(
    schema_name       => 'hr',
    table_name        => 'employees',
    tombstone_deletes => TRUE);
END;
/
[3] - https://en.wikipedia.org/wiki/Tombstone_(data_store)
[4] -
https://docs.oracle.com/en/middleware/goldengate/core/19.1/oracle-db/automatic-conflict-detection-and-resolution1.html#GUID-423C6EE8-1C62-4085-899C-8454B8FB9C92
[5] - https://www.postgresql.org/message-id/e4cdb849-d647-4acf-aabe-7049ae170fbf%40enterprisedb.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

13 сентября 2024 г., 10:55:52

On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > >
> > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > feedback_slots is non-configurable and thus irrelevant.
> >
> > Right.
> >
>
> One possible idea to address the non-bidirectional case raised by
> Shveta is to use a time-based cut-off to remove dead tuples. As
> mentioned earlier in my email [1], we can define a new GUC parameter
> say vacuum_committs_age which would indicate that we will allow rows
> to be removed only if the modified time of the tuple as indicated by
> committs module is greater than the vacuum_committs_age. We could keep
> this parameter a table-level option without introducing a GUC as this
> may not apply to all tables. I checked and found that some other
> replication solutions like GoldenGate also allowed similar parameters
> (tombstone_deletes) to be specified at table level [2]. The other
> advantage of allowing it at table level is that it won't hamper the
> performance of hot-pruning or vacuum in general. Note, I am careful
> here because to decide whether to remove a dead tuple or not we need
> to compare its committs_time both during hot-pruning and vacuum.

+1 on the idea, but IIUC this value doesn’t need to be significant; it
can be limited to just a few minutes. The one which is sufficient to
handle replication delays caused by network lag or other factors,
assuming clock skew has already been addressed.

This new parameter is necessary only for cases where an UPDATE and
DELETE on the same row occur concurrently, but the replication order
to a third node is not preserved, which could result in data
divergence. Consider the following example:

Node A:
   T1: INSERT INTO t (id, value) VALUES (1,1);  (10.01 AM)
   T2: DELETE FROM t WHERE id = 1;             (10.03 AM)

Node B:
   T3: UPDATE t SET value = 2 WHERE id = 1;    (10.02 AM)

Assume a third node (Node C) subscribes to both Node A and Node B. The
"correct" order of messages received by Node C would be T1-T3-T2, but
it could also receive them in the order T1-T2-T3, wherein  sayT3 is
received with a lag of say 2 mins. In such a scenario, T3 should be
able to recognize that the row was deleted by T2 on Node C, thereby
detecting the update-deleted conflict and skipping the apply.

The 'vacuum_committs_age' parameter should account for this lag, which
could lead to the order reversal of UPDATE and DELETE operations.

Any subsequent attempt to update the same row after conflict detection
and resolution should not pose an issue. For example, if Node A
triggers the following at 10:20 AM:
UPDATE t SET value = 3 WHERE id = 1;

Since the row has already been deleted, the UPDATE will not proceed
and therefore will not generate a replication operation on the other
nodes, indicating that vacuum need not to preserve the dead row to
this far.

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

17 сентября 2024 г., 03:38:16

On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > >
> > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > feedback_slots is non-configurable and thus irrelevant.
> > >
> > > Right.
> > >
> >
> > One possible idea to address the non-bidirectional case raised by
> > Shveta is to use a time-based cut-off to remove dead tuples. As
> > mentioned earlier in my email [1], we can define a new GUC parameter
> > say vacuum_committs_age which would indicate that we will allow rows
> > to be removed only if the modified time of the tuple as indicated by
> > committs module is greater than the vacuum_committs_age. We could keep
> > this parameter a table-level option without introducing a GUC as this
> > may not apply to all tables. I checked and found that some other
> > replication solutions like GoldenGate also allowed similar parameters
> > (tombstone_deletes) to be specified at table level [2]. The other
> > advantage of allowing it at table level is that it won't hamper the
> > performance of hot-pruning or vacuum in general. Note, I am careful
> > here because to decide whether to remove a dead tuple or not we need
> > to compare its committs_time both during hot-pruning and vacuum.
>
> +1 on the idea,

I agree that this idea is much simpler than the idea originally
proposed in this thread.

IIUC vacuum_committs_age specifies a time rather than an XID age. But
how can we implement it? If it ends up affecting the vacuum cutoff, we
should be careful not to end up with the same result of
vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
the implementation needs not to affect the performance of
ComputeXidHorizons().

> but IIUC this value doesn’t need to be significant; it
> can be limited to just a few minutes. The one which is sufficient to
> handle replication delays caused by network lag or other factors,
> assuming clock skew has already been addressed.

I think that in a non-bidirectional case the value could need to be a
large number. Is that right?

Regards,

[1] https://www.postgresql.org/message-id/20230317230930.nhsgk3qfk7f4axls%40awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

17 сентября 2024 г., 09:53:18

On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > >
> > > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > > feedback_slots is non-configurable and thus irrelevant.
> > > >
> > > > Right.
> > > >
> > >
> > > One possible idea to address the non-bidirectional case raised by
> > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > say vacuum_committs_age which would indicate that we will allow rows
> > > to be removed only if the modified time of the tuple as indicated by
> > > committs module is greater than the vacuum_committs_age. We could keep
> > > this parameter a table-level option without introducing a GUC as this
> > > may not apply to all tables. I checked and found that some other
> > > replication solutions like GoldenGate also allowed similar parameters
> > > (tombstone_deletes) to be specified at table level [2]. The other
> > > advantage of allowing it at table level is that it won't hamper the
> > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > here because to decide whether to remove a dead tuple or not we need
> > > to compare its committs_time both during hot-pruning and vacuum.
> >
> > +1 on the idea,
>
> I agree that this idea is much simpler than the idea originally
> proposed in this thread.
>
> IIUC vacuum_committs_age specifies a time rather than an XID age.
>

Your understanding is correct that vacuum_committs_age specifies a time.

>
> But
> how can we implement it? If it ends up affecting the vacuum cutoff, we
> should be careful not to end up with the same result of
> vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> the implementation needs not to affect the performance of
> ComputeXidHorizons().
>

I haven't thought about the implementation details yet but I think
during pruning (for example in heap_prune_satisfies_vacuum()), apart
from checking if the tuple satisfies
HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
committs is greater than configured vacuum_committs_age (for the
table) to decide whether tuple can be removed. One thing to consider
is what to do in case of aggressive vacuum where we expect
relfrozenxid to be advanced to FreezeLimit (at a minimum). We may want
to just ignore vacuum_committs_age during aggressive vacuum and LOG if
we end up removing some tuple. This will allow users to retain deleted
tuples by respecting the freeze limits which also avoid xid_wrap
around. I think we can't retain tuples forever if the user
misconfigured vacuum_committs_age and to avoid that we can keep the
maximum limit on this parameter to say an hour or so. Also, users can
tune freeze parameters if they want to retain tuples for longer.

> > but IIUC this value doesn’t need to be significant; it
> > can be limited to just a few minutes. The one which is sufficient to
> > handle replication delays caused by network lag or other factors,
> > assuming clock skew has already been addressed.
>
> I think that in a non-bidirectional case the value could need to be a
> large number. Is that right?
>

As per my understanding, even for non-bidirectional cases, the value
should be small. For example, in the case, pointed out by Shveta [1],
where the updates from 2 nodes are received by a third node, this
setting is expected to be small. This setting primarily deals with
concurrent transactions on multiple nodes, so it should be small but I
could be missing something.

[1] - https://www.postgresql.org/message-id/CAJpy0uAzzOzhXGH-zBc7Zt8ndXRf6r4OnLzgRrHyf8cvd%2Bfpwg%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

17 сентября 2024 г., 20:54:05

On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 12:56 AM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > >
> > > > > > So in brief, this solution is only for bidrectional setup? For non-bidirectional,
> > > > > > feedback_slots is non-configurable and thus irrelevant.
> > > > >
> > > > > Right.
> > > > >
> > > >
> > > > One possible idea to address the non-bidirectional case raised by
> > > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > > say vacuum_committs_age which would indicate that we will allow rows
> > > > to be removed only if the modified time of the tuple as indicated by
> > > > committs module is greater than the vacuum_committs_age. We could keep
> > > > this parameter a table-level option without introducing a GUC as this
> > > > may not apply to all tables. I checked and found that some other
> > > > replication solutions like GoldenGate also allowed similar parameters
> > > > (tombstone_deletes) to be specified at table level [2]. The other
> > > > advantage of allowing it at table level is that it won't hamper the
> > > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > > here because to decide whether to remove a dead tuple or not we need
> > > > to compare its committs_time both during hot-pruning and vacuum.
> > >
> > > +1 on the idea,
> >
> > I agree that this idea is much simpler than the idea originally
> > proposed in this thread.
> >
> > IIUC vacuum_committs_age specifies a time rather than an XID age.
> >
>
> Your understanding is correct that vacuum_committs_age specifies a time.
>
> >
> > But
> > how can we implement it? If it ends up affecting the vacuum cutoff, we
> > should be careful not to end up with the same result of
> > vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> > the implementation needs not to affect the performance of
> > ComputeXidHorizons().
> >
>
> I haven't thought about the implementation details yet but I think
> during pruning (for example in heap_prune_satisfies_vacuum()), apart
> from checking if the tuple satisfies
> HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> committs is greater than configured vacuum_committs_age (for the
> table) to decide whether tuple can be removed.

Sounds very costly. I think we need to do performance tests. Even if
the vacuum gets slower only on the particular table having the
vacuum_committs_age setting, it would affect overall autovacuum
performance. Also, it would affect HOT pruning performance.

>
> > > but IIUC this value doesn’t need to be significant; it
> > > can be limited to just a few minutes. The one which is sufficient to
> > > handle replication delays caused by network lag or other factors,
> > > assuming clock skew has already been addressed.
> >
> > I think that in a non-bidirectional case the value could need to be a
> > large number. Is that right?
> >
>
> As per my understanding, even for non-bidirectional cases, the value
> should be small. For example, in the case, pointed out by Shveta [1],
> where the updates from 2 nodes are received by a third node, this
> setting is expected to be small. This setting primarily deals with
> concurrent transactions on multiple nodes, so it should be small but I
> could be missing something.
>

I might be missing something but the scenario I was thinking of is
something below.

Suppose that we setup uni-directional logical replication between Node
A and Node B (e.g., Node A -> Node B) and both nodes have the same row
with key = 1:

Node A:
    T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
      -> This change is applied on Node B at 10:01 AM.

Node B:
    T2: DELETE FROM t WHERE key = 1;         (05:00 AM)

If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
Node A would raise an "update_missing" conflict. On the other hand, if
a vacuum runs on Node B at 11:00 AM, the change would raise an
"update_deleted" conflict. It looks whether we detect an
"update_deleted" or an "updated_missing" depends on the timing of
vacuum, and to avoid such a situation, we would need to set
vacuum_committs_age to more than 5 hours.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

18 сентября 2024 г., 07:29:10

On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I haven't thought about the implementation details yet but I think
> > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > from checking if the tuple satisfies
> > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > committs is greater than configured vacuum_committs_age (for the
> > table) to decide whether tuple can be removed.
>
> Sounds very costly. I think we need to do performance tests. Even if
> the vacuum gets slower only on the particular table having the
> vacuum_committs_age setting, it would affect overall autovacuum
> performance. Also, it would affect HOT pruning performance.
>

Agreed that we should do some performance testing and additionally
think of any better way to implement. I think the cost won't be much
if the tuples to be removed are from a single transaction because the
required commit_ts information would be cached but when the tuples are
from different transactions, we could see a noticeable impact. We need
to test to say anything concrete on this.

> >
> > > > but IIUC this value doesn’t need to be significant; it
> > > > can be limited to just a few minutes. The one which is sufficient to
> > > > handle replication delays caused by network lag or other factors,
> > > > assuming clock skew has already been addressed.
> > >
> > > I think that in a non-bidirectional case the value could need to be a
> > > large number. Is that right?
> > >
> >
> > As per my understanding, even for non-bidirectional cases, the value
> > should be small. For example, in the case, pointed out by Shveta [1],
> > where the updates from 2 nodes are received by a third node, this
> > setting is expected to be small. This setting primarily deals with
> > concurrent transactions on multiple nodes, so it should be small but I
> > could be missing something.
> >
>
> I might be missing something but the scenario I was thinking of is
> something below.
>
> Suppose that we setup uni-directional logical replication between Node
> A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> with key = 1:
>
> Node A:
>     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
>       -> This change is applied on Node B at 10:01 AM.
>
> Node B:
>     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
>
> If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> Node A would raise an "update_missing" conflict. On the other hand, if
> a vacuum runs on Node B at 11:00 AM, the change would raise an
> "update_deleted" conflict. It looks whether we detect an
> "update_deleted" or an "updated_missing" depends on the timing of
> vacuum, and to avoid such a situation, we would need to set
> vacuum_committs_age to more than 5 hours.
>

Yeah, in this case, it would detect a different conflict (if we don't
set vacuum_committs_age to greater than 5 hours) but as per my
understanding, the primary purpose of conflict detection and
resolution is to avoid data inconsistency in a bi-directional setup.
Assume, in the above case it is a bi-directional setup, then we want
to have the same data in both nodes. Now, if there are other cases
like the one you mentioned that require to detect the conflict
reliably than I agree this value could be large and probably not the
best way to achieve it. I think we can mention in the docs that the
primary purpose of this is to achieve data consistency among
bi-directional kind of setups.

Having said that even in the above case, the result should be the same
whether the vacuum has removed the row or not. Say, if the vacuum has
not yet removed the row (due to vacuum_committs_age or otherwise) then
also because the incoming update has a later timestamp, we will
convert the update to insert as per last_update_wins resolution
method, so the conflict will be considered as update_missing. And,
say, the vacuum has removed the row and the conflict detected is
update_missing, then also we will convert the update to insert. In
short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
has higher commit-ts, UPDATE should win.

So, we can expect data consistency in bidirectional cases and expect a
deterministic behavior in other cases (e.g. the final data in a table
does not depend on the order of applying the transactions from other
nodes).

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

19 сентября 2024 г., 21:49:18

On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I haven't thought about the implementation details yet but I think
> > > during pruning (for example in heap_prune_satisfies_vacuum()), apart
> > > from checking if the tuple satisfies
> > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
> > > committs is greater than configured vacuum_committs_age (for the
> > > table) to decide whether tuple can be removed.
> >
> > Sounds very costly. I think we need to do performance tests. Even if
> > the vacuum gets slower only on the particular table having the
> > vacuum_committs_age setting, it would affect overall autovacuum
> > performance. Also, it would affect HOT pruning performance.
> >
>
> Agreed that we should do some performance testing and additionally
> think of any better way to implement. I think the cost won't be much
> if the tuples to be removed are from a single transaction because the
> required commit_ts information would be cached but when the tuples are
> from different transactions, we could see a noticeable impact. We need
> to test to say anything concrete on this.

Agreed.

>
> > >
> > > > > but IIUC this value doesn’t need to be significant; it
> > > > > can be limited to just a few minutes. The one which is sufficient to
> > > > > handle replication delays caused by network lag or other factors,
> > > > > assuming clock skew has already been addressed.
> > > >
> > > > I think that in a non-bidirectional case the value could need to be a
> > > > large number. Is that right?
> > > >
> > >
> > > As per my understanding, even for non-bidirectional cases, the value
> > > should be small. For example, in the case, pointed out by Shveta [1],
> > > where the updates from 2 nodes are received by a third node, this
> > > setting is expected to be small. This setting primarily deals with
> > > concurrent transactions on multiple nodes, so it should be small but I
> > > could be missing something.
> > >
> >
> > I might be missing something but the scenario I was thinking of is
> > something below.
> >
> > Suppose that we setup uni-directional logical replication between Node
> > A and Node B (e.g., Node A -> Node B) and both nodes have the same row
> > with key = 1:
> >
> > Node A:
> >     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> >       -> This change is applied on Node B at 10:01 AM.
> >
> > Node B:
> >     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
> >
> > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > Node A would raise an "update_missing" conflict. On the other hand, if
> > a vacuum runs on Node B at 11:00 AM, the change would raise an
> > "update_deleted" conflict. It looks whether we detect an
> > "update_deleted" or an "updated_missing" depends on the timing of
> > vacuum, and to avoid such a situation, we would need to set
> > vacuum_committs_age to more than 5 hours.
> >
>
> Yeah, in this case, it would detect a different conflict (if we don't
> set vacuum_committs_age to greater than 5 hours) but as per my
> understanding, the primary purpose of conflict detection and
> resolution is to avoid data inconsistency in a bi-directional setup.
> Assume, in the above case it is a bi-directional setup, then we want
> to have the same data in both nodes. Now, if there are other cases
> like the one you mentioned that require to detect the conflict
> reliably than I agree this value could be large and probably not the
> best way to achieve it. I think we can mention in the docs that the
> primary purpose of this is to achieve data consistency among
> bi-directional kind of setups.
>
> Having said that even in the above case, the result should be the same
> whether the vacuum has removed the row or not. Say, if the vacuum has
> not yet removed the row (due to vacuum_committs_age or otherwise) then
> also because the incoming update has a later timestamp, we will
> convert the update to insert as per last_update_wins resolution
> method, so the conflict will be considered as update_missing. And,
> say, the vacuum has removed the row and the conflict detected is
> update_missing, then also we will convert the update to insert. In
> short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> has higher commit-ts, UPDATE should win.
>
> So, we can expect data consistency in bidirectional cases and expect a
> deterministic behavior in other cases (e.g. the final data in a table
> does not depend on the order of applying the transactions from other
> nodes).

Agreed.

I think that such a time-based configuration parameter would be a
reasonable solution. The current concerns are that it might affect
vacuum performance and lead to a similar bug we had with
vacuum_defer_cleanup_age.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

20 сентября 2024 г., 05:54:59

> -----Original Message-----
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> Sent: Friday, September 20, 2024 2:49 AM
> To: Amit Kapila <amit.kapila16@gmail.com>
> Cc: shveta malik <shveta.malik@gmail.com>; Hou, Zhijie/侯 志杰
> <houzj.fnst@fujitsu.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Re: Conflict detection for update_deleted in logical replication
> 
> On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I haven't thought about the implementation details yet but I think
> > > > during pruning (for example in heap_prune_satisfies_vacuum()),
> > > > apart from checking if the tuple satisfies
> > > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the
> > > > tuple's committs is greater than configured vacuum_committs_age
> > > > (for the
> > > > table) to decide whether tuple can be removed.
> > >
> > > Sounds very costly. I think we need to do performance tests. Even if
> > > the vacuum gets slower only on the particular table having the
> > > vacuum_committs_age setting, it would affect overall autovacuum
> > > performance. Also, it would affect HOT pruning performance.
> > >
> >
> > Agreed that we should do some performance testing and additionally
> > think of any better way to implement. I think the cost won't be much
> > if the tuples to be removed are from a single transaction because the
> > required commit_ts information would be cached but when the tuples are
> > from different transactions, we could see a noticeable impact. We need
> > to test to say anything concrete on this.
> 
> Agreed.
> 
> >
> > > >
> > > > > > but IIUC this value doesn’t need to be significant; it can be
> > > > > > limited to just a few minutes. The one which is sufficient to
> > > > > > handle replication delays caused by network lag or other
> > > > > > factors, assuming clock skew has already been addressed.
> > > > >
> > > > > I think that in a non-bidirectional case the value could need to
> > > > > be a large number. Is that right?
> > > > >
> > > >
> > > > As per my understanding, even for non-bidirectional cases, the
> > > > value should be small. For example, in the case, pointed out by
> > > > Shveta [1], where the updates from 2 nodes are received by a third
> > > > node, this setting is expected to be small. This setting primarily
> > > > deals with concurrent transactions on multiple nodes, so it should
> > > > be small but I could be missing something.
> > > >
> > >
> > > I might be missing something but the scenario I was thinking of is
> > > something below.
> > >
> > > Suppose that we setup uni-directional logical replication between
> > > Node A and Node B (e.g., Node A -> Node B) and both nodes have the
> > > same row with key = 1:
> > >
> > > Node A:
> > >     T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> > >       -> This change is applied on Node B at 10:01 AM.
> > >
> > > Node B:
> > >     T2: DELETE FROM t WHERE key = 1;         (05:00 AM)
> > >
> > > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > > Node A would raise an "update_missing" conflict. On the other hand,
> > > if a vacuum runs on Node B at 11:00 AM, the change would raise an
> > > "update_deleted" conflict. It looks whether we detect an
> > > "update_deleted" or an "updated_missing" depends on the timing of
> > > vacuum, and to avoid such a situation, we would need to set
> > > vacuum_committs_age to more than 5 hours.
> > >
> >
> > Yeah, in this case, it would detect a different conflict (if we don't
> > set vacuum_committs_age to greater than 5 hours) but as per my
> > understanding, the primary purpose of conflict detection and
> > resolution is to avoid data inconsistency in a bi-directional setup.
> > Assume, in the above case it is a bi-directional setup, then we want
> > to have the same data in both nodes. Now, if there are other cases
> > like the one you mentioned that require to detect the conflict
> > reliably than I agree this value could be large and probably not the
> > best way to achieve it. I think we can mention in the docs that the
> > primary purpose of this is to achieve data consistency among
> > bi-directional kind of setups.
> >
> > Having said that even in the above case, the result should be the same
> > whether the vacuum has removed the row or not. Say, if the vacuum has
> > not yet removed the row (due to vacuum_committs_age or otherwise) then
> > also because the incoming update has a later timestamp, we will
> > convert the update to insert as per last_update_wins resolution
> > method, so the conflict will be considered as update_missing. And,
> > say, the vacuum has removed the row and the conflict detected is
> > update_missing, then also we will convert the update to insert. In
> > short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> > has higher commit-ts, UPDATE should win.
> >
> > So, we can expect data consistency in bidirectional cases and expect a
> > deterministic behavior in other cases (e.g. the final data in a table
> > does not depend on the order of applying the transactions from other
> > nodes).
> 
> Agreed.
> 
> I think that such a time-based configuration parameter would be a reasonable
> solution. The current concerns are that it might affect vacuum performance and
> lead to a similar bug we had with vacuum_defer_cleanup_age.

Thanks for the feedback!

I am working on the POC patch and doing some initial performance tests on this idea.
I will share the results after finishing.

Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to our
approach for retaining dead tuples and have come up with another idea that can
reliably detect conflicts without requiring users to choose a wise value for
the vacuum_committs_age. This new idea could also reduce the performance
impact. Thanks a lot to Amit for off-list discussion.

The concept of the new idea is that, the dead tuples are only useful to detect
conflicts when applying *concurrent* transactions from remotes. Any subsequent
UPDATE from a remote node after removing the dead tuples should have a later
timestamp, meaning it's reasonable to detect an update_missing scenario and
convert the UPDATE to an INSERT when applying it.

To achieve above, we can create an additional replication slot on the
subscriber side, maintained by the apply worker. This slot is used to retain
the dead tuples. The apply worker will advance the slot.xmin after confirming
that all the concurrent transaction on publisher has been applied locally.

The process of advancing the slot.xmin could be:

1) the apply worker call GetRunningTransactionData() to get the
'oldestRunningXid' and consider this as 'candidate_xmin'.
2) the apply worker send a new message to walsender to request the latest wal
flush position(GetFlushRecPtr) on publisher, and save it to
'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
extend the existing keepalive message(e,g extends the requestReply bit in
keepalive message to add a 'request_wal_position' value)
3) The apply worker can continue to apply changes. After applying all the WALs
upto 'candidate_remote_wal_lsn', the apply worker can then advance the
slot.xmin to 'candidate_xmin'.

This approach ensures that dead tuples are not removed until all concurrent
transactions have been applied. It can be effective for both bidirectional and
non-bidirectional replication cases.

We could introduce a boolean subscription option (retain_dead_tuples) to
control whether this feature is enabled. Each subscription intending to detect
update-delete conflicts should set retain_dead_tuples to true.

The following explains how it works in different cases to achieve data
consistency:

--
2 nodes, bidirectional case 1:
--
Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.02 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.01 AM

subscription retain_dead_tuples = true/false

After executing T2, the apply worker on Node A will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

--
2 nodes, bidirectional case 2:
--
Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

After executing T2, the apply worker on Node A will request the latest wal
flush location on Node B. And the T3 is either running concurrently or has not
started. In both cases, the T3 must have a later timestamp. So, even if the
dead tuple is removed in this cases and update_missing is detected, the default
resolution is to convert UDPATE to INSERT which is OK because the data are
still consistent on Node A and B.

--
3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
--

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

Node C:
    apply T1, T2, T3

After applying T2, the apply worker on Node C will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

Your feedback on this idea would be greatly appreciated.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

20 сентября 2024 г., 06:59:07

On Friday, September 20, 2024 10:55 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> On Friday, September 20, 2024 2:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > 
> >
> > I think that such a time-based configuration parameter would be a
> > reasonable solution. The current concerns are that it might affect
> > vacuum performance and lead to a similar bug we had with
> vacuum_defer_cleanup_age.
> 
> Thanks for the feedback!
> 
> I am working on the POC patch and doing some initial performance tests on
> this idea.
> I will share the results after finishing.
> 
> Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to
> our approach for retaining dead tuples and have come up with another idea that
> can reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
> 
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any
> subsequent UPDATE from a remote node after removing the dead tuples
> should have a later timestamp, meaning it's reasonable to detect an
> update_missing scenario and convert the UPDATE to an INSERT when
> applying it.
> 
> To achieve above, we can create an additional replication slot on the subscriber
> side, maintained by the apply worker. This slot is used to retain the dead tuples.
> The apply worker will advance the slot.xmin after confirming that all the
> concurrent transaction on publisher has been applied locally.
> 
> The process of advancing the slot.xmin could be:
> 
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or extend the existing keepalive message(e,g extends the
> requestReply bit in keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
> 
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
> 
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
> 
> The following explains how it works in different cases to achieve data
> consistency:
...
> --
> 3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
> --

Sorry for a typo here, the time of T2 and T3 were reversed.
Please see the following correction:

> 
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);        ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;            ts=10.01 AM

Here T2 should be at ts=10.02 AM

> 
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;        ts=10.02 AM

T3 should be at ts=10.01 AM

> 
> Node C:
>     apply T1, T2, T3
> 
> After applying T2, the apply worker on Node C will check the latest wal flush
> location on Node B. Till that time, the T3 should have finished, so the xmin will
> be advanced only after applying the WALs that is later than T3. So, the dead
> tuple will not be removed before applying the T3, which means the
> update_delete can be detected.
> 
> Your feedback on this idea would be greatly appreciated.
> 

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

20 сентября 2024 г., 12:46:26

On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Apart from the vacuum_defer_cleanup_age idea.
>

I think you meant to say vacuum_committs_age idea.

> we’ve given more thought to our
> approach for retaining dead tuples and have come up with another idea that can
> reliably detect conflicts without requiring users to choose a wise value for
> the vacuum_committs_age. This new idea could also reduce the performance
> impact. Thanks a lot to Amit for off-list discussion.
>
> The concept of the new idea is that, the dead tuples are only useful to detect
> conflicts when applying *concurrent* transactions from remotes. Any subsequent
> UPDATE from a remote node after removing the dead tuples should have a later
> timestamp, meaning it's reasonable to detect an update_missing scenario and
> convert the UPDATE to an INSERT when applying it.
>
> To achieve above, we can create an additional replication slot on the
> subscriber side, maintained by the apply worker. This slot is used to retain
> the dead tuples. The apply worker will advance the slot.xmin after confirming
> that all the concurrent transaction on publisher has been applied locally.
>
> The process of advancing the slot.xmin could be:
>
> 1) the apply worker call GetRunningTransactionData() to get the
> 'oldestRunningXid' and consider this as 'candidate_xmin'.
> 2) the apply worker send a new message to walsender to request the latest wal
> flush position(GetFlushRecPtr) on publisher, and save it to
> 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> extend the existing keepalive message(e,g extends the requestReply bit in
> keepalive message to add a 'request_wal_position' value)
> 3) The apply worker can continue to apply changes. After applying all the WALs
> upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> slot.xmin to 'candidate_xmin'.
>
> This approach ensures that dead tuples are not removed until all concurrent
> transactions have been applied. It can be effective for both bidirectional and
> non-bidirectional replication cases.
>
> We could introduce a boolean subscription option (retain_dead_tuples) to
> control whether this feature is enabled. Each subscription intending to detect
> update-delete conflicts should set retain_dead_tuples to true.
>

As each apply worker needs a separate slot to retain deleted rows, the
requirement for slots will increase. The other possibility is to
maintain one slot by launcher or some other central process that
traverses all subscriptions, remember the ones marked with
retain_dead_rows (let's call this list as retain_sub_list). Then using
running_transactions get the oldest running_xact, and then get the
remote flush location from the other node (publisher node) and store
those as candidate values (candidate_xmin and
candidate_remote_wal_lsn) in slot. We can probably reuse existing
candidate variables of the slot. Next, we can check the remote_flush
locations from all the origins corresponding in retain_sub_list and if
all are ahead of candidate_remote_wal_lsn, we can update the slot's
xmin to candidate_xmin.

I think in the above idea we can an optimization to combine the
request for remote wal LSN from different subscriptions pointing to
the same node to avoid sending multiple requests to the same node. I
am not sure if using pg_subscription.subconninfo is sufficient for
this, if not we can probably leave this optimization.

If this idea is feasible then it would reduce the number of slots
required to retain the deleted rows but the launcher needs to get the
remote wal location corresponding to each publisher node. There are
two ways to achieve that (a) launcher requests one of the apply
workers corresponding to subscriptions pointing to the same publisher
node to get this information; (b) launcher launches another worker to
get the remote wal flush location.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

24 сентября 2024 г., 00:05:17

Hi,

Thank you for considering another idea.

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to detect
> > conflicts when applying *concurrent* transactions from remotes. Any subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after confirming
> > that all the concurrent transaction on publisher has been applied locally.

The replication slot used for this purpose will be a physical one or
logical one? And IIUC such a slot doesn't need to retain WAL but if we
do that, how do we advance the LSN of the slot?

> > 2) the apply worker send a new message to walsender to request the latest wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)

The apply worker sends a keepalive message when it didn't receive
anything more than wal_receiver_timeout / 2. So in a very active
system, we cannot rely on piggybacking new information to the
keepalive messages to get the latest remote flush LSN.

> > 3) The apply worker can continue to apply changes. After applying all the WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >

I'm still studying this idea but let me confirm the following scenario.

Suppose both Node-A and Node-B have the same row (1,1) in table t, and
XIDs and commit LSNs of T2 and T3 are the following:

Node A
  T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000

Node B
  T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, commit-LSN:5000

Further suppose that it's now 10:05 AM, and the latest XID and the
latest flush WAL position of Node-A and Node-B are following:

Node A
  current XID: 300
  latest flush LSN; 3000

Node B
  current XID: 700
  latest flush LSN: 7000

Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
(i.e., the logical replication is delaying for 5 min).

Consider the following scenario:

1. The apply worker on Node-A calls GetRunningTransactionData() and
gets 301 (set as candidate_xmin).
2. The apply worker on Node-A requests the latest WAL flush position
from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
4. The apply worker on Node-A continues applying changes, and applies
the transactions up to remote (commit) LSN 7100.
5. Now that the apply worker on Node-A applied all changes smaller
than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
301 (candidate_xmin).
6. On Node-A, vacuum runs and physically removes the tuple that was
deleted by T2.

Here, on Node-B, there might be a transition between LSN 7100 and 8000
that might require the tuple that is deleted by T2.

For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
On Node-A, whether we detect "update_deleted" or "update_missing"
still depends on when vacuum removes the tuple deleted by T2.

If applying T4 raises an "update_missing" (i.e. the changes are
applied in the order of T2->T3->(vacuum)->T4), it converts into an
insert, resulting in the table having a row with value = 3.

If applying T4 raises an "update_deleted" (i.e. the changes are
applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
in the table having no row.

On the other hand, in this scenario, Node-B applies changes in the
order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
resulting in the table having a row with val=3 (assuming
latest_committs_win is the default resolver for this confliction).

Please confirm this scenario as I might be missing something.

>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Does it mean that we use one candiate_remote_wal_lsn in a slot for all
subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
LSN of one of publishers, so other publishers could have completely
different LSNs. How do we compare the candidate_remote_wal_lsn to
remote_flush locations from all the origins?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

24 сентября 2024 г., 06:32:33

On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> Thank you for considering another idea.

Thanks for reviewing the idea!

> 
> On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Apart from the vacuum_defer_cleanup_age idea.
> > >
> >
> > I think you meant to say vacuum_committs_age idea.
> >
> > > we’ve given more thought to our
> > > approach for retaining dead tuples and have come up with another idea
> that can
> > > reliably detect conflicts without requiring users to choose a wise value for
> > > the vacuum_committs_age. This new idea could also reduce the
> performance
> > > impact. Thanks a lot to Amit for off-list discussion.
> > >
> > > The concept of the new idea is that, the dead tuples are only useful to
> detect
> > > conflicts when applying *concurrent* transactions from remotes. Any
> subsequent
> > > UPDATE from a remote node after removing the dead tuples should have a
> later
> > > timestamp, meaning it's reasonable to detect an update_missing scenario
> and
> > > convert the UPDATE to an INSERT when applying it.
> > >
> > > To achieve above, we can create an additional replication slot on the
> > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > the dead tuples. The apply worker will advance the slot.xmin after
> confirming
> > > that all the concurrent transaction on publisher has been applied locally.
> 
> The replication slot used for this purpose will be a physical one or
> logical one? And IIUC such a slot doesn't need to retain WAL but if we
> do that, how do we advance the LSN of the slot?

I think it would be a logical slot. We can keep the
restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
WALs for decoding purpose.

> 
> > > 2) the apply worker send a new message to walsender to request the latest
> wal
> > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or
> > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > keepalive message to add a 'request_wal_position' value)
> 
> The apply worker sends a keepalive message when it didn't receive
> anything more than wal_receiver_timeout / 2. So in a very active
> system, we cannot rely on piggybacking new information to the
> keepalive messages to get the latest remote flush LSN.

Right. I think we need to send this new message at some interval independent of
wal_receiver_timeout.

> 
> > > 3) The apply worker can continue to apply changes. After applying all the
> WALs
> > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > slot.xmin to 'candidate_xmin'.
> > >
> > > This approach ensures that dead tuples are not removed until all
> concurrent
> > > transactions have been applied. It can be effective for both bidirectional
> and
> > > non-bidirectional replication cases.
> > >
> > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > control whether this feature is enabled. Each subscription intending to
> detect
> > > update-delete conflicts should set retain_dead_tuples to true.
> > >
> 
> I'm still studying this idea but let me confirm the following scenario.
> 
> Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> XIDs and commit LSNs of T2 and T3 are the following:
> 
> Node A
>   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> 
> Node B
>   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> commit-LSN:5000
> 
> Further suppose that it's now 10:05 AM, and the latest XID and the
> latest flush WAL position of Node-A and Node-B are following:
> 
> Node A
>   current XID: 300
>   latest flush LSN; 3000
> 
> Node B
>   current XID: 700
>   latest flush LSN: 7000
> 
> Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> (i.e., the logical replication is delaying for 5 min).
> 
> Consider the following scenario:
> 
> 1. The apply worker on Node-A calls GetRunningTransactionData() and
> gets 301 (set as candidate_xmin).
> 2. The apply worker on Node-A requests the latest WAL flush position
> from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> 4. The apply worker on Node-A continues applying changes, and applies
> the transactions up to remote (commit) LSN 7100.
> 5. Now that the apply worker on Node-A applied all changes smaller
> than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> 301 (candidate_xmin).
> 6. On Node-A, vacuum runs and physically removes the tuple that was
> deleted by T2.
> 
> Here, on Node-B, there might be a transition between LSN 7100 and 8000
> that might require the tuple that is deleted by T2.
> 
> For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> On Node-A, whether we detect "update_deleted" or "update_missing"
> still depends on when vacuum removes the tuple deleted by T2.

I think in this case, no matter we detect "update_delete" or "update_missing",
the final data is the same. Because T4's commit timestamp should be later than
T2 on node A, so in the case of "update_deleted", it will compare the commit
timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
which means we will convert the update into insert and apply. Even if the
deleted tuple is deleted and "update_missing" is detected, the update will
still be converted into insert and applied. So, the result is the same.

> 
> If applying T4 raises an "update_missing" (i.e. the changes are
> applied in the order of T2->T3->(vacuum)->T4), it converts into an
> insert, resulting in the table having a row with value = 3.
> 
> If applying T4 raises an "update_deleted" (i.e. the changes are
> applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
> in the table having no row.
> 
> On the other hand, in this scenario, Node-B applies changes in the
> order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
> resulting in the table having a row with val=3 (assuming
> latest_committs_win is the default resolver for this confliction).
> 
> Please confirm this scenario as I might be missing something.

As explained above, I think the data can be consistent in this case as well.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

24 сентября 2024 г., 07:05:55

On Tue, Sep 24, 2024 at 2:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > As each apply worker needs a separate slot to retain deleted rows, the
> > requirement for slots will increase. The other possibility is to
> > maintain one slot by launcher or some other central process that
> > traverses all subscriptions, remember the ones marked with
> > retain_dead_rows (let's call this list as retain_sub_list). Then using
> > running_transactions get the oldest running_xact, and then get the
> > remote flush location from the other node (publisher node) and store
> > those as candidate values (candidate_xmin and
> > candidate_remote_wal_lsn) in slot. We can probably reuse existing
> > candidate variables of the slot. Next, we can check the remote_flush
> > locations from all the origins corresponding in retain_sub_list and if
> > all are ahead of candidate_remote_wal_lsn, we can update the slot's
> > xmin to candidate_xmin.
>
> Does it mean that we use one candiate_remote_wal_lsn in a slot for all
> subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
> LSN of one of publishers, so other publishers could have completely
> different LSNs. How do we compare the candidate_remote_wal_lsn to
> remote_flush locations from all the origins?
>

This should be an array/list with one element per publisher. We can
copy candidate_xmin to actual xmin only when the
candiate_remote_wal_lsn's corresponding to all publishers have been
applied aka their remote_flush locations (present in origins) are
ahead. The advantages I see with this are (a) reduces the number of
slots required to achieve the retention of deleted rows for conflict
detection, (b) in some cases we can avoid sending messages to the
publisher because with this we only need to send message to a
particular publisher once rather than by all the apply workers
corresponding to same publisher node.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

24 сентября 2024 г., 08:19:10

On Tue, Sep 24, 2024 at 9:02 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>

As per my understanding, one of the main reasons to keep it logical is
to allow syncing it to standbys (slotsync functionality). It is
required because after promotion the subscriptions replicated to
standby could be enabled to make it a subscriber. If that is not
possible due to any reason then we can consider it to be a physical
slot as well.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

24 сентября 2024 г., 09:42:15

On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>
> >
> > > > 2) the apply worker send a new message to walsender to request the latest
> > wal
> > > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> > message or
> > > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > > keepalive message to add a 'request_wal_position' value)
> >
> > The apply worker sends a keepalive message when it didn't receive
> > anything more than wal_receiver_timeout / 2. So in a very active
> > system, we cannot rely on piggybacking new information to the
> > keepalive messages to get the latest remote flush LSN.
>
> Right. I think we need to send this new message at some interval independent of
> wal_receiver_timeout.
>
> >
> > > > 3) The apply worker can continue to apply changes. After applying all the
> > WALs
> > > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > > slot.xmin to 'candidate_xmin'.
> > > >
> > > > This approach ensures that dead tuples are not removed until all
> > concurrent
> > > > transactions have been applied. It can be effective for both bidirectional
> > and
> > > > non-bidirectional replication cases.
> > > >
> > > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > > control whether this feature is enabled. Each subscription intending to
> > detect
> > > > update-delete conflicts should set retain_dead_tuples to true.
> > > >
> >
> > I'm still studying this idea but let me confirm the following scenario.
> >
> > Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> > XIDs and commit LSNs of T2 and T3 are the following:
> >
> > Node A
> >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> >
> > Node B
> >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > commit-LSN:5000
> >
> > Further suppose that it's now 10:05 AM, and the latest XID and the
> > latest flush WAL position of Node-A and Node-B are following:
> >
> > Node A
> >   current XID: 300
> >   latest flush LSN; 3000
> >
> > Node B
> >   current XID: 700
> >   latest flush LSN: 7000
> >
> > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > (i.e., the logical replication is delaying for 5 min).
> >
> > Consider the following scenario:
> >
> > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > gets 301 (set as candidate_xmin).
> > 2. The apply worker on Node-A requests the latest WAL flush position
> > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> > 4. The apply worker on Node-A continues applying changes, and applies
> > the transactions up to remote (commit) LSN 7100.
> > 5. Now that the apply worker on Node-A applied all changes smaller
> > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > 301 (candidate_xmin).
> > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > deleted by T2.
> >
> > Here, on Node-B, there might be a transition between LSN 7100 and 8000
> > that might require the tuple that is deleted by T2.
> >
> > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > On Node-A, whether we detect "update_deleted" or "update_missing"
> > still depends on when vacuum removes the tuple deleted by T2.
>
> I think in this case, no matter we detect "update_delete" or "update_missing",
> the final data is the same. Because T4's commit timestamp should be later than
> T2 on node A, so in the case of "update_deleted", it will compare the commit
> timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
> which means we will convert the update into insert and apply. Even if the
> deleted tuple is deleted and "update_missing" is detected, the update will
> still be converted into insert and applied. So, the result is the same.

The "latest_timestamp_wins" is the default resolution method for
"update_deleted"? When I checked the wiki page[1], the "skip" was the
default solution method for that.

Regards,

[1] https://wiki.postgresql.org/wiki/Conflict_Detection_and_Resolution#Defaults

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

24 сентября 2024 г., 10:14:53

On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > > I'm still studying this idea but let me confirm the following scenario.
> > >
> > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > and XIDs and commit LSNs of T2 and T3 are the following:
> > >
> > > Node A
> > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> commit-LSN:1000
> > >
> > > Node B
> > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > commit-LSN:5000
> > >
> > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > latest flush WAL position of Node-A and Node-B are following:
> > >
> > > Node A
> > >   current XID: 300
> > >   latest flush LSN; 3000
> > >
> > > Node B
> > >   current XID: 700
> > >   latest flush LSN: 7000
> > >
> > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > (i.e., the logical replication is delaying for 5 min).
> > >
> > > Consider the following scenario:
> > >
> > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > gets 301 (set as candidate_xmin).
> > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> 8000.
> > > 4. The apply worker on Node-A continues applying changes, and
> > > applies the transactions up to remote (commit) LSN 7100.
> > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > 301 (candidate_xmin).
> > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > deleted by T2.
> > >
> > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > 8000 that might require the tuple that is deleted by T2.
> > >
> > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > still depends on when vacuum removes the tuple deleted by T2.
> >
> > I think in this case, no matter we detect "update_delete" or
> > "update_missing", the final data is the same. Because T4's commit
> > timestamp should be later than
> > T2 on node A, so in the case of "update_deleted", it will compare the
> > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > T4 should win, which means we will convert the update into insert and
> > apply. Even if the deleted tuple is deleted and "update_missing" is
> > detected, the update will still be converted into insert and applied. So, the
> result is the same.
> 
> The "latest_timestamp_wins" is the default resolution method for
> "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> solution method for that.

Right, I think the wiki needs some update.

I think using 'skip' as default for update_delete could easily cause data
divergence when the dead tuple is deleted by an old transaction while the
UPDATE has a newer timestamp like the case you mentioned. It's necessary to
follow the last update win strategy when the incoming update has later
timestamp, which is to convert update to insert.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

24 сентября 2024 г., 20:24:56

On Tue, Sep 24, 2024 at 12:14 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 24, 2024 2:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > > I'm still studying this idea but let me confirm the following scenario.
> > > >
> > > > Suppose both Node-A and Node-B have the same row (1,1) in table t,
> > > > and XIDs and commit LSNs of T2 and T3 are the following:
> > > >
> > > > Node A
> > > >   T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100,
> > commit-LSN:1000
> > > >
> > > > Node B
> > > >   T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > > > commit-LSN:5000
> > > >
> > > > Further suppose that it's now 10:05 AM, and the latest XID and the
> > > > latest flush WAL position of Node-A and Node-B are following:
> > > >
> > > > Node A
> > > >   current XID: 300
> > > >   latest flush LSN; 3000
> > > >
> > > > Node B
> > > >   current XID: 700
> > > >   latest flush LSN: 7000
> > > >
> > > > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > > > (i.e., the logical replication is delaying for 5 min).
> > > >
> > > > Consider the following scenario:
> > > >
> > > > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > > > gets 301 (set as candidate_xmin).
> > > > 2. The apply worker on Node-A requests the latest WAL flush position
> > > > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > > > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now
> > 8000.
> > > > 4. The apply worker on Node-A continues applying changes, and
> > > > applies the transactions up to remote (commit) LSN 7100.
> > > > 5. Now that the apply worker on Node-A applied all changes smaller
> > > > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > > > 301 (candidate_xmin).
> > > > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > > > deleted by T2.
> > > >
> > > > Here, on Node-B, there might be a transition between LSN 7100 and
> > > > 8000 that might require the tuple that is deleted by T2.
> > > >
> > > > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > > > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > > > On Node-A, whether we detect "update_deleted" or "update_missing"
> > > > still depends on when vacuum removes the tuple deleted by T2.
> > >
> > > I think in this case, no matter we detect "update_delete" or
> > > "update_missing", the final data is the same. Because T4's commit
> > > timestamp should be later than
> > > T2 on node A, so in the case of "update_deleted", it will compare the
> > > commit timestamp of the deleted tuple's xmax with T4's timestamp, and
> > > T4 should win, which means we will convert the update into insert and
> > > apply. Even if the deleted tuple is deleted and "update_missing" is
> > > detected, the update will still be converted into insert and applied. So, the
> > result is the same.
> >
> > The "latest_timestamp_wins" is the default resolution method for
> > "update_deleted"? When I checked the wiki page[1], the "skip" was the default
> > solution method for that.
>
> Right, I think the wiki needs some update.
>
> I think using 'skip' as default for update_delete could easily cause data
> divergence when the dead tuple is deleted by an old transaction while the
> UPDATE has a newer timestamp like the case you mentioned. It's necessary to
> follow the last update win strategy when the incoming update has later
> timestamp, which is to convert update to insert.

Right. If "latest_timestamp_wins" is the default resolution for
"update_deleted", I think your idea works fine unless I'm missing
corner cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

24 сентября 2024 г., 21:22:36

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to detect
> > conflicts when applying *concurrent* transactions from remotes. Any subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after confirming
> > that all the concurrent transaction on publisher has been applied locally.
> >
> > The process of advancing the slot.xmin could be:
> >
> > 1) the apply worker call GetRunningTransactionData() to get the
> > 'oldestRunningXid' and consider this as 'candidate_xmin'.
> > 2) the apply worker send a new message to walsender to request the latest wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)
> > 3) The apply worker can continue to apply changes. After applying all the WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >
>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Yeah, I think that such an idea to reduce the number required slots
would be necessary.

>
> I think in the above idea we can an optimization to combine the
> request for remote wal LSN from different subscriptions pointing to
> the same node to avoid sending multiple requests to the same node. I
> am not sure if using pg_subscription.subconninfo is sufficient for
> this, if not we can probably leave this optimization.
>
> If this idea is feasible then it would reduce the number of slots
> required to retain the deleted rows but the launcher needs to get the
> remote wal location corresponding to each publisher node. There are
> two ways to achieve that (a) launcher requests one of the apply
> workers corresponding to subscriptions pointing to the same publisher
> node to get this information; (b) launcher launches another worker to
> get the remote wal flush location.

I think the remote wal flush location is asked using a replication
protocol. Therefore, if a new worker is responsible for asking wal
flush location from multiple publishers (like the idea (b)), the
corresponding process would need to be launched on publisher sides and
logical replication would also need to start on each connection. I
think it would be better to get the remote wal flush location using
the existing logical replication connection (i.e., between the logical
wal sender and the apply worker), and advertise the locations on the
shared memory. Then, the central process who holds the slot to retain
the deleted row versions traverses them and increases slot.xmin if
possible.

The cost of requesting the remote wal flush location would not be huge
if we don't ask it very frequently. So probably we can start by having
each apply worker (in the retain_sub_list) ask the remote wal flush
location and can leave the optimization of avoiding sending the
request for the same publisher.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

01 октября 2024 г., 15:43:33

On Mon, Sep 30, 2024 at 12:02 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, September 25, 2024 2:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I think the remote wal flush location is asked using a replication protocol.
> > Therefore, if a new worker is responsible for asking wal flush location from
> > multiple publishers (like the idea (b)), the corresponding process would need
> > to be launched on publisher sides and logical replication would also need to
> > start on each connection. I think it would be better to get the remote wal flush
> > location using the existing logical replication connection (i.e., between the
> > logical wal sender and the apply worker), and advertise the locations on the
> > shared memory. Then, the central process who holds the slot to retain the
> > deleted row versions traverses them and increases slot.xmin if possible.
> >
> > The cost of requesting the remote wal flush location would not be huge if we
> > don't ask it very frequently. So probably we can start by having each apply
> > worker (in the retain_sub_list) ask the remote wal flush location and can leave
> > the optimization of avoiding sending the request for the same publisher.
>
> Agreed. Here is the POC patch set based on this idea.
>
> The implementation is as follows:
>
> A subscription option is added to allow users to specify whether dead
> tuples on the subscriber, which are useful for detecting update_deleted
> conflicts, should be retained. The default setting is false. If set to true,
> the detection of update_deleted will be enabled,
>

I find the option name retain_dead_tuples bit misleading because by
name one can't make out the purpose of the same. It is better to name
it as detect_update_deleted or something on those lines.

> and an additional replication
> slot named pg_conflict_detection will be created on the subscriber to prevent
> dead tuples from being removed. Note that if multiple subscriptions on one node
> enable this option, only one replication slot will be created.
>

In general, we should have done this by default but as detecting
update_deleted type conflict has some overhead in terms of retaining
dead tuples for more time, so having an option seems reasonable. But I
suggest to keep this as a separate last patch. If we can make the core
idea work by default then we can enable it via option in the end.

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

02 октября 2024 г., 09:33:50

Dear Hou,

Thanks for updating the patch! Here are my comments.
My comments do not take care which file contains the change, and the ordering may
be random.

1.
```
+       and <link
linkend="sql-createsubscription-params-with-detect-update-deleted"><literal>detect_conflict</literal></link>
+       are enabled.
```
"detect_conflict" still exists, it should be "detect_update_deleted".

2. maybe_advance_nonremovable_xid
```
+        /* Send a wal position request message to the server */
+        walrcv_send(LogRepWorkerWalRcvConn, "x", sizeof(uint8))
```
I think the character is used for PoC purpose, so it's about time we change it.
How about:

- 'W', because it requests the WAL location, or
- 'S', because it is accosiated with 's' message.

3. maybe_advance_nonremovable_xid
```
+        if (!AllTablesyncsReady())
+            return;
```
If we do not update oldest_nonremovable_xid during the sync, why do we send
the status message? I feel we can return in any phases if !AllTablesyncsReady().

4. advance_conflict_slot_xmin
```
+            ReplicationSlotCreate(CONFLICT_DETECTION_SLOT, false,
+                                  RS_PERSISTENT, false, false, false);
```
Hmm. You said the slot would be logical, but now it is physical. Which is correct?

5. advance_conflict_slot_xmin
```
+            xmin_horizon = GetOldestSafeDecodingTransactionId(true);
```
Since the slot won't do the logical decoding, we do not have to use oldest-safe-decoding
xid. I feel it is OK to use the latest xid.

6. advance_conflict_slot_xmin
```
+    /* No need to update xmin if the slot has been invalidated */
+    if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
```
I feel the slot won't be invalidated. According to
InvalidatePossiblyObsoleteSlot(), the physical slot cannot be invalidated if it
has invalid restart_lsn.

7. ApplyLauncherMain
```
+            retain_dead_tuples |= sub->detectupdatedeleted;
```
Can you tell me why it must be updated even if the sub is disabled?

8. ApplyLauncherMain

If the subscription which detect_update_deleted = true exists but wal_receiver_status_interval = 0,
the slot won't be advanced and dead tuple retains forever... is it right? Can we
avoid it anyway?

9. FindMostRecentlyDeletedTupleInfo

It looks for me that the scan does not use indexes even if exists, but I feel it should use.
Am I missing something or is there a reason?

[1]:
https://www.postgresql.org/message-id/OS0PR01MB5716E0A283D1B66954CDF5A694682%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

14 октября 2024 г., 06:39:49

On Friday, October 11, 2024 4:35 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
>
> Attach the V4 patch set which addressed above comments.
>

While reviewing the patch, I noticed that the current design could not work in
a non-bidirectional cluster (publisher -> subscriber) when the publisher is
also a physical standby. (We supported logical decoding on a physical standby
recently, so it's possible to take a physical standby as a logical publisher).

The cluster looks like:

physical primary -> physical standby (also publisher) -> logical subscriber (detect_update_deleted)

The issue arises because the physical standby (acting as the publisher) might
lag behind its primary. As a result, the logical walsender on the standby might
not be able to get the latest WAL position when requested by the logical
subscriber. We can only get the WAL replay position but there may be more WALs
that are being replicated from the primary and those WALs could have older
commit timestamp. (Note that transactions on both primary and standby have
the same commit timestamp).

So, the logical walsender might send an outdated WAL position as feedback.
This, in turn, can cause the replication slot on the subscriber to advance
prematurely, leading to the unwanted removal of dead tuples. As a result, the
apply worker may fail to correctly detect update-delete conflicts.

We thought of few options to fix this:

1) Add a Time-Based Subscription Option:

We could add a new time-based option for subscriptions, such as
retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
candidate XID, the launcher will wait for the specified time before advancing
the slot.xmin. This ensures that deleted tuples are retained for at least the
duration defined by this new option.

This approach is designed to simulate the functionality of the GUC
(vacuum_committs_age), but with a simpler implementation that does not impact
vacuum performance. We can maintain both this time-based method and the current
automatic method. If a user does not specify the time-based option, we will
continue using the existing approach to retain dead tuples until all concurrent
transactions from the remote node have been applied.

2) Modification to the Logical Walsender

On the logical walsender, which is as a physical standby, we can build an
additional connection to the physical primary to obtain the latest WAL
position. This position will then be sent as feedback to the logical
subscriber.

A potential concern is that this requires the walsender to use the walreceiver
API, which may seem a bit unnatural. And, it starts an additional walsender
process on the primary, as the logical walsender on the physical standby will
need to communicate with this walsender to fetch the WAL position.

3) Documentation of Restrictions

As an alternative, we could simply document the restriction that detecting
update_delete is not supported if the publisher is also acting as a physical
standby.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

15 октября 2024 г., 14:33:14

On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> While reviewing the patch, I noticed that the current design could not work in
> a non-bidirectional cluster (publisher -> subscriber) when the publisher is
> also a physical standby. (We supported logical decoding on a physical standby
> recently, so it's possible to take a physical standby as a logical publisher).
>
> The cluster looks like:
>
>         physical primary -> physical standby (also publisher) -> logical subscriber (detect_update_deleted)
>
> The issue arises because the physical standby (acting as the publisher) might
> lag behind its primary. As a result, the logical walsender on the standby might
> not be able to get the latest WAL position when requested by the logical
> subscriber. We can only get the WAL replay position but there may be more WALs
> that are being replicated from the primary and those WALs could have older
> commit timestamp. (Note that transactions on both primary and standby have
> the same commit timestamp).
>
> So, the logical walsender might send an outdated WAL position as feedback.
> This, in turn, can cause the replication slot on the subscriber to advance
> prematurely, leading to the unwanted removal of dead tuples. As a result, the
> apply worker may fail to correctly detect update-delete conflicts.
>
> We thought of few options to fix this:
>
> 1) Add a Time-Based Subscription Option:
>
> We could add a new time-based option for subscriptions, such as
> retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
> candidate XID, the launcher will wait for the specified time before advancing
> the slot.xmin. This ensures that deleted tuples are retained for at least the
> duration defined by this new option.
>
> This approach is designed to simulate the functionality of the GUC
> (vacuum_committs_age), but with a simpler implementation that does not impact
> vacuum performance. We can maintain both this time-based method and the current
> automatic method. If a user does not specify the time-based option, we will
> continue using the existing approach to retain dead tuples until all concurrent
> transactions from the remote node have been applied.
>
> 2) Modification to the Logical Walsender
>
> On the logical walsender, which is as a physical standby, we can build an
> additional connection to the physical primary to obtain the latest WAL
> position. This position will then be sent as feedback to the logical
> subscriber.
>
> A potential concern is that this requires the walsender to use the walreceiver
> API, which may seem a bit unnatural. And, it starts an additional walsender
> process on the primary, as the logical walsender on the physical standby will
> need to communicate with this walsender to fetch the WAL position.
>

This idea is worth considering, but I think it may not be a good
approach if the physical standby is cascading. We need to restrict the
update_delete conflict detection, if the standby is cascading, right?

The other approach is that we send current_timestamp from the
subscriber and somehow check if the physical standby has applied
commit_lsn up to that commit_ts, if so, it can send that WAL position
to the subscriber, otherwise, wait for it to be applied. If we do this
then we don't need to add a restriction for cascaded physical standby.
I think the subscriber anyway needs to wait for such an LSN to be
applied on standby before advancing the xmin even if we get it from
the primary. This is because the subscriber can only be able to apply
and flush the WAL once it is applied on the standby. Am, I missing
something?

This approach has a disadvantage that we are relying on clocks to be
synced on both nodes which we anyway need for conflict resolution as
discussed in the thread [1]. We also need to consider the Commit
Timestamp and LSN inversion issue as discussed in another thread [2]
if we want to pursue this approach because we may miss an LSN that has
a prior timestamp.

> 3) Documentation of Restrictions
>
> As an alternative, we could simply document the restriction that detecting
> update_delete is not supported if the publisher is also acting as a physical
> standby.
>

If we don't want to go for something along the lines of the approach
mentioned in (2) then I think we can do a combination of (1) and (3)
where we can error out if the user has not provided retain_dead_tuples
and the publisher is physical standby.

[1] - https://www.postgresql.org/message-id/CABdArM4%3D152B9PoyF4kggQ4LniYtjBCdUjL%3DqBwD-jcogP2BPQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAJpy0uBxEJnabEp3JS%3Dn9X19Vx2ZK3k5AR7N0h-cSMtOwYV3fA%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

18 октября 2024 г., 12:44:54

On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > We thought of few options to fix this:
> >
> > 1) Add a Time-Based Subscription Option:
> >
> > We could add a new time-based option for subscriptions, such as
> > retain_dead_tuples = '5s'. In the logical launcher, after obtaining the
> > candidate XID, the launcher will wait for the specified time before advancing
> > the slot.xmin. This ensures that deleted tuples are retained for at least the
> > duration defined by this new option.
> >
> > This approach is designed to simulate the functionality of the GUC
> > (vacuum_committs_age), but with a simpler implementation that does not impact
> > vacuum performance. We can maintain both this time-based method and the current
> > automatic method. If a user does not specify the time-based option, we will
> > continue using the existing approach to retain dead tuples until all concurrent
> > transactions from the remote node have been applied.
> >
> > 2) Modification to the Logical Walsender
> >
> > On the logical walsender, which is as a physical standby, we can build an
> > additional connection to the physical primary to obtain the latest WAL
> > position. This position will then be sent as feedback to the logical
> > subscriber.
> >
> > A potential concern is that this requires the walsender to use the walreceiver
> > API, which may seem a bit unnatural. And, it starts an additional walsender
> > process on the primary, as the logical walsender on the physical standby will
> > need to communicate with this walsender to fetch the WAL position.
> >
>
> This idea is worth considering, but I think it may not be a good
> approach if the physical standby is cascading. We need to restrict the
> update_delete conflict detection, if the standby is cascading, right?
>
> The other approach is that we send current_timestamp from the
> subscriber and somehow check if the physical standby has applied
> commit_lsn up to that commit_ts, if so, it can send that WAL position
> to the subscriber, otherwise, wait for it to be applied. If we do this
> then we don't need to add a restriction for cascaded physical standby.
> I think the subscriber anyway needs to wait for such an LSN to be
> applied on standby before advancing the xmin even if we get it from
> the primary. This is because the subscriber can only be able to apply
> and flush the WAL once it is applied on the standby. Am, I missing
> something?
>
> This approach has a disadvantage that we are relying on clocks to be
> synced on both nodes which we anyway need for conflict resolution as
> discussed in the thread [1]. We also need to consider the Commit
> Timestamp and LSN inversion issue as discussed in another thread [2]
> if we want to pursue this approach because we may miss an LSN that has
> a prior timestamp.
>

The problem due to Commit Timestamp and LSN inversion is that the
standby may not consider the WAL LSN from an earlier timestamp, which
could lead to the removal of required dead rows on the subscriber.

The other problem pointed out by Hou-San offlist due to Commit
Timestamp and LSN inversion is that we could miss sending the WAL LSN
that the subscriber requires to retain dead rows for update_delete
conflict. For example, consider the following case 2 node,
bidirectional setup:

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
  T2: DELETE FROM t WHERE id = 1; ts=10.02 AM

Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM

Say subscription is created with retain_dead_tuples = true/false

After executing T2, the apply worker on Node A will check the latest
wal flush location on Node B. Till that time, the T3 should have
finished, so the xmin will be advanced only after applying the WALs
that is later than T3. So, the dead tuple will not be removed before
applying the T3, which means the update_delete can be detected.

As there is a gap between when we acquire the commit_timestamp and the
commit LSN, it is possible that T3 would have not yet flushed it's WAL
even though it is committed earlier than T2. If this happens then we
won't be able to detect update_deleted conflict reliably.

Now, the one simpler idea is to acquire the commit timestamp and
reserve WAL (LSN) under the same spinlock in
ReserveXLogInsertLocation() but that could be costly as discussed in
the thread [1]. The other more localized solution is to acquire a
timestamp after reserving the commit WAL LSN outside the lock which
will solve this particular problem.

[1] - https://www.postgresql.org/message-id/CAJpy0uBxEJnabEp3JS%3Dn9X19Vx2ZK3k5AR7N0h-cSMtOwYV3fA%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

18 октября 2024 г., 12:46:34

On Fri, Oct 11, 2024 at 2:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attach the V4 patch set which addressed above comments.
>

A few minor comments:
1.
+ * Retaining the dead tuples for this period is sufficient because any
+ * subsequent transaction from the publisher will have a later timestamp.
+ * Therefore, it is acceptable if dead tuples are removed by vacuum and an
+ * update_missing conflict is detected, as the correct resolution for the
+ * last-update-wins strategy in this case is to convert the UPDATE to an INSERT
+ * and apply it anyway.
+ *
+ * The 'remote_wal_pos' will be reset after sending a new request to walsender.
+ */
+static void
+maybe_advance_nonremovable_xid(XLogRecPtr *remote_wal_pos,
+    DeadTupleRetainPhase *phase)

We should cover the key point of retaining dead tuples which is to
avoid converting updates to inserts (considering the conflict as
update_missing) in the comments above and also in the commit message.

2. In maybe_advance_nonremovable_xid() all three phases are handled by
different if blocks but as per my understanding the phase value will
be unique in one call to the function. So, shouldn't it be handled
with else if?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Peter Smith

Дата:

24 октября 2024 г., 08:00:09

Hi Hou-san, here are my review comments for patch v5-0001.

======
General

1.
Sometimes in the commit message and code comments the patch refers to
"transaction id" and other times to "transaction ID". The patch should
use the same wording everywhere.

======
Commit message.

2.
"While for concurrent remote transactions with earlier timestamps,..."

I think this means:
"But, for concurrent remote transactions with earlier timestamps than
the DELETE,..."

Maybe expressed this way is clearer.

~~~

3.
... the resolution would be to convert the update to an insert.

Change this to uppercase like done elsewhere:
"... the resolution would be to convert the UPDATE to an INSERT.

======
doc/src/sgml/protocol.sgml

4.
            +       <varlistentry
id="protocol-replication-primary-wal-status-update">
+        <term>Primary WAL status update (B)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('s')</term>
+           <listitem>
+            <para>
+             Identifies the message as a primary WAL status update.
+            </para>
+           </listitem>
+          </varlistentry>

I felt it would be better if this is described as just a "Primary
status update" instead of a "Primary WAL status update". Doing this
makes it more flexible in case there is a future requirement to put
more status values in here which may not be strictly WAL related.

~~~

5.
+       <varlistentry id="protocol-replication-standby-wal-status-request">
+        <term>Standby WAL status request (F)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('W')</term>
+           <listitem>
+            <para>
+             Identifies the message as a request for the WAL status
on the primary.
+            </para>
+           </listitem>
+          </varlistentry>
+         </variablelist>
+        </listitem>
+       </varlistentry>

5a.
Ditto the previous comment #4. Perhaps you should just call this a
"Primary status request".

~

5b.
Also, The letter 'W' also seems chosen because of WAL. But it might be
more flexible if those identifiers are more generic.

e.g.
's' = the request for primary status update
'S' = the primary status update

======
src/backend/replication/logical/worker.c

6.
+ else if (c == 's')
+ {
+ TimestampTz timestamp;
+
+ remote_lsn = pq_getmsgint64(&s);
+ timestamp = pq_getmsgint64(&s);
+
+ maybe_advance_nonremovable_xid(&remote_lsn, &phase);
+ UpdateWorkerStats(last_received, timestamp, false);
+ }

Since there's no equivalent #define or enum value, IMO it is too hard
to know the intent of this code without already knowing the meaning of
the magic letter 's'. At least there could be a comment here to
explain that this is for handling an incoming "Primary status update"
message.

~~~

maybe_advance_nonremovable_xid:

7.
+ * The oldest_nonremovable_xid is maintained in shared memory to prevent dead
+ * rows from being removed prematurely when the apply worker still needs them
+ * to detect update-delete conflicts.

/update-delete/update_deleted/

~

8.
+ * applied and flushed locally. The process involves:
+ *
+ * DTR_REQUEST_WALSENDER_WAL_POS - Call GetOldestActiveTransactionId() to get
+ * the candidate xmin and send a message to request the remote WAL position
+ * from the walsender.
+ *
+ * DTR_WAIT_FOR_WALSENDER_WAL_POS - Wait for receiving the WAL position from
+ * the walsender.
+ *
+ * DTR_WAIT_FOR_LOCAL_FLUSH - Advance the non-removable transaction ID if the
+ * current flush location has reached or surpassed the received WAL position.

8a.
This part would be easier to read if those 3 phases were indented from
the rest of this function comment.

~

8b.
/Wait for receiving/Wait to receive/

~

9.
+ * Retaining the dead tuples for this period is sufficient for ensuring
+ * eventual consistency using last-update-wins strategy, which involves
+ * converting an UPDATE to an INSERT and applying it if remote transactions

The commit message referred to a "latest_timestamp_wins". I suppose
that is the same as what this function comment called
"last-update-wins". The patch should use consistent terminology.

It would be better if the commit message and (parts of) this function
comment were just cut/pasted to be identical. Currently, they seem to
be saying the same thing, but using slightly different wording.

~

10.
+ static TimestampTz xid_advance_attemp_time = 0;
+ static FullTransactionId candidate_xid;

typo in var name - "attemp"

~

11.
+ *phase = DTR_WAIT_FOR_LOCAL_FLUSH;
+
+ /*
+ * Do not return here because the apply worker might have already
+ * applied all changes up to remote_wal_pos. Proceeding to the next
+ * phase to check if we can immediately advance the transaction ID.
+ */

11a.
IMO this comment should be above the *phase assignment.

11b.
/Proceeding to the next phase to check.../Instead, proceed to the next
phase to check.../

~

12.
+ /*
+ * Advance the non-removable transaction id if the remote wal position
+ * has been received, and all transactions up to that position on the
+ * publisher have been applied and flushed locally.
+ */

Some minor re-wording would help clarify this comment.

SUGGESTION
Reaching here means the remote wal position has been received, and all
transactions up to that position on the
publisher have been applied and flushed locally. So, now we can
advance the non-removable transaction id.

~

13.
+ *phase = DTR_REQUEST_WALSENDER_WAL_POS;
+
+ /*
+ * Do not return here as enough time might have passed since the last
+ * wal position request. Proceeding to the next phase to determine if
+ * we can send the next request.
+ */

13a.
IMO this comment should be above the *phase assignment.

~

13b.
This comment should have the same wording here as in the previous
if-block (see #11b).

/Proceeding to the next phase to determine.../Instead, proceed to the
next phase to check.../

~

14.
+ FullTransactionId next_full_xix;
+ FullTransactionId full_xid;

You probably mean 'next_full_xid' (not xix)

~

15.
+ /*
+ * Exit early if the user has disabled sending messages to the
+ * publisher.
+ */
+ if (wal_receiver_status_interval <= 0)
+ return;

What are the implications of this early exit? If the update request is
not possible, then I guess the update status is never received, but
then I suppose that means none of this update_deleted logic is
possible. If that is correct, then will there be some documented
warning/caution about conflict-handling implications by disabling that
GUC?

======
src/backend/replication/walsender.c

16.
+/*
+ * Process the standby message requesting the latest WAL write position.
+ */
+static void
+ProcessStandbyWalPosRequestMessage(void)

Ideally, this function comment should be referring to this message we
are creating by the same name that it was called in the documentation.
For example something like:

"Process the request for a primary status update message."

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

25 октября 2024 г., 11:51:07

> Here is the V5 patch set which addressed above comments.
>
Here are a couple of comments on v5 patch-set -

1) In FindMostRecentlyDeletedTupleInfo(),

+ /* Try to find the tuple */
+ while (index_getnext_slot(scan, ForwardScanDirection, scanslot))
+ {
+ Assert(tuples_equal(scanslot, searchslot, eq));
+ update_recent_dead_tuple_info(scanslot, oldestXmin, delete_xid,
+   delete_time, delete_origin);
+ }

In my tests, I found that the above assert() triggers during
unidirectional replication of an update on a table. While doing the
replica identity index scan, it can only ensure to match the indexed
columns value, but the current Assert() assumes all the column values
should match, which seems wrong.

2) Since update_deleted requires both 'track_commit_timestamp' and the
'detect_update_deleted' to be enabled, should we raise an error in the
CREATE and ALTER subscription commands when track_commit_timestamp=OFF
but the user specifies detect_update_deleted=true?

Re: Conflict detection for update_deleted in logical replication

От

Michail Nikolaev

Дата:

25 октября 2024 г., 13:20:48

Hello, Hayato!

> Thanks for updating the patch! While reviewing yours, I found a corner case that
> a recently deleted tuple cannot be detected when index scan is chosen.
> This can happen when indices are re-built during the replication.
> Unfortunately, I don't have any solutions for it.

I just randomly saw your message, so, I could be wrong and out of the context - so, sorry in advance.

But as far as I know, to solve this problem, we need to wait for slot.xmin during the [0] (WaitForOlderSnapshots) while creating index concurrently.

[1]: https://github.com/postgres/postgres/blob/68dfecbef210dc000271553cfcb2342989d4ca0f/src/backend/commands/indexcmds.c#L1758-L1765

Best regards,

Mikhail.

Re: Conflict detection for update_deleted in logical replication

От

Peter Smith

Дата:

28 октября 2024 г., 08:40:28

Hi Hou-San, here are a few trivial comments remaining for patch v6-0001.

======
General.

1.
There are multiple comments in this patch mentioning 'wal' which
probably should say 'WAL' (uppercase).

~~~

2.
There are multiple comments in this patch missing periods (.)

======
doc/src/sgml/protocol.sgml

3.
+        <term>Primary status update (B)</term>
+        <listitem>
+         <variablelist>
+          <varlistentry>
+           <term>Byte1('s')</term>

Currently, there are identifiers 's' for the "Primary status update"
message, and 'S' for the "Primary status request" message.

As mentioned in the previous review ([1] #5b) I preferred it to be the
other way around:
'S' = status from primary
's' = request status from primary

Of course, it doesn't make any difference, but "S" seems more
important than "s", so therefore "S" being the main msg and coming
from the *primary* seemed more natural to me.

~~~

4.
+       <varlistentry id="protocol-replication-standby-wal-status-request">
+        <term>Primary status request (F)</term>

Is it better to call this slightly differently to emphasise this is
only the request?

/Primary status request/Request primary status update/

======
src/backend/replication/logical/worker.c

5.
+ * Retaining the dead tuples for this period is sufficient for ensuring
+ * eventual consistency using last-update-wins strategy, as dead tuples are
+ * useful for detecting conflicts only during the application of concurrent

As mentioned in review [1] #9, this is still called "last-update-wins
strategy" here in the comment, but was called the "last update win
strategy" strategy in the commit message. Those terms should be the
same -- e.g. the 'last-update-wins' strategy.

======
[1] https://www.postgresql.org/message-id/CAHut%2BPs3sgXh2%3DrHDaqjU%3Dp28CK5rCgCLJZgPByc6Tr7_P2imw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

29 октября 2024 г., 10:59:04

Dear Mikhail,

Thanks for giving comments!

> But as far as I know, to solve this problem, we need to wait for slot.xmin during the [0]
> (WaitForOlderSnapshots) while creating index concurrently.

WaitForOlderSnapshots() waits other transactions which can access older tuples
than the specified (=current) transaction, right? I think it does not solve our issue.

Assuming that same workloads [1] are executed, slot.xmin on node2 is arbitrary
older than noted SQL, and WaitForOlderSnapshots(slot.xmin) is added in
ReindexRelationConcurrently(). In this case, transaction older than slot.xmin
does not exist at step 5, so the REINDEX will finish immediately. Then, the worker
receives changes at step 7 so it is problematic if worker uses the reindexed index.

From another point of view... this approach must fix REINDEX code, but we should
not modify other component of codes as much as possible. This feature is related
with the replication so that changes should be closed within the replication subdir.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB5692541820BCC365C69442FFF54F2%40TYAPR01MB5692.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

От

Michail Nikolaev

Дата:

29 октября 2024 г., 13:40:54

Hello Hayato,

> WaitForOlderSnapshots() waits other transactions which can access older tuples
> than the specified (=current) transaction, right? I think it does not solve our issue.

Oh, I actually described the idea a bit incorrectly. The goal isn’t simply to call WaitForOlderSnapshots(slot.xmin);
rather, it’s to ensure that we wait for slot.xmin in the same way we wait for regular snapshots (xmin).
The reason WaitForOlderSnapshots is used in ReindexConcurrently and DefineIndex is to guarantee that any transaction
needing to view rows not included in the index has completed before the index is marked as valid.
The same logic should apply here — we need to wait for the xmin of slot used in conflict detection as well.

> From another point of view... this approach must fix REINDEX code, but we should
> not modify other component of codes as much as possible. This feature is related
> with the replication so that changes should be closed within the replication subdir.

One possible solution here would be to register a snapshot with slot.xmin for the worker backend.
This way, WaitForOlderSnapshots will account for it.

By the way, WaitForOlderSnapshots is also used in partitioning and other areas for similar reasons,
so these might be good places to check for any related issues.

Best regards,
Mikhail,

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

30 октября 2024 г., 05:55:21

Dear Mikhail,

Thanks for describing more detail!

> Oh, I actually described the idea a bit incorrectly. The goal isn’t simply to call WaitForOlderSnapshots(slot.xmin);
> rather, it’s to ensure that we wait for slot.xmin in the same way we wait for regular snapshots (xmin).
> ...
> One possible solution here would be to register a snapshot with slot.xmin for the worker backend.
> This way, WaitForOlderSnapshots will account for it.

Note that apply workers can stop due to some reasons (e.g., disabling subscriptions,
error out, deadlock...). In this case, the snapshot cannot eb registered by the
worker and index can be re-built during the period.

If we do not assume the existence of workers, we must directly somehow check slot.xmin
and wait until it is advanced until the REINDEXing transaction. I still think it
is risky and another topic.

Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
detection. We can work on it in later versions based on the needs.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

30 октября 2024 г., 11:33:44

Dear Hou,

Thanks for updating the patch! Here are my comments.

01. CreateSubscription
```
+    if (opts.detectupdatedeleted && !track_commit_timestamp)
+        ereport(ERROR,
+                errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("detecting update_deleted conflicts requires \"%s\" to be enabled",
+                       "track_commit_timestamp"));
```

I don't think this guard is sufficient. I found two cases:

* Creates a subscription with detect_update_deleted = false and track_commit_timestamp = true,
  then alters detect_update_deleted to true.
* Creates a subscription with detect_update_deleted = true and track_commit_timestamp = true,
  then update track_commit_timestamp to true and restart the instance.

Based on that, how about detecting the inconsistency on the apply worker? It check
the parameters and do error out when it starts or re-reads a catalog. If we want
to detect in SQL commands, this can do in parse_subscription_options().

02. AlterSubscription()
```
+                    ApplyLauncherWakeupAtCommit();
```

The reason why launcher should wake up is different from other parts. Can we add comments
that it is needed to track/untrack the xmin?

03. build_index_column_bitmap()
```
+    for (int i = 0; i < indexinfo->ii_NumIndexAttrs; i++)
+    {
+        int         keycol = indexinfo->ii_IndexAttrNumbers[i];
+
+        index_bitmap = bms_add_member(index_bitmap, keycol);
+    }
```

I feel we can assert the ii_IndexAttrNumbers is valid, because the passed index is a replica identity key.

04. LogicalRepApplyLoop()

Can we move the definition of "phase" to the maybe_advance_nonremovable_xid() and
make it static? The variable is used only by the function.

05. LogicalRepApplyLoop()
```
+                        UpdateWorkerStats(last_received, timestamp, false);
```

The statistics seems not correct. last_received is not sent at "timestamp", it had
already been sent earlier. Do we really have to update here?

06. ErrorOnReservedSlotName()

I feel we should document that the slot name 'pg_conflict_detection' cannot be specified
unconditionally.

07. General

update_deleted can happen without DELETE commands. Should we rename the conflict
reason, like 'update_target_modified'?

E.g., there is a 2-way replication system and below transactions happen:

Node A:
  T1: INSERT INTO t (id, value) VALUES (1,1); ts = 10.00
  T2: UPDATE t SET id = 2 WHERE id = 1; ts = 10.02
Node B:
  T3: UPDATE t SET value = 2 WHERE id = 1; ts = 10.01

Then, T3 comes to Node A after executing T2. T3 tries to find id = 1 but find a
dead tuple instead. In this case, 'update_delete' happens without the delete.

08. Others

Also, here is an analysis related with the truncation of commit timestamp. I worried the
case that commit timestamp might be removed so that the detection would not go well.
But it seemed that entries can be removed when it behinds GetOldestNonRemovableTransactionId(NULL),
i.e., horizons.shared_oldest_nonremovable. The value is affected by the replication
slots so that interesting commit_ts entries for us are not removed.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

От

Michail Nikolaev

Дата:

30 октября 2024 г., 14:31:29

Hello Hayato!

> Note that apply workers can stop due to some reasons (e.g., disabling subscriptions,
> error out, deadlock...). In this case, the snapshot cannot eb registered by the
> worker and index can be re-built during the period.

However, the xmin of a slot affects replication_slot_xmin in ProcArrayStruct, so it might
be straightforward to wait for it during concurrent index builds. We could consider adding
a separate conflict_resolution_replication_slot_xmin to wait only for that.

> Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
> detection. We can work on it in later versions based on the needs.

From my perspective, this is critical for databases. REINDEX CONCURRENTLY is typically run
in production databases on regular basic, so any master-master system should be unaffected by it.

Best regards,
Mikhail.

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

31 октября 2024 г., 03:55:01

Dear Mikhail,

Thanks for the reply!

> > Anyway, this topic introduces huge complexity and is not mandatory for update_deleted
> > detection. We can work on it in later versions based on the needs.
>
> From my perspective, this is critical for databases. REINDEX CONCURRENTLY is typically run
> in production databases on regular basic, so any master-master system should be unaffected by it.

I think you do not understand what I said correctly. The main point here is that
the index scan is not needed to detect the update_deleted. In the first version
workers can do the normal sequential scan instead. This workaround definitely does
not affect REINDEX CONCURRENTLY.
After the patch being good shape or pushed, we can support using the index to find
the dead tuple, at that time we can consider how we ensure the index contains the entry
for dead tuples.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

05 ноября 2024 г., 05:24:44

On Monday, October 28, 2024 1:40 PM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Hi Hou-San, here are a few trivial comments remaining for patch v6-0001.

Thanks for the comments!

> 
> ======
> doc/src/sgml/protocol.sgml
> 
> 3.
> +        <term>Primary status update (B)</term>
> +        <listitem>
> +         <variablelist>
> +          <varlistentry>
> +           <term>Byte1('s')</term>
> 
> Currently, there are identifiers 's' for the "Primary status update"
> message, and 'S' for the "Primary status request" message.
> 
> As mentioned in the previous review ([1] #5b) I preferred it to be the other way
> around:
> 'S' = status from primary
> 's' = request status from primary
> 
> Of course, it doesn't make any difference, but "S" seems more important than
> "s", so therefore "S" being the main msg and coming from the *primary*
> seemed more natural to me.

I am not sure if one message is more important than another, so I prefer to
keep the current style. Since this is a minor issue, we can easily revise it in
future version patches if we receive additional feedback.

Other comments look good to me and will address in V7 patch set.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

13 ноября 2024 г., 03:34:55

On Tue, Nov 12, 2024 at 2:19 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, October 18, 2024 5:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
> > > On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > We thought of few options to fix this:
> > > >
> > > > 1) Add a Time-Based Subscription Option:
> > > >
> > > > We could add a new time-based option for subscriptions, such as
> > > > retain_dead_tuples = '5s'. In the logical launcher, after obtaining
> > > > the candidate XID, the launcher will wait for the specified time
> > > > before advancing the slot.xmin. This ensures that deleted tuples are
> > > > retained for at least the duration defined by this new option.
> > > >
> > > > This approach is designed to simulate the functionality of the GUC
> > > > (vacuum_committs_age), but with a simpler implementation that does
> > > > not impact vacuum performance. We can maintain both this time-based
> > > > method and the current automatic method. If a user does not specify
> > > > the time-based option, we will continue using the existing approach
> > > > to retain dead tuples until all concurrent transactions from the remote node
> > have been applied.
> > > >
> > > > 2) Modification to the Logical Walsender
> > > >
> > > > On the logical walsender, which is as a physical standby, we can
> > > > build an additional connection to the physical primary to obtain the
> > > > latest WAL position. This position will then be sent as feedback to
> > > > the logical subscriber.
> > > >
> > > > A potential concern is that this requires the walsender to use the
> > > > walreceiver API, which may seem a bit unnatural. And, it starts an
> > > > additional walsender process on the primary, as the logical
> > > > walsender on the physical standby will need to communicate with this
> > walsender to fetch the WAL position.
> > > >
> > >
> > > This idea is worth considering, but I think it may not be a good
> > > approach if the physical standby is cascading. We need to restrict the
> > > update_delete conflict detection, if the standby is cascading, right?
> > >
> > > The other approach is that we send current_timestamp from the
> > > subscriber and somehow check if the physical standby has applied
> > > commit_lsn up to that commit_ts, if so, it can send that WAL position
> > > to the subscriber, otherwise, wait for it to be applied. If we do this
> > > then we don't need to add a restriction for cascaded physical standby.
> > > I think the subscriber anyway needs to wait for such an LSN to be
> > > applied on standby before advancing the xmin even if we get it from
> > > the primary. This is because the subscriber can only be able to apply
> > > and flush the WAL once it is applied on the standby. Am, I missing
> > > something?
> > >
> > > This approach has a disadvantage that we are relying on clocks to be
> > > synced on both nodes which we anyway need for conflict resolution as
> > > discussed in the thread [1]. We also need to consider the Commit
> > > Timestamp and LSN inversion issue as discussed in another thread [2]
> > > if we want to pursue this approach because we may miss an LSN that has
> > > a prior timestamp.
> > >
>
> For the "publisher is also a standby" issue, I have modified the V8 patch to
> report a warning in this case. As I personally feel this is not the main use case
> for conflict detection, we can revisit it later after pushing the main patches
> receiving some user feedback.
>
> >
> > The problem due to Commit Timestamp and LSN inversion is that the standby
> > may not consider the WAL LSN from an earlier timestamp, which could lead to
> > the removal of required dead rows on the subscriber.
> >
> > The other problem pointed out by Hou-San offlist due to Commit Timestamp
> > and LSN inversion is that we could miss sending the WAL LSN that the
> > subscriber requires to retain dead rows for update_delete conflict. For example,
> > consider the following case 2 node, bidirectional setup:
> >
> > Node A:
> >   T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1; ts=10.02 AM
> >
> > Node B:
> >   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
> >
> > Say subscription is created with retain_dead_tuples = true/false
> >
> > After executing T2, the apply worker on Node A will check the latest wal flush
> > location on Node B. Till that time, the T3 should have finished, so the xmin will
> > be advanced only after applying the WALs that is later than T3. So, the dead
> > tuple will not be removed before applying the T3, which means the
> > update_delete can be detected.
> >
> > As there is a gap between when we acquire the commit_timestamp and the
> > commit LSN, it is possible that T3 would have not yet flushed it's WAL even
> > though it is committed earlier than T2. If this happens then we won't be able to
> > detect update_deleted conflict reliably.
> >
> > Now, the one simpler idea is to acquire the commit timestamp and reserve WAL
> > (LSN) under the same spinlock in
> > ReserveXLogInsertLocation() but that could be costly as discussed in the
> > thread [1]. The other more localized solution is to acquire a timestamp after
> > reserving the commit WAL LSN outside the lock which will solve this particular
> > problem.
>
> Since the discussion of the WAL/LSN inversion issue is ongoing, I also thought
> about another approach that can fix the issue independently. This idea is to
> delay the non-removable xid advancement until all the remote concurrent
> transactions that may have been assigned earlier timestamp have been committed.
>
> The implementation is:
>
> On the walsender, after receiving a request, it can send the oldest xid and
> next xid along with the
>
> In response, the apply worker can safely advance the non-removable XID if
> oldest_committing_xid == nextXid, indicating that there is no race conditions.
>
> Alternatively, if oldest_committing_xid != nextXid, the apply worker might send
> a second request after some interval. If the subsequently obtained
> oldest_committing_xid surpasses or equal to the initial nextXid, it indicates
> that all previously risky transactions have committed, therefore the the
> non-removable transaction ID can be advanced.
>
>
> Attach the V8 patch set. Note that I put the new approach for above race
> condition in a temp patch " v8-0001_2-Maintain-xxx.patch.txt", because the
> approach may or may not be accepted based on the discussion in WAL/LSN
> inversion thread.

I've started to review these patch series. I've reviewed only 0001
patch for now but let me share some comments:

---
+        if (*phase == DTR_WAIT_FOR_WALSENDER_WAL_POS)
+        {
+                Assert(xid_advance_attempt_time);

What is this assertion for? If we want to check here that we have sent
a request message for the publisher, I think it's clearer if we have
"Assert(xid_advance_attempt_time > 0)". I'm not sure we really need
this assertion though since it's never false once we set
xid_advance_attempt_time.

---
+                /*
+                 * Do not return here because the apply worker might
have already
+                 * applied all changes up to remote_lsn. Instead,
proceed to the
+                 * next phase to check if we can immediately advance
the transaction
+                 * ID.
+                 */
+                *phase = DTR_WAIT_FOR_LOCAL_FLUSH;
+        }

If we always proceed to the next phase, is this phase really
necessary? IIUC even if we jump to DTR_WAIT_FOR_LOCAL_FLUSH phase
after DTR_REQUEST_WALSENDER_WAL_POS and have a check if we received
the remote WAL position in DTR_WAIT_FOR_LOCAL_FLUSH phase, it would
work fine.

---
+                /*
+                 * Reaching here means the remote WAL position has
been received, and
+                 * all transactions up to that position on the
publisher have been
+                 * applied and flushed locally. So, now we can advance the
+                 * non-removable transaction ID.
+                 */
+                SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+                MyLogicalRepWorker->oldest_nonremovable_xid = candidate_xid;
+                SpinLockRelease(&MyLogicalRepWorker->relmutex);

How about adding a debug log message showing new
oldest_nonremovable_xid and related LSN for making the
debug/investigation easier? For example,

elog(LOG, "confirmed remote flush up to %X/%X: new oldest_nonremovable_xid %u",
     LSN_FORMAT_ARGS(*remote_lsn),
     XidFromFullTransactionId(candidate_xid));

---
+                /*
+                 * Exit early if the user has disabled sending messages to the
+                 * publisher.
+                 */
+                if (wal_receiver_status_interval <= 0)
+                        return;

In send_feedback(), we send a feedback message if the publisher
requests, even if wal_receiver_status_interval is 0. On the other
hand, the above codes mean that we don't send a WAL position request
and therefore never update oldest_nonremovable_xid if
wal_receiver_status_interval is 0. I'm concerned it could be a pitfall
for users.

---
% git show | grep update_delete
    This set of patches aims to support the detection of
update_deleted conflicts,
    transactions with earlier timestamps than the DELETE, detecting
update_delete
    We assume that the appropriate resolution for update_deleted conflicts, to
    that when detecting the update_deleted conflict, and the remote update has a
+ * to detect update_deleted conflicts.
+ * update_deleted is necessary, as the UPDATEs in remote transactions should be
+        * to allow for the detection of update_delete conflicts when applying

There are mixed 'update_delete' and 'update_deleted' in the commit
message and the codes. I think it's better to use 'update_deleted'.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

20 ноября 2024 г., 15:05:23

On Thu, Nov 14, 2024 at 8:24 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V9 patch set which addressed above comments.
>

Reviewed v9 patch-set and here are my comments for below changes:

@@ -1175,10 +1189,29 @@ ApplyLauncherMain(Datum main_arg)
  long elapsed;

  if (!sub->enabled)
+ {
+ can_advance_xmin = false;
+ xmin = InvalidFullTransactionId;
  continue;
+ }

  LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
  w = logicalrep_worker_find(sub->oid, InvalidOid, false);
+
+ if (can_advance_xmin && w != NULL)
+ {
+ FullTransactionId nonremovable_xid;
+
+ SpinLockAcquire(&w->relmutex);
+ nonremovable_xid = w->oldest_nonremovable_xid;
+ SpinLockRelease(&w->relmutex);
+
+ if (!FullTransactionIdIsValid(xmin) ||
+ !FullTransactionIdIsValid(nonremovable_xid) ||
+ FullTransactionIdPrecedes(nonremovable_xid, xmin))
+ xmin = nonremovable_xid;
+ }
+

1) In Patch-0002, could you please add a comment above "+ if
(can_advance_xmin && w != NULL)" to briefly explain the purpose of
finding the minimum XID at this point?

2) In Patch-0004, with the addition of the 'detect_update_deleted'
option, I see the following two issues in the above code:
2a) Currently, all enabled subscriptions are considered when comparing
and finding the minimum XID, even if detect_update_deleted is disabled
for some subscriptions.
I suggest excluding the oldest_nonremovable_xid of subscriptions where
detect_update_deleted=false by updating the check as follows:

    if (sub->detectupdatedeleted && can_advance_xmin && w != NULL)

2b) I understand why advancing xmin is not allowed when a subscription
is disabled. However, the current check allows a disabled subscription
with detect_update_deleted=false to block xmin advancement, which
seems incorrect. Should the check also account for
detect_update_deleted?, like :
  if (sub->detectupdatedeleted &&  !sub->enabled)
+ {
+ can_advance_xmin = false;
+ xmin = InvalidFullTransactionId;
  continue;
+ }

However, I'm not sure if this is the right fix, as it could lead to
inconsistencies if the detect_update_deleted is set to false after
disabling the sub.
Thoughts?

--
Thanks,
Nisha

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

25 ноября 2024 г., 12:49:31

On Thu, Nov 21, 2024 at 3:03 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V10 patch set which addressed above comments
> and fixed a CFbot warning due to un-initialized variable.
>

We should make the v10_2-0001* as the first main patch for review till
we have a consensus to resolve LSN<->Timestamp inversion issue. This
is because v10_2 doesn't rely on the correctness of LSN<->Timestamp
mapping. Now, say in some later release, we fix the LSN<->Timestamp
inversion issue, we can simply avoid sending remote_xact information
and it will behave the same as your v10_1 approach.

Comments on v10_2_0001*:
======================
1.
+/*
+ * The phases involved in advancing the non-removable transaction ID.
+ *
+ * Refer to maybe_advance_nonremovable_xid() for details on how the function
+ * transitions between these phases.
+ */
+typedef enum
+{
+ DTR_GET_CANDIDATE_XID,
+ DTR_REQUEST_PUBLISHER_STATUS,
+ DTR_WAIT_FOR_PUBLISHER_STATUS,
+ DTR_WAIT_FOR_LOCAL_FLUSH
+} DeadTupleRetainPhase;

First, can we have a better name for this enum like
RetainConflictInfoPhase or something like that? Second, the phase
transition is not very clear from the comments atop
maybe_advance_nonremovable_xid. You can refer to comments atop
tablesync.c or snapbuild.c to see other cases where we have explained
phase transitions.

2.
+ *   Wait for the status from the walsender. After receiving the first status
+ *   after acquiring a new candidate transaction ID, do not proceed if there
+ *   are ongoing concurrent remote transactions.

In this part of the comments: " .. after acquiring a new candidate
transaction ID ..." appears misplaced.

3. In maybe_advance_nonremovable_xid(), the handling of each phase
looks ad-hoc though I see that you have done that have so that you can
handle the phase change functionality after changing the phase
immediately. If we have to ever extend this functionality, it will be
tricky to handle the new phase or at least the code will become
complicated. How about handling each phase in the order of their
occurrence and having separate functions for each phase as we have in
apply_dispatch()? That way it would be convenient to invoke the phase
handling functionality even if it needs to be called multiple times in
the same function.

4.
/*
+ * An invalid position indiates the publisher is also
+ * a physical standby. In this scenario, advancing the
+ * non-removable transaction ID is not supported. This
+ * is because the logical walsender on the standby can
+ * only get the WAL replay position but there may be
+ * more WALs that are being replicated from the
+ * primary and those WALs could have earlier commit
+ * timestamp. Refer to
+ * maybe_advance_nonremovable_xid() for details.
+ */
+ if (XLogRecPtrIsInvalid(remote_lsn))
+ {
+ ereport(WARNING,
+ errmsg("cannot get the latest WAL position from the publisher"),
+ errdetail("The connected publisher is also a standby server."));
+
+ /*
+ * Continuously revert to the request phase until
+ * the standby server (publisher) is promoted, at
+ * which point a valid WAL position will be
+ * received.
+ */
+ phase = DTR_REQUEST_PUBLISHER_STATUS;
+ }

Shouldn't this be an ERROR as the patch doesn't support this case? The
same should be true for later patches for the subscription option.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

27 ноября 2024 г., 13:56:09

On Tue, Nov 26, 2024 at 1:50 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>

Few comments on the latest 0001 patch:
1.
+ * - RCI_REQUEST_PUBLISHER_STATUS:
+ *   Send a message to the walsender requesting the publisher status, which
+ *   includes the latest WAL write position and information about running
+ *   transactions.

Shall we make the later part of this comment (".. information about
running transactions.") accurate w.r.t the latest changes of
requesting xacts that are known to be in the process of committing?

2.
+ * The overall state progression is: GET_CANDIDATE_XID ->
+ * REQUEST_PUBLISHER_STATUS -> WAIT_FOR_PUBLISHER_STATUS -> (loop to
+ * REQUEST_PUBLISHER_STATUS if concurrent remote transactions persist) ->
+ * WAIT_FOR_LOCAL_FLUSH.

This state machine progression misses to mention that after we waited
for flush the state again moves back to GET_CANDIDATE_XID.

3.
+request_publisher_status(RetainConflictInfoData *data)
+{
...
+ /* Send a WAL position request message to the server */
+ walrcv_send(LogRepWorkerWalRcvConn,
+ reply_message->data, reply_message->len);

This message requests more than a WAL write position but the comment
is incomplete.

4.
+/*
+ * Process the request for a primary status update message.
+ */
+static void
+ProcessStandbyPSRequestMessage(void)
...
+ /*
+ * Information about running transactions and the WAL write position is
+ * only available on a non-standby server.
+ */
+ if (!RecoveryInProgress())
+ {
+ oldestXidInCommit = GetOldestTransactionIdInCommit();
+ nextFullXid = ReadNextFullTransactionId();
+ lsn = GetXLogWriteRecPtr();
+ }

Shall we ever reach here for a standby case? If not shouldn't that be an ERROR?

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

29 ноября 2024 г., 13:35:20

Dear Hou,

Thanks for updating the patch! Here are my comments mainly for 0001.

01. protocol.sgml

I think the ordering of attributes in "Primary status update" seems not correct.
The second entry is LSN, not the oldest running xid.

02. maybe_advance_nonremovable_xid

```
+        case RCI_REQUEST_PUBLISHER_STATUS:
+            request_publisher_status(data);
+            break;
```

I think the part is not reachable because the transit
RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATUS is done in
get_candidate_xid()->request_publisher_status().
Can we remove this?

03. RetainConflictInfoData

```
+    Timestamp   xid_advance_attempt_time;   /* when the candidate_xid is
+                                             * decided */
+    Timestamp   reply_time;     /* when the publisher responds with status */
+
+} RetainConflictInfoData;
```

The datatype should be TimestampTz.

04. get_candidate_xid

```
+    if (!TimestampDifferenceExceeds(data->xid_advance_attempt_time, now,
+                                    wal_receiver_status_interval * 1000))
+        return;
```

I think data->xid_advance_attempt_time can be accessed without the initialization
at the first try. I've found the patch could not pass test for 32-bit build
due to the reason.


05. request_publisher_status

```
+    if (!reply_message)
+    {
+        MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext);
+
+        reply_message = makeStringInfo();
+        MemoryContextSwitchTo(oldctx);
+    }
+    else
+        resetStringInfo(reply_message);
```

Same lines exist in two functions: can we provide an inline function?

06. wait_for_publisher_status

```
+    if (!FullTransactionIdIsValid(data->last_phase_at))
+        data->last_phase_at = FullTransactionIdFromEpochAndXid(data->remote_epoch,
+                                                               data->remote_nextxid);
+
```

Not sure, is there a possibility that data->last_phase_at is valid here? It is initialized
just before transiting to RCI_WAIT_FOR_PUBLISHER_STATUS.

07. wait_for_publisher_status

I think all calculations and checking in the function can be done even on the
walsender. Based on this, I come up with an idea to reduce the message size:
walsender can just send a status (boolean) whether there are any running transactions
instead of oldest xid, next xid and their epoch. Or, it is more important to reduce the
amount of calc. on publisher side?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

29 ноября 2024 г., 13:52:04

On Fri, Nov 29, 2024 at 4:05 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> 07. wait_for_publisher_status
>
> I think all calculations and checking in the function can be done even on the
> walsender. Based on this, I come up with an idea to reduce the message size:
> walsender can just send a status (boolean) whether there are any running transactions
> instead of oldest xid, next xid and their epoch. Or, it is more important to reduce the
> amount of calc. on publisher side?
>

Won't it be tricky to implement this tracking on publisher side?
Because we not only need to check that there is no running xact but
also that the oldest_running_xact that was present last time when the
status message arrived has finished. Won't this need more bookkeeping
on publisher's side?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

02 декабря 2024 г., 07:14:01

On Fri, Nov 29, 2024 at 4:05 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> 02. maybe_advance_nonremovable_xid
>
> ```
> +        case RCI_REQUEST_PUBLISHER_STATUS:
> +            request_publisher_status(data);
> +            break;
> ```
>
> I think the part is not reachable because the transit
> RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATUS is done in
> get_candidate_xid()->request_publisher_status().
> Can we remove this?
>

After changing phase to RCI_REQUEST_PUBLISHER_STATUS, we directly
invoke request_publisher_status, and similarly, after changing phase
to RCI_WAIT_FOR_LOCAL_FLUSH, we call wait_for_local_flush. Won't it be
better that in both cases and other similar cases, we instead invoke
maybe_advance_nonremovable_xid()? This will make
maybe_advance_nonremovable_xid() the only function with the knowledge
to take action based on phase rather than spreading the knowledge of
phase-related actions to various functions. Then we should also add a
comment at the end in request_publisher_status() where we change the
phase but don't do anything. The comment should explain the reason for
the same.

One more point, it seems on a busy server, the patch won't be able to
advance nonremovable_xid. We should call
maybe_advance_nonremovable_xid() at all the places where we call
send_feedback() and additionally, we should also call it after
applying some threshold number (say 100) of messages. The latter is to
avoid the cases where we won't invoke the required functionality on a
busy server with a large value of sender/receiver timeouts.

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

02 декабря 2024 г., 14:43:32

On Friday, November 29, 2024 6:35 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> 
> Dear Hou,
> 
> Thanks for updating the patch! Here are my comments mainly for 0001.

Thanks for the comments!

> 
> 02. maybe_advance_nonremovable_xid
> 
> ```
> +        case RCI_REQUEST_PUBLISHER_STATUS:
> +            request_publisher_status(data);
> +            break;
> ```
> 
> I think the part is not reachable because the transit
> RCI_REQUEST_PUBLISHER_STATUS->RCI_WAIT_FOR_PUBLISHER_STATU
> S is done in get_candidate_xid()->request_publisher_status().
> Can we remove this?

I changed to call the maybe_advance_nonremovable_xid() after changing the phase
in get_candidate_xid/wait_for_publisher_status, so that the code is reachable.

> 
> 
> 05. request_publisher_status
> 
> ```
> +    if (!reply_message)
> +    {
> +        MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext);
> +
> +        reply_message = makeStringInfo();
> +        MemoryContextSwitchTo(oldctx);
> +    }
> +    else
> +        resetStringInfo(reply_message);
> ```
> 
> Same lines exist in two functions: can we provide an inline function?

I personally feel these codes may not worth a separate function since it’s simple.
So didn't change in this version.

> 
> 06. wait_for_publisher_status
> 
> ```
> +    if (!FullTransactionIdIsValid(data->last_phase_at))
> +        data->last_phase_at =
> FullTransactionIdFromEpochAndXid(data->remote_epoch,
> +
> + data->remote_nextxid);
> +
> ```
> 
> Not sure, is there a possibility that data->last_phase_at is valid here? It is
> initialized just before transiting to RCI_WAIT_FOR_PUBLISHER_STATUS.

Oh. I think last_phase_at should be initialized only in the first phase. Fixed.

Other comments look good to me and have been addressed in V13.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

09 декабря 2024 г., 13:32:19

On Mon, Dec 9, 2024 at 3:20 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> Here is a summary of tests targeted to the Publisher node in a
> Publisher-Subscriber setup.
> (All tests done with v14 patch-set)
>
> ----------------------------
> Performance Tests:
> ----------------------------
> Test machine details:
> Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :120 - 800GB RAM
>
> Setup:
> - Created two nodes ( 'Pub' and 'Sub'), with logical replication.
> - Configurations for Both Nodes:
>
>     shared_buffers = 40GB
>     max_worker_processes = 32
>     max_parallel_maintenance_workers = 24
>     max_parallel_workers = 32
>     checkpoint_timeout = 1d
>     max_wal_size = 24GB
>     min_wal_size = 15GB
>     autovacuum = off
>
> - Additional setting on Sub: 'track_commit_timestamp = on' (required
> for the feature).
> - Initial data insertion via 'pgbench' with scale factor 100 on both nodes.
>
> Workload:
> - Ran pgbench with 60 clients for the publisher.
> - The duration was 120s, and the measurement was repeated 10 times.
>

You didn't mention it is READONLY or READWRITE tests but I think it is
later. I feel it is better to run these tests for 15 minutes, repeat
them 3 times, and get the median data for those. Also, try to run it
for lower client counts like 2, 16, 32. Overall, the conclusion may be
same but it will rule out the possibility of any anomaly.

With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

11 декабря 2024 г., 10:33:32

On Wednesday, December 11, 2024 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Dec 6, 2024 at 1:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Thursday, December 5, 2024 6:00 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > A few more comments:
> > > 1.
> > > +static void
> > > +wait_for_local_flush(RetainConflictInfoData *data)
> > > {
> > > ...
> > > +
> > > + data->phase = RCI_GET_CANDIDATE_XID;
> > > +
> > > + maybe_advance_nonremovable_xid(data);
> > > +}
> > >
> > > Isn't it better to reset all the fields of data before the next
> > > round of GET_CANDIDATE_XID phase? If we do that then we don't need
> > > to reset
> > > data->remote_lsn = InvalidXLogRecPtr; and data->last_phase_at =
> > > InvalidFullTransactionId; individually in request_publisher_status()
> > > and
> > > get_candidate_xid() respectively. Also, it looks clean and logical
> > > to me unless I am missing something.
> >
> > The remote_lsn was used to determine whether a status is received, so
> > was reset each time in request_publisher_status. To make it more
> > straightforward, I added a new function parameter 'status_received',
> > which would be set to true when calling
> > maybe_advance_nonremovable_xid() on receving the status. After this
> change, there is no need to reset the remote_lsn.
> >
> 
> As part of the above comment, I had asked for three things (a) avoid setting
> data->remote_lsn = InvalidXLogRecPtr; in request_publisher_status(); (b)
> avoid setting data->last_phase_at =InvalidFullTransactionId; in
> get_candidate_xid(); (c) reset data in
> wait_for_local_flush() after wait is over. You only did (a) in the patch and didn't
> mention anything about (b) or (c). Is that intentional? If so, what is the reason?

I think I misunderstood the intention, so will address in next version.

> 
> *
> +static bool
> +can_advance_nonremovable_xid(RetainConflictInfoData *data) {
> +
> 
> Isn't it better to make this an inline function as it contains just one check?

Agreed. Will address in next version.

> 
> *
> + /*
> + * The non-removable transaction ID for a subscription is centrally
> + * managed by the main apply worker.
> + */
> + if (!am_leader_apply_worker())
> 
> I have tried to improve this comment in the attached.

Thanks, will check and merge the next version.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

16 декабря 2024 г., 14:21:20

On Wed, Dec 11, 2024 at 2:32 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the V16 patch set which addressed above comments.
>
> There is a new 0002 patch where I tried to dynamically adjust the interval for
> advancing the transaction ID. Instead of always waiting for
> wal_receiver_status_interval, we can start with a short interval and increase
> it if there is no activity (no xid assigned on subscriber), but not beyond
> wal_receiver_status_interval.
>
> The intention is to more effectively advance xid to avoid retaining too much
> dead tuples. My colleague will soon share detailed performance data and
> analysis related to this enhancement.

I am starting to review the patches, and trying to understand the
concept that how you are preventing vacuum to remove the dead tuple
which might required by the concurrent remote update, so I was looking
at the commit message which explains the idea quite clearly but I have
one question

The process of advancing the non-removable transaction ID in the apply worker
involves:

== copied from commit message of 0001 start==
1) Call GetOldestActiveTransactionId() to take oldestRunningXid as the
candidate xid.
2) Send a message to the walsender requesting the publisher status, which
includes the latest WAL write position and information about transactions
that are in the commit phase.
3) Wait for the status from the walsender. After receiving the first status, do
not proceed if there are concurrent remote transactions that are still in the
commit phase. These transactions might have been assigned an earlier commit
timestamp but have not yet written the commit WAL record. Continue to request
the publisher status until all these transactions have completed.
4) Advance the non-removable transaction ID if the current flush location has
reached or surpassed the last received WAL position.
== copied from commit message of 0001 start==

So IIUC in step 2) we send the message and get the list of all the
transactions which are in the commit phase? What do you exactly mean
by a transaction which is in the commit phase?  Can I assume
transactions which are currently running on the publisher?  And in
step 3) we wait for all the transactions to get committed which we saw
running (or in the commit phase) and we anyway don't worry about the
newly started transactions as they would not be problematic for us.
And in step 4) we would wait for all the flush location to reach "last
received WAL position", here my question is what exactly will be the
"last received WAL position" I assume it would be the position
somewhere after the position of the commit WAL of all the transaction
we were interested on the publisher?

At high level the overall idea looks promising to me but wanted to put
more thought on lower level details about what transactions exactly we
are waiting for and what WAL LSN we are waiting to get flushed.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

17 декабря 2024 г., 06:24:47

On Monday, December 16, 2024 7:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Hi,

> 
> On Wed, Dec 11, 2024 at 2:32 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attach the V16 patch set which addressed above comments.
> >
> > There is a new 0002 patch where I tried to dynamically adjust the interval for
> > advancing the transaction ID. Instead of always waiting for
> > wal_receiver_status_interval, we can start with a short interval and increase
> > it if there is no activity (no xid assigned on subscriber), but not beyond
> > wal_receiver_status_interval.
> >
> > The intention is to more effectively advance xid to avoid retaining too much
> > dead tuples. My colleague will soon share detailed performance data and
> > analysis related to this enhancement.
> 
> I am starting to review the patches, and trying to understand the
> concept that how you are preventing vacuum to remove the dead tuple
> which might required by the concurrent remote update, so I was looking
> at the commit message which explains the idea quite clearly but I have
> one question

Thanks for the review!

> 
> The process of advancing the non-removable transaction ID in the apply worker
> involves:
> 
> == copied from commit message of 0001 start==
> 1) Call GetOldestActiveTransactionId() to take oldestRunningXid as the
> candidate xid.
> 2) Send a message to the walsender requesting the publisher status, which
> includes the latest WAL write position and information about transactions
> that are in the commit phase.
> 3) Wait for the status from the walsender. After receiving the first status, do
> not proceed if there are concurrent remote transactions that are still in the
> commit phase. These transactions might have been assigned an earlier commit
> timestamp but have not yet written the commit WAL record. Continue to
> request
> the publisher status until all these transactions have completed.
> 4) Advance the non-removable transaction ID if the current flush location has
> reached or surpassed the last received WAL position.
> == copied from commit message of 0001 start==
> 
> So IIUC in step 2) we send the message and get the list of all the
> transactions which are in the commit phase? What do you exactly mean by a
> transaction which is in the commit phase?

I was referring to transactions calling RecordTransactionCommit() and have
entered the commit critical section. In the patch, we checked if the proc has
marked the new flag DELAY_CHKPT_IN_COMMIT in 'MyProc->delayChkptFlags'.

> Can I assume transactions which are currently running on the publisher?

I think it's a subset of the running transactions. We only get the transactions
in commit phase with the intention to avoid delays caused by waiting for
long-running transactions to complete, which can result in the long retention
of dead tuples.

We decided to wait for running(committing) transactions due to the WAL/LSN
inversion issue[1]. The original idea is to directly return the latest WAL
write position without checking running transactions. But since there is a gap
between when we acquire the commit_timestamp and the commit LSN, it's possible
the transactions might have been assigned an earlier commit timestamp but have
not yet written the commit WAL record.

> And in step 3) we wait for all the transactions to get committed which we saw
> running (or in the commit phase) and we anyway don't worry about the newly
> started transactions as they would not be problematic for us. And in step 4)
> we would wait for all the flush location to reach "last received WAL
> position", here my question is what exactly will be the "last received WAL
> position" I assume it would be the position somewhere after the position of
> the commit WAL of all the transaction we were interested on the publisher?

Yes, your understanding is correct. It's a position after the position of all
the interesting transactions. In the patch, we get the latest WAL write
position(GetXLogWriteRecPtr()) in walsender after all interesting transactions
have finished and reply it to apply worker.

> At high level the overall idea looks promising to me but wanted to put
> more thought on lower level details about what transactions exactly we
> are waiting for and what WAL LSN we are waiting to get flushed.

Yeah, that makes sense, thanks.

[1]
https://www.postgresql.org/message-id/OS0PR01MB571628594B26B4CC2346F09294592%40OS0PR01MB5716.jpnprd01.prod.outlook.com>

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

17 декабря 2024 г., 11:25:42

On Tue, Dec 17, 2024 at 8:54 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 16, 2024 7:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > So IIUC in step 2) we send the message and get the list of all the
> > transactions which are in the commit phase? What do you exactly mean by a
> > transaction which is in the commit phase?
>
> I was referring to transactions calling RecordTransactionCommit() and have
> entered the commit critical section. In the patch, we checked if the proc has
> marked the new flag DELAY_CHKPT_IN_COMMIT in 'MyProc->delayChkptFlags'.
>
> > Can I assume transactions which are currently running on the publisher?
>
> I think it's a subset of the running transactions. We only get the transactions
> in commit phase with the intention to avoid delays caused by waiting for
> long-running transactions to complete, which can result in the long retention
> of dead tuples.

Ok

> We decided to wait for running(committing) transactions due to the WAL/LSN
> inversion issue[1]. The original idea is to directly return the latest WAL
> write position without checking running transactions. But since there is a gap
> between when we acquire the commit_timestamp and the commit LSN, it's possible
> the transactions might have been assigned an earlier commit timestamp but have
> not yet written the commit WAL record.

Yes, that makes sense.

> > And in step 3) we wait for all the transactions to get committed which we saw
> > running (or in the commit phase) and we anyway don't worry about the newly
> > started transactions as they would not be problematic for us. And in step 4)
> > we would wait for all the flush location to reach "last received WAL
> > position", here my question is what exactly will be the "last received WAL
> > position" I assume it would be the position somewhere after the position of
> > the commit WAL of all the transaction we were interested on the publisher?
>
> Yes, your understanding is correct. It's a position after the position of all
> the interesting transactions. In the patch, we get the latest WAL write
> position(GetXLogWriteRecPtr()) in walsender after all interesting transactions
> have finished and reply it to apply worker.

Got it, thanks.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

23 декабря 2024 г., 09:14:49

Dear Hou,

Thanks for updating the patch. Few comments:

01. worker.c

```
+/*
+ * The minimum (100ms) and maximum (3 minutes) intervals for advancing
+ * non-removable transaction IDs.
+ */
+#define MIN_XID_ADVANCEMENT_INTERVAL 100
+#define MAX_XID_ADVANCEMENT_INTERVAL 180000L
```

Since the max_interval is an integer variable, it can be s/180000L/180000/.


02.  ErrorOnReservedSlotName()

Currently the function is callsed from three points - create_physical_replication_slot(),
create_logical_replication_slot() and CreateReplicationSlot(). 
Can we move them to the ReplicationSlotCreate(), or combine into ReplicationSlotValidateName()?

03. advance_conflict_slot_xmin()

```
    Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
```

Assuming the case that the launcher crashed just after ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
After the restart, the slot can be acquired since SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
is true, but the process would fail the assert because data.xmin is still invalid.

I think we should re-create the slot when the xmin is invalid. Thought?

04. documentation

Should we update "Configuration Settings" section in logical-replication.sgml
because an additional slot is required?

05. check_remote_recovery()

Can we add a test case related with this?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

01 января, 08:36:15

On Thu, Dec 19, 2024 at 4:34 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Sunday, December 15, 2024 9:39 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> >
> > 5. The apply worker needs to at least twice get the publisher status message to
> > advance oldest_nonremovable_xid once. It then uses the remote_lsn of the last
> > such message to ensure that it has been applied locally. Such a remote_lsn
> > could be a much later value than required leading to delay in advancing
> > oldest_nonremovable_xid. How about if while first time processing the
> > publisher_status message on walsender, we get the
> > latest_transaction_in_commit by having a function
> > GetLatestTransactionIdInCommit() instead of
> > GetOldestTransactionIdInCommit() and then simply wait till that proc has
> > written commit WAL (aka wait till it clears DELAY_CHKPT_IN_COMMIT)?
> > Then get the latest LSN wrote and send that to apply worker waiting for the
> > publisher_status message. If this is feasible then we should be able to
> > advance oldest_nonremovable_xid with just one publisher_status message.
> > Won't that be an improvement over current? If so, we can even further try to
> > improve it by just using commit_LSN of the transaction returned by
> > GetLatestTransactionIdInCommit(). One idea is that we can try to use
> > MyProc->waitLSN which we are using in synchronous replication for our
> > purpose. See SyncRepWaitForLSN.
>
> I will do more performance tests on this and address if it improves
> the performance.
>

Did you check this idea? Again, thinking about this, I see a downside
to the new proposal. In the new proposal, the walsender needs to
somehow wait for the transactions in the commit which essentially
means that it may lead delay in decoding and sending the decoded WAL.
But it is still worth checking the impact of such a change, if nothing
else, we can add a short comment in the code to suggest such an
improvement is not worthwhile.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

02 января, 09:30:13

On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>

But why would it prevent the launcher from creating the slot? I think
we should add this check in the function
ReplicationSlotValidateName(). Another related point:

+ErrorOnReservedSlotName(const char *name)
+{
+ if (strcmp(name, CONFLICT_DETECTION_SLOT) == 0)
+ ereport(ERROR,
+ errcode(ERRCODE_RESERVED_NAME),
+ errmsg("replication slot name \"%s\" is reserved",
+    name));

Won't it be sufficient to check using an existing IsReservedName()?
Even, if not, then also we should keep that as part of the check
similar to what we are doing in pg_replication_origin_create().

> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>

Sounds reasonable but OTOH, all other places that create physical
slots (which we are doing here) don't use this trick. So, don't they
need similar reliability? Also, add some comments as to why we are
initially creating the RS_EPHEMERAL slot as we have at other places.

Few other comments on 0003
=======================
1.
+ if (sublist)
+ {
+ bool updated;
+
+ if (!can_advance_xmin)
+ xmin = InvalidFullTransactionId;
+
+ updated = advance_conflict_slot_xmin(xmin);

How will it help to try advancing slot_xmin when xmin is invalid?

2.
@@ -1167,14 +1181,43 @@ ApplyLauncherMain(Datum main_arg)
  long elapsed;

  if (!sub->enabled)
+ {
+ can_advance_xmin = false;

In ApplyLauncherMain(), if one of the subscriptions is disabled (say
the last one in sublist), then can_advance_xmin will become false in
the above code. Now, later, as quoted in comment-1, the patch
overrides xmin to InvalidFullTransactionId if can_advance_xmin is
false. Won't that lead to the wrong computation of xmin?

3.
+ slot_maybe_exist = true;
+ }
+
+ /*
+ * Drop the slot if we're no longer retaining dead tuples.
+ */
+ else if (slot_maybe_exist)
+ {
+ drop_conflict_slot_if_exists();
+ slot_maybe_exist = false;

Can't we use MyReplicationSlot instead of introducing a new boolean
slot_maybe_exist?

In any case, how does the above code deal with the case where the
launcher is restarted for some reason and there is no subscription
after that? Will it be possible to drop the slot in that case?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

vignesh C

Дата:

02 января, 12:27:17

On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.

Conflict detection of truncated updates is detected as update_missing
and deleted update is detected as update_deleted. I was not sure if
truncated updates should also be detected as update_deleted, as the
document says truncate operation is "It has the same effect as an
unqualified DELETE on each table" at [1].

I tried with the following three node(N1,N2 & N3) setup with
subscriber on N3 subscribing to the publisher pub1 in N1 and publisher
pub2 in N2:
N1 - pub1
N2 - pub2
N3 - sub1 -> pub1(N1) and sub2 -> pub2(N2)

-- Insert a record in N1
insert into t1 values(1);

-- Insert a record in N2
insert into t1 values(1);

-- Now N3 has the above inserts from N1 and N2
N3=# select * from t1;
 c1
----
  1
  1
(2 rows)

-- Truncate t1 from N2
N2=# truncate t1;
TRUNCATE TABLE

-- Now N3 has no records:
N3=# select * from t1;
 c1
----
(0 rows)

-- Update from N1 to generated a conflict
postgres=# update t1 set c1 = 2;
UPDATE 1
N1=# select * from t1;
 c1
----
  2
(1 row)

--- N3 logs the conflict as update_missing
2025-01-02 12:21:37.388 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_missing
2025-01-02 12:21:37.388 IST [24803] DETAIL:  Could not find the row to
be updated.
        Remote tuple (2); replica identity full (1).
2025-01-02 12:21:37.388 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 757, finished
at 0/17478D0

-- Insert a record with value 2 in N2
N2=# insert into t1 values(2);
INSERT 0 1

-- Now N3 has the above inserted records:
N3=# select * from t1;
 c1
----
  2
(1 row)

-- Delete this record from N2:
N2=# delete from t1;
DELETE 1

-- Now N3 has no records:
N3=# select * from t1;
 c1
----
(0 rows)

-- Update from N1 to generate a conflict
postgres=# update t1 set c1 = 3;
UPDATE 1

--- N3 logs the conflict as update_deleted
2025-01-02 12:22:38.036 IST [24803] LOG:  conflict detected on
relation "public.t1": conflict=update_deleted
2025-01-02 12:22:38.036 IST [24803] DETAIL:  The row to be updated was
deleted by a different origin "pg_16388" in transaction 764 at
2025-01-02 12:22:29.025347+05:30.
        Remote tuple (3); replica identity full (2).
2025-01-02 12:22:38.036 IST [24803] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 758, finished
at 0/174D240

I'm not sure if this behavior is expected or not. If this is expected
can we mention this in the documentation for the user to handle the
conflict resolution accordingly in these cases.
Thoughts?

[1] - https://www.postgresql.org/docs/devel/sql-truncate.html

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

От

vignesh C

Дата:

02 января, 13:34:25

On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

Few suggestions:
1) If we have a subscription with detect_update_deleted option and we
try to upgrade it with default settings(in case dba forgot to set
track_commit_timestamp), the upgrade will fail after doing a lot of
steps like that mentioned in ok below:
Setting locale and encoding for new cluster                   ok
Analyzing all rows in the new cluster                         ok
Freezing all rows in the new cluster                          ok
Deleting files from new pg_xact                               ok
Copying old pg_xact to new server                             ok
Setting oldest XID for new cluster                            ok
Setting next transaction ID and epoch for new cluster         ok
Deleting files from new pg_multixact/offsets                  ok
Copying old pg_multixact/offsets to new server                ok
Deleting files from new pg_multixact/members                  ok
Copying old pg_multixact/members to new server                ok
Setting next multixact ID and offset for new cluster          ok
Resetting WAL archives                                        ok
Setting frozenxid and minmxid counters in new cluster         ok
Restoring global objects in the new cluster                   ok
Restoring database schemas in the new cluster
  postgres
*failure*

We should detect this at an earlier point somewhere like in
check_new_cluster_subscription_configuration and throw an error from
there.

2) Also should we include an additional slot for the
pg_conflict_detection slot while checking max_replication_slots.
Though this error will occur after the upgrade is completed, it may be
better to include the slot during upgrade itself so that the DBA need
not handle this error separately after the upgrade is completed.

3) We have reserved the pg_conflict_detection name in this version, so
if there was a replication slot with the name pg_conflict_detection in
the older version, the upgrade will fail at a very later stage like an
earlier upgrade shown. I feel we should check if the old cluster has
any slot with the name pg_conflict_detection and throw an error
earlier itself:
+void
+ErrorOnReservedSlotName(const char *name)
+{
+       if (strcmp(name, CONFLICT_DETECTION_SLOT) == 0)
+               ereport(ERROR,
+                               errcode(ERRCODE_RESERVED_NAME),
+                               errmsg("replication slot name \"%s\"
is reserved",
+                                          name));
+}

4) We should also mention something like below in the documentation so
the user can be aware of it:
The slot name cannot be created with pg_conflict_detection, as this is
reserved for logical replication conflict detection.

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

03 января, 08:52:51

On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attach the new version patch set which addressed all other comments.
>

Some more miscellaneous comments:
=============================
1.
@@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
  * modifying it.  This makes checkpoint's determination of which xacts
  * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
  */
- Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+ Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
  START_CRIT_SECTION();
- MyProc->delayChkptFlags |= DELAY_CHKPT_START;
+ MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;

  /*
  * Insert the commit XLOG record.
@@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
  */
  if (markXidCommitted)
  {
- MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+ MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
  END_CRIT_SECTION();

The comments related to this change should be updated in EndPrepare()
and RecordTransactionCommitPrepared(). They still refer to the
DELAY_CHKPT_START flag. We should update the comments explaining why a
similar change is not required for prepare or commit_prepare, if there
is one.

2.
 static bool
 tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
- TypeCacheEntry **eq)
+ TypeCacheEntry **eq, Bitmapset *columns)
 {
  int attrnum;

@@ -337,6 +340,14 @@ tuples_equal(TupleTableSlot *slot1, TupleTableSlot *slot2,
  if (att->attisdropped || att->attgenerated)
  continue;

+ /*
+ * Ignore columns that are not listed for checking.
+ */
+ if (columns &&
+ !bms_is_member(att->attnum - FirstLowInvalidHeapAttributeNumber,
+    columns))
+ continue;

Update the comment atop tuples_equal to reflect this change.

3.
+FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
+ TransactionId *delete_xid,
+ RepOriginId *delete_origin,
+ TimestampTz *delete_time)
...
...
+ /* Try to find the tuple */
+ while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
+ {
+ bool dead = false;
+ TransactionId xmax;
+ TimestampTz localts;
+ RepOriginId localorigin;
+
+ if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
+ continue;
+
+ tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
+ buf = hslot->buffer;
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
HEAPTUPLE_RECENTLY_DEAD)
+ dead = true;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (!dead)
+ continue;

Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
matter to detect update_delete conflict?

4. In FindMostRecentlyDeletedTupleInfo(), add comments to state why we
need to use SnapshotAny.

5.
+
+      <varlistentry
id="sql-createsubscription-params-with-detect-update-deleted">
+        <term><literal>detect_update_deleted</literal>
(<type>boolean</type>)</term>
+        <listitem>
+         <para>
+          Specifies whether the detection of <xref
linkend="conflict-update-deleted"/>
+          is enabled. The default is <literal>false</literal>. If set to
+          true, the dead tuples on the subscriber that are still useful for
+          detecting <xref linkend="conflict-update-deleted"/>
+          are retained,

One of the purposes of retaining dead tuples is to detect
update_delete conflict. But, I also see the following in 0001's commit
message: "Since the mechanism relies on a single replication slot, it
not only assists in retaining dead tuples but also preserves commit
timestamps and origin data. These information will be displayed in the
additional logs generated for logical replication conflicts.
Furthermore, the preserved commit timestamps and origin data are
essential for consistently detecting update_origin_differs conflicts."
which indicates there are other cases where retaining dead tuples can
help. So, I was thinking about whether to name this new option as
retain_dead_tuples or something along those lines?

BTW, it is not clear how retaining dead tuples will help the detection
update_origin_differs. Will it happen when the tuple is inserted or
updated on the subscriber and then when we try to update the same
tuple due to remote update, the commit_ts information of the xact is
not available because the same is already removed by vacuum? This
should happen for the update case for the new row generated by the
update operation as that will be used in comparison. Can you please
show it be a test case even if it is manual?

Can't it happen for delete_origin_differs as well for the same reason?

6. I feel we should keep 0004 as a later patch. We can ideally
consider committing 0001, 0002, 0003, 0005, and 0006 (or part of 0006
to get some tests that are relevant) as one unit and then the patch to
detect and report update_delete conflict. What do you think?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

03 января, 09:35:36

On Tue, Dec 24, 2024 at 6:43 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.
>
> Based on some off-list discussions with Sawada-san and Amit, it would be better
> if the apply worker can avoid reporting an ERROR if the publisher's clock's
> lags behind that of the subscriber, so I implemented a new 0007 patch to allow
> the apply worker to wait for the clock skew to pass and then send a new request
> to the publisher for the latest status. The implementation is as follows:
>
> Since we have the time (reply_time) on the walsender when it confirms that all
> the committing transactions have finished, it means any subsequent transactions
> on the publisher should be assigned a commit timestamp later then reply_time.
> And the (candidate_xid_time) when it determines the oldest active xid. Any old
> transactions on the publisher that have finished should have a commit timestamp
> earlier than the candidate_xid_time.
>
> The apply worker can compare the candidate_xid_time with reply_time. If
> candidate_xid_time is less than the reply_time, then it's OK to advance the xid
> immdidately. If candidate_xid_time is greater than reply_time, it means the
> clock of publisher is behind that of the subscriber, so the apply worker can
> wait for the skew to pass before advancing the xid.
>
> Since this is considered as an improvement, we can focus on this after
> pushing the main patches.
>

Thank you for updating the patches!

I have one comment on the 0001 patch:

+       /*
+        * The changes made by this and later transactions are still
non-removable
+        * to allow for the detection of update_deleted conflicts when applying
+        * changes in this logical replication worker.
+        *
+        * Note that this info cannot directly protect dead tuples from being
+        * prematurely frozen or removed. The logical replication launcher
+        * asynchronously collects this info to determine whether to advance the
+        * xmin value of the replication slot.
+        *
+        * Therefore, FullTransactionId that includes both the
transaction ID and
+        * its epoch is used here instead of a single Transaction ID. This is
+        * critical because without considering the epoch, the transaction ID
+        * alone may appear as if it is in the future due to transaction ID
+        * wraparound.
+        */
+       FullTransactionId oldest_nonremovable_xid;

The last paragraph of the comment mentions that we need to use
FullTransactionId to properly compare XIDs even after the XID
wraparound happens. But once we set the oldest-nonremovable-xid it
prevents XIDs from being wraparound, no? I mean that workers'
oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
xmin) are never away from more than 2^31 XIDs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

vignesh C

Дата:

03 января, 12:03:49

On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

Few comments:
1) In case there are no logical replication workers, the launcher
process just logs a warning "out of logical replication worker slots"
and continues. Whereas in case of "pg_conflict_detection" replication
slot creation launcher throw an error and the launcher exits, can we
throw a warning in this case too:
2025-01-02 10:24:41.899 IST [4280] ERROR:  all replication slots are in use
2025-01-02 10:24:41.899 IST [4280] HINT:  Free one or increase
"max_replication_slots".
2025-01-02 10:24:42.148 IST [4272] LOG:  background worker "logical
replication launcher" (PID 4280) exited with exit code 1

2) Currently, we do not detect when the track_commit_timestamp setting
is disabled for a subscription immediately after the worker starts.
Instead, it is detected later during conflict detection. Since
changing the track_commit_timestamp GUC requires a server restart, it
would be beneficial for DBAs if the error were raised immediately.
This way, DBAs would be aware of the issue as they monitor the server
restart and can take the necessary corrective actions without delay.

3) Tab completion missing for CREATE SUBSCRIPTION does not include
detect_update_deleted option:
postgres=# create subscription sub3 CONNECTION 'dbname=postgres
host=localhost port=5432' publication pub1 with (
BINARY              COPY_DATA           DISABLE_ON_ERROR    FAILOVER
         PASSWORD_REQUIRED   SLOT_NAME           SYNCHRONOUS_COMMIT
CONNECT             CREATE_SLOT         ENABLED             ORIGIN
         RUN_AS_OWNER        STREAMING           TWO_PHASE

4) Tab completion missing for ALTER SUBSCRIPTION does not include
detect_update_deleted option:
ALTER SUBSCRIPTION sub3 SET (
BINARY              FAILOVER            PASSWORD_REQUIRED   SLOT_NAME
         SYNCHRONOUS_COMMIT
DISABLE_ON_ERROR    ORIGIN              RUN_AS_OWNER        STREAMING
         TWO_PHASE

5) Copyright year can be updated to 2025:
+++ b/src/test/subscription/t/035_confl_update_deleted.pl
@@ -0,0 +1,169 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test the CREATE SUBSCRIPTION 'detect_update_deleted' parameter and its
+# interaction with the xmin value of replication slots.
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;

6) This include is not required, I was able to compile without it:
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -173,12 +173,14 @@
 #include "replication/logicalrelation.h"
 #include "replication/logicalworker.h"
 #include "replication/origin.h"
+#include "replication/slot.h"
 #include "replication/walreceiver.h"

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

03 января, 12:46:16

On Fri, Jan 3, 2025 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I have one comment on the 0001 patch:
>
> +       /*
> +        * The changes made by this and later transactions are still
> non-removable
> +        * to allow for the detection of update_deleted conflicts when applying
> +        * changes in this logical replication worker.
> +        *
> +        * Note that this info cannot directly protect dead tuples from being
> +        * prematurely frozen or removed. The logical replication launcher
> +        * asynchronously collects this info to determine whether to advance the
> +        * xmin value of the replication slot.
> +        *
> +        * Therefore, FullTransactionId that includes both the
> transaction ID and
> +        * its epoch is used here instead of a single Transaction ID. This is
> +        * critical because without considering the epoch, the transaction ID
> +        * alone may appear as if it is in the future due to transaction ID
> +        * wraparound.
> +        */
> +       FullTransactionId oldest_nonremovable_xid;
>
> The last paragraph of the comment mentions that we need to use
> FullTransactionId to properly compare XIDs even after the XID
> wraparound happens. But once we set the oldest-nonremovable-xid it
> prevents XIDs from being wraparound, no? I mean that workers'
> oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> xmin) are never away from more than 2^31 XIDs.
>

I also think that the slot's non-removal-xid should ensure that we
never allow xid to advance to a level where it can cause a wraparound
for the oldest-nonremovable-xid value stored in each worker because
the slot's value is the minimum of all workers. Now, if both of us are
missing something then it is probably better to write some more
detailed comments as to how this can happen.

Along the same lines, I was thinking whether
RetainConflictInfoData->last_phase_at should be FullTransactionId but
I think that is correct because we can't stop wraparound from
happening on remote_node, right?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

03 января, 13:47:57

On Fri, Jan 3, 2025 at 2:34 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Few comments:
> 1) In case there are no logical replication workers, the launcher
> process just logs a warning "out of logical replication worker slots"
> and continues. Whereas in case of "pg_conflict_detection" replication
> slot creation launcher throw an error and the launcher exits, can we
> throw a warning in this case too:
> 2025-01-02 10:24:41.899 IST [4280] ERROR:  all replication slots are in use
> 2025-01-02 10:24:41.899 IST [4280] HINT:  Free one or increase
> "max_replication_slots".
> 2025-01-02 10:24:42.148 IST [4272] LOG:  background worker "logical
> replication launcher" (PID 4280) exited with exit code 1
>

This case is not the same because if give just WARNING and allow to
proceed then we won't be able to protect dead rows from removal. We
don't want the apply workers to keep working and making progress till
this slot is created. Am, I missing something? If not, we probably
need to ensure this if not already ensured. Also, we should mention in
the docs that the 'max_replication_slots' setting should consider this
additional slot.

> 2) Currently, we do not detect when the track_commit_timestamp setting
> is disabled for a subscription immediately after the worker starts.
> Instead, it is detected later during conflict detection.
>

I am not sure if an ERROR is required in the first place. Shouldn't we
simply not detect the update_delete in that case? It should be
documented that to detect this conflict 'track_commit_timestamp'
should be enabled. Don't we do the same thing for *_origin_differs
type of conflicts?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

03 января, 14:00:41

On Thu, Jan 2, 2025 at 2:57 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Conflict detection of truncated updates is detected as update_missing
> and deleted update is detected as update_deleted. I was not sure if
> truncated updates should also be detected as update_deleted, as the
> document says truncate operation is "It has the same effect as an
> unqualified DELETE on each table" at [1].
>

This is expected behavior because TRUNCATE would immediately reclaim
space and remove all the data. So, there is no way to retain the
removed row.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

06 января, 06:15:09

On Fri, Dec 20, 2024 at 12:41 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
>
>
>
> In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS performance
wasreduced due to dead tuple accumulation. This performance drop depended on the wal_receiver_status_interval—larger
intervalsresulted in more dead tuple accumulation on the subscriber node. However, after the improvement in patch
v16-0002,which dynamically tunes the status request, the default TPS reduction was limited to only 1%. 
>
>
>
> We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
>
>  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
>
>  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for the
conflictdetection when detect_update_deleted=on. 
>
>  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
>
>  - To validate this further, we modified the patch to check only each transaction's commit_time and advance the
non-removableXID if the commit_time is greater than candidate_xid_time. The benchmark results[4] remained consistent,
showingsimilar performance reduction. This confirms that the performance impact on the subscriber side is a reasonable
behaviorif we want to detect the update_deleted conflict reliably. 
>
>
>
> We have also tested similar scenarios in physical streaming replication, to see the effect of enabling the
hot_standby_feedbackand recovery_min_apply_delay. The benchmark results[5] showed performance reduction in these cases
aswell, though impact was less compared to the update_deleted scenario because the physical walreceiver does not need
towait for specified WAL to be applied before sending the hot standby feedback message. However, as the
recovery_min_apply_delayincreased, a similar TPS reduction (~50%) was observed, aligning with the behavior seen in the
update_deletedcase. 
>

The first impression after seeing such a performance dip will be not
to use such a setting but as the primary reason is that one
purposefully wants to retain dead tuples both in physical replication
and pub-sub environment, it is an expected outcome. Now, it is
possible that in real world people may not use exactly the setup we
have used to check the worst-case performance. For example, for a
pub-sub setup, one could imagine that writes happen on two nodes N1,
and N2 (both will be publisher nodes) and then all the changes from
both nodes will be assembled in the third node N3 (a subscriber node).
Or, the subscriber node, may not be set up for aggressive writes, Or,
one would be okay not to detect update_delete conflicts with complete
accuracy.

>
>
> Based on the above, I think the performance reduction observed with the update_deleted patch is expected and
necessarybecause the patch's main goal is to retain dead tuples for reliable conflict detection. Reducing this
retentionperiod would compromise the accuracy of update_deleted detection. 
>

The point related to dead tuple accumulation (or database bloat) with
this setting should be documented similarly to what we document for
hot_standby_feedback. See hot_standby_feedback description in docs
[1].

[1] - https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-STANDBY

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

06 января, 14:22:08

On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> I have one comment on the 0001 patch:

Thanks for the comments!

> 
> +       /*
> +        * The changes made by this and later transactions are still
> non-removable
> +        * to allow for the detection of update_deleted conflicts when
> applying
> +        * changes in this logical replication worker.
> +        *
> +        * Note that this info cannot directly protect dead tuples from being
> +        * prematurely frozen or removed. The logical replication launcher
> +        * asynchronously collects this info to determine whether to advance
> the
> +        * xmin value of the replication slot.
> +        *
> +        * Therefore, FullTransactionId that includes both the
> transaction ID and
> +        * its epoch is used here instead of a single Transaction ID. This is
> +        * critical because without considering the epoch, the transaction ID
> +        * alone may appear as if it is in the future due to transaction ID
> +        * wraparound.
> +        */
> +       FullTransactionId oldest_nonremovable_xid;
> 
> The last paragraph of the comment mentions that we need to use
> FullTransactionId to properly compare XIDs even after the XID wraparound
> happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> being wraparound, no? I mean that workers'
> oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> xmin) are never away from more than 2^31 XIDs.

I think the issue is that the launcher may create the replication slot after
the apply worker has already set the 'oldest_nonremovable_xid' because the
launcher are doing that asynchronously. So, Before the slot is created, there's
a window where transaction IDs might wrap around. If initially the apply worker
has computed a candidate_xid (755) and the xid wraparound before the launcher
creates the slot, causing the new current xid to be (740), then the old
candidate_xid(755) looks like a xid in the future, and the launcher could
advance the xmin to 755 which cause the dead tuples to be removed prematurely.
(We are trying to reproduce this to ensure that it's a real issue and will
share after finishing)

We thought of another approach, which is to create/drop this slot first as
soon as one enables/disables detect_update_deleted (E.g. create/drop slot
during DDL). But it seems complicate to control the concurrent slot
create/drop. For example, if one backend A enables detect_update_deteled, it
will create a slot. But if another backend B is disabling the
detect_update_deteled at the same time, then the newly created slot may be
dropped by backend B. I thought about checking the number of subscriptions that
enables detect_update_deteled before dropping the slot in backend B, but the
subscription changes caused by backend A may not visable yet (e.g. not
committed yet).

Does that make sense to you, or do you have some other ideas?

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

07 января, 07:55:50

On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > I have one comment on the 0001 patch:
>
> Thanks for the comments!
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)
>

I tried to reproduce the issue described above, where an
xid_wraparound occurs before the launcher creates the conflict slot,
and the apply worker retains a very old xid (from before the
wraparound) as its oldest_nonremovable_xid.

In this scenario, the launcher will not update the apply worker's
older epoch xid (oldest_nonremovable_xid = 755) as the conflict slot's
xmin. This is because advance_conflict_slot_xmin() ensures proper
handling by comparing the full 64-bit xids. However, this could lead
to real issues if 32-bit TransactionID were used instead of 64-bit
FullTransactionID. The detailed test steps and results are as below:

Setup:  A Publisher-Subscriber setup with logical replication.

Steps done to reproduce the test scenario -
On Sub -
1) Created a subscription with detect_update_deleted=off, so no
conflict slot to start with.
2) Attached gdb to the launcher and put a breakpoint at
advance_conflict_slot_xmin().
3) Run "alter subscription ..... (detect_update_deleted=ON);"
4) Stopped the launcher at the start of the
"advance_conflict_slot_xmin()",  and blocked the creation of the
conflict slot.
5) Attached another gdb session to the apply worker and made sure it
has set an oldest_nonremovable_xid . In
"maybe_advance_nonremovable_xid()" -

  (gdb) p MyLogicalRepWorker->oldest_nonremovable_xid
  $3 = {value = 760}
  -- so apply worker's oldest_nonremovable_xid = 760

6) Consumed ~4.2 billion xids to let the xid_wraparound happen. After
the wraparound, the next_xid was "705", which is less than "760".
7) Released the launcher from gdb, but the apply_worker still stopped in gdb.
8) The slot gets created with xmin=705 :

  postgres=# select slot_name, slot_type, active, xmin, catalog_xmin,
restart_lsn, inactive_since, confirmed_flush_lsn from
pg_replication_slots;
         slot_name       | slot_type | active | xmin | catalog_xmin |
restart_lsn | inactive_since | confirmed_flush_lsn

-----------------------+-----------+--------+------+--------------+-------------+----------------+---------------------
  pg_conflict_detection | physical  | t      |  705 |              |
          |                |
  (1 row)

Next, when launcher tries to advance the slot's xmin in
advance_conflict_slot_xmin() with new_xmin as the apply worker's
oldest_nonremovable_xid(760), it returns without updating the slot's
xmin because of below check -
````
  if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
    return false;
````
we are comparing the full xids (64-bit) in
FullTransactionIdPrecedesOrEquals() and in this case the values are:
  new_xmin=760
  full_xmin=4294968001 (w.r.t. xid=705)

As "760 <= 4294968001", the launcher will return from here and not
update the slot's xmin to "760".  Above check will always be true in
such scenarios.
Note: The launcher would have updated the slot's xmin to 760 if 32-bit
XIDs were being compared, i.e., "760 <= 705".

--
Thanks,
Nisha

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

07 января, 09:00:19

On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > I have one comment on the 0001 patch:
>
> Thanks for the comments!
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)

The slot's first xmin is calculated by
GetOldestSafeDecodingTransactionId(false). The initial computed
cancidate_xid could be newer than this xid?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

07 января, 09:40:50

On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > I have one comment on the 0001 patch:
> >
> > Thanks for the comments!
> >
> > >
> > > +       /*
> > > +        * The changes made by this and later transactions are still
> > > non-removable
> > > +        * to allow for the detection of update_deleted conflicts
> > > + when
> > > applying
> > > +        * changes in this logical replication worker.
> > > +        *
> > > +        * Note that this info cannot directly protect dead tuples from
> being
> > > +        * prematurely frozen or removed. The logical replication launcher
> > > +        * asynchronously collects this info to determine whether to
> > > + advance
> > > the
> > > +        * xmin value of the replication slot.
> > > +        *
> > > +        * Therefore, FullTransactionId that includes both the
> > > transaction ID and
> > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > +        * critical because without considering the epoch, the transaction
> ID
> > > +        * alone may appear as if it is in the future due to transaction ID
> > > +        * wraparound.
> > > +        */
> > > +       FullTransactionId oldest_nonremovable_xid;
> > >
> > > The last paragraph of the comment mentions that we need to use
> > > FullTransactionId to properly compare XIDs even after the XID
> > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > prevents XIDs from being wraparound, no? I mean that workers'
> > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > xmin) are never away from more than 2^31 XIDs.
> >
> > I think the issue is that the launcher may create the replication slot
> > after the apply worker has already set the 'oldest_nonremovable_xid'
> > because the launcher are doing that asynchronously. So, Before the
> > slot is created, there's a window where transaction IDs might wrap
> > around. If initially the apply worker has computed a candidate_xid
> > (755) and the xid wraparound before the launcher creates the slot,
> > causing the new current xid to be (740), then the old
> > candidate_xid(755) looks like a xid in the future, and the launcher
> > could advance the xmin to 755 which cause the dead tuples to be removed
> prematurely.
> > (We are trying to reproduce this to ensure that it's a real issue and
> > will share after finishing)
> 
> The slot's first xmin is calculated by
> GetOldestSafeDecodingTransactionId(false). The initial computed
> cancidate_xid could be newer than this xid?

I think the issue occurs when the slot is created after an XID wraparound. As a
result, GetOldestSafeDecodingTransactionId() returns the current XID
(after wraparound), which appears older than the computed candidate_xid (e.g.,
oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
issue in [1]. What do you think ?

[1] https://www.postgresql.org/message-id/CABdArM6P0zoEVRN%2B3YHNET_oOaAVOKc-EPUnXiHkcBJ-uDKQVw%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

07 января, 10:05:06

On Mon, Jan 6, 2025 at 10:40 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > >
> > > > I have one comment on the 0001 patch:
> > >
> > > Thanks for the comments!
> > >
> > > >
> > > > +       /*
> > > > +        * The changes made by this and later transactions are still
> > > > non-removable
> > > > +        * to allow for the detection of update_deleted conflicts
> > > > + when
> > > > applying
> > > > +        * changes in this logical replication worker.
> > > > +        *
> > > > +        * Note that this info cannot directly protect dead tuples from
> > being
> > > > +        * prematurely frozen or removed. The logical replication launcher
> > > > +        * asynchronously collects this info to determine whether to
> > > > + advance
> > > > the
> > > > +        * xmin value of the replication slot.
> > > > +        *
> > > > +        * Therefore, FullTransactionId that includes both the
> > > > transaction ID and
> > > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > > +        * critical because without considering the epoch, the transaction
> > ID
> > > > +        * alone may appear as if it is in the future due to transaction ID
> > > > +        * wraparound.
> > > > +        */
> > > > +       FullTransactionId oldest_nonremovable_xid;
> > > >
> > > > The last paragraph of the comment mentions that we need to use
> > > > FullTransactionId to properly compare XIDs even after the XID
> > > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > > prevents XIDs from being wraparound, no? I mean that workers'
> > > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > > xmin) are never away from more than 2^31 XIDs.
> > >
> > > I think the issue is that the launcher may create the replication slot
> > > after the apply worker has already set the 'oldest_nonremovable_xid'
> > > because the launcher are doing that asynchronously. So, Before the
> > > slot is created, there's a window where transaction IDs might wrap
> > > around. If initially the apply worker has computed a candidate_xid
> > > (755) and the xid wraparound before the launcher creates the slot,
> > > causing the new current xid to be (740), then the old
> > > candidate_xid(755) looks like a xid in the future, and the launcher
> > > could advance the xmin to 755 which cause the dead tuples to be removed
> > prematurely.
> > > (We are trying to reproduce this to ensure that it's a real issue and
> > > will share after finishing)
> >
> > The slot's first xmin is calculated by
> > GetOldestSafeDecodingTransactionId(false). The initial computed
> > cancidate_xid could be newer than this xid?
>
> I think the issue occurs when the slot is created after an XID wraparound. As a
> result, GetOldestSafeDecodingTransactionId() returns the current XID
> (after wraparound), which appears older than the computed candidate_xid (e.g.,
> oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
> issue in [1]. What do you think ?

I agree that the scenario Nisha shared could happen with the current
patch. On the other hand, I think that if slot's initial xmin is
always newer than or equal to the initial computed non-removable-xid
(i.e., the oldest of workers' oldest_nonremovable_xid values), we can
always use slot's first xmin. And I think it might be true while I'm
concerned the fact that worker's oldest_nonremoable_xid and the slot's
initial xmin is calculated differently (GetOldestActiveTransactionId()
and GetOldestSafeDecodingTransactionId(), respectively). That way,
subsequent comparisons between slot's xmin and computed candidate_xid
won't need to take care of the epoch. IOW, the worker's
non-removable-xid values effectively are not used until the slot is
created.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

07 января, 11:49:37

On Fri, Jan 3, 2025 at 11:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 5.
> +
> +      <varlistentry
> id="sql-createsubscription-params-with-detect-update-deleted">
> +        <term><literal>detect_update_deleted</literal>
> (<type>boolean</type>)</term>
> +        <listitem>
> +         <para>
> +          Specifies whether the detection of <xref
> linkend="conflict-update-deleted"/>
> +          is enabled. The default is <literal>false</literal>. If set to
> +          true, the dead tuples on the subscriber that are still useful for
> +          detecting <xref linkend="conflict-update-deleted"/>
> +          are retained,
>
> One of the purposes of retaining dead tuples is to detect
> update_delete conflict. But, I also see the following in 0001's commit
> message: "Since the mechanism relies on a single replication slot, it
> not only assists in retaining dead tuples but also preserves commit
> timestamps and origin data. These information will be displayed in the
> additional logs generated for logical replication conflicts.
> Furthermore, the preserved commit timestamps and origin data are
> essential for consistently detecting update_origin_differs conflicts."
> which indicates there are other cases where retaining dead tuples can
> help. So, I was thinking about whether to name this new option as
> retain_dead_tuples or something along those lines?
>

The other possible option name could be retain_conflict_info.
Sawada-San, and others, do you have any preference for the name of
this option?

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

07 января, 12:10:49

On Tuesday, January 7, 2025 3:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> On Mon, Jan 6, 2025 at 10:40 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, January 7, 2025 2:00 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > >
> > > On Mon, Jan 6, 2025 at 3:22 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com>
> > > wrote:
> > > >
> > > > On Friday, January 3, 2025 2:36 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > >
> > > > > I have one comment on the 0001 patch:
> > > >
> > > > Thanks for the comments!
> > > >
> > > > >
> > > > > +       /*
> > > > > +        * The changes made by this and later transactions are still
> > > > > non-removable
> > > > > +        * to allow for the detection of update_deleted conflicts
> > > > > + when
> > > > > applying
> > > > > +        * changes in this logical replication worker.
> > > > > +        *
> > > > > +        * Note that this info cannot directly protect dead tuples from
> > > being
> > > > > +        * prematurely frozen or removed. The logical replication
> launcher
> > > > > +        * asynchronously collects this info to determine whether to
> > > > > + advance
> > > > > the
> > > > > +        * xmin value of the replication slot.
> > > > > +        *
> > > > > +        * Therefore, FullTransactionId that includes both the
> > > > > transaction ID and
> > > > > +        * its epoch is used here instead of a single Transaction ID.
> This is
> > > > > +        * critical because without considering the epoch, the
> transaction
> > > ID
> > > > > +        * alone may appear as if it is in the future due to transaction
> ID
> > > > > +        * wraparound.
> > > > > +        */
> > > > > +       FullTransactionId oldest_nonremovable_xid;
> > > > >
> > > > > The last paragraph of the comment mentions that we need to use
> > > > > FullTransactionId to properly compare XIDs even after the XID
> > > > > wraparound happens. But once we set the oldest-nonremovable-xid it
> > > > > prevents XIDs from being wraparound, no? I mean that workers'
> > > > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > > > xmin) are never away from more than 2^31 XIDs.
> > > >
> > > > I think the issue is that the launcher may create the replication slot
> > > > after the apply worker has already set the 'oldest_nonremovable_xid'
> > > > because the launcher are doing that asynchronously. So, Before the
> > > > slot is created, there's a window where transaction IDs might wrap
> > > > around. If initially the apply worker has computed a candidate_xid
> > > > (755) and the xid wraparound before the launcher creates the slot,
> > > > causing the new current xid to be (740), then the old
> > > > candidate_xid(755) looks like a xid in the future, and the launcher
> > > > could advance the xmin to 755 which cause the dead tuples to be
> removed
> > > prematurely.
> > > > (We are trying to reproduce this to ensure that it's a real issue and
> > > > will share after finishing)
> > >
> > > The slot's first xmin is calculated by
> > > GetOldestSafeDecodingTransactionId(false). The initial computed
> > > cancidate_xid could be newer than this xid?
> >
> > I think the issue occurs when the slot is created after an XID wraparound. As
> a
> > result, GetOldestSafeDecodingTransactionId() returns the current XID
> > (after wraparound), which appears older than the computed candidate_xid
> (e.g.,
> > oldest_nonremovable_xid). Nisha has shared detailed steps to reproduce the
> > issue in [1]. What do you think ?
> 
> I agree that the scenario Nisha shared could happen with the current
> patch. On the other hand, I think that if slot's initial xmin is
> always newer than or equal to the initial computed non-removable-xid
> (i.e., the oldest of workers' oldest_nonremovable_xid values), we can
> always use slot's first xmin. And I think it might be true while I'm
> concerned the fact that worker's oldest_nonremoable_xid and the slot's
> initial xmin is calculated differently (GetOldestActiveTransactionId()
> and GetOldestSafeDecodingTransactionId(), respectively). That way,
> subsequent comparisons between slot's xmin and computed candidate_xid
> won't need to take care of the epoch. IOW, the worker's
> non-removable-xid values effectively are not used until the slot is
> created.

I might be missing something, so could you please elaborate a bit more on this
idea?

Initially, I thought you meant delaying the initialization of slot.xmin until
after the worker computes the oldest_nonremovable_xid. However, I think the
same issue would occur with this approach as well [1], with the difference
being that the slot would directly use a future XID as xmin, which seems
inappropriate to me.

Or do you mean opposite that we delay the initialization of
oldest_nonremovable_xid after the creation of the slot ?

[1]
> So, Before the slot is created, there's a window where transaction IDs might
> wrap around. If initially the apply worker has computed a candidate_xid (755)
> and the xid wraparound before the launcher creates the slot, causing the new
> current xid to be (740), then the old candidate_xid(755) looks like a xid in
> the future, and the launcher could advance the xmin to 755 which cause the
> dead tuples to be removed prematurely.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

07 января, 13:49:12

On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > +       /*
> > +        * The changes made by this and later transactions are still
> > non-removable
> > +        * to allow for the detection of update_deleted conflicts when
> > applying
> > +        * changes in this logical replication worker.
> > +        *
> > +        * Note that this info cannot directly protect dead tuples from being
> > +        * prematurely frozen or removed. The logical replication launcher
> > +        * asynchronously collects this info to determine whether to advance
> > the
> > +        * xmin value of the replication slot.
> > +        *
> > +        * Therefore, FullTransactionId that includes both the
> > transaction ID and
> > +        * its epoch is used here instead of a single Transaction ID. This is
> > +        * critical because without considering the epoch, the transaction ID
> > +        * alone may appear as if it is in the future due to transaction ID
> > +        * wraparound.
> > +        */
> > +       FullTransactionId oldest_nonremovable_xid;
> >
> > The last paragraph of the comment mentions that we need to use
> > FullTransactionId to properly compare XIDs even after the XID wraparound
> > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > being wraparound, no? I mean that workers'
> > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > xmin) are never away from more than 2^31 XIDs.
>
> I think the issue is that the launcher may create the replication slot after
> the apply worker has already set the 'oldest_nonremovable_xid' because the
> launcher are doing that asynchronously. So, Before the slot is created, there's
> a window where transaction IDs might wrap around. If initially the apply worker
> has computed a candidate_xid (755) and the xid wraparound before the launcher
> creates the slot, causing the new current xid to be (740), then the old
> candidate_xid(755) looks like a xid in the future, and the launcher could
> advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> (We are trying to reproduce this to ensure that it's a real issue and will
> share after finishing)
>
> We thought of another approach, which is to create/drop this slot first as
> soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> during DDL). But it seems complicate to control the concurrent slot
> create/drop. For example, if one backend A enables detect_update_deteled, it
> will create a slot. But if another backend B is disabling the
> detect_update_deteled at the same time, then the newly created slot may be
> dropped by backend B. I thought about checking the number of subscriptions that
> enables detect_update_deteled before dropping the slot in backend B, but the
> subscription changes caused by backend A may not visable yet (e.g. not
> committed yet).
>

This means that for the transaction whose changes are not yet visible,
we may have already created the slot and the backend B would end up
dropping it. Is it possible that during the change of this new option
via DDL, we take AccessExclusiveLock on pg_subscription as we do in
DropSubscription() to ensure that concurrent transactions can't drop
the slot? Will that help in solving the above scenario?

The second idea could be that each worker first checks whether a slot
exists along with a subscription flag (new option). Checking the
existence of a slot each time would be costly, so we somehow need to
cache it. But if we do that then we need to invent some cache
invalidation mechanism for the slot. I am not sure if we can design a
race-free mechanism for that. I mean we need to think of a solution
for race conditions between the launcher and apply workers to ensure
that after dropping the slot, launcher doesn't recreate the slot (say
if some subscription enables this option) before all the workers can
clear their existing values of oldest_nonremovable_xid.

The third idea to avoid the race condition could be that in the
function InitializeLogRepWorker() after CommitTransactionCommand(), we
check if the retain_dead_tuples flag is true for MySubscription then
we check whether the system slot exists. If exits then go ahead,
otherwise, wait till the slot is created. It could be some additional
cycles during worker start up but it is a one-time effort and that too
only when the flag is set. In addition to this, we anyway need to
create the slot in the launcher before launching the workers, and
after re-reading the subscription, the change in retain_dead_tuples
flag (off->on) should cause the worker restart.

Now, in the third idea, the issue can still arise if, after waiting
for the slot to be created, the user sets the retain_dead_tuples to
false and back to true again immediately. Because the launcher may
have noticed the "retain_dead_tuples=false" operation and dropped the
slot, while the apply worker has not noticed and still holds an old
candidate_xid. The xid may wraparound in this window before setting
the retain_dead_tuples back to true. And, the apply worker would not
restart because after it calls maybe_reread_subscription(), the
retain_dead_tuples would have been set back to true again. Again, to
avoid this race condition, the launcher can wait for each worker to
reset the oldest_nonremovamble_xid before dropping the slot.

Even after doing the above, the third idea could still have another
race condition:
1. The launcher creates the replication slot and starts a worker with
retain_dead_tuples = true, the worker is waiting for publish status
and has not set oldest_nonremovable_xid.
2. The user set the option retain_dead_tuples to false, the launcher
noticed that and drop the replication slot.
3. The worker received the status and set oldest_nonremovable_xid to a
valid value (say 750).
4. Xid wraparound happened at this point and say new_available_xid becomes 740
5. User set retain_dead_tuples = true again.

After the above steps, the apply worker holds an old
oldest_nonremovable_xid (750) and will not restart if it does not call
maybe_reread_subscription() before step 5. So, such a case can again
create a problem of incorrect slot->xmin value. We can probably try to
find some way to avoid this race condition as well but I haven't
thought more about this as this idea sounds a bit risky and bug-prone
to me.

Among the above ideas, the first idea of taking AccessExclusiveLock on
pg_subscription sounds safest to me. I haven't evaluated the changes
for the first approach so I could be missing something that makes it
difficult to achieve but I think it is worth investigating unless we
have better ideas or we think that the current approach used in patch
to use FullTransactionId is okay.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

vignesh C

Дата:

07 января, 14:11:14

On Wed, 25 Dec 2024 at 08:13, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, December 23, 2024 2:15 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Hou,
> >
> > Thanks for updating the patch. Few comments:
>
> Thanks for the comments!
>
> > 02.  ErrorOnReservedSlotName()
> >
> > Currently the function is callsed from three points -
> > create_physical_replication_slot(),
> > create_logical_replication_slot() and CreateReplicationSlot().
> > Can we move them to the ReplicationSlotCreate(), or combine into
> > ReplicationSlotValidateName()?
>
> I am not sure because moving the check into these functions because that would
> prevent the launcher from creating the slot as well unless we add a new
> parameter for these functions, but I am not sure if it's worth it at this
> stage.
>
> >
> > 03. advance_conflict_slot_xmin()
> >
> > ```
> >       Assert(TransactionIdIsValid(MyReplicationSlot->data.xmin));
> > ```
> >
> > Assuming the case that the launcher crashed just after
> > ReplicationSlotCreate(CONFLICT_DETECTION_SLOT).
> > After the restart, the slot can be acquired since
> > SearchNamedReplicationSlot(CONFLICT_DETECTION_SLOT)
> > is true, but the process would fail the assert because data.xmin is still invalid.
> >
> > I think we should re-create the slot when the xmin is invalid. Thought?
>
> After thinking more, the standard approach to me would be to mark the slot as
> EPHEMERAL during creation and persist it after initializing, so changed like
> that.
>
> > 05. check_remote_recovery()
> >
> > Can we add a test case related with this?
>
> I think the code path is already tested, and I am a bit unsure if we want to setup
> a standby to test the ERROR case, so didn't add this.
>
> ---
>
> Attach the new version patch set which addressed all other comments.

I was doing backward compatibility test by creating publication in
PG17 and subscription with the patch on HEAD:
Currently, we are able to create subscription with
detect_update_deleted option for a publication on PG17:
postgres=# create subscription sub1 connection 'dbname=postgres
host=localhost port=5432' publication pub1 with
(detect_update_deleted=true);
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION

This should not be allowed now as the subscriber will now request
publisher status from the publisher for which handling is not
available in the publisher:
+static void
+request_publisher_status(RetainConflictInfoData *data)
+{
...
+       pq_sendbyte(request_message, 'p');
+       pq_sendint64(request_message, GetCurrentTimestamp());
...
+}

I felt this should not be allowed.

Regards,
Vignesh

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

07 января, 15:34:23

On Thursday, January 2, 2025 6:34 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> Few suggestions:
> 1) If we have a subscription with detect_update_deleted option and we
> try to upgrade it with default settings(in case dba forgot to set
> track_commit_timestamp), the upgrade will fail after doing a lot of
> steps like that mentioned in ok below:
> Setting locale and encoding for new cluster                   ok
> Analyzing all rows in the new cluster                         ok
> Freezing all rows in the new cluster                          ok
> Deleting files from new pg_xact                               ok
> Copying old pg_xact to new server                             ok
> Setting oldest XID for new cluster                            ok
> Setting next transaction ID and epoch for new cluster         ok
> Deleting files from new pg_multixact/offsets                  ok
> Copying old pg_multixact/offsets to new server                ok
> Deleting files from new pg_multixact/members                  ok
> Copying old pg_multixact/members to new server                ok
> Setting next multixact ID and offset for new cluster          ok
> Resetting WAL archives                                        ok
> Setting frozenxid and minmxid counters in new cluster         ok
> Restoring global objects in the new cluster                   ok
> Restoring database schemas in the new cluster
>   postgres
> *failure*
> 
> We should detect this at an earlier point somewhere like in
> check_new_cluster_subscription_configuration and throw an error from
> there.
> 
> 2) Also should we include an additional slot for the
> pg_conflict_detection slot while checking max_replication_slots.
> Though this error will occur after the upgrade is completed, it may be
> better to include the slot during upgrade itself so that the DBA need
> not handle this error separately after the upgrade is completed.

Thanks for the comments!

I added the suggested changes but didn't add more tests to verify each error
message in this version, because it seems a rare case to me, so I am not sure
if it's worth increasing the testing time for these errors. But I am OK to add
if people think it's worth the effort and I will also test this locally.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

07 января, 15:49:55

On Thursday, January 2, 2025 2:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> Sounds reasonable but OTOH, all other places that create physical
> slots (which we are doing here) don't use this trick. So, don't they
> need similar reliability?

I have not figured the reason for existing physical slots' handling,
but will think more.

> Also, add some comments as to why we are
> initially creating the RS_EPHEMERAL slot as we have at other places.

Added.

> 
> Few other comments on 0003
> =======================
> 1.
> + if (sublist)
> + {
> + bool updated;
> +
> + if (!can_advance_xmin)
> + xmin = InvalidFullTransactionId;
> +
> + updated = advance_conflict_slot_xmin(xmin);
> 
> How will it help to try advancing slot_xmin when xmin is invalid?

It was intended to create the slot without updating the xmin in this case,
but the function name seems misleading. So, I will think more on this and
modify it in next version because it may also be affected by the discussion
in [1].

> 
> 2.
> @@ -1167,14 +1181,43 @@ ApplyLauncherMain(Datum main_arg)
>   long elapsed;
> 
>   if (!sub->enabled)
> + {
> + can_advance_xmin = false;
> 
> In ApplyLauncherMain(), if one of the subscriptions is disabled (say
> the last one in sublist), then can_advance_xmin will become false in
> the above code. Now, later, as quoted in comment-1, the patch
> overrides xmin to InvalidFullTransactionId if can_advance_xmin is
> false. Won't that lead to the wrong computation of xmin?

advance_conflict_slot_xmin() would skip updating the slot.xmin
if the input value is invalid. But I will think how to improve this
in next version.

> 
> 3.
> + slot_maybe_exist = true;
> + }
> +
> + /*
> + * Drop the slot if we're no longer retaining dead tuples.
> + */
> + else if (slot_maybe_exist)
> + {
> + drop_conflict_slot_if_exists();
> + slot_maybe_exist = false;
> 
> Can't we use MyReplicationSlot instead of introducing a new boolean
> slot_maybe_exist?
> 
> In any case, how does the above code deal with the case where the
> launcher is restarted for some reason and there is no subscription
> after that? Will it be possible to drop the slot in that case?

Since the initial value of slot_maybe_exist is true, so I think the launcher would
always check the slot once and drop the slot if not needed even if the
launcher restarted.

[1] https://www.postgresql.org/message-id/CAA4eK1Li8XLJ5f-pYvPJ8pXxyA3G-QsyBLNzHY940amF7jm%3D3A%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

vignesh C

Дата:

08 января, 09:59:40

On Tue, 7 Jan 2025 at 18:04, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 1:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new version patch set which addressed all other comments.
> > >
> >
> > Some more miscellaneous comments:
>
> Thanks for the comments!
>
> > =============================
> > 1.
> > @@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
> >   * modifying it.  This makes checkpoint's determination of which xacts
> >   * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
> >   */
> > - Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> > + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
> >   START_CRIT_SECTION();
> > - MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> >
> >   /*
> >   * Insert the commit XLOG record.
> > @@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
> >   */
> >   if (markXidCommitted)
> >   {
> > - MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
> >   END_CRIT_SECTION();
> >
> > The comments related to this change should be updated in EndPrepare()
> > and RecordTransactionCommitPrepared(). They still refer to the
> > DELAY_CHKPT_START flag. We should update the comments explaining why
> > a
> > similar change is not required for prepare or commit_prepare, if there
> > is one.
>
> After considering more, I think we need to use the new flag in
> RecordTransactionCommitPrepared() as well, because it is assigned a commit
> timestamp and would be replicated as normal transaction if sub's two_phase is
> not enabled.
>
> > 3.
> > +FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
> > + TransactionId *delete_xid,
> > + RepOriginId *delete_origin,
> > + TimestampTz *delete_time)
> > ...
> > ...
> > + /* Try to find the tuple */
> > + while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
> > + {
> > + bool dead = false;
> > + TransactionId xmax;
> > + TimestampTz localts;
> > + RepOriginId localorigin;
> > +
> > + if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
> > + continue;
> > +
> > + tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
> > + buf = hslot->buffer;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_SHARE);
> > +
> > + if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
> > HEAPTUPLE_RECENTLY_DEAD)
> > + dead = true;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> > +
> > + if (!dead)
> > + continue;
> >
> > Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
> > HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
> > tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
> > matter to detect update_delete conflict?
>
> The HEAPTUPLE_DEAD could indicate tuples whose inserting transaction was
> aborted, in which case we could not get the commit timestamp or origin for the
> transaction. Or it could indicate tuples deleted by a transaction older than
> oldestXmin(we would take the new replication slot's xmin into account when
> computing this value), which means any subsequent transaction would have commit
> timestamp later than that old delete transaction, so I think it's OK to ignore
> this dead tuple and even detect update_missing because the resolution is to
> apply the subsequent UPDATEs anyway (assuming we are using last update win
> strategy). I added some comments along these lines in the patch.
>
> >
> > 5.
> > +
> > +      <varlistentry
> > id="sql-createsubscription-params-with-detect-update-deleted">
> > +        <term><literal>detect_update_deleted</literal>
> > (<type>boolean</type>)</term>
> > +        <listitem>
> > +         <para>
> > +          Specifies whether the detection of <xref
> > linkend="conflict-update-deleted"/>
> > +          is enabled. The default is <literal>false</literal>. If set to
> > +          true, the dead tuples on the subscriber that are still useful for
> > +          detecting <xref linkend="conflict-update-deleted"/>
> > +          are retained,
> >
> > One of the purposes of retaining dead tuples is to detect
> > update_delete conflict. But, I also see the following in 0001's commit
> > message: "Since the mechanism relies on a single replication slot, it
> > not only assists in retaining dead tuples but also preserves commit
> > timestamps and origin data. These information will be displayed in the
> > additional logs generated for logical replication conflicts.
> > Furthermore, the preserved commit timestamps and origin data are
> > essential for consistently detecting update_origin_differs conflicts."
> > which indicates there are other cases where retaining dead tuples can
> > help. So, I was thinking about whether to name this new option as
> > retain_dead_tuples or something along those lines?
>
> I used the retain_conflict_info in this version as it looks more general and we
> are already using similar name in patch(RetainConflictInfoData), but we can
> change it later if people have better ideas.
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].

Few comments:
1) All other options are ordered, we can mention retain_conflict_info
after password_required to keep it consistent, I think it got
misplaced because of the name change from detect_update_deleted to
retain_conflict_info:
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index bbd08770c3..9d07fbf07a 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -2278,9 +2278,10 @@ match_previous_words(int pattern_id,
                COMPLETE_WITH("(", "PUBLICATION");
        /* ALTER SUBSCRIPTION <name> SET ( */
        else if (Matches("ALTER", "SUBSCRIPTION", MatchAny, MatchAnyN,
"SET", "("))
-               COMPLETE_WITH("binary", "disable_on_error",
"failover", "origin",
-                                         "password_required",
"run_as_owner", "slot_name",
-                                         "streaming",
"synchronous_commit", "two_phase");
+               COMPLETE_WITH("binary", "retain_conflict_info",
"disable_on_error",
+                                         "failover", "origin",
"password_required",
+                                         "run_as_owner", "slot_name",
"streaming",
+                                         "synchronous_commit", "two_phase");

2) Similarly here too:
        /* Complete "CREATE SUBSCRIPTION <name> ...  WITH ( <opt>" */
        else if (Matches("CREATE", "SUBSCRIPTION", MatchAnyN, "WITH", "("))
                COMPLETE_WITH("binary", "connect", "copy_data", "create_slot",
-                                         "disable_on_error",
"enabled", "failover", "origin",
-                                         "password_required",
"run_as_owner", "slot_name",
-                                         "streaming",
"synchronous_commit", "two_phase");
+                                         "retain_conflict_info",
"disable_on_error", "enabled",

3) Now that the option detect_update_deleted is changed to
retain_conflict_info, we can change this to "Retain conflict info":
+               if (pset.sversion >= 180000)
+                       appendPQExpBuffer(&buf,
+                                                         ",
subretainconflictinfo AS \"%s\"\n",
+
gettext_noop("Detect update deleted"));

4) The corresponding test changes also should be updated:
+++ b/src/test/regress/expected/subscription.out
@@ -116,18 +116,18 @@ CREATE SUBSCRIPTION regress_testsub4 CONNECTION
'dbname=regress_doesnotexist' PU
 WARNING:  subscription was created, but is not connected
 HINT:  To initiate replication, you must manually create the
replication slot, enable the subscription, and refresh the
subscription.
 \dRs+ regress_testsub4
-
                                           List of subscriptions
-       Name       |           Owner           | Enabled | Publication
| Binary | Streaming | Two-phase commit | Disable on error | Origin |
Password required | Run as owner? | Failover | Synchronous commit |
      Conninfo           | Skip LSN

-------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-------------------+---------------+----------+--------------------+-----------------------------+----------
- regress_testsub4 | regress_subscription_user | f       | {testpub}
| f      | parallel  | d                | f                | none   |
t                 | f             | f        | off                |
dbname=regress_doesnotexist | 0/0
+
                                                       List of
subscriptions
+       Name       |           Owner           | Enabled | Publication
| Binary | Streaming | Two-phase commit | Disable on error | Origin |
Password required | Run as owner? | Failover | Detect update deleted |
Synchronous commit |          Conninfo           | Skip LSN

+------------------+---------------------------+---------+-------------+--------+-----------+------------------+------------------+--------+-------------------+---------------+----------+-----------------------+--------------------+-----------------------------+----------
+ regress_testsub4 | regress_subscription_user | f       | {testpub}
| f      | parallel  | d                | f                | none   |
t                 | f             | f        | f                     |
off                | dbname=regress_doesnotexist | 0/0

5) This part of code is not very easy to understand that it is done
for handling wrap around, could we add some comments here:
+       if (!TimestampDifferenceExceeds(data->candidate_xid_time, now,
+
 data->xid_advance_interval))
+               return;
+
+       data->candidate_xid_time = now;
+
+       oldest_running_xid = GetOldestActiveTransactionId();
+       next_full_xid = ReadNextFullTransactionId();
+       epoch = EpochFromFullTransactionId(next_full_xid);
+
+       /* Compute the epoch of the oldest_running_xid */
+       if (oldest_running_xid > XidFromFullTransactionId(next_full_xid))
+               epoch--;

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

08 января, 10:48:37

On Tue, Jan 7, 2025 at 6:04 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].
>

Here are a couple of initial review comments on v19 patch set:

1) The subscription option 'retain_conflict_info' remains set to
"true" for a subscription even after restarting the server with
'track_commit_timestamp=off', which can lead to incorrect behavior.
  Steps to reproduce:
   1. Start the server with 'track_commit_timestamp=ON'.
   2. Create a subscription with (retain_conflict_info=ON).
   3. Restart the server with 'track_commit_timestamp=OFF'.

 - The apply worker starts successfully, and the subscription retains
'retain_conflict_info=true'. However, in this scenario, the
update_deleted conflict detection will not function correctly without
'track_commit_timestamp'.
```
postgres=# show track_commit_timestamp;
 track_commit_timestamp
------------------------
 off
(1 row)

postgres=# select subname, subretainconflictinfo from pg_subscription;
 subname | subretainconflictinfo
---------+-----------------------
 sub21   | t
 sub22   | t
```

2) With the new parameter name change to "retain_conflict_info", the
error message for both the 'CREATE SUBSCRIPTION' and 'ALTER
SUBSCRIPTION' commands needs to be updated accordingly.

  postgres=# create subscription sub11 connection 'dbname=postgres'
publication pub1 with (retain_conflict_info=on);
  ERROR:  detecting update_deleted conflicts requires
"track_commit_timestamp" to be enabled
  postgres=# alter subscription sub12 set (retain_conflict_info=on);
  ERROR:  detecting update_deleted conflicts requires
"track_commit_timestamp" to be enabled

 - Change the message to something similar - "retaining conflict info
requires "track_commit_timestamp" to be enabled".

--
Thanks,
Nisha

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

08 января, 11:44:29

On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 6, 2025 at 4:52 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > +       /*
> > > +        * The changes made by this and later transactions are still
> > > non-removable
> > > +        * to allow for the detection of update_deleted conflicts when
> > > applying
> > > +        * changes in this logical replication worker.
> > > +        *
> > > +        * Note that this info cannot directly protect dead tuples from being
> > > +        * prematurely frozen or removed. The logical replication launcher
> > > +        * asynchronously collects this info to determine whether to advance
> > > the
> > > +        * xmin value of the replication slot.
> > > +        *
> > > +        * Therefore, FullTransactionId that includes both the
> > > transaction ID and
> > > +        * its epoch is used here instead of a single Transaction ID. This is
> > > +        * critical because without considering the epoch, the transaction ID
> > > +        * alone may appear as if it is in the future due to transaction ID
> > > +        * wraparound.
> > > +        */
> > > +       FullTransactionId oldest_nonremovable_xid;
> > >
> > > The last paragraph of the comment mentions that we need to use
> > > FullTransactionId to properly compare XIDs even after the XID wraparound
> > > happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> > > being wraparound, no? I mean that workers'
> > > oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> > > xmin) are never away from more than 2^31 XIDs.
> >
> > I think the issue is that the launcher may create the replication slot after
> > the apply worker has already set the 'oldest_nonremovable_xid' because the
> > launcher are doing that asynchronously. So, Before the slot is created, there's
> > a window where transaction IDs might wrap around. If initially the apply worker
> > has computed a candidate_xid (755) and the xid wraparound before the launcher
> > creates the slot, causing the new current xid to be (740), then the old
> > candidate_xid(755) looks like a xid in the future, and the launcher could
> > advance the xmin to 755 which cause the dead tuples to be removed prematurely.
> > (We are trying to reproduce this to ensure that it's a real issue and will
> > share after finishing)
> >
> > We thought of another approach, which is to create/drop this slot first as
> > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > during DDL). But it seems complicate to control the concurrent slot
> > create/drop. For example, if one backend A enables detect_update_deteled, it
> > will create a slot. But if another backend B is disabling the
> > detect_update_deteled at the same time, then the newly created slot may be
> > dropped by backend B. I thought about checking the number of subscriptions that
> > enables detect_update_deteled before dropping the slot in backend B, but the
> > subscription changes caused by backend A may not visable yet (e.g. not
> > committed yet).
> >
>
> This means that for the transaction whose changes are not yet visible,
> we may have already created the slot and the backend B would end up
> dropping it. Is it possible that during the change of this new option
> via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> DropSubscription() to ensure that concurrent transactions can't drop
> the slot? Will that help in solving the above scenario?

If we create/stop the slot during DDL, how do we support rollback DDLs?

>
> The second idea could be that each worker first checks whether a slot
> exists along with a subscription flag (new option). Checking the
> existence of a slot each time would be costly, so we somehow need to
> cache it. But if we do that then we need to invent some cache
> invalidation mechanism for the slot. I am not sure if we can design a
> race-free mechanism for that. I mean we need to think of a solution
> for race conditions between the launcher and apply workers to ensure
> that after dropping the slot, launcher doesn't recreate the slot (say
> if some subscription enables this option) before all the workers can
> clear their existing values of oldest_nonremovable_xid.
>
> The third idea to avoid the race condition could be that in the
> function InitializeLogRepWorker() after CommitTransactionCommand(), we
> check if the retain_dead_tuples flag is true for MySubscription then
> we check whether the system slot exists. If exits then go ahead,
> otherwise, wait till the slot is created. It could be some additional
> cycles during worker start up but it is a one-time effort and that too
> only when the flag is set. In addition to this, we anyway need to
> create the slot in the launcher before launching the workers, and
> after re-reading the subscription, the change in retain_dead_tuples
> flag (off->on) should cause the worker restart.
>
> Now, in the third idea, the issue can still arise if, after waiting
> for the slot to be created, the user sets the retain_dead_tuples to
> false and back to true again immediately. Because the launcher may
> have noticed the "retain_dead_tuples=false" operation and dropped the
> slot, while the apply worker has not noticed and still holds an old
> candidate_xid. The xid may wraparound in this window before setting
> the retain_dead_tuples back to true. And, the apply worker would not
> restart because after it calls maybe_reread_subscription(), the
> retain_dead_tuples would have been set back to true again. Again, to
> avoid this race condition, the launcher can wait for each worker to
> reset the oldest_nonremovamble_xid before dropping the slot.
>
> Even after doing the above, the third idea could still have another
> race condition:
> 1. The launcher creates the replication slot and starts a worker with
> retain_dead_tuples = true, the worker is waiting for publish status
> and has not set oldest_nonremovable_xid.
> 2. The user set the option retain_dead_tuples to false, the launcher
> noticed that and drop the replication slot.
> 3. The worker received the status and set oldest_nonremovable_xid to a
> valid value (say 750).
> 4. Xid wraparound happened at this point and say new_available_xid becomes 740
> 5. User set retain_dead_tuples = true again.
>
> After the above steps, the apply worker holds an old
> oldest_nonremovable_xid (750) and will not restart if it does not call
> maybe_reread_subscription() before step 5. So, such a case can again
> create a problem of incorrect slot->xmin value. We can probably try to
> find some way to avoid this race condition as well but I haven't
> thought more about this as this idea sounds a bit risky and bug-prone
> to me.
>
> Among the above ideas, the first idea of taking AccessExclusiveLock on
> pg_subscription sounds safest to me. I haven't evaluated the changes
> for the first approach so I could be missing something that makes it
> difficult to achieve but I think it is worth investigating unless we
> have better ideas or we think that the current approach used in patch
> to use FullTransactionId is okay.

Thank you for considering some ideas. As I mentioned above, we might
need to consider a case like where 'CREATE SUBSCRIPTION ..
(retain_conflict_info = true)' is rolled back. Having said that, this
comment is just for simplifying the logic. If using TransactionId
instead makes other parts complex, it would not make sense. I'm okay
with leaving this part and improving the comment for
oldest_nonremovable_xid, say, by mentioning that there is a window for
XID wraparound happening between workers computing their
oldst_nonremovable_xid and pg_conflict_detection slot being created.

BTW while reviewing the code, I realized that changing
retain_conflict_info value doesn't have the worker relaunch and we
don't clear the worker's oldest_nonremovable_xid value in this case.
Is it okay? I'm concerned about a case like where
RetainConflictInfoPhase state transition is paused by disabling
retain_conflict_info and resume by re-enabling it with an old
RetainConflictInfoData value.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

08 января, 11:54:47

On Wed, Jan 8, 2025 at 2:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > We thought of another approach, which is to create/drop this slot first as
> > > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > > during DDL). But it seems complicate to control the concurrent slot
> > > create/drop. For example, if one backend A enables detect_update_deteled, it
> > > will create a slot. But if another backend B is disabling the
> > > detect_update_deteled at the same time, then the newly created slot may be
> > > dropped by backend B. I thought about checking the number of subscriptions that
> > > enables detect_update_deteled before dropping the slot in backend B, but the
> > > subscription changes caused by backend A may not visable yet (e.g. not
> > > committed yet).
> > >
> >
> > This means that for the transaction whose changes are not yet visible,
> > we may have already created the slot and the backend B would end up
> > dropping it. Is it possible that during the change of this new option
> > via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> > DropSubscription() to ensure that concurrent transactions can't drop
> > the slot? Will that help in solving the above scenario?
>
> If we create/stop the slot during DDL, how do we support rollback DDLs?
>

We will prevent changing this setting in a transaction block as we
already do for slot related case. See use of
PreventInTransactionBlock() in subscriptioncmds.c.

>
> Thank you for considering some ideas. As I mentioned above, we might
> need to consider a case like where 'CREATE SUBSCRIPTION ..
> (retain_conflict_info = true)' is rolled back. Having said that, this
> comment is just for simplifying the logic. If using TransactionId
> instead makes other parts complex, it would not make sense. I'm okay
> with leaving this part and improving the comment for
> oldest_nonremovable_xid, say, by mentioning that there is a window for
> XID wraparound happening between workers computing their
> oldst_nonremovable_xid and pg_conflict_detection slot being created.
>

Fair enough. Let us see what you think about my above response first.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

08 января, 12:31:55

On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> Here is further performance test analysis with v16 patch-set.
>
>
> In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
>
>
>
> In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS performance
wasreduced due to dead tuple accumulation. This performance drop depended on the wal_receiver_status_interval—larger
intervalsresulted in more dead tuple accumulation on the subscriber node. However, after the improvement in patch
v16-0002,which dynamically tunes the status request, the default TPS reduction was limited to only 1%. 
>
>
>
> We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
>
>  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.

Nice. It means that frequently getting in-commit-phase transactions by
the subscriber didn't have a negative impact on the publisher's
performance.

>
>  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for the
conflictdetection when detect_update_deleted=on. 
>
>  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 

Assuming that the performance dip happened due to dead tuple retention
for the conflict detection, would TPS on other databases also be
affected?

>
>
> [3] Test with pgbench run on both publisher and subscriber.
>
>
>
> Test setup:
>
> - Tests performed on pgHead + v16 patches
>
> - Created a pub-sub replication system.
>
> - Parameters for both instances were:
>
>
>
>    share_buffers = 30GB
>
>    min_wal_size = 10GB
>
>    max_wal_size = 20GB
>
>    autovacuum = false

Since you disabled autovacuum on the subscriber, dead tuples created
by non-hot updates are accumulated anyway regardless of
detect_update_deleted setting, is that right?

> Test Run:
>
> - Ran pgbench(read-write) on both the publisher and the subscriber with 30 clients for a duration of 120 seconds,
collectingdata over 5 runs. 
>
> - Note that pgbench was running for different tables on pub and sub.
>
> (The scripts used for test "case1-2_measure.sh" and case1-2_setup.sh" are attached).
>
>
>
> Results:
>
>
>
> Run#                   pub TPS              sub TPS
>
> 1                         32209   13704
>
> 2                         32378   13684
>
> 3                         32720   13680
>
> 4                         31483   13681
>
> 5                         31773   13813
>
> median               32209   13684
>
> regression          7%         -53%

What was the TPS on the subscriber when detect_update_deleted = false?
And how much were the tables bloated compared to when
detect_update_deleted = false?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

08 января, 12:52:48

On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> >
> > Here is further performance test analysis with v16 patch-set.
> >
> >
> > In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a pub-sub
setup,no performance degradation was observed on either node. 
> >
> >
> >
> > In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS
performancewas reduced due to dead tuple accumulation. This performance drop depended on the
wal_receiver_status_interval—largerintervals resulted in more dead tuple accumulation on the subscriber node. However,
afterthe improvement in patch v16-0002, which dynamically tunes the status request, the default TPS reduction was
limitedto only 1%. 
> >
> >
> >
> > We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
> >
> >  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
>
> Nice. It means that frequently getting in-commit-phase transactions by
> the subscriber didn't have a negative impact on the publisher's
> performance.
>
> >
> >  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for
theconflict detection when detect_update_deleted=on. 
> >
> >  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
>
> Assuming that the performance dip happened due to dead tuple retention
> for the conflict detection, would TPS on other databases also be
> affected?
>

As we use slot->xmin to retain dead tuples, shouldn't the impact be
global (means on all databases)? Or, maybe I am missing something.

> >
> >
> > [3] Test with pgbench run on both publisher and subscriber.
> >
> >
> >
> > Test setup:
> >
> > - Tests performed on pgHead + v16 patches
> >
> > - Created a pub-sub replication system.
> >
> > - Parameters for both instances were:
> >
> >
> >
> >    share_buffers = 30GB
> >
> >    min_wal_size = 10GB
> >
> >    max_wal_size = 20GB
> >
> >    autovacuum = false
>
> Since you disabled autovacuum on the subscriber, dead tuples created
> by non-hot updates are accumulated anyway regardless of
> detect_update_deleted setting, is that right?
>

I think hot-pruning mechanism during the update operation will remove
dead tuples even when autovacuum is disabled.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

08 января, 13:33:05

On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> > >
> > > Here is further performance test analysis with v16 patch-set.
> > >
> > >
> > > In the test scenarios already shared on -hackers [1], where pgbench was run only on the publisher node in a
pub-subsetup, no performance degradation was observed on either node. 
> > >
> > >
> > >
> > > In contrast, when pgbench was run only on the subscriber side with detect_update_deleted=on [2], the TPS
performancewas reduced due to dead tuple accumulation. This performance drop depended on the
wal_receiver_status_interval—largerintervals resulted in more dead tuple accumulation on the subscriber node. However,
afterthe improvement in patch v16-0002, which dynamically tunes the status request, the default TPS reduction was
limitedto only 1%. 
> > >
> > >
> > >
> > > We performed more benchmarks with the v16-patches where pgbench was run on both the publisher and subscriber,
focusingon TPS performance. To summarize the key observations: 
> > >
> > >  - No performance impact on the publisher as dead tuple accumulation does not occur on the publisher.
> >
> > Nice. It means that frequently getting in-commit-phase transactions by
> > the subscriber didn't have a negative impact on the publisher's
> > performance.
> >
> > >
> > >  - The performance is reduced on the subscriber side (TPS reduction (~50%) [3] ) due to dead tuple retention for
theconflict detection when detect_update_deleted=on. 
> > >
> > >  - Performance reduction happens only on the subscriber side, as workload on the publisher is pretty high and the
applyworkers must wait for the amount of transactions with earlier timestamps to be applied and flushed before
advancingthe non-removable XID to remove dead tuples. 
> >
> > Assuming that the performance dip happened due to dead tuple retention
> > for the conflict detection, would TPS on other databases also be
> > affected?
> >
>
> As we use slot->xmin to retain dead tuples, shouldn't the impact be
> global (means on all databases)?

I think so too.

>
> > >
> > >
> > > [3] Test with pgbench run on both publisher and subscriber.
> > >
> > >
> > >
> > > Test setup:
> > >
> > > - Tests performed on pgHead + v16 patches
> > >
> > > - Created a pub-sub replication system.
> > >
> > > - Parameters for both instances were:
> > >
> > >
> > >
> > >    share_buffers = 30GB
> > >
> > >    min_wal_size = 10GB
> > >
> > >    max_wal_size = 20GB
> > >
> > >    autovacuum = false
> >
> > Since you disabled autovacuum on the subscriber, dead tuples created
> > by non-hot updates are accumulated anyway regardless of
> > detect_update_deleted setting, is that right?
> >
>
> I think hot-pruning mechanism during the update operation will remove
> dead tuples even when autovacuum is disabled.

True, but why did it disable autovacuum? It seems that
case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates
less likely to happen.

I understand that a certain performance dip happens due to dead tuple
retention, which is fine, but I'm surprised that the TPS decreased by
50% within 120 seconds. The TPS goes even worse for a longer test? I
did a quick benchmark where I completely disabled removing dead tuples
(by autovacuum=off and a logical slot) and ran pgbench but I didn't
see such a precipitous dip.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

08 января, 14:00:24

On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> <nisha.moond412@gmail.com> wrote:
> > > >
> > > >
> > > > [3] Test with pgbench run on both publisher and subscriber.
> > > >
> > > >
> > > >
> > > > Test setup:
> > > >
> > > > - Tests performed on pgHead + v16 patches
> > > >
> > > > - Created a pub-sub replication system.
> > > >
> > > > - Parameters for both instances were:
> > > >
> > > >
> > > >
> > > >    share_buffers = 30GB
> > > >
> > > >    min_wal_size = 10GB
> > > >
> > > >    max_wal_size = 20GB
> > > >
> > > >    autovacuum = false
> > >
> > > Since you disabled autovacuum on the subscriber, dead tuples created
> > > by non-hot updates are accumulated anyway regardless of
> > > detect_update_deleted setting, is that right?
> > >
> >
> > I think hot-pruning mechanism during the update operation will remove
> > dead tuples even when autovacuum is disabled.
> 
> True, but why did it disable autovacuum? It seems that case1-2_setup.sh
> doesn't specify fillfactor, which makes hot-updates less likely to happen.

IIUC, we disable autovacuum as a general practice in read-write tests for
stable TPS numbers.

> 
> I understand that a certain performance dip happens due to dead tuple
> retention, which is fine, but I'm surprised that the TPS decreased by 50% within
> 120 seconds. The TPS goes even worse for a longer test?

We will try to increase the time and run the test again.

> I did a quick
> benchmark where I completely disabled removing dead tuples (by
> autovacuum=off and a logical slot) and ran pgbench but I didn't see such a
> precipitous dip.

I think a logical slot only retain the dead tuples on system catalog,
so the TPS on user table would not be affected that much.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

vignesh C

Дата:

08 января, 14:03:07

On Tue, 7 Jan 2025 at 18:04, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, January 3, 2025 1:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 25, 2024 at 8:13 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Attach the new version patch set which addressed all other comments.
> > >
> >
> > Some more miscellaneous comments:
>
> Thanks for the comments!
>
> > =============================
> > 1.
> > @@ -1431,9 +1431,9 @@ RecordTransactionCommit(void)
> >   * modifying it.  This makes checkpoint's determination of which xacts
> >   * are delaying the checkpoint a bit fuzzy, but it doesn't matter.
> >   */
> > - Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
> > + Assert((MyProc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0);
> >   START_CRIT_SECTION();
> > - MyProc->delayChkptFlags |= DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> >
> >   /*
> >   * Insert the commit XLOG record.
> > @@ -1536,7 +1536,7 @@ RecordTransactionCommit(void)
> >   */
> >   if (markXidCommitted)
> >   {
> > - MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> > + MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
> >   END_CRIT_SECTION();
> >
> > The comments related to this change should be updated in EndPrepare()
> > and RecordTransactionCommitPrepared(). They still refer to the
> > DELAY_CHKPT_START flag. We should update the comments explaining why
> > a
> > similar change is not required for prepare or commit_prepare, if there
> > is one.
>
> After considering more, I think we need to use the new flag in
> RecordTransactionCommitPrepared() as well, because it is assigned a commit
> timestamp and would be replicated as normal transaction if sub's two_phase is
> not enabled.
>
> > 3.
> > +FindMostRecentlyDeletedTupleInfo(Relation rel, TupleTableSlot *searchslot,
> > + TransactionId *delete_xid,
> > + RepOriginId *delete_origin,
> > + TimestampTz *delete_time)
> > ...
> > ...
> > + /* Try to find the tuple */
> > + while (table_scan_getnextslot(scan, ForwardScanDirection, scanslot))
> > + {
> > + bool dead = false;
> > + TransactionId xmax;
> > + TimestampTz localts;
> > + RepOriginId localorigin;
> > +
> > + if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
> > + continue;
> > +
> > + tuple = ExecFetchSlotHeapTuple(scanslot, false, NULL);
> > + buf = hslot->buffer;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_SHARE);
> > +
> > + if (HeapTupleSatisfiesVacuum(tuple, oldestXmin, buf) ==
> > HEAPTUPLE_RECENTLY_DEAD)
> > + dead = true;
> > +
> > + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> > +
> > + if (!dead)
> > + continue;
> >
> > Why do we need to check only for HEAPTUPLE_RECENTLY_DEAD and not
> > HEAPTUPLE_DEAD? IIUC, we came here because we couldn't find the live
> > tuple, now whether the tuple is DEAD or RECENTLY_DEAD, why should it
> > matter to detect update_delete conflict?
>
> The HEAPTUPLE_DEAD could indicate tuples whose inserting transaction was
> aborted, in which case we could not get the commit timestamp or origin for the
> transaction. Or it could indicate tuples deleted by a transaction older than
> oldestXmin(we would take the new replication slot's xmin into account when
> computing this value), which means any subsequent transaction would have commit
> timestamp later than that old delete transaction, so I think it's OK to ignore
> this dead tuple and even detect update_missing because the resolution is to
> apply the subsequent UPDATEs anyway (assuming we are using last update win
> strategy). I added some comments along these lines in the patch.
>
> >
> > 5.
> > +
> > +      <varlistentry
> > id="sql-createsubscription-params-with-detect-update-deleted">
> > +        <term><literal>detect_update_deleted</literal>
> > (<type>boolean</type>)</term>
> > +        <listitem>
> > +         <para>
> > +          Specifies whether the detection of <xref
> > linkend="conflict-update-deleted"/>
> > +          is enabled. The default is <literal>false</literal>. If set to
> > +          true, the dead tuples on the subscriber that are still useful for
> > +          detecting <xref linkend="conflict-update-deleted"/>
> > +          are retained,
> >
> > One of the purposes of retaining dead tuples is to detect
> > update_delete conflict. But, I also see the following in 0001's commit
> > message: "Since the mechanism relies on a single replication slot, it
> > not only assists in retaining dead tuples but also preserves commit
> > timestamps and origin data. These information will be displayed in the
> > additional logs generated for logical replication conflicts.
> > Furthermore, the preserved commit timestamps and origin data are
> > essential for consistently detecting update_origin_differs conflicts."
> > which indicates there are other cases where retaining dead tuples can
> > help. So, I was thinking about whether to name this new option as
> > retain_dead_tuples or something along those lines?
>
> I used the retain_conflict_info in this version as it looks more general and we
> are already using similar name in patch(RetainConflictInfoData), but we can
> change it later if people have better ideas.
>
> Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].

Consider a LR setup with retain_conflict_info=true for a table t1:
Publisher:
insert into t1 values(1);
-- Have a open transaction before delete operation in subscriber
begin;

Subscriber:
-- delete the record that was replicated
delete from t1;

-- Now commit the transaction in publisher
Publisher:
update t1 set c1 = 2;
commit;

In normal case update_deleted conflict is detected
2025-01-08 15:41:38.529 IST [112744] LOG:  conflict detected on
relation "public.t1": conflict=update_deleted
2025-01-08 15:41:38.529 IST [112744] DETAIL:  The row to be updated
was deleted locally in transaction 751 at 2025-01-08
15:41:29.811566+05:30.
        Remote tuple (2); replica identity full (1).
2025-01-08 15:41:38.529 IST [112744] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 747, finished
at 0/16FBCA0

Now execute the same above case by having a presetup to consume all
the replication slots in the system by executing
pg_create_logical_replication_slot before the subscription is created,
in this case the conflict is not detected correctly.
2025-01-08 15:39:17.931 IST [112551] LOG:  conflict detected on
relation "public.t1": conflict=update_missing
2025-01-08 15:39:17.931 IST [112551] DETAIL:  Could not find the row
to be updated.
        Remote tuple (2); replica identity full (1).
2025-01-08 15:39:17.931 IST [112551] CONTEXT:  processing remote data
for replication origin "pg_16387" during message type "UPDATE" for
replication target relation "public.t1" in transaction 747, finished
at 0/16FBC68
2025-01-08 15:39:18.266 IST [112582] ERROR:  all replication slots are in use
2025-01-08 15:39:18.266 IST [112582] HINT:  Free one or increase
"max_replication_slots".

This is because even though we say create subscription is successful,
the launcher has not yet created the replication slot.

There are few observations from this test:
1) Create subscription does not wait for the slot to be created by the
launcher and starts applying the changes. Should create a subscription
wait till the slot is created by the launcher process.
2) Currently launcher is exiting continuously and trying to create
replication slots. Should the launcher wait for
wal_retrieve_retry_interval configuration before trying to create the
slot instead of filling the logs continuously.
3) If we try to create a similar subscription with
retain_conflict_info and disable_on_error option and there is an error
in replication slot creation, currently the subscription does not get
disabled. Should we consider disable_on_error for these cases and
disable the subscription if we are not able to create the slots.

Regards,
Vignesh

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

09 января, 04:48:21

On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > <nisha.moond412@gmail.com> wrote:
> > > > >
> > > > >
> > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > >
> > > > >
> > > > >
> > > > > Test setup:
> > > > >
> > > > > - Tests performed on pgHead + v16 patches
> > > > >
> > > > > - Created a pub-sub replication system.
> > > > >
> > > > > - Parameters for both instances were:
> > > > >
> > > > >
> > > > >
> > > > >    share_buffers = 30GB
> > > > >
> > > > >    min_wal_size = 10GB
> > > > >
> > > > >    max_wal_size = 20GB
> > > > >
> > > > >    autovacuum = false
> > > >
> > > > Since you disabled autovacuum on the subscriber, dead tuples created
> > > > by non-hot updates are accumulated anyway regardless of
> > > > detect_update_deleted setting, is that right?
> > > >
> > >
> > > I think hot-pruning mechanism during the update operation will remove
> > > dead tuples even when autovacuum is disabled.
> >
> > True, but why did it disable autovacuum? It seems that case1-2_setup.sh
> > doesn't specify fillfactor, which makes hot-updates less likely to happen.
>
> IIUC, we disable autovacuum as a general practice in read-write tests for
> stable TPS numbers.

Okay. TBH I'm not sure what we can say with these results. At a
glance, in a typical bi-directional-like setup,  we can interpret
these results as that if users turn retain_conflict_info on the TPS
goes 50% down.  But I'm not sure this 50% dip is the worst case that
users possibly face. It could be better in practice thanks to
autovacuum, or it also could go even worse due to further bloats if we
run the test longer.

Suppose that users had 50% performance dip due to dead tuple retention
for update_deleted detection, is there any way for users to improve
the situation? For example, trying to advance slot.xmin more
frequently might help to reduce dead tuple accumulation. I think it
would be good if we could have a way to balance between the publisher
performance and the subscriber performance.

In test case 3, we observed a -53% performance dip, which is worse
than the results of test case 5 with wal_receiver_status_interval =
100s. Given that in test case 5 with wal_receiver_status_interval =
100s we cannot remove dead tuples for the most of the whole 120s test
time, probably we could not remove dead tuples for a long time also in
test case 3. I expected that the apply worker gets remote transaction
XIDs and tries to advance slot.xmin more frequently, so this
performance dip surprised me. I would like to know how many times the
apply worker gets remote transaction XIDs and succeeds in advance
slot.xmin during the test.

>
> >
> > I understand that a certain performance dip happens due to dead tuple
> > retention, which is fine, but I'm surprised that the TPS decreased by 50% within
> > 120 seconds. The TPS goes even worse for a longer test?
>
> We will try to increase the time and run the test again.
>
> > I did a quick
> > benchmark where I completely disabled removing dead tuples (by
> > autovacuum=off and a logical slot) and ran pgbench but I didn't see such a
> > precipitous dip.
>
> I think a logical slot only retain the dead tuples on system catalog,
> so the TPS on user table would not be affected that much.

You're right, I missed it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

09 января, 06:26:31

On Thursday, January 9, 2025 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > <nisha.moond412@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > - Tests performed on pgHead + v16 patches
> > > > > >
> > > > > > - Created a pub-sub replication system.
> > > > > >
> > > > > > - Parameters for both instances were:
> > > > > >
> > > > > >
> > > > > >
> > > > > >    share_buffers = 30GB
> > > > > >
> > > > > >    min_wal_size = 10GB
> > > > > >
> > > > > >    max_wal_size = 20GB
> > > > > >
> > > > > >    autovacuum = false
> > > > >
> > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > detect_update_deleted setting, is that right?
> > > > >
> > > >
> > > > I think hot-pruning mechanism during the update operation will
> > > > remove dead tuples even when autovacuum is disabled.
> > >
> > > True, but why did it disable autovacuum? It seems that
> > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates less
> likely to happen.
> >
> > IIUC, we disable autovacuum as a general practice in read-write tests
> > for stable TPS numbers.
> 
> Okay. TBH I'm not sure what we can say with these results. At a glance, in a
> typical bi-directional-like setup,  we can interpret these results as that if
> users turn retain_conflict_info on the TPS goes 50% down.  But I'm not sure
> this 50% dip is the worst case that users possibly face. It could be better in
> practice thanks to autovacuum, or it also could go even worse due to further
> bloats if we run the test longer.

I think it shouldn't go worse because ideally the amount of bloat would not
increase beyond what we see here due to this patch unless there is some
misconfiguration that leads to one of the node not working properly (say it is
down). However, my colleague is running longer tests and we will share the
results soon.

> Suppose that users had 50% performance dip due to dead tuple retention for
> update_deleted detection, is there any way for users to improve the situation?
> For example, trying to advance slot.xmin more frequently might help to reduce
> dead tuple accumulation. I think it would be good if we could have a way to
> balance between the publisher performance and the subscriber performance.

AFAICS, most of the time in each xid advancement is spent on waiting for the
target remote_lsn to be applied and flushed, so increasing the frequency could
not help. This can be proved to be reasonable in the testcase 4 shared by
Nisha[1], in that test, we do not request a remote_lsn but simply wait for the
commit_ts of incoming transaction to exceed the candidate_xid_time, the
regression is still the same. I think it indicates that we indeed need to wait
for this amount of time before applying all the transactions that have earlier
commit timestamp. IOW, the performance impact on the subscriber side is a
reasonable behavior if we want to detect the update_deleted conflict reliably.

[1] https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

09 января, 06:50:58

On Thursday, January 9, 2025 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com>

Hi,

> 
> On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > > wrote:
> > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > <nisha.moond412@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Test setup:
> > > > > >
> > > > > > - Tests performed on pgHead + v16 patches
> > > > > >
> > > > > > - Created a pub-sub replication system.
> > > > > >
> > > > > > - Parameters for both instances were:
> > > > > >
> > > > > >
> > > > > >
> > > > > >    share_buffers = 30GB
> > > > > >
> > > > > >    min_wal_size = 10GB
> > > > > >
> > > > > >    max_wal_size = 20GB
> > > > > >
> > > > > >    autovacuum = false
> > > > >
> > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > detect_update_deleted setting, is that right?
> > > > >
> > > >
> > > > I think hot-pruning mechanism during the update operation will
> > > > remove dead tuples even when autovacuum is disabled.
> > >
> > > True, but why did it disable autovacuum? It seems that
> > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates less
> likely to happen.
> >
> > IIUC, we disable autovacuum as a general practice in read-write tests
> > for stable TPS numbers.
>
...
> In test case 3, we observed a -53% performance dip, which is worse than the
> results of test case 5 with wal_receiver_status_interval = 100s. Given that
> in test case 5 with wal_receiver_status_interval = 100s we cannot remove dead
> tuples for the most of the whole 120s test time, probably we could not remove
> dead tuples for a long time also in test case 3. I expected that the apply
> worker gets remote transaction XIDs and tries to advance slot.xmin more
> frequently, so this performance dip surprised me.
 
As noted in my previous email[1], the delay primarily occurs during the final
phase (RCI_WAIT_FOR_LOCAL_FLUSH), where we wait for concurrent transactions
from the publisher to be applied and flushed locally (e.g., last_flushpos <
data->remote_lsn). I think that the interval between each transaction ID
advancement is brief, the duration of each advancement itself is significant.
 
> I would like to know how many times the apply worker gets remote transaction
> XIDs and succeeds in advance slot.xmin during the test.
 
my colleague will collect and share the data soon.

[1]
https://www.postgresql.org/message-id/OS0PR01MB57164C9A65F29875AE63F0BD94132%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

09 января, 08:50:21

On Wed, Jan 8, 2025 at 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 8, 2025 at 2:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Jan 7, 2025 at 2:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > We thought of another approach, which is to create/drop this slot first as
> > > > soon as one enables/disables detect_update_deleted (E.g. create/drop slot
> > > > during DDL). But it seems complicate to control the concurrent slot
> > > > create/drop. For example, if one backend A enables detect_update_deteled, it
> > > > will create a slot. But if another backend B is disabling the
> > > > detect_update_deteled at the same time, then the newly created slot may be
> > > > dropped by backend B. I thought about checking the number of subscriptions that
> > > > enables detect_update_deteled before dropping the slot in backend B, but the
> > > > subscription changes caused by backend A may not visable yet (e.g. not
> > > > committed yet).
> > > >
> > >
> > > This means that for the transaction whose changes are not yet visible,
> > > we may have already created the slot and the backend B would end up
> > > dropping it. Is it possible that during the change of this new option
> > > via DDL, we take AccessExclusiveLock on pg_subscription as we do in
> > > DropSubscription() to ensure that concurrent transactions can't drop
> > > the slot? Will that help in solving the above scenario?
> >
> > If we create/stop the slot during DDL, how do we support rollback DDLs?
> >
>
> We will prevent changing this setting in a transaction block as we
> already do for slot related case. See use of
> PreventInTransactionBlock() in subscriptioncmds.c.
>

On further thinking, even if we prevent this command in a transaction
block, there is still a small chance of rollback. Say, we created the
slot as the last operation after making database changes, but still,
the transaction can fail in the commit code path. So, it is still not
bulletproof. However, we already create a remote_slot at the end of
CREATE SUBSCRIPTION, so, if by any chance the transaction fails in the
commit code path, we will end up having a dangling slot on the remote
node. The same can happen in the DROP SUBSCRIPTION code path as well.
We can follow that or the other option is to allow creation of the
slot by the backend and let drop be handled by the launcher which can
even take care of dangling slots. However, I feel it will be better to
give the responsibility to the launcher for creating and dropping the
slot as the patch is doing and use the FullTransactionId for each
worker. What do you think?

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

09 января, 12:14:55

On Wednesday, January 8, 2025 7:03 PM vignesh C <vignesh21@gmail.com> wrote:

Hi,

> Consider a LR setup with retain_conflict_info=true for a table t1:
> Publisher:
> insert into t1 values(1);
> -- Have a open transaction before delete operation in subscriber begin;
> 
> Subscriber:
> -- delete the record that was replicated delete from t1;
> 
> -- Now commit the transaction in publisher
> Publisher:
> update t1 set c1 = 2;
> commit;
> 
> In normal case update_deleted conflict is detected
> 2025-01-08 15:41:38.529 IST [112744] LOG:  conflict detected on relation
> "public.t1": conflict=update_deleted
> 2025-01-08 15:41:38.529 IST [112744] DETAIL:  The row to be updated was
> deleted locally in transaction 751 at 2025-01-08 15:41:29.811566+05:30.
>         Remote tuple (2); replica identity full (1).
> 2025-01-08 15:41:38.529 IST [112744] CONTEXT:  processing remote data for
> replication origin "pg_16387" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 747, finished at 0/16FBCA0
> 
> Now execute the same above case by having a presetup to consume all the
> replication slots in the system by executing pg_create_logical_replication_slot
> before the subscription is created, in this case the conflict is not detected
> correctly.
> 2025-01-08 15:39:17.931 IST [112551] LOG:  conflict detected on relation
> "public.t1": conflict=update_missing
> 2025-01-08 15:39:17.931 IST [112551] DETAIL:  Could not find the row to be
> updated.
>         Remote tuple (2); replica identity full (1).
> 2025-01-08 15:39:17.931 IST [112551] CONTEXT:  processing remote data for
> replication origin "pg_16387" during message type "UPDATE" for replication
> target relation "public.t1" in transaction 747, finished at 0/16FBC68
> 2025-01-08 15:39:18.266 IST [112582] ERROR:  all replication slots are in use
> 2025-01-08 15:39:18.266 IST [112582] HINT:  Free one or increase
> "max_replication_slots".
> 
> This is because even though we say create subscription is successful, the
> launcher has not yet created the replication slot.

I think some detection miss in the beginning after enabling the option is
acceptable. Because even if we let the launcher to create the slot before
starting workers, some dead tuples could have been already removed during this
period, so update_missing could still be detected. I have added some documents
to clarify that the information can be safely retained only after the slot is
created.

> 
> There are few observations from this test:
> 1) Create subscription does not wait for the slot to be created by the launcher
> and starts applying the changes. Should create a subscription wait till the slot
> is created by the launcher process.

I think the DDL could not wait for the slot creation, because the launcher would
not create the slot until the DDL is committed. Instead, I have changed the
code to create the slot before starting workers, so that at least the worker
would not unnecessarily maintain the oldest non-removable xid.

> 2) Currently launcher is exiting continuously and trying to create replication
> slots. Should the launcher wait for wal_retrieve_retry_interval configuration
> before trying to create the slot instead of filling the logs continuously.

Since the launcher already have a 5s (bgw_restart_time) restart interval, I
feel it would not consume the too much resources in this case.

> 3) If we try to create a similar subscription with retain_conflict_info and
> disable_on_error option and there is an error in replication slot creation,
> currently the subscription does not get disabled. Should we consider
> disable_on_error for these cases and disable the subscription if we are not able
> to create the slots.

Currently, since only ERRORs in apply worker would trigger disable_on_error, I
am not sure if It's worth the effort to teach the apply to catch launcher's
error because it doesn't seem like a common scenario.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

09 января, 12:15:44

On Wednesday, January 8, 2025 3:49 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> 
> On Tue, Jan 7, 2025 at 6:04 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > Attached the V19 patch which addressed comments in [1][2][3][4][5][6][7].
> >
> 
> Here are a couple of initial review comments on v19 patch set:
> 
> 1) The subscription option 'retain_conflict_info' remains set to "true" for a
> subscription even after restarting the server with
> 'track_commit_timestamp=off', which can lead to incorrect behavior.
>   Steps to reproduce:
>    1. Start the server with 'track_commit_timestamp=ON'.
>    2. Create a subscription with (retain_conflict_info=ON).
>    3. Restart the server with 'track_commit_timestamp=OFF'.
> 
>  - The apply worker starts successfully, and the subscription retains
> 'retain_conflict_info=true'. However, in this scenario, the update_deleted
> conflict detection will not function correctly without
> 'track_commit_timestamp'.
> ```

IIUC, track_commit_timestamp is a GUC that designed mainly for conflict
detection, so it seems an unreasonable behavior to me if user enable this when
creating the sub but disable is afterwards. Besides, we documented that
update_deleted conflict would not be detected when track_commit_timestamp is
not enabled, so I am not sure if it's worth more effort adding checks for this
case.

> 
> 2) With the new parameter name change to "retain_conflict_info", the error
> message for both the 'CREATE SUBSCRIPTION' and 'ALTER SUBSCRIPTION'
> commands needs to be updated accordingly.
> 
>   postgres=# create subscription sub11 connection 'dbname=postgres'
> publication pub1 with (retain_conflict_info=on);
>   ERROR:  detecting update_deleted conflicts requires
> "track_commit_timestamp" to be enabled
>   postgres=# alter subscription sub12 set (retain_conflict_info=on);
>   ERROR:  detecting update_deleted conflicts requires
> "track_commit_timestamp" to be enabled
> 
>  - Change the message to something similar - "retaining conflict info requires
> "track_commit_timestamp" to be enabled".

After thinking more, I changed this to a warning for now, because to detect
all necessary conflicts, user must enable the option anyway, and the same has
been documented for update/delete_origin_differs conflicts as well.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

10 января, 03:42:39

On Wed, Jan 8, 2025 at 7:26 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, January 9, 2025 9:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> >
> > On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila <amit.kapila16@gmail.com>
> > > > wrote:
> > > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > > <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > > <nisha.moond412@gmail.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Test setup:
> > > > > > >
> > > > > > > - Tests performed on pgHead + v16 patches
> > > > > > >
> > > > > > > - Created a pub-sub replication system.
> > > > > > >
> > > > > > > - Parameters for both instances were:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >    share_buffers = 30GB
> > > > > > >
> > > > > > >    min_wal_size = 10GB
> > > > > > >
> > > > > > >    max_wal_size = 20GB
> > > > > > >
> > > > > > >    autovacuum = false
> > > > > >
> > > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > > detect_update_deleted setting, is that right?
> > > > > >
> > > > >
> > > > > I think hot-pruning mechanism during the update operation will
> > > > > remove dead tuples even when autovacuum is disabled.
> > > >
> > > > True, but why did it disable autovacuum? It seems that
> > > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates less
> > likely to happen.
> > >
> > > IIUC, we disable autovacuum as a general practice in read-write tests
> > > for stable TPS numbers.
> >
> > Okay. TBH I'm not sure what we can say with these results. At a glance, in a
> > typical bi-directional-like setup,  we can interpret these results as that if
> > users turn retain_conflict_info on the TPS goes 50% down.  But I'm not sure
> > this 50% dip is the worst case that users possibly face. It could be better in
> > practice thanks to autovacuum, or it also could go even worse due to further
> > bloats if we run the test longer.
>
> I think it shouldn't go worse because ideally the amount of bloat would not
> increase beyond what we see here due to this patch unless there is some
> misconfiguration that leads to one of the node not working properly (say it is
> down). However, my colleague is running longer tests and we will share the
> results soon.
>
> > Suppose that users had 50% performance dip due to dead tuple retention for
> > update_deleted detection, is there any way for users to improve the situation?
> > For example, trying to advance slot.xmin more frequently might help to reduce
> > dead tuple accumulation. I think it would be good if we could have a way to
> > balance between the publisher performance and the subscriber performance.
>
> AFAICS, most of the time in each xid advancement is spent on waiting for the
> target remote_lsn to be applied and flushed, so increasing the frequency could
> not help. This can be proved to be reasonable in the testcase 4 shared by
> Nisha[1], in that test, we do not request a remote_lsn but simply wait for the
> commit_ts of incoming transaction to exceed the candidate_xid_time, the
> regression is still the same.

True, but I think that not only more frequently asking the publisher
its status but also the apply worker frequently trying to advance the
RetainConflictInfoPhase and the launcher frequently trying to advance
the slot.xmin are important.

> I think it indicates that we indeed need to wait
> for this amount of time before applying all the transactions that have earlier
> commit timestamp. IOW, the performance impact on the subscriber side is a
> reasonable behavior if we want to detect the update_deleted conflict reliably.

It's reasonable behavior for this approach but it might not be a
reasonable outcome for users if they could be affected by such a
performance dip without no way to avoid it.

To closely look at what is happening in the apply worker and the
launcher, I did a quick test with the same setup, where running
pgbench with 30 clients to each of the publisher and subscriber (on
different pgbench tables so conflicts don't happen on the subscriber),
and I recorded how often the worker and the launcher tried to update
the worker's xmin and slot's xmin, respectively.  During the 120
seconds test I observed that the apply worker advanced its
oldest_nonremovable_xid 10 times with 43 attempts and the launcher
advanced the slot's xmin 5 times with 20 attempts, which seems to be
less frequent. And there seems no way for users to increase these
frequencies. Actually, these XID advancements happened only early in
the test and in the later part there was almost no attempt to advance
XIDs (I described the reason below). Therefore, after 120 secs tests,
slot's xmin was 2366291 XIDs behind (TPS on the publisher and
subscriber were 15728 and 18052, respectively).

I think there 3 things we need to deal with:

1. The launcher could still be sleeping even after the worker updates
its oldest_nonremovable_xid. We compute the launcher's sleep time by
doubling the sleep time with 3min maximum time. When I started the
test, the launcher already entered 3min sleep, and it took a long time
to advance the slot.xmin for the first time. I think we can improve
this situation by having the worker send a signal to the launcher
after updating the worker's oldest_nonremovable_xid so that it can
quickly update the slot.xmin.

2. The apply worker doesn't advance RetainConflictInfoPhase from the
RCI_WAIT_FOR_LOCAL_FLUSH phase when it's busy. Regarding the phase
transition from RCI_WAIT_FOR_LOCAL_FLUSH to RCI_GET_CANDIDATE_XID, we
rely on calling maybe_advance_nonremovable_xid() (1) right after
transitioning to RCI_WAIT_FOR_LOCAL_FLUSH phase, (2) after receiving
'k' message, and (3) there is no available incoming data. If we miss
(1) opportunity (because we still need to wait for the local flush),
we effectively need to consume all available data to call
maybe_advance_nonremovable_xid() (note that the publisher doesn't need
to send 'k' (keepalive) message if it sends data frequently). In the
test, since I ran pgbench with 30 clients on the publisher and
therefore there were some apply delays, the apply worker took 25 min
to get out the inner apply loop in LogicalRepApplyLoop() and advance
its oldest_nonremovable_xid. I think we need to consider having more
opportunities to check the local flush LSN.

3. If the apply worker cannot catch up, it could enter to a bad loop;
the publisher sends huge amount of data -> the apply worker cannot
catch up -> it needs to wait for a longer time to advance its
oldest_nonremovable_xid -> more garbage are accumulated and then have
the apply more slow -> (looping). I'm not sure how to deal with this
point TBH. We might be able to avoid entering this bad loop once we
resolve the other two points.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

10 января, 11:50:01

On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> > [3] Test with pgbench run on both publisher and subscriber.
> >
> > Test setup:
> > - Tests performed on pgHead + v16 patches
> > - Created a pub-sub replication system.
> > - Parameters for both instances were:
> >
> >    share_buffers = 30GB
> >    min_wal_size = 10GB
> >    max_wal_size = 20GB
> >    autovacuum = false
>
> Since you disabled autovacuum on the subscriber, dead tuples created
> by non-hot updates are accumulated anyway regardless of
> detect_update_deleted setting, is that right?
>
> > Test Run:
> > - Ran pgbench(read-write) on both the publisher and the subscriber with 30 clients for a duration of 120 seconds,
collectingdata over 5 runs. 
> > - Note that pgbench was running for different tables on pub and sub.
> > (The scripts used for test "case1-2_measure.sh" and case1-2_setup.sh" are attached).
> >
> > Results:
> > Run#                   pub TPS              sub TPS
> > 1                         32209   13704
> > 2                         32378   13684
> > 3                         32720   13680
> > 4                         31483   13681
> > 5                         31773   13813
> > median               32209   13684
> > regression          7%         -53%
>
> What was the TPS on the subscriber when detect_update_deleted = false?
> And how much were the tables bloated compared to when
> detect_update_deleted = false?
>

Test results with 'retain_conflict_info=false', tested on v20 patches
where the parameter name is changed.
With 'retain_conflict_info' disabled, both the Publisher and
Subscriber sustain similar TPS, with no performance reduction observed
on either node.

Test Setup:
(used same setup as above test)
 - Tests performed on pgHead+v20 patches
 - Created a pub-sub replication setup.
 - Parameters for both instances were:
   autovacuum = false
   shared_buffers = '30GB'
   max_wal_size = 20GB
   min_wal_size = 10GB
 Note: 'track_commit_timestamp' is disabled on Sub as not required for
retain_conflict_info=false.

Test Run:
- Pub and Sub had different pgbench tables with initial data of scale=100.
- Ran pgbench(read-write) on both the publisher and the subscriber
with 30 clients for a duration of 15 minutes, collecting data over 3
runs.

Results:
Run#            pub TPS             sub TPS
      1             30533.29878     29161.33335
      2             29931.30723     29520.89321
      3             30665.54192     29440.92953
Median         30533.29878     29440.92953
pgHead median 30112.31203     28933.75013
regression     1%                    2%

- Both Pub and Sub nodes have similar TPS in all runs, which is 1-2%
better than pgHead.

--
Thanks,
Nisha

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

10 января, 14:05:58

On Friday, January 10, 2025 8:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

> 
> On Wed, Jan 8, 2025 at 7:26 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Thursday, January 9, 2025 9:48 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > Hi,
> >
> > >
> > > On Wed, Jan 8, 2025 at 3:00 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com>
> > > wrote:
> > > >
> > > > On Wednesday, January 8, 2025 6:33 PM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > > On Wed, Jan 8, 2025 at 1:53 AM Amit Kapila
> <amit.kapila16@gmail.com>
> > > > > wrote:
> > > > > > On Wed, Jan 8, 2025 at 3:02 PM Masahiko Sawada
> > > > > <sawada.mshk@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Dec 19, 2024 at 11:11 PM Nisha Moond
> > > > > <nisha.moond412@gmail.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > [3] Test with pgbench run on both publisher and subscriber.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Test setup:
> > > > > > > >
> > > > > > > > - Tests performed on pgHead + v16 patches
> > > > > > > >
> > > > > > > > - Created a pub-sub replication system.
> > > > > > > >
> > > > > > > > - Parameters for both instances were:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >    share_buffers = 30GB
> > > > > > > >
> > > > > > > >    min_wal_size = 10GB
> > > > > > > >
> > > > > > > >    max_wal_size = 20GB
> > > > > > > >
> > > > > > > >    autovacuum = false
> > > > > > >
> > > > > > > Since you disabled autovacuum on the subscriber, dead tuples
> > > > > > > created by non-hot updates are accumulated anyway regardless of
> > > > > > > detect_update_deleted setting, is that right?
> > > > > > >
> > > > > >
> > > > > > I think hot-pruning mechanism during the update operation will
> > > > > > remove dead tuples even when autovacuum is disabled.
> > > > >
> > > > > True, but why did it disable autovacuum? It seems that
> > > > > case1-2_setup.sh doesn't specify fillfactor, which makes hot-updates
> less
> > > likely to happen.
> > > >
> > > > IIUC, we disable autovacuum as a general practice in read-write tests
> > > > for stable TPS numbers.
> > >
> > > Okay. TBH I'm not sure what we can say with these results. At a glance, in
> a
> > > typical bi-directional-like setup,  we can interpret these results as that if
> > > users turn retain_conflict_info on the TPS goes 50% down.  But I'm not
> sure
> > > this 50% dip is the worst case that users possibly face. It could be better in
> > > practice thanks to autovacuum, or it also could go even worse due to
> further
> > > bloats if we run the test longer.
> > > Suppose that users had 50% performance dip due to dead tuple retention
> for
> > > update_deleted detection, is there any way for users to improve the
> situation?
> > > For example, trying to advance slot.xmin more frequently might help to
> reduce
> > > dead tuple accumulation. I think it would be good if we could have a way to
> > > balance between the publisher performance and the subscriber
> performance.
> >
> > AFAICS, most of the time in each xid advancement is spent on waiting for the
> > target remote_lsn to be applied and flushed, so increasing the frequency
> could
> > not help. This can be proved to be reasonable in the testcase 4 shared by
> > Nisha[1], in that test, we do not request a remote_lsn but simply wait for the
> > commit_ts of incoming transaction to exceed the candidate_xid_time, the
> > regression is still the same.
> 
> True, but I think that not only more frequently asking the publisher
> its status but also the apply worker frequently trying to advance the
> RetainConflictInfoPhase and the launcher frequently trying to advance
> the slot.xmin are important.

I agree.

> 
> > I think it indicates that we indeed need to wait
> > for this amount of time before applying all the transactions that have earlier
> > commit timestamp. IOW, the performance impact on the subscriber side is a
> > reasonable behavior if we want to detect the update_deleted conflict reliably.
> 
> It's reasonable behavior for this approach but it might not be a
> reasonable outcome for users if they could be affected by such a
> performance dip without no way to avoid it.
> 
> To closely look at what is happening in the apply worker and the
> launcher, I did a quick test with the same setup, where running
> pgbench with 30 clients to each of the publisher and subscriber (on
> different pgbench tables so conflicts don't happen on the subscriber),
> and I recorded how often the worker and the launcher tried to update
> the worker's xmin and slot's xmin, respectively.  During the 120
> seconds test I observed that the apply worker advanced its
> oldest_nonremovable_xid 10 times with 43 attempts and the launcher
> advanced the slot's xmin 5 times with 20 attempts, which seems to be
> less frequent. And there seems no way for users to increase these
> frequencies. Actually, these XID advancements happened only early in
> the test and in the later part there was almost no attempt to advance
> XIDs (I described the reason below). Therefore, after 120 secs tests,
> slot's xmin was 2366291 XIDs behind (TPS on the publisher and
> subscriber were 15728 and 18052, respectively).

Thanks for testing ! It appears that the frequency observed in your tests is
higher than what we've experienced locally. Could you please share the scripts
you used and possibly the machine configuration? This information will help us
verify the differences in the data you've shared.

> I think there 3 things we need to deal with:

Thanks for the suggestions. We will analyze them and share some top-up patches
for the suggested changes later.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

13 января, 09:36:26

On Fri, Jan 10, 2025 at 6:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> 3. If the apply worker cannot catch up, it could enter to a bad loop;
> the publisher sends huge amount of data -> the apply worker cannot
> catch up -> it needs to wait for a longer time to advance its
> oldest_nonremovable_xid -> more garbage are accumulated and then have
> the apply more slow -> (looping). I'm not sure how to deal with this
> point TBH. We might be able to avoid entering this bad loop once we
> resolve the other two points.
>

I don't think we can avoid accumulating garbage especially when the
workload on the publisher is more. Consider the current case being
discussed, on the publisher, we have 30 clients performing read-write
operations and there is only one pair of reader (walsender) and writer
(apply_worker) to perform all those write operations on the
subscriber. It can never match the speed and the subscriber side is
bound to have less performance (or accumulate more bloat) irrespective
of its workload. If there is one client on the publisher performing
operation, we won't see much degradation but as the number of clients
increases, the performance degradation (and bloat) will keep on
increasing.

There are other scenarios that can lead to the same situation, such as
a large table sync, the subscriber node being down for sometime, etc.
Basically, any case where apply_side lags by a large amount from the
remote node.

One idea to prevent the performance degradation or bloat increase is
to invalidate the slot, once we notice that subscriber lags (in terms
of WAL apply) behind the publisher by a certain threshold. Say we have
max_lag (or max_lag_behind_remote) (defined in terms of seconds)
subscription option which allows us to stop calculating
oldest_nonremovable_xid for that subscription. We can indicate that
via some worker_level parameter. Once all the subscriptions on a node
that has enabled retain_conflict_info have stopped calculating
oldest_nonremovable_xid, we can invalidate the slot. Now, users can
check this and need to disable/enable retain_conflict_info to again
start retaining the required information. The other way could be that
instead of invalidating the slot, we directly drop/re-create the slot
or increase its xmin. If we choose to advance the slot automatically
without user intervention, we need to let users know via LOG and or
via information in the view.

I think such a mechanism via the new option max_lag will address your
concern: "It's reasonable behavior for this approach but it might not
be a reasonable outcome for users if they could be affected by such a
performance dip without no way to avoid it." as it will provide a way
to avoid performance dip only when there is a possibility of such a
dip.

I mentioned max_lag as a subscription option instead of a GUC because
it applies only to subscriptions that have enabled
retain_conflict_info but we can consider it to be a GUC if you and
others think so provided the above proposal sounds reasonable. Also,
max_lag could be defined in terms of LSN as well but I think time
would be easy to configure.

Thoughts?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

14 января, 04:43:27

On Sun, Jan 12, 2025 at 10:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 6:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > 3. If the apply worker cannot catch up, it could enter to a bad loop;
> > the publisher sends huge amount of data -> the apply worker cannot
> > catch up -> it needs to wait for a longer time to advance its
> > oldest_nonremovable_xid -> more garbage are accumulated and then have
> > the apply more slow -> (looping). I'm not sure how to deal with this
> > point TBH. We might be able to avoid entering this bad loop once we
> > resolve the other two points.
> >
>
> I don't think we can avoid accumulating garbage especially when the
> workload on the publisher is more. Consider the current case being
> discussed, on the publisher, we have 30 clients performing read-write
> operations and there is only one pair of reader (walsender) and writer
> (apply_worker) to perform all those write operations on the
> subscriber. It can never match the speed and the subscriber side is
> bound to have less performance (or accumulate more bloat) irrespective
> of its workload. If there is one client on the publisher performing
> operation, we won't see much degradation but as the number of clients
> increases, the performance degradation (and bloat) will keep on
> increasing.
>
> There are other scenarios that can lead to the same situation, such as
> a large table sync, the subscriber node being down for sometime, etc.
> Basically, any case where apply_side lags by a large amount from the
> remote node.
>
> One idea to prevent the performance degradation or bloat increase is
> to invalidate the slot, once we notice that subscriber lags (in terms
> of WAL apply) behind the publisher by a certain threshold. Say we have
> max_lag (or max_lag_behind_remote) (defined in terms of seconds)
> subscription option which allows us to stop calculating
> oldest_nonremovable_xid for that subscription. We can indicate that
> via some worker_level parameter. Once all the subscriptions on a node
> that has enabled retain_conflict_info have stopped calculating
> oldest_nonremovable_xid, we can invalidate the slot. Now, users can
> check this and need to disable/enable retain_conflict_info to again
> start retaining the required information. The other way could be that
> instead of invalidating the slot, we directly drop/re-create the slot
> or increase its xmin. If we choose to advance the slot automatically
> without user intervention, we need to let users know via LOG and or
> via information in the view.
>
> I think such a mechanism via the new option max_lag will address your
> concern: "It's reasonable behavior for this approach but it might not
> be a reasonable outcome for users if they could be affected by such a
> performance dip without no way to avoid it." as it will provide a way
> to avoid performance dip only when there is a possibility of such a
> dip.
>
> I mentioned max_lag as a subscription option instead of a GUC because
> it applies only to subscriptions that have enabled
> retain_conflict_info but we can consider it to be a GUC if you and
> others think so provided the above proposal sounds reasonable. Also,
> max_lag could be defined in terms of LSN as well but I think time
> would be easy to configure.
>
> Thoughts?

I agree that we cannot avoid accumulating dead tuples when the
workload on the publisher is more, and which affects the subscriber
performance. What we need to do is to update slot's xmin as quickly as
possible to minimize the dead tuple accumulation at least when the
subscriber is not much behind. If there is a tradeoff for doing so
(e.g., vs. the publisher performance), we need to provide a way for
users to balance it.  The max_lag idea sounds interesting for the case
where the subscriber is much behind. Probably we can visit this idea
as a new feature after completing this feature.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

14 января, 07:39:07

On Tue, Jan 14, 2025 at 7:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Jan 12, 2025 at 10:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I don't think we can avoid accumulating garbage especially when the
> > workload on the publisher is more. Consider the current case being
> > discussed, on the publisher, we have 30 clients performing read-write
> > operations and there is only one pair of reader (walsender) and writer
> > (apply_worker) to perform all those write operations on the
> > subscriber. It can never match the speed and the subscriber side is
> > bound to have less performance (or accumulate more bloat) irrespective
> > of its workload. If there is one client on the publisher performing
> > operation, we won't see much degradation but as the number of clients
> > increases, the performance degradation (and bloat) will keep on
> > increasing.
> >
> > There are other scenarios that can lead to the same situation, such as
> > a large table sync, the subscriber node being down for sometime, etc.
> > Basically, any case where apply_side lags by a large amount from the
> > remote node.
> >
> > One idea to prevent the performance degradation or bloat increase is
> > to invalidate the slot, once we notice that subscriber lags (in terms
> > of WAL apply) behind the publisher by a certain threshold. Say we have
> > max_lag (or max_lag_behind_remote) (defined in terms of seconds)
> > subscription option which allows us to stop calculating
> > oldest_nonremovable_xid for that subscription. We can indicate that
> > via some worker_level parameter. Once all the subscriptions on a node
> > that has enabled retain_conflict_info have stopped calculating
> > oldest_nonremovable_xid, we can invalidate the slot. Now, users can
> > check this and need to disable/enable retain_conflict_info to again
> > start retaining the required information. The other way could be that
> > instead of invalidating the slot, we directly drop/re-create the slot
> > or increase its xmin. If we choose to advance the slot automatically
> > without user intervention, we need to let users know via LOG and or
> > via information in the view.
> >
> > I think such a mechanism via the new option max_lag will address your
> > concern: "It's reasonable behavior for this approach but it might not
> > be a reasonable outcome for users if they could be affected by such a
> > performance dip without no way to avoid it." as it will provide a way
> > to avoid performance dip only when there is a possibility of such a
> > dip.
> >
> > I mentioned max_lag as a subscription option instead of a GUC because
> > it applies only to subscriptions that have enabled
> > retain_conflict_info but we can consider it to be a GUC if you and
> > others think so provided the above proposal sounds reasonable. Also,
> > max_lag could be defined in terms of LSN as well but I think time
> > would be easy to configure.
> >
> > Thoughts?
>
> I agree that we cannot avoid accumulating dead tuples when the
> workload on the publisher is more, and which affects the subscriber
> performance. What we need to do is to update slot's xmin as quickly as
> possible to minimize the dead tuple accumulation at least when the
> subscriber is not much behind. If there is a tradeoff for doing so
> (e.g., vs. the publisher performance), we need to provide a way for
> users to balance it.
>

As of now, I can't think of a way to throttle the publisher when the
apply_worker lags. Basically, we need some way to throttle (reduce the
speed of backends) when the apply worker is lagging behind a threshold
margin. Can you think of some way? I thought if one notices frequent
invalidation of the launcher's slot due to max_lag, then they can
rebalance their workload on the publisher.

>
  The max_lag idea sounds interesting for the case
> where the subscriber is much behind. Probably we can visit this idea
> as a new feature after completing this feature.
>

Sure, but what will be our answer to users for cases where the
performance tanks due to bloat accumulation? The tests show that once
the apply_lag becomes large, it becomes almost impossible for the
apply worker to catch up (or take a very long time) and advance the
slot's xmin. The users can disable retain_conflict_info to bring back
the performance and get rid of bloat but I thought it would be easier
for users to do that if we have some knob where they don't need to
wait till actually the problem of bloat/performance dip happens.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

15 января, 03:27:00

On Mon, Jan 13, 2025 at 8:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 14, 2025 at 7:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sun, Jan 12, 2025 at 10:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > I don't think we can avoid accumulating garbage especially when the
> > > workload on the publisher is more. Consider the current case being
> > > discussed, on the publisher, we have 30 clients performing read-write
> > > operations and there is only one pair of reader (walsender) and writer
> > > (apply_worker) to perform all those write operations on the
> > > subscriber. It can never match the speed and the subscriber side is
> > > bound to have less performance (or accumulate more bloat) irrespective
> > > of its workload. If there is one client on the publisher performing
> > > operation, we won't see much degradation but as the number of clients
> > > increases, the performance degradation (and bloat) will keep on
> > > increasing.
> > >
> > > There are other scenarios that can lead to the same situation, such as
> > > a large table sync, the subscriber node being down for sometime, etc.
> > > Basically, any case where apply_side lags by a large amount from the
> > > remote node.
> > >
> > > One idea to prevent the performance degradation or bloat increase is
> > > to invalidate the slot, once we notice that subscriber lags (in terms
> > > of WAL apply) behind the publisher by a certain threshold. Say we have
> > > max_lag (or max_lag_behind_remote) (defined in terms of seconds)
> > > subscription option which allows us to stop calculating
> > > oldest_nonremovable_xid for that subscription. We can indicate that
> > > via some worker_level parameter. Once all the subscriptions on a node
> > > that has enabled retain_conflict_info have stopped calculating
> > > oldest_nonremovable_xid, we can invalidate the slot. Now, users can
> > > check this and need to disable/enable retain_conflict_info to again
> > > start retaining the required information. The other way could be that
> > > instead of invalidating the slot, we directly drop/re-create the slot
> > > or increase its xmin. If we choose to advance the slot automatically
> > > without user intervention, we need to let users know via LOG and or
> > > via information in the view.
> > >
> > > I think such a mechanism via the new option max_lag will address your
> > > concern: "It's reasonable behavior for this approach but it might not
> > > be a reasonable outcome for users if they could be affected by such a
> > > performance dip without no way to avoid it." as it will provide a way
> > > to avoid performance dip only when there is a possibility of such a
> > > dip.
> > >
> > > I mentioned max_lag as a subscription option instead of a GUC because
> > > it applies only to subscriptions that have enabled
> > > retain_conflict_info but we can consider it to be a GUC if you and
> > > others think so provided the above proposal sounds reasonable. Also,
> > > max_lag could be defined in terms of LSN as well but I think time
> > > would be easy to configure.
> > >
> > > Thoughts?
> >
> > I agree that we cannot avoid accumulating dead tuples when the
> > workload on the publisher is more, and which affects the subscriber
> > performance. What we need to do is to update slot's xmin as quickly as
> > possible to minimize the dead tuple accumulation at least when the
> > subscriber is not much behind. If there is a tradeoff for doing so
> > (e.g., vs. the publisher performance), we need to provide a way for
> > users to balance it.
> >
>
> As of now, I can't think of a way to throttle the publisher when the
> apply_worker lags. Basically, we need some way to throttle (reduce the
> speed of backends) when the apply worker is lagging behind a threshold
> margin. Can you think of some way? I thought if one notices frequent
> invalidation of the launcher's slot due to max_lag, then they can
> rebalance their workload on the publisher.

I don't have any ideas other than invalidating the launcher's slot
when the apply lag is huge. We can think of invalidating the
launcher's slot for some reasons such as the replay lag, the age of
slot's xmin, and the duration.

>
> >
>   The max_lag idea sounds interesting for the case
> > where the subscriber is much behind. Probably we can visit this idea
> > as a new feature after completing this feature.
> >
>
> Sure, but what will be our answer to users for cases where the
> performance tanks due to bloat accumulation? The tests show that once
> the apply_lag becomes large, it becomes almost impossible for the
> apply worker to catch up (or take a very long time) and advance the
> slot's xmin. The users can disable retain_conflict_info to bring back
> the performance and get rid of bloat but I thought it would be easier
> for users to do that if we have some knob where they don't need to
> wait till actually the problem of bloat/performance dip happens.

Probably retaining dead tuples based on the time duration or its age
might be other solutions, it would increase a risk of not being able
to detect update_deleted conflict though. I think in any way as long
as we accumulate dead tulpes to detect update_deleted conflicts, it
would be a tradeoff between reliably detecting update_deleted
conflicts and the performance.

As for detecting update_deleted conflicts, we probably don't need the
whole tuple data of deleted tuples. It would be sufficient if we can
check XIDs of deleted tuple to get their origins and commit
timestamps. Probably the same is true for the old version of updated
tuple in terms of detecting update_origin_differ conflicts. If my
understanding is right, probably we can remove only the tuple data of
dead tuples that are older than a xmin horizon (excluding the
launcher's xmin), while leaving the heap tuple header, which can
minimize the table bloat.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

15 января, 07:08:12

On Wed, Jan 15, 2025 at 5:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jan 13, 2025 at 8:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > As of now, I can't think of a way to throttle the publisher when the
> > apply_worker lags. Basically, we need some way to throttle (reduce the
> > speed of backends) when the apply worker is lagging behind a threshold
> > margin. Can you think of some way? I thought if one notices frequent
> > invalidation of the launcher's slot due to max_lag, then they can
> > rebalance their workload on the publisher.
>
> I don't have any ideas other than invalidating the launcher's slot
> when the apply lag is huge. We can think of invalidating the
> launcher's slot for some reasons such as the replay lag, the age of
> slot's xmin, and the duration.
>

Right, this is exactly where we are heading. I think we can add
reasons step-wise. For example, as a first step, we can invalidate the
slot due to replay LAG. Then, slowly, we can add other reasons as
well.

One thing that needs more discussion is the exact way to invalidate a
slot. I have mentioned a couple of ideas in my previous email which I
am writing again: "If we just invalidate the slot, users can check the
status of the slot and need to disable/enable retain_conflict_info
again to start retaining the required information. This would be
required because we can't allow system slots (slots created
internally) to be created by users. The other way could be that
instead of invalidating the slot, we directly drop/re-create the slot
or increase its xmin. If we choose to advance the slot automatically
without user intervention, we need to let users know via LOG and or
via information in the view."

> >
> > >
> >   The max_lag idea sounds interesting for the case
> > > where the subscriber is much behind. Probably we can visit this idea
> > > as a new feature after completing this feature.
> > >
> >
> > Sure, but what will be our answer to users for cases where the
> > performance tanks due to bloat accumulation? The tests show that once
> > the apply_lag becomes large, it becomes almost impossible for the
> > apply worker to catch up (or take a very long time) and advance the
> > slot's xmin. The users can disable retain_conflict_info to bring back
> > the performance and get rid of bloat but I thought it would be easier
> > for users to do that if we have some knob where they don't need to
> > wait till actually the problem of bloat/performance dip happens.
>
> Probably retaining dead tuples based on the time duration or its age
> might be other solutions, it would increase a risk of not being able
> to detect update_deleted conflict though. I think in any way as long
> as we accumulate dead tulpes to detect update_deleted conflicts, it
> would be a tradeoff between reliably detecting update_deleted
> conflicts and the performance.
>

Right, and users have an option for it. Say, if they set max_lag as -1
(or some special value), we won't invalidate the slot, so the
update_delete conflict can be detected with complete reliability. At
this stage, it is okay if this information is LOGGED and displayed via
a system view. We need more thoughts while working on the CONFLICT
RESOLUTION patch such as we may need to additionally display a WARNING
or ERROR if the remote_tuples commit_time is earlier than the last
time slot is invalidated. I don't want to go in a detailed discussion
at this point but just wanted you to know that we will need additional
work for the resolution of update_delete conflicts to avoid
inconsistency.

> As for detecting update_deleted conflicts, we probably don't need the
> whole tuple data of deleted tuples. It would be sufficient if we can
> check XIDs of deleted tuple to get their origins and commit
> timestamps. Probably the same is true for the old version of updated
> tuple in terms of detecting update_origin_differ conflicts. If my
> understanding is right, probably we can remove only the tuple data of
> dead tuples that are older than a xmin horizon (excluding the
> launcher's xmin), while leaving the heap tuple header, which can
> minimize the table bloat.
>

I am afraid that is not possible because even to detect the conflict,
we first need to find the matching tuple on the subscriber node. If
the replica_indentity or primary_key is present in the table, we try
to save that and transaction info but that won't be simple either.
Also, if RI or primary_key is not there, we need an entire tuple to
match. We need a concept of tombstone tables (or we can call it a
dead-rows-store) where old data is stored reliably till we don't need
it. We have discussed briefly that idea previously [1][2] and decided
to move forward with an idea to retain the dead tuples idea based on
the theory that we already use similar ideas at other places.

BTW, a related point to note is that we need to retain the
conflict_info even to detect origin_differ conflict with complete
reliability. We need only commit_ts information for that case. See
analysis [3].

[1] - https://www.postgresql.org/message-id/CAJpy0uCov4JfZJeOvY0O21_gk9bcgNUDp4jf8%2BBbMp%2BEAv8cVQ%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/e4cdb849-d647-4acf-aabe-7049ae170fbf%40enterprisedb.com
[3] -
https://www.postgresql.org/message-id/OSCPR01MB14966F6B816880165E387758AF5112%40OSCPR01MB14966.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

16 января, 13:01:55

On Wed, Jan 15, 2025 at 2:20 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> In the latest version, we implemented a simpler approach that allows the apply
> worker to directly advance the oldest_nonremovable_xid if the waiting time
> exceeds the newly introduced option's limit. I've named this option
> "max_conflict_retention_duration," as it aligns better with the conflict
> detection concept and the "retain_conflict_info" option.
>
> During the last phase (RCI_WAIT_FOR_LOCAL_FLUSH), the apply worker evaluates
> how much time it has spent waiting. If this duration exceeds the
> max_conflict_retention_duration, the worker directly advances the
> oldest_nonremovable_xid and logs a message indicating the forced advancement of
> the non-removable transaction ID.
>
> This approach is a bit like a time-based option that discussed before.
> Compared to the slot invalidation approach, this approach is simpler because we
> can avoid adding 1) new slot invalidation type due to apply lag, 2) new field
> lag_behind in shared memory (MyLogicalRepWorker) to indicate when the lag
> exceeds the limit, and 3) additional logic in the launcher to handle each
> worker's lag status.
>
> In the slot invalidation, user would be able to confirm if the current by
> checking if the slot in pg_replication_slot in invalidated or not, while in the
> simpler approach mentioned, user could only confirm that by checking the LOGs.
>

The user needs to check the LOGs corresponding to all subscriptions on
the node. I see the simplicity of the approach you used but still the
slot_invalidation idea sounds better to me on the grounds that it will
be convenient for users/DBA to know when to rely on the update_missing
type conflict if there is a valid and active slot with the name
'pg_conflict_detection' (or whatever name we decide to give) then
users can rely on the detected conflict. Sawada-San, and others, do
you have any preference on this matter?

Do we want to prohibit the combination copy_data as true and
retain_conflict_info=true?  I understand that with the new parameter
'max_conflict_retention_duration', for large copies slot would anyway
be invalidated but I don't want to give users more ways to see this
slot invalidated in the beginning itself. Similarly during ALTER
SUBSCRIPTION, if the initial synch is in progress, we can disallow
enabling retain_conflict_info. Later, if there is a real demand for
such a combination, we can always enable it.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

16 января, 13:14:56

On Fri, Jan 3, 2025 at 4:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 2, 2025 at 2:57 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Conflict detection of truncated updates is detected as update_missing
> and deleted update is detected as update_deleted. I was not sure if
> truncated updates should also be detected as update_deleted, as the
> document says truncate operation is "It has the same effect as an
> unqualified DELETE on each table" at [1].
>

This is expected behavior because TRUNCATE would immediately reclaim
space and remove all the data. So, there is no way to retain the
removed row.

I’m not sure whether to call this expected behavior or simply acknowledge that we have no way to control it. Logically, it would have been preferable if it behaved like a DELETE, but we are constrained by the way TRUNCATE works. At least that's what my opinion about this case.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

16 января, 13:31:47

On Thu, Jan 16, 2025 at 3:45 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jan 3, 2025 at 4:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Jan 2, 2025 at 2:57 PM vignesh C <vignesh21@gmail.com> wrote:
>> >
>> > Conflict detection of truncated updates is detected as update_missing
>> > and deleted update is detected as update_deleted. I was not sure if
>> > truncated updates should also be detected as update_deleted, as the
>> > document says truncate operation is "It has the same effect as an
>> > unqualified DELETE on each table" at [1].
>> >
>>
>> This is expected behavior because TRUNCATE would immediately reclaim
>> space and remove all the data. So, there is no way to retain the
>> removed row.
>
>
> I’m not sure whether to call this expected behavior or simply acknowledge that we have no way to control it.
Logically,it would have been preferable if it behaved like a DELETE, but we are constrained by the way TRUNCATE works. 
>

I see your point. So, it is probably better to add a Note about this.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

16 января, 13:32:53

On Thu, Jan 16, 2025 at 4:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 16, 2025 at 3:45 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jan 3, 2025 at 4:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Jan 2, 2025 at 2:57 PM vignesh C <vignesh21@gmail.com> wrote:
>> >
>> > Conflict detection of truncated updates is detected as update_missing
>> > and deleted update is detected as update_deleted. I was not sure if
>> > truncated updates should also be detected as update_deleted, as the
>> > document says truncate operation is "It has the same effect as an
>> > unqualified DELETE on each table" at [1].
>> >
>>
>> This is expected behavior because TRUNCATE would immediately reclaim
>> space and remove all the data. So, there is no way to retain the
>> removed row.
>
>
> I’m not sure whether to call this expected behavior or simply acknowledge that we have no way to control it. Logically, it would have been preferable if it behaved like a DELETE, but we are constrained by the way TRUNCATE works.
>

I see your point. So, it is probably better to add a Note about this.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

16 января, 13:56:39

On Wed, Jan 15, 2025 at 9:38 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 5:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Probably retaining dead tuples based on the time duration or its age
> > might be other solutions, it would increase a risk of not being able
> > to detect update_deleted conflict though. I think in any way as long
> > as we accumulate dead tulpes to detect update_deleted conflicts, it
> > would be a tradeoff between reliably detecting update_deleted
> > conflicts and the performance.
> >
>
> Right, and users have an option for it. Say, if they set max_lag as -1
> (or some special value), we won't invalidate the slot, so the
> update_delete conflict can be detected with complete reliability. At
> this stage, it is okay if this information is LOGGED and displayed via
> a system view. We need more thoughts while working on the CONFLICT
> RESOLUTION patch such as we may need to additionally display a WARNING
> or ERROR if the remote_tuples commit_time is earlier than the last
> time slot is invalidated.
>

The more reliable way to do something in this regard would be that if
there is a valid and active pg_conflict_detection (or whatever we name
this slot) then consider the update_missing conflict detected as
reliable. Otherwise, the conflict_type will depend on whether the
vacuum has removed the dead row. so, the conflict management system or
users would easily know when to rely on this update_missing conflict
type.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

17 января, 11:06:48

On Thu, Jan 16, 2025 at 2:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 15, 2025 at 2:20 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > In the latest version, we implemented a simpler approach that allows the apply
> > worker to directly advance the oldest_nonremovable_xid if the waiting time
> > exceeds the newly introduced option's limit. I've named this option
> > "max_conflict_retention_duration," as it aligns better with the conflict
> > detection concept and the "retain_conflict_info" option.
> >
> > During the last phase (RCI_WAIT_FOR_LOCAL_FLUSH), the apply worker evaluates
> > how much time it has spent waiting. If this duration exceeds the
> > max_conflict_retention_duration, the worker directly advances the
> > oldest_nonremovable_xid and logs a message indicating the forced advancement of
> > the non-removable transaction ID.
> >
> > This approach is a bit like a time-based option that discussed before.
> > Compared to the slot invalidation approach, this approach is simpler because we
> > can avoid adding 1) new slot invalidation type due to apply lag, 2) new field
> > lag_behind in shared memory (MyLogicalRepWorker) to indicate when the lag
> > exceeds the limit, and 3) additional logic in the launcher to handle each
> > worker's lag status.
> >
> > In the slot invalidation, user would be able to confirm if the current by
> > checking if the slot in pg_replication_slot in invalidated or not, while in the
> > simpler approach mentioned, user could only confirm that by checking the LOGs.
> >
>
> The user needs to check the LOGs corresponding to all subscriptions on
> the node. I see the simplicity of the approach you used but still the
> slot_invalidation idea sounds better to me on the grounds that it will
> be convenient for users/DBA to know when to rely on the update_missing
> type conflict if there is a valid and active slot with the name
> 'pg_conflict_detection' (or whatever name we decide to give) then
> users can rely on the detected conflict. Sawada-San, and others, do
> you have any preference on this matter?

I also think that it would be convenient for users if they could check
if there was a valid and active pg_conflict_detection slot to know
when to rely on detected conflicts. On the other hand, I think it
would not be convenient for users if we always required user
intervention to re-create the slot. Once the slot is invalidated or
dropped, we can no longer guarantee that update_deleted conflicts are
detected reliably, but the logical replication would still be running.
That means we might have already been missing update_deleted
conflicts. From the user perspective, it would be cumbersome to
disable/enable retain_conflict_info (and check if the slot was
re-created) just to make retain_conflict_info work again.

> Do we want to prohibit the combination copy_data as true and
> retain_conflict_info=true?  I understand that with the new parameter
> 'max_conflict_retention_duration', for large copies slot would anyway
> be invalidated but I don't want to give users more ways to see this
> slot invalidated in the beginning itself. Similarly during ALTER
> SUBSCRIPTION, if the initial synch is in progress, we can disallow
> enabling retain_conflict_info. Later, if there is a real demand for
> such a combination, we can always enable it.

Does it mean that whenever users want to start the initial sync they
need to disable reatin_conflict_info on all subscriptions? Which
doesn't seem very convenient.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

17 января, 14:31:05

On Fri, Jan 17, 2025 at 1:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 16, 2025 at 2:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 15, 2025 at 2:20 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > In the latest version, we implemented a simpler approach that allows the apply
> > > worker to directly advance the oldest_nonremovable_xid if the waiting time
> > > exceeds the newly introduced option's limit. I've named this option
> > > "max_conflict_retention_duration," as it aligns better with the conflict
> > > detection concept and the "retain_conflict_info" option.
> > >
> > > During the last phase (RCI_WAIT_FOR_LOCAL_FLUSH), the apply worker evaluates
> > > how much time it has spent waiting. If this duration exceeds the
> > > max_conflict_retention_duration, the worker directly advances the
> > > oldest_nonremovable_xid and logs a message indicating the forced advancement of
> > > the non-removable transaction ID.
> > >
> > > This approach is a bit like a time-based option that discussed before.
> > > Compared to the slot invalidation approach, this approach is simpler because we
> > > can avoid adding 1) new slot invalidation type due to apply lag, 2) new field
> > > lag_behind in shared memory (MyLogicalRepWorker) to indicate when the lag
> > > exceeds the limit, and 3) additional logic in the launcher to handle each
> > > worker's lag status.
> > >
> > > In the slot invalidation, user would be able to confirm if the current by
> > > checking if the slot in pg_replication_slot in invalidated or not, while in the
> > > simpler approach mentioned, user could only confirm that by checking the LOGs.
> > >
> >
> > The user needs to check the LOGs corresponding to all subscriptions on
> > the node. I see the simplicity of the approach you used but still the
> > slot_invalidation idea sounds better to me on the grounds that it will
> > be convenient for users/DBA to know when to rely on the update_missing
> > type conflict if there is a valid and active slot with the name
> > 'pg_conflict_detection' (or whatever name we decide to give) then
> > users can rely on the detected conflict. Sawada-San, and others, do
> > you have any preference on this matter?
>
> I also think that it would be convenient for users if they could check
> if there was a valid and active pg_conflict_detection slot to know
> when to rely on detected conflicts. On the other hand, I think it
> would not be convenient for users if we always required user
> intervention to re-create the slot. Once the slot is invalidated or
> dropped, we can no longer guarantee that update_deleted conflicts are
> detected reliably, but the logical replication would still be running.
> That means we might have already been missing update_deleted
> conflicts. From the user perspective, it would be cumbersome to
> disable/enable retain_conflict_info (and check if the slot was
> re-created) just to make retain_conflict_info work again.
>

True, ideally, we can recreate the slot automatically or use the idea
of directly advancing oldest_nonremovable_xid as Hou-San proposed or
directly advance slot's xmin. However, we won't be able to detect
update_delete conflict until the publisher's load is adjusted (or
reduced) because apply worker will keep lagging till that point, even
if we advance slot's xmin automatically. So, we will keep re-creating
the slot or advancing it at regular intervals
(max_conflict_retention_duration) without any additional reliability.
This will lead to bloat retention and or performance dip on subscriber
workload without even one could detect the update_missing type of
conflict reliably.

The other possibilities to avoid/reduce user intervention could be
that once the subscriber catches up with the publisher in terms of
applying WAL, we could re-create/advance the slot. We could do this in
multiple ways, (a) say when last_received_pos from publishers equals
or last_flush_pos on the subscriber, or (b) the apply worker keeps
doing the xid advancement phases but do not actually advance xid, it's
only intended to check the latest lag. If the lag becomes less than
the max_conflict_retention_duration, then notify the launcher to
re-create the slot.

I feel these are some optimizations that could reduce the need to
re-enable retain_conflict_info but users can still do it manually if
they wish to.

> > Do we want to prohibit the combination copy_data as true and
> > retain_conflict_info=true?  I understand that with the new parameter
> > 'max_conflict_retention_duration', for large copies slot would anyway
> > be invalidated but I don't want to give users more ways to see this
> > slot invalidated in the beginning itself. Similarly during ALTER
> > SUBSCRIPTION, if the initial synch is in progress, we can disallow
> > enabling retain_conflict_info. Later, if there is a real demand for
> > such a combination, we can always enable it.
>
> Does it mean that whenever users want to start the initial sync they
> need to disable reatin_conflict_info on all subscriptions? Which
> doesn't seem very convenient.
>

I agree it is inconvenient but OTOH, if it leads to a large copy then
anyway, the slot may not be able to progress leading to invalidation.
As it is difficult to predict, we may allow it but document that large
copies could lead to slot_invalidation as during that time there is a
possibility that we may not be able to apply any WAL.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

22 января, 13:04:27

On Sat, Jan 18, 2025 at 9:15 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Here is the V24 patch set. I modified 0004 patch to implement the slot
> Invalidation part. Since the automatic recovery could be an optimization and
> the discussion is in progress, I didn't implement that part.

Few comments for patch-0004
====
src/backend/replication/slot.c

1) Need to describe the new macro RS_INVAL_CONFLICT_RETENTION_DURATION
in the comments above InvalidateObsoleteReplicationSlots(), where all
other invalidation causes are explained.
...
 * Whether a slot needs to be invalidated depends on the cause. A slot is
 * removed if it:
 * - RS_INVAL_WAL_REMOVED: requires a LSN older than the given segment
 * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
 *   db; dboid may be InvalidOid for shared relations
 * - RS_INVAL_WAL_LEVEL: is logical
...

2) Can we mention the GUC parameter that defines this "maximum limit"
while reporting?

+
+ case RS_INVAL_CONFLICT_RETENTION_DURATION:
+ appendStringInfo(&err_detail, _("The duration for retaining conflict
information exceeds the maximum limit."));
+ break;
+

Something like -
  "The duration for retaining conflict information exceeds the maximum
limit configured in \"%s\".","max_conflict_retention_duration"

=====
doc/src/sgml/ref/create_subscription.sgml

3)
+         <para>
+          Note that setting a non-zero value for this option could lead to
+          conflict information being removed prematurely, potentially missing
+          some conflict detections.
+         </para>

Should we add the above info as a “Warning” in the docs?

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

31 января, 01:39:28

On Thu, Jan 23, 2025 at 3:47 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, January 22, 2025 7:54 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> > On Saturday, January 18, 2025 11:45 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > > I think invalidating the slot is OK and we could also let the apply
> > > worker to automatic recovery as suggested in [1].
> > >
> > > Here is the V24 patch set. I modified 0004 patch to implement the slot
> > > Invalidation part. Since the automatic recovery could be an
> > > optimization and the discussion is in progress, I didn't implement that part.
> >
> > The implementation is in progress and I will include it in next version.
> >
> > Here is the V25 patch set that includes the following change:
> >
> > 0001
> >
> > * Per off-list discussion with Amit, I added few comments to mention the
> > reason of skipping advancing xid when table sync is in progress and to mention
> > that the advancement will not be delayed if changes are filtered out on
> > publisher via row/table filter.
> >
> > 0004
> >
> > * Fixed a bug that the launcher would advance the slot.xmin when some apply
> >   workers have not yet started.
> >
> > * Fixed a bug that the launcher did not advance the slot.xmin even if one of the
> >   apply worker has stopped conflict retention due to the lag.
> >
> > * Add a retain_conflict_info column in the pg_stat_subscription view to
> >   indicate whether the apply worker is effectively retaining conflict
> >   information. The value is set to true only if retain_conflict_info is enabled
> >   for the associated subscription, and the retention duration for conflict
> >   detection by the apply worker has not exceeded
> >   max_conflict_retention_duration. Thanks Kuroda-san for contributing codes
> >   off-list.
>
> Here is V25 patch set which includes the following changes:
>
> 0004
> * Addressed Nisha's comments[1].
> * Fixed a cfbot failure[2] in the doc.

I have one question about the 0004 patch; it implemented
max_conflict_retntion_duration as a subscription parameter. But the
launcher invalidates the pg_conflict_detection slot only if all
subscriptions with retain_conflict_info stopped retaining dead tuples
due to the max_conflict_retention_duration parameter. Therefore, even
if users set the parameter to a low value to avoid table bloats, it
would not make sense if other subscriptions set it to a larger value.
Is my understanding correct?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

31 января, 09:39:30

On Fri, Jan 31, 2025 at 4:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I have one question about the 0004 patch; it implemented
> max_conflict_retntion_duration as a subscription parameter. But the
> launcher invalidates the pg_conflict_detection slot only if all
> subscriptions with retain_conflict_info stopped retaining dead tuples
> due to the max_conflict_retention_duration parameter. Therefore, even
> if users set the parameter to a low value to avoid table bloats, it
> would not make sense if other subscriptions set it to a larger value.
> Is my understanding correct?
>

Yes, your understanding is correct. I think this could be helpful
during resolution because the worker for which the duration has
exceeded cannot detect conflicts reliably but others can. So, this
info can be useful while performing resolutions. Do you have an
opinion/suggestion on this matter?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

01 февраля, 00:23:30

On Thu, Jan 30, 2025 at 10:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 31, 2025 at 4:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I have one question about the 0004 patch; it implemented
> > max_conflict_retntion_duration as a subscription parameter. But the
> > launcher invalidates the pg_conflict_detection slot only if all
> > subscriptions with retain_conflict_info stopped retaining dead tuples
> > due to the max_conflict_retention_duration parameter. Therefore, even
> > if users set the parameter to a low value to avoid table bloats, it
> > would not make sense if other subscriptions set it to a larger value.
> > Is my understanding correct?
> >
>
> Yes, your understanding is correct. I think this could be helpful
> during resolution because the worker for which the duration has
> exceeded cannot detect conflicts reliably but others can. So, this
> info can be useful while performing resolutions. Do you have an
> opinion/suggestion on this matter?

I imagined a scenario like where two apply workers are running and
have different max_conflict_retention_duration values (say '5 min' and
'15 min'). Suppose both workers are roughly the same behind the
publisher(s), when both workers cannot advance the workers' xmin
values for 5 min or longer, one worker stops retaining dead tuples.
However, the pg_conflict_detection slot is not invalidated yet since
another worker is still using it, so both workers would continue to be
getting slower. The subscriber would end up retaining dead tuples
until both workers are behind for 15 min or longer, before
invalidating the slot. In this case, stopping dead tuple retention on
the first worker would help neither advance the slot's xmin nor
improve another worker's performance. I was not sure of the point of
making the max_conflict_retention_duration a per-subscription
parameter.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

01 февраля, 08:07:07

On Sat, Feb 1, 2025 at 2:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jan 30, 2025 at 10:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 31, 2025 at 4:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I have one question about the 0004 patch; it implemented
> > > max_conflict_retntion_duration as a subscription parameter. But the
> > > launcher invalidates the pg_conflict_detection slot only if all
> > > subscriptions with retain_conflict_info stopped retaining dead tuples
> > > due to the max_conflict_retention_duration parameter. Therefore, even
> > > if users set the parameter to a low value to avoid table bloats, it
> > > would not make sense if other subscriptions set it to a larger value.
> > > Is my understanding correct?
> > >
> >
> > Yes, your understanding is correct. I think this could be helpful
> > during resolution because the worker for which the duration has
> > exceeded cannot detect conflicts reliably but others can. So, this
> > info can be useful while performing resolutions. Do you have an
> > opinion/suggestion on this matter?
>
> I imagined a scenario like where two apply workers are running and
> have different max_conflict_retention_duration values (say '5 min' and
> '15 min'). Suppose both workers are roughly the same behind the
> publisher(s), when both workers cannot advance the workers' xmin
> values for 5 min or longer, one worker stops retaining dead tuples.
> However, the pg_conflict_detection slot is not invalidated yet since
> another worker is still using it, so both workers would continue to be
> getting slower. The subscriber would end up retaining dead tuples
> until both workers are behind for 15 min or longer, before
> invalidating the slot. In this case, stopping dead tuple retention on
> the first worker would help neither advance the slot's xmin nor
> improve another worker's performance.

Won't the same be true for 'retain_conflict_info' option as well? I
mean even if one worker is retaining dead tuples, the performance of
others will also be impacted.

>
> I was not sure of the point of
> making the max_conflict_retention_duration a per-subscription
> parameter.
>

The idea is to keep it at the same level as the other related
parameter 'retain_conflict_info'. It could be useful for cases where
publishers are from two different nodes (NP1 and  NP2) and we have
separate subscriptions for both nodes. Now, it is possible that users
won't expect conflicts on the tables from one of the nodes NP1 then
she could choose to enable 'retain_conflict_info' and
'max_conflict_retention_duration' only for the subscription pointing
to publisher NP2.

Now, say the publisher node that can generate conflicts (NP2) has
fewer writes and the corresponding apply worker could easily catch up
and almost always be in sync with the publisher. In contrast, the
other node that has no conflicts has a large number of writes. In such
cases, giving new options at the subscription level will be helpful.

If we want to provide it at the global level, then the performance or
dead tuple control may not be any better than the current patch but
won't allow the provision for the above kinds of cases. Second, adding
two new GUCs is another thing I want to prevent. But OTOH, the
implementation could be slightly simpler if we provide these options
as GUC though I am not completely sure of that point. Having said
that, I am open to changing it to a non-subscription level. Do you
think it would be better to provide one or both of these parameters as
GUCs or do you have something else in mind?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

05 февраля, 03:29:48

On Fri, Jan 31, 2025 at 9:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 1, 2025 at 2:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 30, 2025 at 10:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jan 31, 2025 at 4:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I have one question about the 0004 patch; it implemented
> > > > max_conflict_retntion_duration as a subscription parameter. But the
> > > > launcher invalidates the pg_conflict_detection slot only if all
> > > > subscriptions with retain_conflict_info stopped retaining dead tuples
> > > > due to the max_conflict_retention_duration parameter. Therefore, even
> > > > if users set the parameter to a low value to avoid table bloats, it
> > > > would not make sense if other subscriptions set it to a larger value.
> > > > Is my understanding correct?
> > > >
> > >
> > > Yes, your understanding is correct. I think this could be helpful
> > > during resolution because the worker for which the duration has
> > > exceeded cannot detect conflicts reliably but others can. So, this
> > > info can be useful while performing resolutions. Do you have an
> > > opinion/suggestion on this matter?
> >
> > I imagined a scenario like where two apply workers are running and
> > have different max_conflict_retention_duration values (say '5 min' and
> > '15 min'). Suppose both workers are roughly the same behind the
> > publisher(s), when both workers cannot advance the workers' xmin
> > values for 5 min or longer, one worker stops retaining dead tuples.
> > However, the pg_conflict_detection slot is not invalidated yet since
> > another worker is still using it, so both workers would continue to be
> > getting slower. The subscriber would end up retaining dead tuples
> > until both workers are behind for 15 min or longer, before
> > invalidating the slot. In this case, stopping dead tuple retention on
> > the first worker would help neither advance the slot's xmin nor
> > improve another worker's performance.
>
> Won't the same be true for 'retain_conflict_info' option as well? I
> mean even if one worker is retaining dead tuples, the performance of
> others will also be impacted.

I guess the situation might be a bit different. It's a user's choice
to disable retain_conflict_info, and it should be done manually. That
is, in this case, I think users will be able to figure out that both
apply workers are the same behind the publishers and they need to
disable retain_conflict_info on both subscriptions in order to remove
accumulated dead tuples (which is the cause of performance dip).

On the other hand, ISTM max_conflict_retentation_duration is something
like a switch to recover the system performance by automatically
disabling retain_conflict_info (and it will automatically go back to
be enabled again). I guess users who use the
max_conflict_retention_duration would expect that the system
performance will tend to recover by automatically disabling
reatin_conflict_info if the apply worker is lagging for longer than
the specified value. However, there are cases where this cannot be
expected.

>
> >
> > I was not sure of the point of
> > making the max_conflict_retention_duration a per-subscription
> > parameter.
> >
>
> The idea is to keep it at the same level as the other related
> parameter 'retain_conflict_info'. It could be useful for cases where
> publishers are from two different nodes (NP1 and  NP2) and we have
> separate subscriptions for both nodes. Now, it is possible that users
> won't expect conflicts on the tables from one of the nodes NP1 then
> she could choose to enable 'retain_conflict_info' and
> 'max_conflict_retention_duration' only for the subscription pointing
> to publisher NP2.
>
> Now, say the publisher node that can generate conflicts (NP2) has
> fewer writes and the corresponding apply worker could easily catch up
> and almost always be in sync with the publisher. In contrast, the
> other node that has no conflicts has a large number of writes. In such
> cases, giving new options at the subscription level will be helpful.
>
> If we want to provide it at the global level, then the performance or
> dead tuple control may not be any better than the current patch but
> won't allow the provision for the above kinds of cases. Second, adding
> two new GUCs is another thing I want to prevent. But OTOH, the
> implementation could be slightly simpler if we provide these options
> as GUC though I am not completely sure of that point. Having said
> that, I am open to changing it to a non-subscription level. Do you
> think it would be better to provide one or both of these parameters as
> GUCs or do you have something else in mind?

It makes sense to me to have the retain_conflict_info as a
subscription-level parameter. I was thinking of making only
max_conflict_retention_duration a global parameter, but I might be
missing something. With a subscription-level
max_conflict_retention_duration, how can users choose the setting
values for each subscription, and is there a case that can be covered
only by a subscription-level max_conflict_retention_duration?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

05 февраля, 08:24:09

On Sat, Feb 1, 2025 at 10:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 1, 2025 at 2:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jan 30, 2025 at 10:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jan 31, 2025 at 4:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I have one question about the 0004 patch; it implemented
> > > > max_conflict_retntion_duration as a subscription parameter. But the
> > > > launcher invalidates the pg_conflict_detection slot only if all
> > > > subscriptions with retain_conflict_info stopped retaining dead tuples
> > > > due to the max_conflict_retention_duration parameter. Therefore, even
> > > > if users set the parameter to a low value to avoid table bloats, it
> > > > would not make sense if other subscriptions set it to a larger value.
> > > > Is my understanding correct?
> > > >
> > >
> > > Yes, your understanding is correct. I think this could be helpful
> > > during resolution because the worker for which the duration has
> > > exceeded cannot detect conflicts reliably but others can. So, this
> > > info can be useful while performing resolutions. Do you have an
> > > opinion/suggestion on this matter?
> >
> > I imagined a scenario like where two apply workers are running and
> > have different max_conflict_retention_duration values (say '5 min' and
> > '15 min'). Suppose both workers are roughly the same behind the
> > publisher(s), when both workers cannot advance the workers' xmin
> > values for 5 min or longer, one worker stops retaining dead tuples.
> > However, the pg_conflict_detection slot is not invalidated yet since
> > another worker is still using it, so both workers would continue to be
> > getting slower. The subscriber would end up retaining dead tuples
> > until both workers are behind for 15 min or longer, before
> > invalidating the slot. In this case, stopping dead tuple retention on
> > the first worker would help neither advance the slot's xmin nor
> > improve another worker's performance.
>
> Won't the same be true for 'retain_conflict_info' option as well? I
> mean even if one worker is retaining dead tuples, the performance of
> others will also be impacted.


+1

>
> >
> > I was not sure of the point of
> > making the max_conflict_retention_duration a per-subscription
> > parameter.
> >
>
> The idea is to keep it at the same level as the other related
> parameter 'retain_conflict_info'. It could be useful for cases where
> publishers are from two different nodes (NP1 and  NP2) and we have
> separate subscriptions for both nodes. Now, it is possible that users
> won't expect conflicts on the tables from one of the nodes NP1 then
> she could choose to enable 'retain_conflict_info' and
> 'max_conflict_retention_duration' only for the subscription pointing
> to publisher NP2.
>
> Now, say the publisher node that can generate conflicts (NP2) has
> fewer writes and the corresponding apply worker could easily catch up
> and almost always be in sync with the publisher. In contrast, the
> other node that has no conflicts has a large number of writes. In such
> cases, giving new options at the subscription level will be helpful.
>
> If we want to provide it at the global level, then the performance or
> dead tuple control may not be any better than the current patch but
> won't allow the provision for the above kinds of cases. Second, adding
> two new GUCs is another thing I want to prevent. But OTOH, the
> implementation could be slightly simpler if we provide these options
> as GUC though I am not completely sure of that point. Having said
> that, I am open to changing it to a non-subscription level. Do you
> think it would be better to provide one or both of these parameters as
> GUCs or do you have something else in mind?
>

I agree with this analogy. It seems that
'max_conflict_retention_duration' is quite similar to
'retain_conflict_info'. In both cases, the slot for retaining dead
tuples is shared among all subscribers. However, these subscribers may
be receiving data from different publishers and even different nodes.
Therefore, the decision on whether to wait and for how long should be
made at the subscriber level.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

05 февраля, 09:30:32

On Wed, Feb 5, 2025 at 6:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jan 31, 2025 at 9:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > I was not sure of the point of
> > > making the max_conflict_retention_duration a per-subscription
> > > parameter.
> > >
> >
> > The idea is to keep it at the same level as the other related
> > parameter 'retain_conflict_info'. It could be useful for cases where
> > publishers are from two different nodes (NP1 and  NP2) and we have
> > separate subscriptions for both nodes. Now, it is possible that users
> > won't expect conflicts on the tables from one of the nodes NP1 then
> > she could choose to enable 'retain_conflict_info' and
> > 'max_conflict_retention_duration' only for the subscription pointing
> > to publisher NP2.
> >
> > Now, say the publisher node that can generate conflicts (NP2) has
> > fewer writes and the corresponding apply worker could easily catch up
> > and almost always be in sync with the publisher. In contrast, the
> > other node that has no conflicts has a large number of writes. In such
> > cases, giving new options at the subscription level will be helpful.
> >
> > If we want to provide it at the global level, then the performance or
> > dead tuple control may not be any better than the current patch but
> > won't allow the provision for the above kinds of cases. Second, adding
> > two new GUCs is another thing I want to prevent. But OTOH, the
> > implementation could be slightly simpler if we provide these options
> > as GUC though I am not completely sure of that point. Having said
> > that, I am open to changing it to a non-subscription level. Do you
> > think it would be better to provide one or both of these parameters as
> > GUCs or do you have something else in mind?
>
> It makes sense to me to have the retain_conflict_info as a
> subscription-level parameter. I was thinking of making only
> max_conflict_retention_duration a global parameter, but I might be
> missing something. With a subscription-level
> max_conflict_retention_duration, how can users choose the setting
> values for each subscription, and is there a case that can be covered
> only by a subscription-level max_conflict_retention_duration?
>

Users can configure depending on the workload of the publisher
considering the publishers are different nodes as explained in my
previous response. Also, I think it will help in resolutions where the
worker for which the duration for updating the worker_level xmin has
not exceeded the max_conflict_retention_duration can reliably detect
update_delete. Then this parameter will only be required for
subscriptions that have enabled retain_conflict_info. I am not
completely sure if these are reasons enough to keep at the
subscription level but OTOH Dilip also seems to favor keeping
max_conflict_retention_duration at susbcription-level.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

05 февраля, 10:44:50

On Thu, Jan 23, 2025 at 5:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
I was reviewing v26 patch set and have some comments so far I reviewed
0001 so most of the comments/question are from this patch.

comments on v26-0001

1.
+ next_full_xid = ReadNextFullTransactionId();
+ epoch = EpochFromFullTransactionId(next_full_xid);
+
+ /*
+ * Adjust the epoch if the next transaction ID is less than the oldest
+ * running transaction ID. This handles the case where transaction ID
+ * wraparound has occurred.
+ */
+ if (oldest_running_xid > XidFromFullTransactionId(next_full_xid))
+ epoch--;
+
+ full_xid = FullTransactionIdFromEpochAndXid(epoch, oldest_running_xid);

I think you can directly use the 'AdjustToFullTransactionId()'
function here, maybe we can move that somewhere else and make that
non-static function.


2.
+ /*
+ * We expect the publisher and subscriber clocks to be in sync using time
+ * sync service like NTP. Otherwise, we will advance this worker's
+ * oldest_nonremovable_xid prematurely, leading to the removal of rows
+ * required to detect update_delete conflict.
+ *
+ * XXX Consider waiting for the publisher's clock to catch up with the
+ * subscriber's before proceeding to the next phase.
+ */
+ if (TimestampDifferenceExceeds(data->reply_time,
+    data->candidate_xid_time, 0))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("oldest_nonremovable_xid transaction ID may be advanced prematurely"),
+ errdetail("The clock on the publisher is behind that of the subscriber."));


I don't fully understand the purpose of this check. Based on the
comments in RetainConflictInfoData, if I understand correctly,
candidate_xid_time represents the time when the candidate is
determined, and reply_time indicates the time of the reply from the
publisher. Why do we expect these two timestamps to have zero
difference to ensure clock synchronization?

3.
+ /*
+ * Use last_recv_time when applying changes in the loop; otherwise, get
+ * the latest timestamp.
+ */
+ now = data->last_recv_time ? data->last_recv_time : GetCurrentTimestamp();

Can you explain in the comment what's the logic behind using
last_recv_time here?  Why not just compare 'candidate_xid_time' vs
current timestamp?

4.
Comment of v26-0004 doesn't clearly explain that once retention
stopped after reaching 'max_conflict_retention_duration' will it
resume back?


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

06 февраля, 23:47:52

On Tue, Feb 4, 2025 at 10:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 5, 2025 at 6:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jan 31, 2025 at 9:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > I was not sure of the point of
> > > > making the max_conflict_retention_duration a per-subscription
> > > > parameter.
> > > >
> > >
> > > The idea is to keep it at the same level as the other related
> > > parameter 'retain_conflict_info'. It could be useful for cases where
> > > publishers are from two different nodes (NP1 and  NP2) and we have
> > > separate subscriptions for both nodes. Now, it is possible that users
> > > won't expect conflicts on the tables from one of the nodes NP1 then
> > > she could choose to enable 'retain_conflict_info' and
> > > 'max_conflict_retention_duration' only for the subscription pointing
> > > to publisher NP2.
> > >
> > > Now, say the publisher node that can generate conflicts (NP2) has
> > > fewer writes and the corresponding apply worker could easily catch up
> > > and almost always be in sync with the publisher. In contrast, the
> > > other node that has no conflicts has a large number of writes. In such
> > > cases, giving new options at the subscription level will be helpful.
> > >
> > > If we want to provide it at the global level, then the performance or
> > > dead tuple control may not be any better than the current patch but
> > > won't allow the provision for the above kinds of cases. Second, adding
> > > two new GUCs is another thing I want to prevent. But OTOH, the
> > > implementation could be slightly simpler if we provide these options
> > > as GUC though I am not completely sure of that point. Having said
> > > that, I am open to changing it to a non-subscription level. Do you
> > > think it would be better to provide one or both of these parameters as
> > > GUCs or do you have something else in mind?
> >
> > It makes sense to me to have the retain_conflict_info as a
> > subscription-level parameter. I was thinking of making only
> > max_conflict_retention_duration a global parameter, but I might be
> > missing something. With a subscription-level
> > max_conflict_retention_duration, how can users choose the setting
> > values for each subscription, and is there a case that can be covered
> > only by a subscription-level max_conflict_retention_duration?
> >
>
> Users can configure depending on the workload of the publisher
> considering the publishers are different nodes as explained in my
> previous response. Also, I think it will help in resolutions where the
> worker for which the duration for updating the worker_level xmin has
> not exceeded the max_conflict_retention_duration can reliably detect
> update_delete. Then this parameter will only be required for
> subscriptions that have enabled retain_conflict_info. I am not
> completely sure if these are reasons enough to keep at the
> subscription level but OTOH Dilip also seems to favor keeping
> max_conflict_retention_duration at susbcription-level.

I'd like to confirm what users would expect of this
max_conflict_retention_duration option and it works as expected. IIUC
users would want to use this option when they want to balance between
the reliable update_deleted conflict detection and the performance. I
think they want to detect updated_deleted reliably as much as possible
but, at the same time, would like to avoid a huge performance dip
caused by it. IOW, once the apply lag becomes larger than the limit,
they would expect to prioritize the performance (recovery) over the
reliable update_deleted conflict detection.

With the subscription-level max_conflict_retention_duration, users can
set it to '5min' to a subscription, SUB1, while not setting it to
another subscription, SUB2, (assuming here that both subscriptions set
retain_conflict_info = true). This setting works fine if SUB2 could
easily catch up while SUB1 is delaying, because in this case, SUB1
would stop updating its xmin when delaying for 5 min or longer so the
slot's xmin can advance based only on SUB2's xmin. Which is good
because it ultimately allow vacuum to remove dead tuples and
contributes to better performance. On the other hand, in cases where
SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
its xmin, the slot's xmin would not be able to advance. IIUC
pg_conflict_detection slot won't be invalidated as long as there is at
least one subscription that sets retain_conflict_info = true and
doesn't set max_conflict_retention_duration, even if other
subscriptions set max_conflict_retention_duration.

I'm not really sure that these behaviors are the expected behavior of
users who set max_conflict_retention_duration to some subscriptions.
Or I might have set the wrong expectation or assumption on this
parameter. I'm fine with a subscription-level
max_conflict_retention_duration if it's clear this option works as
expected by users who want to use it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

07 февраля, 08:47:21

On Fri, Feb 7, 2025 at 2:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I'd like to confirm what users would expect of this
> max_conflict_retention_duration option and it works as expected. IIUC
> users would want to use this option when they want to balance between
> the reliable update_deleted conflict detection and the performance. I
> think they want to detect updated_deleted reliably as much as possible
> but, at the same time, would like to avoid a huge performance dip
> caused by it. IOW, once the apply lag becomes larger than the limit,
> they would expect to prioritize the performance (recovery) over the
> reliable update_deleted conflict detection.
>

Yes, this understanding is correct.

> With the subscription-level max_conflict_retention_duration, users can
> set it to '5min' to a subscription, SUB1, while not setting it to
> another subscription, SUB2, (assuming here that both subscriptions set
> retain_conflict_info = true). This setting works fine if SUB2 could
> easily catch up while SUB1 is delaying, because in this case, SUB1
> would stop updating its xmin when delaying for 5 min or longer so the
> slot's xmin can advance based only on SUB2's xmin. Which is good
> because it ultimately allow vacuum to remove dead tuples and
> contributes to better performance. On the other hand, in cases where
> SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
> its xmin, the slot's xmin would not be able to advance. IIUC
> pg_conflict_detection slot won't be invalidated as long as there is at
> least one subscription that sets retain_conflict_info = true and
> doesn't set max_conflict_retention_duration, even if other
> subscriptions set max_conflict_retention_duration.
>

Right.

> I'm not really sure that these behaviors are the expected behavior of
> users who set max_conflict_retention_duration to some subscriptions.
> Or I might have set the wrong expectation or assumption on this
> parameter. I'm fine with a subscription-level
> max_conflict_retention_duration if it's clear this option works as
> expected by users who want to use it.
>

It seems you are not convinced to provide this parameter at the
subscription level and anyway providing it as GUC will simplify the
implementation and it would probably be easier for users to configure
rather than giving it at the subscription level for all subscriptions
that have set retain_conflict_info set to true. I guess in the future
we can provide it at the subscription level as well if there is a
clear use case for it. Does that make sense to you?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

07 февраля, 21:44:14

On Thu, Feb 6, 2025 at 9:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 7, 2025 at 2:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I'd like to confirm what users would expect of this
> > max_conflict_retention_duration option and it works as expected. IIUC
> > users would want to use this option when they want to balance between
> > the reliable update_deleted conflict detection and the performance. I
> > think they want to detect updated_deleted reliably as much as possible
> > but, at the same time, would like to avoid a huge performance dip
> > caused by it. IOW, once the apply lag becomes larger than the limit,
> > they would expect to prioritize the performance (recovery) over the
> > reliable update_deleted conflict detection.
> >
>
> Yes, this understanding is correct.
>
> > With the subscription-level max_conflict_retention_duration, users can
> > set it to '5min' to a subscription, SUB1, while not setting it to
> > another subscription, SUB2, (assuming here that both subscriptions set
> > retain_conflict_info = true). This setting works fine if SUB2 could
> > easily catch up while SUB1 is delaying, because in this case, SUB1
> > would stop updating its xmin when delaying for 5 min or longer so the
> > slot's xmin can advance based only on SUB2's xmin. Which is good
> > because it ultimately allow vacuum to remove dead tuples and
> > contributes to better performance. On the other hand, in cases where
> > SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
> > its xmin, the slot's xmin would not be able to advance. IIUC
> > pg_conflict_detection slot won't be invalidated as long as there is at
> > least one subscription that sets retain_conflict_info = true and
> > doesn't set max_conflict_retention_duration, even if other
> > subscriptions set max_conflict_retention_duration.
> >
>
> Right.
>
> > I'm not really sure that these behaviors are the expected behavior of
> > users who set max_conflict_retention_duration to some subscriptions.
> > Or I might have set the wrong expectation or assumption on this
> > parameter. I'm fine with a subscription-level
> > max_conflict_retention_duration if it's clear this option works as
> > expected by users who want to use it.
> >
>
> It seems you are not convinced to provide this parameter at the
> subscription level and anyway providing it as GUC will simplify the
> implementation and it would probably be easier for users to configure
> rather than giving it at the subscription level for all subscriptions
> that have set retain_conflict_info set to true. I guess in the future
> we can provide it at the subscription level as well if there is a
> clear use case for it. Does that make sense to you?

Yes, that makes sense to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

10 февраля, 07:56:07

On Fri, Feb 7, 2025 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Feb 7, 2025 at 2:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I'd like to confirm what users would expect of this
> > max_conflict_retention_duration option and it works as expected. IIUC
> > users would want to use this option when they want to balance between
> > the reliable update_deleted conflict detection and the performance. I
> > think they want to detect updated_deleted reliably as much as possible
> > but, at the same time, would like to avoid a huge performance dip
> > caused by it. IOW, once the apply lag becomes larger than the limit,
> > they would expect to prioritize the performance (recovery) over the
> > reliable update_deleted conflict detection.
> >
>
> Yes, this understanding is correct.
>
> > With the subscription-level max_conflict_retention_duration, users can
> > set it to '5min' to a subscription, SUB1, while not setting it to
> > another subscription, SUB2, (assuming here that both subscriptions set
> > retain_conflict_info = true). This setting works fine if SUB2 could
> > easily catch up while SUB1 is delaying, because in this case, SUB1
> > would stop updating its xmin when delaying for 5 min or longer so the
> > slot's xmin can advance based only on SUB2's xmin. Which is good
> > because it ultimately allow vacuum to remove dead tuples and
> > contributes to better performance. On the other hand, in cases where
> > SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
> > its xmin, the slot's xmin would not be able to advance. IIUC
> > pg_conflict_detection slot won't be invalidated as long as there is at
> > least one subscription that sets retain_conflict_info = true and
> > doesn't set max_conflict_retention_duration, even if other
> > subscriptions set max_conflict_retention_duration.
> >

That seems like a valid point.

>
> > I'm not really sure that these behaviors are the expected behavior of
> > users who set max_conflict_retention_duration to some subscriptions.
> > Or I might have set the wrong expectation or assumption on this
> > parameter. I'm fine with a subscription-level
> > max_conflict_retention_duration if it's clear this option works as
> > expected by users who want to use it.
> >
>
> It seems you are not convinced to provide this parameter at the
> subscription level and anyway providing it as GUC will simplify the
> implementation and it would probably be easier for users to configure
> rather than giving it at the subscription level for all subscriptions
> that have set retain_conflict_info set to true. I guess in the future
> we can provide it at the subscription level as well if there is a
> clear use case for it. Does that make sense to you?

Would it make sense to introduce a GUC parameter for this value, where
subscribers can overwrite it for their specific subscriptions, but
only up to the limit set by the GUC? This would allow flexibility in
certain cases --subscribers could opt to wait for a shorter duration
than the GUC value if needed.

Although a concrete use case isn't immediately clear, consider a
hypothetical scenario: Suppose a subscriber connected to Node1 must
wait for long period to avoid an incorrect conflict decision. In such
cases, it would rely on the default high value set by the GUC.
However, since Node1 is generally responsive and rarely has
long-running transactions, this long wait would only be necessary in
rare cases. On the other hand, a subscriber pulling from Node2 may not
require as stringent conflict detection. If Node2 frequently has
long-running transactions, waiting too long could lead to excessive
delays.

The idea here is that the Node1 subscriber can wait for the full
max_conflict_retention_duration set by the GUC when necessary, while
the Node2 subscriber can choose a shorter wait time to avoid
unnecessary delays caused by frequent long transactions.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

10 февраля, 12:15:22

On Mon, Feb 10, 2025 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Feb 7, 2025 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > > I'm not really sure that these behaviors are the expected behavior of
> > > users who set max_conflict_retention_duration to some subscriptions.
> > > Or I might have set the wrong expectation or assumption on this
> > > parameter. I'm fine with a subscription-level
> > > max_conflict_retention_duration if it's clear this option works as
> > > expected by users who want to use it.
> > >
> >
> > It seems you are not convinced to provide this parameter at the
> > subscription level and anyway providing it as GUC will simplify the
> > implementation and it would probably be easier for users to configure
> > rather than giving it at the subscription level for all subscriptions
> > that have set retain_conflict_info set to true. I guess in the future
> > we can provide it at the subscription level as well if there is a
> > clear use case for it. Does that make sense to you?
>
> Would it make sense to introduce a GUC parameter for this value, where
> subscribers can overwrite it for their specific subscriptions, but
> only up to the limit set by the GUC? This would allow flexibility in
> certain cases --subscribers could opt to wait for a shorter duration
> than the GUC value if needed.
>
> Although a concrete use case isn't immediately clear, consider a
> hypothetical scenario: Suppose a subscriber connected to Node1 must
> wait for long period to avoid an incorrect conflict decision. In such
> cases, it would rely on the default high value set by the GUC.
> However, since Node1 is generally responsive and rarely has
> long-running transactions, this long wait would only be necessary in
> rare cases. On the other hand, a subscriber pulling from Node2 may not
> require as stringent conflict detection. If Node2 frequently has
> long-running transactions, waiting too long could lead to excessive
> delays.
>
> The idea here is that the Node1 subscriber can wait for the full
> max_conflict_retention_duration set by the GUC when necessary, while
> the Node2 subscriber can choose a shorter wait time to avoid
> unnecessary delays caused by frequent long transactions.
>

I see that users can have some cases like this where it can be helpful
to provide the option to set max_conflict_retention_duration both at
GUC as well as a subscription parameter. However, I suggest let's go a
bit slower in adding more options for this particular stuff. In the
first version of this work, let's add a GUC and then let it bake for
some time after which we can discuss again adding a subscription
option based on some feedback from the field.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

10 февраля, 15:19:49

On Mon, Feb 10, 2025 at 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 10, 2025 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Feb 7, 2025 at 11:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > > I'm not really sure that these behaviors are the expected behavior of
> > > > users who set max_conflict_retention_duration to some subscriptions.
> > > > Or I might have set the wrong expectation or assumption on this
> > > > parameter. I'm fine with a subscription-level
> > > > max_conflict_retention_duration if it's clear this option works as
> > > > expected by users who want to use it.
> > > >
> > >
> > > It seems you are not convinced to provide this parameter at the
> > > subscription level and anyway providing it as GUC will simplify the
> > > implementation and it would probably be easier for users to configure
> > > rather than giving it at the subscription level for all subscriptions
> > > that have set retain_conflict_info set to true. I guess in the future
> > > we can provide it at the subscription level as well if there is a
> > > clear use case for it. Does that make sense to you?
> >
> > Would it make sense to introduce a GUC parameter for this value, where
> > subscribers can overwrite it for their specific subscriptions, but
> > only up to the limit set by the GUC? This would allow flexibility in
> > certain cases --subscribers could opt to wait for a shorter duration
> > than the GUC value if needed.
> >
> > Although a concrete use case isn't immediately clear, consider a
> > hypothetical scenario: Suppose a subscriber connected to Node1 must
> > wait for long period to avoid an incorrect conflict decision. In such
> > cases, it would rely on the default high value set by the GUC.
> > However, since Node1 is generally responsive and rarely has
> > long-running transactions, this long wait would only be necessary in
> > rare cases. On the other hand, a subscriber pulling from Node2 may not
> > require as stringent conflict detection. If Node2 frequently has
> > long-running transactions, waiting too long could lead to excessive
> > delays.
> >
> > The idea here is that the Node1 subscriber can wait for the full
> > max_conflict_retention_duration set by the GUC when necessary, while
> > the Node2 subscriber can choose a shorter wait time to avoid
> > unnecessary delays caused by frequent long transactions.
> >
>
> I see that users can have some cases like this where it can be helpful
> to provide the option to set max_conflict_retention_duration both at
> GUC as well as a subscription parameter. However, I suggest let's go a
> bit slower in adding more options for this particular stuff. In the
> first version of this work, let's add a GUC and then let it bake for
> some time after which we can discuss again adding a subscription
> option based on some feedback from the field.

I am fine with that.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

20 февраля, 10:20:27

On Friday, February 7, 2025 1:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Feb 7, 2025 at 2:18 AM Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
> >
> > I'd like to confirm what users would expect of this
> > max_conflict_retention_duration option and it works as expected. IIUC
> > users would want to use this option when they want to balance between
> > the reliable update_deleted conflict detection and the performance. I
> > think they want to detect updated_deleted reliably as much as possible
> > but, at the same time, would like to avoid a huge performance dip
> > caused by it. IOW, once the apply lag becomes larger than the limit,
> > they would expect to prioritize the performance (recovery) over the
> > reliable update_deleted conflict detection.
> >
> 
> Yes, this understanding is correct.
> 
> > With the subscription-level max_conflict_retention_duration, users can
> > set it to '5min' to a subscription, SUB1, while not setting it to
> > another subscription, SUB2, (assuming here that both subscriptions set
> > retain_conflict_info = true). This setting works fine if SUB2 could
> > easily catch up while SUB1 is delaying, because in this case, SUB1
> > would stop updating its xmin when delaying for 5 min or longer so the
> > slot's xmin can advance based only on SUB2's xmin. Which is good
> > because it ultimately allow vacuum to remove dead tuples and
> > contributes to better performance. On the other hand, in cases where
> > SUB2 is as delayed as or more than SUB1, even if SUB1 stopped updating
> > its xmin, the slot's xmin would not be able to advance. IIUC
> > pg_conflict_detection slot won't be invalidated as long as there is at
> > least one subscription that sets retain_conflict_info = true and
> > doesn't set max_conflict_retention_duration, even if other
> > subscriptions set max_conflict_retention_duration.
> >
> 
> Right.
> 
> > I'm not really sure that these behaviors are the expected behavior of
> > users who set max_conflict_retention_duration to some subscriptions.
> > Or I might have set the wrong expectation or assumption on this
> > parameter. I'm fine with a subscription-level
> > max_conflict_retention_duration if it's clear this option works as
> > expected by users who want to use it.
> >

Here is the v28 patch set, which converts the subscription option
max_conflict_retention_duration into a GUC. Other logic remains unchanged.

Best Regards,
Hou zj

Вложения

Re: Conflict detection for update_deleted in logical replication

От

vignesh C

Дата:

12 марта, 14:36:23

On Thu, 20 Feb 2025 at 12:50, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Here is the v28 patch set, which converts the subscription option
> max_conflict_retention_duration into a GUC. Other logic remains unchanged.

After discussing with Hou internally, I have moved this to the next
CommitFest since it will not be committed in the current release. This
also allows reviewers to focus on the remaining patches in the current
CommitFest.

Regards,
Vignesh

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

26 марта, 13:47:40

On Wed, Mar 12, 2025 at 7:36 PM vignesh C wrote:

> 
> On Thu, 20 Feb 2025 at 12:50, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > Here is the v28 patch set, which converts the subscription option
> > max_conflict_retention_duration into a GUC. Other logic remains
> unchanged.
> 
> After discussing with Hou internally, I have moved this to the next CommitFest
> since it will not be committed in the current release. This also allows reviewers
> to focus on the remaining patches in the current CommitFest.

Thanks!

Here's a rebased version of the patch series.

Best Regards,
Hou zj

On Thu, Apr 17, 2025 at 12:19 PM shveta malik wrote:

> 
> On Wed, Apr 16, 2025 at 10:30 AM shveta malik <shveta.malik@gmail.com>
> wrote:
> >
> > On Wed, Mar 26, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Here's a rebased version of the patch series.
> > >
> >

> 
> Thanks Hou-San for the patches. I am going through this long thread and
> patches. One doubt I have is whenever there is a chance of conflict-slot update
> (either xmin or possibility of its invalidation), apply-worker gives a wake-up call
> to the launcher (ApplyLauncherWakeup). Shouldn't that suffice to wake-up
> launcher irrespective of its nap-time? Do we actually need to introduce
> MIN/MAX_NAPTIME_PER_SLOT_UPDATE in the launcher and the logic
> around it?

Thanks for reviewing. After rethinking, I agree that the wakeup is
sufficient, so I removed the nap-time logic in this version.

> Few comments for patch004:
> Config.sgml:
> 1)
> +       <para>
> +        Maximum duration (in milliseconds) for which conflict
> +        information can be retained for conflict detection by the apply worker.
> +        The default value is <literal>0</literal>, indicating that conflict
> +        information is retained until it is no longer needed for detection
> +        purposes.
> +       </para>
> 
> IIUC, the above is not entirely accurate. Suppose the subscriber manages to
> catch up and sets oldest_nonremovable_xid to 100, which is then updated in
> slot. After this, the apply worker takes a nap and begins a new xid update cycle.
> Now, let’s say the next candidate_xid is 200, but this time the subscriber fails
> to keep up and exceeds max_conflict_retention_duration. As a result, it sets
> oldest_nonremovable_xid to InvalidTransactionId, and the launcher skips
> updating the slot’s xmin. 

If the time exceeds the max_conflict_retention_duration, the launcher would
Invalidate the slot, instead of skipping updating it. So the conflict info(e.g.,
dead tuples) would not be retained anymore.

> However, the previous xmin value (100) is still there
> in the slot, causing its data to be retained beyond the
> max_conflict_retention_duration. The xid 200 which actually honors
> max_conflict_retention_duration was never marked for retention. If my
> understanding is correct, then the documentation doesn’t fully capture this
> scenario.

As mentioned above, the strategy here is to invalidate the slot.

> 
> 2)
> +        The replication slot
> +        <quote><literal>pg_conflict_detection</literal></quote> that
> used to
> +        retain conflict information will be invalidated if all apply workers
> +        associated with the subscription, where
> 
> Subscription --> subscriptions
> 
> 3)
> Name stop_conflict_retention in MyLogicalRepWorker is confusing. Shall it be
> stop_conflict_info_retention?

Changed.

Here is V30 patch set includes the following changes:

* Addressed above comments.
* Added the retention timeout check in wait_for_local_flush(), as suggested by Nisha[1].
* Improved upgrade codes and added a test for upgrade of retain_conflict_info option,
  as suggested by Kuroda-san[2].

[1] https://www.postgresql.org/message-id/CABdArM4Ft8q3dZv4Bqw%3DrbS5_LFMXDJMRr3vC8a_KMCX1qatpg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/OSCPR01MB14966269726272F2F2B2BD3B0F5B22%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

On Fri, May 16, 2025 at 5:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>

Please find some more comments on the 0001 patch:
1.
We need to know about such transactions
+ * for conflict detection and resolution in logical replication. See
+ * GetOldestTransactionIdInCommit and its use.

Do we need to mention resolution in the above sentence? This patch is
all about detecting conflict reliably.

2. In wait_for_publisher_status(), we use remote_epoch,
remote_nextxid, and remote_oldestxid to compute full transaction id's.
Why can't we send FullTransactionIds for remote_oldestxid and
remote_nextxid from publisher? If these are required, maybe a comment
somewhere for that would be good.

3.
/*
+ * Note it is important to set committs value after marking ourselves as
+ * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This is because
+ * we want to ensure all such transactions are finished before we allow
+ * the logical replication client to advance its xid which is used to hold
+ * back dead rows for conflict detection. See
+ * maybe_advance_nonremovable_xid.
+ */
+ committs = GetCurrentTimestamp();

How does setting committs after setting DELAY_CHKPT_IN_COMMIT help in
advancing client-side xid? IIUC, on client-side, we simply wait for
such an xid to be finished based on the remote_oldestxid and
remote_nextxid sent via the server. So, the above comment is not
completely clear to me. I have updated this and related comments in
the attached diff patch to make it clear. See if that makes sense to
you.

4.
In 0001's commit message, we have: "Furthermore, the preserved commit
timestamps and origin data are essential for
consistently detecting update_origin_differs conflicts." But it is not
clarified how this patch helps in consistently detecting
update_origin_differs conflict?

5. I have tried to add some more details in comments on why
oldest_nonremovable_xid needs to be FullTransactionId. See attached.

--
With Regards,
Amit Kapila.

Вложения

v30_0001_amit.1.patch.txt

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

22 мая, 05:58:42

On Tue, May 20, 2025 at 6:30 PM shveta malik wrote:
> 
> Few more comments mostly for patch001:

Thanks for the comments!

> 
> 4)
> For this feature, since we are only interested in remote UPDATEs happening
> concurrently, so shall we ask primary to provide oldest "UPDATE"
> transaction-id in commit-phase rather than any operation's transaction-id?
> This may avoid unnecessarily waiting and pinging at subscriber's end in order
> to advance oldest_nonremovable-xid.
> Thoughts?

It is possible, but considering the potential complexity/cost to track UPDATE
operations in top-level and sub-transactions, coupled with its limited benefit
for workloads featuring frequent UPDATEs on publishers such as observed during
TPC-B performance tests, I have opted to document this possibility in comments
instead of implementing it in the patch set.

> 
> 5)
> +
> +/*
> + * GetOldestTransactionIdInCommit()
> + *
> + * Similar to GetOldestActiveTransactionId but returns the oldest
> transaction ID
> + * that is currently in the commit phase.
> + */
> +TransactionId
> +GetOldestTransactionIdInCommit(void)
> 
> If there is no transaction currently in 'commit' phase, this function will then
> return the next transaction-id. Please mention this in the comments. I think the
> doc 'protocol-replication.html' should also be updated for the same.

I added this info in the doc. But since we have merged this function with
GetOldestActiveTransactionId() which has the same behavior, so I am
not adding more code comments for the existing function.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

22 мая, 05:59:02

On Tue, May 20, 2025 at 11:08 AM shveta malik wrote:
> 
> Please find few more comments:

Thanks for the comments!

> 
> 2)
> 
> ----------
>   send_feedback(last_received, requestReply, requestReply);
> 
> + maybe_advance_nonremovable_xid(&data, false);
> +
>   /*
>   * Force reporting to ensure long idle periods don't lead to
>   * arbitrarily delayed stats. Stats can only be reported outside
> ----------
> 
> Why do we need this call to 'maybe_advance_nonremovable_xid' towards end
> of LogicalRepApplyLoop() i.e. the last call? Can it make any further 'data.phase'
> change here? IIUC, there are 2 triggers for 'data.phase' change through
> LogicalRepApplyLoop(). First one is for the very first time when we start this
> loop, it will set data.phase to
> 0  (RCI_GET_CANDIDATE_XID) and will call
> maybe_advance_nonremovable_xid(). After that, LogicalRepApplyLoop()
> function can trigger a 'data.phase' change only when it receives a response
> from the publisher. Shouldn't the first 4 calls  to
> maybe_advance_nonremovable_xid() from LogicalRepApplyLoop() suffice?

I think each invocation of maybe_advance_nonremovable_xid() has a chance to
complete the final RCI_WAIT_FOR_LOCAL_FLUSH phase, as it could be waiting for
changes to be flushed. The function call was added with the intention to
enhance the likelihood of advancing the transaction ID, particularly when it is
waiting for flushed changes. Although we could check the same in other func
calls as well, but I think it's OK to keep the last check.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

23 мая, 14:07:37

On Thu, May 22, 2025 at 8:28 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attaching the V31 patch set which addressed comments in [1]~[8].
>

Few comments:
1.
<para>
+             The oldest transaction ID along that is currently in the commit
+             phase on the server, along with its epoch.

The first 'along' in the sentence looks redundant. I've removed this
in the attached.

2.
+ data.remote_oldestxid = FullTransactionIdFromU64(pq_getmsgint64(&s));
+ data.remote_nextxid = FullTransactionIdFromU64(pq_getmsgint64(&s));

Shouldn't we need to typecast the result of pq_getmsgint64(&s) with
uint64 as we do at similar other places in pg_snapshot_recv?

3.

+ pq_sendint64(&output_message,
U64FromFullTransactionId(fullOldestXidInCommit));
+ pq_sendint64(&output_message, U64FromFullTransactionId(nextFullXid));

Similarly, here also we should typecase with uint64

4.
+ * XXX In phase RCI_REQUEST_PUBLISHER_STATUS, a potential enhancement could be
+ * requesting transaction information specifically for those containing
+ * UPDATEs. However, this approach introduces additional complexities in
+ * tracking UPDATEs for transactions on the publisher, and it may not
+ * effectively address scenarios with frequent UPDATEs.

I think, as the patch needs the oldest_nonremovable_xid idea even to
detect update_origin_differs and delete_origin_differs reliably, as
mentioned in 0001's commit message, is it sufficient to track update
transactions? Don't we need to track it even for deletes? I have
removed this note for now and updated the comment to mention it is
required to detect update_origin_differs and delete_origin_differs
conflicts reliably.

Apart from the above comments, I made a few other cosmetic changes in
the attached.

--
With Regards,
Amit Kapila.

Вложения

v31_0001_amit.1.patch.txt

Re: Conflict detection for update_deleted in logical replication

От

Xuneng Zhou

Дата:

23 мая, 18:51:01

Hi Zhijie,

Thanks for the effort on the patches. I did a quick look on them before diving into the logic and discussion. Below are a few minor typos found in version 31.

⸻

1. Spelling of “non-removable”

[PATCH v31 1/7]

In docs and code “removeable” vs. “removable” are used alternatively and omitted the hyphen in “non-removable”.

2. Double “arise” in SGML

[PATCH v31 7/7]

In doc/src/sgml/logical-replication.sgml under the <varlistentry id="conflict-update-deleted">, have duplicate arise:

+ are enabled. Note that if a tuple cannot be found due to the table being
+ truncated only a <literal>update_missing</literal> conflict will arise.
+ arise

3. Commit-message typos

[PATCH v31 1/7] (typo “tranasction”)

Subject: [PATCH v30 1/7] Maintain the oldest non removeable tranasction ID by
apply worker

Attaching the V31 patch set which addressed comments in [1]~[8].

The comments in [9] concerning the new GUC in patch 0004 is still under review
and will be addressed in the next version.

[1]https://www.postgresql.org/message-id/CAJpy0uD6SgD7w839Wzezdj0JT2OnA%2BxCxddM15%3Dgb5nRqYAv%2BA%40mail.gmail.com
[2]https://www.postgresql.org/message-id/CAJpy0uCYqG16zCjiCK4og6yqR7zP2rB1oOT7%3DAnDdVePo-8RrA%40mail.gmail.com
[3]https://www.postgresql.org/message-id/CAA4eK1KemsW0EXaSy2Y-M-vVy5Gh4onNG%2B%2BkKs7ugY%2B3N-g-Yw%40mail.gmail.com
[4]https://www.postgresql.org/message-id/CAA4eK1%2Br9V6DpH9gYRa2xOx167FapbuKdc4gESr8Etxpx2zrqw%40mail.gmail.com
[5]https://www.postgresql.org/message-id/CAJpy0uArh0A9yOxoamD0RWM-7K9kyoUMNh5C2%2BPFTbGFoxf5wg%40mail.gmail.com
[6]https://www.postgresql.org/message-id/CAJpy0uDL4oLdhYup44a2%3D1OeyUSsKhg8Y30-h1uxcf%3Dmki57uA%40mail.gmail.com
[7]https://www.postgresql.org/message-id/CAA4eK1%2BVNaGi-GU6awgFKmTgidLTHo2HDuzV1%2BaT8sjn8QtPxg%40mail.gmail.com
[8]https://www.postgresql.org/message-id/CAA4eK1%2B%3DZAf0T2iMg2%2BZF4cJdUk%3DUViqpiOg_kPa8jgK%2Bg94aw%40mail.gmail.com
[9]https://www.postgresql.org/message-id/CAA4eK1LLaXzsKOaPwGTiikOYySYK%2BTy_x3EXg-g%3D7M_CLn4WiQ%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

24 мая, 07:34:25

On Fri, May 23, 2025 at 9:21 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
Looking at v31-0001 again, most of it looks fine except this logic of
getting the commit_ts after marking the transaction in commit.  I see
in RecordTransactionCommit(), we are setting this flag
(DELAY_CHKPT_IN_COMMIT) to put the transaction in commit state[1], and
after that we insert the commit log[2], but I noticed that there we
call GetCurrentTransactionStopTimestamp() for acquiring the commit-ts
and IIUC we want to ensure that commit-ts timestamp should be after we
set the transaction in commit with (DELAY_CHKPT_IN_COMMIT), but
question is, is it guaranteed that the place we are calling
GetCurrentTransactionStopTimestamp() will always give us the current
timestamp? Because if you see this function, it may return
'xactStopTimestamp' as well if that is already set.  I am still
digging a bit more. Is there a possibility that 'xactStopTimestamp' is
already set during some interrupt handling when
GetCurrentTransactionStopTimestamp() is already called by
pgstat_report_stat(), or is it guaranteed that during
RecordTransactionCommit we will call this first time?

If we have already ensured this then I think adding a comment to
explain the same will be really useful.

[1]
@@ -1537,7 +1541,7 @@ RecordTransactionCommit(void)
  */
  if (markXidCommitted)
  {
- MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
+ MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
  END_CRIT_SECTION();
  }

[2]
/*
* Insert the commit XLOG record.
*/
XactLogCommitRecord(GetCurrentTransactionStopTimestamp(),
                                       nchildren, children, nrels, rels,
                                       ndroppedstats, droppedstats,
                                       nmsgs, invalMessages,
                                       RelcacheInitFileInval,
                                       MyXactFlags,
                                       InvalidTransactionId, NULL /*
plain commit */ );

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

24 мая, 07:58:50

On Sat, May 24, 2025 at 10:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, May 23, 2025 at 9:21 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> Looking at v31-0001 again, most of it looks fine except this logic of
> getting the commit_ts after marking the transaction in commit.  I see
> in RecordTransactionCommit(), we are setting this flag
> (DELAY_CHKPT_IN_COMMIT) to put the transaction in commit state[1], and
> after that we insert the commit log[2], but I noticed that there we
> call GetCurrentTransactionStopTimestamp() for acquiring the commit-ts
> and IIUC we want to ensure that commit-ts timestamp should be after we
> set the transaction in commit with (DELAY_CHKPT_IN_COMMIT), but
> question is, is it guaranteed that the place we are calling
> GetCurrentTransactionStopTimestamp() will always give us the current
> timestamp? Because if you see this function, it may return
> 'xactStopTimestamp' as well if that is already set.  I am still
> digging a bit more. Is there a possibility that 'xactStopTimestamp' is
> already set during some interrupt handling when
> GetCurrentTransactionStopTimestamp() is already called by
> pgstat_report_stat(), or is it guaranteed that during
> RecordTransactionCommit we will call this first time?
>
> If we have already ensured this then I think adding a comment to
> explain the same will be really useful.
>
> [1]
> @@ -1537,7 +1541,7 @@ RecordTransactionCommit(void)
>   */
>   if (markXidCommitted)
>   {
> - MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
> + MyProc->delayChkptFlags &= ~DELAY_CHKPT_IN_COMMIT;
>   END_CRIT_SECTION();
>   }
>
> [2]
> /*
> * Insert the commit XLOG record.
> */
> XactLogCommitRecord(GetCurrentTransactionStopTimestamp(),
>                                        nchildren, children, nrels, rels,
>                                        ndroppedstats, droppedstats,
>                                        nmsgs, invalMessages,
>                                        RelcacheInitFileInval,
>                                        MyXactFlags,
>                                        InvalidTransactionId, NULL /*
> plain commit */ );

IMHO, this should not be an issue as the only case where
'xactStopTimestamp' is set for the current process is from
ProcessInterrupts->pgstat_report_stat->
GetCurrentTransactionStopTimestamp, and this call sequence is only
possible when transaction blockState is TBLOCK_DEFAULT.  And that is
only set after RecordTransactionCommit() is called, so logically,
RecordTransactionCommit() should always be the first one to set the
'xactStopTimestamp'.  But I still think this is a candidate for
comments, or even better,r if somehow it can be ensured by some
assertion, maybe by passing a parameter in
GetCurrentTransactionStopTimestamp() that if this is called from
RecordTransactionCommit() then 'xactStopTimestamp' must not already be
set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

24 мая, 08:30:16

On Sat, May 24, 2025 at 10:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, May 24, 2025 at 10:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, May 23, 2025 at 9:21 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > Looking at v31-0001 again, most of it looks fine except this logic of
> > getting the commit_ts after marking the transaction in commit.  I see
> > in RecordTransactionCommit(), we are setting this flag
> > (DELAY_CHKPT_IN_COMMIT) to put the transaction in commit state[1], and
> > after that we insert the commit log[2], but I noticed that there we
> > call GetCurrentTransactionStopTimestamp() for acquiring the commit-ts
> > and IIUC we want to ensure that commit-ts timestamp should be after we
> > set the transaction in commit with (DELAY_CHKPT_IN_COMMIT), but
> > question is, is it guaranteed that the place we are calling
> > GetCurrentTransactionStopTimestamp() will always give us the current
> > timestamp? Because if you see this function, it may return
> > 'xactStopTimestamp' as well if that is already set.  I am still
> > digging a bit more. Is there a possibility that 'xactStopTimestamp' is
> > already set during some interrupt handling when
> > GetCurrentTransactionStopTimestamp() is already called by
> > pgstat_report_stat(), or is it guaranteed that during
> > RecordTransactionCommit we will call this first time?
> >
> > If we have already ensured this then I think adding a comment to
> > explain the same will be really useful.
> >
...
>
> IMHO, this should not be an issue as the only case where
> 'xactStopTimestamp' is set for the current process is from
> ProcessInterrupts->pgstat_report_stat->
> GetCurrentTransactionStopTimestamp, and this call sequence is only
> possible when transaction blockState is TBLOCK_DEFAULT.  And that is
> only set after RecordTransactionCommit() is called, so logically,
> RecordTransactionCommit() should always be the first one to set the
> 'xactStopTimestamp'.  But I still think this is a candidate for
> comments, or even better,r if somehow it can be ensured by some
> assertion, maybe by passing a parameter in
> GetCurrentTransactionStopTimestamp() that if this is called from
> RecordTransactionCommit() then 'xactStopTimestamp' must not already be
> set.
>

We can add an assertion as you are suggesting, but I feel that adding
a parameter for this purpose looks slightly odd. I suggest adding
comments and probably a test case, if possible, so that if we acquire
commit_ts before setting DELAY_CHKPT_IN_COMMIT flag, then the test
should fail. I haven't checked the feasibility of such a test, so it
may be that it is not feasible or may require some odd injection
points, but even then, it seems better to add comments for this case.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

24 мая, 13:28:02

On Sat, May 24, 2025 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, May 24, 2025 at 10:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, May 24, 2025 at 10:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Fri, May 23, 2025 at 9:21 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > >
> > > Looking at v31-0001 again, most of it looks fine except this logic of
> > > getting the commit_ts after marking the transaction in commit.  I see
> > > in RecordTransactionCommit(), we are setting this flag
> > > (DELAY_CHKPT_IN_COMMIT) to put the transaction in commit state[1], and
> > > after that we insert the commit log[2], but I noticed that there we
> > > call GetCurrentTransactionStopTimestamp() for acquiring the commit-ts
> > > and IIUC we want to ensure that commit-ts timestamp should be after we
> > > set the transaction in commit with (DELAY_CHKPT_IN_COMMIT), but
> > > question is, is it guaranteed that the place we are calling
> > > GetCurrentTransactionStopTimestamp() will always give us the current
> > > timestamp? Because if you see this function, it may return
> > > 'xactStopTimestamp' as well if that is already set.  I am still
> > > digging a bit more. Is there a possibility that 'xactStopTimestamp' is
> > > already set during some interrupt handling when
> > > GetCurrentTransactionStopTimestamp() is already called by
> > > pgstat_report_stat(), or is it guaranteed that during
> > > RecordTransactionCommit we will call this first time?
> > >
> > > If we have already ensured this then I think adding a comment to
> > > explain the same will be really useful.
> > >
> ...
> >
> > IMHO, this should not be an issue as the only case where
> > 'xactStopTimestamp' is set for the current process is from
> > ProcessInterrupts->pgstat_report_stat->
> > GetCurrentTransactionStopTimestamp, and this call sequence is only
> > possible when transaction blockState is TBLOCK_DEFAULT.  And that is
> > only set after RecordTransactionCommit() is called, so logically,
> > RecordTransactionCommit() should always be the first one to set the
> > 'xactStopTimestamp'.  But I still think this is a candidate for
> > comments, or even better,r if somehow it can be ensured by some
> > assertion, maybe by passing a parameter in
> > GetCurrentTransactionStopTimestamp() that if this is called from
> > RecordTransactionCommit() then 'xactStopTimestamp' must not already be
> > set.
> >
>
> We can add an assertion as you are suggesting, but I feel that adding
> a parameter for this purpose looks slightly odd.


Yeah, that's true. Another option is to add an assert as
Assert(xactStopTimestamp == 0) right before calling
XactLogCommitRecord()?  With that, we don't need to pass an extra
parameter, and since we are in a critical section, this process can
not be interrupted, so it's fine even if we have ensured that
'xactStopTimestamp' is 0 before calling the API, as this can not be
changed.  And we can add a comment atop this assertion.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

24 мая, 14:16:15

On Sat, May 24, 2025 at 3:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, May 24, 2025 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, May 24, 2025 at 10:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sat, May 24, 2025 at 10:04 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Fri, May 23, 2025 at 9:21 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > >
> > > > Looking at v31-0001 again, most of it looks fine except this logic of
> > > > getting the commit_ts after marking the transaction in commit.  I see
> > > > in RecordTransactionCommit(), we are setting this flag
> > > > (DELAY_CHKPT_IN_COMMIT) to put the transaction in commit state[1], and
> > > > after that we insert the commit log[2], but I noticed that there we
> > > > call GetCurrentTransactionStopTimestamp() for acquiring the commit-ts
> > > > and IIUC we want to ensure that commit-ts timestamp should be after we
> > > > set the transaction in commit with (DELAY_CHKPT_IN_COMMIT), but
> > > > question is, is it guaranteed that the place we are calling
> > > > GetCurrentTransactionStopTimestamp() will always give us the current
> > > > timestamp? Because if you see this function, it may return
> > > > 'xactStopTimestamp' as well if that is already set.  I am still
> > > > digging a bit more. Is there a possibility that 'xactStopTimestamp' is
> > > > already set during some interrupt handling when
> > > > GetCurrentTransactionStopTimestamp() is already called by
> > > > pgstat_report_stat(), or is it guaranteed that during
> > > > RecordTransactionCommit we will call this first time?
> > > >
> > > > If we have already ensured this then I think adding a comment to
> > > > explain the same will be really useful.
> > > >
> > ...
> > >
> > > IMHO, this should not be an issue as the only case where
> > > 'xactStopTimestamp' is set for the current process is from
> > > ProcessInterrupts->pgstat_report_stat->
> > > GetCurrentTransactionStopTimestamp, and this call sequence is only
> > > possible when transaction blockState is TBLOCK_DEFAULT.  And that is
> > > only set after RecordTransactionCommit() is called, so logically,
> > > RecordTransactionCommit() should always be the first one to set the
> > > 'xactStopTimestamp'.  But I still think this is a candidate for
> > > comments, or even better,r if somehow it can be ensured by some
> > > assertion, maybe by passing a parameter in
> > > GetCurrentTransactionStopTimestamp() that if this is called from
> > > RecordTransactionCommit() then 'xactStopTimestamp' must not already be
> > > set.
> > >
> >
> > We can add an assertion as you are suggesting, but I feel that adding
> > a parameter for this purpose looks slightly odd.
>
>
> Yeah, that's true. Another option is to add an assert as
> Assert(xactStopTimestamp == 0) right before calling
> XactLogCommitRecord()?  With that, we don't need to pass an extra
> parameter, and since we are in a critical section, this process can
> not be interrupted, so it's fine even if we have ensured that
> 'xactStopTimestamp' is 0 before calling the API, as this can not be
> changed.  And we can add a comment atop this assertion.
>

This sounds reasonable to me. Let us see what others think.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

25 мая, 11:36:22

On Sat, May 24, 2025 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> This sounds reasonable to me. Let us see what others think.
>

I was looking into the for getting the transaction status from
publisher, what I would assume this patch should be doing is request
the publisher status first time, and if some transactions are still in
commit, then we need to wait for them to get completed.  But in the
current design its possible that while we are waiting for in-commit
transactions to get committed the old running transaction might come
in commit phase and then we wait for them again, is my understanding
not correct?

Maybe this is very corner case that there are thousands of old running
transaction and everytime we request the status we find some
transactions is in commit phase and the process keep running for long
time until all the old running transaction eventually get committed.

I am thinking can't we make it more deterministic such that when we
get the status first time if we find some transactions that are in
commit phase then we should just wait for those transaction to get
committed?  One idea is to get the list of xids in commit phase and
next time when we get the list we can just compare and in next status
if we don't get any xids in commit phase which were in commit phase
during previous status then we are done.  But not sure is this worth
the complexity?  Mabe not but shall we add some comment explaining the
case and also explaining why this corner case is not harmful?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

26 мая, 10:16:48

On Sun, May 25, 2025 at 4:36 PM Dilip Kumar wrote:

> 
> On Sat, May 24, 2025 at 4:46 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > This sounds reasonable to me. Let us see what others think.
> >
> 
> I was looking into the for getting the transaction status from
> publisher, what I would assume this patch should be doing is request
> the publisher status first time, and if some transactions are still in
> commit, then we need to wait for them to get completed.  But in the
> current design its possible that while we are waiting for in-commit
> transactions to get committed the old running transaction might come
> in commit phase and then we wait for them again, is my understanding
> not correct?

Thanks for reviewing the patch. And yes, your understanding is correct.

> 
> Maybe this is very corner case that there are thousands of old running
> transaction and everytime we request the status we find some
> transactions is in commit phase and the process keep running for long
> time until all the old running transaction eventually get committed.
> 
> I am thinking can't we make it more deterministic such that when we
> get the status first time if we find some transactions that are in
> commit phase then we should just wait for those transaction to get
> committed?  One idea is to get the list of xids in commit phase and
> next time when we get the list we can just compare and in next status
> if we don't get any xids in commit phase which were in commit phase
> during previous status then we are done.  But not sure is this worth
> the complexity?  Mabe not but shall we add some comment explaining the
> case and also explaining why this corner case is not harmful?

I also think it's not worth the complexity for this corner case which is
rare. So, I have added some comments in wait_for_publisher_status() to
mention the same.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

26 мая, 10:17:04

On Sat, May 24, 2025 at 6:28 PM Dilip Kumar wrote:

> 
> On Sat, May 24, 2025 at 11:00 AM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Sat, May 24, 2025 at 10:29 AM Dilip Kumar <dilipbalaut@gmail.com>
> wrote:
> > >
> > > On Sat, May 24, 2025 at 10:04 AM Dilip Kumar <dilipbalaut@gmail.com>
> wrote:
> > > >
> > > > On Fri, May 23, 2025 at 9:21 PM Xuneng Zhou
> <xunengzhou@gmail.com> wrote:
> > > > >
> > > > Looking at v31-0001 again, most of it looks fine except this logic
> > > > of getting the commit_ts after marking the transaction in commit.
> > > > I see in RecordTransactionCommit(), we are setting this flag
> > > > (DELAY_CHKPT_IN_COMMIT) to put the transaction in commit state[1],
> > > > and after that we insert the commit log[2], but I noticed that
> > > > there we call GetCurrentTransactionStopTimestamp() for acquiring
> > > > the commit-ts and IIUC we want to ensure that commit-ts timestamp
> > > > should be after we set the transaction in commit with
> > > > (DELAY_CHKPT_IN_COMMIT), but question is, is it guaranteed that
> > > > the place we are calling
> > > > GetCurrentTransactionStopTimestamp() will always give us the
> > > > current timestamp? Because if you see this function, it may return
> > > > 'xactStopTimestamp' as well if that is already set.  I am still
> > > > digging a bit more. Is there a possibility that
> > > > 'xactStopTimestamp' is already set during some interrupt handling
> > > > when
> > > > GetCurrentTransactionStopTimestamp() is already called by
> > > > pgstat_report_stat(), or is it guaranteed that during
> > > > RecordTransactionCommit we will call this first time?
> > > >
> > > > If we have already ensured this then I think adding a comment to
> > > > explain the same will be really useful.
> > > >
> > ...
> > >
> > > IMHO, this should not be an issue as the only case where
> > > 'xactStopTimestamp' is set for the current process is from
> > > ProcessInterrupts->pgstat_report_stat->
> > > GetCurrentTransactionStopTimestamp, and this call sequence is only
> > > possible when transaction blockState is TBLOCK_DEFAULT.  And that is
> > > only set after RecordTransactionCommit() is called, so logically,
> > > RecordTransactionCommit() should always be the first one to set the
> > > 'xactStopTimestamp'.  But I still think this is a candidate for
> > > comments, or even better,r if somehow it can be ensured by some
> > > assertion, maybe by passing a parameter in
> > > GetCurrentTransactionStopTimestamp() that if this is called from
> > > RecordTransactionCommit() then 'xactStopTimestamp' must not already
> > > be set.
> > >
> >
> > We can add an assertion as you are suggesting, but I feel that adding
> > a parameter for this purpose looks slightly odd.
> 
> 
> Yeah, that's true. Another option is to add an assert as
> Assert(xactStopTimestamp == 0) right before calling
> XactLogCommitRecord()?  With that, we don't need to pass an extra
> parameter, and since we are in a critical section, this process can not be
> interrupted, so it's fine even if we have ensured that 'xactStopTimestamp' is 0
> before calling the API, as this can not be changed.  And we can add a
> comment atop this assertion.

Thanks for the suggestion !

I think adding an Assert as suggested is OK. I am not adding more comments
atop of the assert because we already have comments in a very close place that
explains the importance of setting the flag first.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

26 мая, 10:21:38

On Fri, May 23, 2025 at 11:51 PM Xuneng Zhou wrote:

> Thanks for the effort on the patches. I did a quick look on them before
> diving into the logic and discussion. Below are a few minor typos found in
> version 31.

Thanks for the comments! I have fixed these typos in latest version.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

27 мая, 09:15:44

On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Sun, May 25, 2025 at 4:36 PM Dilip Kumar wrote:
>
> >
> > I am thinking can't we make it more deterministic such that when we
> > get the status first time if we find some transactions that are in
> > commit phase then we should just wait for those transaction to get
> > committed?  One idea is to get the list of xids in commit phase and
> > next time when we get the list we can just compare and in next status
> > if we don't get any xids in commit phase which were in commit phase
> > during previous status then we are done.  But not sure is this worth
> > the complexity?  Mabe not but shall we add some comment explaining the
> > case and also explaining why this corner case is not harmful?
>
> I also think it's not worth the complexity for this corner case which is
> rare.

Yeah, complexity is one part, but I feel improving such less often
cases could add performance burden for more often cases where we need
to either maintain and invalidate the cache on the publisher or send
the list of all such xids to the subscriber over the network.

> So, I have added some comments in wait_for_publisher_status() to
> mention the same.
>

I agree that at this stage it is good to note down in comments, and if
we face such cases often, then we can improve it in the future.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

27 мая, 13:29:45

On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attaching the V32 patch set which addressed comments in [1]~[5].

Thanks for the patch, I am still reviewing the patches, please find
few trivial comments for patch001:

1)

+ FullTransactionId last_phase_at; /* publisher transaction ID that must
+ * be awaited to complete before
+ * entering the final phase
+ * (RCI_WAIT_FOR_LOCAL_FLUSH) */

'last_phase_at' seems like we are talking about the phase in the past.
(similar to 'last' in last_recv_time).
Perhaps we should name it as 'final_phase_at'


2)
RetainConflictInfoData data = {0};

We can change this name as well to rci_data.

3)
+ /*
+ * Compute FullTransactionId for the oldest running transaction ID. This
+ * handles the case where transaction ID wraparound has occurred.
+ */
+ full_xid = FullTransactionIdFromAllowableAt(next_full_xid,
oldest_running_xid);

Shall we name it to full_oldest_xid for better clarity?


4)
+ /*
+ * Update and check the remote flush position if we are applying changes
+ * in a loop. This is done at most once per WalWriterDelay to avoid
+ * performing costy operations in get_flush_position() too frequently
+ * during change application.
+ */
+ if (last_flushpos < rci_data->remote_lsn && rci_data->last_recv_time &&
+ TimestampDifferenceExceeds(rci_data->flushpos_update_time,
+    rci_data->last_recv_time, WalWriterDelay))

a) costy --> costly

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

29 мая, 06:25:48

On Tue, May 27, 2025 at 11:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Sun, May 25, 2025 at 4:36 PM Dilip Kumar wrote:
> >
> > >
> > > I am thinking can't we make it more deterministic such that when we
> > > get the status first time if we find some transactions that are in
> > > commit phase then we should just wait for those transaction to get
> > > committed?  One idea is to get the list of xids in commit phase and
> > > next time when we get the list we can just compare and in next status
> > > if we don't get any xids in commit phase which were in commit phase
> > > during previous status then we are done.  But not sure is this worth
> > > the complexity?  Mabe not but shall we add some comment explaining the
> > > case and also explaining why this corner case is not harmful?
> >
> > I also think it's not worth the complexity for this corner case which is
> > rare.
>
> Yeah, complexity is one part, but I feel improving such less often
> cases could add performance burden for more often cases where we need
> to either maintain and invalidate the cache on the publisher or send
> the list of all such xids to the subscriber over the network.

Yeah, that's a valid point.

> > So, I have added some comments in wait_for_publisher_status() to
> > mention the same.
> >
>
> I agree that at this stage it is good to note down in comments, and if
> we face such cases often, then we can improve it in the future.

+1

--
Regards,
Dilip Kumar
Google

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

29 мая, 12:00:10

On Tue, May 27, 2025 at 3:59 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attaching the V32 patch set which addressed comments in [1]~[5].
>
> Thanks for the patch, I am still reviewing the patches, please find
> few trivial comments for patch001:
>
> 1)
>
> + FullTransactionId last_phase_at; /* publisher transaction ID that must
> + * be awaited to complete before
> + * entering the final phase
> + * (RCI_WAIT_FOR_LOCAL_FLUSH) */
>
> 'last_phase_at' seems like we are talking about the phase in the past.
> (similar to 'last' in last_recv_time).
> Perhaps we should name it as 'final_phase_at'
>
>
> 2)
> RetainConflictInfoData data = {0};
>
> We can change this name as well to rci_data.
>
> 3)
> + /*
> + * Compute FullTransactionId for the oldest running transaction ID. This
> + * handles the case where transaction ID wraparound has occurred.
> + */
> + full_xid = FullTransactionIdFromAllowableAt(next_full_xid,
> oldest_running_xid);
>
> Shall we name it to full_oldest_xid for better clarity?
>
>
> 4)
> + /*
> + * Update and check the remote flush position if we are applying changes
> + * in a loop. This is done at most once per WalWriterDelay to avoid
> + * performing costy operations in get_flush_position() too frequently
> + * during change application.
> + */
> + if (last_flushpos < rci_data->remote_lsn && rci_data->last_recv_time &&
> + TimestampDifferenceExceeds(rci_data->flushpos_update_time,
> +    rci_data->last_recv_time, WalWriterDelay))
>
> a) costy --> costly
>

Please find few comments in docs of v32-003:


1)

logical-replication.sgml:

    <para>
     <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
     must be set to at least the number of subscriptions expected to connect,
-    plus some reserve for table synchronization.
+    plus some reserve for table synchronization and one if
+    <link
linkend="sql-createsubscription-params-with-retain-conflict-info"><literal>retain_conflict_info</literal></link>
is enabled.
    </para>

Above doc is updated under Publishers section:

<sect2 id="logical-replication-config-publisher">
   <title>Publishers</title>

But the conflict-slot is created on subscriber, so this info shall be
moved to subscriber. But subscriber currently does not have
'max_replication_slots' parameter under it. But I guess with
conflict-slot created on subscribers, we need to have that parameter
under 'Subscribers' too.


2)
catalogs.sgml:

+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>subretainconflictinfo</structfield> <type>bool</type>
+      </para>
+      <para>
+       If true, the detection of <xref linkend="conflict-update-deleted"/> is
+       enabled and the information (e.g., dead tuples, commit timestamps, and
+       origins) on the subscriber that is still useful for conflict detection
+       is retained.
+      </para></entry>
+     </row>
+

2a) In absence of rest of the patches atop 3rd patch, this failed to
compile due to missing xref link. Error:
>element xref: validity error : IDREF attribute linked references an unknown ID "conflict-update-deleted"

2b) Also, if 'subretainconflictinfo' is true, IMO, it does not enable
the update_deleted detection, it just provides information which can
be used for detection. We should rephrase this doc.


3)
 <xref linkend="conflict-update-deleted"/> is used in
create_subscription.sgml as well. That too needs correction.

4)
+ if (!sub_enabled)
+ ereport(WARNING,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("information for detecting conflicts cannot be purged when
the subscription is disabled"));

WARNING is good but not enough to clarify the problem and our
recommendation in such a case. Shall we update the docs as well as
explaining that such a situation may result in system bloat and thus
if subscription is disabled for longer, it is good to have
retain_conflict_info disabled as well.

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

02 июня, 09:39:17

On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Attaching the V32 patch set which addressed comments in [1]~[5].
>

Review comments:
===============
*
+advance_conflict_slot_xmin(FullTransactionId new_xmin)
+{
+ FullTransactionId full_xmin;
+ FullTransactionId next_full_xid;
+
+ Assert(MyReplicationSlot);
+ Assert(FullTransactionIdIsValid(new_xmin));
+
+ next_full_xid = ReadNextFullTransactionId();
+
+ /*
+ * Compute FullTransactionId for the current xmin. This handles the case
+ * where transaction ID wraparound has occurred.
+ */
+ full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
+ MyReplicationSlot->data.xmin);
+
+ if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
+ return;

The above code suggests that the launcher could compute a new xmin
that is less than slot's xmin. At first, this looks odd to me, but
IIUC, this can happen when the user toggles retain_conflict_info flag
at some random time when the launcher is trying to compute the new
xmin value for the slot. One of the possible combinations of steps for
this race could be as follows:

1. The subscriber has two subscriptions, A and B. Subscription A has
retain_conflict_info as true, and B has retain_conflict_info as false
2. Say the launcher calls get_subscription_list(), and worker A is
already alive.
3. Assuming the apply worker will restart on changing
retain_conflict_info, the user enables retain_conflict_info for
subscription B.
4. The launcher processes the subscription B first in the first cycle,
and starts worker B. Say, worker B gets 759 as candidate_xid.
5. The launcher creates the conflict detection slot, xmin = 759
6. Say a new txn happens, worker A gets 760 as candidate_xid and
updates it to oldest_nonremovable_xid.
7. The launcher processes the subscription A in the first cycle, and
the final xmin value is 760, because it only checks the
oldest_nonremovable_xid from worker A. The launcher then updates the
value to slot.xmin.
8. In the next cycle, the launcher finds that worker B has an older
oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
launher would have retreated the slot's xmin unless we had the above
check in the quoted code.

I think the above race is possible because the system lets the changed
subscription values of a subscription take effect asynchronously by
workers. The one more similar race condition handled by the patch is
as follows:

*
...
+ * It's necessary to use FullTransactionId here to mitigate potential race
+ * conditions. Such scenarios might occur if the replication slot is not
+ * yet created by the launcher while the apply worker has already
+ * initialized this field. During this period, a transaction ID wraparound
+ * could falsely make this ID appear as if it originates from the future
+ * w.r.t the transaction ID stored in the slot maintained by launcher. See
+ * advance_conflict_slot_xmin.
...
+ FullTransactionId oldest_nonremovable_xid;

This case can happen if the user disables and immediately enables the
retain_conflict_info option. In this case, the launcher may drop the
slot after noticing the disable. But the apply worker may not notice
the disable and it only notices that the retain_conflict_info is still
enabled, so it will keep maintaining oldest_nonremovable_xid when the
slot is not created.

It is okay to handle both the race conditions, but I am worried we may
miss some such race conditions which could lead to difficult-to-find
bugs. So, at least for the first version of this option (aka for
patches 0001 to 0003), we can add a condition that allows us to change
retain_conflict_info only on disabled subscriptions. This will
simplify the patch. We can make a separate patch to allow changing
retain_conflict_info option for enabled subscriptions. That will make
it easier to evaluate such race conditions and the solutions more
deeply.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

04 июня, 13:25:40

On Mon, Jun 2, 2025 at 12:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attaching the V32 patch set which addressed comments in [1]~[5].
> >
>
> Review comments:
> ===============
> *
> +advance_conflict_slot_xmin(FullTransactionId new_xmin)
> +{
> + FullTransactionId full_xmin;
> + FullTransactionId next_full_xid;
> +
> + Assert(MyReplicationSlot);
> + Assert(FullTransactionIdIsValid(new_xmin));
> +
> + next_full_xid = ReadNextFullTransactionId();
> +
> + /*
> + * Compute FullTransactionId for the current xmin. This handles the case
> + * where transaction ID wraparound has occurred.
> + */
> + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
> + MyReplicationSlot->data.xmin);
> +
> + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin))
> + return;
>
> The above code suggests that the launcher could compute a new xmin
> that is less than slot's xmin. At first, this looks odd to me, but
> IIUC, this can happen when the user toggles retain_conflict_info flag
> at some random time when the launcher is trying to compute the new
> xmin value for the slot. One of the possible combinations of steps for
> this race could be as follows:
>
> 1. The subscriber has two subscriptions, A and B. Subscription A has
> retain_conflict_info as true, and B has retain_conflict_info as false
> 2. Say the launcher calls get_subscription_list(), and worker A is
> already alive.
> 3. Assuming the apply worker will restart on changing
> retain_conflict_info, the user enables retain_conflict_info for
> subscription B.
> 4. The launcher processes the subscription B first in the first cycle,
> and starts worker B. Say, worker B gets 759 as candidate_xid.
> 5. The launcher creates the conflict detection slot, xmin = 759
> 6. Say a new txn happens, worker A gets 760 as candidate_xid and
> updates it to oldest_nonremovable_xid.
> 7. The launcher processes the subscription A in the first cycle, and
> the final xmin value is 760, because it only checks the
> oldest_nonremovable_xid from worker A. The launcher then updates the
> value to slot.xmin.
> 8. In the next cycle, the launcher finds that worker B has an older
> oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
> launher would have retreated the slot's xmin unless we had the above
> check in the quoted code.
>
> I think the above race is possible because the system lets the changed
> subscription values of a subscription take effect asynchronously by
> workers. The one more similar race condition handled by the patch is
> as follows:
>
> *
> ...
> + * It's necessary to use FullTransactionId here to mitigate potential race
> + * conditions. Such scenarios might occur if the replication slot is not
> + * yet created by the launcher while the apply worker has already
> + * initialized this field. During this period, a transaction ID wraparound
> + * could falsely make this ID appear as if it originates from the future
> + * w.r.t the transaction ID stored in the slot maintained by launcher. See
> + * advance_conflict_slot_xmin.
> ...
> + FullTransactionId oldest_nonremovable_xid;
>
> This case can happen if the user disables and immediately enables the
> retain_conflict_info option. In this case, the launcher may drop the
> slot after noticing the disable. But the apply worker may not notice
> the disable and it only notices that the retain_conflict_info is still
> enabled, so it will keep maintaining oldest_nonremovable_xid when the
> slot is not created.
>

Another case to handle is similar to above with only difference that
no transaction ID wraparound has happened. In such a case, the
launcher may end-up using worker's oldest_nonremovable_xid from the
previous cycle before user disabled-enabled retain_conflict_info. This
may result in the slot moving backward  in absence of suggested check
in advance_conflict_slot_xmin(),

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

04 июня, 13:42:36

On Mon, Jun 2, 2025 at 2:39 PM Amit Kapila wrote:
> 
> On Mon, May 26, 2025 at 12:46 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Attaching the V32 patch set which addressed comments in [1]~[5].
> >
> 
> Review comments:
> ===============
> *
> +advance_conflict_slot_xmin(FullTransactionId new_xmin) {
> +FullTransactionId full_xmin;  FullTransactionId next_full_xid;
> +
> + Assert(MyReplicationSlot);
> + Assert(FullTransactionIdIsValid(new_xmin));
> +
> + next_full_xid = ReadNextFullTransactionId();
> +
> + /*
> + * Compute FullTransactionId for the current xmin. This handles the
> + case
> + * where transaction ID wraparound has occurred.
> + */
> + full_xmin = FullTransactionIdFromAllowableAt(next_full_xid,
> + MyReplicationSlot->data.xmin);
> +
> + if (FullTransactionIdPrecedesOrEquals(new_xmin, full_xmin)) return;
> 
> The above code suggests that the launcher could compute a new xmin that is
> less than slot's xmin. At first, this looks odd to me, but IIUC, this can happen
> when the user toggles retain_conflict_info flag at some random time when the
> launcher is trying to compute the new xmin value for the slot. One of the
> possible combinations of steps for this race could be as follows:
> 
> 1. The subscriber has two subscriptions, A and B. Subscription A has
> retain_conflict_info as true, and B has retain_conflict_info as false 2. Say the
> launcher calls get_subscription_list(), and worker A is already alive.
> 3. Assuming the apply worker will restart on changing retain_conflict_info, the
> user enables retain_conflict_info for subscription B.
> 4. The launcher processes the subscription B first in the first cycle, and starts
> worker B. Say, worker B gets 759 as candidate_xid.
> 5. The launcher creates the conflict detection slot, xmin = 759 6. Say a new txn
> happens, worker A gets 760 as candidate_xid and updates it to
> oldest_nonremovable_xid.
> 7. The launcher processes the subscription A in the first cycle, and the final
> xmin value is 760, because it only checks the oldest_nonremovable_xid from
> worker A. The launcher then updates the value to slot.xmin.
> 8. In the next cycle, the launcher finds that worker B has an older
> oldest_nonremovable_xid 759, so the minimal xid would now be 759. The
> launher would have retreated the slot's xmin unless we had the above check in
> the quoted code.
> 
> I think the above race is possible because the system lets the changed
> subscription values of a subscription take effect asynchronously by workers.
> The one more similar race condition handled by the patch is as follows:
> 
> *
> ...
> + * It's necessary to use FullTransactionId here to mitigate potential
> + race
> + * conditions. Such scenarios might occur if the replication slot is
> + not
> + * yet created by the launcher while the apply worker has already
> + * initialized this field. During this period, a transaction ID
> + wraparound
> + * could falsely make this ID appear as if it originates from the
> + future
> + * w.r.t the transaction ID stored in the slot maintained by launcher.
> + See
> + * advance_conflict_slot_xmin.
> ...
> + FullTransactionId oldest_nonremovable_xid;
> 
> This case can happen if the user disables and immediately enables the
> retain_conflict_info option. In this case, the launcher may drop the slot after
> noticing the disable. But the apply worker may not notice the disable and it only
> notices that the retain_conflict_info is still enabled, so it will keep maintaining
> oldest_nonremovable_xid when the slot is not created.
> 
> It is okay to handle both the race conditions, but I am worried we may miss
> some such race conditions which could lead to difficult-to-find bugs. So, at
> least for the first version of this option (aka for patches 0001 to 0003), we can
> add a condition that allows us to change retain_conflict_info only on disabled
> subscriptions. This will simplify the patch.

Agreed.

> We can make a separate patch to
> allow changing retain_conflict_info option for enabled subscriptions. That will
> make it easier to evaluate such race conditions and the solutions more deeply.

I will prepare a separate patch soon.

Here is the V33 patch set which includes the following changes:

0001:
* Renaming and typo fixes based on Shveta's comments [1]
* Comment changes suggested by Amit [2]
* Changed oldest_nonremoable_xid from FullTransactionID to TransactionID.
* Code refactoring in drop_conflict_slot_if_exists()

0002:
* Documentation updates suggested by Amit [2]
* Code modifications to adapt to TransactionID oldest_nonremoable_xid

0003:
* Documentation improvements suggested by Shveta [3]
* Added restriction: disallow changing retain_conflict_info when sub
  is enabled or worker is alive

0004:
* Simplified race condition handling due to the new restriction from 0003

0005:
* Code updates to accommodate both the TransactionID type for
  oldest_nonremoable_xid and the new restriction from 0003

0006:
* New test case for the restriction introduced in 0003

0007:
No changes


[1] https://www.postgresql.org/message-id/CAJpy0uBSsRuVOeuo-i8R_aO0CMiORHTnEBZ9z-TDq941WqhyLA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1KUTHbgroBRNp8_dy3Lrc%2BetPm19O1RcyRcDBgCp7EFcg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAJpy0uAJUTmSx7fAE3gbnBUzp9ZDOgkLrP5gdoysKUGbvf7vGg%40mail.gmail.com


Best Regards,
Hou zj

On Fri, Jun 6, 2025 at 1:49 PM shveta malik wrote:
> 
> On Wed, Jun 4, 2025 at 4:12 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Here is the V33 patch set which includes the following changes:
> >
> 
> please find few comments for patch003:

Thanks for the comments!

> 
> 1)
> + /*
> + * Skip the track_commit_timestamp check by passing it as
> + * true, since it has already been validated during CREATE
> + * SUBSCRIPTION and ALTER SUBSCRIPTION SET commands.
> + */
> + CheckSubConflictInfoRetention(sub->retainconflictinfo,
> +   true, opts.enabled);
> +
> 
> Is there a special reason for disabling WARNING while enabling the
> subscription? If rci subscription was created in disabled state and
> track_commit_timestamp was enabled at that time, then there will be no
> WARNING. But while enabling the sub at a later stage, it may be
> possible that track_commit_timestamp is off but rci as ON.

I feel reporting a WARNING related to track_commit_timestamp during
subscription enable DDL is a bit unnatural, since it's not directly related to the
this DDL. Also, I think we do not intend to capture scenarios where
track_commit_timestamp is disabled afterwards.

> 
> 2)
> 
>   * The worker has not yet started, so there is no valid
>   * non-removable transaction ID available for advancement.
>   */
> + if (sub->retainconflictinfo)
> + can_advance_xmin = false;
> 
> Shall we change comment to:
> Only advance xmin when all workers for rci enabled subscriptions are
> up and running.

Adjusted according to your suggestion.

> 
> 
> 3)
> 
>   Assert(MyReplicationSlot);
> - Assert(TransactionIdIsValid(new_xmin));
>   Assert(TransactionIdPrecedesOrEquals(MyReplicationSlot->data.xmin,
>   new_xmin));
> 
> + if (!TransactionIdIsValid(new_xmin))
> + return;
> 
> 
> a)
> Why have we replaced Assert with 'if' check? In which scenario do we
> expect new_xmin as Invalid here?

I think it's not needed now, so removed.

> 4)
> DisableSubscriptionAndExit:
> + /*
> + * Skip the track_commit_timestamp check by passing it as true, since it
> + * has already been validated during CREATE SUBSCRIPTION and ALTER
> + * SUBSCRIPTION SET commands.
> + */
> + CheckSubConflictInfoRetention(MySubscription->retainconflictinfo, true,
> +   false);
> 
> This comment makes sense during alter-sub enable, here shall we change it
> to:
> Skip the track_commit_timestamp check by passing it as true as it is
> not needed to be checked during subscription-disable.

Changed.

> 
> 
> 5)
> postgres=# alter subscription sub3 set (retain_conflict_info=false);
> ERROR:  cannot set option retain_conflict_info for enabled subscription
> 
> Do we need this restriction during disable of rci as well?

I prefer to maintain the restriction on both enabling and disabling operations
for the sake of simplicity, since the primary aim of this restriction is to
keep the logic straightforward and eliminate the need to think and address all
potential race conditions. I think restricting only the enable operation is
also OK and would not introducing new issues, but it might be more prudent to
keep things simple in the first version. Once the main patches stabilize, we
can consider easing or removing the entire restriction.

> 
> 6)
> +     <para>
> +      If the <link
> linkend="sql-createsubscription-params-with-retain-conflict-info"><literal
> >retain_conflict_info</literal></link>
> +      option is altered to <literal>false</literal> and no other subscription
> +      has this option enabled, the additional replication slot that was created
> +      to retain conflict information will be dropped.
> +     </para>
> 
> It will be good to mention the slot name as well.

Added.

> 
> 
> 7)
> + * Verify that the max_active_replication_origins and max_replication_slots
> + * configurations specified are enough for creating the subscriptions. This is
> + * required to create the replication origin and the conflict detection slot
> + * for each subscription.
>   */
> 
> We shall rephrase the comment, it gives the feeling that a 'conflict
> detection slot' is needed for each subscription.

Right, changed.

Here is the V34 patch set which includes the following changes:

0001:
* pgindent

0002:
* pgindent

0003:
* pgindent
* Addressed above comments from Shvete
* Improved the comments atop of the new restrcition.
* Ensured that the worker restarts when the retain_conflict_info was enabled
  during startup regardless of the existence of the slot.

  In V33, we relied on the existence of slot to decide whether the worker needs
  to restart on startup option change. But we found that even if the slot
  exists when launching the apply worker with(retain_conflict_info=false), the
  slot could be removed soon by the launcher since the launcher might find
  there is no subscription that enables retain_conflict_info. So the worker
  could start to maintain the oldest_nonremovable_xid when the slot is not yet
  created.

0004:
* pgindent
* Fixed some inaccurate and wrong descriptions in the document.

0005:
* pgindent

0006:
* pgindent

0007:
* pgindent

0008:
* A new patch to remove the restriction on altering retain_conflict_info when
the subscription is enabled, and resolves race condition issues caused by the
new option value being asynchronously acknowledged by the launcher and apply
workers. It changed the oldest_nonremovable_xid to FullTransactionID so that
even if the warparound happens, we can correctly identity if the transaction ID
a old or new one. Additioanly, it adds a safeguard check when advancing
slot.xmin to prevent backward movement.

The 0008 is kept as .txt to prevent the BF failure from testing it at this stage.

Best Regards,
Hou zj

Вложения

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

06 июня, 14:33:47

On Wed, Jun 4, 2025 at 4:12 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Here is the V33 patch set which includes the following changes:
>

Few comments:
1.
+ if (sub->enabled)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot set option %s for enabled subscription",
+ "retain_conflict_info")));

Isn't it better to call CheckAlterSubOption() for this check, as we do
for failover and two_phase options?

2.
postgres=# Alter subscription sub1 set (retain_conflict_info=true);
ERROR:  cannot set option retain_conflict_info for enabled subscription
postgres=# Alter subscription sub1 disable;
ALTER SUBSCRIPTION
postgres=# Alter subscription sub1 set (retain_conflict_info=true);
WARNING:  information for detecting conflicts cannot be purged when
the subscription is disabled
ALTER SUBSCRIPTION

The above looks odd to me because first we didn't allow setting the
option for enabled subscription, and then when the user disabled the
subscription, a WARNING is issued. Isn't it better to give NOTICE
like: "enable the subscription to avoid accumulating deleted rows for
detecting conflicts" in the above case?

Also in this,
postgres=# Alter subscription sub1 set (retain_conflict_info=true);
WARNING:  information for detecting conflicts cannot be fully retained
when "track_commit_timestamp" is disabled
WARNING:  information for detecting conflicts cannot be purged when
the subscription is disabled
ALTER SUBSCRIPTION

What do we mean by this WARNING message? If track_commit_timestamp is
not enabled, we won't be able to detect certain conflicts, including
update_delete, but how can it lead to not retaining information
required for conflict detection? BTW, shall we consider giving ERROR
instead of WARNING for this case because without
track_commit_timestamp, there is no benefit in retaining deleted rows?
If we just want to make the user aware to enable
track_commit_timestamp to detect conflicts, then we can even consider
making this a NOTICE.

postgres=# Alter subscription sub1 Enable;
ALTER SUBSCRIPTION
postgres=# Alter subscription sub1 set (retain_conflict_info=false);
ERROR:  cannot set option retain_conflict_info for enabled subscription
postgres=# Alter subscription sub1 disable;
WARNING:  information for detecting conflicts cannot be purged when
the subscription is disabled
ALTER SUBSCRIPTION

Here, we should have a WARNING like: "deleted rows to detect conflicts
would not be removed till the subscription is enabled"; this should be
followed by errdetail like: "Consider setting retain_conflict_info to
false."

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

10 июня, 09:25:34

On Fri, Jun 6, 2025 at 7:34 PM Amit Kapila wrote:

> 
> On Wed, Jun 4, 2025 at 4:12 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Here is the V33 patch set which includes the following changes:
> >
> 
> Few comments:
> 1.
> + if (sub->enabled)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot set option %s for enabled subscription",
> + "retain_conflict_info")));
> 
> Isn't it better to call CheckAlterSubOption() for this check, as we do
> for failover and two_phase options?

Moved.

> 
> 2.
> postgres=# Alter subscription sub1 set (retain_conflict_info=true);
> ERROR:  cannot set option retain_conflict_info for enabled subscription
> postgres=# Alter subscription sub1 disable;
> ALTER SUBSCRIPTION
> postgres=# Alter subscription sub1 set (retain_conflict_info=true);
> WARNING:  information for detecting conflicts cannot be purged when
> the subscription is disabled
> ALTER SUBSCRIPTION
> 
> The above looks odd to me because first we didn't allow setting the
> option for enabled subscription, and then when the user disabled the
> subscription, a WARNING is issued. Isn't it better to give NOTICE
> like: "enable the subscription to avoid accumulating deleted rows for
> detecting conflicts" in the above case?

Yes, a NOTICE would be better.

I think we normally only describe the current situation of the operation in a
NOTICE message, and the suggested message sounds like a hint.
So I used the following message:

"deleted rows will continue to accumulate for detecting conflicts until the subscription is enabled"

> 
> Also in this,
> postgres=# Alter subscription sub1 set (retain_conflict_info=true);
> WARNING:  information for detecting conflicts cannot be fully retained
> when "track_commit_timestamp" is disabled
> WARNING:  information for detecting conflicts cannot be purged when
> the subscription is disabled
> ALTER SUBSCRIPTION
> 
> What do we mean by this WARNING message? If track_commit_timestamp is
> not enabled, we won't be able to detect certain conflicts, including
> update_delete, but how can it lead to not retaining information
> required for conflict detection? BTW, shall we consider giving ERROR
> instead of WARNING for this case because without
> track_commit_timestamp, there is no benefit in retaining deleted rows?
> If we just want to make the user aware to enable
> track_commit_timestamp to detect conflicts, then we can even consider
> making this a NOTICE.

I think it's an unexpected case that track_commit_timestamp is not enabled, so
NOTICE may not be appropriate. Giving ERROR is also OK, but since user can
change the track_commit_timestamp setting at anytime after creating/modifying a
subscription, we can't catch all cases, so we considered simply issuing a
warning directly and document this case.

> postgres=# Alter subscription sub1 Enable;
> ALTER SUBSCRIPTION
> postgres=# Alter subscription sub1 set (retain_conflict_info=false);
> ERROR:  cannot set option retain_conflict_info for enabled subscription
> postgres=# Alter subscription sub1 disable;
> WARNING:  information for detecting conflicts cannot be purged when
> the subscription is disabled
> ALTER SUBSCRIPTION
> 
> Here, we should have a WARNING like: "deleted rows to detect conflicts
> would not be removed till the subscription is enabled"; this should be
> followed by errdetail like: "Consider setting retain_conflict_info to
> false."

Changed as suggested.

Here is the V35 patch set which includes the following changes:

0001:
No change.

0002:
* Added an errdetail for reserved slot name error per off-list discussion with Shveta.
* Moves the codes in launcher's foreach loop to a new function to improve the readability.

0003:
* Addressed all above comments sent by Amit.
* Adjusted some comments per off-list discussion with Amit.
* Check track_commit_timestamp when enabling the subscription. This is to avoid
 passing track_commit_timestamp as a parameter to the check function.

0004:
Rebased

0005:
Rebased

0006:
Rebased

0007:
Rebased

0008:
Rebased

Best Regards,
Hou zj

On Fri, Jun 20, 2025 at 4:48 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Here is the V39 patch set which includes the following changes:
>

1.
-static void
-create_conflict_slot_if_not_exists(void)
+void
+ApplyLauncherCreateConflictDetectionSlot(void)

I am not so sure about adding ApplyLauncher in front of this function
name. I see most others exposed from this file add such a prefix, but
this one looks odd to me as it has nothing specific to the launcher,
though we use it in launcher? How about
CreateConflictDetectionSlot(void)?

2.
 static void
 create_logical_replication_slots(void)
 {
+ if (!count_old_cluster_logical_slots())
+ return;
+

Doing this count twice (once here and once at the caller of
create_logical_replication_slots) seems redundant.

Apart from the above, attached please find a diff patch atop 0001,
0002, 0003. I think the first three patches look in a reasonable shape
now, can we merge them (0001, 0002, 0003)?

--
With Regards,
Amit Kapila.

Вложения

v39-amit_1.diff.txt

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

23 июня, 11:41:45

>
> Here is the V39 patch set which includes the following changes:
>

Few trivial comments:

1)
Currently we have this error and detail:

ERROR:  Enabling retain_conflict_info requires "wal_level" >= "replica"
DETAIL:  A replication slot must be created to retain conflict information.

Shall we change it to something like:

msg: "wal_level is insufficient to create slot required by retain_conflict_info"
hint: "wal_level must be set to replica or logical at server start"

2)

+   <para>
+    Note that commit timestamps and origin data are not preserved during the
+    upgrade. Consequently, even with
+    <link
linkend="sql-createsubscription-params-with-retain-conflict-info"><literal>retain_conflict_info</literal></link>
+    enabled, the upgraded subscriber might be unable to detect conflicts or log
+    relevant commit timestamps and origins when applying changes from the
+    publisher occurring before or during the upgrade. To prevent this
issue, the
+    user must ensure that all potentially conflicting changes are fully
+    replicated to the subscriber before proceeding with the upgrade.
+   </para>

Shall we have a NOTE tag here? This page has existing NOTE and WARNING
tags for similar situations where we are advising something to users?

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

24 июня, 13:20:54

On Mon, Jun 23, 2025 at 4:20 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Here is the V40 patch set

Thanks for the patches. Few comments:

1)
In get_subscription_info(), we are doing COUNT of rci-subscriptions
using below query:
SELECT count(*) AS nsub, COUNT(CASE WHEN subretainconflictinfo THEN 1
END) AS retain_conflict_info FROM pg_catalog.pg_subscription;

And then we are doing:
cluster->sub_retain_conflict_info = (strcmp(PQgetvalue(res, 0,
i_retain_conflict_info), "1") == 0);

i.e. get the value and compare with "1".  If the count of such subs is
say 2, won't it fail and will set sub_retain_conflict_info as 0?

2)
create_logical_replication_slots(void)
{
+ if (!count_old_cluster_logical_slots())
+ return;
+

We shall get rid of count_old_cluster_logical_slots() here as the
caller is checking it already.

3)
We can move the 'old_cluster.sub_retain_conflict_info' check from
create_conflict_detection_slot() to its caller. Then it will be more
informative and consistent with how we check migrate_logical_slots
outside of create_conflict_detection_slot()

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

25 июня, 06:08:33

On Tue, Jun 24, 2025 at 6:22 PM shveta malik wrote:
> 
> On Mon, Jun 23, 2025 at 4:20 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> >
> > Here is the V40 patch set
> 
> Thanks for the patches. Few comments:
> 
> 1)
> In get_subscription_info(), we are doing COUNT of rci-subscriptions using
> below query:
> SELECT count(*) AS nsub, COUNT(CASE WHEN subretainconflictinfo THEN 1
> END) AS retain_conflict_info FROM pg_catalog.pg_subscription;
> 
> And then we are doing:
> cluster->sub_retain_conflict_info = (strcmp(PQgetvalue(res, 0,
> i_retain_conflict_info), "1") == 0);
> 
> i.e. get the value and compare with "1".  If the count of such subs is say 2,
> won't it fail and will set sub_retain_conflict_info as 0?

Right, it could return wrong results. I have changed it to count(*) xx > 0
so that it can return directly boolean value.

> 
> 2)
> create_logical_replication_slots(void)
> {
> + if (!count_old_cluster_logical_slots())
> + return;
> +
> 
> We shall get rid of count_old_cluster_logical_slots() here as the caller is
> checking it already.

Removed.

> 
> 3)
> We can move the 'old_cluster.sub_retain_conflict_info' check from
> create_conflict_detection_slot() to its caller. Then it will be more informative
> and consistent with how we check migrate_logical_slots outside of
> create_conflict_detection_slot()

Moved.

Here is the V41 patch set which includes the following changes:

0001:
* Rebased due to recent commit fd51941.
* Addressed the comments above.
* Improved some documentation stuff.
* Improved the status message when creating
  conflict detection slot in pg_upgrade

0002:
No change

0003:
No change

0004:
No change

0005:
No change

0006:
Rebased due to recent commit fd51941.

Best Regards,
Hou zj

On Wed, Jun 25, 2025 at 7:27 PM Amit Kapila wrote:
> 
> On Wed, Jun 25, 2025 at 8:38 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Here is the V41 patch set which includes the following changes:
> >
> 
> Few comments on 0004
> ===================
> 1.
> +
> +# Remember the next transaction ID to be assigned my $next_xid =
> +$node_A->safe_psql('postgres', "SELECT txid_current() + 1;");
> +
> +# Confirm that the xmin value is updated ok( $node_A->poll_query_until(
> +'postgres',  "SELECT xmin = $next_xid from pg_replication_slots WHERE
> +slot_name =
> 'pg_conflict_detection'"
> + ),
> + "the xmin value of slot 'pg_conflict_detection' is updated on Node
> + A");
> +
> 
> Why use an indirect way to verify that the vacuum can now remove rows?
> Even if we want to check that the conflict slot is getting updated properly, we
> should verify that the vacuum has removed the deleted rows. Also, please
> improve comments for this test, as it is not very clear why you are expecting the
> latest xid value of conflict_slot.

I agree that testing VACUUM is straightforward. But I think there is a gap
between applying remote changes and updating slot.xmin in the launcher.
Therefore, it's necessary to wait for the launcher to update the slot before
testing whether VACUUM can remove the dead tuple.

I have improved the comments and added the VACUUM test as
suggested after the slot.xmin test.

> 
> 2.
> +# Alter failover for enabled subscription my ($cmdret, $stdout,
> +$stderr) = $node_A->psql('postgres',  "ALTER SUBSCRIPTION
> $subname_AB
> +SET (retain_conflict_info = true)"); ok( $stderr =~
> +   /ERROR:  cannot set option \"retain_conflict_info\" for enabled
> subscription/,
> + "altering retain_conflict_info is not allowed for enabled
> + subscription");
> +
> +# Disable the subscription
> +($cmdret, $stdout, $stderr) = $node_A->psql('postgres',  "ALTER
> +SUBSCRIPTION $subname_AB DISABLE;"); ok( $stderr =~
> +   /WARNING:  deleted rows to detect conflicts would not be removed
> until the subscription is enabled/,
> + "A warning is raised on disabling the subscription if
> retain_conflict_info is enabled");
> +
> +# Alter failover for disabled subscription ($cmdret, $stdout, $stderr)
> += $node_A->psql('postgres',  "ALTER SUBSCRIPTION $subname_AB SET
> +(retain_conflict_info = true);"); ok( $stderr =~
> +   /NOTICE:  deleted rows to detect conflicts would not be removed
> until the subscription is enabled/,
> + "altering retain_conflict_info is allowed for disabled subscription");
> 
> In all places, the comments use failover as an option name, whereas it is
> testing retain_conflict_info.

Changed.

> 
> 3. It is better to merge the 0004 into 0001 as it tests the core part of the
> functionality added by 0001.

Merged.

Here is the V41 patch set which includes the following changes:

0001:
* ran pgindent
* Merge the original v41-0004 tap-test.
* Addressed the comments above.
* Addressed the Shveta's comments[1].

0002:
* ran pgindent

0003:
* ran pgindent

0004:
* ran pgindent

0005:
* ran pgindent

[1] https://www.postgresql.org/message-id/CAJpy0uAg1mTcy00nR%3DVAx1nTJYRkQF84YOY4_YKh8L53A1t6sA%40mail.gmail.com

Best Regards,
Hou zj

Вложения

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

26 июня, 11:28:31

On Thu, Jun 26, 2025 at 8:31 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Thanks for the comments. All of them look good to me and
> have been addressed in V42.
>

Thank You for the patches. Few comments.

t/035_conflicts.pl:

1)
Both the subscriptions subname_BA and subname_AB have rci enabled
during CREATE sub itself. And later in the second test, we are trying
to enable rci of subname_AB  to test WARNING and NOTICE, but rci is
already enabled. Shall we have one CREATE sub with rci enabled while
another CREATE sub with default rci. And then we try to enable rci of
the second sub later and check pg_conflict_detection slot has been
created once we enabled rci. This way, it will cover more scenarios.

2)
+$node_B->safe_psql('postgres', "UPDATE tab SET b = 3 WHERE a = 1;");
+$node_A->safe_psql('postgres', "DELETE FROM tab WHERE a = 1;");
+
+$node_A->wait_for_catchup($subname_BA);

Can you please help me understand why we are doing  wait_for_catchup
here? Do we want DELETE to be replicated from A to B? IMO, this step
is not essential for our test as we have node_A->poll_query  until
xmin = $next_xid in pg_conflict_detection and that should suffice to
ensure both DELETE and UPDATE are replicated from one to other.

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

27 июня, 05:28:10

On Thu, Jun 26, 2025 at 4:28 PM shveta malik wrote:
> 
> On Thu, Jun 26, 2025 at 8:31 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Thanks for the comments. All of them look good to me and
> > have been addressed in V42.
> >
> 
> Thank You for the patches. Few comments.
> 
> t/035_conflicts.pl:
> 
> 1)
> Both the subscriptions subname_BA and subname_AB have rci enabled
> during CREATE sub itself. And later in the second test, we are trying
> to enable rci of subname_AB  to test WARNING and NOTICE, but rci is
> already enabled. Shall we have one CREATE sub with rci enabled while
> another CREATE sub with default rci. And then we try to enable rci of
> the second sub later and check pg_conflict_detection slot has been
> created once we enabled rci. This way, it will cover more scenarios.

Agreed and changed as suggested. I removed the test for WARNING since the
message is the same as the NOITCE and it seems not worthwhile to disable
the subscription again to verify one message.

> 
> 2)
> +$node_B->safe_psql('postgres', "UPDATE tab SET b = 3 WHERE a = 1;");
> +$node_A->safe_psql('postgres', "DELETE FROM tab WHERE a = 1;");
> +
> +$node_A->wait_for_catchup($subname_BA);
> 
> Can you please help me understand why we are doing  wait_for_catchup
> here? Do we want DELETE to be replicated from A to B? IMO, this step
> is not essential for our test as we have node_A->poll_query  until
> xmin = $next_xid in pg_conflict_detection and that should suffice to
> ensure both DELETE and UPDATE are replicated from one to other.

I think this step belongs to a later patch to ensure the DELETE operation is
replicated to Node B, allowing us to verify the `delete_origin_differ`
conflicts detected there. So, I moved it to the later patches.

Here is the V43 patch set which includes the following changes:

0001:
* Addressed the comments above.

0002:
No change.

0003:
No change.

0004:
* Moved some tests from 0001 to here.

0005:
No change.


Best Regards,
Hou zj

On Tue, Jul 1, 2025 at 5:07 PM Dilip Kumar wrote:
> 
> On Tue, Jul 1, 2025 at 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jul 1, 2025 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Tue, Jul 1, 2025 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com>
> wrote:
> > > >
> > > > On Mon, Jun 30, 2025 at 6:59 PM Zhijie Hou (Fujitsu)
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > >
> > > > > On Mon, Jun 30, 2025 at 7:22 PM Amit Kapila wrote:
> > > > > >
> > > >
> > > > I was looking at 0001, it mostly looks fine to me except this one
> > > > case.  So here we need to ensure that commits must be acquired
> > > > after marking the flag, don't you think we need to ensure strict
> > > > statement ordering using memory barrier, or we think it's not
> > > > required and if so why?
> > > >
> >
> > Good point. I also think we need a barrier here, but a write barrier
> > should be sufficient as we want ordering of two store operations.
> 
> +1
> 
> > > > RecordTransactionCommitPrepared()
> > > > {
> > > > ..
> > > > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> > > > +
> > > > + /*
> > > > + * Note it is important to set committs value after marking
> > > > + ourselves as
> > > > + * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This
> > > > + is because
> > > > + * we want to ensure all transactions that have acquired commit
> > > > + timestamp
> > > > + * are finished before we allow the logical replication client to
> > > > + advance
> > > > + * its xid which is used to hold back dead rows for conflict detection.
> > > > + * See maybe_advance_nonremovable_xid.
> > > > + */
> > > > + committs = GetCurrentTimestamp();
> > > > }
> > >
> > > I'm unsure whether the function call inherently acts as a memory
> > > barrier, preventing the compiler from reordering these operations.
> > > This needs to be confirmed.
> > >
> >
> > As per my understanding, function calls won't be a memory barrier. In
> > this regard, we need a similar change in RecordTransactionCommit as
> > well.
> 
> Right, we need this in RecordTransactionCommit() as well.

Thanks for the comments! I also agree that the barrier is needed.

Here is V45 patch set.

I modified 0001, added write barriers, and improved some comments.

Best Regards,
Hou zj

Вложения

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

02 июля, 07:06:04

On Tue, Jul 1, 2025 at 3:39 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tue, Jul 1, 2025 at 5:07 PM Dilip Kumar wrote:
> >
> > On Tue, Jul 1, 2025 at 2:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jul 1, 2025 at 10:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > On Tue, Jul 1, 2025 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com>
> > wrote:
> > > > >
> > > > > On Mon, Jun 30, 2025 at 6:59 PM Zhijie Hou (Fujitsu)
> > > > > <houzj.fnst@fujitsu.com> wrote:
> > > > > >
> > > > > > On Mon, Jun 30, 2025 at 7:22 PM Amit Kapila wrote:
> > > > > > >
> > > > >
> > > > > I was looking at 0001, it mostly looks fine to me except this one
> > > > > case.  So here we need to ensure that commits must be acquired
> > > > > after marking the flag, don't you think we need to ensure strict
> > > > > statement ordering using memory barrier, or we think it's not
> > > > > required and if so why?
> > > > >
> > >
> > > Good point. I also think we need a barrier here, but a write barrier
> > > should be sufficient as we want ordering of two store operations.
> >
> > +1
> >
> > > > > RecordTransactionCommitPrepared()
> > > > > {
> > > > > ..
> > > > > + MyProc->delayChkptFlags |= DELAY_CHKPT_IN_COMMIT;
> > > > > +
> > > > > + /*
> > > > > + * Note it is important to set committs value after marking
> > > > > + ourselves as
> > > > > + * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This
> > > > > + is because
> > > > > + * we want to ensure all transactions that have acquired commit
> > > > > + timestamp
> > > > > + * are finished before we allow the logical replication client to
> > > > > + advance
> > > > > + * its xid which is used to hold back dead rows for conflict detection.
> > > > > + * See maybe_advance_nonremovable_xid.
> > > > > + */
> > > > > + committs = GetCurrentTimestamp();
> > > > > }
> > > >
> > > > I'm unsure whether the function call inherently acts as a memory
> > > > barrier, preventing the compiler from reordering these operations.
> > > > This needs to be confirmed.
> > > >
> > >
> > > As per my understanding, function calls won't be a memory barrier. In
> > > this regard, we need a similar change in RecordTransactionCommit as
> > > well.
> >
> > Right, we need this in RecordTransactionCommit() as well.
>
> Thanks for the comments! I also agree that the barrier is needed.
>
> Here is V45 patch set.
>
> I modified 0001, added write barriers, and improved some comments.

Thanks for working on this, I will have a look at it latest by tomorrow.


--
Regards,
Dilip Kumar
Google

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

02 июля, 09:02:51

On Tue, Jul 1, 2025 at 6:10 PM Zhijie Hou (Fujitsu) wrote:
> Here is V45 patch set.

With the main patch set now stable, I am summarizing the performance tests
conducted before for reference.

In earlier tests [1], we confirmed that in a pub-sub cluster with high workload
on the publisher (via pgbench), the patch had no impact on TPS (Transactions
Per Second) on the publisher. This indicates that the modifications to the
walsender responsible for replying to publisher status do not introduce
noticeable overhead.

Additionally, we confirmed that the patch, with its latest mechanism for
dynamically tuning the frequency of advancing slot.xmin, does not affect TPS on
the subscriber when minimal changes occur on the publisher. This test[2]
involved creating a pub-sub cluster and running pgbench on the subscriber to
monitor TPS. It further suggests that the logic for maintaining non-removable
xid in the apply worker does not introduce noticeable overhead for concurrent
user DMLs.

Furthermore, we tested running pgbench on both publisher and subscriber[3].
Some regression was observed in TPS on the subscriber, because workload on the
publisher is pretty high and the apply workers must wait for the amount of
transactions with earlier timestamps to be applied and flushed before advancing
the non-removable XID to remove dead tuples. This is the expected behavior of
this approach since the patch's main goal is to retain dead tuples for reliable
conflict detection.

When discussing the regression, we considered providing a workaround for users
to recover from the regression (the 0002 of the latest patch set). We
introduces a GUC option max_conflict_retention_duration, designed to prevent
excessive accumulation of dead tuples when subscription with
retain_conflict_info enabled is present and the apply worker cannot catch
up with the publisher's workload. In short, the conflict detection replication slot
will be invalidated if lag time exceeds the specified GUC value.

In performance tests[4], we confirmed that the slot would be invalidated as
expected when the workload on the publisher was high, and it would not get
invalidated anymore after reducing the workload. This shows even if the slot
has been invalidated once, users can continue to detect the update_deleted
conflict by reduce the workload on the publisher.

The design of the patch set was not changed since the last performance test;
only some code enhancements have been made. Therefore, I think the results and
findings from the previous performance tests are still valid. However, if
necessary, we can rerun all the tests on the latest patch set to verify the
same.

[1] https://www.postgresql.org/message-id/CABdArM5SpMyGvQTsX0-d%3Db%2BJAh0VQjuoyf9jFqcrQ3JLws5eOw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/TYAPR01MB5692B0182356F041DC9DE3B5F53E2%40TYAPR01MB5692.jpnprd01.prod.outlook.com
[3] https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com
[4]
https://www.postgresql.org/message-id/OSCPR01MB14966F39BE1732B9E433023BFF5E72%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

03 июля, 07:56:12

On Wed, Jul 2, 2025 at 12:58 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>

> During local testing, I discovered a bug caused by my oversight in assigning
> the new xmin to slot.effective, which resulted in dead tuples remaining
> non-removable until restart. I apologize for the error and have provided
> corrected patches. Kindly use the latest patch set for performance testing.

You changes related to write barrier LGTM, however I have question
regarding below change, IIUC, in logical replication
MyReplicationSlot->effective_xmin should be the xmin value which has
been flushed to the disk, but here we are just setting "data.xmin =
new_xmin;" and marking the slot dirty so I believe its not been yet
flushed to the disk right?

+advance_conflict_slot_xmin(TransactionId new_xmin)
+{
+ Assert(MyReplicationSlot);
+ Assert(TransactionIdIsValid(new_xmin));
+ Assert(TransactionIdPrecedesOrEquals(MyReplicationSlot->data.xmin, new_xmin));
+
+ /* Return if the xmin value of the slot cannot be advanced */
+ if (TransactionIdEquals(MyReplicationSlot->data.xmin, new_xmin))
+ return;
+
+ SpinLockAcquire(&MyReplicationSlot->mutex);
+ MyReplicationSlot->effective_xmin = new_xmin;
+ MyReplicationSlot->data.xmin = new_xmin;
+ SpinLockRelease(&MyReplicationSlot->mutex);
+
+ elog(DEBUG1, "updated xmin: %u", MyReplicationSlot->data.xmin);
+
+ ReplicationSlotMarkDirty();
+ ReplicationSlotsComputeRequiredXmin(false);
..
}

--
Regards,
Dilip Kumar
Google

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

03 июля, 08:13:24

On Thu, Jul 3, 2025 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jul 2, 2025 at 12:58 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
>
> > During local testing, I discovered a bug caused by my oversight in assigning
> > the new xmin to slot.effective, which resulted in dead tuples remaining
> > non-removable until restart. I apologize for the error and have provided
> > corrected patches. Kindly use the latest patch set for performance testing.
>
> You changes related to write barrier LGTM, however I have question
> regarding below change, IIUC, in logical replication
> MyReplicationSlot->effective_xmin should be the xmin value which has
> been flushed to the disk, but here we are just setting "data.xmin =
> new_xmin;" and marking the slot dirty so I believe its not been yet
> flushed to the disk right?
>

Yes, because this is a physical slot and we need to follow
PhysicalConfirmReceivedLocation()/PhysicalReplicationSlotNewXmin().
The patch has kept a comment in advance_conflict_slot_xmin() as to why
it is okay not to flush the slot immediately.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

03 июля, 08:26:56

On Thu, Jul 3, 2025 at 10:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 3, 2025 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jul 2, 2025 at 12:58 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> >
> > > During local testing, I discovered a bug caused by my oversight in assigning
> > > the new xmin to slot.effective, which resulted in dead tuples remaining
> > > non-removable until restart. I apologize for the error and have provided
> > > corrected patches. Kindly use the latest patch set for performance testing.
> >
> > You changes related to write barrier LGTM, however I have question
> > regarding below change, IIUC, in logical replication
> > MyReplicationSlot->effective_xmin should be the xmin value which has
> > been flushed to the disk, but here we are just setting "data.xmin =
> > new_xmin;" and marking the slot dirty so I believe its not been yet
> > flushed to the disk right?
> >
>
> Yes, because this is a physical slot and we need to follow
> PhysicalConfirmReceivedLocation()/PhysicalReplicationSlotNewXmin().
> The patch has kept a comment in advance_conflict_slot_xmin() as to why
> it is okay not to flush the slot immediately.

Oh right, I forgot its physical slot.  I think we are good, thanks.


--
Regards,
Dilip Kumar
Google

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

03 июля, 13:51:31

On Thu, Jul 3, 2025 at 10:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Jul 3, 2025 at 10:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 3, 2025 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > >
> > > You changes related to write barrier LGTM, however I have question
> > > regarding below change, IIUC, in logical replication
> > > MyReplicationSlot->effective_xmin should be the xmin value which has
> > > been flushed to the disk, but here we are just setting "data.xmin =
> > > new_xmin;" and marking the slot dirty so I believe its not been yet
> > > flushed to the disk right?
> > >
> >
> > Yes, because this is a physical slot and we need to follow
> > PhysicalConfirmReceivedLocation()/PhysicalReplicationSlotNewXmin().
> > The patch has kept a comment in advance_conflict_slot_xmin() as to why
> > it is okay not to flush the slot immediately.
>
> Oh right, I forgot its physical slot.  I think we are good, thanks.
>

BTW, I wanted to clarify one more point related to this part of the
patch. One difference between PhysicalReplicationSlotNewXmin() and
advance_conflict_slot_xmin() is that the former updates both
catalog_xmin and xmin for the slot, but later updates only the slot's
xmin. Can we see any reason to update both in our case? For example,
there is one case where the caller of
ProcArrayGetReplicationSlotXmin() expects that the catalog_xmin must
be set when required, but as far as I can see, it is required only
when logical slots are present, so we should be okay with that case.

The other case to consider is vacuum calling
GetOldestNonRemovableTransactionId() to get the cutoff xid to remove
deleted rows. This returns the xmin horizon based on the type of table
(user table, catalog table, etc.). Now, in this case,
ComputeXidHorizons() will first set data xmin for
catalog_oldest_nonremovable xid, and then if slot_catalog_xmin is
smaller, it uses that value. So, for this computation as well, setting
just slot's xmin in advance_conflict_slot_xmin() should be sufficient,
as we will anyway set both xmin and catalog_xmin to the same values.

By this theory, it doesn't matter whether we set catalog_xmin for
physical slots or not till we are setting the slot's xmin. IIUC,
catalog_xmin is required to be set for logical slots because during
logical decoding, we access only catalog tables, so we need to protect
those, and the catalog_xmin value is used for that.

Thoughts?

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

04 июля, 09:31:24

On Thu, Jul 3, 2025 at 4:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 3, 2025 at 10:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Jul 3, 2025 at 10:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jul 3, 2025 at 10:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > >
> > > > You changes related to write barrier LGTM, however I have question
> > > > regarding below change, IIUC, in logical replication
> > > > MyReplicationSlot->effective_xmin should be the xmin value which has
> > > > been flushed to the disk, but here we are just setting "data.xmin =
> > > > new_xmin;" and marking the slot dirty so I believe its not been yet
> > > > flushed to the disk right?
> > > >
> > >
> > > Yes, because this is a physical slot and we need to follow
> > > PhysicalConfirmReceivedLocation()/PhysicalReplicationSlotNewXmin().
> > > The patch has kept a comment in advance_conflict_slot_xmin() as to why
> > > it is okay not to flush the slot immediately.
> >
> > Oh right, I forgot its physical slot.  I think we are good, thanks.
> >
>
> BTW, I wanted to clarify one more point related to this part of the
> patch. One difference between PhysicalReplicationSlotNewXmin() and
> advance_conflict_slot_xmin() is that the former updates both
> catalog_xmin and xmin for the slot, but later updates only the slot's
> xmin. Can we see any reason to update both in our case?

IMHO the purpose of these 2 functions are different, I think the
PhysicalReplicationSlotNewXmin() update the xmin in response to
hot_standby_feedback and the purpose of that is to avoid removing
anything on primary until it is no longer required by the standby so
that we do not create conflict or query cancellation. So it has to
consider the data required by active queries, physical/logical
replication slots on standby etc.  Whereas the purpose of
advance_conflict_slot_xmin() to prevent the tuple from removing on the
subscriber which might required for the conflict detection on this
node, the other aspect of not removing the tuple which is required for
the logical/physical replication slots on this node is already taken
care by other slots.  So I think this is a very specific purpose build
slot which just has one specific task, so I feel we are good with what
we have.

--
Regards,
Dilip Kumar
Google

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

05 июля, 11:56:26

On Fri, Jul 4, 2025 at 4:48 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wed, Jul 2, 2025 at 3:28 PM Hou, Zhijie wrote:
> > Kindly use the latest patch set for performance testing.
>
> During testing, we observed a limitation in cascading logical replication
> setups, such as (A -> B -> C). When retain_conflict_info is enabled on Node C,
> it may not retain information necessary for conflict detection when applying
> changes originally replicated from Node A. This happens because Node C only
> waits for locally originated changes on Node B to be applied before advancing
> the non-removable transaction ID.
>
> For example, Consider a logical replication setup as mentioned above : A -> B -> C.
>  - All three nodes have a table t1 with two tuples (1,1) (2,2).
>  - Node B subscribed to all changes of t1 from Node A
>  - Node-C subscribed to all changes from Node B.
>  - Subscriptions use the default origin=ANY, as this is not a bidirectional
>    setup.
>
> Now, consider two concurrent operations:
>   - @9:00 Node A - UPDATE (1,1) -> (1,11)
>
>   - @9:02 Node C - DELETE (1,1)
>
> Assume a slight delay at Node B before it applies the update from Node A.
>
>  @9:03 Node C - advances the non-removable XID because it sees no concurrent
>  transactions from Node B. It is unaware of Node A’s concurrent update.
>
>   @9:04 Node B - receives Node A's UPDATE and applies (1,1) -> (1,11)
>   t1 has tuples : (1,11), (2,2)
>
>   @9:05 Node C - receives the UPDATE (1,1) -> (1,11)
>     - As conflict slot’s xmin is advanced, the deleted tuple may already have
>       been removed.
>     - Conflict resolution fails to detect update_deleted and instead raises
>       update_missing.
>
> Note that, as per decoding logic Node C sees the commit timestamp of the update
> as 9:00 (origin commit_ts from Node A), not 9:04 (commit time on Node B). In
> this case, since the UPDATE's timestamp is earlier than the DELETE, Node C
> should ideally detect an update_deleted conflict. However, it cannot, because
> it no longer retains the deleted tuple.
>
> Even if Node C attempts to retrieve the latest WAL position from Node A, Node C
> doesn't maintain any LSN which we could use to compare with it.
>
> This scenario is similar to another restriction in the patch where
> retain_conflict_info is not supported if the publisher is also a physical
> standby, as the required transaction information from the original primary is
> unavailable. Moreover, this limitation is relevant only when the subscription
> origin option is set to ANY, as only in that case changes from other origins
> can be replicated. Since retain_conflict_info is primarily useful for conflict
> detection in bidirectional clusters where the origin option is set to NONE,
> this limitation appears acceptable.
>
> Given these findings, to help users avoid unintended configurations, we plan to
> issue a warning in scenarios where replicated changes may include origins other
> than the direct publisher, similar to the existing checks in the
> check_publications_origin() function.
>
> Here is the latest patch that implements the warning and documents
> this case. Only 0001 is modified for this.
>
> A big thanks to Nisha for invaluable assistance in identifying this
> case and preparing the analysis for it.

In this setup if we have A->B->C->A then after we implement conflict
resolution is it possible that node A will just left with (2,2),
because (1,11) will be deleted while applying the changes from Node C
whereas node C has detected the indirect conflicting update from Node
A as update missing and has inserted the row and it will left with
(1,11) and (2,2).  So can it cause divergence as I explained here, or
it will not?  If not then can you explain how?

--
Regards,
Dilip Kumar
Google

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

05 июля, 14:21:34

On Sat, Jul 5, 2025 at 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jul 4, 2025 at 4:48 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wed, Jul 2, 2025 at 3:28 PM Hou, Zhijie wrote:
> > > Kindly use the latest patch set for performance testing.
> >
> > During testing, we observed a limitation in cascading logical replication
> > setups, such as (A -> B -> C). When retain_conflict_info is enabled on Node C,
> > it may not retain information necessary for conflict detection when applying
> > changes originally replicated from Node A. This happens because Node C only
> > waits for locally originated changes on Node B to be applied before advancing
> > the non-removable transaction ID.
> >
> > For example, Consider a logical replication setup as mentioned above : A -> B -> C.
> >  - All three nodes have a table t1 with two tuples (1,1) (2,2).
> >  - Node B subscribed to all changes of t1 from Node A
> >  - Node-C subscribed to all changes from Node B.
> >  - Subscriptions use the default origin=ANY, as this is not a bidirectional
> >    setup.
> >
> > Now, consider two concurrent operations:
> >   - @9:00 Node A - UPDATE (1,1) -> (1,11)
> >
> >   - @9:02 Node C - DELETE (1,1)
> >
> > Assume a slight delay at Node B before it applies the update from Node A.
> >
> >  @9:03 Node C - advances the non-removable XID because it sees no concurrent
> >  transactions from Node B. It is unaware of Node A’s concurrent update.
> >
> >   @9:04 Node B - receives Node A's UPDATE and applies (1,1) -> (1,11)
> >   t1 has tuples : (1,11), (2,2)
> >
> >   @9:05 Node C - receives the UPDATE (1,1) -> (1,11)
> >     - As conflict slot’s xmin is advanced, the deleted tuple may already have
> >       been removed.
> >     - Conflict resolution fails to detect update_deleted and instead raises
> >       update_missing.
> >
> > Note that, as per decoding logic Node C sees the commit timestamp of the update
> > as 9:00 (origin commit_ts from Node A), not 9:04 (commit time on Node B). In
> > this case, since the UPDATE's timestamp is earlier than the DELETE, Node C
> > should ideally detect an update_deleted conflict. However, it cannot, because
> > it no longer retains the deleted tuple.
> >
> > Even if Node C attempts to retrieve the latest WAL position from Node A, Node C
> > doesn't maintain any LSN which we could use to compare with it.
> >
> > This scenario is similar to another restriction in the patch where
> > retain_conflict_info is not supported if the publisher is also a physical
> > standby, as the required transaction information from the original primary is
> > unavailable. Moreover, this limitation is relevant only when the subscription
> > origin option is set to ANY, as only in that case changes from other origins
> > can be replicated. Since retain_conflict_info is primarily useful for conflict
> > detection in bidirectional clusters where the origin option is set to NONE,
> > this limitation appears acceptable.
> >
> > Given these findings, to help users avoid unintended configurations, we plan to
> > issue a warning in scenarios where replicated changes may include origins other
> > than the direct publisher, similar to the existing checks in the
> > check_publications_origin() function.
> >
> > Here is the latest patch that implements the warning and documents
> > this case. Only 0001 is modified for this.
> >
> > A big thanks to Nisha for invaluable assistance in identifying this
> > case and preparing the analysis for it.
>
> In this setup if we have A->B->C->A then after we implement conflict
> resolution is it possible that node A will just left with (2,2),
> because (1,11) will be deleted while applying the changes from Node C
> whereas node C has detected the indirect conflicting update from Node
> A as update missing and has inserted the row and it will left with
> (1,11) and (2,2).  So can it cause divergence as I explained here, or
> it will not?  If not then can you explain how?

Thinking further, I believe this will lead to data divergence.
However, in this specific setup, that's not a concern. If a user needs
to guarantee consistency across all nodes, they'll have to configure a
one-to-one publication-subscription relationship between each pair of
nodes: A to B and B to A, B to C and C to B, and A to C and C to A. In
a cascading setup, however, we cannot expect all nodes to contain
identical data.  So I think I am fine with giving a WARNING what you
have done in your patch.

--
Regards,
Dilip Kumar
Google

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

06 июля, 14:03:47

Dear hackers,

As a confirmation purpose, I did performance testing with four workloads
we did before.

Highlights
==========
The retests on the latest patch set v46 show results consistent with previous
observations:
 - There is no performance impact on the publisher side
 - There is no performance impact on the subscriber side, if the workload is
   running only on subscriber.
 - The performance is reduced on the subscriber side (TPS reduction (~50%) [Test-03])
   when retain_conflict_info=on and pgbench is running on both side. Because dead
   tuple retention for conflict detection. If high workloads on the publisher,
   the apply workers must wait for the amount of transactions with earlier
   timestamps to be applied and flushed before advancing the non-removable XID
   to remove dead tuples. 
 - Subscriber-side TPS improves when the workload on the publisher is reduced.
 - Performance on the subscriber can also be improved by tuning the
   max_conflict_retention_duration GUC properly.

Used source
===========
pgHead commit fd7d7b7191 + v46 patchset

Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

01. pgbench on publisher
========================
The workload is mostly same as [1]. 

Workload:
 - Ran pgbench with 40 clients for the publisher.
 - The duration was 120s, and the measurement was repeated 10 times.

(pubtest.tar.gz can run the same workload)

Test Scenarios & Results:
 - pgHead : Median TPS = 39809.84925
 - pgHead + patch : Median TPS = 40102.88108

Observation:
 - No performance regression observed with the patch applied.
 - The results were consistent across runs.

Detailed Results Table:
  - each cell shows the TPS in each case.
  - patch(ON) means patched and retain_conflict_info=ON is set.

run#    pgHEAD         pgHead+patch(ON) 
1    40106.88834        40356.60039
2    39854.17244        40087.18077
3    39516.26983        40063.34688
4    39746.45715        40389.40549
5    40014.83857        40537.24
6    39819.26374        40016.78705
7    39800.43476        38774.9827
8    39884.2691        40163.35257
9    39753.11246        39902.02755
10    39427.2353        40118.58138
median    39809.84925        40102.88108

02. pgbench on subscriber
========================
The workload is mostly same as [2]. 

Workload:
 - Ran pgbench with 40 clients for the *subscriber*.
 - The duration was 120s, and the measurement was repeated 10 times.

(subtest.tar.gz can run the same workload)

Test Scenarios & Results:
 - pgHead : Median TPS = 41564.64591
 - pgHead + patch : Median TPS = 41083.09555

Observation:
 - No performance regression observed with the patch applied.
 - The results were consistent across runs.

Detailed Results Table:

run#    pgHEAD         pgHead+patch(ON)
1    41605.88999        41106.93126
2    41555.76448        40975.9575
3    41505.76161        41223.92841
4    41722.50373        41049.52787
5    41400.48427        41262.15085
6    41386.47969        41059.25985
7    41679.7485        40916.93053
8    41563.60036        41178.82461
9    41565.69145        41672.41773
10    41765.11049        40958.73512
median    41564.64591        41083.09555

03. pgbench on both sides
========================
The workload is mostly same as [3].

Workload:
 - Ran pgbench with 40 clients for the *both side*.
 - The duration was 120s, and the measurement was repeated 10 times.

(bothtest.tar.gz can run the same workload)

Test Scenarios & Results:
Publisher:
 - pgHead : Median TPS = 16799.67659
 - pgHead + patch : Median TPS = 17338.38423
Subscriber:
 - pgHead : Median TPS = 16552.60515
 - pgHead + patch : Median TPS = 8367.133693

Observation:
 - No performance regression observed on the publisher with the patch applied.
 - The performance is reduced on the subscriber side (TPS reduction (~50%)) due
   to dead tuple retention for the conflict detection

Detailed Results Table:

On publisher:
run#    pgHEAD         pgHead+patch(ON) 
1    16735.53391        17369.89325
2    16957.01458        17077.96864
3    16838.07008        17480.08206
4    16743.67772        17531.00493
5    16776.74723        17511.4314
6    16784.73354        17235.76573
7    16871.63841        17255.04538
8    16814.61964        17460.33946
9    16903.14424        17024.77703
10    16556.05636        17306.87522
median    16799.67659        17338.38423

On subscriber:
run#    pgHEAD     pgHead+patch(ON) 
1    16505.27302    8381.200661
2    16765.38292    8353.310973
3    16899.41055    8396.901652
4    16305.05353    8413.058805
5    16722.90536    8320.833085
6    16587.64864    8327.217432
7    16508.45076    8369.205438
8    16357.05337    8394.34603
9    16724.90296    8351.718212
10    16517.56167    8365.061948
median    16552.60515    8367.133693

04. pgbench on both side, and max_conflict_retention_duration was tuned
========================================================================
The workload is mostly same as [4].

Workload:
- Initially ran pgbench with 40 clients for the *both side*.
- Set max_conflict_retention_duration = {60, 120}
- When the slot is invalidated on the subscriber side, stop the benchmark and
  wait until the subscriber would be caught up. Then the number of clients on
  the publisher would be half.
  In this test the conflict slot could be invalidated as expected when the workload
  on the publisher was high, and it would not get invalidated anymore after
  reducing the workload. This shows even if the slot has been invalidated once,
  users can continue to detect the update_deleted conflict by reduce the
  workload on the publisher.
- Total period of the test was 900s for each cases.

(max_conflixt.tar.gz can run the same workload)

Observation:
 - 
 - Parallelism of the publisher side is reduced till 15->7->3 and finally the
   conflict slot is not invalidated.
 - TPS on the subscriber side is improved when the concurrency was reduced.
   This is because the dead tuple accumulation is reduced on subscriber due to
   the reduced workload on the publisher.
 - when publisher has Nclients=3, no regression in subscriber's TPS

Detailed Results Table:
    For max_conflict_retention_duration = 60s
    On publisher:
        Nclients        duration [s]    TPS
        15        72        14079.1
        7        82        9307
        3        446        4133.2

    On subscriber:
        Nclients        duration [s]    TPS
        15        72        6827
        15        81        7200
        15        446        19129.4


    For max_conflict_retention_duration = 120s
    On publisher:
        Nclients        duration [s]    TPS
        15                162                17835.3
        7                152                9503.8
        3                283                4243.9


    On subscriber:
        Nclients        duration [s]    TPS
        15                162                4571.8
        15                152                4707
        15                283                19568.4

Thanks Nisha-san and Hou-san for helping the work.

[1]: https://www.postgresql.org/message-id/CABdArM5SpMyGvQTsX0-d%3Db%2BJAh0VQjuoyf9jFqcrQ3JLws5eOw%40mail.gmail.com
[2]:
https://www.postgresql.org/message-id/TYAPR01MB5692B0182356F041DC9DE3B5F53E2%40TYAPR01MB5692.jpnprd01.prod.outlook.com
[3]: https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com
[4]:
https://www.postgresql.org/message-id/OSCPR01MB14966F39BE1732B9E433023BFF5E72%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

06 июля, 17:50:38

On Fri, Jul 4, 2025 at 8:18 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wed, Jul 2, 2025 at 3:28 PM Hou, Zhijie wrote:
> > Kindly use the latest patch set for performance testing.
>
> During testing, we observed a limitation in cascading logical replication
> setups, such as (A -> B -> C). When retain_conflict_info is enabled on Node C,
> it may not retain information necessary for conflict detection when applying
> changes originally replicated from Node A. This happens because Node C only
> waits for locally originated changes on Node B to be applied before advancing
> the non-removable transaction ID.
>
> For example, Consider a logical replication setup as mentioned above : A -> B -> C.
>  - All three nodes have a table t1 with two tuples (1,1) (2,2).
>  - Node B subscribed to all changes of t1 from Node A
>  - Node-C subscribed to all changes from Node B.
>  - Subscriptions use the default origin=ANY, as this is not a bidirectional
>    setup.
>
> Now, consider two concurrent operations:
>   - @9:00 Node A - UPDATE (1,1) -> (1,11)
>
>   - @9:02 Node C - DELETE (1,1)
>
> Assume a slight delay at Node B before it applies the update from Node A.
>
>  @9:03 Node C - advances the non-removable XID because it sees no concurrent
>  transactions from Node B. It is unaware of Node A’s concurrent update.
>
>   @9:04 Node B - receives Node A's UPDATE and applies (1,1) -> (1,11)
>   t1 has tuples : (1,11), (2,2)
>
>   @9:05 Node C - receives the UPDATE (1,1) -> (1,11)
>     - As conflict slot’s xmin is advanced, the deleted tuple may already have
>       been removed.
>     - Conflict resolution fails to detect update_deleted and instead raises
>       update_missing.
>
> Note that, as per decoding logic Node C sees the commit timestamp of the update
> as 9:00 (origin commit_ts from Node A), not 9:04 (commit time on Node B). In
> this case, since the UPDATE's timestamp is earlier than the DELETE, Node C
> should ideally detect an update_deleted conflict. However, it cannot, because
> it no longer retains the deleted tuple.
>
> Even if Node C attempts to retrieve the latest WAL position from Node A, Node C
> doesn't maintain any LSN which we could use to compare with it.
>
> This scenario is similar to another restriction in the patch where
> retain_conflict_info is not supported if the publisher is also a physical
> standby, as the required transaction information from the original primary is
> unavailable. Moreover, this limitation is relevant only when the subscription
> origin option is set to ANY, as only in that case changes from other origins
> can be replicated. Since retain_conflict_info is primarily useful for conflict
> detection in bidirectional clusters where the origin option is set to NONE,
> this limitation appears acceptable.
>
> Given these findings, to help users avoid unintended configurations, we plan to
> issue a warning in scenarios where replicated changes may include origins other
> than the direct publisher, similar to the existing checks in the
> check_publications_origin() function.
>
> Here is the latest patch that implements the warning and documents
> this case. Only 0001 is modified for this.
>
> A big thanks to Nisha for invaluable assistance in identifying this
> case and preparing the analysis for it.

I'm still reviewing the 0001 patch but let me share some comments and
questions I have so far:

---
It seems there is no place where we describe the overall idea of
reliably detecting update_deleted conflicts. The comment atop
maybe_advance_nonremovable_xid() describes why the implemented
algorithm works for that purpose but doesn't how it is implemented,
for example the relationship with pg_conflict_detection slot. I'm not
sure the long comment atop maybe_advance_nonremovable_xid() is the
right place as it seems to be a description beyond explaining
maybe_advance_nonremovable_xid() function. Probably we can move that
comment and explain the overall idea somewhere for example atop
worker.c or in README.

---
The new parameter name "retain_conflict_info" sounds to me like we
keep the conflict information somewhere that has expired at some time
such as how many times insert_exists or update_origin_differs
happened. How about choosing a name that indicates retain dead tuples
more explicitly for example retain_dead_tuples?

---
You mentioned in the previous email:

> Furthermore, we tested running pgbench on both publisher and subscriber[3].
> Some regression was observed in TPS on the subscriber, because workload on the
> publisher is pretty high and the apply workers must wait for the amount of
> transactions with earlier timestamps to be applied and flushed before advancing
> the non-removable XID to remove dead tuples. This is the expected behavior of
> this approach since the patch's main goal is to retain dead tuples for reliable
> conflict detection.

Have you conducted any performance testing of a scenario where a
publisher replicates a large number of databases (say 64) to a
subscriber? I'm particularly interested in a configuration where
retain_conflict_info is set to true, and there are 64 apply workers
running on the subscriber side. In such a setup, even when running
pgbench exclusively on the publisher's databases, I suspect the
replication lag would likely increase quickly, as all apply workers on
the subscriber would be impacted by the overhead of retaining dead
tuples.

---
@@ -71,8 +72,9 @@
 #define SUBOPT_PASSWORD_REQUIRED   0x00000800
 #define SUBOPT_RUN_AS_OWNER            0x00001000
 #define SUBOPT_FAILOVER                0x00002000
-#define SUBOPT_LSN                 0x00004000
-#define SUBOPT_ORIGIN              0x00008000
+#define SUBOPT_RETAIN_CONFLICT_INFO    0x00004000
+#define SUBOPT_LSN                 0x00008000
+#define SUBOPT_ORIGIN              0x00010000

Why do we need to change the existing options' value?

---
+                * This is required to ensure that we don't advance the xmin
+                * of CONFLICT_DETECTION_SLOT even if one of the subscriptions
+                * is not enabled. Otherwise, we won't be able to detect

I guess the "even" in the first sentence is not necessary.

---
+/*
+ * Determine the minimum non-removable transaction ID across all apply workers
+ * for subscriptions that have retain_conflict_info enabled. Store the result
+ * in *xmin.
+ *
+ * If the replication slot cannot be advanced during this cycle, due to either
+ * a disabled subscription or an inactive worker, set *can_advance_xmin to
+ * false.
+ */
+static void
+compute_min_nonremovable_xid(LogicalRepWorker *worker,
+                            bool retain_conflict_info, TransactionId *xmin,
+                            bool *can_advance_xmin)

I think this function is quite confusing for several reasons. For
instance, it's doing more things than described in the comments such
as trying to create the CONFLICT_DETECTION_SLOT if no worker is
passed. Also, one of the caller describes:

+               /*
+                * This is required to ensure that we don't advance the xmin
+                * of CONFLICT_DETECTION_SLOT even if one of the subscriptions
+                * is not enabled. Otherwise, we won't be able to detect
+                * conflicts reliably for such a subscription even though it
+                * has set the retain_conflict_info option.
+                */
+               compute_min_nonremovable_xid(NULL, sub->retainconflictinfo,
+                                            &xmin, &can_advance_xmin);

but it's unclear to me from the function name that it tries to create
the replication slot. Furthermore, in this path it doesn't actually
compute xmin. I guess we can try to create CONFLICT_DETECTION_SLOT in
the loop of "foreach(lc, sublist)" and set false to can_advance_xmin
if either the subscription is disabled or the worker is not running.

---
+   FullTransactionId remote_oldestxid; /* oldest transaction ID that was in
+                                        * the commit phase on the publisher.
+                                        * Use FullTransactionId to prevent
+                                        * issues with transaction ID
+                                        * wraparound, where a new
+                                        * remote_oldestxid could falsely
+                                        * appear to originate from the past
+                                        * and block advancement */
+   FullTransactionId remote_nextxid;   /* next transaction ID to be assigned
+                                        * on the publisher. Use
+                                        * FullTransactionId for consistency
+                                        * and to allow straightforward
+                                        * comparisons with remote_oldestxid. */

I think it would be readable if we could write them above each field.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

06 июля, 17:50:42

On Sun, Jul 6, 2025 at 8:03 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> As a confirmation purpose, I did performance testing with four workloads
> we did before.

Thank you for doing the performance tests!

>
> 03. pgbench on both sides
> ========================
> The workload is mostly same as [3].
>
> Workload:
>  - Ran pgbench with 40 clients for the *both side*.
>  - The duration was 120s, and the measurement was repeated 10 times.
>
> (bothtest.tar.gz can run the same workload)
>
> Test Scenarios & Results:
> Publisher:
>  - pgHead : Median TPS = 16799.67659
>  - pgHead + patch : Median TPS = 17338.38423
> Subscriber:
>  - pgHead : Median TPS = 16552.60515
>  - pgHead + patch : Median TPS = 8367.133693

My first impression is that 40 clients is a small number at which a
50% performance degradation occurs in 120s. Did you test how many
clients are required to trigger the same level performance regression
with retain_conflict_info = off?

>
> 04. pgbench on both side, and max_conflict_retention_duration was tuned
> ========================================================================
> The workload is mostly same as [4].
>
> Workload:
> - Initially ran pgbench with 40 clients for the *both side*.
> - Set max_conflict_retention_duration = {60, 120}
> - When the slot is invalidated on the subscriber side, stop the benchmark and
>   wait until the subscriber would be caught up. Then the number of clients on
>   the publisher would be half.
>   In this test the conflict slot could be invalidated as expected when the workload
>   on the publisher was high, and it would not get invalidated anymore after
>   reducing the workload. This shows even if the slot has been invalidated once,
>   users can continue to detect the update_deleted conflict by reduce the
>   workload on the publisher.
> - Total period of the test was 900s for each cases.
>
> (max_conflixt.tar.gz can run the same workload)
>
> Observation:
>  -
>  - Parallelism of the publisher side is reduced till 15->7->3 and finally the
>    conflict slot is not invalidated.
>  - TPS on the subscriber side is improved when the concurrency was reduced.
>    This is because the dead tuple accumulation is reduced on subscriber due to
>    the reduced workload on the publisher.
>  - when publisher has Nclients=3, no regression in subscriber's TPS

I think that users typically cannot control the amount of workloads in
production, meaning that once the performance regression starts to
happen the subscriber could enter the loop where invalidating the
slot, recovreing the performance, creating the slot, and having the
performance problem.

>
> Detailed Results Table:
>         For max_conflict_retention_duration = 60s
>         On publisher:
>                 Nclients                duration [s]    TPS
>                 15              72              14079.1
>                 7               82              9307
>                 3               446             4133.2
>
>         On subscriber:
>                 Nclients                duration [s]    TPS
>                 15              72              6827
>                 15              81              7200
>                 15              446             19129.4
>
>
>         For max_conflict_retention_duration = 120s
>         On publisher:
>                 Nclients                duration [s]    TPS
>                 15                              162                             17835.3
>                 7                               152                             9503.8
>                 3                               283                             4243.9
>
>
>         On subscriber:
>                 Nclients                duration [s]    TPS
>                 15                              162                             4571.8
>                 15                              152                             4707
>                 15                              283                             19568.4

What does each duration mean in these results? Can we interpret the
test case of max_conflict_retention_duration=120s that when 7 clients
and 15 clients are working on the publisher and the subscriber
respectively, the TPS on the subscriber was about one fourth (17835.3
vs. 4707)?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

07 июля, 05:12:46

On Sun, Jul 6, 2025 at 10:51 PM Masahiko Sawada wrote:
> 
> On Fri, Jul 4, 2025 at 8:18 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Wed, Jul 2, 2025 at 3:28 PM Hou, Zhijie wrote:
> > > Kindly use the latest patch set for performance testing.
> >
> > During testing, we observed a limitation in cascading logical
> > replication setups, such as (A -> B -> C). When retain_conflict_info
> > is enabled on Node C, it may not retain information necessary for
> > conflict detection when applying changes originally replicated from
> > Node A. This happens because Node C only waits for locally originated
> > changes on Node B to be applied before advancing the non-removable
> transaction ID.
> >
> > For example, Consider a logical replication setup as mentioned above : A -> B
> -> C.
> >  - All three nodes have a table t1 with two tuples (1,1) (2,2).
> >  - Node B subscribed to all changes of t1 from Node A
> >  - Node-C subscribed to all changes from Node B.
> >  - Subscriptions use the default origin=ANY, as this is not a bidirectional
> >    setup.
> >
> > Now, consider two concurrent operations:
> >   - @9:00 Node A - UPDATE (1,1) -> (1,11)
> >
> >   - @9:02 Node C - DELETE (1,1)
> >
> > Assume a slight delay at Node B before it applies the update from Node A.
> >
> >  @9:03 Node C - advances the non-removable XID because it sees no
> > concurrent  transactions from Node B. It is unaware of Node A’s concurrent
> update.
> >
> >   @9:04 Node B - receives Node A's UPDATE and applies (1,1) -> (1,11)
> >   t1 has tuples : (1,11), (2,2)
> >
> >   @9:05 Node C - receives the UPDATE (1,1) -> (1,11)
> >     - As conflict slot’s xmin is advanced, the deleted tuple may already
> have
> >       been removed.
> >     - Conflict resolution fails to detect update_deleted and instead raises
> >       update_missing.
> >
> > Note that, as per decoding logic Node C sees the commit timestamp of
> > the update as 9:00 (origin commit_ts from Node A), not 9:04 (commit
> > time on Node B). In this case, since the UPDATE's timestamp is earlier
> > than the DELETE, Node C should ideally detect an update_deleted
> > conflict. However, it cannot, because it no longer retains the deleted tuple.
> >
> > Even if Node C attempts to retrieve the latest WAL position from Node
> > A, Node C doesn't maintain any LSN which we could use to compare with it.
> >
> > This scenario is similar to another restriction in the patch where
> > retain_conflict_info is not supported if the publisher is also a
> > physical standby, as the required transaction information from the
> > original primary is unavailable. Moreover, this limitation is relevant
> > only when the subscription origin option is set to ANY, as only in
> > that case changes from other origins can be replicated. Since
> > retain_conflict_info is primarily useful for conflict detection in
> > bidirectional clusters where the origin option is set to NONE, this limitation
> appears acceptable.
> >
> > Given these findings, to help users avoid unintended configurations,
> > we plan to issue a warning in scenarios where replicated changes may
> > include origins other than the direct publisher, similar to the
> > existing checks in the
> > check_publications_origin() function.
> >
> > Here is the latest patch that implements the warning and documents
> > this case. Only 0001 is modified for this.
> >
> > A big thanks to Nisha for invaluable assistance in identifying this
> > case and preparing the analysis for it.
> 
> I'm still reviewing the 0001 patch but let me share some comments and
> questions I have so far:

Thanks for the comments!

> 
> ---
> It seems there is no place where we describe the overall idea of reliably
> detecting update_deleted conflicts. The comment atop
> maybe_advance_nonremovable_xid() describes why the implemented
> algorithm works for that purpose but doesn't how it is implemented, for
> example the relationship with pg_conflict_detection slot. I'm not sure the long
> comment atop maybe_advance_nonremovable_xid() is the right place as it
> seems to be a description beyond explaining
> maybe_advance_nonremovable_xid() function. Probably we can move that
> comment and explain the overall idea somewhere for example atop worker.c or
> in README.

I think it makes sense to explain it atop of worker.c, will do.

> 
> ---
> The new parameter name "retain_conflict_info" sounds to me like we keep the
> conflict information somewhere that has expired at some time such as how
> many times insert_exists or update_origin_differs happened. How about
> choosing a name that indicates retain dead tuples more explicitly for example
> retain_dead_tuples?

We considered the name you suggested, but we wanted to convey that this option
not only retains dead tuples but also preserves commit timestamps and origin
data for conflict detection, hence we opted for a more general name. Do you
have better suggestions?

> 
> ---
> You mentioned in the previous email:
> 
> > Furthermore, we tested running pgbench on both publisher and
> subscriber[3].
> > Some regression was observed in TPS on the subscriber, because
> > workload on the publisher is pretty high and the apply workers must
> > wait for the amount of transactions with earlier timestamps to be
> > applied and flushed before advancing the non-removable XID to remove
> > dead tuples. This is the expected behavior of this approach since the
> > patch's main goal is to retain dead tuples for reliable conflict detection.
> 
> Have you conducted any performance testing of a scenario where a publisher
> replicates a large number of databases (say 64) to a subscriber? I'm particularly
> interested in a configuration where retain_conflict_info is set to true, and there
> are 64 apply workers running on the subscriber side. In such a setup, even
> when running pgbench exclusively on the publisher's databases, I suspect the
> replication lag would likely increase quickly, as all apply workers on the
> subscriber would be impacted by the overhead of retaining dead tuples.

We will try this workload and share the feedback.


> 
> ---
> @@ -71,8 +72,9 @@
>  #define SUBOPT_PASSWORD_REQUIRED   0x00000800
>  #define SUBOPT_RUN_AS_OWNER            0x00001000
>  #define SUBOPT_FAILOVER                0x00002000
> -#define SUBOPT_LSN                 0x00004000
> -#define SUBOPT_ORIGIN              0x00008000
> +#define SUBOPT_RETAIN_CONFLICT_INFO    0x00004000
> +#define SUBOPT_LSN                 0x00008000
> +#define SUBOPT_ORIGIN              0x00010000
> 
> Why do we need to change the existing options' value?

The intention is to position the new option after the failover option, ensuring
consistency with the order in the pg_subscription catalog. I think modifying existing
options in a major version is acceptable, as we have done similarly in commits
4826759 and 776621a.


> 
> ---
> +                * This is required to ensure that we don't advance the xmin
> +                * of CONFLICT_DETECTION_SLOT even if one of the
> subscriptions
> +                * is not enabled. Otherwise, we won't be able to detect
> 
> I guess the "even" in the first sentence is not necessary.

Agreed.

> 
> ---
> +/*
> + * Determine the minimum non-removable transaction ID across all apply
> +workers
> + * for subscriptions that have retain_conflict_info enabled. Store the
> +result
> + * in *xmin.
> + *
> + * If the replication slot cannot be advanced during this cycle, due to
> +either
> + * a disabled subscription or an inactive worker, set *can_advance_xmin
> +to
> + * false.
> + */
> +static void
> +compute_min_nonremovable_xid(LogicalRepWorker *worker,
> +                            bool retain_conflict_info, TransactionId *xmin,
> +                            bool *can_advance_xmin)
> 
> I think this function is quite confusing for several reasons. For instance, it's
> doing more things than described in the comments such as trying to create the
> CONFLICT_DETECTION_SLOT if no worker is passed. Also, one of the caller
> describes:
> 
> +               /*
> +                * This is required to ensure that we don't advance the xmin
> +                * of CONFLICT_DETECTION_SLOT even if one of the
> subscriptions
> +                * is not enabled. Otherwise, we won't be able to detect
> +                * conflicts reliably for such a subscription even though it
> +                * has set the retain_conflict_info option.
> +                */
> +               compute_min_nonremovable_xid(NULL,
> sub->retainconflictinfo,
> +                                            &xmin, &can_advance_xmin);
> 
> but it's unclear to me from the function name that it tries to create the
> replication slot. Furthermore, in this path it doesn't actually compute xmin. I
> guess we can try to create CONFLICT_DETECTION_SLOT in the loop of
> "foreach(lc, sublist)" and set false to can_advance_xmin if either the
> subscription is disabled or the worker is not running.

I understand. The original code was similar to your suggestion, but we decided
to encapsulate it within a separate function to maintain a clean and concise
main loop. However, your suggestion also makes sense, so I will proceed with
the change.


> 
> ---
> +   FullTransactionId remote_oldestxid; /* oldest transaction ID that was in
> +                                        * the commit phase on the
> publisher.
> +                                        * Use FullTransactionId to prevent
> +                                        * issues with transaction ID
> +                                        * wraparound, where a new
> +                                        * remote_oldestxid could falsely
> +                                        * appear to originate from the past
> +                                        * and block advancement */
> +   FullTransactionId remote_nextxid;   /* next transaction ID to be assigned
> +                                        * on the publisher. Use
> +                                        * FullTransactionId for consistency
> +                                        * and to allow straightforward
> +                                        * comparisons with
> + remote_oldestxid. */
> 
> I think it would be readable if we could write them above each field.

Will adjust.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

07 июля, 06:03:40

On Sun, Jul 6, 2025 at 10:51 PM Masahiko Sawada wrote:

> 
> On Sun, Jul 6, 2025 at 8:03 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear hackers,
> >
> > As a confirmation purpose, I did performance testing with four
> > workloads we did before.
> 
> Thank you for doing the performance tests!
> 
> >
> > 03. pgbench on both sides
> > ========================
> > The workload is mostly same as [3].
> >
> > Workload:
> >  - Ran pgbench with 40 clients for the *both side*.
> >  - The duration was 120s, and the measurement was repeated 10 times.
> >
> > (bothtest.tar.gz can run the same workload)
> >
> > Test Scenarios & Results:
> > Publisher:
> >  - pgHead : Median TPS = 16799.67659
> >  - pgHead + patch : Median TPS = 17338.38423
> > Subscriber:
> >  - pgHead : Median TPS = 16552.60515
> >  - pgHead + patch : Median TPS = 8367.133693
> 
> My first impression is that 40 clients is a small number at which a 50%
> performance degradation occurs in 120s. Did you test how many clients are
> required to trigger the same level performance regression with
> retain_conflict_info = off?

Could you please elaborate further on the intention behind the suggested tests
and what outcomes are expected? I ask because we anticipate that disabling
retain_conflict_info should not cause regression, given that dead tuples will
not be retained.


> 
> >
> > 04. pgbench on both side, and max_conflict_retention_duration was
> > tuned
> >
> ================================================
> ======================
> > ==
> > The workload is mostly same as [4].
> >
> > Workload:
> > - Initially ran pgbench with 40 clients for the *both side*.
> > - Set max_conflict_retention_duration = {60, 120}
> > - When the slot is invalidated on the subscriber side, stop the benchmark
> and
> >   wait until the subscriber would be caught up. Then the number of clients
> on
> >   the publisher would be half.
> >   In this test the conflict slot could be invalidated as expected when the
> workload
> >   on the publisher was high, and it would not get invalidated anymore after
> >   reducing the workload. This shows even if the slot has been invalidated
> once,
> >   users can continue to detect the update_deleted conflict by reduce the
> >   workload on the publisher.
> > - Total period of the test was 900s for each cases.
> >
> > (max_conflixt.tar.gz can run the same workload)
> >
> > Observation:
> >  -
> >  - Parallelism of the publisher side is reduced till 15->7->3 and finally the
> >    conflict slot is not invalidated.
> >  - TPS on the subscriber side is improved when the concurrency was
> reduced.
> >    This is because the dead tuple accumulation is reduced on subscriber
> due to
> >    the reduced workload on the publisher.
> >  - when publisher has Nclients=3, no regression in subscriber's TPS
> 
> I think that users typically cannot control the amount of workloads in
> production, meaning that once the performance regression starts to happen
> the subscriber could enter the loop where invalidating the slot, recovreing the
> performance, creating the slot, and having the performance problem.

Yes, you are right. The test is designed to demonstrate that the slot can
be invalidated under high workload conditions as expected, while
remaining valid if the workload is reduced. In production systems where
workload reduction may not be possible, it’s recommended to disable
`retain_conflict_info` to enhance performance. This decision involves balancing
the need for reliable conflict detection with optimal system performance.

I think the hot standby feedback also has a similar impact on the performance
of the primary, which is done to prevent the early removal of data necessary
for the standby, ensuring that it remains accessible when needed.


Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

07 июля, 06:29:11

Dear Sawada-san,

> What does each duration mean in these results? Can we interpret the
> test case of max_conflict_retention_duration=120s that when 7 clients
> and 15 clients are working on the publisher and the subscriber
> respectively, the TPS on the subscriber was about one fourth (17835.3
> vs. 4707)?

Firstly, this workload is done to prove that users can tune their workload to keep
enabling the update_deleted detections. Let me describe what happened there with
the timetable since the test starts.

0-162s:
Number of clients on both publisher/subscriber was 15. TPS was 17835.3 on the
publisher and 4571.8 on the subscriber. This means that retained dead tuples on
the subscriber may reduce the performance to around 1/4 compared with publisher,
and the workload on the publisher is too heavy to keep working the update_deleted
detection.

163-314s:
Number of clients was 7 on publisher, and 15 on subscriber. TPS was 9503.8 on
the publisher and 4707 on the subscriber. This means that N=7 on the publisher
was still too many thus conflict slot must be invalidated.

315-597s:
Number of clients was 3 on publisher, and 15 on subscriber. TPS was 4243.9 on
the publisher and 19568.4 on the subscriber. Here the conflict slot could survive
during the benchmark because concurrency on the publisher was reduced.
Performance could be improved on the subscriber side because dead tuples can be
reduced here.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

07 июля, 13:21:35

On Mon, Jul 7, 2025 at 11:03 AM Zhijie Hou (Fujitsu) wrote:
> 
> On Sun, Jul 6, 2025 at 10:51 PM Masahiko Sawada wrote:
> ================================================
> > ======================
> > > ==
> > > The workload is mostly same as [4].
> > >
> > > Workload:
> > > - Initially ran pgbench with 40 clients for the *both side*.
> > > - Set max_conflict_retention_duration = {60, 120}
> > > - When the slot is invalidated on the subscriber side, stop the
> > > benchmark
> > and
> > >   wait until the subscriber would be caught up. Then the number of
> > > clients
> > on
> > >   the publisher would be half.
> > >   In this test the conflict slot could be invalidated as expected
> > > when the
> > workload
> > >   on the publisher was high, and it would not get invalidated anymore
> after
> > >   reducing the workload. This shows even if the slot has been
> > > invalidated
> > once,
> > >   users can continue to detect the update_deleted conflict by reduce the
> > >   workload on the publisher.
> > > - Total period of the test was 900s for each cases.
> > >
> > > (max_conflixt.tar.gz can run the same workload)
> > >
> > > Observation:
> > >  -
> > >  - Parallelism of the publisher side is reduced till 15->7->3 and finally the
> > >    conflict slot is not invalidated.
> > >  - TPS on the subscriber side is improved when the concurrency was
> > reduced.
> > >    This is because the dead tuple accumulation is reduced on
> > > subscriber
> > due to
> > >    the reduced workload on the publisher.
> > >  - when publisher has Nclients=3, no regression in subscriber's TPS
> >
> > I think that users typically cannot control the amount of workloads in
> > production, meaning that once the performance regression starts to
> > happen the subscriber could enter the loop where invalidating the
> > slot, recovreing the performance, creating the slot, and having the
> performance problem.
> 
> Yes, you are right. The test is designed to demonstrate that the slot can be
> invalidated under high workload conditions as expected, while remaining valid
> if the workload is reduced. In production systems where workload reduction
> may not be possible, it’s recommended to disable `retain_conflict_info` to
> enhance performance. This decision involves balancing the need for reliable
> conflict detection with optimal system performance.
> 
> I think the hot standby feedback also has a similar impact on the performance
> of the primary, which is done to prevent the early removal of data necessary for
> the standby, ensuring that it remains accessible when needed.

For reference, we conducted test[1] to evaluate the impact of enabling hot
standby feedback in a physical replication setup, observing approximately
a 50% regression in TPS on the primary as well.

[1] https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

07 июля, 21:47:29

On Mon, Jul 7, 2025 at 12:03 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Sun, Jul 6, 2025 at 10:51 PM Masahiko Sawada wrote:
>
> >
> > On Sun, Jul 6, 2025 at 8:03 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear hackers,
> > >
> > > As a confirmation purpose, I did performance testing with four
> > > workloads we did before.
> >
> > Thank you for doing the performance tests!
> >
> > >
> > > 03. pgbench on both sides
> > > ========================
> > > The workload is mostly same as [3].
> > >
> > > Workload:
> > >  - Ran pgbench with 40 clients for the *both side*.
> > >  - The duration was 120s, and the measurement was repeated 10 times.
> > >
> > > (bothtest.tar.gz can run the same workload)
> > >
> > > Test Scenarios & Results:
> > > Publisher:
> > >  - pgHead : Median TPS = 16799.67659
> > >  - pgHead + patch : Median TPS = 17338.38423
> > > Subscriber:
> > >  - pgHead : Median TPS = 16552.60515
> > >  - pgHead + patch : Median TPS = 8367.133693
> >
> > My first impression is that 40 clients is a small number at which a 50%
> > performance degradation occurs in 120s. Did you test how many clients are
> > required to trigger the same level performance regression with
> > retain_conflict_info = off?
>
> Could you please elaborate further on the intention behind the suggested tests
> and what outcomes are expected? I ask because we anticipate that disabling
> retain_conflict_info should not cause regression, given that dead tuples will
> not be retained.

I think these performance regressions occur because at some point the
subscriber can no longer keep up with the changes occurring on the
publisher. This is because the publisher runs multiple transactions
simultaneously, while the Subscriber applies them with one apply
worker. When retain_conflict_info = on, the performance of the apply
worker deteriorates because it retains dead tuples, and as a result it
gradually cannot keep up with the publisher, the table bloats, and the
TPS of pgbench executed on the subscriber is also affected. This
happened when only 40 clients (or 15 clients according to the results
of test 4?) were running simultaneously.

I think that even with retain_conflict_info = off, there is probably a
point at which the subscriber can no longer keep up with the
publisher. For example, if with retain_conflict_info = off we can
withstand 100 clients running at the same time, then the fact that
this performance degradation occurred with 15 clients explains that
performance degradation is much more likely to occur because of
retain_conflict_info = on.

Test cases 3 and 4 are typical cases where this feature is used since
the  conflicts actually happen on the subscriber, so I think it's
important to look at the performance in these cases. The worst case
scenario for this feature is that when this feature is turned on, the
subscriber cannot keep up even with a small load, and with
max_conflict_retetion_duration we enter a loop of slot invalidation
and re-creating, which means that conflict cannot be detected
reliably.

> >
> > >
> > > 04. pgbench on both side, and max_conflict_retention_duration was
> > > tuned
> > >
> > ================================================
> > ======================
> > > ==
> > > The workload is mostly same as [4].
> > >
> > > Workload:
> > > - Initially ran pgbench with 40 clients for the *both side*.
> > > - Set max_conflict_retention_duration = {60, 120}
> > > - When the slot is invalidated on the subscriber side, stop the benchmark
> > and
> > >   wait until the subscriber would be caught up. Then the number of clients
> > on
> > >   the publisher would be half.
> > >   In this test the conflict slot could be invalidated as expected when the
> > workload
> > >   on the publisher was high, and it would not get invalidated anymore after
> > >   reducing the workload. This shows even if the slot has been invalidated
> > once,
> > >   users can continue to detect the update_deleted conflict by reduce the
> > >   workload on the publisher.
> > > - Total period of the test was 900s for each cases.
> > >
> > > (max_conflixt.tar.gz can run the same workload)
> > >
> > > Observation:
> > >  -
> > >  - Parallelism of the publisher side is reduced till 15->7->3 and finally the
> > >    conflict slot is not invalidated.
> > >  - TPS on the subscriber side is improved when the concurrency was
> > reduced.
> > >    This is because the dead tuple accumulation is reduced on subscriber
> > due to
> > >    the reduced workload on the publisher.
> > >  - when publisher has Nclients=3, no regression in subscriber's TPS
> >
> > I think that users typically cannot control the amount of workloads in
> > production, meaning that once the performance regression starts to happen
> > the subscriber could enter the loop where invalidating the slot, recovreing the
> > performance, creating the slot, and having the performance problem.
>
> Yes, you are right. The test is designed to demonstrate that the slot can
> be invalidated under high workload conditions as expected, while
> remaining valid if the workload is reduced. In production systems where
> workload reduction may not be possible, it’s recommended to disable
> `retain_conflict_info` to enhance performance. This decision involves balancing
> the need for reliable conflict detection with optimal system performance.

Agreed. I'm a bit concerned that the range in which users can achieve
this balance is small.

> I think the hot standby feedback also has a similar impact on the performance
> of the primary, which is done to prevent the early removal of data necessary
> for the standby, ensuring that it remains accessible when needed.

Right. I think it's likely to happen if there is a long running
read-only query on the standby. But does it happen also when there are
only short read-only transactions on the standbys?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

07 июля, 22:00:45

On Mon, Jul 7, 2025 at 12:29 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > What does each duration mean in these results? Can we interpret the
> > test case of max_conflict_retention_duration=120s that when 7 clients
> > and 15 clients are working on the publisher and the subscriber
> > respectively, the TPS on the subscriber was about one fourth (17835.3
> > vs. 4707)?
>
> Firstly, this workload is done to prove that users can tune their workload to keep
> enabling the update_deleted detections. Let me describe what happened there with
> the timetable since the test starts.
>
> 0-162s:
> Number of clients on both publisher/subscriber was 15. TPS was 17835.3 on the
> publisher and 4571.8 on the subscriber. This means that retained dead tuples on
> the subscriber may reduce the performance to around 1/4 compared with publisher,
> and the workload on the publisher is too heavy to keep working the update_deleted
> detection.
>
> 163-314s:
> Number of clients was 7 on publisher, and 15 on subscriber. TPS was 9503.8 on
> the publisher and 4707 on the subscriber. This means that N=7 on the publisher
> was still too many thus conflict slot must be invalidated.
>
> 315-597s:
> Number of clients was 3 on publisher, and 15 on subscriber. TPS was 4243.9 on
> the publisher and 19568.4 on the subscriber. Here the conflict slot could survive
> during the benchmark because concurrency on the publisher was reduced.
> Performance could be improved on the subscriber side because dead tuples can be
> reduced here.

Thank you for your explanation!

Since this feature is designed to reliably detect conflicts on the
subscriber side, this scenario, where both the publisher and
subscriber are under load, represents a typical use case.

The fact that the subscriber can withstand the case where N=7 on the
publisher and N=15 on the subscriber with retain_conflict_info =
false, but fails to do so when retain_conflict_info = true, might
suggests a significant performance impact from enabling this feature.
In these test cases, was autovacuum disabled? I'm curious whether
users would experience permanently reduced transaction throughput, or
if this performance drop is temporary and recovers after autovacuum
runs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

09 июля, 15:08:58

On Tue, Jul 8, 2025 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jul 7, 2025 at 12:03 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
>
> I think these performance regressions occur because at some point the
> subscriber can no longer keep up with the changes occurring on the
> publisher. This is because the publisher runs multiple transactions
> simultaneously, while the Subscriber applies them with one apply
> worker. When retain_conflict_info = on, the performance of the apply
> worker deteriorates because it retains dead tuples, and as a result it
> gradually cannot keep up with the publisher, the table bloats, and the
> TPS of pgbench executed on the subscriber is also affected. This
> happened when only 40 clients (or 15 clients according to the results
> of test 4?) were running simultaneously.
>

I think here the primary reason is the speed of one apply worker vs.
15 or 40 clients working on the publisher, and all the data is being
replicated. We don't see regression at 3 clients, which suggests apply
worker is able to keep up with that much workload. Now, we have
checked that if the workload is slightly different such that fewer
clients (say 1-3) work on same set of tables and then we make
different set of pub-sub pairs for all such different set of clients
(for example, 3 clients working on tables t1 and t2, other 3 clients
working on tables t3 and t4; then we can have 2 pub-sub pairs, one for
tables t1, t2, and other for t3-t4 ) then there is almost negligible
regression after enabling retain_conflict_info. Additionally, for very
large transactions that can be parallelized, we shouldn't see any
regression because those can be applied in parallel.

> I think that even with retain_conflict_info = off, there is probably a
> point at which the subscriber can no longer keep up with the
> publisher. For example, if with retain_conflict_info = off we can
> withstand 100 clients running at the same time, then the fact that
> this performance degradation occurred with 15 clients explains that
> performance degradation is much more likely to occur because of
> retain_conflict_info = on.
>
> Test cases 3 and 4 are typical cases where this feature is used since
> the  conflicts actually happen on the subscriber, so I think it's
> important to look at the performance in these cases. The worst case
> scenario for this feature is that when this feature is turned on, the
> subscriber cannot keep up even with a small load, and with
> max_conflict_retetion_duration we enter a loop of slot invalidation
> and re-creating, which means that conflict cannot be detected
> reliably.
>

As per the above observations, it is less of a regression of this
feature but more of a lack of parallel apply or some kind of pre-fetch
for apply, as is recently proposed [1]. I feel there are use cases, as
explained above, for which this feature would work without any
downside, but due to a lack of some sort of parallel apply, we may not
be able to use it without any downside for cases where the contention
is only on a smaller set of tables. We have not tried, but may in
cases where contention is on a smaller set of tables, if users
distribute workload among different pub-sub pairs by using row
filters, there also, we may also see less regression. We can try that
as well.

>
> > I think the hot standby feedback also has a similar impact on the performance
> > of the primary, which is done to prevent the early removal of data necessary
> > for the standby, ensuring that it remains accessible when needed.
>
> Right. I think it's likely to happen if there is a long running
> read-only query on the standby. But does it happen also when there are
> only short read-only transactions on the standbys?
>

IIUC, the regression happens simply by increasing the value of
recovery_min_apply_delay. See case 5 in email [2]. This is to show the
point that we can see some regression in physical replication when
there is a delay in replication.

[1] - https://www.postgresql.org/message-id/7b60e4e1-de40-4956-8135-cb1dc2be62e9%40garret.ru
[2] - https://www.postgresql.org/message-id/CABdArM4OEwmh_31dQ8_F__VmHwk2ag_M%3DYDD4H%2ByYQBG%2BbHGzg%40mail.gmail.com

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

09 июля, 15:39:33

Dear hackers,

All the tests in [1] are done with autovacuum=off, so I checked how would be in
autovacuum=on case.

Highlights
==========
* Regression on subscriber-side became bit larger than autovacuum=on case
  when pgbench was run on both side
* Other than that, they were mostly same.

Used source
===========
pgHead commit fd7d7b7191 + v46 patchset

Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

01. pgbench on publisher
========================
The workload is mostly same as [1], but autovacuum was enabled.

Workload:
 - Ran pgbench with 40 clients for the publisher.
 - The duration was 300s, and the measurement was repeated 3 times.

Test Scenarios & Results:
 - pgHead : Median TPS = 39757.24404
 - pgHead + patch : Median TPS = 40871.37782

Observation:
 - same trend as autovacuum=off
 - No performance regression observed with the patch applied.
 - The results were consistent across runs.

Detailed Results Table:
  - each cell shows the TPS in each case.
  - patch(ON) means patched and retain_conflict_info=ON is set.

run#    pgHEAD         pgHead+patch(ON) 
1    40020.19703        40871.37782
2    39650.27381        40299.31806
3    39757.24404        41017.21089
median    39757.24404        40871.37782

02. pgbench on subscriber
========================
The workload is mostly same as [1]. 

Workload:
 - Ran pgbench with 40 clients for the *subscriber*.
 - The duration was 300s, and the measurement was repeated 3 times.


Test Scenarios & Results:
 - pgHead : Median TPS = 41552.27857
 - pgHead + patch : Median TPS = 41677.02942

Observation:
 - same trend as autovacuum=off
 - No performance regression observed with the patch applied.
 - The results were consistent across runs.

Detailed Results Table:

run#    pgHEAD         pgHead+patch(ON) 
1    41656.71589        41673.42577
2    41552.27857        41677.02942
3    41504.98347        42114.66964
median    41552.27857        41677.02942


03. pgbench on both sides
========================
The workload is mostly same as [1].

Workload:
 - Ran pgbench with 15 clients for the *both side*.
 - The duration was 300s, and the measurement was repeated 3 times.

Test Scenarios & Results:
Publisher:
 - pgHead : Median TPS = 17355.08998
 - pgHead + patch : Median TPS = 18382.41213

Subscriber:
 - pgHead : Median TPS = 16998.14496
 - pgHead + patch : Median TPS = 5804.129821


Observation:
 - Regression became larger than autovacuum = off (-50%->-60%)
 - No performance regression observed on the publisher with the patch applied.
 - The performance is reduced on the subscriber side (TPS reduction (~60%)) due
   to dead tuple retention for the conflict detection

Detailed Results Table:

On publisher:
run#    pgHEAD     pgHead+patch(ON) 
1    17537.52375    18382.41213
2    17355.08998    18408.0712
3    17286.78467    18119.77276
median    17355.08998    18382.41213

On subscriber:
run#    pgHEAD     pgHead+patch(ON) 
1    17130.63876    5886.375748
2    16998.14496    5737.799408
3    16891.2713    5804.129821
median    16998.14496    5804.129821

[1]
https://www.postgresql.org/message-id/OSCPR01MB1496663AED8EEC566074DFBC9F54CA%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

09 июля, 16:46:37

On Wed, Jul 9, 2025 at 5:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jul 8, 2025 at 12:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Jul 7, 2025 at 12:03 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> >
> > I think these performance regressions occur because at some point the
> > subscriber can no longer keep up with the changes occurring on the
> > publisher. This is because the publisher runs multiple transactions
> > simultaneously, while the Subscriber applies them with one apply
> > worker. When retain_conflict_info = on, the performance of the apply
> > worker deteriorates because it retains dead tuples, and as a result it
> > gradually cannot keep up with the publisher, the table bloats, and the
> > TPS of pgbench executed on the subscriber is also affected. This
> > happened when only 40 clients (or 15 clients according to the results
> > of test 4?) were running simultaneously.
> >
>
> I think here the primary reason is the speed of one apply worker vs.
> 15 or 40 clients working on the publisher, and all the data is being
> replicated. We don't see regression at 3 clients, which suggests apply
> worker is able to keep up with that much workload. Now, we have
> checked that if the workload is slightly different such that fewer
> clients (say 1-3) work on same set of tables and then we make
> different set of pub-sub pairs for all such different set of clients
> (for example, 3 clients working on tables t1 and t2, other 3 clients
> working on tables t3 and t4; then we can have 2 pub-sub pairs, one for
> tables t1, t2, and other for t3-t4 ) then there is almost negligible
> regression after enabling retain_conflict_info. Additionally, for very
> large transactions that can be parallelized, we shouldn't see any
> regression because those can be applied in parallel.
>

Yes, in test case-03 [1], the performance drop(~50%) observed on the
subscriber side was primarily due to a single apply worker handling
changes from 40 concurrent clients on the publisher, which led to the
accumulation of dead tuples.

To validate this and simulate a more realistic workload, designed a
test as suggested above, where multiple clients update different
tables, and multiple subscriptions exist on the subscriber (one per
table set).
A custom pgbench script was created to run pgbench on the publisher,
with each client updating a unique set of tables. On the subscriber
side, created one subscription per set of tables.  Each
publication-subscription pair handles a distinct table set.

Highlights
==========
- Two tests were done with two different workloads - 15 and 45
concurrent clients, respectively.
- No regression was observed when publisher changes were processed by
multiple apply workers on the subscriber.

Used source
===========
pgHead commit 62a17a92833 + v47 patch set

Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM


01. pgbench on both sides (with 15 clients)
=====================================
Setup:
 - Publisher and Subscriber nodes are created with configurations:
    autovacuum = false
    shared_buffers = '30GB'
    -- Also, worker and logical replication related parameters were
increased as per requirement (see attached scripts for details).

Workload:
 - The publisher has 15 sets of pgbench tables: Each set includes four
tables: pgbench_accounts, pgbench_tellers, pgbench_branches, and
pgbench_history, named as:
     pgbench_accounts_0, pgbench_tellers_0, ..., pgbench_accounts_14,
pgbench_tellers_14, etc.
 - Ran pgbench with 15 clients for the *both side*.
   -- On publisher, each client updates *only one* set of pgbench
tables: e.g., client '0' updates the pgbench_xx_0 tables,  client '1'
updates pgbench_xx_1 tables, and so on.
   -- On Subscriber, there exists one subscription per set of tables
of the publisher, i.e, there is one apply worker consuming changes
corresponding to each client. So, #subscriptions on subscriber(15) =
#clients on publisher(15).
 - On subscriber, the default pgbench workload is also run with 15 clients.
 - The duration was 5 minutes, and the measurement was repeated 3 times.

Test Scenarios & Results:
Publisher:
 - pgHead : Median TPS = 10386.93507
 - pgHead + patch : Median TPS = 10187.0887 (TPS reduced ~2%)

Subscriber:
 - pgHead : Median TPS = 10006.3903
 - pgHead + patch : Median TPS = 9986.269682 (TPS reduced ~0.2%)

Observation:
 - No performance regression was observed on either the publisher or
subscriber with the patch applied.
 - The TPS drop was under 2% on both sides, within expected case to
case variation range.

Detailed Results Table:
On publisher:
#run  pgHEAD    pgHead+patch(ON)
1   10477.26438   10029.36155
2   10261.63429   10187.0887
3   10386.93507   10750.86231
median  10386.93507   10187.0887

On subscriber:
#run   pgHEAD    pgHead+patch(ON)
1   10261.63429    9813.114002
2   9962.914457    9986.269682
3   10006.3903    10580.13015
median   10006.3903    9986.269682

~~~~

02. pgbench on both sides (with 45 clients)
=====================================
Setup:
 - same as case 01.

Workload:
 - Publisher has the same 15 sets of pgbench tables as in case-01 and
3 clients will be updating one set of tables.
 - Ran pgbench with 45 clients for the *both side*.
   -- On publisher, each client updates *three* set of pgbench tables:
e.g., clients '0','15' and '30' update pgbench_xx_0 tables,  clients
'1', '16', and '31' update pgbench_xx_1 tables, and so on.
   -- On Subscriber, there exists one subscription per set of tables
of the publisher, i.e, there is one apply worker consuming changes
corresponding to *three* clients of the publisher.
 - On subscriber, the default pgbench workload is also run with 45 clients.
 - The duration was 5 minutes, and the measurement was repeated 3 times.

Test Scenarios & Results:
Publisher:
 - pgHead : Median TPS = 13845.7381
 - pgHead + patch : Median TPS = 13553.682 (TPS reduced ~2%)

Subscriber:
 - pgHead : Median TPS = 10080.54686
 - pgHead + patch : Median TPS = 9908.304381 (TPS reduced ~1.7%)

Observation:
 - No significant performance regression observed on either the
publisher or subscriber with the patch applied.
 - The TPS drop was under 2% on both sides, within expected case to
case variation range.

Detailed Results Table:
On publisher:
#run   pgHEAD   pgHead+patch(ON)
1   14446.62404   13616.81375
2   12988.70504   13425.22938
3   13845.7381   13553.682
median  13845.7381  13553.682

On subscriber:
#run   pgHEAD    pgHead+patch(ON)
1   10505.47481   9908.304381
2   9963.119531   9843.280308
3   10080.54686   9987.983147
median  10080.54686   9908.304381

~~~~

The scripts used to perform above tests are attached.

[1]
https://www.postgresql.org/message-id/OSCPR01MB1496663AED8EEC566074DFBC9F54CA%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Вложения

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

10 июля, 15:37:53

Dear hackers,

> As per the above observations, it is less of a regression of this
> feature but more of a lack of parallel apply or some kind of pre-fetch
> for apply, as is recently proposed [1]. I feel there are use cases, as
> explained above, for which this feature would work without any
> downside, but due to a lack of some sort of parallel apply, we may not
> be able to use it without any downside for cases where the contention
> is only on a smaller set of tables. We have not tried, but may in
> cases where contention is on a smaller set of tables, if users
> distribute workload among different pub-sub pairs by using row
> filters, there also, we may also see less regression. We can try that
> as well.

I verified row filter idea by benchmark and this was the valid approach.
Please see the below report.

Highlights
=======
- No regression was observed when publisher changes were processed by multiple
  apply workers on the subscriber.

Used source
=========
pgHead commit 62a17a92833 + v47 patch set

Machine details
===========
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

Setup
====
 - Publisher and Subscriber nodes are created with configurations:
    autovacuum = false
    shared_buffers = '30GB'
    -- Also, worker and logical replication related parameters were increased
       as needed (see attached scripts for details).

Workload
======
 - The publisher has 4 pgbench tables: pgbench_pub_accounts, pgbench_pub_tellers,
   pgbench_pub_branches, and pgbench_pub_history
 - Also the publisher has 15 publications, say pub_[0..14]. Each publications could
   publish tuples which PK % 15 was same as their suffic
 - Ran pgbench with 15 clients for the *both side*.
 - On subscriber, there were 15 subscribers which subscribed one of the publication
 - On subscriber, the default pgbench workload is also run.
 - The duration was 5 minutes, and the measurement was repeated 3 times.

Test Scenarios & Results:
Publisher:
 - pgHead : Median TPS = 12201.92205
 - pgHead + patch : Median TPS = 12368.58001
(TPS reduced ~1.5%)

Subscriber:
 - pgHead : Median TPS = 11264.78483
 - pgHead + patch : Median TPS = 11471.8107
(TPS reduced ~1.8%)

Observation:
 - No performance regression was observed on either the publisher or subscriber
   with the patch applied.

Detailed Results Table
======================
Publisher:
#run    head        patched
1    12201.92205    12368.58001
2    12263.03531    12410.21465
3    12171.24214    12330.47522
median    12201.92205    12368.58001

Subscriber:
#run    head        patched
1    11383.51717    11471.8107
2    11264.78483    11422.47011
3    11146.6676    11518.8403
median    11264.78483    11471.8107

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Conflict detection for update_deleted in logical replication

От

"Hayato Kuroda (Fujitsu)"

Дата:

10 июля, 15:39:21

Dear hackers,

>     -- Also, worker and logical replication related parameters were increased
>        as needed (see attached scripts for details).

Sorry, I forgot to attach scripts.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

On Thu, Jul 17, 2025 at 4:44 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Thu, Jul 17, 2025 at 9:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Jul 11, 2025 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jul 10, 2025 at 6:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Wed, Jul 9, 2025 at 9:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > >
> > > > > > I think that even with retain_conflict_info = off, there is probably a
> > > > > > point at which the subscriber can no longer keep up with the
> > > > > > publisher. For example, if with retain_conflict_info = off we can
> > > > > > withstand 100 clients running at the same time, then the fact that
> > > > > > this performance degradation occurred with 15 clients explains that
> > > > > > performance degradation is much more likely to occur because of
> > > > > > retain_conflict_info = on.
> > > > > >
> > > > > > Test cases 3 and 4 are typical cases where this feature is used since
> > > > > > the  conflicts actually happen on the subscriber, so I think it's
> > > > > > important to look at the performance in these cases. The worst case
> > > > > > scenario for this feature is that when this feature is turned on, the
> > > > > > subscriber cannot keep up even with a small load, and with
> > > > > > max_conflict_retetion_duration we enter a loop of slot invalidation
> > > > > > and re-creating, which means that conflict cannot be detected
> > > > > > reliably.
> > > > > >
> > > > >
> > > > > As per the above observations, it is less of a regression of this
> > > > > feature but more of a lack of parallel apply or some kind of pre-fetch
> > > > > for apply, as is recently proposed [1]. I feel there are use cases, as
> > > > > explained above, for which this feature would work without any
> > > > > downside, but due to a lack of some sort of parallel apply, we may not
> > > > > be able to use it without any downside for cases where the contention
> > > > > is only on a smaller set of tables. We have not tried, but may in
> > > > > cases where contention is on a smaller set of tables, if users
> > > > > distribute workload among different pub-sub pairs by using row
> > > > > filters, there also, we may also see less regression. We can try that
> > > > > as well.
> > > >
> > > > While I understand that there are some possible solutions we have
> > > > today to reduce the contention, I'm not really sure these are really
> > > > practical solutions as it increases the operational costs instead.
> > > >
> > >
> > > I assume by operational costs you mean defining the replication
> > > definitions such that workload is distributed among multiple apply
> > > workers via subscriptions either by row_filters, or by defining
> > > separate pub-sub pairs of a set of tables, right? If so, I agree with
> > > you but I can't think of a better alternative. Even without this
> > > feature as well, we know in such cases the replication lag could be
> > > large as is evident in recent thread [1] and some offlist feedback by
> > > people using native logical replication. As per a POC in the
> > > thread[1], parallelizing apply or by using some prefetch, we could
> > > reduce the lag but we need to wait for that work to mature to see the
> > > actual effect of it.
> > >
> > > The path I see with this work is to clearly document the cases
> > > (configuration) where this feature could be used without much downside
> > > and keep the default value of subscription option to enable this as
> > > false (which is already the case with the patch). Do you see any
> > > better alternative for moving forward?
> >
> > I was just thinking about what are the most practical use cases where
> > a user would need multiple active writer nodes. Most applications
> > typically function well with a single active writer node. While it's
> > beneficial to have multiple nodes capable of writing for immediate
> > failover (e.g., if the current writer goes down), or they select a
> > primary writer via consensus algorithms like Raft/Paxos, I rarely
> > encounter use cases where users require multiple active writer nodes
> > for scaling write workloads.
>
> Thank you for the feedback. In the scenario with a single writer node
> and a subscriber with RCI enabled, we have not observed any
> regression.  Please refer to the test report at [1], specifically test
> cases 1 and 2, which involve a single writer node. Next, we can test a
> scenario with multiple (2-3) writer nodes publishing changes, and a
> subscriber node subscribing to those writers with RCI enabled, which
> can even serve as a good use case of the conflict detection we are
> targeting through RCI enabling.
>

I did a workload test for the setup as suggested above - "we can test
a scenario with multiple (2-3) writer nodes publishing changes, and a
subscriber node subscribing to those writers with RCI enabled".

Here are the results :

Highlights
==========
- Two tests were done with two different workloads - 15 and 40
concurrent clients, respectively.
- No regression was observed on any of the nodes.

Used source
===========
pgHead commit 62a17a92833 + v47 patch set

Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

01. pgbench with 15 clients
========================
Setup:
 - Two publishers and one subscriber:
  pub1 --> sub
  pub2 --> sub
 - All three nodes have same pgbench tables (scale=60) and are configured with:
    autovacuum = false
    shared_buffers = '30GB'
    -- Also, worker and logical replication related parameters were
increased as per requirement (see attached scripts for details).
 - The topology is such that pub1 & pub2 are independent writers. The
sub acts as reader(no writes) and has subscribed for all the changes
from both pub1 and pub2.

Workload:
 - pgbench (read-write) was run on both pub1 and pub2 (15 clients,
duration = 5 minutes)
 - pgbench (read-only) was run on sub (15 clients, duration = 5 minutes)
 - The measurement was repeated 2 times.

Observation:
 - No performance regression was observed on either the writer nodes
(publishers) or the reader node (subscriber) with the patch applied.
 - TPS on both publishers was slightly better than on pgHead. This
could be because all nodes run on the same machine - under high
publisher load, the subscriber's apply worker performs I/O more slowly
due to dead tuple retention, giving publisher-side pgbench more I/O
bandwidth to complete writes. We can investigate further if needed.


Detailed Results Table:
On publishers:
#run   pgHead_Pub1_TPS   pgHead_Pub2_TPS   patched_pub1_TPS   patched_pub2_TPS
1   13440.47394   13459.71296   14325.81026   14345.34077
2   13529.29649   13553.65741   14382.32144   14332.94777
median 13484.88521   13506.68518   14354.06585   14339.14427
   - No regression

On subscriber:
#run   pgHead_sub_TPS   patched_sub_TPS
1 127009.0631 126894.9649
2 127767.4083 127207.8632
median 127388.2357 127051.4141
  - No regression

~~~~

02. pgbench with 40 clients
======================
Setup:
 - same as case-01

Workload:
 - pgbench (read-write) was run on both pub1 and pub2 (40 clients,
duration = 10 minutes)
 - pgbench (read-only) was run on sub (40 clients, duration = 10 minutes)
 - The measurement was repeated 2 times.

Observation:
 - No performance regression was observed on any writer nodes, i.e,
the publishers, or the reader node i.e., subscriber with the patch
applied.
 - Similar to case-01, TPS on both publishers was slightly higher than
on pgHead.

Detailed Results Table:
On publisher:
#run   pgHead_Pub1_TPS   patched_pub1_TPS   pgHead_Pub2_TPS   patched_pub2_TPS
1   17818.12479   18602.42504   17744.77163   18620.90056
2   17759.3144   18660.44407   17774.47442   18230.63849
median  17788.7196   18631.43455   17759.62302   18425.76952
   - No regression

On subscriber:
#run   pgHead_sub_TPS   patched_sub_TPS
1   281075.3732   279438.4882
2   275988.1383   277388.6316
median 278531.7557   278413.5599
   - No regression

~~~~
The scripts used to perform above tests are attached.

--
Thanks,
Nisha

Вложения

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

18 июля, 15:02:50

On Friday, July 18, 2025 1:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Fri, Jul 11, 2025 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jul 10, 2025 at 6:46 PM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Jul 9, 2025 at 9:09 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > >
> > > > > I think that even with retain_conflict_info = off, there is
> > > > > probably a point at which the subscriber can no longer keep up
> > > > > with the publisher. For example, if with retain_conflict_info =
> > > > > off we can withstand 100 clients running at the same time, then
> > > > > the fact that this performance degradation occurred with 15
> > > > > clients explains that performance degradation is much more
> > > > > likely to occur because of retain_conflict_info = on.
> > > > >
> > > > > Test cases 3 and 4 are typical cases where this feature is used
> > > > > since the  conflicts actually happen on the subscriber, so I
> > > > > think it's important to look at the performance in these cases.
> > > > > The worst case scenario for this feature is that when this
> > > > > feature is turned on, the subscriber cannot keep up even with a
> > > > > small load, and with max_conflict_retetion_duration we enter a
> > > > > loop of slot invalidation and re-creating, which means that
> > > > > conflict cannot be detected reliably.
> > > > >
> > > >
> > > > As per the above observations, it is less of a regression of this
> > > > feature but more of a lack of parallel apply or some kind of
> > > > pre-fetch for apply, as is recently proposed [1]. I feel there are
> > > > use cases, as explained above, for which this feature would work
> > > > without any downside, but due to a lack of some sort of parallel
> > > > apply, we may not be able to use it without any downside for cases
> > > > where the contention is only on a smaller set of tables. We have
> > > > not tried, but may in cases where contention is on a smaller set
> > > > of tables, if users distribute workload among different pub-sub
> > > > pairs by using row filters, there also, we may also see less
> > > > regression. We can try that as well.
> > >
> > > While I understand that there are some possible solutions we have
> > > today to reduce the contention, I'm not really sure these are really
> > > practical solutions as it increases the operational costs instead.
> > >
> >
> > I assume by operational costs you mean defining the replication
> > definitions such that workload is distributed among multiple apply
> > workers via subscriptions either by row_filters, or by defining
> > separate pub-sub pairs of a set of tables, right? If so, I agree with
> > you but I can't think of a better alternative. Even without this
> > feature as well, we know in such cases the replication lag could be
> > large as is evident in recent thread [1] and some offlist feedback by
> > people using native logical replication. As per a POC in the
> > thread[1], parallelizing apply or by using some prefetch, we could
> > reduce the lag but we need to wait for that work to mature to see the
> > actual effect of it.
> 
> I don't have a better alternative either.
> 
> I agree that this feature will work without any problem when logical replication
> is properly configured. It's a good point that update-delete conflicts can be
> detected reliably without additional performance overhead in scenarios with
> minimal replication lag.
> However, this approach requires users to carefully pay particular attention to
> replication performance and potential delays. My primary concern is that, given
> the current logical replication performance limitations, most users who want to
> use this feature will likely need such dedicated care for replication lag.
> Nevertheless, most features involve certain trade-offs. Given that this is an
> opt-in feature and future performance improvements will reduce these
> challenges for users, it would be reasonable to have this feature at this stage.
> 
> >
> > The path I see with this work is to clearly document the cases
> > (configuration) where this feature could be used without much downside
> > and keep the default value of subscription option to enable this as
> > false (which is already the case with the patch).
> 
> +1

Thanks for the discussion. Here is the V49 patch which includes the suggested
doc change in 0002. I will rebase the remaining patches once the first one is
pushed.

Thanks to Shveta for preparing the doc change.

Best Regards,
Hou zj

Вложения

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

19 июля, 00:10:50

On Mon, Jul 7, 2025 at 3:31 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Mon, Jul 7, 2025 at 10:13 AM Zhijie Hou (Fujitsu) wrote:
> >
> > On Sun, Jul 6, 2025 at 10:51 PM Masahiko Sawada wrote:
> > >
> > > On Fri, Jul 4, 2025 at 8:18 PM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com>
> > > wrote:
> > > >
> > > > On Wed, Jul 2, 2025 at 3:28 PM Hou, Zhijie wrote:
> > > > > Kindly use the latest patch set for performance testing.
> > > >
> > > > During testing, we observed a limitation in cascading logical
> > > > replication setups, such as (A -> B -> C). When retain_conflict_info
> > > > is enabled on Node C, it may not retain information necessary for
> > > > conflict detection when applying changes originally replicated from
> > > > Node A. This happens because Node C only waits for locally
> > > > originated changes on Node B to be applied before advancing the
> > > > non-removable
> > > transaction ID.
> > > >
> > > > For example, Consider a logical replication setup as mentioned above
> > > > : A -> B
> > > -> C.
> > > >  - All three nodes have a table t1 with two tuples (1,1) (2,2).
> > > >  - Node B subscribed to all changes of t1 from Node A
> > > >  - Node-C subscribed to all changes from Node B.
> > > >  - Subscriptions use the default origin=ANY, as this is not a bidirectional
> > > >    setup.
> > > >
> > > > Now, consider two concurrent operations:
> > > >   - @9:00 Node A - UPDATE (1,1) -> (1,11)
> > > >
> > > >   - @9:02 Node C - DELETE (1,1)
> > > >
> > > > Assume a slight delay at Node B before it applies the update from Node A.
> > > >
> > > >  @9:03 Node C - advances the non-removable XID because it sees no
> > > > concurrent  transactions from Node B. It is unaware of Node A’s
> > > > concurrent
> > > update.
> > > >
> > > >   @9:04 Node B - receives Node A's UPDATE and applies (1,1) -> (1,11)
> > > >   t1 has tuples : (1,11), (2,2)
> > > >
> > > >   @9:05 Node C - receives the UPDATE (1,1) -> (1,11)
> > > >     - As conflict slot’s xmin is advanced, the deleted tuple may
> > > > already
> > > have
> > > >       been removed.
> > > >     - Conflict resolution fails to detect update_deleted and instead raises
> > > >       update_missing.
> > > >
> > > > Note that, as per decoding logic Node C sees the commit timestamp of
> > > > the update as 9:00 (origin commit_ts from Node A), not 9:04 (commit
> > > > time on Node B). In this case, since the UPDATE's timestamp is
> > > > earlier than the DELETE, Node C should ideally detect an
> > > > update_deleted conflict. However, it cannot, because it no longer retains
> > the deleted tuple.
> > > >
> > > > Even if Node C attempts to retrieve the latest WAL position from
> > > > Node A, Node C doesn't maintain any LSN which we could use to compare
> > with it.
> > > >
> > > > This scenario is similar to another restriction in the patch where
> > > > retain_conflict_info is not supported if the publisher is also a
> > > > physical standby, as the required transaction information from the
> > > > original primary is unavailable. Moreover, this limitation is
> > > > relevant only when the subscription origin option is set to ANY, as
> > > > only in that case changes from other origins can be replicated.
> > > > Since retain_conflict_info is primarily useful for conflict
> > > > detection in bidirectional clusters where the origin option is set
> > > > to NONE, this limitation
> > > appears acceptable.
> > > >
> > > > Given these findings, to help users avoid unintended configurations,
> > > > we plan to issue a warning in scenarios where replicated changes may
> > > > include origins other than the direct publisher, similar to the
> > > > existing checks in the
> > > > check_publications_origin() function.
> > > >
> > > > Here is the latest patch that implements the warning and documents
> > > > this case. Only 0001 is modified for this.
> > > >
> > > > A big thanks to Nisha for invaluable assistance in identifying this
> > > > case and preparing the analysis for it.
> > >
> > > I'm still reviewing the 0001 patch but let me share some comments and
> > > questions I have so far:
> >
> > Thanks for the comments!
> >
> > >
> > > ---
> > > +/*
> > > + * Determine the minimum non-removable transaction ID across all
> > > +apply workers
> > > + * for subscriptions that have retain_conflict_info enabled. Store
> > > +the result
> > > + * in *xmin.
> > > + *
> > > + * If the replication slot cannot be advanced during this cycle, due
> > > +to either
> > > + * a disabled subscription or an inactive worker, set
> > > +*can_advance_xmin to
> > > + * false.
> > > + */
> > > +static void
> > > +compute_min_nonremovable_xid(LogicalRepWorker *worker,
> > > +                            bool retain_conflict_info, TransactionId
> > *xmin,
> > > +                            bool *can_advance_xmin)
> > >
> > > I think this function is quite confusing for several reasons. For
> > > instance, it's doing more things than described in the comments such
> > > as trying to create the CONFLICT_DETECTION_SLOT if no worker is
> > > passed. Also, one of the caller
> > > describes:
> > >
> > > +               /*
> > > +                * This is required to ensure that we don't advance the xmin
> > > +                * of CONFLICT_DETECTION_SLOT even if one of the
> > > subscriptions
> > > +                * is not enabled. Otherwise, we won't be able to detect
> > > +                * conflicts reliably for such a subscription even though it
> > > +                * has set the retain_conflict_info option.
> > > +                */
> > > +               compute_min_nonremovable_xid(NULL,
> > > sub->retainconflictinfo,
> > > +                                            &xmin,
> > > + &can_advance_xmin);
> > >
> > > but it's unclear to me from the function name that it tries to create
> > > the replication slot. Furthermore, in this path it doesn't actually
> > > compute xmin. I guess we can try to create CONFLICT_DETECTION_SLOT in
> > > the loop of "foreach(lc, sublist)" and set false to can_advance_xmin
> > > if either the subscription is disabled or the worker is not running.
> >
> > I understand. The original code was similar to your suggestion, but we decided
> > to encapsulate it within a separate function to maintain a clean and concise
> > main loop. However, your suggestion also makes sense, so I will proceed with
> > the change.
>
> I have made this change in the 0002 patch for reference. What do you think ? If
> there are no objections, I plan to merge it in the next version.
>

The changes in the 0002 patch look good to me. I've attached the patch
for some minor suggestions. Please incorporate these changes if you
agree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

change_masahiko_v49.patch

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

19 июля, 00:30:43

On Fri, Jul 18, 2025 at 5:03 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, July 18, 2025 1:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jul 11, 2025 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jul 10, 2025 at 6:46 PM Masahiko Sawada
> > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Wed, Jul 9, 2025 at 9:09 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > > >
> > > > >
> > > > > > I think that even with retain_conflict_info = off, there is
> > > > > > probably a point at which the subscriber can no longer keep up
> > > > > > with the publisher. For example, if with retain_conflict_info =
> > > > > > off we can withstand 100 clients running at the same time, then
> > > > > > the fact that this performance degradation occurred with 15
> > > > > > clients explains that performance degradation is much more
> > > > > > likely to occur because of retain_conflict_info = on.
> > > > > >
> > > > > > Test cases 3 and 4 are typical cases where this feature is used
> > > > > > since the  conflicts actually happen on the subscriber, so I
> > > > > > think it's important to look at the performance in these cases.
> > > > > > The worst case scenario for this feature is that when this
> > > > > > feature is turned on, the subscriber cannot keep up even with a
> > > > > > small load, and with max_conflict_retetion_duration we enter a
> > > > > > loop of slot invalidation and re-creating, which means that
> > > > > > conflict cannot be detected reliably.
> > > > > >
> > > > >
> > > > > As per the above observations, it is less of a regression of this
> > > > > feature but more of a lack of parallel apply or some kind of
> > > > > pre-fetch for apply, as is recently proposed [1]. I feel there are
> > > > > use cases, as explained above, for which this feature would work
> > > > > without any downside, but due to a lack of some sort of parallel
> > > > > apply, we may not be able to use it without any downside for cases
> > > > > where the contention is only on a smaller set of tables. We have
> > > > > not tried, but may in cases where contention is on a smaller set
> > > > > of tables, if users distribute workload among different pub-sub
> > > > > pairs by using row filters, there also, we may also see less
> > > > > regression. We can try that as well.
> > > >
> > > > While I understand that there are some possible solutions we have
> > > > today to reduce the contention, I'm not really sure these are really
> > > > practical solutions as it increases the operational costs instead.
> > > >
> > >
> > > I assume by operational costs you mean defining the replication
> > > definitions such that workload is distributed among multiple apply
> > > workers via subscriptions either by row_filters, or by defining
> > > separate pub-sub pairs of a set of tables, right? If so, I agree with
> > > you but I can't think of a better alternative. Even without this
> > > feature as well, we know in such cases the replication lag could be
> > > large as is evident in recent thread [1] and some offlist feedback by
> > > people using native logical replication. As per a POC in the
> > > thread[1], parallelizing apply or by using some prefetch, we could
> > > reduce the lag but we need to wait for that work to mature to see the
> > > actual effect of it.
> >
> > I don't have a better alternative either.
> >
> > I agree that this feature will work without any problem when logical replication
> > is properly configured. It's a good point that update-delete conflicts can be
> > detected reliably without additional performance overhead in scenarios with
> > minimal replication lag.
> > However, this approach requires users to carefully pay particular attention to
> > replication performance and potential delays. My primary concern is that, given
> > the current logical replication performance limitations, most users who want to
> > use this feature will likely need such dedicated care for replication lag.
> > Nevertheless, most features involve certain trade-offs. Given that this is an
> > opt-in feature and future performance improvements will reduce these
> > challenges for users, it would be reasonable to have this feature at this stage.
> >
> > >
> > > The path I see with this work is to clearly document the cases
> > > (configuration) where this feature could be used without much downside
> > > and keep the default value of subscription option to enable this as
> > > false (which is already the case with the patch).
> >
> > +1
>
> Thanks for the discussion. Here is the V49 patch which includes the suggested
> doc change in 0002. I will rebase the remaining patches once the first one is
> pushed.

Thank you for updating the patch!

Here are some review comments and questions:

+   /*
+    * Do not allow users to acquire the reserved slot. This scenario may
+    * occur if the launcher that owns the slot has terminated unexpectedly
+    * due to an error, and a backend process attempts to reuse the slot.
+    */
+   if (!IsLogicalLauncher() && IsReservedSlotName(name))
+       ereport(ERROR,
+               errcode(ERRCODE_UNDEFINED_OBJECT),
+               errmsg("cannot acquire replication slot \"%s\"", name),
+               errdetail("The slot is reserved for conflict detection
and can only be acquired by logical replication launcher."));

I think it might be better to rename IsReservedSlotName() to be more
specific to the conflict detection because we might want to add more
reserved slot names in the future that would not necessarily be
acquired only by the launcher process.

---
+       if (inCommitOnly &&
+           (proc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0)
+           continue;
+

I've not verified yet but is it possible that we exclude XIDs of
processes who are running on other databases?

---
Regarding the option name we've discussed:

> > The new parameter name "retain_conflict_info" sounds to me like we keep the
> > conflict information somewhere that has expired at some time such as how
> > many times insert_exists or update_origin_differs happened. How about
> > choosing a name that indicates retain dead tuples more explicitly for example
> > retain_dead_tuples?
>
> We considered the name you suggested, but we wanted to convey that this option
> not only retains dead tuples but also preserves commit timestamps and origin
> data for conflict detection, hence we opted for a more general name. Do you
> have better suggestions?

I think the commit timestamp and origin data would be retained as a
result of retaining dead tuples. While such a general name could
convey more than retaining dead tuples, I'm concerned that it could be
ambiguous what exactly to retain by the subscription. How about the
following names or something along those lines?

- retain_dead_tuples_for_conflict
- delay_vacuum_for_conflict
- keep_dead_tuples_for_conflict

---
+check_pub_conflict_info_retention(WalReceiverConn *wrconn, bool
retain_conflict_info)
+{
+   WalRcvExecResult *res;
+   Oid         RecoveryRow[1] = {BOOLOID};
+   TupleTableSlot *slot;
+   bool        isnull;
+   bool        remote_in_recovery;
+
+   if (!retain_conflict_info)
+       return;

It seems that retain_conflict_info is used only for this check to
quick exit from this function. How about calling this function only
when the caller knows retain_conflict_info is true instead of adding
it as a function argument?

---
+void
+CheckSubConflictInfoRetention(bool retain_conflict_info, bool check_guc,
+                             bool sub_disabled, int elevel_for_sub_disabled)

This function seems not to be specific to the apply workers but to
subscriptions in general. Is there any reason why we define this
function in worker.c?

---
+   if (check_guc && wal_level < WAL_LEVEL_REPLICA)
+       ereport(ERROR,
+               errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+               errmsg("\"wal_level\" is insufficient to create the
replication slot required by retain_conflict_info"),
+               errhint("\"wal_level\" must be set to \"replica\" or
\"logical\" at server start."));
+

Why does (retain_conflict_info == false && wal_level == minimal) not work?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

19 июля, 08:02:09

On Sat, Jul 19, 2025 at 3:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jul 18, 2025 at 5:03 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
>
> Here are some review comments and questions:
>
>
> ---
> +       if (inCommitOnly &&
> +           (proc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0)
> +           continue;
> +
>
> I've not verified yet but is it possible that we exclude XIDs of
> processes who are running on other databases?
>

I can't see how, even the comments atop function says: " We look at
all databases, though there is no need to include WALSender since this
has no effect on hot standby conflicts." which indicate that it
shouldn't exlude XIDs of procs who are running on other databases.

> ---
> Regarding the option name we've discussed:
>
> > > The new parameter name "retain_conflict_info" sounds to me like we keep the
> > > conflict information somewhere that has expired at some time such as how
> > > many times insert_exists or update_origin_differs happened. How about
> > > choosing a name that indicates retain dead tuples more explicitly for example
> > > retain_dead_tuples?
> >
> > We considered the name you suggested, but we wanted to convey that this option
> > not only retains dead tuples but also preserves commit timestamps and origin
> > data for conflict detection, hence we opted for a more general name. Do you
> > have better suggestions?
>
> I think the commit timestamp and origin data would be retained as a
> result of retaining dead tuples. While such a general name could
> convey more than retaining dead tuples, I'm concerned that it could be
> ambiguous what exactly to retain by the subscription. How about the
> following names or something along those lines?
>
> - retain_dead_tuples_for_conflict
> - delay_vacuum_for_conflict
> - keep_dead_tuples_for_conflict
>

Among these, the first option is better but I think it is better to
name it just retain_dead_tuples. The explanation of the option will
explain its use. It is similar to other options like binary or
streaming. We are not naming them like request_data_binary_format to
make the meaning apparent. There is a value in keeping names succinct.

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

20 июля, 06:05:37

On Saturday, July 19, 2025 5:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> Here are some review comments and questions:

Thanks for the comments!

> 
> +   /*
> +    * Do not allow users to acquire the reserved slot. This scenario may
> +    * occur if the launcher that owns the slot has terminated unexpectedly
> +    * due to an error, and a backend process attempts to reuse the slot.
> +    */
> +   if (!IsLogicalLauncher() && IsReservedSlotName(name))
> +       ereport(ERROR,
> +               errcode(ERRCODE_UNDEFINED_OBJECT),
> +               errmsg("cannot acquire replication slot \"%s\"", name),
> +               errdetail("The slot is reserved for conflict detection
> and can only be acquired by logical replication launcher."));
> 
> I think it might be better to rename IsReservedSlotName() to be more specific to
> the conflict detection because we might want to add more reserved slot names
> in the future that would not necessarily be acquired only by the launcher
> process.

Agreed. I have renamed it to IsSlotForConflictCheck.

> 
> ---
> Regarding the option name we've discussed:
> 
> > > The new parameter name "retain_conflict_info" sounds to me like we
> > > keep the conflict information somewhere that has expired at some
> > > time such as how many times insert_exists or update_origin_differs
> > > happened. How about choosing a name that indicates retain dead
> > > tuples more explicitly for example retain_dead_tuples?
> >
> > We considered the name you suggested, but we wanted to convey that
> > this option not only retains dead tuples but also preserves commit
> > timestamps and origin data for conflict detection, hence we opted for
> > a more general name. Do you have better suggestions?
> 
> I think the commit timestamp and origin data would be retained as a result of
> retaining dead tuples. While such a general name could convey more than
> retaining dead tuples, I'm concerned that it could be ambiguous what exactly to
> retain by the subscription. How about the following names or something along
> those lines?
> 
> - retain_dead_tuples_for_conflict
> - delay_vacuum_for_conflict
> - keep_dead_tuples_for_conflict

OK, I use the shorter version retain_conflict_info as mentioned by Amit[1].

> 
> ---
> +check_pub_conflict_info_retention(WalReceiverConn *wrconn, bool
> retain_conflict_info)
> +{
> +   WalRcvExecResult *res;
> +   Oid         RecoveryRow[1] = {BOOLOID};
> +   TupleTableSlot *slot;
> +   bool        isnull;
> +   bool        remote_in_recovery;
> +
> +   if (!retain_conflict_info)
> +       return;
> 
> It seems that retain_conflict_info is used only for this check to quick exit from
> this function. How about calling this function only when the caller knows
> retain_conflict_info is true instead of adding it as a function argument?

Changed as suggested.

> 
> ---
> +void
> +CheckSubConflictInfoRetention(bool retain_conflict_info, bool check_guc,
> +                             bool sub_disabled, int
> +elevel_for_sub_disabled)
> 
> This function seems not to be specific to the apply workers but to
> subscriptions in general. Is there any reason why we define this function in
> worker.c?

I do not have special reasons, so I moved it to subscriptioncmd.c
where most checks are performed.

> 
> ---
> +   if (check_guc && wal_level < WAL_LEVEL_REPLICA)
> +       ereport(ERROR,
> +
> errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +               errmsg("\"wal_level\" is insufficient to create the
> replication slot required by retain_conflict_info"),
> +               errhint("\"wal_level\" must be set to \"replica\" or
> \"logical\" at server start."));
> +
> 
> Why does (retain_conflict_info == false && wal_level == minimal) not work?

I think it works because the check is skipped when rci is false. BTW, to be consistent with
check_pub_conflict_info_retention, I moved the retain_conflict_info check outside of this
function.

[1] https://www.postgresql.org/message-id/CAA4eK1JdCJK0KtV5WGYWoQVe7S3uqx-7J7t1qFpCki_rdLQFmw%40mail.gmail.com

Best Regards,
Hou zj

Вложения

v50-0001-Preserve-conflict-relevant-data-during-logical-r.patch

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

21 июля, 07:00:25

On Sat, Jul 19, 2025 at 10:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jul 19, 2025 at 3:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Jul 18, 2025 at 5:03 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> >
> > Here are some review comments and questions:
> >
> >
> > ---
> > +       if (inCommitOnly &&
> > +           (proc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0)
> > +           continue;
> > +
> >
> > I've not verified yet but is it possible that we exclude XIDs of
> > processes who are running on other databases?
> >
>
> I can't see how, even the comments atop function says: " We look at
> all databases, though there is no need to include WALSender since this
> has no effect on hot standby conflicts." which indicate that it
> shouldn't exlude XIDs of procs who are running on other databases.
>

I think I misunderstood your question. You were asking if possible, we
should exclude XIDs of processes running on other databases in the
above check as for our purpose, we don't need those. If so, I agree
with you, we don't need XIDs of other databases as logical WALSender
will anyway won't process transactions in other databases, so we can
exclude those. The function GetOldestActiveTransactionId() is called
from two places in patch get_candidate_xid() and
ProcessStandbyPSRequestMessage(). We don't need to care for XIDs in
other databases at both places but care for
Commit_Critical_Section_Phase when called from
ProcessStandbyPSRequestMessage(). So, we probably need two parameters
to distinguish those cases.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

21 июля, 14:49:11

On Mon, Jul 21, 2025 at 9:30 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jul 19, 2025 at 10:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Jul 19, 2025 at 3:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Jul 18, 2025 at 5:03 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > >
> > > Here are some review comments and questions:
> > >
> > >
> > > ---
> > > +       if (inCommitOnly &&
> > > +           (proc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0)
> > > +           continue;
> > > +
> > >
> > > I've not verified yet but is it possible that we exclude XIDs of
> > > processes who are running on other databases?
> > >
> >
> > I can't see how, even the comments atop function says: " We look at
> > all databases, though there is no need to include WALSender since this
> > has no effect on hot standby conflicts." which indicate that it
> > shouldn't exlude XIDs of procs who are running on other databases.
> >
>
> I think I misunderstood your question. You were asking if possible, we
> should exclude XIDs of processes running on other databases in the
> above check as for our purpose, we don't need those. If so, I agree
> with you, we don't need XIDs of other databases as logical WALSender
> will anyway won't process transactions in other databases, so we can
> exclude those. The function GetOldestActiveTransactionId() is called
> from two places in patch get_candidate_xid() and
> ProcessStandbyPSRequestMessage(). We don't need to care for XIDs in
> other databases at both places but care for
> Commit_Critical_Section_Phase when called from
> ProcessStandbyPSRequestMessage(). So, we probably need two parameters
> to distinguish those cases.
>

It seems unnecessary to track transactions on other databases, as they
won't be replicated to the subscriber.
So, a new parameter 'allDbs' is introduced to control the filtering of
transactions from other databases.

Attached updated V51 patch.
Thank you Hou-san for updating the patch for this change.

--
Thanks,
Nisha

Вложения

v51-0001-Preserve-conflict-relevant-data-during-logical-r.patch

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

21 июля, 20:56:53

On Sun, Jul 20, 2025 at 9:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jul 19, 2025 at 10:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Jul 19, 2025 at 3:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Jul 18, 2025 at 5:03 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > >
> > > Here are some review comments and questions:
> > >
> > >
> > > ---
> > > +       if (inCommitOnly &&
> > > +           (proc->delayChkptFlags & DELAY_CHKPT_IN_COMMIT) == 0)
> > > +           continue;
> > > +
> > >
> > > I've not verified yet but is it possible that we exclude XIDs of
> > > processes who are running on other databases?
> > >
> >
> > I can't see how, even the comments atop function says: " We look at
> > all databases, though there is no need to include WALSender since this
> > has no effect on hot standby conflicts." which indicate that it
> > shouldn't exlude XIDs of procs who are running on other databases.
> >
>
> I think I misunderstood your question. You were asking if possible, we
> should exclude XIDs of processes running on other databases in the
> above check as for our purpose, we don't need those.

Right.

> If so, I agree
> with you, we don't need XIDs of other databases as logical WALSender
> will anyway won't process transactions in other databases, so we can
> exclude those. The function GetOldestActiveTransactionId() is called
> from two places in patch get_candidate_xid() and
> ProcessStandbyPSRequestMessage(). We don't need to care for XIDs in
> other databases at both places but care for
> Commit_Critical_Section_Phase when called from
> ProcessStandbyPSRequestMessage(). So, we probably need two parameters
> to distinguish those cases.

Why do we need to include all XIDs even in the cases called from
ProcessStandbyPSRequestMessage()? I guess that there is no chance that
the changes happening on other (non-subscribed) databases could
conflict with something on the subscriber.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

22 июля, 06:48:51

On Mon, Jul 21, 2025 at 11:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Jul 20, 2025 at 9:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> > If so, I agree
> > with you, we don't need XIDs of other databases as logical WALSender
> > will anyway won't process transactions in other databases, so we can
> > exclude those. The function GetOldestActiveTransactionId() is called
> > from two places in patch get_candidate_xid() and
> > ProcessStandbyPSRequestMessage(). We don't need to care for XIDs in
> > other databases at both places but care for
> > Commit_Critical_Section_Phase when called from
> > ProcessStandbyPSRequestMessage(). So, we probably need two parameters
> > to distinguish those cases.
>
> Why do we need to include all XIDs even in the cases called from
> ProcessStandbyPSRequestMessage()?
>

No, we don't need all XIDs even in the case of
ProcessStandbyPSRequestMessage(). That is what I wrote: "The function
GetOldestActiveTransactionId() is called from two places in patch
get_candidate_xid() and ProcessStandbyPSRequestMessage(). We don't
need to care for XIDs in other databases at both places ...". Am I
missing something or you misread it?

> I guess that there is no chance that
> the changes happening on other (non-subscribed) databases could
> conflict with something on the subscriber.
>

Right.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

23 июля, 01:20:57

On Mon, Jul 21, 2025 at 8:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jul 21, 2025 at 11:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sun, Jul 20, 2025 at 9:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > > If so, I agree
> > > with you, we don't need XIDs of other databases as logical WALSender
> > > will anyway won't process transactions in other databases, so we can
> > > exclude those. The function GetOldestActiveTransactionId() is called
> > > from two places in patch get_candidate_xid() and
> > > ProcessStandbyPSRequestMessage(). We don't need to care for XIDs in
> > > other databases at both places but care for
> > > Commit_Critical_Section_Phase when called from
> > > ProcessStandbyPSRequestMessage(). So, we probably need two parameters
> > > to distinguish those cases.
> >
> > Why do we need to include all XIDs even in the cases called from
> > ProcessStandbyPSRequestMessage()?
> >
>
> No, we don't need all XIDs even in the case of
> ProcessStandbyPSRequestMessage(). That is what I wrote: "The function
> GetOldestActiveTransactionId() is called from two places in patch
> get_candidate_xid() and ProcessStandbyPSRequestMessage(). We don't
> need to care for XIDs in other databases at both places ...". Am I
> missing something or you misread it?

Oh I misread it. Sorry for the noise.

>
> > I guess that there is no chance that
> > the changes happening on other (non-subscribed) databases could
> > conflict with something on the subscriber.
> >
>
> Right.

I've reviewed the 0001 patch and it looks good to me. The patch still
has XXX comments at several places. Do we want to keep all of them as
they are (i.e., as something like TODO or FIXME)?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

23 июля, 07:07:55

On Wed, Jul 23, 2025 at 3:51 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've reviewed the 0001 patch and it looks good to me.
>

Thanks, I have pushed the 0001 patch.

 The patch still
> has XXX comments at several places. Do we want to keep all of them
>

Yes, those are primarily the ideas for future optimization and or
special notes for some not so obvious design decisions.

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

23 июля, 10:23:42

On Wednesday, July 23, 2025 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Jul 23, 2025 at 3:51 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> >
> > I've reviewed the 0001 patch and it looks good to me.
> >
> 
> Thanks, I have pushed the 0001 patch.

Thanks for pushing. I have rebased the remaining patches.

I have reordered the patches to prioritize the detection of update_deleted as
the initial patch. This can give us more time to consider the new GUC, since the
performance-related aspects have been documented.

One pervious patch used to prove the possibility of allowing changing the
retain_dead_tuples for enabled subscription, has not yet been rebased. I will
rebase that once all the main patches are stable.

Best Regards,
Hou zj

On Friday, July 25, 2025 2:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Jul 23, 2025 at 12:53 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Thanks for pushing. I have rebased the remaining patches.
> >
> 
> + * This function performs a full table scan instead of using indexes
> + because
> + * index scans could miss deleted tuples if an index has been
> + re-indexed or
> + * re-created during change applications.
> 
> IIUC, once the tuple is not found during update, the patch does an additional
> scan with SnapshotAny to find the DEAD tuple, so that it can report
> update_deleted conflict, if one is found. The reason in the comments to do
> sequential scan in such cases sound reasonable but I was thinking if we can
> do index scans if the pg_conflict_* slot's xmin is ahead of the RI (or any usable
> index that can be used during scan) index_tuple's xmin? Note, we use a similar
> check with the indcheckxmin parameter in pg_index though the purpose of
> that is different. If this can happen then still in most cases the index scan will
> happen.

Right, I think it makes sense to do with the index scan when the index's xmin is
less than the conflict detection xmin, as that can ensure that all the tuples
deleted before the index creation or re-indexing are irrelevant for conflict
detection.

I have implemented in the V53 patch set and improved the test to verify both
index and seq scan for dead tuples.

The V53-0001 also includes Shveta's comments in [1].

Apart from above issue,
I'd like to clarify why we scan all matching dead tuples in the relation to
find the most recently deleted one in the patch, and I will share an example for
the same.

The main reason is that only the latest deletion information is relevant for
resolving conflicts. If the first tuple retrieved is antiquated while a newer
deleted tuple exists, users may incorrectly resolve the remote change by
applying a last-update-win strategy. Here is an example:

1. In a BI-cluster setup, if both nodes initially contain empty data:

Node A: tbl (empty)
Node B: tbl (empty)

2. Then if user do the following operations on Node A and wait for them to be
replicated to Node B:

INSERT (pk, 1)
DELETE (pk, 1) @9:00
INSERT (pk, 1)

The data on both nodes looks like:

Node A: tbl (pk, 1) - live tuple
           (pk, 1) - dead tuple - @9:00
Node B: tbl (pk, 1) - live tuple
           (pk, 1) - dead tuple - @9:00

3. If a user do DELETE (pk) on Node B @9:02, and do UDPATE (pk, 1)->(pk, 2) on Node A
   @9:01.

When applying the UPDATE on Node B, it cannot find the target tuple, so will
search the dead tuples, but there are two dead tuples:

Node B: tbl (pk, 1) - live tuple
           (pk, 1) - dead tuple - @9:00
           (pk, 1) - dead tuple - @9:02

If we only fetch the first tuple in the scan, it could be either a) the tuple
deleted @9:00 which is older than the remote UPDATE, or b) the tuple deleted
@9:02, which is newer than the remote UPDATE is @9:01. User may choose to apply
the UPDATE for case a) which can cause data inconsistency between nodes
(using last-update-win strategy).

Ideally, we should give the resolve the new dead tuple @9:02, so the resolver
can choose to ignore the remote UDPATE, keeping the data consistent.

[1] https://www.postgresql.org/message-id/CAJpy0uDiyjDzLU-%3DNGO7PnXB4OLy4%2BRxJiAySdw%3Da%2BYO62JO2g%40mail.gmail.com

Best Regards,
Hou zj

Вложения

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

25 июля, 14:08:46

On Thursday, July 24, 2025 11:42 AM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Wed, Jul 23, 2025 at 12:53 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, July 23, 2025 12:08 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Jul 23, 2025 at 3:51 AM Masahiko Sawada
> > > <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I've reviewed the 0001 patch and it looks good to me.
> > > >
> > >
> > > Thanks, I have pushed the 0001 patch.
> >
> > Thanks for pushing. I have rebased the remaining patches.

Thanks for the comments!

> >
> > I have reordered the patches to prioritize the detection of
> > update_deleted as the initial patch. This can give us more time to
> > consider the new GUC, since the performance-related aspects have been
> documented.
> >
> 
> 2)
> +               if (MySubscription->retaindeadtuples &&
> +                       FindMostRecentlyDeletedTupleInfo(localrel,
> + remoteslot,
> +
>                   &conflicttuple.xmin,
> +
>                   &conflicttuple.origin,
> +
>                   &conflicttuple.ts) &&
> +                       conflicttuple.origin != replorigin_session_origin)
> +                       type = CT_UPDATE_DELETED;
> +               else
> +                       type = CT_UPDATE_MISSING;
> 
> Shall the conflict be detected as update_deleted irrespective of origin?

According to the discussion[1], I kept the current behavior.

> 
> 
> 5)
> monitoring.sgml:
> +      <para>
> +       Number of times the tuple to be updated was deleted by another origin
> +       during the application of changes. See <xref
> linkend="conflict-update-deleted"/>
> +       for details about this conflict.
> +      </para></entry>
> 
> Here we are using the term 'by another origin', while in the rest of the doc (see
> confl_update_origin_differs, confl_delete_origin_differs) we use the term 'by
> another source'. Shall we keep it the same?
> OTOH, I think using 'origin' is better but the rest of the page  is using source.
> So perhaps changing source to origin everywhere is better. Thoughts?
> This can be changed if needed once we decide on point 2 above.

Yes, origin may be better. But for now, I have changed to 'source' here to be
consistent with the descriptions around it, and we can improve it in a separate
patch if needed.

Other comments have been addressed in the V53 patch set.

[1] https://www.postgresql.org/message-id/CAA4eK1L09u_A0HFRydA4xc%3DHpPkCh%2B7h-%2B_WRhKw1Cksp5_5zQ%40mail.gmail.com

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

25 июля, 14:38:01

Hi All,

We conducted performance testing of a bi-directional logical
replication setup, focusing on the primary use case of the
update_deleted feature.
To simulate a realistic scenario, we used a high workload with limited
concurrent updates, and well-distributed writes among servers.

Used source
===========
pgHead commit 62a17a92833 + v47 patch set

Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

Test-1: Distributed Write Load
==============================
Highlight:
-----------
 - In a bi-directional logical replication setup, with
well-distributed write workloads and a thoughtfully tuned
configuration to minimize lag (e.g., through row filters), TPS
regression is minimal or even negligible.
 - Performance can be sustained with significantly fewer apply workers
compared to the number of client connections on the publisher.

Setup:
--------
 - 2 Nodes(node1 and node2) are created(on same machine) of same
configurations -
    autovacuum = false
    shared_buffers = '30GB'
    -- Also, worker and logical replication related parameters were
increased as per requirement (see attached scripts for details).
 - Both nodes have two set of pgbench tables initiated with *scale=300*:
   -- set1: pgbench_pub_accounts, pgbench_pub_tellers,
pgbench_pub_branches, and pgbench_pub_history
   -- set2: pgbench_accounts, pgbench_tellers, pgbench_branches, and
pgbench_history
 - Node1 is publishing all changes for set1 tables and Node2 has
subscribed for the same.
 - Node2 is publishing all changes for set2 tables and Node2 has
subscribed for the same.
Note: In all the tests, subscriptions are created with (origin=NONE)
as it is a bi-directional replication.

Workload Run:
---------------
 - On node1, pgbench(read-write) with option "-b simple-update" is run
on set1 tables.
 - On node2, pgbench(read-write) with option "-b simple-update" is run
on set2 tables.
 - #clients = 40
 - pgbench run duration = 10 minutes.
 - results were measured for 3 runs of each case.

Test Runs:
- Six tests were done with varying #pub-sub pairs and below is TPS
reduction in both nodes for all the cases:

| Case | # Pub-Sub Pairs | TPS Reduction  |
| ---- | --------------- | -------------- |
| 01   | 30              | 0–1%           |
| 02   | 15              | 6–7%           |
| 03   | 5               | 7–8%           |
| 04   | 3               | 0-1%           |
| 05   | 2               | 14–15%         |
| 06   | 1 (no filters)  | 37–40%         |

 - With appropriate row filters and distribution of load across apply
workers, the performance impact of update_deleted patch can be
minimized.
 - Just 3 pub-sub pairs are enough to keep TPS close to the baseline
for the given workload.
 - Poor distribution of replication workload (e.g., only 1–2 pub-sub
pairs) leads to higher overhead due to increased apply worker
contention.
~~~~

Detailed results for all the above cases:

case-01:
---------
 - Created 30 pub-sub pairs to distribute the replication load between
30 apply workers on each node.

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   5633.377165   5579.244492   6385.839585   6482.775975
2   5926.328644   5947.035275   6216.045707   6416.113723
3   5522.804663   5542.380108   6541.031535   6190.123097
median   5633.377165   5579.244492   6385.839585   6416.113723
regression  -1%   0%

 - No regression
~~~~

case-02:
---------
 - #pub-sub pairs = 15

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   8207.708475   7584.288026   8854.017934   8204.301497
2   8120.979334   7404.735801   8719.451895   8169.697482
3   7877.859139   7536.762733   8542.896669   8177.853563
median   8120.979334   7536.762733   8719.451895   8177.853563
regression   -7%   -6%

 - There was 6-7% TPS reduction on both nodes, which seems in acceptable range.
~~~

case-03:
---------
 - #pub-sub pairs = 5

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   12325.90315   11664.7445   12997.47104   12324.025
2   12060.38753   11370.52775   12728.41287   12127.61208
3   12390.3677   11367.10255   13135.02558   12036.71502
median   12325.90315   11370.52775   12997.47104   12127.61208
regression   -8%   -7%

 - There was 7-8% TPS reduction on both nodes, which seems in acceptable range.
~~~

case-04:
---------
 -  #pub-sub pairs = 3

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   13186.22898   12464.42604   13973.8394   13370.45596
2   13038.15817   13014.03906   13866.51966   13866.47395
3   13881.10513   13868.71971   14687.67444   14516.33854
median   13186.22898   13014.03906   13973.8394   13866.47395
regression   -1%   -1%

 - No regression observed


case-05:
---------
 -  #pub-sub pairs = 2

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   15936.98792   13563.98476   16734.35292   14527.22942
2   16031.23003   13648.24979   16958.49609   14657.80008
3   16113.79935   13550.68329   17029.5035   14509.84068
median   16031.23003   13563.98476   16958.49609   14527.22942
regression   -15%   -14%

 - The TPS reduced by 14-15% on both nodes.
~~~

case-06:
---------
 - #pub-sub pairs = 1 , no row filter is used on both nodes

Results:
#run   pgHead_Node1_TPS   patched_Node1_TPS   pgHead_Node2_TPS
patched_Node2_TPS
1   22900.06507   13609.60639   23254.25113   14592.25271
2   22110.98426   13907.62583   22755.89945   14805.73717
3   22719.88901   13246.41484   23055.70406   14256.54223
median 22719.88901 13609.60639 23055.70406   14592.25271
regression   -40%   -37%

- The regression observed is 37-40% on both nodes.
~~~~


Test-2: High concurrency
===========================
Highlight:
------------
 Despite poor write distribution across servers and high concurrent
updates, distributing replication load across multiple apply workers
limited the TPS drop to just 15–18%.

Setup:
---------------
 - 2 Nodes(node1 and node2) are created with same configuration as in Test-01
 - Both nodes have same set of pgbench tables initialized with
scale=60 (small tables to increase concurrent updates)
 - Both nodes are subscribed to each other for all the changes.
  -- 15 pub-sub pairs are created using row filters to distribute the
load and all the subscriptions are created with (origin = NONE).

Workload Run:
---------------
 - On both nodes,the default pgbench(read-write) is run on tables.
 - #clients = 15
 - pgbench run duration = 5 minutes.
 - results were measured for 2 runs of each case.

Results:

Node1 TPS:
#run   pgHead_Node1_TPS   patched_Node1_TPS
1   9585.470749   7660.645249
2   9442.364918   8035.531482
median   9513.917834   7848.088366
regression     -18%

Node2 TPS:

#run   pgHead_Node2_TPS   patched_Node2_TPS
1   9485.232611   8248.783417
2   9468.894086   7938.991136
median  9477.063349   8093.887277
regression    -15%

- Under high concurrent writes to the same small tables, contention
increases and the TPS drop is 15-18% on both nodes.
~~~~

The scripts used for above tests are attached.

--
Thanks,
Nisha

Вложения

bi_dir_test_scripts.zip

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

25 июля, 16:54:25

On Fri, Jul 25, 2025 at 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 25, 2025 at 12:37 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Thu, Jul 24, 2025 at 9:12 AM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > >
> > > 2)
> > > +               if (MySubscription->retaindeadtuples &&
> > > +                       FindMostRecentlyDeletedTupleInfo(localrel, remoteslot,
> > > +
> > >                   &conflicttuple.xmin,
> > > +
> > >                   &conflicttuple.origin,
> > > +
> > >                   &conflicttuple.ts) &&
> > > +                       conflicttuple.origin != replorigin_session_origin)
> > > +                       type = CT_UPDATE_DELETED;
> > > +               else
> > > +                       type = CT_UPDATE_MISSING;
> > >
> > > Shall the conflict be detected as update_deleted irrespective of origin?
> > >
> >
> > On thinking more here, I think that we may have the possibility of
> > UPDATE after DELETE from the same origin only when a publication
> > selectively publishes certain operations.
> >
> > 1)
> > Consider a publication that only publishes UPDATE and DELETE
> > operations. On the publisher, we may perform operations like DELETE,
> > INSERT, and UPDATE. On the subscriber, only DELETE and UPDATE events
> > are received. In this case, should we treat the incoming UPDATE as
> > update_deleted or update_missing?
> >
>
> If the user is doing subscription only for certain operations like
> Update or Delete, she may not be interested in eventual consistency as
> some of the data may not be replicated, so a conflict detection
> followed by any resolution may not be helpful.
>
> The other point is that if we report update_delete in such cases, it
> won't be reliable, sometimes it can be update_missing as vacuum would
> have removed the row, OTOH, if we report update_missing, it will
> always be the same conflict, and we can document it.
>

Agree with both the points. We can keep the current behaviour as it is.

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

28 июля, 12:54:14

On Fri, Jul 25, 2025 at 4:38 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Right, I think it makes sense to do with the index scan when the index's xmin is
> less than the conflict detection xmin, as that can ensure that all the tuples
> deleted before the index creation or re-indexing are irrelevant for conflict
> detection.
>
> I have implemented in the V53 patch set and improved the test to verify both
> index and seq scan for dead tuples.
>

Thanks. Following are a few comments on 0001 patch:

1.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1397,6 +1397,7 @@ CREATE VIEW pg_stat_subscription_stats AS
         ss.apply_error_count,
         ss.sync_error_count,
         ss.confl_insert_exists,
+        ss.confl_update_deleted,
…
Datum
 pg_stat_get_subscription_stats(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_SUBSCRIPTION_STATS_COLS 11
+#define PG_STAT_GET_SUBSCRIPTION_STATS_COLS 12

Can we consider splitting stats into a separate patch? It will help us
to first focus on core functionality of detecting update_delete
conflict.

2.
While this approach may be slow on large tables,
+ * it is considered acceptable because it is only used in rare conflict cases
+ * where the target row for an update cannot be found.

Here we should add at end "and no usable index is found"

3.
+ * We scan all matching dead tuples in the relation to find the most recently
+ * deleted one, rather than stopping at the first match. This is because only
+ * the latest deletion information is relevant for resolving conflicts.
+ * Returning solely the first, potentially outdated tuple can lead users to
+ * mistakenly apply remote changes using a last-update-win strategy,
even when a
+ * more recent deleted tuple is available. See comments atop worker.c for
+ * details.

I think we can share a short example of cases when this can happen.
And probably a test which will fail if the user only fetches the first
dead tuple?

4.
executor\execReplication.c(671) : warning C4700: uninitialized local
variable 'eq' used

Please fix this warning.

5.
+ /* Build scan key. */
+ skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
+
+ /* Start an index scan. */
+ scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);

While scanning with SnapshotAny, isn't it possible that we find some
tuple for which the xact is still not committed or are inserted
successfully just before the scan is started?

I think such tuples shouldn't be considered for giving update_deleted.
It seems the patch will handle it later during
update_recent_dead_tuple_info() where it uses following check: "if
(HeapTupleSatisfiesVacuum(tuple, oldestxmin, buf) ==
HEAPTUPLE_RECENTLY_DEAD)", is my understanding correct? If so, we
should add some comments for it.

6.
FindRecentlyDeletedTupleInfoSeq()
{
…
+ /* Get the index column bitmap for tuples_equal */
+ indexbitmap = RelationGetIndexAttrBitmap(rel,
+ INDEX_ATTR_BITMAP_IDENTITY_KEY);
+
+ /* fallback to PK if no replica identity */
+ if (!indexbitmap)
+ indexbitmap = RelationGetIndexAttrBitmap(rel,
+ INDEX_ATTR_BITMAP_PRIMARY_KEY);
…
...
+ if (!tuples_equal(scanslot, searchslot, eq, indexbitmap))
+ continue;

We don't do any such thing in RelationFindReplTupleSeq(), so, if we do
something differently here, it should be explained in the comments.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

28 июля, 14:08:45

On Fri, Jul 25, 2025 at 4:38 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> The V53-0001 also includes Shveta's comments in [1].
>

Thanks, I have not yet completed the review, but please find a few
comments on 001:

1)
IsIndexUsableForFindingDeletedTuple()
We first have:
+ /*
+ * A frozen transaction ID indicates that it must fall behind the conflict
+ * detection slot.xmin.
+ */
+ if (HeapTupleHeaderXminFrozen(index_tuple->t_data))
+ return true;

thent his:
+ index_xmin = HeapTupleHeaderGetRawXmin(index_tuple->t_data);

Shall we use HeapTupleHeaderGetXmin() instead of above 2? We can check
if xid returned by HeapTupleHeaderGetXmin() is FrozenTransactionId or
normal one and then do further processing.

2)
Both FindRecentlyDeletedTupleInfoByIndex and
FindRecentlyDeletedTupleInfoSeq() has:
+ /* Exit early if the commit timestamp data is not available */
+ if (!track_commit_timestamp)
+ return false;

We shall either move this check to FindDeletedTupleInLocalRel() or in
the caller of FindDeletedTupleInLocalRel() where we check
'MySubscription->retaindeadtuples'. Moving to the caller of
FindDeletedTupleInLocalRel() looks better as there is no need to call
FindDeletedTupleInLocalRel itself if pre-conditions are not met.

3)
FindRecentlyDeletedTupleInfoSeq():
+ * during change applications. While this approach may be slow on large tables,
+ * it is considered acceptable because it is only used in rare conflict cases
+ * where the target row for an update cannot be found.
+ */
Shall we extend the last line to:
where the target row for an update cannot be found and index scan can
not be used.

4)
catalogs.sgml:
+       If true, he detection of <xref linkend="conflict-update-deleted"/> is
he --> the

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

28 июля, 14:43:08

On Mon, Jul 28, 2025 at 4:38 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Jul 25, 2025 at 4:38 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> >
> > The V53-0001 also includes Shveta's comments in [1].
> >
>
> Thanks, I have not yet completed the review, but please find a few
> comments on 001:
>
> 1)
> IsIndexUsableForFindingDeletedTuple()
> We first have:
> + /*
> + * A frozen transaction ID indicates that it must fall behind the conflict
> + * detection slot.xmin.
> + */
> + if (HeapTupleHeaderXminFrozen(index_tuple->t_data))
> + return true;
>
> thent his:
> + index_xmin = HeapTupleHeaderGetRawXmin(index_tuple->t_data);
>
> Shall we use HeapTupleHeaderGetXmin() instead of above 2? We can check
> if xid returned by HeapTupleHeaderGetXmin() is FrozenTransactionId or
> normal one and then do further processing.
>
>
> 2)
> Both FindRecentlyDeletedTupleInfoByIndex and
> FindRecentlyDeletedTupleInfoSeq() has:
> + /* Exit early if the commit timestamp data is not available */
> + if (!track_commit_timestamp)
> + return false;
>
> We shall either move this check to FindDeletedTupleInLocalRel() or in
> the caller of FindDeletedTupleInLocalRel() where we check
> 'MySubscription->retaindeadtuples'. Moving to the caller of
> FindDeletedTupleInLocalRel() looks better as there is no need to call
> FindDeletedTupleInLocalRel itself if pre-conditions are not met.
>
>
> 3)
> FindRecentlyDeletedTupleInfoSeq():
> + * during change applications. While this approach may be slow on large tables,
> + * it is considered acceptable because it is only used in rare conflict cases
> + * where the target row for an update cannot be found.
> + */
> Shall we extend the last line to:
> where the target row for an update cannot be found and index scan can
> not be used.
>
>
> 4)
> catalogs.sgml:
> +       If true, he detection of <xref linkend="conflict-update-deleted"/> is
> he --> the
>

5)
FindRecentlyDeletedTupleInfoSeq():

+ /* Get the cutoff xmin for HeapTupleSatisfiesVacuum */
+ oldestxmin = GetOldestNonRemovableTransactionId(rel);

Another point is which xid should be used as threshold in
HeapTupleSatisfiesVacuum() to decide if tuple is DEAD or RECENTLY-DEAD
for update_deleted case? Currently we are using
GetOldestNonRemovableTransactionId() but the xid returned here could
be older than pg_conflict_detection's xmin in presence of other
logical slots which have older effective_xmin. Shall we use
pg_conflict_detection's xmin instead as threshold or worker's
oldest_nonremovable_xid?

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

29 июля, 08:20:51

On Monday, July 28, 2025 7:43 PM shveta malik <shveta.malik@gmail.com> wrote:
> On Mon, Jul 28, 2025 at 4:38 PM shveta malik <shveta.malik@gmail.com>
> wrote:
> >
> > On Fri, Jul 25, 2025 at 4:38 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > >
> > > The V53-0001 also includes Shveta's comments in [1].
> > >
> >
> > Thanks, I have not yet completed the review, but please find a few
> > comments on 001:
> >
> > 1)
> > IsIndexUsableForFindingDeletedTuple()
> > We first have:
> > + /*
> > + * A frozen transaction ID indicates that it must fall behind the
> > + conflict
> > + * detection slot.xmin.
> > + */
> > + if (HeapTupleHeaderXminFrozen(index_tuple->t_data))
> > + return true;
> >
> > thent his:
> > + index_xmin = HeapTupleHeaderGetRawXmin(index_tuple->t_data);
> >
> > Shall we use HeapTupleHeaderGetXmin() instead of above 2? We can check
> > if xid returned by HeapTupleHeaderGetXmin() is FrozenTransactionId or
> > normal one and then do further processing.

I have simplified the code to avoid unnecessary checks and added generic
error handling and proper resource release, which was overlooked in the
previous version.

> >
> >
> > 2)
> > Both FindRecentlyDeletedTupleInfoByIndex and
> > FindRecentlyDeletedTupleInfoSeq() has:
> > + /* Exit early if the commit timestamp data is not available */ if
> > + (!track_commit_timestamp) return false;
> >
> > We shall either move this check to FindDeletedTupleInLocalRel() or in
> > the caller of FindDeletedTupleInLocalRel() where we check
> > 'MySubscription->retaindeadtuples'. Moving to the caller of
> > FindDeletedTupleInLocalRel() looks better as there is no need to call
> > FindDeletedTupleInLocalRel itself if pre-conditions are not met.

Moved.

> >
> >
> > 3)
> > FindRecentlyDeletedTupleInfoSeq():
> > + * during change applications. While this approach may be slow on
> > + large tables,
> > + * it is considered acceptable because it is only used in rare
> > + conflict cases
> > + * where the target row for an update cannot be found.
> > + */
> > Shall we extend the last line to:
> > where the target row for an update cannot be found and index scan can
> > not be used.

Changed.

> >
> >
> > 4)
> > catalogs.sgml:
> > +       If true, he detection of <xref
> > + linkend="conflict-update-deleted"/> is
> > he --> the
> >

Fixed.

> 
> 5)
> FindRecentlyDeletedTupleInfoSeq():
> 
> + /* Get the cutoff xmin for HeapTupleSatisfiesVacuum */ oldestxmin =
> + GetOldestNonRemovableTransactionId(rel);
> 
> Another point is which xid should be used as threshold in
> HeapTupleSatisfiesVacuum() to decide if tuple is DEAD or RECENTLY-DEAD
> for update_deleted case? Currently we are using
> GetOldestNonRemovableTransactionId() but the xid returned here could be
> older than pg_conflict_detection's xmin in presence of other logical slots which
> have older effective_xmin. Shall we use pg_conflict_detection's xmin instead
> as threshold or worker's oldest_nonremovable_xid?

I thought both approaches are fine in terms of correctness, because even if the
apply worker will use the outdated xmin to determine the recently dead tuples,
user can still compare the timestamp to choose the correct resolution.

OTOH, I agree that it would be better to use a stricter rule, so I changed to
use the apply worker's oldest_nonremoable_xid to determine recently dead tuples.

To be consistent, I also used this xid instead of conflict detection slot.xmin to
determine whether an index is usable or not.

This is the V54 patch set, with only patch 0001 updated to address the latest
comments.

Patch 0002 includes separate code for statistics but must be applied together
with 0001 to pass the regression tests. This is because the current
implementation assumes the number of conflict type in statistics matches the
general ConflictType enum; otherwise, it may access an invalid memory area
during stat collection.

Best Regards,
Hou zj

On Tue, Jul 29, 2025 at 10:51 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> This is the V54 patch set, with only patch 0001 updated to address the latest
> comments.
>

Few minor comments:
1.
/* The row to be updated was deleted by a different origin */
CT_UPDATE_DELETED,
/* The row to be updated was modified by a different origin */
CT_UPDATE_ORIGIN_DIFFERS,
/* The updated row value violates unique constraint */
CT_UPDATE_EXISTS,
/* The row to be updated is missing */
CT_UPDATE_MISSING,

Is there a reason to keep CT_UPDATE_DELETED before
CT_UPDATE_ORIGIN_DIFFERS? I mean why not keep it just before
CT_UPDATE_MISSING on the grounds that they are always handled
together?

2. Will it be better to name FindRecentlyDeletedTupleInfoByIndex as
RelationFindDeletedTupleInfoByIndex to make it similar to existing
function RelationFindReplTupleByIndex? If you agree then make a
similar change for FindRecentlyDeletedTupleInfoSeq as well.

Apart from above, please find a number of comment edits and other
cosmetic changes in the attached.

--
With Regards,
Amit Kapila.

Вложения

v54-amit-1.patch.txt

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

31 июля, 13:18:51

On Thursday, July 31, 2025 5:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Tue, Jul 29, 2025 at 10:51 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > This is the V54 patch set, with only patch 0001 updated to address the
> > latest comments.
> >
> 
> Few minor comments:

Thanks for the comments.

> 1.
> /* The row to be updated was deleted by a different origin */
> CT_UPDATE_DELETED,
> /* The row to be updated was modified by a different origin */
> CT_UPDATE_ORIGIN_DIFFERS,
> /* The updated row value violates unique constraint */ CT_UPDATE_EXISTS,
> /* The row to be updated is missing */
> CT_UPDATE_MISSING,
> 
> Is there a reason to keep CT_UPDATE_DELETED before
> CT_UPDATE_ORIGIN_DIFFERS? I mean why not keep it just before
> CT_UPDATE_MISSING on the grounds that they are always handled together?

I agree that it makes more sense to put it before update_missing, and changed it.

> 
> 2. Will it be better to name FindRecentlyDeletedTupleInfoByIndex as
> RelationFindDeletedTupleInfoByIndex to make it similar to existing function
> RelationFindReplTupleByIndex? If you agree then make a similar change for
> FindRecentlyDeletedTupleInfoSeq as well.

Yes, the suggested name looks better.

> 
> Apart from above, please find a number of comment edits and other cosmetic
> changes in the attached.

Thanks, I have addressed above comments and merge the patch into 0001.

Here is V55 patch set.

Best Regards,
Hou zj

On Friday, August 1, 2025 7:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 
> On Fri, Aug 1, 2025 at 5:02 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > > 2.
> > >
> > > +          If set to <literal>true</literal>, the detection of
> > > +          <xref linkend="conflict-update-deleted"/> is enabled, and
> > > + a
> > > physical
> > > +          replication slot named
> > > <quote><literal>pg_conflict_detection</literal></quote>
> > >            created on the subscriber to prevent the conflict information
> from
> > >            being removed.
> > >
> > > "to prevent the conflict information from being removed." should be
> > > rewritten as "to prevent removal of tuple required for conflict detection"
> >
> > It appears the document you commented is already committed. I think
> > the intention was to make a general statement that neither dead tuples
> > nor commit timestamp data would be removed.
> 
> Okay got it, so instead of "conflict information" should we say "information for
> detecting conflicts" or "conflict detection information", conflict information
> looks like we want to prevent the information about the conflict which has
> already happened, instead we are preventing information which are required
> for detecting the conflict, does this make sense?

It makes sense to me, so changed.

> 
> I know this is already committed, but actually this is part of the whole patch set
> so we can always improvise it.
> 
> > >
> > > 3.
> > > +   /* Return if the commit timestamp data is not available */
> > > +   if (!track_commit_timestamp)
> > > +       return false;
> > >
> > > Shouldn't caller should take care of this?  I mean if the
> > > 'retaindeadtuples' and 'track_commit_timestamp' is not set then
> > > caller shouldn't even call this function.
> >
> > I feel moving the checks into a single central function would
> > streamline the caller, reducing code duplication. So, maybe we could
> > move the retaindeadtuple check into this function as well for consistency.
> Thoughts ?
> 
> Fine with either way, actually I wanted both the check 'retaindeadtuple' and
> 'track_commit_timestamp' at the same place.

Thanks for confirming. Here is V56 patch set which addressed all the
comments including the comments from Amit[1] and Shveta[2].

I have merged V55-0002 into 0001 and updated the list of author
and reviewers based on my knowledge.

[1] https://www.postgresql.org/message-id/CAA4eK1%2B2tZ0rGowwpfmPQA03KdBOaeaK6D5omBN76UTP2EPx6w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAJpy0uDNhP%2BQeH-zGLBgMnRY1JZGVeoZ_dxff5S6HmpnRcWk8A%40mail.gmail.com

Best Regards,
Hou zj

On Monday, August 4, 2025 2:16 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Fri, Aug 1, 2025 at 9:16 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > Thanks for confirming. Here is V56 patch set which addressed all the
> > comments including the comments from Amit[1] and Shveta[2].
> >
> > I have merged V55-0002 into 0001 and updated the list of author and
> > reviewers based on my knowledge.
> >
> 
> Thank You Hou-San for the patches. Please find a few initial comments on 002:
> 
> 
> 1)
> src/sgml/system-views.sgml:
> +         <para>
> +          <literal>conflict_retention_exceeds_max_duration</literal>
> means that
> +          the duration for retaining conflict information, which is used
> +          in logical replication conflict detection, has exceeded the
> maximum
> +          allowable limit. It is set only for the slot
> +          <literal>pg_conflict_detection</literal>, which is created when
> +          <link
> linkend="sql-createsubscription-params-with-retain-dead-tuples"><literal
> >retain_dead_tuples</literal></link>
> +          is enabled.
> +         </para>
> 
> We can mention 'max_conflict_retention_duration' here i.e:
> ...has exceeded the maximum allowable limit of
> max_conflict_retention_duration.

Added.

> 
> 
> 2)
> Shall we rename 'conflict_retention_exceeds_max_duration' as
> 'conflict_info_retention_exceeds_limit'? It is better to incorporate 'info' keyword,
> but then 'conflict_info_retention_exceeds_max_duration' becomes too long
> and thus I suggest 'conflict_info_retention_exceeds_limit'. Thoughts?

I will think on this.

> 
> 3)
> src/sgml/monitoring.sgml:
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>retain_dead_tuples</structfield>
> <type>boolean</type>
> +      </para>
> +      <para>
> +       True if <link
> linkend="sql-createsubscription-params-with-retain-dead-tuples"><literal
> >retain_dead_tuples</literal></link>
> +       is enabled and the duration for which conflict information is
> +       retained for conflict detection by this apply worker does not exceed
> +       <link
> + linkend="guc-max-conflict-retention-duration"><literal>max_conflict_re
> + tention_duration</literal></link>;
> NULL for
> +       parallel apply workers and table synchronization workers.
> +      </para></entry>
> +     </row>
> 
> a)
> In the html file, the link does not take me to 'max_conflict_retention_duration'
> GUC. It takes to that page but to some other location.
> 
> b)
> +       the duration for which conflict information is
> +       retained for conflict detection by this apply worker
> 
> Shall this be better:  'the duration for which information useful for conflict
> detection is retained by this apply worker'

Changed.

> 
> 4)
> src/sgml/config.sgml:
> 
> a)
> +        Maximum duration for which each apply worker can request to retain
> the
> +        information useful for conflict detection when
> +        <literal>retain_dead_tuples</literal> is enabled for the associated
> +        subscriptions.
> 
> Shall it be :
> "Maximum duration for which each apply worker is allowed to retain.."
> or "can retain"?

Changed to "is allowed to"

> 
> b)
> src/sgml/config.sgml
> +        subscriptions. The default value is <literal>0</literal>, indicating
> +        that conflict information is retained until it is no longer needed for
> +        detection purposes. If this value is specified without units, it is
> +        taken as milliseconds.
> 
> 'that conflict information is retained' --> 'that information useful for conflict
> detection is retained'

I changed to "the information" because the nearby texts have already
mentioned the usage of this information.

> 
> c)
> src/sgml/config.sgml
> +        The replication slot
> +        <quote><literal>pg_conflict_detection</literal></quote> that
> used to
> +        retain conflict information will be invalidated if all apply workers
> +        associated with the subscriptions, where
> 
> 'that used' --> 'that is used'

Fixed.

> 
> 5)
> ApplyLauncherMain():
> + /*
> + * Stop the conflict information retention only if all workers
> + * for subscriptions with retain_dead_tuples enabled have
> + * requested it.
> + */
> + stop_retention &= sub->enabled;
> 
> This comment is not clear. By enabling or disabling subscription, how can it
> request for 'stop or continue conflict info retention'?
> 
> Do you mean we can not 'invalidate the slot' and thus stop retention if a sub
> with rdt=ON is disabled?

Yes, there is no apply worker for disabled
subscription, thus no way to request that.

> If so, we can pair it up with the previous comment
> itself where we have mentioned that we can not advance xmin when sub is
> disabled, as that comment indicates a clear reason too.

Changed.

> 
> 6)
> Above brings me to a point that in doc, shall we mention that if a sub with
> rdt=on is disabled, even 'max_conflict_retention_duration' is not considered
> for other subs which have rdt=ON.

I think the documentation specifies that only active apply workers can make such
requests, which appears sufficient to me.

> 
> 7)
> Shall we rename 'max_conflict_retention_duration' to
> 'max_conflict_info_retention_duration' as the latter one is more clear?

I will think on it.

> 
> 8)
> + * nonremovable xid. Similarly, stop the conflict information
> + * retention only if all workers for subscriptions with
> + * retain_dead_tuples enabled have requested it.
> 
> Shall we rephrase to:
>  Similarly, can't stop the conflict information retention unless all such workers
> are running.

Changed.

Here is V57 patch set which addressed most of comments.

In this version, I also fixed a bug that the apply worker continued to
find dead tuples even if it has already stop retaining dead tuples.

Best Regards,
Hou zj

Вложения

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

07 августа, 07:39:53

On Tuesday, August 5, 2025 10:09 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> Here is V57 patch set which addressed most of comments.
> 
> In this version, I also fixed a bug that the apply worker continued to find dead
> tuples even if it has already stop retaining dead tuples.

Here is a V58 patch set which improved few things by internal review:

0001:

* Remove the slot invalidation.

Initially, we thought it would be convenient for users to determine if they can
reliably detect update_deleted by checking the validity of the conflict
detection slot. However, after re-thinking, even if the slot is valid, it
doesn't guarantee that each apply worker can reliably detect conflicts. Some
apply workers might have stopped retention, yet the slot remains valid due to
other active workers continuing retention.

Instead of querying the slot, users should verify the ability of a specific
apply worker to reliably detect conflicts by checking the view
pg_stat_subscription.retain_dead_tuples.

So, slot invalidation would be necessary. We could set slot.xmin to invalid
instead to allow dead tuples to be removed when all apply workers stop
retention. This approach simplifies implementation and avoids introducing a new
invalidation type solely for one internal slot.

* Fixed a bug that parallel apply worker continues to search dead tuples when
  the retention has already stopped. The parallel and table sync worker referred
  to its own stop_conflict_info_retention flag, but should refer to the
  retention flag of the leader instead because only leader mananges this flag.

0002:

* Allow the apply worker to wait for the slot to be recover after resuming the dead
  tuple retention instead of restarting the apply worker.

Best Regards,
Hou zj

On Friday, August 8, 2025 2:34 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Thu, Aug 7, 2025 at 10:10 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, August 5, 2025 10:09 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> > > Here is V57 patch set which addressed most of comments.
> > >
> > > In this version, I also fixed a bug that the apply worker continued
> > > to find dead tuples even if it has already stop retaining dead tuples.
> >
> > Here is a V58 patch set which improved few things by internal review:
> >
> > 0001:
> >
> 
> Thank You for the patches, please find  a few comments on 001 alone:

Thanks for the comments.

> 
> 1)
> + /*
> + * Return if the wait time has not exceeded the maximum limit
> + * (max_conflict_retention_duration).
> + */
> + if (!TimestampDifferenceExceeds(rdt_data->candidate_xid_time, now,
> + max_conflict_retention_duration +
> + rdt_data->table_sync_wait_time))
> 
> We can add comments here as in why we are adding table-sync time to
> max_conflict_retention_duration.

Added.

> 
> 2)
> relmutex comment says:
> 
>         /* Used for initial table synchronization. */
>         Oid                     relid;
>         char            relstate;
>         XLogRecPtr      relstate_lsn;
>         slock_t         relmutex;
> 
> We shall update this comment as now we are using it for other purposes. Also
> name is specific to relation (due to originally created for table-sync case). We
> can rename it to be more general so that it can be used for oldest-xid access
> purposes as well.

Changed the name and added comments.

> 
> 3)
> + Assert(TransactionIdIsValid(rdt_data->candidate_xid));
> + Assert(rdt_data->phase == RDT_WAIT_FOR_PUBLISHER_STATUS ||
> +    rdt_data->phase == RDT_WAIT_FOR_LOCAL_FLUSH);
> +
> + if (!max_conflict_retention_duration)
> + return false;
> 
> Shall we move 'max_conflict_retention_duration' NULL check as the first step.
> Or do you think it will be better to move it to the caller before
> should_stop_conflict_info_retention is invoked?

I think these Asserts are good to have, even if the GUC is not
specified, so I kept the current style.

> 
> 4)
> +        The information useful for conflict detection is no longer retained if
> +        all apply workers associated with the subscriptions, where
> +        <literal>retain_dead_tuples</literal> is enabled, confirm that the
> +        retention duration exceeded the
> +        <literal>max_conflict_retention_duration</literal>. To re-enable
> +        retention, you can disable <literal>retain_dead_tuples</literal> and
> +        re-enable it after confirming this replication slot has been dropped.
> 
> But the replication slot will not be dropped unless all the subscriptions have
> disabled retain_dead_tuples. So shall the above doc somehow mention this
> part as well otherwise it could be misleading for users.

Added.

> 5)
> pg_stat_subscription_stats: retain_dead_tuples
> 
> Can it cause confusion as both subscription's parameter and
> pg_stat_subscription_stats's column have the same name while may have
> different values. Shall the stats one be named as
> 'effective_retain_dead_tuples'?

I think the prefix "effective_" is typically used for non-boolean options (such
as effective_cache_size or effective_io_concurrency). So, I opted for the name
"dead_tuple_retention_active" as it aligns with some existing names like
"row_security_active."

Here is V59 patch set which addressed above comments in 0001.

Best Regards,
Hou zj

Вложения

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

11 августа, 12:10:40

On Thursday, August 7, 2025 8:35 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Aug 4, 2025 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Mon, Aug 4, 2025 at 11:46 AM shveta malik <shveta.malik@gmail.com>
> wrote:
> > >
> > > 7)
> > > Shall we rename 'max_conflict_retention_duration' to
> > > 'max_conflict_info_retention_duration' as the latter one is more
> > > clear?
> > >
> >
> > Before bikeshedding on the name of this option, I would like us to
> > once again consider whether we should provide this option at
> > subscription-level or GUC?
> >
> > The rationale behind considering it as a subscription option is that
> > the different subscriptions may have different requirements for dead
> > tuple retention which means that for some particular subscription, the
> > workload may not be always high which means that even if temporarily
> > the lag_duration (of apply) has exceeded the new option's value, it
> > should become okay. So, in such a case users may not want to configure
> > max_conflict_retention_duration for a subscription which would
> > otherwise lead to stop detection of update_deleted conflict for that
> > subscription.
> 
> Yes valid point, and it's also possible that for some subscription user is okay to
> not retain dead tuple if it crosses certain duration OTOH for some subscription
> it is too critical to retain dead tuple even if user has to take some performance
> hit, so might want to have higher threshold for those slots.
> 
> > The other point is that it is only related to the retain_dead_tuples
> > option of the subscription, so providing this new option at the same
> > level would appear consistent.
> 
> Yes that's a valid argument, because if the user is setting retain dead tuples for
> subscription then only they need to consider setting duration.
> 
> > I remember that previously Sawada-San has advocated it to provide as
> > GUC but I think the recent tests suggest that users should define
> > pub-sub topology carefuly to enable retain_dead_tuples option as even
> > mentioned in docs[2], so, it is worth considering to provide it at
> > subscription-level.
> 
> IMHO, it should be fine to provide the subscription option first and if we see
> complaints about inconvenience we may consider GUC as well in the future.

I agree. So, following the above points and some off-list discussions, I have
revised the option to be a subscription option in the V60 version.

Best Regards,
Hou zj

On Tuesday, August 12, 2025 4:37 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Mon, Aug 11, 2025 at 2:40 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > I agree. So, following the above points and some off-list discussions,
> > I have revised the option to be a subscription option in the V60 version.
> >
> 
> Thank You for the patches. Tried to test the new sub-level parameter, have few
> comments:

Thanks for the comments.

> 
> 1)
> Let's say commit on pub is taking time and worker has stopped retention and
> meanwhile we alter max_conflict_retention_duration=0,
> then the expectation is immediately worker should resume retention.
> But it does not happen, it does not restart conflict-retention until the pub's
> commit is finished. The 'dead_tuple_retention_active'
> remains 'f' till then.
> 
> postgres=# select subname, subretaindeadtuples, maxconflretention from
> pg_subscription order by subname;  subname | subretaindeadtuples |
> maxconflretention
> ---------+---------------------+-------------------
>  sub1       | t                                |                 0
> 
> 
> postgres=# select subname, worker_type, dead_tuple_retention_active from
> pg_stat_subscription order by subname;  subname | worker_type |
> dead_tuple_retention_active
> ---------+-------------+-----------------------------
>  sub1      | apply              | f
> 
> I think we shall reset 'stop_conflict_info_retention' flag in
> should_stop_conflict_info_retention() if maxconflretention is 0 and the flag is
> originally true.

Agreed. Thinking more, I think we can resume retention at more places, as in
each phase, we could have a possibility of resuming, so changed.

> 2)
> postgres=# create subscription sub2  connection 'dbname=postgres
> host=localhost user=shveta port=5433' publication pub2 WITH
> (retain_dead_tuples = false, max_conflict_retention_duration=1000);
> NOTICE:  created replication slot "sub2" on publisher CREATE
> SUBSCRIPTION
> 
> Shall we give notice that max_conflict_retention_duration is ignored as
> retain_dead_tuples is false.

Agreed. In addition to this command, I added the NOTICE for all
the cases when the max_conflict_retention_duration is ineffective as 
discussed[1].


> 
> 3)
> When worker stops retention, it gives message:
> 
> LOG:  logical replication worker for subscription "sub1" will stop retaining the
> information for detecting conflicts
> DETAIL:  The time spent advancing the non-removable transaction ID has
> exceeded the maximum limit of 100 ms.
> 
> Will it be more informative if we mention the parameter name
> 'max_conflict_retention_duration' either in DETAIL or in additional HINT, as
> then the user can easily map this behaviour to the parameter configured.

Added.

Here is the V61 patch set which addressed above comments and the comment by Nisha[2].

[1] https://www.postgresql.org/message-id/CAJpy0uC81YgAmrA2ji2ZKbJK_qfvajuV6%3DyvcCWuFsQKqiED%2BA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CABdArM7G1sSDDOEC-nmJRnJMCZoBsLqOMz08UotX_h_wqxHWCg%40mail.gmail.com

Best Regards,
Hou zj

On Thursday, August 14, 2025 11:46 AM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Wed, Aug 13, 2025 at 4:15 PM shveta malik <shveta.malik@gmail.com>
> wrote:
> >
> > On Wed, Aug 13, 2025 at 10:41 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > >
> > > Here is the V61 patch set which addressed above comments and the
> comment by Nisha[2].
> > >
> >
> > Thank You for the patch. I tested the patch, please find a few comments:
> >
> > 1)
> > Now when it stops-retention and later resumes it due to the fact that
> > max_duration is meanwhile altered to 0, I get log:
> >
> > LOG:  logical replication worker for subscription "sub1" resumes
> > retaining the information for detecting conflicts
> > DETAIL:  The time spent applying changes up to LSN 0/17DD728 is now
> > within the maximum limit of 0 ms.
> >
> > I did not get which lsn it is pointing to? Is it some dangling lsn
> > from when it was retaining info?  Also the msg looks odd, when it says
> > 'is now within the maximum limit of 0 ms.'
> >
> > 2)
> > While stopping the message is:
> > LOG:  logical replication worker for subscription "sub1" will stop
> > retaining conflict information
> > DETAIL:  The time spent advancing the non-removable transaction ID has
> > exceeded the maximum limit of 1000 ms.
> >
> > And while resuming:
> > logical replication worker for subscription "sub1" resumes retaining
> > the information for detecting conflicts
> > ----------
> >
> > We can make both similar. Both can have 'retaining the information for
> > detecting conflicts' instead of 'conflict information' in first one.
> >
> > 3)
> > I believe the tenses should also be updated. When stopping, we can say:
> >
> > Logical replication worker for subscription "sub1" has stopped...
> >
> > This is appropriate because it has already stopped by pre-setting
> > oldest_nonremovable_xid to Invalid.
> >
> > When resuming, we can say:
> > Logical replication worker for subscription "sub1" will resume...
> >
> > This is because it will begin resuming from the next cycle onward,
> > specifically after the launcher sets its oldest_xid.
> >
> > 4)
> > For the DETAIL part of resume and stop messages, how about these:
> >
> > The retention duration for information used in conflict detection has
> > exceeded the limit of xx.
> > The retention duration for information used in conflict detection is
> > now within the acceptable limit of xx.
> > The retention duration for information used in conflict detection is
> > now indefinite.
> >

Thanks for the comments, I have adjusted the log messages
according to the suggestions.


> 
> 5)
> Say there 2-3 subs, all have stopped-retention and the slot is set to have invalid
> xmin; now if I  create a new sub, it will start with stopped-flag set to true due to
> the fact that slot has invalid xmin to begin with. But then immediately, it will
> dump a resume message. It looks odd, as at first, it has not even stopped, as it
> is a new sub.
> Is there anything we can do to improve this situation?

I changed the logic to recovery the slot immediately on starting a new worker
that has retain_dead_tuples enabled.

Here is the V62 patch set which addressed above comments and [1].

[1] https://www.postgresql.org/message-id/CAJpy0uBW8G2RNY%3DJjxzr_ootQ2MTxPQG98hz%3D-wdJzn86yapVg%40mail.gmail.com

Best Regards,
Hou zj

On Monday, August 18, 2025 8:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Aug 18, 2025 at 5:05 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Monday, August 18, 2025 2:32 PM Dilip Kumar <dilipbalaut@gmail.com>
> wrote:
> > >
> > > On Mon, Aug 18, 2025 at 10:36 AM Amit Kapila
> <amit.kapila16@gmail.com>
> > > wrote:
> > > >
> > > > > ---
> > > > > Even if an apply worker disables retaining dead tuples due to
> > > > > max_conflict_retention_duration, it enables again after the server
> > > > > restarts.
> > > > >
> > > >
> > > > I also find this behaviour questionable because this also means that
> > > > it is possible that before restart one would deduce that the
> > > > update_deleted conflict won't be reliably detected for a particular
> > > > subscription but after restart it could lead to the opposite
> > > > conclusion. But note that to make it behave similarly we need to store
> > > > this value persistently in pg_subscription unless you have better
> > > > ideas for this. Theoretically, there are two places where we can
> > > > persist this information, one is with pg_subscription, and other in
> > > > origin. I find it is closer to pg_subscription.
> > >
> > > I think it makes sense to store this in pg_subscription to preserve the
> decision
> > > across restart.
> >
> > Thanks for sharing the opinion!
> >
> > Regarding this, I'd like to clarify some implementation details for persisting
> the
> > retention status in pg_subscription.
> >
> > Since the logical launcher does not connect to a specific database, it cannot
> > update the catalog, as this would trigger a FATAL error (e.g.,
> > CatalogTupleUpdate -> ... -> ScanPgRelation -> FATAL: cannot read
> pg_class
> > without having selected a database). Therefore, the apply worker should take
> > responsibility for updating the catalog.
> >
> > To achieve that, ideally, the apply worker should update pg_subscription in a
> > separate transaction, rather than using the transaction started during the
> > application of changes. This implies that we must wait for the current
> > transaction to complete before proceeding with the catalog update. So I think
> we
> > could an additional phase, RDT_MARK_RETENTION_INACTIVE, to manage
> the
> > catalog update once the existing transaction finishes.
> >
> > If we proceed in this manner, it suggests that the apply worker could set the
> > shared memory flag first and then catalog flag. So, if the apply worker
> > encounters an error after setting the shared memory flag but before updating
> the
> > catalog, it may lead to issues similar to the one mentioned by Sawada-San,
> > e.g., the apply worker restart but would retain the dead tuples again because
> > the status had not persisted.
> 
> In this approach, why do we need to set the shared memory flag in the
> first place, can't we rely on the catalog values? I understand there
> is some delay when we detect to stop retention and when we actually
> update the catalog but it shouldn't be big enough to matter for small
> transactions because we will update it at the next transaction
> boundary. For large transactions, we can always update it at the next
> stream_stop message.

I agree. Here is V63 version which implements this approach.

The retention status is recorded in the pg_subscription catalog
(subretentionactive) to prevent unnecessary retention initiation upon server
restarts. The apply worker is responsible for updating this flag based on the
retention duration. Meanwhile, the column is set to true when retain_dead_tuples
is enabled or when creating a new subscription with retain_dead_tuples enabled,
and it is set to false when retain_dead_tuples is disabled.

Best Regards,
Hou zj

On Thursday, August 21, 2025 3:47 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> 
> On Wed, Aug 20, 2025 at 12:12 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > I agree. Here is V63 version which implements this approach.
> >
> 
> Thank you Hou-san for the patches. Here are couple of comments:
> 
> 1) Once retention is stopped for all subscriptions and conflict_slot.xmin is
> reset to NULL, we are no longer retaining dead tuples. In that case, the warning
> shown during subscription disable looks misleading.
> 
> For example sub has already stopped the retention and when disabled -
> postgres=# alter subscription sub1 disable;
> WARNING:  deleted rows to detect conflicts would not be removed until the
> subscription is enabled
> HINT:  Consider setting retain_dead_tuples to false.
> ALTER SUBSCRIPTION
> 
> I think we should check if retention is active or not here.
> 
> 2) Regarding the logic in the launcher for advancing the slot’s xmin:
> Consider a case where two subscriptions exist, and one of them is disabled
> after it has already stopped retention.
> Example subscriptions in state:
> ... 
> Here, sub2 is disabled, and since subretentionactive = 'f', it is not retaining
> dead tuples anymore. But, the current launcher logic still blocks xmin
> advancement as one of the subscriptions with retain_dead_tuples is disabled.
> I think the launcher should consider the subretentionactive value and the xmin
> should be allowed to advance. Thoughts?

I agree that retentionactive needs to be checked in the cases mentioned above.
Here is the V64 patch set addressing this concern. This version also resolves
the bug reported by Shveta[1], where retention could not resume and was stuck
waiting for the publisher status.

In addition, I also improved the comments related to the new phases and
retentionactive flag.

[1] https://www.postgresql.org/message-id/CAJpy0uCP7x_pdVysYohvrjpk0Vtmk36%2BXfnC_DOPiegekxfBLA%40mail.gmail.com

Best Regards,
Hou zj

On Thu, Aug 21, 2025 at 2:09 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, August 21, 2025 2:01 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Wed, Aug 20, 2025 at 12:12 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > >
> > > I agree. Here is V63 version which implements this approach.
> > >
> >
> > Thank You for the patches.
> >
> > > The retention status is recorded in the pg_subscription catalog
> > > (subretentionactive) to prevent unnecessary retention initiation upon
> > > server restarts. The apply worker is responsible for updating this
> > > flag based on the retention duration. Meanwhile, the column is set to
> > > true when retain_dead_tuples is enabled or when creating a new
> > > subscription with retain_dead_tuples enabled, and it is set to false when
> > retain_dead_tuples is disabled.
> > >
> >
> > +1 on the idea.
> >
> > Please find few initial testing feedback:
>
> Thanks for the comments.
>
> >
> > 1)
> > When it stops, it does not resume until we restart th server. It keeps on waiting
> > in wait_for_publisher_status and it never receives one.
> >
> > 2)
> > When we do: alter subscription sub1 set (max_conflict_retention_duration=0);
> >
> > It does not resume in this scenario too.
> > should_resume_retention_immediately() does not return true due to
> > wait-status on publisher.
>
> Fixed in the V64 patches.
>
>
> > 3)
> > AlterSubscription():
> >  * retention will be stopped gain soon in such cases, and
> >
> > stopped gain --> stopped again
>
> Sorry, I missed this typo in V64, I will fix it in the next version.
>

Sure. Thanks.
Please find a few more comments:

1)
There is an issue in retention resumption. The issue is observed for a
multi pub-sub setup where one sub is retaining info while another one
has stopped retention. Now even if I set
max_conflict_retention_duration=0 for the one which has stopped
retention, it does not resume. I have attached steps in the txt file.

2)
In the same testcase, sub1 is not resuming otherwise also i.e. even
though if we do not set max_conflict_retention_duration to 0, it
should resume in a while as there is no other txn on pub which is
stuck in commit-phase. In a single pub-sub setup, it works well. Multi
pub-sub setup has this issue.

3)
ApplyLauncherMain() has some processing under 'if
(sub->retaindeadtuples)', all dependent upon sub->retentionactive.
Will it be better to write it as:

                        if (sub->retaindeadtuples)
                        {
                                retain_dead_tuples = true;
         CreateConflictDetectionSlot();
         if (sub->retentionactive)
         {
               retention_inactive = false
             can_advance_xmin &= sub->enabled;
           if (!TransactionIdIsValid(MyReplicationSlot->data.xmin))
             init_conflict_slot_xmin();
         }
                        }

All 'sub->retentionactive' based logic under one 'if' would be easier
to understand.

thanks
Shveta

Вложения

resume_test.txt

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

22 августа, 14:16:48

On Thu, Aug 21, 2025 at 2:01 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Here is the V64 patch set addressing this concern.
>

Few minor comments:
1.
static void
 process_syncing_tables_for_sync(XLogRecPtr current_lsn)
 {
- SpinLockAcquire(&MyLogicalRepWorker->relmutex);
+ SpinLockAcquire(&MyLogicalRepWorker->mutex);

Why is this change part of this patch? Please extract it as a separate
patch unless this change is related to this patch.

2.
 pg_stat_get_subscription(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_SUBSCRIPTION_COLS 10
+#define PG_STAT_GET_SUBSCRIPTION_COLS 11
  Oid subid = PG_ARGISNULL(0) ? InvalidOid : PG_GETARG_OID(0);
  int i;
  ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
@@ -1595,6 +1614,22 @@ pg_stat_get_subscription(PG_FUNCTION_ARGS)
  elog(ERROR, "unknown worker type");
  }

+ /*
+ * Use the worker's oldest_nonremovable_xid instead of
+ * pg_subscription.subretentionactive to determine whether retention
+ * is active, as retention resumption might not be complete even when
+ * subretentionactive is set to true; this is because the launcher
+ * assigns the initial oldest_nonremovable_xid after the apply worker
+ * updates the catalog (see resume_conflict_info_retention).
+ *
+ * Only the leader apply worker manages conflict retention (see
+ * maybe_advance_nonremovable_xid() for details).
+ */
+ if (!isParallelApplyWorker(&worker) && !isTablesyncWorker(&worker))
+ values[10] = TransactionIdIsValid(worker.oldest_nonremovable_xid);
+ else

The theory given in the comment sounds good to me but I still suggest
it is better to extract it into a separate patch, so that we can
analyse/test it separately. Also, it will reduce the patch size as
well.

3.
  /* Ensure that we can enable retain_dead_tuples */
  if (opts.retaindeadtuples)
- CheckSubDeadTupleRetention(true, !opts.enabled, WARNING);
+ CheckSubDeadTupleRetention(true, !opts.enabled, WARNING, true);
+
+ /* Notify that max_conflict_retention_duration is ineffective */
+ else if (opts.maxconflretention)
+ notify_ineffective_max_conflict_retention(true);

Can't we combine these checks by passing both parameters to
CheckSubDeadTupleRetention() and let that function handle all
inappropriate value cases? BTW, even for other places, see if you can
reduce the length of the function name
notify_ineffective_max_conflict_retention.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

22 августа, 14:40:29

On Mon, Aug 4, 2025 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 4, 2025 at 11:46 AM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > 7)
> > Shall we rename 'max_conflict_retention_duration' to
> > 'max_conflict_info_retention_duration' as the latter one is more
> > clear?
> >
>
> Before bikeshedding on the name of this option, I would like us to
> once again consider whether we should provide this option at
> subscription-level or GUC?
>

Now that we decided that we would like to go with the subscription
option. The other alternative to name this new option could be
max_retention_duration. The explanation should clarify that it is used
with the retain_dead_tuples option. I think the other proposed names
appear a bit long to me.

--
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

22 августа, 18:49:44

On Fri, Aug 22, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 4, 2025 at 3:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 4, 2025 at 11:46 AM shveta malik <shveta.malik@gmail.com> wrote:
> > >
> > > 7)
> > > Shall we rename 'max_conflict_retention_duration' to
> > > 'max_conflict_info_retention_duration' as the latter one is more
> > > clear?
> > >
> >
> > Before bikeshedding on the name of this option, I would like us to
> > once again consider whether we should provide this option at
> > subscription-level or GUC?
> >
>
> Now that we decided that we would like to go with the subscription
> option. The other alternative to name this new option could be
> max_retention_duration. The explanation should clarify that it is used
> with the retain_dead_tuples option. I think the other proposed names
> appear a bit long to me.
>

'max_retention_duration' looks good to me.

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

25 августа, 07:36:11

On Friday, August 22, 2025 7:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Aug 21, 2025 at 2:01 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Here is the V64 patch set addressing this concern.
> >
> 
> Few minor comments:
> 1.
> static void
>  process_syncing_tables_for_sync(XLogRecPtr current_lsn)
>  {
> - SpinLockAcquire(&MyLogicalRepWorker->relmutex);
> + SpinLockAcquire(&MyLogicalRepWorker->mutex);
> 
> Why is this change part of this patch? Please extract it as a separate
> patch unless this change is related to this patch.

Removed these changes for now, will post again once the
main patches get pushed.

> 
> 2.
>  pg_stat_get_subscription(PG_FUNCTION_ARGS)
>  {
> -#define PG_STAT_GET_SUBSCRIPTION_COLS 10
> +#define PG_STAT_GET_SUBSCRIPTION_COLS 11
>   Oid subid = PG_ARGISNULL(0) ? InvalidOid : PG_GETARG_OID(0);
>   int i;
>   ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
> @@ -1595,6 +1614,22 @@ pg_stat_get_subscription(PG_FUNCTION_ARGS)
>   elog(ERROR, "unknown worker type");
>   }
> 
> + /*
> + * Use the worker's oldest_nonremovable_xid instead of
> + * pg_subscription.subretentionactive to determine whether retention
> + * is active, as retention resumption might not be complete even when
> + * subretentionactive is set to true; this is because the launcher
> + * assigns the initial oldest_nonremovable_xid after the apply worker
> + * updates the catalog (see resume_conflict_info_retention).
> + *
> + * Only the leader apply worker manages conflict retention (see
> + * maybe_advance_nonremovable_xid() for details).
> + */
> + if (!isParallelApplyWorker(&worker) && !isTablesyncWorker(&worker))
> + values[10] = TransactionIdIsValid(worker.oldest_nonremovable_xid);
> + else
> 
> The theory given in the comment sounds good to me but I still suggest
> it is better to extract it into a separate patch, so that we can
> analyse/test it separately. Also, it will reduce the patch size as
> well.

OK, I have moved these changes into the 0003 patch in the
latest version.

> 
> 3.
>   /* Ensure that we can enable retain_dead_tuples */
>   if (opts.retaindeadtuples)
> - CheckSubDeadTupleRetention(true, !opts.enabled, WARNING);
> + CheckSubDeadTupleRetention(true, !opts.enabled, WARNING, true);
> +
> + /* Notify that max_conflict_retention_duration is ineffective */
> + else if (opts.maxconflretention)
> + notify_ineffective_max_conflict_retention(true);
> 
> Can't we combine these checks by passing both parameters to
> CheckSubDeadTupleRetention() and let that function handle all
> inappropriate value cases? BTW, even for other places, see if you can
> reduce the length of the function name
> notify_ineffective_max_conflict_retention.

Attach the V65 patch set which addressed above and
Shveta's comments[1].

[1] https://www.postgresql.org/message-id/CAJpy0uBFB6K2ZoLebLCBfG%2B2edu63dU5oS1C6MqcnfcQj4CofQ%40mail.gmail.com

Best Regards,
Hou zj

On Mon, Aug 25, 2025 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> A few comments on 0001:
>

Some more comments:
1.
+ /*
+ * Return false if the leader apply worker has stopped retaining
+ * information for detecting conflicts. This implies that update_deleted
+ * can no longer be reliably detected.
+ */
+ if (!retention_active)
+ return false;
+
  /*
  * For conflict detection, we use the conflict slot's xmin value instead
  * of invoking GetOldestNonRemovableTransactionId(). The slot.xmin acts as
@@ -3254,7 +3315,15 @@ FindDeletedTupleInLocalRel(Relation localrel,
Oid localidxoid,
  oldestxmin = slot->data.xmin;
  SpinLockRelease(&slot->mutex);

- Assert(TransactionIdIsValid(oldestxmin));
+ /*
+ * Return false if the conflict detection slot.xmin is set to
+ * InvalidTransactionId. This situation arises if the current worker is
+ * either a table synchronization or parallel apply worker, and the leader
+ * stopped retention immediately after checking the
+ * oldest_nonremovable_xid above.
+ */
+ if (!TransactionIdIsValid(oldestxmin))
+ return false;

If the current worker is tablesync or parallel_apply, it should have
exited from the above check of retention_active as we get the leader's
oldest_nonremovable_xid to decide that. What am, I missing? This made
me wonder whether we need to use slot's xmin after we have fetched
leader's oldest_nonremovable_xid to find deleted tuple?

2.
- * The interval is reset to a minimum value of 100ms once there is some
- * activity on the node.
+ * The interval is reset to the lesser of 100ms and
+ * max_conflict_retention_duration once there is some activities on the node.
AFAICS, this is not adhered in the code because you are using it when
there is no activity aka when new_xid_found is false. IS the comment
wrong or code needs some updation?

3.
+
+ /* Ensure the wait time remains within the maximum limit */
+ rdt_data->xid_advance_interval = Min(rdt_data->xid_advance_interval,
+ MySubscription->maxconflretention);
Can't we combine it with calculation of max_interval few lines above
this change? And also adjust comments atop
adjust_xid_advance_interval() accordingly?

4.
  if (am_leader_apply_worker() &&
- MySubscription->retaindeadtuples &&
+ MySubscription->retaindeadtuples && MySubscription->retentionactive &&
  !TransactionIdIsValid(MyLogicalRepWorker->oldest_nonremovable_xid))

I think this code can look neat if you have one condition per line.

Apart from above comments, I have tried to improve some code comments
in the attached.

--
With Regards,
Amit Kapila.

Вложения

v65_amit_1.patch.txt

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

26 августа, 12:02:10

On Tue, Aug 26, 2025 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 25, 2025 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >

Some comments on latest patch
0001:

1.
+          <para>
+           Note that setting a non-zero value for this option could lead to
+           information for conflict detection being removed prematurely,
+           potentially missing some conflict detections.
+          </para>

We can improve this wording by saying "potentially incorrectly
detecting some conflict"

2.
@@ -1175,6 +1198,8 @@ AlterSubscription(ParseState *pstate,
AlterSubscriptionStmt *stmt,
  bool update_failover = false;
  bool update_two_phase = false;
  bool check_pub_rdt = false;
+ bool ineffective_maxconflretention = false;
+ bool update_maxretention = false;

For making variable names more consistent, better to change
'ineffective_maxconflretention' to 'ineffective_maxretention' so that
this will be more consistent with 'update_maxretention'

3.
+/*
+ * Report a NOTICE to inform users that max_conflict_retention_duration is
+ * ineffective when retain_dead_tuples is disabled for a subscription. An ERROR
+ * is not issued because setting max_conflict_retention_duration
causes no harm,
+ * even when it is ineffective.
+ */
+static void
+notify_ineffective_max_retention(bool update_maxretention)
+{
+ ereport(NOTICE,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ update_maxretention
+ ? errmsg("max_conflict_retention_duration has no effect when
retain_dead_tuples is disabled")
+ : errmsg("disabling retain_dead_tuples will render
max_conflict_retention_duration ineffective"));
 }

I really don't like to make a function for a single ereport, even if
this is being called from multiple places.


--
Regards,
Dilip Kumar
Google

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

26 августа, 13:44:54

Please find some more comments:

1)
In CheckSubDeadTupleRetention(), shall we have below instead of
retain_dead_tuples check in all conditions?

if (retain_dead_tuples)
    guc checks (wal_level and tracl_commit)
else
   max retention check

2)
Currently stop and resume messages are:
~~
LOG:  logical replication worker for subscription "sub2" has stopped
retaining the information for detecting conflicts
DETAIL:  The retention duration for information used in conflict
detection has exceeded the maximum limit of 10000 ms.
HINT:  You might need to increase "max_conflict_retention_duration".
 --
LOG:  logical replication worker for subscription "sub2" will resume
retaining the information for detecting conflicts
DETAIL:  The retention duration for information used in conflict
detection is now within the acceptable limit of 10000 ms.
~~

Resume message does not mention GUC while stop does mention it in
HINT. Shall we have both stop and resume DETAIL msg mention GUC as:

Stop:
DETAIL: Retention of information used for conflict detection has
exceeded max_conflict_retention_duration of 10000 ms.

Resume:
DETAIL: Retention of information used for conflict detection is now
within the max_conflict_retention_duration of 1000 ms.

I think we should get rid of HINT in stop msg as that is not what we
actually should be suggesting without knowing the
system-workload/bloat condition. Hint seems oversimplified and thus
incomplete looking at the possibilities we may have here.


3)
CREATE SUBSCRIPTION sub CONNECTION '...' PUBLICATION pub WITH (connect
= false, retain_dead_tuples = true, max_conflict_retention_duration =
5000);
WARNING:  deleted rows to detect conflicts would not be removed until
the subscription is enabled
HINT:  Consider setting retain_dead_tuples to false.
WARNING:  subscription was created, but is not connected
HINT:  To initiate replication, you must manually create the
replication slot, enable the subscription, and refresh the
subscription.
CREATE SUBSCRIPTION

With connect=false, we get above messages. Reverse order of WARNINGs
will make more sense as 'not connected' WARNING and HINT clarifies a
few things including that the sub is disabled and needs to be enabled.
Can we attempt doing it provided it does not over-complicate code?

4)
postgres=# \dRs+
     List of subscriptions
.. | Retain dead tuples | Max conflict retention duration | Dead tuple
retention active |..

Here shall we have  'Max retention duration' and 'Retention Active'
instead of 'Max conflict retention duration' and 'Dead tuple retention
active'?

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

27 августа, 06:29:12

On Tuesday, August 26, 2025 5:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 
> On Tue, Aug 26, 2025 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Mon, Aug 25, 2025 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> 
> Some comments on latest patch

Thanks for the comments!

> 0001:
> 
> 1.
> +          <para>
> +           Note that setting a non-zero value for this option could lead to
> +           information for conflict detection being removed prematurely,
> +           potentially missing some conflict detections.
> +          </para>
> 
> We can improve this wording by saying "potentially incorrectly detecting some
> conflict"

I slightly reworded it to "potentially resulting in incorrect conflict detection."

> 
> 2.
> @@ -1175,6 +1198,8 @@ AlterSubscription(ParseState *pstate,
> AlterSubscriptionStmt *stmt,
>   bool update_failover = false;
>   bool update_two_phase = false;
>   bool check_pub_rdt = false;
> + bool ineffective_maxconflretention = false; bool update_maxretention =
> + false;
> 
> For making variable names more consistent, better to change
> 'ineffective_maxconflretention' to 'ineffective_maxretention' so that this will be
> more consistent with 'update_maxretention'
> 
> 3.
> +/*
> + * Report a NOTICE to inform users that max_conflict_retention_duration
> +is
> + * ineffective when retain_dead_tuples is disabled for a subscription.
> +An ERROR
> + * is not issued because setting max_conflict_retention_duration
> causes no harm,
> + * even when it is ineffective.
> + */
> +static void
> +notify_ineffective_max_retention(bool update_maxretention) {
> +ereport(NOTICE,  errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + update_maxretention
> + ? errmsg("max_conflict_retention_duration has no effect when
> retain_dead_tuples is disabled")
> + : errmsg("disabling retain_dead_tuples will render
> max_conflict_retention_duration ineffective"));  }
> 
> I really don't like to make a function for a single ereport, even if this is being
> called from multiple places.

I refactored this part based on some other comments, so these points
is addressed in the V66 patch set as well.

Best Regards,
Hou zj

Вложения

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

27 августа, 06:29:51

On Tuesday, August 26, 2025 2:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Aug 25, 2025 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > A few comments on 0001:
> >
> 
> Some more comments:

Thanks for the comments!

> 1.
> + /*
> + * Return false if the leader apply worker has stopped retaining
> + * information for detecting conflicts. This implies that
> + update_deleted
> + * can no longer be reliably detected.
> + */
> + if (!retention_active)
> + return false;
> +
>   /*
>   * For conflict detection, we use the conflict slot's xmin value instead
>   * of invoking GetOldestNonRemovableTransactionId(). The slot.xmin acts as
> @@ -3254,7 +3315,15 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid
> localidxoid,
>   oldestxmin = slot->data.xmin;
>   SpinLockRelease(&slot->mutex);
> 
> - Assert(TransactionIdIsValid(oldestxmin));
> + /*
> + * Return false if the conflict detection slot.xmin is set to
> + * InvalidTransactionId. This situation arises if the current worker is
> + * either a table synchronization or parallel apply worker, and the
> + leader
> + * stopped retention immediately after checking the
> + * oldest_nonremovable_xid above.
> + */
> + if (!TransactionIdIsValid(oldestxmin))
> + return false;
> 
> If the current worker is tablesync or parallel_apply, it should have exited from
> the above check of retention_active as we get the leader's
> oldest_nonremovable_xid to decide that. What am, I missing? This made me
> wonder whether we need to use slot's xmin after we have fetched leader's
> oldest_nonremovable_xid to find deleted tuple?

There was a race condition that the leader could stop retention immediately
after the pa worker fetches its oldest_nonremovable_xid. But I agree that it
would be simpler to directly use oldest_nonremovable_xid to find deleted tuple,
so changed. This way, the logic can be simplified.

> 
> 2.
> - * The interval is reset to a minimum value of 100ms once there is some
> - * activity on the node.
> + * The interval is reset to the lesser of 100ms and
> + * max_conflict_retention_duration once there is some activities on the node.
> AFAICS, this is not adhered in the code because you are using it when there is
> no activity aka when new_xid_found is false. IS the comment wrong or code
> needs some updation?

I think both needs to be updated. I have adjusted the code to consider
max_retention in both cases and update the comments.

All the comments have been addressed in V66 patch set.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

27 августа, 06:29:54

On Tuesday, August 26, 2025 6:45 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> Please find some more comments:

Thanks for the comments!

> 
> 
> 3)
> CREATE SUBSCRIPTION sub CONNECTION '...' PUBLICATION pub WITH
> (connect = false, retain_dead_tuples = true, max_conflict_retention_duration
> = 5000);
> WARNING:  deleted rows to detect conflicts would not be removed until the
> subscription is enabled
> HINT:  Consider setting retain_dead_tuples to false.
> WARNING:  subscription was created, but is not connected
> HINT:  To initiate replication, you must manually create the replication slot,
> enable the subscription, and refresh the subscription.
> CREATE SUBSCRIPTION
> 
> With connect=false, we get above messages. Reverse order of WARNINGs will
> make more sense as 'not connected' WARNING and HINT clarifies a few things
> including that the sub is disabled and needs to be enabled.
> Can we attempt doing it provided it does not over-complicate code?

I agree it makes sense for reversing the messages. But since this behavior isn't
caused by the remaining patches, we can consider it a separate improvement after
the main patch is pushed.

All other comments look good to me, so addressed in the latest patches.

Here is V66 patch set which includes the following changes:

[0001]
* Enhanced documentation according to Dilip's comments [1].
* Simplified logic by directly using the leader's oldest_nonremovable_xid to
  locate deleted tuples, according to Amit's comments [2].
* Merged Amit's diff [2].
* Integrated the new NOTICE into the existing CheckSubDeadTupleRetention
  function according to Amit's feedback [3].
* Some other adjustments according to feedback from [2] [3].

[0002]
* Refactored code according to Shveta's comments [4].
* Removed an unnecessary variable according to Shveta's suggestions [5].
* Some other adjustments according to feedback from [4][5][6].

[1] https://www.postgresql.org/message-id/CAFiTN-v2-Jv3UFYQ48pbX%2Bjb%2BMXWoxrfsRXQ_J1s1xqPq8P_zg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAA4eK1%2BHcrkKfXAwEsXK0waDK8VSx1qjBVj95SmZKPM0vMF%3DQg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAA4eK1JhYwJhU4vYPGeh8Y46S%2BFS5ciATw5beJKPrkF5ZAu2AQ%40mail.gmail.com
[4] https://www.postgresql.org/message-id/CAJpy0uCrHtwN3wgnC26G8M4jQfaMJG3EUU3OY%2BzpwQPeifjmTg%40mail.gmail.com
[5] https://www.postgresql.org/message-id/CAJpy0uDpX6jQWC3-cyA38ANT0L_L_qQeWPy2cATzSpLNDha1%3DA%40mail.gmail.com
[6] https://www.postgresql.org/message-id/CAJpy0uBQ_v0D3ceuZfJrx%3DzH6-59ORLqj%2BaqZJ7AQnw3vRRcSA%40mail.gmail.com

Best Regards,
Hou zj

Вложения

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

28 августа, 05:32:06

On Wednesday, August 27, 2025 11:30 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> 
> Here is V66 patch set which includes the following changes:
> 
> [0001]
> * Enhanced documentation according to Dilip's comments [1].
> * Simplified logic by directly using the leader's oldest_nonremovable_xid to
>   locate deleted tuples, according to Amit's comments [2].
> * Merged Amit's diff [2].
> * Integrated the new NOTICE into the existing CheckSubDeadTupleRetention
>   function according to Amit's feedback [3].
> * Some other adjustments according to feedback from [2] [3].
> 
> [0002]
> * Refactored code according to Shveta's comments [4].
> * Removed an unnecessary variable according to Shveta's suggestions [5].
> * Some other adjustments according to feedback from [4][5][6].

I noticed that Cfbot failed to compile the document due to a typo after renaming
the subscription option. Here are the updated V67 patches to fix that, only the doc
in 0001 is modified.

Best Regards,
Hou zj

Вложения

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

28 августа, 13:07:00

On Thu, Aug 28, 2025 at 8:02 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> I noticed that Cfbot failed to compile the document due to a typo after renaming
> the subscription option. Here are the updated V67 patches to fix that, only the doc
> in 0001 is modified.
>

Please find a  few comments:

patch001 :
1)
Assert is hit while testing patch001 alone:
TRAP: failed Assert("TransactionIdIsValid(nonremovable_xid)"), File:
"launcher.c", Line: 1394, PID: 55350.

Scenario:
I have 2 subs, both have stopped retention. Now on one of the sub, if I do this:

--switch off rdt:
alter subscription sub1 disable;
alter subscription sub1 set (retain_dead_tuples=off);
alter subscription sub1 enable;

--switch back rdt, launcher asserts after this
alter subscription sub1 disable;
alter subscription sub1 set (retain_dead_tuples=on);
alter subscription sub1 enable;

2)
+          Maximum duration for which this subscription's apply worker
is allowed
+          to retain the information useful for conflict detection when
+          <literal>retain_dead_tuples</literal> is enabled for the associated
+          subscriptions.

associated subscriptions --> associated subscription (since we are
talking about 'this subscription's apply worker')

3)
+          The information useful for conflict detection is no longer
retained if
+          all apply workers associated with the subscriptions, where
+          <literal>retain_dead_tuples</literal> is enabled, confirm that the
+          retention duration exceeded the

A trivial thing: 'retention duration has exceeded' sounds better to me.

~~

patch002:
(feel free to defer these comments if we are focusing on patch001 right now):

4)
stop_conflict_info_retention() has the correct message and detail in
patch01 while in patch02, it is switched back to the older one (wrong
one). Perhaps some merge mistake

5)
resume_conflict_info_retention() still refers to the old GUC name:
max_conflict_retention_duration.

6)
In compute_min_nonremovable_xid(), shall we have
Assert(TransactionIdIsValid(nonremovable_xid)) before assigning it to
worker->oldest_nonremovable_xid? Here:

+ nonremovable_xid = MyReplicationSlot->data.xmin;
+
+ SpinLockAcquire(&worker->relmutex);
+ worker->oldest_nonremovable_xid = nonremovable_xid;
+ SpinLockRelease(&worker->relmutex);

7)
Now since compute_min_nonremovable_xid() is also taking care of
assigning valid xid to the worker, shall we update that in comment as
well? We can have one more line:
'Additionally if any of the apply workers has invalid xid, assign
slot's xmin to it.'

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

28 августа, 14:09:04

On Thu, Aug 28, 2025 at 8:02 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> I noticed that Cfbot failed to compile the document due to a typo after renaming
> the subscription option. Here are the updated V67 patches to fix that, only the doc
> in 0001 is modified.
>

I have made a number of cosmetic and comment changes in the attached
atop 0001 patch. Kindly include in next version, if you think they are
good to include.

--
With Regards,
Amit Kapila.

Вложения

v65_amit_1.patch.txt

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

28 августа, 14:09:58

On Thu, Aug 28, 2025 at 4:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 28, 2025 at 8:02 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > I noticed that Cfbot failed to compile the document due to a typo after renaming
> > the subscription option. Here are the updated V67 patches to fix that, only the doc
> > in 0001 is modified.
> >
>
> I have made a number of cosmetic and comment changes in the attached
> atop 0001 patch. Kindly include in next version, if you think they are
> good to include.
>

Sorry, by mistake, I attached the wrong file. Please find the correct one now.

--
With Regards,
Amit Kapila.

Вложения

v67_amit_1.diff.txt

Re: Conflict detection for update_deleted in logical replication

От

Nisha Moond

Дата:

28 августа, 15:30:08

Hi,

As per the test results in [1], the TPS drop observed on the
subscriber with update_deleted enabled was mainly because only a
single apply worker was handling the replication workload from
multiple concurrent publisher clients.
The following performance benchmarks were conducted to evaluate the
improvements using parallel apply when update_deleted
(retain_dead_tuples) is enabled, under heavy workloads, without
leveraging row filters or multiple subscriptions to distribute the
load.

Note: The earlier tests from[1] are repeated with few workload
modifications to see the improvements using parallel-apply.

Highlights
===============
 - No regression was observed when running pgbench individually on
either Pub or Sub nodes.
 - When pgbench was run on both Pub and Sub, performance improved
significantly with the parallel apply patch. With just 4 workers, Sub
was able to catch up with Pub without regression.
 - With max_conflict_retention_duration=60s, retention on Sub was not
stopped when using 4 or more parallel workers.

Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM

Source code:
===============
 - pgHead(e9a31c0cc60) and parallel-apply v1 patch[2]
 - additionally used v64-0001 of update_deleted for
max_conflict_retention_duration related tests.

Test-01: pgbench on publisher
============================
Setup:
---------
Pub --> Sub
 - Two nodes created in pub-sub logical replication setup.
 - Both nodes have the same set of pgbench tables created with scale=60.
 - The Sub node is subscribed to all the changes from the Pub's
pgbench tables and the subscription has retain_dead_tuples = on

Workload:
--------------
 - Run default pgbench(read-write) only on Pub with #clients=40 and
run duration=10 minutes

Results:
------------
#run    pgHead_TPS    pgHead+v1_patch_TPS
   1    41135.71241    39922.7163
   2    40466.23865    39980.29034
   3    40578.16374    39867.44929
median   40578.16374    39922.7163
 - No regression.
~~~

Test-02: pgbench on subscriber
========================
Setup: same as Test-01

Workload:
--------------
 - Run default pgbench(read-write) only on Sub node with #clients=40
and run duration=10 minutes

Results:
-----------
#run    pgHead_TPS    pgHead+v1_patch_TPS
    1    42173.90504    42071.18058
    2    41836.10027    41839.84837
    3    41921.81233    41494.9918
median    41921.81233    41839.84837
 - No regression.
~~~

Test-03: pgbench on both sides
========================
Setup:
------
Pub --> Sub
 - Two nodes created in a pub-sub logical replication setup.
 - Both nodes have different sets of pgbench tables created with scale=60.
 - The sub node also has Pub's pgbench tables and is subscribed to all
the changes.

Workload:
--------------
 - Run default pgbench(read-write) on both Pub and Sub nodes for their
respective pgbench tables
 - Both pgbench runs are with #clients=15 and duration=10 minutes

Observations:
--------------
 - On pgHead when retain_dead_tuples=ON, the Sub's TPS reduced by ~76%
 - With the parallel apply patch, performance improves significantly
as parallel workers increase, since conflict_slot.xmin advances more
quickly.
 - With just 4 parallel workers, subscription TPS matches the baseline
(no regression).
 - Performance remains consistent at 8 and 16 workers.

Detailed results:
------------------
case-1:
 - The base case pgHead(e9a31c0cc60) and retain_dead_tuples=OFF

#run    Pub_tps    Sub_tps
     1    17140.08647    16994.63269
     2    17421.28513    17445.62828
     3    17167.57801    17070.86979
median    17167.57801    17070.86979

case-2:
 - pgHead(e9a31c0cc60) and retain_dead_tuples=ON

#run    Pub_tps    Sub_tps
    1    18667.29343    4135.884924
    2    18200.90297    4178.713784
    3    18309.87093    4227.330234
median    18309.87093    4178.713784
 - The Sub sees a ~76% of TPS reduction by default on head.

case-3:
 - pgHead(e9a31c0cc60)+ v1_parallel_apply_patch and retain_dead_tuples=ON
 - number of parallel apply workers varied as 2,4,8,16

3a) #workers=2
#run    Pub_tps    Sub_tps
    1   18336.98565    4244.072357
    2    18629.96658    4231.887288
    3    18152.92036    4253.293648
median    18336.98565    4244.072357
 - There is no significant TPS improvement with 2 parallel workers,
~76% TPS reduction

3b) #workers=4
#run    Pub_tps    Sub_tps
    1    16796.49468    16850.05362
    2    16834.06057    16757.73115
    3    16647.78486    16762.9107
median 16796.49468    16762.9107
 - No regression

3c) #workers=8
#run    Pub_tps    Sub_tps
    1    17105.38794    16778.38209
    2    16783.5085    16780.20492
    3   16806.97569    16642.87521
median    16806.97569    16778.38209
 - No regression

3d) #workers=16
#run    Pub_tps    Sub_tps
    1   16827.20615    16770.92293
    2    16860.10188    16745.2968
    3    16808.2148    16668.47974
median    16827.20615   16745.2968
 - No regression.
~~~

Test-04. pgbench on both side, and max_conflict_retention_duration was tuned
 ========================================================================
Setup:
-------
Pub --> Sub
 - setup is same as Test-03(above)
 - Additionally, subscription option max_conflict_retention_duration=60s

Workload:
-------------
 - Run default pgbench(read-write) on both Pub and Sub nodes for their
respective pgbench tables
 - Started with 15 clients on both sides.
 - When conflict_slot.xmin becomes NULL on Sub, pgbench was paused to
let the subscription catch up. Then reduced publisher clients by half
and resumed pgbench. Here, slot.xmin becomes NULL to indicate conflict
retention is stopped under high publisher load but stays non-NULL when
Sub is able to catchup with Pub's load.
- Total duration of pgbench run is 10 minutes (600s).

Observations:
------------------
 - Without the parallel apply patch, publisher clients reduced from
15->7->3,and finally the retention was not stopped at 3 clients and
slot.xmin remained non-NULL.
 - With the parallel apply patch, using 2 workers the subscription
handled up to 7 publisher clients without stopping the conflict
retention.
 - With 4+ workers, retention continued for the full 10 minutes and
Sub TPS showed no regression.

Detailed results:
-----------------
case-1:
 - pgHead(e9a31c0cc60) + v64-001 and retain_dead_tuples=ON

On publisher:
    #cleints    durations[s]    TPS
    15     73     17953.52
    7     100     9141.9
    3     426     4286.381132
On Subscriber:
    #cleints     durations[s]     TPS
    15     73     10626.67
    15     99     10271.35
    15     431   19467.07612
~~~

case-2:
 - pgHead(e9a31c0cc60) + v64-001 + v1_parallel-apply patch[2] and
retain_dead_tuples=ON
 - number of parallel apply workers varied as 2,4,8

2a) #workers=2
On publisher:
    #cleints     durations[s]     TPS
    15     87     17318.3
    7     512     9063.506025

On Subscriber:
    #cleints     durations[s]     TPS
    15     87     10299.66
    15     512     18426.44818

2b) #workers=4
On publisher:
    #cleints     durations[s]     TPS
    15     600     16953.40302

On Subscriber:
    #cleints     durations[s]     TPS
    15     600     16812.15289

2c) #workers=8
On publisher:
    #cleints     durations[s]     TPS
    15     600     16946.91636

On Subscriber:
    #cleints     durations[s]     TPS
    15     600     16708.12774

~~~~
The scripts used for all the tests are attached.

[1]
https://www.postgresql.org/message-id/OSCPR01MB1496663AED8EEC566074DFBC9F54CA%40OSCPR01MB14966.jpnprd01.prod.outlook.com
[2]
https://www.postgresql.org/message-id/OS0PR01MB5716D43CB68DB8FFE73BF65D942AA%40OS0PR01MB5716.jpnprd01.prod.outlook.com

--
Thanks,
Nisha

Вложения

update_deleted_tests_using parallel_apply.zip

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

29 августа, 09:19:44

On Thursday, August 28, 2025 7:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Aug 28, 2025 at 8:02 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > I noticed that Cfbot failed to compile the document due to a typo
> > after renaming the subscription option. Here are the updated V67
> > patches to fix that, only the doc in 0001 is modified.
> >
> 
> I have made a number of cosmetic and comment changes in the attached atop
> 0001 patch. Kindly include in next version, if you think they are good to include.

Thanks ! The changes look good to me, I have merged them into V68 patch set.

Here is the new version patch set which also addressed Shveta's comments[1].

[1] https://www.postgresql.org/message-id/CAJpy0uD%3DMLy77JZ78_J4H3XCV1mCA%3DiUPHuFC5Vt4EKyj6zfjg%40mail.gmail.com

Best Regards,
Hou zj

On Friday, August 29, 2025 6:28 PM shveta malik <shveta.malik@gmail.com>:
> 
> On Fri, Aug 29, 2025 at 11:49 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Here is the new version patch set which also addressed Shveta's
> comments[1].
> >
> 
> Thanks for the patch.
> 
> On 001 alone, I’m observing a behavior where, if sub1 has stopped retention,
> and I then create a new subscription sub2, the worker for
> sub2 fails to start successfully. It repeatedly starts and exits, logging the
> following message:
> 
> LOG:  logical replication worker for subscription "sub2" will restart because
> the option retain_dead_tuples was enabled during startup
> 
> Same things happen when I disable and re-enable 'retain_dead_tuple' of any
> sub once the slot has invalid xmin.

I think this behavior is because slot.xmin is set to an invalid number, and 0001
patch has no slot recovery logic, so even if retentionactive is true, newly
created subscriptions cannot have a valid oldest_nonremovable_xid.

After thinking more, I decided to add slot recovery functionality to 0001 as
well, thus avoiding the need for additional checks here. I also adjusted
the documents accordingly.

Here is the V69 patch set which addressed above comments and the
latest comment from Nisha[1].

[1] https://www.postgresql.org/message-id/CABdArM7GBa8kXCdOQw4U--tKgapj5j0hAVzL%3D%3DB3-fkg8Gzmdg%40mail.gmail.com

Best Regards,
Hou zj

Вложения

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

01 сентября, 07:45:42

On Saturday, August 30, 2025 12:48 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> 
> On Fri, Aug 29, 2025 at 11:49 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Here is the new version patch set which also addressed Shveta's
> comments[1].
> >
> 
> Thanks for the patches here, I tested the v68-001 patch alone, please find
> review comments -

Thanks for the comments!

> 
> 1) If a sub is created with retain_dead_tuples=on but disabled, e.g.
> postgres=# create subscription sub3 CONNECTION 'dbname=postgres
> host=localhost port=8841' PUBLICATION pub3 WITH
> (enabled=false,retain_dead_tuples=on);
> WARNING:  deleted rows to detect conflicts would not be removed until the
> subscription is enabled
> HINT:  Consider setting retain_dead_tuples to false.
> NOTICE:  created replication slot "sub3" on publisher CREATE
> SUBSCRIPTION
> 
> In this case, the conflict slot is not created until the sub is enabled.

I think this is a separate issue that the sub creation command does
not wakeup the launcher to create the slot in time. So, I will prepare a
fix in another thread.

> Also, if the
> slot already exists but all other subscriptions have stopped retaining
> (slot.xmin=NULL), the dead tuple retention will not start until the slot is
> recreated.
> To me, the above warning seems misleading in this case.
> 
> 2) A similar situation can happen with ALTER SUBSCRIPTION. For example,
> consider two subscriptions where retention has stopped for both and slot.xmin
> is NULL:
> 
>  subname | subenabled | subretaindeadtuples | submaxretention |
> subretentionactive
> ---------+------------+---------------------+-----------------+---------
> ---------+------------+---------------------+-----------------+---------
> ---------+------------+---------------------+-----------------+--
>  sub2    | t          | t                   |             100 | f
>  sub1    | t          | t                   |             100 | f
> 
> postgres=# select slot_name,active,xmin from pg_replication_slots ;
>        slot_name       | active | xmin
> -----------------------+--------+------
>  pg_conflict_detection | t      |
> 
> If we try to resume retention only for sub1 by toggling retain_dead_tuples:
> 
> postgres=# alter subscription sub1 set (retain_dead_tuples =off);
> NOTICE:  max_retention_duration is ineffective when retain_dead_tuples is
> disabled ALTER SUBSCRIPTION postgres=# alter subscription sub1 set
> (retain_dead_tuples =on);
> NOTICE:  deleted rows to detect conflicts would not be removed until the
> subscription is enabled ALTER SUBSCRIPTION
> 
> 2a) Here also the retention NOTICE is ambiguous as slot.xmin remains NULL.
> Though, the above steps don't strictly follow the docs (i.e.
> slot should be recreated to resume the retention), still the notice can be
> confusing for users.
> 
> 2b) Also, the retention is not resumed for sub1(expected), but still the
> subretentionactive is changed to true.
> 
>  subname | subenabled | subretaindeadtuples | submaxretention |
> subretentionactive
> ---------+------------+---------------------+-----------------+---------
> ---------+------------+---------------------+-----------------+---------
> ---------+------------+---------------------+-----------------+--
>  sub1    | f          | t                   |             100 | t
>  sub2    | t          | t                   |             100 | f
> 
> I think we should avoid changing subretentionactive to true in such cases until
> the slot is recreated and retention is actually resumed.
> Thoughts?

Since I have added slot recovery functionality to 0001, so I think these comments
should also be addressed.

Best Regards,
Hou zj

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

01 сентября, 14:37:30

On Monday, September 1, 2025 12:45 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> 
> On Friday, August 29, 2025 6:28 PM shveta malik <shveta.malik@gmail.com>:
> >
> > On Fri, Aug 29, 2025 at 11:49 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > Here is the new version patch set which also addressed Shveta's
> > comments[1].
> > >
> >
> > Thanks for the patch.
> >
> > On 001 alone, I’m observing a behavior where, if sub1 has stopped
> > retention, and I then create a new subscription sub2, the worker for
> > sub2 fails to start successfully. It repeatedly starts and exits,
> > logging the following message:
> >
> > LOG:  logical replication worker for subscription "sub2" will restart
> > because the option retain_dead_tuples was enabled during startup
> >
> > Same things happen when I disable and re-enable 'retain_dead_tuple' of
> > any sub once the slot has invalid xmin.
> 
> I think this behavior is because slot.xmin is set to an invalid number, and 0001
> patch has no slot recovery logic, so even if retentionactive is true, newly created
> subscriptions cannot have a valid oldest_nonremovable_xid.
> 
> After thinking more, I decided to add slot recovery functionality to 0001 as well,
> thus avoiding the need for additional checks here. I also adjusted the
> documents accordingly.
> 
> Here is the V69 patch set which addressed above comments and the latest
> comment from Nisha[1].

I reviewed the patch internally and tweaked a small detail of the apply worker
to reduce the waiting time in the main loop when max_retention_duration is
defined (set wait_time = min(wait_time, max_retention_duration)). Also, I added
a simple test in 035_conflicts.pl of 0001 to verify the new sub option.

Here is V70 patch set.

Best Regards,
Hou zj

Hi,

As reported by Robert[1], it is worth adding a test for the race condition in
the RecordTransactionCommitPrepared() function to reduce the risk of future code
changes:

    /*
     * Note it is important to set committs value after marking ourselves as
     * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This is because
     * we want to ensure all transactions that have acquired commit timestamp
     * are finished before we allow the logical replication client to advance
     * its xid which is used to hold back dead rows for conflict detection.
     * See comments atop worker.c.
     */
     committs = GetCurrentTimestamp();

While writing the test, I noticed a bug that conflict-relevant data could be
prematurely removed before applying prepared transactions on the publisher that
are in the commit phase. This occurred because GetOldestActiveTransactionId()
was called on the publisher, which failed to account for the backend executing
COMMIT PREPARED, as this backend does not have an xid stored in PGPROC.

Since this issue overlaps with the race condition related to timestamp
acquisition, I've prepared a bug fix along with a test for the race condition.
The 0001 patch fixes this issue by introducing a new function that iterates over
global transactions and identifies prepared transactions during the commit
phase. 0002 added injection points and tap-test to test the bug and timestamp
acquisition.

[1] https://www.postgresql.org/message-id/CA%2BTgmoaQtB%3DcnMJwCA33bDrGt7x5ysoW770uJ2Z56AU_NVNdbw%40mail.gmail.com

Best Regards,
Hou zj

Вложения

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

05 сентября, 09:01:27

On Thu, Sep 4, 2025 at 3:30 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Hi,
>
> As reported by Robert[1], it is worth adding a test for the race condition in
> the RecordTransactionCommitPrepared() function to reduce the risk of future code
> changes:
>
>         /*
>          * Note it is important to set committs value after marking ourselves as
>          * in the commit critical section (DELAY_CHKPT_IN_COMMIT). This is because
>          * we want to ensure all transactions that have acquired commit timestamp
>          * are finished before we allow the logical replication client to advance
>          * its xid which is used to hold back dead rows for conflict detection.
>          * See comments atop worker.c.
>          */
>          committs = GetCurrentTimestamp();
>
> While writing the test, I noticed a bug that conflict-relevant data could be
> prematurely removed before applying prepared transactions on the publisher that
> are in the commit phase. This occurred because GetOldestActiveTransactionId()
> was called on the publisher, which failed to account for the backend executing
> COMMIT PREPARED, as this backend does not have an xid stored in PGPROC.
>
> Since this issue overlaps with the race condition related to timestamp
> acquisition, I've prepared a bug fix along with a test for the race condition.
> The 0001 patch fixes this issue by introducing a new function that iterates over
> global transactions and identifies prepared transactions during the commit
> phase. 0002 added injection points and tap-test to test the bug and timestamp
> acquisition.
>

Thank You for the patches.Verified 001, works well. Just one minor comment:

1)
+ PGPROC    *proc = GetPGProcByNumber(gxact->pgprocno);
+ PGPROC    *commitproc;
We get first proc and then commitproc. proc is used only to get
databaseId, why don't we get databaseId from commitproc itself?

~~

Few trivial comments for 002:
1)
+# Test that publisher's transactions marked with DELAY_CHKPT_IN_COMMIT prevent
+# concurrently deleted tuples on the subscriber from being removed.

Here shall we also mention something like:
This test also acts as a safeguard to prevent developers from moving
the timestamp acquisition before marking DELAY_CHKPT_IN_COMMIT in
RecordTransactionCommitPrepared.

2)
# This test depends on an injection point to block the transaction commit after
# marking DELAY_CHKPT_IN_COMMIT flag.

Shall we say:
..to block the prepared transaction commit..

3)
+# Create a publisher node. Disable autovacuum to stablized the tests related to
+# manual VACUUM and transaction ID.

to stablized --> to stabilize

4)
+# Confirm that the dead tuple can be removed now
+my ($cmdret, $stdout, $stderr) =
+  $node_subscriber->psql('postgres', qq(VACUUM (verbose) public.tab;));
+
+ok($stderr =~ qr/1 are dead but not yet removable/,
+ 'the deleted column is non-removable');

can be removed now --> cannot be removed now

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

05 сентября, 14:33:18

On Friday, September 5, 2025 2:01 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Thu, Sep 4, 2025 at 3:30 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Hi,
> >
> > As reported by Robert[1], it is worth adding a test for the race condition in
> > the RecordTransactionCommitPrepared() function to reduce the risk of
> future code
> > changes:
> >
> >         /*
> >          * Note it is important to set committs value after marking ourselves
> as
> >          * in the commit critical section (DELAY_CHKPT_IN_COMMIT).
> This is because
> >          * we want to ensure all transactions that have acquired commit
> timestamp
> >          * are finished before we allow the logical replication client to
> advance
> >          * its xid which is used to hold back dead rows for conflict
> detection.
> >          * See comments atop worker.c.
> >          */
> >          committs = GetCurrentTimestamp();
> >
> > While writing the test, I noticed a bug that conflict-relevant data could be
> > prematurely removed before applying prepared transactions on the publisher
> that
> > are in the commit phase. This occurred because
> GetOldestActiveTransactionId()
> > was called on the publisher, which failed to account for the backend
> executing
> > COMMIT PREPARED, as this backend does not have an xid stored in
> PGPROC.
> >
> > Since this issue overlaps with the race condition related to timestamp
> > acquisition, I've prepared a bug fix along with a test for the race condition.
> > The 0001 patch fixes this issue by introducing a new function that iterates
> over
> > global transactions and identifies prepared transactions during the commit
> > phase. 0002 added injection points and tap-test to test the bug and
> timestamp
> > acquisition.
> >
> 
> Thank You for the patches.Verified 001, works well. Just one minor comment:
> 
> 1)
> + PGPROC    *proc = GetPGProcByNumber(gxact->pgprocno);
> + PGPROC    *commitproc;
> We get first proc and then commitproc. proc is used only to get
> databaseId, why don't we get databaseId from commitproc itself?

I think it's OK to directly refer to commitproc, so changed.

> 
> ~~
> 
> Few trivial comments for 002:

Thanks for the comments.

> 1)
> +# Test that publisher's transactions marked with
> DELAY_CHKPT_IN_COMMIT prevent
> +# concurrently deleted tuples on the subscriber from being removed.
> 
> Here shall we also mention something like:
> This test also acts as a safeguard to prevent developers from moving
> the timestamp acquisition before marking DELAY_CHKPT_IN_COMMIT in
> RecordTransactionCommitPrepared.

Added.

> 
> 2)
> # This test depends on an injection point to block the transaction commit after
> # marking DELAY_CHKPT_IN_COMMIT flag.
> 
> Shall we say:
> ..to block the prepared transaction commit..

Changed.

> 
> 3)
> +# Create a publisher node. Disable autovacuum to stablized the tests related
> to
> +# manual VACUUM and transaction ID.
> 
> to stablized --> to stabilize

Fixed.

> 
> 4)
> +# Confirm that the dead tuple can be removed now
> +my ($cmdret, $stdout, $stderr) =
> +  $node_subscriber->psql('postgres', qq(VACUUM (verbose) public.tab;));
> +
> +ok($stderr =~ qr/1 are dead but not yet removable/,
> + 'the deleted column is non-removable');
> 
> can be removed now --> cannot be removed now

Fixed.

Here are v2 patches which addressed above comments.

Best Regards,
Hou zj

On Mon, Sep 8, 2025 at 3:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Coverity is not happy with commit a850be2fe:
>
> /srv/coverity/git/pgsql-git/postgresql/src/backend/replication/logical/worker.c: 3276             in
FindDeletedTupleInLocalRel()
> 3270                     * maybe_advance_nonremovable_xid() for details).
> 3271                     */
> 3272                    LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> 3273                    leader = logicalrep_worker_find(MyLogicalRepWorker->subid,
> 3274                                                                                    InvalidOid, false);
> 3275
> >>>     CID 1665367:         Null pointer dereferences  (NULL_RETURNS)
> >>>     Dereferencing a pointer that might be "NULL" "&leader->relmutex" when calling "tas".
> 3276                    SpinLockAcquire(&leader->relmutex);
> 3277                    oldestxmin = leader->oldest_nonremovable_xid;
> 3278                    SpinLockRelease(&leader->relmutex);
> 3279                    LWLockRelease(LogicalRepWorkerLock);
> 3280            }
>
> I think Coverity has a point.  AFAICS every other call of
> logicalrep_worker_find() guards against a NULL result,
> so why is it okay for this one to dump core on NULL?
>

Thanks for pointing it out. It was a miss.
I attempted to reproduce a SIGSEGV in this flow. It appears that a
SIGSEGV can occur when the tablesync worker is catching up and is in
FindDeletedTupleInLocalRel() and meanwhile drop-subscription is done
in another session. Here’s the sequence that triggers the issue:

1) Pause the tablesync worker while it's in FindDeletedTupleInLocalRel().
2) Issue a 'DROP SUBSCRIPTION sub'.
3) Allow DropSubscription to proceed to logicalrep_worker_stop() for
the apply worker, but block it using the debugger before it attempts
to stop the tablesync worker.
4) Simultaneously, hold the launcher process using the debugger before
it can restart the apply worker.
5) Now, resume the tablesync worker. It ends up with a NULL leader
worker and hits a SIGSEGV.

Since this issue can be reliably reproduced with a simple DROP
SUBSCRIPTION, I thought it would be appropriate to add the new error
as a user-facing error.

Additionally, the issue can also be reproduced  if the apply worker is
forcefully made to error out in wait_for_relation_state_change() while
the tablesync worker is applying changes and is in
FindDeletedTupleInLocalRel().

Attached a patch to address the issue.

thanks
Shveta

Вложения

v1-0001-Add-NULL-check-for-apply-worker-in-update-deleted.patch

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

08 сентября, 14:20:32

On Monday, September 8, 2025 3:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Sep 5, 2025 at 5:03 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Here are v2 patches which addressed above comments.
> >
> 
> I have pushed the first patch. I find that the test can't reliably fail without a fix.
> Can you please investigate it?

Thank you for catching this issue. I confirmed that the test may have tested
VACCUM before slot.xmin was advanced. Therefore, to improve the test, I modified
test to wait for the publisher's request message appearing twice, as after the
fix, the apply worker should keep waiting for publisher status until the
prepared txn is committed.

Also, to reduce test time, I moved the test into the existing 035 test.

Here is the updated test.

Best Regards,
Hou zj

Вложения

v3-0001-Add-a-race-condition-test.patch

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

09 сентября, 09:17:28

On Tuesday, September 2, 2025 6:00 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Mon, Sep 1, 2025 at 5:45 PM shveta malik <shveta.malik@gmail.com>
> wrote:
> >
> > >
> > > Here is V70 patch set.
> > >
> >
> > The patch v70-0001 looks good to me. Verified, all the old issues are resolved.
> >
> > Will resume review of v70-0002 now.
> >
> 
> Please find a few comments on v70-0002:
> 
> 1)
> - * Note: Retention won't be resumed automatically. The user must manually
> - * disable retain_dead_tuples and re-enable it after confirming that the
> - * replication slot maintained by the launcher has been dropped.
> + * The retention will resume automatically if the worker has confirmed
> + that the
> + * retention duration is now within the max_retention_duration.
> 
> Do we need this comment atop stop as it does not directly concern stop? Isn't
> the details regarding RDT_RESUME_CONFLICT_INFO_RETENTION
> in the file-header section sufficient?

Agreed. I removed this comment.

> 
> 2)
> + /* Stop retention if not yet */
> + if (MySubscription->retentionactive)
> + {
> + rdt_data->phase = RDT_STOP_CONFLICT_INFO_RETENTION;
> 
> - /* process the next phase */
> - process_rdt_phase_transition(rdt_data, false);
> + /* process the next phase */
> + process_rdt_phase_transition(rdt_data, false); }
> +
> + reset_retention_data_fields(rdt_data);
> 
> should_stop_conflict_info_retention() does reset_retention_data_fields
> everytime when wait-time exceeds the limit, and when it actually stops i.e.
> calls stop_conflict_info_retention through phase change; the stop function
> also does reset_retention_data_fields and calls process_rdt_phase_transition.
> Can we optimize this code part by consolidating the
> reset_retention_data_fields() and
> process_rdt_phase_transition() calls into
> should_stop_conflict_info_retention() itself, eliminating redundancy?

Agreed. I improved the code here.

> 
> 3)
> Shall we update 035_conflicts.pl to have a resume test by setting
> max_retention_duration to 0 after stop-retention test?

Added.

> 
> 4)
> +          subscription. The retention will be automatically resumed
> once at least
> +          one apply worker confirms that the retention duration is within the
> +          specified limit, or a new subscription is created with
> +          <literal>retain_dead_tuples = true</literal>, or the user manually
> +          re-enables <literal>retain_dead_tuples</literal>.
> 
> Shall we rephrase it slightly to:
> 
> Retention will automatically resume when at least one apply worker confirms
> that the retention duration is within the specified limit, or when a new
> subscription is created with <literal>retain_dead_tuples = true</literal>.
> Alternatively, retention can be manually resumed by re-enabling
> <literal>retain_dead_tuples</literal>.

Changed as suggested.

Here is V71 patch set which addressed above comments and [1].

[1] https://www.postgresql.org/message-id/CAJpy0uC8w442wGEJ0gyR23ojAyvd-s_g-m8fUbixy0V9yOmrcg%40mail.gmail.com

Best Regards,
Hou zj

On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Here is V71 patch set which addressed above comments and [1].
> >
> 
> IIUC, this patch after stopping the retention, it immediately starts retrying to
> resume by transitioning through various phases. This can consume CPU and
> network resources, if the apply_worker takes a long time to catch up.

I agree. I think one way is to increase the interval in each cycle when the retention has
been stopped and the worker is retrying to resume the retention. I have
updated the patch for the same.

> 
> Few other comments:
> 1.
> + /*
> + * Return if the launcher has not initialized oldest_nonremovable_xid.
> + *
> + * It might seem feasible to directly check the conflict detection
> + * slot.xmin instead of relying on the launcher to assign the worker's
> + * oldest_nonremovable_xid; however, that could lead to a race
> + condition
> + * where slot.xmin is set to InvalidTransactionId immediately after the
> + * check. In such cases, oldest_nonremovable_xid would no longer be
> + * protected by a replication slot and could become unreliable if a
> + * wraparound occurs.
> + */
> + if (!TransactionIdIsValid(nonremovable_xid))
> + return;
> 
> I understand the reason why we get assigned the worker's non_removable_xid
> from launcher but all this makes the code complex to understand. Isn't there
> any other way to achieve it? One naïve and inefficient way could be to just
> restart the worker probably after updating its retentionactive flag. I am not
> suggesting to make this change but just a brainstorming point.

I'll keep thinking about it, and for now, I've added a comment mentioning
that rebooting is a simpler solution.

> 
> 2. The function should_stop_conflict_info_retention() is invoked from
> wait_for_local_flush() and then it can lead further state transitioning which
> doesn't appear neat and makes code difficult to understand.

I changed the logic to avoid proceeding to next phase when
the retention is stopped.

> 
> 3.
> + /*
> + * If conflict info retention was previously stopped due to a timeout,
> + and
> + * the time required to advance the non-removable transaction ID has
> + now
> + * decreased to within acceptable limits, resume the rentention.
> + */
> + if (!MySubscription->retentionactive)
> + {
> + rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION;
> + process_rdt_phase_transition(rdt_data, false); return; }
> 
> In this check, where do we check the time has come in acceptable range? Can
> you update comments to make it clear?

I updated the comments to mention that the code can reach here only
when the time is within max_retention_duration.

Here is the V72 patch set which addressed above and Shveta's comments[1].

[1] https://www.postgresql.org/message-id/CAJpy0uDw7SmCN_jOvpNUzo_sE4ZsgpqQ5_vHLjm4aCm10eBApA%40mail.gmail.com

Best Regards,
Hou zj

Вложения

v72-0001-Allow-conflict-relevant-data-retention-to-resume.patch

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

10 сентября, 06:39:00

On Tuesday, September 9, 2025 5:17 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > Here is V71 patch set which addressed above comments and [1].
> >
> 
> Thank You for the patches. Please find a few comments on 001:
> 
> 1)
> In compute_min_nonremovable_xid, when 'wait_for_xid' is true, we should
> have Assert(!worker->oldest_nonremovable_xid) to ensure it is always Invalid
> if reached here.

Added.

> 
> Or we can rewrite the current if-else-if code logic based on worker's oldest-xid
> as main criteria as that will be NULL in both the blocks:
> 
> if (!TransactionIdIsValid(nonremovable_xid))
> {
>   /* resume case */
>   if(wait_for_xid)
>     set worker's oldest-xid using slot's xmin
>   else
>   /* stop case */
>     return;
> }

I am personally not in favor this style because it adds more conditions, so I
did not change in this version.

> 
> 2)
> In stop_conflict_info_retention(), there may be a case where due to an ongoing
> transaction, it could not update retentionactive to false. But even in such cases,
> the function still sets oldest_nonremovable_xid to Invalid, which seems wrong.
> 
> 3)
> Similar in resume flow, it still sets wait_for_initial_xid=true even when it could
> not update retentionactive=true.

Right, I missed to return when the update fails. Fixed in this version.

> 
> 4)
> resume_conflict_info_retention():
> + /*
> + * Return if the launcher has not initialized oldest_nonremovable_xid.
> + *
> + */
> + if (!TransactionIdIsValid(nonremovable_xid))
> + return;
> 
> I think we should have
> 'Assert(MyLogicalRepWorker->wait_for_initial_xid)' before 'return'
> here.

Added.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

10 сентября, 11:39:34

On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> >
> > Few other comments:
> > 1.
> > + /*
> > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > + *
> > + * It might seem feasible to directly check the conflict detection
> > + * slot.xmin instead of relying on the launcher to assign the worker's
> > + * oldest_nonremovable_xid; however, that could lead to a race
> > + condition
> > + * where slot.xmin is set to InvalidTransactionId immediately after the
> > + * check. In such cases, oldest_nonremovable_xid would no longer be
> > + * protected by a replication slot and could become unreliable if a
> > + * wraparound occurs.
> > + */
> > + if (!TransactionIdIsValid(nonremovable_xid))
> > + return;
> >
> > I understand the reason why we get assigned the worker's non_removable_xid
> > from launcher but all this makes the code complex to understand. Isn't there
> > any other way to achieve it? One naïve and inefficient way could be to just
> > restart the worker probably after updating its retentionactive flag. I am not
> > suggesting to make this change but just a brainstorming point.
>
> I'll keep thinking about it, and for now, I've added a comment mentioning
> that rebooting is a simpler solution.
>

Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED mode
as soon as we encounter the first subscription with retain_dead_tuples
set during the traversal of the sublist. We release it only after
updating xmin outside the sublist traversal. We then acquire it in
EXCLUSIVE mode to fetch the resume the retention in worker for the
period till we fetch slot's xmin.

This will close the above race condition but acquiring LWLock while
traversing subscription is not advisable as that will make it
uninterruptible. The other possibility is to use some heavy-weight
lock here, say a lock on pg_subscription catalog but that has a
drawback that it can conflict with DDL operations. Yet another way is
to invent a new lock-type for this.

OTOH, the simple strategy to let apply-worker restart to resume
retention will keep the handling simpler. We do something similar at
the start of apply-worker if we find that some subscription parameter
is changed. I think we need more opinions on this matter.

One other comment:
+ if (!MySubscription->retentionactive)
+ {
+ rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION;
+ process_rdt_phase_transition(rdt_data, false);
+ return;
+ }

It is better that the caller processes the next phase, otherwise, this
looks a bit ad hoc. Similarly, please check other places.

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

11 сентября, 11:59:05

On Monday, September 8, 2025 7:21 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> 
> On Monday, September 8, 2025 3:13 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Sep 5, 2025 at 5:03 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > Here are v2 patches which addressed above comments.
> > >
> >
> > I have pushed the first patch. I find that the test can't reliably fail without a fix.
> > Can you please investigate it?
> 
> Thank you for catching this issue. I confirmed that the test may have tested
> VACCUM before slot.xmin was advanced. Therefore, to improve the test, I
> modified test to wait for the publisher's request message appearing twice, as
> after the fix, the apply worker should keep waiting for publisher status until the
> prepared txn is committed.
> 
> Also, to reduce test time, I moved the test into the existing 035 test.
> 
> Here is the updated test.

I noticed a BF failure[1] on this test. The log shows that the apply worker
advances the non-removable xid to the latest state before waiting for the
prepared transaction to commit. Upon reviewing the log, I didn't find any clues
of a bug in the code. One potential explanation is that the prepared transaction
hasn't reached the injection point before the apply worker requests the
publisher status.

The log lacks the timing for when the injection point is triggered and only
includes:

pub: 2025-09-11 03:40:05.667 CEST [396867][client backend][8/3:0] LOG:  statement: COMMIT PREPARED
'txn_with_later_commit_ts';
..
sub: 2025-09-11 03:40:05.684 CEST [396798][logical replication apply worker][16/0:0] DEBUG:  sending publisher status
requestmessage

Although the statement on the publisher appears before the publisher request,
the statement log is generated prior to command execution. Thus, it's possible
the injection point is triggered after responding to the publisher status.

After checking some other tap tests using injection points, most of them ensure
the injection is triggered before proceeding with the test (by waiting for the
wait event of injection point). We could also add this in the test:

$node_B->wait_for_event('client backend', 'commit-after-delay-checkpoint');

Here is a small patch.

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=scorpion&dt=2025-09-11%2001%3A17%3A25&stg=subscription-check

Best Regards,
Hou zj

Вложения

v1-0001-Fix-unstable-test-in-6456c6e.patch

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

11 сентября, 12:23:26

On Thu, Sep 11, 2025 at 2:29 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, September 8, 2025 7:21 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> >
> > On Monday, September 8, 2025 3:13 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Sep 5, 2025 at 5:03 PM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com>
> > > wrote:
> > > >
> > > > Here are v2 patches which addressed above comments.
> > > >
> > >
> > > I have pushed the first patch. I find that the test can't reliably fail without a fix.
> > > Can you please investigate it?
> >
> > Thank you for catching this issue. I confirmed that the test may have tested
> > VACCUM before slot.xmin was advanced. Therefore, to improve the test, I
> > modified test to wait for the publisher's request message appearing twice, as
> > after the fix, the apply worker should keep waiting for publisher status until the
> > prepared txn is committed.
> >
> > Also, to reduce test time, I moved the test into the existing 035 test.
> >
> > Here is the updated test.
>
> I noticed a BF failure[1] on this test. The log shows that the apply worker
> advances the non-removable xid to the latest state before waiting for the
> prepared transaction to commit. Upon reviewing the log, I didn't find any clues
> of a bug in the code. One potential explanation is that the prepared transaction
> hasn't reached the injection point before the apply worker requests the
> publisher status.
>
> The log lacks the timing for when the injection point is triggered and only
> includes:
>
> pub: 2025-09-11 03:40:05.667 CEST [396867][client backend][8/3:0] LOG:  statement: COMMIT PREPARED
'txn_with_later_commit_ts';
> ..
> sub: 2025-09-11 03:40:05.684 CEST [396798][logical replication apply worker][16/0:0] DEBUG:  sending publisher status
requestmessage 
>
> Although the statement on the publisher appears before the publisher request,
> the statement log is generated prior to command execution. Thus, it's possible
> the injection point is triggered after responding to the publisher status.
>
> After checking some other tap tests using injection points, most of them ensure
> the injection is triggered before proceeding with the test (by waiting for the
> wait event of injection point). We could also add this in the test:
>
> $node_B->wait_for_event('client backend', 'commit-after-delay-checkpoint');
>
> Here is a small patch.
>

Agree with the analysis. The patch looks good.

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

Dilip Kumar

Дата:

11 сентября, 13:53:58

On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > > wrote:
> > >
> > > Few other comments:
> > > 1.
> > > + /*
> > > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > > + *
> > > + * It might seem feasible to directly check the conflict detection
> > > + * slot.xmin instead of relying on the launcher to assign the worker's
> > > + * oldest_nonremovable_xid; however, that could lead to a race
> > > + condition
> > > + * where slot.xmin is set to InvalidTransactionId immediately after the
> > > + * check. In such cases, oldest_nonremovable_xid would no longer be
> > > + * protected by a replication slot and could become unreliable if a
> > > + * wraparound occurs.
> > > + */
> > > + if (!TransactionIdIsValid(nonremovable_xid))
> > > + return;
> > >
> > > I understand the reason why we get assigned the worker's non_removable_xid
> > > from launcher but all this makes the code complex to understand. Isn't there
> > > any other way to achieve it? One naïve and inefficient way could be to just
> > > restart the worker probably after updating its retentionactive flag. I am not
> > > suggesting to make this change but just a brainstorming point.
> >
> > I'll keep thinking about it, and for now, I've added a comment mentioning
> > that rebooting is a simpler solution.
> >
>
> Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED mode
> as soon as we encounter the first subscription with retain_dead_tuples
> set during the traversal of the sublist. We release it only after
> updating xmin outside the sublist traversal. We then acquire it in
> EXCLUSIVE mode to fetch the resume the retention in worker for the
> period till we fetch slot's xmin.
>
> This will close the above race condition but acquiring LWLock while
> traversing subscription is not advisable as that will make it
> uninterruptible. The other possibility is to use some heavy-weight
> lock here, say a lock on pg_subscription catalog but that has a
> drawback that it can conflict with DDL operations. Yet another way is
> to invent a new lock-type for this.
>
> OTOH, the simple strategy to let apply-worker restart to resume
> retention will keep the handling simpler. We do something similar at
> the start of apply-worker if we find that some subscription parameter
> is changed. I think we need more opinions on this matter.

IMHO the situation of retention being disabled and re-enabled is not a
common occurrence. It typically happens in specific scenarios where
there's a significant replication lag or the user has not configured
the retention timeout correctly. Because these are corner cases, I
believe we should avoid over-engineering a solution and simply restart
the worker, as Amit suggested.

--
Regards,
Dilip Kumar
Google

Re: Conflict detection for update_deleted in logical replication

От

Masahiko Sawada

Дата:

11 сентября, 21:39:14

On Thu, Sep 11, 2025 at 3:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > > > wrote:
> > > >
> > > > Few other comments:
> > > > 1.
> > > > + /*
> > > > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > > > + *
> > > > + * It might seem feasible to directly check the conflict detection
> > > > + * slot.xmin instead of relying on the launcher to assign the worker's
> > > > + * oldest_nonremovable_xid; however, that could lead to a race
> > > > + condition
> > > > + * where slot.xmin is set to InvalidTransactionId immediately after the
> > > > + * check. In such cases, oldest_nonremovable_xid would no longer be
> > > > + * protected by a replication slot and could become unreliable if a
> > > > + * wraparound occurs.
> > > > + */
> > > > + if (!TransactionIdIsValid(nonremovable_xid))
> > > > + return;
> > > >
> > > > I understand the reason why we get assigned the worker's non_removable_xid
> > > > from launcher but all this makes the code complex to understand. Isn't there
> > > > any other way to achieve it? One naïve and inefficient way could be to just
> > > > restart the worker probably after updating its retentionactive flag. I am not
> > > > suggesting to make this change but just a brainstorming point.
> > >
> > > I'll keep thinking about it, and for now, I've added a comment mentioning
> > > that rebooting is a simpler solution.
> > >
> >
> > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED mode
> > as soon as we encounter the first subscription with retain_dead_tuples
> > set during the traversal of the sublist. We release it only after
> > updating xmin outside the sublist traversal. We then acquire it in
> > EXCLUSIVE mode to fetch the resume the retention in worker for the
> > period till we fetch slot's xmin.
> >
> > This will close the above race condition but acquiring LWLock while
> > traversing subscription is not advisable as that will make it
> > uninterruptible. The other possibility is to use some heavy-weight
> > lock here, say a lock on pg_subscription catalog but that has a
> > drawback that it can conflict with DDL operations. Yet another way is
> > to invent a new lock-type for this.
> >
> > OTOH, the simple strategy to let apply-worker restart to resume
> > retention will keep the handling simpler. We do something similar at
> > the start of apply-worker if we find that some subscription parameter
> > is changed. I think we need more opinions on this matter.
>
> IMHO the situation of retention being disabled and re-enabled is not a
> common occurrence. It typically happens in specific scenarios where
> there's a significant replication lag or the user has not configured
> the retention timeout correctly. Because these are corner cases, I
> believe we should avoid over-engineering a solution and simply restart
> the worker, as Amit suggested.

+1

While it's ideal if workers could initialize their
oldest_nonremovable_xid values on-the-fly, I believe we can begin with
the simple solution given that stopping and resuming retaining of
conflict info would not happen so often. In fact, frequent stops and
restarts would typically be a sign that users might be not configuring
the options properly for their systems. IIUC if the workers are able
to do that, we can support to activate retain_conflict_info even for
enabled subscriptions. I think we can leave it for future improvements
if necessary.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

12 сентября, 06:25:00

On Friday, September 12, 2025 2:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Thu, Sep 11, 2025 at 3:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 2:09 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > >
> > > On Wed, Sep 10, 2025 at 9:08 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Tuesday, September 9, 2025 7:01 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > > > > On Tue, Sep 9, 2025 at 11:47 AM Zhijie Hou (Fujitsu)
> > > > > <houzj.fnst@fujitsu.com>
> > > > > wrote:
> > > > >
> > > > > Few other comments:
> > > > > 1.
> > > > > + /*
> > > > > + * Return if the launcher has not initialized oldest_nonremovable_xid.
> > > > > + *
> > > > > + * It might seem feasible to directly check the conflict
> > > > > + detection
> > > > > + * slot.xmin instead of relying on the launcher to assign the
> > > > > + worker's
> > > > > + * oldest_nonremovable_xid; however, that could lead to a race
> > > > > + condition
> > > > > + * where slot.xmin is set to InvalidTransactionId immediately
> > > > > + after the
> > > > > + * check. In such cases, oldest_nonremovable_xid would no
> > > > > + longer be
> > > > > + * protected by a replication slot and could become unreliable
> > > > > + if a
> > > > > + * wraparound occurs.
> > > > > + */
> > > > > + if (!TransactionIdIsValid(nonremovable_xid))
> > > > > + return;
> > > > >
> > > > > I understand the reason why we get assigned the worker's
> > > > > non_removable_xid from launcher but all this makes the code
> > > > > complex to understand. Isn't there any other way to achieve it?
> > > > > One naïve and inefficient way could be to just restart the
> > > > > worker probably after updating its retentionactive flag. I am not
> suggesting to make this change but just a brainstorming point.
> > > >
> > > > I'll keep thinking about it, and for now, I've added a comment
> > > > mentioning that rebooting is a simpler solution.
> > > >
> > >
> > > Say we have a LW LogicalRepRetentionLock. We acquire it in SHARED
> > > mode as soon as we encounter the first subscription with
> > > retain_dead_tuples set during the traversal of the sublist. We
> > > release it only after updating xmin outside the sublist traversal.
> > > We then acquire it in EXCLUSIVE mode to fetch the resume the
> > > retention in worker for the period till we fetch slot's xmin.
> > >
> > > This will close the above race condition but acquiring LWLock while
> > > traversing subscription is not advisable as that will make it
> > > uninterruptible. The other possibility is to use some heavy-weight
> > > lock here, say a lock on pg_subscription catalog but that has a
> > > drawback that it can conflict with DDL operations. Yet another way
> > > is to invent a new lock-type for this.
> > >
> > > OTOH, the simple strategy to let apply-worker restart to resume
> > > retention will keep the handling simpler. We do something similar at
> > > the start of apply-worker if we find that some subscription
> > > parameter is changed. I think we need more opinions on this matter.
> >
> > IMHO the situation of retention being disabled and re-enabled is not a
> > common occurrence. It typically happens in specific scenarios where
> > there's a significant replication lag or the user has not configured
> > the retention timeout correctly. Because these are corner cases, I
> > believe we should avoid over-engineering a solution and simply restart
> > the worker, as Amit suggested.
> 
> +1
> 
> While it's ideal if workers could initialize their oldest_nonremovable_xid values
> on-the-fly, I believe we can begin with the simple solution given that stopping
> and resuming retaining of conflict info would not happen so often. In fact,
> frequent stops and restarts would typically be a sign that users might be not
> configuring the options properly for their systems. IIUC if the workers are able to
> do that, we can support to activate retain_conflict_info even for enabled
> subscriptions. I think we can leave it for future improvements if necessary.

I agree. Here is a V73 patch that will restart the worker if the retention
resumes. I also addressed other comments posted by Amit[1].

[1] https://www.postgresql.org/message-id/CAA4eK1LA7mnvKT9L8Nx_h%2B0TCvq-Ob2BGPO1bQKhkGHtoZKsow%40mail.gmail.com

Best Regards,
Hou zj

Вложения

v73-0001-Allow-conflict-relevant-data-retention-to-resume.patch

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

12 сентября, 11:42:31

On Fri, Sep 12, 2025 at 8:55 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> I agree. Here is a V73 patch that will restart the worker if the retention
> resumes. I also addressed other comments posted by Amit[1].
>

Few comments:
============
* In adjust_xid_advance_interval(), for the case when retention is not
active, we still cap the interval by wal_receiver_status_interval. Is
that required or do we let it go till 3 minutes? We can add a new else
if loop to keep the code clear, if you agree with this.

*
+ /*
+ * Resume retention immediately if required. (See
+ * should_resume_retention_immediately() for details).
+ */
+ if (should_resume_retention_immediately(rdt_data, status_received))
+ rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION;

Is this optimization for the case when we are in stop_phase or after
stop_phase and someone has changed maxretention to 0? If so, I don't
think it is worth optimizing for such a rare case at the cost of
making code difficult to understand.

Apart from this, I have changed a few comments in the attached.

--
With Regards,
Amit Kapila.

Вложения

v73-0001-amit.patch.txt

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

12 сентября, 11:47:31

On Fri, Sep 12, 2025 at 8:55 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> I agree. Here is a V73 patch that will restart the worker if the retention
> resumes. I also addressed other comments posted by Amit[1].
>

Thanks for the patch. Few comments:

1)
There is a small window where worker can exit  while resuming
retention and launcher can end up acessign stale worker info.

Lets say launcher is at a stage where it has fetched worker:
w = logicalrep_worker_find(sub->oid, InvalidOid, false);

And after this point, before the launcher could do
compute_min_nonremovable_xid(), the worker has stopped retention and
resumed as well. Now the worker has exited but in
compute_min_nonremovable_xid(), launcher will still use the
worker-info fetched previously.

2)

  if (should_stop_conflict_info_retention(rdt_data))
+  {
+    /*
+     * Stop retention if not yet. Otherwise, reset to the initial phase to
+     * retry all phases. This is required to recalculate the current wait
+     * time and resume retention if the time falls within
+     * max_retention_duration.
+     */
+    if (MySubscription->retentionactive)
+      rdt_data->phase = RDT_STOP_CONFLICT_INFO_RETENTION;
+    else
+      reset_retention_data_fields(rdt_data);
+
    return;
+  }

Shall we have an Assert( !MyLogicalRepWorker->oldest_nonremovable_xid)
in 'else' part above?

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

12 сентября, 13:09:06

On Friday, September 12, 2025 4:48 PM shveta malik <shveta.malik@gmail.com> wrote:
> On Fri, Sep 12, 2025 at 8:55 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> >
> > I agree. Here is a V73 patch that will restart the worker if the
> > retention resumes. I also addressed other comments posted by Amit[1].
> >
> 
> Thanks for the patch. Few comments:

Thanks for the comments!

> 
> 1)
> There is a small window where worker can exit  while resuming retention and
> launcher can end up acessign stale worker info.
> 
> Lets say launcher is at a stage where it has fetched worker:
> w = logicalrep_worker_find(sub->oid, InvalidOid, false);
> 
> And after this point, before the launcher could do
> compute_min_nonremovable_xid(), the worker has stopped retention and
> resumed as well. Now the worker has exited but in
> compute_min_nonremovable_xid(), launcher will still use the worker-info
> fetched previously.

Thanks for catching this, I have fixed by computing the xid under
LogicalRepWorkerLock.

> 
> 2)
> 
>   if (should_stop_conflict_info_retention(rdt_data))
> +  {
> +    /*
> +     * Stop retention if not yet. Otherwise, reset to the initial phase
> +to
> +     * retry all phases. This is required to recalculate the current
> +wait
> +     * time and resume retention if the time falls within
> +     * max_retention_duration.
> +     */
> +    if (MySubscription->retentionactive)
> +      rdt_data->phase = RDT_STOP_CONFLICT_INFO_RETENTION;
> +    else
> +      reset_retention_data_fields(rdt_data);
> +
>     return;
> +  }
> 
> 
> 
> Shall we have an Assert( !MyLogicalRepWorker->oldest_nonremovable_xid)
> in 'else' part above?

Added.

Here is the V74 patch which addressed all comments.

Best Regards,
Hou zj

Вложения

v74-0001-Allow-conflict-relevant-data-retention-to-resume.patch

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

12 сентября, 13:09:10

On Friday, September 12, 2025 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Sep 12, 2025 at 8:55 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > I agree. Here is a V73 patch that will restart the worker if the
> > retention resumes. I also addressed other comments posted by Amit[1].
> >
> 
> Few comments:
> ============
> * In adjust_xid_advance_interval(), for the case when retention is not active,
> we still cap the interval by wal_receiver_status_interval. Is that required or do
> we let it go till 3 minutes? We can add a new else if loop to keep the code clear,
> if you agree with this.

I agree we can let it go till 3 mins, and changed the patch as suggested.

> 
> *
> + /*
> + * Resume retention immediately if required. (See
> + * should_resume_retention_immediately() for details).
> + */
> + if (should_resume_retention_immediately(rdt_data, status_received))
> + rdt_data->phase = RDT_RESUME_CONFLICT_INFO_RETENTION;
> 
> Is this optimization for the case when we are in stop_phase or after stop_phase
> and someone has changed maxretention to 0? If so, I don't think it is worth
> optimizing for such a rare case at the cost of making code difficult to
> understand.

Agreed. I removed this in V74.

> 
> Apart from this, I have changed a few comments in the attached.

Thanks for the patch, it looks good to me. I have merged it in V74.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

15 сентября, 07:51:54

On Fri, Sep 12, 2025 at 3:39 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Here is the V74 patch which addressed all comments.
>

+ ereport(LOG,
+ errmsg("logical replication worker for subscription \"%s\" will
resume retaining the information for detecting conflicts",
+    MySubscription->name),
+ MySubscription->maxretention
+ ? errdetail("Retention of information used for conflict detection is
now within the max_retention_duration of %u ms.",
+ MySubscription->maxretention)
+ : errdetail("Retention of information used for conflict detection is
now indefinite."));

The detail message doesn't seems to convey the correct meaning as the
duration is compared with something vague. How about changing
errdetail messages as follows:
"Retention is re-enabled as the apply process is advancing its xmin
within the configured max_retention_duration of %u ms."
"Retention is re-enabled as max_retention_duration is set to unlimited."

If you agree with the above then we can consider changing the existing
errdetail related to stop_retention functionality as follows:
"Retention is stopped as the apply process is not advancing its xmin
within the configured max_retention_duration of %u ms."

Apart from these, I have made some cosmetic changes in the attached.

--
With Regards,
Amit Kapila.

Вложения

v74_amit_1.txt

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

15 сентября, 07:55:07

One concern:

  if (should_stop_conflict_info_retention(rdt_data))
+ {
+ /*
+ * Stop retention if not yet. Otherwise, reset to the initial phase to
+ * retry all phases. This is required to recalculate the current wait
+ * time and resume retention if the time falls within
+ * max_retention_duration.
+ */
+ if (MySubscription->retentionactive)
+ rdt_data->phase = RDT_STOP_CONFLICT_INFO_RETENTION;
+ else
+ reset_retention_data_fields(rdt_data);
+
  return;
+ }

 Instead of above code changes, shall we have:

 if (should_stop_conflict_info_retention(rdt_data))
     rdt_data->phase = RDT_STOP_CONFLICT_INFO_RETENTION;  (always)

And then stop_conflict_info_retention() should have these checks:

if (MySubscription->retentionactive)
{
    ...update flag and perform stop (current functionality)
}
else
{
        Assert(!TransactionIdIsValid(MyLogicalRepWorker->oldest_nonremovable_xid));
        reset_retention_data_fields(rdt_data);
}

thanks
Shveta

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

15 сентября, 10:36:56

On Monday, September 15, 2025 12:55 PM shveta malik <shveta.malik@gmail.com> wrote:
> 
> One concern:
> 
>   if (should_stop_conflict_info_retention(rdt_data))
> + {
> + /*
> + * Stop retention if not yet. Otherwise, reset to the initial phase to
> + * retry all phases. This is required to recalculate the current wait
> + * time and resume retention if the time falls within
> + * max_retention_duration.
> + */
> + if (MySubscription->retentionactive)
> + rdt_data->phase = RDT_STOP_CONFLICT_INFO_RETENTION;
> + else
> + reset_retention_data_fields(rdt_data);
> +
>   return;
> + }
> 
>  Instead of above code changes, shall we have:
> 
>  if (should_stop_conflict_info_retention(rdt_data))
>      rdt_data->phase = RDT_STOP_CONFLICT_INFO_RETENTION;
> (always)
> 
> And then stop_conflict_info_retention() should have these checks:
> 
> if (MySubscription->retentionactive)
> {
>     ...update flag and perform stop (current functionality)
> }
> else
> {
> 
> Assert(!TransactionIdIsValid(MyLogicalRepWorker->oldest_nonremovable_xi
> d));
>         reset_retention_data_fields(rdt_data);
> }

Thanks for the suggestion. I refactored the codes according to this.

Here is V65 patch which also addressed Amit's comments[1].

Best Regards,
Hou zj

Вложения

v75-0001-Allow-conflict-relevant-data-retention-to-resume.patch

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

15 сентября, 10:37:06

On Monday, September 15, 2025 12:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Sep 12, 2025 at 3:39 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Here is the V74 patch which addressed all comments.
> >
> 
> + ereport(LOG,
> + errmsg("logical replication worker for subscription \"%s\" will
> resume retaining the information for detecting conflicts",
> +    MySubscription->name),
> + MySubscription->maxretention
> + ? errdetail("Retention of information used for conflict detection is
> now within the max_retention_duration of %u ms.",
> + MySubscription->maxretention)
> + : errdetail("Retention of information used for conflict detection is
> now indefinite."));
> 
> The detail message doesn't seems to convey the correct meaning as the
> duration is compared with something vague. How about changing errdetail
> messages as follows:
> "Retention is re-enabled as the apply process is advancing its xmin within the
> configured max_retention_duration of %u ms."
> "Retention is re-enabled as max_retention_duration is set to unlimited."
> 
> If you agree with the above then we can consider changing the existing errdetail
> related to stop_retention functionality as follows:
> "Retention is stopped as the apply process is not advancing its xmin within the
> configured max_retention_duration of %u ms."
> 
> Apart from these, I have made some cosmetic changes in the attached.

Thanks, the changes look good to me. I have merged them in V75 patch.

Best Regards,
Hou zj

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

15 сентября, 15:10:58

On Mon, Sep 15, 2025 at 1:07 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Thanks, the changes look good to me. I have merged them in V75 patch.
>

Pushed. I see a BF which is not related with this commit but a
previous commit for the work in this thread.

See LOGs [1]:
regress_log_035_conflicts
-----------------------------------
[11:16:47.604](0.015s) not ok 24 - the deleted column is removed
[11:16:47.605](0.002s) #   Failed test 'the deleted column is removed'
#   at /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/035_conflicts.pl
line 562.

Then the corresponding subscriber LOG:

025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
vacuuming "postgres.public.tab"
2025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
finished vacuuming "postgres.public.tab": index scans: 0
pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly scanned
tuples: 0 removed, 1 remain, 0 are dead but not yet removable
tuples missed: 1 dead from 1 pages not removed due to cleanup lock contention
removable cutoff: 787, which was 0 XIDs old when operation ended
...

This indicates that the Vacuum is not able to remove the row even
after the slot is advanced because some other background process holds
a lock/pin on the page. I think that is possible because the page was
dirty due to apply of update operation and bgwriter/checkpointer could
try to write such a page.

I'll analyze more tomorrow and share if I have any new findings.

[1]:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=kestrel&dt=2025-09-15%2009%3A08%3A07&stg=subscription-check

--
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

16 сентября, 06:53:38

On Monday, September 15, 2025 8:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Sep 15, 2025 at 1:07 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > Thanks, the changes look good to me. I have merged them in V75 patch.
> >
> 
> Pushed. I see a BF which is not related with this commit but a previous commit
> for the work in this thread.
> 
> See LOGs [1]:
> regress_log_035_conflicts
> -----------------------------------
> [11:16:47.604](0.015s) not ok 24 - the deleted column is removed
> [11:16:47.605](0.002s) #   Failed test 'the deleted column is removed'
> #   at
> /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/035_conflict
> s.pl
> line 562.
> 
> Then the corresponding subscriber LOG:
> 
> 025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> vacuuming "postgres.public.tab"
> 2025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> finished vacuuming "postgres.public.tab": index scans: 0
> pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly scanned
> tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples missed: 1
> dead from 1 pages not removed due to cleanup lock contention removable
> cutoff: 787, which was 0 XIDs old when operation ended ...
> 
> This indicates that the Vacuum is not able to remove the row even after the slot
> is advanced because some other background process holds a lock/pin on the
> page. I think that is possible because the page was dirty due to apply of update
> operation and bgwriter/checkpointer could try to write such a page.
> 
> I'll analyze more tomorrow and share if I have any new findings.

I agree with the analysis. I attempted to delete a tuple from a table and, while
executing VACUUM(verbose) on this table, I executed a checkpoint concurrently.
Using the debugger, I stoped in SyncOneBuffer() after acquiring the page block.
This allowed me to reproduce the same log where the deleted row could not be
removed:

--
tuples: 0 removed, 1 remain, 0 are dead but not yet removable
tuples missed: 1 dead from 1 pages not removed due to cleanup lock contention
--

I think we can remove the VACUUM for removing the deleted column. We have
already confirmed that the replication slot.xmin has advanced, which should be
sufficient to prove that the feature works correctly.

Best Regards,
Hou zj

Вложения

v1-0001-Stablize-the-tests-in-035_conflicts.patch

RE: Conflict detection for update_deleted in logical replication

От

"Zhijie Hou (Fujitsu)"

Дата:

16 сентября, 07:38:49

On Tuesday, September 16, 2025 11:54 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> 
> On Monday, September 15, 2025 8:11 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Sep 15, 2025 at 1:07 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Thanks, the changes look good to me. I have merged them in V75 patch.
> > >
> >
> > Pushed. I see a BF which is not related with this commit but a
> > previous commit for the work in this thread.
> >
> > See LOGs [1]:
> > regress_log_035_conflicts
> > -----------------------------------
> > [11:16:47.604](0.015s) not ok 24 - the deleted column is removed
> > [11:16:47.605](0.002s) #   Failed test 'the deleted column is removed'
> > #   at
> > /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/035_confl
> > ict
> > s.pl
> > line 562.
> >
> > Then the corresponding subscriber LOG:
> >
> > 025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> > vacuuming "postgres.public.tab"
> > 2025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> > finished vacuuming "postgres.public.tab": index scans: 0
> > pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly
> > scanned
> > tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples
> > missed: 1 dead from 1 pages not removed due to cleanup lock contention
> > removable
> > cutoff: 787, which was 0 XIDs old when operation ended ...
> >
> > This indicates that the Vacuum is not able to remove the row even
> > after the slot is advanced because some other background process holds
> > a lock/pin on the page. I think that is possible because the page was
> > dirty due to apply of update operation and bgwriter/checkpointer could try to
> write such a page.
> >
> > I'll analyze more tomorrow and share if I have any new findings.
> 
> I agree with the analysis. I attempted to delete a tuple from a table and, while
> executing VACUUM(verbose) on this table, I executed a checkpoint
> concurrently.
> Using the debugger, I stoped in SyncOneBuffer() after acquiring the page
> block.
> This allowed me to reproduce the same log where the deleted row could not be
> removed:
> 
> --
> tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples missed: 1
> dead from 1 pages not removed due to cleanup lock contention
> --
> 
> I think we can remove the VACUUM for removing the deleted column. We have
> already confirmed that the replication slot.xmin has advanced, which should
> be sufficient to prove that the feature works correctly.

Apart from above fix, I noticed another BF failure[1].

--
timed out waiting for match: (?^:logical replication worker for subscription "tap_sub_a_b" will resume retaining the
informationfor detecting conflicts
 
--

It is clear from the log[2] that the apply worker resumes retention immediately
after the synchronized_standby_slots configuration is removed, but before the
max_retention_duration is set to 0. We expected resumption to occur only after
adjusting max_retention_duration to 0, thus overlooking the log. To ensure
stability, we should postpone the removal of synchronized_standby_slots until
setting max_retention_duration to 0.

I can reproduce it locally by adding "sleep 10;" after resetting the
synchronized_standby_slots GUC and before resume test

I updated the patch to fix this as well.

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2025-09-15%2009%3A07%3A45&stg=subscription-check

[2]
2025-09-15 11:53:41.450 UTC [3604:23] LOG:  logical replication worker for subscription "tap_sub_a_b" will resume
retainingthe information for detecting conflicts
 
2025-09-15 11:53:41.450 UTC [3604:24] DETAIL:  Retention is re-enabled as the apply process is advancing its xmin
withinthe configured max_retention_duration of 1 ms.
 

Best Regards,
Hou zj

Вложения

v2-0001-Stablize-the-tests-in-035_conflicts.patch

Re: Conflict detection for update_deleted in logical replication

От

shveta malik

Дата:

16 сентября, 08:42:59

On Tue, Sep 16, 2025 at 10:08 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 16, 2025 11:54 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> >
> > On Monday, September 15, 2025 8:11 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Sep 15, 2025 at 1:07 PM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > Thanks, the changes look good to me. I have merged them in V75 patch.
> > > >
> > >
> > > Pushed. I see a BF which is not related with this commit but a
> > > previous commit for the work in this thread.
> > >
> > > See LOGs [1]:
> > > regress_log_035_conflicts
> > > -----------------------------------
> > > [11:16:47.604](0.015s) not ok 24 - the deleted column is removed
> > > [11:16:47.605](0.002s) #   Failed test 'the deleted column is removed'
> > > #   at
> > > /home/bf/bf-build/kestrel/HEAD/pgsql/src/test/subscription/t/035_confl
> > > ict
> > > s.pl
> > > line 562.
> > >
> > > Then the corresponding subscriber LOG:
> > >
> > > 025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> > > vacuuming "postgres.public.tab"
> > > 2025-09-15 11:16:47.600 CEST [1888170][client backend][1/13:0] INFO:
> > > finished vacuuming "postgres.public.tab": index scans: 0
> > > pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly
> > > scanned
> > > tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples
> > > missed: 1 dead from 1 pages not removed due to cleanup lock contention
> > > removable
> > > cutoff: 787, which was 0 XIDs old when operation ended ...
> > >
> > > This indicates that the Vacuum is not able to remove the row even
> > > after the slot is advanced because some other background process holds
> > > a lock/pin on the page. I think that is possible because the page was
> > > dirty due to apply of update operation and bgwriter/checkpointer could try to
> > write such a page.
> > >
> > > I'll analyze more tomorrow and share if I have any new findings.
> >
> > I agree with the analysis. I attempted to delete a tuple from a table and, while
> > executing VACUUM(verbose) on this table, I executed a checkpoint
> > concurrently.
> > Using the debugger, I stoped in SyncOneBuffer() after acquiring the page
> > block.
> > This allowed me to reproduce the same log where the deleted row could not be
> > removed:
> >
> > --
> > tuples: 0 removed, 1 remain, 0 are dead but not yet removable tuples missed: 1
> > dead from 1 pages not removed due to cleanup lock contention
> > --
> >
> > I think we can remove the VACUUM for removing the deleted column. We have
> > already confirmed that the replication slot.xmin has advanced, which should
> > be sufficient to prove that the feature works correctly.
>
> Apart from above fix, I noticed another BF failure[1].
>
> --
> timed out waiting for match: (?^:logical replication worker for subscription "tap_sub_a_b" will resume retaining the
informationfor detecting conflicts 
> --
>
> It is clear from the log[2] that the apply worker resumes retention immediately
> after the synchronized_standby_slots configuration is removed, but before the
> max_retention_duration is set to 0. We expected resumption to occur only after
> adjusting max_retention_duration to 0, thus overlooking the log. To ensure
> stability, we should postpone the removal of synchronized_standby_slots until
> setting max_retention_duration to 0.
>
> I can reproduce it locally by adding "sleep 10;" after resetting the
> synchronized_standby_slots GUC and before resume test
>
> I updated the patch to fix this as well.
>

Thank You for the patch. Fix looks good.

Shall we update the comment:
+# Drop the physical slot and reset the synchronized_standby_slots setting. We
+# change this after setting max_retention_duration to 0, ensuring the apply
+# worker does not resume prematurely without noticing the updated
+# max_retention_duration value.

to:
Drop the physical slot and reset the synchronized_standby_slots
setting. This adjustment is made after setting max_retention_duration
to 0, ensuring consistent results in the test case as the resumption
becomes possible immediately after resetting
synchronized_standby_slots, due to the smaller max_retention_duration
value of 1ms previously.

thanks
Shveta

Re: Conflict detection for update_deleted in logical replication

От

Amit Kapila

Дата:

16 сентября, 11:34:34

On Tue, Sep 16, 2025 at 11:13 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> >
> > I updated the patch to fix this as well.
> >
>
> Thank You for the patch. Fix looks good.
>
> Shall we update the comment:
> +# Drop the physical slot and reset the synchronized_standby_slots setting. We
> +# change this after setting max_retention_duration to 0, ensuring the apply
> +# worker does not resume prematurely without noticing the updated
> +# max_retention_duration value.
>
> to:
> Drop the physical slot and reset the synchronized_standby_slots
> setting. This adjustment is made after setting max_retention_duration
> to 0, ensuring consistent results in the test case as the resumption
> becomes possible immediately after resetting
> synchronized_standby_slots, due to the smaller max_retention_duration
> value of 1ms previously.
>

Thanks, I have modified the patch based on what you suggested and pushed.

--
With Regards,
Amit Kapila.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: Conflict detection for update_deleted in logical replication

Вложения

Вложения

Вложения