Обсуждение: [BUG?] check_exclusion_or_unique_constraint false negative
Hello, everyone!
While reviewing [1], I noticed that check_exclusion_or_unique_constraint occasionally returns false negatives for btree unique indexes during UPSERT operations.
Although this doesn't cause any real issues with INSERT ON CONFLICT, I wanted to bring it to your attention, as it might indicate an underlying problem.
Attached is a patch to reproduce the issue.
Attached is a patch to reproduce the issue.
make -C src/test/modules/test_misc/ check PROVE_TESTS='t/006_*'
....
Failed test 'concurrent INSERTs status (got 2 vs expected 0)'
# at t/006_concurrently_unique_fail.pl line 26.
# Failed test 'concurrent INSERTs stderr /(?^:^$)/'
# at t/006_concurrently_unique_fail.pl line 26.
# 'pgbench: error: client 34 script 0 aborted in command 0 query 0: ERROR: we know 31337 in the index!
# at t/006_concurrently_unique_fail.pl line 26.
# Failed test 'concurrent INSERTs stderr /(?^:^$)/'
# at t/006_concurrently_unique_fail.pl line 26.
# 'pgbench: error: client 34 script 0 aborted in command 0 query 0: ERROR: we know 31337 in the index!
Best regards,
Mikhail,
Вложения
Hello, Andres.
Sorry to bother you, but I feel it's necessary to validate the possible issue regarding someone who can decide whether it is okay or not.
The issue is reproducible with the first UPSERT implementation (your commit 168d5805e4c08bed7b95d351bf097cff7c07dd65 from 2015) and up to now.
The problem appears as follows:
* A unique index contains a specific value (in the test, it is the only value for the entire index).
* check_exclusion_or_unique_constraint returns FALSE for that value in some random cases.
* Technically, this means index_getnext finds 0 records, even though we know the value exists in the index.
I was able to reproduce this only with an UNLOGGED table.
I can't find any scenarios that are actually broken (since the issue is resolved by speculative insertion later), but this looks suspicious to me. It could be a symptom of some tricky race condition in the btree.
Best regards,
Mikhail
Sorry to bother you, but I feel it's necessary to validate the possible issue regarding someone who can decide whether it is okay or not.
The issue is reproducible with the first UPSERT implementation (your commit 168d5805e4c08bed7b95d351bf097cff7c07dd65 from 2015) and up to now.
The problem appears as follows:
* A unique index contains a specific value (in the test, it is the only value for the entire index).
* check_exclusion_or_unique_constraint returns FALSE for that value in some random cases.
* Technically, this means index_getnext finds 0 records, even though we know the value exists in the index.
I was able to reproduce this only with an UNLOGGED table.
I can't find any scenarios that are actually broken (since the issue is resolved by speculative insertion later), but this looks suspicious to me. It could be a symptom of some tricky race condition in the btree.
Best regards,
Mikhail
Hello, everyone!
Updates so far:
* issue happens with both LOGGED and UNLOGGED relations
* issue happens with DirtySnapshot
* not happens with SnapshotSelf
* not happens with SnapshotAny
* not related to speculative inserted tuples - I have commented the code of its insertion - and the issue continues to occur.
Best regards,
Mikhail.
It seems like I've identified the cause of the issue.
Currently, any DirtySnapshot (or SnapshotSelf) scan over a B-tree index may skip (not find the TID for) some records in the case of parallel updates.
The following scenario is possible:
* Session 1 reads a B-tree page using SnapshotDirty and copies item X to the buffer.
* Session 2 updates item X, inserting a new TID Y into the same page.
* Session 2 commits its transaction.
* Session 1 starts to fetch from the heap and tries to fetch X, but it was already deleted by session 2. So, it goes to the B-tree for the next TID.
* The B-tree goes to the next page, skipping Y.
* Therefore, the search finds nothing, but tuple Y is still alive.
This situation is somewhat controversial. DirtySnapshot might seem to show more (or more recent, even uncommitted) data than MVCC, but not less. So, DirtySnapshot scan over a B-tree does not provide any guarantees, as far as I understand.
Why does it work for MVCC? Because tuple X will be visible due to the snapshot, making Y unnecessary.
This might be "as designed," but I think it needs to be clearly documented (I couldn't find any documentation on this particular case, only _bt_drop_lock_and_maybe_pin - related).
Here are the potential consequences of the issue:
* check_exclusion_or_unique_constraint
It may not find a record in a UNIQUE index during INSERT ON CONFLICT UPDATE. However, this is just a minor performance issue.
* Exclusion constraints with B-tree, like ADD CONSTRAINT exclusion_data EXCLUDE USING btree (data WITH =)
It should work correctly because the first inserter may "skip" the TID from a concurrent inserter, but the second one should still find the TID from the first.
* RelationFindReplTupleByIndex
Amit, this is why I've included you in this previously solo thread :)
RelationFindReplTupleByIndex uses DirtySnapshot and may not find some records if they are updated by a parallel transaction. This could lead to lost deletes/updates, especially in the case of streaming=parallel mode.
I'm not familiar with how parallel workers apply transactions, so maybe this isn't possible.
Best regards,
Mikhail
Currently, any DirtySnapshot (or SnapshotSelf) scan over a B-tree index may skip (not find the TID for) some records in the case of parallel updates.
The following scenario is possible:
* Session 1 reads a B-tree page using SnapshotDirty and copies item X to the buffer.
* Session 2 updates item X, inserting a new TID Y into the same page.
* Session 2 commits its transaction.
* Session 1 starts to fetch from the heap and tries to fetch X, but it was already deleted by session 2. So, it goes to the B-tree for the next TID.
* The B-tree goes to the next page, skipping Y.
* Therefore, the search finds nothing, but tuple Y is still alive.
This situation is somewhat controversial. DirtySnapshot might seem to show more (or more recent, even uncommitted) data than MVCC, but not less. So, DirtySnapshot scan over a B-tree does not provide any guarantees, as far as I understand.
Why does it work for MVCC? Because tuple X will be visible due to the snapshot, making Y unnecessary.
This might be "as designed," but I think it needs to be clearly documented (I couldn't find any documentation on this particular case, only _bt_drop_lock_and_maybe_pin - related).
Here are the potential consequences of the issue:
* check_exclusion_or_unique_constraint
It may not find a record in a UNIQUE index during INSERT ON CONFLICT UPDATE. However, this is just a minor performance issue.
* Exclusion constraints with B-tree, like ADD CONSTRAINT exclusion_data EXCLUDE USING btree (data WITH =)
It should work correctly because the first inserter may "skip" the TID from a concurrent inserter, but the second one should still find the TID from the first.
* RelationFindReplTupleByIndex
Amit, this is why I've included you in this previously solo thread :)
RelationFindReplTupleByIndex uses DirtySnapshot and may not find some records if they are updated by a parallel transaction. This could lead to lost deletes/updates, especially in the case of streaming=parallel mode.
I'm not familiar with how parallel workers apply transactions, so maybe this isn't possible.
Best regards,
Mikhail
Dear Michail, Thanks for pointing out the issue! >* RelationFindReplTupleByIndex > >Amit, this is why I've included you in this previously solo thread :) >RelationFindReplTupleByIndex uses DirtySnapshot and may not find some records >if they are updated by a parallel transaction. This could lead to lost >deletes/updates, especially in the case of streaming=parallel mode. >I'm not familiar with how parallel workers apply transactions, so maybe this >isn't possible. IIUC, the issue can happen when two concurrent transactions using DirtySnapshot access the same tuples, which is not specific to the parallel apply. Consider that two subscriptions exist and publishers modify the same tuple of the same table. In this case, two workers access the tuple, so one of the changes may be missed by the scenario you said. I feel we do not need special treatments for parallel apply. Best regards, Hayato Kuroda FUJITSU LIMITED
Hello, Hayato!
> Thanks for pointing out the issue!
Thanks for your attention!
> IIUC, the issue can happen when two concurrent transactions using DirtySnapshot access
> the same tuples, which is not specific to the parallel apply
Not exactly, it happens for any DirtySnapshot scan over a B-tree index with some other transaction updating the same index page (even using the MVCC snapshot).
So, logical replication related scenario looks like this:
* subscriber worker receives a tuple update\delete from the publisher
* it calls RelationFindReplTupleByIndex to find the tuple in the local table
* some other transaction updates the tuple in the local table (on subscriber side) in parallel
* RelationFindReplTupleByIndex may not find the tuple because it uses DirtySnapshot
* update\delete is lost
Parallel apply mode looks like more dangerous because it uses multiple workers on the subscriber side, so the probability of the issue is higher.
In that case, "some other transaction" is just another worker applying changes of different transaction in parallel.
Best regards,
Mikhail.
> Thanks for pointing out the issue!
Thanks for your attention!
> IIUC, the issue can happen when two concurrent transactions using DirtySnapshot access
> the same tuples, which is not specific to the parallel apply
Not exactly, it happens for any DirtySnapshot scan over a B-tree index with some other transaction updating the same index page (even using the MVCC snapshot).
So, logical replication related scenario looks like this:
* subscriber worker receives a tuple update\delete from the publisher
* it calls RelationFindReplTupleByIndex to find the tuple in the local table
* some other transaction updates the tuple in the local table (on subscriber side) in parallel
* RelationFindReplTupleByIndex may not find the tuple because it uses DirtySnapshot
* update\delete is lost
Parallel apply mode looks like more dangerous because it uses multiple workers on the subscriber side, so the probability of the issue is higher.
In that case, "some other transaction" is just another worker applying changes of different transaction in parallel.
Best regards,
Mikhail.
On Thu, Aug 1, 2024 at 2:55 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > > Thanks for pointing out the issue! > > Thanks for your attention! > > > IIUC, the issue can happen when two concurrent transactions using DirtySnapshot access > > the same tuples, which is not specific to the parallel apply > > Not exactly, it happens for any DirtySnapshot scan over a B-tree index with some other transaction updating the same indexpage (even using the MVCC snapshot). > > So, logical replication related scenario looks like this: > > * subscriber worker receives a tuple update\delete from the publisher > * it calls RelationFindReplTupleByIndex to find the tuple in the local table > * some other transaction updates the tuple in the local table (on subscriber side) in parallel > * RelationFindReplTupleByIndex may not find the tuple because it uses DirtySnapshot > * update\delete is lost > > Parallel apply mode looks like more dangerous because it uses multiple workers on the subscriber side, so the probabilityof the issue is higher. > In that case, "some other transaction" is just another worker applying changes of different transaction in parallel. > I think it is rather less likely or not possible in a parallel apply case because such conflicting updates (updates on the same tuple) should be serialized at the publisher itself. So one of the updates will be after the commit that has the second update. I haven't tried the test based on your description of the general problem with DirtySnapshot scan. In case of logical replication, we will LOG update_missing type of conflict and the user may need to take some manual action based on that. I have not tried a test so I could be wrong as well. I am not sure we can do anything specific to logical replication for this but feel free to suggest if you have ideas to solve this problem in general or specific to logical replication. -- With Regards, Amit Kapila.
Hello, Amit!
> I think it is rather less likely or not possible in a parallel apply
> case because such conflicting updates (updates on the same tuple)
> should be serialized at the publisher itself. So one of the updates
> will be after the commit that has the second update.
Glad to hear! But anyway, such logic looks very fragile to me.
> I haven't tried the test based on your description of the general
> problem with DirtySnapshot scan. In case of logical replication, we
> will LOG update_missing type of conflict and the user may need to take
> some manual action based on that.
Current it is just DEBUG1, so it will be probably missed by the user.
> * XXX should this be promoted to ereport(LOG) perhaps?
> */
> elog(DEBUG1,
> "logical replication did not find row to be updated "
> "in replication target relation \"%s\"",
> RelationGetRelationName(localrel));
> }
> I have not tried a test so I could
> be wrong as well. I am not sure we can do anything specific to logical
> replication for this but feel free to suggest if you have ideas to
> solve this problem in general or specific to logical replication.
I've implemented a solution to address the problem more generally, attached the patch (and also the link [1]).
Here's a summary of the changes:
* For each tuple skipped because it was deleted, we now accumulate the maximum xmax.
* Before the scan begins, we store the value of the latest completed transaction.
* If no tuples are found in the index, we check the max(xmax) value. If this value is newer than the latest completed transaction stored before the scan, it indicates that a tuple was deleted by another transaction after the scan started. To ensure all tuples are correctly processed we then rescan the index.
Also added a test case to cover this scenario using the new injection point mechanism and
updated the b-tree index documentation to include a description of this case.
I'll add this into the next commitfest.
Best regards,
Mikhail.
[1]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:concurrent_unique
> I think it is rather less likely or not possible in a parallel apply
> case because such conflicting updates (updates on the same tuple)
> should be serialized at the publisher itself. So one of the updates
> will be after the commit that has the second update.
Glad to hear! But anyway, such logic looks very fragile to me.
> I haven't tried the test based on your description of the general
> problem with DirtySnapshot scan. In case of logical replication, we
> will LOG update_missing type of conflict and the user may need to take
> some manual action based on that.
Current it is just DEBUG1, so it will be probably missed by the user.
> * XXX should this be promoted to ereport(LOG) perhaps?
> */
> elog(DEBUG1,
> "logical replication did not find row to be updated "
> "in replication target relation \"%s\"",
> RelationGetRelationName(localrel));
> }
> I have not tried a test so I could
> be wrong as well. I am not sure we can do anything specific to logical
> replication for this but feel free to suggest if you have ideas to
> solve this problem in general or specific to logical replication.
I've implemented a solution to address the problem more generally, attached the patch (and also the link [1]).
Here's a summary of the changes:
* For each tuple skipped because it was deleted, we now accumulate the maximum xmax.
* Before the scan begins, we store the value of the latest completed transaction.
* If no tuples are found in the index, we check the max(xmax) value. If this value is newer than the latest completed transaction stored before the scan, it indicates that a tuple was deleted by another transaction after the scan started. To ensure all tuples are correctly processed we then rescan the index.
Also added a test case to cover this scenario using the new injection point mechanism and
updated the b-tree index documentation to include a description of this case.
I'll add this into the next commitfest.
Best regards,
Mikhail.
[1]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:concurrent_unique
Вложения
On Fri, Aug 2, 2024 at 10:38 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > > I think it is rather less likely or not possible in a parallel apply > > case because such conflicting updates (updates on the same tuple) > > should be serialized at the publisher itself. So one of the updates > > will be after the commit that has the second update. > > Glad to hear! But anyway, such logic looks very fragile to me. > > > I haven't tried the test based on your description of the general > > problem with DirtySnapshot scan. In case of logical replication, we > > will LOG update_missing type of conflict and the user may need to take > > some manual action based on that. > > Current it is just DEBUG1, so it will be probably missed by the user. > > > * XXX should this be promoted to ereport(LOG) perhaps? > > */ > > elog(DEBUG1, > > "logical replication did not find row to be updated " > > "in replication target relation \"%s\"", > > RelationGetRelationName(localrel)); > > } > Right, but we are extending this functionality to detect and resolve such conflicts [1][2]. I am hoping after that such updates won't be missed. [1] - https://commitfest.postgresql.org/49/5064/ [2] - https://commitfest.postgresql.org/49/5021/ -- With Regards, Amit Kapila.
Hello!
> Right, but we are extending this functionality to detect and resolve
> such conflicts [1][2]. I am hoping after that such updates won't be
> missed.
> such conflicts [1][2]. I am hoping after that such updates won't be
> missed.
Yes, this is a nice feature. However, without the DirtySnapshot index scan fix, it will fail in numerous instances, especially in master-master replication.
The update_missing feature is helpful in this case, but it is still not the correct event because a real tuple exists, and we should receive update_differ instead. As a result, some conflict resolution systems may malfunction. For example, if the resolution method is set to apply_or_skip, it will insert the new row, causing two rows to exist. This system is quite fragile, and I am sure there are many more complicated scenarios that could arise.
Best regards,
Mikhail.
Hi, Thanks for reporting the issue ! I tried to reproduce this in logical replication but failed. If possible, could you please share some steps to reproduce it in logicalrep context ? In my test, if the tuple is updated and new tuple is in the same page, heapam_index_fetch_tuple should find the new tuple using HOT chain. So, it's a bit unclear to me how the updated tuple is missing. Maybe I missed some other conditions for this issue. It would be better if we can reproduce this by adding some breakpoints using gdb, which may help us to write a tap test using injection point to reproduce this reliably. I see the tap test you shared used pgbench to reproduce this, it works, but It would be great if we can analyze the issue more deeply by debugging the code. And I have few questions related the steps you shared: > * Session 1 reads a B-tree page using SnapshotDirty and copies item X to the buffer. > * Session 2 updates item X, inserting a new TID Y into the same page. > * Session 2 commits its transaction. > * Session 1 starts to fetch from the heap and tries to fetch X, but it was > already deleted by session 2. So, it goes to the B-tree for the next TID. > * The B-tree goes to the next page, skipping Y. > * Therefore, the search finds nothing, but tuple Y is still alive. I am wondering at which point should the update happen ? should it happen after calling index_getnext_tid and before index_fetch_heap ? It would be great if you could give more details in above steps. Thanks ! Best Regards, Hou zj
Hello, Hou zj!
> In my test, if the tuple is updated and new tuple is in the same page,
> heapam_index_fetch_tuple should find the new tuple using HOT chain. So, it's a
> bit unclear to me how the updated tuple is missing. Maybe I missed some other
> conditions for this issue.
Yeah, I think the pgbench-based reproducer may also cause page splits in btree.
But we may add an index to the table to disable HOT.
I have attached a reproducer for this case using a spec and injection points.
> In my test, if the tuple is updated and new tuple is in the same page,
> heapam_index_fetch_tuple should find the new tuple using HOT chain. So, it's a
> bit unclear to me how the updated tuple is missing. Maybe I missed some other
> conditions for this issue.
Yeah, I think the pgbench-based reproducer may also cause page splits in btree.
But we may add an index to the table to disable HOT.
I have attached a reproducer for this case using a spec and injection points.
I hope it helps, check the attached files.
Best regards,
Mikhail.
Best regards,
Mikhail.
Вложения
On Monday, August 12, 2024 7:11 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote: > > In my test, if the tuple is updated and new tuple is in the same page, > > heapam_index_fetch_tuple should find the new tuple using HOT chain. So, it's a > > bit unclear to me how the updated tuple is missing. Maybe I missed some other > > conditions for this issue. > > Yeah, I think the pgbench-based reproducer may also cause page splits in btree. > But we may add an index to the table to disable HOT. > > I have attached a reproducer for this case using a spec and injection points. > > I hope it helps, check the attached files. Thanks a lot for the steps! I successfully reproduced the issue you mentioned in the context of logical replication[1]. As you said, it could increase the possibility of tuple missing when applying updates or deletes in the logical apply worker. I think this is a long-standing issue and I will investigate the fix you proposed. In addition, I think the bug is not a blocker for the conflict detection feature. As the feature simply reports the current behavior of the logical apply worker (either unique violation or tuple missing) without introducing any new functionality. Furthermore, I think that the new ExecCheckIndexConstraints call after ExecInsertIndexTuples() is not affected by the dirty snapshot bug. This is because a tuple has already been inserted into the btree before the dirty snapshot scan, which means that a concurrent non-HOT update would not be possible (it would be blocked after finding the just inserted tuple and wait for the apply worker to commit the current transaction). It would be good if others could also share their opinion on this. [1] The steps to reproduce the tuple missing in logical replication. 1. setup pub/sub env, and publish a table with 1 row. pub: CREATE TABLE t(a int primary key, b int); INSERT INTO t VALUES(1,1); CREATE PUBLICATION pub FOR TABLE t; sub: CREATE TABLE t (a int primary key, b int check (b < 5)); CREATE INDEX t_b_idx ON t(b); CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=$port_publisher' PUBLICATION pub; 2. Execute an UPDATE(UPDATE t set b = b + 1) on the publisher and use gdb to stop the apply worker at the point after index_getnext_tid() and before index_fetch_heap(). 3. execute a concurrent update(UPDATE t set b = b + 100) on the subscriber to update a non-key column value and commit the update. 4. release the apply worker and it would report the update_missing conflict. Best Regards, Hou zj
Hello!
> In addition, I think the bug is not a blocker for the conflict detection
> feature. As the feature simply reports the current behavior of the logical
> apply worker (either unique violation or tuple missing) without introducing any
> new functionality. Furthermore, I think that the new ExecCheckIndexConstraints
> call after ExecInsertIndexTuples() is not affected by the dirty snapshot bug.
> This is because a tuple has already been inserted into the btree before the
> dirty snapshot scan, which means that a concurrent non-HOT update would not be
> possible (it would be blocked after finding the just inserted tuple and wait
> for the apply worker to commit the current transaction).
> feature. As the feature simply reports the current behavior of the logical
> apply worker (either unique violation or tuple missing) without introducing any
> new functionality. Furthermore, I think that the new ExecCheckIndexConstraints
> call after ExecInsertIndexTuples() is not affected by the dirty snapshot bug.
> This is because a tuple has already been inserted into the btree before the
> dirty snapshot scan, which means that a concurrent non-HOT update would not be
> possible (it would be blocked after finding the just inserted tuple and wait
> for the apply worker to commit the current transaction).
> It would be good if others could also share their opinion on this.
Yes, you are right. At least, I can't find any scenario for that case.
Best regards,
Mikhail.
Hello, Hou!
I have sent [0] reproducer within the context of conflict detection and resolution to the original thread.
Hello!
I realize proposed solution does not guarantee absent of false negative cases...
It happens because I am looking just at XID values, but them have nothing with transaction commitment order in the common case.
I'll look for some other option.
Best regards,
Mikhail.
Hello, everyone and Peter!
Peter, I have added you because you may be interested in (or already know about) this btree-related issue.
Short description of the problem:
I noticed a concurrency issue in btree index scans that affects SnapshotDirty and SnapshotSelf scan types.
Peter, I have added you because you may be interested in (or already know about) this btree-related issue.
Short description of the problem:
I noticed a concurrency issue in btree index scans that affects SnapshotDirty and SnapshotSelf scan types.
When using these non-MVCC snapshot types, a scan could miss tuples if concurrent transactions delete existing tuples and insert new one with different TIDs on the same page.
The problem occurs because:
1. The scan reads a page and caches its tuples in backend-local storage
2. A concurrent transaction deletes a tuple and inserts a new one with a different TID
3. The scan misses the new tuple because it was already deleted by a committed transaction and does not pass visibility check
The problem occurs because:
1. The scan reads a page and caches its tuples in backend-local storage
2. A concurrent transaction deletes a tuple and inserts a new one with a different TID
3. The scan misses the new tuple because it was already deleted by a committed transaction and does not pass visibility check
4. But new version on the page is missed, because not in cached tuples
This may cause issues with:
This may cause issues with:
- logical replication (RelationFindReplTupleByIndex fail) - invalid conflict message (MISSING instead of ORIGIN_DIFFERS), probably other issues with upcoming conflict resolution for logical replication
- check_exclusion_or_unique_constraint false negative (but currently it does not cause any real issues as far as I can see)
The fix implemented in this version of the patch:
- Retains the read lock on a page for SnapshotDirty and SnapshotSelf scans until we're completely done with all tuples from that page
- Introduces a new 'extra_unlock' field in BTScanPos to track when a lock is being held longer than usual
- Updates documentation to explain this special locking behavior
The fix implemented in this version of the patch:
- Retains the read lock on a page for SnapshotDirty and SnapshotSelf scans until we're completely done with all tuples from that page
- Introduces a new 'extra_unlock' field in BTScanPos to track when a lock is being held longer than usual
- Updates documentation to explain this special locking behavior
Yes, it may cause some degradation in performance because of that additional lock.
Another possible idea is to use a fresh MVCC snapshot for such cases (but I think it is still better to fix or at least document that issue anyway).
Best regards,
Mikhail.
Вложения
Hello, everyone! Rebased + fix for compilation due the new INEJCTION_POINT signature. Best regards, Mikhail.
Вложения
Hello! Rebased\reworked to align with the changes of [0]. Best regards. [0]: https://www.postgresql.org/message-id/flat/CAH2-WznZBhWqDBDVGh1VhVBLgLqaYHEkPhmVV7mJCr1Y3ZQhQQ%40mail.gmail.com#d6f1debb1405d1b4a983cbb46b24f41b
Вложения
Hello, everyone! Issue description is available at [0] (in few words - SnapshotDirty scan may miss tuple in index because of race condition with update of that tuple). But I have realised there are cases much more severe than invalid conflict messages for logical replication - lost delete/updates in logical replication. New tests with reproducers are included in the new patch version. Short description up issues: 1) Lost delete Setup: CREATE TABLE conf_tab(a int PRIMARY key, data text); CREATE INDEX data_index ON conf_tab(data); INSERT INTO conf_tab(a, data) VALUES (1,'frompub'); On publisher: DELETE FROM conf_tab WHERE a=1; On subscriber: UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1); Expected result: Tuple is deleted on both subscriber and publisher. Actual result: Either as expected, or: - Tuple is deleted on publisher, but 'fromsubnew' remains on subscriber. 2) Lost update Setup: On publisher: CREATE TABLE conf_tab(a int PRIMARY key, data text); INSERT INTO conf_tab(a, data) VALUES (1,'frompub'); On subscriber: -- note additional subscriber-only column - i CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0); CREATE INDEX i_index ON conf_tab(i); On publisher: UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1); On subscriber: UPDATE conf_tab SET i = 1 WHERE (a=1); Expected result: On subscriber: tuple (a=1, data='frompubnew', i=1). Actual result: Either as expected, or: - Publisher update is lost, leaving (a=1, data='frompub', i=1) on subscriber. Best regards, Mikhail. [0]: https://www.postgresql.org/message-id/flat/CADzfLwWuXh8KO%3DOZvB71pZnQ8nH0NYXfuGbFU6FBiVZUbmuFGg%40mail.gmail.com#76f98a9ae3479bbaf5ee9262322d466e
Вложения
Hello, Added one more test - for invalid "update_deleted" conflict detection. Best regards, Mikhail.
Вложения
On Wed, Mar 12, 2025 at 6:36 AM Mihail Nikalayeu <michail.nikolaev@gmail.com> wrote: > > Hello, everyone and Peter! > > Peter, I have added you because you may be interested in (or already know about) this btree-related issue. > > Short description of the problem: > > I noticed a concurrency issue in btree index scans that affects SnapshotDirty and SnapshotSelf scan types. > When using these non-MVCC snapshot types, a scan could miss tuples if concurrent transactions delete existing tuples andinsert new one with different TIDs on the same page. > > The problem occurs because: > 1. The scan reads a page and caches its tuples in backend-local storage > 2. A concurrent transaction deletes a tuple and inserts a new one with a different TID > 3. The scan misses the new tuple because it was already deleted by a committed transaction and does not pass visibilitycheck > 4. But new version on the page is missed, because not in cached tuples > IIUC, the problem you are worried about can happen with DELETE+INSERT in the same transaction on the subscriber, right? If so, this should happen with DELETE and INSERT in a separate transaction as well. If that happens then we anyway may not be able to detect such an INSERT if it happens on a page earlier than the current page. BTW, as the update (or DELETE+INSERT) happens at a later time than the publisher's update/delete, so once we have the last_write_win resolution strategy implemented, it is the subscriber operation that will win. So, the current behavior shouldn't cause any problem. -- With Regards, Amit Kapila.
Hello, Amit, > IIUC, the problem you are worried about can happen with DELETE+INSERT It seems there was some misunderstanding due to my bad explanation and wording. I wrote "A concurrent transaction deletes a tuple and inserts a new one with a different TID" - but I mean logical UPDATE causing new TID in index page appear because HOT was applied... Lets try again, I hope that explanation is better: At the start, we have a table with a primary key and one extra index (to disable HOT), and a tuple with i=13: CREATE TABLE table (i int PRIMARY KEY, data text); CREATE INDEX no_more_hot_data_index ON table (data); INSERT INTO table (i, data) VALUES (13, 'data'); A btree scan using SnapshotDirty can miss tuples because of internal locking logic. Here’s how the bug shows up: 1) we have a tuple in the index (i=13), committed long ago 2) transaction A starts an index search for that tuple using SnapshotDirty (WHERE i = 13) 3) in parallel, transaction B updates that tuple (SET data='updated' WHERE i=13) and commits (creating a new index entry because HOT is not applied) 4) the scan from step 2 returns nothing at all - as if the tuple never existed In other words, if you start a SnapshotDirty btree scan for i=13 and update that row i=13 at the same physical moment, the scan may: * return the TID of the pre‑update version - correct behavior * return the TID of the post‑update version - also correct * return nothing - this is the broken case More broadly: any SnapshotDirty scan may completely miss existing data when there are concurrent updates. SnapshotDirty usage in Postgres is limited, so the impact isn’t huge, but every case I found is reproducible with the tests from the first commit from v10 in my previous email. * check_exclusion_or_unique_constraint: only a minor performance impact, handled by retry logic * logical replication TAP tests: multiple scenarios fail because RelationFindReplTupleByIndex cannot find existing committed tuples These scenarios look like: 1) logical replication tries to apply a change for tuple X received from the publisher 2) meanwhile, the subscriber updates the same tuple X and commits in parallel transaction 3) due to the bug, RelationFindReplTupleByIndex concludes the tuple X does not exist at all, leading to bad outcomes, including: * incorrect conflict‑type messages (and, in the future, potentially wrong conflict‑resolution choices) * lost updates (see scenario 2 from [0]) If you look at the tests and play with the $simulate_race_condition flag, you can see the behavior directly. The second commit (a possible fix) in v10 also includes documentation updates that try to explain the issue in a more appropriate context. I’m happy to provide additional reproducers or explanations if that would help. [0]: https://www.postgresql.org/message-id/flat/CADzfLwWC49oanFSGPTf%3D6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ%40mail.gmail.com#5f6b3be849f8d95c166decfae541df09 Best regards, Mikhail.
Oh, > in index page appear because HOT was applied... I mean "HOT was NOT applied", sorry for the inconvenience.
Amit, a few more explanations related to your message. > IIUC, the problem you are worried about can happen with DELETE+INSERT > in the same transaction on the subscriber, right? Technically, yes - this can occur during a single UPDATE, as well as a DELETE followed by an INSERT of the same key within the same transaction (which is effectively equivalent to an UPDATE). However, it should NOT occur, because at no point in the timeline does a row with that key fail to exist; therefore, no scan should return “there is no such row in the index.” > If so, this should > happen with DELETE and INSERT in a separate transaction as well. Yes, it may happen - and in that case, it is correct. This is because there is a moment between the DELETE and the INSERT when the row does not exist. Therefore, it is acceptable for a scan to check the index at that particular moment and find nothing. > BTW, as the update (or DELETE+INSERT) happens at a later time than the > publisher's update/delete, so once we have the last_write_win > resolution strategy implemented, it is the subscriber operation that > will win. So, the current behavior shouldn't cause any problem. For the last_write_win and UPDATE vs UPDATE case - yes, probably, but only by luck. However, there are many scenarios that cannot be implemented correctly, for example: * DELETE always wins * UPDATE with a higher version (column value) wins * first_write_win * etc. Also, the cases from [0] are clearly wrong without any conflict resolution. In particular, case 2 - there are no real conflicts at all (since different sets of columns are involved), but an incorrect result may still be produced. [0]: https://www.postgresql.org/message-id/flat/CADzfLwWC49oanFSGPTf%3D6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ%40mail.gmail.com#5f6b3be849f8d95c166decfae541df09
On Fri, Aug 22, 2025 at 9:12 PM Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > > BTW, as the update (or DELETE+INSERT) happens at a later time than the > > publisher's update/delete, so once we have the last_write_win > > resolution strategy implemented, it is the subscriber operation that > > will win. So, the current behavior shouldn't cause any problem. > > For the last_write_win and UPDATE vs UPDATE case - yes, probably, but > only by luck. > Why only by luck? > However, there are many scenarios that cannot be implemented > correctly, for example: > * DELETE always wins > * UPDATE with a higher version (column value) wins > * first_write_win > * etc. > Then these may not lead to eventual consistency for such cases. So, not sure one should anyway rely on these. > Also, the cases from [0] are clearly wrong without any conflict > resolution. In particular, case 2 - there are no real conflicts at all > (since different sets of columns are involved), but an incorrect > result may still be produced. > I think this questions whether we consider the SnapshotDirty results correct or not. The case of logical replication giving wrong results [0] is the behavior from the beginning of logical replication. Now, I would like to know the opinion of others who were involved in the initial commit, so added Peter E. to see what he thinks of the same. If we don't get the opinion here (say people missed to read because of an unrelated title) then I suggest you start a separate email thread to discuss just that case and see what others think. [0]: https://www.postgresql.org/message-id/flat/CADzfLwWC49oanFSGPTf%3D6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ%40mail.gmail.com#5f6b3be849f8d95c166decfae541df09 -- With Regards, Amit Kapila.
On Fri, Aug 22, 2025 at 9:12 PM Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > Amit, a few more explanations related to your message. > > > IIUC, the problem you are worried about can happen with DELETE+INSERT > > in the same transaction on the subscriber, right? > > Technically, yes - this can occur during a single UPDATE, as well as a > DELETE followed by an INSERT of the same key within the same > transaction (which is effectively equivalent to an UPDATE). > BTW, then isn't it possible that INSERT happens on a different page? -- With Regards, Amit Kapila.
Hello! > Why only by luck? I mean last_write_win provides the same results in the following cases: * we found the tuple, detected a conflict, and decided to ignore the update coming from the publisher * we were unable to find the tuple, logged an error about it, and ignored the update coming from the publisher In both cases, the result is the same: the subscriber version remains in the table. > Then these may not lead to eventual consistency for such cases. So, > not sure one should anyway rely on these. But with the fixed snapshot dirty scan, it becomes possible to implement such strategies. Also, some strategies require some kind of merge function for tuples. In my understanding, even last_write_win should probably compare timestamps to determine which version is "newer" because time in distributed systems can be tricky. Therefore, we have to find the tuple if it exists. > BTW, then isn't it possible that INSERT happens on a different page? Yes, it is possible - in that case, the bug does not occur. It only happens if a new TID of some logical tuple is added to the same page. Just to clarify, this is about B-tree pages, not the heap. > I think this questions whether we consider the SnapshotDirty results > correct or not. In my understanding, this is clearly wrong: * such behavior is not documented anywhere * usage patterns assume that such things cannot happen * new features struggle with it. For example, the new update_deleted logging may fail to behave correctly (038_update_missing_with_retain.pl in the patch) - so how should it be used? It might be correct, but it also might not be... Another option is to document the behavior and rename it to SnapshotMaybe :) By the way, SnapshotSelf is also affected. > The case of logical replication giving wrong results > [0] is the behavior from the beginning of logical replication. Logical replication was mainly focused on replication without any concurrent updates on the subscriber side. So, I think this is why the issue was overlooked. Best regards, Mikhail.
On Mon, Aug 25, 2025 at 4:19 PM Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > > Why only by luck? > > I mean last_write_win provides the same results in the following cases: > * we found the tuple, detected a conflict, and decided to ignore the > update coming from the publisher > * we were unable to find the tuple, logged an error about it, and > ignored the update coming from the publisher > > In both cases, the result is the same: the subscriber version remains > in the table. > Right, so we can say that it will be consistent. > > Then these may not lead to eventual consistency for such cases. So, > > not sure one should anyway rely on these. > > But with the fixed snapshot dirty scan, it becomes possible to > implement such strategies. > Also, some strategies require some kind of merge function for tuples. > In my understanding, even last_write_win should probably compare > timestamps to determine which version is "newer" because time in > distributed systems can be tricky. > Therefore, we have to find the tuple if it exists. > > > BTW, then isn't it possible that INSERT happens on a different page? > > Yes, it is possible - in that case, the bug does not occur. It only > happens if a new TID of some logical tuple is added to the same page. > What if the new insert happens in a page prior to the current page? I mean that the scan won't encounter the page where Insert happens. > Just to clarify, this is about B-tree pages, not the heap. > > > I think this questions whether we consider the SnapshotDirty results > > correct or not. > > In my understanding, this is clearly wrong: > * such behavior is not documented anywhere > I agree. This is where we need inputs. > * usage patterns assume that such things cannot happen > * new features struggle with it. For example, the new update_deleted > logging may fail to behave correctly > (038_update_missing_with_retain.pl in the patch) - so how should it be > used? It might be correct, but it also might not be... > > Another option is to document the behavior and rename it to SnapshotMaybe :) > By the way, SnapshotSelf is also affected. > BTW, do we know the reason behind using SnapshotDirty in the first place? I don't see any comments in the nearby code unless I am missing something. > > The case of logical replication giving wrong results > > [0] is the behavior from the beginning of logical replication. > > Logical replication was mainly focused on replication without any > concurrent updates on the subscriber side. So, I think this is why the > issue was overlooked. > The other possibility is that as this is a rare scenario so we didn't consider it. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com>: > What if the new insert happens in a page prior to the current page? I > mean that the scan won't encounter the page where Insert happens. Hmm.... Yes - if the TID lands to the page left of the current position, we’ll miss it as well. A lock‑based solution (version in the v10) would require keeping all pages with the same key under a read lock, which feels too expensive. > BTW, do we know the reason behind using SnapshotDirty in the first > place? I don't see any comments in the nearby code unless I am missing > something. I think this is simply an attempt to lock the newest version of the logical tuple, including INSERT cases. For an existing tuple, the same can be achieved using MVCC snapshot + retry. However, in the case of a not-yet-committed INSERT, a different type of snapshot is required. But I'm not sure if it provides any advantages.
On Mon, Aug 25, 2025 at 7:02 PM Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > Amit Kapila <amit.kapila16@gmail.com>: > > > > What if the new insert happens in a page prior to the current page? I > > mean that the scan won't encounter the page where Insert happens. > > Hmm.... Yes - if the TID lands to the page left of the current > position, we’ll miss it as well. > A lock‑based solution (version in the v10) would require keeping all > pages with the same key under a read lock, which feels too expensive. > Right. > > BTW, do we know the reason behind using SnapshotDirty in the first > > place? I don't see any comments in the nearby code unless I am missing > > something. > > I think this is simply an attempt to lock the newest version of the > logical tuple, including INSERT cases. > For an existing tuple, the same can be achieved using MVCC snapshot + retry. > However, in the case of a not-yet-committed INSERT, a different type > of snapshot is required. > > But I'm not sure if it provides any advantages. > I think it is better to document this race somewhere in a logical replication document for now unless we have a consensus on a way to move forward. -- With Regards, Amit Kapila.
Hello, Amit! Amit Kapila <amit.kapila16@gmail.com>: > Now, I > would like to know the opinion of others who were involved in the > initial commit, so added Peter E. to see what he thinks of the same. Seems like you added another Peter in [0] - I added Peter Eisentraut :) > > Hmm.... Yes - if the TID lands to the page left of the current > > position, we’ll miss it as well. > > A lock‑based solution (version in the v10) would require keeping all > > pages with the same key under a read lock, which feels too expensive. > Right. I think it is possible to achieve the same guarantees and logic using GetLatestSnapshot + HeapTupleSatisfiesDirty, but without the "tuple not found" case - I'll try to experiment with it. GetLatestSnapshot is called before tuple lock anyway. > I think it is better to document this race somewhere in a logical > replication document for now unless we have a consensus on a way to > move forward. Yes, it is an option, but what documentation is going to be strange: * there is delete_missing type of conflict stats\logs, but be aware it may be wrong (actually it delete_missing) * the same for update_missing vs update_origin_differs * the same for update_deleted vs update_origin_differs * also DELETE or UPDATE from publisher may be missed in case of update on subscriber even if update touches subscriber-only columns It looks like "if something is updating on subscriber - no guarantees". And the worst thing - it is the actual state. [0]: https://www.postgresql.org/message-id/flat/CAA4eK1LZxzORgAoDhix9MWrOqYOsNZuZLW2sTfGsJFM99yRgrg%40mail.gmail.com#02be86f7e2d24a038878f03ac1b93e95 Best regards, MIkhail.
Hello, Amit!
> Now, I
> would like to know the opinion of others who were involved in the
> initial commit, so added Peter E. to see what he thinks of the same.
Peter answered in [0]:
> I don’t remember. I was just the committer.
I’ve attached a new version of the proposed solution.
The first commit includes tests, some README updates, and an
additional pgbench test that reproduces the issue without explicitly
simulating the wait/resume race. This last test is heavy and isn't
intended to be committed.
Instead of adding extra locking in btree, a more lightweight approach
is used: since we already call GetLatestSnapshot before
table_tuple_lock, we can simply call it before each scan attempt and
use that snapshot for the scan.
As a result:
* MVCC scan will not miss updated tuples, while DirtyScan may
* in both cases, table_tuple_lock will wait for the updating
transaction to commit before retrying
* MVCC scan cannot see not-yet-committed new rows, while DirtyScan
can. However, this does not provide any stronger guarantee: in the
case of INSERT vs INSERT, two parallel inserts are still possible.
DirtyScan only slightly reduces the probability, but if the scan does
not find the row, there is still no guarantee that it won’t be
inserted immediately afterward.
Therefore, the MVCC version appears to provide the same guarantees,
without missing tuples, and with the same performance.
Best regards,
Mikhail.
[0]: https://discord.com/channels/1258108670710124574/1407753138991009913/1411303541900841090
> Now, I
> would like to know the opinion of others who were involved in the
> initial commit, so added Peter E. to see what he thinks of the same.
Peter answered in [0]:
> I don’t remember. I was just the committer.
I’ve attached a new version of the proposed solution.
The first commit includes tests, some README updates, and an
additional pgbench test that reproduces the issue without explicitly
simulating the wait/resume race. This last test is heavy and isn't
intended to be committed.
Instead of adding extra locking in btree, a more lightweight approach
is used: since we already call GetLatestSnapshot before
table_tuple_lock, we can simply call it before each scan attempt and
use that snapshot for the scan.
As a result:
* MVCC scan will not miss updated tuples, while DirtyScan may
* in both cases, table_tuple_lock will wait for the updating
transaction to commit before retrying
* MVCC scan cannot see not-yet-committed new rows, while DirtyScan
can. However, this does not provide any stronger guarantee: in the
case of INSERT vs INSERT, two parallel inserts are still possible.
DirtyScan only slightly reduces the probability, but if the scan does
not find the row, there is still no guarantee that it won’t be
inserted immediately afterward.
Therefore, the MVCC version appears to provide the same guarantees,
without missing tuples, and with the same performance.
Best regards,
Mikhail.
[0]: https://discord.com/channels/1258108670710124574/1407753138991009913/1411303541900841090
Вложения
Rebased. Also, separate thread with some additional explanation is here: https://www.postgresql.org/message-id/flat/CADzfLwXZVmbo11tFS_G2i+6TfFVwHU4VUUSeoqb+8UQfuoJs8A@mail.gmail.com