Обсуждение: Inconsistent update in the MERGE command
Hi hackers, I noticed an inconsistent update when executing MERGE commands, which looks more like a bug. In my test example, the value of 'val' should increase in an ascending monotonous sequence. Test system =========== - Architecture: x86_64 - OS: Ubuntu 24.04.3 LTS (Noble Numbat) - Tested postgres version(s): - latest 17 (17.6) Steps to reproduce ================== postgres=# create table t_merge (id int primary key, val int); CREATE TABLE postgres=# create table t_merge_chk (val int primary key); CREATE TABLE postgres=# insert into t_merge values (1,0); INSERT 0 1 pgbench --no-vacuum --exit-on-abort -c 10 --file=/dev/stdin <<'EOF' begin; merge into t_merge t using (select 1 id) s on (t.id = s.id) when matched then update set val = t.val + 1 returning val \gset -- Checking the uniqueness of a value insert into t_merge_chk (val) values (:val); commit; EOF pgbench: error: client 3 script 0 aborted in command 2 query 0: ERROR: duplicate key value violates unique constraint "t_merge_chk_pkey" DETAIL: Key (val)=(2) already exists. What do you think about this? Best regards, Dmitry
On Thu, 21 Aug 2025 16:30:09 +0300 Dmitry <dsy.075@yandex.ru> wrote: > Hi hackers, > > I noticed an inconsistent update when executing MERGE commands, which > looks more like a bug. > In my test example, the value of 'val' should increase in an ascending > monotonous sequence. > > Test system > =========== > - Architecture: x86_64 > - OS: Ubuntu 24.04.3 LTS (Noble Numbat) > - Tested postgres version(s): > - latest 17 (17.6) > > Steps to reproduce > ================== > postgres=# create table t_merge (id int primary key, val int); > CREATE TABLE > postgres=# create table t_merge_chk (val int primary key); > CREATE TABLE > postgres=# insert into t_merge values (1,0); > INSERT 0 1 > > pgbench --no-vacuum --exit-on-abort -c 10 --file=/dev/stdin <<'EOF' > begin; > merge into t_merge t > using (select 1 id) s on (t.id = s.id) > when matched then update set val = t.val + 1 > returning val \gset > > -- Checking the uniqueness of a value > insert into t_merge_chk (val) values (:val); > commit; > EOF > > pgbench: error: client 3 script 0 aborted in command 2 query 0: > ERROR: duplicate key value violates unique constraint "t_merge_chk_pkey" > DETAIL: Key (val)=(2) already exists. > > > What do you think about this? I confirmed this issue by executing the following query concurrently in three transactions. (With only two transactions, the issue does not occur.) merge into t_merge t using (select 1 id) s on (t.id = s.id) when matched then update set val = t.val + 1 returning old.ctid,old.val,new.ctid,new.val; Theoretically, each transaction should increment the val value independently, so the total result should be an increment of three. However, sometimes (not always, but often) the second and third committed transactions end up producing the same result. For example, when the first committed transaction returns: ctid | val | ctid | val ---------+-----+---------+----- (2,158) | 4 | (2,159) | 5 (1 row) the second committed transaction could return: ctid | val | ctid | val ---------+-----+---------+----- (2,159) | 5 | (2,160) | 6 (1 row) and the third committed transaction could return: ctid | val | ctid | val ---------+-----+---------+----- (2,159) | 5 | (2,161) | 6 (1 row) This is wrong, since the old.val and new.val should be 6 and 7, respectively. I don't completely understand how this race condition occurs, but I believe the bug is due to the misuse of TM_FailureData returned by table_tuple_lock in ExecMergeMatched(). Currently, TM_FailureData.ctid is used as a reference to the latest version of oldtuple, but this is not always correct. Instead, the tupleid passed to table_tuple_lock should be used. I've attached a patch to fix this. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On 24.08.2025 20:34, Yugo Nagata wrote: > I've attached a patch to fix this. I ran tests with this patch [1] and it seems to have fixed the issue. Thank you very much for your work. [1] https://www.postgresql.org/message-id/attachment/180494/0001-Fix-misuse-of-TM_FailureData.ctid-in-ExecMergeMatche.patch -- Best regards, Dmitry.
On Sun, 24 Aug 2025 at 18:34, Yugo Nagata <nagata@sraoss.co.jp> wrote: > > I confirmed this issue by executing the following query concurrently > in three transactions. (With only two transactions, the issue does not occur.) Yes, I think 3 transactions are required to reproduce this (2 separate concurrent updates). > I don't completely understand how this race condition occurs, > but I believe the bug is due to the misuse of TM_FailureData > returned by table_tuple_lock in ExecMergeMatched(). > > Currently, TM_FailureData.ctid is used as a reference to the > latest version of oldtuple, but this is not always correct. > Instead, the tupleid passed to table_tuple_lock should be used. > > I've attached a patch to fix this. Thanks. That makes sense. I think we also should update the isolation tests to test this. Attached is an update to the merge-match-recheck isolation test, doing so. As you found, it doesn't always seem to fail with the unpatched code (though I didn't look to see why), but with your patch, it always passes. Regards, Dean
Вложения
On Aug 25, 2025, at 01:34, Yugo Nagata <nagata@sraoss.co.jp> wrote:Currently, TM_FailureData.ctid is used as a reference to the
latest version of oldtuple, but this is not always correct.
Instead, the tupleid passed to table_tuple_lock should be used.
At line 3386:
result = table_tuple_lock(resultRelationDesc, tupleid,
estate->es_snapshot,
inputslot, estate->es_output_cid,
lockmode, LockWaitBlock,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
&context->tmfd);
When calling table_tuple_lock(), “tupleid” is used to lock the tuple, and “&context->tmfd” is used to carry out lasted version of tuple info if table_tuple_lock() fails.
In this case, at line 3394, it drops into:
switch (result)
{
case TM_Ok:
Result is TM_Ok, meaning table_tuple_lock() didn’t fail, then we should not use info from “context->tmfd”.
So I think this fix makes sense.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Wed, 27 Aug 2025 14:44:55 +0100 Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On Sun, 24 Aug 2025 at 18:34, Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > I confirmed this issue by executing the following query concurrently > > in three transactions. (With only two transactions, the issue does not occur.) > > Yes, I think 3 transactions are required to reproduce this (2 separate > concurrent updates). > > > I don't completely understand how this race condition occurs, > > but I believe the bug is due to the misuse of TM_FailureData > > returned by table_tuple_lock in ExecMergeMatched(). > > > > Currently, TM_FailureData.ctid is used as a reference to the > > latest version of oldtuple, but this is not always correct. > > Instead, the tupleid passed to table_tuple_lock should be used. > > > > I've attached a patch to fix this. > > Thanks. That makes sense. > > I think we also should update the isolation tests to test this. > Attached is an update to the merge-match-recheck isolation test, doing > so. As you found, it doesn't always seem to fail with the unpatched > code (though I didn't look to see why), but with your patch, it always > passes. Thank you for your suggestion and the test patch. The test looks good to me, so I’ve attached an updated patch including it. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On Thu, 28 Aug 2025 at 15:10, Yugo Nagata <nagata@sraoss.co.jp> wrote: > > Thank you for your suggestion and the test patch. The test looks good > to me, so I’ve attached an updated patch including it. Thanks, that mostly looks good to me. There's one other place in ExecMergeMatched() that's using TM_FailureData.ctid when it shouldn't, and that's this check: if (ItemPointerIndicatesMovedPartitions(&context->tmfd.ctid)) ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), errmsg("tuple to be merged was already moved to another partition due to concurrent update"))); I've updated that to use tupleid in the attached v3 patch, and added a couple more isolation tests. In practice, however, I don't think that error can ever happen because this check follows table_tuple_lock() which has a similar test which will always error out first. I was tempted to simply remove this check from ExecMergeMatched(), but I think perhaps it's worth keeping just in case. Also, I thought that it's worth updating the comments for table_tuple_lock() and TM_FailureData to make it clearer that it updates its tid argument, and which fields of TM_FailureData can be relied upon in the different cases. Regards, Dean
Вложения
On Thu, 4 Sep 2025 13:50:41 +0100 Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On Thu, 28 Aug 2025 at 15:10, Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > Thank you for your suggestion and the test patch. The test looks good > > to me, so I’ve attached an updated patch including it. > > Thanks, that mostly looks good to me. Thank you for updating the patch. > There's one other place in ExecMergeMatched() that's using > TM_FailureData.ctid when it shouldn't, and that's this check: > > if (ItemPointerIndicatesMovedPartitions(&context->tmfd.ctid)) > ereport(ERROR, > (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), > errmsg("tuple to be merged was already moved to > another partition due to concurrent update"))); I overlooked this check. > I've updated that to use tupleid in the attached v3 patch, and added a > couple more isolation tests. In practice, however, I don't think that > error can ever happen because this check follows table_tuple_lock() > which has a similar test which will always error out first. I was > tempted to simply remove this check from ExecMergeMatched(), but I > think perhaps it's worth keeping just in case. ItemPointerIndicatesMovedPartitions() is checked in heapam_tuple_lock(), but other table access methods might not perform this check (though I'm not sure if this is acceptable), so it may still make sense to keep it. > Also, I thought that it's worth updating the comments for > table_tuple_lock() and TM_FailureData to make it clearer that it > updates its tid argument, and which fields of TM_FailureData can be > relied upon in the different cases. +1 This should help prevent misuse of the argument in the future. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
On Thu, 4 Sept 2025 at 16:03, Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > I've updated that to use tupleid in the attached v3 patch, and added a > > couple more isolation tests. In practice, however, I don't think that > > error can ever happen because this check follows table_tuple_lock() > > which has a similar test which will always error out first. I was > > tempted to simply remove this check from ExecMergeMatched(), but I > > think perhaps it's worth keeping just in case. > > ItemPointerIndicatesMovedPartitions() is checked in heapam_tuple_lock(), > but other table access methods might not perform this check (though I'm > not sure if this is acceptable), so it may still make sense to keep it. Good point. > > Also, I thought that it's worth updating the comments for > > table_tuple_lock() and TM_FailureData to make it clearer that it > > updates its tid argument, and which fields of TM_FailureData can be > > relied upon in the different cases. > > +1 > This should help prevent misuse of the argument in the future. OK, thanks. Pushed. In v15 and v16 the code was a little different, and it was actually much more obvious that it thought that table_tuple_lock() didn't update the tupleid passed to it, because it was explicitly updating it from tmfd->ctid. It was also slightly harder to trigger the error in v15 and v16, so I had to tweak the tests a little. In doing so, I realised that it's not actually necessary to have 3 sessions to reproduce this error -- it only requires 2 sessions, as long as the "other" session does 2 updates. So I did it that way in all branches, which simplified them a bit. Regards, Dean
On Fri, 5 Sep 2025 08:47:19 +0100 Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On Thu, 4 Sept 2025 at 16:03, Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > I've updated that to use tupleid in the attached v3 patch, and added a > > > couple more isolation tests. In practice, however, I don't think that > > > error can ever happen because this check follows table_tuple_lock() > > > which has a similar test which will always error out first. I was > > > tempted to simply remove this check from ExecMergeMatched(), but I > > > think perhaps it's worth keeping just in case. > > > > ItemPointerIndicatesMovedPartitions() is checked in heapam_tuple_lock(), > > but other table access methods might not perform this check (though I'm > > not sure if this is acceptable), so it may still make sense to keep it. > > Good point. > > > > Also, I thought that it's worth updating the comments for > > > table_tuple_lock() and TM_FailureData to make it clearer that it > > > updates its tid argument, and which fields of TM_FailureData can be > > > relied upon in the different cases. > > > > +1 > > This should help prevent misuse of the argument in the future. > > OK, thanks. Pushed. > > In v15 and v16 the code was a little different, and it was actually > much more obvious that it thought that table_tuple_lock() didn't > update the tupleid passed to it, because it was explicitly updating it > from tmfd->ctid. > > It was also slightly harder to trigger the error in v15 and v16, so I > had to tweak the tests a little. In doing so, I realised that it's not > actually necessary to have 3 sessions to reproduce this error -- it > only requires 2 sessions, as long as the "other" session does 2 > updates. So I did it that way in all branches, which simplified them a > bit. Thank you for committing it, and for your additional work on improving the test and the backpatch. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>