Обсуждение: Inconsistent update in the MERGE command

Поиск
Список
Период
Сортировка

Inconsistent update in the MERGE command

От
Dmitry
Дата:
Hi hackers,

I noticed an inconsistent update when executing MERGE commands, which
looks more like a bug.
In my test example, the value of 'val' should increase in an ascending
monotonous sequence.

Test system
===========
- Architecture: x86_64
- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Tested postgres version(s):
      - latest 17 (17.6)

Steps to reproduce
==================
   postgres=# create table t_merge (id int primary key, val int);
   CREATE TABLE
   postgres=# create table t_merge_chk (val int primary key);
   CREATE TABLE
   postgres=# insert into t_merge values (1,0);
   INSERT 0 1

   pgbench --no-vacuum --exit-on-abort -c 10 --file=/dev/stdin <<'EOF'
   begin;
   merge into t_merge t
   using (select 1 id) s on (t.id = s.id)
   when matched then update set val = t.val + 1
   returning val \gset

   -- Checking the uniqueness of a value
   insert into t_merge_chk (val) values (:val);
   commit;
   EOF

   pgbench: error: client 3 script 0 aborted in command 2 query 0:
ERROR:  duplicate key value violates unique constraint "t_merge_chk_pkey"
   DETAIL:  Key (val)=(2) already exists.


What do you think about this?

Best regards,
Dmitry




Re: Inconsistent update in the MERGE command

От
Yugo Nagata
Дата:
On Thu, 21 Aug 2025 16:30:09 +0300
Dmitry <dsy.075@yandex.ru> wrote:

> Hi hackers,
> 
> I noticed an inconsistent update when executing MERGE commands, which 
> looks more like a bug.
> In my test example, the value of 'val' should increase in an ascending 
> monotonous sequence.
> 
> Test system
> ===========
> - Architecture: x86_64
> - OS: Ubuntu 24.04.3 LTS (Noble Numbat)
> - Tested postgres version(s):
>       - latest 17 (17.6)
> 
> Steps to reproduce
> ==================
>    postgres=# create table t_merge (id int primary key, val int);
>    CREATE TABLE
>    postgres=# create table t_merge_chk (val int primary key);
>    CREATE TABLE
>    postgres=# insert into t_merge values (1,0);
>    INSERT 0 1
> 
>    pgbench --no-vacuum --exit-on-abort -c 10 --file=/dev/stdin <<'EOF'
>    begin;
>    merge into t_merge t
>    using (select 1 id) s on (t.id = s.id)
>    when matched then update set val = t.val + 1
>    returning val \gset
> 
>    -- Checking the uniqueness of a value
>    insert into t_merge_chk (val) values (:val);
>    commit;
>    EOF
> 
>    pgbench: error: client 3 script 0 aborted in command 2 query 0: 
> ERROR:  duplicate key value violates unique constraint "t_merge_chk_pkey"
>    DETAIL:  Key (val)=(2) already exists.
> 
> 
> What do you think about this?

I confirmed this issue by executing the following query concurrently
in three transactions. (With only two transactions, the issue does not occur.)

 merge into t_merge t
 using (select 1 id) s on (t.id = s.id)
 when matched then update set val = t.val + 1
 returning old.ctid,old.val,new.ctid,new.val;

Theoretically, each transaction should increment the val value
independently, so the total result should be an increment of three.
However, sometimes (not always, but often) the second and third committed
transactions end up producing the same result.

For example, when the first committed transaction returns:

  ctid   | val |  ctid   | val 
---------+-----+---------+-----
 (2,158) |   4 | (2,159) |   5
(1 row)

the second committed transaction could return:

  ctid   | val |  ctid   | val 
---------+-----+---------+-----
 (2,159) |   5 | (2,160) |   6
(1 row)


and the third committed transaction could return:

  ctid   | val |  ctid   | val 
---------+-----+---------+-----
 (2,159) |   5 | (2,161) |   6
(1 row)

This is wrong, since the old.val and new.val should be 6 and 7, respectively.

I don't completely understand how this race condition occurs,
but I believe the bug is due to the misuse of TM_FailureData
returned by table_tuple_lock in ExecMergeMatched().

Currently, TM_FailureData.ctid is used as a reference to the
latest version of oldtuple, but this is not always correct. 
Instead, the tupleid passed to table_tuple_lock should be used.

I've attached a patch to fix this.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Inconsistent update in the MERGE command

От
Dmitry
Дата:
On 24.08.2025 20:34, Yugo Nagata wrote:
> I've attached a patch to fix this.
I ran tests with this patch [1] and it seems to have fixed the issue.
Thank you very much for your work.


[1] 

https://www.postgresql.org/message-id/attachment/180494/0001-Fix-misuse-of-TM_FailureData.ctid-in-ExecMergeMatche.patch

--
Best regards,
Dmitry.




Re: Inconsistent update in the MERGE command

От
Dean Rasheed
Дата:
On Sun, 24 Aug 2025 at 18:34, Yugo Nagata <nagata@sraoss.co.jp> wrote:
>
> I confirmed this issue by executing the following query concurrently
> in three transactions. (With only two transactions, the issue does not occur.)

Yes, I think 3 transactions are required to reproduce this (2 separate
concurrent updates).

> I don't completely understand how this race condition occurs,
> but I believe the bug is due to the misuse of TM_FailureData
> returned by table_tuple_lock in ExecMergeMatched().
>
> Currently, TM_FailureData.ctid is used as a reference to the
> latest version of oldtuple, but this is not always correct.
> Instead, the tupleid passed to table_tuple_lock should be used.
>
> I've attached a patch to fix this.

Thanks. That makes sense.

I think we also should update the isolation tests to test this.
Attached is an update to the merge-match-recheck isolation test, doing
so. As you found, it doesn't always seem to fail with the unpatched
code (though I didn't look to see why), but with your patch, it always
passes.

Regards,
Dean

Вложения

Re: Inconsistent update in the MERGE command

От
Chao Li
Дата:


On Aug 25, 2025, at 01:34, Yugo Nagata <nagata@sraoss.co.jp> wrote:

Currently, TM_FailureData.ctid is used as a reference to the
latest version of oldtuple, but this is not always correct. 
Instead, the tupleid passed to table_tuple_lock should be used.

Based on my understanding:

At line 3386:
result = table_tuple_lock(resultRelationDesc, tupleid,
estate->es_snapshot,
inputslot, estate->es_output_cid,
lockmode, LockWaitBlock,
TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
&context->tmfd);

When calling table_tuple_lock(), “tupleid” is used to lock the tuple, and “&context->tmfd” is used to carry out lasted version of tuple info if table_tuple_lock() fails.

In this case, at line 3394, it drops into:

switch (result)
{
case TM_Ok:

Result is TM_Ok, meaning table_tuple_lock() didn’t fail, then we should not use info from “context->tmfd”.

So I think this fix makes sense.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: Inconsistent update in the MERGE command

От
Yugo Nagata
Дата:
On Wed, 27 Aug 2025 14:44:55 +0100
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

> On Sun, 24 Aug 2025 at 18:34, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> >
> > I confirmed this issue by executing the following query concurrently
> > in three transactions. (With only two transactions, the issue does not occur.)
> 
> Yes, I think 3 transactions are required to reproduce this (2 separate
> concurrent updates).
> 
> > I don't completely understand how this race condition occurs,
> > but I believe the bug is due to the misuse of TM_FailureData
> > returned by table_tuple_lock in ExecMergeMatched().
> >
> > Currently, TM_FailureData.ctid is used as a reference to the
> > latest version of oldtuple, but this is not always correct.
> > Instead, the tupleid passed to table_tuple_lock should be used.
> >
> > I've attached a patch to fix this.
> 
> Thanks. That makes sense.
> 
> I think we also should update the isolation tests to test this.
> Attached is an update to the merge-match-recheck isolation test, doing
> so. As you found, it doesn't always seem to fail with the unpatched
> code (though I didn't look to see why), but with your patch, it always
> passes.

Thank you for your suggestion and the test patch. The test looks good
to me, so I’ve attached an updated patch including it.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

Вложения

Re: Inconsistent update in the MERGE command

От
Dean Rasheed
Дата:
On Thu, 28 Aug 2025 at 15:10, Yugo Nagata <nagata@sraoss.co.jp> wrote:
>
> Thank you for your suggestion and the test patch. The test looks good
> to me, so I’ve attached an updated patch including it.

Thanks, that mostly looks good to me.

There's one other place in ExecMergeMatched() that's using
TM_FailureData.ctid when it shouldn't, and that's this check:

    if (ItemPointerIndicatesMovedPartitions(&context->tmfd.ctid))
        ereport(ERROR,
                (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                 errmsg("tuple to be merged was already moved to
another partition due to concurrent update")));

I've updated that to use tupleid in the attached v3 patch, and added a
couple more isolation tests. In practice, however, I don't think that
error can ever happen because this check follows table_tuple_lock()
which has a similar test which will always error out first. I was
tempted to simply remove this check from ExecMergeMatched(), but I
think perhaps it's worth keeping just in case.

Also, I thought that it's worth updating the comments for
table_tuple_lock() and TM_FailureData to make it clearer that it
updates its tid argument, and which fields of TM_FailureData can be
relied upon in the different cases.

Regards,
Dean

Вложения

Re: Inconsistent update in the MERGE command

От
Yugo Nagata
Дата:
On Thu, 4 Sep 2025 13:50:41 +0100
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

> On Thu, 28 Aug 2025 at 15:10, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> >
> > Thank you for your suggestion and the test patch. The test looks good
> > to me, so I’ve attached an updated patch including it.
> 
> Thanks, that mostly looks good to me.

Thank you for updating the patch.

> There's one other place in ExecMergeMatched() that's using
> TM_FailureData.ctid when it shouldn't, and that's this check:
> 
>     if (ItemPointerIndicatesMovedPartitions(&context->tmfd.ctid))
>         ereport(ERROR,
>                 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>                  errmsg("tuple to be merged was already moved to
> another partition due to concurrent update")));

I overlooked this check.
 
> I've updated that to use tupleid in the attached v3 patch, and added a
> couple more isolation tests. In practice, however, I don't think that
> error can ever happen because this check follows table_tuple_lock()
> which has a similar test which will always error out first. I was
> tempted to simply remove this check from ExecMergeMatched(), but I
> think perhaps it's worth keeping just in case.

ItemPointerIndicatesMovedPartitions() is checked in heapam_tuple_lock(),
but other table access methods might not perform this check (though I'm
not sure if this is acceptable), so it may still make sense to keep it.

> Also, I thought that it's worth updating the comments for
> table_tuple_lock() and TM_FailureData to make it clearer that it
> updates its tid argument, and which fields of TM_FailureData can be
> relied upon in the different cases.

+1
This should help prevent misuse of the argument in the future.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>



Re: Inconsistent update in the MERGE command

От
Dean Rasheed
Дата:
On Thu, 4 Sept 2025 at 16:03, Yugo Nagata <nagata@sraoss.co.jp> wrote:
>
> > I've updated that to use tupleid in the attached v3 patch, and added a
> > couple more isolation tests. In practice, however, I don't think that
> > error can ever happen because this check follows table_tuple_lock()
> > which has a similar test which will always error out first. I was
> > tempted to simply remove this check from ExecMergeMatched(), but I
> > think perhaps it's worth keeping just in case.
>
> ItemPointerIndicatesMovedPartitions() is checked in heapam_tuple_lock(),
> but other table access methods might not perform this check (though I'm
> not sure if this is acceptable), so it may still make sense to keep it.

Good point.

> > Also, I thought that it's worth updating the comments for
> > table_tuple_lock() and TM_FailureData to make it clearer that it
> > updates its tid argument, and which fields of TM_FailureData can be
> > relied upon in the different cases.
>
> +1
> This should help prevent misuse of the argument in the future.

OK, thanks. Pushed.

In v15 and v16 the code was a little different, and it was actually
much more obvious that it thought that table_tuple_lock() didn't
update the tupleid passed to it, because it was explicitly updating it
from tmfd->ctid.

It was also slightly harder to trigger the error in v15 and v16, so I
had to tweak the tests a little. In doing so, I realised that it's not
actually necessary to have 3 sessions to reproduce this error -- it
only requires 2 sessions, as long as the "other" session does 2
updates. So I did it that way in all branches, which simplified them a
bit.

Regards,
Dean



Re: Inconsistent update in the MERGE command

От
Yugo Nagata
Дата:
On Fri, 5 Sep 2025 08:47:19 +0100
Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

> On Thu, 4 Sept 2025 at 16:03, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> >
> > > I've updated that to use tupleid in the attached v3 patch, and added a
> > > couple more isolation tests. In practice, however, I don't think that
> > > error can ever happen because this check follows table_tuple_lock()
> > > which has a similar test which will always error out first. I was
> > > tempted to simply remove this check from ExecMergeMatched(), but I
> > > think perhaps it's worth keeping just in case.
> >
> > ItemPointerIndicatesMovedPartitions() is checked in heapam_tuple_lock(),
> > but other table access methods might not perform this check (though I'm
> > not sure if this is acceptable), so it may still make sense to keep it.
> 
> Good point.
> 
> > > Also, I thought that it's worth updating the comments for
> > > table_tuple_lock() and TM_FailureData to make it clearer that it
> > > updates its tid argument, and which fields of TM_FailureData can be
> > > relied upon in the different cases.
> >
> > +1
> > This should help prevent misuse of the argument in the future.
> 
> OK, thanks. Pushed.
> 
> In v15 and v16 the code was a little different, and it was actually
> much more obvious that it thought that table_tuple_lock() didn't
> update the tupleid passed to it, because it was explicitly updating it
> from tmfd->ctid.
> 
> It was also slightly harder to trigger the error in v15 and v16, so I
> had to tweak the tests a little. In doing so, I realised that it's not
> actually necessary to have 3 sessions to reproduce this error -- it
> only requires 2 sessions, as long as the "other" session does 2
> updates. So I did it that way in all branches, which simplified them a
> bit.

Thank you for committing it, and for your additional work on improving
the test and the backpatch.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>