Обсуждение: In logical replication concurrent update of partition key creates aduplicate record on standby.

Поиск
Список
Период
Сортировка

In logical replication concurrent update of partition key creates aduplicate record on standby.

От
amul sul
Дата:
Hi,

If an update of partition key involves tuple movement from one partition to
another partition then there will be a separate delete on one partition and
insert on the other partition made.

In the logical replication if an update performed on the master and standby at
the same moment, then replication worker tries to replicate delete + insert
operation on standby. While replying master changes on standby for the delete
operation worker will log "concurrent update, retrying" message (because the
update on standby has already deleted) and move forward to reply the next
insert operation. Standby update also did the same delete+insert is as part of
the update of partition key in a result there will be two records inserted on
standby.

Here is the quick demonstration:

== NODE 1 ==

postgres=# insert into foo values(1, 'initial insert');
INSERT 0 1
postgres=# select tableoid::regclass, * from foo;
 tableoid | a |       b
----------+---+----------------
 foo1     | 1 | initial insert
(1 row)


== NODE 2 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |       b
----------+---+----------------
 foo1     | 1 | initial insert
(1 row)


-- Now attach GDB to the replication worker & break on heap_lock_tuple() call


== NODE 1 ==

postgres=#  update foo set a=2, b='node1_update' where a=1;
UPDATE 1

<---- replication worker hits break point on heap_lock_tuple() --->


== NODE 2 ==

postgres=# update foo set a=2, b='node2_update' where a=1;
UPDATE 1

<---- continue replication worker --->

postgres=# 2018-02-07 13:32:46.307 IST [81613] LOG:  concurrent update, retrying


== NODE 1 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |      b
----------+---+--------------
 foo2     | 2 | node1_update
(1 row)


== NODE 2 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |      b
----------+---+--------------
 foo2     | 2 | node2_update
 foo2     | 2 | node1_update
(2 rows)

=== Script to create partitioned table, publication  & subscription ==
-- node1
CREATE TABLE foo (a int2, b text) PARTITION BY LIST (a);
CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);

CREATE PUBLICATION update_row_mov_pub FOR ALL TABLES;

ALTER TABLE foo REPLICA IDENTITY FULL;
ALTER TABLE foo1 REPLICA IDENTITY FULL;
ALTER TABLE foo2 REPLICA IDENTITY FULL;

-- node2
CREATE TABLE foo (a int2, b text) PARTITION BY LIST (a);
CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);

CREATE SUBSCRIPTION update_row_mov_sub CONNECTION 'host=localhost
dbname=postgres' PUBLICATION update_row_mov_pub;

== END ==

Here is a link of previous discussion :
https://postgr.es/m/CAAJ_b97w_GGV-k4ErxWTpz5sAgFJ4aUYMx0khfySvANmWRzsag@mail.gmail.com

Regards,
Amul  Sul


Re: In logical replication concurrent update of partition key createsa duplicate record on standby.

От
Amit Khandekar
Дата:
On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote:
> Hi,
>
> If an update of partition key involves tuple movement from one partition to
> another partition then there will be a separate delete on one partition and
> insert on the other partition made.
>
> In the logical replication if an update performed on the master and standby at
> the same moment, then replication worker tries to replicate delete + insert
> operation on standby. While replying master changes on standby for the delete
> operation worker will log "concurrent update, retrying" message (because the
> update on standby has already deleted) and move forward to reply the next
> insert operation. Standby update also did the same delete+insert is as part of
> the update of partition key in a result there will be two records inserted on
> standby.

A quick thinking on how to resolve this makes me wonder if we can
manage to pass some information through logical decoding that the
delete is part of a partition key update. This is analogous to how we
set some information locally in the tuple by setting
tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.

I guess, at the node 2 where this issue reproduces, this issue can
also be reproduced if there is a non-partition-key UPDATE going on,
and the tuple gets deleted as part of the replaying of partition-key
update ? This UPDATE will skip the update, thinking that the tuple is
deleted. This is similar to what's happening now in case of local
concurrent updates, for which the fix is being worked upon.


On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote:
>> Hi,
>>
>> If an update of partition key involves tuple movement from one partition to
>> another partition then there will be a separate delete on one partition and
>> insert on the other partition made.
>>
>> In the logical replication if an update performed on the master and standby at
>> the same moment, then replication worker tries to replicate delete + insert
>> operation on standby. While replying master changes on standby for the delete
>> operation worker will log "concurrent update, retrying" message (because the
>> update on standby has already deleted) and move forward to reply the next
>> insert operation. Standby update also did the same delete+insert is as part of
>> the update of partition key in a result there will be two records inserted on
>> standby.
>
> A quick thinking on how to resolve this makes me wonder if we can
> manage to pass some information through logical decoding that the
> delete is part of a partition key update. This is analogous to how we
> set some information locally in the tuple by setting
> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
>

+1, also if  worker failed to reply delete operation on standby then
we need to decide what will be the next step, should we skip follow
insert operation or error out or something else.

> I guess, at the node 2 where this issue reproduces, this issue can
> also be reproduced if there is a non-partition-key UPDATE going on,
> and the tuple gets deleted as part of the replaying of partition-key
> update ? This UPDATE will skip the update, thinking that the tuple is
> deleted. This is similar to what's happening now in case of local
> concurrent updates, for which the fix is being worked upon.

Yes, you are correct, at node 2 the reported issue is also reproducible without
the update of partition key.

== NODE 2 ==

postgres=# update foo set  b='node2_update' where a=1;
UPDATE 1
postgres=# select * from foo;
 a |      b
---+--------------
 1 | node2_update
(1 row)

< -- continued replication worker -->

postgres=# 2018-02-07 15:26:53.323 IST [86449] LOG:  concurrent update, retrying

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |      b
----------+---+--------------
 foo1     | 1 | node2_update
 foo2     | 2 | node1_update
(2 rows)

Regards,
Amul Sul


Re: In logical replication concurrent update of partition key createsa duplicate record on standby.

От
Craig Ringer
Дата:
On 7 February 2018 at 17:33, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
 

A quick thinking on how to resolve this makes me wonder if we can
manage to pass some information through logical decoding that the
delete is part of a partition key update. This is analogous to how we
set some information locally in the tuple by setting
tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.


We already do something similar for UPDATEs that change the REPLICA IDENTITY; we include the oldkey in extra WAL.

The main question is whether the required knowledge is available at a suitable level. 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: In logical replication concurrent update of partition key createsa duplicate record on standby.

От
Amit Kapila
Дата:
On Wed, Feb 7, 2018 at 3:42 PM, amul sul <sulamul@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote:
>>> Hi,
>>>
>>> If an update of partition key involves tuple movement from one partition to
>>> another partition then there will be a separate delete on one partition and
>>> insert on the other partition made.
>>>
>>> In the logical replication if an update performed on the master and standby at
>>> the same moment, then replication worker tries to replicate delete + insert
>>> operation on standby. While replying master changes on standby for the delete
>>> operation worker will log "concurrent update, retrying" message (because the
>>> update on standby has already deleted) and move forward to reply the next
>>> insert operation. Standby update also did the same delete+insert is as part of
>>> the update of partition key in a result there will be two records inserted on
>>> standby.
>>
>> A quick thinking on how to resolve this makes me wonder if we can
>> manage to pass some information through logical decoding that the
>> delete is part of a partition key update. This is analogous to how we
>> set some information locally in the tuple by setting
>> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
>>
>
> +1,
>

I also mentioned the same thing in the other thread [1], but I think
that alone won't solve the dual record problem as you are seeing.  I
think we need to do something for next insert as you are suggesting.

> also if  worker failed to reply delete operation on standby then
> we need to decide what will be the next step, should we skip follow
> insert operation or error out or something else.
>

That would be tricky, do you see any simple way of doing either of those.


[1] - https://www.postgresql.org/message-id/CAA4eK1%2BHopDbA3h0oYXE1kuhsU0rLT-hONeeS0SoG36YpeSnGw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: In logical replication concurrent update of partition key createsa duplicate record on standby.

От
Amit Kapila
Дата:
On Wed, Feb 7, 2018 at 6:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 3:42 PM, amul sul <sulamul@gmail.com> wrote:
>> On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> If an update of partition key involves tuple movement from one partition to
>>>> another partition then there will be a separate delete on one partition and
>>>> insert on the other partition made.
>>>>
>>>> In the logical replication if an update performed on the master and standby at
>>>> the same moment, then replication worker tries to replicate delete + insert
>>>> operation on standby. While replying master changes on standby for the delete
>>>> operation worker will log "concurrent update, retrying" message (because the
>>>> update on standby has already deleted) and move forward to reply the next
>>>> insert operation. Standby update also did the same delete+insert is as part of
>>>> the update of partition key in a result there will be two records inserted on
>>>> standby.
>>>
>>> A quick thinking on how to resolve this makes me wonder if we can
>>> manage to pass some information through logical decoding that the
>>> delete is part of a partition key update. This is analogous to how we
>>> set some information locally in the tuple by setting
>>> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
>>>
>>
>> +1,
>>
>
> I also mentioned the same thing in the other thread [1], but I think
> that alone won't solve the dual record problem as you are seeing.  I
> think we need to do something for next insert as you are suggesting.
>

Can you please once check what was the behavior before Update Tuple
routing patch (Commit-id: 2f178441) went in?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


On Wed, Feb 7, 2018 at 6:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 3:42 PM, amul sul <sulamul@gmail.com> wrote:
>> On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> If an update of partition key involves tuple movement from one partition to
>>>> another partition then there will be a separate delete on one partition and
>>>> insert on the other partition made.
>>>>
>>>> In the logical replication if an update performed on the master and standby at
>>>> the same moment, then replication worker tries to replicate delete + insert
>>>> operation on standby. While replying master changes on standby for the delete
>>>> operation worker will log "concurrent update, retrying" message (because the
>>>> update on standby has already deleted) and move forward to reply the next
>>>> insert operation. Standby update also did the same delete+insert is as part of
>>>> the update of partition key in a result there will be two records inserted on
>>>> standby.
>>>
>>> A quick thinking on how to resolve this makes me wonder if we can
>>> manage to pass some information through logical decoding that the
>>> delete is part of a partition key update. This is analogous to how we
>>> set some information locally in the tuple by setting
>>> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
>>>
>>
>> +1,
>>
>
> I also mentioned the same thing in the other thread [1], but I think
> that alone won't solve the dual record problem as you are seeing.  I
> think we need to do something for next insert as you are suggesting.
>
>> also if  worker failed to reply delete operation on standby then
>> we need to decide what will be the next step, should we skip follow
>> insert operation or error out or something else.
>>
>
> That would be tricky, do you see any simple way of doing either of those.
>

Not really, like ExecUpdate for an update of partition key if delete is failed
then the further insert will be skipped, but you are correct, it might be more
tricky than I can think -- there is no guarantee that the next insert operation
which replication worker trying to replicate is part of the update of partition
key mechanism.  How can one identify that an insert operation on one relation is
related to previously deleting operation on some other relation?

Regards,
Amul


On Thu, Feb 8, 2018 at 5:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 6:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Wed, Feb 7, 2018 at 3:42 PM, amul sul <sulamul@gmail.com> wrote:
>>> On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>>>> On 7 February 2018 at 13:53, amul sul <sulamul@gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> If an update of partition key involves tuple movement from one partition to
>>>>> another partition then there will be a separate delete on one partition and
>>>>> insert on the other partition made.
>>>>>
>>>>> In the logical replication if an update performed on the master and standby at
>>>>> the same moment, then replication worker tries to replicate delete + insert
>>>>> operation on standby. While replying master changes on standby for the delete
>>>>> operation worker will log "concurrent update, retrying" message (because the
>>>>> update on standby has already deleted) and move forward to reply the next
>>>>> insert operation. Standby update also did the same delete+insert is as part of
>>>>> the update of partition key in a result there will be two records inserted on
>>>>> standby.
>>>>
>>>> A quick thinking on how to resolve this makes me wonder if we can
>>>> manage to pass some information through logical decoding that the
>>>> delete is part of a partition key update. This is analogous to how we
>>>> set some information locally in the tuple by setting
>>>> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
>>>>
>>>
>>> +1,
>>>
>>
>> I also mentioned the same thing in the other thread [1], but I think
>> that alone won't solve the dual record problem as you are seeing.  I
>> think we need to do something for next insert as you are suggesting.
>>
>
> Can you please once check what was the behavior before Update Tuple
> routing patch (Commit-id: 2f178441) went in?
>

Before this commit such update will be failed with following error:

postgres=#  update foo set a=2, b='node1_update' where a=1;
ERROR:  new row for relation "foo1" violates partition constraint
DETAIL:  Failing row contains (2, node1_update).

Regards,
Amul


Re: In logical replication concurrent update of partition key createsa duplicate record on standby.

От
Peter Eisentraut
Дата:
On 2/8/18 10:54, amul sul wrote:
> Not really, like ExecUpdate for an update of partition key if delete is failed
> then the further insert will be skipped, but you are correct, it might be more
> tricky than I can think -- there is no guarantee that the next insert operation
> which replication worker trying to replicate is part of the update of partition
> key mechanism.  How can one identify that an insert operation on one relation is
> related to previously deleting operation on some other relation?

I think you somehow need to stitch this back together in logical
decoding and publish it as an update operation.  Otherwise, wrong things
happen.  For example, what happens to a publication that is configured
to only publish inserts?  What happens to update triggers on the
receiving table?  What if the subscriber side is partitioned differently?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: In logical replication concurrent update of partition key createsa duplicate record on standby.

От
Amit Kapila
Дата:
On Tue, Feb 13, 2018 at 5:25 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 2/8/18 10:54, amul sul wrote:
>> Not really, like ExecUpdate for an update of partition key if delete is failed
>> then the further insert will be skipped, but you are correct, it might be more
>> tricky than I can think -- there is no guarantee that the next insert operation
>> which replication worker trying to replicate is part of the update of partition
>> key mechanism.  How can one identify that an insert operation on one relation is
>> related to previously deleting operation on some other relation?
>
> I think you somehow need to stitch this back together in logical
> decoding and publish it as an update operation.
>

I am not sure what you have in mind, but it seems difficult to
piggyback it on exiting delete operation because both delete and
insert are separate operations.  What might work out is that we have a
separate special update WAL record which constitutes both delete and
insert.  Now, it is easier said than done, because current update
tuple routing mechanism uses existing infrastructure for delete and
insert, and now if we want to use a different WAL, then we might need
to change the existing infrastructure as well to avoid writing three
WAL records (one for delete, one for insert, one for special update)
for this operation.

>  Otherwise, wrong things
> happen.  For example, what happens to a publication that is configured
> to only publish inserts?  What happens to update triggers on the
> receiving table?  What if the subscriber side is partitioned differently?
>

All of these seems to be valid points and will get addressed if we
decide to tinker WAL format and logical decoding for this operation.
OTOH, we might want to document this behavior for v11 as we are
discussing for some other related issues [1][2].

[1] -
https://www.postgresql.org/message-id/CA%2BTgmoY_h%2B3J46zShEZD0_KLRHa1NsJkGrC4Ou%3DBqt%3DKRboHtg%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CA%2BTgmoacGfUSWSMVRci-duVFSGOoevgq43mSY9Sztd1RRhiHjg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com