Обсуждение: Logical decoding restart problems

Поиск
Список
Период
Сортировка

Logical decoding restart problems

От
konstantin knizhnik
Дата:
Hi,

We are using logical decoding in multimaster and we are faced with the problem that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical decoding channel and so each pair of nodes 
has its own replication slot.
2. In normal scenario each replication channel is used to replicate only those transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.
2. When offline cluster node is returned back to the multimaster we need to recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using only one replication channel to receive all (self and foreign) transactions.
Only in this case we can guarantee consistent order of applying transactions at recovered node.
After the end of recovery we need to recreate replication slots with all other cluster nodes (because we have already replied transactions from this nodes).
To restart logical decoding we first drop existed slot, then create new one and then start logical replication from the WAL position 0/0 (invalid LSN).
In this case recovery should be started from the last consistent point.

The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica.  I started investigation of logical decoding code and found several things which I do not understand.

Assume that we have transactions T1={start_lsn=100, end_lsn=400} and T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
If now we want to restart logical decoding, we can not start with position less than 300, because CreateDecodingContext doesn't allow it:

 * start_lsn
 * The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
 * from the slot's confirmed_flush; otherwise, start from the specified
 * location (but move it forwards to confirmed_flush if it's older than
 * that, see below).
 *
else if (start_lsn < slot->data.confirmed_flush)
{
/*
* It might seem like we should error out in this case, but it's
* pretty common for a client to acknowledge a LSN it doesn't have to
* do anything for, and thus didn't store persistently, because the
* xlog records didn't result in anything relevant for logical
* decoding. Clients have to be able to do that to support synchronous
* replication.
*/

So it means that we have no chance to restore T1?
What is worse, if there are valid T2 transaction records with lsn >= 300, then we can partly decode T1 and send this T1' to the replica.
I missed something here?

Are there any alternative way to "seek" slot to the proper position without  actual fetching data from it or recreation of the slot?
Is there any mechanism in xlog which can enforce consistent decoding of transaction (so that no transaction records are missed)?
May be I missed something but I didn't find any "record_number" or something else which can identify first record of transaction.

Thanks in advance,
Konstantin Knizhnik,
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: Logical decoding restart problems

От
Petr Jelinek
Дата:
On 19/08/16 09:34, konstantin knizhnik wrote:
>
> We are using logical decoding in multimaster and we are faced with the
> problem that inconsistent transactions are sent to replica.
> Briefly, multimaster is using logical decoding in this way:
> 1. Each multimaster node is connected with each other using logical
> decoding channel and so each pair of nodes
> has its own replication slot.
> 2. In normal scenario each replication channel is used to replicate only
> those transactions which were originated at the source node.
> We are using origin mechanism to skip "foreign" transactions.
> When offline cluster node is returned back to the multimaster we need
> to recover this node to the current cluster state.
> Recovery is performed from one of the cluster's node. So we are using
> only one replication channel to receive all (self and foreign) transactions.
> Only in this case we can guarantee consistent order of applying
> transactions at recovered node.
> After the end of recovery we need to recreate replication slots with all
> other cluster nodes (because we have already replied transactions from
> this nodes).
> To restart logical decoding we first drop existed slot, then create new
> one and then start logical replication from the WAL position 0/0
> (invalid LSN).
> In this case recovery should be started from the last consistent point.
>

I don't think this will work correctly, there will be gap between when 
the new slot starts to decode and the drop of the old one as the new 
slot first needs to make snapshot.

Do I understand correctly that you are not using replication origins?

> The problem is that for some reasons consistent point is not so
> consistent and we get partly decoded transactions.
> I.e. transaction body consists of two UPDATE but reorder_buffer extracts
> only the one (last) update and sent this truncated transaction to
> destination causing consistency violation at replica.  I started
> investigation of logical decoding code and found several things which I
> do not understand.

Never seen this happen. Do you have more details about what exactly is 
happening?

>
> Assume that we have transactions T1={start_lsn=100, end_lsn=400} and
> T2={start_lsn=200, end_lsn=300}.
> Transaction T2 is sent to the replica and replica confirms that
> flush_lsn=300.
> If now we want to restart logical decoding, we can not start with
> position less than 300, because CreateDecodingContext doesn't allow it:
>
>  * start_lsn
>  *The LSN at which to start decoding.  If InvalidXLogRecPtr, restart
>  *from the slot's confirmed_flush; otherwise, start from the specified
>  *location (but move it forwards to confirmed_flush if it's older than
>  *that, see below).
>  *
> else if (start_lsn < slot->data.confirmed_flush)
> {
> /*
> * It might seem like we should error out in this case, but it's
> * pretty common for a client to acknowledge a LSN it doesn't have to
> * do anything for, and thus didn't store persistently, because the
> * xlog records didn't result in anything relevant for logical
> * decoding. Clients have to be able to do that to support synchronous
> * replication.
> */
>
> So it means that we have no chance to restore T1?
> What is worse, if there are valid T2 transaction records with lsn >=
> 300, then we can partly decode T1 and send this T1' to the replica.
> I missed something here?

The decoding starts from restart_lsn of the slot, start_lsn is used for 
skipping the transactions.

> Are there any alternative way to "seek" slot to the proper position
> without  actual fetching data from it or recreation of the slot?

You can seek forward just fine, just specify the start position in 
START_REPLICATION command.

> Is there any mechanism in xlog which can enforce consistent decoding of
> transaction (so that no transaction records are missed)?
> May be I missed something but I didn't find any "record_number" or
> something else which can identify first record of transaction.

As I mentioned above, what you probably want to do is use replication 
origins. When you use those you get origin info when decoding the 
transaction which you can then send to downstream and it can update it's 
idea of where it is for that origin. This is especially useful for the 
transaction forwarding you are doing (See BDR and/or pglogical code for 
example of that).

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services



Re: Logical decoding restart problems

От
Konstantin Knizhnik
Дата:

On 19.08.2016 19:06, Petr Jelinek wrote:
>
> I don't think this will work correctly, there will be gap between when 
> the new slot starts to decode and the drop of the old one as the new 
> slot first needs to make snapshot.
>
> Do I understand correctly that you are not using replication origins?

No, we are using replication origins to prevent recursive replication.
The gap is not a biggest problem because in multimaster we commit only 
those transaction which are acknowledged by all live nodes.
So if some transaction falls in the gap... it is unlucky and will be 
aborted.

>
>> The problem is that for some reasons consistent point is not so
>> consistent and we get partly decoded transactions.
>> I.e. transaction body consists of two UPDATE but reorder_buffer extracts
>> only the one (last) update and sent this truncated transaction to
>> destination causing consistency violation at replica.  I started
>> investigation of logical decoding code and found several things which I
>> do not understand.
>
> Never seen this happen. Do you have more details about what exactly is 
> happening?
>

This is transaction at primary node:

root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node1 bash
postgres@9a04a0c9f246:/pg$  pg_xlogdump 
data/pg_xlog/000000010000000000000001  | fgrep "tx: 6899"
rmgr: Heap        len (rec/tot):      7/    53, tx:       6899, lsn: 
0/019EFBD8, prev 0/019EFB80, desc: LOCK off 12: xid 6899 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heap        len (rec/tot):     14/    74, tx:       6899, lsn: 
0/019EFC10, prev 0/019EFBD8, desc: HOT_UPDATE off 12 xmax 6899 ; new off 
224 xmax 6899, blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heap        len (rec/tot):      7/    53, tx:       6899, lsn: 
0/019EFC60, prev 0/019EFC10, desc: LOCK off 153: xid 6899 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 49
rmgr: Heap        len (rec/tot):     14/    82, tx:       6899, lsn: 
0/019EFC98, prev 0/019EFC60, desc: UPDATE off 153 xmax 6899 ; new off 55 
xmax 6899, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel 
1663/12407/16421 blk 49
rmgr: Btree       len (rec/tot):      2/    64, tx:       6899, lsn: 
0/019EFCF0, prev 0/019EFC98, desc: INSERT_LEAF off 294, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):    236/   265, tx:       6899, lsn: 
0/019EFD30, prev 0/019EFCF0, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1AC0E70: invalid record 
length at 0/1AC0EA8: wanted 24, got 0


This is the replicated transaction at other node (it is ont clear from 
the trace, but believe me, it is the same transaction):


root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node2 bash
postgres@e5b16d82ce06:/pg$ pg_xlogdump 
data/pg_xlog/000000010000000000000001  | fgrep "tx:       6882"
rmgr: Heap        len (rec/tot):     14/    74, tx:       6882, lsn: 
0/019F3240, prev 0/019F31F0, desc: HOT_UPDATE off 113 xmax 6882 ; new 
off 219 xmax 0, blkref #0: rel 1663/12407/16421 blk 53
rmgr: Heap        len (rec/tot):     14/    82, tx:       6882, lsn: 
0/019F5CB8, prev 0/019F5C60, desc: UPDATE off 163 xmax 6882 ; new off 4 
xmax 0, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel 
1663/12407/16421 blk 51
rmgr: Btree       len (rec/tot):      2/    64, tx:       6882, lsn: 
0/019F5D10, prev 0/019F5CB8, desc: INSERT_LEAF off 284, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):    248/   274, tx:       6882, lsn: 
0/019F61F8, prev 0/019F60E8, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1AD3AD8: invalid record 
length at 0/1AD3B10: wanted 24, got 0


And "shorten" version of the same transaction at the third (recovered) node:

root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node3 bash
postgres@b4066d4211bc:/pg$  pg_xlogdump 
data/pg_xlog/000000010000000000000001  | fgrep "tx: 6893"
rmgr: Heap        len (rec/tot):      7/    53, tx:       6893, lsn: 
0/01A29828, prev 0/01A297E0, desc: LOCK off 172: xid 6893 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 50
rmgr: Heap        len (rec/tot):     14/    82, tx:       6893, lsn: 
0/01A29860, prev 0/01A29828, desc: UPDATE+INIT off 172 xmax 6893 ; new 
off 1 xmax 6893, blkref #0: rel 1663/12407/16421 blk 64, blkref #1: rel 
1663/12407/16421 blk 50
rmgr: Btree       len (rec/tot):      2/    64, tx:       6893, lsn: 
0/01A298B8, prev 0/01A29860, desc: INSERT_LEAF off 314, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):    236/   265, tx:       6893, lsn: 
0/01A298F8, prev 0/01A298B8, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1ACBBF8: invalid record 
length at 0/1ACBC30: wanted 24, got 0

You can see one update instead of two.

Sorry, I have not saved trace  with output of logical decoder. Bu t it 
really decodes just one update!
What I have done:

DROP_REPLICATION_SLOT "mtm_slot_1";
CREATE_REPLICATION_SLOT "mtm_slot_1" LOGICAL "multimaster";
START_REPLICATION SLOT "mtm_slot_1" LOGICAL 0/0 ("startup_params_format" 
'1', "max_proto_version" '1', "min_proto_version" '1', 
"forward_changesets" '1', "mtm_replication_mode" 'recovered');

I have also tried to calculate last origin LSN for this node and 
explicitly specify it in START_REPLICATION.
But it doesn't help: the same problem persists.





-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Logical decoding restart problems

От
Petr Jelinek
Дата:
On 19/08/16 18:36, Konstantin Knizhnik wrote:
>
>
> On 19.08.2016 19:06, Petr Jelinek wrote:
>>
>> I don't think this will work correctly, there will be gap between when
>> the new slot starts to decode and the drop of the old one as the new
>> slot first needs to make snapshot.
>>
>> Do I understand correctly that you are not using replication origins?
>
> No, we are using replication origins to prevent recursive replication.
> The gap is not a biggest problem because in multimaster we commit only
> those transaction which are acknowledged by all live nodes.
> So if some transaction falls in the gap... it is unlucky and will be
> aborted.
>

Ah okay, forgot you have sync MM.

>>
>>> The problem is that for some reasons consistent point is not so
>>> consistent and we get partly decoded transactions.
>>> I.e. transaction body consists of two UPDATE but reorder_buffer extracts
>>> only the one (last) update and sent this truncated transaction to
>>> destination causing consistency violation at replica.  I started
>>> investigation of logical decoding code and found several things which I
>>> do not understand.
>>
>> Never seen this happen. Do you have more details about what exactly is
>> happening?
>>
>
> This is transaction at primary node:
>
> root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
> -ti node1 bash
> postgres@9a04a0c9f246:/pg$  pg_xlogdump
> data/pg_xlog/000000010000000000000001  | fgrep "tx: 6899"
> rmgr: Heap        len (rec/tot):      7/    53, tx:       6899, lsn:
> 0/019EFBD8, prev 0/019EFB80, desc: LOCK off 12: xid 6899 LOCK_ONLY
> EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 60
> rmgr: Heap        len (rec/tot):     14/    74, tx:       6899, lsn:
> 0/019EFC10, prev 0/019EFBD8, desc: HOT_UPDATE off 12 xmax 6899 ; new off
> 224 xmax 6899, blkref #0: rel 1663/12407/16421 blk 60
> rmgr: Heap        len (rec/tot):      7/    53, tx:       6899, lsn:
> 0/019EFC60, prev 0/019EFC10, desc: LOCK off 153: xid 6899 LOCK_ONLY
> EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 49
> rmgr: Heap        len (rec/tot):     14/    82, tx:       6899, lsn:
> 0/019EFC98, prev 0/019EFC60, desc: UPDATE off 153 xmax 6899 ; new off 55
> xmax 6899, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel
> 1663/12407/16421 blk 49
> rmgr: Btree       len (rec/tot):      2/    64, tx:       6899, lsn:
> 0/019EFCF0, prev 0/019EFC98, desc: INSERT_LEAF off 294, blkref #0: rel
> 1663/12407/16424 blk 23
> rmgr: Transaction len (rec/tot):    236/   265, tx:       6899, lsn:
> 0/019EFD30, prev 0/019EFCF0, desc: PREPARE
> pg_xlogdump: FATAL:  error in WAL record at 0/1AC0E70: invalid record
> length at 0/1AC0EA8: wanted 24, got 0
>
>
> This is the replicated transaction at other node (it is ont clear from
> the trace, but believe me, it is the same transaction):
>
>
> root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
> -ti node2 bash
> postgres@e5b16d82ce06:/pg$ pg_xlogdump
> data/pg_xlog/000000010000000000000001  | fgrep "tx:       6882"
> rmgr: Heap        len (rec/tot):     14/    74, tx:       6882, lsn:
> 0/019F3240, prev 0/019F31F0, desc: HOT_UPDATE off 113 xmax 6882 ; new
> off 219 xmax 0, blkref #0: rel 1663/12407/16421 blk 53
> rmgr: Heap        len (rec/tot):     14/    82, tx:       6882, lsn:
> 0/019F5CB8, prev 0/019F5C60, desc: UPDATE off 163 xmax 6882 ; new off 4
> xmax 0, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel
> 1663/12407/16421 blk 51
> rmgr: Btree       len (rec/tot):      2/    64, tx:       6882, lsn:
> 0/019F5D10, prev 0/019F5CB8, desc: INSERT_LEAF off 284, blkref #0: rel
> 1663/12407/16424 blk 23
> rmgr: Transaction len (rec/tot):    248/   274, tx:       6882, lsn:
> 0/019F61F8, prev 0/019F60E8, desc: PREPARE
> pg_xlogdump: FATAL:  error in WAL record at 0/1AD3AD8: invalid record
> length at 0/1AD3B10: wanted 24, got 0
>
>
> And "shorten" version of the same transaction at the third (recovered)
> node:
>
> root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
> -ti node3 bash
> postgres@b4066d4211bc:/pg$  pg_xlogdump
> data/pg_xlog/000000010000000000000001  | fgrep "tx: 6893"
> rmgr: Heap        len (rec/tot):      7/    53, tx:       6893, lsn:
> 0/01A29828, prev 0/01A297E0, desc: LOCK off 172: xid 6893 LOCK_ONLY
> EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 50
> rmgr: Heap        len (rec/tot):     14/    82, tx:       6893, lsn:
> 0/01A29860, prev 0/01A29828, desc: UPDATE+INIT off 172 xmax 6893 ; new
> off 1 xmax 6893, blkref #0: rel 1663/12407/16421 blk 64, blkref #1: rel
> 1663/12407/16421 blk 50
> rmgr: Btree       len (rec/tot):      2/    64, tx:       6893, lsn:
> 0/01A298B8, prev 0/01A29860, desc: INSERT_LEAF off 314, blkref #0: rel
> 1663/12407/16424 blk 23
> rmgr: Transaction len (rec/tot):    236/   265, tx:       6893, lsn:
> 0/01A298F8, prev 0/01A298B8, desc: PREPARE
> pg_xlogdump: FATAL:  error in WAL record at 0/1ACBBF8: invalid record
> length at 0/1ACBC30: wanted 24, got 0
>
> You can see one update instead of two.
>
> Sorry, I have not saved trace  with output of logical decoder. Bu t it
> really decodes just one update!
> What I have done:
>
> DROP_REPLICATION_SLOT "mtm_slot_1";
> CREATE_REPLICATION_SLOT "mtm_slot_1" LOGICAL "multimaster";
> START_REPLICATION SLOT "mtm_slot_1" LOGICAL 0/0 ("startup_params_format"
> '1', "max_proto_version" '1', "min_proto_version" '1',
> "forward_changesets" '1', "mtm_replication_mode" 'recovered');
>
> I have also tried to calculate last origin LSN for this node and
> explicitly specify it in START_REPLICATION.
> But it doesn't help: the same problem persists.
>

Well okay, that says that there was different number of UPDATEs on the 
subscriber, but that does not mean that it's fault of decoding. The 
difference in WAL on downstream can be for example because it didn't 
find matching key for update due to some other desync issue. You'll need 
to check what the output plugin outputs and go from there. If you have 
some reproducible use-case/script I can help debugging it.

--   Petr Jelinek                  http://www.2ndQuadrant.com/  PostgreSQL Development, 24x7 Support, Training &
Services



Re: Logical decoding restart problems

От
Craig Ringer
Дата:
On 19 August 2016 at 15:34, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
Hi,

We are using logical decoding in multimaster and we are faced with the problem that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical decoding channel and so each pair of nodes 
has its own replication slot.

Makes sense. Same as BDR.
 
2. In normal scenario each replication channel is used to replicate only those transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.

Again, makes sense. Same as BDR.
 
2. When offline cluster node is returned back to the multimaster we need to recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using only one replication channel to receive all (self and foreign) transactions.

I'm planning on doing this for BDR soon, for the case where we lose a node unrecoverably, and it's what we already do during node join for the same reasons you're doing it. Glad to hear you're doing something similar.
 
Only in this case we can guarantee consistent order of applying transactions at recovered node.
After the end of recovery we need to recreate replication slots with all other cluster nodes (because we have already replied transactions from this nodes).

No, you don't need to recreate them. Just advance your replication identifier downstream and request a replay position in the future. Let the existing slot skip over unwanted data and resume where you want to start replay.

You can advance the replication origins on the peers as you replay forwarded xacts from your master.

Have a look at how the BDR code does this during "catchup mode" replay.
 
So while your problem discussed below seems concerning, you don't have to drop and recreate slots like are currently doing. 

To restart logical decoding we first drop existed slot, then create new one and then start logical replication from the WAL position 0/0 (invalid LSN).
In this case recovery should be started from the last consistent point.

How do you create the new slot? SQL interface? walsender interface? Direct C calls?
 
The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica.  I started investigation of logical decoding code and found several things which I do not understand.

Yeah, that sounds concerning and shouldn't happen.

Assume that we have transactions T1={start_lsn=100, end_lsn=400} and T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
If now we want to restart logical decoding, we can not start with position less than 300, because CreateDecodingContext doesn't allow it:


Right. You've already confirmed receipt of T2, so you can't receive it again.
 
So it means that we have no chance to restore T1?

Wrong. You can, because the slot's restart_lsn still be will be some LSN <= 100. The slot keeps track of inprogress transactions (using xl_running_xacts records) and knows it can't discard WAL past lsn 100 because xact T1 is still in-progress, so it must be able to decode from the start of it.

When you create a decoding context decoding starts at restart_lsn not at confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits start resulting in decoded data being sent to you. So in your case, T1 commits at lsn=400, which is >300, so you'll receive the whole xact for T1.
 
What is worse, if there are valid T2 transaction records with lsn >= 300, then we can partly decode T1 and send this T1' to the replica.
I missed something here?

That shouldn't be possible. Have you seen this in the wild?

If so can you boil it down to a test case separate to your whole MM framework?
 
Are there any alternative way to "seek" slot to the proper position without  actual fetching data from it or recreation of the slot?

Yes, send replication feedback, or just request a future LSN at START_REPLICATION time.

You cannot rewind a slot; there's no way to seek backwards. At all. The only backwards movement is that done automatically and internally within the slot in the form of decoding restarting at restart_lsn at reconnect time.
 
Is there any mechanism in xlog which can enforce consistent decoding of transaction (so that no transaction records are missed)?

It's all already there. See logical decoding's use of xl_running_xacts.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding restart problems

От
Craig Ringer
Дата:
On 20 August 2016 at 00:36, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
 

Sorry, I have not saved trace  with output of logical decoder. Bu t it really decodes just one update!

That's definitely a worry. Can you turn this into a reproducible, self contained test case?
 
DROP_REPLICATION_SLOT "mtm_slot_1";
CREATE_REPLICATION_SLOT "mtm_slot_1" LOGICAL "multimaster";
START_REPLICATION SLOT "mtm_slot_1" LOGICAL 0/0 ("startup_params_format" '1', "max_proto_version" '1', "min_proto_version" '1', "forward_changesets" '1', "mtm_replication_mode" 'recovered');

That'll make a new slot by looking for a recent/future non-overflowed xl_running_xacts record to set as restart_lsn. Every xact that starts _after_ that xl_running_xacts record will be decoded and replayed to the client in its entirety. Every xact started before that xl_running_xacts record, i.e. listed as running in the record, will not be replayed to the client at all, whether it commits or rolls back, even though the commit will be after the slot creation lsn.
 
I have also tried to calculate last origin LSN for this node and explicitly specify it in START_REPLICATION.

That will have no effect because of the logic you pointed out around confirmed_flush_lsn.
 
Dropping and re-creating the slot is the wrong thing to do if you want to maintain consistency. Just advance it over the changes you already saw from the other node. Read BDR's catchup mode code.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding restart problems

От
konstantin knizhnik
Дата:
Thank you for answers.

No, you don't need to recreate them. Just advance your replication identifier downstream and request a replay position in the future. Let the existing slot skip over unwanted data and resume where you want to start replay.

You can advance the replication origins on the peers as you replay forwarded xacts from your master.

Have a look at how the BDR code does this during "catchup mode" replay.
 
So while your problem discussed below seems concerning, you don't have to drop and recreate slots like are currently doing. 

The only reason for recreation of slot is that I want to move it to the current "horizont" and skip all pending transaction without explicit specification of the restart position.
If I do not drop the slot and just restart replication specifying position 0/0 (invalid LSN), then replication will be continued from the current slot position in WAL, will not it?
So there  is no way to specify something "start replication from the end of WAL", like lseek(0, SEEK_END).
Right now I trying to overcome this limitation by explicit calculation of the position from which we should continue replication.
But unfortunately the problem  with partly decoded transactions persist.
I will try at next week to create example reproducing the problem without any multimaster stuff, just using standard logical decoding plugin.



To restart logical decoding we first drop existed slot, then create new one and then start logical replication from the WAL position 0/0 (invalid LSN).
In this case recovery should be started from the last consistent point.

How do you create the new slot? SQL interface? walsender interface? Direct C calls?

Slot is created by peer node using standard libpq connection with database=replication connection string.

 
The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica.  I started investigation of logical decoding code and found several things which I do not understand.

Yeah, that sounds concerning and shouldn't happen.

I looked at replication code more precisely and understand that my first concerns were wrong.
Confirming flush position should not prevent replaying transactions with smaller LSNs.
But unfortunately the problem is really present. May be it is caused by race conditions (although most logical decoder data is local to backend).
This is why I will try to create reproducing scenario without multimaster.


Assume that we have transactions T1={start_lsn=100, end_lsn=400} and T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
If now we want to restart logical decoding, we can not start with position less than 300, because CreateDecodingContext doesn't allow it:


Right. You've already confirmed receipt of T2, so you can't receive it again.
 
So it means that we have no chance to restore T1?

Wrong. You can, because the slot's restart_lsn still be will be some LSN <= 100. The slot keeps track of inprogress transactions (using xl_running_xacts records) and knows it can't discard WAL past lsn 100 because xact T1 is still in-progress, so it must be able to decode from the start of it.

When you create a decoding context decoding starts at restart_lsn not at confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits start resulting in decoded data being sent to you. So in your case, T1 commits at lsn=400, which is >300, so you'll receive the whole xact for T1.

Yeh, but unfortunately it happens. Need to understand why...

 
It's all already there. See logical decoding's use of xl_running_xacts.


But how this information is persisted?
What will happen if wal_sender is restarted?


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding restart problems

От
Craig Ringer
Дата:


On 20 August 2016 at 14:56, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
Thank you for answers.

No, you don't need to recreate them. Just advance your replication identifier downstream and request a replay position in the future. Let the existing slot skip over unwanted data and resume where you want to start replay.

You can advance the replication origins on the peers as you replay forwarded xacts from your master.

Have a look at how the BDR code does this during "catchup mode" replay.
 
So while your problem discussed below seems concerning, you don't have to drop and recreate slots like are currently doing. 

The only reason for recreation of slot is that I want to move it to the current "horizont" and skip all pending transaction without explicit specification of the restart position.

Why not just specify the restart position as the upstream server's xlog insert position?

Anyway, you _should_ specify the restart position. Otherwise, if there's concurrent write activity, you might have a gap between when you stop replaying from your forwarding slot on the recovery node and start replaying from the other nodes.

Again, really, go read the BDR catchup mode code. Really.
 
If I do not drop the slot and just restart replication specifying position 0/0 (invalid LSN), then replication will be continued from the current slot position in WAL, will not it?

The "current slot position" isn't in WAL. It's stored in the replication slot in pg_replslot/ . But yes, if you pass 0/0 it'll use the stored confirmed_flush_lsn from the replication slot.
 
So there  is no way to specify something "start replication from the end of WAL", like lseek(0, SEEK_END).

Correct, but you can fetch the server's xlog insert position separately and pass it.

I guess I can see it being a little bit useful to be able to say "start decoding at the first commit after this command". Send a patch, see if Andres agrees.

I still think your whole approach is wrong and you need to use replication origins or similar to co-ordinate a consistent switchover.
 
 
Slot is created by peer node using standard libpq connection with database=replication connection string.


So walsender interface  then.
 
The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica.  I started investigation of logical decoding code and found several things which I do not understand.

Yeah, that sounds concerning and shouldn't happen.

I looked at replication code more precisely and understand that my first concerns were wrong.
Confirming flush position should not prevent replaying transactions with smaller LSNs.

Strictly, confirming the flush position does not prevent transactions *with changes* at lower LSNs. It does prevent replay of transactions that *commit* with lower LSNs.
 
But unfortunately the problem is really present. May be it is caused by race conditions (although most logical decoder data is local to backend).
This is why I will try to create reproducing scenario without multimaster.
 
Yeh, but unfortunately it happens. Need to understand why...

Yes. I think we need a simple standalone test case. I've never yet seen a partially decoded transaction like this.
It's all already there. See logical decoding's use of xl_running_xacts.
But how this information is persisted?

restart_lsn points to a xl_running_xacts record in WAL. Which is of course persistent. The restart_lsn is persistent in the replication slot, as is catalog_xmin and confirmed_flush_lsn.
 
What will happen if wal_sender is restarted?

That's why the restart_lsn exists. Decoding restarts from the restart_lsn when you START_REPLICATION on the new walsender. It continues without sending data to the client until it decodes the first commit > confirmed_flush_lsn or some greater-than-that LSN that you requested by passing it to the START_REPLICATION command.

The snapshot builder is also involved; see snapbuild.c and the comments there.

I'll wait for a test case or some more detail.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding restart problems

От
Stas Kelvich
Дата:
> On 20 Aug 2016, at 15:59, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> I'll wait for a test case or some more detail.

Thanks for clarification about how restart_lsn is working.

Digging slightly deeper into this topic revealed that problem was in two phase decoding, not it logical decoding
itself.
While I was writing DecodePrepare() I've and copied call to SnapBuildCommitTxn() function from DecodeCommit()
which was removing current transaction from running list and that’s obviously wrong thing to do for a prepared tx.

So I end up with following workaround:

--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -950,7 +950,7 @@ SnapBuildAbortTxn(SnapBuild *builder, XLogRecPtr lsn, */voidSnapBuildCommitTxn(SnapBuild *builder,
XLogRecPtrlsn, TransactionId xid, 
-                                  int nsubxacts, TransactionId *subxacts)
+                                  int nsubxacts, TransactionId *subxacts, bool isCommit){       int
nxact; 
@@ -1026,7 +1026,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,        * Make sure
topleveltxn is not tracked in running txn's anymore, switch        * state to consistent if possible.        */ 
-       SnapBuildEndTxn(builder, lsn, xid);
+       if (isCommit)
+               SnapBuildEndTxn(builder, lsn, xid);


Calling SnapBuildCommitTxn with isCommit=true from commit and false from Prepare. However while I’m not
observing partially decoded transactions anymore, I’m not sure that this is right way to build proper snapshot and
something else isn’t broken.

Also while I was playing psycopg2 logical decoding client I do see empty transaction containing only
BEGIN/COMMIT (with test_decoding output plugin, and current postgres master).


--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company




Re: Logical decoding restart problems

От
Craig Ringer
Дата:
<p dir="ltr">On 25 Aug. 2016 20:03, "Stas Kelvich" <<a
href="mailto:s.kelvich@postgrespro.ru">s.kelvich@postgrespro.ru</a>>wrote:<br /> ><br /> > > On 20 Aug
2016,at 15:59, Craig Ringer <<a href="mailto:craig@2ndquadrant.com">craig@2ndquadrant.com</a>> wrote:<br /> >
><br/> > > I'll wait for a test case or some more detail.<br /> ><br /> > Thanks for clarification about
howrestart_lsn is working.<br /> ><br /> > Digging slightly deeper into this topic revealed that problem was in
twophase decoding, not it logical decoding itself.<p dir="ltr">Good to know. The behavior didn't make much sense for
logicaldecoding but bugs usually don't.<p dir="ltr">Do you plan to submit a patch for logical decoding of prepared
transactionsto 10.0? 

Re: Logical decoding restart problems

От
Stas Kelvich
Дата:
> On 31 Aug 2016, at 03:28, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On 25 Aug. 2016 20:03, "Stas Kelvich" <s.kelvich@postgrespro.ru> wrote:
> >
> > Thanks for clarification about how restart_lsn is working.
> >
> > Digging slightly deeper into this topic revealed that problem was in two phase decoding, not it logical decoding
itself.
>
> Good to know. The behavior didn't make much sense for logical decoding but bugs usually don't.
>
> Do you plan to submit a patch for logical decoding of prepared transactions to 10.0?
>

I do, probably on CF2. There is issue with locks that Andres pointed me to; also twophase.c was written before logical
decoding happened and it deserves some refactoring to avoid duplicated functionality between 2pc decoding and restoring
of
old prepared xact — I want to include that in patch too.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company