Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
От | Andres Freund |
---|---|
Тема | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Дата | |
Msg-id | 20221121200836.wov46biwtramawmq@alap3.anarazel.de обсуждение исходный текст |
Ответ на | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
|
Список | pgsql-bugs |
Hi, On 2022-11-21 19:56:20 +0530, Amit Kapila wrote: > I think this problem could arise when walsender exits due to some > error like "terminating walsender process due to replication timeout". > Here is the theory I came up with: > > 1. Initially the restart_lsn is updated to 1039D/83825958. This will > allow all files till 000000000001039D00000082 to be removed. > 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8. > 3. walsender restarts due to replication timeout. > 4. After restart, it starts reading WAL from 1039D/83825958 as that > was restart_lsn. > 5. walsender gets a message to update write, flush, apply, etc. As > part of that, it invokes > ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. > 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and > replicationSlotMinLSN will also be computed to the same value allowing > to remove of all files older than 000000000001039D0000008A. This will > allow removing 000000000001039D00000083, 000000010001039D00000084, > etc. This would require that the client acknowledged an LSN that we haven't sent out, no? Shouldn't the MyReplicationSlot->candidate_restart_valid <= lsn from LogicalConfirmReceivedLocation() prevented this from happening unless the client acknowledges up to candidate_restart_valid? > 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958. > Remember from step 1, we are still reading WAL from that location. I don't think LogicalIncreaseRestartDecodingForSlot() would do anything in that case, because of the /* don't overwrite if have a newer restart lsn */ check. > If this diagnosis is correct, I think we need to clear > candidate_restart_lsn and friends during ReplicationSlotRelease(). Possible, but I don't quite see it yet. Greetings, Andres Freund
В списке pgsql-bugs по дате отправления: