Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
От | Amit Kapila |
---|---|
Тема | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Дата | |
Msg-id | CAA4eK1JvyWHzMwhO9jzPquctE_ha6bz3EkB3KE6qQJx63StErQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (hubert depesz lubaczewski <depesz@depesz.com>) |
Ответы |
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
|
Список | pgsql-bugs |
On Fri, Nov 18, 2022 at 4:57 PM hubert depesz lubaczewski <depesz@depesz.com> wrote: > > On Thu, Nov 17, 2022 at 10:55:29AM -0800, Andres Freund wrote: > > Hi, > > > > On 2022-11-17 23:22:12 +0900, Masahiko Sawada wrote: > > > On Thu, Nov 17, 2022 at 5:03 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > 4. wal sender restarts for some reason (or server crashed). > > > > > > > > I don't think walsender alone restarting should change anything, but > > > > crash-restart obviously would. > > > > > > Right. I've confirmed this scenario is possible to happen with crash-restart. > > > > Depesz, were there any crashes / immediate restarts on the PG 12 side? If so, > > then we know what the problem likely is and can fix it. If not... > > > > > > Just to confirm, the problem has "newly" occurred after you've upgraded to > > 12.12? I couldn't quite tell from the thread. > > No crashes that I can find any track of. > I think this problem could arise when walsender exits due to some error like "terminating walsender process due to replication timeout". Here is the theory I came up with: 1. Initially the restart_lsn is updated to 1039D/83825958. This will allow all files till 000000000001039D00000082 to be removed. 2. Next the slot->candidate_restart_lsn is updated to a 1039D/8B5773D8. 3. walsender restarts due to replication timeout. 4. After restart, it starts reading WAL from 1039D/83825958 as that was restart_lsn. 5. walsender gets a message to update write, flush, apply, etc. As part of that, it invokes ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. 6. Due to step 5, the restart_lsn is updated to 1039D/8B5773D8 and replicationSlotMinLSN will also be computed to the same value allowing to remove of all files older than 000000000001039D0000008A. This will allow removing 000000000001039D00000083, 000000010001039D00000084, etc. 7. Now, we got new slot->candidate_restart_lsn as 1039D/83825958. Remember from step 1, we are still reading WAL from that location. 8. At the next update for write, flush, etc. as part of processing standby reply message, we will invoke ProcessStandbyReplyMessage->LogicalConfirmReceivedLocation. This updates restart_lsn to 1039D/83825958. 9. walsender restarts due to replication timeout. 10. After restart, it starts reading WAL from 1039D/83825958 as that was restart_lsn and gets an error "requested WAL segment 000000010001039D00000083 has already been removed" because the same is removed as part of point-6. 11. After that walsender will keep on getting the same error after restart as the restart_lsn is never progressed. What do you think? If this diagnosis is correct, I think we need to clear candidate_restart_lsn and friends during ReplicationSlotRelease(). -- With Regards, Amit Kapila.
В списке pgsql-bugs по дате отправления: