Re: Potential data loss due to race condition during logical replication slot creation
От | Amit Kapila |
---|---|
Тема | Re: Potential data loss due to race condition during logical replication slot creation |
Дата | |
Msg-id | CAA4eK1JWGXcnT2=PYW4U--j8GME4o3+SR6k+bs_r0thG9RVSpQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Potential data loss due to race condition during logical replication slot creation (Masahiko Sawada <sawada.mshk@gmail.com>) |
Ответы |
Re: Potential data loss due to race condition during logical replication slot creation
|
Список | pgsql-bugs |
On Tue, Apr 16, 2024 at 9:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Pondering further, I came across the question; in what case we would > need to restore the snapshot and jump to the consistent state when > finding the initial start point? > > When creating a slot, in ReplicationSlotReserveWal(), we determine the > restart_lsn and write a RUNNING_XACT record. This record would be the > first RUNNING_XACT record that the logical decoding decodes in most > cases. > This probably won't be true on standby as we don't LOG RUNNING_XACT record in that case. > In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e. > running->oldestRunningXid == running->nextXid), it can jump to the > consistent state. If not, it means there were running transactions at > that time of the RUNNING_XACT record being produced. Therefore, we > must not restore the snapshot and jump to the consistent state. > Because otherwise, we end up deciding the start point in the middle of > a transaction that started before the RUNNING_XACT record. Probably > the same is true for all subsequent snapbuild states. > One thing to consider on this is that we can only restore the snapshot if by that time some other get changes (in your case, step "s2_get_changes_slot0" as in your draft patch) would have serialized the consistent snapshot at that LSN. One could question if we can't reach a consistent state at a particular LSN then why in the first place the snapshot has been serialized at that LSN? The answer could be that it is okay to use such a serialized snapshot after initial snapshot creation because we know that the restart_lsn of a slot in such cases would be a location where we won't see the data for partial transactions. This is because, after the very first time (after the initdecodingcontext), the restart_lsn would be set to a location after we reach the consistent point. > I might be missing something but I could not find the case where we > can or want to restore the serialized snapshot when finding the > initial start point. If my analysis is correct, we can go with either > (a) or (c) I proposed before[1]. Between these two options, it also > could be an option that (a) is for backbranches for safety and (c) is > for master. > The approach (a) has a downside, it will lead to tracking more transactions (non-catalog) than required without any benefit for the user. Considering that is true, I wouldn't prefer that approach. -- With Regards, Amit Kapila.
В списке pgsql-bugs по дате отправления: