Re: Potential data loss due to race condition during logical replication slot creation
От | Amit Kapila |
---|---|
Тема | Re: Potential data loss due to race condition during logical replication slot creation |
Дата | |
Msg-id | CAA4eK1KrVfUWw9WnCF191iDDhoiMQ+jmOPExCRLTbq4dPoK9Zw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Potential data loss due to race condition during logical replication slot creation (Masahiko Sawada <sawada.mshk@gmail.com>) |
Ответы |
Re: Potential data loss due to race condition during logical replication slot creation
|
Список | pgsql-bugs |
On Tue, Jul 9, 2024 at 11:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jul 8, 2024 at 11:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jul 8, 2024 at 6:01 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Dear Sawada-san, > > > > > > Thanks for creating the patch! > > > > > > > I've attached updated patches for HEAD and pg17 for now (I will create > > > > the patch for other backbranches). > > > > > > > > In the patches, I used a different approach in between HEAD and > > > > backbranches. On HEAD, we store a flag indicating whether or not we > > > > should skip snapshot restores into the SnapBuild struct and set it > > > > only while finding the start point. Therefore we have to bump > > > > SNAPBUILD_VERSION. On backbranches, I used the approach we discussed > > > > above; store the flag in LogicalDecodingContext and set it when > > > > creating the LogicalDecodingContext for a new logical slot. A possible > > > > downside of the approach taken for backbranches is that we implicitly > > > > require for users to use the same LogicalDecodingContext for both > > > > initializing the context for a new slot and finding its start point. > > > > IIUC it was not strictly required. This restriction would not be a > > > > problem at least in the core, but I'm not sure if there are no > > > > external extensions that create a logical slot in that way. This is > > > > the main reason why I used a different approach on HEAD and > > > > backbranches. Therefore, if it does not matter, I think we can use the > > > > same approach on all branches, which is better in terms of > > > > maintainability. > > > > > > I want to confirm your point. You meant that you wanted to unify appraoches, > > > especially you wanted to store the flag in LogicalDecodingContext, rigth? > > > > Yes. Ideally I'd like to use the same approach in all branches > > regardless of how to fix it for better maintainability. > > The difference is minor so using slightly different approaches should be okay. In the ideal case, I agree that using the same approach makes sense but for future extendability, it is better to keep this new variable in SnapBuild at least in HEAD and PG17. > > I've attached the new version patches for all branches. > Few comments: 1. @@ -650,6 +650,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx) { ReplicationSlot *slot = ctx->slot; + /* Let snapshot builder start to find the start point */ + SnapBuildSetFindStartPoint(ctx->snapshot_builder, true); + /* Initialize from where to start reading WAL. */ XLogBeginRead(ctx->reader, slot->data.restart_lsn); @@ -683,6 +686,9 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx) if (slot->data.two_phase) slot->data.two_phase_at = ctx->reader->EndRecPtr; SpinLockRelease(&slot->mutex); + + /* Complete to find the start point */ + SnapBuildSetFindStartPoint(ctx->snapshot_builder, false); I wonder why you didn't choose to set this variable in AllocateSnapshotBuilder()? If we do so, then we may not need set/reset in DecodingContextFindStartpoint(). The one advantage of using set/reset for the minimal time as done here is that we can avoid any impact of this new variable but I still feel setting it in AllocateSnapshotBuilder() seems more natural. 2. Since in this case + * the restart LSN could be in the middle of transactions we need to + * find the start point where we won't see the data for partial + * transactions. There is a connecting word missing between *transactions* and *we*. Can we use a slightly different wording like: "Can't use this method while finding the start point for decoding changes as the restart LSN would be an arbitrary LSN but we need to find the start point to extract changes where we won't see the data for partial transactions."? -- With Regards, Amit Kapila.
В списке pgsql-bugs по дате отправления: