Re: Assertion failure in SnapBuildInitialSnapshot()
От | Masahiko Sawada |
---|---|
Тема | Re: Assertion failure in SnapBuildInitialSnapshot() |
Дата | |
Msg-id | CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Assertion failure in SnapBuildInitialSnapshot() (Amit Kapila <amit.kapila16@gmail.com>) |
Ответы |
Re: Assertion failure in SnapBuildInitialSnapshot()
RE: Assertion failure in SnapBuildInitialSnapshot() Re: Assertion failure in SnapBuildInitialSnapshot() |
Список | pgsql-hackers |
On Mon, Nov 21, 2022 at 4:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Nov 19, 2022 at 6:35 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2022-11-18 11:20:36 +0530, Amit Kapila wrote: > > > Okay, updated the patch accordingly. > > > > Assuming it passes tests etc, this'd work for me. > > > > Thanks, Pushed. The same assertion failure has been reported on another thread[1]. Since I could reproduce this issue several times in my environment I've investigated the root cause. I think there is a race condition of updating procArray->replication_slot_xmin by CreateInitDecodingContext() and LogicalConfirmReceivedLocation(). What I observed in the test was that a walsender process called: SnapBuildProcessRunningXacts() LogicalIncreaseXminForSlot() LogicalConfirmReceivedLocation() ReplicationSlotsComputeRequiredXmin(false). In ReplicationSlotsComputeRequiredXmin() it acquired the ReplicationSlotControlLock and got 0 as the minimum xmin since there was no wal sender having effective_xmin. Before calling ProcArraySetReplicationSlotXmin() (i.e. before acquiring ProcArrayLock), another walsender process called CreateInitDecodingContext(), acquired ProcArrayLock, computed slot->effective_catalog_xmin, called ReplicationSlotsComputeRequiredXmin(true). Since its effective_catalog_xmin had been set, it got 39968 as the minimum xmin, and updated replication_slot_xmin. However, as soon as the second walsender released ProcArrayLock, the first walsender updated the replication_slot_xmin to 0. After that, the second walsender called SnapBuildInitialSnapshot(), and GetOldestSafeDecodingTransactionId() returned an XID newer than snap->xmin. One idea to fix this issue is that in ReplicationSlotsComputeRequiredXmin(), we compute the minimum xmin while holding both ProcArrayLock and ReplicationSlotControlLock, and release only ReplicationSlotsControlLock before updating the replication_slot_xmin. I'm concerned it will increase the contention on ProcArrayLock but I've attached the patch for discussion. Regards, [1] https://www.postgresql.org/message-id/tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406%40qq.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: