Re: Deadlock between logrep apply worker and tablesync worker
От | vignesh C |
---|---|
Тема | Re: Deadlock between logrep apply worker and tablesync worker |
Дата | |
Msg-id | CALDaNm2toSP3n3xaMPga_=y5zZ88P_cyZ-kyB2dyKKp3dE3uMg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Deadlock between logrep apply worker and tablesync worker (vignesh C <vignesh21@gmail.com>) |
Ответы |
RE: Deadlock between logrep apply worker and tablesync worker
|
Список | pgsql-hackers |
On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > > check in replorigin_drop_guts() which is a static function and > > > > > called from only one place, so, will it be required to check at both places? > > > > > > > > There is a possibility that the initial check to verify if replication > > > > origin exists in replorigin_drop_by_name was successful but later one > > > > of either table sync worker or apply worker process might have dropped > > > > the replication origin, > > > > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > > drop happens after the patch acquires the lock on the origin. > > > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we > > already lock the origin before that. I think the check in replorigin_drop_guts > > is a custom check after calling SearchSysCache1 to get the tuple, but the error > > should not happen as no concurrent drop can be performed. > > > > To make it simpler, one idea is to move the code that getting the tuple from > > system cache to the replorigin_drop_by_name(). After locking the origin, we > > can try to get the tuple and do the existence check, and we can reuse > > this tuple to perform origin delete. In this approach we only need to check > > origin existence once after locking. BTW, if we do this, then we'd better rename the > > replorigin_drop_guts() to something like replorigin_state_clear() as the function > > only clear the in-memory information after that. > > > > The code could be like: > > > > ------- > > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > > ... > > /* > > * Lock the origin to prevent concurrent drops. We keep the lock until the > > * end of transaction. > > */ > > LockSharedObject(ReplicationOriginRelationId, roident, 0, > > AccessExclusiveLock); > > > > tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); > > if (!HeapTupleIsValid(tuple)) > > { > > if (!missing_ok) > > elog(ERROR, "cache lookup failed for replication origin with ID %d", > > roident); > > > > return; > > } > > > > replorigin_state_clear(rel, roident, nowait); > > > > /* > > * Now, we can delete the catalog entry. > > */ > > CatalogTupleDelete(rel, &tuple->t_self); > > ReleaseSysCache(tuple); > > > > CommandCounterIncrement(); > > The attached updated patch has the changes to handle the same. I had not merged one of the local changes that was present, please find the updated patch including that change. Sorry for missing that change. Regards, Vignesh
Вложения
В списке pgsql-hackers по дате отправления: