Обсуждение: Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().
Hi Tom, On 2017-09-29 20:26:42 +0000, Tom Lane wrote: > get_rel_oids used to not take any relation locks at all, but that stopped > being a good idea with commit 3c3bb9933, which inserted a syscache lookup > into the function. A concurrent DROP TABLE could now produce "cache lookup > failed", which we don't want to have happen in normal operation. The best > solution seems to be to transiently take a lock on the relation named by > the RangeVar (which also makes the result of RangeVarGetRelid a lot less > spongy). But we shouldn't hold the lock beyond this function, because we > don't want VACUUM to lock more than one table at a time. (That would not > be a big problem right now, but it will become one after the pending > feature patch to allow multiple tables to be named in VACUUM.) I'm not sure how big a problem this is, but I suspect it is one. Autovacuum used to skip relations when they're locked, and the vacuum isn't an anti-wraparound one. But after this change it appears we'll get stuck behind this new lock, even if VACOPT_NOWAIT is specified. That's bad because now relations that are AEL locked (e.g. because of some longrunning DDL) can block global autovacuum progress. I noticed this when reviewing a patch that adds NOWAIT (now renamed to SKIP LOCKED) as a user visible knob and it getting stuck behind an AEL. The updated version of the patch http://archives.postgresql.org/message-id/7327B413-1A57-477F-A6A0-6FD80CB5482D%40amazon.com has an attempt at fixing the issue, although I've not yet looked at it in any sort of detail. I suspect we should break out that part once reviewed, and backpatch to 10? Greetings, Andres Freund
On 2018-03-05 16:11:52 -0800, Andres Freund wrote: > Hi Tom, > > On 2017-09-29 20:26:42 +0000, Tom Lane wrote: > > get_rel_oids used to not take any relation locks at all, but that stopped > > being a good idea with commit 3c3bb9933, which inserted a syscache lookup > > into the function. A concurrent DROP TABLE could now produce "cache lookup > > failed", which we don't want to have happen in normal operation. The best > > solution seems to be to transiently take a lock on the relation named by > > the RangeVar (which also makes the result of RangeVarGetRelid a lot less > > spongy). But we shouldn't hold the lock beyond this function, because we > > don't want VACUUM to lock more than one table at a time. (That would not > > be a big problem right now, but it will become one after the pending > > feature patch to allow multiple tables to be named in VACUUM.) > > I'm not sure how big a problem this is, but I suspect it is > one. Autovacuum used to skip relations when they're locked, and the > vacuum isn't an anti-wraparound one. But after this change it appears > we'll get stuck behind this new lock, even if VACOPT_NOWAIT is > specified. That's bad because now relations that are AEL locked > (e.g. because of some longrunning DDL) can block global autovacuum > progress. Scratch that, we should be going down the /* If caller supplied OID, there's nothing we need do here. */ if (OidIsValid(vrel->oid)) { oldcontext = MemoryContextSwitchTo(vac_context); vacrels = lappend(vacrels, vrel); MemoryContextSwitchTo(oldcontext); } branch in expand_vacuum_rel() for autovacuum, so this shouldn't matter. Sorry for the noise Greetings, Andres Freund
On 2018-03-05 16:11:52 -0800, Andres Freund wrote: > Hi Tom, > > On 2017-09-29 20:26:42 +0000, Tom Lane wrote: > > get_rel_oids used to not take any relation locks at all, but that stopped > > being a good idea with commit 3c3bb9933, which inserted a syscache lookup > > into the function. A concurrent DROP TABLE could now produce "cache lookup > > failed", which we don't want to have happen in normal operation. The best > > solution seems to be to transiently take a lock on the relation named by > > the RangeVar (which also makes the result of RangeVarGetRelid a lot less > > spongy). But we shouldn't hold the lock beyond this function, because we > > don't want VACUUM to lock more than one table at a time. (That would not > > be a big problem right now, but it will become one after the pending > > feature patch to allow multiple tables to be named in VACUUM.) > > I'm not sure how big a problem this is, but I suspect it is > one. Autovacuum used to skip relations when they're locked, and the > vacuum isn't an anti-wraparound one. But after this change it appears > we'll get stuck behind this new lock, even if VACOPT_NOWAIT is > specified. That's bad because now relations that are AEL locked > (e.g. because of some longrunning DDL) can block global autovacuum > progress. Scratch that, we should be going down the /* If caller supplied OID, there's nothing we need do here. */ if (OidIsValid(vrel->oid)) { oldcontext = MemoryContextSwitchTo(vac_context); vacrels = lappend(vacrels, vrel); MemoryContextSwitchTo(oldcontext); } branch in expand_vacuum_rel() for autovacuum, so this shouldn't matter. Sorry for the noise Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-09-29 20:26:42 +0000, Tom Lane wrote: >> get_rel_oids used to not take any relation locks at all, but that stopped >> being a good idea with commit 3c3bb9933, which inserted a syscache lookup >> into the function. A concurrent DROP TABLE could now produce "cache lookup >> failed", which we don't want to have happen in normal operation. The best >> solution seems to be to transiently take a lock on the relation named by >> the RangeVar (which also makes the result of RangeVarGetRelid a lot less >> spongy). But we shouldn't hold the lock beyond this function, because we >> don't want VACUUM to lock more than one table at a time. (That would not >> be a big problem right now, but it will become one after the pending >> feature patch to allow multiple tables to be named in VACUUM.) > I'm not sure how big a problem this is, but I suspect it is > one. Autovacuum used to skip relations when they're locked, and the > vacuum isn't an anti-wraparound one. But after this change it appears > we'll get stuck behind this new lock, even if VACOPT_NOWAIT is > specified. That's bad because now relations that are AEL locked > (e.g. because of some longrunning DDL) can block global autovacuum > progress. Hm. Maybe we should take this lock with nowait if the vacuum option is specified? Another idea is to revert both this patch and 3c3bb9933, and instead handle partitioning more like we handle recursion for toast tables, ie make no decisions until after we do have lock on a relation. The really fundamental problem here is that 3c3bb9933 thought it is OK to do a syscache lookup on a table we have no lock on, which is flat wrong. But we don't necessarily have to do it exactly like that. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > Scratch that, we should be going down the > /* If caller supplied OID, there's nothing we need do here. */ > branch in expand_vacuum_rel() for autovacuum, so this shouldn't > matter. Sorry for the noise But you said you'd seen blocking behind AEL with NOWAIT, so there's still a problem for manual vacuums no? regards, tom lane
On Mon, Mar 05, 2018 at 04:34:09PM -0800, Andres Freund wrote: > Scratch that, we should be going down the > /* If caller supplied OID, there's nothing we need do here. */ > if (OidIsValid(vrel->oid)) > { > oldcontext = MemoryContextSwitchTo(vac_context); > vacrels = lappend(vacrels, vrel); > MemoryContextSwitchTo(oldcontext); > } > branch in expand_vacuum_rel() for autovacuum, so this shouldn't > matter. Sorry for the noise Yes, I don't see a problem. However I can understand that it is easy to be confused on those code paths if you are not used to them and this area has changed quite a bit the last years. Perhaps we could add an Assert(IsAutoVacuumWorkerProcess()) to reduce the confusion? -- Michael
Вложения
On Mon, Mar 05, 2018 at 04:34:09PM -0800, Andres Freund wrote: > Scratch that, we should be going down the > /* If caller supplied OID, there's nothing we need do here. */ > if (OidIsValid(vrel->oid)) > { > oldcontext = MemoryContextSwitchTo(vac_context); > vacrels = lappend(vacrels, vrel); > MemoryContextSwitchTo(oldcontext); > } > branch in expand_vacuum_rel() for autovacuum, so this shouldn't > matter. Sorry for the noise Yes, I don't see a problem. However I can understand that it is easy to be confused on those code paths if you are not used to them and this area has changed quite a bit the last years. Perhaps we could add an Assert(IsAutoVacuumWorkerProcess()) to reduce the confusion? -- Michael
Вложения
On 2018-03-05 19:53:23 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Scratch that, we should be going down the > > /* If caller supplied OID, there's nothing we need do here. */ > > branch in expand_vacuum_rel() for autovacuum, so this shouldn't > > matter. Sorry for the noise > > But you said you'd seen blocking behind AEL with NOWAIT, so there's > still a problem for manual vacuums no? Right. But that facility isn't yet exposed to SQL, just in the patch I'm reviewing. So there's no issue with the code from the commit I'm replying to. In [1] I wrote about one idea how to resolve this for the proposed patch. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20180306005349.b65whmvj7z6hbe2y@alap3.anarazel.de