Re: race condition in pg_class
От | Noah Misch |
---|---|
Тема | Re: race condition in pg_class |
Дата | |
Msg-id | 20240512232923.aa.nmisch@google.com обсуждение исходный текст |
Ответ на | Re: race condition in pg_class (Noah Misch <noah@leadboat.com>) |
Ответы |
Re: race condition in pg_class
Re: race condition in pg_class Re: race condition in pg_class Re: race condition in pg_class |
Список | pgsql-hackers |
I'm attaching patches implementing the LockTuple() design. It turns out we don't just lose inplace updates. We also overwrite unrelated tuples, reproduced at inplace.spec. Good starting points are README.tuplock and the heap_inplace_update_scan() header comment. On Wed, Nov 01, 2023 at 08:09:15PM -0700, Noah Misch wrote: > On Fri, Oct 27, 2023 at 04:26:12PM -0700, Noah Misch wrote: > > On Fri, Oct 27, 2023 at 06:40:55PM -0400, Tom Lane wrote: > > > We could perhaps make this work by using the existing tuple-lock > > > infrastructure, rather than inventing new locktags (a choice that > > > spills to a lot of places including clients that examine pg_locks). > > > I'd also like to find a solution that fixes the case of a conflicting > > > manual UPDATE (although certainly that's a stretch goal we may not be > > > able to reach). I implemented that; search for ri_needLockTagTuple. > - GRANT passes to heapam the fixed-size portion of the pre-modification tuple. > heap_update() compares those bytes to the oldtup in shared buffers to see if > an inplace update happened. (HEAD could get the bytes from a new > heap_update() parameter, while back branches would need a different passing > approach.) This could have been fine, but ... > - LockTuple(), as seen in its attached prototype. I like this least at the > moment, because it changes pg_locks content without having a clear advantage > over the previous option. ... I settled on the LockTuple() design for these reasons: - Solves more conflicts by waiting, instead of by ERROR or by retry loops. - Extensions wanting inplace updates don't have a big disadvantage over core code inplace updates. - One could use this to stop "tuple concurrently updated" for pg_class rows, by using SearchSysCacheLocked1() for all pg_class DDL and making that function wait for any existing xmax like inplace_xmax_lock() does. I don't expect to write that, but it's a nice option to have. - pg_locks shows the new lock acquisitions. Separable, nontrivial things not fixed in the attached patch stack: - Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK of CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss still seen in inplace-inval.spec. CacheInvalidateRelmap() does this right. - AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical section, but it is critical. We must not commit transactional DDL without other backends receiving an inval. (When the inplace inval becomes nontransactional, it will face the same threat.) - Trouble is possible, I bet, if the system crashes between the inplace-update memcpy() and XLogInsert(). See the new XXX comment below the memcpy(). Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and finally issuing memcpy() into the buffer. - [consequences limited to transient failure] Since a PROC_IN_VACUUM backend's xmin does not stop pruning, an MVCC scan in that backend can find zero tuples when one is live. This is like what all backends got in the days of SnapshotNow catalog scans. See the pgbench test suite addition. (Perhaps the fix is to make VACUUM do its MVCC scans outside of PROC_IN_VACUUM, setting that flag later and unsetting it earlier.) If you find decisions in this thread's patches are tied to any of those such that I should not separate those, let's discuss that. Topics in the patches that I feel are most fruitful for debate: - This makes inplace update block if the tuple has an updater. It's like one GRANT blocking another, except an inplace updater won't get "ERROR: tuple concurrently updated" like one of the GRANTs would. I had implemented versions that avoided this blocking by mutating each tuple in the updated tuple chain. That worked, but it had corner cases bad for maintainability, listed in the inplace_xmax_lock() header comment. I'd rather accept the blocking, so hackers can rule out those corner cases. A long-running GRANT already hurts VACUUM progress more just by keeping an XID running. - Pre-checks could make heap_inplace_update_cancel() calls rarer. Avoiding one of those avoids an exclusive buffer lock, and it avoids waiting on concurrent heap_update() if any. We'd pre-check the syscache tuple. EventTriggerOnLogin() does it that way, because the code was already in that form. I expect only vac_update_datfrozenxid() concludes !dirty enough to matter. I didn't bother with the optimization, but it would be simple. - If any citus extension user feels like porting its heap_inplace_update() call to this, I'd value hearing about your experience. - I paid more than my usual attention to test coverage, considering the patch stack's intensity compared to most back-patch bug fixes. I've kept all the above topics brief; feel free to ask for more details. Thanks, nm
Вложения
- inplace005-UNEXPECTEDPASS-tap-meson-v1.patch
- inplace010-tests-v1.patch
- inplace040-waitfuncs-v1.patch
- inplace050-tests-inj-v1.patch
- inplace060-nodeModifyTable-comments-v1.patch
- inplace070-rel-locks-missing-v1.patch
- inplace080-catcache-detoast-inplace-stale-v1.patch
- inplace090-LOCKTAG_TUPLE-eoxact-v1.patch
- inplace110-successors-v1.patch
- inplace120-locktag-v1.patch
В списке pgsql-hackers по дате отправления: