Обсуждение: Inval reliability, especially for inplace updates

Поиск
Список
Период
Сортировка

Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote:
> Separable, nontrivial things not fixed in the attached patch stack:
> 
> - Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
>   CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
>   still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.

I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
inside the critical section.  Send it in heap_xlog_inplace(), too.  The
interesting decision is how to handle RelationCacheInitFilePreInvalidate(),
which has an unlink_initfile() that can fail with e.g. EIO.  Options:

1. Unlink during critical section, and accept that EIO becomes PANIC.  Replay
   may reach the same EIO, and the system won't reopen to connections until
   the storage starts cooperating.a  Interaction with checkpoints is not ideal.
   If we checkpoint and then crash between inplace XLogInsert() and inval,
   we'd be relying on StartupXLOG() -> RelationCacheInitFileRemove().  That
   uses elevel==LOG, so replay would neglect to PANIC on EIO.

2. Unlink before critical section, so normal xact abort suffices.  This would
   hold RelCacheInitLock and a buffer content lock at the same time.  In
   RecordTransactionCommit(), it would hold RelCacheInitLock and e.g. slru
   locks at the same time.

The PANIC risk of (1) seems similar to the risk of PANIC at
RecordTransactionCommit() -> XLogFlush(), which hasn't been a problem.  The
checkpoint-related risk bothers me more, and (1) generally makes it harder to
reason about checkpoint interactions.  The lock order risk of (2) feels
tolerable.  I'm leaning toward (2), but that might change.  Other preferences?

Another decision is what to do about LogLogicalInvalidations().  Currently,
inplace update invalidations do reach WAL via LogLogicalInvalidations() at the
next CCI.  Options:

a. Within logical decoding, cease processing invalidations for inplace
   updates.  Inplace updates don't affect storage or interpretation of table
   rows, so they don't affect logicalrep_write_tuple() outcomes.  If they did,
   invalidations wouldn't make it work.  Decoding has no way to retrieve a
   snapshot-appropriate version of the inplace-updated value.

b. Make heap_decode() of XLOG_HEAP_INPLACE recreate the invalidation.  This
   would be, essentially, cheap insurance against invalidations having a
   benefit I missed in (a).

I plan to pick (a).

> - AtEOXact_Inval(true) is outside the RecordTransactionCommit() critical
>   section, but it is critical.  We must not commit transactional DDL without
>   other backends receiving an inval.  (When the inplace inval becomes
>   nontransactional, it will face the same threat.)

This faces the same RelationCacheInitFilePreInvalidate() decision, and I think
the conclusion should be the same as for inplace update.

Thanks,
nm



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:
> https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote:
> > Separable, nontrivial things not fixed in the attached patch stack:
> > 
> > - Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
> >   CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
> >   still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.
> 
> I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
> inside the critical section.  Send it in heap_xlog_inplace(), too.

> a. Within logical decoding, cease processing invalidations for inplace

I'm attaching the implementation.  This applies atop the v3 patch stack from
https://postgr.es/m/20240614003549.c2.nmisch@google.com, but the threads are
mostly orthogonal and intended for independent review.  Translating a tuple
into inval messages uses more infrastructure than relmapper, which needs just
a database ID.  Hence, this ended up more like a miniature of inval.c's
participation in the transaction commit sequence.

I waffled on whether to back-patch inplace150-inval-durability-atcommit.  The
consequences of that bug are plenty bad, but reaching them requires an error
between TransactionIdCommitTree() and AtEOXact_Inval().  I've not heard
reports of that, and I don't have a recipe for making it happen on demand.
For now, I'm leaning toward back-patch.  The main risk would be me overlooking
an LWLock deadlock scenario reachable from the new, earlier RelCacheInitLock
timing.  Alternatives for RelCacheInitLock:

- RelCacheInitLock before PreCommit_Notify(), because notify concurrency
  matters more than init file concurrency.  I chose this.
- RelCacheInitLock after PreCommit_Notify(), because PreCommit_Notify() uses a
  heavyweight lock, giving it less risk of undetected deadlock.
- Replace RelCacheInitLock with a heavyweight lock, and keep it before
  PreCommit_Notify().
- Fold PreCommit_Inval() back into AtCommit_Inval(), accepting that EIO in
  unlink_initfile() will PANIC.

Opinions on that?

The patch changes xl_heap_inplace of XLOG_HEAP_INPLACE.  For back branches, we
could choose between:

- Same change, no WAL version bump.  Standby must update before primary.  This
  is best long-term, but the transition is more disruptive.  I'm leaning
  toward this one, but the second option isn't bad:

- heap_xlog_inplace() could set the shared-inval-queue overflow signal on
  every backend.  This is more wasteful, but inplace updates might be rare
  enough (~once per VACUUM) to make it tolerable.

- Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE.  This isn't
  correct if one ends recovery between the two records, but you'd need to be
  unlucky to notice.  Noticing would need a procedure like the following.  A
  hot standby backend populates a relcache entry, then does DDL on the rel
  after recovery ends.

Future cleanup work could eliminate LogStandbyInvalidations() and the case of
!markXidCommitted && nmsgs != 0.  Currently, the src/test/regress suite still
reaches that case:

- AlterDomainDropConstraint() queues an inval even if !found; it can stop
  that.

- ON COMMIT DELETE ROWS nontransactionally rebuilds an index, which sends a
  relcache inval.  The point of that inval is, I think, to force access
  methods like btree and hash to reload the metapage copy that they store in
  rd_amcache.  Since no assigned XID implies no changes to the temp index, the
  no-XID case could simply skip the index rebuild.  (temp.sql reaches this
  with a read-only transaction that selects from an ON COMMIT DELETE ROWS
  table.  Realistic usage will tend not to do that.)  ON COMMIT DELETE ROWS
  has another preexisting problem for indexes, mentioned in a code comment.

Thanks,
nm

Вложения

Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:
> I'm attaching the implementation.

I'm withdrawing inplace150-inval-durability-atcommit-v1.patch, having found
two major problems so far:

1. It sends transactional invalidation messages before
   ProcArrayEndTransaction(), so other backends can read stale data.

2. It didn't make the equivalent changes for COMMIT PREPARED.



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:
> On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:
> > https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote:
> > > Separable, nontrivial things not fixed in the attached patch stack:
> > > 
> > > - Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
> > >   CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
> > >   still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.
> > 
> > I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
> > inside the critical section.  Send it in heap_xlog_inplace(), too.
> 
> > a. Within logical decoding, cease processing invalidations for inplace
> 
> I'm attaching the implementation.  This applies atop the v3 patch stack from
> https://postgr.es/m/20240614003549.c2.nmisch@google.com, but the threads are
> mostly orthogonal and intended for independent review.  Translating a tuple
> into inval messages uses more infrastructure than relmapper, which needs just
> a database ID.  Hence, this ended up more like a miniature of inval.c's
> participation in the transaction commit sequence.
> 
> I waffled on whether to back-patch inplace150-inval-durability-atcommit

That inplace150 patch turned out to be unnecessary.  Contrary to the
"noncritical resource releasing" comment some lines above
AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
PANIC.  An ERROR just before or after sending invals becomes PANIC, "cannot
abort transaction %u, it was already committed".  Since
inplace130-AtEOXact_RelationCache-comments existed to clear the way for
inplace150, inplace130 also becomes unnecessary.  I've removed both from the
attached v2 patch stack.

> The patch changes xl_heap_inplace of XLOG_HEAP_INPLACE.  For back branches, we
> could choose between:
> 
> - Same change, no WAL version bump.  Standby must update before primary.  This
>   is best long-term, but the transition is more disruptive.  I'm leaning
>   toward this one, but the second option isn't bad:
> 
> - heap_xlog_inplace() could set the shared-inval-queue overflow signal on
>   every backend.  This is more wasteful, but inplace updates might be rare
>   enough (~once per VACUUM) to make it tolerable.
> 
> - Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE.  This isn't
>   correct if one ends recovery between the two records, but you'd need to be
>   unlucky to notice.  Noticing would need a procedure like the following.  A
>   hot standby backend populates a relcache entry, then does DDL on the rel
>   after recovery ends.

That still holds.

Вложения

Re: Inval reliability, especially for inplace updates

От
Andres Freund
Дата:
Hi,

On 2024-06-17 16:58:54 -0700, Noah Misch wrote:
> On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:
> > On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:
> > > https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote:
> > > > Separable, nontrivial things not fixed in the attached patch stack:
> > > >
> > > > - Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
> > > >   CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
> > > >   still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.
> > >
> > > I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
> > > inside the critical section.  Send it in heap_xlog_inplace(), too.

I'm worried this might cause its own set of bugs, e.g. if there are any places
that, possibly accidentally, rely on the invalidation from the inplace update
to also cover separate changes.

Have you considered instead submitting these invalidations during abort as
well?


> > > a. Within logical decoding, cease processing invalidations for inplace
> >
> > I'm attaching the implementation.  This applies atop the v3 patch stack from
> > https://postgr.es/m/20240614003549.c2.nmisch@google.com, but the threads are
> > mostly orthogonal and intended for independent review.  Translating a tuple
> > into inval messages uses more infrastructure than relmapper, which needs just
> > a database ID.  Hence, this ended up more like a miniature of inval.c's
> > participation in the transaction commit sequence.
> >
> > I waffled on whether to back-patch inplace150-inval-durability-atcommit
>
> That inplace150 patch turned out to be unnecessary.  Contrary to the
> "noncritical resource releasing" comment some lines above
> AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
> PANIC.  An ERROR just before or after sending invals becomes PANIC, "cannot
> abort transaction %u, it was already committed".

Relying on that, instead of explicit critical sections, seems fragile to me.
IIRC some of the behaviour around errors around transaction commit/abort has
changed a bunch of times. Tying correctness into something that could be
changed for unrelated reasons doesn't seem great.

I'm not sure it holds true even today - what if the transaction didn't have an
xid? Then RecordTransactionAbort() wouldn't trigger
     "cannot abort transaction %u, it was already committed"
I think?



> > - Same change, no WAL version bump.  Standby must update before primary.  This
> >   is best long-term, but the transition is more disruptive.  I'm leaning
> >   toward this one, but the second option isn't bad:

Hm. The inplace record doesn't use the length of the "main data" record
segment for anything, from what I can tell. If records by an updated primary
were replayed by an old standby, it'd just ignore the additional data, afaict?

I think with the code as-is, the situation with an updated standby replaying
an old primary's record would actually be worse - it'd afaict just assume the
now-longer record contained valid fields, despite those just pointing into
uninitialized memory.  I think the replay routine would have to check the
length of the main data and execute the invalidation conditionally.


> > - heap_xlog_inplace() could set the shared-inval-queue overflow signal on
> >   every backend.  This is more wasteful, but inplace updates might be rare
> >   enough (~once per VACUUM) to make it tolerable.

We already set that surprisingly frequently, as
a) The size of the sinval queue is small
b) If a backend is busy, it does not process catchup interrupts
   (i.e. executing queries, waiting for a lock prevents processing)
c) There's no deduplication of invals, we often end up sending the same inval
   over and over.

So I suspect this might not be too bad, compared to the current badness.


At least for core code. I guess there could be extension code triggering
inplace updates more frequently? But I'd hope they'd do it not on catalog
tables... Except that we wouldn't know that that's the case during replay,
it's not contained in the record.




> > - Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE.  This isn't
> >   correct if one ends recovery between the two records, but you'd need to be
> >   unlucky to notice.  Noticing would need a procedure like the following.  A
> >   hot standby backend populates a relcache entry, then does DDL on the rel
> >   after recovery ends.

Hm. The problematic cases presumably involves an access exclusive lock? If so,
could we do LogStandbyInvalidations() *before* logging the WAL record for the
inplace update? The invalidations can't be processed by other backends until
the exclusive lock has been released, which should avoid the race?


Greetings,

Andres Freund



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:
> On 2024-06-17 16:58:54 -0700, Noah Misch wrote:
> > On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:
> > > On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:
> > > > https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote:
> > > > > Separable, nontrivial things not fixed in the attached patch stack:
> > > > >
> > > > > - Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
> > > > >   CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
> > > > >   still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.
> > > >
> > > > I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
> > > > inside the critical section.  Send it in heap_xlog_inplace(), too.
> 
> I'm worried this might cause its own set of bugs, e.g. if there are any places
> that, possibly accidentally, rely on the invalidation from the inplace update
> to also cover separate changes.

Good point.  I do have index_update_stats() still doing an ideally-superfluous
relcache update for that reason.  Taking that further, it would be cheap
insurance to have the inplace update do a transactional inval in addition to
its immediate inval.  Future master-only work could remove the transactional
one.  How about that?

> Have you considered instead submitting these invalidations during abort as
> well?

I had not.  Hmmm.  If the lock protocol in README.tuplock (after patch
inplace120) told SearchSysCacheLocked1() to do systable scans instead of
syscache reads, that could work.  Would need to ensure a PANIC if transaction
abort doesn't reach the inval submission.  Overall, it would be harder to
reason about the state of caches, but I suspect the patch would be smaller.
How should we choose between those strategies?

> > > > a. Within logical decoding, cease processing invalidations for inplace
> > >
> > > I'm attaching the implementation.  This applies atop the v3 patch stack from
> > > https://postgr.es/m/20240614003549.c2.nmisch@google.com, but the threads are
> > > mostly orthogonal and intended for independent review.  Translating a tuple
> > > into inval messages uses more infrastructure than relmapper, which needs just
> > > a database ID.  Hence, this ended up more like a miniature of inval.c's
> > > participation in the transaction commit sequence.
> > >
> > > I waffled on whether to back-patch inplace150-inval-durability-atcommit
> >
> > That inplace150 patch turned out to be unnecessary.  Contrary to the
> > "noncritical resource releasing" comment some lines above
> > AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
> > PANIC.  An ERROR just before or after sending invals becomes PANIC, "cannot
> > abort transaction %u, it was already committed".
> 
> Relying on that, instead of explicit critical sections, seems fragile to me.
> IIRC some of the behaviour around errors around transaction commit/abort has
> changed a bunch of times. Tying correctness into something that could be
> changed for unrelated reasons doesn't seem great.

Fair enough.  It could still be a good idea for master, but given I missed a
bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones
$SUBJECT fixes, let's not risk it in back branches.

> I'm not sure it holds true even today - what if the transaction didn't have an
> xid? Then RecordTransactionAbort() wouldn't trigger
>      "cannot abort transaction %u, it was already committed"
> I think?

I think that's right.  As the inplace160-inval-durability-inplace-v2.patch
edits to xact.c say, the concept of invals in XID-less transactions is buggy
at its core.  Fortunately, after that patch, we use them only for two things
that could themselves stop with something roughly as simple as the attached.

> > > - Same change, no WAL version bump.  Standby must update before primary.  This
> > >   is best long-term, but the transition is more disruptive.  I'm leaning
> > >   toward this one, but the second option isn't bad:
> 
> Hm. The inplace record doesn't use the length of the "main data" record
> segment for anything, from what I can tell. If records by an updated primary
> were replayed by an old standby, it'd just ignore the additional data, afaict?

Agreed, but ...

> I think with the code as-is, the situation with an updated standby replaying
> an old primary's record would actually be worse - it'd afaict just assume the
> now-longer record contained valid fields, despite those just pointing into
> uninitialized memory.  I think the replay routine would have to check the
> length of the main data and execute the invalidation conditionally.

I anticipated back branches supporting a new XLOG_HEAP_INPLACE_WITH_INVAL
alongside the old XLOG_HEAP_INPLACE.  Updated standbys would run both fine,
and old binaries consuming new WAL would PANIC, "heap_redo: unknown op code".

> > > - heap_xlog_inplace() could set the shared-inval-queue overflow signal on
> > >   every backend.  This is more wasteful, but inplace updates might be rare
> > >   enough (~once per VACUUM) to make it tolerable.
> 
> We already set that surprisingly frequently, as
> a) The size of the sinval queue is small
> b) If a backend is busy, it does not process catchup interrupts
>    (i.e. executing queries, waiting for a lock prevents processing)
> c) There's no deduplication of invals, we often end up sending the same inval
>    over and over.
> 
> So I suspect this might not be too bad, compared to the current badness.

That is good.  We might be able to do the overflow signal once at end of
recovery, like RelationCacheInitFileRemove() does for the init file.  That's
mildly harder to reason about, but it would be cheaper.  Hmmm.

> At least for core code. I guess there could be extension code triggering
> inplace updates more frequently? But I'd hope they'd do it not on catalog
> tables... Except that we wouldn't know that that's the case during replay,
> it's not contained in the record.

For what it's worth, from a grep of PGXN, only citus does inplace updates.

> > > - Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE.  This isn't
> > >   correct if one ends recovery between the two records, but you'd need to be
> > >   unlucky to notice.  Noticing would need a procedure like the following.  A
> > >   hot standby backend populates a relcache entry, then does DDL on the rel
> > >   after recovery ends.
> 
> Hm. The problematic cases presumably involves an access exclusive lock? If so,
> could we do LogStandbyInvalidations() *before* logging the WAL record for the
> inplace update? The invalidations can't be processed by other backends until
> the exclusive lock has been released, which should avoid the race?

A lock forces a backend to drain the inval queue before using the locked
object, but it doesn't stop the backend from draining the queue and
repopulating cache entries earlier.  For example, pg_describe_object() can
query many syscaches without locking underlying objects.  Hence, the inval
system relies on the buffer change getting fully visible to catcache queries
before the sinval message enters the shared queue.

Thanks,
nm



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote:
> On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:
> > On 2024-06-17 16:58:54 -0700, Noah Misch wrote:
> > > On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote:
> > > > On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote:
> > > > > https://postgr.es/m/20240512232923.aa.nmisch@google.com wrote:
> > > > > > Separable, nontrivial things not fixed in the attached patch stack:
> > > > > >
> > > > > > - Inplace update uses transactional CacheInvalidateHeapTuple().  ROLLBACK of
> > > > > >   CREATE INDEX wrongly discards the inval, leading to the relhasindex=t loss
> > > > > >   still seen in inplace-inval.spec.  CacheInvalidateRelmap() does this right.
> > > > >
> > > > > I plan to fix that like CacheInvalidateRelmap(): send the inval immediately,
> > > > > inside the critical section.  Send it in heap_xlog_inplace(), too.
> > 
> > I'm worried this might cause its own set of bugs, e.g. if there are any places
> > that, possibly accidentally, rely on the invalidation from the inplace update
> > to also cover separate changes.
> 
> Good point.  I do have index_update_stats() still doing an ideally-superfluous
> relcache update for that reason.  Taking that further, it would be cheap
> insurance to have the inplace update do a transactional inval in addition to
> its immediate inval.  Future master-only work could remove the transactional
> one.  How about that?
> 
> > Have you considered instead submitting these invalidations during abort as
> > well?
> 
> I had not.  Hmmm.  If the lock protocol in README.tuplock (after patch
> inplace120) told SearchSysCacheLocked1() to do systable scans instead of
> syscache reads, that could work.  Would need to ensure a PANIC if transaction
> abort doesn't reach the inval submission.  Overall, it would be harder to
> reason about the state of caches, but I suspect the patch would be smaller.
> How should we choose between those strategies?
> 
> > > > > a. Within logical decoding, cease processing invalidations for inplace
> > > >
> > > > I'm attaching the implementation.  This applies atop the v3 patch stack from
> > > > https://postgr.es/m/20240614003549.c2.nmisch@google.com, but the threads are
> > > > mostly orthogonal and intended for independent review.  Translating a tuple
> > > > into inval messages uses more infrastructure than relmapper, which needs just
> > > > a database ID.  Hence, this ended up more like a miniature of inval.c's
> > > > participation in the transaction commit sequence.
> > > >
> > > > I waffled on whether to back-patch inplace150-inval-durability-atcommit
> > >
> > > That inplace150 patch turned out to be unnecessary.  Contrary to the
> > > "noncritical resource releasing" comment some lines above
> > > AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
> > > PANIC.  An ERROR just before or after sending invals becomes PANIC, "cannot
> > > abort transaction %u, it was already committed".
> > 
> > Relying on that, instead of explicit critical sections, seems fragile to me.
> > IIRC some of the behaviour around errors around transaction commit/abort has
> > changed a bunch of times. Tying correctness into something that could be
> > changed for unrelated reasons doesn't seem great.
> 
> Fair enough.  It could still be a good idea for master, but given I missed a
> bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones
> $SUBJECT fixes, let's not risk it in back branches.
> 
> > I'm not sure it holds true even today - what if the transaction didn't have an
> > xid? Then RecordTransactionAbort() wouldn't trigger
> >      "cannot abort transaction %u, it was already committed"
> > I think?
> 
> I think that's right.  As the inplace160-inval-durability-inplace-v2.patch
> edits to xact.c say, the concept of invals in XID-less transactions is buggy
> at its core.  Fortunately, after that patch, we use them only for two things
> that could themselves stop with something roughly as simple as the attached.

Now actually attached.

> > > > - Same change, no WAL version bump.  Standby must update before primary.  This
> > > >   is best long-term, but the transition is more disruptive.  I'm leaning
> > > >   toward this one, but the second option isn't bad:
> > 
> > Hm. The inplace record doesn't use the length of the "main data" record
> > segment for anything, from what I can tell. If records by an updated primary
> > were replayed by an old standby, it'd just ignore the additional data, afaict?
> 
> Agreed, but ...
> 
> > I think with the code as-is, the situation with an updated standby replaying
> > an old primary's record would actually be worse - it'd afaict just assume the
> > now-longer record contained valid fields, despite those just pointing into
> > uninitialized memory.  I think the replay routine would have to check the
> > length of the main data and execute the invalidation conditionally.
> 
> I anticipated back branches supporting a new XLOG_HEAP_INPLACE_WITH_INVAL
> alongside the old XLOG_HEAP_INPLACE.  Updated standbys would run both fine,
> and old binaries consuming new WAL would PANIC, "heap_redo: unknown op code".
> 
> > > > - heap_xlog_inplace() could set the shared-inval-queue overflow signal on
> > > >   every backend.  This is more wasteful, but inplace updates might be rare
> > > >   enough (~once per VACUUM) to make it tolerable.
> > 
> > We already set that surprisingly frequently, as
> > a) The size of the sinval queue is small
> > b) If a backend is busy, it does not process catchup interrupts
> >    (i.e. executing queries, waiting for a lock prevents processing)
> > c) There's no deduplication of invals, we often end up sending the same inval
> >    over and over.
> > 
> > So I suspect this might not be too bad, compared to the current badness.
> 
> That is good.  We might be able to do the overflow signal once at end of
> recovery, like RelationCacheInitFileRemove() does for the init file.  That's
> mildly harder to reason about, but it would be cheaper.  Hmmm.
> 
> > At least for core code. I guess there could be extension code triggering
> > inplace updates more frequently? But I'd hope they'd do it not on catalog
> > tables... Except that we wouldn't know that that's the case during replay,
> > it's not contained in the record.
> 
> For what it's worth, from a grep of PGXN, only citus does inplace updates.
> 
> > > > - Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE.  This isn't
> > > >   correct if one ends recovery between the two records, but you'd need to be
> > > >   unlucky to notice.  Noticing would need a procedure like the following.  A
> > > >   hot standby backend populates a relcache entry, then does DDL on the rel
> > > >   after recovery ends.
> > 
> > Hm. The problematic cases presumably involves an access exclusive lock? If so,
> > could we do LogStandbyInvalidations() *before* logging the WAL record for the
> > inplace update? The invalidations can't be processed by other backends until
> > the exclusive lock has been released, which should avoid the race?
> 
> A lock forces a backend to drain the inval queue before using the locked
> object, but it doesn't stop the backend from draining the queue and
> repopulating cache entries earlier.  For example, pg_describe_object() can
> query many syscaches without locking underlying objects.  Hence, the inval
> system relies on the buffer change getting fully visible to catcache queries
> before the sinval message enters the shared queue.
> 
> Thanks,
> nm

Вложения

Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Mon, Jun 17, 2024 at 04:58:54PM -0700, Noah Misch wrote:
> attached v2 patch stack.

Rebased.  This applies on top of three patches from
https://postgr.es/m/20240629024251.03.nmisch@google.com.  I'm attaching those
to placate cfbot, but this thread is for review of the last patch only.

Вложения

Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote:
> On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote:
> > On 2024-06-17 16:58:54 -0700, Noah Misch wrote:
> > > That inplace150 patch turned out to be unnecessary.  Contrary to the
> > > "noncritical resource releasing" comment some lines above
> > > AtEOXact_Inval(true), the actual behavior is already to promote ERROR to
> > > PANIC.  An ERROR just before or after sending invals becomes PANIC, "cannot
> > > abort transaction %u, it was already committed".
> > 
> > Relying on that, instead of explicit critical sections, seems fragile to me.
> > IIRC some of the behaviour around errors around transaction commit/abort has
> > changed a bunch of times. Tying correctness into something that could be
> > changed for unrelated reasons doesn't seem great.
> 
> Fair enough.  It could still be a good idea for master, but given I missed a
> bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones
> $SUBJECT fixes, let's not risk it in back branches.

What are your thoughts on whether a change to explicit critical sections
should be master-only vs. back-patched?  I have a feeling your comment pointed
to something I'm still missing, but I don't know where to look next.



Re: Inval reliability, especially for inplace updates

От
Nitin Motiani
Дата:
On Sat, Oct 12, 2024 at 5:47 PM Noah Misch <noah@leadboat.com> wrote:
>
> Rebased.

Hi,

I have a couple of questions :

1. In heap_inplace_update_and_unlock, currently both buffer and tuple
are unlocked outside the critical section. Why do we have to move the
buffer unlock within the critical section here? My guess is that it
needs to be unlocked for the inplace invals to be processed. But what
is the reasoning behind that?

2. Is there any benefit in CacheInvalidateHeapTupleCommon taking the
preapre_callback argument? Wouldn't it be simpler to just pass an
InvalidationInfo* to the function?

Also is inval-requires-xid-v0.patch planned to be fixed up to inplace160?

Thanks



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Sat, Oct 12, 2024 at 06:05:06PM +0530, Nitin Motiani wrote:
> 1. In heap_inplace_update_and_unlock, currently both buffer and tuple
> are unlocked outside the critical section. Why do we have to move the
> buffer unlock within the critical section here? My guess is that it
> needs to be unlocked for the inplace invals to be processed. But what
> is the reasoning behind that?

AtInplace_Inval() acquires SInvalWriteLock.  There are two reasons to want to
release the buffer lock before acquiring SInvalWriteLock:

1. Otherwise, we'd need to maintain the invariant that no other part of the
   system tries to lock the buffer while holding SInvalWriteLock.  (That would
   cause an undetected deadlock.)

2. Concurrency is better if we release a no-longer-needed LWLock before doing
   something time-consuming, like acquiring another LWLock potentially is.

Inplace invals do need to happen in the critical section, because we've
already written the change to shared buffers, making it the new authoritative
value.  If we fail to invalidate, other backends may continue operating with
stale caches.

> 2. Is there any benefit in CacheInvalidateHeapTupleCommon taking the
> preapre_callback argument? Wouldn't it be simpler to just pass an
> InvalidationInfo* to the function?

CacheInvalidateHeapTupleCommon() has three conditions that cause it to return
without invoking the callback.  Every heap_update() calls
CacheInvalidateHeapTuple().  In typical performance-critical systems, non-DDL
changes dwarf DDL.  Hence, the overwhelming majority of heap_update() calls
involve !IsCatalogRelation().  I wouldn't want to allocate InvalidationInfo in
DDL-free transactions.  To pass in InvalidationInfo*, I suppose I'd move those
three conditions to a function and make the callers look like:

CacheInvalidateHeapTuple(Relation relation,
                         HeapTuple tuple,
                         HeapTuple newtuple)
{
    if (NeedsInvalidateHeapTuple(relation))
        CacheInvalidateHeapTupleCommon(relation, tuple, newtuple,
                                       PrepareInvalidationState());
}

I don't have a strong preference between that and the callback way.

> Also is inval-requires-xid-v0.patch planned to be fixed up to inplace160?

I figure I'll pursue that on a different thread, after inplace160 and
inplace180.  If there's cause to pursue it earlier, let me know.

Thanks,
nm



Re: Inval reliability, especially for inplace updates

От
Nitin Motiani
Дата:
On Sun, Oct 13, 2024 at 6:15 AM Noah Misch <noah@leadboat.com> wrote:
>
> On Sat, Oct 12, 2024 at 06:05:06PM +0530, Nitin Motiani wrote:
> > 1. In heap_inplace_update_and_unlock, currently both buffer and tuple
> > are unlocked outside the critical section. Why do we have to move the
> > buffer unlock within the critical section here? My guess is that it
> > needs to be unlocked for the inplace invals to be processed. But what
> > is the reasoning behind that?
>
> AtInplace_Inval() acquires SInvalWriteLock.  There are two reasons to want to
> release the buffer lock before acquiring SInvalWriteLock:
>
> 1. Otherwise, we'd need to maintain the invariant that no other part of the
>    system tries to lock the buffer while holding SInvalWriteLock.  (That would
>    cause an undetected deadlock.)
>
> 2. Concurrency is better if we release a no-longer-needed LWLock before doing
>    something time-consuming, like acquiring another LWLock potentially is.
>
> Inplace invals do need to happen in the critical section, because we've
> already written the change to shared buffers, making it the new authoritative
> value.  If we fail to invalidate, other backends may continue operating with
> stale caches.
>

Thanks for the clarification.

> > 2. Is there any benefit in CacheInvalidateHeapTupleCommon taking the
> > preapre_callback argument? Wouldn't it be simpler to just pass an
> > InvalidationInfo* to the function?
>
> CacheInvalidateHeapTupleCommon() has three conditions that cause it to return
> without invoking the callback.  Every heap_update() calls
> CacheInvalidateHeapTuple().  In typical performance-critical systems, non-DDL
> changes dwarf DDL.  Hence, the overwhelming majority of heap_update() calls
> involve !IsCatalogRelation().  I wouldn't want to allocate InvalidationInfo in
> DDL-free transactions.  To pass in InvalidationInfo*, I suppose I'd move those
> three conditions to a function and make the callers look like:
>
> CacheInvalidateHeapTuple(Relation relation,
>                                                  HeapTuple tuple,
>                                                  HeapTuple newtuple)
> {
>         if (NeedsInvalidateHeapTuple(relation))
>                 CacheInvalidateHeapTupleCommon(relation, tuple, newtuple,
>                                                                            PrepareInvalidationState());
> }
>
> I don't have a strong preference between that and the callback way.
>

Thanks. I would have probably done it using the
NeedsInvalidateHeapTuple. But I don't have a strong enough preference
to change it from the callback way. So the current approach seems
good.

> > Also is inval-requires-xid-v0.patch planned to be fixed up to inplace160?
>
> I figure I'll pursue that on a different thread, after inplace160 and
> inplace180.  If there's cause to pursue it earlier, let me know.
>
Sure. Can be done in a different thread.

Thanks,
Nitin Motiani
Google



Re: Inval reliability, especially for inplace updates

От
Nitin Motiani
Дата:
On Mon, Oct 14, 2024 at 3:15 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>
> On Sun, Oct 13, 2024 at 6:15 AM Noah Misch <noah@leadboat.com> wrote:
> >
> > On Sat, Oct 12, 2024 at 06:05:06PM +0530, Nitin Motiani wrote:
> > > 1. In heap_inplace_update_and_unlock, currently both buffer and tuple
> > > are unlocked outside the critical section. Why do we have to move the
> > > buffer unlock within the critical section here? My guess is that it
> > > needs to be unlocked for the inplace invals to be processed. But what
> > > is the reasoning behind that?
> >
> > AtInplace_Inval() acquires SInvalWriteLock.  There are two reasons to want to
> > release the buffer lock before acquiring SInvalWriteLock:
> >
> > 1. Otherwise, we'd need to maintain the invariant that no other part of the
> >    system tries to lock the buffer while holding SInvalWriteLock.  (That would
> >    cause an undetected deadlock.)
> >
> > 2. Concurrency is better if we release a no-longer-needed LWLock before doing
> >    something time-consuming, like acquiring another LWLock potentially is.
> >
> > Inplace invals do need to happen in the critical section, because we've
> > already written the change to shared buffers, making it the new authoritative
> > value.  If we fail to invalidate, other backends may continue operating with
> > stale caches.
> >
>
> Thanks for the clarification.
>
> > > 2. Is there any benefit in CacheInvalidateHeapTupleCommon taking the
> > > preapre_callback argument? Wouldn't it be simpler to just pass an
> > > InvalidationInfo* to the function?
> >
> > CacheInvalidateHeapTupleCommon() has three conditions that cause it to return
> > without invoking the callback.  Every heap_update() calls
> > CacheInvalidateHeapTuple().  In typical performance-critical systems, non-DDL
> > changes dwarf DDL.  Hence, the overwhelming majority of heap_update() calls
> > involve !IsCatalogRelation().  I wouldn't want to allocate InvalidationInfo in
> > DDL-free transactions.  To pass in InvalidationInfo*, I suppose I'd move those
> > three conditions to a function and make the callers look like:
> >
> > CacheInvalidateHeapTuple(Relation relation,
> >                                                  HeapTuple tuple,
> >                                                  HeapTuple newtuple)
> > {
> >         if (NeedsInvalidateHeapTuple(relation))
> >                 CacheInvalidateHeapTupleCommon(relation, tuple, newtuple,
> >                                                                            PrepareInvalidationState());
> > }
> >
> > I don't have a strong preference between that and the callback way.
> >
>
> Thanks. I would have probably done it using the
> NeedsInvalidateHeapTuple. But I don't have a strong enough preference
> to change it from the callback way. So the current approach seems
> good.
>
> > > Also is inval-requires-xid-v0.patch planned to be fixed up to inplace160?
> >
> > I figure I'll pursue that on a different thread, after inplace160 and
> > inplace180.  If there's cause to pursue it earlier, let me know.
> >
> Sure. Can be done in a different thread.
>

I tested the patch locally and it works. And I have no other question
regarding the structure. So this patch looks good to me to commit.

Thanks,
Nitin Motiani
Google



Re: Inval reliability, especially for inplace updates

От
Nitin Motiani
Дата:
On Thu, Oct 24, 2024 at 8:24 AM Noah Misch <noah@leadboat.com> wrote:
>
> With the releases wrapping in 2.5 weeks, I'm ambivalent about pushing this
> before the release or after.  Pushing before means fewer occurrences of
> corruption, but pushing after gives more bake time to discover these changes
> were defective.  It's hard to predict which helps users more, on a
> risk-adjusted basis.  I'm leaning toward pushing this week.  Opinions?
>

I lean towards pushing after the release. This is based on my
assumption that since this bug has been around for a while, it is
(probably) not hit often. And a few weeks delay is better than
introducing a new defect.

Thanks



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Mon, Oct 28, 2024 at 02:27:03PM +0530, Nitin Motiani wrote:
> On Thu, Oct 24, 2024 at 8:24 AM Noah Misch <noah@leadboat.com> wrote:
> > With the releases wrapping in 2.5 weeks, I'm ambivalent about pushing this
> > before the release or after.  Pushing before means fewer occurrences of
> > corruption, but pushing after gives more bake time to discover these changes
> > were defective.  It's hard to predict which helps users more, on a
> > risk-adjusted basis.  I'm leaning toward pushing this week.  Opinions?
> 
> I lean towards pushing after the release. This is based on my
> assumption that since this bug has been around for a while, it is
> (probably) not hit often. And a few weeks delay is better than
> introducing a new defect.

I had pushed this during the indicated week, before your mail.  Reverting it
is an option.  Let's see if more opinions arrive.



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Thu, Oct 31, 2024 at 01:01:39PM -0700, Noah Misch wrote:
> On Thu, Oct 31, 2024 at 05:00:02PM +0300, Alexander Lakhin wrote:
> > I've accidentally discovered an incorrect behaviour caused by commit
> > 4eac5a1fa. Running this script:
> 
> Thanks.  This looks important.
> 
> > parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`
> 
> This didn't reproduce it for me, at -j20, -j40, or -j80.  I tested at commit
> fb7e27a.  At what commit(s) does it reproduce for you?  At what commits, if
> any, did your test not reproduce this?

I reproduced this using a tmpfs current working directory.

> > All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
> > for the same buffer lock:
> > #5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
> > #6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318
> > #7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
> 
> Can you share the full backtrace for the autovacuum workers?

Here, one of the autovacuum workers had the guilty stack trace, appearing at
the end of this message.  heap_inplace_update_and_unlock() calls
CacheInvalidateHeapTupleInplace() while holding BUFFER_LOCK_EXCLUSIVE on a
buffer of pg_class.  CacheInvalidateHeapTupleInplace() may call
CatalogCacheInitializeCache(), which opens the cache's rel.  If there's not a
valid relcache entry for the catcache's rel, we scan pg_class to make a valid
relcache entry.  The ensuing hang makes sense.

Tomorrow, I'll think more about fixes.  Two that might work:

1. Call CacheInvalidateHeapTupleInplace() before locking the buffer.  Each
   time we need to re-find the tuple, discard the previous try's inplace
   invals and redo CacheInvalidateHeapTupleInplace().  That's because
   concurrent activity may have changed cache key fields like relname.

2. Add some function that we call before locking the buffer.  Its charter is
   to ensure PrepareToInvalidateCacheTuple() won't have to call
   CatalogCacheInitializeCache().  I think nothing resets catcache to the
   extent that CatalogCacheInitializeCache() must happen again, so this should
   suffice regardless of concurrent sinval traffic, debug_discard_caches, etc.

What else is worth considering?  Any preferences among those?



#6  0x0000555e8a3c3afc in LWLockAcquire (lock=0x7f1683a65424, mode=LW_SHARED) at lwlock.c:1287
#7  0x0000555e8a08cec8 in heap_prepare_pagescan (sscan=sscan@entry=0x7f168e99ba48) at heapam.c:512
#8  0x0000555e8a08d6f1 in heapgettup_pagemode (scan=scan@entry=0x7f168e99ba48, dir=<optimized out>, nkeys=<optimized
out>,key=<optimized out>) at heapam.c:979
 
#9  0x0000555e8a08dc9e in heap_getnextslot (sscan=0x7f168e99ba48, direction=<optimized out>, slot=0x7f168e9a58c0) at
heapam.c:1299
#10 0x0000555e8a0aad3d in table_scan_getnextslot (direction=ForwardScanDirection, slot=<optimized out>,
sscan=<optimizedout>)
 
    at ../../../../src/include/access/tableam.h:1080
#11 systable_getnext (sysscan=sysscan@entry=0x7f168e9a79c8) at genam.c:538
#12 0x0000555e8a501391 in ScanPgRelation (targetRelId=<optimized out>, indexOK=false,
force_non_historic=force_non_historic@entry=false)at relcache.c:388
 
#13 0x0000555e8a507ea4 in RelationReloadIndexInfo (relation=0x7f168f1b8808) at relcache.c:2272
#14 RelationRebuildRelation (relation=0x7f168f1b8808) at relcache.c:2573
#15 0x0000555e8a5083e5 in RelationFlushRelation (relation=0x7f168f1b8808) at relcache.c:2848
#16 RelationCacheInvalidateEntry (relationId=<optimized out>) at relcache.c:2910
#17 0x0000555e8a4fafaf in LocalExecuteInvalidationMessage (msg=0x7ffec99ff8a0) at inval.c:795
#18 0x0000555e8a3b711a in ReceiveSharedInvalidMessages (invalFunction=invalFunction@entry=0x555e8a4faf10
<LocalExecuteInvalidationMessage>,
 
    resetFunction=resetFunction@entry=0x555e8a4fa650 <InvalidateSystemCaches>) at sinval.c:88
#19 0x0000555e8a4fa677 in AcceptInvalidationMessages () at inval.c:865
#20 0x0000555e8a3bc809 in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:135
#21 0x0000555e8a05155d in relation_open (relationId=relationId@entry=1259, lockmode=lockmode@entry=1) at relation.c:55
#22 0x0000555e8a0dd809 in table_open (relationId=relationId@entry=1259, lockmode=lockmode@entry=1) at table.c:44
#23 0x0000555e8a501359 in ScanPgRelation (targetRelId=<optimized out>, indexOK=indexOK@entry=true,
force_non_historic=force_non_historic@entry=false)
    at relcache.c:371
#24 0x0000555e8a507f9a in RelationReloadNailed (relation=0x7f168f1738c8) at relcache.c:2380
#25 RelationRebuildRelation (relation=0x7f168f1738c8) at relcache.c:2579
#26 0x0000555e8a5083e5 in RelationFlushRelation (relation=0x7f168f1738c8) at relcache.c:2848
#27 RelationCacheInvalidateEntry (relationId=<optimized out>) at relcache.c:2910
#28 0x0000555e8a4fafaf in LocalExecuteInvalidationMessage (msg=0x7ffec99ffc80) at inval.c:795
#29 0x0000555e8a3b71aa in ReceiveSharedInvalidMessages (invalFunction=invalFunction@entry=0x555e8a4faf10
<LocalExecuteInvalidationMessage>,
 
    resetFunction=resetFunction@entry=0x555e8a4fa650 <InvalidateSystemCaches>) at sinval.c:118
#30 0x0000555e8a4fa677 in AcceptInvalidationMessages () at inval.c:865
#31 0x0000555e8a3bc809 in LockRelationOid (relid=1259, lockmode=1) at lmgr.c:135
#32 0x0000555e8a05155d in relation_open (relationId=1259, lockmode=lockmode@entry=1) at relation.c:55
#33 0x0000555e8a0dd809 in table_open (relationId=<optimized out>, lockmode=lockmode@entry=1) at table.c:44
#34 0x0000555e8a4f7211 in CatalogCacheInitializeCache (cache=cache@entry=0x555eb26d2180) at catcache.c:1045
#35 0x0000555e8a4f9a68 in PrepareToInvalidateCacheTuple (relation=relation@entry=0x7f168f1738c8,
tuple=tuple@entry=0x7f168e9a74e8,newtuple=newtuple@entry=0x0, 
 
    function=function@entry=0x555e8a4fa220 <RegisterCatcacheInvalidation>, context=context@entry=0x7f168e9a7428) at
catcache.c:2326
#36 0x0000555e8a4fa500 in CacheInvalidateHeapTupleCommon (relation=relation@entry=0x7f168f1738c8,
tuple=tuple@entry=0x7f168e9a74e8,newtuple=newtuple@entry=0x0, 
 
    prepare_callback=prepare_callback@entry=0x555e8a4fa320 <PrepareInplaceInvalidationState>) at inval.c:1391
#37 0x0000555e8a4fabcb in CacheInvalidateHeapTupleCommon (prepare_callback=0x555e8a4fa320
<PrepareInplaceInvalidationState>,newtuple=newtuple@entry=0x0, 
 
    tuple=tuple@entry=0x7f168e9a74e8, relation=relation@entry=0x7f168f1738c8) at inval.c:1504
#38 0x0000555e8a094f7c in heap_inplace_update_and_unlock (relation=0x7f168f1738c8, oldtup=0x555eb277ea58,
tuple=0x7f168e9a74e8,buffer=3) at heapam.c:6354
 
#39 0x0000555e8a0ab35a in systable_inplace_update_finish (state=0x7f168e9a72c8, tuple=<optimized out>) at genam.c:891
#40 0x0000555e8a1fbdc4 in vac_update_relstats (relation=relation@entry=0x7f168f1beae8, num_pages=num_pages@entry=19,
num_tuples=<optimizedout>, 
 
    num_all_visible_pages=<optimized out>, hasindex=hasindex@entry=true, frozenxid=frozenxid@entry=0,
minmulti=minmulti@entry=0,frozenxid_updated=0x0, 
 
    minmulti_updated=0x0, in_outer_xact=false) at vacuum.c:1545
#41 0x0000555e8a18c97c in do_analyze_rel (onerel=onerel@entry=0x7f168f1beae8, params=params@entry=0x555eb272e2e4,
va_cols=va_cols@entry=0x0,
 
    acquirefunc=<optimized out>, relpages=19, inh=inh@entry=false, in_outer_xact=false, elevel=13) at analyze.c:643
#42 0x0000555e8a18e122 in analyze_rel (relid=<optimized out>, relation=<optimized out>,
params=params@entry=0x555eb272e2e4,va_cols=0x0, 
 
    in_outer_xact=<optimized out>, bstrategy=bstrategy@entry=0x555eb2741238) at analyze.c:249
#43 0x0000555e8a1fc883 in vacuum (relations=0x555eb274b370, relations@entry=0x555eb2749390,
params=params@entry=0x555eb272e2e4,
 
    bstrategy=bstrategy@entry=0x555eb2741238, vac_context=vac_context@entry=0x555eb274b220,
isTopLevel=isTopLevel@entry=true)at vacuum.c:635
 
#44 0x0000555e8a330e69 in autovacuum_do_vac_analyze (bstrategy=<optimized out>, tab=<optimized out>) at
autovacuum.c:3108
#45 do_autovacuum () at autovacuum.c:2425
#46 0x0000555e8a33132f in AutoVacWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at
autovacuum.c:1571
...
(gdb) p num_held_lwlocks
$2 = 1



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Sun, Nov 03, 2024 at 10:29:25AM -0800, Noah Misch wrote:
> Pushed as 0bada39.
> 
> Buildfarm member hornet REL_15_STABLE was in the same hang.  Other buildfarm
> runs 2024-10-25T13:51:02Z - 2024-11-02T16:04:56Z may hang the same way.  It's
> early to make a comprehensive list of hung buildfarm members, since many
> reported infrequently even before this period.  I'll wait a week or two and
> then contact the likely-hung member owners.  I regret the damage.

Buildfarm members plover and (less likely) sevengill might be hung in this
way.  Owners (bcc'd): if convenient, please check whether a buildfarm run that
started before 2024-11-02T16:04:56Z is still ongoing.  If it is, please use
"kill -QUIT" on one of the long-running autovacuum workers.

If it's the same hang, one autovacuum worker will have stack frames like:

#37 0x0000555e8a4fabcb in CacheInvalidateHeapTupleCommon (prepare_callback=0x555e8a4fa320
<PrepareInplaceInvalidationState>,newtuple=newtuple@entry=0x0, 
 
    tuple=tuple@entry=0x7f168e9a74e8, relation=relation@entry=0x7f168f1738c8) at inval.c:1504
#38 0x0000555e8a094f7c in heap_inplace_update_and_unlock (relation=0x7f168f1738c8, oldtup=0x555eb277ea58,
tuple=0x7f168e9a74e8,buffer=3) at heapam.c:6354
 



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Sun, Nov 03, 2024 at 10:29:25AM -0800, Noah Misch wrote:
> On Fri, Nov 01, 2024 at 04:38:29PM -0700, Noah Misch wrote:
> > This was a near miss to having a worst-in-years regression in a minor release,
> > so I'm proposing this sequence:
> > 
> > - Revert from non-master branches commits 8e7e672 (inplace180, "WAL-log
> >   inplace update before revealing it to other sessions.") and 243e9b4
> >   (inplace160, "For inplace update, send nontransactional invalidations.").
> > 
> > - Back-patch inplace230-index_update_stats-io-before-buflock to harden commit
> >   a07e03f (inplace110, "Fix data loss at inplace update after heap_update()").
> > 
> > - Push attached inplace240 to master.
> > 
> > - Make the commitfest entry a request for review of v17 inplace160+inplace240.
> >   After some amount of additional review and master bake time, the reverted
> >   patches would return to non-master branches.

> Pushed as 0bada39.

> I'm attaching the v17 patch that now becomes the commitfest submission

That still applies cleanly.  I'm adding a back-patch of commit f4ece89, which
adds assertions intended to build confidence about the main patch.  The
commitfest entry requests a review of the first patch only, but the second
patch may answer questions that a review of the first would otherwise raise.

Вложения

Re: Inval reliability, especially for inplace updates

От
Paul A Jungwirth
Дата:
Ian Ilyasov and I reviewed this patch. We think it is ready to commit
to back branches.

The attached patch applies to REL_17_STABLE but not other stable
branches, so we assume Noah will adjust it as needed.

We were able to reproduce Alexander Lakhin's hang when we tested
against 0a0a0f2c59 (i.e. before the previous version was reverted),
although adding the delay was necessary. With this patch applied, we
don't see the hang (even with the delay).

We agree that the new assertions are a good idea to prevent similar
errors in the future.

We couldn't devise other ways to break this patch. Surely more
experienced hackers could be more creative, but nonetheless it's
reassuring that the patch's twin has been in v18devel since November.

We assume you will also un-revert 8e7e672cda ("WAL-log inplace update
before revealing it to other sessions.")? We didn't look closely at
that patch, but it seems like there are no known problems with it. It
was just reverted because it depends on this patch.

Since this is a backpatch, it doesn't seem right to give non-essential
feedback. Here are a few thoughts anyway. Consider them notes for
future work rather than reasons to delay backpatching or drift from
the patch on master.

Is there any way to add more testing around non-transactional
invalidations? It is a new "feature" but it is not really tested
anywhere. I don't think we could do this with regress tests, but
perhaps isolation tests would be suitable.

Some of the comments felt a bit compressed. They make sense in the
context of this fix, but reading them cold seems like it will be
challenging. For example this took a lot of thinking to follow:

     * Construct shared cache inval if necessary.  Because we pass a
tuple
     * version without our own inplace changes or inplace changes
other
     * sessions complete while we wait for locks, inplace update
mustn't
     * change catcache lookup keys.  But we aren't bothering with
index
     * updates either, so that's true a fortiori.

Or this:

     * WAL contains likely-unnecessary commit-time invals from the
     * CacheInvalidateHeapTuple() call in heap_inplace_update().

Why likely-unnecessary? I know you explain it at that callsite, but
some hint might help here.

It's a bit surprising that wrongly leaving relhasindex=t is safe (for
example after BEGIN; CREATE INDEX; ROLLBACK;). I guess this column is
just to save us a lookup for tables with no index, and no harm is done
if we do the lookup needlessly but find no indexes. And vacuum can
repair it later. Still it's a little unnerving.

On Thu, Oct 31, 2024 at 09:20:52PM -0700, Noah Misch wrote:
> Here, one of the autovacuum workers had the guilty stack trace, appearing at
> the end of this message.  heap_inplace_update_and_unlock() calls
> CacheInvalidateHeapTupleInplace() while holding BUFFER_LOCK_EXCLUSIVE on a
> buffer of pg_class.  CacheInvalidateHeapTupleInplace() may call
> CatalogCacheInitializeCache(), which opens the cache's rel.  If there's not a
> valid relcache entry for the catcache's rel, we scan pg_class to make a valid
> relcache entry.  The ensuing hang makes sense.

Personally I never expected that catcache could depend on relcache,
since it seems lower-level. But it makes sense that you need a
relcache of pg_class at least, so their relationship is more
complicated than just layers.

I'm struggling to understand how your explanation incorporates
*concurrency*, since a self-deadlock only involves locks from one
backend. But the point is that a concurrent invalidation causes the
relcache entry to vanish, so that we need to rebuild it. (We can't get
this far without having built the relcache for pg_class once already.)

Specifically, we drop the table while autovacuum is updating its
statistics. But how is that possible? Don't both those things
exclusive-lock the row in pg_class? I must be misunderstanding.

> Tomorrow, I'll think more about fixes.  Two that might work:
>
> 1. Call CacheInvalidateHeapTupleInplace() before locking the buffer.  Each
>    time we need to re-find the tuple, discard the previous try's inplace
>    invals and redo CacheInvalidateHeapTupleInplace().  That's because
>    concurrent activity may have changed cache key fields like relname.

We agree that choice 1 is a good approach.

 PrepareToInvalidateCacheTuple(Relation relation,
                               HeapTuple tuple,
                               HeapTuple newtuple,
-                              void (*function) (int, uint32, Oid))
+                              void (*function) (int, uint32, Oid, void *),
+                              void *context)

It's a little odd that PrepareToInvalidateCacheTuple takes a callback
function when it only has one caller, so it always calls
RegisterCatcacheInvalidation. Is it just to avoid adding dependencies
to inval.c? But it already #includes catcache.h and contains lots of
knowledge about catcache specifics. Maybe originally
PrepareToInvalidateCacheTuple was built to take
RegisterRelcacheInvalidation as well? Is it worth still passing the
callback?

@@ -6511,6 +6544,7 @@ heap_inplace_unlock(Relation relation,
 {
     LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
     UnlockTuple(relation, &oldtup->t_self, InplaceUpdateTupleLock);
+    ForgetInplace_Inval();
 }

Is this the right place to add this? We think on balance yes, but the
question crossed my mind: Clearing the invals seems like a separate
responsibility from unlocking the buffer & tuple. After this patch,
our only remaining caller of heap_inplace_unlock is
systable_inplace_update_cancel, so perhaps it should call
ForgetInplace_Inval itself? OTOH we like that putting it here
guarantees it gets called, as a complement to building the invals in
heap_inplace_lock.

Yours,

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com



Re: Inval reliability, especially for inplace updates

От
Paul A Jungwirth
Дата:
On Thu, Jul 31, 2025 at 9:53 AM Paul A Jungwirth
<pj@illuminatedcomputing.com> wrote:
> On Thu, Oct 31, 2024 at 09:20:52PM -0700, Noah Misch wrote:
> > Here, one of the autovacuum workers had the guilty stack trace, appearing at
> > the end of this message.  heap_inplace_update_and_unlock() calls
> > CacheInvalidateHeapTupleInplace() while holding BUFFER_LOCK_EXCLUSIVE on a
> > buffer of pg_class.  CacheInvalidateHeapTupleInplace() may call
> > CatalogCacheInitializeCache(), which opens the cache's rel.  If there's not a
> > valid relcache entry for the catcache's rel, we scan pg_class to make a valid
> > relcache entry.  The ensuing hang makes sense.
>
> Personally I never expected that catcache could depend on relcache,
> since it seems lower-level. But it makes sense that you need a
> relcache of pg_class at least, so their relationship is more
> complicated than just layers.
>
> I'm struggling to understand how your explanation incorporates
> *concurrency*, since a self-deadlock only involves locks from one
> backend. But the point is that a concurrent invalidation causes the
> relcache entry to vanish, so that we need to rebuild it. (We can't get
> this far without having built the relcache for pg_class once already.)
>
> Specifically, we drop the table while autovacuum is updating its
> statistics. But how is that possible? Don't both those things
> exclusive-lock the row in pg_class? I must be misunderstanding.

I was curious so I decided to look into this some more. We have a
self-deadlock, but it's rare. Concurrency is necessary to trigger it.
The patch fixes the double-locking, but what was the sequence that
caused the problem?

The only thing reproi.sh does is add & drop tables over and over.

I looked at a few reproductions, and the situation was always the
same: one of the autovacuum workers was stuck analyzing pg_attribute,
with a stack trace like this (no different than Noah's):

#0  __futex_abstimed_wait_common
(futex_word=futex_word@entry=0x7fa76af77238,
expected=expected@entry=0, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=<optimized out>,
cancel=cancel@entry=true) at ./nptl/futex-internal.c:103
#1  0x00007fa7742d5f7b in __GI___futex_abstimed_wait_cancelable64
(futex_word=futex_word@entry=0x7fa76af77238,
expected=expected@entry=0,
    clockid=clockid@entry=0, abstime=abstime@entry=0x0,
private=<optimized out>) at ./nptl/futex-internal.c:139
#2  0x00007fa7742e0cff in do_futex_wait (sem=sem@entry=0x7fa76af77238,
abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#3  0x00007fa7742e0d90 in __new_sem_wait_slow64
(sem=sem@entry=0x7fa76af77238, abstime=0x0, clockid=0) at
./nptl/sem_waitcommon.c:183
#4  0x00007fa7742e0df9 in __new_sem_wait
(sem=sem@entry=0x7fa76af77238) at ./nptl/sem_wait.c:42
#5  0x00005583c641a6a8 in PGSemaphoreLock (sema=0x7fa76af77238) at
pg_sema.c:327
#6  0x00005583c652e1da in LWLockAcquire (lock=0x7fa76b4fb4e4,
mode=mode@entry=LW_SHARED) at lwlock.c:1318
#7  0x00005583c64df801 in LockBuffer (buffer=36, mode=mode@entry=1) at
bufmgr.c:4182
#8  0x00005583c5eda4b3 in heapam_index_fetch_tuple
(scan=0x7fa76adc1b30, tid=0x7fa76adc3fb8, snapshot=0x5583e8ede3d8,
slot=0x7fa76adc1030,
    call_again=0x7fa76adc3fbe, all_dead=0x7ffebf643e6f) at
heapam_handler.c:146
#9  0x00005583c5efb548 in table_index_fetch_tuple (scan=<optimized
out>, tid=tid@entry=0x7fa76adc3fb8, snapshot=<optimized out>,
    slot=slot@entry=0x7fa76adc1030,
call_again=call_again@entry=0x7fa76adc3fbe,
all_dead=all_dead@entry=0x7ffebf643e6f)
    at ../../../../src/include/access/tableam.h:1226
#10 0x00005583c5efd1a7 in index_fetch_heap
(scan=scan@entry=0x7fa76adc3f58, slot=slot@entry=0x7fa76adc1030) at
indexam.c:622
#11 0x00005583c5efd362 in index_getnext_slot (scan=0x7fa76adc3f58,
direction=direction@entry=ForwardScanDirection, slot=0x7fa76adc1030)
    at indexam.c:682
#12 0x00005583c5efa32a in systable_getnext
(sysscan=sysscan@entry=0x7fa76adc3dd0) at genam.c:512
#13 0x00005583c677fd20 in ScanPgRelation (targetRelId=<optimized out>,
indexOK=indexOK@entry=true,
    force_non_historic=force_non_historic@entry=false) at
relcache.c:385
#14 0x00005583c6780cf8 in RelationReloadNailed
(relation=relation@entry=0x7fa7740fd698) at relcache.c:2381
#15 0x00005583c678b67e in RelationClearRelation
(relation=relation@entry=0x7fa7740fd698, rebuild=<optimized out>) at
relcache.c:2527
#16 0x00005583c678ce85 in RelationFlushRelation
(relation=0x7fa7740fd698) at relcache.c:2839
#17 0x00005583c678cf00 in RelationCacheInvalidateEntry
(relationId=<optimized out>) at relcache.c:2900
#18 0x00005583c67714fe in LocalExecuteInvalidationMessage
(msg=0x7ffebf644220) at inval.c:666
#19 0x00005583c651585b in ReceiveSharedInvalidMessages (
    invalFunction=invalFunction@entry=0x5583c677130d
<LocalExecuteInvalidationMessage>,
    resetFunction=resetFunction@entry=0x5583c67700cf
<InvalidateSystemCaches>) at sinval.c:121
#20 0x00005583c677012c in AcceptInvalidationMessages () at inval.c:766
#21 0x00005583c651f8d2 in LockRelationOid (relid=<optimized out>,
lockmode=1) at lmgr.c:137
#22 0x00005583c5e503db in relation_open
(relationId=relationId@entry=1259, lockmode=lockmode@entry=1) at
relation.c:56
#23 0x00005583c5f595e5 in table_open
(relationId=relationId@entry=1259, lockmode=lockmode@entry=1) at
table.c:43
#24 0x00005583c677fc8c in ScanPgRelation (targetRelId=<optimized out>,
indexOK=<optimized out>,
    force_non_historic=force_non_historic@entry=false) at
relcache.c:368
#25 0x00005583c67807c1 in RelationReloadIndexInfo
(relation=relation@entry=0x7fa774141028) at relcache.c:2257
#26 0x00005583c6780dff in RelationReloadNailed
(relation=relation@entry=0x7fa774141028) at relcache.c:2359
#27 0x00005583c678b67e in RelationClearRelation
(relation=relation@entry=0x7fa774141028, rebuild=<optimized out>) at
relcache.c:2527
#28 0x00005583c678ce85 in RelationFlushRelation
(relation=0x7fa774141028) at relcache.c:2839
#29 0x00005583c678cf00 in RelationCacheInvalidateEntry
(relationId=<optimized out>) at relcache.c:2900
#30 0x00005583c67714fe in LocalExecuteInvalidationMessage
(msg=0x7ffebf644640) at inval.c:666
#31 0x00005583c651585b in ReceiveSharedInvalidMessages (
    invalFunction=invalFunction@entry=0x5583c677130d
<LocalExecuteInvalidationMessage>,
    resetFunction=resetFunction@entry=0x5583c67700cf
<InvalidateSystemCaches>) at sinval.c:121
#32 0x00005583c677012c in AcceptInvalidationMessages () at inval.c:766
#33 0x00005583c651f8d2 in LockRelationOid (relid=<optimized out>,
lockmode=1) at lmgr.c:137
#34 0x00005583c5e503db in relation_open
(relationId=relationId@entry=1259, lockmode=lockmode@entry=1) at
relation.c:56
#35 0x00005583c5f595e5 in table_open
(relationId=relationId@entry=1259, lockmode=lockmode@entry=1) at
table.c:43
#36 0x00005583c677fc8c in ScanPgRelation (targetRelId=<optimized out>,
indexOK=indexOK@entry=true,
    force_non_historic=force_non_historic@entry=false) at relcache.c:368
#37 0x00005583c6780cf8 in RelationReloadNailed
(relation=relation@entry=0x7fa7740fd698) at relcache.c:2381
#38 0x00005583c678b67e in RelationClearRelation
(relation=relation@entry=0x7fa7740fd698, rebuild=<optimized out>) at
relcache.c:2527
#39 0x00005583c678ce85 in RelationFlushRelation
(relation=0x7fa7740fd698) at relcache.c:2839
#40 0x00005583c678cf00 in RelationCacheInvalidateEntry
(relationId=<optimized out>) at relcache.c:2900
#41 0x00005583c67714fe in LocalExecuteInvalidationMessage
(msg=0x7ffebf644a40) at inval.c:666
#42 0x00005583c651585b in ReceiveSharedInvalidMessages (
    invalFunction=invalFunction@entry=0x5583c677130d
<LocalExecuteInvalidationMessage>,
    resetFunction=resetFunction@entry=0x5583c67700cf
<InvalidateSystemCaches>) at sinval.c:121
#43 0x00005583c677012c in AcceptInvalidationMessages () at inval.c:766
#44 0x00005583c651f8d2 in LockRelationOid (relid=<optimized out>,
lockmode=1) at lmgr.c:137
#45 0x00005583c5e503db in relation_open (relationId=1259,
lockmode=lockmode@entry=1) at relation.c:56
#46 0x00005583c5f595e5 in table_open (relationId=<optimized out>,
lockmode=lockmode@entry=1) at table.c:43
#47 0x00005583c676b11a in CatalogCacheInitializeCache
(cache=cache@entry=0x5583e8e79f80) at catcache.c:929
#48 0x00005583c676e3c2 in PrepareToInvalidateCacheTuple
(relation=relation@entry=0x7fa7740fd698,
tuple=tuple@entry=0x7fa76adc3a88,
    newtuple=newtuple@entry=0x0,
function=function@entry=0x5583c676fac8 <RegisterCatcacheInvalidation>,
    context=context@entry=0x7fa76adc38b8) at catcache.c:2164
#49 0x00005583c676fdc3 in CacheInvalidateHeapTupleCommon
(relation=relation@entry=0x7fa7740fd698,
tuple=tuple@entry=0x7fa76adc3a88,
    newtuple=newtuple@entry=0x0,
prepare_callback=prepare_callback@entry=0x5583c676fe93
<PrepareInplaceInvalidationState>) at inval.c:1350
#50 0x00005583c6770ab9 in CacheInvalidateHeapTupleInplace
(relation=relation@entry=0x7fa7740fd698,
tuple=tuple@entry=0x7fa76adc3a88,
    newtuple=newtuple@entry=0x0) at inval.c:1461
#51 0x00005583c5ec8dc9 in heap_inplace_update_and_unlock
(relation=0x7fa7740fd698, oldtup=<optimized out>,
tuple=0x7fa76adc3a88, buffer=36)
    at heapam.c:6311
#52 0x00005583c5efb33a in systable_inplace_update_finish
(state=0x7fa76adc3748, tuple=<optimized out>) at genam.c:884
#53 0x00005583c61fbc92 in vac_update_relstats
(relation=relation@entry=0x7fa774102b20, num_pages=num_pages@entry=89,
num_tuples=4460,
    num_all_visible_pages=<optimized out>,
hasindex=hasindex@entry=true, frozenxid=frozenxid@entry=0,
minmulti=minmulti@entry=0,
    frozenxid_updated=0x0, minmulti_updated=0x0, in_outer_xact=false)
at vacuum.c:1455
#54 0x00005583c60ec3f2 in do_analyze_rel
(onerel=onerel@entry=0x7fa774102b20,
params=params@entry=0x5583e8e9bda4,
    va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>,
relpages=89, inh=inh@entry=false, in_outer_xact=false, elevel=13)
    at analyze.c:645
#55 0x00005583c60ed155 in analyze_rel (relid=<optimized out>,
relation=0x5583e8ef9fa0, params=params@entry=0x5583e8e9bda4,
va_cols=0x0,
    in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:262
#56 0x00005583c61fd2d4 in vacuum (relations=0x5583e8efc008,
params=params@entry=0x5583e8e9bda4, bstrategy=<optimized out>,
    bstrategy@entry=0x5583e8ef2e00, isTopLevel=isTopLevel@entry=true)
at vacuum.c:493
#57 0x00005583c641e58b in autovacuum_do_vac_analyze
(tab=tab@entry=0x5583e8e9bda0,
bstrategy=bstrategy@entry=0x5583e8ef2e00)
    at autovacuum.c:3180
#58 0x00005583c6420f84 in do_autovacuum () at autovacuum.c:2503
#59 0x00005583c64219ae in AutoVacWorkerMain (argc=argc@entry=0,
argv=argv@entry=0x0) at autovacuum.c:1716
#60 0x00005583c6421bb3 in StartAutoVacWorker () at autovacuum.c:1494
#61 0x00005583c6433c22 in StartAutovacuumWorker () at postmaster.c:5536
#62 0x00005583c64344ca in sigusr1_handler
(postgres_signal_arg=<optimized out>) at postmaster.c:5241
#63 <signal handler called>
#64 0x00007fa77434e904 in __GI___select (nfds=nfds@entry=8,
readfds=readfds@entry=0x7ffebf6461c0, writefds=writefds@entry=0x0,
    exceptfds=exceptfds@entry=0x0,
timeout=timeout@entry=0x7ffebf6461b0) at
../sysdeps/unix/sysv/linux/select.c:69
#65 0x00005583c6435442 in ServerLoop () at postmaster.c:1773
#66 0x00005583c6437c5e in PostmasterMain (argc=argc@entry=1,
argv=argv@entry=0x5583e8e0fe00) at postmaster.c:1481
#67 0x00005583c62d57f5 in main (argc=1, argv=0x5583e8e0fe00) at main.c:202

In frame 55 we are analyzing relation 1249 (pg_attribute):

(gdb) f 55
#55 0x00005583c60ed155 in analyze_rel (relid=<optimized out>,
relation=0x5583e8ef9fa0, params=params@entry=0x5583e8e9bda4,
va_cols=0x0,
    in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:262
262                     do_analyze_rel(onerel, params, va_cols, acquirefunc,
(gdb) p *relation
$1 = {type = T_RangeVar, catalogname = 0x0, schemaname =
0x5583e8ef3308 "pg_catalog", relname = 0x5583e8ef32e0 "pg_attribute",
inh = true,
  relpersistence = 112 'p', alias = 0x0, location = -1}
(gdb) p *onerel
$2 = {rd_node = {spcNode = 1663, dbNode = 5, relNode = 1249}, rd_smgr
= 0x5583e8eb98c0, rd_refcnt = 2, rd_backend = -1,
  rd_islocaltemp = false, rd_isnailed = true, rd_isvalid = true,
rd_indexvalid = true, rd_statvalid = false, rd_createSubid = 0,
  rd_newRelfilenodeSubid = 0, rd_firstRelfilenodeSubid = 0,
rd_droppedSubid = 0, rd_rel = 0x7fa774102d38, rd_att = 0x7fa774102e50,
  rd_id = 1249, rd_lockInfo = {lockRelId = {relId = 1249, dbId = 5}},
rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0,
  rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkey = 0x0,
rd_partkeycxt = 0x0, rd_partdesc = 0x0, rd_pdcxt = 0x0,
  rd_partdesc_nodetached = 0x0, rd_pddcxt = 0x0,
rd_partdesc_nodetached_xmin = 0, rd_partcheck = 0x0, rd_partcheckvalid
= false,
  rd_partcheckcxt = 0x0, rd_indexlist = 0x7fa774120298, rd_pkindex =
2659, rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0,
  rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, rd_pubdesc =
0x0, rd_options = 0x0, rd_amhandler = 3,
  rd_tableam = 0x5583c6b0bf00 <heapam_methods>, rd_index = 0x0,
rd_indextuple = 0x0, rd_indexcxt = 0x0, rd_indam = 0x0, rd_opfamily =
0x0,
  rd_opcintype = 0x0, rd_support = 0x0, rd_supportinfo = 0x0,
rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops =
0x0,
  rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_indcollation = 0x0,
rd_opcoptions = 0x0, rd_amcache = 0x0, rd_fdwroutine = 0x0,
  rd_toastoid = 0, pgstat_enabled = true, pgstat_info = 0x5583e8eb0090}

It makes sense that pg_attribute would experience a lot of churn,
since we just add & drop tables.

We call vac_update_relstats, which does an in-place update to pg_class
(frame 53).

Within vac_update_relstats, we call system_inplace_update_begin, which
LWLocks the buffer we want to update. That's the first part of the
self-deadlock. But it's no problem yet. We proceed in
vac_update_relstats to sys_inplace_update_finish. . . .

That sends an inplace invalidation message for the pg_class tuple
(frame 50). So we need to replace the tuple in the pg_class catcache.
But before we do that, we make sure the catcache is initialized (frame
47):

    /* Just in case cache hasn't finished initialization yet... */
    if (ccp->cc_tupdesc == NULL)
        CatalogCacheInitializeCache(ccp);

That locks pg_class with AccessShareLock, which looks for pending
invalidation messages (frame 43).

It finds one invalidating the relcache for pg_class:

(gdb) f 41
#41 0x00005583c67714fe in LocalExecuteInvalidationMessage
(msg=0x7ffebf644a40) at inval.c:666
666
RelationCacheInvalidateEntry(msg->rc.relId);
(gdb) p *msg
$9 = {id = -2 '\376', cc = {id = -2 '\376', dbId = 5, hashValue =
1259}, cat = {id = -2 '\376', dbId = 5, catId = 1259}, rc = {
    id = -2 '\376', dbId = 5, relId = 1259}, sm = {id = -2 '\376',
backend_hi = -47 '\321', backend_lo = 59642, rnode = {spcNode = 5,
      dbNode = 1259, relNode = 32679}}, rm = {id = -2 '\376', dbId =
5}, sn = {id = -2 '\376', dbId = 5, relId = 1259}}

(Btw where did this message come from?)

So we have to rebuild the relcache for pg_class. Obviously we need to
look up its entry in (heh) pg_class (frame 37). So we open that table
with an AccessShareLock, which looks for pending invalidation messages
(frame 32). We find one telling us to invalidate 2662
(pg_class_oid_index):

(gdb) f 30
#30 0x00005583c67714fe in LocalExecuteInvalidationMessage
(msg=0x7ffebf644640) at inval.c:666
666
RelationCacheInvalidateEntry(msg->rc.relId);
(gdb) p *msg
$12 = {id = -2 '\376', cc = {id = -2 '\376', dbId = 5, hashValue =
2662}, cat = {id = -2 '\376', dbId = 5, catId = 2662}, rc = {
    id = -2 '\376', dbId = 5, relId = 2662}, sm = {id = -2 '\376',
backend_hi = -47 '\321', backend_lo = 59642, rnode = {spcNode = 5,
      dbNode = 2662, relNode = 32679}}, rm = {id = -2 '\376', dbId =
5}, sn = {id = -2 '\376', dbId = 5, relId = 2662}}

I guess that is coming from the add & drop table commands.

So we reload the relcache entry for 2662 (frame 26). That opens
pg_class with AccessShareLock, which looks for pending invalidation
messages (frame 20). We find one telling us to invalidate pg_class:

(gdb) f 18
#18 0x00005583c67714fe in LocalExecuteInvalidationMessage
(msg=0x7ffebf644220) at inval.c:666
666
RelationCacheInvalidateEntry(msg->rc.relId);
(gdb) p *msg
$14 = {id = -2 '\376', cc = {id = -2 '\376', dbId = 5, hashValue =
1259}, cat = {id = -2 '\376', dbId = 5, catId = 1259}, rc = {
    id = -2 '\376', dbId = 5, relId = 1259}, sm = {id = -2 '\376',
backend_hi = 75 'K', backend_lo = 48996, rnode = {spcNode = 5,
      dbNode = 1259, relNode = 32679}}, rm = {id = -2 '\376', dbId =
5}, sn = {id = -2 '\376', dbId = 5, relId = 1259}}

So we rebuild the relcache for pg_class, which means scanning pg_class
(frame 14). We try to lock one of its buffers (frame 7), and we
self-deadlock.

So the problem is triggered by an invalidation message for the
pg_class relcache entry. But what sends that message? (Actually it
seems we got two of them.) Why does anything we're doing invalidate
pg_class? Normally an inplace update to pg_class sends an invalidation
message for the relation described by that tuple, not all of pg_class
itself (see CacheInvalidateHeapTupleCommon).

I couldn't figure that out, until I put a breakpoint at the top of
RegisterRelcacheInvalidation and ran `vacuum analyze`. When we analyze
pg_class (and btw the repro script gives it a lot of churn), we call
vac_update_relstats on it, so we are doing an inplace update of
pg_class on the tuple *about* pg_class. So that's why we send the
invalidation about the whole table.

The concurrency, then, is analyzing pg_attribute in one worker and
pg_class in another. No locks would prevent that. It doesn't have to
be pg_attribute; it could be some user table. But since the repro
script adds & drops tables, pg_attribute is getting churned.

Yours,

--
Paul              ~{:-)
pj@illuminatedcomputing.com



Re: Inval reliability, especially for inplace updates

От
Noah Misch
Дата:
On Thu, Jul 31, 2025 at 09:53:20AM -0700, Paul A Jungwirth wrote:
> Ian Ilyasov and I reviewed this patch. We think it is ready to commit
> to back branches.

Thank you!

> We assume you will also un-revert 8e7e672cda ("WAL-log inplace update
> before revealing it to other sessions.")? We didn't look closely at
> that patch, but it seems like there are no known problems with it. It
> was just reverted because it depends on this patch.

Right.  I'm fine self-certifying that one.

> Is there any way to add more testing around non-transactional
> invalidations? It is a new "feature" but it is not really tested
> anywhere. I don't think we could do this with regress tests, but
> perhaps isolation tests would be suitable.

I think src/test/isolation/specs/inplace-inval.spec is testing it.

> Some of the comments felt a bit compressed. They make sense in the
> context of this fix, but reading them cold seems like it will be
> challenging. For example this took a lot of thinking to follow:
> 
>      * Construct shared cache inval if necessary.  Because we pass a tuple
>      * version without our own inplace changes or inplace changes other
>      * sessions complete while we wait for locks, inplace update mustn't
>      * change catcache lookup keys.  But we aren't bothering with index
>      * updates either, so that's true a fortiori.

What do you think of the attached rewrite?  I also removed this part:

-     * If we're mutating a tuple visible only to this transaction, there's an
-     * equivalent transactional inval from the action that created the tuple,
-     * and this inval is superfluous.

That would have needed s/this transaction/this command/ to be correct, and I
didn't feel it was saying something important enough to keep.  There are
plenty of ways for invals to be redundant.

> Or this:
> 
>      * WAL contains likely-unnecessary commit-time invals from the
>      * CacheInvalidateHeapTuple() call in heap_inplace_update().
> 
> Why likely-unnecessary? I know you explain it at that callsite, but
> some hint might help here.

On further study, I think one could construct a logical decoding output plugin
for which it's necessary.  I've detailed that in the attached edits.  This was
in the heap_decode() changes, which is the part of the patch I understand the
least.  I likely should have made it a separate patch.  Here's the surrounding
change I made in 2024, in context diff format:

*** 508,530 **** heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
  
              /*
               * Inplace updates are only ever performed on catalog tuples and
!              * can, per definition, not change tuple visibility.  Since we
!              * don't decode catalog tuples, we're not interested in the
!              * record's contents.
               *
!              * In-place updates can be used either by XID-bearing transactions
!              * (e.g.  in CREATE INDEX CONCURRENTLY) or by XID-less
!              * transactions (e.g.  VACUUM).  In the former case, the commit
!              * record will include cache invalidations, so we mark the
!              * transaction as catalog modifying here. Currently that's
!              * redundant because the commit will do that as well, but once we
!              * support decoding in-progress relations, this will be important.
               */
-             if (!TransactionIdIsValid(xid))
-                 break;
- 
-             (void) SnapBuildProcessChange(builder, xid, buf->origptr);
-             ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
              break;
  
          case XLOG_HEAP_CONFIRM:
--- 508,526 ----
  
              /*
               * Inplace updates are only ever performed on catalog tuples and
!              * can, per definition, not change tuple visibility.  Inplace
!              * updates don't affect storage or interpretation of table rows,
!              * so they don't affect logicalrep_write_tuple() outcomes.  Hence,
!              * we don't process invalidations from the original operation.  If
!              * inplace updates did affect those things, invalidations wouldn't
!              * make it work, since there are no snapshot-specific versions of
!              * inplace-updated values.  Since we also don't decode catalog
!              * tuples, we're not interested in the record's contents.
               *
!              * WAL contains likely-unnecessary commit-time invals from the
!              * CacheInvalidateHeapTuple() call in heap_inplace_update().
!              * Excess invalidation is safe.
               */
              break;
  
          case XLOG_HEAP_CONFIRM:


This code had been unchanged since commit b89e1510 (v9.4) introduced logical
decoding.  I'm making the following assumptions about the old code:

- I guessed "decoding in-progress relations" was a typo for "decoding
  in-progress transactions", something we've supported since at least commit
  4648243 "Add support for streaming to built-in logical replication"
  (2020-09, v14).  If it's not a typo, I don't know what "in-progress
  relations" denoted here.

- It said "redundant because the commit will do that as well", but I didn't
  find such code.  I bet that it referenced the DecodeCommit() lines removed
  in commit c55040c "WAL Log invalidations at command end with
  wal_level=logical" (2020-07, v14).  The same commit likely made the
  ReorderBufferXidSetCatalogChanges() call obsolete.

- I had no idea why we call SnapBuildProcessChange().  Every other caller uses
  its return value.  I guess there was a desire for its snapshot side effects,
  but I didn't follow that.  Nothing snapshot-relevant happens at
  XLOG_HEAP_INPLACE.  Removing this call doesn't break any test on v13 or on
  v9.4.  Similarly, no test fails after removing both this and the
  ReorderBufferXidSetCatalogChanges() call.

- In v9.4, this area of code (per-heapam-record-type decoding) had inval
  responsibilities.  That went away in v14, so there's no need for the comment
  here to keep discussing invals.


I now think it would be prudent to omit from back-patch the non-comment
heap_decode() changes.  While the above assumptions argue against needing the
removed ReorderBufferXidSetCatalogChanges(), that's the sort of speculation I
should keep out back-branch changes.

> It's a bit surprising that wrongly leaving relhasindex=t is safe (for
> example after BEGIN; CREATE INDEX; ROLLBACK;). I guess this column is
> just to save us a lookup for tables with no index, and no harm is done
> if we do the lookup needlessly but find no indexes.

> And vacuum can
> repair it later. Still it's a little unnerving.

DROP INDEX doesn't clear it, either.  That's longstanding, and it doesn't
involve narrow race conditions.  Hence, I'm not worrying about it.  If it were
broken, we'd have heard by now.

> On Thu, Oct 31, 2024 at 09:20:52PM -0700, Noah Misch wrote:
> > Here, one of the autovacuum workers had the guilty stack trace, appearing at
> > the end of this message.  heap_inplace_update_and_unlock() calls
> > CacheInvalidateHeapTupleInplace() while holding BUFFER_LOCK_EXCLUSIVE on a
> > buffer of pg_class.  CacheInvalidateHeapTupleInplace() may call
> > CatalogCacheInitializeCache(), which opens the cache's rel.  If there's not a
> > valid relcache entry for the catcache's rel, we scan pg_class to make a valid
> > relcache entry.  The ensuing hang makes sense.
> 
> Personally I never expected that catcache could depend on relcache,
> since it seems lower-level. But it makes sense that you need a
> relcache of pg_class at least, so their relationship is more
> complicated than just layers.

Yes, relcache.c is indeed ... eclectic.

>  PrepareToInvalidateCacheTuple(Relation relation,
>                                HeapTuple tuple,
>                                HeapTuple newtuple,
> -                              void (*function) (int, uint32, Oid))
> +                              void (*function) (int, uint32, Oid, void *),
> +                              void *context)
> 
> It's a little odd that PrepareToInvalidateCacheTuple takes a callback
> function when it only has one caller, so it always calls
> RegisterCatcacheInvalidation. Is it just to avoid adding dependencies
> to inval.c? But it already #includes catcache.h and contains lots of
> knowledge about catcache specifics. Maybe originally
> PrepareToInvalidateCacheTuple was built to take
> RegisterRelcacheInvalidation as well? Is it worth still passing the
> callback?

Looking at the history, it did have two callbacks in Postgres95 1.01.  It was
down to one callback when it got the name PrepareToInvalidateCacheTuple() in
commit 81d08fc of 2001-01.  I think the main alternative to today's callback
would be to make RegisterCatcacheInvalidation() an extern for catcache.c to
call.  All the other Register* functions would remain static.  I left things
unchanged since that would be cleaner in one way and dirtier in another.

> @@ -6511,6 +6544,7 @@ heap_inplace_unlock(Relation relation,
>  {
>      LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>      UnlockTuple(relation, &oldtup->t_self, InplaceUpdateTupleLock);
> +    ForgetInplace_Inval();
>  }
> 
> Is this the right place to add this? We think on balance yes, but the
> question crossed my mind: Clearing the invals seems like a separate
> responsibility from unlocking the buffer & tuple. After this patch,
> our only remaining caller of heap_inplace_unlock is
> systable_inplace_update_cancel, so perhaps it should call
> ForgetInplace_Inval itself? OTOH we like that putting it here
> guarantees it gets called, as a complement to building the invals in
> heap_inplace_lock.

Style principles behind the placement:

- Inplace update puts some responsibilities in genam.c and others in heapam.c.
  A given task, e.g. draining the inval queue, should consistently appear on
  the same side of that boundary.

- The heapam.c side of inplace update should cover roughly as much as the
  heapam.c side of heap_update().  Since heap_update() handles invals and the
  critical section, so should the heapam.c side of inplace update.

It wouldn't take a strong reason to override those principles.

postgr.es/m/20240831011043.2b.nmisch@google.com has an inventory of the ways
to assign inval responsibility to heapam.c vs. genam.c.
postgr.es/m/20241013004511.9c.nmisch@google.com has a related discussion,
about the need for AtInplace_Inval() to be in the critical section.

On Sun, Aug 03, 2025 at 12:32:31PM -0700, Paul A Jungwirth wrote:

[details of the hang]

> The concurrency, then, is analyzing pg_attribute in one worker and
> pg_class in another. No locks would prevent that. It doesn't have to
> be pg_attribute; it could be some user table. But since the repro
> script adds & drops tables, pg_attribute is getting churned.

Thanks for that detailed write-up!  I agree with your findings.

Вложения