Обсуждение: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

Поиск
Список
Период
Сортировка

TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Tom Lane
Дата:
I believe I have reproduced the behavior described by Andrew Hammond in
http://archives.postgresql.org/pgsql-general/2011-10/msg00928.php

This is using the regression database:

1. In session 1, doset default_statistics_target TO 10000;analyze tenk1;
(We need the large stats target to ensure that tenk1's pg_statistic
entries require toasting.)

2. Attach to session 1 with a debugger and set a breakpoint at
CommitTransaction's call to CallXactCallbacks (or anyplace after
ProcArrayEndTransaction and before AtEOXact_Inval).

3. In session 2, do
select count(*) from tenk1 where fivethous < 2500;

(This loads up session 2's syscaches with toasted pg_statistic entries.)

4. In session 1, again do
analyze tenk1;

and wait for it to stop at the breakpoint.

5. In session 3 (or you can use session 2 for this), do vacuum verbose pg_statistic;
You should see it removing toast entries that were generated in step 1
and obsoleted in step 4.

6. In session 2, again do
select count(*) from tenk1 where fivethous < 2500;

and voila:
ERROR:  missing chunk number 0 for toast value 53668 in pg_toast_2619

What has happened here is that the second ANALYZE has marked itself
committed in pg_clog and no longer running in the ProcArray, so VACUUM
feels entitled to remove toast tuples that the ANALYZE deleted.  However,
the ANALYZE has not yet sent out the sinval messages that would inform
session 2 that its syscache entries are obsolete.  In Andrew's report,
presumably the machine was under enough load to slow down ANALYZE at
just this point, and there was a concurrent autovacuum that would have
done the rest of the deed.  The problem could only be seen for a short
interval, which squares with his report, and with a similar one from
Tim Uckun back in September.

Ordinarily, sending out sinval messages post-commit is okay because we
don't release locks until after that, and we suppose that our locks
prevent any other transactions from getting to the point of using
syscache entries that might have been invalidated by our transaction.
However, *we have carefully hacked on ANALYZE until it doesn't take any
locks that would block concurrent queries on the analyzed table.*  So
the normal protection against stale syscache entries simply doesn't
work for pg_statistic fetches.

I'm not sure about a good way to fix this.  When we last dealt with a
similar failure, Heikki suggested that we forcibly detoast all fields in
a tuple that we're putting into the syscaches:
http://archives.postgresql.org/pgsql-hackers/2011-08/msg00661.php
I don't much like that, though, as it seems expensive, and I'm worried
about possible circularity of needing to know about all toastable fields
while making a syscache entry, and anyway it's papering over a symptom
rather than solving the actual problem that we're relying on a stale
syscache entry.

We could fix it by not using a syscache anymore for pg_statistic
entries, but that's probably not acceptable from a performance
standpoint.

A clean fix would be to add locking that blocks would-be users of
pg_statistic entries when an ANALYZE is about to commit.  This isn't
much fun from a performance standpoint either, but at least it should be
relatively cheap most of the time.

Thoughts?
        regards, tom lane


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Robert Haas
Дата:
On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What has happened here is that the second ANALYZE has marked itself
> committed in pg_clog and no longer running in the ProcArray, so VACUUM
> feels entitled to remove toast tuples that the ANALYZE deleted.  However,
> the ANALYZE has not yet sent out the sinval messages that would inform
> session 2 that its syscache entries are obsolete.  In Andrew's report,
> presumably the machine was under enough load to slow down ANALYZE at
> just this point, and there was a concurrent autovacuum that would have
> done the rest of the deed.  The problem could only be seen for a short
> interval, which squares with his report, and with a similar one from
> Tim Uckun back in September.
>
> Ordinarily, sending out sinval messages post-commit is okay because we
> don't release locks until after that, and we suppose that our locks
> prevent any other transactions from getting to the point of using
> syscache entries that might have been invalidated by our transaction.
> However, *we have carefully hacked on ANALYZE until it doesn't take any
> locks that would block concurrent queries on the analyzed table.*  So
> the normal protection against stale syscache entries simply doesn't
> work for pg_statistic fetches.

This is very similar to one of the issues that reared its ugly head in
regards to Simon's now-reverted patch to lower DDL locking strength.
You identified some other issues there as well, but *one* of the
issues was that, as in this case, the sinval mechanism fails to
provide the necessary synchronization guarantees unless the lock
required to reread the updated data conflicts with the lock required
to change the data.  In that case, "the data" meant "the pg_class
entry" or "the pg_attribute" entry whereas here it means "the
pg_statistic entry", but I believe the principal is the same.  And
there as here, (1) there is a fundamental conflict between what the
sinval mechanism requires for correctness and what is actually
desirable in terms of lock levels from a user experience point of view
and (2) it is relatively easy to write code that looks superficially
safe but which actually contains subtle race conditions.  IIRC, you
never thought Simon's patch looked safe, but I'm guessing that this
pg_statistic bug has been around for a long time.

So I'm wondering if we ought to rethink our position that users of the
sinval machinery must provide their own external synchronization
through heavyweight locking, and instead build the synchronization
into the sinval mechanism itself.  One idea I had was to include the
XID of the transaction sending the sinval mechanism in every message,
and to force clients receiving a message to do XactLockTableWait() for
each such XID.  That would force the backend reloading its cache to
wait until the committing transaction reaches the lock-release phase.
If we sent out sinval messages just before removing ourselves from the
ProcArray, I think that would more-or-less fix this bug (although
maybe I'm missing some reason why it's not practical to send them that
early) except that I don't see any way to handle the sinval-reset
case, which seems to more or less kill this idea in its tracks.

But maybe there's some other mechanism whereby we could combine
sending the sinval messages slightly earlier (before
ProcArrayEndTransaction) with blocking anyone who processes those
messages until after the committing backend finishes
ProcArrayEndTransaction.  For example, you could add an additional
LWLock, which has to be held in exclusive mode by a committing
transaction that sends any sinval messages.  It must be acquired
before sending the sinval messages and can't be released until after
ProcArrayEndTransaction() is complete.  Anyone processing a sinval
message must acquire and release the lock in shared mode before
reloading their caches, so that we guarantee that at the time you
reread the catalogs, any transactions involved in sending those
messages are visible.

That's actually a bit coarse-grained; there's probably a better
mechanism, but I'm just throwing this out to see if the basic idea has
any legs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ordinarily, sending out sinval messages post-commit is okay because we
>> don't release locks until after that, and we suppose that our locks
>> prevent any other transactions from getting to the point of using
>> syscache entries that might have been invalidated by our transaction.
>> However, *we have carefully hacked on ANALYZE until it doesn't take any
>> locks that would block concurrent queries on the analyzed table.* �So
>> the normal protection against stale syscache entries simply doesn't
>> work for pg_statistic fetches.

> This is very similar to one of the issues that reared its ugly head in
> regards to Simon's now-reverted patch to lower DDL locking strength.
> You identified some other issues there as well, but *one* of the
> issues was that, as in this case, the sinval mechanism fails to
> provide the necessary synchronization guarantees unless the lock
> required to reread the updated data conflicts with the lock required
> to change the data.

Right.  We may take as little as AccessShareLock on a relation before
examining its pg_statistic entries, and ANALYZE isn't taking anything
that would block that.

> So I'm wondering if we ought to rethink our position that users of the
> sinval machinery must provide their own external synchronization
> through heavyweight locking, and instead build the synchronization
> into the sinval mechanism itself.

Yeah, it's starting to feel like we need a basic redesign of sinval
... although I'd not care to back-patch that, so we also need to think
of a sane solution for the back branches.

> If we sent out sinval messages just before removing ourselves from the
> ProcArray, I think that would more-or-less fix this bug (although
> maybe I'm missing some reason why it's not practical to send them that
> early) except that I don't see any way to handle the sinval-reset
> case, which seems to more or less kill this idea in its tracks.

The other reason that doesn't work is there's a race condition: someone
might load their cache entry immediately after the sinval message went
past, but before the updating transaction commits.

> But maybe there's some other mechanism whereby we could combine
> sending the sinval messages slightly earlier (before
> ProcArrayEndTransaction) with blocking anyone who processes those
> messages until after the committing backend finishes
> ProcArrayEndTransaction.  For example, you could add an additional
> LWLock, which has to be held in exclusive mode by a committing
> transaction that sends any sinval messages.

Doesn't sound very scalable :-(.

Even given your recent changes to reduce the overhead of checking for
sinval messages, I'm not sure that it'd be practical to move the sinval
message processing to just-before-we-look-up-a-cache-entry.  Right now,
we do AcceptInvalidationMessages basically once per table per query
(or maybe it's twice or so, but anyway a very small multiplier on that).
If we try to do it every time through SearchSysCache, we are probably
talking two to three orders of magnitude more checks, which ISTM is
certain to push the sinval queue back up to the top of the heap for
contention.

But in any case, this isn't the core of the problem.  The real point
here is that we need a guarantee that a syscache entry we're going to
use is/was valid as of some suitable time point later than the start of
our current transaction.  (Once we have taken a snapshot, VACUUM will
know that it can't remove any tuples that were deleted after the time of
that snapshot; so even for SnapshotNow fetches, it's important to have
an MVCC snapshot to protect toast-table dereferences.)  Perhaps rather
than tying the problem into SearchSysCache, we should attach the
overhead to GetTransactionSnapshot, which is called appealingly few
times per query.  But right offhand it seems like that only protects us
against the toast-tuple-deletion problem, not against the more general
one of getting a stale view of the status of some relation.
        regards, tom lane


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Simon Riggs
Дата:
On Wed, Oct 26, 2011 at 4:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Even given your recent changes to reduce the overhead of checking for
> sinval messages, I'm not sure that it'd be practical to move the sinval
> message processing to just-before-we-look-up-a-cache-entry.  Right now,
> we do AcceptInvalidationMessages basically once per table per query
> (or maybe it's twice or so, but anyway a very small multiplier on that).
> If we try to do it every time through SearchSysCache, we are probably
> talking two to three orders of magnitude more checks, which ISTM is
> certain to push the sinval queue back up to the top of the heap for
> contention.

Initially I provided an implementation of eager locking, as Robert
suggests, but the above is a great argument against doing it that way.

Incidentally, I'd like to focus on the causal reason for these
problems. We have static type definitions, which allow us to
completely avoid dynamic type management at runtime. That is a
performance advantage for us in normal running, but it can also give
operational problems when we look to make changes.

> But in any case, this isn't the core of the problem.  The real point
> here is that we need a guarantee that a syscache entry we're going to
> use is/was valid as of some suitable time point later than the start of
> our current transaction.  (Once we have taken a snapshot, VACUUM will
> know that it can't remove any tuples that were deleted after the time of
> that snapshot; so even for SnapshotNow fetches, it's important to have
> an MVCC snapshot to protect toast-table dereferences.)  Perhaps rather
> than tying the problem into SearchSysCache, we should attach the
> overhead to GetTransactionSnapshot, which is called appealingly few
> times per query.  But right offhand it seems like that only protects us
> against the toast-tuple-deletion problem, not against the more general
> one of getting a stale view of the status of some relation.

Do we need to guarantee that? It seems strange to me to use the words
cache and strict in the same sentence.

I'm sure there are many points in the code where strictness is
required though also in many cases, it should be acceptable to say
that a cache miss is not a problem, so does not require strict
guarantees.

Do we need a redesign? Or do we need a way to use
relaxation/optimistic techniques.

I'm aware that it could be a huge undertaking to examine all call
points. If you have any ideas of where to investigate or parts of the
problem to assist with, I'll be happy to work on this. This seems
important to me.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Robert Haas
Дата:
On Tue, Oct 25, 2011 at 11:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Even given your recent changes to reduce the overhead of checking for
> sinval messages, I'm not sure that it'd be practical to move the sinval
> message processing to just-before-we-look-up-a-cache-entry.

Wait a minute: I'm confused.  What's at issue here, at least AIUI, is
taking a TOAST pointer and fetching the corresponding value.  But that
must involve reading from the TOAST table, and our usual paradigm for
reading from system catalogs is (1) take AccessShareLock, (2) read the
relevant rows, (3) release AccessShareLock.  If we're doing that here,
then AcceptInvalidationMessages() is already getting called.  If we're
not, this seems horribly unsafe; aside from the current bug, somebody
could rewrite the table in the interim.

> Right now,
> we do AcceptInvalidationMessages basically once per table per query
> (or maybe it's twice or so, but anyway a very small multiplier on that).
> If we try to do it every time through SearchSysCache, we are probably
> talking two to three orders of magnitude more checks, which ISTM is
> certain to push the sinval queue back up to the top of the heap for
> contention.

I think we've pretty much got the *contention* licked, barring an
increase in the number of things that generate sinval messages, but
calling it too often will still be slow, of course.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Wait a minute: I'm confused.  What's at issue here, at least AIUI, is
> taking a TOAST pointer and fetching the corresponding value.  But that
> must involve reading from the TOAST table, and our usual paradigm for
> reading from system catalogs is (1) take AccessShareLock, (2) read the
> relevant rows, (3) release AccessShareLock.  If we're doing that here,
> then AcceptInvalidationMessages() is already getting called.  If we're
> not, this seems horribly unsafe; aside from the current bug, somebody
> could rewrite the table in the interim.

Yes, the TOAST fetch will call AcceptInvalidationMessages().  But
(1) we are starting from a TOAST pointer that is within a now-stale
syscache entry, and even if we get an sinval message telling us it's
stale when we open the toast table, we're still going to try to fetch
that toast OID.  Worse, (2) there is no guarantee that the inval message
has even been sent yet, because there is no lock keeping us from trying
to use the syscache entry before the inval has been sent.

After further reflection I believe that pg_statistic entries invalidated
by ANALYZE are not the only problem.  Consider a pg_proc row whose
prosrc field is toasted (not too unlikely), and suppose that somebody
executes CREATE OR REPLACE FUNCTION to update it to a different toasted
value.  Meanwhile, somebody else is about to use the function, and they
have a copy of its pg_proc row in syscache.  There is no lock that will
block that second backend from fetching the toast tuples before it's
seen the sinval message from the CREATE OR REPLACE FUNCTION.  So the
exact same type of race can occur, if the first backend gets delayed
between committing and broadcasting its sinval messages, and an
overeager vacuum comes along to clean up the dead toast tuples.

I am currently thinking that we will have to go with Heikki's solution
of detoasting any out-of-line fields before we put a tuple into
syscache.  That does fix the problem, so long as we hold a transaction
snapshot at the instant we fetch any catalog tuple that needs this
treatment, because at the time we fetch the tuple we make a SnapshotNow
test that shows the tuple is good (and its dependent toast tuples are
too).  Even if somebody has outdated the tuple and commits immediately
afterward, vacuum cannot remove the toast tuples before we fetch them,
because the outdater is beyond our advertised MyProc->xmin.  The risk
factor comes in when we hold a syscache entry across transactions; then
this guarantee is lost.  (In both the original example and the pg_proc
case above, the second backend is only at risk when it starts its
transaction after the first one's commit, and in between VACUUM was able
to compute an OldestXmin greater than the first one's XID.)

I guess an alternative approach would be to discard syscache entries
at transaction end if HeapTupleHasExternal is true, or equivalently
mark them as to which transaction read them and not use them in later
transactions.  But that seems more fragile, because it would have to
be tied into SnapshotResetXmin --- any time we discard our last snapshot
intra-transaction, toasted syscache entries would have to be invalidated
since our advertised xmin is no longer protecting them.

The main concern I had about detoast before caching is the risk of
circularity, ie, needing detoastable cache entries in order to figure
out how to detoast.  But I think it's probably okay.  The current list
of catalogs with toast tables is

pg_attrdefpg_constraintpg_databasepg_db_role_settingpg_descriptionpg_procpg_rewritepg_seclabelpg_shdescriptionpg_statisticpg_trigger

Of these, only pg_proc is even conceivably consulted during a toast
table fetch, and we can be sure that functions needed for such a fetch
don't have toasted entries.  But we will have to be very wary of any
future proposal for adding a toast table to pg_class, pg_index, etc.

Detoast-before-caching also seems sufficiently safe and localized to
be a back-patchable fix, whereas I'd be very scared of making any
significant overhaul of the sinval mechanism in the back branches.

Barring objections, I'll see about making this happen.
        regards, tom lane


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Alvaro Herrera
Дата:
Excerpts from Tom Lane's message of vie oct 28 15:37:43 -0300 2011:

> The main concern I had about detoast before caching is the risk of
> circularity, ie, needing detoastable cache entries in order to figure
> out how to detoast.  But I think it's probably okay.  The current list
> of catalogs with toast tables is
> 
>  pg_attrdef
>  pg_constraint
>  pg_database
>  pg_db_role_setting
>  pg_description
>  pg_proc
>  pg_rewrite
>  pg_seclabel
>  pg_shdescription
>  pg_statistic
>  pg_trigger
> 
> Of these, only pg_proc is even conceivably consulted during a toast
> table fetch, and we can be sure that functions needed for such a fetch
> don't have toasted entries.  But we will have to be very wary of any
> future proposal for adding a toast table to pg_class, pg_index, etc.

BTW we had previous discussions about dropping pg_database's toast
table.  Maybe this is a good time to do it, even if there's no risk of
this bug (or the hypothetical circularity detoasting problem) showing up
there.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Alvaro Herrera
Дата:
Excerpts from Alvaro Herrera's message of vie oct 28 16:47:13 -0300 2011:

> BTW we had previous discussions about dropping pg_database's toast
> table.  Maybe this is a good time to do it, even if there's no risk of
> this bug (or the hypothetical circularity detoasting problem) showing up
> there.

Oh, something unrelated I just remembered: we have
pg_database.datlastsysoid which seems unused.  Perhaps we should remove
that column for cleanliness.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
"Kevin Grittner"
Дата:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The risk factor comes in when we hold a syscache entry across
> transactions; then this guarantee is lost.  (In both the original
> example and the pg_proc case above, the second backend is only at
> risk when it starts its transaction after the first one's commit,
> and in between VACUUM was able to compute an OldestXmin greater
> than the first one's XID.)
If we made the commit sequence number more generally available,
incrementing it at the point of visibility change under cover of
ProcArrayLock, and including the then-current value in a Snapshot
object when built, would that help with this at all?  Since it's
something we might need to do to optimize SSI for 9.2, and the
issues sound vaguely echoic of the reasons this is maintained for
SSI, it seems worth asking the question, anyway.
-Kevin


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> If we made the commit sequence number more generally available,
> incrementing it at the point of visibility change under cover of
> ProcArrayLock, and including the then-current value in a Snapshot
> object when built, would that help with this at all?

No, because we need a back-patchable fix.  Even going forward,
I don't find the idea of flushing syscache entries at transaction
end to be especially appealing.
        regards, tom lane


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Alvaro Herrera's message of vie oct 28 16:47:13 -0300 2011:
>> BTW we had previous discussions about dropping pg_database's toast
>> table.  Maybe this is a good time to do it, even if there's no risk of
>> this bug (or the hypothetical circularity detoasting problem) showing up
>> there.

No objection from me.

> Oh, something unrelated I just remembered: we have
> pg_database.datlastsysoid which seems unused.  Perhaps we should remove
> that column for cleanliness.

I have a vague recollection that some client-side code uses this to
figure out what's the dividing line between user and system OIDs.
pg_dump used to need to know that number, and it can still be useful
with casts and other things that are hard to tell whether they're built-in.
While in principle people could use FirstNormalObjectId instead, that
could come back to bite us if we ever have to increase that constant
in future releases.  I'm inclined to leave this one alone.
        regards, tom lane


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Merlin Moncure
Дата:
On Fri, Oct 28, 2011 at 3:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Excerpts from Alvaro Herrera's message of vie oct 28 16:47:13 -0300 2011:
>>> BTW we had previous discussions about dropping pg_database's toast
>>> table.  Maybe this is a good time to do it, even if there's no risk of
>>> this bug (or the hypothetical circularity detoasting problem) showing up
>>> there.
>
> No objection from me.
>
>> Oh, something unrelated I just remembered: we have
>> pg_database.datlastsysoid which seems unused.  Perhaps we should remove
>> that column for cleanliness.
>
> I have a vague recollection that some client-side code uses this to
> figure out what's the dividing line between user and system OIDs.
> pg_dump used to need to know that number, and it can still be useful
> with casts and other things that are hard to tell whether they're built-in.
> While in principle people could use FirstNormalObjectId instead, that
> could come back to bite us if we ever have to increase that constant
> in future releases.  I'm inclined to leave this one alone.

A cursory search of the archives turned up that pgadmin is explicitly
querying it out (or at least did in 2009).  Also, there are some
fairly ancient references of discussion about using it via pg_dump to
deal with the 'not dumping user casts wrapping system objects' issue,
but that was deemed unreliable and is probably not relevant.

merlin


Re: TOAST versus VACUUM, or "missing chunk number 0 for toast value" identified

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Excerpts from Alvaro Herrera's message of vie oct 28 16:47:13 -0300 2011:
> >> BTW we had previous discussions about dropping pg_database's toast
> >> table.  Maybe this is a good time to do it, even if there's no risk of
> >> this bug (or the hypothetical circularity detoasting problem) showing up
> >> there.
> 
> No objection from me.
> 
> > Oh, something unrelated I just remembered: we have
> > pg_database.datlastsysoid which seems unused.  Perhaps we should remove
> > that column for cleanliness.
> 
> I have a vague recollection that some client-side code uses this to
> figure out what's the dividing line between user and system OIDs.
> pg_dump used to need to know that number, and it can still be useful
> with casts and other things that are hard to tell whether they're built-in.
> While in principle people could use FirstNormalObjectId instead, that
> could come back to bite us if we ever have to increase that constant
> in future releases.  I'm inclined to leave this one alone.

Maybe add a C comment?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +