Обсуждение: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

Поиск
Список
Период
Сортировка

BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      18014
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16beta2
Operating system:   Ubuntu 22.04
Description:

Yesterday's test failure on prion:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-07-03%2010%3A13%3A03
made me wonder, what's going on there and whether it's yet another issue
with invalidating relcache (bug #17994).
(
SELECT schema_to_xmlschema('testxmlschema', false, true, '');
ERROR:  relation with OID 29598 does not exist
CONTEXT:  SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE
relnamespace = 29597 AND relkind IN ('r','m','v') AND
pg_catalog.has_table_privilege (oid, 'SELECT') ORDER BY relname;"

Other failures of that kind:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prion&dt=2023-06-20%2001%3A56%3A04&stg=check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prion&dt=2023-04-15%2017%3A17%3A09&stg=check
)

I managed to construct a simple reproducer for the error:
for ((n=1;n<=30;n++)); do
echo "ITERATION $n"

numclients=30
for ((c=1;c<=$numclients;c++)); do
cat << EOF | psql >psql_$c.log &
CREATE SCHEMA testxmlschema_$c;

SELECT format('CREATE TABLE testxmlschema_$c.test_%s (a int);', g) FROM
generate_series(1, 30) g
\\gexec

SET parallel_setup_cost = 1;
SET min_parallel_table_scan_size = '1kB';
SELECT schema_to_xmlschema('testxmlschema_$c', true, false, '');

SELECT format('DROP TABLE testxmlschema_$c.test_%s', g) FROM
generate_series(1, 30) g
\\gexec

DROP SCHEMA testxmlschema_$c;
EOF
done
wait
grep 'ERROR:' server.log && break;
done

With a server compiled as follows:
CPPFLAGS="-O0 -DCATCACHE_FORCE_RELEASE" ./configure -q --enable-debug
--enable-cassert --enable-tap-tests --with-libxml && make ...
(More precisely, "#ifndef CATCACHE_FORCE_RELEASE" in ReleaseCatCache()
does matter here.)

I get errors as in the test in question:
...
ITERATION 9
ITERATION 10
ERROR:  relation with OID 59777 does not exist
CONTEXT:  parallel worker
SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE relnamespace =
57162 AND relkind IN ('r','m','v') AND pg_catalog.has_table_privilege (oid,
'SELECT') ORDER BY relname;"
2023-07-04 12:48:14.205 MSK [3111661] ERROR:  relation with OID 59777 does
not exist
2023-07-04 12:48:14.206 MSK [3111598] ERROR:  relation with OID 59777 does
not exist

With a debug logging added in src/backend/utils/adt/acl.c, I see that
SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid) returns true in
has_table_privilege_id(), but later, in
pg_class_aclcheck()/pg_class_aclmask_ext(), 
SearchSysCache1(RELOID, ObjectIdGetDatum(table_oid)) returns NULL.

This is reproduced on REL_10_STABLE .. master. 
The first commit that demonstrates the issue is 61c2e1a95 (it improved
access to parallelism for SPI users, one of which is
schema_to_xmlschema_internal() (see also schema_get_xml_visible_tables())).


04.07.2023 14:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      18014
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 16beta2
> Operating system:   Ubuntu 22.04
> Description:
>
> Yesterday's test failure on prion:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-07-03%2010%3A13%3A03
> made me wonder, what's going on there and whether it's yet another issue
> with invalidating relcache (bug #17994).
> (
> SELECT schema_to_xmlschema('testxmlschema', false, true, '');
> ERROR:  relation with OID 29598 does not exist
> CONTEXT:  SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE
> relnamespace = 29597 AND relkind IN ('r','m','v') AND
> pg_catalog.has_table_privilege (oid, 'SELECT') ORDER BY relname;"

I investigated this case and would like to share my findings.
I added in has_table_privilege_id(), just below
      if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
          PG_RETURN_NULL();

the following loop:
for (int i = 0; i < 100; i++) {
   bool sce = SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid));
   if (!sce)
     elog(LOG, "has_table_privilege_id(): no syscache entry on iteration %d", i);
     break;
   }
}

and discovered that when the reproducing script uses parallel worker(s), the
syscache entry disappears during this loop execution. But that's not
happening when the query "SELECT oid FROM pg_catalog.pg_class WHERE ..."
is executed in a regular backend.
AFAICS, the difference is in the LockRelationOid():
      res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);

     /*
      * Now that we have the lock, check for invalidation messages, so that we
      * will update or flush any stale relcache entry before we try to use it.
      * RangeVarGetRelid() specifically relies on us for this.  We can skip
      * this in the not-uncommon case that we already had the same type of lock
      * being requested, since then no one else could have modified the
      * relcache entry in an undesirable way.  (In the case where our own xact
      * modifies the rel, the relcache update happens via
      * CommandCounterIncrement, not here.)
      *
      * However, in corner cases where code acts on tables (usually catalogs)
      * recursively, we might get here while still processing invalidation
      * messages in some outer execution of this function or a sibling.  The
      * "cleared" status of the lock tells us whether we really are done
      * absorbing relevant inval messages.
      */
     if (res != LOCKACQUIRE_ALREADY_CLEAR)
     {
         AcceptInvalidationMessages();
         MarkLockClear(locallock);
     }
when LockRelationOid() is called for pg_class_oid_index inside
SearchCatCacheMiss() -> systable_beginscan() -> index_open() -> relation_open().

The parallel worker doesn't have a lock on pg_class_oid_index before
executing the query, so it gets the lock and res == LOCKACQUIRE_OK (not
LOCKACQUIRE_ALREADY_CLEAR as in a regular backend case), after that it
processes invalidation messages (this can make the backend use a newer
catalog snapshot), and at the end it does systable_endscan() ->
index_close() -> UnlockRelationId() -> LockRelease()...
Thus, on a next iteration it gets the lock anew, with the res == LOCKACQUIRE_OK
again, and all that ceremony repeated.

It's not clear to me, whether this parallel worker behavior is expected and
if so, what to fix to avoid the test failure.

Best regards,
Alexander



At Sun, 16 Jul 2023 23:00:01 +0300, Alexander Lakhin <exclusion@gmail.com> wrote in 
> The parallel worker doesn't have a lock on pg_class_oid_index before
> executing the query, so it gets the lock and res == LOCKACQUIRE_OK
> (not
> LOCKACQUIRE_ALREADY_CLEAR as in a regular backend case), after that it
> processes invalidation messages (this can make the backend use a newer
> catalog snapshot), and at the end it does systable_endscan() ->
> index_close() -> UnlockRelationId() -> LockRelease()...
> Thus, on a next iteration it gets the lock anew, with the res ==
> LOCKACQUIRE_OK
> again, and all that ceremony repeated.
> 
> It's not clear to me, whether this parallel worker behavior is
> expected and
> if so, what to fix to avoid the test failure.

That is, the function is not parallel-safe. In fact it is marked as
'r' in pg_proc.proparallel. So, the real question appears to be how it
ended up running in a paralell worker.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



At Thu, 20 Jul 2023 16:28:02 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> That is, the function is not parallel-safe. In fact it is marked as
> 'r' in pg_proc.proparallel. So, the real question appears to be how it
> ended up running in a paralell worker.

Stupid. What we should do here would be ensuring the funtion doesn't
invoke parallel workers, maybe.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



At Thu, 20 Jul 2023 16:44:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Stupid. What we should do here would be ensuring the funtion doesn't
> invoke parallel workers, maybe.

So, for testing, I didn't see the error with applying the attached patch.
There are other SPI calls but I didn't see them at all.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения
Hello Horiguchi-san,

20.07.2023 11:29, Kyotaro Horiguchi wrote:
> At Thu, 20 Jul 2023 16:44:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
>> Stupid. What we should do here would be ensuring the funtion doesn't
>> invoke parallel workers, maybe.
> So, for testing, I didn't see the error with applying the attached patch.
> There are other SPI calls but I didn't see them at all.
>

Thank you for looking into it!

I think that we need to determine the level where the problem that should
be fixed is:
1) test xmlmap fails sporadically due to the catalog changes caused by
  parallel tests activity
2) schema_to_xmlschemaX() can fail when parallel workers are used
3) has_table_privilegeX() can fail sporadically when executed within a
  parallel worker
4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
  when repeated in a parallel worker

The patch proposed fixes the issue on level 2, but that error still can be
seen on level 3. For example, you can replace
SELECT schema_to_xmlschema('testxmlschema_$c', true, false, '');
in the above script with
SELECT oid FROM pg_catalog.pg_class WHERE relnamespace = 1 AND relkind IN ('r', 'm', 'v') AND 
pg_catalog.has_table_privilege (oid, 'SELECT');

Probably, similar errors can also occur with
"SELECT * FROM information_schema.tables" or with other
information_schema views, that call has_table_privilege().

But if it's okey to have sporadic errors on lower levels, then may be just
fix the issue on level 1. Please look at the patch attached, that makes
schema_get_xml_visible_tables(Oid nspid) isolated from changes in other
namespaces.

Best regards,
Alexander
Вложения
Alexander Lakhin <exclusion@gmail.com> writes:
> I think that we need to determine the level where the problem that should
> be fixed is:
> 1) test xmlmap fails sporadically due to the catalog changes caused by
>   parallel tests activity
> 2) schema_to_xmlschemaX() can fail when parallel workers are used
> 3) has_table_privilegeX() can fail sporadically when executed within a
>   parallel worker
> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
>   when repeated in a parallel worker

Yeah, that's not immediately obvious.  IIUC, the situation we are
looking at is that SearchSysCacheExists can succeed even though the
tuple we found is already dead at the instant that the function
exits (thanks to absorption of inval messages during relation_close).
The fact that that only happens in parallel workers is pure chance
really.  It is not okay for has_table_privilegeX to depend on the
fact that the surrounding query already has some lock on pg_class.
So this means that the approach has_table_privilegeX uses of
assuming that successful SearchSysCacheExists means it can call
pg_class_aclcheck without fear is just broken.

If we suppose that that assumption is only being made in the
has_foo_privilege functions, then one way we could fix it is to extend
the API of pg_class_aclcheck etc to add a no-error-on-not-found flag,
and get rid of the separate SearchSysCacheExists check.  However,
I can't avoid the suspicion that we have other places assuming the
same thing.  So I think what we really ought to be doing is one
of two things:

1. Hack SearchSysCacheExists to account for this issue, by making it
loop if it finds a syscache entry but sees that the entry is already
dead.  (We have to loop, not just return false, in case the row was
updated rather than deleted.)  Maybe all the syscache lookup
functions need to do likewise; it's certainly not intuitively
reasonable for them to return already-known-stale entries.

2. Figure out how come we are executing a cache inval on the way
out of syscache entry creation, and stop that from happening.

I like #2 better if it's not hard to do cleanly.  However, I'm not
quite sure how we are getting to an inval during relation close;
maybe that's not something we want to prevent.

            regards, tom lane



Hello Tom,

Thank you for your considerations!

21.07.2023 20:20, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> I think that we need to determine the level where the problem that should
>> be fixed is:
>> 1) test xmlmap fails sporadically due to the catalog changes caused by
>>    parallel tests activity
>> 2) schema_to_xmlschemaX() can fail when parallel workers are used
>> 3) has_table_privilegeX() can fail sporadically when executed within a
>>    parallel worker
>> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
>>    when repeated in a parallel worker
> Yeah, that's not immediately obvious.  IIUC, the situation we are
> looking at is that SearchSysCacheExists can succeed even though the
> tuple we found is already dead at the instant that the function
> exits (thanks to absorption of inval messages during relation_close).
> The fact that that only happens in parallel workers is pure chance
> really.  It is not okay for has_table_privilegeX to depend on the
> fact that the surrounding query already has some lock on pg_class.
> So this means that the approach has_table_privilegeX uses of
> assuming that successful SearchSysCacheExists means it can call
> pg_class_aclcheck without fear is just broken.
>
> If we suppose that that assumption is only being made in the
> has_foo_privilege functions, then one way we could fix it is to extend
> the API of pg_class_aclcheck etc to add a no-error-on-not-found flag,
> and get rid of the separate SearchSysCacheExists check.  However,
> I can't avoid the suspicion that we have other places assuming the
> same thing.

Running through SearchSysCacheExistsX() calls, I've found an interesting
(and optimistic) comment in src/backend/catalog/namespace.c:

  * ...  The underlying IsVisible functions
  * always use an up-to-date snapshot and so might see the object as already
  * gone when it's still visible to the transaction snapshot.  (There is no race
  * condition in the current coding because we don't accept sinval messages
  * between the SearchSysCacheExists test and the subsequent lookup.)
  */

Datum
pg_table_is_visible(PG_FUNCTION_ARGS)
{
     Oid         oid = PG_GETARG_OID(0);

     if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(oid)))
         PG_RETURN_NULL();

     PG_RETURN_BOOL(RelationIsVisible(oid));
}

...

bool
RelationIsVisible(Oid relid)
{
     HeapTuple   reltup;
     Form_pg_class relform;
     Oid         relnamespace;
     bool        visible;

     reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(reltup))
         elog(ERROR, "cache lookup failed for relation %u", relid);

So that's exactly the same coding pattern, thus fixing pg_class_aclcheck
is not an option, apparently.

>    So I think what we really ought to be doing is one
> of two things:
>
> 1. Hack SearchSysCacheExists to account for this issue, by making it
> loop if it finds a syscache entry but sees that the entry is already
> dead.  (We have to loop, not just return false, in case the row was
> updated rather than deleted.)  Maybe all the syscache lookup
> functions need to do likewise; it's certainly not intuitively
> reasonable for them to return already-known-stale entries.
>
> 2. Figure out how come we are executing a cache inval on the way
> out of syscache entry creation, and stop that from happening.

I wrote about LockRelationOid() before, and I still think that invalidation
messages are processed in that call (reached via SearchCatCacheMiss() ->
systable_beginscan() -> index_open() -> relation_open(), not via
relation_close()). So it seems that SearchSysCacheX calls find an entry,
that is not dead, but that entry can be dead (not found) for the next call
if invalidation messages are processed.

> I like #2 better if it's not hard to do cleanly.  However, I'm not
> quite sure how we are getting to an inval during relation close;
> maybe that's not something we want to prevent.

Yes, there is a detailed comment in LockRelationOid(), that explains why
AcceptInvalidationMessages() is called. (I've tried to remove that call
now, just for testing, and get 6 tests failed during `make check`.)

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> 21.07.2023 20:20, Tom Lane wrote:
>> I like #2 better if it's not hard to do cleanly.  However, I'm not
>> quite sure how we are getting to an inval during relation close;
>> maybe that's not something we want to prevent.

> Yes, there is a detailed comment in LockRelationOid(), that explains why
> AcceptInvalidationMessages() is called. (I've tried to remove that call
> now, just for testing, and get 6 tests failed during `make check`.)

Yes, we certainly want to do that during LockRelationOid.  But what
seems to be happening here is an inval while we are closing/unlocking
the catalog we got the syscache entry from.  That is, the expected
behavior here is:

SearchSysCacheExists:

  * is entry present-and-valid?
    No, so...

  * open and lock relevant catalog (with possible inval)

  * scan catalog, find desired row, create valid syscache entry

  * close and unlock catalog

  * return success

SearchSysCache1 (from pg_class_aclmask_ext):

  * is entry present-and-valid?
    Yes, so increment its refcount and return it

There is no inval in the entry-already-present code path in syscache
lookup.  So if we are seeing this failure, ISTM it must mean that an
inval is happening during "close and unlock catalog", which seems like
something that we don't want.  But I've not traced exactly how that
happens.

            regards, tom lane



Hi Tom,

21.07.2023 22:21, Tom Lane wrote:
> Yes, we certainly want to do that during LockRelationOid.  But what
> seems to be happening here is an inval while we are closing/unlocking
> the catalog we got the syscache entry from.  That is, the expected
> behavior here is:
>
> SearchSysCacheExists:
>
>    * is entry present-and-valid?
>      No, so...
>
>    * open and lock relevant catalog (with possible inval)
>
>    * scan catalog, find desired row, create valid syscache entry
>
>    * close and unlock catalog
>
>    * return success
>
> SearchSysCache1 (from pg_class_aclmask_ext):
>
>    * is entry present-and-valid?
>      Yes, so increment its refcount and return it
>
> There is no inval in the entry-already-present code path in syscache
> lookup.  So if we are seeing this failure, ISTM it must mean that an
> inval is happening during "close and unlock catalog", which seems like
> something that we don't want.  But I've not traced exactly how that
> happens.

Yes, but here we deal with -DCATCACHE_FORCE_RELEASE (added to config_env
on prion), so the cache entry, that was just found in
SearchSysCacheExists(), is removed immediately because of
SearchSysCacheExists() ->  ReleaseSysCache(tuple) -> ReleaseCatCache(tuple).

So, while the construction "if (SearchSysCacheExists()) ... SearchSysCache1()"
seems robust for normal conditions, it might be broken when catcache entries
released forcefully. Thus, if the worst consequence of the issue is sporadic
test failures on prion, then may be fix it in a least invasive way (on level 1).

Best regards,
Alexander



At Tue, 25 Jul 2023 13:00:00 +0300, Alexander Lakhin <exclusion@gmail.com> wrote in
> Hi Tom,
>
> 21.07.2023 22:21, Tom Lane wrote:
> > Yes, we certainly want to do that during LockRelationOid.  But what
> > seems to be happening here is an inval while we are closing/unlocking
> > the catalog we got the syscache entry from.  That is, the expected
> > behavior here is:
> >
> > SearchSysCacheExists:
> >
> >    * is entry present-and-valid?
> >      No, so...
> >
> >    * open and lock relevant catalog (with possible inval)
> >
> >    * scan catalog, find desired row, create valid syscache entry
> >
> >    * close and unlock catalog
> >
> >    * return success
> >
> > SearchSysCache1 (from pg_class_aclmask_ext):
> >
> >    * is entry present-and-valid?
> >      Yes, so increment its refcount and return it
> >
> > There is no inval in the entry-already-present code path in syscache
> > lookup.  So if we are seeing this failure, ISTM it must mean that an
> > inval is happening during "close and unlock catalog", which seems like
> > something that we don't want.  But I've not traced exactly how that
> > happens.
>
> Yes, but here we deal with -DCATCACHE_FORCE_RELEASE (added to
> config_env
> on prion), so the cache entry, that was just found in
> SearchSysCacheExists(), is removed immediately because of
> SearchSysCacheExists() ->  ReleaseSysCache(tuple) ->
> ReleaseCatCache(tuple).
>
> So, while the construction "if (SearchSysCacheExists())
> ... SearchSysCache1()"
> seems robust for normal conditions, it might be broken when catcache

I agree about the safety of the construct.

> entries
> released forcefully. Thus, if the worst consequence of the issue is
> sporadic
> test failures on prion, then may be fix it in a least invasive way (on
> level 1).

> 1) test xmlmap fails sporadically due to the catalog changes caused by
>  parallel tests activity
> 2) schema_to_xmlschemaX() can fail when parallel workers are used

> 3) has_table_privilegeX() can fail sporadically when executed within a
>  parallel worker

Doesn't this imply that the function isn't parallel-safe? The issue is
gone by marking it and all variants as parallel-restricted. It seems
to be a reasolable way to address this issue.

> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
>  when repeated in a parallel worker

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> At Tue, 25 Jul 2023 13:00:00 +0300, Alexander Lakhin <exclusion@gmail.com> wrote in
>> 21.07.2023 22:21, Tom Lane wrote:
>>> There is no inval in the entry-already-present code path in syscache
>>> lookup.  So if we are seeing this failure, ISTM it must mean that an
>>> inval is happening during "close and unlock catalog", which seems like
>>> something that we don't want.  But I've not traced exactly how that
>>> happens.

>> Yes, but here we deal with -DCATCACHE_FORCE_RELEASE (added to config_env
>> on prion), so the cache entry, that was just found in
>> SearchSysCacheExists(), is removed immediately because of
>> SearchSysCacheExists() ->  ReleaseSysCache(tuple) ->
>> ReleaseCatCache(tuple).

Oh!  So are you saying that this case cannot happen in the wild
(that is, in a non-cache-clobbering build)?  If so, I think that
there's a good case to be made that the cache-clobbering behavior
is too strict, and we should change that (not sure just how) rather
than complicating callers that are perfectly safe in reality.

> Doesn't this imply that the function isn't parallel-safe? The issue is
> gone by marking it and all variants as parallel-restricted.

As I said earlier, I think that's a purely coincidental "fix" for
this specific manifestation.  Either SearchSysCacheExists followed
by a syscache lookup of the same tuple should be considered safe,
or it shouldn't.  If it should be considered safe, we need to fix the
cache-clobber test scaffolding to not give a false positive.  While if
it shouldn't, we need to get rid of that coding pattern, not apply
high-level band-aids that remove just one particular path to reaching
the problem.  I'm not dead set on either answer at this point, but
I think those are the plausible alternatives.

            regards, tom lane



At Wed, 26 Jul 2023 11:29:23 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Tue, 25 Jul 2023 13:00:00 +0300, Alexander Lakhin <exclusion@gmail.com> wrote in 
> > 3) has_table_privilegeX() can fail sporadically when executed within a
> >  parallel worker
> 
> Doesn't this imply that the function isn't parallel-safe? The issue is
> gone by marking it and all variants as parallel-restricted. It seems
> to be a reasolable way to address this issue.

Mmm. This may give somewhat different impression from my intention.

I meant that it seems that the function may return different values on
different workers for the same parameter. This means the function is
not "stable" or "stable but parallel-unsafe". I think the latter is
true.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



At Tue, 25 Jul 2023 22:41:57 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > Doesn't this imply that the function isn't parallel-safe? The issue is
> > gone by marking it and all variants as parallel-restricted.
> 
> As I said earlier, I think that's a purely coincidental "fix" for
> this specific manifestation.  Either SearchSysCacheExists followed
> by a syscache lookup of the same tuple should be considered safe,
> or it shouldn't.  If it should be considered safe, we need to fix the

(So, it came to this after all..)

Yeah, as I posted at the same time, what I meant is not that the
sequence is unsafe. It is safe even on a parallel worker.  What I
meant was that there seems to be a case where it returns different
result for the same parameters if they are called on different
workers.

> cache-clobber test scaffolding to not give a false positive.  While if
> it shouldn't, we need to get rid of that coding pattern, not apply
> high-level band-aids that remove just one particular path to reaching
> the problem.  I'm not dead set on either answer at this point, but
> I think those are the plausible alternatives.

Right.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Tue, Jul 25, 2023 at 10:41:57PM -0400, Tom Lane wrote:
> As I said earlier, I think that's a purely coincidental "fix" for
> this specific manifestation.  Either SearchSysCacheExists followed
> by a syscache lookup of the same tuple should be considered safe,
> or it shouldn't.  If it should be considered safe, we need to fix the
> cache-clobber test scaffolding to not give a false positive.  While if
> it shouldn't, we need to get rid of that coding pattern, not apply
> high-level band-aids that remove just one particular path to reaching
> the problem.  I'm not dead set on either answer at this point, but
> I think those are the plausible alternatives.

FWIW, I'm having a hard time thinking about a reason that we should
not support SearchSysCacheExists()+lookup to be a valid pattern, even
if the cache is clobbered.  I am pretty sure that there are other code
paths in the tree, not mentioned on this thread, that do exactly that
(haven't checked, but indexcmds.c is one coming in mind).
--
Michael

Вложения
Michael Paquier <michael@paquier.xyz> writes:
> FWIW, I'm having a hard time thinking about a reason that we should
> not support SearchSysCacheExists()+lookup to be a valid pattern, even
> if the cache is clobbered.  I am pretty sure that there are other code
> paths in the tree, not mentioned on this thread, that do exactly that
> (haven't checked, but indexcmds.c is one coming in mind).

Yeah.  As I see it, there are several questions here.

First, is the SearchSysCacheExists()+lookup pattern actually unsafe,
that is is there a way to break it without any cache-clobbering debug
options enabled?  If so, there's no question we have to get rid of it.

If it isn't really unsafe, then we have a choice whether to get rid of it
or adjust the cache-clobbering options to not fail.

I think the main argument for keeping it is exactly that we've probably
depended on the idea in multiple places, and finding them all might be
hard (and not re-introducing more later, even harder).

On the other hand, I have to concede that such coding patterns are
inherently fragile: you can't introduce any new opportunities for a
cache inval between the SearchSysCacheExists() and the lookup, or
there's definitely a real hazard --- which we might not find for
awhile, if this example is anything to go by.

These has_foo_privilege() functions are arguably a good example of the
least safe way to go about it, in that the SearchSysCacheExists and
the subsequent lookup aren't textually close or even in the same
module.  Nor do we have any comments warning against introducing more
logic into the critical code paths.

So another conclusion we could arrive at is that the pattern isn't
inherently unsafe, but we should only allow it when there's not much
code between the two calls, which would probably lead to wanting to
rewrite the has_foo_privilege() family after all.

I don't yet have an opinion about which way to go.

            regards, tom lane



At Wed, 26 Jul 2023 11:59:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Yeah, as I posted at the same time, what I meant is not that the
> sequence is unsafe. It is safe even on a parallel worker.  What I
> meant was that there seems to be a case where it returns different
> result for the same parameters if they are called on different
> workers.

Okey, I'm convinced that that won't happen, due to command counter,
which was missing in the above.

Sorry for the noise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Hi:
 
I think that we need to determine the level where the problem that should
be fixed is:

Since this thread has not been updated in the last 20 days, I'm not
sure if any people are still working on this. If not, I think considering
reducing the level of the fix is an option, to avoid the unstable testcase. 
 
Please look at the patch attached, that makes
schema_get_xml_visible_tables(Oid nspid) isolated from changes in other
namespaces.

I read the patch and it looks good and the change is pretty mild. so
personally +1  for this solution.  Another solution suggested by me
is to force the query to go with index scan for the test case only [1], 
which we get the similar result but it just works in this test case.
I think Alexander's method is better,  I just put my findings here just
in case I missed something. 



--
Best Regards
Andy Fan
I wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> FWIW, I'm having a hard time thinking about a reason that we should
>> not support SearchSysCacheExists()+lookup to be a valid pattern, even
>> if the cache is clobbered.  I am pretty sure that there are other code
>> paths in the tree, not mentioned on this thread, that do exactly that
>> (haven't checked, but indexcmds.c is one coming in mind).

> Yeah.  As I see it, there are several questions here.

> First, is the SearchSysCacheExists()+lookup pattern actually unsafe,
> that is is there a way to break it without any cache-clobbering debug
> options enabled?  If so, there's no question we have to get rid of it.

I came back to this question today, and after more thought I'm going
to cast my vote for "let's not trust SearchSysCacheExists()+lookup".
It just seems like that's too fragile.  The case reported in this
thread of it failing in parallel workers but not main should scare
anybody, and the future course of development seems far more likely
to introduce new problems than remove them.

I spent some time looking through existing SearchSysCacheExists calls,
and I could only find two sets of routines where we seem to be
depending on SearchSysCacheExists to protect a subsequent lookup
somewhere else, and there isn't any lock on the object in question.
Those are the has_foo_privilege functions discussed here, and the
foo_is_visible functions near the bottom of namespace.c.  I'm not
sure why we've not heard complaints traceable to the foo_is_visible
family.  Maybe nobody has tried hard to break them, or maybe they
are just less likely to be used in ways that are at risk.

It turns out to not be that hard to get rid of the problem in the
has_foo_privilege family, as per attached patches.  I've not looked
at fixing the foo_is_visible family, but that probably ought to be a
separate patch anyway.

BTW, while nosing around I found what seems like a very nasty related
bug.  Suppose that a catalog tuple being loaded into syscache contains
some toasted fields.  CatalogCacheCreateEntry will flatten the tuple,
involving fetches from toast tables that will certainly cause
AcceptInvalidationMessages calls.  What if one of those should have
invalidated this tuple?  We will not notice, because it's not in
the hashtable yet.  When we do add it, we will mark it not-dead,
meaning that the stale entry looks fine and could persist for a long
while.

Anyway, as to the attached patches: I split this into two parts
just to make it easier to review.  0001 deals with the functions
that use pg_class_aclcheck and related, while 0002 deals with the
ones that go through object_aclcheck.  In both cases, we're
basically extending the API convention somebody added awhile ago
that foo_aclcheck and foo_aclmask should have extended versions
foo_aclcheck_ext and foo_aclmask_ext that take an "is_missing"
argument.  That wasn't terribly well documented IMO, so 0001
starts out with massaging the comments to make it clearer.
I also changed some existing places in pg_attribute_aclmask_ext and
pg_attribute_aclcheck_all to conform to the is_missing convention,
rather than hard-wiring a decision that it's okay to ignore lookup
failure.

pg_attribute_aclcheck_all also had a most curious decision that
it could call pg_attribute_aclmask, which not only opens it right
back up to potential lookup failure but is quite inefficient,
requiring in three syscache lookups instead of one for each
column having a non-null attacl.  It just takes a few more lines
to call aclmask() directly.

0001's changes in acl.c are straightforward, but it's worth noting
that the has_sequence_privilege functions hadn't gotten the memo
about not failing when a bogus relation OID is passed.

0002 is pretty straightforward as well, just adding an "_ext"
version of object_aclcheck/object_aclmask and using those
where appropriate.

It looks like 0001 could be back-patched as far as v14 without
too much trouble.  Before that, there isn't pg_class_aclcheck_ext,
and I'm not sure it's worth the trouble to add.  0002 will only
work in HEAD and v16 because object_aclcheck wasn't there earlier.
I'm not sure whether to bother back-patching at all, given that
we've only seen this problem in test builds.

Comments?

            regards, tom lane

diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index d1f5dcd8be..95b0f8c7b0 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -13,6 +13,27 @@
  * NOTES
  *      See acl.h.
  *
+ *      The xxx_aclmask() functions in this file are wrappers around
+ *      acl.c's aclmask() function; see that for basic usage information.
+ *      The wrapper functions add object-type-specific lookup capability.
+ *      Generally, they will throw error if the object doesn't exist.
+ *
+ *      The xxx_aclmask_ext() functions add the ability to not throw
+ *      error if the object doesn't exist.  If their "is_missing" argument
+ *      isn't NULL, then when the object isn't found they will set
+ *      *is_missing = true and return zero (no privileges) instead of
+ *      throwing an error.  Caller must initialize *is_missing = false.
+ *
+ *      The xxx_aclcheck() functions are simplified wrappers around the
+ *      corresponding xxx_aclmask() functions, simply returning ACLCHECK_OK
+ *      if any of the privileges specified in "mode" are held, and otherwise
+ *      a suitable error code (in practice, always ACLCHECK_NO_PRIV).
+ *      Again, they will throw error if the object doesn't exist.
+ *
+ *      The xxx_aclcheck_ext() functions add the ability to not throw
+ *      error if the object doesn't exist.  Their "is_missing" argument
+ *      works similarly to the xxx_aclmask_ext() functions.
+ *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
@@ -3149,10 +3170,7 @@ pg_attribute_aclmask(Oid table_oid, AttrNumber attnum, Oid roleid,
 }

 /*
- * Routine for examining a user's privileges for a column
- *
- * Does the bulk of the work for pg_attribute_aclmask(), and allows other
- * callers to avoid the missing attribute ERROR when is_missing is non-NULL.
+ * Routine for examining a user's privileges for a column, with is_missing
  */
 static AclMode
 pg_attribute_aclmask_ext(Oid table_oid, AttrNumber attnum, Oid roleid,
@@ -3226,15 +3244,24 @@ pg_attribute_aclmask_ext(Oid table_oid, AttrNumber attnum, Oid roleid,
      * Must get the relation's ownerId from pg_class.  Since we already found
      * a pg_attribute entry, the only likely reason for this to fail is that a
      * concurrent DROP of the relation committed since then (which could only
-     * happen if we don't have lock on the relation).  We prefer to report "no
-     * privileges" rather than failing in such a case, so as to avoid unwanted
-     * failures in has_column_privilege() tests.
+     * happen if we don't have lock on the relation).  Treat that similarly to
+     * not finding the attribute entry.
      */
     classTuple = SearchSysCache1(RELOID, ObjectIdGetDatum(table_oid));
     if (!HeapTupleIsValid(classTuple))
     {
         ReleaseSysCache(attTuple);
-        return 0;
+        if (is_missing != NULL)
+        {
+            /* return "no privileges" instead of throwing an error */
+            *is_missing = true;
+            return 0;
+        }
+        else
+            ereport(ERROR,
+                    (errcode(ERRCODE_UNDEFINED_TABLE),
+                     errmsg("relation with OID %u does not exist",
+                            table_oid)));
     }
     classForm = (Form_pg_class) GETSTRUCT(classTuple);

@@ -3267,10 +3294,7 @@ pg_class_aclmask(Oid table_oid, Oid roleid,
 }

 /*
- * Routine for examining a user's privileges for a table
- *
- * Does the bulk of the work for pg_class_aclmask(), and allows other
- * callers to avoid the missing relation ERROR when is_missing is non-NULL.
+ * Routine for examining a user's privileges for a table, with is_missing
  */
 static AclMode
 pg_class_aclmask_ext(Oid table_oid, Oid roleid, AclMode mask,
@@ -3784,10 +3808,8 @@ pg_attribute_aclcheck(Oid table_oid, AttrNumber attnum,


 /*
- * Exported routine for checking a user's access privileges to a column
- *
- * Does the bulk of the work for pg_attribute_aclcheck(), and allows other
- * callers to avoid the missing attribute ERROR when is_missing is non-NULL.
+ * Exported routine for checking a user's access privileges to a column,
+ * with is_missing
  */
 AclResult
 pg_attribute_aclcheck_ext(Oid table_oid, AttrNumber attnum,
@@ -3822,23 +3844,47 @@ pg_attribute_aclcheck_ext(Oid table_oid, AttrNumber attnum,
 AclResult
 pg_attribute_aclcheck_all(Oid table_oid, Oid roleid, AclMode mode,
                           AclMaskHow how)
+{
+    return pg_attribute_aclcheck_all_ext(table_oid, roleid, mode, how, NULL);
+}
+
+/*
+ * Exported routine for checking a user's access privileges to any/all columns,
+ * with is_missing
+ */
+AclResult
+pg_attribute_aclcheck_all_ext(Oid table_oid, Oid roleid,
+                              AclMode mode, AclMaskHow how,
+                              bool *is_missing)
 {
     AclResult    result;
     HeapTuple    classTuple;
     Form_pg_class classForm;
+    Oid            ownerId;
     AttrNumber    nattrs;
     AttrNumber    curr_att;

     /*
-     * Must fetch pg_class row to check number of attributes.  As in
-     * pg_attribute_aclmask, we prefer to return "no privileges" instead of
-     * throwing an error if we get any unexpected lookup errors.
+     * Must fetch pg_class row to get owner ID and number of attributes.
      */
     classTuple = SearchSysCache1(RELOID, ObjectIdGetDatum(table_oid));
     if (!HeapTupleIsValid(classTuple))
-        return ACLCHECK_NO_PRIV;
+    {
+        if (is_missing != NULL)
+        {
+            /* return "no privileges" instead of throwing an error */
+            *is_missing = true;
+            return ACLCHECK_NO_PRIV;
+        }
+        else
+            ereport(ERROR,
+                    (errcode(ERRCODE_UNDEFINED_TABLE),
+                     errmsg("relation with OID %u does not exist",
+                            table_oid)));
+    }
     classForm = (Form_pg_class) GETSTRUCT(classTuple);

+    ownerId = classForm->relowner;
     nattrs = classForm->relnatts;

     ReleaseSysCache(classTuple);
@@ -3852,11 +3898,20 @@ pg_attribute_aclcheck_all(Oid table_oid, Oid roleid, AclMode mode,
     for (curr_att = 1; curr_att <= nattrs; curr_att++)
     {
         HeapTuple    attTuple;
+        Datum        aclDatum;
+        bool        isNull;
+        Acl           *acl;
         AclMode        attmask;

         attTuple = SearchSysCache2(ATTNUM,
                                    ObjectIdGetDatum(table_oid),
                                    Int16GetDatum(curr_att));
+
+        /*
+         * Lookup failure probably indicates that the table was just dropped,
+         * but we'll treat it the same as a dropped column rather than
+         * throwing error.
+         */
         if (!HeapTupleIsValid(attTuple))
             continue;

@@ -3867,16 +3922,27 @@ pg_attribute_aclcheck_all(Oid table_oid, Oid roleid, AclMode mode,
             continue;
         }

+        aclDatum = SysCacheGetAttr(ATTNUM, attTuple, Anum_pg_attribute_attacl,
+                                   &isNull);
+
         /*
          * Here we hard-wire knowledge that the default ACL for a column
          * grants no privileges, so that we can fall out quickly in the very
          * common case where attacl is null.
          */
-        if (heap_attisnull(attTuple, Anum_pg_attribute_attacl, NULL))
+        if (isNull)
             attmask = 0;
         else
-            attmask = pg_attribute_aclmask(table_oid, curr_att, roleid,
-                                           mode, ACLMASK_ANY);
+        {
+            /* detoast column's ACL if necessary */
+            acl = DatumGetAclP(aclDatum);
+
+            attmask = aclmask(acl, roleid, ownerId, mode, ACLMASK_ANY);
+
+            /* if we have a detoasted copy, free it */
+            if ((Pointer) acl != DatumGetPointer(aclDatum))
+                pfree(acl);
+        }

         ReleaseSysCache(attTuple);

@@ -3911,10 +3977,8 @@ pg_class_aclcheck(Oid table_oid, Oid roleid, AclMode mode)
 }

 /*
- * Exported routine for checking a user's access privileges to a table
- *
- * Does the bulk of the work for pg_class_aclcheck(), and allows other
- * callers to avoid the missing relation ERROR when is_missing is non-NULL.
+ * Exported routine for checking a user's access privileges to a table,
+ * with is_missing
  */
 AclResult
 pg_class_aclcheck_ext(Oid table_oid, Oid roleid,
diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c
index 27eabb80ab..809cb6f03f 100644
--- a/src/backend/utils/adt/acl.c
+++ b/src/backend/utils/adt/acl.c
@@ -1915,14 +1915,15 @@ has_table_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_table_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
-        PG_RETURN_NULL();
+    aclresult = pg_class_aclcheck_ext(tableoid, roleid, mode, &is_missing);

-    aclresult = pg_class_aclcheck(tableoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -1941,14 +1942,15 @@ has_table_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_table_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
-        PG_RETURN_NULL();
+    aclresult = pg_class_aclcheck_ext(tableoid, roleid, mode, &is_missing);

-    aclresult = pg_class_aclcheck(tableoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -1989,13 +1991,14 @@ has_table_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_table_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
-        PG_RETURN_NULL();
+    aclresult = pg_class_aclcheck_ext(tableoid, roleid, mode, &is_missing);

-    aclresult = pg_class_aclcheck(tableoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -2134,6 +2137,7 @@ has_sequence_privilege_name_id(PG_FUNCTION_ARGS)
     AclMode        mode;
     AclResult    aclresult;
     char        relkind;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_sequence_priv_string(priv_type_text);
@@ -2146,7 +2150,10 @@ has_sequence_privilege_name_id(PG_FUNCTION_ARGS)
                  errmsg("\"%s\" is not a sequence",
                         get_rel_name(sequenceoid))));

-    aclresult = pg_class_aclcheck(sequenceoid, roleid, mode);
+    aclresult = pg_class_aclcheck_ext(sequenceoid, roleid, mode, &is_missing);
+
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -2166,6 +2173,7 @@ has_sequence_privilege_id(PG_FUNCTION_ARGS)
     AclMode        mode;
     AclResult    aclresult;
     char        relkind;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_sequence_priv_string(priv_type_text);
@@ -2178,7 +2186,11 @@ has_sequence_privilege_id(PG_FUNCTION_ARGS)
                  errmsg("\"%s\" is not a sequence",
                         get_rel_name(sequenceoid))));

-    aclresult = pg_class_aclcheck(sequenceoid, roleid, mode);
+
+    aclresult = pg_class_aclcheck_ext(sequenceoid, roleid, mode, &is_missing);
+
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -2225,6 +2237,7 @@ has_sequence_privilege_id_id(PG_FUNCTION_ARGS)
     AclMode        mode;
     AclResult    aclresult;
     char        relkind;
+    bool        is_missing = false;

     mode = convert_sequence_priv_string(priv_type_text);
     relkind = get_rel_relkind(sequenceoid);
@@ -2236,7 +2249,11 @@ has_sequence_privilege_id_id(PG_FUNCTION_ARGS)
                  errmsg("\"%s\" is not a sequence",
                         get_rel_name(sequenceoid))));

-    aclresult = pg_class_aclcheck(sequenceoid, roleid, mode);
+
+    aclresult = pg_class_aclcheck_ext(sequenceoid, roleid, mode, &is_missing);
+
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -2345,18 +2362,22 @@ has_any_column_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_column_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
-        PG_RETURN_NULL();
-
     /* First check at table level, then examine each column if needed */
-    aclresult = pg_class_aclcheck(tableoid, roleid, mode);
+    aclresult = pg_class_aclcheck_ext(tableoid, roleid, mode, &is_missing);
     if (aclresult != ACLCHECK_OK)
-        aclresult = pg_attribute_aclcheck_all(tableoid, roleid, mode,
-                                              ACLMASK_ANY);
+    {
+        if (is_missing)
+            PG_RETURN_NULL();
+        aclresult = pg_attribute_aclcheck_all_ext(tableoid, roleid, mode,
+                                                  ACLMASK_ANY, &is_missing);
+        if (is_missing)
+            PG_RETURN_NULL();
+    }

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -2375,18 +2396,22 @@ has_any_column_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_column_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
-        PG_RETURN_NULL();
-
     /* First check at table level, then examine each column if needed */
-    aclresult = pg_class_aclcheck(tableoid, roleid, mode);
+    aclresult = pg_class_aclcheck_ext(tableoid, roleid, mode, &is_missing);
     if (aclresult != ACLCHECK_OK)
-        aclresult = pg_attribute_aclcheck_all(tableoid, roleid, mode,
-                                              ACLMASK_ANY);
+    {
+        if (is_missing)
+            PG_RETURN_NULL();
+        aclresult = pg_attribute_aclcheck_all_ext(tableoid, roleid, mode,
+                                                  ACLMASK_ANY, &is_missing);
+        if (is_missing)
+            PG_RETURN_NULL();
+    }

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -2431,17 +2456,21 @@ has_any_column_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_column_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
-        PG_RETURN_NULL();
-
     /* First check at table level, then examine each column if needed */
-    aclresult = pg_class_aclcheck(tableoid, roleid, mode);
+    aclresult = pg_class_aclcheck_ext(tableoid, roleid, mode, &is_missing);
     if (aclresult != ACLCHECK_OK)
-        aclresult = pg_attribute_aclcheck_all(tableoid, roleid, mode,
-                                              ACLMASK_ANY);
+    {
+        if (is_missing)
+            PG_RETURN_NULL();
+        aclresult = pg_attribute_aclcheck_all_ext(tableoid, roleid, mode,
+                                                  ACLMASK_ANY, &is_missing);
+        if (is_missing)
+            PG_RETURN_NULL();
+    }

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
diff --git a/src/include/utils/acl.h b/src/include/utils/acl.h
index aba1afa971..331a87d0e6 100644
--- a/src/include/utils/acl.h
+++ b/src/include/utils/acl.h
@@ -250,6 +250,9 @@ extern AclResult pg_attribute_aclcheck_ext(Oid table_oid, AttrNumber attnum,
                                            bool *is_missing);
 extern AclResult pg_attribute_aclcheck_all(Oid table_oid, Oid roleid,
                                            AclMode mode, AclMaskHow how);
+extern AclResult pg_attribute_aclcheck_all_ext(Oid table_oid, Oid roleid,
+                                               AclMode mode, AclMaskHow how,
+                                               bool *is_missing);
 extern AclResult pg_class_aclcheck(Oid table_oid, Oid roleid, AclMode mode);
 extern AclResult pg_class_aclcheck_ext(Oid table_oid, Oid roleid,
                                        AclMode mode, bool *is_missing);
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 95b0f8c7b0..67354cea92 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -160,6 +160,9 @@ static AclMode pg_aclmask(ObjectType objtype, Oid object_oid, AttrNumber attnum,
                           Oid roleid, AclMode mask, AclMaskHow how);
 static AclMode object_aclmask(Oid classid, Oid objectid, Oid roleid,
                               AclMode mask, AclMaskHow how);
+static AclMode object_aclmask_ext(Oid classid, Oid objectid, Oid roleid,
+                                  AclMode mask, AclMaskHow how,
+                                  bool *is_missing);
 static AclMode pg_attribute_aclmask(Oid table_oid, AttrNumber attnum,
                                     Oid roleid, AclMode mask, AclMaskHow how);
 static AclMode pg_attribute_aclmask_ext(Oid table_oid, AttrNumber attnum,
@@ -172,10 +175,12 @@ static AclMode pg_parameter_acl_aclmask(Oid acl_oid, Oid roleid,
                                         AclMode mask, AclMaskHow how);
 static AclMode pg_largeobject_aclmask_snapshot(Oid lobj_oid, Oid roleid,
                                                AclMode mask, AclMaskHow how, Snapshot snapshot);
-static AclMode pg_namespace_aclmask(Oid nsp_oid, Oid roleid,
-                                    AclMode mask, AclMaskHow how);
-static AclMode pg_type_aclmask(Oid type_oid, Oid roleid,
-                               AclMode mask, AclMaskHow how);
+static AclMode pg_namespace_aclmask_ext(Oid nsp_oid, Oid roleid,
+                                        AclMode mask, AclMaskHow how,
+                                        bool *is_missing);
+static AclMode pg_type_aclmask_ext(Oid type_oid, Oid roleid,
+                                   AclMode mask, AclMaskHow how,
+                                   bool *is_missing);
 static void recordExtensionInitPriv(Oid objoid, Oid classoid, int objsubid,
                                     Acl *new_acl);
 static void recordExtensionInitPrivWorker(Oid objoid, Oid classoid, int objsubid,
@@ -3085,6 +3090,18 @@ pg_aclmask(ObjectType objtype, Oid object_oid, AttrNumber attnum, Oid roleid,
 static AclMode
 object_aclmask(Oid classid, Oid objectid, Oid roleid,
                AclMode mask, AclMaskHow how)
+{
+    return object_aclmask_ext(classid, objectid, roleid, mask, how, NULL);
+}
+
+/*
+ * Generic routine for examining a user's privileges for an object,
+ * with is_missing
+ */
+static AclMode
+object_aclmask_ext(Oid classid, Oid objectid, Oid roleid,
+                   AclMode mask, AclMaskHow how,
+                   bool *is_missing)
 {
     int            cacheid;
     AclMode        result;
@@ -3098,9 +3115,11 @@ object_aclmask(Oid classid, Oid objectid, Oid roleid,
     switch (classid)
     {
         case NamespaceRelationId:
-            return pg_namespace_aclmask(objectid, roleid, mask, how);
+            return pg_namespace_aclmask_ext(objectid, roleid, mask, how,
+                                            is_missing);
         case TypeRelationId:
-            return pg_type_aclmask(objectid, roleid, mask, how);
+            return pg_type_aclmask_ext(objectid, roleid, mask, how,
+                                       is_missing);
     }

     /* Even more special cases */
@@ -3113,16 +3132,26 @@ object_aclmask(Oid classid, Oid objectid, Oid roleid,
         return mask;

     /*
-     * Get the objects's ACL from its catalog
+     * Get the object's ACL from its catalog
      */

     cacheid = get_object_catcache_oid(classid);

     tuple = SearchSysCache1(cacheid, ObjectIdGetDatum(objectid));
     if (!HeapTupleIsValid(tuple))
-        ereport(ERROR,
-                (errcode(ERRCODE_UNDEFINED_DATABASE),
-                 errmsg("%s with OID %u does not exist", get_object_class_descr(classid), objectid)));
+    {
+        if (is_missing != NULL)
+        {
+            /* return "no privileges" instead of throwing an error */
+            *is_missing = true;
+            return 0;
+        }
+        else
+            ereport(ERROR,
+                    (errcode(ERRCODE_UNDEFINED_OBJECT),
+                     errmsg("%s with OID %u does not exist",
+                            get_object_class_descr(classid), objectid)));
+    }

     ownerId = DatumGetObjectId(SysCacheGetAttrNotNull(cacheid,
                                                       tuple,
@@ -3609,11 +3638,12 @@ pg_largeobject_aclmask_snapshot(Oid lobj_oid, Oid roleid,
 }

 /*
- * Routine for examining a user's privileges for a namespace
+ * Routine for examining a user's privileges for a namespace, with is_missing
  */
 static AclMode
-pg_namespace_aclmask(Oid nsp_oid, Oid roleid,
-                     AclMode mask, AclMaskHow how)
+pg_namespace_aclmask_ext(Oid nsp_oid, Oid roleid,
+                         AclMode mask, AclMaskHow how,
+                         bool *is_missing)
 {
     AclMode        result;
     HeapTuple    tuple;
@@ -3647,8 +3677,8 @@ pg_namespace_aclmask(Oid nsp_oid, Oid roleid,
      */
     if (isTempNamespace(nsp_oid))
     {
-        if (object_aclcheck(DatabaseRelationId, MyDatabaseId, roleid,
-                            ACL_CREATE_TEMP) == ACLCHECK_OK)
+        if (object_aclcheck_ext(DatabaseRelationId, MyDatabaseId, roleid,
+                                ACL_CREATE_TEMP, is_missing) == ACLCHECK_OK)
             return mask & ACL_ALL_RIGHTS_SCHEMA;
         else
             return mask & ACL_USAGE;
@@ -3659,9 +3689,18 @@ pg_namespace_aclmask(Oid nsp_oid, Oid roleid,
      */
     tuple = SearchSysCache1(NAMESPACEOID, ObjectIdGetDatum(nsp_oid));
     if (!HeapTupleIsValid(tuple))
-        ereport(ERROR,
-                (errcode(ERRCODE_UNDEFINED_SCHEMA),
-                 errmsg("schema with OID %u does not exist", nsp_oid)));
+    {
+        if (is_missing != NULL)
+        {
+            /* return "no privileges" instead of throwing an error */
+            *is_missing = true;
+            return 0;
+        }
+        else
+            ereport(ERROR,
+                    (errcode(ERRCODE_UNDEFINED_SCHEMA),
+                     errmsg("schema with OID %u does not exist", nsp_oid)));
+    }

     ownerId = ((Form_pg_namespace) GETSTRUCT(tuple))->nspowner;

@@ -3701,20 +3740,20 @@ pg_namespace_aclmask(Oid nsp_oid, Oid roleid,
 }

 /*
- * Routine for examining a user's privileges for a type.
+ * Routine for examining a user's privileges for a type, with is_missing
  */
 static AclMode
-pg_type_aclmask(Oid type_oid, Oid roleid, AclMode mask, AclMaskHow how)
+pg_type_aclmask_ext(Oid type_oid, Oid roleid, AclMode mask, AclMaskHow how,
+                    bool *is_missing)
 {
     AclMode        result;
     HeapTuple    tuple;
+    Form_pg_type typeForm;
     Datum        aclDatum;
     bool        isNull;
     Acl           *acl;
     Oid            ownerId;

-    Form_pg_type typeForm;
-
     /* Bypass permission checks for superusers */
     if (superuser_arg(roleid))
         return mask;
@@ -3724,10 +3763,19 @@ pg_type_aclmask(Oid type_oid, Oid roleid, AclMode mask, AclMaskHow how)
      */
     tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(type_oid));
     if (!HeapTupleIsValid(tuple))
-        ereport(ERROR,
-                (errcode(ERRCODE_UNDEFINED_OBJECT),
-                 errmsg("type with OID %u does not exist",
-                        type_oid)));
+    {
+        if (is_missing != NULL)
+        {
+            /* return "no privileges" instead of throwing an error */
+            *is_missing = true;
+            return 0;
+        }
+        else
+            ereport(ERROR,
+                    (errcode(ERRCODE_UNDEFINED_OBJECT),
+                     errmsg("type with OID %u does not exist",
+                            type_oid)));
+    }
     typeForm = (Form_pg_type) GETSTRUCT(tuple);

     /*
@@ -3741,9 +3789,20 @@ pg_type_aclmask(Oid type_oid, Oid roleid, AclMode mask, AclMaskHow how)
         ReleaseSysCache(tuple);

         tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(elttype_oid));
-        /* this case is not a user-facing error, so elog not ereport */
         if (!HeapTupleIsValid(tuple))
-            elog(ERROR, "cache lookup failed for type %u", elttype_oid);
+        {
+            if (is_missing != NULL)
+            {
+                /* return "no privileges" instead of throwing an error */
+                *is_missing = true;
+                return 0;
+            }
+            else
+                ereport(ERROR,
+                        (errcode(ERRCODE_UNDEFINED_OBJECT),
+                         errmsg("type with OID %u does not exist",
+                                elttype_oid)));
+        }
         typeForm = (Form_pg_type) GETSTRUCT(tuple);
     }

@@ -3783,7 +3842,20 @@ pg_type_aclmask(Oid type_oid, Oid roleid, AclMode mask, AclMaskHow how)
 AclResult
 object_aclcheck(Oid classid, Oid objectid, Oid roleid, AclMode mode)
 {
-    if (object_aclmask(classid, objectid, roleid, mode, ACLMASK_ANY) != 0)
+    return object_aclcheck_ext(classid, objectid, roleid, mode, NULL);
+}
+
+/*
+ * Exported generic routine for checking a user's access privileges to an object
+ * with is_missing
+ */
+AclResult
+object_aclcheck_ext(Oid classid, Oid objectid,
+                    Oid roleid, AclMode mode,
+                    bool *is_missing)
+{
+    if (object_aclmask_ext(classid, objectid, roleid, mode, ACLMASK_ANY,
+                           is_missing) != 0)
         return ACLCHECK_OK;
     else
         return ACLCHECK_NO_PRIV;
diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c
index 809cb6f03f..798833011b 100644
--- a/src/backend/utils/adt/acl.c
+++ b/src/backend/utils/adt/acl.c
@@ -3010,14 +3010,17 @@ has_database_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_database_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(DATABASEOID, ObjectIdGetDatum(databaseoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(DatabaseRelationId, databaseoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(DatabaseRelationId, databaseoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3036,14 +3039,17 @@ has_database_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_database_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(DATABASEOID, ObjectIdGetDatum(databaseoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(DatabaseRelationId, databaseoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(DatabaseRelationId, databaseoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3084,13 +3090,16 @@ has_database_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_database_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(DATABASEOID, ObjectIdGetDatum(databaseoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(DatabaseRelationId, databaseoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(DatabaseRelationId, databaseoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3207,14 +3216,17 @@ has_foreign_data_wrapper_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_foreign_data_wrapper_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(FOREIGNDATAWRAPPEROID, ObjectIdGetDatum(fdwid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ForeignDataWrapperRelationId, fdwid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ForeignDataWrapperRelationId, fdwid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3233,14 +3245,17 @@ has_foreign_data_wrapper_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_foreign_data_wrapper_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(FOREIGNDATAWRAPPEROID, ObjectIdGetDatum(fdwid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ForeignDataWrapperRelationId, fdwid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ForeignDataWrapperRelationId, fdwid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3281,13 +3296,16 @@ has_foreign_data_wrapper_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_foreign_data_wrapper_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(FOREIGNDATAWRAPPEROID, ObjectIdGetDatum(fdwid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ForeignDataWrapperRelationId, fdwid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ForeignDataWrapperRelationId, fdwid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3398,14 +3416,17 @@ has_function_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_function_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(PROCOID, ObjectIdGetDatum(functionoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ProcedureRelationId, functionoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ProcedureRelationId, functionoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3424,14 +3445,17 @@ has_function_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_function_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(PROCOID, ObjectIdGetDatum(functionoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ProcedureRelationId, functionoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ProcedureRelationId, functionoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3472,13 +3496,16 @@ has_function_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_function_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(PROCOID, ObjectIdGetDatum(functionoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ProcedureRelationId, functionoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ProcedureRelationId, functionoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3598,14 +3625,17 @@ has_language_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_language_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(LANGOID, ObjectIdGetDatum(languageoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(LanguageRelationId, languageoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(LanguageRelationId, languageoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3624,14 +3654,17 @@ has_language_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_language_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(LANGOID, ObjectIdGetDatum(languageoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(LanguageRelationId, languageoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(LanguageRelationId, languageoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3672,13 +3705,16 @@ has_language_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_language_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(LANGOID, ObjectIdGetDatum(languageoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(LanguageRelationId, languageoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(LanguageRelationId, languageoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3789,14 +3825,17 @@ has_schema_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_schema_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(NAMESPACEOID, ObjectIdGetDatum(schemaoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(NamespaceRelationId, schemaoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(NamespaceRelationId, schemaoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3815,14 +3854,17 @@ has_schema_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_schema_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(NAMESPACEOID, ObjectIdGetDatum(schemaoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(NamespaceRelationId, schemaoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(NamespaceRelationId, schemaoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3863,13 +3905,16 @@ has_schema_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_schema_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(NAMESPACEOID, ObjectIdGetDatum(schemaoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(NamespaceRelationId, schemaoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(NamespaceRelationId, schemaoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -3982,14 +4027,17 @@ has_server_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_server_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(FOREIGNSERVEROID, ObjectIdGetDatum(serverid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ForeignServerRelationId, serverid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ForeignServerRelationId, serverid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -4008,14 +4056,17 @@ has_server_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_server_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(FOREIGNSERVEROID, ObjectIdGetDatum(serverid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ForeignServerRelationId, serverid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ForeignServerRelationId, serverid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -4056,13 +4107,16 @@ has_server_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_server_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(FOREIGNSERVEROID, ObjectIdGetDatum(serverid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(ForeignServerRelationId, serverid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(ForeignServerRelationId, serverid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -4173,14 +4227,17 @@ has_tablespace_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_tablespace_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(TABLESPACEOID, ObjectIdGetDatum(tablespaceoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(TableSpaceRelationId, tablespaceoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(TableSpaceRelationId, tablespaceoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -4199,14 +4256,17 @@ has_tablespace_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_tablespace_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(TABLESPACEOID, ObjectIdGetDatum(tablespaceoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(TableSpaceRelationId, tablespaceoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(TableSpaceRelationId, tablespaceoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -4247,13 +4307,16 @@ has_tablespace_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_tablespace_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(TABLESPACEOID, ObjectIdGetDatum(tablespaceoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(TableSpaceRelationId, tablespaceoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(TableSpaceRelationId, tablespaceoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -4363,14 +4426,17 @@ has_type_privilege_name_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = get_role_oid_or_public(NameStr(*username));
     mode = convert_type_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(TYPEOID, ObjectIdGetDatum(typeoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(TypeRelationId, typeoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(TypeRelationId, typeoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -4389,14 +4455,17 @@ has_type_privilege_id(PG_FUNCTION_ARGS)
     Oid            roleid;
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     roleid = GetUserId();
     mode = convert_type_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(TYPEOID, ObjectIdGetDatum(typeoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(TypeRelationId, typeoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(TypeRelationId, typeoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
@@ -4437,13 +4506,16 @@ has_type_privilege_id_id(PG_FUNCTION_ARGS)
     text       *priv_type_text = PG_GETARG_TEXT_PP(2);
     AclMode        mode;
     AclResult    aclresult;
+    bool        is_missing = false;

     mode = convert_type_priv_string(priv_type_text);

-    if (!SearchSysCacheExists1(TYPEOID, ObjectIdGetDatum(typeoid)))
-        PG_RETURN_NULL();
+    aclresult = object_aclcheck_ext(TypeRelationId, typeoid,
+                                    roleid, mode,
+                                    &is_missing);

-    aclresult = object_aclcheck(TypeRelationId, typeoid, roleid, mode);
+    if (is_missing)
+        PG_RETURN_NULL();

     PG_RETURN_BOOL(aclresult == ACLCHECK_OK);
 }
diff --git a/src/include/utils/acl.h b/src/include/utils/acl.h
index 331a87d0e6..02bc4d08d6 100644
--- a/src/include/utils/acl.h
+++ b/src/include/utils/acl.h
@@ -239,8 +239,12 @@ extern void RemoveRoleFromObjectACL(Oid roleid, Oid classid, Oid objid);
 extern AclMode pg_class_aclmask(Oid table_oid, Oid roleid,
                                 AclMode mask, AclMaskHow how);

-/* generic function */
-extern AclResult object_aclcheck(Oid classid, Oid objectid, Oid roleid, AclMode mode);
+/* generic functions */
+extern AclResult object_aclcheck(Oid classid, Oid objectid,
+                                 Oid roleid, AclMode mode);
+extern AclResult object_aclcheck_ext(Oid classid, Oid objectid,
+                                     Oid roleid, AclMode mode,
+                                     bool *is_missing);

 /* special cases */
 extern AclResult pg_attribute_aclcheck(Oid table_oid, AttrNumber attnum,

Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

От
Alexander Lakhin
Дата:
Hello Tom,

13.10.2023 00:01, Tom Lane wrote:
> I came back to this question today, and after more thought I'm going
> to cast my vote for "let's not trust SearchSysCacheExists()+lookup".
> It just seems like that's too fragile.  The case reported in this
> thread of it failing in parallel workers but not main should scare
> anybody, and the future course of development seems far more likely
> to introduce new problems than remove them.

Thanks for digging into this!

> I spent some time looking through existing SearchSysCacheExists calls,
> and I could only find two sets of routines where we seem to be
> depending on SearchSysCacheExists to protect a subsequent lookup
> somewhere else, and there isn't any lock on the object in question.
> Those are the has_foo_privilege functions discussed here, and the
> foo_is_visible functions near the bottom of namespace.c.  I'm not
> sure why we've not heard complaints traceable to the foo_is_visible
> family.  Maybe nobody has tried hard to break them, or maybe they
> are just less likely to be used in ways that are at risk.

I'll try to research/break xxx_is_visible and share my findings tomorrow.

> It turns out to not be that hard to get rid of the problem in the
> has_foo_privilege family, as per attached patches.  I've not looked
> at fixing the foo_is_visible family, but that probably ought to be a
> separate patch anyway.

Yeah, the attached patches greatly improve consistency. The only
inconsistency I've found in the patches is a missing comma here:
+ * Exported generic routine for checking a user's access privileges to an object
+ * with is_missing

You removed "this case is not a user-facing error, so elog not ereport",
and I think it's good for consistency too, as all aclmask/aclcheck
functions now use ereport().

> BTW, while nosing around I found what seems like a very nasty related
> bug.  Suppose that a catalog tuple being loaded into syscache contains
> some toasted fields.  CatalogCacheCreateEntry will flatten the tuple,
> involving fetches from toast tables that will certainly cause
> AcceptInvalidationMessages calls.  What if one of those should have
> invalidated this tuple?  We will not notice, because it's not in
> the hashtable yet.  When we do add it, we will mark it not-dead,
> meaning that the stale entry looks fine and could persist for a long
> while.

Yeah, it's an interesting case that needs investigation, IMO.
I'll try to look into this and construct the test case in the background.

> 0001's changes in acl.c are straightforward, but it's worth noting
> that the has_sequence_privilege functions hadn't gotten the memo
> about not failing when a bogus relation OID is passed.

I've looked through all functions has_*priv*_id and all they have
"if (is_missing) PG_RETURN_NULL()" now (with the patches applied).

Best regards,
Alexander



Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

От
Alexander Lakhin
Дата:
13.10.2023 18:00, Alexander Lakhin wrote:

>
>> I spent some time looking through existing SearchSysCacheExists calls,
>> and I could only find two sets of routines where we seem to be
>> depending on SearchSysCacheExists to protect a subsequent lookup
>> somewhere else, and there isn't any lock on the object in question.
>> Those are the has_foo_privilege functions discussed here, and the
>> foo_is_visible functions near the bottom of namespace.c.  I'm not
>> sure why we've not heard complaints traceable to the foo_is_visible
>> family.  Maybe nobody has tried hard to break them, or maybe they
>> are just less likely to be used in ways that are at risk.
>
> I'll try to research/break xxx_is_visible and share my findings tomorrow.
>

I tried the script based on the initial reproducer [1]:
for ((n=1;n<=30;n++)); do
echo "ITERATION $n"

numclients=30
for ((c=1;c<=$numclients;c++)); do
cat << EOF | psql >psql_$c.log &
CREATE SCHEMA testxmlschema_$c;

SELECT format('CREATE TABLE testxmlschema_$c.test_%s (a int);', g) FROM
generate_series(1, 30) g
\\gexec

SET parallel_setup_cost = 1;
SET min_parallel_table_scan_size = '1kB';

SELECT oid FROM pg_catalog.pg_class WHERE relnamespace = 1 AND
  relkind IN ('r', 'm', 'v') AND pg_catalog.pg_table_is_visible(oid);

SELECT format('DROP TABLE testxmlschema_$c.test_%s', g) FROM
generate_series(1, 30) g
\\gexec

DROP SCHEMA testxmlschema_$c;
EOF
done
wait
grep 'ERROR:' server.log && break;
done

And couldn't get the error, for multiple runs. (Here SELECT oid ... is
based on the query executed by schema_to_xmlschema().)
But I could reliably get the error with
s/pg_table_is_visible(oid)/has_table_privilege (oid, 'SELECT')/.
So there is a difference between these two functions. And the difference is
in their costs.
If I do "ALTER FUNCTION pg_table_is_visible COST 1" before the script,
I get the error as expected.
With cost 10 I see the following plan:
  Index Scan using pg_class_relname_nsp_index on pg_class (cost=0.42..2922.38 rows=1 width=4)
    Index Cond: (relnamespace = '1'::oid)
    Filter: ((relkind = ANY ('{r,m,v}'::"char"[])) AND pg_table_is_visible(oid))

But with cost 1:
  Gather  (cost=1.00..257.10 rows=1 width=4)
    Workers Planned: 2
    Workers Launched: 2
    ->  Parallel Seq Scan on pg_class  (cost=0.00..256.00 rows=1 width=4)
          Filter: (pg_table_is_visible(oid) AND (relnamespace = '1'::oid) AND (relkind = ANY ('{r,m,v}'::"char"[])))
          Rows Removed by Filter: 405

The cost of the pg_foo_is_visible functions was increased in a80889a73.
But all the has_xxx_privilige functions have cost 1, except for
has_any_column_privilege, which cost was also increased in 7449427a1.

So to see the issue we need several ingredients:
1) The mode CATCACHE_FORCE_RELEASE enabled (may be some other way is
  possible, I don't know of);
     - Thanks to prion for that.
2) A function with the coding pattern
  "SearchSysCacheExistsX();  SearchSysCacheX();" called in a parallel worker;
     - Thanks to "debug_parallel_query = regress" and low cost of
       has_table_privilege() called by schema_to_xmlschema().
3) The catalog cache invalidated by some concurrent activity.
     - Thanks to running the test xmlmap in parallel with 16 other tests.

[1] https://www.postgresql.org/message-id/18014-28c81cb79d44295d%40postgresql.org

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> 13.10.2023 18:00, Alexander Lakhin wrote:
>> I'll try to research/break xxx_is_visible and share my findings tomorrow.

> I tried the script based on the initial reproducer [1]:
> ...
> And couldn't get the error, for multiple runs. (Here SELECT oid ... is
> based on the query executed by schema_to_xmlschema().)
> But I could reliably get the error with
> s/pg_table_is_visible(oid)/has_table_privilege (oid, 'SELECT')/.
> So there is a difference between these two functions. And the difference is
> in their costs.

Ah, thanks for poking at it.  I believe the reason for the cost issue
is that your query already has a very selective indexable condition,
so it tends not to bother with a parallel scan.  I removed the
relnamespace condition:

SELECT oid FROM pg_catalog.pg_class WHERE -- relnamespace = 1 AND
  relkind IN ('r', 'm', 'v') AND pg_catalog.pg_table_is_visible(oid);

and then I get a parallel plan without messing with the cost, and it
falls over almost immediately (in a CATCACHE_FORCE_RELEASE build,
anyway).

ITERATION 1
ITERATION 2
ERROR:  cache lookup failed for relation 208139
CONTEXT:  parallel worker
ERROR:  cache lookup failed for relation 208471
CONTEXT:  parallel worker
2023-10-14 12:30:24.747 EDT [1762290] ERROR:  cache lookup failed for relation 208139
2023-10-14 12:30:24.747 EDT [1762266] ERROR:  cache lookup failed for relation 208139
2023-10-14 12:30:24.782 EDT [1762310] ERROR:  cache lookup failed for relation 208471
2023-10-14 12:30:24.782 EDT [1762252] ERROR:  cache lookup failed for relation 208471

So we do need to fix that.

            regards, tom lane



I wrote:
> So we do need to fix that.

I've fixed both sets of functions as of now.  We still need to look
into the question of whether detoasting syscache entries is safe.

            regards, tom lane



Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

От
Alexander Lakhin
Дата:
14.10.2023 23:20, Tom Lane wrote:
> I've fixed both sets of functions as of now.  We still need to look
> into the question of whether detoasting syscache entries is safe.

Unfortunately, the answer is "no". Please look at the proof:
https://www.postgresql.org/message-id/18163-859bad19a43edcf6%40postgresql.org

Best regards,
Alexander