Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

Поиск

Список

Период

Сортировка

От	Alexander Lakhin
Тема	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Дата	21 июля 2023 г. 19:00:01
Msg-id	00e24aa6-5d03-99b6-ab41-495a35fc93f6@gmail.com обсуждение исходный текст
Ответ на	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Список	pgsql-bugs

Дерево обсуждения

Hello Tom,

Thank you for your considerations!

21.07.2023 20:20, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> I think that we need to determine the level where the problem that should
>> be fixed is:
>> 1) test xmlmap fails sporadically due to the catalog changes caused by
>>    parallel tests activity
>> 2) schema_to_xmlschemaX() can fail when parallel workers are used
>> 3) has_table_privilegeX() can fail sporadically when executed within a
>>    parallel worker
>> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
>>    when repeated in a parallel worker
> Yeah, that's not immediately obvious.  IIUC, the situation we are
> looking at is that SearchSysCacheExists can succeed even though the
> tuple we found is already dead at the instant that the function
> exits (thanks to absorption of inval messages during relation_close).
> The fact that that only happens in parallel workers is pure chance
> really.  It is not okay for has_table_privilegeX to depend on the
> fact that the surrounding query already has some lock on pg_class.
> So this means that the approach has_table_privilegeX uses of
> assuming that successful SearchSysCacheExists means it can call
> pg_class_aclcheck without fear is just broken.
>
> If we suppose that that assumption is only being made in the
> has_foo_privilege functions, then one way we could fix it is to extend
> the API of pg_class_aclcheck etc to add a no-error-on-not-found flag,
> and get rid of the separate SearchSysCacheExists check.  However,
> I can't avoid the suspicion that we have other places assuming the
> same thing.

Running through SearchSysCacheExistsX() calls, I've found an interesting
(and optimistic) comment in src/backend/catalog/namespace.c:

  * ...  The underlying IsVisible functions
  * always use an up-to-date snapshot and so might see the object as already
  * gone when it's still visible to the transaction snapshot.  (There is no race
  * condition in the current coding because we don't accept sinval messages
  * between the SearchSysCacheExists test and the subsequent lookup.)
  */

Datum
pg_table_is_visible(PG_FUNCTION_ARGS)
{
     Oid         oid = PG_GETARG_OID(0);

     if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(oid)))
         PG_RETURN_NULL();

     PG_RETURN_BOOL(RelationIsVisible(oid));
}

...

bool
RelationIsVisible(Oid relid)
{
     HeapTuple   reltup;
     Form_pg_class relform;
     Oid         relnamespace;
     bool        visible;

     reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(reltup))
         elog(ERROR, "cache lookup failed for relation %u", relid);

So that's exactly the same coding pattern, thus fixing pg_class_aclcheck
is not an option, apparently.

>    So I think what we really ought to be doing is one
> of two things:
>
> 1. Hack SearchSysCacheExists to account for this issue, by making it
> loop if it finds a syscache entry but sees that the entry is already
> dead.  (We have to loop, not just return false, in case the row was
> updated rather than deleted.)  Maybe all the syscache lookup
> functions need to do likewise; it's certainly not intuitively
> reasonable for them to return already-known-stale entries.
>
> 2. Figure out how come we are executing a cache inval on the way
> out of syscache entry creation, and stop that from happening.

I wrote about LockRelationOid() before, and I still think that invalidation
messages are processed in that call (reached via SearchCatCacheMiss() ->
systable_beginscan() -> index_open() -> relation_open(), not via
relation_close()). So it seems that SearchSysCacheX calls find an entry,
that is not dead, but that entry can be dead (not found) for the next call
if invalidation messages are processed.

> I like #2 better if it's not hard to do cleanly.  However, I'm not
> quite sure how we are getting to an inval during relation close;
> maybe that's not something we want to prevent.

Yes, there is a detailed comment in LockRelationOid(), that explains why
AcceptInvalidationMessages() is called. (I've tried to remove that call
now, just for testing, and get 6 tests failed during `make check`.)

Best regards,
Alexander

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used