Обсуждение: some namespace.c refactoring

Поиск
Список
Период
Сортировка

some namespace.c refactoring

От
Peter Eisentraut
Дата:
Here are two patches that refactor the mostly repetitive "${object} is 
visible" and get_${object}_oid() functions in namespace.c.  This uses 
the functions in objectaddress.c to look up the appropriate per-catalog 
system caches and attribute numbers, similar to other refactoring 
patches I have posted recently.

In both cases, there are some functions that have special behaviors that 
are not easy to unify, so I left those alone for now.

Notes on 0001-Refactor-is-visible-functions.patch:

Among the functions that are being unified, some check temp schemas and 
some skip them.  I suppose that this is because some (most) object types 
cannot normally be in temp schemas, but this isn't made explicit in the 
code.  I added a code comment about this, the way I understand it.

That said, you can create objects explicitly in temp schemas, so I'm not 
sure the existing code is completely correct.

Notes on 0002-Refactor-common-parts-of-get_-_oid-functions.patch:

Here, I only extracted the common parts of each function but left the 
actual functions alone, because they each have to produce their own 
error message.  There is a possibility to generalize this further, 
perhaps in the style of does_not_exist_skipping(), but that looked like 
a separate step to me.

Вложения

Re: some namespace.c refactoring

От
Tom Lane
Дата:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> Here are two patches that refactor the mostly repetitive "${object} is 
> visible" and get_${object}_oid() functions in namespace.c.  This uses 
> the functions in objectaddress.c to look up the appropriate per-catalog 
> system caches and attribute numbers, similar to other refactoring 
> patches I have posted recently.

This does not look like a simple refactoring patch to me.  I have
very serious concerns first about whether it even preserves the
existing semantics, and second about whether there is a performance
penalty.

            regards, tom lane



Re: some namespace.c refactoring

От
Alvaro Herrera
Дата:
On 2023-Feb-15, Tom Lane wrote:

> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> > Here are two patches that refactor the mostly repetitive "${object} is 
> > visible" and get_${object}_oid() functions in namespace.c.  This uses 
> > the functions in objectaddress.c to look up the appropriate per-catalog 
> > system caches and attribute numbers, similar to other refactoring 
> > patches I have posted recently.
> 
> This does not look like a simple refactoring patch to me.  I have
> very serious concerns first about whether it even preserves the
> existing semantics, and second about whether there is a performance
> penalty.

I suppose there are two possible questionable angles from a performance
POV:

1. the new code uses get_object_property_data() when previously there
was a straight dereference after casting to the right struct type.  So
how expensive is that?  I think the answer to that is not great, because
it does a linear array scan on ObjectProperty.  Maybe we need a better
answer.

2. other accesses to the data are done using SysCacheGetAttr instead of
straight struct access dereferences.  I expect that most of the fields
being accessed have attcacheoff set, which allows pretty fast access to
the field in question, so it's not *that* bad.  (For cases where
attcacheoff is not set, then the original code would also have to deform
the tuple.)  Still, it's going to have nontrivial impact in any
microbenchmarking.

That said, I think most of this code is invoked for DDL, where
performance is not so critical; probably just fixing
get_object_property_data to not be so naïve would suffice.

Queries are another matter.  I can't think of a way to determine which
of these paths are used for queries, so that we can optimize more (IOW:
just plain not rewrite those); manually going through the callers seems
a bit laborious, but perhaps doable.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)



Re: some namespace.c refactoring

От
Noah Misch
Дата:
On Tue, Feb 14, 2023 at 02:32:04PM +0100, Peter Eisentraut wrote:
> Notes on 0001-Refactor-is-visible-functions.patch:
> 
> Among the functions that are being unified, some check temp schemas and some
> skip them.  I suppose that this is because some (most) object types cannot
> normally be in temp schemas, but this isn't made explicit in the code.  I
> added a code comment about this, the way I understand it.
> 
> That said, you can create objects explicitly in temp schemas, so I'm not
> sure the existing code is completely correct.

> +            /*
> +             * Do not look in temp namespace for object types that don't
> +             * support temporary objects
> +             */
> +            if (!(classid == RelationRelationId || classid == TypeRelationId) &&
> +                namespaceId == myTempNamespace)
> +                continue;

I think the reason for the class-specific *IsVisible behavior is alignment
with the lookup rules that CVE-2007-2138 introduced (commit aa27977).  "CREATE
FUNCTION pg_temp.f(...)" works, but calling the resulting function requires a
schema-qualified name regardless of search_path.  Since *IsVisible functions
determine whether you can reach the object without schema qualification, their
outcomes shall reflect those CVE-2007-2138 rules.



Re: some namespace.c refactoring

От
Peter Eisentraut
Дата:
On 15.02.23 19:04, Alvaro Herrera wrote:
> That said, I think most of this code is invoked for DDL, where
> performance is not so critical; probably just fixing
> get_object_property_data to not be so naïve would suffice.

Ok, I'll look into that.

> Queries are another matter.  I can't think of a way to determine which
> of these paths are used for queries, so that we can optimize more (IOW:
> just plain not rewrite those); manually going through the callers seems
> a bit laborious, but perhaps doable.

The "is visible" functions are only used for the likes of psql, pg_dump, 
query tree reversing, object descriptions -- nothing that is in a normal 
query unless you explicitly call it.

The get_*_oid() functions are used mostly for DDL to find conflicting 
objects.  The text-search related ones can be invoked via some user 
functions, if you specify a TS object other than the default one.  I 
will check into what the impact of that is.




Re: some namespace.c refactoring

От
Peter Eisentraut
Дата:
On 15.02.23 06:04, Tom Lane wrote:
> I have
> very serious concerns first about whether it even preserves the
> existing semantics, and second about whether there is a performance
> penalty.

We can work out the performance issues, but what are your concerns about 
semantics?



Re: some namespace.c refactoring

От
Peter Eisentraut
Дата:
On 20.02.23 15:03, Peter Eisentraut wrote:
> On 15.02.23 19:04, Alvaro Herrera wrote:
>> That said, I think most of this code is invoked for DDL, where
>> performance is not so critical; probably just fixing
>> get_object_property_data to not be so naïve would suffice.
> 
> Ok, I'll look into that.

I did a variety of performance testing on this now.

I wrote a C function that calls the "is visible" functions in a tight loop:

Datum
pg_test_visible(PG_FUNCTION_ARGS)
{
     int32 count = PG_GETARG_INT32(0);
     Oid relid = PG_GETARG_OID(1);
     Oid typid = PG_GETARG_OID(2);

     for (int i = 0; i < count; i++)
     {
         RelationIsVisible(relid);
         TypeIsVisible(typid);
         //ObjectIsVisible(RelationRelationId, relid);
         //ObjectIsVisible(TypeRelationId, typid);
     }

     PG_RETURN_VOID();
}

(It's calling two different ones to defeat the caching in 
get_object_property_data().)

Here are some run times:

unpatched:

select pg_test_visible(100_000_000, 'pg_class', 'int4');
Time: 4536.747 ms (00:04.537)

select pg_test_visible(100_000_000, 'tenk1', 'widget');
Time: 10828.802 ms (00:10.829)

(Note that the "is visible" functions special case system catalogs.)

patched:

select pg_test_visible(100_000_000, 'pg_class', 'int4');
Time: 11409.948 ms (00:11.410)

select pg_test_visible(100_000_000, 'tenk1', 'widget');
Time: 18649.496 ms (00:18.649)

So, it's slower, but it's not clear whether it matters in practice, 
considering this test.

I also wondered if this is visible through a normal external function 
call, so I tried

do $$ begin perform pg_get_statisticsobjdef(28999) from 
generate_series(1, 1_000_000); end $$;

(where that is the OID of the first object from select * from 
pg_statistic_ext; in the regression database).

unpatched:

Time: 6952.259 ms (00:06.952)

patched (first patch only):

Time: 6993.655 ms (00:06.994)

patched (both patches):

Time: 7114.290 ms (00:07.114)

So there is some visible impact, but again, the test isn't realistic.

Then I tried a few ways to make get_object_property_data() faster.  I 
tried building a class_id+index cache that is qsort'ed (once) and then 
bsearch'ed, that helped only minimally, not enough to make up the 
difference.  I also tried just searching the class_id+index cache 
linearly, hoping maybe that if the cache is smaller it would be more 
efficient to access, but that actually made things (minimally) worse, 
probably because of the indirection.  So it might be hard to get much 
more out of this.  I also thought about PerfectHash, but I didn't code 
that up yet.

Another way would be to not use get_object_property_data() at all but 
write a "common" function that we pass in all it needs hardcodedly, like

bool
RelationIsVisible(Oid relid)
{
     return IsVisible_common(RELOID,
                             Anum_pg_class_relname
                             Anum_pg_class_relnamespace);
}

This would still save a lot of duplicate code.

But again, I don't think the micro-performance really matters here.




Re: some namespace.c refactoring

От
vignesh C
Дата:
On Thu, 23 Feb 2023 at 16:38, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> On 20.02.23 15:03, Peter Eisentraut wrote:
> > On 15.02.23 19:04, Alvaro Herrera wrote:
> >> That said, I think most of this code is invoked for DDL, where
> >> performance is not so critical; probably just fixing
> >> get_object_property_data to not be so naïve would suffice.
> >
> > Ok, I'll look into that.
>
> I did a variety of performance testing on this now.
>
> I wrote a C function that calls the "is visible" functions in a tight loop:
>
> Datum
> pg_test_visible(PG_FUNCTION_ARGS)
> {
>      int32 count = PG_GETARG_INT32(0);
>      Oid relid = PG_GETARG_OID(1);
>      Oid typid = PG_GETARG_OID(2);
>
>      for (int i = 0; i < count; i++)
>      {
>          RelationIsVisible(relid);
>          TypeIsVisible(typid);
>          //ObjectIsVisible(RelationRelationId, relid);
>          //ObjectIsVisible(TypeRelationId, typid);
>      }
>
>      PG_RETURN_VOID();
> }
>
> (It's calling two different ones to defeat the caching in
> get_object_property_data().)
>
> Here are some run times:
>
> unpatched:
>
> select pg_test_visible(100_000_000, 'pg_class', 'int4');
> Time: 4536.747 ms (00:04.537)
>
> select pg_test_visible(100_000_000, 'tenk1', 'widget');
> Time: 10828.802 ms (00:10.829)
>
> (Note that the "is visible" functions special case system catalogs.)
>
> patched:
>
> select pg_test_visible(100_000_000, 'pg_class', 'int4');
> Time: 11409.948 ms (00:11.410)
>
> select pg_test_visible(100_000_000, 'tenk1', 'widget');
> Time: 18649.496 ms (00:18.649)
>
> So, it's slower, but it's not clear whether it matters in practice,
> considering this test.
>
> I also wondered if this is visible through a normal external function
> call, so I tried
>
> do $$ begin perform pg_get_statisticsobjdef(28999) from
> generate_series(1, 1_000_000); end $$;
>
> (where that is the OID of the first object from select * from
> pg_statistic_ext; in the regression database).
>
> unpatched:
>
> Time: 6952.259 ms (00:06.952)
>
> patched (first patch only):
>
> Time: 6993.655 ms (00:06.994)
>
> patched (both patches):
>
> Time: 7114.290 ms (00:07.114)
>
> So there is some visible impact, but again, the test isn't realistic.
>
> Then I tried a few ways to make get_object_property_data() faster.  I
> tried building a class_id+index cache that is qsort'ed (once) and then
> bsearch'ed, that helped only minimally, not enough to make up the
> difference.  I also tried just searching the class_id+index cache
> linearly, hoping maybe that if the cache is smaller it would be more
> efficient to access, but that actually made things (minimally) worse,
> probably because of the indirection.  So it might be hard to get much
> more out of this.  I also thought about PerfectHash, but I didn't code
> that up yet.
>
> Another way would be to not use get_object_property_data() at all but
> write a "common" function that we pass in all it needs hardcodedly, like
>
> bool
> RelationIsVisible(Oid relid)
> {
>      return IsVisible_common(RELOID,
>                              Anum_pg_class_relname
>                              Anum_pg_class_relnamespace);
> }
>
> This would still save a lot of duplicate code.
>
> But again, I don't think the micro-performance really matters here.

I'm seeing that there has been no activity in this thread for almost a
year now, I'm planning to close this in the current commitfest unless
someone is planning to take it forward.

Regards,
Vignesh