Обсуждение: Re: Extended Statistics set/restore/clear functions.

Поиск
Список
Период
Сортировка

Re: Extended Statistics set/restore/clear functions.

От
Tomas Vondra
Дата:
Hi,

Thanks for continuing to work on this.

On 1/22/25 19:17, Corey Huinker wrote:
> This is a separate thread for work started in [1] but focused purely on
> getting the following functions working:
> 
> * pg_set_extended_stats
> * pg_clear_extended_stats
> * pg_restore_extended_stats
> 
> These functions are analogous to their relation/attribute counterparts,
> use the same calling conventions, and build upon the same basic
> infrastructure.
> 
> I think it is important that we get these implemented because they close
> the gap that was left in terms of the ability to modify existing
> statistics and to round out the work being done to carry over statistics
> via dump/restore and pg_upgrade i [1].
> 
> The purpose of each patch is as follows (adapted from previous thread):
> 
> 0001 - This makes the input function for pg_ndistinct functional.
> 
> 0002 - This makes the input function for pg_dependencies functional.
> 

I only quickly skimmed the patches, but a couple comments:


1) I think it makes perfect sense to use the JSON parsing for the input
functions, but maybe it'd be better to adjust the format a bit to make
that even easier?

Right now the JSON "keys" have structure, which means we need some ad
hoc parsing. Maybe we should make it "proper JSON" by moving that into
separate key/value, e.g. for ndistinct we might replace this:

  {"1, 2": 2323, "1, 3" : 3232, ...}

with this:

  [ {"keys": [1, 2], "ndistinct" : 2323},
    {"keys": [1, 3], "ndistinct" : 3232},
    ... ]

so a regular JSON array of objects, with keys an "array". And similarly
for dependencies.

Yes, it's more verbose, but maybe better for "mechanical" processing?


2) Do we need some sort of validation? Perhaps this was discussed in the
other thread and I missed that, but isn't it a problem that happily
accept e.g. this?

  {"6666, 6666" : 1, "1, -222": 14, ...}

That has duplicate keys with bogus attribute numbers, stats on (bogus)
system attributes, etc. I suspect this may easily cause problems during
planning (if it happens to visit those statistics).

Maybe that's acceptable - ultimately the user could import something
broken in a much subtler way, of course. But the pg_set_attribute_stats
seems somewhat more protected against this, because it gets the attr as
a separate argument.


I recall I wished to have the attnum in the output function, but that
was not quite possible because we don't know the relid (and thus the
descriptor) in that function.

Is it a good idea to rely on the input/output format directly? How will
that deal with cross-version differences? Doesn't it mean the in/out
format is effectively fixed, or at least has to be backwards compatible
(i.e. new version has to handle any value from older versions)?

Or what if I want to import the stats for a table with slightly
different structure (e.g. because dump/restore skips dropped columns).
Won't that be a problem with the format containing raw attnums? Or is
this a use case we don't expect to work?

For the per-attribute stats it's probably fine, because that's mostly
just a collection of regular data types (scalar values or arrays of
values, ...) and we're not modifying them except for maybe adding new
fields. But extended stats seem more complex, so maybe it's different?

I remember a long discussion about the format at the very beginning of
this patch series, and the conclusion clearly was to have a function
that import stats for one attribute at a time. And that seems to be
working fine, but extended stats values have more internal structure, so
perhaps they need to do something more complicated.

> 0003 - Makes several static functions in attribute_stats.c public for use
> by extended stats. One of those is get_stat_attr_type(), which in the last
> patchset was modified to take an attribute name rather than attnum, thus
> saving a syscache lookup. However, extended stats identifies attributes by
> attnum not name, so that optimization had to be set aside, at least
> temporarily.
> 
> 0004 - These implement the functions pg_set_extended_stats(),
> pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
> their relation/attribute equivalents. If we can get these committed and
> used by pg_dump, then we don't have to debate how to handle post-upgrade
> steps for users who happen to have extended stats vs the approximately
> 99.75% of users who do not have extended stats.
> 

I see there's a couple MCV-specific functions in the extended_stats.c.
Shouldn't those go into mvc.c instead?

FWIW there's a bunch of whitespace issues during git apply.

> This patchset does not presently include any work to integrate these
> functions into pg_dump, but may do so once that work is settled, or it
> may become its own thread.
> 

OK. Thanks for the patch!


regards

-- 
Tomas Vondra




Re: Extended Statistics set/restore/clear functions.

От
Corey Huinker
Дата:
On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me> wrote:
Hi,

Thanks for continuing to work on this.

On 1/22/25 19:17, Corey Huinker wrote:
> This is a separate thread for work started in [1] but focused purely on
> getting the following functions working:
>
> * pg_set_extended_stats
> * pg_clear_extended_stats
> * pg_restore_extended_stats
>
> These functions are analogous to their relation/attribute counterparts,
> use the same calling conventions, and build upon the same basic
> infrastructure.
>
> I think it is important that we get these implemented because they close
> the gap that was left in terms of the ability to modify existing
> statistics and to round out the work being done to carry over statistics
> via dump/restore and pg_upgrade i [1].
>
> The purpose of each patch is as follows (adapted from previous thread):
>
> 0001 - This makes the input function for pg_ndistinct functional.
>
> 0002 - This makes the input function for pg_dependencies functional.
>

I only quickly skimmed the patches, but a couple comments:


1) I think it makes perfect sense to use the JSON parsing for the input
functions, but maybe it'd be better to adjust the format a bit to make
that even easier?

Right now the JSON "keys" have structure, which means we need some ad
hoc parsing. Maybe we should make it "proper JSON" by moving that into
separate key/value, e.g. for ndistinct we might replace this:

  {"1, 2": 2323, "1, 3" : 3232, ...}

with this:

  [ {"keys": [1, 2], "ndistinct" : 2323},
    {"keys": [1, 3], "ndistinct" : 3232},
    ... ]

so a regular JSON array of objects, with keys an "array". And similarly
for dependencies.

That is almost exactly what I did back when the stats import functions took a nested JSON argument.

The biggest problem with changing that format is that the old format would still show up in the system being exported, so we would have to process that format as well as the new one.
 
Yes, it's more verbose, but maybe better for "mechanical" processing?

It absolutely would be better for processing, but we'd still have to read the old format from older systems.  I suppose the pg_dump code could do some SQL gymnastics to convert the old json-but-sad format into the processing-friendly format of the future, and I could easily adapt what I've already written over a year ago to that task. I suppose it's just a matter of having the community behind changing the output format to enable a better input format.

 
2) Do we need some sort of validation? Perhaps this was discussed in the
other thread and I missed that, but isn't it a problem that happily
accept e.g. this?

  {"6666, 6666" : 1, "1, -222": 14, ...}

That has duplicate keys with bogus attribute numbers, stats on (bogus)
system attributes, etc. I suspect this may easily cause problems during
planning (if it happens to visit those statistics).

We used to have _lots_ of validation for data quality issues, much of which was removed at the request of reviewers. However, much of that discussion was about the health of the statistics, but these are standalone data types, maybe they're held to a higher standard. If so, what sort of checks do you think would be needed and/or wanted?

So far, I can imagine the following:

* no negative attnums in key list
* no duplicate attnums in key list

anything else?

Maybe that's acceptable - ultimately the user could import something
broken in a much subtler way, of course. But the pg_set_attribute_stats
seems somewhat more protected against this, because it gets the attr as
a separate argument.

The datatype itself is in isolation, but once we've created a valid pg_ndistinct or pg_dependencies, there's nothing stopping us from validating the values in the datatype against the statistics object and the relation it belongs to, but that might get the same resistance that I got to say, ensuring that frequency lists were monotonically decreasing.
 
I recall I wished to have the attnum in the output function, but that
was not quite possible because we don't know the relid (and thus the
descriptor) in that function.

Is it a good idea to rely on the input/output format directly? How will
that deal with cross-version differences? Doesn't it mean the in/out
format is effectively fixed, or at least has to be backwards compatible
(i.e. new version has to handle any value from older versions)?

Presently there are no cross-version differences, though earlier I address the pros and cons of changing it. No matter what, the burden of having a valid format is on the user in the case of pg_set_extended_stats() and pg_restore_extended_stats() has a server version number associated, so the option of handling a format change could be baked in, but then we're doing version tests and input typecasts like we do with ANYARRAY types. Not impossible, but definitely more work.
 
Or what if I want to import the stats for a table with slightly
different structure (e.g. because dump/restore skips dropped columns).
Won't that be a problem with the format containing raw attnums? Or is
this a use case we don't expect to work?

The family of pg_set_*_stats functions expect the input to be meaningful and correct for that relation on that server version. Any attnum translation would have to be done by the user to adapt to the new or changed relation.

The family of pg_restore_*_stats functions are designed to be forward compatible, and to work across versions but for basically the same relation or relation of the same shape. Basically, they're for pg_restore and pg_upgrade, so no changes in attnums would be expected.
 

For the per-attribute stats it's probably fine, because that's mostly
just a collection of regular data types (scalar values or arrays of
values, ...) and we're not modifying them except for maybe adding new
fields. But extended stats seem more complex, so maybe it's different?

I had that working by attname matching way back in the early days, but it would involve creating an intermediate format for translating the attnums to attnames on export, and then re-translating them on the way back in.

I suppose someone could write the following utility functions

    pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -> json
    pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) -> pg_ndistinct

and that would bridge the gap for the special case where you want to adapt pg_ndistinct from one table structure to a slightly different one.

 

I remember a long discussion about the format at the very beginning of
this patch series, and the conclusion clearly was to have a function
that import stats for one attribute at a time. And that seems to be
working fine, but extended stats values have more internal structure, so
perhaps they need to do something more complicated.

I believe this *is* doing something more complicated, especially with the MCVList, though I was very pleased to see how much of the existing infrastructure I was able to reuse.


 

> 0003 - Makes several static functions in attribute_stats.c public for use
> by extended stats. One of those is get_stat_attr_type(), which in the last
> patchset was modified to take an attribute name rather than attnum, thus
> saving a syscache lookup. However, extended stats identifies attributes by
> attnum not name, so that optimization had to be set aside, at least
> temporarily.
>
> 0004 - These implement the functions pg_set_extended_stats(),
> pg_clear_extended_stats(), and pg_restore_extended_stats() and behave like
> their relation/attribute equivalents. If we can get these committed and
> used by pg_dump, then we don't have to debate how to handle post-upgrade
> steps for users who happen to have extended stats vs the approximately
> 99.75% of users who do not have extended stats.
>

I see there's a couple MCV-specific functions in the extended_stats.c.
Shouldn't those go into mvc.c instead?

I wanted to put it there, but there was a reason I didn't and I've now forgotten what it was. I'll make an effort to relocate it to mcv.c.

For that matter, it might make sense to break out the expressions code into its own file, because every other stat attribute has its own. Thoughts on that?
 

FWIW there's a bunch of whitespace issues during git apply.

+1
 

OK. Thanks for the patch!

Thanks for the feedback, please keep it coming. I think it's really important that extended stats, though used rarely, be included in our dump/restore/upgrade changes so as to make for a more consistent user experience.
 

Re: Extended Statistics set/restore/clear functions.

От
Tomas Vondra
Дата:

On 1/23/25 15:51, Corey Huinker wrote:
> On Wed, Jan 22, 2025 at 5:50 PM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@vondra.me>> wrote:
> 
>     Hi,
> 
>     Thanks for continuing to work on this.
> 
>     On 1/22/25 19:17, Corey Huinker wrote:
>     > This is a separate thread for work started in [1] but focused
>     purely on
>     > getting the following functions working:
>     >
>     > * pg_set_extended_stats
>     > * pg_clear_extended_stats
>     > * pg_restore_extended_stats
>     >
>     > These functions are analogous to their relation/attribute
>     counterparts,
>     > use the same calling conventions, and build upon the same basic
>     > infrastructure.
>     >
>     > I think it is important that we get these implemented because they
>     close
>     > the gap that was left in terms of the ability to modify existing
>     > statistics and to round out the work being done to carry over
>     statistics
>     > via dump/restore and pg_upgrade i [1].
>     >
>     > The purpose of each patch is as follows (adapted from previous
>     thread):
>     >
>     > 0001 - This makes the input function for pg_ndistinct functional.
>     >
>     > 0002 - This makes the input function for pg_dependencies functional.
>     >
> 
>     I only quickly skimmed the patches, but a couple comments:
> 
> 
>     1) I think it makes perfect sense to use the JSON parsing for the input
>     functions, but maybe it'd be better to adjust the format a bit to make
>     that even easier?
> 
>     Right now the JSON "keys" have structure, which means we need some ad
>     hoc parsing. Maybe we should make it "proper JSON" by moving that into
>     separate key/value, e.g. for ndistinct we might replace this:
> 
>       {"1, 2": 2323, "1, 3" : 3232, ...}
> 
>     with this:
> 
>       [ {"keys": [1, 2], "ndistinct" : 2323},
>         {"keys": [1, 3], "ndistinct" : 3232},
>         ... ]
> 
>     so a regular JSON array of objects, with keys an "array". And similarly
>     for dependencies.
> 
> 
> That is almost exactly what I did back when the stats import functions
> took a nested JSON argument.
> 
> The biggest problem with changing that format is that the old format
> would still show up in the system being exported, so we would have to
> process that format as well as the new one.
>  
> 
>     Yes, it's more verbose, but maybe better for "mechanical" processing?
> 
> 
> It absolutely would be better for processing, but we'd still have to
> read the old format from older systems.  I suppose the pg_dump code
> could do some SQL gymnastics to convert the old json-but-sad format into
> the processing-friendly format of the future, and I could easily adapt
> what I've already written over a year ago to that task. I suppose it's
> just a matter of having the community behind changing the output format
> to enable a better input format.
> 

D'oh! I always forget about the backwards compatibility issue, i.e. that
we still need to ingest values from already released versions. Yeah,
that makes the format change less beneficial.

>  
> 
>     2) Do we need some sort of validation? Perhaps this was discussed in the
>     other thread and I missed that, but isn't it a problem that happily
>     accept e.g. this?
> 
>       {"6666, 6666" : 1, "1, -222": 14, ...}
> 
>     That has duplicate keys with bogus attribute numbers, stats on (bogus)
>     system attributes, etc. I suspect this may easily cause problems during
>     planning (if it happens to visit those statistics).
> 
> 
> We used to have _lots_ of validation for data quality issues, much of
> which was removed at the request of reviewers. However, much of that
> discussion was about the health of the statistics, but these are
> standalone data types, maybe they're held to a higher standard. If so,
> what sort of checks do you think would be needed and/or wanted?
> 
> So far, I can imagine the following:
> 
> * no negative attnums in key list
> * no duplicate attnums in key list
> 
> anything else?
> 

Yeah, I recall there were a lot of checks initially and we dropped them
over time. I'm not asking to reinstate all of those thorough checks.

At this point I was really thinking only about validating the attnums,
i.e. to make sure it's a valid attribute in the table / statistics. That
is something the pg_set_attribute_stats() enforce too, thanks to having
a separate argument for the attribute name.

That's where I'd stop. I don't want to do checks on the statistics
content, like verifying the frequencies in the MCV sum up to 1.0 or
stuff like that. I think we're not doing that for pg_set_attribute_stats
either (and I'd bet one could cause a lot of "fun" this way).

> 
>     Maybe that's acceptable - ultimately the user could import something
>     broken in a much subtler way, of course. But the pg_set_attribute_stats
>     seems somewhat more protected against this, because it gets the attr as
>     a separate argument.
> 
> 
> The datatype itself is in isolation, but once we've created a valid
> pg_ndistinct or pg_dependencies, there's nothing stopping us from
> validating the values in the datatype against the statistics object and
> the relation it belongs to, but that might get the same resistance that
> I got to say, ensuring that frequency lists were monotonically decreasing.
>  

Understood. IMHO it's fine to say we're not validating the statistics
are "consistent" but I think we should check it matches the definition.

> 
>     I recall I wished to have the attnum in the output function, but that
>     was not quite possible because we don't know the relid (and thus the
>     descriptor) in that function.
> 
>     Is it a good idea to rely on the input/output format directly? How will
>     that deal with cross-version differences? Doesn't it mean the in/out
>     format is effectively fixed, or at least has to be backwards compatible
>     (i.e. new version has to handle any value from older versions)?
> 
> 
> Presently there are no cross-version differences, though earlier I
> address the pros and cons of changing it. No matter what, the burden of
> having a valid format is on the user in the case of
> pg_set_extended_stats() and pg_restore_extended_stats() has a server
> version number associated, so the option of handling a format change
> could be baked in, but then we're doing version tests and input
> typecasts like we do with ANYARRAY types. Not impossible, but definitely
> more work.
> 

OK, makes sense.
 
> 
>     Or what if I want to import the stats for a table with slightly
>     different structure (e.g. because dump/restore skips dropped columns).
>     Won't that be a problem with the format containing raw attnums? Or is
>     this a use case we don't expect to work?
> 
> 
> The family of pg_set_*_stats functions expect the input to be meaningful
> and correct for that relation on that server version. Any attnum
> translation would have to be done by the user to adapt to the new or
> changed relation.
> 
> The family of pg_restore_*_stats functions are designed to be forward
> compatible, and to work across versions but for basically the same
> relation or relation of the same shape. Basically, they're for
> pg_restore and pg_upgrade, so no changes in attnums would be expected.
>  
> 

OK

> 
>     For the per-attribute stats it's probably fine, because that's mostly
>     just a collection of regular data types (scalar values or arrays of
>     values, ...) and we're not modifying them except for maybe adding new
>     fields. But extended stats seem more complex, so maybe it's different?
> 
> 
> I had that working by attname matching way back in the early days, but
> it would involve creating an intermediate format for translating the
> attnums to attnames on export, and then re-translating them on the way
> back in.
> 
> I suppose someone could write the following utility functions
> 
>     pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -
>> json
>     pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
> pg_ndistinct
> 
> and that would bridge the gap for the special case where you want to
> adapt pg_ndistinct from one table structure to a slightly different one.
> 
> 

OK
 
> 
> 
>     I remember a long discussion about the format at the very beginning of
>     this patch series, and the conclusion clearly was to have a function
>     that import stats for one attribute at a time. And that seems to be
>     working fine, but extended stats values have more internal structure, so
>     perhaps they need to do something more complicated.
> 
> 
> I believe this *is* doing something more complicated, especially with
> the MCVList, though I was very pleased to see how much of the existing
> infrastructure I was able to reuse.
> 
> 

OK

>  
> 
> 
>     > 0003 - Makes several static functions in attribute_stats.c public
>     for use
>     > by extended stats. One of those is get_stat_attr_type(), which in
>     the last
>     > patchset was modified to take an attribute name rather than
>     attnum, thus
>     > saving a syscache lookup. However, extended stats identifies
>     attributes by
>     > attnum not name, so that optimization had to be set aside, at least
>     > temporarily.
>     >
>     > 0004 - These implement the functions pg_set_extended_stats(),
>     > pg_clear_extended_stats(), and pg_restore_extended_stats() and
>     behave like
>     > their relation/attribute equivalents. If we can get these
>     committed and
>     > used by pg_dump, then we don't have to debate how to handle post-
>     upgrade
>     > steps for users who happen to have extended stats vs the approximately
>     > 99.75% of users who do not have extended stats.
>     >
> 
>     I see there's a couple MCV-specific functions in the extended_stats.c.
>     Shouldn't those go into mvc.c instead?
> 
> 
> I wanted to put it there, but there was a reason I didn't and I've now
> forgotten what it was. I'll make an effort to relocate it to mcv.c.
> 
> For that matter, it might make sense to break out the expressions code
> into its own file, because every other stat attribute has its own.
> Thoughts on that?
>  

+1 to that, if it reduced unnecessary code duplication

> 
> 
>     FWIW there's a bunch of whitespace issues during git apply.
> 
> 
> +1
>  
> 
> 
>     OK. Thanks for the patch!
> 
> 
> Thanks for the feedback, please keep it coming. I think it's really
> important that extended stats, though used rarely, be included in our
> dump/restore/upgrade changes so as to make for a more consistent user
> experience.
>  

I agree, and I appreciate you working on it.



-- 
Tomas Vondra




Re: Extended Statistics set/restore/clear functions.

От
Corey Huinker
Дата:


> * no negative attnums in key list

Disregard this suggestion - negative attnums mean the Nth expression in the extended stats object, though it boggles the mind how we could have 222 expressions...

 
> * no duplicate attnums in key list

This one is still live, am considering.

At this point I was really thinking only about validating the attnums,
i.e. to make sure it's a valid attribute in the table / statistics. That
is something the pg_set_attribute_stats() enforce too, thanks to having
a separate argument for the attribute name.

That's where I'd stop. I don't want to do checks on the statistics
content, like verifying the frequencies in the MCV sum up to 1.0 or
stuff like that. I think we're not doing that for pg_set_attribute_stats

Agreed. 
 
either (and I'd bet one could cause a lot of "fun" this way).

If by "fun" you mean "create a fuzzing tool", then yes.

As an aside, the "big win" in all these functions is the ability to dump a database --no-data, but have all the schema and statistics, thus allowing for checking query plans on existing databases with sensitive data while not actually exposing the data (except mcv, obvs), nor spending the I/O to load that data.
 
Understood. IMHO it's fine to say we're not validating the statistics
are "consistent" but I think we should check it matches the definition.

+1

 
> I suppose someone could write the following utility functions
>
>     pg_xlat_ndistinct_to_attnames(relation reloid, ndist pg_ndistinct) -
>> json
>     pg_xlat_ndistinct_from_attnames(relation reloid, ndist json) ->
> pg_ndistinct
>
> and that would bridge the gap for the special case where you want to
> adapt pg_ndistinct from one table structure to a slightly different one.
>
>

OK

As they'll be pure-SQL functions, I'll likely post the definitions here, but not put them into a patch unless it draws interest.
 > For that matter, it might make sense to break out the expressions code
> into its own file, because every other stat attribute has its own.
> Thoughts on that?
>  

+1 to that, if it reduced unnecessary code duplication

I'm uncertain that it actually would deduplicate any code, but I'll certainly try.
 

Re: Extended Statistics set/restore/clear functions.

От
Corey Huinker
Дата:
I see there's a couple MCV-specific functions in the extended_stats.c.
Shouldn't those go into mvc.c instead?

I wanted to put it there, but there was a reason I didn't and I've now forgotten what it was. I'll make an effort to relocate it to mcv.c.

Looking at it now, I see that check_mcvlist_array() expects access to extarginfo, so I either I add a bunch of Datum-aware and extarginfo-aware code to mcv.c, or I do those checks outside of import_mcvlist() and leave check_mcvlist_array() where it is, which is my current inclination, though obviously it's not a perfectly clean break.