Обсуждение: Statistics Import and Export

Поиск
Список
Период
Сортировка

Statistics Import and Export

От
Corey Huinker
Дата:
pg_stats_export is a view that aggregates pg_statistic data by relation
oid and stores all of the column statistical data in a system-indepdent (i.e.
no oids, collation information removed, all MCV values rendered as text)
jsonb format, along with the relation's relname, reltuples, and relpages
from pg_class, as well as the schemaname from pg_namespace.

pg_import_rel_stats is a function which takes a relation oid,
server_version_num, num_tuples, num_pages, and a column_stats jsonb in
a format matching that of pg_stats_export, and applies that data to
the specified pg_class and pg_statistics rows for the relation
specified.

The most common use-case for such a function is in upgrades and
dump/restore, wherein the upgrade process would capture the output of
pg_stats_export into a regular table, perform the upgrade, and then
join that data to the existing pg_class rows, updating statistics to be
a close approximation of what they were just prior to the upgrade. The
hope is that these statistics are better than the early stages of
--analyze-in-stages and can be applied faster, thus reducing system
downtime.

The values applied to pg_class are done inline, which is to say
non-transactionally. The values applied to pg_statitics are applied
transactionally, as if an ANALYZE operation was reading from a
cheat-sheet.

This function and view will need to be followed up with corresponding
ones for pg_stastitic_ext and pg_stastitic_ext_data, and while we would
likely never backport the import functions, we can have user programs
do the same work as the export views such that statistics can be brought
forward from versions as far back as there is jsonb to store it.

While the primary purpose of the import function(s) are to reduce downtime
during an upgrade, it is not hard to see that they could also be used to
facilitate tuning and development operations, asking questions like "how might
this query plan change if this table has 1000x rows in it?", without actually
putting those rows into the table.
Вложения

Re: Statistics Import and Export

От
Ashutosh Bapat
Дата:
On Thu, Aug 31, 2023 at 12:17 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> While the primary purpose of the import function(s) are to reduce downtime
> during an upgrade, it is not hard to see that they could also be used to
> facilitate tuning and development operations, asking questions like "how might
> this query plan change if this table has 1000x rows in it?", without actually
> putting those rows into the table.

Thanks. I think this may be used with postgres_fdw to import
statistics directly from the foreigns server, whenever possible,
rather than fetching the rows and building it locally. If it's known
that the stats on foreign and local servers match for a foreign table,
we will be one step closer to accurately estimating the cost of a
foreign plan locally rather than through EXPLAIN.

--
Best Wishes,
Ashutosh Bapat



Re: Statistics Import and Export

От
Corey Huinker
Дата:

Thanks. I think this may be used with postgres_fdw to import
statistics directly from the foreigns server, whenever possible,
rather than fetching the rows and building it locally. If it's known
that the stats on foreign and local servers match for a foreign table,
we will be one step closer to accurately estimating the cost of a
foreign plan locally rather than through EXPLAIN.


Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate query for several versions back - they change, not by much, but they do change. So now we'd have each query text in three places: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the best way to share those in the codebase.

 

Re: Statistics Import and Export

От
Corey Huinker
Дата:

Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate query for several versions back - they change, not by much, but they do change. So now we'd have each query text in three places: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the best way to share those in the codebase.


Attached is v2 of this patch.

New features:
* imports index statistics. This is not strictly accurate: it re-computes index statistics the same as ANALYZE does, which is to say it derives those stats entirely from table column stats, which are imported, so in that sense we're getting index stats without touching the heap.
* now support extended statistics except for MCV, which is currently serialized as an difficult-to-decompose bytea field.
* bare-bones CLI script pg_export_stats, which extracts stats on databases back to v12 (tested) and could work back to v10.
* bare-bones CLI script pg_import_stats, which obviously only works on current devel dbs, but can take exports from older versions.
 
Вложения

Re: Statistics Import and Export

От
Tomas Vondra
Дата:
On 10/31/23 08:25, Corey Huinker wrote:
>
> Attached is v2 of this patch.
> 
> New features:
> * imports index statistics. This is not strictly accurate: it 
> re-computes index statistics the same as ANALYZE does, which is to
> say it derives those stats entirely from table column stats, which
> are imported, so in that sense we're getting index stats without
> touching the heap.

Maybe I just don't understand, but I'm pretty sure ANALYZE does not
derive index stats from column stats. It actually builds them from the
row sample.

> * now support extended statistics except for MCV, which is currently 
> serialized as an difficult-to-decompose bytea field.

Doesn't pg_mcv_list_items() already do all the heavy work?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

От
Corey Huinker
Дата:


Maybe I just don't understand, but I'm pretty sure ANALYZE does not
derive index stats from column stats. It actually builds them from the
row sample.

That is correct, my error.
 

> * now support extended statistics except for MCV, which is currently
> serialized as an difficult-to-decompose bytea field.

Doesn't pg_mcv_list_items() already do all the heavy work?

Thanks! I'll look into that.

The comment below in mcv.c made me think there was no easy way to get output.

/*
 * pg_mcv_list_out      - output routine for type pg_mcv_list.
 *
 * MCV lists are serialized into a bytea value, so we simply call byteaout()
 * to serialize the value into text. But it'd be nice to serialize that into
 * a meaningful representation (e.g. for inspection by people).
 *
 * XXX This should probably return something meaningful, similar to what
 * pg_dependencies_out does. Not sure how to deal with the deduplicated
 * values, though - do we want to expand that or not?
 */

Re: Statistics Import and Export

От
Tomas Vondra
Дата:
On 11/2/23 06:01, Corey Huinker wrote:
> 
> 
>     Maybe I just don't understand, but I'm pretty sure ANALYZE does not
>     derive index stats from column stats. It actually builds them from the
>     row sample.
> 
> 
> That is correct, my error.
>  
> 
> 
>     > * now support extended statistics except for MCV, which is currently
>     > serialized as an difficult-to-decompose bytea field.
> 
>     Doesn't pg_mcv_list_items() already do all the heavy work?
> 
> 
> Thanks! I'll look into that.
> 
> The comment below in mcv.c made me think there was no easy way to get
> output.
> 
> /*
>  * pg_mcv_list_out      - output routine for type pg_mcv_list.
>  *
>  * MCV lists are serialized into a bytea value, so we simply call byteaout()
>  * to serialize the value into text. But it'd be nice to serialize that into
>  * a meaningful representation (e.g. for inspection by people).
>  *
>  * XXX This should probably return something meaningful, similar to what
>  * pg_dependencies_out does. Not sure how to deal with the deduplicated
>  * values, though - do we want to expand that or not?
>  */
> 

Yeah, that was the simplest output function possible, it didn't seem
worth it to implement something more advanced. pg_mcv_list_items() is
more convenient for most needs, but it's quite far from the on-disk
representation.

That's actually a good question - how closely should the exported data
be to the on-disk format? I'd say we should keep it abstract, not tied
to the details of the on-disk format (which might easily change between
versions).

I'm a bit confused about the JSON schema used in pg_statistic_export
view, though. It simply serializes stakinds, stavalues, stanumbers into
arrays ... which works, but why not to use the JSON nesting? I mean,
there could be a nested document for histogram, MCV, ... with just the
correct fields.

  {
    ...
    histogram : { stavalues: [...] },
    mcv : { stavalues: [...], stanumbers: [...] },
    ...
  }

and so on. Also, what does TRIVIAL stand for?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

От
Shubham Khanna
Дата:
On Mon, Nov 6, 2023 at 4:16 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>>
>> Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate
queryfor several versions back - they change, not by much, but they do change. So now we'd have each query text in
threeplaces: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the
bestway to share those in the codebase. 
>>
>
> Attached is v2 of this patch.

While applying Patch, I noticed few Indentation issues:
1) D:\Project\Postgres>git am v2-0003-Add-pg_import_rel_stats.patch
.git/rebase-apply/patch:1265: space before tab in indent.
                                        errmsg("invalid statistics
format, stxndeprs must be array or null");
.git/rebase-apply/patch:1424: trailing whitespace.
                                   errmsg("invalid statistics format,
stxndistinct attnums elements must be strings, but one is %s",
.git/rebase-apply/patch:1315: new blank line at EOF.
+
warning: 3 lines add whitespace errors.
Applying: Add pg_import_rel_stats().

2) D:\Project\Postgres>git am v2-0004-Add-pg_export_stats-pg_import_stats.patch
.git/rebase-apply/patch:282: trailing whitespace.
const char *export_query_v14 =
.git/rebase-apply/patch:489: trailing whitespace.
const char *export_query_v12 =
.git/rebase-apply/patch:648: trailing whitespace.
const char *export_query_v10 =
.git/rebase-apply/patch:826: trailing whitespace.

.git/rebase-apply/patch:1142: trailing whitespace.
        result = PQexec(conn,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
Applying: Add pg_export_stats, pg_import_stats.

Thanks and Regards,
Shubham Khanna.



Re: Statistics Import and Export

От
Ashutosh Bapat
Дата:
On Tue, Oct 31, 2023 at 12:55 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>>
>> Yeah, that use makes sense as well, and if so then postgres_fdw would likely need to be aware of the appropriate
queryfor several versions back - they change, not by much, but they do change. So now we'd have each query text in
threeplaces: a system view, postgres_fdw, and the bin/scripts pre-upgrade program. So I probably should consider the
bestway to share those in the codebase. 
>>
>
> Attached is v2 of this patch.
>
> New features:
> * imports index statistics. This is not strictly accurate: it re-computes index statistics the same as ANALYZE does,
whichis to say it derives those stats entirely from table column stats, which are imported, so in that sense we're
gettingindex stats without touching the heap. 
> * now support extended statistics except for MCV, which is currently serialized as an difficult-to-decompose bytea
field.
> * bare-bones CLI script pg_export_stats, which extracts stats on databases back to v12 (tested) and could work back
tov10. 
> * bare-bones CLI script pg_import_stats, which obviously only works on current devel dbs, but can take exports from
olderversions. 
>

I did a small experiment with your patches. In a separate database
"fdw_dst" I created a table t1 and populated it with 100K rows
#create table t1 (a int, b int);
#insert into t1 select i, i + 1 from generate_series(1, 100000) i;
#analyse t1;

In database "postgres" on the same server, I created a foreign table
pointing to t1
#create server fdw_dst_server foreign data wrapper postgres_fdw
OPTIONS ( dbname 'fdw_dst', port '5432');
#create user mapping for public server fdw_dst_server ;
#create foreign table t1 (a int, b int) server fdw_dst_server;

The estimates are off
#explain select * from t1 where a = 100;
                        QUERY PLAN
-----------------------------------------------------------
 Foreign Scan on t1  (cost=100.00..142.26 rows=13 width=8)
(1 row)

Export and import stats for table t1
$ pg_export_stats -d fdw_dst | pg_import_stats -d postgres

gives accurate estimates
#explain select * from t1 where a = 100;
                        QUERY PLAN
-----------------------------------------------------------
 Foreign Scan on t1  (cost=100.00..1793.02 rows=1 width=8)
(1 row)

In this simple case it's working like a charm.

Then I wanted to replace all ANALYZE commands in postgres_fdw.sql with
import and export of statistics. But I can not do that since it
requires table names to match. Foreign table metadata stores the
mapping between local and remote table as well as column names. Import
can use that mapping to install the statistics appropriately. We may
want to support a command or function in postgres_fdw to import
statistics of all the tables that point to a given foreign server.
That may be some future work based on your current patches.

I have not looked at the code though.

--
Best Wishes,
Ashutosh Bapat



Re: Statistics Import and Export

От
Corey Huinker
Дата:
Yeah, that was the simplest output function possible, it didn't seem
worth it to implement something more advanced. pg_mcv_list_items() is
more convenient for most needs, but it's quite far from the on-disk
representation.

I was able to make it work.
 

That's actually a good question - how closely should the exported data
be to the on-disk format? I'd say we should keep it abstract, not tied
to the details of the on-disk format (which might easily change between
versions).

For the most part, I chose the exported data json types and formats in a way that was the most accommodating to cstring input functions. So, while so many of the statistic values are obviously only ever integers/floats, those get stored as a numeric data type which lacks direct numeric->int/float4/float8 functions (though we could certainly create them, and I'm not against that), casting them to text lets us leverage pg_strtoint16, etc.
 

I'm a bit confused about the JSON schema used in pg_statistic_export
view, though. It simply serializes stakinds, stavalues, stanumbers into
arrays ... which works, but why not to use the JSON nesting? I mean,
there could be a nested document for histogram, MCV, ... with just the
correct fields.

  {
    ...
    histogram : { stavalues: [...] },
    mcv : { stavalues: [...], stanumbers: [...] },
    ...
  }

That's a very good question. I went with this format because it was fairly straightforward to code in SQL using existing JSON/JSONB functions, and that's what we will need if we want to export statistics on any server currently in existence. I'm certainly not locked in with the current format, and if it can be shown how to transform the data into a superior format, I'd happily do so.

and so on. Also, what does TRIVIAL stand for?

It's currently serving double-duty for "there are no stats in this slot" and the situations where the stats computation could draw no conclusions about the data.

Attached is v3 of this patch. Key features are:

* Handles regular pg_statistic stats for any relation type.
* Handles extended statistics.
* Export views pg_statistic_export and pg_statistic_ext_export to allow inspection of existing stats and saving those values for later use.
* Import functions pg_import_rel_stats() and pg_import_ext_stats() which take Oids as input. This is intentional to allow stats from one object to be imported into another object.
* User scripts pg_export_stats and pg_import stats, which offer a primitive way to serialize all the statistics of one database and import them into another.
* Has regression test coverage for both with a variety of data types.
* Passes my own manual test of extracting all of the stats from a v15 version of the popular "dvdrental" example database, as well as some additional extended statistics objects, and importing them into a development database.
* Import operations never touch the heap of any relation outside of pg_catalog. As such, this should be significantly faster than even the most cursory analyze operation, and therefore should be useful in upgrade situations, allowing the database to work with "good enough" stats more quickly, while still allowing for regular autovacuum to recalculate the stats "for real" at some later point.

The relation statistics code was adapted from similar features in analyze.c, but is now done in a query context. As before, the rowcount/pagecount values are updated on pg_class in a non-transactional fashion to avoid table bloat, while the updates to pg_statistic are pg_statistic_ext_data are done transactionally.

The existing statistics _store() functions were leveraged wherever practical, so much so that the extended statistics import is mostly just adapting the existing _build() functions into _import() functions which pull their values from JSON rather than computing the statistics.

Current concerns are:

1. I had to code a special-case exception for MCELEM stats on array data types, so that the array_in() call uses the element type rather than the array type. I had assumed that the existing exmaine_attribute() functions would have properly derived the typoid for that column, but it appears to not be the case, and I'm clearly missing how the existing code gets it right.
2. This hasn't been tested with external custom datatypes, but if they have a custom typanalyze function things should be ok.
3. While I think I have cataloged all of the schema-structural changes to pg_statistic[_ext[_data]] since version 10, I may have missed a case where the schema stayed the same, but the values are interpreted differently.
4. I don't yet have a complete vision for how these tools will be used by pg_upgrade and pg_dump/restore, the places where these will provide the biggest win for users.
 
Вложения

Re: Statistics Import and Export

От
Andrei Lepikhov
Дата:
On 13/12/2023 17:26, Corey Huinker wrote:> 4. I don't yet have a 
complete vision for how these tools will be used
> by pg_upgrade and pg_dump/restore, the places where these will provide 
> the biggest win for users.

Some issues here with docs:

func.sgml:28465: parser error : Opening and ending tag mismatch: sect1 
line 26479 and sect2
   </sect2>
           ^

Also, as I remember, we already had some attempts to invent dump/restore 
statistics [1,2]. They were stopped with the problem of type 
verification. What if the definition of the type has changed between the 
dump and restore? As I see in the code, Importing statistics you just 
check the column name and don't see into the type.

[1] Backup and recovery of pg_statistic
https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook
[2] Re: Ideas about a better API for postgres_fdw remote estimates
https://www.postgresql.org/message-id/7a40707d-1758-85a2-7bb1-6e5775518e64%40postgrespro.ru

-- 
regards,
Andrei Lepikhov
Postgres Professional




Re: Statistics Import and Export

От
Corey Huinker
Дата:


On Fri, Dec 15, 2023 at 3:36 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
On 13/12/2023 17:26, Corey Huinker wrote:> 4. I don't yet have a
complete vision for how these tools will be used
> by pg_upgrade and pg_dump/restore, the places where these will provide
> the biggest win for users.

Some issues here with docs:

func.sgml:28465: parser error : Opening and ending tag mismatch: sect1
line 26479 and sect2
   </sect2>
           ^

Apologies, will fix.
 

Also, as I remember, we already had some attempts to invent dump/restore
statistics [1,2]. They were stopped with the problem of type
verification. What if the definition of the type has changed between the
dump and restore? As I see in the code, Importing statistics you just
check the column name and don't see into the type.

We look up the imported statistics via column name, that is correct.

However, the values in stavalues and mcv and such are stored purely as text, so they must be casted using the input functions for that particular datatype. If that column definition changed, or the underlying input function changed, the stats import of that particular table would fail. It should be noted, however, that those same input functions were used to bring the data into the table via restore, so it would have already failed on that step. Either way, the structure of the table has effectively changed, so failure to import those statistics would be a good thing.
 

[1] Backup and recovery of pg_statistic
https://www.postgresql.org/message-id/flat/724322880.K8vzik8zPz%40abook

That proposal sought to serialize enough information on the old server such that rows could be directly inserted into pg_statistic on the new server. As was pointed out at the time, version N of a server cannot know what the format of pg_statistic will be in version N+1. 

This patch avoids that problem by inspecting the structure of the object to be faux-analyzed, and using that to determine what parts of the JSON to fetch, and what datatype to cast values to in cases like mcv and stavaluesN. The exported JSON has no oids in it whatseover, all elements subject to casting on import have already been cast to text, and the record returned has the server version number of the producing system, and the import function can use that to determine how it interprets the data it finds.
 

[2] Re: Ideas about a better API for postgres_fdw remote estimates
https://www.postgresql.org/message-id/7a40707d-1758-85a2-7bb1-6e5775518e64%40postgrespro.ru


This one seems to be pulling oids from the remote server, and we can't guarantee their stability across systems, especially for objects and operators from extensions. I tried to go the route of extracting the full text name of an operator, but discovered that the qualified names, in addition to being unsightly, were irrelevant because we can't insert stats that disagree about type with the attribute/expression. So it didn't matter what type the remote system thought it had, the local system was going to coerce it into the expected data type or ereport() trying.

I think there is hope for having do_analyze() run a remote query fetching the remote table's exported stats and then storing them locally, possibly after some modification, and that would save us from having to sample a remote table.

Re: Statistics Import and Export

От
Tomas Vondra
Дата:
Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?


interfaces
----------

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

I suggest we focus on the pg_upgrade use case for now. In particular, I
think we really need to find a good way to integrate this into
pg_upgrade. I'm not against having custom CLI commands, but it's still a
manual thing - I wonder if we could extend pg_dump to dump stats, or
make it built-in into pg_upgrade in some way (possibly disabled by
default, or something like that).


JSON format
-----------

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I'm still a bit unsure about the "right" JSON schema. I find it a bit
inconvenient that the JSON objects mimic the pg_statistic schema very
closely. In particular, it has one array for stakind values, another
array for stavalues, array for stanumbers etc. I understand generating
this JSON in SQL is fairly straightforward, and for the pg_upgrade use
case it's probably OK. But my concern is it's not very convenient for
the "manual tweaking" use case, because the "related" fields are
scattered in different parts of the JSON.

That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

 {
   "mcv" : {"values" : [...], "frequencies" : [...]}
   "histogram" : {"bounds" : [...]}
 }

But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).


data missing in the export
--------------------------

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.


more strict checks
------------------

I think the code should be a bit more "defensive" when importing stuff,
and do at least some sanity checks. For the pg_upgrade use case this
should be mostly non-issue (except for maybe helping to detect bugs
earlier), but for the "manual tweak" use case it's much more important.

By this I mean checks like:

* making sure the frequencies in MCV lists are not obviously wrong
  (outside [0,1], sum exceeding > 1.0, etc.)

* cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
  both arrays while histogram has only stavalues, that the arrays have
  the same length for MCV, etc.)

* checking there are no duplicate stakind values (e.g. two MCV lists)

This is another reason why I was thinking the current JSON format may be
a bit inconvenient, because it loads the fields separately, making the
checks harder. But I guess it could be done after loading everything, as
a separate phase.

Not sure if all the checks need to be regular elog(ERROR), perhaps some
could/should be just asserts.


minor questions
---------------

1) Should the views be called pg_statistic_export or pg_stats_export?
Perhaps pg_stats_export is better, because the format is meant to be
human-readable (rather than 100% internal).

2) It's not very clear what "non-transactional update" of pg_class
fields actually means. Does that mean we update the fields in-place,
can't be rolled back, is not subject to MVCC or what? I suspect users
won't know unless the docs say that explicitly.

3) The "statistics.c" code should really document the JSON structure. Or
maybe if we plan to use this for other purposes, it should be documented
in the SGML?

Actually, this means that the use supported cases determine if the
expected JSON structure is part of the API. For pg_upgrade we could keep
it as "internal" and maybe change it as needed, but for "manual tweak"
it'd become part of the public API.

4) Why do we need the separate "replaced" flags in import_stakinds? Can
it happen that collreplaces/opreplaces differ from kindreplaces?

5) What happens in we import statistics for a table that already has
some statistics? Will this discard the existing statistics, or will this
merge them somehow? (I think we should always discard the existing
stats, and keep only the new version.)

6) What happens if we import extended stats with mismatching definition?
For example, what if the "new" statistics object does not have "mcv"
enabled, but the imported data do include MCV? What if the statistics do
have the same number of "dimensions" but not the same number of columns
and expressions?

7) The func.sgml additions in 0007 seems a bit strange, particularly the
first sentence of the paragraph.

8) While experimenting with the patch, I noticed this:

  create table t (a int, b int, c text);
  create statistics s on a, b, c, (a+b), (a-b) from t;

  create table t2 (a text, b text, c text);
  create statistics s2 on a, c from t2;

  select pg_import_ext_stats(
    (select oid from pg_statistic_ext where stxname = 's2'),
    (select server_version_num from pg_statistic_ext_export
                              where ext_stats_name = 's'),
    (select stats from pg_statistic_ext_export
                 where ext_stats_name = 's'));

WARNING:  statistics import has 5 mcv dimensions, but the expects 2.
Skipping excess dimensions.
ERROR:  statistics import has 5 mcv dimensions, but the expects 2.
Skipping excess dimensions.

I guess we should not trigger WARNING+ERROR with the same message.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

От
Bruce Momjian
Дата:
On Tue, Dec 26, 2023 at 02:18:56AM +0100, Tomas Vondra wrote:
> interfaces
> ----------
> 
> When I thought about the ability to dump/load statistics in the past, I
> usually envisioned some sort of DDL that would do the export and import.
> So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
> or something like that, and that'd do all the work. This would mean
> stats are "first-class citizens" and it'd be fairly straightforward to
> add this into pg_dump, for example. Or at least I think so ...
> 
> Alternatively we could have the usual "functional" interface, with a
> functions to export/import statistics, replacing the DDL commands.
> 
> Unfortunately, none of this works for the pg_upgrade use case, because
> existing cluster versions would not support this new interface, of
> course. That's a significant flaw, as it'd make this useful only for
> upgrades of future versions.
> 
> So I think for the pg_upgrade use case, we don't have much choice other
> than using "custom" export through a view, which is what the patch does.
> 
> However, for the other use case (tweaking optimizer stats) this is not
> really an issue - that always happens on the same instance, so no issue
> with not having the "export" function and so on. I'd bet there are more
> convenient ways to do this than using the export view. I'm sure it could
> share a lot of the infrastructure, ofc.
> 
> I suggest we focus on the pg_upgrade use case for now. In particular, I
> think we really need to find a good way to integrate this into
> pg_upgrade. I'm not against having custom CLI commands, but it's still a
> manual thing - I wonder if we could extend pg_dump to dump stats, or
> make it built-in into pg_upgrade in some way (possibly disabled by
> default, or something like that).

I have some thoughts on this too.  I understand the desire to add
something that can be used for upgrades _to_ PG 17, but I am concerned
that this will give us a cumbersome API that will hamper future
development.  I think we should develop the API we want, regardless of
how useful it is for upgrades _to_ PG 17, and then figure out what
short-term hacks we can add to get it working for upgrades _to_ PG 17; 
these hacks can eventually be removed.  Even if they can't be removed,
they are export-only and we can continue developing the import SQL
command cleanly, and I think import is going to need the most long-term
maintenance.

I think we need a robust API to handle two cases:

*  changes in how we store statistics
*  changes in how how data type values are represented in the statistics

We have had such changes in the past, and I think these two issues are
what have prevented import/export of statistics up to this point.
Developing an API that doesn't cleanly handle these will cause long-term
pain.

In summary, I think we need an SQL-level command for this.  I think we
need to embed the Postgres export version number into the statistics
export file (maybe in the COPY header), and then load the file via COPY
internally (not JSON) into a temporary table that we know matches the
exported Postgres version.  We then need to use SQL to make any
adjustments to it before loading it into pg_statistic.  Doing that
internally in JSON just isn't efficient.  If people want JSON for such
cases, I suggest we add a JSON format to COPY.

I think we can then look at pg_upgrade to see if we can simulate the
export action which can use the statistics import SQL command.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> I think we need a robust API to handle two cases:

> *  changes in how we store statistics
> *  changes in how how data type values are represented in the statistics

> We have had such changes in the past, and I think these two issues are
> what have prevented import/export of statistics up to this point.
> Developing an API that doesn't cleanly handle these will cause long-term
> pain.

Agreed.

> In summary, I think we need an SQL-level command for this.

I think a SQL command is an actively bad idea.  It'll just add development
and maintenance overhead that we don't need.  When I worked on this topic
years ago at Salesforce, I had things set up with simple functions, which
pg_dump would invoke by writing more or less

    SELECT pg_catalog.load_statistics(....);

This has a number of advantages, not least of which is that an extension
could plausibly add compatible functions to older versions.  The trick,
as you say, is to figure out what the argument lists ought to be.
Unfortunately I recall few details of what I wrote for Salesforce,
but I think I had it broken down in a way where there was a separate
function call occurring for each pg_statistic "slot", thus roughly

load_statistics(table regclass, attname text, stakind int, stavalue ...);

I might have had a separate load_statistics_xxx function for each
stakind, which would ease the issue of deciding what the datatype
of "stavalue" is.  As mentioned already, we'd also need some sort of
version identifier, and we'd expect the load_statistics() functions
to be able to transform the data if the old version used a different
representation.  I agree with the idea that an explicit representation
of the source table attribute's type would be wise, too.

            regards, tom lane



Re: Statistics Import and Export

От
Tomas Vondra
Дата:
On 12/26/23 20:19, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> I think we need a robust API to handle two cases:
> 
>> *  changes in how we store statistics
>> *  changes in how how data type values are represented in the statistics
> 
>> We have had such changes in the past, and I think these two issues are
>> what have prevented import/export of statistics up to this point.
>> Developing an API that doesn't cleanly handle these will cause long-term
>> pain.
> 
> Agreed.
> 

I agree the format is important - we don't want to end up with a format
that's cumbersome or inconvenient to use. But I don't think the proposed
format is somewhat bad in those respects - it mostly reflects how we
store statistics and if I was designing a format for humans, it might
look a bit differently. But that's not the goal here, IMHO.

I don't quite understand the two cases above. Why should this affect how
we store statistics? Surely, making the statistics easy to use for the
optimizer is much more important than occasional export/import.

>> In summary, I think we need an SQL-level command for this.
> 
> I think a SQL command is an actively bad idea.  It'll just add development
> and maintenance overhead that we don't need.  When I worked on this topic
> years ago at Salesforce, I had things set up with simple functions, which
> pg_dump would invoke by writing more or less
> 
>     SELECT pg_catalog.load_statistics(....);
> 
> This has a number of advantages, not least of which is that an extension
> could plausibly add compatible functions to older versions.  The trick,
> as you say, is to figure out what the argument lists ought to be.
> Unfortunately I recall few details of what I wrote for Salesforce,
> but I think I had it broken down in a way where there was a separate
> function call occurring for each pg_statistic "slot", thus roughly
> 
> load_statistics(table regclass, attname text, stakind int, stavalue ...);
> 
> I might have had a separate load_statistics_xxx function for each
> stakind, which would ease the issue of deciding what the datatype
> of "stavalue" is.  As mentioned already, we'd also need some sort of
> version identifier, and we'd expect the load_statistics() functions
> to be able to transform the data if the old version used a different
> representation.  I agree with the idea that an explicit representation
> of the source table attribute's type would be wise, too.
> 

Yeah, this is pretty much what I meant by "functional" interface. But if
I said maybe the format implemented by the patch is maybe too close to
how we store the statistics, then this has exactly the same issue. And
it has other issues too, I think - it breaks down the stats into
multiple function calls, so ensuring the sanity/correctness of whole
sets of statistics gets much harder, I think.

I'm not sure about the extension idea. Yes, we could have an extension
providing such functions, but do we have any precedent of making
pg_upgrade dependent on an external extension? I'd much rather have
something built-in that just works, especially if we intend to make it
the default behavior (which I think should be our aim here).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

От
Bruce Momjian
Дата:
On Wed, Dec 27, 2023 at 01:08:47PM +0100, Tomas Vondra wrote:
> On 12/26/23 20:19, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> >> I think we need a robust API to handle two cases:
> > 
> >> *  changes in how we store statistics
> >> *  changes in how how data type values are represented in the statistics
> > 
> >> We have had such changes in the past, and I think these two issues are
> >> what have prevented import/export of statistics up to this point.
> >> Developing an API that doesn't cleanly handle these will cause long-term
> >> pain.
> > 
> > Agreed.
> > 
> 
> I agree the format is important - we don't want to end up with a format
> that's cumbersome or inconvenient to use. But I don't think the proposed
> format is somewhat bad in those respects - it mostly reflects how we
> store statistics and if I was designing a format for humans, it might
> look a bit differently. But that's not the goal here, IMHO.
> 
> I don't quite understand the two cases above. Why should this affect how
> we store statistics? Surely, making the statistics easy to use for the
> optimizer is much more important than occasional export/import.

The two items above were to focus on getting a solution that can easily
handle future statistics storage changes.  I figured we would want to
manipulate the data as a table internally so I am confused why we would
export JSON instead of a COPY format.  I didn't think we were changing
how we internall store or use the statistics.

> >> In summary, I think we need an SQL-level command for this.
> > 
> > I think a SQL command is an actively bad idea.  It'll just add development
> > and maintenance overhead that we don't need.  When I worked on this topic
> > years ago at Salesforce, I had things set up with simple functions, which
> > pg_dump would invoke by writing more or less
> > 
> >     SELECT pg_catalog.load_statistics(....);
> > 
> > This has a number of advantages, not least of which is that an extension
> > could plausibly add compatible functions to older versions.  The trick,
> > as you say, is to figure out what the argument lists ought to be.
> > Unfortunately I recall few details of what I wrote for Salesforce,
> > but I think I had it broken down in a way where there was a separate
> > function call occurring for each pg_statistic "slot", thus roughly
> > 
> > load_statistics(table regclass, attname text, stakind int, stavalue ...);
> > 
> > I might have had a separate load_statistics_xxx function for each
> > stakind, which would ease the issue of deciding what the datatype
> > of "stavalue" is.  As mentioned already, we'd also need some sort of
> > version identifier, and we'd expect the load_statistics() functions
> > to be able to transform the data if the old version used a different
> > representation.  I agree with the idea that an explicit representation
> > of the source table attribute's type would be wise, too.
> > 
> 
> Yeah, this is pretty much what I meant by "functional" interface. But if
> I said maybe the format implemented by the patch is maybe too close to
> how we store the statistics, then this has exactly the same issue. And
> it has other issues too, I think - it breaks down the stats into
> multiple function calls, so ensuring the sanity/correctness of whole
> sets of statistics gets much harder, I think.

I was suggesting an SQL command because this feature is going to need a
lot of options and do a lot of different things, I am afraid, and a
single function might be too complex to manage.

> I'm not sure about the extension idea. Yes, we could have an extension
> providing such functions, but do we have any precedent of making
> pg_upgrade dependent on an external extension? I'd much rather have
> something built-in that just works, especially if we intend to make it
> the default behavior (which I think should be our aim here).

Uh, an extension seems nice to allow people in back branches to install
it, but not for normal usage.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

От
Corey Huinker
Дата:
On Mon, Dec 25, 2023 at 8:18 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?


interfaces
----------

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

This was the reason I settled on the interface that I did: while we can create whatever interface we want for importing the statistics, we would need to be able to extract stats from databases using only the facilities available in those same databases, and then store that in a medium that could be conveyed across databases, either by text files or by saving them off in a side table prior to upgrade. JSONB met the criteria.
 

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

So, there is a third use case - foreign data wrappers. When analyzing a foreign table, at least one in the postgresql_fdw family of foreign servers, we should be able to send a query specific to the version and dialect of that server, get back the JSONB, and import those results. That use case may be more tangible to you than the tweak/tuning case.
 



JSON format
-----------

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I see a few problems with dumping pg_statistic[_ext_data]. The first is that the importer now has to understand all of the past formats of those two tables. The next is that the tables are chock full of Oids that don't necessarily carry forward. I could see us having a text-ified version of those two tables, but we'd need that for all previous iterations of those table formats. Instead, I put the burden on the stats export to de-oid the data and make it *_in() function friendly.
 
That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

 {
   "mcv" : {"values" : [...], "frequencies" : [...]}
   "histogram" : {"bounds" : [...]}
 }

I agree that would be a lot more readable, and probably a lot more debuggable. But I went into this unsure if there could be more than one stats slot of a given kind per table. Knowing that they must be unique helps.
 
But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).

I think it would be harder, but far from impossible.

 
data missing in the export
--------------------------

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

I"m not against this at all, and I started out doing that, but the qualified names of operators got _ugly_, and I quickly realized that what I was generating wouldn't matter, either the input data would make sense for the attribute's stats or it would fail trying.
 
Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

I can see including the type of the column, that's a lot cleaner than the operator names for sure, and I can see us rejecting stats or sections of stats in certain situations. Like in your example, if the collation changed, then reject all "<" op stats but keep the "=" ones.
 
For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.

Correct me if I'm wrong, but I thought expression parse trees change _a lot_ from version to version?

Attribute reordering is a definite vulnerability of the current implementation, so an attribute name export might be a way to mitigate that.
 

* making sure the frequencies in MCV lists are not obviously wrong
  (outside [0,1], sum exceeding > 1.0, etc.)

+1
 

* cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
  both arrays while histogram has only stavalues, that the arrays have
  the same length for MCV, etc.)

To this end, there's an edge-case hack in the code where I have to derive the array elemtype. I had thought that examine_attribute() or std_typanalyze() was going to do that for me, but it didn't. Very much want your input there.
 

* checking there are no duplicate stakind values (e.g. two MCV lists)

Per previous comment, it's good to learn these restrictions.
 
Not sure if all the checks need to be regular elog(ERROR), perhaps some
could/should be just asserts.

For this first pass, all errors were one-size fits all, safe for the WARNING vs ERROR.
 


minor questions
---------------

1) Should the views be called pg_statistic_export or pg_stats_export?
Perhaps pg_stats_export is better, because the format is meant to be
human-readable (rather than 100% internal).

I have no opinion on what the best name would be, and will go with consensus.
 

2) It's not very clear what "non-transactional update" of pg_class
fields actually means. Does that mean we update the fields in-place,
can't be rolled back, is not subject to MVCC or what? I suspect users
won't know unless the docs say that explicitly.

Correct. Cannot be rolled back, not subject to MVCC.

 
3) The "statistics.c" code should really document the JSON structure. Or
maybe if we plan to use this for other purposes, it should be documented
in the SGML?

I agree, but I also didn't expect the format to survive first contact with reviewers, so I held back.
 

4) Why do we need the separate "replaced" flags in import_stakinds? Can
it happen that collreplaces/opreplaces differ from kindreplaces?

That was initially done to maximize the amount of code that could be copied from do_analyze(). In retrospect, I like how extended statistics just deletes all the pg_statistic_ext_data rows and replaces them and I would like to do the same for pg_statistic before this is all done.
 

5) What happens in we import statistics for a table that already has
some statistics? Will this discard the existing statistics, or will this
merge them somehow? (I think we should always discard the existing
stats, and keep only the new version.)

In the case of pg_statistic_ext_data, the stats are thrown out and replaced by the imported ones.

In the case of pg_statistic, it's basically an upsert, and any values that were missing in the JSON are not updated on the existing row. That's appealing in a tweak situation where you want to only alter one or two bits of a stat, but not really useful in other situations. Per previous comment, I'd prefer a clean slate and forcing tweaking use cases to fill in all the blanks.
 

6) What happens if we import extended stats with mismatching definition?
For example, what if the "new" statistics object does not have "mcv"
enabled, but the imported data do include MCV? What if the statistics do
have the same number of "dimensions" but not the same number of columns
and expressions?

The importer is currently driven by the types of stats to be expected for that pg_attribute/pg_statistic_ext. It only looks for things that are possible for that stat type, and any extra JSON values are ignored.
 

Re: Statistics Import and Export

От
Corey Huinker
Дата:

  As mentioned already, we'd also need some sort of
version identifier, and we'd expect the load_statistics() functions
to be able to transform the data if the old version used a different
representation.  I agree with the idea that an explicit representation
of the source table attribute's type would be wise, too.

There is a version identifier currently (its own column not embedded in the JSON), but I discovered that I was able to put the burden on the export queries to spackle-over the changes in the table structures over time. Still, I knew that we'd need the version number in there eventually. 

Re: Statistics Import and Export

От
Corey Huinker
Дата:
Yeah, this is pretty much what I meant by "functional" interface. But if
I said maybe the format implemented by the patch is maybe too close to
how we store the statistics, then this has exactly the same issue. And
it has other issues too, I think - it breaks down the stats into
multiple function calls, so ensuring the sanity/correctness of whole
sets of statistics gets much harder, I think.

Export functions was my original plan, for simplicity, maintenance, etc, but it seemed like I'd be adding quite a few functions, so the one view made more sense for an initial version. Also, I knew that pg_dump or some other stats exporter would have to inline the guts of those functions into queries for older versions, and adapting a view definition seemed more straightforward for the reader than function definitions.
 

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
> Export functions was my original plan, for simplicity, maintenance, etc,
> but it seemed like I'd be adding quite a few functions, so the one view
> made more sense for an initial version. Also, I knew that pg_dump or some
> other stats exporter would have to inline the guts of those functions into
> queries for older versions, and adapting a view definition seemed more
> straightforward for the reader than function definitions.

Hmm, I'm not sure we are talking about the same thing at all.

What I am proposing is *import* functions.  I didn't say anything about
how pg_dump obtains the data it prints; however, I would advocate that
we keep that part as simple as possible.  You cannot expect export
functionality to know the requirements of future server versions,
so I don't think it's useful to put much intelligence there.

So I think pg_dump should produce a pretty literal representation of
what it finds in the source server's catalog, and then rely on the
import functions in the destination server to make sense of that
and do whatever slicing-n-dicing is required.

That being the case, I don't see a lot of value in a view -- especially
not given the requirement to dump from older server versions.
(Conceivably we could just say that we won't dump stats from server
versions predating the introduction of this feature, but that's hardly
a restriction that supports doing this via a view.)

            regards, tom lane



Re: Statistics Import and Export

От
Bruce Momjian
Дата:
On Wed, Dec 27, 2023 at 09:41:31PM -0500, Corey Huinker wrote:
>     When I thought about the ability to dump/load statistics in the past, I
>     usually envisioned some sort of DDL that would do the export and import.
>     So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
>     or something like that, and that'd do all the work. This would mean
>     stats are "first-class citizens" and it'd be fairly straightforward to
>     add this into pg_dump, for example. Or at least I think so ...
> 
>     Alternatively we could have the usual "functional" interface, with a
>     functions to export/import statistics, replacing the DDL commands.
> 
>     Unfortunately, none of this works for the pg_upgrade use case, because
>     existing cluster versions would not support this new interface, of
>     course. That's a significant flaw, as it'd make this useful only for
>     upgrades of future versions.
> 
> 
> This was the reason I settled on the interface that I did: while we can create
> whatever interface we want for importing the statistics, we would need to be
> able to extract stats from databases using only the facilities available in
> those same databases, and then store that in a medium that could be conveyed
> across databases, either by text files or by saving them off in a side table
> prior to upgrade. JSONB met the criteria.

Uh, it wouldn't be crazy to add this capability to pg_upgrade/pg_dump in
a minor version upgrade if it wasn't enabled by default, and if we were
very careful.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

От
Corey Huinker
Дата:
On Wed, Dec 27, 2023 at 10:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Huinker <corey.huinker@gmail.com> writes:
> Export functions was my original plan, for simplicity, maintenance, etc,
> but it seemed like I'd be adding quite a few functions, so the one view
> made more sense for an initial version. Also, I knew that pg_dump or some
> other stats exporter would have to inline the guts of those functions into
> queries for older versions, and adapting a view definition seemed more
> straightforward for the reader than function definitions.

Hmm, I'm not sure we are talking about the same thing at all.

Right, I was conflating two things.
 

What I am proposing is *import* functions.  I didn't say anything about
how pg_dump obtains the data it prints; however, I would advocate that
we keep that part as simple as possible.  You cannot expect export
functionality to know the requirements of future server versions,
so I don't think it's useful to put much intelligence there.

True, but presumably you'd be using the pg_dump/pg_upgrade of that future version to do the exporting, so the export format would always be tailored to the importer's needs.
 

So I think pg_dump should produce a pretty literal representation of
what it finds in the source server's catalog, and then rely on the
import functions in the destination server to make sense of that
and do whatever slicing-n-dicing is required.

Obviously it can't be purely literal, as we have to replace the oid values with whatever text representation we feel helps us carry forward. In addition, we're setting the number of tuples and number of pages directly in pg_class, and doing so non-transactionally just like ANALYZE does. We could separate that out into its own import function, but then we're locking every relation twice, once for the tuples/pages and once again for the pg_statistic import.

My current line of thinking was that the stats import call, if enabled, would immediately follow the CREATE statement of the object itself, but that requires us to have everything we need to know for the import passed into the import function, so we'd be needing a way to serialize _that_. If you're thinking that we have one big bulk stats import, that might work, but it also means that we're less tolerant of failures in the import step. 

Re: Statistics Import and Export

От
Bruce Momjian
Дата:
On Thu, Dec 28, 2023 at 12:28:06PM -0500, Corey Huinker wrote:
>     What I am proposing is *import* functions.  I didn't say anything about
>     how pg_dump obtains the data it prints; however, I would advocate that
>     we keep that part as simple as possible.  You cannot expect export
>     functionality to know the requirements of future server versions,
>     so I don't think it's useful to put much intelligence there.
> 
> True, but presumably you'd be using the pg_dump/pg_upgrade of that future
> version to do the exporting, so the export format would always be tailored to
> the importer's needs.

I think the question is whether we will have the export functionality in
the old cluster, or if it will be queries run by pg_dump and therefore
also run by pg_upgrade calling pg_dump.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

От
Tomas Vondra
Дата:
On 12/13/23 11:26, Corey Huinker wrote:
>     Yeah, that was the simplest output function possible, it didn't seem
>
>     worth it to implement something more advanced. pg_mcv_list_items() is
>     more convenient for most needs, but it's quite far from the on-disk
>     representation.
>
>
> I was able to make it work.
>
>
>
>     That's actually a good question - how closely should the exported data
>     be to the on-disk format? I'd say we should keep it abstract, not tied
>     to the details of the on-disk format (which might easily change
between
>     versions).
>
>
> For the most part, I chose the exported data json types and formats in a
> way that was the most accommodating to cstring input functions. So,
> while so many of the statistic values are obviously only ever
> integers/floats, those get stored as a numeric data type which lacks
> direct numeric->int/float4/float8 functions (though we could certainly
> create them, and I'm not against that), casting them to text lets us
> leverage pg_strtoint16, etc.
>
>
>
>     I'm a bit confused about the JSON schema used in pg_statistic_export
>     view, though. It simply serializes stakinds, stavalues, stanumbers
into
>     arrays ... which works, but why not to use the JSON nesting? I mean,
>     there could be a nested document for histogram, MCV, ... with just the
>     correct fields.
>
>       {
>         ...
>         histogram : { stavalues: [...] },
>         mcv : { stavalues: [...], stanumbers: [...] },
>         ...
>       }
>
>
> That's a very good question. I went with this format because it was
> fairly straightforward to code in SQL using existing JSON/JSONB
> functions, and that's what we will need if we want to export statistics
> on any server currently in existence. I'm certainly not locked in with
> the current format, and if it can be shown how to transform the data
> into a superior format, I'd happily do so.
>
>     and so on. Also, what does TRIVIAL stand for?
>
>
> It's currently serving double-duty for "there are no stats in this slot"
> and the situations where the stats computation could draw no conclusions
> about the data.
>
> Attached is v3 of this patch. Key features are:
>
> * Handles regular pg_statistic stats for any relation type.
> * Handles extended statistics.
> * Export views pg_statistic_export and pg_statistic_ext_export to allow
> inspection of existing stats and saving those values for later use.
> * Import functions pg_import_rel_stats() and pg_import_ext_stats() which
> take Oids as input. This is intentional to allow stats from one object
> to be imported into another object.
> * User scripts pg_export_stats and pg_import stats, which offer a
> primitive way to serialize all the statistics of one database and import
> them into another.
> * Has regression test coverage for both with a variety of data types.
> * Passes my own manual test of extracting all of the stats from a v15
> version of the popular "dvdrental" example database, as well as some
> additional extended statistics objects, and importing them into a
> development database.
> * Import operations never touch the heap of any relation outside of
> pg_catalog. As such, this should be significantly faster than even the
> most cursory analyze operation, and therefore should be useful in
> upgrade situations, allowing the database to work with "good enough"
> stats more quickly, while still allowing for regular autovacuum to
> recalculate the stats "for real" at some later point.
>
> The relation statistics code was adapted from similar features in
> analyze.c, but is now done in a query context. As before, the
> rowcount/pagecount values are updated on pg_class in a non-transactional
> fashion to avoid table bloat, while the updates to pg_statistic are
> pg_statistic_ext_data are done transactionally.
>
> The existing statistics _store() functions were leveraged wherever
> practical, so much so that the extended statistics import is mostly just
> adapting the existing _build() functions into _import() functions which
> pull their values from JSON rather than computing the statistics.
>
> Current concerns are:
>
> 1. I had to code a special-case exception for MCELEM stats on array data
> types, so that the array_in() call uses the element type rather than the
> array type. I had assumed that the existing exmaine_attribute()
> functions would have properly derived the typoid for that column, but it
> appears to not be the case, and I'm clearly missing how the existing
> code gets it right.
Hmm, after looking at this, I'm not sure it's such an ugly hack ...

The way this works for ANALYZE is that examine_attribute() eventually
calls the typanalyze function:

  if (OidIsValid(stats->attrtype->typanalyze))
    ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
                                       PointerGetDatum(stats)));

which for arrays is array_typanalyze, and this sets stats->extra_data to
ArrayAnalyzeExtraData with all the interesting info about the array
element type, and then also std_extra_data with info about the array
type itself.

  stats -> extra_data -> std_extra_data

compute_array_stats then "restores" std_extra_data to compute standard
stats for the whole array, and then uses the ArrayAnalyzeExtraData to
calculate stats for the elements.

It's not exactly pretty, because there are global variables and so on.

And examine_rel_attribute() does the same thing - calls typanalyze, so
if I break after it returns, I see this for int[] column:

(gdb) p * (ArrayAnalyzeExtraData *) stat->extra_data

$1 = {type_id = 23, eq_opr = 96, coll_id = 0, typbyval = true, typlen =
4, typalign = 105 'i', cmp = 0x2e57920, hash = 0x2e57950,
std_compute_stats = 0x6681b8 <compute_scalar_stats>, std_extra_data =
0x2efe670}

I think the "problem" will be how to use this in import_stavalues(). You
can't just do this for any array type, I think. I could create an array
type (with ELEMENT=X) but with a custom analyze function, in which case
the extra_data may be something entirely different.

I suppose the correct solution would be to add an "import" function into
the pg_type catalog (next to typanalyze). Or maybe it'd be enough to set
it from the typanalyze? After all, that's what sets compute_stats.

But maybe it's enough to just do what you did - if we get an MCELEM
slot, can it ever contain anything else than array of elements of the
attribute array type? I'd bet that'd cause all sorts of issues, no?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

От
Corey Huinker
Дата:
But maybe it's enough to just do what you did - if we get an MCELEM
slot, can it ever contain anything else than array of elements of the
attribute array type? I'd bet that'd cause all sorts of issues, no?


Thanks for the explanation of why it wasn't working for me. Knowing that the case of MCELEM + is-array-type is the only case where we'd need to do that puts me at ease.

Re: Statistics Import and Export

От
Tomas Vondra
Дата:
On 12/29/23 17:27, Corey Huinker wrote:
>     But maybe it's enough to just do what you did - if we get an MCELEM
>     slot, can it ever contain anything else than array of elements of the
>     attribute array type? I'd bet that'd cause all sorts of issues, no?
> 
> 
> Thanks for the explanation of why it wasn't working for me. Knowing that
> the case of MCELEM + is-array-type is the only case where we'd need to
> do that puts me at ease.
> 

But I didn't claim MCELEM is the only slot where this might be an issue.
I merely asked if a MCELEM slot can ever contain an array with element
type different from the "original" attribute.

After thinking about this a bit more, and doing a couple experiments
with a trivial custom data type, I think this is true:

1) MCELEM slots for "real" array types are OK

I don't think we allow "real" arrays created by users directly, all
arrays are created implicitly by the system. Those types always have
array_typanalyze, which guarantees MCELEM has the correct element type.

I haven't found a way to either inject my custom array type or alter the
typanalyze to some custom function. So I think this is OK.


2) I'm not sure we can extend this regular data types / other slots

For example, I think I can implement a data type with custom typanalyze
function (and custom compute_stats function) that fills slots with some
other / strange stuff. For example I might build MCV with hashes of the
original data, a CountMin sketch, or something like that.

Yes, I don't think people do that often, but as long as the type also
implements custom selectivity functions for the operators, I think this
would work.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

От
Peter Smith
Дата:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.



Re: Statistics Import and Export

От
Corey Huinker
Дата:


On Mon, Jan 22, 2024 at 1:09 AM Peter Smith <smithpb2250@gmail.com> wrote:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.

Attached is v4 of the statistics export/import patch.

This version has been refactored to match the design feedback received previously.

The system views are gone. These were mostly there to serve as a baseline for what an export query would look like. That role is temporarily reassigned to pg_export_stats.c, but hopefully they will be integrated into pg_dump in the next version. The regression test also contains the version of each query suitable for the current server version.

The export format is far closer to the raw format of pg_statistic and pg_statistic_ext_data, respectively. This format involves exporting oid values for types, collations, operators, and attributes - values which are specific to the server they were created on. To make sense of those values, a subset of the columns of pg_type, pg_attribute, pg_collation, and pg_operator are exported as well, which allows pg_import_rel_stats() and pg_import_ext_stats() to reconstitute the data structure as it existed on the old server, and adapt it to the modern structure and local schema objects.

pg_import_rel_stats matches up local columns with the exported stats by column name, not attnum. This allows for stats to be imported when columns have been dropped, added, or reordered.

pg_import_ext_stats can also handle column reordering, though it currently would get confused by changes in expressions that maintain the same result data type. I'm not yet brave enough to handle importing nodetrees, nor do I think it's wise to try. I think we'd be better off validating that the destination extended stats object is identical in structure, and to fail the import of that one object if it isn't perfect.

Export formats go back to v10.

Re: Statistics Import and Export

От
Corey Huinker
Дата:
(hit send before attaching patches, reposting message as well)

Attached is v4 of the statistics export/import patch.

This version has been refactored to match the design feedback received previously.

The system views are gone. These were mostly there to serve as a baseline for what an export query would look like. That role is temporarily reassigned to pg_export_stats.c, but hopefully they will be integrated into pg_dump in the next version. The regression test also contains the version of each query suitable for the current server version.

The export format is far closer to the raw format of pg_statistic and pg_statistic_ext_data, respectively. This format involves exporting oid values for types, collations, operators, and attributes - values which are specific to the server they were created on. To make sense of those values, a subset of the columns of pg_type, pg_attribute, pg_collation, and pg_operator are exported as well, which allows pg_import_rel_stats() and pg_import_ext_stats() to reconstitute the data structure as it existed on the old server, and adapt it to the modern structure and local schema objects.

pg_import_rel_stats matches up local columns with the exported stats by column name, not attnum. This allows for stats to be imported when columns have been dropped, added, or reordered.

pg_import_ext_stats can also handle column reordering, though it currently would get confused by changes in expressions that maintain the same result data type. I'm not yet brave enough to handle importing nodetrees, nor do I think it's wise to try. I think we'd be better off validating that the destination extended stats object is identical in structure, and to fail the import of that one object if it isn't perfect.

Export formats go back to v10.


On Mon, Jan 22, 2024 at 1:09 AM Peter Smith <smithpb2250@gmail.com> wrote:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4538/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4538

Kind Regards,
Peter Smith.
Вложения

Re: Statistics Import and Export

От
Tomas Vondra
Дата:
Hi,

I took a quick look at the v4 patches. I haven't done much testing yet,
so only some basic review.

0001

- The SGML docs for pg_import_rel_stats may need some changes. It starts
with description of what gets overwritten (non-)transactionally (which
gets repeated twice), but that seems more like an implementation detail.
But it does not really say which pg_class fields get updated. Then it
speculates about the possible use case (pg_upgrade). I think it'd be
better to focus on the overall goal of updating statistics, explain what
gets updated/how, and only then maybe mention the pg_upgrade use case.

Also, it says "statistics are replaced" but it's quite clear if that
applies only to matching statistics or if all stats are deleted first
and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
deletes all pre-existing stats).


- import_pg_statistics: I somewhat dislike that we're passing arguments
as datum[] array - it's hard to say what the elements are expected to
be, etc. Maybe we should expand this, to make it clear. How do we even
know the array is large enough?

- I don't quite understand why we need examine_rel_attribute. It sets a
lot of fields in the VacAttrStats struct, but then we only use attrtypid
and attrtypmod from it - so why bother and not simply load just these
two fields? Or maybe I miss something.

- examine_rel_attribute can return NULL, but get_attrinfo does not check
for NULL and just dereferences the pointer. Surely that can lead to
segfaults?

- validate_no_duplicates and the other validate functions would deserve
a better docs, explaining what exactly is checked (it took me a while to
realize we check just for duplicates), what the parameters do etc.

- Do we want to make the validate_ functions part of the public API? I
realize we want to use them from multiple places (regular and extended
stats), but maybe it'd be better to have an "internal" header file, just
like we have extended_stats_internal?

- I'm not sure we do "\set debug f" elsewhere. It took me a while to
realize why the query outputs are empty ...


0002

- I'd rename create_stat_ext_entry to statext_create_entry.

- Do we even want to include OIDs from the source server? Why not to
just have object names and resolve those? Seems safer - if the target
server has the OID allocated to a different object, that could lead to
confusing / hard to detect issues.

- What happens if we import statistics which includes data for extended
statistics object which does not exist on the target machine?

- pg_import_ext_stats seems to not use require_match_oids - bug?


0003

- no SGML docs for the new tools?

- The help() seems to be wrong / copied from "clusterdb" or something
like that, right?


On 2/2/24 09:37, Corey Huinker wrote:
> (hit send before attaching patches, reposting message as well)
> 
> Attached is v4 of the statistics export/import patch.
> 
> This version has been refactored to match the design feedback received
> previously.
> 
> The system views are gone. These were mostly there to serve as a baseline
> for what an export query would look like. That role is temporarily
> reassigned to pg_export_stats.c, but hopefully they will be integrated into
> pg_dump in the next version. The regression test also contains the version
> of each query suitable for the current server version.
> 

OK

> The export format is far closer to the raw format of pg_statistic and
> pg_statistic_ext_data, respectively. This format involves exporting oid
> values for types, collations, operators, and attributes - values which are
> specific to the server they were created on. To make sense of those values,
> a subset of the columns of pg_type, pg_attribute, pg_collation, and
> pg_operator are exported as well, which allows pg_import_rel_stats() and
> pg_import_ext_stats() to reconstitute the data structure as it existed on
> the old server, and adapt it to the modern structure and local schema
> objects.

I have no opinion on the proposed format - still JSON, but closer to the
original data. Works for me, but I wonder what Tom thinks about it,
considering he suggested making it closer to the raw data.

> 
> pg_import_rel_stats matches up local columns with the exported stats by
> column name, not attnum. This allows for stats to be imported when columns
> have been dropped, added, or reordered.
> 

Makes sense. What will happen if we try to import data for extended
statistics (or index) that does not exist on the target server?

> pg_import_ext_stats can also handle column reordering, though it currently
> would get confused by changes in expressions that maintain the same result
> data type. I'm not yet brave enough to handle importing nodetrees, nor do I
> think it's wise to try. I think we'd be better off validating that the
> destination extended stats object is identical in structure, and to fail
> the import of that one object if it isn't perfect.
> 

Yeah, column reordering is something we probably need to handle. The
stats order them by attnum, so if we want to allow import on a system
where the attributes were dropped/created in a different way, this is
necessary. I haven't tested this - is there a regression test for this?

I agree expressions are hard. I don't think it's feasible to import
nodetree from other server versions, but why don't we simply deparse the
expression on the source, and either parse it on the target (and then
compare the two nodetrees), or deparse the target too and compare the
two deparsed expressions? I suspect the deparsing may produce slightly
different results on the two versions (causing false mismatches), but
perhaps the deparse on source + parse on target + compare nodetrees
would work? Haven't tried, though.

> Export formats go back to v10.
> 

Do we even want/need to go beyond 12? All earlier versions are EOL.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Statistics Import and Export

От
Corey Huinker
Дата:
Also, it says "statistics are replaced" but it's quite clear if that
applies only to matching statistics or if all stats are deleted first
and then the new stuff is inserted. (FWIW remove_pg_statistics clearly
deletes all pre-existing stats).

All are now deleted first, both in the pg_statistic and pg_statistic_ext_data tables. The previous version was taking a more "replace it if we find a new value" approach, but that's overly complicated, so following the example set by extended statistics seemed best.

 
- import_pg_statistics: I somewhat dislike that we're passing arguments
as datum[] array - it's hard to say what the elements are expected to
be, etc. Maybe we should expand this, to make it clear. How do we even
know the array is large enough?

Completely fair. Initially that was done with the expectation that the array would be the same for both regular stats and extended stats, but that was no longer the case.
 
- I don't quite understand why we need examine_rel_attribute. It sets a
lot of fields in the VacAttrStats struct, but then we only use attrtypid
and attrtypmod from it - so why bother and not simply load just these
two fields? Or maybe I miss something.
 
I think you're right, we don't need it anymore for regular statistics. We still need it in extended stats because statext_store() takes a subset of the vacattrstats rows as an input.

Which leads to a side issue. We currently have 3 functions: examine_rel_attribute and the two varieties of examine_attribute (one in analyze.c and the other in extended stats). These are highly similar but just different enough that I didn't feel comfortable refactoring them into a one-size-fits-all function, and I was particularly reluctant to modify existing code for the ANALYZE path.
 

- examine_rel_attribute can return NULL, but get_attrinfo does not check
for NULL and just dereferences the pointer. Surely that can lead to
segfaults?

Good catch, and it highlights how little we need VacAttrStats for regular statistics.
 

- validate_no_duplicates and the other validate functions would deserve
a better docs, explaining what exactly is checked (it took me a while to
realize we check just for duplicates), what the parameters do etc.

Those functions are in a fairly formative phase - I expect a conversation about what sort of validations we want to do to ensure that the statistics being imported make sense, and under what circumstances we would forego some of those checks.
 

- Do we want to make the validate_ functions part of the public API? I
realize we want to use them from multiple places (regular and extended
stats), but maybe it'd be better to have an "internal" header file, just
like we have extended_stats_internal?

I see no need to have them be a part of the public API. Will move.
 

- I'm not sure we do "\set debug f" elsewhere. It took me a while to
realize why the query outputs are empty ...

That was an experiment that rose out of the difficulty in determining _where_ a difference was when the set-difference checks failed. So far I like it, and I'm hoping it catches on.

 


0002

- I'd rename create_stat_ext_entry to statext_create_entry.

- Do we even want to include OIDs from the source server? Why not to
just have object names and resolve those? Seems safer - if the target
server has the OID allocated to a different object, that could lead to
confusing / hard to detect issues.

The import functions would obviously never use the imported oids to look up objects on the destination system. Rather, they're there to verify that the local object oid matches the exported object oid, which is true in the case of a binary upgrade.

The export format is an attempt to export the pg_statistic[_ext_data] for that object as-is, and, as Tom suggested, let the import function do the transformations. We can of course remove them if they truly have no purpose for validation.
 

- What happens if we import statistics which includes data for extended
statistics object which does not exist on the target machine?

The import function takes an oid of the object (relation or extstat object), and the json payload is supposed to be the stats for ONE corresponding object. Multiple objects of data really don't fit into the json format, and statistics exported for an object that does not exist on the destination system would have no meaningful invocation. I envision the dump file looking like this

    CREATE TABLE public.foo (....);

    SELECT pg_import_rel_stats('public.foo'::regclass, <json blob>, option flag, option flag);

So a call against a nonexistent object would fail on the regclass cast.
 

- pg_import_ext_stats seems to not use require_match_oids - bug?

I haven't yet seen a good way to make use of matching oids in extended stats. Checking matching operator/collation oids would make sense, but little else.
 


0003

- no SGML docs for the new tools?

Correct. I foresee the export tool being folded into pg_dump(), and the import tool going away entirely as psql could handle it.
 

- The help() seems to be wrong / copied from "clusterdb" or something
like that, right?

Correct, for the reason above.

 
>
> pg_import_rel_stats matches up local columns with the exported stats by
> column name, not attnum. This allows for stats to be imported when columns
> have been dropped, added, or reordered.
>

Makes sense. What will happen if we try to import data for extended
statistics (or index) that does not exist on the target server?

One of the parameters to the function is the oid of the object that is the target of the stats. The importer will not seek out objects with matching names and each JSON payload is limited to holding one object, though clearly someone could encapsulate the existing format in a format that has a manifest of objects to import.
 

> pg_import_ext_stats can also handle column reordering, though it currently
> would get confused by changes in expressions that maintain the same result
> data type. I'm not yet brave enough to handle importing nodetrees, nor do I
> think it's wise to try. I think we'd be better off validating that the
> destination extended stats object is identical in structure, and to fail
> the import of that one object if it isn't perfect.
>

Yeah, column reordering is something we probably need to handle. The
stats order them by attnum, so if we want to allow import on a system
where the attributes were dropped/created in a different way, this is
necessary. I haven't tested this - is there a regression test for this?

The overlong transformation SQL starts with the object to be imported (the local oid was specified) and it

1. grabs all the attributes (or exprs, for extended stats) of that object.
2. looks for columns/exprs in the exported json for an attribute with a matching name
3. takes the exported attnum of that exported attribute for use in things like stdexprs
4. looks up the type, collation, and operators for the exported attribute.

So we get a situation where there might not be importable stats for an attribute of the destination table, and we'd import nothing for that column. Stats for exported columns with no matching local column would never be referenced.

Yes, there should be a test of this.


I agree expressions are hard. I don't think it's feasible to import
nodetree from other server versions, but why don't we simply deparse the
expression on the source, and either parse it on the target (and then
compare the two nodetrees), or deparse the target too and compare the
two deparsed expressions? I suspect the deparsing may produce slightly
different results on the two versions (causing false mismatches), but
perhaps the deparse on source + parse on target + compare nodetrees
would work? Haven't tried, though.

> Export formats go back to v10.
>

Do we even want/need to go beyond 12? All earlier versions are EOL.

True, but we had pg_dump and pg_restore stuff back to 7.x until fairly recently, and a major friction point in getting customers to upgrade their instances off of unsupported versions is the downtime caused by an upgrade, why wouldn't we make it easier for them?

Re: Statistics Import and Export

От
Corey Huinker
Дата:
Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(), which address many of the concerns listed earlier.

Leaving the export/import scripts off for the time being, as they haven't  changed and the next likely change is to fold them into pg_dump.


Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:
On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com> wrote:
Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(), which address many of the concerns listed earlier.

Leaving the export/import scripts off for the time being, as they haven't  changed and the next likely change is to fold them into pg_dump.



v6 posted below.

Changes:

- Additional documentation about the overall process.
- Rewording of SGML docs.
- removed a fair number of columns from the transformation queries.
- enabled require_match_oids in extended statistics, but I'm having my doubts about the value of that.
- moved stats extraction functions to an fe_utils file stats_export.c that will be used by both pg_export_stats and pg_dump.
- pg_export_stats now generates SQL statements rather than a tsv, and has boolean flags to set the validate and require_match_oids parameters in the calls to pg_import_(rel|ext)_stats.
- pg_import_stats is gone, as importing can now be done with psql.

I'm hoping to get feedback on a few areas.

1. The checks for matching oids. On the one hand, in a binary upgrade situation, we would of course want the oid of the relation to match what was exported, as well as all of the atttypids of the attributes to match the type ids exported, same for collations, etc. However, the binary upgrade is the one place where there are absolutely no middle steps that could have altered either the stats jsons or the source tables. Given that and that oid simply will never match in any situation other than a binary upgrade, it may be best to discard those checks.

2. The checks for relnames matching, and typenames of attributes matching (they are already matched by name, so the column order can change without the import missing a beat) seem so necessary that there shouldn't be an option to enable/disable them. But if that's true, then the initial relation parameter becomes somewhat unnecessary, and anyone using these functions for tuning or FDW purposes could easily transform the JSON using SQL to put in the proper relname.

3. The data integrity validation functions may belong in a separate function rather than being a parameter on the existing import functions.

4. Lastly, pg_dump. Each relation object and extended statistics object will have a statistics import statement. From my limited experience with pg_dump, it seems like we would add an additional Stmt variable (statsStmt) to the TOC entry for each object created, and the restore process would check the value of --with-statistics and in cases where the statistics flag was set AND a stats import statement exists, then execute that stats statement immediately after the creation of the object. This assumes that there is no case where additional attributes are added to a relation after it's initial CREATE statement. Indexes are independent relations in this regard.
 
Вложения

Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.

> v6 posted below.
>
> Changes:
>
> - Additional documentation about the overall process.
> - Rewording of SGML docs.
> - removed a fair number of columns from the transformation queries.
> - enabled require_match_oids in extended statistics, but I'm having my
> doubts about the value of that.
> - moved stats extraction functions to an fe_utils file stats_export.c that
> will be used by both pg_export_stats and pg_dump.
> - pg_export_stats now generates SQL statements rather than a tsv, and has
> boolean flags to set the validate and require_match_oids parameters in the
> calls to pg_import_(rel|ext)_stats.
> - pg_import_stats is gone, as importing can now be done with psql.

Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Part of this process would include mapping from OIDs/attrnum's to names
on the source side and then from those names to the appropriate
OIDs/attrnum's on the destination side.

As this code would be used by both pg_dump and the postgres_fdw, it
seems logical that it would go into the common library.  Further, it
would make sense to have this code be able to handle multiple major
versions for the foreign side, such as how postgres_fdw and pg_dump
already do.

In terms of working to ensure that newer versions support loading from
older dumps (that is, that v18 would be able to load a dump file created
by a v17 pg_dump against a v17 server in the face of changes having been
made to the statistics system in v18), we could have the functions take
a version parameter (to handle cases where the data structure is the
same but the contents have to be handled differently), use overloaded
functions, or have version-specific names for the functions.  I'm also
generally supportive of the idea that we, perhaps initially, only
support dumping/loading stats with pg_dump when in binary-upgrade mode,
which removes our need to be concerned with this (perhaps that would be
a good v1 of this feature?) as the version of pg_dump needs to match
that of pg_upgrade and the destination server for various other reasons.
Including a switch to exclude stats on restore might also be an
acceptable answer, or even simply excluding them by default when going
between major versions except in binary-upgrade mode.

Along those same lines when it comes to a 'v1', I'd say that we may wish
to consider excluding extended statistics, which I am fairly confident
Corey's heard a number of times previously already but thought I would
add my own support for that.  To the extent that we do want to make
extended stats work down the road, we should probably have some
pre-patches to flush out the missing _in/_recv functions for those types
which don't have them today- and that would include modifying the _out
of those types to use names instead of OIDs/attrnums.  In thinking about
this, I was reviewing specifically pg_dependencies.  To the extent that
there are people who depend on the current output, I would think that
they'd actually appreciate this change.

I don't generally feel like we need to be checking that the OIDs between
the old server and the new server match- I appreciate that that should
be the case in a binary-upgrade situation but it still feels unnecessary
and complicated and clutters up the output and the function calls.

Overall, I definitely think this is a good project to work on as it's an
often, rightfully, complained about issue when it comes to pg_upgrade
and the amount of downtime required for it before the upgraded system
can be reasonably used again.

Thanks,

Stephen

Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:


Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Thanks for the excellent summary of our conversation, though I do add that we discussed a problem with per-attribute functions: each function would be acquiring locks on both the relation (so it doesn't go away) and pg_statistic, and that lock thrashing would add up. Whether that overhead is judged significant or not is up for discussion. If it is significant, it makes sense to package up all the attributes into one call, passing in an array of some new pg_statistic-esque special type....the very issue that sent me down the JSON path.

I certainly see the flexibility in having a per-attribute functions, but am concerned about non-binary-upgrade situations where the attnums won't line up, and if we're passing them by name then the function has dig around looking for the right matching attnum, and that's overhead too. In the whole-table approach, we just iterate over the attributes that exist, and find the matching parameter row.







 

Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

On Thu, Feb 29, 2024 at 17:48 Corey Huinker <corey.huinker@gmail.com> wrote:
Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Thanks for the excellent summary of our conversation, though I do add that we discussed a problem with per-attribute functions: each function would be acquiring locks on both the relation (so it doesn't go away) and pg_statistic, and that lock thrashing would add up. Whether that overhead is judged significant or not is up for discussion. If it is significant, it makes sense to package up all the attributes into one call, passing in an array of some new pg_statistic-esque special type....the very issue that sent me down the JSON path.

I certainly see the flexibility in having a per-attribute functions, but am concerned about non-binary-upgrade situations where the attnums won't line up, and if we're passing them by name then the function has dig around looking for the right matching attnum, and that's overhead too. In the whole-table approach, we just iterate over the attributes that exist, and find the matching parameter row.

That’s certainly a fair point and my initial reaction (which could certainly be wrong) is that it’s unlikely to be an issue- but also, if you feel you could make it work with an array and passing all the attribute info in with one call, which I suspect would be possible but just a bit more complex to build, then sure, go for it. If it ends up being overly unwieldy then perhaps the  per-attribute call would be better and we could perhaps acquire the lock before the function calls..?  Doing a check to see if we have already locked it would be cheaper than trying to acquire a new lock, I’m fairly sure.

Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing. 

Thanks!

Stephen

Re: Statistics Import and Export

От
Corey Huinker
Дата:

That’s certainly a fair point and my initial reaction (which could certainly be wrong) is that it’s unlikely to be an issue- but also, if you feel you could make it work with an array and passing all the attribute info in with one call, which I suspect would be possible but just a bit more complex to build, then sure, go for it. If it ends up being overly unwieldy then perhaps the  per-attribute call would be better and we could perhaps acquire the lock before the function calls..?  Doing a check to see if we have already locked it would be cheaper than trying to acquire a new lock, I’m fairly sure.

Well the do_analyze() code was already ok with acquiring the lock once for non-inherited stats and again for inherited stats, so the locks were already not the end of the world. However, that's at most a 2x of the locking required, and this would natts * x, quite a bit more. Having the procedures check for a pre-existing lock seems like a good compromise.
 
Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing. 

The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction blocks? i.e. if we ended up importing stats like this:

BEGIN;
LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
SELECT pg_import_pg_statistic('schema.relation', 'name', ...);
SELECT pg_import_pg_statistic('schema.relation', 'description', ...);
...
COMMIT;

Re: Statistics Import and Export

От
Nathan Bossart
Дата:
On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
>> That’s certainly a fair point and my initial reaction (which could
>> certainly be wrong) is that it’s unlikely to be an issue- but also, if you
>> feel you could make it work with an array and passing all the attribute
>> info in with one call, which I suspect would be possible but just a bit
>> more complex to build, then sure, go for it. If it ends up being overly
>> unwieldy then perhaps the  per-attribute call would be better and we could
>> perhaps acquire the lock before the function calls..?  Doing a check to see
>> if we have already locked it would be cheaper than trying to acquire a new
>> lock, I’m fairly sure.
> 
> Well the do_analyze() code was already ok with acquiring the lock once for
> non-inherited stats and again for inherited stats, so the locks were
> already not the end of the world. However, that's at most a 2x of the
> locking required, and this would natts * x, quite a bit more. Having the
> procedures check for a pre-existing lock seems like a good compromise.

I think this is a reasonable starting point.  If the benchmarks show that
the locking is a problem, we can reevaluate, but otherwise IMHO we should
try to keep it as simple/flexible as possible.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

On Fri, Mar 1, 2024 at 12:14 Nathan Bossart <nathandbossart@gmail.com> wrote:
On Thu, Feb 29, 2024 at 10:55:20PM -0500, Corey Huinker wrote:
>> That’s certainly a fair point and my initial reaction (which could
>> certainly be wrong) is that it’s unlikely to be an issue- but also, if you
>> feel you could make it work with an array and passing all the attribute
>> info in with one call, which I suspect would be possible but just a bit
>> more complex to build, then sure, go for it. If it ends up being overly
>> unwieldy then perhaps the  per-attribute call would be better and we could
>> perhaps acquire the lock before the function calls..?  Doing a check to see
>> if we have already locked it would be cheaper than trying to acquire a new
>> lock, I’m fairly sure.
>
> Well the do_analyze() code was already ok with acquiring the lock once for
> non-inherited stats and again for inherited stats, so the locks were
> already not the end of the world. However, that's at most a 2x of the
> locking required, and this would natts * x, quite a bit more. Having the
> procedures check for a pre-existing lock seems like a good compromise.

I think this is a reasonable starting point.  If the benchmarks show that
the locking is a problem, we can reevaluate, but otherwise IMHO we should
try to keep it as simple/flexible as possible.

Yeah, this was my general feeling as well.  If it does become an issue, it certainly seems like we would have ways to improve it in the future. Even with this locking it is surely going to be better than having to re-analyze the entire database which is where we are at now.

Thanks,

Stephen

Re: Statistics Import and Export

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Feb 20, 2024 at 02:24:52AM -0500, Corey Huinker wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> 
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.
> >
> >
> >
> v6 posted below.

Thanks!

I had in mind to look at it but it looks like a rebase is needed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

От
Matthias van de Meent
Дата:
On Fri, 1 Mar 2024, 04:55 Corey Huinker, <corey.huinker@gmail.com> wrote:
>> Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have
theindexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the
process,making me a bit less concerned overall about the individual timing. 
>
>
> The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction
blocks?i.e. if we ended up importing stats like this: 
>
> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Or even just one VALUES for the whole statistics loading?

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Fri, 1 Mar 2024, 04:55 Corey Huinker, <corey.huinker@gmail.com> wrote:
>> Also per our prior discussion- this makes sense to include in post-data section, imv, and also because then we have the indexes we may wish to load stats for, but further that also means it’ll be in the paralleliziable part of the process, making me a bit less concerned overall about the individual timing.
>
>
> The ability to parallelize is pretty persuasive. But is that per-statement parallelization or do we get transaction blocks? i.e. if we ended up importing stats like this:
>
> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Using a VALUES construct for this does seem like it might make it cleaner, so +1 for investigating that idea.

Or even just one VALUES for the whole statistics loading?

I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to lock a whole bunch of objects at once, and any failures would only impact that one relation’s stats load.

I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have to spend a lot of effort on.  I’m pretty bullish on the idea that this simply won’t happen except in very exceptional cases under a pg_upgrade (where the pg_dump that’s used must match the target server version) and where it happens under a pg_dump it’ll be because it’s an older pg_dump’s output and the answer will likely need to be “you’re using a pg_dump file generated using an older version of pg_dump and need to exclude stats entirely from the load and instead run analyze on the data after loading it.”

What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset of columns, or just generally, if we know that the pg_dump version matches the target server version?

Thanks!

Stephen

Re: Statistics Import and Export

От
Matthias van de Meent
Дата:
On Wed, 6 Mar 2024 at 11:33, Stephen Frost <sfrost@snowman.net> wrote:
> On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>> Or even just one VALUES for the whole statistics loading?
>
>
> I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to
locka whole bunch of objects at once, and any failures would only impact that one relation’s stats load. 

That also makes sense.

>> I suspect the main issue with combining this into one statement
>> (transaction) is that failure to load one column's statistics implies
>> you'll have to redo all the other statistics (or fail to load the
>> statistics at all), which may be problematic at the scale of thousands
>> of relations with tens of columns each.
>
>
> I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have to
spenda lot of effort on. 

Agreed on the "don't have to spend a lot of time on it", but I'm not
so sure on the "unlikely" part while the autovacuum deamon is
involved, specifically for non-upgrade pg_restore. I imagine (haven't
checked) that autoanalyze is disabled during pg_upgrade, but
pg_restore doesn't do that, while it would have to be able to restore
statistics of a table if it is included in the dump (and the version
matches).

> What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset of
columns,or just generally, if we know that the pg_dump version matches the target server version? 

Last time I checked, pg_restore's default is to load data on a
row-by-row basis without --single-transaction or --exit-on-error. Of
course, pg_upgrade uses it's own set of flags, but if a user is
restoring stats with  pg_restore, I suspect they'd rather have some
column's stats loaded than no stats at all; so I would assume this
requires one separate pg_import_pg_statistic()-transaction for every
column.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

* Matthias van de Meent (boekewurm+postgres@gmail.com) wrote:
> On Wed, 6 Mar 2024 at 11:33, Stephen Frost <sfrost@snowman.net> wrote:
> > On Wed, Mar 6, 2024 at 11:07 Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
> >> Or even just one VALUES for the whole statistics loading?
> > I don’t think we’d want to go beyond one relation at a time as then it can be parallelized, we won’t be trying to
locka whole bunch of objects at once, and any failures would only impact that one relation’s stats load. 
>
> That also makes sense.

Great, thanks.

> >> I suspect the main issue with combining this into one statement
> >> (transaction) is that failure to load one column's statistics implies
> >> you'll have to redo all the other statistics (or fail to load the
> >> statistics at all), which may be problematic at the scale of thousands
> >> of relations with tens of columns each.
> >
> >
> > I’m pretty skeptical that “stats fail to load and lead to a failed transaction” is a likely scenario that we have
tospend a lot of effort on. 
>
> Agreed on the "don't have to spend a lot of time on it", but I'm not
> so sure on the "unlikely" part while the autovacuum deamon is
> involved, specifically for non-upgrade pg_restore. I imagine (haven't
> checked) that autoanalyze is disabled during pg_upgrade, but
> pg_restore doesn't do that, while it would have to be able to restore
> statistics of a table if it is included in the dump (and the version
> matches).

Even if autovacuum was running and it kicked off an auto-analyze of the
relation at just the time that we were trying to load the stats, there
would be appropriate locking happening to keep them from causing an
outright ERROR and transaction failure, or if not, that's a lack of
locking and should be fixed.  With the per-attribute-function-call
approach, that could lead to a situation where some stats are from the
auto-analyze and some are from the stats being loaded but I'm not sure
if that's a big concern or not.

For users of this, I would think we'd generally encourage them to
disable autovacuum on the tables they're loading as otherwise they'll
end up with the stats going back to whatever an auto-analyze ends up
finding.  That may be fine in some cases, but not in others.

A couple questions to think about though: Should pg_dump explicitly ask
autovacuum to ignore these tables while we're loading them?
Should these functions only perform a load when there aren't any
existing stats?  Should the latter be an argument to the functions to
allow the caller to decide?

> > What are the cases where we would be seeing stats reloads failing where it would make sense to re-try on a subset
ofcolumns, or just generally, if we know that the pg_dump version matches the target server version? 
>
> Last time I checked, pg_restore's default is to load data on a
> row-by-row basis without --single-transaction or --exit-on-error. Of
> course, pg_upgrade uses it's own set of flags, but if a user is
> restoring stats with  pg_restore, I suspect they'd rather have some
> column's stats loaded than no stats at all; so I would assume this
> requires one separate pg_import_pg_statistic()-transaction for every
> column.

Having some discussion around that would be useful.  Is it better to
have a situation where there are stats for some columns but no stats for
other columns?  There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls.  If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

Thanks!

Stephen

Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:

> BEGIN;
> LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
> LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
> SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
> SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
> SELECT pg_import_pg_statistic('schema.relation', 'name', ...);

How well would this simplify to the following:

SELECT pg_import_statistic('schema.relation', attname, ...)
FROM (VALUES ('id', ...), ...) AS relation_stats (attname, ...);

Or even just one VALUES for the whole statistics loading?

I'm sorry, I don't quite understand what you're suggesting here. I'm about to post the new functions, so perhaps you can rephrase this in the context of those functions.
 
I suspect the main issue with combining this into one statement
(transaction) is that failure to load one column's statistics implies
you'll have to redo all the other statistics (or fail to load the
statistics at all), which may be problematic at the scale of thousands
of relations with tens of columns each.

Yes, that is is a concern, and I can see value to having it both ways (one failure fails the whole table's worth of set_something() functions, but I can also see emitting a warning instead of error and returning false. I'm eager to get feedback on which the community would prefer, or perhaps even make it a parameter.
 

Re: Statistics Import and Export

От
Corey Huinker
Дата:


Having some discussion around that would be useful.  Is it better to
have a situation where there are stats for some columns but no stats for
other columns?  There would be a good chance that this would lead to a
set of queries that were properly planned out and a set which end up
with unexpected and likely poor query plans due to lack of stats.
Arguably that's better overall, but either way an ANALYZE needs to be
done to address the lack of stats for those columns and then that
ANALYZE is going to blow away whatever stats got loaded previously
anyway and all we did with a partial stats load was maybe have a subset
of queries have better plans in the interim, after having expended the
cost to try and individually load the stats and dealing with the case of
some of them succeeding and some failing.

It is my (incomplete and entirely second-hand) understanding is that pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value and then resets it on completion, presumably because analyzing a table before its data is loaded and indexes are created would just be a waste of time.

 

Overall, I'd suggest we wait to see what Corey comes up with in terms of
doing the stats load for all attributes in a single function call,
perhaps using the VALUES construct as you suggested up-thread, and then
we can contemplate if that's clean enough to work or if it's so grotty
that the better plan would be to do per-attribute function calls.  If it
ends up being the latter, then we can revisit this discussion and try to
answer some of the questions raised above.

In the patch below, I ended up doing per-attribute function calls, mostly because it allowed me to avoid creating a custom data type for the portable version of pg_statistic. This comes at the cost of a very high number of parameters, but that's the breaks.

I am a bit concerned about the number of locks on pg_statistic and the relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per attribute rather than once per relation. But I also see that this will mostly get used at a time when no other traffic is on the machine, and whatever it costs, it's still faster than the smallest table sample (insert joke about "don't have to be faster than the bear" here).

This raises questions about whether a failure in one attribute update statement should cause the others in that relation to roll back or not, and I can see situations where both would be desirable.

I'm putting this out there ahead of the pg_dump / fe_utils work, mostly because what I do there heavily depends on how this is received.

Also, I'm still seeking confirmation that I can create a pg_dump TOC entry with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan them out into multiple entries.

Anyway, here's v7. Eagerly awaiting feedback.
Вложения

Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > Having some discussion around that would be useful.  Is it better to
> > have a situation where there are stats for some columns but no stats for
> > other columns?  There would be a good chance that this would lead to a
> > set of queries that were properly planned out and a set which end up
> > with unexpected and likely poor query plans due to lack of stats.
> > Arguably that's better overall, but either way an ANALYZE needs to be
> > done to address the lack of stats for those columns and then that
> > ANALYZE is going to blow away whatever stats got loaded previously
> > anyway and all we did with a partial stats load was maybe have a subset
> > of queries have better plans in the interim, after having expended the
> > cost to try and individually load the stats and dealing with the case of
> > some of them succeeding and some failing.
>
> It is my (incomplete and entirely second-hand) understanding is that
> pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value
> and then resets it on completion, presumably because analyzing a table
> before its data is loaded and indexes are created would just be a waste of
> time.

No, pg_upgrade starts the postmaster with -b (undocumented
binary-upgrade mode) which prevents autovacuum (and logical replication
workers) from starting, so we don't need to worry about autovacuum
coming in and causing issues during binary upgrade.  That doesn't
entirely eliminate the concerns around pg_dump vs. autovacuum because we
may restore a dump into a non-binary-upgrade-in-progress system though,
of course.

> > Overall, I'd suggest we wait to see what Corey comes up with in terms of
> > doing the stats load for all attributes in a single function call,
> > perhaps using the VALUES construct as you suggested up-thread, and then
> > we can contemplate if that's clean enough to work or if it's so grotty
> > that the better plan would be to do per-attribute function calls.  If it
> > ends up being the latter, then we can revisit this discussion and try to
> > answer some of the questions raised above.
>
> In the patch below, I ended up doing per-attribute function calls, mostly
> because it allowed me to avoid creating a custom data type for the portable
> version of pg_statistic. This comes at the cost of a very high number of
> parameters, but that's the breaks.

Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

> This raises questions about whether a failure in one attribute update
> statement should cause the others in that relation to roll back or not, and
> I can see situations where both would be desirable.
>
> I'm putting this out there ahead of the pg_dump / fe_utils work, mostly
> because what I do there heavily depends on how this is received.
>
> Also, I'm still seeking confirmation that I can create a pg_dump TOC entry
> with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan
> them out into multiple entries.

If we do go with this approach, we'd certainly want to make sure to do
it in a manner which would allow pg_restore's single-transaction mode to
still work, which definitely complicates this some.

Given that and the other general feeling that the locking won't be a big
issue, I'd suggest the simple approach on the pg_dump side of just
dumping the stats out without the BEGIN/COMMIT.

> Anyway, here's v7. Eagerly awaiting feedback.

> Subject: [PATCH v7] Create pg_set_relation_stats, pg_set_attribute_stats.

> diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
> index 291ed876fc..d12b6e3ca3 100644
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -8818,7 +8818,6 @@
>  { oid => '3813', descr => 'generate XML text node',
>    proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
>    proargtypes => 'text', prosrc => 'xmltext' },
> -
>  { oid => '2923', descr => 'map table contents to XML',
>    proname => 'table_to_xml', procost => '100', provolatile => 's',
>    proparallel => 'r', prorettype => 'xml',
> @@ -12163,8 +12162,24 @@
>
>  # GiST stratnum implementations
>  { oid => '8047', descr => 'GiST support',
> -  proname => 'gist_stratnum_identity', prorettype => 'int2',
> +  proname => 'gist_stratnum_identity',prorettype => 'int2',
>    proargtypes => 'int2',
>    prosrc => 'gist_stratnum_identity' },

Random whitespace hunks shouldn't be included

> diff --git a/src/backend/statistics/statistics.c b/src/backend/statistics/statistics.c
> new file mode 100644
> index 0000000000..999aebdfa9
> --- /dev/null
> +++ b/src/backend/statistics/statistics.c
> @@ -0,0 +1,360 @@
> +/*------------------------------------------------------------------------- * * statistics.c *
> + * IDENTIFICATION
> + *       src/backend/statistics/statistics.c
> + *
> + *-------------------------------------------------------------------------
> + */

Top-of-file comment should be cleaned up.

> +/*
> + * Set statistics for a given pg_class entry.
> + *
> + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> + *
> + * This does an in-place (i.e. non-transactional) update of pg_class, just as
> + * is done in ANALYZE.
> + *
> + */
> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)
> +{
> +    const char *param_names[] = {
> +        "relation",
> +        "reltuples",
> +        "relpages",
> +    };
> +
> +    Oid                relid;
> +    Relation        rel;
> +    HeapTuple        ctup;
> +    Form_pg_class    pgcform;
> +
> +    for (int i = 0; i <= 2; i++)
> +        if (PG_ARGISNULL(i))
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("%s cannot be NULL", param_names[i])));

Why not just mark this function as strict..?  Or perhaps we should allow
NULLs to be passed in and just not update the current value in that
case?  Also, in some cases we allow the function to be called with a
NULL but then make it a no-op rather than throwing an ERROR (eg, if the
OID ends up being NULL).  Not sure if that makes sense here or not
offhand but figured I'd mention it as something to consider.

> +    pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +    pgcform->reltuples = PG_GETARG_FLOAT4(1);
> +    pgcform->relpages = PG_GETARG_INT32(2);

Shouldn't we include relallvisible?

Also, perhaps we should use the approach that we have in ANALYZE, and
only actually do something if the values are different rather than just
always doing an update.

> +/*
> + * Import statistics for a given relation attribute
> + *
> + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> + *                        stanullfrac float4, stawidth int, stadistinct float4,
> + *                        stakind1 int2, stakind2 int2, stakind3 int3,
> + *                        stakind4 int2, stakind5 int2, stanumbers1 float4[],
> + *                        stanumbers2 float4[], stanumbers3 float4[],
> + *                        stanumbers4 float4[], stanumbers5 float4[],
> + *                        stanumbers1 float4[], stanumbers2 float4[],
> + *                        stanumbers3 float4[], stanumbers4 float4[],
> + *                        stanumbers5 float4[], stavalues1 text,
> + *                        stavalues2 text, stavalues3 text,
> + *                        stavalues4 text, stavalues5 text);
> + *
> + *
> + */

Don't know that it makes sense to just repeat the function declaration
inside a comment like this- it'll just end up out of date.

> +Datum
> +pg_set_attribute_stats(PG_FUNCTION_ARGS)

> +    /* names of columns that cannot be null */
> +    const char *required_param_names[] = {
> +        "relation",
> +        "attname",
> +        "stainherit",
> +        "stanullfrac",
> +        "stawidth",
> +        "stadistinct",
> +        "stakind1",
> +        "stakind2",
> +        "stakind3",
> +        "stakind4",
> +        "stakind5",
> +    };

Same comment here as above wrt NULL being passed in.

> +    for (int k = 0; k < 5; k++)

Shouldn't we use STATISTIC_NUM_SLOTS here?

Thanks!

Stephen

Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:
Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

It's more than I can memorize, so that feels like a lot to me. Clearly it's not insurmountable.

 
> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

Ok, I'm sold on that plan.
 

> +/*
> + * Set statistics for a given pg_class entry.
> + *
> + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> + *
> + * This does an in-place (i.e. non-transactional) update of pg_class, just as
> + * is done in ANALYZE.
> + *
> + */
> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)
> +{
> +     const char *param_names[] = {
> +             "relation",
> +             "reltuples",
> +             "relpages",
> +     };
> +
> +     Oid                             relid;
> +     Relation                rel;
> +     HeapTuple               ctup;
> +     Form_pg_class   pgcform;
> +
> +     for (int i = 0; i <= 2; i++)
> +             if (PG_ARGISNULL(i))
> +                     ereport(ERROR,
> +                                     (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                      errmsg("%s cannot be NULL", param_names[i])));

Why not just mark this function as strict..?  Or perhaps we should allow
NULLs to be passed in and just not update the current value in that
case?

Strict could definitely apply here, and I'm inclined to make it so.

 
Also, in some cases we allow the function to be called with a
NULL but then make it a no-op rather than throwing an ERROR (eg, if the
OID ends up being NULL).

Thoughts on it emitting a WARN or NOTICE before returning false?

 
  Not sure if that makes sense here or not
offhand but figured I'd mention it as something to consider.

> +     pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +     pgcform->reltuples = PG_GETARG_FLOAT4(1);
> +     pgcform->relpages = PG_GETARG_INT32(2);

Shouldn't we include relallvisible?

Yes. No idea why I didn't have that in there from the start.
 
Also, perhaps we should use the approach that we have in ANALYZE, and
only actually do something if the values are different rather than just
always doing an update.

That was how it worked back in v1, more for the possibility that there was no matching JSON to set values.

Looking again at analyze.c (currently lines 1751-1780), we just check if there is a row in the way, and if so we replace it. I don't see where we compare existing values to new values.
 

> +/*
> + * Import statistics for a given relation attribute
> + *
> + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> + *                        stanullfrac float4, stawidth int, stadistinct float4,
> + *                        stakind1 int2, stakind2 int2, stakind3 int3,
> + *                        stakind4 int2, stakind5 int2, stanumbers1 float4[],
> + *                        stanumbers2 float4[], stanumbers3 float4[],
> + *                        stanumbers4 float4[], stanumbers5 float4[],
> + *                        stanumbers1 float4[], stanumbers2 float4[],
> + *                        stanumbers3 float4[], stanumbers4 float4[],
> + *                        stanumbers5 float4[], stavalues1 text,
> + *                        stavalues2 text, stavalues3 text,
> + *                        stavalues4 text, stavalues5 text);
> + *
> + *
> + */

Don't know that it makes sense to just repeat the function declaration
inside a comment like this- it'll just end up out of date.

Historical artifact - previous versions had a long explanation of the JSON format.

 

> +Datum
> +pg_set_attribute_stats(PG_FUNCTION_ARGS)

> +     /* names of columns that cannot be null */
> +     const char *required_param_names[] = {
> +             "relation",
> +             "attname",
> +             "stainherit",
> +             "stanullfrac",
> +             "stawidth",
> +             "stadistinct",
> +             "stakind1",
> +             "stakind2",
> +             "stakind3",
> +             "stakind4",
> +             "stakind5",
> +     };

Same comment here as above wrt NULL being passed in.

In this case, the last 10 params (stanumbersN and stavaluesN) can be null, and are NULL more often than not.
 

> +     for (int k = 0; k < 5; k++)

Shouldn't we use STATISTIC_NUM_SLOTS here?

Yes, I had in the past. Not sure why I didn't again. 

Re: Statistics Import and Export

От
Bharath Rupireddy
Дата:
On Fri, Mar 8, 2024 at 12:06 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks for working on this. It looks useful to have the ability to
restore the stats after upgrade and restore. But, the exported stats
are valid only until the next ANALYZE is run on the table. IIUC,
postgres collects stats during VACUUM, autovacuum and ANALYZE, right?
 Perhaps there are other ways to collect stats. I'm thinking what
problems does the user face if they are just asked to run ANALYZE on
the tables (I'm assuming ANALYZE doesn't block concurrent access to
the tables) instead of automatically exporting stats.

Here are some comments on the v7 patch. I've not dived into the whole
thread, so some comments may be of type repeated or need
clarification. Please bear with me.

1. The following two are unnecessary changes in pg_proc.dat, please remove them.

--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8818,7 +8818,6 @@
 { oid => '3813', descr => 'generate XML text node',
   proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
   proargtypes => 'text', prosrc => 'xmltext' },
-
 { oid => '2923', descr => 'map table contents to XML',
   proname => 'table_to_xml', procost => '100', provolatile => 's',
   proparallel => 'r', prorettype => 'xml',
@@ -12163,8 +12162,24 @@

 # GiST stratnum implementations
 { oid => '8047', descr => 'GiST support',
-  proname => 'gist_stratnum_identity', prorettype => 'int2',
+  proname => 'gist_stratnum_identity',prorettype => 'int2',
   proargtypes => 'int2',
   prosrc => 'gist_stratnum_identity' },

2.
+        they are replaced by the next auto-analyze. This function is used by
+        <command>pg_upgrade</command> and <command>pg_restore</command> to
+        convey the statistics from the old system version into the new one.
+       </para>

Is there any demonstration of pg_set_relation_stats and
pg_set_attribute_stats being used either in pg_upgrade or in
pg_restore? Perhaps, having them as 0002, 0003 and so on patches might
show real need for functions like this. It also clarifies how these
functions pull stats from tables on the old cluster to the tables on
the new cluster.

3. pg_set_relation_stats and pg_set_attribute_stats seem to be writing
to pg_class and might affect the plans as stats can get tampered. Can
we REVOKE the execute permissions from the public out of the box in
src/backend/catalog/system_functions.sql? This way one can decide who
to give permissions to.

4.
+SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
3.6::float4, 15000);
+ pg_set_relation_stats
+-----------------------
+ t
+(1 row)
+
+SELECT reltuples, relpages FROM pg_class WHERE oid =
'stats_export_import.test'::regclass;
+ reltuples | relpages
+-----------+----------
+       3.6 |    15000

Isn't this test case showing a misuse of these functions? Table
actually has  no rows, but we are lying to the postgres optimizer on
stats. I think altering stats of a table mustn't be that easy for the
end user. As mentioned in comment #3, permissions need to be
tightened. In addition, we can also mark the functions pg_upgrade only
with CHECK_IS_BINARY_UPGRADE, but that might not work for pg_restore
(or I don't know if we have a way to know within the server that the
server is running for pg_restore).

5. In continuation to the comment #2, is pg_dump supposed to generate
pg_set_relation_stats and pg_set_attribute_stats statements for each
table? When pg_dump does that , pg_restore can automatically load the
stats.

6.
+/*-------------------------------------------------------------------------
* * statistics.c *
+ * IDENTIFICATION
+ *       src/backend/statistics/statistics.c
+ *
+ *-------------------------------------------------------------------------

A description of what the new file statistics.c does is missing.

7. pgindent isn't happy with new file statistics.c, please check.

8.
+/*
+ * Import statistics for a given relation attribute
+ *
+ * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
+ *                        stanullfrac float4, stawidth int, stadistinct float4,

Having function definition in the function comment isn't necessary -
it's hard to keep it consistent with pg_proc.dat in future. If
required, one can either look at pg_proc.dat or docs.

9. Isn't it good to add a test case where the plan of a query on table
after exporting the stats would remain same as that of the original
table from which the stats are exported? IMO, this is a more realistic
than just comparing pg_statistic of the tables because this is what an
end-user wants eventually.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

От
Corey Huinker
Дата:


On Sun, Mar 10, 2024 at 11:57 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Fri, Mar 8, 2024 at 12:06 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks for working on this. It looks useful to have the ability to
restore the stats after upgrade and restore. But, the exported stats
are valid only until the next ANALYZE is run on the table. IIUC,
postgres collects stats during VACUUM, autovacuum and ANALYZE, right?
 Perhaps there are other ways to collect stats. I'm thinking what
problems does the user face if they are just asked to run ANALYZE on
the tables (I'm assuming ANALYZE doesn't block concurrent access to
the tables) instead of automatically exporting stats.

Correct. These are just as temporary as any other analyze of the table. Another analyze will happen later, probably through autovacuum and wipe out these values. This is designed to QUICKLY get stats into a table to enable the database to be operational sooner. This is especially important after an upgrade/restore, when all stats were wiped out. Other uses could be adapting this for use the postgres_fdw so that we don't have to do table sampling on the remote table, and of course statistics injection to test the query planner.
 
2.
+        they are replaced by the next auto-analyze. This function is used by
+        <command>pg_upgrade</command> and <command>pg_restore</command> to
+        convey the statistics from the old system version into the new one.
+       </para>

Is there any demonstration of pg_set_relation_stats and
pg_set_attribute_stats being used either in pg_upgrade or in
pg_restore? Perhaps, having them as 0002, 0003 and so on patches might
show real need for functions like this. It also clarifies how these
functions pull stats from tables on the old cluster to the tables on
the new cluster.

That code was adapted from do_analyze(), and yes, there is a patch for pg_dump, but as I noted earlier it is on hold pending feedback.
 

3. pg_set_relation_stats and pg_set_attribute_stats seem to be writing
to pg_class and might affect the plans as stats can get tampered. Can
we REVOKE the execute permissions from the public out of the box in
src/backend/catalog/system_functions.sql? This way one can decide who
to give permissions to.

You'd have to be the table owner to alter the stats. I can envision these functions getting a special role, but they could also be fine as superuser-only.
 

4.
+SELECT pg_set_relation_stats('stats_export_import.test'::regclass,
3.6::float4, 15000);
+ pg_set_relation_stats
+-----------------------
+ t
+(1 row)
+
+SELECT reltuples, relpages FROM pg_class WHERE oid =
'stats_export_import.test'::regclass;
+ reltuples | relpages
+-----------+----------
+       3.6 |    15000

Isn't this test case showing a misuse of these functions? Table
actually has  no rows, but we are lying to the postgres optimizer on
stats.

Consider this case. You want to know at what point the query planner will start using a given index. You can generate dummy data for a thousand, a million, a billion rows, and wait for that to complete, or you can just tell the table "I say you have a billion rows, twenty million pages, etc" and see when it changes.

But again, in most cases, you're setting the values to the same values the table had on the old database just before the restore/upgrade. 
 
I think altering stats of a table mustn't be that easy for the
end user.

Only easy for the end users that happen to be the table owner or a superuser.
 
As mentioned in comment #3, permissions need to be
tightened. In addition, we can also mark the functions pg_upgrade only
with CHECK_IS_BINARY_UPGRADE, but that might not work for pg_restore
(or I don't know if we have a way to know within the server that the
server is running for pg_restore).

I think they will have usage both in postgres_fdw and for tuning.
 

5. In continuation to the comment #2, is pg_dump supposed to generate
pg_set_relation_stats and pg_set_attribute_stats statements for each
table? When pg_dump does that , pg_restore can automatically load the
stats.

Current plan is to have one TOC entry in the post-data section with a dependency on the table/index/matview. That let's us leverage existing filters. The TOC entry will have a series of statements in it, one pg_set_relation_stats() and one pg_set_attribute_stats() per attribute.
 
9. Isn't it good to add a test case where the plan of a query on table
after exporting the stats would remain same as that of the original
table from which the stats are exported? IMO, this is a more realistic
than just comparing pg_statistic of the tables because this is what an
end-user wants eventually.

I'm sure we can add something like that, but query plan formats change a lot and are greatly dependent on database configuration, so maintaining such a test would be a lot of work.

Re: Statistics Import and Export

От
Bertrand Drouvot
Дата:
Hi,

On Fri, Mar 08, 2024 at 01:35:40AM -0500, Corey Huinker wrote:
> Anyway, here's v7. Eagerly awaiting feedback.

Thanks!

A few random comments:

1 ===

+        The purpose of this function is to apply statistics values in an
+        upgrade situation that are "good enough" for system operation until

Worth to add a few words about "influencing" the planner use case?

2 ===

+#include "catalog/pg_type.h"
+#include "fmgr.h"

Are those 2 needed?

3 ===

+       if (!HeapTupleIsValid(ctup))
+               elog(ERROR, "pg_class entry for relid %u vanished during statistics import",

s/during statistics import/when setting statistics/?

4 ===

+Datum
+pg_set_relation_stats(PG_FUNCTION_ARGS)
+{
.
.
+       table_close(rel, ShareUpdateExclusiveLock);
+
+       PG_RETURN_BOOL(true);

Why returning a bool? (I mean we'd throw an error or return true).

5 ===

+ */
+Datum
+pg_set_attribute_stats(PG_FUNCTION_ARGS)
+{

This function is not that simple, worth to explain its logic in a comment above?

6 ===

+       if (!HeapTupleIsValid(tuple))
+       {
+               relation_close(rel, NoLock);
+               PG_RETURN_BOOL(false);
+       }
+
+       attr = (Form_pg_attribute) GETSTRUCT(tuple);
+       if (attr->attisdropped)
+       {
+               ReleaseSysCache(tuple);
+               relation_close(rel, NoLock);
+               PG_RETURN_BOOL(false);
+       }

Why is it returning "false" and not throwing an error? (if ok, then I think
we can get rid of returning a bool).

7 ===

+        * If this relation is an index and that index has expressions in
+        * it, and the attnum specified

s/is an index and that index has/is an index that has/?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > > +/*
> > > + * Set statistics for a given pg_class entry.
> > > + *
> > > + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> > > + *
> > > + * This does an in-place (i.e. non-transactional) update of pg_class,
> > just as
> > > + * is done in ANALYZE.
> > > + *
> > > + */
> > > +Datum
> > > +pg_set_relation_stats(PG_FUNCTION_ARGS)
> > > +{
> > > +     const char *param_names[] = {
> > > +             "relation",
> > > +             "reltuples",
> > > +             "relpages",
> > > +     };
> > > +
> > > +     Oid                             relid;
> > > +     Relation                rel;
> > > +     HeapTuple               ctup;
> > > +     Form_pg_class   pgcform;
> > > +
> > > +     for (int i = 0; i <= 2; i++)
> > > +             if (PG_ARGISNULL(i))
> > > +                     ereport(ERROR,
> > > +
> >  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                                      errmsg("%s cannot be NULL",
> > param_names[i])));
> >
> > Why not just mark this function as strict..?  Or perhaps we should allow
> > NULLs to be passed in and just not update the current value in that
> > case?
>
> Strict could definitely apply here, and I'm inclined to make it so.

Having thought about it a bit more, I generally like the idea of being
able to just update one stat instead of having to update all of them at
once (and therefore having to go look up what the other values currently
are...).  That said, per below, perhaps making it strict is the better
plan.

> > Also, in some cases we allow the function to be called with a
> > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > OID ends up being NULL).
>
> Thoughts on it emitting a WARN or NOTICE before returning false?

Eh, I don't think so?

Where this is coming from is that we can often end up with functions
like these being called inside of larger queries, and having them spit
out WARN or NOTICE will just make them noisy.

That leads to my general feeling of just returning NULL if called with a
NULL OID, as we would get with setting the function strict.

> >   Not sure if that makes sense here or not
> > offhand but figured I'd mention it as something to consider.
> >
> > > +     pgcform = (Form_pg_class) GETSTRUCT(ctup);
> > > +     pgcform->reltuples = PG_GETARG_FLOAT4(1);
> > > +     pgcform->relpages = PG_GETARG_INT32(2);
> >
> > Shouldn't we include relallvisible?
>
> Yes. No idea why I didn't have that in there from the start.

Ok.

> > Also, perhaps we should use the approach that we have in ANALYZE, and
> > only actually do something if the values are different rather than just
> > always doing an update.
>
> That was how it worked back in v1, more for the possibility that there was
> no matching JSON to set values.
>
> Looking again at analyze.c (currently lines 1751-1780), we just check if
> there is a row in the way, and if so we replace it. I don't see where we
> compare existing values to new values.

Well, that code is for pg_statistic while I was looking at pg_class (in
vacuum.c:1428-1443, where we track if we're actually changing anything
and only make the pg_class change if there's actually something
different):

vacuum.c:1531
    /* If anything changed, write out the tuple. */
    if (dirty)
        heap_inplace_update(rd, ctup);

Not sure why we don't treat both the same way though ... although it's
probably the case that it's much less likely to have an entire
pg_statistic row be identical than the few values in pg_class.

> > > +Datum
> > > +pg_set_attribute_stats(PG_FUNCTION_ARGS)
> >
> > > +     /* names of columns that cannot be null */
> > > +     const char *required_param_names[] = {
> > > +             "relation",
> > > +             "attname",
> > > +             "stainherit",
> > > +             "stanullfrac",
> > > +             "stawidth",
> > > +             "stadistinct",
> > > +             "stakind1",
> > > +             "stakind2",
> > > +             "stakind3",
> > > +             "stakind4",
> > > +             "stakind5",
> > > +     };
> >
> > Same comment here as above wrt NULL being passed in.
>
> In this case, the last 10 params (stanumbersN and stavaluesN) can be null,
> and are NULL more often than not.

Hmm, that's a valid point, so a NULL passed in would need to set that
value actually to NULL, presumably.  Perhaps then we should have
pg_set_relation_stats() be strict and have pg_set_attribute_stats()
handles NULLs passed in appropriately, and return NULL if the relation
itself or attname, or other required (not NULL'able) argument passed in
cause the function to return NULL.

(What I'm trying to drive at here is a consistent interface for these
functions, but one which does a no-op instead of returning an ERROR on
values being passed in which aren't allowable; it can be quite
frustrating trying to get a query to work where one of the functions
decides to return ERROR instead of just ignoring things passed in which
aren't valid.)

> > > +     for (int k = 0; k < 5; k++)
> >
> > Shouldn't we use STATISTIC_NUM_SLOTS here?
>
> Yes, I had in the past. Not sure why I didn't again.

No worries.

Thanks!

Stephen

Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:



Having thought about it a bit more, I generally like the idea of being
able to just update one stat instead of having to update all of them at
once (and therefore having to go look up what the other values currently
are...).  That said, per below, perhaps making it strict is the better
plan.

v8 has it as strict.
 

> > Also, in some cases we allow the function to be called with a
> > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > OID ends up being NULL).
>
> Thoughts on it emitting a WARN or NOTICE before returning false?

Eh, I don't think so?

Where this is coming from is that we can often end up with functions
like these being called inside of larger queries, and having them spit
out WARN or NOTICE will just make them noisy.

That leads to my general feeling of just returning NULL if called with a
NULL OID, as we would get with setting the function strict.

In which case we're failing nearly silently, yes, there is a null returned, but we have no idea why there is a null returned. If I were using this function manually I'd want to know what I did wrong, what parameter I skipped, etc.
 
Well, that code is for pg_statistic while I was looking at pg_class (in
vacuum.c:1428-1443, where we track if we're actually changing anything
and only make the pg_class change if there's actually something
different):

I can do that, especially since it's only 3 tuples of known types, but my reservations are summed up in the next comment.

 
Not sure why we don't treat both the same way though ... although it's
probably the case that it's much less likely to have an entire
pg_statistic row be identical than the few values in pg_class.

That would also involve comparing ANYARRAY values, yuk. Also, a matched record will never be the case when used in primary purpose of the function (upgrades), and not a big deal in the other future cases (if we use it in ANALYZE on foreign tables instead of remote table samples, users experimenting with tuning queries under hypothetical workloads).


 
Hmm, that's a valid point, so a NULL passed in would need to set that
value actually to NULL, presumably.  Perhaps then we should have
pg_set_relation_stats() be strict and have pg_set_attribute_stats()
handles NULLs passed in appropriately, and return NULL if the relation
itself or attname, or other required (not NULL'able) argument passed in
cause the function to return NULL.

That's how I have relstats done in v8, and could make it do that for attr stats.

(What I'm trying to drive at here is a consistent interface for these
functions, but one which does a no-op instead of returning an ERROR on
values being passed in which aren't allowable; it can be quite
frustrating trying to get a query to work where one of the functions
decides to return ERROR instead of just ignoring things passed in which
aren't valid.)

I like the symmetry of a consistent interface, but we've already got an asymmetry in that the pg_class update is done non-transactionally (like ANALYZE does).

One persistent problem is that there is no _safe equivalent to ARRAY_IN, so that can always fail on us, though it should only do so if the string passed in wasn't a valid array input format, or the values in the array can't coerce to the attribute's basetype.

I should also point out that we've lost the ability to check if the export values were of a type, and if the destination column is also of that type. That's a non-issue in binary upgrades, but of course if a field changed from integers to text the histograms would now be highly misleading. Thoughts on adding a typname parameter that the function uses as a cheap validity check?

v8 attached, incorporating these suggestions plus those of Bharath and Bertrand. Still no pg_dump.

As for pg_dump, I'm currently leading toward the TOC entry having either a series of commands:

    SELECT pg_set_relation_stats('foo.bar'::regclass, ...); pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

Or one compound command

    SELECT pg_set_relation_stats(t.oid, ...)
         pg_set_attribute_stats(t.oid, 'id'::name, ...),
         pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
         ...
    FROM (VALUES('foo.bar'::regclass)) AS t(oid);

The second one has the feature that if any one attribute fails, then the whole update fails, except, of course, for the in-place update of pg_class. This avoids having an explicit transaction block, but we could get that back by having restore wrap the list of commands in a transaction block (and adding the explicit lock commands) when it is safe to do so.

Вложения

Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > Having thought about it a bit more, I generally like the idea of being
> > able to just update one stat instead of having to update all of them at
> > once (and therefore having to go look up what the other values currently
> > are...).  That said, per below, perhaps making it strict is the better
> > plan.
>
> v8 has it as strict.

Ok.

> > > > Also, in some cases we allow the function to be called with a
> > > > NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> > > > OID ends up being NULL).
> > >
> > > Thoughts on it emitting a WARN or NOTICE before returning false?
> >
> > Eh, I don't think so?
> >
> > Where this is coming from is that we can often end up with functions
> > like these being called inside of larger queries, and having them spit
> > out WARN or NOTICE will just make them noisy.
> >
> > That leads to my general feeling of just returning NULL if called with a
> > NULL OID, as we would get with setting the function strict.
>
> In which case we're failing nearly silently, yes, there is a null returned,
> but we have no idea why there is a null returned. If I were using this
> function manually I'd want to know what I did wrong, what parameter I
> skipped, etc.

I can see it both ways and don't feel super strongly about it ... I just
know that I've had some cases where we returned an ERROR or otherwise
were a bit noisy on NULL values getting passed into a function and it
was much more on the annoying side than on the helpful side; to the
point where we've gone back and pulled out ereport(ERROR) calls from
functions before because they were causing issues in otherwise pretty
reasonable queries (consider things like functions getting pushed down
to below WHERE clauses and such...).

> > Well, that code is for pg_statistic while I was looking at pg_class (in
> > vacuum.c:1428-1443, where we track if we're actually changing anything
> > and only make the pg_class change if there's actually something
> > different):
>
> I can do that, especially since it's only 3 tuples of known types, but my
> reservations are summed up in the next comment.

> > Not sure why we don't treat both the same way though ... although it's
> > probably the case that it's much less likely to have an entire
> > pg_statistic row be identical than the few values in pg_class.
>
> That would also involve comparing ANYARRAY values, yuk. Also, a matched
> record will never be the case when used in primary purpose of the function
> (upgrades), and not a big deal in the other future cases (if we use it in
> ANALYZE on foreign tables instead of remote table samples, users
> experimenting with tuning queries under hypothetical workloads).

Sure.  Not a huge deal either way, was just pointing out the difference.
I do think it'd be good to match what ANALYZE does here, so checking if
the values in pg_class are different and only updating if they are,
while keeping the code for pg_statistic where it'll just always update.

> > Hmm, that's a valid point, so a NULL passed in would need to set that
> > value actually to NULL, presumably.  Perhaps then we should have
> > pg_set_relation_stats() be strict and have pg_set_attribute_stats()
> > handles NULLs passed in appropriately, and return NULL if the relation
> > itself or attname, or other required (not NULL'able) argument passed in
> > cause the function to return NULL.
> >
>
> That's how I have relstats done in v8, and could make it do that for attr
> stats.

That'd be my suggestion, at least, but as I mention above, it's not a
position I hold very strongly.

> > (What I'm trying to drive at here is a consistent interface for these
> > functions, but one which does a no-op instead of returning an ERROR on
> > values being passed in which aren't allowable; it can be quite
> > frustrating trying to get a query to work where one of the functions
> > decides to return ERROR instead of just ignoring things passed in which
> > aren't valid.)
>
> I like the symmetry of a consistent interface, but we've already got an
> asymmetry in that the pg_class update is done non-transactionally (like
> ANALYZE does).

Don't know that I really consider that to be the same kind of thing when
it comes to talking about the interface as the other aspects we're
discussing ...

> One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
> that can always fail on us, though it should only do so if the string
> passed in wasn't a valid array input format, or the values in the array
> can't coerce to the attribute's basetype.

That would happen before we even get to being called and there's not
much to do about it anyway.

> I should also point out that we've lost the ability to check if the export
> values were of a type, and if the destination column is also of that type.
> That's a non-issue in binary upgrades, but of course if a field changed
> from integers to text the histograms would now be highly misleading.
> Thoughts on adding a typname parameter that the function uses as a cheap
> validity check?

Seems reasonable to me.

> v8 attached, incorporating these suggestions plus those of Bharath and
> Bertrand. Still no pg_dump.
>
> As for pg_dump, I'm currently leading toward the TOC entry having either a
> series of commands:
>
>     SELECT pg_set_relation_stats('foo.bar'::regclass, ...);
> pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

I'm guessing the above was intended to be SELECT ..; SELECT ..;

> Or one compound command
>
>     SELECT pg_set_relation_stats(t.oid, ...)
>          pg_set_attribute_stats(t.oid, 'id'::name, ...),
>          pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
>          ...
>     FROM (VALUES('foo.bar'::regclass)) AS t(oid);
>
> The second one has the feature that if any one attribute fails, then the
> whole update fails, except, of course, for the in-place update of pg_class.
> This avoids having an explicit transaction block, but we could get that
> back by having restore wrap the list of commands in a transaction block
> (and adding the explicit lock commands) when it is safe to do so.

Hm, I like this approach as it should essentially give us the
transaction block we had been talking about wanting but without needing
to explicitly do a begin/commit, which would add in some annoying
complications.  This would hopefully also reduce the locking concern
mentioned previously, since we'd get the lock needed in the first
function call and then the others would be able to just see that we've
already got the lock pretty quickly.

> Subject: [PATCH v8] Create pg_set_relation_stats, pg_set_attribute_stats.

[...]

> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)

[...]

> +    ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
> +    if (!HeapTupleIsValid(ctup))
> +        elog(ERROR, "pg_class entry for relid %u vanished during statistics import",
> +             relid);

Maybe drop the 'during statistics import' part of this message?  Also
wonder if maybe we should make it a regular ereport() instead, since it
might be possible for a user to end up seeing this?

> +    pgcform = (Form_pg_class) GETSTRUCT(ctup);
> +
> +    reltuples = PG_GETARG_FLOAT4(P_RELTUPLES);
> +    relpages = PG_GETARG_INT32(P_RELPAGES);
> +    relallvisible = PG_GETARG_INT32(P_RELALLVISIBLE);
> +
> +    /* Do not update pg_class unless there is no meaningful change */

This comment doesn't seem quite right.  Maybe it would be better if it
was in the positive, eg: Only update pg_class if there is a meaningful
change.

Rest of it looks pretty good to me, at least.

Thanks!

Stephen

Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:
> In which case we're failing nearly silently, yes, there is a null returned,
> but we have no idea why there is a null returned. If I were using this
> function manually I'd want to know what I did wrong, what parameter I
> skipped, etc.

I can see it both ways and don't feel super strongly about it ... I just
know that I've had some cases where we returned an ERROR or otherwise
were a bit noisy on NULL values getting passed into a function and it
was much more on the annoying side than on the helpful side; to the
point where we've gone back and pulled out ereport(ERROR) calls from
functions before because they were causing issues in otherwise pretty
reasonable queries (consider things like functions getting pushed down
to below WHERE clauses and such...).

I don't have strong feelings either. I think we should get more input on this. Regardless, it's easy to change...for now.

 

Sure.  Not a huge deal either way, was just pointing out the difference.
I do think it'd be good to match what ANALYZE does here, so checking if
the values in pg_class are different and only updating if they are,
while keeping the code for pg_statistic where it'll just always update.

I agree that mirroring ANALYZE wherever possible is the ideal.

 
> I like the symmetry of a consistent interface, but we've already got an
> asymmetry in that the pg_class update is done non-transactionally (like
> ANALYZE does).

Don't know that I really consider that to be the same kind of thing when
it comes to talking about the interface as the other aspects we're
discussing ...

Fair.


 

> One persistent problem is that there is no _safe equivalent to ARRAY_IN, so
> that can always fail on us, though it should only do so if the string
> passed in wasn't a valid array input format, or the values in the array
> can't coerce to the attribute's basetype.

That would happen before we even get to being called and there's not
much to do about it anyway.

Not sure I follow you here. the ARRAY_IN function calls happen once for every non-null stavaluesN parameter, and it's done inside the function because the result type could be the base type for a domain/array type, or could be the type itself. I suppose we could move that determination to the caller, but then we'd need to call get_base_element_type() inside a client, and that seems wrong if it's even possible.
 
> I should also point out that we've lost the ability to check if the export
> values were of a type, and if the destination column is also of that type.
> That's a non-issue in binary upgrades, but of course if a field changed
> from integers to text the histograms would now be highly misleading.
> Thoughts on adding a typname parameter that the function uses as a cheap
> validity check?

Seems reasonable to me.

I'd like to hear what Tomas thinks about this, as he was the initial advocate for it.
 
> As for pg_dump, I'm currently leading toward the TOC entry having either a
> series of commands:
>
>     SELECT pg_set_relation_stats('foo.bar'::regclass, ...);
> pg_set_attribute_stats('foo.bar'::regclass, 'id'::name, ...); ...

I'm guessing the above was intended to be SELECT ..; SELECT ..;

Yes.
 

> Or one compound command
>
>     SELECT pg_set_relation_stats(t.oid, ...)
>          pg_set_attribute_stats(t.oid, 'id'::name, ...),
>          pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
>          ...
>     FROM (VALUES('foo.bar'::regclass)) AS t(oid);
>
> The second one has the feature that if any one attribute fails, then the
> whole update fails, except, of course, for the in-place update of pg_class.
> This avoids having an explicit transaction block, but we could get that
> back by having restore wrap the list of commands in a transaction block
> (and adding the explicit lock commands) when it is safe to do so.

Hm, I like this approach as it should essentially give us the
transaction block we had been talking about wanting but without needing
to explicitly do a begin/commit, which would add in some annoying
complications.  This would hopefully also reduce the locking concern
mentioned previously, since we'd get the lock needed in the first
function call and then the others would be able to just see that we've
already got the lock pretty quickly.

True, we'd get the lock needed in the first function call, but wouldn't we also release that very lock before the subsequent call? Obviously we'd be shrinking the window in which another process could get in line and take a superior lock, and the universe of other processes that would even want a lock that blocks us is nil in the case of an upgrade, identical to existing behavior in the case of an FDW ANALYZE, and perfectly fine in the case of someone tinkering with stats.
 

> Subject: [PATCH v8] Create pg_set_relation_stats, pg_set_attribute_stats.

[...]

> +Datum
> +pg_set_relation_stats(PG_FUNCTION_ARGS)

[...]

> +     ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
> +     if (!HeapTupleIsValid(ctup))
> +             elog(ERROR, "pg_class entry for relid %u vanished during statistics import",
> +                      relid);

Maybe drop the 'during statistics import' part of this message?  Also
wonder if maybe we should make it a regular ereport() instead, since it
might be possible for a user to end up seeing this?

Agreed and agreed. It was copypasta from ANALYZE.

 

This comment doesn't seem quite right.  Maybe it would be better if it
was in the positive, eg: Only update pg_class if there is a meaningful
change.

+1

Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > > One persistent problem is that there is no _safe equivalent to ARRAY_IN,
> > so
> > > that can always fail on us, though it should only do so if the string
> > > passed in wasn't a valid array input format, or the values in the array
> > > can't coerce to the attribute's basetype.
> >
> > That would happen before we even get to being called and there's not
> > much to do about it anyway.
>
> Not sure I follow you here. the ARRAY_IN function calls happen once for
> every non-null stavaluesN parameter, and it's done inside the function
> because the result type could be the base type for a domain/array type, or
> could be the type itself. I suppose we could move that determination to the
> caller, but then we'd need to call get_base_element_type() inside a client,
> and that seems wrong if it's even possible.

Ah, yeah, ok, I see what you're saying here and sure, there's a risk
those might ERROR too, but that's outright invalid data then as opposed
to a NULL getting passed in.

> > > Or one compound command
> > >
> > >     SELECT pg_set_relation_stats(t.oid, ...)
> > >          pg_set_attribute_stats(t.oid, 'id'::name, ...),
> > >          pg_set_attribute_stats(t.oid, 'last_name'::name, ...),
> > >          ...
> > >     FROM (VALUES('foo.bar'::regclass)) AS t(oid);
> > >
> > > The second one has the feature that if any one attribute fails, then the
> > > whole update fails, except, of course, for the in-place update of
> > pg_class.
> > > This avoids having an explicit transaction block, but we could get that
> > > back by having restore wrap the list of commands in a transaction block
> > > (and adding the explicit lock commands) when it is safe to do so.
> >
> > Hm, I like this approach as it should essentially give us the
> > transaction block we had been talking about wanting but without needing
> > to explicitly do a begin/commit, which would add in some annoying
> > complications.  This would hopefully also reduce the locking concern
> > mentioned previously, since we'd get the lock needed in the first
> > function call and then the others would be able to just see that we've
> > already got the lock pretty quickly.
>
> True, we'd get the lock needed in the first function call, but wouldn't we
> also release that very lock before the subsequent call? Obviously we'd be
> shrinking the window in which another process could get in line and take a
> superior lock, and the universe of other processes that would even want a
> lock that blocks us is nil in the case of an upgrade, identical to existing
> behavior in the case of an FDW ANALYZE, and perfectly fine in the case of
> someone tinkering with stats.

No, we should be keeping the lock until the end of the transaction
(which in this case would be just the one statement, but it would be the
whole statement and all of the calls in it).  See analyze.c:268 or
so, where we call relation_close(onerel, NoLock); meaning we're closing
the relation but we're *not* releasing the lock on it- it'll get
released at the end of the transaction.

Thanks!

Stephen

Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:
No, we should be keeping the lock until the end of the transaction
(which in this case would be just the one statement, but it would be the
whole statement and all of the calls in it).  See analyze.c:268 or
so, where we call relation_close(onerel, NoLock); meaning we're closing
the relation but we're *not* releasing the lock on it- it'll get
released at the end of the transaction.


If that's the case, then changing the two table_close() statements to NoLock should resolve any remaining concern.
 

Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> > No, we should be keeping the lock until the end of the transaction
> > (which in this case would be just the one statement, but it would be the
> > whole statement and all of the calls in it).  See analyze.c:268 or
> > so, where we call relation_close(onerel, NoLock); meaning we're closing
> > the relation but we're *not* releasing the lock on it- it'll get
> > released at the end of the transaction.
>
> If that's the case, then changing the two table_close() statements to
> NoLock should resolve any remaining concern.

Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself.  Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end.  I'd be inclined to keep it that way in
these functions also.  I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

If it can be shown to be an issue then we could certainly revisit this.

Thanks,

Stephen

Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:
Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself.  Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end.  I'd be inclined to keep it that way in
these functions also.  I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

ANALYZE takes out one lock StatisticRelationId per relation, not per attribute like we do now. If we didn't release the lock after every attribute, and we only called the function outside of a larger transaction (as we plan to do with pg_restore) then that is the closest we're going to get to being consistent with ANALYZE.

Re: Statistics Import and Export

От
Corey Huinker
Дата:
ANALYZE takes out one lock StatisticRelationId per relation, not per attribute like we do now. If we didn't release the lock after every attribute, and we only called the function outside of a larger transaction (as we plan to do with pg_restore) then that is the closest we're going to get to being consistent with ANALYZE.

v9 attached. This adds pg_dump support. It works in tests against existing databases such as dvdrental, though I was surprised at how few indexes have attribute stats there.

Statistics are preserved by default, but this can be disabled with the option --no-statistics. This follows the prevailing option pattern in pg_dump, etc.

There are currently several failing TAP tests around pg_dump/pg_restore/pg_upgrade. I'm looking at those, but in the mean time I'm seeking feedback on the progress so far.
 
Вложения

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Fri, 2024-03-15 at 03:55 -0400, Corey Huinker wrote:
>
> Statistics are preserved by default, but this can be disabled with
> the option --no-statistics. This follows the prevailing option
> pattern in pg_dump, etc.

I'm not sure if saving statistics should be the default in 17. I'm
inclined to make it opt-in.

> There are currently several failing TAP tests around
> pg_dump/pg_restore/pg_upgrade.

It is a permissions problem. When user running pg_dump is not the
superuser, they don't have permission to access pg_statistic. That
causes an error in exportRelationStatsStmt(), which returns NULL, and
then the caller segfaults.

> I'm looking at those, but in the mean time I'm seeking feedback on
> the progress so far.

Still looking, but one quick comment is that the third argument of
dumpRelationStats() should be const, which eliminates a warning.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Fri, 2024-03-15 at 15:30 -0700, Jeff Davis wrote:
> Still looking, but one quick comment is that the third argument of
> dumpRelationStats() should be const, which eliminates a warning.

A few other comments:

* pg_set_relation_stats() needs to do an ACL check so you can't set the
stats on someone else's table. I suggest honoring the new MAINTAIN
privilege as well.

* If possible, reading from pg_stats (instead of pg_statistic) would be
ideal because pg_stats already does the right checks at read time, so a
non-superuser can export stats, too.

* If reading from pg_stats, should you change the signature of
pg_set_relation_stats() to have argument names matching the columns of
pg_stats (e.g. most_common_vals instead of stakind/stavalues)?

In other words, make this a slightly higher level: conceptually
exporting/importing pg_stats rather than pg_statistic. This may also
make the SQL export queries simpler.

Also, I'm wondering about error handling. Is some kind of error thrown
by pg_set_relation_stats() going to abort an entire restore? That might
be easy to prevent with pg_restore, because it can just omit the stats,
but harder if it's in a SQL file.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:

* pg_set_relation_stats() needs to do an ACL check so you can't set the
stats on someone else's table. I suggest honoring the new MAINTAIN
privilege as well.

Added.
 

* If possible, reading from pg_stats (instead of pg_statistic) would be
ideal because pg_stats already does the right checks at read time, so a
non-superuser can export stats, too.

Done. That was sorta how it was originally, so returning to that wasn't too hard.
 

* If reading from pg_stats, should you change the signature of
pg_set_relation_stats() to have argument names matching the columns of
pg_stats (e.g. most_common_vals instead of stakind/stavalues)?

Done.
 

In other words, make this a slightly higher level: conceptually
exporting/importing pg_stats rather than pg_statistic. This may also
make the SQL export queries simpler.

Eh, about the same.
 
Also, I'm wondering about error handling. Is some kind of error thrown
by pg_set_relation_stats() going to abort an entire restore? That might
be easy to prevent with pg_restore, because it can just omit the stats,
but harder if it's in a SQL file.

Aside from the oid being invalid, there's not a whole lot that can go wrong in set_relation_stats(). The error checking I did closely mirrors that in analyze.c.

Aside from the changes you suggested, as well as the error reporting change you suggested for pg_dump, I also filtered out attempts to dump stats on views.

A few TAP tests are still failing and I haven't been able to diagnose why, though the failures in parallel dump seem to be that it tries to import stats on indexes that haven't been created yet, which is odd because I sent the dependency.

All those changes are available in the patches attached.
 
Вложения

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Sun, 2024-03-17 at 23:33 -0400, Corey Huinker wrote:
>
> A few TAP tests are still failing and I haven't been able to diagnose
> why, though the failures in parallel dump seem to be that it tries to
> import stats on indexes that haven't been created yet, which is odd
> because I sent the dependency.

From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
for the "not ok" and then look at what it tried to do right before
that. I see:

pg_dump: error: prepared statement failed: ERROR:  syntax error at or
near "%"
LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
a.nsp...

> All those changes are available in the patches attached.

How about if you provided "get" versions of the functions that return a
set of rows that match what the "set" versions expect? That would make
0001 essentially a complete feature itself.

I think it would also make the changes in pg_dump simpler, and the
tests in 0001 a lot simpler.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:



From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
for the "not ok" and then look at what it tried to do right before
that. I see:

pg_dump: error: prepared statement failed: ERROR:  syntax error at or
near "%"
LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
a.nsp...

Thanks. Unfamiliar turf for me.
 

> All those changes are available in the patches attached.

How about if you provided "get" versions of the functions that return a
set of rows that match what the "set" versions expect? That would make
0001 essentially a complete feature itself.

That's tricky. At the base level, those functions would just be an encapsulation of "SELECT * FROM pg_stats WHERE schemaname = $1 AND tablename = $2" which isn't all that much of a savings. Perhaps we can make the documentation more explicit about the source and nature of the parameters going into the pg_set_ functions.

Per conversation, it would be trivial to add a helper functions that replace the parameters after the initial oid with a pg_class rowtype, and that would dissect the values needed and call the more complex function:

pg_set_relation_stats( oid, pg_class)
pg_set_attribute_stats( oid, pg_stats)
 

I think it would also make the changes in pg_dump simpler, and the
tests in 0001 a lot simpler.

I agree. The tests are currently showing that a fidelity copy can be made from one table to another, but to do so we have to conceal the actual stats values because those are 1. not deterministic/known and 2. subject to change from version to version.

I can add some sets to arbitrary values like was done for pg_set_relation_stats().

Re: Statistics Import and Export

От
Corey Huinker
Дата:
v11 attached.

- TAP tests passing (the big glitch was that indexes that are used in constraints should have their stats dependent on the constraint, not the index, thanks Jeff)
- The new range-specific statistics types are now supported. I'm not happy with the typid machinations I do to get them to work, but it is working so far. These are stored out-of-stakind-order (7 before 6), which is odd because all other types seem store stakinds in ascending order. It shouldn't matter, it was just odd.
- regression tests now make simpler calls with arbitrary stats to demonstrate the function usage more cleanly
- pg_set_*_stats function now have all of their parameters in the same order as the table/view they pull from

Вложения

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> v11 attached.

Thank you.

Comments on 0001:

This test:

   +SELECT
   +    format('SELECT pg_catalog.pg_set_attribute_stats( '
   ...

seems misplaced. It's generating SQL that can be used to restore or
copy the stats -- that seems like the job of pg_dump, and shouldn't be
tested within the plain SQL regression tests.

And can the other tests use pg_stats rather than pg_statistic?

The function signature for pg_set_attribute_stats could be more
friendly -- how about there are a few required parameters, and then it
only sets the stats that are provided and the other ones are either
left to the existing value or get some reasonable default?

Make sure all error paths ReleaseSysCache().

Why are you calling checkCanModifyRelation() twice?

I'm confused about when the function should return false and when it
should throw an error. I'm inclined to think the return type should be
void and all failures should be reported as ERROR.

replaces[] is initialized to {true}, which means only the first element
is initialized to true. Try following the pattern in AlterDatabase (or
similar) which reads the catalog tuple first, then updates a few fields
selectively, setting the corresponding element of replaces[] along the
way.

The test also sets the most_common_freqs in an ascending order, which
is weird.

Relatedly, I got worried recently about the idea of plain users
updating statistics. In theory, that should be fine, and the planner
should be robust to whatever pg_statistic contains; but in practice
there's some risk of mischief there until everyone understands that the
contents of pg_stats should not be trusted. Fortunately I didn't find
any planner crashes or even errors after a brief test.

One thing we can do is some extra validation for consistency, like
checking that the arrays are properly sorted, check for negative
numbers in the wrong place, or fractions larger than 1.0, etc.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:


On Thu, Mar 21, 2024 at 2:29 AM Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2024-03-19 at 05:16 -0400, Corey Huinker wrote:
> v11 attached.

Thank you.

Comments on 0001:

This test:

   +SELECT
   +    format('SELECT pg_catalog.pg_set_attribute_stats( '
   ...

seems misplaced. It's generating SQL that can be used to restore or
copy the stats -- that seems like the job of pg_dump, and shouldn't be
tested within the plain SQL regression tests.

Fair enough.
 

And can the other tests use pg_stats rather than pg_statistic?

They can, but part of what I wanted to show was that the values that aren't directly passed in as parameters (staopN, stacollN) get set to the correct values, and those values aren't guaranteed to match across databases, hence testing them in the regression test rather than in a TAP test. I'd still like to be able to test that.
 

The function signature for pg_set_attribute_stats could be more
friendly -- how about there are a few required parameters, and then it
only sets the stats that are provided and the other ones are either
left to the existing value or get some reasonable default?

That would be problematic.

1. We'd have to compare the stats provided against the stats that are already there, make that list in-memory, and then re-order what remains
2. There would be no way to un-set statistics of a given stakind, unless we added an "actually set it null" boolean for each parameter that can be null. 
3. I tried that with the JSON formats, it made the code even messier than it already was.

Make sure all error paths ReleaseSysCache().

+1
 

Why are you calling checkCanModifyRelation() twice?

Once for the relation itself, and once for pg_statistic.
 
I'm confused about when the function should return false and when it
should throw an error. I'm inclined to think the return type should be
void and all failures should be reported as ERROR.

I go back and forth on that. I can see making it void and returning an error for everything that we currently return false for, but if we do that, then a statement with one pg_set_relation_stats, and N pg_set_attribute_stats (which we lump together in one command for the locking benefits and atomic transaction) would fail entirely if one of the set_attributes named a column that we had dropped. It's up for debate whether that's the right behavior or not.

replaces[] is initialized to {true}, which means only the first element
is initialized to true. Try following the pattern in AlterDatabase (or
similar) which reads the catalog tuple first, then updates a few fields
selectively, setting the corresponding element of replaces[] along the
way.

+1.
 

The test also sets the most_common_freqs in an ascending order, which
is weird.

I pulled most of the hardcoded values from pg_stats itself. The sample set is trivially small, and the values inserted were in-order-ish. So maybe that's why.
 
Relatedly, I got worried recently about the idea of plain users
updating statistics. In theory, that should be fine, and the planner
should be robust to whatever pg_statistic contains; but in practice
there's some risk of mischief there until everyone understands that the
contents of pg_stats should not be trusted. Fortunately I didn't find
any planner crashes or even errors after a brief test.

Maybe we could have the functions restricted to a role or roles:

1. pg_write_all_stats (can modify stats on ANY table)
2. pg_write_own_stats (can modify stats on tables owned by user)

I'm iffy on the need for the first one, I list it first purely to show how I derived the name for the second.
 
One thing we can do is some extra validation for consistency, like
checking that the arrays are properly sorted, check for negative
numbers in the wrong place, or fractions larger than 1.0, etc.

+1. All suggestions of validation checks welcome.

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Thu, 2024-03-21 at 03:27 -0400, Corey Huinker wrote:
>
> They can, but part of what I wanted to show was that the values that
> aren't directly passed in as parameters (staopN, stacollN) get set to
> the correct values, and those values aren't guaranteed to match
> across databases, hence testing them in the regression test rather
> than in a TAP test. I'd still like to be able to test that.

OK, that's fine.

> > The function signature for pg_set_attribute_stats could be more
> > friendly
...
> 1. We'd have to compare the stats provided against the stats that are
> already there, make that list in-memory, and then re-order what
> remains
> 2. There would be no way to un-set statistics of a given stakind,
> unless we added an "actually set it null" boolean for each parameter
> that can be null. 
> 3. I tried that with the JSON formats, it made the code even messier
> than it already was.

How about just some defaults then? Many of them have a reasonable
default, like NULL or an empty array. Some are parallel arrays and
either both should be specified or neither (e.g.
most_common_vals+most_common_freqs), but you can check for that.

> > Why are you calling checkCanModifyRelation() twice?
>
> Once for the relation itself, and once for pg_statistic.

Nobody has the privileges to modify pg_statistic except superuser,
right? I thought the point of a privilege check is that users could
modify statistics for their own tables, or the tables they maintain.

>
> I can see making it void and returning an error for everything that
> we currently return false for, but if we do that, then a statement
> with one pg_set_relation_stats, and N pg_set_attribute_stats (which
> we lump together in one command for the locking benefits and atomic
> transaction) would fail entirely if one of the set_attributes named a
> column that we had dropped. It's up for debate whether that's the
> right behavior or not.

I'd probably make the dropped column a WARNING with a message like
"skipping dropped column whatever". Regardless, have some kind of
explanatory comment.

>
> I pulled most of the hardcoded values from pg_stats itself. The
> sample set is trivially small, and the values inserted were in-order-
> ish. So maybe that's why.

In my simple test, most_common_freqs is descending:

   CREATE TABLE a(i int);
   INSERT INTO a VALUES(1);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   ANALYZE a;
   SELECT most_common_vals, most_common_freqs
     FROM pg_stats WHERE tablename='a';
    most_common_vals | most_common_freqs
   ------------------+-------------------
    {4,3,2}          | {0.4,0.3,0.2}
   (1 row)

Can you show an example where it's not?

>
> Maybe we could have the functions restricted to a role or roles:
>
> 1. pg_write_all_stats (can modify stats on ANY table)
> 2. pg_write_own_stats (can modify stats on tables owned by user)

If we go that route, we are giving up on the ability for users to
restore stats on their own tables. Let's just be careful about
validating data to mitigate this risk.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:
How about just some defaults then? Many of them have a reasonable
default, like NULL or an empty array. Some are parallel arrays and
either both should be specified or neither (e.g.
most_common_vals+most_common_freqs), but you can check for that.

+1
Default NULL has been implemented for all parameters after n_distinct.
 

> > Why are you calling checkCanModifyRelation() twice?
>
> Once for the relation itself, and once for pg_statistic.

Nobody has the privileges to modify pg_statistic except superuser,
right? I thought the point of a privilege check is that users could
modify statistics for their own tables, or the tables they maintain.

In which case wouldn't the checkCanModify on pg_statistic would be a proxy for is_superuser/has_special_role_we_havent_created_yet.

 

>
> I can see making it void and returning an error for everything that
> we currently return false for, but if we do that, then a statement
> with one pg_set_relation_stats, and N pg_set_attribute_stats (which
> we lump together in one command for the locking benefits and atomic
> transaction) would fail entirely if one of the set_attributes named a
> column that we had dropped. It's up for debate whether that's the
> right behavior or not.

I'd probably make the dropped column a WARNING with a message like
"skipping dropped column whatever". Regardless, have some kind of
explanatory comment.

That's certainly do-able.


 

>
> I pulled most of the hardcoded values from pg_stats itself. The
> sample set is trivially small, and the values inserted were in-order-
> ish. So maybe that's why.

In my simple test, most_common_freqs is descending:

   CREATE TABLE a(i int);
   INSERT INTO a VALUES(1);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(2);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(3);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   INSERT INTO a VALUES(4);
   ANALYZE a;
   SELECT most_common_vals, most_common_freqs
     FROM pg_stats WHERE tablename='a';
    most_common_vals | most_common_freqs
   ------------------+-------------------
    {4,3,2}          | {0.4,0.3,0.2}
   (1 row)

Can you show an example where it's not?

Not off hand, no.

 

>
> Maybe we could have the functions restricted to a role or roles:
>
> 1. pg_write_all_stats (can modify stats on ANY table)
> 2. pg_write_own_stats (can modify stats on tables owned by user)

If we go that route, we are giving up on the ability for users to
restore stats on their own tables. Let's just be careful about
validating data to mitigate this risk.

A great many test cases coming in the next patch.

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Thu, 2024-03-21 at 15:10 -0400, Corey Huinker wrote:
>
> In which case wouldn't the checkCanModify on pg_statistic would be a
> proxy for is_superuser/has_special_role_we_havent_created_yet.

So if someone pg_dumps their table and gets the statistics in the SQL,
then they will get errors loading it unless they are a member of a
special role?

If so we'd certainly need to make --no-statistics the default, and have
some way of skipping stats during reload of the dump (perhaps make the
set function a no-op based on a GUC?).

But ideally we'd just make it safe to dump and reload stats on your own
tables, and then not worry about it.

> Not off hand, no.

To me it seems like inconsistent data to have most_common_freqs in
anything but descending order, and we should prevent it.

> >
Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:


But ideally we'd just make it safe to dump and reload stats on your own
tables, and then not worry about it.

That is my strong preference, yes.
 

> Not off hand, no.

To me it seems like inconsistent data to have most_common_freqs in
anything but descending order, and we should prevent it.

Sorry, I misunderstood, I thought we were talking about values, not the frequencies. Yes, the frequencies should only be monotonically non-increasing (i.e. it can go down or flatline from N->N+1). I'll add a test case for that.

Re: Statistics Import and Export

От
Corey Huinker
Дата:
v12 attached.

0001 - 

The functions pg_set_relation_stats() and pg_set_attribute_stats() now return void. There just weren't enough conditions where a condition was considered recoverable to justify having it. This may mean that combining multiple pg_set_attribute_stats calls into one compound statement may no longer be desirable, but that's just one of the places where I'd like feedback on how pg_dump/pg_restore use these functions.

The function pg_set_attribute_stats() now has NULL defaults for all stakind-based statistics types. Thus, you can set statistics on a more terse basis, like so:

SELECT pg_catalog.pg_set_attribute_stats(
    relation => 'stats_export_import.test'::regclass,
    attname => 'id'::name,
    inherited => false::boolean,
    null_frac => 0.5::real,
    avg_width => 2::integer,
    n_distinct => -0.1::real,
    most_common_vals => '{2,1,3}'::text,
    most_common_freqs => '{0.3,0.25,0.05}'::real[]
    );

This would generate a pg_statistic row with exactly one stakind in it, and replaces whatever statistics previously existed for that attribute.

It now checks for many types of data inconsistencies, and most (35) of those have test coverage in the regression. There's a few areas still uncovered, mostly surrounding histograms where the datatype is dependent on the attribute.

The functions both require that the caller be the owner of the table/index.

The function pg_set_relation_stats is largely unchanged from previous versions.

Key areas where I'm seeking feedback:

- What additional checks can be made to ensure validity of statistics?
- What additional regression tests would be desirable?
- What extra information can we add to the error messages to give the user an idea of how to fix the error?
- What are some edge cases we should test concerning putting bad stats in a table to get an undesirable outcome?


0002 -

This patch concerns invoking the functions in 0001 via pg_restore/pg_upgrade. Little has changed here. Dumping statistics is currently the default for pg_dump/pg_restore/pg_upgrade, and can be switched off with the switch --no-statistics. Some have expressed concern about whether stats dumping should be the default. I have a slight preference for making it the default, for the following reasons:

- The existing commandline switches are all --no-something based, and this follows the pattern.
- Complaints about poor performance post-upgrade are often the result of the user not knowing about vacuumdb --analyze-in-stages or the need to manually ANALYZE. If they don't know about that, how can we expect them to know about about new switches in pg_upgrade?
- The failure condition means that the user has a table with no stats in it (or possibly partial stats, if we change how we make the calls), which is exactly where they were before they made the call.
- Any performance regressions will be remedied with the next autovacuum or manual ANALYZE.
- If we had a positive flag (e.g. --with-statistics or just --statistics), and we then changed the default, that could be considered a POLA violation.


Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.


Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:
On Fri, Mar 22, 2024 at 9:51 PM Corey Huinker <corey.huinker@gmail.com> wrote:
v12 attached.


v13 attached. All the same features as v12, but with a lot more type checking, bounds checking, value inspection, etc. Perhaps the most notable feature is that we're now ensuring that histogram values are in ascending order. This could come in handy for detecting when we are applying stats to a column of the wrong type, or the right type but with a different collation. It's not a guarantee of validity, of course, but it would detect egregious changes in sort order.
 
Вложения

Re: Statistics Import and Export

От
Ashutosh Bapat
Дата:
Hi Corey,


On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker <corey.huinker@gmail.com> wrote:
v12 attached.

0001 - 


Some random comments

+SELECT
+    format('SELECT pg_catalog.pg_set_attribute_stats( '
+            || 'relation => %L::regclass::oid, attname => %L::name, '
+            || 'inherited => %L::boolean, null_frac => %L::real, '
+            || 'avg_width => %L::integer, n_distinct => %L::real, '
+            || 'most_common_vals => %L::text, '
+            || 'most_common_freqs => %L::real[], '
+            || 'histogram_bounds => %L::text, '
+            || 'correlation => %L::real, '
+            || 'most_common_elems => %L::text, '
+            || 'most_common_elem_freqs => %L::real[], '
+            || 'elem_count_histogram => %L::real[], '
+            || 'range_length_histogram => %L::text, '
+            || 'range_empty_frac => %L::real, '
+            || 'range_bounds_histogram => %L::text) ',
+        'stats_export_import.' || s.tablename || '_clone', s.attname,
+        s.inherited, s.null_frac,
+        s.avg_width, s.n_distinct,
+        s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
+        s.correlation, s.most_common_elems, s.most_common_elem_freqs,
+        s.elem_count_histogram, s.range_length_histogram,
+        s.range_empty_frac, s.range_bounds_histogram)
+FROM pg_catalog.pg_stats AS s
+WHERE s.schemaname = 'stats_export_import'
+AND s.tablename IN ('test', 'is_odd')
+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

+   <table id="functions-admin-statsimport">
+    <title>Database Object Statistics Import Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".
 
 
The functions pg_set_relation_stats() and pg_set_attribute_stats() now return void. There just weren't enough conditions where a condition was considered recoverable to justify having it. This may mean that combining multiple pg_set_attribute_stats calls into one compound statement may no longer be desirable, but that's just one of the places where I'd like feedback on how pg_dump/pg_restore use these functions.


0002 -

This patch concerns invoking the functions in 0001 via pg_restore/pg_upgrade. Little has changed here. Dumping statistics is currently the default for pg_dump/pg_restore/pg_upgrade, and can be switched off with the switch --no-statistics. Some have expressed concern about whether stats dumping should be the default. I have a slight preference for making it the default, for the following reasons:


+ /* Statistics are dependent on the definition, not the data */
+ /* Views don't have stats */
+ if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
+ (tbinfo->relkind == RELKIND_VIEW))
+ dumpRelationStats(fout, &tbinfo->dobj, reltypename,
+  tbinfo->dobj.dumpId);
+

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.
 

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.



Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

I believe, in order to import statistics through IMPORT FOREIGN SCHEMA, postgresImportForeignSchema() will need to add SELECT commands invoking pg_set_relation_stats() on each imported table and pg_set_attribute_stats() on each of its attribute. Am I right? Do we want to make that happen in the first cut of the feature? How do you expect these functions to be used to update statistics of foreign tables?

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

От
Tomas Vondra
Дата:
On 3/25/24 09:27, Corey Huinker wrote:
> On Fri, Mar 22, 2024 at 9:51 PM Corey Huinker <corey.huinker@gmail.com>
> wrote:
> 
>> v12 attached.
>>
>>
> v13 attached. All the same features as v12, but with a lot more type
> checking, bounds checking, value inspection, etc. Perhaps the most notable
> feature is that we're now ensuring that histogram values are in ascending
> order. This could come in handy for detecting when we are applying stats to
> a column of the wrong type, or the right type but with a different
> collation. It's not a guarantee of validity, of course, but it would detect
> egregious changes in sort order.
> 

Hi,

I did take a closer look at v13 today. I have a bunch of comments and
some minor whitespace fixes in the attached review patches.

0001
----

1) The docs say this:

  <para>
   The purpose of this function is to apply statistics values in an
   upgrade situation that are "good enough" for system operation until
   they are replaced by the next <command>ANALYZE</command>, usually via
   <command>autovacuum</command> This function is used by
   <command>pg_upgrade</command> and <command>pg_restore</command> to
   convey the statistics from the old system version into the new one.
  </para>

I find this a bit confusing, considering the pg_dump/pg_restore changes
are only in 0002, not in this patch.

2) Also, I'm not sure about this:

  <parameter>relation</parameter>, the parameters in this are all
  derived from <structname>pg_stats</structname>, and the values
  given are most often extracted from there.

How do we know where do the values come from "most often"? I mean, where
else would it come from?

3) The function pg_set_attribute_stats() is veeeeery long - 1000 lines
or so, that's way too many for me to think about. I agree the flow is
pretty simple, but I still wonder if there's a way to maybe split it
into some smaller "meaningful" steps.

4) It took me *ages* to realize the enums at the beginning of some of
the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
That'd surely deserve a comment explaining this.

5) The comment for param_names in pg_set_attribute_stats says this:

    /* names of columns that cannot be null */
    const char *param_names[] = { ... }

but isn't that actually incorrect? I think that applies only to a couple
initial arguments, but then other fields (MCV, mcelem stats, ...) can be
NULL, right?

6) There's a couple minor whitespace fixes or comments etc.


0002
----

1) I don't understand why we have exportExtStatsSupported(). Seems
pointless - nothing calls it, even if it did we don't know how to export
the stats.

2) I think this condition in dumpTableSchema() is actually incorrect:

  if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
      (tbinfo->relkind == RELKIND_VIEW))
      dumpRelationStats(fout, &tbinfo->dobj, reltypename,

Aren't indexes pretty much exactly the thing for which we don't want to
dump statistics? In fact this skips dumping statistics for table - if
you dump a database with a single table (-Fc), pg_restore -l will tell
you this:

217; 1259 16385 TABLE public t user
3403; 0 16385 TABLE DATA public t user

Which is not surprising, because table is not a view. With an expression
index you get this:

217; 1259 16385 TABLE public t user
3404; 0 16385 TABLE DATA public t user
3258; 1259 16418 INDEX public t_expr_idx user
3411; 0 0 STATS IMPORT public INDEX t_expr_idx

Unfortunately, fixing the condition does not work:

  $ pg_dump -Fc test > test.dump
  pg_dump: warning: archive items not in correct section order

This happens for a very simple reason - the statistics are marked as
SECTION_POST_DATA, which for the index works, because indexes are in
post-data section. But the table stats are dumped right after data,
still in the "data" section.

IMO that's wrong, the statistics should be delayed to the post-data
section. Which probably means there needs to be a separate dumpable
object for statistics on table/index, with a dependency on the object.

3) I don't like the "STATS IMPORT" description. For extended statistics
we dump the definition as "STATISTICS" so why to shorten it to "STATS"
here? And "IMPORT" seems more like the process of loading data, not the
data itself. So I suggest "STATISTICS DATA".


regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:


+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

We don't strictly need it, but I've found the set-difference operation to be incredibly useful in diagnosing problems. Additionally, the values are subject to change due to changes in test data, no guarantee that the output of ANALYZE is deterministic, etc. But most of all, because the test cares about the correct copying of values, not the values themselves.
 

+   <table id="functions-admin-statsimport">
+    <title>Database Object Statistics Import Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".

Yes. It should.
 
 Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.

Good points, but I'm not immediately sure how to enforce those rules.
 
 

Key areas where I'm seeking feedback:

- What level of errors in a restore will a user tolerate, and what should be done to the error messages to indicate that the data itself is fine, but a manual operation to update stats on that particular table is now warranted?
- To what degree could pg_restore/pg_upgrade take that recovery action automatically?
- Should the individual attribute/class set function calls be grouped by relation, so that they all succeed/fail together, or should they be called separately, each able to succeed or fail on their own?
- Any other concerns about how to best use these new functions.



Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

+1
 
I believe, in order to import statistics through IMPORT FOREIGN SCHEMA, postgresImportForeignSchema() will need to add SELECT commands invoking pg_set_relation_stats() on each imported table and pg_set_attribute_stats() on each of its attribute. Am I right? Do we want to make that happen in the first cut of the feature? How do you expect these functions to be used to update statistics of foreign tables?

I don't think there's time to get it into this release. I think we'd want to extend this functionality to both IMPORT FOREIGN SCHEMA and ANALYZE for foreign tables, in both cases with a server/table option to do regular remote sampling. In both cases, they'd do a remote query very similar to what pg_dump does (hence putting it in fe_utils), with some filters on which columns/tables it believes it can trust. The remote table might itself be a view (in which case they query would turn up nothing) or column data types may change across the wire, and in those cases we'd have to fall back to sampling.

Re: Statistics Import and Export

От
Corey Huinker
Дата:
1) The docs say this:

  <para>
   The purpose of this function is to apply statistics values in an
   upgrade situation that are "good enough" for system operation until
   they are replaced by the next <command>ANALYZE</command>, usually via
   <command>autovacuum</command> This function is used by
   <command>pg_upgrade</command> and <command>pg_restore</command> to
   convey the statistics from the old system version into the new one.
  </para>

I find this a bit confusing, considering the pg_dump/pg_restore changes
are only in 0002, not in this patch.

True, I'll split the docs.
 

2) Also, I'm not sure about this:

  <parameter>relation</parameter>, the parameters in this are all
  derived from <structname>pg_stats</structname>, and the values
  given are most often extracted from there.

How do we know where do the values come from "most often"? I mean, where
else would it come from?

The next most likely sources would be 1. stats from another similar table and 2. the imagination of a user testing hypothetical query plans.
 

3) The function pg_set_attribute_stats() is veeeeery long - 1000 lines
or so, that's way too many for me to think about. I agree the flow is
pretty simple, but I still wonder if there's a way to maybe split it
into some smaller "meaningful" steps.

I wrestle with that myself. I think there's some pieces that can be factored out.


4) It took me *ages* to realize the enums at the beginning of some of
the functions are actually indexes of arguments in PG_FUNCTION_ARGS.
That'd surely deserve a comment explaining this.

My apologies, it definitely deserves a comment.
 

5) The comment for param_names in pg_set_attribute_stats says this:

    /* names of columns that cannot be null */
    const char *param_names[] = { ... }

but isn't that actually incorrect? I think that applies only to a couple
initial arguments, but then other fields (MCV, mcelem stats, ...) can be
NULL, right?

Yes, that is vestigial, I'll remove it.
 

6) There's a couple minor whitespace fixes or comments etc.


0002
----

1) I don't understand why we have exportExtStatsSupported(). Seems
pointless - nothing calls it, even if it did we don't know how to export
the stats.

It's not strictly necessary. 
 

2) I think this condition in dumpTableSchema() is actually incorrect:

IMO that's wrong, the statistics should be delayed to the post-data
section. Which probably means there needs to be a separate dumpable
object for statistics on table/index, with a dependency on the object.

Good points.
 

3) I don't like the "STATS IMPORT" description. For extended statistics
we dump the definition as "STATISTICS" so why to shorten it to "STATS"
here? And "IMPORT" seems more like the process of loading data, not the
data itself. So I suggest "STATISTICS DATA".

+1
 

Re: Statistics Import and Export

От
Jeff Davis
Дата:
Hi Tom,

Comparing the current patch set to your advice below:

On Tue, 2023-12-26 at 14:19 -0500, Tom Lane wrote:
> I had things set up with simple functions, which
> pg_dump would invoke by writing more or less
>
>         SELECT pg_catalog.load_statistics(....);
>
> This has a number of advantages, not least of which is that an
> extension
> could plausibly add compatible functions to older versions.

Check.

>   The trick,
> as you say, is to figure out what the argument lists ought to be.
> Unfortunately I recall few details of what I wrote for Salesforce,
> but I think I had it broken down in a way where there was a separate
> function call occurring for each pg_statistic "slot", thus roughly
>
> load_statistics(table regclass, attname text, stakind int, stavalue
> ...);

The problem with basing the function on pg_statistic directly is that
it can only be exported by the superuser.

The current patches instead base it on the pg_stats view, which already
does the privilege checking. Technically, information about which
stakinds go in which slots is lost, but I don't think that's a problem
as long as the stats make it in, right? It's also more user-friendly to
have nice names for the function arguments. The only downside I see is
that it's slightly asymmetric: exporting from pg_stats and importing
into pg_statistic.

I do have some concerns about letting non-superusers import their own
statistics: how robust is the rest of the code to handle malformed
stats once they make it into pg_statistic? Corey has addressed that
with basic input validation, so I think it's fine, but perhaps I'm
missing something.

>  As mentioned already, we'd also need some sort of
> version identifier, and we'd expect the load_statistics() functions
> to be able to transform the data if the old version used a different
> representation.

You mean a version argument to the function, which would appear in the
exported stats data? That's not in the current patch set.

It's relying on the new version of pg_dump understanding the old
statistics data, and dumping it out in a form that the new server will
understand.

>   I agree with the idea that an explicit representation
> of the source table attribute's type would be wise, too.

That's not in the current patch set, either.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Tue, 2024-03-26 at 00:16 +0100, Tomas Vondra wrote:
> I did take a closer look at v13 today. I have a bunch of comments and
> some minor whitespace fixes in the attached review patches.

I also attached a patch implementing a different approach to the
pg_dump support. Instead of trying to create a query that uses SQL
"format()" to create more SQL, I did all the formatting in C. It turned
out to be about 30% fewer lines, and I find it more understandable and
consistent with the way other stuff in pg_dump happens.

The attached patch is pretty rough -- not many comments, and perhaps
some things should be moved around. I only tested very basic
dump/reload in SQL format.

Regards,
    Jeff Davis


Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:
On Fri, Mar 29, 2024 at 2:25 AM Jeff Davis <pgsql@j-davis.com> wrote:
I also attached a patch implementing a different approach to the
pg_dump support. Instead of trying to create a query that uses SQL
"format()" to create more SQL, I did all the formatting in C. It turned
out to be about 30% fewer lines, and I find it more understandable and
consistent with the way other stuff in pg_dump happens.

That is fairly close to what I came up with per our conversation (attached below), but I really like the att_stats_arginfo construct and I definitely want to adopt that and expand it to a third dimension that flags the fields that cannot be null. I will incorporate that into v15.

As for v14, here are the highlights:

0001:
- broke up pg_set_attribute_stats() into many functions. Every stat kind gets its own validation function. Type derivation is now done in its own function.
- removed check on inherited stats flag that required the table be partitioned. that was in error
- added check for most_common_values to be unique in ascending order, and tests to match
- no more mention of pg_dump in the function documentation
- function documentation cites pg-stats-view as reference for the parameter's data requirements

0002:
- All relstats and attrstats calls are now their own statement instead of a compound statement
- moved the archive TOC entry from post-data back to SECTION_NONE (as it was modeled on object COMMENTs), which seems to work better.
- remove meta-query in favor of more conventional query building
- removed all changes to fe_utils/

Вложения

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> That is fairly close to what I came up with per our conversation
> (attached below), but I really like the att_stats_arginfo construct
> and I definitely want to adopt that and expand it to a third
> dimension that flags the fields that cannot be null. I will
> incorporate that into v15.

Sounds good. I think it cuts down on the boilerplate.

> 0002:
> - All relstats and attrstats calls are now their own statement
> instead of a compound statement
> - moved the archive TOC entry from post-data back to SECTION_NONE (as
> it was modeled on object COMMENTs), which seems to work better.
> - remove meta-query in favor of more conventional query building
> - removed all changes to fe_utils/

Can we get a consensus on whether the default should be with stats or
without? That seems like the most important thing remaining in the
pg_dump changes.

There's still a failure in the pg_upgrade TAP test. One seems to be
ordering, so perhaps we need to ORDER BY the attribute number. Others
seem to be missing relstats and I'm not sure why yet. I suggest doing
some manual pg_upgrade tests and comparing the before/after dumps to
see if you can reproduce a smaller version of the problem.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

On Fri, Mar 29, 2024 at 11:05 Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 05:32 -0400, Corey Huinker wrote:
> 0002:
> - All relstats and attrstats calls are now their own statement
> instead of a compound statement
> - moved the archive TOC entry from post-data back to SECTION_NONE (as
> it was modeled on object COMMENTs), which seems to work better.
> - remove meta-query in favor of more conventional query building
> - removed all changes to fe_utils/

Can we get a consensus on whether the default should be with stats or
without? That seems like the most important thing remaining in the
pg_dump changes.

I’d certainly think “with stats” would be the preferred default of our users.

Thanks!

Stephen

Re: Statistics Import and Export

От
Corey Huinker
Дата:

There's still a failure in the pg_upgrade TAP test. One seems to be
ordering, so perhaps we need to ORDER BY the attribute number. Others
seem to be missing relstats and I'm not sure why yet. I suggest doing
some manual pg_upgrade tests and comparing the before/after dumps to
see if you can reproduce a smaller version of the problem.

That's fixed in my current working version, as is a tsvector-specific issue. Working on the TAP issue.
 

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

Also, it's new behavior, so it may cause some minor surprises, or there
might be minor interactions to work out. For instance, dumping stats
doesn't make a lot of sense if pg_upgrade (or something else) is just
going to run analyze anyway.

What do you think about starting off with it as non-default, and then
switching it to default in 18?

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
>> I’d certainly think “with stats” would be the preferred default of
>> our users.

> What do you think about starting off with it as non-default, and then
> switching it to default in 18?

I'm with Stephen: I find it very hard to imagine that there's any
users who wouldn't want this as default.  If we do what you suggest,
then there will be three historical behaviors to cope with not two.
That doesn't sound like it'll make anyone's life better.

As for the "it might break" argument, that could be leveled against
any nontrivial patch.  You're at least offering an opt-out switch,
which is something we more often don't do.

(I've not read the patch yet, but I assume the switch works like
other pg_dump filters in that you can apply it on the restore
side?)

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:


On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

To that end, I'm going to be modifying the "Optimizer statistics are not transferred by pg_upgrade..." message when stats _were_ transferred, width additional instructions that the user should treat any stats-ish error messages encountered as a reason to manually analyze that table. We should probably say something about extended stats as well.
 

Re: Statistics Import and Export

От
Corey Huinker
Дата:
(I've not read the patch yet, but I assume the switch works like
other pg_dump filters in that you can apply it on the restore
side?)

Correct. It follows the existing --no-something pattern. 

Re: Statistics Import and Export

От
Stephen Frost
Дата:
Greetings,

On Fri, Mar 29, 2024 at 19:35 Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

I understand that point.

Also, it's new behavior, so it may cause some minor surprises, or there
might be minor interactions to work out. For instance, dumping stats
doesn't make a lot of sense if pg_upgrade (or something else) is just
going to run analyze anyway.

But we don’t expect anything to run analyze … do we?  So I’m not sure why it makes sense to raise this as a concern. 

What do you think about starting off with it as non-default, and then
switching it to default in 18?

What’s different, given the above arguments, in making the change with 18 instead of now?  I also suspect that if we say “we will change the default later” … that later won’t ever come and we will end up making our users always have to remember to say “with-stats” instead.

The stats are important which is why the effort is being made in the first place. If just doing an analyze after loading the data was good enough then this wouldn’t be getting worked on.

Independently, I had a thought around doing an analyze as the data is being loaded .. but we can’t do that for indexes (but we could perhaps analyze the indexed values as we build the index..).  This works when we do a truncate or create the table in the same transaction, so we would tie into some of the existing logic that we have around that.  Would also adjust COPY to accept an option that specifies the anticipated number of rows being loaded (which we can figure out during the dump phase reasonably..). Perhaps this would lead to a pg_dump option to do the data load as a transaction with a truncate before the copy (point here being to be able to still do parallel load while getting the benefits from knowing that we are completely reloading the table). Just some other thoughts- which I don’t intend to take away from the current effort at all, which I see as valuable and should be enabled by default.

Thanks!

Stephen

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Fri, 2024-03-29 at 20:54 -0400, Stephen Frost wrote:
> What’s different, given the above arguments, in making the change
> with 18 instead of now?

Acknowledged. You, Tom, and Corey (and perhaps everyone else) seem to
be aligned here, so that's consensus enough for me. Default is with
stats, --no-statistics to disable them.

> Independently, I had a thought around doing an analyze as the data is
> being loaded ..

Right, I think there are some interesting things to pursue here. I also
had an idea to use logical decoding to get a streaming sample, which
would be better randomness than block sampling. At this point that's
just an idea, I haven't looked into it seriously.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:
Right, I think there are some interesting things to pursue here. I also
had an idea to use logical decoding to get a streaming sample, which
would be better randomness than block sampling. At this point that's
just an idea, I haven't looked into it seriously.

Regards,
        Jeff Davis



v15 attached

0001:
- fixed an error involving tsvector types
- only derive element type if element stats available
- general cleanup

0002:

- 002pg_upgrade.pl now dumps before/after databases with --no-statistics. I tried to find out why some tables were getting their relstats either not set, or set and reset, never affecting the attribute stats. I even tried turning autovacuum off for both instances, but nothing seemed to change the fact that the same tables were having their relstats reset.

TODO list:

- decision on whether suppressing stats in the pg_upgrade TAP check is for the best
- pg_upgrade option to suppress stats import, there is no real pattern to follow there
- what message text to convey to the user about the potential stats import errors and their remediation, and to what degree that replaces the "you ought to run vacuumdb" message.
- what additional error context we want to add to the array_in() imports of anyarray strings

Вложения

Re: Statistics Import and Export

От
Magnus Hagander
Дата:


On Sat, Mar 30, 2024 at 1:26 AM Corey Huinker <corey.huinker@gmail.com> wrote:


On Fri, Mar 29, 2024 at 7:34 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2024-03-29 at 18:02 -0400, Stephen Frost wrote:
> I’d certainly think “with stats” would be the preferred default of
> our users.

I'm concerned there could still be paths that lead to an error. For
pg_restore, or when loading a SQL file, a single error isn't fatal
(unless -e is specified), but it still could be somewhat scary to see
errors during a reload.

To that end, I'm going to be modifying the "Optimizer statistics are not transferred by pg_upgrade..." message when stats _were_ transferred, width additional instructions that the user should treat any stats-ish error messages encountered as a reason to manually analyze that table. We should probably say something about extended stats as well.
 

I'm getting late into this discussion and I apologize if I've missed this being discussed before. But.

Please don't.

That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

Surely we can come up with something better. Otherwise, I think all those tools are just going ot have to end up assuming that it always failed and proceed based on that, and that would be a shame.

--

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Sat, 2024-03-30 at 01:34 -0400, Corey Huinker wrote:
>
> - 002pg_upgrade.pl now dumps before/after databases with --no-
> statistics. I tried to find out why some tables were getting their
> relstats either not set, or set and reset, never affecting the
> attribute stats. I even tried turning autovacuum off for both
> instances, but nothing seemed to change the fact that the same tables
> were having their relstats reset.

I think I found out why this is happening: a schema-only dump first
creates the table, then sets the relstats, then creates indexes. The
index creation updates the relstats, but because the dump was schema-
only, it overwrites the relstats with zeros.

That exposes an interesting dependency, which is that relstats must be
set after index creation, otherwise they will be lost -- at least in
the case of pg_upgrade.

This re-raises the question of whether stats are part of a schema-only
dump or not. Did we settle conclusively that they are?

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> This re-raises the question of whether stats are part of a schema-only
> dump or not. Did we settle conclusively that they are?

Surely they are data, not schema.  It would make zero sense to restore
them if you aren't restoring the data they describe.

Hence, it'll be a bit messy if we can't put them in the dump's DATA
section.  Maybe we need to revisit CREATE INDEX's behavior rather
than assuming it's graven in stone?

            regards, tom lane



Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Sat, 2024-03-30 at 13:18 -0400, Tom Lane wrote:
> Surely they are data, not schema.  It would make zero sense to
> restore
> them if you aren't restoring the data they describe.

The complexity is that pg_upgrade does create the data, but relies on a
schema-only dump. So we'd need to at least account for that somehow,
either with a separate stats-only dump, or make a special case in
binary upgrade mode that dumps schema+stats (and resolves the CREATE
INDEX issue).

> Maybe we need to revisit CREATE INDEX's behavior rather
> than assuming it's graven in stone?

Would there be a significant cost to just not doing that? Or are you
suggesting that we special-case the behavior, or turn it off during
restore with a GUC?

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2024-03-30 at 13:18 -0400, Tom Lane wrote:
>> Surely they are data, not schema.  It would make zero sense to
>> restore them if you aren't restoring the data they describe.

> The complexity is that pg_upgrade does create the data, but relies on a
> schema-only dump. So we'd need to at least account for that somehow,
> either with a separate stats-only dump, or make a special case in
> binary upgrade mode that dumps schema+stats (and resolves the CREATE
> INDEX issue).

Ah, good point.  But binary-upgrade mode is special in tons of ways
already.  I don't see a big problem with allowing it to dump stats
even though --schema-only would normally imply not doing that.

(You could also imagine an explicit positive --stats switch that would
override --schema-only, but I don't see that it's worth the trouble.)

>> Maybe we need to revisit CREATE INDEX's behavior rather
>> than assuming it's graven in stone?

> Would there be a significant cost to just not doing that? Or are you
> suggesting that we special-case the behavior, or turn it off during
> restore with a GUC?

I didn't have any specific proposal in mind, was just trying to think
outside the box.

            regards, tom lane



Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Sat, 2024-03-30 at 13:39 -0400, Tom Lane wrote:
> (You could also imagine an explicit positive --stats switch that
> would
> override --schema-only, but I don't see that it's worth the trouble.)

That would have its own utility for reproducing planner problems
outside of production systems.

(That could be a separate feature, though, and doesn't need to be a
part of this patch set.)

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:
I'm getting late into this discussion and I apologize if I've missed this being discussed before. But.

Please don't.

That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

That makes a lot of sense. I'm not sure what form it would take (file, pseudo-table, something else?). Open to suggestions.

Re: Statistics Import and Export

От
Corey Huinker
Дата:
I didn't have any specific proposal in mind, was just trying to think
outside the box.

What if we added a separate resection SECTION_STATISTICS which is run following post-data?

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
>> I didn't have any specific proposal in mind, was just trying to think
>> outside the box.

> What if we added a separate resection SECTION_STATISTICS which is run
> following post-data?

Maybe, but that would have a lot of side-effects on pg_dump's API
and probably on some end-user scripts.  I'd rather not.

I haven't looked at the details, but I'm really a bit surprised
by Jeff's assertion that CREATE INDEX destroys statistics on the
base table.  That seems wrong from here, and maybe something we
could have it not do.  (I do realize that it recalculates reltuples
and relpages, but so what?  If it updates those, the results should
be perfectly accurate.)

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
That will make it *really* hard for any form of automation or drivers of this. The information needs to go somewhere where such tools can easily consume it, and an informational message during runtime (which is also likely to be translated in many environments) is the exact opposite of that.

Having given this some thought, I'd be inclined to create a view, pg_stats_missing, with the same security barrier as pg_stats, but looking for tables that lack stats on at least one column, or lack stats on an extended statistics object.

Table structure would be 
schemaname name
tablename name
attnames text[]
ext_stats text[]

The informational message, if it changes at all, could reference this new view as the place to learn about how well the stats import went.

vacuumdb might get a --missing-only option in addition to --analyze-in-stages.

For that matter, we could add --analyze-missing options to pg_restore and pg_upgrade to do the mopping up themselves.

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
> Having given this some thought, I'd be inclined to create a view,
> pg_stats_missing, with the same security barrier as pg_stats, but looking
> for tables that lack stats on at least one column, or lack stats on an
> extended statistics object.

The week before feature freeze is no time to be designing something
like that, unless you've abandoned all hope of getting this into v17.

There's a bigger issue though: AFAICS this patch set does nothing
about dumping extended statistics.  I surely don't want to hold up
the patch insisting that that has to happen before we can commit the
functionality proposed here.  But we cannot rip out pg_upgrade's
support for post-upgrade ANALYZE processing before we do something
about extended statistics, and that means it's premature to be
designing any changes to how that works.  So I'd set that whole
topic on the back burner.

It's possible that we could drop the analyze-in-stages recommendation,
figuring that this functionality will get people to the
able-to-limp-along level immediately and that all that is needed is a
single mop-up ANALYZE pass.  But I think we should leave that till we
have a bit more than zero field experience with this feature.

            regards, tom lane



Re: Statistics Import and Export

От
Tom Lane
Дата:
My apologies for having paid so little attention to this thread for
months.  I got around to reading the v15 patches today, and while
I think they're going in more or less the right direction, there's
a long way to go IMO.

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute.  (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go.  A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof.  What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters.  The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like.  As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use, and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
                       inherited bool, source_version int,
                       variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
                              false, -- not inherited
                  16, -- source server major version
                              -- pairs of labels and values follow
                              'null_frac', 0.4,
                              'avg_width', 42,
                              'histogram_bounds',
                              array['a', 'b', 'c']::text[],
                              ...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks.  If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation).  This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on.  This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function.  (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

If we write the labels as undecorated string literals as I show
above, I think they will arrive at the function as "unknown"-type
constants, which is a little weird but doesn't seem like it's
really a big problem.  The alternative is to cast them all to text
explicitly, but that's adding notation to no great benefit IMO.

pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too).  If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I've not read the patches in great detail, but I did note a
few low-level issues:

* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct.  Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?)  That all looks
overly complicated and duplicative.  I think you don't need two
layers of function there.

* I find the stuff with enums and "const char *param_names" to
be way too cute and unlike anything we do elsewhere.  Please
don't invent your own notations for coding patterns that have
hundreds of existing instances.  pg_set_relation_stats, for
example, has absolutely no reason not to look like the usual

    Oid    relid = PG_GETARG_OID(0);
    float4    relpages = PG_GETARG_FLOAT4(1);
    ... etc ...

* The array manipulations seem to me to be mostly not well chosen.
There's no reason to use expanded arrays here, since you won't be
modifying the arrays in-place; all that's doing is wasting memory.
I'm also noting a lack of defenses against nulls in the arrays.
I'd suggest using deconstruct_array to disassemble the arrays,
if indeed they need disassembled at all.  (Maybe they don't, see
next item.)

* I'm dubious that we can fully vet the contents of these arrays,
and even a little dubious that we need to try.  As an example,
what's the worst that's going to happen if a histogram array isn't
sorted precisely?  You might get bogus selectivity estimates
from the planner, but that's no worse than you would've got with
no stats at all.  (It used to be that selfuncs.c would use a
histogram even if its contents didn't match the query's collation.
The comments justifying that seem to be gone, but I think it's
still the case that the code isn't *really* dependent on the sort
order being exactly so.)  The amount of hastily-written code in the
patch for checking this seems a bit scary, and it's well within the
realm of possibility that it introduces more bugs than it prevents.
We do need to verify data types, lack of nulls, and maybe
1-dimensional-ness, which could break the accessing code at a fairly
low level; but I'm not sure that we need more than that.

* There's a lot of ERROR cases that maybe we ought to downgrade
to WARN-and-press-on, in the service of not breaking the restore
completely in case of trouble.

* 0002 is confused about whether the tag for these new TOC
entries is "STATISTICS" or "STATISTICS DATA".  I also think
they need to be in SECTION_DATA not SECTION_NONE, and I'd be
inclined to make them dependent on the table data objects
not the table declarations.  We don't really want a parallel
restore to load them before the data is loaded: that just
increases the risk of bad interactions with concurrent
auto-analyze.

* It'd definitely not be OK to put BEGIN/COMMIT into the commands
in these TOC entries.  But I don't think we need to.

* dumpRelationStats seems to be dumping the relation-level
stats twice.

* Why exactly are you suppressing testing of statistics upgrade
in 002_pg_upgrade??

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
On Sun, Mar 31, 2024 at 2:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Corey Huinker <corey.huinker@gmail.com> writes:
> Having given this some thought, I'd be inclined to create a view,
> pg_stats_missing, with the same security barrier as pg_stats, but looking
> for tables that lack stats on at least one column, or lack stats on an
> extended statistics object.

The week before feature freeze is no time to be designing something
like that, unless you've abandoned all hope of getting this into v17.

It was a response to the suggestion that there be some way for tools/automation to read the status of stats. I would view it as a separate patch, as such a view would be useful now for knowing which tables to ANALYZE, regardless of whether this patch goes in or not.
 
There's a bigger issue though: AFAICS this patch set does nothing
about dumping extended statistics.  I surely don't want to hold up
the patch insisting that that has to happen before we can commit the
functionality proposed here.  But we cannot rip out pg_upgrade's
support for post-upgrade ANALYZE processing before we do something
about extended statistics, and that means it's premature to be
designing any changes to how that works.  So I'd set that whole
topic on the back burner.

So Extended Stats _were_ supported by earlier versions where the medium of communication was JSON. However, there were several problems with adapting that to the current model where we match params to stat types:

* Several of the column types do not have functional input functions, so we must construct the data structure internally and pass them to statext_store().
* The output functions for some of those column types have lists of attnums, with negative values representing positional expressions in the stat definition. This information is not translatable to another system without also passing along the attnum/attname mapping of the source system.

At least three people told me "nobody uses extended stats" and to just drop that from the initial version. Unhappy with this assessment, I inquired as to whether my employer (AWS) had some internal databases that used extended stats so that I could get good test data, and came up with nothing, nor did anyone know of customers who used the feature. So when the fourth person told me that nobody uses extended stats, and not to let a rarely-used feature get in the way of a feature that would benefit nearly 100% of users, I dropped it.
 
It's possible that we could drop the analyze-in-stages recommendation,
figuring that this functionality will get people to the
able-to-limp-along level immediately and that all that is needed is a
single mop-up ANALYZE pass.  But I think we should leave that till we
have a bit more than zero field experience with this feature.

It may be that we leave the recommendation exactly as it is.

Perhaps we enhance the error messages in pg_set_*_stats() to indicate what command would remediate the issue.
 

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
> On Sun, Mar 31, 2024 at 2:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> There's a bigger issue though: AFAICS this patch set does nothing
>> about dumping extended statistics.  I surely don't want to hold up
>> the patch insisting that that has to happen before we can commit the
>> functionality proposed here.  But we cannot rip out pg_upgrade's
>> support for post-upgrade ANALYZE processing before we do something
>> about extended statistics, and that means it's premature to be
>> designing any changes to how that works.  So I'd set that whole
>> topic on the back burner.

> So Extended Stats _were_ supported by earlier versions where the medium of
> communication was JSON. However, there were several problems with adapting
> that to the current model where we match params to stat types:

> * Several of the column types do not have functional input functions, so we
> must construct the data structure internally and pass them to
> statext_store().
> * The output functions for some of those column types have lists of
> attnums, with negative values representing positional expressions in the
> stat definition. This information is not translatable to another system
> without also passing along the attnum/attname mapping of the source system.

I wonder if the right answer to that is "let's enhance the I/O
functions for those types".  But whether that helps or not, it's
v18-or-later material for sure.

> At least three people told me "nobody uses extended stats" and to just drop
> that from the initial version.

I can't quibble with that view of what has priority.  I'm just
suggesting that redesigning what pg_upgrade does in this area
should come later than doing something about extended stats.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
I wonder if the right answer to that is "let's enhance the I/O
functions for those types".  But whether that helps or not, it's
v18-or-later material for sure.

That was Stephen's take as well, and I agreed given that I had to throw the kitchen-sink of source-side oid mappings (attname, types, collatons, operators) into the JSON to work around the limitation.
 
I can't quibble with that view of what has priority.  I'm just
suggesting that redesigning what pg_upgrade does in this area
should come later than doing something about extended stats.

I mostly agree, with the caveat that pg_upgrade's existing message saying that optimizer stats were not carried over wouldn't be 100% true anymore.
 

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
>> I can't quibble with that view of what has priority.  I'm just
>> suggesting that redesigning what pg_upgrade does in this area
>> should come later than doing something about extended stats.

> I mostly agree, with the caveat that pg_upgrade's existing message saying
> that optimizer stats were not carried over wouldn't be 100% true anymore.

I think we can tweak the message wording.  I just don't want to be
doing major redesign of the behavior, nor adding fundamentally new
monitoring capabilities.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute.  (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go.  A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

That was my reasoning as well.

 
I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof.  What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters.

I don't think we'd overload, we'd just add new parameters to the function signature.
 
  The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like.

There was a lot of back-and-forth about what sorts of failures were error-worthy, and which were warn-worthy. I'll discuss further below.
 
  As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use,

+1
 
and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

The JSON had it, and I never did use it. Not against having it again.
 

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
                       inherited bool, source_version int,
                       variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
                              false, -- not inherited
                              16, -- source server major version
                              -- pairs of labels and values follow
                              'null_frac', 0.4,
                              'avg_width', 42,
                              'histogram_bounds',
                              array['a', 'b', 'c']::text[],
                              ...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks.  If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation).  This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

I'm not aware of how strict works with variadics. Would the lack of any variadic parameters trigger it?

Also going with strict means that an inadvertent explicit NULL in one parameter would cause the entire attribute import to fail silently. I'd rather fail loudly.

 
* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on.  This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

Interesting.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function.  (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

So, that's tricky. The type of the values is not always the attribute type, for expression indexes, we do call exprType() and exprCollation(), in which case we always use the expression type over the attribute type, but only use the collation type if the attribute had no collation. This mimics the behavior of ANALYZE.

Then, for the MCELEM and DECHIST stakinds we have to find the type's element type, and that has special logic for tsvectors, ranges, and other non-scalars, borrowing from the various *_typanalyze() functions. For that matter, the existing typanalyze functions don't grab the < operator, which I need for later data validations, so using examine_attribute() was simultaneously overkill and insufficient.

None of this functionality is accessible from a client program, so we'd have to pull in a lot of backend stuff to pg_dump to make it resolve the typecasts correctly. Text and array_in() was just easier.
 
pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too).  If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I'm still iffy about the silent failures of strict. 

I looked it up, and the only change needed for changing oid to regclass is in the pg_proc.dat. (and the docs, of course). So I'm already on board.
 
* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct.  Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?)  That all looks
overly complicated and duplicative.  I think you don't need two
layers of function there.

To prove that the caller is the owner (or better) of the table.
  

* The array manipulations seem to me to be mostly not well chosen.
There's no reason to use expanded arrays here, since you won't be
modifying the arrays in-place; all that's doing is wasting memory.
I'm also noting a lack of defenses against nulls in the arrays.

Easily remedied in light of the deconstruct_array() suggestion below, but I do want to add that value_not_null_array_len() does check for nulls, and that function is used to generate all but one of the arrays (and that one we're just verifying that it's length matches the length of the other array).There's even a regression test that checks it (search for: "elem_count_histogram null element").
 
I'd suggest using deconstruct_array to disassemble the arrays,
if indeed they need disassembled at all.  (Maybe they don't, see
next item.)

+1 
 

* I'm dubious that we can fully vet the contents of these arrays,
and even a little dubious that we need to try.  As an example,
what's the worst that's going to happen if a histogram array isn't
sorted precisely?  You might get bogus selectivity estimates
from the planner, but that's no worse than you would've got with
no stats at all.  (It used to be that selfuncs.c would use a
histogram even if its contents didn't match the query's collation.
The comments justifying that seem to be gone, but I think it's
still the case that the code isn't *really* dependent on the sort
order being exactly so.)  The amount of hastily-written code in the
patch for checking this seems a bit scary, and it's well within the
realm of possibility that it introduces more bugs than it prevents.
We do need to verify data types, lack of nulls, and maybe
1-dimensional-ness, which could break the accessing code at a fairly
low level; but I'm not sure that we need more than that.

A lot of the feedback I got on this patch over the months concerned giving inaccurate, nonsensical, or malicious data to the planner. Surely the planner does do *some* defensive programming when fetching these values, but this is the first time those values were potentially set by a user, not by our own internal code. We can try to match types, collations, etc from source to dest, but even that would fall victim to another glibc-level collation change. Verifying that the list the source system said was sorted is actually sorted when put on the destination system is the truest test we're ever going to get, albeit for sampled elements.
 

* There's a lot of ERROR cases that maybe we ought to downgrade
to WARN-and-press-on, in the service of not breaking the restore
completely in case of trouble.

All cases were made error precisely to spark debate about which cases we'd want to continue from and which we'd want to error from. Also, I was under the impression it was bad form to follow up NOTICE/WARN with an ERROR in the same function call.

 
* 0002 is confused about whether the tag for these new TOC
entries is "STATISTICS" or "STATISTICS DATA".  I also think
they need to be in SECTION_DATA not SECTION_NONE, and I'd be
inclined to make them dependent on the table data objects
not the table declarations.  We don't really want a parallel
restore to load them before the data is loaded: that just
increases the risk of bad interactions with concurrent
auto-analyze.

SECTION_NONE works the best, but we're getting some situations where the relpages/reltuples/relallvisible gets reset to 0s in pg_class. Hence the temporary --no-statistics in the pg_upgrade TAP test.

SECTION_POST_DATA (a previous suggestion) causes something weird to happen where certain GRANT/REVOKEs happen outside of their expected section. 

In work I've done since v15, I tried giving the table stats archive entry a dependency on every index (and index constraint) as well as the table itself, thinking that would get us past all resets of pg_class, but it hasn't worked.
 
* It'd definitely not be OK to put BEGIN/COMMIT into the commands
in these TOC entries.  But I don't think we need to.

Agreed. Don't need to, each function call now sinks or swims on its own.
 

* dumpRelationStats seems to be dumping the relation-level
stats twice.

+1 

* Why exactly are you suppressing testing of statistics upgrade
in 002_pg_upgrade??

Temporary. Related to the pg_class overwrite issue above.

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
>> and I really think that we need to provide
>> the source server's major version number --- maybe we will never
>> need that, but if we do and we don't have it we will be sad.

> The JSON had it, and I never did use it. Not against having it again.

Well, you don't need it now seeing that the definition of pg_stats
columns hasn't changed in the past ... but there's no guarantee we
won't want to change them in the future.

>> So this leads me to suggest that we'd be best off with a VARIADIC
>> ANY signature, where the variadic part consists of alternating
>> parameter labels and values:
>> pg_set_attribute_stats(table regclass, attribute name,
>>                        inherited bool, source_version int,
>>                        variadic "any") returns void

> I'm not aware of how strict works with variadics. Would the lack of any
> variadic parameters trigger it?

IIRC, "variadic any" requires having at least one variadic parameter.
But that seems fine --- what would be the point, or even the
semantics, of calling pg_set_attribute_stats with no data fields?

> Also going with strict means that an inadvertent explicit NULL in one
> parameter would cause the entire attribute import to fail silently. I'd
> rather fail loudly.

Not really convinced that that is worth any trouble...

> * We can require the calling statement to cast arguments, particularly
>> arrays, to the proper type, removing the need for conversions within
>> the stats-setting function.  (But instead, it'd need to check that the
>> next "any" argument is the type it ought to be based on the type of
>> the target column.)

> So, that's tricky. The type of the values is not always the attribute type,

Hmm.  You would need to have enough smarts in pg_set_attribute_stats
to identify the appropriate array type in any case: as coded, it needs
that for coercion, whereas what I'm suggesting would only require it
for checking, but either way you need it.  I do concede that pg_dump
(or other logic generating the calls) needs to know more under my
proposal than before.  I had been thinking that it would not need to
hard-code that because it could look to see what the actual type is
of the array it's dumping.  However, I see that pg_typeof() doesn't
work for that because it just returns anyarray.  Perhaps we could
invent a new backend function that extracts the actual element type
of a non-null anyarray argument.

Another way we could get to no-coercions is to stick with your
signature but declare the relevant parameters as anyarray instead of
text.  I still think though that we'd be better off to leave the
parameter matching to runtime, so that we-don't-recognize-that-field
can be a warning not an error.

>> * why is check_relation_permissions looking up the pg_class row?
>> There's already a copy of that in the Relation struct.

> To prove that the caller is the owner (or better) of the table.

I think you missed my point: you're doing that inefficiently,
and maybe even with race conditions.  Use the relcache's copy
of the pg_class row.

>> * I'm dubious that we can fully vet the contents of these arrays,
>> and even a little dubious that we need to try.

> A lot of the feedback I got on this patch over the months concerned giving
> inaccurate, nonsensical, or malicious data to the planner. Surely the
> planner does do *some* defensive programming when fetching these values,
> but this is the first time those values were potentially set by a user, not
> by our own internal code. We can try to match types, collations, etc from
> source to dest, but even that would fall victim to another glibc-level
> collation change.

That sort of concern is exactly why I think the planner has to, and
does, defend itself.  Even if you fully vet the data at the instant
of loading, we might have the collation change under us later.

It could be argued that feeding bogus data to the planner for testing
purposes is a valid use-case for this feature.  (Of course, as
superuser we could inject bogus data into pg_statistic manually,
so it's not necessary to have this feature for that purpose.)
I guess I'm a great deal more sanguine than other people about the
planner's ability to tolerate inconsistent data; but in any case
I don't have a lot of faith in relying on checks in
pg_set_attribute_stats to substitute for that ability.  That idea
mainly leads to having a whole lot of code that has to be kept in
sync with other code that's far away from it and probably isn't
coded in a parallel fashion either.

>> * There's a lot of ERROR cases that maybe we ought to downgrade
>> to WARN-and-press-on, in the service of not breaking the restore
>> completely in case of trouble.

> All cases were made error precisely to spark debate about which cases we'd
> want to continue from and which we'd want to error from.

Well, I'm here to debate it if you want, but I'll just note that *one*
error will be enough to abort a pg_upgrade entirely, and most users
these days get scared by errors during manual dump/restore too.  So we
had better not be throwing errors except for cases that we don't think
pg_dump could ever emit.

> Also, I was under
> the impression it was bad form to follow up NOTICE/WARN with an ERROR in
> the same function call.

Seems like nonsense to me.  WARN then ERROR about the same condition
would be annoying, but that's not what we are talking about here.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
IIRC, "variadic any" requires having at least one variadic parameter.
But that seems fine --- what would be the point, or even the
semantics, of calling pg_set_attribute_stats with no data fields?

If my pg_dump run emitted a bunch of stats that could never be imported, I'd want to know. With silent failures, I don't.

 
Perhaps we could
invent a new backend function that extracts the actual element type
of a non-null anyarray argument.

A backend function that we can't guarantee exists on the source system. :(
 
Another way we could get to no-coercions is to stick with your
signature but declare the relevant parameters as anyarray instead of
text.  I still think though that we'd be better off to leave the
parameter matching to runtime, so that we-don't-recognize-that-field
can be a warning not an error.

I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

# select '{1,2,3}'::anyarray;
ERROR:  cannot accept a value of type anyarray
 
I think you missed my point: you're doing that inefficiently,
and maybe even with race conditions.  Use the relcache's copy
of the pg_class row.

Roger Wilco.


Well, I'm here to debate it if you want, but I'll just note that *one*
error will be enough to abort a pg_upgrade entirely, and most users
these days get scared by errors during manual dump/restore too.  So we
had better not be throwing errors except for cases that we don't think
pg_dump could ever emit.

That's pretty persuasive. It also means that we need to trap for error in the array_in() calls, as that function does not yet have a _safe() mode.

Re: Statistics Import and Export

От
Ashutosh Bapat
Дата:
Hi Corey,


On Mon, Mar 25, 2024 at 3:38 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Hi Corey,


On Sat, Mar 23, 2024 at 7:21 AM Corey Huinker <corey.huinker@gmail.com> wrote:
v12 attached.

0001 - 


Some random comments

+SELECT
+    format('SELECT pg_catalog.pg_set_attribute_stats( '
+            || 'relation => %L::regclass::oid, attname => %L::name, '
+            || 'inherited => %L::boolean, null_frac => %L::real, '
+            || 'avg_width => %L::integer, n_distinct => %L::real, '
+            || 'most_common_vals => %L::text, '
+            || 'most_common_freqs => %L::real[], '
+            || 'histogram_bounds => %L::text, '
+            || 'correlation => %L::real, '
+            || 'most_common_elems => %L::text, '
+            || 'most_common_elem_freqs => %L::real[], '
+            || 'elem_count_histogram => %L::real[], '
+            || 'range_length_histogram => %L::text, '
+            || 'range_empty_frac => %L::real, '
+            || 'range_bounds_histogram => %L::text) ',
+        'stats_export_import.' || s.tablename || '_clone', s.attname,
+        s.inherited, s.null_frac,
+        s.avg_width, s.n_distinct,
+        s.most_common_vals, s.most_common_freqs, s.histogram_bounds,
+        s.correlation, s.most_common_elems, s.most_common_elem_freqs,
+        s.elem_count_histogram, s.range_length_histogram,
+        s.range_empty_frac, s.range_bounds_histogram)
+FROM pg_catalog.pg_stats AS s
+WHERE s.schemaname = 'stats_export_import'
+AND s.tablename IN ('test', 'is_odd')
+\gexec

Why do we need to construct the command and execute? Can we instead execute the function directly? That would also avoid ECHO magic.

Addressed in v15
 

+   <table id="functions-admin-statsimport">
+    <title>Database Object Statistics Import Functions</title>
+    <tgroup cols="1">
+     <thead>
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        Function
+       </para>
+       <para>
+        Description
+       </para></entry>
+      </row>
+     </thead>

COMMENT: The functions throw many validation errors. Do we want to list the acceptable/unacceptable input values in the documentation corresponding to those? I don't expect one line per argument validation. Something like "these, these and these arguments can not be NULL" or "both arguments in each of the pairs x and y, a and b, and c and d should be non-NULL or NULL respectively".

Addressed in v15.
 
+ /* Statistics are dependent on the definition, not the data */
+ /* Views don't have stats */
+ if ((tbinfo->dobj.dump & DUMP_COMPONENT_STATISTICS) &&
+ (tbinfo->relkind == RELKIND_VIEW))
+ dumpRelationStats(fout, &tbinfo->dobj, reltypename,
+  tbinfo->dobj.dumpId);
+

Statistics are about data. Whenever pg_dump dumps some filtered data, the
statistics collected for the whole table are uselss. We should avoide dumping
statistics in such a case. E.g. when only schema is dumped what good is
statistics? Similarly the statistics on a partitioned table may not be useful
if some its partitions are not dumped. Said that dumping statistics on foreign
table makes sense since they do not contain data but the statistics still makes sense.

Dumping statistics without data is required for pg_upgrade. This is being discussed in the same thread. But I don't see some of the suggestions e.g. using binary-mode switch being used in v15.

Also, should we handle sequences, composite types the same way? THe latter is probably not dumped, but in case.
 

Whether or not I pass --no-statistics, there is no difference in the dump output. Am I missing something?
$ pg_dump -d postgres > /tmp/dump_no_arguments.out
$ pg_dump -d postgres --no-statistics > /tmp/dump_no_statistics.out
$ diff /tmp/dump_no_arguments.out /tmp/dump_no_statistics.out
$

IIUC, pg_dump includes statistics by default. That means all our pg_dump related tests will have statistics output by default. That's good since the functionality will always be tested. 1. We need additional tests to ensure that the statistics is installed after restore. 2. Some of those tests compare dumps before and after restore. In case the statistics is changed because of auto-analyze happening post-restore, these tests will fail.

Fixed.

Thanks for addressing those comments.
 
--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

От
Ashutosh Bapat
Дата:
Hi Corey,

Some more comments on v15.

+/*
+ * A more encapsulated version of can_modify_relation for when the the
+ * HeapTuple and Form_pg_class are not needed later.
+ */
+static void
+check_relation_permissions(Relation rel)

This function is used exactly at one place, so usually won't make much sense to write a separate function. But given that the caller is so long, this seems ok. If this function returns the cached tuple when permission checks succeed, it can be used at the other place as well. The caller will be responsible to release the tuple Or update it.

Attached patch contains a test to invoke this function on a view. ANALYZE throws a WARNING when a view is passed to it. Similarly this function should refuse to update the statistics on relations for which ANALYZE throws a warning. A warning instead of an error seems fine.

+
+ const float4 min = 0.0;
+ const float4 max = 1.0;

When reading the validation condition, I have to look up variable values. That can be avoided by directly using the values in the condition itself? If there's some dependency elsewhere in the code, we can use macros. But I have not seen using constant variables in such a way elsewhere in the code.

+ values[Anum_pg_statistic_starelid - 1] = ObjectIdGetDatum(relid);
+ values[Anum_pg_statistic_staattnum - 1] = Int16GetDatum(attnum);
+ values[Anum_pg_statistic_stainherit - 1] = PG_GETARG_DATUM(P_INHERITED);

For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I have fixed a documentation error in the patch as well. Please incorporate it in your next patchset.
--
Best Wishes,
Ashutosh Bapat
Вложения

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
>> IIRC, "variadic any" requires having at least one variadic parameter.
>> But that seems fine --- what would be the point, or even the
>> semantics, of calling pg_set_attribute_stats with no data fields?

> If my pg_dump run emitted a bunch of stats that could never be imported,
> I'd want to know. With silent failures, I don't.

What do you think would be silent about that?  If there's a complaint
to be made, it's that it'd be a hard failure ("no such function").

To be clear, I'm ok with emitting ERROR for something that pg_dump
clearly did wrong, which in this case would be emitting a
set_statistics call for an attribute it had exactly no stats values
for.  What I think needs to be WARN is conditions that the originating
pg_dump couldn't have foreseen, for example cross-version differences.
If we do try to check things like sort order, that complaint obviously
has to be WARN, since it's checking something potentially different
from what was correct at the source server.

>> Perhaps we could
>> invent a new backend function that extracts the actual element type
>> of a non-null anyarray argument.

> A backend function that we can't guarantee exists on the source system. :(

[ shrug... ] If this doesn't work for source servers below v17, that
would be a little sad, but it wouldn't be the end of the world.
I see your point that that is an argument for finding another way,
though.

>> Another way we could get to no-coercions is to stick with your
>> signature but declare the relevant parameters as anyarray instead of
>> text.

> I'm a bit confused here. AFAIK we can't construct an anyarray in SQL:

> # select '{1,2,3}'::anyarray;
> ERROR:  cannot accept a value of type anyarray

That's not what I suggested at all.  The function parameters would
be declared anyarray, but the values passed to them would be coerced
to the correct concrete array types.  So as far as the coercion rules
are concerned this'd be equivalent to the variadic-any approach.

> That's pretty persuasive. It also means that we need to trap for error in
> the array_in() calls, as that function does not yet have a _safe() mode.

Well, the approach I'm advocating for would have the array input and
coercion done by the calling query before control ever reaches
pg_set_attribute_stats, so that any incorrect-for-the-data-type values
would result in hard errors.  I think that's okay for the same reason
you probably figured you didn't have to trap array_in: it's the fault
of the originating pg_dump if it offers a value that doesn't coerce to
the datatype it claims the value is of.  My formulation is a bit safer
though in that it's the originating pg_dump, not the receiving server,
that is in charge of saying which type that is.  (If that type doesn't
agree with what the receiving server thinks it should be, that's a
condition that pg_set_attribute_stats itself will detect, and then it
can WARN and move on to the next thing.)

            regards, tom lane



Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
> I haven't looked at the details, but I'm really a bit surprised
> by Jeff's assertion that CREATE INDEX destroys statistics on the
> base table.  That seems wrong from here, and maybe something we
> could have it not do.  (I do realize that it recalculates reltuples
> and relpages, but so what?  If it updates those, the results should
> be perfectly accurate.)

In the v15 of the patch I was looking at, "pg_dump -s" included the
statistics. The stats appeared first in the dump, followed by the
CREATE INDEX commands. The latter overwrote the relpages/reltuples set
by the former.

While zeros are the right answers for a schema-only dump, it defeated
the purpose of including relpages/reltuples stats in the dump, and
caused the pg_upgrade TAP test to fail.

You're right that there are a number of ways this could be resolved --
I don't think it's an inherent problem.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Bruce Momjian
Дата:
Reality check --- are we still targeting this feature for PG 17?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2024-03-30 at 20:08 -0400, Tom Lane wrote:
>> I haven't looked at the details, but I'm really a bit surprised
>> by Jeff's assertion that CREATE INDEX destroys statistics on the
>> base table.  That seems wrong from here, and maybe something we
>> could have it not do.  (I do realize that it recalculates reltuples
>> and relpages, but so what?  If it updates those, the results should
>> be perfectly accurate.)

> In the v15 of the patch I was looking at, "pg_dump -s" included the
> statistics. The stats appeared first in the dump, followed by the
> CREATE INDEX commands. The latter overwrote the relpages/reltuples set
> by the former.

> While zeros are the right answers for a schema-only dump, it defeated
> the purpose of including relpages/reltuples stats in the dump, and
> caused the pg_upgrade TAP test to fail.

> You're right that there are a number of ways this could be resolved --
> I don't think it's an inherent problem.

I'm inclined to call it not a problem at all.  While I do agree there
are use-cases for injecting false statistics with these functions,
I do not think that pg_dump has to cater to such use-cases.

In any case, I remain of the opinion that stats are data and should
not be included in a -s dump (with some sort of exception for
pg_upgrade).  If the data has been loaded, then a subsequent
overwrite by CREATE INDEX should not be a problem.

            regards, tom lane



Re: Statistics Import and Export

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Reality check --- are we still targeting this feature for PG 17?

I'm not sure.  I think if we put our heads down we could finish
the changes I'm suggesting and resolve the other issues this week.
However, it is starting to feel like the sort of large, barely-ready
patch that we often regret cramming in at the last minute.  Maybe
we should agree that the first v18 CF would be a better time to
commit it.

            regards, tom lane



Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
> Reality check --- are we still targeting this feature for PG 17?

I see a few useful pieces here:

1. Support import of statistics (i.e.
pg_set_{relation|attribute}_stats()).

2. Support pg_dump of stats

3. Support pg_upgrade with stats

It's possible that not all of them make it, but let's not dismiss the
entire feature yet.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Bruce Momjian
Дата:
On Sun, Mar 31, 2024 at 07:04:47PM -0400, Tom Lane wrote:
> Corey Huinker <corey.huinker@gmail.com> writes:
> >> I can't quibble with that view of what has priority.  I'm just
> >> suggesting that redesigning what pg_upgrade does in this area
> >> should come later than doing something about extended stats.
> 
> > I mostly agree, with the caveat that pg_upgrade's existing message saying
> > that optimizer stats were not carried over wouldn't be 100% true anymore.
> 
> I think we can tweak the message wording.  I just don't want to be
> doing major redesign of the behavior, nor adding fundamentally new
> monitoring capabilities.

I think pg_upgrade could check for the existence of extended statistics
in any database and adjust the analyze recommdnation wording
accordingly.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
> What happens when
> somebody adds a new stakind (and hence new pg_stats column)?
> You could try to add an overloaded pg_set_attribute_stats
> version with more parameters, but I'm pretty sure that would
> lead to "ambiguous function call" failures when trying to load
> old dump files containing only the original parameters.

Why would you need to overload in this case? Wouldn't we just define a
new function with more optional named parameters?

>   The
> present design is also fragile in that an unrecognized parameter
> will lead to a parse-time failure and no function call happening,
> which is less robust than I'd like.

I agree on this point; I found this annoying when testing the feature.

> So this leads me to suggest that we'd be best off with a VARIADIC
> ANY signature, where the variadic part consists of alternating
> parameter labels and values:

I didn't consider this and I think it has a lot of advantages. It's
slightly unfortunate that we can't make them explicitly name/value
pairs, but pg_dump can use whitespace or even SQL comments to make it
more readable.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2024-04-01 at 13:11 -0400, Bruce Momjian wrote:
>> Reality check --- are we still targeting this feature for PG 17?

> I see a few useful pieces here:

> 1. Support import of statistics (i.e.
> pg_set_{relation|attribute}_stats()).

> 2. Support pg_dump of stats

> 3. Support pg_upgrade with stats

> It's possible that not all of them make it, but let's not dismiss the
> entire feature yet.

The unresolved questions largely have to do with the interactions
between these pieces.  I think we would seriously regret setting
any one of them in stone before all three are ready to go.

            regards, tom lane



Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2024-03-31 at 14:48 -0400, Tom Lane wrote:
>> What happens when
>> somebody adds a new stakind (and hence new pg_stats column)?

> Why would you need to overload in this case? Wouldn't we just define a
> new function with more optional named parameters?

Ah, yeah, you could change the function to have more parameters,
given the assumption that all calls will be named-parameter style.
I still suggest that my proposal is more robust for the case where
the dump lists parameters that the receiving system doesn't have.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
That's not what I suggested at all.  The function parameters would
be declared anyarray, but the values passed to them would be coerced
to the correct concrete array types.  So as far as the coercion rules
are concerned this'd be equivalent to the variadic-any approach.

+1

 

> That's pretty persuasive. It also means that we need to trap for error in
> the array_in() calls, as that function does not yet have a _safe() mode.

Well, the approach I'm advocating for would have the array input and
coercion done by the calling query before control ever reaches
pg_set_attribute_stats, so that any incorrect-for-the-data-type values
would result in hard errors.  I think that's okay for the same reason
you probably figured you didn't have to trap array_in: it's the fault
of the originating pg_dump if it offers a value that doesn't coerce to
the datatype it claims the value is of.

+1
 

Re: Statistics Import and Export

От
Corey Huinker
Дата:
I think pg_upgrade could check for the existence of extended statistics
in any database and adjust the analyze recommdnation wording
accordingly.

+1

Re: Statistics Import and Export

От
Corey Huinker
Дата:
Ah, yeah, you could change the function to have more parameters,
given the assumption that all calls will be named-parameter style.
I still suggest that my proposal is more robust for the case where
the dump lists parameters that the receiving system doesn't have.

So what's the behavior when the user fails to supply a parameter that is currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?
 

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
> So what's the behavior when the user fails to supply a parameter that is
> currently NOT NULL checked (example: avg_witdth)? Is that a WARN-and-exit?

I still think that we could just declare the function strict, if we
use the variadic-any approach.  Passing a null in any position is
indisputable caller error.  However, if you're allergic to silently
doing nothing in such a case, we could have pg_set_attribute_stats
check each argument and throw an error.  (Or warn and keep going;
but according to the design principle I posited earlier, this'd be
the sort of thing we don't need to tolerate.)

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:

I still think that we could just declare the function strict, if we
use the variadic-any approach.  Passing a null in any position is
indisputable caller error.  However, if you're allergic to silently
doing nothing in such a case, we could have pg_set_attribute_stats
check each argument and throw an error.  (Or warn and keep going;
but according to the design principle I posited earlier, this'd be
the sort of thing we don't need to tolerate.)

Any thoughts about going back to having a return value, a caller could then see that the function returned NULL rather than whatever the expected value was (example: TRUE)?

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
> Any thoughts about going back to having a return value, a caller could then
> see that the function returned NULL rather than whatever the expected value
> was (example: TRUE)?

If we are envisioning that the function might emit multiple warnings
per call, a useful definition could be to return the number of
warnings (so zero is good, not-zero is bad).  But I'm not sure that's
really better than a boolean result.  pg_dump/pg_restore won't notice
anyway, but perhaps other programs using these functions would care.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
If we are envisioning that the function might emit multiple warnings
per call, a useful definition could be to return the number of
warnings (so zero is good, not-zero is bad).  But I'm not sure that's
really better than a boolean result.  pg_dump/pg_restore won't notice
anyway, but perhaps other programs using these functions would care.

A boolean is what we had before, I'm quite comfortable with that, and it addresses my silent-failure concerns.
 

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
> A boolean is what we had before, I'm quite comfortable with that, and it
> addresses my silent-failure concerns.

WFM.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
Here's a one-liner patch for disabling update of pg_class relpages/reltuples/relallviible during a binary upgrade.

This was causting pg_upgrade tests to fail in the existing stats import work.
Вложения

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Tue, 2024-04-02 at 05:38 -0400, Corey Huinker wrote:
> Here's a one-liner patch for disabling update of pg_class
> relpages/reltuples/relallviible during a binary upgrade.

This change makes sense to me regardless of the rest of the work.
Updating the relpages/reltuples/relallvisible during pg_upgrade before
the data is there will store the wrong stats.

It could use a brief comment, though.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:
I have refactored pg_set_relation_stats to be variadic, and I'm working on pg_set_attribute_sttats, but I'm encountering an issue with the anyarray values.

Jeff suggested looking at anyarray_send as a way of extracting the type, and with some extra twiddling we can get and cast the type. However, some of the ANYARRAYs have element types that are themselves arrays, and near as I can tell, such a construct is not expressible in SQL. So, rather than getting an anyarray of an array type, you instead get an array of one higher dimension.  Like so:

# select schemaname, tablename, attname,

         substring(substring(anyarray_send(histogram_bounds) from 9 for 4)::text,2)::bit(32)::integer::regtype,

         substring(substring(anyarray_send(histogram_bounds::text::text[][]) from 9 for 4)::text,2)::bit(32)::integer::regtype
from pg_stats where histogram_bounds is not null

and tablename = 'pg_proc' and attname = 'proargnames'                                                                                                                                                                                                 ;

 schemaname | tablename |   attname   | substring | substring 

------------+-----------+-------------+-----------+-----------

 pg_catalog | pg_proc   | proargnames | text[]    | text
Luckily, passing in such a value would have done all of the element typechecking for us, so we would just move the data to an array of one less dimension typed elem[]. If there's an easy way to do that, I don't know of it.

What remains is just checking the input types against the expected type of the array, stepping down the dimension if need be, and skipping if the type doesn't meet expectations.

 

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Tue, 2024-04-02 at 12:59 -0400, Corey Huinker wrote:
>  However, some of the ANYARRAYs have element types that are
> themselves arrays, and near as I can tell, such a construct is not
> expressible in SQL. So, rather than getting an anyarray of an array
> type, you instead get an array of one higher dimension.

Fundamentally, you want to recreate the exact same anyarray values on
the destination system as they existed on the source. There's some
complexity to that on both the export side as well as the import side,
but I believe the problems are solvable.

On the export side, the problem is that the element type (and
dimensionality and maybe hasnull) is an important part of the anyarray
value, but it's not part of the output of anyarray_out(). For new
versions, we can add a scalar function that simply outputs the
information we need. For old versions, we can hack it by parsing the
output of anyarray_send(), which contains the information we need
(binary outputs are under-specified, but I believe they are specified
enough in this case). There may be other hacks to get the information
from the older systems; that's just an idea. To get the actual data,
doing histogram_bounds::text::text[] seems to be enough: that seems to
always give a one-dimensional array with element type "text", even if
the element type is an array. (Note: this means we need the function's
API to also include this extra information about the anyarray values,
so it might be slightly more complex than name/value pairs).

On the import side, the problem is that there may not be an input
function to go from a 1-D array of text to a 1-D array of any element
type we want. For example, there's no input function that will create a
1-D array with element type float4[] (that's because Postgres doesn't
really have arrays-of-arrays, it has multi-dimensional arrays).
Instead, don't use the input function, pass each element of the 1-D
text array to the element type's input function (which may be scalar or
not) and then construct a 1-D array out of that with the appropriate
element type (which may be scalar or not).

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On the export side, the problem is that the element type (and
> dimensionality and maybe hasnull) is an important part of the anyarray
> value, but it's not part of the output of anyarray_out(). For new
> versions, we can add a scalar function that simply outputs the
> information we need. For old versions, we can hack it by parsing the
> output of anyarray_send(), which contains the information we need
> (binary outputs are under-specified, but I believe they are specified
> enough in this case).

Yeah, I was thinking yesterday about pulling the anyarray columns in
binary and looking at the header fields.  However, I fear there is a
showstopper problem: anyarray_send will fail if the element type
doesn't have a typsend function, which is entirely possible for
user-defined types (and I'm not even sure we've provided them for
every type in the core distro).  I haven't thought of a good answer
to that other than a new backend function.  However ...

> On the import side, the problem is that there may not be an input
> function to go from a 1-D array of text to a 1-D array of any element
> type we want. For example, there's no input function that will create a
> 1-D array with element type float4[] (that's because Postgres doesn't
> really have arrays-of-arrays, it has multi-dimensional arrays).
> Instead, don't use the input function, pass each element of the 1-D
> text array to the element type's input function (which may be scalar or
> not) and then construct a 1-D array out of that with the appropriate
> element type (which may be scalar or not).

Yup.  I had hoped that we could avoid doing any array-munging inside
pg_set_attribute_stats, but this array-of-arrays problem seems to
mean we have to.  In turn, that means that the whole idea of
declaring the function inputs as anyarray rather than text[] is
probably pointless.  And that means that we don't need the sending
side to know the element type anyway.  So, I apologize for sending
us down a useless side path.  We may as well stick to the function
signature as shown in the v15 patch --- although maybe variadic
any is still worthwhile so that an unrecognized field name doesn't
need to be a hard error?

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
side to know the element type anyway.  So, I apologize for sending
us down a useless side path.  We may as well stick to the function
signature as shown in the v15 patch --- although maybe variadic
any is still worthwhile so that an unrecognized field name doesn't
need to be a hard error?

Variadic is nearly done. This issue was the main blocking point. I can go back to array_in() as we know that code works.


 

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Tue, 2024-04-02 at 17:31 -0400, Tom Lane wrote:
> And that means that we don't need the sending
> side to know the element type anyway.

We need to get the original element type on the import side somehow,
right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
element type "int4" or "text", which affects the binary representation
of the anyarray value in pg_statistic.

Either we need to get it at export time (which seems the most reliable
in principle, but problematic for older versions) and pass it as an
argument to pg_set_attribute_stats(); or we need to derive it reliably
from the table schema on the destination side, right?

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> We need to get the original element type on the import side somehow,
> right? Otherwise it will be hard to tell whether '{1, 2, 3, 4}' has
> element type "int4" or "text", which affects the binary representation
> of the anyarray value in pg_statistic.

Yeah, but that problem exists no matter what.  I haven't read enough
of the patch to find where it's determining that, but I assume there's
code in there to intuit the statistics storage type depending on the
table column's data type and the statistics kind.

> Either we need to get it at export time (which seems the most reliable
> in principle, but problematic for older versions) and pass it as an
> argument to pg_set_attribute_stats(); or we need to derive it reliably
> from the table schema on the destination side, right?

We could not trust the exporting side to tell us the correct answer;
for one reason, it might be different across different releases.
So "derive it reliably on the destination" is really the only option.

I think that it's impossible to do this in the general case, since
type-specific typanalyze functions can store pretty nearly whatever
they like.  However, the pg_stats view isn't going to show nonstandard
statistics kinds anyway, so we are going to be lossy for custom
statistics kinds.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:
Yeah, but that problem exists no matter what.  I haven't read enough
of the patch to find where it's determining that, but I assume there's
code in there to intuit the statistics storage type depending on the
table column's data type and the statistics kind.

Correct. It borrows a lot from examine_attribute() and the *_typanalyze() functions. Actually using VacAttrStats proved problematic, but that can be revisited at some point.
 
We could not trust the exporting side to tell us the correct answer;
for one reason, it might be different across different releases.
So "derive it reliably on the destination" is really the only option.

+1
 
I think that it's impossible to do this in the general case, since
type-specific typanalyze functions can store pretty nearly whatever
they like.  However, the pg_stats view isn't going to show nonstandard
statistics kinds anyway, so we are going to be lossy for custom
statistics kinds.

Sadly true.

Re: Statistics Import and Export

От
Corey Huinker
Дата:
v16 attached.

- both functions now use variadics for anything that can be considered a stat.
- most consistency checks removed, null element tests remain
- functions strive to not ERROR unless absolutely necessary. The biggest exposure is the call to array_in().
- docs have not yet been updated, pending general acceptance of the variadic over the named arg version.

Having variant arguments is definitely a little bit more work to manage, and the shift from ERROR to WARN removes a lot of the easy exits that it previously had, as well as having to do some extra type checking that we got for free with fixed arguments. Still, I don't think the readability suffers too much, and we are now able to work for downgrades as well as upgrades.

Вложения

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
> - functions strive to not ERROR unless absolutely necessary. The biggest
> exposure is the call to array_in().

As far as that goes, it shouldn't be that hard to deal with, at least
not for "soft" errors which hopefully cover most input-function
failures these days.  You should be invoking array_in via
InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
(Look at pg_input_error_info() for useful precedent.)

There might be something to be said for handling all the error
cases via an ErrorSaveContext and use of ereturn() instead of
ereport().  Not sure if it's worth the trouble or not.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:

As far as that goes, it shouldn't be that hard to deal with, at least
not for "soft" errors which hopefully cover most input-function
failures these days.  You should be invoking array_in via
InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
(Look at pg_input_error_info() for useful precedent.)

Ah, my understanding may be out of date. I was under the impression that that mechanism relied on the the cooperation of the per-element input function, so even if we got all the builtin datatypes to play nice with *Safe(), we were always going to be at risk with a user-defined input function.
 
There might be something to be said for handling all the error
cases via an ErrorSaveContext and use of ereturn() instead of
ereport().  Not sure if it's worth the trouble or not.

It would help us tailor the user experience. Right now we have several endgames. To recap:

1. NULL input = Return NULL. (because strict).
2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR (thus ruining binary upgrade)
3. Call values are so bad (examples: attname not found, required stat missing) that nothing can recover => WARN, return FALSE.
4. At least one stakind-stat is wonky (impossible for datatype, missing stat pair, wrong type on input parameter), but that's the worst of it => 1 to N WARNs, write stats that do make sense, return TRUE.
5. Hunky-dory. => No warns. Write all stats. return TRUE.

Which of those seem like good ereturn candidates to you?

Re: Statistics Import and Export

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
>> As far as that goes, it shouldn't be that hard to deal with, at least
>> not for "soft" errors which hopefully cover most input-function
>> failures these days.  You should be invoking array_in via
>> InputFunctionCallSafe and passing a suitably-set-up ErrorSaveContext.
>> (Look at pg_input_error_info() for useful precedent.)

> Ah, my understanding may be out of date. I was under the impression that
> that mechanism relied on the the cooperation of the per-element input
> function, so even if we got all the builtin datatypes to play nice with
> *Safe(), we were always going to be at risk with a user-defined input
> function.

That's correct, but it's silly not to do what we can.  Also, I imagine
that there is going to be high evolutionary pressure on UDTs to
support soft error mode for COPY, so over time the problem will
decrease --- as long as we invoke the soft error mode.

> 1. NULL input =>  Return NULL. (because strict).
> 2. Actual error (permissions, cache lookup not found, etc) => Raise ERROR
> (thus ruining binary upgrade)
> 3. Call values are so bad (examples: attname not found, required stat
> missing) that nothing can recover => WARN, return FALSE.
> 4. At least one stakind-stat is wonky (impossible for datatype, missing
> stat pair, wrong type on input parameter), but that's the worst of it => 1
> to N WARNs, write stats that do make sense, return TRUE.
> 5. Hunky-dory. => No warns. Write all stats. return TRUE.

> Which of those seem like good ereturn candidates to you?

I'm good with all those behaviors.  On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

            regards, tom lane



Re: Statistics Import and Export

От
Michael Paquier
Дата:
On Mon, Apr 01, 2024 at 01:21:53PM -0400, Tom Lane wrote:
> I'm not sure.  I think if we put our heads down we could finish
> the changes I'm suggesting and resolve the other issues this week.
> However, it is starting to feel like the sort of large, barely-ready
> patch that we often regret cramming in at the last minute.  Maybe
> we should agree that the first v18 CF would be a better time to
> commit it.

There are still 4 days remaining, so there's still time, but my
overall experience on the matter with my RMT hat on is telling me that
we should not rush this patch set.  Redesigning portions close to the
end of a dev cycle is not a good sign, I am afraid, especially if the
sub-parts of the design don't fit well in the global picture as that
could mean more maintenance work on stable branches in the long term.
Still, it is very good to be aware of the problems because you'd know
what to tackle to reach the goals of this patch set.
--
Michael

Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:

I'm good with all those behaviors.  On reflection, the design I was
vaguely imagining wouldn't cope with case 4 (multiple WARNs per call)
so never mind that.

                        regards, tom lane

v17

0001
- array_in now repackages cast errors as warnings and skips the stat, test added
- version parameter added, though it's mostly for future compatibility, tests modified
- both functions delay object/attribute locking until absolutely necessary
- general cleanup

0002
- added version parameter to dumps
- --schema-only will not dump stats unless in binary upgrade mode
- stats are dumped SECTION_NONE
- general cleanup

I think that covers the outstanding issues. 
Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:
For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

Re: Statistics Import and Export

От
Ashutosh Bapat
Дата:


On Fri, Apr 5, 2024 at 7:00 AM Corey Huinker <corey.huinker@gmail.com> wrote:
For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and maybe also for IMPORT foreign schema where the SQL is generated by PostgreSQL itself. But not for simulating statistics. In that case, if the function happily installs statistics cooked by the user and those aren't used anywhere, users may be misled by the plans that are generated subsequently. Thus negating the very purpose of simulating statistics. Once the feature is out there, we won't be able to restrict its usage unless we document the possible anomalies.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

От
Tom Lane
Дата:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and
> maybe also for IMPORT foreign schema where the SQL is generated by
> PostgreSQL itself. But not for simulating statistics. In that case, if the
> function happily installs statistics cooked by the user and those aren't
> used anywhere, users may be misled by the plans that are generated
> subsequently. Thus negating the very purpose of simulating
> statistics.

I'm not sure what you think the "purpose of simulating statistics" is,
but it seems like you have an extremely narrow-minded view of it.
I think we should allow injecting any stats that won't actively crash
the backend.  Such functionality could be useful for stress-testing
the planner, for example, or even just to see what it would do in
a situation that is not what you have.

Note that I don't think pg_dump or pg_upgrade need to support
injection of counterfactual statistics.  But direct calls of the
stats insertion functions should be able to do so.

            regards, tom lane



Re: Statistics Import and Export

От
Ashutosh Bapat
Дата:


On Fri, Apr 5, 2024 at 10:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> I read that discussion, and it may be ok for pg_upgrade/pg_dump usecase and
> maybe also for IMPORT foreign schema where the SQL is generated by
> PostgreSQL itself. But not for simulating statistics. In that case, if the
> function happily installs statistics cooked by the user and those aren't
> used anywhere, users may be misled by the plans that are generated
> subsequently. Thus negating the very purpose of simulating
> statistics.

I'm not sure what you think the "purpose of simulating statistics" is,
but it seems like you have an extremely narrow-minded view of it.
I think we should allow injecting any stats that won't actively crash
the backend.  Such functionality could be useful for stress-testing
the planner, for example, or even just to see what it would do in
a situation that is not what you have.

My reply was in the following context
For a partitioned table this value has to be true. For a normal table when setting this value to true, it should at least make sure that the table has at least one child. Otherwise it should throw an error. Blindly accepting the given value may render the statistics unusable. Prologue of the function needs to be updated accordingly.

I can see rejecting non-inherited stats for a partitioned table. The reverse, however, isn't true, because a table may end up being inherited by another, so those statistics may be legit. Having said that, a great deal of the data validation I was doing was seen as unnecessary, so I' not sure where this check would fall on that line. It's a trivial check if we do add it.

If a user installs inherited stats for a non-inherited table by accidently passing true to the corresponding argument, those stats won't be even used. The user wouldn't know that those stats are not used. Yet, they would think that any change in the plans is the result of their stats. So whatever simulation experiment they are running would lead to wrong conclusions. This could be easily avoided by raising an error. Similarly for installing non-inherited stats for a partitioned table. There might be other scenarios where the error won't be required.

--
Best Wishes,
Ashutosh Bapat

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Thu, 2024-04-04 at 00:30 -0400, Corey Huinker wrote:
>
> v17
>
> 0001
> - array_in now repackages cast errors as warnings and skips the stat,
> test added
> - version parameter added, though it's mostly for future
> compatibility, tests modified
> - both functions delay object/attribute locking until absolutely
> necessary
> - general cleanup
>
> 0002
> - added version parameter to dumps
> - --schema-only will not dump stats unless in binary upgrade mode
> - stats are dumped SECTION_NONE
> - general cleanup
>
> I think that covers the outstanding issues. 

Thank you, this has improved a lot and the fundamentals are very close.

I think it could benefit from a bit more time to settle on a few
issues:

1. SECTION_NONE. Conceptually, stats are more like data, and so
intuitively I would expect this in the SECTION_DATA or
SECTION_POST_DATA. However, the two most important use cases (in my
opinion) don't involve dumping the data: pg_upgrade (data doesn't come
from the dump) and planner simulations/repros. Perhaps the section we
place it in is not a critical decision, but we will need to stick with
it for a long time, and I'm not sure that we have consensus on that
point.

2. We changed the stats import function API to be VARIADIC very
recently. After we have a bit of time to think on it, I'm not 100% sure
we will want to stick with that new API. It's not easy to document,
which is something I always like to consider.

3. The error handling also changed recently to change soft errors (i.e.
type input errors) to warnings. I like this change but I'd need a bit
more time to get comfortable with how this is done, there is not a lot
of precedent for doing this kind of thing. This is connected to the
return value, as well as the machine-readability concern that Magnus
raised.

Additionally, a lot of people are simply very busy around this time,
and may not have had a chance to opine on all the recent changes yet.

Regards,
    Jeff Davis





Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> Thank you, this has improved a lot and the fundamentals are very close.
> I think it could benefit from a bit more time to settle on a few
> issues:

Yeah ... it feels like we aren't quite going to manage to get this
over the line for v17.  We could commit with the hope that these
last details will get sorted later, but that path inevitably leads
to a mess.

> 1. SECTION_NONE. Conceptually, stats are more like data, and so
> intuitively I would expect this in the SECTION_DATA or
> SECTION_POST_DATA. However, the two most important use cases (in my
> opinion) don't involve dumping the data: pg_upgrade (data doesn't come
> from the dump) and planner simulations/repros. Perhaps the section we
> place it in is not a critical decision, but we will need to stick with
> it for a long time, and I'm not sure that we have consensus on that
> point.

I think it'll be a serious, serious error for this not to be
SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

> 2. We changed the stats import function API to be VARIADIC very
> recently. After we have a bit of time to think on it, I'm not 100% sure
> we will want to stick with that new API. It's not easy to document,
> which is something I always like to consider.

Perhaps.  I think the argument of wanting to be able to salvage
something even in the presence of unrecognized stats types is
stronger, but I agree this could use more time in the oven.
Unlike many other things in this patch, this would be nigh
impossible to reconsider later.

> 3. The error handling also changed recently to change soft errors (i.e.
> type input errors) to warnings. I like this change but I'd need a bit
> more time to get comfortable with how this is done, there is not a lot
> of precedent for doing this kind of thing.

I don't think there's much disagreement that that's the right thing,
but yeah there could be bugs or some more to do in this area.

            regards, tom lane



Re: Statistics Import and Export

От
Corey Huinker
Дата:


I think it'll be a serious, serious error for this not to be
SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

We'd have to split them on account of when the underlying object is created. Index statistics would be SECTION_POST_DATA, and everything else would be SECTION_DATA. Looking ahead, statistics data for extended statistics objects would also be POST. That's not a big change, but my first attempt at that resulted in a bunch of unrelated grants dumping in the wrong section.

Re: Statistics Import and Export

От
Corey Huinker
Дата:
On Sat, Apr 6, 2024 at 5:23 PM Corey Huinker <corey.huinker@gmail.com> wrote:


I think it'll be a serious, serious error for this not to be
SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
an implementation compromise not "the way it ought to be".

We'd have to split them on account of when the underlying object is created. Index statistics would be SECTION_POST_DATA, and everything else would be SECTION_DATA. Looking ahead, statistics data for extended statistics objects would also be POST. That's not a big change, but my first attempt at that resulted in a bunch of unrelated grants dumping in the wrong section.

At the request of a few people, attached is an attempt to move stats to DATA/POST-DATA, and the TAP test failure that results from that.

The relevant errors are confusing, in that they all concern GRANT/REVOKE, and the fact that I made no changes to the TAP test itself.

$ grep 'not ok' build/meson-logs/testlog.txt
not ok 9347 - section_data: should not dump GRANT INSERT(col1) ON TABLE test_second_table
not ok 9348 - section_data: should not dump GRANT SELECT (proname ...) ON TABLE pg_proc TO public
not ok 9349 - section_data: should not dump GRANT SELECT ON TABLE measurement
not ok 9350 - section_data: should not dump GRANT SELECT ON TABLE measurement_y2006m2
not ok 9351 - section_data: should not dump GRANT SELECT ON TABLE test_table
not ok 9379 - section_data: should not dump REVOKE SELECT ON TABLE pg_proc FROM public
not ok 9788 - section_pre_data: should dump CREATE TABLE test_table
not ok 9837 - section_pre_data: should dump GRANT INSERT(col1) ON TABLE test_second_table
not ok 9838 - section_pre_data: should dump GRANT SELECT (proname ...) ON TABLE pg_proc TO public
not ok 9839 - section_pre_data: should dump GRANT SELECT ON TABLE measurement
not ok 9840 - section_pre_data: should dump GRANT SELECT ON TABLE measurement_y2006m2
not ok 9841 - section_pre_data: should dump GRANT SELECT ON TABLE test_table
not ok 9869 - section_pre_data: should dump REVOKE SELECT ON TABLE pg_proc FROM public
 
Вложения

Re: Statistics Import and Export

От
Nathan Bossart
Дата:
On Thu, Apr 11, 2024 at 03:54:07PM -0400, Corey Huinker wrote:
> At the request of a few people, attached is an attempt to move stats to
> DATA/POST-DATA, and the TAP test failure that results from that.
> 
> The relevant errors are confusing, in that they all concern GRANT/REVOKE,
> and the fact that I made no changes to the TAP test itself.
> 
> $ grep 'not ok' build/meson-logs/testlog.txt
> not ok 9347 - section_data: should not dump GRANT INSERT(col1) ON TABLE
> test_second_table

It looks like the problem is that the ACLs are getting dumped in the data
section when we are also dumping stats.  I'm able to get the tests to pass
by moving the call to dumpRelationStats() that's in dumpTableSchema() to
dumpTableData().  I'm not entirely sure why that fixes it yet, but if we're
treating stats as data, then it intuitively makes sense for us to dump it
in dumpTableData().  However, that seems to prevent the stats from getting
exported in the --schema-only/--binary-upgrade scenario, which presents a
problem for pg_upgrade.  ISTM we'll need some extra hacks to get this to
work as desired.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Wed, 2024-04-17 at 11:50 -0500, Nathan Bossart wrote:
> It looks like the problem is that the ACLs are getting dumped in the
> data
> section when we are also dumping stats.  I'm able to get the tests to
> pass
> by moving the call to dumpRelationStats() that's in dumpTableSchema()
> to
> dumpTableData().  I'm not entirely sure why that fixes it yet, but if
> we're
> treating stats as data, then it intuitively makes sense for us to
> dump it
> in dumpTableData().

Would it make sense to have a new SECTION_STATS?

>  However, that seems to prevent the stats from getting
> exported in the --schema-only/--binary-upgrade scenario, which
> presents a
> problem for pg_upgrade.  ISTM we'll need some extra hacks to get this
> to
> work as desired.

Philosophically, I suppose stats are data, but I still don't understand
why considering stats to be data is so important in pg_dump.

Practically, I want to dump stats XOR data. That's because, if I dump
the data, it's so costly to reload and rebuild indexes that it's not
very important to avoid a re-ANALYZE.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> Would it make sense to have a new SECTION_STATS?

Perhaps, but the implications for pg_dump's API would be nontrivial,
eg would we break any applications that know about the current
options for --section.  And you still have to face up to the question
"does --data-only include this stuff?".

> Philosophically, I suppose stats are data, but I still don't understand
> why considering stats to be data is so important in pg_dump.
> Practically, I want to dump stats XOR data. That's because, if I dump
> the data, it's so costly to reload and rebuild indexes that it's not
> very important to avoid a re-ANALYZE.

Hmm, interesting point.  But the counterargument to that is that
the cost of building indexes will also dwarf the cost of installing
stats, so why not do so?  Loading data without stats, and hoping
that auto-analyze will catch up sooner not later, is exactly the
current behavior that we're doing all this work to get out of.
I don't really think we want it to continue to be the default.

            regards, tom lane



Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> Loading data without stats, and hoping
> that auto-analyze will catch up sooner not later, is exactly the
> current behavior that we're doing all this work to get out of.

That's the disconnect, I think. For me, the main reason I'm excited
about this work is as a way to solve the bad-plans-after-upgrade
problem and to repro planner issues outside of production. Avoiding the
need to ANALYZE at the end of a data load is also a nice convenience,
but not a primary driver (for me).

Should we just itemize some common use cases for pg_dump, and then
choose the defaults that are least likely to cause surprise?

As for the section, I'm not sure what to do about that. Based on this
thread it seems that SECTION_NONE (or a SECTION_STATS?) is easiest to
implement, but I don't understand the long-term consequences of that
choice.

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
>> Loading data without stats, and hoping
>> that auto-analyze will catch up sooner not later, is exactly the
>> current behavior that we're doing all this work to get out of.

> That's the disconnect, I think. For me, the main reason I'm excited
> about this work is as a way to solve the bad-plans-after-upgrade
> problem and to repro planner issues outside of production. Avoiding the
> need to ANALYZE at the end of a data load is also a nice convenience,
> but not a primary driver (for me).

Oh, I don't doubt that there are use-cases for dumping stats without
data.  I'm just dubious about the reverse.  I think data+stats should
be the default, even if only because pg_dump's default has always
been to dump everything.  Then there should be a way to get stats
only, and maybe a way to get data only.  Maybe this does argue for a
four-section definition, despite the ensuing churn in the pg_dump API.

> Should we just itemize some common use cases for pg_dump, and then
> choose the defaults that are least likely to cause surprise?

Per above, I don't find any difficulty in deciding what should be the
default.  What I think we need to consider is what the pg_dump and
pg_restore switch sets should be.  There's certainly a few different
ways we could present that; maybe we should sketch out the details for
a couple of ways.

            regards, tom lane



Re: Statistics Import and Export

От
Matthias van de Meent
Дата:
On Tue, 23 Apr 2024, 05:52 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > On Mon, 2024-04-22 at 16:19 -0400, Tom Lane wrote:
> >> Loading data without stats, and hoping
> >> that auto-analyze will catch up sooner not later, is exactly the
> >> current behavior that we're doing all this work to get out of.
>
> > That's the disconnect, I think. For me, the main reason I'm excited
> > about this work is as a way to solve the bad-plans-after-upgrade
> > problem and to repro planner issues outside of production. Avoiding the
> > need to ANALYZE at the end of a data load is also a nice convenience,
> > but not a primary driver (for me).
>
> Oh, I don't doubt that there are use-cases for dumping stats without
> data.  I'm just dubious about the reverse.  I think data+stats should
> be the default, even if only because pg_dump's default has always
> been to dump everything.  Then there should be a way to get stats
> only, and maybe a way to get data only.  Maybe this does argue for a
> four-section definition, despite the ensuing churn in the pg_dump API.

I've heard of use cases where dumping stats without data would help
with production database planner debugging on a non-prod system.

Sure, some planner inputs would have to be taken into account too, but
having an exact copy of production stats is at least a start and can
help build models and alerts for what'll happen when the tables grow
larger with the current stats.

As for other planner inputs: table size is relatively easy to shim
with sparse files; cumulative statistics can be copied from a donor
replica if needed, and btree indexes only really really need to
contain their highest and lowest values (and need their height set
correctly).

Kind regards,

Matthias van de Meent



Re: Statistics Import and Export

От
Corey Huinker
Дата:
I've heard of use cases where dumping stats without data would help
with production database planner debugging on a non-prod system.


So far, I'm seeing these use cases:

1. Binary upgrade. (schema: on, data: off, stats: on)
2. Dump to file/dir and restore elsewhere. (schema: on, data: on, stats: on)
3. Dump stats for one or more objects, either to directly apply those stats to a remote database, or to allow a developer to edit/experiment with those stats. (schema: off, data: off, stats: on)
4. restore situations where stats are not wanted and/or not trusted (whatever: on, stats: off)

Case #1 is handled via pg_upgrade and special case flags in pg_dump.
Case #2 uses the default pg_dump options, so that's covered.
Case #3 would require a --statistics-only option mutually exclusive with --data-only and --schema-only. Alternatively, I could reanimate the script pg_export_statistics, but we'd end up duplicating a lot of filtering options that pg_dump already has solved. Similarly, we may want server-side functions that generate the statements for us (pg_get_*_stats paired with each pg_set_*_stats)
Case #4 is handled via --no-statistics.


Attached is v19, which attempts to put table stats in SECTION_DATA and matview/index stats in SECTION_POST_DATA. It's still failing one TAP test (004_pg_dump_parallel: parallel restore as inserts). I'm still unclear as to why using SECTION_NONE is a bad idea, but I'm willing to go along with DATA/POST_DATA, assuming we can make it work.

 
Вложения

Re: Statistics Import and Export

От
Bruce Momjian
Дата:
On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> I've heard of use cases where dumping stats without data would help
> with production database planner debugging on a non-prod system.
> 
> Sure, some planner inputs would have to be taken into account too, but
> having an exact copy of production stats is at least a start and can
> help build models and alerts for what'll happen when the tables grow
> larger with the current stats.
> 
> As for other planner inputs: table size is relatively easy to shim
> with sparse files; cumulative statistics can be copied from a donor
> replica if needed, and btree indexes only really really need to
> contain their highest and lowest values (and need their height set
> correctly).

Is it possible to prevent stats from being updated by autovacuum and
other methods?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Statistics Import and Export

От
Matthias van de Meent
Дата:
On Wed, 24 Apr 2024 at 21:31, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Apr 23, 2024 at 06:33:48PM +0200, Matthias van de Meent wrote:
> > I've heard of use cases where dumping stats without data would help
> > with production database planner debugging on a non-prod system.
> >
> > Sure, some planner inputs would have to be taken into account too, but
> > having an exact copy of production stats is at least a start and can
> > help build models and alerts for what'll happen when the tables grow
> > larger with the current stats.
> >
> > As for other planner inputs: table size is relatively easy to shim
> > with sparse files; cumulative statistics can be copied from a donor
> > replica if needed, and btree indexes only really really need to
> > contain their highest and lowest values (and need their height set
> > correctly).
>
> Is it possible to prevent stats from being updated by autovacuum

You can set autovacuum_analyze_threshold and *_scale_factor to
excessively high values, which has the effect of disabling autoanalyze
until it has had similarly excessive tuple churn. But that won't
guarantee autoanalyze won't run; that guarantee only exists with
autovacuum = off.

> and other methods?

No nice ways. AFAIK there is no command (or command sequence) that can
"disable" only ANALYZE and which also guarantee statistics won't be
updated until ANALYZE is manually "re-enabled" for that table. An
extension could maybe do this, but I'm not aware of any extension
points where this would hook into PostgreSQL in a nice way.

You can limit maintenance access on the table to only trusted roles
that you know won't go in and run ANALYZE for those tables, or even
only your superuser (so only they can run ANALYZE, and have them
promise they won't). Alternatively, you can also constantly keep a
lock on the table that conflicts with ANALYZE. The last few are just
workarounds though, and not all something I'd suggest running on a
production database.

Kind regards,

Matthias van de Meent



Re: Statistics Import and Export

От
Corey Huinker
Дата:

You can set autovacuum_analyze_threshold and *_scale_factor to
excessively high values, which has the effect of disabling autoanalyze
until it has had similarly excessive tuple churn. But that won't
guarantee autoanalyze won't run; that guarantee only exists with
autovacuum = off.

I'd be a bit afraid to set to those values so high, for fear that they wouldn't get reset when normal operations resumed, and nobody would notice until things got bad.

v20 is attached. It resolves the dependency issue in v19, so while I'm still unclear as to why we want it this way vs the simplicity of SECTION_NONE, I'm going to roll with it.

Next up for question is how to handle --statistics-only or an equivalent. The option would be mutually exclusive with --schema-only and --data-only, and it would be mildly incongruous if it didn't have a short option like the others, so I'm suggested -P for Probablity / Percentile / ρ: correlation / etc.

One wrinkle with having three mutually exclusive options instead of two is that the existing code was able to assume that one of the options being true meant that we could bail out of certain dumpXYZ() functions, and now those tests have to compare against two, which makes me think we should add three new DumpOptions that are the non-exclusive positives (yesSchema, yesData, yesStats) and set those in addition to the schemaOnly, dataOnly, and statsOnly flags. Thoughts?
Вложения

Re: Statistics Import and Export

От
Corey Huinker
Дата:

Next up for question is how to handle --statistics-only or an equivalent. The option would be mutually exclusive with --schema-only and --data-only, and it would be mildly incongruous if it didn't have a short option like the others, so I'm suggested -P for Probablity / Percentile / ρ: correlation / etc.

One wrinkle with having three mutually exclusive options instead of two is that the existing code was able to assume that one of the options being true meant that we could bail out of certain dumpXYZ() functions, and now those tests have to compare against two, which makes me think we should add three new DumpOptions that are the non-exclusive positives (yesSchema, yesData, yesStats) and set those in addition to the schemaOnly, dataOnly, and statsOnly flags. Thoughts?

v21 attached.

0001 is the same.

0002 is a preparatory change to pg_dump introducing DumpOption/RestoreOption variables dumpSchema and dumpData. The current code makes heavy use of the fact that schemaOnly and dataOnly are mutually exclusive and logically opposite. That will not be the case when statisticsOnly is introduced, so I decided to add the new variables whose value is entirely derivative of the existing command flags, but resolves the complexities of those interactions in one spot, as those complexities are about to jump with the new options.

0003 is the statistics changes to pg_dump, adding the options -X / --statistics-only, and the derivative boolean statisticsOnly. The -P option is already used by pg_restore, so instead I chose -X because of the passing resemblance to Chi as in the chi-square statistics test makes it vaguely statistics-ish. If someone has a better letter, I'm listening.

With that change, people should be able to use pg_dump -X --table=foo to dump existing stats for a table and its dependent indexes, and then tweak those calls to do tuning work. Have fun with it. If this becomes a common use-case then it may make sense to get functions to fetch relation/attribute stats for a given relation, either as a formed SQL statement or as the parameter values.
 
Вложения

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Mon, 2024-05-06 at 23:43 -0400, Corey Huinker wrote:
>
> v21 attached.
>
> 0003 is the statistics changes to pg_dump, adding the options -X / --
> statistics-only, and the derivative boolean statisticsOnly. The -P
> option is already used by pg_restore, so instead I chose -X because
> of the passing resemblance to Chi as in the chi-square statistics
> test makes it vaguely statistics-ish. If someone has a better letter,
> I'm listening.
>
> With that change, people should be able to use pg_dump -X --table=foo
> to dump existing stats for a table and its dependent indexes, and
> then tweak those calls to do tuning work. Have fun with it. If this
> becomes a common use-case then it may make sense to get functions to
> fetch relation/attribute stats for a given relation, either as a
> formed SQL statement or as the parameter values.

Can you explain what you did with the
SECTION_NONE/SECTION_DATA/SECTION_POST_DATA over v19-v21 and why?

Regards,
    Jeff Davis




Re: Statistics Import and Export

От
Corey Huinker
Дата:
Can you explain what you did with the
SECTION_NONE/SECTION_DATA/SECTION_POST_DATA over v19-v21 and why?

Initially, I got things to work by having statistics import behave like COMMENTs, which meant that they were run immediately after the table/matview/index/constraint that created the pg_class/pg_attribute entries, but they could be suppressed with a --noX flag

Per previous comments, it was suggested by others that:

- having them in SECTION_NONE was a grave mistake
- Everything that could belong in SECTION_DATA should, and the rest should be in SECTION_POST_DATA
- This would almost certainly require the statistics import commands to be TOC objects (one object per pg_class entry, not one object per function call)

Turning them into TOC objects was a multi-phase process.

1. the TOC entries are generated with dependencies (the parent pg_class object as well as the potential unique/pk constraint in the case of indexes), but no statements are generated (in case the stats are filtered out or the parent object is filtered out). This TOC entry must have everything we'll need to later generate the function calls. So far, that information is the parent name, parent schema, and relkind of the parent object.

2. The TOC entries get sorted by dependencies, and additional dependencies are added which enforce the PRE/DATA/POST boundaries. This is where knowing the parent object's relkind is required, as that determines the DATA/POST section.

3. Now the TOC entry is able to stand on its own, and generate the statements if they survive the dump/restore filters. Most of the later versions of the patch were efforts to get the objects to fall into the right PRE/DATA/POST sections, and the central bug was that the dependencies passed into ARCHIVE_OPTS were incorrect, as the dependent object passed in was now the new TOC object, not the parent TOC object. Once that was resolved, things fell into place.
 

Re: Statistics Import and Export

От
Jeff Davis
Дата:
On Thu, 2024-05-16 at 05:25 -0400, Corey Huinker wrote:
>
> Per previous comments, it was suggested by others that:
>
> - having them in SECTION_NONE was a grave mistake
> - Everything that could belong in SECTION_DATA should, and the rest
> should be in SECTION_POST_DATA

I don't understand the gravity of the choice here: what am I missing?

To be clear: I'm not arguing against it, but I'd like to understand it
better. Perhaps it has to do with the relationship between the sections
and the dependencies?

Regards,
    Jeff Davis