Re: Statistics Import and Export

Поиск
Список
Период
Сортировка
От Corey Huinker
Тема Re: Statistics Import and Export
Дата
Msg-id CADkLM=dracL=rSf+2v+1P4NN_4jq--kyv6Ywh77t4MiiRX98og@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Statistics Import and Export  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Statistics Import and Export  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers

I concur with the plan of extracting data from pg_stats not
pg_statistics, and with emitting a single "set statistics"
call per attribute.  (I think at one point I'd suggested a call
per stakind slot, but that would lead to a bunch of UPDATEs on
existing pg_attribute tuples and hence a bunch of dead tuples
at the end of an import, so it's not the way to go.  A series
of UPDATEs would likely also play poorly with a background
auto-ANALYZE happening concurrently.)

That was my reasoning as well.

 
I do not like the current design for pg_set_attribute_stats' API
though: I don't think it's at all future-proof.  What happens when
somebody adds a new stakind (and hence new pg_stats column)?
You could try to add an overloaded pg_set_attribute_stats
version with more parameters, but I'm pretty sure that would
lead to "ambiguous function call" failures when trying to load
old dump files containing only the original parameters.

I don't think we'd overload, we'd just add new parameters to the function signature.
 
  The
present design is also fragile in that an unrecognized parameter
will lead to a parse-time failure and no function call happening,
which is less robust than I'd like.

There was a lot of back-and-forth about what sorts of failures were error-worthy, and which were warn-worthy. I'll discuss further below.
 
  As lesser points,
the relation argument ought to be declared regclass not oid for
convenience of use,

+1
 
and I really think that we need to provide
the source server's major version number --- maybe we will never
need that, but if we do and we don't have it we will be sad.

The JSON had it, and I never did use it. Not against having it again.
 

So this leads me to suggest that we'd be best off with a VARIADIC
ANY signature, where the variadic part consists of alternating
parameter labels and values:

pg_set_attribute_stats(table regclass, attribute name,
                       inherited bool, source_version int,
                       variadic "any") returns void

where a call might look like

SELECT pg_set_attribute_stats('public.mytable', 'mycolumn',
                              false, -- not inherited
                              16, -- source server major version
                              -- pairs of labels and values follow
                              'null_frac', 0.4,
                              'avg_width', 42,
                              'histogram_bounds',
                              array['a', 'b', 'c']::text[],
                              ...);

Note a couple of useful things here:

* AFAICS we could label the function strict and remove all those ad-hoc
null checks.  If you don't have a value for a particular stat, you
just leave that pair of arguments out (exactly as the existing code
in 0002 does, just using a different notation).  This also means that
we don't need any default arguments and so no need for hackery in
system_functions.sql.

I'm not aware of how strict works with variadics. Would the lack of any variadic parameters trigger it?

Also going with strict means that an inadvertent explicit NULL in one parameter would cause the entire attribute import to fail silently. I'd rather fail loudly.

 
* If we don't recognize a parameter label at runtime, we can treat
that as a warning rather than a hard error, and press on.  This case
would mostly be useful in major version downgrades I suppose, but
that will be something people will want eventually.

Interesting.

* We can require the calling statement to cast arguments, particularly
arrays, to the proper type, removing the need for conversions within
the stats-setting function.  (But instead, it'd need to check that the
next "any" argument is the type it ought to be based on the type of
the target column.)

So, that's tricky. The type of the values is not always the attribute type, for expression indexes, we do call exprType() and exprCollation(), in which case we always use the expression type over the attribute type, but only use the collation type if the attribute had no collation. This mimics the behavior of ANALYZE.

Then, for the MCELEM and DECHIST stakinds we have to find the type's element type, and that has special logic for tsvectors, ranges, and other non-scalars, borrowing from the various *_typanalyze() functions. For that matter, the existing typanalyze functions don't grab the < operator, which I need for later data validations, so using examine_attribute() was simultaneously overkill and insufficient.

None of this functionality is accessible from a client program, so we'd have to pull in a lot of backend stuff to pg_dump to make it resolve the typecasts correctly. Text and array_in() was just easier.
 
pg_set_relation_stats is simpler in that the set of stats values
to be set will probably remain fairly static, and there seems little
reason to allow only part of them to be supplied (so personally I'd
drop the business about accepting nulls there too).  If we do grow
another value or values for it to set there shouldn't be much problem
with overloading it with another version with more arguments.
Still needs to take regclass not oid though ...

I'm still iffy about the silent failures of strict. 

I looked it up, and the only change needed for changing oid to regclass is in the pg_proc.dat. (and the docs, of course). So I'm already on board.
 
* why is check_relation_permissions looking up the pg_class row?
There's already a copy of that in the Relation struct.  Likewise
for the other caller of can_modify_relation (but why is that
caller not using check_relation_permissions?)  That all looks
overly complicated and duplicative.  I think you don't need two
layers of function there.

To prove that the caller is the owner (or better) of the table.
  

* The array manipulations seem to me to be mostly not well chosen.
There's no reason to use expanded arrays here, since you won't be
modifying the arrays in-place; all that's doing is wasting memory.
I'm also noting a lack of defenses against nulls in the arrays.

Easily remedied in light of the deconstruct_array() suggestion below, but I do want to add that value_not_null_array_len() does check for nulls, and that function is used to generate all but one of the arrays (and that one we're just verifying that it's length matches the length of the other array).There's even a regression test that checks it (search for: "elem_count_histogram null element").
 
I'd suggest using deconstruct_array to disassemble the arrays,
if indeed they need disassembled at all.  (Maybe they don't, see
next item.)

+1 
 

* I'm dubious that we can fully vet the contents of these arrays,
and even a little dubious that we need to try.  As an example,
what's the worst that's going to happen if a histogram array isn't
sorted precisely?  You might get bogus selectivity estimates
from the planner, but that's no worse than you would've got with
no stats at all.  (It used to be that selfuncs.c would use a
histogram even if its contents didn't match the query's collation.
The comments justifying that seem to be gone, but I think it's
still the case that the code isn't *really* dependent on the sort
order being exactly so.)  The amount of hastily-written code in the
patch for checking this seems a bit scary, and it's well within the
realm of possibility that it introduces more bugs than it prevents.
We do need to verify data types, lack of nulls, and maybe
1-dimensional-ness, which could break the accessing code at a fairly
low level; but I'm not sure that we need more than that.

A lot of the feedback I got on this patch over the months concerned giving inaccurate, nonsensical, or malicious data to the planner. Surely the planner does do *some* defensive programming when fetching these values, but this is the first time those values were potentially set by a user, not by our own internal code. We can try to match types, collations, etc from source to dest, but even that would fall victim to another glibc-level collation change. Verifying that the list the source system said was sorted is actually sorted when put on the destination system is the truest test we're ever going to get, albeit for sampled elements.
 

* There's a lot of ERROR cases that maybe we ought to downgrade
to WARN-and-press-on, in the service of not breaking the restore
completely in case of trouble.

All cases were made error precisely to spark debate about which cases we'd want to continue from and which we'd want to error from. Also, I was under the impression it was bad form to follow up NOTICE/WARN with an ERROR in the same function call.

 
* 0002 is confused about whether the tag for these new TOC
entries is "STATISTICS" or "STATISTICS DATA".  I also think
they need to be in SECTION_DATA not SECTION_NONE, and I'd be
inclined to make them dependent on the table data objects
not the table declarations.  We don't really want a parallel
restore to load them before the data is loaded: that just
increases the risk of bad interactions with concurrent
auto-analyze.

SECTION_NONE works the best, but we're getting some situations where the relpages/reltuples/relallvisible gets reset to 0s in pg_class. Hence the temporary --no-statistics in the pg_upgrade TAP test.

SECTION_POST_DATA (a previous suggestion) causes something weird to happen where certain GRANT/REVOKEs happen outside of their expected section. 

In work I've done since v15, I tried giving the table stats archive entry a dependency on every index (and index constraint) as well as the table itself, thinking that would get us past all resets of pg_class, but it hasn't worked.
 
* It'd definitely not be OK to put BEGIN/COMMIT into the commands
in these TOC entries.  But I don't think we need to.

Agreed. Don't need to, each function call now sinks or swims on its own.
 

* dumpRelationStats seems to be dumping the relation-level
stats twice.

+1 

* Why exactly are you suppressing testing of statistics upgrade
in 002_pg_upgrade??

Temporary. Related to the pg_class overwrite issue above.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: jian he
Дата:
Сообщение: Re: Catalog domain not-null constraints
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: [HACKERS] make async slave to wait for lsn to be replayed