Re: Statistics Import and Export

Поиск

Список

Период

Сортировка

От	Corey Huinker
Тема	Re: Statistics Import and Export
Дата	28 декабря 2023 г. 02:41:31
Msg-id	CADkLM=e4Uo1kTDzzUfHNzOyfz80PfJpnFurEsM7spoH9=zdrYA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Statistics Import and Export (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы	Re: Statistics Import and Export
Список	pgsql-hackers

Дерево обсуждения

On Mon, Dec 25, 2023 at 8:18 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

I finally had time to look at the last version of the patch, so here's a
couple thoughts and questions in somewhat random order. Please take this
as a bit of a brainstorming and push back if you disagree some of my
comments.

In general, I like the goal of this patch - not having statistics is a
common issue after an upgrade, and people sometimes don't even realize
they need to run analyze. So, it's definitely worth improving.

I'm not entirely sure about the other use case - allowing people to
tweak optimizer statistics on a running cluster, to see what would be
the plan in that case. Or more precisely - I agree that would be an
interesting and useful feature, but maybe the interface should not be
the same as for the binary upgrade use case?

interfaces
----------

When I thought about the ability to dump/load statistics in the past, I
usually envisioned some sort of DDL that would do the export and import.
So for example we'd have EXPORT STATISTICS / IMPORT STATISTICS commands,
or something like that, and that'd do all the work. This would mean
stats are "first-class citizens" and it'd be fairly straightforward to
add this into pg_dump, for example. Or at least I think so ...

Alternatively we could have the usual "functional" interface, with a
functions to export/import statistics, replacing the DDL commands.

Unfortunately, none of this works for the pg_upgrade use case, because
existing cluster versions would not support this new interface, of
course. That's a significant flaw, as it'd make this useful only for
upgrades of future versions.

This was the reason I settled on the interface that I did: while we can create whatever interface we want for importing the statistics, we would need to be able to extract stats from databases using only the facilities available in those same databases, and then store that in a medium that could be conveyed across databases, either by text files or by saving them off in a side table prior to upgrade. JSONB met the criteria.

So I think for the pg_upgrade use case, we don't have much choice other
than using "custom" export through a view, which is what the patch does.

However, for the other use case (tweaking optimizer stats) this is not
really an issue - that always happens on the same instance, so no issue
with not having the "export" function and so on. I'd bet there are more
convenient ways to do this than using the export view. I'm sure it could
share a lot of the infrastructure, ofc.

So, there is a third use case - foreign data wrappers. When analyzing a foreign table, at least one in the postgresql_fdw family of foreign servers, we should be able to send a query specific to the version and dialect of that server, get back the JSONB, and import those results. That use case may be more tangible to you than the tweak/tuning case.

JSON format
-----------

As for the JSON format, I wonder if we need that at all? Isn't that an
unnecessary layer of indirection? Couldn't we simply dump pg_statistic
and pg_statistic_ext_data in CSV, or something like that? The amount of
new JSONB code seems to be very small, so it's OK I guess.

I see a few problems with dumping pg_statistic[_ext_data]. The first is that the importer now has to understand all of the past formats of those two tables. The next is that the tables are chock full of Oids that don't necessarily carry forward. I could see us having a text-ified version of those two tables, but we'd need that for all previous iterations of those table formats. Instead, I put the burden on the stats export to de-oid the data and make it *_in() function friendly.

That's pretty much why I envisioned a format "grouping" the arrays for a
particular type of statistics (MCV, histogram) into the same object, as
for example in

{
"mcv" : {"values" : [...], "frequencies" : [...]}
"histogram" : {"bounds" : [...]}
}

I agree that would be a lot more readable, and probably a lot more debuggable. But I went into this unsure if there could be more than one stats slot of a given kind per table. Knowing that they must be unique helps.

But that's probably much harder to generate from plain SQL (at least I
think so, I haven't tried).

I think it would be harder, but far from impossible.

data missing in the export
--------------------------

I think the data needs to include more information. Maybe not for the
pg_upgrade use case, where it's mostly guaranteed not to change, but for
the "manual tweak" use case it can change. And I don't think we want two
different formats - we want one, working for everything.

I"m not against this at all, and I started out doing that, but the qualified names of operators got _ugly_, and I quickly realized that what I was generating wouldn't matter, either the input data would make sense for the attribute's stats or it would fail trying.

Consider for example about the staopN and stacollN fields - if we clone
the stats from one table to the other, and the table uses different
collations, will that still work? Similarly, I think we should include
the type of each column, because it's absolutely not guaranteed the
import function will fail if the type changes. For example, if the type
changes from integer to text, that will work, but the ordering will
absolutely not be the same. And so on.

I can see including the type of the column, that's a lot cleaner than the operator names for sure, and I can see us rejecting stats or sections of stats in certain situations. Like in your example, if the collation changed, then reject all "<" op stats but keep the "=" ones.

For the extended statistics export, I think we need to include also the
attribute names and expressions, because these can be different between
the statistics. And not only that - the statistics values reference the
attributes by positions, but if the two tables have the attributes in a
different order (when ordered by attnum), that will break stuff.

Correct me if I'm wrong, but I thought expression parse trees change _a lot_ from version to version?

Attribute reordering is a definite vulnerability of the current implementation, so an attribute name export might be a way to mitigate that.

* making sure the frequencies in MCV lists are not obviously wrong
(outside [0,1], sum exceeding > 1.0, etc.)

* cross-checking that stanumbers/stavalues make sense (e.g. that MCV has
both arrays while histogram has only stavalues, that the arrays have
the same length for MCV, etc.)

To this end, there's an edge-case hack in the code where I have to derive the array elemtype. I had thought that examine_attribute() or std_typanalyze() was going to do that for me, but it didn't. Very much want your input there.

* checking there are no duplicate stakind values (e.g. two MCV lists)

Per previous comment, it's good to learn these restrictions.

Not sure if all the checks need to be regular elog(ERROR), perhaps some
could/should be just asserts.

For this first pass, all errors were one-size fits all, safe for the WARNING vs ERROR.

minor questions
---------------

1) Should the views be called pg_statistic_export or pg_stats_export?
Perhaps pg_stats_export is better, because the format is meant to be
human-readable (rather than 100% internal).

I have no opinion on what the best name would be, and will go with consensus.

2) It's not very clear what "non-transactional update" of pg_class
fields actually means. Does that mean we update the fields in-place,
can't be rolled back, is not subject to MVCC or what? I suspect users
won't know unless the docs say that explicitly.

Correct. Cannot be rolled back, not subject to MVCC.

3) The "statistics.c" code should really document the JSON structure. Or
maybe if we plan to use this for other purposes, it should be documented
in the SGML?

I agree, but I also didn't expect the format to survive first contact with reviewers, so I held back.

4) Why do we need the separate "replaced" flags in import_stakinds? Can
it happen that collreplaces/opreplaces differ from kindreplaces?

That was initially done to maximize the amount of code that could be copied from do_analyze(). In retrospect, I like how extended statistics just deletes all the pg_statistic_ext_data rows and replaces them and I would like to do the same for pg_statistic before this is all done.

5) What happens in we import statistics for a table that already has
some statistics? Will this discard the existing statistics, or will this
merge them somehow? (I think we should always discard the existing
stats, and keep only the new version.)

In the case of pg_statistic_ext_data, the stats are thrown out and replaced by the imported ones.

In the case of pg_statistic, it's basically an upsert, and any values that were missing in the JSON are not updated on the existing row. That's appealing in a tweak situation where you want to only alter one or two bits of a stat, but not really useful in other situations. Per previous comment, I'd prefer a clean slate and forcing tweaking use cases to fill in all the blanks.

6) What happens if we import extended stats with mismatching definition?
For example, what if the "new" statistics object does not have "mcv"
enabled, but the imported data do include MCV? What if the statistics do
have the same number of "dimensions" but not the same number of columns
and expressions?

The importer is currently driven by the types of stats to be expected for that pg_attribute/pg_statistic_ext. It only looks for things that are possible for that stat type, and any extra JSON values are ignored.

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Statistics Import and Export