Re: Import Statistics in postgres_fdw before resorting to sampling.

Поиск
Список
Период
Сортировка
От Corey Huinker
Тема Re: Import Statistics in postgres_fdw before resorting to sampling.
Дата
Msg-id CADkLM=dSJk8h+14Wi7dGGcO1168R_nCGO2EHRTcC6b2Zu0rhpg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Import Statistics in postgres_fdw before resorting to sampling.  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
I am a bit uncomfortable regarding the design you are using here,
where the ImportStatistics callback, if defined, takes priority over
the existing AnalyzeForeignTable, especially regarding the fact that
both callbacks attempt to retrieve the same data, except that the
existing callback has a different idea of the timing to use to
retrieve reltuples and relpages.  The original callback
AnalyzeForeignTable retrieves immediately the total number of pages
via SQL, to feed ANALYZE.  reltuples is then fed to ANALYZE from a
callback function that that's defined in AnalyzeForeignTable().

They don't try to retrieve the same data. AnalyzeForeignTable tries to retrieve a table sample which it feeds into the normal ANALYZE process. If that sample is going to be any good, it has to be a lot of rows, that that's a lot of network traffic.

ImportStatistics just grabs the results that ANALYZE computed for the remote table, using a far better sample than we'd want to pull across the wire.

 
My point is: rather than trying to implement a second solution with a
new callback, shouldn't we try to rework the existing callback so as
it would fit better with what you are trying to do here: feed data
that ANALYZE would then be in charge of inserting?

To do that, we would have to somehow generate fake data locally from the pg_stats data that we did pull over the wire, just to have ANALYZE compute it back down to the data we already had.
 
  Relying on
pg_restore_relation_stats() for the job feels incorrect to me knowing
that ANALYZE is the code path in charge of updating the stats of a
relation. 

That sounds like an argument for expanding ANALYZE to have syntax that will digest pre-computed rows, essentially taking over what pg_restore_relation_stats and pg_restore_attribute_stats already do, and that idea was dismissed fairly early on in development, though I wasn't against it at the time.

As it is, those two functions were developed to match what ANALYZE does. pg_restore_relation_stats even briefly had inplace updates because that's what ANALYZE did.

 
The new callback is a shortcut that bypasses what ANALYZE
does, so the problem, at least it seems to me, is that we want to
retrieve all the data in a single step, like your new callback, not in
two steps, something that only the existing callback allows.  Hence,
wouldn't it be more natural to retrieve the total number of pages and
reltuples from one callback, meaning that for local relations we
should delay RelationGetNumberOfBlocks() inside the existing
"acquire_sample_rows" callback (renaming it would make sense)?  This
requires some redesign of AnalyzeForeignTable and the internals of
analyze.c, but it would let FDW extensions know about the potential
efficiency gain.

I wanted to make minimal disruption to the existing callbacks, but that may have been misguided.

One problem I do see, however, is that the queries for fetching relation (pg_class) stats should never fail and always return one row, even if the values returned are meaningless. It's the query against pg_stats that lets us know if 1) the database on the other side is a real-enough postgres and 2) the remote table is itself analyzed. Only once we're happy that we have good attribute stats should we bother with the relation stats. All of this logic is specific to postgres, so it was confined to postgres_fdw. I don't know that other FDWs would be that much different, but minimizing the generic impact was a goal.

I'll look at this again, but I'm not sure I'm going to come up with much different.
 
There is also a performance concern to be made with the patch, but as
it's an ANALYZE path that may be acceptable: if we fail the first
callback, then we may call the second callback.

I think the big win is the network traffic savings.
 

Fujita-san, what do you think?

Very interested to know as well. 

В списке pgsql-hackers по дате отправления: