Обсуждение: Import Statistics in postgres_fdw before resorting to sampling.
Attached is my current work on adding remote fetching of statistics to postgres_fdw, and opening the possibility of doing so to other foreign data wrappers.
This involves adding two new options to postgres_fdw at the server and table level.
The first option, fetch_stats, defaults to true at both levels. If enabled, it will cause an ANALYZE of a postgres_fdw foreign table to first attempt to fetch relation and attribute statistics from the remote table. If this succeeds, then those statistics are imported into the local foreign table, allowing us to skip a potentially expensive sampling operation.
The second option, remote_analyze, defaults to false at both levels, and only comes into play if the first fetch succeeds but no attribute statistics (i.e. the stats from pg_stats) are found. If enabled then the function will attempt to ANALYZE the remote table, and if that is successful then a second attempt at fetching remote statistics will be made.
If no statistics were fetched, then the operation will fall back to the normal sampling operation per settings.
Вложения
On Tue, Aug 12, 2025 at 10:33 PM Corey Huinker <corey.huinker@gmail.com> wrote: > > > Attached is my current work on adding remote fetching of statistics to postgres_fdw, and opening the possibility of doingso to other foreign data wrappers. > > This involves adding two new options to postgres_fdw at the server and table level. > > The first option, fetch_stats, defaults to true at both levels. If enabled, it will cause an ANALYZE of a postgres_fdwforeign table to first attempt to fetch relation and attribute statistics from the remote table. If this succeeds,then those statistics are imported into the local foreign table, allowing us to skip a potentially expensive samplingoperation. > > The second option, remote_analyze, defaults to false at both levels, and only comes into play if the first fetch succeedsbut no attribute statistics (i.e. the stats from pg_stats) are found. If enabled then the function will attempt toANALYZE the remote table, and if that is successful then a second attempt at fetching remote statistics will be made. > > If no statistics were fetched, then the operation will fall back to the normal sampling operation per settings. > > Note patches 0001 and 0002 are already a part of a separate thread https://www.postgresql.org/message-id/flat/CADkLM%3DcpUiJ3QF7aUthTvaVMmgQcm7QqZBRMDLhBRTR%2BgJX-Og%40mail.gmail.com regardinga bug (0001) and a nitpick (0002) that came about as a side-effect to this effort, and but I expect those to beresolved one way or another soon. Any feedback on those two can be handled there. I think this is very useful to avoid fetching rows from foreign server and analyzing them locally. This isn't a full review. I looked at the patches mainly to find out how does it fit into the current method of analysing a foreign table. Right now, do_analyze_rel() is called with FDW specific acquireFunc, which collects a sample of rows. The sample is passed to attribute specific compute_stats which fills VacAttrStats for that attribute. VacAttrStats for all the attributes is passed to update_attstats(), which updates pg_statistics. The patch changes that to fetch the statistics from the foreign server and call pg_restore_attribute_stats for each attribute. Instead I was expecting that after fetching the stats from the foreign server, it would construct VacAttrStats and call update_attstats(). That might be marginally faster since it avoids SPI call and updates stats for all the attributes. Did you consider this alternate approach and why it was discarded? If a foreign table points to an inheritance parent on the foreign server, we will receive two rows for that table - one with inherited = false and other with true in that order. I think the stats with inherited=true are relevant to the local server since querying the parent will fetch rows from children. Since that stats is applied last, the pg_statistics will retain the intended statistics. But why to fetch two rows in the first place and waste computing cycles? -- Best Wishes, Ashutosh Bapat
This isn't a full review. I looked at the patches mainly to find out
how does it fit into the current method of analysing a foreign table.
Any degree of review is welcome. We're chasing views, reviews, etc.
Right now, do_analyze_rel() is called with FDW specific acquireFunc,
which collects a sample of rows. The sample is passed to attribute
specific compute_stats which fills VacAttrStats for that attribute.
VacAttrStats for all the attributes is passed to update_attstats(),
which updates pg_statistics. The patch changes that to fetch the
statistics from the foreign server and call pg_restore_attribute_stats
for each attribute.
That recap is accurate.
Instead I was expecting that after fetching the
stats from the foreign server, it would construct VacAttrStats and
call update_attstats(). That might be marginally faster since it
avoids SPI call and updates stats for all the attributes. Did you
consider this alternate approach and why it was discarded?
It might be marginally faster, but it would duplicate a lot of the pair-checking (must have a most-common-freqs with a most-common-vals, etc) and type-checking logic (the vals in a most-common vals must all input coerce to the correct datatype for the _destination_ column, etc), and we've already got that in pg_restore_attribute_stats. There used to be a non-fcinfo function that took a long list of Datums and isnull boolean pairs, but that wasn't pretty to look at and was replaced with the positional fcinfo version we have today. This use case might be a reason to bring that back, or expose the existing positional fcinfo function (presently static) if we want to avoid SPI badly enough. As it is, the SPI code is fairly future proof in that it isn't required to add new stat types as they are created. My first attempt at this patch attempted to make a FunctionCallInvoke() on the variadic pg_restore_attribute_stats, but that would require a filled out fn_expr, and to get that we'd have to duplicate a lot of logic from the parser, so the infrastructure isn't available to easily call a variadic function.
If a foreign table points to an inheritance parent on the foreign
server, we will receive two rows for that table - one with inherited =
false and other with true in that order. I think the stats with
inherited=true are relevant to the local server since querying the
parent will fetch rows from children. Since that stats is applied
last, the pg_statistics will retain the intended statistics. But why
to fetch two rows in the first place and waste computing cycles?
Glad you agree that we're fetching the right statistics.
That was the only way I could think of to do one client fetch and still get exactly one row back.
Anything else involved fetching the inh=true first, and if that failed fetching the inh=false statistics. That adds extra round-trips especially given that inherited statistics are more rare than non-inherited statistics. Moreoever, we're making decisions (analyze yes/no, fallback to sampling yes/no) based on whether or not we got statistics back from the foreign server at all, and having to consider the result of two fetches instead of one makes that logic more complicated.
If, however, you were referring to the work we're handing the remote server, I'm open to queries that you think would be more lightweight. However, the pg_stats view is a security barrier view, so we reduce overhead by passing through that barrier as few times as possible.