Обсуждение: Import Statistics in postgres_fdw before resorting to sampling.

Поиск

Список

Период

Сортировка

Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

12 августа, 20:02:39

Attached is my current work on adding remote fetching of statistics to postgres_fdw, and opening the possibility of doing so to other foreign data wrappers.

This involves adding two new options to postgres_fdw at the server and table level.

The first option, fetch_stats, defaults to true at both levels. If enabled, it will cause an ANALYZE of a postgres_fdw foreign table to first attempt to fetch relation and attribute statistics from the remote table. If this succeeds, then those statistics are imported into the local foreign table, allowing us to skip a potentially expensive sampling operation.

The second option, remote_analyze, defaults to false at both levels, and only comes into play if the first fetch succeeds but no attribute statistics (i.e. the stats from pg_stats) are found. If enabled then the function will attempt to ANALYZE the remote table, and if that is successful then a second attempt at fetching remote statistics will be made.

If no statistics were fetched, then the operation will fall back to the normal sampling operation per settings.

Note patches 0001 and 0002 are already a part of a separate thread https://www.postgresql.org/message-id/flat/CADkLM%3DcpUiJ3QF7aUthTvaVMmgQcm7QqZBRMDLhBRTR%2BgJX-Og%40mail.gmail.com regarding a bug (0001) and a nitpick (0002) that came about as a side-effect to this effort, and but I expect those to be resolved one way or another soon. Any feedback on those two can be handled there.

Вложения

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Ashutosh Bapat

Дата:

14 августа, 10:37:58

On Tue, Aug 12, 2025 at 10:33 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>
>
> Attached is my current work on adding remote fetching of statistics to postgres_fdw, and opening the possibility of
doingso to other foreign data wrappers. 
>
> This involves adding two new options to postgres_fdw at the server and table level.
>
> The first option, fetch_stats, defaults to true at both levels. If enabled, it will cause an ANALYZE of a
postgres_fdwforeign table to first attempt to fetch relation and attribute statistics from the remote table. If this
succeeds,then those statistics are imported into the local foreign table, allowing us to skip a potentially expensive
samplingoperation. 
>
> The second option, remote_analyze, defaults to false at both levels, and only comes into play if the first fetch
succeedsbut no attribute statistics (i.e. the stats from pg_stats) are found. If enabled then the function will attempt
toANALYZE the remote table, and if that is successful then a second attempt at fetching remote statistics will be made. 
>
> If no statistics were fetched, then the operation will fall back to the normal sampling operation per settings.
>
> Note patches 0001 and 0002 are already a part of a separate thread
https://www.postgresql.org/message-id/flat/CADkLM%3DcpUiJ3QF7aUthTvaVMmgQcm7QqZBRMDLhBRTR%2BgJX-Og%40mail.gmail.com
regardinga bug (0001) and a nitpick (0002) that came about as a side-effect to this effort, and but I expect those to
beresolved one way or another soon. Any feedback on those two can be handled there. 

I think this is very useful to avoid fetching rows from foreign server
and analyzing them locally.

This isn't a full review. I looked at the patches mainly to find out
how does it fit into the current method of analysing a foreign table.
Right now, do_analyze_rel() is called with FDW specific acquireFunc,
which collects a sample of rows. The sample is passed to attribute
specific compute_stats which fills VacAttrStats for that attribute.
VacAttrStats for all the attributes is passed to update_attstats(),
which updates pg_statistics. The patch changes that to fetch the
statistics from the foreign server and call pg_restore_attribute_stats
for each attribute. Instead I was expecting that after fetching the
stats from the foreign server, it would construct VacAttrStats and
call update_attstats(). That might be marginally faster since it
avoids SPI call and updates stats for all the attributes. Did you
consider this alternate approach and why it was discarded?

If a foreign table points to an inheritance parent on the foreign
server, we will receive two rows for that table - one with inherited =
false and other with true in that order. I think the stats with
inherited=true are relevant to the local server since querying the
parent will fetch rows from children. Since that stats is applied
last, the pg_statistics will retain the intended statistics. But why
to fetch two rows in the first place and waste computing cycles?

--
Best Wishes,
Ashutosh Bapat

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

14 августа, 21:34:43

This isn't a full review. I looked at the patches mainly to find out
how does it fit into the current method of analysing a foreign table.

Any degree of review is welcome. We're chasing views, reviews, etc.

Right now, do_analyze_rel() is called with FDW specific acquireFunc,
which collects a sample of rows. The sample is passed to attribute
specific compute_stats which fills VacAttrStats for that attribute.
VacAttrStats for all the attributes is passed to update_attstats(),
which updates pg_statistics. The patch changes that to fetch the
statistics from the foreign server and call pg_restore_attribute_stats
for each attribute.

That recap is accurate.

Instead I was expecting that after fetching the
stats from the foreign server, it would construct VacAttrStats and
call update_attstats(). That might be marginally faster since it
avoids SPI call and updates stats for all the attributes. Did you
consider this alternate approach and why it was discarded?

It might be marginally faster, but it would duplicate a lot of the pair-checking (must have a most-common-freqs with a most-common-vals, etc) and type-checking logic (the vals in a most-common vals must all input coerce to the correct datatype for the _destination_ column, etc), and we've already got that in pg_restore_attribute_stats. There used to be a non-fcinfo function that took a long list of Datums and isnull boolean pairs, but that wasn't pretty to look at and was replaced with the positional fcinfo version we have today. This use case might be a reason to bring that back, or expose the existing positional fcinfo function (presently static) if we want to avoid SPI badly enough. As it is, the SPI code is fairly future proof in that it isn't required to add new stat types as they are created. My first attempt at this patch attempted to make a FunctionCallInvoke() on the variadic pg_restore_attribute_stats, but that would require a filled out fn_expr, and to get that we'd have to duplicate a lot of logic from the parser, so the infrastructure isn't available to easily call a variadic function.

If a foreign table points to an inheritance parent on the foreign
server, we will receive two rows for that table - one with inherited =
false and other with true in that order. I think the stats with
inherited=true are relevant to the local server since querying the
parent will fetch rows from children. Since that stats is applied
last, the pg_statistics will retain the intended statistics. But why
to fetch two rows in the first place and waste computing cycles?

Glad you agree that we're fetching the right statistics.

That was the only way I could think of to do one client fetch and still get exactly one row back.

Anything else involved fetching the inh=true first, and if that failed fetching the inh=false statistics. That adds extra round-trips especially given that inherited statistics are more rare than non-inherited statistics. Moreoever, we're making decisions (analyze yes/no, fallback to sampling yes/no) based on whether or not we got statistics back from the foreign server at all, and having to consider the result of two fetches instead of one makes that logic more complicated.

If, however, you were referring to the work we're handing the remote server, I'm open to queries that you think would be more lightweight. However, the pg_stats view is a security barrier view, so we reduce overhead by passing through that barrier as few times as possible.

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

12 сентября, 08:17:01

Rebased.

Вложения

v2-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

19 октября, 03:32:24

On Fri, Sep 12, 2025 at 1:17 AM Corey Huinker <corey.huinker@gmail.com> wrote:

Rebased.

Rebased again.

Вложения

v3-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Michael Paquier

Дата:

20 октября, 09:45:14

On Sat, Oct 18, 2025 at 08:32:24PM -0400, Corey Huinker wrote:
> Rebased again.

Hearing an opinion from Fujita-san would be interesting here, so I am
adding him in CC.  I have been looking a little bit at this patch.

+    ImportStatistics_function ImportStatistics;

All FDW callbacks are documented in fdwhandler.sgml.  This new one is
not.

I am a bit uncomfortable regarding the design you are using here,
where the ImportStatistics callback, if defined, takes priority over
the existing AnalyzeForeignTable, especially regarding the fact that
both callbacks attempt to retrieve the same data, except that the
existing callback has a different idea of the timing to use to
retrieve reltuples and relpages.  The original callback
AnalyzeForeignTable retrieves immediately the total number of pages
via SQL, to feed ANALYZE.  reltuples is then fed to ANALYZE from a
callback function that that's defined in AnalyzeForeignTable().

My point is: rather than trying to implement a second solution with a
new callback, shouldn't we try to rework the existing callback so as
it would fit better with what you are trying to do here: feed data
that ANALYZE would then be in charge of inserting?  Relying on
pg_restore_relation_stats() for the job feels incorrect to me knowing
that ANALYZE is the code path in charge of updating the stats of a
relation.  The new callback is a shortcut that bypasses what ANALYZE
does, so the problem, at least it seems to me, is that we want to
retrieve all the data in a single step, like your new callback, not in
two steps, something that only the existing callback allows.  Hence,
wouldn't it be more natural to retrieve the total number of pages and
reltuples from one callback, meaning that for local relations we
should delay RelationGetNumberOfBlocks() inside the existing
"acquire_sample_rows" callback (renaming it would make sense)?  This
requires some redesign of AnalyzeForeignTable and the internals of
analyze.c, but it would let FDW extensions know about the potential
efficiency gain.

There is also a performance concern to be made with the patch, but as
it's an ANALYZE path that may be acceptable: if we fail the first
callback, then we may call the second callback.

Fujita-san, what do you think?
--
Michael

Вложения

signature.asc

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Michael Paquier

Дата:

20 октября, 11:07:40

On Mon, Oct 20, 2025 at 03:45:14PM +0900, Michael Paquier wrote:
> On Sat, Oct 18, 2025 at 08:32:24PM -0400, Corey Huinker wrote:
> > Rebased again.
>
> Hearing an opinion from Fujita-san would be interesting here, so I am
> adding him in CC.  I have been looking a little bit at this patch.

By the way, as far as I can see this patch is still in the past commit
fest:
https://commitfest.postgresql.org/patch/5959/

You may want to move it if you are planning to continue working on
that.
--
Michael

Вложения

signature.asc

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

03 ноября, 08:26:42

I am a bit uncomfortable regarding the design you are using here,
where the ImportStatistics callback, if defined, takes priority over
the existing AnalyzeForeignTable, especially regarding the fact that
both callbacks attempt to retrieve the same data, except that the
existing callback has a different idea of the timing to use to
retrieve reltuples and relpages. The original callback
AnalyzeForeignTable retrieves immediately the total number of pages
via SQL, to feed ANALYZE. reltuples is then fed to ANALYZE from a
callback function that that's defined in AnalyzeForeignTable().

They don't try to retrieve the same data. AnalyzeForeignTable tries to retrieve a table sample which it feeds into the normal ANALYZE process. If that sample is going to be any good, it has to be a lot of rows, that that's a lot of network traffic.

ImportStatistics just grabs the results that ANALYZE computed for the remote table, using a far better sample than we'd want to pull across the wire.

My point is: rather than trying to implement a second solution with a
new callback, shouldn't we try to rework the existing callback so as
it would fit better with what you are trying to do here: feed data
that ANALYZE would then be in charge of inserting?

To do that, we would have to somehow generate fake data locally from the pg_stats data that we did pull over the wire, just to have ANALYZE compute it back down to the data we already had.

Relying on
pg_restore_relation_stats() for the job feels incorrect to me knowing
that ANALYZE is the code path in charge of updating the stats of a
relation.

That sounds like an argument for expanding ANALYZE to have syntax that will digest pre-computed rows, essentially taking over what pg_restore_relation_stats and pg_restore_attribute_stats already do, and that idea was dismissed fairly early on in development, though I wasn't against it at the time.

As it is, those two functions were developed to match what ANALYZE does. pg_restore_relation_stats even briefly had inplace updates because that's what ANALYZE did.

The new callback is a shortcut that bypasses what ANALYZE
does, so the problem, at least it seems to me, is that we want to
retrieve all the data in a single step, like your new callback, not in
two steps, something that only the existing callback allows. Hence,
wouldn't it be more natural to retrieve the total number of pages and
reltuples from one callback, meaning that for local relations we
should delay RelationGetNumberOfBlocks() inside the existing
"acquire_sample_rows" callback (renaming it would make sense)? This
requires some redesign of AnalyzeForeignTable and the internals of
analyze.c, but it would let FDW extensions know about the potential
efficiency gain.

I wanted to make minimal disruption to the existing callbacks, but that may have been misguided.

One problem I do see, however, is that the queries for fetching relation (pg_class) stats should never fail and always return one row, even if the values returned are meaningless. It's the query against pg_stats that lets us know if 1) the database on the other side is a real-enough postgres and 2) the remote table is itself analyzed. Only once we're happy that we have good attribute stats should we bother with the relation stats. All of this logic is specific to postgres, so it was confined to postgres_fdw. I don't know that other FDWs would be that much different, but minimizing the generic impact was a goal.

I'll look at this again, but I'm not sure I'm going to come up with much different.

There is also a performance concern to be made with the patch, but as
it's an ANALYZE path that may be acceptable: if we fail the first
callback, then we may call the second callback.

I think the big win is the network traffic savings.

Fujita-san, what do you think?

Very interested to know as well.

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Etsuro Fujita

Дата:

21 ноября, 21:35:17

Hi Corey,

This is an important feature.  Thanks for working on it.

On Mon, Nov 3, 2025 at 2:26 PM Corey Huinker <corey.huinker@gmail.com> wrote:

>> My point is: rather than trying to implement a second solution with a
>> new callback, shouldn't we try to rework the existing callback so as
>> it would fit better with what you are trying to do here: feed data
>> that ANALYZE would then be in charge of inserting?

> To do that, we would have to somehow generate fake data locally from the pg_stats data that we did pull over the
wire,just to have ANALYZE compute it back down to the data we already had. 

>>   Relying on
>> pg_restore_relation_stats() for the job feels incorrect to me knowing
>> that ANALYZE is the code path in charge of updating the stats of a
>> relation.

> That sounds like an argument for expanding ANALYZE to have syntax that will digest pre-computed rows, essentially
takingover what pg_restore_relation_stats and pg_restore_attribute_stats already do, and that idea was dismissed fairly
earlyon in development, though I wasn't against it at the time. 
>
> As it is, those two functions were developed to match what ANALYZE does. pg_restore_relation_stats even briefly had
inplaceupdates because that's what ANALYZE did. 

To me it seems like a good idea to rely on pg_restore_relation_stats
and pg_restore_attribute_stats, rather than doing some rework on
analyze.c; IMO I don't think it's a good idea to do such a thing for
something rather special like this.

>> The new callback is a shortcut that bypasses what ANALYZE
>> does, so the problem, at least it seems to me, is that we want to
>> retrieve all the data in a single step, like your new callback, not in
>> two steps, something that only the existing callback allows.  Hence,
>> wouldn't it be more natural to retrieve the total number of pages and
>> reltuples from one callback, meaning that for local relations we
>> should delay RelationGetNumberOfBlocks() inside the existing
>> "acquire_sample_rows" callback (renaming it would make sense)?  This
>> requires some redesign of AnalyzeForeignTable and the internals of
>> analyze.c, but it would let FDW extensions know about the potential
>> efficiency gain.

> I wanted to make minimal disruption to the existing callbacks, but that may have been misguided.

+1 for the minimal disruption.

Other initial comments:

The commit message says:

    This is managed via two new options, fetch_stats and remote_analyze,
    both are available at the server level and table level. If fetch_stats
    is true, then the ANALYZE command will first attempt to fetch statistics
    from the remote table and import those statistics locally.

    If remote_analyze is true, and if the first attempt to fetch remote
    statistics found no attribute statistics, then an attempt will be made
    to ANALYZE the remote table before a second and final attempt to fetch
    remote statistics.

    If no statistics are found, then ANALYZE will fall back to the normal
    behavior of sampling and local analysis.

I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression.  (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.)  So how
about making it the user's responsibility to make them up-to-date?  If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.

On the other hand:

    This operation will only work on remote relations that can have stored
    statistics: tables, partitioned tables, and materialized views. If the
    remote relation is a view then remote fetching/analyzing is just wasted
    effort and the user is better of setting fetch_stats to false for that
    table.

I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.

That's it for now.

My apologies for the delayed response.

Best regards,
Etsuro Fujita

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

22 ноября, 00:31:23

Other initial comments:

The commit message says:

This is managed via two new options, fetch_stats and remote_analyze,
both are available at the server level and table level. If fetch_stats
is true, then the ANALYZE command will first attempt to fetch statistics
from the remote table and import those statistics locally.

If remote_analyze is true, and if the first attempt to fetch remote
statistics found no attribute statistics, then an attempt will be made
to ANALYZE the remote table before a second and final attempt to fetch
remote statistics.

If no statistics are found, then ANALYZE will fall back to the normal
behavior of sampling and local analysis.

I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression. (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.) So how
about making it the user's responsibility to make them up-to-date? If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.

Obviously there is no way to know the quality/freshness of remote stats if they are found.

The analyze option was borne of feedback from other postgres hackers while brainstorming on what this option might look like. I don't think we *need* this extra option for the feature to be a success, but it's relative simplicity did make me want to put it out there to see who else liked it.

On the other hand:

This operation will only work on remote relations that can have stored
statistics: tables, partitioned tables, and materialized views. If the
remote relation is a view then remote fetching/analyzing is just wasted
effort and the user is better of setting fetch_stats to false for that
table.

I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.

The stats fetch query is pretty light, but I can see fetching the relkind along with the relstats, and making decisions on whether to continue from there, only applying the relstats after attrstats have been successfully applied.

That's it for now.

I'll see what I can do to make that work.

My apologies for the delayed response.

Valuable responses are worth waiting for.

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

23 ноября, 10:27:49

My apologies for the delayed response.

Valuable responses are worth waiting for.

I've reorganized some things a bit, mostly to make resource cleanup simpler.

Вложения

v4-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Chao Li

Дата:

24 ноября, 12:25:13


> On Nov 23, 2025, at 15:27, Corey Huinker <corey.huinker@gmail.com> wrote:
>
>   My apologies for the delayed response.
>
> Valuable responses are worth waiting for.
>
> I've reorganized some things a bit, mostly to make resource cleanup simpler.
> <v4-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch>


Few comments:

1 - commit message
```
effort and the user is better of setting fetch_stats to false for that
```

I guess “of” should be “off”

2 - postgres-fdw.sgml
```
+       server, determines wheter an <command>ANALYZE</command> on a foreign
```

Typo: wheter => whether

3 - postgres-fdw.sgml
```
+       data sampling if no statistics were availble. This option is only
```

Typo: availble => available

4 - option.c
```
+        /* fetch_size is available on both server and table */
+        {"fetch_stats", ForeignServerRelationId, false},
+        {"fetch_stats", ForeignTableRelationId, false},
```

In the comment, I guess fetch_size should be fetch_stats.

5 - analyze.c
```
+             * XXX: Should this be it's own fetch type? If not, then there might be
```

Typo: “it’s own” => “its own”

6 - analyze.c
```
+                case FDW_IMPORT_STATS_NOTFOUND:
+                    ereport(INFO,
+                            errmsg("Found no remote statistics for \"%s\"",
+                                   RelationGetRelationName(onerel)));
```

`Found no remote statistics for \"%s\””` could be rephrased as `""No remote statistics found for foreign table
\"%s\””`,sounds better wording in server log. 

Also, I wonder if this message at INFO level is too noisy?

7 - postgres_fdw.c
```
+        default:
+            ereport(INFO,
+                    errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                    errmsg("Remote table %s.%s does not support statistics.",
+                           quote_identifier(remote_schemaname),
+                           quote_identifier(remote_relname)),
+                    errdetail("Remote relation if of relkind \"%c\"", relkind));
```

I think “if of” should be “is of”.

8 - postgres_fdw.c
```
+                        errmsg("Attribute statistics found for %s.%s but no columns matched",
+                               quote_identifier(schemaname),
+                               quote_identifier(relname))));
```

Instead of using "%s.%s” and calling quote_identifier() twice, there is a simple function to use:
quote_qualified_identifier(schemaname,relname). 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Etsuro Fujita

Дата:

27 ноября, 15:46:09

On Sat, Nov 22, 2025 at 6:31 AM Corey Huinker <corey.huinker@gmail.com> wrote:
>> Other initial comments:
>>
>> The commit message says:
>>
>>     This is managed via two new options, fetch_stats and remote_analyze,
>>     both are available at the server level and table level. If fetch_stats
>>     is true, then the ANALYZE command will first attempt to fetch statistics
>>     from the remote table and import those statistics locally.
>>
>>     If remote_analyze is true, and if the first attempt to fetch remote
>>     statistics found no attribute statistics, then an attempt will be made
>>     to ANALYZE the remote table before a second and final attempt to fetch
>>     remote statistics.
>>
>>     If no statistics are found, then ANALYZE will fall back to the normal
>>     behavior of sampling and local analysis.
>>
>> I think the first step assumes that the remote stats are up-to-date;
>> if they aren't, it would cause a regression.  (If the remote relation
>> is a plain table, they are likely to be up-to-date, but for example,
>> if it is a foreign table, it's possible that they are stale.)  So how
>> about making it the user's responsibility to make them up-to-date?  If
>> doing so, we wouldn't need to do the second and third steps anymore,
>> making the patch simple.

> Obviously there is no way to know the quality/freshness of remote stats if they are found.
>
> The analyze option was borne of feedback from other postgres hackers while brainstorming on what this option might
looklike. I don't think we *need* this extra option for the feature to be a success, but it's relative simplicity did
makeme want to put it out there to see who else liked it. 

Actually, I have some concerns about the ANALYZE and fall-back
options.  As for the former, if the remote user didn't have the
MAINTAIN privilege on the remote table, remote ANALYZE would be just a
waste effort.  As for the latter, imagine the situation where a user
ANALYZEs a foreign table whose remote table is significantly large.
When the previous attempts fail, the user might want to re-try to
import remote stats after ANALYZEing the remote table in the remote
side in some way, rather than postgres_fdw automatically falling back
to the normal lengthy processing.  I think just throwing an error if
the first attempt fails would make the system not only simple but
reliable while providing some flexibility to users.

>> On the other hand:
>>
>>     This operation will only work on remote relations that can have stored
>>     statistics: tables, partitioned tables, and materialized views. If the
>>     remote relation is a view then remote fetching/analyzing is just wasted
>>     effort and the user is better of setting fetch_stats to false for that
>>     table.
>>
>> I'm not sure the waste effort is acceptable; IMO, if the remote table
>> is a view, I think that the system should detect that in some way, and
>> then just do the normal ANALYZE processing.
>
>
> The stats fetch query is pretty light, but I can see fetching the relkind along with the relstats, and making
decisionson whether to continue from there, only applying the relstats after attrstats have been successfully applied. 

Good idea!  I would vote for throwing an error if the relkind is view,
making the user set fetch_stats to false for the foreign table.

Best regards,
Etsuro Fujita

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Etsuro Fujita

Дата:

27 ноября, 15:48:34

On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com> wrote:
> I've reorganized some things a bit, mostly to make resource cleanup simpler.

Thanks for updating the patch!  I will look into it.

Best regards,
Etsuro Fujita

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

12 декабря, 00:59:19

On Thu, Nov 27, 2025 at 7:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:

On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com> wrote:
> I've reorganized some things a bit, mostly to make resource cleanup simpler.

Thanks for updating the patch! I will look into it.

Best regards,
Etsuro Fujita

Rebase, no changes.

Вложения

v5-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Chao Li

Дата:

12 декабря, 09:11:31


> On Dec 12, 2025, at 05:59, Corey Huinker <corey.huinker@gmail.com> wrote:
>
> On Thu, Nov 27, 2025 at 7:48 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Sun, Nov 23, 2025 at 4:28 PM Corey Huinker <corey.huinker@gmail.com> wrote:
> > I've reorganized some things a bit, mostly to make resource cleanup simpler.
>
> Thanks for updating the patch!  I will look into it.
>
> Best regards,
> Etsuro Fujita
>
> Rebase, no changes.
> <v5-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch>

A kind reminder, I don’t see my comments are addressed:

https://postgr.es/m/F9C73EF2-F977-46E4-9F61-B6CF72BF1A69@gmail.com

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Etsuro Fujita

Дата:

12 декабря, 16:45:46

On Thu, Nov 27, 2025 at 9:46 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Sat, Nov 22, 2025 at 6:31 AM Corey Huinker <corey.huinker@gmail.com> wrote:
> >> Other initial comments:
> >>
> >> The commit message says:
> >>
> >>     This is managed via two new options, fetch_stats and remote_analyze,
> >>     both are available at the server level and table level. If fetch_stats
> >>     is true, then the ANALYZE command will first attempt to fetch statistics
> >>     from the remote table and import those statistics locally.
> >>
> >>     If remote_analyze is true, and if the first attempt to fetch remote
> >>     statistics found no attribute statistics, then an attempt will be made
> >>     to ANALYZE the remote table before a second and final attempt to fetch
> >>     remote statistics.
> >>
> >>     If no statistics are found, then ANALYZE will fall back to the normal
> >>     behavior of sampling and local analysis.
> >>
> >> I think the first step assumes that the remote stats are up-to-date;
> >> if they aren't, it would cause a regression.  (If the remote relation
> >> is a plain table, they are likely to be up-to-date, but for example,
> >> if it is a foreign table, it's possible that they are stale.)  So how
> >> about making it the user's responsibility to make them up-to-date?  If
> >> doing so, we wouldn't need to do the second and third steps anymore,
> >> making the patch simple.
>
> > Obviously there is no way to know the quality/freshness of remote stats if they are found.
> >
> > The analyze option was borne of feedback from other postgres hackers while brainstorming on what this option might
looklike. I don't think we *need* this extra option for the feature to be a success, but it's relative simplicity did
makeme want to put it out there to see who else liked it. 
>
> Actually, I have some concerns about the ANALYZE and fall-back
> options.  As for the former, if the remote user didn't have the
> MAINTAIN privilege on the remote table, remote ANALYZE would be just a
> waste effort.  As for the latter, imagine the situation where a user
> ANALYZEs a foreign table whose remote table is significantly large.
> When the previous attempts fail, the user might want to re-try to
> import remote stats after ANALYZEing the remote table in the remote
> side in some way, rather than postgres_fdw automatically falling back
> to the normal lengthy processing.  I think just throwing an error if
> the first attempt fails would make the system not only simple but
> reliable while providing some flexibility to users.

I think I was narrow-minded here; as for the ANALYZE option, if the
remote user has the privilege, it would work well, so +1 for it, but I
don't think it's a must-have option, so I'd vote for making it a
separate patch.  As for the fall-back behavior, however, sorry, I
still think it reduces flexibility.

Best regards,
Etsuro Fujita

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Etsuro Fujita

Дата:

12 декабря, 17:04:26

On Fri, Dec 12, 2025 at 6:59 AM Corey Huinker <corey.huinker@gmail.com> wrote:
> Rebase, no changes.

Thanks for rebasing!  I reviewed the changes made to the core:

@@ -196,13 +196,56 @@ analyze_rel(Oid relid, RangeVar *relation,
    {
        /*
         * For a foreign table, call the FDW's hook function to see whether it
-        * supports analysis.
+        * supports statistics import and/or analysis.
         */
        FdwRoutine *fdwroutine;
        bool        ok = false;

        fdwroutine = GetFdwRoutineForRelation(onerel, false);

+       if (fdwroutine->ImportStatistics != NULL)
+       {
+           FdwImportStatsResult    res;
+
+           /*
+            * Fetching pre-existing remote stats is not guaranteed to
be a quick
+            * operation.
+            *
+            * XXX: Should this be it's own fetch type? If not, then
there might be
+            * confusion when a long stats-fetch fails, followed by a
regular analyze,
+            * which would make it look like the table was analyzed twice.
+            */
+           pgstat_progress_start_command(PROGRESS_COMMAND_ANALYZE,
+                                         RelationGetRelid(onerel));
+
+           res = fdwroutine->ImportStatistics(onerel);
+
+           pgstat_progress_end_command();
+
+           /*
+            * If we were able to import statistics, then there is no
need to collect
+            * samples for local analysis.
+            */
+           switch(res)
+           {
+               case FDW_IMPORT_STATS_OK:
+                   relation_close(onerel, ShareUpdateExclusiveLock);
+                   return;
+               case FDW_IMPORT_STATS_DISABLED:
+                   break;
+               case FDW_IMPORT_STATS_NOTFOUND:
+                   ereport(INFO,
+                           errmsg("Found no remote statistics for \"%s\"",
+                                  RelationGetRelationName(onerel)));
+                   break;
+               case FDW_IMPORT_STATS_FAILED:
+               default:
+                   ereport(INFO,
+                           errmsg("Fetching remote statistics from
\"%s\" failed",
+                                  RelationGetRelationName(onerel)));
+           }
+       }

Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
foreign table is an inheritance parent, we would fail to do inherited
stats.

IIUC, the FDW needs to do two things within the ImportStatistics
callback function: check the importability, and if ok, do the work.  I
think that would make the API complicated.  To avoid that, how about
1) splitting the callback function into the two functions shown below
and then 2) rewriting the above to something like the attached?  The
attached addresses the returning issue mentioned above as well.

bool
StatisticsAreImportable(Relation relation)

Checks the importability, and if ok, returns true, in which case the
following callback function is called.  In the postgres_fdw case, we
could implement this to check if the fetch_stats flag for the foreign
table is set to true, and if so, return true.

void
ImportStatistics(Relation relation, List *va_cols, int elevel)

Imports the stats for the foreign table from the foreign server.  As
mentioned in previous emails, I don't think it's a good idea to fall
back to the normal processing when the attempt to import the stats
fails, in which case I think we should just throw an error (or
warning) so that the user can re-try to import the stats after fixing
the foreign side in some way.  So I re-defined this as a void
function.  Note that this re-definition removes the concern mentioned
in the comment starting with "XXX:".  In the postgres_fdw case, as
mentioned in a previous email, I think it would be good to implement
this so that it checks whether the remote table is a view or not when
importing the relation stats from the remote server, and if so, just
throws an error (or warning), making the user reset the fetch_stats
flag.

I added two arguments to the callback function: va_cols, for
supporting the column list in the ANALYZE command, and elevel, for
proper logging.  I'm not sure if VaccumParams should be added as well.

That's it for now.  I'll continue to review the patch.

Best regards,
Etsuro Fujita

Вложения

FDW-API-for-importing-stats.patch

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

12 декабря, 22:03:55

A kind reminder, I don’t see my comments are addressed:

My apologies. Will get into the next rev.

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

12 декабря, 22:45:42

CCing Jonathan Katz and Nathan Bossart as they had been sounding boards for me when I started designing this feature.

Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
foreign table is an inheritance parent, we would fail to do inherited
stats.

Perhaps I'm not understanding completely, but I believe what we're doing now should be ok.

The local table of type 'f' can be a member of a partition, but can't be a partition itself, so whatever stats we get for it, we're storing them as inh = false.

On the remote side, the table could be an inheritance parent, in which case we ONLY want the inh=true stats, but we will still store them locally as inh = false.

The DISTINCT ON(a.attname)....ORDER BY a.attname, s.inherited DESC part of the query ensures that we get inh=true stats if they're there in preference to the inh=false steps.

I will grant you that in an old-style inheritance (i.e. not formal partitions) situation we'd probably want some weighted mix of the inherited and non-inherited stats, but 1) very few people use old-style inheritance anymore, 2) few of those export tables via a FDW, and 3) there's no way to do that weighting so we should fall back to sampling anyway.

None of this takes away from your suggestions down below.

IIUC, the FDW needs to do two things within the ImportStatistics
callback function: check the importability, and if ok, do the work. I
think that would make the API complicated. To avoid that, how about
1) splitting the callback function into the two functions shown below
and then 2) rewriting the above to something like the attached? The
attached addresses the returning issue mentioned above as well.

bool
StatisticsAreImportable(Relation relation)

Checks the importability, and if ok, returns true, in which case the
following callback function is called. In the postgres_fdw case, we
could implement this to check if the fetch_stats flag for the foreign
table is set to true, and if so, return true.

void
ImportStatistics(Relation relation, List *va_cols, int elevel)

Imports the stats for the foreign table from the foreign server. As
mentioned in previous emails, I don't think it's a good idea to fall
back to the normal processing when the attempt to import the stats
fails, in which case I think we should just throw an error (or
warning) so that the user can re-try to import the stats after fixing
the foreign side in some way. So I re-defined this as a void
function. Note that this re-definition removes the concern mentioned
in the comment starting with "XXX:". In the postgres_fdw case, as
mentioned in a previous email, I think it would be good to implement
this so that it checks whether the remote table is a view or not when
importing the relation stats from the remote server, and if so, just
throws an error (or warning), making the user reset the fetch_stats
flag.

I think I'm ok with this design as the decision, as it still leaves open the fdw-specific options of how to handle initially finding no remote stats.

I can still see a situation where a local table expects the remote table to eventually have proper statistics on it, but until that happens it will fall back to table samples. This design decision means that either the user lives without any statistics for a while, or alters the foreign table options and hopefully remembers to set them back. While I understand the desire to first implement something very simple, I think that adding the durability that fallback allows for will be harder to implement if we don't build it in from the start. I'm interested to hear with Nathan and/or Jonathan have to say about that.

I added two arguments to the callback function: va_cols, for
supporting the column list in the ANALYZE command, and elevel, for
proper logging. I'm not sure if VaccumParams should be added as well.

Good catch, I forgot about that one.

Going to think some more on this before I work incorporating your

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

13 декабря, 22:12:16

On Fri, Dec 12, 2025 at 2:45 PM Corey Huinker <corey.huinker@gmail.com> wrote:

CCing Jonathan Katz and Nathan Bossart as they had been sounding boards for me when I started designing this feature.

Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
foreign table is an inheritance parent, we would fail to do inherited
stats.

Perhaps I'm not understanding completely, but I believe what we're doing now should be ok.

The local table of type 'f' can be a member of a partition, but can't be a partition itself, so whatever stats we get for it, we're storing them as inh = false.

On the remote side, the table could be an inheritance parent, in which case we ONLY want the inh=true stats, but we will still store them locally as inh = false.

The DISTINCT ON(a.attname)....ORDER BY a.attname, s.inherited DESC part of the query ensures that we get inh=true stats if they're there in preference to the inh=false steps.

I will grant you that in an old-style inheritance (i.e. not formal partitions) situation we'd probably want some weighted mix of the inherited and non-inherited stats, but 1) very few people use old-style inheritance anymore, 2) few of those export tables via a FDW, and 3) there's no way to do that weighting so we should fall back to sampling anyway.

None of this takes away from your suggestions down below.

IIUC, the FDW needs to do two things within the ImportStatistics
callback function: check the importability, and if ok, do the work. I
think that would make the API complicated. To avoid that, how about
1) splitting the callback function into the two functions shown below
and then 2) rewriting the above to something like the attached? The
attached addresses the returning issue mentioned above as well.

bool
StatisticsAreImportable(Relation relation)

Checks the importability, and if ok, returns true, in which case the
following callback function is called. In the postgres_fdw case, we
could implement this to check if the fetch_stats flag for the foreign
table is set to true, and if so, return true.

+1

void
ImportStatistics(Relation relation, List *va_cols, int elevel)

Imports the stats for the foreign table from the foreign server. As
mentioned in previous emails, I don't think it's a good idea to fall
back to the normal processing when the attempt to import the stats
fails, in which case I think we should just throw an error (or
warning) so that the user can re-try to import the stats after fixing
the foreign side in some way. So I re-defined this as a void
function. Note that this re-definition removes the concern mentioned
in the comment starting with "XXX:". In the postgres_fdw case, as
mentioned in a previous email, I think it would be good to implement
this so that it checks whether the remote table is a view or not when
importing the relation stats from the remote server, and if so, just
throws an error (or warning), making the user reset the fetch_stats
flag.

I think I'm ok with this design as the decision, as it still leaves open the fdw-specific options of how to handle initially finding no remote stats.

I can still see a situation where a local table expects the remote table to eventually have proper statistics on it, but until that happens it will fall back to table samples. This design decision means that either the user lives without any statistics for a while, or alters the foreign table options and hopefully remembers to set them back. While I understand the desire to first implement something very simple, I think that adding the durability that fallback allows for will be harder to implement if we don't build it in from the start. I'm interested to hear with Nathan and/or Jonathan have to say about that.

I added two arguments to the callback function: va_cols, for
supporting the column list in the ANALYZE command, and elevel, for
proper logging. I'm not sure if VaccumParams should be added as well.

Good catch, I forgot about that one.

Going to think some more on this before I work incorporating your

Heh, the word "changes" got cut off.

Anyway, here's v6 incorporating both threads of feedback.

Вложения

v6-0001-Add-remote-statistics-fetching-to-postgres_fdw.patch

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Etsuro Fujita

Дата:

14 декабря, 21:40:49

On Sun, Dec 14, 2025 at 4:12 AM Corey Huinker <corey.huinker@gmail.com> wrote:
> On Fri, Dec 12, 2025 at 2:45 PM Corey Huinker <corey.huinker@gmail.com> wrote:
>> CCing Jonathan Katz and Nathan Bossart as they had been sounding boards for me when I started designing this
feature.

>>> Returning in the FDW_IMPORT_STATS_OK case isn't 100% correct; if the
>>> foreign table is an inheritance parent, we would fail to do inherited
>>> stats.

>> Perhaps I'm not understanding completely, but I believe what we're doing now should be ok.
>>
>> The local table of type 'f' can be a member of a partition, but can't be a partition itself, so whatever stats we
getfor it, we're storing them as inh = false. 
>>
>> On the remote side, the table could be an inheritance parent, in which case we ONLY want the inh=true stats, but we
willstill store them locally as inh = false. 
>>
>> The DISTINCT ON(a.attname)....ORDER BY a.attname, s.inherited DESC part of the query ensures that we get inh=true
statsif they're there in preference to the inh=false steps. 
>>
>> I will grant you that in an old-style inheritance (i.e. not formal partitions) situation we'd probably want some
weightedmix of the inherited and non-inherited stats, but 1) very few people use old-style inheritance anymore, 2) few
ofthose export tables via a FDW, and 3) there's no way to do that weighting so we should fall back to sampling anyway. 

Ah, I mean the case where the foreign table is an inheritance parent
on the *local* side.  In that case, the return would cause us to skip
the recursive ANALYZE (i.e., do_analyze_rel() with inh=true), leading
to no inherited stats.  I agree that the case is minor, but I don't
think that that's acceptable.

>>> void
>>> ImportStatistics(Relation relation, List *va_cols, int elevel)
>>>
>>> Imports the stats for the foreign table from the foreign server.  As
>>> mentioned in previous emails, I don't think it's a good idea to fall
>>> back to the normal processing when the attempt to import the stats
>>> fails, in which case I think we should just throw an error (or
>>> warning) so that the user can re-try to import the stats after fixing
>>> the foreign side in some way.  So I re-defined this as a void
>>> function.  Note that this re-definition removes the concern mentioned
>>> in the comment starting with "XXX:".  In the postgres_fdw case, as
>>> mentioned in a previous email, I think it would be good to implement
>>> this so that it checks whether the remote table is a view or not when
>>> importing the relation stats from the remote server, and if so, just
>>> throws an error (or warning), making the user reset the fetch_stats
>>> flag.

>> I think I'm ok with this design as the decision, as it still leaves open the fdw-specific options of how to handle
initiallyfinding no remote stats. 
>>
>> I can still see a situation where a local table expects the remote table to eventually have proper statistics on it,
butuntil that happens it will fall back to table samples. This design decision means that either the user lives without
anystatistics for a while, or alters the foreign table options and hopefully remembers to set them back. While I
understandthe desire to first implement something very simple, I think that adding the durability that fallback allows
forwill be harder to implement if we don't build it in from the start. I'm interested to hear with Nathan and/or
Jonathanhave to say about that. 

My concern about the fall-back behavior is that it reduces flexibility
in some cases, as mentioned upthread.  Maybe that could be addressed
by making the behavior an option, but my another (bigger) concern is
that considering that it's the user's responsibility to make remote
stats up-to-date when he/she uses this feature, the "no remote stats
found" result would be his/her fault; it might not be worth
complicating the code for something like that.

Anyway, I too would like to hear the opinions of them (or anyone else).

> Anyway, here's v6 incorporating both threads of feedback.

Thanks for updating the patch!  I will review the postgres_fdw changes next.

Best regards,
Etsuro Fujita

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Corey Huinker

Дата:

14 декабря, 23:01:35

Ah, I mean the case where the foreign table is an inheritance parent
on the *local* side. In that case, the return would cause us to skip
the recursive ANALYZE (i.e., do_analyze_rel() with inh=true), leading
to no inherited stats. I agree that the case is minor, but I don't
think that that's acceptable.

When such a corner case occurs (stats import configured to true, but table is an inheritance parent), should we raise an error, or raise a warning and return false on the CanImportStats() call? I guess the answer may depend on the feedback we get.

Re: Import Statistics in postgres_fdw before resorting to sampling.

От

Etsuro Fujita

Дата:

15 декабря, 13:01:44

On Mon, Dec 15, 2025 at 5:01 AM Corey Huinker <corey.huinker@gmail.com> wrote:

>> Ah, I mean the case where the foreign table is an inheritance parent
>> on the *local* side.  In that case, the return would cause us to skip
>> the recursive ANALYZE (i.e., do_analyze_rel() with inh=true), leading
>> to no inherited stats.  I agree that the case is minor, but I don't
>> think that that's acceptable.

> When such a corner case occurs (stats import configured to true, but table is an inheritance parent), should we raise
anerror, or raise a warning and return false on the CanImportStats() call? I guess the answer may depend on the
feedbackwe get. 

As mentioned upthread, the FDW API that I proposed addresses this
issue; even in such a case it allows the FDW to import stats, instead
of doing the normal non-recursive ANALYZE, and then do the recursive
ANALYZE, for the inherited stats.

Best regards,
Etsuro Fujita

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Import Statistics in postgres_fdw before resorting to sampling.

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения