Обсуждение: Indexes on expressions with multiple columns and operators

Поиск

Список

Период

Сортировка

Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

17 сентября, 16:55:29

Hello, in the following, I don't understand why:

1) the expression index isn't used in the first EXPLAIN

2) the number of estimated rows is completely off in the second EXPLAIN, 
whereas the planner could easily use the statistics of foo_f_idx.

(SQL script attached, tested with master and v17)

DROP TABLE IF EXISTS foo;

CREATE UNLOGGED TABLE foo (id bigint, ackid int, crit text);

ALTER TABLE foo ALTER COLUMN crit SET statistics 400;

INSERT INTO foo SELECT i, NULL, CASE WHEN i%100=1 THEN 'WARNING' ELSE 
'INFO' END FROM generate_series(1,100000) AS T(i);

UPDATE foo SET ackid = random()*10000 WHERE id%100=1 AND id > 500 ;

CREATE INDEX foo_expr_idx ON foo ((ackid IS NULL AND crit = 'WARNING'));

ANALYZE foo ;

EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE 
ackid IS NULL AND crit = 'WARNING';
                                               QUERY PLAN 

-------------------------------------------------------------------------------------------------------
  Seq Scan on foo  (cost=0.00..1797.00 rows=990 width=17) (actual 
time=0.012..23.932 rows=5.00 loops=1)
    Filter: ((ackid IS NULL) AND (crit = 'WARNING'::text))
    Rows Removed by Filter: 99995
(3 rows)

CREATE OR REPLACE
FUNCTION f(crit text, ackid int)
RETURNS bool
LANGUAGE plpgsql
IMMUTABLE AS $$
   BEGIN RETURN crit = 'WARNING' AND ackid IS NULL; END
$$;

CREATE INDEX foo_f_idx ON foo (f(crit, ackid));

ANALYZE foo ;

EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE 
f(crit, ackid);
                                                       QUERY PLAN 


-----------------------------------------------------------------------------------------------------------------------
  Index Scan using foo_f_idx on foo  (cost=0.29..8.39 rows=33333 
width=17) (actual time=0.021..0.028 rows=5.00 loops=1)
    Index Cond: (f(crit, ackid) = true)
    Index Searches: 1
(3 rows)

SELECT tablename, most_common_vals, most_common_freqs FROM pg_stats 
WHERE tablename like 'foo_%';
   tablename   | most_common_vals | most_common_freqs
--------------+------------------+-------------------
  foo_expr_idx | {f,t}            | {0.99995,5e-05}
  foo_f_idx    | {f,t}            | {0.99995,5e-05}

SELECT 100000 * 5e-05 AS the_row_estimate_that_the_planner_should_use;
  the_row_estimate_that_the_planner_should_use
---------------------------------------------
                                      5.00000


Best regards,
Frédéric

Вложения

expr_index.sql

Re: Indexes on expressions with multiple columns and operators

От

Laurenz Albe

Дата:

17 сентября, 17:22:04

On Wed, 2025-09-17 at 15:55 +0200, Frédéric Yhuel wrote:
> Hello, in the following, I don't understand why:
>
> 1) the expression index isn't used in the first EXPLAIN
>
> 2) the number of estimated rows is completely off in the second EXPLAIN,
> whereas the planner could easily use the statistics of foo_f_idx.
>
> (SQL script attached, tested with master and v17)
>
> DROP TABLE IF EXISTS foo;
>
> CREATE UNLOGGED TABLE foo (id bigint, ackid int, crit text);
>
> ALTER TABLE foo ALTER COLUMN crit SET statistics 400;
>
> INSERT INTO foo SELECT i, NULL, CASE WHEN i%100=1 THEN 'WARNING' ELSE
> 'INFO' END FROM generate_series(1,100000) AS T(i);
>
> UPDATE foo SET ackid = random()*10000 WHERE id%100=1 AND id > 500 ;
>
> CREATE INDEX foo_expr_idx ON foo ((ackid IS NULL AND crit = 'WARNING'));
>
> ANALYZE foo ;
>
> EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE
> ackid IS NULL AND crit = 'WARNING';
>                                                QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------
>   Seq Scan on foo  (cost=0.00..1797.00 rows=990 width=17) (actual
> time=0.012..23.932 rows=5.00 loops=1)
>     Filter: ((ackid IS NULL) AND (crit = 'WARNING'::text))
>     Rows Removed by Filter: 99995
> (3 rows)

As far as I know, PostgreSQL considers "ackid IS NULL" and "crit = 'WARNING'"
as two different RestrictInfos.  See the comment:

/*
 * Restriction clause info.
 *
 * We create one of these for each AND sub-clause of a restriction condition
 * (WHERE or JOIN/ON clause).  Since the restriction clauses are logically
 * ANDed, we can use any one of them or any subset of them to filter out
 * tuples, without having to evaluate the rest.  The RestrictInfo node itself
 * stores data used by the optimizer while choosing the best query plan.

PostgreSQL doesn't consider the (exotic) case what someone creates an index
on an expression that contains an AND.

You could do this:

EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE
(ackid IS NULL AND crit = 'WARNING') IS TRUE;
                                                      QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 Index Scan using foo_expr_idx on foo  (cost=0.29..8.39 rows=5 width=17) (actual time=0.011..0.018 rows=5.00 loops=1)
   Index Cond: (((ackid IS NULL) AND (crit = 'WARNING'::text)) = true)
   Index Searches: 1
(3 rows)

But really, you should create a better index, perhaps

  CREATE INDEX ON foo (crit) WHERE ackid IS NULL;

Yours,
Laurenz Albe

Re: Indexes on expressions with multiple columns and operators

От

Tom Lane

Дата:

17 сентября, 17:41:23

=?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes:
> Hello, in the following, I don't understand why:
> 1) the expression index isn't used in the first EXPLAIN

The planner doesn't look for multi-clause matches of that sort.
You could apply a little ju-jitsu perhaps:

regression=# EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE (ackid IS NULL AND crit = 'WARNING')
istrue; 
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Index Scan using foo_expr_idx on foo  (cost=0.29..8.39 rows=5 width=17) (actual time=0.013..0.016 rows=5.00 loops=1)
   Index Cond: (((ackid IS NULL) AND (crit = 'WARNING'::text)) = true)
   Index Searches: 1
(3 rows)

but my own tendency would be to use a partial index rather than a
boolean-valued index:

regression=# CREATE INDEX foo_partial_idx ON foo (id) WHERE ackid IS NULL AND crit = 'WARNING';
CREATE INDEX
regression=# EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE ackid IS NULL AND crit = 'WARNING';
                                                         QUERY PLAN
     

-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using foo_partial_idx on foo  (cost=0.13..107.18 rows=990 width=17) (actual time=0.010..0.014 rows=5.00
loops=1)
   Index Searches: 1
(2 rows)

The advantage of a partial index is you might be able to have the
index entries themselves carry some other column(s), allowing
more queries to be made into index-only scans.  I put "id" here,
which might or might not be of any use in this specific toy example.

> 2) the number of estimated rows is completely off in the second EXPLAIN,
> whereas the planner could easily use the statistics of foo_f_idx.

Hmm, not sure about that.  Again, boolean-valued indexes aren't
something we've worked on too hard, but I don't see why that
would affect this case.

            regards, tom lane

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

17 сентября, 17:57:26

Thank you Laurenz and Tom! I'm going to quote Tom's email here:

On 9/17/25 16:41, Tom Lane wrote:
> =?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes:
>> Hello, in the following, I don't understand why:
>> 1) the expression index isn't used in the first EXPLAIN
> 
> The planner doesn't look for multi-clause matches of that sort.
> You could apply a little ju-jitsu perhaps:
> 
> regression=# EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE (ackid IS NULL AND crit = 'WARNING')
istrue;
 
>                                                        QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------
>   Index Scan using foo_expr_idx on foo  (cost=0.29..8.39 rows=5 width=17) (actual time=0.013..0.016 rows=5.00
loops=1)
>     Index Cond: (((ackid IS NULL) AND (crit = 'WARNING'::text)) = true)
>     Index Searches: 1
> (3 rows)
> 

Thanks, it works well indeed.

> but my own tendency would be to use a partial index rather than a
> boolean-valued index:
> 
> regression=# CREATE INDEX foo_partial_idx ON foo (id) WHERE ackid IS NULL AND crit = 'WARNING';
> CREATE INDEX
> regression=# EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE ackid IS NULL AND crit = 'WARNING';
>                                                           QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
>   Index Scan using foo_partial_idx on foo  (cost=0.13..107.18 rows=990 width=17) (actual time=0.010..0.014 rows=5.00
loops=1)
>     Index Searches: 1
> (2 rows)
> 
> The advantage of a partial index is you might be able to have the
> index entries themselves carry some other column(s), allowing
> more queries to be made into index-only scans.  I put "id" here,
> which might or might not be of any use in this specific toy example.
>

Yes, Laurenz made a similar suggestion, but the problem is that I'm 
mostly interested in the estimated number of output rows... because in 
the real query, there's a very bad Hash Join above (the Nested Loop is 
*much* faster).


>> 2) the number of estimated rows is completely off in the second EXPLAIN,
>> whereas the planner could easily use the statistics of foo_f_idx.
> 
> Hmm, not sure about that.  Again, boolean-valued indexes aren't
> something we've worked on too hard, but I don't see why that
> would affect this case.
> 

OK, thanks anyway, I think the ju-jitsu mentioned above will do, even 
though the application code will have to be patched.

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

17 сентября, 18:33:35

On 9/17/25 16:57, Frédéric Yhuel wrote:
> Yes, Laurenz made a similar suggestion, but the problem is that I'm 
> mostly interested in the estimated number of output rows... because in 
> the real query, there's a very bad Hash Join above (the Nested Loop is 
> *much* faster).

BTW, I've also tested another solution: partitioning on 'criticity', 
with one partition for 'INFO' (99% of the rows), and another one for the 
other values ('WARNING' and 'ALARM' in the real case). The statistics 
are much better... however an expression index would be an easier fix.

Multivariate MCV statistics don't work well in the real case (100M lines 
in the table).

Re: Indexes on expressions with multiple columns and operators

От

Jehan-Guillaume de Rorthais

Дата:

18 сентября, 18:26:28

Hi there,

I think this discussion has a nice solution, thank you!

However, while poking around this issue yesterday, we also found something
surprising between estimated rows and costs when using a function. Bellow the
scenario to apply on top of Frederic's one to quickly expose the weirdness:

  CREATE OR REPLACE FUNCTION s(crit text, ackid int) 
  RETURNS bool LANGUAGE plpgsql IMMUTABLE AS $$
  BEGIN RETURN crit = 'WARNING' AND ackid IS NULL;
  END $$;

  CREATE INDEX foo_s_idx ON foo (s(crit, ackid));

  ANALYZE foo ;

  EXPLAIN SELECT * FROM foo WHERE s(crit, ackid);

  EXPLAIN SELECT * FROM foo
  WHERE (ackid IS NULL AND crit = 'WARNING') is true;

  SELECT most_common_vals, most_common_freqs
  FROM pg_stats WHERE tablename = 'foo_s_idx';

On a fresh instance from HEAD with its default configuration, it shows:

  Index Scan using foo_s_idx on foo  (cost=0.29..8.39 rows=33333 width=13)
    Index Cond: (s(crit, ackid) = true)

  Index Scan using foo_expr_idx on foo  (cost=0.29..8.39 rows=5 width=13)
    Index Cond: (((ackid IS NULL) AND (crit = 'WARNING'::text)) = true)

   most_common_vals | most_common_freqs 
  ------------------+-------------------
   {f,t}            | {0.99995,5e-05}

It seems statistics shown in "pg_stats" view for function "s()" are good. The
query itself even have the same costs than the query using the syntax tips you
provide before.

However, the estimated row number seems wrong in regard with the costs shown
and statistics. It looks like a default hardcoded 33% has been applied. As
Frederic said earlier, this bad row estimation drives the upper join to the
wrong method.

Is this a known planer behavior?

(again, the ju-jitsu syntax provided by Laurenz and you is definitely on
point here, I just hijack the thread to discuss this weirdness.)

Regards,

Re: Indexes on expressions with multiple columns and operators

От

Tom Lane

Дата:

18 сентября, 18:32:29

=?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes:
> On 9/17/25 16:41, Tom Lane wrote:
>> =?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes:
>>> 2) the number of estimated rows is completely off in the second EXPLAIN,
>>> whereas the planner could easily use the statistics of foo_f_idx.

>> Hmm, not sure about that.  Again, boolean-valued indexes aren't
>> something we've worked on too hard, but I don't see why that
>> would affect this case.

> OK, thanks anyway, I think the ju-jitsu mentioned above will do, even 
> though the application code will have to be patched.

Sigh ... so the answer is this used to work (since commit 39df0f150)
and then I carelessly broke it in commit a391ff3c3.  If you try this
test case in versions 9.5..11 you get a spot-on rowcount estimate.
Serves me right for not having a test case I guess, but I'm astonished
that nobody complained sooner.

            regards, tom lane

Re: Indexes on expressions with multiple columns and operators

От

Tom Lane

Дата:

18 сентября, 19:40:39

I wrote:
> Sigh ... so the answer is this used to work (since commit 39df0f150)
> and then I carelessly broke it in commit a391ff3c3.  If you try this
> test case in versions 9.5..11 you get a spot-on rowcount estimate.
> Serves me right for not having a test case I guess, but I'm astonished
> that nobody complained sooner.

The attached fixes things so it works like it did pre-a391ff3c3.

I spent some time trying to devise a test case, and was reminded
of why I didn't have one before: it's hard to make a case that
will be robust enough to not show diffs in the buildfarm.
I'll keep thinking about that though.

            regards, tom lane

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 5d51f97f219..d0f516b7645 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -874,6 +874,10 @@ clause_selectivity_ext(PlannerInfo *root,
                                   varRelid,
                                   jointype,
                                   sjinfo);
+
+        /* If no support, fall back on boolvarsel */
+        if (s1 < 0)
+            s1 = boolvarsel(root, clause, varRelid);
     }
     else if (IsA(clause, ScalarArrayOpExpr))
     {
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index f8641204a67..da5d901ec3c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2143,9 +2143,8 @@ join_selectivity(PlannerInfo *root,
 /*
  * function_selectivity
  *
- * Returns the selectivity of a specified boolean function clause.
- * This code executes registered procedures stored in the
- * pg_proc relation, by calling the function manager.
+ * Attempt to estimate the selectivity of a specified boolean function clause
+ * by asking its support function.  If the function lacks support, return -1.
  *
  * See clause_selectivity() for the meaning of the additional parameters.
  */
@@ -2163,15 +2162,8 @@ function_selectivity(PlannerInfo *root,
     SupportRequestSelectivity req;
     SupportRequestSelectivity *sresult;

-    /*
-     * If no support function is provided, use our historical default
-     * estimate, 0.3333333.  This seems a pretty unprincipled choice, but
-     * Postgres has been using that estimate for function calls since 1992.
-     * The hoariness of this behavior suggests that we should not be in too
-     * much hurry to use another value.
-     */
     if (!prosupport)
-        return (Selectivity) 0.3333333;
+        return (Selectivity) -1;    /* no support function */

     req.type = T_SupportRequestSelectivity;
     req.root = root;
@@ -2188,9 +2180,8 @@ function_selectivity(PlannerInfo *root,
         DatumGetPointer(OidFunctionCall1(prosupport,
                                          PointerGetDatum(&req)));

-    /* If support function fails, use default */
     if (sresult != &req)
-        return (Selectivity) 0.3333333;
+        return (Selectivity) -1;    /* function did not honor request */

     if (req.selectivity < 0.0 || req.selectivity > 1.0)
         elog(ERROR, "invalid function selectivity: %f", req.selectivity);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 1c480cfaaf7..e5e066a5537 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -1528,6 +1528,17 @@ boolvarsel(PlannerInfo *root, Node *arg, int varRelid)
         selec = var_eq_const(&vardata, BooleanEqualOperator, InvalidOid,
                              BoolGetDatum(true), false, true, false);
     }
+    else if (is_funcclause(arg))
+    {
+        /*
+         * If we have no stats and it's a function call, estimate 0.3333333.
+         * This seems a pretty unprincipled choice, but Postgres has been
+         * using that estimate for function calls since 1992.  The hoariness
+         * of this behavior suggests that we should not be in too much hurry
+         * to use another value.
+         */
+        selec = 0.3333333;
+    }
     else
     {
         /* Otherwise, the default estimate is 0.5 */

Re: Indexes on expressions with multiple columns and operators

От

Ranier Vilela

Дата:

18 сентября, 19:48:08

Em qui., 18 de set. de 2025 às 13:40, Tom Lane <tgl@sss.pgh.pa.us> escreveu:

I wrote:
> Sigh ... so the answer is this used to work (since commit 39df0f150)
> and then I carelessly broke it in commit a391ff3c3. If you try this
> test case in versions 9.5..11 you get a spot-on rowcount estimate.
> Serves me right for not having a test case I guess, but I'm astonished
> that nobody complained sooner.

The attached fixes things so it works like it did pre-a391ff3c3.

I spent some time trying to devise a test case, and was reminded
of why I didn't have one before: it's hard to make a case that
will be robust enough to not show diffs in the buildfarm.
I'll keep thinking about that though.

One question, make difference return -1.0 (float point notation)?

If not, static analyzers will have less work.

best regards,

Ranier Vilela

Re: Indexes on expressions with multiple columns and operators

От

Tom Lane

Дата:

18 сентября, 19:59:11

Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:
> On a fresh instance from HEAD with its default configuration, it shows:

>   Index Scan using foo_s_idx on foo  (cost=0.29..8.39 rows=33333 width=13)
>     Index Cond: (s(crit, ackid) = true)

> It seems statistics shown in "pg_stats" view for function "s()" are good. The
> query itself even have the same costs than the query using the syntax tips you
> provide before.

> However, the estimated row number seems wrong in regard with the costs shown
> and statistics.

Yeah.  The problem is that clause_selectivity_ext fails to consider
use of statistics if the clause looks like "bool_valued_function(...)".
If it looks like "bool_valued_function(...) = true", that goes down
a different code path that does the right thing.

Additional factors:

* If you just write "WHERE bool_valued_function(...) = true", that
gets stripped down to "WHERE bool_valued_function(...)" in the name
of making equivalent expressions look equivalent.  (IS TRUE doesn't
get stripped, which is why you have to use that wording to avoid
that.)

* Index condition building puts back the "= true" in order to
construct something that satisfies the index AM API.  And then it
uses that form to get a selectivity estimate for costing purposes
--- so the right number goes into the indexscan cost estimate.

* But the rowcount estimate is made on the form without "= true".
That's the number shown in EXPLAIN and used when considering
joins.

            regards, tom lane

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

19 сентября, 10:50:26


On 9/18/25 18:40, Tom Lane wrote:
> The attached fixes things so it works like it did pre-a391ff3c3.
> 

Indeed, it works well!

> I spent some time trying to devise a test case, and was reminded
> of why I didn't have one before: it's hard to make a case that
> will be robust enough to not show diffs in the buildfarm.
> I'll keep thinking about that though.

It looks like an interesting homework for an aspiring hacker :-)
So I'll try to devise one, too.

Re: Indexes on expressions with multiple columns and operators

От

Jehan-Guillaume de Rorthais

Дата:

19 сентября, 17:37:45

On Thu, 18 Sep 2025 12:59:11 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes:
> > On a fresh instance from HEAD with its default configuration, it shows:
> 
> >   Index Scan using foo_s_idx on foo  (cost=0.29..8.39 rows=33333 width=13)
> >     Index Cond: (s(crit, ackid) = true)
> 
> > It seems statistics shown in "pg_stats" view for function "s()" are good.
> > The query itself even have the same costs than the query using the syntax
> > tips you provide before.
> 
> > However, the estimated row number seems wrong in regard with the costs shown
> > and statistics.
> 
> Yeah.  The problem is that clause_selectivity_ext fails to consider
> use of statistics if the clause looks like "bool_valued_function(...)".
> If it looks like "bool_valued_function(...) = true", that goes down
> a different code path that does the right thing.

Oh, OK, I understand.

Thanks for your explanations!

Regards,

Re: Indexes on expressions with multiple columns and operators

От

Tom Lane

Дата:

20 сентября, 19:51:50

=?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes:
> On 9/18/25 18:40, Tom Lane wrote:
>> The attached fixes things so it works like it did pre-a391ff3c3.

> Indeed, it works well!

Thanks for testing!

>> I spent some time trying to devise a test case, and was reminded
>> of why I didn't have one before: it's hard to make a case that
>> will be robust enough to not show diffs in the buildfarm.
>> I'll keep thinking about that though.

> It looks like an interesting homework for an aspiring hacker :-)
> So I'll try to devise one, too.

I concluded that maybe I was overthinking this part.  We only really
need to check the rowcount estimate, since the indexscan cost estimate
is already okay.  And stats_ext.sql seems to have gotten away with
assuming that rowcount estimates are reliably reproducible on
not-too-large tables.  This bug affects use of extended statistics
too, so a test using those is good enough to show it's fixed; we don't
really need to use an expression index for the purpose.  So, I added a
test case in stats_ext.sql and pushed it.

I wish we could back-patch this, since it's so obviously a bug fix.
But we avoid changing plans in released branches, and at this point
it's even too late for v18.

Thank you again for the report.

            regards, tom lane

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

22 сентября, 16:37:23

On 9/20/25 18:51, Tom Lane wrote:
> I concluded that maybe I was overthinking this part.  We only really
> need to check the rowcount estimate, since the indexscan cost estimate
> is already okay.  And stats_ext.sql seems to have gotten away with
> assuming that rowcount estimates are reliably reproducible on
> not-too-large tables.  This bug affects use of extended statistics
> too, so a test using those is good enough to show it's fixed; we don't
> really need to use an expression index for the purpose.  So, I added a
> test case in stats_ext.sql and pushed it.

OK, great! Thanks for the detailed explanation.

Regarding extended statistics, it's unfortunate that they cannot be used 
in this case. Multivariate MCV statistics work as long as the number of 
rows in the table is reasonably small (100K) and the STATISTICS value 
for a column is high enough (it doesn't matter which column, because 
only the sample size matters).

I wonder if this is an argument in favour of decoupling the sample size 
and the precision of the statistics. Here, we basically want the sample 
size to be as big as the table in order to include the few (NULL, 
WARNING) values.

Or maybe we need a different kind of extended statistics?

Re: Indexes on expressions with multiple columns and operators

От

Andrei Lepikhov

Дата:

22 сентября, 16:57:43

On 22/9/2025 15:37, Frédéric Yhuel wrote:
> I wonder if this is an argument in favour of decoupling the sample size 
> and the precision of the statistics. Here, we basically want the sample 
> size to be as big as the table in order to include the few (NULL, 
> WARNING) values.
I also have seen how repeating ANALYZE on the same database drastically 
changes query plans ;(.
It seems to me that with massive samples, many of the ANALYZE algorithms 
should be rewritten. In principle, statistical hooks exist. So, it is 
possible to invent an independent table analyser which will scan the 
whole table to get precise statistics.

-- 
regards, Andrei Lepikhov

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

22 сентября, 19:09:05

On 9/22/25 15:57, Andrei Lepikhov wrote:
> On 22/9/2025 15:37, Frédéric Yhuel wrote:
>> I wonder if this is an argument in favour of decoupling the sample 
>> size and the precision of the statistics. Here, we basically want the 
>> sample size to be as big as the table in order to include the few 
>> (NULL, WARNING) values.
> I also have seen how repeating ANALYZE on the same database drastically 
> changes query plans ;(.
> It seems to me that with massive samples, many of the ANALYZE algorithms 
> should be rewritten. In principle, statistical hooks exist. So, it is 
> possible to invent an independent table analyser which will scan the 
> whole table to get precise statistics.
> 

Interesting! I wonder how difficult it would be.

However, in this specific case, I realised that it wouldn't solve the 
issue of ANALYZE being triggered when there are zero rows with (ackid, 
crit) = (NULL, WARNING).

Partitioning would still work in this case, though, because ackid's 
null_frac would be zero for the partition containing the 'WARNING' value.

I wonder if we could devise another kind of extended statistic that 
would provide these "partitioned statistics" without actually partitioning.

Re: Indexes on expressions with multiple columns and operators

От

Andrei Lepikhov

Дата:

23 сентября, 00:15:52

On 22/9/2025 18:09, Frédéric Yhuel wrote:
> On 9/22/25 15:57, Andrei Lepikhov wrote:
> I wonder if we could devise another kind of extended statistic that 
> would provide these "partitioned statistics" without actually partitioning.I'm not sure I fully understand your case,
butSQL Server demonstrates 

an interesting approach: they have a WHERE clause attached to 
statistics. So, having implemented this, you may separate the whole 
range of values inside the table into 'partitions' by such a WHERE 
condition.
It may solve at least one issue with the 'dependencies' statistics: a 
single number describing the dependency between any two values in the 
columns often leads to incorrect estimations, as I see.

-- 
regards, Andrei Lepikhov

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

23 сентября, 13:20:44

On 9/22/25 23:15, Andrei Lepikhov wrote:
> I'm not sure I fully understand your case, but SQL Server demonstrates 
> an interesting approach: they have a WHERE clause attached to 
> statistics. So, having implemented this, you may separate the whole 
> range of values inside the table into 'partitions' by such a WHERE 
> condition.

Yes, from what I understood of the documentation [1], this is exactly 
what I would like!

> It may solve at least one issue with the 'dependencies' statistics: a 
> single number describing the dependency between any two values in the 
> columns often leads to incorrect estimations, as I see.

For what it's worth, I've never encountered a case in my life as a 
PostgreSQL support engineer where the 'dependency' kind could be useful. 
I only successfully used the 'mcv' kind once (and that was only 
partially successful, as it fixed the estimates but not the plan).

[1] 

https://learn.microsoft.com/en-us/sql/t-sql/statements/create-statistics-transact-sql?view=sql-server-ver17#c-use-create-statistics-to-create-filtered-statistics

Re: Indexes on expressions with multiple columns and operators

От

Andrei Lepikhov

Дата:

23 сентября, 13:43:32

On 23/9/2025 12:20, Frédéric Yhuel wrote:
> On 9/22/25 23:15, Andrei Lepikhov wrote:
>> It may solve at least one issue with the 'dependencies' statistics: a 
>> single number describing the dependency between any two values in the 
>> columns often leads to incorrect estimations, as I see.
> 
> For what it's worth, I've never encountered a case in my life as a 
> PostgreSQL support engineer where the 'dependency' kind could be useful. 
> I only successfully used the 'mcv' kind once (and that was only 
> partially successful, as it fixed the estimates but not the plan).Thanks for your feedback!
I also don't think the 'dependencies' statistics are very useful now, 
especially considering how many computational resources it is needed in 
case of multiple columns involved.
But is it the same for the 'distinct' statistics? It seems you should 
love it - the number of groups in GROUP-BY, DISTINCT, and even HashJoin 
should be estimated more precisely, no?

-- 
regards, Andrei Lepikhov

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

23 сентября, 16:31:18

On 9/23/25 12:43, Andrei Lepikhov wrote:
> But is it the same for the 'distinct' statistics? It seems you should 
> love it - the number of groups in GROUP-BY, DISTINCT, and even HashJoin 
> should be estimated more precisely, no?

I think it has more potential, and I would love to use this weapon, but 
I haven't had the opportunity yet. It would be interesting to know how 
much it is used in real life.

To get back to the topic of partitioned statistics, do you know if SQL 
Server is smart enough to handle this case [1] that we discussed last 
year? (with filtered statistics)

[1] 
https://www.postgresql.org/message-id/flat/b860c71a-7cab-4d88-ad87-8c1f2eea9ae8%40dalibo.com

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

25 сентября, 13:41:18


On 9/23/25 12:20, Frédéric Yhuel wrote:
> On 9/22/25 23:15, Andrei Lepikhov wrote:
>> I'm not sure I fully understand your case, but SQL Server demonstrates
>> an interesting approach: they have a WHERE clause attached to
>> statistics. So, having implemented this, you may separate the whole
>> range of values inside the table into 'partitions' by such a WHERE
>> condition.
>
> Yes, from what I understood of the documentation [1], this is exactly
> what I would like!

I've tested it and I can confirm that it works very well.

So, on SQL Server, you can do this:
CREATE STATISTICS FooStats ON foo (ackid, crit) WHERE crit = 'WARNING';

It would be great to have a similar feature in PostgreSQL.

Re: Indexes on expressions with multiple columns and operators

От

Frédéric Yhuel

Дата:

25 сентября, 13:43:18

On 9/23/25 15:31, Frédéric Yhuel wrote:
> To get back to the topic of partitioned statistics, do you know if SQL
> Server is smart enough to handle this case [1] that we discussed last
> year? (with filtered statistics)
>
> [1] https://www.postgresql.org/message-id/flat/b860c71a-7cab-4d88-
> ad87-8c1f2eea9ae8%40dalibo.com
>

Sorry, it doesn't make any sense.

First of all, it's not possible to do something like this with SQL Sever:

CREATE STATISTICS OrdersStats
      ON orders (id, product_id)
WHERE product_id IN (SELECT id FROM products WHERE name = 'babar');

this is because you need to use simple scalar expressions in the filter
clause.

An even if it were possible... it would be completely useless in this
case. Sorry for the noise.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Indexes on expressions with multiple columns and operators

Вложения