Обсуждение: Indexes on expressions with multiple columns and operators
Hello, in the following, I don't understand why: 1) the expression index isn't used in the first EXPLAIN 2) the number of estimated rows is completely off in the second EXPLAIN, whereas the planner could easily use the statistics of foo_f_idx. (SQL script attached, tested with master and v17) DROP TABLE IF EXISTS foo; CREATE UNLOGGED TABLE foo (id bigint, ackid int, crit text); ALTER TABLE foo ALTER COLUMN crit SET statistics 400; INSERT INTO foo SELECT i, NULL, CASE WHEN i%100=1 THEN 'WARNING' ELSE 'INFO' END FROM generate_series(1,100000) AS T(i); UPDATE foo SET ackid = random()*10000 WHERE id%100=1 AND id > 500 ; CREATE INDEX foo_expr_idx ON foo ((ackid IS NULL AND crit = 'WARNING')); ANALYZE foo ; EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE ackid IS NULL AND crit = 'WARNING'; QUERY PLAN ------------------------------------------------------------------------------------------------------- Seq Scan on foo (cost=0.00..1797.00 rows=990 width=17) (actual time=0.012..23.932 rows=5.00 loops=1) Filter: ((ackid IS NULL) AND (crit = 'WARNING'::text)) Rows Removed by Filter: 99995 (3 rows) CREATE OR REPLACE FUNCTION f(crit text, ackid int) RETURNS bool LANGUAGE plpgsql IMMUTABLE AS $$ BEGIN RETURN crit = 'WARNING' AND ackid IS NULL; END $$; CREATE INDEX foo_f_idx ON foo (f(crit, ackid)); ANALYZE foo ; EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE f(crit, ackid); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Index Scan using foo_f_idx on foo (cost=0.29..8.39 rows=33333 width=17) (actual time=0.021..0.028 rows=5.00 loops=1) Index Cond: (f(crit, ackid) = true) Index Searches: 1 (3 rows) SELECT tablename, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename like 'foo_%'; tablename | most_common_vals | most_common_freqs --------------+------------------+------------------- foo_expr_idx | {f,t} | {0.99995,5e-05} foo_f_idx | {f,t} | {0.99995,5e-05} SELECT 100000 * 5e-05 AS the_row_estimate_that_the_planner_should_use; the_row_estimate_that_the_planner_should_use --------------------------------------------- 5.00000 Best regards, Frédéric
Вложения
On Wed, 2025-09-17 at 15:55 +0200, Frédéric Yhuel wrote: > Hello, in the following, I don't understand why: > > 1) the expression index isn't used in the first EXPLAIN > > 2) the number of estimated rows is completely off in the second EXPLAIN, > whereas the planner could easily use the statistics of foo_f_idx. > > (SQL script attached, tested with master and v17) > > DROP TABLE IF EXISTS foo; > > CREATE UNLOGGED TABLE foo (id bigint, ackid int, crit text); > > ALTER TABLE foo ALTER COLUMN crit SET statistics 400; > > INSERT INTO foo SELECT i, NULL, CASE WHEN i%100=1 THEN 'WARNING' ELSE > 'INFO' END FROM generate_series(1,100000) AS T(i); > > UPDATE foo SET ackid = random()*10000 WHERE id%100=1 AND id > 500 ; > > CREATE INDEX foo_expr_idx ON foo ((ackid IS NULL AND crit = 'WARNING')); > > ANALYZE foo ; > > EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE > ackid IS NULL AND crit = 'WARNING'; > QUERY PLAN > > ------------------------------------------------------------------------------------------------------- > Seq Scan on foo (cost=0.00..1797.00 rows=990 width=17) (actual > time=0.012..23.932 rows=5.00 loops=1) > Filter: ((ackid IS NULL) AND (crit = 'WARNING'::text)) > Rows Removed by Filter: 99995 > (3 rows) As far as I know, PostgreSQL considers "ackid IS NULL" and "crit = 'WARNING'" as two different RestrictInfos. See the comment: /* * Restriction clause info. * * We create one of these for each AND sub-clause of a restriction condition * (WHERE or JOIN/ON clause). Since the restriction clauses are logically * ANDed, we can use any one of them or any subset of them to filter out * tuples, without having to evaluate the rest. The RestrictInfo node itself * stores data used by the optimizer while choosing the best query plan. PostgreSQL doesn't consider the (exotic) case what someone creates an index on an expression that contains an AND. You could do this: EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE (ackid IS NULL AND crit = 'WARNING') IS TRUE; QUERY PLAN ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ Index Scan using foo_expr_idx on foo (cost=0.29..8.39 rows=5 width=17) (actual time=0.011..0.018 rows=5.00 loops=1) Index Cond: (((ackid IS NULL) AND (crit = 'WARNING'::text)) = true) Index Searches: 1 (3 rows) But really, you should create a better index, perhaps CREATE INDEX ON foo (crit) WHERE ackid IS NULL; Yours, Laurenz Albe
=?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes: > Hello, in the following, I don't understand why: > 1) the expression index isn't used in the first EXPLAIN The planner doesn't look for multi-clause matches of that sort. You could apply a little ju-jitsu perhaps: regression=# EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE (ackid IS NULL AND crit = 'WARNING') istrue; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using foo_expr_idx on foo (cost=0.29..8.39 rows=5 width=17) (actual time=0.013..0.016 rows=5.00 loops=1) Index Cond: (((ackid IS NULL) AND (crit = 'WARNING'::text)) = true) Index Searches: 1 (3 rows) but my own tendency would be to use a partial index rather than a boolean-valued index: regression=# CREATE INDEX foo_partial_idx ON foo (id) WHERE ackid IS NULL AND crit = 'WARNING'; CREATE INDEX regression=# EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE ackid IS NULL AND crit = 'WARNING'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Index Scan using foo_partial_idx on foo (cost=0.13..107.18 rows=990 width=17) (actual time=0.010..0.014 rows=5.00 loops=1) Index Searches: 1 (2 rows) The advantage of a partial index is you might be able to have the index entries themselves carry some other column(s), allowing more queries to be made into index-only scans. I put "id" here, which might or might not be of any use in this specific toy example. > 2) the number of estimated rows is completely off in the second EXPLAIN, > whereas the planner could easily use the statistics of foo_f_idx. Hmm, not sure about that. Again, boolean-valued indexes aren't something we've worked on too hard, but I don't see why that would affect this case. regards, tom lane
Thank you Laurenz and Tom! I'm going to quote Tom's email here: On 9/17/25 16:41, Tom Lane wrote: > =?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes: >> Hello, in the following, I don't understand why: >> 1) the expression index isn't used in the first EXPLAIN > > The planner doesn't look for multi-clause matches of that sort. > You could apply a little ju-jitsu perhaps: > > regression=# EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE (ackid IS NULL AND crit = 'WARNING') istrue; > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------- > Index Scan using foo_expr_idx on foo (cost=0.29..8.39 rows=5 width=17) (actual time=0.013..0.016 rows=5.00 loops=1) > Index Cond: (((ackid IS NULL) AND (crit = 'WARNING'::text)) = true) > Index Searches: 1 > (3 rows) > Thanks, it works well indeed. > but my own tendency would be to use a partial index rather than a > boolean-valued index: > > regression=# CREATE INDEX foo_partial_idx ON foo (id) WHERE ackid IS NULL AND crit = 'WARNING'; > CREATE INDEX > regression=# EXPLAIN (ANALYZE, SUMMARY OFF, BUFFERS OFF) SELECT * FROM foo WHERE ackid IS NULL AND crit = 'WARNING'; > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------- > Index Scan using foo_partial_idx on foo (cost=0.13..107.18 rows=990 width=17) (actual time=0.010..0.014 rows=5.00 loops=1) > Index Searches: 1 > (2 rows) > > The advantage of a partial index is you might be able to have the > index entries themselves carry some other column(s), allowing > more queries to be made into index-only scans. I put "id" here, > which might or might not be of any use in this specific toy example. > Yes, Laurenz made a similar suggestion, but the problem is that I'm mostly interested in the estimated number of output rows... because in the real query, there's a very bad Hash Join above (the Nested Loop is *much* faster). >> 2) the number of estimated rows is completely off in the second EXPLAIN, >> whereas the planner could easily use the statistics of foo_f_idx. > > Hmm, not sure about that. Again, boolean-valued indexes aren't > something we've worked on too hard, but I don't see why that > would affect this case. > OK, thanks anyway, I think the ju-jitsu mentioned above will do, even though the application code will have to be patched.
On 9/17/25 16:57, Frédéric Yhuel wrote: > Yes, Laurenz made a similar suggestion, but the problem is that I'm > mostly interested in the estimated number of output rows... because in > the real query, there's a very bad Hash Join above (the Nested Loop is > *much* faster). BTW, I've also tested another solution: partitioning on 'criticity', with one partition for 'INFO' (99% of the rows), and another one for the other values ('WARNING' and 'ALARM' in the real case). The statistics are much better... however an expression index would be an easier fix. Multivariate MCV statistics don't work well in the real case (100M lines in the table).
Hi there, I think this discussion has a nice solution, thank you! However, while poking around this issue yesterday, we also found something surprising between estimated rows and costs when using a function. Bellow the scenario to apply on top of Frederic's one to quickly expose the weirdness: CREATE OR REPLACE FUNCTION s(crit text, ackid int) RETURNS bool LANGUAGE plpgsql IMMUTABLE AS $$ BEGIN RETURN crit = 'WARNING' AND ackid IS NULL; END $$; CREATE INDEX foo_s_idx ON foo (s(crit, ackid)); ANALYZE foo ; EXPLAIN SELECT * FROM foo WHERE s(crit, ackid); EXPLAIN SELECT * FROM foo WHERE (ackid IS NULL AND crit = 'WARNING') is true; SELECT most_common_vals, most_common_freqs FROM pg_stats WHERE tablename = 'foo_s_idx'; On a fresh instance from HEAD with its default configuration, it shows: Index Scan using foo_s_idx on foo (cost=0.29..8.39 rows=33333 width=13) Index Cond: (s(crit, ackid) = true) Index Scan using foo_expr_idx on foo (cost=0.29..8.39 rows=5 width=13) Index Cond: (((ackid IS NULL) AND (crit = 'WARNING'::text)) = true) most_common_vals | most_common_freqs ------------------+------------------- {f,t} | {0.99995,5e-05} It seems statistics shown in "pg_stats" view for function "s()" are good. The query itself even have the same costs than the query using the syntax tips you provide before. However, the estimated row number seems wrong in regard with the costs shown and statistics. It looks like a default hardcoded 33% has been applied. As Frederic said earlier, this bad row estimation drives the upper join to the wrong method. Is this a known planer behavior? (again, the ju-jitsu syntax provided by Laurenz and you is definitely on point here, I just hijack the thread to discuss this weirdness.) Regards,
=?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes: > On 9/17/25 16:41, Tom Lane wrote: >> =?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes: >>> 2) the number of estimated rows is completely off in the second EXPLAIN, >>> whereas the planner could easily use the statistics of foo_f_idx. >> Hmm, not sure about that. Again, boolean-valued indexes aren't >> something we've worked on too hard, but I don't see why that >> would affect this case. > OK, thanks anyway, I think the ju-jitsu mentioned above will do, even > though the application code will have to be patched. Sigh ... so the answer is this used to work (since commit 39df0f150) and then I carelessly broke it in commit a391ff3c3. If you try this test case in versions 9.5..11 you get a spot-on rowcount estimate. Serves me right for not having a test case I guess, but I'm astonished that nobody complained sooner. regards, tom lane
I wrote: > Sigh ... so the answer is this used to work (since commit 39df0f150) > and then I carelessly broke it in commit a391ff3c3. If you try this > test case in versions 9.5..11 you get a spot-on rowcount estimate. > Serves me right for not having a test case I guess, but I'm astonished > that nobody complained sooner. The attached fixes things so it works like it did pre-a391ff3c3. I spent some time trying to devise a test case, and was reminded of why I didn't have one before: it's hard to make a case that will be robust enough to not show diffs in the buildfarm. I'll keep thinking about that though. regards, tom lane diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 5d51f97f219..d0f516b7645 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -874,6 +874,10 @@ clause_selectivity_ext(PlannerInfo *root, varRelid, jointype, sjinfo); + + /* If no support, fall back on boolvarsel */ + if (s1 < 0) + s1 = boolvarsel(root, clause, varRelid); } else if (IsA(clause, ScalarArrayOpExpr)) { diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c index f8641204a67..da5d901ec3c 100644 --- a/src/backend/optimizer/util/plancat.c +++ b/src/backend/optimizer/util/plancat.c @@ -2143,9 +2143,8 @@ join_selectivity(PlannerInfo *root, /* * function_selectivity * - * Returns the selectivity of a specified boolean function clause. - * This code executes registered procedures stored in the - * pg_proc relation, by calling the function manager. + * Attempt to estimate the selectivity of a specified boolean function clause + * by asking its support function. If the function lacks support, return -1. * * See clause_selectivity() for the meaning of the additional parameters. */ @@ -2163,15 +2162,8 @@ function_selectivity(PlannerInfo *root, SupportRequestSelectivity req; SupportRequestSelectivity *sresult; - /* - * If no support function is provided, use our historical default - * estimate, 0.3333333. This seems a pretty unprincipled choice, but - * Postgres has been using that estimate for function calls since 1992. - * The hoariness of this behavior suggests that we should not be in too - * much hurry to use another value. - */ if (!prosupport) - return (Selectivity) 0.3333333; + return (Selectivity) -1; /* no support function */ req.type = T_SupportRequestSelectivity; req.root = root; @@ -2188,9 +2180,8 @@ function_selectivity(PlannerInfo *root, DatumGetPointer(OidFunctionCall1(prosupport, PointerGetDatum(&req))); - /* If support function fails, use default */ if (sresult != &req) - return (Selectivity) 0.3333333; + return (Selectivity) -1; /* function did not honor request */ if (req.selectivity < 0.0 || req.selectivity > 1.0) elog(ERROR, "invalid function selectivity: %f", req.selectivity); diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 1c480cfaaf7..e5e066a5537 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -1528,6 +1528,17 @@ boolvarsel(PlannerInfo *root, Node *arg, int varRelid) selec = var_eq_const(&vardata, BooleanEqualOperator, InvalidOid, BoolGetDatum(true), false, true, false); } + else if (is_funcclause(arg)) + { + /* + * If we have no stats and it's a function call, estimate 0.3333333. + * This seems a pretty unprincipled choice, but Postgres has been + * using that estimate for function calls since 1992. The hoariness + * of this behavior suggests that we should not be in too much hurry + * to use another value. + */ + selec = 0.3333333; + } else { /* Otherwise, the default estimate is 0.5 */
Em qui., 18 de set. de 2025 às 13:40, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
I wrote:
> Sigh ... so the answer is this used to work (since commit 39df0f150)
> and then I carelessly broke it in commit a391ff3c3. If you try this
> test case in versions 9.5..11 you get a spot-on rowcount estimate.
> Serves me right for not having a test case I guess, but I'm astonished
> that nobody complained sooner.
The attached fixes things so it works like it did pre-a391ff3c3.
I spent some time trying to devise a test case, and was reminded
of why I didn't have one before: it's hard to make a case that
will be robust enough to not show diffs in the buildfarm.
I'll keep thinking about that though.
One question, make difference return -1.0 (float point notation)?
If not, static analyzers will have less work.
best regards,
Ranier Vilela
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes: > On a fresh instance from HEAD with its default configuration, it shows: > Index Scan using foo_s_idx on foo (cost=0.29..8.39 rows=33333 width=13) > Index Cond: (s(crit, ackid) = true) > It seems statistics shown in "pg_stats" view for function "s()" are good. The > query itself even have the same costs than the query using the syntax tips you > provide before. > However, the estimated row number seems wrong in regard with the costs shown > and statistics. Yeah. The problem is that clause_selectivity_ext fails to consider use of statistics if the clause looks like "bool_valued_function(...)". If it looks like "bool_valued_function(...) = true", that goes down a different code path that does the right thing. Additional factors: * If you just write "WHERE bool_valued_function(...) = true", that gets stripped down to "WHERE bool_valued_function(...)" in the name of making equivalent expressions look equivalent. (IS TRUE doesn't get stripped, which is why you have to use that wording to avoid that.) * Index condition building puts back the "= true" in order to construct something that satisfies the index AM API. And then it uses that form to get a selectivity estimate for costing purposes --- so the right number goes into the indexscan cost estimate. * But the rowcount estimate is made on the form without "= true". That's the number shown in EXPLAIN and used when considering joins. regards, tom lane
On 9/18/25 18:40, Tom Lane wrote: > The attached fixes things so it works like it did pre-a391ff3c3. > Indeed, it works well! > I spent some time trying to devise a test case, and was reminded > of why I didn't have one before: it's hard to make a case that > will be robust enough to not show diffs in the buildfarm. > I'll keep thinking about that though. It looks like an interesting homework for an aspiring hacker :-) So I'll try to devise one, too.
On Thu, 18 Sep 2025 12:59:11 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jehan-Guillaume de Rorthais <jgdr@dalibo.com> writes: > > On a fresh instance from HEAD with its default configuration, it shows: > > > Index Scan using foo_s_idx on foo (cost=0.29..8.39 rows=33333 width=13) > > Index Cond: (s(crit, ackid) = true) > > > It seems statistics shown in "pg_stats" view for function "s()" are good. > > The query itself even have the same costs than the query using the syntax > > tips you provide before. > > > However, the estimated row number seems wrong in regard with the costs shown > > and statistics. > > Yeah. The problem is that clause_selectivity_ext fails to consider > use of statistics if the clause looks like "bool_valued_function(...)". > If it looks like "bool_valued_function(...) = true", that goes down > a different code path that does the right thing. Oh, OK, I understand. Thanks for your explanations! Regards,
=?UTF-8?Q?Fr=C3=A9d=C3=A9ric_Yhuel?= <frederic.yhuel@dalibo.com> writes: > On 9/18/25 18:40, Tom Lane wrote: >> The attached fixes things so it works like it did pre-a391ff3c3. > Indeed, it works well! Thanks for testing! >> I spent some time trying to devise a test case, and was reminded >> of why I didn't have one before: it's hard to make a case that >> will be robust enough to not show diffs in the buildfarm. >> I'll keep thinking about that though. > It looks like an interesting homework for an aspiring hacker :-) > So I'll try to devise one, too. I concluded that maybe I was overthinking this part. We only really need to check the rowcount estimate, since the indexscan cost estimate is already okay. And stats_ext.sql seems to have gotten away with assuming that rowcount estimates are reliably reproducible on not-too-large tables. This bug affects use of extended statistics too, so a test using those is good enough to show it's fixed; we don't really need to use an expression index for the purpose. So, I added a test case in stats_ext.sql and pushed it. I wish we could back-patch this, since it's so obviously a bug fix. But we avoid changing plans in released branches, and at this point it's even too late for v18. Thank you again for the report. regards, tom lane
On 9/20/25 18:51, Tom Lane wrote: > I concluded that maybe I was overthinking this part. We only really > need to check the rowcount estimate, since the indexscan cost estimate > is already okay. And stats_ext.sql seems to have gotten away with > assuming that rowcount estimates are reliably reproducible on > not-too-large tables. This bug affects use of extended statistics > too, so a test using those is good enough to show it's fixed; we don't > really need to use an expression index for the purpose. So, I added a > test case in stats_ext.sql and pushed it. OK, great! Thanks for the detailed explanation. Regarding extended statistics, it's unfortunate that they cannot be used in this case. Multivariate MCV statistics work as long as the number of rows in the table is reasonably small (100K) and the STATISTICS value for a column is high enough (it doesn't matter which column, because only the sample size matters). I wonder if this is an argument in favour of decoupling the sample size and the precision of the statistics. Here, we basically want the sample size to be as big as the table in order to include the few (NULL, WARNING) values. Or maybe we need a different kind of extended statistics?
On 22/9/2025 15:37, Frédéric Yhuel wrote: > I wonder if this is an argument in favour of decoupling the sample size > and the precision of the statistics. Here, we basically want the sample > size to be as big as the table in order to include the few (NULL, > WARNING) values. I also have seen how repeating ANALYZE on the same database drastically changes query plans ;(. It seems to me that with massive samples, many of the ANALYZE algorithms should be rewritten. In principle, statistical hooks exist. So, it is possible to invent an independent table analyser which will scan the whole table to get precise statistics. -- regards, Andrei Lepikhov
On 9/22/25 15:57, Andrei Lepikhov wrote: > On 22/9/2025 15:37, Frédéric Yhuel wrote: >> I wonder if this is an argument in favour of decoupling the sample >> size and the precision of the statistics. Here, we basically want the >> sample size to be as big as the table in order to include the few >> (NULL, WARNING) values. > I also have seen how repeating ANALYZE on the same database drastically > changes query plans ;(. > It seems to me that with massive samples, many of the ANALYZE algorithms > should be rewritten. In principle, statistical hooks exist. So, it is > possible to invent an independent table analyser which will scan the > whole table to get precise statistics. > Interesting! I wonder how difficult it would be. However, in this specific case, I realised that it wouldn't solve the issue of ANALYZE being triggered when there are zero rows with (ackid, crit) = (NULL, WARNING). Partitioning would still work in this case, though, because ackid's null_frac would be zero for the partition containing the 'WARNING' value. I wonder if we could devise another kind of extended statistic that would provide these "partitioned statistics" without actually partitioning.
On 22/9/2025 18:09, Frédéric Yhuel wrote: > On 9/22/25 15:57, Andrei Lepikhov wrote: > I wonder if we could devise another kind of extended statistic that > would provide these "partitioned statistics" without actually partitioning.I'm not sure I fully understand your case, butSQL Server demonstrates an interesting approach: they have a WHERE clause attached to statistics. So, having implemented this, you may separate the whole range of values inside the table into 'partitions' by such a WHERE condition. It may solve at least one issue with the 'dependencies' statistics: a single number describing the dependency between any two values in the columns often leads to incorrect estimations, as I see. -- regards, Andrei Lepikhov
On 9/22/25 23:15, Andrei Lepikhov wrote: > I'm not sure I fully understand your case, but SQL Server demonstrates > an interesting approach: they have a WHERE clause attached to > statistics. So, having implemented this, you may separate the whole > range of values inside the table into 'partitions' by such a WHERE > condition. Yes, from what I understood of the documentation [1], this is exactly what I would like! > It may solve at least one issue with the 'dependencies' statistics: a > single number describing the dependency between any two values in the > columns often leads to incorrect estimations, as I see. For what it's worth, I've never encountered a case in my life as a PostgreSQL support engineer where the 'dependency' kind could be useful. I only successfully used the 'mcv' kind once (and that was only partially successful, as it fixed the estimates but not the plan). [1] https://learn.microsoft.com/en-us/sql/t-sql/statements/create-statistics-transact-sql?view=sql-server-ver17#c-use-create-statistics-to-create-filtered-statistics
On 23/9/2025 12:20, Frédéric Yhuel wrote: > On 9/22/25 23:15, Andrei Lepikhov wrote: >> It may solve at least one issue with the 'dependencies' statistics: a >> single number describing the dependency between any two values in the >> columns often leads to incorrect estimations, as I see. > > For what it's worth, I've never encountered a case in my life as a > PostgreSQL support engineer where the 'dependency' kind could be useful. > I only successfully used the 'mcv' kind once (and that was only > partially successful, as it fixed the estimates but not the plan).Thanks for your feedback! I also don't think the 'dependencies' statistics are very useful now, especially considering how many computational resources it is needed in case of multiple columns involved. But is it the same for the 'distinct' statistics? It seems you should love it - the number of groups in GROUP-BY, DISTINCT, and even HashJoin should be estimated more precisely, no? -- regards, Andrei Lepikhov
On 9/23/25 12:43, Andrei Lepikhov wrote: > But is it the same for the 'distinct' statistics? It seems you should > love it - the number of groups in GROUP-BY, DISTINCT, and even HashJoin > should be estimated more precisely, no? I think it has more potential, and I would love to use this weapon, but I haven't had the opportunity yet. It would be interesting to know how much it is used in real life. To get back to the topic of partitioned statistics, do you know if SQL Server is smart enough to handle this case [1] that we discussed last year? (with filtered statistics) [1] https://www.postgresql.org/message-id/flat/b860c71a-7cab-4d88-ad87-8c1f2eea9ae8%40dalibo.com
On 9/23/25 12:20, Frédéric Yhuel wrote: > On 9/22/25 23:15, Andrei Lepikhov wrote: >> I'm not sure I fully understand your case, but SQL Server demonstrates >> an interesting approach: they have a WHERE clause attached to >> statistics. So, having implemented this, you may separate the whole >> range of values inside the table into 'partitions' by such a WHERE >> condition. > > Yes, from what I understood of the documentation [1], this is exactly > what I would like! I've tested it and I can confirm that it works very well. So, on SQL Server, you can do this: CREATE STATISTICS FooStats ON foo (ackid, crit) WHERE crit = 'WARNING'; It would be great to have a similar feature in PostgreSQL.
On 9/23/25 15:31, Frédéric Yhuel wrote: > To get back to the topic of partitioned statistics, do you know if SQL > Server is smart enough to handle this case [1] that we discussed last > year? (with filtered statistics) > > [1] https://www.postgresql.org/message-id/flat/b860c71a-7cab-4d88- > ad87-8c1f2eea9ae8%40dalibo.com > Sorry, it doesn't make any sense. First of all, it's not possible to do something like this with SQL Sever: CREATE STATISTICS OrdersStats ON orders (id, product_id) WHERE product_id IN (SELECT id FROM products WHERE name = 'babar'); this is because you need to use simple scalar expressions in the filter clause. An even if it were possible... it would be completely useless in this case. Sorry for the noise.