Обсуждение: PoC/WIP: Extended statistics on expressions
Hi, Attached is a PoC/WIP patch adding support for extended statistics on expressions. This is by no means "ready" - most of the stuff works, but often in a rather hackish way. I certainly don't expect this to pass regression tests, for example. There's an example demonstrating how this works for two queries at the end of this message. Now let's talk about the main parts of the patch: 1) extending grammar to allow expressions, not just plain columns Fairly straighforward, I think. I'm sure the logic which expressions are allowed is not 100% (e.g. volatile functions etc.) but that's a detail we can deal with later. 2) store the expressions in pg_statistic_ext catalog I ended up adding a separate column, similar to indexprs, except that the order of columns/expressions does not matter, so we don't need to bother with storing 0s in stxkeys - we simply consider expressions to be "after" all the simple columns. 3) build statistics This should work too, for all three types of statistics we have (mcv, dependencies and ndistinct). This should work too, although the code changes are often very hackish "to make it work". The main challenge here was how to represent the expressions in the statistics - e.g. in ndistinct, which track ndistinct estimates for combinations of parameters, and so far we used attnums for that. I decided the easiest way it to keep doing that, but offset the expressions by MaxHeapAttributeNumber. That seems to work, but maybe there's a better way. 4) apply the statistics This is the hard part, really, and the exact state of the support depends on type of statistics. For ndistinct coefficients, it generally works. I'm sure there may be bugs in estimate_num_groups, etc. but in principle it works. For MCV lists, it generally works too - you can define statistics on the expressions and the estimates should improve. The main downside here is that it requires at least two expressions, otherwise we can't build/apply the extended statistics. So for example SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0 may be estimated "correctly", once you drop any of the conditions it gets much worse as we don't have stats for individual expressions. That's rather annoying - it does not break the extended MCV, but the behavior will certainly cause confusion. For functional dependencies, the estimation does not work yet. Also, the missing per-column statistics have bigger impact than on MCV, because while MCV can work fine without it, the dependencies heavily rely on the per-column estimates. We only apply "corrections" based on the dependency degree, so we still need (good) per-column estimates, which does not quite work with the expressions. Of course, the lack of per-expression statistics may be somewhat fixed by adding indexes on expressions, but that's kinda expensive. Now, a simple example demonstrating how this improves estimates - let's create a table with 1M rows, and do queries with mod() expressions on it. It might be date_trunc() or something similar, that'd work too. table with 1M rows ================== test=# create table t (a int); test=# insert into t select i from generate_series(1,1000000) s(i); test=# analyze t; poorly estimated queries ======================== test=# explain (analyze, timing off) select * from t where mod(a,3) = 0 and mod(a,7) = 0; QUERY PLAN ---------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..24425.00 rows=25 width=4) (actual rows=47619 loops=1) Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0)) Rows Removed by Filter: 952381 Planning Time: 0.329 ms Execution Time: 156.675 ms (5 rows) test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t group by 1, 2; QUERY PLAN ----------------------------------------------------------------------------------------------- HashAggregate (cost=75675.00..98487.50 rows=1000000 width=8) (actual rows=21 loops=1) Group Key: mod(a, 3), mod(a, 7) Planned Partitions: 32 Batches: 1 Memory Usage: 1561kB -> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.277 ms Execution Time: 502.803 ms (6 rows) improved estimates ================== test=# create statistics s1 (ndistinct) on mod(a,3), mod(a,7) from t; test=# analyze t; test=# explain (analyze, timing off) select mod(a,3), mod(a,7) from t group by 1, 2; QUERY PLAN ----------------------------------------------------------------------------------------------- HashAggregate (cost=24425.00..24425.31 rows=21 width=8) (actual rows=21 loops=1) Group Key: mod(a, 3), mod(a, 7) Batches: 1 Memory Usage: 24kB -> Seq Scan on t (cost=0.00..19425.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.135 ms Execution Time: 500.092 ms (6 rows) test=# create statistics s2 (mcv) on mod(a,3), mod(a,7) from t; test=# analyze t; test=# explain (analyze, timing off) select * from t where mod(a,3) = 0 and mod(a,7) = 0; QUERY PLAN ------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..24425.00 rows=46433 width=4) (actual rows=47619 loops=1) Filter: ((mod(a, 3) = 0) AND (mod(a, 7) = 0)) Rows Removed by Filter: 952381 Planning Time: 0.702 ms Execution Time: 152.280 ms (5 rows) Clearly, estimates for both queries are significantly improved. Of course, this example is kinda artificial/simplistic. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 11/16/20 2:49 PM, Tomas Vondra wrote: > Hi, > > ... > > 4) apply the statistics > > This is the hard part, really, and the exact state of the support > depends on type of statistics. > > For ndistinct coefficients, it generally works. I'm sure there may be > bugs in estimate_num_groups, etc. but in principle it works. > > For MCV lists, it generally works too - you can define statistics on > the expressions and the estimates should improve. The main downside > here is that it requires at least two expressions, otherwise we can't > build/apply the extended statistics. So for example > > SELECT * FROM t WHERE mod(a,100) = 10 AND mod(b,11) = 0 > > may be estimated "correctly", once you drop any of the conditions it > gets much worse as we don't have stats for individual expressions. > That's rather annoying - it does not break the extended MCV, but the > behavior will certainly cause confusion. > > For functional dependencies, the estimation does not work yet. Also, > the missing per-column statistics have bigger impact than on MCV, > because while MCV can work fine without it, the dependencies heavily > rely on the per-column estimates. We only apply "corrections" based > on the dependency degree, so we still need (good) per-column > estimates, which does not quite work with the expressions. > > > Of course, the lack of per-expression statistics may be somewhat > fixed by adding indexes on expressions, but that's kinda expensive. > FWIW after re-reading [1], I think the plan to build pg_statistic rows for expressions and stash them in pg_statistic_ext_data is the way to go. I was thinking that maybe we'll need some new statistics type to request this, e.g. CREATE STATISTICS s (expressions) ON ... but on second thought I think we should just build this whenever there are expressions in the definition. It'll require some changes (e.g. we require at least two items in the list, but we'll want to allow building stats on a single expression too, I guess), but that's doable. Of course, we don't have any catalogs with composite types yet, so it's not 100% sure this will work, but it's worth a try. regards [1] https://www.postgresql.org/message-id/flat/6331.1579041473%40sss.pgh.pa.us#5ec6af7583e84cef2ca6a9e8a713511e -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, attached is a significantly improved version of the patch, allowing defining extended statistics on expressions. This fixes most of the problems in the previous WIP version and AFAICS it does pass all regression tests (including under valgrind). There's a bunch of FIXMEs and a couple loose ends, but overall I think it's ready for reviews. Overall, the patch does two main things: * it adds a new "expressions" statistics kind, building per-expression statistics (i.e it's similar to having expression index) * it allows using expressions in definition of extended statistics, and properly handles that in all existing statistics kinds (dependencies, mcv, ndistinct) The expression handling mostly copies what we do for indexes, with similar restrictions - no volatile functions, aggregates etc. The list of expressions is stored in pg_statistic_ext catalog, but unlike for indexes we don't need to worry about the exact order of elements, so there are no "0" for expressions in stxkeys etc. We simply assume the expressions come after simple columns, and that's it. To reference expressions in the built statistics (e.g. in a dependency) we use "special attnums" computed from the expression index by adding MaxHeapAttributeNumber. So the first expression has attnum 1601 etc. This mapping expressions to attnums is used both while building and applying the statistics to clauses, as it makes the whole process much simpler than dealing with attnums and expressions entirely separately. The first part allows us to do something like this: CREATE TABLE t (a int); INSERT INTO t SELECT i FROM generate_series(1,1000000) s(i); ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE mod(a,10) = 0; CREATE STATISTICS s (expressions) ON mod(a,10) FROM t; ANALYZE t; EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE mod(a,10) = 0; Without the statistics we get this: QUERY PLAN -------------------------------------------------------- Seq Scan on t (cost=0.00..19425.00 rows=5000 width=4) (actual rows=100000 loops=1) Filter: (mod(a, 10) = 0) Rows Removed by Filter: 900000 Planning Time: 0.216 ms Execution Time: 157.552 ms (5 rows) while with the statistics we get this QUERY PLAN ---------------------------------------------------------- Seq Scan on t (cost=0.00..19425.00 rows=100900 width=4) (actual rows=100000 loops=1) Filter: (mod(a, 10) = 0) Rows Removed by Filter: 900000 Planning Time: 0.399 ms Execution Time: 157.530 ms (5 rows) So that's pretty nice improvement. In practice you could get the same effect by creating an expression index CREATE INDEX ON t (mod(a,10)); but of course that's far from free - there's cost to maintain the index, it blocks HOT, and it takes space on disk. The statistics have none of these issues. Implementation-wise, this simply builds per-column statistics for each expression, and stashes them into a new column in pg_statistic_ext_data catalog as an array of elements with pg_statistic composite type. And then in selfuncs.c we look not just at indexes, but also at this when looking for expression stats. So that gives us the per-expression stats. This is enabled by default when you don't specify the statistics type and the definition includes any expression that is not a simple column reference. Otherwise you may also request it explicitly by using "expressions" in the CREATE. Now, the second part is really just a natural extension of the existing stats to also work with expressions. The easiest thing is probably to show some examples, so consider this: CREATE TABLE t (a INT, b INT, c INT); INSERT INTO t SELECT i, i, i FROM generate_series(1,1000000) s(i); ANALYZE t; which without any statistics gives us this: EXPLAIN (ANALYZE, TIMING OFF) SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0; QUERY PLAN ------------------------------------------------------ Seq Scan on t (cost=0.00..25406.00 rows=25 width=4) (actual rows=100000 loops=1) Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0)) Rows Removed by Filter: 900000 Planning Time: 0.080 ms Execution Time: 161.445 ms (5 rows) EXPLAIN (ANALYZE, TIMING OFF) SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5); QUERY PLAN ------------------------------------------------------------------ HashAggregate (cost=76656.00..99468.50 rows=1000000 width=12) (actual rows=10 loops=1) Group Key: mod(a, 10), mod(b, 5) Planned Partitions: 32 Batches: 1 Memory Usage: 1561kB -> Seq Scan on t (cost=0.00..20406.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.232 ms Execution Time: 514.446 ms (6 rows) and now let's add statistics on the expressions: CREATE STATISTICS s ON mod(a,10), mod(b,5) FROM t; ANALYZE t; which ends up with these spot-on estimates: EXPLAIN (ANALYZE, TIMING OFF) SELECT 1 FROM t WHERE mod(a,10) = 0 AND mod(b,5) = 0; QUERY PLAN --------------------------------------------------------- Seq Scan on t (cost=0.00..25406.00 rows=97400 width=4) (actual rows=100000 loops=1) Filter: ((mod(a, 10) = 0) AND (mod(b, 5) = 0)) Rows Removed by Filter: 900000 Planning Time: 0.366 ms Execution Time: 159.207 ms (5 rows) EXPLAIN (ANALYZE, TIMING OFF) SELECT 1 FROM t GROUP BY mod(a,10), mod(b,5); QUERY PLAN ----------------------------------------------------------------- HashAggregate (cost=25406.00..25406.15 rows=10 width=12) (actual rows=10 loops=1) Group Key: mod(a, 10), mod(b, 5) Batches: 1 Memory Usage: 24kB -> Seq Scan on t (cost=0.00..20406.00 rows=1000000 width=8) (actual rows=1000000 loops=1) Planning Time: 0.299 ms Execution Time: 530.793 ms (6 rows) Of course, this is a very simple query, but hopefully you get the idea. There's about two main areas where I think might be hidden issues: 1) We're kinda faking the pg_statistic entries, and I suppose there might be some loose ends (e.g. with respect to ACLs etc.). 2) Memory management while evaluating the expressions during analyze is kinda simplistic, we're probably keeping the memory around longer than needed etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote: > attached is a significantly improved version of the patch, allowing > defining extended statistics on expressions. This fixes most of the > problems in the previous WIP version and AFAICS it does pass all > regression tests (including under valgrind). There's a bunch of FIXMEs > and a couple loose ends, but overall I think it's ready for reviews. I was looking at the previous patch, so now read this one instead, and attach some proposed fixes. + * This matters especially for * expensive expressions, of course. + The expression can refer only to columns of the underlying table, but + it can use all columns, not just the ones the statistics is defined + on. I don't know what these are trying to say? + errmsg("statistics expressions and predicates can refer only to the table being indexed"))); + * partial-index predicates. Create it in the per-index context to be I think these are copied and shouldn't mention "indexes" or "predicates". Or should statistics support predicates, too ? Idea: if a user specifies no stakinds, and there's no expression specified, then you automatically build everything except for expressional stats. But if they specify only one statistics "column", it gives an error. If that's a non-simple column reference, should that instead build *only* expressional stats (possibly with a NOTICE, since the user might be intending to make MV stats). I think pg_stats_ext should allow inspecting the pg_statistic data in pg_statistic_ext_data.stxdexprs. I guess array_agg() should be ordered by something, so maybe it should use ORDINALITY (?) I hacked more on bootstrap.c so included that here. -- Justin
Вложения
On 11/23/20 3:26 AM, Justin Pryzby wrote: > On Sun, Nov 22, 2020 at 08:03:51PM +0100, Tomas Vondra wrote: >> attached is a significantly improved version of the patch, allowing >> defining extended statistics on expressions. This fixes most of the >> problems in the previous WIP version and AFAICS it does pass all >> regression tests (including under valgrind). There's a bunch of FIXMEs >> and a couple loose ends, but overall I think it's ready for reviews. > > I was looking at the previous patch, so now read this one instead, and attach > some proposed fixes. > > + * This matters especially for * expensive expressions, of course. > The point this was trying to make is that we evaluate the expressions only once, and use the results to build all extended statistics. Instead of leaving it up to every "build" to re-evaluate it. > + The expression can refer only to columns of the underlying table, but > + it can use all columns, not just the ones the statistics is defined > + on. > > I don't know what these are trying to say? > D'oh. That's bogus para, copied from the CREATE INDEX docs (where it talked about the index predicate, which is irrelevant here). > + errmsg("statistics expressions and predicates can refer only to the table being indexed"))); > + * partial-index predicates. Create it in the per-index context to be > > I think these are copied and shouldn't mention "indexes" or "predicates". Or > should statistics support predicates, too ? > Right. Stupid copy-pasto. > Idea: if a user specifies no stakinds, and there's no expression specified, > then you automatically build everything except for expressional stats. But if > they specify only one statistics "column", it gives an error. If that's a > non-simple column reference, should that instead build *only* expressional > stats (possibly with a NOTICE, since the user might be intending to make MV > stats). > Right, that was the intention - but I messed up and it works only if you specify the "expressions" kind explicitly (and I also added the ERROR message to expected output by mistake). I agree we should handle this automatically, so that CREATE STATISTICS s ON (a+b) FROM t works and only creates statistics for the expression. > I think pg_stats_ext should allow inspecting the pg_statistic data in > pg_statistic_ext_data.stxdexprs. I guess array_agg() should be ordered by > something, so maybe it should use ORDINALITY (?) > I agree we should expose the expression statistics, but I'm not convinced we should do that in the pg_stats_ext view itself. The problem is that it's a table bested in a table, essentially, with non-trivial structure, so I was thinking about adding a separate view exposing just this one part. Something like pg_stats_ext_expressions, with about the same structure as pg_stats, or something. > I hacked more on bootstrap.c so included that here. Thanks. As for the 0004-0007 patches: 0004 - Seems fine. IMHO not really "silly errors" but OK. 0005 - Mostly OK. The docs wording mostly comes from CREATE INDEX docs, though. The paragraph about "t1" is old, so if we want to reword it then maybe we should backpatch too. 0006 - Not sure. I think CreateStatistics can be fixed with less code, keeping it more like PG13 (good for backpatching). Not sure why rename extended statistics to multi-variate statistics - we use "extended" everywhere. Not sure what's the point of serialize_expr_stats changes, that's code is mostly copy-paste from update_attstats. 0007 - I suspect this makes the pg_stats_ext too complex to work with, IMHO we should move this to a separate view. Thanks for the review! I'll try to look more closely at those patches sometime next week, and merge most of it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Attached is an updated version of the patch series, merging some of the changes proposed by Justin. I've kept the bootstrap patches separate, at least for now. As for the individual 0004-0007 patches: 1) 0004 - merged as is 2) 0005 - I've merged some of the docs changes, but some of the wording was copied from CREATE INDEX docs in which case I've kept that. I've also not merged changed to pre-existing docs, like the t1 example which is unrelated to this patch. OTOH I've corrected the t3 example description, which was somewhat bogus and unrelated to what the example actually did. I've also removed the irrelevant para which originally described index predicates and was copied from CREATE INDEX docs by mistake. 3) 0006 - I've committed something similar / less invasive, achieving the same goals (I think), and I've added a couple regression tests. 4) 0007 - I agreed we need a way to expose the stats, but including this in pg_stats_ext seems rather inconvenient (table in a table is difficult to work with). Instead I've added a new catalog pg_stats_ext_exprs with structure similar to pg_stats. I've also added the expressions to the pg_stats_ext catalog, which was only showing the attributes, and some basic docs for the catalog changes. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote: > 0004 - Seems fine. IMHO not really "silly errors" but OK. This is one of the same issues you pointed out - shadowing a variable. Could be backpatched. On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote: > > + errmsg("statistics expressions and predicates can refer only to the table being indexed"))); > > + * partial-index predicates. Create it in the per-index context to be > > > > I think these are copied and shouldn't mention "indexes" or "predicates". Or > > should statistics support predicates, too ? > > > > Right. Stupid copy-pasto. Right, but then I was wondering if CREATE STATS should actually support predicates, since one use case is to do what indexes do without their overhead. I haven't thought about it enough yet. > 0006 - Not sure. I think CreateStatistics can be fixed with less code, > keeping it more like PG13 (good for backpatching). Not sure why rename > extended statistics to multi-variate statistics - we use "extended" > everywhere. - if (build_expressions && (list_length(stxexprs) == 0)) + if (!build_expressions_only && (list_length(stmt->exprs) < 2)) ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), - errmsg("extended expression statistics require at least one expression"))); + errmsg("multi-variate statistics require at least two columns"))); I think all of "CREATE STATISTICS" has been known as "extended stats", so I think it may be confusing to say that it requires two columns for the general facility. > Not sure what's the point of serialize_expr_stats changes, > that's code is mostly copy-paste from update_attstats. Right. I think "i" is poor variable name when it isn't a loop variable and not of limited scope. > 0007 - I suspect this makes the pg_stats_ext too complex to work with, > IMHO we should move this to a separate view. Right - then unnest() the whole thing and return one row per expression rather than array, as you've done. Maybe the docs should say that this returns one row per expression. Looking quickly at your new patch: I guess you know there's a bunch of lingering references to "indexes" and "predicates": I don't know if you want to go to the effort to prohibit this. postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t; CREATE STATISTICS I think a lot of people will find this confusing: postgres=# CREATE STATISTICS asf ON i FROM t; ERROR: extended statistics require at least 2 columns postgres=# CREATE STATISTICS asf ON (i) FROM t; CREATE STATISTICS postgres=# CREATE STATISTICS asf (expressions) ON i FROM t; ERROR: extended expression statistics require at least one expression postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t; CREATE STATISTICS I haven't looked, but is it possible to make it work without parens ? -- Justin
On 11/24/20 5:23 PM, Justin Pryzby wrote: > On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote: >> 0004 - Seems fine. IMHO not really "silly errors" but OK. > > This is one of the same issues you pointed out - shadowing a variable. > Could be backpatched. > > On Mon, Nov 23, 2020 at 04:30:26AM +0100, Tomas Vondra wrote: >>> + errmsg("statistics expressions and predicates can refer only to the table being indexed"))); >>> + * partial-index predicates. Create it in the per-index context to be >>> >>> I think these are copied and shouldn't mention "indexes" or "predicates". Or >>> should statistics support predicates, too ? >>> >> >> Right. Stupid copy-pasto. > > Right, but then I was wondering if CREATE STATS should actually support > predicates, since one use case is to do what indexes do without their overhead. > I haven't thought about it enough yet. > Well, it's not supported now, so the message is bogus. I'm not against supporting "partial statistics" with predicates in the future, but it's going to be non-trivial project on it's own. It's not something I can bolt onto the current patch easily. >> 0006 - Not sure. I think CreateStatistics can be fixed with less code, >> keeping it more like PG13 (good for backpatching). Not sure why rename >> extended statistics to multi-variate statistics - we use "extended" >> everywhere. > > - if (build_expressions && (list_length(stxexprs) == 0)) > + if (!build_expressions_only && (list_length(stmt->exprs) < 2)) > ereport(ERROR, > (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), > - errmsg("extended expression statistics require at least one expression"))); > + errmsg("multi-variate statistics require at least two columns"))); > > I think all of "CREATE STATISTICS" has been known as "extended stats", so I > think it may be confusing to say that it requires two columns for the general > facility. > >> Not sure what's the point of serialize_expr_stats changes, >> that's code is mostly copy-paste from update_attstats. > > Right. I think "i" is poor variable name when it isn't a loop variable and not > of limited scope. > OK, I understand. I'll consider tweaking that. >> 0007 - I suspect this makes the pg_stats_ext too complex to work with, >> IMHO we should move this to a separate view. > > Right - then unnest() the whole thing and return one row per expression rather > than array, as you've done. Maybe the docs should say that this returns one > row per expression. > > Looking quickly at your new patch: I guess you know there's a bunch of > lingering references to "indexes" and "predicates": > > I don't know if you want to go to the effort to prohibit this. > postgres=# CREATE STATISTICS asf ON (tableoid::int+1) FROM t; > CREATE STATISTICS > Hmm, we're already rejecting system attributes, I suppose we should do the same thing for expressions on system attributes. > I think a lot of people will find this confusing: > > postgres=# CREATE STATISTICS asf ON i FROM t; > ERROR: extended statistics require at least 2 columns > postgres=# CREATE STATISTICS asf ON (i) FROM t; > CREATE STATISTICS > > postgres=# CREATE STATISTICS asf (expressions) ON i FROM t; > ERROR: extended expression statistics require at least one expression > postgres=# CREATE STATISTICS asf (expressions) ON (i) FROM t; > CREATE STATISTICS > > I haven't looked, but is it possible to make it work without parens ? > Hmm, you're right that may be surprising. I suppose we could walk the expressions while creating the statistics, and replace such trivial expressions with the nested variable, but I haven't tried. I wonder what the CREATE INDEX behavior would be in these cases. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Attached is a patch series rebased on top of 25a9e54d2d which improves estimation of OR clauses. There were only a couple minor conflicts. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Attached is a patch series rebased on top of 25a9e54d2d. After reading this thread and [1], I think I prefer the name "standard" rather than "expressions", because it is meant to describe the kind of statistics being built rather than what they apply to, but maybe that name doesn't actually need to be exposed to the end user: Looking at the current behaviour, there are a couple of things that seem a little odd, even though they are understandable. For example, the fact that CREATE STATISTICS s (expressions) ON (expr), col FROM tbl; fails, but CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl; succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl; tends to suggest that it's going to create statistics on the pair of expressions, describing their correlation, when actually it builds 2 independent statistics. Also, this error text isn't entirely accurate: CREATE STATISTICS s ON col FROM tbl; ERROR: extended statistics require at least 2 columns because extended statistics don't always require 2 columns, they can also just have an expression, or multiple expressions and 0 or 1 columns. I think a lot of this stems from treating "expressions" in the same way as the other (multi-column) stats kinds, and it might actually be neater to have separate documented syntaxes for single- and multi-column statistics: CREATE STATISTICS [ IF NOT EXISTS ] statistics_name ON (expression) FROM table_name CREATE STATISTICS [ IF NOT EXISTS ] statistics_name [ ( statistics_kind [, ... ] ) ] ON { column_name | (expression) } , { column_name | (expression) } [, ...] FROM table_name The first syntax would create single-column stats, and wouldn't accept a statistics_kind argument, because there is only one kind of single-column statistic. Maybe that might change in the future, but if so, it's likely that the kinds of single-column stats will be different from the kinds of multi-column stats. In the second syntax, the only accepted kinds would be the current multi-column stats kinds (ndistinct, dependencies, and mcv), and it would always build stats describing the correlations between the columns listed. It would continue to build standard/expression stats on any expressions in the list, but that's more of an implementation detail. It would no longer be possible to do "CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to issue 2 separate "CREATE STATISTICS" commands, but that seems more logical, because they're independent stats. The parsing code might not change much, but some of the errors would be different. For example, the errors "building only extended expression statistics on simple columns not allowed" and "extended expression statistics require at least one expression" would go away, and the error "extended statistics require at least 2 columns" might become more specific, depending on the stats kind. Regards, Dean [1] https://www.postgresql.org/message-id/flat/1009.1579038764%40sss.pgh.pa.us#8624792a20ae595683b574f5933dae53
On 12/7/20 10:56 AM, Dean Rasheed wrote: > On Thu, 3 Dec 2020 at 15:23, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> Attached is a patch series rebased on top of 25a9e54d2d. > > After reading this thread and [1], I think I prefer the name > "standard" rather than "expressions", because it is meant to describe > the kind of statistics being built rather than what they apply to, but > maybe that name doesn't actually need to be exposed to the end user: > > Looking at the current behaviour, there are a couple of things that > seem a little odd, even though they are understandable. For example, > the fact that > > CREATE STATISTICS s (expressions) ON (expr), col FROM tbl; > > fails, but > > CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl; > > succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax > > CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl; > > tends to suggest that it's going to create statistics on the pair of > expressions, describing their correlation, when actually it builds 2 > independent statistics. Also, this error text isn't entirely accurate: > > CREATE STATISTICS s ON col FROM tbl; > ERROR: extended statistics require at least 2 columns > > because extended statistics don't always require 2 columns, they can > also just have an expression, or multiple expressions and 0 or 1 > columns. > > I think a lot of this stems from treating "expressions" in the same > way as the other (multi-column) stats kinds, and it might actually be > neater to have separate documented syntaxes for single- and > multi-column statistics: > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > ON (expression) > FROM table_name > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > [ ( statistics_kind [, ... ] ) ] > ON { column_name | (expression) } , { column_name | (expression) } [, ...] > FROM table_name > > The first syntax would create single-column stats, and wouldn't accept > a statistics_kind argument, because there is only one kind of > single-column statistic. Maybe that might change in the future, but if > so, it's likely that the kinds of single-column stats will be > different from the kinds of multi-column stats. > > In the second syntax, the only accepted kinds would be the current > multi-column stats kinds (ndistinct, dependencies, and mcv), and it > would always build stats describing the correlations between the > columns listed. It would continue to build standard/expression stats > on any expressions in the list, but that's more of an implementation > detail. > > It would no longer be possible to do "CREATE STATISTICS s > (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to > issue 2 separate "CREATE STATISTICS" commands, but that seems more > logical, because they're independent stats. > > The parsing code might not change much, but some of the errors would > be different. For example, the errors "building only extended > expression statistics on simple columns not allowed" and "extended > expression statistics require at least one expression" would go away, > and the error "extended statistics require at least 2 columns" might > become more specific, depending on the stats kind. > I think it makes sense in general. I see two issues with this approach, though: * By adding expression/standard stats for individual statistics, it makes the list of statistics longer - I wonder if this might have measurable impact on lookups in this list. * I'm not sure it's a good idea that the second syntax would always build the per-expression stats. Firstly, it seems a bit strange that it behaves differently than the other kinds. Secondly, I wonder if there are cases where it'd be desirable to explicitly disable building these per-expression stats. For example, what if we have multiple extended statistics objects, overlapping on a couple expressions. It seems pointless to build the stats for all of them. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 12/7/20 10:56 AM, Dean Rasheed wrote: > > it might actually be > > neater to have separate documented syntaxes for single- and > > multi-column statistics: > > > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > > ON (expression) > > FROM table_name > > > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > > [ ( statistics_kind [, ... ] ) ] > > ON { column_name | (expression) } , { column_name | (expression) } [, ...] > > FROM table_name > > I think it makes sense in general. I see two issues with this approach, > though: > > * By adding expression/standard stats for individual statistics, it > makes the list of statistics longer - I wonder if this might have > measurable impact on lookups in this list. > > * I'm not sure it's a good idea that the second syntax would always > build the per-expression stats. Firstly, it seems a bit strange that it > behaves differently than the other kinds. Secondly, I wonder if there > are cases where it'd be desirable to explicitly disable building these > per-expression stats. For example, what if we have multiple extended > statistics objects, overlapping on a couple expressions. It seems > pointless to build the stats for all of them. > Hmm, I'm not sure it would really be a good idea to build MCV stats on expressions without also building the standard stats for those expressions, otherwise the assumptions that mcv_combine_selectivities() makes about simple_sel and mcv_basesel wouldn't really hold. But then, if multiple MCV stats shared the same expression, it would be quite wasteful to build standard stats on the expression more than once. It feels like it should build a single extended stats object for each unique expression, with appropriate dependencies for any MCV stats that used those expressions, but I'm not sure how complex that would be. Dropping the last MCV stat object using a standard expression stat object might reasonably drop the expression stats ... except if they were explicitly created by the user, independently of any MCV stats. That could get quite messy. Regards, Dean
On 12/7/20 5:02 PM, Dean Rasheed wrote: > On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> On 12/7/20 10:56 AM, Dean Rasheed wrote: >>> it might actually be >>> neater to have separate documented syntaxes for single- and >>> multi-column statistics: >>> >>> CREATE STATISTICS [ IF NOT EXISTS ] statistics_name >>> ON (expression) >>> FROM table_name >>> >>> CREATE STATISTICS [ IF NOT EXISTS ] statistics_name >>> [ ( statistics_kind [, ... ] ) ] >>> ON { column_name | (expression) } , { column_name | (expression) } [, ...] >>> FROM table_name >> >> I think it makes sense in general. I see two issues with this approach, >> though: >> >> * By adding expression/standard stats for individual statistics, it >> makes the list of statistics longer - I wonder if this might have >> measurable impact on lookups in this list. >> >> * I'm not sure it's a good idea that the second syntax would always >> build the per-expression stats. Firstly, it seems a bit strange that it >> behaves differently than the other kinds. Secondly, I wonder if there >> are cases where it'd be desirable to explicitly disable building these >> per-expression stats. For example, what if we have multiple extended >> statistics objects, overlapping on a couple expressions. It seems >> pointless to build the stats for all of them. >> > > Hmm, I'm not sure it would really be a good idea to build MCV stats on > expressions without also building the standard stats for those > expressions, otherwise the assumptions that > mcv_combine_selectivities() makes about simple_sel and mcv_basesel > wouldn't really hold. But then, if multiple MCV stats shared the same > expression, it would be quite wasteful to build standard stats on the > expression more than once. > Yeah. You're right it'd be problematic to build MCV on expressions without having the per-expression stats. In fact, that's exactly the problem what forced me to add the per-expression stats to this patch. Originally I planned to address it in a later patch, but I had to move it forward. So I think you're right we need to ensure we have standard stats for each expression at least once, to make this work well. > It feels like it should build a single extended stats object for each > unique expression, with appropriate dependencies for any MCV stats > that used those expressions, but I'm not sure how complex that would > be. Dropping the last MCV stat object using a standard expression stat > object might reasonably drop the expression stats ... except if they > were explicitly created by the user, independently of any MCV stats. > That could get quite messy. > Possibly. But I don't think it's worth the extra complexity. I don't expect people to have a lot of overlapping stats, so the amount of wasted space and CPU time is expected to be fairly limited. So I don't think it's worth spending too much time on this now. Let's just do what you proposed, and revisit this later if needed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Possibly. But I don't think it's worth the extra complexity. I don't > expect people to have a lot of overlapping stats, so the amount of > wasted space and CPU time is expected to be fairly limited. > > So I don't think it's worth spending too much time on this now. Let's > just do what you proposed, and revisit this later if needed. > Yes, I think that's a reasonable approach to take. As long as the documentation makes it clear that building MCV stats also causes standard expression stats to be built on any expressions included in the list, then the user will know and can avoid duplication most of the time. I don't think there's any need for code to try to prevent that -- just as we don't bother with code to prevent a user building multiple indexes on the same column. The only case where duplication won't be avoidable is where there are multiple MCV stats sharing the same expression, but that's probably quite unlikely in practice, and it seems acceptable to leave improving that as a possible future optimisation. Regards, Dean
On 12/11/20 1:58 PM, Dean Rasheed wrote: > On Tue, 8 Dec 2020 at 12:44, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> Possibly. But I don't think it's worth the extra complexity. I don't >> expect people to have a lot of overlapping stats, so the amount of >> wasted space and CPU time is expected to be fairly limited. >> >> So I don't think it's worth spending too much time on this now. Let's >> just do what you proposed, and revisit this later if needed. >> > > Yes, I think that's a reasonable approach to take. As long as the > documentation makes it clear that building MCV stats also causes > standard expression stats to be built on any expressions included in > the list, then the user will know and can avoid duplication most of > the time. I don't think there's any need for code to try to prevent > that -- just as we don't bother with code to prevent a user building > multiple indexes on the same column. > > The only case where duplication won't be avoidable is where there are > multiple MCV stats sharing the same expression, but that's probably > quite unlikely in practice, and it seems acceptable to leave improving > that as a possible future optimisation. > OK. Attached is an updated version, reworking it this way. I tried tweaking the grammar to differentiate these two syntax variants, but that led to shift/reduce conflicts with the existing ones. I tried fixing that, but I ended up doing that in CreateStatistics(). The other thing is that we probably can't tie this to just MCV, because functional dependencies need the per-expression stats too. So I simply build expression stats whenever there's at least one expression. I also decided to keep the "expressions" statistics kind - it's not allowed to specify it in CREATE STATISTICS, but it's useful internally as it allows deciding whether to build the stats in a single place. Otherwise we'd need to do that every time we build the statistics, etc. I added a brief explanation to the sgml docs, not sure if that's good enough - maybe it needs more details. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, 11 Dec 2020 at 20:17, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > OK. Attached is an updated version, reworking it this way. Cool. I think this is an exciting development, so I hope it makes it into the next release. I have started looking at it. So far I have only looked at the catalog, parser and client changes, but I thought it's worth posting my comments so far. > I tried tweaking the grammar to differentiate these two syntax variants, > but that led to shift/reduce conflicts with the existing ones. I tried > fixing that, but I ended up doing that in CreateStatistics(). Yeah, that makes sense. I wasn't expecting the grammar to change. > The other thing is that we probably can't tie this to just MCV, because > functional dependencies need the per-expression stats too. So I simply > build expression stats whenever there's at least one expression. Makes sense. > I also decided to keep the "expressions" statistics kind - it's not > allowed to specify it in CREATE STATISTICS, but it's useful internally > as it allows deciding whether to build the stats in a single place. > Otherwise we'd need to do that every time we build the statistics, etc. Yes, I thought that would be the easiest way to do it. Essentially the "expressions" stats kind is an internal implementation detail, hidden from the user, because it's built automatically when required, so you don't need to (and can't) explicitly ask for it. This new behaviour seems much more logical to me. > I added a brief explanation to the sgml docs, not sure if that's good > enough - maybe it needs more details. Yes, I think that could use a little tidying up, but I haven't looked too closely yet. Some other comments: * I'm not sure I understand the need for 0001. Wasn't there an earlier version of this patch that just did it by re-populating the type array, but which still had it as an array rather than turning it into a list? Making it a list falsifies some of the comments and function/variable name choices in that file. * There's a comment typo in catalog/Makefile -- "are are reputedly other...", should be "there are reputedly other...". * Looking at the pg_stats_ext view, I think perhaps expressions stats should be omitted entirely from that view, since it doesn't show any useful information about them. So it could remove "e" from the "kinds" array, and exclude rows whose only kind is "e", since such rows have no interesting data in them. Essentially, the new view pg_stats_ext_exprs makes having any expression stats in pg_stats_ext redundant. Hiding this data in pg_stats_ext would also be consistent with making the "expressions" stats kind hidden from the user. * In gram.y, it wasn't quite obvious why you converted the column list for CREATE STATISTICS from an expr_list to a stats_params list. I figured it out, and it makes sense, but I think it could use a comment, perhaps something along the lines of the one for index_elem, e.g.: /* * Statistics attributes can be either simple column references, or arbitrary * expressions in parens. For compatibility with index attributes permitted * in CREATE INDEX, we allow an expression that's just a function call to be * written without parens. */ * In parse_func.c and parse_agg.c, there are a few new error strings that use the abbreviation "stats expressions", whereas most other errors refer to "statistics expressions". For consistency, I think they should all be the latter. * In generateClonedExtStatsStmt(), I think the "expressions" stats kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE ...) fails if the source table has expression stats. * CreateStatistics() uses ShareUpdateExclusiveLock, but in tcop/utility.c the relation is opened with a ShareLock. ISTM that the latter lock mode should be made to match CreateStatistics(). * Why does the new code in tcop/utility.c not use RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation? That seems preferable to doing the ACL check in CreateStatistics(). For one thing, as it stands, it allows the lock to be taken even if the user doesn't own the table. Is it intentional that the current code allows extended stats to be created on system catalogs? That would be one thing that using RangeVarCallbackOwnsRelation would change, but I can't see a reason to allow it. * In src/bin/psql/describe.c, I think the \d output should also exclude the "expressions" stats kind and just list the other kinds (or have no kinds list at all, if there are no other kinds), to make it consistent with the CREATE STATISTICS syntax. * The pg_dump output for a stats object whose only kind is "expressions" is broken -- it includes a spurious "()" for the kinds list. That's it for now. I'll look at the optimiser changes next, and try to post more comments later this week. Regards, Dean
On 1/4/21 4:34 PM, Dean Rasheed wrote: > > ... > > Some other comments: > > * I'm not sure I understand the need for 0001. Wasn't there an earlier > version of this patch that just did it by re-populating the type > array, but which still had it as an array rather than turning it into > a list? Making it a list falsifies some of the comments and > function/variable name choices in that file. > That's a bit done to Justin - I think it's fine to use the older version repopulating the type array, but that question is somewhat unrelated to this patch. > * There's a comment typo in catalog/Makefile -- "are are reputedly > other...", should be "there are reputedly other...". > > * Looking at the pg_stats_ext view, I think perhaps expressions stats > should be omitted entirely from that view, since it doesn't show any > useful information about them. So it could remove "e" from the "kinds" > array, and exclude rows whose only kind is "e", since such rows have > no interesting data in them. Essentially, the new view > pg_stats_ext_exprs makes having any expression stats in pg_stats_ext > redundant. Hiding this data in pg_stats_ext would also be consistent > with making the "expressions" stats kind hidden from the user. > Hmmm, not sure. I'm not sure removing 'e' from the array is a good idea. On the one hand it's internal detail, on the other hand most of that view is internal detail too. Excluding rows with only 'e' seems reasonable, though. I need to think about this. > * In gram.y, it wasn't quite obvious why you converted the column list > for CREATE STATISTICS from an expr_list to a stats_params list. I > figured it out, and it makes sense, but I think it could use a > comment, perhaps something along the lines of the one for index_elem, > e.g.: > > /* > * Statistics attributes can be either simple column references, or arbitrary > * expressions in parens. For compatibility with index attributes permitted > * in CREATE INDEX, we allow an expression that's just a function call to be > * written without parens. > */ > OH, right. I'd have trouble figuring this myself, and I wrote that code myself only one or two months ago. > * In parse_func.c and parse_agg.c, there are a few new error strings > that use the abbreviation "stats expressions", whereas most other > errors refer to "statistics expressions". For consistency, I think > they should all be the latter. > OK, will fix. > * In generateClonedExtStatsStmt(), I think the "expressions" stats > kind needs to be explicitly excluded, otherwise CREATE TABLE (LIKE > ...) fails if the source table has expression stats. > Yeah, will fix. I guess this also means we're missing some tests. > * CreateStatistics() uses ShareUpdateExclusiveLock, but in > tcop/utility.c the relation is opened with a ShareLock. ISTM that the > latter lock mode should be made to match CreateStatistics(). > Not sure, will check. > * Why does the new code in tcop/utility.c not use > RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation? > That seems preferable to doing the ACL check in CreateStatistics(). > For one thing, as it stands, it allows the lock to be taken even if > the user doesn't own the table. Is it intentional that the current > code allows extended stats to be created on system catalogs? That > would be one thing that using RangeVarCallbackOwnsRelation would > change, but I can't see a reason to allow it. > I think I copied the code from somewhere - probably expression indexes, or something like that. Not a proof that it's the right/better way to do this, though. > * In src/bin/psql/describe.c, I think the \d output should also > exclude the "expressions" stats kind and just list the other kinds (or > have no kinds list at all, if there are no other kinds), to make it > consistent with the CREATE STATISTICS syntax. > Not sure I understand. Why would this make it consistent with CREATE STATISTICS? Can you elaborate? > * The pg_dump output for a stats object whose only kind is > "expressions" is broken -- it includes a spurious "()" for the kinds > list. > Will fix. Again, this suggests there are TAP tests missing. > That's it for now. I'll look at the optimiser changes next, and try to > post more comments later this week. > Thanks! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 1/4/21 4:34 PM, Dean Rasheed wrote: > > > > * In src/bin/psql/describe.c, I think the \d output should also > > exclude the "expressions" stats kind and just list the other kinds (or > > have no kinds list at all, if there are no other kinds), to make it > > consistent with the CREATE STATISTICS syntax. > > Not sure I understand. Why would this make it consistent with CREATE > STATISTICS? Can you elaborate? > This isn't absolutely essential, but I think it would be neater. For example, if I have a table with stats like this: CREATE TABLE foo(a int, b int); CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo; then the \d output is as follows: \d foo Table "public.foo" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | Statistics objects: "public"."foo_s_ab" (mcv) ON a, b FROM foo and the stats line matches the DDL used to create the stats. It could, for example, be copy-pasted and tweaked to create similar stats on another table, but even if that's not very likely, it's neat that it reflects how the stats were created. OTOH, if there are expressions in the list, it produces something like this: Table "public.foo" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | Statistics objects: "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo which no longer matches the DDL used, and isn't part of an accepted syntax, so seems a bit inconsistent. In general, if we're making the "expressions" kind an internal implementation detail that just gets built automatically when needed, then I think we should hide it from this sort of output, so the list of kinds matches the list that the user used when the stats were created. Regards, Dean
On 1/5/21 3:10 PM, Dean Rasheed wrote: > On Tue, 5 Jan 2021 at 00:45, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> On 1/4/21 4:34 PM, Dean Rasheed wrote: >>> >>> * In src/bin/psql/describe.c, I think the \d output should also >>> exclude the "expressions" stats kind and just list the other kinds (or >>> have no kinds list at all, if there are no other kinds), to make it >>> consistent with the CREATE STATISTICS syntax. >> >> Not sure I understand. Why would this make it consistent with CREATE >> STATISTICS? Can you elaborate? >> > > This isn't absolutely essential, but I think it would be neater. For > example, if I have a table with stats like this: > > CREATE TABLE foo(a int, b int); > CREATE STATISTICS foo_s_ab (mcv) ON a,b FROM foo; > > then the \d output is as follows: > > \d foo > Table "public.foo" > Column | Type | Collation | Nullable | Default > --------+---------+-----------+----------+--------- > a | integer | | | > b | integer | | | > Statistics objects: > "public"."foo_s_ab" (mcv) ON a, b FROM foo > > and the stats line matches the DDL used to create the stats. It could, > for example, be copy-pasted and tweaked to create similar stats on > another table, but even if that's not very likely, it's neat that it > reflects how the stats were created. > > OTOH, if there are expressions in the list, it produces something like this: > > Table "public.foo" > Column | Type | Collation | Nullable | Default > --------+---------+-----------+----------+--------- > a | integer | | | > b | integer | | | > Statistics objects: > "public"."foo_s_ab" (mcv, expressions) ON a, b, ((a * b)) FROM foo > > which no longer matches the DDL used, and isn't part of an accepted > syntax, so seems a bit inconsistent. > > In general, if we're making the "expressions" kind an internal > implementation detail that just gets built automatically when needed, > then I think we should hide it from this sort of output, so the list > of kinds matches the list that the user used when the stats were > created. > Hmm, I see. You're probably right it's not necessary to show this, given the modified handling of expression stats (which makes them an internal detail, not exposed to users). I'll tweak this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Looking over the statscmds.c changes, there are a few XXX's and FIXME's that need resolving, and I had a couple of other minor comments: + /* + * An expression using mutable functions is probably wrong, + * since if you aren't going to get the same result for the + * same data every time, it's not clear what the index entries + * mean at all. + */ + if (CheckMutability((Expr *) expr)) + ereport(ERROR, That comment is presumably copied from the index code, so needs updating. + /* + * Disallow data types without a less-than operator + * + * XXX Maybe allow this, but only for EXPRESSIONS stats and + * prevent building e.g. MCV etc. + */ + atttype = exprType(expr); + type = lookup_type_cache(atttype, TYPECACHE_LT_OPR); + if (type->lt_opr == InvalidOid) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("expression cannot be used in statistics because its type %s has no default btree operator class", + format_type_be(atttype)))); As the comment suggests, it's probably worth skipping this check if numcols is 1 so that single-column stats can be built for more types of expressions. (I'm assuming that it's basically no more effort to make that work, so I think it falls into the might-as-well-do-it category.) + /* + * Parse the statistics kinds. Firstly, check that this is not the + * variant building statistics for a single expression, in which case + * we don't allow specifying any statistis kinds. The simple variant + * only has one expression, and does not allow statistics kinds. + */ + if ((list_length(stmt->exprs) == 1) && (list_length(stxexprs) == 1)) + { Typo: "statistis" Nit-picking, this test could just be: + if ((numcols == 1) && (list_length(stxexprs) == 1)) which IMO is a little more readable, and matches a similar test a little further down. + /* + * If there are no simply-referenced columns, give the statistics an + * auto dependency on the whole table. In most cases, this will + * be redundant, but it might not be if the statistics expressions + * contain no Vars (which might seem strange but possible). + * + * XXX This is copied from index_create, not sure if it's applicable + * to extended statistics too. + */ Seems right to me. + /* + * FIXME use 'expr' for expressions, which have empty column names. + * For indexes this is handled in ChooseIndexColumnNames, but we + * have no such function for stats. + */ + if (!name) + name = "expr"; In theory, this function could be made to duplicate the logic used for indexes, creating names like "expr1", "expr2", etc. To be honest though, I don't think it's worth the effort. The code for indexes isn't really bulletproof anyway -- for example there might be a column called "expr" that is or isn't included in the index, which would make the generated name ambiguous. And in any case, a name like "tbl_cola_expr_colb_expr1_colc_stat" isn't really any more useful than "tbl_cola_expr_colb_expr_colc_stat". So I'd be tempted to leave that code as it is. + +/* + * CheckMutability + * Test whether given expression is mutable + * + * FIXME copied from indexcmds.c, maybe use some shared function? + */ +static bool +CheckMutability(Expr *expr) +{ As the comment says, it's quite messy duplicating this code, but I'm wondering whether it would be OK to just skip this check entirely. I think someone else suggested that elsewhere, and I think it might not be a bad idea. For indexes, it could easily lead to wrong query results, but for stats the most likely problem is that the stats would get out of date (which they tend to do all by themselves anyway) and need rebuilding. If you ignore intentionally crazy examples (which are still possible even with this check), then there are probably many legitimate cases where someone might want to use non-immutable functions in stats, and this check just forces them to create an immutable wrapper function. Regards, Dean
Starting to look at the planner code, I found an oversight in the way expression stats are read at the start of planning -- it is necessary to call ChangeVarNodes() on any expressions if the relid isn't 1, otherwise the stats expressions may contain Var nodes referring to the wrong relation. Possibly the easiest place to do that would be in get_relation_statistics(), if rel->relid != 1. Here's a simple test case: CREATE TABLE foo AS SELECT x FROM generate_series(1,100000) g(x); CREATE STATISTICS foo_s ON (x%10) FROM foo; ANALYSE foo; EXPLAIN SELECT * FROM foo WHERE x%10 = 0; EXPLAIN SELECT * FROM (SELECT 1) t, foo WHERE x%10 = 0; (in the second query, the stats don't get applied). Regards, Dean
Hi, Attached is a patch fixing most of the issues. There are a couple exceptions: > * Looking at the pg_stats_ext view, I think perhaps expressions stats > should be omitted entirely from that view, since it doesn't show any > useful information about them. So it could remove "e" from the "kinds" > array, and exclude rows whose only kind is "e", since such rows have > no interesting data in them. Essentially, the new view > pg_stats_ext_exprs makes having any expression stats in pg_stats_ext > redundant. Hiding this data in pg_stats_ext would also be consistent > with making the "expressions" stats kind hidden from the user. I haven't removed the expressions stats from pg_stats_ext view yet. I'm not 100% sure about it yet. > * Why does the new code in tcop/utility.c not use > RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation? > That seems preferable to doing the ACL check in CreateStatistics(). > For one thing, as it stands, it allows the lock to be taken even if > the user doesn't own the table. Is it intentional that the current > code allows extended stats to be created on system catalogs? That > would be one thing that using RangeVarCallbackOwnsRelation would > change, but I can't see a reason to allow it. I haven't switched utility.c to RangeVarGetRelidExtended together with RangeVarCallbackOwnsRelation, because the current code allows checking for object type first. I don't recall why exactly was it done this way, but I didn't feel like changing that in this patch. You're however right it should not be possible to create statistics on system catalogs. For regular users that should be rejected thanks to the ownership check, but superuser may create it. I've added proper check to CreateStatistics() - this is probably worth backpatching. > * In src/bin/psql/describe.c, I think the \d output should also > exclude the "expressions" stats kind and just list the other kinds (or > have no kinds list at all, if there are no other kinds), to make it > consistent with the CREATE STATISTICS syntax. I've done this, but I went one step further - we hide the list of kinds using the same rules as pg_dump, i.e. we don't list the kinds if all of them are selected. Not sure if that's the right thing, though. The rest of the issues should be fixed, I think. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, Jan 08, 2021 at 01:57:29AM +0100, Tomas Vondra wrote: > Attached is a patch fixing most of the issues. There are a couple > exceptions: In the docs: + — at the cost that its schema must be extended whenever the structure + of statistics <link linkend="catalog-pg-statistic"><structname>pg_statistic</structname></link> changes. should say "of statistics *IN* pg_statistics changes" ? + to an expression index. The full variant allows defining statistics objects + on multiple columns and expressions, and pick which statistics kinds will + be built. The per-expression statistics are built automatically when there "and pick" is wrong - maybe say "and selecting which.." + and run a query using an expression on that column. Without the remove "the" ? + extended statistics, the planner has no information about data + distribution for reasults of those expression, and uses default *results + estimates as illustrated by the first query. The planner also does + not realize the value of the second column fully defines the value + of the other column, because date truncated to day still identifies + the month). Then expression and ndistinct statistics are built on The ")" is unbalanced + /* all parts of thi expression are covered by this statistics */ this + * GrouExprInfos, but only if it's not known equal to any of the existing Group + * we don't allow specifying any statistis kinds. The simple variant statistics + * If no statistic type was specified, build them all (but request Say "kind" not "type" ? + * expression is a simple Var. OTOH we check that there's at least one + * statistics matching the expression. one statistic (singular) ? + * the future, we might consider + */ consider ??? +-- (not it fails, when there are no simple column references) note? There's some remaining copy/paste stuff from index expressions: errmsg("statistics expressions and predicates can refer only to the table being indexed"))); left behind by evaluating the predicate or index expressions. Set up for predicate or expression evaluation Need an EState for evaluation of index expressions and /* Compute and save index expression values */ left behind by evaluating the predicate or index expressions. Fetch function for analyzing index expressions. partial-index predicates. Create it in the per-index context to be * When analyzing an expression index, believe the expression tree's type -- Justin
On 1/8/21 3:35 AM, Justin Pryzby wrote: > On Fri, Jan 08, 2021 at 01:57:29AM +0100, Tomas Vondra wrote: >> Attached is a patch fixing most of the issues. There are a couple >> exceptions: > > In the docs: > > ... > Thanks! Checking the docs and comments is a tedious work, I appreciate you going through all that. I'll fix that in the next version. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attached is an updated version of the patch series, fixing a couple issues: 1) docs issues, pointed out by Justin Pryzby 2) adds ACL check to statext_extract_expression to verify access to attributes in the expression(s) 3) adds comment to statext_is_compatible_clause_internal explaining the ambiguity in extracting expressions for extended stats 4) fixes/improves memory management in compute_expr_stats 5) a bunch of minor comment and code fixes regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote: > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>expr</structfield> <type>text</type> > + </para> > + <para> > + Expression the extended statistics is defined on > + </para></entry> Expression the extended statistics ARE defined on Or maybe say "on which the extended statistics are defined" > + <para> > + The <command>CREATE STATISTICS</command> command has two basic forms. The > + simple variant allows to build statistics for a single expression, does .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does) > + Expression statistics are per-expression and are similar to creating an > + index on the expression, except that they avoid the overhead of the index. Maybe say "overhead of index maintenance" > + All functions and operators used in a statistics definition must be > + <quote>immutable</quote>, that is, their results must depend only on > + their arguments and never on any outside influence (such as > + the contents of another table or the current time). This restriction say "outside factor" or "external factor" > + results of those expression, and uses default estimates as illustrated > + by the first query. The planner also does not realize the value of the realize THAT > + second column fully defines the value of the other column, because date > + truncated to day still identifies the month. Then expression and > + ndistinct statistics are built on those two columns: I got an error doing this: CREATE TABLE t AS SELECT generate_series(1,9) AS i; CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t; ANALYZE t; SELECT i+1 FROM t GROUP BY 1; ERROR: corrupt MVNDistinct entry -- Justin
On 1/17/21 12:22 AM, Justin Pryzby wrote: > On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote: >> + <entry role="catalog_table_entry"><para role="column_definition"> >> + <structfield>expr</structfield> <type>text</type> >> + </para> >> + <para> >> + Expression the extended statistics is defined on >> + </para></entry> > > Expression the extended statistics ARE defined on > Or maybe say "on which the extended statistics are defined" > I'm pretty sure "is" is correct because "expression" is singular. >> + <para> >> + The <command>CREATE STATISTICS</command> command has two basic forms. The >> + simple variant allows to build statistics for a single expression, does > > .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does) > >> + Expression statistics are per-expression and are similar to creating an >> + index on the expression, except that they avoid the overhead of the index. > > Maybe say "overhead of index maintenance" > Yeah, that sounds better. >> + All functions and operators used in a statistics definition must be >> + <quote>immutable</quote>, that is, their results must depend only on >> + their arguments and never on any outside influence (such as >> + the contents of another table or the current time). This restriction > > say "outside factor" or "external factor" > In fact, we've removed the immutability restriction, so this paragraph should have been removed. >> + results of those expression, and uses default estimates as illustrated >> + by the first query. The planner also does not realize the value of the > > realize THAT > OK, changed. >> + second column fully defines the value of the other column, because date >> + truncated to day still identifies the month. Then expression and >> + ndistinct statistics are built on those two columns: > > I got an error doing this: > > CREATE TABLE t AS SELECT generate_series(1,9) AS i; > CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t; > ANALYZE t; > SELECT i+1 FROM t GROUP BY 1; > ERROR: corrupt MVNDistinct entry > Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting in mismatching the ndistinct coefficient items. The attached patch fixes that, but I've realized the way we pick the "best" statistics may need some improvements (I added an XXX comment about that). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi,
+ * Check that only the base rel is mentioned. (This should be dead code
+ * now that add_missing_from is history.)
+ */
+ if (list_length(pstate->p_rtable) != 1)
+ * now that add_missing_from is history.)
+ */
+ if (list_length(pstate->p_rtable) != 1)
If it is dead code, it can be removed, right ?
For statext_mcv_clauselist_selectivity:
+ // bms_free(list_attnums[listidx]);
The commented line can be removed.
+bool
+examine_clause_args2(List *args, Node **exprp, Const **cstp, bool *expronleftp)
+examine_clause_args2(List *args, Node **exprp, Const **cstp, bool *expronleftp)
Better add some comment for examine_clause_args2 since there is examine_clause_args() already.
+ if (!ok || stats->compute_stats == NULL || stats->minrows <= 0)
When would stats->minrows have negative value ?
For serialize_expr_stats():
+ sd = table_open(StatisticRelationId, RowExclusiveLock);
+
+ /* lookup OID of composite type for pg_statistic */
+ typOid = get_rel_type_id(StatisticRelationId);
+ if (!OidIsValid(typOid))
+ ereport(ERROR,
+
+ /* lookup OID of composite type for pg_statistic */
+ typOid = get_rel_type_id(StatisticRelationId);
+ if (!OidIsValid(typOid))
+ ereport(ERROR,
Looks like the table_open() call can be made after the typOid check.
+ Datum values[Natts_pg_statistic];
+ bool nulls[Natts_pg_statistic];
+ HeapTuple stup;
+
+ if (!stats->stats_valid)
+ bool nulls[Natts_pg_statistic];
+ HeapTuple stup;
+
+ if (!stats->stats_valid)
It seems the local arrays can be declared after the validity check.
+ if (enabled[i] == STATS_EXT_NDISTINCT)
+ ndistinct_enabled = true;
+ if (enabled[i] == STATS_EXT_DEPENDENCIES)
+ dependencies_enabled = true;
+ if (enabled[i] == STATS_EXT_MCV)
+ ndistinct_enabled = true;
+ if (enabled[i] == STATS_EXT_DEPENDENCIES)
+ dependencies_enabled = true;
+ if (enabled[i] == STATS_EXT_MCV)
the second and third if should be preceded with 'else'
+ReleaseDummy(HeapTuple tuple)
+{
+ pfree(tuple);
+{
+ pfree(tuple);
Since ReleaseDummy() is just a single pfree call, maybe you don't need this method - call pfree in its place.
Cheers
On Sat, Jan 16, 2021 at 4:24 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 1/17/21 12:22 AM, Justin Pryzby wrote:
> On Sat, Jan 16, 2021 at 05:48:43PM +0100, Tomas Vondra wrote:
>> + <entry role="catalog_table_entry"><para role="column_definition">
>> + <structfield>expr</structfield> <type>text</type>
>> + </para>
>> + <para>
>> + Expression the extended statistics is defined on
>> + </para></entry>
>
> Expression the extended statistics ARE defined on
> Or maybe say "on which the extended statistics are defined"
>
I'm pretty sure "is" is correct because "expression" is singular.
>> + <para>
>> + The <command>CREATE STATISTICS</command> command has two basic forms. The
>> + simple variant allows to build statistics for a single expression, does
>
> .. ALLOWS BUILDING statistics for a single expression, AND does (or BUT does)
>
>> + Expression statistics are per-expression and are similar to creating an
>> + index on the expression, except that they avoid the overhead of the index.
>
> Maybe say "overhead of index maintenance"
>
Yeah, that sounds better.
>> + All functions and operators used in a statistics definition must be
>> + <quote>immutable</quote>, that is, their results must depend only on
>> + their arguments and never on any outside influence (such as
>> + the contents of another table or the current time). This restriction
>
> say "outside factor" or "external factor"
>
In fact, we've removed the immutability restriction, so this paragraph
should have been removed.
>> + results of those expression, and uses default estimates as illustrated
>> + by the first query. The planner also does not realize the value of the
>
> realize THAT
>
OK, changed.
>> + second column fully defines the value of the other column, because date
>> + truncated to day still identifies the month. Then expression and
>> + ndistinct statistics are built on those two columns:
>
> I got an error doing this:
>
> CREATE TABLE t AS SELECT generate_series(1,9) AS i;
> CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t;
> ANALYZE t;
> SELECT i+1 FROM t GROUP BY 1;
> ERROR: corrupt MVNDistinct entry
>
Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting
in mismatching the ndistinct coefficient items. The attached patch fixes
that, but I've realized the way we pick the "best" statistics may need
some improvements (I added an XXX comment about that).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote: > diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml > index 4363be50c3..5b8eb8d248 100644 > --- a/doc/src/sgml/ref/create_statistics.sgml > +++ b/doc/src/sgml/ref/create_statistics.sgml > @@ -21,9 +21,13 @@ PostgreSQL documentation > > <refsynopsisdiv> > <synopsis> > +CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable> > + ON ( <replaceable class="parameter">expression</replaceable> ) > + FROM <replaceable class="parameter">table_name</replaceable> > CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable> > [ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ] > - ON <replaceable class="parameter">column_name</replaceable>, <replaceable class="parameter">column_name</replaceable>[, ...] > + ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable>) } [, ...] > FROM <replaceable class="parameter">table_name</replaceable> > </synopsis> I think this part is wrong, since it's not possible to specify a single column in form#2. If I'm right, the current way is: - form#1 allows expression statistics of a single expression, and doesn't allow specifying "kinds"; - form#2 allows specifying "kinds", but always computes expression statistics, and requires multiple columns/expressions. So it'd need to be column_name|expression, column_name|expression [,...] > @@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na > database and will be owned by the user issuing the command. > </para> > > + <para> > + The <command>CREATE STATISTICS</command> command has two basic forms. The > + simple variant allows building statistics for a single expression, does > + not allow specifying any statistics kinds and provides benefits similar > + to an expression index. The full variant allows defining statistics objects > + on multiple columns and expressions, and selecting which statistics kinds will > + be built. The per-expression statistics are built automatically when there > + is at least one expression. > + </para> > + <varlistentry> > + <term><replaceable class="parameter">expression</replaceable></term> > + <listitem> > + <para> > + The expression to be covered by the computed statistics. In this case > + only a single expression is required, in which case only the expression > + statistics kind is allowed. The order of expressions is insignificant. I think this part is wrong now ? I guess there's no user-facing expression "kind" left in the CREATE command. I guess "in which case" means "if only one expr is specified". "expression" could be either form#1 or form#2. Maybe it should just say: > + An expression to be covered by the computed statistics. Maybe somewhere else, say: > In the second form of the command, the order of expressions is insignificant. -- Justin
On 1/17/21 3:55 AM, Zhihong Yu wrote: > Hi, > + * Check that only the base rel is mentioned. (This should be dead code > + * now that add_missing_from is history.) > + */ > + if (list_length(pstate->p_rtable) != 1) > > If it is dead code, it can be removed, right ? > Maybe. The question is whether it really is dead code. It's copied from transformIndexStmt so I kept it for now. > For statext_mcv_clauselist_selectivity: > > + // bms_free(list_attnums[listidx]); > > The commented line can be removed. > Actually, this should probably do list_free on the list_exprs. > +bool > +examine_clause_args2(List *args, Node **exprp, Const **cstp, bool > *expronleftp) > > Better add some comment for examine_clause_args2 since there > is examine_clause_args() already. > Yeah, this probably needs a better name. I have a feeling this may need a refactoring to reuse more of the existing code for the expressions. > + if (!ok || stats->compute_stats == NULL || stats->minrows <= 0) > > When would stats->minrows have negative value ? > The typanalyze function (e.g. std_typanalyze) can set it to negative value. This is the same check as in examine_attribute, and we need the same protections I think. > For serialize_expr_stats(): > > + sd = table_open(StatisticRelationId, RowExclusiveLock); > + > + /* lookup OID of composite type for pg_statistic */ > + typOid = get_rel_type_id(StatisticRelationId); > + if (!OidIsValid(typOid)) > + ereport(ERROR, > > Looks like the table_open() call can be made after the typOid check. > Probably, but this failure is unlikely (should never happen) so I don't think this makes any real difference. > + Datum values[Natts_pg_statistic]; > + bool nulls[Natts_pg_statistic]; > + HeapTuple stup; > + > + if (!stats->stats_valid) > > It seems the local arrays can be declared after the validity check. > Nope, that would be against C99. > + if (enabled[i] == STATS_EXT_NDISTINCT) > + ndistinct_enabled = true; > + if (enabled[i] == STATS_EXT_DEPENDENCIES) > + dependencies_enabled = true; > + if (enabled[i] == STATS_EXT_MCV) > > the second and third if should be preceded with 'else' > Yeah, although this just moves existing code. But you're right it could use else. > +ReleaseDummy(HeapTuple tuple) > +{ > + pfree(tuple); > > Since ReleaseDummy() is just a single pfree call, maybe you don't need > this method - call pfree in its place. > No, that's not possible because the freefunc callback expects signature void (*)(HeapTupleData *) and pfree() does not match that. thanks -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/17/21 6:29 AM, Justin Pryzby wrote: > On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote: >> diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml >> index 4363be50c3..5b8eb8d248 100644 >> --- a/doc/src/sgml/ref/create_statistics.sgml >> +++ b/doc/src/sgml/ref/create_statistics.sgml >> @@ -21,9 +21,13 @@ PostgreSQL documentation >> >> <refsynopsisdiv> >> <synopsis> >> +CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable> >> + ON ( <replaceable class="parameter">expression</replaceable> ) >> + FROM <replaceable class="parameter">table_name</replaceable> > >> CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable> >> [ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ] >> - ON <replaceable class="parameter">column_name</replaceable>, <replaceable class="parameter">column_name</replaceable>[, ...] >> + ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable>) } [, ...] >> FROM <replaceable class="parameter">table_name</replaceable> >> </synopsis> > > I think this part is wrong, since it's not possible to specify a single column > in form#2. > > If I'm right, the current way is: > > - form#1 allows expression statistics of a single expression, and doesn't > allow specifying "kinds"; > > - form#2 allows specifying "kinds", but always computes expression statistics, > and requires multiple columns/expressions. > > So it'd need to be column_name|expression, column_name|expression [,...] > Strictly speaking you're probably correct - there should be at least two elements. But I'm somewhat hesitant about making this more complex, because it'll be harder to understand. >> @@ -39,6 +43,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na >> database and will be owned by the user issuing the command. >> </para> >> >> + <para> >> + The <command>CREATE STATISTICS</command> command has two basic forms. The >> + simple variant allows building statistics for a single expression, does >> + not allow specifying any statistics kinds and provides benefits similar >> + to an expression index. The full variant allows defining statistics objects >> + on multiple columns and expressions, and selecting which statistics kinds will >> + be built. The per-expression statistics are built automatically when there >> + is at least one expression. >> + </para> > >> + <varlistentry> >> + <term><replaceable class="parameter">expression</replaceable></term> >> + <listitem> >> + <para> >> + The expression to be covered by the computed statistics. In this case >> + only a single expression is required, in which case only the expression >> + statistics kind is allowed. The order of expressions is insignificant. > > I think this part is wrong now ? > I guess there's no user-facing expression "kind" left in the CREATE command. > I guess "in which case" means "if only one expr is specified". > "expression" could be either form#1 or form#2. > > Maybe it should just say: >> + An expression to be covered by the computed statistics. > > Maybe somewhere else, say: >> In the second form of the command, the order of expressions is insignificant. > Yeah, this is a leftover from when there was "expressions" kind. I'll reword this a bit. thanks -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jan 17, 2021 at 01:23:39AM +0100, Tomas Vondra wrote: > > CREATE TABLE t AS SELECT generate_series(1,9) AS i; > > CREATE STATISTICS s ON (i+1) ,(i+1+0) FROM t; > > ANALYZE t; > > SELECT i+1 FROM t GROUP BY 1; > > ERROR: corrupt MVNDistinct entry > > Thanks. There was a thinko in estimate_multivariate_ndistinct, resulting in > mismatching the ndistinct coefficient items. The attached patch fixes that, > but I've realized the way we pick the "best" statistics may need some > improvements (I added an XXX comment about that). That maybe indicates a deficiency in testing and code coverage. | postgres=# CREATE TABLE t(i int); | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t; | postgres=# \d t | Table "public.t" | Column | Type | Collation | Nullable | Default | --------+---------+-----------+----------+--------- | i | integer | | | | Indexes: | "t_i_idx" btree (i) | Statistics objects: | "public"."s2" (ndistinct, dependencies, mcv) ON FROM t on ... what ? -- Justin
Looking through extended_stats.c, I found a corner case that can lead to a seg-fault: CREATE TABLE foo(); CREATE STATISTICS s ON (1) FROM foo; ANALYSE foo; This crashes in lookup_var_attr_stats(), because it isn't expecting nvacatts to be 0. I can't think of any case where building stats on a table with no analysable columns is useful, so it should probably just exit early in that case. In BuildRelationExtStatistics(), it looks like min_attrs should be declared assert-only. In evaluate_expressions(): + /* set the pointers */ + result = (ExprInfo *) ptr; + ptr += sizeof(ExprInfo); I think that should probably have a MAXALIGN(). A slightly bigger issue that I don't like is the way it assigns attribute numbers for expressions starting from MaxHeapAttributeNumber+1, so the first expression has an attnum of 1601. That leads to pretty inefficient use of Bitmapsets, since most tables only contain a handful of columns, and there's a large block of unused space in the middle the Bitmapset. An alternative approach might be to use regular attnums for columns and use negative indexes -1, -2, -3, ... for expressions in the stored stats. Then when storing and retrieving attnums from Bitmapsets, it could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values in the Bitmapsets, since there can't be more than that many expressions (just like other code stores system attributes using FirstLowInvalidHeapAttributeNumber). That would be a somewhat bigger change, but hopefully fairly mechanical, and then some code like add_expressions_to_attributes() would go away. Looking at the new view pg_stats_ext_exprs, I noticed that it fails to show expressions until the statistics have been built. For example: CREATE TABLE foo(a int, b int); CREATE STATISTICS s ON (a+b), (a*b) FROM foo; SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs; statistics_name | tablename | expr | n_distinct -----------------+-----------+------+------------ s | foo | | (1 row) but after populating and analysing the table, this becomes: statistics_name | tablename | expr | n_distinct -----------------+-----------+---------+------------ s | foo | (a + b) | 11 s | foo | (a * b) | 11 (2 rows) I think it should show the expressions even before the stats have been built. Another issue is that it returns rows for non-expression stats as well. For example: CREATE TABLE foo(a int, b int); CREATE STATISTICS s ON a, b FROM foo; SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs; statistics_name | tablename | expr | n_distinct -----------------+-----------+------+------------ s | foo | | (1 row) and those values will never be populated, since they're not expressions, so I would expect them to not be shown in the view. So basically, instead of + LEFT JOIN LATERAL ( + SELECT + * + FROM ( + SELECT + unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr, + unnest(sd.stxdexpr)::pg_statistic AS a + ) x + ) stat ON sd.stxdexpr IS NOT NULL; perhaps just + JOIN LATERAL ( + SELECT + * + FROM ( + SELECT + unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr, + unnest(sd.stxdexpr)::pg_statistic AS a + ) x + ) stat ON true; Regards, Dean
On 1/18/21 4:48 PM, Dean Rasheed wrote: > Looking through extended_stats.c, I found a corner case that can lead > to a seg-fault: > > CREATE TABLE foo(); > CREATE STATISTICS s ON (1) FROM foo; > ANALYSE foo; > > This crashes in lookup_var_attr_stats(), because it isn't expecting > nvacatts to be 0. I can't think of any case where building stats on a > table with no analysable columns is useful, so it should probably just > exit early in that case. > > > In BuildRelationExtStatistics(), it looks like min_attrs should be > declared assert-only. > > > In evaluate_expressions(): > > + /* set the pointers */ > + result = (ExprInfo *) ptr; > + ptr += sizeof(ExprInfo); > > I think that should probably have a MAXALIGN(). > Thanks, I'll fix all of that. > > A slightly bigger issue that I don't like is the way it assigns > attribute numbers for expressions starting from > MaxHeapAttributeNumber+1, so the first expression has an attnum of > 1601. That leads to pretty inefficient use of Bitmapsets, since most > tables only contain a handful of columns, and there's a large block of > unused space in the middle the Bitmapset. > > An alternative approach might be to use regular attnums for columns > and use negative indexes -1, -2, -3, ... for expressions in the stored > stats. Then when storing and retrieving attnums from Bitmapsets, it > could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values > in the Bitmapsets, since there can't be more than that many > expressions (just like other code stores system attributes using > FirstLowInvalidHeapAttributeNumber). > > That would be a somewhat bigger change, but hopefully fairly > mechanical, and then some code like add_expressions_to_attributes() > would go away. > Well, I tried this but unfortunately it's not that simple. We still need to build the bitmaps, so I don't think add_expression_to_attributes can be just removed. I mean, we need to do the offsetting somewhere, even if we change how we do it. But the main issue is that in some cases the number of expressions is not really limited by STATS_MAX_DIMENSIONS - for example when applying functional dependencies, we "merge" multiple statistics, so we may end up with more expressions. So we can't just use STATS_MAX_DIMENSIONS. Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts the order of processing (so far we've assumed expressions are after regular attnums). So the changes are more extensive - I tried doing that anyway, and I'm still struggling with crashes and regression failures. Of course, that doesn't mean we shouldn't do it, but it's far from mechanical. (Some of that is probably a sign this code needs a bit more work to polish.) But I wonder if it'd be easier to just calculate the actual max attnum and then use it instead of MaxHeapAttributeNumber ... > > Looking at the new view pg_stats_ext_exprs, I noticed that it fails to > show expressions until the statistics have been built. For example: > > CREATE TABLE foo(a int, b int); > CREATE STATISTICS s ON (a+b), (a*b) FROM foo; > SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs; > > statistics_name | tablename | expr | n_distinct > -----------------+-----------+------+------------ > s | foo | | > (1 row) > > but after populating and analysing the table, this becomes: > > statistics_name | tablename | expr | n_distinct > -----------------+-----------+---------+------------ > s | foo | (a + b) | 11 > s | foo | (a * b) | 11 > (2 rows) > > I think it should show the expressions even before the stats have been built. > > Another issue is that it returns rows for non-expression stats as > well. For example: > > CREATE TABLE foo(a int, b int); > CREATE STATISTICS s ON a, b FROM foo; > SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs; > > statistics_name | tablename | expr | n_distinct > -----------------+-----------+------+------------ > s | foo | | > (1 row) > > and those values will never be populated, since they're not > expressions, so I would expect them to not be shown in the view. > > So basically, instead of > > + LEFT JOIN LATERAL ( > + SELECT > + * > + FROM ( > + SELECT > + > unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr, > + unnest(sd.stxdexpr)::pg_statistic AS a > + ) x > + ) stat ON sd.stxdexpr IS NOT NULL; > > perhaps just > > + JOIN LATERAL ( > + SELECT > + * > + FROM ( > + SELECT > + > unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr, > + unnest(sd.stxdexpr)::pg_statistic AS a > + ) x > + ) stat ON true; Will fix. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 19 Jan 2021 at 01:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > A slightly bigger issue that I don't like is the way it assigns > > attribute numbers for expressions starting from > > MaxHeapAttributeNumber+1, so the first expression has an attnum of > > 1601. That leads to pretty inefficient use of Bitmapsets, since most > > tables only contain a handful of columns, and there's a large block of > > unused space in the middle the Bitmapset. > > > > An alternative approach might be to use regular attnums for columns > > and use negative indexes -1, -2, -3, ... for expressions in the stored > > stats. Then when storing and retrieving attnums from Bitmapsets, it > > could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values > > in the Bitmapsets, since there can't be more than that many > > expressions (just like other code stores system attributes using > > FirstLowInvalidHeapAttributeNumber). > > Well, I tried this but unfortunately it's not that simple. We still need > to build the bitmaps, so I don't think add_expression_to_attributes can > be just removed. I mean, we need to do the offsetting somewhere, even if > we change how we do it. Hmm, I was imagining that the offsetting would happen in each place that adds or retrieves an attnum from a Bitmapset, much like a lot of other code does for system attributes, and then you'd know you had an expression if the resulting attnum was negative. I was also thinking that it would be these negative attnums that would be stored in the stats data, so instead of something like "1, 2 => 1601", it would be "1, 2 => -1", so in some sense "-1" would be the "real" attnum associated with the expression. > But the main issue is that in some cases the number of expressions is > not really limited by STATS_MAX_DIMENSIONS - for example when applying > functional dependencies, we "merge" multiple statistics, so we may end > up with more expressions. So we can't just use STATS_MAX_DIMENSIONS. Ah, I see. I hadn't really fully understood what that code was doing. ISTM though that this is really an internal problem to the dependencies code, in that these "merged" Bitmapsets containing attrs from multiple different stats objects do not (and must not) ever go outside that local code that uses them. So that code would be free to use a different offset for its own purposes -- e..g., collect all the distinct expressions across all the stats objects and then offset by the number of distinct expressions. > Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts > the order of processing (so far we've assumed expressions are after > regular attnums). So the changes are more extensive - I tried doing that > anyway, and I'm still struggling with crashes and regression failures. > Of course, that doesn't mean we shouldn't do it, but it's far from > mechanical. (Some of that is probably a sign this code needs a bit more > work to polish.) Interesting. What code assumes expressions come after attributes? Ideally, I think it would be cleaner if no code assumed any particular order, but I can believe that it might be convenient in some circumstances. > But I wonder if it'd be easier to just calculate the actual max attnum > and then use it instead of MaxHeapAttributeNumber ... Hmm, I'm not sure how that would work. There still needs to be an attnum that gets stored in the database, and it has to continue to work if the user adds columns to the table. That's why I was advocating storing negative values, though I haven't actually tried it to see what might go wrong. Regards, Dean
On 1/21/21 12:11 PM, Dean Rasheed wrote: > On Tue, 19 Jan 2021 at 01:57, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >>> A slightly bigger issue that I don't like is the way it assigns >>> attribute numbers for expressions starting from >>> MaxHeapAttributeNumber+1, so the first expression has an attnum of >>> 1601. That leads to pretty inefficient use of Bitmapsets, since most >>> tables only contain a handful of columns, and there's a large block of >>> unused space in the middle the Bitmapset. >>> >>> An alternative approach might be to use regular attnums for columns >>> and use negative indexes -1, -2, -3, ... for expressions in the stored >>> stats. Then when storing and retrieving attnums from Bitmapsets, it >>> could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values >>> in the Bitmapsets, since there can't be more than that many >>> expressions (just like other code stores system attributes using >>> FirstLowInvalidHeapAttributeNumber). >> >> Well, I tried this but unfortunately it's not that simple. We still need >> to build the bitmaps, so I don't think add_expression_to_attributes can >> be just removed. I mean, we need to do the offsetting somewhere, even if >> we change how we do it. > > Hmm, I was imagining that the offsetting would happen in each place > that adds or retrieves an attnum from a Bitmapset, much like a lot of > other code does for system attributes, and then you'd know you had an > expression if the resulting attnum was negative. > > I was also thinking that it would be these negative attnums that would > be stored in the stats data, so instead of something like "1, 2 => > 1601", it would be "1, 2 => -1", so in some sense "-1" would be the > "real" attnum associated with the expression. > >> But the main issue is that in some cases the number of expressions is >> not really limited by STATS_MAX_DIMENSIONS - for example when applying >> functional dependencies, we "merge" multiple statistics, so we may end >> up with more expressions. So we can't just use STATS_MAX_DIMENSIONS. > > Ah, I see. I hadn't really fully understood what that code was doing. > > ISTM though that this is really an internal problem to the > dependencies code, in that these "merged" Bitmapsets containing attrs > from multiple different stats objects do not (and must not) ever go > outside that local code that uses them. So that code would be free to > use a different offset for its own purposes -- e..g., collect all the > distinct expressions across all the stats objects and then offset by > the number of distinct expressions. > >> Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts >> the order of processing (so far we've assumed expressions are after >> regular attnums). So the changes are more extensive - I tried doing that >> anyway, and I'm still struggling with crashes and regression failures. >> Of course, that doesn't mean we shouldn't do it, but it's far from >> mechanical. (Some of that is probably a sign this code needs a bit more >> work to polish.) > > Interesting. What code assumes expressions come after attributes? > Ideally, I think it would be cleaner if no code assumed any particular > order, but I can believe that it might be convenient in some > circumstances. > Well, in a bunch of places we look at the index (from the bitmap) and use it to determine whether the value is a regular attribute or an expression, because the values are stored in separate arrays. This is solvable, all I'm saying (both here and in the preceding part about dependencies) is that it's not entirely mechanical task. But it might be better to rethink the separation of simple values and expression, and make it more "unified" so that most of the code does not really need to deal with these differences. >> But I wonder if it'd be easier to just calculate the actual max attnum >> and then use it instead of MaxHeapAttributeNumber ... > > Hmm, I'm not sure how that would work. There still needs to be an > attnum that gets stored in the database, and it has to continue to > work if the user adds columns to the table. That's why I was > advocating storing negative values, though I haven't actually tried it > to see what might go wrong. > Well, yeah, we need to identify the expression in some statistics (e.g. in dependencies or ndistinct items). And yeah, offsetting the expression attnums by MaxHeapAttributeNumber may be inefficient in this case. Attached is an updated version of the patch, hopefully addressing all issues pointed out by you, Justin and Zhihong, with the exception of the expression attnums discussed here. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
This already needs to be rebased on 55dc86eca. And needs to update rules.out. And doesn't address this one: On Sun, Jan 17, 2021 at 10:53:31PM -0600, Justin Pryzby wrote: > | postgres=# CREATE TABLE t(i int); > | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t; > | postgres=# \d t > | Table "public.t" > | Column | Type | Collation | Nullable | Default > | --------+---------+-----------+----------+--------- > | i | integer | | | > | Indexes: > | "t_i_idx" btree (i) > | Statistics objects: > | "public"."s2" (ndistinct, dependencies, mcv) ON FROM t > > on ... what ? -- Justin
On 1/22/21 3:29 AM, Justin Pryzby wrote: > This already needs to be rebased on 55dc86eca. > > And needs to update rules.out. > Whooops. A fixed version attached. > And doesn't address this one: > > On Sun, Jan 17, 2021 at 10:53:31PM -0600, Justin Pryzby wrote: >> | postgres=# CREATE TABLE t(i int); >> | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t; >> | postgres=# \d t >> | Table "public.t" >> | Column | Type | Collation | Nullable | Default >> | --------+---------+-----------+----------+--------- >> | i | integer | | | >> | Indexes: >> | "t_i_idx" btree (i) >> | Statistics objects: >> | "public"."s2" (ndistinct, dependencies, mcv) ON FROM t >> >> on ... what ? > Umm, for me that prints: test=# CREATE TABLE t(i int); CREATE TABLE test=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t; CREATE STATISTICS test=# \d t Table "public.t" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- i | integer | | | Statistics objects: "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t which I think is OK. But maybe there's something else to trigger the problem? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote: > > > | Statistics objects: > > > | "public"."s2" (ndistinct, dependencies, mcv) ON FROM t > > Umm, for me that prints: > "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t > > which I think is OK. But maybe there's something else to trigger the > problem? Oh. It's because I was using /usr/bin/psql and not ./src/bin/psql. I think it's considered ok if old client's \d commands don't work on new server, but it's not clear to me if it's ok if they misbehave. It's almost better it made an ERROR. In any case, why are there so many parentheses ? -- Justin
On Thu, Jan 21, 2021 at 10:01:01PM -0600, Justin Pryzby wrote: > On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote: > > > > | Statistics objects: > > > > | "public"."s2" (ndistinct, dependencies, mcv) ON FROM t > > > > Umm, for me that prints: > > > "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t > > > > which I think is OK. But maybe there's something else to trigger the > > problem? > > Oh. It's because I was using /usr/bin/psql and not ./src/bin/psql. > I think it's considered ok if old client's \d commands don't work on new > server, but it's not clear to me if it's ok if they misbehave. It's almost > better it made an ERROR. I think you'll maybe have to do something better - this seems a bit too weird: | postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t; | postgres=# \d t | ... | "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t It suggests including additional columns in stxkeys for each expression. Maybe that also helps give direction to response to Dean's concern? That doesn't make old psql do anything more desirable, though. Unless you also added attributes, all you can do is make it say things like "columns: ctid". > In any case, why are there so many parentheses ? -- Justin
On Fri, 22 Jan 2021 at 04:46, Justin Pryzby <pryzby@telsasoft.com> wrote: > > I think you'll maybe have to do something better - this seems a bit too weird: > > | postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t; > | postgres=# \d t > | ... > | "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t > I guess that's not surprising, given that old psql knows nothing about expressions in stats. In general, I think connecting old versions of psql to newer servers is not supported. You're lucky if \d works at all. So it shouldn't be this patch's responsibility to make that output nicer. Regards, Dean
On 1/22/21 10:00 AM, Dean Rasheed wrote: > On Fri, 22 Jan 2021 at 04:46, Justin Pryzby <pryzby@telsasoft.com> wrote: >> >> I think you'll maybe have to do something better - this seems a bit too weird: >> >> | postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t; >> | postgres=# \d t >> | ... >> | "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t >> > > I guess that's not surprising, given that old psql knows nothing about > expressions in stats. > > In general, I think connecting old versions of psql to newer servers > is not supported. You're lucky if \d works at all. So it shouldn't be > this patch's responsibility to make that output nicer. > Yeah. It's not clear to me what exactly could we do with this, without "backpatching" the old psql or making the ruleutils.c consider version of the psql. Neither of these seems possible/acceptable. I'm sure this is not the only place showing "incomplete" information in old psql on new server. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/22/21 5:01 AM, Justin Pryzby wrote: > On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote: >>>> | Statistics objects: >>>> | "public"."s2" (ndistinct, dependencies, mcv) ON FROM t >> >> Umm, for me that prints: > >> "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t >> >> which I think is OK. But maybe there's something else to trigger the >> problem? > > Oh. It's because I was using /usr/bin/psql and not ./src/bin/psql. > I think it's considered ok if old client's \d commands don't work on new > server, but it's not clear to me if it's ok if they misbehave. It's almost > better it made an ERROR. > Well, how would the server know to throw an error? We can't quite patch the old psql (if we could, we could just tweak the query). > In any case, why are there so many parentheses ? > That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be adding extra parentheses, on top of what deparse_expression_pretty does. Will fix. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 22 Jan 2021 at 03:49, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Whooops. A fixed version attached. > The change to pg_stats_ext_exprs isn't quite right, because now it cross joins expressions and their stats, which leads to too many rows, with the wrong stats being listed against expressions. For example: CREATE TABLE foo (a int, b text); INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000); CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo; ANALYSE foo; SELECT tablename, statistics_name, expr, most_common_vals FROM pg_stats_ext_exprs; tablename | statistics_name | expr | most_common_vals -----------+-----------------+----------+------------------ foo | foo_s | (a * 10) | {10} foo | foo_s | (a * 10) | {XXX} foo | foo_s | upper(b) | {10} foo | foo_s | upper(b) | {XXX} (4 rows) More protection is still required for tables with no analysable columns. For example: CREATE TABLE foo(); CREATE STATISTICS foo_s ON (1) FROM foo; INSERT INTO foo SELECT FROM generate_series(1,1000); ANALYSE foo; Program received signal SIGSEGV, Segmentation fault. 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664 664 stats[i]->tupDesc = vacatts[0]->tupDesc; #0 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664 #1 0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598, totalrows=1000, numrows=100, rows=0x216d040, natts=0, vacattrstats=0x216cb40) at extended_stats.c:161 #2 0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598, params=0x7ffc06f7d450, va_cols=0x0, acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false, in_outer_xact=false, elevel=13) at analyze.c:595 Attached is an incremental update fixing those issues, together with a few more suggested improvements: There was quite a bit of code duplication in extended_stats.c which I attempted to reduce by 1). Deleting examine_opclause_expression() in favour of examine_clause_args(). 2). Deleting examine_opclause_expression2() in favour of examine_clause_args2(). 3). Merging examine_clause_args() and examine_clause_args2(), renaming it examine_opclause_args() (which was actually the name it had in its original doc comment, despite the name in the code being different). 4). Merging statext_extract_expression() and statext_extract_expression_internal() into statext_is_compatible_clause() and statext_is_compatible_clause_internal() respectively. That last change goes beyond just removing code duplication. It allows support for compound clauses that contain a mix of attribute and expression clauses, for example, this simple test case wasn't previously estimated well: CREATE TABLE foo (a int, b int, c int); INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x); CREATE STATISTICS foo_s on a,b,(c*c) FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1); I didn't add any new regression tests, but perhaps it would be worth adding something to test a case like that. I changed choose_best_statistics() in a couple of ways. Firstly, I think it wants to only count expressions from fully covered clauses, just as we only count attributes if the stat covers all the attributes from a clause, since otherwise the stat cannot estimate the clause, so it shouldn't count. Secondly, I think the number of expressions in the stat needs to be added to it's number of keys, so that the choice of narrowest stat with the same number of matches counts expressions in the same way as attributes. I simplified the code in statext_mcv_clauselist_selectivity(), by attempting to handle expressions and attributes together in the same way, making it much closer to the original code. I don't think that the check for the existence of a stat covering all the expressions in a clause was necessary when pre-processing the list of clauses, since that's checked later on, so it's enough to just detect compatible clauses. Also, it now checks for stats that cover both the attributes and the expressions from each clause, rather than one or the other, to cope with examples like the one above. I also updated the check for simple_clauses -- what's wanted there is to identify clauses that only reference a single column or a single expression, so that the later code doesn't apply multi-column estimates to it. I'm attaching it as a incremental patch (0004) on top of your patches, but if 0003 and 0004 are collapsed together, the total number of diffs is less than 0003 alone. Regards, Dean
Вложения
On 1/27/21 12:02 PM, Dean Rasheed wrote: > On Fri, 22 Jan 2021 at 03:49, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Whooops. A fixed version attached. >> > > The change to pg_stats_ext_exprs isn't quite right, because now it > cross joins expressions and their stats, which leads to too many rows, > with the wrong stats being listed against expressions. For example: > > CREATE TABLE foo (a int, b text); > INSERT INTO foo SELECT 1, 'xxx' FROM generate_series(1,1000); > CREATE STATISTICS foo_s ON (a*10), upper(b) FROM foo; > ANALYSE foo; > > SELECT tablename, statistics_name, expr, most_common_vals > FROM pg_stats_ext_exprs; > > tablename | statistics_name | expr | most_common_vals > -----------+-----------------+----------+------------------ > foo | foo_s | (a * 10) | {10} > foo | foo_s | (a * 10) | {XXX} > foo | foo_s | upper(b) | {10} > foo | foo_s | upper(b) | {XXX} > (4 rows) > > > More protection is still required for tables with no analysable > columns. For example: > > CREATE TABLE foo(); > CREATE STATISTICS foo_s ON (1) FROM foo; > INSERT INTO foo SELECT FROM generate_series(1,1000); > ANALYSE foo; > > Program received signal SIGSEGV, Segmentation fault. > 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, attrs=0x0, > exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) at extended_stats.c:664 > 664 stats[i]->tupDesc = vacatts[0]->tupDesc; > > #0 0x000000000090e9d4 in lookup_var_attr_stats (rel=0x7f7766b37598, > attrs=0x0, exprs=0x216b258, nvacatts=0, vacatts=0x216cb40) > at extended_stats.c:664 > #1 0x000000000090da93 in BuildRelationExtStatistics (onerel=0x7f7766b37598, > totalrows=1000, numrows=100, rows=0x216d040, natts=0, > vacattrstats=0x216cb40) at extended_stats.c:161 > #2 0x000000000066ea97 in do_analyze_rel (onerel=0x7f7766b37598, > params=0x7ffc06f7d450, va_cols=0x0, > acquirefunc=0x66f71a <acquire_sample_rows>, relpages=4, inh=false, > in_outer_xact=false, elevel=13) at analyze.c:595 > > > Attached is an incremental update fixing those issues, together with a > few more suggested improvements: > > There was quite a bit of code duplication in extended_stats.c which I > attempted to reduce by > > 1). Deleting examine_opclause_expression() in favour of examine_clause_args(). > 2). Deleting examine_opclause_expression2() in favour of examine_clause_args2(). > 3). Merging examine_clause_args() and examine_clause_args2(), renaming > it examine_opclause_args() (which was actually the name it had in its > original doc comment, despite the name in the code being different). > 4). Merging statext_extract_expression() and > statext_extract_expression_internal() into > statext_is_compatible_clause() and > statext_is_compatible_clause_internal() respectively. > > That last change goes beyond just removing code duplication. It allows > support for compound clauses that contain a mix of attribute and > expression clauses, for example, this simple test case wasn't > previously estimated well: > > CREATE TABLE foo (a int, b int, c int); > INSERT INTO foo SELECT x/100, x/100, x/100 FROM generate_series(1,10000) g(x); > CREATE STATISTICS foo_s on a,b,(c*c) FROM foo; > ANALYSE foo; > EXPLAIN ANALYSE SELECT * FROM foo WHERE a=1 AND (b=1 OR c*c=1); > > I didn't add any new regression tests, but perhaps it would be worth > adding something to test a case like that. > > I changed choose_best_statistics() in a couple of ways. Firstly, I > think it wants to only count expressions from fully covered clauses, > just as we only count attributes if the stat covers all the attributes > from a clause, since otherwise the stat cannot estimate the clause, so > it shouldn't count. Secondly, I think the number of expressions in the > stat needs to be added to it's number of keys, so that the choice of > narrowest stat with the same number of matches counts expressions in > the same way as attributes. > > I simplified the code in statext_mcv_clauselist_selectivity(), by > attempting to handle expressions and attributes together in the same > way, making it much closer to the original code. I don't think that > the check for the existence of a stat covering all the expressions in > a clause was necessary when pre-processing the list of clauses, since > that's checked later on, so it's enough to just detect compatible > clauses. Also, it now checks for stats that cover both the attributes > and the expressions from each clause, rather than one or the other, to > cope with examples like the one above. I also updated the check for > simple_clauses -- what's wanted there is to identify clauses that only > reference a single column or a single expression, so that the later > code doesn't apply multi-column estimates to it. > > I'm attaching it as a incremental patch (0004) on top of your patches, > but if 0003 and 0004 are collapsed together, the total number of diffs > is less than 0003 alone. > Thanks. All of this seems like a clear improvement, both removing the duplicate copy-pasted code, and fixing the handling of the cases that mix plain variables and expressions. FWIW I agree there should be a regression test for this, so I'll add one. I think the main remaining issue is how we handle the expressions in bitmapsets. I've been experimenting with this a bit, but shifting the regular attnums and stashing expressions before them seems quite complex, especially when we don't know how many expressions there are (e.g. when merging functional dependencies). It's true using attnums above MaxHeapAttributeNumber for expressions wastes ~200B, but is that really an issue, considering it's very short-lived allocation? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Attached is a rebased patch series, merging the changes from the last review into the 0003 patch, and with a WIP patch 0004 reworking the tracking of expressions (to address the inefficiency due to relying on MaxHeapAttributeNumber). The 0004 passes is very much an experimental patch with a lot of ad hoc changes. It passes make check, but it definitely needs much more work, cleanup and testing. At this point it's more a demonstration of what would be needed to rework it like this. The main change is that instead of handling expressions by assigning them attnums above MaxHeapAttributeNumber, we assign them system-like attnums, i.e. negative ones. So the first one gets -1, the second one -2, etc. And then we shift all attnums above 0, to allow using the bitmapset as before. Overall, this works, but the shifting is kinda pointless - it allows us to build a bitmapset, but it's mostly useless because it depends on how many expressions are in the statistics definition. So we can't compare or combine bitmapsets for different statistics, and (more importantly) we can't easily compare bitmapset on attnums from clauses. Using MaxHeapAttributeNumber allowed using the bitmapsets at least for regular attributes. Not sure if that's a major advantage, outweighing wasting some space. I wonder if we should just ditch the bitmapsets, and just use simple arrays of attnums. I don't think we expect too many elements here, especially when dealing with individual statistics. So now we're just building and rebuilding the bitmapsets ... seems pointless. One thing I'd like to improve (independently of what we do with the bitmapsets) is getting rid of the distinction between attributes and expressions when building the statistics - currently all the various places have to care about whether the item is attribute or expression, and look either into the tuple or array of pre-calculated value, do various shifts to get the indexes, etc. That's quite tedious, and I've made a lot of errors in that (and I'm sure there are more). So IMO we should simplify this by replacing this with something containing values for both attributes and expressions, handling it in a unified way. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, On 3/17/21 4:55 PM, Dean Rasheed wrote: > On Sun, 7 Mar 2021 at 21:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> 2) ndistinct >> >> There's one thing that's bugging me, in how we handle "partial" matches. >> For each expression we track both the original expression and the Vars >> we extract from it. If we can't find a statistics matching the whole >> expression, we try to match those individual Vars, and we remove the >> matching ones from the list. And in the end we multiply the estimates >> for the remaining Vars. >> >> This works fine with one matching ndistinct statistics. Consider for example >> >> GROUP BY (a+b), (c+d) >> >> with statistics on [(a+b),c] - that is, expression and one column. > > I've just been going over this example, and I think it actually works > slightly differently from how you described, though I haven't worked > out the full general implications of that. > >> We parse the expressions into two GroupExprInfo >> >> {expr: (a+b), vars: [a, b]} >> {expr: (c+d), vars: [c, d]} >> > > Here, I think what you actually get, in the presence of stats on > [(a+b),c] is actually the following two GroupExprInfos: > > {expr: (a+b), vars: []} > {expr: (c+d), vars: [c, d]} > Yeah, right. To be precise, we get {expr: (a+b), vars: [(a+b)]} because in the first case we pass NIL, so add_unique_group_expr treats the whole expression as a var (a bit strange, but OK). > because of the code immediately after this comment in estimate_num_groups(): > > /* > * If examine_variable is able to deduce anything about the GROUP BY > * expression, treat it as a single variable even if it's really more > * complicated. > */ > > As it happens, that makes no difference in this case, where there is > just this one stats object, but it does change things when there are > two stats objects. > >> and the statistics matches the first item exactly (the expression). The >> second expression is not in the statistics, but we match "c". So we end >> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo: >> >> {expr: (c+d), vars: [d]} > > Right. > >> Without any other statistics we estimate that as ndistinct for "d", so >> we end up with >> >> ndistinct((a+b), c) * ndistinct(d) >> >> which mostly makes sense. It assumes ndistinct(c+d) is product of the >> ndistinct estimates, but that's kinda what we've been always doing. > > Yes, that appears to be what happens, and it's probably the best that > can be done with the available stats. > >> But now consider we have another statistics on just (c+d). In the second >> loop we end up matching this expression exactly, so we end up with >> >> ndistinct((a+b), c) * ndistinct((c+d)) > > In this case, with stats on (c+d) as well, the two GroupExprInfos > built at the start change to: > > {expr: (a+b), vars: []} > {expr: (c+d), vars: []} > > so it actually ends up not using any multivariate stats, but instead > uses the two univariate expression stats, giving > > ndistinct((a+b)) * ndistinct((c+d)) > > which actually seems pretty good, and gave very good estimates in the > simple test case I tried. > Yeah, that works pretty well in this case. I wonder if we'd be better off extracting the Vars and doing what I mistakenly described as the current behavior. That's essentially mean extracting the Vars even in the case where we now pass NIL. My concern is that the current behavior (where we prefer expression stats over multi-column stats to some extent) works fine as long as the parts are independent, but once there's dependency it's probably more likely to produce underestimates. I think underestimates for grouping estimates were a risk in the past, so let's not make that worse. >> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think >> what we should do after the first loop is just discarding the whole >> expression and "expand" into per-variable GroupExprInfo, so in the >> second step we would not match the (c+d) statistics. > > Not using the (c+d) stats would give either > > ndistinct((a+b)) * ndistinct(c) * ndistinct(d) > > or > > ndistinct((a+b), c) * ndistinct(d) > > depending on exactly how the algorithm was changed. In my test, these > both gave worse estimates, but there are probably other datasets for > which they might do better. It all depends on how much correlation > there is between (a+b) and c. > > I suspect that there is no optimal strategy for handling overlapping > stats that works best for all datasets, but the current algorithm > seems to do a pretty decent job. > >> Of course, maybe there's a better way to pick the statistics, but I >> think our conclusion so far was that people should just create >> statistics covering all the columns in the query, to not have to match >> multiple statistics like this. > > Yes, I think that's always likely to work better, especially for > ndistinct stats, where all possible permutations of subsets of the > columns are included, so a single ndistinct stat can work well for a > range of different queries. > Yeah, I agree that's a reasonable mitigation. Ultimately, there's no perfect algorithm how to pick and combine stats when we don't know if there even is a statistical dependency between the subsets of columns. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 17 Mar 2021 at 17:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > My concern is that the current behavior (where we prefer expression > stats over multi-column stats to some extent) works fine as long as the > parts are independent, but once there's dependency it's probably more > likely to produce underestimates. I think underestimates for grouping > estimates were a risk in the past, so let's not make that worse. > I'm not sure the current behaviour really is preferring expression stats over multi-column stats. In this example, where we're grouping by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of those multi-column stats actually match more than one column/expression. If anything, I'd go the other way and say that it was wrong to use the [(a+b),c] stats in the first case, where they were the only stats available, since those stats aren't really applicable to (c+d), which probably ought to be treated as independent. IOW, it might have been better to estimate the first case as ndistinct((a+b)) * ndistinct(c) * ndistinct(d) and the second case as ndistinct((a+b)) * ndistinct((c+d)) Regards, Dean
On 3/17/21 7:54 PM, Dean Rasheed wrote: > On Wed, 17 Mar 2021 at 17:26, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> My concern is that the current behavior (where we prefer expression >> stats over multi-column stats to some extent) works fine as long as the >> parts are independent, but once there's dependency it's probably more >> likely to produce underestimates. I think underestimates for grouping >> estimates were a risk in the past, so let's not make that worse. >> > > I'm not sure the current behaviour really is preferring expression > stats over multi-column stats. In this example, where we're grouping > by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of > those multi-column stats actually match more than one > column/expression. If anything, I'd go the other way and say that it > was wrong to use the [(a+b),c] stats in the first case, where they > were the only stats available, since those stats aren't really > applicable to (c+d), which probably ought to be treated as > independent. IOW, it might have been better to estimate the first case > as > > ndistinct((a+b)) * ndistinct(c) * ndistinct(d) > > and the second case as > > ndistinct((a+b)) * ndistinct((c+d)) > OK. I might be confused, but isn't that what the algorithm currently does? Or am I just confused about what the first/second case refers to? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 17 Mar 2021 at 19:07, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 3/17/21 7:54 PM, Dean Rasheed wrote: > > > > it might have been better to estimate the first case as > > > > ndistinct((a+b)) * ndistinct(c) * ndistinct(d) > > > > and the second case as > > > > ndistinct((a+b)) * ndistinct((c+d)) > > OK. I might be confused, but isn't that what the algorithm currently > does? Or am I just confused about what the first/second case refers to? > No, it currently estimates the first case as ndistinct((a+b),c) * ndistinct(d). Having said that, maybe that's OK after all. It at least makes an effort to account for any correlation between (a+b) and (c+d), using the known correlation between (a+b) and c. For reference, here is the test case I was using (which isn't really very good for catching dependence between columns): DROP TABLE IF EXISTS foo; CREATE TABLE foo (a int, b int, c int, d int); INSERT INTO foo SELECT x%10, x%11, x%12, x%13 FROM generate_series(1,100000) x; SELECT COUNT(DISTINCT a) FROM foo; -- 10 SELECT COUNT(DISTINCT b) FROM foo; -- 11 SELECT COUNT(DISTINCT c) FROM foo; -- 12 SELECT COUNT(DISTINCT d) FROM foo; -- 13 SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 20 SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 24 SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 228 SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 478 -- First case: stats on [(a+b),c] CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 2964, Actual = 478 -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 228*13 -- Second case: stats on (c+d) as well CREATE STATISTICS s2 ON (c+d) FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 480, Actual = 478 -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 20*24 I think that's probably pretty reasonable behaviour, given incomplete stats (the estimate with no extended stats is capped at 10000). Regards, Dean
On Wed, 17 Mar 2021 at 20:48, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > For reference, here is the test case I was using (which isn't really very good for > catching dependence between columns): > And here's a test case with much more dependence between the columns: DROP TABLE IF EXISTS foo; CREATE TABLE foo (a int, b int, c int, d int); INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,100000) x; SELECT COUNT(DISTINCT a) FROM foo; -- 2 SELECT COUNT(DISTINCT b) FROM foo; -- 5 SELECT COUNT(DISTINCT c) FROM foo; -- 10 SELECT COUNT(DISTINCT d) FROM foo; -- 15 SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6 SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20 SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10 SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30 -- First case: stats on [(a+b),c] CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 150, Actual = 30 -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15, -- which is much better than ndistinct((a+b)) * ndistinct(c) * ndistinct(d) = 6*10*15 = 900 -- Estimate with no stats = 1500 -- Second case: stats on (c+d) as well CREATE STATISTICS s2 ON (c+d) FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 120, Actual = 30 -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20 Again, I'd say the current behaviour is pretty good. Regards, Dean
On 3/17/21 9:58 PM, Dean Rasheed wrote: > On Wed, 17 Mar 2021 at 20:48, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: >> >> For reference, here is the test case I was using (which isn't really very good for >> catching dependence between columns): >> > > And here's a test case with much more dependence between the columns: > > DROP TABLE IF EXISTS foo; > CREATE TABLE foo (a int, b int, c int, d int); > INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,100000) x; > SELECT COUNT(DISTINCT a) FROM foo; -- 2 > SELECT COUNT(DISTINCT b) FROM foo; -- 5 > SELECT COUNT(DISTINCT c) FROM foo; -- 10 > SELECT COUNT(DISTINCT d) FROM foo; -- 15 > SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6 > SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20 > SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10 > SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30 > > -- First case: stats on [(a+b),c] > CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; > ANALYSE foo; > EXPLAIN ANALYSE > SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); > -- Estimate = 150, Actual = 30 > -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15, > -- which is much better than ndistinct((a+b)) * ndistinct(c) * > ndistinct(d) = 6*10*15 = 900 > -- Estimate with no stats = 1500 > > -- Second case: stats on (c+d) as well > CREATE STATISTICS s2 ON (c+d) FROM foo; > ANALYSE foo; > EXPLAIN ANALYSE > SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); > -- Estimate = 120, Actual = 30 > -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20 > > Again, I'd say the current behaviour is pretty good. > Thanks! I agree applying at least the [(a+b),c] stats is probably the right approach, as it means we're considering at least the available information about dependence between the columns. I think to improve this, we'll need to teach the code to use overlapping statistics, a bit like conditional probability. In this case we might do something like this: ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c)) Which in this case would be either, for the "less correlated" case 228 * 24 / 12 = 446 (actual = 478, current estimate = 480) or, for the "more correlated" case 10 * 20 / 10 = 20 (actual = 30, current estimate = 120) But that's clearly a matter for a future patch, and I'm sure there are cases where this will produce worse estimates. Anyway, I plan to go over the patches one more time, and start pushing them sometime early next week. I don't want to leave it until the very last moment in the CF. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 17 Mar 2021 at 21:31, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I agree applying at least the [(a+b),c] stats is probably the right > approach, as it means we're considering at least the available > information about dependence between the columns. > > I think to improve this, we'll need to teach the code to use overlapping > statistics, a bit like conditional probability. In this case we might do > something like this: > > ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c)) Yes, I was thinking the same thing. That would be equivalent to applying a multiplicative "correction" factor of ndistinct(a,b,c,...) / ( ndistinct(a) * ndistinct(b) * ndistinct(c) * ... ) for each multivariate stat applicable to more than one column/expression, regardless of whether those columns were already covered by other multivariate stats. That might well simplify the implementation, as well as probably produce better estimates. > But that's clearly a matter for a future patch, and I'm sure there are > cases where this will produce worse estimates. Agreed. > Anyway, I plan to go over the patches one more time, and start pushing > them sometime early next week. I don't want to leave it until the very > last moment in the CF. +1. I think they're in good enough shape for that process to start. Regards, Dean
Hi, I've pushed the first two parts, dealing with composite types during bootstrap. I've decided to commit both, including the array->list conversion, as that makes the reloads simpler. I've made two tweaks: 1) I've renamed the function to reload_typ_list, which I think is better (and it used to be reload_typ_array). 2) I've removed the did_reread assert. It'd allow just a single reload, which blocks recursive composite types - seems unnecessary, although we don't need that now. I think we can't have infinite recursion, because we can only load types from already defined catalogs (so no cycles). I've rebased and cleaned up the main part of the patch. There's a bunch of comments slightly reworded / cleaned up, etc. The more significant changes are: 1) The explanation of the example added to create_statistics.sgml was somewhat wrong, so I corrected that. 2) I've renamed StatBuildData to StatsBuildData. 3) I've resolved the FIXMEs in examine_expression. We don't need to do anything special about the collation, because unlike indexes it's not possible to specify "collate" for the attributes. It's possible to do thatin the expression, but exprCollation handles that. For statistics target, we simply use the value determined for the statistics itself. There's no way to specify that for expressions. 4) I've updated the comments about ndistinct estimates in selfuncs.c, because some of it was a bit wrong/misleading - per the discussion we had about matching stats to expressions. 5) I've also tweaked the add_unique_group_expr so that we don't have to run examine_variable() repeatedly if we already have it. 6) Resolved the FIXME about acl_ok in examine_variable. Instead of just setting it to 'true' it now mimics what we do for indexes. I think it's correct, but this is probably worth a review. 7) psql now prints expressions only for (version > 14). I've considered tweaking the existing block, but that was quite incomprehensible so I just made a modified copy. I think this is 99.999% ready to be pushed, so barring objections I'll do that in a day or two. The part 0003 is a small tweak I'll consider, preferring exact matches of expressions over Var matches. It's not a clear win (but what is, in a greedy algorithm), but it does improve one of the regression tests. Minor change, though. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Most importantly, it looks like this forgets to update catalog documentation for stxexprs and stxkind='e' It seems like you're preferring to use pluralized "statistics" in a lot of places that sound wrong to me. For example: > Currently the first statistics wins, which seems silly. I can write more separately, but I think this is resolved and clarified if you write "statistics object" and not just "statistics". > + Name of schema containing table I don't know about the nearby descriptions, but this one sounds too much like a "schema-containing" table. Say "Name of the schema which contains the table" ? > + Name of table Say "name of table on which the extended statistics are defined" > + Name of extended statistics "Name of the extended statistic object" > + Owner of the extended statistics ..object > + Expression the extended statistics is defined on I think it should say "the extended statistic", or "the extended statistics object". Maybe "..on which the extended statistic is defined" > + of random access to the disk. (This expression is null if the expression > + data type does not have a <literal><</literal> operator.) expression's data type > + much-too-small row count estimate in the first two queries. Moreover, the maybe say "dramatically underestimates the rowcount" > + planner has no information about relationship between the expressions, so it the relationship > + assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal> > + conditions are independent, and multiplies their selectivities together to > + arrive at a much-too-high group count estimate in the aggregate query. severe overestimate ? > + This is further exacerbated by the lack of accurate statistics for the > + expressions, forcing the planner to use default ndistinct estimate for the use *a default > + expression derived from ndistinct for the column. With such statistics, the > + planner recognizes that the conditions are correlated and arrives at much > + more accurate estimates. are correlated comma > + if (type->lt_opr == InvalidOid) These could be !OidIsValid > + * expressions. It's either expensive or very easy to defeat for > + * determined used, and there's no risk if we allow such statistics (the > + * statistics is useless, but harmless). I think it's meant to say "for a determined user" ? > + * If there are no simply-referenced columns, give the statistics an auto > + * dependency on the whole table. In most cases, this will be redundant, > + * but it might not be if the statistics expressions contain no Vars > + * (which might seem strange but possible). > + */ > + if (!nattnums) > + { > + ObjectAddressSet(parentobject, RelationRelationId, relid); > + recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO); > + } Can this be unconditional ? > + * Translate the array of indexs to regular attnums for the dependency (we sp: indexes > + * Not found a matching expression, so we can simply skip Found no matching expr > + /* if found a matching, */ matching .. > +examine_attribute(Node *expr) Maybe you should rename this to something distinct ? So it's easy to add a breakpoint there, for example. > + stats->anl_context = CurrentMemoryContext; /* XXX should be using > + * something else? */ > + bool nulls[Natts_pg_statistic]; ... > + * Construct a new pg_statistic tuple > + */ > + for (i = 0; i < Natts_pg_statistic; ++i) > + { > + nulls[i] = false; > + } Shouldn't you just write nulls[Natts_pg_statistic] = {false}; or at least: memset(nulls, 0, sizeof(nulls)); > + * We don't store collations used to build the statistics, but > + * we can use the collation for the attribute itself, as > + * stored in varcollid. We do reset the statistics after a > + * type change (including collation change), so this is OK. We > + * may need to relax this after allowing extended statistics > + * on expressions. This text should be updated or removed ? > @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname, > } > > /* print any extended statistics */ > - if (pset.sversion >= 100000) > + if (pset.sversion >= 140000) > + { > + printfPQExpBuffer(&buf, > + "SELECT oid, " > + "stxrelid::pg_catalog.regclass, " > + "stxnamespace::pg_catalog.regnamespace AS nsp, " > + "stxname,\n" > + "pg_get_statisticsobjdef_columns(oid) AS columns,\n" > + " 'd' = any(stxkind) AS ndist_enabled,\n" > + " 'f' = any(stxkind) AS deps_enabled,\n" > + " 'm' = any(stxkind) AS mcv_enabled,\n"); > + > + if (pset.sversion >= 130000) > + appendPQExpBufferStr(&buf, " stxstattarget\n"); > + else > + appendPQExpBufferStr(&buf, " -1 AS stxstattarget\n"); >= 130000 is fully determined by >= 14000 :) > + * type of the opclass, which is not interesting for our purposes. (Note: > + * if we did anything with non-expression index columns, we'd need to index is wrong ? I mentioned a bunch of other references to "index" and "predicate" which are still around: On Thu, Jan 07, 2021 at 08:35:37PM -0600, Justin Pryzby wrote: > There's some remaining copy/paste stuff from index expressions: > > errmsg("statistics expressions and predicates can refer only to the table being indexed"))); > left behind by evaluating the predicate or index expressions. > Set up for predicate or expression evaluation > Need an EState for evaluation of index expressions and > partial-index predicates. Create it in the per-index context to be > Fetch function for analyzing index expressions.
I got this crash running sqlsmith: #1 0x00007f907574b801 in __GI_abort () at abort.c:79 #2 0x00005646b95a35f8 in ExceptionalCondition (conditionName=conditionName@entry=0x5646b97411db "bms_num_members(varnos)== 1", errorType=errorType@entry=0x5646b95fa00b "FailedAssertion", fileName=fileName@entry=0x5646b9739dbe "selfuncs.c", lineNumber=lineNumber@entry=3332) at assert.c:69 #3 0x00005646b955c9a1 in add_unique_group_expr (vars=0x5646bbd9e200, expr=0x5646b9eb0c30, exprinfos=0x5646bbd9e100, root=0x5646ba9a0cb0)at selfuncs.c:3332 #4 add_unique_group_expr (root=0x5646ba9a0cb0, exprinfos=0x5646bbd9e100, expr=0x5646b9eb0c30, vars=0x5646bbd9e200) at selfuncs.c:3307 #5 0x00005646b955d560 in estimate_num_groups () at selfuncs.c:3558 #6 0x00005646b93ad004 in create_distinct_paths (input_rel=<optimized out>, root=0x5646ba9a0cb0) at planner.c:4808 #7 grouping_planner () at planner.c:2238 #8 0x00005646b93ae0ef in subquery_planner (glob=glob@entry=0x5646ba9a0b98, parse=parse@entry=0x5646ba905d80, parent_root=parent_root@entry=0x0,hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0) at planner.c:1024 #9 0x00005646b93af543 in standard_planner (parse=0x5646ba905d80, query_string=<optimized out>, cursorOptions=256, boundParams=<optimizedout>) at planner.c:404 #10 0x00005646b94873ac in pg_plan_query (querytree=0x5646ba905d80, query_string=0x5646b9cd87e0 "select distinct \n \n pg_catalog.variance(\n cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints()as int8)) over (partition by subq_0.c2 order by subq_0.c0) as c0,\n subq_0.c2 as c1, \n sub"..., cursorOptions=256, boundParams=0x0) at postgres.c:821 #11 0x00005646b94874a1 in pg_plan_queries (querytrees=0x5646baaba250, query_string=query_string@entry=0x5646b9cd87e0 "select distinct \n \n pg_catalog.variance(\n cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints()as int8)) over (partition by subq_0.c2 order by subq_0.c0) as c0,\n subq_0.c2 as c1, \n sub"..., cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:912 #12 0x00005646b9487888 in exec_simple_query () at postgres.c:1104 2021-03-24 03:06:12.489 CDT postmaster[11653] LOG: server process (PID 11696) was terminated by signal 6: Aborted 2021-03-24 03:06:12.489 CDT postmaster[11653] DETAIL: Failed process was running: select distinct pg_catalog.variance( cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over (partition by subq_0.c2 order by subq_0.c0)as c0, subq_0.c2 as c1, subq_0.c0 as c2, subq_0.c2 as c3, subq_0.c1 as c4, subq_0.c1 as c5, subq_0.c0 as c6 from (select ref_1.foreign_server_catalog as c0, ref_1.authorization_identifier as c1, sample_2.tgname as c2, ref_1.foreign_server_catalog as c3 from pg_catalog.pg_stat_database_conflicts as ref_0 left join information_schema._pg_user_mappings as ref_1 on (ref_0.datname < ref_0.datname) inner join pg_catalog.pg_amproc as sample_0 tablesample system (5) on (cast(null as uuid) < cast(null as uuid)) left join pg_catalog.pg_aggregate as sample_1 tablesample system (2.9) on (sample_0.amprocnum = sample_1.aggnumdirectargs ) inner join pg_catalog.pg_trigger as sample_2 tablesampl
Hi Justin, Unfortunately the query is incomplete, so I can't quite determine what went wrong. Can you extract the full query causing the crash, either from the server log or from a core file? thanks On 3/24/21 9:14 AM, Justin Pryzby wrote: > I got this crash running sqlsmith: > > #1 0x00007f907574b801 in __GI_abort () at abort.c:79 > #2 0x00005646b95a35f8 in ExceptionalCondition (conditionName=conditionName@entry=0x5646b97411db "bms_num_members(varnos)== 1", errorType=errorType@entry=0x5646b95fa00b "FailedAssertion", > fileName=fileName@entry=0x5646b9739dbe "selfuncs.c", lineNumber=lineNumber@entry=3332) at assert.c:69 > #3 0x00005646b955c9a1 in add_unique_group_expr (vars=0x5646bbd9e200, expr=0x5646b9eb0c30, exprinfos=0x5646bbd9e100, root=0x5646ba9a0cb0)at selfuncs.c:3332 > #4 add_unique_group_expr (root=0x5646ba9a0cb0, exprinfos=0x5646bbd9e100, expr=0x5646b9eb0c30, vars=0x5646bbd9e200) atselfuncs.c:3307 > #5 0x00005646b955d560 in estimate_num_groups () at selfuncs.c:3558 > #6 0x00005646b93ad004 in create_distinct_paths (input_rel=<optimized out>, root=0x5646ba9a0cb0) at planner.c:4808 > #7 grouping_planner () at planner.c:2238 > #8 0x00005646b93ae0ef in subquery_planner (glob=glob@entry=0x5646ba9a0b98, parse=parse@entry=0x5646ba905d80, parent_root=parent_root@entry=0x0,hasRecursion=hasRecursion@entry=false, tuple_fraction=tuple_fraction@entry=0) > at planner.c:1024 > #9 0x00005646b93af543 in standard_planner (parse=0x5646ba905d80, query_string=<optimized out>, cursorOptions=256, boundParams=<optimizedout>) at planner.c:404 > #10 0x00005646b94873ac in pg_plan_query (querytree=0x5646ba905d80, > query_string=0x5646b9cd87e0 "select distinct \n \n pg_catalog.variance(\n cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints()as int8)) over (partition by subq_0.c2 order by subq_0.c0) as c0,\n subq_0.c2 as c1, \n sub"..., cursorOptions=256, boundParams=0x0) at postgres.c:821 > #11 0x00005646b94874a1 in pg_plan_queries (querytrees=0x5646baaba250, > query_string=query_string@entry=0x5646b9cd87e0 "select distinct \n \n pg_catalog.variance(\n cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints()as int8)) over (partition by subq_0.c2 order by subq_0.c0) as c0,\n subq_0.c2 as c1, \n sub"..., cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:912 > #12 0x00005646b9487888 in exec_simple_query () at postgres.c:1104 > > 2021-03-24 03:06:12.489 CDT postmaster[11653] LOG: server process (PID 11696) was terminated by signal 6: Aborted > 2021-03-24 03:06:12.489 CDT postmaster[11653] DETAIL: Failed process was running: select distinct > > pg_catalog.variance( > cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over (partition by subq_0.c2 order bysubq_0.c0) as c0, > subq_0.c2 as c1, > subq_0.c0 as c2, > subq_0.c2 as c3, > subq_0.c1 as c4, > subq_0.c1 as c5, > subq_0.c0 as c6 > from > (select > ref_1.foreign_server_catalog as c0, > ref_1.authorization_identifier as c1, > sample_2.tgname as c2, > ref_1.foreign_server_catalog as c3 > from > pg_catalog.pg_stat_database_conflicts as ref_0 > left join information_schema._pg_user_mappings as ref_1 > on (ref_0.datname < ref_0.datname) > inner join pg_catalog.pg_amproc as sample_0 tablesample system (5) > on (cast(null as uuid) < cast(null as uuid)) > left join pg_catalog.pg_aggregate as sample_1 tablesample system (2.9) > on (sample_0.amprocnum = sample_1.aggnumdirectargs ) > inner join pg_catalog.pg_trigger as sample_2 tablesampl > -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 24, 2021 at 09:54:22AM +0100, Tomas Vondra wrote: > Hi Justin, > > Unfortunately the query is incomplete, so I can't quite determine what > went wrong. Can you extract the full query causing the crash, either > from the server log or from a core file? Oh, shoot, I didn't realize it was truncated, and I already destroyed the core and moved on to something else... But this fails well enough, and may be much shorter than the original :) select distinct pg_catalog.variance( cast(pg_catalog.pg_stat_get_bgwriter_timed_checkpoints() as int8)) over (partition by subq_0.c2 order by subq_0.c0)as c0, subq_0.c2 as c1, subq_0.c0 as c2, subq_0.c2 as c3, subq_0.c1 as c4, subq_0.c1 as c5, subq_0.c0 as c6 from (select ref_1.foreign_server_catalog as c0, ref_1.authorization_identifier as c1, sample_2.tgname as c2, ref_1.foreign_server_catalog as c3 from pg_catalog.pg_stat_database_conflicts as ref_0 left join information_schema._pg_user_mappings as ref_1 on (ref_0.datname < ref_0.datname) inner join pg_catalog.pg_amproc as sample_0 tablesample system (5) on (cast(null as uuid) < cast(null as uuid)) left join pg_catalog.pg_aggregate as sample_1 tablesample system (2.9) on (sample_0.amprocnum = sample_1.aggnumdirectargs ) inner join pg_catalog.pg_trigger as sample_2 tablesample system (1) on true )subq_0; TRAP: FailedAssertion("bms_num_members(varnos) == 1", File: "selfuncs.c", Line: 3332, PID: 16422) Also ... with this patch CREATE STATISTIC is no longer rejecting multiple tables, and instead does this: postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true; ERROR: schema "i" does not exist -- Justin
Thanks, it seems to be some thinko in handling in PlaceHolderVars, which seem to break the code's assumptions about varnos. This fixes it for me, but I need to look at it more closely. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Wed, 24 Mar 2021 at 10:22, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Thanks, it seems to be some thinko in handling in PlaceHolderVars, which > seem to break the code's assumptions about varnos. This fixes it for me, > but I need to look at it more closely. > I think that makes sense. Reviewing the docs, I noticed a couple of omissions, and had a few other suggestions (attached). Regards, Dean
Вложения
On 3/24/21 2:36 PM, Dean Rasheed wrote: > On Wed, 24 Mar 2021 at 10:22, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which >> seem to break the code's assumptions about varnos. This fixes it for me, >> but I need to look at it more closely. >> > > I think that makes sense. > AFAIK the primary issue here is that the two places disagree. While estimate_num_groups does this varnos = pull_varnos(root, (Node *) varshere); if (bms_membership(varnos) == BMS_SINGLETON) { ... } the add_unique_group_expr does this varnos = pull_varnos(root, (Node *) groupexpr); That is, one looks at the group expression, while the other look at vars extracted from it by pull_var_clause(). Apparently for PlaceHolderVar this can differ, causing the crash. So we need to change one of those places - my fix tweaked the second place to also look at the vars, but maybe we should change the other place? Or maybe it's not the right fix for PlaceHolderVars ... > Reviewing the docs, I noticed a couple of omissions, and had a few > other suggestions (attached). > Thanks! I'll include that in the next version of the patch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 24 Mar 2021 at 14:48, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > AFAIK the primary issue here is that the two places disagree. While > estimate_num_groups does this > > varnos = pull_varnos(root, (Node *) varshere); > if (bms_membership(varnos) == BMS_SINGLETON) > { ... } > > the add_unique_group_expr does this > > varnos = pull_varnos(root, (Node *) groupexpr); > > That is, one looks at the group expression, while the other look at vars > extracted from it by pull_var_clause(). Apparently for PlaceHolderVar > this can differ, causing the crash. > > So we need to change one of those places - my fix tweaked the second > place to also look at the vars, but maybe we should change the other > place? Or maybe it's not the right fix for PlaceHolderVars ... > I think that it doesn't make any difference which place is changed. This is a case of an expression with no stats. With your change, you'll get a single GroupExprInfo containing a list of VariableStatData's for each of it's Var's, whereas if you changed it the other way, you'd get a separate GroupExprInfo for each Var. But I think they'd both end up being treated the same by estimate_multivariate_ndistinct(), since there wouldn't be any stats matching the expression, only the individual Var's. Maybe changing the first place would be the more bulletproof fix though. Regards, Dean
On 3/24/21 7:24 AM, Justin Pryzby wrote: > Most importantly, it looks like this forgets to update catalog documentation > for stxexprs and stxkind='e' > Good catch. > It seems like you're preferring to use pluralized "statistics" in a lot of > places that sound wrong to me. For example: >> Currently the first statistics wins, which seems silly. > I can write more separately, but I think this is resolved and clarified if you > write "statistics object" and not just "statistics". > OK "statistics object" seems better and more consistent. >> + Name of schema containing table > > I don't know about the nearby descriptions, but this one sounds too much like a > "schema-containing" table. Say "Name of the schema which contains the table" ? > I think the current spelling is OK / consistent with the other catalogs. >> + Name of table > > Say "name of table on which the extended statistics are defined" > I've used "Name of table the statistics object is defined on". >> + Name of extended statistics > > "Name of the extended statistic object" > >> + Owner of the extended statistics > > ..object > OK >> + Expression the extended statistics is defined on > > I think it should say "the extended statistic", or "the extended statistics > object". Maybe "..on which the extended statistic is defined" > OK >> + of random access to the disk. (This expression is null if the expression >> + data type does not have a <literal><</literal> operator.) > > expression's data type > OK >> + much-too-small row count estimate in the first two queries. Moreover, the > > maybe say "dramatically underestimates the rowcount" > I've changed this to "... results in a significant underestimate of row count". >> + planner has no information about relationship between the expressions, so it > > the relationship > OK >> + assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal> >> + conditions are independent, and multiplies their selectivities together to >> + arrive at a much-too-high group count estimate in the aggregate query. > > severe overestimate ? > OK >> + This is further exacerbated by the lack of accurate statistics for the >> + expressions, forcing the planner to use default ndistinct estimate for the > > use *a default > OK >> + expression derived from ndistinct for the column. With such statistics, the >> + planner recognizes that the conditions are correlated and arrives at much >> + more accurate estimates. > > are correlated comma > OK >> + if (type->lt_opr == InvalidOid) > > These could be !OidIsValid > Maybe, but it's like this already. I'll leave this alone and then fix/backpatch separately. >> + * expressions. It's either expensive or very easy to defeat for >> + * determined used, and there's no risk if we allow such statistics (the >> + * statistics is useless, but harmless). > > I think it's meant to say "for a determined user" ? > Right. >> + * If there are no simply-referenced columns, give the statistics an auto >> + * dependency on the whole table. In most cases, this will be redundant, >> + * but it might not be if the statistics expressions contain no Vars >> + * (which might seem strange but possible). >> + */ >> + if (!nattnums) >> + { >> + ObjectAddressSet(parentobject, RelationRelationId, relid); >> + recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO); >> + } > > Can this be unconditional ? > What would be the benefit? This behavior copied from index_create, so I'd prefer keeping it the same for consistency reason. Presumably it's like that for some reason (a bit of cargo cult programming, I know). >> + * Translate the array of indexs to regular attnums for the dependency (we > > sp: indexes > OK >> + * Not found a matching expression, so we can simply skip > > Found no matching expr > OK >> + /* if found a matching, */ > > matching .. > Matching dependency. >> +examine_attribute(Node *expr) > > Maybe you should rename this to something distinct ? So it's easy to add a > breakpoint there, for example. > What would be a better name? It's not difficult to add a breakpoint using line number, for example. >> + stats->anl_context = CurrentMemoryContext; /* XXX should be using >> + * something else? */ > >> + bool nulls[Natts_pg_statistic]; > ... >> + * Construct a new pg_statistic tuple >> + */ >> + for (i = 0; i < Natts_pg_statistic; ++i) >> + { >> + nulls[i] = false; >> + } > > Shouldn't you just write nulls[Natts_pg_statistic] = {false}; > or at least: memset(nulls, 0, sizeof(nulls)); > Maybe, but it's a copy of what update_attstats() does, so I prefer keeping it the same. >> + * We don't store collations used to build the statistics, but >> + * we can use the collation for the attribute itself, as >> + * stored in varcollid. We do reset the statistics after a >> + * type change (including collation change), so this is OK. We >> + * may need to relax this after allowing extended statistics >> + * on expressions. > > This text should be updated or removed ? > Yeah, the last sentence is obsolete. Updated. >> @@ -2705,7 +2705,108 @@ describeOneTableDetails(const char *schemaname, >> } >> >> /* print any extended statistics */ >> - if (pset.sversion >= 100000) >> + if (pset.sversion >= 140000) >> + { >> + printfPQExpBuffer(&buf, >> + "SELECT oid, " >> + "stxrelid::pg_catalog.regclass, " >> + "stxnamespace::pg_catalog.regnamespace AS nsp, " >> + "stxname,\n" >> + "pg_get_statisticsobjdef_columns(oid) AS columns,\n" >> + " 'd' = any(stxkind) AS ndist_enabled,\n" >> + " 'f' = any(stxkind) AS deps_enabled,\n" >> + " 'm' = any(stxkind) AS mcv_enabled,\n"); >> + >> + if (pset.sversion >= 130000) >> + appendPQExpBufferStr(&buf, " stxstattarget\n"); >> + else >> + appendPQExpBufferStr(&buf, " -1 AS stxstattarget\n"); > > >= 130000 is fully determined by >= 14000 :) > Ah, right. >> + * type of the opclass, which is not interesting for our purposes. (Note: >> + * if we did anything with non-expression index columns, we'd need to > > index is wrong ? > Fixed > I mentioned a bunch of other references to "index" and "predicate" which are > still around: > Whooops, sorry. Fixed. I'll post a cleaned-up version of the patch addressing Dean's review comments too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 24, 2021 at 01:24:46AM -0500, Justin Pryzby wrote: > It seems like you're preferring to use pluralized "statistics" in a lot of > places that sound wrong to me. For example: > > Currently the first statistics wins, which seems silly. > I can write more separately, but I think this is resolved and clarified if you > write "statistics object" and not just "statistics". In HEAD:catalogs.sgml, pg_statistic_ext (the table) says "object": |Name of the statistics object |Owner of the statistics object |An array of attribute numbers, indicating which table columns are covered by this statistics object; But pg_stats_ext (the view) doesn't say "object", which sounds wrong: |Name of extended statistics |Owner of the extended statistics |Names of the columns the extended statistics is defined on Other pre-existing issues: should be singular "statistic": doc/src/sgml/perform.sgml: Another type of statistics stored for each column are most-common value doc/src/sgml/ref/psql-ref.sgml: The status of each kind of extended statistics is shown in a column Pre-existing issues: doesn't say "object" but I think it should: src/backend/commands/statscmds.c: errmsg("statistics creation on system columns isnot supported"))); src/backend/commands/statscmds.c: errmsg("cannot have more than %d columns in statistics", src/backend/commands/statscmds.c: * If we got here and the OID is not valid, it means the statistics does src/backend/commands/statscmds.c: * Select a nonconflicting name for a new statistics. src/backend/commands/statscmds.c: * Generate "name2" for a new statistics given the list of column names for it src/backend/statistics/extended_stats.c: /* compute statistics target for this statistics */ src/backend/statistics/extended_stats.c: * attributes the statistics is defined on, and then the default statistics src/backend/statistics/mcv.c: * The input is the OID of the statistics, and there are no rows returned if should say "for a statistics object" or "for statistics objects" src/backend/statistics/extended_stats.c: * target for a statistics objects (from the object target, attribute targets Your patch adds these: Should say "object": + * Check if we actually have a matching statistics for the expression. + /* evaluate expressions (if the statistics has any) */ + * for the extended statistics. The second option seems more reasonable. + * the statistics had all options enabled on the original version. + * But if the statistics is defined on just a single column, it has to + /* has the statistics expressions? */ + /* expression - see if it's in the statistics */ + * column(s) the statistics depends on. Also require all + * statistics is defined on more than one column/expression). + * statistics is useless, but harmless). + * If there are no simply-referenced columns, give the statistics an auto + * Then the first statistics matches no expressions and 3 vars, + * while the second statistics matches one expression and 1 var. + * Currently the first statistics wins, which seems silly. + * [(a+c), d]. But maybe it's better than failing to match the + * second statistics? I can make patches for these (separate patches for HEAD and your patch), but I don't think your patch has to wait on it, since the user-facing documentation is consistent with what's already there, and the rest are internal comments. -- Justin
On 3/24/21 5:28 PM, Dean Rasheed wrote: > On Wed, 24 Mar 2021 at 14:48, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> AFAIK the primary issue here is that the two places disagree. While >> estimate_num_groups does this >> >> varnos = pull_varnos(root, (Node *) varshere); >> if (bms_membership(varnos) == BMS_SINGLETON) >> { ... } >> >> the add_unique_group_expr does this >> >> varnos = pull_varnos(root, (Node *) groupexpr); >> >> That is, one looks at the group expression, while the other look at vars >> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar >> this can differ, causing the crash. >> >> So we need to change one of those places - my fix tweaked the second >> place to also look at the vars, but maybe we should change the other >> place? Or maybe it's not the right fix for PlaceHolderVars ... >> > > I think that it doesn't make any difference which place is changed. > > This is a case of an expression with no stats. With your change, > you'll get a single GroupExprInfo containing a list of > VariableStatData's for each of it's Var's, whereas if you changed it > the other way, you'd get a separate GroupExprInfo for each Var. But I > think they'd both end up being treated the same by > estimate_multivariate_ndistinct(), since there wouldn't be any stats > matching the expression, only the individual Var's. Maybe changing the > first place would be the more bulletproof fix though. > Yeah, I think that's true. I'll do a bit more research / experiments. As for the changes proposed in the create_statistics, do we really want to use univariate / multivariate there? Yes, the terms are correct, but I'm not sure how many people looking at CREATE STATISTICS will understand them. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 24 Mar 2021 at 16:48, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > As for the changes proposed in the create_statistics, do we really want > to use univariate / multivariate there? Yes, the terms are correct, but > I'm not sure how many people looking at CREATE STATISTICS will > understand them. > Hmm, I think "univariate" and "multivariate" are pretty ubiquitous, when used to describe statistics. You could use "single-column" and "multi-column", but then "column" isn't really right anymore, since it might be a column or an expression. I can't think of any other terms that fit. Regards, Dean
On 3/24/21 5:48 PM, Tomas Vondra wrote: > > > On 3/24/21 5:28 PM, Dean Rasheed wrote: >> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> AFAIK the primary issue here is that the two places disagree. While >>> estimate_num_groups does this >>> >>> varnos = pull_varnos(root, (Node *) varshere); >>> if (bms_membership(varnos) == BMS_SINGLETON) >>> { ... } >>> >>> the add_unique_group_expr does this >>> >>> varnos = pull_varnos(root, (Node *) groupexpr); >>> >>> That is, one looks at the group expression, while the other look at vars >>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar >>> this can differ, causing the crash. >>> >>> So we need to change one of those places - my fix tweaked the second >>> place to also look at the vars, but maybe we should change the other >>> place? Or maybe it's not the right fix for PlaceHolderVars ... >>> >> >> I think that it doesn't make any difference which place is changed. >> >> This is a case of an expression with no stats. With your change, >> you'll get a single GroupExprInfo containing a list of >> VariableStatData's for each of it's Var's, whereas if you changed it >> the other way, you'd get a separate GroupExprInfo for each Var. But I >> think they'd both end up being treated the same by >> estimate_multivariate_ndistinct(), since there wouldn't be any stats >> matching the expression, only the individual Var's. Maybe changing the >> first place would be the more bulletproof fix though. >> > > Yeah, I think that's true. I'll do a bit more research / experiments. > Actually, I think we need that block at all - there's no point in keeping the exact expression, because if there was a statistics matching it it'd be matched by the examine_variable. So if we get here, we have to just split it into the vars anyway. So the second block is entirely useless. That however means we don't need the processing with GroupExprInfo and GroupVarInfo lists, i.e. we can revert back to the original simpler processing, with a bit of extra logic to match expressions, that's all. The patch 0003 does this (it's a bit crude, but hopefully enough to demonstrate). here's an updated patch. 0001 should address most of the today's review items regarding comments etc. 0002 is an attempt to fix an issue I noticed today - we need to handle type changes. Until now we did not have problems with that, because we only had attnums - so we just reset the statistics (with the exception of functional dependencies, on the assumption that those remain valid). With expressions it's a bit more complicated, though. 1) we need to transform the expressions so that the Vars contain the right type info etc. Otherwise an analyze with the old pg_node_tree crashes 2) we need to reset the pg_statistic[] data too, which however makes keeping the functional dependencies a bit less useful, because those rely on the expression stats :-( So I'm wondering what to do about this. I looked into how ALTER TABLE handles indexes, and 0003 is a PoC to do the same thing for statistics. Of couse, this is a bit unfortunate because it recreates the statistics (so we don't keep anything, not even functional dependencies). I think we have two options: a) Make UpdateStatisticsForTypeChange smarter to also transform and update the expression string, and reset pg_statistics[] data. b) Just recreate the statistics, just like we do for indexes. Currently this does not force analyze, so it just resets all the stats. Maybe it should do analyze, though. Any opinions? I need to think about this a bit more, but maybe (b) with the analyze is the right thing to do. Keeping just some of the stats always seemed a bit weird. (This is why the 0002 patch breaks one of the regression tests.) BTW I wonder how useful the updated statistics actually is. Consider this example: ======================================================================== CREATE TABLE t (a int, b int, c int); INSERT INTO t SELECT mod(i,10), mod(i,10), mod(i,10) FROM generate_series(1,1000000) s(i); CREATE STATISTICS s (ndistinct) ON (a+b), (b+c) FROM t; ANALYZE t; EXPLAIN ANALYZE SELECT 1 FROM t GROUP BY (a+b), (b+c); test=# \d t Table "public.t" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | c | integer | | | Statistics objects: "public"."s" (ndistinct) ON ((a + b)), ((b + c)) FROM t test=# EXPLAIN SELECT 1 FROM t GROUP BY (a+b), (b+c); QUERY PLAN ----------------------------------------------------------------- HashAggregate (cost=25406.00..25406.15 rows=10 width=12) Group Key: (a + b), (b + c) -> Seq Scan on t (cost=0.00..20406.00 rows=1000000 width=8) (3 rows) ======================================================================== Great. Now let's change one of the data types to something else: ======================================================================== test=# alter table t alter column c type numeric; ALTER TABLE test=# \d t Table "public.t" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- a | integer | | | b | integer | | | c | numeric | | | Statistics objects: "public"."s" (ndistinct) ON ((a + b)), (((b)::numeric + c)) FROM t test=# analyze t; ANALYZE test=# EXPLAIN SELECT 1 FROM t GROUP BY (a+b), (b+c); QUERY PLAN ------------------------------------------------------------------ HashAggregate (cost=27906.00..27906.17 rows=10 width=40) Group Key: (a + b), ((b)::numeric + c) -> Seq Scan on t (cost=0.00..22906.00 rows=1000000 width=36) (3 rows) ======================================================================== Great! Let's change it again: ======================================================================== test=# alter table t alter column c type double precision; ALTER TABLE test=# analyze t; ANALYZE test=# EXPLAIN SELECT 1 FROM t GROUP BY (a+b), (b+c); QUERY PLAN ------------------------------------------------------------------ HashAggregate (cost=27906.00..27923.50 rows=1000 width=16) Group Key: (a + b), ((b)::double precision + c) -> Seq Scan on t (cost=0.00..22906.00 rows=1000000 width=12) (3 rows) ======================================================================== Well, not that great, apparently. We clearly failed to match the second expression, so we ended with (b+c) estimated as (10 * 10). Why? Because the expression now looks like this: ======================================================================== "public"."s" (ndistinct) ON ((a + b)), ((((b)::numeric)::double precision + c)) FROM t ======================================================================== But we're matching it to (((b)::double precision + c)), so that fails. This is not specific to extended statistics - indexes have exactly the same issue. Not sure how common this is in practice. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Thu, Mar 25, 2021 at 01:05:37AM +0100, Tomas Vondra wrote: > here's an updated patch. 0001 should address most of the today's review > items regarding comments etc. This is still an issue: postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true; ERROR: schema "i" does not exist -- Justin
On 3/25/21 1:30 AM, Justin Pryzby wrote: > On Thu, Mar 25, 2021 at 01:05:37AM +0100, Tomas Vondra wrote: >> here's an updated patch. 0001 should address most of the today's review >> items regarding comments etc. > > This is still an issue: > > postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true; > ERROR: schema "i" does not exist > Ah, right. That's a weird issue. I was really confused about this, because nothing changes about the grammar or how we check the number of relations. The problem is pretty trivial - the new code in utility.c just grabs the first element and casts it to RangeVar, without checking that it actually is RangeVar. With joins it's a JoinExpr, so we get a bogus error. The attached version fixes it by simply doing the check in utility.c. It's a bit redundant with what's in CreateStatistics() but I don't think we can just postpone it easily - we need to do the transformation here, with access to queryString. But maybe we don't need to pass the relid, when we have the list of relations in CreateStatsStmt itself ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 3/25/21 1:05 AM, Tomas Vondra wrote: > ... > > 0002 is an attempt to fix an issue I noticed today - we need to handle > type changes. Until now we did not have problems with that, because we > only had attnums - so we just reset the statistics (with the exception > of functional dependencies, on the assumption that those remain valid). > > With expressions it's a bit more complicated, though. > > 1) we need to transform the expressions so that the Vars contain the > right type info etc. Otherwise an analyze with the old pg_node_tree crashes > > 2) we need to reset the pg_statistic[] data too, which however makes > keeping the functional dependencies a bit less useful, because those > rely on the expression stats :-( > > So I'm wondering what to do about this. I looked into how ALTER TABLE > handles indexes, and 0003 is a PoC to do the same thing for statistics. > Of couse, this is a bit unfortunate because it recreates the statistics > (so we don't keep anything, not even functional dependencies). > > I think we have two options: > > a) Make UpdateStatisticsForTypeChange smarter to also transform and > update the expression string, and reset pg_statistics[] data. > > b) Just recreate the statistics, just like we do for indexes. Currently > this does not force analyze, so it just resets all the stats. Maybe it > should do analyze, though. > > Any opinions? I need to think about this a bit more, but maybe (b) with > the analyze is the right thing to do. Keeping just some of the stats > always seemed a bit weird. (This is why the 0002 patch breaks one of the > regression tests.) > After thinking about this a bit more I think (b) is the right choice, and the analyze is not necessary. The reason is fairly simple - we drop the per-column statistics, because ATExecAlterColumnType does RemoveStatistics(RelationGetRelid(rel), attnum); so the user has to run analyze anyway, to get any reasonable estimates (we keep the functional dependencies, but those still rely on per-column statistics quite a bit). And we'll have to do the same thing with per-expression stats too. It was a nice idea to keep at least the stats that are not outright broken, but unfortunately it's not a very useful optimization. It increases the instability of the system, because now we have estimates with all statistics, no statistics, and something in between after the partial reset. Not nice. So my plan is to get rid of UpdateStatisticsForTypeChange, and just do mostly what we do for indexes. It's not perfect (as demonstrated in last message), but that'd apply even to option (a). Any better ideas? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, 25 Mar 2021 at 00:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Actually, I think we need that block at all - there's no point in > keeping the exact expression, because if there was a statistics matching > it it'd be matched by the examine_variable. So if we get here, we have > to just split it into the vars anyway. So the second block is entirely > useless. Good point. > That however means we don't need the processing with GroupExprInfo and > GroupVarInfo lists, i.e. we can revert back to the original simpler > processing, with a bit of extra logic to match expressions, that's all. > > The patch 0003 does this (it's a bit crude, but hopefully enough to > demonstrate). Cool. I did wonder about that, but I didn't fully think it through. I'll take a look. > 0002 is an attempt to fix an issue I noticed today - we need to handle > type changes. > > I think we have two options: > > a) Make UpdateStatisticsForTypeChange smarter to also transform and > update the expression string, and reset pg_statistics[] data. > > b) Just recreate the statistics, just like we do for indexes. Currently > this does not force analyze, so it just resets all the stats. Maybe it > should do analyze, though. I'd vote for (b) without an analyse, and I agree with getting rid of UpdateStatisticsForTypeChange(). I've always been a bit skeptical about trying to preserve extended statistics after a type change, when we don't preserve regular per-column stats. > BTW I wonder how useful the updated statistics actually is. Consider > this example: > ... > the expression now looks like this: > > ======================================================================== > "public"."s" (ndistinct) ON ((a + b)), ((((b)::numeric)::double > precision + c)) FROM t > ======================================================================== > > But we're matching it to (((b)::double precision + c)), so that fails. > > This is not specific to extended statistics - indexes have exactly the > same issue. Not sure how common this is in practice. Hmm, that's unfortunate. Maybe it's not that common in practice though. I'm not sure if there is any practical way to fix it, but if there is, I guess we'd want to apply the same fix to both stats and indexes, and that certainly seems out of scope for this patch. Regards, Dean
On Thu, 25 Mar 2021 at 00:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > here's an updated patch. 0001 The change to the way that CreateStatistics() records dependencies isn't quite right -- recordDependencyOnSingleRelExpr() will not create any dependencies if the expression uses only a whole-row Var. However, pull_varattnos() will include whole-row Vars, and so nattnums_exprs will be non-zero, and CreateStatistics() will not create a whole-table dependency when it should. I suppose that could be fixed up by inspecting the bitmapset returned by pull_varattnos() in more detail, but I think it's probably safer to revert to the previous code, which matched what index_create() did. Regards, Dean
On 3/25/21 2:33 PM, Dean Rasheed wrote: > On Thu, 25 Mar 2021 at 00:05, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> here's an updated patch. 0001 > > The change to the way that CreateStatistics() records dependencies > isn't quite right -- recordDependencyOnSingleRelExpr() will not create > any dependencies if the expression uses only a whole-row Var. However, > pull_varattnos() will include whole-row Vars, and so nattnums_exprs > will be non-zero, and CreateStatistics() will not create a whole-table > dependency when it should. > > I suppose that could be fixed up by inspecting the bitmapset returned > by pull_varattnos() in more detail, but I think it's probably safer to > revert to the previous code, which matched what index_create() did. > Ah, good catch. I haven't realized recordDependencyOnSingleRelExpr works like that, so I've moved it after the whole-table dependency. Attached is an updated patch series, with all the changes discussed here. I've cleaned up the ndistinct stuff a bit more (essentially reverting back from GroupExprInfo to GroupVarInfo name), and got rid of the UpdateStatisticsForTypeChange. I've also looked at speeding up the stats_ext regression tests. The 0002 patch reduces the size of a couple of test tables, and removes a bunch of queries. I've initially mostly just copied the original tests, but we don't really need that many queries I think. This cuts the runtime about in half, so it's mostly in line with other tests. Some of these changes are in existing tests, I'll consider moving that into a separate patch applied before the main one. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Thu, 25 Mar 2021 at 19:59, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Attached is an updated patch series, with all the changes discussed > here. I've cleaned up the ndistinct stuff a bit more (essentially > reverting back from GroupExprInfo to GroupVarInfo name), and got rid of > the UpdateStatisticsForTypeChange. > I've looked over all that and I think it's in pretty good shape. I particularly like how much simpler the ndistinct code has now become. Some (hopefully final) review comments: 1). I don't think index.c is the right place for StatisticsGetRelation(). I appreciate that it is very similar to the adjacent IndexGetRelation() function, but index.c is really only for index-related code, so I think StatisticsGetRelation() should go in statscmds.c 2). Perhaps the error message at statscmds.c:293 should read "expression cannot be used in multivariate statistics because its type %s has no default btree operator class" (i.e., add the word "multivariate", since the same expression *can* be used in univariate statistics even though it has no less-than operator). 3). The comment for ATExecAddStatistics() should probably mention that "ALTER TABLE ADD STATISTICS" isn't a command in the grammar, in a similar way to other similar functions, e.g.: /* * ALTER TABLE ADD STATISTICS * * This is no such command in the grammar, but we use this internally to add * AT_ReAddStatistics subcommands to rebuild extended statistics after a table * column type change. */ 4). The comment at the start of ATPostAlterTypeParse() needs updating to mention CREATE STATISTICS statements. 5). I think ATPostAlterTypeParse() should also attempt to preserve any COMMENTs attached to statistics objects, i.e., something like: --- src/backend/commands/tablecmds.c.orig 2021-03-26 10:39:38.328631864 +0000 +++ src/backend/commands/tablecmds.c 2021-03-26 10:47:21.042279580 +0000 @@ -12619,6 +12619,9 @@ CreateStatsStmt *stmt = (CreateStatsStmt *) stm; AlterTableCmd *newcmd; + /* keep the statistics object's comment */ + stmt->stxcomment = GetComment(oldId, StatisticExtRelationId, 0); + newcmd = makeNode(AlterTableCmd); newcmd->subtype = AT_ReAddStatistics; newcmd->def = (Node *) stmt; 6). Comment typo at extended_stats.c:2532 - s/statitics/statistics/ 7). I don't think that the big XXX comment near the start of estimate_multivariate_ndistinct() is really relevant anymore, now that the code has been simplified and we no longer extract Vars from expressions, so perhaps it can just be deleted. Regards, Dean
On 3/26/21 12:37 PM, Dean Rasheed wrote: > On Thu, 25 Mar 2021 at 19:59, Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> Attached is an updated patch series, with all the changes discussed >> here. I've cleaned up the ndistinct stuff a bit more (essentially >> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of >> the UpdateStatisticsForTypeChange. >> > > I've looked over all that and I think it's in pretty good shape. I > particularly like how much simpler the ndistinct code has now become. > > Some (hopefully final) review comments: > > 1). I don't think index.c is the right place for > StatisticsGetRelation(). I appreciate that it is very similar to the > adjacent IndexGetRelation() function, but index.c is really only for > index-related code, so I think StatisticsGetRelation() should go in > statscmds.c > Ah, right, I forgot about this. I wonder if we should add catalog/statistics.c, similar to catalog/index.c (instead of adding it locally to statscmds.c). > 2). Perhaps the error message at statscmds.c:293 should read > > "expression cannot be used in multivariate statistics because its > type %s has no default btree operator class" > > (i.e., add the word "multivariate", since the same expression *can* be > used in univariate statistics even though it has no less-than > operator). > > 3). The comment for ATExecAddStatistics() should probably mention that > "ALTER TABLE ADD STATISTICS" isn't a command in the grammar, in a > similar way to other similar functions, e.g.: > > /* > * ALTER TABLE ADD STATISTICS > * > * This is no such command in the grammar, but we use this internally to add > * AT_ReAddStatistics subcommands to rebuild extended statistics after a table > * column type change. > */ > > 4). The comment at the start of ATPostAlterTypeParse() needs updating > to mention CREATE STATISTICS statements. > > 5). I think ATPostAlterTypeParse() should also attempt to preserve any > COMMENTs attached to statistics objects, i.e., something like: > > --- src/backend/commands/tablecmds.c.orig 2021-03-26 10:39:38.328631864 +0000 > +++ src/backend/commands/tablecmds.c 2021-03-26 10:47:21.042279580 +0000 > @@ -12619,6 +12619,9 @@ > CreateStatsStmt *stmt = (CreateStatsStmt *) stm; > AlterTableCmd *newcmd; > > + /* keep the statistics object's comment */ > + stmt->stxcomment = GetComment(oldId, StatisticExtRelationId, 0); > + > newcmd = makeNode(AlterTableCmd); > newcmd->subtype = AT_ReAddStatistics; > newcmd->def = (Node *) stmt; > > 6). Comment typo at extended_stats.c:2532 - s/statitics/statistics/ > > 7). I don't think that the big XXX comment near the start of > estimate_multivariate_ndistinct() is really relevant anymore, now that > the code has been simplified and we no longer extract Vars from > expressions, so perhaps it can just be deleted. > Thanks! I'll fix these, and then will consider getting it committed sometime later today, once the buildfarm does some testing on the other stuff I already committed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/26/21 1:54 PM, Tomas Vondra wrote: > > > On 3/26/21 12:37 PM, Dean Rasheed wrote: >> On Thu, 25 Mar 2021 at 19:59, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> >>> Attached is an updated patch series, with all the changes discussed >>> here. I've cleaned up the ndistinct stuff a bit more (essentially >>> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of >>> the UpdateStatisticsForTypeChange. >>> >> >> I've looked over all that and I think it's in pretty good shape. I >> particularly like how much simpler the ndistinct code has now become. >> >> Some (hopefully final) review comments: >> >> ... >> > > Thanks! I'll fix these, and then will consider getting it committed > sometime later today, once the buildfarm does some testing on the other > stuff I already committed. > OK, pushed after a bit more polishing and testing. I've noticed one more missing piece in describe (expressions missing in \dX), so I fixed that. May the buildfarm be merciful ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/27/21 1:17 AM, Tomas Vondra wrote: > On 3/26/21 1:54 PM, Tomas Vondra wrote: >> >> >> On 3/26/21 12:37 PM, Dean Rasheed wrote: >>> On Thu, 25 Mar 2021 at 19:59, Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> >>>> Attached is an updated patch series, with all the changes discussed >>>> here. I've cleaned up the ndistinct stuff a bit more (essentially >>>> reverting back from GroupExprInfo to GroupVarInfo name), and got rid of >>>> the UpdateStatisticsForTypeChange. >>>> >>> >>> I've looked over all that and I think it's in pretty good shape. I >>> particularly like how much simpler the ndistinct code has now become. >>> >>> Some (hopefully final) review comments: >>> >>> ... >>> >> >> Thanks! I'll fix these, and then will consider getting it committed >> sometime later today, once the buildfarm does some testing on the other >> stuff I already committed. >> > > OK, pushed after a bit more polishing and testing. I've noticed one more > missing piece in describe (expressions missing in \dX), so I fixed that. > > May the buildfarm be merciful ... > LOL! It failed on *my* buildfarm machine, because apparently some of the expressions used in stats_ext depend on locale and the machine is using cs_CZ.UTF-8. Will fix later ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I suggest to add some kind of reference to stats expressions here. --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml <sect2 id="vacuum-for-statistics"> <title>Updating Planner Statistics</title> <indexterm zone="vacuum-for-statistics"> <primary>statistics</primary> <secondary>of the planner</secondary> </indexterm> [...] @@ -330,10 +330,12 @@ <para> Also, by default there is limited information available about - the selectivity of functions. However, if you create an expression + the selectivity of functions. However, if you create a statistics + expression or an expression index that uses a function call, useful statistics will be gathered about the function, which can greatly improve query plans that use the expression index. -- Justin
On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote: > On 1/22/21 5:01 AM, Justin Pryzby wrote: > > On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote: > > > > > | Statistics objects: > > > > > | "public"."s2" (ndistinct, dependencies, mcv) ON FROM t > > > > > > Umm, for me that prints: > > > > > "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t > > > > > > which I think is OK. But maybe there's something else to trigger the > > > problem? > > > > Oh. It's because I was using /usr/bin/psql and not ./src/bin/psql. > > I think it's considered ok if old client's \d commands don't work on new > > server, but it's not clear to me if it's ok if they misbehave. It's almost > > better it made an ERROR. > > > > Well, how would the server know to throw an error? We can't quite patch the > old psql (if we could, we could just tweak the query). To refresh: stats objects on a v14 server which include expressions are shown by pre-v14 psql client with the expressions elided (because the attnums don't correspond to anything in pg_attribute). I'm mentioning it again since, even though I knew about this earlier in the year, it caused some confusion for me again just now while testing our application. I had the v14 server installed but the psql symlink still pointed to the v13 client. There may not be anything we can do about it. And it may not be a significant issue outside the beta period: more typically, the client version would match the server version. -- Justin
On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote: > OK, pushed after a bit more polishing and testing. This added a "transformed" field to CreateStatsStmt, but it didn't mention that field in src/backend/nodes. Should those functions handle the field?
On 6/6/21 7:37 AM, Noah Misch wrote: > On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote: >> OK, pushed after a bit more polishing and testing. > > This added a "transformed" field to CreateStatsStmt, but it didn't mention > that field in src/backend/nodes. Should those functions handle the field? > Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not sure if it can result in error/failure or just inefficiency (due to transforming the expressions repeatedly), but it should do whatever CREATE INDEX is doing. Thanks for noticing! Fixed by d57ecebd12. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 6/6/21 7:37 AM, Noah Misch wrote: >> This added a "transformed" field to CreateStatsStmt, but it didn't mention >> that field in src/backend/nodes. Should those functions handle the field? > Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not > sure if it can result in error/failure or just inefficiency (due to > transforming the expressions repeatedly), but it should do whatever > CREATE INDEX is doing. I'm curious about how come the buildfarm didn't notice this. The animals using COPY_PARSE_PLAN_TREES should have failed. The fact that they didn't implies that there's no test case that makes use of a nonzero value for this field, which seems like a testing gap. regards, tom lane
On 6/6/21 9:17 PM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: >> On 6/6/21 7:37 AM, Noah Misch wrote: >>> This added a "transformed" field to CreateStatsStmt, but it didn't mention >>> that field in src/backend/nodes. Should those functions handle the field? > >> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not >> sure if it can result in error/failure or just inefficiency (due to >> transforming the expressions repeatedly), but it should do whatever >> CREATE INDEX is doing. > > I'm curious about how come the buildfarm didn't notice this. The > animals using COPY_PARSE_PLAN_TREES should have failed. The fact > that they didn't implies that there's no test case that makes use > of a nonzero value for this field, which seems like a testing gap. > AFAICS the reason is pretty simple - the COPY_PARSE_PLAN_TREES checks look like this: List *new_list = copyObject(raw_parsetree_list); /* This checks both copyObject() and the equal() routines... */ if (!equal(new_list, raw_parsetree_list)) elog(WARNING, "copyObject() failed to produce an equal raw parse tree"); else raw_parsetree_list = new_list; } But if the field is missing from all the functions, equal() can't detect that copyObject() did not actually copy it. It'd detect a case when the field was added just to one place, but not this. The CREATE INDEX (which served as an example for CREATE STATISTICS) has exactly the same issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > On 6/6/21 9:17 PM, Tom Lane wrote: >> I'm curious about how come the buildfarm didn't notice this. The >> animals using COPY_PARSE_PLAN_TREES should have failed. The fact >> that they didn't implies that there's no test case that makes use >> of a nonzero value for this field, which seems like a testing gap. > AFAICS the reason is pretty simple - the COPY_PARSE_PLAN_TREES checks > look like this: > ... > But if the field is missing from all the functions, equal() can't detect > that copyObject() did not actually copy it. Right, that code would only detect a missing copyfuncs.c line if equalfuncs.c did have the line, which isn't all that likely. However, we then pass the copied node on to further processing, which in principle should result in visible failures when copyfuncs.c is missing a line. I think the reason it didn't is that the transformed field would always be zero (false) in grammar output. We could only detect a problem if we copied already-transformed nodes and then used them further. Even then it *might* not fail, because the consequence would likely be an extra round of parse analysis on the expressions, which is likely to be a no-op. Not sure if there's a good way to improve that. I hope sometime soon we'll be able to auto-generate these functions, and then the risk of this sort of mistake will go away (he says optimistically). regards, tom lane
On Sun, Jun 06, 2021 at 09:13:17PM +0200, Tomas Vondra wrote: > > On 6/6/21 7:37 AM, Noah Misch wrote: > > On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote: > >> OK, pushed after a bit more polishing and testing. > > > > This added a "transformed" field to CreateStatsStmt, but it didn't mention > > that field in src/backend/nodes. Should those functions handle the field? > > > > Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not > sure if it can result in error/failure or just inefficiency (due to > transforming the expressions repeatedly), but it should do whatever > CREATE INDEX is doing. > > Thanks for noticing! Fixed by d57ecebd12. Great. For future reference, this didn't need a catversion bump. readfuncs.c changes need a catversion bump, since the catalogs might contain input for each read function. Other src/backend/nodes functions don't face that. Also, src/backend/nodes generally process fields in the order that they appear in the struct. The order you used in d57ecebd12 is nicer, being more like IndexStmt, so I'm pushing an order change to the struct.
On 6/11/21 6:55 AM, Noah Misch wrote: > On Sun, Jun 06, 2021 at 09:13:17PM +0200, Tomas Vondra wrote: >> >> On 6/6/21 7:37 AM, Noah Misch wrote: >>> On Sat, Mar 27, 2021 at 01:17:14AM +0100, Tomas Vondra wrote: >>>> OK, pushed after a bit more polishing and testing. >>> >>> This added a "transformed" field to CreateStatsStmt, but it didn't mention >>> that field in src/backend/nodes. Should those functions handle the field? >>> >> >> Yup, that's a mistake - it should do whatever CREATE INDEX is doing. Not >> sure if it can result in error/failure or just inefficiency (due to >> transforming the expressions repeatedly), but it should do whatever >> CREATE INDEX is doing. >> >> Thanks for noticing! Fixed by d57ecebd12. > > Great. For future reference, this didn't need a catversion bump. readfuncs.c > changes need a catversion bump, since the catalogs might contain input for > each read function. Other src/backend/nodes functions don't face that. Also, > src/backend/nodes generally process fields in the order that they appear in > the struct. The order you used in d57ecebd12 is nicer, being more like > IndexStmt, so I'm pushing an order change to the struct. > OK, makes sense. Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/22/21 5:01 AM, Justin Pryzby wrote: > > In any case, why are there so many parentheses ? On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote: > That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be > adding extra parentheses, on top of what deparse_expression_pretty does. > Will fix. The extra parens are still here - is it intended ? postgres=# CREATE STATISTICS s ON i, (1+i), (2+i) FROM t; CREATE STATISTICS postgres=# \d t Table "public.t" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- i | integer | | | Statistics objects: "public"."s" ON i, ((1 + i)), ((2 + i)) FROM t -- Justin
On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote: > > Looking at the current behaviour, there are a couple of things that > > seem a little odd, even though they are understandable. For example, > > the fact that > > > > CREATE STATISTICS s (expressions) ON (expr), col FROM tbl; > > > > fails, but > > > > CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl; > > > > succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax > > > > CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl; > > > > tends to suggest that it's going to create statistics on the pair of > > expressions, describing their correlation, when actually it builds 2 > > independent statistics. Also, this error text isn't entirely accurate: > > > > CREATE STATISTICS s ON col FROM tbl; > > ERROR: extended statistics require at least 2 columns > > > > because extended statistics don't always require 2 columns, they can > > also just have an expression, or multiple expressions and 0 or 1 > > columns. > > > > I think a lot of this stems from treating "expressions" in the same > > way as the other (multi-column) stats kinds, and it might actually be > > neater to have separate documented syntaxes for single- and > > multi-column statistics: > > > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > > ON (expression) > > FROM table_name > > > > CREATE STATISTICS [ IF NOT EXISTS ] statistics_name > > [ ( statistics_kind [, ... ] ) ] > > ON { column_name | (expression) } , { column_name | (expression) } [, ...] > > FROM table_name > > > > The first syntax would create single-column stats, and wouldn't accept > > a statistics_kind argument, because there is only one kind of > > single-column statistic. Maybe that might change in the future, but if > > so, it's likely that the kinds of single-column stats will be > > different from the kinds of multi-column stats. > > > > In the second syntax, the only accepted kinds would be the current > > multi-column stats kinds (ndistinct, dependencies, and mcv), and it > > would always build stats describing the correlations between the > > columns listed. It would continue to build standard/expression stats > > on any expressions in the list, but that's more of an implementation > > detail. > > > > It would no longer be possible to do "CREATE STATISTICS s > > (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to > > issue 2 separate "CREATE STATISTICS" commands, but that seems more > > logical, because they're independent stats. > > > > The parsing code might not change much, but some of the errors would > > be different. For example, the errors "building only extended > > expression statistics on simple columns not allowed" and "extended > > expression statistics require at least one expression" would go away, > > and the error "extended statistics require at least 2 columns" might > > become more specific, depending on the stats kind. This still seems odd: postgres=# CREATE STATISTICS asf ON i FROM t; ERROR: extended statistics require at least 2 columns postgres=# CREATE STATISTICS asf ON (i) FROM t; CREATE STATISTICS It seems wrong that the command works with added parens, but builds expression stats on a simple column (which is redundant with what analyze does without extended stats). -- Justin
On 8/16/21 3:31 AM, Justin Pryzby wrote: > On 1/22/21 5:01 AM, Justin Pryzby wrote: >>> In any case, why are there so many parentheses ? > > On Fri, Jan 22, 2021 at 02:09:04PM +0100, Tomas Vondra wrote: >> That's a bug in pg_get_statisticsobj_worker, probably. It shouldn't be >> adding extra parentheses, on top of what deparse_expression_pretty does. >> Will fix. > > The extra parens are still here - is it intended ? > Ah, thanks for reminding me! I was looking at this, and the problem is that pg_get_statisticsobj_worker only does this: prettyFlags = PRETTYFLAG_INDENT; Changing that to prettyFlags = PRETTYFLAG_INDENT | PRETTYFLAG_PAREN; fixes this (not sure we need the INDENT flag - probably not). I'm a bit confused, though. My assumption was "PRETTYFLAG_PAREN = true" would force the deparsing itself to add the parens, if needed, but in reality it works the other way around. I guess it's more complicated due to deparsing multi-level expressions, but unfortunately, there's no comment explaining what it does. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/16/21 3:32 AM, Justin Pryzby wrote: > On Mon, Dec 07, 2020 at 03:15:17PM +0100, Tomas Vondra wrote: >>> Looking at the current behaviour, there are a couple of things that >>> seem a little odd, even though they are understandable. For example, >>> the fact that >>> >>> CREATE STATISTICS s (expressions) ON (expr), col FROM tbl; >>> >>> fails, but >>> >>> CREATE STATISTICS s (expressions, mcv) ON (expr), col FROM tbl; >>> >>> succeeds and creates both "expressions" and "mcv" statistics. Also, the syntax >>> >>> CREATE STATISTICS s (expressions) ON (expr1), (expr2) FROM tbl; >>> >>> tends to suggest that it's going to create statistics on the pair of >>> expressions, describing their correlation, when actually it builds 2 >>> independent statistics. Also, this error text isn't entirely accurate: >>> >>> CREATE STATISTICS s ON col FROM tbl; >>> ERROR: extended statistics require at least 2 columns >>> >>> because extended statistics don't always require 2 columns, they can >>> also just have an expression, or multiple expressions and 0 or 1 >>> columns. >>> >>> I think a lot of this stems from treating "expressions" in the same >>> way as the other (multi-column) stats kinds, and it might actually be >>> neater to have separate documented syntaxes for single- and >>> multi-column statistics: >>> >>> CREATE STATISTICS [ IF NOT EXISTS ] statistics_name >>> ON (expression) >>> FROM table_name >>> >>> CREATE STATISTICS [ IF NOT EXISTS ] statistics_name >>> [ ( statistics_kind [, ... ] ) ] >>> ON { column_name | (expression) } , { column_name | (expression) } [, ...] >>> FROM table_name >>> >>> The first syntax would create single-column stats, and wouldn't accept >>> a statistics_kind argument, because there is only one kind of >>> single-column statistic. Maybe that might change in the future, but if >>> so, it's likely that the kinds of single-column stats will be >>> different from the kinds of multi-column stats. >>> >>> In the second syntax, the only accepted kinds would be the current >>> multi-column stats kinds (ndistinct, dependencies, and mcv), and it >>> would always build stats describing the correlations between the >>> columns listed. It would continue to build standard/expression stats >>> on any expressions in the list, but that's more of an implementation >>> detail. >>> >>> It would no longer be possible to do "CREATE STATISTICS s >>> (expressions) ON (expr1), (expr2) FROM tbl". Instead, you'd have to >>> issue 2 separate "CREATE STATISTICS" commands, but that seems more >>> logical, because they're independent stats. >>> >>> The parsing code might not change much, but some of the errors would >>> be different. For example, the errors "building only extended >>> expression statistics on simple columns not allowed" and "extended >>> expression statistics require at least one expression" would go away, >>> and the error "extended statistics require at least 2 columns" might >>> become more specific, depending on the stats kind. > > This still seems odd: > > postgres=# CREATE STATISTICS asf ON i FROM t; > ERROR: extended statistics require at least 2 columns > postgres=# CREATE STATISTICS asf ON (i) FROM t; > CREATE STATISTICS > > It seems wrong that the command works with added parens, but builds expression > stats on a simple column (which is redundant with what analyze does without > extended stats). > Well, yeah. But I think this is a behavior that was discussed somewhere in this thread, and the agreement was that it's not worth the complexity, as this comment explains * XXX We do only the bare minimum to separate simple attribute and * complex expressions - for example "(a)" will be treated as a complex * expression. No matter how elaborate the check is, there'll always be * a way around it, if the user is determined (consider e.g. "(a+0)"), * so it's not worth protecting against it. Of course, maybe that wasn't the right decision - it's a bit weird that CREATE INDEX on t ((a), (b)) actually "extracts" the column references and stores that in indkeys, instead of treating that as expressions. Patch 0001 fixes the "double parens" issue discussed elsewhere in this thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple column reference. But I'm not sure 0002 is something we can do without catversion bump. What if someone created such "bogus" statistics? It's mostly harmless, because the statistics is useless anyway (AFAICS we'll just use the regular one we have for the column), but if they do pg_dump, that'll fail because of this new restriction. OTOH we're still "only" in beta, and IIRC the rule is not to bump catversion after rc1. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
> Patch 0001 fixes the "double parens" issue discussed elsewhere in this > thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple > column reference. > From: Tomas Vondra <tomas.vondra@postgresql.org> > Date: Mon, 16 Aug 2021 17:19:33 +0200 > Subject: [PATCH 2/2] fix: identify single-attribute references > diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl > index a4ee54d516..be1f3a5175 100644 > --- a/src/bin/pg_dump/t/002_pg_dump.pl > +++ b/src/bin/pg_dump/t/002_pg_dump.pl > @@ -2811,7 +2811,7 @@ my %tests = ( > create_sql => 'CREATE STATISTICS dump_test.test_ext_stats_expr > ON (2 * col1) FROM dump_test.test_fifth_table', > regexp => qr/^ > - \QCREATE STATISTICS dump_test.test_ext_stats_expr ON ((2 * col1)) FROM dump_test.test_fifth_table;\E > + \QCREATE STATISTICS dump_test.test_ext_stats_expr ON (2 * col1) FROM dump_test.test_fifth_table;\E This hunk should be in 0001, no ? > But I'm not sure 0002 is something we can do without catversion bump. What > if someone created such "bogus" statistics? It's mostly harmless, because > the statistics is useless anyway (AFAICS we'll just use the regular one we > have for the column), but if they do pg_dump, that'll fail because of this > new restriction. I think it's okay if it pg_dump throws an error, since the fix is as easy as dropping the stx object. (It wouldn't be okay if it silently misbehaved.) Andres concluded similarly with the reverted autovacuum patch: https://www.postgresql.org/message-id/20210817105022.e2t4rozkhqy2myhn@alap3.anarazel.de +RMT in case someone wants to argue otherwise. -- Justin
On 8/18/21 5:07 AM, Justin Pryzby wrote: >> Patch 0001 fixes the "double parens" issue discussed elsewhere in this >> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple >> column reference. > >> From: Tomas Vondra <tomas.vondra@postgresql.org> >> Date: Mon, 16 Aug 2021 17:19:33 +0200 >> Subject: [PATCH 2/2] fix: identify single-attribute references > >> diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl >> index a4ee54d516..be1f3a5175 100644 >> --- a/src/bin/pg_dump/t/002_pg_dump.pl >> +++ b/src/bin/pg_dump/t/002_pg_dump.pl >> @@ -2811,7 +2811,7 @@ my %tests = ( >> create_sql => 'CREATE STATISTICS dump_test.test_ext_stats_expr >> ON (2 * col1) FROM dump_test.test_fifth_table', >> regexp => qr/^ >> - \QCREATE STATISTICS dump_test.test_ext_stats_expr ON ((2 * col1)) FROM dump_test.test_fifth_table;\E >> + \QCREATE STATISTICS dump_test.test_ext_stats_expr ON (2 * col1) FROM dump_test.test_fifth_table;\E > > > This hunk should be in 0001, no ? > Yeah, I mixed that up a bit. >> But I'm not sure 0002 is something we can do without catversion bump. What >> if someone created such "bogus" statistics? It's mostly harmless, because >> the statistics is useless anyway (AFAICS we'll just use the regular one we >> have for the column), but if they do pg_dump, that'll fail because of this >> new restriction. > > I think it's okay if it pg_dump throws an error, since the fix is as easy as > dropping the stx object. (It wouldn't be okay if it silently misbehaved.) > > Andres concluded similarly with the reverted autovacuum patch: > https://www.postgresql.org/message-id/20210817105022.e2t4rozkhqy2myhn@alap3.anarazel.de > > +RMT in case someone wants to argue otherwise. > I feel a bit uneasy about it, but if there's a precedent ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 16, 2021 at 05:41:57PM +0200, Tomas Vondra wrote: > > This still seems odd: > > > > postgres=# CREATE STATISTICS asf ON i FROM t; > > ERROR: extended statistics require at least 2 columns > > postgres=# CREATE STATISTICS asf ON (i) FROM t; > > CREATE STATISTICS > > > > It seems wrong that the command works with added parens, but builds expression > > stats on a simple column (which is redundant with what analyze does without > > extended stats). > > Well, yeah. But I think this is a behavior that was discussed somewhere in > this thread, and the agreement was that it's not worth the complexity, as > this comment explains > > * XXX We do only the bare minimum to separate simple attribute and > * complex expressions - for example "(a)" will be treated as a complex > * expression. No matter how elaborate the check is, there'll always be > * a way around it, if the user is determined (consider e.g. "(a+0)"), > * so it's not worth protecting against it. > > Patch 0001 fixes the "double parens" issue discussed elsewhere in this > thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple > column reference. 0002 refuses to create expressional stats on a simple column reference like (a), which I think is helps to avoid a user accidentally creating useless ext stats objects (which are redundant with the table's column stats). 0002 does not attempt to refuse cases like (a+0), which I think is fine: we don't try to reject useless cases if someone insists on it. See 240971675, 701fd0bbc. So I am +1 to apply both patches. I added this as an Opened Item for increased visibility. -- Justin
On 8/24/21 3:13 PM, Justin Pryzby wrote: > On Mon, Aug 16, 2021 at 05:41:57PM +0200, Tomas Vondra wrote: >>> This still seems odd: >>> >>> postgres=# CREATE STATISTICS asf ON i FROM t; >>> ERROR: extended statistics require at least 2 columns >>> postgres=# CREATE STATISTICS asf ON (i) FROM t; >>> CREATE STATISTICS >>> >>> It seems wrong that the command works with added parens, but builds expression >>> stats on a simple column (which is redundant with what analyze does without >>> extended stats). >> >> Well, yeah. But I think this is a behavior that was discussed somewhere in >> this thread, and the agreement was that it's not worth the complexity, as >> this comment explains >> >> * XXX We do only the bare minimum to separate simple attribute and >> * complex expressions - for example "(a)" will be treated as a complex >> * expression. No matter how elaborate the check is, there'll always be >> * a way around it, if the user is determined (consider e.g. "(a+0)"), >> * so it's not worth protecting against it. >> >> Patch 0001 fixes the "double parens" issue discussed elsewhere in this >> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple >> column reference. > > 0002 refuses to create expressional stats on a simple column reference like > (a), which I think is helps to avoid a user accidentally creating useless ext > stats objects (which are redundant with the table's column stats). > > 0002 does not attempt to refuse cases like (a+0), which I think is fine: > we don't try to reject useless cases if someone insists on it. > See 240971675, 701fd0bbc. > > So I am +1 to apply both patches. > > I added this as an Opened Item for increased visibility. > I've pushed both fixes, so the open item should be resolved. However while polishing the second patch, I realized we're allowing statistics on expressions referencing system attributes. So this fails; CREATE STATISTICS s ON ctid, x FROM t; but this passes: CREATE STATISTICS s ON (ctid::text), x FROM t; IMO we should reject such expressions, just like we reject direct references to system attributes - patch attached. Furthermore, I wonder if we should reject expressions without any Vars? This works now: CREATE STATISTICS s ON (11:text) FROM t; but it seems rather silly / useless, so maybe we should reject it. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote: > > > Patch 0001 fixes the "double parens" issue discussed elsewhere in this > > > thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple > > > column reference. > > > > 0002 refuses to create expressional stats on a simple column reference like > > (a), which I think is helps to avoid a user accidentally creating useless ext > > stats objects (which are redundant with the table's column stats). > > > > 0002 does not attempt to refuse cases like (a+0), which I think is fine: > > we don't try to reject useless cases if someone insists on it. > > See 240971675, 701fd0bbc. > > > > So I am +1 to apply both patches. > > > > I added this as an Opened Item for increased visibility. > > I've pushed both fixes, so the open item should be resolved. Thank you - I marked it as such. There are some typos in 537ca68db (refenrece) I'll add them to my typos branch if you don't want to patch them right now or wait to see if someone notices anything else. diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c index 59369f8736..17cbd97808 100644 --- a/src/backend/commands/statscmds.c +++ b/src/backend/commands/statscmds.c @@ -205,27 +205,27 @@ CreateStatistics(CreateStatsStmt *stmt) numcols = list_length(stmt->exprs); if (numcols > STATS_MAX_DIMENSIONS) ereport(ERROR, (errcode(ERRCODE_TOO_MANY_COLUMNS), errmsg("cannot have more than %d columns in statistics", STATS_MAX_DIMENSIONS))); /* * Convert the expression list to a simple array of attnums, but also keep * a list of more complex expressions. While at it, enforce some * constraints - we don't allow extended statistics on system attributes, - * and we require the data type to have less-than operator. + * and we require the data type to have a less-than operator. * - * There are many ways how to "mask" a simple attribute refenrece as an + * There are many ways to "mask" a simple attribute reference as an * expression, for example "(a+0)" etc. We can't possibly detect all of - * them, but we handle at least the simple case with attribute in parens. + * them, but we handle at least the simple case with the attribute in parens. * There'll always be a way around this, if the user is determined (like * the "(a+0)" example), but this makes it somewhat consistent with how * indexes treat attributes/expressions. */ foreach(cell, stmt->exprs) { StatsElem *selem = lfirst_node(StatsElem, cell); if (selem->name) /* column reference */ { char *attname;
On 9/1/21 9:38 PM, Justin Pryzby wrote: > On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote: >>>> Patch 0001 fixes the "double parens" issue discussed elsewhere in this >>>> thread, and patch 0002 tweaks CREATE STATISTICS to treat "(a)" as a simple >>>> column reference. >>> >>> 0002 refuses to create expressional stats on a simple column reference like >>> (a), which I think is helps to avoid a user accidentally creating useless ext >>> stats objects (which are redundant with the table's column stats). >>> >>> 0002 does not attempt to refuse cases like (a+0), which I think is fine: >>> we don't try to reject useless cases if someone insists on it. >>> See 240971675, 701fd0bbc. >>> >>> So I am +1 to apply both patches. >>> >>> I added this as an Opened Item for increased visibility. >> >> I've pushed both fixes, so the open item should be resolved. > > Thank you - I marked it as such. > > There are some typos in 537ca68db (refenrece) > I'll add them to my typos branch if you don't want to patch them right now or > wait to see if someone notices anything else. > Yeah, probably better to wait a bit. Any opinions on rejecting expressions referencing system attributes or no attributes at all? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote: > However while polishing the second patch, I realized we're allowing > statistics on expressions referencing system attributes. So this fails; > > CREATE STATISTICS s ON ctid, x FROM t; > > but this passes: > > CREATE STATISTICS s ON (ctid::text), x FROM t; > > IMO we should reject such expressions, just like we reject direct references > to system attributes - patch attached. Right, same as indexes. +1 > Furthermore, I wonder if we should reject expressions without any Vars? This > works now: > > CREATE STATISTICS s ON (11:text) FROM t; > > but it seems rather silly / useless, so maybe we should reject it. To my surprise, this is also allowed for indexes... But (maybe this is what I was remembering) it's prohibited to have a constant expression as a partition key. Expressions without a var seem like a case where the user did something deliberately silly, and dis-similar from the case of making a stats expression on a simple column - that seemed like it could be a legitimate mistake/confusion (it's not unreasonable to write an extra parenthesis, but it's strange if that causes it to behave differently). I think it's not worth too much effort to prohibit this: if they're determined, they can still write an expresion with a var which is constant. I'm not going to say it's worth zero effort, though. -- Justin
On 9/3/21 5:56 AM, Justin Pryzby wrote: > On Wed, Sep 01, 2021 at 06:45:29PM +0200, Tomas Vondra wrote: >> However while polishing the second patch, I realized we're allowing >> statistics on expressions referencing system attributes. So this fails; >> >> CREATE STATISTICS s ON ctid, x FROM t; >> >> but this passes: >> >> CREATE STATISTICS s ON (ctid::text), x FROM t; >> >> IMO we should reject such expressions, just like we reject direct references >> to system attributes - patch attached. > > Right, same as indexes. +1 > I've pushed this check, disallowing extended stats on expressions referencing system attributes. This means we'll reject both ctid and (ctid::text), just like for indexes. >> Furthermore, I wonder if we should reject expressions without any Vars? This >> works now: >> >> CREATE STATISTICS s ON (11:text) FROM t; >> >> but it seems rather silly / useless, so maybe we should reject it. > > To my surprise, this is also allowed for indexes... > > But (maybe this is what I was remembering) it's prohibited to have a constant > expression as a partition key. > > Expressions without a var seem like a case where the user did something > deliberately silly, and dis-similar from the case of making a stats expression > on a simple column - that seemed like it could be a legitimate > mistake/confusion (it's not unreasonable to write an extra parenthesis, but > it's strange if that causes it to behave differently). > > I think it's not worth too much effort to prohibit this: if they're determined, > they can still write an expresion with a var which is constant. I'm not going > to say it's worth zero effort, though. > I've decided not to push this. The statistics objects on expressions not referencing any variables seem useless, but maybe not entirely - we allow volatile expressions, like CREATE STATISTICS s ON (random()) FROM t; which I suppose might be useful. And we reject similar cases (except for the volatility, of course) for indexes too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company