Обсуждение: Allow to collect statistics on virtual generated columns
Hi hackers, Hi hackers, Now we can create a table with a virtual generated column, but when a condition in WHERE clause contains virtual generated column, estimated rows are not correct since no statistics on this is collectef. [Ex.1] test=# CREATE TABLE t (i int, v int GENERATED ALWAYS AS (i+1) VIRTUAL); CREATE TABLE test=# INSERT INTO t SELECT generate_series(1,1000); INSERT 0 1000 test=# INSERT INTO t SELECT 1 FROM generate_series(1,1000); INSERT 0 1000 test=# EXPLAIN ANALYZE SELECT * FROM t WHERE v = 2; QUERY PLAN -------------------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..36.02 rows=9 width=8) (actual time=0.093..3.059 rows=1001.00 loops=1) Filter: ((i + 1) = 2) Rows Removed by Filter: 999 Buffers: shared hit=9 Planning: Buffers: shared hit=26 Planning Time: 1.142 ms Execution Time: 3.434 ms (8 rows) Therefore, I would like to allow to collect statistics on virtual enerated columns. I think there are at least three approaches for this. (1) Allow the normal ANALYZE to collect statistics on virtual generated columns ANALYZE expands virtual generated columns' expression, and collects statistics on evaluated values. In this approach, the statistics on virtual generated columns are collected in default, but ANALYZE on the table would become a bit expensive. (2) Allow to create an index on virtual generated column This is proposed in [1]. This proposal itself would be useful, I believe it is better to provide a way to collect statistics without cost of creating an index. [1] https://www.postgresql.org/message-id/flat/CACJufxGao-cypdNhifHAdt8jHfK6-HX=tRBovBkgRuxw063GaA@mail.gmail.com (3) Allow to create extended statistics on virtual generated columns In this approach, ANALYZE processes virtual generated columns only if corresponding extended statistics are defined. Although we can create extended statistics on expressions of virtual generated columns even in the current implementation, this enables that users to create a useful statistics this just by specifying a column name without specifying complex expression. I can also think of two variations for this approach. (3a) At the timing when an extended statistics is created, virtual generated columns are expanded, and the statistics is defined on this expression. (3b) At the timing when an extended statistics is created, virtual generated columns are NOT expanded. The statistics is defined on the virtual generated column itself and, the expression is expanded when ANALYZE processes the extended statistics. I've attached a draft patch based on (3a). However, if it is possible we could change the definition of generated columns in future (as proposed in [2]), (3b) might be preferred. [2] https://www.postgresql.org/message-id/flat/CACJufxH3VETr7orF5rW29GnDk3n1wWbOE3WdkHYd3iPGrQ9E_A@mail.gmail.com Here is an example of how the patch works. [Ex.2] test=# CREATE STATISTICS exstat ON v FROM t; CREATE STATISTICS test=# ANALYZE t; ANALYZE test=# EXPLAIN ANALYZE SELECT * FROM t WHERE v = 2; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.067..2.422 rows=1001.00 loops=1) Filter: ((i + 1) = 2) Rows Removed by Filter: 999 Buffers: shared hit=9 Planning: Buffers: shared hit=14 Planning Time: 0.785 ms Execution Time: 2.744 ms (8 rows) What do you think of this? Which approach of (1), (3a), or (3b) is good? Or, completely different approach is better? With your feedback, I would like to progress or rework the patch. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
Hi, On 2025-04-22 18:10:06 +0900, Yugo Nagata wrote: > With your feedback, I would like to progress or rework the patch. Right now the tests seem to always fail: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5714 Fails e.g. with: https://api.cirrus-ci.com/v1/artifact/task/5921189782093824/testrun/build/testrun/regress/regress/regression.diffs diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/stats_ext.out /tmp/cirrus-ci-build/build/testrun/regress/regress/results/stats_ext.out --- /tmp/cirrus-ci-build/src/test/regress/expected/stats_ext.out 2025-05-26 00:59:01.813042000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/stats_ext.out 2025-05-26 01:02:20.350387000 +0000 @@ -56,7 +56,6 @@ ERROR: unrecognized statistics kind "unrecognized" -- incorrect expressions CREATE STATISTICS tst ON (y) FROM ext_stats_test; -- single column reference -ERROR: extended statistics require at least 2 columns CREATE STATISTICS tst ON y + z FROM ext_stats_test; -- missing parentheses ERROR: syntax error at or near "+" LINE 1: CREATE STATISTICS tst ON y + z FROM ext_stats_test; @@ -69,25 +68,24 @@ -- statistics on virtual generated column not allowed CREATE TABLE ext_stats_test1 (x int, y int, z int GENERATED ALWAYS AS (x+y) VIRTUAL, w xid); CREATE STATISTICS tst on z from ext_stats_test1; -ERROR: statistics creation on virtual generated columns is not supported CREATE STATISTICS tst on (z) from ext_stats_test1; -ERROR: statistics creation on virtual generated columns is not supported +ERROR: statistics object "tst" already exists CREATE STATISTICS tst on (z+1) from ext_stats_test1; -ERROR: statistics creation on virtual generated columns is not supported +ERROR: statistics object "tst" already exists CREATE STATISTICS tst (ndistinct) ON z from ext_stats_test1; -ERROR: statistics creation on virtual generated columns is not supported +ERROR: statistics object "tst" already exists -- statistics on system column not allowed CREATE STATISTICS tst on tableoid from ext_stats_test1; -ERROR: statistics creation on system columns is not supported +ERROR: statistics object "tst" already exists CREATE STATISTICS tst on (tableoid) from ext_stats_test1; -ERROR: statistics creation on system columns is not supported +ERROR: statistics object "tst" already exists CREATE STATISTICS tst on (tableoid::int+1) from ext_stats_test1; -ERROR: statistics creation on system columns is not supported +ERROR: statistics object "tst" already exists CREATE STATISTICS tst (ndistinct) ON xmin from ext_stats_test1; -ERROR: statistics creation on system columns is not supported +ERROR: statistics object "tst" already exists -- statistics without a less-than operator not supported CREATE STATISTICS tst (ndistinct) ON w from ext_stats_test1; -ERROR: column "w" cannot be used in statistics because its type xid has no default btree operator class +ERROR: statistics object "tst" already exists DROP TABLE ext_stats_test1; -- Ensure stats are dropped sanely, and test IF NOT EXISTS while at it CREATE TABLE ab1 (a INTEGER, b INTEGER, c INTEGER); Greetings, Andres
On Tue, 17 Jun 2025 10:43:41 -0400 Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2025-04-22 18:10:06 +0900, Yugo Nagata wrote: > > With your feedback, I would like to progress or rework the patch. > > Right now the tests seem to always fail: > https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F571 Thank you for letting me know it. I've attached an updated patch to fix the test failure. However, I'm now reconsidering the current approach, where the expression of a virtual generated column is expanded at the time of creating extended statistics. This seems not be ideal, as the statistics would become useless if the expression is later modified. Instead, I'm thinking of an alternative approach: expanding the expression at the time statistics are collected. Best regards, Yugo Nagata > > Fails e.g. with: > https://api.cirrus-ci.com/v1/artifact/task/5921189782093824/testrun/build/testrun/regress/regress/regression.diffs > > diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/stats_ext.out /tmp/cirrus-ci-build/build/testrun/regress/regress/results/stats_ext.out > --- /tmp/cirrus-ci-build/src/test/regress/expected/stats_ext.out 2025-05-26 00:59:01.813042000 +0000 > +++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/stats_ext.out 2025-05-26 01:02:20.350387000 +0000 > @@ -56,7 +56,6 @@ > ERROR: unrecognized statistics kind "unrecognized" > -- incorrect expressions > CREATE STATISTICS tst ON (y) FROM ext_stats_test; -- single column reference > -ERROR: extended statistics require at least 2 columns > CREATE STATISTICS tst ON y + z FROM ext_stats_test; -- missing parentheses > ERROR: syntax error at or near "+" > LINE 1: CREATE STATISTICS tst ON y + z FROM ext_stats_test; > @@ -69,25 +68,24 @@ > -- statistics on virtual generated column not allowed > CREATE TABLE ext_stats_test1 (x int, y int, z int GENERATED ALWAYS AS (x+y) VIRTUAL, w xid); > CREATE STATISTICS tst on z from ext_stats_test1; > -ERROR: statistics creation on virtual generated columns is not supported > CREATE STATISTICS tst on (z) from ext_stats_test1; > -ERROR: statistics creation on virtual generated columns is not supported > +ERROR: statistics object "tst" already exists > CREATE STATISTICS tst on (z+1) from ext_stats_test1; > -ERROR: statistics creation on virtual generated columns is not supported > +ERROR: statistics object "tst" already exists > CREATE STATISTICS tst (ndistinct) ON z from ext_stats_test1; > -ERROR: statistics creation on virtual generated columns is not supported > +ERROR: statistics object "tst" already exists > -- statistics on system column not allowed > CREATE STATISTICS tst on tableoid from ext_stats_test1; > -ERROR: statistics creation on system columns is not supported > +ERROR: statistics object "tst" already exists > CREATE STATISTICS tst on (tableoid) from ext_stats_test1; > -ERROR: statistics creation on system columns is not supported > +ERROR: statistics object "tst" already exists > CREATE STATISTICS tst on (tableoid::int+1) from ext_stats_test1; > -ERROR: statistics creation on system columns is not supported > +ERROR: statistics object "tst" already exists > CREATE STATISTICS tst (ndistinct) ON xmin from ext_stats_test1; > -ERROR: statistics creation on system columns is not supported > +ERROR: statistics object "tst" already exists > -- statistics without a less-than operator not supported > CREATE STATISTICS tst (ndistinct) ON w from ext_stats_test1; > -ERROR: column "w" cannot be used in statistics because its type xid has no default btree operator class > +ERROR: statistics object "tst" already exists > DROP TABLE ext_stats_test1; > -- Ensure stats are dropped sanely, and test IF NOT EXISTS while at it > CREATE TABLE ab1 (a INTEGER, b INTEGER, c INTEGER); > > Greetings, > > Andres > > -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
Hi, On Tue, 24 Jun 2025 17:05:33 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > Instead, I'm thinking of an alternative approach: expanding the expression > at the time statistics are collected. I've attached a new patch in this approache. This allows to collect statistics on virtual generated columns. During ANALYZE, generation expressions are expanded, and statistics are computed using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData are now exported from extended_stats.c. However, since they are no longer specific to extended statistics, it might be better to move them to analyze.c and vacuum.h. To enable the optimizer to make use of these statistics, a new field named virtual_gencols is added to RelOptInfo. This field holds the expressions of virtual generated columns in the table. In examine_variable(), if an expression in a WHERE clause matches a virtual generated column, the corresponding statistics are used for that expression. Example: - Before applying the patch, the cardinality estimate is erroneous. test=# create table t (i int, j int generated always as (i*10) virtual); CREATE TABLE test=# insert into t select generate_series(1,1000); INSERT 0 1000 test=# insert into t select 1 from generate_series(1,1000); INSERT 0 1000 test=# analyze t; ANALYZE test=# explain analyze select * from t where j = 10; QUERY PLAN ------------------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1) Filter: ((i * 10) = 10) Rows Removed by Filter: 999 Buffers: shared hit=9 Planning: Buffers: shared hit=10 Planning Time: 0.299 ms Execution Time: 0.948 ms (8 rows) - After applying the patch, the cardinality estimate is correct. test=# analyze t; ANALYZE test=# explain analyze select * from t where j = 10; QUERY PLAN ----------------------------------------------------------------------------------------------------- Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1) Filter: ((i * 10) = 10) Rows Removed by Filter: 999 Buffers: shared hit=9 Planning: Buffers: shared hit=6 Planning Time: 0.374 ms Execution Time: 1.028 ms (8 rows) Note that the patch is still a work in progress, so documentation and tests are not included. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On Fri, 1 Aug 2025 00:28:30 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > Hi, > > On Tue, 24 Jun 2025 17:05:33 +0900 > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > Instead, I'm thinking of an alternative approach: expanding the expression > > at the time statistics are collected. > > I've attached a new patch in this approache. > > This allows to collect statistics on virtual generated columns. > > During ANALYZE, generation expressions are expanded, and statistics are computed > using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData > are now exported from extended_stats.c. However, since they are no longer specific > to extended statistics, it might be better to move them to analyze.c and vacuum.h. > > To enable the optimizer to make use of these statistics, a new field named > virtual_gencols is added to RelOptInfo. This field holds the expressions of > virtual generated columns in the table. In examine_variable(), if an expression > in a WHERE clause matches a virtual generated column, the corresponding statistics > are used for that expression. > > Example: > > - Before applying the patch, the cardinality estimate is erroneous. > > test=# create table t (i int, j int generated always as (i*10) virtual); > CREATE TABLE > test=# insert into t select generate_series(1,1000); > INSERT 0 1000 > test=# insert into t select 1 from generate_series(1,1000); > INSERT 0 1000 > test=# analyze t; > ANALYZE > test=# explain analyze select * from t where j = 10; > QUERY PLAN > ------------------------------------------------------------------------------------------------- > Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1) > Filter: ((i * 10) = 10) > Rows Removed by Filter: 999 > Buffers: shared hit=9 > Planning: > Buffers: shared hit=10 > Planning Time: 0.299 ms > Execution Time: 0.948 ms > (8 rows) > > > - After applying the patch, the cardinality estimate is correct. > > test=# analyze t; > ANALYZE > test=# explain analyze select * from t where j = 10; > QUERY PLAN > ----------------------------------------------------------------------------------------------------- > Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1) > Filter: ((i * 10) = 10) > Rows Removed by Filter: 999 > Buffers: shared hit=9 > Planning: > Buffers: shared hit=6 > Planning Time: 0.374 ms > Execution Time: 1.028 ms > (8 rows) > > > Note that the patch is still a work in progress, so documentation and tests are not included. I've attached an updated patch. I modified the documentation to remove the statement that virtual generated columns do not have statistics. In addition, I added a test to ensure that statistics on virtual generated columns are available. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On Fri, 8 Aug 2025 12:21:25 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > On Fri, 1 Aug 2025 00:28:30 +0900 > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > Hi, > > > > On Tue, 24 Jun 2025 17:05:33 +0900 > > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > Instead, I'm thinking of an alternative approach: expanding the expression > > > at the time statistics are collected. > > > > I've attached a new patch in this approache. > > > > This allows to collect statistics on virtual generated columns. > > > > During ANALYZE, generation expressions are expanded, and statistics are computed > > using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData > > are now exported from extended_stats.c. However, since they are no longer specific > > to extended statistics, it might be better to move them to analyze.c and vacuum.h. > > > > To enable the optimizer to make use of these statistics, a new field named > > virtual_gencols is added to RelOptInfo. This field holds the expressions of > > virtual generated columns in the table. In examine_variable(), if an expression > > in a WHERE clause matches a virtual generated column, the corresponding statistics > > are used for that expression. > > > > Example: > > > > - Before applying the patch, the cardinality estimate is erroneous. > > > > test=# create table t (i int, j int generated always as (i*10) virtual); > > CREATE TABLE > > test=# insert into t select generate_series(1,1000); > > INSERT 0 1000 > > test=# insert into t select 1 from generate_series(1,1000); > > INSERT 0 1000 > > test=# analyze t; > > ANALYZE > > test=# explain analyze select * from t where j = 10; > > QUERY PLAN > > ------------------------------------------------------------------------------------------------- > > Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1) > > Filter: ((i * 10) = 10) > > Rows Removed by Filter: 999 > > Buffers: shared hit=9 > > Planning: > > Buffers: shared hit=10 > > Planning Time: 0.299 ms > > Execution Time: 0.948 ms > > (8 rows) > > > > > > - After applying the patch, the cardinality estimate is correct. > > > > test=# analyze t; > > ANALYZE > > test=# explain analyze select * from t where j = 10; > > QUERY PLAN > > ----------------------------------------------------------------------------------------------------- > > Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1) > > Filter: ((i * 10) = 10) > > Rows Removed by Filter: 999 > > Buffers: shared hit=9 > > Planning: > > Buffers: shared hit=6 > > Planning Time: 0.374 ms > > Execution Time: 1.028 ms > > (8 rows) > > > > > > Note that the patch is still a work in progress, so documentation and tests are not included. > > I've attached an updated patch. > > I modified the documentation to remove the statement that virtual generated columns > do not have statistics. > > In addition, I added a test to ensure that statistics on virtual generated columns > are available. I've attached a rebased patch. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>
Вложения
On Wed, 20 Aug 2025 14:10:28 +0900 Yugo Nagata <nagata@sraoss.co.jp> wrote: > On Fri, 8 Aug 2025 12:21:25 +0900 > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > On Fri, 1 Aug 2025 00:28:30 +0900 > > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > Hi, > > > > > > On Tue, 24 Jun 2025 17:05:33 +0900 > > > Yugo Nagata <nagata@sraoss.co.jp> wrote: > > > > > > > Instead, I'm thinking of an alternative approach: expanding the expression > > > > at the time statistics are collected. > > > > > > I've attached a new patch in this approache. > > > > > > This allows to collect statistics on virtual generated columns. > > > > > > During ANALYZE, generation expressions are expanded, and statistics are computed > > > using compute_expr_stats(). To support this, both compute_expr_stats() and AnlExprData > > > are now exported from extended_stats.c. However, since they are no longer specific > > > to extended statistics, it might be better to move them to analyze.c and vacuum.h. > > > > > > To enable the optimizer to make use of these statistics, a new field named > > > virtual_gencols is added to RelOptInfo. This field holds the expressions of > > > virtual generated columns in the table. In examine_variable(), if an expression > > > in a WHERE clause matches a virtual generated column, the corresponding statistics > > > are used for that expression. > > > > > > Example: > > > > > > - Before applying the patch, the cardinality estimate is erroneous. > > > > > > test=# create table t (i int, j int generated always as (i*10) virtual); > > > CREATE TABLE > > > test=# insert into t select generate_series(1,1000); > > > INSERT 0 1000 > > > test=# insert into t select 1 from generate_series(1,1000); > > > INSERT 0 1000 > > > test=# analyze t; > > > ANALYZE > > > test=# explain analyze select * from t where j = 10; > > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------- > > > Seq Scan on t (cost=0.00..0.02 rows=1 width=8) (actual time=0.031..0.806 rows=1001.00 loops=1) > > > Filter: ((i * 10) = 10) > > > Rows Removed by Filter: 999 > > > Buffers: shared hit=9 > > > Planning: > > > Buffers: shared hit=10 > > > Planning Time: 0.299 ms > > > Execution Time: 0.948 ms > > > (8 rows) > > > > > > > > > - After applying the patch, the cardinality estimate is correct. > > > > > > test=# analyze t; > > > ANALYZE > > > test=# explain analyze select * from t where j = 10; > > > QUERY PLAN > > > ----------------------------------------------------------------------------------------------------- > > > Seq Scan on t (cost=0.00..41.50 rows=1001 width=8) (actual time=0.034..0.871 rows=1001.00 loops=1) > > > Filter: ((i * 10) = 10) > > > Rows Removed by Filter: 999 > > > Buffers: shared hit=9 > > > Planning: > > > Buffers: shared hit=6 > > > Planning Time: 0.374 ms > > > Execution Time: 1.028 ms > > > (8 rows) > > > > > > > > > Note that the patch is still a work in progress, so documentation and tests are not included. > > > > I've attached an updated patch. > > > > I modified the documentation to remove the statement that virtual generated columns > > do not have statistics. > > > > In addition, I added a test to ensure that statistics on virtual generated columns > > are available. I've attached a rebased patch. Regards, Yugo Nagata -- Yugo Nagata <nagata@sraoss.co.jp>