Обсуждение: Postgres picks suboptimal index after building of an extended statistics
Postgres picks suboptimal index after building of an extended statistics
От
"Andrey V. Lepikhov"
Дата:
Hi, Ivan Frolkov reported a problem with choosing a non-optimal index during a query optimization. This problem appeared after building of an extended statistics. I prepared the test case (see t.sql in attachment). For reproduction of this case we need to have a composite primary key index and one another index. Before creation of extended statistics, SELECT from the table choose PK index and returns only one row. But after, this SELECT picks alternative index, fetches and filters many tuples. The problem is related to a corner case in btree cost estimation procedure: if postgres detects unique one-row index scan, it sets numIndexTuples to 1.0. But the selectivity is calculated as usual, by the clauselist_selectivity() routine and can have a value, much more than corresponding to single tuple. This selectivity value is used later in the code to calculate a number of fetched tuples and can lead to choosing of an suboptimal index. The attached patch is my suggestion to fix this problem. -- regards, Andrey Lepikhov Postgres Professional
Вложения
Re: Postgres picks suboptimal index after building of an extended statistics
От
Peter Geoghegan
Дата:
On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Ivan Frolkov reported a problem with choosing a non-optimal index during > a query optimization. This problem appeared after building of an > extended statistics. Any thoughts on this, Tomas? -- Peter Geoghegan
On 8/11/21 2:48 AM, Peter Geoghegan wrote: > On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Ivan Frolkov reported a problem with choosing a non-optimal index during >> a query optimization. This problem appeared after building of an >> extended statistics. > > Any thoughts on this, Tomas? > Thanks for reminding me, I missed / forgot about this thread. I agree the current behavior is unfortunate, but I'm not convinced the proposed patch is fixing the right place - doesn't this mean the index costing won't match the row estimates displayed by EXPLAIN? I wonder if we should teach clauselist_selectivity about UNIQUE indexes, and improve the cardinality estimates directly, not just costing for index scans. Also, is it correct that the patch calculates num_sa_scans only when (numIndexTuples >= 0.0)? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Postgres picks suboptimal index after building of an extended statistics
От
"Andrey V. Lepikhov"
Дата:
On 8/12/21 4:26 AM, Tomas Vondra wrote: > On 8/11/21 2:48 AM, Peter Geoghegan wrote: >> On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov >> <a.lepikhov@postgrespro.ru> wrote: >>> Ivan Frolkov reported a problem with choosing a non-optimal index during >>> a query optimization. This problem appeared after building of an >>> extended statistics. >> >> Any thoughts on this, Tomas? >> > > Thanks for reminding me, I missed / forgot about this thread. > > I agree the current behavior is unfortunate, but I'm not convinced the > proposed patch is fixing the right place - doesn't this mean the index > costing won't match the row estimates displayed by EXPLAIN? I think, it is not a problem. In EXPLAIN you will see only 1 row with/without this patch. > > I wonder if we should teach clauselist_selectivity about UNIQUE indexes, > and improve the cardinality estimates directly, not just costing for > index scans. This idea looks better. I will try to implement it. > > Also, is it correct that the patch calculates num_sa_scans only when > (numIndexTuples >= 0.0)? Thanks, fixed. -- regards, Andrey Lepikhov Postgres Professional
Вложения
Re: Postgres picks suboptimal index after building of an extended statistics
От
Andrey Lepikhov
Дата:
On 12/8/21 04:26, Tomas Vondra wrote: > On 8/11/21 2:48 AM, Peter Geoghegan wrote: >> On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov > I agree the current behavior is unfortunate, but I'm not convinced the > proposed patch is fixing the right place - doesn't this mean the index > costing won't match the row estimates displayed by EXPLAIN? I rewrote the patch. It's now simpler and shorter. May be more convenient. Also, it includes a regression test to detect the problem in future. > > I wonder if we should teach clauselist_selectivity about UNIQUE indexes, > and improve the cardinality estimates directly, not just costing for > index scans. I tried to implement this in different ways. But it causes additional overhead and code complexity - analyzing a list of indexes and match clauses of each index with input clauses in each selectivity estimation. I don't like that way and propose a new patch in attachment. -- regards, Andrey Lepikhov Postgres Professional
Вложения
Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes: > On 12/8/21 04:26, Tomas Vondra wrote: >> I wonder if we should teach clauselist_selectivity about UNIQUE indexes, >> and improve the cardinality estimates directly, not just costing for >> index scans. > I tried to implement this in different ways. But it causes additional > overhead and code complexity - analyzing a list of indexes and match > clauses of each index with input clauses in each selectivity estimation. > I don't like that way and propose a new patch in attachment. I looked at this briefly. I do not think that messing with btcostestimate/genericcostestimate is the right response at all. The problem can be demonstrated with no index whatever, as in the attached shortened version of the original example. I get QUERY PLAN --------------------------------------------------- Seq Scan on a (cost=0.00..46.02 rows=1 width=12) Filter: ((x = 1) AND (y = 1) AND (z = 1)) (2 rows) before adding the extended stats, and QUERY PLAN ---------------------------------------------------- Seq Scan on a (cost=0.00..46.02 rows=28 width=12) Filter: ((x = 1) AND (y = 1) AND (z = 1)) (2 rows) afterwards. So the extended stats have made the rowcount estimate significantly worse, which seems like an indicator of a bug somewhere in extended stats. The more so because I can crank default_statistics_target all the way to 10000 without these estimates changing. If we can't get a dead-on estimate for a 2001-row table at that stats level, we're doing something wrong, surely? Also, I found that if I ask only for ndistinct stats, I still get rows=1. The fishiness seems to be directly a problem with dependencies stats. regards, tom lane DROP TABLE IF EXISTS a CASCADE; DROP STATISTICS IF EXISTS aestat; CREATE TABLE a AS ( SELECT gs % 10 AS x, (gs % 10 + (gs/10::int4) % 10) % 10 AS y, (gs / 100)::int4 AS z FROM generate_series(1,1000) AS gs ); INSERT INTO a (SELECT gs,gs,gs FROM generate_series(1000,2000) AS gs); -- ALTER TABLE a ADD PRIMARY KEY (x,y,z); -- CREATE INDEX ON a(x); ANALYZE a; EXPLAIN SELECT * FROM a WHERE x=1 AND y=1 AND z=1; CREATE STATISTICS aestat(dependencies,ndistinct) ON x,y,z FROM a; ANALYZE a; EXPLAIN SELECT * FROM a WHERE x=1 AND y=1 AND z=1;
Re: Postgres picks suboptimal index after building of an extended statistics
От
Andrey Lepikhov
Дата:
On 7/8/22 03:07, Tom Lane wrote: > Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes: >> On 12/8/21 04:26, Tomas Vondra wrote: >>> I wonder if we should teach clauselist_selectivity about UNIQUE indexes, >>> and improve the cardinality estimates directly, not just costing for >>> index scans. > >> I tried to implement this in different ways. But it causes additional >> overhead and code complexity - analyzing a list of indexes and match >> clauses of each index with input clauses in each selectivity estimation. >> I don't like that way and propose a new patch in attachment. > > I looked at this briefly. I do not think that messing with > btcostestimate/genericcostestimate is the right response at all. > The problem can be demonstrated with no index whatever, as in the > attached shortened version of the original example. I get I partly agree with you. Yes, I see the problem too. But also we have a problem that I described above: optimizer don't choose a path with minimal selectivity from a set selectivities which shows cardinality less than 1 (see badestimate2.sql). New patch (see in attachment), fixes this problem. -- Regards Andrey Lepikhov Postgres Professional
Вложения
Hi, On 2022-07-11 12:57:36 +0500, Andrey Lepikhov wrote: > On 7/8/22 03:07, Tom Lane wrote: > > Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes: > > > On 12/8/21 04:26, Tomas Vondra wrote: > > > > I wonder if we should teach clauselist_selectivity about UNIQUE indexes, > > > > and improve the cardinality estimates directly, not just costing for > > > > index scans. > > > > > I tried to implement this in different ways. But it causes additional > > > overhead and code complexity - analyzing a list of indexes and match > > > clauses of each index with input clauses in each selectivity estimation. > > > I don't like that way and propose a new patch in attachment. > > > > I looked at this briefly. I do not think that messing with > > btcostestimate/genericcostestimate is the right response at all. > > The problem can be demonstrated with no index whatever, as in the > > attached shortened version of the original example. I get > > I partly agree with you. Yes, I see the problem too. But also we have a > problem that I described above: optimizer don't choose a path with minimal > selectivity from a set selectivities which shows cardinality less than 1 > (see badestimate2.sql). > New patch (see in attachment), fixes this problem. This causes the mains regression tests to fail due to a planner change: https://cirrus-ci.com/build/6680222884429824 diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/join.out /tmp/cirrus-ci-build/build/testrun/regress/regress/results/join.out --- /tmp/cirrus-ci-build/src/test/regress/expected/join.out 2022-11-22 12:27:18.852087140 +0000 +++ /tmp/cirrus-ci-build/build/testrun/regress/regress/results/join.out 2022-11-22 12:28:47.934938882 +0000 @@ -6671,10 +6671,9 @@ Merge Cond: (j1.id1 = j2.id1) Join Filter: (j2.id2 = j1.id2) -> Index Scan using j1_id1_idx on j1 - -> Index Only Scan using j2_pkey on j2 + -> Index Scan using j2_id1_idx on j2 Index Cond: (id1 >= ANY ('{1,5}'::integer[])) - Filter: ((id1 % 1000) = 1) -(7 rows) +(6 rows) select * from j1 inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2 Greetings, Andres Freund
Re: Postgres picks suboptimal index after building of an extended statistics
От
Andrey Lepikhov
Дата:
On 12/8/2021 06:26, Tomas Vondra wrote: > On 8/11/21 2:48 AM, Peter Geoghegan wrote: >> On Wed, Jun 23, 2021 at 7:19 AM Andrey V. Lepikhov >> <a.lepikhov@postgrespro.ru> wrote: >>> Ivan Frolkov reported a problem with choosing a non-optimal index during >>> a query optimization. This problem appeared after building of an >>> extended statistics. >> >> Any thoughts on this, Tomas? >> > > Thanks for reminding me, I missed / forgot about this thread. > > I agree the current behavior is unfortunate, but I'm not convinced the > proposed patch is fixing the right place - doesn't this mean the index > costing won't match the row estimates displayed by EXPLAIN? > > I wonder if we should teach clauselist_selectivity about UNIQUE indexes, > and improve the cardinality estimates directly, not just costing for > index scans. > > Also, is it correct that the patch calculates num_sa_scans only when > (numIndexTuples >= 0.0)? I can't stop thinking about this issue. It is bizarre when Postgres chooses a non-unique index if a unique index gives us proof of minimum scan. I don't see a reason to teach the clauselist_selectivity() routine to estimate UNIQUE indexes. We add some cycles, but it will work with btree indexes only. Maybe to change compare_path_costs_fuzzily() and add some heuristic, for example: "If selectivity of both paths gives us no more than 1 row, prefer to use a unique index or an index with least selectivity." -- regards, Andrey Lepikhov Postgres Professional
On 9/25/23 06:30, Andrey Lepikhov wrote: > ... > I can't stop thinking about this issue. It is bizarre when Postgres > chooses a non-unique index if a unique index gives us proof of minimum > scan. That's true, but no one implemented this heuristics. So the "proof of minimum scan" is merely hypothetical - the optimizer is unaware of it. > I don't see a reason to teach the clauselist_selectivity() routine to > estimate UNIQUE indexes. We add some cycles, but it will work with btree > indexes only. I'm not sure I understand what this is meant to say. Can you elaborate? We only allow UNIQUE for btree indexes anyway, so what exactly is the problem here? > Maybe to change compare_path_costs_fuzzily() and add some heuristic, for > example: > "If selectivity of both paths gives us no more than 1 row, prefer to use > a unique index or an index with least selectivity." > I don't understand how this would work. What do yo mean by "selectivity of a path"? AFAICS the problem here is that we estimate a path to return more rows (while we know there can only be 1, thanks to UNIQUE index). So how would you know the path does not give us more than 1 row? Surely you're not proposing compare_path_costs_fuzzily() to do something expensive like analyzing the paths, or something. Also, how would it work in upper levels? If you just change which path we keep, but leave the inaccurate row estimate in place, that may fix that level, but it's certainly going to confuse later planning steps. IMHO the problem here is that we produce wrong estimate, so we better fix that, instead of adding band-aids to other places. This happens because functional dependencies are very simple type of statistics - it has some expectations about the input data and also the queries executed on it. For example it assumes the data is reasonably homogeneous, so that we can calculate a global "degree". But the data in the example directly violates that - it has 1000 rows that are very random (so we'd find no dependencies), and 1000 rows with perfect dependencies. Hence we end with degree=0.5, which approximates the dependencies to all data. Not great, true, but that's the price for simplicity of this statistics kind. So the simplest solution is to disable dependencies on such data sets. It's a bit inconvenient/unfortunate we build dependencies by default, and people may not be aware of there assumptions. Perhaps we could/should make dependency_degree() more pessimistic when we find some "violations" of the rule (we intentionally are not strict about it, because data are imperfect). I don't have a good idea how to change the formulas, but I'm open to the idea in principle. The other thing we could do is checking for unique indexes in clauselist_selectivity, and if we find a match we can just skip the extended statistics altogether. Not sure how expensive this would be, but for typical cases (with modest number of indexes) perhaps OK. It wouldn't work without a unique index, but I don't have a better idea. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Postgres picks suboptimal index after building of an extended statistics
От
Andrei Lepikhov
Дата:
Thanks for detaied answer, On 3/11/2023 00:37, Tomas Vondra wrote: > On 9/25/23 06:30, Andrey Lepikhov wrote: >> ... >> I can't stop thinking about this issue. It is bizarre when Postgres >> chooses a non-unique index if a unique index gives us proof of minimum >> scan. > That's true, but no one implemented this heuristics. So the "proof of > minimum scan" is merely hypothetical - the optimizer is unaware of it. See the simple patch in the attachment. There, I have attempted to resolve situations of uncertainty to avoid making decisions based solely on the order of indexes in the list. >> I don't see a reason to teach the clauselist_selectivity() routine to >> estimate UNIQUE indexes. We add some cycles, but it will work with btree >> indexes only. > I'm not sure I understand what this is meant to say. Can you elaborate? > We only allow UNIQUE for btree indexes anyway, so what exactly is the > problem here? Partly, you already answered yourself below: we have unique index estimation in a few estimation calls, but go through the list of indexes each time. Also, for this sake, we would add some input parameters, usually NULL, because many estimations don't involve indexes at all. >> Maybe to change compare_path_costs_fuzzily() and add some heuristic, for >> example: >> "If selectivity of both paths gives us no more than 1 row, prefer to use >> a unique index or an index with least selectivity." > I don't understand how this would work. What do yo mean by "selectivity > of a path"? AFAICS the problem here is that we estimate a path to return > more rows (while we know there can only be 1, thanks to UNIQUE index). Oops, I meant cardinality. See the patch in the attachment. > So how would you know the path does not give us more than 1 row? Surely > you're not proposing compare_path_costs_fuzzily() to do something > expensive like analyzing the paths, or something. I solely propose to make optimizer more consecutive in its decisions: if we have one row for both path candidates, use uniqueness of the index or value of selectivity as one more parameter. > Also, how would it work in upper levels? If you just change which path > we keep, but leave the inaccurate row estimate in place, that may fix > that level, but it's certainly going to confuse later planning steps. It is designed for the only scan level. > IMHO the problem here is that we produce wrong estimate, so we better > fix that, instead of adding band-aids to other places. Agree. I am looking for a solution to help users somehow resolve such problems. As an alternative solution, I can propose a selectivity hook or (maybe even better) - use the pathlist approach and add indexes into the index list with some predefined order - at first positions, place unique indexes with more columns, etc. > This happens because functional dependencies are very simple type of > statistics - it has some expectations about the input data and also the > queries executed on it. For example it assumes the data is reasonably > homogeneous, so that we can calculate a global "degree". > > But the data in the example directly violates that - it has 1000 rows > that are very random (so we'd find no dependencies), and 1000 rows with > perfect dependencies. Hence we end with degree=0.5, which approximates > the dependencies to all data. Not great, true, but that's the price for > simplicity of this statistics kind. > > So the simplest solution is to disable dependencies on such data sets. > It's a bit inconvenient/unfortunate we build dependencies by default, > and people may not be aware of there assumptions. > > Perhaps we could/should make dependency_degree() more pessimistic when > we find some "violations" of the rule (we intentionally are not strict > about it, because data are imperfect). I don't have a good idea how to > change the formulas, but I'm open to the idea in principle. Thanks for the explanation! > The other thing we could do is checking for unique indexes in > clauselist_selectivity, and if we find a match we can just skip the > extended statistics altogether. Not sure how expensive this would be, > but for typical cases (with modest number of indexes) perhaps OK. It > wouldn't work without a unique index, but I don't have a better idea. It looks a bit expensive for me. But I am ready to try, if current solution doesn't look applicable. -- regards, Andrei Lepikhov Postgres Professional
Вложения
Re: Postgres picks suboptimal index after building of an extended statistics
От
Andrei Lepikhov
Дата:
Second version of the patch - resolve non-symmetrical decision, thanks to Teodor Sigaev's review. -- regards, Andrei Lepikhov Postgres Professional
Вложения
Re: Postgres picks suboptimal index after building of an extended statistics
От
Alexander Korotkov
Дата:
Hi!
I'd like to get this subject off the ground. The problem originally described in [1] obviously comes from wrong selectivity estimation. "Dependencies" extended statistics lead to significant selectivity miss 24/1000 instead of 1/1000. When the estimation is correct, the PostgreSQL optimizer is capable of choosing the appropriate unique index for the query.
Tom pointed out in [2] that this might be a problem of "Dependencies" extended statistics. I've rechecked this. The dataset consists of two parts. The first part defined in the CREATE TABLE statement contains independent values. The second part defined in the INSERT statement contains dependent values. "Dependencies" extended statistics estimate dependency degree as a fraction of rows containing dependent values. According to this definition, it correctly estimates the dependency degree as about 0.5 for all the combinations. So, the error in the estimate comes from the synergy of two factors: MCV estimation of the frequency of individual values, and spreading of average dependency degree for those values, which is not relevant for them. I don't think there is a way to fix "dependencies" extended statistics because it works exactly as designed. I have to note that instead of fixing "dependencies" extended statistics you can just add multi-column MCV statistics, which would fix the problem.
CREATE STATISTICS aestat(dependencies,ndistinct,mcv) ON x,y,z FROM a;
Independently on how well our statistics work, it looks pitiful that we couldn't fix that using the knowledge of unique constraints on the table. Despite statistics, which give us just more or less accurate estimates, the constraint is something we really enforce and thus can rely on. The patches provided by Andrei in [1], [3], and [4] fix this issue at the index scan path level. As Thomas pointed out in [5], that could lead to inconsistency between the number of rows used for unique index scan estimation and the value displayed in EXPLAIN (and used for other paths). Even though this was debated in [6], I can confirm this inconsistency. Thus, with the patch published in [4], I can see the 28-row estimation with a unique index scan.
` QUERY PLAN
-----------------------------------------------------------------------
Index Only Scan using a_pkey on a (cost=0.28..8.30 rows=28 width=12)
Index Cond: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)
Also, there is a set of patches [7], [8], and [9], which makes the optimizer consider path selectivity as long as path costs during the path selection. I've rechecked that none of these patches could resolve the original problem described in [1]. Also, I think they are quite tricky. The model of our optimizer assumes that paths in the list should be the different ways of getting the same result. If we choose the paths by their selectivity, that breaks this model. I don't say there is no way for this. But if we do this, that would require significant rethinking of our optimizer model and possible revision of a significant part of it. Anyway, I think if there is still interest in this, that should be moved into a separate thread to keep this thread focused on the problem described in [1].
Finally, I'd like to note that the issue described in [1] is mostly the selectivity estimation problem. It could be solved by adding the multi-column MCV statistics. The patches published so far look more like hacks for particular use cases rather than appropriate solutions. It still looks promising to me to use the knowledge of unique constraints during selectivity estimation [10]. Even though it's hard to implement and possibly implies some overhead, it fits the current model. I also think unique contracts could probably be used in some way to improve estimates even when there is no full match.
Links.
1. https://www.postgresql.org/message-id/0ca4553c-1f34-12ba-9122-44199d1ced41%40postgrespro.ru
2. https://www.postgresql.org/message-id/3119052.1657231656%40sss.pgh.pa.us
3. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
4. https://www.postgresql.org/message-id/a5a18d86-c0e5-0ceb-9a18-be1beb2d2944%40postgrespro.ru
5. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com
6. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
7. https://www.postgresql.org/message-id/2df148b5-0bb8-f80b-ac03-251682fab585%40postgrespro.ru
8. https://www.postgresql.org/message-id/6fb43191-2df3-4791-b307-be754e648276%40postgrespro.ru
9. https://www.postgresql.org/message-id/154f786a-06a0-4fb1-b8a4-16c66149731b%40postgrespro.ru
10. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com
------
Regards,
Alexander Korotkov
I'd like to get this subject off the ground. The problem originally described in [1] obviously comes from wrong selectivity estimation. "Dependencies" extended statistics lead to significant selectivity miss 24/1000 instead of 1/1000. When the estimation is correct, the PostgreSQL optimizer is capable of choosing the appropriate unique index for the query.
Tom pointed out in [2] that this might be a problem of "Dependencies" extended statistics. I've rechecked this. The dataset consists of two parts. The first part defined in the CREATE TABLE statement contains independent values. The second part defined in the INSERT statement contains dependent values. "Dependencies" extended statistics estimate dependency degree as a fraction of rows containing dependent values. According to this definition, it correctly estimates the dependency degree as about 0.5 for all the combinations. So, the error in the estimate comes from the synergy of two factors: MCV estimation of the frequency of individual values, and spreading of average dependency degree for those values, which is not relevant for them. I don't think there is a way to fix "dependencies" extended statistics because it works exactly as designed. I have to note that instead of fixing "dependencies" extended statistics you can just add multi-column MCV statistics, which would fix the problem.
CREATE STATISTICS aestat(dependencies,ndistinct,mcv) ON x,y,z FROM a;
Independently on how well our statistics work, it looks pitiful that we couldn't fix that using the knowledge of unique constraints on the table. Despite statistics, which give us just more or less accurate estimates, the constraint is something we really enforce and thus can rely on. The patches provided by Andrei in [1], [3], and [4] fix this issue at the index scan path level. As Thomas pointed out in [5], that could lead to inconsistency between the number of rows used for unique index scan estimation and the value displayed in EXPLAIN (and used for other paths). Even though this was debated in [6], I can confirm this inconsistency. Thus, with the patch published in [4], I can see the 28-row estimation with a unique index scan.
` QUERY PLAN
-----------------------------------------------------------------------
Index Only Scan using a_pkey on a (cost=0.28..8.30 rows=28 width=12)
Index Cond: ((x = 1) AND (y = 1) AND (z = 1))
(2 rows)
Also, there is a set of patches [7], [8], and [9], which makes the optimizer consider path selectivity as long as path costs during the path selection. I've rechecked that none of these patches could resolve the original problem described in [1]. Also, I think they are quite tricky. The model of our optimizer assumes that paths in the list should be the different ways of getting the same result. If we choose the paths by their selectivity, that breaks this model. I don't say there is no way for this. But if we do this, that would require significant rethinking of our optimizer model and possible revision of a significant part of it. Anyway, I think if there is still interest in this, that should be moved into a separate thread to keep this thread focused on the problem described in [1].
Finally, I'd like to note that the issue described in [1] is mostly the selectivity estimation problem. It could be solved by adding the multi-column MCV statistics. The patches published so far look more like hacks for particular use cases rather than appropriate solutions. It still looks promising to me to use the knowledge of unique constraints during selectivity estimation [10]. Even though it's hard to implement and possibly implies some overhead, it fits the current model. I also think unique contracts could probably be used in some way to improve estimates even when there is no full match.
Links.
1. https://www.postgresql.org/message-id/0ca4553c-1f34-12ba-9122-44199d1ced41%40postgrespro.ru
2. https://www.postgresql.org/message-id/3119052.1657231656%40sss.pgh.pa.us
3. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
4. https://www.postgresql.org/message-id/a5a18d86-c0e5-0ceb-9a18-be1beb2d2944%40postgrespro.ru
5. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com
6. https://www.postgresql.org/message-id/90a1d6ef-c777-b95d-9f77-0065ad4522df%40postgrespro.ru
7. https://www.postgresql.org/message-id/2df148b5-0bb8-f80b-ac03-251682fab585%40postgrespro.ru
8. https://www.postgresql.org/message-id/6fb43191-2df3-4791-b307-be754e648276%40postgrespro.ru
9. https://www.postgresql.org/message-id/154f786a-06a0-4fb1-b8a4-16c66149731b%40postgrespro.ru
10. https://www.postgresql.org/message-id/f8044836-5d61-a4e0-af82-5821a2a1f0a7%40enterprisedb.com
------
Regards,
Alexander Korotkov
Re: Postgres picks suboptimal index after building of an extended statistics
От
Andrei Lepikhov
Дата:
On 18/12/2023 15:29, Alexander Korotkov wrote: > Also, there is a set of patches [7], [8], and [9], which makes the > optimizer consider path selectivity as long as path costs during the > path selection. I've rechecked that none of these patches could resolve > the original problem described in [1]. It is true. We accidentally mixed two different problems in one thread. > Also, I think they are quite > tricky. The model of our optimizer assumes that paths in the list > should be the different ways of getting the same result. If we choose > the paths by their selectivity, that breaks this model. I don't say > there is no way for this. But if we do this, that would require > significant rethinking of our optimizer model and possible revision of a > significant part of it. I can't understand that. In [9] we just elaborate the COSTS_EQUAL case and establish final decision on more stable basis than a casual order of indexes in the list. > Anyway, I think if there is still interest in > this, that should be moved into a separate thread to keep this thread > focused on the problem described in [1]. Agree. IMO, the problem of optimizer dependency on an order of indexes in the relation index list is more urgent for now. > > Finally, I'd like to note that the issue described in [1] is mostly the > selectivity estimation problem. It could be solved by adding the > multi-column MCV statistics. The patches published so far look more > like hacks for particular use cases rather than appropriate solutions. > It still looks promising to me to use the knowledge of unique > constraints during selectivity estimation [10]. Even though it's hard > to implement and possibly implies some overhead, it fits the current > model. I also think unique contracts could probably be used in some way > to improve estimates even when there is no full match. I have tried to use the knowledge about unique indexes in the selectivity estimation routine. But it looks invasive and adds a lot of overhead. -- regards, Andrei Lepikhov Postgres Professional
Re: Postgres picks suboptimal index after building of an extended statistics
От
Alexander Korotkov
Дата:
On Thu, Dec 21, 2023 at 10:41 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > On 18/12/2023 15:29, Alexander Korotkov wrote: > > Also, there is a set of patches [7], [8], and [9], which makes the > > optimizer consider path selectivity as long as path costs during the > > path selection. I've rechecked that none of these patches could resolve > > the original problem described in [1]. > It is true. We accidentally mixed two different problems in one thread. > > Also, I think they are quite > > tricky. The model of our optimizer assumes that paths in the list > > should be the different ways of getting the same result. If we choose > > the paths by their selectivity, that breaks this model. I don't say > > there is no way for this. But if we do this, that would require > > significant rethinking of our optimizer model and possible revision of a > > significant part of it. > I can't understand that. In [9] we just elaborate the COSTS_EQUAL case > and establish final decision on more stable basis than a casual order of > indexes in the list. I took a closer look at the patch in [9]. I should drop my argument about breaking the model, because add_path() already considers other aspects than just costs. But I have two more note about that patch: 1) It seems that you're determining the fact that the index path should return strictly one row by checking path->rows <= 1.0 and indexinfo->unique. Is it really guaranteed that in this case quals are matching unique constraint? path->rows <= 1.0 could be just an estimation error. Or one row could be correctly estimated, but it's going to be selected by some quals matching unique constraint and other quals in recheck. So, it seems there is a risk to select suboptimal index due to this condition. 2) Even for non-unique indexes this patch is putting new logic on top of the subsequent code. How we can prove it's going to be a win? That could lead, for instance, to dropping parallel-safe paths in cases we didn't do so before. Anyway, please start a separate thread if you're willing to put more work into this. > > Anyway, I think if there is still interest in > > this, that should be moved into a separate thread to keep this thread > > focused on the problem described in [1]. > Agree. IMO, the problem of optimizer dependency on an order of indexes > in the relation index list is more urgent for now. > > > > Finally, I'd like to note that the issue described in [1] is mostly the > > selectivity estimation problem. It could be solved by adding the > > multi-column MCV statistics. The patches published so far look more > > like hacks for particular use cases rather than appropriate solutions. > > It still looks promising to me to use the knowledge of unique > > constraints during selectivity estimation [10]. Even though it's hard > > to implement and possibly implies some overhead, it fits the current > > model. I also think unique contracts could probably be used in some way > > to improve estimates even when there is no full match. > I have tried to use the knowledge about unique indexes in the > selectivity estimation routine. But it looks invasive and adds a lot of > overhead. I got it. But it doesn't look enough to decide this is no way. Could you, please, share some of your results? It might happen that we just need to rework some of data structures to make this information more easily accessible at selectivity estimation stage. ------ Regards, Alexander Korotkov