Re: a wrong index choose when statistics is out of date

Поиск

Список

Период

Сортировка

От	Andrei Lepikhov
Тема	Re: a wrong index choose when statistics is out of date
Дата	7 марта 2024 г. 08:17:10
Msg-id	701d2097-2c5b-41e2-8629-734e3c8ba613@postgrespro.ru обсуждение исходный текст
Ответ на	Re: a wrong index choose when statistics is out of date (Andy Fan <zhihuifan1213@163.com>)
Ответы	Re: a wrong index choose when statistics is out of date Re: a wrong index choose when statistics is out of date
Список	pgsql-hackers

Дерево обсуждения

On 5/3/2024 19:56, Andy Fan wrote:
> I think it is OK for a design review, for the implementaion side, the
> known issue includes:
> 
> 1. Support grap such infromation from its parent for partitioned table
> if the child doesn't have such information.
> 2. builtin document and testing.
> 
> Any feedback is welcome.
Thanks for your efforts.
I was confused when you showed the problem connected to clauses like 
"Var op Const" and "Var op Param".
As far as I know, the estimation logic of such clauses uses MCV and 
number-distinct statistics. So, being out of MCV values, it becomes 
totally insensitive to any internal skew in data and any data outside 
the statistics boundaries.
Having studied the example you provided with the patch, I think it is 
not a correct example:
Difference between var_eq_const and var_eq_non_const quite obvious:
In the second routine, you don't have information about the const value 
and can't use MCV for estimation. Also, you can't exclude MCV values 
from the estimation. And it is just luck that you've got the right 
answer. I think if you increased the weight of the unknown part, you 
would get a bad result, too.
I would like to ask David why the var_eq_const estimator doesn't have an 
option for estimation with a histogram. Having that would relieve a 
problem with skewed data. Detecting the situation with incoming const 
that is out of the covered area would allow us to fall back to ndistinct 
estimation or something else. At least, histogram usage can be 
restricted by the reltuples value and ratio between the total number of 
MCV values and the total number of distinct values in the table.

Just for demo: demonstration of data skew issue:

CREATE EXTENSION tablefunc;
CREATE TABLE norm_test AS
   SELECT abs(r::integer) AS val
   FROM normal_rand(1E7::integer, 5.::float8, 300.::float8) AS r;
ANALYZE norm_test;

-- First query is estimated with MCV quite precisely:
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 100;
-- result: planned rows=25669, actual rows=25139

-- Here we have numdistinct estimation, mostly arbitrary:
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 200;
-- result: planned rows=8604, actual rows=21239
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 500;
-- result: planned rows=8604, actual rows=6748
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 600;
-- result: planned rows=8604, actual rows=3501
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 700;
-- result: planned rows=8604, actual rows=1705
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 1000;
-- result: planned rows=8604, actual rows=91

-- 
regards,
Andrei Lepikhov
Postgres Professional

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: a wrong index choose when statistics is out of date