Re: Selectivity estimation for intarray with @@

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: Selectivity estimation for intarray with @@
Дата
Msg-id CAPpHfdsxU-V9AB1-PrSUQc42n6WR44vciZFCTiVa2ZOci1vJ0w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Selectivity estimation for intarray with @@  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: Selectivity estimation for intarray with @@  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
On Tue, Jul 21, 2015 at 6:49 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 07/21/2015 03:44 PM, Alexander Korotkov wrote:
While Uriy is on vacation, I've revised this patch a bit.

I whacked this around quite a bit, and I think it's in a committable state now. But if you could run whatever tests you were using before on this, to make sure it still produces the same estimates, that would be great. I didn't change the estimates it should produce, only the code structure.

One thing that bothers me slightly with this patch is the way it peeks into the Most-Common-Elements arrays, which is produced by the built-in type analyze function. If we ever change what statistics are collected for arrays, or the way they are stored, this might break. In matchsel, why don't we just call the built-in estimator function for each element that we need to probe, and not look into the statistics ourselves at all? I actually experimented with that, and it did slash much of the code, and it would be more future-proof. However, it was also a lot slower for queries that contain multiple values. That's understandable: the built-in estimator will fetch the statistics tuple, parse the arrays, etc. separately for each value in the query_int, while this patch will do it only once for the whole query, and perform a simple binary search for each value. So overall, I think this is OK as it is. But if we find that we need to use the MCE list in this fashion in more places in the future, it might be worthwhile to add some support code for this in the backend to allow extracting the stats once, and doing multiple "lightweight estimations" using the extracted stats.

Yeah, I see. We could end up with something like this. But probably we would need something more general for extensions which wants to play with statistics.
For instance, pg_trgm could estimate selectivity for "text % text" operator. But in order to provide that it needs trigram statistics. Now it could be implemented by defining separate datatype, but it's a kluge. Probably, we would end up with custom additional statistics for datatypes.
 
Some things I fixed/changed:

* I didn't like that transformOperator() function, which looked up the function's name. I replaced it with separate wrapper functions for each operator, so that the built-in operator's OID can be hardcoded into each.

* I refactored the matchsel function heavily. I think it's more readable now.

* I got rid of the Int4Freq array. It didn't seem significantly easier to work with than the separate values/numbers arrays, so I just used those directly.

* Also use the matchsel estimator for ~~ (the commutator of @@)

In this version of patch it's not checked if variable is actually and int[] not query_int. See following test case.

# create table test2 as (select '1'::query_int val from generate_series(1,1000000));
# analyze test2;

# explain select * from test2 where '{1}'::int[] @@ val;
ERROR:  unrecognized int query item type: 0
 
I've added this check.

* Also use the estimators for the obsolete @ and ~ operators. Not that I care much about those as they are obsolete, but seems strange not to, as it's a trivial matter of setting the right estimator function.

* I added an ANALYZE in the regression test. It still won't systematically test all the cost estimation code, and there's nothing to check that the estimates make sense, but at least more of the code will now run.

You also forgot to include intarray--1.0--1.1.sql into patch. I've also added it.
 
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: ALTER TABLE .. ADD PRIMARY KEY .. USING INDEX has dump-restore hazard
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: pgbench stats per script & other stuff