Idea about estimating selectivity for single-column expressions
От | Tom Lane |
---|---|
Тема | Idea about estimating selectivity for single-column expressions |
Дата | |
Msg-id | 10499.1250650414@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Idea about estimating selectivity for single-column
expressions
Re: Idea about estimating selectivity for single-column expressions |
Список | pgsql-hackers |
There were two different complaints today about the planner's inability to make good estimates for WHERE conditions involving bit AND tests, such as(field & 8) = 0(field & 8) <> 0(field & 127) = field I'm not particularly eager to try to put in special-case knowledge for this sort of thing, but I had an idea that might cover a wide enough variety of cases to be worth the trouble. It's this: if we have a WHERE clause that is an immutable (or maybe even just stable) function of a single table column, and we don't have any special-purpose estimator that works for it, we could estimate the selectivity by actually evaluating the clause for each MCV and histogram entry for the column, then combining the results with appropriate weightings. We do something comparable already for "field LIKE pattern" expressions, but this would generalize it to any expression. With the larger histogram sizes that are default as of 8.4, I think this could be expected to provide a pretty good estimate most of the time --- certainly far better than the hardwired default estimates are. In the long run this might be workable for multi-column expressions too, but not before we have multi-column statistics so we know which combinations of values to try. I can see a couple of possible downsides: * If the expression is expensive to evaluate, we could waste a lot of time making the estimate. This would only be a serious issue if the query were designed to avoid evaluating the WHERE clause very often, but that's certainly possible. We could perhaps check the estimated cost of the expression and not do this if it's too high ... but expression cost estimates are generally pretty bogus. * The expression might throw an error for some inputs, for instance(1 / field) < 0.5 which would fail on zero. We could recover by wrapping the whole estimation process in a subtransaction, but that seems really expensive. I thought about arguing that the failure would happen anyway at runtime, but that doesn't hold water --- for example, the user might have just deleted all the rows with field = 0, and would have grounds to complain if the query failed; but there would still be an entry for zero in the histogram. Comments? regards, tom lane
В списке pgsql-hackers по дате отправления: