Re: Problem (bug?) with like
От | Bruce Momjian |
---|---|
Тема | Re: Problem (bug?) with like |
Дата | |
Msg-id | 200112290508.fBT58Cn14715@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Re: Problem (bug?) with like (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-general |
> The problem here is that the planner is being way too optimistic about > the selectivity of LIKE '%DAVID%' --- notice the estimate that only > one matching row will be found in cliente, rather than 54 as with '%DA%'. > So it chooses a plan that avoids the sort overhead needed for an > efficient merge join with the other tables. That would be a win if > there were only one matching row, but as soon as there are lots, it's > a big loss, because the subquery to join the other tables gets redone > for every matching row :-( > > >> Also, how many rows are there really that match '%DA%' and '%DAVID%'? > > > 1) 2672 rows -> 3.59 sec. > > 2) 257 rows -> 364.69 sec. > > I am thinking that the rules for selectivity of LIKE patterns probably > need to be modified. Presently the code assumes that a long constant > string has probability of occurrence proportional to the product of the > probabilities of the individual letters. That might be true in a random > world, but people don't search for random strings. I think we need to > back off the selectivity estimate by some large factor to account for > the fact that the pattern being searched for is probably not random. > Anyone have ideas how to do that? Let's use the above example with the new FIXED_CHAR_SEL values: With the new 0.20 value for FIXED_CHAR_SEL, we see for DA and DAVID above: DA 1) 0.20 ^ 2 .04 DAVID 2) 0.20 ^ 5 .00032 If we divide these two, we get: > 0.04 / 0.00032 125 while looking at the total counts reported above, we get: > 2672 / 257 ~10.39688715953307392996 The 0.04 value gives a value of: > 0.04 ^ 2 .0016 > 0.04 ^ 5 .0000001024 > .0016 / .0000001024 15625 Clearly the 0.20 value is 10x too large, while the 0.04 value is 1000x too large. Because this was a contrived example, and because some have more random text than DAVID in their field, I think 0.20 is the proper value. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
В списке pgsql-general по дате отправления: