Re: BUG #17207: Bad cost estimate of Merge Join despite correct row estimate
От | Tom Lane |
---|---|
Тема | Re: BUG #17207: Bad cost estimate of Merge Join despite correct row estimate |
Дата | |
Msg-id | 3433233.1633107188@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: BUG #17207: Bad cost estimate of Merge Join despite correct row estimate (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-bugs |
I wrote: > I've not yet figured out what the connection is, but a reasonable > bet is that there's an edge case associated with the ranges of the > two join keys being exactly the same. Hmm, no, that's not it. The problem stems from the weird statistics we have for the smaller table's join column: attname | most_common_vals | most_common_freqs | histogram_bounds ---------+------------------+-------------------+------------------ user_id | {1} | {0.6666667} | We have an MCV item for "1", because it's repeated, but there's no histogram (since, with only one other value, there'd be no way to make one). Despite this, get_variable_range() cheerfully decides that the range of values of "user_id" is from 1 to 1, and then mergejoinscansel() concludes that only an insignificant fraction of the bigger table need be scanned. The apparent dependency on commit 7d08ce286 is a bit artificial. We need to calculate the fraction of the bigger table that has id <= 1 (the supposed upper bound of the smaller table). Before that commit, that was approximated as "id < 1", and we got an estimate of exactly 0. With accurate handling of <=, we realize that the estimate should be a shade larger than zero. However, there's a bogus-estimate detector at the end of mergejoinscansel: if (*rightstart >= *rightend) { *rightstart = 0.0; *rightend = 1.0; } In v10, we have both rightstart and rightend exactly zero at this point, so the BS detector trips and restores a sane rightend. In later versions, that doesn't happen so we accept the not-too-sane rightend value. It seems clear that the appropriate fix is to make get_variable_range more cautious. We could decide that it should always require a histogram to be present in order to succeed. However, I'm worried that that would result in unnecessary loss of accuracy in cases where the MCV list covers all of the table (which is typically true for enum-like columns). So I'm thinking that if it sees an MCV list and no histogram, it should succeed only if the MCV frequencies sum to 1, with some small allowance for roundoff error. regards, tom lane
В списке pgsql-bugs по дате отправления: