Re: Performance improvement for joins where outer side is unique

Поиск

Список

Период

Сортировка

От	David G. Johnston
Тема	Re: Performance improvement for joins where outer side is unique
Дата	12 марта 2016 г. 20:01:21
Msg-id	CAKFQuwYVKsyhWNZaWpc=4YOs6qBnJE-pRcN9CWh7ExqZyvNNpg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Performance improvement for joins where outer side is unique ("David G. Johnston" <david.g.johnston@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Saturday, March 12, 2016, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Saturday, March 12, 2016, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> Don't the semantics of a SEMI JOIN also state that the output columns only
> come from the outer relation? i.e., the inner relation doesn't contribute
> either rows or columns to the final result? Or is that simply
> an implementation artifact of the fact that the only current way to perform
> a semi-join explicitly is via exists/in?

I think it's an artifact. What nodes.h actually says about it is you get
the values of one randomly-selected matching inner row, which seems like
a fine definition for the purposes we plan to put it to.

But is it a definition that actually materializes anywhere presently?

I'm not sure what we consider an authoritative source but relational algebra does define the results of semi and anti joins as only containing rows from main relation.

https://en.m.wikipedia.org/wiki/Relational_algebra

Pondering it more calling these optimizations "semi" joins further distances us from the meaning of "semi" as used in relational algebra. The half that semi refers to IS that only one half of the tables are returned. That you only get a single row of output regardless of multiple potential matches is simply a consequence of this and general set theory.

In short "semi" communicates a semantic meaning as to the intended output of the query irrespective of the data upon which it is executed. We now are hijacking the and calling something "semi" if by some chance the data the query is operating against happens to be accommodating to some particular optimization.

This seems wrong on definitional and cleanliness grounds.

So while I'm still liking the idea of introducing specializations of outer and inner joins I think calling them "semi" joins adds a definitional inconsistency we are better off avoiding.

This came about because calling something "outer semi join" struck me as odd.

Something like "outer only join" and "inner only join" comes to mind. Consider the parallel between this and "index only scan". Learning that "only" means "join the outer row to the (at most for outer) one and only row in the inner relation" doesn't seem to much of a challenge.

David J.

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Performance improvement for joins where outer side is unique