Re: Performance improvement for joins where outer side is unique

Поиск
Список
Период
Сортировка
От David G. Johnston
Тема Re: Performance improvement for joins where outer side is unique
Дата
Msg-id CAKFQuwYVKsyhWNZaWpc=4YOs6qBnJE-pRcN9CWh7ExqZyvNNpg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Performance improvement for joins where outer side is unique  ("David G. Johnston" <david.g.johnston@gmail.com>)
Список pgsql-hackers
On Saturday, March 12, 2016, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Saturday, March 12, 2016, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> Don't the semantics of a SEMI JOIN also state that the output columns only
> come from the outer relation? i.e., the inner relation doesn't contribute
> either rows or columns to the final result?  Or is that simply
> an implementation artifact of the fact that the only current way to perform
> a semi-join explicitly is via exists/in?

I think it's an artifact.  What nodes.h actually says about it is you get
the values of one randomly-selected matching inner row, which seems like
a fine definition for the purposes we plan to put it to.


But is it a definition that actually materializes anywhere presently?

I'm not sure what we consider an authoritative source but relational algebra does define the results of semi and anti joins as only containing rows from main relation.


Pondering it more calling these optimizations "semi" joins further distances us from the meaning of "semi" as used in relational algebra.  The half that semi refers to IS that only one half of the tables are returned.  That you only get a single row of output regardless of multiple potential matches is simply a consequence of this and general set theory.

In short "semi" communicates a semantic meaning as to the intended output of the query irrespective of the data upon which it is executed.  We now are hijacking the and calling something "semi" if by some chance the data the query is operating against happens to be accommodating to some particular optimization.

This seems wrong on definitional and cleanliness grounds.

So while I'm still liking the idea of introducing specializations of outer and inner joins I think calling them "semi" joins adds a definitional inconsistency we are better off avoiding.

This came about because calling something "outer semi join" struck me as odd.

Something like "outer only join" and "inner only join" comes to mind.  Consider the parallel between this and "index only scan".  Learning that "only" means "join the outer row to the (at most for outer) one and only row in the inner relation" doesn't seem to much of a challenge.

David J.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [COMMITTERS] pgsql: Only try to push down foreign joins if the user mapping OIDs mat
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: amcheck (B-Tree integrity checking tool)