Обсуждение: [HACKERS] ASOF join
I wonder if there were some discussion/attempts to add ASOF join to Postgres (sorry, may be there is better term for it, I am refereeing KDB definition: http://code.kx.com/wiki/Reference/aj ).
Such kind of join can be useful when we need to associate two timeseries. It is quite popular in trading:
//join the eq price and size for the op trade time
a::aj[`underlyingSym`time;select time, underlyingSym, sym, putorcall, ex, cond, price, seqnum, contracts, contractsTraded from t;eqtrades];
...and not only. Below is one example of how people now manually coding ASOF join:
select
count(*),
count(*)
filter (where timedelta_prev < -30),
count(*)
filter (where ride_prev = ride_next),
... -- many different aggregates
from
(
select
p.provider_id,
p.ts,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
order by t.ts desc
limit 1
) as timedelta_prev,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
order by t.ts
limit 1
) as timedelta,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
order by t.ts desc
limit 1
) as ride_prev,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
order by t.ts
limit 1
) as ride_next
from (
select
provider_id,
ts,
event_name
from
lvl2_681_parsed p
) p
where
p.event_name = 'GPS signal restored'
-- offset 0
) z;
Without OFFSET 0 this query generates awful execution plans with hundreds (!) of subplans corresponding to the subqueries.
Number of subplans (most of them are the same) is equal number of occurrences of timedelta, timedelta_prev, ... columns in target aggregates.
OFFSET 0 reduce number of subplans to 4. And I expect that using LATERAL join can reduce it to two and even without "OFFSET 0" trick.
But in any case - it is very complex and unnatural way of expressing this not so complex query.
With ASOF join is can be written much simpler.
Also Postgres is implementing this query using nested loop with index scan, which is definitely not the most efficient strategy.
The best way to implement ASOF join is to use something like mergejoin. Usually there are indexes for both timeseries, so what we need is to merge two ordered sets using ASOF join rules.
It will require minimal changes in SQL syntax, just adding ASOF keyword seems to be enough:
select * from Trades ASOF JOIN EqTrades USING (underlyingSym,time);
It seems to me that adding ASOF joins should not require huge amount of work and can be done with minimal number of changes in executor and optimizer.
But may be there are some problems/challenges which I do not realize now?
Such kind of join can be useful when we need to associate two timeseries. It is quite popular in trading:
//join the eq price and size for the op trade time
a::aj[`underlyingSym`time;select time, underlyingSym, sym, putorcall, ex, cond, price, seqnum, contracts, contractsTraded from t;eqtrades];
...and not only. Below is one example of how people now manually coding ASOF join:
select
count(*),
count(*)
filter (where timedelta_prev < -30),
count(*)
filter (where ride_prev = ride_next),
... -- many different aggregates
from
(
select
p.provider_id,
p.ts,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
order by t.ts desc
limit 1
) as timedelta_prev,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
order by t.ts
limit 1
) as timedelta,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and t.source = 'gps'
order by t.ts desc
limit 1
) as ride_prev,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and t.source = 'gps'
order by t.ts
limit 1
) as ride_next
from (
select
provider_id,
ts,
event_name
from
lvl2_681_parsed p
) p
where
p.event_name = 'GPS signal restored'
-- offset 0
) z;
Without OFFSET 0 this query generates awful execution plans with hundreds (!) of subplans corresponding to the subqueries.
Number of subplans (most of them are the same) is equal number of occurrences of timedelta, timedelta_prev, ... columns in target aggregates.
OFFSET 0 reduce number of subplans to 4. And I expect that using LATERAL join can reduce it to two and even without "OFFSET 0" trick.
But in any case - it is very complex and unnatural way of expressing this not so complex query.
With ASOF join is can be written much simpler.
Also Postgres is implementing this query using nested loop with index scan, which is definitely not the most efficient strategy.
The best way to implement ASOF join is to use something like mergejoin. Usually there are indexes for both timeseries, so what we need is to merge two ordered sets using ASOF join rules.
It will require minimal changes in SQL syntax, just adding ASOF keyword seems to be enough:
select * from Trades ASOF JOIN EqTrades USING (underlyingSym,time);
It seems to me that adding ASOF joins should not require huge amount of work and can be done with minimal number of changes in executor and optimizer.
But may be there are some problems/challenges which I do not realize now?
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > I wonder if there were some discussion/attempts to add ASOF join to Postgres > (sorry, may be there is better term for it, I am refereeing KDB definition: > http://code.kx.com/wiki/Reference/aj ). Interesting idea. Also in Pandas: http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof I suppose you could write a function that pulls tuples out of a bunch of cursors and zips them together like this, as a kind of hand-coded special merge join "except that we match on nearest key rather than equal keys" (as they put it). I've written code like this before in a trading context, where we called that 'previous tick interpolation', and in a scientific context where other kinds of interpolation were called for (so not really matching a tuple but synthesising one if no exact match). If you view the former case as a kind of degenerate case of interpolation then it doesn't feel like a "join" as we know it, but clearly it is. I had never considered before that such things might belong inside the database as a kind of join operator. -- Thomas Munro http://www.enterprisedb.com
On Fri, Jun 16, 2017 at 11:51:34AM +1200, Thomas Munro wrote: > On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: > > I wonder if there were some discussion/attempts to add ASOF join to Postgres > > (sorry, may be there is better term for it, I am refereeing KDB definition: > > http://code.kx.com/wiki/Reference/aj ). > > Interesting idea. Also in Pandas: > > http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof > > I suppose you could write a function that pulls tuples out of a bunch > of cursors and zips them together like this, as a kind of hand-coded > special merge join "except that we match on nearest key rather than > equal keys" (as they put it). > > I've written code like this before in a trading context, where we > called that 'previous tick interpolation', and in a scientific context > where other kinds of interpolation were called for (so not really > matching a tuple but synthesising one if no exact match). If you view > the former case as a kind of degenerate case of interpolation then it > doesn't feel like a "join" as we know it, but clearly it is. I had > never considered before that such things might belong inside the > database as a kind of join operator. If you turn your head sideways, it's very similar to the range merge join Jeff Davis proposed. https://commitfest.postgresql.org/14/1106/ Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 16.06.2017 19:07, David Fetter wrote: > On Fri, Jun 16, 2017 at 11:51:34AM +1200, Thomas Munro wrote: >> On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik >> <k.knizhnik@postgrespro.ru> wrote: >>> I wonder if there were some discussion/attempts to add ASOF join to Postgres >>> (sorry, may be there is better term for it, I am refereeing KDB definition: >>> http://code.kx.com/wiki/Reference/aj ). >> Interesting idea. Also in Pandas: >> >> http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof I attached simple patch adding ASOF join to Postgres. Right now it support only outer join and requires USING clause (consequently it is not possible to join two tables which joi keys has different names. May be it is also possible to support ON clause with condition written like o.k1 = i.k2 AND o.k2 = i.k2 AND ... AND o.kN >= i.kN But such notation can be confusing, because join result includes only one matching inner record with kN smaller or equal than kN of outer record and not all such records. As alternative we can add specia If people fin such construction really useful, I will continue work on it. >> >> I suppose you could write a function that pulls tuples out of a bunch >> of cursors and zips them together like this, as a kind of hand-coded >> special merge join "except that we match on nearest key rather than >> equal keys" (as they put it). >> >> I've written code like this before in a trading context, where we >> called that 'previous tick interpolation', and in a scientific context >> where other kinds of interpolation were called for (so not really >> matching a tuple but synthesising one if no exact match). If you view >> the former case as a kind of degenerate case of interpolation then it >> doesn't feel like a "join" as we know it, but clearly it is. I had >> never considered before that such things might belong inside the >> database as a kind of join operator. > If you turn your head sideways, it's very similar to the range merge > join Jeff Davis proposed. https://commitfest.postgresql.org/14/1106/ May be, but I do not understand how to limit result to contain exactly one (last) inner tuple for each outer tuple. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Mon, Jun 19, 2017 at 11:57 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > I attached simple patch adding ASOF join to Postgres. Right now it support > only outer join and requires USING clause (consequently it is not possible > to join two tables which joi keys has different names. May be it is also > possible to support ON clause with condition written like o.k1 = i.k2 AND > o.k2 = i.k2 AND ... AND o.kN >= i.kN > But such notation can be confusing, because join result includes only one > matching inner record with kN smaller or equal than kN of outer record and > not all such records. > As alternative we can add specia Hmm. Yeah, I see the notational problem. It's hard to come up with a new syntax that has SQL nature. What if... we didn't use a new syntax at all, but recognised existing queries that are executable with this strategy? Queries like this: WITH ticks(time, price) AS (VALUES ('2017-07-20 12:00:00'::timestamptz, 100.00), ('2017-07-21 11:00:00'::timestamptz,150.00)), times(time) AS (VALUES ('2017-07-19 12:00:00'::timestamptz), ('2017-07-2012:00:00'::timestamptz), ('2017-07-21 12:00:00'::timestamptz), ('2017-07-22 12:00:00'::timestamptz)) SELECT times.time, previous_tick.price FROM times LEFT JOIN LATERAL (SELECT * FROM ticks WHERE ticks.time<= times.time ORDER BY ticks.time DESC LIMIT 1) previous_tick ON trueORDER BY times.time; time | price ------------------------+--------2017-07-19 12:00:00+12 |2017-07-20 12:00:00+12 | 100.002017-07-21 12:00:00+12 | 150.002017-07-2212:00:00+12 | 150.00 (4 rows) I haven't used LATERAL much myself but I've noticed that it's often used to express this type of thing. "Get me the latest ... as of time ...". It'd a bit like the way we recognise EXISTS (...) as a semi-join and execute it with a join operator instead of having a SEMI JOIN syntax. On the other hand it's a bit more long winded, extreme and probably quite niche. -- Thomas Munro http://www.enterprisedb.com
On 21.06.2017 11:00, Thomas Munro wrote: > Hmm. Yeah, I see the notational problem. It's hard to come up with a > new syntax that has SQL nature. What if... we didn't use a new syntax > at all, but recognised existing queries that are executable with this > strategy? Queries like this: > > WITH ticks(time, price) AS > (VALUES ('2017-07-20 12:00:00'::timestamptz, 100.00), > ('2017-07-21 11:00:00'::timestamptz, 150.00)), > times(time) AS > (VALUES ('2017-07-19 12:00:00'::timestamptz), > ('2017-07-20 12:00:00'::timestamptz), > ('2017-07-21 12:00:00'::timestamptz), > ('2017-07-22 12:00:00'::timestamptz)) > > SELECT times.time, previous_tick.price > FROM times > LEFT JOIN LATERAL (SELECT * FROM ticks > WHERE ticks.time <= times.time > ORDER BY ticks.time DESC LIMIT 1) previous_tick ON true > ORDER BY times.time; > > time | price > ------------------------+-------- > 2017-07-19 12:00:00+12 | > 2017-07-20 12:00:00+12 | 100.00 > 2017-07-21 12:00:00+12 | 150.00 > 2017-07-22 12:00:00+12 | 150.00 > (4 rows) > > I haven't used LATERAL much myself but I've noticed that it's often > used to express this type of thing. "Get me the latest ... as of time > ...". > > It'd a bit like the way we recognise EXISTS (...) as a semi-join and > execute it with a join operator instead of having a SEMI JOIN syntax. > On the other hand it's a bit more long winded, extreme and probably > quite niche. Thank you for this idea. I agree that it is the best way of implementing ASOF join - just as optimization of standard SQL query. But do you think that still it will be good idea to extend SQL syntax with ASOF JOIN ... USING ... clause? It will significantly simplify writing queries like above and IMHO doesn't introduce some confusions with standard SQL syntax. My primary idea of suggesting ASOF join for Postgres was not just building more efficient plan (using merge join instead of nested loop) but also simplifying writing of such queries. Or do you think that nobody will be interested in non-standard SQL extensions? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Jun 19, 2017 at 11:57 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > On 16.06.2017 19:07, David Fetter wrote: >> If you turn your head sideways, it's very similar to the range merge >> join Jeff Davis proposed. https://commitfest.postgresql.org/14/1106/ > > May be, but I do not understand how to limit result to contain exactly one > (last) inner tuple for each outer tuple. Yeah, it's somehow related but it's not the same thing. I guess you can think of the keys in the ASOF case as modelling range starts, with range ends implied by the record with next higher/lower key. Concretely, if every 'tick' row from my example in a nearby message had a time but also an end time to be set when a new tick is inserted so that each tick row had the complete effective time range for that tick, then you could rewrite the query as "get me the tick whose effective time range overlaps with each value of times.time" and get a nice range merge join with Jeff's patch. That sort of thing might be very useful for SQL:2011 temporal query-style stuff, but it's not what Konstantin wants to do: he wants to merge time series where the effective range is not explicitly stored in every row. -- Thomas Munro http://www.enterprisedb.com
On Wed, Jun 21, 2017 at 9:46 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > Thank you for this idea. I agree that it is the best way of implementing > ASOF join - just as optimization of standard SQL query. Great. I think this part definitely has potential. > But do you think that still it will be good idea to extend SQL syntax with > ASOF JOIN ... USING ... clause? It will significantly simplify writing > queries like above > and IMHO doesn't introduce some confusions with standard SQL syntax. My > primary idea of suggesting ASOF join for Postgres was not just building > more efficient plan (using merge join instead of nested loop) but also > simplifying writing of such queries. Or do you think that nobody will be > interested in non-standard SQL extensions? I can see the appeal, but I expect it to be difficult to convince the project to accept a non-standard syntax for a niche use case that can be expressed already. Q is super terse and designed for time series data. SQL is neither of those things. Some first reactions to the syntaxes you mentioned: 1. times LEFT ASOF JOIN ticks ON ticks.time <= times.time 2. times LEFT ASOF JOIN ticks USING (time) 3. times LEFT ASOF JOIN ticks USING (ticks.time, times.time) The USING ideas don't seem to be general enough, because there is no place to say whether to use a lower or higher value if there is no match, or did I miss something? Relying on an ORDER BY clause in the query to control the meaning of the join seems too weird, and making it always (for example) <= would be an arbitrary limitation. The first syntax at least has enough information: when you say one of <, >, <=, >= you also imply the search order. I'm not sure if there are any problems with that, perhaps when combined with other quals. The equivalent nearly-standard syntax is definitely quite verbose, but it has the merit of being absolutely explicit about which row from 'ticks' will be selected: times LEFT JOIN LATERAL (SELECT * FROM ticks WHERE ticks.time <= times.time ORDER BY ticks.time DESC LIMIT 1) x ON true -- Thomas Munro http://www.enterprisedb.com