Обсуждение: Joining 1-minute data with 5-minute data

Поиск
Список
Период
Сортировка

Joining 1-minute data with 5-minute data

От
Stephen Froehlich
Дата:

I have a couple of relatively large tables, each with 100-500 million lines (at least in each monthly partition).

 

One has data every 1 minute, and the other has data every 5 minutes, and I’d like to be able to join them (i.e. with each minute in the 5-minute span rounded down to the beginning of that 5-minute interval).

 

I’m currently running PostgreSQL 11. An upgrade to 12 (for calculated fields) is possible but annoying at the moment. (i.e. I’ll do it if its worth it, but I’m otherwise planning on holding off until the Ubuntu 20.40LTS release for that upgrade process.)

 

What is the most efficient (i.e. performant) way to do that join?

  • Create an index for the 1-min table something like (trunc(time_stamp::int / 300) * 300)::timestamp with time zone
    • Is there a more efficient way to round to 5 minutes?
  • Encode the time stamp for the 5-min table as a tstzrange and create a gist index on that column?
  • Manually add a 5-minute rounded column to the 1-minute table and index that?
  • Something I have missed entirely?

 

Thanks,

Stephen

 


Stephen Froehlich
Sr. Strategist, CableLabs®


s.froehlich@cablelabs.com

Tel: +1 (303) 661-3708

Re: Joining 1-minute data with 5-minute data

От
Laurenz Albe
Дата:
On Mon, 2020-01-27 at 16:35 +0000, Stephen Froehlich wrote:
> I have a couple of relatively large tables, each with 100-500 million lines (at least in each monthly partition). 
> 
> One has data every 1 minute, and the other has data every 5 minutes, and I’d like to be able to
> join them (i.e. with each minute in the 5-minute span rounded down to the beginning of that 5-minute interval).
> 
> I’m currently running PostgreSQL 11. An upgrade to 12 (for calculated fields) is possible but annoying at the
moment.
> (i.e. I’ll do it if its worth it, but I’m otherwise planning on holding off until the Ubuntu 20.40LTS release for
thatupgrade process.)
 
> 
> What is the most efficient (i.e. performant) way to do that join?
> - Create an index for the 1-min table something like (trunc(time_stamp::int / 300) * 300)::timestamp with time zone
>   - Is there a more efficient way to round to 5 minutes?
> - Encode the time stamp for the 5-min table as a tstzrange and create a gist index on that column?
> - Manually add a 5-minute rounded column to the 1-minute table and index that?
> - Something I have missed entirely?

Depending on the number of rows required from each table, an index
may not be useful at all: with a hash join, indexes don't help.

You should make sure that the join condition looks like this:

 (expression with columns of the 1-minute table) =
 (expression with columns of the 5-minute table)

Otherwise, PostgreSQL can only use a nested loop join, which may
not be the best strategy.

Then experiment with indexes on the expressions in the join condition:
nested loop joins and merge joins can profit from them.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




RE: Joining 1-minute data with 5-minute data

От
Stephen Froehlich
Дата:
That's a solid plan ... thanks.

-----Original Message-----
From: Laurenz Albe <laurenz.albe@cybertec.at> 
Sent: Tuesday, January 28, 2020 1:44 AM
To: Stephen Froehlich <s.froehlich@cablelabs.com>; pgsql-novice@lists.postgresql.org
Subject: Re: Joining 1-minute data with 5-minute data

On Mon, 2020-01-27 at 16:35 +0000, Stephen Froehlich wrote:
> I have a couple of relatively large tables, each with 100-500 million lines (at least in each monthly partition). 
> 
> One has data every 1 minute, and the other has data every 5 minutes, 
> and I’d like to be able to join them (i.e. with each minute in the 5-minute span rounded down to the beginning of
that5-minute interval).
 
> 
> I’m currently running PostgreSQL 11. An upgrade to 12 (for calculated fields) is possible but annoying at the
moment.
> (i.e. I’ll do it if its worth it, but I’m otherwise planning on 
> holding off until the Ubuntu 20.40LTS release for that upgrade 
> process.)
> 
> What is the most efficient (i.e. performant) way to do that join?
> - Create an index for the 1-min table something like (trunc(time_stamp::int / 300) * 300)::timestamp with time zone
>   - Is there a more efficient way to round to 5 minutes?
> - Encode the time stamp for the 5-min table as a tstzrange and create a gist index on that column?
> - Manually add a 5-minute rounded column to the 1-minute table and index that?
> - Something I have missed entirely?

Depending on the number of rows required from each table, an index may not be useful at all: with a hash join, indexes
don'thelp.
 

You should make sure that the join condition looks like this:

 (expression with columns of the 1-minute table) =  (expression with columns of the 5-minute table)

Otherwise, PostgreSQL can only use a nested loop join, which may not be the best strategy.

Then experiment with indexes on the expressions in the join condition:
nested loop joins and merge joins can profit from them.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com