Обсуждение: select from grouped data

Поиск
Список
Период
Сортировка

select from grouped data

От
Roger Mason
Дата:
Hello,

I have written this function to process some multiline text data into a
table with both text & numeric columns:

CREATE OR REPLACE FUNCTION get_final_energy ()
  RETURNS TABLE (
    jid text,
    "timestamp" text,
    scf integer,
    energy double precision
  )
  AS $$
  WITH a AS (
    SELECT
      jid,
      regexp_split_to_table(totenergy_out, '\n') AS teo
    FROM
      results
),
b AS (
  SELECT
    results.jid,
    results. "timestamp",
    cast(
      CASE WHEN split_part(a.teo, ' ', 2) = '' THEN
        '0'
      ELSE
        split_part(a.teo, ' ', 2)
      END AS integer) AS scf,
    cast(
      CASE WHEN split_part(a.teo, ' ', 3) = '' THEN
        '0.0'
      ELSE
        split_part(a.teo, ' ', 3)
      END AS double precision) AS energy
  FROM
    results,
    a
  WHERE
    results.jid = a.jid
  GROUP BY
    results.jid,
    results. "timestamp",
    a.teo
  --HAVING
    -- scf = max(scf)
    ORDER BY
         timestamp ASC,
         scf DESC
)
SELECT
  *
FROM
  b;

$$
LANGUAGE sql;

The output looks like:

   jid     |    timestamp    | scf |     energy     
------------+-----------------+-----+----------------
 1250_1     | 20210805-114634 |  18 | -1316.43700819
 1250_1     | 20210805-114634 |  17 | -1316.43700825
 1250_1     | 20210805-114634 |  16 |  -1316.4370097
 1250_1     | 20210805-114634 |  15 | -1316.43700991
 1250_1     | 20210805-114634 |  14 | -1316.43699775
 1250_1     | 20210805-114634 |  13 | -1316.43699117
 1250_1     | 20210805-114634 |  12 | -1316.43750771
 1250_1     | 20210805-114634 |  11 | -1316.43805358
 1250_1     | 20210805-114634 |  10 | -1316.43857192
 1250_1     | 20210805-114634 |   9 | -1316.43070942
 1251_1     | 20210806-062539 |  18 | -1316.43700819
 1251_1     | 20210806-062539 |  17 | -1316.43700826
....

What I want is to get (for each group) the energy corresponding to the
maximum value of scf.

I appreciate any help,
Thanks,
Roger



Re: select from grouped data

От
"David G. Johnston"
Дата:

The output looks like:

   jid     |    timestamp    | scf |     energy     
------------+-----------------+-----+----------------
 1250_1     | 20210805-114634 |  18 | -1316.43700819
 1250_1     | 20210805-114634 |  17 | -1316.43700825
 1250_1     | 20210805-114634 |  16 |  -1316.4370097
 1250_1     | 20210805-114634 |  15 | -1316.43700991
 1250_1     | 20210805-114634 |  14 | -1316.43699775
 1250_1     | 20210805-114634 |  13 | -1316.43699117
 1250_1     | 20210805-114634 |  12 | -1316.43750771
 1250_1     | 20210805-114634 |  11 | -1316.43805358
 1250_1     | 20210805-114634 |  10 | -1316.43857192
 1250_1     | 20210805-114634 |   9 | -1316.43070942
 1251_1     | 20210806-062539 |  18 | -1316.43700819
 1251_1     | 20210806-062539 |  17 | -1316.43700826
....

What I want is to get (for each group) the energy corresponding to the
maximum value of scf.


SELECT DISTINCT ON (jid) jid, timestamp, scf, energy [...] ORDER BY jid, scf DESC

David J.

Re: select from grouped data

От
Roger Mason
Дата:
David G. Johnston writes:

> SELECT DISTINCT ON (jid) jid, timestamp, scf, energy [...] ORDER BY jid,
> scf DESC

Got it.

Many thanks,
Roger



Re: select from grouped data

От
Stephen Frost
Дата:
Greetings,

* Roger Mason (rmason@mun.ca) wrote:
> I have written this function to process some multiline text data into a
> table with both text & numeric columns:
>
> CREATE OR REPLACE FUNCTION get_final_energy ()
>   RETURNS TABLE (
>     jid text,
>     "timestamp" text,
>     scf integer,
>     energy double precision
>   )
>   AS $$
>   WITH a AS (
>     SELECT
>       jid,
>       regexp_split_to_table(totenergy_out, '\n') AS teo
>     FROM
>       results
> ),
> b AS (
>   SELECT
>     results.jid,
>     results. "timestamp",
>     cast(
>       CASE WHEN split_part(a.teo, ' ', 2) = '' THEN
>         '0'
>       ELSE
>         split_part(a.teo, ' ', 2)
>       END AS integer) AS scf,
>     cast(
>       CASE WHEN split_part(a.teo, ' ', 3) = '' THEN
>         '0.0'
>       ELSE
>         split_part(a.teo, ' ', 3)
>       END AS double precision) AS energy
>   FROM
>     results,
>     a
>   WHERE
>     results.jid = a.jid
>   GROUP BY
>     results.jid,
>     results. "timestamp",
>     a.teo
>   --HAVING
>     -- scf = max(scf)
>     ORDER BY
>          timestamp ASC,
>          scf DESC
> )
> SELECT
>   *
> FROM
>   b;
>
> $$
> LANGUAGE sql;
>
> The output looks like:
>
>    jid     |    timestamp    | scf |     energy
> ------------+-----------------+-----+----------------
>  1250_1     | 20210805-114634 |  18 | -1316.43700819
>  1250_1     | 20210805-114634 |  17 | -1316.43700825
>  1250_1     | 20210805-114634 |  16 |  -1316.4370097
>  1250_1     | 20210805-114634 |  15 | -1316.43700991
>  1250_1     | 20210805-114634 |  14 | -1316.43699775
>  1250_1     | 20210805-114634 |  13 | -1316.43699117
>  1250_1     | 20210805-114634 |  12 | -1316.43750771
>  1250_1     | 20210805-114634 |  11 | -1316.43805358
>  1250_1     | 20210805-114634 |  10 | -1316.43857192
>  1250_1     | 20210805-114634 |   9 | -1316.43070942
>  1251_1     | 20210806-062539 |  18 | -1316.43700819
>  1251_1     | 20210806-062539 |  17 | -1316.43700826
> ....
>
> What I want is to get (for each group) the energy corresponding to the
> maximum value of scf.

* David G. Johnston (david.g.johnston@gmail.com) wrote:
> >
> >
> > The output looks like:
> >
> >    jid     |    timestamp    | scf |     energy
> > ------------+-----------------+-----+----------------
> >  1250_1     | 20210805-114634 |  18 | -1316.43700819
> >  1250_1     | 20210805-114634 |  17 | -1316.43700825
> >  1250_1     | 20210805-114634 |  16 |  -1316.4370097
> >  1250_1     | 20210805-114634 |  15 | -1316.43700991
> >  1250_1     | 20210805-114634 |  14 | -1316.43699775
> >  1250_1     | 20210805-114634 |  13 | -1316.43699117
> >  1250_1     | 20210805-114634 |  12 | -1316.43750771
> >  1250_1     | 20210805-114634 |  11 | -1316.43805358
> >  1250_1     | 20210805-114634 |  10 | -1316.43857192
> >  1250_1     | 20210805-114634 |   9 | -1316.43070942
> >  1251_1     | 20210806-062539 |  18 | -1316.43700819
> >  1251_1     | 20210806-062539 |  17 | -1316.43700826
> > ....
> >
> > What I want is to get (for each group) the energy corresponding to the
> > maximum value of scf.
> >
> >
> SELECT DISTINCT ON (jid) jid, timestamp, scf, energy [...] ORDER BY jid,
> scf DESC

While this works, it's generally better to use a LATERAL join as that's
part of the SQL standard while DISTINCT ON isn't.  Using a LATERAL join
also would allow you to have multiple rows (top-N) if you wanted.  You'd
do that using:

WITH jids AS (SELECT jid FROM results GROUP BY jid)
SELECT jids.jid, t.ts, t.scf, t.energy
FROM jids CROSS JOIN LATERAL
 (SELECT ts, scf, energy FROM results
  WHERE results.jid = jids.jid
  ORDER BY scf DESC
  LIMIT 1) AS t
;

A couple of notes on this: don't name a column "timestamp" and when it's
a timestamp, use the 'timestamptz' data type, not text.  Your jids sure
look like they should just be numbers instead of text too.  If you
already have a distinct set of jids somewhere (like in another table),
you could use that as the source table instead of the CTE that I'm using
above.

Thanks,

Stephen

Вложения

Re: select from grouped data

От
James Keener
Дата:
I don't know that "generally better" is the correct description. There are a lot of things not covered by the SQL standard and it would be crazy not to make use of them, e.g. insert on conflict and distinct on, especially when they make the query more readable and more correct. Worrying about sql portability is almost always time not well spent and an invitation to do things not as well as they can be.

Jim

On August 25, 2021 4:42:56 PM EDT, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Roger Mason (rmason@mun.ca) wrote:
I have written this function to process some multiline text data into a
table with both text & numeric columns:

CREATE OR REPLACE FUNCTION get_final_energy ()
RETURNS TABLE (
jid text,
"timestamp" text,
scf integer,
energy double precision
)
AS $$
WITH a AS (
SELECT
jid,
regexp_split_to_table(totenergy_out, '\n') AS teo
FROM
results
),
b AS (
SELECT
results.jid,
results. "timestamp",
cast(
CASE WHEN split_part(a.teo, ' ', 2) = '' THEN
'0'
ELSE
split_part(a.teo, ' ', 2)
END AS integer) AS scf,
cast(
CASE WHEN split_part(a.teo, ' ', 3) = '' THEN
'0.0'
ELSE
split_part(a.teo, ' ', 3)
END AS double precision) AS energy
FROM
results,
a
WHERE
results.jid = a.jid
GROUP BY
results.jid,
results. "timestamp",
a.teo
--HAVING
-- scf = max(scf)
ORDER BY
timestamp ASC,
scf DESC
)
SELECT
*
FROM
b;

$$
LANGUAGE sql;

The output looks like:

jid | timestamp | scf | energy
------------+-----------------+-----+----------------
1250_1 | 20210805-114634 | 18 | -1316.43700819
1250_1 | 20210805-114634 | 17 | -1316.43700825
1250_1 | 20210805-114634 | 16 | -1316.4370097
1250_1 | 20210805-114634 | 15 | -1316.43700991
1250_1 | 20210805-114634 | 14 | -1316.43699775
1250_1 | 20210805-114634 | 13 | -1316.43699117
1250_1 | 20210805-114634 | 12 | -1316.43750771
1250_1 | 20210805-114634 | 11 | -1316.43805358
1250_1 | 20210805-114634 | 10 | -1316.43857192
1250_1 | 20210805-114634 | 9 | -1316.43070942
1251_1 | 20210806-062539 | 18 | -1316.43700819
1251_1 | 20210806-062539 | 17 | -1316.43700826
....

What I want is to get (for each group) the energy corresponding to the
maximum value of scf.

* David G. Johnston (david.g.johnston@gmail.com) wrote:


The output looks like:

jid | timestamp | scf | energy
------------+-----------------+-----+----------------
1250_1 | 20210805-114634 | 18 | -1316.43700819
1250_1 | 20210805-114634 | 17 | -1316.43700825
1250_1 | 20210805-114634 | 16 | -1316.4370097
1250_1 | 20210805-114634 | 15 | -1316.43700991
1250_1 | 20210805-114634 | 14 | -1316.43699775
1250_1 | 20210805-114634 | 13 | -1316.43699117
1250_1 | 20210805-114634 | 12 | -1316.43750771
1250_1 | 20210805-114634 | 11 | -1316.43805358
1250_1 | 20210805-114634 | 10 | -1316.43857192
1250_1 | 20210805-114634 | 9 | -1316.43070942
1251_1 | 20210806-062539 | 18 | -1316.43700819
1251_1 | 20210806-062539 | 17 | -1316.43700826
....

What I want is to get (for each group) the energy corresponding to the
maximum value of scf.


SELECT DISTINCT ON (jid) jid, timestamp, scf, energy [...] ORDER BY jid,
scf DESC

While this works, it's generally better to use a LATERAL join as that's
part of the SQL standard while DISTINCT ON isn't. Using a LATERAL join
also would allow you to have multiple rows (top-N) if you wanted. You'd
do that using:

WITH jids AS (SELECT jid FROM results GROUP BY jid)
SELECT jids.jid, t.ts, t.scf, t.energy
FROM jids CROSS JOIN LATERAL
(SELECT ts, scf, energy FROM results
WHERE results.jid = jids.jid
ORDER BY scf DESC
LIMIT 1) AS t
;

A couple of notes on this: don't name a column "timestamp" and when it's
a timestamp, use the 'timestamptz' data type, not text. Your jids sure
look like they should just be numbers instead of text too. If you
already have a distinct set of jids somewhere (like in another table),
you could use that as the source table instead of the CTE that I'm using
above.

Thanks,

Stephen

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: select from grouped data

От
Stephen Frost
Дата:
Greetings,

We generally prefer to reply in-line on these threads, as I’ve done below. Thanks.

On Wed, Aug 25, 2021 at 17:37 James Keener <jim@jimkeener.com> wrote:
I don't know that "generally better" is the correct description. There are a lot of things not covered by the SQL standard and it would be crazy not to make use of them, e.g. insert on conflict and distinct on, especially when they make the query more readable and more correct. Worrying about sql portability is almost always time not well spent and an invitation to do things not as well as they can be.

It’s generally better as it follows the SQL standard and is more flexible as it allows you to get top-N, as I said. DISTINCT ON has always been a hack and its use should be discouraged at this point.

When an UPSERT capability exists in the actual standard then it’s use should be promoted over other hacks that various vendors have implemented also.

Thanks again,

Stephen

Re: select from grouped data

От
Roger Mason
Дата:
Stephen Frost writes:

>
> It’s generally better as it follows the SQL standard and is more flexible
> as it allows you to get top-N, as I said. DISTINCT ON has always been a
> hack and its use should be discouraged at this point.
>
> When an UPSERT capability exists in the actual standard then it’s use
> should be promoted over other hacks that various vendors have implemented
> also.

OK, thanks, I've filed this discussion away for future reference.
Meanwhile I'm having trouble getting this function to work as a
trigger.  I'll start another thread on that.

Thanks again.

Roger