Обсуждение: MIN/MAX functions for a record

Поиск
Список
Период
Сортировка

MIN/MAX functions for a record

От
Viliam Ďurina
Дата:
In my queries I often need to do MIN/MAX for tuples, for example:

  SELECT MAX(row(year, month)) 
  FROM (VALUES(2025, 1), (2024,2)) x(year, month);

This query throws:

    ERROR: function max(record) does not exist

In this case you can replace it with `MAX((year||'-'||month||'-1')::date)`. However in my case I have an event table with `event_time` and `text` columns, I'm grouping that table by some key and want to have the text for the newest event. I would do `MAX(ROW(event_time, text)).text`. Workarounds for this are clumsy, e.g. with a subquery with LIMIT 1.

The lack of this feature is kind of unexpected, because the `>` operator or `GREATEST` function are defined for records:

    SELECT 
        GREATEST((2025, 1), (2024, 2)), 
        (2025, 1) > (2024, 2)

Was this ever discussed or is there something preventing the implementation?

Viliam

Re: MIN/MAX functions for a record

От
Aleksander Alekseev
Дата:
Hi,

> In my queries I often need to do MIN/MAX for tuples, for example:
>
>   SELECT MAX(row(year, month))
>   FROM (VALUES(2025, 1), (2024,2)) x(year, month);
>
> This query throws:
>
>     ERROR: function max(record) does not exist
>
> In this case you can replace it with `MAX((year||'-'||month||'-1')::date)`. However in my case I have an event table
with`event_time` and `text` columns, I'm grouping that table by some key and want to have the text for the newest
event.I would do `MAX(ROW(event_time, text)).text`. Workarounds for this are clumsy, e.g. with a subquery with LIMIT 1. 
>
> The lack of this feature is kind of unexpected, because the `>` operator or `GREATEST` function are defined for
records:
>
>     SELECT
>         GREATEST((2025, 1), (2024, 2)),
>         (2025, 1) > (2024, 2)
>
> Was this ever discussed or is there something preventing the implementation?

I believe it would be challenging to implement max(record) that would
work reasonably well in a general case.

What if, for instance, one of the columns is JOSNB or XML? Not to
mention the fact that Postgres supports user-defined types which don't
necessarily have a reasonable order. Take a point in a 2D or 3D space
as an example. On top of that I doubt that the proposed query will
perform well since I don't see how it could benefit from using
indexes. I don't claim that this is necessarily true in your case but
generally one could argue that the wrong schema is used here and
instead of (year, month) pair a table should have a date/timestamp(tz)
column.

Personally I would choose format() function [1] in cases like this in
order to play it safe. Assuming of course that the table is small and
the query is not executed often.

[1]: https://www.postgresql.org/docs/current/functions-string.html

--
Best regards,
Aleksander Alekseev



Re: MIN/MAX functions for a record

От
Tom Lane
Дата:
Aleksander Alekseev <aleksander@timescale.com> writes:
>> In my queries I often need to do MIN/MAX for tuples, for example:
>> SELECT MAX(row(year, month))
>> FROM (VALUES(2025, 1), (2024,2)) x(year, month);
>> This query throws:
>> ERROR: function max(record) does not exist
>> Was this ever discussed or is there something preventing the implementation?

> I believe it would be challenging to implement max(record) that would
> work reasonably well in a general case.

As long as you define it as "works the same way record comparison
does", ie base it on record_cmp(), I don't think it would be much
more than a finger exercise [*].  And why would you want it to act
any differently from record_cmp()?  Those semantics have been
established for a long time.

            regards, tom lane

[*] Although conceivably there are some challenges in getting
record_cmp's caching logic to work in the context of an aggregate.



Re: MIN/MAX functions for a record

От
Viliam Ďurina
Дата:
Exactly Tom, I see no fundamental problem for it not to be implemented, since comparison operator is already implemented. In fact, MIN/MAX should work for all types for which comparison operator is defined.

Regarding index support, there should not be an issue if the index is defined for the record (e.g. `CREATE INDEX ON my_table(ROW(field_a, field_b))`). However such indexes seem not to be supported. Whether a composite index is compatible with a record created on the indexed fields in every edge case I'm not sure...

Alexander, rewriting the year-month example is easy, but how would you rewrite this query?

CREATE TABLE events(event_time TIMESTAMP, message VARCHAR, user_id VARCHAR);

You want a newest message for each user. It's easy with MAX(record):

SELECT user_id, MAX(ROW(event_time, message)).message
FROM events
GROUP BY user_id;

One option is to rewrite to a subquery with LIMIT 1

SELECT user_id, (SELECT message FROM events e2 WHERE e1.user_id=e2.user_id ORDER BY event_time DESC LIMIT 1)
FROM events e1
GROUP BY user_id;

If your query already has multiple levels of grouping, multiple joins, UNIONs etc., it gets much more complex. I also wonder if the optimizer would pick the same plan as it would be if the MAX(record) is supported.

Viliam

On Fri, Mar 22, 2024 at 4:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Aleksander Alekseev <aleksander@timescale.com> writes:
>> In my queries I often need to do MIN/MAX for tuples, for example:
>> SELECT MAX(row(year, month))
>> FROM (VALUES(2025, 1), (2024,2)) x(year, month);
>> This query throws:
>> ERROR: function max(record) does not exist
>> Was this ever discussed or is there something preventing the implementation?

> I believe it would be challenging to implement max(record) that would
> work reasonably well in a general case.

As long as you define it as "works the same way record comparison
does", ie base it on record_cmp(), I don't think it would be much
more than a finger exercise [*].  And why would you want it to act
any differently from record_cmp()?  Those semantics have been
established for a long time.

                        regards, tom lane

[*] Although conceivably there are some challenges in getting
record_cmp's caching logic to work in the context of an aggregate.

Re: MIN/MAX functions for a record

От
Aleksander Alekseev
Дата:
Hi,

> Exactly Tom, I see no fundamental problem for it not to be implemented, since comparison operator is already
implemented.In fact, MIN/MAX should work for all types for which comparison operator is defined.
 

On second thought, this should work reasonably well.

PFA a WIP patch. At this point it implements only MAX(record), no MIN, no tests:

```
=# SELECT MAX(row(year, month)) FROM (VALUES(2025, 1), (2024,2)) x(year, month);
   max
----------
 (2025,1)
```

One thing I'm not 100% sure of is whether record_larger() should make
a copy of its arguments or the current implementation is safe.

-- 
Best regards,
Aleksander Alekseev

Вложения

Re: MIN/MAX functions for a record

От
Tom Lane
Дата:
Aleksander Alekseev <aleksander@timescale.com> writes:
> One thing I'm not 100% sure of is whether record_larger() should make
> a copy of its arguments or the current implementation is safe.

I don't see any copying happening in, say, text_larger or
numeric_larger, so this shouldn't need to either.

Personally I'd write "record_cmp(fcinfo) > 0" rather than indirecting
through record_gt.  The way you have it is not strictly correct anyhow:
you're cheating by not using DirectFunctionCall.

Also, given that you're passing the fcinfo, there's no need
to extract the arguments from it before that call.  So it
seems to me that code like

    if (record_cmp(fcinfo) > 0)
        PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(0));
    else
        PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(1));

should do, and possibly save one useless detoast step.  Or you could
do

    if (record_cmp(fcinfo) > 0)
        PG_RETURN_DATUM(PG_GETARG_DATUM(0));
    else
        PG_RETURN_DATUM(PG_GETARG_DATUM(1));

because really there's no point in detoasting at all.

            regards, tom lane



Re: MIN/MAX functions for a record

От
Aleksander Alekseev
Дата:
Hi,

> I don't see any copying happening in, say, text_larger or
> numeric_larger, so this shouldn't need to either.
>
> Personally I'd write "record_cmp(fcinfo) > 0" rather than indirecting
> through record_gt.  The way you have it is not strictly correct anyhow:
> you're cheating by not using DirectFunctionCall.
>
> Also, given that you're passing the fcinfo, there's no need
> to extract the arguments from it before that call.  So it
> seems to me that code like
>
>         if (record_cmp(fcinfo) > 0)
>                 PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(0));
>         else
>                 PG_RETURN_HEAPTUPLEHEADER(PG_GETARG_HEAPTUPLEHEADER(1));
>
> should do, and possibly save one useless detoast step.  Or you could
> do
>
>         if (record_cmp(fcinfo) > 0)
>                 PG_RETURN_DATUM(PG_GETARG_DATUM(0));
>         else
>                 PG_RETURN_DATUM(PG_GETARG_DATUM(1));
>
> because really there's no point in detoasting at all.

Many thanks. Here is the corrected patch. Now it also includes MIN()
support and tests.

-- 
Best regards,
Aleksander Alekseev

Вложения