Обсуждение: MAP syntax for arrays

Поиск
Список
Период
Сортировка

MAP syntax for arrays

От
Ildar Musin
Дата:
Hello hackers,

Recently I was working with sql arrays in postgres and it turned out
that postgres doesn't have such very convinient functional constructions
as map, reduce and filter. Currently to map function over array user has
to make a subquery like:

select u.* from
    my_table,
    lateral (
        select array_agg(lower(elem))
        from unnest(arr) as elem
    ) as u;

Which is not only inconvenient but not very efficient as well (see
'Demo' section below).

When I dug into the code I found that postgres already has the needed
infrastructure for implementing map for arrays; actually array coercing
already works that way (it basically maps cast function).

In the attached patch there is a simple map implementation which
introduces new expression type and syntax:

    MAP(<func_name> OVER <array_expression>)

For example:

    SELECT MAP(upper OVER array['one', 'two', 'three']::text[]);
        ?column?
    -----------------
     {ONE,TWO,THREE}
    (1 row)

This is probably not the most useful notation and it would be better to
have syntax for mapping arbitrary expressions over array, not just
function. I'm struggling to come up with a good idea of how it should
look like. It could look something like following:

    MAP(<expr> FOR <placeholder> IN <array_expressin>)

For instance:

    SELECT MAP(x*2 FOR x IN array[1, 2, 3]::int[]);

Looking forward for community's suggestions!

Demo
----

Here is a small comparison between map and unnest/aggregate ways for
per-element processing of arrays. Given a table with 1K rows which
contains single column of text[] type. Each array contains 5/10/100
elements.

create table my_table (arr text[]);
insert into my_table
    select array_agg(md5(random()::text))
    from generate_series(1, 1000) as rows,
         generate_series(1, 10) as elements
    group by rows;

There are two scripts for pgbench. One for 'map' syntax:

    select map(upper over arr) from my_table;

And one for unnest/aggregate:

    select u.* from my_table,
        lateral (
            select array_agg(upper(elem))
            from unnest(arr) as elem
        ) as u;

Results are:

  elements per array |  map (tps) | unnest/aggregate (tps)
--------------------+------------+------------------------
                   5 | 139.105359 |       74.434010
                  10 |  74.089743 |       43.622554
                 100 |   7.693000 |        5.325805

Apparently map is more efficient for small arrays. And as the size of
array increases the difference decreases.

I'll be glad to any input from the community. Thanks!

-- 
Ildar Musin
i.musin@postgrespro.ru

Вложения

Re: MAP syntax for arrays

От
Ashutosh Bapat
Дата:
On Fri, May 4, 2018 at 6:38 PM, Ildar Musin <i.musin@postgrespro.ru> wrote:
> Hello hackers,
>
> Recently I was working with sql arrays in postgres and it turned out
> that postgres doesn't have such very convinient functional constructions
> as map, reduce and filter. Currently to map function over array user has
> to make a subquery like:
>
> select u.* from
>         my_table,
>         lateral (
>                 select array_agg(lower(elem))
>                 from unnest(arr) as elem
>         ) as u;
>
> Which is not only inconvenient but not very efficient as well (see
> 'Demo' section below).

Is there a way we can improve unnest() and array_agg() to match the
performance you have specified by let's say optimizing the cases
specially when those two are used together. Identifying that may be
some work, but will not require introducing new syntax.

>
> When I dug into the code I found that postgres already has the needed
> infrastructure for implementing map for arrays; actually array coercing
> already works that way (it basically maps cast function).
>
> In the attached patch there is a simple map implementation which
> introduces new expression type and syntax:
>
>         MAP(<func_name> OVER <array_expression>)
>
> For example:
>
>         SELECT MAP(upper OVER array['one', 'two', 'three']::text[]);
>             ?column?
>         -----------------
>          {ONE,TWO,THREE}
>         (1 row)
>
> This is probably not the most useful notation and it would be better to
> have syntax for mapping arbitrary expressions over array, not just
> function. I'm struggling to come up with a good idea of how it should
> look like. It could look something like following:
>
>         MAP(<expr> FOR <placeholder> IN <array_expressin>)
>
> For instance:
>
>         SELECT MAP(x*2 FOR x IN array[1, 2, 3]::int[]);
>
> Looking forward for community's suggestions!

What if the expression has more than one variable, each mapping to a
different array? What if the arrays have different lengths or worse
different dimensions? This looks like the way SRFs used to work.

Instead of introducing a new syntax, is it possible to detect that
argument to a function is an array of the same type as the argument
and apply MAP automatically? In your example, upper(arr) would detect
that the input is an array of the same type as the scalar argument
type and do array_agg(upper(arr[id1], arr[id2], ...).

>
>  elements per array |  map (tps) | unnest/aggregate (tps)
> --------------------+------------+------------------------
>                   5 | 139.105359 |       74.434010
>                  10 |  74.089743 |       43.622554
>                 100 |   7.693000 |        5.325805
>
> Apparently map is more efficient for small arrays. And as the size of
> array increases the difference decreases.

I am afraid that the way difference is diminishing with increase in
the number of elements, unnest, array_agg combination might win for
large number of elements. Have you tried that?
If we try to improve unnest, array_agg combination for small array, we
will get consistent performance without any additional syntax.
Although, I admit that query involving unnest and array_agg is not
very readable.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: MAP syntax for arrays

От
Tom Lane
Дата:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> Is there a way we can improve unnest() and array_agg() to match the
> performance you have specified by let's say optimizing the cases
> specially when those two are used together. Identifying that may be
> some work, but will not require introducing new syntax.

+1.  The first thing I thought on seeing this proposal was "I wonder
how long it will be before the SQL committee introduces some syntax
that uses the MAP keyword and breaks this".

ISTM the planner could be taught to notice the combination of unnest
and array_agg and produce a special output plan from that.

It is, however, fair to wonder whether this is worth our time to
optimize.  I've not noticed a lot of people complaining about
performance of this sort of thing, at least not since we fixed
array_agg to not be O(N^2).

            regards, tom lane


Re: MAP syntax for arrays

От
Ildar Musin
Дата:
Hello Tom, Ashutosh,

On 07.05.2018 18:16, Tom Lane wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
>> Is there a way we can improve unnest() and array_agg() to match
>> the performance you have specified by let's say optimizing the
>> cases specially when those two are used together. Identifying that
>> may be some work, but will not require introducing new syntax.
>
> +1.  The first thing I thought on seeing this proposal was "I wonder
> how long it will be before the SQL committee introduces some syntax
> that uses the MAP keyword and breaks this".
>
> ISTM the planner could be taught to notice the combination of unnest
> and array_agg and produce a special output plan from that.
>
> It is, however, fair to wonder whether this is worth our time to
> optimize.  I've not noticed a lot of people complaining about
> performance of this sort of thing, at least not since we fixed
> array_agg to not be O(N^2).

The main point of this patch was about convenience; the performance
thing came out later just as a side effect :) Many users are familiar
with "map/reduce/filter" concept that many languages (not only
functional ones) utilized. And my though was that it would be great to
have those in postgres too.

Apparently there is also a case when unnest/array_agg may not produce
the result we're looking for. I mean multidimensional arrays. E.g.

select array_agg(x * 2)
from unnest(array[[1,2],[3,4],[5,6]]) as x;
     array_agg
-----------------
  {2,4,6,8,10,12}
(1 row)

select map(x * 2 for x in array[[1,2],[3,4],[5,6]]);
        ?column?
-----------------------
  {{2,4},{6,8},{10,12}}
(1 row)

array_agg produces plain arrays no matter what the input was.

There is a new version of the patch in the attachment which introduces
arbitrary per-element expressions (instead of single function call). So
now user can specify a placeholder representing one element of the array
and use it in the expressions. Like following:

select map (pow(x, 2) - 1 for x in array[1,2,3]);
    ?column?
---------------
  {1,3,7,15,31}
(1 row)


-- 
Ildar Musin
i.musin@postgrespro.ru

Вложения

Re: MAP syntax for arrays

От
Ildar Musin
Дата:
On 08.05.2018 15:49, Ildar Musin wrote:
> select map (pow(x, 2) - 1 for x in array[1,2,3]);

Sorry, the example should be:

select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);
    ?column?
---------------
  {1,3,7,15,31}
(1 row)

-- 
Ildar Musin
i.musin@postgrespro.ru


Re: MAP syntax for arrays

От
Chapman Flack
Дата:
On 05/08/2018 08:57 AM, Ildar Musin wrote:
> 
> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);

I wonder how efficient an implementation would be possible strictly
as a function, without grammar changes?

While PostgreSQL certainly has extensions to and divergences from
standard SQL syntax, some historical and some recent, it seems like
there ought to be a highish bar for adding new ones; or, looking at it
another way, has this feature been proposed to the SQL committee?
It doesn't sound like a bad idea, and supporting new syntax for it
would be an easier call it if it were known to be in an SQL draft
somewhere.

-Chap


Re: MAP syntax for arrays

От
Chapman Flack
Дата:
On 05/08/2018 09:19 AM, Chapman Flack wrote:
> 
> While PostgreSQL certainly has extensions to and divergences from
> standard SQL syntax, some historical and some recent, it seems like
> there ought to be a highish bar for adding new ones; or, looking at it
> another way, has this feature been proposed to the SQL committee?
> It doesn't sound like a bad idea, and supporting new syntax for it
> would be an easier call it if it were known to be in an SQL draft
> somewhere.

There seems to have been some work at Databricks adding higher-order
function syntax to their SQL; they've chosen the name 'transform'
for 'map', and also provided 'exists', 'filter', and 'aggregate'.

https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html

-Chap


Re: MAP syntax for arrays

От
Peter Eisentraut
Дата:
On 5/8/18 09:19, Chapman Flack wrote:
> On 05/08/2018 08:57 AM, Ildar Musin wrote:
>>
>> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);
> 
> I wonder how efficient an implementation would be possible strictly
> as a function, without grammar changes?

Yeah, you can pass a function to another function (using regprocedure or
just oid), so this should be possible entirely in user space.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: MAP syntax for arrays

От
Alvaro Herrera
Дата:
Peter Eisentraut wrote:
> On 5/8/18 09:19, Chapman Flack wrote:
> > On 05/08/2018 08:57 AM, Ildar Musin wrote:
> >>
> >> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);
> > 
> > I wonder how efficient an implementation would be possible strictly
> > as a function, without grammar changes?
> 
> Yeah, you can pass a function to another function (using regprocedure or
> just oid), so this should be possible entirely in user space.

How would you invoke it?  It seems you'd be forced to use EXECUTE in a
plpgsql function, or a C function.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: MAP syntax for arrays

От
Andreas Karlsson
Дата:
On 05/08/2018 02:49 PM, Ildar Musin wrote:
> The main point of this patch was about convenience; the performance
> thing came out later just as a side effect :) Many users are familiar
> with "map/reduce/filter" concept that many languages (not only
> functional ones) utilized. And my though was that it would be great to
> have those in postgres too.

Map is a feature I have quite often wished to have, but I am not sure it 
is quite useful enough to be worth adding our own proprietary syntax. It 
would be a pain if the SQL committee started using MAP for something.

Andreas


Re: MAP syntax for arrays

От
Andrew Gierth
Дата:
>>>>> "Andreas" == Andreas Karlsson <andreas@proxel.se> writes:

 Andreas> It would be a pain if the SQL committee started using MAP for
 Andreas> something.

They already did - MAP is a non-reserved keyword in sql2016, used at
least with <user-defined ordering definition>. Can't see any obvious
conflict with use in expressions, but I haven't checked all the
references.

-- 
Andrew (irc:RhodiumToad)


Re: MAP syntax for arrays

От
Peter Eisentraut
Дата:
On 5/8/18 10:18, Alvaro Herrera wrote:
> Peter Eisentraut wrote:
>> On 5/8/18 09:19, Chapman Flack wrote:
>>> On 05/08/2018 08:57 AM, Ildar Musin wrote:
>>>>
>>>> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);
>>>
>>> I wonder how efficient an implementation would be possible strictly
>>> as a function, without grammar changes?
>>
>> Yeah, you can pass a function to another function (using regprocedure or
>> just oid), so this should be possible entirely in user space.
> 
> How would you invoke it?  It seems you'd be forced to use EXECUTE in a
> plpgsql function, or a C function.

Yes, I was thinking about a C function.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: MAP syntax for arrays

От
Tom Lane
Дата:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 5/8/18 10:18, Alvaro Herrera wrote:
>> How would you invoke it?  It seems you'd be forced to use EXECUTE in a
>> plpgsql function, or a C function.

> Yes, I was thinking about a C function.

The thing actually implementing MAP would presumably be in C,
so this doesn't seem like a problem technically.  But having to
create a function seems like a big usability stumbling block,
probably a big enough one to make the "select array_agg(expression)
from unnest(something)" approach more attractive.

I do see the usability benefit of a dedicated MAP syntax --- I'm
just afraid of getting out in front of the SQL committee on such
things.  I doubt that it's enough nicer than the sub-select way
to justify risking future standards-compliance issues.

Realistically, we're talking about this:

    select a, b, (select array_agg(x*2) from unnest(arraycol) x)
    from ...

versus something on the order of this:

    select a, b, map(x*2 over x from arraycol)
    from ...

Yeah, it's a bit shorter, but not that much ... and there's a lot
more you can do with the sub-select syntax, eg add a WHERE filter.

            regards, tom lane


Re: MAP syntax for arrays

От
Ildar Musin
Дата:

On 08.05.2018 17:15, Peter Eisentraut wrote:
> On 5/8/18 09:19, Chapman Flack wrote:
>> On 05/08/2018 08:57 AM, Ildar Musin wrote:
>>>
>>> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);
>>
>> I wonder how efficient an implementation would be possible
>> strictly as a function, without grammar changes?
>
> Yeah, you can pass a function to another function (using regprocedure
> or just oid), so this should be possible entirely in user space.
>

The problem with this approach is that extension should either have
single map() function with input and output type of anyarray which
cannot be used when user needs to map int[] to text[] for example. Or
the other way there should be a set of map functions for different
intput/output types.

Another thing is that this approach is not as versatile since user need
to create a function before running map, while with the proposed patch
they could run arbitrary expression over the array directly.

-- 
Ildar Musin
i.musin@postgrespro.ru


Re: MAP syntax for arrays

От
Alvaro Herrera
Дата:
Andrew Gierth wrote:
> >>>>> "Andreas" == Andreas Karlsson <andreas@proxel.se> writes:
> 
>  Andreas> It would be a pain if the SQL committee started using MAP for
>  Andreas> something.
> 
> They already did - MAP is a non-reserved keyword in sql2016, used at
> least with <user-defined ordering definition>. Can't see any obvious
> conflict with use in expressions, but I haven't checked all the
> references.

Ah, so in SQL2011 (and 2016) there already is a designator for routines,
which was a thing we were lacking previously, as I recall.

<specific routine designator> ::=
    SPECIFIC <routine type> <specific name>
    | <routine type> <member name> [ FOR <schema-resolved user-defined type name> ]

<routine type> ::=
    ROUTINE
    | FUNCTION
    | PROCEDURE
    | [ INSTANCE | STATIC | CONSTRUCTOR ] METHOD

<member name> ::=
    <member name alternatives> [ <data type list> ]

<member name alternatives> ::=
    <schema qualified routine name>
    | <method name>

<data type list> ::= <left paren> [ <data type> [ { <comma> <data type> }... ] ] <right paren>


[elsewhere]

<specific name> ::=
    <schema qualified name>

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: MAP syntax for arrays

От
Heikki Linnakangas
Дата:
On 08/05/18 18:11, Ildar Musin wrote:
> On 08.05.2018 17:15, Peter Eisentraut wrote:
>> On 5/8/18 09:19, Chapman Flack wrote:
>>> On 05/08/2018 08:57 AM, Ildar Musin wrote:
>>>>
>>>> select map (pow(2, x) - 1 for x in array[1,2,3,4,5]);
>>>
>>> I wonder how efficient an implementation would be possible
>>> strictly as a function, without grammar changes?
>>
>> Yeah, you can pass a function to another function (using regprocedure
>> or just oid), so this should be possible entirely in user space.
> 
> The problem with this approach is that extension should either have
> single map() function with input and output type of anyarray which
> cannot be used when user needs to map int[] to text[] for example. Or
> the other way there should be a set of map functions for different
> intput/output types.
> 
> Another thing is that this approach is not as versatile since user need
> to create a function before running map, while with the proposed patch
> they could run arbitrary expression over the array directly.

The consensus on this seems to be that we don't want this. Yeah, it's a 
handy syntax, but it's not that much better than using a subselect or a 
writing a user-defined function. And there's risk of future conflicts 
with the SQL standard. I'll mark this as "Rejected" in the commitfest.

- Heikki