Обсуждение: Do we want a hashset type?

Поиск
Список
Период
Сортировка

Do we want a hashset type?

От
"Joel Jacobson"
Дата:
Hi,

I've been working with a social network start-up that uses PostgreSQL as their
only database. Recently, they became interested in graph databases, largely
because of an article [1] suggesting that a SQL database "just chokes" when it
encounters a depth-five friends-of-friends query (for a million users having
50 friends each).

The article didn't provide the complete SQL queries, so I had to buy the
referenced book to get the full picture. It turns out, the query was a simple
self-join, which, of course, isn't very efficient. When we rewrote the query
using a modern RECURSIVE CTE, it worked but still took quite some time.

Of course, there will always be a need for specific databases, and some queries
will run faster on them. But, if PostgreSQL could handle graph queries with a
Big-O runtime similar to graph databases, scalability wouldn't be such a big
worry.

Just like the addition of the JSON type to PostgreSQL helped reduce the hype
around NoSQL, maybe there's something simple that's missing in PostgreSQL that
could help us achieve the same Big-O class performance as graph databases for
some of these type of graph queries?

Looking into the key differences between PostgreSQL and graph databases,
it seems that one is how they store adjacent nodes. In SQL, a graph can be
represented as one table for the Nodes and another table for the Edges.
For a friends-of-friends query, we would need to query Edges to find adjacent
nodes, recursively.

Graph databases, on the other hand, keep adjacent nodes immediately accessible
by storing them with the node itself. This looks like a major difference in
terms of how the data is stored.

Could a hashset type help bridge this gap?

The idea would be to store adjacent nodes as a hashset column in a Nodes table.

Apache AGE is an option for users who really need a new graph query language.
But I believe there are more users who occasionally run into a graph problem and
would be glad if there was an efficient way to solve it in SQL without having
to bring in a new database.

/Joel


Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 5/31/23 16:09, Joel Jacobson wrote:
> Hi,
> 
> I've been working with a social network start-up that uses PostgreSQL as
> their
> only database. Recently, they became interested in graph databases, largely
> because of an article [1] suggesting that a SQL database "just chokes"
> when it
> encounters a depth-five friends-of-friends query (for a million users having
> 50 friends each).
> 
> The article didn't provide the complete SQL queries, so I had to buy the
> referenced book to get the full picture. It turns out, the query was a
> simple
> self-join, which, of course, isn't very efficient. When we rewrote the query
> using a modern RECURSIVE CTE, it worked but still took quite some time.
> 
> Of course, there will always be a need for specific databases, and some
> queries
> will run faster on them. But, if PostgreSQL could handle graph queries
> with a
> Big-O runtime similar to graph databases, scalability wouldn't be such a big
> worry.
> 
> Just like the addition of the JSON type to PostgreSQL helped reduce the hype
> around NoSQL, maybe there's something simple that's missing in
> PostgreSQL that
> could help us achieve the same Big-O class performance as graph
> databases for
> some of these type of graph queries?
> 
> Looking into the key differences between PostgreSQL and graph databases,
> it seems that one is how they store adjacent nodes. In SQL, a graph can be
> represented as one table for the Nodes and another table for the Edges.
> For a friends-of-friends query, we would need to query Edges to find
> adjacent
> nodes, recursively.
> 
> Graph databases, on the other hand, keep adjacent nodes immediately
> accessible
> by storing them with the node itself. This looks like a major difference in
> terms of how the data is stored.
> 
> Could a hashset type help bridge this gap?
> 
> The idea would be to store adjacent nodes as a hashset column in a Nodes
> table.
> 

I think this needs a better explanation - what exactly is a hashset in
this context? Something like an array with a hash for faster lookup of
unique elements, or what?

Presumably it'd store whole adjacent nodes, not just some sort of node
ID. So what if a node is adjacent to many other nodes? What if a node is
added/deleted/modified?

AFAICS the main problem is the lookups of adjacent nodes, generating
lot of random I/O etc. Presumably it's not that hard to keep the
"relational" schema with table for vertices/edges, and then an auxiliary
table with adjacent nodes grouped by node, possibly maintained by a
couple triggers. A bit like an "aggregated index" except the queries
would have to use it explicitly.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
> I think this needs a better explanation - what exactly is a hashset in
> this context? Something like an array with a hash for faster lookup of
> unique elements, or what?

In this context, by "hashset" I am indeed referring to a data structure similar
to an array, where each element would be unique, and lookups would be faster
than arrays for larger number of elements due to hash-based lookups.

This data structure would store identifiers (IDs) of the nodes, not the complete
nodes themselves.

> Presumably it'd store whole adjacent nodes, not just some sort of node
> ID. So what if a node is adjacent to many other nodes? What if a node is
> added/deleted/modified?

That would require updating the hashset, which should be close to O(1) in
practical applications.

> AFAICS the main problem is the lookups of adjacent nodes, generating
> lot of random I/O etc. Presumably it's not that hard to keep the
> "relational" schema with table for vertices/edges, and then an auxiliary
> table with adjacent nodes grouped by node, possibly maintained by a
> couple triggers. A bit like an "aggregated index" except the queries
> would have to use it explicitly.

Yes, auxiliary table would be good, since we don't want to duplicate all
node-related data, and only store the IDs in the adjacent nodes hashset.

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 5/31/23 17:40, Joel Jacobson wrote:
> On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
>> I think this needs a better explanation - what exactly is a hashset in
>> this context? Something like an array with a hash for faster lookup of
>> unique elements, or what?
> 
> In this context, by "hashset" I am indeed referring to a data structure similar
> to an array, where each element would be unique, and lookups would be faster
> than arrays for larger number of elements due to hash-based lookups.
> 

Why would you want hash-based lookups? It should be fairly trivial to
implement as a user-defined data type, but in what cases are you going
to ask "does the hashset contain X"?

> This data structure would store identifiers (IDs) of the nodes, not the complete
> nodes themselves.
> 

How does storing just the IDs solves anything? Isn't the main challenge
the random I/O when fetching the adjacent nodes? This does not really
improve that, no?

>> Presumably it'd store whole adjacent nodes, not just some sort of node
>> ID. So what if a node is adjacent to many other nodes? What if a node is
>> added/deleted/modified?
> 
> That would require updating the hashset, which should be close to O(1) in
> practical applications.
> 

But you need to modify hashsets for all the adjacent nodes. Also, O(1)
doesn't say it's cheap. I wonder how expensive it'd be in practice.

>> AFAICS the main problem is the lookups of adjacent nodes, generating
>> lot of random I/O etc. Presumably it's not that hard to keep the
>> "relational" schema with table for vertices/edges, and then an auxiliary
>> table with adjacent nodes grouped by node, possibly maintained by a
>> couple triggers. A bit like an "aggregated index" except the queries
>> would have to use it explicitly.
> 
> Yes, auxiliary table would be good, since we don't want to duplicate all
> node-related data, and only store the IDs in the adjacent nodes hashset.
> 

I may be missing something, but as mentioned, I don't quite see how this
would help. What exactly would this save us? If you create an index on
the edge node IDs, you should get the adjacent nodes pretty cheap from
an IOS. Granted, a pre-built hashset is going to be faster, but if the
next step is fetching the node info, that's going to do a lot of random
I/O, dwarfing all of this.

It's entirely possible I'm missing some important aspect. It'd be very
useful to have a couple example queries illustrating the issue, that we
could use to actually test different approaches.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Wed, May 31, 2023, at 18:59, Tomas Vondra wrote:
> How does storing just the IDs solves anything? Isn't the main challenge
> the random I/O when fetching the adjacent nodes? This does not really
> improve that, no?

I'm thinking of a recursive query where a lot of time is just spent following
all friends-of-friends, where the graph traversal is the heavy part,
where the final set of nodes are only fetched at the end.

> It's entirely possible I'm missing some important aspect. It'd be very
> useful to have a couple example queries illustrating the issue, that we
> could use to actually test different approaches.

Here is an example using a real anonymised social network.

wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
gunzip soc-pokec-relationships.txt.gz
CREATE TABLE edges (from_node INT, to_node INT);
\copy edges from soc-pokec-relationships.txt;
ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);
SET work_mem TO '1GB';
CREATE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
    SELECT 
        ARRAY[5867::bigint] AS current, 
        ARRAY[5867::bigint] AS found,
        0 AS depth
    UNION ALL  
    SELECT 
        new_current, 
        found || new_current, 
        friends_of_friends.depth + 1
    FROM 
        friends_of_friends
    CROSS JOIN LATERAL (
        SELECT
            array_agg(DISTINCT edges.to_node) AS new_current
        FROM
            edges
        WHERE
            from_node = ANY(friends_of_friends.current)
    ) q
    WHERE
        friends_of_friends.depth < 3
)
SELECT
    depth,
    coalesce(array_length(current, 1), 0)
FROM
    friends_of_friends
WHERE
    depth = 3;
;

SELECT COUNT(*) FROM edges;
  count
----------
 30622564
(1 row)

SELECT COUNT(DISTINCT from_node) FROM edges;
  count
---------
 1432693
(1 row)

-- Most connected user (worst-case) is user 5867 with 8763 friends:
SELECT from_node, COUNT(*) FROM edges GROUP BY from_node ORDER BY COUNT(*) DESC LIMIT 1;
 from_node | count
-----------+-------
      5867 |  8763
(1 row)

-- Friend-of-friends query exactly at depth three:

EXPLAIN ANALYZE
SELECT * FROM friends_of_friends;

                                                                              QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on friends_of_friends  (cost=6017516.90..6017517.60 rows=1 width=8) (actual time=2585.881..2586.334 rows=1
loops=1)
   Filter: (depth = 3)
   Rows Removed by Filter: 3
   CTE friends_of_friends
     ->  Recursive Union  (cost=0.00..6017516.90 rows=31 width=68) (actual time=0.005..2581.664 rows=4 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=68) (actual time=0.002..0.002 rows=1 loops=1)
           ->  Subquery Scan on "*SELECT* 2"  (cost=200583.71..601751.66 rows=3 width=68) (actual time=645.036..645.157
rows=1loops=4)
 
                 ->  Nested Loop  (cost=200583.71..601751.54 rows=3 width=68) (actual time=641.880..641.972 rows=1
loops=4)
                       ->  WorkTable Scan on friends_of_friends friends_of_friends_1  (cost=0.00..0.22 rows=3 width=68)
(actualtime=0.001..0.002 rows=1 loops=4)
 
                             Filter: (depth < 3)
                             Rows Removed by Filter: 0
                       ->  Aggregate  (cost=200583.71..200583.72 rows=1 width=32) (actual time=850.997..850.998 rows=1
loops=3)
                             ->  Bitmap Heap Scan on edges  (cost=27656.38..196840.88 rows=1497133 width=4) (actual
time=203.239..423.534rows=3486910 loops=3)
 
                                   Recheck Cond: (from_node = ANY (friends_of_friends_1.current))
                                   Heap Blocks: exact=117876
                                   ->  Bitmap Index Scan on edges_pkey  (cost=0.00..27282.10 rows=1497133 width=0)
(actualtime=198.047..198.047 rows=3486910 loops=3)
 
                                         Index Cond: (from_node = ANY (friends_of_friends_1.current))
 Planning Time: 1.414 ms
 Execution Time: 2588.288 ms
(19 rows)

I tested on PostgreSQL 15.2. For some reason I got a different slower query on HEAD:

SELECT * FROM friends_of_friends;
                                                                               QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on friends_of_friends  (cost=6576.67..6577.37 rows=1 width=8) (actual time=6412.693..6413.335 rows=1
loops=1)
   Filter: (depth = 3)
   Rows Removed by Filter: 3
   CTE friends_of_friends
     ->  Recursive Union  (cost=0.00..6576.67 rows=31 width=68) (actual time=0.008..6407.134 rows=4 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=68) (actual time=0.005..0.005 rows=1 loops=1)
           ->  Subquery Scan on "*SELECT* 2"  (cost=219.05..657.64 rows=3 width=68) (actual time=1600.747..1600.934
rows=1loops=4)
 
                 ->  Nested Loop  (cost=219.05..657.52 rows=3 width=68) (actual time=1594.906..1595.035 rows=1
loops=4)
                       ->  WorkTable Scan on friends_of_friends friends_of_friends_1  (cost=0.00..0.22 rows=3 width=68)
(actualtime=0.001..0.002 rows=1 loops=4)
 
                             Filter: (depth < 3)
                             Rows Removed by Filter: 0
                       ->  Aggregate  (cost=219.05..219.06 rows=1 width=32) (actual time=2118.105..2118.105 rows=1
loops=3)
                             ->  Sort  (cost=207.94..213.49 rows=2221 width=4) (actual time=1780.770..1925.853
rows=3486910loops=3)
 
                                   Sort Key: edges.to_node
                                   Sort Method: quicksort  Memory: 393217kB
                                   ->  Index Only Scan using edges_pkey on edges  (cost=0.56..84.48 rows=2221 width=4)
(actualtime=0.077..762.408 rows=3486910 loops=3)
 
                                         Index Cond: (from_node = ANY (friends_of_friends_1.current))
                                         Heap Fetches: 0
 Planning Time: 8.229 ms
 Execution Time: 6446.421 ms
(20 rows)



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Thu, Jun 1, 2023, at 09:02, Joel Jacobson wrote:
> Here is an example using a real anonymised social network.

I realised the "found" column is not necessary in this particular example,
since we only care about the friends at the exact depth level. Simplified query:

CREATE OR REPLACE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
    SELECT 
        ARRAY[5867::bigint] AS current,
        0 AS depth
    UNION ALL
    SELECT
        new_current,
        friends_of_friends.depth + 1
    FROM
        friends_of_friends
    CROSS JOIN LATERAL (
        SELECT
            array_agg(DISTINCT edges.to_node) AS new_current
        FROM
            edges
        WHERE
            from_node = ANY(friends_of_friends.current)
    ) q
    WHERE
        friends_of_friends.depth < 3
)
SELECT
    depth,
    coalesce(array_length(current, 1), 0)
FROM
    friends_of_friends
WHERE
    depth = 3;
;

-- PostgreSQL 15.2:

EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
                                                                            QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on friends_of_friends  (cost=2687.88..2688.58 rows=1 width=8) (actual time=2076.362..2076.454 rows=1
loops=1)
   Filter: (depth = 3)
   Rows Removed by Filter: 3
   CTE friends_of_friends
     ->  Recursive Union  (cost=0.00..2687.88 rows=31 width=36) (actual time=0.008..2075.073 rows=4 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=1)
           ->  Subquery Scan on "*SELECT* 2"  (cost=89.44..268.75 rows=3 width=36) (actual time=518.613..518.622 rows=1
loops=4)
                 ->  Nested Loop  (cost=89.44..268.64 rows=3 width=36) (actual time=515.523..515.523 rows=1 loops=4)
                       ->  WorkTable Scan on friends_of_friends friends_of_friends_1  (cost=0.00..0.22 rows=3 width=36)
(actualtime=0.001..0.001 rows=1 loops=4)
 
                             Filter: (depth < 3)
                             Rows Removed by Filter: 0
                       ->  Aggregate  (cost=89.44..89.45 rows=1 width=32) (actual time=687.356..687.356 rows=1
loops=3)
                             ->  Index Only Scan using edges_pkey on edges  (cost=0.56..83.96 rows=2191 width=4)
(actualtime=0.139..290.996 rows=3486910 loops=3)
 
                                   Index Cond: (from_node = ANY (friends_of_friends_1.current))
                                   Heap Fetches: 0
 Planning Time: 0.557 ms
 Execution Time: 2076.990 ms
(17 rows)



Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:


On 2023-05-31 We 11:40, Joel Jacobson wrote:
On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
I think this needs a better explanation - what exactly is a hashset in
this context? Something like an array with a hash for faster lookup of
unique elements, or what?
In this context, by "hashset" I am indeed referring to a data structure similar
to an array, where each element would be unique, and lookups would be faster
than arrays for larger number of elements due to hash-based lookups.


Yeah, a fast lookup set type has long been on my "blue sky" wish list.  So +1 for pursuing the idea.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Do we want a hashset type?

От
Ants Aasma
Дата:
On Wed, 31 May 2023 at 18:40, Joel Jacobson <joel@compiler.org> wrote:
>
> On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
> > I think this needs a better explanation - what exactly is a hashset in
> > this context? Something like an array with a hash for faster lookup of
> > unique elements, or what?
>
> In this context, by "hashset" I am indeed referring to a data structure similar
> to an array, where each element would be unique, and lookups would be faster
> than arrays for larger number of elements due to hash-based lookups.
>
> This data structure would store identifiers (IDs) of the nodes, not the complete
> nodes themselves.

Have you looked at roaring bitmaps? There is a pg_roaringbitmap
extension [1] already available that offers very fast unions,
intersections and membership tests over integer sets. I used it to get
some pretty impressive performance results for faceting search on
large document sets. [2]

Depending on the graph fan-outs and operations it might make sense in
the graph use case. For small sets it's probably not too different
from the intarray extension in contrib. But for finding intersections
over large sets (i.e. a join) it's very-very fast. If the workload is
traversal heavy it might make sense to even cache materialized
transitive closures up to some depth (a friend-of-a-friend list).

Roaring bitmaps only support int4 right now, but that is easily
fixable. And they need a relatively dense ID space to get the
performance boost, which seems essential to the approach. The latter
issue means that it can't be easily dropped into GIN or B-tree indexes
for ctid storage.

[1] https://github.com/ChenHuajun/pg_roaringbitmap
[2] https://github.com/cybertec-postgresql/pgfaceting
-- 
Ants Aasma
www.cybertec-postgresql.com



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/1/23 09:14, Joel Jacobson wrote:
> On Thu, Jun 1, 2023, at 09:02, Joel Jacobson wrote:
>> Here is an example using a real anonymised social network.
> 
> I realised the "found" column is not necessary in this particular example,
> since we only care about the friends at the exact depth level. Simplified query:
> 
> CREATE OR REPLACE VIEW friends_of_friends AS
> WITH RECURSIVE friends_of_friends AS (
>     SELECT 
>         ARRAY[5867::bigint] AS current,
>         0 AS depth
>     UNION ALL
>     SELECT
>         new_current,
>         friends_of_friends.depth + 1
>     FROM
>         friends_of_friends
>     CROSS JOIN LATERAL (
>         SELECT
>             array_agg(DISTINCT edges.to_node) AS new_current
>         FROM
>             edges
>         WHERE
>             from_node = ANY(friends_of_friends.current)
>     ) q
>     WHERE
>         friends_of_friends.depth < 3
> )
> SELECT
>     depth,
>     coalesce(array_length(current, 1), 0)
> FROM
>     friends_of_friends
> WHERE
>     depth = 3;
> ;
> 
> -- PostgreSQL 15.2:
> 
> EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
>                                                                             QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  CTE Scan on friends_of_friends  (cost=2687.88..2688.58 rows=1 width=8) (actual time=2076.362..2076.454 rows=1
loops=1)
>    Filter: (depth = 3)
>    Rows Removed by Filter: 3
>    CTE friends_of_friends
>      ->  Recursive Union  (cost=0.00..2687.88 rows=31 width=36) (actual time=0.008..2075.073 rows=4 loops=1)
>            ->  Result  (cost=0.00..0.01 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=1)
>            ->  Subquery Scan on "*SELECT* 2"  (cost=89.44..268.75 rows=3 width=36) (actual time=518.613..518.622
rows=1loops=4)
 
>                  ->  Nested Loop  (cost=89.44..268.64 rows=3 width=36) (actual time=515.523..515.523 rows=1 loops=4)
>                        ->  WorkTable Scan on friends_of_friends friends_of_friends_1  (cost=0.00..0.22 rows=3
width=36)(actual time=0.001..0.001 rows=1 loops=4)
 
>                              Filter: (depth < 3)
>                              Rows Removed by Filter: 0
>                        ->  Aggregate  (cost=89.44..89.45 rows=1 width=32) (actual time=687.356..687.356 rows=1
loops=3)
>                              ->  Index Only Scan using edges_pkey on edges  (cost=0.56..83.96 rows=2191 width=4)
(actualtime=0.139..290.996 rows=3486910 loops=3)
 
>                                    Index Cond: (from_node = ANY (friends_of_friends_1.current))
>                                    Heap Fetches: 0
>  Planning Time: 0.557 ms
>  Execution Time: 2076.990 ms
> (17 rows)
> 

I've been thinking about this a bit on the way back from pgcon. Per CPU
profile it seems most of the job is actually spent on qsort, calculating
the array_agg(distinct) bit. I don't think we build

We could replace this part by a hash-based set aggregate, which would be
faster, but I doubt that may yield a massive improvement that'd change
the fundamental cost.

I forgot I did something like that a couple years back, implementing a
count_distinct() aggregate that was meant as a faster alternative to
count(distinct). And then I mostly abandoned it because the built-in
sort-based approach improved significantly - it was still slower in
cases, but the gap got small enough.

Anyway, I hacked together a trivial set backed by an open addressing
hash table:

  https://github.com/tvondra/hashset

It's super-basic, providing just some bare minimum of functions, but
hopefully good enough for experiments.

- hashset data type - hash table in varlena
- hashset_init - create new hashset
- hashset_add / hashset_contains - add value / check membership
- hashset_merge - merge two hashsets
- hashset_count - count elements
- hashset_to_array - return
- hashset(int) aggregate

This allows rewriting the recursive query like this:

    WITH RECURSIVE friends_of_friends AS (
        SELECT
            ARRAY[5867::bigint] AS current,
            0 AS depth
        UNION ALL
        SELECT
            new_current,
            friends_of_friends.depth + 1
        FROM
            friends_of_friends
        CROSS JOIN LATERAL (
            SELECT
                hashset_to_array(hashset(edges.to_node)) AS new_current
            FROM
                edges
            WHERE
                from_node = ANY(friends_of_friends.current)
        ) q
        WHERE
            friends_of_friends.depth < 3
    )
    SELECT
        depth,
        coalesce(array_length(current, 1), 0)
    FROM
        friends_of_friends
    WHERE
        depth = 3;

On my laptop cuts the timing roughly in half - which is nice, but as I
said I don't think it's not a fundamental speedup. The aggregate can be
also parallelized, but I don't think that changes much.

Furthermore, this has a number of drawbacks too - e.g. it can't spill
data to disk, which might be an issue on more complex queries / larger
data sets.

FWIW I wonder how representative this query is, considering it returns
~1M node IDs, i.e. about 10% of the whole set of node IDs. Seems a bit
on the high side.

I've also experimented with doing stuff from plpgsql procedure (that's
what the non-aggregate functions are about). I saw this mostly as a way
to do stuff that'd be hard to do in the recursive CTE, but it has a lot
of additional execution overhead due to plpgsql. Maybe it we had some
smart trick to calculate adjacent nodes we could have a SRF written in C
to get rid of the overhead.

Anyway, this leads me to the question what graph databases are doing for
these queries, if they're faster in answering such queries (by how
much?). I'm not very familiar with that stuff, but surely they do
something smart - precalculating the data, some special indexing,
duplicating the data in multiple places?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Fri, Jun 2, 2023, at 10:01, Ants Aasma wrote:
> Have you looked at roaring bitmaps? There is a pg_roaringbitmap
> extension [1] already available that offers very fast unions,
> intersections and membership tests over integer sets. I used it to get
> some pretty impressive performance results for faceting search on
> large document sets. [2]

Many thanks for the tip!

New benchmark:

We already had since before:
> wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz
> gunzip soc-pokec-relationships.txt.gz
> CREATE TABLE edges (from_node INT, to_node INT);
> \copy edges from soc-pokec-relationships.txt;
> ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node);

I've created a new `users` table from the `edges` table,
with a new `friends` roaringbitmap column:

CREATE TABLE users AS
SELECT from_node AS id, rb_build_agg(to_node) AS friends FROM edges GROUP BY 1;
ALTER TABLE users ADD PRIMARY KEY (id);

Old query from before:

CREATE OR REPLACE VIEW friends_of_friends AS
WITH RECURSIVE friends_of_friends AS (
    SELECT
        ARRAY[5867::bigint] AS current,
        0 AS depth
    UNION ALL
    SELECT
        new_current,
        friends_of_friends.depth + 1
    FROM
        friends_of_friends
    CROSS JOIN LATERAL (
        SELECT
            array_agg(DISTINCT edges.to_node) AS new_current
        FROM
            edges
        WHERE
            from_node = ANY(friends_of_friends.current)
    ) q
    WHERE
        friends_of_friends.depth < 3
)
SELECT
    coalesce(array_length(current, 1), 0) AS count_friends_at_depth_3
FROM
    friends_of_friends
WHERE
    depth = 3;
;

New roaringbitmap-based query using users instead:

CREATE OR REPLACE VIEW friends_of_friends_roaringbitmap AS
WITH RECURSIVE friends_of_friends AS
(
    SELECT
        friends,
        1 AS depth
    FROM users WHERE id = 5867
    UNION ALL
    SELECT
        new_friends,
        friends_of_friends.depth + 1
    FROM
        friends_of_friends
    CROSS JOIN LATERAL (
        SELECT
            rb_or_agg(users.friends) AS new_friends
        FROM
            users
        WHERE
            users.id = ANY(rb_to_array(friends_of_friends.friends))
    ) q
    WHERE
        friends_of_friends.depth < 3
)
SELECT
    rb_cardinality(friends) AS count_friends_at_depth_3
FROM
    friends_of_friends
WHERE
    depth = 3
;

Note, depth is 1 at first level since we already have user 5867's friends in the users column.

Maybe there is a better way to make use of the btree index on users.id,
than to convert the roaringbitmap to an array in order to use = ANY(...),
that is, this part: `users.id = ANY(rb_to_array(friends_of_friends.friends))`?

Or maybe there is some entirely different but equivalent way of writing the query
in a more efficient way?


SELECT * FROM friends_of_friends;
 count_friends_at_depth_3
--------------------------
                  1035293
(1 row)

SELECT * FROM friends_of_friends_roaringbitmap;
 count_friends_at_depth_3
--------------------------
                  1035293
(1 row)

EXPLAIN ANALYZE SELECT * FROM friends_of_friends;
                                                                               QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on friends_of_friends  (cost=5722.03..5722.73 rows=1 width=4) (actual time=2232.896..2233.289 rows=1
loops=1)
   Filter: (depth = 3)
   Rows Removed by Filter: 3
   CTE friends_of_friends
     ->  Recursive Union  (cost=0.00..5722.03 rows=31 width=36) (actual time=0.003..2228.707 rows=4 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=36) (actual time=0.001..0.001 rows=1 loops=1)
           ->  Subquery Scan on "*SELECT* 2"  (cost=190.59..572.17 rows=3 width=36) (actual time=556.806..556.837
rows=1loops=4)
 
                 ->  Nested Loop  (cost=190.59..572.06 rows=3 width=36) (actual time=553.748..553.748 rows=1 loops=4)
                       ->  WorkTable Scan on friends_of_friends friends_of_friends_1  (cost=0.00..0.22 rows=3 width=36)
(actualtime=0.000..0.001 rows=1 loops=4)
 
                             Filter: (depth < 3)
                             Rows Removed by Filter: 0
                       ->  Aggregate  (cost=190.59..190.60 rows=1 width=32) (actual time=737.427..737.427 rows=1
loops=3)
                             ->  Sort  (cost=179.45..185.02 rows=2227 width=4) (actual time=577.192..649.812
rows=3486910loops=3)
 
                                   Sort Key: edges.to_node
                                   Sort Method: quicksort  Memory: 393217kB
                                   ->  Index Only Scan using edges_pkey on edges  (cost=0.56..55.62 rows=2227 width=4)
(actualtime=0.027..225.609 rows=3486910 loops=3)
 
                                         Index Cond: (from_node = ANY (friends_of_friends_1.current))
                                         Heap Fetches: 0
 Planning Time: 0.294 ms
 Execution Time: 2240.446 ms

EXPLAIN ANALYZE SELECT * FROM friends_of_friends_roaringbitmap;
                                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on friends_of_friends  (cost=799.30..800.00 rows=1 width=8) (actual time=492.925..492.930 rows=1 loops=1)
   Filter: (depth = 3)
   Rows Removed by Filter: 2
   CTE friends_of_friends
     ->  Recursive Union  (cost=0.43..799.30 rows=31 width=118) (actual time=0.061..492.842 rows=3 loops=1)
           ->  Index Scan using users_pkey on users  (cost=0.43..2.65 rows=1 width=118) (actual time=0.060..0.062
rows=1loops=1)
 
                 Index Cond: (id = 5867)
           ->  Nested Loop  (cost=26.45..79.63 rows=3 width=36) (actual time=164.244..164.244 rows=1 loops=3)
                 ->  WorkTable Scan on friends_of_friends friends_of_friends_1  (cost=0.00..0.22 rows=3 width=36)
(actualtime=0.000..0.001 rows=1 loops=3)
 
                       Filter: (depth < 3)
                       Rows Removed by Filter: 0
                 ->  Aggregate  (cost=26.45..26.46 rows=1 width=32) (actual time=246.359..246.359 rows=1 loops=2)
                       ->  Index Scan using users_pkey on users users_1  (cost=0.43..26.42 rows=10 width=114) (actual
time=0.074..132.318rows=116336 loops=2)
 
                             Index Cond: (id = ANY (rb_to_array(friends_of_friends_1.friends)))
 Planning Time: 0.257 ms
 Execution Time: 493.134 ms



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
> Anyway, I hacked together a trivial set backed by an open addressing
> hash table:
>
>   https://github.com/tvondra/hashset
>
> It's super-basic, providing just some bare minimum of functions, but
> hopefully good enough for experiments.
>
> - hashset data type - hash table in varlena
> - hashset_init - create new hashset
> - hashset_add / hashset_contains - add value / check membership
> - hashset_merge - merge two hashsets
> - hashset_count - count elements
> - hashset_to_array - return
> - hashset(int) aggregate
>
> This allows rewriting the recursive query like this:
>
>     WITH RECURSIVE friends_of_friends AS (
...

Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).

I think if you just add one more hashset function, it will be a win against roaringbitmap, which is 400 ms.

The missing function is an agg that takes hashset as input and returns hashset, similar to roaringbitmap's
rb_or_agg().

With such a function, we could add an adjacent nodes hashset column to the `nodes` table, which would eliminate the
needto scan the edges table for graph traversal:
 

We could then benchmark roaringbitmap against hashset querying the same table:

CREATE TABLE users AS
SELECT
    from_node AS id,
    rb_build_agg(to_node) AS friends_roaringbitmap,
    hashset(to_node) AS friends_hashset
FROM edges
GROUP BY 1;

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:
On 6/5/23 21:52, Joel Jacobson wrote:
> On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
>> Anyway, I hacked together a trivial set backed by an open addressing
>> hash table:
>>
>>   https://github.com/tvondra/hashset
>>
>> It's super-basic, providing just some bare minimum of functions, but
>> hopefully good enough for experiments.
>>
>> - hashset data type - hash table in varlena
>> - hashset_init - create new hashset
>> - hashset_add / hashset_contains - add value / check membership
>> - hashset_merge - merge two hashsets
>> - hashset_count - count elements
>> - hashset_to_array - return
>> - hashset(int) aggregate
>>
>> This allows rewriting the recursive query like this:
>>
>>     WITH RECURSIVE friends_of_friends AS (
> ...
> 
> Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).
> 
> I think if you just add one more hashset function, it will be a win against roaringbitmap, which is 400 ms.
> 
> The missing function is an agg that takes hashset as input and returns hashset, similar to roaringbitmap's
rb_or_agg().
> 
> With such a function, we could add an adjacent nodes hashset column to the `nodes` table, which would eliminate the
needto scan the edges table for graph traversal:
 
> 

I added a trivial version of such aggregate hashset(hashset), and if I
rewrite the CTE like this:

    WITH RECURSIVE friends_of_friends AS (
        SELECT
            (select hashset(v) from values (5867) as s(v)) AS current,
            0 AS depth
        UNION ALL
        SELECT
            new_current,
            friends_of_friends.depth + 1
        FROM
            friends_of_friends
        CROSS JOIN LATERAL (
            SELECT
                hashset(edges.to_node) AS new_current
            FROM
                edges
            WHERE
                from_node =
ANY(hashset_to_array(friends_of_friends.current))
        ) q
        WHERE
            friends_of_friends.depth < 3
    )
    SELECT
        depth,
        hashset_count(current)
    FROM
        friends_of_friends
    WHERE
        depth = 3;

it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
on your system. There's a bunch of opportunities for more improvements,
as the hash table implementation is pretty naive/silly, the on-disk
format is wasteful and so on.

But before spending more time on that, it'd be interesting to know what
would be a competitive timing. I mean, what would be "good enough"? What
timings are achievable with graph databases?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 6, 2023, at 13:20, Tomas Vondra wrote:
> it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
> on your system. There's a bunch of opportunities for more improvements,
> as the hash table implementation is pretty naive/silly, the on-disk
> format is wasteful and so on.
>
> But before spending more time on that, it'd be interesting to know what
> would be a competitive timing. I mean, what would be "good enough"? What
> timings are achievable with graph databases?

Your hashset is now almost exactly as fast as the corresponding roaringbitmap query, +/- 1 ms on my machine.

I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
However, I've probably misunderstood something, maybe I need to add some index or something.
Even so, it's interesting it's apparently not fast "by default".

The query I tested:
MATCH (user:User {id: '5867'})-[:FRIENDS_WITH*3..3]->(fof)
RETURN COUNT(DISTINCT fof)

Here is how I loaded the data into it:

% pwd
/Users/joel/Library/Application Support/Neo4j
Desktop/Application/relate-data/dbmss/dbms-3837aa22-c830-4dcf-8668-ef8e302263c7

% head import/*
==> import/friendships.csv <==
1,13,FRIENDS_WITH
1,11,FRIENDS_WITH
1,6,FRIENDS_WITH
1,3,FRIENDS_WITH
1,4,FRIENDS_WITH
1,5,FRIENDS_WITH
1,15,FRIENDS_WITH
1,14,FRIENDS_WITH
1,7,FRIENDS_WITH
1,8,FRIENDS_WITH

==> import/friendships_header.csv <==
:START_ID(User),:END_ID(User),:TYPE

==> import/users.csv <==
1,User
2,User
3,User
4,User
5,User
6,User
7,User
8,User
9,User
10,User

==> import/users_header.csv <==
id:ID(User),:LABEL

% ./bin/neo4j-admin database import full --overwrite-destination --nodes=User=import/users_header.csv,import/users.csv
--relationships=FRIENDS_WIDTH=import/friendships_header.csv,import/friendships.csvneo4j
 

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/7/23 16:21, Joel Jacobson wrote:
> On Tue, Jun 6, 2023, at 13:20, Tomas Vondra wrote:
>> it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
>> on your system. There's a bunch of opportunities for more improvements,
>> as the hash table implementation is pretty naive/silly, the on-disk
>> format is wasteful and so on.
>>
>> But before spending more time on that, it'd be interesting to know what
>> would be a competitive timing. I mean, what would be "good enough"? What
>> timings are achievable with graph databases?
> 
> Your hashset is now almost exactly as fast as the corresponding roaringbitmap query, +/- 1 ms on my machine.
> 

Interesting, considering how dumb the the hash table implementation is.

> I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
> However, I've probably misunderstood something, maybe I need to add some index or something.
> Even so, it's interesting it's apparently not fast "by default".
> 

No idea how to fix that, but it's rather suspicious.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote:
> Interesting, considering how dumb the the hash table implementation is.

That's promising.

>> I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
>> However, I've probably misunderstood something, maybe I need to add some index or something.
>> Even so, it's interesting it's apparently not fast "by default".
>> 
>
> No idea how to fix that, but it's rather suspicious.

I've had a graph-db expert review my benchmark, and he suggested adding an index:

CREATE INDEX FOR (n:User) ON (n.id)

This did improve the execution time for Neo4j a bit, down from 819 ms to 528 ms, but PostgreSQL 299 ms is still a win.

Benchmark here: https://github.com/joelonsql/graph-query-benchmarks
Note, in this benchmark, I only test the naive RECURSIVE CTE approach using array_agg(DISTINCT ...).
And I couldn't even test the most connected user with Neo4j, the query never finish for some reason,
so I had to test with a less connected user.

The graph expert also said that other more realistic graph use-cases might be "multi-relational",
and pointed me to a link: https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset
No idea how such multi-relational datasets would affect the benchmarks.

I think we have already strong indicators that PostgreSQL with a hashset type will from a relative
performance perspective, do just fine processing basic graph queries, even with large datasets.

Then there will always be the case when users primarily write very different graph queries all day long,
who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher or Gremlin.

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:
On 6/8/23 11:41, Joel Jacobson wrote:
> On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote:
>> Interesting, considering how dumb the the hash table implementation is.
> 
> That's promising.
> 

Yeah, not bad for sleep-deprived on-plane hacking ...

There's a bunch of stuff that needs to be improved to make this properly
usable, like:

1) better hash table implementation

2) input/output functions

3) support for other types (now it only works with int32)

4) I wonder if this might be done as an array-like polymorphic type.

5) more efficient storage format, with versioning etc.

6) regression tests

Would you be interested in helping with / working on some of that? I
don't have immediate need for this stuff, so it's not very high on my
TODO list.

>>> I tested Neo4j and the results are surprising; it appears to be significantly *slower*.
>>> However, I've probably misunderstood something, maybe I need to add some index or something.
>>> Even so, it's interesting it's apparently not fast "by default".
>>>
>>
>> No idea how to fix that, but it's rather suspicious.
> 
> I've had a graph-db expert review my benchmark, and he suggested adding an index:
> 
> CREATE INDEX FOR (n:User) ON (n.id)
> 
> This did improve the execution time for Neo4j a bit, down from 819 ms to 528 ms, but PostgreSQL 299 ms is still a
win.
> 
> Benchmark here: https://github.com/joelonsql/graph-query-benchmarks
> Note, in this benchmark, I only test the naive RECURSIVE CTE approach using array_agg(DISTINCT ...).
> And I couldn't even test the most connected user with Neo4j, the query never finish for some reason,
> so I had to test with a less connected user.
> 

Interesting. I'd have expected the graph db to be much faster.

> The graph expert also said that other more realistic graph use-cases might be "multi-relational",
> and pointed me to a link: https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset
> No idea how such multi-relational datasets would affect the benchmarks.
> 

Not sure either, but I don't have ambition to improve everything at
once. If the hashset improves one practical use case, fine with me.

> I think we have already strong indicators that PostgreSQL with a hashset type will from a relative
> performance perspective, do just fine processing basic graph queries, even with large datasets.
> 
> Then there will always be the case when users primarily write very different graph queries all day long,
> who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher or Gremlin.
> 

Right. IMHO the query language is a separate thing, you still need to
evaluate the query somehow - which is where hashset applies.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
> Would you be interested in helping with / working on some of that? I
> don't have immediate need for this stuff, so it's not very high on my
> TODO list.

Sure, I'm willing to help!

I've attached a patch that works on some of the items on your list,
including some additions to the README.md.

There were a bunch of places where `maxelements / 8` caused bugs,
that had to be changed to do proper integer ceiling division:

-       values = (int32 *) (set->data + set->maxelements / 8);
+       values = (int32 *) (set->data + (set->maxelements + 7) / 8);

Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros
to the PostgreSQL source code in general, since it's easy to make this mistake,
and quite verbose/error-prone to write it out manually everywhere.
Such macros could simplify code in e.g. numeric.c.

> There's a bunch of stuff that needs to be improved to make this properly
> usable, like:
>
> 1) better hash table implementation
TODO

> 2) input/output functions
I've attempted to implement these.
I thought comma separated values wrapped around curly braces felt as the most natural format,
example:
SELECT '{1,2,3}'::hashset;

> 3) support for other types (now it only works with int32)
TODO

> 4) I wonder if this might be done as an array-like polymorphic type.
That would be nice!
I guess the work-around would be to store the actual value of non-int type
in a lookup table, and then hash the int-based primary key in such table.

Do you think later implementing polymorphic type support would
mean a more or less complete rewrite, or can we carry on with int32-support
and add it later on?

> 5) more efficient storage format, with versioning etc.
TODO

> 6) regression tests
I've added some regression tests.

> Right. IMHO the query language is a separate thing, you still need to
> evaluate the query somehow - which is where hashset applies.

Good point, I fully agree.

/Joel
Вложения

Re: Do we want a hashset type?

От
jian he
Дата:


On Fri, Jun 9, 2023 at 6:58 PM Joel Jacobson <joel@compiler.org> wrote:
On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
> Would you be interested in helping with / working on some of that? I
> don't have immediate need for this stuff, so it's not very high on my
> TODO list.

Sure, I'm willing to help!

I've attached a patch that works on some of the items on your list,
including some additions to the README.md.

There were a bunch of places where `maxelements / 8` caused bugs,
that had to be changed to do proper integer ceiling division:

-       values = (int32 *) (set->data + set->maxelements / 8);
+       values = (int32 *) (set->data + (set->maxelements + 7) / 8);

Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros
to the PostgreSQL source code in general, since it's easy to make this mistake,
and quite verbose/error-prone to write it out manually everywhere.
Such macros could simplify code in e.g. numeric.c.

> There's a bunch of stuff that needs to be improved to make this properly
> usable, like:
>
> 1) better hash table implementation
TODO

> 2) input/output functions
I've attempted to implement these.
I thought comma separated values wrapped around curly braces felt as the most natural format,
example:
SELECT '{1,2,3}'::hashset;

> 3) support for other types (now it only works with int32)
TODO

> 4) I wonder if this might be done as an array-like polymorphic type.
That would be nice!
I guess the work-around would be to store the actual value of non-int type
in a lookup table, and then hash the int-based primary key in such table.

Do you think later implementing polymorphic type support would
mean a more or less complete rewrite, or can we carry on with int32-support
and add it later on?

> 5) more efficient storage format, with versioning etc.
TODO

> 6) regression tests
I've added some regression tests.

> Right. IMHO the query language is a separate thing, you still need to
> evaluate the query somehow - which is where hashset applies.

Good point, I fully agree.

/Joel



Hi, I am quite new about C.....
The following function I have 3 questions.
1. 7691,4201, I assume they are just random prime ints?
2. I don't get the last return set, even the return type should be bool.
3. I don't understand 13 in hash = (hash + 13) % set->maxelements;

static bool
hashset_contains_element(hashset_t *set, int32 value)
{	int		byte;	int		bit;	uint32	hash;	char   *bitmap;	int32  *values;
	hash = ((uint32) value * 7691 + 4201) % set->maxelements;
	bitmap = set->data;	values = (int32 *) (set->data + set->maxelements / 8);
	while (true)	{		byte = (hash / 8);		bit = (hash % 8);
		/* found an empty slot, value is not there */		if ((bitmap[byte] & (0x01 << bit)) == 0)			return false;
		/* is it the same value? */		if (values[hash] == value)			return true;
		/* move to the next element */		hash = (hash + 13) % set->maxelements;	}
	return set;
}




Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Fri, Jun 9, 2023, at 13:33, jian he wrote:
> Hi, I am quite new about C.....
> The following function I have 3 questions.
> 1. 7691,4201, I assume they are just random prime ints?

Yes, 7691 and 4201 are likely chosen as random prime numbers.
In hash functions, prime numbers are often used to help in evenly distributing
the hash values across the range and reduce the chance of collisions.

> 2. I don't get the last return set, even the return type should be bool.

Thanks, you found a mistake!

The line

    return set;

is actually unreachable and should be removed.
The function will always return either true or false within the while loop and
never reach the final return statement.

I've attached a new incremental patch with this fix.

> 3. I don't understand 13 in hash = (hash + 13) % set->maxelements;

The value 13 is used for linear probing [1] in handling hash collisions.
Linear probing sequentially checks the next slot in the array when a collision
occurs. 13, being a small prime number not near a power of 2, helps in uniformly
distributing data and ensuring that all slots are probed, as it's relatively prime
to the hash table size.

Hm, I realise we actually don't ensure the hash table size and step size (13)
are coprime. I've fixed that in the attached patch as well.


/Joel
Вложения

Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:


On 2023-06-09 Fr 07:56, Joel Jacobson wrote:
On Fri, Jun 9, 2023, at 13:33, jian he wrote:
> Hi, I am quite new about C.....
> The following function I have 3 questions.
> 1. 7691,4201, I assume they are just random prime ints?

Yes, 7691 and 4201 are likely chosen as random prime numbers.
In hash functions, prime numbers are often used to help in evenly distributing
the hash values across the range and reduce the chance of collisions.

> 2. I don't get the last return set, even the return type should be bool.

Thanks, you found a mistake!

The line

    return set;

is actually unreachable and should be removed.
The function will always return either true or false within the while loop and
never reach the final return statement.

I've attached a new incremental patch with this fix.

> 3. I don't understand 13 in hash = (hash + 13) % set->maxelements;

The value 13 is used for linear probing [1] in handling hash collisions.
Linear probing sequentially checks the next slot in the array when a collision
occurs. 13, being a small prime number not near a power of 2, helps in uniformly
distributing data and ensuring that all slots are probed, as it's relatively prime
to the hash table size.

Hm, I realise we actually don't ensure the hash table size and step size (13)
are coprime. I've fixed that in the attached patch as well.





Maybe you can post a full patch as well as incremental?

Stylistically I think you should reduce reliance on magic numbers (like 13). Probably need some #define's?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Do we want a hashset type?

От
jian he
Дата:

in funcion. hashset_in 

int32 value = strtol(str, &endptr, 10);
there is no int32 value range check? 
imitate src/backend/utils/adt/int.c. the following way is what I came up with.

int64 value = strtol(str, &endptr, 10);
if (errno == ERANGE || value < INT_MIN || value > INT_MAX)
ereturn(fcinfo->context, (Datum) 0,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("value \"%s\" is out of range for type %s", str,
"integer")));

 set = hashset_add_element(set, (int32)value);

also it will infinity loop in hashset_in if supply the wrong value.... 
example select  '{1,2s}'::hashset;
I need kill -9 to kill the process. 





Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/9/23 12:58, Joel Jacobson wrote:
> On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote:
>> Would you be interested in helping with / working on some of that? I
>> don't have immediate need for this stuff, so it's not very high on my
>> TODO list.
> 
> Sure, I'm willing to help!
> 
> I've attached a patch that works on some of the items on your list,
> including some additions to the README.md.
> 
> There were a bunch of places where `maxelements / 8` caused bugs,
> that had to be changed to do proper integer ceiling division:
> 
> -       values = (int32 *) (set->data + set->maxelements / 8);
> +       values = (int32 *) (set->data + (set->maxelements + 7) / 8);
> 
> Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros
> to the PostgreSQL source code in general, since it's easy to make this mistake,
> and quite verbose/error-prone to write it out manually everywhere.
> Such macros could simplify code in e.g. numeric.c.
> 

Yeah, it'd be good to have macros to calculate the sizes. We'll need
that in many places.

>> There's a bunch of stuff that needs to be improved to make this properly
>> usable, like:
>>
>> 1) better hash table implementation
> TODO
> 
>> 2) input/output functions
> I've attempted to implement these.
> I thought comma separated values wrapped around curly braces felt as the most natural format,
> example:
> SELECT '{1,2,3}'::hashset;
> 

+1 to that. I'd mimic the array in/out functions as much as possible.

>> 3) support for other types (now it only works with int32)
> TODO
> 

I think we should decide what types we want/need to support, and add one
or two types early. Otherwise we'll have code / on-disk format making
various assumptions about the type length etc.

I have no idea what types people use as node IDs - is it likely we'll
need to support types passed by reference / varlena types? Or can we
just assume it's int/bigint?

>> 4) I wonder if this might be done as an array-like polymorphic type.
> That would be nice!
> I guess the work-around would be to store the actual value of non-int type
> in a lookup table, and then hash the int-based primary key in such table.
> 
> Do you think later implementing polymorphic type support would
> mean a more or less complete rewrite, or can we carry on with int32-support
> and add it later on?
> 

I don't know. From the storage perspective it doesn't matter much, I
think, we would not need to change that. But I think adding a
polymorphic type (array-like) would require changes to grammar, and
that's not possible for an extension. If there's a way, I'm not aware of
it and I don't recall an extension doing that.

>> 5) more efficient storage format, with versioning etc.
> TODO
> 

I think the main question is whether to serialize the hash table as is,
or compact it in some way. The current code just uses the same thing for
both cases - on-disk format and in-memory representation (aggstate).
That's simple, but it also means the on-disk value is likely not well
compressible (because it's ~50% random data.

I've thought about serializing just the values (as a simple array), but
that defeats the whole purpose of fast membership checks. I have two ideas:

a) sort the data, and use binary search for this compact variant (and
then expand it into "full" hash table if needed)

b) use a more compact hash table (with load factor much closer to 1.0)

Not sure which if this option is the right one, each has cost for
converting between formats (but large on-disk value is not free either).

That's roughly what I did for the tdigest extension.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/10/23 17:46, Andrew Dunstan wrote:
> 
> On 2023-06-09 Fr 07:56, Joel Jacobson wrote:
>> On Fri, Jun 9, 2023, at 13:33, jian he wrote:
>> > Hi, I am quite new about C.....
>> > The following function I have 3 questions.
>> > 1. 7691,4201, I assume they are just random prime ints?
>>
>> Yes, 7691 and 4201 are likely chosen as random prime numbers.
>> In hash functions, prime numbers are often used to help in evenly
>> distributing
>> the hash values across the range and reduce the chance of collisions.
>>
>> > 2. I don't get the last return set, even the return type should be bool.
>>
>> Thanks, you found a mistake!
>>
>> The line
>>
>>     return set;
>>
>> is actually unreachable and should be removed.
>> The function will always return either true or false within the while
>> loop and
>> never reach the final return statement.
>>
>> I've attached a new incremental patch with this fix.
>>
>> > 3. I don't understand 13 in hash = (hash + 13) % set->maxelements;
>>
>> The value 13 is used for linear probing [1] in handling hash collisions.
>> Linear probing sequentially checks the next slot in the array when a
>> collision
>> occurs. 13, being a small prime number not near a power of 2, helps in
>> uniformly
>> distributing data and ensuring that all slots are probed, as it's
>> relatively prime
>> to the hash table size.
>>
>> Hm, I realise we actually don't ensure the hash table size and step
>> size (13)
>> are coprime. I've fixed that in the attached patch as well.
>>
>> [1] https://en.wikipedia.org/wiki/Linear_probing
>>
>>
> 
> 
> Maybe you can post a full patch as well as incremental?
> 

I wonder if we should keep discussing this extension here, considering
it's going to be out of core (at least for now). Not sure how many
pgsql-hackers are interested in this, so maybe we should just move it to
github PRs or something ...


> Stylistically I think you should reduce reliance on magic numbers (like
> 13). Probably need some #define's?
> 

Yeah, absolutely. This was just pure laziness.


regard

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
> On 6/10/23 17:46, Andrew Dunstan wrote:
>> 
>> Maybe you can post a full patch as well as incremental?
>> 
>
> I wonder if we should keep discussing this extension here, considering
> it's going to be out of core (at least for now). Not sure how many
> pgsql-hackers are interested in this, so maybe we should just move it to
> github PRs or something ...

I think there are some good arguments that speaks in favour of including it in core:

1. It's a fundamental data structure. Perhaps "set" would have been a better name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence "hashset".

2. The addition of SQL/PGQ in SQL:2023 is evidence of a general perceived need
among SQL users to evaluate graph queries. Even if a future implementation of SQL/PGQ
would mean users wouldn't need to deal with the hashset type directly, the same
type could hopefully be used, in part or in whole, under the hood by the future 
SQL/PGQ implementation. If low-level functionality is useful on its own, I think it's
a benefit of exposing it to users, since it allows system testing of the component
in isolation, even if it's primarily gonna be used as a smaller part of a larger more
high-level component.

3. I think there is a general need for hashset, experienced by myself, Andrew and
I would guess lots of others users. The general pattern that will be improved is
when you currently would do array_agg(DISTINCT ...)
probably there are other situations too, since it's a fundamental data structure.

On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>> 3) support for other types (now it only works with int32)
> I think we should decide what types we want/need to support, and add one
> or two types early. Otherwise we'll have code / on-disk format making
> various assumptions about the type length etc.
>
> I have no idea what types people use as node IDs - is it likely we'll
> need to support types passed by reference / varlena types? Or can we
> just assume it's int/bigint?

I think we should just support data types that would be sensible
to use as a PRIMARY KEY in a fully normalised data model,
which I believe would only include "int", "bigint" and "uuid".

/Joel



Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:


On 2023-06-11 Su 06:26, Joel Jacobson wrote:
On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
On 6/10/23 17:46, Andrew Dunstan wrote:
Maybe you can post a full patch as well as incremental?

I wonder if we should keep discussing this extension here, considering
it's going to be out of core (at least for now). Not sure how many
pgsql-hackers are interested in this, so maybe we should just move it to
github PRs or something ...
I think there are some good arguments that speaks in favour of including it in core:

1. It's a fundamental data structure. 


That's reason enough IMNSHO.


Perhaps "set" would have been a better name,
since the use of hash functions from an end-user perspective is implementation
details, but we cannot use that word since it's a reserved keyword, hence "hashset".


Rather than use "hashset", which as you say is based on an implementation detail, I would prefer something like "integer_set" - what it's a set of.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/11/23 12:26, Joel Jacobson wrote:
> On Sat, Jun 10, 2023, at 22:26, Tomas Vondra wrote:
>> On 6/10/23 17:46, Andrew Dunstan wrote:
>>>
>>> Maybe you can post a full patch as well as incremental?
>>>
>>
>> I wonder if we should keep discussing this extension here, considering
>> it's going to be out of core (at least for now). Not sure how many
>> pgsql-hackers are interested in this, so maybe we should just move it to
>> github PRs or something ...
> 
> I think there are some good arguments that speaks in favour of including it in core:
> 
> 1. It's a fundamental data structure. Perhaps "set" would have been a better name,
> since the use of hash functions from an end-user perspective is implementation
> details, but we cannot use that word since it's a reserved keyword, hence "hashset".
> 
> 2. The addition of SQL/PGQ in SQL:2023 is evidence of a general perceived need
> among SQL users to evaluate graph queries. Even if a future implementation of SQL/PGQ
> would mean users wouldn't need to deal with the hashset type directly, the same
> type could hopefully be used, in part or in whole, under the hood by the future 
> SQL/PGQ implementation. If low-level functionality is useful on its own, I think it's
> a benefit of exposing it to users, since it allows system testing of the component
> in isolation, even if it's primarily gonna be used as a smaller part of a larger more
> high-level component.
> 
> 3. I think there is a general need for hashset, experienced by myself, Andrew and
> I would guess lots of others users. The general pattern that will be improved is
> when you currently would do array_agg(DISTINCT ...)
> probably there are other situations too, since it's a fundamental data structure.
> 

I agree with all of that, but ...

It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.

Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.

Presumably people would like to use the extension in the release they
already use, without backporting.

Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.

> On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>>> 3) support for other types (now it only works with int32)
>> I think we should decide what types we want/need to support, and add one
>> or two types early. Otherwise we'll have code / on-disk format making
>> various assumptions about the type length etc.
>>
>> I have no idea what types people use as node IDs - is it likely we'll
>> need to support types passed by reference / varlena types? Or can we
>> just assume it's int/bigint?
> 
> I think we should just support data types that would be sensible
> to use as a PRIMARY KEY in a fully normalised data model,
> which I believe would only include "int", "bigint" and "uuid".
> 

OK, makes sense.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Sun, Jun 11, 2023, at 16:58, Andrew Dunstan wrote:
>>On 2023-06-11 Su 06:26, Joel Jacobson wrote:
>>Perhaps "set" would have been a better name, since the use of hash functions from an end-user perspective is >>implementation details, but we cannot use that word since it's a reserved keyword, hence "hashset".
>
>Rather than use "hashset", which as you say is based on an implementation detail, I would prefer something like
>"integer_set" - what it's a set of.

Apologies for the confusion previously.
Upon further reflection, I recognize that the term "hash" in "hashset"
extends beyond mere representation of implementation.
It imparts key information about performance characteristics as well as functionality inherent to the set.

In hindsight, "hashset" does emerge as the most suitable terminology.

/Joel

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
> On 6/11/23 12:26, Joel Jacobson wrote:
>> I think there are some good arguments that speaks in favour of including it in core:
...
>
> I agree with all of that, but ...
>
> It's just past feature freeze, so the earliest release this could appear
> in is 17, about 15 months away.
>
> Once stuff gets added to core, it's tied to the release cycle, so no new
> features in between.
>
> Presumably people would like to use the extension in the release they
> already use, without backporting.
>
> Postgres is extensible for a reason, exactly so that we don't need to
> have everything in core.

Interesting, I've never thought about this one before:
What if something is deemed to be fundamental and therefore qualify for core inclusion,
and at the same time is suitable to be made an extension.
Would it be possible to ship such extension as pre-installed?

What was the json/jsonb story, was it ever an extension before
being included in core?

/Joel



Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:


On 2023-06-11 Su 16:15, Joel Jacobson wrote:
On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
On 6/11/23 12:26, Joel Jacobson wrote:
I think there are some good arguments that speaks in favour of including it in core:
...
I agree with all of that, but ...

It's just past feature freeze, so the earliest release this could appear
in is 17, about 15 months away.

Once stuff gets added to core, it's tied to the release cycle, so no new
features in between.

Presumably people would like to use the extension in the release they
already use, without backporting.

Postgres is extensible for a reason, exactly so that we don't need to
have everything in core.
Interesting, I've never thought about this one before:
What if something is deemed to be fundamental and therefore qualify for core inclusion,
and at the same time is suitable to be made an extension.
Would it be possible to ship such extension as pre-installed?

What was the json/jsonb story, was it ever an extension before
being included in core?


No, and the difficulty is that an in-core type and associated functions will have different oids, so migrating from one to the other would be at best painful.

So it's a kind of now or never decision. I think extensions are excellent for specialized types. But I don't regard a set type in that light.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/11/23 22:15, Joel Jacobson wrote:
> On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
>> On 6/11/23 12:26, Joel Jacobson wrote:
>>> I think there are some good arguments that speaks in favour of including it in core:
> ...
>>
>> I agree with all of that, but ...
>>
>> It's just past feature freeze, so the earliest release this could appear
>> in is 17, about 15 months away.
>>
>> Once stuff gets added to core, it's tied to the release cycle, so no new
>> features in between.
>>
>> Presumably people would like to use the extension in the release they
>> already use, without backporting.
>>
>> Postgres is extensible for a reason, exactly so that we don't need to
>> have everything in core.
> 
> Interesting, I've never thought about this one before:
> What if something is deemed to be fundamental and therefore qualify for core inclusion,
> and at the same time is suitable to be made an extension.
> Would it be possible to ship such extension as pre-installed?
> 

I think it's always a matter of judgment - I don't think there's some
clear set into a stone. If something is "fundamental" and can be done in
an extension, there's always the option to have it in contrib (with all
the limitations I mentioned).

FWIW I'm not strictly against adding it to contrib, if there's agreement
it's worth it. But if we consider it to be a fundamental data structure
and widely useful, maybe we should consider making it a built-in data
type, with fixed OID etc. That'd allow support at the SQL grammar level,
and perhaps also from the proposed SQL/PGQ (and GQL?). AFAIK moving data
types from extension (even if from contrib) to core is not painless.

Either way it might be a nice / useful first patch, I guess.

> What was the json/jsonb story, was it ever an extension before
> being included in core?

I don't recall what the exact story was, but I guess the "json" type was
added to core very long ago (before we started to push back a bit), and
we added some SQL grammar stuff too, which can't be done from extension.
So when we added jsonb (much later than json), there wasn't much point
in not having it in core.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/12/23 14:46, Andrew Dunstan wrote:
> 
> On 2023-06-11 Su 16:15, Joel Jacobson wrote:
>> On Sun, Jun 11, 2023, at 17:03, Tomas Vondra wrote:
>>> On 6/11/23 12:26, Joel Jacobson wrote:
>>>> I think there are some good arguments that speaks in favour of including it in core:
>> ...
>>> I agree with all of that, but ...
>>>
>>> It's just past feature freeze, so the earliest release this could appear
>>> in is 17, about 15 months away.
>>>
>>> Once stuff gets added to core, it's tied to the release cycle, so no new
>>> features in between.
>>>
>>> Presumably people would like to use the extension in the release they
>>> already use, without backporting.
>>>
>>> Postgres is extensible for a reason, exactly so that we don't need to
>>> have everything in core.
>> Interesting, I've never thought about this one before:
>> What if something is deemed to be fundamental and therefore qualify for core inclusion,
>> and at the same time is suitable to be made an extension.
>> Would it be possible to ship such extension as pre-installed?
>>
>> What was the json/jsonb story, was it ever an extension before
>> being included in core?
> 
> 
> No, and the difficulty is that an in-core type and associated functions
> will have different oids, so migrating from one to the other would be at
> best painful.
> 
> So it's a kind of now or never decision. I think extensions are
> excellent for specialized types. But I don't regard a set type in that
> light.
> 

Perhaps. So you're proposing to have this as a regular built-in type?
It's hard for me to judge how popular this feature would be, but I guess
people often use arrays while they actually want set semantics ...

If we do that, I wonder if we could do something similar to arrays, with
the polymorphism and SQL grammar support. Does SQL have any concept of
sets (or arrays, for that matter) as data types?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:


On 2023-06-12 Mo 09:00, Tomas Vondra wrote:

What was the json/jsonb story, was it ever an extension before
being included in core?
I don't recall what the exact story was, but I guess the "json" type was
added to core very long ago (before we started to push back a bit), and
we added some SQL grammar stuff too, which can't be done from extension.
So when we added jsonb (much later than json), there wasn't much point
in not having it in core.



Not quite.

The json type as added in 9.2 (Sept 2012) and jsonb in 9.4 (Dec 2014). I wouldn't call those far apart or very long ago. Neither included any grammar changes AFAIR.

But if they had been added as extensions we'd probably be in a whole lot more trouble now implementing SQL/JSON, so whether that was foresight or laziness I think the we landed on our feet there.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Sat, Jun 10, 2023, at 17:46, Andrew Dunstan wrote:
> Maybe you can post a full patch as well as incremental?

Attached patch is based on tvondra's last commit 375b072.

> Stylistically I think you should reduce reliance on magic numbers (like 13). Probably need some #define's?

Great idea, fixed, I've added a HASHSET_STEP definition (set to the value 13).

On Sat, Jun 10, 2023, at 17:51, jian he wrote:
> int32 value = strtol(str, &endptr, 10);
> there is no int32 value range check? 
> imitate src/backend/utils/adt/int.c. the following way is what I came up with.
>
>
> int64 value = strtol(str, &endptr, 10);
>
> if (errno == ERANGE || value < INT_MIN || value > INT_MAX)

Thanks, fixed like suggested, except I used PG_INT32_MIN and PG_INT32_MAX,
which explicitly represent the maximum value for a 32-bit integer,
regardless of the platform or C implementation.

> also it will infinity loop in hashset_in if supply the wrong value.... 
> example select  '{1,2s}'::hashset;
> I need kill -9 to kill the process. 

Thanks. I've added a new test, `sql/invalid.sql` with that example query.

Here is a summary of all other changes:
 
* README.md: Added sections Usage, Data types, Functions and Aggregate Functions

* Added test/ directory with some tests.

* Added "not production-ready" notice at top of README, warning for breaking
changes and no migration scripts, until our first release.

* Change version from 1.0.0 to 0.0.1 to indicate current status.

* Added CEIL_DIV macro

* Implemented hashset_in(), hashset_out()
  The syntax for the serialized format is comma separated integer values
  wrapped around curly braces, e.g '{1,2,3}'

* Implemented hashset_recv() to match the existing hashset_send()

* Removed/rewrote some tdigest related comments

/Joel
Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>> 1) better hash table implementation

I noticed src/include/common/hashfn.h provides implementation
of the Jenkins/lookup3 hash function, and thought maybe
we could simply use it in hashset?

However, I noticed that according to SMHasher [1],
Jenkins/lookup3 has some quality problems:
"UB, 28% bias, collisions, 30% distr, BIC"

Not sure if that's true or maybe not a problem in the PostgreSQL implementation?

According to SHHasher, the two fastest 32/64-bit hash functions
for non-cryptographic purposes without any quality problems
that are also portable seems to be these two:

wyhash v4.1 (64-bit) [2]
MiB/sec: 22513.04
cycl./hash: 29.01
size: 474

xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
MiB/sec: 20722.94
cycl./hash: 30.26
size: 756

[1] https://github.com/rurban/smhasher
[2] https://github.com/wangyi-fudan/wyhash
[3] https://github.com/Cyan4973/xxHash

>>> 5) more efficient storage format, with versioning etc.
> I think the main question is whether to serialize the hash table as is,
> or compact it in some way. The current code just uses the same thing for
> both cases - on-disk format and in-memory representation (aggstate).
> That's simple, but it also means the on-disk value is likely not well
> compressible (because it's ~50% random data.
>
> I've thought about serializing just the values (as a simple array), but
> that defeats the whole purpose of fast membership checks. I have two ideas:
>
> a) sort the data, and use binary search for this compact variant (and
> then expand it into "full" hash table if needed)
>
> b) use a more compact hash table (with load factor much closer to 1.0)
>
> Not sure which if this option is the right one, each has cost for
> converting between formats (but large on-disk value is not free either).
>
> That's roughly what I did for the tdigest extension.

Is the choice of hash function (and it's in-memory representation)
independent of the on-disk format question, i.e. could we work
on experimenting and evaluating different hash functions first,
to optimise for speed and quality, and then when done, proceed
and optimise for space, or are the two intertwined somehow?

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/12/23 19:34, Joel Jacobson wrote:
> On Sat, Jun 10, 2023, at 22:12, Tomas Vondra wrote:
>>>> 1) better hash table implementation
> 
> I noticed src/include/common/hashfn.h provides implementation
> of the Jenkins/lookup3 hash function, and thought maybe
> we could simply use it in hashset?
> 
> However, I noticed that according to SMHasher [1],
> Jenkins/lookup3 has some quality problems:
> "UB, 28% bias, collisions, 30% distr, BIC"
> 
> Not sure if that's true or maybe not a problem in the PostgreSQL implementation?
> > According to SHHasher, the two fastest 32/64-bit hash functions
> for non-cryptographic purposes without any quality problems
> that are also portable seems to be these two:
> 
> wyhash v4.1 (64-bit) [2]
> MiB/sec: 22513.04
> cycl./hash: 29.01
> size: 474
> 
> xxh3low (xxHash v3, 64-bit, low 32-bits part) [3]
> MiB/sec: 20722.94
> cycl./hash: 30.26
> size: 756
> 
> [1] https://github.com/rurban/smhasher
> [2] https://github.com/wangyi-fudan/wyhash
> [3] https://github.com/Cyan4973/xxHash
> 

But those are numbers for large keys - if you restrict the input to
4B-16B (which is what we planned to do by focusing on int, bigint and
uuid), there's no significant difference:

lookup3:

Small key speed test -    4-byte keys -    30.17 cycles/hash
Small key speed test -    8-byte keys -    31.00 cycles/hash
Small key speed test -   16-byte keys -    49.00 cycles/hash

xxh3low:

Small key speed test -    4-byte keys -    29.00 cycles/hash
Small key speed test -    8-byte keys -    29.58 cycles/hash
Small key speed test -   16-byte keys -    37.00 cycles/hash

But you can try doing some measurements, of course. Or just do profiling
to see how much time we spend in the hash function - I'd bet it's pretty
tiny fraction of the total time.

As for the "quality" issues - it's the same algorithm in Postgres, so it
has the same issues. I don't if that has measurable impact, though. I'd
guess it does not, particularly for "reasonably small" sets.

>>>> 5) more efficient storage format, with versioning etc.
>> I think the main question is whether to serialize the hash table as is,
>> or compact it in some way. The current code just uses the same thing for
>> both cases - on-disk format and in-memory representation (aggstate).
>> That's simple, but it also means the on-disk value is likely not well
>> compressible (because it's ~50% random data.
>>
>> I've thought about serializing just the values (as a simple array), but
>> that defeats the whole purpose of fast membership checks. I have two ideas:
>>
>> a) sort the data, and use binary search for this compact variant (and
>> then expand it into "full" hash table if needed)
>>
>> b) use a more compact hash table (with load factor much closer to 1.0)
>>
>> Not sure which if this option is the right one, each has cost for
>> converting between formats (but large on-disk value is not free either).
>>
>> That's roughly what I did for the tdigest extension.
> 
> Is the choice of hash function (and it's in-memory representation)
> independent of the on-disk format question, i.e. could we work
> on experimenting and evaluating different hash functions first,
> to optimise for speed and quality, and then when done, proceed
> and optimise for space, or are the two intertwined somehow?
> 

Not sure what you mean by "optimizing for space" - I imagined the
on-disk format would just use the same hash table with tiny amount of
free space (say 10% and not ~%50).


My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
(through hash_bytes or something), and at the same time make it possible
to switch to a different function in the future. I'd store and ID of the
hash function in the set, so that we can support a different algorithm
in the future, if we choose to.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
> But those are numbers for large keys - if you restrict the input to
> 4B-16B (which is what we planned to do by focusing on int, bigint and
> uuid), there's no significant difference:

Oh, sorry, I completely failed to read the meaning of the Columns.

> lookup3:
>
> Small key speed test -    4-byte keys -    30.17 cycles/hash
> Small key speed test -    8-byte keys -    31.00 cycles/hash
> Small key speed test -   16-byte keys -    49.00 cycles/hash
>
> xxh3low:
>
> Small key speed test -    4-byte keys -    29.00 cycles/hash
> Small key speed test -    8-byte keys -    29.58 cycles/hash
> Small key speed test -   16-byte keys -    37.00 cycles/hash

The winner of the "Small key speed test" competition seems to be:

ahash64 "ahash 64bit":
Small key speed test -    4-byte keys -    24.00 cycles/hash
Small key speed test -    8-byte keys -    24.00 cycles/hash
Small key speed test -   16-byte keys -    26.98 cycles/hash

Looks like it's a popular one, e.g. it's used by Rust in their std::collections::HashSet.

Another notable property of ahash64 is that it's "DOS resistant",
but it isn't crucial for our use case, since we exclusively target
auto-generated primary keys which are not influenced by end-users.

> Not sure what you mean by "optimizing for space" - I imagined the
> on-disk format would just use the same hash table with tiny amount of
> free space (say 10% and not ~%50).

With "optimizing for space" I meant trying to find some alternative or
intermediate data structure that is more likely to be compressible,
like your idea of sorting the data.
What I wondered was if the on-disk format would be affected by
the choice of hash function. I guess it wouldn't, if the hashset
is created by adding the elements one-by-one by iterating
through the elements by reading the on-disk format.
But I thought maybe something more advanced could be
done, where conversion between the on-disk format
and the in-memory format could be done without naively
iterating through all elements, i.e. something less complex
than O(n).
No idea what that mechanism would be though.

> My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
> (through hash_bytes or something), and at the same time make it possible
> to switch to a different function in the future. I'd store and ID of the
> hash function in the set, so that we can support a different algorithm
> in the future, if we choose to.

Sounds good to me.

Smart idea to include the hash function algorithm ID in the set,
to allow implementing a different one in the future!

/Joel



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 12, 2023, at 22:36, Joel Jacobson wrote:
> On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
>> My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
>> (through hash_bytes or something), and at the same time make it possible
>> to switch to a different function in the future. I'd store and ID of the
>> hash function in the set, so that we can support a different algorithm
>> in the future, if we choose to.

hashset is now using hash_bytes_uint32() from hashfn.h

Other changes in the same commit:

* Introduce hashfn_id field to specify hash function ID
* Implement hashset_send and hashset_recv and add C-test using libpq
* Add operators and operator classes for hashset comparison, sorting
  and distinct queries

Looks good? If so, I wonder what's best to focus on next?
Perhaps adding support for bigint? Other ideas?

/Joel
Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 13, 2023, at 20:50, Joel Jacobson wrote:
> hashset is now using hash_bytes_uint32() from hashfn.h

I spotted a problem in the ordering logic of the comparison functions.

The issue was with handling hashsets containing empty positions,
causing non-lexicographic ordering.

The updated implementation now correctly iterates over the hashsets,
skipping any empty positions, which results in proper comparison
and ordering of elements present in the hashset.

New patch attached.

Вложения

Re: Do we want a hashset type?

От
Tom Dunstan
Дата:
On Mon, 12 Jun 2023 at 22:37, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Perhaps. So you're proposing to have this as a regular built-in type?
It's hard for me to judge how popular this feature would be, but I guess
people often use arrays while they actually want set semantics ...

Perspective from a potential user: I'm currently working on something where an array-like structure with fast membership test performance would be very useful. The main type of query is doing an =ANY(the set) filter, where the set could contain anywhere from very few to thousands of entries (ints in our case). So we'd want the same index usage as =ANY(array) but would like faster row checking than we get with an array when other indexes are used.

Our app runs connecting to either an embedded postgres database that we control or an external one controlled by customers - this is typically RDS or some other cloud vendor's DB. Having such a type as a separate extension would make it unusable for us until all relevant cloud vendors decided that it was popular enough to include - something that may never happen, or even if it did, now any time soon.

Cheers

Tom

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Wed, Jun 14, 2023, at 06:31, Tom Dunstan wrote:
> On Mon, 12 Jun 2023 at 22:37, Tomas Vondra 
> <tomas.vondra@enterprisedb.com> wrote:
>> Perhaps. So you're proposing to have this as a regular built-in type?
>> It's hard for me to judge how popular this feature would be, but I guess
>> people often use arrays while they actually want set semantics ...
>
> Perspective from a potential user: I'm currently working on something 
> where an array-like structure with fast membership test performance 
> would be very useful. The main type of query is doing an =ANY(the set) 
> filter, where the set could contain anywhere from very few to thousands 
> of entries (ints in our case). So we'd want the same index usage as 
> =ANY(array) but would like faster row checking than we get with an 
> array when other indexes are used.

Thanks for providing an interesting use-case.

If you would like to help, one thing that would be helpful,
would be a complete runnable sql script,
that demonstrates exactly the various array-based queries
you currently use, with random data that resembles
reality as closely as possible, i.e. the same number of rows
in the tables, and similar distribution of values etc.

This would be helpful in terms of documentation,
as I think it would be good to provide Usage examples
that are based on real-life scenarios.

It would also be helpful to create realistic benchmarks when
evaluating and optimising the performance.

> Our app runs connecting to either an embedded postgres database that we 
> control or an external one controlled by customers - this is typically 
> RDS or some other cloud vendor's DB. Having such a type as a separate 
> extension would make it unusable for us until all relevant cloud 
> vendors decided that it was popular enough to include - something that 
> may never happen, or even if it did, now any time soon.

Good point.



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:
On 6/14/23 06:31, Tom Dunstan wrote:
> On Mon, 12 Jun 2023 at 22:37, Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     Perhaps. So you're proposing to have this as a regular built-in type?
>     It's hard for me to judge how popular this feature would be, but I guess
>     people often use arrays while they actually want set semantics ...
> 
> 
> Perspective from a potential user: I'm currently working on something
> where an array-like structure with fast membership test performance
> would be very useful. The main type of query is doing an =ANY(the set)
> filter, where the set could contain anywhere from very few to thousands
> of entries (ints in our case). So we'd want the same index usage as
> =ANY(array) but would like faster row checking than we get with an array
> when other indexes are used.
> 

We kinda already do this since PG14 (commit 50e17ad281), actually. If
the list is long enough (9 values or more), we'll build a hash table
during query execution. So pretty much exactly what you're asking for.

> Our app runs connecting to either an embedded postgres database that we
> control or an external one controlled by customers - this is typically
> RDS or some other cloud vendor's DB. Having such a type as a separate
> extension would make it unusable for us until all relevant cloud vendors
> decided that it was popular enough to include - something that may never
> happen, or even if it did, now any time soon.
> 

Understood, but that's really a problem / choice of the cloud vendors.

The thing is, adding stuff to core is not free - it means the community
becomes responsible for maintenance, testing, fixing issues, etc. It's
not feasible (or desirable) to have all extensions in core, and cloud
vendors generally do have ways to support some pre-vetted extensions
that they deem useful enough. Granted, it means vetting/maintenance for
them, but that's kinda the point of managed services. And it'd not be
free for us either.

Anyway, that's mostly irrelevant, as PG14 already does the hash table
for this kind of queries. And I'm not strictly against adding some of
this into core, if it ends up being useful enough.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Wed, Jun 14, 2023, at 11:44, Tomas Vondra wrote:
>> Perspective from a potential user: I'm currently working on something
>> where an array-like structure with fast membership test performance
>> would be very useful. The main type of query is doing an =ANY(the set)
>> filter, where the set could contain anywhere from very few to thousands
>> of entries (ints in our case). So we'd want the same index usage as
>> =ANY(array) but would like faster row checking than we get with an array
>> when other indexes are used.
>> 
>
> We kinda already do this since PG14 (commit 50e17ad281), actually. If
> the list is long enough (9 values or more), we'll build a hash table
> during query execution. So pretty much exactly what you're asking for.

Would it be feasible to teach the planner to utilize the internal hash table of
hashset directly? In the case of arrays, the hash table construction is an
ad hoc operation, whereas with hashset, the hash table already exists, which
could potentially lead to a faster execution.

Essentially, the aim would be to support:

=ANY(hashset)

Instead of the current:

=ANY(hashset_to_array(hashset))

Thoughts?

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/14/23 14:57, Joel Jacobson wrote:
> On Wed, Jun 14, 2023, at 11:44, Tomas Vondra wrote:
>>> Perspective from a potential user: I'm currently working on something
>>> where an array-like structure with fast membership test performance
>>> would be very useful. The main type of query is doing an =ANY(the set)
>>> filter, where the set could contain anywhere from very few to thousands
>>> of entries (ints in our case). So we'd want the same index usage as
>>> =ANY(array) but would like faster row checking than we get with an array
>>> when other indexes are used.
>>>
>>
>> We kinda already do this since PG14 (commit 50e17ad281), actually. If
>> the list is long enough (9 values or more), we'll build a hash table
>> during query execution. So pretty much exactly what you're asking for.
> 
> Would it be feasible to teach the planner to utilize the internal hash table of
> hashset directly? In the case of arrays, the hash table construction is an
> ad hoc operation, whereas with hashset, the hash table already exists, which
> could potentially lead to a faster execution.
> 
> Essentially, the aim would be to support:
> 
> =ANY(hashset)
> 
> Instead of the current:
> 
> =ANY(hashset_to_array(hashset))
> 
> Thoughts?

That should be possible, but probably only when hashset is a built-in
data type (maybe polymorphic).

I don't know if it'd be worth it, the general idea is that building the
hash table is way cheaper than repeated lookups in an array. Yeah, it
might save something, but likely only a tiny fraction of the runtime.

It's definitely something I'd leave out of v0, personally.

=ANY(set) should probably work with an implicit ARRAY cast, I believe.
It'll do the ad hoc build, ofc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:


On 2023-06-14 We 05:44, Tomas Vondra wrote:

The thing is, adding stuff to core is not free - it means the community
becomes responsible for maintenance, testing, fixing issues, etc. It's
not feasible (or desirable) to have all extensions in core, and cloud
vendors generally do have ways to support some pre-vetted extensions
that they deem useful enough. Granted, it means vetting/maintenance for
them, but that's kinda the point of managed services. And it'd not be
free for us either.


I agree it's a judgement call. But the code burden here seems pretty small, far smaller than, say, the SQL/JSON patches. And I think the range of applications that could benefit is quite significant.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
> On 6/14/23 14:57, Joel Jacobson wrote:
>> Would it be feasible to teach the planner to utilize the internal hash table of
>> hashset directly? In the case of arrays, the hash table construction is an
...
> It's definitely something I'd leave out of v0, personally.

OK, thanks for guidance, I'll stay away from it.

I've been doing some preparatory work on this todo item:

> 3) support for other types (now it only works with int32)

I've renamed the type from "hashset" to "int4hashset",
and the SQL-functions are now prefixed with "int4"
when necessary. The overloaded functions with
int4hashset as input parameters don't need to be prefixed,
e.g. hashset_add(int4hashset, int).

Other changes since last update (4e60615):

* Support creation of empty hashset using '{}'::hashset
* Introduced a new function hashset_capacity() to return the current capacity
  of a hashset.
* Refactored hashset initialization:
  - Replaced hashset_init(int) with int4hashset() to initialize an empty hashset
    with zero capacity.
  - Added int4hashset_with_capacity(int) to initialize a hashset with
    a specified capacity.
* Improved README.md and testing

As a next step, I'm planning on adding int8 support.

Looks and sounds good?

/Joel
Вложения

Re: Do we want a hashset type?

От
Tom Dunstan
Дата:
On Wed, 14 Jun 2023 at 19:14, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> ...So we'd want the same index usage as
> =ANY(array) but would like faster row checking than we get with an array
> when other indexes are used.

We kinda already do this since PG14 (commit 50e17ad281), actually. If
the list is long enough (9 values or more), we'll build a hash table
during query execution. So pretty much exactly what you're asking for.

Ha! That is great. Unfortunately we can't rely on it as we have customers using versions back to 12. But good to know that it's available when we bump the required versions.

Thanks

Tom

Re: Do we want a hashset type?

От
jian he
Дата:


On Thu, Jun 15, 2023 at 5:04 AM Joel Jacobson <joel@compiler.org> wrote:
On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
> On 6/14/23 14:57, Joel Jacobson wrote:
>> Would it be feasible to teach the planner to utilize the internal hash table of
>> hashset directly? In the case of arrays, the hash table construction is an
...
> It's definitely something I'd leave out of v0, personally.

OK, thanks for guidance, I'll stay away from it.

I've been doing some preparatory work on this todo item:

> 3) support for other types (now it only works with int32)

I've renamed the type from "hashset" to "int4hashset",
and the SQL-functions are now prefixed with "int4"
when necessary. The overloaded functions with
int4hashset as input parameters don't need to be prefixed,
e.g. hashset_add(int4hashset, int).

Other changes since last update (4e60615):

* Support creation of empty hashset using '{}'::hashset
* Introduced a new function hashset_capacity() to return the current capacity
  of a hashset.
* Refactored hashset initialization:
  - Replaced hashset_init(int) with int4hashset() to initialize an empty hashset
    with zero capacity.
  - Added int4hashset_with_capacity(int) to initialize a hashset with
    a specified capacity.
* Improved README.md and testing

As a next step, I'm planning on adding int8 support.

Looks and sounds good?

/Joel

still playing around with hashset-0.0.1-a8a282a.patch.

I think "postgres.h" should be on the top, (someone have said it on another email thread, I forgot who said that) 

In my local /home/jian/postgres/pg16/include/postgresql/server/libpq/pqformat.h:
/*
 * Append a binary integer to a StringInfo buffer
 *
 * This function is deprecated; prefer use of the functions above.
 */
static inline void
pq_sendint(StringInfo buf, uint32 i, int b)

So I changed to pq_sendint32.

ending and beginning, and in between white space should be stripped. The following c example seems ok for now. but I am not sure, I don't know how to glue it in hashset_in.

forgive me the patch name.... 

/*
gcc /home/jian/Desktop/regress_pgsql/strip_white_space.c && ./a.out
*/

#include<stdio.h>
#include<stdint.h>
#include<string.h>
#include<stdbool.h>
#include <ctype.h>
#include<stdlib.h>

/*
 * array_isspace() --- a non-locale-dependent isspace()
 *
 * We used to use isspace() for parsing array values, but that has
 * undesirable results: an array value might be silently interpreted
 * differently depending on the locale setting.  Now we just hard-wire
 * the traditional ASCII definition of isspace().
 */
static bool
array_isspace(char ch)
{
if (ch == ' ' ||
ch == '\t' ||
ch == '\n' ||
ch == '\r' ||
ch == '\v' ||
ch == '\f')
return true;
return false;
}

int main(void)
{
    long *temp   = malloc(10 * sizeof(long));
    memset(temp,0,10);
    char    source[5][50]   = {{0}};
    snprintf(source[0],sizeof(source[0]),"%s","  { 1   ,   20  }");
    snprintf(source[1],sizeof(source[0]),"%s","   { 1      ,20 ,   30 ");
    snprintf(source[2],sizeof(source[0]),"%s","   {1      ,20 ,   30 ");
    snprintf(source[3],sizeof(source[0]),"%s","   {1      ,  20 ,   30  }");
    snprintf(source[4],sizeof(source[0]),"%s","   {1      ,  20 ,   30  } ");
    /* Make a modifiable copy of the input */
char    *p;
    char    string_save[50];
   
    for(int j = 0; j < 5; j++)
    {
        snprintf(string_save,sizeof(string_save),"%s",source[j]);
        p = string_save;

        int     i = 0;
        while (array_isspace(*p))
            p++;
        if (*p != '{')
        {
            printf("line: %d should be {\n",__LINE__);
            exit(EXIT_FAILURE);
        }

        for (;;)
        {
            char   *q;
            if (*p == '{')
                p++;
            temp[i]     = strtol(p, &q,10);
            printf("temp[j=%d] [%d]=%ld\n",j,i,temp[i]);  

            if (*q == '}' && (*(q+1) == '\0'))
            {
                printf("all works ok now exit\n");
                break;
            }
            if( !array_isspace(*q) && *q != ',')
            {
                printf("wrong format. program will exit\n");
                exit(EXIT_FAILURE);
            }
            while(array_isspace(*q))
                q++;
            if(*q != ',')
                break;
            else
                p = q+1;
            i++;
        }  
    }
}



Вложения

Re: Do we want a hashset type?

От
jian he
Дата:


On Thu, Jun 15, 2023 at 5:04 AM Joel Jacobson <joel@compiler.org> wrote:
On Wed, Jun 14, 2023, at 15:16, Tomas Vondra wrote:
> On 6/14/23 14:57, Joel Jacobson wrote:
>> Would it be feasible to teach the planner to utilize the internal hash table of
>> hashset directly? In the case of arrays, the hash table construction is an
...
> It's definitely something I'd leave out of v0, personally.

OK, thanks for guidance, I'll stay away from it.

I've been doing some preparatory work on this todo item:

> 3) support for other types (now it only works with int32)

I've renamed the type from "hashset" to "int4hashset",
and the SQL-functions are now prefixed with "int4"
when necessary. The overloaded functions with
int4hashset as input parameters don't need to be prefixed,
e.g. hashset_add(int4hashset, int).

Other changes since last update (4e60615):

* Support creation of empty hashset using '{}'::hashset
* Introduced a new function hashset_capacity() to return the current capacity
  of a hashset.
* Refactored hashset initialization:
  - Replaced hashset_init(int) with int4hashset() to initialize an empty hashset
    with zero capacity.
  - Added int4hashset_with_capacity(int) to initialize a hashset with
    a specified capacity.
* Improved README.md and testing

As a next step, I'm planning on adding int8 support.

Looks and sounds good?

/Joel
I am not sure the following results are correct.
with cte as (
    select hashset(x) as x
            ,hashset_capacity(hashset(x))
            ,hashset_count(hashset(x))
    from generate_series(1,10) g(x))
select *
        ,'|' as delim
        , hashset_add(x,11111::int)
        ,hashset_capacity(hashset_add(x,11111::int))
        ,hashset_count(hashset_add(x,11111::int))
from    cte \gx


results:  
-[ RECORD 1 ]----+-----------------------------
x                | {8,1,10,3,9,4,6,2,11111,5,7}
hashset_capacity | 64
hashset_count    | 10
delim            | |
hashset_add      | {8,1,10,3,9,4,6,2,11111,5,7}
hashset_capacity | 64
hashset_count    | 11


but:
with cte as(select '{1,2}'::int4hashset as x)   select x,hashset_add(x,3::int)  from cte;

returns
   x   | hashset_add
-------+-------------
 {1,2} | {3,1,2}
(1 row)
last simple query seems more sensible to me.


Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Thu, Jun 15, 2023, at 04:22, jian he wrote:
> Attachments:
> * temp.patch

Thanks for good suggestions.
New patch attached:

Enhance parsing and reorder headers in hashset module

Allow whitespaces in hashset input and reorder the inclusion of
header files, placing PostgreSQL headers first. Additionally, update
deprecated pq_sendint calls to pq_sendint32. Add tests for improved
parsing functionality.

/Joel
Вложения

Re: Do we want a hashset type?

От
jian he
Дата:

In hashset/test/sql/order.sql, can we add the following to test whether the optimizer will use our index.

CREATE INDEX  ON test_int4hashset_order (int4hashset_col  int4hashset_btree_ops);

-- to make sure that this work with just two rows
SET enable_seqscan TO off;              

explain(costs off) SELECT * FROM test_int4hashset_order WHERE int4hashset_col = '{1,2}'::int4hashset;
reset enable_seqscan;


Since most contrib modules, one module, only one test file, maybe we need to consolidate all the test sql files to one sql file (int4hashset.sql)?
--------------
I didn't install the extension directly. I copied the hashset--0.0.1.sql to another place, using gcc to compile these functions. 
gcc -I/home/jian/postgres/2023_05_25_beta5421/include/server -fPIC -c /home/jian/hashset/hashset.c
gcc -shared  -o /home/jian/hashset/hashset.so /home/jian/hashset/hashset.o 
then modify hashset--0.0.1.sql  then in psql  \i fullsqlfilename to create these functions, types.

Because even make PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config still has an error.
 fatal error: libpq-fe.h: No such file or directory
    3 | #include <libpq-fe.h>


Is there any way to put test_send_recv.c to sql test file? 
Attached is a patch slightly modified README.md. feel free to change, since i am not native english speaker... 

Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Thu, Jun 15, 2023, at 11:44, jian he wrote:
> I didn't install the extension directly. I copied the 
> hashset--0.0.1.sql to another place, using gcc to compile these 
> functions. 
..
> Because even make 
> PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config still 
> has an error.
>  fatal error: libpq-fe.h: No such file or directory
>     3 | #include <libpq-fe.h>

What platform are you on?
You seem to be missing the postgresql dev package.
For instance, here is how to compile and install the extension on Ubuntu 22.04.1 LTS:

sudo apt install postgresql-15 postgresql-server-dev-15 postgresql-client-15
git clone https://github.com/tvondra/hashset.git
cd hashset
make
sudo make install
make installcheck

> Is there any way to put test_send_recv.c to sql test file? 

Unfortunately, there doesn't seem to be a way to test *_recv() functions from SQL,
since they take `internal` as input. The only way I could figure out to test them
was to write a C-program using libpq's binary mode.

I also note that the test_send_recv test was broken; I had forgot to change
the type from "hashset" to "int4hashset". Fixed in attached commit.

On Ubuntu, you can now run the test by specifying to connect via the UNIX socket:

PGHOST=/var/run/postgresql make run_c_tests
cd test/c_tests && ./test_send_recv.sh
test test_send_recv               ... ok

> Attached is a patch slightly modified README.md. feel free to change, 
> since i am not native english speaker... 
> Attachments:
> * 0001-add-instruction-using-PG_CONFIG-to-install-extension.patch

Thanks, will have a look later.

/Joel
Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Thu, Jun 15, 2023, at 06:29, jian he wrote:
> I am not sure the following results are correct.
> with cte as (
>     select hashset(x) as x
>             ,hashset_capacity(hashset(x))
>             ,hashset_count(hashset(x))
>     from generate_series(1,10) g(x))
> select *
>         ,'|' as delim
>         , hashset_add(x,11111::int)
>         ,hashset_capacity(hashset_add(x,11111::int))
>         ,hashset_count(hashset_add(x,11111::int))
> from    cte \gx
>
>
> results:  
> -[ RECORD 1 ]----+-----------------------------
> x                | {8,1,10,3,9,4,6,2,11111,5,7}
> hashset_capacity | 64
> hashset_count    | 10
> delim            | |
> hashset_add      | {8,1,10,3,9,4,6,2,11111,5,7}
> hashset_capacity | 64
> hashset_count    | 11

Nice catch, you found a bug!

Fixed in attached patch:

---
Ensure hashset_add and hashset_merge operate on copied data

Previously, the hashset_add() and hashset_merge() functions were
modifying the original hashset in-place. This was leading to unexpected
results because the original data in the hashset was being altered.

This commit introduces the macro PG_GETARG_INT4HASHSET_COPY(), ensuring
a copy of the hashset is created and modified, leaving the original
hashset untouched.

This adjustment ensures hashset_add() and hashset_merge() operate
correctly on the copied hashset and prevent modification of the
original data.

A new regression test file `reported_bugs.sql` has been added to
validate the proper functionality of these changes. Future reported
bugs and their corresponding tests will also be added to this file.
---

I wonder if this function:

static int4hashset_t *
int4hashset_copy(int4hashset_t *src)
{
    return src;
}

...that was previously named hashset_copy(),
should be implemented to actually copy the struct,
instead of just returning the input?

It is being used by int4hashset_agg_combine() like this:

/* copy the hashset into the right long-lived memory context */
oldcontext = MemoryContextSwitchTo(aggcontext);
src = int4hashset_copy(src);
MemoryContextSwitchTo(oldcontext);

/Joel
Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Thu, Jun 15, 2023, at 11:44, jian he wrote:
> In hashset/test/sql/order.sql, can we add the following to test whether 
> the optimizer will use our index.
>
> CREATE INDEX  ON test_int4hashset_order (int4hashset_col  
> int4hashset_btree_ops);
>
> -- to make sure that this work with just two rows
> SET enable_seqscan TO off;              
>
> explain(costs off) SELECT * FROM test_int4hashset_order WHERE 
> int4hashset_col = '{1,2}'::int4hashset;
> reset enable_seqscan;

Not sure I can see the value of that test,
since we've already tested the comparison functions,
which are used by the int4hashset_btree_ops operator class.

I think a test that verifies the btree index is actually used,
would more be a test of the query planner than hashset.

I might be missing something here, please tell me if so.

> Since most contrib modules, one module, only one test file, maybe we 
> need to consolidate all the test sql files to one sql file 
> (int4hashset.sql)?

I've also made the same observation; I wonder if it's by design
or by coincidence? I think multiple test files improves modularity,
isolation and overall organisation of the testing.

As long as we are developing in the pre-release phase,
I think it's beneficial and affordable with rigorous testing.

However, if hashset would ever be considered
for core inclusion, then we should consolidate all tests into
one file and retain only essential tests, thereby minimizing
impact on PostgreSQL's overall test suite runtime
where every millisecond matters.

> Attached is a patch slightly modified README.md. feel free to change, 
> since i am not native english speaker... 
>
> Attachments:
> * 0001-add-instruction-using-PG_CONFIG-to-install-extension.patch

Thanks, improvements incorporated with some minor changes.

/Joel
Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
New patch attached:

Add customizable params to int4hashset() and collision count function

This commit enhances int4hashset() by introducing adjustable capacity,
load, and growth factors, providing flexibility for performance optimization.

Also added is a new function, hashset_collisions(), to report collision
counts, aiding in performance tuning.

Aggregate functions are renamed to hashset_agg() for consistency with
array_agg() and range_agg().

A new test file, test/sql/benchmark.sql, is added for evaluating the
performance of hash functions. It's not run automatically by
make installcheck.

The adjustable parameters and the naive hash function are useful for testing
and performance comparison. However, to keep things simple and streamlined
for users, these features are likely to be removed in the final release,
emphasizing the use of well-optimized default settings.

SQL-function indentation is also adjusted to align with the PostgreSQL
source repo, improving readability.

In the benchmark results below, it was a bit surprising the naive hash
function had no collisions, but that only held true when the input
elements were sequential integers. When tested with random integers,
all three hash functions caused collisions.

Timing results not statistical significant, the purpose is just to
give an idea of the execution times.

*** Elements in sequence 1..100000
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_count: 100000
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:23: NOTICE:  hashset_collisions: 31195
DO
Time: 1342.564 ms (00:01.343)
- Testing Murmurhash32
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_count: 100000
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:40: NOTICE:  hashset_collisions: 30879
DO
Time: 1297.823 ms (00:01.298)
- Testing naive hash function
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_count: 100000
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:57: NOTICE:  hashset_collisions: 0
DO
Time: 1400.936 ms (00:01.401)
*** Testing 100000 random ints
    setseed
---------

(1 row)

Time: 3.591 ms
- Testing default hash function (Jenkins/lookup3)
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_count: 100000
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:77: NOTICE:  hashset_collisions: 30919
DO
Time: 1415.497 ms (00:01.415)
    setseed
---------

(1 row)

Time: 1.282 ms
- Testing Murmurhash32
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_count: 100000
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:95: NOTICE:  hashset_collisions: 30812
DO
Time: 2079.202 ms (00:02.079)
    setseed
---------

(1 row)

Time: 0.122 ms
- Testing naive hash function
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_count: 100000
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_capacity: 262144
psql:test/sql/benchmark.sql:113: NOTICE:  hashset_collisions: 30822
DO
Time: 1613.965 ms (00:01.614)

/Joel
Вложения

Re: Do we want a hashset type?

От
jian he
Дата:


similar to (int[] || int4) and (int4 || int[])
should we expect
('{1,2}'::int4hashset || 3) == (3 ||  '{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); ?

The following is the general idea on how to make it work by looking at similar code....
CREATE OPERATOR || (
    leftarg = int4hashset,
    rightarg = int4,
    function = int4hashset_add,
    commutator = ||
);

CREATE OR REPLACE FUNCTION int4_add_int4hashset(int4, int4hashset)
RETURNS int4hashset
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT COST 1
RETURN $2 || $1;

CREATE OPERATOR || (
    leftarg = int4,
    rightarg = int4hashset,
    function = int4_add_int4hashset,
    commutator = ||
);
while creating an operator. I am not sure how to specify NEGATOR,RESTRICT,JOIN clause.
-----------------------------------------------------------------------------------------------------------------------------
also. I think the following query should return one row only? but now it doesn't.
select hashset_cmp('{1,2}','{2,1}')
union
select hashset_cmp('{1,2}','{1,2,1}')
union
select hashset_cmp('{1,2}','{1,2}');
----------------------------------------------------------------------------------------------------------------------
similar to elem_contained_by_range, range_contains_elem. we should already consider the operator <@ and @>? 

CREATE OR REPLACE FUNCTION elem_contained_by_hashset(int4, int4hashset)
RETURNS bool
LANGUAGE sql
IMMUTABLE PARALLEL SAFE STRICT COST 1
RETURN hashset_contains ($2,$1);

Is the integer contained in the int4hashset?
integer  <@ int4hashset → boolean
1 <@ int4hashset'{1,7}' → t

CREATE OPERATOR <@ (
    leftarg = integer,
    rightarg = int4hashset,
    function = elem_contained_by_hashset
);

int4hashset @> integer → boolean
Does the int4hashset contain the element?
int4hashset'{1,7}' @> 1     → t

CREATE OPERATOR @> (
    leftarg = int4hashset,
    rightarg = integer,
    function = hashset_contains
);
-------------------

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Fri, Jun 16, 2023, at 13:57, jian he wrote:
> similar to (int[] || int4) and (int4 || int[])
> should we expect ('{1,2}'::int4hashset || 3) == (3 ||  
> '{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); 
> *?*

Good idea, makes sense to support it.
Implemented in attached patch.

> CREATE OPERATOR || (
>     leftarg = int4,
>     rightarg = int4hashset,
>     function = int4_add_int4hashset,
>     commutator = ||
> );
> while creating an operator. I am not sure how to specify 
> NEGATOR,RESTRICT,JOIN clause.

I don't think we need those for this operator, might be wrong though.

>
-----------------------------------------------------------------------------------------------------------------------------
> also. I think the following query should return one row only? but now 
> it doesn't.
> select hashset_cmp('{1,2}','{2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2}');

Good point.

I realise int4hashset_hash() is broken,
since two int4hashset's that are considered equal,
can by coincidence get different hashes:

SELECT '{1,2}'::int4hashset = '{2,1}'::int4hashset;
 ?column?
----------
 t
(1 row)

SELECT hashset_hash('{1,2}'::int4hashset);
 hashset_hash
--------------
    990882385
(1 row)

SELECT hashset_hash('{2,1}'::int4hashset);
 hashset_hash
--------------
    996377797
(1 row)

Do we have any ideas on how to fix this without sacrificing performance?

We of course want to avoid having to sort the hashsets,
which is the naive solution.

To understand why this is happening, consider this example:

SELECT '{1,2}'::int4hashset;
 int4hashset
-------------
 {1,2}
(1 row)

SELECT '{2,1}'::int4hashset;
 int4hashset
-------------
 {2,1}
(1 row)

If the hash of `1` and `2` modulus the capacity results in the same value,
they will be attempted to be inserted into the same position,
and since the input text is parsed left-to-right, in the first case `1` will win
the first position, and `2` will get a collision and try the next position.

In the second case, the opposite happens.

Since we do modulus capacity, the position depends on the capacity,
which is why the output string can be different for the same input.

SELECTint4hashset() || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=1) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=2) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=3) || 1 || 2 || 3;
 {3,2,1}

SELECTint4hashset(capacity:=4) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=5) || 1 || 2 || 3;
 {1,2,3}

SELECTint4hashset(capacity:=6) || 1 || 2 || 3;
 {1,3,2}


>
----------------------------------------------------------------------------------------------------------------------
> similar to elem_contained_by_range, range_contains_elem. we should 
> already consider the operator *<@* and @*>? *

That could perhaps be nice.
Apart from possible syntax convenience,
are there any other benefits than just using the function hashset_contains(int4hashset, integer) directly?

/Joel
Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
> I realise int4hashset_hash() is broken,
> since two int4hashset's that are considered equal,
> can by coincidence get different hashes:
...
> Do we have any ideas on how to fix this without sacrificing performance?

The problem was due to hashset_hash() function accumulating the hashes
of individual elements in a non-commutative manner. As a consequence, the
final hash value was sensitive to the order in which elements were inserted
into the hashset. This behavior led to inconsistencies, as logically
equivalent sets (i.e., sets with the same elements but in different orders)
produced different hash values.

Solved by modifying the hashset_hash() function to use a commutative operation
when combining the hashes of individual elements. This change ensures that the
final hash value is independent of the element insertion order, and logically
equivalent sets produce the same hash.

An somewhat unfortunate side-effect of this fix, is that we can no longer
visually sort the hashset output format, since it's not lexicographically sorted.
I think this is an acceptable trade-off for a hashset type,
since the only alternative I see would be to sort the elements,
but then it wouldn't be a hashset, but a treeset, which different
Big-O complexity.

New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.

/Joel
Вложения

Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:
On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
>
> New patch is attached, which will henceforth always be a complete patch,
> to avoid the hassle of having to assemble incremental patches.


Cool, thanks.


A couple of random thoughts:


. It might be worth sending a version number with the send function 
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever 
to some wire representation.

. I think there are some important set operations missing: most notably 
intersection, slightly less importantly asymmetric and symmetric 
difference. I have no idea how easy these would be to add, but even for 
your stated use I should have thought set intersection would be useful 
("Who is a member of both this set of friends and that set of friends?").

. While supporting int4 only is OK for now, I think we would at least 
want to support int8, and probably UUID since a number of systems I know 
of use that as an object identifier.


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Sun, Jun 18, 2023, at 18:45, Andrew Dunstan wrote:
> . It might be worth sending a version number with the send function 
> (c.f. jsonb_send / jsonb_recv). That way would would not be tied forever 
> to some wire representation.

Great idea; implemented.

> . I think there are some important set operations missing: most notably 
> intersection, slightly less importantly asymmetric and symmetric 
> difference. I have no idea how easy these would be to add, but even for 
> your stated use I should have thought set intersection would be useful 
> ("Who is a member of both this set of friends and that set of friends?").

Another great idea; implemented.

> . While supporting int4 only is OK for now, I think we would at least 
> want to support int8, and probably UUID since a number of systems I know 
> of use that as an object identifier.

I agree that's probably the most logical thing to focus on next. I'm on it.

New patch attached.

/Joel
Вложения

Re: Do we want a hashset type?

От
jian he
Дата:
On Sat, Jun 17, 2023 at 8:38 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote:
> > I realise int4hashset_hash() is broken,
> > since two int4hashset's that are considered equal,
> > can by coincidence get different hashes:
> ...
> > Do we have any ideas on how to fix this without sacrificing performance?
>
> The problem was due to hashset_hash() function accumulating the hashes
> of individual elements in a non-commutative manner. As a consequence, the
> final hash value was sensitive to the order in which elements were inserted
> into the hashset. This behavior led to inconsistencies, as logically
> equivalent sets (i.e., sets with the same elements but in different orders)
> produced different hash values.
>
> Solved by modifying the hashset_hash() function to use a commutative operation
> when combining the hashes of individual elements. This change ensures that the
> final hash value is independent of the element insertion order, and logically
> equivalent sets produce the same hash.
>
> An somewhat unfortunate side-effect of this fix, is that we can no longer
> visually sort the hashset output format, since it's not lexicographically sorted.
> I think this is an acceptable trade-off for a hashset type,
> since the only alternative I see would be to sort the elements,
> but then it wouldn't be a hashset, but a treeset, which different
> Big-O complexity.
>
> New patch is attached, which will henceforth always be a complete patch,
> to avoid the hassle of having to assemble incremental patches.
>
> /Joel


select hashset_contains('{1,2}'::int4hashset,NULL::int);
should return null?
---------------------------------------------------------------------------------
SELECT attname
        ,pc.relname
        ,CASE attstorage
        WHEN 'p' THEN 'plain'
        WHEN 'e' THEN 'external'
        WHEN 'm' THEN 'main'
        WHEN 'x' THEN 'extended'
        END AS storage
FROM    pg_attribute pa
join    pg_class    pc  on pc.oid = pa.attrelid
where   attnum > 0      and pa.attstorage   = 'e';

In my system catalog, it seems only the hashset type storage =
'external'. most is extended.....
I am not sure the consequence of switch from external to extended.
------------------------------------------------------------------------------------------------------------
select hashset_hash('{-1,1}')   as a1
        ,hashset_hash('{1,-2}') as a2
        ,hashset_hash('{-3,1}') as a3
        ,hashset_hash('{4,1}')  as a4;
returns:
     a1      |    a2     |     a3     |     a4
-------------+-----------+------------+------------
 -1735582196 | 998516167 | 1337000903 | 1305426029
(1 row)

values {a1,a2,a3,a4} should be monotone increasing, based on the
function int4hashset_cmp, but now it's not.
so the following queries failed.

--should return only one row.
select hashset_cmp('{2,1}','{3,1}')
union
select hashset_cmp('{3,1}','{4,1}')
union
select hashset_cmp('{1,3}','{4,1}');

select hashset_cmp('{9,10,11}','{10,9,-11}') =
hashset_cmp('{9,10,11}','{10,9,-1}'); --should be true
select '{2,1}'::int4hashset > '{7}'::int4hashset; --should be false.
based on int array comparison,.
-----------------------------------------------------------------------------------------
I comment out following lines in hashset-api.c somewhere between {810,829}

// if (a->hash < b->hash)
// PG_RETURN_INT32(-1);
// else if (a->hash > b->hash)
// PG_RETURN_INT32(1);

// if (a->nelements < b->nelements)
// PG_RETURN_INT32(-1);
// else if (a->nelements > b->nelements)
// PG_RETURN_INT32(1);

// Assert(a->nelements == b->nelements);

So hashset_cmp will directly compare int array. the above queries works.

{int4hashset_equals,int4hashset_neq} two special cases of hashset_cmp.
maybe we can just  wrap it just like int4hashset_le?

now store 10 element int4hashset need 99 bytes, similar one dimension
bigint array with length 10, occupy 101 byte....

in int4hashset_send, newly add struct attributes/member {load_factor
growth_factor ncollisions hash} also need send to buf?



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
> select hashset_contains('{1,2}'::int4hashset,NULL::int);
> should return null?

Hmm, that's a good philosophical question.

I notice Tomas Vondra in the initial commit opted for allowing NULL inputs,
treating them as empty sets, e.g. in int4hashset_add() we create a
new hashset if the first argument is NULL.

I guess the easiest perhaps most consistent NULL-handling strategy
would be to just mark all relevant functions STRICT except for the agg ones
since we probably want to allow skipping over rows with NULL values
without the entire result becoming NULL.

But if we're not just going the STRICT route, then I think it's a bit more tricky,
since you could argue the hashset_contains() example should return FALSE
since the set doesn't contain the NULL value, but OTOH, since we don't
store NULL values, we don't know if has ever been added, hence a NULL
result would perhaps make more sense.

I think I lean on thinking that if we want to be "NULL-friendly", like we
currently are in hashset_add(), it would probably be most user-friendly
to be consistent and let all functions return non-null return values in
all cases where it is not unreasonable.

Since we're essentially designing a set-theoretic system, I think we should
aim for the logical "soundness" property of it and think about how we can
verify that it is.

Thoughts?

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/18/23 18:45, Andrew Dunstan wrote:
> 
> On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
>>
>> New patch is attached, which will henceforth always be a complete patch,
>> to avoid the hassle of having to assemble incremental patches.
> 
> 
> Cool, thanks.
> 

It might still be convenient to keep it split into smaller, easier to
review, parts. A patch that introduces basic functionality and then
patches adding various "advanced" features.

> 
> A couple of random thoughts:
> 
> 
> . It might be worth sending a version number with the send function
> (c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
> to some wire representation.
> 
> . I think there are some important set operations missing: most notably
> intersection, slightly less importantly asymmetric and symmetric
> difference. I have no idea how easy these would be to add, but even for
> your stated use I should have thought set intersection would be useful
> ("Who is a member of both this set of friends and that set of friends?").
> 
> . While supporting int4 only is OK for now, I think we would at least
> want to support int8, and probably UUID since a number of systems I know
> of use that as an object identifier.
> 

I agree we should aim to support a wider range of data types. Could we
have a polymorphic type, similar to what we do for arrays and ranges? In
fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
to implement this as a special variant of an array? Would be better than
having a set of functions for every supported data type.

(Note: It might still be possible to have a special implementation for
selected fixed-length data types, as it allows optimization at compile
time. But that could be done later.)


The other thing I've been thinking about is the SQL syntax and what does
the SQL standard says about this.

AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).

So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
jian he
Дата:


On Mon, Jun 19, 2023 at 2:51 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
> > select hashset_contains('{1,2}'::int4hashset,NULL::int);
> > should return null?
>
> Hmm, that's a good philosophical question.
>
> I notice Tomas Vondra in the initial commit opted for allowing NULL inputs,
> treating them as empty sets, e.g. in int4hashset_add() we create a
> new hashset if the first argument is NULL.
>
> I guess the easiest perhaps most consistent NULL-handling strategy
> would be to just mark all relevant functions STRICT except for the agg ones
> since we probably want to allow skipping over rows with NULL values
> without the entire result becoming NULL.
>
> But if we're not just going the STRICT route, then I think it's a bit more tricky,
> since you could argue the hashset_contains() example should return FALSE
> since the set doesn't contain the NULL value, but OTOH, since we don't
> store NULL values, we don't know if has ever been added, hence a NULL
> result would perhaps make more sense.
>
> I think I lean on thinking that if we want to be "NULL-friendly", like we
> currently are in hashset_add(), it would probably be most user-friendly
> to be consistent and let all functions return non-null return values in
> all cases where it is not unreasonable.
>
> Since we're essentially designing a set-theoretic system, I think we should
> aim for the logical "soundness" property of it and think about how we can
> verify that it is.
>
> Thoughts?
>
> /Joel

hashset_to_array function should be strict?

I noticed hashset_symmetric_difference and  hashset_difference handle null in a different way, seems they should handle null in a consistent way?

select '{1,2,NULL}'::int[] operator (pg_catalog.@>) '{NULL}'::int[]; --false
select '{1,2,NULL}'::int[] operator (pg_catalog.&&) '{NULL}'::int[]; --false.
So similarly I guess hashset_contains should be false.
select hashset_contains('{1,2}'::int4hashset,NULL::int); 


Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 19, 2023, at 11:21, Tomas Vondra wrote:
> AFAICS the standard only defines arrays and multisets. Arrays are pretty
> much the thing we have, including the ARRAY[] constructor etc. Multisets
> are similar to hashset discussed here, except that it tracks the number
> of elements for each value (which would be trivial in hashset).
>
> So if we want to make this a built-in feature, maybe we should aim to do
> the multiset thing, with the standard SQL syntax? Extending the grammar
> should not be hard, I think. I'm not sure of the underlying code
> (ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
> lot of separate code doing that.

Multisets handle duplicates uniquely, this may bring unexpected issues. Sets
and multisets have distinct utility in C++, Rust, Java, etc. However, sets are
more fundamental and prevalent in std libs than multisets.

Despite SQL's multiset possibility, a distinct hashset type is my preference,
helping appropriate data structure choice and reducing misuse.

The necessity of multisets is vague beyond standards compliance.

/Joel



Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:


On 2023-06-19 Mo 05:21, Tomas Vondra wrote:

On 6/18/23 18:45, Andrew Dunstan wrote:
On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
New patch is attached, which will henceforth always be a complete patch,
to avoid the hassle of having to assemble incremental patches.

Cool, thanks.

It might still be convenient to keep it split into smaller, easier to
review, parts. A patch that introduces basic functionality and then
patches adding various "advanced" features.

A couple of random thoughts:


. It might be worth sending a version number with the send function
(c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
to some wire representation.

. I think there are some important set operations missing: most notably
intersection, slightly less importantly asymmetric and symmetric
difference. I have no idea how easy these would be to add, but even for
your stated use I should have thought set intersection would be useful
("Who is a member of both this set of friends and that set of friends?").

. While supporting int4 only is OK for now, I think we would at least
want to support int8, and probably UUID since a number of systems I know
of use that as an object identifier.

I agree we should aim to support a wider range of data types. Could we
have a polymorphic type, similar to what we do for arrays and ranges? In
fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
to implement this as a special variant of an array? Would be better than
having a set of functions for every supported data type.

(Note: It might still be possible to have a special implementation for
selected fixed-length data types, as it allows optimization at compile
time. But that could be done later.)


Interesting idea. There's also the keyword SETOF that we could possibly make use of.




The other thing I've been thinking about is the SQL syntax and what does
the SQL standard says about this.

AFAICS the standard only defines arrays and multisets. Arrays are pretty
much the thing we have, including the ARRAY[] constructor etc. Multisets
are similar to hashset discussed here, except that it tracks the number
of elements for each value (which would be trivial in hashset).

So if we want to make this a built-in feature, maybe we should aim to do
the multiset thing, with the standard SQL syntax? Extending the grammar
should not be hard, I think. I'm not sure of the underlying code
(ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
lot of separate code doing that.



Yes, Multisets (a.k.a. bags and a large number of other names) would be interesting. But I wouldn't like to abandon pure sets either. Maybe a typmod indicating the allowed multiplicity of the type?



cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 19, 2023, at 11:49, jian he wrote:
> hashset_to_array function should be strict?
>
> I noticed hashset_symmetric_difference and  hashset_difference handle 
> null in a different way, seems they should handle null in a consistent 
> way?

Yes, I agree, they should be consistent.

I've thought a bit more on this, and came to the conclusion that I think it
would be easiest, safest and least confusing to just mark all functions STRICT.

That way, it's the user's responsibility to ensure null operands are not passed
to the functions, which is simply a WHERE ... or FILTER (WHERE ...). And if
making a mistake and passing, it's better to make the entire result blow up by
letting the result be NULL, than to silently ignore the operand or return some
true/false value that is questionable.

SQL has a quite unique NULL handling compared to other languages, so I think
it's better to let the user use the full arsenal of SQL to deal with nulls,
rather than trying to shoehorn some null semantics into a set-theoretic system.

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:
On 6/19/23 13:50, Andrew Dunstan wrote:
> 
> On 2023-06-19 Mo 05:21, Tomas Vondra wrote:
>> On 6/18/23 18:45, Andrew Dunstan wrote:
>>> On 2023-06-16 Fr 20:38, Joel Jacobson wrote:
>>>> New patch is attached, which will henceforth always be a complete patch,
>>>> to avoid the hassle of having to assemble incremental patches.
>>> Cool, thanks.
>>>
>> It might still be convenient to keep it split into smaller, easier to
>> review, parts. A patch that introduces basic functionality and then
>> patches adding various "advanced" features.
>>
>>> A couple of random thoughts:
>>>
>>>
>>> . It might be worth sending a version number with the send function
>>> (c.f. jsonb_send / jsonb_recv). That way would would not be tied forever
>>> to some wire representation.
>>>
>>> . I think there are some important set operations missing: most notably
>>> intersection, slightly less importantly asymmetric and symmetric
>>> difference. I have no idea how easy these would be to add, but even for
>>> your stated use I should have thought set intersection would be useful
>>> ("Who is a member of both this set of friends and that set of friends?").
>>>
>>> . While supporting int4 only is OK for now, I think we would at least
>>> want to support int8, and probably UUID since a number of systems I know
>>> of use that as an object identifier.
>>>
>> I agree we should aim to support a wider range of data types. Could we
>> have a polymorphic type, similar to what we do for arrays and ranges? In
>> fact, CREATE TYPE allows specifying ELEMENT, so wouldn't it be possible
>> to implement this as a special variant of an array? Would be better than
>> having a set of functions for every supported data type.
>>
>> (Note: It might still be possible to have a special implementation for
>> selected fixed-length data types, as it allows optimization at compile
>> time. But that could be done later.)
> 
> 
> Interesting idea. There's also the keyword SETOF that we could possibly
> make use of.
> 
> 
>> The other thing I've been thinking about is the SQL syntax and what does
>> the SQL standard says about this.
>>
>> AFAICS the standard only defines arrays and multisets. Arrays are pretty
>> much the thing we have, including the ARRAY[] constructor etc. Multisets
>> are similar to hashset discussed here, except that it tracks the number
>> of elements for each value (which would be trivial in hashset).
>>
>> So if we want to make this a built-in feature, maybe we should aim to do
>> the multiset thing, with the standard SQL syntax? Extending the grammar
>> should not be hard, I think. I'm not sure of the underlying code
>> (ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
>> lot of separate code doing that.
>>
>>
> 
> Yes, Multisets (a.k.a. bags and a large number of other names) would be
> interesting. But I wouldn't like to abandon pure sets either. Maybe a
> typmod indicating the allowed multiplicity of the type?
> 

Maybe, although I'm not sure if that can be specified with a multiset
constructor, i.e. when using MULTISET[...] in places where we now use
ARRAY[...] to specify arrays.

I was thinking more about having one set of operators, one considering
the duplicity (and thus doing what SQL standard says) and one ignoring
it (thus treating MULTISETS as plain sets).

Anyway, I'm just thinking aloud. I'm not sure if this is the way to do,
but it'd be silly to end up implementing stuff unnecessarily and/or
inventing something that contradicts the SQL standard (or is somehow
inconsistent with similar stuff).


regard

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/19/23 13:33, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 11:21, Tomas Vondra wrote:
>> AFAICS the standard only defines arrays and multisets. Arrays are pretty
>> much the thing we have, including the ARRAY[] constructor etc. Multisets
>> are similar to hashset discussed here, except that it tracks the number
>> of elements for each value (which would be trivial in hashset).
>>
>> So if we want to make this a built-in feature, maybe we should aim to do
>> the multiset thing, with the standard SQL syntax? Extending the grammar
>> should not be hard, I think. I'm not sure of the underlying code
>> (ArrayType, ARRAY_SUBLINK stuff, etc.) we could reuse or if we'd need a
>> lot of separate code doing that.
> 
> Multisets handle duplicates uniquely, this may bring unexpected issues. Sets
> and multisets have distinct utility in C++, Rust, Java, etc. However, sets are
> more fundamental and prevalent in std libs than multisets.
> 

What unexpected issues you mean? Sure, if someone uses multisets as if
they were sets (so ignoring the handling of duplicates), things will go
booom! quickly.

I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
operators) that'd allow perhaps help with this.

> Despite SQL's multiset possibility, a distinct hashset type is my preference,
> helping appropriate data structure choice and reducing misuse.
> 
> The necessity of multisets is vague beyond standards compliance.
> 

True - we haven't had any requests/proposal to implement MULTISETs.

I've looked at the SQL standard primarily to check if maybe there's some
precedent that'd give us guidance on the SQL syntax etc. And I think
multisets are that - even if we end up not implementing them, it'd be
sad to have unnecessarily inconsistent syntax (in case someone decides
to add multisets in the future).

We could invent "SET" data type, so while standard has ARRAY / MULTISET,
we'd have ARRAY / MULTISET / SET, and the difference between the last
two would be just handling of duplicates.

The other way to look at sets is that they are pretty similar to arrays,
except that there are no duplicates and order does not matter. Sure, the
on-disk format and code is different, but from the SQL perspective it'd
be nice to allow using sets in most places where arrays are allowed
(which is what the standard does for MULTISETS, more or less).

That'd mean we could probably search through gram.y for places working
with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
make them work with sets too, say by having SET_SUBLINK instead of
ARRAY_SUBLINK, set_expression instead of array_expression, etc.

This might be also "consistent" with defining hashset type using CREATE
TYPE with ELEMENT, because we consider the type to be "array". So that
would be polymorphic type, but we don't have pre-defined array for every
type (and I'm not sure we want to).

Of course, maybe there's some fatal flaw in these idea, I don't know.
And I don't want to move the goalposts too far - but it seems like this
might make some stuff actually simpler to implement (by piggy-backing on
the existing array infrastructure).


A mostly unrelated thought - I wonder if this might be somehow related
to the foreign key array patch ([1] might be the most recent attempt in
this direction). Not to hashset itself, but I recalled these patches
because it'd mean we don't need the separate "edges" link table (so the
hashset column would be the think backing the FK).

[1]
https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> Yes, Multisets (a.k.a. bags and a large number of other names) would be 
> interesting. But I wouldn't like to abandon pure sets either. Maybe a 
> typmod indicating the allowed multiplicity of the type?

I don't think trying to use typmod to carry fundamental semantic
information will work, because we drop it in too many places
(e.g. there's no way to pass it through a function).  If you want
both sets and multisets, they'll need to be two different container
types, even if code is shared under the hood.

            regards, tom lane



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
> What unexpected issues you mean? Sure, if someone uses multisets as if
> they were sets (so ignoring the handling of duplicates), things will go
> booom! quickly.

The unexpected issues I had in mind are subtle bugs due to treating multisets
as sets, which could go undetected due to having no duplicates initially.
Multisets might initially therefore seem equal, but later diverge due to
different element counts, leading to hard-to-detect issues.

> I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
> operators) that'd allow perhaps help with this.

Might help. But still think providing both structures would be a more foolproof
solution, offering users the choice to select what's best for their use-case.

>> Despite SQL's multiset possibility, a distinct hashset type is my preference,
>> helping appropriate data structure choice and reducing misuse.
>> 
>> The necessity of multisets is vague beyond standards compliance.
>
> True - we haven't had any requests/proposal to implement MULTISETs.
>
> I've looked at the SQL standard primarily to check if maybe there's some
> precedent that'd give us guidance on the SQL syntax etc. And I think
> multisets are that - even if we end up not implementing them, it'd be
> sad to have unnecessarily inconsistent syntax (in case someone decides
> to add multisets in the future).
>
> We could invent "SET" data type, so while standard has ARRAY / MULTISET,
> we'd have ARRAY / MULTISET / SET, and the difference between the last
> two would be just handling of duplicates.

Is the idea to use the "SET" keyword for the syntax?
Isn't it a risk that will be confusing, since "SET" is currently
only used for configuration and update operations?

> The other way to look at sets is that they are pretty similar to arrays,
> except that there are no duplicates and order does not matter. Sure, the
> on-disk format and code is different, but from the SQL perspective it'd
> be nice to allow using sets in most places where arrays are allowed
> (which is what the standard does for MULTISETS, more or less).
>
> That'd mean we could probably search through gram.y for places working
> with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
> make them work with sets too, say by having SET_SUBLINK instead of
> ARRAY_SUBLINK, set_expression instead of array_expression, etc.
>
> This might be also "consistent" with defining hashset type using CREATE
> TYPE with ELEMENT, because we consider the type to be "array". So that
> would be polymorphic type, but we don't have pre-defined array for every
> type (and I'm not sure we want to).
>
> Of course, maybe there's some fatal flaw in these idea, I don't know.
> And I don't want to move the goalposts too far - but it seems like this
> might make some stuff actually simpler to implement (by piggy-backing on
> the existing array infrastructure).

I think it's very interesting thoughts and ambitions.

I wonder though, from a user-perspective, if a new hashset type still
wouldn't just be considered simpler, than introducing new SQL syntax?

However, it would be interesting to see how the piggy-backing on the
existing array infrastructure would look in practise code-wise though.

I think it's still meaningful to continue hacking on the int4-type
hashset extension, to see if we can agree on the semantics,
especially around null handling and sorting.

> A mostly unrelated thought - I wonder if this might be somehow related
> to the foreign key array patch ([1] might be the most recent attempt in
> this direction). Not to hashset itself, but I recalled these patches
> because it'd mean we don't need the separate "edges" link table (so the
> hashset column would be the think backing the FK).
>
> [1]
> https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com

I remember that one! We tried to revive that one, but didn't manage to keep it alive.
It's a really good idea though. Good idea to see if there might be synergies
between arrays and hashsets in this area, since if we envision the elements in
a hashset mostly will be PKs, then it would be nice to enforce reference
integrity.

/Joel




Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/20/23 00:50, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
>> What unexpected issues you mean? Sure, if someone uses multisets as if
>> they were sets (so ignoring the handling of duplicates), things will go
>> booom! quickly.
> 
> The unexpected issues I had in mind are subtle bugs due to treating multisets
> as sets, which could go undetected due to having no duplicates initially.
> Multisets might initially therefore seem equal, but later diverge due to
> different element counts, leading to hard-to-detect issues.
> 

Understood.

>> I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
>> operators) that'd allow perhaps help with this.
> 
> Might help. But still think providing both structures would be a more foolproof
> solution, offering users the choice to select what's best for their use-case.
> 

Yeah. Not confusing people is better.

>>> Despite SQL's multiset possibility, a distinct hashset type is my preference,
>>> helping appropriate data structure choice and reducing misuse.
>>>
>>> The necessity of multisets is vague beyond standards compliance.
>>
>> True - we haven't had any requests/proposal to implement MULTISETs.
>>
>> I've looked at the SQL standard primarily to check if maybe there's some
>> precedent that'd give us guidance on the SQL syntax etc. And I think
>> multisets are that - even if we end up not implementing them, it'd be
>> sad to have unnecessarily inconsistent syntax (in case someone decides
>> to add multisets in the future).
>>
>> We could invent "SET" data type, so while standard has ARRAY / MULTISET,
>> we'd have ARRAY / MULTISET / SET, and the difference between the last
>> two would be just handling of duplicates.
> 
> Is the idea to use the "SET" keyword for the syntax?
> Isn't it a risk that will be confusing, since "SET" is currently
> only used for configuration and update operations?
> 

I haven't tried doing that, so not sure if there would be any conflicts
in the grammar. But I can't think of a case that'd be confusing for
users - when setting internal GUC variables it's a completely different
context, there's no use for SQL-level collections (arrays, sets, ...).

For UPDATE, it'd be pretty clear too, I think. It's possible to do

   UPDATE table SET col = SET[1,2,3]

and it's clear the first is the command SET, while the second is a set
constructor. For SELECT there'd be conflict, and for ALTER TABLE it'd be
possible to do

   ALTER TABLE table ALTER COLUMN col SET DEFAULT SET[1,2,3];

Seems clear to me too, I think.


>> The other way to look at sets is that they are pretty similar to arrays,
>> except that there are no duplicates and order does not matter. Sure, the
>> on-disk format and code is different, but from the SQL perspective it'd
>> be nice to allow using sets in most places where arrays are allowed
>> (which is what the standard does for MULTISETS, more or less).
>>
>> That'd mean we could probably search through gram.y for places working
>> with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
>> make them work with sets too, say by having SET_SUBLINK instead of
>> ARRAY_SUBLINK, set_expression instead of array_expression, etc.
>>
>> This might be also "consistent" with defining hashset type using CREATE
>> TYPE with ELEMENT, because we consider the type to be "array". So that
>> would be polymorphic type, but we don't have pre-defined array for every
>> type (and I'm not sure we want to).
>>
>> Of course, maybe there's some fatal flaw in these idea, I don't know.
>> And I don't want to move the goalposts too far - but it seems like this
>> might make some stuff actually simpler to implement (by piggy-backing on
>> the existing array infrastructure).
> 
> I think it's very interesting thoughts and ambitions.
> 
> I wonder though, from a user-perspective, if a new hashset type still
> wouldn't just be considered simpler, than introducing new SQL syntax?
> 

It's a matter of personal taste, I guess. I'm fine with calling function
API and what not, but a sensible SQL syntax seems nicer.

> However, it would be interesting to see how the piggy-backing on the
> existing array infrastructure would look in practise code-wise though.
> 
> I think it's still meaningful to continue hacking on the int4-type
> hashset extension, to see if we can agree on the semantics,
> especially around null handling and sorting.
> 

Definitely. It certainly was not my intention to derail the work by
proposing more and more stuff. So feel free to pursue what makes sense
to you / helps the use case.


TBH I don't particularly see why we'd want to sort sets.

I wonder if the SQL standard says something about these things (for
MULTISETs), especially for the NULL handling. If it does, I'd try to
stick with those rules.

>> A mostly unrelated thought - I wonder if this might be somehow related
>> to the foreign key array patch ([1] might be the most recent attempt in
>> this direction). Not to hashset itself, but I recalled these patches
>> because it'd mean we don't need the separate "edges" link table (so the
>> hashset column would be the think backing the FK).
>>
>> [1]
>> https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com
> 
> I remember that one! We tried to revive that one, but didn't manage to keep it alive.
> It's a really good idea though. Good idea to see if there might be synergies
> between arrays and hashsets in this area, since if we envision the elements in
> a hashset mostly will be PKs, then it would be nice to enforce reference
> integrity.

I haven't followed that at all, but I wonder how difficult would it be
to also support other collection types (like sets) and not just arrays.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 20, 2023, at 02:04, Tomas Vondra wrote:
> For UPDATE, it'd be pretty clear too, I think. It's possible to do
>
>    UPDATE table SET col = SET[1,2,3]
>
> and it's clear the first is the command SET, while the second is a set
> constructor. For SELECT there'd be conflict, and for ALTER TABLE it'd be
> possible to do
>
>    ALTER TABLE table ALTER COLUMN col SET DEFAULT SET[1,2,3];
>
> Seems clear to me too, I think.
...
> It's a matter of personal taste, I guess. I'm fine with calling function
> API and what not, but a sensible SQL syntax seems nicer.

Now when I see it written out, I actually agree looks nice.

>> I think it's still meaningful to continue hacking on the int4-type
>> hashset extension, to see if we can agree on the semantics,
>> especially around null handling and sorting.
>> 
>
> Definitely. It certainly was not my intention to derail the work by
> proposing more and more stuff. So feel free to pursue what makes sense
> to you / helps the use case.

OK, cool, and didn't mean at all that you did. I appreciate the long-term
perspective, otherwise our short-term work might go wasted.

> TBH I don't particularly see why we'd want to sort sets.

Me neither, sorting sets in the conventional, visually coherent sense
(i.e., lexicographically) doesn't seem necessary. However, for ORDER BY hashset
functionality, we need a we need a stable and deterministic method.

This can be achieved performance-efficiently by computing a commutative hash of
the hashset, XORing each new value's hash with set->hash:

        set->hash ^= hash;

...and then sort primarily by set->hash.

Though resulting in an apparently random order, this approach, already employed
in int4hashset_add_element() and int4hashset_cmp(), ensures a deterministic and
stable sorting order.

I think this an acceptable trade-off, better than not supporting ORDER BY.

Jian He had some comments on hashset_cmp() which I will look at.

/Joel



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 19, 2023, at 02:00, jian he wrote:
> select hashset_contains('{1,2}'::int4hashset,NULL::int);
> should return null?

I agree, it should.

I've now changed all functions except int4hashset() (the init function)
and the aggregate functions to be STRICT.
I think this patch is OK to send as an incremental one, since it's an isolated change:

Apply STRICT to hashset functions; clean up null handling in hashset-api.c

Set hashset functions to be STRICT, thereby letting the system reject null
inputs automatically. This change reflects the nature of hashset as an
implementation of a set-theoretic system, where null values are conceptually
unusual.

Alongside, the hashset-api.c code has been refactored for clarity, consolidating
null checks and assignments into single lines.

A 'strict' test case has been added to account for these changes.

/Joel
Вложения

Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/20/23 12:59, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
>> select hashset_contains('{1,2}'::int4hashset,NULL::int);
>> should return null?
> 
> I agree, it should.
> 
> I've now changed all functions except int4hashset() (the init function)
> and the aggregate functions to be STRICT.

I don't think this is correct / consistent with what we do elsewhere.
IMHO it's perfectly fine to have a hashset containing a NULL value,
because then it can affect results of membership checks.

Consider these IN / ANY queries:

    test=# select 4 in (1,2,3);
     ?column?
    ----------
     f
    (1 row)

    test=# select 4 = ANY(ARRAY[1,2,3]);
     ?column?
    ----------
     f
    (1 row)

now add a NULL:

    test=# select 4 in (1,2,3,null);
     ?column?
    ----------

    (1 row)

    test=# select 4 = ANY(ARRAY[1,2,3,NULL]);
     ?column?
    ----------

    (1 row)

I don't see why a (hash)set should behave any differently. It's true
arrays don't behave like this:

    test=# select array[1,2,3,4,NULL] @> ARRAY[5];
     ?column?
    ----------
     f
    (1 row)

but I'd say that's more an anomaly than something we should replicate.

This is also what the SQL standard does for multisets - there's SQL:20nn
draft at http://www.wiscorp.com/SQLStandards.html, and the <member
predicate> section (p. 475) explains how this should work with NULL.

So if we see a set as a special case of multiset (with no duplicates),
then we have to handle NULLs this way too. It'd be weird to have this
behavior inconsistent.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/20/23 14:10, Tomas Vondra wrote:
> ...
>
> This is also what the SQL standard does for multisets - there's SQL:20nn
> draft at http://www.wiscorp.com/SQLStandards.html, and the <member
> predicate> section (p. 475) explains how this should work with NULL.
> 

BTW I just notices there's also a multiset proposal at the wiscorp page:

   http://www.wiscorp.com/sqlmultisets.zip

It's just the initial proposal and I'm not sure how much it changed over
time, but it likely provide way more context for the choices than the
(rather dry) SQL standard.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> On 6/20/23 12:59, Joel Jacobson wrote:
>> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
>>> select hashset_contains('{1,2}'::int4hashset,NULL::int);
>>> should return null?
>> 
>> I agree, it should.
>> 
>> I've now changed all functions except int4hashset() (the init function)
>> and the aggregate functions to be STRICT.
>
> I don't think this is correct / consistent with what we do elsewhere.
> IMHO it's perfectly fine to have a hashset containing a NULL value,

The reference to consistency with what we do elsewhere might not be entirely
applicable in this context, since the set feature we're designing is a new beast
in the SQL landscape.

I think adhering to the theoretical purity of sets by excluding NULLs aligns us
with set theory, simplifies our code, and parallels set implementations in other
languages.

I think we have an opportunity here to innovate and potentially influence a
future set concept in the SQL standard.

However, I see how one could argue against this reasoning, on the basis that
PostgreSQL users might be more familiar with and expect NULLs can exist
everywhere in all data structures.

A different perspective is to look at what use-cases we can foresee.

I've been trying hard, but I can't find compelling use-cases where a NULL element
in a set would offer a more natural SQL query than handling NULLs within SQL and
keeping the set NULL-free.

Does anyone else have a strong realistic example where including NULLs in the
set would simplify the SQL query?

/Joel



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 20, 2023, at 16:56, Joel Jacobson wrote:
> I think we have an opportunity here to innovate and potentially influence a
> future set concept in the SQL standard.

Adding to my previous note - If there's a worry about future SQL standards
introducing SETs with NULLs, causing compatibility issues, we could address it
proactively. We could set up set functions to throw errors when passed NULL
inputs, rather than being STRICT. This keeps our theoretical alignment now, and
offers a smooth transition if standards evolve.

Considering we have a flag field in the struct, we could use it to indicate
whether a value stored on disk was written with NULL support or not.

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/20/23 16:56, Joel Jacobson wrote:
> On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
>> On 6/20/23 12:59, Joel Jacobson wrote:
>>> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
>>>> select hashset_contains('{1,2}'::int4hashset,NULL::int);
>>>> should return null?
>>>
>>> I agree, it should.
>>>
>>> I've now changed all functions except int4hashset() (the init function)
>>> and the aggregate functions to be STRICT.
>>
>> I don't think this is correct / consistent with what we do elsewhere.
>> IMHO it's perfectly fine to have a hashset containing a NULL value,
> 
> The reference to consistency with what we do elsewhere might not be entirely
> applicable in this context, since the set feature we're designing is a new beast
> in the SQL landscape.
> 

I don't see how it's new, considering relational algebra is pretty much
based on (multi)sets, and the three-valued logic with NULL values is
pretty well established part of that.

> I think adhering to the theoretical purity of sets by excluding NULLs aligns us
> with set theory, simplifies our code, and parallels set implementations in other
> languages.
> 

I don't see how that would be more theoretically pure, really. The
three-valued logic is a well established part of relational algebra, so
not respecting that is more a violation of the purity.

> I think we have an opportunity here to innovate and potentially influence a
> future set concept in the SQL standard.
> 

I doubt this going to influence what the SQL standard says, especially
because it already defined the behavior for MULTISETS (of which the sets
are a special case, pretty much). So this has 0% chance of success.

> However, I see how one could argue against this reasoning, on the basis that
> PostgreSQL users might be more familiar with and expect NULLs can exist
> everywhere in all data structures.
> 

Right, it's what we already do for similar cases, and if you have NULLS
in the data, you better be aware of the behavior. Granted, some people
are surprised by three-valued logic, but using a different behavior for
some new features would just increase the confusion.

> A different perspective is to look at what use-cases we can foresee.
> 
> I've been trying hard, but I can't find compelling use-cases where a NULL element
> in a set would offer a more natural SQL query than handling NULLs within SQL and
> keeping the set NULL-free.
> 

IMO if you have NULL values in the data, you better be aware of it and
handle the case accordingly (e.g. by filtering them out when building
the set). If you don't have NULLs in the data, there's no issue.

And in the graph case, I don't see why you'd have any NULLs, considering
we're dealing with adjacent nodes, and if there's adjacent node, it's ID
is not NULL.

> Does anyone else have a strong realistic example where including NULLs in the
> set would simplify the SQL query?
> 

I'm sure there are cases where you have NULLs in the dat aand need to
filter them out, but that's just natural consequence of having NULLs. If
you have them you better know what NULLs do ...

It's too early to make any strong statements, but it's going to be hard
to convince me we should handle NULLs differently from what we already
do elsewhere.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
jian he
Дата:
On Wed, Jun 21, 2023 at 12:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 6/20/23 16:56, Joel Jacobson wrote:
> > On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> >> On 6/20/23 12:59, Joel Jacobson wrote:
> >>> On Mon, Jun 19, 2023, at 02:00, jian he wrote:
> >>>> select hashset_contains('{1,2}'::int4hashset,NULL::int);
> >>>> should return null?
> >>>
> >>> I agree, it should.
> >>>
> >>> I've now changed all functions except int4hashset() (the init function)
> >>> and the aggregate functions to be STRICT.
> >>
> >> I don't think this is correct / consistent with what we do elsewhere.
> >> IMHO it's perfectly fine to have a hashset containing a NULL value,
> >
> > The reference to consistency with what we do elsewhere might not be entirely
> > applicable in this context, since the set feature we're designing is a new beast
> > in the SQL landscape.
> >
>
> I don't see how it's new, considering relational algebra is pretty much
> based on (multi)sets, and the three-valued logic with NULL values is
> pretty well established part of that.
>
> > I think adhering to the theoretical purity of sets by excluding NULLs aligns us
> > with set theory, simplifies our code, and parallels set implementations in other
> > languages.
> >
>
> I don't see how that would be more theoretically pure, really. The
> three-valued logic is a well established part of relational algebra, so
> not respecting that is more a violation of the purity.
>
> > I think we have an opportunity here to innovate and potentially influence a
> > future set concept in the SQL standard.
> >
>
> I doubt this going to influence what the SQL standard says, especially
> because it already defined the behavior for MULTISETS (of which the sets
> are a special case, pretty much). So this has 0% chance of success.
>
> > However, I see how one could argue against this reasoning, on the basis that
> > PostgreSQL users might be more familiar with and expect NULLs can exist
> > everywhere in all data structures.
> >
>
> Right, it's what we already do for similar cases, and if you have NULLS
> in the data, you better be aware of the behavior. Granted, some people
> are surprised by three-valued logic, but using a different behavior for
> some new features would just increase the confusion.
>
> > A different perspective is to look at what use-cases we can foresee.
> >
> > I've been trying hard, but I can't find compelling use-cases where a NULL element
> > in a set would offer a more natural SQL query than handling NULLs within SQL and
> > keeping the set NULL-free.
> >
>
> IMO if you have NULL values in the data, you better be aware of it and
> handle the case accordingly (e.g. by filtering them out when building
> the set). If you don't have NULLs in the data, there's no issue.
>
> And in the graph case, I don't see why you'd have any NULLs, considering
> we're dealing with adjacent nodes, and if there's adjacent node, it's ID
> is not NULL.
>
> > Does anyone else have a strong realistic example where including NULLs in the
> > set would simplify the SQL query?
> >
>
> I'm sure there are cases where you have NULLs in the dat aand need to
> filter them out, but that's just natural consequence of having NULLs. If
> you have them you better know what NULLs do ...
>
> It's too early to make any strong statements, but it's going to be hard
> to convince me we should handle NULLs differently from what we already
> do elsewhere.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


>  http://www.wiscorp.com/sqlmultisets.zip

> Conceptually, a multiset is an unordered collection of elements, all of the same type, with dupli-
> cates permitted. Unlike arrays, a multiset is an unbounded collection, with no declared maximum
> cardinality. This does not mean that the user can insert elements in a multiset without limit, just
> that the standard does not mandate that there should be a limit. This is analogous to tables, which
> have no declared maximum number of rows.

Postgres arrays don't have size limits.
unordered means no need to use subscript?
So multiset is a more limited array type?

null is fine. but personally I feel like so far the hashset main
feature is the quickly aggregate unique value using hashset.
I found using hashset count distinct (non null values) is quite faster.



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:
On 6/20/23 20:08, jian he wrote:
> On Wed, Jun 21, 2023 at 12:25 AM Tomas Vondra
> ...
>>  http://www.wiscorp.com/sqlmultisets.zip
> 
>> Conceptually, a multiset is an unordered collection of elements, all of the same type, with dupli-
>> cates permitted. Unlike arrays, a multiset is an unbounded collection, with no declared maximum
>> cardinality. This does not mean that the user can insert elements in a multiset without limit, just
>> that the standard does not mandate that there should be a limit. This is analogous to tables, which
>> have no declared maximum number of rows.
> 
> Postgres arrays don't have size limits.

Right. You can say int[5] but we don't enforce that limit (I haven't
checked why, but presumably because we had arrays before the standard
existed, and it was more like a list in LISP or something.)

> unordered means no need to use subscript?

Yeah - there's no obvious way to subscript the items when there's no
implicit ordering.

> So multiset is a more limited array type?
> 

Yes and no - both are collection types, so there are similarities and
differences. Multiset does not need to keep the ordering, so in this
sense it's a relaxed version of array.


> null is fine. but personally I feel like so far the hashset main
> feature is the quickly aggregate unique value using hashset.
> I found using hashset count distinct (non null values) is quite faster.

True. That's related to fast membership checks.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 20, 2023, at 18:25, Tomas Vondra wrote:
> On 6/20/23 16:56, Joel Jacobson wrote:
>> The reference to consistency with what we do elsewhere might not be entirely
>> applicable in this context, since the set feature we're designing is a new beast
>> in the SQL landscape.
>
> I don't see how it's new, considering relational algebra is pretty much
> based on (multi)sets, and the three-valued logic with NULL values is
> pretty well established part of that.

What I meant was that the SET-feature is new; since it doesn't exist in PostgreSQL nor SQL.

>> I think adhering to the theoretical purity of sets by excluding NULLs aligns us
>> with set theory, simplifies our code, and parallels set implementations in other
>> languages.
>
> I don't see how that would be more theoretically pure, really. The
> three-valued logic is a well established part of relational algebra, so
> not respecting that is more a violation of the purity.

Hmm, I think it's pure in different ways;
Set Theory is well established and is based on two-values logic,
but at the same time SQL's three-valued logic is also well established.

>> I think we have an opportunity here to innovate and potentially influence a
>> future set concept in the SQL standard.
>
> I doubt this going to influence what the SQL standard says, especially
> because it already defined the behavior for MULTISETS (of which the sets
> are a special case, pretty much). So this has 0% chance of success.

OK. 0% is 1% too low for me to work with. :)

>> However, I see how one could argue against this reasoning, on the basis that
>> PostgreSQL users might be more familiar with and expect NULLs can exist
>> everywhere in all data structures.
>
> Right, it's what we already do for similar cases, and if you have NULLS
> in the data, you better be aware of the behavior. Granted, some people
> are surprised by three-valued logic, but using a different behavior for
> some new features would just increase the confusion.

Good point.

>> I've been trying hard, but I can't find compelling use-cases where a NULL element
>> in a set would offer a more natural SQL query than handling NULLs within SQL and
>> keeping the set NULL-free.
>
> IMO if you have NULL values in the data, you better be aware of it and
> handle the case accordingly (e.g. by filtering them out when building
> the set). If you don't have NULLs in the data, there's no issue.

As long as the data model and queries would ensure there can never be
any NULLs, fine, then there's is no issue.

> And in the graph case, I don't see why you'd have any NULLs, considering
> we're dealing with adjacent nodes, and if there's adjacent node, it's ID
> is not NULL.

Me neither, can't see the need for any NULLs there.

>> Does anyone else have a strong realistic example where including NULLs in the
>> set would simplify the SQL query?
>
> I'm sure there are cases where you have NULLs in the dat aand need to
> filter them out, but that's just natural consequence of having NULLs. If
> you have them you better know what NULLs do ...

What I tried to find was an example for was when you wouldn't want to
filter out the NULLs, when you would want to include the NULL
in the set.

If we could just find one should realistic use-case, that would be very
helpful, since it would then kill my argument completely that we couldn't
do without storing a NULL in the set.

> It's too early to make any strong statements, but it's going to be hard
> to convince me we should handle NULLs differently from what we already
> do elsewhere.

I think it's a trade-off, and I don't have any strong preference for the simplicity
of a classical two-valued set-theoretic system vs a three-valued
multiset-based one. I was 51/49 but given your feedback I'm now 49/51.

I think the next step is to think about how the hashset type should work
with three-valued logic, and then implement it to get a feeling for it.

For instance, how should hashset_count() work?

Given the query,

SELECT hashset_count('{1,2,3,null}'::int4hashset);

Should we,

a) threat NULL as a distinct value and return 4?

b) ignore NULL and return 3?

c) return NULL? (since the presence of NULL can be thought to render the entire count indeterminate)

I think my personal preference is (b) since it is then consistent with how COUNT() works.

/Joel



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
> This is also what the SQL standard does for multisets - there's SQL:20nn
> draft at http://www.wiscorp.com/SQLStandards.html, and the <member
> predicate> section (p. 475) explains how this should work with NULL.

I've looked again at the paper you mentioned and found something intriguing
in section 2.6 (b). I'm a bit puzzled about this: why would we want to return
null when we're certain it's not null but just doesn't have any elements?

In the same vein, it says, "If it has more than one element, an exception is
raised." Makes sense to me, but what about when there are no elements at all?
Why not raise an exception in that case too?

The ELEMENT function is designed to do one simple thing: return the element of
a multiset if the multiset has only 1 element. This seems very similar to how
our INTO STRICT operates, right?

The SQL:20nn seems to still be in draft form, and I can't help but wonder if we
should propose a bit of an improvement here:

"If it doesn't have exactly one element, an exception is raised."

Meaning, it would raise an exception both if there are more elements,
or zero elements (no elements).

I think this would make the semantics more intuitive and less surprising.

/Joel



Re: Do we want a hashset type?

От
Tomas Vondra
Дата:
On 6/22/23 19:52, Joel Jacobson wrote:
> On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
>> This is also what the SQL standard does for multisets - there's SQL:20nn
>> draft at http://www.wiscorp.com/SQLStandards.html, and the <member
>> predicate> section (p. 475) explains how this should work with NULL.
> 
> I've looked again at the paper you mentioned and found something intriguing
> in section 2.6 (b). I'm a bit puzzled about this: why would we want to return
> null when we're certain it's not null but just doesn't have any elements?
> 
> In the same vein, it says, "If it has more than one element, an exception is
> raised." Makes sense to me, but what about when there are no elements at all?
> Why not raise an exception in that case too?
> 
> The ELEMENT function is designed to do one simple thing: return the element of
> a multiset if the multiset has only 1 element. This seems very similar to how
> our INTO STRICT operates, right?
> 

I agree this looks a bit weird, but that's what I mentioned - this is an
initial a proposal, outlining the idea. Inevitably some of the stuff
will get reworked or just left out of the final version. It's useful
mostly to explain the motivation / goal.

I believe that's the case here - I don't think the ELEMENT got into the
standard at all, and the NULL rules for the MEMBER OF clause seem not to
have these strange bits.

> The SQL:20nn seems to still be in draft form, and I can't help but wonder if we
> should propose a bit of an improvement here:
> 
> "If it doesn't have exactly one element, an exception is raised."
> 
> Meaning, it would raise an exception both if there are more elements,
> or zero elements (no elements).
> 
> I think this would make the semantics more intuitive and less surprising.
> 

Well, the simple truth is the draft is freely available, but you'd need
to buy the final version. It doesn't mean it's still being worked on or
that no SQL standard was released since then. In fact, SQL 2023 was
released a couple weeks ago [1].

It'd be interesting to know the version that actually got into the SQL
standard (if at all), but I don't have access to the standard yet.

regards


[1] https://www.iso.org/standard/76584.html

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
jian he
Дата:

I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested is somehow beyond the imagination...).

* A standard varlena array has the following internal structure:
 *   <vl_len_> - standard varlena header word
 *   <ndim> - number of dimensions of the array
 *   <dataoffset> - offset to stored data, or 0 if no nulls bitmap
 *   <elemtype> - element type OID
 *   <dimensions> - length of each array axis (C array of int)
 *   <lower bnds> - lower boundary of each dimension (C array of int)
 *   <null bitmap> - bitmap showing locations of nulls (OPTIONAL)
 *   <actual data> - whatever is the stored data

in set/multiset, we don't need {ndim,lower bnds}, since we are only one dimension, also we don't need subscript.
So for set we can have following
*  int32 vl_len_; /* varlena header (do not touch directly!) */
int32       capacity; /* # of capacity */
*  int32 dataoffset; /* offset to data, or 0 if no bitmap */
int32 nelements; /* number of items added to the hashset */
Oid elemtype; /* element type OID */
 *  <null bitmap> - bitmap showing locations of nulls (OPTIONAL)
 *  <bitmap> - bitmap showing this slot is empty or not  ( I am not sure this part)
 *  <actual data> - whatever is the stored data

many of the code in array_func.c can be reused.
array_isspace     ==> set_isspace
ArrayMetaState  ==> SetMetastate
ArrayCount         ==> SetCount (similar to ArrayCount return the dimension of set, should be zero (empty set) or one)
ArrayParseState ==> SetParseState
ReadArrayStr     ==>  ReadSetStr

attached is a demo shows that use array_func.c to parse cstring. have similar effect of  array_in.
for multiset_in set type input function. if no duplicate required then multiset_in would just like array, so more code can be copied from array_func.c
but if unique required then we need first palloc0(capacity * datums (size per type)) then put valid value into to a specific slot? 




On Fri, Jun 23, 2023 at 6:27 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 6/22/23 19:52, Joel Jacobson wrote:
> On Tue, Jun 20, 2023, at 14:10, Tomas Vondra wrote:
>> This is also what the SQL standard does for multisets - there's SQL:20nn
>> draft at http://www.wiscorp.com/SQLStandards.html, and the <member
>> predicate> section (p. 475) explains how this should work with NULL.
>
> I've looked again at the paper you mentioned and found something intriguing
> in section 2.6 (b). I'm a bit puzzled about this: why would we want to return
> null when we're certain it's not null but just doesn't have any elements?
>
> In the same vein, it says, "If it has more than one element, an exception is
> raised." Makes sense to me, but what about when there are no elements at all?
> Why not raise an exception in that case too?
>
> The ELEMENT function is designed to do one simple thing: return the element of
> a multiset if the multiset has only 1 element. This seems very similar to how
> our INTO STRICT operates, right?
>

I agree this looks a bit weird, but that's what I mentioned - this is an
initial a proposal, outlining the idea. Inevitably some of the stuff
will get reworked or just left out of the final version. It's useful
mostly to explain the motivation / goal.

I believe that's the case here - I don't think the ELEMENT got into the
standard at all, and the NULL rules for the MEMBER OF clause seem not to
have these strange bits.

> The SQL:20nn seems to still be in draft form, and I can't help but wonder if we
> should propose a bit of an improvement here:
>
> "If it doesn't have exactly one element, an exception is raised."
>
> Meaning, it would raise an exception both if there are more elements,
> or zero elements (no elements).
>
> I think this would make the semantics more intuitive and less surprising.
>

Well, the simple truth is the draft is freely available, but you'd need
to buy the final version. It doesn't mean it's still being worked on or
that no SQL standard was released since then. In fact, SQL 2023 was
released a couple weeks ago [1].

It'd be interesting to know the version that actually got into the SQL
standard (if at all), but I don't have access to the standard yet.

regards


[1] https://www.iso.org/standard/76584.html

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
 I recommend David Deutsch's <<The Beginning of Infinity>>

  Jian


Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
> I played around array_func.c
> many of the code can be used for multiset data type.
> now I imagine multiset as something like one dimension array. (nested 
> is somehow beyond the imagination...).

Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?

/Joel



Re: Do we want a hashset type?

От
jian he
Дата:


On Fri, Jun 23, 2023 at 4:23 PM Joel Jacobson <joel@compiler.org> wrote:
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
> I played around array_func.c
> many of the code can be used for multiset data type.
> now I imagine multiset as something like one dimension array. (nested
> is somehow beyond the imagination...).

Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?

/Joel

select prosrc from pg_proc where proname ~* '(hash.*extended)|(extended.*hash)';
return around 30 rows.
so it's a bit generic? 

I tend to think set/multiset as a one dimension array, so the textual input should be like a one dimension array. 
use array_func.c functions to parse and validate the input.
So different types, one input validation function. 

Does this make sense?

Re: Do we want a hashset type?

От
Andrew Dunstan
Дата:


On 2023-06-23 Fr 04:23, Joel Jacobson wrote:
On Fri, Jun 23, 2023, at 08:40, jian he wrote:
I played around array_func.c
many of the code can be used for multiset data type.
now I imagine multiset as something like one dimension array. (nested 
is somehow beyond the imagination...).
Are you suggesting it might be a better idea to start over completely
and work on a new code base that is based on arrayfuncs.c,
and aim for MULTISET/SET or anyhashset from start, that would not
only support int4/int8/uuid but any type?


Before we run too far down this rabbit hole, let's discuss the storage implications of using multisets. ISTM that for small base datums like integers it will be a substantial increase in size, since you'll need an addition int for the item count, unless some very clever tricks are played.

As for this older discussion referred to upthread, if the SQL Standards Committee hasn't acted on it by now it seem reasonable to think they are unlikely to.

Just for reference, Here's some description of Oracle's suport for Multisets from <https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Oracle-Support-for-Optional-Features-of-SQLFoundation2011.html#GUID-3BA98AEC-FAAD-4F21-A6AD-F696B5D36D56>:

Multisets in the standard are supported as nested table types in Oracle. The Oracle nested table data type based on a scalar type ST is equivalent, in standard terminology, to a multiset of rows having a single field of type ST and named column_value. The Oracle nested table type based on an object type is equivalent to a multiset of structured type in the standard.

Oracle supports the following elements of this feature on nested tables using the same syntax as the standard has for multisets:

    The CARDINALITY function

    The SET function

    The MEMBER predicate

    The IS A SET predicate

    The COLLECT aggregate

All other aspects of this feature are supported with non-standard syntax, as follows:

    To create an empty multiset, denoted MULTISET[] in the standard, use an empty constructor of the nested table type.

    To obtain the sole element of a multiset with one element, denoted ELEMENT (<multiset value expression>) in the standard, use a scalar subquery to select the single element from the nested table.

    To construct a multiset by enumeration, use the constructor of the nested table type.

    To construct a multiset by query, use CAST with a multiset argument, casting to the nested table type.

    To unnest a multiset, use the TABLE operator in the FROM clause.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/23/23 13:47, Andrew Dunstan wrote:
> 
> On 2023-06-23 Fr 04:23, Joel Jacobson wrote:
>> On Fri, Jun 23, 2023, at 08:40, jian he wrote:
>>> I played around array_func.c
>>> many of the code can be used for multiset data type.
>>> now I imagine multiset as something like one dimension array. (nested 
>>> is somehow beyond the imagination...).
>> Are you suggesting it might be a better idea to start over completely
>> and work on a new code base that is based on arrayfuncs.c,
>> and aim for MULTISET/SET or anyhashset from start, that would not
>> only support int4/int8/uuid but any type?
>>
> 
> Before we run too far down this rabbit hole, let's discuss the storage
> implications of using multisets. ISTM that for small base datums like
> integers it will be a substantial increase in size, since you'll need an
> addition int for the item count, unless some very clever tricks are played.
> 

I honestly don't quite understand what exactly is meant by the proposal
to "reuse array_func.c for multisets". We're implementing sets, not
multisets (those were mentioned only to illustrate behavior). And the
whole point is that sets are not arrays - no duplicates, ordering does
not matter (so no index).

I mentioned that maybe we can model sets based on arrays (say, gram.y
would do similar stuff for SET[] and ARRAY[], polymorphism), not that we
should store sets as arrays. Would it be possible - maybe, if we extend
arrays to also maintain some hash hash table. But I'd bet that'll just
make arrays more complex, and will make sets slower.

Or maybe I just don't understand the proposal. Perhaps it'd be best if
jian wrote a patch illustrating the idea, and showing how it performs
compared to the current approach.

As for the storage size, I don't think an extra "count" field would make
any measurable difference. If we're storing a hash table, we're bound to
have a couple percent of wasted space due to load factor (likely between
0.75 and 0.9).

> As for this older discussion referred to upthread, if the SQL Standards
> Committee hasn't acted on it by now it seem reasonable to think they are
> unlikely to.
> 

AFAIK multisets are included in SQL 2023, pretty much matching the draft
we discussed earlier. Yeah, it's unlikely to change in the future.

> Just for reference, Here's some description of Oracle's suport for
> Multisets from
>
<https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/Oracle-Support-for-Optional-Features-of-SQLFoundation2011.html#GUID-3BA98AEC-FAAD-4F21-A6AD-F696B5D36D56>:
> 

good to know


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Thu, Jun 22, 2023, at 07:51, Joel Jacobson wrote:
> For instance, how should hashset_count() work?
>
> Given the query,
>
> SELECT hashset_count('{1,2,3,null}'::int4hashset);
>
> Should we,
>
> a) threat NULL as a distinct value and return 4?
>
> b) ignore NULL and return 3?
>
> c) return NULL? (since the presence of NULL can be thought to render 
> the entire count indeterminate)
>
> I think my personal preference is (b) since it is then consistent with 
> how COUNT() works.

Having thought a bit more on this matter,
I think it's better to remove hashset_count() since the semantics are not obvious,
and instead provide a hashset_cardinality() function, that would obviously
include a possible null value in the number of elements:

SELECT hashset_cardinality('{1,2,3,null}'::int4hashset);
4

SELECT hashset_cardinality('{null}'::int4hashset);
1

SELECT hashset_cardinality('{null,null}'::int4hashset);
1

SELECT hashset_cardinality('{}'::int4hashset);
0

SELECT hashset_cardinality(NULL::int4hashset);
NULL

Sounds good?

/Joel



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
New version of int4hashset_contains() that should follow the same
General Rules as MULTISET's MEMBER OF (8.16 <member predicate>).

The first rule is to return False if the cardinality is 0 (zero).
However, we must first check if the first argument is null,
in which case the cardinality cannot be 0 (zero),
so if the first argument is null then we return Unknown
(represented as null).

We then proceed and check if the set is empty,
which is defined as nelements being 0 (zero)
as well as the new null_element field being false.
If the set is empty, then we always return False,
regardless of the second argument, that is,
even if it would be null we would still return False,
since the set is empty and can therefore not contain
any element.

The second rule is to return Unknown (represented as null)
if any of the arguments are null. We've already checked that
the first argument is not null, so now we check the second
argument, and return Unknown (represented as null) if it is null.

The third rule is to check for the element, and return True if
the set contains the element. Otherwise, if the set contains
the null element, we don't know if the element we're checking
for is in the set, so we then return Unknown (represented as null).
Finally, if the set doesn't contain the null element and nor the
element we're checking for, then we return False.

Datum
int4hashset_contains(PG_FUNCTION_ARGS)
{
    int4hashset_t  *set;
    int32            value;
    bool            result;

    if (PG_ARGISNULL(0))
        PG_RETURN_NULL();

    set = PG_GETARG_INT4HASHSET(0);

    if (set->nelements == 0 && !set->null_element)
        PG_RETURN_BOOL(false);

    if (PG_ARGISNULL(1))
        PG_RETURN_NULL();

    value = PG_GETARG_INT32(1);
    result = int4hashset_contains_element(set, value);

    if (!result && set->null_element)
        PG_RETURN_NULL();

    PG_RETURN_BOOL(result);
}

Example queries and expected results:

SELECT hashset_contains(NULL::int4hashset, NULL::int); -- null
SELECT hashset_contains(NULL::int4hashset, 1::int); -- null
SELECT hashset_contains('{}'::int4hashset, NULL::int); -- false
SELECT hashset_contains('{}'::int4hashset, 1::int); -- false
SELECT hashset_contains('{null}'::int4hashset, NULL::int); -- null
SELECT hashset_contains('{null}'::int4hashset, 1::int); -- null
SELECT hashset_contains('{1}'::int4hashset, NULL::int); -- null
SELECT hashset_contains('{1}'::int4hashset, 1::int); -- true
SELECT hashset_contains('{1}'::int4hashset, 2::int); -- false

Looks good?

/Joel



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Sat, Jun 24, 2023, at 21:16, Joel Jacobson wrote:
> New version of int4hashset_contains() that should follow the same
> General Rules as MULTISET's MEMBER OF (8.16 <member predicate>).
...
> SELECT hashset_contains('{}'::int4hashset, NULL::int); -- false
...
> SELECT hashset_contains('{null}'::int4hashset, NULL::int); -- null

When it comes to SQL, the general rule of thumb is that expressions and functions 
handling null usually return the null value. This is why it might feel a bit out 
of the ordinary to return False when checking if an empty set contains NULL.

However, that's my understanding of the General Rules on page 553 of 
ISO/IEC 9075-2:2023(E). Rule 3 Case a) specifically states:

    "If N is 0 (zero), then the <member predicate> is False.",

where N is the cardinality, and for an empty set, that's 0 (zero).

Rule 3 Case b) goes on to say:

    "If at least one of XV and MV is the null value, then the 
    <member predicate> is Unknown."

But since b) follows a), and the condition for a) already matches, b) is out of 
the running. This leads me to believe that the result of:

    SELECT hashset_contains('{}'::int4hashset, NULL::int);

would be False, according to the General Rules.

Now, this is based on the assumption that the Case conditions are evaluated in 
sequence, stopping at the first match. Does that assumption hold water?

Applying the same rules, we'd have to return Unknown (which we represent as
null) for:

    SELECT hashset_contains('{null}'::int4hashset, NULL::int);

Here, since the cardinality N is 1, Case a) doesn't apply, but Case b) does 
since XV is null.

Looking ahead, we're entertaining the possibility of a future SET SQL-syntax
feature and wondering how our hashset type could be adapted to be compatible and
reusable for such a development. It's a common prediction that any future SET
syntax feature would probably operate on Three-Valued Logic. Therefore, it's key
for our hashset to handle null values, whether storing, identifying, or adding
them.

But here's my two cents, and remember it's just a personal viewpoint. I'm not so 
sure that the hashset type functions need to mirror the corresponding MULTISET 
language constructs exactly. In my book, our hashset catalog functions could
take a more clear-cut route with null handling, as long as our data structure is
prepared to handle null values.

Think about this possibility:

hashset_contains_null(int4hashset) -> boolean
hashset_add_null(int4hashset) -> int4hashset
hashset_contains(..., NULL) -> ERROR
hashset_add(..., NULL) -> ERROR

In my mind, this explicit null handling could simplify things, clear up any
potential confusion, and at the same time pave the way for compatibility with
any future SET SQL-syntax feature.

Thoughts?

/Joel



Re: Do we want a hashset type?

От
jian he
Дата:
> Or maybe I just don't understand the proposal. Perhaps it'd be best if
> jian wrote a patch illustrating the idea, and showing how it performs
> compared to the current approach.

currently joel's idea is a int4hashset. based on the code first tomas wrote.
it looks like a non-nested an collection of unique int4. external text format looks like {int4, int4,int4}
structure looks like (header +  capacity slots * int4).
Within the capacity slots, some slots are empty, some have unique values.

The textual int4hashset looks like a one dimensional array.
so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a slight generic hashset input and output function.

see the attached c file.
It works fine for non-null input output for {int4hashset, int8hashset, timestamphashset,intervalhashset,uuidhashset).
Вложения

Re: Do we want a hashset type?

От
Tomas Vondra
Дата:

On 6/25/23 15:32, jian he wrote:
>> Or maybe I just don't understand the proposal. Perhaps it'd be best if
>> jian wrote a patch illustrating the idea, and showing how it performs
>> compared to the current approach.
> 
> currently joel's idea is a int4hashset. based on the code first tomas wrote.
> it looks like a non-nested an collection of unique int4. external text
> format looks like {int4, int4,int4}
> structure looks like (header +  capacity slots * int4).
> Within the capacity slots, some slots are empty, some have unique values.
> 
> The textual int4hashset looks like a one dimensional array.
> so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a
> slight generic hashset input and output function.
> 
> see the attached c file.
> It works fine for non-null input output for {int4hashset, int8hashset,
> timestamphashset,intervalhashset,uuidhashset).

So how do you define a table with a "set" column? I mean, with the
original patch we could have done

   CREATE TABLE (a int4hashset);

and then store / query this. How do you do that with this approach?

I've looked at the patch only very briefly - it's really difficult to
grok such patches - large, with half the comments possibly obsolete etc.
So what does reusing the array code give us, really?

I'm not against reusing some of the array code, but arrays seem to be
much more elaborate (multiple dimensions, ...) so the code needs to do
significantly more stuff in various cases.

When I previously suggested that maybe we should get "inspiration" from
the array code, I was mostly talking about (a) type polymorphism, i.e.
doing sets for arbitrary types, and (b) integrating this into grammar
(instead of using functions).

I don't see how copying arrayfuncs.c like this achieves either of these
things. It still hardcodes just a handful of selected data types, and
the array polymorphism relies on automatic creation of array type for
every scalar type.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Sun, Jun 25, 2023, at 11:42, Joel Jacobson wrote:
>     SELECT hashset_contains('{}'::int4hashset, NULL::int);
>
> would be False, according to the General Rules.
>
...
> Applying the same rules, we'd have to return Unknown (which we represent as
> null) for:
>
>     SELECT hashset_contains('{null}'::int4hashset, NULL::int);
>

Aha! I just discovered to my surprise that the corresponding array
queries gives the same result:

SELECT NULL = ANY(ARRAY[]::int[]);
 ?column?
----------
 f
(1 row)

SELECT NULL = ANY(ARRAY[NULL]::int[]);
 ?column?
----------

(1 row)

I have no more objections; let's stick to the same null semantics as arrays and multisets.

/Joel



Re: Do we want a hashset type?

От
jian he
Дата:


On Mon, Jun 26, 2023 at 2:56 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 6/25/23 15:32, jian he wrote:
> >> Or maybe I just don't understand the proposal. Perhaps it'd be best if
> >> jian wrote a patch illustrating the idea, and showing how it performs
> >> compared to the current approach.
> >
> > currently joel's idea is a int4hashset. based on the code first tomas wrote.
> > it looks like a non-nested an collection of unique int4. external text
> > format looks like {int4, int4,int4}
> > structure looks like (header +  capacity slots * int4).
> > Within the capacity slots, some slots are empty, some have unique values.
> >
> > The textual int4hashset looks like a one dimensional array.
> > so I copied/imitated src/backend/utils/adt/arrayfuncs.c code, rewrote a
> > slight generic hashset input and output function.
> >
> > see the attached c file.
> > It works fine for non-null input output for {int4hashset, int8hashset,
> > timestamphashset,intervalhashset,uuidhashset).
>
> So how do you define a table with a "set" column? I mean, with the
> original patch we could have done
>
>    CREATE TABLE (a int4hashset);
>
> and then store / query this. How do you do that with this approach?
>
> I've looked at the patch only very briefly - it's really difficult to
> grok such patches - large, with half the comments possibly obsolete etc.
> So what does reusing the array code give us, really?
>
> I'm not against reusing some of the array code, but arrays seem to be
> much more elaborate (multiple dimensions, ...) so the code needs to do
> significantly more stuff in various cases.
>
> When I previously suggested that maybe we should get "inspiration" from
> the array code, I was mostly talking about (a) type polymorphism, i.e.
> doing sets for arbitrary types, and (b) integrating this into grammar
> (instead of using functions).
>
> I don't see how copying arrayfuncs.c like this achieves either of these
> things. It still hardcodes just a handful of selected data types, and
> the array polymorphism relies on automatic creation of array type for
> every scalar type.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

You are right.
I misread sql-createtype.html about type input_function that can take 3 arguments (cstring, oid, integer) part.
I thought while creating data types, I can pass different params to the input_function.

Re: Do we want a hashset type?

От
jian he
Дата:


On Mon, Jun 26, 2023 at 4:36 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Sun, Jun 25, 2023, at 11:42, Joel Jacobson wrote:
> >     SELECT hashset_contains('{}'::int4hashset, NULL::int);
> >
> > would be False, according to the General Rules.
> >
> ...
> > Applying the same rules, we'd have to return Unknown (which we represent as
> > null) for:
> >
> >     SELECT hashset_contains('{null}'::int4hashset, NULL::int);
> >
>
> Aha! I just discovered to my surprise that the corresponding array
> queries gives the same result:
>
> SELECT NULL = ANY(ARRAY[]::int[]);
>  ?column?
> ----------
>  f
> (1 row)
>
> SELECT NULL = ANY(ARRAY[NULL]::int[]);
>  ?column?
> ----------
>
> (1 row)
>
> I have no more objections; let's stick to the same null semantics as arrays and multisets.
>
> /Joel

Can you try to glue the attached to the hashset data type input function.
the attached will parse cstring with double quote and not. so '{1,2,3}' == '{"1","2","3"}'. obviously quote will preserve the inner string as is.
currently int4hashset input is delimited by comma, if you want deal with range then you need escape the comma.


Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> Can you try to glue the attached to the hashset data type input 
> function.
> the attached will parse cstring with double quote and not. so '{1,2,3}' 
> == '{"1","2","3"}'. obviously quote will preserve the inner string as 
> is.
> currently int4hashset input is delimited by comma, if you want deal 
> with range then you need escape the comma.

Not sure what you're trying to do here; what's the problem with
the current int4hashset_in()?

I think it might be best to focus on null semantics / three-valued logic
before moving on and trying to implement support for more types,
otherwise we would need to rewrite more code if we find general
thinkos that are problems in all types.

Help wanted to reason about what the following queries should return:

SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);

Should they return NULL, the empty set or something else?

I've renamed hashset_merge() -> hashset_union() to better match
SQL's MULTISET feature which has a MULTISET UNION.

/Joel



Re: Do we want a hashset type?

От
Kirk Wolak
Дата:
On Mon, Jun 26, 2023 at 4:55 PM Joel Jacobson <joel@compiler.org> wrote:
On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> Can you try to glue the attached to the hashset data type input
> function.
> the attached will parse cstring with double quote and not. so '{1,2,3}'
> == '{"1","2","3"}'. obviously quote will preserve the inner string as
> is.
> currently int4hashset input is delimited by comma, if you want deal
> with range then you need escape the comma.

Not sure what you're trying to do here; what's the problem with
the current int4hashset_in()?

I think it might be best to focus on null semantics / three-valued logic
before moving on and trying to implement support for more types,
otherwise we would need to rewrite more code if we find general
thinkos that are problems in all types.

Help wanted to reason about what the following queries should return:

SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);

SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);

Should they return NULL, the empty set or something else?

I've renamed hashset_merge() -> hashset_union() to better match
SQL's MULTISET feature which has a MULTISET UNION.

Shouldn't they return the same thing that left(NULL::text,1) returns? (NULL)...
Typically any operation on NULL is NULL.

Kirk...


 

Re: Do we want a hashset type?

От
jian he
Дата:
On Tue, Jun 27, 2023 at 4:55 AM Joel Jacobson <joel@compiler.org> wrote:
>
> On Mon, Jun 26, 2023, at 13:06, jian he wrote:
> > Can you try to glue the attached to the hashset data type input
> > function.
> > the attached will parse cstring with double quote and not. so '{1,2,3}'
> > == '{"1","2","3"}'. obviously quote will preserve the inner string as
> > is.
> > currently int4hashset input is delimited by comma, if you want deal
> > with range then you need escape the comma.
>
> Not sure what you're trying to do here; what's the problem with
> the current int4hashset_in()?
>
> I think it might be best to focus on null semantics / three-valued logic
> before moving on and trying to implement support for more types,
> otherwise we would need to rewrite more code if we find general
> thinkos that are problems in all types.
>
> Help wanted to reason about what the following queries should return:
>
> SELECT hashset_union(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_intersection(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_difference(NULL::int4hashset, '{}'::int4hashset);
>
> SELECT hashset_symmetric_difference(NULL::int4hashset, '{}'::int4hashset);
>
> Should they return NULL, the empty set or something else?
>
> I've renamed hashset_merge() -> hashset_union() to better match
> SQL's MULTISET feature which has a MULTISET UNION.
>
> /Joel

in SQLMultiSets.pdf(previously thread) I found a related explanation
on page 45, 46.

(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)

CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END

(CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
(OP2) AS T2 (V) ) END)

In page11,
>
> Unlike the corresponding table operators UNION, INTERSECT and EXCEPT, we have chosen ALL as the default, since this
isthe most natural interpretation of MULTISET UNION, etc 

also in page 11 aggregate name FUSION. (I like the name.................)



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 27, 2023, at 04:35, jian he wrote:
> in SQLMultiSets.pdf(previously thread) I found a related explanation
> on page 45, 46.
>
> (CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
> (OP2) AS T2 (V) ) END)
>
> CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
> (OP2) AS T2 (V) ) END
>
> (CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
> (OP2) AS T2 (V) ) END)

Thanks! This was exactly what I was looking for, I knew I've seen it but failed to find it.

Attached is a new incremental patch as well as a full patch, since this is a substantial change:

    Align null semantics with SQL:2023 array and multiset standards

    * Introduced a new boolean field, null_element, in the int4hashset_t type.

    * Rename hashset_count() to hashset_cardinality().

    * Rename hashset_merge() to hashset_union().

    * Rename hashset_equals() to hashset_eq().

    * Rename hashset_neq() to hashset_ne().

    * Add hashset_to_sorted_array().

    * Handle null semantics to work as in arrays and multisets.

    * Update int4hashset_add() to allow creating a new set if none exists.

    * Use more portable int32 typedef instead of int32_t.

    This also adds a thorough test suite in array-and-multiset-semantics.sql,
    which aims to test all relevant combinations of operations and values.

 Makefile                                       |   2 +-
 README.md                                      |   6 ++--
 hashset--0.0.1.sql                             |  37 +++++++++++---------
 hashset-api.c                                  | 208
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------
 hashset.c                                      |  12 ++++++-
 hashset.h                                      |  11 +++---
 test/expected/array-and-multiset-semantics.out | 365
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 test/expected/basic.out                        |  12 +++----
 test/expected/reported_bugs.out                |   6 ++--
 test/expected/strict.out                       | 114 ------------------------------------------------------------
 test/expected/table.out                        |   8 ++---
 test/sql/array-and-multiset-semantics.sql      | 232
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 test/sql/basic.sql                             |   4 +--
 test/sql/benchmark.sql                         |  14 ++++----
 test/sql/reported_bugs.sql                     |   6 ++--
 test/sql/strict.sql                            |  32 -----------------
 test/sql/table.sql                             |   2 +-
 17 files changed, 823 insertions(+), 248 deletions(-)

/Joel
Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Tue, Jun 27, 2023, at 10:26, Joel Jacobson wrote:
> Attachments:
> * hashset-0.0.1-b7e5614-full.patch
> * hashset-0.0.1-b7e5614-incremental.patch

To help verify that the semantics, I thought it might be helpful to provide
a comprehensive set of examples that tries to cover all different ways of varying
the arguments to the functions.

Please let me know if you find any possible errors or if you think it looks good.

SELECT NULL::int4hashset;
 int4hashset
-------------

(1 row)

SELECT '{}'::int4hashset;
 int4hashset
-------------
 {}
(1 row)

SELECT int4hashset();
 int4hashset
-------------
 {}
(1 row)

SELECT '{NULL}'::int4hashset;
 int4hashset
-------------
 {NULL}
(1 row)

SELECT '{NULL,NULL}'::int4hashset;
 int4hashset
-------------
 {NULL}
(1 row)

SELECT '{1,3,2,NULL,2,NULL,3,1}'::int4hashset;
 int4hashset
--------------
 {2,1,3,NULL}
(1 row)

SELECT hashset_add(NULL, NULL);
 hashset_add
-------------
 {NULL}
(1 row)

SELECT hashset_add(NULL, 1);
 hashset_add
-------------
 {1}
(1 row)

SELECT hashset_add('{}', 1);
 hashset_add
-------------
 {1}
(1 row)

SELECT hashset_add('{NULL}', 1);
 hashset_add
-------------
 {1,NULL}
(1 row)

SELECT hashset_add('{1}', 1);
 hashset_add
-------------
 {1}
(1 row)

SELECT hashset_add('{1}', 2);
 hashset_add
-------------
 {1,2}
(1 row)

SELECT hashset_add('{1}', NULL);
 hashset_add
-------------
 {1,NULL}
(1 row)

SELECT hashset_contains(NULL, NULL);
 hashset_contains
------------------

(1 row)

SELECT hashset_contains('{}', NULL);
 hashset_contains
------------------
 f
(1 row)

SELECT hashset_contains('{NULL}', NULL);
 hashset_contains
------------------

(1 row)

SELECT hashset_contains('{1}', 1);
 hashset_contains
------------------
 t
(1 row)

SELECT hashset_contains('{1,NULL}', 1);
 hashset_contains
------------------
 t
(1 row)

SELECT hashset_contains('{1}', 2);
 hashset_contains
------------------
 f
(1 row)

SELECT hashset_contains('{1,NULL}', 2);
 hashset_contains
------------------

(1 row)

SELECT hashset_to_array(NULL);
 hashset_to_array
------------------

(1 row)

SELECT hashset_to_array('{}');
 hashset_to_array
------------------
 {}
(1 row)

SELECT hashset_to_array('{NULL}');
 hashset_to_array
------------------
 {NULL}
(1 row)

SELECT hashset_to_array('{3,1,NULL,2}');
 hashset_to_array
------------------
 {1,3,2,NULL}
(1 row)

SELECT hashset_to_sorted_array(NULL);
 hashset_to_sorted_array
-------------------------

(1 row)

SELECT hashset_to_sorted_array('{}');
 hashset_to_sorted_array
-------------------------
 {}
(1 row)

SELECT hashset_to_sorted_array('{NULL}');
 hashset_to_sorted_array
-------------------------
 {NULL}
(1 row)

SELECT hashset_to_sorted_array('{3,1,NULL,2}');
 hashset_to_sorted_array
-------------------------
 {1,2,3,NULL}
(1 row)

SELECT hashset_cardinality(NULL);
 hashset_cardinality
---------------------

(1 row)

SELECT hashset_cardinality('{}');
 hashset_cardinality
---------------------
                   0
(1 row)

SELECT hashset_cardinality('{NULL}');
 hashset_cardinality
---------------------
                   1
(1 row)

SELECT hashset_cardinality('{NULL,NULL}');
 hashset_cardinality
---------------------
                   1
(1 row)

SELECT hashset_cardinality('{1}');
 hashset_cardinality
---------------------
                   1
(1 row)

SELECT hashset_cardinality('{1,1}');
 hashset_cardinality
---------------------
                   1
(1 row)

SELECT hashset_cardinality('{1,2}');
 hashset_cardinality
---------------------
                   2
(1 row)

SELECT hashset_cardinality('{1,2,NULL}');
 hashset_cardinality
---------------------
                   3
(1 row)

SELECT hashset_union(NULL, NULL);
 hashset_union
---------------

(1 row)

SELECT hashset_union(NULL, '{}');
 hashset_union
---------------

(1 row)

SELECT hashset_union('{}', NULL);
 hashset_union
---------------

(1 row)

SELECT hashset_union('{}', '{}');
 hashset_union
---------------
 {}
(1 row)

SELECT hashset_union('{}', '{NULL}');
 hashset_union
---------------
 {NULL}
(1 row)

SELECT hashset_union('{NULL}', '{}');
 hashset_union
---------------
 {NULL}
(1 row)

SELECT hashset_union('{NULL}', '{NULL}');
 hashset_union
---------------
 {NULL}
(1 row)

SELECT hashset_union('{}', '{1}');
 hashset_union
---------------
 {1}
(1 row)

SELECT hashset_union('{1}', '{}');
 hashset_union
---------------
 {1}
(1 row)

SELECT hashset_union('{1}', '{1}');
 hashset_union
---------------
 {1}
(1 row)

SELECT hashset_union('{1}', NULL);
 hashset_union
---------------

(1 row)

SELECT hashset_union(NULL, '{1}');
 hashset_union
---------------

(1 row)

SELECT hashset_union('{1}', '{NULL}');
 hashset_union
---------------
 {1,NULL}
(1 row)

SELECT hashset_union('{NULL}', '{1}');
 hashset_union
---------------
 {1,NULL}
(1 row)

SELECT hashset_union('{1}', '{2}');
 hashset_union
---------------
 {1,2}
(1 row)

SELECT hashset_union('{1,2}', '{2,3}');
 hashset_union
---------------
 {3,1,2}
(1 row)

SELECT hashset_intersection(NULL, NULL);
 hashset_intersection
----------------------

(1 row)

SELECT hashset_intersection(NULL, '{}');
 hashset_intersection
----------------------

(1 row)

SELECT hashset_intersection('{}', NULL);
 hashset_intersection
----------------------

(1 row)

SELECT hashset_intersection('{}', '{}');
 hashset_intersection
----------------------
 {}
(1 row)

SELECT hashset_intersection('{}', '{NULL}');
 hashset_intersection
----------------------
 {}
(1 row)

SELECT hashset_intersection('{NULL}', '{}');
 hashset_intersection
----------------------
 {}
(1 row)

SELECT hashset_intersection('{NULL}', '{NULL}');
 hashset_intersection
----------------------
 {NULL}
(1 row)

SELECT hashset_intersection('{}', '{1}');
 hashset_intersection
----------------------
 {}
(1 row)

SELECT hashset_intersection('{1}', '{}');
 hashset_intersection
----------------------
 {}
(1 row)

SELECT hashset_intersection('{1}', '{1}');
 hashset_intersection
----------------------
 {1}
(1 row)

SELECT hashset_intersection('{1}', NULL);
 hashset_intersection
----------------------

(1 row)

SELECT hashset_intersection(NULL, '{1}');
 hashset_intersection
----------------------

(1 row)

SELECT hashset_intersection('{1}', '{NULL}');
 hashset_intersection
----------------------
 {}
(1 row)

SELECT hashset_intersection('{NULL}', '{1}');
 hashset_intersection
----------------------
 {}
(1 row)

SELECT hashset_intersection('{1}', '{2}');
 hashset_intersection
----------------------
 {}
(1 row)

SELECT hashset_intersection('{1,2}', '{2,3}');
 hashset_intersection
----------------------
 {2}
(1 row)

SELECT hashset_difference(NULL, NULL);
 hashset_difference
--------------------

(1 row)

SELECT hashset_difference(NULL, '{}');
 hashset_difference
--------------------

(1 row)

SELECT hashset_difference('{}', NULL);
 hashset_difference
--------------------

(1 row)

SELECT hashset_difference('{}', '{}');
 hashset_difference
--------------------
 {}
(1 row)

SELECT hashset_difference('{}', '{NULL}');
 hashset_difference
--------------------
 {}
(1 row)

SELECT hashset_difference('{NULL}', '{}');
 hashset_difference
--------------------
 {NULL}
(1 row)

SELECT hashset_difference('{NULL}', '{NULL}');
 hashset_difference
--------------------
 {}
(1 row)

SELECT hashset_difference('{}', '{1}');
 hashset_difference
--------------------
 {}
(1 row)

SELECT hashset_difference('{1}', '{}');
 hashset_difference
--------------------
 {1}
(1 row)

SELECT hashset_difference('{1}', '{1}');
 hashset_difference
--------------------
 {}
(1 row)

SELECT hashset_difference('{1}', NULL);
 hashset_difference
--------------------

(1 row)

SELECT hashset_difference(NULL, '{1}');
 hashset_difference
--------------------

(1 row)

SELECT hashset_difference('{1}', '{NULL}');
 hashset_difference
--------------------
 {1}
(1 row)

SELECT hashset_difference('{NULL}', '{1}');
 hashset_difference
--------------------
 {NULL}
(1 row)

SELECT hashset_difference('{1}', '{2}');
 hashset_difference
--------------------
 {1}
(1 row)

SELECT hashset_difference('{1,2}', '{2,3}');
 hashset_difference
--------------------
 {1}
(1 row)

SELECT hashset_symmetric_difference(NULL, NULL);
 hashset_symmetric_difference
------------------------------

(1 row)

SELECT hashset_symmetric_difference(NULL, '{}');
 hashset_symmetric_difference
------------------------------

(1 row)

SELECT hashset_symmetric_difference('{}', NULL);
 hashset_symmetric_difference
------------------------------

(1 row)

SELECT hashset_symmetric_difference('{}', '{}');
 hashset_symmetric_difference
------------------------------
 {}
(1 row)

SELECT hashset_symmetric_difference('{}', '{NULL}');
 hashset_symmetric_difference
------------------------------
 {NULL}
(1 row)

SELECT hashset_symmetric_difference('{NULL}', '{}');
 hashset_symmetric_difference
------------------------------
 {NULL}
(1 row)

SELECT hashset_symmetric_difference('{NULL}', '{NULL}');
 hashset_symmetric_difference
------------------------------
 {}
(1 row)

SELECT hashset_symmetric_difference('{}', '{1}');
 hashset_symmetric_difference
------------------------------
 {1}
(1 row)

SELECT hashset_symmetric_difference('{1}', '{}');
 hashset_symmetric_difference
------------------------------
 {1}
(1 row)

SELECT hashset_symmetric_difference('{1}', '{1}');
 hashset_symmetric_difference
------------------------------
 {}
(1 row)

SELECT hashset_symmetric_difference('{1}', NULL);
 hashset_symmetric_difference
------------------------------

(1 row)

SELECT hashset_symmetric_difference(NULL, '{1}');
 hashset_symmetric_difference
------------------------------

(1 row)

SELECT hashset_symmetric_difference('{1}', '{NULL}');
 hashset_symmetric_difference
------------------------------
 {1,NULL}
(1 row)

SELECT hashset_symmetric_difference('{NULL}', '{1}');
 hashset_symmetric_difference
------------------------------
 {1,NULL}
(1 row)

SELECT hashset_symmetric_difference('{1}', '{2}');
 hashset_symmetric_difference
------------------------------
 {1,2}
(1 row)

SELECT hashset_symmetric_difference('{1,2}', '{2,3}');
 hashset_symmetric_difference
------------------------------
 {1,3}
(1 row)

SELECT hashset_eq(NULL, NULL);
 hashset_eq
------------

(1 row)

SELECT hashset_eq(NULL, '{}');
 hashset_eq
------------

(1 row)

SELECT hashset_eq('{}', NULL);
 hashset_eq
------------

(1 row)

SELECT hashset_eq('{}', '{}');
 hashset_eq
------------
 t
(1 row)

SELECT hashset_eq('{}', '{NULL}');
 hashset_eq
------------
 f
(1 row)

SELECT hashset_eq('{NULL}', '{}');
 hashset_eq
------------
 f
(1 row)

SELECT hashset_eq('{NULL}', '{NULL}');
 hashset_eq
------------
 t
(1 row)

SELECT hashset_eq('{}', '{1}');
 hashset_eq
------------
 f
(1 row)

SELECT hashset_eq('{1}', '{}');
 hashset_eq
------------
 f
(1 row)

SELECT hashset_eq('{1}', '{1}');
 hashset_eq
------------
 t
(1 row)

SELECT hashset_eq('{1}', NULL);
 hashset_eq
------------

(1 row)

SELECT hashset_eq(NULL, '{1}');
 hashset_eq
------------

(1 row)

SELECT hashset_eq('{1}', '{NULL}');
 hashset_eq
------------
 f
(1 row)

SELECT hashset_eq('{NULL}', '{1}');
 hashset_eq
------------
 f
(1 row)

SELECT hashset_eq('{1}', '{2}');
 hashset_eq
------------
 f
(1 row)

SELECT hashset_eq('{1,2}', '{2,3}');
 hashset_eq
------------
 f
(1 row)

SELECT hashset_ne(NULL, NULL);
 hashset_ne
------------

(1 row)

SELECT hashset_ne(NULL, '{}');
 hashset_ne
------------

(1 row)

SELECT hashset_ne('{}', NULL);
 hashset_ne
------------

(1 row)

SELECT hashset_ne('{}', '{}');
 hashset_ne
------------
 f
(1 row)

SELECT hashset_ne('{}', '{NULL}');
 hashset_ne
------------
 t
(1 row)

SELECT hashset_ne('{NULL}', '{}');
 hashset_ne
------------
 t
(1 row)

SELECT hashset_ne('{NULL}', '{NULL}');
 hashset_ne
------------
 f
(1 row)

SELECT hashset_ne('{}', '{1}');
 hashset_ne
------------
 t
(1 row)

SELECT hashset_ne('{1}', '{}');
 hashset_ne
------------
 t
(1 row)

SELECT hashset_ne('{1}', '{1}');
 hashset_ne
------------
 f
(1 row)

SELECT hashset_ne('{1}', NULL);
 hashset_ne
------------

(1 row)

SELECT hashset_ne(NULL, '{1}');
 hashset_ne
------------

(1 row)

SELECT hashset_ne('{1}', '{NULL}');
 hashset_ne
------------
 t
(1 row)

SELECT hashset_ne('{NULL}', '{1}');
 hashset_ne
------------
 t
(1 row)

SELECT hashset_ne('{1}', '{2}');
 hashset_ne
------------
 t
(1 row)

SELECT hashset_ne('{1,2}', '{2,3}');
 hashset_ne
------------
 t
(1 row)

/Joel



Re: Do we want a hashset type?

От
jian he
Дата:
On Tue, Jun 27, 2023 at 4:27 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Tue, Jun 27, 2023, at 04:35, jian he wrote:
> > in SQLMultiSets.pdf(previously thread) I found a related explanation
> > on page 45, 46.
> > /home/jian/hashset/0001-make-int4hashset_contains-strict-and-header-file-change.patch
> > (CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> > T1.V FROM UNNEST (OP1) AS T1 (V) INTERSECT SQ SELECT T2.V FROM UNNEST
> > (OP2) AS T2 (V) ) END)
> >
> > CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> > T1.V FROM UNNEST (OP1) AS T1 (V) UNION SQ SELECT T2.V FROM UNNEST
> > (OP2) AS T2 (V) ) END
> >
> > (CASE WHEN OP1 IS NULL OR OP2 IS NULL THEN NULL ELSE MULTISET ( SELECT
> > T1.V FROM UNNEST (OP1) AS T1 (V) EXCEPT SQ SELECT T2.V FROM UNNEST
> > (OP2) AS T2 (V) ) END)
>
> Thanks! This was exactly what I was looking for, I knew I've seen it but failed to find it.
>
> Attached is a new incremental patch as well as a full patch, since this is a substantial change:
>
>     Align null semantics with SQL:2023 array and multiset standards
>
>     * Introduced a new boolean field, null_element, in the int4hashset_t type.
>
>     * Rename hashset_count() to hashset_cardinality().
>
>     * Rename hashset_merge() to hashset_union().
>
>     * Rename hashset_equals() to hashset_eq().
>
>     * Rename hashset_neq() to hashset_ne().
>
>     * Add hashset_to_sorted_array().
>
>     * Handle null semantics to work as in arrays and multisets.
>
>     * Update int4hashset_add() to allow creating a new set if none exists.
>
>     * Use more portable int32 typedef instead of int32_t.
>
>     This also adds a thorough test suite in array-and-multiset-semantics.sql,
>     which aims to test all relevant combinations of operations and values.
>
>  Makefile                                       |   2 +-
>  README.md                                      |   6 ++--
>  hashset--0.0.1.sql                             |  37 +++++++++++---------
>  hashset-api.c                                  | 208
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------
>  hashset.c                                      |  12 ++++++-
>  hashset.h                                      |  11 +++---
>  test/expected/array-and-multiset-semantics.out | 365
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  test/expected/basic.out                        |  12 +++----
>  test/expected/reported_bugs.out                |   6 ++--
>  test/expected/strict.out                       | 114 ------------------------------------------------------------
>  test/expected/table.out                        |   8 ++---
>  test/sql/array-and-multiset-semantics.sql      | 232
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  test/sql/basic.sql                             |   4 +--
>  test/sql/benchmark.sql                         |  14 ++++----
>  test/sql/reported_bugs.sql                     |   6 ++--
>  test/sql/strict.sql                            |  32 -----------------
>  test/sql/table.sql                             |   2 +-
>  17 files changed, 823 insertions(+), 248 deletions(-)
>
> /Joel

Hi there.
I changed the function hashset_contains to strict.
also change the way to return an empty array.

in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
int4hashset can speed distinct aggregate and distinct counts?
like the following:

explain(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3

explain(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3

explain(costs off,timing off, analyze,buffers)
select count(distinct rnd) from benchmark_input_100k \watch c=3

explain(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
benchmark_input_100k) sub(x) \watch c=3

Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Wed, Jun 28, 2023, at 08:26, jian he wrote:

> Hi there.
> I changed the function hashset_contains to strict.

Changing hashset_contains to STRICT would cause it to return NULL
if any of the operands are NULL, which I don't believe is correct, since:

SELECT NULL = ANY('{}'::int4[]);
 ?column?
----------
 f
(1 row)

Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.

Did you try running `make installcheck` after your change?
You would then have seen one of the tests failing:

test array-and-multiset-semantics ... FAILED       21 ms

Check the content of `regression.diffs` to see why:

% cat regression.diffs
diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out
/Users/joel/src/hashset/results/array-and-multiset-semantics.out
--- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out    2023-06-27 10:07:38
+++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out    2023-06-28 10:13:27
@@ -158,7 +158,7 @@
         |      | {NULL}      | {NULL}       |                  |
         |    1 | {1}         | {1}          |                  |
         |    4 | {4}         | {4}          |                  |
- {}     |      | {NULL}      | {NULL}       | f                | f
+ {}     |      | {NULL}      | {NULL}       |                  | f
  {}     |    1 | {1}         | {1}          | f                | f
  {}     |    4 | {4}         | {4}          | f                | f
  {NULL} |      | {NULL}      | {NULL,NULL}  |                  |
@@ -284,7 +284,8 @@
     "= ANY(...)";
  arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
 ------+------+-------------+--------------+------------------+------------
-(0 rows)
+ {}   |      | {NULL}      | {NULL}       |                  | f
+(1 row)


> also change the way to return an empty array.

Nice.
I agree the `Datum d` variable was unnecessary.
I also removed the unused includes.

> in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> int4hashset can speed distinct aggregate and distinct counts?
> like the following:
>
> explain(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
>
> explain(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3

The 100k tables seems to be too small to give any meaningful results,
when trying to measure individual queries:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
 Execution Time: 26.790 ms
 Execution Time: 30.616 ms
 Execution Time: 33.253 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k;
 Execution Time: 32.797 ms
 Execution Time: 27.605 ms
 Execution Time: 26.228 ms

If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
is actually faster for the `i` column where all input integers are unique:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
 Execution Time: 799.017 ms
 Execution Time: 796.008 ms
 Execution Time: 799.121 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_10M;
 Execution Time: 1204.873 ms
 Execution Time: 1221.822 ms
 Execution Time: 1216.340 ms

For random integers, hashset is a win though:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
 Execution Time: 1874.722 ms
 Execution Time: 1878.760 ms
 Execution Time: 1861.640 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(rnd) FROM benchmark_input_10M;
 Execution Time: 1253.709 ms
 Execution Time: 1222.651 ms
 Execution Time: 1237.849 ms

> explain(costs off,timing off, analyze,buffers)
> select count(distinct rnd) from benchmark_input_100k \watch c=3
>
> explain(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> benchmark_input_100k) sub(x) \watch c=3

I tried these with 10M:

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
 Execution Time: 1733.320 ms
 Execution Time: 1725.214 ms
 Execution Time: 1716.636 ms

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x);
 Execution Time: 1249.612 ms
 Execution Time: 1240.558 ms
 Execution Time: 1252.103 ms

Not sure what I think of the current benchmark suite.

I think it would be better to only include some realistic examples from
real-life, such as the graph query which was the reason I personally started
working on this. Otherwise there is a risk we optimise for some hypothetical
scenario that is not relevant in practise.

Would be good with more examples of typical work loads for when the hashset
type would be useful.

/Joel



Re: Do we want a hashset type?

От
jian he
Дата:
On Wed, Jun 28, 2023 at 4:50 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Wed, Jun 28, 2023, at 08:26, jian he wrote:
>
> > Hi there.
> > I changed the function hashset_contains to strict.
>
> Changing hashset_contains to STRICT would cause it to return NULL
> if any of the operands are NULL, which I don't believe is correct, since:
>
> SELECT NULL = ANY('{}'::int4[]);
>  ?column?
> ----------
>  f
> (1 row)
>
> Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
> to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.
>
> Did you try running `make installcheck` after your change?
> You would then have seen one of the tests failing:
>
> test array-and-multiset-semantics ... FAILED       21 ms
>
> Check the content of `regression.diffs` to see why:
>
> % cat regression.diffs
> diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out
/Users/joel/src/hashset/results/array-and-multiset-semantics.out
> --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out      2023-06-27 10:07:38
> +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out    2023-06-28 10:13:27
> @@ -158,7 +158,7 @@
>          |      | {NULL}      | {NULL}       |                  |
>          |    1 | {1}         | {1}          |                  |
>          |    4 | {4}         | {4}          |                  |
> - {}     |      | {NULL}      | {NULL}       | f                | f
> + {}     |      | {NULL}      | {NULL}       |                  | f
>   {}     |    1 | {1}         | {1}          | f                | f
>   {}     |    4 | {4}         | {4}          | f                | f
>   {NULL} |      | {NULL}      | {NULL,NULL}  |                  |
> @@ -284,7 +284,8 @@
>      "= ANY(...)";
>   arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
>  ------+------+-------------+--------------+------------------+------------
> -(0 rows)
> + {}   |      | {NULL}      | {NULL}       |                  | f
> +(1 row)
>
>
> > also change the way to return an empty array.
>
> Nice.
> I agree the `Datum d` variable was unnecessary.
> I also removed the unused includes.
>
> > in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> > int4hashset can speed distinct aggregate and distinct counts?
> > like the following:
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
> >
> > explain(analyze, costs off, timing off, buffers)
> > SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3
>
> The 100k tables seems to be too small to give any meaningful results,
> when trying to measure individual queries:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
>  Execution Time: 26.790 ms
>  Execution Time: 30.616 ms
>  Execution Time: 33.253 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k;
>  Execution Time: 32.797 ms
>  Execution Time: 27.605 ms
>  Execution Time: 26.228 ms
>
> If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
> is actually faster for the `i` column where all input integers are unique:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
>  Execution Time: 799.017 ms
>  Execution Time: 796.008 ms
>  Execution Time: 799.121 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_10M;
>  Execution Time: 1204.873 ms
>  Execution Time: 1221.822 ms
>  Execution Time: 1216.340 ms
>
> For random integers, hashset is a win though:
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
>  Execution Time: 1874.722 ms
>  Execution Time: 1878.760 ms
>  Execution Time: 1861.640 ms
>
> EXPLAIN(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(rnd) FROM benchmark_input_10M;
>  Execution Time: 1253.709 ms
>  Execution Time: 1222.651 ms
>  Execution Time: 1237.849 ms
>
> > explain(costs off,timing off, analyze,buffers)
> > select count(distinct rnd) from benchmark_input_100k \watch c=3
> >
> > explain(costs off,timing off, analyze,buffers)
> > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> > benchmark_input_100k) sub(x) \watch c=3
>
> I tried these with 10M:
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
>  Execution Time: 1733.320 ms
>  Execution Time: 1725.214 ms
>  Execution Time: 1716.636 ms
>
> EXPLAIN(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x);
>  Execution Time: 1249.612 ms
>  Execution Time: 1240.558 ms
>  Execution Time: 1252.103 ms
>
> Not sure what I think of the current benchmark suite.
>
> I think it would be better to only include some realistic examples from
> real-life, such as the graph query which was the reason I personally started
> working on this. Otherwise there is a risk we optimise for some hypothetical
> scenario that is not relevant in practise.
>
> Would be good with more examples of typical work loads for when the hashset
> type would be useful.
>
> /Joel

> Did you try running `make installcheck` after your change?

First I use make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
I found out it uses another active cluster.
So I killed another active cluster.
later i found another database port so I took me sometime to found out
I need use following:
make installcheck
PG_CONFIG=/home/jian/postgres/2023_05_25_beta5421/bin/pg_config
PGPORT=5421

Anyway, this time, I added another macro,which seems to simplify the code.

#define SET_DATA_PTR(a) \
(((char *) (a->data)) + CEIL_DIV(a->capacity, 8))

it passed all the tests on my local machine.
I should have only made a patch, but when I was committed, I forgot to
mention one file, so I needed 2 commits.


> Not sure what I think of the current benchmark suite.
your result is so different from mine. I use the default  config. I
see a big difference. yech, I agree, the performance test should be
more careful.

Вложения

Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Thu, Jun 29, 2023, at 08:54, jian he wrote:
> Anyway, this time, I added another macro,which seems to simplify the code.
>
> #define SET_DATA_PTR(a) \
> (((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
>
> it passed all the tests on my local machine.

Hmm, this is interesting. There is a bug in your second patch,
that the tests catch, so it's really surprising if they pass on your machine.

Can you try to run `make clean && make && make install && make installcheck`?

I would guess you forgot to recompile or reinstall.

This is the bug in 0002-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patch:

@@ -411,7 +411,7 @@ int4hashset_union(PG_FUNCTION_ARGS)
     int4hashset_t  *seta = PG_GETARG_INT4HASHSET_COPY(0);
     int4hashset_t  *setb = PG_GETARG_INT4HASHSET(1);
     char           *bitmap = setb->data;
-    int32           *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
+    int32           *values = (int32 *) SET_DATA_PTR(seta);

You accidentally replaced `setb` with `seta`.

I renamed the macro to HASHSET_GET_VALUES and changed it slightly,
also added a HASHSET_GET_BITMAP for completeness:

#define HASHSET_GET_BITMAP(set) ((set)->data)
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data + CEIL_DIV((set)->capacity, 8)))

Instead of your version:

#define SET_DATA_PTR(a) \
        (((char *) (a->data)) + CEIL_DIV(a->capacity, 8))

Changes:
* Parenthesize macro parameters.
* Prefix the macro names with "HASHSET_" to avoid potential conflicts.
* "GET_VALUES" more clearly communicates that it's the values we're extracting.

New patch attached.

Other changes in same commit:

* Add original friends-of-friends graph query to new benchmark/ directory
* Add table of content to README
* Update docs: Explain null semantics and add function examples
* Simplify empty hashset handling, remove unused includes

/Joel
Вложения

Re: Do we want a hashset type?

От
jian he
Дата:
On Thu, Jun 29, 2023 at 4:43 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Thu, Jun 29, 2023, at 08:54, jian he wrote:
> > Anyway, this time, I added another macro,which seems to simplify the code.
> >
> > #define SET_DATA_PTR(a) \
> > (((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
> >
> > it passed all the tests on my local machine.
>
> Hmm, this is interesting. There is a bug in your second patch,
> that the tests catch, so it's really surprising if they pass on your machine.
>
> Can you try to run `make clean && make && make install && make installcheck`?
>
> I would guess you forgot to recompile or reinstall.
>
> This is the bug in 0002-marco-SET_DATA_PTR-to-quicly-access-hashset-data-reg.patch:
>
> @@ -411,7 +411,7 @@ int4hashset_union(PG_FUNCTION_ARGS)
>         int4hashset_t  *seta = PG_GETARG_INT4HASHSET_COPY(0);
>         int4hashset_t  *setb = PG_GETARG_INT4HASHSET(1);
>         char               *bitmap = setb->data;
> -       int32              *values = (int32 *) (bitmap + CEIL_DIV(setb->capacity, 8));
> +       int32              *values = (int32 *) SET_DATA_PTR(seta);
>
> You accidentally replaced `setb` with `seta`.
>
> I renamed the macro to HASHSET_GET_VALUES and changed it slightly,
> also added a HASHSET_GET_BITMAP for completeness:
>
> #define HASHSET_GET_BITMAP(set) ((set)->data)
> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data + CEIL_DIV((set)->capacity, 8)))
>
> Instead of your version:
>
> #define SET_DATA_PTR(a) \
>                 (((char *) (a->data)) + CEIL_DIV(a->capacity, 8))
>
> Changes:
> * Parenthesize macro parameters.
> * Prefix the macro names with "HASHSET_" to avoid potential conflicts.
> * "GET_VALUES" more clearly communicates that it's the values we're extracting.
>
> New patch attached.
>
> Other changes in same commit:
>
> * Add original friends-of-friends graph query to new benchmark/ directory
> * Add table of content to README
> * Update docs: Explain null semantics and add function examples
> * Simplify empty hashset handling, remove unused includes
>
> /Joel

more like a C questions
in this context does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))
define first, then define struct int4hashset_t. Is this normally ok?

Also does
#define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
CEIL_DIV((set)->capacity, 8)))

remove (int32 *) will make it generic? then when you use it, you can
cast whatever type you like?



Re: Do we want a hashset type?

От
"Joel Jacobson"
Дата:
On Fri, Jun 30, 2023, at 06:50, jian he wrote:
> more like a C questions
> in this context does
> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
> CEIL_DIV((set)->capacity, 8)))
> define first, then define struct int4hashset_t. Is this normally ok?

Yes, it's fine. Macros are just text substitutions done pre-compilation.

> Also does
> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
> CEIL_DIV((set)->capacity, 8)))
>
> remove (int32 *) will make it generic? then when you use it, you can
> cast whatever type you like?

Maybe, but might be less error-prone more descriptive with different
macros for each type, e.g. INT4HASHSET_GET_VALUES,
similar to the existing PG_GETARG_INT4HASHSET

Curious to hear what everybody thinks about the interface, documentation,
semantics and implementation?

Is there anything missing or something that you think should be changed/improved?

/Joel



Re: Do we want a hashset type?

От
Florents Tselai
Дата:
Has anyone put this in a git repo / extension package or similar ?

I’d like to try it out outside the core pg tree.

> On 1 Jul 2023, at 12:04 PM, Joel Jacobson <joel@compiler.org> wrote:
>
> On Fri, Jun 30, 2023, at 06:50, jian he wrote:
>> more like a C questions
>> in this context does
>> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
>> CEIL_DIV((set)->capacity, 8)))
>> define first, then define struct int4hashset_t. Is this normally ok?
>
> Yes, it's fine. Macros are just text substitutions done pre-compilation.
>
>> Also does
>> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
>> CEIL_DIV((set)->capacity, 8)))
>>
>> remove (int32 *) will make it generic? then when you use it, you can
>> cast whatever type you like?
>
> Maybe, but might be less error-prone more descriptive with different
> macros for each type, e.g. INT4HASHSET_GET_VALUES,
> similar to the existing PG_GETARG_INT4HASHSET
>
> Curious to hear what everybody thinks about the interface, documentation,
> semantics and implementation?
>
> Is there anything missing or something that you think should be changed/improved?
>
> /Joel
>
>




Re: Do we want a hashset type?

От
jian he
Дата:
https://github.com/tvondra/hashset

On Mon, Aug 14, 2023 at 11:23 PM Florents Tselai
<florents.tselai@gmail.com> wrote:
>
> Has anyone put this in a git repo / extension package or similar ?
>
> I’d like to try it out outside the core pg tree.
>
> > On 1 Jul 2023, at 12:04 PM, Joel Jacobson <joel@compiler.org> wrote:
> >
> > On Fri, Jun 30, 2023, at 06:50, jian he wrote:
> >> more like a C questions
> >> in this context does
> >> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
> >> CEIL_DIV((set)->capacity, 8)))
> >> define first, then define struct int4hashset_t. Is this normally ok?
> >
> > Yes, it's fine. Macros are just text substitutions done pre-compilation.
> >
> >> Also does
> >> #define HASHSET_GET_VALUES(set) ((int32 *) ((set)->data +
> >> CEIL_DIV((set)->capacity, 8)))
> >>
> >> remove (int32 *) will make it generic? then when you use it, you can
> >> cast whatever type you like?
> >
> > Maybe, but might be less error-prone more descriptive with different
> > macros for each type, e.g. INT4HASHSET_GET_VALUES,
> > similar to the existing PG_GETARG_INT4HASHSET
> >
> > Curious to hear what everybody thinks about the interface, documentation,
> > semantics and implementation?
> >
> > Is there anything missing or something that you think should be changed/improved?
> >
> > /Joel
> >
> >
>


--
 I recommend David Deutsch's <<The Beginning of Infinity>>

  Jian