Re: Do we want a hashset type?

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Do we want a hashset type?
Дата
Msg-id 136e97fe-a41c-7e32-f58f-312f1a0e4caa@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Ответы Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Список pgsql-hackers

On 6/20/23 00:50, Joel Jacobson wrote:
> On Mon, Jun 19, 2023, at 14:59, Tomas Vondra wrote:
>> What unexpected issues you mean? Sure, if someone uses multisets as if
>> they were sets (so ignoring the handling of duplicates), things will go
>> booom! quickly.
> 
> The unexpected issues I had in mind are subtle bugs due to treating multisets
> as sets, which could go undetected due to having no duplicates initially.
> Multisets might initially therefore seem equal, but later diverge due to
> different element counts, leading to hard-to-detect issues.
> 

Understood.

>> I imagined (if we ended up doing MULTISET) we'd provide interface (e.g.
>> operators) that'd allow perhaps help with this.
> 
> Might help. But still think providing both structures would be a more foolproof
> solution, offering users the choice to select what's best for their use-case.
> 

Yeah. Not confusing people is better.

>>> Despite SQL's multiset possibility, a distinct hashset type is my preference,
>>> helping appropriate data structure choice and reducing misuse.
>>>
>>> The necessity of multisets is vague beyond standards compliance.
>>
>> True - we haven't had any requests/proposal to implement MULTISETs.
>>
>> I've looked at the SQL standard primarily to check if maybe there's some
>> precedent that'd give us guidance on the SQL syntax etc. And I think
>> multisets are that - even if we end up not implementing them, it'd be
>> sad to have unnecessarily inconsistent syntax (in case someone decides
>> to add multisets in the future).
>>
>> We could invent "SET" data type, so while standard has ARRAY / MULTISET,
>> we'd have ARRAY / MULTISET / SET, and the difference between the last
>> two would be just handling of duplicates.
> 
> Is the idea to use the "SET" keyword for the syntax?
> Isn't it a risk that will be confusing, since "SET" is currently
> only used for configuration and update operations?
> 

I haven't tried doing that, so not sure if there would be any conflicts
in the grammar. But I can't think of a case that'd be confusing for
users - when setting internal GUC variables it's a completely different
context, there's no use for SQL-level collections (arrays, sets, ...).

For UPDATE, it'd be pretty clear too, I think. It's possible to do

   UPDATE table SET col = SET[1,2,3]

and it's clear the first is the command SET, while the second is a set
constructor. For SELECT there'd be conflict, and for ALTER TABLE it'd be
possible to do

   ALTER TABLE table ALTER COLUMN col SET DEFAULT SET[1,2,3];

Seems clear to me too, I think.


>> The other way to look at sets is that they are pretty similar to arrays,
>> except that there are no duplicates and order does not matter. Sure, the
>> on-disk format and code is different, but from the SQL perspective it'd
>> be nice to allow using sets in most places where arrays are allowed
>> (which is what the standard does for MULTISETS, more or less).
>>
>> That'd mean we could probably search through gram.y for places working
>> with arrays ("ARRAY array_expr", "ARRAY select_with_parens", ...) and
>> make them work with sets too, say by having SET_SUBLINK instead of
>> ARRAY_SUBLINK, set_expression instead of array_expression, etc.
>>
>> This might be also "consistent" with defining hashset type using CREATE
>> TYPE with ELEMENT, because we consider the type to be "array". So that
>> would be polymorphic type, but we don't have pre-defined array for every
>> type (and I'm not sure we want to).
>>
>> Of course, maybe there's some fatal flaw in these idea, I don't know.
>> And I don't want to move the goalposts too far - but it seems like this
>> might make some stuff actually simpler to implement (by piggy-backing on
>> the existing array infrastructure).
> 
> I think it's very interesting thoughts and ambitions.
> 
> I wonder though, from a user-perspective, if a new hashset type still
> wouldn't just be considered simpler, than introducing new SQL syntax?
> 

It's a matter of personal taste, I guess. I'm fine with calling function
API and what not, but a sensible SQL syntax seems nicer.

> However, it would be interesting to see how the piggy-backing on the
> existing array infrastructure would look in practise code-wise though.
> 
> I think it's still meaningful to continue hacking on the int4-type
> hashset extension, to see if we can agree on the semantics,
> especially around null handling and sorting.
> 

Definitely. It certainly was not my intention to derail the work by
proposing more and more stuff. So feel free to pursue what makes sense
to you / helps the use case.


TBH I don't particularly see why we'd want to sort sets.

I wonder if the SQL standard says something about these things (for
MULTISETs), especially for the NULL handling. If it does, I'd try to
stick with those rules.

>> A mostly unrelated thought - I wonder if this might be somehow related
>> to the foreign key array patch ([1] might be the most recent attempt in
>> this direction). Not to hashset itself, but I recalled these patches
>> because it'd mean we don't need the separate "edges" link table (so the
>> hashset column would be the think backing the FK).
>>
>> [1]
>> https://www.postgresql.org/message-id/CAJvoCut7zELHnBSC8HrM6p-R6q-NiBN1STKhqnK5fPE-9%3DGq3g%40mail.gmail.com
> 
> I remember that one! We tried to revive that one, but didn't manage to keep it alive.
> It's a really good idea though. Good idea to see if there might be synergies
> between arrays and hashsets in this area, since if we envision the elements in
> a hashset mostly will be PKs, then it would be nice to enforce reference
> integrity.

I haven't followed that at all, but I wonder how difficult would it be
to also support other collection types (like sets) and not just arrays.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Tristan Partin"
Дата:
Сообщение: Re: Make pgbench exit on SIGINT more reliably
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Partial aggregates pushdown