Re: [PATCH] Compression dictionaries for JSONB

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: [PATCH] Compression dictionaries for JSONB
Дата
Msg-id 20230210200145.sm3bt5xidlqx6ukt@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: [PATCH] Compression dictionaries for JSONB  (Nikita Malakhov <hukutoc@gmail.com>)
Ответы Re: [PATCH] Compression dictionaries for JSONB  (Aleksander Alekseev <aleksander@timescale.com>)
Список pgsql-hackers
Hi,

On 2023-02-10 21:22:14 +0300, Nikita Malakhov wrote:
> If I understand Andres' message correctly - the proposition is to
> make use of compression dictionaries automatic, possibly just setting
> a parameter when the table is created, something like
> CREATE TABLE t ( ..., t JSONB USE DICTIONARY);

I definitely wouldn't make it automatic initially, and then later see how well
that works.

Whether automatic or not, it probably makes sense to integrate building
dictionaries with analyze. We can build a dictionaries from sampled datums, we
can check how efficient a dictionary is when compressing sampled datums, we
can compare the efficiency of a new dictionary with the existing dictionary to
see whether it's worth a new one.


> The question is then how to create such dictionaries automatically
> and extend them while data is being added to the table. Because
> it is not something unusual when after a time circumstances change
> and a rather small table is started to be loaded with huge amounts
> of data.

It doesn't really make sense to create the dictionaries with small tables,
anyway. For them to be efficient, you need a reasonable amount of data to
build a dictionary from.


> I prefer extending a dictionary over re-creating it because while
> dictionary is recreated we leave users two choices - to wait until
> dictionary creation is over or to use the old version (say, kept as
> as a snapshot while a new one is created). Keeping many versions
> simultaneously does not make sense and would extend DB size.

I don't think you really can extend dictionaries. The references into the
dictionary are as small as possible, based on the contents of the
dictonary. And you normally want to keep the dictionary size bounded for that
reason alone, but also for [de]compression speed.

So you'd need build a new dictionary, and use that going forward. And yes,
you'd not be able to delete the old dictionary, because there will still be
references to the old one.

We could add a command to scan the data to see if an old dictionary is still
used, or even remove references to it. But I don't think it's a particularly
important: It only makes sense to create the initial directory once some data
has accumulated, and creating further dictionaries only makes sense once the
table is a good bit larger. At that point the size of another dictionary to be
stored isn't relevant in relation (otherwise you'd just give up after building
a new dictionary and evaluating its effectiveness).


> Also, compressing small data with a large dictionary (the case for
> one-for-many tables dictionary), I think, would add some considerable
> overhead to the INSERT/UPDATE commands, so the most reasonable
> choice is a per-table dictionary.

Likely even per-column, but I can see some advantages in either appraoch.


> Any ideas on how to create and extend such dictionaries automatically?

As I said, I don't think we should extend dictionaries. For this to work we'll
likely need a new / extended compressed toast datum header of some form, with
a reference to the dictionary. That'd likely be needed even with updatable
dictionaries, as we IIRC don't know which column a toasted datum is for, but
we need to know, to identify the dictionary. As we need that field anyway, we
can easily have multiple dictionaries.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: pg_usleep for multisecond delays
Следующее
От: Pavel Stehule
Дата:
Сообщение: possible memory leak in VACUUM ANALYZE