Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Поиск

Список

Период

Сортировка

От	Peter Geoghegan
Тема	Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Дата	13 сентября 2019 г. 01:04:25
Msg-id	CAH2-WzkjuaM7_aFrfrbPrmow4jakeQmQ=mrntKw_aA9OVvcsRg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>)
Ответы	Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Список	pgsql-hackers

Дерево обсуждения

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that the new WAL record has to be created once per posting
> list that is generated, not once per page that is deduplicated --
> that's the only way that I can see that avoids a huge increase in
> total WAL volume. Even if we assume that I am wrong about there being
> value in making deduplication incremental, it is still necessary to
> make the WAL-logging behave incrementally.

Attached is v13 of the patch, which shows what I mean. You could say
that v13 makes _bt_dedup_one_page() do a few extra things that are
kind of similar to the things that nbtsplitloc.c does for _bt_split().

More specifically, the v13-0001-* patch includes code that makes
_bt_dedup_one_page() "goal orientated" -- it calculates how much space
will be freed when _bt_dedup_one_page() goes on to deduplicate those
items on the page that it has already "decided to deduplicate". The
v13-0002-* patch makes _bt_dedup_one_page() actually use this ability
-- it makes _bt_dedup_one_page() give up on deduplication when it is
clear that the items that are already "pending deduplication" will
free enough space for its caller to at least avoid a page split. This
revision of the patch doesn't truly make deduplication incremental. It
is only a proof of concept that shows how _bt_dedup_one_page() can
*decide* that it will free "enough" space, whatever that may mean, so
that it can finish early. The task of making _bt_dedup_one_page()
actually avoid lots of work when it finishes early remains.

As I said yesterday, I'm not asking you to accept that v13-0002-* is
an improvement. At least not yet. In fact, "finishes early" due to the
v13-0002-* logic clearly makes everything a lot slower, since
_bt_dedup_one_page() will "thrash" even more than earlier versions of
the patch. This is especially problematic with WAL-logged relations --
the test case that I shared yesterday goes from about 6GB to 10GB with
v13-0002-* applied. But we need to fundamentally rethink the approach
to the rewriting + WAL-logging by _bt_dedup_one_page() anyway. (Note
that total index space utilization is barely affected by the
v13-0002-* patch, so clearly that much works well.)

Other changes:

* Small tweaks to amcheck (nothing interesting, really).

* Small tweaks to the _bt_killitems() stuff.

* Moved all of the deduplication helper functions to nbtinsert.c. This
is where deduplication gets complicated, so I think that it should all
live there. (i.e. nbtsort.c will call nbtinsert.c code, never the
other way around.)

Note that I haven't merged any of the changes from v12 of the patch
from yesterday. I didn't merge the posting list WAL logging changes
because of the bug I reported, but I would have were it not for that.
The WAL logging for _bt_dedup_one_page() added to v12 didn't appear to
be more efficient than your original approach (i.e. calling
log_newpage_buffer()), so I have stuck with your original approach.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.

-- 
Peter Geoghegan

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Вложения