Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Дата | |
Msg-id | CAH2-WzkjuaM7_aFrfrbPrmow4jakeQmQ=mrntKw_aA9OVvcsRg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
|
Список | pgsql-hackers |
On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that the new WAL record has to be created once per posting > list that is generated, not once per page that is deduplicated -- > that's the only way that I can see that avoids a huge increase in > total WAL volume. Even if we assume that I am wrong about there being > value in making deduplication incremental, it is still necessary to > make the WAL-logging behave incrementally. Attached is v13 of the patch, which shows what I mean. You could say that v13 makes _bt_dedup_one_page() do a few extra things that are kind of similar to the things that nbtsplitloc.c does for _bt_split(). More specifically, the v13-0001-* patch includes code that makes _bt_dedup_one_page() "goal orientated" -- it calculates how much space will be freed when _bt_dedup_one_page() goes on to deduplicate those items on the page that it has already "decided to deduplicate". The v13-0002-* patch makes _bt_dedup_one_page() actually use this ability -- it makes _bt_dedup_one_page() give up on deduplication when it is clear that the items that are already "pending deduplication" will free enough space for its caller to at least avoid a page split. This revision of the patch doesn't truly make deduplication incremental. It is only a proof of concept that shows how _bt_dedup_one_page() can *decide* that it will free "enough" space, whatever that may mean, so that it can finish early. The task of making _bt_dedup_one_page() actually avoid lots of work when it finishes early remains. As I said yesterday, I'm not asking you to accept that v13-0002-* is an improvement. At least not yet. In fact, "finishes early" due to the v13-0002-* logic clearly makes everything a lot slower, since _bt_dedup_one_page() will "thrash" even more than earlier versions of the patch. This is especially problematic with WAL-logged relations -- the test case that I shared yesterday goes from about 6GB to 10GB with v13-0002-* applied. But we need to fundamentally rethink the approach to the rewriting + WAL-logging by _bt_dedup_one_page() anyway. (Note that total index space utilization is barely affected by the v13-0002-* patch, so clearly that much works well.) Other changes: * Small tweaks to amcheck (nothing interesting, really). * Small tweaks to the _bt_killitems() stuff. * Moved all of the deduplication helper functions to nbtinsert.c. This is where deduplication gets complicated, so I think that it should all live there. (i.e. nbtsort.c will call nbtinsert.c code, never the other way around.) Note that I haven't merged any of the changes from v12 of the patch from yesterday. I didn't merge the posting list WAL logging changes because of the bug I reported, but I would have were it not for that. The WAL logging for _bt_dedup_one_page() added to v12 didn't appear to be more efficient than your original approach (i.e. calling log_newpage_buffer()), so I have stuck with your original approach. It would be good to hear your thoughts on this _bt_dedup_one_page() WAL volume/"write amplification" issue. -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления: