Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Дата | |
Msg-id | CAH2-Wz=SQ_vTESu5cE8cH6QVGDWp22hsOTA_sJ2rpmS9BT+ymQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
(Peter Geoghegan <pg@bowt.ie>)
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>) |
Список | pgsql-hackers |
On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > I attach version 16. This revision merges your recent work on WAL > logging with my recent work on simplifying _bt_dedup_one_page(). See > my e-mail from earlier today for details. I attach version 17. This version has changes that are focussed on further polishing certain things, including fixing some minor bugs. It seemed worth creating a new version for that. (I didn't get very far with the space utilization stuff I talked about, so no changes there.) Changes in v17: * nbtsort.c now has a loop structure that closely matches _bt_dedup_one_page() (I put this off in v16). We now reuse most of the nbtinsert.c deduplication routines. * Further simplification of btree_xlog_dedup() loop. Recovery no longer relies on local variables to track the progress of deduplication -- it uses dedup state (the state managed by nbtinsert.c's dedup routines) instead. This is easier to follow. * Reworked _bt_split() comments on posting list splits that coincide with page splits. * Fixed memory leaks in recovery code by creating a dedicated memory context that gets reset regularly. The context is create in a new rmgr "startup" callback I created for the B-Tree rmgr. We already do this for both GIN and GiST. More specifically, the REDO code calls MemoryContextReset() against its dedicated memory context after every record is processed by REDO, no matter what. The MemoryContextReset() call usually won't have to actually free anything, but that's okay because the no-free case does almost no work. I think that it makes sense to keep things as simple as possible for memory management during recovery -- it's too easy for a new memory leak to get introduced when a small change is made to the nbtinsert.c routines later on. * Optimize VACUUMing of posting lists: we now only allocate memory for an array of still-live posting list items when the array will actually be needed. It is only needed when there are tuples to remove from the posting list, because only then do we need to create a replacement posting list that lacks the heap TIDs that VACUUM needs to delete. It seemed like a really good idea to not allocate any memory in the common case where VACUUM doesn't need to change a posting list tuple at all. ginVacuumItemPointers() has exactly the same optimization. * Fixed an accounting bug in the output of VACCUM VERBOSE by changing some code in nbtree.c. The tuples_removed and num_index_tuples fields in IndexBulkDeleteResult are reported as "index row versions" by VACUUM VERBOSE. Everything but the index pages stat works at the level of "index row versions", which should not be affected by the deduplication patch. Of course, deduplication only changes the physical representation of items in the index -- never the logical contents of the index. This is what GIN does already. Another infrastructure thing that the patch needs to handle to be committable: We still haven't added an "off" switch to deduplication, which seems necessary. I suppose that this should look like GIN's "fastupdate" storage parameter. It's not obvious how to do this in a way that's easy to work with, though. Maybe we could do something like copy GIN's GinGetUseFastUpdate() macro, but the situation with nbtree is actually quite different. There are two questions for nbtree when it comes to deduplication within an inde: 1) Does the user want to use deduplication, because that will help performance?, and 2) Is it safe/possible to use deduplication at all? I think that we should probably stash this information (deduplication is both possible and safe) in the metapage. Maybe we can copy it over to our insertion scankey, just like the "heapkeyspace" field -- that information also comes from the metapage (it's based on the nbtree version). The "heapkeyspace" field is a bit ugly, so maybe we shouldn't go further by adding something similar, but I don't see any great alternative right now. -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: "Tsunakawa, Takayuki"Дата:
Сообщение: RE: [bug fix??] Fishy code in tts_cirtual_copyslot()