Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Дата | |
Msg-id | CAH2-Wzk42XV2Dg4tYGsLV2X0hf8PSLZ=9gOTOSnp+d6QB_nTyA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
|
Список | pgsql-hackers |
On Tue, Jan 28, 2020 at 5:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > In my opinion, the patch is now pretty close to being committable. Attached is v32, which is even closer to being committable. > I do have two outstanding open items for the patch, though. These items > are: > > * We still need infrastructure that marks B-Tree opclasses as safe for > deduplication, to avoid things like the numeric display scale problem, > collations that are unsafe for deduplication because they're > nondeterministic, etc. No progress on this item for v32, though. It's now my only open item for this entire project. Getting very close. > * Make VACUUM's WAL record more space efficient when it contains one > or more "updates" to an existing posting list tuple. * I've focussed on this item in v32 -- it has been closed out. v32 doesn't explicitly WAL-log post-update index tuples during vacuuming of posting list tuples, making the WAL records a lot smaller in some cases. v32 represents the posting list TIDs that must be deleted instead. It does this in the most WAL-space-efficient manner possible: by storing an array of uint16 offsets for each "updated" posting list within xl_btree_vacuum records -- each entry in each array is an offset to remove (i.e. a TID that should not appear in the updated version of the tuple). We use a new nbtdedup.c utility function for this, _bt_update_posting(). The new function is similar to its neighbor function, _bt_swap_posting(), which is the nbtdedup.c utility function used during posting list splits. Just like _bt_swap_posting(), we call _bt_update_posting() both during the initial action, and from the REDO routine that replays that action. Performing vacuuming of posting list tuples this way seems to matter with larger databases that depend on deduplication to control bloat, though I haven't taken the time to figure out exactly how much it matters. I'm satisfied that this is worth having based on microbenchmarks that measure WAL volume using pg_waldump. One microbenchmark showed something like a 10x decrease in the size of all xl_btree_vacuum records taken together compared to v31. I'm pretty sure that v32 makes it all but impossible for deduplication to write out more WAL than an equivalent case with deduplication disabled (I'm excluding FPIs here, of course -- full_page_writes=on cases will see significant benefits from reduced FPIs, simply by having fewer index pages). The per-leaf-page WAL record header accounts for a lot of the space overhead of xl_btree_vacuum records, and we naturally reduce that overhead when deduplicating, so we can now noticeably come out ahead when it comes to overall WAL volume. I wouldn't say that reducing WAL volume (other than FPIs) is actually a goal of this project, but it might end up happening anyway. Apparently Microsoft Azure PostgreSQL uses full_page_writes=off, so not everyone cares about the number of FPIs (everyone cares about raw record size, though). * Removed the GUC that controls the use of deduplication in this new version, per discussion with Robert over on the "Enabling B-Tree deduplication by default" thread. Perhaps we can get by with only an index storage parameter. Let's defer this until after the Postgres 13 beta period is over, and we get feedback from testers. * Turned the documentation on deduplication in the B-Tree internals chapter into a more general discussion of the on-disk format that covers deduplication. Deduplication enhances this on-disk representation, and discussing it outside that wider context always felt awkward to me. Having this kind of discussion in the docs seems like a good idea anyway. -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления: