Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
От | Peter Geoghegan |
---|---|
Тема | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Дата | |
Msg-id | CAH2-Wz=43-kYJ95B9RGztjNuj1zpupmh=PjsX+aMycRFcVYwJg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think it's pretty clear that we have to view that as acceptable. I > mean, we could reduce contention even further by finding a way to make > indexes 40% larger, but I think it's clear that nobody wants that. I found this analysis of bloat in the production database of Gitlab in January 2019 fascinating: https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/ They determined that their tables consisted of about 2% bloat, whereas indexes were 51% bloat (determined by running VACUUM FULL, and measuring how much smaller indexes and tables were afterwards). That in itself may not be that telling. What is telling is the index bloat disproportionately affects certain kinds of indexes. As they put it, "Indexes that do not serve a primary key constraint make up 95% of the overall index bloat". In other words, the vast majority of all bloat occurs within non-unique indexes, with most remaining bloat in unique indexes. One factor that could be relevant is that unique indexes get a lot more opportunistic LP_DEAD killing. Unique indexes don't rely on the similar-but-distinct kill_prior_tuple optimization -- a lot more tuples can be killed within _bt_check_unique() than with kill_prior_tuple in realistic cases. That's probably not really that big a factor, though. It seems almost certain that "getting tired" is the single biggest problem. The blog post drills down further, and cites the examples of several *extremely* bloated indexes on a single-column, which is obviously low cardinality. This includes an index on a boolean field, and an index on an enum-like text field. In my experience, having many indexes like that is very common in real world applications, though not at all common in popular benchmarks (with the exception of TPC-E). It also looks like they may benefit from the "split after new item" optimization, at least among the few unique indexes that were very bloated, such as merge_requests_pkey: https://gitlab.com/snippets/1812014 Gitlab is open source, so it should be possible to confirm my theory about the "split after new item" optimization (I am certain about "getting tired", though). Their schema is defined here: https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb I don't have time to confirm all this right now, but I am pretty confident that they have both problems. And almost as confident that they'd notice substantial benefits from this patch series. -- Peter Geoghegan
В списке pgsql-hackers по дате отправления: