Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

Поиск

Список

Период

Сортировка

От	Peter Geoghegan
Тема	Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Дата	13 марта 2019 г. 23:23:54
Msg-id	CAH2-Wz=43-kYJ95B9RGztjNuj1zpupmh=PjsX+aMycRFcVYwJg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Robert Haas <robertmhaas@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Tue, Mar 12, 2019 at 11:40 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think it's pretty clear that we have to view that as acceptable. I
> mean, we could reduce contention even further by finding a way to make
> indexes 40% larger, but I think it's clear that nobody wants that.

I found this analysis of bloat in the production database of Gitlab in
January 2019 fascinating:

https://about.gitlab.com/handbook/engineering/infrastructure/blueprint/201901-postgres-bloat/

They determined that their tables consisted of about 2% bloat, whereas
indexes were 51% bloat (determined by running VACUUM FULL, and
measuring how much smaller indexes and tables were afterwards). That
in itself may not be that telling. What is telling is the index bloat
disproportionately affects certain kinds of indexes. As they put it,
"Indexes that do not serve a primary key constraint make up 95% of the
overall index bloat". In other words, the vast majority of all bloat
occurs within non-unique indexes, with most remaining bloat in unique
indexes.

One factor that could be relevant is that unique indexes get a lot
more opportunistic LP_DEAD killing. Unique indexes don't rely on the
similar-but-distinct kill_prior_tuple optimization -- a lot more
tuples can be killed within _bt_check_unique() than with
kill_prior_tuple in realistic cases. That's probably not really that
big a factor, though. It seems almost certain that "getting tired" is
the single biggest problem.

The blog post drills down further, and cites the examples of several
*extremely* bloated indexes on a single-column, which is obviously low
cardinality. This includes an index on a boolean field, and an index
on an enum-like text field. In my experience, having many indexes like
that is very common in real world applications, though not at all
common in popular benchmarks (with the exception of TPC-E).

It also looks like they may benefit from the "split after new item"
optimization, at least among the few unique indexes that were very
bloated, such as merge_requests_pkey:

https://gitlab.com/snippets/1812014

Gitlab is open source, so it should be possible to confirm my theory
about the "split after new item" optimization (I am certain about
"getting tired", though). Their schema is defined here:

https://gitlab.com/gitlab-org/gitlab-ce/blob/master/db/schema.rb

I don't have time to confirm all this right now, but I am pretty
confident that they have both problems. And almost as confident that
they'd notice substantial benefits from this patch series.
--
Peter Geoghegan

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons