Обсуждение: GiST index for pgtrgm bloats a lot

Поиск
Список
Период
Сортировка

GiST index for pgtrgm bloats a lot

От
hubert depesz lubaczewski
Дата:
Hi,
We have this situation on many servers - used versions are 9.1.14 and/or
9.3.6, on Linux servers.

There is a table, with ~ 1 million rows.
There are no deletes there, inserts happen at the rate of ~ 100-200
daily (but not spread, it's usually a single moment in day where there
happen the inserts).
There is also HUGE number of updates - up to 200,000 rows updated per
hour.

Schema of the table:
                                                      Table "schemaXX.tableYY"
                Column                |            Type             |                           Modifiers
            

--------------------------------------+-----------------------------+---------------------------------------------------------------
 id                                   | bigint                      | not null default
nextval('schemaXX.tableYY_id_seq'::regclass)
 col_a                                | character varying(255)      |
 ...............................      | bigint                      | not null
 ...............................      | character varying(255)      |
 ...............................      | bigint                      |
 ...............................      | character varying(255)      | not null
 ...............................      | character varying(255)      |
 ...............................      | timestamp without time zone |
 ...............................      | timestamp without time zone |
 ...............................      | bigint                      |
 ...............................      | boolean                     |
 ...............................      | boolean                     |
 ...............................      | timestamp without time zone |
 ...............................      | timestamp without time zone |
 ...............................      | boolean                     |
 ...............................      | text                        |
 ...............................      | boolean                     | default false
 ...............................      | character varying(255)      |
 ...............................      | bigint                      |
 ...............................      | boolean                     | default true
 col_c                                | character varying(255)      |
 ...............................      | character varying(255)      | default '----'::character varying
 ...............................      | bigint                      |
 ...............................      | bigint                      | not null
 ...............................      | bigint                      | not null
 col_b                                | character varying(255)      |
 ...............................      | bigint                      |
 ...............................      | boolean                     |
 ...............................      | boolean                     |
 ...............................      | bigint                      |
 ...............................      | text                        |
 ...............................      | boolean                     |
 ...............................      | text                        |
 ...............................      | boolean                     |
 ...............................      | character varying(255)      |
 ...............................      | boolean                     |
 ...............................      | boolean                     |
 ...............................      | bigint                      |
 ...............................      | character varying(255)      |
 ...............................      | text                        |
 ...............................      | bigint                      |
 ...............................      | text                        |
 ...............................      | text                        |
 ...............................      | character varying(255)      |
 ...............................      | integer                     |
 ...............................      | character varying(255)      |
 ...............................      | character varying(255)      |
 ...............................      | character varying(255)      |
 ...............................      | bigint                      |

indexes:
There are 17 indexes in total, and the problematic is:

"index_trgm_courses_composite_search" gist (((((COALESCE(lower(col_a::text), ''::text) || ' '::text) ||
COALESCE(lower(col_b::text),''::text)) || ' '::text) || COALESCE(lower(col_c::text), ''::text)) gist_trgm_ops) 

On 2015-04-27 this index was reindexed. It's size (relpages as shown in
pg_class) dropped from 152698 to 45865.  Since then the index is consistently
increasing in size. Today it was 343370 pages, and after reindex it dropped to
29091.

On these databases we have some (not many) 1-2 minute transactions, but usually
every transaction within 30 seconds

This table is vacuumed (autovacuum) quite a lot, usually at least 10 times a
day.

Is there anything we could do to help diagnose the problem, and fix it?
I don't think I can install custom pg version, and downtime would be
complicated to get approval, but anything about the data/database I can
check that would allow diagnosing the bug, I'd be happy to do.

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/

Re: GiST index for pgtrgm bloats a lot

От
Heikki Linnakangas
Дата:
On 05/18/2015 03:12 PM, hubert depesz lubaczewski wrote:
> Hi,
> We have this situation on many servers - used versions are 9.1.14 and/or
> 9.3.6, on Linux servers.
>
> There is a table, with ~ 1 million rows.
> There are no deletes there, inserts happen at the rate of ~ 100-200
> daily (but not spread, it's usually a single moment in day where there
> happen the inserts).
> There is also HUGE number of updates - up to 200,000 rows updated per
> hour.

Are the columns that are included in the bloated index also updated that
often? If not, I'd suggest moving those columns to a separate table with
a one-to-one relationship to the main table. Or perhaps just create a
helper table that contains copies of those columns, and keep it
up-to-date with triggers. That would reduce the churn in the index.

> On these databases we have some (not many) 1-2 minute transactions, but usually
> every transaction within 30 seconds
>
> This table is vacuumed (autovacuum) quite a lot, usually at least 10 times a
> day.
>
> Is there anything we could do to help diagnose the problem, and fix it?
> I don't think I can install custom pg version, and downtime would be
> complicated to get approval, but anything about the data/database I can
> check that would allow diagnosing the bug, I'd be happy to do.

GiST doesn't have the same features that e.g B-tree has to avoid bloat:

1. When GiST has multiple equally good choices where it could insert a
tuple, it favours branches that are "earlier" in the index. Always
descending down the same branch is good for cache efficiency when you
insert multiple items with similar keys, but the downside is that the
other branches can easily have a lot of free space that goes unused,
while the "hot" branch just gets split repeatedly. This is explained in
the comments in the gistchoose() function. That leads to bloat, because
the free space isn't being used.

2. When an index page becomes completely empty, GiST doesn't try to
remove and reuse it.

It would be nice to address those in a future version, but that won't
help you right now.

- Heikki

Re: GiST index for pgtrgm bloats a lot

От
hubert depesz lubaczewski
Дата:
On Mon, May 18, 2015 at 04:19:41PM +0300, Heikki Linnakangas wrote:
> Are the columns that are included in the bloated index also updated that
> often? If not, I'd suggest moving those columns to a separate table with a
> one-to-one relationship to the main table. Or perhaps just create a helper
> table that contains copies of those columns, and keep it up-to-date with
> triggers. That would reduce the churn in the index.

I'm not sure if this is possible, but will look into it.

> 1. When GiST has multiple equally good choices where it could insert a
> tuple, it favours branches that are "earlier" in the index. Always
> descending down the same branch is good for cache efficiency when you insert
> multiple items with similar keys, but the downside is that the other
> branches can easily have a lot of free space that goes unused, while the
> "hot" branch just gets split repeatedly. This is explained in the comments
> in the gistchoose() function. That leads to bloat, because the free space
> isn't being used.

This looks related. Too bad that we don't have any other way to handle
it, but at the very least I know it's not some "bug", it's just
a missing feature.

Thanks a lot,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/