Re: plan_create_index_workers doesn't account for TOAST
От | Jonathan S. Katz |
---|---|
Тема | Re: plan_create_index_workers doesn't account for TOAST |
Дата | |
Msg-id | 950a26ba-ed90-6b24-3335-05e401739873@postgresql.org обсуждение исходный текст |
Ответ на | plan_create_index_workers doesn't account for TOAST ("Jonathan S. Katz" <jkatz@postgresql.org>) |
Список | pgsql-hackers |
On 6/29/23 10:12 AM, Jonathan S. Katz wrote: > Hi, > > plan_create_index_workers[1] does not consider the amount of tuples > existing in TOAST pages when determining the number of parallel workers > to use for a build. The estimation comes from estimate_rel_size[2], > which in this case, will just take the value from rel->rd_rel->relpages. > > We probably don't notice this much with B-trees, given a B-tree is > typically used for data that does not require toasting. However, this > becomes more visible when working on custom index access methods that > implement their own parallel build strategy. > > For example, pgvector[3] provides its own data types and index access > method for indexing vector data. Vectors can get quite large fairly > quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes > on disk, which quickly clears the default TOAST tuple threshold. > > In a recent patch proposal to allow for building indexes in parallel[4], > I performed a few experiments on how many parallel workers would be > spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with > EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader > participation, but the leader is not considered in > plan_create_index_workers. > > With EXTENDED, plan_create_index_workers recommended 2 workers. The > build time was ~2x faster than the serial build. > > With PLAIN, plan_create_index_workers recommended 4 workers. The build > time was **~3X faster** than the serial build. > > (I've been doing more detailed, less hand-waivy performance testing, but > I wanted to provide directional numbers here) > > It seems like we're leaving some performance for columns with TOASTed > data that require indexing, so I wanted to propose allowing the pages in > TOASTed tables to be considered when we're trying to index a column with > TOASTed attributes. Just to add to this: there is a lever to get more parallel workers by setting "min_parallel_table_scan_size" to a lower value, which does help in this case. However, it does mask the fact that a large chunk of the data required to build the index exists in the TOAST table, which is not intuitive to a user who rarely has to use tuning parameters. Thanks, Jonathan
Вложения
В списке pgsql-hackers по дате отправления: