Обсуждение: Slow recursive CTE query questions, with row estimate and n_distinct issues

Поиск
Список
Период
Сортировка

Slow recursive CTE query questions, with row estimate and n_distinct issues

От
Christopher Baines
Дата:
Hi,

I have a performance problem with a query. I've uploaded it, along with
the EXPLAIN ANALYZE output here [1].

1: https://explain.depesz.com/s/w5vP

I think the query is taking longer than I'd like, as PostgreSQL is not
generating a great plan, in particular the estimated rows in parts of
the plan are way off from the actual number of rows produced, and I
think PostgreSQL might pick a better approach if it made more reasonable
estimates.

Looking at the row numbers on [1], I think things start to go wrong at
row 11. The first part of the recursive CTE is estimated to generate
22,122 rows, and this isn't far off the 15,670 rows it generates. The
WorkTable scan expects to generate 10x that though, at 221,220. Maybe
that's reasonable, I'm unsure?

Things seem to continue getting out of hand from this point. The nested
loop on line 10 generates 232x less rows than expected, and this results
in an overall overestimate of 4,317x.

My theory as to why the row estimates is way off is that the planner is
using the n_distinct counts for the derivation_inputs table, and they
are way off [2]. I've set the stats target to the highest value for this
table, so I'm unsure what I can do to improve these estimates?

I've included relevant numbers for the two important tables below
[2][3], as well as descriptions of the tables [4][5].

Does anyone know if my theory as to why the row estimates in the plan is
right, or have any ideas as how I can make the query more performant?

Thanks,

Chris


2:
derivation_inputs:
  COUNT(*):  285422539
  reltuples: 285422528

  derivation_id:
    COUNT(DISTINCT): 7508610
    n_distinct:      4336644 (~57% of the true value)

  derivation_output_id:
    COUNT(DISTINCT): 5539406
    n_distinct:      473762 (~8% of the true value)

derivation_outputs:
  COUNT(*):  8206019
  reltuples: 8205873

3:
guix_data_service=> SELECT attname, n_distinct::integer FROM pg_stats WHERE tablename = 'derivation_outputs';
           attname            | n_distinct
------------------------------+------------
 derivation_id                |         -1
 name                         |         54
 derivation_output_details_id |     372225
 id                           |         -1
(4 rows)


4:
guix_data_service@[local]:5432/patches_guix_data_service> \d+ derivation_inputs
                               Table "guix_data_service.derivation_inputs"
┌──────────────────────┬─────────┬───────────┬──────────┬─────────┬─────────┬──────────────┬─────────────┐
│        Column        │  Type   │ Collation │ Nullable │ Default │ Storage │ Stats target │ Description │
├──────────────────────┼─────────┼───────────┼──────────┼─────────┼─────────┼──────────────┼─────────────┤
│ derivation_id        │ integer │           │ not null │         │ plain   │ 10000        │             │
│ derivation_output_id │ integer │           │ not null │         │ plain   │ 10000        │             │
└──────────────────────┴─────────┴───────────┴──────────┴─────────┴─────────┴──────────────┴─────────────┘
Indexes:
    "derivation_inputs_pkey" PRIMARY KEY, btree (derivation_id, derivation_output_id) WITH (fillfactor='100')
    "derivation_inputs_derivation_output_id_idx" btree (derivation_output_id) WITH (fillfactor='100')
Foreign-key constraints:
    "derivation_id_fk" FOREIGN KEY (derivation_id) REFERENCES derivations(id)
    "derivation_output_id_fk" FOREIGN KEY (derivation_output_id) REFERENCES derivation_outputs(id)
Access method: heap
Options: autovacuum_vacuum_scale_factor=0.01, autovacuum_analyze_scale_factor=0.01


5:
guix_data_service@[local]:5432/patches_guix_data_service> \d+ derivation_outputs
                                                   Table "guix_data_service.derivation_outputs"

┌──────────────────────────────┬───────────────────┬───────────┬──────────┬──────────────────────────────┬──────────┬──────────────┬─────────────┐
│            Column            │       Type        │ Collation │ Nullable │           Default            │ Storage  │
Statstarget │ Description │ 

├──────────────────────────────┼───────────────────┼───────────┼──────────┼──────────────────────────────┼──────────┼──────────────┼─────────────┤
│ derivation_id                │ integer           │           │ not null │                              │ plain    │
          │             │ 
│ name                         │ character varying │           │ not null │                              │ extended │
          │             │ 
│ derivation_output_details_id │ integer           │           │ not null │                              │ plain    │
          │             │ 
│ id                           │ integer           │           │ not null │ generated always as identity │ plain    │
          │             │ 

└──────────────────────────────┴───────────────────┴───────────┴──────────┴──────────────────────────────┴──────────┴──────────────┴─────────────┘
Indexes:
    "derivation_outputs_pkey" PRIMARY KEY, btree (derivation_id, name)
    "derivation_outputs_derivation_id_idx" btree (derivation_id)
    "derivation_outputs_derivation_output_details_id_idx" btree (derivation_output_details_id)
    "derivation_outputs_unique_id" UNIQUE CONSTRAINT, btree (id)
Foreign-key constraints:
    "derivation_outputs_derivation_id_fk" FOREIGN KEY (derivation_id) REFERENCES derivations(id)
    "derivation_outputs_derivation_output_details_id_fk" FOREIGN KEY (derivation_output_details_id) REFERENCES
derivation_output_details(id)
Referenced by:
    TABLE "derivation_inputs" CONSTRAINT "derivation_output_id_fk" FOREIGN KEY (derivation_output_id) REFERENCES
derivation_outputs(id)
Access method: heap
Options: autovacuum_vacuum_scale_factor=0.01, autovacuum_analyze_scale_factor=0.01

Вложения

Re: Slow recursive CTE query questions, with row estimate and n_distinct issues

От
Michael Lewis
Дата:
On Mon, Dec 28, 2020 at 7:51 AM Christopher Baines <mail@cbaines.net> wrote:
derivation_inputs:
  COUNT(*):  285422539
  reltuples: 285422528

  derivation_id:
    COUNT(DISTINCT): 7508610
    n_distinct:      4336644 (~57% of the true value)

  derivation_output_id:
    COUNT(DISTINCT): 5539406
    n_distinct:      473762 (~8% of the true value)

If you expect the ratio of distinct of derivation_output_id values to be roughly linear going forward, you can set a custom value for n_distinct on the column (currently seems like -.0194, aka distinct count of derivation_output_id divided by reltuples of the table). You could also do this analysis every month or six and set the custom value as needed.


I am not sure if it will resolve your query problems though.

Re: Slow recursive CTE query questions, with row estimate and n_distinct issues

От
Greg Spiegelberg
Дата:
On Mon, Dec 28, 2020 at 7:51 AM Christopher Baines <mail@cbaines.net> wrote:
Hi,

I have a performance problem with a query. I've uploaded it, along with
the EXPLAIN ANALYZE output here [1].

1: https://explain.depesz.com/s/w5vP

I think the query is taking longer than I'd like, as PostgreSQL is not
generating a great plan, in particular the estimated rows in parts of
the plan are way off from the actual number of rows produced, and I
think PostgreSQL might pick a better approach if it made more reasonable
estimates.

Looking at the row numbers on [1], I think things start to go wrong at
row 11. The first part of the recursive CTE is estimated to generate
22,122 rows, and this isn't far off the 15,670 rows it generates. The
WorkTable scan expects to generate 10x that though, at 221,220. Maybe
that's reasonable, I'm unsure?

Things seem to continue getting out of hand from this point. The nested
loop on line 10 generates 232x less rows than expected, and this results
in an overall overestimate of 4,317x.

My theory as to why the row estimates is way off is that the planner is
using the n_distinct counts for the derivation_inputs table, and they
are way off [2]. I've set the stats target to the highest value for this
table, so I'm unsure what I can do to improve these estimates?

I've included relevant numbers for the two important tables below
[2][3], as well as descriptions of the tables [4][5].

Does anyone know if my theory as to why the row estimates in the plan is
right, or have any ideas as how I can make the query more performant?

Thanks,

Chris


Hi Chris

What is your realistic target maximum execution time for the query?

Ideas off the top of my head sans all the usual tuning of shared_buffers et al...

When was the last time autovacuum vacuumed and analyzed the source tables?

Can you leverage a materialized view refreshing when necessary?  Thinking for the 3 joined tables but could be applied to the whole result.

If you can guarantee no duplicate rows before and after current UNION, change to UNION ALL.

If you have the luxury of faster disk, such as NVMe, create a tablespace and move data and/or indexes there.  Adjust storage planner options accordingly.

Personally, I like to see guardrails on RECURSIVE CTE's but you know your data and clients better than I so may not be necessary.  Suggestions include:
 1. Limit recursion depth.  It does not seem to be the intent here though but ask if really the whole is necessary.  Not knowing the application but say 10 deep is all that is needed initially.  If the need for more is required, run the query again but has a different guix_revisions.commit starting point.  Think of it as pagination.
 2. Cycle detection / prevention but depends on the hierarchy represented by the data.  Doesn't seem to be the case here since the query as-is does return.

HTH.

-Greg

Re: Slow recursive CTE query questions, with row estimate and n_distinct issues

От
Christopher Baines
Дата:
Michael Lewis <mlewis@entrata.com> writes:

> On Mon, Dec 28, 2020 at 7:51 AM Christopher Baines <mail@cbaines.net> wrote:
>
>> derivation_inputs:
>>   COUNT(*):  285422539
>>   reltuples: 285422528
>>
>>   derivation_id:
>>     COUNT(DISTINCT): 7508610
>>     n_distinct:      4336644 (~57% of the true value)
>>
>>   derivation_output_id:
>>     COUNT(DISTINCT): 5539406
>>     n_distinct:      473762 (~8% of the true value)
>>
>
> If you expect the ratio of distinct of derivation_output_id values to be
> roughly linear going forward, you can set a custom value for n_distinct on
> the column (currently seems like -.0194, aka distinct count
> of derivation_output_id divided by reltuples of the table). You could also
> do this analysis every month or six and set the custom value as needed.
>
> https://www.postgresql.org/docs/current/sql-altertable.html
>
> I am not sure if it will resolve your query problems though.

Thanks Michael, I didn't realise a custom value could be set, but I'll
look in to this.

I actually managed to speed the query up enough by increasing
work_mem/shared_buffers. I didn't realise one of the sequential scans
was executing 14 times, but giving PostgreSQL more resources means it
just executes once, which really helps.

Thanks again,

Chris

Вложения

Re: Slow recursive CTE query questions, with row estimate and n_distinct issues

От
Christopher Baines
Дата:
Greg Spiegelberg <gspiegelberg@gmail.com> writes:

> On Mon, Dec 28, 2020 at 7:51 AM Christopher Baines <mail@cbaines.net> wrote:
>
>> Hi,
>>
>> I have a performance problem with a query. I've uploaded it, along with
>> the EXPLAIN ANALYZE output here [1].
>>
>> 1: https://explain.depesz.com/s/w5vP
>>
>> I think the query is taking longer than I'd like, as PostgreSQL is not
>> generating a great plan, in particular the estimated rows in parts of
>> the plan are way off from the actual number of rows produced, and I
>> think PostgreSQL might pick a better approach if it made more reasonable
>> estimates.
>>
>> Looking at the row numbers on [1], I think things start to go wrong at
>> row 11. The first part of the recursive CTE is estimated to generate
>> 22,122 rows, and this isn't far off the 15,670 rows it generates. The
>> WorkTable scan expects to generate 10x that though, at 221,220. Maybe
>> that's reasonable, I'm unsure?
>>
>> Things seem to continue getting out of hand from this point. The nested
>> loop on line 10 generates 232x less rows than expected, and this results
>> in an overall overestimate of 4,317x.
>>
>> My theory as to why the row estimates is way off is that the planner is
>> using the n_distinct counts for the derivation_inputs table, and they
>> are way off [2]. I've set the stats target to the highest value for this
>> table, so I'm unsure what I can do to improve these estimates?
>>
>> I've included relevant numbers for the two important tables below
>> [2][3], as well as descriptions of the tables [4][5].
>>
>> Does anyone know if my theory as to why the row estimates in the plan is
>> right, or have any ideas as how I can make the query more performant?

> Hi Chris
>
> What is your realistic target maximum execution time for the query?

It's for a web page. It can be slow, so 10 seconds is a reasonable
target.

> Ideas off the top of my head sans all the usual tuning of shared_buffers et
> al...
>
> When was the last time autovacuum vacuumed and analyzed the source tables?
>
> Can you leverage a materialized view refreshing when necessary?  Thinking
> for the 3 joined tables but could be applied to the whole result.
>
> If you can guarantee no duplicate rows before and after current UNION,
> change to UNION ALL.
>
> If you have the luxury of faster disk, such as NVMe, create a tablespace
> and move data and/or indexes there.  Adjust storage planner options
> accordingly.
>
> Personally, I like to see guardrails on RECURSIVE CTE's but you know your
> data and clients better than I so may not be necessary.  Suggestions
> include:
>  1. Limit recursion depth.  It does not seem to be the intent here though
> but ask if really the whole is necessary.  Not knowing the application but
> say 10 deep is all that is needed initially.  If the need for more is
> required, run the query again but has a different guix_revisions.commit
> starting point.  Think of it as pagination.
>  2. Cycle detection / prevention but depends on the hierarchy represented
> by the data.  Doesn't seem to be the case here since the query as-is does
> return.

Thanks for your suggestions Greg. Turns out that I actually managed to
speed the query up enough by increasing work_mem/shared_buffers. I
didn't realise one of the sequential scans was executing 14 times, but
giving PostgreSQL more resources means it just executes once, which
really helps.

There's probably still room for improvement, but it's "fast enough" now.

Thanks again,

Chris

Вложения