Обсуждение: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join

Поиск
Список
Период
Сортировка

Hello Team,

 

There is change in query plan in 12.4 version and Version 13 resulting in performance slowness post upgrade.

 

  • In 12.4 version, Sort Operation Group Aggregate is selected which results to Merge Join. Query takes ~5 seconds.
  • In 13.5 version, optimizer wrongly estimates and due to new Disk Based Hash Aggregate feature, it prefers Hash Aggregate instead of Sort Operation which finally blocks merge-join and chooses Nested Loop Left Join. Query takes ~5 minutes.

 

When we increase work_mem to 23 MB, Disk Usage gets cleared from Query Plan but still Optimizer estimates Hash Aggregate-Nested Loop Left Join (compared to Sort-Merge Join) causing slowness. Query takes ~22 seconds.

 

Version 13 query plan has lower estimated cost than that of 12.4 which implies 13.5 planner thought it found a better plan, but it is running slower.

 

12.4 Version:

"Merge Right Join  (cost=202198.78..295729.10 rows=1 width=8) (actual time=1399.727..5224.574 rows=296 loops=1)"

 

13.5 version:-

"Nested Loop Left Join  (cost=196360.90..287890.45 rows=1 width=8) (actual time=3209.577..371300.693 rows=296 loops=1)"

 

 

 

Thanks & Regards,

 

Prajna Shetty
Technical Specialist,

Data Platform Support & Delivery




http://www.mindtree.com/email/disclaimer.html
Вложения
Prajna Shetty <Prajna.Shetty@mindtree.com> writes:
> There is change in query plan in 12.4 version and Version 13 resulting in performance slowness post upgrade.

Standard upgrade methods don't transfer statistics from the old version,
so the first question to ask is have you ANALYZE'd the relevant tables
since upgrading?

If you have, then to offer useful help with this we'll need to see all
the details described in

https://wiki.postgresql.org/wiki/Slow_Query_Questions

In any case, this is unlikely to be a bug.  The pgsql-performance
list would be a more suitable place to discuss it.

            regards, tom lane



 
Hello Team,
 
There is change in query plan in 12.4 version and Version 13 resulting in performance slowness post upgrade.
 
  • In 12.4 version, Sort Operation Group Aggregate is selected which results to Merge Join. Query takes ~5 seconds.
  • In 13.5 version, optimizer wrongly estimates and due to new Disk Based Hash Aggregate feature, it prefers Hash Aggregate instead of Sort Operation which finally blocks merge-join and chooses Nested Loop Left Join. Query takes ~5 minutes.
 
NOTE: Disabling Hash Aggregate on instance level forces optimizer to choose merge operation but such instance level modification is not possible in terms of Application Functionality.
 
This performance issue is on all over most of queries. Attached one of the query and its plan in both version for reference in case that helps for recreating the issue.
 
Version 13 query plan has lower estimated cost than that of 12.4 which implies 13.5 planner thought it found a better plan, but it is running slower and actual cost show more.
 
12.4 Version:
"Merge Right Join  (cost=202198.78..295729.10 rows=1 width=8) (actual time=1399.727..5224.574 rows=296 loops=1)"
 
13.5 version:-
"Nested Loop Left Join  (cost=196360.90..287890.45 rows=1 width=8) (actual time=3209.577..371300.693 rows=296 loops=1)"
 
Details:-
  1. It is AWS Aurora-Postgresql RDS instance. We have raised case with AWS and since this issue is a regression coming from the community PostgreSQL code, we would like to raise bug here.
  2. We were upgrading from 12.4 version to (13.4 and later)
  3. vCPU: 2 , RAM: 8 GB
  4. Attached Stats for all tables in this schema for your reference.
  1. Attached is metadata for one of the table person for your reference.
We have performed many such below steps, but it did not help:-
 
  1. We have performed Vacuum/Analyze/Reindex post Upgrade.
  2. Tweaked work_mem so it does not spill to Disk. We can Disk Usage But it is still using Hash Aggregate and came down from 5 minutes to 20 seconds. (Expected ~5 seconds). Attached plan after modifying work_mem
  1. Disabled Seqcan/ nestedloop
  2. Tweaked random_page_cost/seq_page_cost
  3. Set default_statistics_target=1000 and then run vacuum(analyze,verbose) on selected tables.
  4. We have also tested performance by increasing resources up to 4 vCPU and 32 GB RAM.
 
Could you please check and confirm if this incorrect Cost Estimation is known concern in Version 13 where in some cases optimizer calculates and prefers Hash AggregateèNested Left Loop Join instead of Merge Join?
 
 
 
Thanks & Regards,
 
Prajna Shetty
Technical Specialist,
Data Platform Support & Delivery
 
 
 
 
-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Monday, March 21, 2022 7:29 PM
To: Prajna Shetty <Prajna.Shetty@mindtree.com>
Cc: pgsql-bugs@lists.postgresql.org; Beenu Sharma <Beenu.Sharma@mindtree.com>
Subject: Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join
 
* This e-mail originated outside of Mindtree. Exercise caution before clicking links or opening attachments *
 
Prajna Shetty <Prajna.Shetty@mindtree.com> writes:
> There is change in query plan in 12.4 version and Version 13 resulting in performance slowness post upgrade.
 
Standard upgrade methods don't transfer statistics from the old version, so the first question to ask is have you ANALYZE'd the relevant tables since upgrading?
 
If you have, then to offer useful help with this we'll need to see all the details described in
 
 
In any case, this is unlikely to be a bug.  The pgsql-performance list would be a more suitable place to discuss it.
 
                        regards, tom lane
 

  ________________________________  
Вложения

On 3/22/22 13:57, Prajna Shetty wrote:
> ++ _pgsql-performance@postgresql.org_
> <https://www.postgresql.org/list/pgsql-performance/>
>  
> Hello Team,
>  
> There is change in query plan in 12.4 version and Version 13 resulting
> in performance slowness post upgrade.
>  
> 
>   * In 12.4 version, Sort Operation Group Aggregate is selected which
>     results to Merge Join. Query takes ~5 seconds.
>   * In 13.5 version, optimizer wrongly estimates and due to new Disk
>     Based Hash Aggregate feature, it prefers Hash Aggregate instead of
>     Sort Operation which finally blocks merge-join and chooses Nested
>     Loop Left Join. Query takes ~5 minutes.
> 
>  
> *_NOTE: _*Disabling Hash Aggregate on instance level forces optimizer to
> choose merge operation but such instance level modification is not
> possible in terms of Application Functionality.
>  
> This performance issue is on all over most of queries. Attached one of
> the query and its plan in both version for reference in case that helps
> for recreating the issue.
>  

It's impossible to comment those other queries, but chances are the root
cause is the same.

> Version 13 query plan has lower estimated cost than that of 12.4 which
> implies 13.5 planner thought it found a better plan, but it is running
> slower and actual cost show more.
>  
> 12.4 Version:
> "Merge Right Join  (cost=*202198.78..295729.10* rows=1 width=8) (actual
> time=1399.727..*5224.574* rows=296 loops=1)"
>  
> 13.5 version:-
> "Nested Loop Left Join  (cost=*196360.90..287890.45* rows=1 width=8)
> (actual time=3209.577..*371300.693* rows=296 loops=1)"
>  

This is not a costing issue, the problem is that we expect 1 row and
calculate the cost for that, but then get 296. And unfortunately a
nested loop degrades much faster than a merge join.

I'm not sure why exactly 12.4 picked a merge join, chances are the
costing formular changed a bit somewhere. But as I said, the problem is
in bogus row cardinality estimates - 12.4 is simply lucky.

The problem most likely stems from this part:

  ->  GroupAggregate  (cost=0.43..85743.24 rows=1830 width=72) (actual
time=1.621..3452.034 rows=282179 loops=3)
  Group Key: student_class_detail.aamc_id
  Filter: (max((student_class_detail.class_level_cd)::text) = '4'::text)
  Rows Removed by Filter: 76060
  ->  Index Scan using uk_student_class_detail_aamcid_classlevelcd on
student_class_detail  (cost=0.43..74747.61 rows=1284079 width=6) (actual
time=1.570..2723.014 rows=1272390 loops=3)
    Filter: (class_level_start_dt IS NOT NULL)
    Rows Removed by Filter: 160402

The filter is bound to be misestimated, and the error then snowballs.
Try replacing this part with a temporary table (with pre-aggregated
results) - you can run analyze on it, etc. I'd bet that'll make the
issue go away.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




On 3/22/22 13:57, Prajna Shetty wrote:
> ++ _pgsql-performance@postgresql.org_
> <https://www.postgresql.org/list/pgsql-performance/>
>  
> Hello Team,
>  
> There is change in query plan in 12.4 version and Version 13 resulting
> in performance slowness post upgrade.
>  
> 
>   * In 12.4 version, Sort Operation Group Aggregate is selected which
>     results to Merge Join. Query takes ~5 seconds.
>   * In 13.5 version, optimizer wrongly estimates and due to new Disk
>     Based Hash Aggregate feature, it prefers Hash Aggregate instead of
>     Sort Operation which finally blocks merge-join and chooses Nested
>     Loop Left Join. Query takes ~5 minutes.
> 
>  
> *_NOTE: _*Disabling Hash Aggregate on instance level forces optimizer to
> choose merge operation but such instance level modification is not
> possible in terms of Application Functionality.
>  
> This performance issue is on all over most of queries. Attached one of
> the query and its plan in both version for reference in case that helps
> for recreating the issue.
>  

It's impossible to comment those other queries, but chances are the root
cause is the same.

> Version 13 query plan has lower estimated cost than that of 12.4 which
> implies 13.5 planner thought it found a better plan, but it is running
> slower and actual cost show more.
>  
> 12.4 Version:
> "Merge Right Join  (cost=*202198.78..295729.10* rows=1 width=8) (actual
> time=1399.727..*5224.574* rows=296 loops=1)"
>  
> 13.5 version:-
> "Nested Loop Left Join  (cost=*196360.90..287890.45* rows=1 width=8)
> (actual time=3209.577..*371300.693* rows=296 loops=1)"
>  

This is not a costing issue, the problem is that we expect 1 row and
calculate the cost for that, but then get 296. And unfortunately a
nested loop degrades much faster than a merge join.

I'm not sure why exactly 12.4 picked a merge join, chances are the
costing formular changed a bit somewhere. But as I said, the problem is
in bogus row cardinality estimates - 12.4 is simply lucky.

The problem most likely stems from this part:

  ->  GroupAggregate  (cost=0.43..85743.24 rows=1830 width=72) (actual
time=1.621..3452.034 rows=282179 loops=3)
  Group Key: student_class_detail.aamc_id
  Filter: (max((student_class_detail.class_level_cd)::text) = '4'::text)
  Rows Removed by Filter: 76060
  ->  Index Scan using uk_student_class_detail_aamcid_classlevelcd on
student_class_detail  (cost=0.43..74747.61 rows=1284079 width=6) (actual
time=1.570..2723.014 rows=1272390 loops=3)
    Filter: (class_level_start_dt IS NOT NULL)
    Rows Removed by Filter: 160402

The filter is bound to be misestimated, and the error then snowballs.
Try replacing this part with a temporary table (with pre-aggregated
results) - you can run analyze on it, etc. I'd bet that'll make the
issue go away.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company