Обсуждение: inefficient/wrong plan cache mode selection for queries with partitioned tables (postgresql 17)
Hi,
I found a case where plan cache all time switching to custom plans forces query replan each call (and thus slows down the whole query for 10x or more).
What makes the situation intriguing - that both custom and generic plans are the same.
job_stats_master - partitioned table with 24 partitions (per month last 2 year).
Problem query:
prepare qqq(timestamp, timestamp) AS
SELECT *
FROM "job_stats_master"
WHERE
"job_stats_master"."created_at" BETWEEN $1 AND $2 AND "job_stats_master"."job_reference" = '******' AND "job_stats_master"."job_board_id" = 27068
ORDER BY "created_at" DESC LIMIT 1;
SELECT *
FROM "job_stats_master"
WHERE
"job_stats_master"."created_at" BETWEEN $1 AND $2 AND "job_stats_master"."job_reference" = '******' AND "job_stats_master"."job_board_id" = 27068
ORDER BY "created_at" DESC LIMIT 1;
plan (after 6th execution):
explain analyze execute qqq('2025-04-11 09:22:00.193'::timestamp without time zone, '2025-05-12 09:22:00.203'::timestamp without time zone);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.14..1.29 rows=1 width=384) (actual time=0.026..0.026 rows=1 loops=1)
-> Append (cost=1.14..9.10 rows=50 width=384) (actual time=0.025..0.026 rows=1 loops=1)
-> Index Scan Backward using job_stats_new_2025_05_job_board_id_job_reference_created_at_idx on job_stats_new_2025_05 job_stats_master_2 (cost=0.56..3.28 rows=18 width=368) (actual time=0.025..0.025 rows=1 loops=1)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '******'::text) AND (created_at >= '2025-04-11 09:22:00.193'::timestamp without time zone) AND (created_at <= '2025-05-12 09:22:00.203'::timestamp without time zone))
-> Index Scan Backward using job_stats_new_2025_04_job_board_id_job_reference_created_at_idx on job_stats_new_2025_04 job_stats_master_1 (cost=0.57..5.32 rows=32 width=394) (never executed)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '******'::text) AND (created_at >= '2025-04-11 09:22:00.193'::timestamp without time zone) AND (created_at <= '2025-05-12 09:22:00.203'::timestamp without time zone))
Planning Time: 0.611 ms
Execution Time: 0.057 ms
(8 rows)
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.14..1.29 rows=1 width=384) (actual time=0.026..0.026 rows=1 loops=1)
-> Append (cost=1.14..9.10 rows=50 width=384) (actual time=0.025..0.026 rows=1 loops=1)
-> Index Scan Backward using job_stats_new_2025_05_job_board_id_job_reference_created_at_idx on job_stats_new_2025_05 job_stats_master_2 (cost=0.56..3.28 rows=18 width=368) (actual time=0.025..0.025 rows=1 loops=1)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '******'::text) AND (created_at >= '2025-04-11 09:22:00.193'::timestamp without time zone) AND (created_at <= '2025-05-12 09:22:00.203'::timestamp without time zone))
-> Index Scan Backward using job_stats_new_2025_04_job_board_id_job_reference_created_at_idx on job_stats_new_2025_04 job_stats_master_1 (cost=0.57..5.32 rows=32 width=394) (never executed)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '******'::text) AND (created_at >= '2025-04-11 09:22:00.193'::timestamp without time zone) AND (created_at <= '2025-05-12 09:22:00.203'::timestamp without time zone))
Planning Time: 0.611 ms
Execution Time: 0.057 ms
(8 rows)
plan with set plan_cache_mode to force_generic_plan ;
explain analyze execute qqq('2025-04-11 09:22:00.193'::timestamp without time zone, '2025-05-12 09:22:00.203'::timestamp without time zone);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=19.06..19.32 rows=1 width=407) (actual time=0.030..0.030 rows=1 loops=1)
-> Append (cost=19.06..26.74 rows=29 width=407) (actual time=0.029..0.030 rows=1 loops=1)
Subplans Removed: 27
-> Index Scan Backward using job_stats_new_2025_05_job_board_id_job_reference_created_at_idx on job_stats_new_2025_05 job_stats_master_2 (cost=0.56..0.82 rows=1 width=368) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '*******'::text) AND (created_at >= $1) AND (created_at <= $2))
-> Index Scan Backward using job_stats_new_2025_04_job_board_id_job_reference_created_at_idx on job_stats_new_2025_04 job_stats_master_1 (cost=0.57..0.83 rows=1 width=394) (never executed)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '*******'::text) AND (created_at >= $1) AND (created_at <= $2))
Planning Time: 0.033 ms
Execution Time: 0.086 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=19.06..19.32 rows=1 width=407) (actual time=0.030..0.030 rows=1 loops=1)
-> Append (cost=19.06..26.74 rows=29 width=407) (actual time=0.029..0.030 rows=1 loops=1)
Subplans Removed: 27
-> Index Scan Backward using job_stats_new_2025_05_job_board_id_job_reference_created_at_idx on job_stats_new_2025_05 job_stats_master_2 (cost=0.56..0.82 rows=1 width=368) (actual time=0.029..0.029 rows=1 loops=1)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '*******'::text) AND (created_at >= $1) AND (created_at <= $2))
-> Index Scan Backward using job_stats_new_2025_04_job_board_id_job_reference_created_at_idx on job_stats_new_2025_04 job_stats_master_1 (cost=0.57..0.83 rows=1 width=394) (never executed)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '*******'::text) AND (created_at >= $1) AND (created_at <= $2))
Planning Time: 0.033 ms
Execution Time: 0.086 ms
Plan "de facto" the same, performance almost the same but with custom plans there is 20x more time spent on planning.
With over 1M RPS - it's become quite an issue even for the best available servers.
No playing with cost parameters provides any changes in selection custom plan over generic.
As I understand there is an issue with costing model - generic plan thinks it will visit all 24 partitions but custom plan does prune partitions during planning thus custom plan always wins in this case "by cost" and in the same time huge loss in performance (but actual plans are the same in both cases).
I suspect this situation should be quite common with queries over partitioned tables (where planning time is usually quite a high).
Any suggestions what could be done there outside of using force_generic_plan for a particular db user (which will kill performance in other queries for sure)?
Maxim Boguk
Senior Postgresql DBA
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
Senior Postgresql DBA
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
Re: inefficient/wrong plan cache mode selection for queries with partitioned tables (postgresql 17)
От
Andrei Lepikhov
Дата:
On 5/12/25 13:49, Maxim Boguk wrote: > I suspect this situation should be quite common with queries over > partitioned tables (where planning time is usually quite a high). > > Any suggestions what could be done there outside of using > force_generic_plan for a particular db user (which will kill performance > in other queries for sure)? Thanks for this puzzle! I suppose, in case generic planning is much faster than custom one, there are two candidates exist: 1. Touching the index during planning causes too much overhead - see get_actual_variable_range 2. You have a massive default_statistics_target for a table involved. So, to clarify the problem, may you provide EXPLAIN (without analyze) with BUFFERS ON ? Also, could you provide extra information on the statistics involved? For each column (I think created_at is the most important one), show the size of MCV and histogram arrays. -- regards, Andrei Lepikhov
On Mon, May 12, 2025 at 3:08 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 5/12/25 13:49, Maxim Boguk wrote:
> I suspect this situation should be quite common with queries over
> partitioned tables (where planning time is usually quite a high).
>
> Any suggestions what could be done there outside of using
> force_generic_plan for a particular db user (which will kill performance
> in other queries for sure)?
Thanks for this puzzle!
I suppose, in case generic planning is much faster than custom one,
there are two candidates exist:
1. Touching the index during planning causes too much overhead - see
get_actual_variable_range
2. You have a massive default_statistics_target for a table involved.
So, to clarify the problem, may you provide EXPLAIN (without analyze)
with BUFFERS ON ?
Also, could you provide extra information on the statistics involved?
For each column (I think created_at is the most important one), show the
size of MCV and histogram arrays.
--
regards, Andrei Lepikhov
clickcast=# explain (buffers) execute qqq('2025-04-11 09:22:00.193'::timestamp without time zone, '2025-05-12 09:22:00.203'::timestamp without time zone);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.14..1.29 rows=1 width=385)
-> Append (cost=1.14..9.10 rows=50 width=385)
-> Index Scan Backward using job_stats_new_2025_05_job_board_id_job_reference_created_at_idx on job_stats_new_2025_05 job_stats_master_2 (cost=0.56..3.28 rows=18 width=371)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '*****'::text) AND (created_at >= '2025-04-11 09:22:00.193'::timestamp without time zone) AND (created_at <= '2025-05-12 09:22:00.203'::timestamp without time zone))
-> Index Scan Backward using job_stats_new_2025_04_job_board_id_job_reference_created_at_idx on job_stats_new_2025_04 job_stats_master_1 (cost=0.57..5.32 rows=32 width=394)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '*******'::text) AND (created_at >= '2025-04-11 09:22:00.193'::timestamp without time zone) AND (created_at <= '2025-05-12 09:22:00.203'::timestamp without time zone))
Planning:
Buffers: shared hit=16
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.14..1.29 rows=1 width=385)
-> Append (cost=1.14..9.10 rows=50 width=385)
-> Index Scan Backward using job_stats_new_2025_05_job_board_id_job_reference_created_at_idx on job_stats_new_2025_05 job_stats_master_2 (cost=0.56..3.28 rows=18 width=371)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '*****'::text) AND (created_at >= '2025-04-11 09:22:00.193'::timestamp without time zone) AND (created_at <= '2025-05-12 09:22:00.203'::timestamp without time zone))
-> Index Scan Backward using job_stats_new_2025_04_job_board_id_job_reference_created_at_idx on job_stats_new_2025_04 job_stats_master_1 (cost=0.57..5.32 rows=32 width=394)
Index Cond: ((job_board_id = 27068) AND ((job_reference)::text = '*******'::text) AND (created_at >= '2025-04-11 09:22:00.193'::timestamp without time zone) AND (created_at <= '2025-05-12 09:22:00.203'::timestamp without time zone))
Planning:
Buffers: shared hit=16
16 buffers - most times, sometimes 12k Buffers: shared hit=12511 (like 5% cases) - I have no idea why.
show default_statistics_target ;
default_statistics_target
---------------------------
100
default_statistics_target
---------------------------
100
No custom statistic targets on this table or partitions.
select tablename,attname,inherited,null_frac,n_distinct,array_length(most_common_vals,1) mcv, array_length(histogram_bounds,1) hist from pg_stats where tablename IN ('job_stats_master', 'job_stats_new_2025_04', 'job_stats_new_2025_05') and attname in ('created_at', 'job_board_id', 'job_reference') order by tablename, attname;
tablename | attname | inherited | null_frac | n_distinct | mcv | hist
-----------------------+---------------+-----------+------------+--------------+-----+------
job_stats_master | created_at | t | 0 | 1.066586e+06 | 15 | 101
job_stats_master | job_board_id | t | 0.52743334 | 1716 | 100 | 101
job_stats_master | job_reference | t | 0 | -0.1 | 39 | 101
job_stats_new_2025_04 | created_at | f | 0 | 832508 | 39 | 101
job_stats_new_2025_04 | job_board_id | f | 0.47096667 | 1096 | 100 | 101
job_stats_new_2025_04 | job_reference | f | 0 | -0.1 | 93 | 101
job_stats_new_2025_05 | created_at | f | 0 | 709166 | 42 | 101
job_stats_new_2025_05 | job_board_id | f | 0.4703 | 1142 | 100 | 101
job_stats_new_2025_05 | job_reference | f | 0 | -0.1 | 100 | 101
tablename | attname | inherited | null_frac | n_distinct | mcv | hist
-----------------------+---------------+-----------+------------+--------------+-----+------
job_stats_master | created_at | t | 0 | 1.066586e+06 | 15 | 101
job_stats_master | job_board_id | t | 0.52743334 | 1716 | 100 | 101
job_stats_master | job_reference | t | 0 | -0.1 | 39 | 101
job_stats_new_2025_04 | created_at | f | 0 | 832508 | 39 | 101
job_stats_new_2025_04 | job_board_id | f | 0.47096667 | 1096 | 100 | 101
job_stats_new_2025_04 | job_reference | f | 0 | -0.1 | 93 | 101
job_stats_new_2025_05 | created_at | f | 0 | 709166 | 42 | 101
job_stats_new_2025_05 | job_board_id | f | 0.4703 | 1142 | 100 | 101
job_stats_new_2025_05 | job_reference | f | 0 | -0.1 | 100 | 101
PS: problem not with difference between custom and generic planning time but with prepared statements
generic plan plans only once, but custom plan plan every call (and plan time cost 95% on total query runtime).
Maxim Boguk
Senior Postgresql DBA
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
Senior Postgresql DBA
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
Re: inefficient/wrong plan cache mode selection for queries with partitioned tables (postgresql 17)
От
Andrei Lepikhov
Дата:
On 5/12/25 15:08, Maxim Boguk wrote: > PS: problem not with difference between custom and generic planning time > but with prepared statements > generic plan plans only once, but custom plan plan every call (and plan > time cost 95% on total query runtime). Ah, now I got it. I'm aware of this problem from at least two sources of regular complaints. What can you do here? Let's imagine a palliative solution: Having pg_stat_statements data and the list of prepared statements (see pg_prepared_statement) and queryId enabled, there is a way to force a custom or generic plan in specific cases only: look up into min/max query execution time. If no big difference exists and planning time is sufficient, setting force_generic_plan for this plan makes sense. In another case, if the planning time is too short or the generic plan is unstable - switch to force_custom_plan. It is not hard to write such a tiny extension. As I see, the only extra stored "C" procedure is needed to set up force-plan-type flag employing FetchPreparedStatement(). The rest of the code - querying pg_stat_statements and switching between plan types may be written in plpgsql. If I'm not mistaken, it will work with all PG versions that are currently in support. What do you think? -- regards, Andrei Lepikhov
On Mon, May 12, 2025 at 4:48 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 5/12/25 15:08, Maxim Boguk wrote:
> PS: problem not with difference between custom and generic planning time
> but with prepared statements
> generic plan plans only once, but custom plan plan every call (and plan
> time cost 95% on total query runtime).
Ah, now I got it.
I'm aware of this problem from at least two sources of regular complaints.
What can you do here? Let's imagine a palliative solution:
Having pg_stat_statements data and the list of prepared statements (see
pg_prepared_statement) and queryId enabled, there is a way to force a
custom or generic plan in specific cases only: look up into min/max
query execution time. If no big difference exists and planning time is
sufficient, setting force_generic_plan for this plan makes sense. In
another case, if the planning time is too short or the generic plan is
unstable - switch to force_custom_plan.
It is not hard to write such a tiny extension. As I see, the only extra
stored "C" procedure is needed to set up force-plan-type flag employing
FetchPreparedStatement(). The rest of the code - querying
pg_stat_statements and switching between plan types may be written in
plpgsql.
If I'm not mistaken, it will work with all PG versions that are
currently in support. What do you think?
Such extension would be very useful (and in general - the solution based on the actual execution data - seems more stable/predictable than the plan cost based selection which is currently used by postgresql).
Maxim Boguk
Senior Postgresql DBA
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
Senior Postgresql DBA
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
On Mon, May 12, 2025 at 4:48 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 5/12/25 15:08, Maxim Boguk wrote:
> PS: problem not with difference between custom and generic planning time
> but with prepared statements
> generic plan plans only once, but custom plan plan every call (and plan
> time cost 95% on total query runtime).
Ah, now I got it.
I'm aware of this problem from at least two sources of regular complaints.
What can you do here? Let's imagine a palliative solution:
Having pg_stat_statements data and the list of prepared statements (see
pg_prepared_statement) and queryId enabled, there is a way to force a
custom or generic plan in specific cases only: look up into min/max
query execution time. If no big difference exists and planning time is
sufficient, setting force_generic_plan for this plan makes sense. In
another case, if the planning time is too short or the generic plan is
unstable - switch to force_custom_plan.
It is not hard to write such a tiny extension. As I see, the only extra
stored "C" procedure is needed to set up force-plan-type flag employing
FetchPreparedStatement(). The rest of the code - querying
pg_stat_statements and switching between plan types may be written in
plpgsql.
If I'm not mistaken, it will work with all PG versions that are
currently in support. What do you think?
But a more general question - this exact issue will affect every prepared query logic which selects only a subset of partitions.
In this case - current logic will always select custom plan over generic plan (even in case the both plans are actually the same).
E.g. If a fast/cheap query over a partitioned table has conditions that allow use of only a few partitions - custom plan always wins whatever database settings is (outside of force_custom_plan hammer).
Seems there could be something done about the cost calculation of generic plan.
Maxim Boguk
Senior Postgresql DBA
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
Senior Postgresql DBA
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678
Re: inefficient/wrong plan cache mode selection for queries with partitioned tables (postgresql 17)
От
Andrei Lepikhov
Дата:
On 5/12/25 16:04, Maxim Boguk wrote: > On Mon, May 12, 2025 at 4:48 PM Andrei Lepikhov <lepihov@gmail.com > If I'm not mistaken, it will work with all PG versions that are > currently in support. What do you think? > > > Such extension would be very useful (and in general - the solution based > on the actual execution data - seems more stable/predictable than the > plan cost based selection which is currently used by postgresql). I've written a sketch of such an extension; see [1]. A trivial strategy is implemented to force prepared statements to use a generic plan if the planning time exceeds the maximum execution time. It is just an example - it is written in plpgsql, and you can implement alternative strategies independently. I would be happy if it covered your use case. Any feedback is welcome. [1] https://github.com/danolivo/pg_mentor/tree/main -- regards, Andrei Lepikhov