Обсуждение: simple patch for discussion
This is my first attempt for a patch to postgresql, please forgive me if I have forgotten some step. I recently got a new system, with many more CPU cores than previous systems than I am used to, 128 cores (which may not seem a large number to some). I was a bit unhappy that even though I configured max_worker_processes and max_parallel_workers to values above 90, my queries were only allocating 10 workers, and with fiddling with min_parallel_table_scan_size and min_parallel_index_scan_size I could get it to 14 or so, the algorithm in allpaths.c seems to base the number of workers on the log3 of the ratio of the buffer size and the table/index scan size. I sugguest replacing this with the square root of the ratio, which grows faster than the log3. My thought is that there are ways to configure postgresql so it won't allocate an excessive number of workers, but there isn't an easy way to get more workers. I enclose the patch for consideration. My understanding is that a lot of work is going into getting V19 ready, if people would prefer to wait for that before considering my patch I understand. V/r Greg
Вложения
On Thu, 17 Jul 2025 at 12:44, Greg Hennessy <greg.hennessy@gmail.com> wrote: > workers, but there isn't an easy way to get more > workers. Is "alter table ... set (parallel_workers=N);" not easy enough? David
> On Thu, 17 Jul 2025 at 12:44, Greg Hennessy <greg.hennessy@gmail.com> wrote: >> workers, but there isn't an easy way to get more >> workers. On 7/16/25 11:01 PM, David Rowley wrote: >> Is "alter table ... set (parallel_workers=N);" not easy enough? It may be easy enough for one table, but that won't work for joins as far as I can tell. I'd like to have more cpu's available in more cases.
On 7/16/25 11:01 PM, David Rowley wrote: > Is "alter table ... set (parallel_workers=N);" not easy enough? No opinions on the merit of the patch, but it's not as easy as better default behavior, right? That is, the important questions are whether the proposed behavior is better, and whether the change in default behavior is likely to cause any problems. If there's uncertainty about those, the options for a workaround are, of course, relevant.
Hi, On 2025-07-17 15:01:55 +1200, David Rowley wrote: > On Thu, 17 Jul 2025 at 12:44, Greg Hennessy <greg.hennessy@gmail.com> wrote: > > workers, but there isn't an easy way to get more > > workers. > > Is "alter table ... set (parallel_workers=N);" not easy enough? I don't think that's a great approach, as that basically means the user has to do all the computation for how many workers are a good idea themselves. Manually setting it obviously doesn't deal with future growth etc. Right now we basically assume that the benefit of parallelism reduces substantially with every additional parallel worker, but for things like seqscans that's really not true. I've seen reasonably-close-to-linear scalability for parallel seqscans up to 48 workers (the CPUs in the system I tested on). Given that our degree-of-parallelism logic doesn't really make sense. Greetings, Andres Freund
On Fri, 18 Jul 2025 at 05:03, Andres Freund <andres@anarazel.de> wrote: > Right now we basically assume that the benefit of parallelism reduces > substantially with every additional parallel worker, but for things like > seqscans that's really not true. I've seen reasonably-close-to-linear > scalability for parallel seqscans up to 48 workers (the CPUs in the system I > tested on). Given that our degree-of-parallelism logic doesn't really make > sense. What you're saying is true, but the problem with doing what's proposed is that giving so many more workers to 1 query just increases the chances that some other plan being executed gets no workers because they're all in use. The problem with that is that the disadvantage of giving a parallel plan zero workers is absolutely worse than the advantage you get from giving a parallel plan additional workers. The reason for this is that the planner doesn't parallelise the cheapest serial plan. It picks the cheapest plan based on the assumption that whatever number of workers compute_parallel_worker() calculates will be available for use during execution, and the larger the number that function returns, the greater the chance you have of getting a parallel plan due to how the planner divides the Path costs by the calculated number of parallel workers. I could imagine that there might be room to add more configuration to how compute_parallel_worker() calculates the return value. I don't think there's any room to swap it out with something as aggressive as what's being proposed without any means for users to have something slightly more conservative like what's there today. There is already a complaint in [1] that states we should be trying to reduce the number of concurrent backends in order to reduce context switching. What's being proposed here just makes that problem worse. I suggest to Greg that he might want to come up with a method to make this configurable and a means to get something close to what we get today and default the setting to that. It would also be easier to follow any proposed algorithms with some example output. I used the attached .c file to give me that. There's quite a jump in workers with the proposed algorithm, e.g.: Table Size = 1024 MB old workers = 5, new workers = 12 Table Size = 1048576 MB old workers = 11, new workers = 363 So I kinda doubt we could get away with such a radical change without upsetting people that run more than 1 concurrent parallelisable query on their server. David [1] https://www.postgresql.org/message-id/a5916f83-de79-4a40-933a-fb0d9ba2f5a0@app.fastmail.com
Вложения
On Thursday, July 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
I suggest to Greg that he might want to come up with a method to make
this configurable and a means to get something close to what we get
today and default the setting to that. It would also be easier to
follow any proposed algorithms with some example output. I used the
attached .c file to give me that. There's quite a jump in workers with
the proposed algorithm, e.g.:
Table Size = 1024 MB old workers = 5, new workers = 12
Table Size = 1048576 MB old workers = 11, new workers = 363
So I kinda doubt we could get away with such a radical change without
upsetting people that run more than 1 concurrent parallelisable query
on their server.
Have the planner produce two numbers.
1: This plan needs a minimum of N workers to make the parallelism worthwhile. Assume that is what we produce today.
2: A number representing how much this plan would benefit from the availability of additional workers beyond the minimum. This could be the new proposed computation.
The executor, seeing it has 60 workers to spare right now, is willing to allocate some percentage of them to an executing plan weighted by the value of the second number.
I’d much rather avoid configuration and instead improve the executor to minimize idle parallel workers.
Another approach would be a superuser (grantable) parameter that overrides planning-time heuristics and just says “use this many workers where possible”. This seems like a useful knob even if we do better with dynamic resource allocation heuristics.
David J.
Greg Hennessy <greg.hennessy@gmail.com> writes: Hi, >> On Thu, 17 Jul 2025 at 12:44, Greg Hennessy <greg.hennessy@gmail.com> wrote: >>> workers, but there isn't an easy way to get more >>> workers. > On 7/16/25 11:01 PM, David Rowley wrote: >>> Is "alter table ... set (parallel_workers=N);" not easy enough? > > It may be easy enough for one table, but that won't work for joins as > far as I > can tell. I'd like to have more cpu's available in more cases. It suppose to work on the join cases. See: max_parallel_workers=8 max_parallel_workers_per_gather=4; create table bigt(a int, b int, c int); insert into bigt select i, i, i from generate_series(1, 1000000)i; analyze bigt; explain (costs off) select * from bigt t1 join bigt t2 using(b) where t1.a < 10; QUERY PLAN ------------------------------------------------ Gather Workers Planned: 2 -> Parallel Hash Join Hash Cond: (t2.b = t1.b) -> Parallel Seq Scan on bigt t2 -> Parallel Hash -> Parallel Seq Scan on bigt t1 Filter: (a < 10) (8 rows) alter table bigt set (parallel_workers=4); explain (costs off) select * from bigt t1 join bigt t2 using(b) where t1.a < 10; QUERY PLAN ------------------------------------------------ Gather Workers Planned: 4 -> Parallel Hash Join Hash Cond: (t2.b = t1.b) -> Parallel Seq Scan on bigt t2 -> Parallel Hash -> Parallel Seq Scan on bigt t1 Filter: (a < 10) (8 rows) However it is possible that when query becomes complex, some characters could make parallel doesn't work at all. e.g. SELECT paralle_unsafe_udf(a) FROM t; or some correlated subquery in your queries like [1]. [1] https://www.postgresql.org/message-id/871pqzm5wj.fsf%40163.com -- Best Regards Andy Fan
On Fri, 18 Jul 2025 at 13:04, David G. Johnston <david.g.johnston@gmail.com> wrote: > Have the planner produce two numbers. > > 1: This plan needs a minimum of N workers to make the parallelism worthwhile. Assume that is what we produce today. > 2: A number representing how much this plan would benefit from the availability of additional workers beyond the minimum. This could be the new proposed computation. > > The executor, seeing it has 60 workers to spare right now, is willing to allocate some percentage of them to an executingplan weighted by the value of the second number. Having the ability to do something smarter at executor startup based on the current server load sounds like something to aspire towards. That might mean having to generate a serial version of each parallel plan, and that does mean running the join search at least twice for all plans that have a parallel portion. One problem with doing that is that plans can have multiple Gather/GatherMerge nodes, and if one of those needed some low number of workers and another one needed some large number of workers, then if you ended up in a scenario that during execution only a low number of workers were available, then the best plan might be something in between the serial and parallel plan (i.e just the Gather with the low worker count). It seems fairly prohibitive to perform the join search for every combination of with/without Gather. It is the most expensive part of planning for any moderately complex join problems. > I’d much rather avoid configuration and instead improve the executor to minimize idle parallel workers. That would be nice. Really what we have today with Gather/GatherMerge spinning up workers is down to the fact that we have a volcano executor. If we did push-based execution and had threads rather than processes, or at least if we disconnected backends and sessions, then you could have the planner post to-be-executed plans to a notice board, then have a bunch of workers (one per CPU core) come along and execute those. For push-based execution, any plan node which is a scan or has >0 tuples in the input queue can be executed. Each worker could yield periodically to check for more important work. An implementation of that is far far away from a weekend project. > Another approach would be a superuser (grantable) parameter that overrides planning-time heuristics and just says “usethis many workers where possible”. This seems like a useful knob even if we do better with dynamic resource allocationheuristics. I don't imagine that would be useful to many people as there would be nothing to exclude medium-sized tables from using a possibly silly number of workers. I think the "alter table ... set (parallel_workers=N);" that we have today is useful. In most databases, small tables tend to outnumber huge tables, so it doesn't really seem that prohibitive to do the ALTER TABLE on just the handful of huge tables. I find it hard to imagine that there'd be many people around that would benefit from a global "always_use_this_number_of_parallel_workers" GUC. David
On Thursday, July 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
I find it hard to imagine that there'd be many people
around that would benefit from a global
"always_use_this_number_of_parallel_workers" GUC.
Just to clarify, the global value would be -1 usually. It would be another crappy planner hint for important queries to be assigned saying “make this query go fast even if it doesn’t play nice with others”.
David J.
On Thursday, July 17, 2025, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Thursday, July 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:I find it hard to imagine that there'd be many peoplearound that would benefit from a global
"always_use_this_number_of_parallel_workers" GUC. Just to clarify, the global value would be -1 usually. It would be another crappy planner hint for important queries to be assigned saying “make this query go fast even if it doesn’t play nice with others”.
Framing this differently, how about a patch that lets extension authors choose to implement alternative formulas or even provide GUC-driven constants into the planner at the existing spot instead of having to choose a best algorithm. IOW, what would it take to make the proposed patch an extension that a DBA could choose to install and override the current log3 algorithm?
David J.
On Fri, Jul 18, 2025 at 12:23 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
Framing this differently, how about a patch that lets extension authors choose to implement alternative formulas or even provide GUC-driven constants into the planner at the existing spot instead of having to choose a best algorithm. IOW, what would it take to make the proposed patch an extension that a DBA could choose to install and override the current log3 algorithm?
I've added code to make this new GUC, with the default behavior being the old behavior. I don't think I know enough postgresql to code an extension.
As an example showing this works, where the default algorithm assigns 5 works, the new one assigns 12.:
CREATE TABLE Departments (code VARCHAR(5), UNIQUE (code));
CREATE TABLE Towns (
id SERIAL UNIQUE NOT NULL,
code VARCHAR(10) NOT NULL, -- not unique
article TEXT,
name TEXT NOT NULL, -- not unique
department VARCHAR(5) NOT NULL,
UNIQUE (code, department)
);
insert into towns (
code, article, name, department
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 5)
from generate_series(1, 10000000) s(i);
insert into departments (
code
)
select
left(md5(i::text), 5)
from generate_series(1, 1000) s(i);
analyze departments;
analyze towns;
CREATE TABLE Towns (
id SERIAL UNIQUE NOT NULL,
code VARCHAR(10) NOT NULL, -- not unique
article TEXT,
name TEXT NOT NULL, -- not unique
department VARCHAR(5) NOT NULL,
UNIQUE (code, department)
);
insert into towns (
code, article, name, department
)
select
left(md5(i::text), 10),
md5(random()::text),
md5(random()::text),
left(md5(random()::text), 5)
from generate_series(1, 10000000) s(i);
insert into departments (
code
)
select
left(md5(i::text), 5)
from generate_series(1, 1000) s(i);
analyze departments;
analyze towns;
postgres@fedora:~$ /usr/local/pgsql/bin/psql test
psql (19devel)
Type "help" for help.
test=# show parallel_worker_algorithm ;
parallel_worker_algorithm
---------------------------
log3
(1 row)
test=# show max_parallel_workers ;
max_parallel_workers
----------------------
24
(1 row)
test=# explain (costs off) select count(*) from departments, towns where towns.department = departments.code;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate
-> Gather
Workers Planned: 5
-> Partial Aggregate
-> Hash Join
Hash Cond: ((towns.department)::text = (departments.code)::text)
-> Parallel Seq Scan on towns
-> Hash
-> Seq Scan on departments
(9 rows)
test=# set parallel_worker_algorithm = sqrt;
SET
test=# explain (costs off) select count(*) from departments, towns where towns.department = departments.code;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate
-> Gather
Workers Planned: 12
-> Partial Aggregate
-> Hash Join
Hash Cond: ((towns.department)::text = (departments.code)::text)
-> Parallel Seq Scan on towns
-> Hash
-> Seq Scan on departments
(9 rows)
test=#
psql (19devel)
Type "help" for help.
test=# show parallel_worker_algorithm ;
parallel_worker_algorithm
---------------------------
log3
(1 row)
test=# show max_parallel_workers ;
max_parallel_workers
----------------------
24
(1 row)
test=# explain (costs off) select count(*) from departments, towns where towns.department = departments.code;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate
-> Gather
Workers Planned: 5
-> Partial Aggregate
-> Hash Join
Hash Cond: ((towns.department)::text = (departments.code)::text)
-> Parallel Seq Scan on towns
-> Hash
-> Seq Scan on departments
(9 rows)
test=# set parallel_worker_algorithm = sqrt;
SET
test=# explain (costs off) select count(*) from departments, towns where towns.department = departments.code;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate
-> Gather
Workers Planned: 12
-> Partial Aggregate
-> Hash Join
Hash Cond: ((towns.department)::text = (departments.code)::text)
-> Parallel Seq Scan on towns
-> Hash
-> Seq Scan on departments
(9 rows)
test=#
Вложения
On Mon, 21 Jul 2025 at 06:27, Greg Hennessy <greg.hennessy@gmail.com> wrote: > test=# show parallel_worker_algorithm ; > parallel_worker_algorithm > --------------------------- > log3 > (1 row) I don't think having a GUC which allows exactly two settings is anywhere near as flexible as you could make this. You're complaining that the current calculation isn't ideal for you and are proposing a new one which seemingly is more to your liking, but who's to say that someone else would find either useful? With this method, if someone else comes along complaining, do we give them what they want by adding a new enum? That does not sound great :( One thought I had was that you could adjust the "divide by 3" to be "divide by <some_new_GUC>" and make that calculation floating point. For people who want really aggressive scaling, that could set it to something a little over 1.0. The problem with that is numbers really close to 1.0 would cause that loop to iterate an excessive number of times and that would be bad for performance. On rethink, maybe you can use the log() function to get something close to what's being generated today. The fact that the heap_parallel_threshold is taken into account makes that a little tricky as it's not a simple log3. I experimented with the attached .c file. When I set the GUC to 3.0, it's pretty close to what we get today. If I drop it to 1.01, I get something closer to your version. Do you want to play around with the code in the attached .c file and add a GUC like the parallel_worker_scaling_logarithm I have in the .c file? Feel free to tweak the algorithm. I won't pretend it's the best it can be. I didn't spend much time on it. There might be a better way to get numbers closer to what we have today that takes into account the (heap|index)_parallel_threshold and doesn't need a while loop. David
Вложения
On Tue, Jul 22, 2025 at 8:49 PM David Rowley <dgrowleyml@gmail.com> wrote:
I don't think having a GUC which allows exactly two settings is
anywhere near as flexible as you could make this.
I would phrase it as being a GUC with two settings currently,
I think it is easier to go from 2 possible algorithms to 3 than
it is to go from 1 to 2.
You're complaining
that the current calculation isn't ideal for you and are proposing a
new one which seemingly is more to your liking, but who's to say that
someone else would find either useful? With this method, if someone
else comes along complaining, do we give them what they want by adding
a new enum? That does not sound great :(
Adding a new enum to solve a perceived problem doesn't seem like a large
ask to me. If you think this situation isn't great, may I ask what you would
consider great?
One thought I had was that you could adjust the "divide by 3" to be
"divide by <some_new_GUC>" and make that calculation floating point.
I would think that a new patch to change the hard coded log3 to a more general
log(x) where X is controlled by a GUC would be better solved by a separate
patch, but I have no objection to submitting a new patch that includes that.
On rethink, maybe you can use the log() function to get something
close to what's being generated today.
I think the ceil function needs to be in there to exactly match the existing
behavior, but I haven't verified. I do agree that having the default
behavior not change unless set by an admin to be a good idea.
Can other people give advice on if adding a new algorithm to
calculate parallel worker number and change the scaling
from log3 of a ratio to log of a GUC is best taken care of
by one patch or two?
Its clear that if we make the logarithm base an adjustable
parameter we have to insure it is not 1.0 or less, but
how much larger than 1.0 can we allow? My memory says
that the smallest floating point number larger than unity is 1+2**(-23).
I guess we can make the minimum allowed number another GUC. :)
On Thu, 24 Jul 2025 at 08:52, Greg Hennessy <greg.hennessy@gmail.com> wrote: > Adding a new enum to solve a perceived problem doesn't seem like a large > ask to me. Seems overly slow to me. As someone has to write the patch, a committer has to commit said patch, the user must wait for the next major version to be released before upgrading their database to that version. That's 6 to 15 months at best. > Can other people give advice on if adding a new algorithm to > calculate parallel worker number and change the scaling > from log3 of a ratio to log of a GUC is best taken care of > by one patch or two? Others might be able to chime in if you gave a summary of what those patches were. I assume you want an enum GUC for the algorithm and some scale GUC? > Its clear that if we make the logarithm base an adjustable > parameter we have to insure it is not 1.0 or less, but > how much larger than 1.0 can we allow? My memory says > that the smallest floating point number larger than unity is 1+2**(-23). > I guess we can make the minimum allowed number another GUC. :) The range could be 1.0 to maybe 100.0 or 1000.0. If the number is too close to 1.0 then just have compute_parallel_worker() return max_workers. You could document 1.0 to mean "limit to max_parallel_maintenance_workers / max_parallel_workers_per_gather". The function won't return a value higher than that anyway. David