Обсуждение: simple patch for discussion

Поиск
Список
Период
Сортировка

simple patch for discussion

От
Greg Hennessy
Дата:
This is my first attempt for a patch to postgresql, please forgive me if 
I have forgotten some step.

I recently got a new system, with many more CPU cores than previous 
systems than I am used to, 128 cores
(which may not seem a large number to some). I was a bit unhappy that 
even though I configured max_worker_processes
and max_parallel_workers to values above 90, my queries were only 
allocating 10 workers, and with fiddling
with min_parallel_table_scan_size and min_parallel_index_scan_size I 
could get it to 14 or so, the algorithm in allpaths.c
seems to base the number of workers on the log3 of the ratio of the 
buffer size and the table/index scan size. I sugguest
replacing this with the square root of the ratio, which grows faster 
than the log3. My thought is that there are ways
to configure postgresql so it won't allocate an excessive number of 
workers, but there isn't an easy way to get more
workers.

I enclose the patch for consideration. My understanding is that a lot of 
work is going into getting V19 ready,
if people would prefer to wait for that before considering my patch I 
understand.

V/r
Greg

Вложения

Re: simple patch for discussion

От
David Rowley
Дата:
On Thu, 17 Jul 2025 at 12:44, Greg Hennessy <greg.hennessy@gmail.com> wrote:
> workers, but there isn't an easy way to get more
> workers.

Is "alter table ... set (parallel_workers=N);" not easy enough?

David



Re: simple patch for discussion

От
Greg Hennessy
Дата:
> On Thu, 17 Jul 2025 at 12:44, Greg Hennessy <greg.hennessy@gmail.com> wrote:
>> workers, but there isn't an easy way to get more
>> workers.
On 7/16/25 11:01 PM, David Rowley wrote:
>> Is "alter table ... set (parallel_workers=N);" not easy enough?

It may be easy enough for one table, but that won't work for joins as 
far as I
can tell. I'd like to have more cpu's available in more cases.






Re: simple patch for discussion

От
Maciek Sakrejda
Дата:
 On 7/16/25 11:01 PM, David Rowley wrote:
> Is "alter table ... set (parallel_workers=N);" not easy enough?

No opinions on the merit of the patch, but it's not as easy as better
default behavior, right? That is, the important questions are whether
the proposed behavior is better, and whether the change in default
behavior is likely to cause any problems. If there's uncertainty about
those, the options for a workaround are, of course, relevant.



Re: simple patch for discussion

От
Andres Freund
Дата:
Hi,

On 2025-07-17 15:01:55 +1200, David Rowley wrote:
> On Thu, 17 Jul 2025 at 12:44, Greg Hennessy <greg.hennessy@gmail.com> wrote:
> > workers, but there isn't an easy way to get more
> > workers.
> 
> Is "alter table ... set (parallel_workers=N);" not easy enough?

I don't think that's a great approach, as that basically means the user has to
do all the computation for how many workers are a good idea
themselves. Manually setting it obviously doesn't deal with future growth etc.

Right now we basically assume that the benefit of parallelism reduces
substantially with every additional parallel worker, but for things like
seqscans that's really not true.  I've seen reasonably-close-to-linear
scalability for parallel seqscans up to 48 workers (the CPUs in the system I
tested on).  Given that our degree-of-parallelism logic doesn't really make
sense.

Greetings,

Andres Freund



Re: simple patch for discussion

От
David Rowley
Дата:
On Fri, 18 Jul 2025 at 05:03, Andres Freund <andres@anarazel.de> wrote:
> Right now we basically assume that the benefit of parallelism reduces
> substantially with every additional parallel worker, but for things like
> seqscans that's really not true.  I've seen reasonably-close-to-linear
> scalability for parallel seqscans up to 48 workers (the CPUs in the system I
> tested on).  Given that our degree-of-parallelism logic doesn't really make
> sense.

What you're saying is true, but the problem with doing what's proposed
is that giving so many more workers to 1 query just increases the
chances that some other plan being executed gets no workers because
they're all in use.  The problem with that is that the disadvantage of
giving a parallel plan zero workers is absolutely worse than the
advantage you get from giving a parallel plan additional workers. The
reason for this is that the planner doesn't parallelise the cheapest
serial plan. It picks the cheapest plan based on the assumption that
whatever number of workers compute_parallel_worker() calculates will
be available for use during execution, and the larger the number that
function returns, the greater the chance you have of getting a
parallel plan due to how the planner divides the Path costs by the
calculated number of parallel workers.

I could imagine that there might be room to add more configuration to
how compute_parallel_worker() calculates the return value. I don't
think there's any room to swap it out with something as aggressive as
what's being proposed without any means for users to have something
slightly more conservative like what's there today. There is already a
complaint in [1] that states we should be trying to reduce the number
of concurrent backends in order to reduce context switching. What's
being proposed here just makes that problem worse.

I suggest to Greg that he might want to come up with a method to make
this configurable and a means to get something close to what we get
today and default the setting to that. It would also be easier to
follow any proposed algorithms with some example output. I used the
attached .c file to give me that. There's quite a jump in workers with
the proposed algorithm, e.g.:

Table Size = 1024 MB old workers = 5, new workers = 12
Table Size = 1048576 MB old workers = 11, new workers = 363

So I kinda doubt we could get away with such a radical change without
upsetting people that run more than 1 concurrent parallelisable query
on their server.

David

[1] https://www.postgresql.org/message-id/a5916f83-de79-4a40-933a-fb0d9ba2f5a0@app.fastmail.com

Вложения

simple patch for discussion

От
"David G. Johnston"
Дата:
On Thursday, July 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:

I suggest to Greg that he might want to come up with a method to make
this configurable and a means to get something close to what we get
today and default the setting to that. It would also be easier to
follow any proposed algorithms with some example output. I used the
attached .c file to give me that. There's quite a jump in workers with
the proposed algorithm, e.g.:

Table Size = 1024 MB old workers = 5, new workers = 12
Table Size = 1048576 MB old workers = 11, new workers = 363

So I kinda doubt we could get away with such a radical change without
upsetting people that run more than 1 concurrent parallelisable query
on their server.

Have the planner produce two numbers.

1: This plan needs a minimum of N workers to make the parallelism worthwhile.  Assume that is what we produce today.
2:  A number representing how much this plan would benefit from the availability of additional workers beyond the minimum.  This could be the new proposed computation.

The executor, seeing it has 60 workers to spare right now, is willing to allocate some percentage of them to an executing plan weighted by the value of the second number.

I’d much rather avoid configuration and instead improve the executor to minimize idle parallel workers.

Another approach would be a superuser (grantable) parameter that overrides planning-time heuristics and just says “use this many workers where possible”.  This seems like a useful knob even if we do better with dynamic resource allocation heuristics.

David J.

Re: simple patch for discussion

От
Andy Fan
Дата:
Greg Hennessy <greg.hennessy@gmail.com> writes:

Hi,

>> On Thu, 17 Jul 2025 at 12:44, Greg Hennessy <greg.hennessy@gmail.com> wrote:
>>> workers, but there isn't an easy way to get more
>>> workers.
> On 7/16/25 11:01 PM, David Rowley wrote:
>>> Is "alter table ... set (parallel_workers=N);" not easy enough?
>
> It may be easy enough for one table, but that won't work for joins as
> far as I
> can tell. I'd like to have more cpu's available in more cases.

It suppose to work on the join cases. See:

max_parallel_workers=8
max_parallel_workers_per_gather=4;

create table bigt(a int, b int, c int);
insert into bigt select i, i, i from generate_series(1, 1000000)i;
analyze bigt;

explain (costs off) select * from bigt t1 join bigt t2 using(b) where t1.a < 10;
                   QUERY PLAN                   
------------------------------------------------
 Gather
   Workers Planned: 2
   ->  Parallel Hash Join
         Hash Cond: (t2.b = t1.b)
         ->  Parallel Seq Scan on bigt t2
         ->  Parallel Hash
               ->  Parallel Seq Scan on bigt t1
                     Filter: (a < 10)
(8 rows)

alter table bigt set (parallel_workers=4);

explain (costs off) select * from bigt t1 join bigt t2 using(b) where t1.a < 10;
                   QUERY PLAN                   
------------------------------------------------
 Gather
   Workers Planned: 4
   ->  Parallel Hash Join
         Hash Cond: (t2.b = t1.b)
         ->  Parallel Seq Scan on bigt t2
         ->  Parallel Hash
               ->  Parallel Seq Scan on bigt t1
                     Filter: (a < 10)
(8 rows)

However it is possible that when query becomes complex, some characters
could make parallel doesn't work at all.

e.g.  SELECT paralle_unsafe_udf(a) FROM t;
or some correlated subquery in your queries like [1].


[1] https://www.postgresql.org/message-id/871pqzm5wj.fsf%40163.com 

-- 
Best Regards
Andy Fan




Re: simple patch for discussion

От
David Rowley
Дата:
On Fri, 18 Jul 2025 at 13:04, David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Have the planner produce two numbers.
>
> 1: This plan needs a minimum of N workers to make the parallelism worthwhile.  Assume that is what we produce today.
> 2:  A number representing how much this plan would benefit from the availability of additional workers beyond the
minimum. This could be the new proposed computation. 
>
> The executor, seeing it has 60 workers to spare right now, is willing to allocate some percentage of them to an
executingplan weighted by the value of the second number. 

Having the ability to do something smarter at executor startup based
on the current server load sounds like something to aspire towards.
That might mean having to generate a serial version of each parallel
plan, and that does mean running the join search at least twice for
all plans that have a parallel portion. One problem with doing that is
that plans can have multiple Gather/GatherMerge nodes, and if one of
those needed some low number of workers and another one needed some
large number of workers, then if you ended up in a scenario that
during execution only a low number of workers were available, then the
best plan might be something in between the serial and parallel plan
(i.e just the Gather with the low worker count). It seems fairly
prohibitive to perform the join search for every combination of
with/without Gather. It is the most expensive part of planning for any
moderately complex join problems.

> I’d much rather avoid configuration and instead improve the executor to minimize idle parallel workers.

That would be nice. Really what we have today with Gather/GatherMerge
spinning up workers is down to the fact that we have a volcano
executor. If we did push-based execution and had threads rather than
processes, or at least if we disconnected backends and sessions, then
you could have the planner post to-be-executed plans to a notice
board, then have a bunch of workers (one per CPU core) come along and
execute those.  For push-based execution, any plan node which is a
scan or has >0 tuples in the input queue can be executed. Each worker
could yield periodically to check for more important work. An
implementation of that is far far away from a weekend project.

> Another approach would be a superuser (grantable) parameter that overrides planning-time heuristics and just says
“usethis many workers where possible”.  This seems like a useful knob even if we do better with dynamic resource
allocationheuristics. 

I don't imagine that would be useful to many people as there would be
nothing to exclude medium-sized tables from using a possibly silly
number of workers. I think the "alter table ... set
(parallel_workers=N);" that we have today is useful. In most
databases, small tables tend to outnumber huge tables, so it doesn't
really seem that prohibitive to do the ALTER TABLE on just the handful
of huge tables. I find it hard to imagine that there'd be many people
around that would benefit from a global
"always_use_this_number_of_parallel_workers" GUC.

David



Re: simple patch for discussion

От
"David G. Johnston"
Дата:
On Thursday, July 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:

I find it hard to imagine that there'd be many people
around that would benefit from a global
"always_use_this_number_of_parallel_workers" GUC.

Just to clarify, the global value would be -1 usually.  It would be another crappy planner hint for important queries to be assigned saying “make this query go fast even if it doesn’t play nice with others”.

David J.

Re: simple patch for discussion

От
"David G. Johnston"
Дата:
On Thursday, July 17, 2025, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Thursday, July 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:

I find it hard to imagine that there'd be many people
around that would benefit from a global
"always_use_this_number_of_parallel_workers" GUC.

Just to clarify, the global value would be -1 usually.  It would be another crappy planner hint for important queries to be assigned saying “make this query go fast even if it doesn’t play nice with others”.

Framing this differently, how about a patch that lets extension authors choose to implement alternative formulas or even provide GUC-driven constants into the planner at the existing spot instead of having to choose a best algorithm.  IOW, what would it take to make the proposed patch an extension that a DBA could choose to install and override the current log3 algorithm?

David J.
 

Re: simple patch for discussion

От
Greg Hennessy
Дата:
On Fri, Jul 18, 2025 at 12:23 AM David G. Johnston <david.g.johnston@gmail.com> wrote:
Framing this differently, how about a patch that lets extension authors choose to implement alternative formulas or even provide GUC-driven constants into the planner at the existing spot instead of having to choose a best algorithm.  IOW, what would it take to make the proposed patch an extension that a DBA could choose to install and override the current log3 algorithm?

I've added code to make this new GUC, with the default behavior being the old behavior. I don't think I know enough postgresql to code an extension.

As an example showing this works, where the default algorithm assigns 5 works, the new one assigns 12.:
CREATE TABLE Departments (code VARCHAR(5), UNIQUE (code));
CREATE TABLE Towns (
  id SERIAL UNIQUE NOT NULL,
  code VARCHAR(10) NOT NULL, -- not unique
  article TEXT,
  name TEXT NOT NULL, -- not unique
  department VARCHAR(5) NOT NULL,
  UNIQUE (code, department)
);

insert into towns (
    code, article, name, department
)
select
    left(md5(i::text), 10),
    md5(random()::text),
    md5(random()::text),
    left(md5(random()::text), 5)
from generate_series(1, 10000000) s(i);

insert into departments (
       code
)
select
left(md5(i::text), 5)
from generate_series(1, 1000) s(i);

analyze departments;
analyze towns;

postgres@fedora:~$ /usr/local/pgsql/bin/psql test
psql (19devel)
Type "help" for help.

test=# show parallel_worker_algorithm ;
 parallel_worker_algorithm
---------------------------
 log3
(1 row)

test=# show max_parallel_workers ;
 max_parallel_workers
----------------------
 24
(1 row)

test=# explain (costs off) select count(*) from departments, towns where towns.department = departments.code;
                                      QUERY PLAN                                      
--------------------------------------------------------------------------------------
 Finalize Aggregate
   ->  Gather
         Workers Planned: 5
         ->  Partial Aggregate
               ->  Hash Join
                     Hash Cond: ((towns.department)::text = (departments.code)::text)
                     ->  Parallel Seq Scan on towns
                     ->  Hash
                           ->  Seq Scan on departments
(9 rows)

test=# set parallel_worker_algorithm = sqrt;
SET
test=# explain (costs off) select count(*) from departments, towns where towns.department = departments.code;
                                      QUERY PLAN                                      
--------------------------------------------------------------------------------------
 Finalize Aggregate
   ->  Gather
         Workers Planned: 12
         ->  Partial Aggregate
               ->  Hash Join
                     Hash Cond: ((towns.department)::text = (departments.code)::text)
                     ->  Parallel Seq Scan on towns
                     ->  Hash
                           ->  Seq Scan on departments
(9 rows)
test=#


 
Вложения

Re: simple patch for discussion

От
David Rowley
Дата:
On Mon, 21 Jul 2025 at 06:27, Greg Hennessy <greg.hennessy@gmail.com> wrote:
> test=# show parallel_worker_algorithm ;
>  parallel_worker_algorithm
> ---------------------------
>  log3
> (1 row)

I don't think having a GUC which allows exactly two settings is
anywhere near as flexible as you could make this. You're complaining
that the current calculation isn't ideal for you and are proposing a
new one which seemingly is more to your liking, but who's to say that
someone else would find either useful? With this method, if someone
else comes along complaining, do we give them what they want by adding
a new enum? That does not sound great :(

One thought I had was that you could adjust the "divide by 3" to be
"divide by <some_new_GUC>" and make that calculation floating point.
For people who want really aggressive scaling, that could set it to
something a little over 1.0. The problem with that is numbers really
close to 1.0 would cause that loop to iterate an excessive number of
times and that would be bad for performance.

On rethink, maybe you can use the log() function to get something
close to what's being generated today.  The fact that the
heap_parallel_threshold is taken into account makes that a little
tricky as it's not a simple log3. I experimented with the attached .c
file. When I set the GUC to 3.0, it's pretty close to what we get
today. If I drop it to 1.01, I get something closer to your version.

Do you want to play around with the code in the attached .c file and
add a GUC like the parallel_worker_scaling_logarithm I have in the .c
file? Feel free to tweak the algorithm. I won't pretend it's the best
it can be. I didn't spend much time on it. There might be a better way
to get numbers closer to what we have today that takes into account
the (heap|index)_parallel_threshold and doesn't need a while loop.

David

Вложения

Re: simple patch for discussion

От
Greg Hennessy
Дата:


On Tue, Jul 22, 2025 at 8:49 PM David Rowley <dgrowleyml@gmail.com> wrote:
I don't think having a GUC which allows exactly two settings is
anywhere near as flexible as you could make this.

I would phrase it as being a GUC with two settings currently,
I think it is easier to go from 2 possible algorithms to 3 than
it is to go from 1 to 2.
 
You're complaining
that the current calculation isn't ideal for you and are proposing a
new one which seemingly is more to your liking, but who's to say that
someone else would find either useful? With this method, if someone
else comes along complaining, do we give them what they want by adding
a new enum? That does not sound great :(

Adding a new enum to solve a perceived problem doesn't seem like a large
ask to me. If you think this situation isn't great, may I ask what you would 
consider great?

One thought I had was that you could adjust the "divide by 3" to be
"divide by <some_new_GUC>" and make that calculation floating point.

I would think that a new patch to change the hard coded log3 to a more general
log(x) where X is controlled by a GUC would be better solved by a separate
patch, but I have no objection to submitting a new patch that includes that.
 
On rethink, maybe you can use the log() function to get something
close to what's being generated today.

I think the ceil function needs to be in there to exactly match the existing
behavior, but I haven't verified. I do agree that having the default
behavior not change unless set by an admin to be a good idea.

Can other people give advice on if adding a new algorithm to
calculate parallel worker number and change the scaling
from log3 of a ratio to log of a GUC is best taken care of
by one patch or two?

Its clear that if we make the logarithm base an adjustable
parameter we have to insure it is not 1.0 or less, but
how much larger than 1.0 can we allow? My memory says
that the smallest floating point number larger than unity is 1+2**(-23).
I guess we can make the minimum allowed number another GUC. :)


Re: simple patch for discussion

От
David Rowley
Дата:
On Thu, 24 Jul 2025 at 08:52, Greg Hennessy <greg.hennessy@gmail.com> wrote:
> Adding a new enum to solve a perceived problem doesn't seem like a large
> ask to me.

Seems overly slow to me. As someone has to write the patch, a
committer has to commit said patch, the user must wait for the next
major version to be released before upgrading their database to that
version. That's 6 to 15 months at best.

> Can other people give advice on if adding a new algorithm to
> calculate parallel worker number and change the scaling
> from log3 of a ratio to log of a GUC is best taken care of
> by one patch or two?

Others might be able to chime in if you gave a summary of what those
patches were. I assume you want an enum GUC for the algorithm and some
scale GUC?

> Its clear that if we make the logarithm base an adjustable
> parameter we have to insure it is not 1.0 or less, but
> how much larger than 1.0 can we allow? My memory says
> that the smallest floating point number larger than unity is 1+2**(-23).
> I guess we can make the minimum allowed number another GUC. :)

The range could be 1.0 to maybe 100.0 or 1000.0. If the number is too
close to 1.0 then just have compute_parallel_worker() return
max_workers. You could document 1.0 to mean "limit to
max_parallel_maintenance_workers / max_parallel_workers_per_gather".
The function won't return a value higher than that anyway.

David