Обсуждение: Batching in executor

Поиск
Список
Период
Сортировка

Batching in executor

От
Amit Langote
Дата:
At PGConf.dev this year we had an unconference session [1] on whether
the community can support an additional batch executor. The discussion
there led me to start hacking on $subject. I have also had off-list
discussions on this topic in recent months with Andres and David, who
have offered useful thoughts.

This patch series is an early attempt to make executor nodes pass
around batches of tuples instead of tuple-at-a-time slots. The main
motivation is to enable expression evaluation in batch form, which can
substantially reduce per-tuple overhead (mainly from function calls)
and open the door to further optimizations such as SIMD usage in
aggregate transition functions. We could even change algorithms of
some plan nodes to operate on batches when, for example, a child node
can return batches.

The expression evaluation changes are still exploratory, but before
moving to make them ready for serious review, we first need a way for
scan nodes to produce tuples in batches and an executor API that
allows upper nodes to consume them. The series includes both the
foundational work to let scan nodes produce batches and an executor
API to pass them around, and a set of follow-on patches that
experiment with batch-aware expression evaluation.

The patch set is structured in two parts. The first three patches lay
the groundwork in the executor and table AM, and the later patches
prototype batch-aware expression evaluation.

Patches 0001-0003 introduce a new batch table AM API and an initial
heapam implementation that can return multiple tuples per call.
SeqScan is adapted to use this interface, with new ExecSeqScanBatch*
routines that fetch tuples in bulk but can still return one
TupleTableSlot at a time to preserve compatibility. On the executor
side, ExecProcNodeBatch() is added alongside ExecProcNode(), with
TupleBatch as the new container for passing groups of tuples. ExecScan
has batch-aware variants that use the AM API internally, but can fall
back to row-at-a-time behavior when required. Plan shapes and EXPLAIN
output remain unchanged; the differences here are executor-internal.

At present, heapam batches are restricted to tuples from a single
page, which means they may not always fill EXEC_BATCH_ROWS (currently
64). That limits how much upper executor nodes can leverage batching,
especially with selective quals where batches may end up sparsely
populated. A future improvement would be to allow batches to span
pages or to let the scan node request more tuples when its buffer is
not yet full, so it avoids passing mostly empty TupleBatch to upper
nodes.

It might also be worth adding some lightweight instrumentation to make
it easier to reason about batch behavior. For example, counters for
average rows per batch, reasons why a batch ended (capacity reached,
page boundary, end of scan), or batches per million rows could help
confirm whether limitations like the single-page restriction or
EXEC_BATCH_ROWS size are showing up in benchmarks. Suggestions from
others on which forms of instrumentation would be most useful are
welcome.

Patches 0004 onwards start experimenting with making expression
evaluation batch-aware, first in the aggregate node. These patches add
new EEOPs (ExprEvalOps and ExprEvalSteps) to fetch attributes into
TupleBatch vectors, evaluate quals across a batch, and run aggregate
transitions over multiple rows at once. Agg is extended to pull
TupleBatch from its child via ExecProcNodeBatch(), with two prototype
paths: one that loops inside the interpreter and another that calls
the transition function once per batch using AggBulkArgs. These are
still PoCs, but with scan nodes and the executor capable of moving
batches around, they provide a base from which the work can be refined
into something potentially committable after the usual polish,
testing, and review.

One area that needs more thought is how TupleBatch interacts with
ExprContext. At present the patches extend ExprContext with
scan_batch, inner_batch, and outer_batch fields, but per-batch
evaluation still spills into ecxt_per_tuple_memory, effectively
reusing the per-tuple context for per-batch work. That’s arguably an
abuse of the contract described in ExecEvalExprSwitchContext(), and it
will need a cleaner definition of how batch-scoped memory should be
managed. Feedback on how best to structure that would be particularly
helpful.

To evaluate the overheads and benefits, I ran microbenchmarks with
single and multi-aggregate queries on a single table, with and without
WHERE clauses. Tables were fully VACUUMed so visibility maps are set
and IO costs are minimal. shared_buffers was large enough to fit the
whole table (up to 10M rows, ~43 on each page), and all pages were
prewarmed into cache before tests. Table schema/script is at [2].

Observations from benchmarking (Detailed benchmark tables are at [3];
below is just a high-level summary of the main patterns):

* Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
sum(a) FROM bar_N): batching scan output alone improved latency by
~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
especially once fmgr overhead was paid per batch instead of per row.

* Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
qual interpreter gave a big step up, with latencies dropping by
~30-40% compared to batching=off.

* Five aggregates, no WHERE: batching input from the child scan cut
~15% off runtime. Adding batched transition evaluation increased
improvements to ~30%.

* Five aggregates, with WHERE: modest gains from scan/input batching,
but per-batch transition evaluation and batched quals brought ~20-30%
improvement.

* Across all cases, executor overheads became visible only after IO
was minimized. Once executor cost dominated, batching consistently
reduced CPU time, with the largest benefits coming from avoiding
per-row fmgr calls and evaluating quals across batches.

I would appreciate if others could try these patches with their own
microbenchmarks or workloads and see if they can reproduce numbers
similar to mine. Feedback on both the general direction and the
details of the patches would be very helpful. In particular, patches
0001-0003, which add the basic batch APIs and integrate them into
SeqScan, are intended to be the first candidates for review and
eventual commit. Comments on the later, more experimental patches
(aggregate input batching and expression evaluation (qual, aggregate
transition) batching) are also welcome.

--
Thanks, Amit Langote

[1]
https://wiki.postgresql.org/wiki/PGConf.dev_2025_Developer_Unconference#Can_the_Community_Support_an_Additional_Batch_Executor

[2] Tables:
cat create_tables.sh
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "drop table if exists bar_$i; create table bar_$i (a int, b
int, c int, d int, e int, f int, g int, h int, i text, j int, k int, l
int, m int, n int, o int);" 2>&1 > /dev/null
psql -c "insert into bar_$i select i, i, i, i, i, i, i, i, repeat('x',
100), i, i, i, i, i, i from generate_series(1, $i) i;" 2>&1 >
/dev/null
echo "bar_$i created."
done

[3] Benchmark result tables

All timings are in milliseconds. off = executor_batching off, on =
executor_batching on.  Negative %diff means on is better than off.

Single aggregate, no WHERE
(~20% faster with scan batching only; ~40%+ faster with batched transitions)

With only batched-seqscan (0001-0003):
Rows    off       on       %diff
1M      10.448    8.147    -22.0
2M      18.442    14.552   -21.1
3M      25.296    22.195   -12.3
4M      36.285    33.383   -8.0
5M      44.441    39.894   -10.2
10M     93.110    82.744   -11.1

With batched-agg on top (0001-0007):
Rows    off       on       %diff
1M      9.891     5.579    -43.6
2M      17.648    9.653    -45.3
3M      27.451    13.919   -49.3
4M      36.394    24.269   -33.3
5M      44.665    29.260   -34.5
10M     87.898    56.221   -36.0

Single aggregate, with WHERE
(~30–40% faster once quals + transitions are batched)

With only batched-seqscan (0001-0003):
Rows    off       on       %diff
1M      18.485    17.749   -4.0
2M      34.696    33.033   -4.8
3M      49.582    46.155   -6.9
4M      70.270    67.036   -4.6
5M      84.616    81.013   -4.3
10M     174.649   164.611  -5.7

With batched-agg and batched-qual on top (0001-0008):
Rows    off       on       %diff
1M      18.887    12.367   -34.5
2M      35.706    22.457   -37.1
3M      51.626    30.902   -40.1
4M      72.694    48.214   -33.7
5M      88.103    57.623   -34.6
10M     181.350   124.278  -31.5

Five aggregates, no WHERE
(~15% faster with scan/input batching; ~30% with batched transitions)

Agg input batching only (0001-0004):
Rows    off       on       %diff
1M      23.193    19.196   -17.2
2M      42.177    35.862   -15.0
3M      62.192    51.121   -17.8
4M      83.215    74.665   -10.3
5M      99.426    91.904   -7.6
10M     213.794   184.263  -13.8

Batched transition eval, per-row fmgr (0001-0006):
Rows    off       on       %diff
1M      23.501    19.672   -16.3
2M      44.128    36.603   -17.0
3M      64.466    53.079   -17.7
5M      103.442   97.623   -5.6
10M     219.120   190.354  -13.1

Batched transition eval, per-batch fmgr (0001-0007):
Rows    off       on       %diff
1M      24.238    16.806   -30.7
2M      43.056    30.939   -28.1
3M      62.938    43.295   -31.2
4M      83.346    63.357   -24.0
5M      100.772   78.351   -22.2
10M     213.755   162.203  -24.1

Five aggregates, with WHERE
(~10–15% faster with scan/input batching; ~30% with batched transitions + quals)

Agg input batching only (0001-0004):
Rows    off       on       %diff
1M      24.261    22.744   -6.3
2M      45.802    41.712   -8.9
3M      79.311    72.732   -8.3
4M      107.189   93.870   -12.4
5M      129.172   115.300  -10.7
10M     278.785   236.275  -15.2

Batched transition eval, per-batch fmgr (0001-0007):
Rows    off       on       %diff
1M      24.354    19.409   -20.3
2M      46.888    36.687   -21.8
3M      82.147    57.683   -29.8
4M      109.616   76.471   -30.2
5M      133.777   94.776   -29.2
10M     282.514   194.954  -31.0

Batched transition eval + batched qual (0001-0008):
Rows    off       on       %diff
1M      24.691    20.193   -18.2
2M      47.182    36.530   -22.6
3M      82.030    58.663   -28.5
4M      110.573   76.500   -30.8
5M      136.701   93.299   -31.7
10M     280.551   191.021  -31.9

Вложения

Re: Batching in executor

От
Bruce Momjian
Дата:
On Fri, Sep 26, 2025 at 10:28:33PM +0900, Amit Langote wrote:
> At PGConf.dev this year we had an unconference session [1] on whether
> the community can support an additional batch executor. The discussion
> there led me to start hacking on $subject. I have also had off-list
> discussions on this topic in recent months with Andres and David, who
> have offered useful thoughts.
> 
> This patch series is an early attempt to make executor nodes pass
> around batches of tuples instead of tuple-at-a-time slots. The main
> motivation is to enable expression evaluation in batch form, which can
> substantially reduce per-tuple overhead (mainly from function calls)
> and open the door to further optimizations such as SIMD usage in
> aggregate transition functions. We could even change algorithms of
> some plan nodes to operate on batches when, for example, a child node
> can return batches.

For background, people might want to watch these two videos from POSETTE
2025.  The first video explains how data warehouse query needs are
different from OLTP needs:

    Building a PostgreSQL data warehouse
    https://www.youtube.com/watch?v=tpq4nfEoioE

and the second one explains the executor optimizations done in PG 18:

    Hacking Postgres Executor For Performance
    https://www.youtube.com/watch?v=D3Ye9UlcR5Y

I learned from these two videos that to handle new workloads, I need to
think of the query demands differently, and of course can this be
accomplished without hampering OLTP workloads?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: Batching in executor

От
Tomas Vondra
Дата:
Hi Amit,

Thanks for the patch. I took a look over the weekend, and done a couple
experiments / benchmarks, so let me share some initial feedback (or
rather a bunch of questions I came up with).

I'll start with some general thoughts, before going into some nitpicky
comments about patches / code and perf results.

I think the general goal of the patch - reducing the per-tuple overhead
and making the executor more efficient for OLAP workloads - is very
desirable. I believe the limitations of per-row executor are one of the
reasons why attempts to implement a columnar TAM mostly failed. The
compression is nice, but it's hard to be competitive without an executor
that leverages that too. So starting with an executor, in a way that
helps even heap, seems like a good plan. So +1 to this.

While looking at the patch, I couldn't help but think about the index
prefetching stuff that I work on. It also introduces the concept of a
"batch", for passing data between an index AM and the executor. It's
interesting how different the designs are in some respects. I'm not
saying one of those designs is wrong, it's more due different goals.

For example, the index prefetching patch establishes a "shared" batch
struct, and the index AM is expected to fill it with data. After that,
the batch is managed entirely by indexam.c, with no AM calls. The only
AM-specific bit in the batch is "position", but that's used only when
advancing to the next page, etc.

This patch does things differently. IIUC, each TAM may produce it's own
"batch", which is then wrapped in a generic one. For example, heap
produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
is fine. In the prefetching we chose to move all this code (walking the
batch items) from the AMs into the layer above, and make it AM agnostic.

But for the batching, we want to retain the custom format as long as
possible. Presumably, the various advantages of the TAMs are tied to the
custom/columnar storage format. Memory efficiency thanks to compression,
execution on compressed data, etc. Keeping the custom format as long as
possible is the whole point of "late materialization" (and materializing
as late as possible is one of the important details in column stores).

How far ahead have you though about these capabilities? I was wondering
about two things in particular. First, at which point do we have to
"materialize" the TupleBatch into some generic format (e.g. TupleSlots).
I get it that you want to enable passing batches between nodes, but
would those use the same "format" as the underlying scan node, or some
generic one? Second, will it be possible to execute expressions on the
custom batches (i.e. on "compressed data")? Or is it necessary to
"materialize" the batch into regular tuple slots? I realize those may
not be there "now" but maybe it'd be nice to plan for the future.

It might be worth exploring some columnar formats, and see if this
design would be a good fit. Let's say we want to process data read from
a parquet file. Would we be able to leverage the format, or would we
need to "materialize" into slots too early? Or maybe it'd be good to
look at the VCI extension [1], discussed in a nearby thread. AFAICS
that's still based on an index AM, but there were suggestions to use TAM
instead (and maybe that'd be a better choice).

The other option would be to "create batches" during execution, say by
having a new node that accumulates tuples, builds a batch and sends it
to the node above. This would help both in cases when either the lower
node does not produce batches at all, or the batches are too small (due
to filtering, aggregation, ...). Or course, it'd only win if this
increases efficiency of the upper part of the plan enough to pay for
building the batches. That can be a hard decision.

You also mentioned we could make batches larger by letting them span
multiple pages, etc. I'm not sure that's worth it - wouldn't that
substantially complicate the TAM code, which would need to pin+track
multiple buffers for each batch, etc.? Possible, but is it worth it?

I'm not sure allowing multi-page batches would actually solve the issue.
It'd help with batches at the "scan level", but presumably the batch
size in the upper nodes matters just as much. Large scan batches may
help, but hard to predict.

In the index prefetching patch we chose to keep batches 1:1 with leaf
pages, at least for now. Instead we allowed having multiple batches at
once. I'm not sure that'd be necessary for TAMs, though.

This also reminds me of LIMIT queries. The way I imagine a "batchified"
executor to work is that batches are essentially "units of work". For
example, a nested loop would grab a batch of tuples from the outer
relation, lookup inner tuples for the whole batch, and only then pass
the result batch. (I'm ignoring the cases when the batch explodes due to
duplicates.)

But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
just the first tuple, and the rest of the batch is wasted work? Plenty
of (very expensive) OLAP have that, and many would likely benefit from
batching, so just disabling batching if there's LIMIT seems way too
heavy handed.

Perhaps it'd be good to gradually ramp up the batch size? Start with
small batches, and then make them larger. The index prefetching does
that too, indirectly - it reads the whole leaf page as a batch, but then
gradually ramps up the prefetch distance (well, read_stream does that).
Maybe the batching should have similar thing ...

In fact, how shall the optimizer decide whether to use batching? It's
one thing to decide whether a node can produce/consume batches, but
another thing is "should it"? With a node that "builds" a batch, this
decision would apply to even more plans, I guess.

I don't have a great answer to this, it seems like an incredibly tricky
costing issue. I'm a bit worried we might end up with something too
coarse, like "jit=on" which we know is causing problems (admittedly,
mostly due to a lot of the LLVM work being unpredictable/external). But
having some "adaptive" heuristics (like the gradual ramp up) might make
it less risky.

FWIW the current batch size limit (64 tuples) seems rather low, but it's
hard to say. It'd be good to be able to experiment with different
values, so I suggest we make this a GUC and not a hard-coded constant.

As for what to add to explain, I'd start by adding info about which
nodes are "batched" (consuming/producing batches), and some info about
the batch sizes. An average size, maybe a histogram if you want to be a
bit fancy.

I have no thoughts about the expression patches, at least not beyond
what I already wrote above. I don't know enough about that part.

[1]

https://www.postgresql.org/message-id/OS7PR01MB119648CA4E8502FE89056E56EEA7D2%40OS7PR01MB11964.jpnprd01.prod.outlook.com


Now, numbers from some microbenchmarks:

On 9/26/25 15:28, Amit Langote wrote:
> 
> To evaluate the overheads and benefits, I ran microbenchmarks with
> single and multi-aggregate queries on a single table, with and without
> WHERE clauses. Tables were fully VACUUMed so visibility maps are set
> and IO costs are minimal. shared_buffers was large enough to fit the
> whole table (up to 10M rows, ~43 on each page), and all pages were
> prewarmed into cache before tests. Table schema/script is at [2].
> 
> Observations from benchmarking (Detailed benchmark tables are at [3];
> below is just a high-level summary of the main patterns):
> 
> * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
> sum(a) FROM bar_N): batching scan output alone improved latency by
> ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
> especially once fmgr overhead was paid per batch instead of per row.
> 
> * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
> qual interpreter gave a big step up, with latencies dropping by
> ~30-40% compared to batching=off.
> 
> * Five aggregates, no WHERE: batching input from the child scan cut
> ~15% off runtime. Adding batched transition evaluation increased
> improvements to ~30%.
> 
> * Five aggregates, with WHERE: modest gains from scan/input batching,
> but per-batch transition evaluation and batched quals brought ~20-30%
> improvement.
> 
> * Across all cases, executor overheads became visible only after IO
> was minimized. Once executor cost dominated, batching consistently
> reduced CPU time, with the largest benefits coming from avoiding
> per-row fmgr calls and evaluating quals across batches.
> 
> I would appreciate if others could try these patches with their own
> microbenchmarks or workloads and see if they can reproduce numbers
> similar to mine. Feedback on both the general direction and the
> details of the patches would be very helpful. In particular, patches
> 0001-0003, which add the basic batch APIs and integrate them into
> SeqScan, are intended to be the first candidates for review and
> eventual commit. Comments on the later, more experimental patches
> (aggregate input batching and expression evaluation (qual, aggregate
> transition) batching) are also welcome.
> 

I tried to replicate the results, but the numbers I see are not this
good. In fact, I see a fair number of regressions (and some are not
negligible).

I'm attaching the scripts I used to build the tables / run the test. I
used the same table structure, and tried to follow the same query
pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where
conditions (with 100% selectivity).

I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off).
And I did that on my (relatively) new ryzen machine, and old xeon. The
behavior is quite different for the two machines, but none of them shows
such improvements. I used clang 19.0, and --with-llvm.

See the attached PDFs with a summary of the results, comparing the
results for master and the two batching branches.

The ryzen is much "smoother" - it shows almost no difference with
batching "off" (as expected). The "scan" branch (with 0001-0003) shows
an improvement of 5-10% - it's consistent, but much less than the 10-20%
you report. For the "agg" branch the benefits are much larger, but
there's also a significant regression for the largest table with 100M
rows (which is ~18GB on disk).

For xeon, the results are a bit more variable, but it affects runs both
with batching "on" and "off". The machine is just more noisy. There
seems to be a small benefit of "scan" batching (in most cases much less
than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup,
and no regression similar to the ryzen.

Perhaps I did something wrong. It does not surprise me this is somewhat
CPU dependent. It's a bit sad the improvements are smaller for the newer
CPU, though.

I also tried running TPC-H. I don't have useful numbers yet, but I ran
into a segfault - see the attached backtrace. It only happens with the
batching, and only on Q22 for some reason. I initially thought it's a
bug in clang, because I saw it with clang-22 built from git, and not
with clang-14 or gcc. But since then I reproduced it with clang-19 (on
debian 13). Still could be a clang bug, of course. I've seen ~20 of
those segfaults so far, and the backtraces look exactly the same.


regards

-- 
Tomas Vondra

Вложения

Re: Batching in executor

От
Amit Langote
Дата:
Hi Tomas,

Thanks a lot for your comments and benchmarking.

I plan to reply to your detailed comments and benchmark results, but I
just realized I had forgotten to attach patch 0008 (oops!) in my last
email. That patch adds batched qual evaluation.

I also noticed that the batched path was unnecessarily doing early
“batch-materialization” in cases like SELECT count(*) FROM bar. I’ve
fixed that as well. It was originally designed to avoid such
materialization, but I must have broken it while refactoring.

Вложения

Re: Batching in executor

От
Amit Langote
Дата:
Hi Bruce,

On Fri, Sep 26, 2025 at 10:49 PM Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Sep 26, 2025 at 10:28:33PM +0900, Amit Langote wrote:
> > At PGConf.dev this year we had an unconference session [1] on whether
> > the community can support an additional batch executor. The discussion
> > there led me to start hacking on $subject. I have also had off-list
> > discussions on this topic in recent months with Andres and David, who
> > have offered useful thoughts.
> >
> > This patch series is an early attempt to make executor nodes pass
> > around batches of tuples instead of tuple-at-a-time slots. The main
> > motivation is to enable expression evaluation in batch form, which can
> > substantially reduce per-tuple overhead (mainly from function calls)
> > and open the door to further optimizations such as SIMD usage in
> > aggregate transition functions. We could even change algorithms of
> > some plan nodes to operate on batches when, for example, a child node
> > can return batches.
>
> For background, people might want to watch these two videos from POSETTE
> 2025.  The first video explains how data warehouse query needs are
> different from OLTP needs:
>
>         Building a PostgreSQL data warehouse
>         https://www.youtube.com/watch?v=tpq4nfEoioE
>
> and the second one explains the executor optimizations done in PG 18:
>
>         Hacking Postgres Executor For Performance
>         https://www.youtube.com/watch?v=D3Ye9UlcR5Y
>
> I learned from these two videos that to handle new workloads, I need to
> think of the query demands differently, and of course can this be
> accomplished without hampering OLTP workloads?

Thanks for pointing to those talks -- I gave the second one. :-)

Yes, the idea here is to introduce batching without adding much
overhead or new code into the OLTP path.

--
Thanks, Amit Langote



Re: Batching in executor

От
Amit Langote
Дата:
On Tue, Sep 30, 2025 at 11:11 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Hi Tomas,
>
> Thanks a lot for your comments and benchmarking.
>
> I plan to reply to your detailed comments and benchmark results

For now, I reran a few benchmarks with the master branch as an
explicit baseline, since Tomas reported possible regressions with
executor_batching=off. I can reproduce that on my side:

5 aggregates, no where:
select avg(a), avg(b), avg(c), avg(d), avg(e) from bar;

parallel_workers=0, jit=off
Rows    master    batching off    batching on    master vs off    master vs on
1M      47.118    48.545          39.531         +3.0%            -16.1%
2M      95.098    97.241          80.189         +2.3%            -15.7%
3M      141.821   148.540         122.005        +4.7%            -14.0%
4M      188.969   197.056         163.779        +4.3%            -13.3%
5M      240.113   245.902         213.645        +2.4%            -11.0%
10M     556.738   564.120         486.359        +1.3%            -12.6%

parallel_workers=2, jit=on
Rows    master    batching off    batching on    master vs off    master vs on
1M      21.147    22.278          20.737         +5.3%            -1.9%
2M      40.319    41.509          37.851         +3.0%            -6.1%
3M      61.582    63.026          55.927         +2.3%            -9.2%
4M      96.363    95.245          78.494         -1.2%            -18.5%
5M      117.226   117.649         97.968         +0.4%            -16.4%
10M     245.503   246.896         196.335        +0.6%            -20.0%

1 aggregate, no where:
select count(*) from bar;

parallel_workers=0, jit=off
Rows    master    batching off    batching on    master vs off    master vs on
1M      17.071    20.135          6.698          +17.9%           -60.8%
2M      36.905    41.522          15.188         +12.5%           -58.9%
3M      56.094    63.110          23.485         +12.5%           -58.1%
4M      74.299    83.912          32.950         +12.9%           -55.7%
5M      94.229    108.621         41.338         +15.2%           -56.1%
10M     234.425   261.490         117.833        +11.6%           -49.7%

parallel_workers=2, jit=on
Rows    master    batching off    batching on    master vs off    master vs on
1M      8.820     9.832           5.324          +11.5%           -39.6%
2M      16.368    18.001          9.526          +10.0%           -41.8%
3M      24.810    28.193          14.482         +13.6%           -41.6%
4M      34.369    35.741          23.212         +4.0%            -32.5%
5M      41.595    45.103          27.918         +8.4%            -32.9%
10M     99.494    112.226         94.081         +12.8%           -5.4%

The regression is more noticeable in the single aggregate case, where
more time is spent in scanning.

Looking into it.

--
Thanks, Amit Langote