Обсуждение: benchmark results comparing versions 15.2 and 16

Поиск
Список
Период
Сортировка

benchmark results comparing versions 15.2 and 16

От
MARK CALLAGHAN
Дата:
Let me know if results like this shouldn't be posted here.

This is mostly a hobby project for me - my other hobby is removing invasive weeds. I am happy to answer questions and run more tests, but turnaround for answers won't be instant. Getting results from Linux perf for these tests is on my TODO list. For now I am just re-running a subset of these to get more certainty that the regressions are real and not noise.

These are results for the insert benchmark on a small server comparing performance for versions 15.2 and 16. For version 16 I built from source at git sha 1ab763fc.
https://github.com/postgres/postgres/commit/1ab763fc22adc88e5d779817e7b42b25a9dd7c9e

Late in the version 15 beta release cycle this benchmark found a significant regression. I don't see anything significant this time, but there are potential small regressions.

More detail on how I run the benchmark and the HW is here, the server is small -- Beelink with 8 cores, 16G RAM and 1T NVMe SSD.
http://smalldatum.blogspot.com/2023/05/the-insert-benchmark-postgres-versions.html

Performance reports are linked below. But first, disclaimers:
* the goal is to determine whether there are CPU improvements or regressions. To make that easier to spot I disable fsync on commit.
* my scripts compute CPU/operation where operation is a SQL statement. However, CPU in this case is measured by vmstat and includes CPU from the benchmark client and Postgres server
* the regressions here are small, usually less than 5% which means it can be hard to distinguish between normal variance and a regression but the results are repeatable
* the links below are to the Summary section which includes throughput (absolute and relative). The relative throughput is the (throughput for PG 16 / throughput for PG 15.2) and 
* I used the same compiler options for the builds of 15.2 and 16

Summary of the results:
* from r1 - insert-heavy (l.i0, l.i1) and create indexes (l.x) steps are ~2% slower in PG 16
* from r2 - create index (l.x) step is ~4% slower in PG 16
* from r3 - regressions are similar to r1
* from r4, r5 and r6 - regressions are mostly worse than r1, r2, r3. Note r4, r5, r6 are the same workload as r1, r2, r3 except the database is cached by PG for r1, r2, r3 so the r4, r5 and r6 benchmarks will do much more copying from the OS page cache into the Postgres buffer pool.

I will repeat r1, r2, r4 and r5 but with the tests run in a different order to confirm this isn't just noise.

Database cached by Postgres:
r1) 1 table, 1 client - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.cached/all.html#summary
r2) 4 tables, 4 clients - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tno.cached/all.html#summary
r3) 1 table, 4 clients - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.cached/all.html#summary

Database cached by OS but not by Postgres:
r4) 1 table, 1 client - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.1g/all.html#summary
r5) 4 tables, 4 clients - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tno.1g/all.html#summary
r6) 1 table, 4 clients - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/all.html#summary


--
Mark Callaghan
mdcallag@gmail.com

Re: benchmark results comparing versions 15.2 and 16

От
Andres Freund
Дата:
Hi,

On 2023-05-05 10:45:12 -0700, MARK CALLAGHAN wrote:
> This is mostly a hobby project for me - my other hobby is removing
> invasive weeds.

Hah :)


> Summary of the results:
> * from r1 - insert-heavy (l.i0, l.i1) and create indexes (l.x) steps are
> ~2% slower in PG 16
> * from r2 - create index (l.x) step is ~4% slower in PG 16
> * from r3 - regressions are similar to r1
> * from r4, r5 and r6 - regressions are mostly worse than r1, r2, r3. Note
> r4, r5, r6 are the same workload as r1, r2, r3 except the database is
> cached by PG for r1, r2, r3 so the r4, r5 and r6 benchmarks will do much
> more copying from the OS page cache into the Postgres buffer pool.

One thing that's somewhat odd is that there's very marked changes in l.i0's
p99 latency for the four clients cases - but whether 15 or 16 are better
differs between the runs.

r2)
                                p99
20m.pg152_o3_native_lto.cx7     300
20m.pg16prebeta.cx7             23683

r3)
                                p99
20m.pg152_o3_native_lto.cx7     70245
20m.pg16prebeta.cx7             8191

r5)
                                p99
20m.pg152_o3_native_lto.cx7     11188
20m.pg16prebeta.cx7             72720

r6)
                                p99
20m.pg152_o3_native_lto.cx7     1898
20m.pg16prebeta.cx7             31666


I do wonder if there's something getting scheduled in some of these runs
increasing latency?

Or what we're seeing depends on the time between the start of the server and
the start of the benchmark? It is interesting that the per-second throughput
graph shows a lot of up/down at the end:
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/tput.l.i0.html

Both 15 and 16 have one very high result at 70s, 15 then has one low, but 16
has two low results.

Greetings,

Andres Freund



Re: benchmark results comparing versions 15.2 and 16

От
MARK CALLAGHAN
Дата:
I have two more runs of the benchmark in progress so we will have 3 results for each of the test cases to confirm that the small regressions are repeatable.

On Fri, May 5, 2023 at 1:34 PM Andres Freund <andres@anarazel.de> wrote:
 
One thing that's somewhat odd is that there's very marked changes in l.i0's
p99 latency for the four clients cases - but whether 15 or 16 are better
differs between the runs.

From the response time sections for l.i0 the [1ms, 4ms) bucket has 99% or more for all 6 cases. 
For example, https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.cached/all.html#l.i0.rt
But yes, the p99 is as you state. I will wade through my test scripts tomorrow to see how the p99 is computed.

I do wonder if there's something getting scheduled in some of these runs
increasing latency?

Do you mean interference from other processes? Both the big and small servers run Ubuntu 22.04 server (no X) and there shouldn't be many extra things, although Google Cloud adds a few extra things that run in the background. 
 
Or what we're seeing depends on the time between the start of the server and
the start of the benchmark? It is interesting that the per-second throughput
graph shows a lot of up/down at the end:
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/tput.l.i0.html

Both 15 and 16 have one very high result at 70s, 15 then has one low, but 16
has two low results.

Immediately prior to l.i0 the database directory is wiped and then Postgres is initialized and started.

The IPS vs time graphs are more interesting for benchmark steps that run longer. Alas, this can't run too long if the resulting database is to fit in <= 16G. But that is a problem for another day. The IPS vs time graphs are not a flat line, but I am not ready to pursue that as problem unless it shows multi-second write-stalls (fortunately it does not).

--
Mark Callaghan
mdcallag@gmail.com

Re: benchmark results comparing versions 15.2 and 16

От
Alexander Lakhin
Дата:
Hello Mark,

05.05.2023 20:45, MARK CALLAGHAN wrote:
> This is mostly a hobby project for me - my other hobby is removing invasive weeds. I am happy to answer questions and

> run more tests, but turnaround for answers won't be instant. Getting results from Linux perf for these tests is on my

> TODO list. For now I am just re-running a subset of these to get more certainty that the regressions are real and not

> noise.
>

It's a very interesting topic to me, too. I had developed some scripts to
measure and compare postgres`s performance using miscellaneous public
benchmarks (ycsb, tpcds, benchmarksql_tpcc, htapbench, benchbase, gdprbench,
s64da-benchmark, ...). Having compared 15.3 (56e869a09) with master
(58f5edf84) I haven't seen significant regressions except a few minor ones.
First regression observed with a simple pgbench test:
pgbench -i benchdb
pgbench -c 10 -T 300 benchdb
(with default compilation options and fsync = off)

On master I get:
tps = 10349.826645 (without initial connection time)
On 15.3:
tps = 11296.064992 (without initial connection time)

This difference is confirmed by multiple test runs. `git bisect` for this
regression pointed at f193883fc.

Best regards,
Alexander



Re: benchmark results comparing versions 15.2 and 16

От
Andres Freund
Дата:
Hi,

On 2023-05-08 16:00:01 +0300, Alexander Lakhin wrote:
> This difference is confirmed by multiple test runs. `git bisect` for this
> regression pointed at f193883fc.

I can reproduce a significant regression due to f193883fc of a workload just
running
SELECT CURRENT_TIMESTAMP;

A single session running it on my workstation via pgbench -Mprepared gets
before:
tps = 89359.128359 (without initial connection time)
after:
tps = 83843.585152 (without initial connection time)

Obviously this is an extreme workload, but that nevertheless seems too large
to just accept...


Michael, the commit message notes that there were no measured performance
regression - yet I see one in a trivial test. What were you measuring?


I'm a bit surprised by the magnitude of the regression, but it's not
surprising that there is a performance effect. You're replacing something that
doesn't go through the whole generic function rigamarole, and replace it with
something that does...

Looking at two perf profiles, the biggest noticable difference is

Before:

-    5.51%     0.13%  postgres  postgres  [.] ExecInitResult
   - 5.38% ExecInitResult
      + 2.29% ExecInitResultTupleSlotTL
      - 2.22% ExecAssignProjectionInfo
         - 2.19% ExecBuildProjectionInfo
              0.47% ExecReadyInterpretedExpr
            - 0.43% ExecInitExprRec
               - 0.10% palloc
                    AllocSetAlloc.localalias (inlined)
            + 0.32% expression_tree_walker_impl.localalias (inlined)
            + 0.28% get_typlen
              0.09% ExecPushExprSlots
            + 0.06% MemoryContextAllocZeroAligned
            + 0.04% MemoryContextAllocZeroAligned
              0.02% exprType.localalias (inlined)
      + 0.41% ExecAssignExprContext
      + 0.35% MemoryContextAllocZeroAligned
        0.11% ExecInitQual.localalias (inlined)
   + 0.11% _start
   + 0.02% 0x55b89c764d7f

After:

-    6.57%     0.17%  postgres  postgres  [.] ExecInitResult
   - 6.40% ExecInitResult
      - 3.00% ExecAssignProjectionInfo
         - ExecBuildProjectionInfo
            - 0.91% ExecInitExprRec
               - 0.65% ExecInitFunc
                    0.23% fmgr_info_cxt_security
                  + 0.18% palloc0
                  + 0.07% object_aclcheck
                    0.04% fmgr_info
                 0.05% check_stack_depth
               + 0.05% palloc
            + 0.58% expression_tree_walker_impl.localalias (inlined)
            + 0.55% get_typlen
              0.37% ExecReadyInterpretedExpr
            + 0.11% MemoryContextAllocZeroAligned
              0.09% ExecPushExprSlots
              0.04% exprType.localalias (inlined)
      + 2.77% ExecInitResultTupleSlotTL
      + 0.50% ExecAssignExprContext
      + 0.09% MemoryContextAllocZeroAligned
        0.05% ExecInitQual.localalias (inlined)
   + 0.10% _start

I.e. we spend more time building the expression state for expression
evaluation, because we now go through the generic ExecInitFunc(), instead of
something dedicated. We also now need to do permission checking etc.


I don't think that's the entirety of the regression...

Greetings,

Andres Freund



Re: benchmark results comparing versions 15.2 and 16

От
Andres Freund
Дата:
Hi,

On 2023-05-08 12:11:17 -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-05-08 16:00:01 +0300, Alexander Lakhin wrote:
> > This difference is confirmed by multiple test runs. `git bisect` for this
> > regression pointed at f193883fc.
> 
> I can reproduce a significant regression due to f193883fc of a workload just
> running
> SELECT CURRENT_TIMESTAMP;
> 
> A single session running it on my workstation via pgbench -Mprepared gets
> before:
> tps = 89359.128359 (without initial connection time)
> after:
> tps = 83843.585152 (without initial connection time)
> 
> Obviously this is an extreme workload, but that nevertheless seems too large
> to just accept...

Added an open item for this.

Greetings,

Andres Freund



Re: benchmark results comparing versions 15.2 and 16

От
MARK CALLAGHAN
Дата:


On Fri, May 5, 2023 at 10:01 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
I have two more runs of the benchmark in progress so we will have 3 results for each of the test cases to confirm that the small regressions are repeatable.

They get similar results. Then I tried Linux perf but the hierarchical call stacks, to be used for Flamegraph, have too many "[unknown]" entries.
I was using: ./configure --prefix=$pfx --enable-debug CFLAGS="-O3 -march=native -mtune=native -flto" LDFLAGS="-flto" > o.cf.$x 2> e.cf.$x
Adding -no-omit-frame-pointer fixes the problem, so I am repeating the benchmark with that to confirm there are still regressions and then I will get flamegraphs.

--
Mark Callaghan
mdcallag@gmail.com

Re: benchmark results comparing versions 15.2 and 16

От
Alexander Lakhin
Дата:
08.05.2023 16:00, Alexander Lakhin wrote:
> ... Having compared 15.3 (56e869a09) with master
> (58f5edf84) I haven't seen significant regressions except a few minor ones.
> First regression observed with a simple pgbench test:

Another noticeable, but not critical performance degradation is revealed by
query 87 from TPC-DS (I use s64da-benchmark):
https://github.com/swarm64/s64da-benchmark-toolkit/blob/master/benchmarks/tpcds/queries/queries_10/87.sql

With `prepare_benchmark --scale-factor=2`, `run_benchmark --scale-factor=10`
I get on master:
2023-05-10 09:27:52,888 INFO    : finished 80/103: query 87 of stream  0: 2.26s OK
but on REL_15_STABLE:
2023-05-10 08:13:40,648 INFO    : finished 80/103: query 87 of stream  0: 1.94s OK

This time `git bisect` pointed at 3c6fc5820. Having compared execution plans
(both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820):
->  Subquery Scan on "*SELECT* 1"  (cost=149622.00..149958.68 rows=16834 width=21) (actual time=1018.606..1074.468 
rows=93891 loops=1)
  ->  Unique  (cost=149622.00..149790.34 rows=16834 width=17) (actual time=1018.604..1064.790 rows=93891 loops=1)
   ->  Sort  (cost=149622.00..149664.09 rows=16834 width=17) (actual time=1018.603..1052.591 rows=94199 loops=1)
    ->  Gather  (cost=146588.59..148440.33 rows=16834 width=17) (actual time=880.899..913.978 rows=94199 loops=1)
vs
->  Subquery Scan on "*SELECT* 1"  (cost=147576.79..149829.53 rows=16091 width=21) (actual time=1126.489..1366.751 
rows=93891 loops=1)
  ->  Unique  (cost=147576.79..149668.62 rows=16091 width=17) (actual time=1126.487..1356.938 rows=93891 loops=1)
   ->  Gather Merge  (cost=147576.79..149547.94 rows=16091 width=17) (actual time=1126.487..1345.253 rows=94204
loops=1)
    ->  Unique  (cost=146576.78..146737.69 rows=16091 width=17) (actual time=1124.426..1306.532 rows=47102 loops=2)
     ->  Sort  (cost=146576.78..146617.01 rows=16091 width=17) (actual time=1124.424..1245.110 rows=533434 loops=2)

->  Subquery Scan on "*SELECT* 2"  (cost=52259.82..52428.16 rows=8417 width=21) (actual time=653.640..676.879
rows=62744
 
loops=1)
  ->  Unique  (cost=52259.82..52343.99 rows=8417 width=17) (actual time=653.639..670.405 rows=62744 loops=1)
   ->  Sort  (cost=52259.82..52280.86 rows=8417 width=17) (actual time=653.637..662.428 rows=62744 loops=1)
    ->  Gather  (cost=50785.20..51711.07 rows=8417 width=17) (actual time=562.158..571.737 rows=62744 loops=1)
     ->  HashAggregate  (cost=49785.20..49869.37 rows=8417 width=17) (actual time=538.263..544.336 rows=31372 loops=2)
      ->  Nested Loop  (cost=0.85..49722.07 rows=8417 width=17) (actual time=2.049..469.747 rows=284349 loops=2)
vs
->  Subquery Scan on "*SELECT* 2"  (cost=48503.68..49630.12 rows=8046 width=21) (actual time=700.050..828.388
rows=62744
 
loops=1)
  ->  Unique  (cost=48503.68..49549.66 rows=8046 width=17) (actual time=700.047..821.836 rows=62744 loops=1)
   ->  Gather Merge  (cost=48503.68..49489.31 rows=8046 width=17) (actual time=700.047..814.136 rows=62744 loops=1)
    ->  Unique  (cost=47503.67..47584.13 rows=8046 width=17) (actual time=666.348..763.403 rows=31372 loops=2)
     ->  Sort  (cost=47503.67..47523.78 rows=8046 width=17) (actual time=666.347..730.336 rows=284349 loops=2)
      ->  Nested Loop  (cost=0.85..46981.72 rows=8046 width=17) (actual time=1.852..454.111 rows=284349 loops=2)

->  Subquery Scan on "*SELECT* 3"  (cost=50608.83..51568.70 rows=7165 width=21) (actual time=302.571..405.305
rows=23737
 
loops=1)
  ->  Unique  (cost=50608.83..51497.05 rows=7165 width=17) (actual time=302.568..402.818 rows=23737 loops=1)
   ->  Gather Merge  (cost=50608.83..51443.31 rows=7165 width=17) (actual time=302.567..372.246 rows=287761 loops=1)
    ->  Sort  (cost=49608.81..49616.27 rows=2985 width=17) (actual time=298.204..310.075 rows=95920 loops=3)
     ->  Nested Loop  (cost=2570.65..49436.52 rows=2985 width=17) (actual time=3.205..229.192 rows=95920 loops=3)
vs
->  Subquery Scan on "*SELECT* 3"  (cost=50541.84..51329.11 rows=5708 width=21) (actual time=302.042..336.820
rows=23737
 
loops=1)
  ->  Unique  (cost=50541.84..51272.03 rows=5708 width=17) (actual time=302.039..334.329 rows=23737 loops=1)
   ->  Gather Merge  (cost=50541.84..51229.22 rows=5708 width=17) (actual time=302.039..331.296 rows=24128 loops=1)
    ->  Unique  (cost=49541.81..49570.35 rows=2854 width=17) (actual time=298.771..320.560 rows=8043 loops=3)
     ->  Sort  (cost=49541.81..49548.95 rows=2854 width=17) (actual time=298.770..309.603 rows=95920 loops=3)
      ->  Nested Loop  (cost=2570.52..49378.01 rows=2854 width=17) (actual time=3.209..230.291 rows=95920 loops=3)

 From the commit message and the discussion [1], it's not clear to me, whether
this plan change is expected. Perhaps it's too minor issue to bring attention
to, but maybe this information could be useful for v16 performance analysis.

[1] https://postgr.es/m/CAApHDvo8Lz2H=42urBbfP65LTcEUOh288MT7DsG2_EWtW1AXHQ@mail.gmail.com

Best regards,
Alexander
Вложения

Re: benchmark results comparing versions 15.2 and 16

От
David Rowley
Дата:
On Thu, 11 May 2023 at 01:00, Alexander Lakhin <exclusion@gmail.com> wrote:
> This time `git bisect` pointed at 3c6fc5820. Having compared execution plans
> (both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820):

Based on what you've sent, I'm uninspired to want to try to do
anything about it.  The patched version finds a plan that's cheaper.
The row estimates are miles off with both plans. I'm not sure what
we're supposed to do here. It's pretty hard to make changes to the
planner's path generation without risking that a bad plan is chosen
when it wasn't beforehand with bad row estimates.

Is the new plan still slower if you increase work_mem so that the sort
no longer goes to disk?  Maybe the planner would have picked Hash
Aggregate if the row estimates had been such that cost_tuplesort()
knew that the sort would have gone to disk.

David



Re: benchmark results comparing versions 15.2 and 16

От
Michael Paquier
Дата:
On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote:
> On 2023-05-08 12:11:17 -0700, Andres Freund wrote:
>> I can reproduce a significant regression due to f193883fc of a workload just
>> running
>> SELECT CURRENT_TIMESTAMP;
>>
>> A single session running it on my workstation via pgbench -Mprepared gets
>> before:
>> tps = 89359.128359 (without initial connection time)
>> after:
>> tps = 83843.585152 (without initial connection time)
>>
>> Obviously this is an extreme workload, but that nevertheless seems too large
>> to just accept...
>
> Added an open item for this.

Thanks for the report, I'll come back to it and look at it at the
beginning of next week.  In the worst case, that would mean a revert
of this refactoring, I assume.
--
Michael

Вложения

Re: benchmark results comparing versions 15.2 and 16

От
Alexander Lakhin
Дата:
11.05.2023 01:27, David Rowley wrote:
> On Thu, 11 May 2023 at 01:00, Alexander Lakhin <exclusion@gmail.com> wrote:
>> This time `git bisect` pointed at 3c6fc5820. Having compared execution plans
>> (both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820):
> Based on what you've sent, I'm uninspired to want to try to do
> anything about it.  The patched version finds a plan that's cheaper.
> The row estimates are miles off with both plans.

I've made sure that s64da-benchmark performs analyze before running the
queries (pg_class.reltuples fields for tables in question contain actual
counts), so it seems that nothing can be done on the benchmark side to
improve those estimates.

> ... It's pretty hard to make changes to the
> planner's path generation without risking that a bad plan is chosen
> when it wasn't beforehand with bad row estimates.

Yeah, I see. It's also interesting to me, which tests perform better after
that commit. It takes several hours to run all tests, so I can't present
results quickly, but I'll try to collect this information next week.

> Is the new plan still slower if you increase work_mem so that the sort
> no longer goes to disk?  Maybe the planner would have picked Hash
> Aggregate if the row estimates had been such that cost_tuplesort()
> knew that the sort would have gone to disk.

Yes, increasing work_mem to 50MB doesn't affect the plans (new plans
attached), though the sort method changed to quicksort. The former plan is
still executed slightly faster.

Best regards,
Alexander
Вложения

Re: benchmark results comparing versions 15.2 and 16

От
Michael Paquier
Дата:
On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote:
> On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote:
>> On 2023-05-08 12:11:17 -0700, Andres Freund wrote:
>>> I can reproduce a significant regression due to f193883fc of a workload just
>>> running
>>> SELECT CURRENT_TIMESTAMP;
>>>
>>> A single session running it on my workstation via pgbench -Mprepared gets
>>> before:
>>> tps = 89359.128359 (without initial connection time)
>>> after:
>>> tps = 83843.585152 (without initial connection time)
>>>
>>> Obviously this is an extreme workload, but that nevertheless seems too large
>>> to just accept...

Extreme is adapted for a worst-case scenario.  Looking at my notes
from a few months back, that's kind of what I did on my laptop, which
was the only machine I had at hand back then:
- Compilation of code with -O2.
- Prepared statement of a simple SELECT combined with a DO block
running executed the query in a loop a few million times on a single
backend:
PREPARE test AS SELECT CURRENT_TIMESTAMP;
DO $$
BEGIN
  FOR i IN 1..10000000
  LOOP
    EXECUTE 'EXECUTE test';
  END LOOP;
END $$;
- The second test is mentioned at [1], with a generate_series() on a
keyword.
- And actually I recall some pgbench runs similar to that..  But I
don't have any traces of that in my notes.

This was not showing much difference, and it does not now, either.
Funny thing is that the pre-patch period was showing signs of being a
bit slower in this environment.  Anyway, I have just tested the DO
case in a second "bigbox" environment that I have set up for
benchmarking a few days ago, and the DO test is showing me a 1%~1.5%
regression in runtime.  That's repeatable so that's not noise.

I have re-run a bit more pgbench (1 client, prepared query with a
single SELECT on a SQL keyword, etc.).  And, TBH, I am not seeing as
much difference as you do (nothing with default pgbench setup, FWIW),
still that's able to show a bit more difference than the other two
cases.  HEAD shows me an average output close to 43900 TPS (3 run of
60s each, for instance), while relying on SQLValueFunction shows an
average of 45000TPS.  That counts for ~2.4% output regression here
on bigbox based on these numbers.  Not a regression as high as
mentioned above, still that's visible.

>> Added an open item for this.
>
> Thanks for the report, I'll come back to it and look at it at the
> beginning of next week.  In the worst case, that would mean a revert
> of this refactoring, I assume.

So, this involves commits 7aa81c6, f193883 and fb32748.  7aa81c6 has
added some tests, so I would let the tests it added be on HEAD as the
precision was not tested for the SQL keywords this has added cover
for.

fb32748 has removed the dependency of SQLValueFunction on collations
by making the name functions use COERCE_SQL_SYNTAX.  One case where
these could be heavily used is RLS, for example, so that could be
visible.  f193883 has removed the rest and the timestamp keywords.

I am not going to let that hang in the air with beta1 getting released
next week, though, so attached are two patches to revert the change
(these would be combined, just posted this way for clarity).  The only
conflict I could see is caused by the query jumbling where a
SQLValueFunction needs "typmod" and "op" in the computation, ignoring
the rest, so the node definition primnodes.h gains the required
node_attr.  Making the execution path cheaper while avoiding the
collation tweaks for SQLValueFunction would be nice, but not at this
time of the year on this branch.

[1]: https://www.postgresql.org/message-id/Y0+dHDYA46UnnLs/@paquier.xyz
--
Michael

Вложения

Re: benchmark results comparing versions 15.2 and 16

От
Andres Freund
Дата:
Hi,

On 2023-05-15 14:20:24 +0900, Michael Paquier wrote:
> On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote:
> > On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote:
> >> On 2023-05-08 12:11:17 -0700, Andres Freund wrote:
> >>> I can reproduce a significant regression due to f193883fc of a workload just
> >>> running
> >>> SELECT CURRENT_TIMESTAMP;
> >>> 
> >>> A single session running it on my workstation via pgbench -Mprepared gets
> >>> before:
> >>> tps = 89359.128359 (without initial connection time)
> >>> after:
> >>> tps = 83843.585152 (without initial connection time)
> >>> 
> >>> Obviously this is an extreme workload, but that nevertheless seems too large
> >>> to just accept...
> 
> Extreme is adapted for a worst-case scenario.  Looking at my notes
> from a few months back, that's kind of what I did on my laptop, which
> was the only machine I had at hand back then:
> - Compilation of code with -O2.

I assume without assertions as well?


> I have re-run a bit more pgbench (1 client, prepared query with a
> single SELECT on a SQL keyword, etc.).  And, TBH, I am not seeing as
> much difference as you do (nothing with default pgbench setup, FWIW),
> still that's able to show a bit more difference than the other two
> cases.

> HEAD shows me an average output close to 43900 TPS (3 run of
> 60s each, for instance), while relying on SQLValueFunction shows an
> average of 45000TPS.  That counts for ~2.4% output regression here
> on bigbox based on these numbers.  Not a regression as high as
> mentioned above, still that's visible.

45k seems too low for a modern machine, given that I get > 80k in such a
workload, on a workstation with server CPUs (i.e. many cores, but not that
fast individually).  Hence wondering about assertions being enabled...

I get quite variable performance if I don't pin client / server to the same
core, but even the slow performance is faster than 45k.

Greetings,

Andres Freund



Re: benchmark results comparing versions 15.2 and 16

От
Michael Paquier
Дата:
On Mon, May 15, 2023 at 05:14:47PM -0700, Andres Freund wrote:
> On 2023-05-15 14:20:24 +0900, Michael Paquier wrote:
>> On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote:
>> Extreme is adapted for a worst-case scenario.  Looking at my notes
>> from a few months back, that's kind of what I did on my laptop, which
>> was the only machine I had at hand back then:
>> - Compilation of code with -O2.
>
> I assume without assertions as well?

Yup, no assertions.

> 45k seems too low for a modern machine, given that I get > 80k in such a
> workload, on a workstation with server CPUs (i.e. many cores, but not that
> fast individually).  Hence wondering about assertions being enabled...

Nope, disabled.

> I get quite variable performance if I don't pin client / server to the same
> core, but even the slow performance is faster than 45k.

Okay.  You mean with something like taskset or similar, I guess?
--
Michael

Вложения

Re: benchmark results comparing versions 15.2 and 16

От
Andres Freund
Дата:
Hi,

On 2023-05-16 09:42:31 +0900, Michael Paquier wrote:
> > I get quite variable performance if I don't pin client / server to the same
> > core, but even the slow performance is faster than 45k.
> 
> Okay.  You mean with something like taskset or similar, I guess?

Yes. numactl --physcpubind ... in my case.  Linux has an optimization where it
does not need to send an IPI when the client and server are scheduled on the
same core. For single threaded ping-pong tasks like pgbench -c1, that can make
a huge difference, particularly on larger CPUs.  So you get a lot better
performance when forcing things to be colocated.

Greetings,

Andres Freund



Re: benchmark results comparing versions 15.2 and 16

От
Michael Paquier
Дата:
On Mon, May 15, 2023 at 05:54:53PM -0700, Andres Freund wrote:
> Yes. numactl --physcpubind ... in my case.  Linux has an optimization where it
> does not need to send an IPI when the client and server are scheduled on the
> same core. For single threaded ping-pong tasks like pgbench -c1, that can make
> a huge difference, particularly on larger CPUs.  So you get a lot better
> performance when forcing things to be colocated.

Yes, that's not bringing the numbers higher with the simple cases I
reported previously, either.

Anyway, even if I cannot see such a high difference, I don't see how
to bring back the original numbers you are reporting without doing
more inlining and tying COERCE_SQL_SYNTAX more tightly within the
executor's portions for the FuncExprs, and there are the collation
assumptions as well.  Perhaps that's not the correct thing to do with
SQLValueFunction remaining around, but nothing can be done for v16, so
I am planning to just revert the change before beta1, and look at it
again later, from scratch.
--
Michael

Вложения

Re: benchmark results comparing versions 15.2 and 16

От
Alexander Lakhin
Дата:
15.05.2023 08:20, Michael Paquier wrote:
> I am not going to let that hang in the air with beta1 getting released
> next week, though, so attached are two patches to revert the change
> (these would be combined, just posted this way for clarity).

I can confirm that the patches improve (restore) performance for my test:
pgbench -i benchdb
pgbench -c 10 -T 300 benchdb

tps (over three runs):
HEAD (08c45ae23):
10238.441580, 10697.202119, 10706.764703

HEAD with the patches:
11134.510118, 11176.554996, 11150.338488

f193883fc~1 (240e0dbac)
11082.561388, 11233.604446, 11087.071768

15.3 (8382864eb)
11328.699555, 11128.057883, 11057.934392

Best regards,
Alexander



Re: benchmark results comparing versions 15.2 and 16

От
Michael Paquier
Дата:
On Tue, May 16, 2023 at 06:00:00PM +0300, Alexander Lakhin wrote:
> I can confirm that the patches improve (restore) performance for my test:
> pgbench -i benchdb
> pgbench -c 10 -T 300 benchdb

Thanks for running these!

> tps (over three runs):
> HEAD (08c45ae23):
> 10238.441580, 10697.202119, 10706.764703
>
> HEAD with the patches:
> 11134.510118, 11176.554996, 11150.338488
>
> f193883fc~1 (240e0dbac)
> 11082.561388, 11233.604446, 11087.071768
>
> 15.3 (8382864eb)
> 11328.699555, 11128.057883, 11057.934392

The numbers between f193883fc~1 and HEAD+patch are close to each
other.  It does not seem to make the whole difference with 15.3, but
most of it.  The difference can also be explained with some noise,
based on the number patterns of the third runs?

I have now applied the revert, ready for beta1.  Thanks for the
feedback!
--
Michael

Вложения

Re: benchmark results comparing versions 15.2 and 16

От
Alexander Lakhin
Дата:
17.05.2023 04:25, Michael Paquier wrote:
The numbers between f193883fc~1 and HEAD+patch are close to each
other.  It does not seem to make the whole difference with 15.3, but
most of it.  The difference can also be explained with some noise,
based on the number patterns of the third runs?

I have now applied the revert, ready for beta1.  Thanks for the
feedback!


Thank you for paying attention to it!

Yes, I ran the benchmark on my workstation, so numbers could vary due parallel
activity. Now I've compared 15.3 (8382864eb) with d8c3106bb and 1d369c9e9,
this time with the CPU boost mode disabled:
1d369c9e9:
10007.130326, 10047.722122, 9920.612426, 10016.053846, 10060.606408
d8c3106bb:
10492.100485, 10505.326827, 10535.918137, 10625.904871, 10573.608859
15.3:
10458.752330, 10308.677192, 10366.496526, 10489.395275, 10319.458041

Best "1d369c9e9" worse than "15.3" by 4.1 percents (10060.61 < 10489.40)
Average "1d369c9e9" worse than "15.3" by 3.6 percents (10010.43 < 10388.56)

Best "d8c3106bb" better than "15.3" by 1.3 percents (10625.90 > 10489.40)
Average "d8c3106bb" better than "15.3" by 1.5 percents (10546.57 > 10388.56)

So it seems that there is nothing left on this plate.

Best regards,
Alexander

Re: benchmark results comparing versions 15.2 and 16

От
MARK CALLAGHAN
Дата:


On Tue, May 9, 2023 at 10:36 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:


On Fri, May 5, 2023 at 10:01 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
I have two more runs of the benchmark in progress so we will have 3 results for each of the test cases to confirm that the small regressions are repeatable.

I repeated the benchmark a few times using a more recent PG16 build (git sha 08c45ae2) and have yet to see any significant changes. So that is good news. My testing scripts have been improved so I should be able to finish the next round of tests in less time.

--
Mark Callaghan
mdcallag@gmail.com

Re: benchmark results comparing versions 15.2 and 16

От
Andres Freund
Дата:
Hi,

On 2023-05-19 15:44:09 -0700, MARK CALLAGHAN wrote:
> On Tue, May 9, 2023 at 10:36 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> 
> >
> >
> > On Fri, May 5, 2023 at 10:01 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> >
> >> I have two more runs of the benchmark in progress so we will have 3
> >> results for each of the test cases to confirm that the small regressions
> >> are repeatable.
> >>
> >
> I repeated the benchmark a few times using a more recent PG16 build (git
> sha 08c45ae2) and have yet to see any significant changes. So that is good
> news. My testing scripts have been improved so I should be able to finish
> the next round of tests in less time.

With "yet to see any significant changes" do you mean that the runs are
comparable with earlier runs, showing the same regression? Or that the
regression vanished? Or ...?

Greetings,

Andres Freund



Re: benchmark results comparing versions 15.2 and 16

От
MARK CALLAGHAN
Дата:
On Fri, May 19, 2023 at 4:04 PM Andres Freund <andres@anarazel.de> wrote:
With "yet to see any significant changes" do you mean that the runs are
comparable with earlier runs, showing the same regression? Or that the
regression vanished? Or ...?

I mean that I might be chasing noise and the mean+stddev for throughput in version 16 pre-beta so far appears to be similar to 15.2. When I ran the insert benchmark a few times, I focused on the cases where 16 pre-beta was worse than 15.2 while ignoring the cases where it was better. Big regressions are easy to document, small ones not so much.

Regardless, I am repeating tests from both the insert benchmark and sysbench for version 16 (pre-beta, and soon beta1).

Re: benchmark results comparing versions 15.2 and 16

От
MARK CALLAGHAN
Дата:
I ran sysbench on Postgres 15.2, 15.3 and 16 prebeta at git sha 1c006c0 (built on May19). 
The workload was in-memory on a small server (8 cores, 16G RAM) and the workload had 1 connection (no concurrency).
For some details on past benchmarks like this see: http://smalldatum.blogspot.com/2023/03/searching-for-performance-regressions.html

My focus is on changes >= 10%, so a value <= 0.90 or >= 1.10.
I used 3 builds of Postgres that I call def, o2_nofp, o3_nofp and ran the benchmark once per build. The results for each build are similar
and I only share the o2_nofp results there.

Good news, that I have not fully explained ...

One of the microbenchmarks gets ~1.5X more transactions/second (TPS) in PG 16 prebeta vs 15.2 and 15.3 for a read-only transaction that does:
* 2 select statements that scans 10k rows from an index (these are Q1 and Q3 below and are slower in PG 16)
* 2 select statements that scans 10k rows from an index and does aggregation (these are Q2 and Q4 below are are a lot faster in PG 16)

The speedup for Q2 and Q4 is larger than the slowdown for Q1/Q3 so TPS is ~1.5X more for PG 16.
Query plans don't appear to have changed. I assume some code got slower and some got faster for the same plan.

The microbenchmarks are read-only_range=10000 and read-only.pre_range=10000 show.
Each of these microbenchmarks run a read-only transaction with 4 SQL statements. The statements are here:

read-only.pre_range runs before a large number of writes, so the b-tree will be more read-friendly. 
read-only.range runs after a large number of writes.

The =10000 means that each SQL statement processes 10000 rows. Note that the microbenchmarks are also run for =100 and =10
and for those the perf with PG16 is similar to 15.x rather than ~1.5X faster.

---

This table shows throughput relative to the base case. The base case is PG 15.2 with the o2_nofp build.
Throughput relative < 1.0 means perf regressed, > 1.0 means perf improved

col-1 : PG 15.3 with the o2_nofp build
col-2 : PG 16 prebeta build on May 19 at git sha 1c006c0 with the o2_nofp build

col-1   col-2
0.99    1.03    hot-points_range=100
1.02    1.05    point-query.pre_range=100
1.06    1.10    point-query_range=100
0.97    1.01    points-covered-pk.pre_range=100
0.98    1.02    points-covered-pk_range=100
0.98    1.01    points-covered-si.pre_range=100
1.00    1.00    points-covered-si_range=100
1.00    1.01    points-notcovered-pk.pre_range=100
1.00    1.01    points-notcovered-pk_range=100
1.01    1.03    points-notcovered-si.pre_range=100
1.01    1.01    points-notcovered-si_range=100
1.00    0.99    random-points.pre_range=1000
1.00    1.02    random-points.pre_range=100
1.01    1.01    random-points.pre_range=10
1.01    1.00    random-points_range=1000
1.01    1.01    random-points_range=100
1.02    1.01    random-points_range=10
1.00    1.00    range-covered-pk.pre_range=100
1.00    1.00    range-covered-pk_range=100
1.00    0.99    range-covered-si.pre_range=100
1.00    0.99    range-covered-si_range=100
1.03    1.01    range-notcovered-pk.pre_range=100
1.02    1.00    range-notcovered-pk_range=100
1.01    1.01    range-notcovered-si.pre_range=100
1.01    1.01    range-notcovered-si_range=100
1.04    1.54    read-only.pre_range=10000                       <<<<<<<<<<
1.00    1.00    read-only.pre_range=100
1.01    1.01    read-only.pre_range=10
1.03    1.45    read-only_range=10000                            <<<<<<<<<<
1.01    1.01    read-only_range=100
1.04    1.00    read-only_range=10
1.00    0.99    scan_range=100
1.00    1.02    delete_range=100
1.01    1.02    insert_range=100
1.01    1.00    read-write_range=100
1.01    0.98    read-write_range=10
1.01    1.01    update-index_range=100
1.00    1.00    update-inlist_range=100
1.02    1.02    update-nonindex_range=100
1.03    1.03    update-one_range=100
1.02    1.02    update-zipf_range=100
1.03    1.03    write-only_range=10000

---

The read-only transaction has 4 SQL statements. I ran explain analyze for each of them assuming the range scan fetches 10k rows and then 100k rows.
The 10k result is similar to what was done above, then I added the 100k result to see if the perf difference changes with more rows.

In each case there are two "Execution Time" entries. The top one is from PG 15.2 and the bottom from PG 16 prebeta

Summary:
* Queries that do a sort show the largest improvement in PG 16 (Q2, Q4)
* Queries that don't do a sort are slower in PG 16 (Q1, Q3)

Q1.10k: explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000;

 Execution Time: 4.222 ms

 Execution Time: 6.243 ms


Q1.100k: explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000;

 Execution Time: 36.508 ms

 Execution Time: 49.344 ms


Q2.10k: explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 order by c;

 Execution Time: 38.224 ms

 Execution Time: 15.700 ms


Q2.100k: explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000 order by c;

 Execution Time: 392.380 ms

 Execution Time: 219.022 ms


Q3.10k: explain analyze SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000;

 Execution Time: 3.660 ms

 Execution Time: 3.994 ms


Q3.100k: explain analyze SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000;

Execution Time: 35.917 ms

 Execution Time: 39.055 ms


Q4.10k: explain analyze SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 ORDER BY c;

 Execution Time: 29.998 ms

 Execution Time: 18.877 ms


Q4.100k: explain analyze SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 ORDER BY c;

 Execution Time: 29.272 ms

 Execution Time: 18.265 ms


---

Finally, the queries with full explain analyze output. Each section has two results -- first for PG 15.3, second for PG 16

--- Q1.10k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000;
                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..525.86 rows=8971 width=121) (actual time=0.061..3.676 rows=10001 loops=1)
   Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.034 ms
 Execution Time: 4.222 ms
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..589.80 rows=10068 width=121) (actual time=0.094..5.456 rows=10001 loops=1)
   Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.063 ms
 Execution Time: 6.243 ms

--- Q1.100k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..5206.44 rows=89700 width=121) (actual time=0.017..31.166 rows=100001 loops=1)
   Index Cond: ((id >= 10000000) AND (id <= 10100000))
 Planning Time: 0.024 ms
 Execution Time: 36.508 ms
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..5845.86 rows=100671 width=121) (actual time=0.029..42.285 rows=100001 loops=1)
   Index Cond: ((id >= 10000000) AND (id <= 10100000))
 Planning Time: 0.061 ms
 Execution Time: 49.344 ms

--- Q2.10k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 order by c;
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1114.85..1137.28 rows=8971 width=121) (actual time=36.561..37.429 rows=10001 loops=1)
   Sort Key: c
   Sort Method: quicksort  Memory: 2025kB
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..525.86 rows=8971 width=121) (actual time=0.022..3.776 rows=10001 loops=1)
         Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.059 ms
 Execution Time: 38.224 ms
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1259.19..1284.36 rows=10068 width=121) (actual time=14.419..15.042 rows=10001 loops=1)
   Sort Key: c
   Sort Method: quicksort  Memory: 1713kB
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..589.80 rows=10068 width=121) (actual time=0.023..3.473 rows=10001 loops=1)
         Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.049 ms
 Execution Time: 15.700 ms

--- Q2.100k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000 order by c;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=18413.03..18637.28 rows=89700 width=121) (actual time=300.717..385.193 rows=100001 loops=1)
   Sort Key: c
   Sort Method: external merge  Disk: 12848kB
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..5206.44 rows=89700 width=121) (actual time=0.028..29.590 rows=100001 loops=1)
         Index Cond: ((id >= 10000000) AND (id <= 10100000))
 Planning Time: 0.048 ms
 Execution Time: 392.380 ms
                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=20749.26..21000.94 rows=100671 width=121) (actual time=154.969..211.572 rows=100001 loops=1)
   Sort Key: c
   Sort Method: external merge  Disk: 12240kB
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..5845.86 rows=100671 width=121) (actual time=0.026..34.278 rows=100001 loops=1)
         Index Cond: ((id >= 10000000) AND (id <= 10100000))
 Planning Time: 0.034 ms
 Execution Time: 219.022 ms

--- Q3.10k : explain analyze SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=548.29..548.30 rows=1 width=8) (actual time=3.645..3.646 rows=1 loops=1)
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..525.86 rows=8971 width=4) (actual time=0.024..2.587 rows=10001 loops=1)
         Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.036 ms
 Execution Time: 3.660 ms
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=614.97..614.98 rows=1 width=8) (actual time=3.980..3.980 rows=1 loops=1)
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..589.80 rows=10068 width=4) (actual time=0.024..2.993 rows=10001 loops=1)
         Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.038 ms
 Execution Time: 3.994 ms

--- Q3.100k : explain analyze SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000;
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5430.69..5430.70 rows=1 width=8) (actual time=35.901..35.902 rows=1 loops=1)
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..5206.44 rows=89700 width=4) (actual time=0.017..25.256 rows=100001 loops=1)
         Index Cond: ((id >= 10000000) AND (id <= 10100000))
 Planning Time: 0.032 ms
 Execution Time: 35.917 ms
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6097.53..6097.55 rows=1 width=8) (actual time=39.034..39.035 rows=1 loops=1)
   ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..5845.86 rows=100671 width=4) (actual time=0.018..29.291 rows=100001 loops=1)
         Index Cond: ((id >= 10000000) AND (id <= 10100000))
 Planning Time: 0.051 ms
 Execution Time: 39.055 ms

--- Q4.10k : explain analyze SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 ORDER BY c;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1114.85..1159.71 rows=8971 width=121) (actual time=26.335..29.435 rows=10001 loops=1)
   ->  Sort  (cost=1114.85..1137.28 rows=8971 width=121) (actual time=26.333..27.085 rows=10001 loops=1)
         Sort Key: c
         Sort Method: quicksort  Memory: 2025kB
         ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..525.86 rows=8971 width=121) (actual time=0.021..2.968 rows=10001 loops=1)
               Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.052 ms
 Execution Time: 29.998 ms
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1259.19..1309.53 rows=10068 width=121) (actual time=14.203..18.318 rows=10001 loops=1)
   ->  Sort  (cost=1259.19..1284.36 rows=10068 width=121) (actual time=14.200..14.978 rows=10001 loops=1)
         Sort Key: c
         Sort Method: quicksort  Memory: 1713kB
         ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..589.80 rows=10068 width=121) (actual time=0.030..3.475 rows=10001 loops=1)
               Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.053 ms
 Execution Time: 18.877 ms

--- Q4.100k : explain analyze SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 ORDER BY c;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1114.85..1159.71 rows=8971 width=121) (actual time=25.567..28.709 rows=10001 loops=1)
   ->  Sort  (cost=1114.85..1137.28 rows=8971 width=121) (actual time=25.565..26.320 rows=10001 loops=1)
         Sort Key: c
         Sort Method: quicksort  Memory: 2025kB
         ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..525.86 rows=8971 width=121) (actual time=0.025..2.926 rows=10001 loops=1)
               Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.052 ms
 Execution Time: 29.272 ms
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1259.19..1309.53 rows=10068 width=121) (actual time=13.620..17.714 rows=10001 loops=1)
   ->  Sort  (cost=1259.19..1284.36 rows=10068 width=121) (actual time=13.618..14.396 rows=10001 loops=1)
         Sort Key: c
         Sort Method: quicksort  Memory: 1713kB
         ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..589.80 rows=10068 width=121) (actual time=0.024..3.478 rows=10001 loops=1)
               Index Cond: ((id >= 10000000) AND (id <= 10010000))
 Planning Time: 0.039 ms
 Execution Time: 18.265 ms

--
Mark Callaghan
mdcallag@gmail.com

Re: benchmark results comparing versions 15.2 and 16

От
MARK CALLAGHAN
Дата:
Results for 16 beta1 are similar to what I shared above:
* no regressions
* a few queries are >= 1.5 times faster which make the read-only transaction >= 1.5 times faster

http://smalldatum.blogspot.com/2023/05/postgres-16beta1-looks-good-vs-sysbench.html

--
Mark Callaghan
mdcallag@gmail.com

Re: benchmark results comparing versions 15.2 and 16

От
David Rowley
Дата:
On Tue, 23 May 2023 at 07:40, MARK CALLAGHAN <mdcallag@gmail.com> wrote:

(pg15)

> --- Q2.10k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 order by c;
>                                                               QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=1114.85..1137.28 rows=8971 width=121) (actual time=36.561..37.429 rows=10001 loops=1)
>    Sort Key: c
>    Sort Method: quicksort  Memory: 2025kB
>    ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..525.86 rows=8971 width=121) (actual time=0.022..3.776
rows=10001loops=1)
 
>          Index Cond: ((id >= 10000000) AND (id <= 10010000))
>  Planning Time: 0.059 ms
>  Execution Time: 38.224 ms

(pg16 b1)

>                                                                QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=1259.19..1284.36 rows=10068 width=121) (actual time=14.419..15.042 rows=10001 loops=1)
>    Sort Key: c
>    Sort Method: quicksort  Memory: 1713kB
>    ->  Index Scan using sbtest1_pkey on sbtest1  (cost=0.44..589.80 rows=10068 width=121) (actual time=0.023..3.473
rows=10001loops=1)
 
>          Index Cond: ((id >= 10000000) AND (id <= 10010000))
>  Planning Time: 0.049 ms
>  Execution Time: 15.700 ms

It looks like the improvements here are due to qsort being faster on
v16.  To get an idea of the time taken to perform the actual qsort,
you can't use the first row time vs last row time in the Sort node, as
we must (obviously) have performed the sort before outputting the
first row.  I think you could get a decent idea of the time taken to
perform the qsort by subtracting the actual time for the final Index
scan row from the actual time for the Sort's first row.  That's 36.561
- 3.776 = 32.785 ms for pg15's plan, and 14.419 - 3.473 = 10.946 ms
pg16 b1's

c6e0fe1f2 might have helped improve some of that performance, but I
suspect there must be something else as ~3x seems much more than I'd
expect from reducing the memory overheads.  Testing versions before
and after that commit might give a better indication.

David



Re: benchmark results comparing versions 15.2 and 16

От
Peter Geoghegan
Дата:
On Sun, May 28, 2023 at 2:42 PM David Rowley <dgrowleyml@gmail.com> wrote:
> c6e0fe1f2 might have helped improve some of that performance, but I
> suspect there must be something else as ~3x seems much more than I'd
> expect from reducing the memory overheads.  Testing versions before
> and after that commit might give a better indication.

I'm virtually certain that this is due to the change in default
collation provider, from libc to ICU. Mostly due to the fact that ICU
is capable of using abbreviated keys, and  the system libc isn't
(unless you go out of your way to define TRUST_STRXFRM when building
Postgres).

Many individual test cases involving larger non-C collation text sorts
showed similar improvements back when I worked on this. Offhand, I
believe that 3x - 3.5x improvements in execution times were common
with high entropy abbreviated keys on high cardinality input columns
at that time (this was with glibc). Low cardinality inputs were more
like 2.5x.

I believe that ICU is faster than glibc in general -- even with
TRUST_STRXFRM enabled. But the TRUST_STRXFRM thing is bound to be the
most important factor here, by far.

--
Peter Geoghegan



Re: benchmark results comparing versions 15.2 and 16

От
MARK CALLAGHAN
Дата:
Do you want me to try PG 16 without ICU or PG 15 with ICU?
I can do that, but it will take a few days before the server is available.

On Mon, May 29, 2023 at 9:55 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, May 28, 2023 at 2:42 PM David Rowley <dgrowleyml@gmail.com> wrote:
> c6e0fe1f2 might have helped improve some of that performance, but I
> suspect there must be something else as ~3x seems much more than I'd
> expect from reducing the memory overheads.  Testing versions before
> and after that commit might give a better indication.

I'm virtually certain that this is due to the change in default
collation provider, from libc to ICU. Mostly due to the fact that ICU
is capable of using abbreviated keys, and  the system libc isn't
(unless you go out of your way to define TRUST_STRXFRM when building
Postgres).

Many individual test cases involving larger non-C collation text sorts
showed similar improvements back when I worked on this. Offhand, I
believe that 3x - 3.5x improvements in execution times were common
with high entropy abbreviated keys on high cardinality input columns
at that time (this was with glibc). Low cardinality inputs were more
like 2.5x.

I believe that ICU is faster than glibc in general -- even with
TRUST_STRXFRM enabled. But the TRUST_STRXFRM thing is bound to be the
most important factor here, by far.

--
Peter Geoghegan


--
Mark Callaghan
mdcallag@gmail.com

Re: benchmark results comparing versions 15.2 and 16

От
Peter Geoghegan
Дата:
Hi Mark,

On Tue, May 30, 2023 at 1:03 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> Do you want me to try PG 16 without ICU or PG 15 with ICU?
> I can do that, but it will take a few days before the server is available.

Sorry for the late reply. Most of the Postgres developers (myself
included) are attending pgCon right now.

It would be nice to ascertain just how much of a boost we're getting
from our use of ICU as far as sysbench goes. I'd appreciate having
that information. We discussed the choice of ICU as default collation
provider at yesterday's developer meeting:


https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Meeting#High_level_thoughts_and_feedback_on_moving_toward_ICU_as_the_preferred_collation_provider
https://wiki.postgresql.org/wiki/StateOfICU

Just confirming my theory about abbreviated keys (without rerunning
the benchmark) should be simple - perhaps that would be a useful place
to start. You could just rerun the two individual queries of interest
from an interactive psql session. There are even low-level debug
messages available through the trace_sort GUC. From a psql session
you'd run something along the lines of "set client_min_messages=log;
set trace_sort=on; $QUERY". You'll see lots of LOG messages with
specific information about the use of abbreviated keys and the
progress of each sort.

Thanks
--
Peter Geoghegan



Re: benchmark results comparing versions 15.2 and 16

От
Alexander Lakhin
Дата:
Hello David,

11.05.2023 16:00, Alexander Lakhin wrote:
> Yeah, I see. It's also interesting to me, which tests perform better after
> that commit. It takes several hours to run all tests, so I can't present
> results quickly, but I'll try to collect this information next week.

To my regret, I could not find such tests that week, so I decided to try
later, after reviewing my benchmark running infrastructure.

But for now, as Postgres Pro company graciously shared the benchmarking
infrastructure project [1], it's possible for anyone to confirm or deny my
results. (You can also see which tests were performed and how.)

Having done one more round of comprehensive testing, I still couldn't find
winning tests for commit 3c6fc5820 (but reassured that
test s64da_tpcds.query87 loses a little (and also s64da_tpcds.query38)).
So it seems to me that the tests I performed or their parameters is not
representative enough for that improvement, unfortunately.

[1] https://github.com/postgrespro/pg-mark

Best regards,
Alexander