Re: Avoid stack frame setup in performance critical routines using tail calls

Поиск

Список

Период

Сортировка

От	Andres Freund
Тема	Re: Avoid stack frame setup in performance critical routines using tail calls
Дата	20 июля 2021 г. 09:16:57
Msg-id	20210720061657.bcueir3krgmkt6m5@alap3.anarazel.de обсуждение исходный текст
Ответ на	Re: Avoid stack frame setup in performance critical routines using tail calls (David Rowley <dgrowleyml@gmail.com>)
Ответы	Re: Avoid stack frame setup in performance critical routines using tail calls (David Rowley <dgrowleyml@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

Hi,

On 2021-07-20 16:50:09 +1200, David Rowley wrote:
> I've not taken the time to study the patch but I was running some
> other benchmarks today on a small scale pgbench readonly test and I
> took this patch for a spin to see if I could see the same performance
> gains.

Thanks!

> This is an AMD 3990x machine that seems to get the most throughput
> from pgbench with 132 processes
> 
> I did: pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared
> --random-seed=12345 postgres
> 
> master = dd498998a
> 
> Master: 3816959.53 tps
> Patched: 3820723.252 tps
> 
> I didn't quite get the same 2-3% as you did, but it did come out
> faster than on master.

It would not at all be suprising to me if AMD in recent microarchitectures did
a better job at removing stack management overview (e.g. by better register
renaming, or by resolving dependencies on %rsp in a smarter way) than Intel
has. This was on a Cascade Lake CPU (xeon 5215), which, despite being released
in 2019, effectively is a moderately polished (or maybe shoehorned)
microarchitecture from 2015 due to all the Intel troubles. Whereas Zen2 is
from 2019.

It's also possible that my attempts at avoiding the stack management just
didn't work on your compiler. Either due to vendor (I know that gcc is better
at it than clang), version, or compiler flags (e.g. -fno-omit-frame-pointer
could make it harder, -fno-optimize-sibling-calls would disable it).

A third plausible explanation for the difference is that at a client count of
132, the bottlenecks are sufficiently elsewhere to just not show a meaningful
gain from memory management efficiency improvements.

Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
palloc` from a patched run? And perhaps how high their percentages of the
total work are. E.g. using something like
perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'

It'd be interesting to know where the bottlenecks on a zen2 machine are.

Greetings,

Andres Freund

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Greg Nancarrow
Дата: 20 июля 2021 г., 09:08:01
Сообщение: Re: row filtering for logical replication

Следующее

От: David Rowley
Дата: 20 июля 2021 г., 09:53:39
Сообщение: Re: Avoid stack frame setup in performance critical routines using tail calls

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Avoid stack frame setup in performance critical routines using tail calls

Предыдущее

Следующее