Re: [HACKERS] Fix performance of generic atomics
От | Sokolov Yura |
---|---|
Тема | Re: [HACKERS] Fix performance of generic atomics |
Дата | |
Msg-id | 9fccff0670a2ec3c031d459564892f42@postgrespro.ru обсуждение исходный текст |
Ответы |
Re: [HACKERS] Fix performance of generic atomics
|
Список | pgsql-hackers |
A bit cleaner version of a patch. Sokolov Yura писал 2017-05-25 15:22: > Good day, everyone. > > I've been played with pgbench on huge machine. > (72 cores, 56 for postgresql, enough memory to fit base > both into shared_buffers and file cache) > (pgbench scale 500, unlogged tables, fsync=off, > synchronous commit=off, wal_writer_flush_after=0). > > With 200 clients performance is around 76000tps and main > bottleneck in this dumb test is LWLockWaitListLock. > > I added gcc specific implementation for pg_atomic_fetch_or_u32_impl > (ie using __sync_fetch_and_or) and performance became 83000tps. > > It were a bit strange at a first look, cause __sync_fetch_and_or > compiles to almost same CAS loop. > > Looking closely, I noticed that intrinsic performs doesn't do > read in the loop body, but at loop initialization. It is correct > behavior cause `lock cmpxchg` instruction stores old value in EAX > register. > > It is expected behavior, and pg_compare_and_exchange_*_impl does > the same in all implementations. So there is no need to re-read > value in the loop body: > > Example diff for pg_atomic_exchange_u32_impl: > > static inline uint32 > pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 > xchg_) > { > uint32 old; > + old = pg_atomic_read_u32_impl(ptr); > while (true) > { > - old = pg_atomic_read_u32_impl(ptr); > if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_)) > break; > } > return old; > } > > After applying this change to all generic atomic functions > (and for pg_atomic_fetch_or_u32_impl ), performance became > equal to __sync_fetch_and_or intrinsic. > > Attached patch contains patch for all generic atomic > functions, and also __sync_fetch_and_(or|and) for gcc, cause > I believe GCC optimize code around intrinsic better than > around inline assembler. > (final performance is around 86000tps, but difference between > 83000tps and 86000tps is not so obvious in NUMA system). > > With regards, -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
В списке pgsql-hackers по дате отправления: