Hi,
This started as a reply to
https://postgr.es/m/CAH2-WzkbCvgKrmw%2Bf%2B1hwgXhmiv%2BUNRihotALXftUiNr%3D3VUKA%40mail.gmail.com
but after typing for a while I decided that it's large and unrelated
enough to be better handled as a new thread.
On 2020-11-27 11:08:49 -0800, Peter Geoghegan wrote:
> We've made LWLocks much more scalable in the last 10 years. Why
> shouldn't we expect to do the same again in the next 10 years? I
> wouldn't bet against it. I might even do the opposite (predict further
> improvements to LWLocks).
I've done a bunch of benchmarking that shows clear areas of
improvements:
- For LWLockWaitForVar() / LWLockUpdateVar() we take the wait list lock
- largely unnecessarily in a lot of cases. On platforms that have
atomic 8 byte reads we don't need the lock to read/write *valptr. Nor
do we need to look at the list of waiters in LWLockUpdateVar() if
LW_FLAG_HAS_WAITERS is not set. WAL insertion is a fairly hot path,
so it'd be nice to improve this.
- For locks that are frequently share-locked (e.g. procarray lock,
buffer locks) making the share-lock acquisition an atomic xadd,
instead of a a compare-exchange loop is a boon. It does make the code
a bit more complicated (to handle races where the optimistically added
shared locker happens after an exclusive acquisition), which lead me
to remove that optimization before committing the current lwlock lock
protocol. But I think it's time to add it, because we end up with a
lot of entirely unnecessary retries in read-only/mostly workloads.
- Currently waking up waiters is O(#waiters) syscalls, because we wake
up each waiting backend via a separate semaphore. In linux that boils
down to a futex(semaphore, WAKE) syscall. The work the kernel needs to
do for each of the futexes is substantial - in workloads with a lot of
blocking I have seen 30% of the CPU time spent in related code.
I played around with using futex() directly inside lwlock.c - yields
quite a bit of benefits, because a) we can wake up many wakers at
once, removing syscalls / futex lookups, b) there's far fewer
different futexes & cachelines touched.
Even things like only waking up shared-lockers etc can be done in one
syscall, via FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET (with bits
indicating the different wait modes).
- LWLockAcquireOrWait(), which we use for WALWriteLock, can waste a lot
of CPU time if (some) of the waiters actually need to wait for WAL to
be written out. We wake up *all* the waiters, none of them takes the
lock, then all of them try to acquire the lock again - that's a lot of
contention on a single poor cacheline, and a lot of pressure on the OS
scheduler.
I think what we instead need is a combo of LWLockAcquire/Release that
allows a callback to inspect what needs to be done with which waiters.
In the WAL case such a callback could utilize a 'flush request
position' publicized for each PGPROC to decide whether to wake the
process because the lock request is already fulfilled, and wake
exactly one of the remaining processes to actually acquire the lock
(rather than exiting LWLockAcquireOrFalse as currently the case).
- There are a large number of lwlocks - some of them bottlenecks - that
are only ever taken in exclusive mode. For those locks we can use more
efficient locking protocols:
- Lock release does not need to be an atomic operation (there's no
concurrent count of readers that concurrently can
change). That can be a significant performance benefit.
- Lock acquisition can be an atomic-exchange, instead of a
compare-exchange.
Unfortunately that probably requires designing the lock representation
a bit differently. I don't know if it's feasible to share the data
structure and just use different lock/unlock functions.
There's more, but that's already a long list ;)
Regards,
Andres