Обсуждение: lwlock optimization opportunities

Поиск
Список
Период
Сортировка

lwlock optimization opportunities

От
Andres Freund
Дата:
Hi,

This started as a reply to
https://postgr.es/m/CAH2-WzkbCvgKrmw%2Bf%2B1hwgXhmiv%2BUNRihotALXftUiNr%3D3VUKA%40mail.gmail.com
but after typing for a while I decided that it's large and unrelated
enough to be better handled as a new thread.


On 2020-11-27 11:08:49 -0800, Peter Geoghegan wrote:
> We've made LWLocks much more scalable in the last 10 years. Why
> shouldn't we expect to do the same again in the next 10 years? I
> wouldn't bet against it. I might even do the opposite (predict further
> improvements to LWLocks).

I've done a bunch of benchmarking that shows clear areas of
improvements:

- For LWLockWaitForVar() / LWLockUpdateVar() we take the wait list lock
  - largely unnecessarily in a lot of cases. On platforms that have
  atomic 8 byte reads we don't need the lock to read/write *valptr. Nor
  do we need to look at the list of waiters in LWLockUpdateVar() if
  LW_FLAG_HAS_WAITERS is not set.  WAL insertion is a fairly hot path,
  so it'd be nice to improve this.


- For locks that are frequently share-locked (e.g. procarray lock,
  buffer locks) making the share-lock acquisition an atomic xadd,
  instead of a a compare-exchange loop is a boon. It does make the code
  a bit more complicated (to handle races where the optimistically added
  shared locker happens after an exclusive acquisition), which lead me
  to remove that optimization before committing the current lwlock lock
  protocol. But I think it's time to add it, because we end up with a
  lot of entirely unnecessary retries in read-only/mostly workloads.


- Currently waking up waiters is O(#waiters) syscalls, because we wake
  up each waiting backend via a separate semaphore. In linux that boils
  down to a futex(semaphore, WAKE) syscall. The work the kernel needs to
  do for each of the futexes is substantial - in workloads with a lot of
  blocking I have seen 30% of the CPU time spent in related code.

  I played around with using futex() directly inside lwlock.c - yields
  quite a bit of benefits, because a) we can wake up many wakers at
  once, removing syscalls / futex lookups, b) there's far fewer
  different futexes & cachelines touched.

  Even things like only waking up shared-lockers etc can be done in one
  syscall, via FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET (with bits
  indicating the different wait modes).


- LWLockAcquireOrWait(), which we use for WALWriteLock, can waste a lot
  of CPU time if (some) of the waiters actually need to wait for WAL to
  be written out. We wake up *all* the waiters, none of them takes the
  lock, then all of them try to acquire the lock again - that's a lot of
  contention on a single poor cacheline, and a lot of pressure on the OS
  scheduler.

  I think what we instead need is a combo of LWLockAcquire/Release that
  allows a callback to inspect what needs to be done with which waiters.

  In the WAL case such a callback could utilize a 'flush request
  position' publicized for each PGPROC to decide whether to wake the
  process because the lock request is already fulfilled, and wake
  exactly one of the remaining processes to actually acquire the lock
  (rather than exiting LWLockAcquireOrFalse as currently the case).


- There are a large number of lwlocks - some of them bottlenecks - that
  are only ever taken in exclusive mode. For those locks we can use more
  efficient locking protocols:
  - Lock release does not need to be an atomic operation (there's no
    concurrent count of readers that concurrently can
    change). That can be a significant performance benefit.
  - Lock acquisition can be an atomic-exchange, instead of a
    compare-exchange.

  Unfortunately that probably requires designing the lock representation
  a bit differently. I don't know if it's feasible to share the data
  structure and just use different lock/unlock functions.


There's more, but that's already a long list ;)

Regards,

Andres