Re: Spinlocks, yet again: analysis and proposed patches
От | Tom Lane |
---|---|
Тема | Re: Spinlocks, yet again: analysis and proposed patches |
Дата | |
Msg-id | 16824.1126820943@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: Spinlocks, yet again: analysis and proposed patches (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Spinlocks, yet again: analysis and proposed patches
(Gavin Sherry <swm@linuxworld.com.au>)
Re: Spinlocks, yet again: analysis and proposed patches ("Simon Riggs" <simon@2ndquadrant.com>) |
Список | pgsql-hackers |
I wrote: > I guess what this means is that there's no real problem with losing the > cache line while manipulating the LWLock, which is what the patch was > intended to prevent. Instead, we're paying for swapping two cache lines > (the spinlock and the LWLock) across processors instead of just one line. > But that should at worst be a 2x inflation of the time previously spent > in LWLockAcquire/Release, which is surely not yet all of the application > ;-). Why the heck is this so bad? Should we expect that apparently > minor changes in shared data structures might be costing equivalently > huge penalties in SMP performance elsewhere? I did some oprofile work and found that the cost seems to be because (1) there's an increase in spinlock contention (more time spent in s_lock), and (2) there's more time spent in LWLockAcquire/Release. I'm not entirely clear about the reason for (1), but (2) apparently is because of the extra cache line swapping, as posited above. In the oprofile trace it's clear that the extra cost comes exactly in the statements that touch fields of the shared LWLock, which are the places where you might have to wait to acquire a cache line. I thought for a bit that the problem might come from having chosen to put a pointer to the spinlock into each LWLock; fetching that pointer implies an additional access to the contended cache line. However, changing the data structure to a simple linear array of spinlocks didn't help. So that whole idea seems to have died a painful death. One other interesting result is that the data structure change neither helped nor hurt on an EM64T machine with 2 physical (4 logical) processors. This is also x86_64, but evidently the caching behavior is totally different from Opterons. One thing that did seem to help a little bit was padding the LWLocks to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring the array starts on a 32-byte boundary. This ensures that we won't have any LWLocks crossing cache lines --- contended access to such an LWLock would probably incur the sort of large penalty seen above, because you'd be trading two cache lines back and forth not one. It seems that the important locks are not split that way in CVS tip, because the gain wasn't much, but I wonder whether some effect like this might explain some of the unexplainable performance changes we've noticed in the past (eg, in the dbt2 results). A seemingly unrelated small change in the size of other data structures in shared memory might move things around enough to make a performance-critical lock cross a cache line boundary. On regular x86, the LWLock struct is by chance exactly 16 bytes, so there's no alignment issue. But 64-bit platforms and platforms where slock_t is bigger than char are exposed to this risk. I'm going to go ahead and make that change, since it doesn't seem likely to have any downside. It might be interesting to look into forcing proper alignment of the shared buffer headers as well. regards, tom lane
В списке pgsql-hackers по дате отправления:
Предыдущее
От: "Michael Paesold"Дата:
Сообщение: Re: Spinlocks, yet again: analysis and proposed patches