Обсуждение: Possible performance regression in version 10.1 with pgbenchread-write tests.
Hi all, When I was trying to do read-write pgbench bench-marking of PostgreSQL 9.6.6 vs 10.1 I found PostgreSQL 10.1 regresses against 9.6.6 in some cases. Non Default settings and test ====================== Server: ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 & Pgbench: CASE 1: when data fits shared buffers. ./pgbench -i -s 1000 postgres CASE 2: when data exceeds shared buffers. ./pgbench -i -s 1000 postgres ./pgbench -c $threads -j $threads -T 1800 -M prepared postgres Script "perf_buff_mgmt_write-2.sh" which is added below can be used to run same. Machine : "cthulhu" 8 node numa machine with 128 hyper threads. =================================================== >numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103 node 0 size: 65498 MB node 0 free: 37885 MB node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111 node 1 size: 65536 MB node 1 free: 31215 MB node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119 node 2 size: 65536 MB node 2 free: 15331 MB node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127 node 3 size: 65536 MB node 3 free: 36774 MB node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40 node 4 size: 65536 MB node 4 free: 62 MB node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48 node 5 size: 65536 MB node 5 free: 9653 MB node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56 node 6 size: 65536 MB node 6 free: 50209 MB node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64 node 7 size: 65536 MB node 7 free: 43966 MB CASE 1: In 9.6.6 peak performance is achieved at 72 concurrent cleints TPS : 35554.573858 and in 10.1 at 72 clients TPS dips to 26882.828133 so nearly 23% decrease in TPS. CASE 2: In 9.6.6 peak performance is achieved at 72 concurrent cleints TPS : 24861.074079 and in 10.1 at 72 clients TPS dips to 18372.565663 so nearly 26% decrease in TPS. Added "Postgresql_benchmarking_9.6vs10.ods" which gives more detailed TPS numbers. And, TPS is median of 3 runs result. I have not run bisect yet to find what has caused the issue. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
Вложения
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Amit Kapila
Дата:
On Wed, Jan 24, 2018 at 12:06 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > Hi all, > > When I was trying to do read-write pgbench bench-marking of PostgreSQL > 9.6.6 vs 10.1 I found PostgreSQL 10.1 regresses against 9.6.6 in some > cases. > > Non Default settings and test > ====================== > Server: > ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c > max_wal_size=20GB -c checkpoint_timeout=900 -c > maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 & > > Pgbench: > CASE 1: when data fits shared buffers. > ./pgbench -i -s 1000 postgres > > CASE 2: when data exceeds shared buffers. > ./pgbench -i -s 1000 postgres > Both the cases look identical, but from the document attached, it seems the case-1 is for scale factor 300. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 24, 2018 at 7:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Both the cases look identical, but from the document attached, it > seems the case-1 is for scale factor 300. Oops sorry it was a typo. CASE 1 is scale factor 300 which will fit in shared buffer =8GB. -- Thanks and Regards Mithun C Y EnterpriseDB: http://www.enterprisedb.com
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Robert Haas
Дата:
On Wed, Feb 21, 2018 at 10:03 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > seeing futex in the call stack andres suggested that following commit could > be the reason for regression > > commit ecb0d20a9d2e09b7112d3b192047f711f9ff7e59 > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: 2016-10-09 18:03:45 -0400 > > Use unnamed POSIX semaphores, if available, on Linux and FreeBSD. > > Commenting out same in src/template/linux I did run the benchmark tests > again > performance improved from 26871.567326 to 34286.620251 (both median of 3 > TPS). Hmm. So that commit might not have been the greatest idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Amit Kapila
Дата:
On Thu, Feb 22, 2018 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 21, 2018 at 10:03 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: >> seeing futex in the call stack andres suggested that following commit could >> be the reason for regression >> >> commit ecb0d20a9d2e09b7112d3b192047f711f9ff7e59 >> Author: Tom Lane <tgl@sss.pgh.pa.us> >> Date: 2016-10-09 18:03:45 -0400 >> >> Use unnamed POSIX semaphores, if available, on Linux and FreeBSD. >> >> Commenting out same in src/template/linux I did run the benchmark tests >> again >> performance improved from 26871.567326 to 34286.620251 (both median of 3 >> TPS). > > Hmm. So that commit might not have been the greatest idea. > It appears so. I think we should do something about it as the regression is quite noticeable. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Alvaro Herrera
Дата:
On 2018-Jul-19, Amit Kapila wrote: > On Thu, Feb 22, 2018 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Feb 21, 2018 at 10:03 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > >> seeing futex in the call stack andres suggested that following commit could > >> be the reason for regression > >> > >> commit ecb0d20a9d2e09b7112d3b192047f711f9ff7e59 > >> Author: Tom Lane <tgl@sss.pgh.pa.us> > >> Date: 2016-10-09 18:03:45 -0400 > >> > >> Use unnamed POSIX semaphores, if available, on Linux and FreeBSD. > > Hmm. So that commit might not have been the greatest idea. > > It appears so. I think we should do something about it as the > regression is quite noticeable. So the fix is just to revert the change for the linux makefile? Sounds easy enough, code-wise. Do we need more evidence that it's harmful? Since it was changed in pg10 not 11, I don't think this is an open-item per se. (Maybe an "older bug", if we must really have it there.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Andres Freund
Дата:
Hi, On 2018-01-24 00:06:44 +0530, Mithun Cy wrote: > Server: > ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c > max_wal_size=20GB -c checkpoint_timeout=900 -c > maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 & Which kernel & glibc version does this server have? Greetings, Andres Freund
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Andres Freund
Дата:
Hi, On 2018-07-19 15:39:44 -0400, Alvaro Herrera wrote: > On 2018-Jul-19, Amit Kapila wrote: > > > On Thu, Feb 22, 2018 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > On Wed, Feb 21, 2018 at 10:03 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > > >> seeing futex in the call stack andres suggested that following commit could > > >> be the reason for regression > > >> > > >> commit ecb0d20a9d2e09b7112d3b192047f711f9ff7e59 > > >> Author: Tom Lane <tgl@sss.pgh.pa.us> > > >> Date: 2016-10-09 18:03:45 -0400 > > >> > > >> Use unnamed POSIX semaphores, if available, on Linux and FreeBSD. > > > > Hmm. So that commit might not have been the greatest idea. > > > > It appears so. I think we should do something about it as the > > regression is quite noticeable. > > So the fix is just to revert the change for the linux makefile? Sounds > easy enough, code-wise. Do we need more evidence that it's harmful? > > Since it was changed in pg10 not 11, I don't think this is an open-item > per se. (Maybe an "older bug", if we must really have it there.) I'm a bit hesitant to just revert without further evaluation - it's just about as likely we'll regress on other hardware / kernel versions. Except it'd be in a minor release, whereas the current issue was in a major release. It'd also suddenly make some installations not start, due to sysv semaphore # limitations. There've been a few annoying, and a few embarassing, issues with futexes, but they receive far more attention from a performance POV. Greetings, Andres Freund
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2018-Jul-19, Amit Kapila wrote: >> It appears so. I think we should do something about it as the >> regression is quite noticeable. It's not *that* noticeable, as I failed to demonstrate any performance difference before committing the patch. I think some more investigation is warranted to find out why some other people are getting different results. > So the fix is just to revert the change for the linux makefile? Sounds > easy enough, code-wise. Do we need more evidence that it's harmful? Some fraction of 3d21f08bc would also need to be undone. It's also worth contemplating that we'd be re-introducing old problems with not-enough- SysV-semaphores, so even if there is a performance benefit to be had, it's very far from being free. regards, tom lane
Hi Andres,
On Fri, Jul 20, 2018 at 1:21 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-01-24 00:06:44 +0530, Mithun Cy wrote:
> Server:
> ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
Which kernel & glibc version does this server have?
[mithun.cy@cthulhu ~]$ cat /proc/version
Linux version 3.10.0-693.5.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Fri Oct 20 20:32:50 UTC 2017
Linux version 3.10.0-693.5.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Fri Oct 20 20:32:50 UTC 2017
[mithun.cy@cthulhu ~]$ ldd --version
ldd (GNU libc) 2.17
ldd (GNU libc) 2.17
--
Andres Freund <andres@anarazel.de> writes: > I'm a bit hesitant to just revert without further evaluation - it's just > about as likely we'll regress on other hardware / kernel > versions. I looked into the archives for the discussion that led up to ecb0d20a9, and found it here: https://www.postgresql.org/message-id/flat/8536.1475704230%40sss.pgh.pa.us The test cases I tried in that thread said that POSIX semas were *faster* ... by single-digit percentages, but still faster. So I think we really need to study this issue, rather than just take one contrary result as being gospel. regards, tom lane
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Thomas Munro
Дата:
On Fri, Jul 20, 2018 at 7:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> On 2018-Jul-19, Amit Kapila wrote: >>> It appears so. I think we should do something about it as the >>> regression is quite noticeable. > > It's not *that* noticeable, as I failed to demonstrate any performance > difference before committing the patch. I think some more investigation > is warranted to find out why some other people are getting different > results. Maybe false sharing is a factor, since sizeof(sem_t) is 32 bytes on Linux/amd64 and we're probably hitting elements clustered at one end of the array? Let's see... I tried sticking padding into PGSemaphoreData and I got ~8% more TPS (72 client on multi socket box, pgbench scale 100, only running for a minute but otherwise the same settings that Mithun showed). --- a/src/backend/port/posix_sema.c +++ b/src/backend/port/posix_sema.c @@ -45,6 +45,7 @@ typedef struct PGSemaphoreData { sem_t pgsem; + char padding[PG_CACHE_LINE_SIZE - sizeof(sem_t)]; } PGSemaphoreData; That's probably not the right idiom and my tests probably weren't long enough, but there seems to be some effect here. -- Thomas Munro http://www.enterprisedb.com
On Fri, Jul 20, 2018 at 10:52 AM, Thomas Munro <thomas.munro@enterprisedb.com > wrote:
On Fri, Jul 20, 2018 at 7:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> It's not *that* noticeable, as I failed to demonstrate any performance
> difference before committing the patch. I think some more investigation
> is warranted to find out why some other people are getting different
> results
Maybe false sharing is a factor, since sizeof(sem_t) is 32 bytes on
Linux/amd64 and we're probably hitting elements clustered at one end
of the array? Let's see... I tried sticking padding into
PGSemaphoreData and I got ~8% more TPS (72 client on multi socket
box, pgbench scale 100, only running for a minute but otherwise the
same settings that Mithun showed).
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -45,6 +45,7 @@
typedef struct PGSemaphoreData
{
sem_t pgsem;
+ char padding[PG_CACHE_LINE_SIZE - sizeof(sem_t)];
} PGSemaphoreData;
That's probably not the right idiom and my tests probably weren't long
enough, but there seems to be some effect here.
I did a quick test applying the patch with same settings as initial mail I have reported (On postgresql 10 latest code)
72 clients
Without Patch : TPS 29269.823540
With Patch : TPS 36005.544960. --- 23% jump
Just Disabling using unnamed POSIX semaphores: TPS 34481.207959
So it seems that is the issue as the test is being run on 8 node numa machine.
I also came across a presentation [1] : slide 20 which says one of those futex architecture is bad for NUMA machine. I am not sure the new fix for same is included as part of Linux version 3.10.0-693.5.2.el7.x86_64 which is on my test machine.
--
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Andres Freund
Дата:
Hi, On 2018-07-21 00:53:28 +0530, Mithun Cy wrote: > On Fri, Jul 20, 2018 at 10:52 AM, Thomas Munro < > thomas.munro@enterprisedb.com> wrote: > > > On Fri, Jul 20, 2018 at 7:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > It's not *that* noticeable, as I failed to demonstrate any performance > > > difference before committing the patch. I think some more investigation > > > is warranted to find out why some other people are getting different > > > results > > Maybe false sharing is a factor, since sizeof(sem_t) is 32 bytes on > > Linux/amd64 and we're probably hitting elements clustered at one end > > of the array? Let's see... I tried sticking padding into > > PGSemaphoreData and I got ~8% more TPS (72 client on multi socket > > box, pgbench scale 100, only running for a minute but otherwise the > > same settings that Mithun showed). > > > > --- a/src/backend/port/posix_sema.c > > +++ b/src/backend/port/posix_sema.c > > @@ -45,6 +45,7 @@ > > typedef struct PGSemaphoreData > > { > > sem_t pgsem; > > + char padding[PG_CACHE_LINE_SIZE - sizeof(sem_t)]; > > } PGSemaphoreData; > > > > That's probably not the right idiom and my tests probably weren't long > > enough, but there seems to be some effect here. > > > > I did a quick test applying the patch with same settings as initial mail I > have reported (On postgresql 10 latest code) > 72 clients > > CASE 1: > Without Patch : TPS 29269.823540 > > With Patch : TPS 36005.544960. --- 23% jump > > Just Disabling using unnamed POSIX semaphores: TPS 34481.207959 > So it seems that is the issue as the test is being run on 8 node numa > machine. Cool. I think we should just backpatch that then. Does anybody want to argue against? > I also came across a presentation [1] : slide 20 which says one of those > futex architecture is bad for NUMA machine. I am not sure the new fix for > same is included as part of Linux version 3.10.0-693.5.2.el7.x86_64 which > is on my test machine. Similar issues are also present internally for sysv semas, so I don't think this really means that much. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-07-21 00:53:28 +0530, Mithun Cy wrote: >> I did a quick test applying the patch with same settings as initial mail I >> have reported (On postgresql 10 latest code) >> 72 clients >> >> CASE 1: >> Without Patch : TPS 29269.823540 >> >> With Patch : TPS 36005.544960. --- 23% jump >> >> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959 >> So it seems that is the issue as the test is being run on 8 node numa >> machine. > Cool. I think we should just backpatch that then. Does anybody want to > argue against? Not entirely clear to me what change is being proposed here? In any case, I strongly resist making performance-based changes on the basis of one test on one kernel and one hardware platform. We should reproduce the results on a few different machines before we even think of committing anything. I'm happy to test on what I have, although I'd be the first to agree that what I'm checking is relatively low-end cases. (Too bad hydra's gone.) regards, tom lane
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Andres Freund
Дата:
On 2018-07-20 15:35:39 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-07-21 00:53:28 +0530, Mithun Cy wrote: > >> I did a quick test applying the patch with same settings as initial mail I > >> have reported (On postgresql 10 latest code) > >> 72 clients > >> > >> CASE 1: > >> Without Patch : TPS 29269.823540 > >> > >> With Patch : TPS 36005.544960. --- 23% jump > >> > >> Just Disabling using unnamed POSIX semaphores: TPS 34481.207959 > > >> So it seems that is the issue as the test is being run on 8 node numa > >> machine. > > > Cool. I think we should just backpatch that then. Does anybody want to > > argue against? > > Not entirely clear to me what change is being proposed here? Adding padding to struct PGSemaphoreData, so multiple semas don't share a cacheline. > In any case, I strongly resist making performance-based changes on > the basis of one test on one kernel and one hardware platform. > We should reproduce the results on a few different machines before > we even think of committing anything. I'm happy to test on what > I have, although I'd be the first to agree that what I'm checking > is relatively low-end cases. (Too bad hydra's gone.) Sure, it'd be good to do more of that. But from a theoretical POV it's quite logical that posix semas sharing cachelines is bad for performance, if there's any contention. When backed by futexes - i.e. all non ancient linux machines - the hot path just does a cmpxchg of the *userspace* data (I've copied the relevant code below). Given that we don't have a large number of semas these days, that there's reasons to make the change even without measuring it, that we have benchmark results, and that it's hard to see how it'd cause regressions, I don't think going for a fix quickly is unreasonable. You could argue it'd be better to make semaphores be embeddable in bigger structures like PGPROC, rather than allocated in an array. While I suspect you'd get a bit of a performance benefit from that, it seems like a far bigger change we'd want to do in a minor release. int __new_sem_wait (sem_t *sem) { /* We need to check whether we need to act upon a cancellation request here because POSIX specifies that cancellation points "shall occur" in sem_wait and sem_timedwait, which also means that they need to check this regardless whether they block or not (unlike "may occur" functions). See the POSIX Rationale for this requirement: Section "Thread Cancellation Overview" [1] and austin group issue #1076 [2] for thoughs on why this may be a suboptimal design. [1] http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html [2] http://austingroupbugs.net/view.php?id=1076 for thoughts on why this */ __pthread_testcancel (); if (__new_sem_wait_fast ((struct new_sem *) sem, 0) == 0) return 0; else return __new_sem_wait_slow((struct new_sem *) sem, NULL); } /* Fast path: Try to grab a token without blocking. */ static int __new_sem_wait_fast (struct new_sem *sem, int definitive_result) { /* We need acquire MO if we actually grab a token, so that this synchronizes with all token providers (i.e., the RMW operation we read from or all those before it in modification order; also see sem_post). We do not need to guarantee any ordering if we observed that there is no token (POSIX leaves it unspecified whether functions that fail synchronize memory); thus, relaxed MO is sufficient for the initial load and the failure path of the CAS. If the weak CAS fails and we need a definitive result, retry. */ #if __HAVE_64B_ATOMICS uint64_t d = atomic_load_relaxed (&sem->data); do { if ((d & SEM_VALUE_MASK) == 0) break; if (atomic_compare_exchange_weak_acquire (&sem->data, &d, d - 1)) return 0; } while (definitive_result); return -1; #else unsigned int v = atomic_load_relaxed (&sem->value); do { if ((v >> SEM_VALUE_SHIFT) == 0) break; if (atomic_compare_exchange_weak_acquire (&sem->value, &v, v - (1 << SEM_VALUE_SHIFT))) return 0; } while (definitive_result); return -1; #endif } /* See sem_wait for an explanation of the algorithm. */ int __new_sem_post (sem_t *sem) { struct new_sem *isem = (struct new_sem *) sem; int private = isem->private; #if __HAVE_64B_ATOMICS /* Add a token to the semaphore. We use release MO to make sure that a thread acquiring this token synchronizes with us and other threads that added tokens before (the release sequence includes atomic RMW operations by other threads). */ /* TODO Use atomic_fetch_add to make it scale better than a CAS loop? */ uint64_t d = atomic_load_relaxed (&isem->data); do { if ((d & SEM_VALUE_MASK) == SEM_VALUE_MAX) { __set_errno (EOVERFLOW); return -1; } } while (!atomic_compare_exchange_weak_release (&isem->data, &d, d + 1)); /* If there is any potentially blocked waiter, wake one of them. */ if ((d >> SEM_NWAITERS_SHIFT) > 0) futex_wake (((unsigned int *) &isem->data) + SEM_VALUE_OFFSET, 1, private); #else /* Add a token to the semaphore. Similar to 64b version. */ unsigned int v = atomic_load_relaxed (&isem->value); do { if ((v >> SEM_VALUE_SHIFT) == SEM_VALUE_MAX) { __set_errno (EOVERFLOW); return -1; } } while (!atomic_compare_exchange_weak_release (&isem->value, &v, v + (1 << SEM_VALUE_SHIFT))); /* If there is any potentially blocked waiter, wake one of them. */ if ((v & SEM_NWAITERS_MASK) != 0) futex_wake (&isem->value, 1, private); #endif return 0; } Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-07-20 15:35:39 -0400, Tom Lane wrote: >> In any case, I strongly resist making performance-based changes on >> the basis of one test on one kernel and one hardware platform. > Sure, it'd be good to do more of that. But from a theoretical POV it's > quite logical that posix semas sharing cachelines is bad for > performance, if there's any contention. When backed by futexes - > i.e. all non ancient linux machines - the hot path just does a cmpxchg > of the *userspace* data (I've copied the relevant code below). Here's the thing: the hot path is of little or no interest, because if we are in the sema code at all, we are expecting to block. The only case where we wouldn't block is if the lock manager decided the current process needs to sleep, but some other process already released us by the time we reach the futex/kernel call. Certainly that will happen some of the time, but it's not likely to be the way to bet. So I'm very dubious of any arguments based on the speed of the "uncontended" path. It's possible that the bigger picture here is that the kernel boys optimized for the "uncontended" path to the point where they broke performance of the blocking path. It's hard to see how they could have broke it to the point of being slower than the SysV sema API, though. Anyway, I think we need to test first and patch second. I'm working on getting some numbers on my own machines now. On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've only got 8 cores) but other parameters matching Mithun's example, I just got transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: prepared number of clients: 8 number of threads: 8 duration: 1800 s number of transactions actually processed: 29001016 latency average = 0.497 ms tps = 16111.575661 (including connections establishing) tps = 16111.623329 (excluding connections establishing) which is interesting because vmstat was pretty consistently reporting around 500000 context swaps/second during the run, or circa 30 cs/transaction. We'd have a minimum of 14 cs/transaction just between client and server (due to seven SQL commands per transaction in TPC-B) so that seems on the low side; not a lot of lock contention here it seems. I wonder what the corresponding ratio was in Mithun's runs. regards, tom lane
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Andres Freund
Дата:
Hi, On 2018-07-20 16:43:33 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-07-20 15:35:39 -0400, Tom Lane wrote: > >> In any case, I strongly resist making performance-based changes on > >> the basis of one test on one kernel and one hardware platform. > > > Sure, it'd be good to do more of that. But from a theoretical POV it's > > quite logical that posix semas sharing cachelines is bad for > > performance, if there's any contention. When backed by futexes - > > i.e. all non ancient linux machines - the hot path just does a cmpxchg > > of the *userspace* data (I've copied the relevant code below). > > Here's the thing: the hot path is of little or no interest, because > if we are in the sema code at all, we are expecting to block. Note that we're also using semas for ProcArrayGroupClearXid(), which is pretty commonly hot for pgbench style workloads, and where the expected wait times are very short. > It's possible that the bigger picture here is that the kernel boys > optimized for the "uncontended" path to the point where they broke > performance of the blocking path. It's hard to see how they could > have broke it to the point of being slower than the SysV sema API, > though. I don't see how this is a likely proposition, given that adding padding to the *userspace* portion of futexes increased the performance quite significantly. > On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've > only got 8 cores) but other parameters matching Mithun's example, > I just got It's *really* common to have more actual clients than cpus for oltp workloads, so I don't think it's insane to test with more clients. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-07-20 16:43:33 -0400, Tom Lane wrote: >> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've >> only got 8 cores) but other parameters matching Mithun's example, >> I just got > It's *really* common to have more actual clients than cpus for oltp > workloads, so I don't think it's insane to test with more clients. I finished a set of runs using similar parameters to Mithun's test except for using 8 clients, and another set using 72 clients (but, being impatient, 5-minute runtime) just to verify that the results wouldn't be markedly different. I got TPS numbers like this: 8 clients 72 clients unmodified HEAD 16112 16284 with padding patch 16096 16283 with SysV semas 15926 16064 with padding+SysV 15949 16085 This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual 4-core Intel E5-2609 (Sandy Bridge era). This hardware does show NUMA effects, although no doubt less strongly than Mithun's machine. I would like to see some other results with a newer kernel. I tried to repeat this test on a laptop running Fedora 28, but soon concluded that anything beyond very short runs was mainly going to tell me about thermal throttling :-(. I could possibly get repeatable numbers from, say, 1-minute SELECT-only runs, but that would be a different test scenario, likely one with a lot less lock contention. Anyway, for me, the padding change is a don't-care. Given that both Mithun and Thomas showed some positive effect from it, I'm not averse to applying it. I'm still -1 on going back to SysV semas. regards, tom lane
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Thomas Munro
Дата:
On Sun, Jul 22, 2018 at 8:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2018-07-20 16:43:33 -0400, Tom Lane wrote: >>> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've >>> only got 8 cores) but other parameters matching Mithun's example, >>> I just got > >> It's *really* common to have more actual clients than cpus for oltp >> workloads, so I don't think it's insane to test with more clients. > > I finished a set of runs using similar parameters to Mithun's test except > for using 8 clients, and another set using 72 clients (but, being > impatient, 5-minute runtime) just to verify that the results wouldn't > be markedly different. I got TPS numbers like this: > > 8 clients 72 clients > > unmodified HEAD 16112 16284 > with padding patch 16096 16283 > with SysV semas 15926 16064 > with padding+SysV 15949 16085 > > This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual > 4-core Intel E5-2609 (Sandy Bridge era). This hardware does show NUMA > effects, although no doubt less strongly than Mithun's machine. > > I would like to see some other results with a newer kernel. I tried to > repeat this test on a laptop running Fedora 28, but soon concluded that > anything beyond very short runs was mainly going to tell me about thermal > throttling :-(. I could possibly get repeatable numbers from, say, > 1-minute SELECT-only runs, but that would be a different test scenario, > likely one with a lot less lock contention. I did some testing on 2-node, 4-node and 8-node systems running Linux 3.10.something (slightly newer but still ancient). Only the 8-node box (= same one Mithun used) shows the large effect (the 2-node box may be a tiny bit faster patched but I'm calling that noise for now... it's not slower, anyway). On the problematic box, I also tried some different strides (char padding[N - sizeof(sem_t)]) and was surprised by the result: Unpatched = ~35k TPS 64 byte stride = ~35k TPS 128 byte stride = ~42k TPS 4096 byte stride = ~47k TPS Huh. PG_CACHE_LINE_SIZE is 128, but the true cache line size on this system is 64 bytes. That exaggeration turned out to do something useful, though I can't explain it. While looking for discussion of 128 byte cache effects I came across the Intel "L2 adjacent cache line prefetcher"[1]. Maybe this, or some of the other prefetchers (enabled in the BIOS) or related stuff could be at work here. It could be microarchitecture-dependent (this is an old Westmere box), though I found a fairly recent discussion about a similar effect[2] that mentions more recent hardware. The spatial prefetcher reference can be found in the Optimization Manual[3]. [1] https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors [2] https://groups.google.com/forum/#!msg/mechanical-sympathy/i3-M2uCYTJE/P7vyoOTIAgAJ [3] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf -- Thomas Munro http://www.enterprisedb.com
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Thomas Munro
Дата:
On Mon, Jul 23, 2018 at 3:40 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I did some testing on 2-node, 4-node and 8-node systems running Linux > 3.10.something (slightly newer but still ancient). Only the 8-node > box (= same one Mithun used) shows the large effect (the 2-node box > may be a tiny bit faster patched but I'm calling that noise for now... > it's not slower, anyway). (I forgot to add that the 4 node system that showed no change is POWER architecture.) -- Thomas Munro http://www.enterprisedb.com
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Thomas Munro
Дата:
On Mon, Jul 23, 2018 at 3:40 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, Jul 22, 2018 at 8:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 8 clients 72 clients >> >> unmodified HEAD 16112 16284 >> with padding patch 16096 16283 >> with SysV semas 15926 16064 >> with padding+SysV 15949 16085 >> >> This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual >> 4-core Intel E5-2609 (Sandy Bridge era). This hardware does show NUMA >> effects, although no doubt less strongly than Mithun's machine. >> >> I would like to see some other results with a newer kernel. I tried to >> repeat this test on a laptop running Fedora 28, but soon concluded that >> anything beyond very short runs was mainly going to tell me about thermal >> throttling :-(. I could possibly get repeatable numbers from, say, >> 1-minute SELECT-only runs, but that would be a different test scenario, >> likely one with a lot less lock contention. > > I did some testing on 2-node, 4-node and 8-node systems running Linux > 3.10.something (slightly newer but still ancient). Only the 8-node > box (= same one Mithun used) shows the large effect (the 2-node box > may be a tiny bit faster patched but I'm calling that noise for now... > it's not slower, anyway). Here's an attempt to use existing style better: a union, like LWLockPadded and WALInsertLockPadded. I think we should back-patch to 10. Thoughts? -- Thomas Munro http://www.enterprisedb.com
Вложения
Re: Possible performance regression in version 10.1 with pgbenchread-write tests.
От
Thomas Munro
Дата:
On Mon, Jul 23, 2018 at 10:06 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's an attempt to use existing style better: a union, like > LWLockPadded and WALInsertLockPadded. I think we should back-patch to > 10. Thoughts? Pushed to 10, 11, master. It's interesting that I could see a further ~12% speedup by using VM page-size stride on that 8 socket machine, but that's something to look at another day. The PG_CACHE_LINE_SIZE padding change gets us back to approximately where we were in 9.6. /me . o O ( Gee, it'd be really nice to see this change on a graph on a web page that tracks a suite of tests on a farm of interesting machines on each branch over time. ) -- Thomas Munro http://www.enterprisedb.com