Обсуждение: write scalability
I've long harbored a suspicion, based on some testing I did on my home machine, that WALInsertLock is a big performance bottleneck. But I just did some benchmarking that doesn't entirely support that contention. This is on Nate Boley's 32-core machine, with the following settings: max_connections = 100 shared_buffers = 8GB synchronous_commit = off checkpoint_segments = 100 checkpoint_timeout = 15min checkpoint_completion_target = 0.9 I did 5-minute pgbench runs with unlogged tables and with permanent tables, restarting the database server and reinitializing the tables between each run. The number at the beginning of each line is the number of clients, while the p/u indicates which type of tables were used: 01p tps = 628.626815 (including connections establishing) 01p tps = 635.394288 (including connections establishing) 01p tps = 634.972789 (including connections establishing) 08p tps = 3342.787325 (including connections establishing) 08p tps = 3883.876813 (including connections establishing) 08p tps = 3941.253567 (including connections establishing) 32p tps = 5597.472192 (including connections establishing) 32p tps = 5738.139895 (including connections establishing) 32p tps = 5794.490934 (including connections establishing) 80p tps = 4499.685286 (including connections establishing) 80p tps = 4917.060441 (including connections establishing) 80p tps = 5050.931933 (including connections establishing) 01u tps = 672.469142 (including connections establishing) 01u tps = 671.256686 (including connections establishing) 01u tps = 670.421003 (including connections establishing) 08u tps = 4087.749529 (including connections establishing) 08u tps = 3797.750851 (including connections establishing) 08u tps = 4181.393560 (including connections establishing) 32u tps = 8956.346905 (including connections establishing) 32u tps = 8898.442517 (including connections establishing) 32u tps = 8971.591569 (including connections establishing) 80u tps = 7287.550952 (including connections establishing) 80u tps = 7266.816989 (including connections establishing) 80u tps = 7255.968109 (including connections establishing) The speed-up from using unlogged tables was not as large as I expected. Turning off synchronous_commit here removes commit rate as the bottleneck, and I think perhaps the main benefit of unlogged tables in that case is the avoidance of I/O, and apparently this machine has enough I/O bandwidth, and just enough memory in general, that that's not an issue. With either type of tables, the 8 client results are about 6.1 times the single core results - not great, but not terrible, either. With 32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x is a long way from linear. vmstat reveals that CPU usage is substantially less than 100%. After some investigation, I found that using unlogged tables wasn't actually getting rid of all the write-ahead logging - the commit records were still being issued. So I hacked up RecordTransactionCommit() not to emit transaction commit records in that case. That doesn't actually completely eliminate the WAL activity, because it still emits records for zeroing new SLRU pages for CLOG, but it takes a big chunk out of it. The upshot is that this improved both raw performance and scalability, but not dramatically. Unlogged table results, with this change: 01h tps = 708.189337 (including connections establishing) 01h tps = 704.030175 (including connections establishing) 01h tps = 701.644199 (including connections establishing) 08h tps = 5196.615955 (including connections establishing) 08h tps = 5126.162200 (including connections establishing) 08h tps = 5067.568727 (including connections establishing) 32h tps = 10661.275636 (including connections establishing) 32h tps = 10621.085226 (including connections establishing) 32h tps = 10575.267197 (including connections establishing) 80h tps = 7557.965666 (including connections establishing) 80h tps = 7545.697547 (including connections establishing) 80h tps = 7556.379921 (including connections establishing) Now the 32-client numbers have risen to 15.1x the single-client numbers, but that's still not great. What does this mean? Well, XLogInsert does strikes me as an awfully complex piece of code to be running with a hot LWLock held in exclusive mode. But even so, I think this certainly means that WALInsertLock, at least on this workload, is not the whole problem... in this test, I'm not only eliminating the overhead of inserting the WAL, but also the overhead of writing it, flushing it, and generating it. So there is something, other than WAL insertion, which is taking a big bite out of performance here. As to what that something might be, I reran this last test with LWLOCK_STATS enabled and here are the top things that are blocking: lwlock 310: shacq 96846 exacq 108433 blk 16275 lwlock 3: shacq 64 exacq 2628381 blk 36027 lwlock 7: shacq 0 exacq 2628615 blk 85220 lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 During this 5-minute test run, an LWLock acquisition blocked 6180335 times. As you can see from the above results, ProcArrayLock accounts for 54% of that blocking, and CLogControlLock accounts for another 34%. lwlock 7 is WALInsertLock, which manages to account for more than 1% of the blocking despite the fact that WAL has been largely eliminated in this test; lwlock 3 is XidGenLock. I dunno what LWLock 310 is but it doesn't matter. After that there is a long tail of other stuff with gradually decreasing amounts of blocking. Experience with the read scalability stuff has taught me also to look at which LWLocks have the most shared acquisitions, as that can cause spinlock and cache line contention. The top few culprits are: lwlock 504: shacq 5315030 exacq 0 blk 0 lwlock 45: shacq 5967317 exacq 13284 blk 1722 lwlock 39: shacq 8219988 exacq 13342 blk 2291 lwlock 5: shacq 26306020 exacq 0 blk 0 lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 In all, there were 238777533 shared LWLock acquisitions during this test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon to be dealt with, as discussed elsewhere on-list), and then it gets down into the lock manager locks and a few others. I'm mulling over what to do about all of this, but thought I'd share the raw numbers first, in case anyone's interested. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Mon, Jul 25, 2011 at 3:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I've long harbored a suspicion, based on some testing I did on my home > machine, that WALInsertLock is a big performance bottleneck. But I > just did some benchmarking that doesn't entirely support that > contention. This is on Nate Boley's 32-core machine, with the > following settings: > > max_connections = 100 > shared_buffers = 8GB > synchronous_commit = off > checkpoint_segments = 100 > checkpoint_timeout = 15min > checkpoint_completion_target = 0.9 > > I did 5-minute pgbench runs with unlogged tables and with permanent > tables, restarting the database server and reinitializing the tables > between each run. The number at the beginning of each line is the > number of clients, while the p/u indicates which type of tables were > used: > > 01p tps = 628.626815 (including connections establishing) > 01p tps = 635.394288 (including connections establishing) > 01p tps = 634.972789 (including connections establishing) > 08p tps = 3342.787325 (including connections establishing) > 08p tps = 3883.876813 (including connections establishing) > 08p tps = 3941.253567 (including connections establishing) > 32p tps = 5597.472192 (including connections establishing) > 32p tps = 5738.139895 (including connections establishing) > 32p tps = 5794.490934 (including connections establishing) > 80p tps = 4499.685286 (including connections establishing) > 80p tps = 4917.060441 (including connections establishing) > 80p tps = 5050.931933 (including connections establishing) > > 01u tps = 672.469142 (including connections establishing) > 01u tps = 671.256686 (including connections establishing) > 01u tps = 670.421003 (including connections establishing) > 08u tps = 4087.749529 (including connections establishing) > 08u tps = 3797.750851 (including connections establishing) > 08u tps = 4181.393560 (including connections establishing) > 32u tps = 8956.346905 (including connections establishing) > 32u tps = 8898.442517 (including connections establishing) > 32u tps = 8971.591569 (including connections establishing) > 80u tps = 7287.550952 (including connections establishing) > 80u tps = 7266.816989 (including connections establishing) > 80u tps = 7255.968109 (including connections establishing) > > The speed-up from using unlogged tables was not as large as I > expected. Turning off synchronous_commit here removes commit rate as > the bottleneck, and I think perhaps the main benefit of unlogged > tables in that case is the avoidance of I/O, and apparently this > machine has enough I/O bandwidth, and just enough memory in general, > that that's not an issue. > > With either type of tables, the 8 client results are about 6.1 times > the single core results - not great, but not terrible, either. With > 32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x > is a long way from linear. vmstat reveals that CPU usage is > substantially less than 100%. After some investigation, I found that > using unlogged tables wasn't actually getting rid of all the > write-ahead logging - the commit records were still being issued. So > I hacked up RecordTransactionCommit() not to emit transaction commit > records in that case. That doesn't actually completely eliminate the > WAL activity, because it still emits records for zeroing new SLRU > pages for CLOG, but it takes a big chunk out of it. The upshot is > that this improved both raw performance and scalability, but not > dramatically. Unlogged table results, with this change: > > 01h tps = 708.189337 (including connections establishing) > 01h tps = 704.030175 (including connections establishing) > 01h tps = 701.644199 (including connections establishing) > 08h tps = 5196.615955 (including connections establishing) > 08h tps = 5126.162200 (including connections establishing) > 08h tps = 5067.568727 (including connections establishing) > 32h tps = 10661.275636 (including connections establishing) > 32h tps = 10621.085226 (including connections establishing) > 32h tps = 10575.267197 (including connections establishing) > 80h tps = 7557.965666 (including connections establishing) > 80h tps = 7545.697547 (including connections establishing) > 80h tps = 7556.379921 (including connections establishing) > > Now the 32-client numbers have risen to 15.1x the single-client > numbers, but that's still not great. > > What does this mean? Well, XLogInsert does strikes me as an awfully > complex piece of code to be running with a hot LWLock held in > exclusive mode. But even so, I think this certainly means that > WALInsertLock, at least on this workload, is not the whole problem... > in this test, I'm not only eliminating the overhead of inserting the > WAL, but also the overhead of writing it, flushing it, and generating > it. So there is something, other than WAL insertion, which is taking > a big bite out of performance here. > > As to what that something might be, I reran this last test with > LWLOCK_STATS enabled and here are the top things that are blocking: > > lwlock 310: shacq 96846 exacq 108433 blk 16275 > lwlock 3: shacq 64 exacq 2628381 blk 36027 > lwlock 7: shacq 0 exacq 2628615 blk 85220 > lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 > lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 > > During this 5-minute test run, an LWLock acquisition blocked 6180335 > times. As you can see from the above results, ProcArrayLock accounts > for 54% of that blocking, and CLogControlLock accounts for another > 34%. lwlock 7 is WALInsertLock, which manages to account for more > than 1% of the blocking despite the fact that WAL has been largely > eliminated in this test; lwlock 3 is XidGenLock. I dunno what LWLock > 310 is but it doesn't matter. After that there is a long tail of > other stuff with gradually decreasing amounts of blocking. Experience > with the read scalability stuff has taught me also to look at which > LWLocks have the most shared acquisitions, as that can cause spinlock > and cache line contention. The top few culprits are: > > lwlock 504: shacq 5315030 exacq 0 blk 0 > lwlock 45: shacq 5967317 exacq 13284 blk 1722 > lwlock 39: shacq 8219988 exacq 13342 blk 2291 > lwlock 5: shacq 26306020 exacq 0 blk 0 > lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 > lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 > > In all, there were 238777533 shared LWLock acquisitions during this > test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon > to be dealt with, as discussed elsewhere on-list), and then it gets > down into the lock manager locks and a few others. hm, all the CLogControlLock acquisitions in clog.c appear to be exclusive...or did you mean shared in some other sense? TransactionIdGetStatus is taking an exclusive lock which is a red flag and a good optimization target, I think. merlin
Excerpts from Merlin Moncure's message of lun jul 25 17:19:58 -0400 2011: > On Mon, Jul 25, 2011 at 3:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Experience > > with the read scalability stuff has taught me also to look at which > > LWLocks have the most shared acquisitions, as that can cause spinlock > > and cache line contention. The top few culprits are: > > > > lwlock 504: shacq 5315030 exacq 0 blk 0 > > lwlock 45: shacq 5967317 exacq 13284 blk 1722 > > lwlock 39: shacq 8219988 exacq 13342 blk 2291 > > lwlock 5: shacq 26306020 exacq 0 blk 0 > > lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 > > lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 > > > > In all, there were 238777533 shared LWLock acquisitions during this > > test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon > > to be dealt with, as discussed elsewhere on-list), and then it gets > > down into the lock manager locks and a few others. > > hm, all the CLogControlLock acquisitions in clog.c appear to be > exclusive...or did you mean shared in some other sense? SLRU control locks are also acquired indirectly by slru.c, see SimpleLruReadPage_ReadOnly. > TransactionIdGetStatus is taking an exclusive lock which is a red flag > and a good optimization target, I think. In fact, if the page that TransactionIdGetStatus is looking for is in the buffers, it'll only take a shared lock. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Mon, Jul 25, 2011 at 4:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > As to what that something might be, I reran this last test with > LWLOCK_STATS enabled and here are the top things that are blocking: > > lwlock 310: shacq 96846 exacq 108433 blk 16275 > lwlock 3: shacq 64 exacq 2628381 blk 36027 > lwlock 7: shacq 0 exacq 2628615 blk 85220 > lwlock 11: shacq 84913908 exacq 4539551 blk 2119423 > lwlock 4: shacq 28667307 exacq 2628524 blk 3356651 > > During this 5-minute test run, an LWLock acquisition blocked 6180335 > times. As you can see from the above results, ProcArrayLock accounts > for 54% of that blocking, and CLogControlLock accounts for another > 34%. lwlock 7 is WALInsertLock, which manages to account for more > than 1% of the blocking despite the fact that WAL has been largely > eliminated in this test... I reran this test on master with permanent (not unlogged) tables, and got the following stats for blocking: lwlock 11: shacq 58376164 exacq 2122076 blk 1130974 lwlock 4: shacq 15168924 exacq 1367303 blk 1555799 lwlock 7: shacq 0 exacq 8440615 blk 1726896 grand total: shacq 144563929 exacq 41618873 blk 4886107 So that's 35% WALInsertLock, 32% ProcArrayLock, and 23% CLogControlLock. No other single lock accounted for more than 1% of the blocking. It's a good guess we're going to have to fix more than one thing to really make this zippy. Also note that CLogControlLock accounts for better than 40% of the shared-lock acquisitions across all LWLocks. The next-most-frequently share-locked LWLock is ProcArrayLock, with 10% of the total shared-lock acquisitions, followed by SInvalReadLock, at 9%. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/25/2011 04:07 PM, Robert Haas wrote: > I did 5-minute pgbench runs with unlogged tables and with permanent > tables, restarting the database server and reinitializing the tables > between each run. Database scale? One or multiple pgbench worker threads? A reminder on the amount of RAM in the server would be helpful for interpreting the results too. The other thing I'd recommend if you're running more write-heavy tests is to turn off autovacuum. Whether or not it kicks in depends on the test duration and the TPS rate, which adds a source of variability better avoided here. It also means that faster tests end up getting penalized by having it run near their end, which makes them no longer look like fast results. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote: > On 07/25/2011 04:07 PM, Robert Haas wrote: >> >> I did 5-minute pgbench runs with unlogged tables and with permanent >> tables, restarting the database server and reinitializing the tables >> between each run. > > Database scale? One or multiple pgbench worker threads? A reminder on the > amount of RAM in the server would be helpful for interpreting the results > too. Ah, sorry. scale = 100, so small. pgbench invocation is: pgbench -T $time -c $clients -j $clients 128GB RAM. > The other thing I'd recommend if you're running more write-heavy tests is to > turn off autovacuum. Whether or not it kicks in depends on the test > duration and the TPS rate, which adds a source of variability better avoided > here. It also means that faster tests end up getting penalized by having it > run near their end, which makes them no longer look like fast results. Well, I kind of don't want to do that, because surely you wouldn't run that way in production. Anyway, it seems clear that nearly all the contention is in CLogControlLock, ProcArrayLock, and WALInsertLock, and I don't think shutting off autovacuum is going to change that. The point of my benchmarking here may be a bit different from the benchmarking you normally do: I'm not trying to figure out whether I've got the best hardware, or how good it is, so I don't really care about squeezing out every last tps. I just need number that can be reproduced quickly with different patches to see whether a given code change helps and by how much. I wasn't sure that 5-minute pgbench runs would be good enough for that, but so far the results have been stable enough that I'm fairly confident I'm seeing real effects rather than random noise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 26, 2011 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> On 07/25/2011 04:07 PM, Robert Haas wrote: >>> >>> I did 5-minute pgbench runs with unlogged tables and with permanent >>> tables, restarting the database server and reinitializing the tables >>> between each run. >> >> Database scale? One or multiple pgbench worker threads? A reminder on the >> amount of RAM in the server would be helpful for interpreting the results >> too. > > Ah, sorry. scale = 100, so small. pgbench invocation is: > It might be worthwhile to test only with the accounts and history table and also increasing the number of statements in a transaction. Otherwise the tiny tables can quickly become a bottleneck. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Tue, Jul 26, 2011 at 11:40 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Tue, Jul 26, 2011 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>> On 07/25/2011 04:07 PM, Robert Haas wrote: >>>> >>>> I did 5-minute pgbench runs with unlogged tables and with permanent >>>> tables, restarting the database server and reinitializing the tables >>>> between each run. >>> >>> Database scale? One or multiple pgbench worker threads? A reminder on the >>> amount of RAM in the server would be helpful for interpreting the results >>> too. >> >> Ah, sorry. scale = 100, so small. pgbench invocation is: >> > > It might be worthwhile to test only with the accounts and history > table and also increasing the number of statements in a transaction. > Otherwise the tiny tables can quickly become a bottleneck. What kind of bottleneck? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 26, 2011 at 12:24 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jul 26, 2011 at 11:40 AM, Pavan Deolasee > <pavan.deolasee@gmail.com> wrote: >> On Tue, Jul 26, 2011 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>>> On 07/25/2011 04:07 PM, Robert Haas wrote: >>>>> >>>>> I did 5-minute pgbench runs with unlogged tables and with permanent >>>>> tables, restarting the database server and reinitializing the tables >>>>> between each run. >>>> >>>> Database scale? One or multiple pgbench worker threads? A reminder on the >>>> amount of RAM in the server would be helpful for interpreting the results >>>> too. >>> >>> Ah, sorry. scale = 100, so small. pgbench invocation is: >>> >> >> It might be worthwhile to test only with the accounts and history >> table and also increasing the number of statements in a transaction. >> Otherwise the tiny tables can quickly become a bottleneck. > > What kind of bottleneck? > So many transactions trying to update a small set of rows in a table. Is that what we really want to measure ? My thinking is that we might see different result if they are updating different parts of the table and the transaction start/stop overhead is spread across few statements. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Tue, Jul 26, 2011 at 12:29 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > So many transactions trying to update a small set of rows in a table. > Is that what we really want to measure ? My thinking is that we might > see different result if they are updating different parts of the table > and the transaction start/stop overhead is spread across few > statements. > I think what I am suggesting is that the default pgbench test setup would probably not give you a good scalability as number of clients are increased and one reason could be the contention in the small table. So it might be a good idea to get rid of that and see if we get much better scalability and understand other bottlenecks. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Tue, Jul 26, 2011 at 4:40M, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > On Tue, Jul 26, 2011 at 9:07 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jul 25, 2011 at 10:14 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>> On 07/25/2011 04:07 PM, Robert Haas wrote: >>>> >>>> I did 5-minute pgbench runs with unlogged tables and with permanent >>>> tables, restarting the database server and reinitializing the tables >>>> between each run. >>> >>> Database scale? One or multiple pgbench worker threads? A reminder on the >>> amount of RAM in the server would be helpful for interpreting the results >>> too. >> >> Ah, sorry. scale = 100, so small. pgbench invocation is: >> > > It might be worthwhile to test only with the accounts and history > table and also increasing the number of statements in a transaction. > Otherwise the tiny tables can quickly become a bottleneck. +1 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 07/26/2011 12:33 PM, Pavan Deolasee wrote: > I think what I am suggesting is that the default pgbench test setup > would probably not give you a good scalability as number of clients > are increased and one reason could be the contention in the small > table. So it might be a good idea to get rid of that and see if we get > much better scalability and understand other bottlenecks. > You can easily see this form of contention pulling down results when the database itself is really small and the overall transaction rate is very high. With Robert using a scale=100, no more than 80 clients, and transaction rates peaking at <10K TPS on a 24 core box, I wouldn't expect this form of contention to be a large issue. It may be dropping results a few percent, but I'd be surprised if it was any more than that. It's easy enough to use "-N" to skip the updates to the smaller tellers and branches table to pull that out of the way. TPS will go up, because it's doing less per transaction. That's not necessarily a better test case though, it's just a different one. The regular case includes a lot of overwriting the same blocks in the hot branches and tellers tables. That effectively pushes a lot more I/O toward being likely to happen at checkpoint time. Those tables rarely have any significant I/O outside of the checkpoint in the standard "TPC-B like" scenario, because their usage counts stay high most of the time. Contention for small tables that are being heavily updated is still important to optimize too though. Which type of test makes more sense depends on what aspect of performance you're trying to focus on. I'll sometimes do a full pgbench-tools "sweep" (range of multiple scales and clients at each scale) in both regular write and "-N" write modes, just to collect the slightly different data each provides. The form of checkpoint I/O spike you see at sync time is very different in the two cases, but both usage profiles are important to consider and optimize. That Robert has started with the regular case doesn't worry me too much now that I've seen the parameters he's using, he's not running it in a way where I'd expect branch/teller contention to be a major limiting factor on the results. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us