Обсуждение: pgbench vs. wait events

Поиск

Список

Период

Сортировка

pgbench vs. wait events

От

Robert Haas

Дата:

06 октября 2016 г., 21:39:01

Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench.  I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on.  During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output.  First, I tried scale factor
3000 with 32 clients and got these results:
     1  LWLockTranche   | buffer_mapping     9  LWLockNamed     | CLogControlLock    14  LWLockNamed     |
ProcArrayLock   16  Lock            | tuple    25  LWLockNamed     | CheckpointerCommLock    49  LWLockNamed     |
WALBufMappingLock  122  LWLockTranche   | clog   182  Lock            | transactionid   287  LWLockNamed     |
XidGenLock 1300  Client          | ClientRead  1375  LWLockTranche   | buffer_content  3990  Lock            | extend
21014 LWLockNamed     | WALWriteLock 28497                  | 58279  LWLockTranche   | wal_insert
 

tps = 1150.803133 (including connections establishing)

What jumps out here is, at least to me, is that there is furious
contention on both the wal_insert locks and on WALWriteLock.
Apparently, the system simply can't get WAL on disk fast enough to
keep up with this workload.   Relation extension locks and
buffer_content locks also are also pretty common, both ahead of
ClientRead, a relatively uncommon wait event on this test.  The load
average on the system was only about 3 during this test, indicating
that most processes are in fact spending most of their time off-CPU.
The first thing I tried was switching to unlogged tables, which
produces these results:
     1  BufferPin       | BufferPin     1  LWLockTranche   | lock_manager     2  LWLockTranche   | buffer_mapping     8
LWLockNamed     | ProcArrayLock     9  LWLockNamed     | CheckpointerCommLock     9  LWLockNamed     | CLogControlLock
 11  LWLockTranche   | buffer_content    37  LWLockTranche   | clog   153  Lock            | tuple   388  LWLockNamed
 | XidGenLock   827  Lock            | transactionid  1267  Client          | ClientRead 20631  Lock            |
extend91767                  |
 

tps = 1223.239416 (including connections establishing)

If you don't look at the TPS number, these results look like a vast
improvement.  The overall amount of time spent not waiting for
anything is now much higher, and the problematic locks have largely
disappeared from the picture.  However, the load average now shoots up
to about 30, because most of the time that the backends are "not
waiting for anything" they are in fact in kernel wait state D; that
is, they're stuck doing I/O.  This suggests that we might want to
consider advertising a wait state when a backend is doing I/O, so we
can measure this sort of thing.

Next, I tried lowering the scale factor to something that fits in
shared buffers.  Here are the results at scale factor 300:
    14  Lock            | tuple    22  LWLockTranche   | lock_manager    39  LWLockNamed     | WALBufMappingLock   331
LWLockNamed    | CLogControlLock   461  LWLockNamed     | ProcArrayLock   536  Lock            | transactionid   552
Lock           | extend   716  LWLockTranche   | buffer_content   763  LWLockNamed     | XidGenLock  2113  LWLockNamed
  | WALWriteLock  6190  LWLockTranche   | wal_insert 25002  Client          | ClientRead 78466                  |
 

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work.  ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare.  Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples.  It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

However, I was curious about what's going on with CLogControlLock,
persuant to the other thread on that topic, so I reran this with
unlogged tables.  At scale factor 300, 32 clients, unlogged tables, I
get these results:
     6  LWLockTranche   | wal_insert    12  Lock            | tuple    26  LWLockTranche   | lock_manager    39
LWLockTranche  | buffer_content   353  LWLockNamed     | CLogControlLock   377  Lock            | extend   491  Lock
       | transactionid   500  LWLockNamed     | ProcArrayLock   696  LWLockNamed     | XidGenLock 27685  Client
| ClientRead 84930                  |
 

WAL contention is eliminated, and CLog contention doesn't increase.
Next, I tried ramping up the client count to see what effect that had.
These next three results are all at scale factor 300 with unlogged
tables.

64 clients:
     1  BufferPin       | BufferPin     2  LWLockTranche   | buffer_mapping    51  LWLockTranche   | wal_insert    52
Lock           | relation   132  LWLockTranche   | lock_manager   154  LWLockTranche   | buffer_content   209  Lock
      | tuple   940  Lock            | extend  1412  LWLockNamed     | XidGenLock  2337  LWLockNamed     |
CLogControlLock 2813  LWLockNamed     | ProcArrayLock  3859  Lock            | transactionid 83760  Client          |
ClientRead133840                 |
 

tps = 36672.484524 (including connections establishing)

96 clients:
     1  LWLockTranche   | proc    53  LWLockTranche   | wal_insert   305  LWLockTranche   | buffer_content   391
LWLockTranche  | lock_manager   797  Lock            | tuple  1074  Lock            | extend  2492  LWLockNamed     |
XidGenLock 4185  LWLockNamed     | ProcArrayLock  6003  LWLockNamed     | CLogControlLock 11348  Lock            |
transactionid142881 Client          | ClientRead174011                  |
 

tps = 37214.257661 (including connections establishing)

192 clients:
     1  LWLockTranche   | buffer_mapping     2  BufferPin       | BufferPin     4  LWLockTranche   | clog     8
LWLockTranche  | proc    87  LWLockTranche   | wal_insert   708  LWLockTranche   | buffer_content  1892  LWLockTranche
| lock_manager  2225  Lock            | extend  4796  LWLockNamed     | ProcArrayLock  5105  LWLockNamed     |
XidGenLock 7982  Lock            | tuple 11882  LWLockNamed     | CLogControlLock 64091  Lock            |
transactionid275481                 |308627  Client          | ClientRead
 

tps = 35018.906863 (including connections establishing)

There are a couple of interesting things here.  First, CLogControlLock
contention becomes more significant as the number of the clients goes
up. At 32 clients, it is the 6th-most frequent wait event; at 192
clients, it's moved up to 4th, and it's clearly growing at a faster
rate than some of the others.  As we go from 32 to 64 to 96 to 192
clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
it goes from being only 2/3 of ProcArrayLock to being more than twice
ProcArrayLock. However, in absolute terms, it's still not very
significant.  Even at 192 clients, where it's the fourth-most-common
wait event, it appears in only 1.7% of samples, which means backends
are only waiting for it about 1.7% of the time.  It appears that, on
this system, this workload is NEVER going to get meaningfully faster
by improving CLogControlLock.

Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.  Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time.  Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgbench vs. wait events

От

Jeff Janes

Дата:

06 октября 2016 г., 23:41:07

On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Hi,

I decided to do some testing on hydra (IBM-provided community
resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
the newly-enhanced wait event stuff to try to get an idea of what
we're waiting for during pgbench. I did 30-minute pgbench runs with
various configurations, but all had max_connections = 200,
shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
on. During each run, I ran this psql script in another window and
captured the output:

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid()
\watch 0.5

Then, I used a little shell-scripting to count up the number of times
each wait event occurred in the output. First, I tried scale factor
3000 with 32 clients and got these results:

Scale factor 3000 obviously doesn't fit in shared_buffers. But does it fit in RAM? That is, are the backends doing real IO, or they just doing fake IO to the kernel's fs cache?

Cheers,

Jeff

Re: pgbench vs. wait events

От

Robert Haas

Дата:

07 октября 2016 г., 01:16:06

On Thu, Oct 6, 2016 at 4:40 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Scale factor 3000 obviously doesn't fit in shared_buffers.  But does it fit
> in RAM?  That is, are the backends doing real IO, or they just doing fake IO
> to the kernel's fs cache?

That's a good question.

[rhaas@hydra ~]$ free -g            total       used       free     shared    buffers     cached
Mem:            61         26         34          0          0         24
-/+ buffers/cache:          2         58
Swap:           19          4         15

rhaas=# select pg_size_pretty(pg_relation_size('pgbench_accounts'));pg_size_pretty
----------------38 GB
(1 row)

rhaas=# select pg_size_pretty(pg_database_size(current_database()));pg_size_pretty
----------------44 GB
(1 row)

That's pretty tight, especially since I now notice Andres also left a
postmaster running on this machine back in April, with
shared_buffers=8GB.  44GB for this database plus 8GB for
shared_buffers plus 8GB for the other postmaster's shared_buffers
leaves basically no slack, so it was probably doing quite a bit of
real I/O, especially after the database got a bit of bloat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgbench vs. wait events

От

Michael Paquier

Дата:

07 октября 2016 г., 02:50:46

On Fri, Oct 7, 2016 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench.  I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on.  During each run, I ran this psql script in another window and
> captured the output:

Nice. Thanks for sharing.

> \t
> select wait_event_type, wait_event from pg_stat_activity where pid !=
> pg_backend_pid()
> \watch 0.5
>
> Then, I used a little shell-scripting to count up the number of times
> each wait event occurred in the output.

Or an INSERT SELECT on an unlogged table? No need of extra maths then.

> First, I tried scale factor
> 3000 with 32 clients and got these results:
>
>       1  LWLockTranche   | buffer_mapping
> [...]
>   21014  LWLockNamed     | WALWriteLock
>   28497                  |
>   58279  LWLockTranche   | wal_insert
>
> tps = 1150.803133 (including connections establishing)
>
> What jumps out here is, at least to me, is that there is furious
> contention on both the wal_insert locks and on WALWriteLock.
> Apparently, the system simply can't get WAL on disk fast enough to
> keep up with this workload.   Relation extension locks and
> buffer_content locks also are also pretty common, both ahead of
> ClientRead, a relatively uncommon wait event on this test.  The load
> average on the system was only about 3 during this test, indicating
> that most processes are in fact spending most of their time off-CPU.

Increasing the number of WAL insert slots would help? With your tests
this is at 8 all the time.

> The first thing I tried was switching to unlogged tables, which
> produces these results:
>
>       1  BufferPin       | BufferPin
>       1  LWLockTranche   | lock_manager
>       2  LWLockTranche   | buffer_mapping
>       8  LWLockNamed     | ProcArrayLock
>       9  LWLockNamed     | CheckpointerCommLock
>       9  LWLockNamed     | CLogControlLock
>      11  LWLockTranche   | buffer_content
>      37  LWLockTranche   | clog
>     153  Lock            | tuple
>     388  LWLockNamed     | XidGenLock
>     827  Lock            | transactionid
>    1267  Client          | ClientRead
>   20631  Lock            | extend
>   91767                  |
>
> tps = 1223.239416 (including connections establishing)
>
> If you don't look at the TPS number, these results look like a vast
> improvement.  The overall amount of time spent not waiting for
> anything is now much higher, and the problematic locks have largely
> disappeared from the picture.  However, the load average now shoots up
> to about 30, because most of the time that the backends are "not
> waiting for anything" they are in fact in kernel wait state D; that
> is, they're stuck doing I/O.  This suggests that we might want to
> consider advertising a wait state when a backend is doing I/O, so we
> can measure this sort of thing.

Maybe something in fd.c.. It may be a good idea to actually have a
look at a perf output on Linux to see where this contention is
happening, use this conclusion to decide the place of a wait point,
and then see if on other OSes share a similar pattern.
-- 
Michael

Re: pgbench vs. wait events

От

Alfred Perlstein

Дата:

07 октября 2016 г., 06:55:38

Robert,

This contention on WAL reminds me of another scenario I've heard about 
that was similar.

To fix things what happened was that anyone that the first person to 
block would be responsible for writing out all buffers for anyone 
blocked behind "him".

The for example if you have many threads, A, B, C, D

If while A is writing to WAL and hold the lock, then B arrives, B of 
course blocks, then C comes along and blocks as well, then D.

Finally A finishes its write and then ....

Now you have two options for resolving this, either

1) A drops its lock, B picks up the lock... B writes its buffer and then 
drops the lock.  Then C gets the lock, writes its buffer, drops the 
lock, then finally D gets the lock, writes its buffer and then drops the 
lock.

2) A then writes out B's, C's, and D's buffers, then A drops the lock, 
B, C and D wake up, note that their respective buffers are written and 
just return.  This greatly speeds up the system. (just be careful to 
make sure A doesn't do "too much work" otherwise you can get a sort of 
livelock if too many threads are blocked behind it, generally only issue 
one additional flush on behalf of other threads, do not "loop until the 
queue is empty")

I'm not sure if this is actually possible with the way WAL is 
implemented, (or perhaps if this strategy is already implemented) but 
it's definitely worth if not done already as it can speed things up 
enormously.


On 10/6/16 11:38 AM, Robert Haas wrote:
> Hi,
>
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench.  I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on.  During each run, I ran this psql script in another window and
> captured the output:
>
> \t
> select wait_event_type, wait_event from pg_stat_activity where pid !=
> pg_backend_pid()
> \watch 0.5
>
> Then, I used a little shell-scripting to count up the number of times
> each wait event occurred in the output.  First, I tried scale factor
> 3000 with 32 clients and got these results:
>
>        1  LWLockTranche   | buffer_mapping
>        9  LWLockNamed     | CLogControlLock
>       14  LWLockNamed     | ProcArrayLock
>       16  Lock            | tuple
>       25  LWLockNamed     | CheckpointerCommLock
>       49  LWLockNamed     | WALBufMappingLock
>      122  LWLockTranche   | clog
>      182  Lock            | transactionid
>      287  LWLockNamed     | XidGenLock
>     1300  Client          | ClientRead
>     1375  LWLockTranche   | buffer_content
>     3990  Lock            | extend
>    21014  LWLockNamed     | WALWriteLock
>    28497                  |
>    58279  LWLockTranche   | wal_insert
>
> tps = 1150.803133 (including connections establishing)
>
> What jumps out here is, at least to me, is that there is furious
> contention on both the wal_insert locks and on WALWriteLock.
> Apparently, the system simply can't get WAL on disk fast enough to
> keep up with this workload.   Relation extension locks and
> buffer_content locks also are also pretty common, both ahead of
> ClientRead, a relatively uncommon wait event on this test.  The load
> average on the system was only about 3 during this test, indicating
> that most processes are in fact spending most of their time off-CPU.
> The first thing I tried was switching to unlogged tables, which
> produces these results:
>
>        1  BufferPin       | BufferPin
>        1  LWLockTranche   | lock_manager
>        2  LWLockTranche   | buffer_mapping
>        8  LWLockNamed     | ProcArrayLock
>        9  LWLockNamed     | CheckpointerCommLock
>        9  LWLockNamed     | CLogControlLock
>       11  LWLockTranche   | buffer_content
>       37  LWLockTranche   | clog
>      153  Lock            | tuple
>      388  LWLockNamed     | XidGenLock
>      827  Lock            | transactionid
>     1267  Client          | ClientRead
>    20631  Lock            | extend
>    91767                  |
>
> tps = 1223.239416 (including connections establishing)
>
> If you don't look at the TPS number, these results look like a vast
> improvement.  The overall amount of time spent not waiting for
> anything is now much higher, and the problematic locks have largely
> disappeared from the picture.  However, the load average now shoots up
> to about 30, because most of the time that the backends are "not
> waiting for anything" they are in fact in kernel wait state D; that
> is, they're stuck doing I/O.  This suggests that we might want to
> consider advertising a wait state when a backend is doing I/O, so we
> can measure this sort of thing.
>
> Next, I tried lowering the scale factor to something that fits in
> shared buffers.  Here are the results at scale factor 300:
>
>       14  Lock            | tuple
>       22  LWLockTranche   | lock_manager
>       39  LWLockNamed     | WALBufMappingLock
>      331  LWLockNamed     | CLogControlLock
>      461  LWLockNamed     | ProcArrayLock
>      536  Lock            | transactionid
>      552  Lock            | extend
>      716  LWLockTranche   | buffer_content
>      763  LWLockNamed     | XidGenLock
>     2113  LWLockNamed     | WALWriteLock
>     6190  LWLockTranche   | wal_insert
>    25002  Client          | ClientRead
>    78466                  |
>
> tps = 27651.562835 (including connections establishing)
>
> Obviously, there's a vast increase in TPS, and the backends seem to
> spend most of their time actually doing work.  ClientRead is now the
> overwhelmingly dominant wait event, although wal_insert and
> WALWriteLock contention is clearly still a significant problem.
> Contention on other locks is apparently quite rare.  Notice that
> client reads are really significant here - more than 20% of the time
> we sample what a backend is doing, it's waiting for the next query.
> It seems unlikely that performance on this workload can be improved
> very much by optimizing anything other than WAL writing, because no
> other wait event appears in more than 1% of the samples.  It's not
> clear how much of the WAL-related stuff is artificial lock contention
> and how much of it is the limited speed at which the disk can rotate.
>
> However, I was curious about what's going on with CLogControlLock,
> persuant to the other thread on that topic, so I reran this with
> unlogged tables.  At scale factor 300, 32 clients, unlogged tables, I
> get these results:
>
>        6  LWLockTranche   | wal_insert
>       12  Lock            | tuple
>       26  LWLockTranche   | lock_manager
>       39  LWLockTranche   | buffer_content
>      353  LWLockNamed     | CLogControlLock
>      377  Lock            | extend
>      491  Lock            | transactionid
>      500  LWLockNamed     | ProcArrayLock
>      696  LWLockNamed     | XidGenLock
>    27685  Client          | ClientRead
>    84930                  |
>
> WAL contention is eliminated, and CLog contention doesn't increase.
> Next, I tried ramping up the client count to see what effect that had.
> These next three results are all at scale factor 300 with unlogged
> tables.
>
> 64 clients:
>
>        1  BufferPin       | BufferPin
>        2  LWLockTranche   | buffer_mapping
>       51  LWLockTranche   | wal_insert
>       52  Lock            | relation
>      132  LWLockTranche   | lock_manager
>      154  LWLockTranche   | buffer_content
>      209  Lock            | tuple
>      940  Lock            | extend
>     1412  LWLockNamed     | XidGenLock
>     2337  LWLockNamed     | CLogControlLock
>     2813  LWLockNamed     | ProcArrayLock
>     3859  Lock            | transactionid
>    83760  Client          | ClientRead
>   133840                  |
>
> tps = 36672.484524 (including connections establishing)
>
> 96 clients:
>
>        1  LWLockTranche   | proc
>       53  LWLockTranche   | wal_insert
>      305  LWLockTranche   | buffer_content
>      391  LWLockTranche   | lock_manager
>      797  Lock            | tuple
>     1074  Lock            | extend
>     2492  LWLockNamed     | XidGenLock
>     4185  LWLockNamed     | ProcArrayLock
>     6003  LWLockNamed     | CLogControlLock
>    11348  Lock            | transactionid
>   142881  Client          | ClientRead
>   174011                  |
>
> tps = 37214.257661 (including connections establishing)
>
> 192 clients:
>
>        1  LWLockTranche   | buffer_mapping
>        2  BufferPin       | BufferPin
>        4  LWLockTranche   | clog
>        8  LWLockTranche   | proc
>       87  LWLockTranche   | wal_insert
>      708  LWLockTranche   | buffer_content
>     1892  LWLockTranche   | lock_manager
>     2225  Lock            | extend
>     4796  LWLockNamed     | ProcArrayLock
>     5105  LWLockNamed     | XidGenLock
>     7982  Lock            | tuple
>    11882  LWLockNamed     | CLogControlLock
>    64091  Lock            | transactionid
>   275481                  |
>   308627  Client          | ClientRead
>
> tps = 35018.906863 (including connections establishing)
>
> There are a couple of interesting things here.  First, CLogControlLock
> contention becomes more significant as the number of the clients goes
> up. At 32 clients, it is the 6th-most frequent wait event; at 192
> clients, it's moved up to 4th, and it's clearly growing at a faster
> rate than some of the others.  As we go from 32 to 64 to 96 to 192
> clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
> but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
> it goes from being only 2/3 of ProcArrayLock to being more than twice
> ProcArrayLock. However, in absolute terms, it's still not very
> significant.  Even at 192 clients, where it's the fourth-most-common
> wait event, it appears in only 1.7% of samples, which means backends
> are only waiting for it about 1.7% of the time.  It appears that, on
> this system, this workload is NEVER going to get meaningfully faster
> by improving CLogControlLock.
>
> Second, ClientRead becomes a bigger and bigger issue as the number of
> clients increases; by 192 clients, it appears in 45% of the samples.
> That basically means that pgbench is increasingly unable to keep up
> with the server; for whatever reason, it suffers more than the server
> does from the increasing lack of CPU resources.  Third,
> Lock/transactionid and Lock/tuple become more and more common as the
> number of clients increases; these happen when two different pgbench
> threads deciding to hit the same row at the same time.  Due to the
> birthday paradox this increases pretty quickly as the number of
> clients ramps up, but it's not really a server issue: there's nothing
> the server can do about the possibility that two or more clients pick
> the same random number at the same time.
>

Re: pgbench vs. wait events

От

Alfred Perlstein

Дата:

07 октября 2016 г., 16:09:28

Robert,

This contention on WAL reminds me of another scenario I've heard about 
that was similar.

To fix things what happened was that anyone that the first person to 
block would be responsible for writing out all buffers for anyone 
blocked behind "him".

The for example if you have many threads, A, B, C, D

If while A is writing to WAL and hold the lock, then B arrives, B of 
course blocks, then C comes along and blocks as well, then D.

Finally A finishes its write and then ....

Now you have two options for resolving this, either

1) A drops its lock, B picks up the lock... B writes its buffer and then 
drops the lock.  Then C gets the lock, writes its buffer, drops the 
lock, then finally D gets the lock, writes its buffer and then drops the 
lock.

2) A then writes out B's, C's, and D's buffers, then A drops the lock, 
B, C and D wake up, note that their respective buffers are written and 
just return.  This greatly speeds up the system. (just be careful to 
make sure A doesn't do "too much work" otherwise you can get a sort of 
livelock if too many threads are blocked behind it, generally only issue 
one additional flush on behalf of other threads, do not "loop until the 
queue is empty")

I'm not sure if this is actually possible with the way WAL is 
implemented, (or perhaps if this strategy is already implemented) but 
it's definitely worth if not done already as it can speed things up 
enormously.

On 10/6/16 11:38 AM, Robert Haas wrote:
> Hi,
>
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench.  I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on.  During each run, I ran this psql script in another window and
> captured the output:
>
> \t
> select wait_event_type, wait_event from pg_stat_activity where pid !=
> pg_backend_pid()
> \watch 0.5
>
> Then, I used a little shell-scripting to count up the number of times
> each wait event occurred in the output.  First, I tried scale factor
> 3000 with 32 clients and got these results:
>
>        1  LWLockTranche   | buffer_mapping
>        9  LWLockNamed     | CLogControlLock
>       14  LWLockNamed     | ProcArrayLock
>       16  Lock            | tuple
>       25  LWLockNamed     | CheckpointerCommLock
>       49  LWLockNamed     | WALBufMappingLock
>      122  LWLockTranche   | clog
>      182  Lock            | transactionid
>      287  LWLockNamed     | XidGenLock
>     1300  Client          | ClientRead
>     1375  LWLockTranche   | buffer_content
>     3990  Lock            | extend
>    21014  LWLockNamed     | WALWriteLock
>    28497                  |
>    58279  LWLockTranche   | wal_insert
>
> tps = 1150.803133 (including connections establishing)
>
> What jumps out here is, at least to me, is that there is furious
> contention on both the wal_insert locks and on WALWriteLock.
> Apparently, the system simply can't get WAL on disk fast enough to
> keep up with this workload.   Relation extension locks and
> buffer_content locks also are also pretty common, both ahead of
> ClientRead, a relatively uncommon wait event on this test.  The load
> average on the system was only about 3 during this test, indicating
> that most processes are in fact spending most of their time off-CPU.
> The first thing I tried was switching to unlogged tables, which
> produces these results:
>
>        1  BufferPin       | BufferPin
>        1  LWLockTranche   | lock_manager
>        2  LWLockTranche   | buffer_mapping
>        8  LWLockNamed     | ProcArrayLock
>        9  LWLockNamed     | CheckpointerCommLock
>        9  LWLockNamed     | CLogControlLock
>       11  LWLockTranche   | buffer_content
>       37  LWLockTranche   | clog
>      153  Lock            | tuple
>      388  LWLockNamed     | XidGenLock
>      827  Lock            | transactionid
>     1267  Client          | ClientRead
>    20631  Lock            | extend
>    91767                  |
>
> tps = 1223.239416 (including connections establishing)
>
> If you don't look at the TPS number, these results look like a vast
> improvement.  The overall amount of time spent not waiting for
> anything is now much higher, and the problematic locks have largely
> disappeared from the picture.  However, the load average now shoots up
> to about 30, because most of the time that the backends are "not
> waiting for anything" they are in fact in kernel wait state D; that
> is, they're stuck doing I/O.  This suggests that we might want to
> consider advertising a wait state when a backend is doing I/O, so we
> can measure this sort of thing.
>
> Next, I tried lowering the scale factor to something that fits in
> shared buffers.  Here are the results at scale factor 300:
>
>       14  Lock            | tuple
>       22  LWLockTranche   | lock_manager
>       39  LWLockNamed     | WALBufMappingLock
>      331  LWLockNamed     | CLogControlLock
>      461  LWLockNamed     | ProcArrayLock
>      536  Lock            | transactionid
>      552  Lock            | extend
>      716  LWLockTranche   | buffer_content
>      763  LWLockNamed     | XidGenLock
>     2113  LWLockNamed     | WALWriteLock
>     6190  LWLockTranche   | wal_insert
>    25002  Client          | ClientRead
>    78466                  |
>
> tps = 27651.562835 (including connections establishing)
>
> Obviously, there's a vast increase in TPS, and the backends seem to
> spend most of their time actually doing work.  ClientRead is now the
> overwhelmingly dominant wait event, although wal_insert and
> WALWriteLock contention is clearly still a significant problem.
> Contention on other locks is apparently quite rare.  Notice that
> client reads are really significant here - more than 20% of the time
> we sample what a backend is doing, it's waiting for the next query.
> It seems unlikely that performance on this workload can be improved
> very much by optimizing anything other than WAL writing, because no
> other wait event appears in more than 1% of the samples.  It's not
> clear how much of the WAL-related stuff is artificial lock contention
> and how much of it is the limited speed at which the disk can rotate.
>
> However, I was curious about what's going on with CLogControlLock,
> persuant to the other thread on that topic, so I reran this with
> unlogged tables.  At scale factor 300, 32 clients, unlogged tables, I
> get these results:
>
>        6  LWLockTranche   | wal_insert
>       12  Lock            | tuple
>       26  LWLockTranche   | lock_manager
>       39  LWLockTranche   | buffer_content
>      353  LWLockNamed     | CLogControlLock
>      377  Lock            | extend
>      491  Lock            | transactionid
>      500  LWLockNamed     | ProcArrayLock
>      696  LWLockNamed     | XidGenLock
>    27685  Client          | ClientRead
>    84930                  |
>
> WAL contention is eliminated, and CLog contention doesn't increase.
> Next, I tried ramping up the client count to see what effect that had.
> These next three results are all at scale factor 300 with unlogged
> tables.
>
> 64 clients:
>
>        1  BufferPin       | BufferPin
>        2  LWLockTranche   | buffer_mapping
>       51  LWLockTranche   | wal_insert
>       52  Lock            | relation
>      132  LWLockTranche   | lock_manager
>      154  LWLockTranche   | buffer_content
>      209  Lock            | tuple
>      940  Lock            | extend
>     1412  LWLockNamed     | XidGenLock
>     2337  LWLockNamed     | CLogControlLock
>     2813  LWLockNamed     | ProcArrayLock
>     3859  Lock            | transactionid
>    83760  Client          | ClientRead
>   133840                  |
>
> tps = 36672.484524 (including connections establishing)
>
> 96 clients:
>
>        1  LWLockTranche   | proc
>       53  LWLockTranche   | wal_insert
>      305  LWLockTranche   | buffer_content
>      391  LWLockTranche   | lock_manager
>      797  Lock            | tuple
>     1074  Lock            | extend
>     2492  LWLockNamed     | XidGenLock
>     4185  LWLockNamed     | ProcArrayLock
>     6003  LWLockNamed     | CLogControlLock
>    11348  Lock            | transactionid
>   142881  Client          | ClientRead
>   174011                  |
>
> tps = 37214.257661 (including connections establishing)
>
> 192 clients:
>
>        1  LWLockTranche   | buffer_mapping
>        2  BufferPin       | BufferPin
>        4  LWLockTranche   | clog
>        8  LWLockTranche   | proc
>       87  LWLockTranche   | wal_insert
>      708  LWLockTranche   | buffer_content
>     1892  LWLockTranche   | lock_manager
>     2225  Lock            | extend
>     4796  LWLockNamed     | ProcArrayLock
>     5105  LWLockNamed     | XidGenLock
>     7982  Lock            | tuple
>    11882  LWLockNamed     | CLogControlLock
>    64091  Lock            | transactionid
>   275481                  |
>   308627  Client          | ClientRead
>
> tps = 35018.906863 (including connections establishing)
>
> There are a couple of interesting things here.  First, CLogControlLock
> contention becomes more significant as the number of the clients goes
> up. At 32 clients, it is the 6th-most frequent wait event; at 192
> clients, it's moved up to 4th, and it's clearly growing at a faster
> rate than some of the others.  As we go from 32 to 64 to 96 to 192
> clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
> but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
> it goes from being only 2/3 of ProcArrayLock to being more than twice
> ProcArrayLock. However, in absolute terms, it's still not very
> significant.  Even at 192 clients, where it's the fourth-most-common
> wait event, it appears in only 1.7% of samples, which means backends
> are only waiting for it about 1.7% of the time.  It appears that, on
> this system, this workload is NEVER going to get meaningfully faster
> by improving CLogControlLock.
>
> Second, ClientRead becomes a bigger and bigger issue as the number of
> clients increases; by 192 clients, it appears in 45% of the samples.
> That basically means that pgbench is increasingly unable to keep up
> with the server; for whatever reason, it suffers more than the server
> does from the increasing lack of CPU resources.  Third,
> Lock/transactionid and Lock/tuple become more and more common as the
> number of clients increases; these happen when two different pgbench
> threads deciding to hit the same row at the same time.  Due to the
> birthday paradox this increases pretty quickly as the number of
> clients ramps up, but it's not really a server issue: there's nothing
> the server can do about the possibility that two or more clients pick
> the same random number at the same time.
>

Re: pgbench vs. wait events

От

Jeff Janes

Дата:

07 октября 2016 г., 18:51:34

On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Next, I tried lowering the scale factor to something that fits in
shared buffers. Here are the results at scale factor 300:

14 Lock | tuple
22 LWLockTranche | lock_manager
39 LWLockNamed | WALBufMappingLock
331 LWLockNamed | CLogControlLock
461 LWLockNamed | ProcArrayLock
536 Lock | transactionid
552 Lock | extend
716 LWLockTranche | buffer_content
763 LWLockNamed | XidGenLock
2113 LWLockNamed | WALWriteLock
6190 LWLockTranche | wal_insert
25002 Client | ClientRead
78466 |

tps = 27651.562835 (including connections establishing)

Obviously, there's a vast increase in TPS, and the backends seem to
spend most of their time actually doing work. ClientRead is now the
overwhelmingly dominant wait event, although wal_insert and
WALWriteLock contention is clearly still a significant problem.
Contention on other locks is apparently quite rare. Notice that
client reads are really significant here - more than 20% of the time
we sample what a backend is doing, it's waiting for the next query.
It seems unlikely that performance on this workload can be improved
very much by optimizing anything other than WAL writing, because no
other wait event appears in more than 1% of the samples. It's not
clear how much of the WAL-related stuff is artificial lock contention
and how much of it is the limited speed at which the disk can rotate.

What happens if you turn fsync off? Once a xlog file is fully written, it is immediately fsynced, even if the backend is holding WALWriteLock or wal_insert (or both) at the time, and even if synchrounous_commit is off. Assuming this machine has a BBU so that it doesn't have to wait for disk rotation, still fsyncs are expensive because the kernel has to find all the data and get it sent over to the BBU, while holding locks.

....

Second, ClientRead becomes a bigger and bigger issue as the number of
clients increases; by 192 clients, it appears in 45% of the samples.
That basically means that pgbench is increasingly unable to keep up
with the server; for whatever reason, it suffers more than the server
does from the increasing lack of CPU resources.

I would be careful about that interpretation. If you asked pgbench, it would probably have the opposite opinion.

The backend tosses its response at the kernel (which will never block, because the pgbench responses are all small and the kernel will buffer them) and then goes into ClientRead. After the backend goes into ClientRead, the kernel needs to find and wake up the pgbench, deliver the response, and pgbench has to receive and process the response. Only then does it create a new query (I've toyed before with having pgbench construct the next query while it is waiting for a response on the previous one, but that didn't seem promising, and much of pgbench has been rewritten since then), pass the query back to the kernel. Then the kernel has to find and wake up the backend and deliver the new query. So for a reasonable chunk of the time that the server thinks it is waiting for the client, the client also thinks it is waiting for the server.

I think we need to come up with some benchmarking queries which get more work done per round-trip to the database. And build them into the binary, because otherwise people won't use them as much as they should if they have to pass "-f" files around mailing lists and blog postings. For example, we could enclose 5 statements of the TPC-B-like into a single function which takes aid, bid, tid, and delta as arguments. And presumably we could drop the other two statements (BEGIN and COMMIT) as well, and rely on autocommit to get that job done. So we could go from 7 statements to 1.

Third,
Lock/transactionid and Lock/tuple become more and more common as the
number of clients increases; these happen when two different pgbench
threads deciding to hit the same row at the same time. Due to the
birthday paradox this increases pretty quickly as the number of
clients ramps up, but it's not really a server issue: there's nothing
the server can do about the possibility that two or more clients pick
the same random number at the same time.

What I have done in the past is chop a zero off from:

#define naccounts 100000

and recompile pgbench. Then you can increase the scale factor so that you have less contention on pgbench_branches while still fitting the data in shared_buffers, or in RAM.

Cheers,

Jeff

Re: pgbench vs. wait events

От

Andres Freund

Дата:

07 октября 2016 г., 20:39:10

Hi,

On 2016-10-06 14:38:56 -0400, Robert Haas wrote:
> Obviously, there's a vast increase in TPS, and the backends seem to
> spend most of their time actually doing work.  ClientRead is now the
> overwhelmingly dominant wait event, although wal_insert and
> WALWriteLock contention is clearly still a significant problem.
> Contention on other locks is apparently quite rare.  Notice that
> client reads are really significant here - more than 20% of the time
> we sample what a backend is doing, it's waiting for the next query.
> It seems unlikely that performance on this workload can be improved
> very much by optimizing anything other than WAL writing, because no
> other wait event appears in more than 1% of the samples.

I don't think you meant that, but this sounds a bit like that there's
nothing to do performance-wise - I don't think that's true.  There's no
significant lock-contention, yes. But that obviously doesn't mean we can
use resources more efficiently.

> Second, ClientRead becomes a bigger and bigger issue as the number of
> clients increases; by 192 clients, it appears in 45% of the samples.
> That basically means that pgbench is increasingly unable to keep up
> with the server; for whatever reason, it suffers more than the server
> does from the increasing lack of CPU resources.

Isn't that more a question of pgbench threads beeing preemted, so server
processes can run? Since there's not enough hardware threads for both
servers and clients to always run, you *have* to switch between them.
Usually a read from the client after processing a query will cause a
directed wakeup of the other thread (which is why it's not a very
frequently observed state), but if the queue of to-be-run tasks is very
long that'll not happen.

Greetings,

Andres Freund

Re: pgbench vs. wait events

От

Andres Freund

Дата:

07 октября 2016 г., 20:39:34

Hi,

On 2016-10-06 18:15:58 -0400, Robert Haas wrote:
> That's pretty tight, especially since I now notice Andres also left a
> postmaster running on this machine back in April, with
> shared_buffers=8GB.

Oops, sorry for that.

- Andres

Re: pgbench vs. wait events

От

Andres Freund

Дата:

07 октября 2016 г., 20:42:06

Hi,

On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote:
> This contention on WAL reminds me of another scenario I've heard about that
> was similar.
> 
> To fix things what happened was that anyone that the first person to block
> would be responsible for writing out all buffers for anyone blocked behind
> "him".

We pretty much do that already. But while that's happening, the other
would-be-writers show up as blocking on the lock.  We don't use kind of
an odd locking model for the waiters (LWLockAcquireOrWait()), which
waits for the lock to be released, but doesn't try to acquire it
afterwards. Instead the wal position is rechecked, and in many cases
we'll be done afterwards, because enough has been written out.

Greetings,

Andres Freund

Re: pgbench vs. wait events

От

Alfred Perlstein

Дата:

07 октября 2016 г., 21:04:53

On 10/7/16 10:42 AM, Andres Freund wrote:
> Hi,
>
> On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote:
>> This contention on WAL reminds me of another scenario I've heard about that
>> was similar.
>>
>> To fix things what happened was that anyone that the first person to block
>> would be responsible for writing out all buffers for anyone blocked behind
>> "him".
> We pretty much do that already. But while that's happening, the other
> would-be-writers show up as blocking on the lock.  We don't use kind of
> an odd locking model for the waiters (LWLockAcquireOrWait()), which
> waits for the lock to be released, but doesn't try to acquire it
> afterwards. Instead the wal position is rechecked, and in many cases
> we'll be done afterwards, because enough has been written out.
>
> Greetings,
>
> Andres Freund
>
>
Are the batched writes all done before fsync is called?

Are you sure that A only calls fsync after flushing all the buffers from 
B, C, and D?  Or will it fsync twice?  Is there instrumentation to show 
that?

I know there's a tremendous level of skill involved in this code, but 
simply asking in case there's some tricks.

Another strategy that may work is actually intentionally 
waiting/buffering some few ms between flushes/fsync, for example, make 
sure that the number of flushes per second doesn't exceed some 
configurable amount because each flush likely eats at least one iop from 
the disk and there is a maximum iops per disk, so might as well buffer 
more if you're exceeding that iops count.  You'll trade some latency, 
but gain throughput for doing that.

Does this make sense?  Again apologies if this has been covered.  Is 
there a whitepaper or blog post or clear way I can examine the algorithm 
wrt locks/buffering for flushing WAL logs?

-Alfred

Re: pgbench vs. wait events

От

Robert Haas

Дата:

07 октября 2016 г., 22:09:47

On Fri, Oct 7, 2016 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-10-06 14:38:56 -0400, Robert Haas wrote:
>> Obviously, there's a vast increase in TPS, and the backends seem to
>> spend most of their time actually doing work.  ClientRead is now the
>> overwhelmingly dominant wait event, although wal_insert and
>> WALWriteLock contention is clearly still a significant problem.
>> Contention on other locks is apparently quite rare.  Notice that
>> client reads are really significant here - more than 20% of the time
>> we sample what a backend is doing, it's waiting for the next query.
>> It seems unlikely that performance on this workload can be improved
>> very much by optimizing anything other than WAL writing, because no
>> other wait event appears in more than 1% of the samples.
>
> I don't think you meant that, but this sounds a bit like that there's
> nothing to do performance-wise - I don't think that's true.  There's no
> significant lock-contention, yes. But that obviously doesn't mean we can
> use resources more efficiently.

Yeah, right.  Doing the same stuff faster will surely help.  It just
doesn't look like we can get anywhere by improving the locking, unless
it's something related to WAL writing.

>> Second, ClientRead becomes a bigger and bigger issue as the number of
>> clients increases; by 192 clients, it appears in 45% of the samples.
>> That basically means that pgbench is increasingly unable to keep up
>> with the server; for whatever reason, it suffers more than the server
>> does from the increasing lack of CPU resources.
>
> Isn't that more a question of pgbench threads beeing preemted, so server
> processes can run? Since there's not enough hardware threads for both
> servers and clients to always run, you *have* to switch between them.
> Usually a read from the client after processing a query will cause a
> directed wakeup of the other thread (which is why it's not a very
> frequently observed state), but if the queue of to-be-run tasks is very
> long that'll not happen.

Yeah, maybe I shouldn't have said it suffers more than the server, but
rather along with the server.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgbench vs. wait events

От

Robert Haas

Дата:

07 октября 2016 г., 23:29:01

On Fri, Oct 7, 2016 at 11:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> What happens if you turn fsync off?  Once a xlog file is fully written, it
> is immediately fsynced, even if the backend is holding WALWriteLock or
> wal_insert (or both) at the time, and even if synchrounous_commit is off.
> Assuming this machine has a BBU so that it doesn't have to wait for disk
> rotation, still fsyncs are expensive because the kernel has to find all the
> data and get it sent over to the BBU, while holding locks.

Scale factor 300, 32 clients, fsync=off:
     5  Lock            | tuple    18  LWLockTranche   | lock_manager    24  LWLockNamed     | WALWriteLock    88
LWLockTranche  | buffer_content   265  LWLockTranche   | wal_insert   373  LWLockNamed     | CLogControlLock   496
LWLockNamed    | ProcArrayLock   532  Lock            | extend   540  LWLockNamed     | XidGenLock   545  Lock
 | transactionid 27067  Client          | ClientRead 85364                  |
 

But I'm not sure you're right about the way the fsync=off code works.
I think pg_fsync()/pg_fsync_writethrough()/pg_fsync_no_writethrough()
look at enableFsync and just do nothing if it's false.

>> Second, ClientRead becomes a bigger and bigger issue as the number of
>> clients increases; by 192 clients, it appears in 45% of the samples.
>> That basically means that pgbench is increasingly unable to keep up
>> with the server; for whatever reason, it suffers more than the server
>> does from the increasing lack of CPU resources.
>
> I would be careful about that interpretation.  If you asked pgbench, it
> would probably have the opposite opinion.

Yeah, that's possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgbench vs. wait events

От

Amit Kapila

Дата:

08 октября 2016 г., 09:14:22

On Fri, Oct 7, 2016 at 11:34 PM, Alfred Perlstein <alfred@freebsd.org> wrote:
>
> Are the batched writes all done before fsync is called?
>

In most cases, but if while writing, the wal segment (size - 16M) is
finished, we do fsync and then do the remaining writes and at end
again perform fsync.

> Are you sure that A only calls fsync after flushing all the buffers from B,
> C, and D?  Or will it fsync twice?
>

I don't think it is guaranteed, however if B, C and D has finished
writing to wal buffers before A starts writing out buffer, then A will
ensure to combine all those writes and call fsync just once.

>  Is there instrumentation to show that?
>

Currently, I don't think we have instrumentation to show that, but you
can do it if you want.  In XLogFlush(), we do print the write request
values (ifXLOG_DEBUG is set and we have compiled the code with WAL_DEBUG
option), you can put the one print in XLOGWrite once the write/flush
is done and another print when the request is satisfied without
calling XLOGWrite.

> I know there's a tremendous level of skill involved in this code, but simply
> asking in case there's some tricks.
>
> Another strategy that may work is actually intentionally waiting/buffering
> some few ms between flushes/fsync,

We do that before attempting to write if user has set "commit_delay"
and "commit_siblings" guc parameters.

Now here, we can't buffer the fsync requests as current we are doing
both writes and fsync under one lock.  However, if we can split the
work such that writes are done under one lock and fsync under separate
lock, then probably we can try to buffer fsync requests and after
fsyncing the current pending requests, we can recheck if there are
more pending requests and try to flush them.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgbench vs. wait events

От

Jeff Janes

Дата:

08 октября 2016 г., 21:01:08

On Fri, Oct 7, 2016 at 1:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 7, 2016 at 11:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> What happens if you turn fsync off? Once a xlog file is fully written, it
> is immediately fsynced, even if the backend is holding WALWriteLock or
> wal_insert (or both) at the time, and even if synchrounous_commit is off.
> Assuming this machine has a BBU so that it doesn't have to wait for disk
> rotation, still fsyncs are expensive because the kernel has to find all the
> data and get it sent over to the BBU, while holding locks.

Scale factor 300, 32 clients, fsync=off:

5 Lock | tuple
18 LWLockTranche | lock_manager
24 LWLockNamed | WALWriteLock
88 LWLockTranche | buffer_content
265 LWLockTranche | wal_insert
373 LWLockNamed | CLogControlLock
496 LWLockNamed | ProcArrayLock
532 Lock | extend
540 LWLockNamed | XidGenLock
545 Lock | transactionid
27067 Client | ClientRead
85364 |

Did the TPS go up appreciably?

But I'm not sure you're right about the way the fsync=off code works.
I think pg_fsync()/pg_fsync_writethrough()/pg_fsync_no_writethrough()
look at enableFsync and just do nothing if it's false.

I think we are in agreement. I don't know which part you think is wrong. When I said immediately, I didn't mean unconditionally.

Anyway, based on the reduced wait events, I think this shows that if we need to do something in the xlog area, probably what it would be is to add a queue of fully written but un-synced xlog files, so that the syncing can be delegated to the background wal writer process. And of course anyone needing to actually flush their xlog would have to start by flushing the queue.

(Or perhaps just make the xlog files bigger, and call it a day)

Cheers,

Jeff

Re: pgbench vs. wait events

От

Jeff Janes

Дата:

08 октября 2016 г., 21:22:37

On Fri, Oct 7, 2016 at 11:14 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

> Another strategy that may work is actually intentionally waiting/buffering
> some few ms between flushes/fsync,

We do that before attempting to write if user has set "commit_delay"
and "commit_siblings" guc parameters.

If you have a fast, high resolution timer, then one thing you can do is keep track of when the previous xlog sync finished. Then instead of having commit_delay be an absolute amount of time to sleep, it would mean "wait until that amount of time has passed since the previous sync finished." So you would set it based on the RPM of your drive, so that the time it is sleeping to allow more work to happen from other processes is time it would have to spend waiting on rotational delay anyway.

But I dropped this, because it would be hard to tune, hard to implement in a cross-platform way, and because anyone with such high performance needs is just going to buy a nonvolatile write-cache and be done with it.

Now here, we can't buffer the fsync requests as current we are doing
both writes and fsync under one lock. However, if we can split the
work such that writes are done under one lock and fsync under separate
lock, then probably we can try to buffer fsync requests and after
fsyncing the current pending requests, we can recheck if there are
more pending requests and try to flush them.

What I implemented at one point was:

(Already have the lock before getting here)

Write to the extent it is ready to be written.

Update the shared structure to reflect written upto.

Drop the lock

fsync

Take the lock again

update the shared structure to reflect flushed upto.

Drop the lock again.

This way, multiple process could all be waiting on the kernel's fsync response, rather than on each others locks. What I was hoping would happen is that if one process wrote everything that was ready and called fsync, while it was waiting for the platter to come around to the writing head, more processes could make more data ready, write that more data, and call an fsync of their own. And the kernel would be smart enough to amalgamate them together. But the kernel evidently was not that smart, and performance did not improve.

Cheers,

Jeff

Re: pgbench vs. wait events

От

Jeff Janes

Дата:

08 октября 2016 г., 22:26:31

On Fri, Oct 7, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

I think we need to come up with some benchmarking queries which get more work done per round-trip to the database. And build them into the binary, because otherwise people won't use them as much as they should if they have to pass "-f" files around mailing lists and blog postings. For example, we could enclose 5 statements of the TPC-B-like into a single function which takes aid, bid, tid, and delta as arguments. And presumably we could drop the other two statements (BEGIN and COMMIT) as well, and rely on autocommit to get that job done. So we could go from 7 statements to 1.

Here is an implementation of that. I've included the calling code as a patch to pgbench, because if I make it a separate -f file then it is a pain to get the correct scale and settings of naccounts, etc., into it.

The create script could be integrated into pgbench -i if this is something we might want to commit.

This gives me an almost 3 fold increase in performance on a system with fsync turned off:

pgbench -b tpcb-func -T30 -c8 -j8

tps = 24193.197877 (excluding connections establishing)

pgbench -b tpcb-like -T30 -c8 -j8

tps = 8434.746300 (excluding connections establishing)

Cheers,

Jeff

Вложения

Re: pgbench vs. wait events

От

Bruce Momjian

Дата:

10 октября 2016 г., 16:44:48

On Thu, Oct  6, 2016 at 02:38:56PM -0400, Robert Haas wrote:
> Hi,
> 
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench.  I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on.  During each run, I ran this psql script in another window and
> captured the output:

This is a great study that shows how the new instrumentation has given
us a new window into performance.  I am frankly surprised we got as far
as we did in finding performance bottlenecks before we had this
instrumentation.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: pgbench vs. wait events

От

Robert Haas

Дата:

10 октября 2016 г., 21:35:16

On Mon, Oct 10, 2016 at 9:44 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Thu, Oct  6, 2016 at 02:38:56PM -0400, Robert Haas wrote:
>> I decided to do some testing on hydra (IBM-provided community
>> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
>> the newly-enhanced wait event stuff to try to get an idea of what
>> we're waiting for during pgbench.  I did 30-minute pgbench runs with
>> various configurations, but all had max_connections = 200,
>> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
>> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
>> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
>> on.  During each run, I ran this psql script in another window and
>> captured the output:
>
> This is a great study that shows how the new instrumentation has given
> us a new window into performance.  I am frankly surprised we got as far
> as we did in finding performance bottlenecks before we had this
> instrumentation.

Thanks, and +1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: pgbench vs. wait events

Вложения