Re: Proposal of tunable fix for scalability of 8.4

Поиск
Список
Период
Сортировка
От Jignesh K. Shah
Тема Re: Proposal of tunable fix for scalability of 8.4
Дата
Msg-id 49B9566C.3010708@sun.com
обсуждение исходный текст
Ответ на Re: Proposal of tunable fix for scalability of 8.4  (Scott Carey <scott@richrelevance.com>)
Ответы Re: Proposal of tunable fix for scalability of 8.4  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Re: Proposal of tunable fix for scalability of 8.4  (Scott Carey <scott@richrelevance.com>)
Re: Proposal of tunable fix for scalability of 8.4  (Greg Smith <gsmith@gregsmith.com>)
Список pgsql-performance


On 03/12/09 13:48, Scott Carey wrote:
On 3/11/09 7:47 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

All I’m adding, is that it makes some sense to me based on my experience in CPU / RAM bound scalability tuning.  It was expressed that the test itself didn’t even make sense.

I was wrong in my understanding of what the change did.  If it wakes ALL waiters up there is an indeterminate amount of time a lock will wait.
However, if instead of waking up all of them, if it only wakes up the shared readers and leaves all the exclusive ones at the front of the queue, there is no possibility of starvation since those exclusives will be at the front of the line after the wake-up batch.

As for this being a use case that is important:

*  SSDs will drive the % of use cases that are not I/O bound up significantly over the next couple years.  All postgres installations with less than about 100GB of data TODAY could avoid being I/O bound with current SSD technology, and those less than 2TB can do so as well but at high expense or with less proven technology like the ZFS L2ARC flash cache.
*  Intel will have a mainstream CPU that handles 12 threads (6 cores, 2 threads each) at the end of this year.  Mainstream two CPU systems will have access to 24 threads and be common in 2010.  Higher end 4CPU boxes will have access to 48 CPU threads.  Hardware thread count is only going up.  This is the future.


SSDs are precisely my motivation of doing RAM based tests with PostgreSQL. While I am waiting for my SSDs to arrive, I started to emulate SSDs by putting the whole database on RAM which in sense are better than SSDs so if we can tune with RAM disks then SSDs will be covered.

What we have is a pool of 2000 users and we start making each user do series of transactions on different rows and see how much the database can handle linearly before some bottleneck (system or database) kicks in and there can be no more linear increase in active users. Many times there is drop after reaching some value of active users. If all 2000 users can scale linearly then another test with say 2500 can be executed .. All to do is what's the limit we can go till typically there are no system resources still remaining to be exploited.

That said the testkit that I am using is a lightweight OLTP typish workload which a user runs against a preknown schema and between various transactions that it does it emulates a wait time of 200ms. That said it is some sense emulating a real user who clicks and then waits to see what he got and does another click which results in another transaction happening.  (Not exactly but you get the point). 
Like all workloads it is generally used to find bottlenecks in systems before putting production stuff on it.


That said my current environment I am having similar workloads and seeing how many users can go to the point where system has no more CPU resources available to do a linear growth in tpm. Generally as many of you  mentioned you will see disk latency, network latency, cpu resource problems, etc.. And thats the work I am doing right now.. I am working around network latency by doing a private network, improving Operating systems tunables to improve efficiency out there.. I am improving disk latency by putting them on /RAM (and soon on SSDs).. However if I still cannot consume all CPU then it means I am probably hit by locks . Using PostgreSQL DTrace probes I can see what's happening..

At low user (100 users) counts my lock profiles from a user point of view are as follows:


# dtrace -q -s 84_lwlock.d 1764

              Lock Id            Mode           State           Count
        ProcArrayLock          Shared         Waiting               1
      CLogControlLock          Shared        Acquired               2
        ProcArrayLock       Exclusive         Waiting               3
        ProcArrayLock       Exclusive        Acquired              24
           XidGenLock       Exclusive        Acquired              24
     FirstLockMgrLock          Shared        Acquired              25
      CLogControlLock       Exclusive        Acquired              26
  FirstBufMappingLock          Shared        Acquired              55
        WALInsertLock       Exclusive        Acquired              75
        ProcArrayLock          Shared        Acquired             178
       SInvalReadLock          Shared        Acquired             378

              Lock Id            Mode           State   Combined Time (ns)
       SInvalReadLock                        Acquired                29849
        ProcArrayLock          Shared         Waiting                92261
        ProcArrayLock                        Acquired               951470
     FirstLockMgrLock       Exclusive        Acquired              1069064
      CLogControlLock       Exclusive        Acquired              1295551
        ProcArrayLock       Exclusive         Waiting              1758033
  FirstBufMappingLock       Exclusive        Acquired              2078507
           XidGenLock       Exclusive        Acquired              3460800
        WALInsertLock       Exclusive        Acquired             12205466
       SInvalReadLock       Exclusive        Acquired             42684236
        ProcArrayLock       Exclusive        Acquired             57397139
  
As users grow beyond 1000 it changes to the following for the sample user point of view
# dtrace -q  -s 84_lwlock.d 1764

              Lock Id            Mode           State           Count
      CLogControlLock       Exclusive         Waiting               1
        WALInsertLock       Exclusive         Waiting               1
        ProcArrayLock       Exclusive        Acquired               7
           XidGenLock       Exclusive        Acquired               7
        ProcArrayLock       Exclusive         Waiting              10
      CLogControlLock          Shared        Acquired              13
        WALInsertLock       Exclusive        Acquired              23
      CLogControlLock       Exclusive        Acquired              30
        ProcArrayLock          Shared        Acquired              50
     FirstLockMgrLock          Shared        Acquired             104
       SInvalReadLock          Shared        Acquired             105
  FirstBufMappingLock          Shared        Acquired             106

              Lock Id            Mode           State   Combined Time (ns)
        WALInsertLock       Exclusive         Waiting                73990
      CLogControlLock       Exclusive         Waiting               383066
           XidGenLock       Exclusive        Acquired               408301
      CLogControlLock       Exclusive        Acquired              1871642
        ProcArrayLock                        Acquired              2825372
        WALInsertLock       Exclusive        Acquired              3144580
     FirstLockMgrLock       Exclusive        Acquired              3799818
  FirstBufMappingLock       Exclusive        Acquired              4083473
       SInvalReadLock       Exclusive        Acquired             20611120
        ProcArrayLock       Exclusive        Acquired             37920098
        ProcArrayLock       Exclusive         Waiting           3783942020


Thats similar to what I had seen last year.. But thats the reason I am playing with lwlock.c to see how changing of how LWLockRelease() can be modified to do different types of wake-ups have impact on this top  waiting time which is basically waste of time from perspective of application, operating system, cpu .  All I am saying is with tuning flexibility we can actually reduce the time wasted and probably use that time with acquired state while it is doing some useful work.

I dont think I have misconfigured the system. I am just showing that hey there are ways to cut down some inefficiencies here and showing test points. I am also showing where it does seem to help performance. It may not help in all case but I just gave you a test where it helps performance where it is better than what it is. 

And again this is the third time I am saying.. the test users also have some latency build up in them which is what generally is exploited to get more users than number of CPUS on the system but that's the point we want to exploit.. Otherwise if all new users begin to do their job with no latency then we would need 6+ billion cpus to handle all possible users. Typically as an administrator (System and database) I can only tweak/control latencies within my domain, that is network, disk, cpu's etc and those are what I am tweaking and coming to a *Configured* environment and now trying to improve lock contentions/waits in PostgreSQL so that we have an optimized setup.

I am trying another run where I limit the waked up threads to a pre-configured number to see how various numbers pans out in terms of throughput on this server.

Regards,
Jignesh

В списке pgsql-performance по дате отправления:

Предыдущее
От: Ron
Дата:
Сообщение: Re: Proposal of tunable fix for scalability of 8.4
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Proposal of tunable fix for scalability of 8.4