Обсуждение: lru_multiplier and backend page write-outs
Hello, I've had the feeling for a while that the pg_stat_bgwriter statistics doesn't work quite the way I have understood it (based on the excellent [1] and the pg docs). I am now monitoring a database that has an lru_multiplier of 4.0, a delay of 200ms and a maxpages of 1000. Current stats: postgres=# select * from pg_stat_bgwriter; checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc -------------------+-----------------+--------------------+---------------+------------------+-----------------+--------------- 241 | 17 | 72803 | 0 | 0 | 81015 | 81708 (1 row) This is while the database is undergoing continuous activity (almost exclusively writing), but at a rate that does not saturate underlying storage (caching raid controller, all write ops are fast, cache is never filled). In addition, PostgreSQL is not even close to even filling it's buffer cache. The buffer cache is configured at 1 GB, and the resident size of the PostgreSQL process is only 80-90 MB so far. So even independently of any lru multplier setting, delays and whatever else, I don't see why any backend would ever have to do its own writeouts in order to allocate a page from the buffer cache. One theory: Is it the auto vacuum process? Stracing those I've seen that they very often to writes directly to disk. In any case, the reason I am fixating on buffers_backend is that I am after a clear indication whether any "normal" backend (non-autovacuum or anything like that) is ever having to block on disk writes, other than WAL fsync:s. Is a non-zero buffers_backend consistent with expected behavior? [1] http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Вложения
On Wed, 5 Nov 2008, Peter Schuller wrote: > In addition, PostgreSQL is not even close to even filling it's buffer > cache. The buffer cache is configured at 1 GB, and the resident size > of the PostgreSQL process is only 80-90 MB so far. So even > independently of any lru multplier setting, delays and whatever else, > I don't see why any backend would ever have to do its own writeouts in > order to allocate a page from the buffer cache. Any buffer that you've accessed recently gets its recent usage count incremented such that the background writer won't touch it--the current one only writes things where that count is 0. The only mechanism which drops that usage count back down again only kicks in once you've used all the buffers in the cache. You need some pressure to evict buffers that can't fit anymore before the background writer has any useful role to play in PostgreSQL 8.3. At one point I envisioned making it smart enough to try and handle the scenario you describe--on an idle system, you may very well want to write out dirty and recently accessed buffers if there's nothing else going on. But such behavior is counter-productive on a busy system, which is why a similar mechanism that existed before 8.3 was removed. Making that only happen when idle requires a metric for what "busy" means, which is tricky to do given the information available to this particular process. Short version: if you never fill the buffer cache, buffers_clean will always be zero, and you'll only see writes by checkpoints and things not operating with the standard client buffer allocation mechanism. Which brings us to... > One theory: Is it the auto vacuum process? Stracing those I've seen > that they very often to writes directly to disk. In order to keep it from using up the whole cache with maintenance overhead, vacuum allocates a 256K ring of buffers and use re-uses ones from there whenever possible. That will generate buffer_backend writes when that ring fills but it has more left to scan. Your theory that all the backend writes are coming from vacuum seems consistant with what you've described. You might even want to drop the two background writer parameters you've tweaked upwards back down closer to their original values. I get the impression you might have increased those hoping for more background writer work because you weren't seeing any. If you ever do get to where your buffer cache is full and the background writer starts doing something, those could jump from ineffective to wastefully heavy at that point. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Hello, > At one point I envisioned making it smart enough to try and handle the > scenario you describe--on an idle system, you may very well want to write > out dirty and recently accessed buffers if there's nothing else going on. > But such behavior is counter-productive on a busy system, which is why a > similar mechanism that existed before 8.3 was removed. Making that only > happen when idle requires a metric for what "busy" means, which is tricky > to do given the information available to this particular process. > > Short version: if you never fill the buffer cache, buffers_clean will > always be zero, and you'll only see writes by checkpoints and things not > operating with the standard client buffer allocation mechanism. Which > brings us to... Sure. I am not really out to get the background writer to pre-emptively do "idle trickling". Though I can see cases where one might care about this (such as lessening the impact of OS buffer cache delays on checkpoints), it's not what I am after now. > > One theory: Is it the auto vacuum process? Stracing those I've seen > > that they very often to writes directly to disk. > > In order to keep it from using up the whole cache with maintenance > overhead, vacuum allocates a 256K ring of buffers and use re-uses ones > from there whenever possible. That will generate buffer_backend writes > when that ring fills but it has more left to scan. Your theory that all > the backend writes are coming from vacuum seems consistant with what > you've described. The bit that is inconsistent with this theory, given the above ring buffer desription, is that I saw the backend write-out count increasing constantlyduring the write activity I was generating to the database. However (because in this particular case it was a small database used for some latency related testing), no table was ever large enough that 256k buffers would ever be filled by the process of vacuuming a single table. Most tables would likely have been a handful to a couple of hundred of pages large. In addition, when I say "constantly" above I mean that the count increases even between successive SELECT:s (of the stat table) with only a second or two in between. In the abscence of long-running vacuum's, that discounts vacuuming because the naptime is 1 minute. In fact this already discounted vacuuming even without the added information you provided above, but I didn't realize when originally posting. The reason I mentioned vacuuming was that the use case is such that we do have a lot of tables constantly getting writes and updates, but they are all small. Anything else known that might be generating the writes, if it is not vacuuming? > You might even want to drop the two background writer parameters you've > tweaked upwards back down closer to their original values. I get the > impression you might have increased those hoping for more background > writer work because you weren't seeing any. If you ever do get to where > your buffer cache is full and the background writer starts doing > something, those could jump from ineffective to wastefully heavy at that > point. I tweaked it in order to eliminate backends having to do "synchrounous" (with respect to the operating system even if not with respect to the underlying device) writes. The idea is that writes to the operating system are less understood/controlled, in terms of any latency they may case. It would be very nice if the backend writes were always zero under normal circumstances (or at least growing very very rarely in edge cases where the JIT policy did not suceed), in order to make it a more relevant and rare observation that the backend write-outs are systematically increasing. On this topic btw, was it considered to allow the administrator to specify a fixed-size margin to use when applying the JIT policy? (The JIT policy and logic itself being exactly the same still.) Especially with larger buffer caches, that would perhaps allow the administrator to make a call to truly eliminate synchronous writes during normal operation, while not adversely affecting anything (if the buffer cache is 1 GB, having a margin of say 50 MB does not really matter much in terms of wasting memory, yet could have a significant impact on eliminating synchronous write-outs). On a system where you really want to keep backend writes to exactly 0 under normal circumstances (discounting vacuuming), and having a large buffer cache (say the one gig), it might be nice to be able to say "ok - I have 1 GB of buffer cache. for the purpose of the JIT algorithm, please pretend it's only 900 MB". The result is 100 MB of constantly sized "margin", with respect to ensuring writes are asynchronous. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Вложения
On Thu, 6 Nov 2008, Peter Schuller wrote: >> In order to keep it from using up the whole cache with maintenance >> overhead, vacuum allocates a 256K ring of buffers and use re-uses ones >> from there whenever possible. > > no table was ever large enough that 256k buffers would ever be filled by > the process of vacuuming a single table. Not 256K buffers--256K, 32 buffers. > In addition, when I say "constantly" above I mean that the count > increases even between successive SELECT:s (of the stat table) with > only a second or two in between. Writes to the database when only doing read operations are usually related to hint bits: http://wiki.postgresql.org/wiki/Hint_Bits > On this topic btw, was it considered to allow the administrator to > specify a fixed-size margin to use when applying the JIT policy? Right now, there's no way to know exactly what's in the buffer cache without scanning the individual buffers, which requires locking their headers so you can see them consistently. No one process can get the big picture without doing something intrusive like that, and on a busy system the overhead of collecting more data to know how exactly far ahead the cleaning is can drag down overall performance. A lot can happen while the background writer is sleeping. One next-generation design which has been sketched out but not even prototyped would take cleaned buffers and add them to the internal list of buffers that are free, which right now is usually empty on the theory that cached data is always more useful than a reserved buffer. If you developed a reasonable model for how many buffers you needed and padded that appropriately, that's the easiest way (given the rest of the buffer manager code) to get close to ensuring there aren't any backend writes. Because you've got the OS buffering writes anyway in most cases, it's hard to pin down whether that actually improved worst-case latency though. And moving in that direction always seems to reduce average throughput even in write-heavy benchmarks. The important thing to remember is that the underlying OS has its own read and write caching mechanisms here, and unless the PostgreSQL ones are measurably better than those you might as well let the OS manage the problem instead. It's easy to demonstrate that's happening when you give a decent amount of memory to shared_buffers, it's much harder to prove that's the case for an improved write scheduling algorithm. Stepping back a bit, you might even consider that one reason PostgreSQL has grown as well as it has in scalability is exactly because it's been riding improvements the underlying OS in many of these cases, rather than trying to do all the I/O scheduling itself. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> > no table was ever large enough that 256k buffers would ever be filled by > > the process of vacuuming a single table. > > Not 256K buffers--256K, 32 buffers. Ok. > > In addition, when I say "constantly" above I mean that the count > > increases even between successive SELECT:s (of the stat table) with > > only a second or two in between. > > Writes to the database when only doing read operations are usually related > to hint bits: http://wiki.postgresql.org/wiki/Hint_Bits Sorry, I didn't mean to imply read-only operations (I did read the hint bits information a while back though). What I meant was that while I was constantly generating the insert/delete/update activity, I was selecting the bg writer stats with only a second or two in between. The intent was to convey that the count of backend written pages was systematically and constantly (as in a few hundreds per handful of seconds) increasing, in spite of no long running vacuum and the buffer cache not being close to full. > > On this topic btw, was it considered to allow the administrator to > > specify a fixed-size margin to use when applying the JIT policy? > > Right now, there's no way to know exactly what's in the buffer cache > without scanning the individual buffers, which requires locking their > headers so you can see them consistently. No one process can get the big > picture without doing something intrusive like that, and on a busy system > the overhead of collecting more data to know how exactly far ahead the > cleaning is can drag down overall performance. A lot can happen while the > background writer is sleeping. Understood. > One next-generation design which has been sketched out but not even > prototyped would take cleaned buffers and add them to the internal list of > buffers that are free, which right now is usually empty on the theory that > cached data is always more useful than a reserved buffer. If you > developed a reasonable model for how many buffers you needed and padded > that appropriately, that's the easiest way (given the rest of the buffer > manager code) to get close to ensuring there aren't any backend writes. > Because you've got the OS buffering writes anyway in most cases, it's hard > to pin down whether that actually improved worst-case latency though. And > moving in that direction always seems to reduce average throughput even in > write-heavy benchmarks. Ok. > The important thing to remember is that the underlying OS has its own read > and write caching mechanisms here, and unless the PostgreSQL ones are > measurably better than those you might as well let the OS manage the > problem instead. The problem though is that though the OS may be good in the common cases it is designed for, it can have specific features that are directly counter-productive if your goals do not line up with that of the commonly designed-for use case (in particular, if you care about latency a lot and not necessarily about absolute max throughput). For example, in Linux up until recently if not still, there is the 1024 per-inode buffer limit that limited the number of buffers written as a result of expiry, which means that when PostgreSQL does its fsync(), you may end up having a lot more to write out than what would have been the case if the centisecs_expiry had been enforced, regardless of whether PostgreSQL was tuned to write dirty pages out sufficiently aggressively. If the amount built up exceeds the capacity of the RAID controller cache... I had a case where I suspect this was exaserbating the situation. Manually doing a 'sync' on the system every few seconds noticably helped (the theory being, because it forced page write-outs to happen earlier and in smaller storms). > It's easy to demonstrate that's happening when you give > a decent amount of memory to shared_buffers, it's much harder to prove > that's the case for an improved write scheduling algorithm. Stepping back > a bit, you might even consider that one reason PostgreSQL has grown as > well as it has in scalability is exactly because it's been riding > improvements the underlying OS in many of these cases, rather than trying > to do all the I/O scheduling itself. Sure. In this case with the backend writes, I am nore interesting in understanding better what is happening and having better indications of when backends block on I/O, than necessarily having a proven improvement in throughput or latency. It makes it easier to reason about what is happening when you *do* have a measured performance problem. Thanks for all the insightful information. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org