Обсуждение: pgsql-server/src backend/storage/buffer/bufmgr ...
CVSROOT: /cvsroot Module name: pgsql-server Changes by: wieck@svr1.postgresql.org 04/01/24 16:00:46 Modified files: src/backend/storage/buffer: bufmgr.c src/backend/utils/misc: guc.c postgresql.conf.sample src/include/storage: bufmgr.h Log message: Added GUC variable bgwriter_flush_method controlling the action done by the background writer between writing dirty blocks and napping. none (default) no action sync bgwriter calls smgrsync() causing a sync(2) A global sync() is only good on dedicated database servers, so more flush methods should be added in the future. Jan
wieck@svr1.postgresql.org (Jan Wieck) writes: > Added GUC variable bgwriter_flush_method controlling the action > done by the background writer between writing dirty blocks and > napping. > none (default) no action > sync bgwriter calls smgrsync() causing a sync(2) Why would that be useful at all? I thought the purpose of the bgwriter was to avoid I/O storms, not provoke them. regards, tom lane
Tom Lane wrote: > wieck@svr1.postgresql.org (Jan Wieck) writes: >> Added GUC variable bgwriter_flush_method controlling the action >> done by the background writer between writing dirty blocks and >> napping. > >> none (default) no action >> sync bgwriter calls smgrsync() causing a sync(2) > > Why would that be useful at all? I thought the purpose of the bgwriter > was to avoid I/O storms, not provoke them. Calling sync(2) every time the background writer naps means calling it every couple hundred milliseconds. That can hardly be called an IO storm, it's more like a constant breeze. So far nobody bothered to make any other proposal how to cause the kernel to actually do some writing at all. A lot of people babble about fsync(), fdatasync() and fadvise and whatnot. A week ago I posted the proposal for this and got exactly zero response. Before this, the bgwriter did only write the dirty blocks, so that the checkpoint (doing the sync() call) still caused all the physical IO to happen at once. Sure, with the bgwriter doing the major write IO, he'd know what files have been written to and can do fsync() and fdatasync() on the. But even with that, the checkpoint doing sync() will be in danger to cause a lot of unexpected IO from wherenot, making the time the checkpoint takes totally unpredictable. The whole point of the bgwriter is to give responsetimes a better variance, I never claimed that it will improve performance. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > So far nobody bothered to make any other proposal how to cause the > kernel to actually do some writing at all. A lot of people babble about > fsync(), fdatasync() and fadvise and whatnot. A week ago I posted the > proposal for this and got exactly zero response. As I've said before, I think we need to find a way to stop using sync() altogether --- we have to move to fsync or O_SYNC and variants. sync has simply got the wrong API. Let me give an example: you write a bunch of stuff and then call sync(). Suppose the kernel is unable to write some of those blocks --- it gets a hard I/O error, or doesn't realize it's out of disk space until the write is attempted, or whatever. (I think this is what happened to Chris K-L last night.) Is the sync call going to tell you about the problem? No, it is not. If you are lucky you will get an error return from the next operation you try on a file descriptor associated with the failed blocks. But by that time you've probably already written a checkpoint record to WAL claiming that those writes were all done successfully. Finding out about the failures after the checkpoint is completed is too late --- you're screwed, especially if a crash happens before you can do anything about it. > The whole point of the bgwriter is to give responsetimes a better > variance, I never claimed that it will improve performance. I want to use it to improve reliability, by getting rid of our dependence on sync(). The bgwriter can afford to wait for writes to occur, so it should be able to use fsync or even O_SYNC. regards, tom lane
On Sat, 24 Jan 2004, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > So far nobody bothered to make any other proposal how to cause the > > kernel to actually do some writing at all. A lot of people babble about > > fsync(), fdatasync() and fadvise and whatnot. A week ago I posted the > > proposal for this and got exactly zero response. > > As I've said before, I think we need to find a way to stop using sync() > altogether --- we have to move to fsync or O_SYNC and variants. sync > has simply got the wrong API. > > Let me give an example: you write a bunch of stuff and then call sync(). > Suppose the kernel is unable to write some of those blocks --- it gets > a hard I/O error, or doesn't realize it's out of disk space until the > write is attempted, or whatever. (I think this is what happened to > Chris K-L last night.) Is the sync call going to tell you about the > problem? No, it is not. If you are lucky you will get an error return > from the next operation you try on a file descriptor associated with the > failed blocks. But by that time you've probably already written a > checkpoint record to WAL claiming that those writes were all done > successfully. Finding out about the failures after the checkpoint is > completed is too late --- you're screwed, especially if a crash happens > before you can do anything about it. Stupid question here, and I just checked postgresql.conf to make sure it wasn't something I overlooked ... why don't we have a 'minfree' setting for disk space? Its not like this is a rare occurance thing, running out of disk space ... Personally, what I'd expect would be that the postmaster process monitors this, and if below a certain threshold, send out a 'close connections' to the postgres processes and refuse future connections with an 'out of space' warning ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> The whole point of the bgwriter is to give responsetimes a better >> variance, I never claimed that it will improve performance. > > I want to use it to improve reliability, by getting rid of our > dependence on sync(). The bgwriter can afford to wait for writes > to occur, so it should be able to use fsync or even O_SYNC. Agreed, that would be our long term strategy. And chances are that the 63 lines of code I added today for a functionality that is turned off by default will not completely screw up that plan. But as I see it, there is not even half of a proposal for all that yet. And people have response time spike problems caused by the checkpointer today. At least that is what I heard from the folks who where at our BOF in New York. Those people will not mind if the option we give them in 7.5 is replaced with something better in 8.0 again. But they mind a lot if we give them nothing because what we can do now is not optimal. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck wrote: > Tom Lane wrote: > > > wieck@svr1.postgresql.org (Jan Wieck) writes: > >> Added GUC variable bgwriter_flush_method controlling the action > >> done by the background writer between writing dirty blocks and > >> napping. > > > >> none (default) no action > >> sync bgwriter calls smgrsync() causing a sync(2) > > > > Why would that be useful at all? I thought the purpose of the bgwriter > > was to avoid I/O storms, not provoke them. > > Calling sync(2) every time the background writer naps means calling it > every couple hundred milliseconds. That can hardly be called an IO > storm, it's more like a constant breeze. Have you tested this option? It seems like sub-second sync would kill performance. > So far nobody bothered to make any other proposal how to cause the > kernel to actually do some writing at all. A lot of people babble about > fsync(), fdatasync() and fadvise and whatnot. A week ago I posted the > proposal for this and got exactly zero response. I assumed all Unixes flush dirty pages at least every 30 seconds, so if checkpoints are every 2-3 minutes, most of the dirty pages should already be flushed. Perhaps instead of tieing sync to the background writer sleeps, we should have a sync_frequency that could be set to sync every 15 or 30 seconds. Is there any value in doing it more frequently than that? > Before this, the bgwriter did only write the dirty blocks, so that the > checkpoint (doing the sync() call) still caused all the physical IO to > happen at once. Sure, with the bgwriter doing the major write IO, he'd > know what files have been written to and can do fsync() and fdatasync() > on the. But even with that, the checkpoint doing sync() will be in > danger to cause a lot of unexpected IO from wherenot, making the time > the checkpoint takes totally unpredictable. > > The whole point of the bgwriter is to give responsetimes a better > variance, I never claimed that it will improve performance. Uh, our goal is better performance overall. If this new options causes dismal performance when enabled, who cares how fast the checkpoints are? :-) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > So far nobody bothered to make any other proposal how to cause the > > kernel to actually do some writing at all. A lot of people babble about > > fsync(), fdatasync() and fadvise and whatnot. A week ago I posted the > > proposal for this and got exactly zero response. > > As I've said before, I think we need to find a way to stop using sync() > altogether --- we have to move to fsync or O_SYNC and variants. sync > has simply got the wrong API. > > Let me give an example: you write a bunch of stuff and then call sync(). > Suppose the kernel is unable to write some of those blocks --- it gets > a hard I/O error, or doesn't realize it's out of disk space until the > write is attempted, or whatever. (I think this is what happened to > Chris K-L last night.) Is the sync call going to tell you about the > problem? No, it is not. If you are lucky you will get an error return > from the next operation you try on a file descriptor associated with the > failed blocks. But by that time you've probably already written a > checkpoint record to WAL claiming that those writes were all done > successfully. Finding out about the failures after the checkpoint is > completed is too late --- you're screwed, especially if a crash happens > before you can do anything about it. If sync failes (kernel to disk write failes) we have a hardware failure, and we don't pretend to recover from that, though it would be nice to know sooner so we can exit. One idea I floated around was to open/write/fsync/close a temporary file after sync in the hope that it would happen after the sync completes because the fsync would be at the end of the disk flush queue. However, tagged queueing could reorder those, but hopefully it would catch a disk error before we recycle the WAL files. > > > The whole point of the bgwriter is to give responsetimes a better > > variance, I never claimed that it will improve performance. > > I want to use it to improve reliability, by getting rid of our > dependence on sync(). The bgwriter can afford to wait for writes > to occur, so it should be able to use fsync or even O_SYNC. But I always wonder how to do that while allowing the reordering of writes done by the kernel and disk drive, and good background writer performance of moving pages out of the buffer cache. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Jan Wieck wrote: > Tom Lane wrote: > > > Jan Wieck <JanWieck@Yahoo.com> writes: > > >> The whole point of the bgwriter is to give responsetimes a better > >> variance, I never claimed that it will improve performance. > > > > I want to use it to improve reliability, by getting rid of our > > dependence on sync(). The bgwriter can afford to wait for writes > > to occur, so it should be able to use fsync or even O_SYNC. > > Agreed, that would be our long term strategy. And chances are that the > 63 lines of code I added today for a functionality that is turned off by > default will not completely screw up that plan. We don't give people options that are useless. Are you sure this option is useful? "Hey, it makes the system so slow, checkpoints are now 90% faster!" :-) > But as I see it, there is not even half of a proposal for all that yet. > And people have response time spike problems caused by the checkpointer > today. At least that is what I heard from the folks who where at our BOF > in New York. Those people will not mind if the option we give them in > 7.5 is replaced with something better in 8.0 again. But they mind a lot > if we give them nothing because what we can do now is not optimal. We need more discussion/proof before we add something like this, even if it is only for one release. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> As I've said before, I think we need to find a way to stop using sync() >> altogether --- we have to move to fsync or O_SYNC and variants. sync >> has simply got the wrong API. > If sync failes (kernel to disk write failes) we have a hardware failure, > and we don't pretend to recover from that, Not necessarily --- it could be out-of-disk-space, on at least some filesystems. More to the point, the important thing is not to commit a checkpoint record to WAL indicating that everything is good, when everything is not good. As long as we don't checkpoint we have some hope of recovering automatically via WAL replay. > One idea I floated around was to > open/write/fsync/close a temporary file after sync in the hope that it > would happen after the sync completes because the fsync would be at the > end of the disk flush queue. "In the hope"? We already have a guess-and-hope approach to this, and it will never be any better as long as we use sync(), because sync() is fundamentally the wrong operation. It doesn't tell you when the I/O is done, and it doesn't tell you whether the I/O was done successfully, and there is no possibility of working around that fundamental lack of information except to stop using it. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> As I've said before, I think we need to find a way to stop using sync() > >> altogether --- we have to move to fsync or O_SYNC and variants. sync > >> has simply got the wrong API. > > > If sync failes (kernel to disk write failes) we have a hardware failure, > > and we don't pretend to recover from that, > > Not necessarily --- it could be out-of-disk-space, on at least some > filesystems. More to the point, the important thing is not to commit a I assume the operating system is already allocating file system space during the write, and the sync is only forcing it to disk. If the operating system doesn't allocate file system space it couldn't properly work, no? In fact, it is my understanding that the file system is in RAM and the disk is just backing store, basically. > checkpoint record to WAL indicating that everything is good, when > everything is not good. As long as we don't checkpoint we have some > hope of recovering automatically via WAL replay. > > > One idea I floated around was to > > open/write/fsync/close a temporary file after sync in the hope that it > > would happen after the sync completes because the fsync would be at the > > end of the disk flush queue. > > "In the hope"? We already have a guess-and-hope approach to this, and > it will never be any better as long as we use sync(), because sync() is > fundamentally the wrong operation. It doesn't tell you when the I/O is > done, and it doesn't tell you whether the I/O was done successfully, and > there is no possibility of working around that fundamental lack of > information except to stop using it. I assumed this would be a closer guess-and-hope approach, and again, how could sync fail unless it is a hardware problem? NFS? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> As I've said before, I think we need to find a way to stop using sync() > >> altogether --- we have to move to fsync or O_SYNC and variants. sync > >> has simply got the wrong API. > > > If sync failes (kernel to disk write failes) we have a hardware failure, > > and we don't pretend to recover from that, > > Not necessarily --- it could be out-of-disk-space, on at least some > filesystems. More to the point, the important thing is not to commit a > checkpoint record to WAL indicating that everything is good, when > everything is not good. As long as we don't checkpoint we have some > hope of recovering automatically via WAL replay. > > > One idea I floated around was to > > open/write/fsync/close a temporary file after sync in the hope that it > > would happen after the sync completes because the fsync would be at the > > end of the disk flush queue. > > "In the hope"? We already have a guess-and-hope approach to this, and > it will never be any better as long as we use sync(), because sync() is > fundamentally the wrong operation. It doesn't tell you when the I/O is > done, and it doesn't tell you whether the I/O was done successfully, and > there is no possibility of working around that fundamental lack of > information except to stop using it. I guess my major problem with moving away from sync is similar to the reason we don't do raw devices --- sync is best done in the kernel and disk driver that knows more about how to do it efficiently. I haven't see any non-sync solution with performance similar to sync(). However, we are going to have to write one for win32, so we can test things once we are done and then decide. I think the win32 solution will be to record modified files in a central location, and have the checkpoint open/fsync(_commit), perhaps it all happening at the same time in different threads so it isn't serialized. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> Not necessarily --- it could be out-of-disk-space, on at least some >> filesystems. More to the point, the important thing is not to commit a > I assume the operating system is already allocating file system space > during the write, and the sync is only forcing it to disk. Not so --- as was pointed out later in the thread, neither NFS nor AFS work that way, and there could be other cases. In any case, I don't subscribe to the idea that we can just abdicate all responsibility in case of hardware problems. Yes, we do rely on a disk to keep storing information once it's accepted it, but that doesn't mean that it's okay to ignore write-failure reports. We are failing to hold up our end of the deal if we do that. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> Not necessarily --- it could be out-of-disk-space, on at least some > >> filesystems. More to the point, the important thing is not to commit a > > > I assume the operating system is already allocating file system space > > during the write, and the sync is only forcing it to disk. > > Not so --- as was pointed out later in the thread, neither NFS nor AFS > work that way, and there could be other cases. > > In any case, I don't subscribe to the idea that we can just abdicate all > responsibility in case of hardware problems. Yes, we do rely on a disk > to keep storing information once it's accepted it, but that doesn't mean > that it's okay to ignore write-failure reports. We are failing to hold > up our end of the deal if we do that. Well, in normal usage, applications do the write and expect the data to be pushed to disk later, so I don't see us ignoring write() failures, but rather push to disk. Isn't a separate fsync after sync closer to reliable? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > I guess my major problem with moving away from sync is similar to the > reason we don't do raw devices --- sync is best done in the kernel and > disk driver that knows more about how to do it efficiently. I haven't > see any non-sync solution with performance similar to sync(). However, > we are going to have to write one for win32, so we can test things once > we are done and then decide. We are not doing raw devices because we don't do tablespaces. I mean in the method where a tablespace for the OS is basically a huge container. For every little table, PostgreSQL creates a separate file and scatters the data all over the place because it is too dumb to group allocations of multiple blocks together. As a consequence, it is short of file descriptors and needs the kernel at least to reorder it's write requests so that they are not done in the clueless order they are issued. Now doing fsync() or fdatasync() of possibly dozens of files in a row, forcing the kernel to do one scattered file after another, letting the disk heads dance like step-chicken on a hot tin ... that will be an improvement, oh boy. However safe this will be, nobody will use it because MySQL is soooo much faster! Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > Now doing fsync() or fdatasync() of possibly dozens of files in a row, > forcing the kernel to do one scattered file after another, letting the > disk heads dance like step-chicken on a hot tin ... that will be an > improvement, oh boy. I'm not convinced it would be so bad. Normally you'd only be issuing those operations at checkpoint time, and if the bgwriter has been doing its job and pushing out dirty pages to the kernel, the kernel should have been busily writing pages all along since the last checkpoint. In theory the fsync would not force all that many new writes (certainly lots less than a once-per-checkpoint sync does). Also keep in mind that fsync is not defined as "write this page NOW". It is defined as "let me know when you've written it". The kernel still has flexibility in scheduling its writes, and may choose to write other pages along the way. Perhaps more to the point: all this is predicated on an assumption no longer particularly valid, which is that the kernel's ideas about disk write scheduling matter at all. A decent SCSI disk drive will pre-empt the kernel's ideas anyway by absorbing as many pending writes as it can and then doing its own write scheduling. fsync won't affect the drive's choices in the least, only allow us to find out when the drive is done. regards, tom lane
On Jan 27, 2004, at 6:25 PM, Tom Lane wrote: > Perhaps more to the point: all this is predicated on an assumption no > longer particularly valid, which is that the kernel's ideas about disk > write scheduling matter at all. A decent SCSI disk drive will pre-empt > the kernel's ideas anyway by absorbing as many pending writes as it can > and then doing its own write scheduling. fsync won't affect the > drive's > choices in the least, only allow us to find out when the drive is done. > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > Perhaps totally unrelated, as i've only read the last couple of posts on this, but what does Postfix (the MTA) do? How does it handle this? I trust Wietse implicitly to DTRT. If it were me I would ask him how he handles the writes or at least check the Postfix src. *shrug* Just an idea. Chris Watson M.M. Bestor G. Brown #433 Wichita, KS AIM: BSDUNIX44
Вложения
Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > > Now doing fsync() or fdatasync() of possibly dozens of files in a row, > > forcing the kernel to do one scattered file after another, letting the > > disk heads dance like step-chicken on a hot tin ... that will be an > > improvement, oh boy. > > I'm not convinced it would be so bad. Normally you'd only be issuing > those operations at checkpoint time, and if the bgwriter has been doing Agreed, I don't have a problem with fsync() during checkpoint instead of sync. I had problems with fsync from the background writer and performance. Let me post ideas to hackers & win32 list. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Jan Wieck wrote: > Tom Lane wrote: > > > wieck@svr1.postgresql.org (Jan Wieck) writes: > >> Added GUC variable bgwriter_flush_method controlling the action > >> done by the background writer between writing dirty blocks and > >> napping. > > > >> none (default) no action > >> sync bgwriter calls smgrsync() causing a sync(2) > > > > Why would that be useful at all? I thought the purpose of the bgwriter > > was to avoid I/O storms, not provoke them. > > Calling sync(2) every time the background writer naps means calling it > every couple hundred milliseconds. That can hardly be called an IO > storm, it's more like a constant breeze. I talked to Jan about the idea of sync on every background writer sleep. He is going to study the issue and report back. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073