Обсуждение: FAQ Q
From the FAQ: --------------------------------------------------------------------- PostgreSQL runs in two modes. Normal fsync mode flushes every completed transaction to disk, guaranteeing that if the OS crashes or loses power in the next few seconds, all your data is safely stored on disk. In this mode, we are slower than most commercial databases, partly because few of them do such conservative flushing to disk in their default modes. In no-fsync mode, we are usually faster than commercial databases, though in this mode, an OS crash could cause data corruption. We are working to provide an intermediate mode that suffers less performance overhead than full fsync mode, and will allow data integrity within 30 seconds of an OS crash. --------------------------------------------------------------------- Has this changed since the FAQ? I understand that Oracle & Sybase (at least) have guaranteed data consistency in case of a crash. From this FAQ, there is a suggestion that at least some commercial RDBMS doesn't even have guaranteed consistency? In Sybase/Oracle this guaranteed consistency + performance is implemented by ensuring that the transaction is guaranteed to be flushed to disk before the data page is flushed, but that neither is guaranteed to be flushed at all unless your application does a specific COMMIT. Therefore, even though you might lose some transactions, you will never lose internal consistency when using transactions. Also, you will not receive a result from COMMIT until at least the transaction is flushed. It would seem the FAQ is saying we always flush every transaction all the time in fsync mode, and in no-fsync mode we don't really guarantee that the transaction will be flushed before the data page is flushed. Am I reading this correctly? (in summation...) So if data consistency is the most important thing to me, performance be damned, I still, as of 7.2.1, want to run in fsync mode? Thanks, -- Tim Ellis DBA, Gamet
Tim Ellis <Tim.Ellis@gamet.com> writes: > (in summation...) So if data consistency is the most important thing to > me, performance be damned, I still, as of 7.2.1, want to run in fsync > mode? Yup. no-fsync is only suitable if you trust your OS and power supply. You do not have to trust Postgres itself: in all cases we push the log entries out to the OS before declaring a transaction committed. The question at hand is whether we use fsync() or other methods to try to force the OS to write to disk before we report the transaction committed. Without that, a system-level crash immediately after a commit report might mean the "committed" transaction isn't reflected as committed on disk. The performance cost of fsync is not nearly what it used to be pre-7.1, btw, because we need flush only the WAL log file not data files. (In case of a crash, any missing data-file updates will be reconstructed from the recent WAL entries during restart.) So the needed disk head movement is a lot less than it used to be --- especially if you've arranged for pg_xlog to live on its own disk. I don't think there's any longer a good argument for considering no-fsync in production servers, even though you might well still choose it for development systems. regards, tom lane
On Mon, 10 Jun 2002 17:16:23 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > You do not have to trust Postgres itself: in all cases we push the log > entries out to the OS before declaring a transaction committed. In all cases while fsync mode is enabled, not in all cases (including no-fsync) right? > question at hand is whether we use fsync() or other methods to try to > force the OS to write to disk before we report the transaction > committed. Without that, a system-level crash immediately after a > commit report might mean the "committed" transaction isn't reflected as > committed on disk. So no-fsync simply means that "commit" != "sync'd to disk"? How is this different than the big boys do it? From Sybase internals classes and Oracle documentation, I know that when you say "COMMIT" you don't get any response back until the RDBMS has written that transaction TO DISK. Is the FAQ being too generous in speed and too unfair in recoverability to the commercial offerings here? From the FAQ: "Normal fsync mode flushes every completed transaction to disk, guaranteeing that if the OS crashes or loses power in the next few seconds, all your data is safely stored on disk. In this mode, we are slower than most commercial databases, partly because few of them do such conservative flushing to disk in their default modes." Seems we're claiming that most commercial databases don't actually flush anything to disk before returning a result to a COMMIT, and at the same time that therefore they're faster. My training says Sybase/DB2/Oracle/Informix won't be any less conservative and at the same time, won't be any faster. Sybase at least just tries to avoid writing tranlogs to disk except in batches, so when you say COMMIT, it pauses until X amount of time passes or Y number of committed transactions have piled up, then it writes those trans to disk. But your client is blocked on the COMMIT until the transaction actually makes it to disk. Perhaps we are including in "most commercial databases" several RDBMSs not Sybase/Oracle/DB2/Informix? > The performance cost of fsync is not nearly what it used to be pre-7.1, > btw, because we need flush only the WAL log file not data files. Oh, so here's where my confusion continues. The fsync pre-7.1 was flushing data pages, whereas 7.1/7.1 only flushes WAL? So basically the FAQ quotes above and previously apply only to pre-7.1, and not 7.1/7.2? > (In case of a crash, any missing data-file updates will be reconstructed > from the recent WAL entries during restart.) This is what I'd thought the FAQ was referring to when it said "all your data is safely stored on disk." So am I right in saying: In 7.1/7.2, when you COMMIT, you are guaranteed that WAL (and probably nothing but WAL) is written to disk before you get a return and that therefore if your instance immediately thereafter crashes, that after recovery (which is required), you will have a consistent (not-corrupted) database. ??? Cheers, Tim Ellis DBA, Gamet ps -- I'm trying to find a writeup that kind'f compares Postgres to Sybase/Oracle/DB2/Informix and/or MySQL to all the aforementioned databases, but can find nothing substantial. I'm interested in ADMINISTRATIVE viewpoints. Most of the info on the web concentrates on what DEVELOPERS think of the aforementioned offerings, but nothing about what ADMINISTRATORS think. For a good example of a bad discussion, see the Slashdot thread on Postgres vs. MySQL. Any hints for me? FAQ 1.14 (at http://postgresql.org/docs/faq-english.html) is close to what I'm talking about, but I'd love something far more in-depth and which compares to specific RDBMSs (ie: isn't politically correct).
Tim Ellis <Tim.Ellis@gamet.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> You do not have to trust Postgres itself: in all cases we push the log >> entries out to the OS before declaring a transaction committed. > In all cases while fsync mode is enabled, not in all cases (including > no-fsync) right? All cases is what I said, and all cases is what I meant. fsync mode only controls whether there's an fsync() after the write(). > So no-fsync simply means that "commit" != "sync'd to disk"? How is this > different than the big boys do it? From Sybase internals classes and > Oracle documentation, I know that when you say "COMMIT" you don't get any > response back until the RDBMS has written that transaction TO DISK. Or thinks it has. One of the points that people don't much like to talk about is the difficulty of knowing how far the data has really been pushed. In fsync mode we report commit when the kernel has told us it's written the data; we have no way of knowing whether the kernel lied, and even less way of knowing whether the disk drive has actually written the information or only cached it on the controller board. On modern disk drives the truth is likely to be that the bits have only gone as far as the drive controller; so you may lose data if you lose power and the power supply does not have enough capacity to hold up the drive while it's finishing its pending writes. However, the big boys running on the same hardware are going to have the same problem. If you can figure out how to configure your drives not to report write complete until it's really complete, then you can feel secure with either Postgres or the big boys. Or you can buy a UPS and make sure you are configured to shut down before the UPS runs out of steam. In no-fsync mode we report commit after pushing the data out to the kernel, but we don't try to force the kernel to push it out to disk. So you are safe against a Postgres crash, but not against kernel or hardware failure. Your transaction could be lost if the system dies before the kernel gets around to syncing it out to disk (typically 30 sec max, on most Unixen). > Is the FAQ being too generous in speed and too unfair in recoverability to > the commercial offerings here? I didn't write the FAQ and won't take responsibility for its claims about commercial databases. I really don't know what tradeoffs they offer in this area. > Sybase at least just tries to avoid writing tranlogs to disk except in > batches, so when you say COMMIT, it pauses until X amount of time passes > or Y number of committed transactions have piled up, then it writes those > trans to disk. But your client is blocked on the COMMIT until the > transaction actually makes it to disk. We have that too, although it's not on by default, and probably needs further tuning work. > In 7.1/7.2, when you COMMIT, you are guaranteed that WAL (and probably > nothing but WAL) is written to disk before you get a return and that > therefore if your instance immediately thereafter crashes, that after > recovery (which is required), you will have a consistent (not-corrupted) > database. Modulo the above issues, yes. Note also that WAL fsync protects you against data corruption in the case of a mid-transaction system-level failure: if all the WAL updates have made it to disk, then we will be able to fix any incomplete or missing writes in the data files. Without fsync there is a distinct risk of corrupted data. This has to do with forcing fsync on the WAL files before we start to modify data pages intra-transaction. regards, tom lane
Tom Lane wrote: > However, the big boys running on the same hardware are going to have the > same problem. > > If you can figure out how to configure your drives not to report write > complete until it's really complete, then you can feel secure with > either Postgres or the big boys. Or you can buy a UPS and make sure you > are configured to shut down before the UPS runs out of steam. > > In no-fsync mode we report commit after pushing the data out to the > kernel, but we don't try to force the kernel to push it out to disk. > So you are safe against a Postgres crash, but not against kernel or > hardware failure. Your transaction could be lost if the system dies > before the kernel gets around to syncing it out to disk (typically > 30 sec max, on most Unixen). > > > Is the FAQ being too generous in speed and too unfair in recoverability to > > the commercial offerings here? > > I didn't write the FAQ and won't take responsibility for its claims > about commercial databases. I really don't know what tradeoffs they > offer in this area. I have updated the FAQ to remove the reference to fsync, because with WAL it isn't as much of an issue: Performance PostgreSQL has performance similar to other commercial and open source databases. it is faster for some things, slower for others. In comparison to MySQL or leaner database systems, we are slower on inserts/updates because of transaction overhead. Of course, MySQL does not have any of the features mentioned in the Features section above. We are built for reliability and features, though we continue to improve performance in every release. There is an interesting Web page comparing PostgreSQL to MySQL at http://openacs.org/why-not-mysql.html > > > Sybase at least just tries to avoid writing tranlogs to disk except in > > batches, so when you say COMMIT, it pauses until X amount of time passes > > or Y number of committed transactions have piled up, then it writes those > > trans to disk. But your client is blocked on the COMMIT until the > > transaction actually makes it to disk. Informix has two modes, buffered logging, and unbuffered logging. The PostgreSQL setup is unbuffered logging (from http://nasis.nrcs.usda.gov/archive/logbuf.html#LoggingStatus): Unbuffered Logging An unbuffered logging status means that the database writes each transactions result to the logical log file as soon as that transaction completes. This mode increases the amount of data that might be recovered during the fast recovery phase. All the transactions that make it to disk can be recovered. It is slower because it requires more frequent writes to the database and it takes more space because all changes are logged. Buffered Logging Buffered logging does not actually reduce the amount of data that is written to the logical log it just reduces how often changes are written to disk. With buffered logging the Informix OnLine server waits until an internal memory buffer is nearly filled before writing the data to disk. This means that even though buffered logging will eventually write the same data as unbuffered logging it will take fewer disk access. That means that the logging should take less time. Buffered logging is not quite as safe as unbuffered logging. As we mentioned only the transactions on the disk will be recovered during fast recovery. Even if a transaction has been completed it may not be on the disk with buffered logging. > We have that too, although it's not on by default, and probably needs > further tuning work. Not sure we do. We have something that delays the transaction hoping for another one to come along, but we don't have something that writes out WAL after X seconds or X transactions. I think we need something like that. In fact, if we had it, I think we could remove the fsync=off option entirely. With fsync=off, an OS crash means you have to restore from disk. With WAL writes/fsync every 5 seconds, at least an OS crash brings the system back to a stable state, and the performance with fsync=off would be nearly identical. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026