Re: First set of OSDL Shared Mem scalability results, some wierdness ...
От | Tom Lane |
---|---|
Тема | Re: First set of OSDL Shared Mem scalability results, some wierdness ... |
Дата | |
Msg-id | 4859.1097363137@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: First set of OSDL Shared Mem scalability results, some wierdness ... (Kevin Brown <kevin@sysexperts.com>) |
Ответы |
Re: First set of OSDL Shared Mem scalability results, some wierdness ...
Re: First set of OSDL Shared Mem scalability results, some |
Список | pgsql-performance |
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> mmap() is Right Out because it does not afford us sufficient control >> over when changes to the in-memory data will propagate to disk. > ... that's especially true if we simply cannot > have the page written to disk in a partially-modified state (something > I can easily see being an issue for the WAL -- would the same hold > true of the index/data files?). You're almost there. Remember the fundamental WAL rule: log entries must hit disk before the data changes they describe. That means that we need not only a way of forcing changes to disk (fsync) but a way of being sure that changes have *not* gone to disk yet. In the existing implementation we get that by just not issuing write() for a given page until we know that the relevant WAL log entries are fsync'd down to disk. (BTW, this is what the LSN field on every page is for: it tells the buffer manager the latest WAL offset that has to be flushed before it can safely write the page.) mmap provides msync which is comparable to fsync, but AFAICS it provides no way to prevent an in-memory change from reaching disk too soon. This would mean that WAL entries would have to be written *and flushed* before we could make the data change at all, which would convert multiple updates of a single page into a series of write-and- wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction is bad enough, once per atomic action is intolerable. There is another reason for doing things this way. Consider a backend that goes haywire and scribbles all over shared memory before crashing. When the postmaster sees the abnormal child termination, it forcibly kills the other active backends and discards shared memory altogether. This gives us fairly good odds that the crash did not affect any data on disk. It's not perfect of course, since another backend might have been in process of issuing a write() when the disaster happens, but it's pretty good; and I think that that isolation has a lot to do with PG's good reputation for not corrupting data in crashes. If we had a large fraction of the address space mmap'd then this sort of crash would be just about guaranteed to propagate corruption into the on-disk files. regards, tom lane
В списке pgsql-performance по дате отправления: