Обсуждение: V4 of PITR performance improvement for 8.4
This is V4 patch to improve file read during startup for review. Basic algorithm is same as V3 but works with Gregory's fadvise patch. http://archives.postgresql.org/pgsql-hackers/2009-01/msg00026.php This patc also include additional patch for posix_fadvise to skip prefetch of pages which does not exist. We still need a patch to work this with syncronous replication, which will be posted by the end of this week. Regards; -- ------ Koichi Suzuki
Sorry, I didn't attatch the patch file. This is the second try. 2009/1/12 Koichi Suzuki <koichi.szk@gmail.com>: > This is V4 patch to improve file read during startup for review. > > Basic algorithm is same as V3 but works with Gregory's fadvise patch. > > http://archives.postgresql.org/pgsql-hackers/2009-01/msg00026.php > > This patc also include additional patch for posix_fadvise to skip > prefetch of pages which does not exist. > > We still need a patch to work this with syncronous replication, which > will be posted by the end of this week. > > Regards; > > -- > ------ > Koichi Suzuki > -- ------ Koichi Suzuki
Вложения
"Koichi Suzuki" <koichi.szk@gmail.com> wrote: > > This patc also include additional patch for posix_fadvise to skip > > prefetch of pages which does not exist. I reviewed your patch and found items to be fixed and improved. - You should not claim your copyrights. There are your copyright claims in two files newly added, but they should be included by PostgreSQL Global Development Group. | * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation | * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group - Avoid pending wal records on continuous REDO. In archive recovery, ReadRecord() waits until the next segment comes. Should we avoid pending WAL records during the waiting? RedoRecords() might be needed before calling RestoreArchivedFile(). It would be better to redo records and restore archived files in parallel, but we will need to replace system() to other system because system() blocks until the process finishes. - Should check the buffer pool before calling smgrprefetch. If the page is in the shared buffer pool, we don't need to prefetch the buffer from disks. I think it is good to add PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache(). - Is possible to remove calls of ReadAheadHasRoom()? If we checks rooms of readahead-queue, xxx_readahead functions don't need to call ReadAheadHasRoom. The maximum readaheads are at most 3, so we'll flush the queue when the rooms of the queue are less than 3. And if the queue is full unexpectedly, we can ignore the added items because it is only hint. - Should avoid large static variables. | [src/backend/access/transam/xlog] | + static char RecordQueueBuf[XLogSegSize * 2]; | [readahead.c] | + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize]; I think it is a bad idea to place a large object as a static variable. It should be a pointer and allocated with palloc() in runtime. Also those variable are only used by startup process. - Merge readahead.h to xlog.h or something. It export just a few functions and the functionality belongs xlog and recovery. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Thanks a lot for the review. Enclosed is a revised patch reflecting your comments. Also, some reply to the comments are included inline. 2009/1/14 ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>: > > "Koichi Suzuki" <koichi.szk@gmail.com> wrote: > >> > This patc also include additional patch for posix_fadvise to skip >> > prefetch of pages which does not exist. > > I reviewed your patch and found items to be fixed and improved. > > - You should not claim your copyrights. > There are your copyright claims in two files newly added, but they > should be included by PostgreSQL Global Development Group. > | * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation > | * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group Okay, fixed. > > - Avoid pending wal records on continuous REDO. > In archive recovery, ReadRecord() waits until the next segment comes. > Should we avoid pending WAL records during the waiting? > RedoRecords() might be needed before calling RestoreArchivedFile(). > It would be better to redo records and restore archived files > in parallel, but we will need to replace system() to other system > because system() blocks until the process finishes. Yes, I added this feature. Because system() blocks, it does not run in parallel. > > - Should check the buffer pool before calling smgrprefetch. > If the page is in the shared buffer pool, we don't need to > prefetch the buffer from disks. I think it is good to add > PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache(). Agreed. This needs an addition to buffer manager API. I'd like to submit this change as a separate patch by the end of this week. > > - Is possible to remove calls of ReadAheadHasRoom()? > If we checks rooms of readahead-queue, xxx_readahead functions > don't need to call ReadAheadHasRoom. The maximum readaheads are > at most 3, so we'll flush the queue when the rooms of the queue > are less than 3. And if the queue is full unexpectedly, we can > ignore the added items because it is only hint. More than three pages are involved in GiST WAL record so I think the above is needed. > > - Should avoid large static variables. > | [src/backend/access/transam/xlog] > | + static char RecordQueueBuf[XLogSegSize * 2]; > | [readahead.c] > | + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize]; > I think it is a bad idea to place a large object as a static variable. > It should be a pointer and allocated with palloc() in runtime. > Also those variable are only used by startup process. In startup process, most of memory allocation is done by malloc(), not palloc(). Replacing malloc() by palloc() may be done separately. > > - Merge readahead.h to xlog.h or something. > It export just a few functions and the functionality belongs > xlog and recovery. Agreed. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > -- ------ Koichi Suzuki
Вложения
Please find enclosed 2nd patch of pg_readahead which include a patch to bufer manager to skip prefetch of pages already in shared buffer. This is complete which works with current head and ready for further review and to be included in 8.4. Because sync.rep is still under work, I'm planning to post additional patch to make pg_readahead work with sync.rep. It will be additional patch and not necessary for usual operation. Regards; 2009/1/20 Koichi Suzuki <koichi.szk@gmail.com>: > Thanks a lot for the review. > > Enclosed is a revised patch reflecting your comments. Also, some > reply to the comments are included inline. > > 2009/1/14 ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>: >> >> "Koichi Suzuki" <koichi.szk@gmail.com> wrote: >> >>> > This patc also include additional patch for posix_fadvise to skip >>> > prefetch of pages which does not exist. >> >> I reviewed your patch and found items to be fixed and improved. >> >> - You should not claim your copyrights. >> There are your copyright claims in two files newly added, but they >> should be included by PostgreSQL Global Development Group. >> | * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation >> | * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group > > Okay, fixed. > >> >> - Avoid pending wal records on continuous REDO. >> In archive recovery, ReadRecord() waits until the next segment comes. >> Should we avoid pending WAL records during the waiting? >> RedoRecords() might be needed before calling RestoreArchivedFile(). >> It would be better to redo records and restore archived files >> in parallel, but we will need to replace system() to other system >> because system() blocks until the process finishes. > > Yes, I added this feature. Because system() blocks, it does not run > in parallel. > >> >> - Should check the buffer pool before calling smgrprefetch. >> If the page is in the shared buffer pool, we don't need to >> prefetch the buffer from disks. I think it is good to add >> PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache(). > > Agreed. This needs an addition to buffer manager API. I'd like to > submit this change as a separate patch by the end of this week. > >> >> - Is possible to remove calls of ReadAheadHasRoom()? >> If we checks rooms of readahead-queue, xxx_readahead functions >> don't need to call ReadAheadHasRoom. The maximum readaheads are >> at most 3, so we'll flush the queue when the rooms of the queue >> are less than 3. And if the queue is full unexpectedly, we can >> ignore the added items because it is only hint. > > More than three pages are involved in GiST WAL record so I think the > above is needed. > >> >> - Should avoid large static variables. >> | [src/backend/access/transam/xlog] >> | + static char RecordQueueBuf[XLogSegSize * 2]; >> | [readahead.c] >> | + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize]; >> I think it is a bad idea to place a large object as a static variable. >> It should be a pointer and allocated with palloc() in runtime. >> Also those variable are only used by startup process. > > In startup process, most of memory allocation is done by malloc(), not > palloc(). Replacing malloc() by palloc() may be done separately. > >> >> - Merge readahead.h to xlog.h or something. >> It export just a few functions and the functionality belongs >> xlog and recovery. > > Agreed. > >> >> Regards, >> --- >> ITAGAKI Takahiro >> NTT Open Source Software Center >> >> >> > > > > -- > ------ > Koichi Suzuki > -- ------ Koichi Suzuki
Вложения
I found the previous post didn't include additional patch to check shared buffer and avoid prefetch if the page is already there. Just in case, I'm posting the whole patches in two files. readahead-20090121.patch is main patch of pg_readahead. addShBufCheck-20090120.patch is patch to buffer manager to avoid prefetch when a page is already in shared buffer. This is the complete set of pg_readahead patch and ready to be included in 8.4. Regards; 2009/1/23 Koichi Suzuki <koichi.szk@gmail.com>: > Please find enclosed 2nd patch of pg_readahead which include a patch > to bufer manager to skip prefetch of pages already in shared buffer. > > This is complete which works with current head and ready for further > review and to be included in 8.4. > > Because sync.rep is still under work, I'm planning to post additional > patch to make pg_readahead work with sync.rep. > > It will be additional patch and not necessary for usual operation. > > Regards; > > 2009/1/20 Koichi Suzuki <koichi.szk@gmail.com>: >> Thanks a lot for the review. >> >> Enclosed is a revised patch reflecting your comments. Also, some >> reply to the comments are included inline. >> >> 2009/1/14 ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>: >>> >>> "Koichi Suzuki" <koichi.szk@gmail.com> wrote: >>> >>>> > This patc also include additional patch for posix_fadvise to skip >>>> > prefetch of pages which does not exist. >>> >>> I reviewed your patch and found items to be fixed and improved. >>> >>> - You should not claim your copyrights. >>> There are your copyright claims in two files newly added, but they >>> should be included by PostgreSQL Global Development Group. >>> | * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation >>> | * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group >> >> Okay, fixed. >> >>> >>> - Avoid pending wal records on continuous REDO. >>> In archive recovery, ReadRecord() waits until the next segment comes. >>> Should we avoid pending WAL records during the waiting? >>> RedoRecords() might be needed before calling RestoreArchivedFile(). >>> It would be better to redo records and restore archived files >>> in parallel, but we will need to replace system() to other system >>> because system() blocks until the process finishes. >> >> Yes, I added this feature. Because system() blocks, it does not run >> in parallel. >> >>> >>> - Should check the buffer pool before calling smgrprefetch. >>> If the page is in the shared buffer pool, we don't need to >>> prefetch the buffer from disks. I think it is good to add >>> PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache(). >> >> Agreed. This needs an addition to buffer manager API. I'd like to >> submit this change as a separate patch by the end of this week. >> >>> >>> - Is possible to remove calls of ReadAheadHasRoom()? >>> If we checks rooms of readahead-queue, xxx_readahead functions >>> don't need to call ReadAheadHasRoom. The maximum readaheads are >>> at most 3, so we'll flush the queue when the rooms of the queue >>> are less than 3. And if the queue is full unexpectedly, we can >>> ignore the added items because it is only hint. >> >> More than three pages are involved in GiST WAL record so I think the >> above is needed. >> >>> >>> - Should avoid large static variables. >>> | [src/backend/access/transam/xlog] >>> | + static char RecordQueueBuf[XLogSegSize * 2]; >>> | [readahead.c] >>> | + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize]; >>> I think it is a bad idea to place a large object as a static variable. >>> It should be a pointer and allocated with palloc() in runtime. >>> Also those variable are only used by startup process. >> >> In startup process, most of memory allocation is done by malloc(), not >> palloc(). Replacing malloc() by palloc() may be done separately. >> >>> >>> - Merge readahead.h to xlog.h or something. >>> It export just a few functions and the functionality belongs >>> xlog and recovery. >> >> Agreed. >> >>> >>> Regards, >>> --- >>> ITAGAKI Takahiro >>> NTT Open Source Software Center >>> >>> >>> >> >> >> >> -- >> ------ >> Koichi Suzuki >> > > > > -- > ------ > Koichi Suzuki > -- ------ Koichi Suzuki
Вложения
Koichi Suzuki <koichi.szk@gmail.com> writes: > Please find enclosed 2nd patch of pg_readahead which include a patch > to bufer manager to skip prefetch of pages already in shared buffer. I'm a bit confused by this comment. PrefetchBuffer already checks if the page is in shared buffers. What is tricky to avoid is prefetching the same page twice -- since the first prefetch doesn't actually put it in shared buffers there's no way to avoid prefetching it again unless you keep some kind of hash of recently prefetched buffers. For the index scan case I'm debating about whether to add such a cache directly to PrefetchBuffer -- in which case it would remember if some other scan prefetched the same buffer -- or to keep it in the index scan code. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Hi, It's simply because we should not refer to system catalog during the recovery. 2009/1/25 Gregory Stark <stark@enterprisedb.com>: > > Koichi Suzuki <koichi.szk@gmail.com> writes: > >> Please find enclosed 2nd patch of pg_readahead which include a patch >> to bufer manager to skip prefetch of pages already in shared buffer. > > I'm a bit confused by this comment. PrefetchBuffer already checks if the page > is in shared buffers. > > What is tricky to avoid is prefetching the same page twice -- since the first > prefetch doesn't actually put it in shared buffers there's no way to avoid > prefetching it again unless you keep some kind of hash of recently prefetched > buffers. > > For the index scan case I'm debating about whether to add such a cache > directly to PrefetchBuffer -- in which case it would remember if some other > scan prefetched the same buffer -- or to keep it in the index scan code. > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > Ask me about EnterpriseDB's On-Demand Production Tuning > -- ------ Koichi Suzuki
Koichi Suzuki <koichi.szk@gmail.com> writes: > It's simply because we should not refer to system catalog during the recovery. I don't understand how this is connected to anything to do with prefetching? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
Hi, This is also a reply to your latest post. I'm replying to the older one because I need original text. 2009/1/26 Koichi Suzuki <koichi.szk@gmail.com>: > Hi, > > It's simply because we should not refer to system catalog during the recovery. This is the reason why I didn't use PrefetchBuffer(). Prefetch buffer goes to the system catalog which we should not read during the recovery. I added a code to prefetch needed page without referring to the system catalog. As you pointed out, it has nothing to do with the prefetch itself. > > 2009/1/25 Gregory Stark <stark@enterprisedb.com>: >> >> Koichi Suzuki <koichi.szk@gmail.com> writes: >> >>> Please find enclosed 2nd patch of pg_readahead which include a patch >>> to bufer manager to skip prefetch of pages already in shared buffer. >> >> I'm a bit confused by this comment. PrefetchBuffer already checks if the page >> is in shared buffers. >> >> What is tricky to avoid is prefetching the same page twice -- since the first >> prefetch doesn't actually put it in shared buffers there's no way to avoid >> prefetching it again unless you keep some kind of hash of recently prefetched >> buffers. >> >> For the index scan case I'm debating about whether to add such a cache >> directly to PrefetchBuffer -- in which case it would remember if some other >> scan prefetched the same buffer -- or to keep it in the index scan code. I also think this could be additional feature of prefetch. On the other hand, if some page is not on the shared buffer and on kernel's cache, whichi should be the case we should avoid pfefetch, posix_fadvise() will not read from the disk and the duration for this call will be very very small. So for the time being, I think this can be acceptable. >> >> -- >> Gregory Stark >> EnterpriseDB http://www.enterprisedb.com >> Ask me about EnterpriseDB's On-Demand Production Tuning >> > > > > -- > ------ > Koichi Suzuki > -- ------ Koichi Suzuki
On Sun, Jan 25, 2009 at 7:15 AM, Gregory Stark <stark@enterprisedb.com> wrote: > Koichi Suzuki <koichi.szk@gmail.com> writes: > >> Please find enclosed 2nd patch of pg_readahead which include a patch >> to bufer manager to skip prefetch of pages already in shared buffer. > > I'm a bit confused by this comment. PrefetchBuffer already checks if the page > is in shared buffers. > > What is tricky to avoid is prefetching the same page twice -- since the first > prefetch doesn't actually put it in shared buffers there's no way to avoid > prefetching it again unless you keep some kind of hash of recently prefetched > buffers. > > For the index scan case I'm debating about whether to add such a cache > directly to PrefetchBuffer -- in which case it would remember if some other > scan prefetched the same buffer -- or to keep it in the index scan code. Has this issue been resolved? Does this patch need more review? Because if so, I'm guessing it needs to happen RSN. ...Robert
Hi, My reply to Gregory's comment didn't have any objections. I believe, as I posted to Wiki page, latest posted patch is okay and waiting for review. 2009/2/24 Robert Haas <robertmhaas@gmail.com>: > On Sun, Jan 25, 2009 at 7:15 AM, Gregory Stark <stark@enterprisedb.com> wrote: >> Koichi Suzuki <koichi.szk@gmail.com> writes: >> >>> Please find enclosed 2nd patch of pg_readahead which include a patch >>> to bufer manager to skip prefetch of pages already in shared buffer. >> >> I'm a bit confused by this comment. PrefetchBuffer already checks if the page >> is in shared buffers. >> >> What is tricky to avoid is prefetching the same page twice -- since the first >> prefetch doesn't actually put it in shared buffers there's no way to avoid >> prefetching it again unless you keep some kind of hash of recently prefetched >> buffers. >> >> For the index scan case I'm debating about whether to add such a cache >> directly to PrefetchBuffer -- in which case it would remember if some other >> scan prefetched the same buffer -- or to keep it in the index scan code. > > Has this issue been resolved? Does this patch need more review? > Because if so, I'm guessing it needs to happen RSN. > > ...Robert > -- ------ Koichi Suzuki
Hi Suzuki-san, On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote: > My reply to Gregory's comment didn't have any objections. I believe, > as I posted to Wiki page, latest posted patch is okay and waiting for > review. One of your latest patches doesn't match with HEAD, so I updated it. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > Hi Suzuki-san, > > On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote: >> My reply to Gregory's comment didn't have any objections. I believe, >> as I posted to Wiki page, latest posted patch is okay and waiting for >> review. > > One of your latest patches doesn't match with HEAD, so I updated it. Oops! I failed in attaching the patch. This is second try. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
Appreciate for your kind help! 2009/3/3 Fujii Masao <masao.fujii@gmail.com>: > On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Hi Suzuki-san, >> >> On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote: >>> My reply to Gregory's comment didn't have any objections. I believe, >>> as I posted to Wiki page, latest posted patch is okay and waiting for >>> review. >> >> One of your latest patches doesn't match with HEAD, so I updated it. > > Oops! I failed in attaching the patch. This is second try. > > Regards, > > -- > Fujii Masao > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > -- ------ Koichi Suzuki
Fujii Masao wrote: > On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> Hi Suzuki-san, >> >> On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki <koichi.szk@gmail.com> wrote: >>> My reply to Gregory's comment didn't have any objections. I believe, >>> as I posted to Wiki page, latest posted patch is okay and waiting for >>> review. >> One of your latest patches doesn't match with HEAD, so I updated it. > > Oops! I failed in attaching the patch. This is second try. Thanks. This patch seems to be missing the new readahead.c file. I grabbed that from the previous patch version. It's annoying that we have to write *_readahead functions for each and every record type. In most record types, we already pass the information about the pages involved to XLogInsert, for full page writes. If we could change the WAL format so that that information is stored in WAL even when a full page image is taken, we could do the read-ahead for every WAL record type in a single function. Getting the API right needs some thinking, but that would be a lot nicer approach in the long run. I agree with Itagaki-san's earlier comment that we should find a way to avoid the ReadAheadHasRoom calls (http://archives.postgresql.org/message-id/20090114101659.89CD.52131E4D@oss.ntt.co.jp). Let's just leave enough slack in the queue, so that it never fills up. And if the unthinkable happens and it does fill up, we can just throw away the readahead requests that don't fit; they're just hints anyway. The API for ReadAheadAddEntry should include ForkNumber. Although all the readahead calls included in the patch all access the main fork, that's really just an omission that probably should be fixed even though the FSM and visibility map should become cached pretty quickly for any active tables. effective_io_concurrency setting is ignored. If it's set to 1, we should disable prefetching altogether for the sake of both robustness (let you recover even if there's a bug in the readahead code) and performance (avoid useless posix_fadvise calls, sorting etc. if there's only one spindle). The bursty queuing behavior feels pretty strange to me, though I guess it works pretty well in practice. I would've assumed a FIFO queue of WAL records, with some kind of a cache of recently issued posix_fadvise() calls. As the patch stands, it's not limited to archive recovery. The code in readahead.c seems to assume that the readahead queue will always be flushed between xlog segment switch, but that is not enforced in xlog.c. malloc() can return NULL on out of memory. Need to check that, or use palloc() instead. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Thanks. This patch seems to be missing the new readahead.c file. I grabbed > that from the previous patch version. Oh, sorry for the mistake. I changed one of Suzuki-san's patches to be rebased to HEAD again (readahead-20090310.patch). The other (addShBufCheck-20090120.patch) is not changed. Comment: we might reach consistent recovery state *before* redoing the safe starting point, because readahead slightly delays the actual redo. Is this safe? If not, the readahead queue should be flushed before reaching that state? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Вложения
Fujii Masao wrote: > Hi, > > On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Thanks. This patch seems to be missing the new readahead.c file. I grabbed >> that from the previous patch version. > > Oh, sorry for the mistake. I changed one of Suzuki-san's patches > to be rebased to HEAD again (readahead-20090310.patch). > The other (addShBufCheck-20090120.patch) is not changed. > > Comment: > we might reach consistent recovery state *before* redoing the safe > starting point, because readahead slightly delays the actual redo. > Is this safe? No. If you haven't replayed all the WAL records up to the safe starting point, the database isn't consistent yet. The distinction doesn't matter in practice without Hot Standby, though. > If not, the readahead queue should be flushed before > reaching that state? Yes. Or you could move the reporting that you've reached the consistent recovery state into RedoRecords, when you reach the min safe starting point. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hi, 2009/3/10 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: > Fujii Masao wrote: >> >> Hi, >> >> On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> >>> Thanks. This patch seems to be missing the new readahead.c file. I >>> grabbed >>> that from the previous patch version. >> >> Oh, sorry for the mistake. I changed one of Suzuki-san's patches >> to be rebased to HEAD again (readahead-20090310.patch). >> The other (addShBufCheck-20090120.patch) is not changed. >> >> Comment: >> we might reach consistent recovery state *before* redoing the safe >> starting point, because readahead slightly delays the actual redo. >> Is this safe? > > No. If you haven't replayed all the WAL records up to the safe starting > point, the database isn't consistent yet. The distinction doesn't matter in > practice without Hot Standby, though. > >> If not, the readahead queue should be flushed before >> reaching that state? > > Yes. Or you could move the reporting that you've reached the consistent > recovery state into RedoRecords, when you reach the min safe starting point. This arrangement can be done with Hot Standby and Sync.Rep, right? So far, it sounds that we need to add a code to handle if malloc() fails (OOM). In this case, possible way could be to skip whole readahead, although the rest of the recovery may fail because of the memory shortage. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > -- ------ Koichi Suzuki
This is a patch with error handling of malloc(). I found that skipping readahead at malloc() error leads to too many changes with very little to get. The memory area are kept only during the recovery and is released at the end of the recovery. So shortage of this memory area will easily lead to another memory problem anyway. 2009/3/11 Koichi Suzuki <koichi.szk@gmail.com>: > Hi, > > 2009/3/10 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>: >> Fujii Masao wrote: >>> >>> Hi, >>> >>> On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas >>> <heikki.linnakangas@enterprisedb.com> wrote: >>>> >>>> Thanks. This patch seems to be missing the new readahead.c file. I >>>> grabbed >>>> that from the previous patch version. >>> >>> Oh, sorry for the mistake. I changed one of Suzuki-san's patches >>> to be rebased to HEAD again (readahead-20090310.patch). >>> The other (addShBufCheck-20090120.patch) is not changed. >>> >>> Comment: >>> we might reach consistent recovery state *before* redoing the safe >>> starting point, because readahead slightly delays the actual redo. >>> Is this safe? >> >> No. If you haven't replayed all the WAL records up to the safe starting >> point, the database isn't consistent yet. The distinction doesn't matter in >> practice without Hot Standby, though. >> >>> If not, the readahead queue should be flushed before >>> reaching that state? >> >> Yes. Or you could move the reporting that you've reached the consistent >> recovery state into RedoRecords, when you reach the min safe starting point. > > This arrangement can be done with Hot Standby and Sync.Rep, right? > > So far, it sounds that we need to add a code to handle if malloc() > fails (OOM). In this case, possible way could be to skip whole > readahead, although the rest of the recovery may fail because of the > memory shortage. > >> >> -- >> Heikki Linnakangas >> EnterpriseDB http://www.enterprisedb.com >> > > > > -- > ------ > Koichi Suzuki > -- ------ Koichi Suzuki