Re: Sorting writes during checkpoint
| От | Tom Lane |
|---|---|
| Тема | Re: Sorting writes during checkpoint |
| Дата | |
| Msg-id | 4421.1209876019@sss.pgh.pa.us обсуждение исходный текст |
| Ответ на | Re: Sorting writes during checkpoint (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>) |
| Ответы |
Re: Sorting writes during checkpoint
|
| Список | pgsql-patches |
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> Greg Smith <gsmith@gregsmith.com> wrote:
>> If shared_buffers(=NBuffers) is set to something big, this could give some
>> memory churn. And I think it's a bad idea to allocate something this
>> large at checkpoint time, because what happens if that fails? Really not
>> the time you want to discover there's no RAM left.
> Hmm, but I think we need to copy buffer tags into bgwriter's local memory
> in order to avoid locking taga many times in the sorting.
I updated this patch to permanently allocate the working array as Greg
suggests, and to fix a bunch of commenting issues (attached).
However, I am completely unable to measure any performance improvement
from it. Given the possible risk of out-of-memory failures, I think the
patch should not be applied without some direct proof of performance
benefits, and I don't see any.
regards, tom lane
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.228
diff -c -r1.228 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c 1 Jan 2008 19:45:51 -0000 1.228
--- src/backend/storage/buffer/bufmgr.c 4 May 2008 01:11:08 -0000
***************
*** 56,61 ****
--- 56,68 ----
#define BUF_WRITTEN 0x01
#define BUF_REUSABLE 0x02
+ /* Struct for BufferSync's internal to-do list */
+ typedef struct BufAndTag
+ {
+ int buf_id;
+ BufferTag tag;
+ } BufAndTag;
+
/* GUC variables */
bool zero_damaged_pages = false;
***************
*** 986,991 ****
--- 993,1025 ----
}
/*
+ * qsort comparator for BufferSync
+ */
+ static int
+ bufandtagcmp(const void *a, const void *b)
+ {
+ const BufAndTag *lhs = (const BufAndTag *) a;
+ const BufAndTag *rhs = (const BufAndTag *) b;
+ int r;
+
+ /*
+ * We don't much care about the order in which different relations get
+ * written, so memcmp is enough for comparing the relfilenodes,
+ * even though its behavior will be platform-dependent.
+ */
+ r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode));
+ if (r != 0)
+ return r;
+
+ /* We do want blocks within a relation to be ordered accurately */
+ if (lhs->tag.blockNum < rhs->tag.blockNum)
+ return -1;
+ if (lhs->tag.blockNum > rhs->tag.blockNum)
+ return 1;
+ return 0;
+ }
+
+ /*
* BufferSync -- Write out all dirty buffers in the pool.
*
* This is called at checkpoint time to write out all dirty shared buffers.
***************
*** 995,1004 ****
static void
BufferSync(int flags)
{
int buf_id;
- int num_to_scan;
int num_to_write;
int num_written;
/* Make sure we can handle the pin inside SyncOneBuffer */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 1029,1056 ----
static void
BufferSync(int flags)
{
+ static BufAndTag *bufs_to_write = NULL;
int buf_id;
int num_to_write;
int num_written;
+ int i;
+
+ /*
+ * We allocate the bufs_to_write[] array on first call and keep it
+ * around for the life of the process. This is okay because in normal
+ * operation this function is only called within the bgwriter, so
+ * we won't have lots of large arrays floating around. We prefer this
+ * way because we don't want checkpoints to suddenly start failing
+ * when the system gets under memory pressure.
+ */
+ if (bufs_to_write == NULL)
+ {
+ bufs_to_write = (BufAndTag *) malloc(NBuffers * sizeof(BufAndTag));
+ if (bufs_to_write == NULL)
+ ereport(FATAL,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+ }
/* Make sure we can handle the pin inside SyncOneBuffer */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
***************
*** 1033,1038 ****
--- 1085,1092 ----
if (bufHdr->flags & BM_DIRTY)
{
bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+ bufs_to_write[num_to_write].buf_id = buf_id;
+ bufs_to_write[num_to_write].tag = bufHdr->tag;
num_to_write++;
}
***************
*** 1043,1061 ****
return; /* nothing to do */
/*
! * Loop over all buffers again, and write the ones (still) marked with
! * BM_CHECKPOINT_NEEDED. In this loop, we start at the clock sweep point
! * since we might as well dump soon-to-be-recycled buffers first.
! *
! * Note that we don't read the buffer alloc count here --- that should be
! * left untouched till the next BgBufferSync() call.
*/
- buf_id = StrategySyncStart(NULL, NULL);
- num_to_scan = NBuffers;
num_written = 0;
! while (num_to_scan-- > 0)
{
! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
/*
* We don't need to acquire the lock here, because we're only looking
--- 1097,1120 ----
return; /* nothing to do */
/*
! * Sort the buffers-to-be-written into order by file and block number.
! * This improves sequentiality of access for the upcoming I/O.
! */
! qsort(bufs_to_write, num_to_write, sizeof(BufAndTag), bufandtagcmp);
!
! /*
! * Loop over all buffers to be written, and write the ones (still) marked
! * with BM_CHECKPOINT_NEEDED. Note that we don't need to recheck the
! * buffer tag, because if the buffer has been reassigned it cannot have
! * BM_CHECKPOINT_NEEDED still set.
*/
num_written = 0;
! for (i = 0; i < num_to_write; i++)
{
! volatile BufferDesc *bufHdr;
!
! buf_id = bufs_to_write[i].buf_id;
! bufHdr = &BufferDescriptors[buf_id];
/*
* We don't need to acquire the lock here, because we're only looking
***************
*** 1077,1096 ****
num_written++;
/*
- * We know there are at most num_to_write buffers with
- * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
- * num_written reaches num_to_write.
- *
- * Note that num_written doesn't include buffers written by
- * other backends, or by the bgwriter cleaning scan. That
- * means that the estimate of how much progress we've made is
- * conservative, and also that this test will often fail to
- * trigger. But it seems worth making anyway.
- */
- if (num_written >= num_to_write)
- break;
-
- /*
* Perform normal bgwriter duties and sleep to throttle our
* I/O rate.
*/
--- 1136,1141 ----
***************
*** 1098,1110 ****
(double) num_written / num_to_write);
}
}
-
- if (++buf_id >= NBuffers)
- buf_id = 0;
}
/*
! * Update checkpoint statistics. As noted above, this doesn't include
* buffers written by other backends or bgwriter scan.
*/
CheckpointStats.ckpt_bufs_written += num_written;
--- 1143,1152 ----
(double) num_written / num_to_write);
}
}
}
/*
! * Update checkpoint statistics. The num_written count doesn't include
* buffers written by other backends or bgwriter scan.
*/
CheckpointStats.ckpt_bufs_written += num_written;
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.64
diff -c -r1.64 freelist.c
*** src/backend/storage/buffer/freelist.c 1 Jan 2008 19:45:51 -0000 1.64
--- src/backend/storage/buffer/freelist.c 4 May 2008 01:11:08 -0000
***************
*** 241,250 ****
}
/*
! * StrategySyncStart -- tell BufferSync where to start syncing
*
! * The result is the buffer index of the best buffer to sync first.
! * BufferSync() will proceed circularly around the buffer array from there.
*
* In addition, we return the completed-pass count (which is effectively
* the higher-order bits of nextVictimBuffer) and the count of recent buffer
--- 241,251 ----
}
/*
! * StrategySyncStart -- tell BgBufferSync where we are reclaiming buffers
*
! * The result is the buffer index of the next possible victim buffer.
! * BgBufferSync() tries to keep the buffers immediately in front of this
! * point clean.
*
* In addition, we return the completed-pass count (which is effectively
* the higher-order bits of nextVictimBuffer) and the count of recent buffer
В списке pgsql-patches по дате отправления: