Re: Scaling shared buffer eviction
От | Amit Kapila |
---|---|
Тема | Re: Scaling shared buffer eviction |
Дата | |
Msg-id | CAA4eK1JC-ZijSv-3km3C-zjK=GAxk97B6i0aQ-Fow0Tmw7K24A@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Scaling shared buffer eviction (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: Scaling shared buffer eviction
(Amit Kapila <amit.kapila16@gmail.com>)
Re: Scaling shared buffer eviction (Merlin Moncure <mmoncure@gmail.com>) |
Список | pgsql-hackers |
On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have updated the patch to address the feedback. Main changes are:
> >
> > 1. For populating freelist, have a separate process (bgreclaimer)
> > instead of doing it by bgwriter.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. I have used the formula as suggested by you upthread.
> > 3. Cleanup of locking regimen as discussed upthread (completely
> > eliminated BufFreelist Lock).
> > 4. Improved comments and general code cleanup.
>
> Overall this looks quite promising to me.
>
> I had thought to call the new process just "bgreclaim" rather than
> "bgreclaimer", but perhaps your name is better after all. At least,
> it matches what we do elsewhere. But I don't care for the use
> "Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
> else "bgreclaimer".
Changed it to bgreclaimer.
> This is unclear:
>
> +buffers for replacement. Earlier to protect freelist, we use LWLOCK as that
> +is needed to perform clock sweep which is a longer operation, however now we
> +are using two spinklocks freelist_lck and victimbuf_lck to perform operations
> +on freelist and run clock sweep respectively.
>
> I would drop the discussion of what was done before and say something
> like this: The data structures relating to buffer eviction are
> protected by two spinlocks. freelist_lck protects the freelist and
> related data structures, while victimbuf_lck protects information
> related to the current clock sweep condition.
Changed, but I have not used exact wording mentioned above, let me know
>
> On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have updated the patch to address the feedback. Main changes are:
> >
> > 1. For populating freelist, have a separate process (bgreclaimer)
> > instead of doing it by bgwriter.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. I have used the formula as suggested by you upthread.
> > 3. Cleanup of locking regimen as discussed upthread (completely
> > eliminated BufFreelist Lock).
> > 4. Improved comments and general code cleanup.
>
> Overall this looks quite promising to me.
>
> I had thought to call the new process just "bgreclaim" rather than
> "bgreclaimer", but perhaps your name is better after all. At least,
> it matches what we do elsewhere. But I don't care for the use
> "Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
> else "bgreclaimer".
Changed it to bgreclaimer.
> This is unclear:
>
> +buffers for replacement. Earlier to protect freelist, we use LWLOCK as that
> +is needed to perform clock sweep which is a longer operation, however now we
> +are using two spinklocks freelist_lck and victimbuf_lck to perform operations
> +on freelist and run clock sweep respectively.
>
> I would drop the discussion of what was done before and say something
> like this: The data structures relating to buffer eviction are
> protected by two spinlocks. freelist_lck protects the freelist and
> related data structures, while victimbuf_lck protects information
> related to the current clock sweep condition.
Changed, but I have not used exact wording mentioned above, let me know
if new wording used is okay?
> +always in this list. We also throw buffers into this list if we consider
> +their pages unlikely to be needed soon; this is done by background process
> +reclaimer. The list is singly-linked using fields in the
>
> I suggest: Allocating pages from this list is much cheaper than
> running the "clock sweep" algorithm, which may encounter many buffers
> that are poor candidates for eviction before finding a good candidate.
> Therefore, we have a background process called bgreclaimer which works
> to keep this list populated.
Changed as per your suggestion.
> +Background Reclaimer's Processing
> +---------------------------------
>
> I suggest titling this section "Background Reclaim".
>
> +The background reclaimer is designed to move buffers to freelist that are
>
> I suggest replacing the first three words of this sentence with "bgreclaimer".
As per discussion in thread, I have kept it as it is.
> +and move the the unpinned and zero usage count buffers to freelist. It
> +keep on doing this until the number of buffers in freelist become equal
> +high threshold of freelist.
>
> s/keep/keeps/
> s/become equal/reaches the/
> s/high threshold/high water mark/
> s/of freelist//
Changed as per your suggestion.
> Please change the other places that say threshold to use the "water
> mark" terminology.
>
> + if (StrategyMoveBufferToFreeListEnd (bufHdr))
>
> Extra space.
>
> + * buffers in consecutive cycles.
>
> s/consecutive/later/
>
> + /* Execute the LRU scan */
>
> s/LRU scan/clock sweep/ ?
Changed as per your suggestion.
>
> + while (tmp_num_to_free > 0)
>
> I am not sure it's a good idea for this value to be fixed at loop
> start and then just decremented. Shouldn't we loop and do the whole
> thing over once we reach the high watermark, only stopping when
> StrategySyncStartAndEnd() says num_to_free is 0?
Okay, changed the loop as per your suggestion.
> + /* choose next victim buffer to clean. */
>
> This process doesn't clean buffers; it puts them on the freelist.
Right. Changed it to match what it does.
> + * high threshold of freelsit), we drasticaly reduce the odds for
>
> Two typos.
Fixed.
> + * of buffers in freelist fall below low threshold of freelist.
>
> s/fall/falls/
Changed as per your suggestion.
> In freelist.c, it seems like a poor idea to have two spinlocks as
> consecutive structure members; they'll be in the same cache line,
> leading to false sharing. If we merge them into a single spinlock,
> does that hurt performance? If we put them further apart, e.g. by
> moving the freelist_lck to the start of the structure, followed by the
> latches, and leaving victimbuf_lck where it is, does that help
> performance?
As per discussion, I have kept them as it is and added a comment
indicating that we can consider having both locks in separate
cache lines.
> + /*
> + * If the buffer is pinned or has a nonzero usage_count,
> we cannot use
> + * it; discard it and retry. (This can only happen if VACUUM put a
> + * valid buffer in the freelist and then someone else
> used it before
> + * we got to it. It's probably impossible altogether as
> of 8.3, but
> + * we'd better check anyway.)
> + */
> +
>
> This comment is clearly obsolete.
Removed.
> > I have not yet added statistics (buffers_backend_clocksweep) as
> > for that we need to add one more variable in BufferStrategyControl
> > structure where I have already added few variables for this patch.
> > I think it is important to have such a stat available via
> > pg_stat_bgwriter, but not sure if it is worth to make the structure
> > bit more bulky.
>
> I think it's worth it.
Okay added new statistic.
> > Another minor point is about changes in lwlock.h
> > lwlock.h
> > * if you remove a lock, consider leaving a gap in the numbering
> > * sequence for the benefit of DTrace and other external debugging
> > * scripts.
> >
> > As I have removed BufFreelist lock, I have adjusted the numbering
> > as well in lwlock.h. There is a meesage on top of lock definitions
> > which suggest to leave gap if we remove any lock, however I was not
> > sure whether this case (removing the first element) can effect anything,
> > so for now, I have adjusted the numbering.
>
> Let's leave slot 0 unused, instead.
Sure, that make sense.
> +always in this list. We also throw buffers into this list if we consider
> +their pages unlikely to be needed soon; this is done by background process
> +reclaimer. The list is singly-linked using fields in the
>
> I suggest: Allocating pages from this list is much cheaper than
> running the "clock sweep" algorithm, which may encounter many buffers
> that are poor candidates for eviction before finding a good candidate.
> Therefore, we have a background process called bgreclaimer which works
> to keep this list populated.
Changed as per your suggestion.
> +Background Reclaimer's Processing
> +---------------------------------
>
> I suggest titling this section "Background Reclaim".
>
> +The background reclaimer is designed to move buffers to freelist that are
>
> I suggest replacing the first three words of this sentence with "bgreclaimer".
As per discussion in thread, I have kept it as it is.
> +and move the the unpinned and zero usage count buffers to freelist. It
> +keep on doing this until the number of buffers in freelist become equal
> +high threshold of freelist.
>
> s/keep/keeps/
> s/become equal/reaches the/
> s/high threshold/high water mark/
> s/of freelist//
Changed as per your suggestion.
> Please change the other places that say threshold to use the "water
> mark" terminology.
>
> + if (StrategyMoveBufferToFreeListEnd (bufHdr))
>
> Extra space.
>
> + * buffers in consecutive cycles.
>
> s/consecutive/later/
>
> + /* Execute the LRU scan */
>
> s/LRU scan/clock sweep/ ?
Changed as per your suggestion.
>
> + while (tmp_num_to_free > 0)
>
> I am not sure it's a good idea for this value to be fixed at loop
> start and then just decremented. Shouldn't we loop and do the whole
> thing over once we reach the high watermark, only stopping when
> StrategySyncStartAndEnd() says num_to_free is 0?
Okay, changed the loop as per your suggestion.
> + /* choose next victim buffer to clean. */
>
> This process doesn't clean buffers; it puts them on the freelist.
Right. Changed it to match what it does.
> + * high threshold of freelsit), we drasticaly reduce the odds for
>
> Two typos.
Fixed.
> + * of buffers in freelist fall below low threshold of freelist.
>
> s/fall/falls/
Changed as per your suggestion.
> In freelist.c, it seems like a poor idea to have two spinlocks as
> consecutive structure members; they'll be in the same cache line,
> leading to false sharing. If we merge them into a single spinlock,
> does that hurt performance? If we put them further apart, e.g. by
> moving the freelist_lck to the start of the structure, followed by the
> latches, and leaving victimbuf_lck where it is, does that help
> performance?
As per discussion, I have kept them as it is and added a comment
indicating that we can consider having both locks in separate
cache lines.
> + /*
> + * If the buffer is pinned or has a nonzero usage_count,
> we cannot use
> + * it; discard it and retry. (This can only happen if VACUUM put a
> + * valid buffer in the freelist and then someone else
> used it before
> + * we got to it. It's probably impossible altogether as
> of 8.3, but
> + * we'd better check anyway.)
> + */
> +
>
> This comment is clearly obsolete.
Removed.
> > I have not yet added statistics (buffers_backend_clocksweep) as
> > for that we need to add one more variable in BufferStrategyControl
> > structure where I have already added few variables for this patch.
> > I think it is important to have such a stat available via
> > pg_stat_bgwriter, but not sure if it is worth to make the structure
> > bit more bulky.
>
> I think it's worth it.
Okay added new statistic.
> > Another minor point is about changes in lwlock.h
> > lwlock.h
> > * if you remove a lock, consider leaving a gap in the numbering
> > * sequence for the benefit of DTrace and other external debugging
> > * scripts.
> >
> > As I have removed BufFreelist lock, I have adjusted the numbering
> > as well in lwlock.h. There is a meesage on top of lock definitions
> > which suggest to leave gap if we remove any lock, however I was not
> > sure whether this case (removing the first element) can effect anything,
> > so for now, I have adjusted the numbering.
>
> Let's leave slot 0 unused, instead.
Sure, that make sense.
Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog. However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.
Performance Data with updated patch
Configuration and Db Details
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
All the data is in tps and taken using pgbench read-only load
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
All the data is in tps and taken using pgbench read-only load
Client Count/Patch_Ver (tps) | 8 | 16 | 32 | 64 | 128 |
HEAD | 58614 | 107370 | 140717 | 104357 | 65010 |
Patch | 60092 | 113564 | 165014 | 213848 | 216065 |
This data is median of 3 runs, detailed report is attached with mail.
I have not repeated the test for all configurations, as there is no
major change in design/algorithm which can effect performance.
Mark has already taken tpc-b data which ensures that there is
no problem with it, however I will also take it once with latest version.
Вложения
В списке pgsql-hackers по дате отправления: