Обсуждение: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

Поиск

Список

Период

Сортировка

[PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

"Greg Burd"

Дата:

11 июля, 20:26:53

Hello,

In conversations [1] recently about considering how best to adapt the code to become NUMA-aware Andres commented,
"FWIW,I've started to wonder if we shouldn't just get rid of the freelist entirely" and because I'm a glutton for
punishment(and I think this idea has some merit) I took him up on this task.

In freelist.c the function StrategyGetBuffer() currently tries first to use the BufferAccessStrategy, if present, via
GetBufferFromRing(). Failing that, the second step is to check the "freelist" as defined by
StrategyControl->firstFreeBuffer/lastFreeBufferto determine if it contains any available buffers, and finally it will
"Usethe clock-sweep algorithm to find a free buffer." The freelist was intended to be "a list of buffers that are
primecandidates for replacement" but I question the value of that versus the overhead of managing it. Without the list
someoperations are likely (I plan to measure this) faster due, as Anders points out in [1], "just using clock sweep
actuallymakes things like DROP TABLE perform better because it doesn't need to maintain the freelist anymore." It may
bethe case that with very large NBuffers where most are in use that this approach is slower (I plan to measure this
too),but in those cases I'd imagine the freelist is likely empty and so the code will use the clock-sweep algorithm
anyway,so I'm not sure there is a performance penalty at all.

This change does remove the have_free_buffer() function used by the contrib/pg_prewarm module. On the surface this
doesn'tseem to cause any issues, but honestly I've not thought too deeply on this one.

v2-0001 Eliminate the freelist from the buffer manager and depend on clock-sweep.

Once removed [2] and tests passing [3] I took a long hard look at the buffer_strategy_lock that used to serialize
concurrentaccess to members of BufferStrategyControl and I couldn't find a good reason to keep it around. Let's review
whatit is guarding:

completePasses: a count of the number of times the clock-sweep hand wraps around. StrategySyncStart() provides this to
thebgwriter which in turn uses it to compute a strategic location at which to start scanning for pages to evict.
There'san interesting comment that indicates both a "requirement" and an equivocal "but that's highly unlikely and
wouldn'tbe particularly harmful" statement conflicting with itself. I tried to find a reason that nextVictimBuffers
couldoverflow or that the use of the completePasses value could somehow cause harm if off by one or more in the
bgwriterand either I missed it (please tell me) or there isn't one. However, it does make sense to change
completePassesinto an atomic value so that it is consistent across backends and in the bgwriter.

bgwprocno: when not -1 is the PID of the allocation notification latch (ProcGlobal->allProcs[bgwprocno].procLatch).
Thisis a "power savings" feature where the goal is to signal the bgwriter "When a backend starts using buffers again,
itwill wake us up by setting our latch." Here the code reads, "Since we don't want to rely on a spinlock for this we
forcea read from shared memory once, and then set the latch based on that value." and uses INT_ACCESS_ONCE() to read
thevalue and set the latch. The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no reason to use
atomicor other synchronization here.

And that's all there is to it now that the freelist is gone. As a result, IMO it seems unnecessary to require a spin
lockfor access to BufferStrategyControl.

v2-0002 Remove the buffer_strategy_lock.

This attached patch is also a branch on GitHub [2] if that is of interest or helpful, it passes check world [3] (I use
theGitHub PR to trigger CirrusCI tests, not as the way to convey the change set).

I also made a few minor changes such that we're consistently referring to "clock-sweep" (not "clock sweep" or
"clocksweep"),I'm not wedded to those but consistency isn't a bad thing, right?

As an aside, we're really implementing "generalized CLOCK" [4][5] which uses counters rather a single bit as pointed
out[6] in the CLOCK-Pro [7] paper, but I digress.

I'd like to hear ideas for worst cases to test and/or benchmark. I plan on attempting what I saw Michael do with
flamegraphsbefore/after/delta in a follow up to this. If people feel this has merit I'll add it to CF/2.

thank you for your time considering this idea,

-greg

[1] https://postgr.es/m/ndvygkpdx44pmi4xbkf52gfrl77cohpefr42tipvd5dgiaeuyd%40fe2og2kxyjnc
[2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v2
[3] https://github.com/gburd/postgres/pull/4/checks
[4] V. F. Nicola, A. Dan, and D. M. Dias, "Analysis of the Generalized Clock Buffer Replacement Scheme for Database
TransactionProcessing", Proceeding of 1992 ACM SIGMETRICS Conference, June 1992, pp. 35-46.
[5] A. J. Smith, "Sequentiality and Prefetching in Database Systems", ACM Trans. on Database Systems, Vol. 3, No. 3,
1978,pp. 223-247.
[6] "In a generalized CLOCK version called GCLOCK [25,17], a counter is associated with each page rather than a single
bit.Its counter will be incremented if a page is hit. The cycling clock hand sweeps over the pages decrementing their
countersuntil a page whose counter is zero is found for replacement."
[7] CLOCK-Pro: An Effective Improvement of the CLOCK Replacement
https://www.usenix.org/legacy/event/usenix05/tech/general/full_papers/jiang/jiang.pdf

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Nathan Bossart

Дата:

11 июля, 21:50:01

On Fri, Jul 11, 2025 at 01:26:53PM -0400, Greg Burd wrote:
> This change does remove the have_free_buffer() function used by the
> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
> issues, but honestly I've not thought too deeply on this one.

Hm.  ISTM we'll either need to invent another similarly inexpensive way to
test for this or to justify to ourselves that it's not necessary.  My guess
is that we do want to keep autoprewarm from evicting things, but FWIW the
docs already say "prewarming may also evict other data from cache" [0].

> Once removed [2] and tests passing [3] I took a long hard look at the
> buffer_strategy_lock that used to serialize concurrent access to members
> of BufferStrategyControl and I couldn't find a good reason to keep it
> around.  Let's review what it is guarding:
> 
> completePasses: a count of the number of times the clock-sweep hand wraps
> around.  StrategySyncStart() provides this to the bgwriter which in turn
> uses it to compute a strategic location at which to start scanning for
> pages to evict.  There's an interesting comment that indicates both a
> "requirement" and an equivocal "but that's highly unlikely and wouldn't
> be particularly harmful" statement conflicting with itself.  I tried to
> find a reason that nextVictimBuffers could overflow or that the use of
> the completePasses value could somehow cause harm if off by one or more
> in the bgwriter and either I missed it (please tell me) or there isn't
> one.  However, it does make sense to change completePasses into an atomic
> value so that it is consistent across backends and in the bgwriter.
> 
> bgwprocno: when not -1 is the PID of the allocation notification latch
> (ProcGlobal->allProcs[bgwprocno].procLatch).  This is a "power savings"
> feature where the goal is to signal the bgwriter "When a backend starts
> using buffers again, it will wake us up by setting our latch."  Here the
> code reads, "Since we don't want to rely on a spinlock for this we force
> a read from shared memory once, and then set the latch based on that
> value." and uses INT_ACCESS_ONCE() to read the value and set the latch.
> The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no
> reason to use atomic or other synchronization here.
> 
> And that's all there is to it now that the freelist is gone.  As a
> result, IMO it seems unnecessary to require a spin lock for access to
> BufferStrategyControl.

I haven't followed your line of reasoning closely yet, but I typically
recommend that patches that replace locks with atomics use functions with
full barrier semantics (e.g., pg_atomic_read_membarrier_u32(),
pg_atomic_fetch_add_u32()) to make things easier to reason about.  But that
might not be as straightforward in cases like StrategySyncStart() where we
atomically retrieve two values that are used together.  Nevertheless,
minimizing cognitive load might be nice, and there's a chance it doesn't
impact performance very much.

[0] https://www.postgresql.org/docs/devel/pgprewarm.html

-- 
nathan

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Andres Freund

Дата:

11 июля, 21:52:24

Hi,

On 2025-07-11 13:26:53 -0400, Greg Burd wrote:
> In conversations [1] recently about considering how best to adapt the code
> to become NUMA-aware Andres commented, "FWIW, I've started to wonder if we
> shouldn't just get rid of the freelist entirely" and because I'm a glutton
> for punishment (and I think this idea has some merit) I took him up on this
> task.

> In freelist.c the function StrategyGetBuffer() currently tries first to use
> the BufferAccessStrategy, if present, via GetBufferFromRing().  Failing
> that, the second step is to check the "freelist" as defined by
> StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it contains
> any available buffers, and finally it will "Use the clock-sweep algorithm to
> find a free buffer."  The freelist was intended to be "a list of buffers
> that are prime candidates for replacement" but I question the value of that
> versus the overhead of managing it.  Without the list some operations are
> likely (I plan to measure this) faster due, as Anders points out in [1],
> "just using clock sweep actually makes things like DROP TABLE perform better
> because it doesn't need to maintain the freelist anymore."  It may be the
> case that with very large NBuffers where most are in use that this approach
> is slower (I plan to measure this too), but in those cases I'd imagine the
> freelist is likely empty and so the code will use the clock-sweep algorithm
> anyway, so I'm not sure there is a performance penalty at all.
>
> This change does remove the have_free_buffer() function used by the
> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
> issues, but honestly I've not thought too deeply on this one.

I think we'll likely need something to replace it.

TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
right. The goal of the use have_free_buffer() is obviously to stop prewarming
shared buffers if doing so would just evict buffers.  But it's not clear to me
that we should just stop when there aren't any free buffers - what if the
previous buffer contents aren't the right ones?  It'd make more sense to me to
stop autoprewarm once NBuffers have been prewarmed...

> v2-0001 Eliminate the freelist from the buffer manager and depend on clock-sweep.
>
>
> Once removed [2] and tests passing [3] I took a long hard look at the
> buffer_strategy_lock that used to serialize concurrent access to members of
> BufferStrategyControl and I couldn't find a good reason to keep it around.
> Let's review what it is guarding:
>
> completePasses: a count of the number of times the clock-sweep hand wraps
> around.  StrategySyncStart() provides this to the bgwriter which in turn
> uses it to compute a strategic location at which to start scanning for pages
> to evict.  There's an interesting comment that indicates both a
> "requirement" and an equivocal "but that's highly unlikely and wouldn't be
> particularly harmful" statement conflicting with itself.  I tried to find a
> reason that nextVictimBuffers could overflow or that the use of the
> completePasses value could somehow cause harm if off by one or more in the
> bgwriter and either I missed it (please tell me) or there isn't one.
> However, it does make sense to change completePasses into an atomic value so
> that it is consistent across backends and in the bgwriter.

I don't think it's *quite* that easy. If you just maintain nextVictimBuffer
and completePasses as separate atomic counters, without a lock making the two
consistent, StrategySyncStart(), as coded right now / in your patch, won't
necessarily return reasonable values for the two.  Which I think would lead to
bgwriter suddenly becoming overly active for one cycle and then very inactive
for a while after.

With really large shared buffers that'd be rather rare to be hit, but with
small shared buffers I think it'd be common enough to worry.

The most obvious way around this would be to make the clock hand a 64bit
atomic, which would avoid the need to have a separate tracker for the number
of passes.  Unfortunately doing so would require doing a modulo operation each
clock tick, which I think would likely be too expensive on common platforms -
on small shared_buffers I actually see existing, relatively rarely reached,
modulo in ClockSweepTick() show up on a Skylake-X system.

It'd be easier if we could rely on NBuffers to be a power of two, but that
doesn't seem like a realistic requirement.

I think the easiest way here would be to make StrategySyncStart(), a
relatively rare operation, retry whenever it detects that it saw out-of-sync
nextVictimBuffer. E.g. something roughly like

while (true)
{
        completePasses = atomic_read_u32(&StrategyControl->completePasses);
        pg_memory_barrier();
        nextVictimBuffer = atomic_read_u32(&StrategyControl->nextVictimBuffer);
        pg_memory_barrier();

        if (completePasses == atomic_read_u32(&StrategyControl->completePasses) &&
            nextVictimBuffer <= atomic_read_u32(&StrategyControl->nextVictimBuffer))
           break;
}

which I think would detect the case that we read a nextVictimBuffer value that was
decreased while reading a completePasses value that was increased.

I think while at it, we should make ClockSweepTick() decrement
nextVictimBuffer by atomically subtracting NBuffers, rather than using CAS. I
recently noticed that the CAS sometimes has to retry a fair number of times,
which in turn makes the `victim % NBuffers` show up in profiles.

> I'd like to hear ideas for worst cases to test and/or benchmark.  I plan on
> attempting what I saw Michael do with flamegraphs before/after/delta in a
> follow up to this.  If people feel this has merit I'll add it to CF/2.

What I've benchmarked is both single threaded and concurrent clock sweep, by
doing pg_prewarm() of per-pgbench-client relations. I used

c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT numa_node, sum(size), count(*) FROM
pg_shmem_allocations_numaWHERE size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo "SELECT
pg_prewarm('copytest_:client_id');")-t1

(with c=1 for the single-threaded case obviously)

The reason for the pg_shmem_allocations_numa is to ensure that shared_buffers
is actually mapped, as otherwise the bottleneck will be the kernel zeroing out
buffers.

The reason for doing -t1 is that I wanted to compare freelist vs clock sweep,
rather than clock sweep in general.

Note that I patched EvictUnpinnedBufferInternal() to call
StrategyFreeBuffer(), otherwise running this a second time won't actually
measure the freelist.  And the first time run after postmaster start will
always be more noisy...

Greetings,

Andres Freund

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

"Greg Burd"

Дата:

11 июля, 22:47:57

On Fri, Jul 11, 2025, at 2:50 PM, Nathan Bossart wrote:
> On Fri, Jul 11, 2025 at 01:26:53PM -0400, Greg Burd wrote:
>> This change does remove the have_free_buffer() function used by the
>> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
>> issues, but honestly I've not thought too deeply on this one.
>
> Hm.  ISTM we'll either need to invent another similarly inexpensive way to
> test for this or to justify to ourselves that it's not necessary.  My guess
> is that we do want to keep autoprewarm from evicting things, but FWIW the
> docs already say "prewarming may also evict other data from cache" [0].

Thank you for spending time reviewing my proposal/patch!

I briefly considered how one might use what was left after surgery to produce some similar boolean signal to no avail.
Ithink that autoprewarm was simply trying to at most warm NBuffers then stop.  The freelist at startup was just a
convenientthing to drain and get that done.  Maybe I'll try adapting autoprewarm to consider that global instead.
 

>> Once removed [2] and tests passing [3] I took a long hard look at the
>> buffer_strategy_lock that used to serialize concurrent access to members
>> of BufferStrategyControl and I couldn't find a good reason to keep it
>> around.  Let's review what it is guarding:
>> 
>> completePasses: a count of the number of times the clock-sweep hand wraps
>> around.  StrategySyncStart() provides this to the bgwriter which in turn
>> uses it to compute a strategic location at which to start scanning for
>> pages to evict.  There's an interesting comment that indicates both a
>> "requirement" and an equivocal "but that's highly unlikely and wouldn't
>> be particularly harmful" statement conflicting with itself.  I tried to
>> find a reason that nextVictimBuffers could overflow or that the use of
>> the completePasses value could somehow cause harm if off by one or more
>> in the bgwriter and either I missed it (please tell me) or there isn't
>> one.  However, it does make sense to change completePasses into an atomic
>> value so that it is consistent across backends and in the bgwriter.
>> 
>> bgwprocno: when not -1 is the PID of the allocation notification latch
>> (ProcGlobal->allProcs[bgwprocno].procLatch).  This is a "power savings"
>> feature where the goal is to signal the bgwriter "When a backend starts
>> using buffers again, it will wake us up by setting our latch."  Here the
>> code reads, "Since we don't want to rely on a spinlock for this we force
>> a read from shared memory once, and then set the latch based on that
>> value." and uses INT_ACCESS_ONCE() to read the value and set the latch.
>> The function StrategyNotifyBgWriter() is where bgwprocno is set, I see no
>> reason to use atomic or other synchronization here.
>> 
>> And that's all there is to it now that the freelist is gone.  As a
>> result, IMO it seems unnecessary to require a spin lock for access to
>> BufferStrategyControl.
>
> I haven't followed your line of reasoning closely yet, but I typically
> recommend that patches that replace locks with atomics use functions with
> full barrier semantics (e.g., pg_atomic_read_membarrier_u32(),
> pg_atomic_fetch_add_u32()) to make things easier to reason about.  But that
> might not be as straightforward in cases like StrategySyncStart() where we
> atomically retrieve two values that are used together.  Nevertheless,
> minimizing cognitive load might be nice, and there's a chance it doesn't
> impact performance very much.

Good thought.  I'll review carefully and see if I can either explain here solid reasons I believe that they don't need
fullbarrier semantics or change the patch accordingly.
 

again, thank you for your time, best.

-greg

> [0] https://www.postgresql.org/docs/devel/pgprewarm.html
>
> -- 
> nathan

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Greg Burd

Дата:

17 июля, 21:35:13

On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
> Hi,
>
> On 2025-07-11 13:26:53 -0400, Greg Burd wrote:
>> In conversations [1] recently about considering how best to adapt the 
>> code
>> to become NUMA-aware Andres commented, "FWIW, I've started to wonder 
>> if we
>> shouldn't just get rid of the freelist entirely" and because I'm a 
>> glutton
>> for punishment (and I think this idea has some merit) I took him up 
>> on this
>> task.
>
>> In freelist.c the function StrategyGetBuffer() currently tries first 
>> to use
>> the BufferAccessStrategy, if present, via GetBufferFromRing().  Failing
>> that, the second step is to check the "freelist" as defined by
>> StrategyControl->firstFreeBuffer/lastFreeBuffer to determine if it 
>> contains
>> any available buffers, and finally it will "Use the clock-sweep 
>> algorithm to
>> find a free buffer."  The freelist was intended to be "a list of buffers
>> that are prime candidates for replacement" but I question the value 
>> of that
>> versus the overhead of managing it.  Without the list some operations are
>> likely (I plan to measure this) faster due, as Anders points out in [1],
>> "just using clock sweep actually makes things like DROP TABLE perform 
>> better
>> because it doesn't need to maintain the freelist anymore."  It may be the
>> case that with very large NBuffers where most are in use that this 
>> approach
>> is slower (I plan to measure this too), but in those cases I'd 
>> imagine the
>> freelist is likely empty and so the code will use the clock-sweep 
>> algorithm
>> anyway, so I'm not sure there is a performance penalty at all.
>>
>> This change does remove the have_free_buffer() function used by the
>> contrib/pg_prewarm module.  On the surface this doesn't seem to cause any
>> issues, but honestly I've not thought too deeply on this one.

Thank you for spending time reviewing my proposal and patch!  I value 
your time
and insights and in this case your inspiration. :)

> I think we'll likely need something to replace it.

Fair, this (v5) patch doesn't yet try to address this.

> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
> right. The goal of the use have_free_buffer() is obviously to stop 
> prewarming
> shared buffers if doing so would just evict buffers.  But it's not 
> clear to me
> that we should just stop when there aren't any free buffers - what if the
> previous buffer contents aren't the right ones?  It'd make more sense 
> to me to
> stop autoprewarm once NBuffers have been prewarmed...

I had the same high level reaction, that autoprewarm was leveraging 
something
convenient but not necessarily required or even correct.  I'd considered 
using
NBuffers as you describe due to similar intuitions, I'll dig into that 
idea for
the next revision after I get to know autoprewarm a bit better.

>> v2-0001 Eliminate the freelist from the buffer manager and depend on 
>> clock-sweep.
>>
>>
>> Once removed [2] and tests passing [3] I took a long hard look at the
>> buffer_strategy_lock that used to serialize concurrent access to 
>> members of
>> BufferStrategyControl and I couldn't find a good reason to keep it 
>> around.
>> Let's review what it is guarding:
>>
>> completePasses: a count of the number of times the clock-sweep hand wraps
>> around.  StrategySyncStart() provides this to the bgwriter which in turn
>> uses it to compute a strategic location at which to start scanning 
>> for pages
>> to evict.  There's an interesting comment that indicates both a
>> "requirement" and an equivocal "but that's highly unlikely and 
>> wouldn't be
>> particularly harmful" statement conflicting with itself.  I tried to 
>> find a
>> reason that nextVictimBuffers could overflow or that the use of the
>> completePasses value could somehow cause harm if off by one or more 
>> in the
>> bgwriter and either I missed it (please tell me) or there isn't one.
>> However, it does make sense to change completePasses into an atomic 
>> value so
>> that it is consistent across backends and in the bgwriter.
>
> I don't think it's *quite* that easy. If you just maintain 
> nextVictimBuffer
> and completePasses as separate atomic counters, without a lock making 
> the two
> consistent, StrategySyncStart(), as coded right now / in your patch, won't
> necessarily return reasonable values for the two.  Which I think would 
> lead to
> bgwriter suddenly becoming overly active for one cycle and then very 
> inactive
> for a while after.

Keeping them separate as atomics still requires coordination that I 
think is unnecessary.  I spent some time and came up with a working 
version [1] where
the two values (nextVictimBuffer and completePasses) were in a singular 
uint64
atomic, but as soon as I had it working I didn't like the idea at all.

> With really large shared buffers that'd be rather rare to be hit, but with
> small shared buffers I think it'd be common enough to worry.

Agreed, not coordinating these values isn't a viable solution.

> The most obvious way around this would be to make the clock hand a 64bit
> atomic, which would avoid the need to have a separate tracker for the 
> number
> of passes.  Unfortunately doing so would require doing a modulo 
> operation each
> clock tick, which I think would likely be too expensive on common 
> platforms -
> on small shared_buffers I actually see existing, relatively rarely 
> reached,
> modulo in ClockSweepTick() show up on a Skylake-X system.

So, this idea came back to me today as I tossed out the union branch and 
started
over.

a) can't require a power of 2 for NBuffers
b) would like a power of 2 for NBuffers to make a few things more efficient
c) a simple uint64 atomic counter would simplify things

The attached (v5) patch takes this approach *and* avoids the modulo you were
concerned with.  My approach is to have nextVictimBuffer as a uint64 
that only
increments (and at some point 200 years or so might wrap around, but I 
digress).
To get the actual "victim" you modulo that, but not with "%" you call
clock_modulo().  In that function I use a "next power of 2" value rather 
than
NBuffers to efficiently find the modulo and adjust for the actual 
value.  Same
for completePasses which is now a function clock_passes() that does similar
trickery and returns the number of times the counter (nextVictimBuffer) has
"wrapped" around modulo NBuffers.

Now that both values exist in the same uint64 it can be the atomic 
vessel that coordinates them, no synchronization problems at all and no 
requirement for
the buffer_strategy_lock.

> It'd be easier if we could rely on NBuffers to be a power of two, but that
> doesn't seem like a realistic requirement.

Yes, this was the good idea I ran with it in a slightly more creative 
way that
doesn't require users to set NBuffers to a power of 2.

> I think the easiest way here would be to make StrategySyncStart(), a
> relatively rare operation, retry whenever it detects that it saw 
> out-of-sync
> nextVictimBuffer. E.g. something roughly like
>
> while (true)
> {
>        completePasses = atomic_read_u32(&StrategyControl->completePasses);
>        pg_memory_barrier();
>        nextVictimBuffer = 
> atomic_read_u32(&StrategyControl->nextVictimBuffer);
>        pg_memory_barrier();
>
>        if (completePasses ==
> atomic_read_u32(&StrategyControl->completePasses) &&
>            nextVictimBuffer <=
> atomic_read_u32(&StrategyControl->nextVictimBuffer))
>           break;
> }
>
> which I think would detect the case that we read a nextVictimBuffer
> value that was
> decreased while reading a completePasses value that was increased.

Okay, interesting idea.  If you dislike this approach I'll circle back and
consider it.

> I think while at it, we should make ClockSweepTick() decrement
> nextVictimBuffer by atomically subtracting NBuffers, rather than using 
> CAS. I
> recently noticed that the CAS sometimes has to retry a fair number of
> times,
> which in turn makes the `victim % NBuffers` show up in profiles.

In my (v5) patch there is one CAS that increments NBuffers.  All other
operations on NBuffers are atomic reads.  The modulo you mention is gone
entirely, unnecessary AFAICT.

>> I'd like to hear ideas for worst cases to test and/or benchmark.  I 
>> plan on
>> attempting what I saw Michael do with flamegraphs before/after/delta in a
>> follow up to this.  If people feel this has merit I'll add it to CF/2.
>
> What I've benchmarked is both single threaded and concurrent clock 
> sweep, by
> doing pg_prewarm() of per-pgbench-client relations. I used
>
> c=40 && psql -Xq -c "select pg_buffercache_evict_all()" -c 'SELECT
> numa_node, sum(size), count(*) FROM pg_shmem_allocations_numa WHERE
> size != 0 GROUP BY numa_node;' && pgbench -n -P1 -c$c -j$c -f <(echo
> "SELECT pg_prewarm('copytest_:client_id');") -t1
>
> (with c=1 for the single-threaded case obviously)
>
> The reason for the pg_shmem_allocations_numa is to ensure that 
> shared_buffers
> is actually mapped, as otherwise the bottleneck will be the kernel 
> zeroing out
> buffers.
>
> The reason for doing -t1 is that I wanted to compare freelist vs clock 
> sweep,
> rather than clock sweep in general.
>
> Note that I patched EvictUnpinnedBufferInternal() to call
> StrategyFreeBuffer(), otherwise running this a second time won't actually
> measure the freelist.  And the first time run after postmaster start will
> always be more noisy...

This is very helpful, thanks! I've started doing some of this but I was
anxious to get this out before the weekend.  I'll work on the prewarm module
and get some benchmarks done next week.

Meanwhile, the tests except for Windows pass [2] for this new patch 
[3].  I'll
dig into the Windows issues next week as well.

> Greetings,
>
> Andres Freund

I'm very curious to hear back your thoughts (or anyone else) on this
approach.

best,

-greg


[1] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v4
[2] https://github.com/gburd/postgres/pull/6/checks
[3] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v5

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Andres Freund

Дата:

18 июля, 20:03:22

Hi,

I'd be curious if anybody wants to argue for keeping the clock sweep. Except
for the have_free_buffer() use in autoprewarm, it's a rather trivial
patch. And I really couldn't measure regressions above the noise level, even
if absurdly extreme use cases.


On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
> > I think we'll likely need something to replace it.
>
> Fair, this (v5) patch doesn't yet try to address this.
>
> > TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
> > right. The goal of the use have_free_buffer() is obviously to stop
> > prewarming
> > shared buffers if doing so would just evict buffers.  But it's not clear
> > to me
> > that we should just stop when there aren't any free buffers - what if the
> > previous buffer contents aren't the right ones?  It'd make more sense to
> > me to
> > stop autoprewarm once NBuffers have been prewarmed...
>
> I had the same high level reaction, that autoprewarm was leveraging
> something
> convenient but not necessarily required or even correct.  I'd considered
> using
> NBuffers as you describe due to similar intuitions, I'll dig into that idea
> for
> the next revision after I get to know autoprewarm a bit better.

Cool. I do think that'll be good enough.



> > The most obvious way around this would be to make the clock hand a 64bit
> > atomic, which would avoid the need to have a separate tracker for the
> > number
> > of passes.  Unfortunately doing so would require doing a modulo
> > operation each
> > clock tick, which I think would likely be too expensive on common
> > platforms -
> > on small shared_buffers I actually see existing, relatively rarely
> > reached,
> > modulo in ClockSweepTick() show up on a Skylake-X system.
>
> So, this idea came back to me today as I tossed out the union branch and
> started
> over.
>
> a) can't require a power of 2 for NBuffers
> b) would like a power of 2 for NBuffers to make a few things more efficient
> c) a simple uint64 atomic counter would simplify things
>
> The attached (v5) patch takes this approach *and* avoids the modulo you were
> concerned with.  My approach is to have nextVictimBuffer as a uint64 that
> only
> increments (and at some point 200 years or so might wrap around, but I
> digress).
> To get the actual "victim" you modulo that, but not with "%" you call
> clock_modulo().  In that function I use a "next power of 2" value rather
> than
> NBuffers to efficiently find the modulo and adjust for the actual value.
> Same
> for completePasses which is now a function clock_passes() that does similar
> trickery and returns the number of times the counter (nextVictimBuffer) has
> "wrapped" around modulo NBuffers.

Yea, that could work!  It'd be interesting to see some performance numbers for
this...


> Now that both values exist in the same uint64 it can be the atomic vessel
> that coordinates them, no synchronization problems at all and no requirement
> for the buffer_strategy_lock.

Nice!



> > I think while at it, we should make ClockSweepTick() decrement
> > nextVictimBuffer by atomically subtracting NBuffers, rather than using
> > CAS. I
> > recently noticed that the CAS sometimes has to retry a fair number of
> > times,
> > which in turn makes the `victim % NBuffers` show up in profiles.
>
> In my (v5) patch there is one CAS that increments NBuffers.  All other
> operations on NBuffers are atomic reads.  The modulo you mention is gone
> entirely, unnecessary AFAICT.

There shouldn't be any CASes needed now, right? Just a fetch-add? The latter
often scales *way* better under contention.

[Looks at the patch ...]

Which I think is true in your patch, I don't see any CAS.



> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
> I'll dig into the Windows issues next week as well.

FWIW, there are backtraces generated on windows. E.g.


https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt

000000cd`827fdea0 00007ff7`6ad82f88     ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @
77]
000000cd`827fdee0 00007ff7`6aae2b7c     postgres!ExceptionalCondition(
            char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
            char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
            int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
000000cd`827fdf20 00007ff7`6aae272c     postgres!clock_modulo(
            unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
000000cd`827fdf60 00007ff7`6aad8647     postgres!StrategySyncStart(
            unsigned int * complete_passes = 0x000000cd`827fdfc0,
            unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c @
300]
000000cd`827fdfa0 00007ff7`6aa254a3     postgres!BgBufferSync(
            struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37
[c:\cirrus\src\backend\storage\buffer\bufmgr.c@ 3649]
 
000000cd`827fe030 00007ff7`6aa278a7     postgres!BackgroundWriterMain(
            void * startup_data = 0x00000000`00000000,
            unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
000000cd`827ff5a0 00007ff7`6a8daf19     postgres!SubPostmasterMain(
            int argc = 0n3,
            char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
000000cd`827ff620 00007ff7`6af0f5a9     postgres!main(
            int argc = 0n3,
            char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]

I.e. your new assertion failed for some reason that i can't *immediately* see.



> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
>
>      /*
>       * Compute strategy_delta = how many buffers have been scanned by the
> -     * clock sweep since last time.  If first time through, assume none. Then
> -     * see if we are still ahead of the clock sweep, and if so, how many
> +     * clock-sweep since last time.  If first time through, assume none. Then
> +     * see if we are still ahead of the clock-sweep, and if so, how many
>       * buffers we could scan before we'd catch up with it and "lap" it. Note:
>       * weird-looking coding of xxx_passes comparisons are to avoid bogus
>       * behavior when the passes counts wrap around.
>       */
>      if (saved_info_valid)
>      {
> -        int32        passes_delta = strategy_passes - prev_strategy_passes;
> +        int32        passes_delta;
> +
> +        if (unlikely(prev_strategy_passes > strategy_passes))
> +        {
> +            /* wrap-around case */
> +            passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
> +        }
> +        else
> +        {
> +            passes_delta = (int32) (strategy_passes - prev_strategy_passes);
> +        }
>
>          strategy_delta = strategy_buf_id - prev_strategy_buf_id;
>          strategy_delta += (long) passes_delta * NBuffers;

That seems somewhat independent of the rest of the change, or am I missing something?


> +static uint32 NBuffersPow2;        /* NBuffers rounded up to the next power of 2 */
> +static uint32 NBuffersPow2Shift;    /* Amount to bitshift NBuffers for
> +                                     * division */

For performance in ClockSweepTick() it might more sense to store the mask
(i.e. NBuffersPow2 - 1), rather than the actual power of two.

Greetings,

Andres Freund

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Greg Burd

Дата:

21 июля, 20:37:04

On 7/18/25 13:03, Andres Freund wrote:
> Hi,

Hello.  Thanks again for taking the time to review the email and patch,
I think we're onto something good here.

>
> I'd be curious if anybody wants to argue for keeping the clock sweep. Except
> for the have_free_buffer() use in autoprewarm, it's a rather trivial
> patch. And I really couldn't measure regressions above the noise level, even
> if absurdly extreme use cases.

Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
for keeping the freelist"?

> On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
>> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
>>> I think we'll likely need something to replace it.
>> Fair, this (v5) patch doesn't yet try to address this.
>>
>>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
>>> right. The goal of the use have_free_buffer() is obviously to stop
>>> prewarming
>>> shared buffers if doing so would just evict buffers.  But it's not clear
>>> to me
>>> that we should just stop when there aren't any free buffers - what if the
>>> previous buffer contents aren't the right ones?  It'd make more sense to
>>> me to
>>> stop autoprewarm once NBuffers have been prewarmed...
>> I had the same high level reaction, that autoprewarm was leveraging
>> something
>> convenient but not necessarily required or even correct.  I'd considered
>> using
>> NBuffers as you describe due to similar intuitions, I'll dig into that idea
>> for
>> the next revision after I get to know autoprewarm a bit better.
> Cool. I do think that'll be good enough.

I re-added the have_free_buffer() function only now it returns false
once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
has made its first complete pass.  With that I reverted my changes in
the autoprewarm module.  The net should be the same behavior as before
at startup when using that module.

>>> The most obvious way around this would be to make the clock hand a 64bit
>>> atomic, which would avoid the need to have a separate tracker for the
>>> number
>>> of passes.  Unfortunately doing so would require doing a modulo
>>> operation each
>>> clock tick, which I think would likely be too expensive on common
>>> platforms -
>>> on small shared_buffers I actually see existing, relatively rarely
>>> reached,
>>> modulo in ClockSweepTick() show up on a Skylake-X system.
>> So, this idea came back to me today as I tossed out the union branch and
>> started
>> over.
>>
>> a) can't require a power of 2 for NBuffers
>> b) would like a power of 2 for NBuffers to make a few things more efficient
>> c) a simple uint64 atomic counter would simplify things
>>
>> The attached (v5) patch takes this approach *and* avoids the modulo you were
>> concerned with.  My approach is to have nextVictimBuffer as a uint64 that
>> only
>> increments (and at some point 200 years or so might wrap around, but I
>> digress).
>> To get the actual "victim" you modulo that, but not with "%" you call
>> clock_modulo().  In that function I use a "next power of 2" value rather
>> than
>> NBuffers to efficiently find the modulo and adjust for the actual value.
>> Same
>> for completePasses which is now a function clock_passes() that does similar
>> trickery and returns the number of times the counter (nextVictimBuffer) has
>> "wrapped" around modulo NBuffers.
> Yea, that could work!  It'd be interesting to see some performance numbers for
> this...

Still no performance comparisons yet, but my gut says this should reduce
contention across cores on a very hot path so I'd imagine some
performance improvement.

>> Now that both values exist in the same uint64 it can be the atomic vessel
>> that coordinates them, no synchronization problems at all and no requirement
>> for the buffer_strategy_lock.
> Nice!
>
>
>>> I think while at it, we should make ClockSweepTick() decrement
>>> nextVictimBuffer by atomically subtracting NBuffers, rather than using
>>> CAS. I
>>> recently noticed that the CAS sometimes has to retry a fair number of
>>> times,
>>> which in turn makes the `victim % NBuffers` show up in profiles.
>> In my (v5) patch there is one CAS that increments NBuffers.  All other
>> operations on NBuffers are atomic reads.  The modulo you mention is gone
>> entirely, unnecessary AFAICT.
> There shouldn't be any CASes needed now, right? Just a fetch-add? The latter
> often scales *way* better under contention.
>
> [Looks at the patch ...]
>
> Which I think is true in your patch, I don't see any CAS.

You are correct, no CAS at all anymore just a mental mistake in the last
email.  Now there are only atomic reads and single atomic fetch-add in
ClockSweepTick().

>> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
>> I'll dig into the Windows issues next week as well.
> FWIW, there are backtraces generated on windows. E.g.
>
>
https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt
>
> 000000cd`827fdea0 00007ff7`6ad82f88     ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp
@77]
 
> 000000cd`827fdee0 00007ff7`6aae2b7c     postgres!ExceptionalCondition(
>             char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
>             char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
>             int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
> 000000cd`827fdf20 00007ff7`6aae272c     postgres!clock_modulo(
>             unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
> 000000cd`827fdf60 00007ff7`6aad8647     postgres!StrategySyncStart(
>             unsigned int * complete_passes = 0x000000cd`827fdfc0,
>             unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c [c:\cirrus\src\backend\storage\buffer\freelist.c
@300]
 
> 000000cd`827fdfa0 00007ff7`6aa254a3     postgres!BgBufferSync(
>             struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37
[c:\cirrus\src\backend\storage\buffer\bufmgr.c@ 3649]
 
> 000000cd`827fe030 00007ff7`6aa278a7     postgres!BackgroundWriterMain(
>             void * startup_data = 0x00000000`00000000,
>             unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
> 000000cd`827ff5a0 00007ff7`6a8daf19     postgres!SubPostmasterMain(
>             int argc = 0n3,
>             char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
> 000000cd`827ff620 00007ff7`6af0f5a9     postgres!main(
>             int argc = 0n3,
>             char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
>
> I.e. your new assertion failed for some reason that i can't *immediately* see.
I put that in as a precaution and as a way to communicate the intention
of the other code above it.  I never imagined it would assert.  I've
changed clock_read() to only assert when the modulo differs and left
that assert in the calling ClockSweepTick() function because it was
redundant and I'm curious to see if we see a similar assert when testing
the modulo.
>> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
>>
>>      /*
>>       * Compute strategy_delta = how many buffers have been scanned by the
>> -     * clock sweep since last time.  If first time through, assume none. Then
>> -     * see if we are still ahead of the clock sweep, and if so, how many
>> +     * clock-sweep since last time.  If first time through, assume none. Then
>> +     * see if we are still ahead of the clock-sweep, and if so, how many
>>       * buffers we could scan before we'd catch up with it and "lap" it. Note:
>>       * weird-looking coding of xxx_passes comparisons are to avoid bogus
>>       * behavior when the passes counts wrap around.
>>       */
>>      if (saved_info_valid)
>>      {
>> -        int32        passes_delta = strategy_passes - prev_strategy_passes;
>> +        int32        passes_delta;
>> +
>> +        if (unlikely(prev_strategy_passes > strategy_passes))
>> +        {
>> +            /* wrap-around case */
>> +            passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
>> +        }
>> +        else
>> +        {
>> +            passes_delta = (int32) (strategy_passes - prev_strategy_passes);
>> +        }
>>
>>          strategy_delta = strategy_buf_id - prev_strategy_buf_id;
>>          strategy_delta += (long) passes_delta * NBuffers;
> That seems somewhat independent of the rest of the change, or am I missing something?

That change is there to cover the possibility of someone managing to
overflow and wrap a uint64 which is *highly* unlikely.  If this degree
of paranoia isn't required I'm happy to remove it.

>> +static uint32 NBuffersPow2;        /* NBuffers rounded up to the next power of 2 */
>> +static uint32 NBuffersPow2Shift;    /* Amount to bitshift NBuffers for
>> +                                     * division */
> For performance in ClockSweepTick() it might more sense to store the mask
> (i.e. NBuffersPow2 - 1), rather than the actual power of two.
Agreed, I've done that and created one more calculated value that could
be pre-computed once and never again (unless NBuffers changes) at runtime.
> Greetings,
>
> Andres Freund


thanks again for the review, v6 attached and re-based onto afa5c365ec5,
also on GitHub at [1][2].


-greg


[1] https://github.com/gburd/postgres/pull/7/checks

[2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v6

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Andres Freund

Дата:

21 июля, 21:35:10

Hi,

On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
> On 7/18/25 13:03, Andres Freund wrote:
> Hello.  Thanks again for taking the time to review the email and patch,
> I think we're onto something good here.
> 
> >
> > I'd be curious if anybody wants to argue for keeping the clock sweep. Except
> > for the have_free_buffer() use in autoprewarm, it's a rather trivial
> > patch. And I really couldn't measure regressions above the noise level, even
> > if absurdly extreme use cases.
> 
> Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
> for keeping the freelist"?

Err, yes :(


> > On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
> >> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
> >>> I think we'll likely need something to replace it.
> >> Fair, this (v5) patch doesn't yet try to address this.
> >>
> >>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
> >>> right. The goal of the use have_free_buffer() is obviously to stop
> >>> prewarming
> >>> shared buffers if doing so would just evict buffers.  But it's not clear
> >>> to me
> >>> that we should just stop when there aren't any free buffers - what if the
> >>> previous buffer contents aren't the right ones?  It'd make more sense to
> >>> me to
> >>> stop autoprewarm once NBuffers have been prewarmed...
> >> I had the same high level reaction, that autoprewarm was leveraging
> >> something
> >> convenient but not necessarily required or even correct.  I'd considered
> >> using
> >> NBuffers as you describe due to similar intuitions, I'll dig into that idea
> >> for
> >> the next revision after I get to know autoprewarm a bit better.
> > Cool. I do think that'll be good enough.
> 
> I re-added the have_free_buffer() function only now it returns false
> once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
> has made its first complete pass.  With that I reverted my changes in
> the autoprewarm module.  The net should be the same behavior as before
> at startup when using that module.

I don't think we should have a have_free_buffer() that doesn't actually test
whether we have a free buffer, that seems too likely to cause
misunderstandings down the line.  What if we instead just limit the amount of
buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
the number of to-be-loaded buffers, so that shouldn't be hard.


> >> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
> >> I'll dig into the Windows issues next week as well.
> > FWIW, there are backtraces generated on windows. E.g.
> >
> >
https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt
> >
> > 000000cd`827fdea0 00007ff7`6ad82f88     ucrtbased!abort(void)+0x5a
[minkernel\crts\ucrt\src\appcrt\startup\abort.cpp@ 77]
 
> > 000000cd`827fdee0 00007ff7`6aae2b7c     postgres!ExceptionalCondition(
> >             char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
> >             char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
> >             int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
> > 000000cd`827fdf20 00007ff7`6aae272c     postgres!clock_modulo(
> >             unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
> > 000000cd`827fdf60 00007ff7`6aad8647     postgres!StrategySyncStart(
> >             unsigned int * complete_passes = 0x000000cd`827fdfc0,
> >             unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c
[c:\cirrus\src\backend\storage\buffer\freelist.c@ 300]
 
> > 000000cd`827fdfa0 00007ff7`6aa254a3     postgres!BgBufferSync(
> >             struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37
[c:\cirrus\src\backend\storage\buffer\bufmgr.c@ 3649]
 
> > 000000cd`827fe030 00007ff7`6aa278a7     postgres!BackgroundWriterMain(
> >             void * startup_data = 0x00000000`00000000,
> >             unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
> > 000000cd`827ff5a0 00007ff7`6a8daf19     postgres!SubPostmasterMain(
> >             int argc = 0n3,
> >             char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
> > 000000cd`827ff620 00007ff7`6af0f5a9     postgres!main(
> >             int argc = 0n3,
> >             char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
> >
> > I.e. your new assertion failed for some reason that i can't *immediately* see.
> I put that in as a precaution and as a way to communicate the intention
> of the other code above it.  I never imagined it would assert.  I've
> changed clock_read() to only assert when the modulo differs and left
> that assert in the calling ClockSweepTick() function because it was
> redundant and I'm curious to see if we see a similar assert when testing
> the modulo.

Do you understand why it triggered? Because I don't immediately. The fact that
it triggered only on windows, where the compiler is rather different, makes it
worth understanding imo.


> >> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
> >>
> >>      /*
> >>       * Compute strategy_delta = how many buffers have been scanned by the
> >> -     * clock sweep since last time.  If first time through, assume none. Then
> >> -     * see if we are still ahead of the clock sweep, and if so, how many
> >> +     * clock-sweep since last time.  If first time through, assume none. Then
> >> +     * see if we are still ahead of the clock-sweep, and if so, how many
> >>       * buffers we could scan before we'd catch up with it and "lap" it. Note:
> >>       * weird-looking coding of xxx_passes comparisons are to avoid bogus
> >>       * behavior when the passes counts wrap around.
> >>       */
> >>      if (saved_info_valid)
> >>      {
> >> -        int32        passes_delta = strategy_passes - prev_strategy_passes;
> >> +        int32        passes_delta;
> >> +
> >> +        if (unlikely(prev_strategy_passes > strategy_passes))
> >> +        {
> >> +            /* wrap-around case */
> >> +            passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
> >> +        }
> >> +        else
> >> +        {
> >> +            passes_delta = (int32) (strategy_passes - prev_strategy_passes);
> >> +        }
> >>
> >>          strategy_delta = strategy_buf_id - prev_strategy_buf_id;
> >>          strategy_delta += (long) passes_delta * NBuffers;
> > That seems somewhat independent of the rest of the change, or am I missing something?
> 
> That change is there to cover the possibility of someone managing to
> overflow and wrap a uint64 which is *highly* unlikely.

That risk existed previously too - I'm not against shoring things up, I'd just
do it in a precursor commit, to make this easier to review.

> If this degree of paranoia isn't required I'm happy to remove it.

That does indeed seem really unlikely.  Assuming that postgres stays up for 10
years without a single restart, it'd be ~59 billion ticks a second.

I don't mind a defense, but I think we'd be better off putting it into
ClockSweepTick() or such, simply erroring out if we ever hit this. It's
unlikely that we'd get (and keep) all the relevant untested code correct ime.
Then we also can assert that prev_strategy_passes <= strategy_passes.

Greetings,

Andres Freund

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Greg Burd

Дата:

22 июля, 21:43:26

On 7/21/25 14:35, Andres Freund wrote:
> Hi,
>
> On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
>> On 7/18/25 13:03, Andres Freund wrote:
>> Hello.  Thanks again for taking the time to review the email and patch,
>> I think we're onto something good here.
>>
>>> I'd be curious if anybody wants to argue for keeping the clock sweep. Except
>>> for the have_free_buffer() use in autoprewarm, it's a rather trivial
>>> patch. And I really couldn't measure regressions above the noise level, even
>>> if absurdly extreme use cases.
>> Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
>> for keeping the freelist"?
> Err, yes :(

Phew.  :)  No worries.

>>> On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
>>>> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
>>>>> I think we'll likely need something to replace it.
>>>> Fair, this (v5) patch doesn't yet try to address this.
>>>>
>>>>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
>>>>> right. The goal of the use have_free_buffer() is obviously to stop
>>>>> prewarming
>>>>> shared buffers if doing so would just evict buffers.  But it's not clear
>>>>> to me
>>>>> that we should just stop when there aren't any free buffers - what if the
>>>>> previous buffer contents aren't the right ones?  It'd make more sense to
>>>>> me to
>>>>> stop autoprewarm once NBuffers have been prewarmed...
>>>> I had the same high level reaction, that autoprewarm was leveraging
>>>> something
>>>> convenient but not necessarily required or even correct.  I'd considered
>>>> using
>>>> NBuffers as you describe due to similar intuitions, I'll dig into that idea
>>>> for
>>>> the next revision after I get to know autoprewarm a bit better.
>>> Cool. I do think that'll be good enough.
>> I re-added the have_free_buffer() function only now it returns false
>> once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
>> has made its first complete pass.  With that I reverted my changes in
>> the autoprewarm module.  The net should be the same behavior as before
>> at startup when using that module.
> I don't think we should have a have_free_buffer() that doesn't actually test
> whether we have a free buffer, that seems too likely to cause
> misunderstandings down the line.  What if we instead just limit the amount of
> buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
> the number of to-be-loaded buffers, so that shouldn't be hard.

I'm glad you said that, I wasn't thrilled with that either and I'm not
sure why I didn't just correct for that in the last patch set.  I'm now
capping num_elements to NBuffers at most.

>>>> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
>>>> I'll dig into the Windows issues next week as well.
>>> FWIW, there are backtraces generated on windows. E.g.
>>>
>>>
https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt
>>>
>>> 000000cd`827fdea0 00007ff7`6ad82f88     ucrtbased!abort(void)+0x5a
[minkernel\crts\ucrt\src\appcrt\startup\abort.cpp@ 77]
 
>>> 000000cd`827fdee0 00007ff7`6aae2b7c     postgres!ExceptionalCondition(
>>>             char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
>>>             char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
>>>             int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
>>> 000000cd`827fdf20 00007ff7`6aae272c     postgres!clock_modulo(
>>>             unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
>>> 000000cd`827fdf60 00007ff7`6aad8647     postgres!StrategySyncStart(
>>>             unsigned int * complete_passes = 0x000000cd`827fdfc0,
>>>             unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c
[c:\cirrus\src\backend\storage\buffer\freelist.c@ 300]
 
>>> 000000cd`827fdfa0 00007ff7`6aa254a3     postgres!BgBufferSync(
>>>             struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37
[c:\cirrus\src\backend\storage\buffer\bufmgr.c@ 3649]
 
>>> 000000cd`827fe030 00007ff7`6aa278a7     postgres!BackgroundWriterMain(
>>>             void * startup_data = 0x00000000`00000000,
>>>             unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
>>> 000000cd`827ff5a0 00007ff7`6a8daf19     postgres!SubPostmasterMain(
>>>             int argc = 0n3,
>>>             char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
>>> 000000cd`827ff620 00007ff7`6af0f5a9     postgres!main(
>>>             int argc = 0n3,
>>>             char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
>>>
>>> I.e. your new assertion failed for some reason that i can't *immediately* see.
>> I put that in as a precaution and as a way to communicate the intention
>> of the other code above it.  I never imagined it would assert.  I've
>> changed clock_read() to only assert when the modulo differs and left
>> that assert in the calling ClockSweepTick() function because it was
>> redundant and I'm curious to see if we see a similar assert when testing
>> the modulo.
> Do you understand why it triggered? Because I don't immediately. The fact that
> it triggered only on windows, where the compiler is rather different, makes it
> worth understanding imo.

I dug into the ASM for both GCC 15.1 and MSVC 19.latest (thanks
godbolt.org!) for x86_64 and there was a critical difference.  It starts
with the fact that I'd used uint32 for my NBuffersPow2Mask rather than
uint64.  That then translates to two different compiled outputs for
clock_read() (was: clock_modulo()).

gcc-15.1 -O2

clock_read(unsigned long long): and edi, DWORD PTR NBuffersPow2Mask[rip]
mov edx, DWORD PTR NBuffers[rip] mov rax, rdi sub rax, rdx cmp rdi, rdx
cmovb rax, rdi ret

msvc-19.latest /O2

hand$ = 8 unsigned int clock_read(unsigned int64) PROC ; clock_read,
COMDAT mov edx, ecx and rdx, QWORD PTR unsigned int64 NBuffersPow2Mask ;
NBuffersPow2Mask mov ecx, DWORD PTR unsigned int NBuffers ; NBuffers mov
eax, edx sub eax, ecx cmp rdx, rcx cmovb eax, edx ret 0 unsigned int
clock_read(unsigned __int64) ENDP ; clock_read


Here's what I think was happening, the MSVC compiler produced assembly
for "hand & NBuffersPow2Mask" that uses "rdx QWORD" while GCC uses "edi
DWORD".  The 32-bit AND operation (edi) automatically zeros the upper 32
bits of rdi after performing the and with the uint64 value of hand while
"rdx QWORD" does not potentially leaving some of the upper 32 bits set. 
My guess is that on Windows when the value of the clock hand exceeded
UINT32_MAX (as can happen in as little as 3 seconds in a tight loop but
likely took longer in the test run) the bits not masked out would
inflate the resulting value which would be > NBufers and also differ
from the simple modulo calculation causing the failed assertion.

Changing the value of NBuffersPow2Mask to uint64 and using similarly
sized types in these functions more closely aligns the assembly code and
should fix this.

>>>> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
>>>>
>>>>      /*
>>>>       * Compute strategy_delta = how many buffers have been scanned by the
>>>> -     * clock sweep since last time.  If first time through, assume none. Then
>>>> -     * see if we are still ahead of the clock sweep, and if so, how many
>>>> +     * clock-sweep since last time.  If first time through, assume none. Then
>>>> +     * see if we are still ahead of the clock-sweep, and if so, how many
>>>>       * buffers we could scan before we'd catch up with it and "lap" it. Note:
>>>>       * weird-looking coding of xxx_passes comparisons are to avoid bogus
>>>>       * behavior when the passes counts wrap around.
>>>>       */
>>>>      if (saved_info_valid)
>>>>      {
>>>> -        int32        passes_delta = strategy_passes - prev_strategy_passes;
>>>> +        int32        passes_delta;
>>>> +
>>>> +        if (unlikely(prev_strategy_passes > strategy_passes))
>>>> +        {
>>>> +            /* wrap-around case */
>>>> +            passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
>>>> +        }
>>>> +        else
>>>> +        {
>>>> +            passes_delta = (int32) (strategy_passes - prev_strategy_passes);
>>>> +        }
>>>>
>>>>          strategy_delta = strategy_buf_id - prev_strategy_buf_id;
>>>>          strategy_delta += (long) passes_delta * NBuffers;
>>> That seems somewhat independent of the rest of the change, or am I missing something?
>> That change is there to cover the possibility of someone managing to
>> overflow and wrap a uint64 which is *highly* unlikely.
> That risk existed previously too - I'm not against shoring things up, I'd just
> do it in a precursor commit, to make this easier to review.
>
>> If this degree of paranoia isn't required I'm happy to remove it.
> That does indeed seem really unlikely.  Assuming that postgres stays up for 10
> years without a single restart, it'd be ~59 billion ticks a second.
Agreed, it's overkill.
> I don't mind a defense, but I think we'd be better off putting it into
> ClockSweepTick() or such, simply erroring out if we ever hit this. It's
> unlikely that we'd get (and keep) all the relevant untested code correct ime.
> Then we also can assert that prev_strategy_passes <= strategy_passes.
Added assertions and comments to explain decision.
> Greetings,
>
> Andres Freund


rebased onto ce6513e96a1 patch set v7 attached and available on GitHub
[1][2],


-greg


[1] https://github.com/gburd/postgres/pull/9

[2] https://github.com/gburd/postgres/tree/gregburd/rm-freelist/patch-v7

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Greg Burd

Дата:

25 июля, 22:02:39

On 7/22/25 14:43, Greg Burd wrote:
> On 7/21/25 14:35, Andres Freund wrote:
>> Hi,
>>
>> On 2025-07-21 13:37:04 -0400, Greg Burd wrote:
>>> On 7/18/25 13:03, Andres Freund wrote:
>>> Hello.  Thanks again for taking the time to review the email and patch,
>>> I think we're onto something good here.
>>>
>>>> I'd be curious if anybody wants to argue for keeping the clock sweep. Except
>>>> for the have_free_buffer() use in autoprewarm, it's a rather trivial
>>>> patch. And I really couldn't measure regressions above the noise level, even
>>>> if absurdly extreme use cases.
>>> Hmmm... was "argue for keeping the clock sweep" supposed to read "argue
>>> for keeping the freelist"?
>> Err, yes :(
> Phew.  :)  No worries.
>
>>>> On 2025-07-17 14:35:13 -0400, Greg Burd wrote:
>>>>> On Fri, Jul 11, 2025, at 2:52 PM, Andres Freund wrote:
>>>>>> I think we'll likely need something to replace it.
>>>>> Fair, this (v5) patch doesn't yet try to address this.
>>>>>
>>>>>> TBH, I'm not convinced that autoprewarm using have_free_buffer() is quite
>>>>>> right. The goal of the use have_free_buffer() is obviously to stop
>>>>>> prewarming
>>>>>> shared buffers if doing so would just evict buffers.  But it's not clear
>>>>>> to me
>>>>>> that we should just stop when there aren't any free buffers - what if the
>>>>>> previous buffer contents aren't the right ones?  It'd make more sense to
>>>>>> me to
>>>>>> stop autoprewarm once NBuffers have been prewarmed...
>>>>> I had the same high level reaction, that autoprewarm was leveraging
>>>>> something
>>>>> convenient but not necessarily required or even correct.  I'd considered
>>>>> using
>>>>> NBuffers as you describe due to similar intuitions, I'll dig into that idea
>>>>> for
>>>>> the next revision after I get to know autoprewarm a bit better.
>>>> Cool. I do think that'll be good enough.
>>> I re-added the have_free_buffer() function only now it returns false
>>> once nextVictimBuffer > NBuffers signaling to autoprewarm that the clock
>>> has made its first complete pass.  With that I reverted my changes in
>>> the autoprewarm module.  The net should be the same behavior as before
>>> at startup when using that module.
>> I don't think we should have a have_free_buffer() that doesn't actually test
>> whether we have a free buffer, that seems too likely to cause
>> misunderstandings down the line.  What if we instead just limit the amount of
>> buffers we load in apw_load_buffers()? apw_load_buffers() knows NBuffers and
>> the number of to-be-loaded buffers, so that shouldn't be hard.
> I'm glad you said that, I wasn't thrilled with that either and I'm not
> sure why I didn't just correct for that in the last patch set.  I'm now
> capping num_elements to NBuffers at most.
>
>>>>> Meanwhile, the tests except for Windows pass [2] for this new patch [3].
>>>>> I'll dig into the Windows issues next week as well.
>>>> FWIW, there are backtraces generated on windows. E.g.
>>>>
>>>>
https://api.cirrus-ci.com/v1/artifact/task/6327899394932736/crashlog/crashlog-postgres.exe_00c0_2025-07-17_19-19-00-008.txt
>>>>
>>>> 000000cd`827fdea0 00007ff7`6ad82f88     ucrtbased!abort(void)+0x5a
[minkernel\crts\ucrt\src\appcrt\startup\abort.cpp@ 77]
 
>>>> 000000cd`827fdee0 00007ff7`6aae2b7c     postgres!ExceptionalCondition(
>>>>             char * conditionName = 0x00007ff7`6b2a4cb8 "result < NBuffers",
>>>>             char * fileName = 0x00007ff7`6b2a4c88 "../src/backend/storage/buffer/freelist.c",
>>>>             int lineNumber = 0n139)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
>>>> 000000cd`827fdf20 00007ff7`6aae272c     postgres!clock_modulo(
>>>>             unsigned int64 counter = 0x101)+0x6c [c:\cirrus\src\backend\storage\buffer\freelist.c @ 139]
>>>> 000000cd`827fdf60 00007ff7`6aad8647     postgres!StrategySyncStart(
>>>>             unsigned int * complete_passes = 0x000000cd`827fdfc0,
>>>>             unsigned int * num_buf_alloc = 0x000000cd`827fdfcc)+0x2c
[c:\cirrus\src\backend\storage\buffer\freelist.c@ 300]
 
>>>> 000000cd`827fdfa0 00007ff7`6aa254a3     postgres!BgBufferSync(
>>>>             struct WritebackContext * wb_context = 0x000000cd`827fe180)+0x37
[c:\cirrus\src\backend\storage\buffer\bufmgr.c@ 3649]
 
>>>> 000000cd`827fe030 00007ff7`6aa278a7     postgres!BackgroundWriterMain(
>>>>             void * startup_data = 0x00000000`00000000,
>>>>             unsigned int64 startup_data_len = 0)+0x243 [c:\cirrus\src\backend\postmaster\bgwriter.c @ 236]
>>>> 000000cd`827ff5a0 00007ff7`6a8daf19     postgres!SubPostmasterMain(
>>>>             int argc = 0n3,
>>>>             char ** argv = 0x0000028f`e75d24d0)+0x2f7 [c:\cirrus\src\backend\postmaster\launch_backend.c @ 714]
>>>> 000000cd`827ff620 00007ff7`6af0f5a9     postgres!main(
>>>>             int argc = 0n3,
>>>>             char ** argv = 0x0000028f`e75d24d0)+0x329 [c:\cirrus\src\backend\main\main.c @ 222]
>>>>
>>>> I.e. your new assertion failed for some reason that i can't *immediately* see.
>>> I put that in as a precaution and as a way to communicate the intention
>>> of the other code above it.  I never imagined it would assert.  I've
>>> changed clock_read() to only assert when the modulo differs and left
>>> that assert in the calling ClockSweepTick() function because it was
>>> redundant and I'm curious to see if we see a similar assert when testing
>>> the modulo.
>> Do you understand why it triggered? Because I don't immediately. The fact that
>> it triggered only on windows, where the compiler is rather different, makes it
>> worth understanding imo.
> I dug into the ASM for both GCC 15.1 and MSVC 19.latest (thanks
> godbolt.org!) for x86_64 and there was a critical difference.  It starts
> with the fact that I'd used uint32 for my NBuffersPow2Mask rather than
> uint64.  That then translates to two different compiled outputs for
> clock_read() (was: clock_modulo()).
>
> gcc-15.1 -O2
>
> clock_read(unsigned long long): and edi, DWORD PTR NBuffersPow2Mask[rip]
> mov edx, DWORD PTR NBuffers[rip] mov rax, rdi sub rax, rdx cmp rdi, rdx
> cmovb rax, rdi ret
>
> msvc-19.latest /O2
>
> hand$ = 8 unsigned int clock_read(unsigned int64) PROC ; clock_read,
> COMDAT mov edx, ecx and rdx, QWORD PTR unsigned int64 NBuffersPow2Mask ;
> NBuffersPow2Mask mov ecx, DWORD PTR unsigned int NBuffers ; NBuffers mov
> eax, edx sub eax, ecx cmp rdx, rcx cmovb eax, edx ret 0 unsigned int
> clock_read(unsigned __int64) ENDP ; clock_read
>
>
> Here's what I think was happening, the MSVC compiler produced assembly
> for "hand & NBuffersPow2Mask" that uses "rdx QWORD" while GCC uses "edi
> DWORD".  The 32-bit AND operation (edi) automatically zeros the upper 32
> bits of rdi after performing the and with the uint64 value of hand while
> "rdx QWORD" does not potentially leaving some of the upper 32 bits set. 
> My guess is that on Windows when the value of the clock hand exceeded
> UINT32_MAX (as can happen in as little as 3 seconds in a tight loop but
> likely took longer in the test run) the bits not masked out would
> inflate the resulting value which would be > NBufers and also differ
> from the simple modulo calculation causing the failed assertion.
>
> Changing the value of NBuffersPow2Mask to uint64 and using similarly
> sized types in these functions more closely aligns the assembly code and
> should fix this.

And it did make the code more correct, it just didn't fix that
particular bug.  The bug was in the logic for my optimized modulo.  Big
thank you to my PostgreSQL committer mentor Noah Misch who asked a
simple question yesterday in our monthly call, "if this algorithm is
correct, why isn't it how CPUs and/or compilers implement modulo?"  Then
went on to suggest that I write an exhaustive test for it (maybe even
coerce an LLM to do that for me) and test it.  So I did, and it was wrong.

I felt I'd give it one more try.  It turns out there is an algorithm [1]
based on some work that went into GMP a while ago [2], so I implemented
it and gave a try (attached test.c).  Without optimization it's 10-42%
faster, but when compiled with GCC -O3 or -Ofast that advantage goes
away and it is slightly slower.  So, simplicity for the win and let the
compiler do the hard parts I guess.

>>>>> @@ -3664,15 +3664,25 @@ BgBufferSync(WritebackContext *wb_context)
>>>>>
>>>>>      /*
>>>>>       * Compute strategy_delta = how many buffers have been scanned by the
>>>>> -     * clock sweep since last time.  If first time through, assume none. Then
>>>>> -     * see if we are still ahead of the clock sweep, and if so, how many
>>>>> +     * clock-sweep since last time.  If first time through, assume none. Then
>>>>> +     * see if we are still ahead of the clock-sweep, and if so, how many
>>>>>       * buffers we could scan before we'd catch up with it and "lap" it. Note:
>>>>>       * weird-looking coding of xxx_passes comparisons are to avoid bogus
>>>>>       * behavior when the passes counts wrap around.
>>>>>       */
>>>>>      if (saved_info_valid)
>>>>>      {
>>>>> -        int32        passes_delta = strategy_passes - prev_strategy_passes;
>>>>> +        int32        passes_delta;
>>>>> +
>>>>> +        if (unlikely(prev_strategy_passes > strategy_passes))
>>>>> +        {
>>>>> +            /* wrap-around case */
>>>>> +            passes_delta = (int32) (UINT32_MAX - prev_strategy_passes + strategy_passes);
>>>>> +        }
>>>>> +        else
>>>>> +        {
>>>>> +            passes_delta = (int32) (strategy_passes - prev_strategy_passes);
>>>>> +        }
>>>>>
>>>>>          strategy_delta = strategy_buf_id - prev_strategy_buf_id;
>>>>>          strategy_delta += (long) passes_delta * NBuffers;
>>>> That seems somewhat independent of the rest of the change, or am I missing something?
>>> That change is there to cover the possibility of someone managing to
>>> overflow and wrap a uint64 which is *highly* unlikely.
>> That risk existed previously too - I'm not against shoring things up, I'd just
>> do it in a precursor commit, to make this easier to review.
>>
>>> If this degree of paranoia isn't required I'm happy to remove it.
>> That does indeed seem really unlikely.  Assuming that postgres stays up for 10
>> years without a single restart, it'd be ~59 billion ticks a second.
> Agreed, it's overkill.
>> I don't mind a defense, but I think we'd be better off putting it into
>> ClockSweepTick() or such, simply erroring out if we ever hit this. It's
>> unlikely that we'd get (and keep) all the relevant untested code correct ime.
>> Then we also can assert that prev_strategy_passes <= strategy_passes.
> Added assertions and comments to explain decision.
>> Greetings,
>>
>> Andres Freund

Patch set is now:

1) remove freelist

2) remove buffer_strategy_lock

3) abstract clock-sweep to type and API


Rebased v10 onto 5457ea46d18 at GitHub[3] and CommitFest[4],


-greg


[1] https://stackoverflow.com/a/26047426/366692

[2] https://gmplib.org/~tege/divcnst-pldi94.pdf

[3] https://github.com/gburd/postgres/pull/10

[4] https://commitfest.postgresql.org/patch/5928/

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Greg Burd

Дата:

27 июля, 17:12:11

On 7/25/25 15:02, Greg Burd wrote:
> Patch set is now:
>
> 1) remove freelist
>
> 2) remove buffer_strategy_lock
>
> 3) abstract clock-sweep to type and API
>
>
>
> -greg

Somehow including the test.c file as an attachment on my last email
confused the CI and it didn't test the v10 patch set (which did pass in
GitHub CI on my fork [1]).  Here's v11 unchanged from v10 except rebased
onto 258bf0a2ea8 cf PG 19-2 5928 [2].

best,

-greg

[1] https://github.com/gburd/postgres/pull/10/checks

[2] https://commitfest.postgresql.org/patch/5928/

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Tomas Vondra

Дата:

11 августа, 16:09:45

Hi,

I took a look at this thread / patches, mostly because it's somewhat
related to the NUMA patch series [1]. Freelist is one of the things the
NUMA patches tweak, to split it into per-node partitions. So removing
the freelist would affect the NUMA patches, mostly simplifying it by
making the freelist part irrelevant.

On the whole the idea seems reasonable - I'm not against removing the
freelist, assuming Andres is right and the freelist is not very useful
anyway. So I decided to validate this assumption by running a couple of
benchmarks ...

benchmarks
----------

I did two tests:

1) run-pgbench.sh

- read-only pgbench

- scale set to different fraction of shared buffers (50%, 90%, 110%)

- tracks tps

2) run-seqscan.sh

- concurrect evic+seqscans on different tables

- shared buffers set to different fractions of total dataset sizes (50%,
90%, 110%)

- This is similar to the workload Andres suggested in [2], except that
it uses a simple SELECT instead of pg_prewarm. The patches change how
pg_prewarm decides to stop, and I didn't want to be affected by that.

- tracks tps, and average latency for the evict + select

The attached test scripts are "complete" - setup an instance. disable
checksums, set a couple GUCs, etc.

I wanted to see the effect of each patch, so tests were done on master,
and then with patches applied one by one (after fixing the compile
failures in 0001).

I did this on two EPYC machines with 96 cores (lscpu attached), and the
results are virtually the same. Attached are PDFs with results from one
of the machines.

On the whole, the impact of the patches is negligible it's within 1% in
either direction, and I don't see any particular trend / systemic change
in the behavior (e.g. regressions for some cases). Seems like a noise,
which supports the assumption the freelist is not very useful.

The one exception positive is "evict latency" which tracks latency of
pg_buffercache_evict_relation(). That got consistently a little bit
better, which makes sense as it does not need to maintain the freelist.

freelist statistics?
--------------------

There's one caveat, though. I find it tricky to "know" if the workload
actually uses a freelist. Is there a good way to determine if a buffer
was acquired from a freelist, or through the clocksweep?

I'm worried the tests might actually have empty freelists, or maybe the
freelists won't be used very much. In which case removing the freelist
has understandably no impact. But it says nothing about cases getting
buffers from freelists often ...

I can probably think about each workload and deduce if there will be
freelists. For the benchmarks described earlier, I think the situation
is this:

pgbench, shared buffers < 100% -> no freelists, uses clock-sweep
pgbench, shared buffers > 100% -> freelists, but unused (no evictions)

seqscan, shared buffers < 100% -> freelists, but clocksweep too
seqscan, shared buffers > 100% -> freelists, no clocksweep

But maybe I got it wrong for some cases? It'd be good to have a way to
collect some stats for each test, to confirm this and also to quantify
the effects. A test that gets 10% of buffers from a freelist may behave
differently from a test that gets 99% buffers from a freelist.

I propose we add a system view showing interesting information about
freelists and buffer allocation, or perhaps extend an existing one (e.g.
pg_stat_bgwriter, which already has buffers_alloc). In the NUMA patches
I simply added a new view into pg_buffercache.

I'm not suggesting this would get committed, at this point I'm more
focused on the development. Whether the view would be useful outside the
development is an open question. Also, if it adds stats about freelists
(for current master), that'd be useless once we remove them.

Now, a couple review comments about the individual patches:

0001
----

1) nitpick: adds a blank line at the end of buffer/README

2) gcc complains about missing have_free_buffer prototype, but it's no
no longer needed and can be removed

3) freelist fails to compile, because of:

freelist.c: In function ‘have_free_buffer’:
freelist.c:166:51: error: passing argument 1 of ‘pg_atomic_read_u64’
from incompatible pointer type [-Wincompatible-pointer-types]
166 | uint64 hand =
pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| |
| pg_atomic_uint32 *

That is, it should be accessed using pg_atomic_read_u32.

CI/cfbot does not report this, because it applies all the patches, and
0002 fixes this.

0002
----

1) I think the buffer/README needs a bit more work. It now suddenly
mentions completePasses, without explaining what it's for, and how it
works with nextVictimBuffer. And then it's not mentioned again, so why
even mention it? We don't mention various other fields ...

It's true completePasses is not a new thing, and probably should have
been already explained by the README. Still, it seems a bit confusing.

2) It'd be good if the README explained how we coordinate access to
multiple atomic values (with the spinlock removed), why the current
approach is safe, or how it's made safe. Or at least mention that we
though about it, and where to look for details (e.g. in a comment for
particular function).

3) I have no opinion on how "clock-sweep" should be spelled, but if we
want to unify the spelling, it should be done in a separate preparatory
patch, so that it does not mix with the actual changes. It adds quite a
bit of changes.

3) I'm not sure about: Assert(prev_strategy_passes <= strategy_passes);

Either it's unlikely but possible, and then it should be elog(ERROR), or
we consider it impossible (and then assert is fine). Maybe just rephrase
the comment to say we consider the overflow it impossible.

Also, no need to split the passes_delta assignment, I think. It's fine
to subtract before an assert - it'll be bogus/negative, but so what?
It's not an undefined / invalid memory access or anything like that.

4) nitpick: freelist.c has incorrect formatting of "{" in a couple
places, e.g. it should be on a newline in a struct definition, etc.
pgindent would eventually fix this, but it bugs me.

5) I don't like the CLOCK_HAND_POSITION() naming very much, partially
because it's not really clear what "clock hand" is. Intuitively I know
what it's supposed to mean, also because I worked with this code before.
But not everyone has that benefit, and may assume something else.

IMO if nextVictimBuffer is no longer meant to be "buffer", but a counter
that only ever increases, we should call it something different. Say,
"clockCweepCounter"?

Then CLOCK_HAND_POSITION() could be CLOCKSWEEP_NEXT_BUFFER()? And we why
not to have a separate CLOCKSWEEP_COMPLETE_PASSES() macro too?

6) I don't understand why the comment mentions the BufferDescriptor
array at all? Sure, we may look at the descriptor, but why is this
detail relevant in this place?

Also, shouldn't it say "divide by NBuffers" at the end?

7) CLOCK_HAND_POSITION() should probably have a comment explaining why
it does the masking, etc. And why it's better than simply doing the
modulo on the counter directly. I mean, why is this better than just
doing (counter % NBuffers)? It's going to work on uint64 anyway.

8) There was a discussion about doing the modulo in a better way, but I
don't see any message explaining it clearly enough for me. And then
there were some issues with it, and I guess the idea was abandoned.

I'm asking because my NUMA patches do a modulo in a couple places in the
clock-sweep part too, so if there's a better way, I'd like to know.

9) Why initialize "hand" to UINT64_MAX? Seems unnecessary.

10) nitpick: StrategySyncStart formatting is a bit wrong

0003
----

1) I don't quite understand the purpose of this part. How is this
abstracting anything, compared to what we already have (or what the two
earlier parts did)?

And if it "abstracts the algorithm", what would be the purpose? Do we
expect some alternative algorithms? And would 0003 really make it easier
compared to just changing the code directly? I don't get it.

2) I don't understand why we need to explicitly pass the "clock" to
every ClockSweepCycles/ClockSweepPosition call.

I mean, there's only ever one instance anyway - at least without "my"
NUMA patches. But even with the NUMA patches, the partition is
determined inside those function, otherwise every caller would have to
repeat that. I think this is unnecessary/inconvenient.

regards

[1]
https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017%40vondra.me

[2]
https://www.postgresql.org/message-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3%40euyewe6tf3z3

--
Tomas Vondra

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Greg Burd

Дата:

13 августа, 00:42:35

On Aug 11 2025, at 9:09 am, Tomas Vondra <tomas@vondra.me> wrote:

> Hi,
>
> I took a look at this thread / patches, mostly because it's somewhat
> related to the NUMA patch series [1]. Freelist is one of the things the
> NUMA patches tweak, to split it into per-node partitions. So removing
> the freelist would affect the NUMA patches, mostly simplifying it by
> making the freelist part irrelevant.

Thank you Tomas for taking the time to review and performance test this
patch set.  I greatly appreciate your feedback.

> On the whole the idea seems reasonable - I'm not against removing the
> freelist, assuming Andres is right and the freelist is not very useful
> anyway. So I decided to validate this assumption by running a couple of
> benchmarks ...

Amazing, thank you.  I'll try to replicate your tests tomorrow to see if
my optimized division and modulo functions do in fact help or not.  I
realize that both you and Anders are (rightly) concerned that the
performance impact of IDIV on some CPUs can be excessive.

I'm going to post this new set of patches and then start a separate
email to analyze and respond to the performance tests.

> benchmarks
> ----------
>
> I did two tests:
>
>
> 1) run-pgbench.sh
>
> - read-only pgbench
>
> - scale set to different fraction of shared buffers (50%, 90%, 110%)
>
> - tracks tps
>
>
> 2) run-seqscan.sh
>
> - concurrect evic+seqscans on different tables
>
> - shared buffers set to different fractions of total dataset sizes (50%,
> 90%, 110%)
>
> - This is similar to the workload Andres suggested in [2], except that
> it uses a simple SELECT instead of pg_prewarm. The patches change how
> pg_prewarm decides to stop, and I didn't want to be affected by that.
>
> - tracks tps, and average latency for the evict + select
>
>
> The attached test scripts are "complete" - setup an instance. disable
> checksums, set a couple GUCs, etc.
>
> I wanted to see the effect of each patch, so tests were done on master,
> and then with patches applied one by one (after fixing the compile
> failures in 0001).
>
> I did this on two EPYC machines with 96 cores (lscpu attached), and the
> results are virtually the same. Attached are PDFs with results from one
> of the machines.
>
> On the whole, the impact of the patches is negligible it's within 1% in
> either direction, and I don't see any particular trend / systemic change
> in the behavior (e.g. regressions for some cases). Seems like a noise,
> which supports the assumption the freelist is not very useful.
>
> The one exception positive is "evict latency" which tracks latency of
> pg_buffercache_evict_relation(). That got consistently a little bit
> better, which makes sense as it does not need to maintain the freelist.
>
>
> freelist statistics?
> --------------------

I like this idea too, I'll try to work it into the next set of patches.

> There's one caveat, though. I find it tricky to "know" if the workload
> actually uses a freelist. Is there a good way to determine if a buffer
> was acquired from a freelist, or through the clocksweep?
>
> I'm worried the tests might actually have empty freelists, or maybe the
> freelists won't be used very much. In which case removing the freelist
> has understandably no impact. But it says nothing about cases getting
> buffers from freelists often ...
>
> I can probably think about each workload and deduce if there will be
> freelists. For the benchmarks described earlier, I think the situation
> is this:
>
> pgbench, shared buffers < 100% -> no freelists, uses clock-sweep
> pgbench, shared buffers > 100% -> freelists, but unused (no evictions)
>
> seqscan, shared buffers < 100% -> freelists, but clocksweep too
> seqscan, shared buffers > 100% -> freelists, no clocksweep
>
> But maybe I got it wrong for some cases? It'd be good to have a way to
> collect some stats for each test, to confirm this and also to quantify
> the effects. A test that gets 10% of buffers from a freelist may behave
> differently from a test that gets 99% buffers from a freelist.
>
> I propose we add a system view showing interesting information about
> freelists and buffer allocation, or perhaps extend an existing one (e.g.
> pg_stat_bgwriter, which already has buffers_alloc). In the NUMA patches
> I simply added a new view into pg_buffercache.
>
> I'm not suggesting this would get committed, at this point I'm more
> focused on the development. Whether the view would be useful outside the
> development is an open question. Also, if it adds stats about freelists
> (for current master), that'd be useless once we remove them.
>
>
> Now, a couple review comments about the individual patches:

Apologies for not testing each commit separately.

> 0001
> ----
>
> 1) nitpick: adds a blank line at the end of buffer/README

Removed.

> 2) gcc complains about missing have_free_buffer prototype, but it's no
> no longer needed and can be removed

Removed.

> 3) freelist fails to compile, because of:
>
> freelist.c: In function ‘have_free_buffer’:
> freelist.c:166:51: error: passing argument 1 of ‘pg_atomic_read_u64’
> from incompatible pointer type [-Wincompatible-pointer-types]
>  166 |         uint64          hand =
> pg_atomic_read_u64(&StrategyControl->nextVictimBuffer);
>      |
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>      |                                                   |
>      |
> pg_atomic_uint32 *
>
> That is, it should be accessed using pg_atomic_read_u32.

Removed.

> CI/cfbot does not report this, because it applies all the patches, and
> 0002 fixes this.

That makes sense, I'll be more careful with each patch in the future.

> 0002
> ----
>
> 1) I think the buffer/README needs a bit more work. It now suddenly
> mentions completePasses, without explaining what it's for, and how it
> works with nextVictimBuffer. And then it's not mentioned again, so why
> even mention it? We don't mention various other fields ...

I've taken a stab at expanding on the new approach, happy to continue to
refine the wording.

> It's true completePasses is not a new thing, and probably should have
> been already explained by the README. Still, it seems a bit confusing.

I explain the idea of "complete passes" and how it is now encoded in the
clockSweepCounter (spoiler, that's the new name for nextVictimBuffer).

> 2) It'd be good if the README explained how we coordinate access to
> multiple atomic values (with the spinlock removed), why the current
> approach is safe, or how it's made safe. Or at least mention that we
> though about it, and where to look for details (e.g. in a comment for
> particular function).

I think I've covered this base as well, let me know if it is still
lacking important content.

> 3) I have no opinion on how "clock-sweep" should be spelled, but if we
> want to unify the spelling, it should be done in a separate preparatory
> patch, so that it does not mix with the actual changes. It adds quite a
> bit of changes.

Absolutely, I should have made that the first patch in the set.  Now it is.

> 3) I'm not sure about: Assert(prev_strategy_passes <= strategy_passes);
>
> Either it's unlikely but possible, and then it should be elog(ERROR), or
> we consider it impossible (and then assert is fine). Maybe just rephrase
> the comment to say we consider the overflow it impossible.

I've rephrased it to state "overflow is impossible" but memorialized in
an Assert for good measure.

> Also, no need to split the passes_delta assignment, I think. It's fine
> to subtract before an assert - it'll be bogus/negative, but so what?
> It's not an undefined / invalid memory access or anything like that.

Okay, works for me.

> 4) nitpick: freelist.c has incorrect formatting of "{" in a couple
> places, e.g. it should be on a newline in a struct definition, etc.
> pgindent would eventually fix this, but it bugs me.

Yep, guilty as charged.  I forgot that last pgindent run before git
format-patch.  It bugs me too, fixed.

> 5) I don't like the CLOCK_HAND_POSITION() naming very much, partially
> because it's not really clear what "clock hand" is. Intuitively I know
> what it's supposed to mean, also because I worked with this code before.
> But not everyone has that benefit, and may assume something else.

Now that the nextVictimBuffer is renamed to clockSweepCounter the new
macro names in 0003 are CLOCKSWEEP_HAND() and CLOCKSWEEP_PASSES().  In
0004 those turn into functions ClockSweepHand() and ClockSweepPasses().

> IMO if nextVictimBuffer is no longer meant to be "buffer", but a counter
> that only ever increases, we should call it something different. Say,
> "clockCweepCounter"?

Great idea and a good suggested name, renamed to clockSweepCounter.

> Then CLOCK_HAND_POSITION() could be CLOCKSWEEP_NEXT_BUFFER()? And we why
> not to have a separate CLOCKSWEEP_COMPLETE_PASSES() macro too?

Agreed, done.

> 6) I don't understand why the comment mentions the BufferDescriptor
> array at all? Sure, we may look at the descriptor, but why is this
> detail relevant in this place?

It's not, removed.

> Also, shouldn't it say "divide by NBuffers" at the end?

Also cleaned up.

> 7) CLOCK_HAND_POSITION() should probably have a comment explaining why
> it does the masking, etc. And why it's better than simply doing the
> modulo on the counter directly. I mean, why is this better than just
> doing (counter % NBuffers)? It's going to work on uint64 anyway.

Took a swing at comments, not thrilled but better than nothing.

> 8) There was a discussion about doing the modulo in a better way, but I
> don't see any message explaining it clearly enough for me. And then
> there were some issues with it, and I guess the idea was abandoned.

Well, I made an attempt at implementing the algorithm in the
Granlund-Montgomery paper "Division by invariant Integers using
Multiplication"[1] and got it to work eventually (attached as
v13-0004b.txt to avoid being recognized as part of this patch set,
rename to
v13-0004-Optimize-modulo-and-division-used-in-clock-sweep.patch). In my
approach I created logic that had two paths for division and modulo, one
when NBuffers was a power-of-2 and one when it wasn't.  I didn't mind
having a branch in the inline'ed functions because it's likely that the
branch predictor will get it right 100% of the time.

Then I found an implementation of the paper called "fastdiv"[2] that
doesn't branch but has a few more instructions.  It is released under
the MIT License. I'm not sure what the convention here in PostgreSQL
code is when including some code inspired by another project (although
now is a good time for me to search the wiki and find out).  Both my
implementation and the fastdiv implementation work and should be
considerably faster on modern CPUs.  I've attached the fastdiv version,
but I can revert to mine if that's easier or more in line with community
standards.  I'd estimate the fastdiv version to run in about ~12-18
cycles (vs 26-90 for hardware division or modulo) on most CPU architectures.

> I'm asking because my NUMA patches do a modulo in a couple places in the
> clock-sweep part too, so if there's a better way, I'd like to know.

Tell me your thoughts on that algorithm, I think it does the job.

> 9) Why initialize "hand" to UINT64_MAX? Seems unnecessary.

Yep, removed.

> 10) nitpick: StrategySyncStart formatting is a bit wrong

Yes, fixed.

>
> 0003
> ----
>
> 1) I don't quite understand the purpose of this part. How is this
> abstracting anything, compared to what we already have (or what the two
> earlier parts did)?
>
> And if it "abstracts the algorithm", what would be the purpose? Do we
> expect some alternative algorithms? And would 0003 really make it easier
> compared to just changing the code directly? I don't get it.
>
>
> 2) I don't understand why we need to explicitly pass the "clock" to
> every ClockSweepCycles/ClockSweepPosition call.
>
> I mean, there's only ever one instance anyway - at least without "my"
> NUMA patches. But even with the NUMA patches, the partition is
> determined inside those function, otherwise every caller would have to
> repeat that. I think this is unnecessary/inconvenient.

Okay, I understand that.  At one point I had implementations for
ClockPro and SIEVE but you're right.

> regards
>
>
> [1]
> https://www.postgresql.org/message-id/099b9433-2855-4f1b-b421-d078a5d82017%40vondra.me
>
> [2]
> https://www.postgresql.org/message-id/2avffd4n6e5lu7kbuvpjclw3dzcqsw4qctj5ch4qin5gakk3r3%40euyewe6tf3z3
>
>
> --
> Tomas Vondra

Thanks again for digging into the code.  I took on this project to help
your NUMA work, but then once I got started it seemed like a good idea
even if that doesn't land as this should remove contention across cores
and hopefully be generally faster.

-greg

[1] https://gmplib.org/~tege/divcnst-pldi94.pdf
[2] https://github.com/jmtilli/fastdiv

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Thomas Munro

Дата:

16 августа, 06:37:33

On Wed, Aug 13, 2025 at 9:42 AM Greg Burd <greg@burd.me> wrote:
> Amazing, thank you.  I'll try to replicate your tests tomorrow to see if
> my optimized division and modulo functions do in fact help or not.  I
> realize that both you and Anders are (rightly) concerned that the
> performance impact of IDIV on some CPUs can be excessive.

At the risk of posting untested crackpot theories on the internet, I
wonder if there is a way to use a simple boundary condition and
subtraction for this.  If you correct overshoot compared to an
advancing-in-strides base value, then I wonder how often you'd finish
up having to actually do that under concurrency.  Obviously in
general, implementing modulo with subtraction is a terrible idea, but
can you make it so that the actual cost works out as mostly 0, rarely
1 and exceedingly rarely more than 1 subtraction loops?  If that's
true, do the branches somehow kill you?

Assume for now that we're OK with keeping % and / for the infrequent
calls to StrategySyncStart(), or we can redefinine the bgwriter's
logic so that it doesn't even need those (perhaps what it really wants
to know is its total distance behind the allocator, so perhaps we can
define that problem away?  haven't thought about that yet...).  What
I'm wondering out loud is whether the hot ClockSweepTick() code might
be able to use something nearly as dumb as this...

    /* untested pseudocode */
    ticks_base = pg_atomic_read_u64(&x->ticks_base);
    ticks = pg_atomic_fetch_add_u64(&x->ticks, 1);
    hand = ticks - ticks_base;

    /*
     * Compensate for overshoot.  Expected number of loops: none most of the
     * time, one when we overshoot, and maybe more if the system gets
     * around the whole clock before we see the base value advance.
     */
    while (hand >= NBuffers)
    {
        /* Base value advanced by backend that overshoots by one tick. */
        if (hand == NBuffers)
            pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
        hand -= NBuffers;
    }

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Thomas Munro

Дата:

17 августа, 07:34:17

On Sat, Aug 16, 2025 at 3:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>     while (hand >= NBuffers)
>     {
>         /* Base value advanced by backend that overshoots by one tick. */
>         if (hand == NBuffers)
>             pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
>         hand -= NBuffers;
>     }

Or if you don't like those odds, maybe it'd be OK to keep % but use it
rarely and without the CAS that can fail.  I assume it would still
happen occasionally in more than one backend due to the race against
the base advancing a few instructions later, but maybe that'd work out
OK?  I dunno.  The point would be to make it rare.  And with a
per-NUMA-node CLOCK, hopefully quite rare indeed.  I guess this way
you don't need to convince yourself that ticks_base is always <= ticks
for all cores, since it would self-correct (if it appears to one core
that ticks_base > ticks then hand will be a very large number and take
this branch).  IDK, again untested, just throwing ideas out there...

    if (hand >= NBuffers)
    {
        hand %= NBuffers;
        /* Base value advanced by backend that overshoots by one tick. */
        if (hand == 0)
            pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
    }

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Thomas Munro

Дата:

17 августа, 07:57:29

On Sun, Aug 17, 2025 at 4:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Or if you don't like those odds, maybe it'd be OK to keep % but use it
> rarely and without the CAS that can fail.

... or if we wanted to try harder to avoid %, could we relegate it to
the unlikely CLOCK-went-all-the-way-around-again-due-to-unlucky-scheduling
case, but use subtraction for the expected periodic overshoot?

    if (hand >= NBuffers)
    {
        hand = hand < Nbuffers * 2 ? hand - NBuffers : hand % NBuffers;
        /* Base value advanced by backend that overshoots by one tick. */
        if (hand == 0)
            pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
    }

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Greg Burd

Дата:

27 августа, 22:42:48

On Aug 17 2025, at 12:57 am, Thomas Munro <thomas.munro@gmail.com> wrote:

> On Sun, Aug 17, 2025 at 4:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Or if you don't like those odds, maybe it'd be OK to keep % but use it
>> rarely and without the CAS that can fail.
>
> ... or if we wanted to try harder to avoid %, could we relegate it to
> the unlikely CLOCK-went-all-the-way-around-again-due-to-unlucky-scheduling
> case, but use subtraction for the expected periodic overshoot?
>
>    if (hand >= NBuffers)
>    {
>        hand = hand < Nbuffers * 2 ? hand - NBuffers : hand % NBuffers;
>        /* Base value advanced by backend that overshoots by one tick. */
>        if (hand == 0)
>            pg_atomic_fetch_add_u64(&StrategyControl->ticks_base, NBuffers);
>    }
>

Hi Tomas,

Thanks for all the ideas, I have tried out a few of them and a number of
other ideas.  I've done a lot of measurement and had a few off channel
discussions about this and I think the best way to move forward is to
just focus on the removal of the freelist and not bother with the lock
or changing clock-sweep right now too much.  So, the attached patch set
keeps the first two from the last set but drops the rest.

But wait, there's more...

As a *bonus* I've added a new third patch with some proposed changes to
spark discussions.  As I researched experiences in the field at scale a
few other buffer management issues came to light.  The one in particular
that I tried to address in this new patch 0003 has to do with very large
shared_buffers (NBuffers) and very large active datasets causing most
buffer usage counts to be at or near the max value (5).  In these cases
the clock-sweep algorithm needs to perform NBuffers * 5 "ticks" before
identifying a buffer to evict.  This also pollutes the completePasses
value used to inform the bgwriter where to start working.

So, in this patch I add per-backend buffer usage tracking and proactive
pressure management.  Each tick of the hand can now decrement usage by a
calculated amount, not just 1, based on /hand-wavy-first-attempt at magic/.

The thing I'm sure this doesn't help with, and may in fact hurt, is
keeping frequently accessed buffers in the buffer pool.  I imagine a two
tier approach to this where some small subset of buffers that are reused
frequently enough are not even considered by the clock-sweep algorithm.

Regardless, I feel the first two patches on this set address the
intention of this thread. I added patch 0003 just to start a
conversation, please chime in if any of this interests you.  Maybe this
new patch should take on a life of its own in a new thread?  If anyone
thinks this approach has some merit, I'll do that.

I look forward to thoughts on these idea, and hopefully to finding
someone willing to help me get the first two over the line.

best.

-greg

Вложения

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Andres Freund

Дата:

04 сентября, 19:59:08

Hi,

On 2025-08-27 15:42:48 -0400, Greg Burd wrote:
> Regardless, I feel the first two patches on this set address the
> intention of this thread.

I'm planning to commit the first two patches after making a pass through
them. I have some work that I'm cleaning up to post that'd conflict and Tomas'
NUMA patches are affected too.

Greetings,

Andres Freund

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

"Greg Burd"

Дата:

04 сентября, 20:24:00

On Thu, Sep 4, 2025, at 12:59 PM, Andres Freund wrote:
> Hi,
>
> On 2025-08-27 15:42:48 -0400, Greg Burd wrote:
>> Regardless, I feel the first two patches on this set address the
>> intention of this thread.
>
> I'm planning to commit the first two patches after making a pass through
> them. I have some work that I'm cleaning up to post that'd conflict and Tomas'
> NUMA patches are affected too.

Fantastic, and thank you.  If you have any concerns I'm here to work through them with you.  I'll carry over the idea
fromthe 3rd patch into a new thread.
 

> Greetings,
>
> Andres Freund

best,

-greg

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Andres Freund

Дата:

05 сентября, 19:27:14

Hi,

On 2025-09-04 13:24:00 -0400, Greg Burd wrote:
> On Thu, Sep 4, 2025, at 12:59 PM, Andres Freund wrote:
> > Hi,
> >
> > On 2025-08-27 15:42:48 -0400, Greg Burd wrote:
> >> Regardless, I feel the first two patches on this set address the
> >> intention of this thread.
> >
> > I'm planning to commit the first two patches after making a pass through
> > them. I have some work that I'm cleaning up to post that'd conflict and Tomas'
> > NUMA patches are affected too.
> 
> Fantastic, and thank you.  If you have any concerns I'm here to work through
> them with you.  I'll carry over the idea from the 3rd patch into a new
> thread.

Committed with very minor changes:
- removal of StrategyFreeBuffer() prototype
- the new message in autoprewarm used "autoprewarm: " in the log message,
  which none of the other messages do - not the prettiest, but it's consistent
  with what's already there
- some editing of the commit message
- some whitespace changes

Greetings,

Andres Freund

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

"Greg Burd"

Дата:

05 сентября, 21:26:06

On Fri, Sep 5, 2025, at 12:27 PM, Andres Freund wrote:
> Hi,
>
> On 2025-09-04 13:24:00 -0400, Greg Burd wrote:
>> On Thu, Sep 4, 2025, at 12:59 PM, Andres Freund wrote:
>> > Hi,
>> >
>> > On 2025-08-27 15:42:48 -0400, Greg Burd wrote:
>> >> Regardless, I feel the first two patches on this set address the
>> >> intention of this thread.
>> >
>> > I'm planning to commit the first two patches after making a pass through
>> > them. I have some work that I'm cleaning up to post that'd conflict and Tomas'
>> > NUMA patches are affected too.
>> 
>> Fantastic, and thank you.  If you have any concerns I'm here to work through
>> them with you.  I'll carry over the idea from the 3rd patch into a new
>> thread.
>
> Committed with very minor changes:
> - removal of StrategyFreeBuffer() prototype
> - the new message in autoprewarm used "autoprewarm: " in the log message,
>   which none of the other messages do - not the prettiest, but it's consistent
>   with what's already there
> - some editing of the commit message
> - some whitespace changes

All reasonable and thoughtful changes.

>
> Greetings,
>
> Andres Freund

thank you.

-greg

Re: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

От

Robert Haas

Дата:

05 сентября, 21:36:59

On Fri, Jul 11, 2025 at 3:48 PM Greg Burd <greg@burd.me> wrote:
> I briefly considered how one might use what was left after surgery to produce some similar boolean signal to no
avail. I think that autoprewarm was simply trying to at most warm NBuffers then stop.  The freelist at startup was just
aconvenient thing to drain and get that done.  Maybe I'll try adapting autoprewarm to consider that global instead. 

My concern had been that while autoprewarm was running, other system
activity could already have started and been filling up
shared_buffers. By considering whether there were actually free
buffers remaining, it would prewarm less in that case.

I'm not saying that was the perfect idea, I'm just telling you what I
was thinking at the time.

--
Robert Haas
EDB: http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [PATCH] Let's get rid of the freelist and the buffer_strategy_lock

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения