Обсуждение: [WiP] B-tree page merge during vacuum to reduce index bloat
Hi hackers, Together with Kirk and Nik we spent several online hacking sessions to tackle index bloat problems [0,1,2]. As a result weconcluded that B-tree index page merge should help to keep an index from pressuring shared_buffers. We are proposing a feature to automatically merge nearly-empty B-tree leaf pages during VACUUM operations to reduce indexbloat. This addresses a common issue where deleted tuples leave behind sparsely populated pages that traditional pagedeletion cannot handle because they're not completely empty. *** Implementation Overview: The patch adds a new `mergefactor` reloption (default 5%, range 0-50%) that defines when a page becomes a candidate for merging.During vacuum, when a page exceeds this threshold (e.g., 95% empty with default settings), we attempt to move theremaining tuples to the right sibling page and then delete the source page using existing page deletion mechanisms. Key changes: - New `mergefactor` reloption for configurable merge thresholds - Detection logic in `btvacuumpage()` to identify merge candidates - Tuple movement implementation in `_bt_unlink_halfdead_page()` - WAL logging enhancements to handle cross-page dependencies during replay The last part needs further improvements (it's simply REGBUF_FORCE_IMAGE for now), but I want to start a discussion and askfor known problems of the approach. *** Correctness: The implementation reuses existing locking, critical sections and WAL logging infrastructure. To handle cross-page dependenciesduring WAL replay, when tuples are merged, the right sibling buffer is registered with `REGBUF_FORCE_IMAGE`,this is a temporary workaround. *** Current Status & Questions: The patch successfully reduces index bloat and handles basic scenarios, but we've identified some areas that need communityinput: 1. Backward Scan Correctness: The primary concern is ensuring backward scans work correctly when pages are being merged concurrently.Since we maintain the same locking protocol as existing page deletion, I believe this should be safe, but wouldappreciate expert review of the approach. 2. Performance Impact: The additional checks during vacuum have minimal overhead, but broader testing would be valuable.Worst case would be the index with leaf pattern (5%,96%,5%,96%,5%,96%...). We will attempt to merge it every timespending time on acquiring locks. 3. WAL Consistency: There are still some edge cases with WAL consistency checking that need refinement. I think I can handleit, just need to spend enough time on debugging real redo instead of imaging right page. *** Usage: CREATE INDEX ON table (col) WITH (mergefactor=10); -- 10% threshold I don't know if it would be a good idea to enable mergefactor for existing indexes. The feature is particularly beneficial for workloads with high update/delete rates that create sparse index pages withouttriggering complete page deletion. I'm attaching the patch for review and would welcome feedback on the approach, especially regarding backward scan safetyand any other correctness concerns we may have missed. Thank you! Best regards, Andrey, Nik, Kirk. [0] https://www.youtube.com/watch?v=3MleDtXZUlM [1] https://www.youtube.com/watch?v=Ib3SXSFt8mE [2] https://www.youtube.com/watch?v=D1PEdDcvZTw
Вложения
On Tue, 26 Aug 2025 at 11:40, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > Hi hackers, > > Together with Kirk and Nik we spent several online hacking sessions to tackle index bloat problems [0,1,2]. As a resultwe concluded that B-tree index page merge should help to keep an index from pressuring shared_buffers. > > We are proposing a feature to automatically merge nearly-empty B-tree leaf pages during VACUUM operations to reduce indexbloat. This addresses a common issue where deleted tuples leave behind sparsely populated pages that traditional pagedeletion cannot handle because they're not completely empty. > > *** Implementation Overview: > > The patch adds a new `mergefactor` reloption (default 5%, range 0-50%) that defines when a page becomes a candidate formerging. During vacuum, when a page exceeds this threshold (e.g., 95% empty with default settings), we attempt to movethe remaining tuples to the right sibling page and then delete the source page using existing page deletion mechanisms. > > Key changes: > - New `mergefactor` reloption for configurable merge thresholds > - Detection logic in `btvacuumpage()` to identify merge candidates > - Tuple movement implementation in `_bt_unlink_halfdead_page()` > - WAL logging enhancements to handle cross-page dependencies during replay > > The last part needs further improvements (it's simply REGBUF_FORCE_IMAGE for now), but I want to start a discussion andask for known problems of the approach. > > *** Correctness: > > The implementation reuses existing locking, critical sections and WAL logging infrastructure. To handle cross-page dependenciesduring WAL replay, when tuples are merged, the right sibling buffer is registered with `REGBUF_FORCE_IMAGE`,this is a temporary workaround. > > 1. Backward Scan Correctness: The primary concern is ensuring backward scans work correctly when pages are being mergedconcurrently. Since we maintain the same locking protocol as existing page deletion, I believe this should be safe,but would appreciate expert review of the approach. I'm fairly sure there is a correctness issue here; I don't think you correctly detect the two following cases: 1.) a page (P0) is scanned by a scan, finishes processing the results, and releases its pin. It prepares to scan the next page of the scan (P1). 2.) a page (A) is split, with new right sibling page B, 3.) and the newly created page B is merged into its right sibling C, 4.) the scan continues on to the next page For backward scans, if page A is the same page as the one identified with P1, the scan won't notice that tuples from P1 (aka A) have been moved through B to P0 (C), causing the scan to skip processing for those tuples. For forward scans, if page A is the same page as the one identified with P0, the scan won't notice that tuples from P0 (A) have been moved through B to P1 (C), causing the scan to process those tuples twice in the same scan, potentially duplicating results. NB: Currently, the only way for "merge" to happen is when the index page is completely empty. This guarantees that there is no movement of scan-visible tuples to pages we've already visited/are about to visit. This invariant is used extensively to limit lock and pin coupling (and thus: improve performance) in index scans; see e.g. in 1bd4bc85. This patch will invalidate that invariant, and therefore it will require (significantly) more work in the scan code (incl. nbtsearch.c) to guarantee exactly-once results + no false negatives. Kind regards, Matthias van de Meent Databricks
On Tue, Aug 26, 2025 at 6:33 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Tue, 26 Aug 2025 at 11:40, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers,
>
> Together with Kirk and Nik we spent several online hacking sessions to tackle index bloat problems [0,1,2]. As a result we concluded that B-tree index page merge should help to keep an index from pressuring shared_buffers.
>
> We are proposing a feature to automatically merge nearly-empty B-tree leaf pages during VACUUM operations to reduce index bloat. This addresses a common issue where deleted tuples leave behind sparsely populated pages that traditional page deletion cannot handle because they're not completely empty.
>
...
I'm fairly sure there is a correctness issue here; I don't think you
correctly detect the two following cases:
1.) a page (P0) is scanned by a scan, finishes processing the results,
and releases its pin. It prepares to scan the next page of the scan
(P1).
2.) a page (A) is split, with new right sibling page B,
3.) and the newly created page B is merged into its right sibling C,
4.) the scan continues on to the next page
For backward scans, if page A is the same page as the one identified
with P1, the scan won't notice that tuples from P1 (aka A) have been
moved through B to P0 (C), causing the scan to skip processing for
those tuples.
For forward scans, if page A is the same page as the one identified
with P0, the scan won't notice that tuples from P0 (A) have been moved
through B to P1 (C), causing the scan to process those tuples twice in
the same scan, potentially duplicating results.
NB: Currently, the only way for "merge" to happen is when the index
page is completely empty. This guarantees that there is no movement of
scan-visible tuples to pages we've already visited/are about to visit.
This invariant is used extensively to limit lock and pin coupling (and
thus: improve performance) in index scans; see e.g. in 1bd4bc85. This
patch will invalidate that invariant, and therefore it will require
(significantly) more work in the scan code (incl. nbtsearch.c) to
guarantee exactly-once results + no false negatives.
Kind regards,
Matthias van de Meent
Databricks
This was one of our concerns. We will review the patch mentioned.
I do have a question, one of the IDEAS we discussed was to ADD a new page that combined the 2 pages.
Making the deletion "feel" like a page split.
This has the advantage of leaving the original 2 pages alone for anyone who is currently traversing.
And like the page split, updating the links around while marking the pages for the new path.
The downside to this approach is that we are "adding 1 page to then mark 2 pages as unused".
Could you comment on this secondary approach?
Thanks in advance!
Kirk
On Tue, Aug 26, 2025 at 5:40 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > *** Correctness: > > The implementation reuses existing locking, critical sections and WAL logging infrastructure. To handle cross-page dependenciesduring WAL replay, when tuples are merged, the right sibling buffer is registered with `REGBUF_FORCE_IMAGE`,this is a temporary workaround. I don't think that you can reuse existing locking for this. It needs to be totally reevaluated in light of the way that you've changed page deletion. In general, page deletion ends up being very complicated with the Lehman & Yao design -- even with the existing restrictions. It's a natural downside of the very permissive locking rules: there is no need for most searchers and inserters to ever hold more than a single buffer lock at a time. Searchers generally hold *no* locks when descending the index at points where they're "between levels". Almost anything can happen in the meantime -- including page deletion (it's just that the deleted page cannot be concurrently recycled right away, which is kind of a separate process). That original block number reference still has to work for these searchers, no matter what. It must at least not give them an irredeemably bad picture of what's going on; they might have to move right to deal with concurrent page deletion and page splits, but that's it. If you merge together underful pages like this, you change the key space boundaries between pages. I don't see a way to make that safe/an atomic operation, except by adding significantly more locking in the critical path of searchers. > *** Current Status & Questions: > > The patch successfully reduces index bloat and handles basic scenarios, but we've identified some areas that need communityinput: Have you benchmarked it? I wouldn't assume that free-at-empty actually is worse than merging underfull pages. It's complicated, but see this paper for some counter-arguments: https://www.sciencedirect.com/science/article/pii/002200009390020W This old Dr. Dobbs article from the same authors gives a lighter summary of the same ideas: https://web.archive.org/web/20200811014238/https://www.drdobbs.com/reexamining-b-trees/184408694?pgno=3 In any case, I believe that this optimization isn't widely implemented by other major RDBMSs, despite being a textbook technique that was known about and described early in the history of B-Trees. -- Peter Geoghegan
Peter, Matthias, many thanks for your input! > On 28 Aug 2025, at 01:06, Peter Geoghegan <pg@bowt.ie> wrote: > > Have you benchmarked it? Kind of. Here are bloat charts from random production clusters: Upper green line has an axis on right - it's total index bloat per cluster. Other lines are individual indexes with mostbloat, axis on the left. Notice the 7 day period on one of lines. It's our friday automatic index repack of an indexes that suffers from bloat. Mostof the week these indexes are 97%+ percents bloated. On a yellow line small period is noticeable - that's what "free at empty" vacuum can help with now. It is just one index, and even not a top bloat contributor. But 2 out of 3 random clusters had such index. (I must admitthat clusters were not truly random - I just picked browser tabs from recent incidents during my on-call duty shift) Both cases are queues with secondary index, which gets bloated quickly after reindexing. Of course, I'm not proposing to do "merge-at-half", merge-at-95%-free would be good enough for this case. 95% bloated indexcertainly has some 95% free pages. I think to establish baseline for locking correctness we are going to start from writing index scan tests, that fail withproposed merge patch and pass on current HEAD. I want to observe that forward scan is showing duplicates and backwardscan misses tuples. From that we will try to design locking that does not affect performance significantly, but allows to merge pages. Perhaps,we can design a way to switch new index scans to "safe mode" during index vacuum and waiting for existing scans tocomplete. Best regards, Andrey Borodin.
Вложения
> On 29 Aug 2025, at 13:39, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > I think to establish baseline for locking correctness we are going to start from writing index scan tests, that fail withproposed merge patch and pass on current HEAD. I want to observe that forward scan is showing duplicates and backwardscan misses tuples. Well, that was unexpectedly easy. See patch 0001. It brings a test where we create sparse tree, and injection point thatwill wait on a scan stepping into some middle leaf page. Then the test invokes vacuum. There are ~35 leaf pages, most of them will be merged into just a few pages. As expected, both scans produce incorrect results. t/008_btree_merge_scan_correctness.pl .. 1/? # Failed test 'Forward scan returns correct count' # at t/008_btree_merge_scan_correctness.pl line 132. # got: '364' # expected: '250' # Failed test 'Backward scan returns correct count' # at t/008_btree_merge_scan_correctness.pl line 133. # got: '142' # expected: '250' # Looks like you failed 2 tests of 2. > From that we will try to design locking that does not affect performance significantly, but allows to merge pages. Perhaps,we can design a way to switch new index scans to "safe mode" during index vacuum and waiting for existing scans tocomplete. What if we just abort a scan, that stepped on the page where tuples were moved out? I've prototype this approach, please see patch 0002. Maybe in future we will improve locking protocol if we will observehigh error rates. Unfortunately, this approach leads to default mergefactor 0 instead of 5%. What do you think? Should we add this to CF or the idea is too wild for a review? Best regards, Andrey Borodin.