Обсуждение: Optimizing "boundary cases" during backward scan B-Tree index descents

Поиск
Список
Период
Сортировка

Optimizing "boundary cases" during backward scan B-Tree index descents

От
Peter Geoghegan
Дата:
An important goal of the work on nbtree that went into PostgreSQL 12
(and to a lesser extent the work that went into 13) was to make sure
that index scans deal with "boundary cases" optimally. The simplest
way of explaining what this means is through a practical worked
example.

Recap, worked example:

Consider a query such as "select count(*) from tenk2 where hundred =
$1", where tenk1 is one of the Wisconsin benchmark tables left behind
by the standard regression tests. In practice this query will show
that there are either 0 or 100 rows counted for every possible $1
value. In practice query execution only ever needs to scan exactly one
leaf page when executing this query -- for any possible $1 value. In
particular, it doesn't matter if $1 is a value that happens to be a
"boundary value". By "boundary value" I mean a value that happens to
match the leftmost or the rightmost tuple on  some leaf page. In other
words, a value that happens to match a key from some internal page
seen while descending the tree in _bt_search.

This effect is reliable for the tenk1_hundred index that my example
query uses (a single column index) because numerous optimizations are
in place, which work together. This starts with _bt_search, which will
reliably descend to exactly one leaf page -- the only one where we
could possibly find a match (thanks to the !pivotsearch optimization).
After that, _bt_readpage() will reliably notice that it doesn't have
to go to the page to the right at all (occasionally it'll need to
check the high key to do this, and so will use the optimization
introduced to Postgres 12 by commit 29b64d1d).

(It's easy to see this using "EXPLAIN (ANALYZE, BUFFERS)", which will
reliably break out index pages when we're using a bitmap index scan --
which happens to be the case here. You'll see one root page access and
one leaf page access for the tenk1_hundred index.)

This is just an example of a fairly general principle: we ought to
expect selective index scans to only need to access one leaf page. At
least for any index scan that only needs to access a single "natural
grouping" that is cleanly isolated onto a single leaf page.

Another example of such a "natural grouping" can also be seen in the
TPC-C orderline table's primary key. In practice we shouldn't ever
need to touch more than a single leaf page when we go to read any
individual order's entries from the order lines table/PK. There will
never be more than 15 order lines per order in practice (10 on
average), which guarantees that suffix truncation will avoid splitting
any individual order's order lines across two leaf pages (after a page
split). Plus we're careful to exploit the index structure (and basic
B-Tree index invariants) to maximum effect during index scans....as
long as they're forward scans.

(This ends my recap of "boundary case" handling.)

Patch:

We still fall short when it comes to handling boundary cases optimally
during backwards scans. This is at least true for a subset of
backwards scans that request "goback=true" processing inside
_bt_first. Attached patch improves matters here. Again, the simplest
way of explaining what this does is through a practical worked
example.

Consider an example where we don't do as well as might be expected right now:

EXPLAIN (ANALYZE, BUFFERS) select * from tenk1 where hundred < 12
order by hundred desc limit 1;

This particular example shows "Buffers: shared hit=4". We see one more
buffer hit than is truly necessary though. If I tweak the query by
replacing 12 with the adjacent value 11 (i.e. same query but with
"hundred < 11"), then I'll see "Buffers: shared hit=3" instead.
Similarly, "hundred < 13" will also show "Buffers: shared hit=3". Why
should "hundred <12" need to access an extra leaf page?

Sure enough, the problematic case shows "Buffers: shared hit=3" with
my patch applied, as expected (a buffer for the root page, a leaf
page, and a heap page). The patch makes every variant of my query
touch the same number of leaf pages/buffers, as expected.

The patch teaches _bt_search (actually, _bt_compare) about the
"goback=true" case. This allows it to exploit information about which
leaf page accesses are truly necessary. The code in question already
knows about "nextkey=true", which is a closely related concept. It
feels very natural to also teach it about "goback=true" now.

(Actually, I lied. All the patch really does is rename existing
"pivotsearch" logic, so that the symbol name "goback" is used instead
-- the insertion scankey code doesn't need to change at all. The real
behavioral change takes place in _bt_first, the higher level calling
code. It has been taught to set its insertion/initial positioning
scankey's pivotsearch/goback field to "true" in the patch. Before now,
this option was exclusively during VACUUM, for page deletion. It turns
out that so-called "pivotsearch" behavior is far more general than
currently supposed.)

Thoughts?
-- 
Peter Geoghegan

Вложения

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

От
Peter Geoghegan
Дата:
On Mon, Jun 19, 2023 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
> We still fall short when it comes to handling boundary cases optimally
> during backwards scans. This is at least true for a subset of
> backwards scans that request "goback=true" processing inside
> _bt_first. Attached patch improves matters here. Again, the simplest
> way of explaining what this does is through a practical worked
> example.

On further reflection, we should go even further in the direction of
teaching _bt_search (and related routines) about the initial leaf
level positioning requirements of backwards scans. In fact, we should
give _bt_search and friends exclusive responsibility for dealing with
initial positioning on behalf of _bt_first. Attached revision shows
how this can work.

This new direction is partly based on the observation that "goback" is
really just a synonym of "backwards scan initial positioning
behavior": all backwards scans already use "goback", while all forward
scans use "!goback". So why shouldn't we just change any "goback"
symbol names to "backward", and be done with it? Then we can move the
"step back one item on leaf level" logic from _bt_first over to
_bt_binsrch. Now _bt_search/_bt_binsrch/_bt_compare own everything to
do with initial positioning.

The main benefit of this approach is that it allows _bt_first to
describe how its various initial positioning strategies work using
high level language, while pretty much leaving the implementation
details up to _bt_search. I've always thought that it was confusing
that the "<= strategy" uses "nextkey=true" -- how can it be "next key"
while also returning keys that directly match those from the insertion
scankey? It only makes sense once you see that the "<= strategy" uses
both "nextkey=true" and "backwards/goback = true" -- something that
the structure in the patch makes clear and explicit.

This revision also adds test coverage for the aforementioned "<=
strategy" (not to be confused with the strategy that we're
optimizing), since it was missing before now. It also adds test
coverage for the "< strategy" (which *is* the strategy affected by the
new optimization). The "< strategy" already has decent enough coverage
-- it just doesn't have coverage that exercises the new optimization.
(Note that I use the term "new optimization" advisedly here -- the new
behavior is closer to "how it's really supposed to work".)

I'm happy with the way that v2 came out, since the new structure makes
a lot more sense to me. The refactoring is probably the most important
aspect of this patch. The new structure seems like it might become
important in a world with skip scan or other new MDAM techniques added
to B-Tree. The important principle here is "think in terms of logical
key space, not in terms of physical pages".

Adding this to the next CF.

--
Peter Geoghegan

Вложения

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

От
Peter Geoghegan
Дата:
On Tue, Jun 20, 2023 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm happy with the way that v2 came out, since the new structure makes
> a lot more sense to me.

Attached is v3, which is a straightforward rebase of v2. v3 is needed
to get the patch to apply cleanly against HEAD - so no real changes
here.


--
Peter Geoghegan

Вложения

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

От
Peter Geoghegan
Дата:
On Mon, Sep 18, 2023 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v3, which is a straightforward rebase of v2. v3 is needed
> to get the patch to apply cleanly against HEAD - so no real changes
> here.

Attached is v4. Just to keep CFTester happy.

--
Peter Geoghegan

Вложения

Re: Optimizing "boundary cases" during backward scan B-Tree index descents

От
Matthias van de Meent
Дата:
On Sun, 15 Oct 2023 at 22:56, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Sep 18, 2023 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Attached is v3, which is a straightforward rebase of v2. v3 is needed
> > to get the patch to apply cleanly against HEAD - so no real changes
> > here.
>
> Attached is v4. Just to keep CFTester happy.

> @@ -402,10 +405,27 @@ _bt_binsrch(Relation rel,
> +        if (unlikely(key->backward))
> +            return OffsetNumberPrev(low);
> +
>         return low;

I wonder if this is (or can be) optimized to the mostly equivalent
"return low - (OffsetNumber) key->backward", as that would remove an
"unlikely" branch that isn't very unlikely during page deletion, even
if page deletion by itself is quite rare.
I'm not sure it's worth the additional cognitive overhead, or if there
are any significant performance implications for the hot path.

> @@ -318,9 +318,12 @@ _bt_moveright(Relation rel,
> [...]
>  * On a leaf page, _bt_binsrch() returns the OffsetNumber of the first
> [...]
> + * key >= given scankey, or > scankey if nextkey is true for forward scans.
> + * _bt_binsrch() also "steps back" by one item/tuple on the leaf level in the
> + * case of backward scans.  (NOTE: this means it is possible to return a value
> + * that's 1 greater than the number of keys on the leaf page.  It also means
> + * that we can return an item 1 less than the first non-pivot tuple on any
> + * leaf page.)

I think this can use a bit more wordsmithing: the use of "also" with
"steps back" implies we also step back in other cases, which aren't
mentioned. Could you update the wording to be more clear about this?

> @@ -767,7 +787,7 @@ _bt_compare(Relation rel,
> [...]
> -         * Most searches have a scankey that is considered greater than a
> +         * Forward scans have a scankey that is considered greater than a

Although it's not strictly an issue for this patch, the comment here
doesn't describe backward scans in as much detail as forward scans
here. The concepts are mostly "do the same but in reverse", but the
difference is noticable.

Apart from these comments, no further noteworthy comments. Looks good.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)