Обсуждение: index prefetching

Поиск

Список

Период

Сортировка

index prefetching

От

Tomas Vondra

Дата:

08 июня 2023 г., 15:40:12

Hi,

At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.


Motivation
----------

Imagine we have a huge table (much larger than RAM), with an index, and
that we're doing a regular index scan (e.g. using a btree index). We
first walk the index to the leaf page, read the item pointers from the
leaf page and then start issuing fetches from the heap.

The index access is usually pretty cheap, because non-leaf pages are
very likely cached, so we may do perhaps I/O for the leaf. But the
fetches from heap are likely very expensive - unless the page is
clustered, we'll do a random I/O for each item pointer. Easily ~200 or
more I/O requests per leaf page. The problem is index scans do these
requests synchronously at the moment - we get the next TID, fetch the
heap page, process the tuple, continue to the next TID etc.

That is slow and can't really leverage the bandwidth of modern storage,
which require longer queues. This patch aims to improve this by async
prefetching.

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.

But there's three shortcomings in logic:

1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.

2) Our estimates / planning are not perfect, so we may easily pick an
index scan instead of a bitmap scan. It'd be nice to limit the damage a
bit by still prefetching.

3) There are queries that can't do a bitmap scan (at all, or because
it's hopelessly inefficient). Consider queries that require ordering, or
queries by distance with GiST/SP-GiST index.


Implementation
--------------

When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.

The easiest thing would be to just do prefetching from the btree code.
But then I realized there's no particular reason why other index types
(except for GIN, which only allows bitmap scans) couldn't do prefetching
too. We could have a copy in each AM, of course, but that seems sloppy
and also violation of layering. After all, bitmap heap scans do prefetch
from the executor, so AM seems way too low level.

So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).

So what I did is introducing a IndexPrefetch struct, which is part of
IndexScanDesc, maintaining all the info about prefetching for that
particular scan - current/maximum distance, progress, etc.

It also contains two AM-specific callbacks (get_range and get_block)
which say valid range of indexes (into the internal array), and block
number for a given index.

This mostly does the trick, although index_prefetch() is still called
from the amgettuple() functions. That seems wrong, we should call it
from indexam.c right aftter calling amgettuple.


Problems / Open questions
-------------------------

There's a couple issues I ran into, I'll try to list them in the order
of importance (most serious ones first).

1) pairing-heap in GiST / SP-GiST

For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.

Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(

I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.

In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.


2) prefetching from executor

Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?

I'm also not entirely sure the way this interfaces with the AM (through
the get_range / get_block callbaces) is very elegant. It did the trick,
but it seems a bit cumbersome. I wonder if someone has a better/nicer
idea how to do this ...


3) prefetch distance

I think we can do various smart things about the prefetch distance.

The current code does about the same thing bitmap scans do - it starts
with distance 0 (no prefetching), and then simply ramps the distance up
until the maximum value from get_tablespace_io_concurrency(). Which is
either effective_io_concurrency, or per-tablespace value.

I think we could be a bit smarter, and also consider e.g. the estimated
number of matching rows (but we shouldn't be too strict, because it's
just an estimate). We could also track some statistics for each scan and
use that during a rescans (think index scan in a nested loop).

But the patch doesn't do any of that now.


4) per-leaf prefetching

The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.

I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.


5) index-only scans

I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.


Benchmarks
----------

1) OLTP

For OLTP, this tested different queries with various index types, on
data sets constructed to have certain number of matching rows, forcing
different types of query plans (bitmap, index, seqscan).

The data sets have ~34GB, which is much more than available RAM (8GB).

For example for BTREE, we have a query like this:

   SELECT * FROM btree_test WHERE a = $v

with data matching 1, 10, 100, ..., 100000 rows for each $v. The results
look like this:

   rows    bitmapscan     master    patched    seqscan
   1             19.8       20.4       18.8    31875.5
   10            24.4       23.8       23.2    30642.4
   100           27.7       40.0       26.3    31871.3
   1000          45.8      178.0       45.4    30754.1
   10000        171.8     1514.9      174.5    30743.3
   100000      1799.0    15993.3     1777.4    30937.3

This says that the query takes ~31s with a seqscan, 1.8s with a bitmap
scan and 16s index scan (on master). With the prefetching patch, it
takes about ~1.8s, i.e. about the same as the bitmap scan.

I don't know where exactly would the plan switch from index scan to
bitmap scan, but the table has ~100M rows, so all of this is tiny. I'd
bet most of the cases would do plain index scan.


For a query with ordering:

    SELECT * FROM btree_test WHERE a >= $v ORDER BY a LIMIT $n

the results look a bit different:

    rows      bitmapscan     master     patched     seqscan
    1            52703.9       19.5        19.5     31145.6
    10           51208.1       22.7        24.7     30983.5
    100          49038.6       39.0        26.3     32085.3
    1000         53760.4      193.9        48.4     31479.4
    10000        56898.4     1600.7       187.5     32064.5
    100000       50975.2    15978.7      1848.9     31587.1

This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).

Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).

The results for other index types (HASH, GiST, SP-GiST) follow roughly
the same pattern. See the attached PDF for more charts, and [1] for
complete results.


Benchmark / TPC-H
-----------------

I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):

    query    serial    parallel
    1          101%         99%
    2          119%        100%
    3          100%         99%
    4          101%        100%
    5          101%        100%
    6           12%         99%
    7          100%        100%
    8           52%         67%
    10         102%        101%
    11         100%         72%
    12         101%        100%
    13         100%        101%
    14          13%        100%
    15         101%        100%
    16          99%         99%
    17          95%        101%
    18         101%        106%
    19          30%         40%
    20          99%        100%
    21         101%        100%
    22         101%        107%

The percentage is (timing patched / master, so <100% means faster, >100%
means slower).

The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.

My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.

There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.


regards


[1] https://github.com/tvondra/index-prefetch-tests
[2] https://github.com/tvondra/postgres/tree/dev/index-prefetch


-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>     Normal index scans are an even more interesting case but I'm not
>     sure how hard it would be to get that information. It may only be
>     convenient to get the blocks from the last leaf page we looked at,
>     for example.
>
> So this suggests we simply started prefetching for the case where the
> information was readily available, and it'd be harder to do for index
> scans so that's it.

What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.

> Even if SAOP (probably) wasn't the reason, I think you're right it may
> be an issue for prefetching, causing regressions. It didn't occur to me
> before, because I'm not that familiar with the btree code and/or how it
> deals with SAOP (and didn't really intend to study it too deeply).

I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.

> So if you're planning to work on this for PG17, collaborating on it
> would be great.
>
> For now I plan to just ignore SAOP, or maybe just disabling prefetching
> for SAOP index scans if it proves to be prone to regressions. That's not
> great, but at least it won't make matters worse.

Makes sense, but I hope that it won't come to that.

IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.

> I guess something like this might be a "nice" bad case:
>
>     insert into btree_test mod(i,100000), md5(i::text)
>       from generate_series(1, $ROWS) s(i)
>
>     select * from btree_test where a in (999, 1000, 1001, 1002)
>
> The values are likely colocated on the same heap page, the bitmap scan
> is going to do a single prefetch. With index scan we'll prefetch them
> repeatedly. I'll give it a try.

This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?

I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.

I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.

This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.

I'm talking about problems that exist today, without your patch.

I'll show a concrete example of the kind of index/index scan that
might be affected.

Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.

Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?

While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.

We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.

(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)

Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.

> > It's important to carefully distinguish between cases where plain
> > index scans really are at an inherent disadvantage relative to bitmap
> > index scans (because there really is no getting around the need to
> > access the same heap page many times with an index scan) versus cases
> > that merely *appear* that way. Implementation restrictions that only
> > really affect the plain index scan case (e.g., the lack of a
> > reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
> > should be accounted for when assessing the viability of index scan +
> > prefetch over bitmap index scan + prefetch. This is very subtle, but
> > important.
> >
>
> I do agree, but what do you mean by "assessing"?

I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.

> Wasn't the agreement at
> the unconference session was we'd not tweak costing? So ultimately, this
> does not really affect which scan type we pick. We'll keep doing the
> same planning decisions as today, no?

I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.

One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.

The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.

> I'm all for building a more comprehensive set of test cases - the stuff
> presented at pgcon was good for demonstration, but it certainly is not
> enough for testing. The SAOP queries are a great addition, I also plan
> to run those queries on different (less random) data sets, etc. We'll
> probably discover more interesting cases as the patch improves.

Definitely.

> There are two aspects why I think AM is not the right place:
>
> - accessing table from index code seems backwards
>
> - we already do prefetching from the executor (nodeBitmapHeapscan.c)
>
> It feels kinda wrong in hindsight.

I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.

> >> I think this is acceptable limitation, certainly for v0. Prefetching
> >> across multiple leaf pages seems way more complex (particularly for the
> >> cases using pairing heap), so let's leave this for the future.

> Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
> to do that. But it seems like work for future someone.

Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.

--
Peter Geoghegan

Вложения

pg_type_typname_nsp_index_index_example_log.txt

Re: index prefetching

От

Andres Freund

Дата:

09 июня 2023 г., 00:06:00

Hi,

On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.

I'm really excited about this work.


> 1) pairing-heap in GiST / SP-GiST
> 
> For most AMs, the index state is pretty trivial - matching items from a
> single leaf page. Prefetching that is pretty trivial, even if the
> current API is a bit cumbersome.
> 
> Distance queries on GiST and SP-GiST are a problem, though, because
> those do not just read the pointers into a simple array, as the distance
> ordering requires passing stuff through a pairing-heap :-(
> 
> I don't know how to best deal with that, especially not in the simple
> API. I don't think we can "scan forward" stuff from the pairing heap, so
> the only idea I have is actually having two pairing-heaps. Or maybe
> using the pairing heap for prefetching, but stashing the prefetched
> pointers into an array and then returning stuff from it.
> 
> In the patch I simply prefetch items before we add them to the pairing
> heap, which is good enough for demonstrating the benefits.

I think it'd be perfectly fair to just not tackle distance queries for now.


> 2) prefetching from executor
> 
> Another question is whether the prefetching shouldn't actually happen
> even higher - in the executor. That's what Andres suggested during the
> unconference, and it kinda makes sense. That's where we do prefetching
> for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...


> 4) per-leaf prefetching
> 
> The code is restricted only prefetches items from one leaf page. If the
> index scan needs to scan multiple (many) leaf pages, we have to process
> the first leaf page first before reading / prefetching the next one.
> 
> I think this is acceptable limitation, certainly for v0. Prefetching
> across multiple leaf pages seems way more complex (particularly for the
> cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.


> 5) index-only scans
> 
> I'm not sure what to do about index-only scans. On the one hand, the
> point of IOS is not to read stuff from the heap at all, so why prefetch
> it. OTOH if there are many allvisible=false pages, we still have to
> access that. And if that happens, this leads to the bizarre situation
> that IOS is slower than regular index scan. But to address this, we'd
> have to consider the visibility during prefetching.

That should be easy to do, right?



> Benchmark / TPC-H
> -----------------
> 
> I ran the 22 queries on 100GB data set, with parallel query either
> disabled or enabled. And I measured timing (and speedup) for each query.
> The speedup results look like this (see the attached PDF for details):
> 
>     query    serial    parallel
>     1          101%         99%
>     2          119%        100%
>     3          100%         99%
>     4          101%        100%
>     5          101%        100%
>     6           12%         99%
>     7          100%        100%
>     8           52%         67%
>     10         102%        101%
>     11         100%         72%
>     12         101%        100%
>     13         100%        101%
>     14          13%        100%
>     15         101%        100%
>     16          99%         99%
>     17          95%        101%
>     18         101%        106%
>     19          30%         40%
>     20          99%        100%
>     21         101%        100%
>     22         101%        107%
> 
> The percentage is (timing patched / master, so <100% means faster, >100%
> means slower).
> 
> The different queries are affected depending on the query plan - many
> queries are close to 100%, which means "no difference". For the serial
> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
> while for the parallel case the benefits are somewhat less significant.
> 
> My explanation is that either (a) parallel case used a different plan
> with fewer index scans or (b) the parallel query does more concurrent
> I/O simply by using parallel workers. Or maybe both.
> 
> There are a couple regressions too, I believe those are due to doing too
> much prefetching in some cases, and some of the heuristics mentioned
> earlier should eliminate most of this, I think.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?

I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

09 июня 2023 г., 00:40:15

On Thu, Jun 8, 2023 at 4:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This is conceptually a "mini bitmap index scan", though one that takes
> place "inside" a plain index scan, as it processes one particular leaf
> page. That's the kind of design that "plain index scan vs bitmap index
> scan as a continuum" leads me to (a little like the continuum between
> nested loop joins, block nested loop joins, and merge joins). I bet it
> would be practical to do things this way, and help a lot with some
> kinds of queries. It might even be simpler than avoiding excessive
> prefetching using an LRU cache thing.

I'll now give a simpler (though less realistic) example of a case
where "mini bitmap index scan" would be expected to help index scans
in general, and prefetching during index scans in particular.
Something very simple:

create table bitmap_parity_test(randkey int4, filler text);
create index on bitmap_parity_test (randkey);
insert into bitmap_parity_test select (random()*1000),
repeat('filler',10) from generate_series(1,250) i;

This gives me a table with 4 pages, and an index with 2 pages.

The following query selects about half of the rows from the table:

select * from bitmap_parity_test where randkey < 500;

If I force the query to use a bitmap index scan, I see that the total
number of buffers hit is exactly as expected (according to
EXPLAIN(ANALYZE,BUFFERS), that is): there are 5 buffers/pages hit. We
need to access every single heap page once, and we need to access the
only leaf page in the index once.

I'm sure that you know where I'm going with this already. I'll force
the same query to use a plain index scan, and get a very different
result. Now EXPLAIN(ANALYZE,BUFFERS) shows that there are a total of
89 buffers hit -- 88 of which must just be the same 5 heap pages,
again and again. That's just silly. It's probably not all that much
slower, but it's not helping things. And it's likely that this effect
interferes with the prefetching in your patch.

Obviously you can come up with a variant of this test case where
bitmap index scan does way fewer buffer accesses in a way that really
makes sense -- that's not in question. This is a fairly selective
index scan, since it only touches one index page -- and yet we still
see this difference.

(Anybody pedantic enough to want to dispute whether or not this index
scan counts as "selective" should run "insert into bitmap_parity_test
select i, repeat('actshually',10)  from generate_series(2000,1e5) i"
before running the "randkey < 500" query, which will make the index
much larger without changing any of the details of how the query pins
pages -- non-pedants should just skip that step.)

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

09 июня 2023 г., 10:18:11

On 6/9/23 02:06, Andres Freund wrote:
> Hi,
> 
> On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
>> At pgcon unconference I presented a PoC patch adding prefetching for
>> indexes, along with some benchmark results demonstrating the (pretty
>> significant) benefits etc. The feedback was quite positive, so let me
>> share the current patch more widely.
> 
> I'm really excited about this work.
> 
> 
>> 1) pairing-heap in GiST / SP-GiST
>>
>> For most AMs, the index state is pretty trivial - matching items from a
>> single leaf page. Prefetching that is pretty trivial, even if the
>> current API is a bit cumbersome.
>>
>> Distance queries on GiST and SP-GiST are a problem, though, because
>> those do not just read the pointers into a simple array, as the distance
>> ordering requires passing stuff through a pairing-heap :-(
>>
>> I don't know how to best deal with that, especially not in the simple
>> API. I don't think we can "scan forward" stuff from the pairing heap, so
>> the only idea I have is actually having two pairing-heaps. Or maybe
>> using the pairing heap for prefetching, but stashing the prefetched
>> pointers into an array and then returning stuff from it.
>>
>> In the patch I simply prefetch items before we add them to the pairing
>> heap, which is good enough for demonstrating the benefits.
> 
> I think it'd be perfectly fair to just not tackle distance queries for now.
> 

My concern is that if we cut this from v0 entirely, we'll end up with an
API that'll not be suitable for adding distance queries later.

> 
>> 2) prefetching from executor
>>
>> Another question is whether the prefetching shouldn't actually happen
>> even higher - in the executor. That's what Andres suggested during the
>> unconference, and it kinda makes sense. That's where we do prefetching
>> for bitmap heap scans, so why should this happen lower, right?
> 
> Yea. I think it also provides potential for further optimizations in the
> future to do it at that layer.
> 
> One thing I have been wondering around this is whether we should not have
> split the code for IOS and plain indexscans...
> 

Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?

> 
>> 4) per-leaf prefetching
>>
>> The code is restricted only prefetches items from one leaf page. If the
>> index scan needs to scan multiple (many) leaf pages, we have to process
>> the first leaf page first before reading / prefetching the next one.
>>
>> I think this is acceptable limitation, certainly for v0. Prefetching
>> across multiple leaf pages seems way more complex (particularly for the
>> cases using pairing heap), so let's leave this for the future.
> 
> Hm. I think that really depends on the shape of the API we end up with. If we
> move the responsibility more twoards to the executor, I think it very well
> could end up being just as simple to prefetch across index pages.
> 

Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).

> 
>> 5) index-only scans
>>
>> I'm not sure what to do about index-only scans. On the one hand, the
>> point of IOS is not to read stuff from the heap at all, so why prefetch
>> it. OTOH if there are many allvisible=false pages, we still have to
>> access that. And if that happens, this leads to the bizarre situation
>> that IOS is slower than regular index scan. But to address this, we'd
>> have to consider the visibility during prefetching.
> 
> That should be easy to do, right?
> 

It doesn't seem particularly complicated (famous last words), and we
need to do the VM checks anyway so it seems like it wouldn't add a lot
of overhead either

> 
> 
>> Benchmark / TPC-H
>> -----------------
>>
>> I ran the 22 queries on 100GB data set, with parallel query either
>> disabled or enabled. And I measured timing (and speedup) for each query.
>> The speedup results look like this (see the attached PDF for details):
>>
>>     query    serial    parallel
>>     1          101%         99%
>>     2          119%        100%
>>     3          100%         99%
>>     4          101%        100%
>>     5          101%        100%
>>     6           12%         99%
>>     7          100%        100%
>>     8           52%         67%
>>     10         102%        101%
>>     11         100%         72%
>>     12         101%        100%
>>     13         100%        101%
>>     14          13%        100%
>>     15         101%        100%
>>     16          99%         99%
>>     17          95%        101%
>>     18         101%        106%
>>     19          30%         40%
>>     20          99%        100%
>>     21         101%        100%
>>     22         101%        107%
>>
>> The percentage is (timing patched / master, so <100% means faster, >100%
>> means slower).
>>
>> The different queries are affected depending on the query plan - many
>> queries are close to 100%, which means "no difference". For the serial
>> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
>> while for the parallel case the benefits are somewhat less significant.
>>
>> My explanation is that either (a) parallel case used a different plan
>> with fewer index scans or (b) the parallel query does more concurrent
>> I/O simply by using parallel workers. Or maybe both.
>>
>> There are a couple regressions too, I believe those are due to doing too
>> much prefetching in some cases, and some of the heuristics mentioned
>> earlier should eliminate most of this, I think.
> 
> I'm a bit confused by some of these numbers. How can OS-level prefetching lead
> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
> Unless I missed what "xeon / cached (speedup)" indicates?
> 

I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:

for q in `seq 1 22`; do

   1. drop caches and restart postgres

   2. run query $q -> uncached

   3. run query $q -> cached

done

So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.

I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.

> I think it'd be good to run a performance comparison of the unpatched vs
> patched cases, with prefetching disabled for both. It's possible that
> something in the patch caused unintended changes (say spilling during a
> hashagg, due to larger struct sizes).
> 

That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Tomas Vondra

Дата:

09 июня 2023 г., 10:44:46


On 6/9/23 01:38, Peter Geoghegan wrote:
> On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>     Normal index scans are an even more interesting case but I'm not
>>     sure how hard it would be to get that information. It may only be
>>     convenient to get the blocks from the last leaf page we looked at,
>>     for example.
>>
>> So this suggests we simply started prefetching for the case where the
>> information was readily available, and it'd be harder to do for index
>> scans so that's it.
> 
> What the exact historical timeline is may not be that important. My
> emphasis on ScalarArrayOpExpr is partly due to it being a particularly
> compelling case for both parallel index scan and prefetching, in
> general. There are many queries that have huge in() lists that
> naturally benefit a great deal from prefetching. Plus they're common.
> 

Did you mean parallel index scan or bitmap index scan?

But yeah, I get the point that SAOP queries are an interesting example
of queries to explore. I'll add some to the next round of tests.

>> Even if SAOP (probably) wasn't the reason, I think you're right it may
>> be an issue for prefetching, causing regressions. It didn't occur to me
>> before, because I'm not that familiar with the btree code and/or how it
>> deals with SAOP (and didn't really intend to study it too deeply).
> 
> I'm pretty sure that you understand this already, but just in case:
> ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
> page" in many important cases. Not really -- not in the sense that
> you'd hope and expect. We're senselessly processing the same index
> leaf page multiple times and treating it as a different, independent
> leaf page. That makes heap prefetching of the kind you're working on
> utterly hopeless, since it effectively throws away lots of useful
> context. Obviously that's the fault of nbtree ScalarArrayOpExpr
> handling, not the fault of your patch.
> 

I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?

I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.

>> So if you're planning to work on this for PG17, collaborating on it
>> would be great.
>>
>> For now I plan to just ignore SAOP, or maybe just disabling prefetching
>> for SAOP index scans if it proves to be prone to regressions. That's not
>> great, but at least it won't make matters worse.
> 
> Makes sense, but I hope that it won't come to that.
> 
> IMV it's actually quite reasonable that you didn't expect to have to
> think about ScalarArrayOpExpr at all -- it would make a lot of sense
> if that was already true. But the fact is that it works in a way
> that's pretty silly and naive right now, which will impact
> prefetching. I wasn't really thinking about regressions, though. I was
> actually more concerned about missing opportunities to get the most
> out of prefetching. ScalarArrayOpExpr really matters here.
> 

OK

>> I guess something like this might be a "nice" bad case:
>>
>>     insert into btree_test mod(i,100000), md5(i::text)
>>       from generate_series(1, $ROWS) s(i)
>>
>>     select * from btree_test where a in (999, 1000, 1001, 1002)
>>
>> The values are likely colocated on the same heap page, the bitmap scan
>> is going to do a single prefetch. With index scan we'll prefetch them
>> repeatedly. I'll give it a try.
> 
> This is the sort of thing that I was thinking of. What are the
> conditions under which bitmap index scan starts to make sense? Why is
> the break-even point whatever it is in each case, roughly? And, is it
> actually because of laws-of-physics level trade-off? Might it not be
> due to implementation-level issues that are much less fundamental? In
> other words, might it actually be that we're just doing something
> stoopid in the case of plain index scans? Something that is just
> papered-over by bitmap index scans right now?
> 

Yeah, that's partially why I do this kind of testing on a wide range of
synthetic data sets - to find cases that behave in unexpected way (say,
seem like they should improve but don't).

> I see that your patch has logic that avoids repeated prefetching of
> the same block -- plus you have comments that wonder about going
> further by adding a "small lru array" in your new index_prefetch()
> function. I asked you about this during the unconference presentation.
> But I think that my understanding of the situation was slightly
> different to yours. That's relevant here.
> 
> I wonder if you should go further than this, by actually sorting the
> items that you need to fetch as part of processing a given leaf page
> (I said this at the unconference, you may recall). Why should we
> *ever* pin/access the same heap page more than once per leaf page
> processed per index scan? Nothing stops us from returning the tuples
> to the executor in the original logical/index-wise order, despite
> having actually accessed each leaf page's pointed-to heap pages
> slightly out of order (with the aim of avoiding extra pin/unpin
> traffic that isn't truly necessary). We can sort the heap TIDs in
> scratch memory, then do our actual prefetching + heap access, and then
> restore the original order before returning anything.
> 

I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.

I mean, imagine we have TIDs

    [T1, T2, T3, T4, T5, T6]

Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:

    [T1, T5, T6, T2, T3, T4]

But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.

> This is conceptually a "mini bitmap index scan", though one that takes
> place "inside" a plain index scan, as it processes one particular leaf
> page. That's the kind of design that "plain index scan vs bitmap index
> scan as a continuum" leads me to (a little like the continuum between
> nested loop joins, block nested loop joins, and merge joins). I bet it
> would be practical to do things this way, and help a lot with some
> kinds of queries. It might even be simpler than avoiding excessive
> prefetching using an LRU cache thing.
> 
> I'm talking about problems that exist today, without your patch.
> 
> I'll show a concrete example of the kind of index/index scan that
> might be affected.
> 
> Attached is an extract of the server log when the regression tests ran
> against a server patched to show custom instrumentation. The log
> output shows exactly what's going on with one particular nbtree
> opportunistic deletion (my point has nothing to do with deletion, but
> it happens to be convenient to make my point in this fashion). This
> specific example involves deletion of tuples from the system catalog
> index "pg_type_typname_nsp_index". There is nothing very atypical
> about it; it just shows a certain kind of heap fragmentation that's
> probably very common.
> 
> Imagine a plain index scan involving a query along the lines of
> "select * from pg_type where typname like 'part%' ", or similar. This
> query runs an instant before the example LD_DEAD-bit-driven
> opportunistic deletion (a "simple deletion" in nbtree parlance) took
> place. You'll be able to piece together from the log output that there
> would only be about 4 heap blocks involved with such a query. Ideally,
> our hypothetical index scan would pin each buffer/heap page exactly
> once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
> we're talking about a fairly selective query here, that only needs to
> scan precisely one leaf page (I verified this part too) -- so why
> wouldn't we expect "index scan parity"?
> 
> While there is significant clustering on this example leaf page/key
> space, heap TID is not *perfectly* correlated with the
> logical/keyspace order of the index -- which can have outsized
> consequences. Notice that some heap blocks are non-contiguous
> relative to logical/keyspace/index scan/index page offset number order.
> 
> We'll end up pinning each of the 4 or so heap pages more than once
> (sometimes several times each), when in principle we could have pinned
> each heap page exactly once. In other words, there is way too much of
> a difference between the case where the tuples we scan are *almost*
> perfectly clustered (which is what you see in my example) and the case
> where they're exactly perfectly clustered. In other other words, there
> is way too much of a difference between plain index scan, and bitmap
> index scan.
> 
> (What I'm saying here is only true because this is a composite index
> and our query uses "like", returning rows matches a prefix -- if our
> index was on the column "typname" alone and we used a simple equality
> condition in our query then the Postgres 12 nbtree work would be
> enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
> that there are still relatively many important cases where we perform
> extra PinBuffer()/UnpinBuffer() calls during plain index scans that
> only touch one leaf page anyway.)
> 
> Obviously we should expect bitmap index scans to have a natural
> advantage over plain index scans whenever there is little or no
> correlation -- that's clear. But that's not what we see here -- we're
> way too sensitive to minor imperfections in clustering that are
> naturally present on some kinds of leaf pages. The potential
> difference in pin/unpin traffic (relative to the bitmap index scan
> case) seems pathological to me. Ideally, we wouldn't have these kinds
> of differences at all. It's going to disrupt usage_count on the
> buffers.
> 

I'm not sure I understand all the nuance here, but the thing I take away
is to add tests with different levels of correlation, and probably also
some multi-column indexes.

>>> It's important to carefully distinguish between cases where plain
>>> index scans really are at an inherent disadvantage relative to bitmap
>>> index scans (because there really is no getting around the need to
>>> access the same heap page many times with an index scan) versus cases
>>> that merely *appear* that way. Implementation restrictions that only
>>> really affect the plain index scan case (e.g., the lack of a
>>> reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
>>> should be accounted for when assessing the viability of index scan +
>>> prefetch over bitmap index scan + prefetch. This is very subtle, but
>>> important.
>>>
>>
>> I do agree, but what do you mean by "assessing"?
> 
> I mean performance validation. There ought to be a theoretical model
> that describes the relationship between index scan and bitmap index
> scan, that has actual predictive power in the real world, across a
> variety of different cases. Something that isn't sensitive to the
> current phase of the moon (e.g., heap fragmentation along the lines of
> my pg_type_typname_nsp_index log output). I particularly want to avoid
> nasty discontinuities that really make no sense.
> 
>> Wasn't the agreement at
>> the unconference session was we'd not tweak costing? So ultimately, this
>> does not really affect which scan type we pick. We'll keep doing the
>> same planning decisions as today, no?
> 
> I'm not really talking about tweaking the costing. What I'm saying is
> that we really should expect index scans to behave similarly to bitmap
> index scans at runtime, for queries that really don't have much to
> gain from using a bitmap heap scan (queries that may or may not also
> benefit from prefetching). There are several reasons why this makes
> sense to me.
> 
> One reason is that it makes tweaking the actual costing easier later
> on. Also, your point about plan robustness was a good one. If we make
> the wrong choice about index scan vs bitmap index scan, and the
> consequences aren't so bad, that's a very useful enhancement in
> itself.
> 
> The most important reason of all may just be to build confidence in
> the design. I'm interested in understanding when and how prefetching
> stops helping.
> 

Agreed.

>> I'm all for building a more comprehensive set of test cases - the stuff
>> presented at pgcon was good for demonstration, but it certainly is not
>> enough for testing. The SAOP queries are a great addition, I also plan
>> to run those queries on different (less random) data sets, etc. We'll
>> probably discover more interesting cases as the patch improves.
> 
> Definitely.
> 
>> There are two aspects why I think AM is not the right place:
>>
>> - accessing table from index code seems backwards
>>
>> - we already do prefetching from the executor (nodeBitmapHeapscan.c)
>>
>> It feels kinda wrong in hindsight.
> 
> I'm willing to accept that we should do it the way you've done it in
> the patch provisionally. It's complicated enough that it feels like I
> should reserve the right to change my mind.
> 
>>>> I think this is acceptable limitation, certainly for v0. Prefetching
>>>> across multiple leaf pages seems way more complex (particularly for the
>>>> cases using pairing heap), so let's leave this for the future.
> 
>> Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
>> to do that. But it seems like work for future someone.
> 
> Right. You probably noticed that this is another case where we'd be
> making index scans behave more like bitmap index scans (perhaps even
> including the downsides for kill_prior_tuple that accompany not
> processing each leaf page inline). There is probably a point where
> that ceases to be sensible, but I don't know what that point is.
> They're way more similar than we seem to imagine.
> 

OK. Thanks for all the comments.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Peter Geoghegan

Дата:

09 июня 2023 г., 18:23:56

On Fri, Jun 9, 2023 at 3:45 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> > What the exact historical timeline is may not be that important. My
> > emphasis on ScalarArrayOpExpr is partly due to it being a particularly
> > compelling case for both parallel index scan and prefetching, in
> > general. There are many queries that have huge in() lists that
> > naturally benefit a great deal from prefetching. Plus they're common.
> >
>
> Did you mean parallel index scan or bitmap index scan?

I meant parallel index scan (also parallel bitmap index scan). Note
that nbtree parallel index scans have special ScalarArrayOpExpr
handling code.

ScalarArrayOpExpr is kind of special -- it is simultaneously one big
index scan (to the executor), and lots of small index scans (to
nbtree). Unlike the queries that you've looked at so far, which really
only have one plausible behavior at execution time, there are many
ways that ScalarArrayOpExpr index scans can be executed at runtime --
some much faster than others. The nbtree implementation can in
principle reorder how it processes ranges from the key space (i.e.
each range of array elements) with significant flexibility.

> I think I understand, although maybe my mental model is wrong. I agree
> it seems inefficient, but I'm not sure why would it make prefetching
> hopeless. Sure, it puts index scans at a disadvantage (compared to
> bitmap scans), but it we pick index scan it should still be an
> improvement, right?

Hopeless might have been too strong of a word. More like it'd fall far
short of what is possible to do with a ScalarArrayOpExpr with a given
high end server.

The quality of the implementation (including prefetching) could make a
huge difference to how well we make use of the available hardware
resources. A really high quality implementation of ScalarArrayOpExpr +
prefetching can keep the system busy with useful work, which is less
true with other types of queries, which have inherently less
predictable I/O (and often have less I/O overall). What could be more
amenable to predicting I/O patterns than a query with a large IN()
list, with many constants that can be processed in whatever order
makes sense at runtime?

What I'd like to do with ScalarArrayOpExpr is to teach nbtree to
coalesce together those "small index scans" into "medium index scans"
dynamically, where that makes sense. That's the main part that's
missing right now. Dynamic behavior matters a lot with
ScalarArrayOpExpr stuff -- that's where the challenge lies, but also
where the opportunities are. Prefetching builds on all that.

> I guess I need to do some testing on a range of data sets / queries, and
> see how it works in practice.

If I can figure out a way of getting ScalarArrayOpExpr to visit each
leaf page exactly once, that might be enough to make things work
really well most of the time. Maybe it won't even be necessary to
coordinate very much, in the end. Unsure.

I've already done a lot of work that tries to minimize the chances of
regular (non-ScalarArrayOpExpr) queries accessing more than a single
leaf page, which will help your strategy of just prefetching items
from a single leaf page at a time -- that will get you pretty far
already. Consider the example of the tenk2_hundred index from the
bt_page_items documentation. You'll notice that the high key for the
page shown in the docs (and every other page in the same index) nicely
makes the leaf page boundaries "aligned" with natural keyspace
boundaries, due to suffix truncation. That helps index scans to access
no more than a single leaf page when accessing any one distinct
"hundred" value.

We are careful to do the right thing with the "boundary cases" when we
descend the tree, too. This _bt_search behavior builds on the way that
suffix truncation influences the on-disk structure of indexes. Queries
such as "select * from tenk2 where hundred = ?" will each return 100
rows spread across almost as many heap pages. That's a fairly large
number of rows/heap pages, but we still only need to access one leaf
page for every possible constant value (every "hundred" value that
might be specified as the ? in my point query example). It doesn't
matter if it's the leftmost or rightmost item on a leaf page -- we
always descend to exactly the correct leaf page directly, and we
always terminate the scan without having to move to the right sibling
page (we check the high key before going to the right page in some
cases, per the optimization added by commit 29b64d1d).

The same kind of behavior is also seen with the TPC-C line items
primary key index, which is a composite index. We want to access the
items from a whole order in one go, from one leaf page -- and we
reliably do the right thing there too (though with some caveats about
CREATE INDEX). We should never have to access more than one leaf page
to read a single order's line items. This matters because it's quite
natural to want to access whole orders with that particular
table/workload (it's also unnatural to only access one single item
from any given order).

Obviously there are many queries that need to access two or more leaf
pages, because that's just what needs to happen. My point is that we
*should* only do that when it's truly necessary on modern Postgres
versions, since the boundaries between pages are "aligned" with the
"natural boundaries" from the keyspace/application. Maybe your testing
should verify that this effect is actually present, though. It would
be a shame if we sometimes messed up prefetching that could have
worked well due to some issue with how page splits divide up items.

CREATE INDEX is much less smart about suffix truncation -- it isn't
capable of the same kind of tricks as nbtsplitloc.c, even though it
could be taught to do roughly the same thing. Hopefully this won't be
an issue for your work. The tenk2 case still works as expected with
CREATE INDEX/REINDEX, due to help from deduplication. Indexes like the
TPC-C line items PK will leave the index with some "orders" (or
whatever the natural grouping of things is) that span more than a
single leaf page, which is undesirable, and might hinder your
prefetching work. I wouldn't mind fixing that if it turned out to hurt
your leaf-page-at-a-time prefetching patch. Something to consider.

We can fit at most 17 TPC-C orders on each order line PK leaf page.
Could be as few as 15. If we do the wrong thing with prefetching for 2
out of every 15 orders then that's a real problem, but is still subtle enough
to easily miss with conventional benchmarking. I've had a lot of success
with paying close attention to all the little boundary cases, which is why
I'm kind of zealous about it now.

> > I wonder if you should go further than this, by actually sorting the
> > items that you need to fetch as part of processing a given leaf page
> > (I said this at the unconference, you may recall). Why should we
> > *ever* pin/access the same heap page more than once per leaf page
> > processed per index scan? Nothing stops us from returning the tuples
> > to the executor in the original logical/index-wise order, despite
> > having actually accessed each leaf page's pointed-to heap pages
> > slightly out of order (with the aim of avoiding extra pin/unpin
> > traffic that isn't truly necessary). We can sort the heap TIDs in
> > scratch memory, then do our actual prefetching + heap access, and then
> > restore the original order before returning anything.
> >
>
> I think that's possible, and I thought about that a bit (not just for
> btree, but especially for the distance queries on GiST). But I don't
> have a good idea if this would be 1% or 50% improvement, and I was
> concerned it might easily lead to regressions if we don't actually need
> all the tuples.

I get that it could be invasive. I have the sense that just pinning
the same heap page more than once in very close succession is just the
wrong thing to do, with or without prefetching.

> I mean, imagine we have TIDs
>
>     [T1, T2, T3, T4, T5, T6]
>
> Maybe T1, T5, T6 are from the same page, so per your proposal we might
> reorder and prefetch them in this order:
>
>     [T1, T5, T6, T2, T3, T4]
>
> But maybe we only need [T1, T2] because of a LIMIT, and the extra work
> we did on processing T5, T6 is wasted.

Yeah, that's possible. But isn't that par for the course? Any
optimization that involves speculation (including all prefetching)
comes with similar risks. They can be managed.

I don't think that we'd literally order by TID...we wouldn't change
the order that each heap page was *initially* pinned. We'd just
reorder the tuples minimally using an approach that is sufficient to
avoid repeated pinning of heap pages during processing of any one leaf
page's heap TIDs. ISTM that the risk of wasting work is limited to
wasting cycles on processing extra tuples from a heap page that we
definitely had to process at least one tuple from already. That
doesn't seem particularly risky, as speculative optimizations go. The
downside is bounded and well understood, while the upside could be
significant.

I really don't have that much confidence in any of this just yet. I'm
not trying to make this project more difficult. I just can't help but
notice that the order that index scans end up pinning heap pages
already has significant problems, and is sensitive to things like
small amounts of heap fragmentation -- maybe that's not a great basis
for prefetching. I *really* hate any kind of sharp discontinuity,
where a minor change in an input (e.g., from minor amounts of heap
fragmentation) has outsized impact on an output (e.g., buffers
pinned). Interactions like that tend to be really pernicious -- they
lead to bad performance that goes unnoticed and unfixed because the
problem effectively camouflages itself. It may even be easier to make
the conservative (perhaps paranoid) assumption that weird nasty
interactions will cause harm somewhere down the line...why take a
chance?

I might end up prototyping this myself. I may have to put my money
where my mouth is.  :-)

--
Peter Geoghegan

Re: index prefetching

От

Gregory Smith

Дата:

09 июня 2023 г., 21:19:47

On Thu, Jun 8, 2023 at 11:40 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans

At the point Greg Stark was hacking on this, the underlying OS async I/O features were tricky to fix into PG's I/O model, and both of us did much review work just to find working common ground that PG could plug into. Linux POSIX advisories were completely different from Solaris's async model, the other OS used for validation that the feature worked, with the hope being that designing against two APIs would be better than just focusing on Linux. Since that foundation was all so brittle and limited, scope was limited to just the heap scan, since it seemed to have the best return on time invested given the parts of async I/O that did and didn't scale as expected.

As I remember it, the idea was to get the basic feature out the door and gather feedback about things like whether the effective_io_concurrency knob worked as expected before moving onto other prefetching. Then that got lost in filesystem upheaval land, with so much drama around Solaris/ZFS and Oracle's btrfs work. I think it's just that no one ever got back to it.

I have all the workloads that I use for testing automated into pgbench-tools now, and this change would be easy to fit into testing on them as I'm very heavy on block I/O tests. To get PG to reach full read speed on newer storage I've had to do some strange tests, like doing index range scans that touch 25+ pages. Here's that one as a pgbench script:

\set range 67 * (:multiplier + 1)

\set limit 100000 * :scale

\set limit :limit - :range

\set aid random(1, :limit)

SELECT aid,abalance FROM pgbench_accounts WHERE aid >= :aid ORDER BY aid LIMIT :range;

And then you use '-Dmultiplier=10' or such to crank it up. Database 4X RAM, multiplier=25 with 16 clients is my starting point on it when I want to saturate storage. Anything that lets me bring those numbers down would be valuable.

--
Greg Smith greg.smith@crunchydata.com
Director of Open Source Strategy

Re: index prefetching

От

Andres Freund

Дата:

10 июня 2023 г., 20:34:56

Hi,

On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
> > 
> >> 2) prefetching from executor
> >>
> >> Another question is whether the prefetching shouldn't actually happen
> >> even higher - in the executor. That's what Andres suggested during the
> >> unconference, and it kinda makes sense. That's where we do prefetching
> >> for bitmap heap scans, so why should this happen lower, right?
> > 
> > Yea. I think it also provides potential for further optimizations in the
> > future to do it at that layer.
> > 
> > One thing I have been wondering around this is whether we should not have
> > split the code for IOS and plain indexscans...
> > 
> 
> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
> did you mean something else?

Yes, I meant that.

> >> 4) per-leaf prefetching
> >>
> >> The code is restricted only prefetches items from one leaf page. If the
> >> index scan needs to scan multiple (many) leaf pages, we have to process
> >> the first leaf page first before reading / prefetching the next one.
> >>
> >> I think this is acceptable limitation, certainly for v0. Prefetching
> >> across multiple leaf pages seems way more complex (particularly for the
> >> cases using pairing heap), so let's leave this for the future.
> > 
> > Hm. I think that really depends on the shape of the API we end up with. If we
> > move the responsibility more twoards to the executor, I think it very well
> > could end up being just as simple to prefetch across index pages.
> > 
> 
> Maybe. I'm open to that idea if you have idea how to shape the API to
> make this possible (although perhaps not in v0).

I'll try to have a look.


> > I'm a bit confused by some of these numbers. How can OS-level prefetching lead
> > to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
> > Unless I missed what "xeon / cached (speedup)" indicates?
> > 
> 
> I forgot to explain what "cached" means in the TPC-H case. It means
> second execution of the query, so you can imagine it like this:
> 
> for q in `seq 1 22`; do
> 
>    1. drop caches and restart postgres

Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.


>    2. run query $q -> uncached
> 
>    3. run query $q -> cached
> 
> done
> 
> So the second execution has a chance of having data in memory - but
> maybe not all, because this is a 100GB data set (so ~200GB after
> loading), but the machine only has 64GB of RAM.
> 
> I think a likely explanation is some of the data wasn't actually in
> memory, so prefetching still did something.

Ah, ok.


> > I think it'd be good to run a performance comparison of the unpatched vs
> > patched cases, with prefetching disabled for both. It's possible that
> > something in the patch caused unintended changes (say spilling during a
> > hashagg, due to larger struct sizes).
> > 
> 
> That's certainly a good idea. I'll do that in the next round of tests. I
> also plan to do a test on data set that fits into RAM, to test "properly
> cached" case.

Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).

Greetings,

Andres Freund

Re: index prefetching

От

Tomas Vondra

Дата:

10 июня 2023 г., 21:10:59


On 6/10/23 22:34, Andres Freund wrote:
> Hi,
> 
> On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
>>>
>>>> 2) prefetching from executor
>>>>
>>>> Another question is whether the prefetching shouldn't actually happen
>>>> even higher - in the executor. That's what Andres suggested during the
>>>> unconference, and it kinda makes sense. That's where we do prefetching
>>>> for bitmap heap scans, so why should this happen lower, right?
>>>
>>> Yea. I think it also provides potential for further optimizations in the
>>> future to do it at that layer.
>>>
>>> One thing I have been wondering around this is whether we should not have
>>> split the code for IOS and plain indexscans...
>>>
>>
>> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
>> did you mean something else?
> 
> Yes, I meant that.
> 

Ah, you meant that maybe we shouldn't have done that. Sorry, I
misunderstood.

>>>> 4) per-leaf prefetching
>>>>
>>>> The code is restricted only prefetches items from one leaf page. If the
>>>> index scan needs to scan multiple (many) leaf pages, we have to process
>>>> the first leaf page first before reading / prefetching the next one.
>>>>
>>>> I think this is acceptable limitation, certainly for v0. Prefetching
>>>> across multiple leaf pages seems way more complex (particularly for the
>>>> cases using pairing heap), so let's leave this for the future.
>>>
>>> Hm. I think that really depends on the shape of the API we end up with. If we
>>> move the responsibility more twoards to the executor, I think it very well
>>> could end up being just as simple to prefetch across index pages.
>>>
>>
>> Maybe. I'm open to that idea if you have idea how to shape the API to
>> make this possible (although perhaps not in v0).
> 
> I'll try to have a look.
> 
> 
>>> I'm a bit confused by some of these numbers. How can OS-level prefetching lead
>>> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
>>> Unless I missed what "xeon / cached (speedup)" indicates?
>>>
>>
>> I forgot to explain what "cached" means in the TPC-H case. It means
>> second execution of the query, so you can imagine it like this:
>>
>> for q in `seq 1 22`; do
>>
>>    1. drop caches and restart postgres
> 
> Are you doing it in that order? If so, the pagecache can end up being seeded
> by postgres writing out dirty buffers.
> 

Actually no, I do it the other way around - first restart, then drop. It
shouldn't matter much, though, because after building the data set (and
vacuum + checkpoint), the data is not modified - all the queries run on
the same data set. So there shouldn't be any dirty buffers.

> 
>>    2. run query $q -> uncached
>>
>>    3. run query $q -> cached
>>
>> done
>>
>> So the second execution has a chance of having data in memory - but
>> maybe not all, because this is a 100GB data set (so ~200GB after
>> loading), but the machine only has 64GB of RAM.
>>
>> I think a likely explanation is some of the data wasn't actually in
>> memory, so prefetching still did something.
> 
> Ah, ok.
> 
> 
>>> I think it'd be good to run a performance comparison of the unpatched vs
>>> patched cases, with prefetching disabled for both. It's possible that
>>> something in the patch caused unintended changes (say spilling during a
>>> hashagg, due to larger struct sizes).
>>>
>>
>> That's certainly a good idea. I'll do that in the next round of tests. I
>> also plan to do a test on data set that fits into RAM, to test "properly
>> cached" case.
> 
> Cool. It'd be good to measure both the case of all data already being in s_b
> (to see the overhead of the buffer mapping lookups) and the case where the
> data is in the kernel pagecache (to see the overhead of pointless
> posix_fadvise calls).
> 

OK, I'll make sure the next round of tests includes a sufficiently small
data set too. I should have some numbers sometime early next week.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Tomasz Rybak

Дата:

12 июня 2023 г., 21:27:04

On Thu, 2023-06-08 at 17:40 +0200, Tomas Vondra wrote:
> Hi,
>
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.
>

I added entry to
https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference
based on notes I took during that session.
Hope it helps.

--
Tomasz Rybak, Debian Developer <serpent@debian.org>
GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C

Re: index prefetching

От

Dilip Kumar

Дата:

13 июня 2023 г., 04:26:46

On Thu, Jun 8, 2023 at 9:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans, but
> I suspect the reasoning was that it only helps when there's many
> matching tuples, and that's what bitmap index scans are for. So it was
> not worth the implementation effort.

One of the reasons IMHO is that in the bitmap scan before starting the
heap fetch TIDs are already sorted in heap block order.  So it is
quite obvious that once we prefetch a heap block most of the
subsequent TIDs will fall on that block i.e. each prefetch will
satisfy many immediate requests.  OTOH, in the index scan the I/O
request is very random so we might have to prefetch many blocks even
for satisfying the request for TIDs falling on one index page.  I
agree with prefetching with an index scan will definitely help in
reducing the random I/O, but this is my guess that thinking of
prefetching with a Bitmap scan appears more natural and that would
have been one of the reasons for implementing this only for a bitmap
scan.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: index prefetching

От

Tomas Vondra

Дата:

19 июня 2023 г., 19:27:46

Hi,

I have results from the new extended round of prefetch tests. I've
pushed everything to

https://github.com/tvondra/index-prefetch-tests-2

There are scripts I used to run this (run-*.sh), raw results and various
kinds of processed summaries (pdf, ods, ...) that I'll mention later.

As before, this tests a number of query types:

- point queries with btree and hash (equality)
- ORDER BY queries with btree (inequality + order by)
- SAOP queries with btree (column IN (values))

It's probably futile to go through details of all the tests - it's
easier to go through the (hopefully fairly readable) shell scripts.

But in principle, runs some simple queries while varying both the data
set and workload:

- data set may be random, sequential or cyclic (with different length)

- the number of matches per value differs (i.e. equality condition may
match 1, 10, 100, ..., 100k rows)

- forces a particular scan type (indexscan, bitmapscan, seqscan)

- each query is executed twice - first run (right after restarting DB
and dropping caches) is uncached, second run should have data cached

- the query is executed 5x with different parameters (so 10x in total)

This is tested with three basic data sizes - fits into shared buffers,
fits into RAM and exceeds RAM. The sizes are roughly 350MB, 3.5GB and
20GB (i5) / 40GB (xeon).

Note: xeon has 64GB RAM, so technically the largest scale fits into RAM.
But should not matter, thanks to drop-caches and restart.

I also attempted to pin the backend to a particular core, in effort to
eliminate scheduling-related noise. It's mostly what taskset does, but I
did that from extension (https://github.com/tvondra/taskset) which
allows me to do that as part of the SQL script.

For the results, I'll talk about the v1 patch (as submitted here) fist.
I'll use the PDF results in the "pdf" directory which generally show a
pivot table by different test parameters, comparing the results by
different parameters (prefetching on/off, master/patched).

Feel free to do your own analysis from the raw CSV data, ofc.

For example, this:

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-point-queries-builds.pdf

shows how the prefetching affects timing for point queries with
different numbers of matches (1 to 100k). The numbers are timings for
master and patched build. The last group is (patched/master), so the
lower the number the better - 50% means patch makes the query 2x faster.
There's also a heatmap, with green=good, red=bad, which makes it easier
to cases that got slower/faster.

The really interesting stuff starts on page 7 (in this PDF), because the
first couple pages are "cached" (so it's more about measuring overhead
when prefetching has no benefit).

Right on page 7 you can see a couple cases with a mix of slower/faster
cases, roughtly in the +/- 30% range. However, this is unrelated from
the patch because those are results for bitmapheapscan.

For indexscans (page 8), the results are invariably improved - the more
matches the better (up to ~10x faster for 100k matches).

Those were results for the "cyclic" data set. For random data set (pages
9-11) the results are pretty similar, but for "sequential" data (11-13)
the prefetching is actually harmful - there are red clusters, with up to
500% slowdowns.

I'm not going to explain the summary for SAOP queries
(https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-saop-queries-builds.pdf),
the story is roughly the same, except that there are more tested query
combinations (because we also vary the pattern in the IN() list - number
of values etc.).

So, the conclusion from this is - generally very good results for random
and cyclic data sets, but pretty bad results for sequential. But even
for the random/cyclic cases there are combinations (especially with many
matches) where prefetching doesn't help or even hurts.

The only way to deal with this is (I think) a cheap way to identify and
skip inefficient prefetches, essentially by doing two things:

a) remembering more recently prefetched blocks (say, 1000+) and not
prefetching them over and over

b) ability to identify sequential pattern, when readahead seems to do
pretty good job already (although I heard some disagreement)

I've been thinking about how to do this - doing (a) seem pretty hard,
because on the one hand we want to remember a fair number of blocks and
we want the check "did we prefetch X" to be very cheap. So a hash table
seems nice. OTOH we want to expire "old" blocks and only keep the most
recent ones, and hash table doesn't really support that.

Perhaps there is a great data structure for this, not sure. But after
thinking about this I realized we don't need a perfect accuracy - it's
fine to have false positives/negatives - it's fine to forget we already
prefetched block X and prefetch it again, or prefetch it again. It's not
a matter of correctness, just a matter of efficiency - after all, we
can't know if it's still in memory, we only know if we prefetched it
fairly recently.

This led me to a "hash table of LRU caches" thing. Imagine a tiny LRU
cache that's small enough to be searched linearly (say, 8 blocks). And
we have many of them (e.g. 128), so that in total we can remember 1024
block numbers. Now, every block number is mapped to a single LRU by
hashing, as if we had a hash table

index = hash(blockno) % 128

and we only use tha one LRU to track this block. It's tiny so we can
search it linearly.

To expire prefetched blocks, there's a counter incremented every time we
prefetch a block, and we store it in the LRU with the block number. When
checking the LRU we ignore old entries (with counter more than 1000
values back), and we also evict/replace the oldest entry if needed.

This seems to work pretty well for the first requirement, but it doesn't
allow identifying the sequential pattern cheaply. To do that, I added a
tiny queue with a couple entries that can checked it the last couple
entries are sequential.

And this is what the attached 0002+0003 patches do. There are PDF with
results for this build prefixed with "patch-v3" and the results are
pretty good - the regressions are largely gone.

It's even cleared in the PDFs comparing the impact of the two patches:

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-point.pdf

https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-saop.pdf

Which simply shows the "speedup heatmap" for the two patches, and the
"v3" heatmap has much less red regression clusters.

Note: The comparison-point.pdf summary has another group of columns
illustrating if this scan type would be actually used, with "green"
meaning "yes". This provides additional context, because e.g. for the
"noisy bitmapscans" it's all white, i.e. without setting the GUcs the
optimizer would pick something else (hence it's a non-issue).

Let me know if the results are not clear enough (I tried to cover the
important stuff, but I'm sure there's a lot of details I didn't cover),
or if you think some other summary would be better.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: index prefetching

От

Tomas Vondra

Дата:

30 июня 2023 г., 11:38:06

Hi,

attached is a v4 of the patch, with a fairly major shift in the approach.

Until now the patch very much relied on the AM to provide information
which blocks to prefetch next (based on the current leaf index page).
This seemed like a natural approach when I started working on the PoC,
but over time I ran into various drawbacks:

* a lot of the logic is at the AM level

* can't prefetch across the index page boundary (have to wait until the
  next index leaf page is read by the indexscan)

* doesn't work for distance searches (gist/spgist),

After thinking about this, I decided to ditch this whole idea of
exchanging prefetch information through an API, and make the prefetching
almost entirely in the indexam code.

The new patch maintains a queue of TIDs (read from index_getnext_tid),
with up to effective_io_concurrency entries - calling getnext_slot()
adds a TID at the queue tail, issues a prefetch for the block, and then
returns TID from the queue head.

Maintaining the queue is up to index_getnext_slot() - it can't be done
in index_getnext_tid(), because then it'd affect IOS (and prefetching
heap would mostly defeat the whole point of IOS). And we can't do that
above index_getnext_slot() because that already fetched the heap page.

I still think prefetching for IOS is doable (and desirable), in mostly
the same way - except that we'd need to maintain the queue from some
other place, as IOS doesn't do index_getnext_slot().

FWIW there's also the "index-only filters without IOS" patch [1] which
switches even regular index scans to index_getnext_tid(), so maybe
relying on index_getnext_slot() is a lost cause anyway.

Anyway, this has the nice consequence that it makes AM code entirely
oblivious of prefetching - there's no need to API, we just get TIDs as
before, and the prefetching magic happens after that. Thus it also works
for searches ordered by distance (gist/spgist). The patch got much
smaller (about 40kB, down from 80kB), which is nice.

I ran the benchmarks [2] with this v4 patch, and the results for the
"point" queries are almost exactly the same as for v3. The SAOP part is
still running - I'll add those results in a day or two, but I expect
similar outcome as for point queries.


regards


[1] https://commitfest.postgresql.org/43/4352/

[2] https://github.com/tvondra/index-prefetch-tests-2/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

index-prefetch-v4.patch

Re: index prefetching

От

Tomas Vondra

Дата:

14 июля 2023 г., 20:31:57

Here's a v5 of the patch, rebased to current master and fixing a couple
compiler warnings reported by cfbot (%lu vs. UINT64_FORMAT in some debug
messages). No other changes compared to v4.

cfbot also reported a failure on windows in pg_dump [1], but it seem
pretty strange:

[11:42:48.708] ------------------------------------- 8<
-------------------------------------
[11:42:48.708] stderr:
[11:42:48.708] #   Failed test 'connecting to an invalid database: matches'

The patch does nothing related to pg_dump, and the test works perfectly
fine for me (I don't have windows machine, but 32-bit and 64-bit linux
works fine for me).


regards


[1] https://cirrus-ci.com/task/6398095366291456

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

index-prefetch-v5.patch

Re: index prefetching

От

Tomas Vondra

Дата:

16 октября 2023 г., 15:34:44

Hi,

Attached is a v6 of the patch, which rebases v5 (just some minor
bitrot), and also does a couple changes which I kept in separate patches
to make it obvious what changed.

0001-v5-20231016.patch
----------------------

Rebase to current master.

0002-comments-and-minor-cleanup-20231012.patch
----------------------------------------------

Various comment improvements (remove obsolete ones clarify a bunch of
other comments, etc.). I tried to explain the reasoning why some places
disable prefetching (e.g. in catalogs, replication, ...), explain how
the caching / LRU works etc.

0003-remove-prefetch_reset-20231016.patch
-----------------------------------------

I decided to remove the separate prefetch_reset parameter, so that all
the index_beginscan() methods only take a parameter specifying the
maximum prefetch target. The reset was added early when the prefetch
happened much lower in the AM code, at the index page level, and the
reset was when moving to the next index page. But now after the prefetch
moved to the executor, this doesn't make much sense - the resets happen
on rescans, and it seems right to just reset to 0 (just like for bitmap
heap scans).

0004-PoC-prefetch-for-IOS-20231016.patch
----------------------------------------

This is a PoC adding the prefetch to index-only scans too. At first that
may seem rather strange, considering eliminating the heap fetches is the
whole point of IOS. But if the pages are not marked as all-visible (say,
the most recent part of the table), we may still have to fetch them. In
which case it'd be easy to see cases that IOS is slower than a regular
index scan (with prefetching).

The code is quite rough. It adds a separate index_getnext_tid_prefetch()
function, adding prefetching on top of index_getnext_tid(). I'm not sure
it's the right pattern, but it's pretty much what index_getnext_slot()
does too, except that it also does the fetch + store to the slot.

Note: There's a second patch adding index-only filters, which requires
the regular index scans from index_getnext_slot() to _tid() too.

The prefetching then happens only after checking the visibility map (if
requested). This part definitely needs improvements - for example
there's no attempt to reuse the VM buffer, which I guess might be expensive.

index-prefetch.pdf
------------------

Attached is also a PDF with results of the same benchmark I did before,
comparing master vs. patched with various data patterns and scan types.
It's not 100% comparable to earlier results as I only ran it on a
laptop, and it's a bit noisier too. The overall behavior and conclusions
are however the same.

I was specifically interested in the IOS behavior, so I added two more
cases to test - indexonlyscan and indexonlyscan-clean. The first is the
worst-case scenario, with no pages marked as all-visible in VM (the test
simply deletes the VM), while indexonlyscan-clean is the good-case (no
heap fetches needed).

The results mostly match the expected behavior, particularly for the
uncached runs (when the data is expected to not be in memory):

* indexonlyscan (i.e. bad case) - About the same results as
"indexscans", with the same speedups etc. Which is a good thing
(i.e. IOS is not unexpectedly slower than regular indexscans).

Hi,

Here's a somewhat reworked version of the patch. My initial goal was to
see if it could adopt the StreamingRead API proposed in [1], but that
turned out to be less straight-forward than I hoped, for two reasons:

(1) The StreamingRead API seems to be designed for pages, but the index
code naturally works with TIDs/tuples. Yes, the callbacks can associate
the blocks with custom data (in this case that'd be the TID), but it
seemed a bit strange ...

(2) The place adding requests to the StreamingRead queue is pretty far
from the place actually reading the pages - for prefetching, the
requests would be generated in nodeIndexscan, but the page reading
happens somewhere deep in index_fetch_heap/heapam_index_fetch_tuple.
Sure, the TIDs would come from a callback, so it's a bit as if the
requests were generated in heapam_index_fetch_tuple - but it has no idea
StreamingRead exists, so where would it get it.

We might teach it about it, but what if there are multiple places
calling index_fetch_heap()? Not all of which may be using StreamingRead
(only indexscans would do that). Or if there are multiple index scans,
there's need to be a separate StreamingRead queues, right?

In any case, I felt a bit out of my depth here, and I chose not to do
all this work without discussing the direction here. (Also, see the
point about cursors and xs_heap_continue a bit later in this post.)

I did however like the general StreamingRead API - how it splits the
work between the API and the callback. The patch used to do everything,
which meant it hardcoded a lot of the IOS-specific logic etc. I did plan
to have some sort of "callback" for reading from the queue, but that
didn't quite solve this issue - a lot of the stuff remained hard-coded.
But the StreamingRead API made me realize that having a callback for the
first phase (that adds requests to the queue) would fix that.

So I did that - there's now one simple callback in for index scans, and
a bit more complex callback for index-only scans. Thanks to this the
hard-coded stuff mostly disappears, which is good.

Perhaps a bigger change is that I decided to move this into a separate
API on top of indexam.c. The original idea was to integrate this into
index_getnext_tid/index_getnext_slot, so that all callers benefit from
the prefetching automatically. Which would be nice, but it also meant
it's need to happen in the indexam.c code, which seemed dirty.

This patch introduces an API similar to StreamingRead. It calls the
indexam.c stuff, but does all the prefetching on top of it, not in it.
If a place calling index_getnext_tid() wants to allow prefetching, it
needs to switch to IndexPrefetchNext(). (There's no function that would
replace index_getnext_slot, at the moment. Maybe there should be.)

Note 1: The IndexPrefetch name is a bit misleading, because it's used
even with prefetching disabled - all index reads from the index scan
happen through it. Maybe it should be called IndexReader or something
like that.

Note 2: I left the code in indexam.c for now, but in principle it could
(should) be moved to a different place.

I think this layering makes sense, and it's probably much closer to what
Andres meant when he said the prefetching should happen in the executor.
Even if the patch ends up using StreamingRead in the future, I guess
we'll want something like IndexPrefetch - it might use the StreamingRead
internally, but it would still need to do some custom stuff to detect
I/O patterns or something that does not quite fit into the StreamingRead.

Now, let's talk about two (mostly unrelated) problems I ran into.

Firstly, I realized there's a bit of a problem with cursors. The
prefetching works like this:

1) reading TIDs from the index
2) stashing them into a queue in IndexPrefetch
3) doing prefetches for the new TIDs added to the queue
4) returning the TIDs to the caller, one by one

And all of this works ... unless the direction of the scan changes.
Which for cursors can happen if someone does FETCH BACKWARD or stuff
like that. I'm not sure how difficult it'd be to make this work. I
suppose we could simply discard the prefetched entries and do the right
number of steps back for the index scan. But I haven't tried, and maybe
it's more complex than I'm imagining. Also, if the cursor changes the
direction a lot, it'd make the prefetching harmful.

The patch simply disables prefetching for such queries, using the same
logic that we do for parallelism. This may be over-zealous.

FWIW this is one of the things that probably should remain outside of
StreamingRead API - it seems pretty index-specific, and I'm not sure
we'd even want to support these "backward" movements in the API.

The other issue I'm aware of is handling xs_heap_continue. I believe it
works fine for "false" but I need to take a look at non-MVCC snapshots
(i.e. when xs_heap_continue=true).

I haven't done any benchmarks with this reworked API - there's a couple
more allocations etc. but it did not change in a fundamental way. I
don't expect any major difference.

regards

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: index prefetching

От

Robert Haas

Дата:

09 января 2024 г., 20:31:39

On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Here's a somewhat reworked version of the patch. My initial goal was to
> see if it could adopt the StreamingRead API proposed in [1], but that
> turned out to be less straight-forward than I hoped, for two reasons:

I guess we need Thomas or Andres or maybe Melanie to comment on this.

> Perhaps a bigger change is that I decided to move this into a separate
> API on top of indexam.c. The original idea was to integrate this into
> index_getnext_tid/index_getnext_slot, so that all callers benefit from
> the prefetching automatically. Which would be nice, but it also meant
> it's need to happen in the indexam.c code, which seemed dirty.

This patch is hard to review right now because there's a bunch of
comment updating that doesn't seem to have been done for the new
design. For instance:

+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().

But not any more.

+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a somewhat wrong). Also, maybe we should consider the filter
selectivity

I'm not sure whether all the problems in this area are solved, but I
think you've solved enough of them that this at least needs rewording,
if not removing.

+     * XXX Comment/check seems obsolete.

This occurs in two places. I'm not sure if it's accurate or not.

+     * XXX Could this be an issue for the prefetching? What if we
prefetch something
+     * but the direction changes before we get to the read? If that
could happen,
+     * maybe we should discard the prefetched data and go back? But can we even
+     * do that, if we already fetched some TIDs from the index? I don't think
+     * indexorderdir can't change, but es_direction maybe can?

But your email claims that "The patch simply disables prefetching for
such queries, using the same logic that we do for parallelism." FWIW,
I think that's a fine way to handle that case.

+     * XXX Maybe we should enable prefetching, but prefetch only pages that
+     * are not all-visible (but checking that from the index code seems like
+     * a violation of layering etc).

Isn't this fixed now? Note this comment occurs twice.

+     * XXX We need to disable this in some cases (e.g. when using index-only
+     * scans, we don't want to prefetch pages). Or maybe we should prefetch
+     * only pages that are not all-visible, that'd be even better.

Here again.

And now for some comments on other parts of the patch, mostly other
XXX comments:

+ * XXX This does not support prefetching of heap pages. When such
prefetching is
+ * desirable, use index_getnext_tid().

There's probably no reason to write XXX here. The comment is fine.

+     * XXX Notice we haven't added the block to the block queue yet, and there
+     * is a preceding block (i.e. blockIndex-1 is valid).

Same here, possibly? If this XXX indicates a defect in the code, I
don't know what the defect is, so I guess it needs to be more clear.
If it is just explaining the code, then there's no reason for the
comment to say XXX.

+     * XXX Could it be harmful that we read the queue backwards? Maybe memory
+     * prefetching works better for the forward direction?

It does. But I don't know whether that matters here or not.

+             * XXX We do add the cache size to the request in order not to
+             * have issues with uint64 underflows.

I don't know what this means.

+ * XXX not sure this correctly handles xs_heap_continue - see
index_getnext_slot,
+ * maybe nodeIndexscan needs to do something more to handle this?
Although, that
+ * should be in the indexscan next_cb callback, probably.
+ *
+ * XXX If xs_heap_continue=true, we need to return the last TID.

You've got a bunch of comments about xs_heap_continue here -- and I
don't fully understand what the issues are here with respect to this
particular patch, but I think that the general purpose of
xs_heap_continue is to handle the case where we need to return more
than one tuple from the same HOT chain. With an MVCC snapshot that
doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
As far as possible, the prefetcher shouldn't be involved at all when
xs_heap_continue is set, I believe, because in that case we're just
returning a bunch of tuples from the same page, and the extra fetches
from that heap page shouldn't trigger or require any further
prefetching.

+     * XXX Should this also look at plan.plan_rows and maybe cap the target
+     * to that? Pointless to prefetch more than we expect to use. Or maybe
+     * just reset to that value during prefetching, after reading the next
+     * index page (or rather after rescan)?

It seems questionable to use plan_rows here because (1) I don't think
we have existing cases where we use the estimated row count in the
executor for anything, we just carry it through so EXPLAIN can print
it and (2) row count estimates can be really far off, especially if
we're on the inner side of a nested loop, we might like to figure that
out eventually instead of just DTWT forever. But on the other hand
this does feel like an important case where we have a clue that
prefetching might need to be done less aggressively or not at all, and
it doesn't seem right to ignore that signal either. I wonder if we
want this shaped in some other way, like a Boolean that says
are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
side of a semi-join or anti-join.

+     * We reach here if the index only scan is not parallel, or if we're
+     * serially executing an index only scan that was planned to be
+     * parallel.

Well, this seems sad.

+     * XXX This might lead to IOS being slower than plain index scan, if the
+     * table has a lot of pages that need recheck.

How?

+    /*
+     * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
+     * of a misuse of the flag, but we need to disable prefetching for cursors
+     * (which might change direction), and parallelModeOK does that. But maybe
+     * we might (or should) have a separate flag.
+     */

I think the correct flag to be using here is execute_once, which
captures whether the executor could potentially be invoked a second
time for the same portal. Changes in the fetch direction are possible
if and only if !execute_once.

> Note 1: The IndexPrefetch name is a bit misleading, because it's used
> even with prefetching disabled - all index reads from the index scan
> happen through it. Maybe it should be called IndexReader or something
> like that.

My biggest gripe here is the capitalization. This version adds, inter
alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
index_heap_prefetch_target, which seems like one or two too many
conventions. But maybe the PREFETCH_* macros don't even belong in a
public header.

I do like the index_heap_prefetch_* naming. Possibly that's too
verbose to use for everything, but calling this index-heap-prefetch
rather than index-prefetch seems clearer.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

От

Tomas Vondra

Дата:

12 января 2024 г., 16:42:39

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I'll briefly go through the main changes in the patch, and then will
respond in-line to Robert's points.

1) I moved the code from indexam.c to (new) execPrefetch.c. All the
prototypes / typedefs now live in executor.h, with only minimal changes
in execnodes.h (adding it to scan descriptors).

I believe this finally moves the code to the right place - it feels much
nicer and cleaner than in indexam.c.  And it allowed me to hide a bunch
of internal structs and improve the general API, I think.

I'm sure there's stuff that could be named differently, but the layering
feels about right, I think.

2) A bunch of stuff got renamed to start with IndexPrefetch... to make
the naming consistent / clearer. I'm not entirely sure IndexPrefetch is
the right name, though - it's still a bit misleading, as it might seem
it's about prefetching index stuff, but really it's about heap pages
from indexes. Maybe IndexScanPrefetch() or something like that?

3) If there's a way to make this work with the streaming I/O API, I'm
not aware of it. But the overall design seems somewhat similar (based on
"next" callback etc.) so hopefully that'd make it easier to adopt it.

4) I initially relied on parallelModeOK to disable prefetching, which
kinda worked, but not really. Robert suggested to use the execute_once
flag directly, and I think that's much better - not only is it cleaner,
it also seems more appropriate (the parallel flag considers other stuff
that is not quite relevant to prefetching).

Thinking about this, I think it should be possible to make prefetching
work even for plans with execute_once=false. In particular, when the
plan changes direction it should be possible to simply "walk back" the
prefetch queue, to get to the "correct" place in in the scan. But I'm
not sure it's worth it, because plans that change direction often can't
really benefit from prefetches anyway - they'll often visit stuff they
accessed shortly before anyway. For plans that don't change direction
but may pause, we don't know if the plan pauses long enough for the
prefetched pages to get evicted or something. So I think it's OK that
execute_once=false means no prefetching.

5) I haven't done anything about the xs_heap_continue=true case yet.

6) I went through all the comments and reworked them considerably. The
main comment at execPrefetch.c start, with some overall design etc. And
then there are comments for each function, explaining that bit in more
detail. Or at least that's the goal - there's still work to do.

There's two trivial FIXMEs, but you can ignore those - it's not that
there's a bug, but that I'd like to rework something and just don't know
how yet.

There's also a couple of XXX comments. Some are a bit wild ideas for the
future, others are somewhat "open questions" to be discussed during a
review.

Anyway, there should be no outright obsolete comments - if there's
something I missed, let me know.

Now to Robert's message ...

On 1/9/24 21:31, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> Here's a somewhat reworked version of the patch. My initial goal was to
>> see if it could adopt the StreamingRead API proposed in [1], but that
>> turned out to be less straight-forward than I hoped, for two reasons:
> 
> I guess we need Thomas or Andres or maybe Melanie to comment on this.
> 

Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
streaming I/O stuff.

>> Perhaps a bigger change is that I decided to move this into a separate
>> API on top of indexam.c. The original idea was to integrate this into
>> index_getnext_tid/index_getnext_slot, so that all callers benefit from
>> the prefetching automatically. Which would be nice, but it also meant
>> it's need to happen in the indexam.c code, which seemed dirty.
> 
> This patch is hard to review right now because there's a bunch of
> comment updating that doesn't seem to have been done for the new
> design. For instance:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> But not any more.
> 

True. And this is now even more obsolete, as the prefetching was moved
from indexam.c layer to the executor.

> + * XXX The prefetching may interfere with the patch allowing us to evaluate
> + * conditions on the index tuple, in which case we may not need the heap
> + * tuple. Maybe if there's such filter, we should prefetch only pages that
> + * are not all-visible (and the same idea would also work for IOS), but
> + * it also makes the indexing a bit "aware" of the visibility stuff (which
> + * seems a somewhat wrong). Also, maybe we should consider the filter
> selectivity
> 
> I'm not sure whether all the problems in this area are solved, but I
> think you've solved enough of them that this at least needs rewording,
> if not removing.
> 
> +     * XXX Comment/check seems obsolete.
> 
> This occurs in two places. I'm not sure if it's accurate or not.
> 
> +     * XXX Could this be an issue for the prefetching? What if we
> prefetch something
> +     * but the direction changes before we get to the read? If that
> could happen,
> +     * maybe we should discard the prefetched data and go back? But can we even
> +     * do that, if we already fetched some TIDs from the index? I don't think
> +     * indexorderdir can't change, but es_direction maybe can?
> 
> But your email claims that "The patch simply disables prefetching for
> such queries, using the same logic that we do for parallelism." FWIW,
> I think that's a fine way to handle that case.
> 

True. I left behind this comment partly intentionally, to point out why
we disable the prefetching in these cases, but you're right the comment
now explains something that can't happen.

> +     * XXX Maybe we should enable prefetching, but prefetch only pages that
> +     * are not all-visible (but checking that from the index code seems like
> +     * a violation of layering etc).
> 
> Isn't this fixed now? Note this comment occurs twice.
> 
> +     * XXX We need to disable this in some cases (e.g. when using index-only
> +     * scans, we don't want to prefetch pages). Or maybe we should prefetch
> +     * only pages that are not all-visible, that'd be even better.
> 
> Here again.
> 

Sorry, you're right those comments (and a couple more nearby) were
stale. Removed / clarified.

> And now for some comments on other parts of the patch, mostly other
> XXX comments:
> 
> + * XXX This does not support prefetching of heap pages. When such
> prefetching is
> + * desirable, use index_getnext_tid().
> 
> There's probably no reason to write XXX here. The comment is fine.
> 
> +     * XXX Notice we haven't added the block to the block queue yet, and there
> +     * is a preceding block (i.e. blockIndex-1 is valid).
> 
> Same here, possibly? If this XXX indicates a defect in the code, I
> don't know what the defect is, so I guess it needs to be more clear.
> If it is just explaining the code, then there's no reason for the
> comment to say XXX.
> 

Yeah, removed the XXX / reworded a bit.

> +     * XXX Could it be harmful that we read the queue backwards? Maybe memory
> +     * prefetching works better for the forward direction?
> 
> It does. But I don't know whether that matters here or not.
> 
> +             * XXX We do add the cache size to the request in order not to
> +             * have issues with uint64 underflows.
> 
> I don't know what this means.
> 

There's a check that does this:

      (x + PREFETCH_CACHE_SIZE) >= y

it might also be done as "mathematically equivalent"

      x >= (y - PREFETCH_CACHE_SIZE)

but if the "y" is an uint64, and the value is smaller than the constant,
this would underflow. It'd eventually disappear, once the "y" gets large
enough, ofc.

> + * XXX not sure this correctly handles xs_heap_continue - see
> index_getnext_slot,
> + * maybe nodeIndexscan needs to do something more to handle this?
> Although, that
> + * should be in the indexscan next_cb callback, probably.
> + *
> + * XXX If xs_heap_continue=true, we need to return the last TID.
> 
> You've got a bunch of comments about xs_heap_continue here -- and I
> don't fully understand what the issues are here with respect to this
> particular patch, but I think that the general purpose of
> xs_heap_continue is to handle the case where we need to return more
> than one tuple from the same HOT chain. With an MVCC snapshot that
> doesn't happen, but with say SnapshotAny or SnapshotDirty, it could.
> As far as possible, the prefetcher shouldn't be involved at all when
> xs_heap_continue is set, I believe, because in that case we're just
> returning a bunch of tuples from the same page, and the extra fetches
> from that heap page shouldn't trigger or require any further
> prefetching.
> 

Yes, that's correct. The current code simply ignores that flag and just
proceeds to the next TID. Which is correct for xs_heap_continue=false,
and thus all MVCC snapshots work fine. But for the Any/Dirty case it
needs to work a bit differently.

> +     * XXX Should this also look at plan.plan_rows and maybe cap the target
> +     * to that? Pointless to prefetch more than we expect to use. Or maybe
> +     * just reset to that value during prefetching, after reading the next
> +     * index page (or rather after rescan)?
> 
> It seems questionable to use plan_rows here because (1) I don't think
> we have existing cases where we use the estimated row count in the
> executor for anything, we just carry it through so EXPLAIN can print
> it and (2) row count estimates can be really far off, especially if
> we're on the inner side of a nested loop, we might like to figure that
> out eventually instead of just DTWT forever. But on the other hand
> this does feel like an important case where we have a clue that
> prefetching might need to be done less aggressively or not at all, and
> it doesn't seem right to ignore that signal either. I wonder if we
> want this shaped in some other way, like a Boolean that says
> are-we-under-a-potentially-row-limiting-construct e.g. limit or inner
> side of a semi-join or anti-join.
> 

The current code actually does look at plan_rows when calculating the
prefetch target:

  prefetch_max = IndexPrefetchComputeTarget(node->ss.ss_currentRelation,
                                            node->ss.ps.plan->plan_rows,
                                            estate->es_use_prefetching);

but I agree maybe it should not, for the reasons you explain. I'm not
attached to this part.

> +     * We reach here if the index only scan is not parallel, or if we're
> +     * serially executing an index only scan that was planned to be
> +     * parallel.
> 
> Well, this seems sad.
> 

Stale comment, I believe. However, I didn't see much benefits with
parallel index scan during testing. Having I/O from multiple workers
generally had the same effect, I think.

> +     * XXX This might lead to IOS being slower than plain index scan, if the
> +     * table has a lot of pages that need recheck.
> 
> How?
> 

The comment is not particularly clear what "this" means, but I believe
this was about index-only scan with many not-all-visible pages. If it
didn't do prefetching, a regular index scan with prefetching may be way
faster. But the code actually allows doing prefetching even for IOS, by
checking the vm in the "next" callback.

> +    /*
> +     * XXX Only allow index prefetching when parallelModeOK=true. This is a bit
> +     * of a misuse of the flag, but we need to disable prefetching for cursors
> +     * (which might change direction), and parallelModeOK does that. But maybe
> +     * we might (or should) have a separate flag.
> +     */
> 
> I think the correct flag to be using here is execute_once, which
> captures whether the executor could potentially be invoked a second
> time for the same portal. Changes in the fetch direction are possible
> if and only if !execute_once.
> 

Right. The new patch version does that.

>> Note 1: The IndexPrefetch name is a bit misleading, because it's used
>> even with prefetching disabled - all index reads from the index scan
>> happen through it. Maybe it should be called IndexReader or something
>> like that.
> 
> My biggest gripe here is the capitalization. This version adds, inter
> alia, IndexPrefetchAlloc, PREFETCH_QUEUE_INDEX, and
> index_heap_prefetch_target, which seems like one or two too many
> conventions. But maybe the PREFETCH_* macros don't even belong in a
> public header.
> 
> I do like the index_heap_prefetch_* naming. Possibly that's too
> verbose to use for everything, but calling this index-heap-prefetch
> rather than index-prefetch seems clearer.
> 

Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
to keep it consistent. And then the constants are all capital, ofc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

v20240112-0001-Prefetch-heap-pages-during-index-scans.patch

Re: index prefetching

От

Robert Haas

Дата:

12 января 2024 г., 16:52:53

Not a full response, but just to address a few points:

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Thinking about this, I think it should be possible to make prefetching
> work even for plans with execute_once=false. In particular, when the
> plan changes direction it should be possible to simply "walk back" the
> prefetch queue, to get to the "correct" place in in the scan. But I'm
> not sure it's worth it, because plans that change direction often can't
> really benefit from prefetches anyway - they'll often visit stuff they
> accessed shortly before anyway. For plans that don't change direction
> but may pause, we don't know if the plan pauses long enough for the
> prefetched pages to get evicted or something. So I think it's OK that
> execute_once=false means no prefetching.

+1.

> > +             * XXX We do add the cache size to the request in order not to
> > +             * have issues with uint64 underflows.
> >
> > I don't know what this means.
> >
>
> There's a check that does this:
>
>       (x + PREFETCH_CACHE_SIZE) >= y
>
> it might also be done as "mathematically equivalent"
>
>       x >= (y - PREFETCH_CACHE_SIZE)
>
> but if the "y" is an uint64, and the value is smaller than the constant,
> this would underflow. It'd eventually disappear, once the "y" gets large
> enough, ofc.

The problem is, I think, that there's no particular reason that
someone reading the existing code should imagine that it might have
been done in that "mathematically equivalent" fashion. I imagined that
you were trying to make a point about adding the cache size to the
request vs. adding nothing, whereas in reality you were trying to make
a point about adding from one side vs. subtracting from the other.

> > +     * We reach here if the index only scan is not parallel, or if we're
> > +     * serially executing an index only scan that was planned to be
> > +     * parallel.
> >
> > Well, this seems sad.
>
> Stale comment, I believe. However, I didn't see much benefits with
> parallel index scan during testing. Having I/O from multiple workers
> generally had the same effect, I think.

Fair point, likely worth mentioning explicitly in the comment.

> Yeah. I renamed all the structs and functions to IndexPrefetchSomething,
> to keep it consistent. And then the constants are all capital, ofc.

It'd still be nice to get table or heap in there, IMHO, but maybe we
can't, and consistency is certainly a good thing regardless of the
details, so thanks for that.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

От

Konstantin Knizhnik

Дата:

16 января 2024 г., 08:13:43

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far as Neon seaprates compute and storage, prefetch is much more critical for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX MERGE), motivating it that in this case OS read-ahead will be more efficient than prefetch. It may be true for normal storage devices, bit not for Neon storage and may be also for Postgres on top of DFS (i.e. Amazon RDS). I wonder if we can delegate decision whether to perform prefetch in this case or not to some other level. I do not know precisely where is should be handled. The best candidate IMHO is storager manager. But it most likely requires extension of SMGR API. Not sure if you want to do it... Straightforward solution is to move this logic to some callback, which can be overwritten by user.

2. It disables prefetch for direct_io. It seems to be even more obvious than 1), because prefetching using `posix_fadvise` definitely not possible in case of using direct_io. But in theory if SMGR provides some alternative prefetch implementation (as in case of Neon), this also may be not true. Still unclear why we can want to use direct_io in Neon... But still I prefer to mo.ve this decision outside executor.

3. It doesn't perform prefetch of leave pages for IOS, only referenced heap pages which are not marked as all-visible. It seems to me that if optimized has chosen IOS (and not bitmap heap scan for example), then there should be large enough fraction for all-visible pages. Also index prefetch is most efficient for OLAp queries and them are used to be performance for historical data which is all-visible. But IOS can be really handled separately in some other PR. Frankly speaking combining prefetch of leave B-Tree pages and referenced heap pages seems to be very challenged task.

4. I think that performing prefetch at executor level is really great idea and so prefetch can be used by all indexes, including custom indexes. But prefetch will be efficient only if index can provide fast access to next TID (located at the same page). I am not sure that it is true for all builtin indexes (GIN, GIST, BRIN,...) and especially for custom AM. I wonder if we should extend AM API to make index make a decision weather to perform prefetch of TIDs or not.

5. Minor notice: there are few places where index_getnext_slot is called with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this places corresponds to "point loopkup", i.e. unique constraint check, find replication tuple by index... Prefetch seems to be unlikely useful here, unlkess there is index bloating and and we have to skip a lot of tuples before locating right one. But should we try to optimize case of bloated indexes?

Re: index prefetching

От

Tomas Vondra

Дата:

16 января 2024 г., 16:25:05

On 1/16/24 09:13, Konstantin Knizhnik wrote:
> Hi,
> 
> On 12/01/2024 6:42 pm, Tomas Vondra wrote:
>> Hi,
>>
>> Here's an improved version of this patch, finishing a lot of the stuff
>> that I alluded to earlier - moving the code from indexam.c, renaming a
>> bunch of stuff, etc. I've also squashed it into a single patch, to make
>> it easier to review.
> 
> I am thinking about testing you patch with Neon (cloud Postgres). As far
> as Neon seaprates compute and storage, prefetch is much more critical
> for Neon
> architecture than for vanilla Postgres.
> 
> I have few complaints:
> 
> 1. It disables prefetch for sequential access pattern (i.e. INDEX
> MERGE), motivating it that in this case OS read-ahead will be more
> efficient than prefetch. It may be true for normal storage devices, bit
> not for Neon storage and may be also for Postgres on top of DFS (i.e.
> Amazon RDS). I wonder if we can delegate decision whether to perform
> prefetch in this case or not to some other level. I do not know
> precisely where is should be handled. The best candidate IMHO is
> storager manager. But it most likely requires extension of SMGR API. Not
> sure if you want to do it... Straightforward solution is to move this
> logic to some callback, which can be overwritten by user.
> 

Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?

However, I'm not against making this modular / wrapping this in some
sort of callbacks, for example.

> 2. It disables prefetch for direct_io. It seems to be even more obvious
> than 1), because prefetching using `posix_fadvise` definitely not
> possible in case of using direct_io. But in theory if SMGR provides some
> alternative prefetch implementation (as in case of Neon), this also may
> be not true. Still unclear why we can want to use direct_io in Neon...
> But still I prefer to mo.ve this decision outside executor.
> 

True. I think this would / should be customizable by the callback.

> 3. It doesn't perform prefetch of leave pages for IOS, only referenced
> heap pages which are not marked as all-visible. It seems to me that if
> optimized has chosen IOS (and not bitmap heap scan for example), then
> there should be large enough fraction for all-visible pages. Also index
> prefetch is most efficient for OLAp queries and them are used to be
> performance for historical data which is all-visible. But IOS can be
> really handled separately in some other PR. Frankly speaking combining
> prefetch of leave B-Tree pages and referenced heap pages seems to be
> very challenged task.
> 

I see prefetching of leaf pages as interesting / worthwhile improvement,
but out of scope for this patch. I don't think it can be done at the
executor level - the prefetch requests need to be submitted from the
index AM code (by calling PrefetchBuffer, etc.)

> 4. I think that performing prefetch at executor level is really great
> idea and so prefetch can be used by all indexes, including custom
> indexes. But prefetch will be efficient only if index can provide fast
> access to next TID (located at the same page). I am not sure that it is
> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
> custom AM. I wonder if we should extend AM API to make index make a
> decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

> 
> 5. Minor notice: there are few places where index_getnext_slot is called
> with last NULL parameter (disabled prefetch) with the following comment
> "XXX Would be nice to also benefit from prefetching here." But all this
> places corresponds to "point loopkup", i.e. unique constraint check,
> find replication tuple by index... Prefetch seems to be unlikely useful
> here, unlkess there is index bloating and and we have to skip a lot of
> tuples before locating right one. But should we try to optimize case of
> bloated indexes?
> 

Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Robert Haas

Дата:

16 января 2024 г., 17:08:14

On Tue, Jan 16, 2024 at 11:25 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> > 3. It doesn't perform prefetch of leave pages for IOS, only referenced
> > heap pages which are not marked as all-visible. It seems to me that if
> > optimized has chosen IOS (and not bitmap heap scan for example), then
> > there should be large enough fraction for all-visible pages. Also index
> > prefetch is most efficient for OLAp queries and them are used to be
> > performance for historical data which is all-visible. But IOS can be
> > really handled separately in some other PR. Frankly speaking combining
> > prefetch of leave B-Tree pages and referenced heap pages seems to be
> > very challenged task.
>
> I see prefetching of leaf pages as interesting / worthwhile improvement,
> but out of scope for this patch. I don't think it can be done at the
> executor level - the prefetch requests need to be submitted from the
> index AM code (by calling PrefetchBuffer, etc.)

+1. This is a good feature, and so is that, but they're not the same
feature, despite the naming problems.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

От

Konstantin Knizhnik

Дата:

16 января 2024 г., 20:10:23

On 16/01/2024 6:25 pm, Tomas Vondra wrote:

On 1/16/24 09:13, Konstantin Knizhnik wrote:

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far
as Neon seaprates compute and storage, prefetch is much more critical
for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX
MERGE), motivating it that in this case OS read-ahead will be more
efficient than prefetch. It may be true for normal storage devices, bit
not for Neon storage and may be also for Postgres on top of DFS (i.e.
Amazon RDS). I wonder if we can delegate decision whether to perform
prefetch in this case or not to some other level. I do not know
precisely where is should be handled. The best candidate IMHO is
storager manager. But it most likely requires extension of SMGR API. Not
sure if you want to do it... Straightforward solution is to move this
logic to some callback, which can be overwritten by user.

Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?

Amazon RDS is just vanilla Postgres with file system mounted on EBS (Amazon distributed file system).
EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.

4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch heap tuple in any case.

5. Minor notice: there are few places where index_getnext_slot is called
with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this
places corresponds to "point loopkup", i.e. unique constraint check,
find replication tuple by index... Prefetch seems to be unlikely useful
here, unlkess there is index bloating and and we have to skip a lot of
tuples before locating right one. But should we try to optimize case of
bloated indexes?

Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).

Sorry, I looked at v20240103-0001-prefetch-2023-12-09.patch , I didn't noticed v20240112-0001-Prefetch-heap-pages-during-index-scans.patch

regards

Re: index prefetching

От

Jim Nasby

Дата:

16 января 2024 г., 21:58:42

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
> Amazon RDS is just vanilla Postgres with file system mounted on EBS 
> (Amazon  distributed file system).
> EBS provides good throughput but larger latencies comparing with local SSDs.
> I am not sure if read-ahead works for EBS.

Actually, EBS only provides a block device - it's definitely not a 
filesystem itself (*EFS* is a filesystem - but it's also significantly 
different than EBS). So as long as readahead is happening somewheer 
above the block device I would expect it to JustWork on EBS.

Of course, Aurora Postgres (like Neon) is completely different. If you 
look at page 53 of [1] you'll note that there's two different terms 
used: prefetch and batch. I'm not sure how much practical difference 
there is, but batched IO (one IO request to Aurora Storage for many 
blocks) predates index prefetch; VACUUM in APG has used batched IO for a 
very long time (it also *only* reads blocks that aren't marked all 
visble/frozen; none of the "only skip if skipping at least 32 blocks" 
logic is used).

1: 

https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deep_dive_on_Amazon_Aurora_with_PostgreSQL_compatibility_DAT328-R1.pdf
-- 
Jim Nasby, Data Architect, Austin TX

Re: index prefetching

От

Konstantin Knizhnik

Дата:

17 января 2024 г., 06:10:01

On 16/01/2024 11:58 pm, Jim Nasby wrote:
> On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
>> Amazon RDS is just vanilla Postgres with file system mounted on EBS 
>> (Amazon  distributed file system).
>> EBS provides good throughput but larger latencies comparing with 
>> local SSDs.
>> I am not sure if read-ahead works for EBS.
>
> Actually, EBS only provides a block device - it's definitely not a 
> filesystem itself (*EFS* is a filesystem - but it's also significantly 
> different than EBS). So as long as readahead is happening somewheer 
> above the block device I would expect it to JustWork on EBS.

Thank you for clarification.
Yes, EBS is just block device and read-ahead can be used fir it as for 
any other local device.
There is actually recommendation to increase read-ahead for EBS device 
to reach better performance on some workloads:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

So looks like for sequential access pattern manual prefetching at EBS is 
not needed.
But at Neon situation is quite different. May be Aurora Postgres is 
using some other mechanism for speed-up vacuum and seqscan,
but Neon is using Postgres prefetch mechanism for it.

Re: index prefetching

От

Konstantin Knizhnik

Дата:

17 января 2024 г., 08:04:43

On 16/01/2024 11:58 pm, Jim Nasby wrote:
> On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
>> Amazon RDS is just vanilla Postgres with file system mounted on EBS 
>> (Amazon  distributed file system).
>> EBS provides good throughput but larger latencies comparing with 
>> local SSDs.
>> I am not sure if read-ahead works for EBS.
>
> Actually, EBS only provides a block device - it's definitely not a 
> filesystem itself (*EFS* is a filesystem - but it's also significantly 
> different than EBS). So as long as readahead is happening somewheer 
> above the block device I would expect it to JustWork on EBS.

Thank you for clarification.
Yes, EBS is just block device and read-ahead can be used fir it as for 
any other local device.
There is actually recommendation to increase read-ahead for EBS device 
to reach better performance on some workloads:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

So looks like for sequential access pattern manual prefetching at EBS is 
not needed.
But at Neon situation is quite different. May be Aurora Postgres is 
using some other mechanism for speed-up vacuum and seqscan,
but Neon is using Postgres prefetch mechanism for it.

Re: index prefetching

От

Konstantin Knizhnik

Дата:

17 января 2024 г., 08:45:01

I have integrated your prefetch patch in Neon and it actually works!
Moreover, I combined it with prefetch of leaf pages for IOS and it also 
seems to work.

Just small notice: you are reporting `blks_prefetch_rounds` in explain, 
but it is not incremented anywhere.
Moreover, I do not precisely understand what it mean and wonder if such 
information is useful for analyzing query executing plan.
Also your patch always report number of prefetched blocks (and rounds) 
if them are not zero.

I think that adding new information to explain it may cause some 
problems because there are a lot of different tools which parse explain 
report to visualize it,
make some recommendations top improve performance, ... Certainly good 
practice for such tools is to ignore all unknown tags. But I am not sure 
that everybody follow this practice.
It seems to be more safe and at the same time convenient for users to 
add extra tag to explain to enable/disable prefetch info (as it was done 
in Neon).

Here we come back to my custom explain patch;) Actually using it is not 
necessary. You can manually add "prefetch" option to Postgres core (as 
it is currently done in Neon).

Best regards,
Konstantin

Re: index prefetching

От

Tomas Vondra

Дата:

18 января 2024 г., 15:57:47

On 1/16/24 21:10, Konstantin Knizhnik wrote:
> 
> ...
> 
>> 4. I think that performing prefetch at executor level is really great
>>> idea and so prefetch can be used by all indexes, including custom
>>> indexes. But prefetch will be efficient only if index can provide fast
>>> access to next TID (located at the same page). I am not sure that it is
>>> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
>>> custom AM. I wonder if we should extend AM API to make index make a
>>> decision weather to perform prefetch of TIDs or not.
>> I'm not against having a flag to enable/disable prefetching, but the
>> question is whether doing prefetching for such indexes can be harmful.
>> I'm not sure about that.
> 
> I tend to agree with you - it is hard to imagine index implementation
> which doesn't win from prefetching heap pages.
> May be only the filtering case you have mentioned. But it seems to me
> that current B-Tree index scan (not IOS) implementation in Postgres
> doesn't try to use index tuple to check extra condition - it will fetch
> heap tuple in any case.
> 

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Tomas Vondra

Дата:

18 января 2024 г., 16:00:32

On 1/17/24 09:45, Konstantin Knizhnik wrote:
> I have integrated your prefetch patch in Neon and it actually works!
> Moreover, I combined it with prefetch of leaf pages for IOS and it also
> seems to work.
> 

Cool! And do you think this is the right design/way to do this?

> Just small notice: you are reporting `blks_prefetch_rounds` in explain,
> but it is not incremented anywhere.
> Moreover, I do not precisely understand what it mean and wonder if such
> information is useful for analyzing query executing plan.
> Also your patch always report number of prefetched blocks (and rounds)
> if them are not zero.
> 

Right, this needs fixing.

> I think that adding new information to explain it may cause some
> problems because there are a lot of different tools which parse explain
> report to visualize it,
> make some recommendations top improve performance, ... Certainly good
> practice for such tools is to ignore all unknown tags. But I am not sure
> that everybody follow this practice.
> It seems to be more safe and at the same time convenient for users to
> add extra tag to explain to enable/disable prefetch info (as it was done
> in Neon).
> 

I think we want to add this info to explain, but maybe it should be
behind a new flag and disabled by default.

> Here we come back to my custom explain patch;) Actually using it is not
> necessary. You can manually add "prefetch" option to Postgres core (as
> it is currently done in Neon).
> 

Yeah, I think that's the right solution.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Konstantin Knizhnik

Дата:

19 января 2024 г., 08:34:42

On 18/01/2024 6:00 pm, Tomas Vondra wrote:
> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>> I have integrated your prefetch patch in Neon and it actually works!
>> Moreover, I combined it with prefetch of leaf pages for IOS and it also
>> seems to work.
>>
> Cool! And do you think this is the right design/way to do this?

I like the idea of prefetching TIDs in executor.

But looking though your patch I have some questions:

1. Why it is necessary to allocate and store all_visible flag in data 
buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?

+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

2. Names of the functions `IndexPrefetchNext` and 
`IndexOnlyPrefetchNext` are IMHO confusing because they look similar and 
one can assume that for one is used for normal index scan and last one - 
for index only scan. But actually `IndexOnlyPrefetchNext` is callback 
and `IndexPrefetchNext` is used in both nodeIndexscan.c and 
nodeIndexonlyscan.c

Re: index prefetching

От

Tomas Vondra

Дата:

19 января 2024 г., 12:35:25


On 1/19/24 09:34, Konstantin Knizhnik wrote:
> 
> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
>> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>>> I have integrated your prefetch patch in Neon and it actually works!
>>> Moreover, I combined it with prefetch of leaf pages for IOS and it also
>>> seems to work.
>>>
>> Cool! And do you think this is the right design/way to do this?
> 
> I like the idea of prefetching TIDs in executor.
> 
> But looking though your patch I have some questions:
> 
> 
> 1. Why it is necessary to allocate and store all_visible flag in data
> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
> 
> +        /* store the all_visible flag in the private part of the entry */
> +        entry->data = palloc(sizeof(bool));
> +        *(bool *) entry->data = all_visible;
> 

What you mean by "prefetch field"? The reason why it's done like this is
to only do the VM check once - without keeping the value, we'd have to
do it in the "next" callback, to determine if we need to prefetch the
heap tuple, and then later in the index-only scan itself. That's a
significant overhead, especially in the case when everything is visible.

> 2. Names of the functions `IndexPrefetchNext` and
> `IndexOnlyPrefetchNext` are IMHO confusing because they look similar and
> one can assume that for one is used for normal index scan and last one -
> for index only scan. But actually `IndexOnlyPrefetchNext` is callback
> and `IndexPrefetchNext` is used in both nodeIndexscan.c and
> nodeIndexonlyscan.c
> 

Yeah, that's a good point. The naming probably needs rethinking.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Konstantin Knizhnik

Дата:

19 января 2024 г., 15:19:22

On 18/01/2024 5:57 pm, Tomas Vondra wrote:

On 1/16/24 21:10, Konstantin Knizhnik wrote:

...

4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

I tend to agree with you - it is hard to imagine index implementation
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch
heap tuple in any case.

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.


regards

Looks like I was not true, even if it is not index-only scan but index condition involves only index attributes, then heap is not accessed until we find tuple satisfying search condition.
Inclusive index case described above (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO exotic case. If keys are actually used in search, then why not to create normal compound index instead?

Re: index prefetching

От

Melanie Plageman

Дата:

19 января 2024 г., 21:43:37

On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 1/9/24 21:31, Robert Haas wrote:
> > On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> Here's a somewhat reworked version of the patch. My initial goal was to
> >> see if it could adopt the StreamingRead API proposed in [1], but that
> >> turned out to be less straight-forward than I hoped, for two reasons:
> >
> > I guess we need Thomas or Andres or maybe Melanie to comment on this.
> >
>
> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
> streaming I/O stuff.

I've been studying your patch with the intent of finding a way to
change it and or the streaming read API to work together. I've
attached a very rough sketch of how I think it could work.

We fill a queue with blocks from TIDs that we fetched from the index.
The queue is saved in a scan descriptor that is made available to the
streaming read callback. Once the queue is full, we invoke the table
AM specific index_fetch_tuple() function which calls
pg_streaming_read_buffer_get_next(). When the streaming read API
invokes the callback we registered, it simply dequeues a block number
for prefetching. The only change to the streaming read API is that
now, even if the callback returns InvalidBlockNumber, we may not be
finished, so make it resumable.

Structurally, this changes the timing of when the heap blocks are
prefetched. Your code would get a tid from the index and then prefetch
the heap block -- doing this until it filled a queue that had the
actual tids saved in it. With my approach and the streaming read API,
you fetch tids from the index until you've filled up a queue of block
numbers. Then the streaming read API will prefetch those heap blocks.

I didn't actually implement the block queue -- I just saved a single
block number and pretended it was a block queue. I was imagining we
replace this with something like your IndexPrefetch->blockItems --
which has light deduplication. We'd probably have to flesh it out more
than that.

There are also table AM layering violations in my sketch which would
have to be worked out (not to mention some resource leakage I didn't
bother investigating [which causes it to fail tests]).

0001 is all of Thomas' streaming read API code that isn't yet in
master and 0002 is my rough sketch of index prefetching using the
streaming read API

There are also numerous optimizations that your index prefetching
patch set does that would need to be added in some way. I haven't
thought much about it yet. I wanted to see what you thought of this
approach first. Basically, is it workable?

- Melanie

Вложения

Re: index prefetching

От

Tomas Vondra

Дата:

19 января 2024 г., 22:14:12


On 1/19/24 16:19, Konstantin Knizhnik wrote:
> 
> On 18/01/2024 5:57 pm, Tomas Vondra wrote:
>> On 1/16/24 21:10, Konstantin Knizhnik wrote:
>>> ...
>>>
>>>> 4. I think that performing prefetch at executor level is really great
>>>>> idea and so prefetch can be used by all indexes, including custom
>>>>> indexes. But prefetch will be efficient only if index can provide fast
>>>>> access to next TID (located at the same page). I am not sure that
>>>>> it is
>>>>> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
>>>>> custom AM. I wonder if we should extend AM API to make index make a
>>>>> decision weather to perform prefetch of TIDs or not.
>>>> I'm not against having a flag to enable/disable prefetching, but the
>>>> question is whether doing prefetching for such indexes can be harmful.
>>>> I'm not sure about that.
>>> I tend to agree with you - it is hard to imagine index implementation
>>> which doesn't win from prefetching heap pages.
>>> May be only the filtering case you have mentioned. But it seems to me
>>> that current B-Tree index scan (not IOS) implementation in Postgres
>>> doesn't try to use index tuple to check extra condition - it will fetch
>>> heap tuple in any case.
>>>
>> That's true, but that's why I started working on this:
>>
>> https://commitfest.postgresql.org/46/4352/
>>
>> I need to think about how to combine that with the prefetching. The good
>> thing is that both changes require fetching TIDs, not slots. I think the
>> condition can be simply added to the prefetch callback.
>>
>>
>> regards
>>
> Looks like I was not true, even if it is not index-only scan but index
> condition involves only index attributes, then heap is not accessed
> until we find tuple satisfying search condition.
> Inclusive index case described above
> (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
> exotic case. If keys are actually used in search, then why not to create
> normal compound index instead?
> 

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Konstantin Knizhnik

Дата:

21 января 2024 г., 19:50:17

On 20/01/2024 12:14 am, Tomas Vondra wrote:

Looks like I was not true, even if it is not index-only scan but index

condition involves only index attributes, then heap is not accessed
until we find tuple satisfying search condition.
Inclusive index case described above
(https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
exotic case. If keys are actually used in search, then why not to create
normal compound index instead?

Not sure I follow ...

Firstly, I'm not convinced the example addressed by that other patch is
that exotic. IMHO it's quite possible it's actually quite common, but
the users do no realize the possible gains.

Also, there are reasons to not want very wide indexes - it has overhead
associated with maintenance, disk space, etc. I think it's perfectly
rational to design indexes in a way eliminates most heap fetches
necessary to evaluate conditions, but does not guarantee IOS (so the
last heap fetch is still needed).

We are comparing compound index (a,b) and covering (inclusive) index (a) include (b)
This indexes have exactly the same width and size and almost the same maintenance overhead.

First index has more expensive comparison function (involving two columns) but I do not think that it can significantly affect
performance and maintenance cost. Also if selectivity of "a" is good enough, then there is no need to compare "b"

Why we can prefer covering index to compound index? I see only two good reasons:
1. Extra columns type do not have comparison function need for AM.
2. The extra columns are never used in query predicate.

If you are going to use this columns in query predicates I do not see much sense in creating inclusive index rather than compound index.
Do you?

What do you mean by "create normal compound index"? The patch addresses
a limitation that not every condition can be translated into a proper
scan key. Even if we improve this, there will always be such conditions.
The the IOS can evaluate them on index tuple, the regular index scan
can't do that (currently).

Can you share an example demonstrating the alternative approach?

May be I missed something.

This is the example from https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me :

```

And here is the plan with index on (a,b).

Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884 rows=0 loops=1) Output: a, b, d Buffers: shared hit=613 -> Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1 width=12) (actual time=6.880..6.881 rows=0 loops=1) Output: a, b, d Index Cond: ((t.a > 1000000) AND (t.b = 4)) Buffers: shared hit=613 Planning: Buffers: shared hit=41 Planning Time: 0.314 ms Execution Time: 6.910 ms ```

Isn't it an optimal plan for this query?

And cite from self reproducible example https://dbfiddle.uk/iehtq44L :
```
create unique index t_a_include_b on t(a) include (b);
-- I'd expecd index above to behave the same as index below for this query
--create unique index on t(a,b);
```

I agree that it is natural to expect the same result for both indexes. So this PR definitely makes sense.
My point is only that compound index (a,b) in this case is more natural and preferable.

Re: index prefetching

От

Konstantin Knizhnik

Дата:

21 января 2024 г., 19:56:36

On 19/01/2024 2:35 pm, Tomas Vondra wrote:
>
> On 1/19/24 09:34, Konstantin Knizhnik wrote:
>> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
>>> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>>>> I have integrated your prefetch patch in Neon and it actually works!
>>>> Moreover, I combined it with prefetch of leaf pages for IOS and it also
>>>> seems to work.
>>>>
>>> Cool! And do you think this is the right design/way to do this?
>> I like the idea of prefetching TIDs in executor.
>>
>> But looking though your patch I have some questions:
>>
>>
>> 1. Why it is necessary to allocate and store all_visible flag in data
>> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
>>
>> +        /* store the all_visible flag in the private part of the entry */
>> +        entry->data = palloc(sizeof(bool));
>> +        *(bool *) entry->data = all_visible;
>>
> What you mean by "prefetch field"?


I mean "prefetch" field of IndexPrefetchEntry:

+
+typedef struct IndexPrefetchEntry
+{
+    ItemPointerData tid;
+
+    /* should we prefetch heap page for this TID? */
+    bool        prefetch;
+

You store the same flag twice:

+        /* prefetch only if not all visible */
+        entry->prefetch = !all_visible;
+
+        /* store the all_visible flag in the private part of the entry */
+        entry->data = palloc(sizeof(bool));
+        *(bool *) entry->data = all_visible;

My question was: why do we need to allocate something in entry->data and 
store all_visible in it, while we already stored !all-visible in 
entry->prefetch.

Re: index prefetching

От

Tomas Vondra

Дата:

21 января 2024 г., 23:39:14


On 1/21/24 20:50, Konstantin Knizhnik wrote:
> 
> On 20/01/2024 12:14 am, Tomas Vondra wrote:
>> Looks like I was not true, even if it is not index-only scan but index
>>> condition involves only index attributes, then heap is not accessed
>>> until we find tuple satisfying search condition.
>>> Inclusive index case described above
>>> (https://commitfest.postgresql.org/46/4352/) is interesting but IMHO
>>> exotic case. If keys are actually used in search, then why not to create
>>> normal compound index instead?
>>>
>> Not sure I follow ...
>>
>> Firstly, I'm not convinced the example addressed by that other patch is
>> that exotic. IMHO it's quite possible it's actually quite common, but
>> the users do no realize the possible gains.
>>
>> Also, there are reasons to not want very wide indexes - it has overhead
>> associated with maintenance, disk space, etc. I think it's perfectly
>> rational to design indexes in a way eliminates most heap fetches
>> necessary to evaluate conditions, but does not guarantee IOS (so the
>> last heap fetch is still needed).
> 
> We are comparing compound index (a,b) and covering (inclusive) index (a)
> include (b)
> This indexes have exactly the same width and size and almost the same
> maintenance overhead.
> 
> First index has more expensive comparison function (involving two
> columns)  but I do not think that it can significantly affect
> performance and maintenance cost. Also if selectivity of "a" is good
> enough, then there is no need to compare "b"
> 
> Why we can prefer covering index  to compound index? I see only two good
> reasons:
> 1. Extra columns type do not  have comparison function need for AM.
> 2. The extra columns are never used in query predicate.
> 

Or maybe you don't want to include the columns in a UNIQUE constraint?

> If you are going to use this columns in query predicates I do not see
> much sense in creating inclusive index rather than compound index.
> Do you?
> 

But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,1000000) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
1111111;
                                                   QUERY PLAN

-----------------------------------------------------------------------------------------------------------------
 Index Scan using t_a_b_idx on t  (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)
   Index Cond: (a = 10)
   Filter: (mod(b, 10) = 1111111)
   Rows Removed by Filter: 974
   Buffers: shared hit=980
   Prefetches: blocks=901
 Planning Time: 0.304 ms
 Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key. Or this:

explain (analyze, buffers) select a from t where a = 10 and (b+1) < 100
                                             and c < 0;


                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Index Scan using t_a_b_idx on t  (cost=0.42..3673.22 rows=1 width=4)
(actual time=4.446..4.448 rows=0 loops=1)
   Index Cond: (a = 10)
   Filter: ((c < 0) AND ((b + 1) < 100))
   Rows Removed by Filter: 974
   Buffers: shared hit=980
   Prefetches: blocks=901
 Planning Time: 0.313 ms
 Execution Time: 4.878 ms
(8 rows)

where it's "broken" by the extra unindexed column.

FWIW there are the primary cases I had in mind for this patch.


> 
>> What do you mean by "create normal compound index"? The patch addresses
>> a limitation that not every condition can be translated into a proper
>> scan key. Even if we improve this, there will always be such conditions.
>> The the IOS can evaluate them on index tuple, the regular index scan
>> can't do that (currently).
>>
>> Can you share an example demonstrating the alternative approach?
> 
> May be I missed something.
> 
> This is the example from
>
https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU=@yamlcoder.me
:
> 
> ```
> 
> And here is the plan with index on (a,b).
> 
> Limit (cost=0.42..4447.90 rows=1 width=12) (actual time=6.883..6.884
> rows=0 loops=1)    Output: a, b, d    Buffers: shared hit=613    ->
> Index Scan using t_a_b_idx on public.t (cost=0.42..4447.90 rows=1
> width=12) (actual time=6.880..6.881 rows=0 loops=1)          Output: a,
> b, d          Index Cond: ((t.a > 1000000) AND (t.b = 4))      
>    Buffers: shared hit=613 Planning:    Buffers: shared hit=41 Planning
> Time: 0.314 ms Execution Time: 6.910 ms ```
> 
> 
> Isn't it an optimal plan for this query?
> 
> And cite from self reproducible example https://dbfiddle.uk/iehtq44L :
> ```
> create unique index t_a_include_b on t(a) include (b);
> -- I'd expecd index above to behave the same as index below for this query
> --create unique index on t(a,b);
> ```
> 
> I agree that it is natural to expect the same result for both indexes.
> So this PR definitely makes sense.
> My point is only that compound index (a,b) in this case is more natural
> and preferable.
> 

Yes, perhaps. But you may also see it from the other direction - if you
already have an index with included columns (for whatever reason), it
would be nice to leverage that if possible. And as I mentioned above,
it's not always the case that move a column from "included" to a proper
key, or stuff like that.

Anyway, it seems entirely unrelated to this prefetching thread.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Tomas Vondra

Дата:

21 января 2024 г., 23:47:27


On 1/21/24 20:56, Konstantin Knizhnik wrote:
> 
> On 19/01/2024 2:35 pm, Tomas Vondra wrote:
>>
>> On 1/19/24 09:34, Konstantin Knizhnik wrote:
>>> On 18/01/2024 6:00 pm, Tomas Vondra wrote:
>>>> On 1/17/24 09:45, Konstantin Knizhnik wrote:
>>>>> I have integrated your prefetch patch in Neon and it actually works!
>>>>> Moreover, I combined it with prefetch of leaf pages for IOS and it
>>>>> also
>>>>> seems to work.
>>>>>
>>>> Cool! And do you think this is the right design/way to do this?
>>> I like the idea of prefetching TIDs in executor.
>>>
>>> But looking though your patch I have some questions:
>>>
>>>
>>> 1. Why it is necessary to allocate and store all_visible flag in data
>>> buffer. Why caller of  IndexPrefetchNext can not look at prefetch field?
>>>
>>> +        /* store the all_visible flag in the private part of the
>>> entry */
>>> +        entry->data = palloc(sizeof(bool));
>>> +        *(bool *) entry->data = all_visible;
>>>
>> What you mean by "prefetch field"?
> 
> 
> I mean "prefetch" field of IndexPrefetchEntry:
> 
> +
> +typedef struct IndexPrefetchEntry
> +{
> +    ItemPointerData tid;
> +
> +    /* should we prefetch heap page for this TID? */
> +    bool        prefetch;
> +
> 
> You store the same flag twice:
> 
> +        /* prefetch only if not all visible */
> +        entry->prefetch = !all_visible;
> +
> +        /* store the all_visible flag in the private part of the entry */
> +        entry->data = palloc(sizeof(bool));
> +        *(bool *) entry->data = all_visible;
> 
> My question was: why do we need to allocate something in entry->data and
> store all_visible in it, while we already stored !all-visible in
> entry->prefetch.
> 

Ah, right. Well, you're right in this case we perhaps could set just one
of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Peter Smith

Дата:

22 января 2024 г., 04:53:15

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
like there were  CFbot test failures last time it was run [2]. Please
have a look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/4351/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4351

Kind Regards,
Peter Smith.

Re: index prefetching

От

Konstantin Knizhnik

Дата:

22 января 2024 г., 06:35:59

On 22/01/2024 1:47 am, Tomas Vondra wrote:

h, right. Well, you're right in this case we perhaps could set just one

of those flags, but the "purpose" of the two places is quite different.

The "prefetch" flag is fully controlled by the prefetcher, and it's up
to it to change it (e.g. I can easily imagine some new logic touching
setting it to "false" for some reason).

The "data" flag is fully controlled by the custom callbacks, so whatever
the callback stores, will be there.

I don't think it's worth simplifying this. In particular, I don't think
the callback can assume it can rely on the "prefetch" flag.

Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not cause any extra space overhead (because of alignment), but allows to avoid dynamic memory allocation (not sure if it is critical, but nice to avoid if possible).

Re: index prefetching

От

Konstantin Knizhnik

Дата:

22 января 2024 г., 07:21:14

On 22/01/2024 1:39 am, Tomas Vondra wrote:

Why we can prefer covering index  to compound index? I see only two good
reasons:
1. Extra columns type do not  have comparison function need for AM.
2. The extra columns are never used in query predicate.

Or maybe you don't want to include the columns in a UNIQUE constraint?

Do you mean that compound index (a,b) can not be used to enforce uniqueness of "a"?
If so, I agree.

If you are going to use this columns in query predicates I do not see
much sense in creating inclusive index rather than compound index.
Do you?

But this is also about conditions that can't be translated into index
scan keys. Consider this:

create table t (a int, b int, c int);
insert into t select 1000 * random(), 1000 * random(), 1000 * random()
from generate_series(1,1000000) s(i);
create index on t (a,b);
vacuum analyze t;

explain (analyze, buffers) select * from t where a = 10 and mod(b,10) =
1111111;                                                   QUERY PLAN

----------------------------------------------------------------------------------------------------------------- Index Scan using t_a_b_idx on t  (cost=0.42..3670.74 rows=5 width=12)
(actual time=4.562..4.564 rows=0 loops=1)   Index Cond: (a = 10)   Filter: (mod(b, 10) = 1111111)   Rows Removed by Filter: 974   Buffers: shared hit=980   Prefetches: blocks=901 Planning Time: 0.304 ms Execution Time: 5.146 ms
(8 rows)

Notice that this still fetched ~1000 buffers in order to evaluate the
filter on "b", because it's complex and can't be transformed into a nice
scan key.

O yes.
Looks like I didn't understand the logic when predicate is included in index condition and when not.
It seems to be natural that only such predicate which specifies some range can be included in index condition.
But it is not the case:

postgres=# explain select * from t where a = 10 and b in (10,20,30);                             QUERY PLAN                              
--------------------------------------------------------------------- Index Scan using t_a_b_idx on t  (cost=0.42..25.33 rows=3 width=12)   Index Cond: ((a = 10) AND (b = ANY ('{10,20,30}'::integer[])))
(2 rows)

So I though ANY predicate using index keys is included in index condition.
But it is not true (as your example shows).

But IMHO mod(b,10)=111111 or (b+1) < 100 are both quite rare predicates this is why I named this use cases "exotic".

In any case, if we have some columns in index tuple it is desired to use them for filtering before extracting heap tuple.
But I afraid it will be not so easy to implement...

Re: index prefetching

От

Tomas Vondra

Дата:

23 января 2024 г., 17:43:25

On 1/19/24 22:43, Melanie Plageman wrote:
> On Fri, Jan 12, 2024 at 11:42 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 1/9/24 21:31, Robert Haas wrote:
>>> On Thu, Jan 4, 2024 at 9:55 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> Here's a somewhat reworked version of the patch. My initial goal was to
>>>> see if it could adopt the StreamingRead API proposed in [1], but that
>>>> turned out to be less straight-forward than I hoped, for two reasons:
>>>
>>> I guess we need Thomas or Andres or maybe Melanie to comment on this.
>>>
>>
>> Yeah. Or maybe Thomas if he has thoughts on how to combine this with the
>> streaming I/O stuff.
> 
> I've been studying your patch with the intent of finding a way to
> change it and or the streaming read API to work together. I've
> attached a very rough sketch of how I think it could work.
> 

Thanks.

> We fill a queue with blocks from TIDs that we fetched from the index.
> The queue is saved in a scan descriptor that is made available to the
> streaming read callback. Once the queue is full, we invoke the table
> AM specific index_fetch_tuple() function which calls
> pg_streaming_read_buffer_get_next(). When the streaming read API
> invokes the callback we registered, it simply dequeues a block number
> for prefetching.

So in a way there are two queues in IndexFetchTableData. One (blk_queue)
is being filled from IndexNext, and then the queue in StreamingRead.

> The only change to the streaming read API is that now, even if the
> callback returns InvalidBlockNumber, we may not be finished, so make
> it resumable.
> 

Hmm, not sure when can the callback return InvalidBlockNumber before
reaching the end. Perhaps for the first index_fetch_heap call? Any
reason not to fill the blk_queue before calling index_fetch_heap?

> Structurally, this changes the timing of when the heap blocks are
> prefetched. Your code would get a tid from the index and then prefetch
> the heap block -- doing this until it filled a queue that had the
> actual tids saved in it. With my approach and the streaming read API,
> you fetch tids from the index until you've filled up a queue of block
> numbers. Then the streaming read API will prefetch those heap blocks.
> 

And is that a good/desirable change? I'm not saying it's not, but maybe
we should not be filling either queue in one go - we don't want to
overload the prefetching.

> I didn't actually implement the block queue -- I just saved a single
> block number and pretended it was a block queue. I was imagining we
> replace this with something like your IndexPrefetch->blockItems --
> which has light deduplication. We'd probably have to flesh it out more
> than that.
> 

I don't understand how this passes the TID to the index_fetch_heap.
Isn't it working only by accident, due to blk_queue only having a single
entry? Shouldn't the first queue (blk_queue) store TIDs instead?

> There are also table AM layering violations in my sketch which would
> have to be worked out (not to mention some resource leakage I didn't
> bother investigating [which causes it to fail tests]).
> 
> 0001 is all of Thomas' streaming read API code that isn't yet in
> master and 0002 is my rough sketch of index prefetching using the
> streaming read API
> 
> There are also numerous optimizations that your index prefetching
> patch set does that would need to be added in some way. I haven't
> thought much about it yet. I wanted to see what you thought of this
> approach first. Basically, is it workable?
> 

It seems workable, yes. I'm not sure it's much simpler than my patch
(considering a lot of the code is in the optimizations, which are
missing from this patch).

I think the question is where should the optimizations happen. I suppose
some of them might/should happen in the StreamingRead API itself - like
the detection of sequential patterns, recently prefetched blocks, ...

But I'm not sure what to do about optimizations that are more specific
to the access path. Consider for example the index-only scans. We don't
want to prefetch all the pages, we need to inspect the VM and prefetch
just the not-all-visible ones. And then pass the info to the index scan,
so that it does not need to check the VM again. It's not clear to me how
to do this with this approach.

The main

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: index prefetching

От

Melanie Plageman

Дата:

24 января 2024 г., 00:51:24

On Tue, Jan 23, 2024 at 12:43 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 1/19/24 22:43, Melanie Plageman wrote:
>
> > We fill a queue with blocks from TIDs that we fetched from the index.
> > The queue is saved in a scan descriptor that is made available to the
> > streaming read callback. Once the queue is full, we invoke the table
> > AM specific index_fetch_tuple() function which calls
> > pg_streaming_read_buffer_get_next(). When the streaming read API
> > invokes the callback we registered, it simply dequeues a block number
> > for prefetching.
>
> So in a way there are two queues in IndexFetchTableData. One (blk_queue)
> is being filled from IndexNext, and then the queue in StreamingRead.

I've changed the name from blk_queue to tid_queue to fix the issue you
mention in your later remarks.
I suppose there are two queues. The tid_queue is just to pass the
block requests to the streaming read API. The prefetch distance will
be the smaller of the two sizes.

> > The only change to the streaming read API is that now, even if the
> > callback returns InvalidBlockNumber, we may not be finished, so make
> > it resumable.
>
> Hmm, not sure when can the callback return InvalidBlockNumber before
> reaching the end. Perhaps for the first index_fetch_heap call? Any
> reason not to fill the blk_queue before calling index_fetch_heap?

The callback will return InvalidBlockNumber whenever the queue is
empty. Let's say your queue size is 5 and your effective prefetch
distance is 10 (some combination of the PgStreamingReadRange sizes and
PgStreamingRead->max_ios). The first time you call index_fetch_heap(),
the callback returns InvalidBlockNumber. Then the tid_queue is filled
with 5 tids. Then index_fetch_heap() is called.
pg_streaming_read_look_ahead() will prefetch all 5 of these TID's
blocks, emptying the queue. Once all 5 have been dequeued, the
callback will return InvalidBlockNumber.
pg_streaming_read_buffer_get_next() will return one of the 5 blocks in
a buffer and save the associated TID in the per_buffer_data. Before
index_fetch_heap() is called again, we will see that the queue is not
full and fill it up again with 5 TIDs. So, the callback will return
InvalidBlockNumber 3 times in this scenario.

> > Structurally, this changes the timing of when the heap blocks are
> > prefetched. Your code would get a tid from the index and then prefetch
> > the heap block -- doing this until it filled a queue that had the
> > actual tids saved in it. With my approach and the streaming read API,
> > you fetch tids from the index until you've filled up a queue of block
> > numbers. Then the streaming read API will prefetch those heap blocks.
>
> And is that a good/desirable change? I'm not saying it's not, but maybe
> we should not be filling either queue in one go - we don't want to
> overload the prefetching.

We can focus on the prefetch distance algorithm maintained in the
streaming read API and then make sure that the tid_queue is larger
than the desired prefetch distance maintained by the streaming read
API.

> > I didn't actually implement the block queue -- I just saved a single
> > block number and pretended it was a block queue. I was imagining we
> > replace this with something like your IndexPrefetch->blockItems --
> > which has light deduplication. We'd probably have to flesh it out more
> > than that.
>
> I don't understand how this passes the TID to the index_fetch_heap.
> Isn't it working only by accident, due to blk_queue only having a single
> entry? Shouldn't the first queue (blk_queue) store TIDs instead?

Oh dear! Fixed in the attached v2. I've replaced the single
BlockNumber with a single ItemPointerData. I will work on implementing
an actual queue next week.

> > There are also table AM layering violations in my sketch which would
> > have to be worked out (not to mention some resource leakage I didn't
> > bother investigating [which causes it to fail tests]).
> >
> > 0001 is all of Thomas' streaming read API code that isn't yet in
> > master and 0002 is my rough sketch of index prefetching using the
> > streaming read API
> >
> > There are also numerous optimizations that your index prefetching
> > patch set does that would need to be added in some way. I haven't
> > thought much about it yet. I wanted to see what you thought of this
> > approach first. Basically, is it workable?
>
> It seems workable, yes. I'm not sure it's much simpler than my patch
> (considering a lot of the code is in the optimizations, which are
> missing from this patch).
>
> I think the question is where should the optimizations happen. I suppose
> some of them might/should happen in the StreamingRead API itself - like
> the detection of sequential patterns, recently prefetched blocks, ...

So, the streaming read API does detection of sequential patterns and
not prefetching things that are in shared buffers. It doesn't handle
avoiding prefetching recently prefetched blocks yet AFAIK. But I
daresay this would be relevant for other streaming read users and
could certainly be implemented there.

> But I'm not sure what to do about optimizations that are more specific
> to the access path. Consider for example the index-only scans. We don't
> want to prefetch all the pages, we need to inspect the VM and prefetch
> just the not-all-visible ones. And then pass the info to the index scan,
> so that it does not need to check the VM again. It's not clear to me how
> to do this with this approach.

Yea, this is an issue I'll need to think about. To really spell out
the problem: the callback dequeues a TID from the tid_queue and looks
up its block in the VM. It's all visible. So, it shouldn't return that
block to the streaming read API to fetch from the heap because it
doesn't need to be read. But, where does the callback put the TID so
that the caller can get it? I'm going to think more about this.

As for passing around the all visible status so as to not reread the
VM block -- that feels solvable but I haven't looked into it.

- Melanie

On 1/22/24 07:35, Konstantin Knizhnik wrote:
> 
> On 22/01/2024 1:47 am, Tomas Vondra wrote:
>> h, right. Well, you're right in this case we perhaps could set just one
>> of those flags, but the "purpose" of the two places is quite different.
>>
>> The "prefetch" flag is fully controlled by the prefetcher, and it's up
>> to it to change it (e.g. I can easily imagine some new logic touching
>> setting it to "false" for some reason).
>>
>> The "data" flag is fully controlled by the custom callbacks, so whatever
>> the callback stores, will be there.
>>
>> I don't think it's worth simplifying this. In particular, I don't think
>> the callback can assume it can rely on the "prefetch" flag.
>>
> Why not to add "all_visible" flag to IndexPrefetchEntry ? If will not
> cause any extra space overhead (because of alignment), but allows to
> avoid dynamic memory allocation (not sure if it is critical, but nice to
> avoid if possible).
> 

Because it's specific to index-only scans, while IndexPrefetchEntry is a
generic thing, for all places.

However:

(1) Melanie actually presented a very different way to implement this,
relying on the StreamingRead API. So chances are this struct won't
actually be used.

(2) After going through Melanie's patch, I realized this is actually
broken. The IOS case needs to keep more stuff, not just the all-visible
flag, but also the index tuple. Otherwise it'll just operate on the last
tuple read from the index, which happens to be in xs_ituple. Attached is
a patch with a trivial fix.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: index prefetching

От

Melanie Plageman

Дата:

24 января 2024 г., 20:20:28

On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 1/24/24 01:51, Melanie Plageman wrote:
>
> >>> There are also table AM layering violations in my sketch which would
> >>> have to be worked out (not to mention some resource leakage I didn't
> >>> bother investigating [which causes it to fail tests]).
> >>>
> >>> 0001 is all of Thomas' streaming read API code that isn't yet in
> >>> master and 0002 is my rough sketch of index prefetching using the
> >>> streaming read API
> >>>
> >>> There are also numerous optimizations that your index prefetching
> >>> patch set does that would need to be added in some way. I haven't
> >>> thought much about it yet. I wanted to see what you thought of this
> >>> approach first. Basically, is it workable?
> >>
> >> It seems workable, yes. I'm not sure it's much simpler than my patch
> >> (considering a lot of the code is in the optimizations, which are
> >> missing from this patch).
> >>
> >> I think the question is where should the optimizations happen. I suppose
> >> some of them might/should happen in the StreamingRead API itself - like
> >> the detection of sequential patterns, recently prefetched blocks, ...
> >
> > So, the streaming read API does detection of sequential patterns and
> > not prefetching things that are in shared buffers. It doesn't handle
> > avoiding prefetching recently prefetched blocks yet AFAIK. But I
> > daresay this would be relevant for other streaming read users and
> > could certainly be implemented there.
> >
>
> Yes, the "recently prefetched stuff" cache seems like a fairly natural
> complement to the pattern detection and shared-buffers check.
>
> FWIW I wonder if we should make some of this customizable, so that
> systems with customized storage (e.g. neon or with direct I/O) can e.g.
> disable some of these checks. Or replace them with their version.

That's a promising idea.

> >> But I'm not sure what to do about optimizations that are more specific
> >> to the access path. Consider for example the index-only scans. We don't
> >> want to prefetch all the pages, we need to inspect the VM and prefetch
> >> just the not-all-visible ones. And then pass the info to the index scan,
> >> so that it does not need to check the VM again. It's not clear to me how
> >> to do this with this approach.
> >
> > Yea, this is an issue I'll need to think about. To really spell out
> > the problem: the callback dequeues a TID from the tid_queue and looks
> > up its block in the VM. It's all visible. So, it shouldn't return that
> > block to the streaming read API to fetch from the heap because it
> > doesn't need to be read. But, where does the callback put the TID so
> > that the caller can get it? I'm going to think more about this.
> >
>
> Yes, that's the problem for index-only scans. I'd generalize it so that
> it's about the callback being able to (a) decide if it needs to read the
> heap page, and (b) store some custom info for the TID.

Actually, I think this is no big deal. See attached. I just don't
enqueue tids whose blocks are all visible. I had to switch the order
from fetch heap then fill queue to fill queue then fetch heap.

While doing this I noticed some wrong results in the regression tests
(like in the alter table test), so I suspect I have some kind of
control flow issue. Perhaps I should fix the resource leak so I can
actually see the failing tests :)

As for your a) and b) above.

Regarding a): We discussed allowing speculative prefetching and
separating the logic for prefetching from actually reading blocks (so
you can prefetch blocks you ultimately don't read). We decided this
may not belong in a streaming read API. What do you think?

Regarding b): We can store per buffer data for anything that actually
goes down through the streaming read API, but, in the index only case,
we don't want the streaming read API to know about blocks that it
doesn't actually need to read.

- Melanie

On Wed, Jan 24, 2024 at 3:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 4:19 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 1/24/24 01:51, Melanie Plageman wrote:
> > >> But I'm not sure what to do about optimizations that are more specific
> > >> to the access path. Consider for example the index-only scans. We don't
> > >> want to prefetch all the pages, we need to inspect the VM and prefetch
> > >> just the not-all-visible ones. And then pass the info to the index scan,
> > >> so that it does not need to check the VM again. It's not clear to me how
> > >> to do this with this approach.
> > >
> > > Yea, this is an issue I'll need to think about. To really spell out
> > > the problem: the callback dequeues a TID from the tid_queue and looks
> > > up its block in the VM. It's all visible. So, it shouldn't return that
> > > block to the streaming read API to fetch from the heap because it
> > > doesn't need to be read. But, where does the callback put the TID so
> > > that the caller can get it? I'm going to think more about this.
> > >
> >
> > Yes, that's the problem for index-only scans. I'd generalize it so that
> > it's about the callback being able to (a) decide if it needs to read the
> > heap page, and (b) store some custom info for the TID.
>
> Actually, I think this is no big deal. See attached. I just don't
> enqueue tids whose blocks are all visible. I had to switch the order
> from fetch heap then fill queue to fill queue then fetch heap.
>
> While doing this I noticed some wrong results in the regression tests
> (like in the alter table test), so I suspect I have some kind of
> control flow issue. Perhaps I should fix the resource leak so I can
> actually see the failing tests :)

Attached is a patch which implements a real queue and fixes some of
the issues with the previous version. It doesn't pass tests yet and
has issues. Some are bugs in my implementation I need to fix. Some are
issues we would need to solve in the streaming read API. Some are
issues with index prefetching generally.

Note that these two patches have to be applied before 21d9c3ee4e
because Thomas hasn't released a rebased version of the streaming read
API patches yet.

Issues
---
- kill prior tuple

This optimization doesn't work with index prefetching with the current
design. Kill prior tuple relies on alternating between fetching a
single index tuple and visiting the heap. After visiting the heap we
can potentially kill the immediately preceding index tuple. Once we
fetch multiple index tuples, enqueue their TIDs, and later visit the
heap, the next index page we visit may not contain all of the index
tuples deemed killable by our visit to the heap.

In our case, we could try and fix this by prefetching only heap blocks
referred to by index tuples on the same index page. Or we could try
and keep a pool of index pages pinned and go back and kill index
tuples on those pages.

Having disabled kill_prior_tuple is why the mvcc test fails. Perhaps
there is an easier way to fix this, as I don't think the mvcc test
failed on Tomas' version.

- switching scan directions

If the index scan switches directions on a given invocation of
IndexNext(), heap blocks may have already been prefetched and read for
blocks containing tuples beyond the point at which we want to switch
directions.

We could fix this by having some kind of streaming read "reset"
callback to drop all of the buffers which have been prefetched which
are now no longer needed. We'd have to go backwards from the last TID
which was yielded to the caller and figure out which buffers in the
pgsr buffer ranges are associated with all of the TIDs which were
prefetched after that TID. The TIDs are in the per_buffer_data
associated with each buffer in pgsr. The issue would be searching
through those efficiently.

The other issue is that the streaming read API does not currently
support backwards scans. So, if we switch to a backwards scan from a
forwards scan, we would need to fallback to the non streaming read
method. We could do this by just setting the TID queue size to 1
(which is what I have currently implemented). Or we could add
backwards scan support to the streaming read API.

- mark and restore

Similar to the issue with switching the scan direction, mark and
restore requires us to reset the TID queue and streaming read queue.
For now, I've hacked in something to the PlannerInfo and Plan to set
the TID queue size to 1 for plans containing a merge join (yikes).

- multiple executions

For reasons I don't entirely understand yet, multiple executions (not
rescan -- see ExecutorRun(...execute_once)) do not work. As in Tomas'
patch, I have disabled prefetching (and made the TID queue size 1)
when execute_once is false.

- Index Only Scans need to return IndexTuples

Because index only scans return either the IndexTuple pointed to by
IndexScanDesc->xs_itup or the HeapTuple pointed to by
IndexScanDesc->xs_hitup -- both of which are populated by the index
AM, we have to save copies of those IndexTupleData and HeapTupleDatas
for every TID whose block we prefetch.

This might be okay, but it is a bit sad to have to make copies of those tuples.

In this patch, I still haven't figured out the memory management part.
I copy over the tuples when enqueuing a TID queue item and then copy
them back again when the streaming read API returns the
per_buffer_data to us. Something is still not quite right here. I
suspect this is part of the reason why some of the other tests are
failing.

Other issues/gaps in my implementation:

Determining where to allocate the memory for the streaming read object
and the TID queue is an outstanding TODO. To implement a fallback
method for cases in which streaming read doesn't work, I set the queue
size to 1. This is obviously not good.

Right now, I allocate the TID queue and streaming read objects in
IndexNext() and IndexOnlyNext(). This doesn't seem ideal. Doing it in
index_beginscan() (and index_beginscan_parallel()) is tricky though
because we don't know the scan direction at that point (and the scan
direction can change). There are also callers of index_beginscan() who
do not call Index[Only]Next() (like systable_getnext() which calls
index_getnext_slot() directly).

Also, my implementation does not yet have the optimization Tomas does
to skip prefetching recently prefetched blocks. As he has said, it
probably makes sense to add something to do this in a lower layer --
such as in the streaming read API or even in bufmgr.c (maybe in
PrefetchSharedBuffer()).

- Melanie

Hi,

here's an improved (rebased + updated) version of the patch series, with
some significant fixes and changes. The patch adds infrastructure and
modifies btree indexes to do prefetching - and AFAIK it passes all tests
(no results, correct results). There's still a fair amount of work to be
done, of course - the btree changes are not very polished, more time
needs to be spent on profiling and optimization, etc. And I'm sure that
while the patch passes tests, there certainly are bugs.

Compared to the last patch version [1] shared on list (in November),
there's a number of significant design changes - a lot of this is based
on a number of off-list discussions I had with Peter Geoghegan, which
was very helpful. Let me try to sum the main conclusions and changes:

1) patch now relies on read_stream

The November patch still relied on sync I/O and PrefetchBuffer(). At
some point I added a commit switching it to read_stream - which turned
out non-trivial, especially for index-only scans. But it works, and for
a while I kept it separate - with PrefetchBuffer first, and a switch to
read_stream later. But then I realized it does not make much sense to
keep the first part - why would we introduce a custom fadvise-based
prefetch, only to immediately rip it out and replace it with with
read_stream code with a comparable amount of complexity, right?

So I squashed these two parts, and the patch now does read_stream (for
the table reads) from the beginning.

2) two new index AM callbacks - amgetbatch + amfreebatch

The [1] patch introduced a new callback for reading a "batch"
(essentially a leaf page) from the index. But there was a limitation of
only allowing a single batch at a time, which was causing trouble with
prefetch distance and read_stream stalls at the end of the batch, etc.

Based on the discussions with Peter I decided to make this a bit more
ambitious, moving the whole batch management from the index AM to the
indexam.c level. So now there are two callbacks - amgetbatch and
amfreebatch, and it's up to indexam.c to manage the batches - decide how
many batches to allow, etc. The index AM is responsible merely for
loading the next batch, but does not decide when to load or free a
batch, how many to keep in memory, etc.

There's a section in indexam.c with a more detailed description of the
design, I'm not going to explain all the design details here.

In a way, this design is a compromise between the initial AM-level
approach I presented as a PoC at pgconf.dev 2023, and the executor level
approach I shared a couple months back. Each of those "extreme" cases
had it's issues with either happening "too deep" or "too high" - being
too integrated in the AM, or not having enough info about the AM.

I think the indexam.c is a sensible layer for this. I was hoping doing
this at the "executor level" would mean no need for AM code changes, but
that turned out not possible - the AM clearly needs to know about the
batch boundaries, so that it can e.g. do killtuples, etc. That's why we
need the two callbacks (not just the "amgetbatch" one). At least this
way it's "hidden" by the indexam.c API, like index_getnext_slot().

(You could argue indexam.c is "executor" and maybe it is - I don't know
where exactly to draw the line. I don't think it matters, really. The
"hidden in indexam API" is the important bit.)

3) btree prefetch

The patch implements the new callbacks only for btree indexes, and it's
not very pretty / clean - it's mostly a massaged version of the old code
backing amgettuple(). This needs cleanup/improvements, and maybe
refactoring to allow reusing more of the code, etc.. Or maybe we should
even rip out the amgettuple() entirely, and only support one of those
for each AM? That's what Peter suggested, but I'm not convinced we
should do that.

For now it was very useful to be able to flip between the APIs by
setting a GUC, and I left prefetching disabled in some places (e.g. when
accessing catalogs, ...) that are unlikely to benefit. But more
importantly, I'm not 100% we want to require the index AMs to support
prefetching for all cases - if we do, a single "can't prefetch" case
would mean we can't prefetch anything for that AM.

In particular, I'm thinking about GiST / SP-GiST and indexes ordered by
distance, which don't return items in leaf pages but sort them through a
binary heap. Maybe we can do prefetch for that, but if we can't it would
be silly if it meant we can't do prefetch for any other SP-GiST queries.

Anyway, the current patch only implements prefetch for btree. I expect
it won't be difficult to do this for other index AMs, considering how
similar the design usually is to btree.

This is one of the next things on my TODO. I want to be able to validate
the design works for multiple AMs, not just btree.

4) duplicate blocks

While working on the patch, I realized the old index_fetch_heap code
skips reads for duplicate blocks - index the TID matches the immediately
preceding block, ReleaseAndReadBuffer() skips most of the work. But
read_stream() doesn't do that - if the callback returns the same block,
it starts a new read for it, pins it, etc. That can be quite expensive,
and I've seen a couple cases where the impact was not negligible
(correlated index, fits in memory, ...).

I've speculated that maybe read_stream_next_buffer() should detect and
handle these cases better - not unlike it detects sequential reads. It
might even keep a small cache of already requested reads, etc. so that
it can handle a wider range of workloads, not just perfect duplicates.

But it does not do that, and I'm not sure if/when that will happen. So
for now I simply reproduced the "skip duplicate blocks" behavior. It's
not as simple with read_stream, because this logic needs to happen in
two places - in the callback (when generating reads), and then also when
reading the blocks from the stream - if these places get "out of sync"
the stream won't return the blocks expected by the reader.

But it does work, and it's not that complex. But there's an issue with
prefetch distance ...

5) prefetch distance

Traditionally, we measure distance in "tuples" - e.g. in bitmap heap
scan, we make sure we prefetched pages for X tuples ahead. But that's
not what read_stream does for prefetching - it works with pages. That
can cause various issues.

Consider for example the "skip duplicate blocks" optimization described
in (4). And imagine a perfectly correlated index, with ~200 items per
leaf page. The heap tuples are likely wider, let's say we have 50 of
them per page. That means that for each leaf page, we have only ~4
blocks per leaf page. With effective_io_concurrency=16 the read_stream
will try to prefetch 16 heap pages, that's 3200 index entries.

Is that what we want? I'm not quite sure, maybe it's OK? It sure is not
quite what I expected.

But now imagine an index-only scan on nearly all-visible table. If the
fraction of index entries that don't pass the visibility check is very
low, we can quickly get into a situation when the read_stream has to
read a lot of leaf pages to get the next block number.

Sure, we'd need to read that block number eventually, but doing it this
early means we may need to keep the batch (leaf page) - a lot of them,
actually. Essentially, pick a number and I can construct an IOS that
needs to keep more batches.

I think this is a consequence of read_stream having an internal idea how
far ahead to prefetch, based on the number of requests it got so far,
measured in heap blocks. It has not idea about the context (how that
maps to index entries, batches we need to keep in memory, ...).

Ideally, we'd be able to give this feedback to read_stream in some way,
say by "pausing" it when we get too far ahead in the index. But we don't
have that - the only thing we can do is to return IndalidBlockNumber to
the stream, so that it stops. And then we need to "reset" the stream,
and let it continue - but only after we consumed all scheduled reads.

In principle it's very similar to the "pause/resume" I mentioned, except
that it requires completely draining the queue - a pipeline stall.
That's not great, but hopefully it's not very common, and more
importantly - it only happens when only a tiny fraction of the index
items requires a heap block.

So that's what the patch does. I think it's acceptable, but some
optimizations may be necessary (see next section).

6) performance and optimization

It's not difficult to construct cases where the prefetching is a huge
improvement - 5-10x speedup for a query is common, depending on the
hardware, dataset, etc.

But there are also cases where it doesn't (and can't) help very much.
For example fully-cached data, or index-only scans of all-visible
tables. I've done basic benchmarking based on that (I'll share some
results in the coming days), and in various cases I see a consistent
regression in the 10-20% range. The queries are very short (~1ms) and
there's a fair amount of noise, but it seems fairly consistent.

I haven't figured out the root cause(s) yet, but I believe there's a
couple contributing factors:

(a) read_stream adds a bit of complexity/overhead, but these cases
worked great with just the sync API, and can't benefit from that.

(b) There's inefficiencies in how I integrated read_stream into the
btree AM. For example every batch allocates the same buffer btbeginscan,
which turned out to be an issue before [2] - and now we do that for
every batch, not just once per scan - that's not great.

regards

Re: index prefetching

От

Tomas Vondra

Дата:

10 июля, 01:55:08

Hi,

I got pinged about issues (compiler warnings, and some test failures) in
the simple patch version shared in May. So here's a rebased and cleaned
up version addressing that, and a couple additional issues I ran into.

FWIW if you run check-world on this, you may get failures in io_workers
TAP test. That's a pre-existing issue [1], the patch just makes it
easier to hit as it (probably) added AIO in some part of the test.

Otherwise it should pass all tests (and it does for me on CI).

The main changes in the patches and remaining questions:

(1) fixed compiler warnings

These were mostly due to contrib/test AMs with not-updated ambeginscan()
implementations.

(2) GiST fixes

I fixed a bug in how the prefetching handled distances, leading to
"tuples returned out of order" errors. It did not copy the Datums when
batching the reordered values, not realizing it may be FLOAT8, and on
32-bit systems the Datum is just a pointer. Fixed by datumCopy().

I'm not aware of any actual bug in the GiST code, but I'm sure the
memory management there is sketchy and likely leaks memory. Needs some
more thought and testing. The SP-GiST may have similar issues.

(3) ambeginscan(heap, index, ....)

I originally undid the changes to ambeginscan(), i.e. the callback was
restored back to what master has. To to create the ReadStream the AM
needs the heap, but it could build Relation using index->rd_index->indrelid.

That worked, but I did not like it for two reasons. The AM then needs to
manage the relation (close it etc.). And there was no way to know when
ambeginscan() gets called for a bitmap scan, in which case the
read_stream is unnecessary/useless. So it got created, but never used.
Not very expensive, but messy.

So I ended up restoring the ambeginscan() change, i.e. it now gets the
heap relation. I ended up passing it as the first argument, mostly for
consistency with index_beginscan(), which also does (heap, index, ...).

I renamed the index argument from 'rel' to 'index' in a couple of the
indexes, it was confusing to have 'heap' and 'rel'.

(4) lastBlock

I added the optimization to not queue duplicate block numbers, i.e. if
the index returns a sequence of TIDs from the same block, we skip
queueing that and simply use the buffer we already have. This is quite a
bit more efficient.

This is something the read_next callback in each AM needs to do, but
it's pretty simple.

(5) xs_visible

The current patch expects the AM to set the xs_visible even if it's not
using ReadStream (which is required to do that in the callback). If the
AM does not do that, index-only scans are broken.

But it occurs to me we could handle this in index_getnext_tid(). If the
AM does not use a ReadStream (xs_rs==NULL), we can check the VM and
store the value in xs_visible. It'd need moving the vmBuffer to the scan
descriptor (it's now in IndexOnlyScanState), but that seems OK. And the
AMs now add the buffer anyway.

(6) SGML

I added a couple paragraphs to indexam.sgml, documenting the new heap
argument, and also requirements from the read_next callback (e.g. the
lastBlock and xs_visible setting).

(7) remaining annoyances

There's a couple things that still annoy me - the "read_next" callbacks
are very similar, and duplicate a fair amount of code to stuff they're
required to. There's a little bit AM-specific code to get the next item
from the ScanOpaque structs, and then code to skip duplicate block
numbers and check the visibility map (if needed).

I believe both of these things could be refactored into some shared
place. The AMs would just call a function from indexam.c (which seems OK
from layering POV, and there's plenty of such calls).

I believe the same place could also act as the "scan manager" component
managing the prefetching (and related stuff?), as suggested by Peter
Geoghegan some time ago.

I ran out of time to work on this today, but I'll look into this soon.

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 16:36:48

On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:
> But the thing I don't really understand it the "cyclic" dataset (for
> example). And the "simple" patch performs really badly here. This data
> set is designed to not work for prefetching, it's pretty much an
> adversary case. There's ~100 TIDs from 100 pages for each key value, and
> once you read the 100 pages you'll hit them many times for following
> values. Prefetching is pointless, and skipping duplicate blocks can't
> help, because the blocks are not effective.
>
> But how come the "complex" patch does so much better? It can't really
> benefit from prefetching TID from the next leaf - not this much. Yet it
> does a bit better than master. I'm looking at this since yesterday, and
> it makes no sense to me. Per "perf trace" it actually does 2x many
> fadvise calls compared to the "simple" patch (which is strange on it's
> own, I think), yet it's apparently so much faster?

The "simple" patch has _bt_readpage reset the read stream. That
doesn't make any sense to me. Though it does explain why the "complex"
patch does so many more fadvise calls.

Another issue with the "simple" patch: it adds 2 bool fields to
"BTScanPosItem". That increases its size considerably. We're very
sensitive to the size of this struct (I think that you know about this
already). Bloating it like this will blow up our memory usage, since
right now we allocate MaxTIDsPerBTreePage/1358 such structs for
so->currPos (and so->markPos). Wasting all that memory on alignment
padding is probably going to have consequences beyond memory bloat.

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 16:39:35

On Wed, Jul 16, 2025 at 9:36 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Another issue with the "simple" patch: it adds 2 bool fields to
> "BTScanPosItem". That increases its size considerably. We're very
> sensitive to the size of this struct (I think that you know about this
> already). Bloating it like this will blow up our memory usage, since
> right now we allocate MaxTIDsPerBTreePage/1358 such structs for
> so->currPos (and so->markPos). Wasting all that memory on alignment
> padding is probably going to have consequences beyond memory bloat.

Actually, there is no alignment padding involved. Even still,
increasing that from 10 bytes to 12 bytes will hurt us. Remember the
issue with support function #6/skip support putting us over that
critical glibc threshold? (I've been meaning to get back to that
thread...)

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

16 июля, 16:58:17

On 7/16/25 15:36, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:
>> But the thing I don't really understand it the "cyclic" dataset (for
>> example). And the "simple" patch performs really badly here. This data
>> set is designed to not work for prefetching, it's pretty much an
>> adversary case. There's ~100 TIDs from 100 pages for each key value, and
>> once you read the 100 pages you'll hit them many times for following
>> values. Prefetching is pointless, and skipping duplicate blocks can't
>> help, because the blocks are not effective.
>>
>> But how come the "complex" patch does so much better? It can't really
>> benefit from prefetching TID from the next leaf - not this much. Yet it
>> does a bit better than master. I'm looking at this since yesterday, and
>> it makes no sense to me. Per "perf trace" it actually does 2x many
>> fadvise calls compared to the "simple" patch (which is strange on it's
>> own, I think), yet it's apparently so much faster?
> 
> The "simple" patch has _bt_readpage reset the read stream. That
> doesn't make any sense to me. Though it does explain why the "complex"
> patch does so many more fadvise calls.
> 

Why it doesn't make sense? The reset_stream_reset() restarts the stream
after it got "terminated" on the preceding leaf page (by returning
InvalidBlockNumber). It'd be better to "pause" the stream somehow, but
there's nothing like that yet. We have to terminate it and start again.

But why would it explain the increase in fadvise calls?

FWIW the pattern of fadvise call is quite different. For the simple
patch we end up doing just this:

fadvise block 1
read block 1
fadvise block 2
read block 2
fadvise block 3
read block 3
...

while for the complex patch we do a small batch (~10) of fadvise calls,
followed by the fadvise/read calls for the same set of blocks:

fadvise block 1
fadvise block 2
...
fadvise block 10
read block 1
fadvise block 2
read block 2
...
fadvise block 10
read block 10

This might explain the advantage of the "complex" patch, because it can
actually do some prefetching every now and then (if my calculation is
right, about 5% blocks needs prefetching).

Te pattern of fadvise+pread for the same block seems a bit silly. And
this is not just about "sync" method, the other methods will have a
similar issue with no starting the I/O earlier. The fadvise is just
easier to trace/inspect.

I suspect this might be an unintended consequence of the stream reset.
AFAIK it wasn't quite meant to be used this way, so maybe it confuses
the built-in heuristics deciding what to prefetch?

If that's the case, I'm afraid the "complex" patch will have the issue
too, because it will need to "pause" the prefetching in some cases too
(e.g. for index-only scans, or when the leaf pages contain very few
index tuples). Will be less common, of course.

> Another issue with the "simple" patch: it adds 2 bool fields to
> "BTScanPosItem". That increases its size considerably. We're very
> sensitive to the size of this struct (I think that you know about this
> already). Bloating it like this will blow up our memory usage, since
> right now we allocate MaxTIDsPerBTreePage/1358 such structs for
> so->currPos (and so->markPos). Wasting all that memory on alignment
> padding is probably going to have consequences beyond memory bloat.
> 

True, no argument here.

regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 17:07:22

On Wed, Jul 16, 2025 at 9:58 AM Tomas Vondra <tomas@vondra.me> wrote:
> > The "simple" patch has _bt_readpage reset the read stream. That
> > doesn't make any sense to me. Though it does explain why the "complex"
> > patch does so many more fadvise calls.
> >
>
> Why it doesn't make sense? The reset_stream_reset() restarts the stream
> after it got "terminated" on the preceding leaf page (by returning
> InvalidBlockNumber).

Resetting the prefetch distance at the end of _bt_readpage doesn't
make any sense to me. Why there? It makes about as much sense as doing
so every 7th index tuple. Reaching the end of _bt_readpage isn't
meaningful -- since it in no way signifies that the scan has been
terminated (it might have been, but you're not checking that at all).

> It'd be better to "pause" the stream somehow, but
> there's nothing like that yet. We have to terminate it and start again.

I don't follow.

> Te pattern of fadvise+pread for the same block seems a bit silly. And
> this is not just about "sync" method, the other methods will have a
> similar issue with no starting the I/O earlier. The fadvise is just
> easier to trace/inspect.

It's not at all surprising that you're seeing duplicate prefetch
requests. I have no reason to believe that it's important to suppress
those ourselves, rather than leaving it up to the OS (though I also
have no reason to believe that the opposite is true).

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

16 июля, 17:20:25

On 7/16/25 16:07, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 9:58 AM Tomas Vondra <tomas@vondra.me> wrote:
>>> The "simple" patch has _bt_readpage reset the read stream. That
>>> doesn't make any sense to me. Though it does explain why the "complex"
>>> patch does so many more fadvise calls.
>>>
>>
>> Why it doesn't make sense? The reset_stream_reset() restarts the stream
>> after it got "terminated" on the preceding leaf page (by returning
>> InvalidBlockNumber).
> 
> Resetting the prefetch distance at the end of _bt_readpage doesn't
> make any sense to me. Why there? It makes about as much sense as doing
> so every 7th index tuple. Reaching the end of _bt_readpage isn't
> meaningful -- since it in no way signifies that the scan has been
> terminated (it might have been, but you're not checking that at all).
> 

Again, resetting the prefetch distance is merely a side-effect (and I
agree it's not desirable). The "reset" merely says the stream is able to
produce blocks again - call the "next" callback etc.

>> It'd be better to "pause" the stream somehow, but
>> there's nothing like that yet. We have to terminate it and start again.
> 
> I don't follow.
> 

The read stream can only return blocks generated by the "next" callback.
When we return the block for the last item on a leaf page, we can only
return "InvalidBlockNumber" which means "no more blocks in the stream".
And once we advance to the next leaf, we say "hey, there's more blocks".
Which is what read_stream_reset() does.

It's a bit like what rescan does.

In an ideal world we'd have a function that'd "pause" the stream,
without resetting the distance etc. But we don't have that, and the
reset thing was suggested to me as a workaround.

>> Te pattern of fadvise+pread for the same block seems a bit silly. And
>> this is not just about "sync" method, the other methods will have a
>> similar issue with no starting the I/O earlier. The fadvise is just
>> easier to trace/inspect.
> 
> It's not at all surprising that you're seeing duplicate prefetch
> requests. I have no reason to believe that it's important to suppress
> those ourselves, rather than leaving it up to the OS (though I also
> have no reason to believe that the opposite is true).
> 

True, but in practice those duplicate calls are fairly expensive. Even
just calling fadvise() on data you already have in page cache costs
something (not much, but it's clearly visible for cached queries).

regards

-- 
Tomas Vondra

Re: index prefetching

От

Andres Freund

Дата:

16 июля, 17:25:06

Hi,

On 2025-07-16 16:20:25 +0200, Tomas Vondra wrote:
> On 7/16/25 16:07, Peter Geoghegan wrote:
> >> Te pattern of fadvise+pread for the same block seems a bit silly. And
> >> this is not just about "sync" method, the other methods will have a
> >> similar issue with no starting the I/O earlier. The fadvise is just
> >> easier to trace/inspect.
> > 
> > It's not at all surprising that you're seeing duplicate prefetch
> > requests. I have no reason to believe that it's important to suppress
> > those ourselves, rather than leaving it up to the OS (though I also
> > have no reason to believe that the opposite is true).
> > 
> 
> True, but in practice those duplicate calls are fairly expensive. Even
> just calling fadvise() on data you already have in page cache costs
> something (not much, but it's clearly visible for cached queries).

This imo isn't something worth optimizing for - if you use an io_method that
actually can execute IO asynchronously this issue does not exist, as the start
of the IO will already have populated the buffer entry (without BM_VALID set,
of course). Thus we won't start another IO for that block.

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 17:27:25

On Wed, Jul 16, 2025 at 10:25 AM Andres Freund <andres@anarazel.de> wrote:
> This imo isn't something worth optimizing for - if you use an io_method that
> actually can execute IO asynchronously this issue does not exist, as the start
> of the IO will already have populated the buffer entry (without BM_VALID set,
> of course). Thus we won't start another IO for that block.

Even if it was worth optimizing for, it'd probably still be too far
down the list of problems to be worth discussing right now.

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 17:29:36

On Wed, Jul 16, 2025 at 10:20 AM Tomas Vondra <tomas@vondra.me> wrote:
> The read stream can only return blocks generated by the "next" callback.
> When we return the block for the last item on a leaf page, we can only
> return "InvalidBlockNumber" which means "no more blocks in the stream".
> And once we advance to the next leaf, we say "hey, there's more blocks".
> Which is what read_stream_reset() does.
>
> It's a bit like what rescan does.

That sounds weird.

> In an ideal world we'd have a function that'd "pause" the stream,
> without resetting the distance etc. But we don't have that, and the
> reset thing was suggested to me as a workaround.

Does the "complex" patch require a similar workaround? Why or why not?

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

16 июля, 17:37:42

On 7/16/25 16:29, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 10:20 AM Tomas Vondra <tomas@vondra.me> wrote:
>> The read stream can only return blocks generated by the "next" callback.
>> When we return the block for the last item on a leaf page, we can only
>> return "InvalidBlockNumber" which means "no more blocks in the stream".
>> And once we advance to the next leaf, we say "hey, there's more blocks".
>> Which is what read_stream_reset() does.
>>
>> It's a bit like what rescan does.
> 
> That sounds weird.
> 

What sounds weird? That the read_stream works like a stream of blocks,
or that it can't do "pause" and we use "reset" as a workaround?


>> In an ideal world we'd have a function that'd "pause" the stream,
>> without resetting the distance etc. But we don't have that, and the
>> reset thing was suggested to me as a workaround.
> 
> Does the "complex" patch require a similar workaround? Why or why not?
> 

I think it'll need to do something like that in some cases, when we need
to limit the number of leaf pages kept in memory to something sane.

(a) index-only scans, with most of the tuples all-visible (we don't
prefetch all-visible pages, so finding the next "prefetchable" block may
force reading a lot of leaf pages)

(b) scans on correlated indexes - we skip duplicate block numbers, so
again, we may need to read a lot of leafs to find enough prefetchable
blocks to reach the "distance" (measured in queued blocks)

(c) indexes with "fat" index tuples (but it's less of an issue, because
with one tuple per leaf we still have a clear idea how many leafs we'll
need to read)


regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 17:45:14

On Wed, Jul 16, 2025 at 10:37 AM Tomas Vondra <tomas@vondra.me> wrote:
> What sounds weird? That the read_stream works like a stream of blocks,
> or that it can't do "pause" and we use "reset" as a workaround?

The fact that prefetch distance is in any way affected by a temporary
inability to return more blocks. Just starting from scratch seems
particularly bad.

Doesn't that mean that it's simply impossible for us to remember
ramping up the distance on an earlier leaf page? There is nothing
about leaf page boundaries that should be meaningful to the read
stream/our heap accesses.

I get that index characteristics could be the limiting factor,
especially in a world where we're not yet eagerly reading leaf pages.
But that in no way justifies just forgetting about prefetch distance
like this.

> >> In an ideal world we'd have a function that'd "pause" the stream,
> >> without resetting the distance etc. But we don't have that, and the
> >> reset thing was suggested to me as a workaround.
> >
> > Does the "complex" patch require a similar workaround? Why or why not?
> >
>
> I think it'll need to do something like that in some cases, when we need
> to limit the number of leaf pages kept in memory to something sane.

That's the only reason? The memory usage for batches?

That doesn't seem like a big deal. It's something to keep an eye on,
but I see no reason why it'd be particularly difficult.

Doesn't this argue for the "complex" patch's approach?

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 18:29:40

On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:
> For "uniform" data set, both prefetch patches do much better than master
> (for low selectivities it's clearer in the log-scale chart). The
> "complex" prefetch patch appears to have a bit of an edge for >1%
> selectivities. I find this a bit surprising, the leaf pages have ~360
> index items, so I wouldn't expect such impact due to not being able to
> prefetch beyond the end of the current leaf page. But could be on
> storage with higher latencies (this is the cloud SSD on azure).

How can you say that the "complex" patch has "a bit of an edge for >1%
selectivities"?

It looks like a *massive* advantage on all "linear" test results.
Those are only about 1/3 of all tests -- but if I'm not mistaken
they're the *only* tests where prefetching could be expected to help a
lot. The "cyclic" tests are adversarial/designed to make the patch
look bad. The "uniform" tests have uniformly random heap accesses (I
think), which can only be helped so much by prefetching.

For example, with "linear_10 / eic=16 / sync", it looks like "complex"
has about half the latency of "simple" in tests where selectivity is
10. The advantage for "complex" is even greater at higher
"selectivity" values. All of the other "linear" test results look
about the same.

Have I missed something?

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 19:39:28

On Wed, Jul 16, 2025 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:
> For example, with "linear_10 / eic=16 / sync", it looks like "complex"
> has about half the latency of "simple" in tests where selectivity is
> 10. The advantage for "complex" is even greater at higher
> "selectivity" values. All of the other "linear" test results look
> about the same.

It's hard to interpret the raw data that you've provided. For example,
I cannot figure out where "selectivity" appears in the raw CSV file
from your results repro.

Can you post a single spreadsheet or CSV file, with descriptive column
names, and a row for every test case you ran? And with the rows
ordered such that directly comparable results/rows appear close
together?

Thanks
--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

16 июля, 20:42:30

On 7/16/25 16:45, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 10:37 AM Tomas Vondra <tomas@vondra.me> wrote:
>> What sounds weird? That the read_stream works like a stream of blocks,
>> or that it can't do "pause" and we use "reset" as a workaround?
> 
> The fact that prefetch distance is in any way affected by a temporary
> inability to return more blocks. Just starting from scratch seems
> particularly bad.
> 
> Doesn't that mean that it's simply impossible for us to remember
> ramping up the distance on an earlier leaf page? There is nothing
> about leaf page boundaries that should be meaningful to the read
> stream/our heap accesses.
> 
> I get that index characteristics could be the limiting factor,
> especially in a world where we're not yet eagerly reading leaf pages.
> But that in no way justifies just forgetting about prefetch distance
> like this.
> 

True. I think it's simply a matter of "no one really needed that yet",
so the read stream does not have a way to do that. I suspect Thomas
might have a WIP patch for that somewhere ...

>>>> In an ideal world we'd have a function that'd "pause" the stream,
>>>> without resetting the distance etc. But we don't have that, and the
>>>> reset thing was suggested to me as a workaround.
>>>
>>> Does the "complex" patch require a similar workaround? Why or why not?
>>>
>>
>> I think it'll need to do something like that in some cases, when we need
>> to limit the number of leaf pages kept in memory to something sane.
> 
> That's the only reason? The memory usage for batches?
> 
> That doesn't seem like a big deal. It's something to keep an eye on,
> but I see no reason why it'd be particularly difficult.
> 
> Doesn't this argue for the "complex" patch's approach?
> 

Memory pressure is the "implementation" reason, because the indexam.c
layer has a fixed-length array of batches, so it can't load more than
INDEX_SCAN_MAX_BATCHES of them. That could be reworked to allow loading
arbitrary number of batches, of course.

But I think we don't really want to do that, because what would be the
benefit? If you need to load many leaf pages to find the next thing to
prefetch, is the prefetching really improving anything?

How would we even know there actually is a prefetchable item? We could
load the whole index only to find everything is all-visible. And then
what if the query has LIMIT 10?

So that's the other thing this probably needs to consider - some concept
of how much effort to invest into finding the next prefetchable block.

regards

-- 
Tomas Vondra

Re: index prefetching

От

Tomas Vondra

Дата:

16 июля, 20:49:53

On 7/16/25 17:29, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 4:40 AM Tomas Vondra <tomas@vondra.me> wrote:
>> For "uniform" data set, both prefetch patches do much better than master
>> (for low selectivities it's clearer in the log-scale chart). The
>> "complex" prefetch patch appears to have a bit of an edge for >1%
>> selectivities. I find this a bit surprising, the leaf pages have ~360
>> index items, so I wouldn't expect such impact due to not being able to
>> prefetch beyond the end of the current leaf page. But could be on
>> storage with higher latencies (this is the cloud SSD on azure).
> 
> How can you say that the "complex" patch has "a bit of an edge for >1%
> selectivities"?
> 
> It looks like a *massive* advantage on all "linear" test results.
> Those are only about 1/3 of all tests -- but if I'm not mistaken
> they're the *only* tests where prefetching could be expected to help a
> lot. The "cyclic" tests are adversarial/designed to make the patch
> look bad. The "uniform" tests have uniformly random heap accesses (I
> think), which can only be helped so much by prefetching.
> 
> For example, with "linear_10 / eic=16 / sync", it looks like "complex"
> has about half the latency of "simple" in tests where selectivity is
> 10. The advantage for "complex" is even greater at higher
> "selectivity" values. All of the other "linear" test results look
> about the same.
> 
> Have I missed something?
> 

That paragraph starts with "for uniform data set", and the statement
about 1% selectivities was only about that particular data set.

You're right there's a massive difference on all the "correlated" data
sets. I believe (assume) that's caused by the same issue, discussed in
this thread (where the simple patch seems to do fewer fadvise calls). I
only picked the "cyclic" data set as an example, representing this.

FWIW I suspect the difference on "uniform" data set might be caused by
this too, because at ~5% selectivity the queries start to hit pages
multiple times (there are ~20 rows/page, hence ~5% means ~1 row). But
it's much weaker than on the correlated data sets, of course.

regards

-- 
Tomas Vondra

Re: index prefetching

От

Tomas Vondra

Дата:

16 июля, 20:56:33

On 7/16/25 18:39, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:
>> For example, with "linear_10 / eic=16 / sync", it looks like "complex"
>> has about half the latency of "simple" in tests where selectivity is
>> 10. The advantage for "complex" is even greater at higher
>> "selectivity" values. All of the other "linear" test results look
>> about the same.
> 
> It's hard to interpret the raw data that you've provided. For example,
> I cannot figure out where "selectivity" appears in the raw CSV file
> from your results repro.
> 
> Can you post a single spreadsheet or CSV file, with descriptive column
> names, and a row for every test case you ran? And with the rows
> ordered such that directly comparable results/rows appear close
> together?
> 

That's a good point, sorry about that. I forgot the CSV files don't have
proper headers, I'll fix that and document the structure better.

The process.sh script starts by loading the CSV(s) into sqlite, in order
to do the processing / aggregations. If you copy the first couple lines,
you'll get scans.db, with nice column names and all that..

The selectivity is calculated as

    (rows / total_rows)

where rows is the rowcount returned by the query, and total_rows is
reltuples. I also had charts with "page selectivity", but that often got
a bunch of 100% points squashed on the right edge, so I stopped
generating those.

regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 21:18:54

On Wed, Jul 16, 2025 at 1:42 PM Tomas Vondra <tomas@vondra.me> wrote:
> On 7/16/25 16:45, Peter Geoghegan wrote:
> > I get that index characteristics could be the limiting factor,
> > especially in a world where we're not yet eagerly reading leaf pages.
> > But that in no way justifies just forgetting about prefetch distance
> > like this.
> >
>
> True. I think it's simply a matter of "no one really needed that yet",
> so the read stream does not have a way to do that. I suspect Thomas
> might have a WIP patch for that somewhere ...

This seems really important.

I don't fully understand why this appears to be less of a problem with
the complex patch. Can you help me to confirm my understanding?

I think that this "complex" patch code is relevant:

static bool
index_batch_getnext(IndexScanDesc scan)
{
    ...
    /*
     * If we already used the maximum number of batch slots available, it's
     * pointless to try loading another one. This can happen for various
     * reasons, e.g. for index-only scans on all-visible table, or skipping
     * duplicate blocks on perfectly correlated indexes, etc.
     *
     * We could enlarge the array to allow more batches, but that's futile, we
     * can always construct a case using more memory. Not only it would risk
     * OOM, it'd also be inefficient because this happens early in the scan
     * (so it'd interfere with LIMIT queries).
     *
     * XXX For now we just error out, but the correct solution is to pause the
     * stream by returning InvalidBlockNumber and then unpause it by doing
     * read_stream_reset.
     */
    if (INDEX_SCAN_BATCH_FULL(scan))
    {
        DEBUG_LOG("index_batch_getnext: ran out of space for batches");
        scan->xs_batches->reset = true;
    }

It looks like we're able to fill up quite a few batches/pages before
having to give anything to the read stream. Is that all this is?

We do still need to reset the read stream with the "complex" patch --
I see that. But it's just much less of a frequent thing, presumably
contributing to the performance advantages that we see for the
"complex" patch over the "simple" patch from your testing. Does that
seem like a fair summary?

BTW, don't think that we actually error-out here? Is that XXX comment
block obsolete?

> So that's the other thing this probably needs to consider - some concept
> of how much effort to invest into finding the next prefetchable block.

I agree, of course. That's the main argument in favor of the "complex"
design. Every possible cost/benefit is relevant (or may be), so one
centralized decision that weighs all those factors seems like the way
to go. We don't need to start with a very sophisticated approach, but
I do think that we need a design that is orientated around this view
of things from the start.

The "simple" patch basically has all the same problems, but doesn't
even try to address them. The INDEX_SCAN_BATCH_FULL thing is probably
still pretty far from optimal, but at least all the pieces are there
in one place. At least we're not leaving it up to chance index AM
implementation details (i.e. leaf page boundaries) that have very
little to do with heapam related costs/what really matters.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

16 июля, 21:27:32

Hi,

On 2025-07-16 14:18:54 -0400, Peter Geoghegan wrote:
> I don't fully understand why this appears to be less of a problem with
> the complex patch. Can you help me to confirm my understanding?

Could you share the current version of the complex patch (happy with a git
tree)? Afaict it hasn't been posted, which makes this pretty hard follow along
/ provide feedback on, for others.

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 21:30:05

On Wed, Jul 16, 2025 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:
> Could you share the current version of the complex patch (happy with a git
> tree)? Afaict it hasn't been posted, which makes this pretty hard follow along
> / provide feedback on, for others.

Sure:

https://github.com/petergeoghegan/postgres/tree/index-prefetch-2025-pg-revisions-v0.11

I think that the version that Tomas must have used is a few days old,
and might be a tiny bit different. But I don't think that that's
likely to matter, especially not if you just want to get the general
idea.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

16 июля, 22:00:24


On 7/16/25 20:18, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 1:42 PM Tomas Vondra <tomas@vondra.me> wrote:
>> On 7/16/25 16:45, Peter Geoghegan wrote:
>>> I get that index characteristics could be the limiting factor,
>>> especially in a world where we're not yet eagerly reading leaf pages.
>>> But that in no way justifies just forgetting about prefetch distance
>>> like this.
>>>
>>
>> True. I think it's simply a matter of "no one really needed that yet",
>> so the read stream does not have a way to do that. I suspect Thomas
>> might have a WIP patch for that somewhere ...
> 
> This seems really important.
> 
> I don't fully understand why this appears to be less of a problem with
> the complex patch. Can you help me to confirm my understanding?
> 
> I think that this "complex" patch code is relevant:
> 
> static bool
> index_batch_getnext(IndexScanDesc scan)
> {
>     ...
>     /*
>      * If we already used the maximum number of batch slots available, it's
>      * pointless to try loading another one. This can happen for various
>      * reasons, e.g. for index-only scans on all-visible table, or skipping
>      * duplicate blocks on perfectly correlated indexes, etc.
>      *
>      * We could enlarge the array to allow more batches, but that's futile, we
>      * can always construct a case using more memory. Not only it would risk
>      * OOM, it'd also be inefficient because this happens early in the scan
>      * (so it'd interfere with LIMIT queries).
>      *
>      * XXX For now we just error out, but the correct solution is to pause the
>      * stream by returning InvalidBlockNumber and then unpause it by doing
>      * read_stream_reset.
>      */
>     if (INDEX_SCAN_BATCH_FULL(scan))
>     {
>         DEBUG_LOG("index_batch_getnext: ran out of space for batches");
>         scan->xs_batches->reset = true;
>     }
> 
> It looks like we're able to fill up quite a few batches/pages before
> having to give anything to the read stream. Is that all this is?
> 
> We do still need to reset the read stream with the "complex" patch --
> I see that. But it's just much less of a frequent thing, presumably
> contributing to the performance advantages that we see for the
> "complex" patch over the "simple" patch from your testing. Does that
> seem like a fair summary?
> 

Yes, sounds like a fair summary.

> BTW, don't think that we actually error-out here? Is that XXX comment
> block obsolete?
> 

Right, obsolete comment.

>> So that's the other thing this probably needs to consider - some concept
>> of how much effort to invest into finding the next prefetchable block.
> 
> I agree, of course. That's the main argument in favor of the "complex"
> design. Every possible cost/benefit is relevant (or may be), so one
> centralized decision that weighs all those factors seems like the way
> to go. We don't need to start with a very sophisticated approach, but
> I do think that we need a design that is orientated around this view
> of things from the start.
> 
> The "simple" patch basically has all the same problems, but doesn't
> even try to address them. The INDEX_SCAN_BATCH_FULL thing is probably
> still pretty far from optimal, but at least all the pieces are there
> in one place. At least we're not leaving it up to chance index AM
> implementation details (i.e. leaf page boundaries) that have very
> little to do with heapam related costs/what really matters.
> 

Perhaps, although I don't quite see why the simpler patch couldn't
address some of those problems (within the limit of a single leaf page,
of course). I don't think there's anything that's prevent collecting the
"details" somewhere (e.g. in the IndexScanDesc), and querying it from
the callbacks. Or something like that.

I understand you may see the "one leaf page" as a limitation of various
optimizations, and that's perfectly correct, ofc. I also saw it as a
crude limitation of how "bad" the things can go.


regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 22:28:06

On Wed, Jul 16, 2025 at 3:00 PM Tomas Vondra <tomas@vondra.me> wrote:
> Yes, sounds like a fair summary.

Cool.

> Perhaps, although I don't quite see why the simpler patch couldn't
> address some of those problems (within the limit of a single leaf page,
> of course). I don't think there's anything that's prevent collecting the
> "details" somewhere (e.g. in the IndexScanDesc), and querying it from
> the callbacks. Or something like that.

That is technically possible. But ISTM that that's just an inferior
version of the "complex" patch, that duplicates lots of things across
index AMs.

> I understand you may see the "one leaf page" as a limitation of various
> optimizations, and that's perfectly correct, ofc. I also saw it as a
> crude limitation of how "bad" the things can go.

I'm not opposed to some fairly crude mechanism that stops the
prefetching from ever being too aggressive based on index
characteristics. But the idea of exclusively relying on leaf page
boundaries to do that for us doesn't even seem like a good stopgap
solution. On average, the cost of accessing leaf pages is relatively
insignificant. But occasionally, very occasionally, it's the dominant
cost. I don't think that you can get away with making a static
assumption about how much leaf page access costs matter -- it doesn't
average out like that. I think that you need at least a simple dynamic
approach, that mostly doesn't care too much about how many leaf pages
we've read, but occasionally makes heap prefetching much less
aggressive in response to the number of leaf pages the scan needs to
read being much higher than is typical.

I get the impression that you're still of the opinion that the
"simple" approach might well have the best chance of success. If
that's still how you view things, then I genuinely don't understand
why you still see things that way. That perspective definitely made
sense to me 6 months ago, but no longer.

Do you imagine that (say) Thomas will be able to add pause-and-resume
to the read stream interface some time soon, at which point the
regressions we see with the "simple" patch (but not the "complex"
patch) go away?

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

16 июля, 22:39:58

Hi,

On 2025-07-16 14:30:05 -0400, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:
> > Could you share the current version of the complex patch (happy with a git
> > tree)? Afaict it hasn't been posted, which makes this pretty hard follow along
> > / provide feedback on, for others.
>
> Sure:
>
> https://github.com/petergeoghegan/postgres/tree/index-prefetch-2025-pg-revisions-v0.11
>
> I think that the version that Tomas must have used is a few days old,
> and might be a tiny bit different. But I don't think that that's
> likely to matter, especially not if you just want to get the general
> idea.

As a first thing I just wanted to get a feel for the improvements we can get.
I had a scale 5 tpch already loaded, so I ran a bogus query on that to see.

The improvement with either of the patchsets with a quick trial query is
rather impressive when using direct IO (presumably also with an empty cache,
but DIO is more predictable).

As Peter's branch doesn't seem to have an enable_* GUC, I used
SET effective_io_concurrency=0 to test the non-prefetching results (and
verified with master that the results are similar).

Test:

Peter's:

Without prefetching:

SET effective_io_concurrency=0;SELECT pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM lineitem
ORDERBY l_shipdate LIMIT 10000;
 

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                     QUERY PLAN
                             │
 

├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.611..957.874 rows=10000.00 loops=1)
                             │
 
│   Buffers: shared hit=1213 read=8626
                             │
 
│   I/O Timings: shared read=943.344
                             │
 
│   ->  Index Scan using i_l_shipdate on lineitem  (cost=0.44..6994824.33 rows=29999796 width=106) (actual
time=0.611..956.593rows=10000.00 loops=1) │
 
│         Index Searches: 1
                             │
 
│         Buffers: shared hit=1213 read=8626
                             │
 
│         I/O Timings: shared read=943.344
                             │
 
│ Planning Time: 0.083 ms
                             │
 
│ Execution Time: 958.508 ms
                             │
 

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)


With prefetching:

SET effective_io_concurrency=64;SELECT pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM lineitem
ORDERBY l_shipdate LIMIT 10000;
 

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                     QUERY PLAN
                            │
 

├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.497..67.737 rows=10000.00 loops=1)
                            │
 
│   Buffers: shared hit=1227 read=8667
                            │
 
│   I/O Timings: shared read=48.473
                            │
 
│   ->  Index Scan using i_l_shipdate on lineitem  (cost=0.44..6994824.33 rows=29999796 width=106) (actual
time=0.496..66.471rows=10000.00 loops=1) │
 
│         Index Searches: 1
                            │
 
│         Buffers: shared hit=1227 read=8667
                            │
 
│         I/O Timings: shared read=48.473
                            │
 
│ Planning Time: 0.090 ms
                            │
 
│ Execution Time: 68.965 ms
                            │
 

└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Tomas':

With prefetching:

SET effective_io_concurrency=64;SELECT pg_buffercache_evict_relation('lineitem');EXPLAIN ANALYZE SELECT * FROM lineitem
ORDERBY l_shipdate LIMIT 10000;
 

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                     QUERY PLAN
                            │ 

├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit  (cost=0.44..2332.06 rows=10000 width=106) (actual time=0.278..70.609 rows=10000.00 loops=1)
                            │
 
│   Buffers: shared hit=1227 read=8668
                            │
 
│   I/O Timings: shared read=52.578
                            │
 
│   ->  Index Scan using i_l_shipdate on lineitem  (cost=0.44..6994824.33 rows=29999796 width=106) (actual
time=0.277..69.304rows=10000.00 loops=1) │
 
│         Index Searches: 1
                            │
 
│         Buffers: shared hit=1227 read=8668
                            │
 
│         I/O Timings: shared read=52.578
                            │
 
│ Planning Time: 0.072 ms
                            │
 
│ Execution Time: 71.549 ms
                            │
 

└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

The wins are similar without DIO and a cold OS cache, but i don't like
emptying out the entire OS cache all the time...


I call that a hell of an impressive improvement with either patch - it's
really really hard to find order of magnitude improvements in anything close
to realistic cases.

And that's on a local reasonably fast NVMe - with networked storage we'll see
much bigger wins.

This also doesn't just repro with toy queries, e.g. TPCH Q02 shows a 2X
improvement too (with either patch) - the only reason it's not bigger is that
all the remaining IO time is on the inner side of a nestloop that isn't
currently prefetchable.


Peter, it'd be rather useful if your patch also had an enable/disable GUC,
otherwise it's more work to study the performance effects. The
effective_io_concurrency approach isn't great, because it also affects
bitmap scans, seqscans etc.


Just playing around, there are many cases where there is effectively no
difference between the two approaches, from a runtime perspective.  There,
unsurprisingly, are some where the complex approach clearly wins, mostly
around IN(list-of-constants) so far.


Looking at the actual patches now.


Greetings,

Andres Freund

Re: index prefetching

От

Andres Freund

Дата:

16 июля, 23:46:46

Hi,

On 2025-07-16 15:39:58 -0400, Andres Freund wrote:
> Looking at the actual patches now.

I just did an initial, not particularly in depth look.  A few comments and
questions below.



For either patch, I think it's high time we split the index/table buffer stats
in index scans. It's really annoying to not be able to see if IO time was
inside the index itself or in the table. What we're discussing here obviously
can never avoid stalls due to fetching index pages, but so far neither patch
is able to fully utilize hardware when bound on heap fetches, but that's
harder to know without those stats.



The BufferMatches() both patches add seems to check more than needed? It's not
like the old buffer could have changed what relation it is for while pinned.
Seems like it'd be better to just keep track what the prior block was and not
go into bufmgr.c at all.


WRT the complex patch:

Maybe I'm missing something, but the current interface doesn't seem to work
for AMs that don't have a 1:1 mapping between the block number portion of the
tid and the actual block number?


Currently the API wouldn't easily allow the table AM to do batched TID lookups
- if you have a query that looks at a lot of table tuples in the same buffer
consecutively, we spend a lot of time locking/unlocking said buffer.  We also
spend a lot of time dispatching from nodeIndexscan.c to tableam in such
queries.

I'm not suggesting to increase the scope to handle that, but it might be worth
keeping in mind.

I think the potential gains here are really substantial. Even just not having
to lock/unlock the heap block for every tuple in the page would be a huge win,
a quick and incorrect hack suggests it's like 25% faster A batched
heap_hot_search_buffer() could be a larger improvement, it's often bound by
memory latency and per-call overhead.


I see some slowdown for well-cached queries with the patch, I've not dug into
why.



WRT the simple patch:

Seems to have the same issue that it assumes TID block numbers correspond to
actual disk location?




Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

16 июля, 23:54:06

On Wed, Jul 16, 2025 at 3:40 PM Andres Freund <andres@anarazel.de> wrote:
> As a first thing I just wanted to get a feel for the improvements we can get.
> I had a scale 5 tpch already loaded, so I ran a bogus query on that to see.

Cool.

> Test:
>
> Peter's:

To be clear, the "complex" patch is still almost all Tomas' work -- at
least right now. I'd like to do a lot more work on this project,
though.

So far, my main contribution has been debugging advice, and removing
code/simplifying things on the nbtree side.

> I call that a hell of an impressive improvement with either patch - it's
> really really hard to find order of magnitude improvements in anything close
> to realistic cases.

Nice.

> Peter, it'd be rather useful if your patch also had an enable/disable GUC,
> otherwise it's more work to study the performance effects. The
> effective_io_concurrency approach isn't great, because it also affects
> bitmap scans, seqscans etc.

FWIW I took out the GUC because it works by making indexam.c use the
amgettuple interface. The "complex" patch completely gets rid of
btgettuple, whereas the simple patch keeps btgettuple in largely its
current form.

I agree that having such a GUC is important during development, and
will try to add it back soon. It'll have to work in some completely
different way, but that still shouldn't be difficult.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

16 июля, 23:57:48


On 7/16/25 19:56, Tomas Vondra wrote:
> On 7/16/25 18:39, Peter Geoghegan wrote:
>> On Wed, Jul 16, 2025 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:
>>> For example, with "linear_10 / eic=16 / sync", it looks like "complex"
>>> has about half the latency of "simple" in tests where selectivity is
>>> 10. The advantage for "complex" is even greater at higher
>>> "selectivity" values. All of the other "linear" test results look
>>> about the same.
>>
>> It's hard to interpret the raw data that you've provided. For example,
>> I cannot figure out where "selectivity" appears in the raw CSV file
>> from your results repro.
>>
>> Can you post a single spreadsheet or CSV file, with descriptive column
>> names, and a row for every test case you ran? And with the rows
>> ordered such that directly comparable results/rows appear close
>> together?
>>
> 
> That's a good point, sorry about that. I forgot the CSV files don't have
> proper headers, I'll fix that and document the structure better.
> 
> The process.sh script starts by loading the CSV(s) into sqlite, in order
> to do the processing / aggregations. If you copy the first couple lines,
> you'll get scans.db, with nice column names and all that..
> 
> The selectivity is calculated as
> 
>     (rows / total_rows)
> 
> where rows is the rowcount returned by the query, and total_rows is
> reltuples. I also had charts with "page selectivity", but that often got
> a bunch of 100% points squashed on the right edge, so I stopped
> generating those.
> 

I've pushed results from a couple more runs (the cyclic_25 is still
running), and I added "export.csv" which has a subset of columns, and
calculated row/page selectivities.

Does this work for you?


regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

17 июля, 00:01:16

On Wed, Jul 16, 2025 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:
> Currently the API wouldn't easily allow the table AM to do batched TID lookups
> - if you have a query that looks at a lot of table tuples in the same buffer
> consecutively, we spend a lot of time locking/unlocking said buffer.  We also
> spend a lot of time dispatching from nodeIndexscan.c to tableam in such
> queries.
>
> I'm not suggesting to increase the scope to handle that, but it might be worth
> keeping in mind.
>
> I think the potential gains here are really substantial.

I agree. I've actually discussed this possibility with Tomas a few
times, though not recently. It's really common for TIDs that appear on
a leaf page to be slightly out of order due to minor heap
fragmentation. Even minor fragmentation can significantly increase
pin/buffer lock traffic right now.

I think that it makes a lot of sense for the general design to open up
possibilities such as this.

> I see some slowdown for well-cached queries with the patch, I've not dug into
> why.

I saw less than a 5% regression in pgbench SELECT with the "complex"
patch with 32 clients. My guess is that it's due to the less efficient
memory allocation with batching. Obviously this isn't acceptable, but
I'm not particularly concerned about it right now. I was actually
pleased to see that there wasn't a much larger regression there.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

17 июля, 00:16:04

Hi,

On 2025-07-16 16:54:06 -0400, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 3:40 PM Andres Freund <andres@anarazel.de> wrote:
> > As a first thing I just wanted to get a feel for the improvements we can get.
> > I had a scale 5 tpch already loaded, so I ran a bogus query on that to see.
> 
> Cool.
> 
> > Test:
> >
> > Peter's:
> 
> To be clear, the "complex" patch is still almost all Tomas' work -- at
> least right now. I'd like to do a lot more work on this project,
> though.

Indeed. Sorry - what I intended but failed to write was "the approach that
Peter is arguing for"...

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

17 июля, 00:27:23

On Wed, Jul 16, 2025 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:
> Maybe I'm missing something, but the current interface doesn't seem to work
> for AMs that don't have a 1:1 mapping between the block number portion of the
> tid and the actual block number?

I'm not completely sure what you mean here.

Even within nbtree, posting list tuples work by setting the
INDEX_ALT_TID_MASK index tuple header bit. That makes nbtree interpret
IndexTupleData.t_tid as metadata (in this case describing a posting
list). Obviously, that isn't "a standard IndexTuple", but that won't
break either patch/approach.

The index AM is obligated to pass back heap TIDs, without any external
code needing to understand these sorts of implementation details. The
on-disk representation of TIDs remains an implementation detail known
only to index AMs.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

17 июля, 00:41:06

Hi,

On 2025-07-16 17:27:23 -0400, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 4:46 PM Andres Freund <andres@anarazel.de> wrote:
> > Maybe I'm missing something, but the current interface doesn't seem to work
> > for AMs that don't have a 1:1 mapping between the block number portion of the
> > tid and the actual block number?
> 
> I'm not completely sure what you mean here.
> 
> Even within nbtree, posting list tuples work by setting the
> INDEX_ALT_TID_MASK index tuple header bit. That makes nbtree interpret
> IndexTupleData.t_tid as metadata (in this case describing a posting
> list). Obviously, that isn't "a standard IndexTuple", but that won't
> break either patch/approach.
> 
> The index AM is obligated to pass back heap TIDs, without any external
> code needing to understand these sorts of implementation details. The
> on-disk representation of TIDs remains an implementation detail known
> only to index AMs.

I don't mean the index tids, but how the read stream is fed block numbers. In
the "complex" patch that's done by index_scan_stream_read_next(). And the
block number it returns is simply

      return ItemPointerGetBlockNumber(tid);

without the table AM having any way of influencing that. Which means that if
your table AM does not use the block number of the tid 1:1 as the real block
number, the fetched block will be completely bogus.

It's similar in the simple patch, bt_stream_read_next() etc also just use
ItemPointerGetBlockNumber().

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

17 июля, 00:47:53

On Wed, Jul 16, 2025 at 5:41 PM Andres Freund <andres@anarazel.de> wrote:
> I don't mean the index tids, but how the read stream is fed block numbers. In
> the "complex" patch that's done by index_scan_stream_read_next(). And the
> block number it returns is simply
>
>       return ItemPointerGetBlockNumber(tid);
>
> without the table AM having any way of influencing that. Which means that if
> your table AM does not use the block number of the tid 1:1 as the real block
> number, the fetched block will be completely bogus.

How is that handled when such a table AM uses the existing amgettuple
interface? I think that it shouldn't be hard to implement an opt-out
of prefetching for such table AMs, so at least you won't fetch random
garbage.

Right now, the amgetbatch interface is oriented around returning TIDs.
Obviously it works that way because that's what heapam expects, and
what amgettuple (which I'd like to replace with amgetbatch) does.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

17 июля, 01:18:39

Hi,

On 2025-07-16 17:47:53 -0400, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 5:41 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't mean the index tids, but how the read stream is fed block numbers. In
> > the "complex" patch that's done by index_scan_stream_read_next(). And the
> > block number it returns is simply
> >
> >       return ItemPointerGetBlockNumber(tid);
> >
> > without the table AM having any way of influencing that. Which means that if
> > your table AM does not use the block number of the tid 1:1 as the real block
> > number, the fetched block will be completely bogus.
> 
> How is that handled when such a table AM uses the existing amgettuple
> interface?

There's no problem today - the indexams never use the tids to look up blocks
themselves. They're always passed to the tableam to do so (via
table_index_fetch_tuple() etc). I.e. the translation from TIDs to specific
blocks & buffers happens entirely inside the tableam, therefore the tableam
can choose to not use a 1:1 mapping or even to not use any buffers at all.

> I think that it shouldn't be hard to implement an opt-out
> of prefetching for such table AMs, so at least you won't fetch random
> garbage.

I don't think that's the right answer here. ISTM the layering in both patches
just isn't quite correct right now. The read stream shouldn't be "filled" with
table buffers by index code, it needs to be filled by tableam specific code.

> Right now, the amgetbatch interface is oriented around returning TIDs.
> Obviously it works that way because that's what heapam expects, and
> what amgettuple (which I'd like to replace with amgetbatch) does.

ISTM the right answer would be to allow the tableam to get the batches,
without indexam feeding the read stream.  That, perhaps not so coincidentally,
is also what's needed for batching heap page locking and and HOT search.

I think this means that it has to be the tableam that creates the read stream
and that does the work that's currently done in index_scan_stream_read_next(),
i.e. the translation from TID to whatever resources are required by the
tableam. Which presumably would include the tableam calling
index_batch_getnext().

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

17 июля, 01:33:40

On Wed, Jul 16, 2025 at 6:18 PM Andres Freund <andres@anarazel.de> wrote:
> There's no problem today - the indexams never use the tids to look up blocks
> themselves. They're always passed to the tableam to do so (via
> table_index_fetch_tuple() etc). I.e. the translation from TIDs to specific
> blocks & buffers happens entirely inside the tableam, therefore the tableam
> can choose to not use a 1:1 mapping or even to not use any buffers at all.

Of course. Somehow, I missed that obvious point. That is the bare
minimum for a new interface such as this.

> ISTM the right answer would be to allow the tableam to get the batches,
> without indexam feeding the read stream.  That, perhaps not so coincidentally,
> is also what's needed for batching heap page locking and and HOT search.

I agree.

> I think this means that it has to be the tableam that creates the read stream
> and that does the work that's currently done in index_scan_stream_read_next(),
> i.e. the translation from TID to whatever resources are required by the
> tableam. Which presumably would include the tableam calling
> index_batch_getnext().

It probably makes sense to put that off for (let's say) a couple more
months. Just so we can get what we have now in better shape. The
"complex" patch only very recently started to pass all my tests (my
custom nbtree test suite used for my work in 17 and 18).

I still need buy-in from Tomas on the "complex" approach. We chatted
briefly on IM, and he seems more optimistic about it than I thought
(in my on-list remarks from earlier). It is definitely his patch, and I don't
want to speak for him.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

18 июля, 20:44:51

On 7/17/25 00:33, Peter Geoghegan wrote:
> On Wed, Jul 16, 2025 at 6:18 PM Andres Freund <andres@anarazel.de> wrote:
>> There's no problem today - the indexams never use the tids to look up blocks
>> themselves. They're always passed to the tableam to do so (via
>> table_index_fetch_tuple() etc). I.e. the translation from TIDs to specific
>> blocks & buffers happens entirely inside the tableam, therefore the tableam
>> can choose to not use a 1:1 mapping or even to not use any buffers at all.
> 
> Of course. Somehow, I missed that obvious point. That is the bare
> minimum for a new interface such as this.
> 
>> ISTM the right answer would be to allow the tableam to get the batches,
>> without indexam feeding the read stream.  That, perhaps not so coincidentally,
>> is also what's needed for batching heap page locking and and HOT search.
> 
> I agree.
> 
>> I think this means that it has to be the tableam that creates the read stream
>> and that does the work that's currently done in index_scan_stream_read_next(),
>> i.e. the translation from TID to whatever resources are required by the
>> tableam. Which presumably would include the tableam calling
>> index_batch_getnext().
> 
> It probably makes sense to put that off for (let's say) a couple more
> months. Just so we can get what we have now in better shape. The
> "complex" patch only very recently started to pass all my tests (my
> custom nbtree test suite used for my work in 17 and 18).
> 

I agree tableam needs to have a say in this, so that it can interpret
the TIDs in a way that fits how it actually stores data. But I'm not
sure it should be responsible for calling index_batch_getnext(). Isn't
the batching mostly an "implementation" detail of the index AM? That's
how I was thinking about it, at least.

Some of these arguments could be used against the current patch, where
the next_block callback is defined by executor nodes. So in a way those
are also "aware" of the batching.

> I still need buy-in from Tomas on the "complex" approach. We chatted 
> briefly on IM, and he seems more optimistic about it than I thought 
> (in my on-list remarks from earlier). It is definitely his patch,
> and I don't want to speak for him.

I think I feel much better about the "complex" approach, mostly because
you got involved and fixed some of the issues I've been struggling with.
That is a huge help, thank you for that.

The reasons why I started to look at the "simple" patch again [1] were
not entirely technical, at least not in the sense "Which of the two
designs is better?" It was mostly about my (in)ability to get it into a
shape I'd be confident enough to commit. I kept running into weird and
subtle issues in parts of the code I knew nothing about. Great way to
learn stuff, but also a great way to burnout ...

So the way I was thinking about it is more "perfect approach that I'll
never be able to commit" vs. "good (and much simpler) approach". It's a
bit like in the saying about a tree falling in forest. If a perfect
patch never gets committed, does it make a sound?

From the technical point of view, the "complex" approach is clearly more
flexible. Because how could it not be? It can do everything the simple
approach can, but also some additional stuff thanks to having multiple
leaf pages at once.

The question I'm still trying to figure out is how significant those
benefits are, and whether it's worth it the extra complexity. I realize
there's a difference between "complexity of a patch" and "complexity of
the final code", and it may very well be that the complex approach would
result in a much cleaner final code - I don't know.

I don't have any clear "vision" of how the index AMs should work. My
ambition was (and still is) limited to "add prefetching to index scans",
and I don't feel qualified to make judgments about the overall design of
index AMs (interfaces, layering). I have opinions, of course, but I also
realize my insights are not very deep in this area.

Which is why I've been trying to measure the "practical" differences
between the two approaches, e.g. trying to compare how it performs on
different data sets, etc. There are some pretty massive differences in
favor of the "complex" approach, mostly due to the single-leaf-page
limitation of the simple patch. I'm still trying to understand if this
is "inherent" or if it could be mitigated in read_stream_reset(). (Will
share results from a couple experiments in a separate message later.)

This is the context of the benchmarks I've been sharing - me trying to
understand the practical implications/limits of the simple approach. Not
an attempt to somehow prove it's better, or anything like that.

I'm not opposed to continuing work on the "complex" approach, but as I
said, I'm sure I can't pull that off on my own. With your help, I think
the chance of success would be considerably higher.

Does this clarify how I think about the complex patch?

regards

[1]
https://www.postgresql.org/message-id/32c15a30-6e25-4f6d-9191-76a19482c556%40vondra.me

-- 
Tomas Vondra

Re: index prefetching

От

Tomas Vondra

Дата:

18 июля, 21:31:37

Hi,

I was wondering why the "simple" approach performs so much worse than
the "complex" one on some of the data sets. The theory was that it's due
to using read_stream_reset(), which resets the prefetch distance, and so
we need to "ramp up" from scratch (distance=1) for every batch. Which
for the correlated data sets is very often.

So I decided to do some experiments, to see if this is really the case,
and maybe see if read_stream_reset() could fix this in some way.

First, I added an

    elog(LOG, "distance %d", stream->distance);

at the beginning of read_stream_next_block() to see how the distance
changes during the scan. Consider a query returning 2M rows from the
"cyclic" table (the attached .sql creates/pupulates it):

   -- selects 20% rows
   SELECT * FROM cyclic WHERE a BETWEEN 0 AND 20000;

With the "complex" patch, the CDF of the distance looks like this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0        | 0   |
| 25       | 0   |
| 50       | 0   |
| 75       | 0   |
| 100      | 0   |
| 125      | 0   |
| 150      | 0   |
| 175      | 0   |
| 200      | 0   |
| 225      | 0   |
| 250      | 0   |
| 275      | 99  |
| 300      | 99  |
+----------+-----+

That is, 99% of the distances is in the range [275, 300].

Note: This is much higher than the effective_io_concurrency value (16),
which may be surprising. But the ReadStream uses that to limit the
number of I/O requests, not as a limit of how far to look ahead. A lot
of the blocks are in the cache, so it looks far ahead.

But with the "simple" patch it looks like this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0        | 0   |
| 25       | 99  |
| 50       | 99  |
| 75       | 99  |
| 100      | 99  |
| 125      | 99  |
| 150      | 99  |
| 175      | 99  |
| 200      | 99  |
| 225      | 99  |
| 250      | 99  |
| 275      | 100 |
| 300      | 100 |
+----------+-----+

So 99% of the distances is in [0, 25]. A more detailed view on the first
couple distances:

+----------+-----+
| distance | pct |
+----------+-----+
| 0        | 0   |
| 1        | 99  |
| 2        | 99  |
| 3        | 99  |
| 4        | 99  |
...

So 99% of distances is 1. Well, that's not very far, it effectively
means no prefetching (We still issue the fadvise, though, although a
comment in read_stream.c suggests we won't. Possible bug?).

This means *there's no ramp-up at all*. On the first leaf the distance
grows to ~270, but after the stream gets reset it stays at 1 and never
increases. That's ... not great?

I'm not entirely sure

I decided to hack the ReadStream a bit, so that it restores the last
non-zero distance seen (i.e. right before reaching end of the stream).
And with that I got this:

+----------+-----+
| distance | pct |
+----------+-----+
| 0        | 0   |
| 25       | 38  |
| 50       | 38  |
| 75       | 38  |
| 100      | 39  |
| 125      | 42  |
| 150      | 47  |
| 175      | 47  |
| 200      | 48  |
| 225      | 49  |
| 250      | 50  |
| 275      | 100 |
| 300      | 100 |
+----------+-----+

Not as good as the "complex" patch, but much better than the original.
And the performance got almost the same (for this one query).

Perhaps the ReadStream should do something like this? Of course, the
simple patch resets the stream very often, likely mcuh more often than
anything else in the code. But wouldn't it be beneficial for streams
reset because of a rescan? Possibly needs to be optional.


regards

-- 
Tomas Vondra

On Sun, Jul 20, 2025 at 1:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <tomas@vondra.me> wrote:
> > Thanks for the link. It seems I came up with an almost the same patch,
> > with three minor differences:
> >
> > 1) There's another place that sets "distance = 0" in
> > read_stream_next_buffer, so maybe this should preserve the distance too?
> >
> > 2) I suspect we need to preserve the distance at the beginning of
> > read_stream_reset, like
> >
> >   stream->reset_distance = Max(stream->reset_distance,
> >                                stream->distance);
> >
> > because what if you call _reset before reaching the end of the stream?
> >
> > 3) Shouldn't it reset the reset_distance to 0 after restoring it?
>
> Probably.  Hmm... an earlier version of this code didn't use distance
> == 0 to indicate end-of-stream, but instead had a separate internal
> end_of_stream flag.  If we brought that back and didn't clobber
> distance, we wouldn't need this save-and-restore dance.  It seemed
> shorter and sweeter without it back then, before _reset() existed in
> its present form, but I wonder if end_of_stream would be nicer than
> having to add this kind of stuff, without measurable downsides.

...

> Good question.  Yeah, your flag idea seems like a good way to avoid
> baking opinion into this level.  I wonder if it should be a bitmask
> rather than a boolean, in case we think of more things that need to be
> included or not when resetting.

Here's a sketch of the above two ideas for discussion (.txt to stay
off cfbot's radar for this thread).  Better than save/restore?

Here also are some alternative experimental patches for preserving
accumulated look-ahead distance better in cases like that.  Needs more
exploration... thoughts/ideas welcome...

Вложения

Re: index prefetching

От

Tomas Vondra

Дата:

22 июля, 16:06:46

On 7/21/25 08:53, Nazir Bilal Yavuz wrote:
> Hi,
> 
> On Mon, 21 Jul 2025 at 03:59, Thomas Munro <thomas.munro@gmail.com> wrote:
>>
>> On Sun, Jul 20, 2025 at 1:07 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>>> On Sat, Jul 19, 2025 at 11:23 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>> The thing that however concerns me is that what I observed was not the
>>>> distance getting reset to 1, and then ramping up. Which should happen
>>>> pretty quickly, thanks to the doubling. In my experiments it *never*
>>>> ramped up again, it stayed at 1. I still don't quite understand why.
>>>
>>> Huh.  Will look into that on Monday.
>>
>> I suspect that it might be working as designed, but suffering from a
>> bit of a weakness in the distance control algorithm, which I described
>> in another thread[1].  In short, the simple minded algorithm that
>> doubles on miss and subtracts one on hit can get stuck alternating
>> between 1 and 2 if you hit certain patterns.  Bilal pinged me off-list
>> to say that he'd repro'd something like your test case and that's what
>> seemed to be happening, anyway?  I will dig out my experimental
>> patches that tried different adjustments to escape from that state....
> 
> I used Tomas Vondra's test [1]. I tracked how many times
> StartReadBuffersImpl() functions return true (IO is needed) and false
> (IO is not needed, cache hit). It returns true ~%6 times on both
> simple and complex patches (~116000 times true, ~1900000 times false
> on both patches).
> 
> A complex patch ramps up to ~250 distance at the start of the stream
> and %6 is enough to stay at distance. Actually, it is enough to ramp
> up more but it seems the max distance is about ~270 so it stays there.
> On the other hand, a simple patch doesn't ramp up at the start of the
> stream and %6 is not enough to ramp up. It is always like distance is
> 1 and IO needed, so multiplying the distance by 2 -> distance = 2 but
> then the next block is cached, so decreasing the distance by 1 and
> distance is 1 again.
> 
> [1] https://www.postgresql.org/message-id/aa46af80-5219-47e6-a7d0-7628106965a6%40vondra.me
> 

Yes, this is the behavior I observed too. I was wondering if the 5% miss
ratio hit some special "threshold" in the distance heuristics, and maybe
it'd work fine with a couple more misses.

But I don't think so, I think pretty workloads with up to 50% misses may
hit this problem. We reset the distance to 1, and then with 50% misses
we'll do about 1 hit + 1 miss, which doubles the distance to 2 and then
reduces the distance to 1, infinitely. Of course, that's only for even
distribution hits/misses (and the synthetic workloads are fairly even).

Real workloads are likely to have multiple misses in a row, which indeed
ramps up the distance quickly. So maybe it's not that bad. Could we
track a longer history of hits/misses, and consider that when adjusting
the distance? Not just the most recent hit/miss?

FWIW I re-ran the index-prefetch-test benchmarks with restoring the
distance for the "simple" patch. The results are in the same github
repository, in a separate branch:

https://github.com/tvondra/indexscan-prefetch-tests/tree/with-distance-restore-after-reset

I'm attaching two PDFs with results for eic=16 (linear and log-scale, to
compare timings for quick queries). This shows that with restoring
distance after reset, the simple patch is pretty much the same as the
complex patch.

The only data set where that's not the case is the "linear" data set,
when everything is perfectly sequential. In this case the simple patch
performs like "master" (i.e. no prefetching). I'm not sure why is that.

Anyway, it seems to confirm most of the differences between the two
patches is due to the "distance collapse". The impact of the resets in
the first benchmarks surprised me quite a bit, but if we don't ramp up
the distance that makes perfect sense.

The issue probably affects other queries that do a lot of resets. Index
scan prefetching just makes it very obvious.

regards

-- 
Tomas Vondra

Вложения

Re: index prefetching

От

Tomas Vondra

Дата:

22 июля, 16:55:42

On 7/21/25 14:39, Thomas Munro wrote:
> ...
> 
> Here's a sketch of the above two ideas for discussion (.txt to stay
> off cfbot's radar for this thread).  Better than save/restore?
> 
> Here also are some alternative experimental patches for preserving
> accumulated look-ahead distance better in cases like that.  Needs more
> exploration... thoughts/ideas welcome...

Thanks! I'll rerun the tests with these patches once the current round
of tests (with the simple distance restore after a reset) completes.


-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

22 июля, 20:35:14

On Tue, Jul 22, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:
> Real workloads are likely to have multiple misses in a row, which indeed
> ramps up the distance quickly. So maybe it's not that bad. Could we
> track a longer history of hits/misses, and consider that when adjusting
> the distance? Not just the most recent hit/miss?

+1

> FWIW I re-ran the index-prefetch-test benchmarks with restoring the
> distance for the "simple" patch. The results are in the same github
> repository, in a separate branch:
>
> https://github.com/tvondra/indexscan-prefetch-tests/tree/with-distance-restore-after-reset

These results make way more sense. There was absolutely no reason why
the "simple" patch should have done so much worse than the "complex"
one for most of the tests you've been running.

Obviously, whatever advantage that the "complex" patch has is bound to
be limited to cases where index characteristics are naturally the
limiting factor. For example, with the pgbench_accounts_pkey table
there are only ever 6 distinct heap blocks on each leaf page. I bet
that your "linear" test more or less looks like that, too.

I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that
(among other things) ouputs "nhblks" for each leaf page from a given
index (while showing the details of each leaf page in index key space
order). It also shows results for pgbench_accounts_pkey with pgbench
scale 1. This is how I determined that every pgbench_accounts_pkey
leaf page points to exactly 6 distinct heap blocks -- "nhblks" is
always 6. Note that this is what I see regardless of the pgbench
scale, indicating that things always perfectly line up (even more than
I would expect for very synthetic data such as this).

This query is unwieldy when run against larger indexes, but that
shouldn't be necessary. As with pgbench_accounts_pkey, it's typical
for synthetically generated data to have a very consistent "nhblks",
regardless of the total amount of data.

With your "uniform" test cases, I'd expect this query to show "nhtids
== nhblks" (or very close to it), which of course makes our ability to
eagerly read further leaf pages almost irrelevant. If there are
hundreds of distinct heap blocks on each leaf page, but
effective_io_concurrency is 16 (or even 64), there's little we can do
about it.

> I'm attaching two PDFs with results for eic=16 (linear and log-scale, to
> compare timings for quick queries). This shows that with restoring
> distance after reset, the simple patch is pretty much the same as the
> complex patch.
>
> The only data set where that's not the case is the "linear" data set,
> when everything is perfectly sequential. In this case the simple patch
> performs like "master" (i.e. no prefetching). I'm not sure why is that.

Did you restore the distance for the "complex" patch, too? I think
that it might well matter there too.

Isn't the obvious explanation that the complex patch benefits from
being able to prefetch without being limited by index
characteristics/leaf page boundaries, while the simple patch doesn't?

> Anyway, it seems to confirm most of the differences between the two
> patches is due to the "distance collapse". The impact of the resets in
> the first benchmarks surprised me quite a bit, but if we don't ramp up
> the distance that makes perfect sense.
>
> The issue probably affects other queries that do a lot of resets. Index
> scan prefetching just makes it very obvious.

What is the difference between cases like "linear / eic=16 / sync" and
"linear_1 / eic=16 / sync"?

One would imagine that these tests are very similar, based on the fact
that they have very similar names. But we see very different results
for each: with the former ("linear") test results, the "complex" patch
is 2x-4x faster than the "simple" patch. But, with the latter test
results ("linear_1", and other similar pairs of "linear_N" tests) the
advantage for the "complex" patch *completely* evaporates. I find that
very suspicious, and wonder if it might be due to a bug/inefficiency
that could easily be fixed (possibly an issue on the read stream side,
like the one you mentioned to Nazir just now).

--
Peter Geoghegan

Вложения

pgbench_accounts_pkey_nhblks.txt

Re: index prefetching

От

Peter Geoghegan

Дата:

22 июля, 20:50:16

On Tue, Jul 22, 2025 at 1:35 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that
> (among other things) ouputs "nhblks" for each leaf page from a given
> index (while showing the details of each leaf page in index key space
> order).

I just realized that my terminal corrupted the SQL query (but not the results).

Attached is a valid and complete version of the same query.

--
Peter Geoghegan

Вложения

leafkeyspace.sql

Re: index prefetching

От

Peter Geoghegan

Дата:

22 июля, 23:28:37

On Tue, Jul 22, 2025 at 1:35 PM Peter Geoghegan <pg@bowt.ie> wrote:
> What is the difference between cases like "linear / eic=16 / sync" and
> "linear_1 / eic=16 / sync"?

I figured this out for myself.

> One would imagine that these tests are very similar, based on the fact
> that they have very similar names. But we see very different results
> for each: with the former ("linear") test results, the "complex" patch
> is 2x-4x faster than the "simple" patch. But, with the latter test
> results ("linear_1", and other similar pairs of "linear_N" tests) the
> advantage for the "complex" patch *completely* evaporates. I find that
> very suspicious

Turns out that the "linear" test's table is actually very different to
the "linear_1" test's table (same applies to all of the other
"linear_N" test tables). The query that I posted earlier clearly shows
this when run against the test data [1].

The "linear" test's linear_a_idx index consists of leaf pages that
each point to exactly 21 heap blocks. That is a lot more than the
pgbench_accounts_pkey's 6 blocks. But it's still low enough to see a huge
advantage on Tomas' test -- an index scan like that can be 2x - 4x
faster with the "complex" patch, relative to the "simple" patch. I
would expect an even larger advantage with a similar range query that
ran against pgbench_accounts.

OTOH, the "linear_1" tests's linear_1_a_idx index shows leaf pages
that each have about 300 distinct heap blocks. Since the total number
of heap TIDs is always 366, it's absolutely not surprising that we can
derive little value from the "complex" patch's ability to eagerly read
more than one leaf page at a time -- a scan like that simply isn't going to
benefit from eagerly reading pages (or it'll only see a very small benefit).

In summary, the only test that has any significant ability to
differentiate the "complex" patch from the "simple" patch is the
"linear" test, which is 2x - 4x faster. Everything else seems to be
about equal, which is what I'd expect, given the particulars of the
tests. This even includes the confusingly named "linear_1" and other
"linear_N" tests.

[1] https://github.com/tvondra/iomethod-tests/blob/master/create2.sql

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

22 июля, 23:50:00

On 7/22/25 19:35, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:
>> Real workloads are likely to have multiple misses in a row, which indeed
>> ramps up the distance quickly. So maybe it's not that bad. Could we
>> track a longer history of hits/misses, and consider that when adjusting
>> the distance? Not just the most recent hit/miss?
> 
> +1
> 
>> FWIW I re-ran the index-prefetch-test benchmarks with restoring the
>> distance for the "simple" patch. The results are in the same github
>> repository, in a separate branch:
>>
>> https://github.com/tvondra/indexscan-prefetch-tests/tree/with-distance-restore-after-reset
> 
> These results make way more sense. There was absolutely no reason why
> the "simple" patch should have done so much worse than the "complex"
> one for most of the tests you've been running.
> 
> Obviously, whatever advantage that the "complex" patch has is bound to
> be limited to cases where index characteristics are naturally the
> limiting factor. For example, with the pgbench_accounts_pkey table
> there are only ever 6 distinct heap blocks on each leaf page. I bet
> that your "linear" test more or less looks like that, too.
> 

Yes. It's definitely true we could construct examples where the complex
patch beats the simple one for this reason. And I believe some of those
examples could be quite realistic, even if not very common (like when
very few index tuples fit on a leaf page).

However, I'm not sure the pgbench example with only 6 heap blocks per
leaf is very significant. Sure, the simple patch can't prefetch TIDs
from the following leaf, but AFAICS the complex patch won't do that
either. Not because it couldn't, but because with that many hits the
distance will drop to ~1 (or close to it). (It'll probably prefetch a
couple TIDs from the next leaf at the very end of the page, but I don't
think that matters overall.)

I'm not sure what prefetch distances will be sensible in queries that do
other stuff. The queries in the benchmark do just the index scan, but if
the query does something with the tuple (in the nodes on top), that
shortens the required prefetch distance. Of course, simple queries will
benefit from prefetching far ahead.

> I attach pgbench_accounts_pkey_nhblks.txt, which shows a query that
> (among other things) ouputs "nhblks" for each leaf page from a given
> index (while showing the details of each leaf page in index key space
> order). It also shows results for pgbench_accounts_pkey with pgbench
> scale 1. This is how I determined that every pgbench_accounts_pkey
> leaf page points to exactly 6 distinct heap blocks -- "nhblks" is
> always 6. Note that this is what I see regardless of the pgbench
> scale, indicating that things always perfectly line up (even more than
> I would expect for very synthetic data such as this).
> 

Thanks. I wonder how difficult would it be to add something like this to
pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
count distinct blocks, right? Seems quite useful.

Explain would also greatly benefit from tracking something like this.
The buffer "hits" and "reads" can be very difficult to interpret.

> This query is unwieldy when run against larger indexes, but that
> shouldn't be necessary. As with pgbench_accounts_pkey, it's typical
> for synthetically generated data to have a very consistent "nhblks",
> regardless of the total amount of data.
> 
> With your "uniform" test cases, I'd expect this query to show "nhtids
> == nhblks" (or very close to it), which of course makes our ability to
> eagerly read further leaf pages almost irrelevant. If there are
> hundreds of distinct heap blocks on each leaf page, but
> effective_io_concurrency is 16 (or even 64), there's little we can do
> about it.
> 

Right.

>> I'm attaching two PDFs with results for eic=16 (linear and log-scale, to
>> compare timings for quick queries). This shows that with restoring
>> distance after reset, the simple patch is pretty much the same as the
>> complex patch.
>>
>> The only data set where that's not the case is the "linear" data set,
>> when everything is perfectly sequential. In this case the simple patch
>> performs like "master" (i.e. no prefetching). I'm not sure why is that.
> 
> Did you restore the distance for the "complex" patch, too? I think
> that it might well matter there too.
> 

No, I did not. I did consider it, but it seemed to me it can't really
make a difference (for these data sets), because each leaf has ~300
items, and the patch limits the prefetch to 64 leafs. That means it can
prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this
should be good enough for eic=1000. It should never hit stream reset.

It'd be useful to show some prefetch info in explain, I guess. It should
not be difficult to track how many times was the stream reset, the
average prefetch distance, and perhaps even a histogram of distances.
The simple patch tracks the average distance, at least.

> Isn't the obvious explanation that the complex patch benefits from
> being able to prefetch without being limited by index
> characteristics/leaf page boundaries, while the simple patch doesn't?
> 

That's a valid interpretation, yes. Although the benefit comes mostly

>> Anyway, it seems to confirm most of the differences between the two
>> patches is due to the "distance collapse". The impact of the resets in
>> the first benchmarks surprised me quite a bit, but if we don't ramp up
>> the distance that makes perfect sense.
>>
>> The issue probably affects other queries that do a lot of resets. Index
>> scan prefetching just makes it very obvious.
> 
> What is the difference between cases like "linear / eic=16 / sync" and
> "linear_1 / eic=16 / sync"?
> 
> One would imagine that these tests are very similar, based on the fact
> that they have very similar names. But we see very different results
> for each: with the former ("linear") test results, the "complex" patch
> is 2x-4x faster than the "simple" patch. But, with the latter test
> results ("linear_1", and other similar pairs of "linear_N" tests) the
> advantage for the "complex" patch *completely* evaporates. I find that
> very suspicious, and wonder if it might be due to a bug/inefficiency
> that could easily be fixed (possibly an issue on the read stream side,
> like the one you mentioned to Nazir just now).
> 

Yes, there's some similarity. Attached is the script I use to create the
tables and load the data.

The "linear" is a table with a simple sequence of values (0 to 100k).
More or less - the value is a floating point, and there are 10M rows.
But you get the idea.

The "linear_X" variants mean the value has a noise of X% of the range.
So with "linear_1" you get the "linear" value, and then random(0,1000),
with normal distribution.

The "cyclic" data sets are similar, except that the "sequence" also
wraps around 100x.

There's nothing "special" about the particular values. I simply wanted
different "levels" of noise, and 1, 10 and 25 seemed good. I initially
had a couple higher values, but that was pretty close to "uniform".

regards

-- 
Tomas Vondra

Вложения

create-tables.sql

Re: index prefetching

От

Andres Freund

Дата:

23 июля, 00:11:04

Hi,

On 2025-07-18 23:25:38 -0400, Peter Geoghegan wrote:
> On Fri, Jul 18, 2025 at 10:47 PM Andres Freund <andres@anarazel.de> wrote:
> > > (Within an index AM, there is a 1:1 correspondence between batches and leaf
> > > pages, and batches need to hold on to a leaf page buffer pin for a
> > > time. None of this should really matter to the table AM.)
> >
> > To some degree the table AM will need to care about the index level batching -
> > we have to be careful about how many pages we keep pinned overall. Which is
> > something that both the table and the index AM have some influence over.
> 
> Can't they operate independently?

I'm somewhat doubtful. Read stream is careful to limit how many things it
pins, lest we get errors about having too many buffers pinned. Somehow the
number of pins held within the index needs to be limited too, and how much
that needs to be limited depends on how many buffers are pinned in the read
stream :/


> > > At a high level, the table AM (and/or its read stream) asks for so
> > > many heap blocks/TIDs. Occasionally, index AM implementation details
> > > (i.e. the fact that many index leaf pages have to be read to get very
> > > few TIDs) will result in that request not being honored. The interface
> > > that the table AM uses must therefore occasionally answer "I'm sorry,
> > > I can only reasonably give you so many TIDs at this time". When that
> > > happens, the table AM has to make do. That can be very temporary, or
> > > it can happen again and again, depending on implementation details
> > > known only to the index AM side (though typically it'll never happen
> > > even once).
> >
> > I think that requirement will make things more complicated. Why do we need to
> > have it?
> 
> What if it turns out that there is a large run of contiguous leaf
> pages that contain no more than 2 or 3 matching index tuples?

I think that's actually likely a case where you want *deeper* prefetching, as
it makes it more likely that the table tuples are on different pages, i.e. you
need a lot more in-flight IOs to avoid stalling on IO.


> What if there's no matches across many leaf pages?

We don't need to keep leaf nodes without matches pinned in that case, so I
don't think there's really an issue?

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

23 июля, 00:35:26

On Tue, Jul 22, 2025 at 4:50 PM Tomas Vondra <tomas@vondra.me> wrote:
> > Obviously, whatever advantage that the "complex" patch has is bound to
> > be limited to cases where index characteristics are naturally the
> > limiting factor. For example, with the pgbench_accounts_pkey table
> > there are only ever 6 distinct heap blocks on each leaf page. I bet
> > that your "linear" test more or less looks like that, too.
> >
>
> Yes. It's definitely true we could construct examples where the complex
> patch beats the simple one for this reason.

It's literally the only possible valid reason why the complex patch could win!

The sole performance justification for the complex patch is that it
can prevent the heap prefetching from getting bottlenecked on factors
tied to physical index characteristics (when it's possible in
principle to avoid getting bottlenecked in that way). Unsurprisingly,
if you assume that that'll never happen, then yeah, the complex patch
has no performance advantage over the simple one.

I happen to think that that's a very unrealistic assumption. Most
standard benchmarks have indexes that almost all look fairly similar
to pgbench_accounts_pkey, from the point of view of "heap page blocks
per leaf page". There are exceptions, of course (e.g., the TPC-C order
table's primary key suffers from fragmentation).

> And I believe some of those
> examples could be quite realistic, even if not very common (like when
> very few index tuples fit on a leaf page).

I don't think cases like that matter very much at all. The only thing
that *really* matters on the index AM side is the logical/physical
correlation. Which your testing seems largely unconcerned with.

> However, I'm not sure the pgbench example with only 6 heap blocks per
> leaf is very significant. Sure, the simple patch can't prefetch TIDs
> from the following leaf, but AFAICS the complex patch won't do that
> either.

Why not?

> Not because it couldn't, but because with that many hits the
> distance will drop to ~1 (or close to it). (It'll probably prefetch a
> couple TIDs from the next leaf at the very end of the page, but I don't
> think that matters overall.)

Then why do your own test results continue to show such a big
advantage for the complex patch, over the simple patch?

> I'm not sure what prefetch distances will be sensible in queries that do
> other stuff. The queries in the benchmark do just the index scan, but if
> the query does something with the tuple (in the nodes on top), that
> shortens the required prefetch distance. Of course, simple queries will
> benefit from prefetching far ahead.

Doing *no* prefetching will usually be the right thing to do. Does
that make index prefetching pointless in general?

> Thanks. I wonder how difficult would it be to add something like this to
> pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
> count distinct blocks, right? Seems quite useful.

I agree that that would be quite useful.

> > Did you restore the distance for the "complex" patch, too? I think
> > that it might well matter there too.
> >
>
> No, I did not. I did consider it, but it seemed to me it can't really
> make a difference (for these data sets), because each leaf has ~300
> items, and the patch limits the prefetch to 64 leafs. That means it can
> prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this
> should be good enough for eic=1000. It should never hit stream reset.

It looks like the complex patch can reset the read stream for a couple
of reasons, which I don't fully understand right now.

I'm mostly thinking of this stuff:

            /*
             * If we advanced to the next batch, release the batch we no
             * longer need. The positions is the "read" position, and we can
             * compare it to firstBatch.
             */
            if (pos->batch != scan->batchState->firstBatch)
            {
                batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch);
                Assert(batch != NULL);

                /*
                 * XXX When advancing readPos, the streamPos may get behind as
                 * we're only advancing it when actually requesting heap
                 * blocks. But we may not do that often enough - e.g. IOS may
                 * not need to access all-visible heap blocks, so the
                 * read_next callback does not get invoked for a long time.
                 * It's possible the stream gets so mucu behind the position
                 * gets invalid, as we already removed the batch. But that
                 * means we don't need any heap blocks until the current read
                 * position - if we did, we would not be in this situation (or
                 * it's a sign of a bug, as those two places are expected to
                 * be in sync). So if the streamPos still points at the batch
                 * we're about to free, just reset the position - we'll set it
                 * to readPos in the read_next callback later.
                 *
                 * XXX This can happen after the queue gets full, we "pause"
                 * the stream, and then reset it to continue. But I think that
                 * just increases the probability of hitting the issue, it's
                 * just more chance to to not advance the streamPos, which
                 * depends on when we try to fetch the first heap block after
                 * calling read_stream_reset().
                 */
                if (scan->batchState->streamPos.batch ==
scan->batchState->firstBatch)
                    index_batch_pos_reset(scan, &scan->batchState->streamPos);

> > Isn't the obvious explanation that the complex patch benefits from
> > being able to prefetch without being limited by index
> > characteristics/leaf page boundaries, while the simple patch doesn't?
> >
>
> That's a valid interpretation, yes. Although the benefit comes mostly

The benefit comes mostly from....?

> Yes, there's some similarity. Attached is the script I use to create the
> tables and load the data.

Another issue with the testing that biases it against the complex
patch: heap fill factor is set to only 25 (but you use the default
index fill-factor).

> The "linear" is a table with a simple sequence of values (0 to 100k).
> More or less - the value is a floating point, and there are 10M rows.
> But you get the idea.
>
> The "linear_X" variants mean the value has a noise of X% of the range.
> So with "linear_1" you get the "linear" value, and then random(0,1000),
> with normal distribution.

I don't get why this is helpful to test, except perhaps as a general smoke test.

If I zoom into any given "linear_1" leaf page, I see TIDs that appear
in an order that isn't technically uniformly random order, but is
fairly close to it. At least in a practical sense. At least for the
purposes of prefetching.

For example:

pg@regression:5432 [104789]=# select
  itemoffset,
  htid
from
  bt_page_items('linear_1_a_idx', 4);
┌────────────┬───────────┐
│ itemoffset │   htid    │
├────────────┼───────────┤
│          1 │ ∅         │
│          2 │ (10,18)   │
│          3 │ (463,9)   │
│          4 │ (66,8)    │
│          5 │ (79,9)    │
│          6 │ (594,7)   │
│          7 │ (289,13)  │
│          8 │ (568,2)   │
│          9 │ (237,2)   │
│         10 │ (156,10)  │
│         11 │ (432,9)   │
│         12 │ (372,17)  │
│         13 │ (554,6)   │
│         14 │ (1698,11) │
│         15 │ (389,6)   │
*** SNIP ***
│        288 │ (1264,5)  │
│        289 │ (738,16)  │
│        290 │ (1143,3)  │
│        291 │ (400,1)   │
│        292 │ (1157,10) │
│        293 │ (266,2)   │
│        294 │ (502,9)   │
│        295 │ (85,15)   │
│        296 │ (282,2)   │
│        297 │ (453,5)   │
│        298 │ (396,6)   │
│        299 │ (267,18)  │
│        300 │ (733,15)  │
│        301 │ (108,8)   │
│        302 │ (356,16)  │
│        303 │ (235,10)  │
│        304 │ (812,18)  │
│        305 │ (675,1)   │
│        306 │ (258,13)  │
│        307 │ (1187,9)  │
│        308 │ (185,2)   │
│        309 │ (179,2)   │
│        310 │ (951,2)   │
└────────────┴───────────┘
(310 rows)

There's actually 55,556 heap blocks in total in the underlying table.
So clearly there is some correlation here. Just not enough to ever
matter very much to prefetching. Again, the sole test case that has
that quality to it is the "linear" test case.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

23 июля, 01:53:52

Hi,

On 2025-07-22 22:50:00 +0200, Tomas Vondra wrote:
> Yes. It's definitely true we could construct examples where the complex
> patch beats the simple one for this reason. And I believe some of those
> examples could be quite realistic, even if not very common (like when
> very few index tuples fit on a leaf page).
>
> However, I'm not sure the pgbench example with only 6 heap blocks per
> leaf is very significant. Sure, the simple patch can't prefetch TIDs
> from the following leaf, but AFAICS the complex patch won't do that
> either. Not because it couldn't, but because with that many hits the
> distance will drop to ~1 (or close to it). (It'll probably prefetch a
> couple TIDs from the next leaf at the very end of the page, but I don't
> think that matters overall.)
>
> I'm not sure what prefetch distances will be sensible in queries that do
> other stuff. The queries in the benchmark do just the index scan, but if
> the query does something with the tuple (in the nodes on top), that
> shortens the required prefetch distance. Of course, simple queries will
> benefit from prefetching far ahead.

That may be true with local fast NVMe disks, but won't be true for networked
storage like in common clouds. Latencies of 0.3 - 4ms leave a lot of CPU
cycles for actual processing of the data.

The high latencies for such storage also means that you need fairly deep
queues and that missing prefetches can introduce substantial slowdowns.

A hypothetical disk that can do 20k iops at 3ms latency needs an average IO
depth of 60. If you have a bubble after every few dozen IOs, you're not going
to reach that effective IO depth.

And even for local NVMes, the IO-depth required to fully utilize the capacity
for small random IO can be fairly high. I have a raid-10 of four SSDs that
peaks at a depth around ~350.

Also, plenty indexes are on multiple columns and/or wider datatypes, making
bubbles triggered due to "crossing-the-leaf-page" more common.

> Thanks. I wonder how difficult would it be to add something like this to
> pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
> count distinct blocks, right? Seems quite useful.

+1

> Explain would also greatly benefit from tracking something like this.
> The buffer "hits" and "reads" can be very difficult to interpret.

Indeed.  I actually observed that sometimes the reason that the real iodepth
(i.e. measured at the OS level) ends up less high than one would hope is that,
while prefetching, we again need a heap buffer that is already being
prefetched. Currently the behaviour in that case is to synchronously wait for
IO on that buffer to complete. That obviously causes a "pipeline bubble"...

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

23 июля, 02:13:23

On Tue, Jul 22, 2025 at 6:53 PM Andres Freund <andres@anarazel.de> wrote:
> That may be true with local fast NVMe disks, but won't be true for networked
> storage like in common clouds. Latencies of 0.3 - 4ms leave a lot of CPU
> cycles for actual processing of the data.

I don't understand why it wouldn't be a problem for NVMe disks, too.

Take a range scan on pgbench_accounts_pkey, for example -- something
like your ORDER BY ... LIMIT N test case, but with pgbench data
instead of TPC-H data. There are 6 heap blocks per leaf page. As I
understand it, the simple patch will only be able to see up to 6 heap
blocks "into the future", at any given time. Why isn't that quite a
significant drawback, regardless of the underlying storage?

> Also, plenty indexes are on multiple columns and/or wider datatypes, making
> bubbles triggered due to "crossing-the-leaf-page" more common.

I actually don't think that that's a significant factor. Even with
fairly wide tuples, we'll still tend to be able to fit about 200 on
each leaf page. For a variety of reasons that doesn't compare too
badly to simple indexes (like pgbench_accounts_pkey), which will store
about 370 when the index is in a pristine state.

It does matter, but in the grand scheme of things it's unlikely to be decisive.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

23 июля, 03:08:04

Hi,

On 2025-07-22 19:13:23 -0400, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 6:53 PM Andres Freund <andres@anarazel.de> wrote:
> > That may be true with local fast NVMe disks, but won't be true for networked
> > storage like in common clouds. Latencies of 0.3 - 4ms leave a lot of CPU
> > cycles for actual processing of the data.
>
> I don't understand why it wouldn't be a problem for NVMe disks, too.

> Take a range scan on pgbench_accounts_pkey, for example -- something
> like your ORDER BY ... LIMIT N test case, but with pgbench data
> instead of TPC-H data. There are 6 heap blocks per leaf page. As I
> understand it, the simple patch will only be able to see up to 6 heap
> blocks "into the future", at any given time. Why isn't that quite a
> significant drawback, regardless of the underlying storage?

My response was specific to Tomas' comment that for many queries, which tend
to be more complicated than the toys we are using here, there will be CPU
costs in the query.

E.g. on my local NVMe SSD I get about 55k IOPS with an iodepth of 6 (that's
without stalls between leaf pages, so not really correct, but it's too much
math for me to compute).  If you have 6 heap blocks referenced per index
block, with 60 tuples on those heap pages and you can get 55k iops with that,
you can fetch 20 million tuples / second. If per-tuple CPU processing takes
longer 10**9/20_000_000 = 50 nanoseconds, you'll not be bottlenecked on
storage.

E.g. for this silly query:
SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);

while also using io_combine_limit=1 (to actually see the achieved IO depth), I
see an achieved IO depth of ~6.3 (complex).

Whereas this even sillier query:
SELECT max(abalance), min(abalance), sum(abalance::numeric), avg(abalance::numeric), avg(aid::numeric),
avg(bid::numeric)FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);

only achieves an IO depth of ~4.1 (complex).

                    cheaper query    expensive query
simple readahead    8723.209 ms         10615.232 ms
complex readahead   5069.438 ms          8018.347 ms

Obviously the CPU overhead in this example didn't completely eliminate the IO
bottleneck, but sure reduced the difference.

If your assumption is that real queries are more CPU intensive that the toy
stuff above, e.g. due to joins etc, you can see why the really attained IO
depth is lower.

Btw, something with the batching is off with the complex patch.  I was
wondering why I was not seing 100% CPU usage while also not seeing very deep
queues - and I get deeper queues and better times with a lowered
INDEX_SCAN_MAX_BATCHES and worse with a higher one.

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

23 июля, 03:29:53

On Tue, Jul 22, 2025 at 5:11 PM Andres Freund <andres@anarazel.de> wrote:
> On 2025-07-18 23:25:38 -0400, Peter Geoghegan wrote:
> > > To some degree the table AM will need to care about the index level batching -
> > > we have to be careful about how many pages we keep pinned overall. Which is
> > > something that both the table and the index AM have some influence over.
> >
> > Can't they operate independently?
>
> I'm somewhat doubtful. Read stream is careful to limit how many things it
> pins, lest we get errors about having too many buffers pinned. Somehow the
> number of pins held within the index needs to be limited too, and how much
> that needs to be limited depends on how many buffers are pinned in the read
> stream :/

That makes sense.

Currently, the complex patch holds on to leaf page buffer pins until
btfreebatch is called for the relevant batch -- no matter what. This
is actually a short term workaround. I removed
_bt_drop_lock_and_maybe_pin from nbtree (the thing added by commit
2ed5b87f), without adding back an equivalent function that can work
across all index AMs. That shouldn't be hard.

Once I do that, then plain index scans with MVCC snapshots should
never actually have to hold on to buffer pins. I'm not sure if that
makes the underlying resource management problem any easier to address
-- but at least we won't *actually* hold on to any extra leaf page
buffer pins most of the time (once I make this fix).

> > What if there's no matches across many leaf pages?
>
> We don't need to keep leaf nodes without matches pinned in that case, so I
> don't think there's really an issue?

That might be true, but if we're reading leaf pages then we're not
returning tuples to the scan -- even when, in principle, we could
return at least a few more right away. That's the kind of trade-off
I'm concerned about here.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

23 июля, 03:37:11

On 7/22/25 23:35, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 4:50 PM Tomas Vondra <tomas@vondra.me> wrote:
>>> Obviously, whatever advantage that the "complex" patch has is bound to
>>> be limited to cases where index characteristics are naturally the
>>> limiting factor. For example, with the pgbench_accounts_pkey table
>>> there are only ever 6 distinct heap blocks on each leaf page. I bet
>>> that your "linear" test more or less looks like that, too.
>>>
>>
>> Yes. It's definitely true we could construct examples where the complex
>> patch beats the simple one for this reason.
> 
> It's literally the only possible valid reason why the complex patch could win!
> 
> The sole performance justification for the complex patch is that it
> can prevent the heap prefetching from getting bottlenecked on factors
> tied to physical index characteristics (when it's possible in
> principle to avoid getting bottlenecked in that way). Unsurprisingly,
> if you assume that that'll never happen, then yeah, the complex patch
> has no performance advantage over the simple one.
> 
> I happen to think that that's a very unrealistic assumption. Most
> standard benchmarks have indexes that almost all look fairly similar
> to pgbench_accounts_pkey, from the point of view of "heap page blocks
> per leaf page". There are exceptions, of course (e.g., the TPC-C order
> table's primary key suffers from fragmentation).
> 

I agree with all of this.

>> And I believe some of those
>> examples could be quite realistic, even if not very common (like when
>> very few index tuples fit on a leaf page).
> 
> I don't think cases like that matter very much at all. The only thing
> that *really* matters on the index AM side is the logical/physical
> correlation. Which your testing seems largely unconcerned with.
> 
>> However, I'm not sure the pgbench example with only 6 heap blocks per
>> leaf is very significant. Sure, the simple patch can't prefetch TIDs
>> from the following leaf, but AFAICS the complex patch won't do that
>> either.
> 
> Why not?
> 
>> Not because it couldn't, but because with that many hits the
>> distance will drop to ~1 (or close to it). (It'll probably prefetch a
>> couple TIDs from the next leaf at the very end of the page, but I don't
>> think that matters overall.)
> 
> Then why do your own test results continue to show such a big
> advantage for the complex patch, over the simple patch?
> 

I assume you mean results for the "linear" data set, because for every
other data set the patches perform almost exactly the same (when
restoring the distance after stream reset):

https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf

And it's a very good point. I was puzzled by this too for a while, and
it took me a while to understand how/why this happens. It pretty much
boils down to the "duplicate block" detection and how it interacts with
the stream resets (again!).

Both patches detect duplicate blocks the same way - using a lastBlock
field, checked in the next_block callback, and skip reading the same
block multiple times. Which for the "linear" data set happens a lot,
because the index is correlated and so every block repeats ~20x.

This seems to trigger entirely different behaviors in the two patches.

For the complex patch, this results in very high prefetch distance,
about ~270. Which seems like less than one leaf page (which has ~360
items). But if I log the read/stream positions seen in
index_batch_getnext_tid, I often see this:

  LOG: index_batch_getnext_tid match 0 read (9,271) stream (22,264)

That is, the stream ~13 batches ahead. AFAICS this happens because the
read_next callback (which "produces" block numbers to the stream), skips
the duplicate blocks, so that the stream never even knows about them.

So the stream thinks the distance is 270, but it's really 20x that (when
measured in index items). I realize this is another way to trigger the
stream resets with the complex patch, even though that didn't happen
here (the limit is 64 leafs, we used 13).

So you're right the complex patch prefetches far ahead. I thought the
distance will quickly decrease because of the duplicate blocks, but I
missed the fact the read stream will not seem them at all.

I'm not sure it's desirable to "hide" blocks from the read stream like
this - it'll never see the misses. How could it make good decisions,
when we skew the data used by the heuristics like this?

For the simple patch, the effect seems exactly the opposite. It detects
duplicate blocks the same way, but there's a caveat - resetting the
stream invalidates the lastBlock field, so it can't detect duplicate
blocks from the previous leaf. And so the distance drops. But this
should not matter I think (it's just a single miss for the first item),
so the rest really has to be about the single-leaf limit.

(This is my working theory, I still need to investigate it a bit more.)

>> I'm not sure what prefetch distances will be sensible in queries that do
>> other stuff. The queries in the benchmark do just the index scan, but if
>> the query does something with the tuple (in the nodes on top), that
>> shortens the required prefetch distance. Of course, simple queries will
>> benefit from prefetching far ahead.
> 
> Doing *no* prefetching will usually be the right thing to do. Does
> that make index prefetching pointless in general?
> 

I don't think so. Why would it? There's plenty of queries that can
benefit from it a lot, and as long as it doesn't cause harm to other
queries it's a win.

>> Thanks. I wonder how difficult would it be to add something like this to
>> pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
>> count distinct blocks, right? Seems quite useful.
> 
> I agree that that would be quite useful.
> 

Good first patch for someone ;-)

>>> Did you restore the distance for the "complex" patch, too? I think
>>> that it might well matter there too.
>>>
>>
>> No, I did not. I did consider it, but it seemed to me it can't really
>> make a difference (for these data sets), because each leaf has ~300
>> items, and the patch limits the prefetch to 64 leafs. That means it can
>> prefetch ~20k TIDs ahead, and each heap page has ~20 items. So this
>> should be good enough for eic=1000. It should never hit stream reset.
> 
> It looks like the complex patch can reset the read stream for a couple
> of reasons, which I don't fully understand right now.
> 
> I'm mostly thinking of this stuff:
> 
>             /*
>              * If we advanced to the next batch, release the batch we no
>              * longer need. The positions is the "read" position, and we can
>              * compare it to firstBatch.
>              */
>             if (pos->batch != scan->batchState->firstBatch)
>             {
>                 batch = INDEX_SCAN_BATCH(scan, scan->batchState->firstBatch);
>                 Assert(batch != NULL);
> 
>                 /*
>                  * XXX When advancing readPos, the streamPos may get behind as
>                  * we're only advancing it when actually requesting heap
>                  * blocks. But we may not do that often enough - e.g. IOS may
>                  * not need to access all-visible heap blocks, so the
>                  * read_next callback does not get invoked for a long time.
>                  * It's possible the stream gets so mucu behind the position
>                  * gets invalid, as we already removed the batch. But that
>                  * means we don't need any heap blocks until the current read
>                  * position - if we did, we would not be in this situation (or
>                  * it's a sign of a bug, as those two places are expected to
>                  * be in sync). So if the streamPos still points at the batch
>                  * we're about to free, just reset the position - we'll set it
>                  * to readPos in the read_next callback later.
>                  *
>                  * XXX This can happen after the queue gets full, we "pause"
>                  * the stream, and then reset it to continue. But I think that
>                  * just increases the probability of hitting the issue, it's
>                  * just more chance to to not advance the streamPos, which
>                  * depends on when we try to fetch the first heap block after
>                  * calling read_stream_reset().
>                  */
>                 if (scan->batchState->streamPos.batch ==
> scan->batchState->firstBatch)
>                     index_batch_pos_reset(scan, &scan->batchState->streamPos);
> 

This is not resetting the stream, though. This is resetting the position
tracking how far the stream got.

This happens because the stream moves forward only in response to
reading buffers from it. So without calling read_stream_next_buffer() it
won't call the read_next callback generating the blocks. And it's the
callback that advances the streamPos field, so it may get stale.

This happens e.g. for index only scans, when we read a couple blocks
that are not all-visible (so that goes through the stream). And then we
get a bunch of all-visible blocks, so we only return the TIDs and index
tuples. The stream gets "behind" the readPos, and may even point at a
batch that was already freed.

>>> Isn't the obvious explanation that the complex patch benefits from
>>> being able to prefetch without being limited by index
>>> characteristics/leaf page boundaries, while the simple patch doesn't?
>>>
>>
>> That's a valid interpretation, yes. Although the benefit comes mostly
> 
> The benefit comes mostly from....?
> 

Sorry, got distracted and forgot to complete the sentence. I think I
wanted to write "mostly from not resetting the distance to 1". Which is
true, but the earlier "linear" example also shows there are cases where
the page boundaries are significant.

>> Yes, there's some similarity. Attached is the script I use to create the
>> tables and load the data.
> 
> Another issue with the testing that biases it against the complex
> patch: heap fill factor is set to only 25 (but you use the default
> index fill-factor).
> 

That's actually intentional. I wanted to model tables with wider tuples,
without having to generate all the data etc. Maybe 25% is too much, and
real table have more than 20 tuples. It's true 400B is fairly large.

I'm not against testing with other parameters, of course. The test was
not originally written for comparing different prefetching patches, so
it may not be quite fair (and I'm not sure how to define "fair").

>> The "linear" is a table with a simple sequence of values (0 to 100k).
>> More or less - the value is a floating point, and there are 10M rows.
>> But you get the idea.
>>
>> The "linear_X" variants mean the value has a noise of X% of the range.
>> So with "linear_1" you get the "linear" value, and then random(0,1000),
>> with normal distribution.
> 
> I don't get why this is helpful to test, except perhaps as a general smoke test.
> 
> If I zoom into any given "linear_1" leaf page, I see TIDs that appear
> in an order that isn't technically uniformly random order, but is
> fairly close to it. At least in a practical sense. At least for the
> purposes of prefetching.
> 

It's not uniformly random, I wrote it uses normal distribution. The
query in the SQL script does this:

  select x + random_normal(0, 1000) from ...

It is a synthetic test data set, of course. It's meant to be simple to
generate, reason about, and somewhere in between the "linear" and
"uniform" data sets.

But it also has realistic motivation - real tables are usually not as
clean as "linear", nor as random as the "uniform" data sets (not for all
columns, at least). If you're looking at data sets like "orders" or
whatever, there's usually a bit of noise even for columns like "date"
etc. People modify the orders, or fill-in data from a couple days ago,
etc. Perfect correlation for one column implies slightly worse
correlation for another column (order date vs. delivery date).

> For example:
> 
> pg@regression:5432 [104789]=# select
>   itemoffset,
>   htid
> from
>   bt_page_items('linear_1_a_idx', 4);
> ┌────────────┬───────────┐
> │ itemoffset │   htid    │
> ├────────────┼───────────┤
> │          1 │ ∅         │
> │          2 │ (10,18)   │
> │          3 │ (463,9)   │
> │          4 │ (66,8)    │
> │          5 │ (79,9)    │
> │          6 │ (594,7)   │
> │          7 │ (289,13)  │
> │          8 │ (568,2)   │
> │          9 │ (237,2)   │
> │         10 │ (156,10)  │
> │         11 │ (432,9)   │
> │         12 │ (372,17)  │
> │         13 │ (554,6)   │
> │         14 │ (1698,11) │
> │         15 │ (389,6)   │
> *** SNIP ***
> │        288 │ (1264,5)  │
> │        289 │ (738,16)  │
> │        290 │ (1143,3)  │
> │        291 │ (400,1)   │
> │        292 │ (1157,10) │
> │        293 │ (266,2)   │
> │        294 │ (502,9)   │
> │        295 │ (85,15)   │
> │        296 │ (282,2)   │
> │        297 │ (453,5)   │
> │        298 │ (396,6)   │
> │        299 │ (267,18)  │
> │        300 │ (733,15)  │
> │        301 │ (108,8)   │
> │        302 │ (356,16)  │
> │        303 │ (235,10)  │
> │        304 │ (812,18)  │
> │        305 │ (675,1)   │
> │        306 │ (258,13)  │
> │        307 │ (1187,9)  │
> │        308 │ (185,2)   │
> │        309 │ (179,2)   │
> │        310 │ (951,2)   │
> └────────────┴───────────┘
> (310 rows)
> 
> There's actually 55,556 heap blocks in total in the underlying table.
> So clearly there is some correlation here. Just not enough to ever
> matter very much to prefetching. Again, the sole test case that has
> that quality to it is the "linear" test case.
> 

Right. I don't see a problem with this. I'm not saying parameters for
this particular data set are "perfect", but the intent is to have a
range of data sets from "perfectly clean" to "random" and see how the
patch(es) behave on all of them.

If you have a suggestion for different data sets, or how to tweak the
parameters to make it more realistic, I'm happy to try those.

regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

23 июля, 03:39:35

On Tue, Jul 22, 2025 at 8:08 PM Andres Freund <andres@anarazel.de> wrote:
> My response was specific to Tomas' comment that for many queries, which tend
> to be more complicated than the toys we are using here, there will be CPU
> costs in the query.

Got it. That makes sense.

>                     cheaper query       expensive query
> simple readahead    8723.209 ms         10615.232 ms
> complex readahead   5069.438 ms          8018.347 ms
>
> Obviously the CPU overhead in this example didn't completely eliminate the IO
> bottleneck, but sure reduced the difference.

That's a reasonable distinction, of course.

> If your assumption is that real queries are more CPU intensive that the toy
> stuff above, e.g. due to joins etc, you can see why the really attained IO
> depth is lower.

Right.

Perhaps I was just repeating myself. Tomas seemed to be suggesting
that cases where we'll actually get a decent and completely worthwhile
improvement with the complex patch would be naturally rare, due in
part to these effects with CPU overhead. I don't think that that's
true at all.

> Btw, something with the batching is off with the complex patch.  I was
> wondering why I was not seing 100% CPU usage while also not seeing very deep
> queues - and I get deeper queues and better times with a lowered
> INDEX_SCAN_MAX_BATCHES and worse with a higher one.

I'm not at all surprised that there'd be bugs like that. I don't know
about Tomas, but I've given almost no thought to
INDEX_SCAN_MAX_BATCHES specifically just yet.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

23 июля, 03:50:04


On 7/23/25 02:39, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 8:08 PM Andres Freund <andres@anarazel.de> wrote:
>> My response was specific to Tomas' comment that for many queries, which tend
>> to be more complicated than the toys we are using here, there will be CPU
>> costs in the query.
> 
> Got it. That makes sense.
> 
>>                     cheaper query       expensive query
>> simple readahead    8723.209 ms         10615.232 ms
>> complex readahead   5069.438 ms          8018.347 ms
>>
>> Obviously the CPU overhead in this example didn't completely eliminate the IO
>> bottleneck, but sure reduced the difference.
> 
> That's a reasonable distinction, of course.
> 
>> If your assumption is that real queries are more CPU intensive that the toy
>> stuff above, e.g. due to joins etc, you can see why the really attained IO
>> depth is lower.
> 
> Right.
> 
> Perhaps I was just repeating myself. Tomas seemed to be suggesting
> that cases where we'll actually get a decent and completely worthwhile
> improvement with the complex patch would be naturally rare, due in
> part to these effects with CPU overhead. I don't think that that's
> true at all.
> 
>> Btw, something with the batching is off with the complex patch.  I was
>> wondering why I was not seing 100% CPU usage while also not seeing very deep
>> queues - and I get deeper queues and better times with a lowered
>> INDEX_SCAN_MAX_BATCHES and worse with a higher one.
> 
> I'm not at all surprised that there'd be bugs like that. I don't know
> about Tomas, but I've given almost no thought to
> INDEX_SCAN_MAX_BATCHES specifically just yet.
> 

I think I mostly picked a value high enough to make it unlikely to hit
it in realistic cases, while also not using too much memory, and 64
seemed like a good value.

But I don't see why would this have any effect on the prefetch distance,
queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
that. I'd have expected exactly the opposite behavior.

Could be bug, of course. But it'd be helpful to see the dataset/query.


regards

-- 
Tomas Vondra

Re: index prefetching

От

Andres Freund

Дата:

23 июля, 03:59:07

Hi,

On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
> But I don't see why would this have any effect on the prefetch distance,
> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
> that. I'd have expected exactly the opposite behavior.
> 
> Could be bug, of course. But it'd be helpful to see the dataset/query.

Pgbench scale 500, with the simpler query from my message.

Greetings,

Andres Freund

Re: index prefetching

От

Tomas Vondra

Дата:

23 июля, 04:17:19

On 7/23/25 02:39, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 8:08 PM Andres Freund <andres@anarazel.de> wrote:
>> My response was specific to Tomas' comment that for many queries, which tend
>> to be more complicated than the toys we are using here, there will be CPU
>> costs in the query.
> 
> Got it. That makes sense.
> 
>>                     cheaper query       expensive query
>> simple readahead    8723.209 ms         10615.232 ms
>> complex readahead   5069.438 ms          8018.347 ms
>>
>> Obviously the CPU overhead in this example didn't completely eliminate the IO
>> bottleneck, but sure reduced the difference.
> 
> That's a reasonable distinction, of course.
> 
>> If your assumption is that real queries are more CPU intensive that the toy
>> stuff above, e.g. due to joins etc, you can see why the really attained IO
>> depth is lower.
> 
> Right.
> 
> Perhaps I was just repeating myself. Tomas seemed to be suggesting
> that cases where we'll actually get a decent and completely worthwhile
> improvement with the complex patch would be naturally rare, due in
> part to these effects with CPU overhead. I don't think that that's
> true at all.

It's entirely possible my mental model is too naive, or my intuition
about the queries is wrong ...

My mental model of how this works is that if I know the amount of time
T1 to process a page, and the amount of time T2 to handle an I/O, then I
can estimate when I should have submitted a read for a page. For example
if T1=1ms and T2=10ms, then I know I should submit an I/O ~10 pages
ahead in order to not have to wait. That's the "minimal" queue depth.

Of course, on high latency "cloud storage" the queue depth needs to
grow, because the time T1 to process a page is likely about the same (if
determined by CPU), but the T2 time for I/O is much higher. So we need
to issue the I/O much sooner.

When I mentioned "complex" queries, I meant queries where processing a
page takes much more time. Because it reads the page, and passes it to
other operators in the query plan, some of which may do CPU stuff, some
will trigger some synchronous I/O, etc. Which means T1 grows, and the
"minimal" queue depth decreases.

Which part of this is not quite right?

-- 
Tomas Vondra

Re: index prefetching

От

Tomas Vondra

Дата:

23 июля, 04:19:20

On 7/23/25 02:59, Andres Freund wrote:
> Hi,
> 
> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
>> But I don't see why would this have any effect on the prefetch distance,
>> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
>> that. I'd have expected exactly the opposite behavior.
>>
>> Could be bug, of course. But it'd be helpful to see the dataset/query.
> 
> Pgbench scale 500, with the simpler query from my message.
> 

With direct I/O, I guess? I'll take a look tomorrow.


regard

-- 
Tomas Vondra

Re: index prefetching

От

Thomas Munro

Дата:

23 июля, 04:19:58

On Wed, Jul 23, 2025 at 1:55 AM Tomas Vondra <tomas@vondra.me> wrote:
> On 7/21/25 14:39, Thomas Munro wrote:
> > Here also are some alternative experimental patches for preserving
> > accumulated look-ahead distance better in cases like that.  Needs more
> > exploration... thoughts/ideas welcome...
>
> Thanks! I'll rerun the tests with these patches once the current round
> of tests (with the simple distance restore after a reset) completes.

Here's C, a tider expression of the policy from the B patch.

Also, I realised that the quickly-drafted A patch didn't actually
implement what Andres suggested in the other thread as I had intended,
what he actually speculated about is distance * 2 + nblocks.

But it doesn't seem to matter much: anything you come up with along
those lines seems to suffer from the problem that you can easily
produce a test that defeats it by inserting just one more hit in
between the misses, where the numbers involved can be quite small.
The only policy I've come up with so far that doesn't give up until we
definitely can't do better is the one that tracks a hypothetical
window of the largest distance we possibly could have, and refuses to
shrink the actual window until even the maximum wouldn't be enough, as
expressed in the B and C patches.

On the flip side, that degree of pessimism has a cost: of course it
takes much longer to come back to distance = 1 and perhaps the fast
path.  Does it matter?  I don't know.

(It's only a hunch at this point but I think I can see a potentially
better way to derive that sustain value from information available
with another in-development patch that adds a new io_currency_target
value, using IO subsystem feedback to compute the IO concurrency level
that avoids I/O stalls but not more instead of going all the way to
the GUC limits and making it the user's problem to set them sensibly.
I'll have to look into that properly, but I think it might be able to
produce an ideal sustain value...)

Вложения

0003-aio-Improve-read_stream.c-look-ahead-heuristics-C.txt

Re: index prefetching

От

Peter Geoghegan

Дата:

23 июля, 04:31:47

On Tue, Jul 22, 2025 at 8:37 PM Tomas Vondra <tomas@vondra.me> wrote:
> > I happen to think that that's a very unrealistic assumption. Most
> > standard benchmarks have indexes that almost all look fairly similar
> > to pgbench_accounts_pkey, from the point of view of "heap page blocks
> > per leaf page". There are exceptions, of course (e.g., the TPC-C order
> > table's primary key suffers from fragmentation).
> >
>
> I agree with all of this.

Cool.

> I assume you mean results for the "linear" data set, because for every
> other data set the patches perform almost exactly the same (when
> restoring the distance after stream reset):
>
>
https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf

Right.

> And it's a very good point. I was puzzled by this too for a while, and
> it took me a while to understand how/why this happens. It pretty much
> boils down to the "duplicate block" detection and how it interacts with
> the stream resets (again!).

I think that you slightly misunderstand where I'm coming from here: it
*doesn't* puzzle me. What puzzled me was that it puzzled you.

Andres' test query is very simple, and not entirely sympathetic
towards the complex patch (by design). And yet it *also* gets quite a
decent improvement from the complex patch. It doesn't speed things up
by another order of magnitude or anything, but it's a very decent
improvement -- one well worth having.

I'm also unsurprised at the fact that all the other tests that you ran
were more or less a draw between simple and complex. At least not now
that I've drilled down and understood what the indexes from those
other test cases actually look like, in practice.

> So you're right the complex patch prefetches far ahead. I thought the
> distance will quickly decrease because of the duplicate blocks, but I
> missed the fact the read stream will not seem them at all.

FWIW I wasn't thinking about it at anything like that level of
sophistication. Everything I've said about it was based on intuitions
about how the prefetching was bound to work, for each different kind
of index. I just looked at individual leaf pages (or small groups of
them) from each index/test, and considered their TIDs, and imagined
how that was likely to affect the scan.

It just seems obvious to me that all the tests (except for "linear")
couldn't possibly be helped by eagerly reading multiple leaf pages. It
seemed equally obvious that it's quite possible to come up with a
suite of tests that have several tests that could benefit in the same
way (not just 1). Although your "linear_1"/"linear_N" tests aren't
actually like that, many cases will be -- and not just those that are
perfectly correlated ala pgbench.

> I'm not sure it's desirable to "hide" blocks from the read stream like
> this - it'll never see the misses. How could it make good decisions,
> when we skew the data used by the heuristics like this?

I don't think that I fully understand what's desirable here myself.

> > Doing *no* prefetching will usually be the right thing to do. Does
> > that make index prefetching pointless in general?
> >
>
> I don't think so. Why would it? There's plenty of queries that can
> benefit from it a lot, and as long as it doesn't cause harm to other
> queries it's a win.

I was being sarcastic. That wasn't a useful thing for me to do. Apologies.

> This is not resetting the stream, though. This is resetting the position
> tracking how far the stream got.

My main point is that there's stuff going on here that nobody quite
understands just yet. And so it probably makes sense to defensively
assume that the prefetch distance resetting stuff might matter with
either the complex or simple patch.

> Sorry, got distracted and forgot to complete the sentence. I think I
> wanted to write "mostly from not resetting the distance to 1". Which is
> true, but the earlier "linear" example also shows there are cases where
> the page boundaries are significant.

Of course that's true. But that was just a temporary defect of the
"simple" patch (and perhaps even for the "complex" patch, albeit to a
much lesser degree). It isn't really relevant to the important
question of whether the simple or complex design should be pursued --
we know that now.

As I said, I don't think that the test suite is particularly well
suited to evaluating simple vs complex. Because there's only one test
("linear") that has any hope of being better with the complex patch.
And because having only 1 such test isn't representative.

> That's actually intentional. I wanted to model tables with wider tuples,
> without having to generate all the data etc. Maybe 25% is too much, and
> real table have more than 20 tuples. It's true 400B is fairly large.

My point about fill factor isn't particularly important.

> I'm not against testing with other parameters, of course. The test was
> not originally written for comparing different prefetching patches, so
> it may not be quite fair (and I'm not sure how to define "fair").

I'd like to see more than 1 test where eagerly reading leaf pages has
any hope of helping. That's my only important concern.

> It's not uniformly random, I wrote it uses normal distribution. The
> query in the SQL script does this:
>
>   select x + random_normal(0, 1000) from ...
>
> It is a synthetic test data set, of course. It's meant to be simple to
> generate, reason about, and somewhere in between the "linear" and
> "uniform" data sets.

I always start by looking at the index leaf pages, and imagining how
an index scan can/will deal with that.

Just because it's not truly uniformly random doesn't mean that that's
apparent when you just look at one leaf page -- heap blocks might very
well *appear* to be uniformly random (or close to it) when you drill
down like that. Or even when you look at (say) 50 neighboring leaf
pages.

> But it also has realistic motivation - real tables are usually not as
> clean as "linear", nor as random as the "uniform" data sets (not for all
> columns, at least). If you're looking at data sets like "orders" or
> whatever, there's usually a bit of noise even for columns like "date"
> etc. People modify the orders, or fill-in data from a couple days ago,
> etc. Perfect correlation for one column implies slightly worse
> correlation for another column (order date vs. delivery date).

I agree.

> Right. I don't see a problem with this. I'm not saying parameters for
> this particular data set are "perfect", but the intent is to have a
> range of data sets from "perfectly clean" to "random" and see how the
> patch(es) behave on all of them.

Obviously none of your test cases are invalid -- they're all basically
reasonable, when considered in isolation. But the "linear_1" test is
*far* closer to the "uniform" test than it is to the "linear" test. At
least as far as the simple vs complex question is concerned.

> If you have a suggestion for different data sets, or how to tweak the
> parameters to make it more realistic, I'm happy to try those.

I'll get back to you on this soon. There are plenty of indexes that
are not perfectly correlated (like pgbench_accounts_pkey is) that'll
nevertheless benefit significantly from the approach taken by the
complex patch. I'm sure of this because I've been using the query I
posted early for many years now -- I've thought about and directly
instrumented the "nhtids:nhblks" of an index of interest many times in
the past.

Thanks
--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

23 июля, 13:06:34

On 7/23/25 03:31, Peter Geoghegan wrote:
> On Tue, Jul 22, 2025 at 8:37 PM Tomas Vondra <tomas@vondra.me> wrote:
>>> I happen to think that that's a very unrealistic assumption. Most
>>> standard benchmarks have indexes that almost all look fairly similar
>>> to pgbench_accounts_pkey, from the point of view of "heap page blocks
>>> per leaf page". There are exceptions, of course (e.g., the TPC-C order
>>> table's primary key suffers from fragmentation).
>>>
>>
>> I agree with all of this.
> 
> Cool.
> 
>> I assume you mean results for the "linear" data set, because for every
>> other data set the patches perform almost exactly the same (when
>> restoring the distance after stream reset):
>>
>>
https://github.com/tvondra/indexscan-prefetch-tests/blob/with-distance-restore-after-reset/d16-rows-cold-32GB-16-unscaled.pdf
> 
> Right.
> 
>> And it's a very good point. I was puzzled by this too for a while, and
>> it took me a while to understand how/why this happens. It pretty much
>> boils down to the "duplicate block" detection and how it interacts with
>> the stream resets (again!).
> 
> I think that you slightly misunderstand where I'm coming from here: it
> *doesn't* puzzle me. What puzzled me was that it puzzled you.
> 
> Andres' test query is very simple, and not entirely sympathetic
> towards the complex patch (by design). And yet it *also* gets quite a
> decent improvement from the complex patch. It doesn't speed things up
> by another order of magnitude or anything, but it's a very decent
> improvement -- one well worth having.
> 
> I'm also unsurprised at the fact that all the other tests that you ran
> were more or less a draw between simple and complex. At least not now
> that I've drilled down and understood what the indexes from those
> other test cases actually look like, in practice.
> 
>> So you're right the complex patch prefetches far ahead. I thought the
>> distance will quickly decrease because of the duplicate blocks, but I
>> missed the fact the read stream will not seem them at all.
> 
> FWIW I wasn't thinking about it at anything like that level of
> sophistication. Everything I've said about it was based on intuitions
> about how the prefetching was bound to work, for each different kind
> of index. I just looked at individual leaf pages (or small groups of
> them) from each index/test, and considered their TIDs, and imagined
> how that was likely to affect the scan.
> 
> It just seems obvious to me that all the tests (except for "linear")
> couldn't possibly be helped by eagerly reading multiple leaf pages. It
> seemed equally obvious that it's quite possible to come up with a
> suite of tests that have several tests that could benefit in the same
> way (not just 1). Although your "linear_1"/"linear_N" tests aren't
> actually like that, many cases will be -- and not just those that are
> perfectly correlated ala pgbench.
> 
>> I'm not sure it's desirable to "hide" blocks from the read stream like
>> this - it'll never see the misses. How could it make good decisions,
>> when we skew the data used by the heuristics like this?
> 
> I don't think that I fully understand what's desirable here myself.
> 
>>> Doing *no* prefetching will usually be the right thing to do. Does
>>> that make index prefetching pointless in general?
>>>
>>
>> I don't think so. Why would it? There's plenty of queries that can
>> benefit from it a lot, and as long as it doesn't cause harm to other
>> queries it's a win.
> 
> I was being sarcastic. That wasn't a useful thing for me to do. Apologies.
> 
>> This is not resetting the stream, though. This is resetting the position
>> tracking how far the stream got.
> 
> My main point is that there's stuff going on here that nobody quite
> understands just yet. And so it probably makes sense to defensively
> assume that the prefetch distance resetting stuff might matter with
> either the complex or simple patch.
> 
>> Sorry, got distracted and forgot to complete the sentence. I think I
>> wanted to write "mostly from not resetting the distance to 1". Which is
>> true, but the earlier "linear" example also shows there are cases where
>> the page boundaries are significant.
> 
> Of course that's true. But that was just a temporary defect of the
> "simple" patch (and perhaps even for the "complex" patch, albeit to a
> much lesser degree). It isn't really relevant to the important
> question of whether the simple or complex design should be pursued --
> we know that now.
> 
> As I said, I don't think that the test suite is particularly well
> suited to evaluating simple vs complex. Because there's only one test
> ("linear") that has any hope of being better with the complex patch.
> And because having only 1 such test isn't representative.
> 
>> That's actually intentional. I wanted to model tables with wider tuples,
>> without having to generate all the data etc. Maybe 25% is too much, and
>> real table have more than 20 tuples. It's true 400B is fairly large.
> 
> My point about fill factor isn't particularly important.
> 

Yeah, the randomness of the TIDs matters too much.

>> I'm not against testing with other parameters, of course. The test was
>> not originally written for comparing different prefetching patches, so
>> it may not be quite fair (and I'm not sure how to define "fair").
> 
> I'd like to see more than 1 test where eagerly reading leaf pages has
> any hope of helping. That's my only important concern.
> 

Agreed.

>> It's not uniformly random, I wrote it uses normal distribution. The
>> query in the SQL script does this:
>>
>>   select x + random_normal(0, 1000) from ...
>>
>> It is a synthetic test data set, of course. It's meant to be simple to
>> generate, reason about, and somewhere in between the "linear" and
>> "uniform" data sets.
> 
> I always start by looking at the index leaf pages, and imagining how
> an index scan can/will deal with that.
> 
> Just because it's not truly uniformly random doesn't mean that that's
> apparent when you just look at one leaf page -- heap blocks might very
> well *appear* to be uniformly random (or close to it) when you drill
> down like that. Or even when you look at (say) 50 neighboring leaf
> pages.
> 

Yeah, the number of heap blocks per leaf page is a useful measure. I
should have thought about that.

The other thing worth tracking is probably how the number of heap blocks
increases with multiple leaf pages, to measure the "hit ratio".

I should have thought about this more when creating the data sets ...

>> But it also has realistic motivation - real tables are usually not as
>> clean as "linear", nor as random as the "uniform" data sets (not for all
>> columns, at least). If you're looking at data sets like "orders" or
>> whatever, there's usually a bit of noise even for columns like "date"
>> etc. People modify the orders, or fill-in data from a couple days ago,
>> etc. Perfect correlation for one column implies slightly worse
>> correlation for another column (order date vs. delivery date).
> 
> I agree.
> 
>> Right. I don't see a problem with this. I'm not saying parameters for
>> this particular data set are "perfect", but the intent is to have a
>> range of data sets from "perfectly clean" to "random" and see how the
>> patch(es) behave on all of them.
> 
> Obviously none of your test cases are invalid -- they're all basically
> reasonable, when considered in isolation. But the "linear_1" test is
> *far* closer to the "uniform" test than it is to the "linear" test. At
> least as far as the simple vs complex question is concerned.
> 

Perhaps not invalid, but it also does not cover the space of possible
data sets the way I intended. It seems all the data sets are much more
random than I expected.

>> If you have a suggestion for different data sets, or how to tweak the
>> parameters to make it more realistic, I'm happy to try those.
> 
> I'll get back to you on this soon. There are plenty of indexes that
> are not perfectly correlated (like pgbench_accounts_pkey is) that'll
> nevertheless benefit significantly from the approach taken by the
> complex patch. I'm sure of this because I've been using the query I
> posted early for many years now -- I've thought about and directly
> instrumented the "nhtids:nhblks" of an index of interest many times in
> the past.
> 

Thanks!


regards

-- 
Tomas Vondra

Re: index prefetching

От

Tomas Vondra

Дата:

23 июля, 15:50:15

On 7/23/25 02:59, Andres Freund wrote:
> Hi,
> 
> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
>> But I don't see why would this have any effect on the prefetch distance,
>> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
>> that. I'd have expected exactly the opposite behavior.
>>
>> Could be bug, of course. But it'd be helpful to see the dataset/query.
> 
> Pgbench scale 500, with the simpler query from my message.
> 

I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
you monitor the queue depth (presumably iostat?), but I added a basic
prefetch info to explain (see the attached WIP patch), reporting the
average prefetch distance, number of stalls (with distance=0) and stream
resets (after filling INDEX_SCAN_MAX_BATCHES).

And I see this (there's a complete explain output attached) for the two
queries from your message [1]. The

simple query:

SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid
LIMIT 10000000);

complex query:

SELECT max(abalance), min(abalance), sum(abalance::numeric),
avg(abalance::numeric), avg(aid::numeric), avg(bid::numeric) FROM
(SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);

The stats actually look *exactly* the same, which makes sense because
it's reading the same index.

   max_batches      distance      stalls      resets      stalls/reset
  --------------------------------------------------------------------
            64           272           3           3                 1
            32            59      122939         653               188
            16            36      108101         1190               90
             8            21       98775         2104               46
             4            11       95627         4556               20

I think this behavior mostly matches my expectations, although it's
interesting the stalls jump so much between 64 and 32 batches.

I did test both with buffered I/O (io_method=sync) and direct I/O
(io_method=worker), and the results are exactly the same for me. Not the
timings, of course, but the prefetch stats.

Of course, maybe there's something wrong in how the stats are collected.
I wonder if maybe we should update the distance in get_block() and not
in next_buffer().

Or maybe there's some interference from having to read the leaf pages
sooner. But I don't see why that would affect the queue depth, fewer
reset should keep the queues fuller I think.

I'll think about adding some sort of distance histogram to the stats.
Maybe something like tinyhist [2] would work here.

[1]
https://www.postgresql.org/message-id/h2n7d7zb2lbkdcemopvrgmteo35zzi5ljl2jmk32vz5f4pziql%407ppr6r6yfv4z

[2] https://github.com/tvondra/tinyhist

regards

-- 
Tomas Vondra

On 7/23/25 17:09, Andres Freund wrote:
> Hi,
> 
> On 2025-07-23 14:50:15 +0200, Tomas Vondra wrote:
>> On 7/23/25 02:59, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
>>>> But I don't see why would this have any effect on the prefetch distance,
>>>> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
>>>> that. I'd have expected exactly the opposite behavior.
>>>>
>>>> Could be bug, of course. But it'd be helpful to see the dataset/query.
>>>
>>> Pgbench scale 500, with the simpler query from my message.
>>>
>>
>> I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
>> you monitor the queue depth (presumably iostat?)
> 
> Yes, iostat, since I was looking at what the "actually required" lookahead
> distance is.
> 
> Do you actually get the query to be entirely CPU bound? What amount of IO
> waiting do you see EXPLAIN (ANALYZE, TIMING OFF) with track_io_timing=on
> report?
> 

No, it definitely needs to wait for I/O (FWIW it's on the xeon, with a
single NVMe SSD).

> Ah - I was using a very high effective_io_concurrency. With a high
> effective_io_concurrency value I see a lot of stalls, even at
> INDEX_SCAN_MAX_BATCHES = 64. And a lower prefetch distance, which seems
> somewhat odd.
> 

I think that's a bug in the explain patch. The counters were updated at
the beginning of _next_buffer(), but that's wrong - a single call to
_next_buffer() can prefetch multiple blocks. This skewed the stats, as
the prefetches are not counted with "distance=0". With higher eic this
happens sooner, so the average distance seemed to decrease.

The attached patch does the updates in _get_block(), which I think is
better. And "stall" now means (distance == 1), which I think detects
requests without prefetching.

I also added a separate "Count" for the actual number of prefetched
blocks, and "Skipped" for duplicate blocks skipped (which the read
stream never even sees, because it's skipped in the callback).

> 
> FWIW, in my tests I was just evicting lineitem from shared buffers, since I
> wanted to test the heap prefetching, without stalls induced by blocking on
> index reads. But what I described happens with either.
> 
> ;SET effective_io_concurrency = 256;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs
off,timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
 
> ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                                QUERY PLAN                                                │
> ├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Aggregate (actual rows=1.00 loops=1)                                                                     │
> │   Buffers: shared hit=27369 read=164191                                                                  │
> │   I/O Timings: shared read=358.795                                                                       │
> │   ->  Limit (actual rows=10000000.00 loops=1)                                                            │
> │         Buffers: shared hit=27369 read=164191                                                            │
> │         I/O Timings: shared read=358.795                                                                 │
> │         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
> │               Index Searches: 1                                                                          │
> │               Prefetch Distance: 256.989                                                                 │
> │               Prefetch Stalls: 3                                                                         │
> │               Prefetch Resets: 3                                                                         │
> │               Buffers: shared hit=27369 read=164191                                                      │
> │               I/O Timings: shared read=358.795                                                           │
> │ Planning Time: 0.086 ms                                                                                  │
> │ Execution Time: 4194.845 ms                                                                              │
> └──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> 
> ;SET effective_io_concurrency = 512;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs
off,timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
 
> ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                                QUERY PLAN                                                │
> ├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Aggregate (actual rows=1.00 loops=1)                                                                     │
> │   Buffers: shared hit=27368 read=164190                                                                  │
> │   I/O Timings: shared read=832.515                                                                       │
> │   ->  Limit (actual rows=10000000.00 loops=1)                                                            │
> │         Buffers: shared hit=27368 read=164190                                                            │
> │         I/O Timings: shared read=832.515                                                                 │
> │         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
> │               Index Searches: 1                                                                          │
> │               Prefetch Distance: 56.778                                                                  │
> │               Prefetch Stalls: 160569                                                                    │
> │               Prefetch Resets: 423                                                                       │
> │               Buffers: shared hit=27368 read=164190                                                      │
> │               I/O Timings: shared read=832.515                                                           │
> │ Planning Time: 0.084 ms                                                                                  │
> │ Execution Time: 4413.058 ms                                                                              │
> └──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> 
> Greetings,
> 

The attached v2 explain patch should fix that. I'm also attaching logs
from my explain, for 64 and 16 batches. I think the output makes much
more sense now.


cheers

-- 
Tomas Vondra

On 7/23/25 02:37, Tomas Vondra wrote:
> ...
> 
>>> Thanks. I wonder how difficult would it be to add something like this to
>>> pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and
>>> count distinct blocks, right? Seems quite useful.
>>
>> I agree that that would be quite useful.
>>
> 
> Good first patch for someone ;-)
> 

I got a bit bored yesterday, so I gave this a try and whipped up a patch
that adds two pgstattuple functins that I think could be useful for
analyzing index metrics that matter for prefetching.

The patch adds two functions, that are meant to provide data for
additional analysis rather than computing something final. Each function
splits the index into a sequence of block ranges (of given length), and
calculates some metrics on that.

pgstatindex_nheap
  - number of leafs in the range
  - number of block numbers
  - number of distinct block numbers
  - number of runs (of the same block)

pgstatindex_runs
  - number of leafs in the range
  - run length
  - number of runs with the length

It's trivial to summarize this into a per-index statistic (of course,
there may be some inaccuracies when the run spans multiple ranges), but
it also seems useful to be able to look at parts of the index.

This is meant as a quick experimental patch, to help with generating
better datasets for the evaluation. And I think it works for that, and I
don't have immediate plans to work on this outside that context.

There are a couple things we'd need to address before actually merging
this, I think. Two that I can think of right now:

First, the "range length" determines memory usage. Right now it's a bit
naive, and just extracts all blocks (for the range) into an array. That
might be an issue for larger ranges, I'm sure there are strategies to
mitigate that - doing some of the processing when reading block numbers,
using hyperloglog to estimate distincts, etc.

Second, the index is walked sequentially in physical order, from block 0
to the last block. But that's not really what the index prefetch sees.
To make it "more accurate" it'd be better to just scan the leaf pages as
if during a "full index scan".

Also, I haven't updated the docs. That'd also need to be done.

regards

-- 
Tomas Vondra

Вложения

v1-0001-pgstattuple-analyze-TIDs-on-btree-leaf-pages.patch

Re: index prefetching

От

Peter Geoghegan

Дата:

24 июля, 17:40:12

On Thu, Jul 24, 2025 at 7:19 AM Tomas Vondra <tomas@vondra.me> wrote:
> I got a bit bored yesterday, so I gave this a try and whipped up a patch
> that adds two pgstattuple functins that I think could be useful for
> analyzing index metrics that matter for prefetching.

This seems quite useful.

I notice that you're not accounting for posting lists. That'll lead to
miscounts of the number of heap blocks in many cases. I think that
that's worth fixing, even given that this patch is experimental.

> It's trivial to summarize this into a per-index statistic (of course,
> there may be some inaccuracies when the run spans multiple ranges), but
> it also seems useful to be able to look at parts of the index.

FWIW in my experience, the per-leaf-page "nhtids:nhblks" tends to be
fairly consistent across all leaf pages from a given index. There are
no doubt some exceptions, but they're probably pretty rare.

> Second, the index is walked sequentially in physical order, from block 0
> to the last block. But that's not really what the index prefetch sees.
> To make it "more accurate" it'd be better to just scan the leaf pages as
> if during a "full index scan".

Why not just do it that way to begin with? It wouldn't be complicated
to make the function follow a chain of right sibling links.

I suggest an interface that takes a block number, and an nblocks int8
argument that must be >= 1. The function would start from the block
number arg leaf page. If it's not a non-ignorable leaf page, throw an
error. Otherwise, count the number of distinct heap blocks on the leaf
page, and count the number of heap blocks on each additional leaf page
to the right -- until we've counted the heap blocks from nblocks-many
leaf pages (or until we reach the rightmost leaf page).

I suggest that a P_IGNORE() page shouldn't have its heap blocks
counted, and shouldn't count towards our nblocks tally of leaf pages
whose heap blocks are to be counted. Upon encountering a P_IGNORE()
page, just move to the right without doing anything. Note that the
rightmost page cannot be P_IGNORE().

This scheme will always succeed, no matter the nblocks argument,
provided the initial leaf page is a valid leaf page (and provided the
nblocks arg is >= 1).

I get that this is just a prototype that might not go anywhere, but
the scheme I've described requires few changes.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

25 июля, 02:52:22

On 7/24/25 16:40, Peter Geoghegan wrote:
> On Thu, Jul 24, 2025 at 7:19 AM Tomas Vondra <tomas@vondra.me> wrote:
>> I got a bit bored yesterday, so I gave this a try and whipped up a patch
>> that adds two pgstattuple functins that I think could be useful for
>> analyzing index metrics that matter for prefetching.
> 
> This seems quite useful.
> 
> I notice that you're not accounting for posting lists. That'll lead to
> miscounts of the number of heap blocks in many cases. I think that
> that's worth fixing, even given that this patch is experimental.
> 

Yeah, I forgot about that. Should be fixed in the v2. Admittedly I don't
know that much about nbtree internals, so this is mostly copy pasting
from verify_nbtree.

>> It's trivial to summarize this into a per-index statistic (of course,
>> there may be some inaccuracies when the run spans multiple ranges), but
>> it also seems useful to be able to look at parts of the index.
> 
> FWIW in my experience, the per-leaf-page "nhtids:nhblks" tends to be
> fairly consistent across all leaf pages from a given index. There are
> no doubt some exceptions, but they're probably pretty rare.
> 

Yeah, probably. And we'll probably test on such uniform data sets, or at
least we we'll start with those. But at some point I'd like to test with
some of these "weird" indexes too, if only to test how well the prefetch
heuristics adjusts the distance.

>> Second, the index is walked sequentially in physical order, from block 0
>> to the last block. But that's not really what the index prefetch sees.
>> To make it "more accurate" it'd be better to just scan the leaf pages as
>> if during a "full index scan".
> 
> Why not just do it that way to begin with? It wouldn't be complicated
> to make the function follow a chain of right sibling links.
> 

I have a very good reason why I didn't do it that way. I was lazy. But
v2 should be doing that, I think.

> I suggest an interface that takes a block number, and an nblocks int8
> argument that must be >= 1. The function would start from the block
> number arg leaf page. If it's not a non-ignorable leaf page, throw an
> error. Otherwise, count the number of distinct heap blocks on the leaf
> page, and count the number of heap blocks on each additional leaf page
> to the right -- until we've counted the heap blocks from nblocks-many
> leaf pages (or until we reach the rightmost leaf page).
> 

Yeah, this interface seems useful. I suppose it'll be handy when looking
at an index scan, to get stats from the currently loaded batches. In
principle you get that from v3 by filtering, but it might be slow on
large indexes. I'll try doing that in v3.

> I suggest that a P_IGNORE() page shouldn't have its heap blocks
> counted, and shouldn't count towards our nblocks tally of leaf pages
> whose heap blocks are to be counted. Upon encountering a P_IGNORE()
> page, just move to the right without doing anything. Note that the
> rightmost page cannot be P_IGNORE().
> 

I think v2 does all of this.

> This scheme will always succeed, no matter the nblocks argument,
> provided the initial leaf page is a valid leaf page (and provided the
> nblocks arg is >= 1).
> 
> I get that this is just a prototype that might not go anywhere, but
> the scheme I've described requires few changes.
> 

Yep, thanks.


-- 
Tomas Vondra

Вложения

v2-0001-pgstattuple-analyze-TIDs-on-btree-leaf-pages.patch

Re: index prefetching

От

Peter Geoghegan

Дата:

25 июля, 03:44:05

On Thu, Jul 24, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> Yeah, I forgot about that. Should be fixed in the v2. Admittedly I don't
> know that much about nbtree internals, so this is mostly copy pasting
> from verify_nbtree.

As long as the scan only moves to the right (never the left), and as
long as you don't forget about P_IGNORE() pages, everything should be
fairly straightforward. You don't really need to understand things
like page deletion, and you'll never need to hold more than a single
buffer lock at a time, provided you stick to the happy path.

I've taken a quick look at v2, and it looks fine to me. It's
acceptable for the purpose that you have in mind, at least.

> Yeah, probably. And we'll probably test on such uniform data sets, or at
> least we we'll start with those. But at some point I'd like to test with
> some of these "weird" indexes too, if only to test how well the prefetch
> heuristics adjusts the distance.

That makes perfect sense. I was just providing context.

> I have a very good reason why I didn't do it that way. I was lazy. But
> v2 should be doing that, I think.

I respect that. That's why I framed my feedback as "it'll be less
effort to just do it than to explain why you haven't done so".  :-)

> Yeah, this interface seems useful. I suppose it'll be handy when looking
> at an index scan, to get stats from the currently loaded batches. In
> principle you get that from v3 by filtering, but it might be slow on
> large indexes. I'll try doing that in v3.

Cool.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

05 августа, 17:52:44

Hi,

I ran some more tests, comparing the two patches, using data sets
generated in a way to have a more gradual transition between correlated
and random cases.

I'll explain how the new benchmark generates data sets (the goal, and
limitations). Then I'll discuss some of the results. And then there's a
brief conclusion / next steps for the index prefetching ...

data sets
---------

I experimented with several ways to generate such data sets, and what I
ended up doing is this:

INSERT INTO t SELECT i, md5(i::text)
FROM generate_series(1, $rows) s(i)
ORDER BY i + $fuzz * (random() - 0.5)

See the "generate-*.sh" scripts for the exact details.

The basic idea is that we generate a sequence of $rows values, but we
also allow the values to jump a random distance determined by $fuzz.
With fuzz=0 we get perfect correlation, with fuzz=1 the value can move
by one position, with fuzz=1000 it can move by up to 1000 positions, and
so on. For very high fuzz (~rows) this will be close to random.

So this fuzz is the primary knob. Obviously, incrementing fuzz by "1" it
would be way too many tests, with very little change. Instead, I used
the doubling strategy - 0, 1, 2, 4, 8, 16, 32, 64, ... $rows. This way
it takes only about ~25 steps for the $fuzz to exceed $rows=10M.

I also used "fillfactor" as another knob, determining how many items fit
on a heap page. I used 20-40-60-80-100, but this turned out to not have
too many doesn't have much impact. From now I'll use fillfactor=20,
results and charts for other fillfactors are in the github repo [1].

I generated some charts visualizing the data sets - see [2] and [3]
(there's also PDFs, but those are pretty huge). Those charts show
percentiles of blocks vs. values, in either dimension. [2] shows
percentiles of "value" (from the sequence) for 1MB chunks. It seems very
correlated (simple "diagonal" line), because the ranges are so narrow.
But at fuzz ~256k the randomness starts to show.

The [3] shows the other direction, i.e. percentiles of heap blocks for
ranges of values. But the patterns are almost exactly the same, it's
very symmetrical.

Fuzz -1 means "random with uniform distribution". It's clear the "most
random" data set (fuzz ~8M) is still quite different, there's still some
correlation. But the behavior seems fairly close to random.

I don't claim those data sets are perfect, or a great representation of
particular (real-world) data sets. It seems like a much nicer transition
between random and correlated data sets. I have some ideas how to evolve
this, for example to introduce some duplicate (and not unique) values,
and also longer runs.

The other thing that annoys me a bit is the weird behavior close to the
beginning/end of the table, where the percentiles get closer and closer.
I suspect this might affect runs that happen to hit those parts, adding
some "noise" into the results.

results
-------

I'm going to talk about results from the Ryzen machine with NVMe RAID,
with 10M rows (which is about 3.7GB with fillfactor=20) [4]. There are
also results from "ryzen / SATA RAID" and "Xeon / NVMe", and 1M data
sets. But the conclusions are almost exactly the same, as with earlier
benchmarks.

- ryzen-nvme-cold-10000000-20-16-scaled.pdf [5]

This compares master, simple and complex prefetch with different
iomethod values (in columns), and fuzz values (in rows, starting from
fuzz=0).

In most cases the two patches perform fairly close - the green and red
data series mostly overlap. But there are cases where the complex patch
performs much better - especially for low fuzz values. Which is not
surprising, because those cases require higher prefetch distance, and
the complex patch can do that.

It surprised me a bit the complex patch can actually help even cases
where I'd not expect prefetching to help very much - e.g. fuzz=0 is
perfectly correlated, I'd expect read-ahead to work just fine. Yet the
complex patch can help ~2x (at least when scanning larger fraction of
the data).

- ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf

Some of the differences are more visible on this chart, which shows
patches relative to master (so 1.0 means "as fast as master", while 0.5
means 2x faster, etc).

I think there are a couple "fuzz ranges" with distinct behaviors:

* 0-1: simple does mostly on par with "master", complex is actually
quite a bit faster

* 2-4: both mostly on par with master

* 8-256: zone of regressions (compared to master)

* 512-64K: mixed results (good for low selectivity, then regression)

* 128K+: clear benefits

The results from the other systems follow this pattern too, although the
ranges may be shifted a bit.

There are some interesting differences between the io_method values. In
a number of cases the "sync" method performs much worse than "worker"
and "io_uring" - which is not entirely surprising, but it just supports
my argument we should stick with "worker" as default for PG18. But
that's not the topic of this thread.

There are also a couple cases where "simple" performs better than
"complex". But most of the time this is only for the "sync" iomethod,
and when scanning significant fraction of the data (10%+). So that
doesn't seem like a great argument in favor of the simple patch,
considering "sync" is not a proper AIO method, I've been arguing against
using it as a default, and with methods like "worker" the "complex"
patch often performs better ...

conclusion
----------

Let's say the complex patch is the way to go. What are the open problems
/ missing parts we need to address to make it committable?

I can think of these issues. I'm sure the list is incomplete and there
are many "smaller issues" and things I haven't even thought about:

1) Making sure the interface can work for other index AMs (both in core
and out-of-core), including cases like GiST etc.

2) Proper layering between index AM and table AM (like the TID issue
pointed out by Andres some time ago).

3) Allowing more flexible management of prefetch distance (this might
involve something like the "scan manager" idea suggested by Peter),
various improvements to ReadStream heuristics, etc.

4) More testing to minimize the risk of regressions.

5) Figuring out how to make this work for IOS (the simple patch has some
special logic in the callback, which may not be great, not sure what's
the right solution in the complex patch).

6) ????

regards

[1] https://github.com/tvondra/index-prefetch-tests-2

[2]
https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets.png

[3]
https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets-2.png

[4]
https://github.com/tvondra/index-prefetch-tests-2/tree/master/ryzen-nvme/10

[5]
https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled.pdf

[6]

https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf

[7]

Re: index prefetching

От

Andres Freund

Дата:

09 августа, 02:47:13

Hi,

On 2025-08-06 16:12:53 +0200, Tomas Vondra wrote:
> That's quite possible. What concerns me about using tables like pgbench
> accounts table is reproducibility - initially it's correlated, and then
> it gets "randomized" by the workload. But maybe the exact pattern
> depends on the workload - how many clients, how long, how it correlates
> with vacuum, etc. Reproducing the dataset might be quite tricky.
> 
> That's why I prefer using "reproducible" data sets. I think the data
> sets with "fuzz" seem like a pretty good model. I plan to experiment
> with adding some duplicate values / runs, possibly with two "levels" of
> randomness (global for all runs, and smaller local perturbations).
> [...]
> Yeah, cases like that are interesting. I plan to do some randomized
> testing, exploring "strange" combinations of parameters, looking for
> weird behaviors like that.

I'm just catching up: Isn't it a bit early to focus this much on testing? ISMT
that the patchsets for both approaches currently have some known architectural
issues and that addressing them seems likely to change their performance
characteristics.

Greetings,

Andres Freund

Re: index prefetching

От

Tomas Vondra

Дата:

11 августа, 17:16:05


On 8/9/25 01:47, Andres Freund wrote:
> Hi,
> 
> On 2025-08-06 16:12:53 +0200, Tomas Vondra wrote:
>> That's quite possible. What concerns me about using tables like pgbench
>> accounts table is reproducibility - initially it's correlated, and then
>> it gets "randomized" by the workload. But maybe the exact pattern
>> depends on the workload - how many clients, how long, how it correlates
>> with vacuum, etc. Reproducing the dataset might be quite tricky.
>>
>> That's why I prefer using "reproducible" data sets. I think the data
>> sets with "fuzz" seem like a pretty good model. I plan to experiment
>> with adding some duplicate values / runs, possibly with two "levels" of
>> randomness (global for all runs, and smaller local perturbations).
>> [...]
>> Yeah, cases like that are interesting. I plan to do some randomized
>> testing, exploring "strange" combinations of parameters, looking for
>> weird behaviors like that.
> 
> I'm just catching up: Isn't it a bit early to focus this much on testing? ISMT
> that the patchsets for both approaches currently have some known architectural
> issues and that addressing them seems likely to change their performance
> characteristics.
> 

Perhaps. For me benchmarks are a way to learn about stuff and better
understand the pros/cons of approaches. It's possible some of the
changes will impact the characteristics, but I doubt it can change the
fundamental differences due to the simple approach being limited to a
single leaf page, etc.

regards


-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

11 августа, 23:14:31

On Mon, Aug 11, 2025 at 10:16 AM Tomas Vondra <tomas@vondra.me> wrote:
> Perhaps. For me benchmarks are a way to learn about stuff and better
> understand the pros/cons of approaches. It's possible some of the
> changes will impact the characteristics, but I doubt it can change the
> fundamental differences due to the simple approach being limited to a
> single leaf page, etc.

I think that we're all now agreed that we want to take the complex
patch's approach. ISTM that that development makes comparative
benchmarking much less interesting, at least for the time being. IMV
we should focus on cleaning up the complex patch, and on closing out
at least a few open items.

The main thing that I'm personally interested in right now,
benchmark-wise, is cases where the complex patch doesn't perform as
well as expected when we compare (say) backwards scans to forwards
scans with the complex patch. In other words, I'm mostly interested in
getting an overall sense of the performance profile of the complex
patch -- which has nothing to do with how it performs against the
master branch. I'd like to find and debug any weird performance
bugs/strange discontinuities in performance. I have a feeling that
there are at least a couple of those lurking in the complex patch
right now. Once we have some confidence that the overall performance
profile of the complex patch "makes sense", we can do more invasive
refactoring (while systematically avoiding new regressions for the
cases that were fixed).

In summary, I think that we should focus on fixing smaller open items
for now -- with an emphasis on fixing strange inconsistencies in
performance for distinct-though-similar queries (pairs of queries that
intuitively seem like they should perform very similarly, but somehow
have very different performance). I can't really justify that, but my
gut feeling is that that's the best place to focus our efforts for the
time being.

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

11 августа, 23:49:22

On Thu, Aug 7, 2025 at 1:25 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>  * you could make a stream that pulls leaf pages from higher level
> internal pages on demand (if you want to avoid the flow control
> problems that come from trying to choose a batch size up front before
> you know you'll even need it by using a pull approach), or just notice
> that it looks sequential and install a block range producer, and if
> that doesn't match the next page pointers by the time you get there
> then you destroy it and switch strategies, or something

I was hoping that we wouldn't ever have to teach index scans to
prefetch leaf pages like this. It is pretty complicated, primarily
because it completely breaks with the idea of the scan having to
access pages in some fixed order. (Whereas if we're just prefetching
heap pages, then there is a fixed order, which makes maintaining
prefetch distance relatively straightforward and index AM neutral.)

It's also awkward to make such a scheme work, especially when there's
any uncertainty about how many leaf pages will ultimately be read/how
much work to do speculatively. There might not be that many relevant
leaf pages (level 0 pages) whose block numbers are conveniently
available as prefetchable downlinks/block numbers to the right of the
downlink we use to descend to the first leaf page to be read (our
initial downlink might be positioned towards the end of the relevant
internal page at level 1). I guess we could re-read the internal page
only when prefetching later leaf pages starts to look like a good
idea, but that's another complicated code path to maintain.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

12 августа, 00:07:50

On 8/11/25 22:14, Peter Geoghegan wrote:
> On Mon, Aug 11, 2025 at 10:16 AM Tomas Vondra <tomas@vondra.me> wrote:
>> Perhaps. For me benchmarks are a way to learn about stuff and better
>> understand the pros/cons of approaches. It's possible some of the
>> changes will impact the characteristics, but I doubt it can change the
>> fundamental differences due to the simple approach being limited to a
>> single leaf page, etc.
> 
> I think that we're all now agreed that we want to take the complex
> patch's approach. ISTM that that development makes comparative
> benchmarking much less interesting, at least for the time being. IMV
> we should focus on cleaning up the complex patch, and on closing out
> at least a few open items.
> 

I agree comparing "simple" and "complex" patches is less interesting. I
still plan to keep comparing "master" and "complex", mostly to look for
unexpected regressions etc.

> The main thing that I'm personally interested in right now,
> benchmark-wise, is cases where the complex patch doesn't perform as
> well as expected when we compare (say) backwards scans to forwards
> scans with the complex patch. In other words, I'm mostly interested in
> getting an overall sense of the performance profile of the complex
> patch -- which has nothing to do with how it performs against the
> master branch. I'd like to find and debug any weird performance
> bugs/strange discontinuities in performance. I have a feeling that
> there are at least a couple of those lurking in the complex patch
> right now. Once we have some confidence that the overall performance
> profile of the complex patch "makes sense", we can do more invasive
> refactoring (while systematically avoiding new regressions for the
> cases that were fixed).
> 

I can do some tests with forward vs. backwards scans. Of course, the
trouble with finding these weird cases is that they may be fairly rare.
So hitting them is a matter or luck or just happening to generate the
right data / query. But I'll give it a try and we'll see.

> In summary, I think that we should focus on fixing smaller open items
> for now -- with an emphasis on fixing strange inconsistencies in
> performance for distinct-though-similar queries (pairs of queries that
> intuitively seem like they should perform very similarly, but somehow
> have very different performance). I can't really justify that, but my
> gut feeling is that that's the best place to focus our efforts for the
> time being.
> 

OK

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

12 августа, 02:41:44

On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas@vondra.me> wrote:
> I can do some tests with forward vs. backwards scans. Of course, the
> trouble with finding these weird cases is that they may be fairly rare.
> So hitting them is a matter or luck or just happening to generate the
> right data / query. But I'll give it a try and we'll see.

I was talking more about finding "performance bugs" through a
semi-directed process of trying random things while looking out for
discrepancies. Something like that shouldn't require the usual
"benchmarking rigor", since suspicious inconsistencies should be
fairly obvious once encountered. I expect similar queries to have
similar performance, regardless of superficial differences such as
scan direction, DESC vs ASC column order, etc.

I tested this issue again (using my original pgbench_account query),
having rebased on top of HEAD as of today. I found that the
inconsistency seems to be much smaller now -- so much so that I don't
think that the remaining inconsistency is particularly suspicious.

I also think that performance might have improved across the board. I
see that the same TPC-C query that took 768.454 ms a few weeks back
now takes only 617.408 ms. Also, while I originally saw "I/O Timings:
shared read=138.856" with this query, I now see "I/O Timings: shared
read=46.745". That feels like a performance bug fix to me.

I wonder if today's commit b4212231 from Thomas ("Fix rare bug in
read_stream.c's split IO handling") fixed the issue, without anyone
realizing that the bug in question could manifest like this.

--
Peter Geoghegan

Re: index prefetching

От

Thomas Munro

Дата:

12 августа, 08:06:47

On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas@vondra.me> wrote:
> > I can do some tests with forward vs. backwards scans. Of course, the
> > trouble with finding these weird cases is that they may be fairly rare.
> > So hitting them is a matter or luck or just happening to generate the
> > right data / query. But I'll give it a try and we'll see.
>
> I was talking more about finding "performance bugs" through a
> semi-directed process of trying random things while looking out for
> discrepancies. Something like that shouldn't require the usual
> "benchmarking rigor", since suspicious inconsistencies should be
> fairly obvious once encountered. I expect similar queries to have
> similar performance, regardless of superficial differences such as
> scan direction, DESC vs ASC column order, etc.

I'd be interested to hear more about reverse scans.  Bilal was
speculating about backwards I/O combining in read_stream.c a while
back, but we didn't have anything interesting to use it yet.  You'll
probably see a flood of uncombined 8KB IOs in the pg_aios view while
travelling up the heap with cache misses today.  I suspect Linux does
reverse sequential prefetching with buffered I/O (less sure about
other OSes) which should help but we'd still have more overheads than
we could if we combined them, not to mention direct I/O.

Not tested, but something like this might do it:

                /* Can we merge it with the pending read? */
-               if (stream->pending_read_nblocks > 0 &&
-                       stream->pending_read_blocknum +
stream->pending_read_nblocks == blocknum)
+               if (stream->pending_read_nblocks > 0)
                {
-                       stream->pending_read_nblocks++;
-                       continue;
+                       if (stream->pending_read_blocknum +
stream->pending_read_nblocks ==
+                               blocknum)
+                       {
+                               stream->pending_read_nblocks++;
+                               continue;
+                       }
+                       else if (stream->pending_read_blocknum ==
blocknum + 1 &&
+                                        stream->forwarded_buffers == 0)
+                       {
+                               stream->pending_read_blocknum--;
+                               stream->pending_read_nblocks++;
+                               continue;
+                       }
                }

> I tested this issue again (using my original pgbench_account query),
> having rebased on top of HEAD as of today. I found that the
> inconsistency seems to be much smaller now -- so much so that I don't
> think that the remaining inconsistency is particularly suspicious.
>
> I also think that performance might have improved across the board. I
> see that the same TPC-C query that took 768.454 ms a few weeks back
> now takes only 617.408 ms. Also, while I originally saw "I/O Timings:
> shared read=138.856" with this query, I now see "I/O Timings: shared
> read=46.745". That feels like a performance bug fix to me.
>
> I wonder if today's commit b4212231 from Thomas ("Fix rare bug in
> read_stream.c's split IO handling") fixed the issue, without anyone
> realizing that the bug in question could manifest like this.

I can't explain that.  If you can consistently reproduce the change at
the two base commits, maybe bisect?  If it's a real phenomenon I'm
definitely curious to know what you're seeing.

Re: index prefetching

От

Nazir Bilal Yavuz

Дата:

12 августа, 14:22:11

Hi,

On Tue, 12 Aug 2025 at 08:07, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas@vondra.me> wrote:
> > > I can do some tests with forward vs. backwards scans. Of course, the
> > > trouble with finding these weird cases is that they may be fairly rare.
> > > So hitting them is a matter or luck or just happening to generate the
> > > right data / query. But I'll give it a try and we'll see.
> >
> > I was talking more about finding "performance bugs" through a
> > semi-directed process of trying random things while looking out for
> > discrepancies. Something like that shouldn't require the usual
> > "benchmarking rigor", since suspicious inconsistencies should be
> > fairly obvious once encountered. I expect similar queries to have
> > similar performance, regardless of superficial differences such as
> > scan direction, DESC vs ASC column order, etc.
>
> I'd be interested to hear more about reverse scans.  Bilal was
> speculating about backwards I/O combining in read_stream.c a while
> back, but we didn't have anything interesting to use it yet.  You'll
> probably see a flood of uncombined 8KB IOs in the pg_aios view while
> travelling up the heap with cache misses today.  I suspect Linux does
> reverse sequential prefetching with buffered I/O (less sure about
> other OSes) which should help but we'd still have more overheads than
> we could if we combined them, not to mention direct I/O.

If I remember correctly, I didn't continue working on this as I didn't
see performance improvement. Right now, my changes don't apply cleanly
to the current HEAD but I can give it another try if you see value in
this.

> Not tested, but something like this might do it:
>
>                 /* Can we merge it with the pending read? */
> -               if (stream->pending_read_nblocks > 0 &&
> -                       stream->pending_read_blocknum +
> stream->pending_read_nblocks == blocknum)
> +               if (stream->pending_read_nblocks > 0)
>                 {
> -                       stream->pending_read_nblocks++;
> -                       continue;
> +                       if (stream->pending_read_blocknum +
> stream->pending_read_nblocks ==
> +                               blocknum)
> +                       {
> +                               stream->pending_read_nblocks++;
> +                               continue;
> +                       }
> +                       else if (stream->pending_read_blocknum ==
> blocknum + 1 &&
> +                                        stream->forwarded_buffers == 0)
> +                       {
> +                               stream->pending_read_blocknum--;
> +                               stream->pending_read_nblocks++;
> +                               continue;
> +                       }
>                 }

Unfortunately this doesn't work. We need to handle backwards I/O
combining in the StartReadBuffersImpl() function too as buffer indexes
won't have correct blocknums. Also, I think buffer forwarding of split
backwards I/O should be handled in a couple of places.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: index prefetching

От

Tomas Vondra

Дата:

12 августа, 19:53:13

On 8/12/25 13:22, Nazir Bilal Yavuz wrote:
> Hi,
> 
> On Tue, 12 Aug 2025 at 08:07, Thomas Munro <thomas.munro@gmail.com> wrote:
>>
>> On Tue, Aug 12, 2025 at 11:42 AM Peter Geoghegan <pg@bowt.ie> wrote:
>>> On Mon, Aug 11, 2025 at 5:07 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>> I can do some tests with forward vs. backwards scans. Of course, the
>>>> trouble with finding these weird cases is that they may be fairly rare.
>>>> So hitting them is a matter or luck or just happening to generate the
>>>> right data / query. But I'll give it a try and we'll see.
>>>
>>> I was talking more about finding "performance bugs" through a
>>> semi-directed process of trying random things while looking out for
>>> discrepancies. Something like that shouldn't require the usual
>>> "benchmarking rigor", since suspicious inconsistencies should be
>>> fairly obvious once encountered. I expect similar queries to have
>>> similar performance, regardless of superficial differences such as
>>> scan direction, DESC vs ASC column order, etc.
>>
>> I'd be interested to hear more about reverse scans.  Bilal was
>> speculating about backwards I/O combining in read_stream.c a while
>> back, but we didn't have anything interesting to use it yet.  You'll
>> probably see a flood of uncombined 8KB IOs in the pg_aios view while
>> travelling up the heap with cache misses today.  I suspect Linux does
>> reverse sequential prefetching with buffered I/O (less sure about
>> other OSes) which should help but we'd still have more overheads than
>> we could if we combined them, not to mention direct I/O.
> 
> If I remember correctly, I didn't continue working on this as I didn't
> see performance improvement. Right now, my changes don't apply cleanly
> to the current HEAD but I can give it another try if you see value in
> this.
> 
>> Not tested, but something like this might do it:
>>
>>                 /* Can we merge it with the pending read? */
>> -               if (stream->pending_read_nblocks > 0 &&
>> -                       stream->pending_read_blocknum +
>> stream->pending_read_nblocks == blocknum)
>> +               if (stream->pending_read_nblocks > 0)
>>                 {
>> -                       stream->pending_read_nblocks++;
>> -                       continue;
>> +                       if (stream->pending_read_blocknum +
>> stream->pending_read_nblocks ==
>> +                               blocknum)
>> +                       {
>> +                               stream->pending_read_nblocks++;
>> +                               continue;
>> +                       }
>> +                       else if (stream->pending_read_blocknum ==
>> blocknum + 1 &&
>> +                                        stream->forwarded_buffers == 0)
>> +                       {
>> +                               stream->pending_read_blocknum--;
>> +                               stream->pending_read_nblocks++;
>> +                               continue;
>> +                       }
>>                 }
> 
> Unfortunately this doesn't work. We need to handle backwards I/O
> combining in the StartReadBuffersImpl() function too as buffer indexes
> won't have correct blocknums. Also, I think buffer forwarding of split
> backwards I/O should be handled in a couple of places.
> 

I'm running some tests looking for these weird changes, not just with
the patches, but on master too. And I don't think b4212231 changed the
situation very much.

FWIW this issue is not caused by the index prefetching patches, I can
reproduce it with master (on b227b0bb4e032e19b3679bedac820eba3ac0d1cf
from yesterday). So maybe we should split this into a separate thread.

Consider for example the dataset built by create.sql - it's randomly
generated, but the idea is that it's correlated, but not perfectly. The
table is ~3.7GB, and it's a cold run - caches dropped + restart).

Anyway, a simple range query look like this:

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;

                                QUERY PLAN
------------------------------------------------------------------------
 Index Scan using idx on t
   (actual time=0.584..433.208 rows=1048576.00 loops=1)
   Index Cond: ((a >= 16336) AND (a <= 49103))
   Index Searches: 1
   Buffers: shared hit=7435 read=50872
   I/O Timings: shared read=332.270
 Planning:
   Buffers: shared hit=78 read=23
   I/O Timings: shared read=2.254
 Planning Time: 3.364 ms
 Execution Time: 463.516 ms
(10 rows)

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;

                                QUERY PLAN
------------------------------------------------------------------------
 Index Scan Backward using idx on t
   (actual time=0.566..22002.780 rows=1048576.00 loops=1)
   Index Cond: ((a >= 16336) AND (a <= 49103))
   Index Searches: 1
   Buffers: shared hit=36131 read=50872
   I/O Timings: shared read=21217.995
 Planning:
   Buffers: shared hit=82 read=23
   I/O Timings: shared read=2.375
 Planning Time: 3.478 ms
 Execution Time: 22231.755 ms
(10 rows)

That's a pretty massive difference ... this is on my laptop, and the
timing changes quite a bit, but it's always a multiple of the first
query with forward scan.

I did look into pg_aios, but there's only 8kB requests in both cases. I
didn't have time to look closer yet.


regards

--
Tomas Vondra

Вложения

create.sql

Re: index prefetching

От

Tomas Vondra

Дата:

12 августа, 20:51:15

On 8/12/25 18:53, Tomas Vondra wrote:
> ...
> 
> EXPLAIN (ANALYZE, COSTS OFF)
> SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Index Scan using idx on t
>    (actual time=0.584..433.208 rows=1048576.00 loops=1)
>    Index Cond: ((a >= 16336) AND (a <= 49103))
>    Index Searches: 1
>    Buffers: shared hit=7435 read=50872
>    I/O Timings: shared read=332.270
>  Planning:
>    Buffers: shared hit=78 read=23
>    I/O Timings: shared read=2.254
>  Planning Time: 3.364 ms
>  Execution Time: 463.516 ms
> (10 rows)
> 
> EXPLAIN (ANALYZE, COSTS OFF)
> SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Index Scan Backward using idx on t
>    (actual time=0.566..22002.780 rows=1048576.00 loops=1)
>    Index Cond: ((a >= 16336) AND (a <= 49103))
>    Index Searches: 1
>    Buffers: shared hit=36131 read=50872
>    I/O Timings: shared read=21217.995
>  Planning:
>    Buffers: shared hit=82 read=23
>    I/O Timings: shared read=2.375
>  Planning Time: 3.478 ms
>  Execution Time: 22231.755 ms
> (10 rows)
> 
> That's a pretty massive difference ... this is on my laptop, and the
> timing changes quite a bit, but it's always a multiple of the first
> query with forward scan.
> 
> I did look into pg_aios, but there's only 8kB requests in both cases. I
> didn't have time to look closer yet.
> 

One more detail I just noticed - the DESC scan apparently needs more
buffers (~87k vs. 57k). That probably shouldn't cause such massive
regression, though.


regards

-- 
Tomas Vondra

Re: index prefetching

От

Thomas Munro

Дата:

12 августа, 22:29:25

On Tue, Aug 12, 2025 at 11:22 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> Unfortunately this doesn't work. We need to handle backwards I/O
> combining in the StartReadBuffersImpl() function too as buffer indexes
> won't have correct blocknums. Also, I think buffer forwarding of split
> backwards I/O should be handled in a couple of places.

Perhaps there could be a flag pending_read_backwards that can only
become set with pending_read_nblocks goes from 1 to 2, and then a new
flag stream->ios[x].backwards (in struct InProgressIO) that is set in
read_stream_start_pending_read().  Then immediately after
WaitReadBuffers(), we reverse the buffers it returned in place if that
flag was set.  Oh, I see, you were imagining a flag
READ_BUFFERS_REVERSE that tells WaitReadBuffers() to do that
internally.  Hmm.  Either way I don't think you need to consider the
forwarded buffers because they will be reversed during a later call
that includes them in *nblocks (output value), no?

Re: index prefetching

От

Peter Geoghegan

Дата:

12 августа, 22:38:06

On Tue, Aug 12, 2025 at 1:51 PM Tomas Vondra <tomas@vondra.me> wrote:
> One more detail I just noticed - the DESC scan apparently needs more
> buffers (~87k vs. 57k). That probably shouldn't cause such massive
> regression, though.

I can reproduce this.

I wondered if the difference might be attributable to the issue with
posting lists and backwards scans (this index has fairly large posting
lists), which is addressed by this patch of mine:

https://commitfest.postgresql.org/patch/5824/

This makes the difference in buffers read identical between the
forwards and backwards scan case. However, it makes exactly no
difference to the execution time of the backwards scan case -- it's
still way higher.

I imagine that this is down to some linux readahead implementation
detail. Maybe it is more willing to speculatively read ahead when the
scan is mostly in ascending order, compared to when the scan is mostly
in descending order. The performance gap that I see is surprisingly
large, but I agree that it has nothing to do with this prefetching
work/the issue that I saw with backwards scans.

I had imagined that we'd be much less sensitive to these kinds of
differences once we don't need to depend on heuristic-driven OS
readahead. Maybe that was wrong.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

12 августа, 22:48:50

Hi,

On 2025-08-12 18:53:13 +0200, Tomas Vondra wrote:
> I'm running some tests looking for these weird changes, not just with
> the patches, but on master too. And I don't think b4212231 changed the
> situation very much.
> 
> FWIW this issue is not caused by the index prefetching patches, I can
> reproduce it with master (on b227b0bb4e032e19b3679bedac820eba3ac0d1cf
> from yesterday). So maybe we should split this into a separate thread.
> 
> Consider for example the dataset built by create.sql - it's randomly
> generated, but the idea is that it's correlated, but not perfectly. The
> table is ~3.7GB, and it's a cold run - caches dropped + restart).
> 
> Anyway, a simple range query look like this:
> 
> EXPLAIN (ANALYZE, COSTS OFF)
> SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a ASC;
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Index Scan using idx on t
>    (actual time=0.584..433.208 rows=1048576.00 loops=1)
>    Index Cond: ((a >= 16336) AND (a <= 49103))
>    Index Searches: 1
>    Buffers: shared hit=7435 read=50872
>    I/O Timings: shared read=332.270
>  Planning:
>    Buffers: shared hit=78 read=23
>    I/O Timings: shared read=2.254
>  Planning Time: 3.364 ms
>  Execution Time: 463.516 ms
> (10 rows)
> 
> EXPLAIN (ANALYZE, COSTS OFF)
> SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a DESC;
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Index Scan Backward using idx on t
>    (actual time=0.566..22002.780 rows=1048576.00 loops=1)
>    Index Cond: ((a >= 16336) AND (a <= 49103))
>    Index Searches: 1
>    Buffers: shared hit=36131 read=50872
>    I/O Timings: shared read=21217.995
>  Planning:
>    Buffers: shared hit=82 read=23
>    I/O Timings: shared read=2.375
>  Planning Time: 3.478 ms
>  Execution Time: 22231.755 ms
> (10 rows)
> 
> That's a pretty massive difference ... this is on my laptop, and the
> timing changes quite a bit, but it's always a multiple of the first
> query with forward scan.

I suspect what you're mainly seeing here is that the OS can do readahead for
us for forward scans, but not for backward scans.  Indeed, if I look at
iostat, the forward scan shows:

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme6n1       3352.00    400.89     0.00   0.00    0.18   122.47    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.62  47.90
 

whereas the backward scan shows:

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
 
nvme6n1       10958.00     85.57     0.00   0.00    0.06     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.69  63.80
 

Note the different read sizes...



> I did look into pg_aios, but there's only 8kB requests in both cases. I
> didn't have time to look closer yet.

That's what we'd expect, right? There's nothing on master that'd perform read
combining for index scans...

Greetings,

Andres Freund

Re: index prefetching

От

"Peter Geoghegan"

Дата:

13 августа, 00:22:20

On Tue Aug 12, 2025 at 1:06 AM EDT, Thomas Munro wrote:
> I'd be interested to hear more about reverse scans.  Bilal was
> speculating about backwards I/O combining in read_stream.c a while
> back, but we didn't have anything interesting to use it yet.  You'll
> probably see a flood of uncombined 8KB IOs in the pg_aios view while
> travelling up the heap with cache misses today.  I suspect Linux does
> reverse sequential prefetching with buffered I/O (less sure about
> other OSes) which should help but we'd still have more overheads than
> we could if we combined them, not to mention direct I/O.

Doesn't look like Linux will do this, if what my local testing shows is anything
to go on. I'm a bit surprised by this (I also thought that OS readahead on linux
was quite sophisticated).

There does seem to be something fishy going on with the patch here.  I can see
strange inconsistencies in EXPLAIN ANALYZE output when the server is started
with --debug_io_direct=data with the master, compared to what I see with the
patch.

Test case
=========

My test case is a minor refinement of Tomas' backwards scan test case from
earlier today, though with one important difference: I ran
"alter index idx set (deduplicate_items = off); reindex index idx;" to get a
pristine index without any posting lists (since the unrelated issue with posting
list TIDs otherwise risks obscuring something relevant).

master
------

pg@regression:5432 [2390630]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2390630]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
┌────────────────────────────────────────────────────────────────────────────────┐
│                                   QUERY PLAN                                   │
├────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual time=0.117..982.469 rows=1048576.00 loops=1) │
│   Index Cond: ((a >= 16336) AND (a <= 49103))                                  │
│   Index Searches: 1                                                            │
│   Buffers: shared hit=10353 read=49933                                         │
│   I/O Timings: shared read=861.953                                             │
│ Planning:                                                                      │
│   Buffers: shared hit=63 read=20                                               │
│   I/O Timings: shared read=1.898                                               │
│ Planning Time: 2.131 ms                                                        │
│ Execution Time: 1015.679 ms                                                    │
└────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

pg@regression:5432 [2390630]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2390630]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│                                        QUERY PLAN                                        │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual time=7.919..6340.579 rows=1048576.00 loops=1) │
│   Index Cond: ((a >= 16336) AND (a <= 49103))                                            │
│   Index Searches: 1                                                                      │
│   Buffers: shared hit=10350 read=49933                                                   │
│   I/O Timings: shared read=6219.776                                                      │
│ Planning:                                                                                │
│   Buffers: shared hit=5                                                                  │
│ Planning Time: 0.076 ms                                                                  │
│ Execution Time: 6374.008 ms                                                              │
└──────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

Notice that readahead seems to be effective with the forwards scan only (even
though I'm using debug_io_direct=data for this).  Also notice that each query
shows identical "Buffers:" output -- that detail is exactly as expected.

Prefetch patch
--------------

Same pair of queries/prewarming/eviction steps with my working copy of the
prefetching patch:

pg@regression:5432 [2400564]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2400564]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a;
┌────────────────────────────────────────────────────────────────────────────────┐
│                                   QUERY PLAN                                   │
├────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual time=0.136..298.301 rows=1048576.00 loops=1) │
│   Index Cond: ((a >= 16336) AND (a <= 49103))                                  │
│   Index Searches: 1                                                            │
│   Buffers: shared hit=6619 read=49933                                          │
│   I/O Timings: shared read=45.313                                              │
│ Planning:                                                                      │
│   Buffers: shared hit=63 read=20                                               │
│   I/O Timings: shared read=2.232                                               │
│ Planning Time: 2.634 ms                                                        │
│ Execution Time: 330.379 ms                                                     │
└────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

pg@regression:5432 [2400564]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2400564]=# EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc;
┌──────────────────────────────────────────────────────────────────────────────────────────┐
│                                        QUERY PLAN                                        │
├──────────────────────────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual time=7.926..1201.988 rows=1048576.00 loops=1) │
│   Index Cond: ((a >= 16336) AND (a <= 49103))                                            │
│   Index Searches: 1                                                                      │
│   Buffers: shared hit=10350 read=49933                                                   │
│   I/O Timings: shared read=194.774                                                       │
│ Planning:                                                                                │
│   Buffers: shared hit=5                                                                  │
│ Planning Time: 0.097 ms                                                                  │
│ Execution Time: 1236.655 ms                                                              │
└──────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

It looks like the patch does significantly better with the forwards scan,
compared to the backwards scan (though both are improved by a lot).  But that's
not the main thing about these results that I find interesting.

The really odd thing is that we get "shared hit=6619 read=49933" for the
forwards scan, and "shared hit=10350 read=49933" for the backwards scan.  The
latter matches master (regardless of the scan direction used on master), while
the former just looks wrong.  What explains the "missing buffer hits" seen with
the forwards scan?

Discrepancies
-------------

All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
to simply be broken/giving wrong answers.  Might it be that the "Buffers"
instrumentation is broken?

The premise of my original complaint was that big inconsistencies in performance
shouldn't happen between similar forwards and backwards scans (at least not with
direct I/O).  I now have serious doubts about that premise, since it looks like
OS readahead remains a big factor with direct I/O.  Did I just miss something
obvious?

>> I wonder if today's commit b4212231 from Thomas ("Fix rare bug in
>> read_stream.c's split IO handling") fixed the issue, without anyone
>> realizing that the bug in question could manifest like this.
>
> I can't explain that.  If you can consistently reproduce the change at
> the two base commits, maybe bisect?

Commit b4212231 was a wild guess on my part.  Probably should have refrained
from that.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

13 августа, 00:52:17

On 8/12/25 23:22, Peter Geoghegan wrote:
> ...
>
> It looks like the patch does significantly better with the forwards scan,
> compared to the backwards scan (though both are improved by a lot).  But that's
> not the main thing about these results that I find interesting.
> 
> The really odd thing is that we get "shared hit=6619 read=49933" for the
> forwards scan, and "shared hit=10350 read=49933" for the backwards scan.  The
> latter matches master (regardless of the scan direction used on master), while
> the former just looks wrong.  What explains the "missing buffer hits" seen with
> the forwards scan?
> 
> Discrepancies
> -------------
> 
> All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
> to simply be broken/giving wrong answers.  Might it be that the "Buffers"
> instrumentation is broken?
> 

I think a bug in the prefetch patch is more likely. I tried with a patch
that adds various prefetch-related counters to explain, and I see this:


test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
BETWEEN 16336 AND 49103 ORDER BY a;

                                QUERY PLAN
------------------------------------------------------------------------
 Index Scan using idx on public.t (actual time=0.682..527.055
rows=1048576.00 loops=1)
   Output: a, b
   Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
   Index Searches: 1
   Prefetch Distance: 271.263
   Prefetch Count: 60888
   Prefetch Stalls: 1
   Prefetch Skips: 991211
   Prefetch Resets: 3
   Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) =>
24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669
   Buffers: shared hit=5027 read=50872
   I/O Timings: shared read=33.528
 Planning:
   Buffers: shared hit=78 read=23
   I/O Timings: shared read=2.349
 Planning Time: 3.686 ms
 Execution Time: 559.659 ms
(17 rows)


test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
BETWEEN 16336 AND 49103 ORDER BY a DESC;
                                QUERY PLAN
------------------------------------------------------------------------
 Index Scan Backward using idx on public.t (actual time=1.110..4116.201
rows=1048576.00 loops=1)
   Output: a, b
   Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
   Index Searches: 1
   Prefetch Distance: 271.061
   Prefetch Count: 118806
   Prefetch Stalls: 1
   Prefetch Skips: 962515
   Prefetch Resets: 3
   Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) =>
17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737
   Buffers: shared hit=30024 read=50872
   I/O Timings: shared read=581.353
 Planning:
   Buffers: shared hit=82 read=23
   I/O Timings: shared read=3.168
 Planning Time: 4.289 ms
 Execution Time: 4185.407 ms
(17 rows)

These two parts are interesting:

   Prefetch Count: 60888
   Prefetch Skips: 991211

   Prefetch Count: 118806
   Prefetch Skips: 962515

It looks like the backwards scan skips fewer blocks. This is based on
the lastBlock optimization, i.e. looking for runs of the same block
number. I don't quite see why would it affect just the backwards scan,
though. Seems weird.

> The premise of my original complaint was that big inconsistencies in performance
> shouldn't happen between similar forwards and backwards scans (at least not with
> direct I/O).  I now have serious doubts about that premise, since it looks like
> OS readahead remains a big factor with direct I/O.  Did I just miss something
> obvious?
> 

I don't think you missed anything. It does seem the assumption relies on
the OS handling the underlying I/O patterns equally, and unfortunately
that does not seem to be the case. Maybe we could "invert" the data set,
i.e. make it "descending" instead of "ascending"? That would make the
heap access direction "forward" again ...

regards

-- 
Tomas Vondra

Re: index prefetching

От

Andres Freund

Дата:

13 августа, 01:42:06

Hi,

On 2025-08-12 17:22:20 -0400, Peter Geoghegan wrote:
> Doesn't look like Linux will do this, if what my local testing shows is anything
> to go on.

Yes, matches my experiments outside of postgres too.

> I'm a bit surprised by this (I also thought that OS readahead on linux
> was quite sophisticated).

It's mildly sophisticated in detecting various *forward scan* patterns. There
just isn't anything for backward scans - presumably because there's not
actually much that generates backward reads of files...

> The premise of my original complaint was that big inconsistencies in performance
> shouldn't happen between similar forwards and backwards scans (at least not with
> direct I/O).  I now have serious doubts about that premise, since it looks like
> OS readahead remains a big factor with direct I/O.  Did I just miss something
> obvious?

There is absolutely no OS level readahead with direct IO (there can be
*merging* of neighboring IOs though, if they're submitted close enough
together).

However that doesn't mean that your storage hardware can't have its own set of
heuristics for faster access - afaict several NVMes I have access to have
shorter IO times for forward scans than for backward scans.

Besides actual IO times, there also is the issue that the page level access
might be differently efficient, the order in which tuples are accessed also
plays a role in how efficient memory level prefetching is.

OS level readahead is visible in some form in iostat - you get bigger reads or
multiple in-flight IOs.

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

13 августа, 01:50:58

On Tue, Aug 12, 2025 at 5:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There does seem to be something fishy going on with the patch here.  I can see
> strange inconsistencies in EXPLAIN ANALYZE output when the server is started
> with --debug_io_direct=data with the master, compared to what I see with the
> patch.

Attached is my working version of the patch, in case that helps anyone
with reproducing the problem.

Note that the nbtree changes are now included in this one
patch/commit. Definitely might make sense to revert to one patch per
index AM again later, but for now it's convenient to have one commit
that both adds the concept of amgetbatch, and removes nbtree's
btgettuple (since it bleeds into things like how indexam.c wants to do
mark and restore).

There are only fairly minor changes here. Most notably:

* Generalizes nbtree's _bt_drop_lock_and_maybe_pin, making it an
index-AM-generic thing I call index_batch_unlock.

Previous versions of this complex patch avoided the issue by always
holding on to a leaf page buffer pin, even when it wasn't truly
necessary (i.e. with plain index scans that use an MVCC snapshot).

It shouldn't be too hard to teach GiST to use index_batch_unlock to
continue dropping buffer pins on leaf pages, as before (with
gistgettuple). The hard part will be ordered GiST scans, and perhaps
every kind of GiST index-only scan (since in general index-only scans
cannot drop pins eagerly within index_batch_unlock, due to race
conditions with VACUUM concurrently setting VM bits all-visible).

* Replaces BufferMatches() with something a bit less invasive, which
works based on block numbers (not buffers).

* Various refinements to the way that nbtree deals with setting things
up using an existing batch.

In particular, the interface of _bt_readnextpage has been revised. It
now makes much more sense in a world where nbtree doesn't "own"
existing batches -- we no longer directly pass an existing batch to
_bt_readnextpage, and it no longer thinks it can clobber what is
actually an old batch.

--
Peter Geoghegan

Вложения

v1-0001-Add-amgetbatch-interface-for-index-scan-prefetchi.patch

Re: index prefetching

От

Tomas Vondra

Дата:

13 августа, 02:10:36


On 8/12/25 23:52, Tomas Vondra wrote:
> 
> On 8/12/25 23:22, Peter Geoghegan wrote:
>> ...
>>
>> It looks like the patch does significantly better with the forwards scan,
>> compared to the backwards scan (though both are improved by a lot).  But that's
>> not the main thing about these results that I find interesting.
>>
>> The really odd thing is that we get "shared hit=6619 read=49933" for the
>> forwards scan, and "shared hit=10350 read=49933" for the backwards scan.  The
>> latter matches master (regardless of the scan direction used on master), while
>> the former just looks wrong.  What explains the "missing buffer hits" seen with
>> the forwards scan?
>>
>> Discrepancies
>> -------------
>>
>> All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
>> to simply be broken/giving wrong answers.  Might it be that the "Buffers"
>> instrumentation is broken?
>>
> 
> I think a bug in the prefetch patch is more likely. I tried with a patch
> that adds various prefetch-related counters to explain, and I see this:
> 
> 
> test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
> BETWEEN 16336 AND 49103 ORDER BY a;
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Index Scan using idx on public.t (actual time=0.682..527.055
> rows=1048576.00 loops=1)
>    Output: a, b
>    Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
>    Index Searches: 1
>    Prefetch Distance: 271.263
>    Prefetch Count: 60888
>    Prefetch Stalls: 1
>    Prefetch Skips: 991211
>    Prefetch Resets: 3
>    Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) =>
> 24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669
>    Buffers: shared hit=5027 read=50872
>    I/O Timings: shared read=33.528
>  Planning:
>    Buffers: shared hit=78 read=23
>    I/O Timings: shared read=2.349
>  Planning Time: 3.686 ms
>  Execution Time: 559.659 ms
> (17 rows)
> 
> 
> test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
> BETWEEN 16336 AND 49103 ORDER BY a DESC;
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Index Scan Backward using idx on public.t (actual time=1.110..4116.201
> rows=1048576.00 loops=1)
>    Output: a, b
>    Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
>    Index Searches: 1
>    Prefetch Distance: 271.061
>    Prefetch Count: 118806
>    Prefetch Stalls: 1
>    Prefetch Skips: 962515
>    Prefetch Resets: 3
>    Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) =>
> 17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737
>    Buffers: shared hit=30024 read=50872
>    I/O Timings: shared read=581.353
>  Planning:
>    Buffers: shared hit=82 read=23
>    I/O Timings: shared read=3.168
>  Planning Time: 4.289 ms
>  Execution Time: 4185.407 ms
> (17 rows)
> 
> These two parts are interesting:
> 
>    Prefetch Count: 60888
>    Prefetch Skips: 991211
> 
>    Prefetch Count: 118806
>    Prefetch Skips: 962515
> 
> It looks like the backwards scan skips fewer blocks. This is based on
> the lastBlock optimization, i.e. looking for runs of the same block
> number. I don't quite see why would it affect just the backwards scan,
> though. Seems weird.
> 

Actually, this might be a consequence of how backwards scans work (at
least in btree). I logged the block in index_scan_stream_read_next, and
this is what I see in the forward scan (at the beginning):

    index_scan_stream_read_next: block 24891
    index_scan_stream_read_next: block 24892
    index_scan_stream_read_next: block 24893
    index_scan_stream_read_next: block 24892
    index_scan_stream_read_next: block 24893
    index_scan_stream_read_next: block 24894
    index_scan_stream_read_next: block 24895
    index_scan_stream_read_next: block 24896
    index_scan_stream_read_next: block 24895
    index_scan_stream_read_next: block 24896
    index_scan_stream_read_next: block 24897
    index_scan_stream_read_next: block 24898
    index_scan_stream_read_next: block 24899
    index_scan_stream_read_next: block 24900
    index_scan_stream_read_next: block 24901
    index_scan_stream_read_next: block 24902
    index_scan_stream_read_next: block 24903
    index_scan_stream_read_next: block 24904
    index_scan_stream_read_next: block 24905
    index_scan_stream_read_next: block 24906
    index_scan_stream_read_next: block 24907
    index_scan_stream_read_next: block 24908
    index_scan_stream_read_next: block 24909
    index_scan_stream_read_next: block 24910

while in the backwards scan (at the end) I see this

    index_scan_stream_read_next: block 24910
    index_scan_stream_read_next: block 24911
    index_scan_stream_read_next: block 24908
    index_scan_stream_read_next: block 24909
    index_scan_stream_read_next: block 24906
    index_scan_stream_read_next: block 24907
    index_scan_stream_read_next: block 24908
    index_scan_stream_read_next: block 24905
    index_scan_stream_read_next: block 24906
    index_scan_stream_read_next: block 24903
    index_scan_stream_read_next: block 24904
    index_scan_stream_read_next: block 24905
    index_scan_stream_read_next: block 24902
    index_scan_stream_read_next: block 24903
    index_scan_stream_read_next: block 24900
    index_scan_stream_read_next: block 24901
    index_scan_stream_read_next: block 24902
    index_scan_stream_read_next: block 24899
    index_scan_stream_read_next: block 24900
    index_scan_stream_read_next: block 24897
    index_scan_stream_read_next: block 24898
    index_scan_stream_read_next: block 24899
    index_scan_stream_read_next: block 24895
    index_scan_stream_read_next: block 24896
    index_scan_stream_read_next: block 24897
    index_scan_stream_read_next: block 24894
    index_scan_stream_read_next: block 24895
    index_scan_stream_read_next: block 24896
    index_scan_stream_read_next: block 24892
    index_scan_stream_read_next: block 24893
    index_scan_stream_read_next: block 24894
    index_scan_stream_read_next: block 24891
    index_scan_stream_read_next: block 24892
    index_scan_stream_read_next: block 24893

These are only the blocks that ended up passes to the read stream, not
the skipped ones. And you can immediately see the backward scan requests
more blocks for (roughly) the same part of the scan - the min/max block
roughly match.

The reason is pretty simple - the table is very correlated, and the
forward scan requests blocks mostly in the right order. Only rarely it
has to jump "back" when progressing to the next value, and so the
lastBlock optimization works nicely.

But with the backwards scan we apparently scan the values backwards, but
then the blocks for each value are accessed in forward direction. So we
do a couple blocks "forward" and then jump to the preceding value - but
that's a couple blocks *back*. And that breaks the lastBlock check.

I believe this applies both to master and the prefetching, except that
master doesn't have read stream - so it only does sync I/O. Could that
hide the extra buffer accesses, somehow?

Anyway, this access pattern in backwards scans seems a bit unfortunate.


regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

13 августа, 02:33:57

On Tue, Aug 12, 2025 at 7:10 PM Tomas Vondra <tomas@vondra.me> wrote:
> Actually, this might be a consequence of how backwards scans work (at
> least in btree). I logged the block in index_scan_stream_read_next, and
> this is what I see in the forward scan (at the beginning):

Just to be clear: you did disable deduplication and then reindex,
right? You're accounting for the known issue with posting list TIDs
returning TIDs in the wrong order, relative to the scan direction
(when the scan direction is backwards)?

It won't be necessary to do this once I commit my patch that fixes the
issue directly, on the nbtree side, but for now deduplication messes
things up here. And so for now you have to work around it.

> But with the backwards scan we apparently scan the values backwards, but
> then the blocks for each value are accessed in forward direction. So we
> do a couple blocks "forward" and then jump to the preceding value - but
> that's a couple blocks *back*. And that breaks the lastBlock check.

I don't think that this should be happening. The read stream ought to
be seeing blocks in exactly the same order as everything else.

> I believe this applies both to master and the prefetching, except that
> master doesn't have read stream - so it only does sync I/O.

In what sense is it an issue on master?

On master, we simply access the TIDs in whatever order amgettuple
returns TIDs in. That should always be scan order/index key space
order, where heap TID counts as a tie-breaker/affects the key space in
the presence of duplicates (at least once that issue with posting
lists is fixed, or once deduplication has been disabled in a way that
leaves no posting list TIDs around via a reindex).

It is certainly not surprising that master does poorly on backwards
scans. And it isn't all that surprising that master does worse on
backwards scans when direct I/O is in use (per the explanation
Andres offered just now). But master should nevertheless always read
the TIDs in whatever order it gets them from amgettuple in.

It sounds like amgetbatch doesn't really behave analogously to master
here, at least with backwards scans. It sounds like you're saying that
we *won't* feed TIDs heap block numbers to the read stream in exactly
scan order (when we happen to be scanning backwards) -- which seems
wrong to me.

As you pointed out, a forwards scan of a DESC column index should feed
heap blocks to the read stream in a way that is very similar to an
equivalent backwards scan of a similar ASC column on the same table.
There might be some very minor differences, due to differences in the
precise leaf page boundaries among each of the indexes. But that
should hardly be noticeable at all.

> Could that hide the extra buffer accesses, somehow?

I think that you meant to ask about *missing* buffer hits with the
patch, for the forwards scan. That doesn't agree with the backwards
scan with the patch, nor does it agree with master (with either the
forwards or backwards scan). Note that the heap accesses themselves
appear to have sane/consistent numbers, since we always see
"read=49933" as expected for those, for all 4 query executions that I
showed.

The "missing buffer hits" issue seems like an issue with the
instrumentation itself. Possibly one that is totally unrelated to
everything else we're discussing.

--
Peter Geoghegan

Re: index prefetching

От

Nazir Bilal Yavuz

Дата:

13 августа, 15:08:53

Hi,

On Tue, 12 Aug 2025 at 22:30, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Tue, Aug 12, 2025 at 11:22 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > Unfortunately this doesn't work. We need to handle backwards I/O
> > combining in the StartReadBuffersImpl() function too as buffer indexes
> > won't have correct blocknums. Also, I think buffer forwarding of split
> > backwards I/O should be handled in a couple of places.
>
> Perhaps there could be a flag pending_read_backwards that can only
> become set with pending_read_nblocks goes from 1 to 2, and then a new
> flag stream->ios[x].backwards (in struct InProgressIO) that is set in
> read_stream_start_pending_read().  Then immediately after
> WaitReadBuffers(), we reverse the buffers it returned in place if that
> flag was set.  Oh, I see, you were imagining a flag
> READ_BUFFERS_REVERSE that tells WaitReadBuffers() to do that
> internally.  Hmm.  Either way I don't think you need to consider the
> forwarded buffers because they will be reversed during a later call
> that includes them in *nblocks (output value), no?

I think the problem is that we are not sure whether we will run
WaitReadBuffers() or not. Let's say that we will process blocknums 25,
24, 23, 22, 21 and 20 so we combined these IOs. We set the
pending_read_backwards flag and sent this IO operation to the
StartReadBuffers(). Let's consider that 22 and 20 are cache hits and
the rest are cache misses. In that case, starting processing buffers
(inside StartReadBuffers()) from 20 will fail because we will try to
return that immediately since this is a first buffer and it is cache
hit.

I think something like this, we will pass the pending_read_backwards
to the StartReadBuffers() and it will start to process blocknums from
backwards because of the pending_read_backwards being true. So,
buffer[0] -> 25 ... buffer[2] -> 23 and we will stop there because 22
is a cache hit. Now, we will reverse these buffers so that buffer[0]
-> 23 ... buffer[2] -> 25, and then send this IO operation to the
WaitReadBuffers() and reverse these buffers again after
WaitReadBuffers(). The problem with that approach is that we need to
forward 22, 21 and 20 and pending_read_blocknum shouldn't change
because we are still at 20, processed buffers don't affect
pending_read_blocknum. And we need to preserve pending_read_backwards
until we process all forwarded buffers, otherwise we may try to
combine forward (pending_read_blocknum is 20 and the let's say next
blocknum from read_stream_get_block() is 21, we shouldn't do IO
combining in that case).

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: index prefetching

От

Tomas Vondra

Дата:

13 августа, 15:15:37

On 8/13/25 01:33, Peter Geoghegan wrote:
> On Tue, Aug 12, 2025 at 7:10 PM Tomas Vondra <tomas@vondra.me> wrote:
>> Actually, this might be a consequence of how backwards scans work (at
>> least in btree). I logged the block in index_scan_stream_read_next, and
>> this is what I see in the forward scan (at the beginning):
> 
> Just to be clear: you did disable deduplication and then reindex,
> right? You're accounting for the known issue with posting list TIDs
> returning TIDs in the wrong order, relative to the scan direction
> (when the scan direction is backwards)?
> 
> It won't be necessary to do this once I commit my patch that fixes the
> issue directly, on the nbtree side, but for now deduplication messes
> things up here. And so for now you have to work around it.
> 

No, I forgot about that (and the the patch only applies to master).


>> But with the backwards scan we apparently scan the values backwards, but
>> then the blocks for each value are accessed in forward direction. So we
>> do a couple blocks "forward" and then jump to the preceding value - but
>> that's a couple blocks *back*. And that breaks the lastBlock check.
> 
> I don't think that this should be happening. The read stream ought to
> be seeing blocks in exactly the same order as everything else.
> 
>> I believe this applies both to master and the prefetching, except that
>> master doesn't have read stream - so it only does sync I/O.
> 
> In what sense is it an issue on master?
> 
> On master, we simply access the TIDs in whatever order amgettuple
> returns TIDs in. That should always be scan order/index key space
> order, where heap TID counts as a tie-breaker/affects the key space in
> the presence of duplicates (at least once that issue with posting
> lists is fixed, or once deduplication has been disabled in a way that
> leaves no posting list TIDs around via a reindex).
> 
> It is certainly not surprising that master does poorly on backwards
> scans. And it isn't all that surprising that master does worse on
> backwards scans when direct I/O is in use (per the explanation
> Andres offered just now). But master should nevertheless always read
> the TIDs in whatever order it gets them from amgettuple in.
> 
> It sounds like amgetbatch doesn't really behave analogously to master
> here, at least with backwards scans. It sounds like you're saying that
> we *won't* feed TIDs heap block numbers to the read stream in exactly
> scan order (when we happen to be scanning backwards) -- which seems
> wrong to me.
> 
> As you pointed out, a forwards scan of a DESC column index should feed
> heap blocks to the read stream in a way that is very similar to an
> equivalent backwards scan of a similar ASC column on the same table.
> There might be some very minor differences, due to differences in the
> precise leaf page boundaries among each of the indexes. But that
> should hardly be noticeable at all.
> 

I gave this another try, this time with disabled deduplication, and on
master I also applied the patch (but now I realize that's probably
unnecessary, right?).

I did a couple more things for this experiment:

1) created a second table with an "inverse pattern" that's decreasing:

  create table t2 (like t) with (fillfactor = 20);
  insert into t2 select -a, b from t;
  create index idx2 on t2 (a);
  alter index idx2 set (deduplicate_items = false);
  reindex index idx2;

  The idea is that

  SELECT * FROM t WHERE (a BETWEEN x AND y) ORDER BY a ASC

  is the same "block pattern" as

  SELECT * FROM t2 WHERE (a BETWEEN -y AND -x) ORDER BY a DESC


2) added logging to heapam_index_fetch_tuple

   elog(LOG, "heapam_index_fetch_tuple block %u",
        ItemPointerGetBlockNumber(tid));

3) disabled autovacuum (so that it doesn't trigger any logs)

4) python script that processes the block numbers and counts number of
blocks, runs, forward/backward advances

5) bash script that runs 4 "equivalent" queries on t/t2, with ASC/DESC.

And the results look like this (FWIW this is with io_method=sync):

Q1: SELECT * FROM t WHERE a BETWEEN 16336 AND 49103
Q2: SELECT * FROM t2 WHERE a BETWEEN -49103 AND -16336

master / buffered

  query  order        time    blocks    runs   forward   backward
  ---------------------------------------------------------------
     Q1    ASC         575   1048576   57365     53648       3716
     Q1   DESC       10245   1048576   57365      3716      53648
     Q2    ASC       14819   1048576   86061     53293      32767
     Q2   DESC        1063   1048576   86061     32767      53293

prefetch / buffered

  query  order        time    blocks    runs   forward   backward
  ---------------------------------------------------------------
     Q1    ASC         701   1048576   57365     53648       3716
     Q1   DESC        1805   1048576   57365      3716      53648
     Q2    ASC        1221   1048576   86061     53293      32767
     Q2   DESC        2101   1048576   86061     32767      53293

master / direct

  query  order        time    blocks    runs   forward   backward
  ---------------------------------------------------------------
     Q1    ASC        6101   1048576   57365     53648       3716
     Q1   DESC       12041   1048576   57365      3716      53648
     Q2    ASC       14837   1048576   86061     53293      32767
     Q2   DESC       14690   1048576   86061     32767      53293

prefetch / direct

  query  order        time    blocks    runs   forward   backward
  ---------------------------------------------------------------
     Q1    ASC        1504   1048576   57365     53648       3716
     Q1   DESC        9034   1048576   57365      3716      53648
     Q2    ASC        6988   1048576   86061     53293      32767
     Q2   DESC        8959   1048576   86061     32767      53293

The timings are from runs without the extra logging, but there's still
quite a bit of run to run variation. But the differences are somewhat
stable.

Some observations:

* The block stats are perfectly stable (for each query), both for each
build and between builds. And also perfectly symmetrical between the
ASC/DESC version of each query. The ASC does the same number of
"forward" steps like DESC does "backward" steps.

* There's a clear difference between Q1 and Q2, with Q2 having many more
runs (and not as "nice" forward/backward steps). When I created the t2
data set, I expected Q1 ASC to behave the same as Q2 DESC, but it
doesn't seem to work that way. Clearly, the "descending" pattern in t2
breaks the sequence of block numbers into many more runs.

>> Could that hide the extra buffer accesses, somehow?
> 
> I think that you meant to ask about *missing* buffer hits with the
> patch, for the forwards scan. That doesn't agree with the backwards
> scan with the patch, nor does it agree with master (with either the
> forwards or backwards scan). Note that the heap accesses themselves
> appear to have sane/consistent numbers, since we always see
> "read=49933" as expected for those, for all 4 query executions that I
> showed.
> 
> The "missing buffer hits" issue seems like an issue with the
> instrumentation itself. Possibly one that is totally unrelated to
> everything else we're discussing.
> 

Yes, I came to this conclusion too. The fact that the stats presented
above are exactly the same for all the different cases (for each query)
is a sign it's about the tracking.

In fact, I believe this is about io_method. I initially didn't see the
difference you described, and then I realized I set io_method=sync to
make it easier to track the block access. And if I change io_method to
worker, I get different stats, that also change between runs.

With "sync" I always get this (after a restart):

   Buffers: shared hit=7435 read=52801

while with "worker" I get this:

   Buffers: shared hit=4879 read=52801
   Buffers: shared hit=5151 read=52801
   Buffers: shared hit=4978 read=52801

So not only it changes run to tun, it also does not add up to 60236.

I vaguely recall I ran into this some time ago during AIO benchmarking,
and IIRC it's due to how StartReadBuffersImpl() may behave differently
depending on I/O started earlier. It only calls PinBufferForBlock() in
some cases, and PinBufferForBlock() is what updates the hits.

In any case, it seems to depend on io_method, and it's confusing.


regards

-- 
Tomas Vondra

On Thu, Aug 14, 2025 at 12:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I performed the usual procedure of prewarming the index and evicting the heap
> relation, and then actually running the relevant query through EXPLAIN
> ANALYZE. Direct I/O was used throughout.

> io_method=io_uring
> ------------------
>
> Original backwards scan: 1052.807 ms (shared read=187.876)
> "No heap correlation" backwards scan: 649.473 ms (shared read=365.802)

Attached is a differential flame graph that compares the execution of
these 2 queries in terms of the default perf event (which is "cycles",
per the generic recipe for making one of these put out by Brendan
Gregg). The actual query runtime for each query was very similar to
what I report here -- the backwards scan is a little under twice as
fast.

The only interesting thing about the flame graph is just how little
difference there seems to be (at least for this particular perf event
type). The only thing that stands out even a little bit is the 8.33%
extra time spent in pg_checksum_page for the "No heap
correlation"/random query. But that's entirely to be expected: we're
reading 49933 pages with the sequential backwards scan query, whereas
the random one must read 77813 pages.

--
Peter Geoghegan

Вложения

sequential_vs_random.svg

Re: index prefetching

От

"Peter Geoghegan"

Дата:

14 августа, 21:44:44

On Thu Aug 14, 2025 at 1:57 PM EDT, Peter Geoghegan wrote:
> The only interesting thing about the flame graph is just how little
> difference there seems to be (at least for this particular perf event
> type).

I captured method_io_uring.c DEBUG output from running each query in the
server log, in the hope that it would shed some light on what's really going
on here.  I think that it just might.

I count a total of 12,401 distinct sleeps for the sequential/slow backwards
scan test case:

$ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | head
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
 2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
$ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | awk '{ total += $11 } END { print total }'
12401

But there are only 3 such sleeps seen when the random backwards scan query is
run -- which might begin to explain the mystery of why it runs so much faster:

$ grep -E "wait_one with [1-9][0-9]* sleeps" random.txt | awk '{ total += $11 } END { print total }'
104

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

14 августа, 21:53:42

Hi,

On 2025-08-14 14:44:44 -0400, Peter Geoghegan wrote:
> On Thu Aug 14, 2025 at 1:57 PM EDT, Peter Geoghegan wrote:
> > The only interesting thing about the flame graph is just how little
> > difference there seems to be (at least for this particular perf event
> > type).
>
> I captured method_io_uring.c DEBUG output from running each query in the
> server log, in the hope that it would shed some light on what's really going
> on here.  I think that it just might.
>
> I count a total of 12,401 distinct sleeps for the sequential/slow backwards
> scan test case:
>
> $ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | head
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
>  2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: wait_one with 1 sleeps
> $ grep -E "wait_one with [1-9][0-9]* sleeps" sequential.txt | awk '{ total += $11 } END { print total }'
> 12401
>
> But there are only 3 such sleeps seen when the random backwards scan query is
> run -- which might begin to explain the mystery of why it runs so much faster:
>
> $ grep -E "wait_one with [1-9][0-9]* sleeps" random.txt | awk '{ total += $11 } END { print total }'
> 104

I think this is just an indicator of being IO bound. That message is output
whenever we have to wait for IO to finish. So if one workload prints that a
12k times and another 104 times, that's because the latter didn't have to wait
for IO to complete, because it already had completed by the time we needed the
IO to have finished to continue.


Factors potentially leading to slower IO:

- sometimes random IO *can* be faster for SSDs, because it allows different
  flash chips to work concurrently, rather than being bound by the speed of
  one one flash chip

- it's possible that with your SSD the sequential IO leads to more IO
  combining. Larger IOs always have a higher latency than smaller IOs - but
  obviously fewer IOs are needed. The increased latency may be bad enough for
  your access pattern to trigger more waits.

  It's *not* necessarily enough to just lower io_combine_limit, the OS also
  can do combining.

  I'd see what changes if you temporarily reduce
  /sys/block/nvme6n1/queue/max_sectors_kb to a smaller size.


Could you show iostat for both cases?

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

14 августа, 22:15:02

On Thu, Aug 14, 2025 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:
> I think this is just an indicator of being IO bound.

Then why does the exact same pair of runs show "I/O Timings: shared
read=194.629" for the sequential table backwards scan (with total
execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
(with total execution time 697.681 ms) for the random table backwards
scan?

Obviously it is hard to believe that the query with shared
read=194.629 is one that is naturally much more I/O bound than another
similar query that shows shared read=352.88. What "I/O Timings" shows
more or less makes sense to me already -- it just doesn't begin to
explain why *overall query execution* is much slower when scanning
backwards sequentially.

>   I'd see what changes if you temporarily reduce
>   /sys/block/nvme6n1/queue/max_sectors_kb to a smaller size.

I reduced max_sectors_kb from 128 to 8. That had no significant effect.

> Could you show iostat for both cases?

iostat has lots of options. Can you be more specific?

--
Peter Geoghegan

Re: index prefetching

От

"Peter Geoghegan"

Дата:

14 августа, 22:30:16

On Thu Aug 14, 2025 at 3:15 PM EDT, Peter Geoghegan wrote:
> On Thu, Aug 14, 2025 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:
>> I think this is just an indicator of being IO bound.
>
> Then why does the exact same pair of runs show "I/O Timings: shared
> read=194.629" for the sequential table backwards scan (with total
> execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
> (with total execution time 697.681 ms) for the random table backwards
> scan?

Is there any particular significance to the invalid op reports I also see in
the same log files?

 $ cat sequential.txt | grep invalid | head
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 2, ref_gen: 1, cycle 1 
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 3, ref_gen: 2, cycle 1 
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 4, ref_gen: 3, cycle 1 
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 5, ref_gen: 4, cycle 1 
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 6, ref_gen: 5, cycle 1 
 2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 7, ref_gen: 6, cycle 1 
 2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 8, ref_gen: 7, cycle 1 
 2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 9, ref_gen: 8, cycle 1 
 2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 10, ref_gen: 9, cycle 1 
 2025-08-14 14:35:03.279 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 11, ref_gen: 10, cycle 1 
 $ cat sequential.txt | grep invalid | wc -l
5733
 $ cat random.txt | grep invalid | wc -l
2206

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

14 августа, 22:34:32

Hi,

On 2025-08-14 15:30:16 -0400, Peter Geoghegan wrote:
> On Thu Aug 14, 2025 at 3:15 PM EDT, Peter Geoghegan wrote:
> > On Thu, Aug 14, 2025 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:
> >> I think this is just an indicator of being IO bound.
> >
> > Then why does the exact same pair of runs show "I/O Timings: shared
> > read=194.629" for the sequential table backwards scan (with total
> > execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
> > (with total execution time 697.681 ms) for the random table backwards
> > scan?
> 
> Is there any particular significance to the invalid op reports I also see in
> the same log files?

>  $ cat sequential.txt | grep invalid | head
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 2, ref_gen: 1, cycle 1
 
>  2025-08-14 14:35:03.278 EDT [2516983][client backend] [[unknown]][0/1:0] DEBUG:  00000: io 0         |op
invalid|targetinvalid|state IDLE            : wait_one io_gen: 3, ref_gen: 2, cycle 1
 

No - that's likely just that the IO completed and thus the handle was made
reusable (i.e. state IDLE). Note that the generation of IO we're waiting for
(ref_gen) is lower than the IO handle's (io_gen).

Greetings,

Andres Freund

Re: index prefetching

От

Andres Freund

Дата:

14 августа, 22:41:58

Hi,

On 2025-08-14 15:15:02 -0400, Peter Geoghegan wrote:
> On Thu, Aug 14, 2025 at 2:53 PM Andres Freund <andres@anarazel.de> wrote:
> > I think this is just an indicator of being IO bound.
> 
> Then why does the exact same pair of runs show "I/O Timings: shared
> read=194.629" for the sequential table backwards scan (with total
> execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
> (with total execution time 697.681 ms) for the random table backwards
> scan?
> 
> Obviously it is hard to believe that the query with shared
> read=194.629 is one that is naturally much more I/O bound than another
> similar query that shows shared read=352.88. What "I/O Timings" shows
> more or less makes sense to me already -- it just doesn't begin to
> explain why *overall query execution* is much slower when scanning
> backwards sequentially.

Hm, that is somewhat curious.

I wonder if there's some wait time that's not being captured by "I/O
Timings". A first thing to do would be to just run strace --summary-only while
running the query, and see if there are syscall wait times that seem too long.

What effective_io_concurrency and io_max_concurrency setting are you using? If
there are no free IO handles that's currently not nicely reported (because
it's unclear how exactly to do so, see comment above pgaio_io_acquire_nb()).

> > Could you show iostat for both cases?
> 
> iostat has lots of options. Can you be more specific?

iostat -xmy /path/to/block/device

I'd like to see the difference in average IO size (rareq-sz), queue depth
(aqu-sz) and completion time (r_await) between the fast and slow cases.

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

14 августа, 22:45:26

On Thu, Aug 14, 2025 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Then why does the exact same pair of runs show "I/O Timings: shared
> read=194.629" for the sequential table backwards scan (with total
> execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
> (with total execution time 697.681 ms) for the random table backwards
> scan?

If you're interested in trying this out for yourself, I've pushed my
working branch here:

https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.2

Note that the test case you'll run is added by the most recent commit:

https://github.com/petergeoghegan/postgres/commit/c9ceb765f3b138f53b7f1fdf494ba7c816082aa1

Run microbenchmarks/random_backwards_weird.sql to do an initial load
of both of the tables. Then run
microbenchmarks/queries_random_backwards_weird.sql to actually run the
relevant queries. There are 4 such queries, but only the 2 backwards
scan queries really seem relevant.

--
Peter Geoghegan

Re: index prefetching

От

"Peter Geoghegan"

Дата:

14 августа, 23:12:40

On Thu Aug 14, 2025 at 3:41 PM EDT, Andres Freund wrote:
> Hm, that is somewhat curious.
>
> I wonder if there's some wait time that's not being captured by "I/O
> Timings". A first thing to do would be to just run strace --summary-only while
> running the query, and see if there are syscall wait times that seem too long.

For the slow, sequential backwards scan query:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.271216           4     66808           io_uring_enter
  0.00    0.000004           4         1           sendto
  0.00    0.000001           0         2         1 recvfrom
  0.00    0.000000           0         5           lseek
  0.00    0.000000           0         1           epoll_wait
  0.00    0.000000           0         4           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.271221           4     66821         1 total

For the fast, random backwards scan query:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.99    0.351518           4     77819           io_uring_enter
  0.00    0.000007           2         3         1 epoll_wait
  0.00    0.000006           6         1           sendto
  0.00    0.000003           1         3         2 recvfrom
  0.00    0.000002           2         1           read
  0.00    0.000002           2         1         1 rt_sigreturn
  0.00    0.000002           2         1           getpid
  0.00    0.000002           1         2           kill
  0.00    0.000000           0         3           lseek
------ ----------- ----------- --------- --------- ----------------
100.00    0.351542           4     77834         4 total

> What effective_io_concurrency and io_max_concurrency setting are you using? If
> there are no free IO handles that's currently not nicely reported (because
> it's unclear how exactly to do so, see comment above pgaio_io_acquire_nb()).

effective_io_concurrency is 100.  io_max_concurrency is 64.  Nothing out of
the ordinary there.

> iostat -xmy /path/to/block/device
>
> I'd like to see the difference in average IO size (rareq-sz), queue depth
> (aqu-sz) and completion time (r_await) between the fast and slow cases.

I'll show one second interval output.

Slow, sequential backwards scan query
-------------------------------------

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util 
nvme0n1       24613.00    192.29     0.00   0.00    0.20     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00    4.92  53.20 

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.22    0.00    0.44    0.85    0.00   98.50

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util 
nvme0n1       25320.00    197.81     0.00   0.00    0.20     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.18  51.20 

Fast, random backwards scan query
---------------------------------

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util 
nvme0n1       27140.59    212.04     0.00   0.00    0.20     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.50  23.37 

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    0.84    0.00    0.00   98.66

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz
d/s    dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util 
nvme0n1       50401.00    393.76     0.00   0.00    0.20     8.00    0.00      0.00     0.00   0.00    0.00     0.00
0.00     0.00     0.00   0.00    0.00     0.00    0.00    0.00   10.06  41.60 

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

14 августа, 23:44:14

Hi,

On 2025-08-14 15:45:26 -0400, Peter Geoghegan wrote:
> On Thu, Aug 14, 2025 at 3:15 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Then why does the exact same pair of runs show "I/O Timings: shared
> > read=194.629" for the sequential table backwards scan (with total
> > execution time 1132.360 ms), versus "I/O Timings: shared read=352.88"
> > (with total execution time 697.681 ms) for the random table backwards
> > scan?
>
> If you're interested in trying this out for yourself, I've pushed my
> working branch here:
>
> https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.2
>
> Note that the test case you'll run is added by the most recent commit:
>
> https://github.com/petergeoghegan/postgres/commit/c9ceb765f3b138f53b7f1fdf494ba7c816082aa1
>
> Run microbenchmarks/random_backwards_weird.sql to do an initial load
> of both of the tables. Then run
> microbenchmarks/queries_random_backwards_weird.sql to actually run the
> relevant queries. There are 4 such queries, but only the 2 backwards
> scan queries really seem relevant.

Interesting. In the sequential case I see some waits that are not attributed
in explain, due to the waits happening within WaitIO(), not WaitReadBuffers().
Which indicates that the read stream is trying to re-read a buffer that
previously started being read.

   read_stream_start_pending_read()
-> StartReadBuffers()
-> AsyncReadBuffers()
-> ReadBuffersCanStartIO()
-> StartBufferIO()
-> WaitIO()

There are far fewer cases of this in the random case.

From what I can tell the sequential case so often will re-read a buffer that
it is already in the process of reading - and thus wait for that IO before
continuing - that we don't actually keep enough IO in flight.

In your email with iostat output you can see that the slow case has
aqu-sz=5.18, while the fast case has aqu-sz=10.06, i.e. the fast case has
twice as much IO in flight. While both have IOs take the same amount of time
(r_await=0.20). Which certainly explains the performance difference...

We can optimize that by deferring the StartBufferIO() if we're encountering a
buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
real-world queries will often encounter the pattern of the same block being
read in by a read stream multiple times in close proximity sufficiently often
to make that worth it.

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 00:06:07

On Thu, Aug 14, 2025 at 4:44 PM Andres Freund <andres@anarazel.de> wrote:
> Interesting. In the sequential case I see some waits that are not attributed
> in explain, due to the waits happening within WaitIO(), not WaitReadBuffers().
> Which indicates that the read stream is trying to re-read a buffer that
> previously started being read.

I *knew* that something had to be up here. Thanks for your help with debugging!

>    read_stream_start_pending_read()
> -> StartReadBuffers()
> -> AsyncReadBuffers()
> -> ReadBuffersCanStartIO()
> -> StartBufferIO()
> -> WaitIO()
>
> There are far fewer cases of this in the random case.

Index tuples with TIDs that are slightly out of order are very normal.
Even for *perfectly* sequential inserts, the FSM tends to use the last
piece of free space on a heap page some time after the heap page
initially becomes "almost full". I recently described this to Tomas on
this thread [1].

> From what I can tell the sequential case so often will re-read a buffer that
> it is already in the process of reading - and thus wait for that IO before
> continuing - that we don't actually keep enough IO in flight.

Oops.

There is an existing stop-gap mechanism in the patch that is supposed
to deal with this problem. index_scan_stream_read_next, which is the
read stream callback, has logic that is supposed to suppress duplicate
block requests. But that's obviously not totally effective, since it
only remembers the very last heap block request.

If this same mechanism remembered (say) the last 2 heap blocks it
requested, that might be enough to totally fix this particular
problem. This isn't a serious proposal, but it'll be simple enough to
implement. Hopefully when I do that (which I plan to soon) it'll fully
validate your theory.

> We can optimize that by deferring the StartBufferIO() if we're encountering a
> buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
> real-world queries will often encounter the pattern of the same block being
> read in by a read stream multiple times in close proximity sufficiently often
> to make that worth it.

We definitely need to be prepared for duplicate prefetch requests in
the context of index scans. I'm far from sure how sophisticated that
actually needs to be. Obviously the design choices in this area are
far from settled right now.

[1] DC1G2PKUO9CI.3MK1L3YBZ2V3T@bowt.ie
--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 00:55:53

On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> If this same mechanism remembered (say) the last 2 heap blocks it
> requested, that might be enough to totally fix this particular
> problem. This isn't a serious proposal, but it'll be simple enough to
> implement. Hopefully when I do that (which I plan to soon) it'll fully
> validate your theory.

I spoke too soon. It isn't going to be so easy, since
heapam_index_fetch_tuple wants to consume buffers as a simple stream.
There's no way that index_scan_stream_read_next can just suppress
duplicate block number requests (in a way that's more sophisticated
than the current trivial approach that stores the very last block
number in IndexScanBatchState.lastBlock) without it breaking the whole
concept of a stream of buffers.

> > We can optimize that by deferring the StartBufferIO() if we're encountering a
> > buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
> > real-world queries will often encounter the pattern of the same block being
> > read in by a read stream multiple times in close proximity sufficiently often
> > to make that worth it.
>
> We definitely need to be prepared for duplicate prefetch requests in
> the context of index scans.

Can you (or anybody else) think of a quick and dirty way of working
around the problem on the read stream side? I would like to prioritize
getting the patch into a state where its overall performance profile
"feels right". From there we can iterate on fixing the underlying
issues in more principled ways.

FWIW it wouldn't be that hard to require the callback (in our case
index_scan_stream_read_next) to explicitly point out that it knows
that the block number it's requesting has to be a duplicate. It might
make sense to at least place that much of the burden on the
callback/client side.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

15 августа, 01:24:37

On 8/14/25 01:19, Andres Freund wrote:
> Hi,
> 
> On 2025-08-14 01:11:07 +0200, Tomas Vondra wrote:
>> On 8/13/25 23:57, Peter Geoghegan wrote:
>>> On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>> It's also not very surprising this happens with backwards scans more.
>>>> The I/O is apparently much slower (due to missing OS prefetch), so we're
>>>> much more likely to hit the I/O limits (max_ios and various other limits
>>>> in read_stream_start_pending_read).
>>>
>>> But there's no OS prefetch with direct I/O. At most, there might be
>>> some kind of readahead implemented in the SSD's firmware.
>>>
>>
>> Good point, I keep forgetting direct I/O means no OS read-ahead. Not
>> sure if there's a good way to determine if the SSD can do something like
>> that (and how well). I wonder if there's a way to do backward sequential
>> scans in fio ..
> 
> In theory, yes, in practice, not quite:
> https://github.com/axboe/fio/issues/1963
> 
> So right now it only works if you skip over some blocks. For that there rather
> significant performance differences on my SSDs. E.g.
> 
> andres@awork3:~/src/fio$ fio --directory /srv/fio --size=$((1024*1024*1024)) --name test --bs=4k --rw read:8k
--buffered0 2>&1|grep READ

>    READ: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), io=341MiB (358MB), run=1907-1907msec
> andres@awork3:~/src/fio$ fio --directory /srv/fio --size=$((1024*1024*1024)) --name test --bs=4k --rw read:-8k
--buffered0 2>&1|grep READ

>    READ: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1024MiB (1074MB), run=14513-14513msec
> 
> So on this WD Red SN700 there's a rather substantial performance difference.
> 
> On a Samsung 970 PRO I don't see much of a difference. Nor on a ADATA
> SX8200PNP.
> 

I experimented with this a little bit today. Given the fio issues, I
ended up writing a simple tool in C, doing pread() forward/backward with
different block size and direct I/O. AFAICS this is roughly equivalent
to fio with iodepth=1 (based on a couple tests).

Too bad fio has issues with backward sequential tests ... I'll see if I
can get at least some results to validate my results.

On all my SSDs there's massive difference between forward and backward
sequential scans. It depends on the block size, but for the smaller
block sizes (1-16KB) it's roughly 4x slower. It gets better for larger
blocks, but while that's interesting, we're stuck with 8K blocks.

FWIW I'm not claiming this explains all odd things we're investigating
in this thread, it's more a confirmation that the scan direction may
matter if it translates to direction at the device level. I don't think
it can explain the strange stuff with the "random" data sets constructed
Peter.

regards

-- 
Tomas Vondra

Hi,

On 2025-08-14 19:36:49 -0400, Andres Freund wrote:
> On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:
> > On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > > We can optimize that by deferring the StartBufferIO() if we're encountering a
> > > > buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
> > > > real-world queries will often encounter the pattern of the same block being
> > > > read in by a read stream multiple times in close proximity sufficiently often
> > > > to make that worth it.
> > >
> > > We definitely need to be prepared for duplicate prefetch requests in
> > > the context of index scans.
> >
> > Can you (or anybody else) think of a quick and dirty way of working
> > around the problem on the read stream side? I would like to prioritize
> > getting the patch into a state where its overall performance profile
> > "feels right". From there we can iterate on fixing the underlying
> > issues in more principled ways.
>
> I think I can see a way to fix the issue, below read stream. Basically,
> whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
> waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
> new flag indicating that we are waiting for an IO that was not started by the
> wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
> has to be somewhat different code from today, because we have to deal with the
> fact of the "foreign" IO potentially having failed.
>
> I'll try writing a prototype for that tomorrow. I think to actually get that
> into a committable shape we need a test harness (probably a read stream
> controlled by an SQL function that gets an array of buffers).

Attached is a prototype of this approach. It does seem to fix this issue.

New code disabled:

    #### backwards sequential table ####
    ┌──────────────────────────────────────────────────────────────────────┐
    │                              QUERY PLAN                              │
    ├──────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                        │
    │   Index Searches: 1                                                  │
    │   Buffers: shared hit=10291 read=49933                               │
    │   I/O Timings: shared read=213.277                                   │
    │ Planning:                                                            │
    │   Buffers: shared hit=91 read=19                                     │
    │   I/O Timings: shared read=2.124                                     │
    │ Planning Time: 3.269 ms                                              │
    │ Execution Time: 1023.279 ms                                          │
    └──────────────────────────────────────────────────────────────────────┘
    (10 rows)


New code enabled:

    #### backwards sequential table ####
    ┌──────────────────────────────────────────────────────────────────────┐
    │                              QUERY PLAN                              │
    ├──────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_pk on t (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                        │
    │   Index Searches: 1                                                  │
    │   Buffers: shared hit=10291 read=49933                               │
    │   I/O Timings: shared read=217.225                                   │
    │ Planning:                                                            │
    │   Buffers: shared hit=91 read=19                                     │
    │   I/O Timings: shared read=2.009                                     │
    │ Planning Time: 2.685 ms                                              │
    │ Execution Time: 602.987 ms                                           │
    └──────────────────────────────────────────────────────────────────────┘
    (10 rows)


With the change enabled, the sequential query is faster than the random query:

    #### backwards random table ####
    ┌────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                         QUERY PLAN                                         │
    ├────────────────────────────────────────────────────────────────────────────────────────────┤
    │ Index Scan Backward using t_randomized_pk on t_randomized (actual rows=1048576.00 loops=1) │
    │   Index Cond: ((a >= 16336) AND (a <= 49103))                                              │
    │   Index Searches: 1                                                                        │
    │   Buffers: shared hit=6085 read=77813                                                      │
    │   I/O Timings: shared read=347.285                                                         │
    │ Planning:                                                                                  │
    │   Buffers: shared hit=127 read=5                                                           │
    │   I/O Timings: shared read=1.001                                                           │
    │ Planning Time: 1.751 ms                                                                    │
    │ Execution Time: 820.544 ms                                                                 │
    └────────────────────────────────────────────────────────────────────────────────────────────┘
    (10 rows)



Greetings,

Andres Freund

Вложения

v1-0001-bufmgr-aio-Prototype-for-not-waiting-for-already-.patch.txt

Re: index prefetching

От

"Peter Geoghegan"

Дата:

15 августа, 19:24:40

On Thu Aug 14, 2025 at 7:26 PM EDT, Tomas Vondra wrote:
>> My guess is that once we fix the underlying problem, we'll see
>> improved performance for many different types of queries. Not as big
>> of a benefit as the one that the broken query will get, but still
>> enough to matter.
>>
>
> Hopefully. Let's see.

Good news here: with Andres' bufmgr patch applied, the similar forwards scan
query does indeed get more than 2x faster.  And I don't mean that it gets
faster on the randomized table -- it actually gets 2x faster with your
original (almost but not quite entirely sequential) table, and your original
query.  This is especially good news because that query seems particularly
likely to be representative of real world user queries.

And so the "backwards scan" aspect of this investigation was always a bit of a
red herring.  The only reason why "backwards-ness" ever even seemed relevant
was that with the backwards scan variant, performance was made so much slower
by the issue that Andres' patch addresses than even my randomized version of
the same query ran quite a bit faster.

More concretely:

Without bufmgr patch
--------------------

┌─────────────────────────────────────────────────────────────┐
│                         QUERY PLAN                          │
├─────────────────────────────────────────────────────────────┤
│ Index Scan using t_pk on t (actual rows=1048576.00 loops=1) │
│   Index Cond: ((a >= 16336) AND (a <= 49103))               │
│   Index Searches: 1                                         │
│   Buffers: shared hit=6572 read=49933                       │
│   I/O Timings: shared read=77.038                           │
│ Planning:                                                   │
│   Buffers: shared hit=50 read=6                             │
│   I/O Timings: shared read=0.570                            │
│ Planning Time: 0.774 ms                                     │
│ Execution Time: 618.585 ms                                  │
└─────────────────────────────────────────────────────────────┘
(10 rows)

With bufmgr patch
-----------------

┌─────────────────────────────────────────────────────────────┐
│                         QUERY PLAN                          │
├─────────────────────────────────────────────────────────────┤
│ Index Scan using t_pk on t (actual rows=1048576.00 loops=1) │
│   Index Cond: ((a >= 16336) AND (a <= 49103))               │
│   Index Searches: 1                                         │
│   Buffers: shared hit=10257 read=49933                      │
│   I/O Timings: shared read=135.825                          │
│ Planning:                                                   │
│   Buffers: shared hit=50 read=6                             │
│   I/O Timings: shared read=0.570                            │
│ Planning Time: 0.767 ms                                     │
│ Execution Time: 279.643 ms                                  │
└─────────────────────────────────────────────────────────────┘
(10 rows)

I _think_ that Andres' patch also fixes the EXPLAIN ANALYZE accounting, so
that "I/O Timings" is actually correct.  That's why EXPLAIN ANALYZE with the
bufmgr patch has much higher "shared read" time, despite overall execution
time being cut in half.

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 19:29:25

On Fri, Aug 15, 2025 at 12:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Good news here: with Andres' bufmgr patch applied, the similar forwards scan
> query does indeed get more than 2x faster.  And I don't mean that it gets
> faster on the randomized table -- it actually gets 2x faster with your
> original (almost but not quite entirely sequential) table, and your original
> query.  This is especially good news because that query seems particularly
> likely to be representative of real world user queries.

BTW, I also think that Andres' patch makes performance a lot more
stable. I'm pretty sure that I've noticed that the exact query that I
just showed updated results for has at various times run faster
(without Andres' patch), due to who-knows-what.

FWIW, this development probably completely changes the results of many
(all?) of your benchmark queries. My guess is that with Andres' patch,
things will be better across the board. But in any case the numbers
that you posted before now must now be considered
obsolete/nonrepresentative. Since this is such a huge change.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

15 августа, 20:09:24

Hi,

Glad to see that the prototype does fix the issue for you.

On 2025-08-15 12:29:25 -0400, Peter Geoghegan wrote:
> FWIW, this development probably completely changes the results of many
> (all?) of your benchmark queries. My guess is that with Andres' patch,
> things will be better across the board. But in any case the numbers
> that you posted before now must now be considered
> obsolete/nonrepresentative. Since this is such a huge change.

I'd hope it doesn't improve all benchmark queries - if so the set of
benchmarks would IMO be too skewed towards cases that access the same heap
blocks multiple times within the readahead distance. That's definitely an
important thing to measure, but it's surely not the only thing to care
about. For the index workloads the patch doesn't do anything about cases where
we don't up re-encountering a buffer that we already started IO for.

Greetings,

Andres Freund

Re: index prefetching

От

Andres Freund

Дата:

15 августа, 20:22:58

Hi,

On 2025-08-15 12:24:40 -0400, Peter Geoghegan wrote:
> With bufmgr patch
> -----------------
> 
> ┌─────────────────────────────────────────────────────────────┐
> │                         QUERY PLAN                          │
> ├─────────────────────────────────────────────────────────────┤
> │ Index Scan using t_pk on t (actual rows=1048576.00 loops=1) │
> │   Index Cond: ((a >= 16336) AND (a <= 49103))               │
> │   Index Searches: 1                                         │
> │   Buffers: shared hit=10257 read=49933                      │
> │   I/O Timings: shared read=135.825                          │
> │ Planning:                                                   │
> │   Buffers: shared hit=50 read=6                             │
> │   I/O Timings: shared read=0.570                            │
> │ Planning Time: 0.767 ms                                     │
> │ Execution Time: 279.643 ms                                  │
> └─────────────────────────────────────────────────────────────┘
> (10 rows)
> 
> I _think_ that Andres' patch also fixes the EXPLAIN ANALYZE accounting, so
> that "I/O Timings" is actually correct.  That's why EXPLAIN ANALYZE with the
> bufmgr patch has much higher "shared read" time, despite overall execution
> time being cut in half.

Somewhat random note about I/O waits:

Unfortunately the I/O wait time we measure often massively *over* estimate the
actual I/O time. If I execute the above query with the patch applied, we
actually barely ever wait for I/O to complete, it's all completed by the time
we have to wait for the I/O. What we are measuring is the CPU cost of
*initiating* the I/O.

That's why we are seeing "I/O Timings" > 0 even if we do perfect readahead.

Most of the cost is in the kernel, primarily looking up block locations and
setting up the actual I/O.

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 20:58:19

On Fri, Aug 15, 2025 at 1:09 PM Andres Freund <andres@anarazel.de> wrote:
> On 2025-08-15 12:29:25 -0400, Peter Geoghegan wrote:
> > FWIW, this development probably completely changes the results of many
> > (all?) of your benchmark queries. My guess is that with Andres' patch,
> > things will be better across the board. But in any case the numbers
> > that you posted before now must now be considered
> > obsolete/nonrepresentative. Since this is such a huge change.
>
> I'd hope it doesn't improve all benchmark queries - if so the set of
> benchmarks would IMO be too skewed towards cases that access the same heap
> blocks multiple times within the readahead distance.

I don't think that that will be a problem. Up until recently, I had
exactly the opposite complaint about the benchmark queries.

> That's definitely an
> important thing to measure, but it's surely not the only thing to care
> about. For the index workloads the patch doesn't do anything about cases where
> we don't up re-encountering a buffer that we already started IO for.

IMV we need to make a conservative assumption that it might matter for
any query. There have already been numerous examples where we thought
we fully understood a test case, but didn't.

BTW, I just rebooted my workstation, losing various procfs changes
that I'd made when debugging this issue. It now looks like the forward
scan query is actually made about 3x faster by the addition of your
patch (not 2x faster, as reported earlier). It goes from 592.618 ms to
204.966 ms.

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 21:05:06

On Fri, Aug 15, 2025 at 1:23 PM Andres Freund <andres@anarazel.de> wrote:
> Somewhat random note about I/O waits:
>
> Unfortunately the I/O wait time we measure often massively *over* estimate the
> actual I/O time. If I execute the above query with the patch applied, we
> actually barely ever wait for I/O to complete, it's all completed by the time
> we have to wait for the I/O. What we are measuring is the CPU cost of
> *initiating* the I/O.

I do get that.

This was really obvious when I temporarily switched the prefetch patch
over from using READ_STREAM_DEFAULT to using READ_STREAM_USE_BATCHING
(this is probably buggy, but still seems likely to be representative
of what's possible with some care). I noticed that that change reduced
the reported "shared read" time by 10x -- which had exactly zero impact on
query execution time (at least for the queries I looked at). Since, as
you say, the backend didn't have to wait for I/O to complete either
way.

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 22:25:50

On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> As far as I know, we only have the following unambiguous performance
> regressions (that clearly need to be fixed):
>
> 1. This issue.
>
> 2. There's about a 3% loss of throughput on pgbench SELECT.

I did a quick pgbench SELECT benchmark again with Andres' patch, just
to see if that has been impacted. Now the regression there is much
larger; it goes from a ~3% regression to a ~14% regression.

I'm not worried about it. Andres' "not waiting for already-in-progress
IO" patch was clearly just a prototype. Just thought it was worth
noting here.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

15 августа, 22:28:01

Hi,

On August 15, 2025 3:25:50 PM EDT, Peter Geoghegan <pg@bowt.ie> wrote:
>On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> As far as I know, we only have the following unambiguous performance
>> regressions (that clearly need to be fixed):
>>
>> 1. This issue.
>>
>> 2. There's about a 3% loss of throughput on pgbench SELECT.
>
>I did a quick pgbench SELECT benchmark again with Andres' patch, just
>to see if that has been impacted. Now the regression there is much
>larger; it goes from a ~3% regression to a ~14% regression.
>
>I'm not worried about it. Andres' "not waiting for already-in-progress
>IO" patch was clearly just a prototype. Just thought it was worth
>noting here.

Are you confident in that? Because the patch should be extremely cheap in that case. What precisely were you testing?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 22:31:47

On Fri, Aug 15, 2025 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:
> >I'm not worried about it. Andres' "not waiting for already-in-progress
> >IO" patch was clearly just a prototype. Just thought it was worth
> >noting here.
>
> Are you confident in that? Because the patch should be extremely cheap in that case.

I'm pretty confident.

> What precisely were you testing?

I'm just running my usual generic pgbench SELECT script, with my usual
settings (so no direct I/O, but with iouring).

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

15 августа, 22:38:31

Hi,

On 2025-08-15 15:31:47 -0400, Peter Geoghegan wrote:
> On Fri, Aug 15, 2025 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:
> > >I'm not worried about it. Andres' "not waiting for already-in-progress
> > >IO" patch was clearly just a prototype. Just thought it was worth
> > >noting here.
> >
> > Are you confident in that? Because the patch should be extremely cheap in that case.
> 
> I'm pretty confident.
> 
> > What precisely were you testing?
> 
> I'm just running my usual generic pgbench SELECT script, with my usual
> settings (so no direct I/O, but with iouring).

I see absolutely no effect of the patch with shared_buffers=1GB and a
read-only scale 200 pgbench at 40 clients. What data sizes, shared buffers
etc. were you testing?

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 22:42:10

On Fri, Aug 15, 2025 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> I see absolutely no effect of the patch with shared_buffers=1GB and a
> read-only scale 200 pgbench at 40 clients. What data sizes, shared buffers
> etc. were you testing?

Just to be clear: you are testing with both the index prefetching
patch and your patch together, right? Not just your own patch?

My shared_buffers is 16GB, with pgbench scale 300.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

15 августа, 22:45:13

Hi,

On 2025-08-15 15:42:10 -0400, Peter Geoghegan wrote:
> On Fri, Aug 15, 2025 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> > I see absolutely no effect of the patch with shared_buffers=1GB and a
> > read-only scale 200 pgbench at 40 clients. What data sizes, shared buffers
> > etc. were you testing?
> 
> Just to be clear: you are testing with both the index prefetching
> patch and your patch together, right? Not just your own patch?

Correct.

> My shared_buffers is 16GB, with pgbench scale 300.

So there's actually no IO, given that a scale 300 is something like 4.7GB? In
that case my patch could really not make a difference, neither of the changed
branches would ever be reached?

Or were you testing the warmup phase, rather than the steady state?

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

15 августа, 23:16:02

On Fri, Aug 15, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > My shared_buffers is 16GB, with pgbench scale 300.
>
> So there's actually no IO, given that a scale 300 is something like 4.7GB? In
> that case my patch could really not make a difference, neither of the changed
> branches would ever be reached?

This was an error on my part -- sorry.

I think that the problem was that I forgot that I temporarily
increased effective_io_concurrency from 100 to 1,000 while debugging
this issue. Apparently that disproportionately affected the patched
server. Could also have been an issue with a recent change of mine.

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

17 августа, 20:30:14

On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> As far as I know, we only have the following unambiguous performance
> regressions (that clearly need to be fixed):
>
> 1. This issue.
>
> 2. There's about a 3% loss of throughput on pgbench SELECT.

Update: I managed to fix the performance regression with pgbench
SELECT (regression 2). Since Andres' patch fixes the other regression
(regression 1), we no longer have any known performance regression
(though I don't doubt that they still exist somewhere). I've also
added back the enable_indexscan_prefetch testing GUC (Andres asked me
to do that a few weeks back). If you set
enable_indexscan_prefetch=false, btgetbatch performance is virtually
identical to master/btgettuple.

A working copy of the patchset with these revisions is available from:
https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.6

The solution to the pgbench issue was surprisingly straightforward.
Profiling showed that the regression was caused by the added overhead
of using the read stream, for queries where prefetching cannot
possibly help -- such small startup costs are relatively noticeable
with pgbench's highly selective scans. It turns out that it's possible
to initially avoid using a read stream, while still retaining the
option of switching over to using a read stream later on. The trick to
fixing the pgbench issue was delaying creating a read stream for long
enough for the pgbench queries to never need to create one, without
that impacting queries that at least have some chance of benefiting
from prefetching.

The actual heuristic I'm using to decide when to start the read stream
is simple: only start a read stream right after the scan's second
batch is returned by amgetbatch, but before we've fetched any heap
blocks related to that second batch (start using a read stream when
fetching new heap blocks from that second batch). It's possible that
that heuristic isn't sophisticated enough for other types of queries.
But either way the basic structure within indexam.c places no
restrictions on when we start a read stream. It doesn't have to be
aligned with amgetbatch-wise batch boundaries, for example (I just
found that structure convenient).

I haven't spent much time testing this change, but it appears to work
perfectly (no pgbench regressions, but also no regressions in queries
that were already seeing significant benefits from prefetching). I'd
feel better about all this if we had better testing of the read stream
invariants by (say) adding assertions to index_scan_stream_read_next,
the read stream callback. And just having comments that explain those
invariants.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

19 августа, 20:23:24

On 8/17/25 19:30, Peter Geoghegan wrote:
> On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> As far as I know, we only have the following unambiguous performance
>> regressions (that clearly need to be fixed):
>>
>> 1. This issue.
>>
>> 2. There's about a 3% loss of throughput on pgbench SELECT.
> 
> Update: I managed to fix the performance regression with pgbench
> SELECT (regression 2). Since Andres' patch fixes the other regression
> (regression 1), we no longer have any known performance regression
> (though I don't doubt that they still exist somewhere). I've also
> added back the enable_indexscan_prefetch testing GUC (Andres asked me
> to do that a few weeks back). If you set
> enable_indexscan_prefetch=false, btgetbatch performance is virtually
> identical to master/btgettuple.
> 
> A working copy of the patchset with these revisions is available from:
> https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.6
> 
> The solution to the pgbench issue was surprisingly straightforward.
> Profiling showed that the regression was caused by the added overhead
> of using the read stream, for queries where prefetching cannot
> possibly help -- such small startup costs are relatively noticeable
> with pgbench's highly selective scans. It turns out that it's possible
> to initially avoid using a read stream, while still retaining the
> option of switching over to using a read stream later on. The trick to
> fixing the pgbench issue was delaying creating a read stream for long
> enough for the pgbench queries to never need to create one, without
> that impacting queries that at least have some chance of benefiting
> from prefetching.
> 
> The actual heuristic I'm using to decide when to start the read stream
> is simple: only start a read stream right after the scan's second
> batch is returned by amgetbatch, but before we've fetched any heap
> blocks related to that second batch (start using a read stream when
> fetching new heap blocks from that second batch). It's possible that
> that heuristic isn't sophisticated enough for other types of queries.
> But either way the basic structure within indexam.c places no
> restrictions on when we start a read stream. It doesn't have to be
> aligned with amgetbatch-wise batch boundaries, for example (I just
> found that structure convenient).
> 
> I haven't spent much time testing this change, but it appears to work
> perfectly (no pgbench regressions, but also no regressions in queries
> that were already seeing significant benefits from prefetching). I'd
> feel better about all this if we had better testing of the read stream
> invariants by (say) adding assertions to index_scan_stream_read_next,
> the read stream callback. And just having comments that explain those
> invariants.
> 

Thanks for investigating this. I think it's the right direction - simple
OLTP queries should not be paying for building read_stream when there's
little chance of benefit.

Unfortunately, this seems to be causing regressions, both compared to
master (or disabled prefetching), and to the earlier prefetch patches.

I kept running the query generator [1] that builds data sets with
randomized parameters, and then runs index scan queries on that, looking
for differences between branches.

Consider this data set:

------------------------------------------------------------------------
create unlogged table t (a bigint, b text) with (fillfactor = 20);

insert into t select 1 * a, b from (select r, a, b,
generate_series(0,1-1) AS p from (select row_number() over () AS r, a, b
from (select i AS a, md5(i::text) AS b from generate_series(1, 10000000)
s(i) ORDER BY (i + 256 * (random() - 0.5))) foo) bar) baz ORDER BY ((r *
1 + p) + 128 * (random() - 0.5));

create index idx on t(a ASC);

vacuum freeze t;

analyze t;
------------------------------------------------------------------------

Let's run this query (all runs are with cold caches):

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM t WHERE a BETWEEN 5085 AND 3053660 ORDER BY a ASC;

1) current patch
================

                               QUERY PLAN
-----------------------------------------------------------------------
 Index Scan using idx on t (actual time=0.517..6593.821 rows=3048576.00
loops=1)
   Index Cond: ((a >= 5085) AND (a <= 3053660))
   Index Searches: 1
   Prefetch Distance: 2.066
   Prefetch Count: 296179
   Prefetch Stalls: 2553745
   Prefetch Skips: 198613
   Prefetch Resets: 0
   Stream Ungets: 0
   Stream Forwarded: 74
   Prefetch Histogram: [2,4) => 289560, [4,8) => 6604, [8,16) => 15
   Buffers: shared hit=2704779 read=153516
 Planning:
   Buffers: shared hit=78 read=27
 Planning Time: 5.525 ms
 Execution Time: 6721.599 ms
(16 rows)


2) removed priorbatch (always uses read stream)
===============================================

                               QUERY PLAN
-----------------------------------------------------------------------
 Index Scan using idx on t (actual time=1.008..1932.379 rows=3048576.00
loops=1)
   Index Cond: ((a >= 5085) AND (a <= 3053660))
   Index Searches: 1
   Prefetch Distance: 87.970
   Prefetch Count: 2877141
   Prefetch Stalls: 1
   Prefetch Skips: 198617
   Prefetch Resets: 0
   Stream Ungets: 27182
   Stream Forwarded: 7640
   Prefetch Histogram: [2,4) => 2, [4,8) => 6, [8,16) => 7, [16,32) =>
10, [32,64) => 8183, [64,128) => 2868933
   Buffers: shared hit=2704571 read=153516
 Planning:
   Buffers: shared hit=78 read=27
 Planning Time: 14.302 ms
 Execution Time: 2036.654 ms
(16 rows)


3) no prefetch (same as master)
===============================

set enable_indexscan_prefetch = off;

                               QUERY PLAN
-----------------------------------------------------------------------
 Index Scan using idx on t (actual time=0.850..1336.723 rows=3048576.00
loops=1)
   Index Cond: ((a >= 5085) AND (a <= 3053660))
   Index Searches: 1
   Buffers: shared hit=2704779 read=153516
 Planning:
   Buffers: shared hit=82 read=22
 Planning Time: 10.696 ms
 Execution Time: 1433.530 ms
(8 rows)


The main difference in the explains is this:

  Prefetch Distance: 2.066  (new patch)

  Prefetch Distance: 87.970 (old patch, without priorbatch)

The histogram just confirms this, with most prefetches either in [2,4)
or [64,128) bins. The new patch has much lower prefetch distance.


I believe this is the same issue with "collapsed" distance after
resetting the read_stream. In that case the trouble was the reset also
set distance to 1, and there were so many "hits" due to buffers read
earlier it never ramped up again (we doubled it every now and then, but
the decay was faster).

The same thing happens here, I think. We process the first batch without
using a read stream. Then after reading the second batch we create the
read_stream, but it starts with distance=1 - it's just like after reset.
And it never ramps up the distance, because of the hits from reading the
preceding batch.

For the resets, the solution (at least for now) was to remember the
distance and restore it after reset. But here we don't have any distance
to restore - there's no prefetch or read stream.

Maybe it'd be possible to track some stats, during the initial phase,
and then use that to initialize the distance for the first batch
processed by read stream? Seems rather inconvenient, though.

What exactly is the overhead of creating the read_stream? Is that about
allocating memory, or something else? Would it be possible to reduce the
overhead enough to not matter even for OLTP queries? Maybe it would be
possible to initialize the read_stream only "partially", enough to do do
sync I/O and track the distance, and delay only the expensive stuff?


I'm also not sure it's optimal to only initialize read_stream after
reading the next batch. For some indexes a batch can have hundreds of
items, and that certainly could benefit from prefetching. I suppose it
should be possible to initialize the read_stream half-way though a
batch, right? Or is there a reason why that can't work?

regards


[1]
https://github.com/tvondra/postgres/tree/index-prefetch-master/query-stress-test

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

19 августа, 21:22:00

On Tue, Aug 19, 2025 at 1:23 PM Tomas Vondra <tomas@vondra.me> wrote:
> Thanks for investigating this. I think it's the right direction - simple
> OLTP queries should not be paying for building read_stream when there's
> little chance of benefit.
>
> Unfortunately, this seems to be causing regressions, both compared to
> master (or disabled prefetching), and to the earlier prefetch patches.

> The main difference in the explains is this:
>
>   Prefetch Distance: 2.066  (new patch)
>
>   Prefetch Distance: 87.970 (old patch, without priorbatch)
>
> The histogram just confirms this, with most prefetches either in [2,4)
> or [64,128) bins. The new patch has much lower prefetch distance.

That definitely seems like a problem. I think that you're saying that
this problem happens because we have extra buffer hits earlier on,
which is enough to completely change the ramp-up behavior. This seems
to be all it takes to dramatically decrease the effectiveness of
prefetching. Does that summary sound correct?

> I believe this is the same issue with "collapsed" distance after
> resetting the read_stream. In that case the trouble was the reset also
> set distance to 1, and there were so many "hits" due to buffers read
> earlier it never ramped up again (we doubled it every now and then, but
> the decay was faster).

If my summary of what you said is accurate, then to me the obvious
question is: isn't this also going to be a problem *without* the new
"delay starting read stream" behavior? Couldn't you break the "removed
priorbatch" case in about the same way using a slightly different test
case? Say a test case involving concurrent query execution?

More concretely: what about similar cases where some *other* much more
selective query runs around the same time as the nonselective
regressed query? What if this other selective query reads the same
group of heap pages into shared_buffers that our nonselective query
will also need to visit (before visiting all the other heap pages not
yet in shared_buffers, that we want to prefetch)? Won't this other
scenario also confuse the read stream ramp-up heuristics, in a similar
way?

It seems bad that the initial conditions that the read stream sees can
have such lasting consequences. It feels as if the read stream is
chasing its own tail. I wonder if this is related to the fact that
we're using the read stream in a way that it wasn't initially
optimized for. After all, we're the first caller that doesn't just do
sequential access all the time -- we're bound to have novel problems
with the read stream for that reason alone.

> The same thing happens here, I think. We process the first batch without
> using a read stream. Then after reading the second batch we create the
> read_stream, but it starts with distance=1 - it's just like after reset.
> And it never ramps up the distance, because of the hits from reading the
> preceding batch.

> Maybe it'd be possible to track some stats, during the initial phase,
> and then use that to initialize the distance for the first batch
> processed by read stream? Seems rather inconvenient, though.

But why should the stats from the first leaf page read be particularly
important? It's just one page out of the thousands that are ultimately
read. Unless I've misunderstood you, the real problem seems to be that
the read stream effectively gets fixated on a few early buffer hits.
It sounds like it is getting stuck in a local minima, or something
like that.

> What exactly is the overhead of creating the read_stream? Is that about
> allocating memory, or something else?

It's hard to be precise here, because we're only talking about a 3%
regression with pgbench. A lot of that regression probably related to
memory allocation overhead. I also remember get_tablespace() being
visible in profiles (it is called from
get_tablespace_maintenance_io_concurrency, which is itself called from
read_stream_begin_impl). It's probably a lot of tiny things, that all
add up to a small (though still unacceptable) regression.

> Would it be possible to reduce the
> overhead enough to not matter even for OLTP queries?

> Maybe it would be
> possible to initialize the read_stream only "partially", enough to do do
> sync I/O and track the distance, and delay only the expensive stuff?

Maybe, but I think that this is something to consider only after other
approaches to fixing the problem fail.

> I'm also not sure it's optimal to only initialize read_stream after
> reading the next batch. For some indexes a batch can have hundreds of
> items, and that certainly could benefit from prefetching.

That does seem quite possible, and should also be investigated. But it
doesn't sound like the issue you're seeing with your adversarial
random query.

> I suppose it
> should be possible to initialize the read_stream half-way though a
> batch, right? Or is there a reason why that can't work?

Yes, that's right -- the general structure should be able to support
switching over to a read stream when we're only mid-way through
reading the TIDs associated with a given batch (likely the first
batch). The only downside is that that'd require adding logic/more
branches to heapam_index_fetch_tuple to detect when to do this. I
think that that approach is workable, if we really need it to work --
it's definitely an option.

For now I would like to focus on debugging your problematic query
(which doesn't sound like the kind of query that could benefit from
initializing the read_stream when we're still only half-way through a
batch). Does that make sense, do you think?

--
Peter Geoghegan

Re: index prefetching

От

Peter Geoghegan

Дата:

20 августа, 01:27:56

On Tue, Aug 19, 2025 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> That definitely seems like a problem. I think that you're saying that
> this problem happens because we have extra buffer hits earlier on,
> which is enough to completely change the ramp-up behavior. This seems
> to be all it takes to dramatically decrease the effectiveness of
> prefetching. Does that summary sound correct?

Update: Tomas and I discussed this over IM.

We ultimately concluded that it made the most sense to treat this
issue as a regression against set enable_indexscan_prefetch =
off/master. It was probably made a bit worse by the recent addition of
delaying creating a read stream (to avoid regressing pgbench SELECT)
with io_method=worker, though for me (with io_method=io_uring) it
makes things faster instead.

None of this is business with io_method seems important, since either
way there's a clear regression against set enable_indexscan_prefetch =
off/master. And we don't want those. So ultimately we need to
understand why mo prefetching wins by a not-insignificant margin with
this query.

Also, I just noticed that with a DESC/backwards scan version of Tomas'
query, things are vastly slower. But even then, fully synchronous
buffered I/O is still slightly faster.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

25 августа, 16:00:39

On 8/15/25 17:09, Andres Freund wrote:
> Hi,
> 
> On 2025-08-14 19:36:49 -0400, Andres Freund wrote:
>> On 2025-08-14 17:55:53 -0400, Peter Geoghegan wrote:
>>> On Thu, Aug 14, 2025 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
>>>>> We can optimize that by deferring the StartBufferIO() if we're encountering a
>>>>> buffer that is undergoing IO, at the cost of some complexity.  I'm not sure
>>>>> real-world queries will often encounter the pattern of the same block being
>>>>> read in by a read stream multiple times in close proximity sufficiently often
>>>>> to make that worth it.
>>>>
>>>> We definitely need to be prepared for duplicate prefetch requests in
>>>> the context of index scans.
>>>
>>> Can you (or anybody else) think of a quick and dirty way of working
>>> around the problem on the read stream side? I would like to prioritize
>>> getting the patch into a state where its overall performance profile
>>> "feels right". From there we can iterate on fixing the underlying
>>> issues in more principled ways.
>>
>> I think I can see a way to fix the issue, below read stream. Basically,
>> whenever AsyncReadBuffers() finds a buffer that has ongoing IO, instead of
>> waiting, as we do today, copy the wref to the ReadBuffersOperation() and set a
>> new flag indicating that we are waiting for an IO that was not started by the
>> wref. Then, in WaitReadBuffers(), we wait for such foreign started IOs. That
>> has to be somewhat different code from today, because we have to deal with the
>> fact of the "foreign" IO potentially having failed.
>>
>> I'll try writing a prototype for that tomorrow. I think to actually get that
>> into a committable shape we need a test harness (probably a read stream
>> controlled by an SQL function that gets an array of buffers).
> 
> Attached is a prototype of this approach. It does seem to fix this issue.
> 

Thanks. Based on the testing so far, the patch seems to be a substantial
improvement. What's needed to make this prototype committable?

I assume this is PG19+ improvement, right? It probably affects PG18 too,
but it's harder to hit / the impact is not as bad as on PG19.


On a related note, my test that generates random datasets / queries, and
compares index prefetching with different io_method values found a
pretty massive difference between worker and io_uring. I wonder if this
might be some issue in io_method=worker.

Consider this synthetic dataset:

----------------------------------------------------------------------
create unlogged table t (a bigint, b text) with (fillfactor = 20);

insert into t
select 1 * a, b from (
  select r, a, b, generate_series(0,2-1) AS p
    from (select row_number() over () AS r, a, b from (
      select i AS a, md5(i::text) AS b
        from generate_series(1, 5000000) s(i)
        order by (i + 16 * (random() - 0.5))
      ) foo
  ) bar
) baz ORDER BY ((r * 2 + p) + 8 * (random() - 0.5));

create index idx on t(a ASC) with (deduplicate_items=false);

vacuum freeze t;
analyze t;

SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437 ORDER BY a ASC;
----------------------------------------------------------------------

On master (or with index prefetching disabled), this gets executed like
this (cold caches):

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t  (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=82 read=21
   Planning Time: 5.982 ms
   Execution Time: 1691.708 ms
  (8 rows)

while with index prefetching (with the aio prototype patch), it looks
like this:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 2.032
     Prefetch Count: 868165
     Prefetch Stalls: 2140228
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 0
     Stream Forwarded: 4
     Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26 dirtied=1
   Planning Time: 1.032 ms
   Execution Time: 3150.578 ms
  (16 rows)

So it's about 2x slower. The prefetch distance collapses, because
there's a lot of cache hits (about 50% of requests seem to be hits of
already visited blocks). I think that's a problem with how we adjust the
distance, but I'll post about that separately.

Let's try to simply set io_method=io_uring:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t  (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 2.032
     Prefetch Count: 868165
     Prefetch Stalls: 2140228
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 0
     Stream Forwarded: 4
     Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26
   Planning Time: 2.212 ms
   Execution Time: 1837.615 ms
  (16 rows)

That's much closer to master (and the difference could be mostly noise).

I'm not sure what's causing this, but almost all regressions my script
is finding look like this - always io_method=worker, with distance close
to 2.0. Is this some inherent io_method=worker overhead?


regards

-- 
Tomas Vondra

Re: index prefetching

От

Tomas Vondra

Дата:

25 августа, 17:18:00

On 8/20/25 00:27, Peter Geoghegan wrote:
> On Tue, Aug 19, 2025 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> That definitely seems like a problem. I think that you're saying that
>> this problem happens because we have extra buffer hits earlier on,
>> which is enough to completely change the ramp-up behavior. This seems
>> to be all it takes to dramatically decrease the effectiveness of
>> prefetching. Does that summary sound correct?
> 

That summary is correct, yes. I kept thinking about this, while looking
at more regressions found by my script (that generates data sets with
different data distributions, etc.).

Almost all regressions (at least the top ones) now look like this, i.e.
distance collapses to ~2.0, which essentially disables prefetching.

But I no longer think it's caused by the "priorbatch" optimization,
which delays read stream creation until after the first batch. I still
think we may need to rethink that (e.g. if the first batch is huge), but
he distance can "collapse" even without it. The optimization just makes
it easier to happen.

AFAICS the distance collapse is "inherent" to how the distance gets
increased/decreased after hits/misses.

Let's start with distance=1, and let's assume 50% of buffers are hits,
in a regular pattern - hit-miss-hit-miss-hit-miss-...

In this case, the distance will never increase beyond 2, because we'll
double-decrement-double-decrement-... so it'll flip between 1 and 2, no
matter how you set effective_io_concurrency.

Of course, this can happen even with other hit ratios, there's nothing
special about 50%.

With fewer hits, it's fine - there's asymmetry, because the distance
grows by doubling and decreases by decrementing 1. So once we have a bit
more misses, it keeps growing.

But with more hits, the hit/miss ratio simply determines the "stable"
distance. Let's say there's 80% hits, so 4 hits to 1 miss. Then the
stable distance is ~4, because we get a miss, double to 8, and then 4
hits, so the distance drops back to 4. And again.

Similarly for other hit/miss ratios (it's easier to think about if you
keep the number of hits 2^n).

It's worth noticing the effective_io_concurrency has almost no impact on
what distance we end up with, it merely limits the maximum distance.

I find this distance heuristics a bit strange, for a couple reasons:

* It doesn't seem right to get stuck at distance=2 with 50% misses.
Surely that would benefit from prefetching a bit more?

* It mostly ignores effective_io_concurrency, which I think about as
"Keep this number of I/Os in the queue." But we don't try doing that.

I understand the current heuristics is trying to not prefetch for cached
data sets, but does that actually make sense? With fadvise it made
sense, because the prefetched data could get evicted if we prefetched
too far ahead. But with worker/io_uring the buffers get pinned, so this
shouldn't happen. Of course, that doesn't mean we should prefetch too
far ahead - there's LIMIT queries and limit of buffer pins, etc.

What about if the distance heuristics asks this question:

  How far do we need to look to generate effective_io_concurrency IOs?

The attached patch is a PoC implementing this. The core idea is that if
we measure "miss probability" for a chunk of requests, we can use that
to estimate the distance needed to generate e_i_c IOs.

So while the current heuristics looks at individual hits/misses, the
patch looks at groups of requests.

The other idea is that the patch maintains a "distance range", with
min/max of allowed distances. The min/max values gradually grow after a
miss, the "min" value "stops" at max_ios, while "max" grows further.

This ensures gradual ramp up, helping LIMIT queries etc.

And even if there are a lot of hits, the distance is not allowed to drop
below the current "min". Because what would be the benefit of that?

- If the read is a hit, we might read it later - but the cost is about
the same, we're not really saving much by delaying the read.

- If the read is a miss, it's clearly better to issue the I/O sooner.

This may not be true if it's a LIMIT query, and it terminates early. But
if the distance_min is not too high, this should be negligible.

Attached is an example table/query, found by my script. Without the
read_stream patch (i.e. just with the current index prefetching), it
looks like this:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 2.032
     Prefetch Count: 868165
     Prefetch Stalls: 2140228
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 0
     Stream Forwarded: 4
     Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26 dirtied=1
   Planning Time: 1.032 ms
   Execution Time: 3150.578 ms
  (16 rows)

and with the attached patch:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t  (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Prefetch Distance: 36.321
     Prefetch Count: 3730750
     Prefetch Stalls: 3
     Prefetch Skips: 6039906
     Prefetch Resets: 0
     Stream Ungets: 722353
     Stream Forwarded: 305265
     Prefetch Histogram: [2,4) => 10, [4,8) => 11, [8,16) => 6,
                         [16,32) => 316890, [32,64) => 3413833
     Buffers: shared hit=2574776 read=455610
   Planning:
     Buffers: shared hit=78 read=26
   Planning Time: 2.249 ms
   Execution Time: 1651.826 ms
  (16 rows)

The example is not entirely perfect, because the index prefetching does
not actually beat master:

                                QUERY PLAN
  ----------------------------------------------------------------------
   Index Scan using idx on t   (actual rows=9048576.00 loops=1)
     Index Cond: ((a >= 16150) AND (a <= 4540437))
     Index Searches: 1
     Buffers: shared hit=2577599 read=455610
   Planning:
     Buffers: shared hit=78 read=26
   Planning Time: 3.688 ms
   Execution Time: 1656.790 ms
  (8 rows)

So it's more a case of "mitigating a regression" (finding regressions
like this is the purpose of my script). Still, I believe the questions
about the distance heuristics are valid.

(Another interesting detail is that the regression happens only with
io_method=worker, not with io_uring. I'm not sure why.)

regards

-- 
Tomas Vondra

On 8/26/25 03:08, Peter Geoghegan wrote:
> On Mon Aug 25, 2025 at 10:18 AM EDT, Tomas Vondra wrote:
>> The attached patch is a PoC implementing this. The core idea is that if
>> we measure "miss probability" for a chunk of requests, we can use that
>> to estimate the distance needed to generate e_i_c IOs.
> 
> I noticed an assertion failure when the tests run. Looks like something about
> the patch breaks the read stream from the point of view of VACUUM:
> 
> TRAP: failed Assert("stream->pinned_buffers + stream->pending_read_nblocks <= stream->max_pinned_buffers"), File:
"../source/src/backend/storage/aio/read_stream.c",Line: 402, PID: 1238204

> [0x55e71f653d29] read_stream_start_pending_read:
/mnt/nvme/postgresql/patch/build_meson_dc/../source/src/backend/storage/aio/read_stream.c:401

Seems the distance adjustment was not quite right, didn't enforce the
limit on pinned buffers, and the distance could get too high. The
attached version should fix that ...

But there's still something wrong. I tried running check-world, and I
see 027_stream_regress.pl is getting stuck in join.sql, for the query on
line 417.

I haven't figured this out yet, but there's a mergejoin. It does reset
the stream a lot, so maybe there's something wrong there ... It's
strange, though. Why would a different distance make the query stuck?

Anyway, Thomas' patch from [1] doesn't seem to have this issue. And
maybe it's a better / more elegant approach in general?

[1]
https://www.postgresql.org/message-id/CA%2BhUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw%40mail.gmail.com

-- 
Tomas Vondra

Вложения

readstream-adaptive-distance-v2-fixed.patch

Re: index prefetching

От

Tomas Vondra

Дата:

26 августа, 18:06:11


On 8/26/25 01:48, Andres Freund wrote:
> Hi,
> 
> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>> Thanks. Based on the testing so far, the patch seems to be a substantial
>> improvement. What's needed to make this prototype committable?
> 
> Mainly some testing infrastructure that can trigger this kind of stream. The
> logic is too finnicky for me to commit it without that.
> 

So, what would that look like? The "naive" approach to testing is to
simply generate a table/index, producing the right sequence of blocks.
That shouldn't be too hard, it'd be enough to have an index that

- has ~2-3 rows per value, on different heap pages
- the values "overlap", e.g. like this (value,page)

   (A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...

Another approach would be to test this at C level, sidestepping the
query execution entirely. We'd have a "stream generator" that just
generates a sequence of blocks of our own choosing (could be hard-coded,
some pattern, read from a file ...), and feed it into a read stream.

But how would we measure success for these tests? I don't think we want
to look at query duration, that's very volatile.

> 
>> I assume this is PG19+ improvement, right? It probably affects PG18 too,
>> but it's harder to hit / the impact is not as bad as on PG19.
> 
> Yea. It does apply to 18 too, but I can't come up with realistic scenarios
> where it's a real issue. I can repro a slowdown when using many parallel
> seqscans with debug_io_direct=data - but that's even slower in 17...
> 

Makes sense.

> 
>> On a related note, my test that generates random datasets / queries, and
>> compares index prefetching with different io_method values found a
>> pretty massive difference between worker and io_uring. I wonder if this
>> might be some issue in io_method=worker.
> 
>> while with index prefetching (with the aio prototype patch), it looks
>> like this:
>>
>>                                 QUERY PLAN
>>   ----------------------------------------------------------------------
>>    Index Scan using idx on t (actual rows=9048576.00 loops=1)
>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>      Index Searches: 1
>>      Prefetch Distance: 2.032
>>      Prefetch Count: 868165
>>      Prefetch Stalls: 2140228
>>      Prefetch Skips: 6039906
>>      Prefetch Resets: 0
>>      Stream Ungets: 0
>>      Stream Forwarded: 4
>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>      Buffers: shared hit=2577599 read=455610
>>    Planning:
>>      Buffers: shared hit=78 read=26 dirtied=1
>>    Planning Time: 1.032 ms
>>    Execution Time: 3150.578 ms
>>   (16 rows)
>>
>> So it's about 2x slower. The prefetch distance collapses, because
>> there's a lot of cache hits (about 50% of requests seem to be hits of
>> already visited blocks). I think that's a problem with how we adjust the
>> distance, but I'll post about that separately.
>>
>> Let's try to simply set io_method=io_uring:
>>
>>                                 QUERY PLAN
>>   ----------------------------------------------------------------------
>>    Index Scan using idx on t  (actual rows=9048576.00 loops=1)
>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>      Index Searches: 1
>>      Prefetch Distance: 2.032
>>      Prefetch Count: 868165
>>      Prefetch Stalls: 2140228
>>      Prefetch Skips: 6039906
>>      Prefetch Resets: 0
>>      Stream Ungets: 0
>>      Stream Forwarded: 4
>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>      Buffers: shared hit=2577599 read=455610
>>    Planning:
>>      Buffers: shared hit=78 read=26
>>    Planning Time: 2.212 ms
>>    Execution Time: 1837.615 ms
>>   (16 rows)
>>
>> That's much closer to master (and the difference could be mostly noise).
>>
>> I'm not sure what's causing this, but almost all regressions my script
>> is finding look like this - always io_method=worker, with distance close
>> to 2.0. Is this some inherent io_method=worker overhead?
> 
> I think what you might be observing might be the inherent IPC / latency
> overhead of the worker based approach. This is particularly pronounced if the
> workers are idle (and the CPU they get scheduled on is clocked down). The
> latency impact of that is small, but if you never actually get to do much
> readahead it can be visible.
> 

Yeah, that's quite possible. If I understand the mechanics of this, this
can behave in a rather unexpected way - lowering the load (i.e. issuing
fewer I/O requests) can make the workers "more idle" and therefore more
likely to get suspended ...

Is there a good way to measure if this is what's happening, and the
impact? For example, it'd be interesting to know how long it took for a
submitted process to get picked up by a worker. And % of time a worker
spent handling I/O.


regards

-- 
Tomas Vondra

Re: index prefetching

От

Tomas Vondra

Дата:

27 августа, 17:36:53

On 8/26/25 17:06, Tomas Vondra wrote:
> 
> 
> On 8/26/25 01:48, Andres Freund wrote:
>> Hi,
>>
>> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>>> Thanks. Based on the testing so far, the patch seems to be a substantial
>>> improvement. What's needed to make this prototype committable?
>>
>> Mainly some testing infrastructure that can trigger this kind of stream. The
>> logic is too finnicky for me to commit it without that.
>>
> 
> So, what would that look like? The "naive" approach to testing is to
> simply generate a table/index, producing the right sequence of blocks.
> That shouldn't be too hard, it'd be enough to have an index that
> 
> - has ~2-3 rows per value, on different heap pages
> - the values "overlap", e.g. like this (value,page)
> 
>    (A,1), (A,2), (A,3), (B,2), (B,3), (B,4), ...
> 
> Another approach would be to test this at C level, sidestepping the
> query execution entirely. We'd have a "stream generator" that just
> generates a sequence of blocks of our own choosing (could be hard-coded,
> some pattern, read from a file ...), and feed it into a read stream.
> 
> But how would we measure success for these tests? I don't think we want
> to look at query duration, that's very volatile.
> 
>>
>>> I assume this is PG19+ improvement, right? It probably affects PG18 too,
>>> but it's harder to hit / the impact is not as bad as on PG19.
>>
>> Yea. It does apply to 18 too, but I can't come up with realistic scenarios
>> where it's a real issue. I can repro a slowdown when using many parallel
>> seqscans with debug_io_direct=data - but that's even slower in 17...
>>
> 
> Makes sense.
> 
>>
>>> On a related note, my test that generates random datasets / queries, and
>>> compares index prefetching with different io_method values found a
>>> pretty massive difference between worker and io_uring. I wonder if this
>>> might be some issue in io_method=worker.
>>
>>> while with index prefetching (with the aio prototype patch), it looks
>>> like this:
>>>
>>>                                 QUERY PLAN
>>>   ----------------------------------------------------------------------
>>>    Index Scan using idx on t (actual rows=9048576.00 loops=1)
>>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>>      Index Searches: 1
>>>      Prefetch Distance: 2.032
>>>      Prefetch Count: 868165
>>>      Prefetch Stalls: 2140228
>>>      Prefetch Skips: 6039906
>>>      Prefetch Resets: 0
>>>      Stream Ungets: 0
>>>      Stream Forwarded: 4
>>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>>      Buffers: shared hit=2577599 read=455610
>>>    Planning:
>>>      Buffers: shared hit=78 read=26 dirtied=1
>>>    Planning Time: 1.032 ms
>>>    Execution Time: 3150.578 ms
>>>   (16 rows)
>>>
>>> So it's about 2x slower. The prefetch distance collapses, because
>>> there's a lot of cache hits (about 50% of requests seem to be hits of
>>> already visited blocks). I think that's a problem with how we adjust the
>>> distance, but I'll post about that separately.
>>>
>>> Let's try to simply set io_method=io_uring:
>>>
>>>                                 QUERY PLAN
>>>   ----------------------------------------------------------------------
>>>    Index Scan using idx on t  (actual rows=9048576.00 loops=1)
>>>      Index Cond: ((a >= 16150) AND (a <= 4540437))
>>>      Index Searches: 1
>>>      Prefetch Distance: 2.032
>>>      Prefetch Count: 868165
>>>      Prefetch Stalls: 2140228
>>>      Prefetch Skips: 6039906
>>>      Prefetch Resets: 0
>>>      Stream Ungets: 0
>>>      Stream Forwarded: 4
>>>      Prefetch Histogram: [2,4) => 855753, [4,8) => 12412
>>>      Buffers: shared hit=2577599 read=455610
>>>    Planning:
>>>      Buffers: shared hit=78 read=26
>>>    Planning Time: 2.212 ms
>>>    Execution Time: 1837.615 ms
>>>   (16 rows)
>>>
>>> That's much closer to master (and the difference could be mostly noise).
>>>
>>> I'm not sure what's causing this, but almost all regressions my script
>>> is finding look like this - always io_method=worker, with distance close
>>> to 2.0. Is this some inherent io_method=worker overhead?
>>
>> I think what you might be observing might be the inherent IPC / latency
>> overhead of the worker based approach. This is particularly pronounced if the
>> workers are idle (and the CPU they get scheduled on is clocked down). The
>> latency impact of that is small, but if you never actually get to do much
>> readahead it can be visible.
>>
> 
> Yeah, that's quite possible. If I understand the mechanics of this, this
> can behave in a rather unexpected way - lowering the load (i.e. issuing
> fewer I/O requests) can make the workers "more idle" and therefore more
> likely to get suspended ...
> 
> Is there a good way to measure if this is what's happening, and the
> impact? For example, it'd be interesting to know how long it took for a
> submitted process to get picked up by a worker. And % of time a worker
> spent handling I/O.
> 

After investigating this a bit more, I'm not sure it's due to workers
getting idle / CPU clocked down, etc. I did an experiment with booting
with idle=poll, which AFAICS should prevent cores from idling, etc.

And it made pretty much no difference - timings didn't change. It can
still be about IPC, but it does not seem to be about clocked-down cores,
or stuff like that. Maybe.

I ran a more extensive set of tests, varying additional parameters:

- iomethod: io_uring / worker (3 or 12 workers)
- shared buffers: 512MB / 16GB (table is ~3GB)
- checksums on / off
- eic: 16 / 100
- difference SSD devices

and comparing master vs. builds with different variants of the patches:

- master
- patched (index prefetching)
- no-explain (EXPLAIN ANALYZE reverted)
- munro / vondra (WIP patches preventing distance collapse)
- munro-no-explain / vondra-no-explain (should be obvious)

We've been speculating (me and Peter) maybe the extra read_stream stats
add a lot of overhead, hence the "no-explain" builds to test that. All
of this is with the recent "aio" patch eliminating I/O waits.

Attached are results from my "ryzen" machine (xeon is very similar),
sliced/colored to show patterns. It's for query:

    SELECT * FROM (
        SELECT * FROM t WHERE a BETWEEN 16150 AND 4540437
        ORDER BY a ASC
    ) OFFSET 1000000000;

Which is the same query as before, except that it's not EXPLAIN ANALYZE,
and it has OFFSET so that it does not send any data back. It's a bit of
an adversarial query, it doesn't seem to benefit from prefetching.

There are some very clear patterns in the results.

In the "cold" (uncached) runs:

* io_uring does much better, with limited regressions (not negligible,
but limited compared to io_method=worker). A hint this may really be
about IPC?

* With worker, there's a massive regression with the basic prefetching
patch (when the distance collapses to 2.0). But then it mostly recovers
with the increased distance, and even does a bit better than master (or
on part with io_uring)

In the "warm" runs (with everything cached in page cache, possibly even
in shared buffers):

* With 16GB shared buffers, the regressions are about the same as for
cold runs, both for io_uring and worker. Roughly ~5%, give or take. The
extra read_stream stats seem to add ~3%.

* With 512MB it's much more complicated. io_uring regresses much more
(relative to master), for some reason. For cold runs it was ~30%, now
it's ~50%. Seems weird, but I guess there's fixed overhead and it's more
visible with data in cache.

* For worker (with buffers=512MB), the basic patch clearly causes a
massive regression, it's about 2x slower. I don't really understand why
- the assumption was this is because of idling, but is it, if it happens
with idle=poll?

In top, I see the backend takes ~60%, and the io worker ~40% (so they
clearly ping-pong the work). 40% utilization does not seem particularly
low (and with idle=poll it should not idle anyway).

I realize there's IPC with worker, and it's going to be more visible for
cases that end up doing no prefetching. But isn't 2x regression a bit
too hign? I wouldn't have expected that. Any good way to measure how
expensive the IPC is?

* With the increased prefetch distance, the regression drops to ~25%
(for worker). And in top I see the backend takes ~100%, and the single
worker uses ~60%. But the 25% is without checksums. With checksums, the
regression is roughly the 5%.

I'm not sure what to think about this.

-- 
Tomas Vondra

Вложения

Re: index prefetching

От

Tomas Vondra

Дата:

28 августа, 15:45:24

On 8/26/25 17:06, Tomas Vondra wrote:
> 
> 
> On 8/26/25 01:48, Andres Freund wrote:
>> Hi,
>>
>> On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
>>> 
>>> ...
>>>
>>> I'm not sure what's causing this, but almost all regressions my script
>>> is finding look like this - always io_method=worker, with distance close
>>> to 2.0. Is this some inherent io_method=worker overhead?
>>
>> I think what you might be observing might be the inherent IPC / latency
>> overhead of the worker based approach. This is particularly pronounced if the
>> workers are idle (and the CPU they get scheduled on is clocked down). The
>> latency impact of that is small, but if you never actually get to do much
>> readahead it can be visible.
>>
> 
> Yeah, that's quite possible. If I understand the mechanics of this, this
> can behave in a rather unexpected way - lowering the load (i.e. issuing
> fewer I/O requests) can make the workers "more idle" and therefore more
> likely to get suspended ...
> 
> Is there a good way to measure if this is what's happening, and the
> impact? For example, it'd be interesting to know how long it took for a
> submitted process to get picked up by a worker. And % of time a worker
> spent handling I/O.
> 

I kept thinking about this, and in the end I decided to try to measure
this IPC overhead. The backend/ioworker communicate by sending signals,
so I wrote a simple C program that does "signal echo" with two processes
(one fork). It works like this:

1) fork a child process
2) send a signal to the child
3) child notices the signal, sends a response signal back
4) after receiving response, go back to (2)

This happens until the requested number of signals is sent, and then it
prints stats like signals/second etc. The C file is attached, I'm sure
it's imperfect but it does the trick.

And the results mostly agree with the benchmark results from yesterday.
Which makes sense, because if the distance collapses to ~1, the AIO with
io_method=worker starts doing about the same thing for every block.

If I run the signal test on the ryzen machine, I get this:

-----------------------------------------------------------------------
root@ryzen:~# ./signal-echo 1000000
nmm_signals = 1000000
parent: sent 100000 signals in 196909 us (1.97)
...
parent: sent 1000000 signals in 1924263 us (1.92 us)
signals / sec = 519679.48
-----------------------------------------------------------------------

So it can do about 500k signals / second. This means that requesting
blocks one by one (with distance=1), a single worker can do about 4GB/s,
assuming there's no other work (no actual I/O, no checksum checks, ...).

Consider the warm runs with 512MB shared buffers, which means there's no
I/O but the data needs to be copied from page cache (by the worker). An
explain analyze for the query says this:

         Buffers: shared hit=2573018 read=455610

That's 455610 blocks to read, mostly one by one. So a bit less than 1
second just for the IPC, but there's also the memcpy etc. An example
result from the benchmark looks like this:

master: 967ms
patched: 2353ms

So that's ~1400ms difference. So a bit more, but in the right ballpark,
and the extra overhead could be the due to AIO being more complex than
sync I/O, etc. Not sure.

The xeon can do ~190k signals/second, i.e. about 1/3 of ryzen, so the
index scan would spend ~3 seconds on the IPC. Timings for the same test
look like this:

master: 3049ms
patched: 9636ms

So, that's about 2x the expected difference. Not sure where the extra
overhead comes from, might be due to NUMA (which the ryzen does not have).

So I think the IPC overhead with "worker" can be quite significant,
especially for cases with distance=1. I don't think it's a major issue
for PG18, because seq/bitmap scans are unlikely to collapse the distance
like this. And with larger distances the cost amortizes. It's much
bigger issue for the index prefetching, it seems.

This is for the "warm" runs with 512MB, with the basic prefetch patch.
I'm not sure it explains the overhead with the patches that increase the
prefetch distance (be it mine or Thomas' patch), or cold runs. The
regresions seem to be smaller in those cases, though.

regards

-- 
Tomas Vondra

Вложения

signal-echo.c

Re: index prefetching

От

Andres Freund

Дата:

28 августа, 19:08:50

Hi,

On 2025-08-26 17:06:11 +0200, Tomas Vondra wrote:
> On 8/26/25 01:48, Andres Freund wrote:
> > Hi,
> > 
> > On 2025-08-25 15:00:39 +0200, Tomas Vondra wrote:
> >> Thanks. Based on the testing so far, the patch seems to be a substantial
> >> improvement. What's needed to make this prototype committable?
> > 
> > Mainly some testing infrastructure that can trigger this kind of stream. The
> > logic is too finnicky for me to commit it without that.
> > 
> 
> So, what would that look like?

I'm thinking of something like an SQL function that accepts a relation and a
series of block numbers, which creates a read stream reading the passed in
block numbers.  Combined with the injection points that are already used in
test_aio, that should allow to test things that I don't know how to test
without that.  E.g. encountering an already-in-progress multi-block IO that
only completes partially.

> Another approach would be to test this at C level, sidestepping the
> query execution entirely. We'd have a "stream generator" that just
> generates a sequence of blocks of our own choosing (could be hard-coded,
> some pattern, read from a file ...), and feed it into a read stream.
> 
> But how would we measure success for these tests? I don't think we want
> to look at query duration, that's very volatile.

Yea, the performance effects would be harder to test, what I care more about
is the error paths. Those are really hard to test interactively.

Greetings,

Andres Freund

Re: index prefetching

От

Andres Freund

Дата:

28 августа, 19:16:07

Hi,

On 2025-08-28 14:45:24 +0200, Tomas Vondra wrote:
> On 8/26/25 17:06, Tomas Vondra wrote:
> I kept thinking about this, and in the end I decided to try to measure
> this IPC overhead. The backend/ioworker communicate by sending signals,
> so I wrote a simple C program that does "signal echo" with two processes
> (one fork). It works like this:
> 
> 1) fork a child process
> 2) send a signal to the child
> 3) child notices the signal, sends a response signal back
> 4) after receiving response, go back to (2)

Nice!

I think this might under-estimate the IPC cost a bit, because typically the
parent and child process do not want to run at the same time, probably leading
to them often being scheduled on the same core. Whereas a shollow IO queue
will lead to some concurrent activity, just not enough to hide the IPC
latency...   But I don't think this matters in the grand scheme of things.

> So I think the IPC overhead with "worker" can be quite significant,
> especially for cases with distance=1. I don't think it's a major issue
> for PG18, because seq/bitmap scans are unlikely to collapse the distance
> like this. And with larger distances the cost amortizes. It's much
> bigger issue for the index prefetching, it seems.

I couldn't keep up with all the discussion, but is there actually valid I/O
bound cases (i.e. not ones were we erroneously keep the distance short) where
index scans end can't have a higher distance?

Obviously you can construct cases with a low distance by having indexes point
to a lot of tiny tuples pointing to perfectly correlated pages, but in that
case IO can't be a significant factor.

Greetings,

Andres Freund

Re: index prefetching

От

Tomas Vondra

Дата:

28 августа, 20:08:40

On 8/28/25 18:16, Andres Freund wrote:
> Hi,
> 
> On 2025-08-28 14:45:24 +0200, Tomas Vondra wrote:
>> On 8/26/25 17:06, Tomas Vondra wrote:
>> I kept thinking about this, and in the end I decided to try to measure
>> this IPC overhead. The backend/ioworker communicate by sending signals,
>> so I wrote a simple C program that does "signal echo" with two processes
>> (one fork). It works like this:
>>
>> 1) fork a child process
>> 2) send a signal to the child
>> 3) child notices the signal, sends a response signal back
>> 4) after receiving response, go back to (2)
> 
> Nice!
> 
> I think this might under-estimate the IPC cost a bit, because typically the
> parent and child process do not want to run at the same time, probably leading
> to them often being scheduled on the same core. Whereas a shollow IO queue
> will lead to some concurrent activity, just not enough to hide the IPC
> latency...   But I don't think this matters in the grand scheme of things.
> 

Right. I thought about measuring this stuff (different cores, different
NUMA nodes, maybe adding some sleeps to simulate "idle"), but I chose to
keep it simple for now.

> 
>> So I think the IPC overhead with "worker" can be quite significant,
>> especially for cases with distance=1. I don't think it's a major issue
>> for PG18, because seq/bitmap scans are unlikely to collapse the distance
>> like this. And with larger distances the cost amortizes. It's much
>> bigger issue for the index prefetching, it seems.
> 
> I couldn't keep up with all the discussion, but is there actually valid I/O
> bound cases (i.e. not ones were we erroneously keep the distance short) where
> index scans end can't have a higher distance?
> 

I don't know, really.

Is the presented example really a case of an "erroneously short
distance"? From the 2x regression (compared to master) it might seem
like that, but even with the increased distance it's still slower than
master (by 25%). So maybe the "error" is to use AIO in these cases,
instead of just switching to I/O done by the backend.

It may be a bit worse for non-btree indexes, e.g. for for ordered scans
on gist indexes (getting the next tuple may require reading many leaf
pages, so maybe we can't look too far ahead?). Or for indexes with
naturally "fat" tuples, which limits how many tuples we see ahead.

> Obviously you can construct cases with a low distance by having indexes point
> to a lot of tiny tuples pointing to perfectly correlated pages, but in that
> case IO can't be a significant factor.
> 

It's definitely true the examples the script finds are "adversary", but
also not entirely unrealistic. I suppose there will be such cases for
any heuristics we come up with.

There's probably more cases like this, where we end up with many hits.
Say, a merge join may visit index tuples repeatedly, and so on. But then
it's likely in shared buffers, so there won't be any IPC.

regards

-- 
Tomas Vondra

Re: index prefetching

От

Andres Freund

Дата:

28 августа, 22:52:12

Hi,

On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:
> On 8/28/25 18:16, Andres Freund wrote:
> >> So I think the IPC overhead with "worker" can be quite significant,
> >> especially for cases with distance=1. I don't think it's a major issue
> >> for PG18, because seq/bitmap scans are unlikely to collapse the distance
> >> like this. And with larger distances the cost amortizes. It's much
> >> bigger issue for the index prefetching, it seems.
> > 
> > I couldn't keep up with all the discussion, but is there actually valid I/O
> > bound cases (i.e. not ones were we erroneously keep the distance short) where
> > index scans end can't have a higher distance?
> > 
> 
> I don't know, really.
> 
> Is the presented exaple really a case of an "erroneously short
> distance"?

I think the query isn't actually measuring something particularly useful in
the general case. You're benchmarking something were the results are never
looked at - which means the time between two index fetches is unrealistically
short. That means any tiny latency increase matters a lot more than with
realistic queries.

And this is, IIUC, on a local SSD. I'd bet that on cloud latencies AIO would
still be a huge win.

> From the 2x regression (compared to master) it might seem like that, but
> even with the increased distance it's still slower than master (by 25%). So
> maybe the "error" is to use AIO in these cases, instead of just switching to
> I/O done by the backend.

If it's slower at a higher distance, we're missing something.

> It may be a bit worse for non-btree indexes, e.g. for for ordered scans
> on gist indexes (getting the next tuple may require reading many leaf
> pages, so maybe we can't look too far ahead?). Or for indexes with
> naturally "fat" tuples, which limits how many tuples we see ahead.

I am not worried at all about those cases. If you have to read a lot of index
leaf pages to get a heap fetch, a distance of even just 2 will be fine,
because the IPC overhead is a neglegible cost compared to the index
processing. Similarly, if you have to do very deep index traversals due to
wide index tuples, there's going to be more time between two table fetches.

> > Obviously you can construct cases with a low distance by having indexes point
> > to a lot of tiny tuples pointing to perfectly correlated pages, but in that
> > case IO can't be a significant factor.
> > 
> 
> It's definitely true the examples the script finds are "adversary", but
> also not entirely unrealistic.

I think doing index scans where the results are just thrown out are entirely
unrealistic...

> I suppose there will be such cases for any heuristics we come up with.

Agreed.

> There's probably more cases like this, where we end up with many hits.
> Say, a merge join may visit index tuples repeatedly, and so on. But then
> it's likely in shared buffers, so there won't be any IPC.

Yea, I'd not expect a meaningful impact of any of this in a workload like
that.

Greetings,

Andres Freund

Re: index prefetching

От

Thomas Munro

Дата:

29 августа, 00:50:57

On Fri, Aug 29, 2025 at 7:52 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:
> > From the 2x regression (compared to master) it might seem like that, but
> > even with the increased distance it's still slower than master (by 25%). So
> > maybe the "error" is to use AIO in these cases, instead of just switching to
> > I/O done by the backend.
>
> If it's slower at a higher distance, we're missing something.

Enough io_workers?  What kind of I/O concurrency does it want?  Does
wait_event show any backends doing synchronous IO?  How many does [1]
want to run for that test workload and does it help?

FWIW there's a very simple canned latency test in a SQL function in
the first message in that thread (0005-XXX-read_buffer_loop.patch),
just on the off-chance that it's useful as a starting point for other
ideas.  There I was interested in IPC overheads, latch collapsing and
other effects, so I was deliberately stalling on/evicting a single
block repeatedly without any readahead distance, so I wasn't letting
the stream "hide" IPC overheads.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com

Re: index prefetching

От

Tomas Vondra

Дата:

29 августа, 02:00:58

On 8/28/25 23:50, Thomas Munro wrote:
> On Fri, Aug 29, 2025 at 7:52 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:
>>> From the 2x regression (compared to master) it might seem like that, but
>>> even with the increased distance it's still slower than master (by 25%). So
>>> maybe the "error" is to use AIO in these cases, instead of just switching to
>>> I/O done by the backend.
>>
>> If it's slower at a higher distance, we're missing something.
> 
> Enough io_workers?  What kind of I/O concurrency does it want?  Does
> wait_event show any backends doing synchronous IO?  How many does [1]
> want to run for that test workload and does it help?
> 

I'm not sure how to determine what concurrency it "wants". All I know is
that for "warm" runs [1], the basic index prefetch patch uses distance
~2.0 on average, and is ~2x slower than master. And with the patches the
distance is ~270, and it's 30% slower than master. (IIRC there's about
30% misses, so 270 is fairly high. Can't check now, the machine is
running other tests.)

Not sure about wait events, but I don't think any backends are doing
sychnronous I/O. There's only that one query running, and it's using AIO
(except for the index, which is still read synchronously).

Likewise, I don't think there's insufficient number of workers. I've
tried with 3 and 12 workers, and there's virtually no difference between
those. IIRC when watching "top", I've never seen more than 1 or maybe 2
workers active (using CPU).

[1] https://www.postgresql.org/message-id/attachment/180630/ryzen-warm.pdf

[2]
https://www.postgresql.org/message-id/293a4735-79a4-499c-9a36-870ee9286281%40vondra.me

> FWIW there's a very simple canned latency test in a SQL function in
> the first message in that thread (0005-XXX-read_buffer_loop.patch),
> just on the off-chance that it's useful as a starting point for other
> ideas.  There I was interested in IPC overheads, latch collapsing and
> other effects, so I was deliberately stalling on/evicting a single
> block repeatedly without any readahead distance, so I wasn't letting
> the stream "hide" IPC overheads.
> 
> [1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Bm4xV0LMoH2c%3DoRAdEXuCnh%2BtGBTWa7uFeFMGgTLAw%2BQ%40mail.gmail.com

Interesting, I'll give it a try tomorrow. Do you recall if the results
were roughly in line with results of my signal IPC test?

regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

29 августа, 02:15:11

On Thu, Aug 28, 2025 at 7:01 PM Tomas Vondra <tomas@vondra.me> wrote:
> I'm not sure how to determine what concurrency it "wants". All I know is
> that for "warm" runs [1], the basic index prefetch patch uses distance
> ~2.0 on average, and is ~2x slower than master. And with the patches the
> distance is ~270, and it's 30% slower than master. (IIRC there's about
> 30% misses, so 270 is fairly high. Can't check now, the machine is
> running other tests.)

Is it possible that the increased distance only accidentally
ameliorates the IPC issues that you're seeing with method=worker? I
mentioned already that it makes things a bit slower with io_uring, for
the same test case. I mean, if you use io_uring then things work out
strictly worse with that extra patch...so something doesn't seem
right.

I notice that the test case in question manages to merge plenty of
reads together with other pending reads, within read_stream_look_ahead
(I added something to our working branch that'll show that information
in EXPLAIN ANALYZE). My wild guess is that an increased distance could
interact with that, somewhat masking the IPC problems with
method=worker.

Could that explain it? It seems possible that the distance is already
roughly optimal, without your patch (or Thomas' similar read stream
patch). It may be that we just aren't converging on "no prefetch"
behavior when we ought to, given such a low distance. If this theory
of mine was correct, it would reconcile the big differences we see
between "worker vs io_uring" with your patch + test case.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

29 августа, 02:27:32

Hi,

On 2025-08-29 01:00:58 +0200, Tomas Vondra wrote:
> I'm not sure how to determine what concurrency it "wants". All I know is
> that for "warm" runs [1], the basic index prefetch patch uses distance
> ~2.0 on average, and is ~2x slower than master. And with the patches the
> distance is ~270, and it's 30% slower than master. (IIRC there's about
> 30% misses, so 270 is fairly high. Can't check now, the machine is
> running other tests.)

There got to be something wrong here, I don't see a reason why at any
meaningful distance it'd be slower.

What set of patches do I need to repro the issue?

And what are the complete set of pieces to load the data?
https://postgr.es/m/293a4735-79a4-499c-9a36-870ee9286281%40vondra.me
has the query, but afaict not enough information to infer init.sql

> Not sure about wait events, but I don't think any backends are doing
> sychnronous I/O. There's only that one query running, and it's using AIO
> (except for the index, which is still read synchronously).
> 
> Likewise, I don't think there's insufficient number of workers. I've
> tried with 3 and 12 workers, and there's virtually no difference between
> those. IIRC when watching "top", I've never seen more than 1 or maybe 2
> workers active (using CPU).

That doesn't say much - if the they are doing IO, they're not on CPU...

Greetings,

Andres Freund

Re: index prefetching

От

Tomas Vondra

Дата:

29 августа, 02:40:59


On 8/28/25 21:52, Andres Freund wrote:
> Hi,
> 
> On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:
>> On 8/28/25 18:16, Andres Freund wrote:
>>>> So I think the IPC overhead with "worker" can be quite significant,
>>>> especially for cases with distance=1. I don't think it's a major issue
>>>> for PG18, because seq/bitmap scans are unlikely to collapse the distance
>>>> like this. And with larger distances the cost amortizes. It's much
>>>> bigger issue for the index prefetching, it seems.
>>>
>>> I couldn't keep up with all the discussion, but is there actually valid I/O
>>> bound cases (i.e. not ones were we erroneously keep the distance short) where
>>> index scans end can't have a higher distance?
>>>
>>
>> I don't know, really.
>>
>> Is the presented exaple really a case of an "erroneously short
>> distance"?
> 
> I think the query isn't actually measuring something particularly useful in
> the general case. You're benchmarking something were the results are never
> looked at - which means the time between two index fetches is unrealistically
> short. That means any tiny latency increase matters a lot more than with
> realistic queries.
> 

Sure, is a "microbenchmark" focusing on index scans.

The point of not looking at the result is to isolate the index scan, and
it's definitely true that if the query did some processing (e.g. feeding
it into an aggregate or something), the relative difference would be
smaller. But the absolute difference would likely remain about the same.

I don't think the submitting the I/O and then not waiting long enough
before actually reading the block is a significant factor here. It does
affect even the "warm" runs (that do no actual I/O), and most of the
difference seems to match the IPC cost. And AFAICS that cost dost not
change if the delay increases, we still need to send two signals.

> And this is, IIUC, on a local SSD. I'd bet that on cloud latencies AIO would
> still be a huge win.
> 

True, but only for cold runs that actually do I/O. The results for the
warm runs show regressions too, although smaller ones. And that would
affect any kind of storage (with buffered I/O).

Also, I'm not sure "On slow storage it does not regress," is a very
strong argument ;-)

> 
>> From the 2x regression (compared to master) it might seem like that, but
>> even with the increased distance it's still slower than master (by 25%). So
>> maybe the "error" is to use AIO in these cases, instead of just switching to
>> I/O done by the backend.
> 
> If it's slower at a higher distance, we're missing something.
> 

There's one weird thing I just realized - I don't think I ever saw more
than a single I/O worker consuming CPU (in top), even with the higher
distance. I'm not 100% sure about it, need to check tomorrow.

IIRC the CPU utilization with "collapsed " distance ~2.0 was about

  backend: 60%
  ioworker: 40%

and with the patches increasing the distance it was more like

  backend: 100%
  ioworker: 50%

But I think it was still just one ioworker. I wonder if that's OK,
intentional, or if it might be an issue ...

> 
>> It may be a bit worse for non-btree indexes, e.g. for for ordered scans
>> on gist indexes (getting the next tuple may require reading many leaf
>> pages, so maybe we can't look too far ahead?). Or for indexes with
>> naturally "fat" tuples, which limits how many tuples we see ahead.
> 
> I am not worried at all about those cases. If you have to read a lot of index
> leaf pages to get a heap fetch, a distance of even just 2 will be fine,
> because the IPC overhead is a neglegible cost compared to the index
> processing. Similarly, if you have to do very deep index traversals due to
> wide index tuples, there's going to be more time between two table fetches.
> 

Most likely, yes.

> 
>>> Obviously you can construct cases with a low distance by having indexes point
>>> to a lot of tiny tuples pointing to perfectly correlated pages, but in that
>>> case IO can't be a significant factor.
>>>
>>
>> It's definitely true the examples the script finds are "adversary", but
>> also not entirely unrealistic.
> 
> I think doing index scans where the results are just thrown out are entirely
> unrealistic...
> 

True, it's a microbenchmark focused on a specific operation. But I don't
think it makes it unrealistic, even though the impact on real-world
queries will be smaller. But I know what you mean.

> 
>> I suppose there will be such cases for any heuristics we come up with.
> 
> Agreed.
> 
> 
>> There's probably more cases like this, where we end up with many hits.
>> Say, a merge join may visit index tuples repeatedly, and so on. But then
>> it's likely in shared buffers, so there won't be any IPC.
> 
> Yea, I'd not expect a meaningful impact of any of this in a workload like
> that.
> 


regards

-- 
Tomas Vondra

Re: index prefetching

От

Tomas Vondra

Дата:

29 августа, 02:52:29

On 8/29/25 01:27, Andres Freund wrote:
> Hi,
> 
> On 2025-08-29 01:00:58 +0200, Tomas Vondra wrote:
>> I'm not sure how to determine what concurrency it "wants". All I know is
>> that for "warm" runs [1], the basic index prefetch patch uses distance
>> ~2.0 on average, and is ~2x slower than master. And with the patches the
>> distance is ~270, and it's 30% slower than master. (IIRC there's about
>> 30% misses, so 270 is fairly high. Can't check now, the machine is
>> running other tests.)
> 
> There got to be something wrong here, I don't see a reason why at any
> meaningful distance it'd be slower.
> 
> What set of patches do I need to repro the issue?
> 

Use this branch:

  https://github.com/tvondra/postgres/commits/index-prefetch-master/

and then Thomas' patch that increases the prefetch distance:


https://www.postgresql.org/message-id/CA%2BhUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw%40mail.gmail.com

(IIRC there's a trivial conflict in read_stream_reset.).

> And what are the complete set of pieces to load the data?
> https://postgr.es/m/293a4735-79a4-499c-9a36-870ee9286281%40vondra.me
> has the query, but afaict not enough information to infer init.sql
> 

Yeah, I forgot to include that piece, sorry. Here's an init.sql, that
loads the table, it also has the query.

> 
>> Not sure about wait events, but I don't think any backends are doing
>> sychnronous I/O. There's only that one query running, and it's using AIO
>> (except for the index, which is still read synchronously).
>>
>> Likewise, I don't think there's insufficient number of workers. I've
>> tried with 3 and 12 workers, and there's virtually no difference between
>> those. IIRC when watching "top", I've never seen more than 1 or maybe 2
>> workers active (using CPU).
> 
> That doesn't say much - if the they are doing IO, they're not on CPU...
> 

True. But one worker did show up in top, using a fair amount of CPU, so
why wouldn't the others (if they process the same stream)?


regards

-- 
Tomas Vondra

Вложения

repro.sql

Re: index prefetching

От

Peter Geoghegan

Дата:

29 августа, 02:57:17

On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> Use this branch:
>
>   https://github.com/tvondra/postgres/commits/index-prefetch-master/
>
> and then Thomas' patch that increases the prefetch distance:
>
>
> https://www.postgresql.org/message-id/CA%2BhUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw%40mail.gmail.com
>
> (IIRC there's a trivial conflict in read_stream_reset.).

I found it quite hard to apply Thomas' patch. There's actually 3
patches, with 2 earlier patches needed for earlier in the thread. And,
there were significant merge conflicts to work around.

I'm not sure that Thomas'/your patch to ameliorate the problem on the
read stream side is essential here. Perhaps Andres can just take a
look at the test case + feature branch, without the extra patches.
That way he'll be able to see whatever the immediate problem is, which
might be all we need.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

29 августа, 03:38:02


On 8/29/25 01:57, Peter Geoghegan wrote:
> On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:
>> Use this branch:
>>
>>   https://github.com/tvondra/postgres/commits/index-prefetch-master/
>>
>> and then Thomas' patch that increases the prefetch distance:
>>
>>
>> https://www.postgresql.org/message-id/CA%2BhUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw%40mail.gmail.com
>>
>> (IIRC there's a trivial conflict in read_stream_reset.).
> 
> I found it quite hard to apply Thomas' patch. There's actually 3
> patches, with 2 earlier patches needed for earlier in the thread. And,
> there were significant merge conflicts to work around.
> 

I don't think the 2 earlier patches are needed, I only ever applied the
one in the linked message. But you're right there were more merge
conflicts, I forgot about that. Here's a patch that should apply on top
of the prefetch branch.

> I'm not sure that Thomas'/your patch to ameliorate the problem on the
> read stream side is essential here. Perhaps Andres can just take a
> look at the test case + feature branch, without the extra patches.
> That way he'll be able to see whatever the immediate problem is, which
> might be all we need.
> 

AFAICS Andres was interested in reproducing the regression with an
increased distance. Or maybe I got it wrong.


regards

-- 
Tomas Vondra

Вложения

vmunro-0001-aio-Improve-read_stream.c-look-ahead-heuristi.patch

Re: index prefetching

От

Andres Freund

Дата:

29 августа, 04:10:48

Hi,

On 2025-08-28 19:57:17 -0400, Peter Geoghegan wrote:
> On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> > Use this branch:
> >
> >   https://github.com/tvondra/postgres/commits/index-prefetch-master/
> >
> > and then Thomas' patch that increases the prefetch distance:
> >
> >
> > https://www.postgresql.org/message-id/CA%2BhUKGL2PhFyDoqrHefqasOnaXhSg48t1phs3VM8BAdrZqKZkw%40mail.gmail.com
> >
> > (IIRC there's a trivial conflict in read_stream_reset.).
> 
> I found it quite hard to apply Thomas' patch. There's actually 3
> patches, with 2 earlier patches needed for earlier in the thread. And,
> there were significant merge conflicts to work around.

Same.  Tomas, could you share what you applied?

> I'm not sure that Thomas'/your patch to ameliorate the problem on the
> read stream side is essential here. Perhaps Andres can just take a
> look at the test case + feature branch, without the extra patches.
> That way he'll be able to see whatever the immediate problem is, which
> might be all we need.

It seems caused to a significant degree by waiting at low queue depths.  If I
comment out the stream->distance-- in read_stream_start_pending_read() the
regression is reduced greatly.

As far as I can tell, after that the process is CPU bound, i.e. IO waits don't
play a role.

I see a variety for increased CPU usage:

1) The private ref count infrastructure in bufmgr.c gets a bit slower once
   more buffers are pinned

2) signalling overhead to the worker - I think we are resetting the latch too
   eagerly, leading to unnecessarily many signals being sent to the IO worker.

3) same issue with the resowner tracking

But there's some additional difference in performance I don't yet
understand...

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

29 августа, 04:52:48

On Thu, Aug 28, 2025 at 9:10 PM Andres Freund <andres@anarazel.de> wrote:
> Same.  Tomas, could you share what you applied?

Tomas posted a self-contained patch to the list about an hour ago?

> > I'm not sure that Thomas'/your patch to ameliorate the problem on the
> > read stream side is essential here. Perhaps Andres can just take a
> > look at the test case + feature branch, without the extra patches.
> > That way he'll be able to see whatever the immediate problem is, which
> > might be all we need.
>
> It seems caused to a significant degree by waiting at low queue depths.  If I
> comment out the stream->distance-- in read_stream_start_pending_read() the
> regression is reduced greatly.

IIUC, that is very roughly equivalent to what the patch actually does.

The fastest configuration of all, independent of io_method, is
"enable_indexscan_prefetch=off". So it's hard to believe that the true
underlying problem is low queue depth. Though I certainly don't doubt
that higher queue depths will help *when io_method=worker*.

--
Peter Geoghegan

Re: index prefetching

От

Thomas Munro

Дата:

29 августа, 06:18:59

On Fri, Aug 29, 2025 at 11:52 AM Tomas Vondra <tomas@vondra.me> wrote:
> True. But one worker did show up in top, using a fair amount of CPU, so
> why wouldn't the others (if they process the same stream)?

It deliberately concentrates wakeups into the lowest numbered workers
that are marked idle in a bitmap.

* higher numbered workers snooze and eventually time out (with the
patches for 19 that make the pool size dynamic)
* busy workers have a better chance of staying on CPU between one job
and the next
* minimised duplication of various caches and descriptors

Every other wakeup routing strategy I've tried so far performed worse
in both avg(latency) and stddev(latency).

I have wondered if we might want to consider per-NUMA-node IO worker
pools with their own submission queues.  Not investigated, but I
suppose it might possibly help with the submission queue lock, cache
line ping pong for buffer headers that the worker touches on
completion, and inter-process interrupts.  I don't know where to draw
the line with a potential optimisations to IO worker mode that would
realistically only help on Linux today, when the main performance plan
for Linux is io_uring.

Re: index prefetching

От

Andres Freund

Дата:

03 сентября, 21:47:25

Hi,

I spent a fair bit more time analyzing this issue.

On 2025-08-28 21:10:48 -0400, Andres Freund wrote:
> On 2025-08-28 19:57:17 -0400, Peter Geoghegan wrote:
> > On Thu, Aug 28, 2025 at 7:52 PM Tomas Vondra <tomas@vondra.me> wrote:
> > I'm not sure that Thomas'/your patch to ameliorate the problem on the
> > read stream side is essential here. Perhaps Andres can just take a
> > look at the test case + feature branch, without the extra patches.
> > That way he'll be able to see whatever the immediate problem is, which
> > might be all we need.
> 
> It seems caused to a significant degree by waiting at low queue depths.  If I
> comment out the stream->distance-- in read_stream_start_pending_read() the
> regression is reduced greatly.
> 
> As far as I can tell, after that the process is CPU bound, i.e. IO waits don't
> play a role.

Indeed the actual AIO subsystem is unrelated, from what I can tell:

I hacked up read_stream.c/bufmgr.c to do readahead even if the buffer is in
shared_buffers. With that, the negative performance impact of doing
enable_indexscan_prefetch=1 is of a similar magnitude even if the table is
already entirely in shared buffers. I.e. actual IO is unrelated.

I compared perf stat -ddd output for enable_indexscan_prefetch=0 with
enable_indexscan_prefetch=1. The only real difference is a substantial (~3x)
increase in branch misses.

I then took a perf profile to see where all those misses are from.

The first souce is:

> I see a variety for increased CPU usage:
> 
> 1) The private ref count infrastructure in bufmgr.c gets a bit slower once
>    more buffers are pinned

The problem mainly seems to be that the branches in the loop at the start of
GetPrivateRefCountEntry() are entirely unpredictable in this workload.  I had
an old patch that tried to make it possible to use SIMD for the search, by
using a separate array for the Buffer ids - with that gcc generates fairly
crappy code, but does make the code branchless.

Here that substantially reduces the overhead of doing prefetching. Afterwards
it's not a meaningful source of misses anymore.

> 3) same issue with the resowner tracking

This one is much harder to address:

a) The "key" we are searching for is much wider (16 bytes), making
   vectorization of the search less helpful

b) because we search up to owner->narr instead of a fixed-length, the compiler
   wouldn't be able to auto-vectorize anyway

c) the branch-misses are partially caused by ResourcOwnerForget() "scrambling"
   the order in the array when forgetting an element

I don't know how to fix this right now.  I nevertheless wanted to see how big
the impact of this is, so I just neutered
ResourceOwner{Remember,Forget}{Buffer,BufferIO} - that's obviously not
correct, but suffices to see that the performance difference reduces
substantially.

But not completely, unfortunately.

> But there's some additional difference in performance I don't yet
> understand...

I still don't think I fully understand why the impact of this is so large. The
branch misses appear to be the only thing differentiating the two cases, but
with resowners neutralized, the remaining difference in branch misses seems
too large - it's not like the sequence of block numbers is more predictable
without prefetching...

The main increase in branch misses is in index_scan_stream_read_next...

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

03 сентября, 22:33:30

On Wed, Sep 3, 2025 at 2:47 PM Andres Freund <andres@anarazel.de> wrote:
> I still don't think I fully understand why the impact of this is so large. The
> branch misses appear to be the only thing differentiating the two cases, but
> with resowners neutralized, the remaining difference in branch misses seems
> too large - it's not like the sequence of block numbers is more predictable
> without prefetching...
>
> The main increase in branch misses is in index_scan_stream_read_next...

I've been working on fixing the same regressed query, but using a
completely different (though likely complementary) approach: by adding
a test to index_scan_stream_read_next that detects when prefetching
isn't favorable. If it isn't favorable, then we stop prefetching
entirely (we fall back on regular sync I/O).

Although this experimental approach is still very rough, it seems
promising. It ~100% fixes the problem at hand, without really creating
any new problems (at least as far as our testing has been able to
determine, so far).

The key idea is to wait until a few batches have already been read,
and then test whether the index-tuple-wise "distance" between readPos
(the read position) and streamPos (the stream position used by
index_scan_stream_read_next) remained excessively low within
index_scan_stream_read_next. If, after processing 20 batches/leaf
pages, readPos and streamPos still read from the same batch *and* have
a low index-tuple-wise position within that batch (they're within 10
or 20 items of each other), we expect "thrashing", which makes
prefetching unfavorable -- and so we just stop using our read stream.

It's worth noting that (given the current structure of the patch) it
is inherently impossible to do something like this from within the
read stream. We're suppressing duplicate heap block requests iff the
blocks are contiguous within the index. So read stream just doesn't
see anything like what I'm calling the "index-tuple-wise distance"
between readPos and streamPos.

Note that the baseline behavior for the test case (the behavior with
master, or with prefetching disabled) appears to be very I/O bound,
due to readahead. I've confirmed this using iostat. So "synchronous"
I/O isn't very synchronous here. (Prefetching actually does make sense
when this query is run with direct I/O, but that's far slower with or
without the use of explicit prefetching, so that likely doesn't tell
us much.)

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

03 сентября, 23:06:32

Hi,

On 2025-09-03 15:33:30 -0400, Peter Geoghegan wrote:
> On Wed, Sep 3, 2025 at 2:47 PM Andres Freund <andres@anarazel.de> wrote:
> > I still don't think I fully understand why the impact of this is so large. The
> > branch misses appear to be the only thing differentiating the two cases, but
> > with resowners neutralized, the remaining difference in branch misses seems
> > too large - it's not like the sequence of block numbers is more predictable
> > without prefetching...
> >
> > The main increase in branch misses is in index_scan_stream_read_next...
>
> I've been working on fixing the same regressed query, but using a
> completely different (though likely complementary) approach: by adding
> a test to index_scan_stream_read_next that detects when prefetching
> isn't favorable. If it isn't favorable, then we stop prefetching
> entirely (we fall back on regular sync I/O).

The issue to me is that this kind of query actually *can* substantially
benefit from prefetching, no? Afaict the performance without prefetching is
rather atrocious as soon as a) storage has a tad higher latency or b) DIO is
used.

Indeed: With DIO, readahead provides a ~2.6x improvement for the query at hand.

I continue to be worried that we're optimizing for queries that have no
real-world relevance. The regression afaict is contingent on

1) An access pattern that is unpredictable to the CPU (due to the use of
   random() as part of ORDER BY during the data generation)

2) Index and heap are somewhat correlated, but fuzzily, i.e. there are
   backward jumps in the heap block numbers being fetched

3) There are 1 - small_number tuples on one heap tables

4) The query scans a huge number of tuples, without actually doing any
   meaningful analysis on the tuples. As soon as one does meaningful work for
   returned tuples, the small difference in per-tuple CPU costs vanishes

5) The query visits all heap pages within a range, just not quite in
   order. Without that the kernel readahead would not work and the query's
   performance without readahead would be terrible even on low-latency storage

This just doesn't strike me as a particularly realistic combination of
factors?

I suspect we could more than eat back the loss in performance by doing batched
heap_hot_search_buffer()...

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

03 сентября, 23:25:56

On Wed, Sep 3, 2025 at 4:06 PM Andres Freund <andres@anarazel.de> wrote:
> The issue to me is that this kind of query actually *can* substantially
> benefit from prefetching, no?

As far as I can tell, not really, no.

> Afaict the performance without prefetching is
> rather atrocious as soon as a) storage has a tad higher latency or b) DIO is
> used.

I don't know that storage latency matters, when (without DIO) we're
doing so well from readahead.

> Indeed: With DIO, readahead provides a ~2.6x improvement for the query at hand.

I don't see that level of improvement with DIO. For me it's 6054.921
ms with prefetching, 8766.287 ms without it.

I can kind of accept the idea that in some sense readahead shouldn't
count too much, since the future is DIO. But it's not like aggressive
prefetching matches the performance of buffered I/O + readahead. Not
for me, at any rate. I don't know why.

> I continue to be worried that we're optimizing for queries that have no
> real-world relevance.

I'm not at all surprised that we're spending so much time on weird
queries. For one thing, the real world queries are already much
improved. For another, in order to accept a trade-off like this, we
have to actually know what it is we're accepting. And how easy/hard it
is to do better (we may very well be able to fix this problem at no
great cost in complexity).

> This just doesn't strike me as a particularly realistic combination of
> factors?

I agree. I just don't think that we've done enough work on this to
justify accepting it as a cost of doing business. We might well do
that at some point in the near future.

> I suspect we could more than eat back the loss in performance by doing batched
> heap_hot_search_buffer()...

Maybe, but I don't think that we're all that likely to get that done for 19.

--
Peter Geoghegan

Re: index prefetching

От

Andres Freund

Дата:

04 сентября, 03:16:06

Hi,

On 2025-09-03 16:25:56 -0400, Peter Geoghegan wrote:
> On Wed, Sep 3, 2025 at 4:06 PM Andres Freund <andres@anarazel.de> wrote:
> > The issue to me is that this kind of query actually *can* substantially
> > benefit from prefetching, no?
>
> As far as I can tell, not really, no.

It seems to here - I see small wins even with kernel readahead, fwiw.

> > Afaict the performance without prefetching is
> > rather atrocious as soon as a) storage has a tad higher latency or b) DIO is
> > used.
>
> I don't know that storage latency matters, when (without DIO) we're
> doing so well from readahead.

The readahead linux does actually is not aggressive enough once you have
higher IO latency - you can tune it up, but then it often does too much IO.

> > Indeed: With DIO, readahead provides a ~2.6x improvement for the query at hand.
>
> I don't see that level of improvement with DIO. For me it's 6054.921
> ms with prefetching, 8766.287 ms without it.

I guess your SSD has lower latency than mine...

> I can kind of accept the idea that in some sense readahead shouldn't
> count too much, since the future is DIO. But it's not like aggressive
> prefetching matches the performance of buffered I/O + readahead. Not
> for me, at any rate. I don't know why.

It does here, just about.  The reason for not matching is fairly simple: The
kernel readahead issues large reads, but with DIO we don't for this query. The
adversarial pattern here rarely has two consecutive neighboring blocks, so
nearly all reads are 8kB reads.

This actually might be the thing to tackle to avoid this and other similar
regressions: If we were able to isssue combined IOs for interspersed patterns
like we have in this query, we'd easily win back the overhead. And it'd make
DIO much much better.

We don't want to do try to find more complicated merges for things like
seqscans and bitmap heap scans, there never can be anything other than merges
of consecutive blocks, and the CPU overhead of the more complicated search
would likely be noticeable.  But for something like index scans that's
different.

I don't quite know if this is best done as an optional feature for read
streams, a layer atop read stream or something dedicated.

For now I'll go back to working on read stream test infrastructure. That's the
prerequisite for testing the "don't synchronously wait for in-progress IO"
improvement. And if we want to have more complicated merging, that also seems
like something much easier to develop with some testing infra.

Greetings,

Andres Freund

Re: index prefetching

От

Peter Geoghegan

Дата:

04 сентября, 03:28:24

On Wed, Sep 3, 2025 at 8:16 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't see that level of improvement with DIO. For me it's 6054.921
> > ms with prefetching, 8766.287 ms without it.
>
> I guess your SSD has lower latency than mine...

It's nothing special: a 4 year old Samsung 980 pro.

> This actually might be the thing to tackle to avoid this and other similar
> regressions: If we were able to isssue combined IOs for interspersed patterns
> like we have in this query, we'd easily win back the overhead. And it'd make
> DIO much much better.

That sounds very plausible to me. I don't think it's at all unusual
for index scans to do this (that particular aspect of the test case
query wasn't unrealistic). In general this seems important to me.

> I don't quite know if this is best done as an optional feature for read
> streams, a layer atop read stream or something dedicated.

My guess is that it would work best as an optional feature for read
streams. A flag like READ_STREAM_REPEAT_READS that's passed to
read_stream_begin_relation might work best.

> For now I'll go back to working on read stream test infrastructure. That's the
> prerequisite for testing the "don't synchronously wait for in-progress IO"
> improvement.

"don't synchronously wait for in-progress IO" is also very important
to this project. Thanks for your help with that.

> And if we want to have more complicated merging, that also seems
> like something much easier to develop with some testing infra.

Great.

--
Peter Geoghegan

Re: index prefetching

От

Tomas Vondra

Дата:

04 сентября, 21:55:19

On 9/3/25 22:06, Andres Freund wrote:
> ...
>
> I continue to be worried that we're optimizing for queries that have no
> real-world relevance. The regression afaict is contingent on
> 
> 1) An access pattern that is unpredictable to the CPU (due to the use of
>    random() as part of ORDER BY during the data generation)
> 
> 2) Index and heap are somewhat correlated, but fuzzily, i.e. there are
>    backward jumps in the heap block numbers being fetched
> 

Aren't those two points rather contradictory? Why would it matter that
the data generator uses random() in the ORDER BY? Seems entirely
irrelevant, if the generated table is "somewhat correlated".

Which seems pretty normal in real-world data sets ...

> 3) There are 1 - small_number tuples on one heap tables
> 

What would you consider a reasonable number of tuples on one heap page?

The current tests generate data with 20-100 tuples per page, which seems
pretty reasonable to me. I mean, that's 80-400B per tuple. Sure, I could
generate data with narrower tuples, but would that be more realistic? I
doubt that.

FWIW it's not like the regressions only happen on fillfactor=20, with 20
tuples/page. It happens on fillfactor=100 (sure, the impact is smaller).

> 4) The query scans a huge number of tuples, without actually doing any
>    meaningful analysis on the tuples. As soon as one does meaningful work for
>    returned tuples, the small difference in per-tuple CPU costs vanishes
> 

I believe I already responded to this before. Sure, the relative
regression will get smaller. But I don't see why would the absolute
difference get smaller.

> 5) The query visits all heap pages within a range, just not quite in
>    order. Without that the kernel readahead would not work and the query's
>    performance without readahead would be terrible even on low-latency storage
> 

I'm sorry, I don't quite understand what this says :-( Or why would that
mean the issues triggered by the generated data sets are not valid even
for real-world queries.

> This just doesn't strike me as a particularly realistic combination of
> factors?
>

Aren't plenty of real-world data sets correlated, but not perfectly?

In any case, I'm the first one to admit these data sets are synthetic.
It's meant to generate data sets that gradually shift from perfectly
ordered to random, increasing number of duplicates, etc. The point was
to cover a wider range of data sets, not just a couple "usual" ones.

It's possible some of these data sets are not realistic, in which case
we can choose to ignore them and the regressions. The approach tends to
find "adversary" cases, hit corner cases (not necessarily as rare as
assumed), etc. But the issues we ran into so far seem perfectly valid
(or at least useful to think about).

regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

11 сентября, 01:24:16

On Thu, Sep 4, 2025 at 2:55 PM Tomas Vondra <tomas@vondra.me> wrote:
> Aren't plenty of real-world data sets correlated, but not perfectly?

Attached is the latest revision of the prefetching patch, taken from
the shared branch that Tomas and I have been working on for some
weeks.

This revision is the first "official revision" that uses the complex
approach, which we agreed was the best approach right before we
started collaborating through this shared branch. While Tomas and I
have posted versions of this "complex" approach at various times,
those were "unofficial" previews of different approaches. Whereas this
is the latest official patch revision of record, that should be tested
by CFTester for the prefetch patch's CF entry, etc.

We haven't done a good job of maintaining an unambiguous, easy to test
"official" CF entry patch before now. That's why I'm being explicit
about what this patch revision represents. It's the shared work of
Tomas and I; it isn't some short-term experimental fork. Future
revisions will be incremental improvements on what I'm posting now.

Our focus has been on fixing a variety of regressions that came to
light following testing by Tomas. There are a few bigger changes that
are intended to fix these regressions, plus lots of small changes.

There's too many small changes to list. But the bigger changes are:

* We're now carrying Andres' patch [1] that deals with inefficiencies
on the read stream side [2]. We need this to get decent performance
with certain kinds of index scans where the same heap page buffer
needs to be read multiple times in close succession.

* We now delay prefetching/creating a new read stream until after
we've already read one index batch, with the goal of avoiding
regressions on cheap, selective queries (e.g., pgbench SELECT). This
optimization has been referred to as the "priorbatch" optimization
earlier in this thread.

* The third patch is a new one, authored by Tomas. It aims to
ameliorate nestloop join regressions by caching memory used to store
batches across rescans.

This is still experimental.

* The regression that we were concerned about most recently [3][4] is
fixed by a new mechanism that sometimes disables prefetching/the read
stream some time prefetching begins, having already read a small
number of batches with prefetching -- the
INDEX_SCAN_MIN_TUPLE_DISTANCE optimization.

This is also experimental. But it does fully fix the problem at hand,
without any read stream changes. (This is part of the main prefetching
patch.)

This works like the "priorbatch" optimization, but in reverse. We
*unset* the scan's read stream when our INDEX_SCAN_MIN_TUPLE_DISTANCE
test shows that prefetching hasn't worked out (as opposed to delaying
starting it up until it starts to look like prefetching might help).
Like the "priorbatch" optimization, this optimization is concerned
with fixed prefetching costs that cannot possibly pay for themselves.

Note that we originally believed that the regression in question
[3][4] necessitated more work on the read stream side, to directly
account for the way that we saw prefetch distance collapse to 2.0 for
the entire scan. But our current thinking is that the regression in
question occurs with scans where wholly avoiding prefetching is the
right goal. Which is why, tentatively, we're addressing the problem
within indexam.c itself (not in the read stream), by adding this new
INDEX_SCAN_MIN_TUPLE_DISTANCE test to the read stream callback. This
means that various experimental read stream distance patches [3][5]
that initially seemed relevant no longer appear necessary (and so
aren't included in this new revision at all).

Much cleanup work remains to get the changes I just described in
proper shape (to say nothing about open items that we haven't made a
start on yet, like moving the read stream out of indexam.c and into
heapam). But it has been too long since the last revision. I'd like to
establish a regular cadence for posting new revisions of the patch
set.

[1] https://postgr.es/m/6butbqln6ewi5kuxz3kfv2mwomnlgtate4mb4lpa7gb2l63j4t@stlwbi2dvvev
[2] https://postgr.es/m/kvyser45imw3xmisfvpeoshisswazlzw35el3fq5zg73zblpql@f56enfj45nf7
[3] https://postgr.es/m/8f5d66cf-44e9-40e0-8349-d5590ba8efb4@vondra.me
[4] https://github.com/tvondra/postgres/blob/index-prefetch-master/microbenchmarks/tomas-weird-issue-readstream.sql
[5] https://postgr.es/m/CA+hUKG+9Qp=E5XWE+_1UPCxULLXz6JrAY=83pmnJ5ifupH-NSA@mail.gmail.com

--
Peter Geoghegan

Вложения

Re: index prefetching

От

Tomas Vondra

Дата:

15 сентября, 16:00:42

On 9/11/25 00:24, Peter Geoghegan wrote:
> On Thu, Sep 4, 2025 at 2:55 PM Tomas Vondra <tomas@vondra.me> wrote:
>> Aren't plenty of real-world data sets correlated, but not perfectly?
> 
> Attached is the latest revision of the prefetching patch, taken from
> the shared branch that Tomas and I have been working on for some
> weeks.
> 
> This revision is the first "official revision" that uses the complex
> approach, which we agreed was the best approach right before we
> started collaborating through this shared branch. While Tomas and I
> have posted versions of this "complex" approach at various times,
> those were "unofficial" previews of different approaches. Whereas this
> is the latest official patch revision of record, that should be tested
> by CFTester for the prefetch patch's CF entry, etc.
> 
> We haven't done a good job of maintaining an unambiguous, easy to test
> "official" CF entry patch before now. That's why I'm being explicit
> about what this patch revision represents. It's the shared work of
> Tomas and I; it isn't some short-term experimental fork. Future
> revisions will be incremental improvements on what I'm posting now.
> 

Indeed, the thread is very confusing as it mixes up different
approaches, various experimental patches etc. Thank you for cleaning
this up, and doing various other fixes.

> Our focus has been on fixing a variety of regressions that came to
> light following testing by Tomas. There are a few bigger changes that
> are intended to fix these regressions, plus lots of small changes.
> 
> There's too many small changes to list. But the bigger changes are:
> 
> * We're now carrying Andres' patch [1] that deals with inefficiencies
> on the read stream side [2]. We need this to get decent performance
> with certain kinds of index scans where the same heap page buffer
> needs to be read multiple times in close succession.
> 
> * We now delay prefetching/creating a new read stream until after
> we've already read one index batch, with the goal of avoiding
> regressions on cheap, selective queries (e.g., pgbench SELECT). This
> optimization has been referred to as the "priorbatch" optimization
> earlier in this thread.
> 
> * The third patch is a new one, authored by Tomas. It aims to
> ameliorate nestloop join regressions by caching memory used to store
> batches across rescans.
> 
> This is still experimental.
> 

Yeah. I realize the commit message does not explain the motivation, so
let me fix that - the batches are pretty much the same thing as
~BTScanPosData, which means it's ~30KB struct. That means it's not
cached in memory contexts, but each palloc/pfree is malloc/free.

That's already a known problem (e.g. for scans on partitioned tables),
but batches make it worse - we now need more instances of the struct. So
it's even more important to not do far more malloc/free calls.

It's not perfect, but it was good enough to eliminate the overhead.

> * The regression that we were concerned about most recently [3][4] is
> fixed by a new mechanism that sometimes disables prefetching/the read
> stream some time prefetching begins, having already read a small
> number of batches with prefetching -- the
> INDEX_SCAN_MIN_TUPLE_DISTANCE optimization.
> 
> This is also experimental. But it does fully fix the problem at hand,
> without any read stream changes. (This is part of the main prefetching
> patch.)
> 
> This works like the "priorbatch" optimization, but in reverse. We
> *unset* the scan's read stream when our INDEX_SCAN_MIN_TUPLE_DISTANCE
> test shows that prefetching hasn't worked out (as opposed to delaying
> starting it up until it starts to look like prefetching might help).
> Like the "priorbatch" optimization, this optimization is concerned
> with fixed prefetching costs that cannot possibly pay for themselves.
> 
> Note that we originally believed that the regression in question
> [3][4] necessitated more work on the read stream side, to directly
> account for the way that we saw prefetch distance collapse to 2.0 for
> the entire scan. But our current thinking is that the regression in
> question occurs with scans where wholly avoiding prefetching is the
> right goal. Which is why, tentatively, we're addressing the problem
> within indexam.c itself (not in the read stream), by adding this new
> INDEX_SCAN_MIN_TUPLE_DISTANCE test to the read stream callback. This
> means that various experimental read stream distance patches [3][5]
> that initially seemed relevant no longer appear necessary (and so
> aren't included in this new revision at all).
> 

Yeah, this heuristics seems very effective in eliminating the regression
(at least judging by the test results I've seen so far). Two or three
question bother me about it, though:

1) I'm not sure I fully understand how the heuristics works, i.e. how
tracking "tuple distance" in index AM identifies queries where
prefetching can't pay for itself. It's hard to say if the tuple distance
is a good predictor of that. It seems to be in case of the regressed
query, I don't dispute that. AFAICS the reasoning is:

  We're prefetching too close ahead, so close the I/O can't possibly
  complete, and the overhead of submitting the I/O using AIO is higher
  than what what async "saves".

That's great, but is the distance a good measure of that? It has no
concept of what happens prefetching and reading a block, during the
"distance". In the test queries it's virtually nothing, because the
query doesn't do anything with the rows. For more complex queries there
could be plenty of time for the I/O to complete.

Of course, if the query is complex, and the I/O complete n time even for
short distances, it's likely not a huge relative difference ...

2) It's a one-time decision, not adaptive. We start prefetching, and
then at some point (not too long after the scan starts) we make a
decision whether to continue with prefetching or not. And if we disable
it, it's disabled forever. That's fine for the synthetic data sets we
use for testing, because those are synthetic. I'm not sure it'll work
this well for real-world data sets where different parts of the file may
be very different.

This is perfectly fine for a WIP patch, but I believe we should try to
make this adaptive. Which probably means we need to invent a "light"
version of read_stream that initially does sync I/O, and only switches
to async (with all the expensive initialization) later. And then can
switch back to sync, but is ready to maybe start prefetching again if
the data pattern changes.

3) Now that I look at the code in index_scan_stream_read_next, it feels
a bit weird we do the decision based on the "immediate" distance only. I
suspect this may make it quite fragile, in the sense that even a small
local irregularity in the data may result in different "step" changes.
Wouldn't it be better to base this on some "average" distance?

In other words, I'm afraid (2) and (3) are pretty much a "performance
cliff", where a tiny difference in the input can result in wildly
different behavior.

> Much cleanup work remains to get the changes I just described in
> proper shape (to say nothing about open items that we haven't made a
> start on yet, like moving the read stream out of indexam.c and into
> heapam). But it has been too long since the last revision. I'd like to
> establish a regular cadence for posting new revisions of the patch
> set.
> 

Thank you! I appreciate the collaboration, it's a huge help.

I kept running the stress test, trying to find cases that regress, and
also to better understand the behavior. The script/charts are available
here: https://github.com/tvondra/prefetch-tests

So far I haven't found any massive regressions (relative to master).
There are data sets where we regress by a couple percent (and it's not
noise). I haven't looked into the details, but I believe most of this
can be attributed to the "AIO costs" we discussed recently (with
signals), and similar things.

I'm attaching three charts, comparing master to "patched" build with the
20250910 patch applied. I don't think I posted these charts before, so
let me explain a bit. Each chart is a simple XY chart, comparing timings
from master (x-axis) to patched build (y-axis).

Data points on the diagonal mean "same performance", below diagonal is
"patched is faster", above diagonal is "master is faster". So the close
the data point is to x-axis the better, and we want few points above the
diagonal, because those are regressions.

The colors identify different data sets. The script (available in git
repo) generates data sets with different parameters (number of distinct
values, randomness, ...), and the prefetch behavior depends on that.

The charts are from three different setups, with different types of SSD
storage (SATA RAID, NVMe RAID, single NVMe drive). There are some
differences, but the overall behavior is quite similar.

Note: The charts show different number of data sets, the data sets are
not comparable. Each run generates new random parameters, so the same
color does not mean the same parameters.

The git has charts with the patch adjusting the prefetch distance [1].
It does improve behavior with some data sets, but it does not change the
overall behavior (and it does not eliminate the small regressions).

regards

[1]
https://www.postgresql.org/message-id/9b2106a4-4901-4b03-a0b2-db2dbaee4c1f%40vondra.me

-- 
Tomas Vondra

On Wed, Sep 10, 2025 at 6:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is the latest revision of the prefetching patch, taken from
> the shared branch that Tomas and I have been working on for some
> weeks.

Attached in a new revision, mostly just to keep CFBot happy following
a recent trivial conflict introduced on master. The only other change
in this revision is that it now carries the latest version of Andres'
patch to prevent the scan from waiting for already-in-progress IO [1].

[1] https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
--
Peter Geoghegan

Вложения

Re: index prefetching

От

Peter Geoghegan

Дата:

03 ноября, 02:49:53

On Sun, Oct 12, 2025 at 2:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached in a new revision, mostly just to keep CFBot happy following
> a recent trivial conflict introduced on master.

Attached is another revision, also just to keep CFBot happy following
a conflict introduced on master. Nothing really new here (I've been
working on batching on the table AM side, but nothing to show on that
just yet).

One minor thing to note about this revision: I added a comment to
selfuncs.c's that notes that there's an unfixed bug there. That code
more or less copies the approach used by nodeIndexonlyscan.c, but
neglects to take the same precautions around the read
stream/prefetching see different pages as all-visible that the view
seen on the consumer side.

ISTM that the right fix there is to totally rethink the interface such
that the read stream is directly owned by the table AM. That way we
won't have to work around inconsistent ideas around which heap pages
are all-visible because there'll only be one view of that, in a single
place. We won't have to do anything special in either selfuncs.c or in
nodeIndexonlyscan.c.

--
Peter Geoghegan

Вложения

Re: index prefetching

От

Tomas Vondra

Дата:

03 ноября, 03:06:03


On 11/3/25 00:49, Peter Geoghegan wrote:
> On Sun, Oct 12, 2025 at 2:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> Attached in a new revision, mostly just to keep CFBot happy following
>> a recent trivial conflict introduced on master.
> 
> Attached is another revision, also just to keep CFBot happy following
> a conflict introduced on master. Nothing really new here (I've been
> working on batching on the table AM side, but nothing to show on that
> just yet).
> 

Thanks.

> One minor thing to note about this revision: I added a comment to
> selfuncs.c's that notes that there's an unfixed bug there. That code
> more or less copies the approach used by nodeIndexonlyscan.c, but
> neglects to take the same precautions around the read
> stream/prefetching see different pages as all-visible that the view
> seen on the consumer side.
> 
> ISTM that the right fix there is to totally rethink the interface such
> that the read stream is directly owned by the table AM. That way we
> won't have to work around inconsistent ideas around which heap pages
> are all-visible because there'll only be one view of that, in a single
> place. We won't have to do anything special in either selfuncs.c or in
> nodeIndexonlyscan.c.
> 

I think we've already more or less agreed that the read_stream should be
managed by the table AM (rather than by indexam.c), because it's up to
the table AM to interpret the TID.

If that also clarifies the IOS handling, that'd be a bonus. I've not
been very happy with having to check visibility in the stream callback
and passing it to the executor. If this gets "nicer", great.


regards

-- 
Tomas Vondra

Re: index prefetching

От

Peter Geoghegan

Дата:

06 ноября, 07:55:50

On Sun, Nov 2, 2025 at 6:49 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is another revision, also just to keep CFBot happy following
> a conflict introduced on master. Nothing really new here (I've been
> working on batching on the table AM side, but nothing to show on that
> just yet).

Same story again today. The recent "Sort guc_parameters.dat
alphabetically by name" commit made the patch no longer compile.

Attached is a trivial rebased version of Sunday's revision, to keep
CFBot green. Nothing new here, really.

--
Peter Geoghegan

Вложения

Re: index prefetching

От

Peter Geoghegan

Дата:

09 ноября, 23:13:22

On Wed, Nov 5, 2025 at 11:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Same story again today. The recent "Sort guc_parameters.dat
> alphabetically by name" commit made the patch no longer compile.

Same again. This new revision fixes bitrot caused by Andres' recent
"bufmgr: Allow some buffer state modifications while holding header
lock" commit.

I had to fix some of Andres' code to get this working. I think that I
got this right, but haven't tested those changes very well.

--
Peter Geoghegan

On Wed, Nov 12, 2025 at 12:39 PM Tomas Vondra <tomas@vondra.me> wrote:
> I think I generally agree with what you said here about the challenges,
> although it's a bit too abstract to respond to individual parts. I just
> don't know how to rework the design to resolve this ...

I'm trying to identify which subsets of the existing design can
reasonably be committed in a single release (while acknowledging that
even those subsets will need to be reworked). That is more abstract
than any of us would like -- no question.

What are we most confident will definitely be useful to prefetching,
that also enables the "only lock heap buffer once per group of TIDs
that point to the same heap page returned from an index scan"
optimization? I'm trying to reach a tentative agreement that just
doing the amgetbatch revisions and the table AM revisions (to do the
other heap buffer lock optimization) will represent useful progress
that can be committed in a single release. And on what the specifics
of the table AM revisions will need to be, to get us to a patch that
we can commit to Postgres 19.

> For the reads stream "pausing" I think it's pretty clear it's more a
> workaround than a desired behavior. We only pause the stream because we
> need to limit the look-ahead distance (measured in index leaf pages),
> and the read_stream has no such concept. It only knows about heap pins,
> but e.g. IOS may need to read many leaf pages to find a single heap page
> to prefetch. And the leaf pages are invisible to the stream.

Right. But we seemed to talk about this as if the implementation of
"pausing" was the problem. I was suggesting that the general idea of
pausing might well be the wrong one -- at least when applied in
anything like the way we currently apply it.

More importantly, I feel that it'll be really hard to get a clear
answer to that particular question (and a couple of others like it)
without first getting clarity on what we need from the table AM at a
high level, API-wise. Bearing in mind that we've made no real progress
on that at all.

We all agree that it's bad that indexam.c tacitly coordinates with
heapam in the way it does in the current patch. And that assuming a
TID representation in the API is bad. But that isn't very satisfying
to me; it's too focussed on that one really obvious and glaring
problem, and what we *don't* want. There's been very little (almost
nothing) on this thread about what we actually *do* want. That's the
thing that's still way to abstract, that I'd like to make more
concrete.

As you know, I think that we should add a new table AM interface that
makes the table AM directly aware of the fact that it is feeding an
ordered index scan, completely avoiding the use of TIDs (as well as
avoiding *any* more abstract representation of a table AM tuple
identifier). In other words, I think that we should just fully admit
the fact that the table AM is in control of the scan, and all that
comes with it. The table AM will have to directly coordinate with the
index AM in a way that's quite different to what we do right now.

I don't think that anybody else has really said much about that idea,
at least on the list. Is it a reasonable approach to take? This is
really important, especially in the short term/for Postgres 19.

> The limit of 64 batches is entirely arbitrary. I needed a number that
> would limit the amount of memory and time wasted on useless look-ahead,
> and 64 seemed "reasonable" (not too high, but enough to not be hit very
> often). Originally there was a fixed-length queue of batches, and 64 was
> the capacity, but we no longer do it that way. So it's an imperfect
> safety measure against "runaway" streams.

Right, but we still max out at 64. And then we stay there. It just
feels unprincipled to me.

> I don't want to get into too much detail about this particular issue,
> it's already discussed somewhere in this thread. But if there was a way
> to "tell" the read stream how much effort to spend looking ahead, we
> wouldn't do the pausing (not in the end+reset way).

I don't want to get into that again either. It was just an example of
the kinds of problems we're running into. Though a particularly good
example IMV.

> > That's all I have for now. My thoughts here should be considered
> > tentative; I want to put my thinking on a more rigorous footing before
> > really committing to this new phased approach.
> >
>
> I don't object to the "phased approach" with doing the batching first,
> but without seeing the code I can't really say if/how much it helps with
> resolving the design/layering questions. It feels a bit too abstract to
> me.

It is in no small part based on gut feeling and intuition. I don't
have anything better to go on right now. It's a really difficult
project.

> While working on the prefetching I moved the code between layers
> about three times, and I'm still not quite sure which layer should be
> responsible for which piece :-(

I don't think that this is quite the same situation.

The index prefetching design was completely overhauled twice now, but
on both occasions that was driven by some clear goal/the need to fix
some problem with the prior design. The first time it was due to the
fact that the original version didn't work with kill_prior_tuple. The
second time was due to the need to support reading index pages that
were ahead of the current page that the scan is returning tuples from.
Granted, it took a while to actually prove that the second overhaul
(which created the third major redesign) was the right direction to
take things in, but testing did eventually make that quite clear.

I don't see this as doing the same thing a third time/creating a forth
design from scratch. It's more of a refinement (albeit quite a big
one) of the most recent design. And in a direction that doesn't seem
too surprising to me. We knew that the table AM side of the most
recent redesign still had plenty of problems. We should have been a
bit more focussed on that side of things earlier on.

--
Peter Geoghegan

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: index prefetching

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения