Re: index prefetching

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: index prefetching
Дата
Msg-id 152ea782-5bd4-4435-b021-0ab2da61e63d@vondra.me
обсуждение исходный текст
Ответ на Re: index prefetching  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: index prefetching
Список pgsql-hackers
Hi,

I ran some more tests, comparing the two patches, using data sets
generated in a way to have a more gradual transition between correlated
and random cases.

I'll explain how the new benchmark generates data sets (the goal, and
limitations). Then I'll discuss some of the results. And then there's a
brief conclusion / next steps for the index prefetching ...


data sets
---------

I experimented with several ways to generate such data sets, and what I
ended up doing is this:

  INSERT INTO t SELECT i, md5(i::text)
    FROM generate_series(1, $rows) s(i)
   ORDER BY i + $fuzz * (random() - 0.5)

See the "generate-*.sh" scripts for the exact details.

The basic idea is that we generate a sequence of $rows values, but we
also allow the values to jump a random distance determined by $fuzz.
With fuzz=0 we get perfect correlation, with fuzz=1 the value can move
by one position, with fuzz=1000 it can move by up to 1000 positions, and
so on. For very high fuzz (~rows) this will be close to random.

So this fuzz is the primary knob. Obviously, incrementing fuzz by "1" it
would be way too many tests, with very little change. Instead, I used
the doubling strategy - 0, 1, 2, 4, 8, 16, 32, 64, ... $rows. This way
it takes only about ~25 steps for the $fuzz to exceed $rows=10M.

I also used "fillfactor" as another knob, determining how many items fit
on a heap page. I used 20-40-60-80-100, but this turned out to not have
too many doesn't have much impact. From now I'll use fillfactor=20,
results and charts for other fillfactors are in the github repo [1].

I generated some charts visualizing the data sets - see [2] and [3]
(there's also PDFs, but those are pretty huge). Those charts show
percentiles of blocks vs. values, in either dimension. [2] shows
percentiles of "value" (from the sequence) for 1MB chunks. It seems very
 correlated (simple "diagonal" line), because the ranges are so narrow.
But at fuzz ~256k the randomness starts to show.

The [3] shows the other direction, i.e. percentiles of heap blocks for
ranges of values. But the patterns are almost exactly the same, it's
very symmetrical.

Fuzz -1 means "random with uniform distribution". It's clear the "most
random" data set (fuzz ~8M) is still quite different, there's still some
correlation. But the behavior seems fairly close to random.

I don't claim those data sets are perfect, or a great representation of
particular (real-world) data sets. It seems like a much nicer transition
between random and correlated data sets. I have some ideas how to evolve
this, for example to introduce some duplicate (and not unique) values,
and also longer runs.

The other thing that annoys me a bit is the weird behavior close to the
beginning/end of the table, where the percentiles get closer and closer.
I suspect this might affect runs that happen to hit those parts, adding
some "noise" into the results.


results
-------

I'm going to talk about results from the Ryzen machine with NVMe RAID,
with 10M rows (which is about 3.7GB with fillfactor=20) [4]. There are
also results from "ryzen / SATA RAID" and "Xeon / NVMe", and 1M data
sets. But the conclusions are almost exactly the same, as with earlier
benchmarks.

- ryzen-nvme-cold-10000000-20-16-scaled.pdf [5]

This compares master, simple and complex prefetch with different
iomethod values (in columns), and fuzz values (in rows, starting from
fuzz=0).

In most cases the two patches perform fairly close - the green and red
data series mostly overlap. But there are cases where the complex patch
performs much better - especially for low fuzz values. Which is not
surprising, because those cases require higher prefetch distance, and
the complex patch can do that.

It surprised me a bit the complex patch can actually help even cases
where I'd not expect prefetching to help very much - e.g. fuzz=0 is
perfectly correlated, I'd expect read-ahead to work just fine. Yet the
complex patch can help ~2x (at least when scanning larger fraction of
the data).


- ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf

Some of the differences are more visible on this chart, which shows
patches relative to master (so 1.0 means "as fast as master", while 0.5
means 2x faster, etc).

I think there are a couple "fuzz ranges" with distinct behaviors:

* 0-1: simple does mostly on par with "master", complex is actually
quite a bit faster

* 2-4: both mostly on par with master

* 8-256: zone of regressions (compared to master)

* 512-64K: mixed results (good for low selectivity, then regression)

* 128K+: clear benefits

The results from the other systems follow this pattern too, although the
ranges may be shifted a bit.

There are some interesting differences between the io_method values. In
a number of cases the "sync" method performs much worse than "worker"
and "io_uring" - which is not entirely surprising, but it just supports
my argument we should stick with "worker" as default for PG18. But
that's not the topic of this thread.

There are also a couple cases where "simple" performs better than
"complex". But most of the time this is only for the "sync" iomethod,
and when scanning significant fraction of the data (10%+). So that
doesn't seem like a great argument in favor of the simple patch,
considering "sync" is not a proper AIO method, I've been arguing against
using it as a default, and with methods like "worker" the "complex"
patch often performs better ...


conclusion
----------

Let's say the complex patch is the way to go. What are the open problems
/ missing parts we need to address to make it committable?

I can think of these issues. I'm sure the list is incomplete and there
are many "smaller issues" and things I haven't even thought about:

1) Making sure the interface can work for other index AMs (both in core
and out-of-core), including cases like GiST etc.

2) Proper layering between index AM and table AM (like the TID issue
pointed out by Andres some time ago).

3) Allowing more flexible management of prefetch distance (this might
involve something like the "scan manager" idea suggested by Peter),
various improvements to ReadStream heuristics, etc.

4) More testing to minimize the risk of regressions.

5) Figuring out how to make this work for IOS (the simple patch has some
special logic in the callback, which may not be great, not sure what's
the right solution in the complex patch).

6) ????


regards



[1] https://github.com/tvondra/index-prefetch-tests-2

[2]
https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets.png

[3]
https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets-2.png

[4]
https://github.com/tvondra/index-prefetch-tests-2/tree/master/ryzen-nvme/10

[5]
https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled.pdf

[6]

https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf

[7]

https://github.com/tvondra/index-prefetch-tests-2/blob/master/xeon-nvme/10/xeon-nvme-cold-10000000-20-16-scaled-relative.pdf

[8]

https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-sata/10/ryzen-sata-cold-10000000-20-16-scaled-relative.pdf

-- 
Tomas Vondra
Вложения

В списке pgsql-hackers по дате отправления: