Re: index prefetching
От | Tomas Vondra |
---|---|
Тема | Re: index prefetching |
Дата | |
Msg-id | 152ea782-5bd4-4435-b021-0ab2da61e63d@vondra.me обсуждение исходный текст |
Ответ на | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: index prefetching
|
Список | pgsql-hackers |
Hi, I ran some more tests, comparing the two patches, using data sets generated in a way to have a more gradual transition between correlated and random cases. I'll explain how the new benchmark generates data sets (the goal, and limitations). Then I'll discuss some of the results. And then there's a brief conclusion / next steps for the index prefetching ... data sets --------- I experimented with several ways to generate such data sets, and what I ended up doing is this: INSERT INTO t SELECT i, md5(i::text) FROM generate_series(1, $rows) s(i) ORDER BY i + $fuzz * (random() - 0.5) See the "generate-*.sh" scripts for the exact details. The basic idea is that we generate a sequence of $rows values, but we also allow the values to jump a random distance determined by $fuzz. With fuzz=0 we get perfect correlation, with fuzz=1 the value can move by one position, with fuzz=1000 it can move by up to 1000 positions, and so on. For very high fuzz (~rows) this will be close to random. So this fuzz is the primary knob. Obviously, incrementing fuzz by "1" it would be way too many tests, with very little change. Instead, I used the doubling strategy - 0, 1, 2, 4, 8, 16, 32, 64, ... $rows. This way it takes only about ~25 steps for the $fuzz to exceed $rows=10M. I also used "fillfactor" as another knob, determining how many items fit on a heap page. I used 20-40-60-80-100, but this turned out to not have too many doesn't have much impact. From now I'll use fillfactor=20, results and charts for other fillfactors are in the github repo [1]. I generated some charts visualizing the data sets - see [2] and [3] (there's also PDFs, but those are pretty huge). Those charts show percentiles of blocks vs. values, in either dimension. [2] shows percentiles of "value" (from the sequence) for 1MB chunks. It seems very correlated (simple "diagonal" line), because the ranges are so narrow. But at fuzz ~256k the randomness starts to show. The [3] shows the other direction, i.e. percentiles of heap blocks for ranges of values. But the patterns are almost exactly the same, it's very symmetrical. Fuzz -1 means "random with uniform distribution". It's clear the "most random" data set (fuzz ~8M) is still quite different, there's still some correlation. But the behavior seems fairly close to random. I don't claim those data sets are perfect, or a great representation of particular (real-world) data sets. It seems like a much nicer transition between random and correlated data sets. I have some ideas how to evolve this, for example to introduce some duplicate (and not unique) values, and also longer runs. The other thing that annoys me a bit is the weird behavior close to the beginning/end of the table, where the percentiles get closer and closer. I suspect this might affect runs that happen to hit those parts, adding some "noise" into the results. results ------- I'm going to talk about results from the Ryzen machine with NVMe RAID, with 10M rows (which is about 3.7GB with fillfactor=20) [4]. There are also results from "ryzen / SATA RAID" and "Xeon / NVMe", and 1M data sets. But the conclusions are almost exactly the same, as with earlier benchmarks. - ryzen-nvme-cold-10000000-20-16-scaled.pdf [5] This compares master, simple and complex prefetch with different iomethod values (in columns), and fuzz values (in rows, starting from fuzz=0). In most cases the two patches perform fairly close - the green and red data series mostly overlap. But there are cases where the complex patch performs much better - especially for low fuzz values. Which is not surprising, because those cases require higher prefetch distance, and the complex patch can do that. It surprised me a bit the complex patch can actually help even cases where I'd not expect prefetching to help very much - e.g. fuzz=0 is perfectly correlated, I'd expect read-ahead to work just fine. Yet the complex patch can help ~2x (at least when scanning larger fraction of the data). - ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf Some of the differences are more visible on this chart, which shows patches relative to master (so 1.0 means "as fast as master", while 0.5 means 2x faster, etc). I think there are a couple "fuzz ranges" with distinct behaviors: * 0-1: simple does mostly on par with "master", complex is actually quite a bit faster * 2-4: both mostly on par with master * 8-256: zone of regressions (compared to master) * 512-64K: mixed results (good for low selectivity, then regression) * 128K+: clear benefits The results from the other systems follow this pattern too, although the ranges may be shifted a bit. There are some interesting differences between the io_method values. In a number of cases the "sync" method performs much worse than "worker" and "io_uring" - which is not entirely surprising, but it just supports my argument we should stick with "worker" as default for PG18. But that's not the topic of this thread. There are also a couple cases where "simple" performs better than "complex". But most of the time this is only for the "sync" iomethod, and when scanning significant fraction of the data (10%+). So that doesn't seem like a great argument in favor of the simple patch, considering "sync" is not a proper AIO method, I've been arguing against using it as a default, and with methods like "worker" the "complex" patch often performs better ... conclusion ---------- Let's say the complex patch is the way to go. What are the open problems / missing parts we need to address to make it committable? I can think of these issues. I'm sure the list is incomplete and there are many "smaller issues" and things I haven't even thought about: 1) Making sure the interface can work for other index AMs (both in core and out-of-core), including cases like GiST etc. 2) Proper layering between index AM and table AM (like the TID issue pointed out by Andres some time ago). 3) Allowing more flexible management of prefetch distance (this might involve something like the "scan manager" idea suggested by Peter), various improvements to ReadStream heuristics, etc. 4) More testing to minimize the risk of regressions. 5) Figuring out how to make this work for IOS (the simple patch has some special logic in the callback, which may not be great, not sure what's the right solution in the complex patch). 6) ???? regards [1] https://github.com/tvondra/index-prefetch-tests-2 [2] https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets.png [3] https://github.com/tvondra/index-prefetch-tests-2/blob/master/visualize/datasets-2.png [4] https://github.com/tvondra/index-prefetch-tests-2/tree/master/ryzen-nvme/10 [5] https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled.pdf [6] https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-nvme/10/ryzen-nvme-cold-10000000-20-16-scaled-relative.pdf [7] https://github.com/tvondra/index-prefetch-tests-2/blob/master/xeon-nvme/10/xeon-nvme-cold-10000000-20-16-scaled-relative.pdf [8] https://github.com/tvondra/index-prefetch-tests-2/blob/master/ryzen-sata/10/ryzen-sata-cold-10000000-20-16-scaled-relative.pdf -- Tomas Vondra
Вложения
В списке pgsql-hackers по дате отправления: