Re: Streaming read-ready sequential scan code
От | Thomas Munro |
---|---|
Тема | Re: Streaming read-ready sequential scan code |
Дата | |
Msg-id | CA+hUKGKXZALJ=6aArUsXRJzBm=qvc4AWp7=iJNXJQqpbRLnD_w@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Streaming read-ready sequential scan code (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: Streaming read-ready sequential scan code
|
Список | pgsql-hackers |
Yeah, I plead benchmarking myopia, sorry. The fastpath as committed is only reached when distance goes 2->1, as pg_prewarm does. Oops. With the attached minor rearrangement, it works fine. I also poked some more at that memory prefetcher. Here are the numbers I got on a desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled, cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe, GCC -O3). create table t (i int, filler text) with (fillfactor=10); insert into t select g, repeat('x', 900) from generate_series(1, 560000) g; vacuum freeze t; set max_parallel_workers_per_gather = 0; select count(*) from t; cold = must be read from actual disk (Linux drop_caches) warm = read from linux page cache hot = already in pg cache via pg_prewarm cold warm hot master 2479ms 886ms 200ms seqscan 2498ms 716ms 211ms <-- regression seqscan + fastpath 2493ms 711ms 200ms <-- fixed, I think? seqscan + memprefetch 2499ms 716ms 182ms seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/ Cold has no difference. That's just my disk demonstrating Linux RA at 128kB (default); random I/O is obviously a more interesting story. It's consistently a smidgen faster with Linux RA set to 2MB (as in blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect probably also increases on fancier faster storage than what I have on hand: cold master 1775ms seqscan + fastpath + memprefetch 1700ms Warm is faster as expected (fewer system calls schlepping data kernel->userspace). The interesting column is hot. The 200ms->211ms regression is due to the extra bookkeeping in the slow path. The rejiggered fastpath code fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can you reproduce that? The memory prefetching trick, on top of that, seems to be a good optimisation so far. Note that that's not an entirely independent trick, it's something we can only do now that we can see into the future; it's the next level up of prefetching, worth doing around 60ns before you need the data I guess. Who knows how thrashed the cache might be before the caller gets around to accessing that page, but there doesn't seem to be much of a cost or downside to this bet. We know there are many more opportunities like that[1] but I don't want to second-guess the AM here, I'm just betting that the caller is going to look at the header. Unfortunately there seems to be a subtle bug hiding somewhere in here, visible on macOS on CI. Looking into that, going to find my Mac... [1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com
Вложения
В списке pgsql-hackers по дате отправления: