Re: Streaming read-ready sequential scan code

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Streaming read-ready sequential scan code
Дата
Msg-id CA+hUKGKXZALJ=6aArUsXRJzBm=qvc4AWp7=iJNXJQqpbRLnD_w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Streaming read-ready sequential scan code  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Streaming read-ready sequential scan code  (Melanie Plageman <melanieplageman@gmail.com>)
Список pgsql-hackers
Yeah, I plead benchmarking myopia, sorry.  The fastpath as committed
is only reached when distance goes 2->1, as pg_prewarm does.  Oops.
With the attached minor rearrangement, it works fine.  I also poked
some more at that memory prefetcher.  Here are the numbers I got on a
desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
GCC -O3).

create table t (i int, filler text) with (fillfactor=10);
insert into t
select g, repeat('x', 900) from generate_series(1, 560000) g;
vacuum freeze t;
set max_parallel_workers_per_gather = 0;

select count(*) from t;

cold = must be read from actual disk (Linux drop_caches)
warm = read from linux page cache
hot = already in pg cache via pg_prewarm

                                    cold   warm    hot
master                            2479ms  886ms  200ms
seqscan                           2498ms  716ms  211ms <-- regression
seqscan + fastpath                2493ms  711ms  200ms <-- fixed, I think?
seqscan + memprefetch             2499ms  716ms  182ms
seqscan + fastpath + memprefetch  2505ms  710ms  170ms <-- \O/

Cold has no difference.  That's just my disk demonstrating Linux RA at
128kB (default); random I/O is obviously a more interesting story.
It's consistently a smidgen faster with Linux RA set to 2MB (as in
blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
probably also increases on fancier faster storage than what I have on
hand:

                                    cold
master                            1775ms
seqscan + fastpath + memprefetch  1700ms

Warm is faster as expected (fewer system calls schlepping data
kernel->userspace).

The interesting column is hot.  The 200ms->211ms regression is due to
the extra bookkeeping in the slow path.  The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
you reproduce that?

The memory prefetching trick, on top of that, seems to be a good
optimisation so far.  Note that that's not an entirely independent
trick, it's something we can only do now that we can see into the
future; it's the next level up of prefetching, worth doing around 60ns
before you need the data I guess.  Who knows how thrashed the cache
might be before the caller gets around to accessing that page, but
there doesn't seem to be much of a cost or downside to this bet.  We
know there are many more opportunities like that[1] but I don't want
to second-guess the AM here, I'm just betting that the caller is going
to look at the header.

Unfortunately there seems to be a subtle bug hiding somewhere in here,
visible on macOS on CI.  Looking into that, going to find my Mac...

[1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: shveta malik
Дата:
Сообщение: Re: Synchronizing slots from primary to standby
Следующее
От: Tom Lane
Дата:
Сообщение: Re: IPC::Run::time[r|out] vs our TAP tests