Re: BitmapHeapScan streaming read user and prelim refactoring

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: BitmapHeapScan streaming read user and prelim refactoring
Дата
Msg-id 20240317163809.2y6h7btfuo76r4z7@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: BitmapHeapScan streaming read user and prelim refactoring  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: BitmapHeapScan streaming read user and prelim refactoring  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
Hi,

On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote:
> On 3/16/24 20:12, Andres Freund wrote:
> > That would address some of the worst behaviour, but it doesn't really seem to
> > address the underlying problem of the two iterators being modified
> > independently. ISTM the proper fix would be to protect the state of the
> > iterators with a single lock, rather than pushing down the locking into the
> > bitmap code. OTOH, we'll only need one lock going forward, so being economic
> > in the effort of fixing this is also important.
> >
>
> Can you share some details about how you identified the problem, counted
> the prefetches that happen too late, etc? I'd like to try to reproduce
> this to understand the issue better.

There's two aspects. Originally I couldn't reliably reproduce the regression
with Melanie's repro on my laptop. I finally was able to do so after I
a) changed the block device's read_ahead_kb to 0
b) used effective_io_concurrency=1

That made the difference between the BitmapAdjustPrefetchIterator() locations
very significant, something like 2.3s vs 12s.

Besides a lot of other things, I finally added debugging fprintfs printing the
pid, (prefetch, read), block number. Even looking at tiny excerpts of the
large amount of output that generates shows that two iterators were out of
sync.


> If I understand correctly, what may happen is that a worker reads blocks
> from the "prefetch" iterator, but before it manages to issue the
> posix_fadvise, some other worker already did pread. Or can the iterators
> get "out of sync" in a more fundamental way?

I agree that the current scheme of two shared iterators being used has some
fairly fundamental raciness. But I suspect there's more than that going on
right now.

Moving BitmapAdjustPrefetchIterator() to later drastically increases the
raciness because it means table_scan_bitmap_next_block() happens between
increasing the "real" and the "prefetch" iterators.

An example scenario that, I think, leads to the iterators being out of sync,
without there being races between iterator advancement and completing
prefetching:

start:
  real -> block 0
  prefetch -> block 0
  prefetch_pages = 0
  prefetch_target = 1

W1: tbm_shared_iterate(real) -> block 0
W2: tbm_shared_iterate(real) -> block 1
W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0
W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1
W1: read block 0
W2: read block 1
W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) -> 2, prefetch block 2
W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target

W1: tbm_shared_iterate(real) -> block 2
W2: tbm_shared_iterate(real) -> block 3

W2: BitmapAdjustPrefetchIterator() -> prefetch_pages--
W2: read block 3
W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, prefetch block 3

So afaict here we end up prefetching a block that the *same process* just had
read.

ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(),
separately from advancing the "real" iterator, is pretty ugly for non-parallel
BHS and just straight up broken in the parallel case.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Improving EXPLAIN's display of SubPlan nodes
Следующее
От: Laurenz Albe
Дата:
Сообщение: Re: pg_upgrade failing for 200+ million Large Objects