Re: BUG #18334: Segfault when running a query with parallel workers

Поиск

Список

Период

Сортировка

От	Thomas Munro
Тема	Re: BUG #18334: Segfault when running a query with parallel workers
Дата	24 мая 2024 г. 00:45:14
Msg-id	CA+hUKG+7KA6wQGx4yFBNj5KaTooErV2Ov1+m_ers4DVZWJ_mKg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: BUG #18334: Segfault when running a query with parallel workers (Marcin Barczyński <mba.ogolny@gmail.com>)
Ответы	Re: BUG #18334: Segfault when running a query with parallel workers
Список	pgsql-bugs

Дерево обсуждения

On Thu, May 23, 2024 at 11:59 PM Marcin Barczyński <mba.ogolny@gmail.com> wrote:
> (gdb) print *segment_map
> $4 = {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "",
> header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap =
> 0x7f309faf4480}
>
> (gdb) print pageno
> $5 = 196979

Hmm.  Page 196979 is an offset of around 769MB within the segment
(pages here are 4k).  What does segment_map->segment->mapped_size
show?  It's OK for the pagemap to contain zeroes, but it should
contain non-zero values for pages that contain the start of an
allocated object.  The actual dsa_pointer has been optimised out but
should be visible from frame #1 as batch->chunks.  I think its higher
24 bits should contain 13 (the element of area->segment_maps that
seems to correspond to the above), and its lower 40 bits should
contain that number ~769MB.

The things that are unusually high so far in your emails are worker
count and work_mem, so that it can make quite large hash tables, in
your case up to 13GB.  Perhaps there is a silly arithmetic/type
problem around large numbers somewhere (perhaps somewhere near 4GB+
segments, but I don't expect segment #13 to be very large IIRC).  But
then that would fail more often I think...  It seems to be
rare/intermittent, and yet you don't have any batching or re-bucketing
in your problem (nbatch and nbuckets have their original values), so a
lot of the more complex parts of the PHJ code are not in play here.
Hmm.

I wondered if the tricky edge case where a segment gets unmapped and
then then remapped in the same slot could be leading to segment
confusion.  That does involve a bit of memory order footwork.  What
CPU architecture is this?  But alas I can't come up with any case
where that could go wrong even if there is an unknown bug in that
area, because the no-rebatching, no-rebucketing case doesn't free
anything until the end when it frees everything (ie it never frees
something and then allocate, a requirement for slot re-use).

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #18334: Segfault when running a query with parallel workers