Re: BUG #18334: Segfault when running a query with parallel workers
От | Thomas Munro |
---|---|
Тема | Re: BUG #18334: Segfault when running a query with parallel workers |
Дата | |
Msg-id | CA+hUKG+7KA6wQGx4yFBNj5KaTooErV2Ov1+m_ers4DVZWJ_mKg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #18334: Segfault when running a query with parallel workers (Marcin Barczyński <mba.ogolny@gmail.com>) |
Ответы |
Re: BUG #18334: Segfault when running a query with parallel workers
|
Список | pgsql-bugs |
On Thu, May 23, 2024 at 11:59 PM Marcin Barczyński <mba.ogolny@gmail.com> wrote: > (gdb) print *segment_map > $4 = {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "", > header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap = > 0x7f309faf4480} > > (gdb) print pageno > $5 = 196979 Hmm. Page 196979 is an offset of around 769MB within the segment (pages here are 4k). What does segment_map->segment->mapped_size show? It's OK for the pagemap to contain zeroes, but it should contain non-zero values for pages that contain the start of an allocated object. The actual dsa_pointer has been optimised out but should be visible from frame #1 as batch->chunks. I think its higher 24 bits should contain 13 (the element of area->segment_maps that seems to correspond to the above), and its lower 40 bits should contain that number ~769MB. The things that are unusually high so far in your emails are worker count and work_mem, so that it can make quite large hash tables, in your case up to 13GB. Perhaps there is a silly arithmetic/type problem around large numbers somewhere (perhaps somewhere near 4GB+ segments, but I don't expect segment #13 to be very large IIRC). But then that would fail more often I think... It seems to be rare/intermittent, and yet you don't have any batching or re-bucketing in your problem (nbatch and nbuckets have their original values), so a lot of the more complex parts of the PHJ code are not in play here. Hmm. I wondered if the tricky edge case where a segment gets unmapped and then then remapped in the same slot could be leading to segment confusion. That does involve a bit of memory order footwork. What CPU architecture is this? But alas I can't come up with any case where that could go wrong even if there is an unknown bug in that area, because the no-rebatching, no-rebucketing case doesn't free anything until the end when it frees everything (ie it never frees something and then allocate, a requirement for slot re-use).
В списке pgsql-bugs по дате отправления: