Re: pgsql: Add parallel-aware hash joins.
От | Thomas Munro |
---|---|
Тема | Re: pgsql: Add parallel-aware hash joins. |
Дата | |
Msg-id | CAEepm=2WFQrwDD4rEVKjdPBQ_di_0n=xYST1MDRsucZ=JgDPxg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: pgsql: Add parallel-aware hash joins. (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: pgsql: Add parallel-aware hash joins.
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-committers |
On Sun, Dec 31, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Right. That's apparently unrelated and is the last build-farm issue >> on my list (so far). I had noticed that certain BF animals are prone >> to that particular failure, and they mostly have architectures that I >> don't have so a few things are probably just differently sized. At >> first I thought I'd tweak the tests so that the parameters were always >> stable, and I got as far as installing Debian on qemu-system-ppc (it >> took a looong time to compile PostgreSQL), but that seems a bit cheap >> and flimsy... better to fix the size estimation error. > > "Size estimation error"? Why do you think it's that? We have exactly > the same plan in both cases. I mean that ExecChooseHashTableSize() estimates the hash table size like this: inner_rel_bytes = ntuples * tupsize; ... but then at execution time, in the Parallel Hash case, we do memory accounting not in tuples but in chunks. The various participants pack tuples into 32KB chunks, and they trigger an increase in the number of batches when the total size of all chunks happens to exceeds the memory budget. In this case they do so unexpectedly due to that extra overhead at execution time that the planner didn't account for. We happened to be close to the threshold, in this case between choosing 8 batches and 16 batches, we can get it wrong and have to increase nbatch at execution time. Non-parallel Hash also has such fragmentation. There are headers + extra space at the end of each chunk, especially the end of the final chunk. But it doesn't matter there because the executor doesn't count the overhead either. For Parallel Hash I count the overhead, because it reduces IPC if I do all the accounting in 32KB chunks. I'm torn between (1) posting a patch that teaches ExecChooseHashTableSize() to estimate the worst case extra fragmentation assuming all participants contribute an almost entirely empty chunk at the end, and (2) just finding some parameters (ie tweak work_mem or number of tuples) that will make this work on all computers in the build farm. I think the former is the correct solution. Another solution would be to teach the executor to discount the overhead, but that seems hard and seems like it's travelling in the wrong direction. > My guess is that what's happening is that one worker or the other ends > up processing the whole scan, or the vast majority of it, so that that > worker's hash table has to hold substantially more than half of the > tuples and thereby is forced to up the number of batches. I don't see > how you can expect to estimate that situation exactly; or if you do, > you'll be pessimizing the plan for cases where the split is more nearly > equal. That sort of thing does indeed affect the size at execution. You can see that run to run variation easily with a small join forced to use Parallel Hash, so that there is a race to load tuples. You get a larger size if more workers manage to load at least one tuple, due to their final partially filled chunk. There is also the question of this being underestimated on systems without real atomics: bucket_bytes = sizeof(HashJoinTuple) * nbuckets; The real size at execution time is sizeof(dsa_pointer_atomic) * nbuckets. I don't think that's responsible for this particular underestimation problem because the bucket array is currently not considered at execution time when deciding to increase batches -- it should be, and I'll come back to those two problems separately. -- Thomas Munro http://www.enterprisedb.com
В списке pgsql-committers по дате отправления: