Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)
От | Tomas Vondra |
---|---|
Тема | Re: PATCH: postpone building buckets to the end of Hash (in HashJoin) |
Дата | |
Msg-id | 56858BF6.2000002@2ndquadrant.com обсуждение исходный текст |
Ответ на | PATCH: postpone building buckets to the end of Hash (in HashJoin) (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Список | pgsql-hackers |
Hi, attached is v2 of the patch, with a bugfix and two significant improvements: 1) bugfix: forgotten memset() in ExecHashIncreaseNumBatches() Caused segfaults whenever we started with a single batch and then had to increase the number of batches. 2) 0002: postpone the batching (not just build of buckets) When ExecChooseHashTableSize finds out we'll probably need batching, we start with nbatch=1 anyway, and only switch to batching (with the estimated number of batches) after filling work_mem. This helps in two basic cases - firstly, when we do over-estimates we may end up doing batching even when we could do with a single batch (i.e. without writing to temp files at all). That's really expensive, and this helps with that - not entirely, because we can only distinguish "no batching vs. batching" case and not the number of batches, but that's the more interesting case anyway (the difference between batching with 16 or 32 batches is not large, at least in my experience). The other use case is the patch adding bloom filters, because it allows with properly sizing the bloom filter, which needs the number of distinct values. So while the filter might be sized using the estimated number of tuples passed to the Hash, it's likely to be larger than needed and thus more expensive. So sizing the bloom filter is crucial, and this change allows first accumulating enough data to estimate the size of bloom filter first. More discussion in the other patch. 3) 0003: eliminate the MaxAlloc limit Currently the size of buckets is limited my MaxAlloc, which means it can't exceed 512MB (assuming I've done my math correctly), which means ~67M buckets. This removes the MaxAlloc limit and just keeps the (INT_MAX/2) limit, so ~2B rows. I don't think it's worth extending further at this point. There was a discussion about this, quoting the f2fc98fb message: Limit the size of the hashtable pointer array to not more than MaxAllocSize, per reports from Kouhei Kaigai and others of "invalid memory alloc request size" failures. There was discussion of allowing the array to get larger than that by using the "huge" palloc API, but so far no proof that that is actually a good idea, and at this point in the 9.5 cycle major changes from old behavior don't seem like the way to go. I believe the objections were correct, because it'd really waste a lot of memory in case of over-estimates. I do however think that doing (1) and (2) fixes this issue, because with those changes we never allocate the buckets based on the initial estimate. We pretty much get the exact number of buckets necessary. I haven't done any performance testing yet (well, I did, but not reliable enough for sharing here). I'll post more data early January, once the machine completes other tests it's running right now. I've also been thinking about what other optimizations might be possible, and the one thing I'm wondering about is adding HyperLogLog counter. The thing is that we do size buckets based on number of rows, but that may be nonsense - to size the hash table, we actually need number of distinct values. So when the Hash contains duplicate rows (say 10 rows for each value), we end up with only ~10% of buckets containing any data (about 10 tuples in a list). If we knew the number of distinct values, we could make the buckets smaller and thus reduce the memory requirements significantly. I mention this because the patch with bloom filters actually uses HLL to size the bloom filters, so this would not be really introducing any new code. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
В списке pgsql-hackers по дате отправления: