Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
От | Tom Lane |
---|---|
Тема | Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex |
Дата | |
Msg-id | 22170.1248108347@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex (Jeremy Kerr <jk@ozlabs.org>) |
Ответы |
Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
|
Список | pgsql-hackers |
Jeremy Kerr <jk@ozlabs.org> writes: > Rather than testing single bits in a loop, change AllocSetFreeIndex to > use the __builtin_clz() function to calculate the chunk index. > This requires a new check for __builtin_clz in the configure script. > Results in a ~2% performance increase on sysbench on PowerPC. I did some performance testing on this by extracting the AllocSetFreeIndex function into a standalone test program that just executed it a lot of times in a loop. And there's a problem: on x86_64 it is not much of a win. The code sequence that gcc generates for __builtin_clz is basically bsrl %eax, %eaxxorl $31, %eax and it turns out that Intel hasn't seen fit to put a lot of effort into the BSR instruction. It's constant time, all right, but on most of their CPUs that constant time is like 8 or 16 times slower than an ADD; cf http://www.intel.com/Assets/PDF/manual/248966.pdf What I found on both a couple-years-old Xeon and a pretty new Macbook is that the __builtin_clz code is actually *slower* than the naive loop for the fewest-iterations case (request size up to 16 bytes) and about a breakeven at the next size category (up to 32). It doesn't start to look like a useful win until you get to sizes up to 1KB or so. Unfortunately PG spends a lot more time allocating little structs than big ones. I suppose that it's more of a win on PPC (where probably __builtin_clz is actually fast), but on the basis of these results it's going to be no gain and maybe a loss on Intel. The other problem is that this approach is gcc-specific, which seems particularly bad for something that's only going to be a win on non-Intel platforms; they're much more likely to be using non-gcc compilers. I wonder whether it would work better to do manual unrolling of the shift loop. That is, forget about the builtin and do something like while (size >= 8) { idx += 4; size >>= 4; } while (size) { idx++; size >>= 1; } Obviously there's room for experimental optimization of the particular shift strides we use here, but remember that the total shift count is never going to exceed about 10, and will usually be quite a bit less. I did some quick experimentation along these lines and couldn't really persuade myself that unrolling was going to win much, but that's an Intel-specific result. Anyway, based on what I'm seeing here I can't convince myself that this is worth introducing a compiler dependency for. regards, tom lane
В списке pgsql-hackers по дате отправления: