Re: SP-GiST micro-optimizations
От | Heikki Linnakangas |
---|---|
Тема | Re: SP-GiST micro-optimizations |
Дата | |
Msg-id | 503D0D86.6080105@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: SP-GiST micro-optimizations (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: SP-GiST micro-optimizations
|
Список | pgsql-hackers |
On 28.08.2012 20:30, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> Drilling into the profile, I came up with three little optimizations: > >> 1. Within spgdoinsert, a significant portion of the CPU time is spent on >> line 2033 in spgdoinsert.c: > >> memset(&out, 0, sizeof(out)); > >> That zeroes out a small struct allocated in the stack. Replacing that >> with MemSet() makes it faster, reducing the time spent on zeroing that >> struct from 10% to 1.5% of the time spent in spgdoinsert(). That's not >> very much in the big scheme of things, but it's a trivial change so >> seems worth it. > > Fascinating. I'd been of the opinion that modern compilers would inline > memset() for themselves and MemSet was probably not better than what the > compiler could do these days. What platform are you testing on? x64, gcc 4.7.1, running Debian. The assembly generated for the MemSet is: .loc 1 2033 0 discriminator 3movq $0, -432(%rbp) .LVL166:movq $0, -424(%rbp) .LVL167:movq $0, -416(%rbp) .LVL168:movq $0, -408(%rbp) .LVL169:movq $0, -400(%rbp) .LVL170:movq $0, -392(%rbp) while the corresponding memset code is: .loc 1 2040 0 discriminator 6xorl %eax, %eax.loc 1 2042 0 discriminator 6cmpb $0, -669(%rbp).loc 1 2040 0 discriminator6movq -584(%rbp), %rdimovl $6, %ecxrep stosq In fact, with -mstringop=unrolled_loop, I can coerce gcc to produce code similar to the MemSet version: movq %rax, -440(%rbp).loc 1 2040 0 discriminator 6xorl %eax, %eax .L254:movl %eax, %edxaddl $32, %eaxcmpl $32, %eaxmovq $0, -432(%rbp,%rdx)movq $0, -424(%rbp,%rdx)movq $0,-416(%rbp,%rdx)movq $0, -408(%rbp,%rdx)jb .L254leaq -432(%rbp), %r9addq %r9, %rax.loc 1 2042 0 discriminator6cmpb $0, -665(%rbp).loc 1 2040 0 discriminator 6movq $0, (%rax)movq $0, 8(%rax) I'm not sure why gcc doesn't choose that by default. Perhaps it's CPU specific which variant is faster - I was quite surprised that MemSet was such a clear win on my laptop. Or maybe it's a speed-space tradeoff, and gcc chooses the more compact version, although using -O3 instead of -O2 made no difference. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: