Re: SP-GiST micro-optimizations
От | Heikki Linnakangas |
---|---|
Тема | Re: SP-GiST micro-optimizations |
Дата | |
Msg-id | 503DD2AB.5050902@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: SP-GiST micro-optimizations (Ants Aasma <ants@cybertec.at>) |
Список | pgsql-hackers |
On 28.08.2012 22:50, Ants Aasma wrote: > On Tue, Aug 28, 2012 at 9:42 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Seems like that's down to the CPU not doing "rep stosq" particularly >> quickly, which might well be chip-specific. > > AMD optimization manual[1] states the following: > > For repeat counts of less than 4k, expand REP string instructions > into equivalent sequences of simple > AMD64 instructions. > > Intel optimization manual[2] doesn't provide equivalent guidelines, > but the graph associated with string instructions states about 30 > cycles of startup latency. The mov based code on the other hand > executes in 6 cycles and can easily overlap with other non-store > instructions. > > [1] http://support.amd.com/us/Processor_TechDocs/25112.PDF > [2] http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf Hmm, sounds like gcc just isn't doing a very good job then. I also tried replacing the memset with variable initialization: "spgChooseOut out = { 0 }" (and I moved that to where the memset was). In that case, gcc produced the same (fast) sequence of movq's I got with -mstringop=unrolled_loop. Out of curiosity, I also tried this on clang. It produced this, regardless of whether I used MemSet or memset or variable initializer: pxor %xmm0, %xmm0.loc 1 2040 4 # spgdoinsert.c:2040:4movaps %xmm0, -1280(%rbp)movaps %xmm0, -1296(%rbp)movaps %xmm0, -1312(%rbp) So, it's using movaps to clear it in 16-byte chunks. perf annotate shows that that's comparable in speed to the gcc's code produced for MemSet. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: