Re: Re: CRC
От | Tom Lane |
---|---|
Тема | Re: Re: CRC |
Дата | |
Msg-id | 17602.976405583@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: Re: CRC (Bruce Guenter <bruceg@em.ca>) |
Ответы |
Re: Re: CRC
Re: Re: CRC |
Список | pgsql-hackers |
Bruce Guenter <bruceg@em.ca> writes: >> This is a lot closer than I'd have expected, but it sure ain't >> "MD5 40% faster" as you reported. I wonder why the difference >> in results between your platform and mine? > The difference is likely because PA-RISC (like most other RISC > architectures) lack a "roll" opcode that is very prevalent in the MD5 > algorithm. A good theory, but unfortunately not a correct theory. PA-RISC can do a circular shift in one cycle using the "shift right double" instruction, with the same register specified as both high and low halves of the 64-bit operand. And gcc does know about that. After some groveling through assembly code, it seems that the CRC-32 implementation is about as tight as it could get: two loads, two XORs, and two EXTRU's per byte (one used to implement the right shift, the other to implement masking with 0xFF). And the wall clock timing does indeed come out to just about six cycles per byte. The MD5 code also looks pretty tight. Each basic OP requires either two or three logical operations (and/or/xor/not) depending on which round you're looking at, plus four additions and a circular shift. PA-RISC needs two cycles to load an arbitrary 32-bit constant, but other than that I see no wasted cycles here: ldil L'-1444681467,%r20xor %r3,%r14,%r19ldo R'-1444681467(%r20),%r20and %r1,%r19,%r19addl %r15,%r20,%r20xor %r14,%r19,%r19addl%r19,%r26,%r19addl %r20,%r19,%r15shd %r15,%r15,27,%r15addl %r15,%r3,%r15 Note gcc has been smart enough to assign all the correct_words[] array elements to registers, else we'd lose another cycle to a load operation --- fortunately PA-RISC has lots of registers. There are 64 of these basic OPs needed in each round, and each round processes 64 input bytes, so basically you can figure one OP per byte. Ignoring loop overhead and so forth, it's nine or ten cycles per byte for MD5 versus six for CRC. I'm at a loss to see how a Pentium would arrive at a better result for MD5 than for CRC. For one thing, it's going to be at a disadvantage because it hasn't got enough registers. I'd be interested to see the assembly code... regards, tom lane
В списке pgsql-hackers по дате отправления: