More random musing ... have you considered making the jump-target fields
in expressions be relative rather than absolute indexes? That is,
EEO_JUMP would look like
op += (stepno); \ EEO_DISPATCH(); \
instead of
op = &state->steps[stepno]; \ EEO_DISPATCH(); \
I have not carried out a full patch to make this work, but just making
that one change and examining the generated assembly code looks promising.
Instead of this
movslq 40(%r14), %r8salq $6, %r8addq 24(%rbx), %r8movq %r8, %r14jmp *(%r8)
we get this
movslq 40(%r14), %raxsalq $6, %raxaddq %rax, %r14jmp *(%r14)
which certainly looks like it ought to be faster. Also, the real reason
I got interested in this at all is that with relative jumps, groups of
steps would be position-independent within the steps array, which would
enable some compile-time tricks that seem impractical with the current
definition.
BTW, now that I've spent a bit of time looking at the generated assembly
code, I'm kind of disinclined to believe any arguments about how we have
better control over branch prediction with the jump-threading
implementation. At least with current gcc (6.3.1 on Fedora 25) at -O2,
what I see is multiple places jumping to the same indirect jump
instruction :-(. It's not a total disaster: as best I can tell, all the
uses of EEO_JUMP remain distinct. But gcc has chosen to implement about
40 of the 71 uses of EEO_NEXT by jumping to the same couple of
instructions that increment the "op" register and then do an indirect
jump :-(.
So it seems that we're at the mercy of gcc's whims as to which instruction
dispatches will be distinguishable to the hardware; which casts a very
dark shadow over any benchmarking-based arguments that X is better than Y
for branch prediction purposes. Compiler version differences are likely
to matter a lot more than anything we do.
regards, tom lane