I was wondering if this code below is truly optimized.
I'm thinking there might be some opcode hacks that may reduce the size or speed.
For indirect calls/jmps, here's the code that I'm using based upon this: https://patchwork.kernel.org/patch/10143779/
NOSPEC_JMP MACRO target:REQ
PUSH target
JMP x86_indirect_thunk
ENDM
NOSPEC_CALL MACRO target:REQ
LOCAL nospec_call_start
LOCAL nospec_call_end
JMP nospec_call_end
nospec_call_start:
PUSH target
JMP x86_indirect_thunk
nospec_call_end:
CALL nospec_call_start
ENDM
.CODE
;; This is a special sequence that prevents the CPU speculating for indirect calls.
x86_indirect_thunk:
CALL retpoline_call_target
capture_speculation:
PAUSE
JMP capture_speculation
retpoline_call_target:
IFDEF WIN64
LEA RSP,[RSP+8]
ELSE
LEA ESP,[ESP+4]
ENDIF
RET
It is interesting, particularly this part
capture_speculation:
PAUSE
JMP capture_speculation
which is never actually executed.
Namely, it is could be (possibly?) transposed for some speed tests within tight loops to clear up the predictive branches.