Here is a quick tweak using JJ's test bed that substitutes one of the slow XCHG versions with a simple memory based byte swap. Its purpose was to work directly on the memory rather than converting it back and forth in registers.
This is the code change.
invoke Sleep, 100
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT 100
; ror eax, 24
; xchg al, ah
; ror eax, 16
; xchg al, ah
; =========================
mov cl, [ebp]
mov dl, [ebp+2]
mov [ebp+2], cl
mov [ebp], dl
; =========================
ENDM
counter_end
print str$(eax), 9, "cycles for 100*Hutch 3 --- mem op version", 13, 10
These are the results on my Core2 quad.
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
1437 cycles for 100*push etc Dave
221 cycles for 100*mov etc JJ
1047 cycles for 100*Hutch 1
1045 cycles for 100*Hutch 2
605 cycles for 100*Hutch 3 --- mem op version
695 cycles for 100*Hutch 4
1436 cycles for 100*push etc Dave
221 cycles for 100*mov etc JJ
1047 cycles for 100*Hutch 1
1045 cycles for 100*Hutch 2
605 cycles for 100*Hutch 3 --- mem op version
696 cycles for 100*Hutch 4
--- ok ---