Jochen, did you time the version of a macro I posted couple pages above?
Here it is:
Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
movups xmm0,[ow0]
movups xmm1,[ow1]
pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
xor ecx,0FFFFh
jz @l2
and ecx,7FFFh
bsr ecx,ecx
mov ah,byte ptr [ow0+15]
mov dh,byte ptr [ow1+15]
mov al,byte ptr [ow0+ecx]
mov dl,byte ptr [ow1+ecx]
cmp ax,dx
@l2:
ENDM
For me it faster than original "_1" macro, also you can try to change so
mov eax,word ptr [ow0+14]
mov edx,word ptr [ow1+14]
but for me it is slower than the version above it.
Timings for it (there is your old macro - my testbed us a bit outdated)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2189320 cycles [x][x][x] - Cmp128Nidud
2295837 cycles [x][x][x] - Cmp128NidudSSE
2773387 cycles [x][x][x] - Cmp128Dave
4033478 cycles [x][x][x] - Cmp128Dave2
1597228 cycles [x][x][x] - Cmp128JJAlexSSE_1
1622741 cycles [x][x][x] - Cmp128JJAlexSSE_2
1905774 cycles [x][x][x] - Cmp128JJAlexSSE_3
993931 cycles [x][x][ ] - Cmp128Alex
1859714 cycles [x][x][x] - Cmp128Alex_2
1901902 cycles [x][x][x] - Cmp128Alex_3
1994856 cycles [x][x][ ] - Cmp128JJSSE
1346269 cycles [x][x][ ] - AxCMP128bitProc3
1311894 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
741050 cycles [x][ ][ ] - Cmp128DaveU
770599 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Timings for Cmp128_timingsOQ
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2696 kCycles [x][x][x] - Cmp128Dave
2713 kCycles [x][x][x] - Cmp128Nidud
3125 kCycles [x][x][x] - Cmp128NidudSSE
945 kCycles [x][x][ ] - Cmp128Alex
1932 kCycles [x][x][x] - MasmBasic Ocmp
1485 kCycles [x][x][x] - MasmBasic Qcmp
1639 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1604 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1595 kCycles [x][x][x] - Cmp128JJAlexSSE_3
1360 kCycles [x][x][ ] - AxCMP128bitProc3
1274 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Timings for Cmp128_timingsO
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2856 kCycles [x][x][x] - Cmp128Dave
2752 kCycles [x][x][x] - Cmp128Nidud
3128 kCycles [x][x][x] - Cmp128NidudSSE
956 kCycles [x][x][ ] - Cmp128Alex
1928 kCycles [x][x][x] - MasmBasic Ocmp
1641 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1601 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1592 kCycles [x][x][x] - Cmp128JJAlexSSE_3
1361 kCycles [x][x][ ] - AxCMP128bitProc3
1272 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---