Following the "What is the fastest way (performance wise) to compare two 128 bit integers" thread (http://masm32.com/board/index.php?topic=2213.0) in the Campus, here a first attempt to time comparisons of 128-bit unsigned integers.
Fifteen cycles is quite a lot, so if you have a better algo, please post it... you can test it before in line 90 of the attached source.
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
15580 cycles for 1000 * cmp128 (2 globals)
88587 cycles for 1000 * cmp128b (loop)
27107 cycles for 1000 * cmp128p (calls proc)
28611 cycles for 1000 * two pointers
P.S.: Googling yields almost nothing, apparently there are not many applications for this :(
Que?
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1824/1000 cycles
3385 cycles for 1000 * cmp128
?? cycles for 1000 * cmp128 xx
3489 cycles for 1000 * cmp128
?? cycles for 1000 * cmp128 xx
3335 cycles for 1000 * cmp128
?? cycles for 1000 * cmp128 xx
John, your CPU doesn't respect the speed limits, as usual :eusa_naughty:
OK, version B attached on top. It features a loop based macro:
cmp128b MACRO ow0, ow1 ; both operands must be memory variables
push esi
push edi
mov esi, offset ow0
mov edi, offset ow1
mov ecx, 16
.Repeat
dec ecx
.if Sign?
inc ecx
.Break
.endif
movzx eax, byte ptr [esi+ecx]
movzx edx, byte ptr [edi+ecx]
cmp eax, edx
.Until !Zero?
pop edi
pop esi
ENDM
>John, your CPU doesn't respect the speed limits, as usual :eusa_naughty:
Hah! It's you being disrespectful to my CPU.
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1694/1000 cycles
3158 cycles for 1000 * cmp128
58915 cycles for 1000 * cmp128b
3162 cycles for 1000 * cmp128
58929 cycles for 1000 * cmp128b
3124 cycles for 1000 * cmp128
59191 cycles for 1000 * cmp128b
One more, adding a generic one which expects two pointers:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2998/1000 cycles
15567 cycles for 1000 * cmp128 (2 globals)
88584 cycles for 1000 * cmp128b (loop)
27141 cycles for 1000 * cmp128p (calls proc)
29387 cycles for 1000 * two pointers
15562 cycles for 1000 * cmp128 (2 globals)
88587 cycles for 1000 * cmp128b (loop)
27100 cycles for 1000 * cmp128p (calls proc)
29431 cycles for 1000 * two pointers
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles
9116 cycles for 1000 * cmp128 (2 globals)
73639 cycles for 1000 * cmp128b (loop)
17461 cycles for 1000 * cmp128p (calls proc)
24831 cycles for 1000 * two pointers
Jochen,
your timings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3047/1000 cycles
1636 cycles for 1000 * cmp128 (2 globals)
56016 cycles for 1000 * cmp128b (loop)
4410 cycles for 1000 * cmp128p (calls proc)
2688 cycles for 1000 * two pointers
1696 cycles for 1000 * cmp128 (2 globals)
55860 cycles for 1000 * cmp128b (loop)
10677 cycles for 1000 * cmp128p (calls proc)
2658 cycles for 1000 * two pointers
1612 cycles for 1000 * cmp128 (2 globals)
55951 cycles for 1000 * cmp128b (loop)
10790 cycles for 1000 * cmp128p (calls proc)
2770 cycles for 1000 * two pointers
--- ok ---
Gunther
Hmmmm,
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
loop overhead is approx. 1028/1000 cycles
6340 cycles for 1000 * cmp128 (2 globals)
66042 cycles for 1000 * cmp128b (loop)
15203 cycles for 1000 * cmp128p (calls proc)
28236 cycles for 1000 * two pointers
6394 cycles for 1000 * cmp128 (2 globals)
66052 cycles for 1000 * cmp128b (loop)
15315 cycles for 1000 * cmp128p (calls proc)
28234 cycles for 1000 * two pointers
6340 cycles for 1000 * cmp128 (2 globals)
66007 cycles for 1000 * cmp128b (loop)
14605 cycles for 1000 * cmp128p (calls proc)
28232 cycles for 1000 * two pointers
--- ok ---
Hi
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
loop overhead is approx. 2450/1000 cycles
18005 cycles for 1000 * cmp128 (2 globals)
101446 cycles for 1000 * cmp128b (loop)
34947 cycles for 1000 * cmp128p (calls proc)
27351 cycles for 1000 * two pointers
17983 cycles for 1000 * cmp128 (2 globals)
103720 cycles for 1000 * cmp128b (loop)
35346 cycles for 1000 * cmp128p (calls proc)
27835 cycles for 1000 * two pointers
18020 cycles for 1000 * cmp128 (2 globals)
97862 cycles for 1000 * cmp128b (loop)
35671 cycles for 1000 * cmp128p (calls proc)
27585 cycles for 1000 * two pointers
--- ok ---
Thanks to all of you :t
If I find the time, I'll have to look at the validity of results. Perhaps an extra check of the msb is necessary.
ifidni @Environ(oAssembler), <mlv615>
oSmall qWORD 00000000000000001h, 0800000000000000h ; 7f00000000000000???
db 03h
oMedium qWORD 00000000000000002h, 0800000000000000h
db 02h
oBig qWORD 00000000000000003h, 0800000000000000h
db 01h
oSmallF qWORD 00000000000007f00h, 0800000000000000h
oMedF qWORD 00000000000008000h, 0800000000000000h
oBigF qWORD 0000000000000ff00h, 0800000000000000h
else
oSmall OWORD 080000000000000000000000000000001h
db 03h
oMedium OWORD 080000000000000000000000000000002h
db 02h
oBig OWORD 080000000000000000000000000000003h
db 01h
oSmallF OWORD 080000000000000000000000000007f00h
oMedF OWORD 080000000000000000000000000008000h
oBigF OWORD 08000000000000000000000000000ff00h
endif
Hi Jochen :t
Here are my code added and timings.
This is a tricky replacement of PCMPEQD to make the SSE1-compatible code. Archive contains both versions, the "default" is SSE1-capable.
ifdef USE_SSE1
xorps xmm1,xmm0
movaps xmm2,xmm1
cmpps xmm1,oword ptr zero128b,0
andps xmm2,xmm1
xorps xmm1,xmm2
else
pcmpeqd xmm0,xmm1
endif
The tricky thing with packed comparsion is the question: is the not matched DWORD the highest DWORD, or not? The signed/unsigned comparsion sets different flags, but they relate to one other for the number that is in positive range of signed.
The code made flags correction if required.
movmskps eax,xmm1
xor al,0fH
bsr eax,eax
jz @l1
xor ecx,ecx
mov edx,dword ptr [ow0+eax*4]
cmp eax,3 ; if this was not highest-order dword
mov eax,dword ptr [ow1+eax*4]
setne cl
cmp edx,eax ; OF = SF if EDX (signed)> ECX, CF = 0 if EDX (unsigned)> ECX
jecxz @l1
; we need to preserve ZF and CF flags
; but make OF != SF if CF = 1
jns @F ; (if sign flag is set, then we do want to clear OF flag)
mov ecx,80000000h ; then mark: highest bit will become #30 bit,
@@: ; CF will become #31 bit...
setc cl ; ...lowest bit will become CF
rcr ecx,1 ; restore CF, if two highest bits set then OF = 0 else OF = 1
@l1:
I've replaced cmp128b macro with my code to simplify addition to the testbed.
It will probably better to avoid JECXZ with using of a jumptable, but in this testbed it was simpler to use JECXZ than to create table for every macro expansion.
Most robust way of usage is to compare for equality first, then for signed/unsigned great-or-less than (because of short cut after BSR - if it goes that path - the only ZF flag is guaranteed to be set exact to results).
SSE2:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2231/1000 cycles
18239 cycles for 1000 * cmp128
20661 cycles for 1000 * cmp128b
18399 cycles for 1000 * cmp128
20907 cycles for 1000 * cmp128b
18365 cycles for 1000 * cmp128
20838 cycles for 1000 * cmp128b
--- ok ---
SSE1:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2189/1000 cycles
18274 cycles for 1000 * cmp128
22574 cycles for 1000 * cmp128b
18327 cycles for 1000 * cmp128
22582 cycles for 1000 * cmp128b
18273 cycles for 1000 * cmp128
22632 cycles for 1000 * cmp128b
--- ok ---
Alex,
you've made good points. Thank you. :t
Here are the timings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3341/1000 cycles
1416 cycles for 1000 * cmp128
3673 cycles for 1000 * cmp128b
1352 cycles for 1000 * cmp128
3597 cycles for 1000 * cmp128b
1405 cycles for 1000 * cmp128
3670 cycles for 1000 * cmp128b
--- ok ---
Gunther
Quote from: Antariy on August 14, 2013, 02:11:59 AM
Hi Jochen :t
Here are my code added and timings.
Thanks a lot, Alex :t
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1767/1000 cycles
9121 cycles for 1000 * cmp128
93407 cycles for 1000 * cmp128bI am still busy with the macro, trying to make it return correct values. If you want to test yours, here is the core of the testbed, in qword version.
.data
qSmall qWORD 7f00000000000001h
qMedium qWORD 7f00000000000002h
qBig qWORD 7f00000000000003h
qSmallN qWORD 8000000000000001h
qMedN qWORD 8000000000000002h
qBigN qWORD 8000000000000003h
qSmallF qWORD 0800000000007f00h
qMedF qWORD 0800000000008000h
qBigF qWORD 080000000000ff00h
...
print chr$(13, 10, "Qcmp", 13, 10)
Print Str$("qSp=%i\n", qSmall)
Print Str$("qBp=%i\n", qBig)
Print Str$("qSn=%i\n", qSmallN)
Print Str$("qBn=%i\n\n", qBigN)
print chr$(13, 10, "positive (lesser, greater)", 13, 10)
Qcmp qSmall, qBig
Qcmp qBig, qSmall
print chr$(13, 10, "negative (lesser, greater)", 13, 10)
Qcmp qSmallN, qBigN
Qcmp qBigN, qSmallN
print chr$(13, 10, "pos, neg (greater, greater)", 13, 10)
jt=1
Qcmp qSmall, qBigN
Qcmp qBig, qSmallN
print chr$(13, 10, "neg, pos (lesser, lesser)", 13, 10)
Qcmp qSmallN, qBig
Qcmp qBigN, qSmall
jj, please a little more effectiveness for the macros...
useA=1
;...
CodeSize MACRO algo, overhead:=<15> ; default overhead is mov ecx, 99+mov eax, ecx+loop
pushad
mov eax, offset &algo&_endp ; OPT_Errline 0
sub eax, offset &algo&_s
sub eax, overhead
if @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) GE 'A' AND @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) LE 'Z'
% print str$(eax), 9, @CatStr(<!"bytes for &Name>,%@SubStr(<algo>,5),<!">), 13, 10
else
% print str$(eax), 9, "bytes for other", 13, 10
endif
popad
ENDM
AlgoName$ MACRO algo
if @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) GE 'A' AND @CatStr(<!'>,%@SubStr(<algo>,5),<!'>) LE 'Z'
EXITM @CatStr(<!"bytes for &Name>,%@SubStr(<algo>,5),<!">)
else
EXITM <"other">
endif
ENDM
;...
start:
push 1
call ShowCpu ; print brand string and SSE level
invoke SetProcessAffinityMask, -1, 1 ; restrict to one core
Calibrate
??cntr = 1
REPEAT ShowLoops+2
IF ??cntr LT ShowLoops+1
SpinUp
ENDIF
FORC char,<ABCDEFGHIJKLMNOPQRSTUVWXYZ>
IFDEF use&char&
IF use&char&
IF ??cntr LT ShowLoops+1
invoke Sleep, SleepMs
counter_begin TimerLoops, HIGH_PRIORITY_CLASS
call Test&char&
counter_end
ShowCycles Test&char&
ELSEIF (??cntr EQ ShowLoops+1) AND ShowSize NE 0
CodeSize Test&char&
ELSEIF (??cntr EQ ShowLoops+2) AND ShowResult NE 0
CodeResult Test&char&
ENDIF
ENDIF
ENDIF
ENDM
IF (??cntr LT ShowLoops+1) OR ((??cntr EQ ShowLoops+1) AND ShowSize NE 0) OR ((??cntr EQ ShowLoops+2) AND ShowResult NE 0)
print chr$(13, 10)
ENDIF
??cntr = ??cntr + 1
ENDM
inkey chr$(13, 10, "--- ok ---", 13)
exit
:icon_cool:
Quote from: qWord on August 14, 2013, 06:07:51 AM
jj, please a little more effectiveness for the macros...
qWord,
Thanks for this demo, guru of the macro universe :biggrin:
Right now I am too busy solving QWORD (note the uppercase) mysteries, will test it asap ;-)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
9118 cycles for 1000 * cmp128 (2 globals, buggy)
73747 cycles for 1000 * cmp128b (loop)
35023 cycles for 1000 * OcmpJJ
94027 cycles for 1000 * OcmpAlex
@Alex: Your macro doesn't pass the full test yet... see Cmp128.asc (open with \Masm32\RichMasm\RichMasm.exe, hit F6 to build; to navigate, click on bookmarks on the right, or select a word and hit F3)
Jochen, can you make an example of a number which fails with it?
Hi Alex,
Search cmp128.asc for if Alex to see the source of this output (N means negative OWORD):
---------- ALEX ----------
positive (lesser, greater)
oSmall smaller oBig
oBig bigger oSmall
negative (lesser, greater)
oSmallN smaller oBigN
oBigN bigger oSmallN
pos, neg (greater, greater)
oSmall smaller oBigN
oBig smaller oSmallN
neg, pos (lesser, lesser)
oSmallN bigger oBig
oBigN bigger oSmall
pos, neg (equal, equal)
oMedium EQUALS oMedium
oSmallN EQUALS oSmallN
Or did you do unsigned comparisons??
P.S.: Ocmp works currently only for JWasm and recent ML.exe, not for ML 6.15 :(
Jochen, I'm sure it is correct. Maybe the problems is in that you're using BSF to find non-matching element, but you should use BSR - because you want to find the highest order non-matching element. So, if you're comparing results of my code and your code, and decide results of your code as right, then this maybe a reason for not correctness of my macro.
Try your macro with this numbers, for an instance: 80000000800000000000000000000100 and 80000000000000000000000000000300.
Quote from: jj2007 on August 14, 2013, 09:20:21 AM
Or did you do unsigned comparisons??
No, I just made the usual CMP "emulator" for 128 bits :biggrin: It behaves just the same as CMP does - i.e. it works for signed and unsigned numbers, with no different, the "signed-ness" of the number controls via usual Jcc instuctions (JG/JA/JL/JB).
Alex, thanks, will check tomorrow (too tired now, it's 1:30 AM).
But oSmallN bigger oBig looks wrong for signed comparisons - a negative number is always lower than a positive one, right?
Quote from: jj2007 on August 14, 2013, 09:20:21 AM
P.S.: Ocmp works currently only for JWasm and recent ML.exe, not for ML 6.15 :(
Yes, I used ML10 to build, but previous, smaller testbed was buildable with ML10 only, too - strange, ML6.15 gives internal assembler error.
Quote from: jj2007 on August 14, 2013, 09:27:56 AM
Alex, thanks, will check tomorrow (too tired now, it's 1:30 AM).
But oSmallN bigger oBig looks wrong for signed comparisons - a negative number is always lower than a positive one, right?
oSmallN? I see oSmallF in the source only? Can you drop the numbers?
Quote from: jj2007 on August 14, 2013, 09:27:56 AM
Alex, thanks, will check tomorrow (too tired now, it's 1:30 AM).
But oSmallN bigger oBig looks wrong for signed comparisons - a negative number is always lower than a positive one, right?
Jochen, check the number oBig definition for non-ML6.15, you've defined it as a negative (too much zeroes, probably) :t
Again, I checked the code very carefully, it shoud not be wrong.
Quote from: Antariy on August 14, 2013, 09:32:10 AM
oSmallN? I see oSmallF in the source only? Can you drop the numbers?
They are in the testbed, cmp128.asc (not the *timings.asm).
oSmallN OWORD 88000000000000000000000000000001h
oBigN OWORD 88000000000000000000000000000003h
And you were absolutely right about bsr instead of bsf :t
> check the number oBig definition
Will do so, thanxalot, Alex :icon14:
New version of the testbed attached. There is still the problem with your negative numbers.
Also make sure how do you check results, my code does it just like CMP:
CMP First128BitNumber,Second128BitNumber
JZ -> if first equal to second
JA/JG -> unsigned/signed jump if first is above/greater than second
JB/JL -> -/- - - - - below/less - -
See if showresults - for technical reasons (cmc), I use Carry? and Zero?, should be sufficient for signed comparisons.
Quote from: jj2007 on August 14, 2013, 09:52:32 AM
Quote from: Antariy on August 14, 2013, 09:32:10 AM
oSmallN? I see oSmallF in the source only? Can you drop the numbers?
They are in the testbed, cmp128.asc (not the *timings.asm).
Thanks, found it after some time :redface:
Quote from: jj2007 on August 14, 2013, 09:52:32 AM
> check the number oBig definition
Will do so, thanxalot, Alex :icon14:
I've completely entangled in the source already :greensml:
oSmallN OWORD 88000000000000000000000000000001h
oMedN OWORD 88000000000000000000000000000002h
oBigN OWORD 88000000000000000000000000000003h
"Big negative number" - what did you mean? "Big" - how? This definition oBigN is greater than oSmallN, like -1 greater than -2.
Quote from: jj2007 on August 14, 2013, 10:03:02 AM
See if showresults - for technical reasons (cmc), I use Carry? and Zero?, should be sufficient for signed comparisons.
But my code sets all flags just like CMP does, so, to check its correctness you should use a standard technique with JA/JB for unsigned numbers, and JG/JL for signed.
Quote from: Antariy on August 14, 2013, 10:13:29 AM"Big negative number" - what did you mean? "Big" - how? This definition oBigN is greater than oSmallN, like -1 greater than -2.
Here are the corresponding qwords, which Str$() can handle:
qSmallPos= 8574853690513424385
qBigPos= 8574853690513424387
positive (lesser, greater)
qSmall lesser qBig
qBig greater qSmall
qSmallNeg= -8646911284551352319
qBigNeg= -8646911284551352317
negative (lesser, greater)
qSmallN lesser qBigN
qBigN greater qSmallNOWORDs should behave identically. As you may have seen already, I use MbcmpO to decide which size to handle.
Yes, I meant this - oBigN is greater ("bigger" :biggrin:) than oSmallN.
But, still, to check my code you need to use appropriate Jcc instructions, because it follows the standard CPU manner in flags setting. I checked it before posting with different kinds of numbers - it works.
Jochen, if you limit code only to ZF and CF flags to check, how would you check if negative number is greater than zero, for an instance? That's why I preferred the standard flags setting in my version, and it seems more logical to use comparsion macro-(or call)-"instruction" in a way as usually do with CMP.
Here (http://masm32.com/board/index.php?topic=2232.msg23009#msg23009) is the test showing that all the value-comparsion-condition conditional jumps are working with my code. Request for a test :biggrin:
yes - signed branches use the OF flag, as well
sign, carry, overflow - the one they give you that noone cares about is parity :P
aux carry is rarely used, too - mainly for BCD math, i think
i was thinking of implementing a function that could compare integers of any size
INVOKE ArbCmp,nNumberOfBytes,lpFirst,lpSecond
it could compare the high-order bytes as bytes until alignment on the second operand is found
after that (if still equal), you could use an aligned method until inequality is found
don't want to use JECXZ because it's slow
Quote from: Antariy on August 14, 2013, 02:34:34 PM
Here (http://masm32.com/board/index.php?topic=2232.msg23009#msg23009) is the test showing that all the value-comparsion-condition conditional jumps are working with my code. Request for a test :biggrin:
Hi
Alex,
Here it is ;-)
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
18742 cycles for 1000 * Ocmp (JJ)
88733 cycles for 1000 * cmp128b (loop)
62612 cycles for 1000 * AxCMP128bit
18750 cycles for 1000 * Ocmp (JJ)
88605 cycles for 1000 * cmp128b (loop)
63719 cycles for 1000 * AxCMP128bitAnd
Ocmp passes all your tests now, i.e. it behaves exactly like a cmp eax, edx.
P.S.: The extra speed comes from the bswap instruction.
Hi Jochen, here are results.
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2255/1000 cycles
25281 cycles for 1000 * Ocmp (JJ)
104235 cycles for 1000 * cmp128b (loop)
34615 cycles for 1000 * AxCMP128bit
25063 cycles for 1000 * Ocmp (JJ)
103227 cycles for 1000 * cmp128b (loop)
34196 cycles for 1000 * AxCMP128bit
28798 cycles for 1000 * Ocmp (JJ)
101244 cycles for 1000 * cmp128b (loop)
34112 cycles for 1000 * AxCMP128bit
--- ok ---
:t
Thanks to everybody, especially Alex :t
I have posted a "CMP defeats intuition (http://masm32.com/board/index.php?topic=2235.0)" thread in the Campus because I was really surprised that a negative number can be "greater" than a positive one (and yes I know this is a stupid noob error :biggrin:).
Note the FPU behaves differently.
Jochen,
the new timings for you:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 2141/1000 cycles
1959 cycles for 1000 * Ocmp (JJ)
60752 cycles for 1000 * cmp128b (loop)
6356 cycles for 1000 * AxCMP128bit
2096 cycles for 1000 * Ocmp (JJ)
60573 cycles for 1000 * cmp128b (loop)
6196 cycles for 1000 * AxCMP128bit
1918 cycles for 1000 * Ocmp (JJ)
59383 cycles for 1000 * cmp128b (loop)
6058 cycles for 1000 * AxCMP128bit
--- ok ---
Gunther
Quote from: jj2007 on August 15, 2013, 04:45:37 AM
Thanks to everybody, especially Alex :t
:icon_redface: You're welcome,
Jochen :t
Quote from: jj2007 on August 15, 2013, 04:45:37 AM
I have posted a "CMP defeats intuition (http://masm32.com/board/index.php?topic=2235.0)" thread in the Campus because I was really surprised that a negative number can be "greater" than a positive one (and yes I know this is a stupid noob error :biggrin:).
Note the FPU behaves differently.
The most annoying thing in that is that there is not a full set of instructions to work with the flags selectively, like CLC/STC/CMC/CLD/STD/CLI/STI. For example - how to simply (un)set OF flag? Is there any other, simpler way than I used (i.e., via RCR - it is the only instruction I know of that may change OF flag
and preserve ZF
and allows to restore CF to its previous state (before RCR))?
if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel
but, you could come up with a set of operands/operations to generate the flag conditions, as desired
mov al,7Fh
mov ah,88h
sub al,ah
so, at the end of your code, you could have
SetFlags:
sub al,88h
and branch to that location with different values in AL
something like that :P
The solution for setting the flags in the Ocmp & Qcmp macros was actually inspired by Alex' insistence that the macro should behave exactly like a cmp reg32, reg32. So the trick is to take two OWORDs, for example:
oSmall OWORD 88000000000000000000000000000100h ; 88=NEGATIVE
oBig OWORD 88000000000000000000000000000300h
.. to scan them with pcmpeqb & bsr for the first different byte, and then "construct" two reg32:
eax = 88000001
edx = 88000003
Then, a simple cmp eax, edx sets the flags. Simple and fast...
that is a great solution :t
Quote from: dedndave on August 18, 2013, 02:15:39 AM
that is a great solution :t
Thanks ;-)
By the way, if you put the pcmpeqb part into a loop, it should be easy to implement the arbitrary length algo you proposed. If the length is below OWORD, you will have to clear the last bits after bsr.
i like the flag-setting solution
i am not as fond of using BSR :P
Then cmps is your candidate, Dave ;-)
just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction:
cmp128 macro ow0,ow1
LOCAL @NE1,@NE2,@NE3,@end
lea esi,ow0
lea edi,ow1
mov eax,[esi+0*DWORD]
mov ecx,[esi+1*DWORD]
mov edx,[esi+2*DWORD]
mov ebx,[esi+3*DWORD]
sub eax,[edi+0*DWORD]
jnz @NE1
sbb ecx,[edi+1*DWORD]
jnz @NE2
sbb edx,[edi+2*DWORD]
jnz @NE3
sbb ebx,[edi+3*DWORD]
;/* equal */
jmp @end
@NE1: sbb ecx,[edi+1*DWORD]
@NE2: sbb edx,[edi+2*DWORD]
@NE3: sbb ebx,[edi+3*DWORD]
;/* LT or GT */
jnz @end
or ebx,2 ; MOV may be better, because it breaks the dependency chain...
cmp ebx,1 ; new flags: A GT B
@end:
endm
Hi qWord,
I think you're right. Very elegant solution. :t
Gunther
Quote from: qWord on August 18, 2013, 04:14:31 AM
just as addition, one might try this x86 solution. If I'm not wrong, it also behaves like the CMP instruction
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1765/1000 cycles
14167 cycles for 1000 * Ocmp (JJ)
73673 cycles for 1000 * cmp128b (loop)
9501 cycles for 1000 * cmp128 qWord
112854 cycles for 1000 * AxCMP128bit :t
The Timings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 1648/1000 cycles
2375 cycles for 1000 * Ocmp (JJ)
60100 cycles for 1000 * cmp128b (loop)
4581 cycles for 1000 * cmp128 qWord
6580 cycles for 1000 * AxCMP128bit
2378 cycles for 1000 * Ocmp (JJ)
59763 cycles for 1000 * cmp128b (loop)
10781 cycles for 1000 * cmp128 qWord
6894 cycles for 1000 * AxCMP128bit
2345 cycles for 1000 * Ocmp (JJ)
59383 cycles for 1000 * cmp128b (loop)
4545 cycles for 1000 * cmp128 qWord
6595 cycles for 1000 * AxCMP128bit
--- ok ---
Well done, qWord. :t
Gunther
deleted
i don't see how qWord's code is any faster than...
OwordA OWORD ?
OwordB OWORD ?
mov eax,dword ptr OwordA[12]
mov edx,dword ptr OwordA[8]
cmp eax,dword ptr OwordB[12]
jnz FlagsSet
cmp edx,dword ptr OwordB[8]
mov eax,dword ptr OwordA[4]
jnz FlagsSet
cmp eax,dword ptr OwordB[4]
mov edx,dword ptr OwordA[0]
jnz FlagsSet
cmp edx,dword ptr OwordB[0]
FlagsSet:
;flags are set as though you had executed CMP OwordA,OwordB
in particular, register indirect addressing is a little slower than direct addressing
and, my code only pre-loads 1 dword, not all 4
in addition to all that, i do not rely on the carry flag being forwarded for SBB
dave,
nidud,
AFAICS your code does not do the same as mine. For the following test numbers,
A OWORD 0ffffffffffffffffffffffff00000001h
B OWORD 0ffffffffffffffffffffffffffffffffh
your code returns that A is greater than B (signed), which is wrong.
Interesting :t
decimal, QWORD size (can't display more than that with deb, sorry ;))
qqA -4294967295
qqB -1
qA LESSER qB ; Ocmp MasmBasic
qA lesser qB ; cmp128 qWord
qB GREATER qA
qB greater qA
good catch, qWord - lol
i have been using something like that for years
always for unsigned comparisons, though :P
still, if you combine my method and your method, i think you get faster CORRECT code :biggrin:
Quote from: dedndave on August 18, 2013, 01:53:45 AM
if you try to manipulate the flags directly, you are likely to be disappointed by the speed
STC/CLC/CMC aren't too bad
POPF and SAHF are slower than you think they ought to be
SAHF doesn't allow you to manipulate the OF - stupid mistake by intel
You're right, and that's why I used RCR and not pushf/popf :biggrin:
Results for latest zip in the thread:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2188/1000 cycles
24104 cycles for 1000 * Ocmp (JJ)
100352 cycles for 1000 * cmp128b (loop)
37171 cycles for 1000 * cmp128 qWord
33162 cycles for 1000 * AxCMP128bit
23649 cycles for 1000 * Ocmp (JJ)
100419 cycles for 1000 * cmp128b (loop)
37734 cycles for 1000 * cmp128 qWord
33313 cycles for 1000 * AxCMP128bit
24096 cycles for 1000 * Ocmp (JJ)
99257 cycles for 1000 * cmp128b (loop)
37482 cycles for 1000 * cmp128 qWord
33053 cycles for 1000 * AxCMP128bit
--- ok ---
Having shamelessly stolen Jochen's implementation :biggrin: and a bit simplified it, there is a "new" algo:
JJAxCMP128bit MACRO ow0:REQ, ow1:REQ
LOCAL @l1
movups xmm0,[ow0]
movzx eax,word ptr [ow0+14]
pcmpeqb xmm0,[ow1]
movzx edx,word ptr [ow1+14]
pmovmskb ecx,xmm0
xor ecx,0FFFFh
bsr ecx,ecx
jz @l1
cmp ecx,15
jz @l1
mov al,byte ptr [ow0+ecx]
mov dl,byte ptr [ow1+ecx]
@l1:
cmp ax,dx
ENDM
The idea is the same - construct compared element from MSB and make a higher unmached byte as a LSB, but in a word-sized reg. If the only highest bytes differs, then it goes shorter way and doesn't update LSB.
And now this is the absolutely equal to the CMP instruction 128-bit-emulation - because if regs are equal, it anyway makes a final CMP, so flags are set as they should be, and right after comparsion we may use any Jcc instruction, not forced to use JZ first like in earlier algos (because we do short cut after PMOVMSKB/BSR, but there only ZF is set - other flags are undefined, but this is not the same behaviour as CMP does for equal regs - it should set them properly and do not leave "undefined"). And even SF is set the same as CMP does :biggrin: Jochen, :t
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++18 of 20 tests valid, loop overhead is approx. 2287/1000 cycles
23052 cycles for 1000 * Ocmp (JJ)
100429 cycles for 1000 * cmp128b (loop)
36612 cycles for 1000 * cmp128 qWord
17043 cycles for 1000 * JJAxCMP128bit
23306 cycles for 1000 * Ocmp (JJ)
102801 cycles for 1000 * cmp128b (loop)
37323 cycles for 1000 * cmp128 qWord
16847 cycles for 1000 * JJAxCMP128bit
24068 cycles for 1000 * Ocmp (JJ)
96693 cycles for 1000 * cmp128b (loop)
36452 cycles for 1000 * cmp128 qWord
17062 cycles for 1000 * JJAxCMP128bit
--- ok ---
Hi,
nidud :t
Quote from: nidud on August 18, 2013, 09:11:57 AM
maybe I'm missing something here, but will not this also work:
cmp128f macro ow0,ow1
LOCAL @end
mov eax,dword ptr ow0[12]
cmp eax,dword ptr ow1[12]
jne @end
mov eax,dword ptr ow0[8]
cmp eax,dword ptr ow1[8]
jne @end
mov eax,dword ptr ow0[4]
cmp eax,dword ptr ow1[4]
jne @end
mov eax,dword ptr ow0
cmp eax,dword ptr ow1
@end:
endm
The problem is that we need to fully "emulate" CMP behaviour, thus the construc should set all flags as CMP does. With such a straightforward code it will return proper result only for unsigned numbers (JA/JB/JZ), for signed - the result is unpredictable, because if not-highest-order DWORDs are different, they may be signed/unsigned (it's not predictable), but the entire OWORD which consists from them may have other signed/unsigned state. So, we forced to include the highest order element (DWORD or BYTE) to the comparsion to get right result.
Hi Alex,
the timings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 3662/1000 cycles
362 cycles for 1000 * Ocmp (JJ)
57227 cycles for 1000 * cmp128b (loop)
2590 cycles for 1000 * cmp128 qWord
5042 cycles for 1000 * JJAxCMP128bit
412 cycles for 1000 * Ocmp (JJ)
57178 cycles for 1000 * cmp128b (loop)
2540 cycles for 1000 * cmp128 qWord
5050 cycles for 1000 * JJAxCMP128bit
353 cycles for 1000 * Ocmp (JJ)
57250 cycles for 1000 * cmp128b (loop)
2587 cycles for 1000 * cmp128 qWord
5045 cycles for 1000 * JJAxCMP128bit
--- ok ---
Good job. :t
Gunther
very nice Alex :t
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 2075/1000 cycles
23383 cycles for 1000 * Ocmp (JJ)
97294 cycles for 1000 * cmp128b (loop)
36753 cycles for 1000 * cmp128 qWord
17587 cycles for 1000 * JJAxCMP128bit
23394 cycles for 1000 * Ocmp (JJ)
97825 cycles for 1000 * cmp128b (loop)
36873 cycles for 1000 * cmp128 qWord
17150 cycles for 1000 * JJAxCMP128bit
23387 cycles for 1000 * Ocmp (JJ)
98043 cycles for 1000 * cmp128b (loop)
36745 cycles for 1000 * cmp128 qWord
17146 cycles for 1000 * JJAxCMP128bit
QuoteHaving shamelessly stolen Jochen's implementation...
the algo has even improved your English :lol:
Thank you,
Gunther and
Dave! :biggrin:
Quote from: dedndave on August 18, 2013, 08:40:17 PM
very nice Alex :t
But Gunther's CPU likes Jochen's algo much better! :biggrin: :biggrin: :biggrin: 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.
Quote from: dedndave on August 18, 2013, 08:40:17 PM
QuoteHaving shamelessly stolen Jochen's implementation...
the algo has even improved your English :lol:
Is this proper sentence? To be honest, very frequently, being writing something in a real-time, I'm very-very unsure that I'm writing properly :redface:
Quote from: Antariy on August 18, 2013, 10:01:52 PM
But Gunther's CPU likes Jochen's algo much better! :biggrin: :biggrin: :biggrin: 0.3 cycles for one comparsion! Anyway, it's faster there, much faster. Very interesting difference.
Something is wrong there - 0.3 cycles is impossible...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1778/1000 cycles
14132 cycles for 1000 * Ocmp (JJ)
13074 cycles for 1000 * Ocmp2 (JJ)
73846 cycles for 1000 * cmp128b (loop)
9513 cycles for 1000 * cmp128 qWord
9141 cycles for 1000 * JJAxCMP128bitOcmp2 uses your xor dx, 0ffffh trick.
Quote from: jj2007 on August 18, 2013, 10:16:21 PM
Something is wrong there - 0.3 cycles is impossible...
It may be some inconsistence in timings, like it often happens, but probably it shows that Gunther's CPU model runs your version faster.
Timings for the new archive:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2261/1000 cycles
26992 cycles for 1000 * Ocmp (JJ)
23136 cycles for 1000 * Ocmp2 (JJ)
108573 cycles for 1000 * cmp128b (loop)
40769 cycles for 1000 * cmp128 qWord
18240 cycles for 1000 * JJAxCMP128bit
24862 cycles for 1000 * Ocmp (JJ)
22679 cycles for 1000 * Ocmp2 (JJ)
107366 cycles for 1000 * cmp128b (loop)
38917 cycles for 1000 * cmp128 qWord
18459 cycles for 1000 * JJAxCMP128bit
24960 cycles for 1000 * Ocmp (JJ)
22641 cycles for 1000 * Ocmp2 (JJ)
107176 cycles for 1000 * cmp128b (loop)
38921 cycles for 1000 * cmp128 qWord
18207 cycles for 1000 * JJAxCMP128bit
--- ok ---
BTW: in my variation of code here:
xor ecx,0FFFFh
bsr ecx,ecx
jz @l1
we may have couple of cycles less when numbers are equal and jz @l1 is moved before BSR (just a note, it will not improve timings in current testbed).
deleted
AMD, as usual, prefers GPR code :biggrin:
BTW, in qWord's code, are really the commented instructions required? They seem to be superfluous.
push esi
push edi
push ebx
lea esi,ow0
lea edi,ow1
mov eax,[esi+0*DWORD]
mov ecx,[esi+1*DWORD]
mov edx,[esi+2*DWORD]
mov ebx,[esi+3*DWORD]
sub eax,[edi+0*DWORD]
;jnz @NE1
sbb ecx,[edi+1*DWORD]
;jnz @NE2
sbb edx,[edi+2*DWORD]
;jnz @NE3
sbb ebx,[edi+3*DWORD]
;/* equal */
;jmp @end
;@NE1: sbb ecx,[edi+1*DWORD]
;@NE2: sbb edx,[edi+2*DWORD]
;@NE3: sbb ebx,[edi+3*DWORD]
;/* LT or GT */
; jnz @end
; or ebx,2 ; MOV may be better, because it breaks the dependency chain...
; cmp ebx,1 ; new flags: A GT B
@end:
pop ebx
pop edi
pop esi
Every jump goes to the continuation of the same sequence of instructions.
CMP behaves just like SUB, so, probably it's simpler and faster just to follow straight 128 bit integer substraction with SUB/SBB/SBB/SBB.
The code:
mov eax,dword ptr [ow0]
sub eax,dword ptr [ow1]
mov eax,dword ptr [ow0+4]
sbb eax,dword ptr [ow1+4]
mov eax,dword ptr [ow0+8]
sbb eax,dword ptr [ow1+8]
mov eax,dword ptr [ow0+12]
sbb eax,dword ptr [ow1+12]
seems to be pretty fast on my machine, though, still a lot slower than SSE-powered version.
deleted
i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB
notice that it executes about the same number of instructions, either way
my thinking is that if the high-order dwords do not match, you shouldn't have to do any more :P
depending on the application (of course), that could be a majority of the time
so, i still think there is room for improvement
at the moment, i have one more graphing routine to write, so i haven't spent any time on it
deleted
Hi nidud,
If your algo yields correct results, you should sell it ;-)
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 2999/1000 cycles
18751 cycles for 1000 * Ocmp (JJ)
18730 cycles for 1000 * Ocmp2 (JJ)
3035 cycles for 1000 * cmp128n (nidud)
8218 cycles for 1000 * cmp128 qWord
22797 cycles for 1000 * JJAxCMP128bit
EDIT: See reply #74 for corrected version
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 2113/1000 cycles
23370 cycles for 1000 * Ocmp (JJ)
21821 cycles for 1000 * Ocmp2 (JJ)
13010 cycles for 1000 * cmp128n (nidud)
37300 cycles for 1000 * cmp128 qWord
17150 cycles for 1000 * JJAxCMP128bit
23393 cycles for 1000 * Ocmp (JJ)
21550 cycles for 1000 * Ocmp2 (JJ)
12906 cycles for 1000 * cmp128n (nidud)
36926 cycles for 1000 * cmp128 qWord
17266 cycles for 1000 * JJAxCMP128bit
23680 cycles for 1000 * Ocmp (JJ)
21558 cycles for 1000 * Ocmp2 (JJ)
13570 cycles for 1000 * cmp128n (nidud)
36904 cycles for 1000 * cmp128 qWord
17157 cycles for 1000 * JJAxCMP128bit
nice idea, nidud :t
still plenty of room for improvement, i think
using EAX all the way through has to be slowing it down
although, as a macro, it does make it more flexible
one little item....
mov eax,2
cmp eax,1
8 bytes - clears OF, ZF, CF, and SF
can be replaced with
or al,1
2 bytes - clears OF, ZF, CF, and SF
Quote from: dedndave on August 19, 2013, 01:29:03 PM
i think qWord wrote it that way to handle the special case of ZERO
if you SUB, SBB, SBB, SBB, the ZF only reflects the result of the last SBB
Ah, yes, you're right,
Dave :redface:
Jochen, to get correct test running, you should change this macro:
align 16
TestC_s:
; useC=0 ; uncomment to exclude TestC
NameC equ cmp128n (nidud) ; assign a descriptive name here
TestC proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
showresults=0
cmp128n offset oSmall, offset oBig
cmp128n offset oBig, offset oSmall
cmp128n offset oMedium, offset oMedium
cmp128n offset oSmallF, offset oBigF
cmp128n offset oBigF, offset oSmallF
cmp128n offset oMedF, offset oMedF
sub ebx, 6
; dec ebx - we test 6x above
.Until Sign?
ret
TestC endp
TestC_endp:
Remove every "offset" statement before numbers.
If you would have a look into the disassembly, you'll find that the expanded macro actually compares not the numbers but the offsets (BTW, that's very nasty "feature" of a macroses).
What about making the test a bit more realistic by adding some dependencies? e.g. if( Condition ) then do (operation A) else do (operation B)
:t
deleted
Quote from: Antariy on August 19, 2013, 07:52:35 PM
Jochen, to get correct test running, you should change this macro:
That's right :t
By the way: Why sbb all over the place?
jnz @NE1
mov eax,dword ptr ow0[4]
.if Carry?
INT 3
.endif
sbb eax,dword ptr ow1[4]AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 3008/1000 cycles
18753 cycles for 1000 * Ocmp (JJ)
18749 cycles for 1000 * Ocmp2 (JJ)
2688 1) cycles for 1000 * cmp128n (nidud)
8210 cycles for 1000 * cmp128 qWord
22749 cycles for 1000 * JJAxCMP128bit1) You will be fined for violating speed limits 8)
your idea is better anyways, nidud
it just occured to me that OR AL,1 does not necessarily clear the ZF if the pre-existing value has bit 7 set
sub eax,eax
inc eax
i like it :P
something in there doesn't like my P4 :P
Cmp128TimingsNidudB
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 2078/1000 cycles
23549 cycles for 1000 * Ocmp (JJ)
21190 cycles for 1000 * Ocmp2 (JJ)
31788 cycles for 1000 * cmp128n (nidud)
36806 cycles for 1000 * cmp128 qWord
17161 cycles for 1000 * JJAxCMP128bit
23934 cycles for 1000 * Ocmp (JJ)
21858 cycles for 1000 * Ocmp2 (JJ)
31505 cycles for 1000 * cmp128n (nidud)
37015 cycles for 1000 * cmp128 qWord
17209 cycles for 1000 * JJAxCMP128bit
24145 cycles for 1000 * Ocmp (JJ)
21179 cycles for 1000 * Ocmp2 (JJ)
31363 cycles for 1000 * cmp128n (nidud)
36887 cycles for 1000 * cmp128 qWord
17515 cycles for 1000 * JJAxCMP128bit
i still think you can stop comparing when you find a high-order mismatch
and - no need to ripple the CF all the way from low-order to high-order if they are all equal
something like this - i need to do some testing
Cmp128Dave MACRO OwA:REQ,OwB:REQ
;OwA and OwB are pointers to memory operands
mov eax,dword ptr OwA[12]
mov edx,dword ptr OwA[8]
sub eax,dword ptr OwB[12]
.if ZERO?
cmp edx,dword ptr OwB[8]
mov ecx,dword ptr OwA[4]
.if ZERO?
cmp ecx,dword ptr OwB[4]
mov edx,dword ptr OwA[0]
.if ZERO?
cmp edx,dword ptr OwB[0]
.if !ZERO?
sbb eax,0
.endif
.endif
.endif
.endif
ENDM
deleted
jnz @NE1
; so here we are inside a ZERO branch, and there is definitely NO CARRY
mov eax,dword ptr ow0[4]
sbb eax,dword ptr ow0[4] ; therefore sbb behaves exactly like sub
New attempt:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2198/1000 cycles
24957 cycles for 1000 * Ocmp (JJ)
22338 cycles for 1000 * Ocmp2 (JJ)
32907 cycles for 1000 * cmp128n (nidud)
39508 cycles for 1000 * cmp128 qWord
6635 cycles for 1000 * AxCMP128bit
24604 cycles for 1000 * Ocmp (JJ)
22414 cycles for 1000 * Ocmp2 (JJ)
32845 cycles for 1000 * cmp128n (nidud)
38513 cycles for 1000 * cmp128 qWord
6367 cycles for 1000 * AxCMP128bit
24608 cycles for 1000 * Ocmp (JJ)
22294 cycles for 1000 * Ocmp2 (JJ)
34059 cycles for 1000 * cmp128n (nidud)
38513 cycles for 1000 * cmp128 qWord
6388 cycles for 1000 * AxCMP128bit
--- ok ---
Did not look over your latest archive yet, Jochen, got it just now.
Can you please add my code to newest testbed?
Ooops, entangled the order :greensml:
Here is my code:
AxCMP128bit MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2, @l3, @l1_1, @l2_1, @l3_1, @l0
mov eax,dword ptr [ow0+12]
cmp eax,dword ptr [ow1+12]
jnz @l0
mov eax,dword ptr [ow0+8]
cmp eax,dword ptr [ow1+8]
jnz @l3
mov eax,dword ptr [ow0+4]
cmp eax,dword ptr [ow1+4]
jnz @l2
mov eax,dword ptr [ow0]
cmp eax,dword ptr [ow1]
jz @l0
@l1:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2]
mov cx,word ptr [ow1+2]
cmp dx,word ptr [ow1+2]
jnz @l1_1
mov cx,word ptr [ow1]
mov dx,ax
@l1_1:
cmp edx,ecx
jmp @l0
@l2:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2+4]
mov cx,word ptr [ow1+2+4]
cmp dx,word ptr [ow1+2+4]
jnz @l2_1
mov cx,word ptr [ow1+4]
mov dx,ax
@l2_1:
cmp edx,ecx
jmp @l0
@l3:
mov edx,dword ptr [ow0+12]
mov ecx,dword ptr [ow1+12]
mov dx,word ptr [ow0+2+8]
mov cx,word ptr [ow1+2+8]
cmp dx,word ptr [ow1+2+8]
jnz @l3_1
mov cx,word ptr [ow1+8]
mov dx,ax
@l3_1:
cmp edx,ecx
@l0:
ENDM
Timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2280/1000 cycles
25091 cycles for 1000 * Ocmp (JJ)
23674 cycles for 1000 * Ocmp2 (JJ)
33530 cycles for 1000 * cmp128n (nidud)
39507 cycles for 1000 * cmp128 qWord
12115 cycles for 1000 * AxCMP128bit
25075 cycles for 1000 * Ocmp (JJ)
22759 cycles for 1000 * Ocmp2 (JJ)
34451 cycles for 1000 * cmp128n (nidud)
39244 cycles for 1000 * cmp128 qWord
10880 cycles for 1000 * AxCMP128bit
25092 cycles for 1000 * Ocmp (JJ)
22990 cycles for 1000 * Ocmp2 (JJ)
32880 cycles for 1000 * cmp128n (nidud)
38631 cycles for 1000 * cmp128 qWord
10777 cycles for 1000 * AxCMP128bit
--- ok ---
deleted
Archive with the right code.
nidud - SUB would be faster because it does not have to wait for the carry condition to be set
Alex - odds are that the high order compare would find a mismatch
of all the possible 128-bit integers, only 1 in every 4294967296 have a specific high dword value
of all combinations of two 128-bit values,
only 1 combination in every 42949672962 (roughly) has matching high-order dwords
as you said, this is very application dependant, but i like playing those odds :P
the timing tests probably aren't set up to test a wide range of values
Quote from: dedndave on August 20, 2013, 02:20:59 AM
Alex - odds are that the high order compare would find a mismatch
...
as you said, this is very application dependant, but i like playing those odds :P
No-no,
Dave, I agree with you, just not get it from first time (in the first pages of the thread me too "voted" for checking the highest order elements), it's late here, sorry :greensml:
deleted
lol nidud
you'd have to select specific values to compare to validate that theory
the values used in the timing test don't cover a very comprehensive range of comparisons
results from Alex's latest code
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 2134/1000 cycles
23563 cycles for 1000 * Ocmp (JJ)
21133 cycles for 1000 * Ocmp2 (JJ)
31419 cycles for 1000 * cmp128n (nidud)
37262 cycles for 1000 * cmp128 qWord
10163 cycles for 1000 * AxCMP128bit
23335 cycles for 1000 * Ocmp (JJ)
21130 cycles for 1000 * Ocmp2 (JJ)
32047 cycles for 1000 * cmp128n (nidud)
37288 cycles for 1000 * cmp128 qWord
10208 cycles for 1000 * AxCMP128bit
23770 cycles for 1000 * Ocmp (JJ)
21134 cycles for 1000 * Ocmp2 (JJ)
31739 cycles for 1000 * cmp128n (nidud)
37136 cycles for 1000 * cmp128 qWord
10172 cycles for 1000 * AxCMP128bit
My results from Alex's latest test:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++++++++++++3 of 20 tests valid, loop overhead is approx. 1678/1000 cycles
2364 cycles for 1000 * Ocmp (JJ)
2351 cycles for 1000 * Ocmp2 (JJ)
2177 cycles for 1000 * cmp128n (nidud)
10741 cycles for 1000 * cmp128 qWord
3756 cycles for 1000 * AxCMP128bit
2340 cycles for 1000 * Ocmp (JJ)
2346 cycles for 1000 * Ocmp2 (JJ)
2188 cycles for 1000 * cmp128n (nidud)
4684 cycles for 1000 * cmp128 qWord
3749 cycles for 1000 * AxCMP128bit
2420 cycles for 1000 * Ocmp (JJ)
2431 cycles for 1000 * Ocmp2 (JJ)
7600 cycles for 1000 * cmp128n (nidud)
4545 cycles for 1000 * cmp128 qWord
9999 cycles for 1000 * AxCMP128bit
--- ok ---
Gunther
deleted
:biggrin: now, you're just picking on me - lol
deleted
Quote from: Antariy on August 20, 2013, 02:01:20 AM
Ooops, entangled the order :greensml:
Attached :t
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 1766/1000 cycles
12963 cycles for 1000 * Ocmp2 (JJ)
6266 cycles for 1000 * Cmp128Dave
6270 cycles for 1000 * cmp128n (nidud)
9508 cycles for 1000 * cmp128 qWord
10295 cycles for 1000 * AxCMP128bitAlex, there's something wrong, we have the slowest algos!! :dazzled:
deleted
deleted
Quote from: jj2007 on August 20, 2013, 04:56:39 AM
Quote from: Antariy on August 20, 2013, 02:01:20 AM
Ooops, entangled the order :greensml:
Attached :t
Alex, there's something wrong, we have the slowest algos!! :dazzled:
Thank you,
Jochen :t
Strange and interesting thing: this not-too-complicated algos seem to be very CPU-dependent.
Timings for your latest archive:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2217/1000 cycles
22772 cycles for 1000 * Ocmp2 (JJ)
7344 cycles for 1000 * Cmp128Dave
33425 cycles for 1000 * cmp128n (nidud)
39887 cycles for 1000 * cmp128 qWord
10968 cycles for 1000 * AxCMP128bit
22775 cycles for 1000 * Ocmp2 (JJ)
7318 cycles for 1000 * Cmp128Dave
34449 cycles for 1000 * cmp128n (nidud)
39424 cycles for 1000 * cmp128 qWord
11218 cycles for 1000 * AxCMP128bit
22814 cycles for 1000 * Ocmp2 (JJ)
7333 cycles for 1000 * Cmp128Dave
33557 cycles for 1000 * cmp128n (nidud)
40793 cycles for 1000 * cmp128 qWord
10702 cycles for 1000 * AxCMP128bit
--- ok ---
Quote from: nidud on August 20, 2013, 08:17:15 AM
cmp128 1, -1
jle error
jnb error
fail: cmp128 qWord
fail: cmp128n (nidud)
fail: cmp128 Dave
Alex version works
Are you sure?
oPlusOne GREATER oMinusOne (jj)
oPlusOne greater oMinusOne (nidud)
Quote from: nidud on August 20, 2013, 05:44:59 AM
So, sorry Dave, but I'm picking on you again :lol:
Your code fails (I think), and is equal to this (I think):
Yes,
nidud is right here,
Dave's code is more or less equal to the failing one I suggested couple pages ago (it consisted just from SUB/SBB/SBB/SBB) - if the number 1 is bigger than number 2 then it returned ZF flag set, too because of latest SUB (in Dave's version it's latest SBB).
(I.e. FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF is GREATER than FFFFFFFFFFFFFFFFFFFFFFFF00000001, but also is EQUAL :biggrin: - JGE/JAE/JE/JZ will have false jump.)
Cmp128Dave MACRO OwA:REQ,OwB:REQ
;OwA and OwB are pointers to memory operands
still - my code may fail - and so may some others
we need a good validation routine before we worry about timing :P
Quote from: dedndave on August 20, 2013, 08:46:44 AM
still - my code may fail - and so may some others
I think you might just "trace" the code "in mind" with different conditions to get general idea if it works (like you did when I suggested brute SUB/SBB/SBB/SBB). There are not much different numbers are needed, actually, the ones chosen by
Jochen are already enough - you may see that, for an instance, for you code, when numbers oBigF and oSmallF are compared (it returns GREATER and ZERO set).
But you're right here:
Quote from: dedndave on August 20, 2013, 08:46:44 AM
we need a good validation routine before we worry about timing :P
We have such a validation here (http://masm32.com/board/index.php?topic=2232.msg23009#msg23009) :P
Though, it's boring to parse results looking on a bunch of reported passed jump conditions :biggrin:
My code is a bit bloated, but I think I checked it pretty thoroughly. And you can see that in general the idea is the same as in Jochen's SSE-powered algo: it just constructs final comparing number from highest order element (WORD which becomes high-order WORD of a constructed DWORD) and highest-order-different element (WORD which becomes low-order WORD of constructed DWORD). So, even if it's a bit bloated, it should work properly with no doubts, if it's implemented properly (no "mistypos" etc).
test i will use to fix my code
by the way, 3200 tests are made (40 are repeats) - my current code only fails 2440 of them :lol:
for the moment - back to work on my graph code
i will play with this some more later this week
;###############################################################################################
.XCREF
.NoList
INCLUDE \Masm32\Include\Masm32rt.inc
.686p
.MMX
.XMM
.List
;###############################################################################################
C128Dave PROTO :LPVOID,:LPVOID
TestCmp PROTO :LPVOID,:LPVOID
;###############################################################################################
Cmp128Dave MACRO OwA:REQ,OwB:REQ
;OwA and OwB are pointers to memory operands
mov eax,dword ptr OwA[12]
mov edx,dword ptr OwA[8]
sub eax,dword ptr OwB[12]
.if ZERO?
cmp edx,dword ptr OwB[8]
mov ecx,dword ptr OwA[4]
.if ZERO?
cmp ecx,dword ptr OwB[4]
mov edx,dword ptr OwA[0]
.if ZERO?
cmp edx,dword ptr OwB[0]
.if !ZERO?
sbb eax,eax
.endif
.endif
.endif
.endif
ENDM
;###############################################################################################
.DATA
;each line is one set of comparison values: a DWORD and an OWORD
;the flag result after comparing the DWORD's from any 2 lines should
;be the same as after comparing the OWORD's on the same 2 lines
;each pair of lines is compared both ways (CMP a,b and CMP b,a) for a total of 3200 tests
TestVal dd 0,0,0,0,0
dd 1,1,0,0,0
dd 100h,0,1,0,0
dd 10000h,0,0,1,0
dd 1000000h,0,0,0,1
dd 40000000h,0,0,0,40000000h
dd 40000001h,1,0,0,40000000h
dd 40000100h,0,1,0,40000000h
dd 40010000h,0,0,1,40000000h
dd 41000000h,0,0,0,40000001h
dd 80000000h,0,0,0,80000000h
dd 80000001h,1,0,0,80000000h
dd 80000100h,0,1,0,80000000h
dd 80010000h,0,0,1,80000000h
dd 81000000h,0,0,0,80000001h
dd 0C0000000h,0,0,0,0C0000000h
dd 0C0000001h,1,0,0,0C0000000h
dd 0C0000100h,0,1,0,0C0000000h
dd 0C0010000h,0,0,1,0C0000000h
dd 0C1000000h,0,0,0,0C0000001h
dd 3FFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
dd 3FFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
dd 3FFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,3FFFFFFFh
dd 3FFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,3FFFFFFFh
dd 3EFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFEh
dd 7FFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
dd 7FFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
dd 7FFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,7FFFFFFFh
dd 7FFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,7FFFFFFFh
dd 7EFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFEh
dd 0BFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
dd 0BFFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
dd 0BFFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0BFFFFFFFh
dd 0BFFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0BFFFFFFFh
dd 0BEFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFEh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh
dd 0FEFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh
;***********************************************************************************************
; .DATA?
;###############################################################################################
.CODE
;***********************************************************************************************
_main PROC
mov esi,offset TestVal
mov ebx,40
mov edi,esi
loop00: push ebx
push edi
mov ebx,40
loop01: INVOKE TestCmp,esi,edi
INVOKE TestCmp,edi,esi
dec ebx
lea edi,[edi+20]
jnz loop01
pop edi
pop ebx
add esi,20
dec ebx
jnz loop00
print chr$(13,10)
inkey
INVOKE ExitProcess,0
_main ENDP
;***********************************************************************************************
C128Dave PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi+12]
mov edx,[esi+8]
sub eax,[edi+12]
.if ZERO?
cmp edx,[edi+8]
mov ecx,[esi+4]
.if ZERO?
cmp ecx,[edi+4]
mov edx,[esi]
.if ZERO?
cmp edx,[edi]
.if !ZERO?
sbb eax,eax
.endif
.endif
.endif
.endif
ret
C128Dave ENDP
;***********************************************************************************************
TestCmp PROC USES EBX ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
;OF = bit 11
;SF = bit 7
;ZF = bit 6
;CF = bit 0
mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi]
cmp eax,[edi]
push ebp
pushfd
add esi,4
add edi,4
; INVOKE C128Dave,esi,edi
pushfd
pop ebx ;EBX = OWORD compare result flags
pop ebp ;EBP = DWORD compare result flags
and ebx,8C1h
and ebp,8C1h ;OF SF ZF CF only
.if ebx!=ebp
print chr$('cmp ')
mov eax,[esi+12]
print uhex$(eax),'_'
mov eax,[esi+8]
print uhex$(eax),'_'
mov eax,[esi+4]
print uhex$(eax),'_'
mov eax,[esi]
print uhex$(eax),' , '
mov eax,[edi+12]
print uhex$(eax),'_'
mov eax,[edi+8]
print uhex$(eax),'_'
mov eax,[edi+4]
print uhex$(eax),'_'
mov eax,[edi]
print uhex$(eax),13,10,'was: '
.if ebx&800h
print chr$('OV ')
.else
print chr$('NV ')
.endif
.if ebx&80h
print chr$('NG ')
.else
print chr$('PL ')
.endif
.if ebx&40h
print chr$('ZR ')
.else
print chr$('NZ ')
.endif
.if ebx&1
print chr$('CY')
.else
print chr$('NC')
.endif
print chr$(' should be: ')
.if ebp&800h
print chr$('OV ')
.else
print chr$('NV ')
.endif
.if ebp&80h
print chr$('NG ')
.else
print chr$('PL ')
.endif
.if ebp&40h
print chr$('ZR ')
.else
print chr$('NZ ')
.endif
.if ebp&1
print chr$('CY')
.else
print chr$('NC')
.endif
print chr$(13,10)
.endif
pop ebp
ret
TestCmp ENDP
;###############################################################################################
END _main
the test program only displays on failure
here is an example of one fail:
cmp 00000000_00000000_00000000_00000000 , 40000000_00000001_00000000_00000000
was: NV PL NZ NC should be: NV NG NZ CY
Quote from: dedndave on August 20, 2013, 10:03:15 AM
for the moment - back to work on my graph code
i will play with this some more later this week
:t
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1756/1000 cycles
2362 cycles for 1000 * Ocmp2 (JJ)
2627 cycles for 1000 * Cmp128Dave
2313 cycles for 1000 * cmp128n (nidud)
4672 cycles for 1000 * cmp128 qWord
3885 cycles for 1000 * AxCMP128bit
2418 cycles for 1000 * Ocmp2 (JJ)
2635 cycles for 1000 * Cmp128Dave
2310 cycles for 1000 * cmp128n (nidud)
4628 cycles for 1000 * cmp128 qWord
3833 cycles for 1000 * AxCMP128bit
2412 cycles for 1000 * Ocmp2 (JJ)
2637 cycles for 1000 * Cmp128Dave
2394 cycles for 1000 * cmp128n (nidud)
4647 cycles for 1000 * cmp128 qWord
3866 cycles for 1000 * AxCMP128bit
Thank you, John! :t
Can I ask you to make one more test of a program in this (http://masm32.com/board/index.php?topic=2222.msg23288#msg23288) post? We had strange results, though, I think they are representative image that one algo is faster than another, but the numbers were just wonderful. Interesting, if this behaviour will be shown on your CPU, too.
:t
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1677/1000 cycles
2537 cycles for 1000 * Ocmp (JJ)
61306 cycles for 1000 * cmp128b (loop)
4740 cycles for 1000 * cmp128 qWord
2132 cycles for 1000 * JJAxCMP128bit
2441 cycles for 1000 * Ocmp (JJ)
61320 cycles for 1000 * cmp128b (loop)
4759 cycles for 1000 * cmp128 qWord
2009 cycles for 1000 * JJAxCMP128bit
2419 cycles for 1000 * Ocmp (JJ)
61279 cycles for 1000 * cmp128b (loop)
4796 cycles for 1000 * cmp128 qWord
2052 cycles for 1000 * JJAxCMP128bit
Incredible! :biggrin: Results are absolutely different! In this thread we have very varying timings for every kind of algo.
Thank you very much, John! :t
I thought that my JJAxCMP128bit tweak is a lot slower than original Jochen's version so replaced it with GPR code, but it seems that on anything younger than desktop PIV Intel CPU models SSE code will be faster than GPR (not sure about this for AMD). Probably there is need to return the tweak into testbed.
deleted
Hi nidud, can you please post entire testbed source + binary? This will help a lot to check what is going on.
We are still testing unsigned? And the end result is ja/je/jb?
Quote from: sinsi on August 20, 2013, 06:34:40 PM
We are still testing unsigned? And the end result is ja/je/jb?
No, theoretically every algo is designed to support any number - fully emulate behaviour of usual CMP instruction in flags setting, so how to treat the number is for programmers decision - JA or JG, JB or JL etc.
Something strange here with GPR code - if nidud will post the testbed it will help, but right now I cannot say what is wrong (and I'm not sure that something is wrong - we probably need a comprehensive checking testbed otherwise it's very error-prone to chech all things manually).
I can't see how comparing dwords can propagate every flag for all four.
Quote from: Intel manualtemp ←SRC1 −SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags inthe same manner as the SUB instruction*)
The CF, OF, SF, ZF, AF, and PF flags are set according to the result.
nidud, if the source is a bit in a working untidy then post a binary, please, or the number which makes the algos to fail.
Quote from: sinsi on August 20, 2013, 07:47:01 PM
I can't see how comparing dwords can propagate every flag for all four.
Quote from: Intel manualtemp ←SRC1 −SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags inthe same manner as the SUB instruction*)
The CF, OF, SF, ZF, AF, and PF flags are set according to the result.
We don't set flags or combine flags for every DWORD - the trick is that we make a CMP unstruction for a 128 bit integer number, so the flags should be set according to the state of comparsion a two 128 bit numbers as entire, not as "arrays" of DWORDs (just like CMP 128_bit_number_1, 128_bit_number_2 and then Jcc).
i cleaned up the validation test code - and removed repeat tests
earlier, i stated that my algo failed 2440 of 3200 tests
i see i had the test commented out, though - lol
with the changes, my current algo fails 112 of 3160 tests :biggrin:
OK, here are the correctness test, not in form of a macro for testing of every algo, but it shows the idea.
Need to compare OWORDs with selected algo, then to convert and save the flags in a easily comparable format with FlagsToEAX, then compare the same OWORDs with "Etalone" macro, which produces proper results by emulation flags setting and returns the result similar to FlagsToEAX, then compare what "Etalone" has returned and what was saved from the tested algo flags. If there's difference - it shows it.
Here is the code isolated from the source:
Added to the source:
; EAX = BITS: ... CF ZF SF OF
FlagsToEAX MACRO
pushfd
xor eax,eax
pop edx
bt edx,0
rcl eax,1
bt edx,6
rcl eax,1
bt edx,7
rcl eax,1
bt edx,11
rcl eax,1
ENDM
Etalone MACRO ow0, ow1
LOCAL @l1, @l2, @l3, l0
push ebx
mov eax,dword ptr [ow0+12]
mov edx,dword ptr [ow1+12]
cmp eax,edx
jnz @l1 ; just save flags
mov ecx,dword ptr [ow0+8]
mov ebx,dword ptr [ow1+8]
cmp ecx,ebx
jnz @l2
mov ecx,dword ptr [ow0+4]
mov ebx,dword ptr [ow1+4]
cmp ecx,ebx
jnz @l2
mov ecx,dword ptr [ow0]
mov ebx,dword ptr [ow1]
cmp ecx,ebx
jz @l1
@l2:
push 0
ja @l3 ; if it's above - the number is bigger because this isn't MSD
mov byte ptr [esp+3],1 ; CF set, below than (unsigned)
test eax,eax ; if numbers are signed then set required flags
jns @l3
mov word ptr [esp],0001h ; SF and OF are not equal, so it means less than (signed)
@l3:
pop edx
shr edx,1
rcl eax,1
push 3
pop ecx
@@:
shr edx,8
rcl eax,1
loop @B
jmp @l0
@l1:
FlagsToEAX
@l0:
if 0
push eax
mov ebx,eax
test ebx,1
jz @F
print "OF "
@@:
test ebx,2
jz @F
print "SF "
@@:
test ebx,4
jz @F
print "ZF "
@@:
test ebx,8
jz @F
print "CF "
@@:
print chr$(13,10)
pop eax
endif
pop ebx
ENDM
The piece below is a checking itself, it may be added in a start of a prog here:
start: push 1
call ShowCpu ; print brand string and SSE level
invoke SetProcessAffinityMask, -1, 1 ; restrict to one core
Calibrate
AxCMP128bit numberOne,numberTwo
FlagsToEAX
push eax
Etalone numberOne, numberTwo
pop ecx
xor eax,ecx
jz @l1 ; test OK
and eax,3 ; layout of SF and OF flag may differ, but if they are both not equal
jz @l1 ; in first comparsion and in second comparsion, then it's proper result
cmp eax,3 ; since signed less than is OF != SF with no difference which flags are (un)set
jz @l1
mov edx,[esp]
print str$(edx)," - Test failed: "
print uhex$(dword ptr [esi+12])
print "_"
print uhex$(dword ptr [esi+8])
print "_"
print uhex$(dword ptr [esi+4])
print "_"
print uhex$(dword ptr [esi])
print " "
print uhex$(dword ptr [edi+12])
print "_"
print uhex$(dword ptr [edi+8])
print "_"
print uhex$(dword ptr [edi+4])
print "_"
print uhex$(dword ptr [edi]),13,10
@l1:
One may say why the "Etalone" stated as truely proper code, well, this is question like about chiken and egg :biggrin: It stated so because it follows the CMP/SUB behaviour in a flags setting - it just emulates it, but not with flags and in a variable, which will then be compared with the converted flags state result of a tested code. It's longer to describe "why" it stated to work properly than to check its algo with help of Intel docs.
Quote from: dedndave on August 20, 2013, 09:39:49 PM
i cleaned up the validation test code - and removed repeat tests
earlier, i stated that my algo failed 2440 of 3200 tests
i see i had the test commented out, though - lol
with the changes, my current algo fails 112 of 3160 tests :biggrin:
Hi
Dave, I did not see your code yet, will check it :t
Can I use your data set with my checking method above? I did not prepare the data patterns yet, but checking on just a random data shows that algo works (but crafted data like yours is better than just random).
of course you may, Alex
the data is organized as sets of: (1) dword and (1) oword
comparing the owords from any two lines should yield the
same flags as comparing the dwords from the same 2 lines
so - the dwords are "control" values and the owords are "test" values
3160 tests are required to make all compares
(40 x 40 x 2, less 40 repeats)
in my test code, i examine OF, SF, ZF, and CF - they should match
no messing around with JL, JGE, etc
deleted
With the help of Dave's test numbers patterns, here is the checking testbed :t
Currently only my and Jochen's algos pass the check.
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 2111/1000 cycles
#######################################################
Testing algo: Cmp128Dave [esi],[edi]
1970169159 - Test failed: 00000000_00000000_00000000_00000000 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_00000000_00000000 00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_00000000_00000000 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_00000000_00000000 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000001_00000001_00000000 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000001_00000001_00000000 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000001_00000001_00000000 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000100_00000000_00000000 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000100_00000000_00000000 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000100_00000000_00000000 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000001_00000000_00000000 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000001_00000000_00000000 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000001_00000000_00000000 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000000_00000000_01000000 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_00000000_01000000 00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_00000000_01000000 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_00000000_01000000 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000000_40000000_00000001 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_40000000_00000001 00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_40000000_00000001 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_40000000_00000001 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000001_00000000_00000000_40010000 00000001_00000000
_C0000100_C0000000
1970169159 - Test failed: 00000000_00000000_41000000_40000000 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_41000000_40000000 00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_41000000_40000000 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_41000000_40000000 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000000
_00000000_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000001
_00000001_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000100
_00000000_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000001
_00000000_00000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000000
_00000000_01000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000000
_40000000_00000001
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000000
_41000000_40000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000001
_00000000_80000100
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_80000000_40000001_00000000 00000000_00000000
_00000001_C0000001
1970169159 - Test failed: 00000000_00000001_00000000_80000100 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000001_00000000_80000100 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000001_00000000_80000100 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000 00000000_00000000
_00000000_00000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000 00000000_00000000
_00000000_01000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000 00000000_00000000
_40000000_00000001
1970169159 - Test failed: 00000000_00000000_80010000_80000000 00000000_00000000
_41000000_40000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_80010000_80000000 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_80010000_80000000 00000000_00000000
_00000001_C0000001
1970169159 - Test failed: 00000000_00000000_80010000_80000000 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000000
_00000000_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000001
_00000001_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000100
_00000000_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000001
_00000000_00000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000000
_00000000_01000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000000
_40000000_00000001
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000000
_41000000_40000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000001
_00000000_80000100
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_81000000_80000000_00000001 00000000_00000000
_00000001_C0000001
1970169159 - Test failed: C0000000_80000001_00000000_00000000 C0000000_00000000
_00000000_00000000
1970169159 - Test failed: C0000000_00000000_00000000_00000000 C0000000_80000001
_00000000_00000000
1970169159 - Test failed: 00000000_00000000_00000001_C0000001 00000000_80000000
_40000001_00000000
1970169159 - Test failed: 00000000_00000000_00000001_C0000001 00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_00000000_00000001_C0000001 00000000_81000000
_80000000_00000001
1970169159 - Test failed: 00000000_00000000_00000001_C0000001 00000000_C0010000
_C0000000_00000000
1970169159 - Test failed: 00000001_00000000_C0000100_C0000000 00000001_00000000
_00000000_40010000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000000
_00000000_00000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000001
_00000001_00000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000100
_00000000_00000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000001
_00000000_00000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000000
_00000000_01000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000000
_40000000_00000001
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000000
_41000000_40000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000001
_00000000_80000100
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000000
_80010000_80000000
1970169159 - Test failed: 00000000_C0010000_C0000000_00000000 00000000_00000000
_00000001_C0000001
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3FFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3FFFFFFF FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3FFFFFFF FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFE_3FFFFFFE_3FFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFE_3FFFFFFE_3FFFFFFF FFFFFFFF_FFFFFFFE
_FFFFFFFF_BFFFFEFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_FFFFFFFF_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFE
_3FFFFFFE_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_FFFFFFFF_3EFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_BFFFFFFF
_7FFFFFFE_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFE
_FFFFFFFF_BFFFFEFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_BFFEFFFF_BFFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_BEFFFFFF
_BFFFFFFF_FFFFFFFE
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_BFFFFFFE
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFE
1970169159 - Test failed: FFFFFFFF_3FFFFEFF_3FFFFFFF_FFFFFFFF FFFFFFFF_FFFEFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3EFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3EFFFFFF FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_3EFFFFFF FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE FFFFFFFF_FFFFFFFF
_FFFFFFFF_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE FFFFFFFF_FFFFFFFF
_FFFFFFFF_3EFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE FFFFFFFF_FFFFFFFF
_BFFEFFFF_BFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE FFFFFFFF_FFFFFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7FFFFFFF_3FFFFFFE FFFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFE
1970169159 - Test failed: FFFFFFFE_7FFFFFFE_7FFFFFFF_FFFFFFFF FFFFFFFE_FFFFFFFF
_FFFFFFFF_7FFEFFFF
1970169159 - Test failed: FFFFFFFE_7FFFFFFE_7FFFFFFF_FFFFFFFF FFFFFFFE_FFFFFFFF
_FFFFFEFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFE_7FFFFFFE_7FFFFFFF_FFFFFFFF FFFFFFFE_FFFFFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFE_FFFFFFFF_FFFFFFFF_7FFEFFFF FFFFFFFE_7FFFFFFE
_7FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF FFFFFFFF_FFFFFFFF
_FFFFFFFF_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF FFFFFFFF_FFFFFFFF
_FFFFFFFF_3EFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF FFFFFFFF_FFFFFFFF
_BFFEFFFF_BFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF FFFFFFFF_FFFFFFFF
_FFFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_7EFFFFFF_7FFFFFFF FFFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFE
1970169159 - Test failed: FFFFFFFF_BFFFFFFF_7FFFFFFE_FFFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFE_FFFFFFFF_BFFFFEFF FFFFFFFF_FFFFFFFE
_3FFFFFFE_3FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFE_FFFFFFFF_BFFFFEFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_BFFEFFFF_BFFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_BFFEFFFF_BFFFFFFF FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_BFFEFFFF_BFFFFFFF FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_BEFFFFFF_BFFFFFFF_FFFFFFFE FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_BFFFFFFE_FFFFFFFF_FFFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFE_FFFFFFFE FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFE_FFFFFFFE FFFFFFFF_FFFFFFFF
_7FFFFFFF_3FFFFFFE
1970169159 - Test failed: FFFFFFFF_FFFFFFFF_FFFFFFFE_FFFFFFFE FFFFFFFF_FFFFFFFF
_7EFFFFFF_7FFFFFFF
1970169159 - Test failed: FFFFFFFE_FFFFFFFF_FFFFFEFF_FFFFFFFF FFFFFFFE_7FFFFFFE
_7FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFF_FFFEFFFF_FFFFFFFF_FFFFFFFF FFFFFFFF_3FFFFEFF
_3FFFFFFF_FFFFFFFF
1970169159 - Test failed: FFFFFFFE_FFFFFFFF_FFFFFFFF_FFFFFFFF FFFFFFFE_7FFFFFFE
_7FFFFFFF_FFFFFFFF
Test done
#######################################################
Testing algo: cmp128n [esi],[edi]
1970169159 - Test failed: 80000000_00000000_00000000_00000001 7FFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFF
Test done
#######################################################
Testing algo: cmp128q [esi],[edi]
1970169159 - Test failed: 80000000_00000000_00000000_00000001 7FFFFFFF_FFFFFFFF
_FFFFFFFE_FFFFFFFF
Test done
#######################################################
Testing algo: Ocmp2 [esi],[edi]
Test done
#######################################################
Testing algo: AxCMP128bit [esi],[edi]
Test done
Dave, thank you very much for the test data :t
Quote from: nidud on August 20, 2013, 10:32:55 PM
Alex,
I just used the test I posted here (http://masm32.com/board/index.php?topic=2222.msg23338#msg23338), adding qWord's number
I then added Small=1 for the cmp(1,-1) which fail's
Then I added this to Dave's test
C128nidud PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi]
sub eax,[edi]
jnz @NE1
mov eax,[esi+4]
sub eax,[edi+4]
jnz @NE2
mov eax,[esi+8]
sub eax,[edi+8]
jnz @NE3
mov eax,[esi+12]
sbb eax,[edi+12]
jmp @end
@NE1: mov eax,[esi+4]
sbb eax,[edi+4]
@NE2: mov eax,[esi+8]
sbb eax,[edi+8]
@NE3: mov eax,[esi+12]
sbb eax,[edi+12]
jo @OV
jnz @end
inc eax
@end: ret
@OV: jc @end
mov eax,80000000h
sub eax,7FFFFFFFh
jmp @end
C128nidud endp
Hmm... I checked it with qWord's number, too - it worked. Maybe I did not get something in your method?
deleted
nidud's latest code passes my test
are you saying it doesn't pass yours Alex ?
on my algo, this fail....
cmp 00000000_00000000_00000000_00000000 , 80000000_00000000_00000000_00000001
was: OV NG NZ CY should be: NV PL NZ CY
tells me that my theory about early exit on high-order compare is not a good theory - lol
it appears that you DO have to ripple from low to high
so, we are looking at something like nidud's code
i added a few blank lines for readability....
;***********************************************************************************************
C128nidud PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi]
sub eax,[edi]
jnz @NE1
mov eax,[esi+4]
sub eax,[edi+4]
jnz @NE2
mov eax,[esi+8]
sub eax,[edi+8]
jnz @NE3
mov eax,[esi+12]
sbb eax,[edi+12]
jmp @end
@NE1: mov eax,[esi+4]
sbb eax,[edi+4]
@NE2: mov eax,[esi+8]
sbb eax,[edi+8]
@NE3: mov eax,[esi+12]
sbb eax,[edi+12]
jo @OV
jnz @end
inc eax
@end: ret
@OV: jc @end
mov eax,80000000h
sub eax,7FFFFFFFh
jmp @end
C128nidud ENDP
;***********************************************************************************************
i like the algo, except it seems a little messy at the end :P
deleted
Here is the new one. Since my testing method goes other way than Dave's one and do not needs additional DWORD to check results (it relies on an "Etalone" proc which now is full), I used that extra DWORD in tests, too - by varying the offset to table and increment size - this DWORD becomes a part of a OWORD in a second pass.
You may play with this, too - comments in the lines:
add edi,16;+offst
dec ebx
jnz @l2
add esi,16;+offst
Quote from: dedndave on August 20, 2013, 11:42:54 PM
nidud's latest code passes my test
are you saying it doesn't pass yours Alex ?
on my algo, this fail....
cmp 00000000_00000000_00000000_00000000 , 80000000_00000000_00000000_00000001
was: OV NG NZ CY should be: NV PL NZ CY
tells me that my theory about early exit on high-order compare is not a good theory - lol
I used his old code, probably, well, ATM I'm working on a testing method and not noticed that it's updated :biggrin: The target was to make working testing method - it's OK now, I'll add new code now.
BTW,
Dave, is this your latest code?
Cmp128Dave MACRO OwA:REQ,OwB:REQ
;OwA and OwB are pointers to memory operands
mov eax,dword ptr OwA[12]
mov edx,dword ptr OwA[8]
sub eax,dword ptr OwB[12]
.if ZERO?
cmp edx,dword ptr OwB[8]
mov ecx,dword ptr OwA[4]
.if ZERO?
cmp ecx,dword ptr OwB[4]
mov edx,dword ptr OwA[0]
.if ZERO?
cmp edx,dword ptr OwB[0]
.if !ZERO?
sbb eax,eax
.endif
.endif
.endif
.endif
ENDM
well - that is the latest that doesn't work :lol:
C128nidud doesn't fail.
One more update - just for completeness make more passes with numbers flow. Some numbers are repeating.
I think no one looked to my testbed so no one knows how neat and flexible it is :P Testing results + timings in one testbed.
Note again: this testing method does not require manual data construction - you may even test it over any random data, because it uses the "milestone" to compare the results with. But Dave's data much-much better than random data, because it's crafted thing.
Timings in the bottom of listing:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
++++16 of 20 tests valid, loop overhead is approx. 2433/1000 cycles
22910 cycles for 1000 * Ocmp (JJ)
21054 cycles for 1000 * Ocmp2 (JJ)
38189 cycles for 1000 * cmp128n (nidud)
36171 cycles for 1000 * cmp128 qWord
9852 cycles for 1000 * AxCMP128bit
22886 cycles for 1000 * Ocmp (JJ)
20740 cycles for 1000 * Ocmp2 (JJ)
38777 cycles for 1000 * cmp128n (nidud)
36293 cycles for 1000 * cmp128 qWord
9835 cycles for 1000 * AxCMP128bit
23170 cycles for 1000 * Ocmp (JJ)
20988 cycles for 1000 * Ocmp2 (JJ)
37485 cycles for 1000 * cmp128n (nidud)
36515 cycles for 1000 * cmp128 qWord
9851 cycles for 1000 * AxCMP128bit
--- ok ---
The message exceeded 20000 chars, so I removed all the data except timings. But when you run it it first displays correctness testing results for every algo.
Can I ask for timings, please?
this code works but as i recall, SAHF is a slow instruction
C128Dave PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
mov esi,lpOp1
mov edi,lpOp2
mov eax,[esi]
cmp eax,[edi]
jnz c1
mov eax,[esi+4]
cmp eax,[edi+4]
jnz c2
mov eax,[esi+8]
cmp eax,[edi+8]
jnz c3
mov eax,[esi+12]
cmp eax,[edi+12]
jmp short cz
c1: mov eax,[esi+4]
sbb eax,[edi+4]
c2: mov eax,[esi+8]
sbb eax,[edi+8]
c3: mov eax,[esi+12]
sbb eax,[edi+12]
jnz cz
lahf
lea eax,[eax-4000h]
sahf
cz: ret
C128Dave ENDP
Quote from: dedndave on August 21, 2013, 01:16:47 AM
this code works, but i think SAHF is slow
Well, just add it in the testbed I posted above :P :biggrin: I got bored with this stuff :P :biggrin:
this is the kind of stuff i enjoy
i should be working on something else - lol
To check your code you may just insert in into testbed and add this in the start:
CheckIt <invoke C128Dave,esi,edi>
And, yes, it passes the check :t Though adding to a timings part is to your side :P
here is a nice algo
i think this one is a winner :t
it can be modified to make a macro that direct addresses
and preloading registers may also provide some improvement
but this gives you the basic concept.....
C128Dave PROC USES ESI EDI lpOp1:LPVOID,lpOp2:LPVOID
mov esi,lpOp1
mov edi,lpOp2
xor edx,edx
mov eax,[esi]
cmp eax,[edi]
.if !ZERO?
inc edx
.endif
mov eax,[esi+4]
sbb eax,[edi+4]
.if !ZERO?
inc edx
.endif
mov eax,[esi+8]
sbb eax,[edi+8]
.if !ZERO?
inc edx
.endif
mov eax,[esi+12]
mov ecx,[edi+12]
sbb al,cl
.if !ZERO?
inc edx
.endif
.if CARRY?
mov cl,dl
mov al,dh
.else
mov al,dl
mov cl,dh
.endif
cmp eax,ecx
ret
C128Dave ENDP
the idea is to gather the info for the 3 low-order DWORD compares
then, when we get to the last one, compare the low bytes (with borrow)
then, replace the low bytes with the cumulated results for a single DWORD CMP instruction
notice that INC does not affect the CF
test_correct.zip:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
18963 cycles for 1000 * Ocmp (JJ)
18407 cycles for 1000 * Ocmp2 (JJ)
15392 cycles for 1000 * cmp128n (nidud)
8043 cycles for 1000 * cmp128 qWord
5544 cycles for 1000 * AxCMP128bit
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
14256 cycles for 1000 * Ocmp (JJ)
13015 cycles for 1000 * Ocmp2 (JJ)
15534 cycles for 1000 * cmp128n (nidud)
9508 cycles for 1000 * cmp128 qWord
10279 cycles for 1000 * AxCMP128bit
Ocmp2 passed all tests.
deleted
here it is in macro form - should be a bit faster :P
Cmp128Dave MACRO OwA:REQ,OwB:REQ
;OwA and OwB are pointers to memory operands
;Example: Cmp128Dave offset Oword1,offset Oword2
mov eax,dword ptr OwA[0]
xor ecx,ecx
cmp eax,dword ptr OwB[0]
mov edx,dword ptr OwA[4]
.if !ZERO?
inc ecx
.endif
sbb edx,dword ptr OwB[4]
mov eax,dword ptr OwA[8]
.if !ZERO?
inc ecx
.endif
sbb eax,dword ptr OwB[8]
mov edx,dword ptr OwB[12]
mov eax,dword ptr OwA[12]
.if !ZERO?
inc ecx
.endif
sbb al,dl
.if !ZERO?
inc ecx
.endif
.if CARRY?
mov dl,cl
mov al,ch
.else
mov al,cl
mov dl,ch
.endif
cmp eax,edx
ENDM
i don't know if this is even vaguely useful as I have not been following this topic in any real detail but this may be useful to someone, a .486 compatible unsigned QWORD comparison algo.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
cmpqword PROTO :DWORD,:DWORD
.data
value1 QWORD 0000000000000000h
value2 QWORD 0000000000000001h
value3 QWORD 0000000000000002h
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
invoke cmpqword,ADDR value1,ADDR value2 ; 1 < 2
print str$(eax),13,10
invoke cmpqword,ADDR value3,ADDR value3 ; 3 = 3
print str$(eax),13,10
invoke cmpqword,ADDR value2,ADDR value1 ; 2 > 1
print str$(eax),13,10
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
cmpqword proc pquad1:DWORD,pquad2:DWORD
; ----------------------
; unsigned QWORD compare
; ----------------------
mov eax, [esp+4]
mov edx, [esp+8]
mov ecx, [eax+4]
cmp ecx, [edx+4] ; high DWORD 1st
ja greater
jb lessthan
mov ecx, [eax]
cmp ecx, [edx] ; low DWORD next
ja greater
jb lessthan
xor eax, eax ; return 0 on equal
ret 8
lessthan:
mov eax, -1 ; return -1 on less than
ret 8
greater:
mov eax, 1 ; return 1 on greater
ret 8
cmpqword endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
deleted
here is that one in macro form
i measure 11 cycles on my P4, which is pretty good
the values that are tested may not be an honest reflection of SAHF usage
i won't throw that other macro away, just yet :P
Cmp128Dave MACRO OwA:REQ,OwB:REQ
;OwA and OwB are pointers to memory operands
;Example: Cmp128Dave offset Oword1,offset Oword2
LOCAL c1,c2,c3,c4
mov eax,dword ptr OwA[0]
cmp eax,dword ptr OwB[0]
jnz c1
mov eax,dword ptr OwA[4]
cmp eax,dword ptr OwB[4]
jnz c2
mov eax,dword ptr OwA[8]
cmp eax,dword ptr OwB[8]
jnz c3
mov eax,dword ptr OwA[12]
cmp eax,dword ptr OwB[12]
jmp short c4
c1: mov eax,dword ptr OwA[4]
sbb eax,dword ptr OwB[4]
c2: mov eax,dword ptr OwA[8]
sbb eax,dword ptr OwB[8]
c3: mov eax,dword ptr OwA[12]
sbb eax,dword ptr OwB[12]
jnz c4
lahf
lea eax,[eax-4000h]
sahf
c4:
ENDM
not sure what the scaling factor is for cycles, but here's your last test :P
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
5663779 cycles for Cmp128Dave
8939593 cycles for Cmp128Dave2
8382453 cycles for Cmp128Nidud
---------------------------------------------------------
6032710 cycles for Cmp128Dave
9247293 cycles for Cmp128Dave2
7744632 cycles for Cmp128Nidud
---------------------------------------------------------
wow - looking at the values again, it would seem that they all take the SAHF path
i am surprised that code does so well :redface:
deleted
deleted
ahhh - good point on the validation test data alignment
we can pad that with "empty" dwords to make it align
Alex's code doesn't use a control value, so that's another way to go
as for that timing chart.....
yes - it was a fast instruction in days of old
however, many instructions that explicitly manipulate flags seem to run slow under NT
CMC, STC, CLC are exceptions to that rule - they are ok
but CLD, STD, POPFD seem to be really slow under NT
i figured SAHF would be also
i think it's related to the protected OS thing
it has to verify that the flag change is allowed with current privileges before continuing
I've tinkered with another one, it's fast but it doesn't pass all tests :(
@Alex: Could you modify CheckIt so that it produces less output? Such as: #tests failed?
ocjj=0
oqDeb=0
OcmpJJ MACRO ow0, ow1
LOCAL z0, z1
ocjj=ocjj+1
z0 CATSTR <ocJJ0>, %ocjj
z1 CATSTR <ocJJ1>, %ocjj
mov eax, dword ptr ow0[12]
cmp eax, dword ptr ow1[12]
jne z0 ; no test byte ptr
mov eax, dword ptr ow0[8]
mov edx, dword ptr ow1[8]
cmp eax, edx
jne z1
mov eax, dword ptr ow0[4]
mov edx, dword ptr ow1[4]
cmp eax, edx
jne z1
mov eax, dword ptr ow0[0]
mov edx, dword ptr ow1[0]
z1:
test byte ptr ow1[15], 80h
usedeb=01
.if Sign?
cmp eax, edx
.if ! (Carry? && Sign?)
xchg eax, edx ; qSmallN
.endif
.endif
cmp eax, edx
z0:
if oqDeb
.if Zero?
print "&ow0", " equals &ow1", 13, 10
.elseif !Sign?
print "&ow0", " greater &ow1", 13, 10
.else
print "&ow0", " lesser &ow1", 13, 10
.endif
print chr$(13, 10)
endif
ENDM
Good night from Europe ;-)
deleted
Quote from: jj2007 on August 21, 2013, 02:32:04 AM
test_correct.zip:
Ocmp2 passed all tests.
AxCMP128bit too! :biggrin:
Quote from: jj2007 on August 21, 2013, 07:48:44 AM
@Alex: Could you modify CheckIt so that it produces less output? Such as: #tests failed?
Here it is. Now it prints the offsets of the numbers, not numbers themselves, so having a binary you have an info where to check and this makes output a lot smaller. Also in the same place an int 3 is executed if the prog is running under the debugger, and after that the jump to the repeat of a failed test is made - you may trace things or may jump over this jump (pun).
Is this suitable?
Also simplified Etalone a bit - more straightforward now.
Dave, in your checking method you're checking for full corresponding of a flags that returned from compare of a control DWORD and flags returned from a comparing of a OWORD. But this is not actually right way, because the layout of SF and OF flags may be different but still have proper: by documentation signed jumps check only for (non-)equality of OF and SF flag, there are no any notes on that which layout of flags will be exactly after any compare (and I think this may be hardware-depended). I.e., if one compare returns OF=1 and SF=0, other compare returns OF=0 and SF=1 - these results are both equal to each other, because JB/JBE (and derivatives like JNA/JNAE) will jump.
My checking code is aware of this, but not yours, that's whay my comparing code doesn't pass your check. But it works, and works properly, because exact SF/OF flags layout is not fixed in standards.
In your case you too may make this like, after this:
and ebx,8C1h
and ebp,8C1h ;OF SF ZF CF only
Make check this way:
xor ebp,ebx
.if ebp!=0 && ebp!=100010000000Y ; if OF and SF were "swapped" then XOR will make both bits set
An update. Checking is more strict + added new Jochen's experimental algo from upper post.
Jochen, though I was working on a testbed - the idea of that algo looks interesting.
Alex
if you find a mismatch in any of those 4 flags, i can find a corresponding branch that will not work properly
conversely, if the 4 flags match, all the conditional branches will work correctly (except parity)
there is also the parity flag, but we haven't tested for it because it's so rarely used
the parity flag only applies to the low-order byte ::)
that actually makes it somewhat useless except for serial comm
i could add that to my test very easily, but all our OWORD algos would probably fail - lol
in addition to all those flags, there is an auxiliary carry flag
however, it is not used for any branches
it's really more or less a "CPU internal" flag
****************************************************************
Equality Branches (Used for Signed or Unsigned Comparisons)
----------------------------------------------------------------
Instruction Description Condition Aliases
----------------------------------------------------------------
JZ Jump if equal ZF=1 JE
JNZ Jump if not equal ZF=0 JNE
****************************************************************
****************************************************************
Unsigned Branches
----------------------------------------------------------------
Instruction Description Condition Aliases
----------------------------------------------------------------
JA Jump if above CF=0 and ZF=0 JNBE
JAE Jump if above or equal CF=0 JNC JNB
JB Jump if below CF=1 JC JNAE
JBE Jump if below or equal CF=1 or ZF=1 JNA
****************************************************************
****************************************************************
Signed Branches
----------------------------------------------------------------
Instruction Description Condition Aliases
----------------------------------------------------------------
JG Jump if greater SF=OF or ZF=0 JNLE
JGE Jump if greater or equal SF=OF JNL
JL Jump if less SF<>OF JNGE
JLE Jump if less or equal SF<>OF or ZF=1 JNG
JO Jump if overflow OF=1
JNO Jump if no overflow OF=0
JS Jump if sign SF=1
JNS Jump if no sign SF=0
****************************************************************
you are right, though - the code could use XOR
mov eax,ebx
xor eax,ebp
test eax,8C1h
jnz fail
i don't really see the advantage, though
also, by keeping the flags in EBX and EBP intact, i can use them for the failure report :P
xor ebx,ebp
that would destroy one set of flags for the report
cmp128tm
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
5534615 cycles for Cmp128Dave
9100335 cycles for Cmp128Dave2
5937448 cycles for Cmp128Nidud
---------------------------------------------------------
5646211 cycles for Cmp128Dave
9094478 cycles for Cmp128Dave2
5587906 cycles for Cmp128Nidud
---------------------------------------------------------
HI Dave,
A nit to pick.
JS Jump if sign SF=1
JNS Jump if no sign SF=1
Should be;
JS Jump if sign SF=1
JNS Jump if no sign SF=0
As an aside, this thread has me looking at my fixed point arithmetic
program again. Such a comparison might be useful.
Cheers,
Steve N.
thanks Steve :t
corrected :P
Quote from: dedndave on August 21, 2013, 09:19:08 PM
Alex
if you find a mismatch in any of those 4 flags, i can find a corresponding branch that will not work properly
conversely, if the 4 flags match, all the conditional branches will work correctly (except parity)
there is also the parity flag, but we haven't tested for it because it's so rarely used
the parity flag only applies to the low-order byte ::)
that actually makes it somewhat useless except for serial comm
i could add that to my test very easily, but all our OWORD algos would probably fail - lol
in addition to all those flags, there is an auxiliary carry flag
however, it is not used for any branches
it's really more or less a "CPU internal" flag
****************************************************************
Equality Branches (Used for Signed or Unsigned Comparisons)
----------------------------------------------------------------
Instruction Description Condition Aliases
----------------------------------------------------------------
JZ Jump if equal ZF=1 JE
JNZ Jump if not equal ZF=0 JNE
****************************************************************
****************************************************************
Unsigned Branches
----------------------------------------------------------------
Instruction Description Condition Aliases
----------------------------------------------------------------
JA Jump if above CF=0 and ZF=0 JNBE
JAE Jump if above or equal CF=0 JNC JNB
JB Jump if below CF=1 JC JNAE
JBE Jump if below or equal CF=1 or ZF=1 JNA
****************************************************************
****************************************************************
Signed Branches
----------------------------------------------------------------
Instruction Description Condition Aliases
----------------------------------------------------------------
JG Jump if greater SF=OF or ZF=0 JNLE
JGE Jump if greater or equal SF=OF JNL
JL Jump if less SF<>OF JNGE
JLE Jump if less or equal SF<>OF or ZF=1 JNG
JO Jump if overflow OF=1
JNO Jump if no overflow OF=0
JS Jump if sign SF=1
JNS Jump if no sign SF=0
****************************************************************
you are right, though - the code could use XOR
mov eax,ebx
xor eax,ebp
test eax,8C1h
jnz fail
i don't really see the advantage, though
also, by keeping the flags in EBX and EBP intact, i can use them for the failure report :P
xor ebx,ebp
that would destroy one set of flags for the report
Dave, no need in huge tables. Just look to the logic.
Sign jumps check for SF and OF (I did not mention ZF just because I told about SF and OF, but that was implied and I thought I should not make millions of reservations) to check if number is greates or less than. And if you would look to the tables you posted, you'll find that for signed jumps only one condition is important - in relation of SF and OF flags, we talk now about them and not about ZF: the mutual state of SF and OF flags. There is no any "agreement", HOW should be set these flags.
If the number 1 is greater than number 2, then both OF and SF should be equal each to other. I.e.:
OF = 1, SF = 1 : JG will jump, JB will NOT jump, JGE will jump, JBE will NOT jump
OF = 0, SF = 0 : JG the same as above
If the number 1 is less than number 2, then both OF and SF should
NOT be equal each to other. I.e.:
OF = 0, SF = 1 : JG will NOT jump, JB will jump, JGE will NOT jump, JBE will jump
OF = 1, SF = 0 : JG the same as above
You don't get what I trying to suggest: not to don't take attention on ZF and CF, but to make code aware of that SF and OF may have more than one set of states, and follow to the docs and CPU design at the same time.
The way I suggested works so: if after XOR of two flags set you have zero - then they both are equal, the test passed. But if it is not zero - you should to check, if the mutual state of OF and SF flags is the same in both flags sets. I.e., if in one flags set OF was 1 and SF was 0, then in second flags set it may be OF=0 and SF=1, and this is still RIGHT RESULT, because it is CPU specs.
OF=1, SF=0 == OF=0, SF=1
OF=1, SF=1 == OF=0, SF=0
This is the spec, you may read it again to just to check it in any way you want - CPU will follow, for an instance, JG for SF=1 + OF=1 the same as for SF=0 + OF=0.
So, if two flags set have equal mutual state of OF and SF, and they are equal, for example, SF=1 and OF=0 in both flags sets, then after XOR you'll get zero (if all other flags are equal); if the state of OF and SF is mutually equal, but swapped, for example SF=0 and OF=1, then you'll after XOR get both bits set (and other flags bits unset, if they were equal). So, checking for zero or availiability of both SF and OF set to 1, after XOR, is a proper way to go, to emulate proper CPU's behaviour.
xor ebp,ebx
.if ebp!=0 && ebp!=100010000000Y
This code meets that requirements.
If you are still not agree, then, please, in the dump of "failed" comparsions for my code, choose any two numbers you want, and then we will make a test - compare them, and then make every possible conditional jump for numbers comparsion. You would not find that numbers, because CPUs behaviour is independed on that how your checking method assumes it behaves.
Besides of this, the Jochen's idea, which is used in his algo and in my algo - just perfect. It's undisputable.
Quote
i could add that to my test very easily, but all our OWORD algos would probably fail - lol
We make comparsion code, but not full CPU emulation :lol:
Quote
OF = 1, SF = 1 : JG will jump, JB will NOT jump, JGE will jump, JBE will NOT jump
I mistyped here, meant JLE instead of JBE and JL instead of JB ::) (well, this is not usual "typo" but rather a hurry + tiredness about this disput)
Quote
if the state of OF and SF is mutually equal, but swapped, for example SF=0 and OF=1
In full form it should be "... for example SF=0 and OF=1 in one flags set, and SF=1 and OF=0 in the other flags set ..."
Well, for an example, I inserted my algo into your testing code, well, grabbing right first "failed" test numbers
cmp 00000000_00000000_00000000_00000000 , 80000000_00000000_00000000_00000001
was: OV NG NZ CY should be: NV PL NZ CY
After your code compares the "checking DWORD", it has the flags CF=1, ZF=0, SF=0, OF=0.
After execution of my comparsion code for that number it returns the flags: CF=1, ZF=0, SF=1, OF=1.
This is
proper result. CF and ZF are the same, SF=CF.
Moreover, if you will trace the code, you'll find than your code first checks the "checking DWORD" by loading first DWORD in EAX - it is zero, and then comparing it with 80000100. And there SF and OF flags are 1.
My code loads high order DWORD of first OWORD, it's zero, then it comparing it with highest order DWORD of second OWORD, it's 80000000. And right after this it goes to exit from algo. And SF and OF flags are 0.
The internal CPUs logic decided to set SF and OF to 1 when the second number was 80000100, and to set SF and OF to 0 when the second number was 80000000. We cannot say why it does that - we don't know CPUs exact circuit, but it anyway has no meaning, because they defined that it's important only mutual state of SF and OF flags, but not HOW they should be set exactly - in this case they both may be set to 1 or 0, and both cases will be right.
Quote
Besides of this, the Jochen's idea, which is used in his algo and in my algo - just perfect. It's undisputable.
Probably, one may say that my checking algo for correctness of comparsion two numbers is perfect :P
deleted
JP/JPO and JNP/JPE are practically never used
and in the rare cases when they are used, they are operating on byte data, not OWORD's
JO and JNO are used reasonably often in signed math, so are JS and JNS
if those flags have to be correct, i don't see what the argument is about - lol
you have to emulate the exact behaviour in OF, SF, ZF, and CF
Alex is missing an important point
it's not enough that SF=OF or SF<>OF
because YOU DON'T KNOW WHICH Jxx INSTRUCTION WILL BE USED
you have to set the flags so that any of the ones listed above will work
deleted
i updated my version of the validation test program :P
compare data is 16 aligned
no more control dword's - instead, i use a known-good routine to set the flags
at the end, it reports the fail count
EDIT: i also simplified usage:
_main PROC
INVOKE AllTests,C128nidud
inkey
exit
_main ENDP
deleted
it doesn't seem to mean much that it passed the "xor test" :badgrin:
mov edx,[esi+12]
and edx,80000000h
or eax,edx
mov edx,[edi+12]
ESI and EDI are never preserved or initialized - oops
i set it up in my validation code and got 800 failures out of 3160 tests
deleted
Cmp128tm2
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
6121006 cycles for Cmp128Dave
9888967 cycles for Cmp128Dave2
6120274 cycles for Cmp128Nidud
3453129 cycles for Cmp128NidudSEE (xor)
1127987 cycles for Cmp128Axel (xor)
1121311 cycles for Cmp128DaveU (unsigned)
841821 cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
6297867 cycles for Cmp128Dave
9968284 cycles for Cmp128Dave2
6103287 cycles for Cmp128Nidud
3369442 cycles for Cmp128NidudSEE (xor)
1309918 cycles for Cmp128Axel (xor)
1181227 cycles for Cmp128DaveU (unsigned)
959974 cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
eliminated most of the repeat tests (doh)
now, 1600 tests are performed
deleted
Quote from: dedndave on August 22, 2013, 05:03:58 AM
Alex is missing an important point
it's not enough that SF=OF or SF<>OF
because YOU DON'T KNOW WHICH Jxx INSTRUCTION WILL BE USED
you have to set the flags so that any of the ones listed above will work
No, this is you missing the point.
My comparsion code will work for every kind of signed/unsigned numbers-comparsion jumps. My code sets CF and ZF properly, but it may set SF and OF differently than YOUR CHECKING METHOD WAITS it will do. But with CPU IT WILL WORK PROPERLY, because CPU IS NOT YOUR CHECKING CODE.
Your checking code is incomplete and thus it fails itself. You may argue with this more, but this is senseless disput, since you totally don't get what I'm trying to say.
You may try to find any numbers with which my code + following JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE will fail, but you would not find that numbers.
My code is working and working properly. You may even bet for contrary for $100000000000, but you will never win.
My checking code is the only code in the thread that properly checks the flags returned for numbers comparsion and following JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE.
Specially for those, who still don't get my point from previous posts, where I omitted CF and ZF flags (which's state was assummed to be equal in both flags sets).
These flags sets:
CF=1, ZF=0, OF=1, SF=1 ARE THE SAME AS CF=1, ZF=0, OF=0, SF=0
CF=1, ZF=0, OF=0, SF=1 ARE THE SAME AS CF=1, ZF=0, OF=1, SF=0
And these:
CF=0, ZF=1, OF=1, SF=1 ARE THE SAME AS CF=0, ZF=1, OF=0, SF=0
CF=1, ZF=0, OF=0, SF=1 ARE THE SAME AS CF=1, ZF=0, OF=1, SF=0
And so on. Take your special attention on the MUTUAL STATE (I said these words much in the thread :greensml:) OF and SF flags in these tables, then read the docs, then take attention on the table again, and then read the docs.
This is CPUs behaviour. But your code isn't aware of this. Though this is only question of two instructions to make code aware - and I showed that instructions.
And then, if it's still not enough, well, try to write the code which will use my code to compare any numbers you will chose, and then execute each of these instructions: JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE, and then will fail. This is challenge, yeah. One may decline this challenge with words like "it's waste of time, I'm not in duty to do that", of course. But this still will not prove that one's point.
I theoretically and practically proved my point in this thread - if the descriptions "are not enough" for opponents - that's their own problems. When opponents do not want to understand or to check what was said to them - this is not adequate disput. It's not so much to grab any "failed" numbers, then compare them with my code, and then try every of only 9 conditional jumps we are all make working code for.
No one will be able theoretically prove contrary if the one will read documentation. Anyone may practically try to find numbers with which my code + following JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE will fail, but one will not find such numbers.
my friend Alex :biggrin:
rather than sorting all that out, let's look at some cases that fail one test, but pass the other
here is one that passes your test, but fails mine
cmp 00000000_00000000_00000000_00000000 , 80000000_00000000_00000000_00000001
was: OV NG NZ CY should be: NV PL NZ CY
if we subtracted 00000000_00000000_00000000_00000000 - 80000000_00000000_00000000_00000001
the result would be 7FFFFFFF_FFFFFFFF_FFFFFFFF_FFFFFFFF (sign = 0)
you can make a simple test:
mov eax,0
cmp eax,80000001h
js is_neg
if i execute a JG, JGE, JL, or JLE instruction after comparing these, your code will work as it should
however, if i execute a JS, JNS, JO, or JNO instruction, it will not
i ran your code through my test and got 48 failures
attached is the text file.....
If you guys are working on 4 dwords, why not work on 4 bytes?
The logic is the same and we can easily test for errors. Forget speed, get the basics going.
Once you figure it out it should be easy to do n-bit comparisons.
Quote from: dedndave on August 22, 2013, 07:33:19 PM
if i execute a JG, JGE, JL, or JLE instruction after comparing these, your code will work as it should
however, if i execute a JS, JNS, JO, or JNO instruction, it will not
Dave, but the point was to make a
comparsion code, i.e. just compare numbers: which one is greater, lesser, above, below, or they equal. There was no points about J(N)S and J(N)O jumps - only compare jumps. Thus, all my posts based on this - only "value compare" point.
The target was make code which will properly work for JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE, but not J(N)O, and this probably was stated in first pages of the thread, as well as in my thread about "128bit CMP" in the lab I said this point - and said that JS will not work properly, too, but one may use test byte ptr [oword+15],80h and then JS (just about ~15 bytes length code and 2 instructions) - instead of using such a bloated algos like we write. It's question of usability - I get it that.
So, thus all my posts was based on the point that we all working on algos which will satisfy these jumps: JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE, as well as my checking algo based on this, and my comparsion algo, too.
Since it's very unlikely anyone may want to compare two 128 bit numbers, and then execute J(N)O jump.
Here is one more variation - partially based on Jochen's experimental algo posted on page 10.
It passes my check, and doesn't passes Dave's check, but anyone may see that the logic is so simple that it just may not fail for JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE. So, we probably worked on checking algos having different specifications of the target :lol:
option prologue:none
option epilogue:none
AxCMP128bitProc2 proc n1,n2
mov eax,[esp+4]
mov edx,[esp+8]
xor ecx,ecx
mov [esp+4],esi
mov [esp+8],edi
mov esi,[eax+12]
test dword ptr [eax+12],80000000h
mov edi,[eax+8]
sets cl
cmp esi,[edx+12]
jnz @l0
cmp edi,[edx+8]
mov esi,[eax+4]
jnz @l1
cmp esi,[edx+4]
mov edi,[eax]
jnz @l1
cmp edi,[edx]
jnz @l1
mov esi,[esp+4]
mov edi,[esp+8]
ret 8
@l1:
ja @F
mov esi,[predata1+ecx*8]
cmp esi,[predata1+4+ecx*8]
mov esi,[esp+4]
mov edi,[esp+8]
ret 8
@@:
mov esi,[predata2+ecx*8]
cmp esi,[predata2+4+ecx*8]
@l0:
mov esi,[esp+4]
mov edi,[esp+8]
ret 8
align 4
predata1 label dword
dd 1,2,-2,-1
predata2 label dword
dd 2,1,-1,-2
AxCMP128bitProc2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Though this code is slower on my machine than my macro code based on first Jochen's idea.
And, yes, I'm glad that you understand me - and agree that it will work for JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE - and we sorted out why were unable to understand the points of each other.
I think probably qWord's code was based on idea of only-comparsion, too.
So, shortly speaking, for your checking code, if you see that the state of OF and SF is just reversed contrary to the "should be", after testing of my code, then it's OK, because it's "by design" :biggrin:
I.e., here in your test if you see something like
was: OV NG NZ CY should be: NV PL NZ CY
OF=1, SF=1, should be OF=0, SF=0, ZF and CF are the same - so, that's OK for JA/JB/JAE/JBE/JG/JL/JGE/JLE/JE. But not for JS or JO, but I did not write the comparsion code for support of JS and JO.
deleted
deleted
Quote783467 cycles for Cmp128Axel (xor)
*Suspeciously looking to the proc name and then to the source*
LOL
The algo not only may reverse the state of a flags but reverse the order of letters in their name (if it was named by the author's name) :biggrin:
Very funny :t
results:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
5881815 cycles for Cmp128Dave
9914640 cycles for Cmp128Dave2
5894460 cycles for Cmp128Nidud
2936083 cycles for Cmp128NidudSEE (xor)
2444879 cycles for Cmp128NidudSEEU (unsigned)
994634 cycles for Cmp128Axel (xor)
941599 cycles for Cmp128DaveU (unsigned)
962220 cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
5905129 cycles for Cmp128Dave
9810117 cycles for Cmp128Dave2
5910750 cycles for Cmp128Nidud
2766487 cycles for Cmp128NidudSEE (xor)
2443439 cycles for Cmp128NidudSEEU (unsigned)
969855 cycles for Cmp128Axel (xor)
933817 cycles for Cmp128DaveU (unsigned)
858812 cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
--- ok ---
Results for attached archive - added Jochen's code and my earlier tweak of his code.
Also renamed some labels in the source, Axel was not contrary :biggrin:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
5843868 cycles for Cmp128Dave
9540057 cycles for Cmp128Dave2
5861747 cycles for Cmp128Nidud
2684675 cycles for Cmp128NidudSEE (xor)
2440403 cycles for Cmp128NidudSEEU (unsigned)
946568 cycles for Cmp128Alex (xor)
950729 cycles for Cmp128DaveU (unsigned)
988245 cycles for Cmp128NidudU (unsigned)
3537702 cycles for JJAxCMP128bit (SSE)
5887886 cycles for Ocmp2 - JJ's (SSE)
---------------------------------------------------------
5847877 cycles for Cmp128Dave
9582836 cycles for Cmp128Dave2
5915661 cycles for Cmp128Nidud
2731934 cycles for Cmp128NidudSEE (xor)
2439047 cycles for Cmp128NidudSEEU (unsigned)
1040627 cycles for Cmp128Alex (xor)
956229 cycles for Cmp128DaveU (unsigned)
950693 cycles for Cmp128NidudU (unsigned)
3533400 cycles for JJAxCMP128bit (SSE)
5874002 cycles for Ocmp2 - JJ's (SSE)
---------------------------------------------------------
--- ok ---
nidud, I inserted your Cmp128NidudSEEU algo into my testbed, it doesn't passes the check - many numbers, but here is first one:
00000000 00000000 00000000 00000000 and 00000000 80000000 40000001 00000000
It returns that first number is above and greater than second (CF=0, SF=0, OF=0, ZF=0) - JA/JG/JAE/JGE will jump.
SSE comparsion instructions are only-signed, that's the reason probably (i.e., they decide the OWORD just as set of signed DWORDs).
Here is the checking code of my testbed - it gets too entangled and contains too much old/not working code, so, probably it's better to post it as "external".
; EAX = BITS: ... CF ZF SF OF
FlagsToEAX MACRO
pushfd
xor eax,eax
pop edx
bt edx,0
rcl eax,1
bt edx,6
rcl eax,1
bt edx,7
rcl eax,1
bt edx,11
rcl eax,1
ENDM
Etalone MACRO ow0, ow1
LOCAL @l1, @l2, @l3, @l0, @l4
push ebx
mov eax,dword ptr [ow0+12]
mov edx,dword ptr [ow1+12]
cmp eax,edx
jnz @l1 ; just save flags
mov ecx,dword ptr [ow0+8]
mov ebx,dword ptr [ow1+8]
cmp ecx,ebx
jnz @l2
mov ecx,dword ptr [ow0+4]
mov ebx,dword ptr [ow1+4]
cmp ecx,ebx
jnz @l2
mov ecx,dword ptr [ow0]
mov ebx,dword ptr [ow1]
cmp ecx,ebx
jz @l1
@l2:
mov eax,0
ja @l0 ; if it's above - the number is bigger because this isn't MSD
;mov byte ptr [esp+3],1 ; CF set, below than (unsigned)
mov eax,1001Y ; CF and OF set, below than and less than
; setting of a sign flag has no meaning in this case, it' superfluous
jmp @l0
@l1:
FlagsToEAX
@l0:
if 0
push eax
mov ebx,eax
test ebx,1
jz @F
print "OF "
@@:
test ebx,2
jz @F
print "SF "
@@:
test ebx,4
jz @F
print "ZF "
@@:
test ebx,8
jz @F
print "CF "
@@:
print chr$(13,10)
pop eax
endif
pop ebx
ENDM
TestVal dd 0,0,0,0,0
dd 1,1,0,0,0
dd 100h,0,1,0,0
dd 10000h,0,0,1,0
dd 1000000h,0,0,0,1
dd 40000000h,0,0,0,40000000h
dd 40000001h,1,0,0,40000000h
dd 40000100h,0,1,0,40000000h
dd 40010000h,0,0,1,40000000h
dd 41000000h,0,0,0,40000001h
dd 80000000h,0,0,0,80000000h
dd 80000001h,1,0,0,80000000h
dd 80000100h,0,1,0,80000000h
dd 80010000h,0,0,1,80000000h
dd 81000000h,0,0,0,80000001h
dd 0C0000000h,0,0,0,0C0000000h
dd 0C0000001h,1,0,0,0C0000000h
dd 0C0000100h,0,1,0,0C0000000h
dd 0C0010000h,0,0,1,0C0000000h
dd 0C1000000h,0,0,0,0C0000001h
dd 3FFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
dd 3FFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
dd 3FFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,3FFFFFFFh
dd 3FFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,3FFFFFFFh
dd 3EFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFEh
dd 7FFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
dd 7FFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
dd 7FFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,7FFFFFFFh
dd 7FFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,7FFFFFFFh
dd 7EFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFEh
dd 0BFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
dd 0BFFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
dd 0BFFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0BFFFFFFFh
dd 0BFFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0BFFFFFFFh
dd 0BEFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFEh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFFFFFEh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFFFEFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFEFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh
dd 0FEFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh
dd 0FFFF0001H,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
ArraySize EQU $-TestVal
;dd 123456h
CheckIt MACRO howToInvoke:REQ
CheckIt2 <howToInvoke>, 0, 0
CheckIt2 <howToInvoke>, 4, 0
CheckIt2 <howToInvoke>, 0, 4
CheckIt2 <howToInvoke>, 4, 4
ENDM
CheckIt2 MACRO howToInvoke, offst1:=<0>, offst2:=<0>
LOCAL @l3, @l2, @l1
;########################## CHECK
push esi
push edi
push ebx
print "#######################################################",13,10
print "Testing algo: ",@CatStr(<!">,<howToInvoke>,<!">)," offst1: ",@CatStr(<!">,<offst1>,<!">)," offst2: ",@CatStr(<!">,<offst2>,<!">),13,10
mov esi,offset TestVal+offst1
@l3:
mov ebx,ArraySize/16
mov edi,offset TestVal+offst1
@l2:
howToInvoke
FlagsToEAX
push eax
Etalone esi, edi
pop ecx
xor eax,ecx
jz @l1 ; test OK
cmp eax,3 ; layout of SF and OF flag may differ, but if they are both not equal
jz @l1 ; in first comparsion and in second comparsion, then it's proper result
@@: ; since signed less than is OF != SF with no difference which flags are (un)set
;int 3
if 0
print str$(ebx)," - Test failed: "
print uhex$(dword ptr [esi+12])
print "_"
print uhex$(dword ptr [esi+8])
print "_"
print uhex$(dword ptr [esi+4])
print "_"
print uhex$(dword ptr [esi])
print " "
print uhex$(dword ptr [edi+12])
print "_"
print uhex$(dword ptr [edi+8])
print "_"
print uhex$(dword ptr [edi+4])
print "_"
print uhex$(dword ptr [edi]),13,10
else
print uhex$(esi)," "
print uhex$(edi)," "
invoke IsDebuggerPresent
test eax,eax
jz @F
int 3
jmp @l2 ; repeat test to see it under the debugger, or skip this instruction
@@:
endif
@l1:
add edi,16+offst2
dec ebx
jnz @l2
add esi,16+offst2
cmp esi,offset TestVal+ArraySize
jb @l3
pop ebx
pop edi
pop esi
print "Test done",13,10,13,10,13,10
;##########################
ENDM
(this is the code from prog posted in the bottom of 10 page)
To test your algos just insert them into source, insert the code above, and then use this:
CheckIt <Cmp128NidudSEEU [esi],[edi]>
CheckIt <Cmp128NidudSEE [esi],[edi]>
To check algos that are procs and not a macroses, use this construct:
CheckIt <invoke AlgoName,esi,edi>
(esi and edi are obligatory to be specified as params)
It will print offsets of the failed numbers, and, if you are running the prog under the debugger, will break and then re-run failed comparsion, so you may trace things.
Did it pass Dave's check? :eek:
deleted
deleted
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
-----------------------------------------------
1996511 cycles for Cmp128Dave
3028413 cycles for Cmp128Dave2
1932345 cycles for Cmp128Nidud
1350691 cycles for Cmp128NidudSEE (xor)
1028161 cycles for Cmp128Alex (xor)
848824 cycles for Cmp128JJ (xor)
838612 cycles for Cmp128DaveU (unsigned)
838743 cycles for Cmp128NidudU (unsigned)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
------------------------------------------------------
1897321 cycles for Cmp128Dave
6395118 cycles for Cmp128Dave2
1833315 cycles for Cmp128Nidud
1987302 cycles for Cmp128NidudSEE (xor)
921104 cycles for Cmp128Alex (xor)
737550 cycles for Cmp128JJ (xor)
873656 cycles for Cmp128DaveU (unsigned)
714518 cycles for Cmp128NidudU (unsigned)
deleted
my latest version has 16-aligned data, as well as a few other improvements
http://masm32.com/board/index.php?topic=2222.msg23534#msg23534 (http://masm32.com/board/index.php?topic=2222.msg23534#msg23534)
deleted
interesting that qWord's "magic" number (that's a pun, really) is needed
that means that my set of data values is insufficient to perform a comprehensive test
it also means that some of our "thought-to-be-good" algos may not be
i may have to look at it a little closer to see what other values are needed
i thought i had it covered with the walking 1's and 0's
Quote from: nidud on August 23, 2013, 01:21:06 AM
Quote from: Antariy on August 23, 2013, 12:37:25 AM
*Suspeciously looking to the proc name and then to the source*
sorry about that :lol:
Do not worry about that :biggrin:
Quote from: nidud on August 23, 2013, 05:19:52 AM
The xor test passes unsigned macros:
Cmp128NidudU
Cmp128DaveU
QuoteCheckIt <Cmp128NidudSEE [esi],[edi]>
I looked at the code, and the TestVal (http://masm32.com/board/index.php?topic=2222.msg23472#msg23472) data is not align 16
I aligned the table by removing the test DWORD in front of the OWORD
I then changed the CheckIt macro:
CheckIt MACRO howToInvoke:REQ
CheckIt2 <howToInvoke>, 0, 0
CheckIt2 <howToInvoke>, 16, 0
CheckIt2 <howToInvoke>, 0, 16
CheckIt2 <howToInvoke>, 16, 16
ENDM
Now the Cmp128NidudSEE macro passes, and:
Cmp128NidudSEEU
Cmp128NidudU
Cmp128DaveU
Yes, it is not aligned, but it has the reason, too: our algos should work with unaligned data, too, even if they are SSE (that's whay we use MOVUPS/MOVUPD) (though JJAxCMP128bit is not currently aware of unaligned data, but it's question of one instruction more), and then, I use offset changement by 4 to not only change the numbers, but also to make a "check DWORD" as the part of the numbers - it will have different position in them in different passes, so we actually have, roughly speaking, 4 times more testing numbers than original Dave's OWORDs set.
Quote from: dedndave on August 23, 2013, 01:10:21 PM
interesting that qWord's "magic" number (that's a pun, really) is needed
that means that my set of data values is insufficient to perform a comprehensive test
it also means that some of our "thought-to-be-good" algos may not be
i may have to look at it a little closer to see what other values are needed
i thought i had it covered with the walking 1's and 0's
In my testbed I get very many errors for algos, though, some numbers are cross-repeated, others are not appeared when I first used your testing data with only one pass using OWORDs just like they were prepared to be used (i.e. skipping DWORD and checking OWORDs). After than I added a playing with offsets and using an additional dword as a part of a numbers, there are much new numbers revealed.
So, I think, probably it's a idea to go - we may craft the data as OWORDs, but then walk through it with step of a DWORD, or even byte, this will increase possibility of detection.
deleted
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
---------------------------------------------------------
1986385 cycles for Cmp128Dave
6307221 cycles for Cmp128Dave2
1832520 cycles for Cmp128Nidud
3162995 cycles for Cmp128NidudSEE (xor)
970403 cycles for Cmp128Alex (xor)
806938 cycles for Cmp128JJ (xor)
866598 cycles for Cmp128DaveU (unsigned)
754468 cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
1980527 cycles for Cmp128Dave
6309359 cycles for Cmp128Dave2
1853104 cycles for Cmp128Nidud
3159145 cycles for Cmp128NidudSEE (xor)
964008 cycles for Cmp128Alex (xor)
807636 cycles for Cmp128JJ (xor)
867367 cycles for Cmp128DaveU (unsigned)
749661 cycles for Cmp128NidudU (unsigned)
---------------------------------------------------------
--- ok ---
deleted
i like that method, nidud :t
i was thinking of doing something like that, and adding parity - lol
but i don't have time to mess with it, right now
i did create a new set of values
but, i haven't had time to validate the standard flags proc
TestVal dd 0,0,0,0
dd 1,0,0,0
dd 0,1,0,0
dd 0,0,1,0
dd 0,0,0,1
dd 7FFFFFFFh,0,0,0
dd 0FFFFFFFFh,0,0,0
dd 0FFFFFFFFh,1,0,0
dd 0FFFFFFFFh,7FFFFFFFh,0,0
dd 0FFFFFFFFh,0FFFFFFFFh,0,0
dd 0FFFFFFFFh,0FFFFFFFFh,1,0
dd 0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh,0
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,1
dd 0,0,0,40000000h
dd 1,0,0,40000000h
dd 0,1,0,40000000h
dd 0,0,1,40000000h
dd 0,0,0,40000001h
dd 7FFFFFFFh,0,0,40000000h
dd 0FFFFFFFFh,0,0,40000000h
dd 0FFFFFFFFh,1,0,40000000h
dd 0FFFFFFFFh,7FFFFFFFh,0,40000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0,40000000h
dd 0FFFFFFFFh,0FFFFFFFFh,1,40000000h
dd 0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh,40000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,40000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,40000001h
dd 0,0,0,80000000h
dd 1,0,0,80000000h
dd 0,1,0,80000000h
dd 0,0,1,80000000h
dd 0,0,0,80000001h
dd 7FFFFFFFh,0,0,80000000h
dd 0FFFFFFFFh,0,0,80000000h
dd 0FFFFFFFFh,1,0,80000000h
dd 0FFFFFFFFh,7FFFFFFFh,0,80000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0,80000000h
dd 0FFFFFFFFh,0FFFFFFFFh,1,80000000h
dd 0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh,80000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,80000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,80000001h
dd 0,0,0,0C0000000h
dd 1,0,0,0C0000000h
dd 0,1,0,0C0000000h
dd 0,0,1,0C0000000h
dd 0,0,0,0C0000001h
dd 7FFFFFFFh,0,0,0C0000000h
dd 0FFFFFFFFh,0,0,0C0000000h
dd 0FFFFFFFFh,1,0,0C0000000h
dd 0FFFFFFFFh,7FFFFFFFh,0,0C0000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0,0C0000000h
dd 0FFFFFFFFh,0FFFFFFFFh,1,0C0000000h
dd 0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh,0C0000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0C0000000h
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0C0000001h
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
dd 0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,3FFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,3FFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFEh
dd 80000000h,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
dd 0,0FFFFFFFFh,0FFFFFFFFh,3FFFFFFFh
dd 0,0FFFFFFFEh,0FFFFFFFFh,3FFFFFFFh
dd 0,80000000h,0FFFFFFFFh,3FFFFFFFh
dd 0,0,0FFFFFFFFh,3FFFFFFFh
dd 0,0,0FFFFFFFEh,3FFFFFFFh
dd 0,0,80000000h,3FFFFFFFh
dd 0,0,0,3FFFFFFFh
dd 0,0,0,3FFFFFFEh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
dd 0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,7FFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,7FFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFEh
dd 80000000h,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
dd 0,0FFFFFFFFh,0FFFFFFFFh,7FFFFFFFh
dd 0,0FFFFFFFEh,0FFFFFFFFh,7FFFFFFFh
dd 0,80000000h,0FFFFFFFFh,7FFFFFFFh
dd 0,0,0FFFFFFFFh,7FFFFFFFh
dd 0,0,0FFFFFFFEh,7FFFFFFFh
dd 0,0,80000000h,7FFFFFFFh
dd 0,0,0,7FFFFFFFh
dd 0,0,0,7FFFFFFEh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
dd 0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0BFFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0BFFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFEh
dd 80000000h,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
dd 0,0FFFFFFFFh,0FFFFFFFFh,0BFFFFFFFh
dd 0,0FFFFFFFEh,0FFFFFFFFh,0BFFFFFFFh
dd 0,80000000h,0FFFFFFFFh,0BFFFFFFFh
dd 0,0,0FFFFFFFFh,0BFFFFFFFh
dd 0,0,0FFFFFFFEh,0BFFFFFFFh
dd 0,0,80000000h,0BFFFFFFFh
dd 0,0,0,0BFFFFFFFh
dd 0,0,0,0BFFFFFFEh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh,0FFFFFFFFh
dd 0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFEh
dd 80000000h,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0,0FFFFFFFFh,0FFFFFFFFh,0FFFFFFFFh
dd 0,0FFFFFFFEh,0FFFFFFFFh,0FFFFFFFFh
dd 0,80000000h,0FFFFFFFFh,0FFFFFFFFh
dd 0,0,0FFFFFFFFh,0FFFFFFFFh
dd 0,0,0FFFFFFFEh,0FFFFFFFFh
dd 0,0,80000000h,0FFFFFFFFh
dd 0,0,0,0FFFFFFFFh
dd 0,0,0,0FFFFFFFEh
TestVal_end LABEL BYTE
deleted
I haven't followed this as intensely as I should, sorry. Now I ran your test, and picked arbitrarily one of the "failed" values, and I don't quite understand why you consider that a failure:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
ox0 oword 0FFFFFFFFFFFFFFFF00000001FFFFFFFFh
ox1 oword 0FFFFFFFF00000001FFFFFFFF00000001h
qx0 qword 0FFFFFFFF0001FFFFh
qx1 qword 0FFFF0001FFFF0001h
dx0 dd 0FFFF01FFh
dx1 dd 0FF01FF01h
Init
Ocmp ox0, ox1
movups xmm0, ox0
movups xmm1, ox1
deb 4, "OWORD size", x:xmm0, x:xmm1, flags
Qcmp qx0, qx1
deb 4, "QWORD size", qx0, qx1, x:qx0, x:qx1, flags
mov eax, dx0
cmp eax, dx1
deb 4, "DWORD size", dx0, dx1, x:dx0, x:dx1, flags
Inkey CrLf$, "was: NO NS NZ CY should be: NO NS NZ NC"
Exit
end start
Output:
OWORD size
x:xmm0 FFFFFFFF FFFFFFFF 00000001 FFFFFFFF
x:xmm1 FFFFFFFF 00000001 FFFFFFFF 00000001
flags: czso
QWORD size
qx0 -4294836225
qx1 -281466386841599
x:qx0 FFFFFFFF 0001FFFF
x:qx1 FFFF0001 FFFF0001
flags: czso
DWORD size
dx0 -65025
dx1 -16646399
x:dx0 FFFF01FF
x:dx1 FF01FF01
flags: czso <<<<<<<<< lowercase means "not set"
was: NO NS NZ CY should be: NO NS NZ NC
Or do I misunderstand something? Apologies if that is the case...
deleted
deleted
DWORD size
dx0 -65025
dx1 -16646399
x:dx0 FFFF01FF
x:dx1 FF01FF01
flags: czso <<<<<<<<< lowercase means "not set"
was: NO NS NZ CY should be: NO NS NZ NC
the carry flag was set by the algorithm and should not have been
Quote from: dedndave on August 24, 2013, 03:28:05 AM
flags: czso <<<<<<<<< lowercase means "not set"
was: NO NS NZ CY should be: NO NS NZ NC
the carry flag was set by the algorithm and should not have been
Well, not by my algo... ::)
deleted
Quote from: nidud on August 24, 2013, 04:05:01 AM
The result of a full test for Cmp128JJ
cmp FFFFFFFF_FFFFFFFF_00000001_FFFFFFFF , FFFFFFFF_00000001_FFFFFFFF_00000001
AX:DX 0001FFFF was: NO NS NZ CY should be: NO NS NZ NC
216 Failures
That one was marked as "tinkering with", you can take it out. I was talking about the MasmBasic algo (Cmp128JJSEE - what is SEE? SSE?) which, AFAIK, sets zero and carry exactly like a cmp eax, edx; it does produce different results for SF/OF but in a manner that does not alter the jl/jg jumps (which require SF!=OF resp SF==OF). Which means 0 failures, right?
Besides, as shown above, your test produces occasionally wrong results. The deb macro's czso means "none of the four are set", your algo says the carry was set. Olly and deb say carry is clear.
deleted
deleted
deleted
Quote from: nidud on August 24, 2013, 05:15:19 AM
The result of a full test
Quote150 Failures
Yes, and all 150 produce correct jumps because SF & OF are swapped. So Ocmp and Qcmp (http://masm32.com/board/index.php?topic=94.msg23071#msg23071) work correctly - no failures. I guess the same holds true for Alex' version, although I haven't had time to test it.
Can I please have timings for the attached prog?
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
5934340 cycles for Cmp128Dave
9986238 cycles for Cmp128Dave2
6086064 cycles for Cmp128Nidud
2989932 cycles for Cmp128NidudSEE (xor)
2520135 cycles for Cmp128NidudSEEU (unsigned)
933977 cycles for Cmp128Alex (xor)
991343 cycles for Cmp128DaveU (unsigned)
926699 cycles for Cmp128NidudU (unsigned)
3613971 cycles for JJAxCMP128bit (SSE)
5933804 cycles for Ocmp2 - JJ's (SSE)
2965817 cycles for AxCMP128bitProc3
3446805 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
6019773 cycles for Cmp128Dave
9803217 cycles for Cmp128Dave2
6065554 cycles for Cmp128Nidud
2770437 cycles for Cmp128NidudSEE (xor)
2689950 cycles for Cmp128NidudSEEU (unsigned)
945193 cycles for Cmp128Alex (xor)
950871 cycles for Cmp128DaveU (unsigned)
923201 cycles for Cmp128NidudU (unsigned)
3674979 cycles for JJAxCMP128bit (SSE)
5935439 cycles for Ocmp2 - JJ's (SSE)
2902697 cycles for AxCMP128bitProc3
3456920 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
--- ok ---
Alex CMP128bit
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
5817682 cycles for Cmp128Dave
9943469 cycles for Cmp128Dave2
6048901 cycles for Cmp128Nidud
2573234 cycles for Cmp128NidudSEE (xor)
2341353 cycles for Cmp128NidudSEEU (unsigned)
1022981 cycles for Cmp128Alex (xor)
919308 cycles for Cmp128DaveU (unsigned)
844475 cycles for Cmp128NidudU (unsigned)
3445214 cycles for JJAxCMP128bit (SSE)
5638061 cycles for Ocmp2 - JJ's (SSE)
2800548 cycles for AxCMP128bitProc3
3475108 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
5776971 cycles for Cmp128Dave
9150470 cycles for Cmp128Dave2
5760939 cycles for Cmp128Nidud
2597086 cycles for Cmp128NidudSEE (xor)
2326359 cycles for Cmp128NidudSEEU (unsigned)
1154350 cycles for Cmp128Alex (xor)
880936 cycles for Cmp128DaveU (unsigned)
896082 cycles for Cmp128NidudU (unsigned)
3442676 cycles for JJAxCMP128bit (SSE)
5767909 cycles for Ocmp2 - JJ's (SSE)
2825598 cycles for AxCMP128bitProc3
3242168 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
deleted
nidud cmp128
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
---------------------------------------------------------
5657695 cycles for Cmp128Dave
8870013 cycles for Cmp128Dave2
5700804 cycles for Cmp128Nidud
1115699 cycles for Cmp128Alex (xor)
895891 cycles for Cmp128DaveU (unsigned)
1045671 cycles for Cmp128NidudU (unsigned)
5584412 cycles for Cmp128JJSSE (xor)
3425749 cycles for Cmp128JJAlexSSE (xor)
6307931 cycles for Cmp128NidudSSE
2788370 cycles for AxCMP128bitProc3
3278983 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
5626412 cycles for Cmp128Dave
8946119 cycles for Cmp128Dave2
5902719 cycles for Cmp128Nidud
1007638 cycles for Cmp128Alex (xor)
911179 cycles for Cmp128DaveU (unsigned)
977962 cycles for Cmp128NidudU (unsigned)
5649649 cycles for Cmp128JJSSE (xor)
3396764 cycles for Cmp128JJAlexSSE (xor)
6278291 cycles for Cmp128NidudSSE
2784183 cycles for AxCMP128bitProc3
3333723 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
deleted
The wondering thing is that there're some "rumours", like "CMOV is preferably than jump + mov", or "string instructions with REP(E) prefix are the fastest possible", but in tests these rumours are not proved much of times :icon_eek:
nidud's cmp128.zip
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
6909610 cycles for Cmp128Dave
9987059 cycles for Cmp128Dave2
5967230 cycles for Cmp128Nidud
948941 cycles for Cmp128Alex (xor)
969717 cycles for Cmp128DaveU (unsigned)
1015841 cycles for Cmp128NidudU (unsigned)
5981220 cycles for Cmp128JJSSE (xor)
3506260 cycles for Cmp128JJAlexSSE (xor)
6612580 cycles for Cmp128NidudSSE
2884301 cycles for AxCMP128bitProc3
3538700 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
6021956 cycles for Cmp128Dave
9902589 cycles for Cmp128Dave2
5931336 cycles for Cmp128Nidud
982830 cycles for Cmp128Alex (xor)
928048 cycles for Cmp128DaveU (unsigned)
933436 cycles for Cmp128NidudU (unsigned)
5800983 cycles for Cmp128JJSSE (xor)
3515863 cycles for Cmp128JJAlexSSE (xor)
6532618 cycles for Cmp128NidudSSE
2878085 cycles for AxCMP128bitProc3
3399463 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
--- ok ---
Quote from: nidud on August 24, 2013, 10:45:31 AM
Alex,
I converted the functions to macros, it's a bit faster :biggrin:
Thanks :biggrin:
It looks like prologue and epilogue get much of time.
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
---------------------------------------------------------
6111257 cycles for Cmp128Dave
9430653 cycles for Cmp128Dave2
6087648 cycles for Cmp128Nidud
977496 cycles for Cmp128Alex (xor)
912537 cycles for Cmp128DaveU (unsigned)
897094 cycles for Cmp128NidudU (unsigned)
5813163 cycles for Cmp128JJSSE (xor)
3502859 cycles for Cmp128JJAlexSSE (xor)
6542713 cycles for Cmp128NidudSSE
2027954 cycles for AxCMP128bitProc3
2026933 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
5771320 cycles for Cmp128Dave
9534868 cycles for Cmp128Dave2
5882258 cycles for Cmp128Nidud
999464 cycles for Cmp128Alex (xor)
911686 cycles for Cmp128DaveU (unsigned)
970983 cycles for Cmp128NidudU (unsigned)
5801006 cycles for Cmp128JJSSE (xor)
3547188 cycles for Cmp128JJAlexSSE (xor)
6496130 cycles for Cmp128NidudSSE
2012868 cycles for AxCMP128bitProc3
2010059 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
--- ok ---
deleted
Hi nidud,
timings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 2360/1000 cycles
1771 cycles for 1000 * Ocmp (JJ)
1757 cycles for 1000 * Ocmp2 (JJ)
1445 cycles for 1000 * cmp128n (nidud)
3829 cycles for 1000 * cmp128 qWord
3146 cycles for 1000 * AxCMP128bit
1843 cycles for 1000 * Ocmp (JJ)
1846 cycles for 1000 * Ocmp2 (JJ)
1612 cycles for 1000 * cmp128n (nidud)
3992 cycles for 1000 * cmp128 qWord
3144 cycles for 1000 * AxCMP128bit
1709 cycles for 1000 * Ocmp (JJ)
1746 cycles for 1000 * Ocmp2 (JJ)
1506 cycles for 1000 * cmp128n (nidud)
3848 cycles for 1000 * cmp128 qWord
3270 cycles for 1000 * AxCMP128bit
--- ok ---
Gunther
QuoteIntel(R) Core(TM) i3 CPU M 370 @ 2.40GHz (SSE4)
---------------------------------------------------------
2639323 cycles for Cmp128Dave
8595376 cycles for Cmp128Dave2
3084573 cycles for Cmp128Nidud
1718384 cycles for Cmp128Alex (xor)
742712 cycles for Cmp128DaveU (unsigned)
621263 cycles for Cmp128NidudU (unsigned)
2229787 cycles for Cmp128JJSSE (xor)
988796 cycles for Cmp128JJAlexSSE (xor)
1853176 cycles for Cmp128NidudSSE
1302968 cycles for AxCMP128bitProc3
1286231 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
1768975 cycles for Cmp128Dave
7299673 cycles for Cmp128Dave2
2650490 cycles for Cmp128Nidud
1824726 cycles for Cmp128Alex (xor)
1732493 cycles for Cmp128DaveU (unsigned)
1461457 cycles for Cmp128NidudU (unsigned)
3480289 cycles for Cmp128JJSSE (xor)
1887690 cycles for Cmp128JJAlexSSE (xor)
2749835 cycles for Cmp128NidudSSE
2177752 cycles for AxCMP128bitProc3
2126370 cycles for AxCMP128bitProc3c (cmov)
---------------------------------------------------------
--- ok ---
deleted
deleted
Hi nidud,
the new timings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
698804 cycles [x][x][x] - Cmp128Dave
1161997 cycles [x][x][x] - Cmp128Dave2
614967 cycles [x][x][x] - Cmp128Nidud
715645 cycles [x][x][x] - Cmp128NidudSSE
461614 cycles [x][x][ ] - Cmp128Alex
378413 cycles [x][x][ ] - Cmp128JJSSE
331370 cycles [x][x][ ] - Cmp128JJAlexSSE
467938 cycles [x][x][ ] - AxCMP128bitProc3
436255 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
437147 cycles [x][ ][ ] - Cmp128DaveU
463277 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Gunther
Quote from: nidud on August 25, 2013, 12:01:08 AM
As a result of this Alex and my SEE macro failed :lol:
Changes made to Cmp128JJAlexSSE:
movups xmm0,[ow0]
movups xmm1,[ow1] ; ++
movzx eax,word ptr [ow0+14]
;pcmpeqb xmm0,[ow1] ; this failed on unaligned data
pcmpeqb xmm0,xmm1
Yes, I noted that it's unaware of unaligned data. Your solution is right :t
Here are the timings:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2652796 cycles [x][x][x] - Cmp128Dave
3952276 cycles [x][x][x] - Cmp128Dave2
2639764 cycles [x][x][x] - Cmp128Nidud
3069710 cycles [x][x][x] - Cmp128NidudSSE
944781 cycles [x][x][ ] - Cmp128Alex
1913987 cycles [x][x][ ] - Cmp128JJSSE
2623148 cycles [x][x][ ] - Cmp128JJAlexSSE
1324491 cycles [x][x][ ] - AxCMP128bitProc3
1279045 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
726832 cycles [x][ ][ ] - Cmp128DaveU
738206 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
It's interesting how differently algos perform on different processors.
Hi Alex,
Quote from: Antariy on August 25, 2013, 05:56:47 PM
It's interesting how differently algos perform on different processors.
yes, it seems that things become more and more hardware dependent. The only way to overcome that are different code paths.
Gunther
deleted
A brand new Cmp128JJAlexSSE!
Don't miss it on your displays right now!
Now fully compliant with Dave's Testing Method™ (JO/JS works as expected).
Even with 3 new tastes modifications!
:greensml:
Timings welcome :t
Hi
Gunther :biggrin:
Quote from: Gunther on August 25, 2013, 06:57:30 PM
Quote from: Antariy on August 25, 2013, 05:56:47 PM
It's interesting how differently algos perform on different processors.
yes, it seems that things become more and more hardware dependent. The only way to overcome that are different code paths.
With current amount of different CPU models that would be a bunch of code :biggrin:
You're perfectly right, to get every clock from every machine it's the only way.
Ah, forget to post the timings in previous post:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2958901 cycles [x][x][x] - Cmp128Dave
4334913 cycles [x][x][x] - Cmp128Dave2
2957840 cycles [x][x][x] - Cmp128Nidud
3402571 cycles [x][x][x] - Cmp128NidudSSE
1034917 cycles [x][x][ ] - Cmp128Alex
2118138 cycles [x][x][ ] - Cmp128JJSSE
1762373 cycles [x][x][x] - Cmp128JJAlexSSE_1
1726287 cycles [x][x][x] - Cmp128JJAlexSSE_2
1739010 cycles [x][x][x] - Cmp128JJAlexSSE_3
1464577 cycles [x][x][ ] - AxCMP128bitProc3
1372694 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
779180 cycles [x][ ][ ] - Cmp128DaveU
798269 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Quote from: Antariy on August 25, 2013, 08:07:28 PM
A brand new Cmp128JJAlexSSE!
It seems to like my Celeron - best among the "good" algos :t
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
968781 cycles {x}{x}{x} - Cmp128Dave
2629540 cycles {x}{x}{x} - Cmp128Dave2
938714 cycles {x}{x}{x} - Cmp128Nidud
1039057 cycles {x}{x}{x} - Cmp128NidudSSE
706010 cycles {x}{x}{ } - Cmp128Alex
1131248 cycles {x}{x}{ } - Cmp128JJSSE
834193 cycles {x}{x}{x} - Cmp128JJAlexSSE_1
947852 cycles {x}{x}{x} - Cmp128JJAlexSSE_2
948452 cycles {x}{x}{x} - Cmp128JJAlexSSE_3
881549 cycles {x}{x}{ } - AxCMP128bitProc3
890835 cycles {x}{x}{ } - AxCMP128bitProc3c (cmov)
610504 cycles {x}{ }{ } - Cmp128DaveU
599043 cycles {x}{ }{ } - Cmp128NidudU
deleted
Ooops, toooooo much digits in the numbers, getting valuating them "by width" :greensml: "By width" the selected timings were wider, so I thought that it much slower... :greensml: :biggrin:
Quote from: nidud on August 25, 2013, 08:35:25 PM
1069602 cycles - - Cmp128JJAlexSSE_1
1070532 cycles - - Cmp128JJAlexSSE_2
1071273 cycles - - Cmp128JJAlexSSE_3
But here they all near.
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2598846 cycles [x][x][x] - Cmp128Dave
3786288 cycles [x][x][x] - Cmp128Dave2
2616598 cycles [x][x][x] - Cmp128Nidud
3025310 cycles [x][x][x] - Cmp128NidudSSE
914405 cycles [x][x][ ] - Cmp128Alex
1906276 cycles [x][x][ ] - Cmp128JJSSE
1588020 cycles [x][x][x] - Cmp128JJAlexSSE_1
1562841 cycles [x][x][x] - Cmp128JJAlexSSE_2
1558993 cycles [x][x][x] - Cmp128JJAlexSSE_3
1326437 cycles [x][x][ ] - AxCMP128bitProc3
1254462 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
692441 cycles [x][ ][ ] - Cmp128DaveU
713309 cycles [x][ ][ ] - Cmp128NidudU
------------------------------------------------------
2615758 cycles [x][x][x] - Cmp128Dave
3829660 cycles [x][x][x] - Cmp128Dave2
2621750 cycles [x][x][x] - Cmp128Nidud
3031078 cycles [x][x][x] - Cmp128NidudSSE
908794 cycles [x][x][ ] - Cmp128Alex
1892463 cycles [x][x][ ] - Cmp128JJSSE
1591916 cycles [x][x][x] - Cmp128JJAlexSSE_1
1557071 cycles [x][x][x] - Cmp128JJAlexSSE_2
1559415 cycles [x][x][x] - Cmp128JJAlexSSE_3
1313596 cycles [x][x][ ] - AxCMP128bitProc3
1267780 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
711284 cycles [x][ ][ ] - Cmp128DaveU
741151 cycles [x][ ][ ] - Cmp128NidudU
deleted
tests that use a little more time return more repeatable results
if i am timing code, i try to make each pass last about 0.5 seconds
that seems to give repeatable numbers
Hi,
From Reply #216.
pre-P4 (SSE1)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1067625 cycles [x][x][x] - Cmp128Dave
2571737 cycles [x][x][x] - Cmp128Dave2
998428 cycles [x][x][x] - Cmp128Nidud
1083846 cycles [x][x][x] - Cmp128NidudSSE
847793 cycles [x][x][ ] - Cmp128Alex
1788551 cycles [x][x][ ] - Cmp128JJSSE
1215146 cycles [x][x][x] - Cmp128JJAlexSSE_1
1623996 cycles [x][x][x] - Cmp128JJAlexSSE_2
1570182 cycles [x][x][x] - Cmp128JJAlexSSE_3
1114476 cycles [x][x][ ] - AxCMP128bitProc3
1174133 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
608508 cycles [x][ ][ ] - Cmp128DaveU
612287 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1040860 cycles [x][x][x] - Cmp128Dave
2541208 cycles [x][x][x] - Cmp128Dave2
940427 cycles [x][x][x] - Cmp128Nidud
1046690 cycles [x][x][x] - Cmp128NidudSSE
834253 cycles [x][x][ ] - Cmp128Alex
1849858 cycles [x][x][ ] - Cmp128JJSSE
1453007 cycles [x][x][x] - Cmp128JJAlexSSE_1
1703155 cycles [x][x][x] - Cmp128JJAlexSSE_2
1713931 cycles [x][x][x] - Cmp128JJAlexSSE_3
963145 cycles [x][x][ ] - AxCMP128bitProc3
1004886 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
652720 cycles [x][ ][ ] - Cmp128DaveU
646938 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
979702 cycles [x][x][x] - Cmp128Dave
2660548 cycles [x][x][x] - Cmp128Dave2
948481 cycles [x][x][x] - Cmp128Nidud
1056326 cycles [x][x][x] - Cmp128NidudSSE
754229 cycles [x][x][ ] - Cmp128Alex
1145531 cycles [x][x][ ] - Cmp128JJSSE
852507 cycles [x][x][x] - Cmp128JJAlexSSE_1
960256 cycles [x][x][x] - Cmp128JJAlexSSE_2
959330 cycles [x][x][x] - Cmp128JJAlexSSE_3
891707 cycles [x][x][ ] - AxCMP128bitProc3
899999 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
615768 cycles [x][ ][ ] - Cmp128DaveU
606497 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Cheers,
Steve N.
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
973837 cycles [x][x][x] - Cmp128Dave
3064246 cycles [x][x][x] - Cmp128Dave2
924278 cycles [x][x][x] - Cmp128Nidud
1063306 cycles [x][x][x] - Cmp128NidudSSE
688245 cycles [x][x][ ] - Cmp128Alex
1082474 cycles [x][x][ ] - Cmp128JJSSE
801400 cycles [x][x][x] - Cmp128JJAlexSSE_1
898730 cycles [x][x][x] - Cmp128JJAlexSSE_2
902646 cycles [x][x][x] - Cmp128JJAlexSSE_3
896815 cycles [x][x][ ] - AxCMP128bitProc3
929492 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
632298 cycles [x][ ][ ] - Cmp128DaveU
627533 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Hi,
Using Dave's original 40 DWORD AND OWORD pairs of numbers,
and some of his logic, I wrote some code for my fixed point
program. Claims it passes the tests. Yippee! Had me going in
circles for a while.
Cheers,
Steve N.
deleted
Quote from: nidud on August 26, 2013, 07:04:16 AM
the first test is strange:
pre-P4 (SSE1)
most of the code used in the macros are SSE2
Hi nidud,
Yeah, I would think it should not run. But?
Cheers,
Steve N.
I have seen SSE2 code that would run on my P3 without triggering an exception, but which would produce incorrect results.
Yes, used in the code SSE2 instructions are PCMPEQB - which used the same opcode as MMX PCMPEQB but with 66h prefix which isn't recognized by PIII so it treat this as a MMX instruction (so SSE results are incorrect) and MOVAPS/MOVAPD - MOVAPS works on PIII and MOVAPD has opcode of MOVAPS with 66h prefix so it works, too.
deleted
Quote from: nidud on August 26, 2013, 11:35:51 AM
this is starting to get a bit obsessive :lol:
Yes :biggrin:
Quote from: nidud on August 26, 2013, 11:35:51 AM
mov eax,1
bsf eax,eax
Is this works? :icon_eek:
Here you can simplify a bit:
Quote from: nidud on August 26, 2013, 11:35:51 AM
movmskps eax,xmm0
sub al,1111B
jnz @0
Instead of jnz @0 jz to the exit from macro - it already processed right zero (equal) result.
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2302893 cycles [x][x][x] - Cmp128Nidud
2441613 cycles [x][x][x] - Cmp128NidudSSE
2871717 cycles [x][x][x] - Cmp128Dave
4208738 cycles [x][x][x] - Cmp128Dave2
1724226 cycles [x][x][x] - Cmp128JJAlexSSE_1
1695861 cycles [x][x][x] - Cmp128JJAlexSSE_2
1946274 cycles [x][x][x] - Cmp128JJAlexSSE_3
985137 cycles [x][x][ ] - Cmp128Alex
2063049 cycles [x][x][ ] - Cmp128JJSSE
1411323 cycles [x][x][ ] - AxCMP128bitProc3
1324774 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
756882 cycles [x][ ][ ] - Cmp128DaveU
784458 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Feed the obsession...
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
---------------------------------------------------
305387 cycles [x][x][x] - Cmp128Nidud
308974 cycles [x][x][x] - Cmp128NidudSSE
617569 cycles [x][x][x] - Cmp128Dave
1184205 cycles [x][x][x] - Cmp128Dave2
273918 cycles [x][x][x] - Cmp128JJAlexSSE_1
319743 cycles [x][x][x] - Cmp128JJAlexSSE_2
319190 cycles [x][x][x] - Cmp128JJAlexSSE_3
452218 cycles [x][x][ ] - Cmp128Alex
323382 cycles [x][x][ ] - Cmp128JJSSE
417314 cycles [x][x][ ] - AxCMP128bitProc3
395354 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
341747 cycles [x][ ][ ] - Cmp128DaveU
348616 cycles [x][ ][ ] - Cmp128NidudU
Quote from: sinsi on August 26, 2013, 01:57:34 PM
Feed the obsession...
Me too :biggrin:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
844 kCycles [x][x][x] - Cmp128Dave
1299 kCycles [x][x][x] - Cmp128Dave2
846 kCycles [x][x][x] - Cmp128Nidud
922 kCycles [x][x][x] - Cmp128NidudSSE
644 kCycles [x][x][ ] - Cmp128Alex
1557 kCycles [x][x][ ] - Cmp128JJSSE
1471 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1465 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1465 kCycles [x][x][x] - Cmp128JJAlexSSE_3
802 kCycles [x][x][ ] - AxCMP128bitProc3
772 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
543 kCycles [x][ ][ ] - Cmp128DaveU
543 kCycles [x][ ][ ] - Cmp128NidudU
P.S.: Added a sar eax, 10, and changed test_end "
kCycles (x)(x)(x) - Cmp128Dave2"
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
700812 cycles [x][x][x] - Cmp128Nidud
729986 cycles [x][x][x] - Cmp128NidudSSE
971877 cycles [x][x][x] - Cmp128Dave
3026250 cycles [x][x][x] - Cmp128Dave2
782064 cycles [x][x][x] - Cmp128JJAlexSSE_1
890599 cycles [x][x][x] - Cmp128JJAlexSSE_2
926681 cycles [x][x][x] - Cmp128JJAlexSSE_3
682186 cycles [x][x][ ] - Cmp128Alex
1067566 cycles [x][x][ ] - Cmp128JJSSE
882899 cycles [x][x][ ] - AxCMP128bitProc3
888908 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
592661 cycles [x][ ][ ] - Cmp128DaveU
570588 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Yum-yum!
New MACRO added - brute rework of original GPR macro but to make it work just like CMP (passes Dave's check).
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2184021 cycles [x][x][x] - Cmp128Nidud
2313648 cycles [x][x][x] - Cmp128NidudSSE
2767063 cycles [x][x][x] - Cmp128Dave
4086277 cycles [x][x][x] - Cmp128Dave2
1672157 cycles [x][x][x] - Cmp128JJAlexSSE_1
1644385 cycles [x][x][x] - Cmp128JJAlexSSE_2
1889066 cycles [x][x][x] - Cmp128JJAlexSSE_3
980736 cycles [x][x][ ] - Cmp128Alex
1851407 cycles [x][x][x] - Cmp128Alex_2
1899452 cycles [x][x][x] - Cmp128Alex_3
2048700 cycles [x][x][ ] - Cmp128JJSSE
1388635 cycles [x][x][ ] - AxCMP128bitProc3
1311284 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
756260 cycles [x][ ][ ] - Cmp128DaveU
775831 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
I think AMD probably should like it better than Intel.
Quote from: nidud on August 26, 2013, 07:04:16 AM
the first test is strange:
pre-P4 (SSE1)
most of the code used in the macros are SSE2
But half of current code is GPR, too - there are Dave's, your and mine codes that didn't use SSE at all :biggrin:
No timings?
deleted
Quote from: dedndave on August 23, 2013, 10:20:37 PM
i did create a new set of values
but, i haven't had time to validate the standard flags proc
Hi,
In Reply #187 Dave had an array of test values. I just tested
my algorithm against those, and passed. I created the check
values as he had in his earlier validation program as that was what
I based mine on. Would that still be of use to anyone else? I know
you want fast algorithms, and mine is most probably slow. (And it
is 16-bit to run it on an 80186.) Anyone interested in it?
Regards,
Steve N.
Quote from: nidud on August 27, 2013, 01:12:12 AM
Quote from: Antariy on August 26, 2013, 12:38:23 PM
Quote from: nidud on August 26, 2013, 11:35:51 AM
mov eax,1
bsf eax,eax
Is this works? :icon_eek:
The trick is to clear the zero flag without changing any of the other flags. In Dave's code this is done like this:
jnz c4
lahf
lea eax,[eax-4000h]
sahf
BSF are one of the few upcodes that only modifies ZF, but it is a bit slow.
Clocks
BSF 6-42
BSR 6-103
The problems is that BSF sets only ZF flag, but other flags after instruction are "undefined". For the flags this means that its state is absolutely unpredictable, and, for my CPU (and maybe (!) for every Intel), they are all (except ZF) zeroed. In short - BSF cannot be used for this purpose with any robustness (if on some CPU the flags aren't touched, on other CPU they may be messed). Check it on your CPU - is BSF trashed other flags? Did it passed Dave's check? If so, then your CPU doesn't change other flags with BSF, otherwice it should not pass the check.
Can I ask everyone for more timings for this archive? http://masm32.com/board/index.php?topic=2222.msg23743#msg23743
It's interesting how worth the rework of old code is.
Quote from: FORTRANS on August 27, 2013, 08:48:04 AMAnyone interested in it?
Yes,
Steve, of course! I'm interested :t
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2044430 cycles [x][x][x] - Cmp128Nidud
2232933 cycles [x][x][x] - Cmp128NidudSSE
2631658 cycles [x][x][x] - Cmp128Dave
3862003 cycles [x][x][x] - Cmp128Dave2
1601513 cycles [x][x][x] - Cmp128JJAlexSSE_1
1559401 cycles [x][x][x] - Cmp128JJAlexSSE_2
1791892 cycles [x][x][x] - Cmp128JJAlexSSE_3
935826 cycles [x][x][ ] - Cmp128Alex
1729147 cycles [x][x][x] - Cmp128Alex_2
1779960 cycles [x][x][x] - Cmp128Alex_3
1913773 cycles [x][x][ ] - Cmp128JJSSE
1302324 cycles [x][x][ ] - AxCMP128bitProc3
1253729 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
701808 cycles [x][ ][ ] - Cmp128DaveU
752020 cycles [x][ ][ ] - Cmp128NidudU
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2033683 cycles [x][x][x] - Cmp128Nidud
2247710 cycles [x][x][x] - Cmp128NidudSSE
2628652 cycles [x][x][x] - Cmp128Dave
3813015 cycles [x][x][x] - Cmp128Dave2
1629220 cycles [x][x][x] - Cmp128JJAlexSSE_1
1591177 cycles [x][x][x] - Cmp128JJAlexSSE_2
1794286 cycles [x][x][x] - Cmp128JJAlexSSE_3
936215 cycles [x][x][ ] - Cmp128Alex
1725124 cycles [x][x][x] - Cmp128Alex_2
1782223 cycles [x][x][x] - Cmp128Alex_3
1900926 cycles [x][x][ ] - Cmp128JJSSE
1331104 cycles [x][x][ ] - AxCMP128bitProc3
1260544 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
696199 cycles [x][ ][ ] - Cmp128DaveU
734917 cycles [x][ ][ ] - Cmp128NidudU
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
700120 cycles [x][x][x] - Cmp128Nidud
730067 cycles [x][x][x] - Cmp128NidudSSE
972859 cycles [x][x][x] - Cmp128Dave
3028178 cycles [x][x][x] - Cmp128Dave2
784587 cycles [x][x][x] - Cmp128JJAlexSSE_1
890498 cycles [x][x][x] - Cmp128JJAlexSSE_2
928216 cycles [x][x][x] - Cmp128JJAlexSSE_3
680786 cycles [x][x][ ] - Cmp128Alex
1108150 cycles [x][x][x] - Cmp128Alex_2
1114646 cycles [x][x][x] - Cmp128Alex_3
1069239 cycles [x][x][ ] - Cmp128JJSSE
871461 cycles [x][x][ ] - AxCMP128bitProc3
889968 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
592212 cycles [x][ ][ ] - Cmp128DaveU
570113 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Thank you very much, Dave and Marinus! :biggrin:
Here you go Alex
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
---------------------------------------------------
332168 cycles [x][x][x] - Cmp128Nidud
339908 cycles [x][x][x] - Cmp128NidudSSE
671045 cycles [x][x][x] - Cmp128Dave
1346792 cycles [x][x][x] - Cmp128Dave2
299201 cycles [x][x][x] - Cmp128JJAlexSSE_1
386241 cycles [x][x][x] - Cmp128JJAlexSSE_2
398699 cycles [x][x][x] - Cmp128JJAlexSSE_3
508040 cycles [x][x][ ] - Cmp128Alex
610825 cycles [x][x][x] - Cmp128Alex_2
608946 cycles [x][x][x] - Cmp128Alex_3
378717 cycles [x][x][ ] - Cmp128JJSSE
467459 cycles [x][x][ ] - AxCMP128bitProc3
417007 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
360011 cycles [x][ ][ ] - Cmp128DaveU
383098 cycles [x][ ][ ] - Cmp128NidudU
deleted
Quote from: sinsi on August 27, 2013, 05:07:12 PM
Here you go Alex
Thank you very much,
John! :biggrin:
Quote from: nidud on August 27, 2013, 05:22:34 PM
Quote from: Antariy on August 27, 2013, 12:32:27 PM
The problems is that BSF sets only ZF flag, but other flags after instruction are "undefined". For the flags this means that its state is absolutely unpredictable, and, for my CPU (and maybe (!) for every Intel), they are all (except ZF) zeroed. In short - BSF cannot be used for this purpose with any robustness (if on some CPU the flags aren't touched, on other CPU they may be messed). Check it on your CPU - is BSF trashed other flags? Did it passed Dave's check? If so, then your CPU doesn't change other flags with BSF, otherwice it should not pass the check.
Unless this is not specifically stated in the Intel manual, that can't be the case. It will mean the same as to say that "on some CPU's the upcode INC sometimes cleared the CF flag", which is not the case.
I may be wrong in this claim, and if this is the case, the attached test will fail on some (your's?) CPU's.
It's stated in Intel's manual. Maybe you're using some textual portable (shortened) version of it, but in full version other flags are stated "undefined".
Interesting enough, it seem that your AMD doesn't trash other flags.
(results truncated since too long)
cmp 00000000_00000000_00000000_00000000 , 00000000_00000000_00000000_00000001
AX:DX 0000EB94 was: NO NS NZ NC should be: NO SF NZ CY
cmp 00000000_00000000_00000000_00000000 , 00000000_00000000_00000001_FFFFFFFF
AX:DX 0000EB94 was: NO NS NZ NC should be: NO SF NZ CY
AX:DX 0000EB94 was: NO NS NZ NC should be: NO SF NZ NC
cmp C0000001_00000000_00000000_00000000 , 40000001_00000000_00000000_00000000
AX:DX 0000EB94 was: NO NS NZ NC should be: NO SF NZ NC
1365 Failures
Press any key to continue ...
Can anyone here with AMD CPU and Intel CPU run the test attached in the post above?
(Maybe we found the fastest "CPUID" functionality for the IsIntelOrAMD routine :biggrin:)
with nidud's Cmp128Eval program, i get 1365 failures
here is a little test program....
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
All Flags Set: OV NG ZR AC PE CY
BSF 1: NV PL NZ NA PE NC
BSR 1: NV PL NZ NA PE NC
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PE NC
BSR 1: NV PL NZ NA PE NC
judging from the parity flag, it looks like it explicitly sets some of the flags, other than the ZF
Yes, I have the same results
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
All Flags Set: OV NG ZR AC PE CY
BSF 1: NV PL NZ NA PE NC
BSR 1: NV PL NZ NA PE NC
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PE NC
BSR 1: NV PL NZ NA PE NC
Press any key to continue ...
Quote from: dedndave on August 27, 2013, 09:00:40 PM
judging from the parity flag, it looks like it explicitly sets some of the flags, other than the ZF
Quote from: Antariy on August 27, 2013, 12:32:27 PM
for my CPU (and maybe (!) for every Intel), they are all (except ZF) zeroed
They all zeroed and parity seems to be set properly for the result.
Hi,
Quote from: Antariy on August 27, 2013, 01:20:42 PM
Quote from: FORTRANS on August 27, 2013, 08:48:04 AMAnyone interested in it?
Yes, Steve, of course! I'm interested :t
Okay, here it is. 16-bit, but would be easy to convert to 32-bits.
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Compare two large numbers. Bigger than can fit into a register to be
; compared with the CMP instruction. This Algorithm is based (loosely)
; on a discussion between deadndave and jj2007 of the MASM Forum. With
; commentary from others. See Comparing 128-bit numbers aka OWORDs, in
; The Laboratory. Note, that the source and destination are subtracted
; differently between CMPS and CMP. And that does not matter here as I
; only test for equality, where order doesn't matter. The final result
; is from a CMP.
; SRN, 22/25 August 2013.
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; INPUT: (E)SI pointing to an OWORD number.
; (E)DI pointing to an OWORD number.
; OUTPUT: Flags set from comparison.
; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CMPSVal PROC
PUSH SI ; Dave is using these as counters, so preserve.
PUSH DI
ADD SI,15 ; Point to last (high) byte of OWORD.
ADD DI,15
MOV AH,[DI] ; Put OWRD high bytes into AH and DH.
MOV DH,[SI]
MOV CX,16
STD ; Go from high to low order bytes.
REPE CMPSB ; Do the comparison.
CMP CX,15 ; Fixed it. Almost.
JNE CV_1
REPE CMPSB
CV_1:
CLD
MOV AL,[DI+1] ; Put lower order byt into AL and DL.
MOV DL,[SI+1]
CMP AX,DX ; Return flags.
POP DI
POP SI
RET
CMPSVal ENDP
Enjoy,
Steve N.
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
All Flags Set: OV NG ZR AC PE CY
BSF 1: OV NG NZ AC PE CY
BSR 1: OV NG NZ AC PE CY
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PO NC
BSR 1: NV PL NZ NA PO NC
Press any key to continue ...
No, it will not work as IsIntelOrAMD check :biggrin:
Thank you, Marinus :t
Hi
Steve! Thank you :t
Quote from: FORTRANS on August 27, 2013, 09:47:36 PM
Okay, here it is. 16-bit, but would be easy to convert to 32-bits.
Here is the algo that uses this idea, too :t
It uses SSE2 instruction PCMPEQB to find non-mathing bytes, instead of CMPSB, but other logic is the same.
Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
movups xmm0,[ow0]
movups xmm1,[ow1]
pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
xor ecx,0FFFFh
jz @l2
and ecx,7FFFh
bsr ecx,ecx
mov ah,byte ptr [ow0+15]
mov dh,byte ptr [ow1+15]
mov al,byte ptr [ow0+ecx]
mov dl,byte ptr [ow1+ecx]
cmp ax,dx
@l2:
ENDM
(this version is faster than earlier included in the testbed)
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
720551 cycles [x][x][x] - Cmp128Nidud
760062 cycles [x][x][x] - Cmp128NidudSSE
1047587 cycles [x][x][x] - Cmp128Dave
2544535 cycles [x][x][x] - Cmp128Dave2
1452129 cycles [x][x][x] - Cmp128JJAlexSSE_1
1704401 cycles [x][x][x] - Cmp128JJAlexSSE_2
1711250 cycles [x][x][x] - Cmp128JJAlexSSE_3
832572 cycles [x][x][ ] - Cmp128Alex
1228790 cycles [x][x][x] - Cmp128Alex_2
1249508 cycles [x][x][x] - Cmp128Alex_3
1849514 cycles [x][x][ ] - Cmp128JJSSE
959424 cycles [x][x][ ] - AxCMP128bitProc3
1008975 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
652808 cycles [x][ ][ ] - Cmp128DaveU
649697 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Quote from: Antariy on August 27, 2013, 10:16:33 PM
Here is the algo that uses this idea, too :t
It uses SSE2 instruction PCMPEQB to find non-matching bytes, instead of CMPSB, but other logic is the same.
Hi Alex,
Interesting. Nice to see the algorithm reworked.
Thanks,
Steve N.
Quote from: FORTRANS on August 27, 2013, 10:25:40 PM
Interesting. Nice to see the algorithm reworked.
Yes, it always interesting to see different implementations of an idea :biggrin:
Thank you for the test,
Steve! :t
deleted
Hi,
First some more results.
pre-P4 (SSE1)
All Flags Set: OV NG ZR AC PE CY
BSF 1: OV NG NZ AC PE CY
BSR 1: OV NG NZ AC PE CY
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PO NC
BSR 1: NV PL NZ NA PO NC
Press any key to continue ...
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
All Flags Set: OV NG ZR AC PE CY
BSF 1: OV NG NZ AC PE CY
BSR 1: OV NG NZ AC PE CY
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PO NC
BSR 1: NV PL NZ NA PO NC
Press any key to continue ...
pre-P4
All Flags Set: OV NG ZR AC PE CY
BSF 1: NV NG NZ AC PE NC
BSR 1: OV NG NZ AC PE NC
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV NG NZ AC PE NC
BSR 1: OV NG NZ AC PE NC
Press any key to continue ...
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
All Flags Set: OV NG ZR AC PE CY
BSF 1: OV NG NZ AC PE CY
BSR 1: OV NG NZ AC PE CY
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PO NC
BSR 1: NV PL NZ NA PO NC
Press any key to continue ...
Second, I tried to visualize the problems with setting the flags
correctly. So I wrote a program to plot the flags for a normal
byte comparison and then for a truncated byte representing a
partial data set. It did not help me in any particular way. But
here it is anyway. Mode 12H graphics.
Regards,
Steve N.
Quote from: nidud on August 28, 2013, 01:36:24 AM
I like the "Intel 8086 Family Architecture" document because it includes timings for the upcodes, and I assumed backward compatible on all instruction set architectures based on the Intel 8086 CPU.
QuoteMany additions and extensions have been added to the x86 instruction set over the years, almost consistently with full backward compatibility
It's not your error :t (BTW in the Intel's 80386 reference information is as in your reference)
As for clocks - after PMMX and especially after P6 family released, old clocks numbers information may be very outdated, the more so this for more or less modern CPUs (here we have totally unpredictable timings not only between manufacturers, but even inside different models of one microarchitecture).
Quote from: FORTRANS on August 28, 2013, 04:13:24 AM
Hi,
First some more results.
pre-P4 (SSE1)
All Flags Set: OV NG ZR AC PE CY
BSF 1: OV NG NZ AC PE CY
BSR 1: OV NG NZ AC PE CY
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PO NC
BSR 1: NV PL NZ NA PO NC
Press any key to continue ...
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
All Flags Set: OV NG ZR AC PE CY
BSF 1: OV NG NZ AC PE CY
BSR 1: OV NG NZ AC PE CY
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PO NC
BSR 1: NV PL NZ NA PO NC
Press any key to continue ...
pre-P4
All Flags Set: OV NG ZR AC PE CY
BSF 1: NV NG NZ AC PE NC
BSR 1: OV NG NZ AC PE NC
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV NG NZ AC PE NC
BSR 1: OV NG NZ AC PE NC
Press any key to continue ...
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
All Flags Set: OV NG ZR AC PE CY
BSF 1: OV NG NZ AC PE CY
BSR 1: OV NG NZ AC PE CY
All Flags Cleared: NV PL NZ NA PO NC
BSF 1: NV PL NZ NA PO NC
BSR 1: NV PL NZ NA PO NC
Press any key to continue ...
Steve, is the third result for your PMMX?
The program looks very representative (especially the screen with all the graphs combined and colored), but I'm not sure I understand how to read the graphs :icon_redface: Can you help?
Quote from: Antariy on August 28, 2013, 09:19:12 AM
Quote from: FORTRANS on August 28, 2013, 04:13:24 AM
Hi,
First some more results.
Steve, is the third result for your PMMX?
Yes, the P-MMX with Windows 98. A pity that it
does not follow Intel's rules.
Quote
The program looks very representative (especially the screen with all the graphs combined and colored), but I'm not sure I understand how to read the graphs :icon_redface: Can you help?
Maybe. I was trying to visualize what information from
the high nybble / byte / double, in a byte / word / OWORD
would tell me about the complete result. So I took the row
and column counters, both bytes, and plotted the results of
a compare for each of the four flags we were looking at.
MOV AH,[RowCount]
MOV AL,[ColCount]
CMP AL,AH
MOV [ucPixel],0
MOV BX,[SaveBX]
MOV SI,[TestTable+BX]
PUSHF
POP DI
TEST SI,DI
JZ Start5
MOV [ucPixel],15
Start5:
CALL SetPixel10
If the flag is set, the pixel is white, otherwise black for the first
four plots or graphs. I then do the same for a truncated case.
MOV AH,[RowCount]
MOV AL,[ColCount]
AND AH,0F0H
AND AL,0F0H
CMP AL,AH
MOV [ucPixel],0
MOV BX,[SaveBX]
MOV SI,[TestTable+BX]
PUSHF
POP DI
TEST SI,DI
JZ Start8
MOV [ucPixel],15
Start8:
CALL SetPixel10
I was hoping that it would show a short-cut or some such.
All it showed was that for the majority of cases there is no such
short-cut that I could possibly see. You have to do a bit better.
(I think, at least that is what I took away from this.)
For the color plot, I am using the Mode 12H planar graphics
mode. Sixteen colors with four planes. So I assigned a flag to
its own plane. So the colors show what flags are set by the
comparison. I probably should change the palette to make the
results be clearer. But I saw what I needed to, and so did not
bother. I could update that if you think it would help. (And you
see a good and proper mapping of the four flags to three primary
colors.)
The only notable fact that I saw was if the Zero Flag is set, no
others being considered are set as well. So you can take an early
exit from the algorithm if the CMPS result is zero. I did not bother
with mine.* (Though I, or someone, should time both versions.)
Given the other colors, any combination of the other three flags
is possible.** And again, no obvious simplification was apparent to
me.
I hope that explains most if not all of your question.
Regards,
Steve N.
Edit:
* Zero is set, on average, once out of 256 times for the byte
comparison. And it just gets worse as the size increases.
SRN
Edit:
** Said that wrong. The colors show which flags can be set
together, as shown by the labels. Not all the possibilities occur.
SRN
that is really strange
if it isn't going to execute the instruction correctly, at least it could BSOD or something :biggrin:
deleted
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2610027 cycles [x][x][x] - Cmp128Nidud
2886072 cycles [x][x][x] - Cmp128NidudSSE
2616611 cycles [x][x][x] - Cmp128Dave
3812015 cycles [x][x][x] - Cmp128Dave2
1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1
1574406 cycles [x][x][x] - Cmp128JJAlexSSE_2
1786771 cycles [x][x][x] - Cmp128JJAlexSSE_3
953366 cycles [x][x][ ] - Cmp128Alex
1705162 cycles [x][x][x] - Cmp128Alex_2
1775305 cycles [x][x][x] - Cmp128Alex_3
1896256 cycles [x][x][ ] - Cmp128JJSSE
1310303 cycles [x][x][ ] - AxCMP128bitProc3
1255525 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
697520 cycles [x][ ][ ] - Cmp128DaveU
717197 cycles [x][ ][ ] - Cmp128NidudU
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
987 kCycles [x][x][x] - Cmp128Nidud
1066 kCycles [x][x][x] - Cmp128NidudSSE
956 kCycles [x][x][x] - Cmp128Dave
2577 kCycles [x][x][x] - Cmp128Dave2
820 kCycles [x][x][x] - Cmp128JJAlexSSE_1 <<<<<<<<<<
929 kCycles [x][x][x] - Cmp128JJAlexSSE_2
950 kCycles [x][x][x] - Cmp128JJAlexSSE_3
688 kCycles [x][x][ ] - Cmp128Alex
1066 kCycles [x][x][x] - Cmp128Alex_2
1112 kCycles [x][x][x] - Cmp128Alex_3
1109 kCycles [x][x][ ] - Cmp128JJSSE
862 kCycles [x][x][ ] - AxCMP128bitProc3
870 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
598 kCycles [x][ ][ ] - Cmp128DaveU
586 kCycles [x][ ][ ] - Cmp128NidudU
deleted
Hi,
I made the PlotFlag program with an improved color scheme.
And I fixed an error in the labeling of the combined flags. Oops.
I updated Reply #260 with the fixed program. The Overflow,
Sign, and Carry flags are assigned to blue. green, and red so
that their combinations should make sense.
Regards,
Steve N.
Hi
Steve! Thanks for your explanation :biggrin: Yes, now it explains things - earlier I did not actually get how the the graps correspond to flags, after your more detailed explanation it got clear.
Quote from: nidud on August 29, 2013, 12:33:05 AM
ok, the BSF thing didn't work, so it's one step back :P
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2799747 cycles [x][x][x] - Cmp128Nidud
3028917 cycles [x][x][x] - Cmp128NidudSSE
2740311 cycles [x][x][x] - Cmp128Dave
4045304 cycles [x][x][x] - Cmp128Dave2
1664740 cycles [x][x][x] - Cmp128JJAlexSSE_1
1633226 cycles [x][x][x] - Cmp128JJAlexSSE_2
1878213 cycles [x][x][x] - Cmp128JJAlexSSE_3
973406 cycles [x][x][ ] - Cmp128Alex
1835600 cycles [x][x][x] - Cmp128Alex_2
1878425 cycles [x][x][x] - Cmp128Alex_3
1968746 cycles [x][x][ ] - Cmp128JJSSE
1374576 cycles [x][x][ ] - AxCMP128bitProc3
1290053 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
739767 cycles [x][ ][ ] - Cmp128DaveU
744369 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
BTW, in my testbed I replaced couple of macroses but not posted them (the results above are not for the updated procs, though):
Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
movups xmm0,[ow0]
movups xmm1,[ow1]
pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
xor ecx,0FFFFh
jz @l2
and ecx,7FFFh
bsr ecx,ecx
mov ah,byte ptr [ow0+15]
mov dh,byte ptr [ow1+15]
mov al,byte ptr [ow0+ecx]
mov dl,byte ptr [ow1+ecx]
cmp ax,dx
@l2:
ENDM
and
Cmp128JJAlexSSE_2 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
movups xmm0,[ow0]
movups xmm1,[ow1]
pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
xor ecx,0FFFFh
jz @l2
and ecx,7FFFh
bsr ecx,ecx
movzx eax,byte ptr [ow0+14]
movzx edx,byte ptr [ow1+14]
mov al,byte ptr [ow0+ecx]
mov dl,byte ptr [ow1+ecx]
cmp ax,dx
@l2:
ENDM
Quote from: nidud on August 29, 2013, 04:44:55 AM
Hmm, maybe the CPU's which "cheat" with the BSF/BSR flags are faster
Alex/JJ use BSR
My test: "no cheat"
701224 cycles [x][x][x] - Cmp128NidudSSE
1068347 cycles [x][x][x] - Cmp128JJAlexSSE_1
Siekmanski: "no cheat"
730067 cycles [x][x][x] - Cmp128NidudSSE
784587 cycles [x][x][x] - Cmp128JJAlexSSE_1
Alex: "cheat"
2441613 cycles [x][x][x] - Cmp128NidudSSE
1724226 cycles [x][x][x] - Cmp128JJAlexSSE_1
Dave: "cheat"
2886072 cycles [x][x][x] - Cmp128NidudSSE
1698629 cycles [x][x][x] - Cmp128JJAlexSSE_1
Well, actually it should not be so, becase, as we see desktop PIV models (my and Dave's Prescotts) not just trash all the flags, but rather set them logically correct (zero all "unused after instruction" flags, set parity flag according to the result - though it should not even bother with it, and set zero flag as defined) - instead of leaving them in unchanged state, so, it should be even slower on our CPUs :biggrin:
But the same PIV models are much more slower than more modern CPUs with some other instructions, too, and those are the real reason: SSE instructions, SBB instruction, LAHF/SAHF.
Cmp128JJAlexSSE_1 uses far more faster (on PIV) code in GRP part without SUBs/SBBs.
Fully-GPR Cmp128Alex_2 is faster than Cmp128Nidud because of this, too.
On modern CPUs your GPR code "should be" faster than my GPR code, for an instance. But for SSE code there are very different timings in the thread, probably, on very modern Intel CPUs models my SSE code "should be" faster.
Jochen's latest attachment...
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2557 kCycles [x][x][x] - Cmp128Nidud
2815 kCycles [x][x][x] - Cmp128NidudSSE
2517 kCycles [x][x][x] - Cmp128Dave
3715 kCycles [x][x][x] - Cmp128Dave2
1557 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1521 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1753 kCycles [x][x][x] - Cmp128JJAlexSSE_3
914 kCycles [x][x][ ] - Cmp128Alex
1698 kCycles [x][x][x] - Cmp128Alex_2
1726 kCycles [x][x][x] - Cmp128Alex_3
1850 kCycles [x][x][ ] - Cmp128JJSSE
1298 kCycles [x][x][ ] - AxCMP128bitProc3
1228 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
696 kCycles [x][ ][ ] - Cmp128DaveU
690 kCycles [x][ ][ ] - Cmp128NidudU
------------------------------------------------------
2576 kCycles [x][x][x] - Cmp128Nidud
2804 kCycles [x][x][x] - Cmp128NidudSSE
2543 kCycles [x][x][x] - Cmp128Dave
3723 kCycles [x][x][x] - Cmp128Dave2
1557 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1518 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1853 kCycles [x][x][x] - Cmp128JJAlexSSE_3
918 kCycles [x][x][ ] - Cmp128Alex
1697 kCycles [x][x][x] - Cmp128Alex_2
1733 kCycles [x][x][x] - Cmp128Alex_3
1848 kCycles [x][x][ ] - Cmp128JJSSE
1280 kCycles [x][x][ ] - AxCMP128bitProc3
1211 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
677 kCycles [x][ ][ ] - Cmp128DaveU
707 kCycles [x][ ][ ] - Cmp128NidudU
Hi,
From Reply # 266.
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1012 kCycles [x][x][x] - Cmp128Nidud
1153 kCycles [x][x][x] - Cmp128NidudSSE
1049 kCycles [x][x][x] - Cmp128Dave
2520 kCycles [x][x][x] - Cmp128Dave2
1441 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1682 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1686 kCycles [x][x][x] - Cmp128JJAlexSSE_3
824 kCycles [x][x][ ] - Cmp128Alex
1200 kCycles [x][x][x] - Cmp128Alex_2
1261 kCycles [x][x][x] - Cmp128Alex_3
1823 kCycles [x][x][ ] - Cmp128JJSSE
951 kCycles [x][x][ ] - AxCMP128bitProc3
987 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
640 kCycles [x][ ][ ] - Cmp128DaveU
633 kCycles [x][ ][ ] - Cmp128NidudU
--- ok --- Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
986 kCycles [x][x][x] - Cmp128Nidud
1134 kCycles [x][x][x] - Cmp128NidudSSE
1024 kCycles [x][x][x] - Cmp128Dave
2491 kCycles [x][x][x] - Cmp128Dave2
1423 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1668 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1674 kCycles [x][x][x] - Cmp128JJAlexSSE_3
817 kCycles [x][x][ ] - Cmp128Alex
1190 kCycles [x][x][x] - Cmp128Alex_2
1249 kCycles [x][x][x] - Cmp128Alex_3
1813 kCycles [x][x][ ] - Cmp128JJSSE
941 kCycles [x][x][ ] - AxCMP128bitProc3
986 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
639 kCycles [x][ ][ ] - Cmp128DaveU
630 kCycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Regards,
Steve
deleted
Quote from: dedndave on August 30, 2013, 10:11:42 PM
Jochen's latest attachment...
The only change was a shr eax, 10 to make the timings more readable...
BTW, it would be nice if the test for the (x)(x)(x) could be integrated with the timings. At present, there is a hand-made static string only... where are the authors of the magic test?
CmpFlag.zip:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
------------------------------------------------------
672764 cycles for CmpLEA
465288 cycles for CmpADD
464995 cycles for CmpINC
212725 cycles for CmpBSF
2981445 cycles for CmpCLX
QuoteLEA will be faster than ADD on Dave's CPU
INC preserve CF, so this will be slower on Dave's CPU
BSF will be much faster on yours and Dave's CPU
today must be opposite day :P
there are certain things that P4's are just not good at
i like developing on a P4, though - if it's fast on my machine, it'll be fast on every one else's :lol:
not sure what the CmpCLX test is, but my CPU hates it
a good chance it is not doing what you want it to
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
------------------------------------------------------
574802 cycles for CmpLEA
601257 cycles for CmpADD
912774 cycles for CmpINC
474878 cycles for CmpBSF
27755044 cycles for CmpCLX
------------------------------------------------------
624710 cycles for CmpLEA
609602 cycles for CmpADD
915751 cycles for CmpINC
494000 cycles for CmpBSF
27825419 cycles for CmpCLX
------------------------------------------------------
589298 cycles for CmpLEA
591765 cycles for CmpADD
909765 cycles for CmpINC
468357 cycles for CmpBSF
27813103 cycles for CmpCLX
------------------------------------------------------
deleted
pre-P4 (SSE1)
------------------------------------------------------
690329 cycles for CmpLEA
703742 cycles for CmpADD
709925 cycles for CmpINC
215679 cycles for CmpBSF
2708852 cycles for CmpCLX
------------------------------------------------------
688496 cycles for CmpLEA
704680 cycles for CmpADD
707613 cycles for CmpINC
215581 cycles for CmpBSF
2709056 cycles for CmpCLX
------------------------------------------------------
688429 cycles for CmpLEA
704797 cycles for CmpADD
707767 cycles for CmpINC
215511 cycles for CmpBSF
2707745 cycles for CmpCLX
------------------------------------------------------
--- ok ---
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
------------------------------------------------------
712592 cycles for CmpLEA
712084 cycles for CmpADD
711109 cycles for CmpINC
207906 cycles for CmpBSF
2633382 cycles for CmpCLX
------------------------------------------------------
710075 cycles for CmpLEA
711340 cycles for CmpADD
712251 cycles for CmpINC
207880 cycles for CmpBSF
2633572 cycles for CmpCLX
------------------------------------------------------
711945 cycles for CmpLEA
712603 cycles for CmpADD
710777 cycles for CmpINC
209128 cycles for CmpBSF
2631655 cycles for CmpCLX
------------------------------------------------------
--- ok ---
pre-P4------------------------------------------------------
1386583 cycles for CmpLEA
734379 cycles for CmpADD
733891 cycles for CmpINC
1341089 cycles for CmpBSF
1865749 cycles for CmpCLX
------------------------------------------------------
1386134 cycles for CmpLEA
735003 cycles for CmpADD
734641 cycles for CmpINC
1341132 cycles for CmpBSF
1867097 cycles for CmpCLX
------------------------------------------------------
1382758 cycles for CmpLEA
736389 cycles for CmpADD
734488 cycles for CmpINC
1341860 cycles for CmpBSF
1867206 cycles for CmpCLX
------------------------------------------------------
--- ok ---
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
------------------------------------------------------
713980 cycles for CmpLEA
716320 cycles for CmpADD
716456 cycles for CmpINC
214883 cycles for CmpBSF
3013235 cycles for CmpCLX
------------------------------------------------------
714376 cycles for CmpLEA
715769 cycles for CmpADD
716068 cycles for CmpINC
214918 cycles for CmpBSF
3013878 cycles for CmpCLX
------------------------------------------------------
714034 cycles for CmpLEA
716052 cycles for CmpADD
716264 cycles for CmpINC
214754 cycles for CmpBSF
3012735 cycles for CmpCLX
------------------------------------------------------
--- ok ---
Quote from: nidud on August 31, 2013, 06:07:49 AM
Quotenot sure what the CmpCLX test is
it manipulate the flags using STC/CLC/STD/CLD/CMC
ahhh.....
CLD and STD are slow as hell on P4's
and not all that fast on many other processors
they seem to be reasonable on your AMD
Quote from: nidud on August 31, 2013, 03:24:36 AM
Quote
Well, actually it should not be so, becase, as we see desktop PIV models (my and Dave's Prescotts) not just trash all the flags, but rather set them logically correct (zero all "unused after instruction" flags, set parity flag according to the result - though it should not even bother with it, and set zero flag as defined) - instead of leaving them in unchanged state, so, it should be even slower on our CPUs :biggrin:
Well, if that is correct the following test will prove your point
AMD Athlon(tm) II X2 245 Processor (SSE3)
------------------------------------------------------
383145 cycles for CmpLEA
382951 cycles for CmpADD
384098 cycles for CmpINC
384502 cycles for CmpBSF
378250 cycles for CmpCLX
------------------------------------------------------
383944 cycles for CmpLEA
387003 cycles for CmpADD
383393 cycles for CmpINC
383522 cycles for CmpBSF
378291 cycles for CmpCLX
------------------------------------------------------
385948 cycles for CmpLEA
384310 cycles for CmpADD
383979 cycles for CmpINC
384283 cycles for CmpBSF
378046 cycles for CmpCLX
The BSF test should then be faster on my CPU :P
Well, it proved (on other machines, too)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
------------------------------------------------------
490345 cycles for CmpLEA
524195 cycles for CmpADD
768829 cycles for CmpINC
397793 cycles for CmpBSF
28201301 cycles for CmpCLX
------------------------------------------------------
501654 cycles for CmpLEA
512675 cycles for CmpADD
758029 cycles for CmpINC
410060 cycles for CmpBSF
28179301 cycles for CmpCLX
------------------------------------------------------
489078 cycles for CmpLEA
521448 cycles for CmpADD
773169 cycles for CmpINC
404085 cycles for CmpBSF
28120852 cycles for CmpCLX
------------------------------------------------------
--- ok ---
deleted
Hi,
I have shortened the MasmBasic Qcmp (and Ocmp) macro a little bit - and get zero failures now :biggrin:
Please include in Cmp128Eval and the timings.
include oqCmp.asm
align 16
test_start
lea esi,ow_table
.repeat
lea edi,ow_table
.repeat
Qcmp [esi], [edi]
add edi,4
.until edi >= offset eo_table
add esi,4
.until esi >= offset eo_table
test_end "cycles (x)(x)(x) - MasmBasic Qcmp"
P.S.: Timings attached. I excluded one very slow algo and those which fail in the first two categories.
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945 kCycles [x][x][x] - Cmp128Dave
911 kCycles [x][x][x] - Cmp128Nidud
1013 kCycles [x][x][x] - Cmp128NidudSSE
684 kCycles [x][x][ ] - Cmp128Alex
1127 kCycles [x][x][x] - MasmBasic Ocmp <<<<< OWORD
989 kCycles [x][x][x] - MasmBasic Qcmp
814 kCycles [x][x][x] - Cmp128JJAlexSSE_1
925 kCycles [x][x][x] - Cmp128JJAlexSSE_2
926 kCycles [x][x][x] - Cmp128JJAlexSSE_3
859 kCycles [x][x][ ] - AxCMP128bitProc3
868 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
And timings for an i5 - with the Qcmp and Alex1 three times each (the timings are not very stable on the i5, and the two algos are most interesting for me):
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
746 kCycles [x][x][x] - Cmp128Dave
600 kCycles [x][x][x] - Cmp128Nidud
714 kCycles [x][x][x] - Cmp128NidudSSE
494 kCycles [x][x][ ] - Cmp128Alex
429 kCycles [x][x][x] - MasmBasic Qcmp
388 kCycles [x][x][x] - MasmBasic Qcmp
429 kCycles [x][x][x] - MasmBasic Qcmp
428 kCycles [x][x][x] - Cmp128JJAlexSSE_1
401 kCycles [x][x][x] - Cmp128JJAlexSSE_1
428 kCycles [x][x][x] - Cmp128JJAlexSSE_1
437 kCycles [x][x][x] - Cmp128JJAlexSSE_2
427 kCycles [x][x][x] - Cmp128JJAlexSSE_3
530 kCycles [x][x][ ] - AxCMP128bitProc3
501 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Hi,
I posted a routine in Reply # 252. This quote is from Reply #262.
Quote from: FORTRANS on August 28, 2013, 10:26:34 PM
The only notable fact that I saw was if the Zero Flag is set, no
others being considered are set as well. So you can take an early
exit from the algorithm if the CMPS result is zero. I did not bother
with mine.* (Though I, or someone, should time both versions.)
Well, I timed it with and without the test for zero and an early
exit. The one with the extra test was slower. Tested with Dave's
112 test values.
Regards,
Steve N.
from Reply #280,
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
942 kCycles [x][x][x] - Cmp128Dave
891 kCycles [x][x][x] - Cmp128Nidud
1023 kCycles [x][x][x] - Cmp128NidudSSE
673 kCycles [x][x][ ] - Cmp128Alex
828 kCycles [x][x][x] - MasmBasic Qcmp
766 kCycles [x][x][x] - Cmp128JJAlexSSE_1
869 kCycles [x][x][x] - Cmp128JJAlexSSE_2
876 kCycles [x][x][x] - Cmp128JJAlexSSE_3
867 kCycles [x][x][ ] - AxCMP128bitProc3
895 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
deleted
You can substantially reduce the number of failures (from 1318 to zero) if you use Ocmp ("O" like "O sole mio") instead of Qcmp ;)
And yes, it's my fault because I erroneously used Qcmp in the timings. I was qonfused ::)
New version attached, with minor changes to Ocmp:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945 kCycles [x][x][x] - Cmp128Dave
911 kCycles [x][x][x] - Cmp128Nidud
1013 kCycles [x][x][x] - Cmp128NidudSSE
684 kCycles [x][x][ ] - Cmp128Alex
1010 kCycles [x][x][x] - MasmBasic Ocmp
815 kCycles [x][x][x] - Cmp128JJAlexSSE_1
925 kCycles [x][x][x] - Cmp128JJAlexSSE_2
925 kCycles [x][x][x] - Cmp128JJAlexSSE_3
870 kCycles [x][x][ ] - AxCMP128bitProc3
867 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
deleted
Here is my contribution (from reply 284)
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1358 kCycles [x][x][x] - Cmp128Dave
1206 kCycles [x][x][x] - Cmp128Nidud
975 kCycles [x][x][x] - Cmp128NidudSSE
1171 kCycles [x][x][ ] - Cmp128Alex
1766 kCycles [x][x][x] - MasmBasic Ocmp
1424 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1040 kCycles [x][x][x] - Cmp128JJAlexSSE_2
721 kCycles [x][x][x] - Cmp128JJAlexSSE_3
535 kCycles [x][x][ ] - AxCMP128bitProc3
519 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Dave AKA KRB
Jochen, did you time the version of a macro I posted couple pages above?
Here it is:
Cmp128JJAlexSSE_1 MACRO ow0:REQ, ow1:REQ
LOCAL @l1, @l2
movups xmm0,[ow0]
movups xmm1,[ow1]
pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
xor ecx,0FFFFh
jz @l2
and ecx,7FFFh
bsr ecx,ecx
mov ah,byte ptr [ow0+15]
mov dh,byte ptr [ow1+15]
mov al,byte ptr [ow0+ecx]
mov dl,byte ptr [ow1+ecx]
cmp ax,dx
@l2:
ENDM
For me it faster than original "_1" macro, also you can try to change so
mov eax,word ptr [ow0+14]
mov edx,word ptr [ow1+14]
but for me it is slower than the version above it.
Timings for it (there is your old macro - my testbed us a bit outdated)
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2189320 cycles [x][x][x] - Cmp128Nidud
2295837 cycles [x][x][x] - Cmp128NidudSSE
2773387 cycles [x][x][x] - Cmp128Dave
4033478 cycles [x][x][x] - Cmp128Dave2
1597228 cycles [x][x][x] - Cmp128JJAlexSSE_1
1622741 cycles [x][x][x] - Cmp128JJAlexSSE_2
1905774 cycles [x][x][x] - Cmp128JJAlexSSE_3
993931 cycles [x][x][ ] - Cmp128Alex
1859714 cycles [x][x][x] - Cmp128Alex_2
1901902 cycles [x][x][x] - Cmp128Alex_3
1994856 cycles [x][x][ ] - Cmp128JJSSE
1346269 cycles [x][x][ ] - AxCMP128bitProc3
1311894 cycles [x][x][ ] - AxCMP128bitProc3c (cmov)
741050 cycles [x][ ][ ] - Cmp128DaveU
770599 cycles [x][ ][ ] - Cmp128NidudU
--- ok ---
Timings for Cmp128_timingsOQ
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2696 kCycles [x][x][x] - Cmp128Dave
2713 kCycles [x][x][x] - Cmp128Nidud
3125 kCycles [x][x][x] - Cmp128NidudSSE
945 kCycles [x][x][ ] - Cmp128Alex
1932 kCycles [x][x][x] - MasmBasic Ocmp
1485 kCycles [x][x][x] - MasmBasic Qcmp
1639 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1604 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1595 kCycles [x][x][x] - Cmp128JJAlexSSE_3
1360 kCycles [x][x][ ] - AxCMP128bitProc3
1274 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Timings for Cmp128_timingsO
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2856 kCycles [x][x][x] - Cmp128Dave
2752 kCycles [x][x][x] - Cmp128Nidud
3128 kCycles [x][x][x] - Cmp128NidudSSE
956 kCycles [x][x][ ] - Cmp128Alex
1928 kCycles [x][x][x] - MasmBasic Ocmp
1641 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1601 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1592 kCycles [x][x][x] - Cmp128JJAlexSSE_3
1361 kCycles [x][x][ ] - AxCMP128bitProc3
1272 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Hi
Dave :t
Quote from: KeepingRealBusy on September 02, 2013, 08:52:39 AM
Here is my contribution (from reply 284)
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1358 kCycles [x][x][x] - Cmp128Dave
1206 kCycles [x][x][x] - Cmp128Nidud
975 kCycles [x][x][x] - Cmp128NidudSSE
1171 kCycles [x][x][ ] - Cmp128Alex
1766 kCycles [x][x][x] - MasmBasic Ocmp
1424 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1040 kCycles [x][x][x] - Cmp128JJAlexSSE_2
721 kCycles [x][x][x] - Cmp128JJAlexSSE_3
535 kCycles [x][x][ ] - AxCMP128bitProc3
519 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Dave AKA KRB
Incredible difference in the algos, which use full and half sized regs. Your AMD seems to very good work with "partial" regs, contrary to Intel's which are bad with them.
Cmp128JJAlexSSE_3 differs from Cmp128JJAlexSSE_1
only with this:
xor cx,0FFFFh
jz @l2
and cx,7FFFh
bsr cx,cx
Quote from: Antariy on September 02, 2013, 12:35:42 PM
Jochen, did you time the version of a macro I posted couple pages above?
Here it comes:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
945 kCycles [x][x][x] - Cmp128Dave
916 kCycles [x][x][x] - Cmp128Nidud
1017 kCycles [x][x][x] - Cmp128NidudSSE
689 kCycles [x][x][ ] - Cmp128Alex
1013 kCycles [x][x][x] - MasmBasic Ocmp
815 kCycles [x][x][x] - Cmp128JJAlexSSE_1
854 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
925 kCycles [x][x][x] - Cmp128JJAlexSSE_2
926 kCycles [x][x][x] - Cmp128JJAlexSSE_3
858 kCycles [x][x][ ] - AxCMP128bitProc3
870 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
653 kCycles [x][x][x] - Cmp128Dave
608 kCycles [x][x][x] - Cmp128Nidud
806 kCycles [x][x][x] - Cmp128NidudSSE
434 kCycles [x][x][ ] - Cmp128Alex
386 kCycles [x][x][x] - MasmBasic Ocmp
315 kCycles [x][x][x] - Cmp128JJAlexSSE_1
366 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
355 kCycles [x][x][x] - Cmp128JJAlexSSE_2
316 kCycles [x][x][x] - Cmp128JJAlexSSE_3
455 kCycles [x][x][ ] - AxCMP128bitProc3
439 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Quote from: nidud on September 02, 2013, 08:34:08 AM
well, it's difficult to read your "code", but I think...
You should learn
Masm, it's a fascinating language :t
(and I'm afraid your interpretation is not correct - you might launch Olly to see what it really does).
jj's latest
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
695 kCycles [x][x][x] - Cmp128Dave
564 kCycles [x][x][x] - Cmp128Nidud
652 kCycles [x][x][x] - Cmp128NidudSSE
396 kCycles [x][x][ ] - Cmp128Alex
316 kCycles [x][x][x] - MasmBasic Ocmp
268 kCycles [x][x][x] - Cmp128JJAlexSSE_1
321 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
312 kCycles [x][x][x] - Cmp128JJAlexSSE_2
271 kCycles [x][x][x] - Cmp128JJAlexSSE_3
403 kCycles [x][x][ ] - AxCMP128bitProc3
378 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
748 kCycles [x][x][x] - Cmp128Dave
615 kCycles [x][x][x] - Cmp128Nidud
714 kCycles [x][x][x] - Cmp128NidudSSE
433 kCycles [x][x][ ] - Cmp128Alex
348 kCycles [x][x][x] - MasmBasic Ocmp
296 kCycles [x][x][x] - Cmp128JJAlexSSE_1
353 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
344 kCycles [x][x][x] - Cmp128JJAlexSSE_2
298 kCycles [x][x][x] - Cmp128JJAlexSSE_3
442 kCycles [x][x][ ] - AxCMP128bitProc3
416 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
your code is hard to read, Jochen - lol
i dread if i have to add a routine :P
Quote from: dedndave on September 02, 2013, 07:01:01 PM
your code is hard to read, Jochen - lol
Come on, it's ultra simple...
pmovmskb edx, xt2 ; show in dx where xt0 differs to xt1
if MbcmpO eq QWORD
not dl
and edx, 07fh
else ; don't duplicate MSB
if 1
xor edx, -1
and edx, 07fffh
else
not dx
and dh, 07fh
endif
endif
deleted
Quote from: nidud on September 02, 2013, 08:34:08 AM
well, it's difficult to read your "code"
Quote from: nidud on September 02, 2013, 10:57:58 PM
I guess there is different views about writing code
Yes, certainly. But I would never call your code "code", or refer to you as a "coder" instead of a coder. It requires a certain level of arrogance to dismiss somebody else's code as "code".
QuoteQuoteI'm afraid your interpretation is not correct
How do you know?
Quoteyou might launch Olly to see what it really does
Don't you think that this is a bit to much to ask, or at least a bit complicated, to use a debugger to see what it actually does?
Normally, I would not ask, but since you had difficulties de-coding my macro, I thought Olly would be a reliable way to check. What you show above, by the way, is old code - the version of oqCmp.asm that I posted 15 hours ago already contained:
if 1
xor edx, -1
and edx, 07fffh
else
not dx
and dh, 07fh
endifThe if 1 is conditional assembly and means "use this branch, not the other one".
Congrats, by the way - on the AMD your code is faster than mine:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
----------------------------------------------------
843 kCycles [x][x][x] - Cmp128Dave
847 kCycles [x][x][x] - Cmp128Nidud
917 kCycles [x][x][x] - Cmp128NidudSSE
643 kCycles [x][x][ ] - Cmp128Alex
1578 kCycles [x][x][x] - MasmBasic Ocmp
1469 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1531 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1466 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1466 kCycles [x][x][x] - Cmp128JJAlexSSE_3
803 kCycles [x][x][ ] - AxCMP128bitProc3
771 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
From Reply #289.
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1022 kCycles [x][x][x] - Cmp128Dave
917 kCycles [x][x][x] - Cmp128Nidud
1022 kCycles [x][x][x] - Cmp128NidudSSE
817 kCycles [x][x][ ] - Cmp128Alex
1561 kCycles [x][x][x] - MasmBasic Ocmp
1422 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1471 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1668 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1677 kCycles [x][x][x] - Cmp128JJAlexSSE_3
937 kCycles [x][x][ ] - AxCMP128bitProc3
985 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Jochen,
it's just the text format
we each have our own style and it can be hard to get used to someone else's :P
The timings for Jochen's latest version:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
736 kCycles [x][x][x] - Cmp128Dave
629 kCycles [x][x][x] - Cmp128Nidud
696 kCycles [x][x][x] - Cmp128NidudSSE
442 kCycles [x][x][ ] - Cmp128Alex
367 kCycles [x][x][x] - MasmBasic Ocmp
321 kCycles [x][x][x] - Cmp128JJAlexSSE_1
371 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
364 kCycles [x][x][x] - Cmp128JJAlexSSE_2
352 kCycles [x][x][x] - Cmp128JJAlexSSE_3
447 kCycles [x][x][ ] - AxCMP128bitProc3
427 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Gunther
Thanxalot :icon14:
Attached one more, inter alia with a modification of the test_start macro:
test_start macro useit:=<1>
usethismacro=useit
if usethismacro
push 50000000
.Repeat
dec dword ptr [esp] ; heat up the CPU
.Until Sign?
add esp, 4
invoke Sleep, 0
counter_begin 1000, HIGH_PRIORITY_CLASS
endif
endm
On some machines, timings were very volatile, the small mod above seems to help.
Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
986 kCycles [x][x][x] - Cmp128Dave
946 kCycles [x][x][x] - Cmp128Nidud
818 kCycles [x][x][x] - Cmp128NidudSSE
575 kCycles [x][x][ ] - Cmp128Alex
564 kCycles [x][x][x] - MasmBasic Ocmp.1
517 kCycles [x][x][x] - MasmBasic Ocmp.0
549 kCycles [x][x][x] - MasmBasic Ocmp.1
513 kCycles [x][x][x] - MasmBasic Ocmp.0
472 kCycles [x][x][x] - Cmp128JJAlexSSE_1
476 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
618 kCycles [x][x][x] - Cmp128JJAlexSSE_2
614 kCycles [x][x][x] - Cmp128JJAlexSSE_3
747 kCycles [x][x][ ] - AxCMP128bitProc3
772 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
-----------------------------------------------------
843 kCycles [x][x][x] - Cmp128Dave
844 kCycles [x][x][x] - Cmp128Nidud
919 kCycles [x][x][x] - Cmp128NidudSSE
641 kCycles [x][x][ ] - Cmp128Alex
1588 kCycles [x][x][x] - MasmBasic Ocmp.1
1584 kCycles [x][x][x] - MasmBasic Ocmp.0
1586 kCycles [x][x][x] - MasmBasic Ocmp.1
1578 kCycles [x][x][x] - MasmBasic Ocmp.0
1467 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1532 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1471 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1468 kCycles [x][x][x] - Cmp128JJAlexSSE_3
801 kCycles [x][x][ ] - AxCMP128bitProc3
771 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Jochen,
the new timings. I hope that helps:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
660 kCycles [x][x][x] - Cmp128Dave
538 kCycles [x][x][x] - Cmp128Nidud
603 kCycles [x][x][x] - Cmp128NidudSSE
371 kCycles [x][x][ ] - Cmp128Alex
316 kCycles [x][x][x] - MasmBasic Ocmp.1
307 kCycles [x][x][x] - MasmBasic Ocmp.0
314 kCycles [x][x][x] - MasmBasic Ocmp.1
306 kCycles [x][x][x] - MasmBasic Ocmp.0
259 kCycles [x][x][x] - Cmp128JJAlexSSE_1
308 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
302 kCycles [x][x][x] - Cmp128JJAlexSSE_2
263 kCycles [x][x][x] - Cmp128JJAlexSSE_3
391 kCycles [x][x][ ] - AxCMP128bitProc3
363 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Gunther
Thanks, Gunther. And here is the Celeron M:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
955 kCycles [x][x][x] - Cmp128Dave
923 kCycles [x][x][x] - Cmp128Nidud
1012 kCycles [x][x][x] - Cmp128NidudSSE
676 kCycles [x][x][ ] - Cmp128Alex
1081 kCycles [x][x][x] - MasmBasic Ocmp.1
1010 kCycles [x][x][x] - MasmBasic Ocmp.0
1080 kCycles [x][x][x] - MasmBasic Ocmp.1
1011 kCycles [x][x][x] - MasmBasic Ocmp.0
814 kCycles [x][x][x] - Cmp128JJAlexSSE_1
853 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
926 kCycles [x][x][x] - Cmp128JJAlexSSE_2
927 kCycles [x][x][x] - Cmp128JJAlexSSE_3
869 kCycles [x][x][ ] - AxCMP128bitProc3
867 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
deleted
Latest Jochen's archive:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2713 kCycles [x][x][x] - Cmp128Dave
2716 kCycles [x][x][x] - Cmp128Nidud
3065 kCycles [x][x][x] - Cmp128NidudSSE
918 kCycles [x][x][ ] - Cmp128Alex
2044 kCycles [x][x][x] - MasmBasic Ocmp.1
1900 kCycles [x][x][x] - MasmBasic Ocmp.0
2046 kCycles [x][x][x] - MasmBasic Ocmp.1
1890 kCycles [x][x][x] - MasmBasic Ocmp.0
1605 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1574 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1614 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1571 kCycles [x][x][x] - Cmp128JJAlexSSE_3
1363 kCycles [x][x][ ] - AxCMP128bitProc3
1256 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Latest nidud's archive:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2708 kCycles [x][x][x] - Cmp128Dave
2749 kCycles [x][x][x] - Cmp128Nidud
3076 kCycles [x][x][x] - Cmp128NidudSSE
909 kCycles [x][x][ ] - Cmp128Alex
2044 kCycles [x][x][x] - MasmBasic Ocmp.1
1895 kCycles [x][x][x] - MasmBasic Ocmp.0
2046 kCycles [x][x][x] - MasmBasic Ocmp.1
1906 kCycles [x][x][x] - MasmBasic Ocmp.0
1604 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1543 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1351 kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
1576 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1578 kCycles [x][x][x] - Cmp128JJAlexSSE_3
1340 kCycles [x][x][ ] - AxCMP128bitProc3
1267 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
:biggrin:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
957 kCycles [x][x][x] - Cmp128Dave
926 kCycles [x][x][x] - Cmp128Nidud
1016 kCycles [x][x][x] - Cmp128NidudSSE
680 kCycles [x][x][ ] - Cmp128Alex
1039 kCycles [x][x][x] - MasmBasic Ocmp.1
1039 kCycles [x][x][x] - MasmBasic Ocmp.0
818 kCycles [x][x][x] - Cmp128JJAlexSSE_1
857 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
785 kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
832 kCycles [x][x][x] - Cmp128AxelNidudJJ_A
853 kCycles [x][x][x] - Cmp128AxelNidudJJ_B
930 kCycles [x][x][x] - Cmp128JJAlexSSE_2
930 kCycles [x][x][x] - Cmp128JJAlexSSE_3
866 kCycles [x][x][ ] - AxCMP128bitProc3
871 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Cmp128AxelNidudJJ MACRO A:REQ, B:REQ
movups xmm0,A[0]
movups xmm1,B[0]
push ecx ; do not trash ecx
; mov eax,A[12]
; mov edx,B[12]
pcmpeqb xmm0,xmm1
pmovmskb ecx,xmm0
if ANJ_A
xor ecx, -1
and ecx, 07FFFh
or ecx, 1 ; make sure there is no zero input (http://masm32.com/board/index.php?topic=2312.0)
bsr ecx, ecx
mov eax,A[12]
mov edx,B[12]
mov dl,B[ecx]
mov al,A[ecx]
cmp eax,edx
else
xor ecx, 0FFFFh
.if !Zero? ; make sure there is no zero input
and ecx, 07FFFh
bsr ecx, ecx
mov eax,A[12]
mov edx,B[12]
mov dl,B[ecx]
mov al,A[ecx]
cmp eax,edx
.endif
endif
pop ecx
ENDM
Hi Jochen,
Reply #298:
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
963 kCycles [x][x][x] - Cmp128Dave
901 kCycles [x][x][x] - Cmp128Nidud
1000 kCycles [x][x][x] - Cmp128NidudSSE
662 kCycles [x][x][ ] - Cmp128Alex
991 kCycles [x][x][x] - MasmBasic Ocmp.1
941 kCycles [x][x][x] - MasmBasic Ocmp.0
991 kCycles [x][x][x] - MasmBasic Ocmp.1
941 kCycles [x][x][x] - MasmBasic Ocmp.0
765 kCycles [x][x][x] - Cmp128JJAlexSSE_1
826 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
870 kCycles [x][x][x] - Cmp128JJAlexSSE_2
873 kCycles [x][x][x] - Cmp128JJAlexSSE_3
862 kCycles [x][x][ ] - AxCMP128bitProc3
888 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Hi nidud,
Reply #301:
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
961 kCycles [x][x][x] - Cmp128Dave
903 kCycles [x][x][x] - Cmp128Nidud
999 kCycles [x][x][x] - Cmp128NidudSSE
663 kCycles [x][x][ ] - Cmp128Alex
990 kCycles [x][x][x] - MasmBasic Ocmp.1
942 kCycles [x][x][x] - MasmBasic Ocmp.0
990 kCycles [x][x][x] - MasmBasic Ocmp.1
942 kCycles [x][x][x] - MasmBasic Ocmp.0
763 kCycles [x][x][x] - Cmp128JJAlexSSE_1
827 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
718 kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
871 kCycles [x][x][x] - Cmp128JJAlexSSE_2
874 kCycles [x][x][x] - Cmp128JJAlexSSE_3
855 kCycles [x][x][ ] - AxCMP128bitProc3
887 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Jochen, Reply #303:
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz (SSE4)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
964 kCycles [x][x][x] - Cmp128Dave
901 kCycles [x][x][x] - Cmp128Nidud
1000 kCycles [x][x][x] - Cmp128NidudSSE
662 kCycles [x][x][ ] - Cmp128Alex
970 kCycles [x][x][x] - MasmBasic Ocmp.1
968 kCycles [x][x][x] - MasmBasic Ocmp.0
764 kCycles [x][x][x] - Cmp128JJAlexSSE_1
825 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
720 kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
717 kCycles [x][x][x] - Cmp128AxelNidudJJ
870 kCycles [x][x][x] - Cmp128JJAlexSSE_2
872 kCycles [x][x][x] - Cmp128JJAlexSSE_3
862 kCycles [x][x][ ] - AxCMP128bitProc3
886 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
Thanks, Marinus - I like it :biggrin:
JJ's latest
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1448 kCycles [x][x][x] - Cmp128Dave
1146 kCycles [x][x][x] - Cmp128Nidud
846 kCycles [x][x][x] - Cmp128NidudSSE
496 kCycles [x][x][ ] - Cmp128Alex
2043 kCycles [x][x][x] - MasmBasic Ocmp.1
2153 kCycles [x][x][x] - MasmBasic Ocmp.0
1849 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1996 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1799 kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
1884 kCycles [x][x][x] - Cmp128AxelNidudJJ
1956 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1994 kCycles [x][x][x] - Cmp128JJAlexSSE_3
1344 kCycles [x][x][ ] - AxCMP128bitProc3
1263 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Dave.
Dave & Marinus,
Thanks but I am afraid the difference between Cmp128JJAlexSSE_1new1 and Cmp128AlexNidudJJ reflected just the volatility of timings - I changed the description but forgot the macro call itself :redface:
Scroll back three posts to get the good version... sorry ;-)
From 303 - I thought something had happened - timings were way high with the newest version.
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
1391 kCycles [x][x][x] - Cmp128Dave
1065 kCycles [x][x][x] - Cmp128Nidud
884 kCycles [x][x][x] - Cmp128NidudSSE
685 kCycles [x][x][ ] - Cmp128Alex
949 kCycles [x][x][x] - MasmBasic Ocmp.1
953 kCycles [x][x][x] - MasmBasic Ocmp.0
714 kCycles [x][x][x] - Cmp128JJAlexSSE_1
703 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
700 kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
740 kCycles [x][x][x] - Cmp128AxelNidudJJ_A
739 kCycles [x][x][x] - Cmp128AxelNidudJJ_B
710 kCycles [x][x][x] - Cmp128JJAlexSSE_2
697 kCycles [x][x][x] - Cmp128JJAlexSSE_3
516 kCycles [x][x][ ] - AxCMP128bitProc3
1331 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Dave.
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
CMP emulation: [JB/JA] [JL/JG] [JO/JS]
------------------------------------------------------
2713 kCycles [x][x][x] - Cmp128Dave
2729 kCycles [x][x][x] - Cmp128Nidud
3105 kCycles [x][x][x] - Cmp128NidudSSE
921 kCycles [x][x][ ] - Cmp128Alex
1979 kCycles [x][x][x] - MasmBasic Ocmp.1
1979 kCycles [x][x][x] - MasmBasic Ocmp.0
1607 kCycles [x][x][x] - Cmp128JJAlexSSE_1
1540 kCycles [x][x][x] - Cmp128JJAlexSSE_1new
1354 kCycles [x][x][x] - Cmp128JJAlexSSE_1new1
1549 kCycles [x][x][x] - Cmp128AxelNidudJJ_A
1589 kCycles [x][x][x] - Cmp128AxelNidudJJ_B
1574 kCycles [x][x][x] - Cmp128JJAlexSSE_2
1577 kCycles [x][x][x] - Cmp128JJAlexSSE_3
1333 kCycles [x][x][ ] - AxCMP128bitProc3
1251 kCycles [x][x][ ] - AxCMP128bitProc3c (cmov)
--- ok ---
Quote from: nidud on September 03, 2013, 03:49:40 AM
You are a bit sensitive me think :P
Sieht so aus ... ist wohl ein Standardfeature hyperaktiver Forumsmitglieder ... da könnte ich auch die ein oder andere Erfahrung beisteuern. :icon_mrgreen:
Quote from: jj2007 on September 03, 2013, 08:18:12 AM
Cmp128AxelNidudJJ MACRO A:REQ, B:REQ
Oops - my apologies,
Alex :redface:
Hi,
Well, I put versions of a compare routine using CMPSB and
CMPSW into the timing suite. If I did it correctly, someone in
Intel really hates string instructions. And going from bytes to
words only helped ~5 - 15%. Which means I probably need to
check for gross errors. Oh well, maybe small code size counts
for something.
Cheers,
Steve N.
i am not sure that the string method would pass all the tests, Steve
at least, no without some extra support code :P
RE: "Axel"
Axel is a good name - let's call him that, from now on :lol:
Hi Dave,
Okay. You may be right. The idea for the algorithm came from
Reply #39. I used your code in Reply #165 to get the algorithm
to verify the compare algorithm. See Replies #241, #252, and
#255 for test results, code, and a comment. I skimmed some
of this thread again, but may have missed something. Did you
change your validation program?
Regards,
Steve N.
Quote from: jj2007 on September 03, 2013, 11:03:31 PM
Quote from: jj2007 on September 03, 2013, 08:18:12 AM
Cmp128AxelNidudJJ MACRO A:REQ, B:REQ
Oops - my apologies, Alex :redface:
Don't worry :biggrin:
Quote from: dedndave on September 04, 2013, 12:04:31 AM
RE: "Axel"
Axel is a good name - let's call him that, from now on :lol:
MOV EAX, "Dave"
BSWAP EAX
Your name is 32 bit, too,
Dave :biggrin:
i have used the value 0DABEDABEh before to XOR some files - lol
Quote from: dedndave on September 05, 2013, 05:27:59 PM
i have used the value 0DABEDABEh before to XOR some files - lol
:biggrin:
BTW, Russian letter "V" is written like "В".
Дэйв - that's your name how it spelled in russian :biggrin:
But what meant "dedn" prefix before name? Is it some idiom?
i used to live in a very small town, out in the desert
i lived at the end of an old, historic, mining road called Plomosa Road
i was at the base of a small mountain, so the road ended at my place :P
up the hill a little was an old abandoned mine
(http://bousechamber.org/images/Bousehill2.jpg)
when you live in a small town, everyone makes up nicknames for you
we had "One-Door Fred" and "Bartender John", etc
one of the guys was an old guy we called "Dirty Ernie"
he used to grab at girls when he'd been drinking - lol
they called me "Dead-End Dave" :biggrin:
that's where dedndave came from
Thank you, Dave, now it became clear :biggrin:
Hi Dave,
Quote from: dedndave on September 05, 2013, 10:41:01 PM
they called me "Dead-End Dave" :biggrin:
that's where dedndave came from
This was an overdue clarification. :t
Gunther