News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Comparing 128-bit numbers aka OWORDs

Started by jj2007, August 12, 2013, 08:25:24 PM

Previous topic - Next topic

sinsi


Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1756/1000 cycles

2362    cycles for 1000 * Ocmp2 (JJ)
2627    cycles for 1000 * Cmp128Dave
2313    cycles for 1000 * cmp128n (nidud)
4672    cycles for 1000 * cmp128 qWord
3885    cycles for 1000 * AxCMP128bit

2418    cycles for 1000 * Ocmp2 (JJ)
2635    cycles for 1000 * Cmp128Dave
2310    cycles for 1000 * cmp128n (nidud)
4628    cycles for 1000 * cmp128 qWord
3833    cycles for 1000 * AxCMP128bit

2412    cycles for 1000 * Ocmp2 (JJ)
2637    cycles for 1000 * Cmp128Dave
2394    cycles for 1000 * cmp128n (nidud)
4647    cycles for 1000 * cmp128 qWord
3866    cycles for 1000 * AxCMP128bit

Antariy

Thank you, John! :t

Can I ask you to make one more test of a program in this post? We had strange results, though, I think they are representative image that one algo is faster than another, but the numbers were just wonderful. Interesting, if this behaviour will be shown on your CPU, too.

sinsi

 :t

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1677/1000 cycles

2537    cycles for 1000 * Ocmp (JJ)
61306   cycles for 1000 * cmp128b (loop)
4740    cycles for 1000 * cmp128 qWord
2132    cycles for 1000 * JJAxCMP128bit

2441    cycles for 1000 * Ocmp (JJ)
61320   cycles for 1000 * cmp128b (loop)
4759    cycles for 1000 * cmp128 qWord
2009    cycles for 1000 * JJAxCMP128bit

2419    cycles for 1000 * Ocmp (JJ)
61279   cycles for 1000 * cmp128b (loop)
4796    cycles for 1000 * cmp128 qWord
2052    cycles for 1000 * JJAxCMP128bit


Antariy

Incredible! :biggrin: Results are absolutely different! In this thread we have very varying timings for every kind of algo.
Thank you very much, John! :t

I thought that my JJAxCMP128bit tweak is a lot slower than original Jochen's version so replaced it with GPR code, but it seems that on anything younger than desktop PIV Intel CPU models SSE code will be faster than GPR (not sure about this for AMD). Probably there is need to return the tweak into testbed.

nidud

#109
deleted

Antariy

Hi nidud, can you please post entire testbed source + binary? This will help a lot to check what is going on.

sinsi

We are still testing unsigned? And the end result is ja/je/jb?

Antariy

Quote from: sinsi on August 20, 2013, 06:34:40 PM
We are still testing unsigned? And the end result is ja/je/jb?

No, theoretically every algo is designed to support any number - fully emulate behaviour of usual CMP instruction in flags setting, so how to treat the number is for programmers decision - JA or JG, JB or JL etc.
Something strange here with GPR code - if nidud will post the testbed it will help, but right now I cannot say what is wrong (and I'm not sure that something is wrong - we probably need a comprehensive checking testbed otherwise it's very error-prone to chech all things manually).

sinsi

I can't see how comparing dwords can propagate every flag for all four.

Quote from: Intel manualtemp ←SRC1 −SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags inthe same manner as the SUB instruction*)
The CF, OF, SF, ZF, AF, and PF flags are set according to the result.

Antariy

nidud, if the source is a bit in a working untidy then post a binary, please, or the number which makes the algos to fail.

Quote from: sinsi on August 20, 2013, 07:47:01 PM
I can't see how comparing dwords can propagate every flag for all four.

Quote from: Intel manualtemp ←SRC1 −SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags inthe same manner as the SUB instruction*)
The CF, OF, SF, ZF, AF, and PF flags are set according to the result.

We don't set flags or combine flags for every DWORD - the trick is that we make a CMP unstruction for a 128 bit integer number, so the flags should be set according to the state of comparsion a two 128 bit numbers as entire, not as "arrays" of DWORDs (just like CMP 128_bit_number_1, 128_bit_number_2 and then Jcc).

dedndave

#115
i cleaned up the validation test code - and removed repeat tests

earlier, i stated that my algo failed 2440 of 3200 tests
i see i had the test commented out, though - lol

with the changes, my current algo fails 112 of 3160 tests   :biggrin:

Antariy

OK, here are the correctness test, not in form of a macro for testing of every algo, but it shows the idea.

Need to compare OWORDs with selected algo, then to convert and save the flags in a easily comparable format with FlagsToEAX, then compare the same OWORDs with "Etalone" macro, which produces proper results by emulation flags setting and returns the result similar to FlagsToEAX, then compare what "Etalone" has returned and what was saved from the tested algo flags. If there's difference - it shows it.

Here is the code isolated from the source:

Added to the source:

; EAX = BITS: ... CF ZF SF OF
FlagsToEAX MACRO
pushfd
xor eax,eax
pop edx
bt edx,0
rcl eax,1
bt edx,6
rcl eax,1
bt edx,7
rcl eax,1
bt edx,11
rcl eax,1
ENDM

Etalone MACRO ow0, ow1
LOCAL @l1, @l2, @l3, l0

push ebx

mov eax,dword ptr [ow0+12]
mov edx,dword ptr [ow1+12]
cmp eax,edx
jnz @l1 ; just save flags

mov ecx,dword ptr [ow0+8]
mov ebx,dword ptr [ow1+8]
cmp ecx,ebx
jnz @l2

mov ecx,dword ptr [ow0+4]
mov ebx,dword ptr [ow1+4]
cmp ecx,ebx
jnz @l2

mov ecx,dword ptr [ow0]
mov ebx,dword ptr [ow1]
cmp ecx,ebx
jz @l1



@l2:
push 0
ja @l3 ; if it's above - the number is bigger because this isn't MSD

mov byte ptr [esp+3],1 ; CF set, below than (unsigned)

test eax,eax ; if numbers are signed then set required flags
jns @l3
mov word ptr [esp],0001h ; SF and OF are not equal, so it means less than (signed)
@l3:
pop edx
shr edx,1
rcl eax,1
push 3
pop ecx
@@:
shr edx,8
rcl eax,1
loop @B
jmp @l0


@l1:
FlagsToEAX

@l0:

if 0
push eax
mov ebx,eax
test ebx,1
jz @F
print "OF "
@@:
test ebx,2
jz @F
print "SF "
@@:
test ebx,4
jz @F
print "ZF "
@@:
test ebx,8
jz @F
print "CF "
@@:

print chr$(13,10)
pop eax
endif

pop ebx

ENDM
   

The piece below is a checking itself, it may be added in a start of a prog here:

start:   push 1
   call ShowCpu   ; print brand string and SSE level
   invoke SetProcessAffinityMask, -1, 1   ; restrict to one core
   Calibrate



AxCMP128bit numberOne,numberTwo
FlagsToEAX
push eax
Etalone numberOne, numberTwo
pop ecx
xor eax,ecx
jz @l1 ; test OK
and eax,3 ; layout of SF and OF flag may differ, but if they are both not equal
jz @l1 ; in first comparsion and in second comparsion, then it's proper result
cmp eax,3 ; since signed less than is OF != SF with no difference which flags are (un)set
jz @l1
mov edx,[esp]
print str$(edx)," - Test failed: "
print uhex$(dword ptr [esi+12])
print "_"
print uhex$(dword ptr [esi+8])
print "_"
print uhex$(dword ptr [esi+4])
print "_"
print uhex$(dword ptr [esi])
print "  "
print uhex$(dword ptr [edi+12])
print "_"
print uhex$(dword ptr [edi+8])
print "_"
print uhex$(dword ptr [edi+4])
print "_"
print uhex$(dword ptr [edi]),13,10
@l1:


One may say why the "Etalone" stated as truely proper code, well, this is question like about chiken and egg :biggrin: It stated so because it follows the CMP/SUB behaviour in a flags setting - it just emulates it, but not with flags and in a variable, which will then be compared with the converted flags state result of a tested code. It's longer to describe "why" it stated to work properly than to check its algo with help of Intel docs.

Antariy

Quote from: dedndave on August 20, 2013, 09:39:49 PM
i cleaned up the validation test code - and removed repeat tests

earlier, i stated that my algo failed 2440 of 3200 tests
i see i had the test commented out, though - lol

with the changes, my current algo fails 112 of 3160 tests   :biggrin:

Hi Dave, I did not see your code yet, will check it :t

Antariy

Can I use your data set with my checking method above? I did not prepare the data patterns yet, but checking on just a random data shows that algo works (but crafted data like yours is better than just random).

dedndave

of course you may, Alex

the data is organized as sets of: (1) dword and (1) oword

comparing the owords from any two lines should yield the
same flags as comparing the dwords from the same 2 lines

so - the dwords are "control" values and the owords are "test" values
3160 tests are required to make all compares
(40 x 40 x 2, less 40 repeats)

in my test code, i examine OF, SF, ZF, and CF - they should match
no messing around with JL, JGE, etc