Comparing 128-bit numbers aka OWORDs

sinsi · August 20, 2013, 11:44:41 AM

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1756/1000 cycles

2362 cycles for 1000 * Ocmp2 (JJ)
2627 cycles for 1000 * Cmp128Dave
2313 cycles for 1000 * cmp128n (nidud)
4672 cycles for 1000 * cmp128 qWord
3885 cycles for 1000 * AxCMP128bit

2418 cycles for 1000 * Ocmp2 (JJ)
2635 cycles for 1000 * Cmp128Dave
2310 cycles for 1000 * cmp128n (nidud)
4628 cycles for 1000 * cmp128 qWord
3833 cycles for 1000 * AxCMP128bit

2412 cycles for 1000 * Ocmp2 (JJ)
2637 cycles for 1000 * Cmp128Dave
2394 cycles for 1000 * cmp128n (nidud)
4647 cycles for 1000 * cmp128 qWord
3866 cycles for 1000 * AxCMP128bit

Antariy · August 20, 2013, 12:08:01 PM

Thank you, John! :t

Can I ask you to make one more test of a program in this post? We had strange results, though, I think they are representative image that one algo is faster than another, but the numbers were just wonderful. Interesting, if this behaviour will be shown on your CPU, too.

sinsi · August 20, 2013, 12:39:40 PM

:t

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 1677/1000 cycles

2537 cycles for 1000 * Ocmp (JJ)
61306 cycles for 1000 * cmp128b (loop)
4740 cycles for 1000 * cmp128 qWord
2132 cycles for 1000 * JJAxCMP128bit

2441 cycles for 1000 * Ocmp (JJ)
61320 cycles for 1000 * cmp128b (loop)
4759 cycles for 1000 * cmp128 qWord
2009 cycles for 1000 * JJAxCMP128bit

2419 cycles for 1000 * Ocmp (JJ)
61279 cycles for 1000 * cmp128b (loop)
4796 cycles for 1000 * cmp128 qWord
2052 cycles for 1000 * JJAxCMP128bit

Antariy · August 20, 2013, 12:57:59 PM

Incredible!

Results are absolutely different! In this thread we have very varying timings for every kind of algo.
Thank you very much, John! :t

I thought that my JJAxCMP128bit tweak is a lot slower than original Jochen's version so replaced it with GPR code, but it seems that on anything younger than desktop PIV Intel CPU models SSE code will be faster than GPR (not sure about this for AMD). Probably there is need to return the tweak into testbed.

nidud · August 20, 2013, 02:25:20 PM

deleted

Antariy · August 20, 2013, 06:16:56 PM

Hi nidud, can you please post entire testbed source + binary? This will help a lot to check what is going on.

sinsi · August 20, 2013, 06:34:40 PM

We are still testing unsigned? And the end result is ja/je/jb?

Antariy · August 20, 2013, 07:19:58 PM

Quote from: sinsi on August 20, 2013, 06:34:40 PM
We are still testing unsigned? And the end result is ja/je/jb?

No, theoretically every algo is designed to support any number - fully emulate behaviour of usual CMP instruction in flags setting, so how to treat the number is for programmers decision - JA or JG, JB or JL etc.
Something strange here with GPR code - if nidud will post the testbed it will help, but right now I cannot say what is wrong (and I'm not sure that something is wrong - we probably need a comprehensive checking testbed otherwise it's very error-prone to chech all things manually).

sinsi · August 20, 2013, 07:47:01 PM

I can't see how comparing dwords can propagate every flag for all four.

Quote from: Intel manualtemp ←SRC1 −SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags inthe same manner as the SUB instruction*)
The CF, OF, SF, ZF, AF, and PF flags are set according to the result.

Antariy · August 20, 2013, 08:24:01 PM

nidud, if the source is a bit in a working untidy then post a binary, please, or the number which makes the algos to fail.

Quote from: sinsi on August 20, 2013, 07:47:01 PM
I can't see how comparing dwords can propagate every flag for all four.

Quote from: Intel manualtemp ←SRC1 −SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags inthe same manner as the SUB instruction*)
The CF, OF, SF, ZF, AF, and PF flags are set according to the result.

We don't set flags or combine flags for every DWORD - the trick is that we make a CMP unstruction for a 128 bit integer number, so the flags should be set according to the state of comparsion a two 128 bit numbers as entire, not as "arrays" of DWORDs (just like CMP 128_bit_number_1, 128_bit_number_2 and then Jcc).

dedndave · August 20, 2013, 09:39:49 PM

i cleaned up the validation test code - and removed repeat tests

earlier, i stated that my algo failed 2440 of 3200 tests
i see i had the test commented out, though - lol

with the changes, my current algo fails 112 of 3160 tests

Antariy · August 20, 2013, 10:03:13 PM

OK, here are the correctness test, not in form of a macro for testing of every algo, but it shows the idea.

Need to compare OWORDs with selected algo, then to convert and save the flags in a easily comparable format with FlagsToEAX, then compare the same OWORDs with "Etalone" macro, which produces proper results by emulation flags setting and returns the result similar to FlagsToEAX, then compare what "Etalone" has returned and what was saved from the tested algo flags. If there's difference - it shows it.

Here is the code isolated from the source:

Added to the source:

Code Select


	; EAX = BITS: ... CF ZF SF OF
FlagsToEAX	MACRO
	pushfd
	xor eax,eax
	pop edx
	bt edx,0
	rcl eax,1
	bt edx,6
	rcl eax,1
	bt edx,7
	rcl eax,1
	bt edx,11
	rcl eax,1
ENDM
	
Etalone	MACRO ow0, ow1
	LOCAL @l1, @l2, @l3, l0
	
	push ebx
	
	mov eax,dword ptr [ow0+12]
	mov edx,dword ptr [ow1+12]
	cmp eax,edx
	jnz @l1	; just save flags
	
	mov ecx,dword ptr [ow0+8]
	mov ebx,dword ptr [ow1+8]
	cmp ecx,ebx
	jnz @l2
	
	mov ecx,dword ptr [ow0+4]
	mov ebx,dword ptr [ow1+4]
	cmp ecx,ebx
	jnz @l2
	
	mov ecx,dword ptr [ow0]
	mov ebx,dword ptr [ow1]
	cmp ecx,ebx
	jz @l1
	
	
	
	@l2:
	push 0
	ja @l3	; if it's above - the number is bigger because this isn't MSD
	
	mov byte ptr [esp+3],1	; CF set, below than (unsigned)
	
	test eax,eax	; if numbers are signed then set required flags
	jns @l3
	mov word ptr [esp],0001h	; SF and OF are not equal, so it means less than (signed)
	@l3:
	pop edx
	shr edx,1
	rcl eax,1
	push 3
	pop ecx
	@@:
	shr edx,8
	rcl eax,1
	loop @B
	jmp @l0
		
	
	@l1:
	FlagsToEAX
	
	@l0:
	
	if 0
		push eax
		mov ebx,eax
		test ebx,1
		jz @F
		print "OF "
		@@:
		test ebx,2
		jz @F
		print "SF "
		@@:
		test ebx,4
		jz @F
		print "ZF "
		@@:
		test ebx,8
		jz @F
		print "CF "
		@@:
		
		print chr$(13,10)
		pop eax
	endif
	
	pop ebx
	
	ENDM

The piece below is a checking itself, it may be added in a start of a prog here:

start:   push 1
   call ShowCpu   ; print brand string and SSE level
   invoke SetProcessAffinityMask, -1, 1   ; restrict to one core
   Calibrate

Code Select


	AxCMP128bit numberOne,numberTwo
	FlagsToEAX
	push eax
	Etalone numberOne, numberTwo
	pop ecx	
	xor eax,ecx
	jz @l1	; test OK
	and eax,3	; layout of SF and OF flag may differ, but if they are both not equal
	jz @l1		; in first comparsion and in second comparsion, then it's proper result
	cmp eax,3	; since signed less than is OF != SF with no difference which flags are (un)set
	jz @l1
	mov edx,[esp]
	print str$(edx)," - Test failed: "
	print uhex$(dword ptr [esi+12])
	print "_"
	print uhex$(dword ptr [esi+8])
	print "_"
	print uhex$(dword ptr [esi+4])
	print "_"
	print uhex$(dword ptr [esi])
	print "  "
	print uhex$(dword ptr [edi+12])
	print "_"
	print uhex$(dword ptr [edi+8])
	print "_"
	print uhex$(dword ptr [edi+4])
	print "_"
	print uhex$(dword ptr [edi]),13,10
	@l1:

One may say why the "Etalone" stated as truely proper code, well, this is question like about chiken and egg

It stated so because it follows the CMP/SUB behaviour in a flags setting - it just emulates it, but not with flags and in a variable, which will then be compared with the converted flags state result of a tested code. It's longer to describe "why" it stated to work properly than to check its algo with help of Intel docs.

Antariy · August 20, 2013, 10:07:37 PM

Quote from: dedndave on August 20, 2013, 09:39:49 PM
i cleaned up the validation test code - and removed repeat tests

earlier, i stated that my algo failed 2440 of 3200 tests
i see i had the test commented out, though - lol

with the changes, my current algo fails 112 of 3160 tests

Hi Dave, I did not see your code yet, will check it :t

Antariy · August 20, 2013, 10:10:01 PM

Can I use your data set with my checking method above? I did not prepare the data patterns yet, but checking on just a random data shows that algo works (but crafted data like yours is better than just random).

dedndave · August 20, 2013, 10:23:27 PM

of course you may, Alex

the data is organized as sets of: (1) dword and (1) oword

comparing the owords from any two lines should yield the
same flags as comparing the dwords from the same 2 lines

so - the dwords are "control" values and the owords are "test" values
3160 tests are required to make all compares
(40 x 40 x 2, less 40 repeats)

in my test code, i examine OF, SF, ZF, and CF - they should match
no messing around with JL, JGE, etc

The MASM Forum

News:

Comparing 128-bit numbers aka OWORDs

sinsi

Antariy

sinsi

Antariy

nidud

Antariy

sinsi

Antariy

sinsi

Antariy

dedndave

Antariy

Antariy

Antariy

dedndave