News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

CMOVx faster?

Started by jj2007, February 28, 2014, 11:13:34 AM

Previous topic - Next topic

jj2007

Just found a new book, Assembly Language Succinctly by Chris Rose. It looks OK for C/C++ coders. What raised my curiosity is a snippet called FindSmallest using cmovx - so I put up a little testbed comparing cmov timings against jump timings. Results:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)

7048    cycles for 100 * max cmov
9905    cycles for 100 * max jmp
6179    cycles for 100 * max cmov lods
10048   cycles for 100 * max jmp lods

7019    cycles for 100 * max cmov
9781    cycles for 100 * max jmp
6179    cycles for 100 * max cmov lods
10026   cycles for 100 * max jmp lods

6990    cycles for 100 * max cmov
9809    cycles for 100 * max jmp
6224    cycles for 100 * max cmov lods
10060   cycles for 100 * max jmp lods

24      bytes for max cmov
25      bytes for max jmp
23      bytes for max cmov lods
23      bytes for max jmp lods

12345678        = eax max cmov
12345678        = eax max jmp
12345678        = eax max cmov lods
12345678        = eax max jmp lods

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

4847    cycles for 100 * max cmov
5345    cycles for 100 * max jmp
4843    cycles for 100 * max cmov lods
5747    cycles for 100 * max jmp lods

4843    cycles for 100 * max cmov
5350    cycles for 100 * max jmp
4845    cycles for 100 * max cmov lods
5726    cycles for 100 * max jmp lods

4845    cycles for 100 * max cmov
5344    cycles for 100 * max jmp
4841    cycles for 100 * max cmov lods
5725    cycles for 100 * max jmp lods

24      bytes for max cmov
25      bytes for max jmp
23      bytes for max cmov lods
23      bytes for max jmp lods

12345678        = eax max cmov
12345678        = eax max jmp
12345678        = eax max cmov lods
12345678        = eax max jmp lods

--- ok ---
Creative coders use backward thinking techniques as a strategy.

KeepingRealBusy

AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)

10522   cycles for 100 * max cmov
8365    cycles for 100 * max jmp
11082   cycles for 100 * max cmov lods
9343    cycles for 100 * max jmp lods

6343    cycles for 100 * max cmov
4870    cycles for 100 * max jmp
6716    cycles for 100 * max cmov lods
10113   cycles for 100 * max jmp lods

4429    cycles for 100 * max cmov
4043    cycles for 100 * max jmp
5848    cycles for 100 * max cmov lods
5475    cycles for 100 * max jmp lods

24      bytes for max cmov
25      bytes for max jmp
23      bytes for max cmov lods
23      bytes for max jmp lods

12345678        = eax max cmov
12345678        = eax max jmp
12345678        = eax max cmov lods
12345678        = eax max jmp lods

--- ok ---

Appears to depend on the CPU.  JJ, your cmov vs lods time seeme to be that cmov manually updates ESI while lods increments as part of the lods execution. My AMD appears to like short jumps. Go figure.

Dave.

hutch--

I have seen various comparisons done over time and none of them have ever been conclusive. It varies with whether the jump is taken or not and varies with the hardware. I rarely ever use them because they don't offer any advantage to a conventional cmp/test jump. For the little that its worth the main speed differences are related to memory access and if you can reduce this to a minimum you can usually forget about jumps.

sinsi

Here's a gotcha for using CMOV

    sub eax,eax
    cmovnz eax,[eax]

EAX is zero so ZF is set but the CMOVNZ line evaluates [eax] before the condition, instant access violation.

satpro

About that book (which is a good read)....

They will email you, call you, email again, call again, call, call, call...
They want you to buy some high end ($$thousands) development software.

jj2007

Quote from: KeepingRealBusy on February 28, 2014, 02:42:08 PMJJ, your cmov vs lods time seeme to be that cmov manually updates ESI while lods increments as part of the lods execution. My AMD appears to like short jumps. Go figure.

Mine behaves differently:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

6238    cycles for 100 * max cmov
8718    cycles for 100 * max jmp
8123    cycles for 100 * max cmov lods
9936    cycles for 100 * max jmp lods

6246    cycles for 100 * max cmov
8715    cycles for 100 * max jmp
8135    cycles for 100 * max cmov lods
9922    cycles for 100 * max jmp lods


@sinsi: Nice find ;-)
@satpro: no registration required here...

FORTRANS

Hi,

   Looks like it might have been more important with older
processors.

pre-P4 (SSE1)

7597    cycles for 100 * max cmov
12030   cycles for 100 * max jmp
7907    cycles for 100 * max cmov lods
11726   cycles for 100 * max jmp lods

7603    cycles for 100 * max cmov
12064   cycles for 100 * max jmp
7896    cycles for 100 * max cmov lods
11706   cycles for 100 * max jmp lods

7592    cycles for 100 * max cmov
12007   cycles for 100 * max jmp
7906    cycles for 100 * max cmov lods
11732   cycles for 100 * max jmp lods

24      bytes for max cmov
25      bytes for max jmp
23      bytes for max cmov lods
23      bytes for max jmp lods

12345678        = eax max cmov
12345678        = eax max jmp
12345678        = eax max cmov lods
12345678        = eax max jmp lods

--- ok ---


Cheers,

Steve N.

dedndave

i'm with Hutch on this one, but for slightly different reasoning
CMOV isn't supported on all pentiums
so, you have to test for it - then provide fall-back code or exit if it's not supported
all that's a pain in the ass for too little advantage, if any - lol

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

14264   cycles for 100 * max cmov
22800   cycles for 100 * max jmp
25437   cycles for 100 * max cmov lods
23358   cycles for 100 * max jmp lods

14243   cycles for 100 * max cmov
23859   cycles for 100 * max jmp
25468   cycles for 100 * max cmov lods
23253   cycles for 100 * max jmp lods

14217   cycles for 100 * max cmov
23216   cycles for 100 * max jmp
25374   cycles for 100 * max cmov lods
23278   cycles for 100 * max jmp lods


EDIT: my mistake - i was thinking of the CMPXCHG instruction not being supported   :P

Gunther

Jochen,


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)

4215    cycles for 100 * max cmov
4650    cycles for 100 * max jmp
4226    cycles for 100 * max cmov lods
5537    cycles for 100 * max jmp lods

4215    cycles for 100 * max cmov
4671    cycles for 100 * max jmp
4824    cycles for 100 * max cmov lods
5000    cycles for 100 * max jmp lods

4514    cycles for 100 * max cmov
4642    cycles for 100 * max jmp
4216    cycles for 100 * max cmov lods
5647    cycles for 100 * max jmp lods

24      bytes for max cmov
25      bytes for max jmp
23      bytes for max cmov lods
23      bytes for max jmp lods

12345678        = eax max cmov
12345678        = eax max jmp
12345678        = eax max cmov lods
12345678        = eax max jmp lods

--- ok ---


Gunther
You have to know the facts before you can distort them.

FORTRANS

Quote from: dedndave on March 01, 2014, 12:44:46 AM
i'm with Hutch on this one, but for slightly different reasoning
CMOV isn't supported on all pentiums
so, you have to test for it - then provide fall-back code or exit if it's not supported
all that's a pain in the ass for too little advantage, if any - lol

...

EDIT: my mistake - i was thinking of the CMPXCHG instruction not being supported   :P

Hi Dave,

   No.  You were correct.  The program bombs on my P-MMX.
CMOV was not supported until the Pentium Pro.  (IIRC, could be
the P-II.)  Given that, I probably wouldn't worry about it too much.
<g>

Cheers,

Steve N.

Farabi

Its an ARM instruction isnt it? I think ARM used it this way.
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

Gunther

Hi Farabi,

Quote from: Farabi on March 02, 2014, 03:36:44 PM
Its an ARM instruction isnt it? I think ARM used it this way.

no, the CMOV instruction was introduced with the P6 (Pentium Pro) for compiler optimization.

Gunther
You have to know the facts before you can distort them.

Ficko

I do use "CMOVE" but only trough macro. It is slightly more difficult to wrap my mind around it as conditional jumps. :P


; *********************************************************
; CMOV SDWORD/DWORD,Arg1,"Operator",Arg2,opt Arg3,opt Arg4
; Syntax:
; CMOV DWORD,eax,"!=",ebx 'cmp eax, ebx:cmovne eax, ebx
; CMOV DWORD,eax,"<>",ebx,edx 'cmp eax, ebx:cmovne eax, edx
; CMOV DWORD,eax,"!=",ebx,edx,ecx 'cmp eax, ebx:cmovne edx, ecx
; *********************************************************
CMOV MACRO Sign:REQ,Arg1:REQ,Operator:REQ,Arg2:REQ,Arg3,Arg4
LOCAL L_Operator,m1,m2
cmp Arg1, Arg2
L_Operator TEXTEQU @CatStr(Operator)
IF @SizeStr(Arg4)
m1 TEXTEQU <Arg3>
m2 TEXTEQU <Arg4>
ELSEIF @SizeStr(Arg3)
m1 TEXTEQU <Arg1>
m2 TEXTEQU <Arg3>
ELSE
m1 TEXTEQU <Arg1>
m2 TEXTEQU <Arg2>
ENDIF
IF Sign EQ dword
IFIDN L_Operator,<"!>">
cmova m1, m2
ELSEIFIDN L_Operator,<"!<">
cmovb m1, m2
ELSEIFIDN L_Operator,<"=">
cmove m1, m2
ELSEIFIDN L_Operator,<"==">
cmove m1, m2
ELSEIFIDN L_Operator,<"!<!>">
cmovne m1, m2
ELSEIFIDN L_Operator,<"!!=">
cmovne m1, m2
ELSEIFIDN L_Operator,<"!>=">
cmovae m1, m2
ELSEIFIDN L_Operator,<"=!>">
cmovae m1, m2
ELSEIFIDN L_Operator,<"!<=">
cmovbe m1, m2
ELSEIFIDN L_Operator,<"=!<">
cmovbe m1, m2
ELSE
echo The Operator operator is not valid
ENDIF
ELSEIF Sign EQ sdword
IFIDN L_Operator,<"!>">
cmovg m1, m2
ELSEIFIDN L_Operator,<"!<">
cmovl m1, m2
ELSEIFIDN L_Operator,<"=">
cmove m1, m2
ELSEIFIDN L_Operator,<"==">
cmove m1, m2
ELSEIFIDN L_Operator,<"!<!>">
cmovne m1, m2
ELSEIFIDN L_Operator,<"!!=">
cmovne m1, m2
ELSEIFIDN L_Operator,<"!>="> 
cmovge m1, m2
ELSEIFIDN L_Operator,<"=!>"> 
cmovge m1, m2
ELSEIFIDN L_Operator,<"!<=">
cmovle m1, m2
ELSEIFIDN L_Operator,<"=!<">
cmovle m1, m2
ELSE
echo The Operator operator is not valid  
ENDIF
ELSE
echo The first parameter have to be "DWORD" or "SDWORD" !
ENDIF
ENDM

alloy

https://dmytrish.net/lib/asm-x86/Assembly_Language_Succinctly.pdf