LEA is a handy instruction

Greenhorn · October 18, 2012, 09:28:55 AM

FX-8150 @3.90GHz ...

AMD FX(tm)-8150 Eight-Core Processor (SSE4)
loop overhead is approx. 201/100 cycles

91 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
2 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

91 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
1 cycles for 100 * lea eax, [esi+esi+6]
3 cycles for 100 * empty loop

92 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop

--- ok ---

EDIT:
My AV complains about LeaVariants.exe. So the result is maybe wrong ...

Greenhorn

jj2007 · October 18, 2012, 09:45:31 AM

Well, LeaVariants.exe runs, so the AV didn't stop it. But the results are astonishing: It seems that the loop (200/100=2 cycles) doesn't slow down at all for a little lea on your CPU...

hutch-- · October 18, 2012, 11:31:18 AM

Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
loop overhead is approx. 124/100 cycles

140 cycles for 100 * mov/add
86 cycles for 100 * lea eax, [2*esi+6]
86 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

140 cycles for 100 * mov/add
86 cycles for 100 * lea eax, [2*esi+6]
90 cycles for 100 * lea eax, [esi+esi+6]
2 cycles for 100 * empty loop

140 cycles for 100 * mov/add
88 cycles for 100 * lea eax, [2*esi+6]
87 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop

--- ok ---

sinsi · October 18, 2012, 11:58:20 AM

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop

?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop

?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop

7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop

Must be quantum or something

jj2007 · October 18, 2012, 04:27:31 PM

Quote from: sinsi on October 18, 2012, 11:58:20 AM
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop

Must be quantum or something

More a problem with the loop overhead being higher after the calibration... try this:

ShowCycles MACRO algo
LOCAL tmp$
tmp$ CATSTR <"cycles for >, %AlgoLoops, < * ">
sub eax, overheadCycles
.if 0 ; Sign?
% print "??", 9, tmp$, AlgoName$(algo), 13, 10
.else
% print str$(eax), 9, tmp$, AlgoName$(algo), 13, 10
.endif
ENDM

sinsi · October 18, 2012, 05:00:41 PM

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 222/100 cycles

-23 cycles for 100 * mov/add
-23 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop

-23 cycles for 100 * mov/add
-25 cycles for 100 * lea eax, [2*esi+6]
3 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

-24 cycles for 100 * mov/add
-23 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
3 cycles for 100 * empty loop

jj2007 · October 18, 2012, 06:59:46 PM

Very odd, Sinsi. As if the mov/add sequences speed up the loop :(
Here is the bit used to calibrate the loop, i.e. getting the loop overhead (usually 150-250 cycles/100):
TestOH:
mov ebx, AlgoLoops-1   ; loop e.g. 100x
align 4
.Repeat
   lea eax, [ebx+Eax2EbxOffset]
;    --- no ops here ---
   dec ebx
.Until Sign?
ret
CaliEnd:
ENDM

One more result:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

77 cycles for 100 * mov/add
72 cycles for 100 * lea eax, [2*esi+6]
99 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

75 cycles for 100 * mov/add
72 cycles for 100 * lea eax, [2*esi+6]
96 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

75 cycles for 100 * mov/add
74 cycles for 100 * lea eax, [2*esi+6]
98 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

dedndave · October 19, 2012, 04:21:47 AM

this brings up an interesting point...

we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on

with the newer cores, these may have been improved upon and we may need to re-visit the "black list" :P

x64Core · October 19, 2012, 01:01:17 PM

hey guys , a question, how you can test the number of cycles? some tool I guess :P

dedndave · October 19, 2012, 03:54:36 PM

most of us use MichaelW's code timing macros, or some derivative
the macros are available here...

http://masm32.com/board/index.php?topic=49.0

attached is an example of how to use them...

EDIT: attachment removed - see below

dedndave · October 19, 2012, 05:30:49 PM

sorry RHL...
that was an experiment

change the following lines...

Code Select

LOOP_COUNT = 10000000

Code Select

mov ecx,10

or see the attachment...

x64Core · October 19, 2012, 06:19:15 PM

thank you Dave, But I think I don't know how it works :P
I tried do this:

;code to be timed goes here

; this code is trash, testing only
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0

rdtsc

but shows nothing and I thought it would show something like your's results lol :P

dedndave · October 19, 2012, 06:29:42 PM

well - that RDTSC instruction is my "code to be timed" for the example
you can remove that

see the attached...

be sure you place a copy of Michael's timers.asm in the \masm32\macros folder

x64Core · October 19, 2012, 06:35:13 PM

Quote from: dedndave on October 19, 2012, 06:29:42 PM
well - that RDTSC instruction is my "code to be timed" for the example
you can remove that

see the attached...

be sure you place a copy of Michael's timers.asm in the \masm32\macros folder

Thank you very much brother :icon_mrgreen:

raleep · October 23, 2012, 04:52:24 AM

Quote from: dedndave on October 19, 2012, 04:21:47 AM
we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on

Is there a summary of these timing results somewhere? Could you post a reference please?

Thanks, ral

The MASM Forum

News:

LEA is a handy instruction

Greenhorn

jj2007

hutch--

sinsi

jj2007

sinsi

jj2007

dedndave

x64Core

dedndave

dedndave

x64Core

dedndave

x64Core

raleep