News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

LEA is a handy instruction

Started by jj2007, October 17, 2012, 10:54:56 AM

Previous topic - Next topic

Greenhorn

FX-8150 @3.90GHz ...

AMD FX(tm)-8150 Eight-Core Processor            (SSE4)
loop overhead is approx. 201/100 cycles

91      cycles for 100 * mov/add
0       cycles for 100 * lea eax, [2*esi+6]
2       cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

91      cycles for 100 * mov/add
0       cycles for 100 * lea eax, [2*esi+6]
1       cycles for 100 * lea eax, [esi+esi+6]
3       cycles for 100 * empty loop

92      cycles for 100 * mov/add
0       cycles for 100 * lea eax, [2*esi+6]
0       cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

7       bytes for mov/add
7       bytes for lea eax, [2*esi+6]
4       bytes for lea eax, [esi+esi+6]
0       bytes for empty loop


--- ok ---


EDIT:
My AV complains about LeaVariants.exe. So the result is maybe wrong ...


Greenhorn
Kole Feut un Nordenwind gift en krusen Büdel un en lütten Pint.

jj2007

Well, LeaVariants.exe runs, so the AV didn't stop it. But the results are astonishing: It seems that the loop (200/100=2 cycles) doesn't slow down at all for a little lea on your CPU...

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
loop overhead is approx. 124/100 cycles

140     cycles for 100 * mov/add
86      cycles for 100 * lea eax, [2*esi+6]
86      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

140     cycles for 100 * mov/add
86      cycles for 100 * lea eax, [2*esi+6]
90      cycles for 100 * lea eax, [esi+esi+6]
2       cycles for 100 * empty loop

140     cycles for 100 * mov/add
88      cycles for 100 * lea eax, [2*esi+6]
87      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

7       bytes for mov/add
7       bytes for lea eax, [2*esi+6]
4       bytes for lea eax, [esi+esi+6]
0       bytes for empty loop


--- ok ---

sinsi

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

??      cycles for 100 * mov/add
??      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

??      cycles for 100 * mov/add
??      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

??      cycles for 100 * mov/add
??      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

7       bytes for mov/add
7       bytes for lea eax, [2*esi+6]
4       bytes for lea eax, [esi+esi+6]
0       bytes for empty loop


Must be quantum or something  :badgrin:
🍺🍺🍺

jj2007

Quote from: sinsi on October 18, 2012, 11:58:20 AM
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

??      cycles for 100 * mov/add
??      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop


Must be quantum or something  :badgrin:

More a problem with the loop overhead being higher after the calibration... try this:

ShowCycles MACRO algo
LOCAL tmp$
  tmp$ CATSTR <"cycles for >, %AlgoLoops, < * ">
  sub eax, overheadCycles
  .if 0 ; Sign?
   % print "??", 9, tmp$, AlgoName$(algo), 13, 10
  .else
   % print str$(eax), 9, tmp$, AlgoName$(algo), 13, 10
  .endif
ENDM

sinsi


AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 222/100 cycles

-23     cycles for 100 * mov/add
-23     cycles for 100 * lea eax, [2*esi+6]
0       cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

-23     cycles for 100 * mov/add
-25     cycles for 100 * lea eax, [2*esi+6]
3       cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

-24     cycles for 100 * mov/add
-23     cycles for 100 * lea eax, [2*esi+6]
0       cycles for 100 * lea eax, [esi+esi+6]
3       cycles for 100 * empty loop
🍺🍺🍺

jj2007

Very odd, Sinsi. As if the mov/add sequences speed up the loop :(
Here is the bit used to calibrate the loop, i.e. getting the loop overhead (usually 150-250 cycles/100):
TestOH:
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   lea eax, [ebx+Eax2EbxOffset]
;    --- no ops here ---
   dec ebx
  .Until Sign?
  ret
CaliEnd:
ENDM


One more result:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

77      cycles for 100 * mov/add
72      cycles for 100 * lea eax, [2*esi+6]
99      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

75      cycles for 100 * mov/add
72      cycles for 100 * lea eax, [2*esi+6]
96      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

75      cycles for 100 * mov/add
74      cycles for 100 * lea eax, [2*esi+6]
98      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

dedndave

this brings up an interesting point...

we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on

with the newer cores, these may have been improved upon and we may need to re-visit the "black list"   :P

x64Core

hey guys , a question, how you can test the number of cycles? some tool I guess :P

dedndave

#24
most of us use MichaelW's code timing macros, or some derivative
the macros are available here...

http://masm32.com/board/index.php?topic=49.0

attached is an example of how to use them...


EDIT: attachment removed - see below

dedndave

sorry RHL...
that was an experiment

change the following lines...
LOOP_COUNT = 10000000
        mov     ecx,10

or see the attachment...

x64Core

thank you Dave, But I think I don't know how it works :P
I tried do this:


;code to be timed goes here

; this code is trash, testing only
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0

        rdtsc

but shows nothing and I thought it would show something like your's results lol :P

dedndave

well - that RDTSC instruction is my "code to be timed" for the example
you can remove that

see the attached...

be sure you place a copy of Michael's timers.asm in the \masm32\macros folder

x64Core

Quote from: dedndave on October 19, 2012, 06:29:42 PM
well - that RDTSC instruction is my "code to be timed" for the example
you can remove that

see the attached...

be sure you place a copy of Michael's timers.asm in the \masm32\macros folder
Thank you very much brother  :icon_mrgreen:

raleep

Quote from: dedndave on October 19, 2012, 04:21:47 AM
we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on

Is there a summary of these timing results somewhere?  Could you post a reference please?

Thanks, ral