The MASM Forum

General => The Campus => Topic started by: jj2007 on October 17, 2012, 10:54:56 AM

Title: LEA is a handy instruction
Post by: jj2007 on October 17, 2012, 10:54:56 AM
Suppose you need eax=2*esi+6:
   mov eax, esi  ; option 1
   add eax, eax
   add eax, 6
   nop
   lea eax, [2*esi+6]  ; option 2
   nop
   lea eax, [esi+esi+6]  ; option 3
   nop

The three options perform exactly the same, but Olly (http://www.ollydbg.de/version2.html) shows that the byte count is quite different:

0040206E         8BC6              mov eax, esi
00402070         03C0              add eax, eax
00402072         83C0 06           add eax, 6
00402075      .  90                nop
00402076      ?  8D0475 06000000   lea eax, [esi*2+6]
0040207D      .  90                nop
0040207E      ?  8D4436 06         lea eax, [esi+esi+6]
:biggrin:
Title: Re: LEA is a handy instruction
Post by: dedndave on October 17, 2012, 02:24:35 PM
cool little trick, Jochen   :t
Title: Re: LEA is a handy instruction
Post by: sinsi on October 17, 2012, 03:27:38 PM
Same code in FASM via IDA

89 F0                                   mov     eax, esi
01 C0                                   add     eax, eax
83 C0 06                                add     eax, 6
90                                      nop
8D 44 36 06                             lea     eax, [esi+esi+6]
90                                      nop
8D 44 36 06                             lea     eax, [esi+esi+6]
Title: Re: LEA is a handy instruction
Post by: jj2007 on October 17, 2012, 04:15:18 PM
cool little trick, FASM :t
Title: Re: LEA is a handy instruction
Post by: hutch-- on October 17, 2012, 04:23:52 PM
Now the fun part is you need to clock it on different hardware, LEA was very good on 486 up to PIII but a poor performer on a PIV. Core2 and i3/5/7 series seem to be OK with LEA. FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Title: Re: LEA is a handy instruction
Post by: sinsi on October 17, 2012, 05:33:52 PM
>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:
8BC6    mov eax, esi
89F0    mov eax, esi


But yeah, I see what you mean.
Title: Re: LEA is a handy instruction
Post by: japheth on October 17, 2012, 07:34:30 PM
It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:

[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.

It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".
Title: Re: LEA is a handy instruction
Post by: jj2007 on October 17, 2012, 08:03:40 PM
Quote from: sinsi on October 17, 2012, 05:33:52 PM
>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:
8BC6    mov eax, esi
89F0    mov eax, esi


But yeah, I see what you mean.

xchg eax, edx   ; 92h
xchg edx, eax   ; 92h
xchg edx, eax   ; 87h, 0D0h

I guess the FASM crew has discussed the risks :lol:
Title: Re: LEA is a handy instruction
Post by: sinsi on October 17, 2012, 08:42:26 PM
Quote from: japheth on October 17, 2012, 07:34:30 PM
It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:

[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.

It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".
OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?
Title: Re: LEA is a handy instruction
Post by: japheth on October 17, 2012, 09:15:19 PM
Quote from: sinsi on October 17, 2012, 08:42:26 PM
OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?

I don't understand what you want to say. A brief test confirms that fasm generates wrong code:


    format mz

    entry text:start
    stack 400h

segment mydata

table dw 'A','B','C','D'

dstr db "value="
dbyt db " ",13,10,'$'

segment text

start:
    mov ax,mydata
    mov ds,ax
    xor ebp,ebp
    mov ax,[table+ebp*2]
    mov [dbyt],al
    mov dx,dstr
    mov ah,9
    int 21h
    mov ah,4ch
    int 21h


It does not display "value=A". If you exchange ebp by ebx, then it does.

Here's the Masm equivalent


    .286
    .model small

    .stack 400h

    .data

table label word
    dw 'A','B','C','D'

dstr db "value="
dbyt db " ",13,10,'$'

    .code

    .386

start:
    mov ax,@data
    mov ds,ax
    xor ebp,ebp
    mov ax,[table+ebp*2]
    mov dbyt,al
    mov dx,offset dstr
    mov ah,9
    int 21h
    mov ah,4ch
    int 21h

    end start


which works as expected.
Title: Re: LEA is a handy instruction
Post by: sinsi on October 17, 2012, 09:46:11 PM
I see, I was thinking more like 'lea eax,[ebp...]' instead of 'mov eax,[ebp...]'.
Might want to let Tomasz know then.
Title: Re: LEA is a handy instruction
Post by: dedndave on October 18, 2012, 06:32:09 AM
i was thinking that LEA was ok on a P4, unless you use a multiplier
        lea     eax,[2*edi]

if i am right, Jochen's little trick might be a good replacement
as it not only saves a few bytes, but also offers some speed improvement
as Hutch says, we'll have to try it out   :P
Title: Re: LEA is a handy instruction
Post by: jj2007 on October 18, 2012, 08:20:46 AM
Here is the speed test - no big surprises, but note these are cycle counts per 100 loops...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
148     cycles for 100 * mov/add
32      cycles for 100 * lea eax, [2*esi+6]
33      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

148     cycles for 100 * mov/add
32      cycles for 100 * lea eax, [2*esi+6]
34      cycles for 100 * lea eax, [esi+esi+6]
2       cycles for 100 * empty loop

150     cycles for 100 * mov/add
32      cycles for 100 * lea eax, [2*esi+6]
33      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

7       bytes for mov/add
7       bytes for lea eax, [2*esi+6]
4       bytes for lea eax, [esi+esi+6]
0       bytes for empty loop
Title: Re: LEA is a handy instruction
Post by: MichaelW on October 18, 2012, 08:47:02 AM
P3:

pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles

137     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

130     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

130     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop


P4 Northwood:

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
+19 of 20 tests valid, loop overhead is approx. 263/100 cycles

62      cycles for 100 * mov/add
56      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

55      cycles for 100 * mov/add
66      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

57      cycles for 100 * mov/add
57      cycles for 100 * lea eax, [2*esi+6]
41      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

Title: Re: LEA is a handy instruction
Post by: dedndave on October 18, 2012, 09:09:38 AM
P4 prescott w/ht
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 239/100 cycles

155     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
52      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

128     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
52      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

191     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
55      cycles for 100 * lea eax, [esi+esi+6]
191     cycles for 100 * empty loop


:t
Title: Re: LEA is a handy instruction
Post by: Greenhorn on October 18, 2012, 09:28:55 AM
FX-8150 @3.90GHz ...

AMD FX(tm)-8150 Eight-Core Processor            (SSE4)
loop overhead is approx. 201/100 cycles

91      cycles for 100 * mov/add
0       cycles for 100 * lea eax, [2*esi+6]
2       cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

91      cycles for 100 * mov/add
0       cycles for 100 * lea eax, [2*esi+6]
1       cycles for 100 * lea eax, [esi+esi+6]
3       cycles for 100 * empty loop

92      cycles for 100 * mov/add
0       cycles for 100 * lea eax, [2*esi+6]
0       cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

7       bytes for mov/add
7       bytes for lea eax, [2*esi+6]
4       bytes for lea eax, [esi+esi+6]
0       bytes for empty loop


--- ok ---


EDIT:
My AV complains about LeaVariants.exe. So the result is maybe wrong ...


Greenhorn
Title: Re: LEA is a handy instruction
Post by: jj2007 on October 18, 2012, 09:45:31 AM
Well, LeaVariants.exe runs, so the AV didn't stop it. But the results are astonishing: It seems that the loop (200/100=2 cycles) doesn't slow down at all for a little lea on your CPU...
Title: Re: LEA is a handy instruction
Post by: hutch-- on October 18, 2012, 11:31:18 AM

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
loop overhead is approx. 124/100 cycles

140     cycles for 100 * mov/add
86      cycles for 100 * lea eax, [2*esi+6]
86      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

140     cycles for 100 * mov/add
86      cycles for 100 * lea eax, [2*esi+6]
90      cycles for 100 * lea eax, [esi+esi+6]
2       cycles for 100 * empty loop

140     cycles for 100 * mov/add
88      cycles for 100 * lea eax, [2*esi+6]
87      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

7       bytes for mov/add
7       bytes for lea eax, [2*esi+6]
4       bytes for lea eax, [esi+esi+6]
0       bytes for empty loop


--- ok ---
Title: Re: LEA is a handy instruction
Post by: sinsi on October 18, 2012, 11:58:20 AM
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

??      cycles for 100 * mov/add
??      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

??      cycles for 100 * mov/add
??      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

??      cycles for 100 * mov/add
??      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

7       bytes for mov/add
7       bytes for lea eax, [2*esi+6]
4       bytes for lea eax, [esi+esi+6]
0       bytes for empty loop


Must be quantum or something  :badgrin:
Title: Re: LEA is a handy instruction
Post by: jj2007 on October 18, 2012, 04:27:31 PM
Quote from: sinsi on October 18, 2012, 11:58:20 AM
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

??      cycles for 100 * mov/add
??      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop


Must be quantum or something  :badgrin:

More a problem with the loop overhead being higher after the calibration... try this:

ShowCycles MACRO algo
LOCAL tmp$
  tmp$ CATSTR <"cycles for >, %AlgoLoops, < * ">
  sub eax, overheadCycles
  .if 0 ; Sign?
   % print "??", 9, tmp$, AlgoName$(algo), 13, 10
  .else
   % print str$(eax), 9, tmp$, AlgoName$(algo), 13, 10
  .endif
ENDM
Title: Re: LEA is a handy instruction
Post by: sinsi on October 18, 2012, 05:00:41 PM

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 222/100 cycles

-23     cycles for 100 * mov/add
-23     cycles for 100 * lea eax, [2*esi+6]
0       cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

-23     cycles for 100 * mov/add
-25     cycles for 100 * lea eax, [2*esi+6]
3       cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

-24     cycles for 100 * mov/add
-23     cycles for 100 * lea eax, [2*esi+6]
0       cycles for 100 * lea eax, [esi+esi+6]
3       cycles for 100 * empty loop
Title: Re: LEA is a handy instruction
Post by: jj2007 on October 18, 2012, 06:59:46 PM
Very odd, Sinsi. As if the mov/add sequences speed up the loop :(
Here is the bit used to calibrate the loop, i.e. getting the loop overhead (usually 150-250 cycles/100):
TestOH:
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  align 4
  .Repeat
   lea eax, [ebx+Eax2EbxOffset]
;    --- no ops here ---
   dec ebx
  .Until Sign?
  ret
CaliEnd:
ENDM


One more result:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

77      cycles for 100 * mov/add
72      cycles for 100 * lea eax, [2*esi+6]
99      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

75      cycles for 100 * mov/add
72      cycles for 100 * lea eax, [2*esi+6]
96      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

75      cycles for 100 * mov/add
74      cycles for 100 * lea eax, [2*esi+6]
98      cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop
Title: Re: LEA is a handy instruction
Post by: dedndave on October 19, 2012, 04:21:47 AM
this brings up an interesting point...

we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on

with the newer cores, these may have been improved upon and we may need to re-visit the "black list"   :P
Title: Re: LEA is a handy instruction
Post by: x64Core on October 19, 2012, 01:01:17 PM
hey guys , a question, how you can test the number of cycles? some tool I guess :P
Title: Re: LEA is a handy instruction
Post by: dedndave on October 19, 2012, 03:54:36 PM
most of us use MichaelW's code timing macros, or some derivative
the macros are available here...

http://masm32.com/board/index.php?topic=49.0 (http://masm32.com/board/index.php?topic=49.0)

attached is an example of how to use them...


EDIT: attachment removed - see below
Title: Re: LEA is a handy instruction
Post by: dedndave on October 19, 2012, 05:30:49 PM
sorry RHL...
that was an experiment

change the following lines...
LOOP_COUNT = 10000000
        mov     ecx,10

or see the attachment...
Title: Re: LEA is a handy instruction
Post by: x64Core on October 19, 2012, 06:19:15 PM
thank you Dave, But I think I don't know how it works :P
I tried do this:


;code to be timed goes here

; this code is trash, testing only
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0
mov  eax,0

        rdtsc

but shows nothing and I thought it would show something like your's results lol :P
Title: Re: LEA is a handy instruction
Post by: dedndave on October 19, 2012, 06:29:42 PM
well - that RDTSC instruction is my "code to be timed" for the example
you can remove that

see the attached...

be sure you place a copy of Michael's timers.asm in the \masm32\macros folder
Title: Re: LEA is a handy instruction
Post by: x64Core on October 19, 2012, 06:35:13 PM
Quote from: dedndave on October 19, 2012, 06:29:42 PM
well - that RDTSC instruction is my "code to be timed" for the example
you can remove that

see the attached...

be sure you place a copy of Michael's timers.asm in the \masm32\macros folder
Thank you very much brother  :icon_mrgreen:
Title: Re: LEA is a handy instruction
Post by: raleep on October 23, 2012, 04:52:24 AM
Quote from: dedndave on October 19, 2012, 04:21:47 AM
we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on

Is there a summary of these timing results somewhere?  Could you post a reference please?

Thanks, ral
Title: Re: LEA is a handy instruction
Post by: dedndave on October 23, 2012, 05:15:16 AM
well - they are all over the old forum laboratory
http://www.masmforum.com/board/index.php?board=4.0 (http://www.masmforum.com/board/index.php?board=4.0)