Print Page - LEA is a handy instruction

Title: LEA is a handy instruction
Post by: jj2007 on October 17, 2012, 10:54:56 AM

Suppose you need eax=2*esi+6:
   mov eax, esi ; option 1
   add eax, eax
   add eax, 6
   nop
   lea eax, [2*esi+6] ; option 2
   nop
   lea eax, [esi+esi+6] ; option 3
   nop

The three options perform exactly the same, but Olly (http://www.ollydbg.de/version2.html) shows that the byte count is quite different:

0040206E 8BC6 mov eax, esi
00402070 03C0 add eax, eax
00402072 83C0 06 add eax, 6
00402075 . 90 nop
00402076 ? 8D0475 06000000 lea eax, [esi*2+6]
0040207D . 90 nop
0040207E ? 8D4436 06 lea eax, [esi+esi+6]
:biggrin:

Title: Re: LEA is a handy instruction
Post by: dedndave on October 17, 2012, 02:24:35 PM

cool little trick, Jochen :t

Title: Re: LEA is a handy instruction
Post by: sinsi on October 17, 2012, 03:27:38 PM

Same code in FASM via IDA

Code Select


89 F0                                   mov     eax, esi
01 C0                                   add     eax, eax
83 C0 06                                add     eax, 6
90                                      nop
8D 44 36 06                             lea     eax, [esi+esi+6]
90                                      nop
8D 44 36 06                             lea     eax, [esi+esi+6]

Title: Re: LEA is a handy instruction
Post by: jj2007 on October 17, 2012, 04:15:18 PM

cool little trick, FASM :t

Title: Re: LEA is a handy instruction
Post by: hutch-- on October 17, 2012, 04:23:52 PM

Now the fun part is you need to clock it on different hardware, LEA was very good on 486 up to PIII but a poor performer on a PIV. Core2 and i3/5/7 series seem to be OK with LEA. FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.

Title: Re: LEA is a handy instruction
Post by: sinsi on October 17, 2012, 05:33:52 PM

>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:

Code Select

8BC6    mov eax, esi
89F0    mov eax, esi

But yeah, I see what you mean.

Title: Re: LEA is a handy instruction
Post by: japheth on October 17, 2012, 07:34:30 PM

It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:

[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.

It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".

Title: Re: LEA is a handy instruction
Post by: jj2007 on October 17, 2012, 08:03:40 PM

Quote from: sinsi on October 17, 2012, 05:33:52 PM
>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:
Code Select Expand
8BC6 mov eax, esi 89F0 mov eax, esi

But yeah, I see what you mean.

xchg eax, edx   ; 92h
xchg edx, eax   ; 92h
xchg edx, eax   ; 87h, 0D0h

I guess the FASM crew has discussed the risks :lol:

Title: Re: LEA is a handy instruction
Post by: sinsi on October 17, 2012, 08:42:26 PM

Quote from: japheth on October 17, 2012, 07:34:30 PM
It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:

[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.

It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".

OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?

Title: Re: LEA is a handy instruction
Post by: japheth on October 17, 2012, 09:15:19 PM

Quote from: sinsi on October 17, 2012, 08:42:26 PM
OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?

I don't understand what you want to say. A brief test confirms that fasm generates wrong code:

Code Select


    format mz

    entry text:start
    stack 400h

segment mydata

table dw 'A','B','C','D'

dstr db "value="
dbyt db " ",13,10,'$'

segment text

start:
    mov ax,mydata
    mov ds,ax
    xor ebp,ebp
    mov ax,[table+ebp*2]
    mov [dbyt],al
    mov dx,dstr
    mov ah,9
    int 21h
    mov ah,4ch
    int 21h

It does not display "value=A". If you exchange ebp by ebx, then it does.

Here's the Masm equivalent

Code Select


    .286
    .model small

    .stack 400h

    .data

table label word
    dw 'A','B','C','D'

dstr db "value="
dbyt db " ",13,10,'$'

    .code

    .386

start:
    mov ax,@data
    mov ds,ax
    xor ebp,ebp
    mov ax,[table+ebp*2]
    mov dbyt,al
    mov dx,offset dstr
    mov ah,9
    int 21h
    mov ah,4ch
    int 21h

    end start

which works as expected.

Title: Re: LEA is a handy instruction
Post by: sinsi on October 17, 2012, 09:46:11 PM

I see, I was thinking more like 'lea eax,[ebp...]' instead of 'mov eax,[ebp...]'.
Might want to let Tomasz know then.

Title: Re: LEA is a handy instruction
Post by: dedndave on October 18, 2012, 06:32:09 AM

i was thinking that LEA was ok on a P4, unless you use a multiplier

Code Select

lea eax,[2*edi]

if i am right, Jochen's little trick might be a good replacement
as it not only saves a few bytes, but also offers some speed improvement
as Hutch says, we'll have to try it out :P

Title: Re: LEA is a handy instruction
Post by: jj2007 on October 18, 2012, 08:20:46 AM

Here is the speed test - no big surprises, but note these are cycle counts per 100 loops...

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
148 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
33 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop

148 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
34 cycles for 100 * lea eax, [esi+esi+6]
2 cycles for 100 * empty loop

150 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
33 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop

7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop

Title: Re: LEA is a handy instruction
Post by: MichaelW on October 18, 2012, 08:47:02 AM

P3:

Code Select


pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles

137     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

130     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

130     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

P4 Northwood:

Code Select


Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
+19 of 20 tests valid, loop overhead is approx. 263/100 cycles

62      cycles for 100 * mov/add
56      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

55      cycles for 100 * mov/add
66      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

57      cycles for 100 * mov/add
57      cycles for 100 * lea eax, [2*esi+6]
41      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

Title: Re: LEA is a handy instruction
Post by: dedndave on October 18, 2012, 09:09:38 AM

P4 prescott w/ht

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 239/100 cycles

155     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
52      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

128     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
52      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

191     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
55      cycles for 100 * lea eax, [esi+esi+6]
191     cycles for 100 * empty loop

Title: Re: LEA is a handy instruction
Post by: Greenhorn on October 18, 2012, 09:28:55 AM

FX-8150 @3.90GHz ...

AMD FX(tm)-8150 Eight-Core Processor (SSE4)
loop overhead is approx. 201/100 cycles

91 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
2 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

91 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
1 cycles for 100 * lea eax, [esi+esi+6]
3 cycles for 100 * empty loop

92 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop

--- ok ---

EDIT:
My AV complains about LeaVariants.exe. So the result is maybe wrong ...

Greenhorn

Title: Re: LEA is a handy instruction
Post by: jj2007 on October 18, 2012, 09:45:31 AM

Well, LeaVariants.exe runs, so the AV didn't stop it. But the results are astonishing: It seems that the loop (200/100=2 cycles) doesn't slow down at all for a little lea on your CPU...

Title: Re: LEA is a handy instruction
Post by: hutch-- on October 18, 2012, 11:31:18 AM

Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
loop overhead is approx. 124/100 cycles

140 cycles for 100 * mov/add
86 cycles for 100 * lea eax, [2*esi+6]
86 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

140 cycles for 100 * mov/add
86 cycles for 100 * lea eax, [2*esi+6]
90 cycles for 100 * lea eax, [esi+esi+6]
2 cycles for 100 * empty loop

140 cycles for 100 * mov/add
88 cycles for 100 * lea eax, [2*esi+6]
87 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop

--- ok ---

Title: Re: LEA is a handy instruction
Post by: sinsi on October 18, 2012, 11:58:20 AM

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop

?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop

?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop

7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop

Must be quantum or something :badgrin:

Title: Re: LEA is a handy instruction
Post by: jj2007 on October 18, 2012, 04:27:31 PM

Quote from: sinsi on October 18, 2012, 11:58:20 AM
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles

?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop

Must be quantum or something :badgrin:

More a problem with the loop overhead being higher after the calibration... try this:

ShowCycles MACRO algo
LOCAL tmp$
tmp$ CATSTR <"cycles for >, %AlgoLoops, < * ">
sub eax, overheadCycles
.if 0 ; Sign?
% print "??", 9, tmp$, AlgoName$(algo), 13, 10
.else
% print str$(eax), 9, tmp$, AlgoName$(algo), 13, 10
.endif
ENDM

Title: Re: LEA is a handy instruction
Post by: sinsi on October 18, 2012, 05:00:41 PM

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 222/100 cycles

-23 cycles for 100 * mov/add
-23 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop

-23 cycles for 100 * mov/add
-25 cycles for 100 * lea eax, [2*esi+6]
3 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

-24 cycles for 100 * mov/add
-23 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
3 cycles for 100 * empty loop

Title: Re: LEA is a handy instruction
Post by: jj2007 on October 18, 2012, 06:59:46 PM

Very odd, Sinsi. As if the mov/add sequences speed up the loop :(
Here is the bit used to calibrate the loop, i.e. getting the loop overhead (usually 150-250 cycles/100):
TestOH:
mov ebx, AlgoLoops-1   ; loop e.g. 100x
align 4
.Repeat
   lea eax, [ebx+Eax2EbxOffset]
;    --- no ops here ---
   dec ebx
.Until Sign?
ret
CaliEnd:
ENDM

One more result:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

77 cycles for 100 * mov/add
72 cycles for 100 * lea eax, [2*esi+6]
99 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

75 cycles for 100 * mov/add
72 cycles for 100 * lea eax, [2*esi+6]
96 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

75 cycles for 100 * mov/add
74 cycles for 100 * lea eax, [2*esi+6]
98 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop

Title: Re: LEA is a handy instruction
Post by: dedndave on October 19, 2012, 04:21:47 AM

this brings up an interesting point...

we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on

with the newer cores, these may have been improved upon and we may need to re-visit the "black list" :P

Title: Re: LEA is a handy instruction
Post by: x64Core on October 19, 2012, 01:01:17 PM

hey guys , a question, how you can test the number of cycles? some tool I guess :P

Title: Re: LEA is a handy instruction
Post by: dedndave on October 19, 2012, 03:54:36 PM

most of us use MichaelW's code timing macros, or some derivative
the macros are available here...

http://masm32.com/board/index.php?topic=49.0 (http://masm32.com/board/index.php?topic=49.0)

attached is an example of how to use them...

EDIT: attachment removed - see below

Title: Re: LEA is a handy instruction
Post by: dedndave on October 19, 2012, 05:30:49 PM

sorry RHL...
that was an experiment

change the following lines...

Code Select

LOOP_COUNT = 10000000

Code Select

mov ecx,10

or see the attachment...

Title: Re: LEA is a handy instruction
Post by: x64Core on October 19, 2012, 06:19:15 PM

thank you Dave, But I think I don't know how it works :P
I tried do this:

;code to be timed goes here

; this code is trash, testing only
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0

rdtsc

but shows nothing and I thought it would show something like your's results lol :P

Title: Re: LEA is a handy instruction
Post by: dedndave on October 19, 2012, 06:29:42 PM

well - that RDTSC instruction is my "code to be timed" for the example
you can remove that

see the attached...

be sure you place a copy of Michael's timers.asm in the \masm32\macros folder

Title: Re: LEA is a handy instruction
Post by: x64Core on October 19, 2012, 06:35:13 PM

Quote from: dedndave on October 19, 2012, 06:29:42 PM
well - that RDTSC instruction is my "code to be timed" for the example
you can remove that

see the attached...

be sure you place a copy of Michael's timers.asm in the \masm32\macros folder

Thank you very much brother :icon_mrgreen:

Title: Re: LEA is a handy instruction
Post by: raleep on October 23, 2012, 04:52:24 AM

Quote from: dedndave on October 19, 2012, 04:21:47 AM
we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on

Is there a summary of these timing results somewhere? Could you post a reference please?

Thanks, ral

Title: Re: LEA is a handy instruction
Post by: dedndave on October 23, 2012, 05:15:16 AM

well - they are all over the old forum laboratory
http://www.masmforum.com/board/index.php?board=4.0 (http://www.masmforum.com/board/index.php?board=4.0)

The MASM Forum

General => The Campus => Topic started by: jj2007 on October 17, 2012, 10:54:56 AM