LEA is a handy instruction

jj2007 · October 17, 2012, 10:54:56 AM

Suppose you need eax=2*esi+6:
   mov eax, esi ; option 1
   add eax, eax
   add eax, 6
   nop
   lea eax, [2*esi+6] ; option 2
   nop
   lea eax, [esi+esi+6] ; option 3
   nop

The three options perform exactly the same, but Olly shows that the byte count is quite different:

0040206E 8BC6 mov eax, esi
00402070 03C0 add eax, eax
00402072 83C0 06 add eax, 6
00402075 . 90 nop
00402076 ? 8D0475 06000000 lea eax, [esi*2+6]
0040207D . 90 nop
0040207E ? 8D4436 06 lea eax, [esi+esi+6]

dedndave · October 17, 2012, 02:24:35 PM

cool little trick, Jochen :t

sinsi · October 17, 2012, 03:27:38 PM

Same code in FASM via IDA

Code Select


89 F0                                   mov     eax, esi
01 C0                                   add     eax, eax
83 C0 06                                add     eax, 6
90                                      nop
8D 44 36 06                             lea     eax, [esi+esi+6]
90                                      nop
8D 44 36 06                             lea     eax, [esi+esi+6]

jj2007 · October 17, 2012, 04:15:18 PM

cool little trick, FASM :t

hutch-- · October 17, 2012, 04:23:52 PM

Now the fun part is you need to clock it on different hardware, LEA was very good on 486 up to PIII but a poor performer on a PIV. Core2 and i3/5/7 series seem to be OK with LEA. FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.

sinsi · October 17, 2012, 05:33:52 PM

>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:

Code Select

8BC6    mov eax, esi
89F0    mov eax, esi

But yeah, I see what you mean.

japheth · October 17, 2012, 07:34:30 PM

It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:

[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.

It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".

jj2007 · October 17, 2012, 08:03:40 PM

Quote from: sinsi on October 17, 2012, 05:33:52 PM
>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:
Code Select Expand
8BC6 mov eax, esi 89F0 mov eax, esi

But yeah, I see what you mean.

xchg eax, edx   ; 92h
xchg edx, eax   ; 92h
xchg edx, eax   ; 87h, 0D0h

I guess the FASM crew has discussed the risks :lol:

sinsi · October 17, 2012, 08:42:26 PM

Quote from: japheth on October 17, 2012, 07:34:30 PM
It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:

[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.

It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".

OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?

japheth · October 17, 2012, 09:15:19 PM

Quote from: sinsi on October 17, 2012, 08:42:26 PM
OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?

I don't understand what you want to say. A brief test confirms that fasm generates wrong code:

Code Select


    format mz

    entry text:start
    stack 400h

segment mydata

table dw 'A','B','C','D'

dstr db "value="
dbyt db " ",13,10,'$'

segment text

start:
    mov ax,mydata
    mov ds,ax
    xor ebp,ebp
    mov ax,[table+ebp*2]
    mov [dbyt],al
    mov dx,dstr
    mov ah,9
    int 21h
    mov ah,4ch
    int 21h

It does not display "value=A". If you exchange ebp by ebx, then it does.

Here's the Masm equivalent

Code Select


    .286
    .model small

    .stack 400h

    .data

table label word
    dw 'A','B','C','D'

dstr db "value="
dbyt db " ",13,10,'$'

    .code

    .386

start:
    mov ax,@data
    mov ds,ax
    xor ebp,ebp
    mov ax,[table+ebp*2]
    mov dbyt,al
    mov dx,offset dstr
    mov ah,9
    int 21h
    mov ah,4ch
    int 21h

    end start

which works as expected.

sinsi · October 17, 2012, 09:46:11 PM

I see, I was thinking more like 'lea eax,[ebp...]' instead of 'mov eax,[ebp...]'.
Might want to let Tomasz know then.

dedndave · October 18, 2012, 06:32:09 AM

i was thinking that LEA was ok on a P4, unless you use a multiplier

Code Select

lea eax,[2*edi]

if i am right, Jochen's little trick might be a good replacement
as it not only saves a few bytes, but also offers some speed improvement
as Hutch says, we'll have to try it out :P

jj2007 · October 18, 2012, 08:20:46 AM

Here is the speed test - no big surprises, but note these are cycle counts per 100 loops...

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
148 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
33 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop

148 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
34 cycles for 100 * lea eax, [esi+esi+6]
2 cycles for 100 * empty loop

150 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
33 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop

7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop

MichaelW · October 18, 2012, 08:47:02 AM

P3:

Code Select


pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles

137     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

130     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

130     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

P4 Northwood:

Code Select


Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
+19 of 20 tests valid, loop overhead is approx. 263/100 cycles

62      cycles for 100 * mov/add
56      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

55      cycles for 100 * mov/add
66      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

57      cycles for 100 * mov/add
57      cycles for 100 * lea eax, [2*esi+6]
41      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

dedndave · October 18, 2012, 09:09:38 AM

P4 prescott w/ht

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 239/100 cycles

155     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
52      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

128     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
52      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

191     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
55      cycles for 100 * lea eax, [esi+esi+6]
191     cycles for 100 * empty loop

:t

The MASM Forum

News:

LEA is a handy instruction

jj2007

dedndave

sinsi

jj2007

hutch--

sinsi

japheth

jj2007

sinsi

japheth

sinsi

dedndave

jj2007

MichaelW

dedndave