News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

LEA is a handy instruction

Started by jj2007, October 17, 2012, 10:54:56 AM

Previous topic - Next topic

jj2007

Suppose you need eax=2*esi+6:
   mov eax, esi  ; option 1
   add eax, eax
   add eax, 6
   nop
   lea eax, [2*esi+6]  ; option 2
   nop
   lea eax, [esi+esi+6]  ; option 3
   nop

The three options perform exactly the same, but Olly shows that the byte count is quite different:

0040206E         8BC6              mov eax, esi
00402070         03C0              add eax, eax
00402072         83C0 06           add eax, 6
00402075      .  90                nop
00402076      ?  8D0475 06000000   lea eax, [esi*2+6]
0040207D      .  90                nop
0040207E      ?  8D4436 06         lea eax, [esi+esi+6]
:biggrin:

dedndave


sinsi

Same code in FASM via IDA

89 F0                                   mov     eax, esi
01 C0                                   add     eax, eax
83 C0 06                                add     eax, 6
90                                      nop
8D 44 36 06                             lea     eax, [esi+esi+6]
90                                      nop
8D 44 36 06                             lea     eax, [esi+esi+6]

jj2007


hutch--

Now the fun part is you need to clock it on different hardware, LEA was very good on 486 up to PIII but a poor performer on a PIV. Core2 and i3/5/7 series seem to be OK with LEA. FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.

sinsi

>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:
8BC6    mov eax, esi
89F0    mov eax, esi


But yeah, I see what you mean.

japheth

It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:

[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.

It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".

jj2007

Quote from: sinsi on October 17, 2012, 05:33:52 PM
>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:
8BC6    mov eax, esi
89F0    mov eax, esi


But yeah, I see what you mean.

xchg eax, edx   ; 92h
xchg edx, eax   ; 92h
xchg edx, eax   ; 87h, 0D0h

I guess the FASM crew has discussed the risks :lol:

sinsi

Quote from: japheth on October 17, 2012, 07:34:30 PM
It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:

[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.

It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".
OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?

japheth

Quote from: sinsi on October 17, 2012, 08:42:26 PM
OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?

I don't understand what you want to say. A brief test confirms that fasm generates wrong code:


    format mz

    entry text:start
    stack 400h

segment mydata

table dw 'A','B','C','D'

dstr db "value="
dbyt db " ",13,10,'$'

segment text

start:
    mov ax,mydata
    mov ds,ax
    xor ebp,ebp
    mov ax,[table+ebp*2]
    mov [dbyt],al
    mov dx,dstr
    mov ah,9
    int 21h
    mov ah,4ch
    int 21h


It does not display "value=A". If you exchange ebp by ebx, then it does.

Here's the Masm equivalent


    .286
    .model small

    .stack 400h

    .data

table label word
    dw 'A','B','C','D'

dstr db "value="
dbyt db " ",13,10,'$'

    .code

    .386

start:
    mov ax,@data
    mov ds,ax
    xor ebp,ebp
    mov ax,[table+ebp*2]
    mov dbyt,al
    mov dx,offset dstr
    mov ah,9
    int 21h
    mov ah,4ch
    int 21h

    end start


which works as expected.

sinsi

I see, I was thinking more like 'lea eax,[ebp...]' instead of 'mov eax,[ebp...]'.
Might want to let Tomasz know then.

dedndave

i was thinking that LEA was ok on a P4, unless you use a multiplier
        lea     eax,[2*edi]

if i am right, Jochen's little trick might be a good replacement
as it not only saves a few bytes, but also offers some speed improvement
as Hutch says, we'll have to try it out   :P

jj2007

Here is the speed test - no big surprises, but note these are cycle counts per 100 loops...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
148     cycles for 100 * mov/add
32      cycles for 100 * lea eax, [2*esi+6]
33      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

148     cycles for 100 * mov/add
32      cycles for 100 * lea eax, [2*esi+6]
34      cycles for 100 * lea eax, [esi+esi+6]
2       cycles for 100 * empty loop

150     cycles for 100 * mov/add
32      cycles for 100 * lea eax, [2*esi+6]
33      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

7       bytes for mov/add
7       bytes for lea eax, [2*esi+6]
4       bytes for lea eax, [esi+esi+6]
0       bytes for empty loop

MichaelW

P3:

pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles

137     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

130     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop

130     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
100     cycles for 100 * lea eax, [esi+esi+6]
1       cycles for 100 * empty loop


P4 Northwood:

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
+19 of 20 tests valid, loop overhead is approx. 263/100 cycles

62      cycles for 100 * mov/add
56      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

55      cycles for 100 * mov/add
66      cycles for 100 * lea eax, [2*esi+6]
??      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

57      cycles for 100 * mov/add
57      cycles for 100 * lea eax, [2*esi+6]
41      cycles for 100 * lea eax, [esi+esi+6]
??      cycles for 100 * empty loop

Well Microsoft, here's another nice mess you've gotten us into.

dedndave

P4 prescott w/ht
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 239/100 cycles

155     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
52      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

128     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
52      cycles for 100 * lea eax, [esi+esi+6]
0       cycles for 100 * empty loop

191     cycles for 100 * mov/add
101     cycles for 100 * lea eax, [2*esi+6]
55      cycles for 100 * lea eax, [esi+esi+6]
191     cycles for 100 * empty loop


:t