Suppose you need eax=2*esi+6:
mov eax, esi ; option 1
add eax, eax
add eax, 6
nop
lea eax, [2*esi+6] ; option 2
nop
lea eax, [esi+esi+6] ; option 3
nop
The three options perform exactly the same, but Olly (http://www.ollydbg.de/version2.html) shows that the byte count is quite different:
0040206E 8BC6 mov eax, esi
00402070 03C0 add eax, eax
00402072 83C0 06 add eax, 6
00402075 . 90 nop
00402076 ? 8D0475 06000000 lea eax, [esi*2+6]
0040207D . 90 nop
0040207E ? 8D4436 06 lea eax, [esi+esi+6]
:biggrin:
cool little trick, Jochen :t
Same code in FASM via IDA
89 F0 mov eax, esi
01 C0 add eax, eax
83 C0 06 add eax, 6
90 nop
8D 44 36 06 lea eax, [esi+esi+6]
90 nop
8D 44 36 06 lea eax, [esi+esi+6]
cool little trick, FASM :t
Now the fun part is you need to clock it on different hardware, LEA was very good on 486 up to PIII but a poor performer on a PIV. Core2 and i3/5/7 series seem to be OK with LEA. FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:
8BC6 mov eax, esi
89F0 mov eax, esi
But yeah, I see what you mean.
It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:
[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.
It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".
Quote from: sinsi on October 17, 2012, 05:33:52 PM
>FASM trick is cute but naughty, you should get the opcode you write, not an optimisation of it.
Same opcode, different encoding - not an optimisation trick. Like these two:
8BC6 mov eax, esi
89F0 mov eax, esi
But yeah, I see what you mean.
xchg eax, edx ; 92h
xchg edx, eax ; 92h
xchg edx, eax ; 87h, 0D0h
I guess the FASM crew has discussed the risks :lol:
Quote from: japheth on October 17, 2012, 07:34:30 PM
It's probably not just naughty, but might generate wrong code - if this "trick" is also applied to the ebp register. Because:
[ebp*2] and [ebp+ebp] are NOT synonymous - only the latter expression uses SS as default register, while the first will use DS.
It's ok to do so in flat memory model, where SS and DS are assumed FLAT, but AFAIK FASM doesn't know "models" and does not "assume".
OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?
Quote from: sinsi on October 17, 2012, 08:42:26 PM
OK, in the case where you want to calculate an actual address, in fasm it doesn't matter because it's your code
- you know it's an address (whether DS or SS) in EAX
- you would use [SS:EAX] to access it
Otherwise using it as simple integer maths shouldn't matter, should it?
I don't understand what you want to say. A brief test confirms that fasm generates wrong code:
format mz
entry text:start
stack 400h
segment mydata
table dw 'A','B','C','D'
dstr db "value="
dbyt db " ",13,10,'$'
segment text
start:
mov ax,mydata
mov ds,ax
xor ebp,ebp
mov ax,[table+ebp*2]
mov [dbyt],al
mov dx,dstr
mov ah,9
int 21h
mov ah,4ch
int 21h
It does not display "value=A". If you exchange ebp by ebx, then it does.
Here's the Masm equivalent
.286
.model small
.stack 400h
.data
table label word
dw 'A','B','C','D'
dstr db "value="
dbyt db " ",13,10,'$'
.code
.386
start:
mov ax,@data
mov ds,ax
xor ebp,ebp
mov ax,[table+ebp*2]
mov dbyt,al
mov dx,offset dstr
mov ah,9
int 21h
mov ah,4ch
int 21h
end start
which works as expected.
I see, I was thinking more like 'lea eax,[ebp...]' instead of 'mov eax,[ebp...]'.
Might want to let Tomasz know then.
i was thinking that LEA was ok on a P4, unless you use a multiplier
lea eax,[2*edi]
if i am right, Jochen's little trick might be a good replacement
as it not only saves a few bytes, but also offers some speed improvement
as Hutch says, we'll have to try it out :P
Here is the speed test - no big surprises, but note these are cycle counts per 100 loops...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
148 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
33 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop
148 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
34 cycles for 100 * lea eax, [esi+esi+6]
2 cycles for 100 * empty loop
150 cycles for 100 * mov/add
32 cycles for 100 * lea eax, [2*esi+6]
33 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop
7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop
P3:
pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles
137 cycles for 100 * mov/add
101 cycles for 100 * lea eax, [2*esi+6]
100 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
130 cycles for 100 * mov/add
101 cycles for 100 * lea eax, [2*esi+6]
100 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
130 cycles for 100 * mov/add
101 cycles for 100 * lea eax, [2*esi+6]
100 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
P4 Northwood:
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
+19 of 20 tests valid, loop overhead is approx. 263/100 cycles
62 cycles for 100 * mov/add
56 cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop
55 cycles for 100 * mov/add
66 cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop
57 cycles for 100 * mov/add
57 cycles for 100 * lea eax, [2*esi+6]
41 cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop
P4 prescott w/ht
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 239/100 cycles
155 cycles for 100 * mov/add
101 cycles for 100 * lea eax, [2*esi+6]
52 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop
128 cycles for 100 * mov/add
101 cycles for 100 * lea eax, [2*esi+6]
52 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop
191 cycles for 100 * mov/add
101 cycles for 100 * lea eax, [2*esi+6]
55 cycles for 100 * lea eax, [esi+esi+6]
191 cycles for 100 * empty loop
:t
FX-8150 @3.90GHz ...
AMD FX(tm)-8150 Eight-Core Processor (SSE4)
loop overhead is approx. 201/100 cycles
91 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
2 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
91 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
1 cycles for 100 * lea eax, [esi+esi+6]
3 cycles for 100 * empty loop
92 cycles for 100 * mov/add
0 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop
--- ok ---
EDIT:
My AV complains about LeaVariants.exe. So the result is maybe wrong ...
Greenhorn
Well, LeaVariants.exe runs, so the AV didn't stop it. But the results are astonishing: It seems that the loop (200/100=2 cycles) doesn't slow down at all for a little lea on your CPU...
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
loop overhead is approx. 124/100 cycles
140 cycles for 100 * mov/add
86 cycles for 100 * lea eax, [2*esi+6]
86 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
140 cycles for 100 * mov/add
86 cycles for 100 * lea eax, [2*esi+6]
90 cycles for 100 * lea eax, [esi+esi+6]
2 cycles for 100 * empty loop
140 cycles for 100 * mov/add
88 cycles for 100 * lea eax, [2*esi+6]
87 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop
--- ok ---
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles
?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop
?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop
?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop
7 bytes for mov/add
7 bytes for lea eax, [2*esi+6]
4 bytes for lea eax, [esi+esi+6]
0 bytes for empty loop
Must be quantum or something :badgrin:
Quote from: sinsi on October 18, 2012, 11:58:20 AM
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 215/100 cycles
?? cycles for 100 * mov/add
?? cycles for 100 * lea eax, [2*esi+6]
?? cycles for 100 * lea eax, [esi+esi+6]
?? cycles for 100 * empty loop
Must be quantum or something :badgrin:
More a problem with the loop overhead being higher after the calibration... try this:
ShowCycles MACRO algo
LOCAL tmp$
tmp$ CATSTR <"cycles for >, %AlgoLoops, < * ">
sub eax, overheadCycles
.if 0 ; Sign?
% print "??", 9, tmp$, AlgoName$(algo), 13, 10
.else
% print str$(eax), 9, tmp$, AlgoName$(algo), 13, 10
.endif
ENDM
AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 222/100 cycles
-23 cycles for 100 * mov/add
-23 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
0 cycles for 100 * empty loop
-23 cycles for 100 * mov/add
-25 cycles for 100 * lea eax, [2*esi+6]
3 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
-24 cycles for 100 * mov/add
-23 cycles for 100 * lea eax, [2*esi+6]
0 cycles for 100 * lea eax, [esi+esi+6]
3 cycles for 100 * empty loop
Very odd, Sinsi. As if the mov/add sequences speed up the loop :(
Here is the bit used to calibrate the loop, i.e. getting the loop overhead (usually 150-250 cycles/100):
TestOH:
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
lea eax, [ebx+Eax2EbxOffset]
; --- no ops here ---
dec ebx
.Until Sign?
ret
CaliEnd:
ENDM
One more result:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles
77 cycles for 100 * mov/add
72 cycles for 100 * lea eax, [2*esi+6]
99 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
75 cycles for 100 * mov/add
72 cycles for 100 * lea eax, [2*esi+6]
96 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
75 cycles for 100 * mov/add
74 cycles for 100 * lea eax, [2*esi+6]
98 cycles for 100 * lea eax, [esi+esi+6]
1 cycles for 100 * empty loop
this brings up an interesting point...
we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on
with the newer cores, these may have been improved upon and we may need to re-visit the "black list" :P
hey guys , a question, how you can test the number of cycles? some tool I guess :P
most of us use MichaelW's code timing macros, or some derivative
the macros are available here...
http://masm32.com/board/index.php?topic=49.0 (http://masm32.com/board/index.php?topic=49.0)
attached is an example of how to use them...
EDIT: attachment removed - see below
sorry RHL...
that was an experiment
change the following lines...
LOOP_COUNT = 10000000
mov ecx,10
or see the attachment...
thank you Dave, But I think I don't know how it works :P
I tried do this:
;code to be timed goes here
; this code is trash, testing only
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
mov eax,0
rdtsc
but shows nothing and I thought it would show something like your's results lol :P
well - that RDTSC instruction is my "code to be timed" for the example
you can remove that
see the attached...
be sure you place a copy of Michael's timers.asm in the \masm32\macros folder
Quote from: dedndave on October 19, 2012, 06:29:42 PM
well - that RDTSC instruction is my "code to be timed" for the example
you can remove that
see the attached...
be sure you place a copy of Michael's timers.asm in the \masm32\macros folder
Thank you very much brother :icon_mrgreen:
Quote from: dedndave on October 19, 2012, 04:21:47 AM
we have developed a set of "bias" about using certain instructions in certain ways
for example, STOSB without REP is slow, the LOOPxx and JECXZ instructions are slow, and so on
Is there a summary of these timing results somewhere? Could you post a reference please?
Thanks, ral
well - they are all over the old forum laboratory
http://www.masmforum.com/board/index.php?board=4.0 (http://www.masmforum.com/board/index.php?board=4.0)