News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Too simple to be true

Started by Grincheux, January 11, 2016, 12:39:15 AM

Previous topic - Next topic

Grincheux

I have the following code :


ESP BEFORE------------------------------->19F89C
POP EBX---------------------------> 19F8A0
PUSH EAX----------------->19F89C


So don't pop ebx replace by


MOV EBX,[ESP]
MOV [ESP],EAX


Is it quicker than POP EBX / PUSH EAX ?

gelatine1

I think it should be a little quicker as it doesnt have to increase/decrease esp. Another possibility:


Mov ebx,eax
Xchg ebx, dword ptr [esp]


Maybe someone could test the speed with some program?

Grincheux

I thought to it also. Perhaps better.

jj2007

push & pop are highly optimised instructions.
The same for mov, but the timings say in this case it's worse.
Finally, xchg reg, mem is known to have awful performance, see below.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2167    cycles for 100 * xchg eax, [esp]

68      cycles for 100 * pop+push
294     cycles for 100 * mov+mov
2169    cycles for 100 * xchg eax, [esp]

68      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2167    cycles for 100 * xchg eax, [esp]

69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2168    cycles for 100 * xchg eax, [esp]

69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2177    cycles for 100 * xchg eax, [esp]

3       bytes for pop+push
7       bytes for mov+mov
6       bytes for xchg eax, [esp]


Note that pop+push is a factor 4 slower (i.e. like mov+mov) if you use the same reg. Since this is rarely useful, you may try to insert some lines of useful other code between the pop and push instructions.

Grincheux

I am very surprised, but this explains why you wrote the m2m macro.

FORTRANS

Hi,

   Three systems that respond somewhat differently.  I had expected
the second and third to be similar, so the change in the "mov+mov"
test was interesting.  The P-MMX really does not like the "xchg eax, [esp]"
test, very slow.  The P-MMX rather likes the "mov+mov" test though.

P-MMX

pre-P4
203 cycles for 100 * pop+push
199 cycles for 100 * mov+mov
7092 cycles for 100 * xchg eax, [esp]

204 cycles for 100 * pop+push
198 cycles for 100 * mov+mov
7119 cycles for 100 * xchg eax, [esp]

201 cycles for 100 * pop+push
201 cycles for 100 * mov+mov
7095 cycles for 100 * xchg eax, [esp]

203 cycles for 100 * pop+push
201 cycles for 100 * mov+mov
7098 cycles for 100 * xchg eax, [esp]

204 cycles for 100 * pop+push
200 cycles for 100 * mov+mov
7085 cycles for 100 * xchg eax, [esp]

3 bytes for pop+push
7 bytes for mov+mov
6 bytes for xchg eax, [esp]


--- ok ---

P-III
pre-P4 (SSE1)

103 cycles for 100 * pop+push
203 cycles for 100 * mov+mov
1718 cycles for 100 * xchg eax, [esp]

102 cycles for 100 * pop+push
202 cycles for 100 * mov+mov
1719 cycles for 100 * xchg eax, [esp]

102 cycles for 100 * pop+push
201 cycles for 100 * mov+mov
1721 cycles for 100 * xchg eax, [esp]

103 cycles for 100 * pop+push
201 cycles for 100 * mov+mov
1734 cycles for 100 * xchg eax, [esp]

102 cycles for 100 * pop+push
203 cycles for 100 * mov+mov
1723 cycles for 100 * xchg eax, [esp]

3 bytes for pop+push
7 bytes for mov+mov
6 bytes for xchg eax, [esp]


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

120 cycles for 100 * pop+push
408 cycles for 100 * mov+mov
1811 cycles for 100 * xchg eax, [esp]

113 cycles for 100 * pop+push
407 cycles for 100 * mov+mov
1804 cycles for 100 * xchg eax, [esp]

121 cycles for 100 * pop+push
406 cycles for 100 * mov+mov
1815 cycles for 100 * xchg eax, [esp]

119 cycles for 100 * pop+push
405 cycles for 100 * mov+mov
1809 cycles for 100 * xchg eax, [esp]

121 cycles for 100 * pop+push
406 cycles for 100 * mov+mov
1810 cycles for 100 * xchg eax, [esp]

3 bytes for pop+push
7 bytes for mov+mov
6 bytes for xchg eax, [esp]


--- ok ---


Cheers,

Steve N.

Grincheux

Jochen, what tool do you use, I would like to make my own tests ?

Example : PUSH 384 / POP EDX
POP  EBP / AND SUB ESP,12 ; Function had 2 arguments, I correct the stack
JMP [ESP - 4] and I return home

I need something to profile

Grincheux

FORTRANS We must write a version for each processor !

jj2007

Quote from: Grincheux on January 11, 2016, 09:49:15 AM
Jochen, what tool do you use, I would like to make my own tests ?

In RichMasm: File/New Masm source -> Timer example (in the lower half of the green window)
Then click on Test A, Test B etc in the right bookmarks bar, and insert your code.

When you are ready, hit Ctrl S or menu File/Save - you will be prompted to choose a name.
F6 builds and runs the code.

Grincheux

Génial !

I made some tests this what I wanted.

hutch--

Philippe,

The problem is that almost every processor is different but there are at least some similarity across some families of processors. Most PIV versions from 1.5 gig to the last 3.8 gig versions had similar family traits, a very long pipeline and high penalties for stalls. The Core2 series were a bit easier as they had further optimisations done in hardware, instructions like LEA were reasonably fast again but something in common with all of the later Intel hardware, many of the old instructions were dumped into microcode which is a lot slower than prime time silicon. Also from the later Core2 series upwards SSE instructions got a lot faster again than they were on a PIV but funny enough some of the common integer instructions tended to be a bit slower.

It come from a fundamental difference between early and much later x86 hardware, very early 8088 processors  chomping along at 2 meg had the common instructions we use hard coded directly into silicon but as the much later versions were developed they shifted from CISC (complex instruction set computers) to RISC (reduced instruction set computers) that provided a CISC interface for compatibility reasons. The preferred instructions were written directly into silicon where the old timers were dumped into a form of slow bulk storage that is commonly called microcode. Old code still works but its so slow that it is not worth using.

Now when you use AMD hardware, the rules change as often AMD were faster than Intel with some instructions and slower with others.

Grincheux

Yes you are right, faster on one cpu, lower on an other...

I would use SSE instruction set but I read that sometimes INTEL instruction set do one thing and the same opcodes on other cpus (CYRIX) do an other thing. We can't be sure to get the same result on many computers.

http://www.nasm.us/xdoc/2.11.08/nasmdoc.pdf page 264
Finaly the last question I told to myself is "Why to optimize? Is it useful?"
I think that no, for the fun... if the program you are writing will always run on your computer, why not, but in the other its loosing a lot of time for nothing. I don't speak about Guga's tools but in an ordinary program.

I think it's better to uses the different parameters and possibilities of the assembler and linker rather than looking for subtilities in assembly language.

Thank You Hutch

hutch--

Philippe,

You can take the "statistics" approach, write algorithms that are faster on most hardware most of the time. Optimisation is in fact a virtuous activity in that you use more than instruction selection, you also have algo design, basic logic and benchmarking to produce better average algorithms and that is much of where the action is these days. You will tend to set the lowest processor that can use the algo by the instruction choice but that is generally determined by the OS version so for example if you wrote an algo that has to run on modern hardware, you may only have to go back to Vista of perhaps XP. You certainly can write code that will run on Win2k or even the earlier Win9x versions but they are now very old OS versions.

Most of the code I have to write in MASM32 has to be almost 486 compatible which limits using SSE, AVX and later stuff but it does keep you on your toes.  :biggrin:

guga

#13
About optimization, the "better" technique will always depends on the usage of the app you are creating and the way you use the set of instructions to optimize.

Although SSE instructions maybe faster in one processor then others, consider using other approach on your own code, such as: better code organization, avoiding heavy usage of loops, nested If/Else/EndIf chains (Spaghetti code style), reorder your code avoiding same instructions set on sequences chains (to avoid staling, for example) etc.  Code optimization (organization, in fact) is, in general, more effective then instruction optimization and have the advantage to be easier to maintain in terms of development.

For example, ff you build a function containing SSE instructions thinking that only because of that, your app will run faster, this is not particularly true if the rest of your code is bloated. Not that SSE instructions are not usefull for that purpose, of course. What i meant is that, you need to pay attention on the rest of the code of your app and not a particularly function that eventtually uses "faster" instruction sets.

It is what Steve´s said, optimization is more then a instruction selection. On RosAsm´s help file (B_U_Asm.exe - I´ll have to rename that file eventually :icon_mrgreen:) it have a section about optimization techniques.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Grincheux

These last days I read many things about optimization. Optimization means instructions speed, aligment...
But what said Hutch and Guga has more sense, speaks better to me.
I will change the way I see programming.