Too simple to be true

Grincheux · January 11, 2016, 12:39:15 AM

I have the following code :


ESP BEFORE------------------------------->19F89C
POP EBX---------------------------> 19F8A0
PUSH EAX----------------->19F89C

So don't pop ebx replace by

Code Select


MOV EBX,[ESP]
MOV [ESP],EAX

Is it quicker than POP EBX / PUSH EAX ?

gelatine1 · January 11, 2016, 12:53:41 AM

I think it should be a little quicker as it doesnt have to increase/decrease esp. Another possibility:

Code Select


Mov ebx,eax
Xchg ebx, dword ptr [esp]

Maybe someone could test the speed with some program?

Grincheux · January 11, 2016, 02:47:04 AM

I thought to it also. Perhaps better.

jj2007 · January 11, 2016, 04:18:18 AM

push & pop are highly optimised instructions.
The same for mov, but the timings say in this case it's worse.
Finally, xchg reg, mem is known to have awful performance, see below.

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2167    cycles for 100 * xchg eax, [esp]

68      cycles for 100 * pop+push
294     cycles for 100 * mov+mov
2169    cycles for 100 * xchg eax, [esp]

68      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2167    cycles for 100 * xchg eax, [esp]

69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2168    cycles for 100 * xchg eax, [esp]

69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2177    cycles for 100 * xchg eax, [esp]

3       bytes for pop+push
7       bytes for mov+mov
6       bytes for xchg eax, [esp]

Note that pop+push is a factor 4 slower (i.e. like mov+mov) if you use the same reg. Since this is rarely useful, you may try to insert some lines of useful other code between the pop and push instructions.

Grincheux · January 11, 2016, 08:27:31 AM

I am very surprised, but this explains why you wrote the m2m macro.

FORTRANS · January 11, 2016, 09:48:17 AM

Hi,

Three systems that respond somewhat differently. I had expected
the second and third to be similar, so the change in the "mov+mov"
test was interesting. The P-MMX really does not like the "xchg eax, [esp]"
test, very slow. The P-MMX rather likes the "mov+mov" test though.

Code Select

P-MMX

pre-P4
203	cycles for 100 * pop+push
199	cycles for 100 * mov+mov
7092	cycles for 100 * xchg eax, [esp]

204	cycles for 100 * pop+push
198	cycles for 100 * mov+mov
7119	cycles for 100 * xchg eax, [esp]

201	cycles for 100 * pop+push
201	cycles for 100 * mov+mov
7095	cycles for 100 * xchg eax, [esp]

203	cycles for 100 * pop+push
201	cycles for 100 * mov+mov
7098	cycles for 100 * xchg eax, [esp]

204	cycles for 100 * pop+push
200	cycles for 100 * mov+mov
7085	cycles for 100 * xchg eax, [esp]

3	bytes for pop+push
7	bytes for mov+mov
6	bytes for xchg eax, [esp]


--- ok ---

P-III
pre-P4 (SSE1)

103	cycles for 100 * pop+push
203	cycles for 100 * mov+mov
1718	cycles for 100 * xchg eax, [esp]

102	cycles for 100 * pop+push
202	cycles for 100 * mov+mov
1719	cycles for 100 * xchg eax, [esp]

102	cycles for 100 * pop+push
201	cycles for 100 * mov+mov
1721	cycles for 100 * xchg eax, [esp]

103	cycles for 100 * pop+push
201	cycles for 100 * mov+mov
1734	cycles for 100 * xchg eax, [esp]

102	cycles for 100 * pop+push
203	cycles for 100 * mov+mov
1723	cycles for 100 * xchg eax, [esp]

3	bytes for pop+push
7	bytes for mov+mov
6	bytes for xchg eax, [esp]


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

120	cycles for 100 * pop+push
408	cycles for 100 * mov+mov
1811	cycles for 100 * xchg eax, [esp]

113	cycles for 100 * pop+push
407	cycles for 100 * mov+mov
1804	cycles for 100 * xchg eax, [esp]

121	cycles for 100 * pop+push
406	cycles for 100 * mov+mov
1815	cycles for 100 * xchg eax, [esp]

119	cycles for 100 * pop+push
405	cycles for 100 * mov+mov
1809	cycles for 100 * xchg eax, [esp]

121	cycles for 100 * pop+push
406	cycles for 100 * mov+mov
1810	cycles for 100 * xchg eax, [esp]

3	bytes for pop+push
7	bytes for mov+mov
6	bytes for xchg eax, [esp]


--- ok ---

Cheers,

Steve N.

Grincheux · January 11, 2016, 09:49:15 AM

Jochen, what tool do you use, I would like to make my own tests ?

Example : PUSH 384 / POP EDX
POP EBP / AND SUB ESP,12 ; Function had 2 arguments, I correct the stack
JMP [ESP - 4] and I return home

I need something to profile

Grincheux · January 11, 2016, 09:51:44 AM

FORTRANS We must write a version for each processor !

jj2007 · January 11, 2016, 11:59:41 AM

Quote from: Grincheux on January 11, 2016, 09:49:15 AM
Jochen, what tool do you use, I would like to make my own tests ?

In RichMasm: File/New Masm source -> Timer example (in the lower half of the green window)
Then click on Test A, Test B etc in the right bookmarks bar, and insert your code.

When you are ready, hit Ctrl S or menu File/Save - you will be prompted to choose a name.
F6 builds and runs the code.

Grincheux · January 11, 2016, 04:15:32 PM

Génial !

I made some tests this what I wanted.

hutch-- · January 11, 2016, 04:18:17 PM

Philippe,

The problem is that almost every processor is different but there are at least some similarity across some families of processors. Most PIV versions from 1.5 gig to the last 3.8 gig versions had similar family traits, a very long pipeline and high penalties for stalls. The Core2 series were a bit easier as they had further optimisations done in hardware, instructions like LEA were reasonably fast again but something in common with all of the later Intel hardware, many of the old instructions were dumped into microcode which is a lot slower than prime time silicon. Also from the later Core2 series upwards SSE instructions got a lot faster again than they were on a PIV but funny enough some of the common integer instructions tended to be a bit slower.

It come from a fundamental difference between early and much later x86 hardware, very early 8088 processors chomping along at 2 meg had the common instructions we use hard coded directly into silicon but as the much later versions were developed they shifted from CISC (complex instruction set computers) to RISC (reduced instruction set computers) that provided a CISC interface for compatibility reasons. The preferred instructions were written directly into silicon where the old timers were dumped into a form of slow bulk storage that is commonly called microcode. Old code still works but its so slow that it is not worth using.

Now when you use AMD hardware, the rules change as often AMD were faster than Intel with some instructions and slower with others.

Grincheux · January 11, 2016, 04:51:43 PM

Yes you are right, faster on one cpu, lower on an other...

I would use SSE instruction set but I read that sometimes INTEL instruction set do one thing and the same opcodes on other cpus (CYRIX) do an other thing. We can't be sure to get the same result on many computers.

http://www.nasm.us/xdoc/2.11.08/nasmdoc.pdf page 264
Finaly the last question I told to myself is "Why to optimize? Is it useful?"
I think that no, for the fun... if the program you are writing will always run on your computer, why not, but in the other its loosing a lot of time for nothing. I don't speak about Guga's tools but in an ordinary program.

I think it's better to uses the different parameters and possibilities of the assembler and linker rather than looking for subtilities in assembly language.

Thank You Hutch

hutch-- · January 11, 2016, 06:01:47 PM

Philippe,

You can take the "statistics" approach, write algorithms that are faster on most hardware most of the time. Optimisation is in fact a virtuous activity in that you use more than instruction selection, you also have algo design, basic logic and benchmarking to produce better average algorithms and that is much of where the action is these days. You will tend to set the lowest processor that can use the algo by the instruction choice but that is generally determined by the OS version so for example if you wrote an algo that has to run on modern hardware, you may only have to go back to Vista of perhaps XP. You certainly can write code that will run on Win2k or even the earlier Win9x versions but they are now very old OS versions.

Most of the code I have to write in MASM32 has to be almost 486 compatible which limits using SSE, AVX and later stuff but it does keep you on your toes.

guga · January 12, 2016, 01:30:32 AM

About optimization, the "better" technique will always depends on the usage of the app you are creating and the way you use the set of instructions to optimize.

Although SSE instructions maybe faster in one processor then others, consider using other approach on your own code, such as: better code organization, avoiding heavy usage of loops, nested If/Else/EndIf chains (Spaghetti code style), reorder your code avoiding same instructions set on sequences chains (to avoid staling, for example) etc. Code optimization (organization, in fact) is, in general, more effective then instruction optimization and have the advantage to be easier to maintain in terms of development.

For example, ff you build a function containing SSE instructions thinking that only because of that, your app will run faster, this is not particularly true if the rest of your code is bloated. Not that SSE instructions are not usefull for that purpose, of course. What i meant is that, you need to pay attention on the rest of the code of your app and not a particularly function that eventtually uses "faster" instruction sets.

It is what Steve´s said, optimization is more then a instruction selection. On RosAsm´s help file (B_U_Asm.exe - I´ll have to rename that file eventually :icon_mrgreen:) it have a section about optimization techniques.

Grincheux · January 12, 2016, 01:52:46 AM

These last days I read many things about optimization. Optimization means instructions speed, aligment...
But what said Hutch and Guga has more sense, speaks better to me.
I will change the way I see programming.

The MASM Forum

News:

Too simple to be true

Grincheux

gelatine1

Grincheux

jj2007

Grincheux

FORTRANS

Grincheux

Grincheux

jj2007

Grincheux

hutch--

Grincheux

hutch--

guga

Grincheux