The MASM Forum

General => The Campus => Topic started by: Grincheux on January 11, 2016, 12:39:15 AM

Title: Too simple to be true
Post by: Grincheux on January 11, 2016, 12:39:15 AM
I have the following code :


ESP BEFORE------------------------------->19F89C
POP EBX---------------------------> 19F8A0
PUSH EAX----------------->19F89C


So don't pop ebx replace by


MOV EBX,[ESP]
MOV [ESP],EAX


Is it quicker than POP EBX / PUSH EAX ?
Title: Re: Too simple to be true
Post by: gelatine1 on January 11, 2016, 12:53:41 AM
I think it should be a little quicker as it doesnt have to increase/decrease esp. Another possibility:


Mov ebx,eax
Xchg ebx, dword ptr [esp]


Maybe someone could test the speed with some program?
Title: Re: Too simple to be true
Post by: Grincheux on January 11, 2016, 02:47:04 AM
I thought to it also. Perhaps better.
Title: Re: Too simple to be true
Post by: jj2007 on January 11, 2016, 04:18:18 AM
push & pop are highly optimised instructions.
The same for mov, but the timings say in this case it's worse.
Finally, xchg reg, mem is known to have awful performance, see below.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2167    cycles for 100 * xchg eax, [esp]

68      cycles for 100 * pop+push
294     cycles for 100 * mov+mov
2169    cycles for 100 * xchg eax, [esp]

68      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2167    cycles for 100 * xchg eax, [esp]

69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2168    cycles for 100 * xchg eax, [esp]

69      cycles for 100 * pop+push
293     cycles for 100 * mov+mov
2177    cycles for 100 * xchg eax, [esp]

3       bytes for pop+push
7       bytes for mov+mov
6       bytes for xchg eax, [esp]


Note that pop+push is a factor 4 slower (i.e. like mov+mov) if you use the same reg. Since this is rarely useful, you may try to insert some lines of useful other code between the pop and push instructions.
Title: Re: Too simple to be true
Post by: Grincheux on January 11, 2016, 08:27:31 AM
I am very surprised, but this explains why you wrote the m2m macro.
Title: Re: Too simple to be true
Post by: FORTRANS on January 11, 2016, 09:48:17 AM
Hi,

   Three systems that respond somewhat differently.  I had expected
the second and third to be similar, so the change in the "mov+mov"
test was interesting.  The P-MMX really does not like the "xchg eax, [esp]"
test, very slow.  The P-MMX rather likes the "mov+mov" test though.

P-MMX

pre-P4
203 cycles for 100 * pop+push
199 cycles for 100 * mov+mov
7092 cycles for 100 * xchg eax, [esp]

204 cycles for 100 * pop+push
198 cycles for 100 * mov+mov
7119 cycles for 100 * xchg eax, [esp]

201 cycles for 100 * pop+push
201 cycles for 100 * mov+mov
7095 cycles for 100 * xchg eax, [esp]

203 cycles for 100 * pop+push
201 cycles for 100 * mov+mov
7098 cycles for 100 * xchg eax, [esp]

204 cycles for 100 * pop+push
200 cycles for 100 * mov+mov
7085 cycles for 100 * xchg eax, [esp]

3 bytes for pop+push
7 bytes for mov+mov
6 bytes for xchg eax, [esp]


--- ok ---

P-III
pre-P4 (SSE1)

103 cycles for 100 * pop+push
203 cycles for 100 * mov+mov
1718 cycles for 100 * xchg eax, [esp]

102 cycles for 100 * pop+push
202 cycles for 100 * mov+mov
1719 cycles for 100 * xchg eax, [esp]

102 cycles for 100 * pop+push
201 cycles for 100 * mov+mov
1721 cycles for 100 * xchg eax, [esp]

103 cycles for 100 * pop+push
201 cycles for 100 * mov+mov
1734 cycles for 100 * xchg eax, [esp]

102 cycles for 100 * pop+push
203 cycles for 100 * mov+mov
1723 cycles for 100 * xchg eax, [esp]

3 bytes for pop+push
7 bytes for mov+mov
6 bytes for xchg eax, [esp]


--- ok ---

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

120 cycles for 100 * pop+push
408 cycles for 100 * mov+mov
1811 cycles for 100 * xchg eax, [esp]

113 cycles for 100 * pop+push
407 cycles for 100 * mov+mov
1804 cycles for 100 * xchg eax, [esp]

121 cycles for 100 * pop+push
406 cycles for 100 * mov+mov
1815 cycles for 100 * xchg eax, [esp]

119 cycles for 100 * pop+push
405 cycles for 100 * mov+mov
1809 cycles for 100 * xchg eax, [esp]

121 cycles for 100 * pop+push
406 cycles for 100 * mov+mov
1810 cycles for 100 * xchg eax, [esp]

3 bytes for pop+push
7 bytes for mov+mov
6 bytes for xchg eax, [esp]


--- ok ---


Cheers,

Steve N.
Title: Re: Too simple to be true
Post by: Grincheux on January 11, 2016, 09:49:15 AM
Jochen, what tool do you use, I would like to make my own tests ?

Example : PUSH 384 / POP EDX
POP  EBP / AND SUB ESP,12 ; Function had 2 arguments, I correct the stack
JMP [ESP - 4] and I return home

I need something to profile
Title: Re: Too simple to be true
Post by: Grincheux on January 11, 2016, 09:51:44 AM
FORTRANS We must write a version for each processor !
Title: Re: Too simple to be true
Post by: jj2007 on January 11, 2016, 11:59:41 AM
Quote from: Grincheux on January 11, 2016, 09:49:15 AM
Jochen, what tool do you use, I would like to make my own tests ?

In RichMasm: File/New Masm source -> Timer example (in the lower half of the green window)
Then click on Test A, Test B etc in the right bookmarks bar, and insert your code.

When you are ready, hit Ctrl S or menu File/Save - you will be prompted to choose a name.
F6 builds and runs the code.
Title: Re: Too simple to be true
Post by: Grincheux on January 11, 2016, 04:15:32 PM
Génial !

I made some tests this what I wanted.
Title: Re: Too simple to be true
Post by: hutch-- on January 11, 2016, 04:18:17 PM
Philippe,

The problem is that almost every processor is different but there are at least some similarity across some families of processors. Most PIV versions from 1.5 gig to the last 3.8 gig versions had similar family traits, a very long pipeline and high penalties for stalls. The Core2 series were a bit easier as they had further optimisations done in hardware, instructions like LEA were reasonably fast again but something in common with all of the later Intel hardware, many of the old instructions were dumped into microcode which is a lot slower than prime time silicon. Also from the later Core2 series upwards SSE instructions got a lot faster again than they were on a PIV but funny enough some of the common integer instructions tended to be a bit slower.

It come from a fundamental difference between early and much later x86 hardware, very early 8088 processors  chomping along at 2 meg had the common instructions we use hard coded directly into silicon but as the much later versions were developed they shifted from CISC (complex instruction set computers) to RISC (reduced instruction set computers) that provided a CISC interface for compatibility reasons. The preferred instructions were written directly into silicon where the old timers were dumped into a form of slow bulk storage that is commonly called microcode. Old code still works but its so slow that it is not worth using.

Now when you use AMD hardware, the rules change as often AMD were faster than Intel with some instructions and slower with others.
Title: Re: Too simple to be true
Post by: Grincheux on January 11, 2016, 04:51:43 PM
Yes you are right, faster on one cpu, lower on an other...

I would use SSE instruction set but I read that sometimes INTEL instruction set do one thing and the same opcodes on other cpus (CYRIX) do an other thing. We can't be sure to get the same result on many computers.

http://www.nasm.us/xdoc/2.11.08/nasmdoc.pdf page 264
Finaly the last question I told to myself is "Why to optimize? Is it useful?"
I think that no, for the fun... if the program you are writing will always run on your computer, why not, but in the other its loosing a lot of time for nothing. I don't speak about Guga's tools but in an ordinary program.

I think it's better to uses the different parameters and possibilities of the assembler and linker rather than looking for subtilities in assembly language.

Thank You Hutch
Title: Re: Too simple to be true
Post by: hutch-- on January 11, 2016, 06:01:47 PM
Philippe,

You can take the "statistics" approach, write algorithms that are faster on most hardware most of the time. Optimisation is in fact a virtuous activity in that you use more than instruction selection, you also have algo design, basic logic and benchmarking to produce better average algorithms and that is much of where the action is these days. You will tend to set the lowest processor that can use the algo by the instruction choice but that is generally determined by the OS version so for example if you wrote an algo that has to run on modern hardware, you may only have to go back to Vista of perhaps XP. You certainly can write code that will run on Win2k or even the earlier Win9x versions but they are now very old OS versions.

Most of the code I have to write in MASM32 has to be almost 486 compatible which limits using SSE, AVX and later stuff but it does keep you on your toes.  :biggrin:
Title: Re: Too simple to be true
Post by: guga on January 12, 2016, 01:30:32 AM
About optimization, the "better" technique will always depends on the usage of the app you are creating and the way you use the set of instructions to optimize.

Although SSE instructions maybe faster in one processor then others, consider using other approach on your own code, such as: better code organization, avoiding heavy usage of loops, nested If/Else/EndIf chains (Spaghetti code style), reorder your code avoiding same instructions set on sequences chains (to avoid staling, for example) etc.  Code optimization (organization, in fact) is, in general, more effective then instruction optimization and have the advantage to be easier to maintain in terms of development.

For example, ff you build a function containing SSE instructions thinking that only because of that, your app will run faster, this is not particularly true if the rest of your code is bloated. Not that SSE instructions are not usefull for that purpose, of course. What i meant is that, you need to pay attention on the rest of the code of your app and not a particularly function that eventtually uses "faster" instruction sets.

It is what Steve´s said, optimization is more then a instruction selection. On RosAsm´s help file (B_U_Asm.exe - I´ll have to rename that file eventually :icon_mrgreen:) it have a section about optimization techniques.
Title: Re: Too simple to be true
Post by: Grincheux on January 12, 2016, 01:52:46 AM
These last days I read many things about optimization. Optimization means instructions speed, aligment...
But what said Hutch and Guga has more sense, speaks better to me.
I will change the way I see programming.
Title: Re: Too simple to be true
Post by: Gunther on January 12, 2016, 04:22:35 AM
Hi Philippe,

Quote from: Grincheux on January 12, 2016, 01:52:46 AM
These last days I read many things about optimization. Optimization means instructions speed, aligment...
But what said Hutch and Guga has more sense, speaks better to me.
I will change the way I see programming.

Have you checked Agner Fog's (http://agner.org/optimize/) manuals?

Gunther
Title: Re: Too simple to be true
Post by: jj2007 on January 12, 2016, 04:49:19 AM
Quote from: Grincheux on January 12, 2016, 01:52:46 AMThese last days I read many things about optimization. Optimization means instructions speed, aligment...

It means lot of things. As Hutch wrote earlier, a redesign of a procedure may dramatically speed up your code. But that is not always possible, and therefore sometimes raw power is important. Especially libraries like the CRT should be fast for frequently used functions, such as strlen or instr. That is one area where SSE can be very useful.

Same applies for sort functions, for example. No problem if you want to sort a thousand lines, but if it approaches the Millions, you better use the fastest available algo. The Masm32 Laboratory is a fascinating place in this respect :P
Title: Re: Too simple to be true
Post by: Grincheux on January 12, 2016, 06:05:39 AM
Gunther, I have read, and re-read AF manuals, Amd, Intel, Sandpile, nasm... and more.
I have downloaded many files (asm, pdf, html...)
But I agree with Hutch, Guga and JJ2007 it is not always a question of speed, sometimes the way we code the algorythm is more important than selection the best opcode.
Title: Re: Too simple to be true
Post by: Gunther on January 14, 2016, 04:18:11 AM
Quote from: Grincheux on January 12, 2016, 06:05:39 AM
But I agree with Hutch, Guga and JJ2007 it is not always a question of speed, sometimes the way we code the algorythm is more important than selection the best opcode.

No doubt about that, but Agner's manuals are a must. You should have a look into those.

Gunther
Title: Re: Too simple to be true
Post by: Grincheux on January 14, 2016, 04:57:14 AM
I have read them
Title: Re: Too simple to be true
Post by: dedndave on January 14, 2016, 06:26:47 AM
algorithm design is first and foremost
selecting the right approach to a problem far outweighs specific instructions used

many of the algorithms in the masm32 package are faster than they really need to be   :biggrin:
i mean, how many times are you going to convert a value to a hex string ?
i might write a program to convert 4 or 5 values and display them
i have never needed a program to perform thousands upon thousands of such conversions
for 4 or 5 values, you would hardly notice a speed difference of 10 nS or 10 mS

but, the algo design is fast
they use a look-up table for maximum performance
that's because it is a "general purpose" routine - to be used many different ways
Title: Re: Too simple to be true
Post by: FORTRANS on January 16, 2016, 09:53:45 AM
Hi Steve,

Quote from: hutch-- on January 11, 2016, 04:18:17 PM
It come from a fundamental difference between early and much later x86 hardware, very early 8088 processors  chomping along at 2 meg had the common instructions we use hard coded directly into silicon but as the much later versions were developed they shifted from CISC (complex instruction set computers) to RISC (reduced instruction set computers) that provided a CISC interface for compatibility reasons. The preferred instructions were written directly into silicon where the old timers were dumped into a form of slow bulk storage that is commonly called microcode. Old code still works but its so slow that it is not worth using.

   Just for laughs I coded up the above three tests in a 16-bit version.
Code obviously is a bit different, but hopefully similar in intent.  Ran
it on an 80186 for timings.

PUSH CX, POP AX

Timed count: 02795 microseconds

MOV

Timed count: 04477 microseconds

XCHG

Timed count: 03728 microseconds


Cheers,

Steve N.