News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Interesting speed observations between ASM and C/C++

Started by Jason, November 06, 2024, 05:27:50 PM

Previous topic - Next topic

Jason

Hi guys,

After a bit of help today making library files from older assembly projects I created in the past I am noticing some interesting speed observations between ASM and C/C++.

My tests involve some heavy math calculations and I'm measuring how many calculations I can do per second in Asm against the equivalent in C++.

In Debug mode my Asm routines are anywhere up to 600% faster than the C++ counterpart, but then when you switch to release mode the C++ becomes anywhere between 200% - 2000% (depending on the function), faster than what my Asm code is.

Would this be because the compiler is seeing the same loop over and over, and can see the result isn't changing and therefore optimising out a lot of code?

Would the results change if the data being fed in to the functions was different on each loop? (I'll try something along those lines and report back shortly).

Love to hear your thoughts. I find it to be an interesting subject.

NoCforMe

You should be able to see if the compiler is doing loop optimization by looking at your actual code in the debugger: code a simple loop in C and see if the code corresponds to what you think it should be in assembler.

It's also possible that your assembly code isn't all that efficient. Are you just doing that simple addition test, or is your function more complex?

Lots of stuff here on speed optimization if you look around; there are a lot of folks here who think that's important. (I don't, but I'm in the minority around here.)
Assembly language programming should be fun. That's why I do it.

Jason

An example is this one. Just a simple function to set the Identity Matrix (for 3D stuff).

MatrixIdentity PROC mat:DWORD
mov eax, mat
tempMat equ [eax.MATRIX]

mov ecx, MATRIX_REAL_ONE
mov tempMat._11, ecx
mov tempMat._22, ecx
mov tempMat._33, ecx
mov tempMat._44, ecx

mov ecx, MATRIX_REAL_ZERO
mov tempMat._12, ecx
mov tempMat._13, ecx
mov tempMat._14, ecx
mov tempMat._21, ecx
mov tempMat._23, ecx
mov tempMat._24, ecx
mov tempMat._31, ecx
mov tempMat._32, ecx
mov tempMat._34, ecx
mov tempMat._41, ecx
mov tempMat._42, ecx
mov tempMat._43, ecx

ret
MatrixIdentity ENDP

I can't see how it would be possible to shorten this any further.

Then I have other more complex matrix functions, but again, can't think of any way to possibly make them leaner.

NoCforMe

Just off the top of my head here: since you're just setting 16 DWORD elements here, if they're in a contiguous block, you could do something like this:

; First set 4 elements to MATRIX_REAL_ONE:
      MOV  EAX, MATRIX_REAL_ONE
      MOV  ECX, 4
      MOV  EDI, mat
      REP  STOSD

; Now set the remainder to MATRIX_REAL_ZERO:
      MOV  EAX, MATRIX_REAL_ZERO
      MOV  ECX, 12
      REP  STOSD

Of course, if they're not in 2 contiguous blocks like I'm assuming here, you'll have to do something different. You could zero the whole thing out first, then just set those 4 elements to MATRIX_REAL_ONE.
Assembly language programming should be fun. That's why I do it.

Jason

Nice one! I wasn't aware of REP.

The locations getting set to 1 aren't contiguous, but I see what you're getting at there.

NoCforMe

Or you could construct the matrix in memory, then just copy the whole block over in one fell swoop:

MatrixData    DD <matrix data here>

    PUSH   ESI
    PUSH   EDI

    MOV    ESI, OFFSET MatrixData
    MOV    EDI, mat
    MOV    ECX, SIZEOF MatrixData / SIZEOF DWORD
    REP    MOVSD

    POP    EDI
    POP    ESI

More data, but mo' faster.
Assembly language programming should be fun. That's why I do it.

Jason

That's an awesome idea also! Just when you think you have things optimal, someone comes along and smashes it, twice! 😂

zedd151

Quote from: Jason on November 06, 2024, 09:53:41 PMThat's an awesome idea also! Just when you think you have things optimal, someone comes along and smashes it, twice! 😂
That's when you ought to post your assembly code in the Laboratory, so other members can help you to optimize it. (Tip for 'next time')
We have a few members (used to be many more) that are keen on optimizing for speed, and know a few tricks.
:cool:

daydreamer

For Matrices math they designed SSE instructions, using 128 bit xmm0-7 regs
Cpp speed depends on compiler settings combined with better or worse compiler
Using visual cpp without optimize settings, it already makes use of Movaps xmm 128 bit registers for initialize local arrays from somewhere else in memory
So code example you show, movups xmm0, can initialize 4 x 32 bit variables instead of 4 x Mov
my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

raymond

In my opinion, your MatrixIdentity Proc is definitely not a good example to compare asm with C++.

A) Such a proc may take only a few nanoseconds as compared to the remainder of your computation, and accessing memory may not vary in any significant manner between the two languages.

B) Such a proc might be optimizable in asm but the result would be so small that its overall effect in a final program may not even be worth the effort.

The best area to concentrate on with complex computations is to restrict memory accesses whenever possible; i.e. keep intermediate results in registers whenever possible. Remember that you have 8 potential registers if you need to use the fpu.
Whenever you assume something, you risk being wrong half the time.
https://masm32.com/masmcode/rayfil/index.html

jj2007

#10
Under the hood, C is machine code just like Assembly. It cannot be faster than the same Assembly code. So your code may be slow. Compare it to the disassembled C++ code.

zedd151

Quote from: jj2007 on November 07, 2024, 05:41:47 AMUnder the hood, C is machine code just like Assembly. I cannot be faster than the same Assembly code. So your code may be slow. Compare it to the disassembled C++ code.
Optimized C/C++ can be faster than poorly written assembly code. Sometimes way faster.
You would indeed have to look at the disassembled C/C++ code to see if you are comparing apples to oranges or not.
:cool:

Jason

Thanks guys, I really appreciate the feedback and advice given here.

I have more matrix functions than the simple one provided, but having said that, it it obvious that I have a long way to go, with what I thought was seemingly optimal.  :badgrin:

Jason

So, this is what I don't get. I am playing around with my MatrixIdentity function right now, calling it MatrixIdentity2 for testing purposes.

All I have in here is a return call-

MatrixIdentity2 PROC mat:DWORD
ret
MatrixIdentity2 ENDP

Just running this, the DirectX MatrixIdentity call is still way faster. How is this possible?

MatrixIdentity (DX):  407901787
MatrixIdentity (ASM): 149828047
MatrixIdentity2 (ASM): 313354786
   DX faster by 130%

Press any key to continue . . .

The higher the number the faster. It's reporting how many times the function has executed per second.

I have a five second delay before the tests are conducted, to ensure that the window has settled down and is stable. And it doesn't matter if I move the DX call to be the first or last test. It always wins by a long margin.

Is there an overhead in calling a function from a static lib?


Jason

I think I may know what is going on. I took my MatrixIdentity2 function out of the test loop altogether and the result was on par with the DirectX version.

I am suspecting that the compiler is seeing that the result of the DX MatrixIdentity call is never actually used and is being optimised out altogether.

[edit]
Yep, that looks like what is happening. I made the DX MatrixIdentity function point to a global variable. That alone wasn't enough to affect the speed. But, then I added a print call, to print out an element of the variable right at the end of the program (way outside of the test loop) and then suddenly the DX calls speed dropped significantly.