Interesting speed observations between ASM and C/C++

Jason · November 06, 2024, 05:27:50 PM

Hi guys,

After a bit of help today making library files from older assembly projects I created in the past I am noticing some interesting speed observations between ASM and C/C++.

My tests involve some heavy math calculations and I'm measuring how many calculations I can do per second in Asm against the equivalent in C++.

In Debug mode my Asm routines are anywhere up to 600% faster than the C++ counterpart, but then when you switch to release mode the C++ becomes anywhere between 200% - 2000% (depending on the function), faster than what my Asm code is.

Would this be because the compiler is seeing the same loop over and over, and can see the result isn't changing and therefore optimising out a lot of code?

Would the results change if the data being fed in to the functions was different on each loop? (I'll try something along those lines and report back shortly).

Love to hear your thoughts. I find it to be an interesting subject.

NoCforMe · November 06, 2024, 06:24:05 PM

You should be able to see if the compiler is doing loop optimization by looking at your actual code in the debugger: code a simple loop in C and see if the code corresponds to what you think it should be in assembler.

It's also possible that your assembly code isn't all that efficient. Are you just doing that simple addition test, or is your function more complex?

Lots of stuff here on speed optimization if you look around; there are a lot of folks here who think that's important. (I don't, but I'm in the minority around here.)

Jason · November 06, 2024, 06:49:40 PM

An example is this one. Just a simple function to set the Identity Matrix (for 3D stuff).

Code Select

MatrixIdentity		PROC mat:DWORD
	mov eax, mat
	tempMat equ [eax.MATRIX]

	mov ecx, MATRIX_REAL_ONE
	mov tempMat._11, ecx
	mov tempMat._22, ecx
	mov tempMat._33, ecx
	mov tempMat._44, ecx

	mov ecx, MATRIX_REAL_ZERO
	mov tempMat._12, ecx
	mov tempMat._13, ecx
	mov tempMat._14, ecx
	mov tempMat._21, ecx
	mov tempMat._23, ecx
	mov tempMat._24, ecx
	mov tempMat._31, ecx
	mov tempMat._32, ecx
	mov tempMat._34, ecx
	mov tempMat._41, ecx
	mov tempMat._42, ecx
	mov tempMat._43, ecx
	
	ret
MatrixIdentity ENDP

I can't see how it would be possible to shorten this any further.

Then I have other more complex matrix functions, but again, can't think of any way to possibly make them leaner.

NoCforMe · November 06, 2024, 07:50:50 PM

Just off the top of my head here: since you're just setting 16 DWORD elements here, if they're in a contiguous block, you could do something like this:

Code Select

; First set 4 elements to MATRIX_REAL_ONE:
      MOV  EAX, MATRIX_REAL_ONE
      MOV  ECX, 4
      MOV  EDI, mat
      REP  STOSD

; Now set the remainder to MATRIX_REAL_ZERO:
      MOV  EAX, MATRIX_REAL_ZERO
      MOV  ECX, 12
      REP  STOSD

Of course, if they're not in 2 contiguous blocks like I'm assuming here, you'll have to do something different. You could zero the whole thing out first, then just set those 4 elements to MATRIX_REAL_ONE.

Jason · November 06, 2024, 08:34:36 PM

Nice one! I wasn't aware of REP.

The locations getting set to 1 aren't contiguous, but I see what you're getting at there.

NoCforMe · November 06, 2024, 09:51:31 PM

Or you could construct the matrix in memory, then just copy the whole block over in one fell swoop:

Code Select

MatrixData    DD <matrix data here>

    PUSH   ESI
    PUSH   EDI

    MOV    ESI, OFFSET MatrixData
    MOV    EDI, mat
    MOV    ECX, SIZEOF MatrixData / SIZEOF DWORD
    REP    MOVSD

    POP    EDI
    POP    ESI

More data, but mo' faster.

Jason · November 06, 2024, 09:53:41 PM

That's an awesome idea also! Just when you think you have things optimal, someone comes along and smashes it, twice! 😂

zedd · November 07, 2024, 01:06:08 AM

Quote from: Jason on November 06, 2024, 09:53:41 PMThat's an awesome idea also! Just when you think you have things optimal, someone comes along and smashes it, twice! 😂

That's when you ought to post your assembly code in the Laboratory, so other members can help you to optimize it. (Tip for 'next time')
We have a few members (used to be many more) that are keen on optimizing for speed, and know a few tricks.

daydreamer · November 07, 2024, 02:52:53 AM

For Matrices math they designed SSE instructions, using 128 bit xmm0-7 regs
Cpp speed depends on compiler settings combined with better or worse compiler
Using visual cpp without optimize settings, it already makes use of Movaps xmm 128 bit registers for initialize local arrays from somewhere else in memory
So code example you show, movups xmm0, can initialize 4 x 32 bit variables instead of 4 x Mov

raymond · November 07, 2024, 04:28:36 AM

In my opinion, your MatrixIdentity Proc is definitely not a good example to compare asm with C++.

A) Such a proc may take only a few nanoseconds as compared to the remainder of your computation, and accessing memory may not vary in any significant manner between the two languages.

B) Such a proc might be optimizable in asm but the result would be so small that its overall effect in a final program may not even be worth the effort.

The best area to concentrate on with complex computations is to restrict memory accesses whenever possible; i.e. keep intermediate results in registers whenever possible. Remember that you have 8 potential registers if you need to use the fpu.

jj2007 · November 07, 2024, 05:41:47 AM

Under the hood, C is machine code just like Assembly. It cannot be faster than the same Assembly code. So your code may be slow. Compare it to the disassembled C++ code.

zedd · November 07, 2024, 06:19:55 AM

Quote from: jj2007 on November 07, 2024, 05:41:47 AMUnder the hood, C is machine code just like Assembly. I cannot be faster than the same Assembly code. So your code may be slow. Compare it to the disassembled C++ code.

Optimized C/C++ can be faster than poorly written assembly code. Sometimes way faster.
You would indeed have to look at the disassembled C/C++ code to see if you are comparing apples to oranges or not.

Jason · November 07, 2024, 07:30:42 AM

Thanks guys, I really appreciate the feedback and advice given here.

I have more matrix functions than the simple one provided, but having said that, it it obvious that I have a long way to go, with what I thought was seemingly optimal.

Jason · November 07, 2024, 08:12:25 AM

So, this is what I don't get. I am playing around with my MatrixIdentity function right now, calling it MatrixIdentity2 for testing purposes.

All I have in here is a return call-

Code Select

MatrixIdentity2		PROC mat:DWORD
	ret
MatrixIdentity2 ENDP

Just running this, the DirectX MatrixIdentity call is still way faster. How is this possible?

Code Select

MatrixIdentity (DX):  407901787
MatrixIdentity (ASM): 149828047
MatrixIdentity2 (ASM): 313354786
   DX faster by 130%

Press any key to continue . . .

The higher the number the faster. It's reporting how many times the function has executed per second.

I have a five second delay before the tests are conducted, to ensure that the window has settled down and is stable. And it doesn't matter if I move the DX call to be the first or last test. It always wins by a long margin.

Is there an overhead in calling a function from a static lib?

Jason · November 07, 2024, 08:31:14 AM

I think I may know what is going on. I took my MatrixIdentity2 function out of the test loop altogether and the result was on par with the DirectX version.

I am suspecting that the compiler is seeing that the result of the DX MatrixIdentity call is never actually used and is being optimised out altogether.

[edit]
Yep, that looks like what is happening. I made the DX MatrixIdentity function point to a global variable. That alone wasn't enough to affect the speed. But, then I added a print call, to print out an element of the variable right at the end of the program (way outside of the test loop) and then suddenly the DX calls speed dropped significantly.

The MASM Forum

News:

Interesting speed observations between ASM and C/C++

Jason

NoCforMe

Jason

NoCforMe

Jason

NoCforMe

Jason

zedd

daydreamer

raymond

jj2007

zedd

Jason

Jason

Jason