The MASM Forum

General => The Workshop => Topic started by: Jason on November 06, 2024, 05:27:50 PM

Title: Interesting speed observations between ASM and C/C++
Post by: Jason on November 06, 2024, 05:27:50 PM
Hi guys,

After a bit of help today making library files from older assembly projects I created in the past I am noticing some interesting speed observations between ASM and C/C++.

My tests involve some heavy math calculations and I'm measuring how many calculations I can do per second in Asm against the equivalent in C++.

In Debug mode my Asm routines are anywhere up to 600% faster than the C++ counterpart, but then when you switch to release mode the C++ becomes anywhere between 200% - 2000% (depending on the function), faster than what my Asm code is.

Would this be because the compiler is seeing the same loop over and over, and can see the result isn't changing and therefore optimising out a lot of code?

Would the results change if the data being fed in to the functions was different on each loop? (I'll try something along those lines and report back shortly).

Love to hear your thoughts. I find it to be an interesting subject.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: NoCforMe on November 06, 2024, 06:24:05 PM
You should be able to see if the compiler is doing loop optimization by looking at your actual code in the debugger: code a simple loop in C and see if the code corresponds to what you think it should be in assembler.

It's also possible that your assembly code isn't all that efficient. Are you just doing that simple addition test, or is your function more complex?

Lots of stuff here on speed optimization if you look around; there are a lot of folks here who think that's important. (I don't, but I'm in the minority around here.)
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 06, 2024, 06:49:40 PM
An example is this one. Just a simple function to set the Identity Matrix (for 3D stuff).

MatrixIdentity PROC mat:DWORD
mov eax, mat
tempMat equ [eax.MATRIX]

mov ecx, MATRIX_REAL_ONE
mov tempMat._11, ecx
mov tempMat._22, ecx
mov tempMat._33, ecx
mov tempMat._44, ecx

mov ecx, MATRIX_REAL_ZERO
mov tempMat._12, ecx
mov tempMat._13, ecx
mov tempMat._14, ecx
mov tempMat._21, ecx
mov tempMat._23, ecx
mov tempMat._24, ecx
mov tempMat._31, ecx
mov tempMat._32, ecx
mov tempMat._34, ecx
mov tempMat._41, ecx
mov tempMat._42, ecx
mov tempMat._43, ecx

ret
MatrixIdentity ENDP

I can't see how it would be possible to shorten this any further.

Then I have other more complex matrix functions, but again, can't think of any way to possibly make them leaner.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: NoCforMe on November 06, 2024, 07:50:50 PM
Just off the top of my head here: since you're just setting 16 DWORD elements here, if they're in a contiguous block, you could do something like this:

; First set 4 elements to MATRIX_REAL_ONE:
      MOV  EAX, MATRIX_REAL_ONE
      MOV  ECX, 4
      MOV  EDI, mat
      REP  STOSD

; Now set the remainder to MATRIX_REAL_ZERO:
      MOV  EAX, MATRIX_REAL_ZERO
      MOV  ECX, 12
      REP  STOSD

Of course, if they're not in 2 contiguous blocks like I'm assuming here, you'll have to do something different. You could zero the whole thing out first, then just set those 4 elements to MATRIX_REAL_ONE.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 06, 2024, 08:34:36 PM
Nice one! I wasn't aware of REP.

The locations getting set to 1 aren't contiguous, but I see what you're getting at there.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: NoCforMe on November 06, 2024, 09:51:31 PM
Or you could construct the matrix in memory, then just copy the whole block over in one fell swoop:

MatrixData    DD <matrix data here>

    PUSH   ESI
    PUSH   EDI

    MOV    ESI, OFFSET MatrixData
    MOV    EDI, mat
    MOV    ECX, SIZEOF MatrixData / SIZEOF DWORD
    REP    MOVSD

    POP    EDI
    POP    ESI

More data, but mo' faster.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 06, 2024, 09:53:41 PM
That's an awesome idea also! Just when you think you have things optimal, someone comes along and smashes it, twice! 😂
Title: Re: Interesting speed observations between ASM and C/C++
Post by: zedd151 on November 07, 2024, 01:06:08 AM
Quote from: Jason on November 06, 2024, 09:53:41 PMThat's an awesome idea also! Just when you think you have things optimal, someone comes along and smashes it, twice! 😂
That's when you ought to post your assembly code in the Laboratory, so other members can help you to optimize it. (Tip for 'next time')
We have a few members (used to be many more) that are keen on optimizing for speed, and know a few tricks.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: daydreamer on November 07, 2024, 02:52:53 AM
For Matrices math they designed SSE instructions, using 128 bit xmm0-7 regs
Cpp speed depends on compiler settings combined with better or worse compiler
Using visual cpp without optimize settings, it already makes use of Movaps xmm 128 bit registers for initialize local arrays from somewhere else in memory
So code example you show, movups xmm0, can initialize 4 x 32 bit variables instead of 4 x Mov
Title: Re: Interesting speed observations between ASM and C/C++
Post by: raymond on November 07, 2024, 04:28:36 AM
In my opinion, your MatrixIdentity Proc is definitely not a good example to compare asm with C++.

A) Such a proc may take only a few nanoseconds as compared to the remainder of your computation, and accessing memory may not vary in any significant manner between the two languages.

B) Such a proc might be optimizable in asm but the result would be so small that its overall effect in a final program may not even be worth the effort.

The best area to concentrate on with complex computations is to restrict memory accesses whenever possible; i.e. keep intermediate results in registers whenever possible. Remember that you have 8 potential registers if you need to use the fpu.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: jj2007 on November 07, 2024, 05:41:47 AM
Under the hood, C is machine code just like Assembly. It cannot be faster than the same Assembly code. So your code may be slow. Compare it to the disassembled C++ code.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: zedd151 on November 07, 2024, 06:19:55 AM
Quote from: jj2007 on November 07, 2024, 05:41:47 AMUnder the hood, C is machine code just like Assembly. I cannot be faster than the same Assembly code. So your code may be slow. Compare it to the disassembled C++ code.
Optimized C/C++ can be faster than poorly written assembly code. Sometimes way faster.
You would indeed have to look at the disassembled C/C++ code to see if you are comparing apples to oranges or not.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 07:30:42 AM
Thanks guys, I really appreciate the feedback and advice given here.

I have more matrix functions than the simple one provided, but having said that, it it obvious that I have a long way to go, with what I thought was seemingly optimal.  :badgrin:
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 08:12:25 AM
So, this is what I don't get. I am playing around with my MatrixIdentity function right now, calling it MatrixIdentity2 for testing purposes.

All I have in here is a return call-

MatrixIdentity2 PROC mat:DWORD
ret
MatrixIdentity2 ENDP

Just running this, the DirectX MatrixIdentity call is still way faster. How is this possible?

MatrixIdentity (DX):  407901787
MatrixIdentity (ASM): 149828047
MatrixIdentity2 (ASM): 313354786
   DX faster by 130%

Press any key to continue . . .

The higher the number the faster. It's reporting how many times the function has executed per second.

I have a five second delay before the tests are conducted, to ensure that the window has settled down and is stable. And it doesn't matter if I move the DX call to be the first or last test. It always wins by a long margin.

Is there an overhead in calling a function from a static lib?

Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 08:31:14 AM
I think I may know what is going on. I took my MatrixIdentity2 function out of the test loop altogether and the result was on par with the DirectX version.

I am suspecting that the compiler is seeing that the result of the DX MatrixIdentity call is never actually used and is being optimised out altogether.

[edit]
Yep, that looks like what is happening. I made the DX MatrixIdentity function point to a global variable. That alone wasn't enough to affect the speed. But, then I added a print call, to print out an element of the variable right at the end of the program (way outside of the test loop) and then suddenly the DX calls speed dropped significantly.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 09:10:47 AM
Quote from: NoCforMe on November 06, 2024, 07:50:50 PMJust off the top of my head here: since you're just setting 16 DWORD elements here, if they're in a contiguous block, you could do something like this:

; First set 4 elements to MATRIX_REAL_ONE:
      MOV  EAX, MATRIX_REAL_ONE
      MOV  ECX, 4
      MOV  EDI, mat
      REP  STOSD

; Now set the remainder to MATRIX_REAL_ZERO:
      MOV  EAX, MATRIX_REAL_ZERO
      MOV  ECX, 12
      REP  STOSD

Of course, if they're not in 2 contiguous blocks like I'm assuming here, you'll have to do something different. You could zero the whole thing out first, then just set those 4 elements to MATRIX_REAL_ONE.

Interestingly enough, this method turns out to be slower than my initial implementation, at about 60% of the speed when compared to filling the memory locations one by one.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: NoCforMe on November 07, 2024, 11:04:27 AM
Before we go further, let me ask you: do you really have a need for speed here?
I ask because there's kind of an obsession with speed for its own sake here among some programmers. Sometimes this takes the form of assembly-language pissing contests: "my code's faster than your code!" (Well, not really: everyone's quite polite and gracious about it. But still.)

My argument, take it for what it's worth, is that unless you're writing an app that's so computation-bound that a speedup is critical, the obsession with speed is kinda misplaced. I mean, if you're writing a routine that gets called while waiting for user input, what's the rush?

Of course, if you have a legitimate need for speed, then simply disregard what I wrote here.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 11:54:00 AM
Hey there! Very valid argument and I agree.

In my case I am looking at every avenue I can take to make my DirectX 11 renderer as efficient as possible and I am making great progress in doing so, being able to render a ridiculous amounts of sprites per frame.

It's more a challenge against myself than anything. I'm not out here to say "my code is faster than yours", I'm more interested in the quirks of how things can be quicker (or more compact, depending on my mood on the day  :biggrin:,  I know compact and speed often don't go hand in hand).

Just a hobby for me.  :thumbsup:
Title: Re: Interesting speed observations between ASM and C/C++
Post by: NoCforMe on November 07, 2024, 12:06:58 PM
Quote from: Jason on November 07, 2024, 11:54:00 AMIn my case I am looking at every avenue I can take to make my DirectX 11 renderer as efficient as possible and I am making great progress in doing so, being able to render a ridiculous amounts of sprites per frame.

That's a perfectly valid reason to aim for speed.

QuoteJust a hobby for me.  :thumbsup:

Same here.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 12:14:55 PM
Good fun and brain frying at the same time.   :biggrin:
Title: Re: Interesting speed observations between ASM and C/C++
Post by: zedd151 on November 07, 2024, 03:22:01 PM
Quote from: NoCforMe on November 07, 2024, 12:06:58 PM
Quote from: Jason on November 07, 2024, 11:54:00 AMIn my case I am looking at every avenue I can take to make my DirectX 11 renderer as efficient as possible and I am making great progress in doing so, being able to render a ridiculous amounts of sprites per frame.

That's a perfectly valid reason to aim for speed.


Quote from: Jason on November 07, 2024, 12:14:55 PMGood fun and brain frying at the same time.  :biggrin:
:biggrin:

Sounds like you are indeed having some fun. (or phun?)
DirectX has always been beyond my skill set.
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 04:25:40 PM
Quote from: zedd151 on November 07, 2024, 03:22:01 PMSounds like you are indeed having some fun. (or phun?)
DirectX has always been beyond my skill set.

 :biggrin:

Maybe this is my calling? Aim for the smallest DirectX 11 application that I can achieve.

I'm certainly up against it beating the compiler for speed, with all of the shenanigans it pulls.

You got me thinking now.  :thumbsup:
Title: Re: Interesting speed observations between ASM and C/C++
Post by: NoCforMe on November 07, 2024, 05:14:47 PM
Are you using Direct2D or Direct3D?
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 06:13:59 PM
Direct 3D. I must admit I've never even touched Direct 2D as Direct 3D can do anything 2D anyway.  :biggrin:
Title: Re: Interesting speed observations between ASM and C/C++
Post by: TimoVJL on November 07, 2024, 06:21:37 PM
What DX function was used for tests ?

XMMatrixIdentity() or D3DXMatrixIdentity(D3DXMATRIX *pout)

perhaps static data is simple to use.


this is deprecated function :
static inline D3DXMATRIX* D3DXMatrixIdentity(D3DXMATRIX *pout)
{
    if ( !pout ) return NULL;
    D3DX_U(*pout).m[0][1] = 0.0f;
    D3DX_U(*pout).m[0][2] = 0.0f;
    D3DX_U(*pout).m[0][3] = 0.0f;
    D3DX_U(*pout).m[1][0] = 0.0f;
    D3DX_U(*pout).m[1][2] = 0.0f;
    D3DX_U(*pout).m[1][3] = 0.0f;
    D3DX_U(*pout).m[2][0] = 0.0f;
    D3DX_U(*pout).m[2][1] = 0.0f;
    D3DX_U(*pout).m[2][3] = 0.0f;
    D3DX_U(*pout).m[3][0] = 0.0f;
    D3DX_U(*pout).m[3][1] = 0.0f;
    D3DX_U(*pout).m[3][2] = 0.0f;
    D3DX_U(*pout).m[0][0] = 1.0f;
    D3DX_U(*pout).m[1][1] = 1.0f;
    D3DX_U(*pout).m[2][2] = 1.0f;
    D3DX_U(*pout).m[3][3] = 1.0f;
    return pout;
}
Title: Re: Interesting speed observations between ASM and C/C++
Post by: Jason on November 07, 2024, 06:31:32 PM
I used a few of them against various XMMatrix* functions.

In debug mode my Asm code demolished the DirectX functions, but in release mode, the roles were reversed.