Hi guys,
After a bit of help today making library files from older assembly projects I created in the past I am noticing some interesting speed observations between ASM and C/C++.
My tests involve some heavy math calculations and I'm measuring how many calculations I can do per second in Asm against the equivalent in C++.
In Debug mode my Asm routines are anywhere up to 600% faster than the C++ counterpart, but then when you switch to release mode the C++ becomes anywhere between 200% - 2000% (depending on the function), faster than what my Asm code is.
Would this be because the compiler is seeing the same loop over and over, and can see the result isn't changing and therefore optimising out a lot of code?
Would the results change if the data being fed in to the functions was different on each loop? (I'll try something along those lines and report back shortly).
Love to hear your thoughts. I find it to be an interesting subject.
You should be able to see if the compiler is doing loop optimization by looking at your actual code in the debugger: code a simple loop in C and see if the code corresponds to what you think it should be in assembler.
It's also possible that your assembly code isn't all that efficient. Are you just doing that simple addition test, or is your function more complex?
Lots of stuff here on speed optimization if you look around; there are a lot of folks here who think that's important. (I don't, but I'm in the minority around here.)
An example is this one. Just a simple function to set the Identity Matrix (for 3D stuff).
MatrixIdentity PROC mat:DWORD
mov eax, mat
tempMat equ [eax.MATRIX]
mov ecx, MATRIX_REAL_ONE
mov tempMat._11, ecx
mov tempMat._22, ecx
mov tempMat._33, ecx
mov tempMat._44, ecx
mov ecx, MATRIX_REAL_ZERO
mov tempMat._12, ecx
mov tempMat._13, ecx
mov tempMat._14, ecx
mov tempMat._21, ecx
mov tempMat._23, ecx
mov tempMat._24, ecx
mov tempMat._31, ecx
mov tempMat._32, ecx
mov tempMat._34, ecx
mov tempMat._41, ecx
mov tempMat._42, ecx
mov tempMat._43, ecx
ret
MatrixIdentity ENDP
I can't see how it would be possible to shorten this any further.
Then I have other more complex matrix functions, but again, can't think of any way to possibly make them leaner.
Just off the top of my head here: since you're just setting 16 DWORD elements here, if they're in a contiguous block, you could do something like this:
; First set 4 elements to MATRIX_REAL_ONE:
MOV EAX, MATRIX_REAL_ONE
MOV ECX, 4
MOV EDI, mat
REP STOSD
; Now set the remainder to MATRIX_REAL_ZERO:
MOV EAX, MATRIX_REAL_ZERO
MOV ECX, 12
REP STOSD
Of course, if they're not in 2 contiguous blocks like I'm assuming here, you'll have to do something different. You could zero the whole thing out first, then just set those 4 elements to MATRIX_REAL_ONE.
Nice one! I wasn't aware of REP.
The locations getting set to 1 aren't contiguous, but I see what you're getting at there.
Or you could construct the matrix in memory, then just copy the whole block over in one fell swoop:
MatrixData DD <matrix data here>
PUSH ESI
PUSH EDI
MOV ESI, OFFSET MatrixData
MOV EDI, mat
MOV ECX, SIZEOF MatrixData / SIZEOF DWORD
REP MOVSD
POP EDI
POP ESI
More data, but mo' faster.
That's an awesome idea also! Just when you think you have things optimal, someone comes along and smashes it, twice! 😂
Quote from: Jason on November 06, 2024, 09:53:41 PMThat's an awesome idea also! Just when you think you have things optimal, someone comes along and smashes it, twice! 😂
That's when you ought to post your assembly code in the Laboratory, so other members can help you to optimize it. (Tip for 'next time')
We have a few members (used to be many more) that are keen on optimizing for speed, and know a few tricks.
For Matrices math they designed SSE instructions, using 128 bit xmm0-7 regs
Cpp speed depends on compiler settings combined with better or worse compiler
Using visual cpp without optimize settings, it already makes use of Movaps xmm 128 bit registers for initialize local arrays from somewhere else in memory
So code example you show, movups xmm0, can initialize 4 x 32 bit variables instead of 4 x Mov
In my opinion, your MatrixIdentity Proc is definitely not a good example to compare asm with C++.
A) Such a proc may take only a few nanoseconds as compared to the remainder of your computation, and accessing memory may not vary in any significant manner between the two languages.
B) Such a proc might be optimizable in asm but the result would be so small that its overall effect in a final program may not even be worth the effort.
The best area to concentrate on with complex computations is to restrict memory accesses whenever possible; i.e. keep intermediate results in registers whenever possible. Remember that you have 8 potential registers if you need to use the fpu.
Under the hood, C is machine code just like Assembly. It cannot be faster than the same Assembly code. So your code may be slow. Compare it to the disassembled C++ code.
Quote from: jj2007 on November 07, 2024, 05:41:47 AMUnder the hood, C is machine code just like Assembly. I cannot be faster than the same Assembly code. So your code may be slow. Compare it to the disassembled C++ code.
Optimized C/C++ can be faster than poorly written assembly code. Sometimes way faster.
You would indeed have to look at the disassembled C/C++ code to see if you are comparing apples to oranges or not.
Thanks guys, I really appreciate the feedback and advice given here.
I have more matrix functions than the simple one provided, but having said that, it it obvious that I have a long way to go, with what I thought was seemingly optimal. :badgrin:
So, this is what I don't get. I am playing around with my MatrixIdentity function right now, calling it MatrixIdentity2 for testing purposes.
All I have in here is a return call-
MatrixIdentity2 PROC mat:DWORD
ret
MatrixIdentity2 ENDP
Just running this, the DirectX MatrixIdentity call is still way faster. How is this possible?
MatrixIdentity (DX): 407901787
MatrixIdentity (ASM): 149828047
MatrixIdentity2 (ASM): 313354786
DX faster by 130%
Press any key to continue . . .
The higher the number the faster. It's reporting how many times the function has executed per second.
I have a five second delay before the tests are conducted, to ensure that the window has settled down and is stable. And it doesn't matter if I move the DX call to be the first or last test. It always wins by a long margin.
Is there an overhead in calling a function from a static lib?
I think I may know what is going on. I took my MatrixIdentity2 function out of the test loop altogether and the result was on par with the DirectX version.
I am suspecting that the compiler is seeing that the result of the DX MatrixIdentity call is never actually used and is being optimised out altogether.
[edit]
Yep, that looks like what is happening. I made the DX MatrixIdentity function point to a global variable. That alone wasn't enough to affect the speed. But, then I added a print call, to print out an element of the variable right at the end of the program (way outside of the test loop) and then suddenly the DX calls speed dropped significantly.
Quote from: NoCforMe on November 06, 2024, 07:50:50 PMJust off the top of my head here: since you're just setting 16 DWORD elements here, if they're in a contiguous block, you could do something like this:
; First set 4 elements to MATRIX_REAL_ONE:
MOV EAX, MATRIX_REAL_ONE
MOV ECX, 4
MOV EDI, mat
REP STOSD
; Now set the remainder to MATRIX_REAL_ZERO:
MOV EAX, MATRIX_REAL_ZERO
MOV ECX, 12
REP STOSD
Of course, if they're not in 2 contiguous blocks like I'm assuming here, you'll have to do something different. You could zero the whole thing out first, then just set those 4 elements to MATRIX_REAL_ONE.
Interestingly enough, this method turns out to be slower than my initial implementation, at about 60% of the speed when compared to filling the memory locations one by one.
Before we go further, let me ask you: do you really have a need for speed here?
I ask because there's kind of an obsession with speed for its own sake here among some programmers. Sometimes this takes the form of assembly-language pissing contests: "my code's faster than your code!" (Well, not really: everyone's quite polite and gracious about it. But still.)
My argument, take it for what it's worth, is that unless you're writing an app that's so computation-bound that a speedup is critical, the obsession with speed is kinda misplaced. I mean, if you're writing a routine that gets called while waiting for user input, what's the rush?
Of course, if you have a legitimate need for speed, then simply disregard what I wrote here.
Hey there! Very valid argument and I agree.
In my case I am looking at every avenue I can take to make my DirectX 11 renderer as efficient as possible and I am making great progress in doing so, being able to render a ridiculous amounts of sprites per frame.
It's more a challenge against myself than anything. I'm not out here to say "my code is faster than yours", I'm more interested in the quirks of how things can be quicker (or more compact, depending on my mood on the day :biggrin:, I know compact and speed often don't go hand in hand).
Just a hobby for me. :thumbsup:
Quote from: Jason on November 07, 2024, 11:54:00 AMIn my case I am looking at every avenue I can take to make my DirectX 11 renderer as efficient as possible and I am making great progress in doing so, being able to render a ridiculous amounts of sprites per frame.
That's a perfectly valid reason to aim for speed.
QuoteJust a hobby for me. :thumbsup:
Same here.
Good fun and brain frying at the same time. :biggrin:
Quote from: NoCforMe on November 07, 2024, 12:06:58 PMQuote from: Jason on November 07, 2024, 11:54:00 AMIn my case I am looking at every avenue I can take to make my DirectX 11 renderer as efficient as possible and I am making great progress in doing so, being able to render a ridiculous amounts of sprites per frame.
That's a perfectly valid reason to aim for speed.
Quote from: Jason on November 07, 2024, 12:14:55 PMGood fun and brain frying at the same time. :biggrin:
:biggrin:
Sounds like you are indeed having some fun. (or phun?)
DirectX has always been beyond my skill set.
Quote from: zedd151 on November 07, 2024, 03:22:01 PMSounds like you are indeed having some fun. (or phun?)
DirectX has always been beyond my skill set.
:biggrin:
Maybe this is my calling? Aim for the smallest DirectX 11 application that I can achieve.
I'm certainly up against it beating the compiler for speed, with all of the shenanigans it pulls.
You got me thinking now. :thumbsup:
Are you using Direct2D or Direct3D?
Direct 3D. I must admit I've never even touched Direct 2D as Direct 3D can do anything 2D anyway. :biggrin:
What DX function was used for tests ?
XMMatrixIdentity() or D3DXMatrixIdentity(D3DXMATRIX *pout)
perhaps static data is simple to use.
this is deprecated function :
static inline D3DXMATRIX* D3DXMatrixIdentity(D3DXMATRIX *pout)
{
if ( !pout ) return NULL;
D3DX_U(*pout).m[0][1] = 0.0f;
D3DX_U(*pout).m[0][2] = 0.0f;
D3DX_U(*pout).m[0][3] = 0.0f;
D3DX_U(*pout).m[1][0] = 0.0f;
D3DX_U(*pout).m[1][2] = 0.0f;
D3DX_U(*pout).m[1][3] = 0.0f;
D3DX_U(*pout).m[2][0] = 0.0f;
D3DX_U(*pout).m[2][1] = 0.0f;
D3DX_U(*pout).m[2][3] = 0.0f;
D3DX_U(*pout).m[3][0] = 0.0f;
D3DX_U(*pout).m[3][1] = 0.0f;
D3DX_U(*pout).m[3][2] = 0.0f;
D3DX_U(*pout).m[0][0] = 1.0f;
D3DX_U(*pout).m[1][1] = 1.0f;
D3DX_U(*pout).m[2][2] = 1.0f;
D3DX_U(*pout).m[3][3] = 1.0f;
return pout;
}
I used a few of them against various XMMatrix* functions.
In debug mode my Asm code demolished the DirectX functions, but in release mode, the roles were reversed.