Thanks man, might be worth a look at :t
Just finished optimising the hell out of my matrix routines. I am quite pleased with the results.
20000000 iterations
XMMatrixIdentity
DirectXMath 578 ms
C++ 125 ms
ASM 109 ms
XMMatrixPerspectiveFovLH
DirectXMath 3000 ms
C++ 2172 ms
ASM 1594 ms
XMMatrixLookAtLH
DirectXMath 9343 ms
C++ 2781 ms
ASM 1641 ms
XMMatrixTranspose
DirectXMath 0.719s
C++ 0.125s
ASM 0.125s
So, I am killing Microsoft's own implementation and this is just using normal FPU commands.