3. ?!?!?!? any proc/function which is going to be called repeatedly and assuming is designed to be optimised and potentially deal with millions of calls is going to suffer badly from N*48 bytes worth of unnecessary memory overhead, especially in multi-threaded applications, peak bandwidth for mem is about 25Gb/s lets say on average .. that means without any other factors, call overhead etc you've already capped yourself at
520 million calls per second, and this will get worse with more arguments or more complex types like matrices! Given 4c/8t on most desktops I routinely hit 1 to 1.5billion calls per second on these sort of vector/matrix functions (but you have to keep stuff in registers). Now with Ryzen and the 8c/16t hedt cpu's that would be even worse and your memory would be be limiting you from achieving 3billion (+-) calls per second to only get 500mil.
This is bullsh%&t, you are not taking into account the work involved in loading the xmm registers before the call. Take as an example the function
XMMatrixMultiply from the DirectXMath library (XMMATRIX XMMatrixMultiply(FXMMATRIX M1, CXMMATRIX M2)).
Look at the vector call:
000000013F6AB12E movaps xmm0,xmmword ptr [__xmm@412e66664131999a412666663e4ccccd (13F6D08F0h)]
000000013F6AB135 lea rdx,[rbp+40h]
000000013F6AB139 movaps xmm2,xmmword ptr [__xmm@3fc000004129999a4083333340c9999a (13F6D06D0h)]
000000013F6AB140 movaps xmm1,xmm14
000000013F6AB144 movaps xmm3,xmmword ptr [__xmm@408ccccd411b33333fd9999a40400000 (13F6D0830h)]
000000013F6AB14B movups xmm6,xmmword ptr [rax]
000000013F6AB14E movups xmm7,xmmword ptr [rax+10h]
000000013F6AB152 movups xmm8,xmmword ptr [rax+20h]
000000013F6AB157 movups xmm9,xmmword ptr [rax+30h]
000000013F6AB15C call XMMatrixMultiply (13F6A3330h)
So bad, we do have to load 8 XMM registers before the call they are not loaded by a miracle :(.
Then we could do the same inside the ASM routine!