Wrote some Matrix Transpose routines of different sizes.
The idea is to use them as building blocks to create an Algorithm for very large Transposition Matrices of any size.
Included the sources and a timing mechanism that measures the clock cycles and the time each Matrix routine takes after 10000000 calls.
You can use the "SaveResults.bat" to save the timing results as "MatrixTimerResults.txt"
It would be nice to post the results here and give some feedback for improvements of the routines. :t
Thanks.
Here are my results:
10000000 calls per Matrix for the Cycle counter and the Routine timer.
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
4x4 Cycles: 4 CodeSize: 66 RoutineTime: 0.032362423 seconds
4x3 Cycles: 3 CodeSize: 56 RoutineTime: 0.029423712 seconds
4x2 Cycles: 4 CodeSize: 42 RoutineTime: 0.032384074 seconds
3x4 Cycles: 4 CodeSize: 62 RoutineTime: 0.032416620 seconds
3x3 Cycles: 2 CodeSize: 51 RoutineTime: 0.026471800 seconds
3x2 Cycles: 3 CodeSize: 35 RoutineTime: 0.029419801 seconds
2x4 Cycles: 0 CodeSize: 31 RoutineTime: 0.020588231 seconds
2x3 Cycles: 0 CodeSize: 27 RoutineTime: 0.020594936 seconds
2x2 Cycles: -1 CodeSize: 18 RoutineTime: 0.017652313 seconds
Code Alignment 64 byte check: 000h
EDIT: forgot to mention that the routines are to be used backwards in memory because when a row is 3 values, 4 are written!
10000000 calls per Matrix for the Cycle counter and the Routine timer.
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G
4x4 Cycles: 3 CodeSize: 66 RoutineTime: 0.055281872 seconds
4x3 Cycles: 3 CodeSize: 56 RoutineTime: 0.048029962 seconds
4x2 Cycles: 4 CodeSize: 42 RoutineTime: 0.047268223 seconds
3x4 Cycles: 5 CodeSize: 62 RoutineTime: 0.056652746 seconds
3x3 Cycles: 4 CodeSize: 51 RoutineTime: 0.047815803 seconds
3x2 Cycles: 2 CodeSize: 35 RoutineTime: 0.034986782 seconds
2x4 Cycles: 1 CodeSize: 31 RoutineTime: 0.038576830 seconds
2x3 Cycles: -1 CodeSize: 27 RoutineTime: 0.028790277 seconds
2x2 Cycles: 0 CodeSize: 18 RoutineTime: 0.024736004 seconds
Code Alignment 64 byte check: 000h
Press any key to continue...
1.60 Ghz cpu speed...
cycle counts seem off "0 cycles", "-1 cycle"
10000000 calls per Matrix for the Cycle counter and the Routine timer.
Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz
4x4 Cycles: 9 CodeSize: 66 RoutineTime: 0.049261743 seconds
4x3 Cycles: 7 CodeSize: 56 RoutineTime: 0.043520454 seconds
4x2 Cycles: 8 CodeSize: 42 RoutineTime: 0.042658139 seconds
3x4 Cycles: 10 CodeSize: 62 RoutineTime: 0.049160314 seconds
3x3 Cycles: 5 CodeSize: 51 RoutineTime: 0.036015704 seconds
3x2 Cycles: 5 CodeSize: 35 RoutineTime: 0.036012691 seconds
2x4 Cycles: 1 CodeSize: 31 RoutineTime: 0.024245575 seconds
2x3 Cycles: 1 CodeSize: 27 RoutineTime: 0.023353133 seconds
2x2 Cycles: 0 CodeSize: 18 RoutineTime: 0.020137194 seconds
Code Alignment 64 byte check: 000h
Hi Siekmanski
Been bench pressing your 4x4 against this one:
XMMatrixTranspose PROC
movaps xmm0, xmmword ptr[edx+0h]
movaps xmm1, xmmword ptr[edx+10h]
movaps xmm2, xmmword ptr[edx+20h]
movaps xmm3, xmmword ptr[edx+30h]
movaps xmm4, xmm0
movaps xmm5, xmm2
shufps xmm0, xmm1, 44h
shufps xmm4, xmm1, 0EEh
shufps xmm2, xmm3, 44h
shufps xmm5, xmm3, 0EEh
movaps xmm1, xmm0
movaps xmm3, xmm2
shufps xmm0, xmm4, 88h
shufps xmm1, xmm4, 0DDh
shufps xmm2, xmm5, 88h
shufps xmm3, xmm5, 0DDh
movaps xmmword ptr[eax+0h], xmm0
movaps xmmword ptr[eax+10h], xmm1
movaps xmmword ptr[eax+20h], xmm2
movaps xmmword ptr[eax+30h], xmm3
XMMatrixTranspose ENDP
;===========================================>
mov edx, offset g_mSpin
mov eax, offset g_mWorld
invoke XMMatrixTranspose
Which is from BOTH DirectXMath and Xnamath (has not changed), with your 4x4 0.815 % faster over
100 million iterations. Nice.
Here we go for YAMTT :t (Yet Another Matrix Transpose Test)
10000000 calls per Matrix for the Cycle counter and the Routine timer.
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
4x4 Cycles: 2 CodeSize: 66 RoutineTime: 0.018620367 seconds
4x3 Cycles: 2 CodeSize: 56 RoutineTime: 0.016572648 seconds
4x2 Cycles: 2 CodeSize: 42 RoutineTime: 0.018659709 seconds
3x4 Cycles: 3 CodeSize: 62 RoutineTime: 0.018680211 seconds
3x3 Cycles: 1 CodeSize: 51 RoutineTime: 0.014052825 seconds
3x2 Cycles: 2 CodeSize: 35 RoutineTime: 0.016425531 seconds
2x4 Cycles: 1 CodeSize: 31 RoutineTime: 0.011702284 seconds
2x3 Cycles: 2 CodeSize: 27 RoutineTime: 0.009617162 seconds
2x2 Cycles: 0 CodeSize: 18 RoutineTime: 0.009352850 seconds
Code Alignment 64 byte check: 000h
Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT :t (Yet Another Matrix Transpose Test)
Where's Rui?? :P
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
4x4 Cycles: 7 CodeSize: 66 RoutineTime: 0.051956401 seconds
4x3 Cycles: 5 CodeSize: 56 RoutineTime: 0.045790781 seconds
4x2 Cycles: 5 CodeSize: 42 RoutineTime: 0.045233293 seconds
3x4 Cycles: 6 CodeSize: 62 RoutineTime: 0.054255731 seconds
3x3 Cycles: 4 CodeSize: 51 RoutineTime: 0.044690584 seconds
3x2 Cycles: 3 CodeSize: 35 RoutineTime: 0.036121362 seconds
2x4 Cycles: 0 CodeSize: 31 RoutineTime: 0.023498652 seconds
2x3 Cycles: -1 CodeSize: 27 RoutineTime: 0.024507303 seconds
2x2 Cycles: -1 CodeSize: 18 RoutineTime: 0.019760855 seconds
Another run, same machine:
4x4 Cycles: 7 CodeSize: 66 RoutineTime: 0.055625230 seconds
4x3 Cycles: 5 CodeSize: 56 RoutineTime: 0.045135589 seconds
4x2 Cycles: 5 CodeSize: 42 RoutineTime: 0.045048148 seconds
3x4 Cycles: 7 CodeSize: 62 RoutineTime: 0.051966664 seconds
3x3 Cycles: 3 CodeSize: 51 RoutineTime: 0.038664541 seconds
3x2 Cycles: 3 CodeSize: 35 RoutineTime: 0.037659996 seconds
2x4 Cycles: -1 CodeSize: 31 RoutineTime: 0.024224864 seconds
2x3 Cycles: 0 CodeSize: 27 RoutineTime: 0.024652217 seconds
2x2 Cycles: -1 CodeSize: 18 RoutineTime: 0.019747718 seconds
Quote from: zedd151 on June 25, 2018, 05:16:35 PM
Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT :t (Yet Another Matrix Transpose Test)
Where's Rui?? :P
Will be here soon as possible, zedd151 :biggrin:
10000000 calls per Matrix for the Cycle counter and the Routine timer.
Intel(R) Atom(TM) CPU N455 @ 1.66GHz
4x4 Cycles: 58 CodeSize: 66 RoutineTime: 0.377001463 seconds
4x3 Cycles: 51 CodeSize: 56 RoutineTime: 0.339285241 seconds
4x2 Cycles: 19 CodeSize: 42 RoutineTime: 0.165002387 seconds
3x4 Cycles: 53 CodeSize: 62 RoutineTime: 0.339326507 seconds
3x3 Cycles: 38 CodeSize: 51 RoutineTime: 0.286387559 seconds
3x2 Cycles: 22 CodeSize: 35 RoutineTime: 0.171114020 seconds
2x4 Cycles: 22 CodeSize: 31 RoutineTime: 0.139145431 seconds
2x3 Cycles: 19 CodeSize: 27 RoutineTime: 0.144533990 seconds
2x2 Cycles: 11 CodeSize: 18 RoutineTime: 0.078623451 seconds
Code Alignment 64 byte check: 000h
Press any key to continue...
Thanks guys,
@Caché GB
The matrices I use here are unaligned ( to process uneven matrix row numbers )
The aligned 4x4 version is even a bit faster:
10 million iterations (i7-4930K)
4x4 aligned DirectXMath Cycles: 4 CodeSize: 74 RoutineTime: 0.042951199 seconds
4x4 aligned Siekmanski Cycles: 4 CodeSize: 66 RoutineTime: 0.032361864 seconds
My aligned version takes 75.3456591 % time compared to the aligned DirectXMath version.
That's 1.327216473 times faster and 8 bytes less code size. :biggrin:
@Caché GB
Are you sure you posted a correct version of the 4x4 DirectXMath transpose matrix?
The results are wrong,
00 04 02 06
01 05 03 07
08 12 10 14
09 13 11 15
Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT :t (Yet Another Matrix Transpose Test)
Yeah, now we can play and build with it just as we played as kids with LEGO blocks. :t
It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148
However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once. :idea:
10000000 calls per Matrix for the Cycle counter and the Routine timer.
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz
4x4 Cycles: 2 CodeSize: 66 RoutineTime: 0.024838457 seconds
4x3 Cycles: 1 CodeSize: 56 RoutineTime: 0.023653736 seconds
4x2 Cycles: 2 CodeSize: 42 RoutineTime: 0.023732180 seconds
3x4 Cycles: 2 CodeSize: 62 RoutineTime: 0.023453961 seconds
3x3 Cycles: 2 CodeSize: 51 RoutineTime: 0.034194162 seconds
3x2 Cycles: 1 CodeSize: 35 RoutineTime: 0.024073813 seconds
2x4 Cycles: 1 CodeSize: 31 RoutineTime: 0.015524173 seconds
2x3 Cycles: 0 CodeSize: 27 RoutineTime: 0.016023061 seconds
2x2 Cycles: 1 CodeSize: 18 RoutineTime: 0.020969785 seconds
Code Alignment 64 byte check: 000h
10000000 calls per Matrix for the Cycle counter and the Routine timer.
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
4x4 Cycles: 3 CodeSize: 66 RoutineTime: 0.021707700 seconds
4x3 Cycles: 2 CodeSize: 56 RoutineTime: 0.018974900 seconds
4x2 Cycles: 3 CodeSize: 42 RoutineTime: 0.021691700 seconds
3x4 Cycles: 3 CodeSize: 62 RoutineTime: 0.021689300 seconds
3x3 Cycles: 1 CodeSize: 51 RoutineTime: 0.016264000 seconds
3x2 Cycles: 2 CodeSize: 35 RoutineTime: 0.018972900 seconds
2x4 Cycles: 1 CodeSize: 31 RoutineTime: 0.013553900 seconds
2x3 Cycles: 0 CodeSize: 27 RoutineTime: 0.011021800 seconds
2x2 Cycles: 0 CodeSize: 18 RoutineTime: 0.010840100 seconds
Code Alignment 64 byte check: 000h
Press any key to continue...
Quote from: AW on June 25, 2018, 09:49:29 PM
It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148
However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once. :idea:
Of course we can use AVX for 8x8 matrices.
This is what I had in mind, all posible sizes without moving single values.
As an example the 9x9 matrix:
Can be done with 4 blocks of 4x4 matrices and exchanging the left over single values.
Or use 4 3 2 size matrices as shown below.
9 = 3 square blocks along the Hypotenuse ( the RED line ) 4(X) 3(Y) 2(Q).
now you have to transpose the rest. 4x3(Y) 4x2(Z) 3X2(P)
If you look closely at the picture you see on the X-axis the repetition of 4 3 2.
The same for the Y-axis and the Hypotenuse, 4 3 2
It looks like building with LEGO blocks...... :lol:
With this in mind you can build every size ( even or uneven ) by transposing along the Hypotenuse
Start at the right bottom with the 3 row sized matrices ( remember rows of 3 values will overwrite with the 4th value to increase the speed ) and work your way backwards in memory.
(http://members.home.nl/siekmanski/9x9_Matrix.png)
deleted
@Marinus,
:t
@Nidud
Interesting, when I produced the DxMath library I have investigated the way DirectXMath was doing it and at the time the memory reserved for the transposed matrix was pointed to by RCX according to Windows ABI and on return pointed to by RAX as usual, but now they are returning in RDX for some mysterious reason that only @nidud knows about.. :badgrin:
We have also concluded that Marinus algo was better so I later changed in DxMath to @Marinus algo.
Hi Siekmanski.
Yes, posted the wrong one. My apologies.
Been benching several. Here is the correct version.
XMMatrixTranspose PROC
movaps xmm5, xmmword ptr[edx+0h]
movaps xmm3, xmmword ptr[edx+20h]
movaps xmm4, xmm5
shufps xmm4, xmmword ptr[edx+10h], 44h
movaps xmm1, xmm3
shufps xmm5, xmmword ptr[edx+10h], 0EEh
movaps xmm2, xmm4
shufps xmm1, xmmword ptr[edx+20h], 44h
movaps xmm0, xmm5
shufps xmm3, xmmword ptr[edx+20h], 0EEh
shufps xmm2, xmm1, 88h
shufps xmm4, xmm1, 0DDh
shufps xmm0, xmm3, 88h
shufps xmm5, xmm3, 0DDh
movaps xmmword ptr[eax+0h] , xmm2
movaps xmmword ptr[eax+10h], xmm4
movaps xmmword ptr[eax+20h], xmm0
movaps xmmword ptr[eax+30h], xmm5
ret
XMMatrixTranspose ENDP
Just doing a brute force like this.
Counter equ 1000000000 ; 1 Billion
invoke Sleep, 1000
invoke GetTickCount
mov TestTime, eax
mov ecx, Counter
L01:
mov edx, offset g_mSpin
mov eax, offset g_mWorld
invoke Siekmanski_MatrixTranspose
dec ecx
jnz L01
invoke GetTickCount
sub eax, TestTime
invoke wsprintfA, addr szTestBuffer01, CSTR("%d milli secs", 13, 10), eax
invoke Sleep, 1000
invoke GetTickCount
mov TestTime, eax
mov ecx, Counter
L02:
mov edx, offset g_mSpin
mov eax, offset g_mWorld
invoke XMMatrixTranspose
dec ecx
jnz L02
invoke GetTickCount
sub eax, TestTime
invoke wsprintfA, addr szTestBuffer02, CSTR("%d milli secs", 13, 10), eax
invoke SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer01
invoke SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer02
invoke SendMessageA, hWndList01, LB_ADDSTRING, null, CSTR(" ", 13, 10)
ret
deleted
Only Rui is beating me with higher times :biggrin: :biggrin: :biggrin:
10000000 calls per Matrix for the Cycle counter and the Routine timer.
AMD A6-3500 APU with Radeon(tm) HD Graphics
4x4 Cycles: 8 CodeSize: 66 RoutineTime: 0.063076945 seconds
4x3 Cycles: 4 CodeSize: 56 RoutineTime: 0.064050233 seconds
4x2 Cycles: 7 CodeSize: 42 RoutineTime: 0.067956549 seconds
3x4 Cycles: 7 CodeSize: 62 RoutineTime: 0.069592511 seconds
3x3 Cycles: 3 CodeSize: 51 RoutineTime: 0.052619953 seconds
3x2 Cycles: 1 CodeSize: 35 RoutineTime: 0.045371007 seconds
2x4 Cycles: 0 CodeSize: 31 RoutineTime: 0.037927502 seconds
2x3 Cycles: 0 CodeSize: 27 RoutineTime: 0.036778671 seconds
2x2 Cycles: 1 CodeSize: 18 RoutineTime: 0.037995768 seconds
Code Alignment 64 byte check: 000h
Press any key to continue...
Hi Caché GB,
The last one is wrong too.
00 04 08 08
01 05 09 09
02 06 10 10
03 07 11 11
Thanks Nidud,
Did some runs and Marinus is a tiny bit faster on my sytem.
Yours has 6 memory reads and 4 memory writes.
Mine has 4 memory reads and 4 memory writes.
Don't know if the memory reads makes any difference when run on some older computers?
Interesting way loading data with shufps ( never thought of this )
I will study your routine, thanks for it.
4x4m Cycles: 4 CodeSize: 66 RoutineTime: 0.032367591 seconds ; Marinus
4x4n Cycles: 4 CodeSize: 68 RoutineTime: 0.032533045 seconds ; Nidud
4x4m Cycles: 4 CodeSize: 66 RoutineTime: 0.032362563 seconds
4x4n Cycles: 4 CodeSize: 68 RoutineTime: 0.032495331 seconds
4x4m Cycles: 4 CodeSize: 66 RoutineTime: 0.032362074 seconds
4x4n Cycles: 4 CodeSize: 68 RoutineTime: 0.032362283 seconds
4x4m Cycles: 4 CodeSize: 66 RoutineTime: 0.032358303 seconds
4x4n Cycles: 4 CodeSize: 68 RoutineTime: 0.032364030 seconds
4x4m Cycles: 4 CodeSize: 66 RoutineTime: 0.032371503 seconds
4x4n Cycles: 4 CodeSize: 68 RoutineTime: 0.032403490 seconds
Niduds version in 32-bit
XMMatrixTranspose PROC
mov eax,offset MatrixIn
mov ecx,offset MatrixOut
M4_4n:
movaps xmm0,[eax+0]
movaps xmm2,[eax+020h]
movaps xmm4,xmm0
movhps xmm4,qword ptr[eax+010h]
shufps xmm0,[eax+010h],Shuffle(3,2,3,2)
movaps xmm1,xmm2
shufps xmm1,[eax+030h],Shuffle(3,2,3,2)
movhps xmm2,qword ptr[eax+030h]
movaps xmm3,xmm4
shufps xmm4,xmm2,Shuffle(3,1,3,1)
movaps [ecx+010h],xmm4
movaps xmm4,xmm0
shufps xmm3,xmm2,Shuffle(2,0,2,0)
shufps xmm4,xmm1,Shuffle(2,0,2,0)
shufps xmm0,xmm1,Shuffle(3,1,3,1)
movaps [ecx+0],xmm3
movaps [ecx+020h],xmm4
movaps [ecx+030h],xmm0
CodeSize4_4n = $-M4_4n
ret
XMMatrixTranspose ENDP
deleted
Ok I think I should stay out of the dojo.
Maybe third time lucky
XMMatrixTranspose_003 PROC
movaps xmm5, xmmword ptr[edx+0h]
movaps xmm3, xmmword ptr[edx+20h]
movaps xmm4, xmm5
shufps xmm4, xmmword ptr[edx+10h], 44h
movaps xmm1, xmm3
shufps xmm5, xmmword ptr[edx+10h], 0EEh
movaps xmm2, xmm4
shufps xmm1, xmmword ptr[edx+30h], 44h ; <--- Not 20h
movaps xmm0, xmm5
shufps xmm3, xmmword ptr[edx+30h], 0EEh ; <--- Not 20h
shufps xmm2, xmm1, 88h
shufps xmm4, xmm1, 0DDh
shufps xmm0, xmm3, 88h
shufps xmm5, xmm3, 0DDh
movaps xmmword ptr[eax+0h] , xmm2
movaps xmmword ptr[eax+10h], xmm4
movaps xmmword ptr[eax+20h], xmm0
movaps xmmword ptr[eax+30h], xmm5
ret
XMMatrixTranspose_003 ENDP
Hi Caché GB,
Welcome to the dojo. :t
4x4 Cycles: 4 CodeSize: 66 RoutineTime: 0.032354671 seconds ; Siekmanski
4x4 Cycles: 4 CodeSize: 68 RoutineTime: 0.032370874 seconds ; Nidud
4x4 Cycles: 4 CodeSize: 70 RoutineTime: 0.032421439 seconds ; Caché GB
Your matrix results are correct.
The speed of the routines are all very close to each other.
Results may vary on different architectures.
@nidud
As usual, you mix oranges with pears.
You showed the Windows ABI being used and filling a return structure pointed to by RDX, which is wrong, and never done in DirectXMath then you justify that with the Vectorcall and inlining of functions. What a bloody confusion! :shock:
Thank you sensei Siekmanski.
deleted
Hi nidud,
You inserted 4 extra instructions in my code.
The last 4 instructions,
movaps xmm0,xmm4 ; [0 4 8 C]
movaps xmm1,xmm5 ; [1 5 9 D]
movaps xmm3,xmm2 ; [3 7 B F]
movaps xmm2,xmm6 ; [2 6 A E]
Doesn't this slow down my code in comparison with yours?
Or is there a reason to insert them?
removed xmm6 ( to make it simpler :biggrin: )
XMMatrixTransposeM PROC
mov eax,offset MatrixIn
mov ecx,offset MatrixOut
movaps xmm0,[eax+0] ; [0 1 2 3]
movaps xmm1,[eax+16] ; [4 5 6 7]
movaps xmm2,[eax+32] ; [8 9 A B]
movaps xmm3,[eax+48] ; [C D E F]
movaps xmm4,xmm0 ; [0 1 2 3]
movaps xmm5,xmm2 ; [8 9 A B]
unpcklps xmm4,xmm1 ; [0 4 1 5]
unpcklps xmm5,xmm3 ; [8 C 9 D]
unpckhps xmm0,xmm1 ; [2 6 3 7]
unpckhps xmm2,xmm3 ; [A E B F]
movaps xmm1,xmm4 ; [0 4 1 5]
movaps xmm3,xmm0 ; [2 6 3 7]
movlhps xmm4,xmm5 ; [0 4 8 C]
movlhps xmm3,xmm2 ; [2 6 A E]
movhlps xmm5,xmm1 ; [1 5 9 D]
movhlps xmm2,xmm0 ; [3 7 B F]
movaps [ecx+0],xmm4 ; [0 4 8 C]
movaps [ecx+16],xmm5 ; [1 5 9 D]
movaps [ecx+32],xmm3 ; [2 6 A E]
movaps [ecx+48],xmm2 ; [3 7 B F]
ret
XMMatrixTransposeM ENDP
deleted
No need for rearranging the regs if you include the memory reads and writes in the speed test.
deleted
deleted
:biggrin:
Quote from: nidud on July 14, 2018, 09:35:26 AM
Well, I'm writing vector call tests at the moment where values are kept in registers over multiple calls, so the thinking and implementation is a bit different.
OK, that makes sense.
Quote
As for AVX in this case there don't seem to be (as you pointed out) any speed improvement except from saving regs. There is also VMOVHLPS that can be used in the same way.
You must have confused me with someone else, never done an AVX version yet.
But will try it out some day.
I remember that we can replace VSHUFPS or VPERM2F128 with VBLENDPS instructions.
AVX shuffles are executed only on port 5, while blends are also executed on port 0.
VPERM2F128 instructions are not that fast.
Maybe we can get some gain out of it.
I will look this up.
EDIT: Found it, it's in chapter 12 section 11.1
http://members.home.nl/siekmanski/Intel_Optimization_Reference_Manual_248966-037.pdf
Quote from: nidud on July 14, 2018, 10:37:03 AM
Simpler and faster..
vunpckhps xmm4,xmm2,xmm3
vunpcklps xmm2,xmm2,xmm3
vunpckhps xmm3,xmm0,xmm1
vunpcklps xmm1,xmm0,xmm1
vmovlhps xmm0,xmm1,xmm2
vmovhlps xmm1,xmm2,xmm1
vmovlhps xmm2,xmm3,xmm4
vmovhlps xmm3,xmm4,xmm3
Timed this AVX piece on my computer, it's a little slower than the SSE version.
deleted
great work
so this works great for feed d3d9 with loads of different matrices?