Author Topic: Fast SIMD transpose routines  (Read 598 times)

mineiro

  • Member
  • ***
  • Posts: 406
Re: Fast SIMD transpose routines
« Reply #15 on: June 25, 2018, 11:22:09 PM »
Code: [Select]
10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

4x4 Cycles: 3  CodeSize: 66 RoutineTime: 0.021707700 seconds
4x3 Cycles: 2  CodeSize: 56 RoutineTime: 0.018974900 seconds
4x2 Cycles: 3  CodeSize: 42 RoutineTime: 0.021691700 seconds
3x4 Cycles: 3  CodeSize: 62 RoutineTime: 0.021689300 seconds
3x3 Cycles: 1  CodeSize: 51 RoutineTime: 0.016264000 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.018972900 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.013553900 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.011021800 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.010840100 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

Siekmanski

  • Member
  • *****
  • Posts: 1553
Re: Fast SIMD transpose routines
« Reply #16 on: June 26, 2018, 12:13:08 AM »
It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148

However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once.  :idea:

Of course we can use AVX for 8x8 matrices.

This is what I had in mind, all posible sizes without moving single values.

As an example the 9x9 matrix:
Can be done with 4 blocks of 4x4 matrices and exchanging the left over single values.
Or use 4 3 2 size matrices as shown below.

9 = 3 square blocks along the Hypotenuse ( the RED line ) 4(X) 3(Y) 2(Q).
    now you have to transpose the rest. 4x3(Y) 4x2(Z) 3X2(P)

If you look closely at the picture you see on the X-axis the repetition of 4 3 2.
The same for the Y-axis and the Hypotenuse,  4 3 2

It looks like building with LEGO blocks......  :lol:

With this in mind you can build every size ( even or uneven ) by transposing along the Hypotenuse
Start at the right bottom with the 3 row sized matrices ( remember rows of 3 values will overwrite with the 4th value to increase the speed ) and work your way backwards in memory.


Creative coders use backward thinking techniques as their strategy.

nidud

  • Member
  • *****
  • Posts: 1531
    • https://github.com/nidud/asmc
Re: Fast SIMD transpose routines
« Reply #17 on: June 26, 2018, 12:57:02 AM »
Here's a 64-bit test version of MatrixTranspose from DirectXMath

The code is something like this:
Code: [Select]
    movaps xmm0,[rcx+0x00]
    movaps xmm2,[rcx+0x20]
    movaps xmm4,xmm0
    movhps xmm4,[rcx+0x10]
    shufps xmm0,[rcx+0x10],_MM_SHUFFLE(3,2,3,2)
    movaps xmm1,xmm2
    shufps xmm1,[rcx+0x30],_MM_SHUFFLE(3,2,3,2)
    movhps xmm2,[rcx+0x30]
    movaps xmm3,xmm4
    shufps xmm4,xmm2,_MM_SHUFFLE(3,1,3,1)
    movaps [rdx+0x10],xmm4
    movaps xmm4,xmm0
    shufps xmm3,xmm2,_MM_SHUFFLE(2,0,2,0)
    shufps xmm4,xmm1,_MM_SHUFFLE(2,0,2,0)
    shufps xmm0,xmm1,_MM_SHUFFLE(3,1,3,1)
    movaps [rdx+0x00],xmm3
    movaps [rdx+0x20],xmm4
    movaps [rdx+0x30],xmm0

They seems to more or less equal:

Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2AVXSSE4.2SSE4.1SSSE3SSE3SSE2)
----------------------------------------------
-- proc(1)
    59557 cycles, rep(10000), code( 67) 1.asm: Marinus
    56599 cycles, rep(10000), code( 69) 2.asm: DirectXMath
-- proc(2)
    57924 cycles, rep(10000), code( 67) 1.asm: Marinus
    57708 cycles, rep(10000), code( 69) 2.asm: DirectXMath
-- proc(3)
    58194 cycles, rep(10000), code( 67) 1.asm: Marinus
    58009 cycles, rep(10000), code( 69) 2.asm: DirectXMath
-- proc(4)
    60007 cycles, rep(10000), code( 67) 1.asm: Marinus
    57674 cycles, rep(10000), code( 69) 2.asm: DirectXMath

total [1 .. 4], 1++
   229990 cycles 2.asm: DirectXMath
   235682 cycles 1.asm: Marinus
hit any key to continue...

AW

  • Member
  • *****
  • Posts: 1360
  • Let's Make ASM Great Again!
Re: Fast SIMD transpose routines
« Reply #18 on: June 26, 2018, 01:44:03 AM »
@Marinus,
 :t

@Nidud
Interesting, when I produced the DxMath library I have investigated the way DirectXMath was doing it and at the time the memory reserved for the transposed matrix was pointed to by RCX according to Windows ABI and on return pointed to by RAX as usual, but now they are returning in RDX for some mysterious reason that only @nidud knows about..  :badgrin:
We have also concluded that Marinus algo was better so I later changed in DxMath to @Marinus algo.


« Last Edit: June 26, 2018, 03:09:57 AM by AW »

Caché GB

  • Regular Member
  • *
  • Posts: 44
  • MASM IS HOT
Re: Fast SIMD transpose routines
« Reply #19 on: June 26, 2018, 02:07:06 AM »
Hi Siekmanski.

Yes, posted the wrong one. My apologies.
Been benching several. Here is the correct version.

Code: [Select]

XMMatrixTranspose PROC

       movaps  xmm5, xmmword ptr[edx+0h] 
       movaps  xmm3, xmmword ptr[edx+20h]
       movaps  xmm4, xmm5 
       shufps  xmm4, xmmword ptr[edx+10h], 44h 
       movaps  xmm1, xmm3 
       shufps  xmm5, xmmword ptr[edx+10h], 0EEh 
       movaps  xmm2, xmm4 
       shufps  xmm1, xmmword ptr[edx+20h], 44h 
       movaps  xmm0, xmm5 
       shufps  xmm3, xmmword ptr[edx+20h], 0EEh 
       shufps  xmm2, xmm1, 88h 
       shufps  xmm4, xmm1, 0DDh 
       shufps  xmm0, xmm3, 88h 
       shufps  xmm5, xmm3, 0DDh 
       movaps  xmmword ptr[eax+0h] , xmm2
       movaps  xmmword ptr[eax+10h], xmm4
       movaps  xmmword ptr[eax+20h], xmm0
       movaps  xmmword ptr[eax+30h], xmm5
          ret

XMMatrixTranspose ENDP


Just doing a brute force like this.

Code: [Select]

       Counter  equ  1000000000  ; 1 Billion

       invoke  Sleep, 1000

       invoke  GetTickCount
          mov  TestTime, eax
          mov  ecx, Counter 
     L01:
                mov  edx, offset g_mSpin
                mov  eax, offset g_mWorld
             invoke  Siekmanski_MatrixTranspose

          dec  ecx
          jnz  L01

       invoke  GetTickCount
          sub  eax, TestTime
       invoke  wsprintfA, addr szTestBuffer01, CSTR("%d milli secs", 13, 10), eax

       invoke  Sleep, 1000       

       invoke  GetTickCount
          mov  TestTime, eax
          mov  ecx, Counter
     L02:
                mov  edx, offset g_mSpin
                mov  eax, offset g_mWorld
             invoke  XMMatrixTranspose

          dec  ecx
          jnz  L02

       invoke  GetTickCount
          sub  eax, TestTime
       invoke  wsprintfA, addr szTestBuffer02, CSTR("%d milli secs", 13, 10), eax

       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer01
       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer02
       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, CSTR(" ", 13, 10)

          ret

Caché GB's 1 and 0-nly language:MASM

nidud

  • Member
  • *****
  • Posts: 1531
    • https://github.com/nidud/asmc
Re: Fast SIMD transpose routines
« Reply #20 on: June 26, 2018, 03:06:51 AM »
Interesting, when I produced the DxMath library I have investigated the way DirectXMath was doing it and at the time the results were returned in RCX according to Windows ABI and pointed to by RAX as usual, but now they are returning in RDX for some mysterious reason that only nidud knows about.. 

 :biggrin:

Here we go again.

The DirectXMath functions use vectorcall but yes there are compatible methods using memory included as well. If you interested to see how the vectorcall implementation work you may use the tool I recommended earlier.

If you past this code into https://gcc.godbolt.org/, select x86-64 MSVC Pre 2018 and use the command line switches -O2 -Gv you will see that no code is generated at all.

#include <emmintrin.h>

typedef struct {
    __m128 r[4];
} FXMMATRIX;

inline FXMMATRIX XMMatrixTranspose(FXMMATRIX M)
{
    __m128 vTemp1 = _mm_shuffle_ps(M.r[0],M.r[1],_MM_SHUFFLE(1,0,1,0));
    __m128 vTemp3 = _mm_shuffle_ps(M.r[0],M.r[1],_MM_SHUFFLE(3,2,3,2));
    __m128 vTemp2 = _mm_shuffle_ps(M.r[2],M.r[3],_MM_SHUFFLE(1,0,1,0));
    __m128 vTemp4 = _mm_shuffle_ps(M.r[2],M.r[3],_MM_SHUFFLE(3,2,3,2));
    FXMMATRIX mResult;

    mResult.r[0] = _mm_shuffle_ps(vTemp1, vTemp2,_MM_SHUFFLE(2,0,2,0));
    mResult.r[1] = _mm_shuffle_ps(vTemp1, vTemp2,_MM_SHUFFLE(3,1,3,1));
    mResult.r[2] = _mm_shuffle_ps(vTemp3, vTemp4,_MM_SHUFFLE(2,0,2,0));
    mResult.r[3] = _mm_shuffle_ps(vTemp3, vTemp4,_MM_SHUFFLE(3,1,3,1));
    return mResult;
}


The reason for this is that the function is inline, so it's basically (in assembler) a macro that will either expand inline when you use it or (depending on the size/situation) a function will be created. If you add the following code (which doesn't make much sense for the compiler) the function will be produced.

void foo()
{
    FXMMATRIX M;

    XMMatrixTranspose(M);
}


The generated code for XMMatrixTranspose() will then be as follow:
Code: [Select]
        sub      rsp, 40              ; 00000028H
        movaps   XMMWORD PTR [rsp+16], xmm6
        movaps   xmm5, xmm2
        movaps   XMMWORD PTR [rsp], xmm7
        movaps   xmm6, xmm0
        shufps   xmm6, xmm1, 238            ; 000000eeH
        movaps   xmm7, xmm0
        shufps   xmm7, xmm1, 68       ; 00000044H
        movaps   xmm4, xmm2
        shufps   xmm4, xmm3, 68       ; 00000044H
        movaps   xmm0, xmm7
        shufps   xmm5, xmm3, 238            ; 000000eeH
        movaps   xmm2, xmm6
        shufps   xmm7, xmm4, 221            ; 000000ddH
        shufps   xmm6, xmm5, 221            ; 000000ddH
        movaps   xmm1, xmm7
        movaps   xmm7, XMMWORD PTR [rsp]
        movaps   xmm3, xmm6
        movaps   xmm6, XMMWORD PTR [rsp+16]
        shufps   xmm0, xmm4, 136            ; 00000088H
        shufps   xmm2, xmm5, 136            ; 00000088H
        add      rsp, 40              ; 00000028H
        ret      0

So this is the basic of using using vecotorcall. Arguments are passed using vectors and the result returned in vectors. In this case the matrix are passed in xmm0..3 and returned in xmm0..3.

HSE

  • Member
  • ****
  • Posts: 717
  • <AMD>< 7-32>
Re: Fast SIMD transpose routines
« Reply #21 on: June 26, 2018, 03:19:18 AM »
Only Rui is beating me with higher times  :biggrin: :biggrin: :biggrin:

Code: [Select]
10000000 calls per Matrix for the Cycle counter and the Routine timer.

AMD A6-3500 APU with Radeon(tm) HD Graphics

4x4 Cycles: 8  CodeSize: 66 RoutineTime: 0.063076945 seconds
4x3 Cycles: 4  CodeSize: 56 RoutineTime: 0.064050233 seconds
4x2 Cycles: 7  CodeSize: 42 RoutineTime: 0.067956549 seconds
3x4 Cycles: 7  CodeSize: 62 RoutineTime: 0.069592511 seconds
3x3 Cycles: 3  CodeSize: 51 RoutineTime: 0.052619953 seconds
3x2 Cycles: 1  CodeSize: 35 RoutineTime: 0.045371007 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.037927502 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.036778671 seconds
2x2 Cycles: 1  CodeSize: 18 RoutineTime: 0.037995768 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...

Siekmanski

  • Member
  • *****
  • Posts: 1553
Re: Fast SIMD transpose routines
« Reply #22 on: June 26, 2018, 03:28:46 AM »
Hi Caché GB,
The last one is wrong too.

00 04 08 08
01 05 09 09
02 06 10 10
03 07 11 11

Thanks Nidud,

Did some runs and Marinus is a tiny bit faster on my sytem.
Yours has 6 memory reads and 4 memory writes.
Mine has 4 memory reads and 4 memory writes.
Don't know if the memory reads makes any difference when run on some older computers?

Interesting way loading data with shufps ( never thought of this )
I will study your routine, thanks for it.

Code: [Select]
4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032367591 seconds ; Marinus
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032533045 seconds ; Nidud

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032362563 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032495331 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032362074 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032362283 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032358303 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032364030 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032371503 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032403490 seconds

Niduds version in 32-bit

Code: [Select]
XMMatrixTranspose PROC
    mov         eax,offset MatrixIn
    mov         ecx,offset MatrixOut
M4_4n:
    movaps      xmm0,[eax+0]
    movaps      xmm2,[eax+020h]
    movaps      xmm4,xmm0
    movhps      xmm4,qword ptr[eax+010h]
    shufps      xmm0,[eax+010h],Shuffle(3,2,3,2)
    movaps      xmm1,xmm2
    shufps      xmm1,[eax+030h],Shuffle(3,2,3,2)
    movhps      xmm2,qword ptr[eax+030h]
    movaps      xmm3,xmm4
    shufps      xmm4,xmm2,Shuffle(3,1,3,1)
    movaps      [ecx+010h],xmm4
    movaps      xmm4,xmm0
    shufps      xmm3,xmm2,Shuffle(2,0,2,0)
    shufps      xmm4,xmm1,Shuffle(2,0,2,0)
    shufps      xmm0,xmm1,Shuffle(3,1,3,1)
    movaps      [ecx+0],xmm3
    movaps      [ecx+020h],xmm4
    movaps      [ecx+030h],xmm0
    CodeSize4_4n = $-M4_4n
    ret
XMMatrixTranspose ENDP

Creative coders use backward thinking techniques as their strategy.

nidud

  • Member
  • *****
  • Posts: 1531
    • https://github.com/nidud/asmc
Re: Fast SIMD transpose routines
« Reply #23 on: June 26, 2018, 03:57:44 AM »
Did some runs and Marinus is a tiny bit faster on my sytem.
Yours has 6 memory reads and 4 memory writes.
Mine has 4 memory reads and 4 memory writes.
Don't know if the memory reads makes any difference when run on some older computers?

It's possible to use regs on some of these but it appears to be slower (don't ask me why). The original code use more memory access and is even faster than this one, but that may be a local thing on my system thought.
Code: [Select]
    movaps xmm2,[rcx+0x00]
    movaps xmm3,[rcx+0x20]
    movaps xmm0,[rcx+0x00]
    shufps xmm0,[rcx+0x10],_MM_SHUFFLE(3,2,3,2)
    movaps xmm1,[rcx+0x20]
    shufps xmm1,[rcx+0x30],_MM_SHUFFLE(3,2,3,2)
    movhps xmm2,[rcx+0x10]
    movhps xmm3,[rcx+0x30]
    movaps xmm4,xmm2
    shufps xmm2,xmm3,_MM_SHUFFLE(3,1,3,1)
    movaps [rdx+0x10],xmm2
    movaps xmm2,xmm0
    shufps xmm4,xmm3,_MM_SHUFFLE(2,0,2,0)
    shufps xmm2,xmm1,_MM_SHUFFLE(2,0,2,0)
    shufps xmm0,xmm1,_MM_SHUFFLE(3,1,3,1)
    movaps [rdx+0x00],xmm4
    movaps [rdx+0x20],xmm2
    movaps [rdx+0x30],xmm0

Caché GB

  • Regular Member
  • *
  • Posts: 44
  • MASM IS HOT
Re: Fast SIMD transpose routines
« Reply #24 on: June 26, 2018, 04:26:50 AM »
Ok I think I should stay out of the dojo.

Maybe third time lucky   

Code: [Select]

XMMatrixTranspose_003 PROC

       movaps  xmm5, xmmword ptr[edx+0h] 
       movaps  xmm3, xmmword ptr[edx+20h]
       movaps  xmm4, xmm5 
       shufps  xmm4, xmmword ptr[edx+10h], 44h 
       movaps  xmm1, xmm3 
       shufps  xmm5, xmmword ptr[edx+10h], 0EEh 
       movaps  xmm2, xmm4 
       shufps  xmm1, xmmword ptr[edx+30h], 44h  ; <--- Not 20h
       movaps  xmm0, xmm5 
       shufps  xmm3, xmmword ptr[edx+30h], 0EEh  ; <--- Not 20h 
       shufps  xmm2, xmm1, 88h 
       shufps  xmm4, xmm1, 0DDh 
       shufps  xmm0, xmm3, 88h 
       shufps  xmm5, xmm3, 0DDh 
       movaps  xmmword ptr[eax+0h] , xmm2
       movaps  xmmword ptr[eax+10h], xmm4
       movaps  xmmword ptr[eax+20h], xmm0
       movaps  xmmword ptr[eax+30h], xmm5
          ret

XMMatrixTranspose_003 ENDP

Caché GB's 1 and 0-nly language:MASM

Siekmanski

  • Member
  • *****
  • Posts: 1553
Re: Fast SIMD transpose routines
« Reply #25 on: June 26, 2018, 02:31:41 PM »
Hi Caché GB,

Welcome to the dojo.  :t

4x4 Cycles: 4  CodeSize: 66 RoutineTime: 0.032354671 seconds ; Siekmanski
4x4 Cycles: 4  CodeSize: 68 RoutineTime: 0.032370874 seconds ; Nidud
4x4 Cycles: 4  CodeSize: 70 RoutineTime: 0.032421439 seconds ; Caché GB

Your matrix results are correct.

The speed of the routines are all very close to each other.
Results may vary on different architectures.
Creative coders use backward thinking techniques as their strategy.

AW

  • Member
  • *****
  • Posts: 1360
  • Let's Make ASM Great Again!
Re: Fast SIMD transpose routines
« Reply #26 on: June 26, 2018, 04:00:17 PM »
@nidud

As usual, you mix oranges with pears.
You showed the Windows ABI being used and filling a return structure pointed to by RDX, which is wrong, and never done in DirectXMath then you justify that with the Vectorcall and inlining of functions. What a bloody confusion!   :shock:

Caché GB

  • Regular Member
  • *
  • Posts: 44
  • MASM IS HOT
Re: Fast SIMD transpose routines
« Reply #27 on: June 27, 2018, 07:07:04 AM »
Thank you sensei  Siekmanski.
Caché GB's 1 and 0-nly language:MASM

nidud

  • Member
  • *****
  • Posts: 1531
    • https://github.com/nidud/asmc
Re: Fast SIMD transpose routines
« Reply #28 on: July 14, 2018, 06:53:38 AM »
Here's a AVX version. The timing is more or less equal on all three of them but the AVX version is simpler where you get away with using one XMM(4) register.

Code: [Select]
    vshufps xmm4,xmm2,xmm3,_MM_SHUFFLE(3,2,3,2)
    vshufps xmm2,xmm2,xmm3,_MM_SHUFFLE(1,0,1,0)
    vshufps xmm3,xmm0,xmm1,_MM_SHUFFLE(3,2,3,2)
    vshufps xmm1,xmm0,xmm1,_MM_SHUFFLE(1,0,1,0)
    vshufps xmm0,xmm1,xmm2,_MM_SHUFFLE(2,0,2,0)
    vshufps xmm1,xmm1,xmm2,_MM_SHUFFLE(3,1,3,1)
    vshufps xmm2,xmm3,xmm4,_MM_SHUFFLE(2,0,2,0)
    vshufps xmm3,xmm3,xmm4,_MM_SHUFFLE(3,1,3,1)

Quote
Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz (AVX2)
----------------------------------------------
-- test(1)
    59933 cycles, rep(10000), code( 49) 1.asm: Marinus
    62218 cycles, rep(10000), code( 46) 2.asm: DirectXMath SSE
    60596 cycles, rep(10000), code( 41) 3.asm: DirectXMath AVX
-- test(2)
    60355 cycles, rep(10000), code( 49) 1.asm: Marinus
    60618 cycles, rep(10000), code( 46) 2.asm: DirectXMath SSE
    60869 cycles, rep(10000), code( 41) 3.asm: DirectXMath AVX
-- test(3)
    61963 cycles, rep(10000), code( 49) 1.asm: Marinus
    60878 cycles, rep(10000), code( 46) 2.asm: DirectXMath SSE
    61006 cycles, rep(10000), code( 41) 3.asm: DirectXMath AVX
-- test(4)
    60767 cycles, rep(10000), code( 49) 1.asm: Marinus
    62612 cycles, rep(10000), code( 46) 2.asm: DirectXMath SSE
    60552 cycles, rep(10000), code( 41) 3.asm: DirectXMath AVX

total [1 .. 4], 1++
   243018 cycles 1.asm: Marinus
   243023 cycles 3.asm: DirectXMath AVX
   246326 cycles 2.asm: DirectXMath SSE
hit any key to continue...

Siekmanski

  • Member
  • *****
  • Posts: 1553
Re: Fast SIMD transpose routines
« Reply #29 on: July 14, 2018, 08:29:58 AM »
Hi nidud,

You inserted 4 extra instructions in my code.
The last 4 instructions,

    movaps      xmm0,xmm4       ; [0 4 8 C]
    movaps      xmm1,xmm5       ; [1 5 9 D]
    movaps      xmm3,xmm2       ; [3 7 B F]
    movaps      xmm2,xmm6       ; [2 6 A E]

Doesn't this slow down my code in comparison with yours?
Or is there a reason to insert them?

removed xmm6 ( to make it simpler  :biggrin: )
Code: [Select]
XMMatrixTransposeM PROC
    mov         eax,offset MatrixIn
    mov         ecx,offset MatrixOut

    movaps      xmm0,[eax+0]            ; [0 1 2 3]
    movaps      xmm1,[eax+16]           ; [4 5 6 7]
    movaps      xmm2,[eax+32]           ; [8 9 A B]
    movaps      xmm3,[eax+48]           ; [C D E F]

    movaps      xmm4,xmm0               ; [0 1 2 3]
    movaps      xmm5,xmm2               ; [8 9 A B]
    unpcklps    xmm4,xmm1               ; [0 4 1 5]
    unpcklps    xmm5,xmm3               ; [8 C 9 D]
    unpckhps    xmm0,xmm1               ; [2 6 3 7]
    unpckhps    xmm2,xmm3               ; [A E B F]
    movaps      xmm1,xmm4               ; [0 4 1 5]
    movaps      xmm3,xmm0               ; [2 6 3 7]
    movlhps     xmm4,xmm5               ; [0 4 8 C]
    movlhps     xmm3,xmm2               ; [2 6 A E]
    movhlps     xmm5,xmm1               ; [1 5 9 D]
    movhlps     xmm2,xmm0               ; [3 7 B F]

    movaps      [ecx+0],xmm4            ; [0 4 8 C]
    movaps      [ecx+16],xmm5           ; [1 5 9 D]
    movaps      [ecx+32],xmm3           ; [2 6 A E]
    movaps      [ecx+48],xmm2           ; [3 7 B F]
    ret
XMMatrixTransposeM ENDP
Creative coders use backward thinking techniques as their strategy.