Fast SIMD transpose routines

mineiro · June 25, 2018, 11:22:09 PM

 10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

4x4 Cycles: 3  CodeSize: 66 RoutineTime: 0.021707700 seconds
4x3 Cycles: 2  CodeSize: 56 RoutineTime: 0.018974900 seconds
4x2 Cycles: 3  CodeSize: 42 RoutineTime: 0.021691700 seconds
3x4 Cycles: 3  CodeSize: 62 RoutineTime: 0.021689300 seconds
3x3 Cycles: 1  CodeSize: 51 RoutineTime: 0.016264000 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.018972900 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.013553900 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.011021800 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.010840100 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...

Siekmanski · June 26, 2018, 12:13:08 AM

Quote from: AW on June 25, 2018, 09:49:29 PM
It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148

However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once. :idea:

Of course we can use AVX for 8x8 matrices.

This is what I had in mind, all posible sizes without moving single values.

As an example the 9x9 matrix:
Can be done with 4 blocks of 4x4 matrices and exchanging the left over single values.
Or use 4 3 2 size matrices as shown below.

9 = 3 square blocks along the Hypotenuse ( the RED line ) 4(X) 3(Y) 2(Q).
now you have to transpose the rest. 4x3(Y) 4x2(Z) 3X2(P)

If you look closely at the picture you see on the X-axis the repetition of 4 3 2.
The same for the Y-axis and the Hypotenuse, 4 3 2

It looks like building with LEGO blocks...... :lol:

With this in mind you can build every size ( even or uneven ) by transposing along the Hypotenuse
Start at the right bottom with the 3 row sized matrices ( remember rows of 3 values will overwrite with the 4th value to increase the speed ) and work your way backwards in memory.

nidud · June 26, 2018, 12:57:02 AM

deleted

aw27 · June 26, 2018, 01:44:03 AM

@Marinus,
:t

@Nidud
Interesting, when I produced the DxMath library I have investigated the way DirectXMath was doing it and at the time the memory reserved for the transposed matrix was pointed to by RCX according to Windows ABI and on return pointed to by RAX as usual, but now they are returning in RDX for some mysterious reason that only @nidud knows about..

We have also concluded that Marinus algo was better so I later changed in DxMath to @Marinus algo.

Caché GB · June 26, 2018, 02:07:06 AM

Hi Siekmanski.

Yes, posted the wrong one. My apologies.
Been benching several. Here is the correct version.

Code Select



XMMatrixTranspose PROC

       movaps  xmm5, xmmword ptr[edx+0h]  
       movaps  xmm3, xmmword ptr[edx+20h] 
       movaps  xmm4, xmm5  
       shufps  xmm4, xmmword ptr[edx+10h], 44h  
       movaps  xmm1, xmm3  
       shufps  xmm5, xmmword ptr[edx+10h], 0EEh  
       movaps  xmm2, xmm4  
       shufps  xmm1, xmmword ptr[edx+20h], 44h  
       movaps  xmm0, xmm5  
       shufps  xmm3, xmmword ptr[edx+20h], 0EEh  
       shufps  xmm2, xmm1, 88h  
       shufps  xmm4, xmm1, 0DDh  
       shufps  xmm0, xmm3, 88h  
       shufps  xmm5, xmm3, 0DDh  
       movaps  xmmword ptr[eax+0h] , xmm2
       movaps  xmmword ptr[eax+10h], xmm4
       movaps  xmmword ptr[eax+20h], xmm0
       movaps  xmmword ptr[eax+30h], xmm5
          ret

XMMatrixTranspose ENDP

Just doing a brute force like this.

Code Select



       Counter  equ  1000000000  ; 1 Billion

       invoke  Sleep, 1000

       invoke  GetTickCount
          mov  TestTime, eax
          mov  ecx, Counter  
     L01:
                mov  edx, offset g_mSpin
                mov  eax, offset g_mWorld
             invoke  Siekmanski_MatrixTranspose

          dec  ecx
          jnz  L01

       invoke  GetTickCount
          sub  eax, TestTime
       invoke  wsprintfA, addr szTestBuffer01, CSTR("%d milli secs", 13, 10), eax

       invoke  Sleep, 1000       

       invoke  GetTickCount
          mov  TestTime, eax
          mov  ecx, Counter 
     L02:
                mov  edx, offset g_mSpin
                mov  eax, offset g_mWorld
             invoke  XMMatrixTranspose

          dec  ecx
          jnz  L02

       invoke  GetTickCount
          sub  eax, TestTime
       invoke  wsprintfA, addr szTestBuffer02, CSTR("%d milli secs", 13, 10), eax

       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer01
       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer02
       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, CSTR(" ", 13, 10)

          ret

nidud · June 26, 2018, 03:06:51 AM

deleted

HSE · June 26, 2018, 03:19:18 AM

Only Rui is beating me with higher times

Code Select

 10000000 calls per Matrix for the Cycle counter and the Routine timer.

AMD A6-3500 APU with Radeon(tm) HD Graphics

4x4 Cycles: 8  CodeSize: 66 RoutineTime: 0.063076945 seconds
4x3 Cycles: 4  CodeSize: 56 RoutineTime: 0.064050233 seconds
4x2 Cycles: 7  CodeSize: 42 RoutineTime: 0.067956549 seconds
3x4 Cycles: 7  CodeSize: 62 RoutineTime: 0.069592511 seconds
3x3 Cycles: 3  CodeSize: 51 RoutineTime: 0.052619953 seconds
3x2 Cycles: 1  CodeSize: 35 RoutineTime: 0.045371007 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.037927502 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.036778671 seconds
2x2 Cycles: 1  CodeSize: 18 RoutineTime: 0.037995768 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...

Siekmanski · June 26, 2018, 03:28:46 AM

Hi Caché GB,
The last one is wrong too.

00 04 08 08
01 05 09 09
02 06 10 10
03 07 11 11

Thanks Nidud,

Did some runs and Marinus is a tiny bit faster on my sytem.
Yours has 6 memory reads and 4 memory writes.
Mine has 4 memory reads and 4 memory writes.
Don't know if the memory reads makes any difference when run on some older computers?

Interesting way loading data with shufps ( never thought of this )
I will study your routine, thanks for it.

Code Select

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032367591 seconds ; Marinus
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032533045 seconds ; Nidud

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032362563 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032495331 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032362074 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032362283 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032358303 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032364030 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032371503 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032403490 seconds

Niduds version in 32-bit

Code Select


XMMatrixTranspose PROC
    mov         eax,offset MatrixIn
    mov         ecx,offset MatrixOut
M4_4n:
    movaps      xmm0,[eax+0]
    movaps      xmm2,[eax+020h]
    movaps      xmm4,xmm0
    movhps      xmm4,qword ptr[eax+010h]
    shufps      xmm0,[eax+010h],Shuffle(3,2,3,2)
    movaps      xmm1,xmm2
    shufps      xmm1,[eax+030h],Shuffle(3,2,3,2)
    movhps      xmm2,qword ptr[eax+030h]
    movaps      xmm3,xmm4
    shufps      xmm4,xmm2,Shuffle(3,1,3,1)
    movaps      [ecx+010h],xmm4
    movaps      xmm4,xmm0
    shufps      xmm3,xmm2,Shuffle(2,0,2,0)
    shufps      xmm4,xmm1,Shuffle(2,0,2,0)
    shufps      xmm0,xmm1,Shuffle(3,1,3,1)
    movaps      [ecx+0],xmm3
    movaps      [ecx+020h],xmm4
    movaps      [ecx+030h],xmm0
    CodeSize4_4n = $-M4_4n
    ret
XMMatrixTranspose ENDP

nidud · June 26, 2018, 03:57:44 AM

deleted

Caché GB · June 26, 2018, 04:26:50 AM

Ok I think I should stay out of the dojo.

Maybe third time lucky

Code Select



XMMatrixTranspose_003 PROC

       movaps  xmm5, xmmword ptr[edx+0h]  
       movaps  xmm3, xmmword ptr[edx+20h] 
       movaps  xmm4, xmm5  
       shufps  xmm4, xmmword ptr[edx+10h], 44h  
       movaps  xmm1, xmm3  
       shufps  xmm5, xmmword ptr[edx+10h], 0EEh  
       movaps  xmm2, xmm4  
       shufps  xmm1, xmmword ptr[edx+30h], 44h  ; <--- Not 20h
       movaps  xmm0, xmm5  
       shufps  xmm3, xmmword ptr[edx+30h], 0EEh  ; <--- Not 20h  
       shufps  xmm2, xmm1, 88h  
       shufps  xmm4, xmm1, 0DDh  
       shufps  xmm0, xmm3, 88h  
       shufps  xmm5, xmm3, 0DDh  
       movaps  xmmword ptr[eax+0h] , xmm2
       movaps  xmmword ptr[eax+10h], xmm4
       movaps  xmmword ptr[eax+20h], xmm0
       movaps  xmmword ptr[eax+30h], xmm5
          ret

XMMatrixTranspose_003 ENDP

Siekmanski · June 26, 2018, 02:31:41 PM

Hi Caché GB,

Welcome to the dojo. :t

4x4 Cycles: 4 CodeSize: 66 RoutineTime: 0.032354671 seconds ; Siekmanski
4x4 Cycles: 4 CodeSize: 68 RoutineTime: 0.032370874 seconds ; Nidud
4x4 Cycles: 4 CodeSize: 70 RoutineTime: 0.032421439 seconds ; Caché GB

Your matrix results are correct.

The speed of the routines are all very close to each other.
Results may vary on different architectures.

aw27 · June 26, 2018, 04:00:17 PM

@nidud

As usual, you mix oranges with pears.
You showed the Windows ABI being used and filling a return structure pointed to by RDX, which is wrong, and never done in DirectXMath then you justify that with the Vectorcall and inlining of functions. What a bloody confusion! :shock:

Caché GB · June 27, 2018, 07:07:04 AM

Thank you sensei Siekmanski.

nidud · July 14, 2018, 06:53:38 AM

deleted

Siekmanski · July 14, 2018, 08:29:58 AM

Hi nidud,

You inserted 4 extra instructions in my code.
The last 4 instructions,

movaps xmm0,xmm4 ; [0 4 8 C]
movaps xmm1,xmm5 ; [1 5 9 D]
movaps xmm3,xmm2 ; [3 7 B F]
movaps xmm2,xmm6 ; [2 6 A E]

Doesn't this slow down my code in comparison with yours?
Or is there a reason to insert them?

removed xmm6 ( to make it simpler

)

Code Select

XMMatrixTransposeM PROC
    mov         eax,offset MatrixIn
    mov         ecx,offset MatrixOut

    movaps      xmm0,[eax+0]            ; [0 1 2 3]
    movaps      xmm1,[eax+16]           ; [4 5 6 7]
    movaps      xmm2,[eax+32]           ; [8 9 A B]
    movaps      xmm3,[eax+48]           ; [C D E F]

    movaps      xmm4,xmm0               ; [0 1 2 3]
    movaps      xmm5,xmm2               ; [8 9 A B]
    unpcklps    xmm4,xmm1               ; [0 4 1 5]
    unpcklps    xmm5,xmm3               ; [8 C 9 D]
    unpckhps    xmm0,xmm1               ; [2 6 3 7]
    unpckhps    xmm2,xmm3               ; [A E B F]
    movaps      xmm1,xmm4               ; [0 4 1 5]
    movaps      xmm3,xmm0               ; [2 6 3 7]
    movlhps     xmm4,xmm5               ; [0 4 8 C]
    movlhps     xmm3,xmm2               ; [2 6 A E]
    movhlps     xmm5,xmm1               ; [1 5 9 D]
    movhlps     xmm2,xmm0               ; [3 7 B F]

    movaps      [ecx+0],xmm4            ; [0 4 8 C]
    movaps      [ecx+16],xmm5           ; [1 5 9 D]
    movaps      [ecx+32],xmm3           ; [2 6 A E]
    movaps      [ecx+48],xmm2           ; [3 7 B F]
    ret
XMMatrixTransposeM ENDP

The MASM Forum

News:

Fast SIMD transpose routines

mineiro

Siekmanski

nidud

aw27

Caché GB

nidud

HSE

Siekmanski

nidud

Caché GB

Siekmanski

aw27

Caché GB

nidud

Siekmanski