The MASM Forum

General => The Laboratory => Topic started by: Siekmanski on June 25, 2018, 11:35:37 AM

Title: Fast SIMD transpose routines
Post by: Siekmanski on June 25, 2018, 11:35:37 AM
Wrote some Matrix Transpose routines of different sizes.
The idea is to use them as building blocks to create an Algorithm for very large Transposition Matrices of any size.
Included the sources and a timing mechanism that measures the clock cycles and the time each Matrix routine takes after 10000000 calls.

You can use the "SaveResults.bat" to save the timing results as "MatrixTimerResults.txt"
It would be nice to post the results here and give some feedback for improvements of the routines.  :t
Thanks.

Here are my results:

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

4x4 Cycles: 4  CodeSize: 66 RoutineTime: 0.032362423 seconds
4x3 Cycles: 3  CodeSize: 56 RoutineTime: 0.029423712 seconds
4x2 Cycles: 4  CodeSize: 42 RoutineTime: 0.032384074 seconds
3x4 Cycles: 4  CodeSize: 62 RoutineTime: 0.032416620 seconds
3x3 Cycles: 2  CodeSize: 51 RoutineTime: 0.026471800 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.029419801 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.020588231 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.020594936 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.017652313 seconds

Code Alignment 64 byte check: 000h


EDIT: forgot to mention that the routines are to be used backwards in memory because when a row is 3 values, 4 are written!
Title: Re: Fast SIMD transpose routines
Post by: zedd151 on June 25, 2018, 11:47:08 AM

10000000 calls per Matrix for the Cycle counter and the Routine timer.
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G
4x4 Cycles: 3  CodeSize: 66 RoutineTime: 0.055281872 seconds
4x3 Cycles: 3  CodeSize: 56 RoutineTime: 0.048029962 seconds
4x2 Cycles: 4  CodeSize: 42 RoutineTime: 0.047268223 seconds
3x4 Cycles: 5  CodeSize: 62 RoutineTime: 0.056652746 seconds
3x3 Cycles: 4  CodeSize: 51 RoutineTime: 0.047815803 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.034986782 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.038576830 seconds
2x3 Cycles: -1  CodeSize: 27 RoutineTime: 0.028790277 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.024736004 seconds
Code Alignment 64 byte check: 000h
Press any key to continue...


1.60 Ghz cpu speed...

cycle counts seem off "0 cycles", "-1 cycle"
Title: Re: Fast SIMD transpose routines
Post by: Yuri on June 25, 2018, 01:44:21 PM

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz

4x4 Cycles: 9  CodeSize: 66 RoutineTime: 0.049261743 seconds
4x3 Cycles: 7  CodeSize: 56 RoutineTime: 0.043520454 seconds
4x2 Cycles: 8  CodeSize: 42 RoutineTime: 0.042658139 seconds
3x4 Cycles: 10  CodeSize: 62 RoutineTime: 0.049160314 seconds
3x3 Cycles: 5  CodeSize: 51 RoutineTime: 0.036015704 seconds
3x2 Cycles: 5  CodeSize: 35 RoutineTime: 0.036012691 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.024245575 seconds
2x3 Cycles: 1  CodeSize: 27 RoutineTime: 0.023353133 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.020137194 seconds

Code Alignment 64 byte check: 000h
Title: Re: Fast SIMD transpose routines
Post by: Caché GB on June 25, 2018, 02:58:41 PM
Hi Siekmanski

Been bench pressing your 4x4 against this one:


XMMatrixTranspose PROC

       movaps  xmm0, xmmword ptr[edx+0h]
       movaps  xmm1, xmmword ptr[edx+10h]
       movaps  xmm2, xmmword ptr[edx+20h]
       movaps  xmm3, xmmword ptr[edx+30h]

       movaps  xmm4, xmm0 
       movaps  xmm5, xmm2 
       shufps  xmm0, xmm1, 44h 
       shufps  xmm4, xmm1, 0EEh 
       shufps  xmm2, xmm3, 44h 
       shufps  xmm5, xmm3, 0EEh 
       movaps  xmm1, xmm0 
       movaps  xmm3, xmm2 
       shufps  xmm0, xmm4, 88h 
       shufps  xmm1, xmm4, 0DDh 
       shufps  xmm2, xmm5, 88h 
       shufps  xmm3, xmm5, 0DDh 

       movaps  xmmword ptr[eax+0h],  xmm0
       movaps  xmmword ptr[eax+10h], xmm1
       movaps  xmmword ptr[eax+20h], xmm2
       movaps  xmmword ptr[eax+30h], xmm3

XMMatrixTranspose ENDP

;===========================================>

          mov  edx, offset g_mSpin
          mov  eax, offset g_mWorld
       invoke  XMMatrixTranspose


Which is from BOTH  DirectXMath and Xnamath (has not changed),  with your 4x4 0.815 % faster over
100 million iterations. Nice.
Title: Re: Fast SIMD transpose routines
Post by: aw27 on June 25, 2018, 05:12:58 PM
Here we go for YAMTT  :t (Yet Another Matrix Transpose Test)

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

4x4 Cycles: 2  CodeSize: 66 RoutineTime: 0.018620367 seconds
4x3 Cycles: 2  CodeSize: 56 RoutineTime: 0.016572648 seconds
4x2 Cycles: 2  CodeSize: 42 RoutineTime: 0.018659709 seconds
3x4 Cycles: 3  CodeSize: 62 RoutineTime: 0.018680211 seconds
3x3 Cycles: 1  CodeSize: 51 RoutineTime: 0.014052825 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.016425531 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.011702284 seconds
2x3 Cycles: 2  CodeSize: 27 RoutineTime: 0.009617162 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.009352850 seconds

Code Alignment 64 byte check: 000h
Title: Re: Fast SIMD transpose routines
Post by: zedd151 on June 25, 2018, 05:16:35 PM
Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT  :t (Yet Another Matrix Transpose Test)

Where's Rui??   :P
Title: Re: Fast SIMD transpose routines
Post by: aw27 on June 25, 2018, 06:01:25 PM
Quote from: zedd151 on June 25, 2018, 05:16:35 PM
Where's Rui??   :P

shhh  :icon_mrgreen:
Title: Re: Fast SIMD transpose routines
Post by: jj2007 on June 25, 2018, 06:17:36 PM
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

4x4 Cycles: 7  CodeSize: 66 RoutineTime: 0.051956401 seconds
4x3 Cycles: 5  CodeSize: 56 RoutineTime: 0.045790781 seconds
4x2 Cycles: 5  CodeSize: 42 RoutineTime: 0.045233293 seconds
3x4 Cycles: 6  CodeSize: 62 RoutineTime: 0.054255731 seconds
3x3 Cycles: 4  CodeSize: 51 RoutineTime: 0.044690584 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.036121362 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.023498652 seconds
2x3 Cycles: -1  CodeSize: 27 RoutineTime: 0.024507303 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.019760855 seconds


Another run, same machine:
4x4 Cycles: 7  CodeSize: 66 RoutineTime: 0.055625230 seconds
4x3 Cycles: 5  CodeSize: 56 RoutineTime: 0.045135589 seconds
4x2 Cycles: 5  CodeSize: 42 RoutineTime: 0.045048148 seconds
3x4 Cycles: 7  CodeSize: 62 RoutineTime: 0.051966664 seconds
3x3 Cycles: 3  CodeSize: 51 RoutineTime: 0.038664541 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.037659996 seconds
2x4 Cycles: -1  CodeSize: 31 RoutineTime: 0.024224864 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.024652217 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.019747718 seconds
Title: Re: Fast SIMD transpose routines
Post by: RuiLoureiro on June 25, 2018, 06:48:48 PM
Quote from: zedd151 on June 25, 2018, 05:16:35 PM
Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT  :t (Yet Another Matrix Transpose Test)

Where's Rui??   :P
Will be here soon as possible, zedd151  :biggrin:
Title: Re: Fast SIMD transpose routines
Post by: RuiLoureiro on June 25, 2018, 06:51:57 PM

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Atom(TM) CPU N455 @ 1.66GHz

4x4 Cycles: 58  CodeSize: 66 RoutineTime: 0.377001463 seconds
4x3 Cycles: 51  CodeSize: 56 RoutineTime: 0.339285241 seconds
4x2 Cycles: 19  CodeSize: 42 RoutineTime: 0.165002387 seconds
3x4 Cycles: 53  CodeSize: 62 RoutineTime: 0.339326507 seconds
3x3 Cycles: 38  CodeSize: 51 RoutineTime: 0.286387559 seconds
3x2 Cycles: 22  CodeSize: 35 RoutineTime: 0.171114020 seconds
2x4 Cycles: 22  CodeSize: 31 RoutineTime: 0.139145431 seconds
2x3 Cycles: 19  CodeSize: 27 RoutineTime: 0.144533990 seconds
2x2 Cycles: 11  CodeSize: 18 RoutineTime: 0.078623451 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on June 25, 2018, 06:59:08 PM
Thanks guys,

@Caché GB

The matrices I use here are unaligned ( to process uneven matrix row numbers )

The aligned 4x4 version is even a bit faster:
10 million iterations (i7-4930K)
4x4 aligned DirectXMath Cycles: 4  CodeSize: 74 RoutineTime: 0.042951199 seconds
4x4 aligned Siekmanski  Cycles: 4  CodeSize: 66 RoutineTime: 0.032361864 seconds

My aligned version takes 75.3456591 % time compared to the aligned DirectXMath version.
That's 1.327216473 times faster and 8 bytes less code size.  :biggrin:
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on June 25, 2018, 07:16:31 PM
@Caché GB

Are you sure you posted a correct version of the 4x4 DirectXMath transpose matrix?
The results are wrong,

00 04 02 06
01 05 03 07
08 12 10 14
09 13 11 15
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on June 25, 2018, 07:23:45 PM
Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT  :t (Yet Another Matrix Transpose Test)

Yeah, now we can play and build with it just as we played as kids with LEGO blocks.  :t
Title: Re: Fast SIMD transpose routines
Post by: aw27 on June 25, 2018, 09:49:29 PM
It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148

However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once.  :idea:
Title: Re: Fast SIMD transpose routines
Post by: LiaoMi on June 25, 2018, 10:52:10 PM
10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz

4x4 Cycles: 2  CodeSize: 66 RoutineTime: 0.024838457 seconds
4x3 Cycles: 1  CodeSize: 56 RoutineTime: 0.023653736 seconds
4x2 Cycles: 2  CodeSize: 42 RoutineTime: 0.023732180 seconds
3x4 Cycles: 2  CodeSize: 62 RoutineTime: 0.023453961 seconds
3x3 Cycles: 2  CodeSize: 51 RoutineTime: 0.034194162 seconds
3x2 Cycles: 1  CodeSize: 35 RoutineTime: 0.024073813 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.015524173 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.016023061 seconds
2x2 Cycles: 1  CodeSize: 18 RoutineTime: 0.020969785 seconds

Code Alignment 64 byte check: 000h
Title: Re: Fast SIMD transpose routines
Post by: mineiro on June 25, 2018, 11:22:09 PM
10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

4x4 Cycles: 3  CodeSize: 66 RoutineTime: 0.021707700 seconds
4x3 Cycles: 2  CodeSize: 56 RoutineTime: 0.018974900 seconds
4x2 Cycles: 3  CodeSize: 42 RoutineTime: 0.021691700 seconds
3x4 Cycles: 3  CodeSize: 62 RoutineTime: 0.021689300 seconds
3x3 Cycles: 1  CodeSize: 51 RoutineTime: 0.016264000 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.018972900 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.013553900 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.011021800 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.010840100 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on June 26, 2018, 12:13:08 AM
Quote from: AW on June 25, 2018, 09:49:29 PM
It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148

However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once.  :idea:

Of course we can use AVX for 8x8 matrices.

This is what I had in mind, all posible sizes without moving single values.

As an example the 9x9 matrix:
Can be done with 4 blocks of 4x4 matrices and exchanging the left over single values.
Or use 4 3 2 size matrices as shown below.

9 = 3 square blocks along the Hypotenuse ( the RED line ) 4(X) 3(Y) 2(Q).
    now you have to transpose the rest. 4x3(Y) 4x2(Z) 3X2(P)

If you look closely at the picture you see on the X-axis the repetition of 4 3 2.
The same for the Y-axis and the Hypotenuse,  4 3 2

It looks like building with LEGO blocks......  :lol:

With this in mind you can build every size ( even or uneven ) by transposing along the Hypotenuse
Start at the right bottom with the 3 row sized matrices ( remember rows of 3 values will overwrite with the 4th value to increase the speed ) and work your way backwards in memory.

(http://members.home.nl/siekmanski/9x9_Matrix.png)
Title: Re: Fast SIMD transpose routines
Post by: nidud on June 26, 2018, 12:57:02 AM
deleted
Title: Re: Fast SIMD transpose routines
Post by: aw27 on June 26, 2018, 01:44:03 AM
@Marinus,
:t

@Nidud
Interesting, when I produced the DxMath library I have investigated the way DirectXMath was doing it and at the time the memory reserved for the transposed matrix was pointed to by RCX according to Windows ABI and on return pointed to by RAX as usual, but now they are returning in RDX for some mysterious reason that only @nidud knows about..  :badgrin:
We have also concluded that Marinus algo was better so I later changed in DxMath to @Marinus algo.


Title: Re: Fast SIMD transpose routines
Post by: Caché GB on June 26, 2018, 02:07:06 AM
Hi Siekmanski.

Yes, posted the wrong one. My apologies.
Been benching several. Here is the correct version.



XMMatrixTranspose PROC

       movaps  xmm5, xmmword ptr[edx+0h] 
       movaps  xmm3, xmmword ptr[edx+20h]
       movaps  xmm4, xmm5 
       shufps  xmm4, xmmword ptr[edx+10h], 44h 
       movaps  xmm1, xmm3 
       shufps  xmm5, xmmword ptr[edx+10h], 0EEh 
       movaps  xmm2, xmm4 
       shufps  xmm1, xmmword ptr[edx+20h], 44h 
       movaps  xmm0, xmm5 
       shufps  xmm3, xmmword ptr[edx+20h], 0EEh 
       shufps  xmm2, xmm1, 88h 
       shufps  xmm4, xmm1, 0DDh 
       shufps  xmm0, xmm3, 88h 
       shufps  xmm5, xmm3, 0DDh 
       movaps  xmmword ptr[eax+0h] , xmm2
       movaps  xmmword ptr[eax+10h], xmm4
       movaps  xmmword ptr[eax+20h], xmm0
       movaps  xmmword ptr[eax+30h], xmm5
          ret

XMMatrixTranspose ENDP



Just doing a brute force like this.



       Counter  equ  1000000000  ; 1 Billion

       invoke  Sleep, 1000

       invoke  GetTickCount
          mov  TestTime, eax
          mov  ecx, Counter 
     L01:
                mov  edx, offset g_mSpin
                mov  eax, offset g_mWorld
             invoke  Siekmanski_MatrixTranspose

          dec  ecx
          jnz  L01

       invoke  GetTickCount
          sub  eax, TestTime
       invoke  wsprintfA, addr szTestBuffer01, CSTR("%d milli secs", 13, 10), eax

       invoke  Sleep, 1000       

       invoke  GetTickCount
          mov  TestTime, eax
          mov  ecx, Counter
     L02:
                mov  edx, offset g_mSpin
                mov  eax, offset g_mWorld
             invoke  XMMatrixTranspose

          dec  ecx
          jnz  L02

       invoke  GetTickCount
          sub  eax, TestTime
       invoke  wsprintfA, addr szTestBuffer02, CSTR("%d milli secs", 13, 10), eax

       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer01
       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer02
       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, CSTR(" ", 13, 10)

          ret


Title: Re: Fast SIMD transpose routines
Post by: nidud on June 26, 2018, 03:06:51 AM
deleted
Title: Re: Fast SIMD transpose routines
Post by: HSE on June 26, 2018, 03:19:18 AM
Only Rui is beating me with higher times  :biggrin: :biggrin: :biggrin:

10000000 calls per Matrix for the Cycle counter and the Routine timer.

AMD A6-3500 APU with Radeon(tm) HD Graphics

4x4 Cycles: 8  CodeSize: 66 RoutineTime: 0.063076945 seconds
4x3 Cycles: 4  CodeSize: 56 RoutineTime: 0.064050233 seconds
4x2 Cycles: 7  CodeSize: 42 RoutineTime: 0.067956549 seconds
3x4 Cycles: 7  CodeSize: 62 RoutineTime: 0.069592511 seconds
3x3 Cycles: 3  CodeSize: 51 RoutineTime: 0.052619953 seconds
3x2 Cycles: 1  CodeSize: 35 RoutineTime: 0.045371007 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.037927502 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.036778671 seconds
2x2 Cycles: 1  CodeSize: 18 RoutineTime: 0.037995768 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on June 26, 2018, 03:28:46 AM
Hi Caché GB,
The last one is wrong too.

00 04 08 08
01 05 09 09
02 06 10 10
03 07 11 11

Thanks Nidud,

Did some runs and Marinus is a tiny bit faster on my sytem.
Yours has 6 memory reads and 4 memory writes.
Mine has 4 memory reads and 4 memory writes.
Don't know if the memory reads makes any difference when run on some older computers?

Interesting way loading data with shufps ( never thought of this )
I will study your routine, thanks for it.

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032367591 seconds ; Marinus
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032533045 seconds ; Nidud

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032362563 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032495331 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032362074 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032362283 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032358303 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032364030 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032371503 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032403490 seconds


Niduds version in 32-bit


XMMatrixTranspose PROC
    mov         eax,offset MatrixIn
    mov         ecx,offset MatrixOut
M4_4n:
    movaps      xmm0,[eax+0]
    movaps      xmm2,[eax+020h]
    movaps      xmm4,xmm0
    movhps      xmm4,qword ptr[eax+010h]
    shufps      xmm0,[eax+010h],Shuffle(3,2,3,2)
    movaps      xmm1,xmm2
    shufps      xmm1,[eax+030h],Shuffle(3,2,3,2)
    movhps      xmm2,qword ptr[eax+030h]
    movaps      xmm3,xmm4
    shufps      xmm4,xmm2,Shuffle(3,1,3,1)
    movaps      [ecx+010h],xmm4
    movaps      xmm4,xmm0
    shufps      xmm3,xmm2,Shuffle(2,0,2,0)
    shufps      xmm4,xmm1,Shuffle(2,0,2,0)
    shufps      xmm0,xmm1,Shuffle(3,1,3,1)
    movaps      [ecx+0],xmm3
    movaps      [ecx+020h],xmm4
    movaps      [ecx+030h],xmm0
    CodeSize4_4n = $-M4_4n
    ret
XMMatrixTranspose ENDP


Title: Re: Fast SIMD transpose routines
Post by: nidud on June 26, 2018, 03:57:44 AM
deleted
Title: Re: Fast SIMD transpose routines
Post by: Caché GB on June 26, 2018, 04:26:50 AM
Ok I think I should stay out of the dojo.

Maybe third time lucky   



XMMatrixTranspose_003 PROC

       movaps  xmm5, xmmword ptr[edx+0h] 
       movaps  xmm3, xmmword ptr[edx+20h]
       movaps  xmm4, xmm5 
       shufps  xmm4, xmmword ptr[edx+10h], 44h 
       movaps  xmm1, xmm3 
       shufps  xmm5, xmmword ptr[edx+10h], 0EEh 
       movaps  xmm2, xmm4 
       shufps  xmm1, xmmword ptr[edx+30h], 44h  ; <--- Not 20h
       movaps  xmm0, xmm5 
       shufps  xmm3, xmmword ptr[edx+30h], 0EEh  ; <--- Not 20h 
       shufps  xmm2, xmm1, 88h 
       shufps  xmm4, xmm1, 0DDh 
       shufps  xmm0, xmm3, 88h 
       shufps  xmm5, xmm3, 0DDh 
       movaps  xmmword ptr[eax+0h] , xmm2
       movaps  xmmword ptr[eax+10h], xmm4
       movaps  xmmword ptr[eax+20h], xmm0
       movaps  xmmword ptr[eax+30h], xmm5
          ret

XMMatrixTranspose_003 ENDP

Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on June 26, 2018, 02:31:41 PM
Hi Caché GB,

Welcome to the dojo.  :t

4x4 Cycles: 4  CodeSize: 66 RoutineTime: 0.032354671 seconds ; Siekmanski
4x4 Cycles: 4  CodeSize: 68 RoutineTime: 0.032370874 seconds ; Nidud
4x4 Cycles: 4  CodeSize: 70 RoutineTime: 0.032421439 seconds ; Caché GB

Your matrix results are correct.

The speed of the routines are all very close to each other.
Results may vary on different architectures.
Title: Re: Fast SIMD transpose routines
Post by: aw27 on June 26, 2018, 04:00:17 PM
@nidud

As usual, you mix oranges with pears.
You showed the Windows ABI being used and filling a return structure pointed to by RDX, which is wrong, and never done in DirectXMath then you justify that with the Vectorcall and inlining of functions. What a bloody confusion!   :shock:
Title: Re: Fast SIMD transpose routines
Post by: Caché GB on June 27, 2018, 07:07:04 AM
Thank you sensei  Siekmanski.
Title: Re: Fast SIMD transpose routines
Post by: nidud on July 14, 2018, 06:53:38 AM
deleted
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on July 14, 2018, 08:29:58 AM
Hi nidud,

You inserted 4 extra instructions in my code.
The last 4 instructions,

    movaps      xmm0,xmm4       ; [0 4 8 C]
    movaps      xmm1,xmm5       ; [1 5 9 D]
    movaps      xmm3,xmm2       ; [3 7 B F]
    movaps      xmm2,xmm6       ; [2 6 A E]

Doesn't this slow down my code in comparison with yours?
Or is there a reason to insert them?

removed xmm6 ( to make it simpler  :biggrin: )
XMMatrixTransposeM PROC
    mov         eax,offset MatrixIn
    mov         ecx,offset MatrixOut

    movaps      xmm0,[eax+0]            ; [0 1 2 3]
    movaps      xmm1,[eax+16]           ; [4 5 6 7]
    movaps      xmm2,[eax+32]           ; [8 9 A B]
    movaps      xmm3,[eax+48]           ; [C D E F]

    movaps      xmm4,xmm0               ; [0 1 2 3]
    movaps      xmm5,xmm2               ; [8 9 A B]
    unpcklps    xmm4,xmm1               ; [0 4 1 5]
    unpcklps    xmm5,xmm3               ; [8 C 9 D]
    unpckhps    xmm0,xmm1               ; [2 6 3 7]
    unpckhps    xmm2,xmm3               ; [A E B F]
    movaps      xmm1,xmm4               ; [0 4 1 5]
    movaps      xmm3,xmm0               ; [2 6 3 7]
    movlhps     xmm4,xmm5               ; [0 4 8 C]
    movlhps     xmm3,xmm2               ; [2 6 A E]
    movhlps     xmm5,xmm1               ; [1 5 9 D]
    movhlps     xmm2,xmm0               ; [3 7 B F]

    movaps      [ecx+0],xmm4            ; [0 4 8 C]
    movaps      [ecx+16],xmm5           ; [1 5 9 D]
    movaps      [ecx+32],xmm3           ; [2 6 A E]
    movaps      [ecx+48],xmm2           ; [3 7 B F]
    ret
XMMatrixTransposeM ENDP
Title: Re: Fast SIMD transpose routines
Post by: nidud on July 14, 2018, 08:45:29 AM
deleted
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on July 14, 2018, 08:54:30 AM
No need for rearranging the regs if you include the memory reads and writes in the speed test.
Title: Re: Fast SIMD transpose routines
Post by: nidud on July 14, 2018, 09:35:26 AM
deleted
Title: Re: Fast SIMD transpose routines
Post by: nidud on July 14, 2018, 10:37:03 AM
deleted
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on July 14, 2018, 10:45:38 AM
 :biggrin:
Quote from: nidud on July 14, 2018, 09:35:26 AM
Well, I'm writing vector call tests at the moment where values are kept in registers over multiple calls, so the thinking and implementation is a bit different.

OK, that makes sense.

Quote
As for AVX in this case there don't seem to be (as you pointed out) any speed improvement except from saving regs. There is also VMOVHLPS that can be used in the same way.

You must have confused me with someone else, never done an AVX version yet.
But will try it out some day.

I remember that we can replace VSHUFPS or VPERM2F128 with VBLENDPS instructions.
AVX shuffles are executed only on port 5, while blends are also executed on port 0.
VPERM2F128 instructions are not that fast.

Maybe we can get some gain out of it.

I will look this up.

EDIT: Found it, it's in chapter 12 section 11.1
http://members.home.nl/siekmanski/Intel_Optimization_Reference_Manual_248966-037.pdf
Title: Re: Fast SIMD transpose routines
Post by: Siekmanski on July 14, 2018, 12:01:38 PM
Quote from: nidud on July 14, 2018, 10:37:03 AM
Simpler and faster..

    vunpckhps xmm4,xmm2,xmm3
    vunpcklps xmm2,xmm2,xmm3
    vunpckhps xmm3,xmm0,xmm1
    vunpcklps xmm1,xmm0,xmm1
    vmovlhps  xmm0,xmm1,xmm2
    vmovhlps  xmm1,xmm2,xmm1
    vmovlhps  xmm2,xmm3,xmm4
    vmovhlps  xmm3,xmm4,xmm3


Timed this AVX piece on my computer, it's a little slower than the SSE version.
Title: Re: Fast SIMD transpose routines
Post by: nidud on July 14, 2018, 01:38:28 PM
deleted
Title: Re: Fast SIMD transpose routines
Post by: daydreamer on July 14, 2018, 05:00:13 PM
great work
so this works great for feed d3d9 with loads of different matrices?