Fast SIMD transpose routines

Siekmanski · June 25, 2018, 11:35:37 AM

Wrote some Matrix Transpose routines of different sizes.
The idea is to use them as building blocks to create an Algorithm for very large Transposition Matrices of any size.
Included the sources and a timing mechanism that measures the clock cycles and the time each Matrix routine takes after 10000000 calls.

You can use the "SaveResults.bat" to save the timing results as "MatrixTimerResults.txt"
It would be nice to post the results here and give some feedback for improvements of the routines. :t
Thanks.

Here are my results:

Code Select

 10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

4x4 Cycles: 4  CodeSize: 66 RoutineTime: 0.032362423 seconds
4x3 Cycles: 3  CodeSize: 56 RoutineTime: 0.029423712 seconds
4x2 Cycles: 4  CodeSize: 42 RoutineTime: 0.032384074 seconds
3x4 Cycles: 4  CodeSize: 62 RoutineTime: 0.032416620 seconds
3x3 Cycles: 2  CodeSize: 51 RoutineTime: 0.026471800 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.029419801 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.020588231 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.020594936 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.017652313 seconds

Code Alignment 64 byte check: 000h

EDIT: forgot to mention that the routines are to be used backwards in memory because when a row is 3 values, 4 are written!

zedd151 · June 25, 2018, 11:47:08 AM

Code Select


 10000000 calls per Matrix for the Cycle counter and the Routine timer.
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G
4x4 Cycles: 3  CodeSize: 66 RoutineTime: 0.055281872 seconds
4x3 Cycles: 3  CodeSize: 56 RoutineTime: 0.048029962 seconds
4x2 Cycles: 4  CodeSize: 42 RoutineTime: 0.047268223 seconds
3x4 Cycles: 5  CodeSize: 62 RoutineTime: 0.056652746 seconds
3x3 Cycles: 4  CodeSize: 51 RoutineTime: 0.047815803 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.034986782 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.038576830 seconds
2x3 Cycles: -1  CodeSize: 27 RoutineTime: 0.028790277 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.024736004 seconds
Code Alignment 64 byte check: 000h
Press any key to continue...

1.60 Ghz cpu speed...

cycle counts seem off "0 cycles", "-1 cycle"

Yuri · June 25, 2018, 01:44:21 PM

Code Select


10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz

4x4 Cycles: 9  CodeSize: 66 RoutineTime: 0.049261743 seconds
4x3 Cycles: 7  CodeSize: 56 RoutineTime: 0.043520454 seconds
4x2 Cycles: 8  CodeSize: 42 RoutineTime: 0.042658139 seconds
3x4 Cycles: 10  CodeSize: 62 RoutineTime: 0.049160314 seconds
3x3 Cycles: 5  CodeSize: 51 RoutineTime: 0.036015704 seconds
3x2 Cycles: 5  CodeSize: 35 RoutineTime: 0.036012691 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.024245575 seconds
2x3 Cycles: 1  CodeSize: 27 RoutineTime: 0.023353133 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.020137194 seconds

Code Alignment 64 byte check: 000h

Caché GB · June 25, 2018, 02:58:41 PM

Hi Siekmanski

Been bench pressing your 4x4 against this one:

Code Select


XMMatrixTranspose PROC

       movaps  xmm0, xmmword ptr[edx+0h]
       movaps  xmm1, xmmword ptr[edx+10h]
       movaps  xmm2, xmmword ptr[edx+20h]
       movaps  xmm3, xmmword ptr[edx+30h]

       movaps  xmm4, xmm0  
       movaps  xmm5, xmm2  
       shufps  xmm0, xmm1, 44h  
       shufps  xmm4, xmm1, 0EEh  
       shufps  xmm2, xmm3, 44h  
       shufps  xmm5, xmm3, 0EEh  
       movaps  xmm1, xmm0  
       movaps  xmm3, xmm2  
       shufps  xmm0, xmm4, 88h  
       shufps  xmm1, xmm4, 0DDh  
       shufps  xmm2, xmm5, 88h  
       shufps  xmm3, xmm5, 0DDh  

       movaps  xmmword ptr[eax+0h],  xmm0
       movaps  xmmword ptr[eax+10h], xmm1
       movaps  xmmword ptr[eax+20h], xmm2
       movaps  xmmword ptr[eax+30h], xmm3 

XMMatrixTranspose ENDP

;===========================================>

          mov  edx, offset g_mSpin
          mov  eax, offset g_mWorld
       invoke  XMMatrixTranspose

Which is from BOTH DirectXMath and Xnamath (has not changed), with your 4x4 0.815 % faster over
100 million iterations. Nice.

aw27 · June 25, 2018, 05:12:58 PM

Here we go for YAMTT :t (Yet Another Matrix Transpose Test)

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

4x4 Cycles: 2 CodeSize: 66 RoutineTime: 0.018620367 seconds
4x3 Cycles: 2 CodeSize: 56 RoutineTime: 0.016572648 seconds
4x2 Cycles: 2 CodeSize: 42 RoutineTime: 0.018659709 seconds
3x4 Cycles: 3 CodeSize: 62 RoutineTime: 0.018680211 seconds
3x3 Cycles: 1 CodeSize: 51 RoutineTime: 0.014052825 seconds
3x2 Cycles: 2 CodeSize: 35 RoutineTime: 0.016425531 seconds
2x4 Cycles: 1 CodeSize: 31 RoutineTime: 0.011702284 seconds
2x3 Cycles: 2 CodeSize: 27 RoutineTime: 0.009617162 seconds
2x2 Cycles: 0 CodeSize: 18 RoutineTime: 0.009352850 seconds

Code Alignment 64 byte check: 000h

zedd151 · June 25, 2018, 05:16:35 PM

Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT :t (Yet Another Matrix Transpose Test)

Where's Rui?? :P

aw27 · June 25, 2018, 06:01:25 PM

Quote from: zedd151 on June 25, 2018, 05:16:35 PM
Where's Rui?? :P

shhh :icon_mrgreen:

jj2007 · June 25, 2018, 06:17:36 PM

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

4x4 Cycles: 7  CodeSize: 66 RoutineTime: 0.051956401 seconds
4x3 Cycles: 5  CodeSize: 56 RoutineTime: 0.045790781 seconds
4x2 Cycles: 5  CodeSize: 42 RoutineTime: 0.045233293 seconds
3x4 Cycles: 6  CodeSize: 62 RoutineTime: 0.054255731 seconds
3x3 Cycles: 4  CodeSize: 51 RoutineTime: 0.044690584 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.036121362 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.023498652 seconds
2x3 Cycles: -1  CodeSize: 27 RoutineTime: 0.024507303 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.019760855 seconds

Another run, same machine:

Code Select

4x4 Cycles: 7  CodeSize: 66 RoutineTime: 0.055625230 seconds
4x3 Cycles: 5  CodeSize: 56 RoutineTime: 0.045135589 seconds
4x2 Cycles: 5  CodeSize: 42 RoutineTime: 0.045048148 seconds
3x4 Cycles: 7  CodeSize: 62 RoutineTime: 0.051966664 seconds
3x3 Cycles: 3  CodeSize: 51 RoutineTime: 0.038664541 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.037659996 seconds
2x4 Cycles: -1  CodeSize: 31 RoutineTime: 0.024224864 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.024652217 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.019747718 seconds

RuiLoureiro · June 25, 2018, 06:48:48 PM

Quote from: zedd151 on June 25, 2018, 05:16:35 PM
Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT :t (Yet Another Matrix Transpose Test)

Where's Rui?? :P

Will be here soon as possible, zedd151

RuiLoureiro · June 25, 2018, 06:51:57 PM

Code Select

 
10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Atom(TM) CPU N455 @ 1.66GHz

4x4 Cycles: 58  CodeSize: 66 RoutineTime: 0.377001463 seconds
4x3 Cycles: 51  CodeSize: 56 RoutineTime: 0.339285241 seconds
4x2 Cycles: 19  CodeSize: 42 RoutineTime: 0.165002387 seconds
3x4 Cycles: 53  CodeSize: 62 RoutineTime: 0.339326507 seconds
3x3 Cycles: 38  CodeSize: 51 RoutineTime: 0.286387559 seconds
3x2 Cycles: 22  CodeSize: 35 RoutineTime: 0.171114020 seconds
2x4 Cycles: 22  CodeSize: 31 RoutineTime: 0.139145431 seconds
2x3 Cycles: 19  CodeSize: 27 RoutineTime: 0.144533990 seconds
2x2 Cycles: 11  CodeSize: 18 RoutineTime: 0.078623451 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...

Siekmanski · June 25, 2018, 06:59:08 PM

Thanks guys,

@Caché GB

The matrices I use here are unaligned ( to process uneven matrix row numbers )

The aligned 4x4 version is even a bit faster:
10 million iterations (i7-4930K)
4x4 aligned DirectXMath Cycles: 4 CodeSize: 74 RoutineTime: 0.042951199 seconds
4x4 aligned Siekmanski Cycles: 4 CodeSize: 66 RoutineTime: 0.032361864 seconds

My aligned version takes 75.3456591 % time compared to the aligned DirectXMath version.
That's 1.327216473 times faster and 8 bytes less code size.

Siekmanski · June 25, 2018, 07:16:31 PM

@Caché GB

Are you sure you posted a correct version of the 4x4 DirectXMath transpose matrix?
The results are wrong,

00 04 02 06
01 05 03 07
08 12 10 14
09 13 11 15

Siekmanski · June 25, 2018, 07:23:45 PM

Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT :t (Yet Another Matrix Transpose Test)

Yeah, now we can play and build with it just as we played as kids with LEGO blocks. :t

aw27 · June 25, 2018, 09:49:29 PM

It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148

However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once. :idea:

LiaoMi · June 25, 2018, 10:52:10 PM

Code Select

 10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz

4x4 Cycles: 2  CodeSize: 66 RoutineTime: 0.024838457 seconds
4x3 Cycles: 1  CodeSize: 56 RoutineTime: 0.023653736 seconds
4x2 Cycles: 2  CodeSize: 42 RoutineTime: 0.023732180 seconds
3x4 Cycles: 2  CodeSize: 62 RoutineTime: 0.023453961 seconds
3x3 Cycles: 2  CodeSize: 51 RoutineTime: 0.034194162 seconds
3x2 Cycles: 1  CodeSize: 35 RoutineTime: 0.024073813 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.015524173 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.016023061 seconds
2x2 Cycles: 1  CodeSize: 18 RoutineTime: 0.020969785 seconds

Code Alignment 64 byte check: 000h

The MASM Forum

News:

Fast SIMD transpose routines

Siekmanski

zedd151

Yuri

Caché GB

aw27

zedd151

aw27

jj2007

RuiLoureiro

RuiLoureiro

Siekmanski

Siekmanski

Siekmanski

aw27

LiaoMi