News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Fast SIMD transpose routines

Started by Siekmanski, June 25, 2018, 11:35:37 AM

Previous topic - Next topic

Siekmanski

Wrote some Matrix Transpose routines of different sizes.
The idea is to use them as building blocks to create an Algorithm for very large Transposition Matrices of any size.
Included the sources and a timing mechanism that measures the clock cycles and the time each Matrix routine takes after 10000000 calls.

You can use the "SaveResults.bat" to save the timing results as "MatrixTimerResults.txt"
It would be nice to post the results here and give some feedback for improvements of the routines.  :t
Thanks.

Here are my results:

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz

4x4 Cycles: 4  CodeSize: 66 RoutineTime: 0.032362423 seconds
4x3 Cycles: 3  CodeSize: 56 RoutineTime: 0.029423712 seconds
4x2 Cycles: 4  CodeSize: 42 RoutineTime: 0.032384074 seconds
3x4 Cycles: 4  CodeSize: 62 RoutineTime: 0.032416620 seconds
3x3 Cycles: 2  CodeSize: 51 RoutineTime: 0.026471800 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.029419801 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.020588231 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.020594936 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.017652313 seconds

Code Alignment 64 byte check: 000h


EDIT: forgot to mention that the routines are to be used backwards in memory because when a row is 3 values, 4 are written!
Creative coders use backward thinking techniques as a strategy.

zedd151


10000000 calls per Matrix for the Cycle counter and the Routine timer.
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G
4x4 Cycles: 3  CodeSize: 66 RoutineTime: 0.055281872 seconds
4x3 Cycles: 3  CodeSize: 56 RoutineTime: 0.048029962 seconds
4x2 Cycles: 4  CodeSize: 42 RoutineTime: 0.047268223 seconds
3x4 Cycles: 5  CodeSize: 62 RoutineTime: 0.056652746 seconds
3x3 Cycles: 4  CodeSize: 51 RoutineTime: 0.047815803 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.034986782 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.038576830 seconds
2x3 Cycles: -1  CodeSize: 27 RoutineTime: 0.028790277 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.024736004 seconds
Code Alignment 64 byte check: 000h
Press any key to continue...


1.60 Ghz cpu speed...

cycle counts seem off "0 cycles", "-1 cycle"

Yuri


10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz

4x4 Cycles: 9  CodeSize: 66 RoutineTime: 0.049261743 seconds
4x3 Cycles: 7  CodeSize: 56 RoutineTime: 0.043520454 seconds
4x2 Cycles: 8  CodeSize: 42 RoutineTime: 0.042658139 seconds
3x4 Cycles: 10  CodeSize: 62 RoutineTime: 0.049160314 seconds
3x3 Cycles: 5  CodeSize: 51 RoutineTime: 0.036015704 seconds
3x2 Cycles: 5  CodeSize: 35 RoutineTime: 0.036012691 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.024245575 seconds
2x3 Cycles: 1  CodeSize: 27 RoutineTime: 0.023353133 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.020137194 seconds

Code Alignment 64 byte check: 000h

Caché GB

Hi Siekmanski

Been bench pressing your 4x4 against this one:


XMMatrixTranspose PROC

       movaps  xmm0, xmmword ptr[edx+0h]
       movaps  xmm1, xmmword ptr[edx+10h]
       movaps  xmm2, xmmword ptr[edx+20h]
       movaps  xmm3, xmmword ptr[edx+30h]

       movaps  xmm4, xmm0 
       movaps  xmm5, xmm2 
       shufps  xmm0, xmm1, 44h 
       shufps  xmm4, xmm1, 0EEh 
       shufps  xmm2, xmm3, 44h 
       shufps  xmm5, xmm3, 0EEh 
       movaps  xmm1, xmm0 
       movaps  xmm3, xmm2 
       shufps  xmm0, xmm4, 88h 
       shufps  xmm1, xmm4, 0DDh 
       shufps  xmm2, xmm5, 88h 
       shufps  xmm3, xmm5, 0DDh 

       movaps  xmmword ptr[eax+0h],  xmm0
       movaps  xmmword ptr[eax+10h], xmm1
       movaps  xmmword ptr[eax+20h], xmm2
       movaps  xmmword ptr[eax+30h], xmm3

XMMatrixTranspose ENDP

;===========================================>

          mov  edx, offset g_mSpin
          mov  eax, offset g_mWorld
       invoke  XMMatrixTranspose


Which is from BOTH  DirectXMath and Xnamath (has not changed),  with your 4x4 0.815 % faster over
100 million iterations. Nice.
Caché GB's 1 and 0-nly language:MASM

aw27

Here we go for YAMTT  :t (Yet Another Matrix Transpose Test)

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

4x4 Cycles: 2  CodeSize: 66 RoutineTime: 0.018620367 seconds
4x3 Cycles: 2  CodeSize: 56 RoutineTime: 0.016572648 seconds
4x2 Cycles: 2  CodeSize: 42 RoutineTime: 0.018659709 seconds
3x4 Cycles: 3  CodeSize: 62 RoutineTime: 0.018680211 seconds
3x3 Cycles: 1  CodeSize: 51 RoutineTime: 0.014052825 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.016425531 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.011702284 seconds
2x3 Cycles: 2  CodeSize: 27 RoutineTime: 0.009617162 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.009352850 seconds

Code Alignment 64 byte check: 000h

zedd151

Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT  :t (Yet Another Matrix Transpose Test)

Where's Rui??   :P


jj2007

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

4x4 Cycles: 7  CodeSize: 66 RoutineTime: 0.051956401 seconds
4x3 Cycles: 5  CodeSize: 56 RoutineTime: 0.045790781 seconds
4x2 Cycles: 5  CodeSize: 42 RoutineTime: 0.045233293 seconds
3x4 Cycles: 6  CodeSize: 62 RoutineTime: 0.054255731 seconds
3x3 Cycles: 4  CodeSize: 51 RoutineTime: 0.044690584 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.036121362 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.023498652 seconds
2x3 Cycles: -1  CodeSize: 27 RoutineTime: 0.024507303 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.019760855 seconds


Another run, same machine:
4x4 Cycles: 7  CodeSize: 66 RoutineTime: 0.055625230 seconds
4x3 Cycles: 5  CodeSize: 56 RoutineTime: 0.045135589 seconds
4x2 Cycles: 5  CodeSize: 42 RoutineTime: 0.045048148 seconds
3x4 Cycles: 7  CodeSize: 62 RoutineTime: 0.051966664 seconds
3x3 Cycles: 3  CodeSize: 51 RoutineTime: 0.038664541 seconds
3x2 Cycles: 3  CodeSize: 35 RoutineTime: 0.037659996 seconds
2x4 Cycles: -1  CodeSize: 31 RoutineTime: 0.024224864 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.024652217 seconds
2x2 Cycles: -1  CodeSize: 18 RoutineTime: 0.019747718 seconds

RuiLoureiro

Quote from: zedd151 on June 25, 2018, 05:16:35 PM
Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT  :t (Yet Another Matrix Transpose Test)

Where's Rui??   :P
Will be here soon as possible, zedd151  :biggrin:

RuiLoureiro


10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Atom(TM) CPU N455 @ 1.66GHz

4x4 Cycles: 58  CodeSize: 66 RoutineTime: 0.377001463 seconds
4x3 Cycles: 51  CodeSize: 56 RoutineTime: 0.339285241 seconds
4x2 Cycles: 19  CodeSize: 42 RoutineTime: 0.165002387 seconds
3x4 Cycles: 53  CodeSize: 62 RoutineTime: 0.339326507 seconds
3x3 Cycles: 38  CodeSize: 51 RoutineTime: 0.286387559 seconds
3x2 Cycles: 22  CodeSize: 35 RoutineTime: 0.171114020 seconds
2x4 Cycles: 22  CodeSize: 31 RoutineTime: 0.139145431 seconds
2x3 Cycles: 19  CodeSize: 27 RoutineTime: 0.144533990 seconds
2x2 Cycles: 11  CodeSize: 18 RoutineTime: 0.078623451 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...

Siekmanski

Thanks guys,

@Caché GB

The matrices I use here are unaligned ( to process uneven matrix row numbers )

The aligned 4x4 version is even a bit faster:
10 million iterations (i7-4930K)
4x4 aligned DirectXMath Cycles: 4  CodeSize: 74 RoutineTime: 0.042951199 seconds
4x4 aligned Siekmanski  Cycles: 4  CodeSize: 66 RoutineTime: 0.032361864 seconds

My aligned version takes 75.3456591 % time compared to the aligned DirectXMath version.
That's 1.327216473 times faster and 8 bytes less code size.  :biggrin:
Creative coders use backward thinking techniques as a strategy.

Siekmanski

@Caché GB

Are you sure you posted a correct version of the 4x4 DirectXMath transpose matrix?
The results are wrong,

00 04 02 06
01 05 03 07
08 12 10 14
09 13 11 15
Creative coders use backward thinking techniques as a strategy.

Siekmanski

Quote from: AW on June 25, 2018, 05:12:58 PM
Here we go for YAMTT  :t (Yet Another Matrix Transpose Test)

Yeah, now we can play and build with it just as we played as kids with LEGO blocks.  :t
Creative coders use backward thinking techniques as a strategy.

aw27

It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148

However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once.  :idea:

LiaoMi

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz

4x4 Cycles: 2  CodeSize: 66 RoutineTime: 0.024838457 seconds
4x3 Cycles: 1  CodeSize: 56 RoutineTime: 0.023653736 seconds
4x2 Cycles: 2  CodeSize: 42 RoutineTime: 0.023732180 seconds
3x4 Cycles: 2  CodeSize: 62 RoutineTime: 0.023453961 seconds
3x3 Cycles: 2  CodeSize: 51 RoutineTime: 0.034194162 seconds
3x2 Cycles: 1  CodeSize: 35 RoutineTime: 0.024073813 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.015524173 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.016023061 seconds
2x2 Cycles: 1  CodeSize: 18 RoutineTime: 0.020969785 seconds

Code Alignment 64 byte check: 000h