News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fast SIMD transpose routines

Started by Siekmanski, June 25, 2018, 11:35:37 AM

Previous topic - Next topic

mineiro

10000000 calls per Matrix for the Cycle counter and the Routine timer.

Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

4x4 Cycles: 3  CodeSize: 66 RoutineTime: 0.021707700 seconds
4x3 Cycles: 2  CodeSize: 56 RoutineTime: 0.018974900 seconds
4x2 Cycles: 3  CodeSize: 42 RoutineTime: 0.021691700 seconds
3x4 Cycles: 3  CodeSize: 62 RoutineTime: 0.021689300 seconds
3x3 Cycles: 1  CodeSize: 51 RoutineTime: 0.016264000 seconds
3x2 Cycles: 2  CodeSize: 35 RoutineTime: 0.018972900 seconds
2x4 Cycles: 1  CodeSize: 31 RoutineTime: 0.013553900 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.011021800 seconds
2x2 Cycles: 0  CodeSize: 18 RoutineTime: 0.010840100 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

Siekmanski

Quote from: AW on June 25, 2018, 09:49:29 PM
It is cool, although the LEGO idea comes back from here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148

However, I would go for large LEGO pieces, like 8x8, instead of small ones which will only be used once.  :idea:

Of course we can use AVX for 8x8 matrices.

This is what I had in mind, all posible sizes without moving single values.

As an example the 9x9 matrix:
Can be done with 4 blocks of 4x4 matrices and exchanging the left over single values.
Or use 4 3 2 size matrices as shown below.

9 = 3 square blocks along the Hypotenuse ( the RED line ) 4(X) 3(Y) 2(Q).
    now you have to transpose the rest. 4x3(Y) 4x2(Z) 3X2(P)

If you look closely at the picture you see on the X-axis the repetition of 4 3 2.
The same for the Y-axis and the Hypotenuse,  4 3 2

It looks like building with LEGO blocks......  :lol:

With this in mind you can build every size ( even or uneven ) by transposing along the Hypotenuse
Start at the right bottom with the 3 row sized matrices ( remember rows of 3 values will overwrite with the 4th value to increase the speed ) and work your way backwards in memory.


Creative coders use backward thinking techniques as a strategy.

nidud

#17
deleted

aw27

#18
@Marinus,
:t

@Nidud
Interesting, when I produced the DxMath library I have investigated the way DirectXMath was doing it and at the time the memory reserved for the transposed matrix was pointed to by RCX according to Windows ABI and on return pointed to by RAX as usual, but now they are returning in RDX for some mysterious reason that only @nidud knows about..  :badgrin:
We have also concluded that Marinus algo was better so I later changed in DxMath to @Marinus algo.



Caché GB

Hi Siekmanski.

Yes, posted the wrong one. My apologies.
Been benching several. Here is the correct version.



XMMatrixTranspose PROC

       movaps  xmm5, xmmword ptr[edx+0h] 
       movaps  xmm3, xmmword ptr[edx+20h]
       movaps  xmm4, xmm5 
       shufps  xmm4, xmmword ptr[edx+10h], 44h 
       movaps  xmm1, xmm3 
       shufps  xmm5, xmmword ptr[edx+10h], 0EEh 
       movaps  xmm2, xmm4 
       shufps  xmm1, xmmword ptr[edx+20h], 44h 
       movaps  xmm0, xmm5 
       shufps  xmm3, xmmword ptr[edx+20h], 0EEh 
       shufps  xmm2, xmm1, 88h 
       shufps  xmm4, xmm1, 0DDh 
       shufps  xmm0, xmm3, 88h 
       shufps  xmm5, xmm3, 0DDh 
       movaps  xmmword ptr[eax+0h] , xmm2
       movaps  xmmword ptr[eax+10h], xmm4
       movaps  xmmword ptr[eax+20h], xmm0
       movaps  xmmword ptr[eax+30h], xmm5
          ret

XMMatrixTranspose ENDP



Just doing a brute force like this.



       Counter  equ  1000000000  ; 1 Billion

       invoke  Sleep, 1000

       invoke  GetTickCount
          mov  TestTime, eax
          mov  ecx, Counter 
     L01:
                mov  edx, offset g_mSpin
                mov  eax, offset g_mWorld
             invoke  Siekmanski_MatrixTranspose

          dec  ecx
          jnz  L01

       invoke  GetTickCount
          sub  eax, TestTime
       invoke  wsprintfA, addr szTestBuffer01, CSTR("%d milli secs", 13, 10), eax

       invoke  Sleep, 1000       

       invoke  GetTickCount
          mov  TestTime, eax
          mov  ecx, Counter
     L02:
                mov  edx, offset g_mSpin
                mov  eax, offset g_mWorld
             invoke  XMMatrixTranspose

          dec  ecx
          jnz  L02

       invoke  GetTickCount
          sub  eax, TestTime
       invoke  wsprintfA, addr szTestBuffer02, CSTR("%d milli secs", 13, 10), eax

       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer01
       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, addr szTestBuffer02
       invoke  SendMessageA, hWndList01, LB_ADDSTRING, null, CSTR(" ", 13, 10)

          ret


Caché GB's 1 and 0-nly language:MASM

nidud

#20
deleted

HSE

Only Rui is beating me with higher times  :biggrin: :biggrin: :biggrin:

10000000 calls per Matrix for the Cycle counter and the Routine timer.

AMD A6-3500 APU with Radeon(tm) HD Graphics

4x4 Cycles: 8  CodeSize: 66 RoutineTime: 0.063076945 seconds
4x3 Cycles: 4  CodeSize: 56 RoutineTime: 0.064050233 seconds
4x2 Cycles: 7  CodeSize: 42 RoutineTime: 0.067956549 seconds
3x4 Cycles: 7  CodeSize: 62 RoutineTime: 0.069592511 seconds
3x3 Cycles: 3  CodeSize: 51 RoutineTime: 0.052619953 seconds
3x2 Cycles: 1  CodeSize: 35 RoutineTime: 0.045371007 seconds
2x4 Cycles: 0  CodeSize: 31 RoutineTime: 0.037927502 seconds
2x3 Cycles: 0  CodeSize: 27 RoutineTime: 0.036778671 seconds
2x2 Cycles: 1  CodeSize: 18 RoutineTime: 0.037995768 seconds

Code Alignment 64 byte check: 000h

Press any key to continue...
Equations in Assembly: SmplMath

Siekmanski

Hi Caché GB,
The last one is wrong too.

00 04 08 08
01 05 09 09
02 06 10 10
03 07 11 11

Thanks Nidud,

Did some runs and Marinus is a tiny bit faster on my sytem.
Yours has 6 memory reads and 4 memory writes.
Mine has 4 memory reads and 4 memory writes.
Don't know if the memory reads makes any difference when run on some older computers?

Interesting way loading data with shufps ( never thought of this )
I will study your routine, thanks for it.

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032367591 seconds ; Marinus
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032533045 seconds ; Nidud

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032362563 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032495331 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032362074 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032362283 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032358303 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032364030 seconds

4x4m Cycles: 4  CodeSize: 66 RoutineTime: 0.032371503 seconds
4x4n Cycles: 4  CodeSize: 68 RoutineTime: 0.032403490 seconds


Niduds version in 32-bit


XMMatrixTranspose PROC
    mov         eax,offset MatrixIn
    mov         ecx,offset MatrixOut
M4_4n:
    movaps      xmm0,[eax+0]
    movaps      xmm2,[eax+020h]
    movaps      xmm4,xmm0
    movhps      xmm4,qword ptr[eax+010h]
    shufps      xmm0,[eax+010h],Shuffle(3,2,3,2)
    movaps      xmm1,xmm2
    shufps      xmm1,[eax+030h],Shuffle(3,2,3,2)
    movhps      xmm2,qword ptr[eax+030h]
    movaps      xmm3,xmm4
    shufps      xmm4,xmm2,Shuffle(3,1,3,1)
    movaps      [ecx+010h],xmm4
    movaps      xmm4,xmm0
    shufps      xmm3,xmm2,Shuffle(2,0,2,0)
    shufps      xmm4,xmm1,Shuffle(2,0,2,0)
    shufps      xmm0,xmm1,Shuffle(3,1,3,1)
    movaps      [ecx+0],xmm3
    movaps      [ecx+020h],xmm4
    movaps      [ecx+030h],xmm0
    CodeSize4_4n = $-M4_4n
    ret
XMMatrixTranspose ENDP


Creative coders use backward thinking techniques as a strategy.

nidud

#23
deleted

Caché GB

Ok I think I should stay out of the dojo.

Maybe third time lucky   



XMMatrixTranspose_003 PROC

       movaps  xmm5, xmmword ptr[edx+0h] 
       movaps  xmm3, xmmword ptr[edx+20h]
       movaps  xmm4, xmm5 
       shufps  xmm4, xmmword ptr[edx+10h], 44h 
       movaps  xmm1, xmm3 
       shufps  xmm5, xmmword ptr[edx+10h], 0EEh 
       movaps  xmm2, xmm4 
       shufps  xmm1, xmmword ptr[edx+30h], 44h  ; <--- Not 20h
       movaps  xmm0, xmm5 
       shufps  xmm3, xmmword ptr[edx+30h], 0EEh  ; <--- Not 20h 
       shufps  xmm2, xmm1, 88h 
       shufps  xmm4, xmm1, 0DDh 
       shufps  xmm0, xmm3, 88h 
       shufps  xmm5, xmm3, 0DDh 
       movaps  xmmword ptr[eax+0h] , xmm2
       movaps  xmmword ptr[eax+10h], xmm4
       movaps  xmmword ptr[eax+20h], xmm0
       movaps  xmmword ptr[eax+30h], xmm5
          ret

XMMatrixTranspose_003 ENDP

Caché GB's 1 and 0-nly language:MASM

Siekmanski

Hi Caché GB,

Welcome to the dojo.  :t

4x4 Cycles: 4  CodeSize: 66 RoutineTime: 0.032354671 seconds ; Siekmanski
4x4 Cycles: 4  CodeSize: 68 RoutineTime: 0.032370874 seconds ; Nidud
4x4 Cycles: 4  CodeSize: 70 RoutineTime: 0.032421439 seconds ; Caché GB

Your matrix results are correct.

The speed of the routines are all very close to each other.
Results may vary on different architectures.
Creative coders use backward thinking techniques as a strategy.

aw27

@nidud

As usual, you mix oranges with pears.
You showed the Windows ABI being used and filling a return structure pointed to by RDX, which is wrong, and never done in DirectXMath then you justify that with the Vectorcall and inlining of functions. What a bloody confusion!   :shock:

Caché GB

Caché GB's 1 and 0-nly language:MASM

nidud

#28
deleted

Siekmanski

Hi nidud,

You inserted 4 extra instructions in my code.
The last 4 instructions,

    movaps      xmm0,xmm4       ; [0 4 8 C]
    movaps      xmm1,xmm5       ; [1 5 9 D]
    movaps      xmm3,xmm2       ; [3 7 B F]
    movaps      xmm2,xmm6       ; [2 6 A E]

Doesn't this slow down my code in comparison with yours?
Or is there a reason to insert them?

removed xmm6 ( to make it simpler  :biggrin: )
XMMatrixTransposeM PROC
    mov         eax,offset MatrixIn
    mov         ecx,offset MatrixOut

    movaps      xmm0,[eax+0]            ; [0 1 2 3]
    movaps      xmm1,[eax+16]           ; [4 5 6 7]
    movaps      xmm2,[eax+32]           ; [8 9 A B]
    movaps      xmm3,[eax+48]           ; [C D E F]

    movaps      xmm4,xmm0               ; [0 1 2 3]
    movaps      xmm5,xmm2               ; [8 9 A B]
    unpcklps    xmm4,xmm1               ; [0 4 1 5]
    unpcklps    xmm5,xmm3               ; [8 C 9 D]
    unpckhps    xmm0,xmm1               ; [2 6 3 7]
    unpckhps    xmm2,xmm3               ; [A E B F]
    movaps      xmm1,xmm4               ; [0 4 1 5]
    movaps      xmm3,xmm0               ; [2 6 3 7]
    movlhps     xmm4,xmm5               ; [0 4 8 C]
    movlhps     xmm3,xmm2               ; [2 6 A E]
    movhlps     xmm5,xmm1               ; [1 5 9 D]
    movhlps     xmm2,xmm0               ; [3 7 B F]

    movaps      [ecx+0],xmm4            ; [0 4 8 C]
    movaps      [ecx+16],xmm5           ; [1 5 9 D]
    movaps      [ecx+32],xmm3           ; [2 6 A E]
    movaps      [ecx+48],xmm2           ; [3 7 B F]
    ret
XMMatrixTransposeM ENDP
Creative coders use backward thinking techniques as a strategy.