News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fast transposing matrix procedure for any size

Started by RuiLoureiro, June 26, 2018, 07:41:14 PM

Previous topic - Next topic

RuiLoureiro

#30
Hi all,
         Here are the basic procedures (.asm) used to write and test
         some basic code needed to transpose any matrix

Good luck

Particular note: my thanks to Siekmanski  :t

Example
Quote
--------------------------------------------
HOW TO WRITE THE CODE TO TRANSPOSE 4x4
--------------------------------------------
u  is unpack   -> ul is unpack low and uh is unpack high
l  is low
h  is high

lh is move  low to high
hl is move high to low

Start with «what is the input» and «what we want to get»

                the input                                             we want to get this
          ----------------------                                    -----------------------
line 0  -> 0  1  2  3 = x0 = x4                                            0  4  8  C
              4  5  6  7 = x1   <- after unpack we may use x1    1  5  9  D
           
line 2 ->  8  9  A  B = x2 = x5                                            2  6  A  E
              C  D  E  F = x3   <- after unpack we may use x3    3  7  B  F


note: As we need to get [0  4  8  C] we need to unpack x0 with x1 and x2 with x3.
      Because we destroy x0 and x2 we need to get a copy x4, x5           
----------------------------------------------------------------------------------
      x4=x0         <--- copy x0 to x4
      x5=x2

ul    x0,x1  = 0  4  1  5  = x6 ( x1 )
ul    x2,x3  = 8  C  9  D

lh    x0,x2  = 0  4  8  C
hl    x2,x6  = 1  5  9  D
----above (low) and bellow(high) are similar------
uh    x4,x1  = 2  6  3  7  = x7 ( x3 )
uh    x5,x3  = A  E  B  F

lh    x4,x5  = 2  6  A  E
hl    x5,x7  = 3  7  B  F
+++++++++++++++++++++++++++++++++++++++++++++++
      x4=x0
      x5=x2

ul    x0,x1  = 0  4  1  5  = x6 ( x1 )
ul    x2,x3  = 8  C  9  D
uh    x4,x1  = 2  6  3  7  = x7 ( x3 )
uh    x5,x3  = A  E  B  F

      x6=x0  or x1=x0
      x7=x4  or x3=x4
     
lh    x0,x2  = 0  4  8  C
hl    x2,x6  = 1  5  9  D   <<- replace x6 by x1
lh    x4,x5  = 2  6  A  E
hl    x5,x7  = 3  7  B  F   <<- replace x7 by x3
     
+++++++++++++++++++++++++++++++++++++++++++++++
      x4=x0
      x5=x2

ul    x0,x1  = 0  4  1  5  = x1
ul    x2,x3  = 8  C  9  D
uh    x4,x1  = 2  6  3  7  = x3
uh    x5,x3  = A  E  B  F

      x1=x0
      x3=x4
     
lh    x0,x2  = 0  4  8  C   <-- first output
hl    x2,x1  = 1  5  9  D
lh    x4,x5  = 2  6  A  E
hl    x5,x3  = 3  7  B  F   <-- fourth output

+++++++++++++++++++++++++++++++++++++++++++++++
                                   instructions
+++++++++++++++++++++++++++++++++++++++++++++++
                ; write the input xmm0,xmm1,xmm2,xmm3
                ;
                movaps      xmm4,xmm0           ; [0  4  1  5]
                movaps      xmm5,xmm2           ; [8  9  A  B]

                unpcklps    xmm0, xmm1          ; [0  4  1  5]
                unpcklps    xmm2, xmm3          ; [8  C  9  D]
                unpckhps    xmm4, xmm1          ; [2  6  3  7]
                unpckhps    xmm5, xmm3          ; [A  E  B  F]

                movaps      xmm1,xmm0           ; [0  4  1  5]
                movaps      xmm3,xmm4           ; [2  6  3  7]

                movlhps     xmm0,xmm2           ; [0  4  8  C] first  line
                movhlps     xmm2,xmm1           ; [1  5  9  D] second line
                movlhps     xmm4,xmm5           ; [2  6  A  E] third  line
                movhlps     xmm5,xmm3           ; [3  7  B  F] fourth line
                ;
                ; write the output xmm0,xmm2,xmm4,xmm5

For input, we may use this (do similar things to output).
If we have no additional columns we may use movaps for input;
If we have no additional lines we may use movaps for output.

We have additional if it is a multiple of 4 plus 1,2 or 3.
Quote
            mov     edx, LinLengthX

For 1 input:
            movups  xmm0, [esi]

For 2 inputs:
            movups  xmm0, [esi]
            movups  xmm1, [esi+edx]
           
For 3 inputs:
            movups  xmm0, [esi]
            movups  xmm1, [esi+edx]
            movups  xmm2, [esi+2*edx]

For 4 inputs:
            movups  xmm0, [esi]
            add     esi, edx
            movups  xmm1, [esi]
            movups  xmm2, [esi+edx]
            movups  xmm3, [esi+2*edx]

Siekmanski

Creative coders use backward thinking techniques as a strategy.

zedd151

Hi Rui!


Just the kind of building block a beginner needs. Nevermind the naysayers.


Also, I'm glad it's 32 bit code, as I have given up with 64 bit coding for time being.

guga

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

HSE

Equations in Assembly: SmplMath

RuiLoureiro

Hi all
       Here we have SSE49_FASCODE_v8 which contains MatrixTransposeSSE49 inside the file sse49.
       MatrixTransposeSSE49 is a procedure that uses 4x8, 4x16, 4x32, 4x64, 8x16, 8x32, 16x16
       16x32, 32x32, 16x64, 32x64, 64x64 transpose code to build one transpose procedure for
       each case. So we may build the best transpose procedure inserting or removing one of
       those cases for each matrix case. For example, to transpose the matrix 512x512 we may
       use 4x32 in each loop instead of 4x4. So, for each block of 4 lines we need only 16
       loops instead of 128 loops with 4x4.

Good luck
Rui Loureiro