Fast transposing matrix procedure for any size

RuiLoureiro · July 20, 2018, 09:22:27 PM

Hi all,
Here are the basic procedures (.asm) used to write and test
some basic code needed to transpose any matrix

Good luck

Particular note: my thanks to Siekmanski :t

Example

Quote
--------------------------------------------
HOW TO WRITE THE CODE TO TRANSPOSE 4x4
--------------------------------------------
u is unpack -> ul is unpack low and uh is unpack high
l is low
h is high

lh is move low to high
hl is move high to low

Start with «what is the input» and «what we want to get»

the input we want to get this
---------------------- -----------------------
line 0 -> 0 1 2 3 = x0 = x4 0 4 8 C
4 5 6 7 = x1 <- after unpack we may use x1 1 5 9 D

line 2 -> 8 9 A B = x2 = x5 2 6 A E
C D E F = x3 <- after unpack we may use x3 3 7 B F

note: As we need to get [0 4 8 C] we need to unpack x0 with x1 and x2 with x3.
Because we destroy x0 and x2 we need to get a copy x4, x5
----------------------------------------------------------------------------------
x4=x0 <--- copy x0 to x4
x5=x2

ul x0,x1 = 0 4 1 5 = x6 ( x1 )
ul x2,x3 = 8 C 9 D

lh x0,x2 = 0 4 8 C
hl x2,x6 = 1 5 9 D
----above (low) and bellow(high) are similar------
uh x4,x1 = 2 6 3 7 = x7 ( x3 )
uh x5,x3 = A E B F

lh x4,x5 = 2 6 A E
hl x5,x7 = 3 7 B F
+++++++++++++++++++++++++++++++++++++++++++++++
x4=x0
x5=x2

ul x0,x1 = 0 4 1 5 = x6 ( x1 )
ul x2,x3 = 8 C 9 D
uh x4,x1 = 2 6 3 7 = x7 ( x3 )
uh x5,x3 = A E B F

x6=x0 or x1=x0
x7=x4 or x3=x4

lh x0,x2 = 0 4 8 C
hl x2,x6 = 1 5 9 D <<- replace x6 by x1
lh x4,x5 = 2 6 A E
hl x5,x7 = 3 7 B F <<- replace x7 by x3

+++++++++++++++++++++++++++++++++++++++++++++++
x4=x0
x5=x2

ul x0,x1 = 0 4 1 5 = x1
ul x2,x3 = 8 C 9 D
uh x4,x1 = 2 6 3 7 = x3
uh x5,x3 = A E B F

x1=x0
x3=x4

lh x0,x2 = 0 4 8 C <-- first output
hl x2,x1 = 1 5 9 D
lh x4,x5 = 2 6 A E
hl x5,x3 = 3 7 B F <-- fourth output

+++++++++++++++++++++++++++++++++++++++++++++++
instructions
+++++++++++++++++++++++++++++++++++++++++++++++
; write the input xmm0,xmm1,xmm2,xmm3
;
movaps xmm4,xmm0 ; [0 4 1 5]
movaps xmm5,xmm2 ; [8 9 A B]

unpcklps xmm0, xmm1 ; [0 4 1 5]
unpcklps xmm2, xmm3 ; [8 C 9 D]
unpckhps xmm4, xmm1 ; [2 6 3 7]
unpckhps xmm5, xmm3 ; [A E B F]

movaps xmm1,xmm0 ; [0 4 1 5]
movaps xmm3,xmm4 ; [2 6 3 7]

movlhps xmm0,xmm2 ; [0 4 8 C] first line
movhlps xmm2,xmm1 ; [1 5 9 D] second line
movlhps xmm4,xmm5 ; [2 6 A E] third line
movhlps xmm5,xmm3 ; [3 7 B F] fourth line
;
; write the output xmm0,xmm2,xmm4,xmm5

For input, we may use this (do similar things to output).
If we have no additional columns we may use movaps for input;
If we have no additional lines we may use movaps for output.
We have additional if it is a multiple of 4 plus 1,2 or 3.

Quote
mov edx, LinLengthX

For 1 input:
movups xmm0, [esi]

For 2 inputs:
movups xmm0, [esi]
movups xmm1, [esi+edx]

For 3 inputs:
movups xmm0, [esi]
movups xmm1, [esi+edx]
movups xmm2, [esi+2*edx]

For 4 inputs:
movups xmm0, [esi]
add esi, edx
movups xmm1, [esi]
movups xmm2, [esi+edx]
movups xmm3, [esi+2*edx]

Siekmanski · July 20, 2018, 09:32:00 PM

:t

zedd151 · July 21, 2018, 01:43:01 AM

Hi Rui!

Just the kind of building block a beginner needs. Nevermind the naysayers.

Also, I'm glad it's 32 bit code, as I have given up with 64 bit coding for time being.

guga · July 21, 2018, 03:08:29 AM

Excelent work, Rui, as usual :t :t :t :t

HSE · July 21, 2018, 03:32:12 AM

Fantastic Rui :t

RuiLoureiro · July 24, 2018, 08:33:12 PM

Hi all
Here we have SSE49_FASCODE_v8 which contains MatrixTransposeSSE49 inside the file sse49.
MatrixTransposeSSE49 is a procedure that uses 4x8, 4x16, 4x32, 4x64, 8x16, 8x32, 16x16
16x32, 32x32, 16x64, 32x64, 64x64 transpose code to build one transpose procedure for
each case. So we may build the best transpose procedure inserting or removing one of
those cases for each matrix case. For example, to transpose the matrix 512x512 we may
use 4x32 in each loop instead of 4x4. So, for each block of 4 lines we need only 16
loops instead of 128 loops with 4x4.

Good luck
Rui Loureiro

The MASM Forum

News:

Fast transposing matrix procedure for any size

RuiLoureiro

Siekmanski

zedd151

guga

HSE

RuiLoureiro