News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fast Transposing Matrix Algorithm for FPU Data

Started by guga, April 01, 2017, 06:01:08 PM

Previous topic - Next topic

guga

Hi Guys

Someone knows how to make a transposing matrix of any size for Float values ??

This is for usage in image processing exchanging the width by the height.

Example:

I have  a sequence of values like this. On these cases, they are not pixels in RGB, they are simple float values representing Luma, for example:
[ValueA1:
F$ 1,2,3
F$ 4,5, 6]


Which means a row (width) of 3 Values and col (height) of 2 Values.

I need them to be transposed as:
[ValueA1:
F$ 1, 4
F$ 2, 5
F$ 3, 6]


It need to be fast and work on any size (squared or not) and work either if height is bigger then width and vice-versa. As:
[ValueA1:
F$ 1, 2
F$ 3, 4
F$ 5, 6]

will turn onto:
[ValueA1:
F$ 1,3, 5
F$ 2, 4, 6]


Note: "F$" is Float notation for RosAsm. But...if someone also have this for Real (Double) it would be appreciated as well.

On the above examples they are only to display the functionality, but in fact it will be used also on large matrixes with size: 507 x 208 , 64x64, 480x640, 1024x768 etc etc


I tried to port the code from here: http://stackoverflow.com/questions/16737298/what-is-the-fastest-way-to-transpose-a-matrix-in-c
But...it seems to not working.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on April 01, 2017, 06:01:08 PMIt need to be fast

This is the biggest problem :biggrin:

mov esi, offset source
mov edi, offset dest
xor edx, edx
mov ecx, rows
.repeat
  mov eax, [esi+edx]
  stosd
  add edx, columns*4
  dec ecx
.until zero?

HSE

#2
I use this little macro for everything:matrix macro x, y , xlarge
    mov eax , x
    mov ecx , xlarge
    mul ecx
    add eax, y
endm


The idea is:ForLp a, 0, 2
  ForLp b, 0, 1
      matrix a, b, 3
      fld matriz1[eax*8]
      matrix b, a, 2
      fstp matriz2[eax*8]
   Next b
Next b


LATER: I forget to say matrix are 0 based
Equations in Assembly: SmplMath

TWell

I think FPU is irrelevant.
It is about moving array of sizeof float to new table to a new order.
Like this improved C versionvoid transpose(float *src, float *dst, const int N, const int M)
{
//    #pragma omp parallel for
register long *p1 = (long *)src;
register long *p2 = (long *)dst;

    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
//        dst[n] = src[M*j + i];
        p2[n] = p1[M*j + i];
    }
}

mabdelouahab

REAL4:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL4 PTR [ECX]
   FSTP    REAL4 PTR [EDI]
   ADD    EDI,4
   ADD      ECX,4*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,4
   DEC      EDX
   JNZ      doCol


REAL8:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL8 PTR [ECX]
   FSTP    REAL8 PTR [EDI]
   ADD    EDI,8
   ADD      ECX,8*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,8
   DEC      EDX
   JNZ      doCol


REAL10:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL10 PTR [ECX]
   FSTP    REAL10 PTR [EDI]
   ADD    EDI,10
   ADD      ECX,10*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,10
   DEC      EDX
   JNZ      doCol


TWell

interesting code but col isn't constant
   ADD      ECX,4*coli leave this issue to professionals ;)

guga

Thanks guys...I´ll try the examples. to see if a faster version can be made. I have one that transpose a 4x4 matrix using SSE2, but didn´t tested on this version yet...

For these initial testings, i built a slow version biased on JJ´s code :).

Proc transpose_block:
    Arguments @Input, @Output, @Width, @Height
    Local @NextRowPos, @MaxWidth
    Uses edi, esi, ecx, ebx

    mov edi D@Output
    mov esi D@Input
    mov ecx D@Width | mov D@MaxWidth ecx
    mov ebx D@Width | shl ebx 2; ebx = end of width that is the nextYPos/Line

L2:
    mov ecx D@Height
    xor edx edx
L1:
        ; edx starts at 0. Cur row pos
        mov eax D$esi+edx
        stosd
        add edx ebx ; add to the next line
        dec ecx
    jnz L1<

    add esi 4 ; go to the next col line
    dec D@MaxWidth
    jnz L2<

EndP


This seems to work regardless the size of the matrix. So, it works either if Width > Height, or Height>Width or Width=height.

Tested on:
[ValueA:   D$ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
            D$ 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
            D$ 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48
            D$ 49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64
            D$ 65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
            D$ 81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96
            D$ 97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
            D$ 113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 16, 8



[ValueA:   D$ 1, 2
            D$ 3, 4]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 2, 2




[ValueA:   D$ 1, 2
            D$ 3, 4
            D$ 5, 6]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 2, 3



; Matrix 20 X 30
[ValueA:    F$ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
            F$ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
            F$ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
            F$ 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
            F$ 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
            F$ 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
            F$ 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140
            F$ 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160
            F$ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180
            F$ 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
            F$ 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220
            F$ 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240
            F$ 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260
            F$ 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280
            F$ 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
            F$ 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320
            F$ 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340
            F$ 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360
            F$ 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380
            F$ 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
            F$ 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420
            F$ 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440
            F$ 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460
            F$ 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480
            F$ 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500
            F$ 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520
            F$ 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540
            F$ 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560
            F$ 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580
            F$ 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 20, 30
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on April 02, 2017, 04:23:39 AMI have one that transpose a 4x4 matrix using SSE2

Loading is fine if your columns can be divided in 4*4 pieces, but the storing is not very effective: MOVD and pshufd. Doesn't sound very fast...

If you need REAL8: movlps is relatively fast. 

K_F

A simple Qword/Dword X/Y pointer swop, with a tmp variable would do, without any FPU usage.
:biggrin:
'Sire, Sire!... the peasants are Revolting !!!'
'Yes, they are.. aren't they....'

jj2007

A works with general regs, B with xmm regs:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2306    cycles for 100 * TransposeA
775     cycles for 100 * TransposeB

1917    cycles for 100 * TransposeA
778     cycles for 100 * TransposeB

2294    cycles for 100 * TransposeA
778     cycles for 100 * TransposeB

2289    cycles for 100 * TransposeA
776     cycles for 100 * TransposeB

1919    cycles for 100 * TransposeA
774     cycles for 100 * TransposeB

61      bytes for TransposeA
114     bytes for TransposeB

Siekmanski

4*4 tranpose from my d3dmath.lib


align 16
D3DMatrixTranspose proc pOut:DWORD,pM:DWORD
; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

mov     eax,pM

movaps  xmm0,[eax+0]            ; [0 1 2 3]
movaps  xmm1,[eax+16]           ; [4 5 6 7]
movaps  xmm4,[eax+32]           ; [8 9 A B]
movaps  xmm3,[eax+48]           ; [C D E F]

mov     eax,pOut

movaps  xmm2,xmm0               ; [0 1 2 3]
movaps  xmm5,xmm1               ; [4 5 6 7]
shufps  xmm0,xmm3,01000100b     ; [0 1 C D]
shufps  xmm1,xmm4,01000100b     ; [4 5 8 9]
shufps  xmm2,xmm4,11101110b     ; [2 3 A B]
shufps  xmm5,xmm3,11101110b     ; [6 7 E F]
movaps  xmm4,xmm0               ; [0 1 C D]
movaps  xmm3,xmm2               ; [2 3 A B]
shufps  xmm0,xmm1,10001000b     ; [0 C 4 8]
shufps  xmm1,xmm4,11011101b     ; [5 9 1 D]
shufps  xmm2,xmm5,10001000b     ; [2 A 6 E]
shufps  xmm3,xmm5,11011101b     ; [3 B 7 F]
shufps  xmm0,xmm0,01111000b     ; [0 4 8 C]
shufps  xmm1,xmm1,11010010b     ; [1 5 9 D]
shufps  xmm2,xmm2,11011000b     ; [2 6 A E]
shufps  xmm3,xmm3,11011000b     ; [3 7 B F]

movaps  [eax+0],xmm0            ; [0 4 8 C]
movaps  [eax+16],xmm1           ; [1 5 9 D]
movaps  [eax+32],xmm2           ; [2 6 A E]
movaps  [eax+48],xmm3           ; [3 7 B F]
ret
D3DMatrixTranspose endp
Creative coders use backward thinking techniques as a strategy.

guga

Thank you guys

Ok..we are close :)

I gave a test on JJ and Siekmanski code, and Siekmanski seems a bit faster.

the results i've got for 20x32 matrix are:

Siekmanski : 231.37439769473289 ns
JJ´s: 281.62775417344523 ns

But...the problem remains the same concerning the size of the matrix. Both works as expected if Width and Height are bigger then 4...But...how to make it work for any matrix size (or sizes not multiple of 4) ? I mean, width = 11, height = 23...or width = 3 height = 2 etc ?


The tested codes ported to RosAsm are:

JJ´s



Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

L2:
    xor ecx ecx
    xor edx edx

    Do
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM0 D$esi+edx | add edx ebx
        movd XMM1 D$esi+edx | add edx ebx
        movd XMM2 D$esi+edx | add edx ebx
        movd XMM3 D$esi+edx | add edx ebx

        pslldq XMM1 4
        pslldq XMM2 8
        pslldq XMM3 12

        xorps XMM0 XMM1
        xorps XMM0 XMM2
        xorps XMM0 XMM3

        movaps X$edi XMM0
        add edi 16
        add ecx 4
    Loop_Until ecx >= D@Height
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Output

EndP

calling it as:
call Fast_transpose_blockJJ ValueA, Output, 20, 32




Siekmanski´s



Proc D3DMatrixTranspose:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @NextHeightPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov ecx ebx | shl ebx 2
    mov eax D@Height | shl eax 2 | mov D@NextHeightPos eax
    shr ecx 2 | mov D@CurXPos ecx ; divide by 4

L2:
    xor ecx ecx
    xor edx edx

    Do
       
        movaps xmm0 X$esi+edx            ; [0 1 2 3]
        add edx ebx
        movaps xmm1 X$esi+edx           ; [4 5 6 7]
        add edx ebx
        movaps xmm4 X$esi+edx           ; [8 9 A B]
        add edx ebx
        movaps xmm3 X$esi+edx           ; [C D E F]

        movaps xmm2 xmm0               ; [0 1 2 3]
        movaps xmm5 xmm1               ; [4 5 6 7]
        shufps xmm0 xmm3 00_01000100     ; [0 1 C D]
        shufps xmm1 xmm4 00_01000100     ; [4 5 8 9]
        shufps xmm2 xmm4 00_11101110     ; [2 3 A B]
        shufps xmm5 xmm3 00_11101110     ; [6 7 E F]
        movaps xmm4 xmm0               ; [0 1 C D]
        movaps xmm3 xmm2               ; [2 3 A B]
        shufps xmm0 xmm1 00_10001000     ; [0 C 4 8]
        shufps xmm1 xmm4 00_11011101     ; [5 9 1 D]
        shufps xmm2 xmm5 00_10001000     ; [2 A 6 E]
        shufps xmm3 xmm5 00_11011101     ; [3 B 7 F]
        shufps xmm0 xmm0 00_01111000     ; [0 4 8 C]
        shufps xmm1 xmm1 00_11010010     ; [1 5 9 D]
        shufps xmm2 xmm2 00_11011000     ; [2 6 A E]
        shufps xmm3 xmm3 00_11011000     ; [3 7 B F]

        movaps X$edi xmm0            ; [0 4 8 C]
        mov eax D@NextHeightPos
        movaps X$edi+eax xmm1           ; [1 5 9 D]
        add eax D@NextHeightPos
        movaps X$edi+eax xmm2           ; [2 6 A E]
        add eax D@NextHeightPos
        movaps X$edi+eax xmm3           ; [3 7 B F]

        add edx ebx
        add edi 16
        add ecx 4
    Loop_Until ecx >= D@Height
    add esi 16;4
    add edi eax
    dec D@CurXPos | jnz L2<<

    mov eax D@Output

EndP

calling it as:
call D3DMatrixTranspose ValueA, Output, 20, 32

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

JJ´s code is very close, but...it fails if we have  matrix like:


[ValueD2:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24
                F$ 25, 26, 27, 28, 29, 30]

call Fast_transpose_blockJJ ValueD2, Output, 6,5


Siekmanski´s will crash, because the width and height are smaller then 16. So i couldn´t test his version yet on this example.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Wrote another faster 4 by 4 transpose routine...

Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: guga on April 02, 2017, 01:56:12 PMBut...how to make it work for any matrix size (or sizes not multiple of 4) ? I mean, width = 11, height = 23...or width = 3 height = 2 etc ?

Simple solution: Check the size and take my fast version if possible, or my version A if too small. After all, why should small matrices need a fast solution?