Fast Transposing Matrix Algorithm for FPU Data

Welcome to The MASM Forum.
Log in
Sign up

June 08, 2025, 05:44:07 PM

News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Home

The MASM Forum
► General
► The Campus
► Fast Transposing Matrix Algorithm for FPU Data

Fast Transposing Matrix Algorithm for FPU Data

Started by guga, April 01, 2017, 06:01:08 PM

Previous topic - Next topic

Print

Go Down Pages1 2 3

guga

Member
Posts: 1,760
Assembly is a state of art.
Location: Brazil
Logged

Fast Transposing Matrix Algorithm for FPU Data

April 01, 2017, 06:01:08 PM

Hi Guys

Someone knows how to make a transposing matrix of any size for Float values ??

This is for usage in image processing exchanging the width by the height.

Example:

I have a sequence of values like this. On these cases, they are not pixels in RGB, they are simple float values representing Luma, for example:

Code Select

[ValueA1:
F$ 1,2,3
F$ 4,5, 6]

Which means a row (width) of 3 Values and col (height) of 2 Values.

I need them to be transposed as:

Code Select

[ValueA1:
F$ 1, 4
F$ 2, 5
F$ 3, 6]

It need to be fast and work on any size (squared or not) and work either if height is bigger then width and vice-versa. As:

Code Select

[ValueA1:
F$ 1, 2
F$ 3, 4
F$ 5, 6]

will turn onto:
[ValueA1:
F$ 1,3, 5
F$ 2, 4, 6]

Note: "F$" is Float notation for RosAsm. But...if someone also have this for Real (Double) it would be appreciated as well.

On the above examples they are only to display the functionality, but in fact it will be used also on large matrixes with size: 507 x 208 , 64x64, 480x640, 1024x768 etc etc

I tried to port the code from here: http://stackoverflow.com/questions/16737298/what-is-the-fastest-way-to-transpose-a-matrix-in-c
But...it seems to not working.

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Member
Posts: 14,883
Assembly is fun ;-)
Location: Italy
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#1

April 01, 2017, 06:40:23 PM

Quote from: guga on April 01, 2017, 06:01:08 PMIt need to be fast

This is the biggest problem

Code Select

mov esi, offset source
mov edi, offset dest
xor edx, edx
mov ecx, rows
.repeat
  mov eax, [esi+edx]
  stosd
  add edx, columns*4
  dec ecx
.until zero?

Masm32 Tips, Tricks and Traps * enable the deb macro

HSE

Member
Posts: 2,791
AMD 7-32 / i3 10-64
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#2

April 02, 2017, 02:04:13 AM Last Edit: April 02, 2017, 02:08:09 AM by HSE

I use this little macro for everything:

Code Select

matrix macro x, y , xlarge
    mov	eax , x
    mov	ecx , xlarge
    mul	ecx
    add	eax, y
endm

The idea is:

Code Select

ForLp a, 0, 2
  ForLp b, 0, 1
      matrix a, b, 3
      fld matriz1[eax*8]
      matrix b, a, 2
      fstp matriz2[eax*8]
   Next b
Next b

LATER: I forget to say matrix are 0 based

Equations in Assembly: SmplMath

TWell

Member
Posts: 686
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#3

April 02, 2017, 02:40:27 AM

I think FPU is irrelevant.
It is about moving array of sizeof float to new table to a new order.
Like this improved C version

Code Select

void transpose(float *src, float *dst, const int N, const int M)
{
//    #pragma omp parallel for
	register long *p1 = (long *)src;
	register long *p2 = (long *)dst;

    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
//        dst[n] = src[M*j + i];
        p2[n] = p1[M*j + i];
    }
}

mabdelouahab

Member
Posts: 536
Location: Algeria
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#4

April 02, 2017, 02:46:23 AM

REAL4:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL4 PTR [ECX]
   FSTP    REAL4 PTR [EDI]
   ADD    EDI,4
   ADD      ECX,4*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,4
   DEC      EDX
   JNZ      doCol

REAL8:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL8 PTR [ECX]
   FSTP    REAL8 PTR [EDI]
   ADD    EDI,8
   ADD      ECX,8*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,8
   DEC      EDX
   JNZ      doCol

REAL10:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL10 PTR [ECX]
   FSTP    REAL10 PTR [EDI]
   ADD    EDI,10
   ADD      ECX,10*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,10
   DEC      EDX
   JNZ      doCol

TWell

Member
Posts: 686
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#5

April 02, 2017, 02:01:00 AM

interesting code but col isn't constant

Code Select

ADD ECX,4*coli leave this issue to professionals ;)

guga

Member
Posts: 1,760
Assembly is a state of art.
Location: Brazil
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#6

April 02, 2017, 04:23:39 AM

Thanks guys...I´ll try the examples. to see if a faster version can be made. I have one that transpose a 4x4 matrix using SSE2, but didn´t tested on this version yet...

For these initial testings, i built a slow version biased on JJ´s code :).

Code Select

Proc transpose_block:
    Arguments @Input, @Output, @Width, @Height
    Local @NextRowPos, @MaxWidth
    Uses edi, esi, ecx, ebx

    mov edi D@Output
    mov esi D@Input
    mov ecx D@Width | mov D@MaxWidth ecx
    mov ebx D@Width | shl ebx 2; ebx = end of width that is the nextYPos/Line

L2:
    mov ecx D@Height
    xor edx edx
L1:
        ; edx starts at 0. Cur row pos
        mov eax D$esi+edx
        stosd
        add edx ebx ; add to the next line
        dec ecx
    jnz L1<

    add esi 4 ; go to the next col line
    dec D@MaxWidth
    jnz L2<

EndP

This seems to work regardless the size of the matrix. So, it works either if Width > Height, or Height>Width or Width=height.

Tested on:

Code Select


[ValueA:   D$ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
            D$ 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
            D$ 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48
            D$ 49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64
            D$ 65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
            D$ 81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96
            D$ 97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
            D$ 113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 16, 8

Code Select


[ValueA:   D$ 1, 2
            D$ 3, 4]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 2, 2

Code Select


[ValueA:   D$ 1, 2
            D$ 3, 4
            D$ 5, 6]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 2, 3

Code Select


; Matrix 20 X 30
[ValueA:    F$ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
            F$ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
            F$ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
            F$ 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
            F$ 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
            F$ 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
            F$ 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140
            F$ 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160
            F$ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180
            F$ 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
            F$ 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220
            F$ 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240
            F$ 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260
            F$ 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280
            F$ 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
            F$ 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320
            F$ 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340
            F$ 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360
            F$ 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380
            F$ 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
            F$ 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420
            F$ 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440
            F$ 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460
            F$ 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480
            F$ 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500
            F$ 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520
            F$ 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540
            F$ 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560
            F$ 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580
            F$ 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 20, 30

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Member
Posts: 14,883
Assembly is fun ;-)
Location: Italy
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#7

April 02, 2017, 05:05:50 AM

Quote from: guga on April 02, 2017, 04:23:39 AMI have one that transpose a 4x4 matrix using SSE2

Loading is fine if your columns can be divided in 4*4 pieces, but the storing is not very effective: MOVD and pshufd. Doesn't sound very fast...

If you need REAL8: movlps is relatively fast.

Masm32 Tips, Tricks and Traps * enable the deb macro

K_F

Member
Posts: 1,701
Anybody out there?
Location: Over Here ...
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#8

April 02, 2017, 05:49:46 AM

A simple Qword/Dword X/Y pointer swop, with a tmp variable would do, without any FPU usage.

'Sire, Sire!... the peasants are Revolting !!!'
'Yes, they are.. aren't they....'

jj2007

Member
Posts: 14,883
Assembly is fun ;-)
Location: Italy
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#9

April 02, 2017, 09:39:49 AM

A works with general regs, B with xmm regs:

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2306    cycles for 100 * TransposeA
775     cycles for 100 * TransposeB

1917    cycles for 100 * TransposeA
778     cycles for 100 * TransposeB

2294    cycles for 100 * TransposeA
778     cycles for 100 * TransposeB

2289    cycles for 100 * TransposeA
776     cycles for 100 * TransposeB

1919    cycles for 100 * TransposeA
774     cycles for 100 * TransposeB

61      bytes for TransposeA
114     bytes for TransposeB

Masm32 Tips, Tricks and Traps * enable the deb macro

Siekmanski

Member
Posts: 2,411
Location: The Netherlands
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#10

April 02, 2017, 12:39:44 PM

4*4 tranpose from my d3dmath.lib

Code Select


align 16
D3DMatrixTranspose proc pOut:DWORD,pM:DWORD
; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

	mov     eax,pM

	movaps  xmm0,[eax+0]            ; [0 1 2 3]
	movaps  xmm1,[eax+16]           ; [4 5 6 7]
	movaps  xmm4,[eax+32]           ; [8 9 A B]
	movaps  xmm3,[eax+48]           ; [C D E F]

	mov     eax,pOut
	
	movaps  xmm2,xmm0               ; [0 1 2 3]
	movaps  xmm5,xmm1               ; [4 5 6 7]
	shufps  xmm0,xmm3,01000100b     ; [0 1 C D]
	shufps  xmm1,xmm4,01000100b     ; [4 5 8 9]
	shufps  xmm2,xmm4,11101110b     ; [2 3 A B]
	shufps  xmm5,xmm3,11101110b     ; [6 7 E F]
	movaps  xmm4,xmm0               ; [0 1 C D]
	movaps  xmm3,xmm2               ; [2 3 A B]
	shufps  xmm0,xmm1,10001000b     ; [0 C 4 8]
	shufps  xmm1,xmm4,11011101b     ; [5 9 1 D]
	shufps  xmm2,xmm5,10001000b     ; [2 A 6 E]
	shufps  xmm3,xmm5,11011101b     ; [3 B 7 F]
	shufps  xmm0,xmm0,01111000b     ; [0 4 8 C]
	shufps  xmm1,xmm1,11010010b     ; [1 5 9 D]
	shufps  xmm2,xmm2,11011000b     ; [2 6 A E]
	shufps  xmm3,xmm3,11011000b     ; [3 7 B F]

	movaps  [eax+0],xmm0            ; [0 4 8 C]
	movaps  [eax+16],xmm1           ; [1 5 9 D]
	movaps  [eax+32],xmm2           ; [2 6 A E]
	movaps  [eax+48],xmm3           ; [3 7 B F]
	ret
D3DMatrixTranspose endp

Creative coders use backward thinking techniques as a strategy.

guga

Member
Posts: 1,760
Assembly is a state of art.
Location: Brazil
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#11

April 02, 2017, 01:56:12 PM

Thank you guys

Ok..we are close :)

I gave a test on JJ and Siekmanski code, and Siekmanski seems a bit faster.

the results i've got for 20x32 matrix are:

Siekmanski : 231.37439769473289 ns
JJ´s: 281.62775417344523 ns

But...the problem remains the same concerning the size of the matrix. Both works as expected if Width and Height are bigger then 4...But...how to make it work for any matrix size (or sizes not multiple of 4) ? I mean, width = 11, height = 23...or width = 3 height = 2 etc ?

The tested codes ported to RosAsm are:

JJ´s

Code Select



Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

L2:
    xor ecx ecx
    xor edx edx

    Do
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM0 D$esi+edx | add edx ebx
        movd XMM1 D$esi+edx | add edx ebx
        movd XMM2 D$esi+edx | add edx ebx
        movd XMM3 D$esi+edx | add edx ebx

        pslldq XMM1 4
        pslldq XMM2 8
        pslldq XMM3 12

        xorps XMM0 XMM1
        xorps XMM0 XMM2
        xorps XMM0 XMM3

        movaps X$edi XMM0
        add edi 16
        add ecx 4
    Loop_Until ecx >= D@Height
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Output

EndP

calling it as:
call Fast_transpose_blockJJ ValueA, Output, 20, 32

Siekmanski´s

Code Select



Proc D3DMatrixTranspose:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @NextHeightPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov ecx ebx | shl ebx 2
    mov eax D@Height | shl eax 2 | mov D@NextHeightPos eax
    shr ecx 2 | mov D@CurXPos ecx ; divide by 4

L2:
    xor ecx ecx
    xor edx edx

    Do
        
        movaps xmm0 X$esi+edx            ; [0 1 2 3]
        add edx ebx
        movaps xmm1 X$esi+edx           ; [4 5 6 7]
        add edx ebx
        movaps xmm4 X$esi+edx           ; [8 9 A B]
        add edx ebx
        movaps xmm3 X$esi+edx           ; [C D E F]

        movaps xmm2 xmm0               ; [0 1 2 3]
        movaps xmm5 xmm1               ; [4 5 6 7]
        shufps xmm0 xmm3 00_01000100     ; [0 1 C D]
        shufps xmm1 xmm4 00_01000100     ; [4 5 8 9]
        shufps xmm2 xmm4 00_11101110     ; [2 3 A B]
        shufps xmm5 xmm3 00_11101110     ; [6 7 E F]
        movaps xmm4 xmm0               ; [0 1 C D]
        movaps xmm3 xmm2               ; [2 3 A B]
        shufps xmm0 xmm1 00_10001000     ; [0 C 4 8]
        shufps xmm1 xmm4 00_11011101     ; [5 9 1 D]
        shufps xmm2 xmm5 00_10001000     ; [2 A 6 E]
        shufps xmm3 xmm5 00_11011101     ; [3 B 7 F]
        shufps xmm0 xmm0 00_01111000     ; [0 4 8 C]
        shufps xmm1 xmm1 00_11010010     ; [1 5 9 D]
        shufps xmm2 xmm2 00_11011000     ; [2 6 A E]
        shufps xmm3 xmm3 00_11011000     ; [3 7 B F]

        movaps X$edi xmm0            ; [0 4 8 C]
        mov eax D@NextHeightPos
        movaps X$edi+eax xmm1           ; [1 5 9 D]
        add eax D@NextHeightPos
        movaps X$edi+eax xmm2           ; [2 6 A E]
        add eax D@NextHeightPos
        movaps X$edi+eax xmm3           ; [3 7 B F]

        add edx ebx
        add edi 16
        add ecx 4
    Loop_Until ecx >= D@Height
    add esi 16;4
    add edi eax
    dec D@CurXPos | jnz L2<<

    mov eax D@Output

EndP

calling it as:
call D3DMatrixTranspose ValueA, Output, 20, 32

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Member
Posts: 1,760
Assembly is a state of art.
Location: Brazil
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#12

April 02, 2017, 02:37:09 PM

JJ´s code is very close, but...it fails if we have matrix like:

Code Select


[ValueD2:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24
                F$ 25, 26, 27, 28, 29, 30]

call Fast_transpose_blockJJ ValueD2, Output, 6,5

Siekmanski´s will crash, because the width and height are smaller then 16. So i couldn´t test his version yet on this example.

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Member
Posts: 2,411
Location: The Netherlands
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#13

April 02, 2017, 04:09:56 PM

Wrote another faster 4 by 4 transpose routine...

Creative coders use backward thinking techniques as a strategy.

jj2007

Member
Posts: 14,883
Assembly is fun ;-)
Location: Italy
Logged

Re: Fast Transposing Matrix Algorithm for FPU Data

#14

April 02, 2017, 05:09:03 PM

Quote from: guga on April 02, 2017, 01:56:12 PMBut...how to make it work for any matrix size (or sizes not multiple of 4) ? I mean, width = 11, height = 23...or width = 3 height = 2 etc ?

Simple solution: Check the size and take my fast version if possible, or my version A if too small. After all, why should small matrices need a fast solution?

Masm32 Tips, Tricks and Traps * enable the deb macro

Print

Go Up Pages1 2 3

The MASM Forum
► General
► The Campus
► Fast Transposing Matrix Algorithm for FPU Data

User actions

Print