The MASM Forum

General => The Campus => Topic started by: guga on April 01, 2017, 06:01:08 PM

Title: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 01, 2017, 06:01:08 PM
Hi Guys

Someone knows how to make a transposing matrix of any size for Float values ??

This is for usage in image processing exchanging the width by the height.

Example:

I have  a sequence of values like this. On these cases, they are not pixels in RGB, they are simple float values representing Luma, for example:
[ValueA1:
F$ 1,2,3
F$ 4,5, 6]


Which means a row (width) of 3 Values and col (height) of 2 Values.

I need them to be transposed as:
[ValueA1:
F$ 1, 4
F$ 2, 5
F$ 3, 6]


It need to be fast and work on any size (squared or not) and work either if height is bigger then width and vice-versa. As:
[ValueA1:
F$ 1, 2
F$ 3, 4
F$ 5, 6]

will turn onto:
[ValueA1:
F$ 1,3, 5
F$ 2, 4, 6]


Note: "F$" is Float notation for RosAsm. But...if someone also have this for Real (Double) it would be appreciated as well.

On the above examples they are only to display the functionality, but in fact it will be used also on large matrixes with size: 507 x 208 , 64x64, 480x640, 1024x768 etc etc


I tried to port the code from here: http://stackoverflow.com/questions/16737298/what-is-the-fastest-way-to-transpose-a-matrix-in-c
But...it seems to not working.
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: jj2007 on April 01, 2017, 06:40:23 PM
Quote from: guga on April 01, 2017, 06:01:08 PMIt need to be fast

This is the biggest problem :biggrin:

mov esi, offset source
mov edi, offset dest
xor edx, edx
mov ecx, rows
.repeat
  mov eax, [esi+edx]
  stosd
  add edx, columns*4
  dec ecx
.until zero?
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: HSE on April 02, 2017, 02:04:13 AM
I use this little macro for everything:matrix macro x, y , xlarge
    mov eax , x
    mov ecx , xlarge
    mul ecx
    add eax, y
endm


The idea is:ForLp a, 0, 2
  ForLp b, 0, 1
      matrix a, b, 3
      fld matriz1[eax*8]
      matrix b, a, 2
      fstp matriz2[eax*8]
   Next b
Next b


LATER: I forget to say matrix are 0 based
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: TWell on April 02, 2017, 02:40:27 AM
I think FPU is irrelevant.
It is about moving array of sizeof float to new table to a new order.
Like this improved C versionvoid transpose(float *src, float *dst, const int N, const int M)
{
//    #pragma omp parallel for
register long *p1 = (long *)src;
register long *p2 = (long *)dst;

    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
//        dst[n] = src[M*j + i];
        p2[n] = p1[M*j + i];
    }
}
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: mabdelouahab on April 02, 2017, 02:46:23 AM
REAL4:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL4 PTR [ECX]
   FSTP    REAL4 PTR [EDI]
   ADD    EDI,4
   ADD      ECX,4*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,4
   DEC      EDX
   JNZ      doCol


REAL8:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL8 PTR [ECX]
   FSTP    REAL8 PTR [EDI]
   ADD    EDI,8
   ADD      ECX,8*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,8
   DEC      EDX
   JNZ      doCol


REAL10:
LEA ESI,source
LEA EDI,dest
MOV EDX,line
doCol:
   MOV      ECX,ESI
   MOV      EAX,col
doLine:
   FLD    REAL10 PTR [ECX]
   FSTP    REAL10 PTR [EDI]
   ADD    EDI,10
   ADD      ECX,10*col
   DEC      EAX
   JNZ      doLine
   ADD      ESI,10
   DEC      EDX
   JNZ      doCol

Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: TWell on April 02, 2017, 02:01:00 AM
interesting code but col isn't constant
   ADD      ECX,4*coli leave this issue to professionals ;)
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 02, 2017, 04:23:39 AM
Thanks guys...I´ll try the examples. to see if a faster version can be made. I have one that transpose a 4x4 matrix using SSE2, but didn´t tested on this version yet...

For these initial testings, i built a slow version biased on JJ´s code :).

Proc transpose_block:
    Arguments @Input, @Output, @Width, @Height
    Local @NextRowPos, @MaxWidth
    Uses edi, esi, ecx, ebx

    mov edi D@Output
    mov esi D@Input
    mov ecx D@Width | mov D@MaxWidth ecx
    mov ebx D@Width | shl ebx 2; ebx = end of width that is the nextYPos/Line

L2:
    mov ecx D@Height
    xor edx edx
L1:
        ; edx starts at 0. Cur row pos
        mov eax D$esi+edx
        stosd
        add edx ebx ; add to the next line
        dec ecx
    jnz L1<

    add esi 4 ; go to the next col line
    dec D@MaxWidth
    jnz L2<

EndP


This seems to work regardless the size of the matrix. So, it works either if Width > Height, or Height>Width or Width=height.

Tested on:
[ValueA:   D$ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
            D$ 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
            D$ 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48
            D$ 49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64
            D$ 65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80
            D$ 81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96
            D$ 97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112
            D$ 113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 16, 8



[ValueA:   D$ 1, 2
            D$ 3, 4]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 2, 2




[ValueA:   D$ 1, 2
            D$ 3, 4
            D$ 5, 6]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 2, 3



; Matrix 20 X 30
[ValueA:    F$ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
            F$ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
            F$ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
            F$ 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
            F$ 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
            F$ 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
            F$ 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140
            F$ 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160
            F$ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180
            F$ 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
            F$ 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220
            F$ 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240
            F$ 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260
            F$ 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280
            F$ 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
            F$ 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320
            F$ 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340
            F$ 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360
            F$ 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380
            F$ 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
            F$ 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420
            F$ 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440
            F$ 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460
            F$ 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480
            F$ 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500
            F$ 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520
            F$ 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540
            F$ 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560
            F$ 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580
            F$ 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600]

[ValueB: D$ 0#1000] ; just a dummy buffer to hold the values. It needs to be calculated. But the size is the same as ValueA

call transpose_block ValueA1, ValueB, 20, 30
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: jj2007 on April 02, 2017, 05:05:50 AM
Quote from: guga on April 02, 2017, 04:23:39 AMI have one that transpose a 4x4 matrix using SSE2

Loading is fine if your columns can be divided in 4*4 pieces, but the storing is not very effective: MOVD and pshufd. Doesn't sound very fast...

If you need REAL8: movlps is relatively fast. 
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: K_F on April 02, 2017, 05:49:46 AM
A simple Qword/Dword X/Y pointer swop, with a tmp variable would do, without any FPU usage.
:biggrin:
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: jj2007 on April 02, 2017, 09:39:49 AM
A works with general regs, B with xmm regs:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

2306    cycles for 100 * TransposeA
775     cycles for 100 * TransposeB

1917    cycles for 100 * TransposeA
778     cycles for 100 * TransposeB

2294    cycles for 100 * TransposeA
778     cycles for 100 * TransposeB

2289    cycles for 100 * TransposeA
776     cycles for 100 * TransposeB

1919    cycles for 100 * TransposeA
774     cycles for 100 * TransposeB

61      bytes for TransposeA
114     bytes for TransposeB
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: Siekmanski on April 02, 2017, 12:39:44 PM
4*4 tranpose from my d3dmath.lib


align 16
D3DMatrixTranspose proc pOut:DWORD,pM:DWORD
; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

mov     eax,pM

movaps  xmm0,[eax+0]            ; [0 1 2 3]
movaps  xmm1,[eax+16]           ; [4 5 6 7]
movaps  xmm4,[eax+32]           ; [8 9 A B]
movaps  xmm3,[eax+48]           ; [C D E F]

mov     eax,pOut

movaps  xmm2,xmm0               ; [0 1 2 3]
movaps  xmm5,xmm1               ; [4 5 6 7]
shufps  xmm0,xmm3,01000100b     ; [0 1 C D]
shufps  xmm1,xmm4,01000100b     ; [4 5 8 9]
shufps  xmm2,xmm4,11101110b     ; [2 3 A B]
shufps  xmm5,xmm3,11101110b     ; [6 7 E F]
movaps  xmm4,xmm0               ; [0 1 C D]
movaps  xmm3,xmm2               ; [2 3 A B]
shufps  xmm0,xmm1,10001000b     ; [0 C 4 8]
shufps  xmm1,xmm4,11011101b     ; [5 9 1 D]
shufps  xmm2,xmm5,10001000b     ; [2 A 6 E]
shufps  xmm3,xmm5,11011101b     ; [3 B 7 F]
shufps  xmm0,xmm0,01111000b     ; [0 4 8 C]
shufps  xmm1,xmm1,11010010b     ; [1 5 9 D]
shufps  xmm2,xmm2,11011000b     ; [2 6 A E]
shufps  xmm3,xmm3,11011000b     ; [3 7 B F]

movaps  [eax+0],xmm0            ; [0 4 8 C]
movaps  [eax+16],xmm1           ; [1 5 9 D]
movaps  [eax+32],xmm2           ; [2 6 A E]
movaps  [eax+48],xmm3           ; [3 7 B F]
ret
D3DMatrixTranspose endp
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 02, 2017, 01:56:12 PM
Thank you guys

Ok..we are close :)

I gave a test on JJ and Siekmanski code, and Siekmanski seems a bit faster.

the results i've got for 20x32 matrix are:

Siekmanski : 231.37439769473289 ns
JJ´s: 281.62775417344523 ns

But...the problem remains the same concerning the size of the matrix. Both works as expected if Width and Height are bigger then 4...But...how to make it work for any matrix size (or sizes not multiple of 4) ? I mean, width = 11, height = 23...or width = 3 height = 2 etc ?


The tested codes ported to RosAsm are:

JJ´s



Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

L2:
    xor ecx ecx
    xor edx edx

    Do
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM0 D$esi+edx | add edx ebx
        movd XMM1 D$esi+edx | add edx ebx
        movd XMM2 D$esi+edx | add edx ebx
        movd XMM3 D$esi+edx | add edx ebx

        pslldq XMM1 4
        pslldq XMM2 8
        pslldq XMM3 12

        xorps XMM0 XMM1
        xorps XMM0 XMM2
        xorps XMM0 XMM3

        movaps X$edi XMM0
        add edi 16
        add ecx 4
    Loop_Until ecx >= D@Height
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Output

EndP

calling it as:
call Fast_transpose_blockJJ ValueA, Output, 20, 32




Siekmanski´s



Proc D3DMatrixTranspose:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @NextHeightPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov ecx ebx | shl ebx 2
    mov eax D@Height | shl eax 2 | mov D@NextHeightPos eax
    shr ecx 2 | mov D@CurXPos ecx ; divide by 4

L2:
    xor ecx ecx
    xor edx edx

    Do
       
        movaps xmm0 X$esi+edx            ; [0 1 2 3]
        add edx ebx
        movaps xmm1 X$esi+edx           ; [4 5 6 7]
        add edx ebx
        movaps xmm4 X$esi+edx           ; [8 9 A B]
        add edx ebx
        movaps xmm3 X$esi+edx           ; [C D E F]

        movaps xmm2 xmm0               ; [0 1 2 3]
        movaps xmm5 xmm1               ; [4 5 6 7]
        shufps xmm0 xmm3 00_01000100     ; [0 1 C D]
        shufps xmm1 xmm4 00_01000100     ; [4 5 8 9]
        shufps xmm2 xmm4 00_11101110     ; [2 3 A B]
        shufps xmm5 xmm3 00_11101110     ; [6 7 E F]
        movaps xmm4 xmm0               ; [0 1 C D]
        movaps xmm3 xmm2               ; [2 3 A B]
        shufps xmm0 xmm1 00_10001000     ; [0 C 4 8]
        shufps xmm1 xmm4 00_11011101     ; [5 9 1 D]
        shufps xmm2 xmm5 00_10001000     ; [2 A 6 E]
        shufps xmm3 xmm5 00_11011101     ; [3 B 7 F]
        shufps xmm0 xmm0 00_01111000     ; [0 4 8 C]
        shufps xmm1 xmm1 00_11010010     ; [1 5 9 D]
        shufps xmm2 xmm2 00_11011000     ; [2 6 A E]
        shufps xmm3 xmm3 00_11011000     ; [3 7 B F]

        movaps X$edi xmm0            ; [0 4 8 C]
        mov eax D@NextHeightPos
        movaps X$edi+eax xmm1           ; [1 5 9 D]
        add eax D@NextHeightPos
        movaps X$edi+eax xmm2           ; [2 6 A E]
        add eax D@NextHeightPos
        movaps X$edi+eax xmm3           ; [3 7 B F]

        add edx ebx
        add edi 16
        add ecx 4
    Loop_Until ecx >= D@Height
    add esi 16;4
    add edi eax
    dec D@CurXPos | jnz L2<<

    mov eax D@Output

EndP

calling it as:
call D3DMatrixTranspose ValueA, Output, 20, 32

Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 02, 2017, 02:37:09 PM
JJ´s code is very close, but...it fails if we have  matrix like:


[ValueD2:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24
                F$ 25, 26, 27, 28, 29, 30]

call Fast_transpose_blockJJ ValueD2, Output, 6,5


Siekmanski´s will crash, because the width and height are smaller then 16. So i couldn´t test his version yet on this example.
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: Siekmanski on April 02, 2017, 04:09:56 PM
Wrote another faster 4 by 4 transpose routine...

Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: jj2007 on April 02, 2017, 05:09:03 PM
Quote from: guga on April 02, 2017, 01:56:12 PMBut...how to make it work for any matrix size (or sizes not multiple of 4) ? I mean, width = 11, height = 23...or width = 3 height = 2 etc ?

Simple solution: Check the size and take my fast version if possible, or my version A if too small. After all, why should small matrices need a fast solution?
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 02, 2017, 05:29:16 PM
Thanks SiekManski. I´l take a look. It seems pretty fast.

Hi JJ

I made another test and a check for the size.  So far it is working for any Height longer then 4 bytes and not multiple of 4. Tomorrow i´ll test to see if it works also on height smaller then 4


Proc Fast_transpose_blockJJguga:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
   
L1:   

L2:
    xor ecx ecx
    xor edx edx

    Do
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM0 D$esi+edx | add edx ebx
        movd XMM1 D$esi+edx | add edx ebx
        movd XMM2 D$esi+edx | add edx ebx
        movd XMM3 D$esi+edx | add edx ebx

        pslldq XMM1 4
        pslldq XMM2 8
        pslldq XMM3 12

        xorps XMM0 XMM1
        xorps XMM0 XMM2
        xorps XMM0 XMM3

        movups X$edi XMM0 ; <---- must be movups (unaligned) due to the remainders in height. otherwise it will crash. it work regardless the output in aligned or not
        add edi 16 ; 4*4
        add ecx 4
    Loop_Until ecx >= D@Height
    sub edi eax; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0
L1:

    mov eax D@Output

EndP



I´ll check tomorrow if it works when Height or width are smaller then 4. So far, it works for 6x5; 6x6; 6x7, 30x32, 32x30 ....

i´ll test tomorrow for 4x30, 3x10, 10x3, 1x2, 2x20 etc :)

If it works on those situations, i´ll see if i can be able to adapt Siekmanski onto your variation as well. On this way we may have something around 20-30% of speed gain, i suppose ?.

I realise that small matrix don´t actually needs speed, but we may have a matrix containing a small height (smaller then 4) and a higher width. Ex: width = 100, height = 3 (Or vice-versa)
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: jj2007 on April 02, 2017, 08:36:51 PM
Time for some timings?Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3442    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
786     cycles for 100 * TransposeC (old)

2285    cycles for 100 * TransposeA
571     cycles for 100 * TransposeB
772     cycles for 100 * TransposeC (old)

2285    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
786     cycles for 100 * TransposeC (old)

2284    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
770     cycles for 100 * TransposeC (old)

1913    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
784     cycles for 100 * TransposeC (old)

61      bytes for TransposeA
111     bytes for TransposeB
109     bytes for TransposeC (old)


P.S.: Don't take the slow version I posted earlier. Take V3 instead :greensml:
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 03, 2017, 02:24:03 AM
Great, but not working for non multiple of 4 matrices

Try with this:
Width = 6
Height = 7
[ValueD4:   D$ 1, 2, 3, 4, 5, 6,
                D$ 7, 8, 9, 10, 11, 12,
                D$ 13, 14, 15, 16, 17, 18,
                D$ 19, 20, 21, 22, 23, 24,
                D$ 25, 26, 27, 28, 29, 30,
                D$ 31, 32, 33, 34, 35, 36,
                D$ 37, 38, 39, 40, 41, 42]


The other version worked on non multiple of 4.
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 03, 2017, 03:30:13 AM
Hi JJ

Ok...i guess i fix it to make work for 6x7 and not multiple of 4

It seems to work on 6x7, 20x30 etc


Proc Fast_transpose_blockJJgugaNew:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    ;mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
L1:   
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@Height
    mov edx esi
    sar ecx 2 | jnc L3> | inc ecx | clc | L3: ; if height have a remainder (i mean, not multiple of 4, increment it). And clear Carry flag just for safety
    ;Align 4 Alignment not really needed ?

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP


Speed tests:

With fixing the height (jnc) 249.51956409010458 ns
Without fixing the height (not work for non 4 multiples): 243.87414443039290 ns (A bit faster, because it misses the data for non multiple of 4)

Old version: 289.57585729750531 ns

Gain of speed from new to old: something around 14%


I´ll start making the tests for height and width smaller then 4 and also exchanging the width and height (To see if it works when height is smaller then width or vice-versa)
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 03, 2017, 03:57:28 AM
Hmm...it seems that align4 make no effect, but align 16 gained more speed 8)


; fastest
Proc Fast_transpose_blockJJgugaNew:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
L1:
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@Height
    mov edx esi
    sar ecx 2 | jnc L3> | inc ecx | clc | L3: ; if height have a remainder (i mean, not multiple of 4, increment it)
    Align 16 ; <--------- Must be aligned with 16 to gain more speed

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP


New result: 217.25499826251323 ns
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 03, 2017, 05:13:32 AM
Another faster version that seems to work with any size (width and height) both bigger then 4 bytes

The previous "sar" operation is no longer needed. Since we can compute the max value of Y (Col) only once and not inside each loop



; fastest
Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder, @MAxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | shl edx 2 | sub ecx edx ; integer count divide by 4 dwords
    jz L1>
        inc D@MaxYPos  ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
L1:
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos ; < No longer needed "sar" operation since we alreadu calculated the max value of Y (Col/Height)
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder to we reach the correct position
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP


New timming: 189.46967097307137 ns


JJ, It is something around 50% faster then the old version (the 1st one) and something around 30% faster then your new one. It seems that using align 16 and putting a variable to hold the Maximum Y Position (col) was the necessary to gain more speed. Can you test on yours ?
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: jj2007 on April 03, 2017, 06:52:49 AM
Quote from: guga on April 03, 2017, 05:13:32 AMJJ, It is something around 50% faster then the old version (the 1st one) and something around 30% faster then your new one. It seems that using align 16 and putting a variable to hold the Maximum Y Position (col) was the necessary to gain more speed. Can you test on yours ?

Sorry, can't test that one, I know Masm syntax only. From what I see, it is unlikely that your version is any faster than mine, though.
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: mineiro on April 03, 2017, 07:20:41 AM
hello sir guga, I hope you're fine irmão;
I don't have tested because don't have rosasm in hands, so I can only give suggestion.
On line:
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | shl edx 2 | sub ecx edx ; integer count divide by 4 dwords
You can get remainder of a 4 division by "and" number with (X-1) mask. This only works to 2,4,8,16,... . To get remainder of 8 the mask is 8-1==7, to 16 the mask is 15, ... .
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx,3

If conditional or inconditional jump address is aligned we have a gain. So, if you analyse opcode generated, from my tests, sometimes is better use add instead of inc per example, because you're align next intruction or a label to a even address.
Not remember where I have read, but, text said that if you like to read just a byte on an odd adress, so machine internally read a word, discard one byte and move a byte. So, even address speed up code. Instructions on even address perfoms better than on odd address.
Some codes that I have see done to larger matrix just get remainder and get address to store too and do calculus. The idea is align address to 16 multiple. They do something like a head code (to align remainder and address) on byte/word/dword/qword way until address its a 16 multiple where the gain appear, and a tail to deal with bytes too.

I also forgot to say, on line below:
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<
add instruction set flags and dec instruction too, this create a bit dependency. You can test with lea that don't change flags to add a displacement, so flag dependency will be on dec instruction.
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 03, 2017, 12:03:20 PM
Oi Mineiro

Obrigado pela dica.  usando o and ecx 3 fica realmente mais pratico para encontrar a posição correta do output em edi. Estou testando agora para ver se funciona com varios tamanhos diferentes de comprimento e altura. Estou tentando fazer algo relacionado com tranposição de matrizes com tamanhos diferentes pois estou testando um algoritmo para identificar imagens a partir de um hash. A biblioteca se chama "pHash" e, em princípio, utiliza a transposição de matrizes após os pixels terem sido transformados para Luma. Ele dispõe o resultado em matrizes no formato de 32x32 que representam o padrão de iluminação da imagem (na verdade, ele cria uma nova imagem em 32x32) para que o hash possa ser gerado. No entanto, pelo que vi do código, se for utilizado nova matriz do mesmo tamanho da imagem original, o algoritmo funcionaria de forma mais eficiente.  O maior problema é que a biblioteca que usam é muito muito lenta. Usam uma versão alterada de cimg que tem dentro dela a transposição de matrizes, mas, como disse é bastante lenta por isso precisaria de uma função melhorada para a transposição de matrizes de qualquer tamanho sejam quadradas ou retangulares (até para poder utilizar em qualquer outro aplicativo e não apenas neste caso específico).

A função que JJ fez é ótima do ponto de vista de rapidez, no entanto preciso apenas tentar fazer funcionar em matrizes não quadradas e, ainda assim, tentar manter a mesma rapidez quanto o código original que ele criou.

Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 03, 2017, 12:39:24 PM
Ok...I guess i got it working for retangular matrices (Quadratic or not). However, Width and Height on this version must be bigger then 4 (both)


; fastest
Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @RemainderY, @MaxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    mov D@RemainderY 0
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx 3; Check if value (Height) is a multiple of 4
    jz L1>
        ; found not multiple of 4
        inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
        mov D@RemainderY eax
L1:

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@RemainderY; adjust edi from the remainder in YPos only
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@RemainderY
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP


; Speed results using "and ecx" 213.78110071619122 ns
The tests were done on these data sets



; 4x4 OK Fast_transpose_blockJJ4544. Start testing the newer versions for retangular matrixes
[Teste4x4:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22]
; 5x4
[Teste5x4:  F$ 1, 2, 3, 4, 5,
                F$ 6, 7, 8, 9, 10,
                F$ 11, 12, 13, 14, 15,
                F$ 16, 17, 18, 19, 20]

[Teste6x4:  F$ 1, 2, 3, 4, 5, 6,
                F$ 26, 7, 8, 9, 10,27
                F$ 11, 12, 13, 14, 15,28
                F$ 16, 17, 18, 19, 20,29]

[Teste7x4:  F$ 1, 2, 3, 4, 5, 6, 77,
                F$ 26, 7, 8, 9, 10,27, 102
                F$ 11, 12, 13, 14, 15,28, 95
                F$ 16, 17, 18, 19, 20,29, 35]

[Teste8x4:  F$ 1, 2, 3, 4, 5, 6, 77, 222,
                F$ 26, 7, 8, 9, 10,27, 102, 67,
                F$ 11, 12, 13, 14, 15,28, 95, 77,
                F$ 16, 17, 18, 19, 20,29, 35, 91]

; for Height = 4. Width works ok on any size Fast_transpose_blockJJ_WidthOk. Now let´ see Heght with odd values
; remainder =12
[Teste4x5:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32]
; remainder = 8
[Teste4x6:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67]
; remainder = 4
[Teste4x7:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107]

; remainder = 0
[Teste4x8:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107,
                F$ 259, 169, 908, 77]

; 5x5
[Teste5x5:  F$ 1, 2, 3, 4, 6,
                F$ 7, 8, 9, 10, 11
                F$ 13, 14, 15, 16, 12
                F$ 19, 20, 21, 22, 23
                F$ 79, 80, 81, 32, 37]



I´ll start the tests on matrix whose width or height are smaller then 4.  Ex: 3x100 ; 2x10, 1x90, 150x2, 100x3 etc
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 03, 2017, 01:57:28 PM
Ok, guys...it is working on all sizes : )

Thank you a lot !  :t :t :t :t

The only condition is that the inputed matrix must be stored on a buffer bigger then 64 bytes (4x4 dwords) and also the output buffer must be at least 64 bytes long too. But..it is working as expected :)



Proc Fast_Matrix_Transpose:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @RemainderY, @MaxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    mov D@RemainderY 0
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx 3; Check if value (Height) is a multiple of 4
    jz L1>
        ; found not multiple of 4
        inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
        mov D@RemainderY eax
L1:

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@RemainderY; adjust edi from the remainder in YPos only
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@RemainderY
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP




Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: jj2007 on April 03, 2017, 06:45:12 PM
Quote from: guga on April 03, 2017, 12:03:20 PMA função que JJ fez é ótima do ponto de vista de rapidez, no entanto preciso apenas tentar fazer funcionar em matrizes não quadradas

My function works fine on non-quadratic matrices, as long as the rows are a multiple of 4.

sub edi D@RemainderY; adjust edi from the remainder in YPos only

:t

Here is the original function, in Masm format (full code with speed test in reply #16 (http://masm32.com/board/index.php?topic=6105.msg64799#msg64799)):

TransposeB proc uses esi edi ebx pSrc, pDest, SrcRows, SrcCols      ; fast
  mov esi, pSrc
  mov edi, pDest
  mov eax, SrcCols
  push eax
  shl eax, 2            ; *4
  lea ebx, [eax+2*eax]      ; 3*eax
  .Repeat
      mov ecx, SrcRows
      mov edx, esi
      sar ecx, 2
      align 4
@@:      movd xmm1, dword ptr [edx+1*eax]      ; +1*eax
      movd xmm0, dword ptr [edx]            ; +0*eax
      pshufd xmm1, xmm1, 01010001b
      xorps xmm0, xmm1
      movd xmm2, dword ptr [edx+1*ebx]      ; 3*eax
      movd xmm1, dword ptr [edx+2*eax]      ; 2*eax
      pshufd xmm2, xmm2, 00010101b
      pshufd xmm1, xmm1, 01000101b
      xorps xmm0, xmm1
      xorps xmm0, xmm2
      lea edx, [edx+4*eax]
      movups [edi], xmm0
      add edi, 4*4      ; advance one xmm reg
      dec ecx
      jg @B
      add esi, 4
      dec stack
  .Until Zero?
  pop edx
  if ShowB
      mov esi, offset ValueC
      For_ ct=0 To 19
            For_ ecx=0 To 31
                  lodsd
                  Print Str$("%i ", eax)
            Next
            Print
      Next
      Exit
  endif
  ret
TransposeB endp
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: mineiro on April 03, 2017, 10:31:02 PM
opa, beleza guga;
O projeto soa interessante, você deseja encontrar padrões pelo que supus e isto oferece uma gama de possibilidades onde possa ser usado. Na hora de testar o código com matrizes tente quadrado, retângulo e linha, quadrados e retângulos são formas geométricas que linhas sobrepostas podem formar, e linha nesse contexto significa um número primo, por exemplo uma matriz de 7x1. No código fonte, não cheguei a analisá-lo por completo, daí suponha que apenas tenho uma vaga ideia, mas sendo uma multiplicação então significa que é mais fácil fazer cinco vezes dez do que dez vezes o número cinco, em número de operações matemáticas.
Concordo com o senhor, neste fórum existe exímios programadores, aprendi muito e continuo aprendendo com os membros. Um eterno aprendiz.
Boa sorte no projeto.
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 04, 2017, 02:01:35 AM
Hi mineiro

Obrigado :) Sim, o phash usa padrões e um algoritmo chamado DCT para reduzir as frequências encontradas na imagem. Só é um pouco complicado entender a funcionalidade seguindo o código-fonte original mas, creio que dê para portar para assembly. Minha idéia inicial é fazer uma função que possa calcular o pHash de imagens mas, ao invés de fazer funções que criam e reduzem o tamanho das imagens etc, fazer apenas a que diretamente realizará o trabalho, ou seja, usar como input width, height (e talvez o pitch)  e o ponteiro para os pixels (convertidos já para algum formato como Luma, -por exemplo). A biblioteca menciona que convertem a imagem para greyscale mas, pelos testes que estou fazendo em um plugin para o virtualdub (para detectar mudanças de cenas em um filme), é melhor converter já para Luma usando o valor encontrado após os pixels terem sido convertidos para YCbcr ou CieLab. Como se usa apenas o Luma, não é necessário usar toda a função de conversão de YCrcb ou CieLab apenas o início que converte RGB para Luma e o valor de Luma (de 0 pra 1 ou seja, o Luma já está normalizado: dividido por 255) é o que será passado para as matrizes e para o DCT.

Se quiser, de uma lida aqui que explicam melhor todo o funcionamento do phash de modo mais claro:
http://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

Sobre a função de JJ fiz os testes de maneira extensiva e, em principio está funcionando para todo e qualquer formato da matrix, incluindo lineares, quadradas e retangulares, independente de serem ou não multiplos de 4 e o melhor de tudo, com uma rapidez impecável :)
Os dados dos testes que fiz foram estes:

; Matrix 20 X 30
[ValueA:    F$ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
            F$ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
            F$ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
            F$ 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
            F$ 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
            F$ 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
            F$ 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140
            F$ 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160
            F$ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180
            F$ 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
            F$ 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220
            F$ 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240
            F$ 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260
            F$ 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280
            F$ 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
            F$ 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320
            F$ 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340
            F$ 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360
            F$ 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380
            F$ 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
            F$ 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420
            F$ 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440
            F$ 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460
            F$ 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480
            F$ 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500
            F$ 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520
            F$ 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540
            F$ 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560
            F$ 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580
            F$ 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600]

; 5x4
[ ValueD:    F$ 1, 2, 3, 4, 5
                F$ 6, 7, 8, 9, 10
                F$ 11, 12, 13, 14, 15
                F$ 16, 17, 18, 19, 20]
; 6x5
[ValueD2:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30]

; 6 x 6 8 btes ahead
[ValueD3:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30,
                F$ 31, 32, 33, 34, 35, 36]
;6x7 4 bytes ahead
[ ValueD4:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30,
                F$ 31, 32, 33, 34, 35, 36,
                F$ 37, 38, 39, 40, 41, 42]
; 4 x 7
[ValueD5:   F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 25, 26, 27, 28,
                F$ 31, 32, 33, 34,
                F$ 37, 38, 39, 40,
                F$ 41, 42, 43, 44]

; [ValueD6:   F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 64,
                F$ 13, 14, 15, 16, 79
                F$ 19, 20, 21, 22, 98
                F$ 25, 26, 27, 28, 101
                F$ 31, 32, 33, 34, 222
                F$ 37, 38, 39, 40, 57
                F$ 41, 42, 43, 44, 58]


; 4x4 OK Fast_transpose_blockJJ4544
[Teste4x4:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22]
; 5x4
[Teste5x4:  F$ 1, 2, 3, 4, 5,
                F$ 6, 7, 8, 9, 10,
                F$ 11, 12, 13, 14, 15,
                F$ 16, 17, 18, 19, 20]

[Teste6x4:  F$ 1, 2, 3, 4, 5, 6,
                F$ 26, 7, 8, 9, 10,27
                F$ 11, 12, 13, 14, 15,28
                F$ 16, 17, 18, 19, 20,29]

[Teste7x4:  F$ 1, 2, 3, 4, 5, 6, 77,
                F$ 26, 7, 8, 9, 10,27, 102
                F$ 11, 12, 13, 14, 15,28, 95
                F$ 16, 17, 18, 19, 20,29, 35]

[Teste8x4:  F$ 1, 2, 3, 4, 5, 6, 77, 222,
                F$ 26, 7, 8, 9, 10,27, 102, 67,
                F$ 11, 12, 13, 14, 15,28, 95, 77,
                F$ 16, 17, 18, 19, 20,29, 35, 91]

; for Height = 4. Width works ok on any size Fast_transpose_blockJJ_WidthOk. Now let´s see Height with odd values
; remainder =12
[Teste4x5:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32]
; remainder = 8
[Teste4x6:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67]
; remainder = 4
[Teste4x7:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107]

; remainder = 0
[Teste4x8:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107,
                F$ 259, 169, 908, 77]

; 5x5
[Teste5x5:  F$ 1, 2, 3, 4, 6,
                F$ 7, 8, 9, 10, 11
                F$ 13, 14, 15, 16, 12
                F$ 19, 20, 21, 22, 23
                F$ 79, 80, 81, 32, 37]

; Fast_transpose_blockJJ_YESSSS WORKS FOR EVERYTHING BIGGER THEN 4

; Now let´ start with height or width smaller then 4

; 1st we test for width be smaller then 4

; remainder = 0; works. no adjustment
[Teste3x4:  F$ 1, 2, 3,
                F$ 7, 8, 9,
                F$ 13, 14, 15,
                F$ 19, 20, 21]

; remainder = 0; works. no adjustment
[Teste2x4:  F$ 1, 2,
                F$ 7, 8,
                F$ 13, 14,
                F$ 19, 20]

; remainder = 0; works. no adjustment
[Teste1x4:  F$ 1,
                F$ 7,
                F$ 13,
                F$ 19,]

; remainder = 0; works. no adjustment
[Teste3x5:  F$ 1, 2, 3,
                F$ 7, 8, 9,
                F$ 13, 14, 15,
                F$ 19, 20, 21,
                F$ 89, 70, 51]

; remainder = 0; works. no adjustment
[Teste2x5:  F$ 1, 2,
                F$ 7, 8,
                F$ 13, 14,
                F$ 19, 20,
                F$ 11, 122]

; remainder = 0; works. no adjustment
[Teste1x5:  F$ 1,
                F$ 7,
                F$ 13,
                F$ 19
                F$ 77]

; Now we test for height be smaller then 4

; remainder = 0; works. no adjustment
[Teste4x3:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16]

[Teste5x3:  F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 91,
                F$ 13, 14, 15, 16,125]

[Teste5x2:  F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 91]

[Teste5x1:  F$ 1, 2, 3, 4, 77]

[Teste1x1:  F$ 1]
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: mineiro on April 04, 2017, 03:11:06 AM
opa guga;
Agora o projeto ficou ainda mais audaz após ler os links postados, muito bacana mesmo. Não conhecia sobre perceputal hash, me veio na memória perceptron no ramo da inteligência artificial mas agora consigo pensar em outros algoritmos que no futuro podem ser agregados para uma automação.
http://courses.cs.washington.edu/courses/cse599/01wi/admin/Assignments/nn.html
Lembro que existia um programa chamado Altamira Composer, e este oferecia um zoom infinito, e não sei como funciona este algoritmo.
Tenho uma pergunta guga; o algoritmo é eficiente entre objetos similares? Pensei em uma laranja, mexerica e limão doce (tangerina). Tem como encontrar apenas laranjas neste pomar de fotos?
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 04, 2017, 04:53:23 AM
Para objetos semelhantes ele aponta qual o percentual mas...entre laranjas e mexericas...kkk.. não sei..bem pensado. É algo à se testar.

O link da demonstração online do phash é:
http://www.phash.org

tenta selecionar foto de laranjas e mexericas etc e ve qual o resultado que o phash encontrou. Se ele conseguir discernir mesmo entre uma laranja e uma mexerica etc, nossa...então o algoritmo é melhor do que pensei :)
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 04, 2017, 04:57:57 AM
Aliás...o que já consegui portar foi a função hamming Distance do phash. Apesar de eu achar um pouco lenta, pelo menos já consegui portar. Assim vou fazendo aos poucos para facilitar a otimização quando tudo ficar pronto.

A biblioteca tb tem uma função para calcular o Fast Fourier Transformation. Eu converti ela e posto mais tarde para tentarmos otimizar :)


[XValue: Q$ 0]

Proc ph_hamming_distance:
    Arguments @Hash1, @Hash2
    Uses ecx, edx, esi, edi

    mov edi D@Hash1
    mov esi D@Hash2
    mov eax D$edi | xor eax D$esi | mov D$XValue eax
    mov ecx D$edi+4 | xor ecx D$esi+4 | mov D$XValue+4 ecx

    call aullshr 1, D$XValue, D$XValue+4

    and eax 055555555 | and edx 055555555
    mov ecx D$XValue | sub ecx eax | mov D$XValue ecx
    mov eax D$XValue+4 | sbb eax edx | mov D$XValue+4 eax

    mov esi D$XValue | and esi 033333333
    mov edi D$XValue+4 | and edi 033333333

    call aullshr 2, D$XValue, D$XValue+4
    and eax 033333333 | add esi eax | mov D$XValue esi
    and edx 033333333 | adc edi edx | mov D$XValue+4 edi

    call aullshr 4, D$XValue, D$XValue+4
    add eax D$XValue
    mov ecx D$XValue+4 | adc ecx edx
    and eax 0F0F0F0F | mov D$XValue eax
    and ecx 0F0F0F0F | mov D$XValue+4 ecx

    call allmul D$XValue, D$XValue+4, 01010101, 01010101
    call aullshr 56, eax, edx

EndP



; results in eax/edx

Proc aullshr:
    Arguments @Flag, @Val1, @Val2
    Uses ecx

    mov ecx D@Flag | movzx ecx cl
    mov eax D@Val1
    mov edx D@Val2

    .If cl < 040
        If cl < 020
            shrd eax edx cl
            shr edx cl
        Else
            mov eax edx
            xor edx edx
            and cl 01F
            shr eax cl
        End_If
    .Else
        xor eax eax
        xor edx edx
    .End_If

EndP


___________________________________________________________________________________

Proc allmul:
    Arguments @Val1, @Val1a, @Val2, @Val2a
    Uses ebx

    mov eax D@Val1a
    If D@Val2a = eax
        mov eax D@Val1
        mul D@Val2
    Else
        mul D@Val2
        mov ebx eax
        mov eax D@Val1
        mul D@Val2a
        add ebx eax
        mov eax D@Val1
        mul ecx
        add edx ebx
    End_If

EndP
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: mineiro on April 04, 2017, 09:18:09 AM
Irei tentar o site amanhã e me aprofundar no assunto, gosto de alimento cerebral.
Creio que já tenha visto neste fórum uma função fft rápida, esqueci agora qual usuário a fez, talvez no velho fórum.
abraços dom.
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 04, 2017, 02:12:04 PM
Não vi ainda mas, se foi criada, creio que possa ter sido ou pelo JJ ou por Siekmanski . Ambos são ótimos quando o assunto é otimização de código :)
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: LordAdef on April 04, 2017, 06:02:35 PM
doideira, não conhecia isso. Interessante
Title: Re: Fast Transposing Matrix Algorithm for FPU Data
Post by: guga on April 07, 2017, 02:27:42 AM
Oi Lord

Sim, é muto interessante a técnica. No entanto, phash utiliza uma biblioteca tosca de manipulação de imagem chamada CImg que é incrivelmente lenta e confusa de portar.

Estou tentando portar ela para assembly e tentando entender direito a funcionalidade. basicamente, eles utilizam uma máscara por cima de uma imagem. A máscara é invertida (transposta mas, não do modo 'normal". A transposição se dá fazendo um flip horizontal e outro vertical da máscara) e depois passam essa máscara por sobre a imagem aplicando uma convolação de matrix. Antes da convolação e do flip, a imagem e a máscara precisam ser convertidas para greyscale (Eu vou usar Luma que é mais preciso).

Eu havia criado uma dll do pHash que tem as funções de exportação necessárias "ph_dct_imagehash" e "ph_hamming_distance" mas, após testar elas quanto à velocidade, é impraticável para usá-las em vídeo, por exemplo.

Exatamente por isso que estou otimizando e tentando rever toda a funcionalidade da biblioteca para fazê-la simplificada sem a necessidade de se utilizar biblioteca externa de manipulação de imagens. A idéia é apenas recriar as funções do pHash tomando como base os pixels já convertidos para Luma (normalizado) e apenas aplicar as alterações necessárias das matrizes e a convolação.

Isso garantirá uma maior rapidez do algoritmo. Se eu conseguir criar uma função que consiga criar o phash de uma imagem (e identificar com o de outro) utilizando apenas alguns nanosegundos, será melhor. Acredito que a função como um todo, talvez se consiga fazer em torno de uns 800 nanosegundos (para uma imagem de 32x32, por exemplo), já incluindo a convolação.  Ou seja, uma imagem de 640x480 pode ser feita em algo que leve alguns poucos microsegundos. (Creio que uns 10 no máximo 20 mas, quanto menos, melhor).

Eu conseguindo isolar apenas o algoritmo de phash, já é um avanço pois se alguém quiser criar um programa de identificação de cenas (como eu estou fazendo) ou de busca por imagens etc, a velocidade final dependerá apenas da forma como o programador  usou para ter acesso aos pixels antes de passar pelo algoritmo e nunca em função do algoritmo em si (Que é o que acontece com o phash  atualmente)