Fast Transposing Matrix Algorithm for FPU Data

guga · April 02, 2017, 05:29:16 PM

Thanks SiekManski. I´l take a look. It seems pretty fast.

Hi JJ

I made another test and a check for the size. So far it is working for any Height longer then 4 bytes and not multiple of 4. Tomorrow i´ll test to see if it works also on height smaller then 4

Code Select


Proc Fast_transpose_blockJJguga:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
    
L1:    

L2:
    xor ecx ecx
    xor edx edx

    Do
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM0 D$esi+edx | add edx ebx
        movd XMM1 D$esi+edx | add edx ebx
        movd XMM2 D$esi+edx | add edx ebx
        movd XMM3 D$esi+edx | add edx ebx

        pslldq XMM1 4
        pslldq XMM2 8
        pslldq XMM3 12

        xorps XMM0 XMM1
        xorps XMM0 XMM2
        xorps XMM0 XMM3

        movups X$edi XMM0 ; <---- must be movups (unaligned) due to the remainders in height. otherwise it will crash. it work regardless the output in aligned or not
        add edi 16 ; 4*4
        add ecx 4
    Loop_Until ecx >= D@Height
    sub edi eax; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0
L1:

    mov eax D@Output

EndP

I´ll check tomorrow if it works when Height or width are smaller then 4. So far, it works for 6x5; 6x6; 6x7, 30x32, 32x30 ....

i´ll test tomorrow for 4x30, 3x10, 10x3, 1x2, 2x20 etc :)

If it works on those situations, i´ll see if i can be able to adapt Siekmanski onto your variation as well. On this way we may have something around 20-30% of speed gain, i suppose ?.

I realise that small matrix don´t actually needs speed, but we may have a matrix containing a small height (smaller then 4) and a higher width. Ex: width = 100, height = 3 (Or vice-versa)

jj2007 · April 02, 2017, 08:36:51 PM

Time for some timings?

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3442    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
786     cycles for 100 * TransposeC (old)

2285    cycles for 100 * TransposeA
571     cycles for 100 * TransposeB
772     cycles for 100 * TransposeC (old)

2285    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
786     cycles for 100 * TransposeC (old)

2284    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
770     cycles for 100 * TransposeC (old)

1913    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
784     cycles for 100 * TransposeC (old)

61      bytes for TransposeA
111     bytes for TransposeB
109     bytes for TransposeC (old)

P.S.: Don't take the slow version I posted earlier. Take V3 instead

guga · April 03, 2017, 02:24:03 AM

Great, but not working for non multiple of 4 matrices

Try with this:

Code Select


Width = 6
Height = 7
[ValueD4:   D$ 1, 2, 3, 4, 5, 6,
                D$ 7, 8, 9, 10, 11, 12,
                D$ 13, 14, 15, 16, 17, 18,
                D$ 19, 20, 21, 22, 23, 24,
                D$ 25, 26, 27, 28, 29, 30,
                D$ 31, 32, 33, 34, 35, 36,
                D$ 37, 38, 39, 40, 41, 42]

The other version worked on non multiple of 4.

guga · April 03, 2017, 03:30:13 AM

Hi JJ

Ok...i guess i fix it to make work for 6x7 and not multiple of 4

It seems to work on 6x7, 20x30 etc

Code Select


Proc Fast_transpose_blockJJgugaNew:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    ;mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
L1:    
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@Height
    mov edx esi
    sar ecx 2 | jnc L3> | inc ecx | clc | L3: ; if height have a remainder (i mean, not multiple of 4, increment it). And clear Carry flag just for safety
    ;Align 4 Alignment not really needed ?

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP

Speed tests:

With fixing the height (jnc) 249.51956409010458 ns
Without fixing the height (not work for non 4 multiples): 243.87414443039290 ns (A bit faster, because it misses the data for non multiple of 4)

Old version: 289.57585729750531 ns

Gain of speed from new to old: something around 14%

I´ll start making the tests for height and width smaller then 4 and also exchanging the width and height (To see if it works when height is smaller then width or vice-versa)

guga · April 03, 2017, 03:57:28 AM

Hmm...it seems that align4 make no effect, but align 16 gained more speed 8)

Code Select


; fastest
Proc Fast_transpose_blockJJgugaNew:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
L1:
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@Height
    mov edx esi
    sar ecx 2 | jnc L3> | inc ecx | clc | L3: ; if height have a remainder (i mean, not multiple of 4, increment it)
    Align 16 ; <--------- Must be aligned with 16 to gain more speed

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP

New result: 217.25499826251323 ns

guga · April 03, 2017, 05:13:32 AM

Another faster version that seems to work with any size (width and height) both bigger then 4 bytes

The previous "sar" operation is no longer needed. Since we can compute the max value of Y (Col) only once and not inside each loop

Code Select



; fastest
Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder, @MAxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | shl edx 2 | sub ecx edx ; integer count divide by 4 dwords
    jz L1>
        inc D@MaxYPos  ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
L1:
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos ; < No longer needed "sar" operation since we alreadu calculated the max value of Y (Col/Height)
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder to we reach the correct position
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP

New timming: 189.46967097307137 ns

JJ, It is something around 50% faster then the old version (the 1st one) and something around 30% faster then your new one. It seems that using align 16 and putting a variable to hold the Maximum Y Position (col) was the necessary to gain more speed. Can you test on yours ?

jj2007 · April 03, 2017, 06:52:49 AM

Quote from: guga on April 03, 2017, 05:13:32 AMJJ, It is something around 50% faster then the old version (the 1st one) and something around 30% faster then your new one. It seems that using align 16 and putting a variable to hold the Maximum Y Position (col) was the necessary to gain more speed. Can you test on yours ?

Sorry, can't test that one, I know Masm syntax only. From what I see, it is unlikely that your version is any faster than mine, though.

mineiro · April 03, 2017, 07:20:41 AM

hello sir guga, I hope you're fine irmão;
I don't have tested because don't have rosasm in hands, so I can only give suggestion.
On line:
mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | shl edx 2 | sub ecx edx ; integer count divide by 4 dwords
You can get remainder of a 4 division by "and" number with (X-1) mask. This only works to 2,4,8,16,... . To get remainder of 8 the mask is 8-1==7, to 16 the mask is 15, ... .
mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx,3

If conditional or inconditional jump address is aligned we have a gain. So, if you analyse opcode generated, from my tests, sometimes is better use add instead of inc per example, because you're align next intruction or a label to a even address.
Not remember where I have read, but, text said that if you like to read just a byte on an odd adress, so machine internally read a word, discard one byte and move a byte. So, even address speed up code. Instructions on even address perfoms better than on odd address.
Some codes that I have see done to larger matrix just get remainder and get address to store too and do calculus. The idea is align address to 16 multiple. They do something like a head code (to align remainder and address) on byte/word/dword/qword way until address its a 16 multiple where the gain appear, and a tail to deal with bytes too.

I also forgot to say, on line below:
add edi (4*4) ; advance one xmm reg
dec ecx | jg L8<
add instruction set flags and dec instruction too, this create a bit dependency. You can test with lea that don't change flags to add a displacement, so flag dependency will be on dec instruction.

guga · April 03, 2017, 12:03:20 PM

Oi Mineiro

Obrigado pela dica. usando o and ecx 3 fica realmente mais pratico para encontrar a posição correta do output em edi. Estou testando agora para ver se funciona com varios tamanhos diferentes de comprimento e altura. Estou tentando fazer algo relacionado com tranposição de matrizes com tamanhos diferentes pois estou testando um algoritmo para identificar imagens a partir de um hash. A biblioteca se chama "pHash" e, em princípio, utiliza a transposição de matrizes após os pixels terem sido transformados para Luma. Ele dispõe o resultado em matrizes no formato de 32x32 que representam o padrão de iluminação da imagem (na verdade, ele cria uma nova imagem em 32x32) para que o hash possa ser gerado. No entanto, pelo que vi do código, se for utilizado nova matriz do mesmo tamanho da imagem original, o algoritmo funcionaria de forma mais eficiente. O maior problema é que a biblioteca que usam é muito muito lenta. Usam uma versão alterada de cimg que tem dentro dela a transposição de matrizes, mas, como disse é bastante lenta por isso precisaria de uma função melhorada para a transposição de matrizes de qualquer tamanho sejam quadradas ou retangulares (até para poder utilizar em qualquer outro aplicativo e não apenas neste caso específico).

A função que JJ fez é ótima do ponto de vista de rapidez, no entanto preciso apenas tentar fazer funcionar em matrizes não quadradas e, ainda assim, tentar manter a mesma rapidez quanto o código original que ele criou.

guga · April 03, 2017, 12:39:24 PM

Ok...I guess i got it working for retangular matrices (Quadratic or not). However, Width and Height on this version must be bigger then 4 (both)

Code Select


; fastest
Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @RemainderY, @MaxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    mov D@RemainderY 0
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx 3; Check if value (Height) is a multiple of 4
    jz L1>
        ; found not multiple of 4
        inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
        mov D@RemainderY eax
L1:
 
    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@RemainderY; adjust edi from the remainder in YPos only
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@RemainderY
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP

; Speed results using "and ecx" 213.78110071619122 ns
The tests were done on these data sets

Code Select



; 4x4 OK Fast_transpose_blockJJ4544. Start testing the newer versions for retangular matrixes
[Teste4x4:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22]
; 5x4
[Teste5x4:  F$ 1, 2, 3, 4, 5,
                F$ 6, 7, 8, 9, 10,
                F$ 11, 12, 13, 14, 15,
                F$ 16, 17, 18, 19, 20]

[Teste6x4:  F$ 1, 2, 3, 4, 5, 6,
                F$ 26, 7, 8, 9, 10,27
                F$ 11, 12, 13, 14, 15,28
                F$ 16, 17, 18, 19, 20,29]

[Teste7x4:  F$ 1, 2, 3, 4, 5, 6, 77,
                F$ 26, 7, 8, 9, 10,27, 102
                F$ 11, 12, 13, 14, 15,28, 95
                F$ 16, 17, 18, 19, 20,29, 35]

[Teste8x4:  F$ 1, 2, 3, 4, 5, 6, 77, 222,
                F$ 26, 7, 8, 9, 10,27, 102, 67,
                F$ 11, 12, 13, 14, 15,28, 95, 77,
                F$ 16, 17, 18, 19, 20,29, 35, 91]

; for Height = 4. Width works ok on any size Fast_transpose_blockJJ_WidthOk. Now let´ see Heght with odd values
; remainder =12
[Teste4x5:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32]
; remainder = 8
[Teste4x6:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67]
; remainder = 4
[Teste4x7:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107]

; remainder = 0
[Teste4x8:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107,
                F$ 259, 169, 908, 77]

; 5x5
[Teste5x5:  F$ 1, 2, 3, 4, 6,
                F$ 7, 8, 9, 10, 11
                F$ 13, 14, 15, 16, 12
                F$ 19, 20, 21, 22, 23
                F$ 79, 80, 81, 32, 37]

I´ll start the tests on matrix whose width or height are smaller then 4. Ex: 3x100 ; 2x10, 1x90, 150x2, 100x3 etc

guga · April 03, 2017, 01:57:28 PM

Ok, guys...it is working on all sizes : )

Thank you a lot ! :t :t :t :t

The only condition is that the inputed matrix must be stored on a buffer bigger then 64 bytes (4x4 dwords) and also the output buffer must be at least 64 bytes long too. But..it is working as expected :)

Code Select



Proc Fast_Matrix_Transpose:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @RemainderY, @MaxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    mov D@RemainderY 0
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx 3; Check if value (Height) is a multiple of 4
    jz L1>
        ; found not multiple of 4
        inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
        mov D@RemainderY eax
L1:
 
    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@RemainderY; adjust edi from the remainder in YPos only
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@RemainderY
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP

jj2007 · April 03, 2017, 06:45:12 PM

Quote from: guga on April 03, 2017, 12:03:20 PMA função que JJ fez é ótima do ponto de vista de rapidez, no entanto preciso apenas tentar fazer funcionar em matrizes não quadradas

My function works fine on non-quadratic matrices, as long as the rows are a multiple of 4.

Code Select

sub edi D@RemainderY; adjust edi from the remainder in YPos only

:t

Here is the original function, in Masm format (full code with speed test in reply #16):

TransposeB proc uses esi edi ebx pSrc, pDest, SrcRows, SrcCols ; fast
mov esi, pSrc
mov edi, pDest
mov eax, SrcCols
push eax
shl eax, 2 ; *4
lea ebx, [eax+2*eax] ; 3*eax
.Repeat
mov ecx, SrcRows
mov edx, esi
sar ecx, 2
align 4
@@: movd xmm1, dword ptr [edx+1*eax] ; +1*eax
movd xmm0, dword ptr [edx] ; +0*eax
pshufd xmm1, xmm1, 01010001b
xorps xmm0, xmm1
movd xmm2, dword ptr [edx+1*ebx] ; 3*eax
movd xmm1, dword ptr [edx+2*eax] ; 2*eax
pshufd xmm2, xmm2, 00010101b
pshufd xmm1, xmm1, 01000101b
xorps xmm0, xmm1
xorps xmm0, xmm2
lea edx, [edx+4*eax]
movups [edi], xmm0
add edi, 4*4 ; advance one xmm reg
dec ecx
jg @B
add esi, 4
dec stack
.Until Zero?
pop edx
if ShowB
mov esi, offset ValueC
For_ ct=0 To 19
For_ ecx=0 To 31
lodsd
Print Str$("%i ", eax)
Next
Print
Next
Exit
endif
ret
TransposeB endp

mineiro · April 03, 2017, 10:31:02 PM

opa, beleza guga;
O projeto soa interessante, você deseja encontrar padrões pelo que supus e isto oferece uma gama de possibilidades onde possa ser usado. Na hora de testar o código com matrizes tente quadrado, retângulo e linha, quadrados e retângulos são formas geométricas que linhas sobrepostas podem formar, e linha nesse contexto significa um número primo, por exemplo uma matriz de 7x1. No código fonte, não cheguei a analisá-lo por completo, daí suponha que apenas tenho uma vaga ideia, mas sendo uma multiplicação então significa que é mais fácil fazer cinco vezes dez do que dez vezes o número cinco, em número de operações matemáticas.
Concordo com o senhor, neste fórum existe exímios programadores, aprendi muito e continuo aprendendo com os membros. Um eterno aprendiz.
Boa sorte no projeto.

guga · April 04, 2017, 02:01:35 AM

Hi mineiro

Obrigado :) Sim, o phash usa padrões e um algoritmo chamado DCT para reduzir as frequências encontradas na imagem. Só é um pouco complicado entender a funcionalidade seguindo o código-fonte original mas, creio que dê para portar para assembly. Minha idéia inicial é fazer uma função que possa calcular o pHash de imagens mas, ao invés de fazer funções que criam e reduzem o tamanho das imagens etc, fazer apenas a que diretamente realizará o trabalho, ou seja, usar como input width, height (e talvez o pitch) e o ponteiro para os pixels (convertidos já para algum formato como Luma, -por exemplo). A biblioteca menciona que convertem a imagem para greyscale mas, pelos testes que estou fazendo em um plugin para o virtualdub (para detectar mudanças de cenas em um filme), é melhor converter já para Luma usando o valor encontrado após os pixels terem sido convertidos para YCbcr ou CieLab. Como se usa apenas o Luma, não é necessário usar toda a função de conversão de YCrcb ou CieLab apenas o início que converte RGB para Luma e o valor de Luma (de 0 pra 1 ou seja, o Luma já está normalizado: dividido por 255) é o que será passado para as matrizes e para o DCT.

Se quiser, de uma lida aqui que explicam melhor todo o funcionamento do phash de modo mais claro:
http://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

Sobre a função de JJ fiz os testes de maneira extensiva e, em principio está funcionando para todo e qualquer formato da matrix, incluindo lineares, quadradas e retangulares, independente de serem ou não multiplos de 4 e o melhor de tudo, com uma rapidez impecável :)
Os dados dos testes que fiz foram estes:

Code Select


; Matrix 20 X 30
[ValueA:    F$ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
            F$ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
            F$ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
            F$ 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
            F$ 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
            F$ 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
            F$ 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140
            F$ 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160
            F$ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180
            F$ 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
            F$ 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220
            F$ 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240
            F$ 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260
            F$ 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280
            F$ 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
            F$ 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320
            F$ 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340
            F$ 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360
            F$ 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380
            F$ 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
            F$ 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420
            F$ 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440
            F$ 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460
            F$ 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480
            F$ 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500
            F$ 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520
            F$ 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540
            F$ 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560
            F$ 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580
            F$ 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600]

; 5x4
[ ValueD:    F$ 1, 2, 3, 4, 5
                F$ 6, 7, 8, 9, 10
                F$ 11, 12, 13, 14, 15
                F$ 16, 17, 18, 19, 20]
; 6x5
[ValueD2:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30]

; 6 x 6 8 btes ahead
[ValueD3:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30,
                F$ 31, 32, 33, 34, 35, 36]
;6x7 4 bytes ahead
[ ValueD4:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30,
                F$ 31, 32, 33, 34, 35, 36,
                F$ 37, 38, 39, 40, 41, 42]
; 4 x 7
[ValueD5:   F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 25, 26, 27, 28,
                F$ 31, 32, 33, 34,
                F$ 37, 38, 39, 40,
                F$ 41, 42, 43, 44]

; [ValueD6:   F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 64,
                F$ 13, 14, 15, 16, 79
                F$ 19, 20, 21, 22, 98
                F$ 25, 26, 27, 28, 101
                F$ 31, 32, 33, 34, 222
                F$ 37, 38, 39, 40, 57
                F$ 41, 42, 43, 44, 58]


; 4x4 OK Fast_transpose_blockJJ4544
[Teste4x4:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22]
; 5x4
[Teste5x4:  F$ 1, 2, 3, 4, 5,
                F$ 6, 7, 8, 9, 10,
                F$ 11, 12, 13, 14, 15,
                F$ 16, 17, 18, 19, 20]

[Teste6x4:  F$ 1, 2, 3, 4, 5, 6,
                F$ 26, 7, 8, 9, 10,27
                F$ 11, 12, 13, 14, 15,28
                F$ 16, 17, 18, 19, 20,29]

[Teste7x4:  F$ 1, 2, 3, 4, 5, 6, 77,
                F$ 26, 7, 8, 9, 10,27, 102
                F$ 11, 12, 13, 14, 15,28, 95
                F$ 16, 17, 18, 19, 20,29, 35]

[Teste8x4:  F$ 1, 2, 3, 4, 5, 6, 77, 222,
                F$ 26, 7, 8, 9, 10,27, 102, 67,
                F$ 11, 12, 13, 14, 15,28, 95, 77,
                F$ 16, 17, 18, 19, 20,29, 35, 91]

; for Height = 4. Width works ok on any size Fast_transpose_blockJJ_WidthOk. Now let´s see Height with odd values
; remainder =12
[Teste4x5:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32]
; remainder = 8
[Teste4x6:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67]
; remainder = 4
[Teste4x7:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107]

; remainder = 0
[Teste4x8:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107,
                F$ 259, 169, 908, 77]

; 5x5
[Teste5x5:  F$ 1, 2, 3, 4, 6,
                F$ 7, 8, 9, 10, 11
                F$ 13, 14, 15, 16, 12
                F$ 19, 20, 21, 22, 23
                F$ 79, 80, 81, 32, 37]

; Fast_transpose_blockJJ_YESSSS WORKS FOR EVERYTHING BIGGER THEN 4

; Now let´ start with height or width smaller then 4

; 1st we test for width be smaller then 4

; remainder = 0; works. no adjustment
[Teste3x4:  F$ 1, 2, 3,
                F$ 7, 8, 9,
                F$ 13, 14, 15,
                F$ 19, 20, 21]

; remainder = 0; works. no adjustment
[Teste2x4:  F$ 1, 2,
                F$ 7, 8,
                F$ 13, 14,
                F$ 19, 20]

; remainder = 0; works. no adjustment
[Teste1x4:  F$ 1,
                F$ 7,
                F$ 13,
                F$ 19,]

; remainder = 0; works. no adjustment
[Teste3x5:  F$ 1, 2, 3,
                F$ 7, 8, 9,
                F$ 13, 14, 15,
                F$ 19, 20, 21,
                F$ 89, 70, 51]

; remainder = 0; works. no adjustment
[Teste2x5:  F$ 1, 2,
                F$ 7, 8,
                F$ 13, 14,
                F$ 19, 20,
                F$ 11, 122]

; remainder = 0; works. no adjustment
[Teste1x5:  F$ 1,
                F$ 7,
                F$ 13,
                F$ 19
                F$ 77]

; Now we test for height be smaller then 4

; remainder = 0; works. no adjustment
[Teste4x3:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16]

[Teste5x3:  F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 91,
                F$ 13, 14, 15, 16,125]

[Teste5x2:  F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 91]

[Teste5x1:  F$ 1, 2, 3, 4, 77]

[Teste1x1:  F$ 1]

mineiro · April 04, 2017, 03:11:06 AM

opa guga;
Agora o projeto ficou ainda mais audaz após ler os links postados, muito bacana mesmo. Não conhecia sobre perceputal hash, me veio na memória perceptron no ramo da inteligência artificial mas agora consigo pensar em outros algoritmos que no futuro podem ser agregados para uma automação.
http://courses.cs.washington.edu/courses/cse599/01wi/admin/Assignments/nn.html
Lembro que existia um programa chamado Altamira Composer, e este oferecia um zoom infinito, e não sei como funciona este algoritmo.
Tenho uma pergunta guga; o algoritmo é eficiente entre objetos similares? Pensei em uma laranja, mexerica e limão doce (tangerina). Tem como encontrar apenas laranjas neste pomar de fotos?

The MASM Forum

News:

Fast Transposing Matrix Algorithm for FPU Data

guga

jj2007

guga

guga

guga

guga

jj2007

mineiro

guga

guga

guga

jj2007

mineiro

guga

mineiro