News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fast Transposing Matrix Algorithm for FPU Data

Started by guga, April 01, 2017, 06:01:08 PM

Previous topic - Next topic

guga

Thanks SiekManski. I´l take a look. It seems pretty fast.

Hi JJ

I made another test and a check for the size.  So far it is working for any Height longer then 4 bytes and not multiple of 4. Tomorrow i´ll test to see if it works also on height smaller then 4


Proc Fast_transpose_blockJJguga:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
   
L1:   

L2:
    xor ecx ecx
    xor edx edx

    Do
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM0 D$esi+edx | add edx ebx
        movd XMM1 D$esi+edx | add edx ebx
        movd XMM2 D$esi+edx | add edx ebx
        movd XMM3 D$esi+edx | add edx ebx

        pslldq XMM1 4
        pslldq XMM2 8
        pslldq XMM3 12

        xorps XMM0 XMM1
        xorps XMM0 XMM2
        xorps XMM0 XMM3

        movups X$edi XMM0 ; <---- must be movups (unaligned) due to the remainders in height. otherwise it will crash. it work regardless the output in aligned or not
        add edi 16 ; 4*4
        add ecx 4
    Loop_Until ecx >= D@Height
    sub edi eax; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0
L1:

    mov eax D@Output

EndP



I´ll check tomorrow if it works when Height or width are smaller then 4. So far, it works for 6x5; 6x6; 6x7, 30x32, 32x30 ....

i´ll test tomorrow for 4x30, 3x10, 10x3, 1x2, 2x20 etc :)

If it works on those situations, i´ll see if i can be able to adapt Siekmanski onto your variation as well. On this way we may have something around 20-30% of speed gain, i suppose ?.

I realise that small matrix don´t actually needs speed, but we may have a matrix containing a small height (smaller then 4) and a higher width. Ex: width = 100, height = 3 (Or vice-versa)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

#16
Time for some timings?Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

3442    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
786     cycles for 100 * TransposeC (old)

2285    cycles for 100 * TransposeA
571     cycles for 100 * TransposeB
772     cycles for 100 * TransposeC (old)

2285    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
786     cycles for 100 * TransposeC (old)

2284    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
770     cycles for 100 * TransposeC (old)

1913    cycles for 100 * TransposeA
570     cycles for 100 * TransposeB
784     cycles for 100 * TransposeC (old)

61      bytes for TransposeA
111     bytes for TransposeB
109     bytes for TransposeC (old)


P.S.: Don't take the slow version I posted earlier. Take V3 instead :greensml:

guga

Great, but not working for non multiple of 4 matrices

Try with this:
Width = 6
Height = 7
[ValueD4:   D$ 1, 2, 3, 4, 5, 6,
                D$ 7, 8, 9, 10, 11, 12,
                D$ 13, 14, 15, 16, 17, 18,
                D$ 19, 20, 21, 22, 23, 24,
                D$ 25, 26, 27, 28, 29, 30,
                D$ 31, 32, 33, 34, 35, 36,
                D$ 37, 38, 39, 40, 41, 42]


The other version worked on non multiple of 4.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Hi JJ

Ok...i guess i fix it to make work for 6x7 and not multiple of 4

It seems to work on 6x7, 20x30 etc


Proc Fast_transpose_blockJJgugaNew:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output
    ;mov ebx D@Width |  mov D@CurXPos ebx | shl ebx 2

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
L1:   
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@Height
    mov edx esi
    sar ecx 2 | jnc L3> | inc ecx | clc | L3: ; if height have a remainder (i mean, not multiple of 4, increment it). And clear Carry flag just for safety
    ;Align 4 Alignment not really needed ?

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP


Speed tests:

With fixing the height (jnc) 249.51956409010458 ns
Without fixing the height (not work for non 4 multiples): 243.87414443039290 ns (A bit faster, because it misses the data for non multiple of 4)

Old version: 289.57585729750531 ns

Gain of speed from new to old: something around 14%


I´ll start making the tests for height and width smaller then 4 and also exchanging the width and height (To see if it works when height is smaller then width or vice-versa)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Hmm...it seems that align4 make no effect, but align 16 gained more speed 8)


; fastest
Proc Fast_transpose_blockJJgugaNew:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | shl edx 2 | sub ecx edx ; integer count divide by 16 bytes (4 dwords)
    jz L1>
         mov eax 4 | sub eax ecx | shl eax 2
L1:
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@Height
    mov edx esi
    sar ecx 2 | jnc L3> | inc ecx | clc | L3: ; if height have a remainder (i mean, not multiple of 4, increment it)
    Align 16 ; <--------- Must be aligned with 16 to gain more speed

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP


New result: 217.25499826251323 ns
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Another faster version that seems to work with any size (width and height) both bigger then 4 bytes

The previous "sar" operation is no longer needed. Since we can compute the max value of Y (Col) only once and not inside each loop



; fastest
Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @Reminder, @MAxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    xor eax eax
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | shl edx 2 | sub ecx edx ; integer count divide by 4 dwords
    jz L1>
        inc D@MaxYPos  ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
L1:
    mov D@Reminder eax

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos ; < No longer needed "sar" operation since we alreadu calculated the max value of Y (Col/Height)
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@Reminder; | mov D$edi 0; adjust edi from the remainder to we reach the correct position
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@Reminder
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP


New timming: 189.46967097307137 ns


JJ, It is something around 50% faster then the old version (the 1st one) and something around 30% faster then your new one. It seems that using align 16 and putting a variable to hold the Maximum Y Position (col) was the necessary to gain more speed. Can you test on yours ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on April 03, 2017, 05:13:32 AMJJ, It is something around 50% faster then the old version (the 1st one) and something around 30% faster then your new one. It seems that using align 16 and putting a variable to hold the Maximum Y Position (col) was the necessary to gain more speed. Can you test on yours ?

Sorry, can't test that one, I know Masm syntax only. From what I see, it is unlikely that your version is any faster than mine, though.

mineiro

hello sir guga, I hope you're fine irmão;
I don't have tested because don't have rosasm in hands, so I can only give suggestion.
On line:
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | shl edx 2 | sub ecx edx ; integer count divide by 4 dwords
You can get remainder of a 4 division by "and" number with (X-1) mask. This only works to 2,4,8,16,... . To get remainder of 8 the mask is 8-1==7, to 16 the mask is 15, ... .
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx,3

If conditional or inconditional jump address is aligned we have a gain. So, if you analyse opcode generated, from my tests, sometimes is better use add instead of inc per example, because you're align next intruction or a label to a even address.
Not remember where I have read, but, text said that if you like to read just a byte on an odd adress, so machine internally read a word, discard one byte and move a byte. So, even address speed up code. Instructions on even address perfoms better than on odd address.
Some codes that I have see done to larger matrix just get remainder and get address to store too and do calculus. The idea is align address to 16 multiple. They do something like a head code (to align remainder and address) on byte/word/dword/qword way until address its a 16 multiple where the gain appear, and a tail to deal with bytes too.

I also forgot to say, on line below:
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<
add instruction set flags and dec instruction too, this create a bit dependency. You can test with lea that don't change flags to add a displacement, so flag dependency will be on dec instruction.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

guga

Oi Mineiro

Obrigado pela dica.  usando o and ecx 3 fica realmente mais pratico para encontrar a posição correta do output em edi. Estou testando agora para ver se funciona com varios tamanhos diferentes de comprimento e altura. Estou tentando fazer algo relacionado com tranposição de matrizes com tamanhos diferentes pois estou testando um algoritmo para identificar imagens a partir de um hash. A biblioteca se chama "pHash" e, em princípio, utiliza a transposição de matrizes após os pixels terem sido transformados para Luma. Ele dispõe o resultado em matrizes no formato de 32x32 que representam o padrão de iluminação da imagem (na verdade, ele cria uma nova imagem em 32x32) para que o hash possa ser gerado. No entanto, pelo que vi do código, se for utilizado nova matriz do mesmo tamanho da imagem original, o algoritmo funcionaria de forma mais eficiente.  O maior problema é que a biblioteca que usam é muito muito lenta. Usam uma versão alterada de cimg que tem dentro dela a transposição de matrizes, mas, como disse é bastante lenta por isso precisaria de uma função melhorada para a transposição de matrizes de qualquer tamanho sejam quadradas ou retangulares (até para poder utilizar em qualquer outro aplicativo e não apenas neste caso específico).

A função que JJ fez é ótima do ponto de vista de rapidez, no entanto preciso apenas tentar fazer funcionar em matrizes não quadradas e, ainda assim, tentar manter a mesma rapidez quanto o código original que ele criou.

Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Ok...I guess i got it working for retangular matrices (Quadratic or not). However, Width and Height on this version must be bigger then 4 (both)


; fastest
Proc Fast_transpose_blockJJ:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @RemainderY, @MaxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    mov D@RemainderY 0
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx 3; Check if value (Height) is a multiple of 4
    jz L1>
        ; found not multiple of 4
        inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
        mov D@RemainderY eax
L1:

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@RemainderY; adjust edi from the remainder in YPos only
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@RemainderY
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP


; Speed results using "and ecx" 213.78110071619122 ns
The tests were done on these data sets



; 4x4 OK Fast_transpose_blockJJ4544. Start testing the newer versions for retangular matrixes
[Teste4x4:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22]
; 5x4
[Teste5x4:  F$ 1, 2, 3, 4, 5,
                F$ 6, 7, 8, 9, 10,
                F$ 11, 12, 13, 14, 15,
                F$ 16, 17, 18, 19, 20]

[Teste6x4:  F$ 1, 2, 3, 4, 5, 6,
                F$ 26, 7, 8, 9, 10,27
                F$ 11, 12, 13, 14, 15,28
                F$ 16, 17, 18, 19, 20,29]

[Teste7x4:  F$ 1, 2, 3, 4, 5, 6, 77,
                F$ 26, 7, 8, 9, 10,27, 102
                F$ 11, 12, 13, 14, 15,28, 95
                F$ 16, 17, 18, 19, 20,29, 35]

[Teste8x4:  F$ 1, 2, 3, 4, 5, 6, 77, 222,
                F$ 26, 7, 8, 9, 10,27, 102, 67,
                F$ 11, 12, 13, 14, 15,28, 95, 77,
                F$ 16, 17, 18, 19, 20,29, 35, 91]

; for Height = 4. Width works ok on any size Fast_transpose_blockJJ_WidthOk. Now let´ see Heght with odd values
; remainder =12
[Teste4x5:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32]
; remainder = 8
[Teste4x6:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67]
; remainder = 4
[Teste4x7:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107]

; remainder = 0
[Teste4x8:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107,
                F$ 259, 169, 908, 77]

; 5x5
[Teste5x5:  F$ 1, 2, 3, 4, 6,
                F$ 7, 8, 9, 10, 11
                F$ 13, 14, 15, 16, 12
                F$ 19, 20, 21, 22, 23
                F$ 79, 80, 81, 32, 37]



I´ll start the tests on matrix whose width or height are smaller then 4.  Ex: 3x100 ; 2x10, 1x90, 150x2, 100x3 etc
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Ok, guys...it is working on all sizes : )

Thank you a lot !  :t :t :t :t

The only condition is that the inputed matrix must be stored on a buffer bigger then 64 bytes (4x4 dwords) and also the output buffer must be at least 64 bytes long too. But..it is working as expected :)



Proc Fast_Matrix_Transpose:
    Arguments @Input, @Output, @Width, @Height
    Local @CurXPos, @RemainderY, @MaxYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; get remainders for edi
    mov D@RemainderY 0
    mov edx D@Height | mov ecx edx | shr edx 2 | mov D@MaxYPos edx | and ecx 3; Check if value (Height) is a multiple of 4
    jz L1>
        ; found not multiple of 4
        inc D@MaxYPos ; if height have a remainder (i mean, not multiple of 4, increment it)
        mov eax 4 | sub eax ecx | shl eax 2
        mov D@RemainderY eax
L1:

    mov eax D@Width |  mov D@CurXPos eax | shl eax 2 | lea ebx D$eax+eax*2

L2:
    mov ecx D@MaxYPos
    mov edx esi
    Align 16 ; <---- Must be aligned to 16 to gain more speed and stability

    L8:
         ; copy the 1st 4 dwords from esi to register XMM
        movd XMM1 D$edx+eax
        movd XMM0 D$edx     ; +0*eax
        pshufd XMM1 XMM1 00__0101_0001
        xorps XMM0 XMM1
        movd XMM2 D$edx+ebx ; 3*eax
        movd xmm1 D$edx+eax*2
        pshufd xmm2 xmm2 00_00010101
        pshufd xmm1 xmm1 00_01000101
        xorps xmm0 xmm1
        xorps xmm0 xmm2
        lea edx D$edx+eax*4
        movups X$edi xmm0
        add edi (4*4) ; advance one xmm reg
        dec ecx | jg L8<

    sub edi D@RemainderY; adjust edi from the remainder in YPos only
    add esi 4
    dec D@CurXPos | jnz L2<<

    mov eax D@RemainderY
    ; clear remainder bytes if any
    test eax eax | jz L1>
        shr eax 2 | jz L1>  | mov D$edi 0
        dec eax | jz L1>    | mov D$edi+4 0
        dec eax | jz L1>    | mov D$edi+8 0

L1:

    mov eax D@Output

EndP




Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: guga on April 03, 2017, 12:03:20 PMA função que JJ fez é ótima do ponto de vista de rapidez, no entanto preciso apenas tentar fazer funcionar em matrizes não quadradas

My function works fine on non-quadratic matrices, as long as the rows are a multiple of 4.

sub edi D@RemainderY; adjust edi from the remainder in YPos only

:t

Here is the original function, in Masm format (full code with speed test in reply #16):

TransposeB proc uses esi edi ebx pSrc, pDest, SrcRows, SrcCols      ; fast
  mov esi, pSrc
  mov edi, pDest
  mov eax, SrcCols
  push eax
  shl eax, 2            ; *4
  lea ebx, [eax+2*eax]      ; 3*eax
  .Repeat
      mov ecx, SrcRows
      mov edx, esi
      sar ecx, 2
      align 4
@@:      movd xmm1, dword ptr [edx+1*eax]      ; +1*eax
      movd xmm0, dword ptr [edx]            ; +0*eax
      pshufd xmm1, xmm1, 01010001b
      xorps xmm0, xmm1
      movd xmm2, dword ptr [edx+1*ebx]      ; 3*eax
      movd xmm1, dword ptr [edx+2*eax]      ; 2*eax
      pshufd xmm2, xmm2, 00010101b
      pshufd xmm1, xmm1, 01000101b
      xorps xmm0, xmm1
      xorps xmm0, xmm2
      lea edx, [edx+4*eax]
      movups [edi], xmm0
      add edi, 4*4      ; advance one xmm reg
      dec ecx
      jg @B
      add esi, 4
      dec stack
  .Until Zero?
  pop edx
  if ShowB
      mov esi, offset ValueC
      For_ ct=0 To 19
            For_ ecx=0 To 31
                  lodsd
                  Print Str$("%i ", eax)
            Next
            Print
      Next

      Exit
  endif
  ret
TransposeB endp

mineiro

opa, beleza guga;
O projeto soa interessante, você deseja encontrar padrões pelo que supus e isto oferece uma gama de possibilidades onde possa ser usado. Na hora de testar o código com matrizes tente quadrado, retângulo e linha, quadrados e retângulos são formas geométricas que linhas sobrepostas podem formar, e linha nesse contexto significa um número primo, por exemplo uma matriz de 7x1. No código fonte, não cheguei a analisá-lo por completo, daí suponha que apenas tenho uma vaga ideia, mas sendo uma multiplicação então significa que é mais fácil fazer cinco vezes dez do que dez vezes o número cinco, em número de operações matemáticas.
Concordo com o senhor, neste fórum existe exímios programadores, aprendi muito e continuo aprendendo com os membros. Um eterno aprendiz.
Boa sorte no projeto.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

guga

Hi mineiro

Obrigado :) Sim, o phash usa padrões e um algoritmo chamado DCT para reduzir as frequências encontradas na imagem. Só é um pouco complicado entender a funcionalidade seguindo o código-fonte original mas, creio que dê para portar para assembly. Minha idéia inicial é fazer uma função que possa calcular o pHash de imagens mas, ao invés de fazer funções que criam e reduzem o tamanho das imagens etc, fazer apenas a que diretamente realizará o trabalho, ou seja, usar como input width, height (e talvez o pitch)  e o ponteiro para os pixels (convertidos já para algum formato como Luma, -por exemplo). A biblioteca menciona que convertem a imagem para greyscale mas, pelos testes que estou fazendo em um plugin para o virtualdub (para detectar mudanças de cenas em um filme), é melhor converter já para Luma usando o valor encontrado após os pixels terem sido convertidos para YCbcr ou CieLab. Como se usa apenas o Luma, não é necessário usar toda a função de conversão de YCrcb ou CieLab apenas o início que converte RGB para Luma e o valor de Luma (de 0 pra 1 ou seja, o Luma já está normalizado: dividido por 255) é o que será passado para as matrizes e para o DCT.

Se quiser, de uma lida aqui que explicam melhor todo o funcionamento do phash de modo mais claro:
http://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

Sobre a função de JJ fiz os testes de maneira extensiva e, em principio está funcionando para todo e qualquer formato da matrix, incluindo lineares, quadradas e retangulares, independente de serem ou não multiplos de 4 e o melhor de tudo, com uma rapidez impecável :)
Os dados dos testes que fiz foram estes:

; Matrix 20 X 30
[ValueA:    F$ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
            F$ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40
            F$ 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
            F$ 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
            F$ 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100
            F$ 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
            F$ 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140
            F$ 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160
            F$ 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180
            F$ 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200
            F$ 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220
            F$ 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240
            F$ 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260
            F$ 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280
            F$ 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300
            F$ 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320
            F$ 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340
            F$ 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360
            F$ 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380
            F$ 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400
            F$ 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420
            F$ 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440
            F$ 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460
            F$ 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480
            F$ 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500
            F$ 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520
            F$ 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540
            F$ 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560
            F$ 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580
            F$ 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600]

; 5x4
[ ValueD:    F$ 1, 2, 3, 4, 5
                F$ 6, 7, 8, 9, 10
                F$ 11, 12, 13, 14, 15
                F$ 16, 17, 18, 19, 20]
; 6x5
[ValueD2:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30]

; 6 x 6 8 btes ahead
[ValueD3:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30,
                F$ 31, 32, 33, 34, 35, 36]
;6x7 4 bytes ahead
[ ValueD4:   F$ 1, 2, 3, 4, 5, 6,
                F$ 7, 8, 9, 10, 11, 12,
                F$ 13, 14, 15, 16, 17, 18,
                F$ 19, 20, 21, 22, 23, 24,
                F$ 25, 26, 27, 28, 29, 30,
                F$ 31, 32, 33, 34, 35, 36,
                F$ 37, 38, 39, 40, 41, 42]
; 4 x 7
[ValueD5:   F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 25, 26, 27, 28,
                F$ 31, 32, 33, 34,
                F$ 37, 38, 39, 40,
                F$ 41, 42, 43, 44]

; [ValueD6:   F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 64,
                F$ 13, 14, 15, 16, 79
                F$ 19, 20, 21, 22, 98
                F$ 25, 26, 27, 28, 101
                F$ 31, 32, 33, 34, 222
                F$ 37, 38, 39, 40, 57
                F$ 41, 42, 43, 44, 58]


; 4x4 OK Fast_transpose_blockJJ4544
[Teste4x4:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22]
; 5x4
[Teste5x4:  F$ 1, 2, 3, 4, 5,
                F$ 6, 7, 8, 9, 10,
                F$ 11, 12, 13, 14, 15,
                F$ 16, 17, 18, 19, 20]

[Teste6x4:  F$ 1, 2, 3, 4, 5, 6,
                F$ 26, 7, 8, 9, 10,27
                F$ 11, 12, 13, 14, 15,28
                F$ 16, 17, 18, 19, 20,29]

[Teste7x4:  F$ 1, 2, 3, 4, 5, 6, 77,
                F$ 26, 7, 8, 9, 10,27, 102
                F$ 11, 12, 13, 14, 15,28, 95
                F$ 16, 17, 18, 19, 20,29, 35]

[Teste8x4:  F$ 1, 2, 3, 4, 5, 6, 77, 222,
                F$ 26, 7, 8, 9, 10,27, 102, 67,
                F$ 11, 12, 13, 14, 15,28, 95, 77,
                F$ 16, 17, 18, 19, 20,29, 35, 91]

; for Height = 4. Width works ok on any size Fast_transpose_blockJJ_WidthOk. Now let´s see Height with odd values
; remainder =12
[Teste4x5:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32]
; remainder = 8
[Teste4x6:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67]
; remainder = 4
[Teste4x7:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107]

; remainder = 0
[Teste4x8:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22,
                F$ 79, 80, 81, 32,
                F$ 39, 49, 68, 67,
                F$ 59, 69, 98, 107,
                F$ 259, 169, 908, 77]

; 5x5
[Teste5x5:  F$ 1, 2, 3, 4, 6,
                F$ 7, 8, 9, 10, 11
                F$ 13, 14, 15, 16, 12
                F$ 19, 20, 21, 22, 23
                F$ 79, 80, 81, 32, 37]

; Fast_transpose_blockJJ_YESSSS WORKS FOR EVERYTHING BIGGER THEN 4

; Now let´ start with height or width smaller then 4

; 1st we test for width be smaller then 4

; remainder = 0; works. no adjustment
[Teste3x4:  F$ 1, 2, 3,
                F$ 7, 8, 9,
                F$ 13, 14, 15,
                F$ 19, 20, 21]

; remainder = 0; works. no adjustment
[Teste2x4:  F$ 1, 2,
                F$ 7, 8,
                F$ 13, 14,
                F$ 19, 20]

; remainder = 0; works. no adjustment
[Teste1x4:  F$ 1,
                F$ 7,
                F$ 13,
                F$ 19,]

; remainder = 0; works. no adjustment
[Teste3x5:  F$ 1, 2, 3,
                F$ 7, 8, 9,
                F$ 13, 14, 15,
                F$ 19, 20, 21,
                F$ 89, 70, 51]

; remainder = 0; works. no adjustment
[Teste2x5:  F$ 1, 2,
                F$ 7, 8,
                F$ 13, 14,
                F$ 19, 20,
                F$ 11, 122]

; remainder = 0; works. no adjustment
[Teste1x5:  F$ 1,
                F$ 7,
                F$ 13,
                F$ 19
                F$ 77]

; Now we test for height be smaller then 4

; remainder = 0; works. no adjustment
[Teste4x3:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16]

[Teste5x3:  F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 91,
                F$ 13, 14, 15, 16,125]

[Teste5x2:  F$ 1, 2, 3, 4, 77,
                F$ 7, 8, 9, 10, 91]

[Teste5x1:  F$ 1, 2, 3, 4, 77]

[Teste1x1:  F$ 1]
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

mineiro

opa guga;
Agora o projeto ficou ainda mais audaz após ler os links postados, muito bacana mesmo. Não conhecia sobre perceputal hash, me veio na memória perceptron no ramo da inteligência artificial mas agora consigo pensar em outros algoritmos que no futuro podem ser agregados para uma automação.
http://courses.cs.washington.edu/courses/cse599/01wi/admin/Assignments/nn.html
Lembro que existia um programa chamado Altamira Composer, e este oferecia um zoom infinito, e não sei como funciona este algoritmo.
Tenho uma pergunta guga; o algoritmo é eficiente entre objetos similares? Pensei em uma laranja, mexerica e limão doce (tangerina). Tem como encontrar apenas laranjas neste pomar de fotos?
I'd rather be this ambulant metamorphosis than to have that old opinion about everything