News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Fast Matrix Flip

Started by guga, April 07, 2017, 03:56:33 AM

Previous topic - Next topic

guga

Hi guys

continuing the matrix operations, i builted one that can flip a matrix along the X (Width) axis using SSE (Thanks Jochen and Marinus)

The goal was to flip matrixes like this:

[Teste4x4:  F$ 1, 2, 3, 4,
                F$ 7, 8, 9, 10,
                F$ 13, 14, 15, 16,
                F$ 19, 20, 21, 22]
onto:

[Teste4x4Inverted:  F$ 4, 3, 2, 1,
                F$ 10, 9, 8, 7,
                F$ 16, 15, 14, 13,
                F$ 22, 21, 20, 19]

The problem, however, remain on non-quadratic matrixes. I´m still strugggling how to set the proper flags or the math envolved when we deal with non-quadratic matrixes (such as 3x2, 27x18, 13x9 etc)

The code i made for quadratic matrixes along the X- Axis is



Proc SquaredMatrix_FlipHorizontal_SSE2new:
    Arguments @Input, @Output, @Width, @Height
    Local @MaxXPos, @CurYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    mov edx D@Height | mov D@CurYPos edx

    mov ebx D@Width
    mov eax ebx | shr eax 2 | mov D@MaxXPos eax ; MaxPos = Width/4
;    shl ebx 2 | mov D@NextScanLine ebx | mov eax ebx; | sub ebx 16 | mov eax ebx ; ebx = (Width*4)-16
    shl ebx 2 | mov eax ebx; | sub ebx 16 | mov eax ebx ; ebx = (Width*4)-16


L2:
    mov ecx D@MaxXPos
    mov edx esi
    Align 64 ; <---- Must be aligned to 64 to gain more speed and stability. (If align to 16 the result is a bit slow)

    L8:

        movdqu XMM0 X$edx+eax-16 ; edx+(Width*4)-16
        pshufd XMM0 XMM0 27 ; invert all 4 dwords from left to right
        sub edx (4*4)
        movups X$edi xmm0
        add edi (4*4)
        dec ecx | jg L8<

    add eax ebx; next scanline in ebx
    dec D@CurYPos | jnz L2<<

    mov eax D@Output

EndP

Timming for the whole function is only: 128.74 nanoseconds :)
Aligning with 16, decreases a bit the speed resulting in something around 136 nanosecs


But...how to make it for non-quadratic and keep it also fast ? What is the math operation envolving non-quadratic matrixes ?


I don´t rememebr exatly the masm syntax, but the code above, it probably is something like this:



SquaredMatrix_FlipHorizontal_SSE2new proc uses esi edi ebx ecx edx
Input: dword
Output:dword
Width: dword
Height: dword

LOCAL  CurYPos: dword
LOCAL MaxXPos: dword

mov esi, Input
mov edi, Output
mov edx, Height
mov CurYPos, edx
mov ebx, Width
mov eax, ebx
shr eax, 2
mov MaxXPos, eax
shl ebx, 2
mov eax, ebx

Loop2:
mov ecx, MaxXPos
mov edx, esi
jmp Loop1
; ---------------------------------------------------------------------------
align64; is a masm directive ?. If it is, then you can use align64 :).
; ---------------------------------------------------------------------------

Loop1:

movdqu xmm0, xmmword ptr [eax+edx-10h]
pshufd xmm0, xmm0, 1Bh
sub edx, 10h
movups xmmword ptr [edi], xmm0
add edi, 10h
dec ecx
jg Loop1
add eax, ebx
dec CurYPos
jnz Loop2
mov eax, Output

SquaredMatrix_FlipHorizontal_SSE2new endp
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Fast transposing even or uneven can be done with a 4 * 4 Matrix.
Reserve enough memory to read from and write to ( 4 * 16 bytes alignment )
Read 4 * 4 pixels at once with the correct memory steps for the rows and columns.
Write 4 * 4 pixels at once also with the correct memory 4 * 4 block steps.

It seems illogical to use a 4 * 4 matrix for uneven images sizes (the unused pixels...), but that is only for the last 1,2 or 3 pixels of the right border and the bottom border.
But if you process larger images sizes than in the example below it is really mega fast. ( 12 cycles per 16 pixels on my PC )


;The 4 * 4 transpose algorithm:
;     In:         Out:
; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

    mov         eax,offset MatrixIn

    movaps      xmm0,[eax+0]    ; [0 1 2 3]
    movaps      xmm1,[eax+16]   ; [4 5 6 7]
    movaps      xmm2,[eax+32]   ; [8 9 A B]
    movaps      xmm3,[eax+48]   ; [C D E F]

    mov         eax,offset MatrixOut

    movaps      xmm4,xmm0       ; [0 1 2 3]
    movaps      xmm5,xmm2       ; [8 9 A B]
    unpcklps    xmm4,xmm1       ; [0 4 1 5]
    unpcklps    xmm5,xmm3       ; [8 C 9 D]
    unpckhps    xmm0,xmm1       ; [2 6 3 7]
    unpckhps    xmm2,xmm3       ; [A E B F]
    movaps      xmm1,xmm4       ; [0 4 1 5]
    movaps      xmm6,xmm0       ; [2 6 3 7]
    movlhps     xmm4,xmm5       ; [0 4 8 C]
    movlhps     xmm6,xmm2       ; [2 6 A E]
    movhlps     xmm5,xmm1       ; [1 5 9 D]
    movaps      xmm7,xmm2       ; [A E B F]
    movhlps     xmm7,xmm0       ; [3 7 B F]

    movaps      [eax+0],xmm4    ; [0 4 8 C]
    movaps      [eax+16],xmm5   ; [1 5 9 D]
    movaps      [eax+32],xmm6   ; [2 6 A E]
    movaps      [eax+48],xmm7   ; [3 7 B F]


I don't have time to write the source code for it, but I'll explain the principle below.
To keep it easier to understand I used a small size image example.

; each pixel number is 4 byte
pixelbufferIn   dd 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,-,-,-,-,-,-,-,-,-,- ( -, = alignment to 4 * 16 bytes )
pixelbufferOut dd 24 dup (0)

Example of an uneven image width = 5, height = 3

0, 1, 2, 3, 4
5, 6, 7, 8, 9
10,11,12,13,14

           ; mov    esi, offset pixelbufferIn
           ; mov    edi, offset pixelbufferOut

Gather the pixels from the pixelbufferIn with a step of 5 pixels ( image width )
Move the transposed pixels to the pixelbufferOut with a step of 3 ( image height)

[ 0  1  2  3] ; movups xmm0,[esi+0]
[ 5  6  7  8] ; movups xmm1,[esi+20]
[10 11 12 13] ; movups xmm2,[esi+40]
[-- -- -- --] ; movups xmm3,[esi+60] ; these are the values 15, 16, 17, 18, we don't need them but they are needed for the algorithm...

transpose the block of 16 pixels at once:

    movaps      xmm4,xmm0
    movaps      xmm5,xmm2
    unpcklps    xmm4,xmm1
    unpcklps    xmm5,xmm3
    unpckhps    xmm0,xmm1
    unpckhps    xmm2,xmm3
    movaps      xmm1,xmm4
    movaps      xmm6,xmm0
    movlhps     xmm4,xmm5
    movlhps     xmm6,xmm2
    movhlps     xmm5,xmm1
    movaps      xmm7,xmm2
    movhlps     xmm7,xmm0

[ 0  5 10 --] ; movups [edi+0],xmm4
[ 1  6 11 --] ; movups [edi+12]xmm5
[ 2  7 12 --] ; movups [edi+24],xmm6
[ 3  8 13 --] ; movups [edi+36],xmm7

this gives a result of:

0,5,10,1,6,11,2,7,12,3,8,13

Now get the uneven pixels of column 5

[ 4 -- -- --] ; movups xmm0,[esi+16]
[ 9 -- -- --] ; movups xmm1,[esi+36]
[14 -- -- --] ; movups xmm0,[esi+56]
[-- -- -- --] ; movups xmm0,[esi+76]

transpose the block of 16 pixels at once:

    movaps      xmm4,xmm0
    movaps      xmm5,xmm2
    unpcklps    xmm4,xmm1
    unpcklps    xmm5,xmm3
    unpckhps    xmm0,xmm1
    unpckhps    xmm2,xmm3
    movaps      xmm1,xmm4
    movaps      xmm6,xmm0
    movlhps     xmm4,xmm5
    movlhps     xmm6,xmm2
    movhlps     xmm5,xmm1
    movaps      xmm7,xmm2
    movhlps     xmm7,xmm0

[ 4  9 14 --] ; movups [edi+48],xmm4
[-- -- -- --] ; movups [edi+60]xmm5
[-- -- -- --] ; movups [edi+72],xmm6
[-- -- -- --] ; movups [edi+84],xmm7

this gives a result of:

0,5,10,1,6,11,2,7,12,3,8,13,4,9,14

here is your transposed image:

0, 5, 10
1, 6, 11
2, 7, 12
3, 8, 13
4, 9, 14


Creative coders use backward thinking techniques as a strategy.

Siekmanski

Improved and faster 4*4 Transpose Matrix  :biggrin:

; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

    mov         esi,offset MatrixIn
    mov         edi,offset MatrixOut

    movaps      xmm0,[esi+0]    ; [0 1 2 3]
    movaps      xmm1,[esi+16]   ; [4 5 6 7]
    movaps      xmm2,[esi+32]   ; [8 9 A B]
    movaps      xmm3,[esi+48]   ; [C D E F]

    movaps      xmm4,xmm0       ; [0 1 2 3]
    movaps      xmm5,xmm2       ; [8 9 A B]
    unpcklps    xmm4,xmm1       ; [0 4 1 5]
    unpcklps    xmm5,xmm3       ; [8 C 9 D]
    unpckhps    xmm0,xmm1       ; [2 6 3 7]
    unpckhps    xmm2,xmm3       ; [A E B F]
    movaps      xmm1,xmm4       ; [0 4 1 5]
    movaps      xmm6,xmm0       ; [2 6 3 7]
    movlhps     xmm4,xmm5       ; [0 4 8 C]
    movlhps     xmm6,xmm2       ; [2 6 A E]
    movhlps     xmm5,xmm1       ; [1 5 9 D]
    movhlps     xmm2,xmm0       ; [3 7 B F]

    movaps      [edi+0],xmm4    ; [0 4 8 C]
    movaps      [edi+16],xmm5   ; [1 5 9 D]
    movaps      [edi+32],xmm6   ; [2 6 A E]
    movaps      [edi+48],xmm2   ; [3 7 B F]
Creative coders use backward thinking techniques as a strategy.

Siekmanski

Hi guga,

Just read your post again. Thought it was about transposing images with sizes that are not multiples of 4.
Now I see it's about flipping the X-axis, sorry my mistake.
Creative coders use backward thinking techniques as a strategy.

aw27

#4
Quote from: guga on April 07, 2017, 03:56:33 AM
But...how to make it for non-quadratic and keep it also fast ? What is the math operation envolving non-quadratic matrixes ?

Hi guga,

I don't know if this is what you want, it is my solution to flip a 2D matrix along the X-axis. It shall work for any number of columns and rows.


option casemap:none
option frame:auto
OPTION STACKBASE:RBP

.code

flipMatrix proc public outMat : ptr, inMat : ptr, rows : qword, cols : qword
LOCAL xmmMovesRequired : qword
LOCAL remainder : qword

mov outMat, rcx
mov inMat, rdx
mov rows, r8
mov cols, r9

; How many xmm moves per row are required;
mov rax, cols
mov r10, 4
xor rdx, rdx
div r10
mov xmmMovesRequired, rax
; Remainder
mov remainder, rdx

mov r11,0
.while r11<rows
mov r10,0
; make destination point to the end of every row
mov rax, r11
inc rax
mul cols
shl rax, 2
mov rcx, outMat
add rcx, rax
.if xmmMovesRequired>0
sub rcx, 16 ; subtract enough to fill one xmm register
.else
sub rcx, 4 ; case of less than 4 columns
.endif
; source points to the start of every row
mov rax, r11
mul cols
shl rax, 2
mov rdx, inMat
add rdx, rax
.while r10<xmmMovesRequired
movups xmm0, xmmword ptr [rdx]
pshufd xmm0, xmm0, 00011011b
movups xmmword ptr [rcx], xmm0
sub rcx, 16
add rdx, 16
inc r10
.endw
.if xmmMovesRequired>0
add rcx, 12
.endif

.if remainder >= 1
mov r10d, dword ptr [rdx]
mov dword ptr [rcx], r10d
.if remainder >= 2
mov r10d, dword ptr [rdx+4]
mov dword ptr [rcx-4], r10d
.if remainder == 3
mov r10d, dword ptr [rdx+8]
mov dword ptr [rcx-8], r10d
.endif
.endif
.endif

inc r11
.endw

ret
flipMatrix endp

end



It was tested by calling from C++ console application, like this:


#include "stdafx.h"


/* tested
#define TOTALROWS 2
#define TOTALCOLS 2
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11, 12 },
{ 21, 22}
};
*/
/* tested
#define TOTALROWS 1
#define TOTALCOLS 1
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11 }
};
*/
/* tested
#define TOTALROWS 4
#define TOTALCOLS 4
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11, 12, 13,14},
{ 21, 22, 23,24},
{ 31, 32, 33,34},
{ 41, 42, 43,44},
};
*/

#define TOTALROWS 5
#define TOTALCOLS 9
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11, 12, 13,14,15,16,17,18,19},
{ 21, 22, 23,24,25,26,27,28,29},
{ 31, 32, 33,34,35,36,37,38,39},
{ 41, 42, 43,44,45,46,47,48,49},
{ 51, 52, 53,54,55,56,57,58,59}
};

int outmatrix[TOTALROWS][TOTALCOLS];

extern "C"
{
void flipMatrix(void* outMat, void* inMat, size_t rows, size_t cols);

}

int main()
{
flipMatrix(outmatrix, inmatrix, TOTALROWS, TOTALCOLS);


printf("Input matrix\n");
for (int row = 0; row < TOTALROWS; row++)
{
for (int columns = 0; columns < TOTALCOLS; columns++)
printf("%d     ", inmatrix[row][columns]);
printf("\n");
}


printf("Output matrix\n");
for (int row = 0; row < TOTALROWS; row++)
{
for (int columns = 0; columns < TOTALCOLS; columns++)
printf("%d     ", outmatrix[row][columns]);
printf("\n");
}
getchar();
return 0;
}


guga

HI marinus

No problem :) I was giving a test on your transpose algo to, but, unfortunately i´m still not being able to make it work on non-quadratic matrices. The algo seems faster then JJ, but ´i´m not succeeding to make it work as expected (non-quadratic)


Many thanks, Aw. I´ll give a try on it :) :t
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

Hi guga,

aw27 worked out the method i presented in Reply #1
http://masm32.com/board/index.php?topic=6140.msg65145#msg65145
Creative coders use backward thinking techniques as a strategy.

guga

Thanks a lot, marinus and AW. I´ll give a try and post the results for speed testing :)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Aw

Here is the port to RosAsm in x86. Many thanks. I´ll try to optimize it.


Proc Matrix_FlipHorizontal:
    Arguments @Input, @Output, @Width, @Height ;@Width, @Height
    Local @xmmMovesRequired, @remainder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov ebx 4
    xor edx edx
    div ebx
    mov D@xmmMovesRequired eax
   
    ; Remainder
    mov D@Remainder edx

    mov ecx 0

    .While ecx < D@Height
        mov ebx 0
        ; make destination point to the end of every row
        mov eax ecx
        inc eax
        mul D@Width
        shl eax 2
        mov edi D@Output
        add edi eax
        If D@xmmMovesRequired > 0
            sub edi 16 ; subtract enough to fill one xmm register
        Else
            sub edi 4 ; case of less than 4 columns
        End_if

        ; source points to the start of every row
        mov eax ecx
        mul D@Width
        shl eax 2
        mov esi D@Input 
        add esi eax
        While ebx < D@xmmMovesRequired
            movups xmm0 X$esi
            pshufd xmm0 xmm0 27
            movups X$edi xmm0
            sub edi 16
            add esi 16
            inc ebx
        End_While

        If D@xmmMovesRequired > 0
            add edi 12
        End_if

        ..If D@remainder >= 1
            mov ebx D$esi
            mov D$edi ebx
            .If D@remainder >= 2
                mov ebx D$esi+4
                mov D$edi-4 ebx
                If D@remainder = 3
                    mov ebx D$esi+8
                    mov D$edi-8 ebx
                End_if
            .End_if
        ..End_if
 
        inc ecx     
    .End_While


EndP
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

#9
Ok..here it is :)

Fully optimized version faster then the original  :t :t :t

Note: Later i´ll relabel the variables to the proper names.

New version:


Proc SquaredMatrix_FlipHorizontal_SSE3Guga:
    Arguments @Input, @Output, @Width, @Height
    Local @xmmMovesRequired, @remainder, @Var1, @Var2, @Var3, @MaxHeight, @CurMovesRequired
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov edx eax
    mov ebx eax
    and edx 3 | mov D@Remainder edx
    shr eax 2 | mov D@xmmMovesRequired eax

    mov eax D@Height | mov D@MaxHeight eax

    mov D@Var2 0
    mov D@Var1 4 ; case of less than 4 columns
    If D@xmmMovesRequired > 0
        mov D@Var1 16 ; subtract enough to fill one xmm register
        mov D@Var2 12
    End_if

    mov eax D@Width | shl eax 2 | mov D@Var3 eax |  sub eax D@Var1
    mov edi D@Output | add edi eax

L1:
        mov ebx D@xmmMovesRequired
        ; source points to the start of every row
        mov eax edi
        mov ecx esi
        test ebx ebx | Jz L4>

        L0:
            movups xmm0 X$ecx
            pshufd xmm0 xmm0 27
            movups X$eax xmm0
            sub eax 16
            add ecx 16
            dec ebx | jg L0<
L4:
        ..If D@remainder >= 1
            mov edx D@Var2
            mov ebx D$ecx | mov D$eax+edx ebx
            .If D@remainder >= 2
                mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
                If D@remainder = 3
                    mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
                End_if
            .End_if
        ..End_if

         add edi D@Var3
         add esi D@Var3

        dec D@MaxHeight | jg L1<


EndP


benchmark tests:
Original version: 249.31571145976065 ns (731 clock cycles)
New version: 163.09080899440107 ns (480 clock cycles)

Speed Improvement: Something around 35%

It maybe optimized more, i guess.

btw...changing to movdqu instead of movups may increase the speed a little bit
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

aw27

Quote from: guga on April 20, 2017, 10:12:13 PM
Ok..here it is :)

Fully optimized version 4 times faster then the original  :t :t :t

Note: Later i´ll relabel the variables to the proper names.

New version:


Proc SquaredMatrix_FlipHorizontal_SSE3Guga:
    Arguments @Input, @Output, @Width, @Height
    Local @xmmMovesRequired, @remainder, @Var1, @Var2, @Var3, @MaxHeight, @CurMovesRequired
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov edx eax
    mov ebx eax
    and edx 3 | mov D@Remainder edx
    shr eax 2 | mov D@xmmMovesRequired eax

    mov eax D@Height | mov D@MaxHeight eax

    mov D@Var2 0
    mov D@Var1 4 ; case of less than 4 columns
    If D@xmmMovesRequired > 0
        mov D@Var1 16 ; subtract enough to fill one xmm register
        mov D@Var2 12
    End_if

    mov eax D@Width | shl eax 2 | mov D@Var3 eax |  sub eax D@Var1
    mov edi D@Output | add edi eax

L1:
        mov ebx D@xmmMovesRequired
        ; source points to the start of every row
        mov eax edi
        mov ecx esi
        test ebx ebx | Jz L4>

        L0:
            movups xmm0 X$ecx
            pshufd xmm0 xmm0 27
            movups X$eax xmm0
            sub eax 16
            add ecx 16
            dec ebx | jg L0<
L4:
        ..If D@remainder >= 1
            mov edx D@Var2
            mov ebx D$ecx | mov D$eax+edx ebx
            .If D@remainder >= 2
                mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
                If D@remainder = 3
                    mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
                End_if
            .End_if
        ..End_if

         add edi D@Var3
         add esi D@Var3

        dec D@MaxHeight | jg L1<


EndP


benchmark tests:
Original version: 249.31571145976065 ns (731 clock cycles)
New version: 163.09080899440107 ns (480 clock cycles)

Speed Improvement: Something around 35%

It maybe optimized more, i guess.

btw...changing to movdqu instead of movups may increase the speed a little bit

Congratulations, although I can not test it because I have not RosAsm. I know I can download it, but I am just lazy.

guga

I´ll try port it to Masm and post it here.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

Ok. Here is the masm version.

Sorry, i don´t know what are the macros for repeat (not the while, but just the dec + jcc chains), so i posted the full disassembled source and some macros i presume are correct, if i remember well the masm syntax.


Matrix_FlipX proc public USES esi edi ebx ecx edx Input : ptr, Output : ptr, Width : dword, Height : dword
LOCAL MaxXPos: dword
LOCAL MaxYPos: dword
LOCAL Remainder: dword
LOCAL AdjustSmallSize: dword
LOCAL NextScanLine: dword

mov esi, Input
mov edi, Output
mov eax, Width
mov edx, eax
mov ebx, eax
and edx, 3
mov Remainder, edx
shr eax, 2
mov MaxXPos, eax
mov eax, Height
mov MaxYPos, eax
mov AdjustSmallSize, 0
mov ebx, 4

.If MaxXPos > 0
mov ebx, 16
mov AdjustSmallSize, 12
.Endif

mov eax, Width
shl eax, 2
mov NextScanLine, eax
sub eax, ebx
mov edi, Output
add edi, eax

loc_40AF18:
mov ebx, MaxXPos
mov eax, edi
mov ecx, esi
test ebx, ebx
jz short loc_40AF38

loc_40AF23:
movdqu xmm0, xmmword ptr [ecx]
pshufd xmm0, xmm0, 27
movups xmmword ptr [eax], xmm0
sub eax, 16
add ecx, 16
dec ebx
jg short loc_40AF23

loc_40AF38:
.If Remainder >= 1
mov edx, AdjustSmallSize
mov ebx, [ecx]
mov [edx+eax], ebx
.If Remainder >= 2
mov ebx, [ecx+4]
mov [edx+eax-4], ebx
.If Remainder == 3
mov ebx, [ecx+8]
mov [edx+eax-8], ebx
.EndIf
.EndIf

add edi, NextScanLine
add esi, NextScanLine
dec MaxYPos
jg short loc_40AF18

Matrix_FlipX endp



Note: I´m pretty sure it can be optimized more removing those chains of If Remainder (specially because we can have large matrixes with small Width. Such as 3x280 etc) but, i´m unable to produce a faster code right now. So, anyway, it is  faster then the original version
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

newrobert

i want to know how you disassembled source?

guga

RosAsm or in this case Idapro (To make it Masm compatible), but, it was mainly to remember the masm syntax so i could be able to port. RosAsm is Nasm compatible, so in order to make easier to others read, i assembled it with RosAsm the new function, and disassembled it with Ida to make a compatible version for masm user´s.

I still have few time to make a RosAsm to Masm converter/translator (as a standalone or inside the RosAsm project itself), so the easier and faster way was simply disassembling the code.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com