Fast Matrix Flip

guga · April 07, 2017, 03:56:33 AM

Hi guys

continuing the matrix operations, i builted one that can flip a matrix along the X (Width) axis using SSE (Thanks Jochen and Marinus)

The goal was to flip matrixes like this:

[Teste4x4: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22]
onto:

[Teste4x4Inverted: F$ 4, 3, 2, 1,
F$ 10, 9, 8, 7,
F$ 16, 15, 14, 13,
F$ 22, 21, 20, 19]

The problem, however, remain on non-quadratic matrixes. I´m still strugggling how to set the proper flags or the math envolved when we deal with non-quadratic matrixes (such as 3x2, 27x18, 13x9 etc)

The code i made for quadratic matrixes along the X- Axis is

Code Select




Proc SquaredMatrix_FlipHorizontal_SSE2new:
    Arguments @Input, @Output, @Width, @Height
    Local @MaxXPos, @CurYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    mov edx D@Height | mov D@CurYPos edx

    mov ebx D@Width
    mov eax ebx | shr eax 2 | mov D@MaxXPos eax ; MaxPos = Width/4
;    shl ebx 2 | mov D@NextScanLine ebx | mov eax ebx; | sub ebx 16 | mov eax ebx ; ebx = (Width*4)-16
    shl ebx 2 | mov eax ebx; | sub ebx 16 | mov eax ebx ; ebx = (Width*4)-16


L2:
    mov ecx D@MaxXPos
    mov edx esi
    Align 64 ; <---- Must be aligned to 64 to gain more speed and stability. (If align to 16 the result is a bit slow)

    L8:

        movdqu XMM0 X$edx+eax-16 ; edx+(Width*4)-16
        pshufd XMM0 XMM0 27 ; invert all 4 dwords from left to right
        sub edx (4*4)
        movups X$edi xmm0
        add edi (4*4)
        dec ecx | jg L8<

    add eax ebx; next scanline in ebx
    dec D@CurYPos | jnz L2<<

    mov eax D@Output

EndP

Timming for the whole function is only: 128.74 nanoseconds :)
Aligning with 16, decreases a bit the speed resulting in something around 136 nanosecs

But...how to make it for non-quadratic and keep it also fast ? What is the math operation envolving non-quadratic matrixes ?

I don´t rememebr exatly the masm syntax, but the code above, it probably is something like this:

Code Select



SquaredMatrix_FlipHorizontal_SSE2new proc uses esi edi ebx ecx edx
	Input: dword
	Output:dword
	Width: dword
	Height: dword

LOCAL  CurYPos: dword
LOCAL MaxXPos: dword

	mov esi, Input
	mov	edi, Output
	mov	edx, Height
	mov	CurYPos, edx
	mov	ebx, Width
	mov	eax, ebx
	shr	eax, 2
	mov	MaxXPos, eax
	shl	ebx, 2
	mov	eax, ebx

Loop2:
		mov	ecx, MaxXPos
		mov	edx, esi
		jmp	Loop1
; ---------------------------------------------------------------------------
		align64; is a masm directive ?. If it is, then you can use align64 :). 
; ---------------------------------------------------------------------------

Loop1:

		movdqu	xmm0, xmmword ptr [eax+edx-10h]
		pshufd	xmm0, xmm0, 1Bh
		sub	edx, 10h
		movups	xmmword	ptr [edi], xmm0
		add	edi, 10h
		dec	ecx
		jg	Loop1
		add	eax, ebx
		dec	CurYPos
		jnz	Loop2
		mov	eax, Output

SquaredMatrix_FlipHorizontal_SSE2new endp

Siekmanski · April 07, 2017, 09:38:23 AM

Fast transposing even or uneven can be done with a 4 * 4 Matrix.
Reserve enough memory to read from and write to ( 4 * 16 bytes alignment )
Read 4 * 4 pixels at once with the correct memory steps for the rows and columns.
Write 4 * 4 pixels at once also with the correct memory 4 * 4 block steps.

It seems illogical to use a 4 * 4 matrix for uneven images sizes (the unused pixels...), but that is only for the last 1,2 or 3 pixels of the right border and the bottom border.
But if you process larger images sizes than in the example below it is really mega fast. ( 12 cycles per 16 pixels on my PC )

Code Select


;The 4 * 4 transpose algorithm:
;     In:         Out:
; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

    mov         eax,offset MatrixIn

    movaps      xmm0,[eax+0]    ; [0 1 2 3]
    movaps      xmm1,[eax+16]   ; [4 5 6 7]
    movaps      xmm2,[eax+32]   ; [8 9 A B]
    movaps      xmm3,[eax+48]   ; [C D E F]

    mov         eax,offset MatrixOut

    movaps      xmm4,xmm0       ; [0 1 2 3]
    movaps      xmm5,xmm2       ; [8 9 A B]
    unpcklps    xmm4,xmm1       ; [0 4 1 5]
    unpcklps    xmm5,xmm3       ; [8 C 9 D]
    unpckhps    xmm0,xmm1       ; [2 6 3 7]
    unpckhps    xmm2,xmm3       ; [A E B F]
    movaps      xmm1,xmm4       ; [0 4 1 5]
    movaps      xmm6,xmm0       ; [2 6 3 7]
    movlhps     xmm4,xmm5       ; [0 4 8 C]
    movlhps     xmm6,xmm2       ; [2 6 A E]
    movhlps     xmm5,xmm1       ; [1 5 9 D]
    movaps      xmm7,xmm2       ; [A E B F]
    movhlps     xmm7,xmm0       ; [3 7 B F]

    movaps      [eax+0],xmm4    ; [0 4 8 C]
    movaps      [eax+16],xmm5   ; [1 5 9 D]
    movaps      [eax+32],xmm6   ; [2 6 A E]
    movaps      [eax+48],xmm7   ; [3 7 B F]

I don't have time to write the source code for it, but I'll explain the principle below.
To keep it easier to understand I used a small size image example.

Code Select

; each pixel number is 4 byte
pixelbufferIn   dd 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,-,-,-,-,-,-,-,-,-,- ( -, = alignment to 4 * 16 bytes )
pixelbufferOut dd 24 dup (0) 

Example of an uneven image width = 5, height = 3

 0, 1, 2, 3, 4 
 5, 6, 7, 8, 9
10,11,12,13,14

           ; mov    esi, offset pixelbufferIn
           ; mov    edi, offset pixelbufferOut

Gather the pixels from the pixelbufferIn with a step of 5 pixels ( image width )
Move the transposed pixels to the pixelbufferOut with a step of 3 ( image height)

 [ 0  1  2  3] ; movups xmm0,[esi+0]
 [ 5  6  7  8] ; movups xmm1,[esi+20]
 [10 11 12 13] ; movups xmm2,[esi+40]
 [-- -- -- --] ; movups xmm3,[esi+60] ; these are the values 15, 16, 17, 18, we don't need them but they are needed for the algorithm...

 transpose the block of 16 pixels at once:

    movaps      xmm4,xmm0
    movaps      xmm5,xmm2
    unpcklps    xmm4,xmm1
    unpcklps    xmm5,xmm3
    unpckhps    xmm0,xmm1
    unpckhps    xmm2,xmm3
    movaps      xmm1,xmm4
    movaps      xmm6,xmm0
    movlhps     xmm4,xmm5
    movlhps     xmm6,xmm2
    movhlps     xmm5,xmm1
    movaps      xmm7,xmm2
    movhlps     xmm7,xmm0

 [ 0  5 10 --] ; movups [edi+0],xmm4
 [ 1  6 11 --] ; movups [edi+12]xmm5
 [ 2  7 12 --] ; movups [edi+24],xmm6
 [ 3  8 13 --] ; movups [edi+36],xmm7

this gives a result of:

 0,5,10,1,6,11,2,7,12,3,8,13

Now get the uneven pixels of column 5

 [ 4 -- -- --] ; movups xmm0,[esi+16]
 [ 9 -- -- --] ; movups xmm1,[esi+36]
 [14 -- -- --] ; movups xmm0,[esi+56]
 [-- -- -- --] ; movups xmm0,[esi+76]

 transpose the block of 16 pixels at once:

    movaps      xmm4,xmm0
    movaps      xmm5,xmm2
    unpcklps    xmm4,xmm1
    unpcklps    xmm5,xmm3
    unpckhps    xmm0,xmm1
    unpckhps    xmm2,xmm3
    movaps      xmm1,xmm4
    movaps      xmm6,xmm0
    movlhps     xmm4,xmm5
    movlhps     xmm6,xmm2
    movhlps     xmm5,xmm1
    movaps      xmm7,xmm2
    movhlps     xmm7,xmm0

 [ 4  9 14 --] ; movups [edi+48],xmm4
 [-- -- -- --] ; movups [edi+60]xmm5
 [-- -- -- --] ; movups [edi+72],xmm6
 [-- -- -- --] ; movups [edi+84],xmm7

this gives a result of:

 0,5,10,1,6,11,2,7,12,3,8,13,4,9,14

here is your transposed image:

 0, 5, 10
 1, 6, 11
 2, 7, 12
 3, 8, 13
 4, 9, 14

Siekmanski · April 07, 2017, 10:39:06 AM

Improved and faster 4*4 Transpose Matrix

Code Select

; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

    mov         esi,offset MatrixIn
    mov         edi,offset MatrixOut

    movaps      xmm0,[esi+0]    ; [0 1 2 3]
    movaps      xmm1,[esi+16]   ; [4 5 6 7]
    movaps      xmm2,[esi+32]   ; [8 9 A B]
    movaps      xmm3,[esi+48]   ; [C D E F]

    movaps      xmm4,xmm0       ; [0 1 2 3]
    movaps      xmm5,xmm2       ; [8 9 A B]
    unpcklps    xmm4,xmm1       ; [0 4 1 5]
    unpcklps    xmm5,xmm3       ; [8 C 9 D]
    unpckhps    xmm0,xmm1       ; [2 6 3 7]
    unpckhps    xmm2,xmm3       ; [A E B F]
    movaps      xmm1,xmm4       ; [0 4 1 5]
    movaps      xmm6,xmm0       ; [2 6 3 7]
    movlhps     xmm4,xmm5       ; [0 4 8 C]
    movlhps     xmm6,xmm2       ; [2 6 A E]
    movhlps     xmm5,xmm1       ; [1 5 9 D]
    movhlps     xmm2,xmm0       ; [3 7 B F]

    movaps      [edi+0],xmm4    ; [0 4 8 C]
    movaps      [edi+16],xmm5   ; [1 5 9 D]
    movaps      [edi+32],xmm6   ; [2 6 A E]
    movaps      [edi+48],xmm2   ; [3 7 B F]

Siekmanski · April 07, 2017, 07:35:50 PM

Hi guga,

Just read your post again. Thought it was about transposing images with sizes that are not multiples of 4.
Now I see it's about flipping the X-axis, sorry my mistake.

aw27 · April 08, 2017, 01:43:18 AM

Quote from: guga on April 07, 2017, 03:56:33 AM
But...how to make it for non-quadratic and keep it also fast ? What is the math operation envolving non-quadratic matrixes ?

Hi guga,

I don't know if this is what you want, it is my solution to flip a 2D matrix along the X-axis. It shall work for any number of columns and rows.

Code Select


option casemap:none
option frame:auto
OPTION STACKBASE:RBP

.code

flipMatrix proc public outMat : ptr, inMat : ptr, rows : qword, cols : qword
	LOCAL xmmMovesRequired : qword
	LOCAL remainder : qword
	
	mov outMat, rcx
	mov inMat, rdx
	mov rows, r8
	mov cols, r9
	
	; How many xmm moves per row are required;
	mov rax, cols
	mov r10, 4
	xor rdx, rdx
	div r10
	mov xmmMovesRequired, rax
	; Remainder
	mov remainder, rdx
	
	mov r11,0
	.while r11<rows
		mov r10,0
		; make destination point to the end of every row
		mov rax, r11
		inc rax
		mul cols
		shl rax, 2
		mov rcx, outMat
		add rcx, rax 
		.if xmmMovesRequired>0
			sub rcx, 16 ; subtract enough to fill one xmm register
		.else
			sub rcx, 4 ; case of less than 4 columns
		.endif
		; source points to the start of every row
		mov rax, r11
		mul cols
		shl rax, 2
		mov rdx, inMat		
		add rdx, rax
		.while r10<xmmMovesRequired
			movups xmm0, xmmword ptr [rdx]
			pshufd xmm0, xmm0, 00011011b
			movups xmmword ptr [rcx], xmm0
			sub rcx, 16
			add rdx, 16
			inc r10
		.endw
		.if xmmMovesRequired>0
			add rcx, 12
		.endif

		.if remainder >= 1
			mov r10d, dword ptr [rdx]
			mov dword ptr [rcx], r10d
			.if remainder >= 2
				mov r10d, dword ptr [rdx+4]
				mov dword ptr [rcx-4], r10d
				.if remainder == 3	
					mov r10d, dword ptr [rdx+8]
					mov dword ptr [rcx-8], r10d
				.endif	
			.endif
		.endif
		
		inc r11
	.endw

	ret
flipMatrix endp

end

It was tested by calling from C++ console application, like this:

Code Select


#include "stdafx.h"


/* tested
#define TOTALROWS 2
#define TOTALCOLS 2
int inmatrix[TOTALROWS][TOTALCOLS] = {
	{ 11, 12 },
	{ 21, 22}
};
*/
/* tested
#define TOTALROWS 1
#define TOTALCOLS 1
int inmatrix[TOTALROWS][TOTALCOLS] = {
	{ 11 }
};
*/
/* tested
#define TOTALROWS 4
#define TOTALCOLS 4
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11, 12, 13,14},
{ 21, 22, 23,24},
{ 31, 32, 33,34},
{ 41, 42, 43,44},
};
*/

#define TOTALROWS 5
#define TOTALCOLS 9
int inmatrix[TOTALROWS][TOTALCOLS] = {
	{ 11, 12, 13,14,15,16,17,18,19},
	{ 21, 22, 23,24,25,26,27,28,29},
	{ 31, 32, 33,34,35,36,37,38,39},
	{ 41, 42, 43,44,45,46,47,48,49},
	{ 51, 52, 53,54,55,56,57,58,59}
};

int outmatrix[TOTALROWS][TOTALCOLS];

extern "C"
{
	void flipMatrix(void* outMat, void* inMat, size_t rows, size_t cols);

}

int main()
{
	flipMatrix(outmatrix, inmatrix, TOTALROWS, TOTALCOLS);


	printf("Input matrix\n");
	for (int row = 0; row < TOTALROWS; row++)
	{
		for (int columns = 0; columns < TOTALCOLS; columns++)
			printf("%d     ", inmatrix[row][columns]);
		printf("\n");
	}


	printf("Output matrix\n");
	for (int row = 0; row < TOTALROWS; row++)
	{
		for (int columns = 0; columns < TOTALCOLS; columns++)
			printf("%d     ", outmatrix[row][columns]);
		printf("\n");
	}
	getchar();
	return 0;
}

guga · April 18, 2017, 03:44:08 PM

HI marinus

No problem :) I was giving a test on your transpose algo to, but, unfortunately i´m still not being able to make it work on non-quadratic matrices. The algo seems faster then JJ, but ´i´m not succeeding to make it work as expected (non-quadratic)

Many thanks, Aw. I´ll give a try on it :) :t

Siekmanski · April 18, 2017, 07:07:59 PM

Hi guga,

aw27 worked out the method i presented in Reply #1
http://masm32.com/board/index.php?topic=6140.msg65145#msg65145

guga · April 18, 2017, 10:20:08 PM

Thanks a lot, marinus and AW. I´ll give a try and post the results for speed testing :)

guga · April 20, 2017, 06:08:37 PM

Aw

Here is the port to RosAsm in x86. Many thanks. I´ll try to optimize it.

Code Select


Proc Matrix_FlipHorizontal:
    Arguments @Input, @Output, @Width, @Height ;@Width, @Height
    Local @xmmMovesRequired, @remainder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov ebx 4
    xor edx edx
    div ebx
    mov D@xmmMovesRequired eax
    
    ; Remainder
    mov D@Remainder edx

    mov ecx 0

    .While ecx < D@Height
        mov ebx 0
        ; make destination point to the end of every row
        mov eax ecx
        inc eax
        mul D@Width
        shl eax 2
        mov edi D@Output
        add edi eax
        If D@xmmMovesRequired > 0
            sub edi 16 ; subtract enough to fill one xmm register
        Else
            sub edi 4 ; case of less than 4 columns
        End_if

        ; source points to the start of every row
        mov eax ecx
        mul D@Width
        shl eax 2
        mov esi D@Input  
        add esi eax
        While ebx < D@xmmMovesRequired
            movups xmm0 X$esi
            pshufd xmm0 xmm0 27
            movups X$edi xmm0
            sub edi 16
            add esi 16
            inc ebx
        End_While

        If D@xmmMovesRequired > 0
            add edi 12
        End_if

        ..If D@remainder >= 1
            mov ebx D$esi
            mov D$edi ebx
            .If D@remainder >= 2
                mov ebx D$esi+4
                mov D$edi-4 ebx
                If D@remainder = 3 
                    mov ebx D$esi+8
                    mov D$edi-8 ebx
                End_if
            .End_if
        ..End_if
  
        inc ecx     
    .End_While


EndP

guga · April 20, 2017, 10:12:13 PM

Ok..here it is :)

Fully optimized version faster then the original :t :t :t

Note: Later i´ll relabel the variables to the proper names.

New version:

Code Select


Proc SquaredMatrix_FlipHorizontal_SSE3Guga:
    Arguments @Input, @Output, @Width, @Height
    Local @xmmMovesRequired, @remainder, @Var1, @Var2, @Var3, @MaxHeight, @CurMovesRequired
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov edx eax
    mov ebx eax
    and edx 3 | mov D@Remainder edx
    shr eax 2 | mov D@xmmMovesRequired eax

    mov eax D@Height | mov D@MaxHeight eax

    mov D@Var2 0
    mov D@Var1 4 ; case of less than 4 columns
    If D@xmmMovesRequired > 0
        mov D@Var1 16 ; subtract enough to fill one xmm register
        mov D@Var2 12
    End_if

    mov eax D@Width | shl eax 2 | mov D@Var3 eax |  sub eax D@Var1
    mov edi D@Output | add edi eax

L1:
        mov ebx D@xmmMovesRequired
        ; source points to the start of every row
        mov eax edi
        mov ecx esi
        test ebx ebx | Jz L4>

        L0:
            movups xmm0 X$ecx
            pshufd xmm0 xmm0 27
            movups X$eax xmm0
            sub eax 16
            add ecx 16
            dec ebx | jg L0<
L4:
        ..If D@remainder >= 1
            mov edx D@Var2
            mov ebx D$ecx | mov D$eax+edx ebx
            .If D@remainder >= 2
                mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
                If D@remainder = 3
                    mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
                End_if
            .End_if
        ..End_if

         add edi D@Var3
         add esi D@Var3

        dec D@MaxHeight | jg L1<


EndP

benchmark tests:
Original version: 249.31571145976065 ns (731 clock cycles)
New version: 163.09080899440107 ns (480 clock cycles)

Speed Improvement: Something around 35%

It maybe optimized more, i guess.

btw...changing to movdqu instead of movups may increase the speed a little bit

aw27 · April 20, 2017, 11:54:07 PM

Quote from: guga on April 20, 2017, 10:12:13 PM
Ok..here it is :)

Fully optimized version 4 times faster then the original :t :t :t

Note: Later i´ll relabel the variables to the proper names.

New version:

Code Select Expand
Proc SquaredMatrix_FlipHorizontal_SSE3Guga: Arguments @Input, @Output, @Width, @Height Local @xmmMovesRequired, @remainder, @Var1, @Var2, @Var3, @MaxHeight, @CurMovesRequired Uses esi, edi, ebx, ecx, edx mov esi D@Input mov edi D@Output ; How many xmm moves per row are required; mov eax D@Width mov edx eax mov ebx eax and edx 3 | mov D@Remainder edx shr eax 2 | mov D@xmmMovesRequired eax mov eax D@Height | mov D@MaxHeight eax mov D@Var2 0 mov D@Var1 4 ; case of less than 4 columns If D@xmmMovesRequired > 0 mov D@Var1 16 ; subtract enough to fill one xmm register mov D@Var2 12 End_if mov eax D@Width | shl eax 2 | mov D@Var3 eax | sub eax D@Var1 mov edi D@Output | add edi eax L1: mov ebx D@xmmMovesRequired ; source points to the start of every row mov eax edi mov ecx esi test ebx ebx | Jz L4> L0: movups xmm0 X$ecx pshufd xmm0 xmm0 27 movups X$eax xmm0 sub eax 16 add ecx 16 dec ebx | jg L0< L4: ..If D@remainder >= 1 mov edx D@Var2 mov ebx D$ecx | mov D$eax+edx ebx .If D@remainder >= 2 mov ebx D$ecx+4 | mov D$eax+edx-4 ebx If D@remainder = 3 mov ebx D$ecx+8 | mov D$eax+edx-8 ebx End_if .End_if ..End_if add edi D@Var3 add esi D@Var3 dec D@MaxHeight | jg L1< EndP

benchmark tests:
Original version: 249.31571145976065 ns (731 clock cycles)
New version: 163.09080899440107 ns (480 clock cycles)

Speed Improvement: Something around 35%

It maybe optimized more, i guess.

btw...changing to movdqu instead of movups may increase the speed a little bit

Congratulations, although I can not test it because I have not RosAsm. I know I can download it, but I am just lazy.

guga · April 21, 2017, 04:13:29 AM

I´ll try port it to Masm and post it here.

guga · April 21, 2017, 06:34:54 AM

Ok. Here is the masm version.

Sorry, i don´t know what are the macros for repeat (not the while, but just the dec + jcc chains), so i posted the full disassembled source and some macros i presume are correct, if i remember well the masm syntax.

Code Select


Matrix_FlipX	proc public USES esi edi ebx ecx edx Input : ptr, Output : ptr, Width : dword, Height : dword
	LOCAL MaxXPos: dword 
	LOCAL MaxYPos: dword
	LOCAL Remainder: dword
	LOCAL AdjustSmallSize: dword
	LOCAL NextScanLine: dword

		mov	esi, Input
		mov	edi, Output
		mov	eax, Width
		mov	edx, eax
		mov	ebx, eax
		and	edx, 3
		mov	Remainder, edx
		shr	eax, 2
		mov	MaxXPos, eax
		mov	eax, Height
		mov	MaxYPos, eax
		mov	AdjustSmallSize, 0
		mov	ebx, 4

		.If MaxXPos > 0
			mov	ebx, 16
			mov	AdjustSmallSize, 12
		.Endif
				
		mov	eax, Width
		shl	eax, 2
		mov	NextScanLine, eax
		sub	eax, ebx
		mov	edi, Output
		add	edi, eax

loc_40AF18:				
		mov	ebx, MaxXPos
		mov	eax, edi
		mov	ecx, esi
		test	ebx, ebx
		jz	short loc_40AF38

loc_40AF23:				
		movdqu	xmm0, xmmword ptr [ecx]
		pshufd	xmm0, xmm0, 27
		movups	xmmword	ptr [eax], xmm0
		sub	eax, 16
		add	ecx, 16
		dec	ebx
		jg	short loc_40AF23

loc_40AF38:
		.If Remainder >= 1			
			mov	edx, AdjustSmallSize
			mov	ebx, [ecx]
			mov	[edx+eax], ebx
			.If Remainder >= 2
				mov	ebx, [ecx+4]
				mov	[edx+eax-4], ebx
				.If Remainder == 3
					mov	ebx, [ecx+8]
					mov	[edx+eax-8], ebx
			.EndIf
		.EndIf			
					
		add	edi, NextScanLine
		add	esi, NextScanLine
		dec	MaxYPos
		jg	short loc_40AF18

Matrix_FlipX	endp

Note: I´m pretty sure it can be optimized more removing those chains of If Remainder (specially because we can have large matrixes with small Width. Such as 3x280 etc) but, i´m unable to produce a faster code right now. So, anyway, it is faster then the original version

newrobert · April 21, 2017, 11:52:30 AM

i want to know how you disassembled source?

guga · April 21, 2017, 01:54:36 PM

RosAsm or in this case Idapro (To make it Masm compatible), but, it was mainly to remember the masm syntax so i could be able to port. RosAsm is Nasm compatible, so in order to make easier to others read, i assembled it with RosAsm the new function, and disassembled it with Ida to make a compatible version for masm user´s.

I still have few time to make a RosAsm to Masm converter/translator (as a standalone or inside the RosAsm project itself), so the easier and faster way was simply disassembling the code.

The MASM Forum

News:

Fast Matrix Flip

guga

Siekmanski

Siekmanski

Siekmanski

aw27

guga

Siekmanski

guga

guga

guga

aw27

guga

guga

newrobert

guga