Print Page - Fast Matrix Flip

Title: Fast Matrix Flip
Post by: guga on April 07, 2017, 03:56:33 AM

Hi guys

continuing the matrix operations, i builted one that can flip a matrix along the X (Width) axis using SSE (Thanks Jochen and Marinus)

The goal was to flip matrixes like this:

[Teste4x4: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22]
onto:

[Teste4x4Inverted: F$ 4, 3, 2, 1,
F$ 10, 9, 8, 7,
F$ 16, 15, 14, 13,
F$ 22, 21, 20, 19]

The problem, however, remain on non-quadratic matrixes. I´m still strugggling how to set the proper flags or the math envolved when we deal with non-quadratic matrixes (such as 3x2, 27x18, 13x9 etc)

The code i made for quadratic matrixes along the X- Axis is

Code Select




Proc SquaredMatrix_FlipHorizontal_SSE2new:
    Arguments @Input, @Output, @Width, @Height
    Local @MaxXPos, @CurYPos
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    mov edx D@Height | mov D@CurYPos edx

    mov ebx D@Width
    mov eax ebx | shr eax 2 | mov D@MaxXPos eax ; MaxPos = Width/4
;    shl ebx 2 | mov D@NextScanLine ebx | mov eax ebx; | sub ebx 16 | mov eax ebx ; ebx = (Width*4)-16
    shl ebx 2 | mov eax ebx; | sub ebx 16 | mov eax ebx ; ebx = (Width*4)-16


L2:
    mov ecx D@MaxXPos
    mov edx esi
    Align 64 ; <---- Must be aligned to 64 to gain more speed and stability. (If align to 16 the result is a bit slow)

    L8:

        movdqu XMM0 X$edx+eax-16 ; edx+(Width*4)-16
        pshufd XMM0 XMM0 27 ; invert all 4 dwords from left to right
        sub edx (4*4)
        movups X$edi xmm0
        add edi (4*4)
        dec ecx | jg L8<

    add eax ebx; next scanline in ebx
    dec D@CurYPos | jnz L2<<

    mov eax D@Output

EndP

Timming for the whole function is only: 128.74 nanoseconds :)
Aligning with 16, decreases a bit the speed resulting in something around 136 nanosecs

But...how to make it for non-quadratic and keep it also fast ? What is the math operation envolving non-quadratic matrixes ?

I don´t rememebr exatly the masm syntax, but the code above, it probably is something like this:

Code Select



SquaredMatrix_FlipHorizontal_SSE2new proc uses esi edi ebx ecx edx
	Input: dword
	Output:dword
	Width: dword
	Height: dword

LOCAL  CurYPos: dword
LOCAL MaxXPos: dword

	mov esi, Input
	mov	edi, Output
	mov	edx, Height
	mov	CurYPos, edx
	mov	ebx, Width
	mov	eax, ebx
	shr	eax, 2
	mov	MaxXPos, eax
	shl	ebx, 2
	mov	eax, ebx

Loop2:
		mov	ecx, MaxXPos
		mov	edx, esi
		jmp	Loop1
; ---------------------------------------------------------------------------
		align64; is a masm directive ?. If it is, then you can use align64 :). 
; ---------------------------------------------------------------------------

Loop1:

		movdqu	xmm0, xmmword ptr [eax+edx-10h]
		pshufd	xmm0, xmm0, 1Bh
		sub	edx, 10h
		movups	xmmword	ptr [edi], xmm0
		add	edi, 10h
		dec	ecx
		jg	Loop1
		add	eax, ebx
		dec	CurYPos
		jnz	Loop2
		mov	eax, Output

SquaredMatrix_FlipHorizontal_SSE2new endp

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 07, 2017, 09:38:23 AM

Fast transposing even or uneven can be done with a 4 * 4 Matrix.
Reserve enough memory to read from and write to ( 4 * 16 bytes alignment )
Read 4 * 4 pixels at once with the correct memory steps for the rows and columns.
Write 4 * 4 pixels at once also with the correct memory 4 * 4 block steps.

It seems illogical to use a 4 * 4 matrix for uneven images sizes (the unused pixels...), but that is only for the last 1,2 or 3 pixels of the right border and the bottom border.
But if you process larger images sizes than in the example below it is really mega fast. ( 12 cycles per 16 pixels on my PC )

Code Select


;The 4 * 4 transpose algorithm:
;     In:         Out:
; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

    mov         eax,offset MatrixIn

    movaps      xmm0,[eax+0]    ; [0 1 2 3]
    movaps      xmm1,[eax+16]   ; [4 5 6 7]
    movaps      xmm2,[eax+32]   ; [8 9 A B]
    movaps      xmm3,[eax+48]   ; [C D E F]

    mov         eax,offset MatrixOut

    movaps      xmm4,xmm0       ; [0 1 2 3]
    movaps      xmm5,xmm2       ; [8 9 A B]
    unpcklps    xmm4,xmm1       ; [0 4 1 5]
    unpcklps    xmm5,xmm3       ; [8 C 9 D]
    unpckhps    xmm0,xmm1       ; [2 6 3 7]
    unpckhps    xmm2,xmm3       ; [A E B F]
    movaps      xmm1,xmm4       ; [0 4 1 5]
    movaps      xmm6,xmm0       ; [2 6 3 7]
    movlhps     xmm4,xmm5       ; [0 4 8 C]
    movlhps     xmm6,xmm2       ; [2 6 A E]
    movhlps     xmm5,xmm1       ; [1 5 9 D]
    movaps      xmm7,xmm2       ; [A E B F]
    movhlps     xmm7,xmm0       ; [3 7 B F]

    movaps      [eax+0],xmm4    ; [0 4 8 C]
    movaps      [eax+16],xmm5   ; [1 5 9 D]
    movaps      [eax+32],xmm6   ; [2 6 A E]
    movaps      [eax+48],xmm7   ; [3 7 B F]

I don't have time to write the source code for it, but I'll explain the principle below.
To keep it easier to understand I used a small size image example.

Code Select

; each pixel number is 4 byte
pixelbufferIn   dd 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,-,-,-,-,-,-,-,-,-,- ( -, = alignment to 4 * 16 bytes )
pixelbufferOut dd 24 dup (0) 

Example of an uneven image width = 5, height = 3

 0, 1, 2, 3, 4 
 5, 6, 7, 8, 9
10,11,12,13,14

           ; mov    esi, offset pixelbufferIn
           ; mov    edi, offset pixelbufferOut

Gather the pixels from the pixelbufferIn with a step of 5 pixels ( image width )
Move the transposed pixels to the pixelbufferOut with a step of 3 ( image height)

 [ 0  1  2  3] ; movups xmm0,[esi+0]
 [ 5  6  7  8] ; movups xmm1,[esi+20]
 [10 11 12 13] ; movups xmm2,[esi+40]
 [-- -- -- --] ; movups xmm3,[esi+60] ; these are the values 15, 16, 17, 18, we don't need them but they are needed for the algorithm...

 transpose the block of 16 pixels at once:

    movaps      xmm4,xmm0
    movaps      xmm5,xmm2
    unpcklps    xmm4,xmm1
    unpcklps    xmm5,xmm3
    unpckhps    xmm0,xmm1
    unpckhps    xmm2,xmm3
    movaps      xmm1,xmm4
    movaps      xmm6,xmm0
    movlhps     xmm4,xmm5
    movlhps     xmm6,xmm2
    movhlps     xmm5,xmm1
    movaps      xmm7,xmm2
    movhlps     xmm7,xmm0

 [ 0  5 10 --] ; movups [edi+0],xmm4
 [ 1  6 11 --] ; movups [edi+12]xmm5
 [ 2  7 12 --] ; movups [edi+24],xmm6
 [ 3  8 13 --] ; movups [edi+36],xmm7

this gives a result of:

 0,5,10,1,6,11,2,7,12,3,8,13

Now get the uneven pixels of column 5

 [ 4 -- -- --] ; movups xmm0,[esi+16]
 [ 9 -- -- --] ; movups xmm1,[esi+36]
 [14 -- -- --] ; movups xmm0,[esi+56]
 [-- -- -- --] ; movups xmm0,[esi+76]

 transpose the block of 16 pixels at once:

    movaps      xmm4,xmm0
    movaps      xmm5,xmm2
    unpcklps    xmm4,xmm1
    unpcklps    xmm5,xmm3
    unpckhps    xmm0,xmm1
    unpckhps    xmm2,xmm3
    movaps      xmm1,xmm4
    movaps      xmm6,xmm0
    movlhps     xmm4,xmm5
    movlhps     xmm6,xmm2
    movhlps     xmm5,xmm1
    movaps      xmm7,xmm2
    movhlps     xmm7,xmm0

 [ 4  9 14 --] ; movups [edi+48],xmm4
 [-- -- -- --] ; movups [edi+60]xmm5
 [-- -- -- --] ; movups [edi+72],xmm6
 [-- -- -- --] ; movups [edi+84],xmm7

this gives a result of:

 0,5,10,1,6,11,2,7,12,3,8,13,4,9,14

here is your transposed image:

 0, 5, 10
 1, 6, 11
 2, 7, 12
 3, 8, 13
 4, 9, 14

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 07, 2017, 10:39:06 AM

Improved and faster 4*4 Transpose Matrix :biggrin:

Code Select

; [0 1 2 3]    [0 4 8 C]
; [4 5 6 7]    [1 5 9 D]
; [8 9 A B]    [2 6 A E]
; [C D E F]    [3 7 B F]

    mov         esi,offset MatrixIn
    mov         edi,offset MatrixOut

    movaps      xmm0,[esi+0]    ; [0 1 2 3]
    movaps      xmm1,[esi+16]   ; [4 5 6 7]
    movaps      xmm2,[esi+32]   ; [8 9 A B]
    movaps      xmm3,[esi+48]   ; [C D E F]

    movaps      xmm4,xmm0       ; [0 1 2 3]
    movaps      xmm5,xmm2       ; [8 9 A B]
    unpcklps    xmm4,xmm1       ; [0 4 1 5]
    unpcklps    xmm5,xmm3       ; [8 C 9 D]
    unpckhps    xmm0,xmm1       ; [2 6 3 7]
    unpckhps    xmm2,xmm3       ; [A E B F]
    movaps      xmm1,xmm4       ; [0 4 1 5]
    movaps      xmm6,xmm0       ; [2 6 3 7]
    movlhps     xmm4,xmm5       ; [0 4 8 C]
    movlhps     xmm6,xmm2       ; [2 6 A E]
    movhlps     xmm5,xmm1       ; [1 5 9 D]
    movhlps     xmm2,xmm0       ; [3 7 B F]

    movaps      [edi+0],xmm4    ; [0 4 8 C]
    movaps      [edi+16],xmm5   ; [1 5 9 D]
    movaps      [edi+32],xmm6   ; [2 6 A E]
    movaps      [edi+48],xmm2   ; [3 7 B F]

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 07, 2017, 07:35:50 PM

Hi guga,

Just read your post again. Thought it was about transposing images with sizes that are not multiples of 4.
Now I see it's about flipping the X-axis, sorry my mistake.

Title: Re: Fast Matrix Flip
Post by: aw27 on April 08, 2017, 01:43:18 AM

Quote from: guga on April 07, 2017, 03:56:33 AM
But...how to make it for non-quadratic and keep it also fast ? What is the math operation envolving non-quadratic matrixes ?

Hi guga,

I don't know if this is what you want, it is my solution to flip a 2D matrix along the X-axis. It shall work for any number of columns and rows.

Code Select


option casemap:none
option frame:auto
OPTION STACKBASE:RBP

.code

flipMatrix proc public outMat : ptr, inMat : ptr, rows : qword, cols : qword
	LOCAL xmmMovesRequired : qword
	LOCAL remainder : qword
	
	mov outMat, rcx
	mov inMat, rdx
	mov rows, r8
	mov cols, r9
	
	; How many xmm moves per row are required;
	mov rax, cols
	mov r10, 4
	xor rdx, rdx
	div r10
	mov xmmMovesRequired, rax
	; Remainder
	mov remainder, rdx
	
	mov r11,0
	.while r11<rows
		mov r10,0
		; make destination point to the end of every row
		mov rax, r11
		inc rax
		mul cols
		shl rax, 2
		mov rcx, outMat
		add rcx, rax 
		.if xmmMovesRequired>0
			sub rcx, 16 ; subtract enough to fill one xmm register
		.else
			sub rcx, 4 ; case of less than 4 columns
		.endif
		; source points to the start of every row
		mov rax, r11
		mul cols
		shl rax, 2
		mov rdx, inMat		
		add rdx, rax
		.while r10<xmmMovesRequired
			movups xmm0, xmmword ptr [rdx]
			pshufd xmm0, xmm0, 00011011b
			movups xmmword ptr [rcx], xmm0
			sub rcx, 16
			add rdx, 16
			inc r10
		.endw
		.if xmmMovesRequired>0
			add rcx, 12
		.endif

		.if remainder >= 1
			mov r10d, dword ptr [rdx]
			mov dword ptr [rcx], r10d
			.if remainder >= 2
				mov r10d, dword ptr [rdx+4]
				mov dword ptr [rcx-4], r10d
				.if remainder == 3	
					mov r10d, dword ptr [rdx+8]
					mov dword ptr [rcx-8], r10d
				.endif	
			.endif
		.endif
		
		inc r11
	.endw

	ret
flipMatrix endp

end

It was tested by calling from C++ console application, like this:

Code Select


#include "stdafx.h"


/* tested
#define TOTALROWS 2
#define TOTALCOLS 2
int inmatrix[TOTALROWS][TOTALCOLS] = {
	{ 11, 12 },
	{ 21, 22}
};
*/
/* tested
#define TOTALROWS 1
#define TOTALCOLS 1
int inmatrix[TOTALROWS][TOTALCOLS] = {
	{ 11 }
};
*/
/* tested
#define TOTALROWS 4
#define TOTALCOLS 4
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11, 12, 13,14},
{ 21, 22, 23,24},
{ 31, 32, 33,34},
{ 41, 42, 43,44},
};
*/

#define TOTALROWS 5
#define TOTALCOLS 9
int inmatrix[TOTALROWS][TOTALCOLS] = {
	{ 11, 12, 13,14,15,16,17,18,19},
	{ 21, 22, 23,24,25,26,27,28,29},
	{ 31, 32, 33,34,35,36,37,38,39},
	{ 41, 42, 43,44,45,46,47,48,49},
	{ 51, 52, 53,54,55,56,57,58,59}
};

int outmatrix[TOTALROWS][TOTALCOLS];

extern "C"
{
	void flipMatrix(void* outMat, void* inMat, size_t rows, size_t cols);

}

int main()
{
	flipMatrix(outmatrix, inmatrix, TOTALROWS, TOTALCOLS);


	printf("Input matrix\n");
	for (int row = 0; row < TOTALROWS; row++)
	{
		for (int columns = 0; columns < TOTALCOLS; columns++)
			printf("%d     ", inmatrix[row][columns]);
		printf("\n");
	}


	printf("Output matrix\n");
	for (int row = 0; row < TOTALROWS; row++)
	{
		for (int columns = 0; columns < TOTALCOLS; columns++)
			printf("%d     ", outmatrix[row][columns]);
		printf("\n");
	}
	getchar();
	return 0;
}

Title: Re: Fast Matrix Flip
Post by: guga on April 18, 2017, 03:44:08 PM

HI marinus

No problem :) I was giving a test on your transpose algo to, but, unfortunately i´m still not being able to make it work on non-quadratic matrices. The algo seems faster then JJ, but ´i´m not succeeding to make it work as expected (non-quadratic)

Many thanks, Aw. I´ll give a try on it :) :t

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 18, 2017, 07:07:59 PM

Hi guga,

aw27 worked out the method i presented in Reply #1
http://masm32.com/board/index.php?topic=6140.msg65145#msg65145

Title: Re: Fast Matrix Flip
Post by: guga on April 18, 2017, 10:20:08 PM

Thanks a lot, marinus and AW. I´ll give a try and post the results for speed testing :)

Title: Re: Fast Matrix Flip
Post by: guga on April 20, 2017, 06:08:37 PM

Aw

Here is the port to RosAsm in x86. Many thanks. I´ll try to optimize it.

Code Select


Proc Matrix_FlipHorizontal:
    Arguments @Input, @Output, @Width, @Height ;@Width, @Height
    Local @xmmMovesRequired, @remainder
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov ebx 4
    xor edx edx
    div ebx
    mov D@xmmMovesRequired eax
    
    ; Remainder
    mov D@Remainder edx

    mov ecx 0

    .While ecx < D@Height
        mov ebx 0
        ; make destination point to the end of every row
        mov eax ecx
        inc eax
        mul D@Width
        shl eax 2
        mov edi D@Output
        add edi eax
        If D@xmmMovesRequired > 0
            sub edi 16 ; subtract enough to fill one xmm register
        Else
            sub edi 4 ; case of less than 4 columns
        End_if

        ; source points to the start of every row
        mov eax ecx
        mul D@Width
        shl eax 2
        mov esi D@Input  
        add esi eax
        While ebx < D@xmmMovesRequired
            movups xmm0 X$esi
            pshufd xmm0 xmm0 27
            movups X$edi xmm0
            sub edi 16
            add esi 16
            inc ebx
        End_While

        If D@xmmMovesRequired > 0
            add edi 12
        End_if

        ..If D@remainder >= 1
            mov ebx D$esi
            mov D$edi ebx
            .If D@remainder >= 2
                mov ebx D$esi+4
                mov D$edi-4 ebx
                If D@remainder = 3 
                    mov ebx D$esi+8
                    mov D$edi-8 ebx
                End_if
            .End_if
        ..End_if
  
        inc ecx     
    .End_While


EndP

Title: Re: Fast Matrix Flip
Post by: guga on April 20, 2017, 10:12:13 PM

Ok..here it is :)

Fully optimized version faster then the original :t :t :t

Note: Later i´ll relabel the variables to the proper names.

New version:

Code Select


Proc SquaredMatrix_FlipHorizontal_SSE3Guga:
    Arguments @Input, @Output, @Width, @Height
    Local @xmmMovesRequired, @remainder, @Var1, @Var2, @Var3, @MaxHeight, @CurMovesRequired
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov edx eax
    mov ebx eax
    and edx 3 | mov D@Remainder edx
    shr eax 2 | mov D@xmmMovesRequired eax

    mov eax D@Height | mov D@MaxHeight eax

    mov D@Var2 0
    mov D@Var1 4 ; case of less than 4 columns
    If D@xmmMovesRequired > 0
        mov D@Var1 16 ; subtract enough to fill one xmm register
        mov D@Var2 12
    End_if

    mov eax D@Width | shl eax 2 | mov D@Var3 eax |  sub eax D@Var1
    mov edi D@Output | add edi eax

L1:
        mov ebx D@xmmMovesRequired
        ; source points to the start of every row
        mov eax edi
        mov ecx esi
        test ebx ebx | Jz L4>

        L0:
            movups xmm0 X$ecx
            pshufd xmm0 xmm0 27
            movups X$eax xmm0
            sub eax 16
            add ecx 16
            dec ebx | jg L0<
L4:
        ..If D@remainder >= 1
            mov edx D@Var2
            mov ebx D$ecx | mov D$eax+edx ebx
            .If D@remainder >= 2
                mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
                If D@remainder = 3
                    mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
                End_if
            .End_if
        ..End_if

         add edi D@Var3
         add esi D@Var3

        dec D@MaxHeight | jg L1<


EndP

benchmark tests:
Original version: 249.31571145976065 ns (731 clock cycles)
New version: 163.09080899440107 ns (480 clock cycles)

Speed Improvement: Something around 35%

It maybe optimized more, i guess.

btw...changing to movdqu instead of movups may increase the speed a little bit

Title: Re: Fast Matrix Flip
Post by: aw27 on April 20, 2017, 11:54:07 PM

Quote from: guga on April 20, 2017, 10:12:13 PM
Ok..here it is :)

Fully optimized version 4 times faster then the original :t :t :t

Note: Later i´ll relabel the variables to the proper names.

New version:

Code Select Expand
Proc SquaredMatrix_FlipHorizontal_SSE3Guga: Arguments @Input, @Output, @Width, @Height Local @xmmMovesRequired, @remainder, @Var1, @Var2, @Var3, @MaxHeight, @CurMovesRequired Uses esi, edi, ebx, ecx, edx mov esi D@Input mov edi D@Output ; How many xmm moves per row are required; mov eax D@Width mov edx eax mov ebx eax and edx 3 | mov D@Remainder edx shr eax 2 | mov D@xmmMovesRequired eax mov eax D@Height | mov D@MaxHeight eax mov D@Var2 0 mov D@Var1 4 ; case of less than 4 columns If D@xmmMovesRequired > 0 mov D@Var1 16 ; subtract enough to fill one xmm register mov D@Var2 12 End_if mov eax D@Width | shl eax 2 | mov D@Var3 eax | sub eax D@Var1 mov edi D@Output | add edi eax L1: mov ebx D@xmmMovesRequired ; source points to the start of every row mov eax edi mov ecx esi test ebx ebx | Jz L4> L0: movups xmm0 X$ecx pshufd xmm0 xmm0 27 movups X$eax xmm0 sub eax 16 add ecx 16 dec ebx | jg L0< L4: ..If D@remainder >= 1 mov edx D@Var2 mov ebx D$ecx | mov D$eax+edx ebx .If D@remainder >= 2 mov ebx D$ecx+4 | mov D$eax+edx-4 ebx If D@remainder = 3 mov ebx D$ecx+8 | mov D$eax+edx-8 ebx End_if .End_if ..End_if add edi D@Var3 add esi D@Var3 dec D@MaxHeight | jg L1< EndP

benchmark tests:
Original version: 249.31571145976065 ns (731 clock cycles)
New version: 163.09080899440107 ns (480 clock cycles)

Speed Improvement: Something around 35%

It maybe optimized more, i guess.

btw...changing to movdqu instead of movups may increase the speed a little bit

Congratulations, although I can not test it because I have not RosAsm. I know I can download it, but I am just lazy.

Title: Re: Fast Matrix Flip
Post by: guga on April 21, 2017, 04:13:29 AM

I´ll try port it to Masm and post it here.

Title: Re: Fast Matrix Flip
Post by: guga on April 21, 2017, 06:34:54 AM

Ok. Here is the masm version.

Sorry, i don´t know what are the macros for repeat (not the while, but just the dec + jcc chains), so i posted the full disassembled source and some macros i presume are correct, if i remember well the masm syntax.

Code Select


Matrix_FlipX	proc public USES esi edi ebx ecx edx Input : ptr, Output : ptr, Width : dword, Height : dword
	LOCAL MaxXPos: dword 
	LOCAL MaxYPos: dword
	LOCAL Remainder: dword
	LOCAL AdjustSmallSize: dword
	LOCAL NextScanLine: dword

		mov	esi, Input
		mov	edi, Output
		mov	eax, Width
		mov	edx, eax
		mov	ebx, eax
		and	edx, 3
		mov	Remainder, edx
		shr	eax, 2
		mov	MaxXPos, eax
		mov	eax, Height
		mov	MaxYPos, eax
		mov	AdjustSmallSize, 0
		mov	ebx, 4

		.If MaxXPos > 0
			mov	ebx, 16
			mov	AdjustSmallSize, 12
		.Endif
				
		mov	eax, Width
		shl	eax, 2
		mov	NextScanLine, eax
		sub	eax, ebx
		mov	edi, Output
		add	edi, eax

loc_40AF18:				
		mov	ebx, MaxXPos
		mov	eax, edi
		mov	ecx, esi
		test	ebx, ebx
		jz	short loc_40AF38

loc_40AF23:				
		movdqu	xmm0, xmmword ptr [ecx]
		pshufd	xmm0, xmm0, 27
		movups	xmmword	ptr [eax], xmm0
		sub	eax, 16
		add	ecx, 16
		dec	ebx
		jg	short loc_40AF23

loc_40AF38:
		.If Remainder >= 1			
			mov	edx, AdjustSmallSize
			mov	ebx, [ecx]
			mov	[edx+eax], ebx
			.If Remainder >= 2
				mov	ebx, [ecx+4]
				mov	[edx+eax-4], ebx
				.If Remainder == 3
					mov	ebx, [ecx+8]
					mov	[edx+eax-8], ebx
			.EndIf
		.EndIf			
					
		add	edi, NextScanLine
		add	esi, NextScanLine
		dec	MaxYPos
		jg	short loc_40AF18

Matrix_FlipX	endp

Note: I´m pretty sure it can be optimized more removing those chains of If Remainder (specially because we can have large matrixes with small Width. Such as 3x280 etc) but, i´m unable to produce a faster code right now. So, anyway, it is faster then the original version

Title: Re: Fast Matrix Flip
Post by: newrobert on April 21, 2017, 11:52:30 AM

i want to know how you disassembled source?

Title: Re: Fast Matrix Flip
Post by: guga on April 21, 2017, 01:54:36 PM

RosAsm or in this case Idapro (To make it Masm compatible), but, it was mainly to remember the masm syntax so i could be able to port. RosAsm is Nasm compatible, so in order to make easier to others read, i assembled it with RosAsm the new function, and disassembled it with Ida to make a compatible version for masm user´s.

I still have few time to make a RosAsm to Masm converter/translator (as a standalone or inside the RosAsm project itself), so the easier and faster way was simply disassembling the code.

Title: Re: Fast Matrix Flip
Post by: aw27 on April 21, 2017, 05:57:21 PM

Quote from: guga on April 21, 2017, 06:34:54 AM
Ok. Here is the masm version.

Well done, the gain was mostly done to elimination of mul instructions which have a great impact particularly inside a loop.

Title: Re: Fast Matrix Flip
Post by: guga on April 21, 2017, 07:10:37 PM

Thanks.

And also the div that was replaced by "and". Div instruction is always slow. Also on the updated versions i´m working, i replaced all movups changing them to movdqu instead, and got a gain of speed of around 12% (Measured on my I7) . And, if you remove all the push/pop operation at the beginning, the code will be way more faster (with the USES directive in Masm. In RoAsm we have a similar thing, but it is not a directive it is a user made macro also called "uses") . I just maintained it because i don´t want the function altering the registers after being used.

Don´t forget that, using a naked dec/jcc instruction is generally a bit faster then cmp xxx / jcc. Also, test is a bit faster then cmp.

Btw...I also just made a version that rotates a Matrix on 180º directly. (Later i´ll check the speed and see if it also have some advantage replacing the movups on it for movdqu. I´ll convert it to Masm, once i finish the tests to see if it is working as expected.

Code Select


; Rotate a Matrix at 180º

Proc Matrix_FlipXY:
    Arguments @Input, @Output, @Width, @Height
    Local @MaxXPos, @MaxYPos, @remainder, @NextScanLine, @AdjustSmallSize
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov edx eax
    mov ebx eax
    and edx 3 | mov D@Remainder edx
    shr eax 2 | mov D@MaxXPos eax

    ; make destination point to the end of every row
    mov eax D@Width | shl eax 2 | mov D@NextScanLine eax
    mov eax D@Height | mov D@MaxYPos eax | mul D@NextScanLine; | sub eax 4
    ;mov eax D@Height | mov D@MaxYPos eax | dec eax | mul D@NextScanLine
    mov edi D@Output | add edi eax

    mov D@AdjustSmallSize 0
    mov ebx 4 ; case of less than 4 columns
    If D@MaxXPos > 0
        mov ebx 16 ; subtract enough to fill one xmm register
        mov D@AdjustSmallSize 12
    End_if
    sub edi ebx

    ;mov eax D@Width | shl eax 2 | mov D@NextScanLine eax |  sub eax ebx
    ;mov edi D@Output | add edi eax

    L1:
        mov ebx D@MaxXPos
        mov eax edi
        mov ecx esi
        test ebx ebx | jz L4>

        L0:
            movups xmm0 X$ecx
            pshufd xmm0 xmm0 27
            movups X$eax xmm0
            sub eax 16
            add ecx 16
            dec ebx | jg L0<
L4:
        ..If D@remainder >= 1
            mov edx D@AdjustSmallSize
            mov ebx D$ecx | mov D$eax+edx ebx
            .If D@remainder >= 2
                mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
                If D@remainder = 3
                    mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
                End_if
            .End_if
        ..End_if

        sub edi D@NextScanLine
        add esi D@NextScanLine

        dec D@MaxYPos | jg L1<

EndP

Note: Do you have n idea how to make a matrix convolution ? For images, like this one: https://www.tutorialspoint.com/dip/concept_of_convolution.htm

I´m optimize all the matrix codes and later adapt it to image processing (all it seems to be needed is including Pitch at the end of scanline. It shouldn´t affect the speed that much, i hope)

Note2: I´ll later give a try replacing all the "If remainder" with a set of dec and using "movdqu XMM0 X$ecx | movd D$eax+edx XMM0" to set the remainder bytes. It should be faster even on small matrices, but, i didn´t tested yet.

Title: Re: Fast Matrix Flip
Post by: guga on April 21, 2017, 09:06:09 PM

Ok, the reminder sequence of If´s can be replaced with code chain like this (used on the example above: Matrix_FlipXY):

Code Select


        mov ebx D@remainder
        test ebx ebx | jz L3>
            mov edx D@AdjustSmallSize
            movdqu XMM0 X$ecx | movd D$eax+edx XMM0
            dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-4 XMM0
            dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-8 XMM0
        L3:

Title: Re: Fast Matrix Flip
Post by: aw27 on April 21, 2017, 09:20:07 PM

Quote from: guga on April 21, 2017, 07:10:37 PM
see if it also have some advantage replacing the movups on it for movdqu.

For very large matrixes you may have an advantage making use of various xmm registers, prefetching and using movntdq . Something like this (pasting some code I had here for a different purpose):
            .repeat
               prefetchnta BYTE PTR [esi+ecx+4096]
               movdqu xmm0, XMMWORD ptr [esi+ecx]
               movdqu xmm1, XMMWORD ptr [esi+ecx+16]
               movdqu xmm2, XMMWORD ptr [esi+ecx+32]
               movdqu xmm3, XMMWORD ptr [esi+ecx+48]
               movntdq XMMWORD ptr [edi+ecx], xmm0
               movntdq XMMWORD ptr [edi+ecx+16], xmm1
               movntdq XMMWORD ptr [edi+ecx+32], xmm2
               movntdq XMMWORD ptr [edi+ecx+48], xmm3
               sub ecx, 64
            .until ecx==0

Title: Re: Fast Matrix Flip
Post by: guga on April 22, 2017, 06:39:10 AM

Yes, this will be a similar solution to what i´m doing with Marinus´s transpose matrix, but the implementation of that is a bit hard. Later i´ll try to make the tests on it. The problem is with the leftovers (remainders) of the function that will need to be enter on other loops. It can be overcome, although the resultant code is a bit messy. On this particular function, Jochen code is cleaner and the small difference of speed between JJ´s and marinus code (marinus is about 35% faster) may worth use JJ´s code for readability reasons. (I´ll later post the fixes and implementation i did on marinus to you better understand what i mean)

Note: There is no need to use prefetchnta. It seems to change performance according to the processor. If the code is doing a large number of loads on the same cache line and stores these could still negatively effect performance.
According to Agner´s Frog, Prefetch throughput on IvyBridge is only one per 43 cycles, so we need to be careful not to prefetch too much if we don't want prefetches to slow down the code on IvB. This is a performance bug specific to IvB. On other designs, too much prefetch will just take up uop throughput that could have been useful instructions (other than harm from prefetching useless addresses).
http://agner.org/optimize/

Even on older processors, prefetchXXX instructions have no guarantee that the data will be in the cache when it is needed. (Intel IA32 Software Developer Manual, Volume 2)

In some cases, the performance can be better results if you simply use a Align16 directive before the SSE2 main instruction (As i did on Jochen´s transpose matrix algo.)

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 22, 2017, 07:22:02 AM

Guga, you're right. On my Ivybridge prefetching uses 43 cycles. Better to let the CPU handle this.
But there are still many PC's around that could benefit from sotware prefetching.

It must be the month of Matrix calculations. :biggrin:

https://www.codeproject.com/Articles/1182724/Blowing-the-Doors-Off-D-Math-Part-I-Matrix-Multipl

Title: Re: Fast Matrix Flip
Post by: guga on April 22, 2017, 09:04:40 AM

QuoteIt must be the month of Matrix calculations.

:icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

Actually i´m trying to understand the matrix functions because i´m porting PHash algo to Assembly in order to see if i can use it on my scene detector plugin for VirtualDub.

PHash compares the similarity between 2 images. It can be used to a wide range of applications, such as image searching, object recognition (including face recognition) and most probably: scene detection, movement detection. Not to mention, it can also be used for audio :) (You may like this :) :greensml:

The main problem on PHash is that it uses a crappy CImg library that is insanely slow. CImg is a pile of crap C++ Classes that results on a bloated performance of the PHash algo.

Since Phash uses matrix manipulaton and Convolution of CImg, i´m trying to recreate not only the matrixes functions involved, but also the convolution function to be used later on the scene detection plugin.

The current version of my plugin can detect (Accurately) hard cut scenes on a rate near to 100% simply computing and comparing the Minimum standard deviation values of the difference of the images (Difference achieved from xor operation and not a simple sub). Processing a entire video of 1:30 hour (something around 190.000 frames of 720x480) can be done in something around 20 minutes on my I7. So, it is somewhat a acceptable amount of time, considering the total operations and all the frames involved and also the underneath functions used by vdub itself (Which, unfortunately, also uses crap C++ classes). The problem is with soft cut scenes (transitions or fades etc) that are not that accurate. So, i´m trying to use PHash instead the STD algos to gain accuracy in general (Hard cut and soft cut scenes). MY goal is try to make the scene detector plugin works nearly 100% of accuracy on all cases (soft and hard cut scenes) in less then 10 minutes per hour of video (So, if i can make it at a rate of 10.000 frames (720x480) per minute (or even less), it will be great to be used)

I´m currently trying to build a convolution function and see if the results are the same as in CImg so i can continue the tests i´m doing, but, so far, no success on making this damn convolution thing :(

Btw: You did the article ? Well done :)

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 22, 2017, 09:30:33 AM

Hi guga,

No, it is not my article. Just saw it on codeproject and thought, yet an other matrix article, it has to be the month of matrix calculations.
Since it is populair now on this forum.

Have you ever thougth about using the videocard and direct3D9 to do the 2D image transformations and the flipping etc.
It would be much faster.

Title: Re: Fast Matrix Flip
Post by: guga on April 22, 2017, 09:40:07 AM

QuoteHave you ever thougth about using the videocard and direct3D9 to do the 2D image transformations and the flipping etc.
It would be much faster.

Didn´t though of that, but, i´m not sure if vdub will handle D3d9, neither if it will be usefull. VDub functions points to the pixel data and not a copy of the full image (with the header etc). So, i´m not sure if it worth using D3d9 directly for pixel manipulation.

Do you have an example of matrix manipulation of D3d9 ? I mean, when we already have the pixel data, and not to fully process a image ? Another question, do d3d9 have a convolution function ?

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 22, 2017, 09:55:00 AM

You can access the pixel data in D3D9 after the image manipulations and save it if you want.
I can make an example if you like, but i need a week or so ( short on free time )

Do you need an example that loads an image of 720x480 pixels and do the transposing, flipping X and flipping Y ?
And copy the result to a memory buffer or save the result as an image?

Or something else maybe... only if i know how to do it of course. :biggrin:

Title: Re: Fast Matrix Flip
Post by: guga on April 22, 2017, 10:33:00 AM

Hi Marinus

Quote"Do you need an example that loads an image of 720x480 pixels and do the transposing, flipping X and flipping Y ?
And copy the result to a memory buffer or save the result as an image?"

Yes...that would be nice :) If D3d9 can be used to perform faster routines then we already did this maybe handfull using it on vdub.

I never combined vdub with d3d9 before, but, it maybe helpfull if the results were faster then we already made :)

Another thing is that a convolution also is needed. So, not only transposing, flipping (x and Y), but can you give a try on convolution too ?
The convolution seems to obey this technique. https://www.tutorialspoint.com/dip/concept_of_convolution.htm

Many thanks :)

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 22, 2017, 11:38:28 AM

I have never done convolution on images, but we could give it a try.
First i will make a stripped down version of the transposing and flipping stuff then it will be easier for you to translate it to RosAsm.
Later we could try to get the convolution stuff working. Could be a nice project. :t

Title: Re: Fast Matrix Flip
Post by: guga on April 22, 2017, 01:19:19 PM

many thanks, marinus :t

Stripping down the matrix functions will be handfull, specially because they can work with any size of matrices. I fond some code that can be used for convolution and will start studying it for us.

One thing, since the matrices will be used for image and video processing, we cannot forget to include the pitch (stride) on the functions. basically, for what i understood on vdub, the pitch is just a leftover on the width of a image. that was used as a alingment in memory. So, the image width seems to be,in fact Width+Pitch. Where pitch contains only null data. From the code we did so far, adding a pitch argument to the functions won´t change the speed, since we can simply need to include one instruction and multiply it to the height to find the 1st scanline from bottom to top. or, instead multiply, it may be added to width if we are doing it from top to bottom. Like:
probably something like this:

mov eax D@Width | add D@Picth eax | shl eax 2 | mov D@NextScanLine eax

On this way, the user can work with the function regardless it contains pitch value (video) or not (images or simple matrix calculation). So, if it is working with image/simple matix, he set Picth = 0. Otherwise he input the pitch value. Something that may result like this:

RosAsm syntax
call Matrix_FlipY Input, Output, D$Width, D$height, D$Pitch

masm syntax
call Matrix_FlipY offset Input, offset Output, Width, height, Pitch

Where:
Input = Pointer to a buffer containing he inputed data in a array of Dwords (which also works for floats).
Output = Pointer to a buffer to hold the resultant data (also a array of Dwords or Floats)
Width = Width of the array/image/video (Dword)
Height = Height of the array/image/video (Dword)
Pitch = Pitch of the video/image. 0 if no pitch. Any value if exists a pitch to be computed. (Dword)

Not sure yet if this is correct, but, it seems it may be used like that outside the loop (to we gain speed)
I believe the functionality of pitch in Vdub is the same as described in M$.
https://msdn.microsoft.com/en-us/library/windows/desktop/aa473780(v=vs.85).aspx

For example, to copy a image with vdub (using pitch), i can do this:

Code Select


Proc CopyImageBuffer:
    Arguments @Input, @Output, @Width, @Height, @Pitch
    Local @CurYPos
    Uses eax, ebx, ecx, edx, edi

    mov eax D@Height
    xor ebx ebx
    mov D@CurYPos eax
    .Do
        mov eax D@Output | add eax ebx
        mov ecx D@Input | add ecx ebx
        mov edx D@Width
        Do
            mov edi D$ecx
            add ecx 4
            mov D$eax edi
            add eax 4
            dec edx
        Loop_Until edx = 0
        add ebx D@Pitch
        dec D@CurYPos
    .Loop_Until D@CurYPos = 0

EndP

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 22, 2017, 01:56:53 PM

Stride/pitch will be dealt with in the image loader. I'm familiar with it.

Title: Re: Fast Matrix Flip
Post by: guga on April 22, 2017, 03:18:03 PM

Wonderfull :) So, we don´t actually need to handle pitch :)

I´m giving a test on the Matrix_FlipY on Vdub (without pitch) and it works like a charm :icon_mrgreen:). I´ll post a screenshot once i finish these testings on Vdub.

Title: Re: Fast Matrix Flip
Post by: guga on April 22, 2017, 04:00:01 PM

Hi Marinus.

here are some screenshots of a video plugin i tested using the different Matrix Functions. Working like a charm :t

VDub, in fact, handles the pitch as you said. And it seems that Pitch, is, in fact, width*4 that i labeled on the functions as "NextScanLine".

Title: Re: Fast Matrix Flip
Post by: FORTRANS on April 22, 2017, 10:59:01 PM

Hi,

I have done some "convolution" code for image processing.
Both in assembly and FORTRAN. Mostly in the early 1990's.
Code was for Sobel edge detection, sharpening, blurring,
deinterlacing of TV image capture, autocorrelation (image
repair), and so forth. So it can't be too difficult.

Also tried FFT image processing about that time. Trying out
bandpass filtering. A bit more difficult. And probably less
successful.

Cheers,

Steve N.

Title: Re: Fast Matrix Flip
Post by: guga on April 22, 2017, 11:04:02 PM

Hi Steve, do you still have the autocorrelation functions for image repairing ? Concerning FFT i ported the Algo to assembly, but it is not optimized yet and did not tested on image processing. Dunno if it will work as expected.

Title: Re: Fast Matrix Flip
Post by: FORTRANS on April 23, 2017, 11:01:49 PM

Hi,

Quote from: guga on April 22, 2017, 11:04:02 PM
Hi Steve, do you still have the autocorrelation functions for image repairing ?

Yes, I still have the code. The TV capture card I was using would
"tear" the last few lines of the image. The bottom of the image had
lines offset from where they should be. Run the autocorrelation on
the last good line and the first bad one to find out how far to shift it.
Then repeat for the next lines. Sort of worked, but too much tweaking
by hand was needed. Haven't looked at it since then. Don't think I
have made any TV image captures either.

Regards,

Steve N.

Title: Re: Fast Matrix Flip
Post by: guga on April 24, 2017, 12:13:01 AM

If you have time, can you post the code here to we analyse ?

I would like to see how it works in practice. I have some pdfs explaining about image reconstruction, but got clueless on how to make the proper code for it. It seems too way complex for my head :greenclp:

it is a sort of inpaint algorithm, right ?

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 24, 2017, 12:54:48 AM

Hi guga,

I made a start, is this what you want?

Title: Re: Fast Matrix Flip
Post by: guga on April 24, 2017, 01:45:55 AM

Hi marinus.

yes :) That´s it. Many thanks :)

About transposing. When you transpose width became height and vice-versa, right ? That would explain why my version got crossing lines all over :greensml:

Not sure what i did wrong. Maybe try to exchange width x height, perhaps will solve.

The pope with an alien was simply hilarious :greenclp: :greenclp: :greenclp:

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 24, 2017, 02:01:03 AM

:biggrin:

Yes, if you have different row and column sizes they need to be switched or else your image is distorted.

This week i'll write an image saver. Is the PNG format OK?
When ready i'll post all the source code.
Because it is lossless, the pixel data will be the same as the original.

How fast are the Matrix manipulations "Timing Result in milliseconds" on your PC?
I'm very curious, bet they are much faster than the CPU code we did.

Title: Re: Fast Matrix Flip
Post by: guga on April 24, 2017, 02:33:33 AM

On My I7 it is:

0074246 for transposing but the counter keeps changing fast and varies, so it is hard to tell the exact speed. 006xxx to 009xxx for FlipX and FlipY, FLipX-Y

I´ll give a try on the filters i´m testing directly on the vdub plugin. Dunno yet, how to rotate (transpose) the video on vdub. I tried simply exchanging width x height and changing the copied buffer but it is crashing. probably because Vdub is trying to keep the original ratio of the video.

Title: Re: Fast Matrix Flip
Post by: guga on April 24, 2017, 03:03:22 AM

Found it. On Vdub it seems that the transposing can be done on the structure VDXPixmapLayout. Never used that before. I´ll give a try today to see if i can activate it an use the transposing algorithm we are making onto it.

Title: Re: Fast Matrix Flip
Post by: FORTRANS on April 24, 2017, 11:12:29 PM

Hi,

Quote from: guga on April 24, 2017, 12:13:01 AM
If you have time, can you post the code here to we analyse ?

Actually there were at least three versions. This is the oldest and
simplest version. As I did not comment at the time (1995) on what
the changes were for, why confuse things. It also means that this
may not be a working example.

Code Select

      PROGRAM ALIGNV
C
C     Align the lower scan lines in an image file.  Stuff digitized from my VCR
C  has poor scans at the bottom of the picture.
C  SRN, 16 October 1994
C
C     LSTART = Number of lines to skip before processing.
C     MSHIFT = Maximum shift to look for.
C     NSHIFT = 3 x Maximum shift to look for.
C     LEN    = 3 x picture line length = RGB line length.
C
      PARAMETER ( LEN=684*3, MSHIFT=24, NSHIFT=72, LSTART = 450 )
      CHARACTER  CHAR1*(LEN), CHAR2*(LEN), CHAR3*(LEN)
C
      OPEN (1,FILE='FRAME.RAW',FORM='BINARY')
      OPEN (2,FILE='OUT.RAW',FORM='BINARY')
C
      MINDIF = 0
      MAXDIF = 0
C
        DO 10 LINE=1,LSTART
        READ (1,END=91)  CHAR1
        WRITE (2)  CHAR1
   10   CONTINUE
      LINE = LSTART
C
   15 CONTINUE
      LINE = LINE + 1
      READ (1,END=92)  CHAR2
      CALL AUTO ( CHAR1, CHAR2, K )
      MINDIF = MAX( MINDIF, K )
      MAXDIF = MAX( MAXDIF, K )
      JSTART = MAX( 1, 1+K*3 )
      JEND = MIN ( LEN, LEN+K*3 )
        DO 20 J=JSTART,JEND
        CHAR3(J:J) = CHAR2(J+K*3:J+K*3)
   20   CONTINUE
C
        DO 30 J=1,JSTART-1
        CHAR3(J:J) = CHAR2(J:J)
   30   CONTINUE
        DO 40 J=JEND+1,480
        CHAR3(J:J) = CHAR2(J:J)
   40   CONTINUE
C
      WRITE (2)  CHAR3
      WRITE (*,'(1X,2I5)')  LINE, K
      CHAR1 = CHAR3
      GO TO 15
C
   91 PRINT *, 'PREMATURE END ON FRAME.RAW, LINE', LINE
      GO TO 94
C
   92 PRINT *, 'END ON FRAME.RAW, LINE', LINE
      GO TO 94
C
   94 CONTINUE
      WRITE (*,*) MINDIF, MAXDIF
      STOP
      END
      SUBROUTINE AUTO ( CHAR1, CHAR2, K )
      PARAMETER ( LEN=684*3, MSHIFT=24, NSHIFT=72, LSTART = 450 )
      CHARACTER  CHAR1*(LEN), CHAR2*(LEN)
      INTEGER  INT1(LEN), INT2(LEN), ADIF(-MSHIFT:MSHIFT)
C
        DO 10 I=1,LEN
        INT1(I) = ICHAR( CHAR1(I:I) )
        INT2(I) = ICHAR( CHAR2(I:I) )
   10   CONTINUE
C
        DO 20 I= -MSHIFT,MSHIFT
        ADIF(I) = 0
   20   CONTINUE
C
        DO 30 I=NSHIFT+1,LEN-NSHIFT,3
          DO 30 J = -MSHIFT,MSHIFT
          ADIF(J) = ADIF(J) + ABS( INT1(I)-INT2(I+J) )
          ADIF(J) = ADIF(J) + ABS( INT1(I+1)-INT2(I+J+1) )
   30     ADIF(J) = ADIF(J) + ABS( INT1(I+2)-INT2(I+J+2) )
C
      MINDIF = ABS( ADIF(-MSHIFT) )
      K = -MSHIFT
        DO 40 I= -MSHIFT,MSHIFT
        IF ( ABS( ADIF(I) ) .LT. MINDIF )  THEN
          K = I
          MINDIF = ABS( ADIF(I) )
        END IF
   40   CONTINUE
*     K = -K
      WRITE (*,100)  (I,ADIF(I),I=-MSHIFT,MSHIFT)
      RETURN
  100 FORMAT ( 1X, 49(I3,I7) )
      END

This looks like an error. But it isn't used. Change MAX() to MIN().

Code Select

      MINDIF = MAX( MINDIF, K )

I hope you find it interesting if not useful as is. This looks like
monochrome processing. The later versions look RGBish.

Quote
it is a sort of inpaint algorithm, right ?

Not sure what you mean.

Cheers,

Steve N.

Title: Re: Fast Matrix Flip
Post by: guga on April 24, 2017, 11:44:27 PM

Many thanks, Steve. I´m taking a look and trying to understand it.

Inpainting techniques are to you make things like this:

https://en.wikipedia.org/wiki/Inpainting

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 25, 2017, 10:40:48 AM

Here are the sources as promised released under the SHARE & ENJOY license. :biggrin:

This is an example how to use Direct3D9 in 2D mode without all the fancy 3D stuff, for very fast image manipulations.

The matrix calculations are done by the video device.
The results can be saved as images ( all the GDIplus formats ) and the raw bitmap data can by read from memory.
I made a comment in the image saver routine.

EDIT: Made a mistake were to put the comment in de "2D_Image_loader_saver.Asm" to fetch the raw bitmap data.
uploaded a new zip file with the comment at the correct spot in the source code.

Title: Re: Fast Matrix Flip
Post by: avcaballero on April 25, 2017, 06:22:55 PM

Hello. I get an error when save as png.

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 25, 2017, 10:26:41 PM

Hi caballero,

And it doesn't save the MatrixImage.png i assume?
Can you trace down where exactly in the code the error occurs?

Title: Re: Fast Matrix Flip
Post by: guga on April 25, 2017, 10:45:31 PM

Hi Marinus

Thank you :) It works like a gem here :t

QuoteHere are the sources as promised released under the SHARE & ENJOY license.

:greensml: :greensml: :greensml: :greensml:

One thing only, i´m not sure if using DirectX is faster then the direct manipulation of the pixels itself as we are dong before. The functions will be used for video manipulation. For that i´m using VirtualDub making those functions as part of a plugin to test the speed. I believe that VDub can handle D3d9 but i don´t know how to setup it. What i know from Vdub is that the functions related to DirectX manipulation are these:

Code Select



class VDFilterAccelEngine;

class VDFilterAccelContext : public IVDXAContext {
public:
	VDFilterAccelContext();
	~VDFilterAccelContext();

	int VDXAPIENTRY AddRef();
	int VDXAPIENTRY Release();
	void *VDXAPIENTRY AsInterface(uint32 iid);

	bool Init(VDFilterAccelEngine& eng);
	void Shutdown();

	bool Restore();

	uint32 RegisterRenderTarget(IVDTSurface *surf, uint32 rw, uint32 rh, uint32 bw, uint32 bh);
	uint32 RegisterTexture(IVDTTexture2D *tex, uint32 imageW, uint32 imageH);

	uint32 VDXAPIENTRY CreateTexture2D(uint32 width, uint32 height, uint32 mipCount, VDXAFormat format, bool wrap, const VDXAInitData2D *initData);
	uint32 VDXAPIENTRY CreateRenderTexture(uint32 width, uint32 height, uint32 borderWidth, uint32 borderHeight, VDXAFormat format, bool wrap);
	uint32 VDXAPIENTRY CreateFragmentProgram(VDXAProgramFormat programFormat, const void *data, uint32 length);
	void VDXAPIENTRY DestroyObject(uint32 handle);

	void VDXAPIENTRY GetTextureDesc(uint32 handle, VDXATextureDesc& desc);

	void VDXAPIENTRY SetTextureMatrix(uint32 coordIndex, uint32 textureHandle, float xoffset, float yoffset, const float uvMatrix[12]);
	void VDXAPIENTRY SetTextureMatrixDual(uint32 coordIndex, uint32 textureHandle, float xoffset, float yoffset, float xoffset2, float yoffset2);
	void VDXAPIENTRY SetSampler(uint32 samplerIndex, uint32 textureHandle, VDXAFilterMode filter);
	void VDXAPIENTRY SetFragmentProgramConstF(uint32 startIndex, uint32 count, const float *data);
	void VDXAPIENTRY DrawRect(uint32 renderTargetHandle, uint32 fragmentProgram, const VDXRect *destRect);
	void VDXAPIENTRY FillRects(uint32 renderTargetHandle, uint32 rectCount, const VDXRect *rects, uint32 colorARGB);

protected:
	enum {
		kHTFragmentProgram	= 0x00010000,
		kHTRenderTarget		= 0x00020000,
		kHTTexture			= 0x00030000,
		kHTRenderTexture	= 0x00040000,
		kHTTypeMask			= 0xFFFF0000
	};

	struct HandleEntry {
		uint32	mFullHandle;
		IVDTResource *mpObject;

		uint32	mImageW;
		uint32	mImageH;
		uint32	mSurfaceW;
		uint32	mSurfaceH;
		uint32	mRenderBorderW;
		uint32	mRenderBorderH;
		bool	mbWrap;
	};

	uint32 AllocHandle(IVDTResource *obj, uint32 handleType);
	HandleEntry *AllocHandleEntry(uint32 handleType);

	IVDTResource *DecodeHandle(uint32 handle, uint32 handleType) const;
	const HandleEntry *DecodeHandleEntry(uint32 handle, uint32 handleType) const;

	void ReportLogicError(const char *msg);

	IVDTContext *mpParent;
	VDFilterAccelEngine *mpEngine;

	typedef vdfastvector<HandleEntry> Handles;
	Handles mHandles;
	uint32	mNextFreeHandle;

	bool	mbErrorState;

	float	mUVTransforms[8][12];

	VDAtomicInt	mRefCount;
};

#endif	// f_VD2_FILTERACCELCONTEXT_H

the classes VDXAPIENTRY seems to be used to setup D3d9 dll, but i have no idea how to make it work yet.

There is a example of a internal plugin inside Vdub itself called invert which as the name suggests, invert the colors of a video. The plugin itself is a bit fast, but i didn´t compared it with other versions because it uses another way to setup and initialize the plugin engine that is a bit harder to understand.(Damn C++ classes) :greensml: :greensml: :greensml:

The full plugin is written like this:

Code Select



namespace {
#ifdef _M_IX86
	void __declspec(naked) VDInvertRect32(uint32 *data, long w, long h, ptrdiff_t pitch) {
		__asm {
			push	ebp
			push	edi
			push	esi
			push	ebx

			mov		edi,[esp+4+16]
			mov		edx,[esp+8+16]
			mov		ecx,[esp+12+16]
			mov		esi,[esp+16+16]
			mov		eax,edx
			xor		edx,-1
			shl		eax,2
			inc		edx
			add		edi,eax
			test	edx,1
			jz		yloop
			sub		edi,4
	yloop:
			mov		ebp,edx
			inc		ebp
			sar		ebp,1
			jz		zero
	xloop:
			mov		eax,[edi+ebp*8  ]
			mov		ebx,[edi+ebp*8+4]
			xor		eax,-1
			xor		ebx,-1
			mov		[edi+ebp*8  ],eax
			mov		[edi+ebp*8+4],ebx
			inc		ebp
			jne		xloop
	zero:
			test	edx,1
			jz		notodd
			not		dword ptr [edi]
	notodd:
			add		edi,esi
			dec		ecx
			jne		yloop

			pop		ebx
			pop		esi
			pop		edi
			pop		ebp
			ret
		};
	}
#else
	void VDInvertRect32(uint32 *data, long w, long h, ptrdiff_t pitch) {
		pitch -= 4*w;

		do {
			long wt = w;
			do {
				*data = ~*data;
				++data;
			} while(--wt);

			data = (uint32 *)((char *)data + pitch);
		} while(--h);
	}
#endif
}

///////////////////////////////////////////////////////////////////////////////

class VDVideoFilterInvert : public VDXVideoFilter {
public:
	VDVideoFilterInvert();

	uint32 GetParams();
	void Run();

	void StartAccel(IVDXAContext *vdxa);
	void RunAccel(IVDXAContext *vdxa);
	void StopAccel(IVDXAContext *vdxa);

protected:
	uint32 mAccelFP;
};

VDVideoFilterInvert::VDVideoFilterInvert()
	: mAccelFP(0)
{
}

uint32 VDVideoFilterInvert::GetParams() {
	const VDXPixmapLayout& pxlsrc = *fa->src.mpPixmapLayout;
	VDXPixmapLayout& pxldst = *fa->dst.mpPixmapLayout;

	switch(pxlsrc.format) {
		case nsVDXPixmap::kPixFormat_XRGB8888:
			pxldst.pitch = pxlsrc.pitch;
			return FILTERPARAM_SUPPORTS_ALTFORMATS | FILTERPARAM_PURE_TRANSFORM;

		case nsVDXPixmap::kPixFormat_VDXA_RGB:
		case nsVDXPixmap::kPixFormat_VDXA_YUV:
			return FILTERPARAM_SWAP_BUFFERS | FILTERPARAM_SUPPORTS_ALTFORMATS | FILTERPARAM_PURE_TRANSFORM;

		default:
			return FILTERPARAM_NOT_SUPPORTED;
	}
}

void VDVideoFilterInvert::Run() {
	VDInvertRect32(
			fa->src.data,
			fa->src.w,
			fa->src.h,
			fa->src.pitch
			);
}

void VDVideoFilterInvert::StartAccel(IVDXAContext *vdxa) {
	mAccelFP = vdxa->CreateFragmentProgram(kVDXAPF_D3D9ByteCodePS20, kVDFilterInvertPS, sizeof kVDFilterInvertPS);
}

void VDVideoFilterInvert::RunAccel(IVDXAContext *vdxa) {
	vdxa->SetTextureMatrix(0, fa->src.mVDXAHandle, 0, 0, NULL);
	vdxa->SetSampler(0, fa->src.mVDXAHandle, kVDXAFilt_Point);
	vdxa->DrawRect(fa->dst.mVDXAHandle, mAccelFP, NULL);
}

void VDVideoFilterInvert::StopAccel(IVDXAContext *vdxa) {
	if (mAccelFP) {
		vdxa->DestroyObject(mAccelFP);
		mAccelFP = 0;
	}
}

///////////////////////////////////////////////////////////////////////////////

extern const VDXFilterDefinition g_VDVFInvert = VDXVideoFilterDefinition<VDVideoFilterInvert>(
		NULL,
		"invert",
		"Inverts the colors in the image.\n\n[Assembly optimized]");

#ifdef _MSC_VER
	#pragma warning(disable: 4505)	// warning C4505: 'VDXVideoFilter::[thunk]: __thiscall VDXVideoFilter::`vcall'{48,{flat}}' }'' : unreferenced local function has been removed
#endif

I´ll try port this to assembly to make the proper tests on Dx video manipulation on Vdub, but, not sure if i´ll succeed or if it is, in fact faster then the direct pixel manipulation as we were doing before. (I believe it cannot be faster, because we need to take onto account all the internal functions used to access directx itself.

One good thing is that i finally succeeded to make Vdub change the Layout with the others functions. Now it is missing only to see if it will work with Matrix_transpose function :)

Title: Re: Fast Matrix Flip
Post by: avcaballero on April 25, 2017, 11:17:24 PM

Quote from: Siekmanski on April 25, 2017, 10:26:41 PM
Hi caballero,

And it doesn't save the MatrixImage.png i assume?
Can you trace down where exactly in the code the error occurs?

MatrixImage.png is created but with 0 bytes. We have already seen that my computer is abit odd for gdip, hence don't worry. Nevertheless, here are some captures from debugging. The flow stops when execute "GdipSaveImageToFile" with "F8", maybe here.

Regards

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 26, 2017, 12:38:00 AM

Hi caballero,

Phewww...., so we can blame GDIplus for the error.

Hi guga,

I don't know what VDub is, is it freeware ?
It seems to use directX9 in some way.

I suppose your goal is to manipulate color values and color positions in an image, am i right ?
This is a perfect job for the video device instead of only the CPU.

For example transposing a 720 by 480 32bit image via CPU:

1 - setup your program code.
2 - load the image data to system memory
3 - the calculation loop:

a - read an array of 1382400 bytes
b - calculate the new positions in the array ( the transpose matrix routine )
c - write 1382400 bytes
d - write 1382400 bytes to the correct position in video memory ( this is slow )
e - present the new image to the screen

For example transposing a 720 by 480 32bit image via GPU (via Direct3D9):

1 - setup your program code.
2 - load the image data to video memory
3 - the calculation loop:

a - calculate the new 4 * X,Y positions for the screen positions, and the new 4 * X,Y for the image corner coordinates.
b - write the new calculated 64 bytes to the video device ( the transpose matrix routine )
c - present the new image to the screen.

I think you will agree, that the second method is much much much faster.
No transfers between system and video memory, and only 16 values to calculate for the whole transpose matrix routine.

0.005 milliseconds for transposing a 32bit 720 by 480 pixels image, try to beat that with CPU coding.

For fast image manipulations try to avoid data transfers between system and video memory because they are slow.
So, better do all the image manipulations etc. by using the video device itself if possible.

Title: Re: Fast Matrix Flip
Post by: guga on April 26, 2017, 01:08:11 AM

Vdudb is free and opensource (I´m talking on the sense the sources are released, disregarding about the license itself :icon_mrgreen:).

It is a Video Editor tool simple and very powerfull, although a bit hard to configure the plugins. It was originally made more then a decade ago, but is used as an alternative for professional video editors such as Sony Vegas, for example.

http://www.virtualdub.org
https://sourceforge.net/projects/virtualdub/?source=top3_dlp_t5

For example, there is a university in Russia that make incredible plugins for it, such as a subtitle remover, motion estimation, noise remover, TV commercial detection, video stabilizer, etc etc. Their plugins can also be found here: http://www.compression.ru/video/video_codecs.htm

Other places of people who made plugins for it (with the source or not) can be found here:

http://www.guthspot.se/video/deshaker.htm
http://avisynth.nl/users/fizick/fizick.html
https://forum.doom9.org/
https://forum.videohelp.com/threads/281594-Avisynth-Virtualdub

Some tutorials on youtube explain several kinds of plugins as well. One of those that i like is:
https://www.youtube.com/watch?v=6QRJZpOrX0s

QuoteI suppose your goal is to manipulate color values and color positions in an image, am i right ?

Yeah...it is for image and video manipulation. I´m currently trying to understand and create those matrices functions in order to create a plugin (or app/dll) that is a variation of a PHash algorithm that is used to compare images (either from video or pure image). PHash algo is a sort of image signature and the field of application is huge....similar to what google and youtube uses for image searching tool (Dunno if google uses a sort of Phash algorithm, but, it probably do) or to be used in object removal of a video or a image, face detection, motion estimation, tracking, image/video reconstruction, etc etc....Also, Phash can be used for audio recognition too.

Rebuilding Phash is the 1st step that i can test to create a plugin for scene detection on videos. Currently i made a plugin for Vdub that can be able to detect scenes from a video. The only problem is that the accuracy is limited to hard cut scenes, but for transition (fades, etc) the algorithm i´m using fails. Basically a scene can be detected comparing the difference of 2 frames. The difference is achieved calculating the Minimum Standard deviation of the Light/Luma values from one frame and the other. So, we compute the STD on each frame, and simply subtract one from another (with xor to Potentiate the differences and not a simple sub). 2 frames are different completelly between each other when the difference of the minimum STD between them is positive, but....when we deal with soft cut scenes is where the algo fails and is where i´ll try to use Phash on it. Phash uses matrix manipulation internally achieved from Cimg library that it uses for loading the images to be compared.

I believe that Phash can be used as a replacement for the scene detection algorithm i´m currently creating. the advantage of using Phash is that we may not be limited to scene detection. A wide range of things can be done with this algorithm (for video and image processing and also for audio)

the phash can be found here. http://phash.org But...as i said before it uses a crappy Cimg library and, at the end, it is incredible slow compared to what we are doing. it is impossible to use the current version of Phash to identify a full video, for example. It would take hours to complete, where as if we simplify the algo you could process it completelly in a matter of minutes, and also use whatever other image library you wish. On this way you are not forced to use CImg all the time, but you can use any other that you want.

Quote0.005 milliseconds for transposing a 32bit 720 by 480 pixels image, try to beat that with CPU coding.

Yes..that´s fast, but i wonder if the performance is the same when using it for videos. If it is faster then what we are doing, then it´s, ok..to we use :) But...i didnp´t measured it. The timmings i had for your previous algo were in nanoseconds. (273 nanosecs, in average was the timming i´ve got on your previous function. About 0,000273 miliseconds for that function.)

I don´t know how to measure separately the matrix manipulation for DX to see if it beats your previous work or not, but, probably DX function used to manipulate the matrix can´t beat your previous work. I mean, if you take the function on DX responsable for matrix manipulation (transpose, flip etc) isolated and compare to the function you did, i doubt DX is faster then yours.

Title: Re: Fast Matrix Flip
Post by: guga on April 26, 2017, 02:24:17 AM

I´m not sure if it became clear, but, basically what i´m trying to do is:

a) the user access the pixel data with whatever method he wants (DX, LoadBitmap, GDIPlus, etc etc)

Once he get the pointers to the image pixels and know the width, height (and perhaps pitch, on case of videos) he simply do this:

b) uses the pointers to create the phash of a image to be compared with the other he already loaded

the problem relies on Phash that load the image with that crappy CImg library which uses matrix transposition internally and react according to the width and height. But...in fact, we don´t need that PHash load the image, we only needs the minimum necessary (the true algorithm) used to create the hash and a few functions to manipulate the matrixes (the pixels we previously loaded) in order to we create a convolution function to retrieve the hash. So, how the image will be loaded to pass the pixel data pointer to Phash algo is up to the user.

Sure, once we create the matrix manipulation functions, they can be used elsewhere with Phash or not, but since we need a minimum of matrix manipulation for Phash, it worth creating them.

Phash works like this:

Code Select


int ph_dct_imagehash(const char* file,ulong64 &hash){

    if (!file){
        return -1;
    }
    CImg<uint8_t> src;
    try {
        src.load(file);
    } catch (CImgIOException ex){
        return -1;
    }
    CImg<float> meanfilter(7,7,1,1,1);
    CImg<float> img;
    if (src.spectrum() == 3){
        img = src.RGBtoYCbCr().channel(0).get_convolve(meanfilter);
    } else if (src.spectrum() == 4){
        int width = img.width();
        int height = img.height();
        int depth = img.depth();
        img = src.crop(0,0,0,0,width-1,height-1,depth-1,2).RGBtoYCbCr().channel(0).get_convolve(meanfilter);
    } else {
        img = src.channel(0).get_convolve(meanfilter);
    }

    img.resize(32,32);
    CImg<float> *C  = ph_dct_matrix(32);
    CImg<float> Ctransp = C->get_transpose();

    CImg<float> dctImage = (*C)*img*Ctransp;

    CImg<float> subsec = dctImage.crop(1,1,8,8).unroll('x');;

    float median = subsec.median();
    ulong64 one = 0x0000000000000001;
    hash = 0x0000000000000000;
    for (int i=0;i< 64;i++){
        float current = subsec(i);
        if (current > median)
            hash |= one;
        one = one << 1;
    }

    delete C;

    return 0;
}

All of the CImg crap we actually don´t need. All we need from it is the minimum matrix manipulation functions and convolution to work directly on the pixel data we already got. (because we already got the pixel data with whatever other method we choose). And, to make things a bit easier, we actually don´t even need RGBtoYCbCr because PHash uses only Luma (Y) and it is up to the user choose whatever method he wants to retrieve the Luma values. Probably all we need is the pointers to the pixel data already converted to Luma.

Now..imagine this new PHash being used as a video plugin :icon_cool:... It can do amazing things on a faster and more reliable way.

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 26, 2017, 04:28:14 AM

Don't know for sure if i understand it well.

1. You need an image loaded and encoded to raw image data in memory ?
2. PHash ( don't know what it is, or does ) the raw data ?
3. Where comes the matrix and convolution stuff ?

You told me it is for video editing and making plugins to create video effects right ?
Am i right that you need to fetch each picture from a movie, let the effect doing its work and save it back in the movie ?

I have a little trouble understanding everything you want to achieve.

Title: Re: Fast Matrix Flip
Post by: guga on April 26, 2017, 06:19:44 AM

Let´s do like Jack the ripper and go in parts :) :icon_mrgreen:

About VDub functionality:

Quote1. You need an image loaded and encoded to raw image data in memory ?

Yes..i needed to know the pointer to the pixels of the image that was loaded (with whatever method, api etc...DX, GdiPlus...etc)
This is the easier part, since Vdub (that is used to edit videos) get me access to the pixels on each frame directly. So i can know exactly the width, height, pitch of a image that belongs to a certain frame of a video.

This is done on a structure called VFBitmap which i ported to Asm onto:

Code Select


[VFBitmap:
 VFBitmap.pVBitmapFunctions: D$ 0 ; It is a pointer inside Vdbub. (vtable) It is a array of offsets. Points to deprecated VBitmap functions
 VFBitmap.data: D$ 0    ; Pixel32 (data of the image).  Pointer to start of _bottom-most_ scanline of plane 0.
 VFBitmap.palette: D$ 0 ; Pointer to palette (reserved - set to NULL).
 VFBitmap.depth: D$ 0   ; image depth. Same as biBitCount in BitmapInfoHeader, but this is a dword
 VFBitmap.w: D$ 0       ; The width of the bitmap, in pixels. Same as in BitmapInfoHeader
 VFBitmap.h: D$ 0       ; The height of the bitmap, in pixels. Same as in BitmapInfoHeader
 VFBitmap.pitch: D$ 0   ; Distance, in bytes, from the start of one scanline in plane 0 to the next. ( Bitmaps can be stored top-down as well as bottom-up. The pitch value value is positive if the image is stored bottom-up in memory and negative if the image is stored top-down.)
 VFBitmap.modulo: D$ 0  ; Distance, in bytes, from the end of one scanline in plane 0 to the start of the next. (This value is positive or zero if the image is stored bottom-up in memory and negative if the image is stored top-down. A value of zero indicates that the image is stored bottom-up with no padding between scanlines. For a 32-bit bitmap, modulo is equal to pitch -)
 VFBitmap.size: D$ 0    ; The size, in bytes, of the image. Size of plane 0, including padding. Same as in BITMAPINFOHEADER
 VFBitmap.offset: D$ 0  ; Offset from beginning of buffer to beginning of plane 0.
 VFBitmap.dwFlags: D$ 0 ; Set in paramProc if the filter requires a Win32 GDI display context for a bitmap.
                        ; (Deprecated as of API V12 - do not use) NEEDS_HDC  = 0x00000001L,
 VFBitmap.hdc: D$ 0]    ; A handle to a device context.

So, the member "data" from VFBitmap structure points to the start of the pixels in memory; (In general, they are in RGB8888 format, which is easy to convert to RGBQUAD - I already done this part)

This part of the code to retrieve the pixels in memory are already done (in case with Vdub that loaded the video and granted me access to each frame containing the pixels to be manipulated)

This is the easier part.

In Vdub Images are passed in and out of video filters through the VFBitmap structure. Each VFBitmap represents an image as follows:
(http://i65.tinypic.com/10engk2.jpg)
The image is stored as a series of sequential scanlines, where each scanline consists of a series of pixels.

Since the video filter system works with 32-bit bitmaps, scanlines are guaranteed to be aligned to 32-bit boundaries. No further alignment is guaranteed. In particular, filter code must not assume that scanlines are 16-byte aligned for SSE code.

It is important to note that there may be padding between scanlines. The pitch field indicates the true spacing in bytes, and should be used to step from one scanline to the next

All of this is how VDub access and handle the image data in memory. This part, in general, i already did. But, i´m having problems only to understand the scanline stuff, because when manipulating the images from Matrix_Transpose, for example, the width and height of the resultant image was weird as you saw on the image i posted earlier. But...i think i found how to make it work properly. It seems that i didn´t configured properly the way the layout can be displayed (I´m currently working on it to see if i can fix the transposing mode)

Why i´m doing this with Vdub ? because with VDub i´ll then use the Phash algo to identify scenes from a video i´ll load on it.

Now....about Phash.

Quote2 - PHash ( don't know what it is, or does ) the raw data ?

Yes...but....Phash as it is written is bloated because it loads the image for you and do all the fancy stuff with the convolution and matrix manipulation using a library called Cimg. The main problem is that, it is insanelly slow for video processing (and also for images, btw), although it is incredible accurate.

I´m posting it here a small example of Phash being used. The source code is embedded (RosAsm file), but it is simply this:

Code Select


[Float_Half: R$ 0.5]

[ImgHash1: R$ 0]
[ImgHash2: R$ 0]
[Similarity_Result: R$ 0]
[Float_64: R$ 64.0]
[Float_Thrsehold: R$ 0.85]

Main:

C_call 'pHash.ph_dct_imagehash' {B$ "Img1.jpg", 0}, ImgHash1
C_call 'pHash.ph_dct_imagehash' {B$ "Img2.jpg", 0}, ImgHash2
C_call 'pHash.ph_hamming_distance' ImgHash1, ImgHash2
mov D$Similarity_Result eax
fld1 | fild F$Similarity_Result | fdiv R$Float_64 | fsubp ST1 ST0 | fstp R$Similarity_Result

Fpu_If R$Similarity_Result >= R$Float_Half
    call 'USER32.MessageBoxA' 0, {B$ 'Images are similar', 0}, {B$ 'PHash test', 0}, &MB_YESNO
Fpu_End_If

call 'Kernel32.ExitProcess' 0

The functionality and example is explained here:
http://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash

and here explain in more details the technical functionality.
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

Quote3. Where comes the matrix and convolution stuff ?

The matrix and convolution comes from PHash itself. Internally it creates a matrix to be used as a mask for later build the convolution, in order to make the hashes for each image that is being compared.

The matrix and convolution functions used in PHash (That is open source, i.e, we have access to the source code to read it and learn how it works exactly) are available freely at phash.org.

http://www.phash.org/releases/win/pHash-0.9.4.zip

And here are the technical aspects of Phash too and the guide of usage:

http://www.phash.org/docs/design.html
http://www.phash.org/docs/howto.html

The major problem is that... to work, pHash uses a bloated Cimg library to load the image and create the matrix and convolutions routines in order to the algo can produce the hash for each image. (The part of the code i posted in the other post on this thread)

How the comparison works ? The final comparition is kinda easy to understand. After having the hashes all you need is compare the hash found on a image and the hash found on another. If the difference is above 50%, then we have a similarity (The bigger the value, more similar the images are). Below 50% the images are definitely different.

How to solve the speed problem in order to use phash on videos (or regular images or audio) on a fast and more reliable way ? Creating our own set of matrix and convolution functions to work on the pixels that was already loaded in memory, instead having to create several functions to load the image or using external bloated libraries to do it for us.

PHash as it is without we fix it, is simply not usefull for video detection, because it is slow as hell, despite it´s high level of accuracy. So, the goal is recreate Phash using our own set of matrix and convolution functions, instead being forced to use bloated Libraries that does a terrible job internally resulting on a algo impossible to be used for video manipulation or even image manipulation in general. We need to recreate a phash that don´t load a image, don´t uses bloated libraries. We need one that simply take the pixel data from a image previously loaded in memory and compute the hash of it.

So, the matrix and convolutions functions we needs basically to manipulate directly the pixel data of a certain image (Which was previously loaded no matter in what method/api used), and we need only to feed the functions only with the pixel pointer, height and width (and perhaps pitch/scanline, since it seems to be necessary sometimes for vdub). We are not using the matrix and convolution functions to load the image, we are using them to manipulate the pixel data already loaded in memory, so the method chosen to load the image is not what matters, since the important is we have access to the pixel and we manipulate them directly.

Since the images can be on any size, the matrix manipulation and convolution functions needs to work on any size (squared or not), because we can have images that have different width and height.

QuoteYou told me it is for video editing and making plugins to create video effects right ?

yes :)

QuoteAm i right that you need to fetch each picture from a movie, let the effect doing its work and save it back in the movie ?

yes :) :)

But, all of the functions responsible for saving the video back, loading it etc, are already done by the main app (VDub). Basically it is a plugin that take the pixel data loaded by Vdub and manipulate the pixels directly. In case, it is a plugin i´m creating to identify the scenes of a video using a algorithm called Phash that uses matrixes functions and convolution to find the hash on each image/frame.

And, since for creating the phash algo, we will need to build the matrices and convolutions functions, those functions can be also used later on others plugins or apps that direct manipulate pixels from memory.

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 26, 2017, 07:27:24 AM

QuoteLet´s do like Jack the ripper and go in parts :) :icon_mrgreen:

:lol:

The easiest and most flexible way to load and get access to the image data and even control the format you want no matter the pixel format of the original image, is this small piece of GDIplus code.

Code Select

.const

;the formats you may need
PixelFormat1bppIndexed          equ 30101h
PixelFormat4bppIndexed          equ 30402h
PixelFormat8bppIndexed          equ 30803h
PixelFormat16bppGreyScale       equ 101004h
PixelFormat16bppRGB555          equ 21005h
PixelFormat16bppRGB565          equ 21006h
PixelFormat16bppARGB1555        equ 61007h
PixelFormat24bppRGB             equ 21808h
PixelFormat32bppRGB             equ 22009h
PixelFormat32bppARGB            equ 26200Ah
PixelFormat32bppPARGB           equ 0E200Bh

BitmapData struct                          
    dwWidth      dd ?
    dwHeight     dd ?   
    Stride       dd ?   
    PixelFormat  dd ?   
    Scan0        dd ?   ; pointer to the raw bitmap data
    Reserved     dd ?
BitmapData ends

.data?
GDIplusBitmapData BitmapData <?>
pImage dd ?
GdiplusToken dd ?

.code
    invoke  GdiplusStartup,offset GdiplusToken,offset GdiplusInput,NULL
    invoke  GdipCreateBitmapFromFile,offset FilenameW,addr pImage
    invoke  GdipBitmapLockBits,pImage,NULL,1,PixelFormat32bppARGB,offset GDIplusBitmapData

; do your stuff here on the bitmap data
    mov     esi,GDIplusBitmapData.Scan0     ; pointer to the bitmap data
    mov     ecx,GDIplusBitmapData.dwHeight
    mov     edx,GDIplusBitmapData.dwWidth
    add     esi,GDIplusBitmapData.Stride    ; jump to the next scan line etc.
 
; close everything
    invoke  GdipBitmapUnlockBits,pImage,offset GDIplusBitmapData
    invoke  GdipDisposeImage,pImage
    invoke  GdiplusShutdown,GdiplusToken

I think you still have a lot to do to get that PHash routine working the way you describe. :t

Title: Re: Fast Matrix Flip
Post by: guga on April 26, 2017, 02:46:10 PM

HI marinus

Thanks for the code. I´ll give a try and see how it works with Vdub. A few questions about gdiplus. SinceVdub already retrieve the pixel data, how to use gdiplus on a bitmap data rather then it´s filename? Does GpStatus GdipCreateBitmapFromScan0 can do it ?

For now, i´m asking this for curiosity, because i don´t believe we need a routine to work with GdiPlus right now, since Vdub already retrieve the pixel data and format to us to work. It is a good thing, because is one less step to do :) maybe we can use this GdiPlus only for testing purposes while i´m also testing the main matrix functions on vdub to see if everything works ok

QuoteI think you still have a lot to do to get that PHash routine working the way you describe. :t

Yep :) :bgrin: :bgrin:

That´s why i´m trying to do in steps. The first ones are basically theses:
a) Create some manipulation matrix functions to work in whatever size
b) create the convolution function.

Once those 2 steps are done, i can then start analysing the internal routines of PHash itself. I believe that there will be needed a few more steps after the convolution is done. Perhaps i´ll need to create only 4 or 5 functions in order to retrieve Phash after the convolution is done, but 1st i need to see if the matrix and convolution functions are ok and have the same result as the ones produced by the internal Cimg routines inside Phash itself .

One of the routines used by Phash i already succeeded to convert. It is the hamming_distance function. This function, i succeeded to port. Although i didn´t tested yet to speed or optimization, because the important on this stage of development is make everything works 1st :)

So we can start with your older functions to start testing. Some of them i converted but unsure if will work as expected on Vdub on all layout formats (Switching Height x width, for example)

The 1st one we can test is Matrix_FlipX (Then we later can work with Matrix_FlipY and Matrix_FlipXY). I remade it without the stride using the examples that you, jochen and Aw provided. it works, but...i´m not sure if it will work on all cases, because it is not using the stride/pitch. How can we add the stride/pitch on it ?

Code Select


Proc Matrix_FlipX:
    Arguments @Input, @Output, @Width, @Height
    Local @MaxXPos, @MaxYPos, @Remainder, @AdjustSmallSize, @NextScanLine
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov edx eax
    mov ebx eax
    and edx 3 | mov D@Remainder edx
    shr eax 2 | mov D@MaxXPos eax

    mov eax D@Height | mov D@MaxYPos eax

    mov D@AdjustSmallSize 0
    mov ebx 4 ; case of less than 4 columns
    If D@MaxXPos > 0
        mov ebx 16 ; subtract enough to fill one xmm register
        mov D@AdjustSmallSize 12
    End_if

    mov eax D@Width | shl eax 2 | mov D@NextScanLine eax |  sub eax ebx
    mov edi D@Output | add edi eax

L1:
        mov ebx D@MaxXPos
        ; source points to the start of every row
        mov eax edi
        mov ecx esi
        test ebx ebx | Jz L4>

        L0:
            movdqu xmm0 X$ecx
            pshufd xmm0 xmm0 27
            movdqu X$eax xmm0
            sub eax 16
            add ecx 16
            dec ebx | jg L0<

L4:
        mov ebx D@remainder
        test ebx ebx | jz L3>
            mov edx D@AdjustSmallSize
            movdqu XMM0 X$ecx | movd D$eax+edx XMM0
            dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-4 XMM0
            dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-8 XMM0
        L3:

         add edi D@NextScanLine
         add esi D@NextScanLine

        dec D@MaxYPos | jg L1<

EndP

I used the pitch to copy the buffer on another function i´m using Vdub. This one uses the pitch information. This was done after Matrix_X worked. It is this:

Code Select


Proc CopyImageBuffer:
    Arguments @Input, @Output, @Width, @Height, @Pitch
    Local @CurYPos
    Uses eax, ebx, ecx, edx, edi

    mov eax D@Height
    xor ebx ebx
    mov D@CurYPos eax
    .Do
        mov eax D@Output | add eax ebx
        mov ecx D@Input | add ecx ebx
        mov edx D@Width
        Do
            mov edi D$ecx
            add ecx 4
            mov D$eax edi
            add eax 4
            dec edx
        Loop_Until edx = 0
        add ebx D@Pitch
        dec D@CurYPos
    .Loop_Until D@CurYPos = 0

EndP

But..i wonder, if using Pitch is really necessary to be inserted inside the main loop. (or, if it is really necessary at all, btw)

Also...maybe, once the routines are ok (with or without the pitch) we can see if it could be faster using the same Input as the output. On this way we can make variations of the matrix functions where the Input and output are the same so we can test the speed. Ex: we can create a sort of Matrix_FlipXEx containing only 3 arguments: Input, Width, Height (or also thee pitch if needed). Where the input will be used also as the output. This is to prevent the need to using another function to copy the contents of the Output to another buffer.

On this way we may have at the end only 6 major functions to manipulate the matrix, instead of 3.
Matrix_FlipX - > input and output buffers are distincts
Matrix_FlipXEx - > input is used to output
Matrix_FlipY - > input and output buffers are distincts
Matrix_FlipYEx - > input is used to output
Matrix_FlipXY - > input and output buffers are distincts
Matrix_FlipXYEx - > input is used to output

Title: Re: Fast Matrix Flip
Post by: Siekmanski on April 26, 2017, 10:13:55 PM

I posted the code because you didn't want to use the bloated Cimg library. :biggrin:

I'm getting the grasp now of your intentions.

1. you need 2 images. ( size may differ )
2. load them to system memory as raw bitmap data.
3. apply a matrix function on both images.
4. apply a convolution. ( this part is still fuzzy to me, what mask etc. )
5. create a PHash value for both images and compare the 2 values.
6. let it run as fast as possible.

Are those the steps to be done and are they in the correct order?

The best way is to come up with a good strategy, that is to work backwards and design the best algorithms for an optimal situation.
This way you can control, how the data needs to be organized when loading the images.

For example:
- What input is needed to make the PHash routine happy.
- Is it faster to transpose a complete image in video memory and copy it back to system memory,
or calculate it in system memory?
- Is it faster to get rid of the difference between stride/pitch and the actual bitmap width?
- Is it faster to add zeros to the horizontal bitmap lines that are not multiples of 4 pixels?
- Try to do as much inner-loop coding on 1 cache line ( < 64 byte ) and align that code for fast execution.
- Align the data for fast reading and writing.
- Thus create the best situation to perform the fastest code possible.

In fact what i meant with working backwards, you don't need to add extra code to adjust things to make it work.
Your code is prepared for the next step.

The MASM Forum

General => The Campus => Topic started by: guga on April 07, 2017, 03:56:33 AM