Hi guys
continuing the matrix operations, i builted one that can flip a matrix along the X (Width) axis using SSE (Thanks Jochen and Marinus)
The goal was to flip matrixes like this:
[Teste4x4: F$ 1, 2, 3, 4,
F$ 7, 8, 9, 10,
F$ 13, 14, 15, 16,
F$ 19, 20, 21, 22]
onto:
[Teste4x4Inverted: F$ 4, 3, 2, 1,
F$ 10, 9, 8, 7,
F$ 16, 15, 14, 13,
F$ 22, 21, 20, 19]
The problem, however, remain on non-quadratic matrixes. I´m still strugggling how to set the proper flags or the math envolved when we deal with non-quadratic matrixes (such as 3x2, 27x18, 13x9 etc)
The code i made for quadratic matrixes along the X- Axis is
Proc SquaredMatrix_FlipHorizontal_SSE2new:
Arguments @Input, @Output, @Width, @Height
Local @MaxXPos, @CurYPos
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
mov edx D@Height | mov D@CurYPos edx
mov ebx D@Width
mov eax ebx | shr eax 2 | mov D@MaxXPos eax ; MaxPos = Width/4
; shl ebx 2 | mov D@NextScanLine ebx | mov eax ebx; | sub ebx 16 | mov eax ebx ; ebx = (Width*4)-16
shl ebx 2 | mov eax ebx; | sub ebx 16 | mov eax ebx ; ebx = (Width*4)-16
L2:
mov ecx D@MaxXPos
mov edx esi
Align 64 ; <---- Must be aligned to 64 to gain more speed and stability. (If align to 16 the result is a bit slow)
L8:
movdqu XMM0 X$edx+eax-16 ; edx+(Width*4)-16
pshufd XMM0 XMM0 27 ; invert all 4 dwords from left to right
sub edx (4*4)
movups X$edi xmm0
add edi (4*4)
dec ecx | jg L8<
add eax ebx; next scanline in ebx
dec D@CurYPos | jnz L2<<
mov eax D@Output
EndP
Timming for the whole function is only: 128.74 nanoseconds :)
Aligning with 16, decreases a bit the speed resulting in something around 136 nanosecs
But...how to make it for non-quadratic and keep it also fast ? What is the math operation envolving non-quadratic matrixes ?
I don´t rememebr exatly the masm syntax, but the code above, it probably is something like this:
SquaredMatrix_FlipHorizontal_SSE2new proc uses esi edi ebx ecx edx
Input: dword
Output:dword
Width: dword
Height: dword
LOCAL CurYPos: dword
LOCAL MaxXPos: dword
mov esi, Input
mov edi, Output
mov edx, Height
mov CurYPos, edx
mov ebx, Width
mov eax, ebx
shr eax, 2
mov MaxXPos, eax
shl ebx, 2
mov eax, ebx
Loop2:
mov ecx, MaxXPos
mov edx, esi
jmp Loop1
; ---------------------------------------------------------------------------
align64; is a masm directive ?. If it is, then you can use align64 :).
; ---------------------------------------------------------------------------
Loop1:
movdqu xmm0, xmmword ptr [eax+edx-10h]
pshufd xmm0, xmm0, 1Bh
sub edx, 10h
movups xmmword ptr [edi], xmm0
add edi, 10h
dec ecx
jg Loop1
add eax, ebx
dec CurYPos
jnz Loop2
mov eax, Output
SquaredMatrix_FlipHorizontal_SSE2new endp
Fast transposing even or uneven can be done with a 4 * 4 Matrix.
Reserve enough memory to read from and write to ( 4 * 16 bytes alignment )
Read 4 * 4 pixels at once with the correct memory steps for the rows and columns.
Write 4 * 4 pixels at once also with the correct memory 4 * 4 block steps.
It seems illogical to use a 4 * 4 matrix for uneven images sizes (the unused pixels...), but that is only for the last 1,2 or 3 pixels of the right border and the bottom border.
But if you process larger images sizes than in the example below it is really mega fast. ( 12 cycles per 16 pixels on my PC )
;The 4 * 4 transpose algorithm:
; In: Out:
; [0 1 2 3] [0 4 8 C]
; [4 5 6 7] [1 5 9 D]
; [8 9 A B] [2 6 A E]
; [C D E F] [3 7 B F]
mov eax,offset MatrixIn
movaps xmm0,[eax+0] ; [0 1 2 3]
movaps xmm1,[eax+16] ; [4 5 6 7]
movaps xmm2,[eax+32] ; [8 9 A B]
movaps xmm3,[eax+48] ; [C D E F]
mov eax,offset MatrixOut
movaps xmm4,xmm0 ; [0 1 2 3]
movaps xmm5,xmm2 ; [8 9 A B]
unpcklps xmm4,xmm1 ; [0 4 1 5]
unpcklps xmm5,xmm3 ; [8 C 9 D]
unpckhps xmm0,xmm1 ; [2 6 3 7]
unpckhps xmm2,xmm3 ; [A E B F]
movaps xmm1,xmm4 ; [0 4 1 5]
movaps xmm6,xmm0 ; [2 6 3 7]
movlhps xmm4,xmm5 ; [0 4 8 C]
movlhps xmm6,xmm2 ; [2 6 A E]
movhlps xmm5,xmm1 ; [1 5 9 D]
movaps xmm7,xmm2 ; [A E B F]
movhlps xmm7,xmm0 ; [3 7 B F]
movaps [eax+0],xmm4 ; [0 4 8 C]
movaps [eax+16],xmm5 ; [1 5 9 D]
movaps [eax+32],xmm6 ; [2 6 A E]
movaps [eax+48],xmm7 ; [3 7 B F]
I don't have time to write the source code for it, but I'll explain the principle below.
To keep it easier to understand I used a small size image example.
; each pixel number is 4 byte
pixelbufferIn dd 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,-,-,-,-,-,-,-,-,-,- ( -, = alignment to 4 * 16 bytes )
pixelbufferOut dd 24 dup (0)
Example of an uneven image width = 5, height = 3
0, 1, 2, 3, 4
5, 6, 7, 8, 9
10,11,12,13,14
; mov esi, offset pixelbufferIn
; mov edi, offset pixelbufferOut
Gather the pixels from the pixelbufferIn with a step of 5 pixels ( image width )
Move the transposed pixels to the pixelbufferOut with a step of 3 ( image height)
[ 0 1 2 3] ; movups xmm0,[esi+0]
[ 5 6 7 8] ; movups xmm1,[esi+20]
[10 11 12 13] ; movups xmm2,[esi+40]
[-- -- -- --] ; movups xmm3,[esi+60] ; these are the values 15, 16, 17, 18, we don't need them but they are needed for the algorithm...
transpose the block of 16 pixels at once:
movaps xmm4,xmm0
movaps xmm5,xmm2
unpcklps xmm4,xmm1
unpcklps xmm5,xmm3
unpckhps xmm0,xmm1
unpckhps xmm2,xmm3
movaps xmm1,xmm4
movaps xmm6,xmm0
movlhps xmm4,xmm5
movlhps xmm6,xmm2
movhlps xmm5,xmm1
movaps xmm7,xmm2
movhlps xmm7,xmm0
[ 0 5 10 --] ; movups [edi+0],xmm4
[ 1 6 11 --] ; movups [edi+12]xmm5
[ 2 7 12 --] ; movups [edi+24],xmm6
[ 3 8 13 --] ; movups [edi+36],xmm7
this gives a result of:
0,5,10,1,6,11,2,7,12,3,8,13
Now get the uneven pixels of column 5
[ 4 -- -- --] ; movups xmm0,[esi+16]
[ 9 -- -- --] ; movups xmm1,[esi+36]
[14 -- -- --] ; movups xmm0,[esi+56]
[-- -- -- --] ; movups xmm0,[esi+76]
transpose the block of 16 pixels at once:
movaps xmm4,xmm0
movaps xmm5,xmm2
unpcklps xmm4,xmm1
unpcklps xmm5,xmm3
unpckhps xmm0,xmm1
unpckhps xmm2,xmm3
movaps xmm1,xmm4
movaps xmm6,xmm0
movlhps xmm4,xmm5
movlhps xmm6,xmm2
movhlps xmm5,xmm1
movaps xmm7,xmm2
movhlps xmm7,xmm0
[ 4 9 14 --] ; movups [edi+48],xmm4
[-- -- -- --] ; movups [edi+60]xmm5
[-- -- -- --] ; movups [edi+72],xmm6
[-- -- -- --] ; movups [edi+84],xmm7
this gives a result of:
0,5,10,1,6,11,2,7,12,3,8,13,4,9,14
here is your transposed image:
0, 5, 10
1, 6, 11
2, 7, 12
3, 8, 13
4, 9, 14
Improved and faster 4*4 Transpose Matrix :biggrin:
; [0 1 2 3] [0 4 8 C]
; [4 5 6 7] [1 5 9 D]
; [8 9 A B] [2 6 A E]
; [C D E F] [3 7 B F]
mov esi,offset MatrixIn
mov edi,offset MatrixOut
movaps xmm0,[esi+0] ; [0 1 2 3]
movaps xmm1,[esi+16] ; [4 5 6 7]
movaps xmm2,[esi+32] ; [8 9 A B]
movaps xmm3,[esi+48] ; [C D E F]
movaps xmm4,xmm0 ; [0 1 2 3]
movaps xmm5,xmm2 ; [8 9 A B]
unpcklps xmm4,xmm1 ; [0 4 1 5]
unpcklps xmm5,xmm3 ; [8 C 9 D]
unpckhps xmm0,xmm1 ; [2 6 3 7]
unpckhps xmm2,xmm3 ; [A E B F]
movaps xmm1,xmm4 ; [0 4 1 5]
movaps xmm6,xmm0 ; [2 6 3 7]
movlhps xmm4,xmm5 ; [0 4 8 C]
movlhps xmm6,xmm2 ; [2 6 A E]
movhlps xmm5,xmm1 ; [1 5 9 D]
movhlps xmm2,xmm0 ; [3 7 B F]
movaps [edi+0],xmm4 ; [0 4 8 C]
movaps [edi+16],xmm5 ; [1 5 9 D]
movaps [edi+32],xmm6 ; [2 6 A E]
movaps [edi+48],xmm2 ; [3 7 B F]
Hi guga,
Just read your post again. Thought it was about transposing images with sizes that are not multiples of 4.
Now I see it's about flipping the X-axis, sorry my mistake.
Quote from: guga on April 07, 2017, 03:56:33 AM
But...how to make it for non-quadratic and keep it also fast ? What is the math operation envolving non-quadratic matrixes ?
Hi guga,
I don't know if this is what you want, it is my solution to flip a 2D matrix along the X-axis. It shall work for any number of columns and rows.
option casemap:none
option frame:auto
OPTION STACKBASE:RBP
.code
flipMatrix proc public outMat : ptr, inMat : ptr, rows : qword, cols : qword
LOCAL xmmMovesRequired : qword
LOCAL remainder : qword
mov outMat, rcx
mov inMat, rdx
mov rows, r8
mov cols, r9
; How many xmm moves per row are required;
mov rax, cols
mov r10, 4
xor rdx, rdx
div r10
mov xmmMovesRequired, rax
; Remainder
mov remainder, rdx
mov r11,0
.while r11<rows
mov r10,0
; make destination point to the end of every row
mov rax, r11
inc rax
mul cols
shl rax, 2
mov rcx, outMat
add rcx, rax
.if xmmMovesRequired>0
sub rcx, 16 ; subtract enough to fill one xmm register
.else
sub rcx, 4 ; case of less than 4 columns
.endif
; source points to the start of every row
mov rax, r11
mul cols
shl rax, 2
mov rdx, inMat
add rdx, rax
.while r10<xmmMovesRequired
movups xmm0, xmmword ptr [rdx]
pshufd xmm0, xmm0, 00011011b
movups xmmword ptr [rcx], xmm0
sub rcx, 16
add rdx, 16
inc r10
.endw
.if xmmMovesRequired>0
add rcx, 12
.endif
.if remainder >= 1
mov r10d, dword ptr [rdx]
mov dword ptr [rcx], r10d
.if remainder >= 2
mov r10d, dword ptr [rdx+4]
mov dword ptr [rcx-4], r10d
.if remainder == 3
mov r10d, dword ptr [rdx+8]
mov dword ptr [rcx-8], r10d
.endif
.endif
.endif
inc r11
.endw
ret
flipMatrix endp
end
It was tested by calling from C++ console application, like this:
#include "stdafx.h"
/* tested
#define TOTALROWS 2
#define TOTALCOLS 2
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11, 12 },
{ 21, 22}
};
*/
/* tested
#define TOTALROWS 1
#define TOTALCOLS 1
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11 }
};
*/
/* tested
#define TOTALROWS 4
#define TOTALCOLS 4
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11, 12, 13,14},
{ 21, 22, 23,24},
{ 31, 32, 33,34},
{ 41, 42, 43,44},
};
*/
#define TOTALROWS 5
#define TOTALCOLS 9
int inmatrix[TOTALROWS][TOTALCOLS] = {
{ 11, 12, 13,14,15,16,17,18,19},
{ 21, 22, 23,24,25,26,27,28,29},
{ 31, 32, 33,34,35,36,37,38,39},
{ 41, 42, 43,44,45,46,47,48,49},
{ 51, 52, 53,54,55,56,57,58,59}
};
int outmatrix[TOTALROWS][TOTALCOLS];
extern "C"
{
void flipMatrix(void* outMat, void* inMat, size_t rows, size_t cols);
}
int main()
{
flipMatrix(outmatrix, inmatrix, TOTALROWS, TOTALCOLS);
printf("Input matrix\n");
for (int row = 0; row < TOTALROWS; row++)
{
for (int columns = 0; columns < TOTALCOLS; columns++)
printf("%d ", inmatrix[row][columns]);
printf("\n");
}
printf("Output matrix\n");
for (int row = 0; row < TOTALROWS; row++)
{
for (int columns = 0; columns < TOTALCOLS; columns++)
printf("%d ", outmatrix[row][columns]);
printf("\n");
}
getchar();
return 0;
}
HI marinus
No problem :) I was giving a test on your transpose algo to, but, unfortunately i´m still not being able to make it work on non-quadratic matrices. The algo seems faster then JJ, but ´i´m not succeeding to make it work as expected (non-quadratic)
Many thanks, Aw. I´ll give a try on it :) :t
Hi guga,
aw27 worked out the method i presented in Reply #1
http://masm32.com/board/index.php?topic=6140.msg65145#msg65145
Thanks a lot, marinus and AW. I´ll give a try and post the results for speed testing :)
Aw
Here is the port to RosAsm in x86. Many thanks. I´ll try to optimize it.
Proc Matrix_FlipHorizontal:
Arguments @Input, @Output, @Width, @Height ;@Width, @Height
Local @xmmMovesRequired, @remainder
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; How many xmm moves per row are required;
mov eax D@Width
mov ebx 4
xor edx edx
div ebx
mov D@xmmMovesRequired eax
; Remainder
mov D@Remainder edx
mov ecx 0
.While ecx < D@Height
mov ebx 0
; make destination point to the end of every row
mov eax ecx
inc eax
mul D@Width
shl eax 2
mov edi D@Output
add edi eax
If D@xmmMovesRequired > 0
sub edi 16 ; subtract enough to fill one xmm register
Else
sub edi 4 ; case of less than 4 columns
End_if
; source points to the start of every row
mov eax ecx
mul D@Width
shl eax 2
mov esi D@Input
add esi eax
While ebx < D@xmmMovesRequired
movups xmm0 X$esi
pshufd xmm0 xmm0 27
movups X$edi xmm0
sub edi 16
add esi 16
inc ebx
End_While
If D@xmmMovesRequired > 0
add edi 12
End_if
..If D@remainder >= 1
mov ebx D$esi
mov D$edi ebx
.If D@remainder >= 2
mov ebx D$esi+4
mov D$edi-4 ebx
If D@remainder = 3
mov ebx D$esi+8
mov D$edi-8 ebx
End_if
.End_if
..End_if
inc ecx
.End_While
EndP
Ok..here it is :)
Fully optimized version faster then the original :t :t :t
Note: Later i´ll relabel the variables to the proper names.
New version:
Proc SquaredMatrix_FlipHorizontal_SSE3Guga:
Arguments @Input, @Output, @Width, @Height
Local @xmmMovesRequired, @remainder, @Var1, @Var2, @Var3, @MaxHeight, @CurMovesRequired
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; How many xmm moves per row are required;
mov eax D@Width
mov edx eax
mov ebx eax
and edx 3 | mov D@Remainder edx
shr eax 2 | mov D@xmmMovesRequired eax
mov eax D@Height | mov D@MaxHeight eax
mov D@Var2 0
mov D@Var1 4 ; case of less than 4 columns
If D@xmmMovesRequired > 0
mov D@Var1 16 ; subtract enough to fill one xmm register
mov D@Var2 12
End_if
mov eax D@Width | shl eax 2 | mov D@Var3 eax | sub eax D@Var1
mov edi D@Output | add edi eax
L1:
mov ebx D@xmmMovesRequired
; source points to the start of every row
mov eax edi
mov ecx esi
test ebx ebx | Jz L4>
L0:
movups xmm0 X$ecx
pshufd xmm0 xmm0 27
movups X$eax xmm0
sub eax 16
add ecx 16
dec ebx | jg L0<
L4:
..If D@remainder >= 1
mov edx D@Var2
mov ebx D$ecx | mov D$eax+edx ebx
.If D@remainder >= 2
mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
If D@remainder = 3
mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
End_if
.End_if
..End_if
add edi D@Var3
add esi D@Var3
dec D@MaxHeight | jg L1<
EndP
benchmark tests:
Original version: 249.31571145976065 ns (731 clock cycles)
New version: 163.09080899440107 ns (480 clock cycles)
Speed Improvement: Something around 35%
It maybe optimized more, i guess.
btw...changing to movdqu instead of movups may increase the speed a little bit
Quote from: guga on April 20, 2017, 10:12:13 PM
Ok..here it is :)
Fully optimized version 4 times faster then the original :t :t :t
Note: Later i´ll relabel the variables to the proper names.
New version:
Proc SquaredMatrix_FlipHorizontal_SSE3Guga:
Arguments @Input, @Output, @Width, @Height
Local @xmmMovesRequired, @remainder, @Var1, @Var2, @Var3, @MaxHeight, @CurMovesRequired
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; How many xmm moves per row are required;
mov eax D@Width
mov edx eax
mov ebx eax
and edx 3 | mov D@Remainder edx
shr eax 2 | mov D@xmmMovesRequired eax
mov eax D@Height | mov D@MaxHeight eax
mov D@Var2 0
mov D@Var1 4 ; case of less than 4 columns
If D@xmmMovesRequired > 0
mov D@Var1 16 ; subtract enough to fill one xmm register
mov D@Var2 12
End_if
mov eax D@Width | shl eax 2 | mov D@Var3 eax | sub eax D@Var1
mov edi D@Output | add edi eax
L1:
mov ebx D@xmmMovesRequired
; source points to the start of every row
mov eax edi
mov ecx esi
test ebx ebx | Jz L4>
L0:
movups xmm0 X$ecx
pshufd xmm0 xmm0 27
movups X$eax xmm0
sub eax 16
add ecx 16
dec ebx | jg L0<
L4:
..If D@remainder >= 1
mov edx D@Var2
mov ebx D$ecx | mov D$eax+edx ebx
.If D@remainder >= 2
mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
If D@remainder = 3
mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
End_if
.End_if
..End_if
add edi D@Var3
add esi D@Var3
dec D@MaxHeight | jg L1<
EndP
benchmark tests:
Original version: 249.31571145976065 ns (731 clock cycles)
New version: 163.09080899440107 ns (480 clock cycles)
Speed Improvement: Something around 35%
It maybe optimized more, i guess.
btw...changing to movdqu instead of movups may increase the speed a little bit
Congratulations, although I can not test it because I have not RosAsm. I know I can download it, but I am just lazy.
I´ll try port it to Masm and post it here.
Ok. Here is the masm version.
Sorry, i don´t know what are the macros for repeat (not the while, but just the dec + jcc chains), so i posted the full disassembled source and some macros i presume are correct, if i remember well the masm syntax.
Matrix_FlipX proc public USES esi edi ebx ecx edx Input : ptr, Output : ptr, Width : dword, Height : dword
LOCAL MaxXPos: dword
LOCAL MaxYPos: dword
LOCAL Remainder: dword
LOCAL AdjustSmallSize: dword
LOCAL NextScanLine: dword
mov esi, Input
mov edi, Output
mov eax, Width
mov edx, eax
mov ebx, eax
and edx, 3
mov Remainder, edx
shr eax, 2
mov MaxXPos, eax
mov eax, Height
mov MaxYPos, eax
mov AdjustSmallSize, 0
mov ebx, 4
.If MaxXPos > 0
mov ebx, 16
mov AdjustSmallSize, 12
.Endif
mov eax, Width
shl eax, 2
mov NextScanLine, eax
sub eax, ebx
mov edi, Output
add edi, eax
loc_40AF18:
mov ebx, MaxXPos
mov eax, edi
mov ecx, esi
test ebx, ebx
jz short loc_40AF38
loc_40AF23:
movdqu xmm0, xmmword ptr [ecx]
pshufd xmm0, xmm0, 27
movups xmmword ptr [eax], xmm0
sub eax, 16
add ecx, 16
dec ebx
jg short loc_40AF23
loc_40AF38:
.If Remainder >= 1
mov edx, AdjustSmallSize
mov ebx, [ecx]
mov [edx+eax], ebx
.If Remainder >= 2
mov ebx, [ecx+4]
mov [edx+eax-4], ebx
.If Remainder == 3
mov ebx, [ecx+8]
mov [edx+eax-8], ebx
.EndIf
.EndIf
add edi, NextScanLine
add esi, NextScanLine
dec MaxYPos
jg short loc_40AF18
Matrix_FlipX endp
Note: I´m pretty sure it can be optimized more removing those chains of If Remainder (specially because we can have large matrixes with small Width. Such as 3x280 etc) but, i´m unable to produce a faster code right now. So, anyway, it is faster then the original version
i want to know how you disassembled source?
RosAsm or in this case Idapro (To make it Masm compatible), but, it was mainly to remember the masm syntax so i could be able to port. RosAsm is Nasm compatible, so in order to make easier to others read, i assembled it with RosAsm the new function, and disassembled it with Ida to make a compatible version for masm user´s.
I still have few time to make a RosAsm to Masm converter/translator (as a standalone or inside the RosAsm project itself), so the easier and faster way was simply disassembling the code.
Quote from: guga on April 21, 2017, 06:34:54 AM
Ok. Here is the masm version.
Well done, the gain was mostly done to elimination of mul instructions which have a great impact particularly inside a loop.
Thanks.
And also the div that was replaced by "and". Div instruction is always slow. Also on the updated versions i´m working, i replaced all movups changing them to movdqu instead, and got a gain of speed of around 12% (Measured on my I7) . And, if you remove all the push/pop operation at the beginning, the code will be way more faster (with the USES directive in Masm. In RoAsm we have a similar thing, but it is not a directive it is a user made macro also called "uses") . I just maintained it because i don´t want the function altering the registers after being used.
Don´t forget that, using a naked dec/jcc instruction is generally a bit faster then cmp xxx / jcc. Also, test is a bit faster then cmp.
Btw...I also just made a version that rotates a Matrix on 180º directly. (Later i´ll check the speed and see if it also have some advantage replacing the movups on it for movdqu. I´ll convert it to Masm, once i finish the tests to see if it is working as expected.
; Rotate a Matrix at 180º
Proc Matrix_FlipXY:
Arguments @Input, @Output, @Width, @Height
Local @MaxXPos, @MaxYPos, @remainder, @NextScanLine, @AdjustSmallSize
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; How many xmm moves per row are required;
mov eax D@Width
mov edx eax
mov ebx eax
and edx 3 | mov D@Remainder edx
shr eax 2 | mov D@MaxXPos eax
; make destination point to the end of every row
mov eax D@Width | shl eax 2 | mov D@NextScanLine eax
mov eax D@Height | mov D@MaxYPos eax | mul D@NextScanLine; | sub eax 4
;mov eax D@Height | mov D@MaxYPos eax | dec eax | mul D@NextScanLine
mov edi D@Output | add edi eax
mov D@AdjustSmallSize 0
mov ebx 4 ; case of less than 4 columns
If D@MaxXPos > 0
mov ebx 16 ; subtract enough to fill one xmm register
mov D@AdjustSmallSize 12
End_if
sub edi ebx
;mov eax D@Width | shl eax 2 | mov D@NextScanLine eax | sub eax ebx
;mov edi D@Output | add edi eax
L1:
mov ebx D@MaxXPos
mov eax edi
mov ecx esi
test ebx ebx | jz L4>
L0:
movups xmm0 X$ecx
pshufd xmm0 xmm0 27
movups X$eax xmm0
sub eax 16
add ecx 16
dec ebx | jg L0<
L4:
..If D@remainder >= 1
mov edx D@AdjustSmallSize
mov ebx D$ecx | mov D$eax+edx ebx
.If D@remainder >= 2
mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
If D@remainder = 3
mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
End_if
.End_if
..End_if
sub edi D@NextScanLine
add esi D@NextScanLine
dec D@MaxYPos | jg L1<
EndP
Note: Do you have n idea how to make a matrix convolution ? For images, like this one: https://www.tutorialspoint.com/dip/concept_of_convolution.htm
I´m optimize all the matrix codes and later adapt it to image processing (all it seems to be needed is including Pitch at the end of scanline. It shouldn´t affect the speed that much, i hope)
Note2: I´ll later give a try replacing all the "If remainder" with a set of dec and using "movdqu XMM0 X$ecx | movd D$eax+edx XMM0" to set the remainder bytes. It should be faster even on small matrices, but, i didn´t tested yet.
Ok, the reminder sequence of If´s can be replaced with code chain like this (used on the example above: Matrix_FlipXY):
mov ebx D@remainder
test ebx ebx | jz L3>
mov edx D@AdjustSmallSize
movdqu XMM0 X$ecx | movd D$eax+edx XMM0
dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-4 XMM0
dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-8 XMM0
L3:
Quote from: guga on April 21, 2017, 07:10:37 PM
see if it also have some advantage replacing the movups on it for movdqu.
For very large matrixes you may have an advantage making use of various xmm registers, prefetching and using movntdq . Something like this (pasting some code I had here for a different purpose):
.repeat
prefetchnta BYTE PTR [esi+ecx+4096]
movdqu xmm0, XMMWORD ptr [esi+ecx]
movdqu xmm1, XMMWORD ptr [esi+ecx+16]
movdqu xmm2, XMMWORD ptr [esi+ecx+32]
movdqu xmm3, XMMWORD ptr [esi+ecx+48]
movntdq XMMWORD ptr [edi+ecx], xmm0
movntdq XMMWORD ptr [edi+ecx+16], xmm1
movntdq XMMWORD ptr [edi+ecx+32], xmm2
movntdq XMMWORD ptr [edi+ecx+48], xmm3
sub ecx, 64
.until ecx==0
Yes, this will be a similar solution to what i´m doing with Marinus´s transpose matrix, but the implementation of that is a bit hard. Later i´ll try to make the tests on it. The problem is with the leftovers (remainders) of the function that will need to be enter on other loops. It can be overcome, although the resultant code is a bit messy. On this particular function, Jochen code is cleaner and the small difference of speed between JJ´s and marinus code (marinus is about 35% faster) may worth use JJ´s code for readability reasons. (I´ll later post the fixes and implementation i did on marinus to you better understand what i mean)
Note: There is no need to use prefetchnta. It seems to change performance according to the processor. If the code is doing a large number of loads on the same cache line and stores these could still negatively effect performance.
According to Agner´s Frog, Prefetch throughput on IvyBridge is only one per 43 cycles, so we need to be careful not to prefetch too much if we don't want prefetches to slow down the code on IvB. This is a performance bug specific to IvB. On other designs, too much prefetch will just take up uop throughput that could have been useful instructions (other than harm from prefetching useless addresses).
http://agner.org/optimize/
Even on older processors, prefetchXXX instructions have no guarantee that the data will be in the cache when it is needed. (Intel IA32 Software Developer Manual, Volume 2)
In some cases, the performance can be better results if you simply use a Align16 directive before the SSE2 main instruction (As i did on Jochen´s transpose matrix algo.)
Guga, you're right. On my Ivybridge prefetching uses 43 cycles. Better to let the CPU handle this.
But there are still many PC's around that could benefit from sotware prefetching.
It must be the month of Matrix calculations. :biggrin:
https://www.codeproject.com/Articles/1182724/Blowing-the-Doors-Off-D-Math-Part-I-Matrix-Multipl
QuoteIt must be the month of Matrix calculations.
:icon_mrgreen: :icon_mrgreen: :icon_mrgreen:
Actually i´m trying to understand the matrix functions because i´m porting PHash algo to Assembly in order to see if i can use it on my scene detector plugin for VirtualDub.
PHash compares the similarity between 2 images. It can be used to a wide range of applications, such as image searching, object recognition (including face recognition) and most probably: scene detection, movement detection. Not to mention, it can also be used for audio :) (You may like this :) :greensml:
The main problem on PHash is that it uses a crappy CImg library that is insanely slow. CImg is a pile of crap C++ Classes that results on a bloated performance of the PHash algo.
Since Phash uses matrix manipulaton and Convolution of CImg, i´m trying to recreate not only the matrixes functions involved, but also the convolution function to be used later on the scene detection plugin.
The current version of my plugin can detect (Accurately) hard cut scenes on a rate near to 100% simply computing and comparing the Minimum standard deviation values of the difference of the images (Difference achieved from xor operation and not a simple sub). Processing a entire video of 1:30 hour (something around 190.000 frames of 720x480) can be done in something around 20 minutes on my I7. So, it is somewhat a acceptable amount of time, considering the total operations and all the frames involved and also the underneath functions used by vdub itself
(Which, unfortunately, also uses crap C++ classes). The problem is with soft cut scenes (transitions or fades etc) that are not that accurate. So, i´m trying to use PHash instead the STD algos to gain accuracy in general (Hard cut and soft cut scenes). MY goal is try to make the scene detector plugin works nearly 100% of accuracy on all cases (soft and hard cut scenes) in less then 10 minutes per hour of video (So, if i can make it at a rate of 10.000 frames (720x480) per minute (or even less), it will be great to be used)
I´m currently trying to build a convolution function and see if the results are the same as in CImg so i can continue the tests i´m doing, but, so far, no success on making this damn convolution thing :(
Btw: You did the article ? Well done :)
Hi guga,
No, it is not my article. Just saw it on codeproject and thought, yet an other matrix article, it has to be the month of matrix calculations.
Since it is populair now on this forum.
Have you ever thougth about using the videocard and direct3D9 to do the 2D image transformations and the flipping etc.
It would be much faster.
QuoteHave you ever thougth about using the videocard and direct3D9 to do the 2D image transformations and the flipping etc.
It would be much faster.
Didn´t though of that, but, i´m not sure if vdub will handle D3d9, neither if it will be usefull. VDub functions points to the pixel data and not a copy of the full image (with the header etc). So, i´m not sure if it worth using D3d9 directly for pixel manipulation.
Do you have an example of matrix manipulation of D3d9 ? I mean, when we already have the pixel data, and not to fully process a image ? Another question, do d3d9 have a convolution function ?
You can access the pixel data in D3D9 after the image manipulations and save it if you want.
I can make an example if you like, but i need a week or so ( short on free time )
Do you need an example that loads an image of 720x480 pixels and do the transposing, flipping X and flipping Y ?
And copy the result to a memory buffer or save the result as an image?
Or something else maybe... only if i know how to do it of course. :biggrin:
Hi Marinus
Quote"Do you need an example that loads an image of 720x480 pixels and do the transposing, flipping X and flipping Y ?
And copy the result to a memory buffer or save the result as an image?"
Yes...that would be nice :) If D3d9 can be used to perform faster routines then we already did this maybe handfull using it on vdub.
I never combined vdub with d3d9 before, but, it maybe helpfull if the results were faster then we already made :)
Another thing is that a convolution also is needed. So, not only transposing, flipping (x and Y), but can you give a try on convolution too ?
The convolution seems to obey this technique. https://www.tutorialspoint.com/dip/concept_of_convolution.htm
Many thanks :)
I have never done convolution on images, but we could give it a try.
First i will make a stripped down version of the transposing and flipping stuff then it will be easier for you to translate it to RosAsm.
Later we could try to get the convolution stuff working. Could be a nice project. :t
many thanks, marinus :t
Stripping down the matrix functions will be handfull, specially because they can work with any size of matrices. I fond some code that can be used for convolution and will start studying it for us.
One thing, since the matrices will be used for image and video processing, we cannot forget to include the pitch (stride) on the functions. basically, for what i understood on vdub, the pitch is just a leftover on the width of a image. that was used as a alingment in memory. So, the image width seems to be,in fact Width+Pitch. Where pitch contains only null data. From the code we did so far, adding a pitch argument to the functions won´t change the speed, since we can simply need to include one instruction and multiply it to the height to find the 1st scanline from bottom to top. or, instead multiply, it may be added to width if we are doing it from top to bottom. Like:
probably something like this:
mov eax D@Width | add D@Picth eax | shl eax 2 | mov D@NextScanLine eax
On this way, the user can work with the function regardless it contains pitch value (video) or not (images or simple matrix calculation). So, if it is working with image/simple matix, he set Picth = 0. Otherwise he input the pitch value. Something that may result like this:
RosAsm syntax
call Matrix_FlipY Input, Output, D$Width, D$height, D$Pitch
masm syntax
call Matrix_FlipY offset Input, offset Output, Width, height, Pitch
Where:
Input = Pointer to a buffer containing he inputed data in a array of Dwords (which also works for floats).
Output = Pointer to a buffer to hold the resultant data (also a array of Dwords or Floats)
Width = Width of the array/image/video (Dword)
Height = Height of the array/image/video (Dword)
Pitch = Pitch of the video/image. 0 if no pitch. Any value if exists a pitch to be computed. (Dword)
Not sure yet if this is correct, but, it seems it may be used like that outside the loop (to we gain speed)
I believe the functionality of pitch in Vdub is the same as described in M$.
https://msdn.microsoft.com/en-us/library/windows/desktop/aa473780(v=vs.85).aspx
For example, to copy a image with vdub (using pitch), i can do this:
Proc CopyImageBuffer:
Arguments @Input, @Output, @Width, @Height, @Pitch
Local @CurYPos
Uses eax, ebx, ecx, edx, edi
mov eax D@Height
xor ebx ebx
mov D@CurYPos eax
.Do
mov eax D@Output | add eax ebx
mov ecx D@Input | add ecx ebx
mov edx D@Width
Do
mov edi D$ecx
add ecx 4
mov D$eax edi
add eax 4
dec edx
Loop_Until edx = 0
add ebx D@Pitch
dec D@CurYPos
.Loop_Until D@CurYPos = 0
EndP
Stride/pitch will be dealt with in the image loader. I'm familiar with it.
Wonderfull :) So, we don´t actually need to handle pitch :)
I´m giving a test on the Matrix_FlipY on Vdub (without pitch) and it works like a charm :icon_mrgreen:). I´ll post a screenshot once i finish these testings on Vdub.
Hi Marinus.
here are some screenshots of a video plugin i tested using the different Matrix Functions. Working like a charm :t
VDub, in fact, handles the pitch as you said. And it seems that Pitch, is, in fact, width*4 that i labeled on the functions as "NextScanLine".
Hi,
I have done some "convolution" code for image processing.
Both in assembly and FORTRAN. Mostly in the early 1990's.
Code was for Sobel edge detection, sharpening, blurring,
deinterlacing of TV image capture, autocorrelation (image
repair), and so forth. So it can't be too difficult.
Also tried FFT image processing about that time. Trying out
bandpass filtering. A bit more difficult. And probably less
successful.
Cheers,
Steve N.
Hi Steve, do you still have the autocorrelation functions for image repairing ? Concerning FFT i ported the Algo to assembly, but it is not optimized yet and did not tested on image processing. Dunno if it will work as expected.
Hi,
Quote from: guga on April 22, 2017, 11:04:02 PM
Hi Steve, do you still have the autocorrelation functions for image repairing ?
Yes, I still have the code. The TV capture card I was using would
"tear" the last few lines of the image. The bottom of the image had
lines offset from where they should be. Run the autocorrelation on
the last good line and the first bad one to find out how far to shift it.
Then repeat for the next lines. Sort of worked, but too much tweaking
by hand was needed. Haven't looked at it since then. Don't think I
have made any TV image captures either.
Regards,
Steve N.
If you have time, can you post the code here to we analyse ?
I would like to see how it works in practice. I have some pdfs explaining about image reconstruction, but got clueless on how to make the proper code for it. It seems too way complex for my head :greenclp:
it is a sort of inpaint algorithm, right ?
Hi guga,
I made a start, is this what you want?
Hi marinus.
yes :) That´s it. Many thanks :)
About transposing. When you transpose width became height and vice-versa, right ? That would explain why my version got crossing lines all over :greensml:
Not sure what i did wrong. Maybe try to exchange width x height, perhaps will solve.
The pope with an alien was simply hilarious :greenclp: :greenclp: :greenclp:
:biggrin:
Yes, if you have different row and column sizes they need to be switched or else your image is distorted.
This week i'll write an image saver. Is the PNG format OK?
When ready i'll post all the source code.
Because it is lossless, the pixel data will be the same as the original.
How fast are the Matrix manipulations "Timing Result in milliseconds" on your PC?
I'm very curious, bet they are much faster than the CPU code we did.
On My I7 it is:
0074246 for transposing but the counter keeps changing fast and varies, so it is hard to tell the exact speed. 006xxx to 009xxx for FlipX and FlipY, FLipX-Y
I´ll give a try on the filters i´m testing directly on the vdub plugin. Dunno yet, how to rotate (transpose) the video on vdub. I tried simply exchanging width x height and changing the copied buffer but it is crashing. probably because Vdub is trying to keep the original ratio of the video.
Found it. On Vdub it seems that the transposing can be done on the structure VDXPixmapLayout. Never used that before. I´ll give a try today to see if i can activate it an use the transposing algorithm we are making onto it.
Hi,
Quote from: guga on April 24, 2017, 12:13:01 AM
If you have time, can you post the code here to we analyse ?
Actually there were at least three versions. This is the oldest and
simplest version. As I did not comment at the time (1995) on what
the changes were for, why confuse things. It also means that this
may not be a working example.
PROGRAM ALIGNV
C
C Align the lower scan lines in an image file. Stuff digitized from my VCR
C has poor scans at the bottom of the picture.
C SRN, 16 October 1994
C
C LSTART = Number of lines to skip before processing.
C MSHIFT = Maximum shift to look for.
C NSHIFT = 3 x Maximum shift to look for.
C LEN = 3 x picture line length = RGB line length.
C
PARAMETER ( LEN=684*3, MSHIFT=24, NSHIFT=72, LSTART = 450 )
CHARACTER CHAR1*(LEN), CHAR2*(LEN), CHAR3*(LEN)
C
OPEN (1,FILE='FRAME.RAW',FORM='BINARY')
OPEN (2,FILE='OUT.RAW',FORM='BINARY')
C
MINDIF = 0
MAXDIF = 0
C
DO 10 LINE=1,LSTART
READ (1,END=91) CHAR1
WRITE (2) CHAR1
10 CONTINUE
LINE = LSTART
C
15 CONTINUE
LINE = LINE + 1
READ (1,END=92) CHAR2
CALL AUTO ( CHAR1, CHAR2, K )
MINDIF = MAX( MINDIF, K )
MAXDIF = MAX( MAXDIF, K )
JSTART = MAX( 1, 1+K*3 )
JEND = MIN ( LEN, LEN+K*3 )
DO 20 J=JSTART,JEND
CHAR3(J:J) = CHAR2(J+K*3:J+K*3)
20 CONTINUE
C
DO 30 J=1,JSTART-1
CHAR3(J:J) = CHAR2(J:J)
30 CONTINUE
DO 40 J=JEND+1,480
CHAR3(J:J) = CHAR2(J:J)
40 CONTINUE
C
WRITE (2) CHAR3
WRITE (*,'(1X,2I5)') LINE, K
CHAR1 = CHAR3
GO TO 15
C
91 PRINT *, 'PREMATURE END ON FRAME.RAW, LINE', LINE
GO TO 94
C
92 PRINT *, 'END ON FRAME.RAW, LINE', LINE
GO TO 94
C
94 CONTINUE
WRITE (*,*) MINDIF, MAXDIF
STOP
END
SUBROUTINE AUTO ( CHAR1, CHAR2, K )
PARAMETER ( LEN=684*3, MSHIFT=24, NSHIFT=72, LSTART = 450 )
CHARACTER CHAR1*(LEN), CHAR2*(LEN)
INTEGER INT1(LEN), INT2(LEN), ADIF(-MSHIFT:MSHIFT)
C
DO 10 I=1,LEN
INT1(I) = ICHAR( CHAR1(I:I) )
INT2(I) = ICHAR( CHAR2(I:I) )
10 CONTINUE
C
DO 20 I= -MSHIFT,MSHIFT
ADIF(I) = 0
20 CONTINUE
C
DO 30 I=NSHIFT+1,LEN-NSHIFT,3
DO 30 J = -MSHIFT,MSHIFT
ADIF(J) = ADIF(J) + ABS( INT1(I)-INT2(I+J) )
ADIF(J) = ADIF(J) + ABS( INT1(I+1)-INT2(I+J+1) )
30 ADIF(J) = ADIF(J) + ABS( INT1(I+2)-INT2(I+J+2) )
C
MINDIF = ABS( ADIF(-MSHIFT) )
K = -MSHIFT
DO 40 I= -MSHIFT,MSHIFT
IF ( ABS( ADIF(I) ) .LT. MINDIF ) THEN
K = I
MINDIF = ABS( ADIF(I) )
END IF
40 CONTINUE
* K = -K
WRITE (*,100) (I,ADIF(I),I=-MSHIFT,MSHIFT)
RETURN
100 FORMAT ( 1X, 49(I3,I7) )
END
This looks like an error. But it isn't used. Change MAX() to MIN().
MINDIF = MAX( MINDIF, K )
I hope you find it interesting if not useful as is. This looks like
monochrome processing. The later versions look RGBish.
Quote
it is a sort of inpaint algorithm, right ?
Not sure what you mean.
Cheers,
Steve N.
Many thanks, Steve. I´m taking a look and trying to understand it.
Inpainting techniques are to you make things like this:
https://en.wikipedia.org/wiki/Inpainting
Here are the sources as promised released under the SHARE & ENJOY license. :biggrin:
This is an example how to use Direct3D9 in 2D mode without all the fancy 3D stuff, for very fast image manipulations.
The matrix calculations are done by the video device.
The results can be saved as images ( all the GDIplus formats ) and the raw bitmap data can by read from memory.
I made a comment in the image saver routine.
EDIT: Made a mistake were to put the comment in de "2D_Image_loader_saver.Asm" to fetch the raw bitmap data.
uploaded a new zip file with the comment at the correct spot in the source code.
Hello. I get an error when save as png.
Hi caballero,
And it doesn't save the MatrixImage.png i assume?
Can you trace down where exactly in the code the error occurs?
Hi Marinus
Thank you :) It works like a gem here :t
QuoteHere are the sources as promised released under the SHARE & ENJOY license.
:greensml: :greensml: :greensml: :greensml:
One thing only, i´m not sure if using DirectX is faster then the direct manipulation of the pixels itself as we are dong before. The functions will be used for video manipulation. For that i´m using VirtualDub making those functions as part of a plugin to test the speed. I believe that VDub can handle D3d9 but i don´t know how to setup it. What i know from Vdub is that the functions related to DirectX manipulation are these:
class VDFilterAccelEngine;
class VDFilterAccelContext : public IVDXAContext {
public:
VDFilterAccelContext();
~VDFilterAccelContext();
int VDXAPIENTRY AddRef();
int VDXAPIENTRY Release();
void *VDXAPIENTRY AsInterface(uint32 iid);
bool Init(VDFilterAccelEngine& eng);
void Shutdown();
bool Restore();
uint32 RegisterRenderTarget(IVDTSurface *surf, uint32 rw, uint32 rh, uint32 bw, uint32 bh);
uint32 RegisterTexture(IVDTTexture2D *tex, uint32 imageW, uint32 imageH);
uint32 VDXAPIENTRY CreateTexture2D(uint32 width, uint32 height, uint32 mipCount, VDXAFormat format, bool wrap, const VDXAInitData2D *initData);
uint32 VDXAPIENTRY CreateRenderTexture(uint32 width, uint32 height, uint32 borderWidth, uint32 borderHeight, VDXAFormat format, bool wrap);
uint32 VDXAPIENTRY CreateFragmentProgram(VDXAProgramFormat programFormat, const void *data, uint32 length);
void VDXAPIENTRY DestroyObject(uint32 handle);
void VDXAPIENTRY GetTextureDesc(uint32 handle, VDXATextureDesc& desc);
void VDXAPIENTRY SetTextureMatrix(uint32 coordIndex, uint32 textureHandle, float xoffset, float yoffset, const float uvMatrix[12]);
void VDXAPIENTRY SetTextureMatrixDual(uint32 coordIndex, uint32 textureHandle, float xoffset, float yoffset, float xoffset2, float yoffset2);
void VDXAPIENTRY SetSampler(uint32 samplerIndex, uint32 textureHandle, VDXAFilterMode filter);
void VDXAPIENTRY SetFragmentProgramConstF(uint32 startIndex, uint32 count, const float *data);
void VDXAPIENTRY DrawRect(uint32 renderTargetHandle, uint32 fragmentProgram, const VDXRect *destRect);
void VDXAPIENTRY FillRects(uint32 renderTargetHandle, uint32 rectCount, const VDXRect *rects, uint32 colorARGB);
protected:
enum {
kHTFragmentProgram = 0x00010000,
kHTRenderTarget = 0x00020000,
kHTTexture = 0x00030000,
kHTRenderTexture = 0x00040000,
kHTTypeMask = 0xFFFF0000
};
struct HandleEntry {
uint32 mFullHandle;
IVDTResource *mpObject;
uint32 mImageW;
uint32 mImageH;
uint32 mSurfaceW;
uint32 mSurfaceH;
uint32 mRenderBorderW;
uint32 mRenderBorderH;
bool mbWrap;
};
uint32 AllocHandle(IVDTResource *obj, uint32 handleType);
HandleEntry *AllocHandleEntry(uint32 handleType);
IVDTResource *DecodeHandle(uint32 handle, uint32 handleType) const;
const HandleEntry *DecodeHandleEntry(uint32 handle, uint32 handleType) const;
void ReportLogicError(const char *msg);
IVDTContext *mpParent;
VDFilterAccelEngine *mpEngine;
typedef vdfastvector<HandleEntry> Handles;
Handles mHandles;
uint32 mNextFreeHandle;
bool mbErrorState;
float mUVTransforms[8][12];
VDAtomicInt mRefCount;
};
#endif // f_VD2_FILTERACCELCONTEXT_H
the classes VDXAPIENTRY seems to be used to setup D3d9 dll, but i have no idea how to make it work yet.
There is a example of a internal plugin inside Vdub itself called invert which as the name suggests, invert the colors of a video. The plugin itself is a bit fast, but i didn´t compared it with other versions because it uses another way to setup and initialize the plugin engine that is a bit harder to understand.(Damn C++ classes) :greensml: :greensml: :greensml:
The full plugin is written like this:
namespace {
#ifdef _M_IX86
void __declspec(naked) VDInvertRect32(uint32 *data, long w, long h, ptrdiff_t pitch) {
__asm {
push ebp
push edi
push esi
push ebx
mov edi,[esp+4+16]
mov edx,[esp+8+16]
mov ecx,[esp+12+16]
mov esi,[esp+16+16]
mov eax,edx
xor edx,-1
shl eax,2
inc edx
add edi,eax
test edx,1
jz yloop
sub edi,4
yloop:
mov ebp,edx
inc ebp
sar ebp,1
jz zero
xloop:
mov eax,[edi+ebp*8 ]
mov ebx,[edi+ebp*8+4]
xor eax,-1
xor ebx,-1
mov [edi+ebp*8 ],eax
mov [edi+ebp*8+4],ebx
inc ebp
jne xloop
zero:
test edx,1
jz notodd
not dword ptr [edi]
notodd:
add edi,esi
dec ecx
jne yloop
pop ebx
pop esi
pop edi
pop ebp
ret
};
}
#else
void VDInvertRect32(uint32 *data, long w, long h, ptrdiff_t pitch) {
pitch -= 4*w;
do {
long wt = w;
do {
*data = ~*data;
++data;
} while(--wt);
data = (uint32 *)((char *)data + pitch);
} while(--h);
}
#endif
}
///////////////////////////////////////////////////////////////////////////////
class VDVideoFilterInvert : public VDXVideoFilter {
public:
VDVideoFilterInvert();
uint32 GetParams();
void Run();
void StartAccel(IVDXAContext *vdxa);
void RunAccel(IVDXAContext *vdxa);
void StopAccel(IVDXAContext *vdxa);
protected:
uint32 mAccelFP;
};
VDVideoFilterInvert::VDVideoFilterInvert()
: mAccelFP(0)
{
}
uint32 VDVideoFilterInvert::GetParams() {
const VDXPixmapLayout& pxlsrc = *fa->src.mpPixmapLayout;
VDXPixmapLayout& pxldst = *fa->dst.mpPixmapLayout;
switch(pxlsrc.format) {
case nsVDXPixmap::kPixFormat_XRGB8888:
pxldst.pitch = pxlsrc.pitch;
return FILTERPARAM_SUPPORTS_ALTFORMATS | FILTERPARAM_PURE_TRANSFORM;
case nsVDXPixmap::kPixFormat_VDXA_RGB:
case nsVDXPixmap::kPixFormat_VDXA_YUV:
return FILTERPARAM_SWAP_BUFFERS | FILTERPARAM_SUPPORTS_ALTFORMATS | FILTERPARAM_PURE_TRANSFORM;
default:
return FILTERPARAM_NOT_SUPPORTED;
}
}
void VDVideoFilterInvert::Run() {
VDInvertRect32(
fa->src.data,
fa->src.w,
fa->src.h,
fa->src.pitch
);
}
void VDVideoFilterInvert::StartAccel(IVDXAContext *vdxa) {
mAccelFP = vdxa->CreateFragmentProgram(kVDXAPF_D3D9ByteCodePS20, kVDFilterInvertPS, sizeof kVDFilterInvertPS);
}
void VDVideoFilterInvert::RunAccel(IVDXAContext *vdxa) {
vdxa->SetTextureMatrix(0, fa->src.mVDXAHandle, 0, 0, NULL);
vdxa->SetSampler(0, fa->src.mVDXAHandle, kVDXAFilt_Point);
vdxa->DrawRect(fa->dst.mVDXAHandle, mAccelFP, NULL);
}
void VDVideoFilterInvert::StopAccel(IVDXAContext *vdxa) {
if (mAccelFP) {
vdxa->DestroyObject(mAccelFP);
mAccelFP = 0;
}
}
///////////////////////////////////////////////////////////////////////////////
extern const VDXFilterDefinition g_VDVFInvert = VDXVideoFilterDefinition<VDVideoFilterInvert>(
NULL,
"invert",
"Inverts the colors in the image.\n\n[Assembly optimized]");
#ifdef _MSC_VER
#pragma warning(disable: 4505) // warning C4505: 'VDXVideoFilter::[thunk]: __thiscall VDXVideoFilter::`vcall'{48,{flat}}' }'' : unreferenced local function has been removed
#endif
I´ll try port this to assembly to make the proper tests on Dx video manipulation on Vdub, but, not sure if i´ll succeed or if it is, in fact faster then the direct pixel manipulation as we were doing before. (I believe it cannot be faster, because we need to take onto account all the internal functions used to access directx itself.
One good thing is that i finally succeeded to make Vdub change the Layout with the others functions. Now it is missing only to see if it will work with Matrix_transpose function :)
Quote from: Siekmanski on April 25, 2017, 10:26:41 PM
Hi caballero,
And it doesn't save the MatrixImage.png i assume?
Can you trace down where exactly in the code the error occurs?
MatrixImage.png is created but with 0 bytes. We have already seen that my computer is abit odd for gdip, hence don't worry. Nevertheless, here are some captures from debugging. The flow stops when execute "GdipSaveImageToFile" with "F8", maybe here.
Regards
Hi caballero,
Phewww...., so we can blame GDIplus for the error.
Hi guga,
I don't know what VDub is, is it freeware ?
It seems to use directX9 in some way.
I suppose your goal is to manipulate color values and color positions in an image, am i right ?
This is a perfect job for the video device instead of only the CPU.
For example transposing a 720 by 480 32bit image via CPU:
1 - setup your program code.
2 - load the image data to system memory
3 - the calculation loop:
a - read an array of 1382400 bytes
b - calculate the new positions in the array ( the transpose matrix routine )
c - write 1382400 bytes
d - write 1382400 bytes to the correct position in video memory ( this is slow )
e - present the new image to the screen
For example transposing a 720 by 480 32bit image via GPU (via Direct3D9):
1 - setup your program code.
2 - load the image data to video memory
3 - the calculation loop:
a - calculate the new 4 * X,Y positions for the screen positions, and the new 4 * X,Y for the image corner coordinates.
b - write the new calculated 64 bytes to the video device ( the transpose matrix routine )
c - present the new image to the screen.
I think you will agree, that the second method is much much much faster.
No transfers between system and video memory, and only 16 values to calculate for the whole transpose matrix routine.
0.005 milliseconds for transposing a 32bit 720 by 480 pixels image, try to beat that with CPU coding.
For fast image manipulations try to avoid data transfers between system and video memory because they are slow.
So, better do all the image manipulations etc. by using the video device itself if possible.
Vdudb is free and opensource (I´m talking on the sense the sources are released, disregarding about the license itself :icon_mrgreen:).
It is a Video Editor tool simple and very powerfull, although a bit hard to configure the plugins. It was originally made more then a decade ago, but is used as an alternative for professional video editors such as Sony Vegas, for example.
http://www.virtualdub.org
https://sourceforge.net/projects/virtualdub/?source=top3_dlp_t5
For example, there is a university in Russia that make incredible plugins for it, such as a subtitle remover, motion estimation, noise remover, TV commercial detection, video stabilizer, etc etc. Their plugins can also be found here: http://www.compression.ru/video/video_codecs.htm
Other places of people who made plugins for it (with the source or not) can be found here:
http://www.guthspot.se/video/deshaker.htm
http://avisynth.nl/users/fizick/fizick.html
https://forum.doom9.org/
https://forum.videohelp.com/threads/281594-Avisynth-Virtualdub
Some tutorials on youtube explain several kinds of plugins as well. One of those that i like is:
https://www.youtube.com/watch?v=6QRJZpOrX0s
QuoteI suppose your goal is to manipulate color values and color positions in an image, am i right ?
Yeah...it is for image and video manipulation. I´m currently trying to understand and create those matrices functions in order to create a plugin (or app/dll) that is a variation of a PHash algorithm that is used to compare images (either from video or pure image). PHash algo is a sort of image signature and the field of application is huge....similar to what google and youtube uses for image searching tool (Dunno if google uses a sort of Phash algorithm, but, it probably do) or to be used in object removal of a video or a image, face detection, motion estimation, tracking, image/video reconstruction, etc etc....Also, Phash can be used for audio recognition too.
Rebuilding Phash is the 1st step that i can test to create a plugin for scene detection on videos. Currently i made a plugin for Vdub that can be able to detect scenes from a video. The only problem is that the accuracy is limited to hard cut scenes, but for transition (fades, etc) the algorithm i´m using fails. Basically a scene can be detected comparing the difference of 2 frames. The difference is achieved calculating the Minimum Standard deviation of the Light/Luma values from one frame and the other. So, we compute the STD on each frame, and simply subtract one from another (with xor to Potentiate the differences and not a simple sub). 2 frames are different completelly between each other when the difference of the minimum STD between them is positive, but....when we deal with soft cut scenes is where the algo fails and is where i´ll try to use Phash on it. Phash uses matrix manipulation internally achieved from Cimg library that it uses for loading the images to be compared.
I believe that Phash can be used as a replacement for the scene detection algorithm i´m currently creating. the advantage of using Phash is that we may not be limited to scene detection. A wide range of things can be done with this algorithm (for video and image processing and also for audio)
the phash can be found here. http://phash.org But...as i said before it uses a crappy Cimg library and, at the end, it is incredible slow compared to what we are doing. it is impossible to use the current version of Phash to identify a full video, for example. It would take hours to complete, where as if we simplify the algo you could process it completelly in a matter of minutes, and also use whatever other image library you wish. On this way you are not forced to use CImg all the time, but you can use any other that you want.
Quote0.005 milliseconds for transposing a 32bit 720 by 480 pixels image, try to beat that with CPU coding.
Yes..that´s fast, but i wonder if the performance is the same when using it for videos. If it is faster then what we are doing, then it´s, ok..to we use :) But...i didnp´t measured it. The timmings i had for your previous algo were in nanoseconds. (273 nanosecs, in average was the timming i´ve got on your previous function. About 0,000273 miliseconds for that function.)
I don´t know how to measure separately the matrix manipulation for DX to see if it beats your previous work or not, but, probably DX function used to manipulate the matrix can´t beat your previous work. I mean, if you take the function on DX responsable for matrix manipulation (transpose, flip etc) isolated and compare to the function you did, i doubt DX is faster then yours.
I´m not sure if it became clear, but, basically what i´m trying to do is:
a) the user access the pixel data with whatever method he wants (DX, LoadBitmap, GDIPlus, etc etc)
Once he get the pointers to the image pixels and know the width, height (and perhaps pitch, on case of videos) he simply do this:
b) uses the pointers to create the phash of a image to be compared with the other he already loaded
the problem relies on Phash that load the image with that crappy CImg library which uses matrix transposition internally and react according to the width and height. But...in fact, we don´t need that PHash load the image, we only needs the minimum necessary (the true algorithm) used to create the hash and a few functions to manipulate the matrixes (the pixels we previously loaded) in order to we create a convolution function to retrieve the hash. So, how the image will be loaded to pass the pixel data pointer to Phash algo is up to the user.
Sure, once we create the matrix manipulation functions, they can be used elsewhere with Phash or not, but since we need a minimum of matrix manipulation for Phash, it worth creating them.
Phash works like this:
int ph_dct_imagehash(const char* file,ulong64 &hash){
if (!file){
return -1;
}
CImg<uint8_t> src;
try {
src.load(file);
} catch (CImgIOException ex){
return -1;
}
CImg<float> meanfilter(7,7,1,1,1);
CImg<float> img;
if (src.spectrum() == 3){
img = src.RGBtoYCbCr().channel(0).get_convolve(meanfilter);
} else if (src.spectrum() == 4){
int width = img.width();
int height = img.height();
int depth = img.depth();
img = src.crop(0,0,0,0,width-1,height-1,depth-1,2).RGBtoYCbCr().channel(0).get_convolve(meanfilter);
} else {
img = src.channel(0).get_convolve(meanfilter);
}
img.resize(32,32);
CImg<float> *C = ph_dct_matrix(32);
CImg<float> Ctransp = C->get_transpose();
CImg<float> dctImage = (*C)*img*Ctransp;
CImg<float> subsec = dctImage.crop(1,1,8,8).unroll('x');;
float median = subsec.median();
ulong64 one = 0x0000000000000001;
hash = 0x0000000000000000;
for (int i=0;i< 64;i++){
float current = subsec(i);
if (current > median)
hash |= one;
one = one << 1;
}
delete C;
return 0;
}
All of the CImg crap we actually don´t need. All we need from it is the minimum matrix manipulation functions and convolution to work directly on the pixel data we already got. (because we already got the pixel data with whatever other method we choose). And, to make things a bit easier, we actually don´t even need RGBtoYCbCr because PHash uses only Luma (Y) and it is up to the user choose whatever method he wants to retrieve the Luma values. Probably all we need is the pointers to the pixel data already converted to Luma.
Now..imagine this new PHash being used as a video plugin :icon_cool:... It can do amazing things on a faster and more reliable way.
Don't know for sure if i understand it well.
1. You need an image loaded and encoded to raw image data in memory ?
2. PHash ( don't know what it is, or does ) the raw data ?
3. Where comes the matrix and convolution stuff ?
You told me it is for video editing and making plugins to create video effects right ?
Am i right that you need to fetch each picture from a movie, let the effect doing its work and save it back in the movie ?
I have a little trouble understanding everything you want to achieve.
Let´s do like Jack the ripper and go in parts :) :icon_mrgreen:
About VDub functionality:
Quote1. You need an image loaded and encoded to raw image data in memory ?
Yes..i needed to know the pointer to the pixels of the image that was loaded (with whatever method, api etc...DX, GdiPlus...etc)
This is the easier part, since Vdub (that is used to edit videos) get me access to the pixels on each frame directly. So i can know exactly the width, height, pitch of a image that belongs to a certain frame of a video.
This is done on a structure called VFBitmap which i ported to Asm onto:
[VFBitmap:
VFBitmap.pVBitmapFunctions: D$ 0 ; It is a pointer inside Vdbub. (vtable) It is a array of offsets. Points to deprecated VBitmap functions
VFBitmap.data: D$ 0 ; Pixel32 (data of the image). Pointer to start of _bottom-most_ scanline of plane 0.
VFBitmap.palette: D$ 0 ; Pointer to palette (reserved - set to NULL).
VFBitmap.depth: D$ 0 ; image depth. Same as biBitCount in BitmapInfoHeader, but this is a dword
VFBitmap.w: D$ 0 ; The width of the bitmap, in pixels. Same as in BitmapInfoHeader
VFBitmap.h: D$ 0 ; The height of the bitmap, in pixels. Same as in BitmapInfoHeader
VFBitmap.pitch: D$ 0 ; Distance, in bytes, from the start of one scanline in plane 0 to the next. ( Bitmaps can be stored top-down as well as bottom-up. The pitch value value is positive if the image is stored bottom-up in memory and negative if the image is stored top-down.)
VFBitmap.modulo: D$ 0 ; Distance, in bytes, from the end of one scanline in plane 0 to the start of the next. (This value is positive or zero if the image is stored bottom-up in memory and negative if the image is stored top-down. A value of zero indicates that the image is stored bottom-up with no padding between scanlines. For a 32-bit bitmap, modulo is equal to pitch -)
VFBitmap.size: D$ 0 ; The size, in bytes, of the image. Size of plane 0, including padding. Same as in BITMAPINFOHEADER
VFBitmap.offset: D$ 0 ; Offset from beginning of buffer to beginning of plane 0.
VFBitmap.dwFlags: D$ 0 ; Set in paramProc if the filter requires a Win32 GDI display context for a bitmap.
; (Deprecated as of API V12 - do not use) NEEDS_HDC = 0x00000001L,
VFBitmap.hdc: D$ 0] ; A handle to a device context.
So, the member "data" from VFBitmap structure points to the start of the pixels in memory; (In general, they are in RGB8888 format, which is easy to convert to RGBQUAD - I already done this part)
This part of the code to retrieve the pixels in memory are already done (in case with Vdub that loaded the video and granted me access to each frame containing the pixels to be manipulated)
This is the easier part.
In Vdub Images are passed in and out of video filters through the VFBitmap structure. Each VFBitmap represents an image as follows:
(http://i65.tinypic.com/10engk2.jpg)
The image is stored as a series of sequential scanlines, where each scanline consists of a series of pixels.
Since the video filter system works with 32-bit bitmaps, scanlines are guaranteed to be aligned to 32-bit boundaries. No further alignment is guaranteed. In particular, filter code must not assume that scanlines are 16-byte aligned for SSE code.
It is important to note that there may be padding between scanlines. The pitch field indicates the true spacing in bytes, and should be used to step from one scanline to the next
All of this is how VDub access and handle the image data in memory. This part, in general, i already did. But, i´m having problems only to understand the scanline stuff, because when manipulating the images from Matrix_Transpose, for example, the width and height of the resultant image was weird as you saw on the image i posted earlier. But...i think i found how to make it work properly. It seems that i didn´t configured properly the way the layout can be displayed (I´m currently working on it to see if i can fix the transposing mode)
Why i´m doing this with Vdub ? because with VDub i´ll then use the Phash algo to identify scenes from a video i´ll load on it.
Now....about Phash.
Quote2 - PHash ( don't know what it is, or does ) the raw data ?
Yes...but....Phash as it is written is bloated because it loads the image for you and do all the fancy stuff with the convolution and matrix manipulation using a library called Cimg. The main problem is that, it is insanelly slow for video processing (and also for images, btw), although it is incredible accurate.
I´m posting it here a small example of Phash being used. The source code is embedded (RosAsm file), but it is simply this:
[Float_Half: R$ 0.5]
[ImgHash1: R$ 0]
[ImgHash2: R$ 0]
[Similarity_Result: R$ 0]
[Float_64: R$ 64.0]
[Float_Thrsehold: R$ 0.85]
Main:
C_call 'pHash.ph_dct_imagehash' {B$ "Img1.jpg", 0}, ImgHash1
C_call 'pHash.ph_dct_imagehash' {B$ "Img2.jpg", 0}, ImgHash2
C_call 'pHash.ph_hamming_distance' ImgHash1, ImgHash2
mov D$Similarity_Result eax
fld1 | fild F$Similarity_Result | fdiv R$Float_64 | fsubp ST1 ST0 | fstp R$Similarity_Result
Fpu_If R$Similarity_Result >= R$Float_Half
call 'USER32.MessageBoxA' 0, {B$ 'Images are similar', 0}, {B$ 'PHash test', 0}, &MB_YESNO
Fpu_End_If
call 'Kernel32.ExitProcess' 0
The functionality and example is explained here:
http://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash
and here explain in more details the technical functionality.
http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html
Quote3. Where comes the matrix and convolution stuff ?
The matrix and convolution comes from PHash itself. Internally it creates a matrix to be used as a mask for later build the convolution, in order to make the hashes for each image that is being compared.
The matrix and convolution functions used in PHash (That is open source, i.e, we have access to the source code to read it and learn how it works exactly) are available freely at phash.org.
http://www.phash.org/releases/win/pHash-0.9.4.zip
And here are the technical aspects of Phash too and the guide of usage:
http://www.phash.org/docs/design.html
http://www.phash.org/docs/howto.html
The
major problem is that... to work, pHash uses a bloated Cimg library to load the image and create the matrix and convolutions routines in order to the algo can produce the hash for each image. (The part of the code i posted in the other post on this thread)
How the comparison works ? The final comparition is kinda easy to understand. After having the hashes all you need is compare the hash found on a image and the hash found on another. If the difference is above 50%, then we have a similarity (The bigger the value, more similar the images are). Below 50% the images are definitely different.
How to solve the speed problem in order to use phash on videos (or regular images or audio) on a fast and more reliable way ? Creating our own set of matrix and convolution functions to work on the pixels that was already loaded in memory, instead having to create several functions to load the image or using external bloated libraries to do it for us.
PHash as it is without we fix it, is simply not usefull for video detection, because it is slow as hell, despite it´s high level of accuracy. So, the goal is recreate Phash using our own set of matrix and convolution functions, instead being forced to use bloated Libraries that does a terrible job internally resulting on a algo impossible to be used for video manipulation or even image manipulation in general. We need to recreate a phash that don´t load a image, don´t uses bloated libraries. We need one that simply take the pixel data from a image previously loaded in memory and compute the hash of it.
So, the matrix and convolutions functions we needs basically to manipulate directly the pixel data of a certain image
(Which was previously loaded no matter in what method/api used), and we need only to feed the functions only with the pixel pointer, height and width (and perhaps pitch/scanline, since it seems to be necessary sometimes for vdub). We are not using the matrix and convolution functions to load the image, we are using them to manipulate the pixel data already loaded in memory, so the method chosen to load the image is not what matters, since the important is we have access to the pixel and we manipulate them directly.
Since the images can be on any size, the matrix manipulation and convolution functions needs to work on any size (squared or not), because we can have images that have different width and height.
QuoteYou told me it is for video editing and making plugins to create video effects right ?
yes :)
QuoteAm i right that you need to fetch each picture from a movie, let the effect doing its work and save it back in the movie ?
yes :) :)
But, all of the functions responsible for saving the video back, loading it etc, are already done by the main app (VDub). Basically it is a plugin that take the pixel data loaded by Vdub and manipulate the pixels directly. In case, it is a plugin i´m creating to identify the scenes of a video using a algorithm called Phash that uses matrixes functions and convolution to find the hash on each image/frame.
And, since for creating the phash algo, we will need to build the matrices and convolutions functions, those functions can be also used later on others plugins or apps that direct manipulate pixels from memory.
QuoteLet´s do like Jack the ripper and go in parts :) :icon_mrgreen:
:lol:
The easiest and most flexible way to load and get access to the image data and even control the format you want no matter the pixel format of the original image, is this small piece of GDIplus code.
.const
;the formats you may need
PixelFormat1bppIndexed equ 30101h
PixelFormat4bppIndexed equ 30402h
PixelFormat8bppIndexed equ 30803h
PixelFormat16bppGreyScale equ 101004h
PixelFormat16bppRGB555 equ 21005h
PixelFormat16bppRGB565 equ 21006h
PixelFormat16bppARGB1555 equ 61007h
PixelFormat24bppRGB equ 21808h
PixelFormat32bppRGB equ 22009h
PixelFormat32bppARGB equ 26200Ah
PixelFormat32bppPARGB equ 0E200Bh
BitmapData struct
dwWidth dd ?
dwHeight dd ?
Stride dd ?
PixelFormat dd ?
Scan0 dd ? ; pointer to the raw bitmap data
Reserved dd ?
BitmapData ends
.data?
GDIplusBitmapData BitmapData <?>
pImage dd ?
GdiplusToken dd ?
.code
invoke GdiplusStartup,offset GdiplusToken,offset GdiplusInput,NULL
invoke GdipCreateBitmapFromFile,offset FilenameW,addr pImage
invoke GdipBitmapLockBits,pImage,NULL,1,PixelFormat32bppARGB,offset GDIplusBitmapData
; do your stuff here on the bitmap data
mov esi,GDIplusBitmapData.Scan0 ; pointer to the bitmap data
mov ecx,GDIplusBitmapData.dwHeight
mov edx,GDIplusBitmapData.dwWidth
add esi,GDIplusBitmapData.Stride ; jump to the next scan line etc.
; close everything
invoke GdipBitmapUnlockBits,pImage,offset GDIplusBitmapData
invoke GdipDisposeImage,pImage
invoke GdiplusShutdown,GdiplusToken
I think you still have a lot to do to get that PHash routine working the way you describe. :t
HI marinus
Thanks for the code. I´ll give a try and see how it works with Vdub. A few questions about gdiplus. SinceVdub already retrieve the pixel data, how to use gdiplus on a bitmap data rather then it´s filename? Does GpStatus GdipCreateBitmapFromScan0 can do it ?
For now, i´m asking this for curiosity, because i don´t believe we need a routine to work with GdiPlus right now, since Vdub already retrieve the pixel data and format to us to work. It is a good thing, because is one less step to do :) maybe we can use this GdiPlus only for testing purposes while i´m also testing the main matrix functions on vdub to see if everything works ok
QuoteI think you still have a lot to do to get that PHash routine working the way you describe. :t
Yep :) :bgrin: :bgrin:
That´s why i´m trying to do in steps. The first ones are basically theses:
a) Create some manipulation matrix functions to work in whatever size
b) create the convolution function.
Once those 2 steps are done, i can then start analysing the internal routines of PHash itself. I believe that there will be needed a few more steps after the convolution is done. Perhaps i´ll need to create only 4 or 5 functions in order to retrieve Phash after the convolution is done, but 1st i need to see if the matrix and convolution functions are ok and have the same result as the ones produced by the internal Cimg routines inside Phash itself .
One of the routines used by Phash i already succeeded to convert. It is the hamming_distance function. This function, i succeeded to port. Although i didn´t tested yet to speed or optimization, because the important on this stage of development is make everything works 1st :)
So we can start with your older functions to start testing. Some of them i converted but unsure if will work as expected on Vdub on all layout formats (Switching Height x width, for example)
The 1st one we can test is Matrix_FlipX (Then we later can work with Matrix_FlipY and Matrix_FlipXY). I remade it without the stride using the examples that you, jochen and Aw provided. it works, but...i´m not sure if it will work on all cases, because it is not using the stride/pitch. How can we add the stride/pitch on it ?
Proc Matrix_FlipX:
Arguments @Input, @Output, @Width, @Height
Local @MaxXPos, @MaxYPos, @Remainder, @AdjustSmallSize, @NextScanLine
Uses esi, edi, ebx, ecx, edx
mov esi D@Input
mov edi D@Output
; How many xmm moves per row are required;
mov eax D@Width
mov edx eax
mov ebx eax
and edx 3 | mov D@Remainder edx
shr eax 2 | mov D@MaxXPos eax
mov eax D@Height | mov D@MaxYPos eax
mov D@AdjustSmallSize 0
mov ebx 4 ; case of less than 4 columns
If D@MaxXPos > 0
mov ebx 16 ; subtract enough to fill one xmm register
mov D@AdjustSmallSize 12
End_if
mov eax D@Width | shl eax 2 | mov D@NextScanLine eax | sub eax ebx
mov edi D@Output | add edi eax
L1:
mov ebx D@MaxXPos
; source points to the start of every row
mov eax edi
mov ecx esi
test ebx ebx | Jz L4>
L0:
movdqu xmm0 X$ecx
pshufd xmm0 xmm0 27
movdqu X$eax xmm0
sub eax 16
add ecx 16
dec ebx | jg L0<
L4:
mov ebx D@remainder
test ebx ebx | jz L3>
mov edx D@AdjustSmallSize
movdqu XMM0 X$ecx | movd D$eax+edx XMM0
dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-4 XMM0
dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-8 XMM0
L3:
add edi D@NextScanLine
add esi D@NextScanLine
dec D@MaxYPos | jg L1<
EndP
I used the pitch to copy the buffer on another function i´m using Vdub. This one uses the pitch information. This was done after Matrix_X worked. It is this:
Proc CopyImageBuffer:
Arguments @Input, @Output, @Width, @Height, @Pitch
Local @CurYPos
Uses eax, ebx, ecx, edx, edi
mov eax D@Height
xor ebx ebx
mov D@CurYPos eax
.Do
mov eax D@Output | add eax ebx
mov ecx D@Input | add ecx ebx
mov edx D@Width
Do
mov edi D$ecx
add ecx 4
mov D$eax edi
add eax 4
dec edx
Loop_Until edx = 0
add ebx D@Pitch
dec D@CurYPos
.Loop_Until D@CurYPos = 0
EndP
But..i wonder, if using Pitch is really necessary to be inserted inside the main loop. (or, if it is really necessary at all, btw)
Also...maybe, once the routines are ok (with or without the pitch) we can see if it could be faster using the same Input as the output. On this way we can make variations of the matrix functions where the Input and output are the same so we can test the speed. Ex: we can create a sort of Matrix_FlipXEx containing only 3 arguments: Input, Width, Height (or also thee pitch if needed). Where the input will be used also as the output. This is to prevent the need to using another function to copy the contents of the Output to another buffer.
On this way we may have at the end only 6 major functions to manipulate the matrix, instead of 3.
Matrix_FlipX - > input and output buffers are distincts
Matrix_FlipXEx - > input is used to output
Matrix_FlipY - > input and output buffers are distincts
Matrix_FlipYEx - > input is used to output
Matrix_FlipXY - > input and output buffers are distincts
Matrix_FlipXYEx - > input is used to output
I posted the code because you didn't want to use the bloated Cimg library. :biggrin:
I'm getting the grasp now of your intentions.
1. you need 2 images. ( size may differ )
2. load them to system memory as raw bitmap data.
3. apply a matrix function on both images.
4. apply a convolution. ( this part is still fuzzy to me, what mask etc. )
5. create a PHash value for both images and compare the 2 values.
6. let it run as fast as possible.
Are those the steps to be done and are they in the correct order?
The best way is to come up with a good strategy, that is to work backwards and design the best algorithms for an optimal situation.
This way you can control, how the data needs to be organized when loading the images.
For example:
- What input is needed to make the PHash routine happy.
- Is it faster to transpose a complete image in video memory and copy it back to system memory,
or calculate it in system memory?
- Is it faster to get rid of the difference between stride/pitch and the actual bitmap width?
- Is it faster to add zeros to the horizontal bitmap lines that are not multiples of 4 pixels?
- Try to do as much inner-loop coding on 1 cache line ( < 64 byte ) and align that code for fast execution.
- Align the data for fast reading and writing.
- Thus create the best situation to perform the fastest code possible.
In fact what i meant with working backwards, you don't need to add extra code to adjust things to make it work.
Your code is prepared for the next step.