Fast Matrix Flip

aw27 · April 21, 2017, 05:57:21 PM

Quote from: guga on April 21, 2017, 06:34:54 AM
Ok. Here is the masm version.

Well done, the gain was mostly done to elimination of mul instructions which have a great impact particularly inside a loop.

guga · April 21, 2017, 07:10:37 PM

Thanks.

And also the div that was replaced by "and". Div instruction is always slow. Also on the updated versions i´m working, i replaced all movups changing them to movdqu instead, and got a gain of speed of around 12% (Measured on my I7) . And, if you remove all the push/pop operation at the beginning, the code will be way more faster (with the USES directive in Masm. In RoAsm we have a similar thing, but it is not a directive it is a user made macro also called "uses") . I just maintained it because i don´t want the function altering the registers after being used.

Don´t forget that, using a naked dec/jcc instruction is generally a bit faster then cmp xxx / jcc. Also, test is a bit faster then cmp.

Btw...I also just made a version that rotates a Matrix on 180º directly. (Later i´ll check the speed and see if it also have some advantage replacing the movups on it for movdqu. I´ll convert it to Masm, once i finish the tests to see if it is working as expected.

Code Select


; Rotate a Matrix at 180º

Proc Matrix_FlipXY:
    Arguments @Input, @Output, @Width, @Height
    Local @MaxXPos, @MaxYPos, @remainder, @NextScanLine, @AdjustSmallSize
    Uses esi, edi, ebx, ecx, edx

    mov esi D@Input
    mov edi D@Output

    ; How many xmm moves per row are required;
    mov eax D@Width
    mov edx eax
    mov ebx eax
    and edx 3 | mov D@Remainder edx
    shr eax 2 | mov D@MaxXPos eax

    ; make destination point to the end of every row
    mov eax D@Width | shl eax 2 | mov D@NextScanLine eax
    mov eax D@Height | mov D@MaxYPos eax | mul D@NextScanLine; | sub eax 4
    ;mov eax D@Height | mov D@MaxYPos eax | dec eax | mul D@NextScanLine
    mov edi D@Output | add edi eax

    mov D@AdjustSmallSize 0
    mov ebx 4 ; case of less than 4 columns
    If D@MaxXPos > 0
        mov ebx 16 ; subtract enough to fill one xmm register
        mov D@AdjustSmallSize 12
    End_if
    sub edi ebx

    ;mov eax D@Width | shl eax 2 | mov D@NextScanLine eax |  sub eax ebx
    ;mov edi D@Output | add edi eax

    L1:
        mov ebx D@MaxXPos
        mov eax edi
        mov ecx esi
        test ebx ebx | jz L4>

        L0:
            movups xmm0 X$ecx
            pshufd xmm0 xmm0 27
            movups X$eax xmm0
            sub eax 16
            add ecx 16
            dec ebx | jg L0<
L4:
        ..If D@remainder >= 1
            mov edx D@AdjustSmallSize
            mov ebx D$ecx | mov D$eax+edx ebx
            .If D@remainder >= 2
                mov ebx D$ecx+4 | mov D$eax+edx-4 ebx
                If D@remainder = 3
                    mov ebx D$ecx+8 | mov D$eax+edx-8 ebx
                End_if
            .End_if
        ..End_if

        sub edi D@NextScanLine
        add esi D@NextScanLine

        dec D@MaxYPos | jg L1<

EndP

Note: Do you have n idea how to make a matrix convolution ? For images, like this one: https://www.tutorialspoint.com/dip/concept_of_convolution.htm

I´m optimize all the matrix codes and later adapt it to image processing (all it seems to be needed is including Pitch at the end of scanline. It shouldn´t affect the speed that much, i hope)

Note2: I´ll later give a try replacing all the "If remainder" with a set of dec and using "movdqu XMM0 X$ecx | movd D$eax+edx XMM0" to set the remainder bytes. It should be faster even on small matrices, but, i didn´t tested yet.

guga · April 21, 2017, 09:06:09 PM

Ok, the reminder sequence of If´s can be replaced with code chain like this (used on the example above: Matrix_FlipXY):

Code Select


        mov ebx D@remainder
        test ebx ebx | jz L3>
            mov edx D@AdjustSmallSize
            movdqu XMM0 X$ecx | movd D$eax+edx XMM0
            dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-4 XMM0
            dec ebx | jz L3> | PSRLDQ xmm0 4 | movd D$eax+edx-8 XMM0
        L3:

aw27 · April 21, 2017, 09:20:07 PM

Quote from: guga on April 21, 2017, 07:10:37 PM
see if it also have some advantage replacing the movups on it for movdqu.

For very large matrixes you may have an advantage making use of various xmm registers, prefetching and using movntdq . Something like this (pasting some code I had here for a different purpose):
            .repeat
               prefetchnta BYTE PTR [esi+ecx+4096]
               movdqu xmm0, XMMWORD ptr [esi+ecx]
               movdqu xmm1, XMMWORD ptr [esi+ecx+16]
               movdqu xmm2, XMMWORD ptr [esi+ecx+32]
               movdqu xmm3, XMMWORD ptr [esi+ecx+48]
               movntdq XMMWORD ptr [edi+ecx], xmm0
               movntdq XMMWORD ptr [edi+ecx+16], xmm1
               movntdq XMMWORD ptr [edi+ecx+32], xmm2
               movntdq XMMWORD ptr [edi+ecx+48], xmm3
               sub ecx, 64
            .until ecx==0

guga · April 22, 2017, 06:39:10 AM

Yes, this will be a similar solution to what i´m doing with Marinus´s transpose matrix, but the implementation of that is a bit hard. Later i´ll try to make the tests on it. The problem is with the leftovers (remainders) of the function that will need to be enter on other loops. It can be overcome, although the resultant code is a bit messy. On this particular function, Jochen code is cleaner and the small difference of speed between JJ´s and marinus code (marinus is about 35% faster) may worth use JJ´s code for readability reasons. (I´ll later post the fixes and implementation i did on marinus to you better understand what i mean)

Note: There is no need to use prefetchnta. It seems to change performance according to the processor. If the code is doing a large number of loads on the same cache line and stores these could still negatively effect performance.
According to Agner´s Frog, Prefetch throughput on IvyBridge is only one per 43 cycles, so we need to be careful not to prefetch too much if we don't want prefetches to slow down the code on IvB. This is a performance bug specific to IvB. On other designs, too much prefetch will just take up uop throughput that could have been useful instructions (other than harm from prefetching useless addresses).
http://agner.org/optimize/

Even on older processors, prefetchXXX instructions have no guarantee that the data will be in the cache when it is needed. (Intel IA32 Software Developer Manual, Volume 2)

In some cases, the performance can be better results if you simply use a Align16 directive before the SSE2 main instruction (As i did on Jochen´s transpose matrix algo.)

Siekmanski · April 22, 2017, 07:22:02 AM

Guga, you're right. On my Ivybridge prefetching uses 43 cycles. Better to let the CPU handle this.
But there are still many PC's around that could benefit from sotware prefetching.

It must be the month of Matrix calculations.

https://www.codeproject.com/Articles/1182724/Blowing-the-Doors-Off-D-Math-Part-I-Matrix-Multipl

guga · April 22, 2017, 09:04:40 AM

QuoteIt must be the month of Matrix calculations.

:icon_mrgreen: :icon_mrgreen: :icon_mrgreen:

Actually i´m trying to understand the matrix functions because i´m porting PHash algo to Assembly in order to see if i can use it on my scene detector plugin for VirtualDub.

PHash compares the similarity between 2 images. It can be used to a wide range of applications, such as image searching, object recognition (including face recognition) and most probably: scene detection, movement detection. Not to mention, it can also be used for audio :) (You may like this :)

The main problem on PHash is that it uses a crappy CImg library that is insanely slow. CImg is a pile of crap C++ Classes that results on a bloated performance of the PHash algo.

Since Phash uses matrix manipulaton and Convolution of CImg, i´m trying to recreate not only the matrixes functions involved, but also the convolution function to be used later on the scene detection plugin.

The current version of my plugin can detect (Accurately) hard cut scenes on a rate near to 100% simply computing and comparing the Minimum standard deviation values of the difference of the images (Difference achieved from xor operation and not a simple sub). Processing a entire video of 1:30 hour (something around 190.000 frames of 720x480) can be done in something around 20 minutes on my I7. So, it is somewhat a acceptable amount of time, considering the total operations and all the frames involved and also the underneath functions used by vdub itself (Which, unfortunately, also uses crap C++ classes). The problem is with soft cut scenes (transitions or fades etc) that are not that accurate. So, i´m trying to use PHash instead the STD algos to gain accuracy in general (Hard cut and soft cut scenes). MY goal is try to make the scene detector plugin works nearly 100% of accuracy on all cases (soft and hard cut scenes) in less then 10 minutes per hour of video (So, if i can make it at a rate of 10.000 frames (720x480) per minute (or even less), it will be great to be used)

I´m currently trying to build a convolution function and see if the results are the same as in CImg so i can continue the tests i´m doing, but, so far, no success on making this damn convolution thing :(

Btw: You did the article ? Well done :)

Siekmanski · April 22, 2017, 09:30:33 AM

Hi guga,

No, it is not my article. Just saw it on codeproject and thought, yet an other matrix article, it has to be the month of matrix calculations.
Since it is populair now on this forum.

Have you ever thougth about using the videocard and direct3D9 to do the 2D image transformations and the flipping etc.
It would be much faster.

guga · April 22, 2017, 09:40:07 AM

QuoteHave you ever thougth about using the videocard and direct3D9 to do the 2D image transformations and the flipping etc.
It would be much faster.

Didn´t though of that, but, i´m not sure if vdub will handle D3d9, neither if it will be usefull. VDub functions points to the pixel data and not a copy of the full image (with the header etc). So, i´m not sure if it worth using D3d9 directly for pixel manipulation.

Do you have an example of matrix manipulation of D3d9 ? I mean, when we already have the pixel data, and not to fully process a image ? Another question, do d3d9 have a convolution function ?

Siekmanski · April 22, 2017, 09:55:00 AM

You can access the pixel data in D3D9 after the image manipulations and save it if you want.
I can make an example if you like, but i need a week or so ( short on free time )

Do you need an example that loads an image of 720x480 pixels and do the transposing, flipping X and flipping Y ?
And copy the result to a memory buffer or save the result as an image?

Or something else maybe... only if i know how to do it of course.

guga · April 22, 2017, 10:33:00 AM

Hi Marinus

Quote"Do you need an example that loads an image of 720x480 pixels and do the transposing, flipping X and flipping Y ?
And copy the result to a memory buffer or save the result as an image?"

Yes...that would be nice :) If D3d9 can be used to perform faster routines then we already did this maybe handfull using it on vdub.

I never combined vdub with d3d9 before, but, it maybe helpfull if the results were faster then we already made :)

Another thing is that a convolution also is needed. So, not only transposing, flipping (x and Y), but can you give a try on convolution too ?
The convolution seems to obey this technique. https://www.tutorialspoint.com/dip/concept_of_convolution.htm

Many thanks :)

Siekmanski · April 22, 2017, 11:38:28 AM

I have never done convolution on images, but we could give it a try.
First i will make a stripped down version of the transposing and flipping stuff then it will be easier for you to translate it to RosAsm.
Later we could try to get the convolution stuff working. Could be a nice project. :t

guga · April 22, 2017, 01:19:19 PM

many thanks, marinus :t

Stripping down the matrix functions will be handfull, specially because they can work with any size of matrices. I fond some code that can be used for convolution and will start studying it for us.

One thing, since the matrices will be used for image and video processing, we cannot forget to include the pitch (stride) on the functions. basically, for what i understood on vdub, the pitch is just a leftover on the width of a image. that was used as a alingment in memory. So, the image width seems to be,in fact Width+Pitch. Where pitch contains only null data. From the code we did so far, adding a pitch argument to the functions won´t change the speed, since we can simply need to include one instruction and multiply it to the height to find the 1st scanline from bottom to top. or, instead multiply, it may be added to width if we are doing it from top to bottom. Like:
probably something like this:

mov eax D@Width | add D@Picth eax | shl eax 2 | mov D@NextScanLine eax

On this way, the user can work with the function regardless it contains pitch value (video) or not (images or simple matrix calculation). So, if it is working with image/simple matix, he set Picth = 0. Otherwise he input the pitch value. Something that may result like this:

RosAsm syntax
call Matrix_FlipY Input, Output, D$Width, D$height, D$Pitch

masm syntax
call Matrix_FlipY offset Input, offset Output, Width, height, Pitch

Where:
Input = Pointer to a buffer containing he inputed data in a array of Dwords (which also works for floats).
Output = Pointer to a buffer to hold the resultant data (also a array of Dwords or Floats)
Width = Width of the array/image/video (Dword)
Height = Height of the array/image/video (Dword)
Pitch = Pitch of the video/image. 0 if no pitch. Any value if exists a pitch to be computed. (Dword)

Not sure yet if this is correct, but, it seems it may be used like that outside the loop (to we gain speed)
I believe the functionality of pitch in Vdub is the same as described in M$.
https://msdn.microsoft.com/en-us/library/windows/desktop/aa473780(v=vs.85).aspx

For example, to copy a image with vdub (using pitch), i can do this:

Code Select


Proc CopyImageBuffer:
    Arguments @Input, @Output, @Width, @Height, @Pitch
    Local @CurYPos
    Uses eax, ebx, ecx, edx, edi

    mov eax D@Height
    xor ebx ebx
    mov D@CurYPos eax
    .Do
        mov eax D@Output | add eax ebx
        mov ecx D@Input | add ecx ebx
        mov edx D@Width
        Do
            mov edi D$ecx
            add ecx 4
            mov D$eax edi
            add eax 4
            dec edx
        Loop_Until edx = 0
        add ebx D@Pitch
        dec D@CurYPos
    .Loop_Until D@CurYPos = 0

EndP

Siekmanski · April 22, 2017, 01:56:53 PM

Stride/pitch will be dealt with in the image loader. I'm familiar with it.

guga · April 22, 2017, 03:18:03 PM

Wonderfull :) So, we don´t actually need to handle pitch :)

I´m giving a test on the Matrix_FlipY on Vdub (without pitch) and it works like a charm :icon_mrgreen:). I´ll post a screenshot once i finish these testings on Vdub.

The MASM Forum

News:

Fast Matrix Flip

aw27

guga

guga

aw27

guga

Siekmanski

guga

Siekmanski

guga

Siekmanski

guga

Siekmanski

guga

Siekmanski

guga