Author Topic: Fast transposing matrix procedure for any size  (Read 476 times)

RuiLoureiro

  • Member
  • ****
  • Posts: 747
Fast transposing matrix procedure for any size
« on: June 26, 2018, 07:41:14 PM »
Hi all,
        Here we have the procedure MatrixTransposeSSE46 which seems to be as fast as possible,
       for now. It is inside the file sse46.inc
        In the folder TESTCODE we have all the asm code necessary to do new tests if we modify
        something in sse46.inc or in the macros used by it.
        So we have all work done to do nearly all things that we may want to do hereafter. The text
        FORUM_TEXT_ABOUT_SSE46.txt explains what is the problem and how we may solve it.

        Now, do yourself your fast version and be happy.
       
Documentation
                To understand what was done, please read the file

                              FORUM_TEXT_ABOUT_SSE46.txt
                     
Macro's files inside sse46.inc used by MatrixTransposeSSE46:

                        Mov3x3SSE_LCMacrosUPS_APS.mac
                        Mov03x3SSE_LCMacrosUPS_v25.mac
                        Mov4Times1x1_LCMacros.mac
               
Main results in i5 and i7 CPU:

                    results1x3_112_771SSE46_v25_ter26_i5_i7.txt
 
Good luck :t
Rui Loureiro

EDIT: my first post/topic about this issue using x86 instructions is here
                              http://masm32.com/board/index.php?topic=7126.0

Important note: Despite MatrixTransposeSSE46 transpose a matrix of any number of columns and rows it may benefit from data alignment by 16 in many cases, starting by the cases 1xN and Mx1 (N,M >=4). When we align by 16 we are not aligning the first row only. All info inside FORUM_TEXT_ABOUT_SSE46.txt. Note also that we must align by 16 both matrices. Otherwise
MatrixTransposeSSE46 doesnt work properly. If it works, you have luck.
« Last Edit: June 28, 2018, 06:56:08 AM by RuiLoureiro »

Siekmanski

  • Member
  • *****
  • Posts: 1548
Re: Fast transposing matrix procedure for any size
« Reply #1 on: June 26, 2018, 08:16:08 PM »
That's a lot of work you have done.  :t
Creative coders use backward thinking techniques as their strategy.

zedd151

  • Member
  • ****
  • Posts: 702
Re: Fast transposing matrix procedure for any size
« Reply #2 on: June 26, 2018, 10:07:46 PM »
 
Quote from: Rui...
Hi all,
       
Documentation ...
Macro's files inside sse46.inc used by MatrixTransposeSSE46...
Main results in i5 and i7 CPU...
Good luck  :t
Rui Loureiro

That's a lot of work you have done.  :t

The undisputed champion of exploring Matrix Transposition algos.  If anything he sure is persistent.  :icon_mrgreen:

AW

  • Member
  • *****
  • Posts: 1346
  • Let's Make ASM Great Again!
Re: Fast transposing matrix procedure for any size
« Reply #3 on: June 26, 2018, 11:44:27 PM »
A whole lot of senseless masturbation using my original idea posted here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148
 :P

guga

  • Member
  • ****
  • Posts: 843
  • Assembly is a state of art.
    • RosAsm
Re: Fast transposing matrix procedure for any size
« Reply #4 on: July 13, 2018, 01:27:25 AM »
Great Work, Ruy, AW and Siekmanski :)

Can this be used to transpose an image as well considering that the image have a BitDepht that may alter the algo ?  I mean, suppose we have a image with 100 x 50 pixels, with a bitdepth/pitch of 3 . How can this be done with the algo ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

  • Member
  • *****
  • Posts: 1548
Re: Fast transposing matrix procedure for any size
« Reply #5 on: July 13, 2018, 04:35:53 AM »
Yes it can, it shuffles 32 bit values in place.
A pixel is 32 bit, so it can transpose bitmap images as well.
Creative coders use backward thinking techniques as their strategy.

RuiLoureiro

  • Member
  • ****
  • Posts: 747
Re: Fast transposing matrix procedure for any size
« Reply #6 on: July 13, 2018, 08:16:19 AM »
Great Work, Ruy, AW and Siekmanski :)

Can this be used to transpose an image as well considering that the image have a BitDepht that may alter the algo ?  I mean, suppose we have a image with 100 x 50 pixels, with a bitdepth/pitch of 3 . How can this be done with the algo ?
Hi guga
           nice to see you here...  :t
           About images Siekmanski has the correct answer i have no doubts. Dont forget that you
may write your own faster transpose procedure for your CPU modifying something in that i posted.
Good luck

guga

  • Member
  • ****
  • Posts: 843
  • Assembly is a state of art.
    • RosAsm
Re: Fast transposing matrix procedure for any size
« Reply #7 on: July 13, 2018, 07:36:56 PM »
Thanks, Rui. I´ll give a try :)

But...for images, they are not all in a DWORD value. Some images formats have 3 pixels only (24 bits), like FreeImage library that by default uses RGBTRIPLE structure rather then RGBA or ARGB :( The same goes for virtualdub that recommends computing the stride rather then a direct computation of the width of a image. Like this:

Code: [Select]
Stride = 4 * (Bitmap.Width * BitsPerPixel + 31)/32)
Code: [Select]
int bitsPerPixel = ((int)format & 0xff00) >> 8;
        int bytesPerPixel = (bitsPerPixel + 7) / 8;
        int stride = 4 * ((width * bytesPerPixel + 3) / 4);

Also computed as:
Code: [Select]
int bitsPerPixel = ((int)format & 0xff00) >> 8;
        int bytesPerPixel = (bitsPerPixel + 7) / 8;
        int stride = 4 * ((width * bytesPerPixel + 3) / 4);

Of course, on freeimage, it can be overcome using FreeImage_ConvertTo32Bits...but take a look, for example the code they made for rotating a image:
https://github.com/lubosz/FreeImage/blob/master/Source/FreeImageToolkit/ClassicRotate.cpp on rotate90º function:

Code: [Select]
for(xs = 0; xs < dst_width; xs += RBLOCK) {    // for all image blocks of RBLOCK*RBLOCK pixels
for(ys = 0; ys < dst_height; ys += RBLOCK) {
for(y = ys; y < MIN(dst_height, ys + RBLOCK); y++) {    // do rotation
y2 = dst_height - y - 1;
// point to src pixel at (y2, xs)
BYTE *src_bits = bsrc + (xs * src_pitch) + (y2 * bytespp);
// point to dst pixel at (xs, y)
BYTE *dst_bits = bdest + (y * dst_pitch) + (xs * bytespp);
for (x = xs; x < MIN(dst_width, xs + RBLOCK); x++) {
// dst.SetPixel(x, y, src.GetPixel(y2, x));
for(int j = 0; j < bytespp; j++) {
dst_bits[j] = src_bits[j];
}
dst_bits += bytespp;
src_bits += src_pitch;
}
}
}

I like Freeimage but i keep wondering why they didn´t standardized everything to work only on a 4 byte boundary formats such as:  RGBA, ARGB, BGRA, RGBQUAD etc ? And if the user opened a image with an unusual format (RGBTRIPLE or even 1 byte channel or those horrible RGB565, RGB555, RGB444 etc etc etc) then  simple conversion functions should be used instead.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

  • Member
  • *****
  • Posts: 1548
Re: Fast transposing matrix procedure for any size
« Reply #8 on: July 13, 2018, 07:53:43 PM »
Why not take full control over your own needs and skip those horrible not flexible 3th party graphics libraries.
With Direct3D9 you can easily deal with all those "1 byte channel or those horrible RGB565, RGB555, RGB444 etc etc etc" formats.
You can transpose your bitmap images with almost no CPU usage ( changing only 4 coordinates ) and let the GPU do it for you using the fantastic very fast Direct3D9 interface.
Creative coders use backward thinking techniques as their strategy.

guga

  • Member
  • ****
  • Posts: 843
  • Assembly is a state of art.
    • RosAsm
Re: Fast transposing matrix procedure for any size
« Reply #9 on: July 14, 2018, 02:06:23 AM »
Hi Siekmanski, Maybe later i´ll restart studying the direct3d functions.  I´m currently trying to make the algorithms works 1st using freeimage because it is is the library i´m using to open images and manipulate their data. I don´t know how to get access to the pixels contents (Load and export ) using directX yet.

But...using freeimage is just for making a small app that can load/save these images so i can test the algorithms i´m trying to create. My idea is basically using a small app from where i can test whatever algorithms that can also be used on video manipulation. For video, it is easier for me to build them for virtualdub because i already studied how it works on a way i can be able to make some plugins for it. I gave a test years ago on the Sdk for sony vegas and make some tests but...had no more time to continue the tests.

Of course, if there is some app that uses direct3d to manipulate videos (load and export in whatever format or codecs) and at the same time allow us to make plugins for it then it maybe a better alternative, but i don´t know if there is such a thing yet :( For video editing i use Sony vegas and virtualdub on a regular daily basis. (Eventually i use premiere but, this one i recently started trying to work on it - Personally, i still prefer sony vegas and virtualdub for that...but...i kept it as an alternative)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

Siekmanski

  • Member
  • *****
  • Posts: 1548
Re: Fast transposing matrix procedure for any size
« Reply #10 on: July 14, 2018, 05:50:54 AM »
 :t Didn't know you where coding plugins.

With GDIplus you can load bitmap images ( bmp, gif, jpeg, png, tiff and ico ), access the pixel data, manipulate it and saving it again in one of those formats.

You can also use DirectShow with Direct3D9 to manipulate Video content on the fly, but it takes some time and learning curve to write such a program from scratch....
Creative coders use backward thinking techniques as their strategy.

daydreamer

  • Member
  • ***
  • Posts: 465
Re: Fast transposing matrix procedure for any size
« Reply #11 on: July 14, 2018, 04:37:19 PM »
With GDIplus you can load bitmap images ( bmp, gif, jpeg, png, tiff and ico ), access the pixel data, manipulate it and saving it again in one of those formats.

You can also use DirectShow with Direct3D9 to manipulate Video content on the fly, but it takes some time and learning curve to write such a program from scratch....
thats good to know,GDIplus can be used to load .image's,perform pixel manipulation and in GDI game or d3d9 texturefrommemory

I also Think d3d9 would be prefered for loads of matrix operations
Quote from Flashdance
Nick  :  When you give up your dream, you die.
*wears a flameproof asbestos suit*

AW

  • Member
  • *****
  • Posts: 1346
  • Let's Make ASM Great Again!
Re: Fast transposing matrix procedure for any size
« Reply #12 on: July 15, 2018, 07:34:58 PM »
For transposing images with pixel formats less than 32-bit, SIMD instructions can not be used because are not able to manipulate less than dwords.
But we can transpose without SIMD, and will still be much faster than using the slow getpixel and setpixel functions.

We need to be aware than Windows bitmaps are laid bottom-up in memory. This means that it will be necessary to make an horizontal flip to fix things after the transpose.

This is a sample that loads a 24-bit .bmp file from disk, transposes it and saves to disk afterwards.
The .bmp has width and height not multiples of 4, so I had to handle the stride stuff.



Siekmanski

  • Member
  • *****
  • Posts: 1548
Re: Fast transposing matrix procedure for any size
« Reply #13 on: July 15, 2018, 09:21:59 PM »
Transposing or rotation, that's the question.  :biggrin:

You can load all pixel formats to 32 bit pixel format in memory with GDIplus and use SIMD intructions on the the pixels.  8)
Creative coders use backward thinking techniques as their strategy.

zedd151

  • Member
  • ****
  • Posts: 702
Re: Fast transposing matrix procedure for any size
« Reply #14 on: July 15, 2018, 09:24:12 PM »
Transposing or rotation, that's the question.  :biggrin:

I'd think if it's only 90 degrees, it's transposition. More than 90 degrees means more 'movement', hence rotation.   :P
 
The .bmp example was a nice little exercise with visual results.   :icon14: