The MASM Forum

General => The Workshop => Topic started by: RuiLoureiro on June 26, 2018, 07:41:14 PM

Title: Fast transposing matrix procedure for any size
Post by: RuiLoureiro on June 26, 2018, 07:41:14 PM
Hi all,
        Here we have the procedure MatrixTransposeSSE46 which seems to be as fast as possible,
       for now. It is inside the file sse46.inc
        In the folder TESTCODE we have all the asm code necessary to do new tests if we modify
        something in sse46.inc or in the macros used by it.
        So we have all work done to do nearly all things that we may want to do hereafter. The text
        FORUM_TEXT_ABOUT_SSE46.txt explains what is the problem and how we may solve it.

        Now, do yourself your fast version and be happy.
       
Documentation
                To understand what was done, please read the file

                              FORUM_TEXT_ABOUT_SSE46.txt
                     
Macro's files inside sse46.inc used by MatrixTransposeSSE46:

                        Mov3x3SSE_LCMacrosUPS_APS.mac
                        Mov03x3SSE_LCMacrosUPS_v25.mac
                        Mov4Times1x1_LCMacros.mac
               
Main results in i5 and i7 CPU:

                    results1x3_112_771SSE46_v25_ter26_i5_i7.txt

Good luck :t
Rui Loureiro

EDIT: my first post/topic about this issue using x86 instructions is here
                              http://masm32.com/board/index.php?topic=7126.0 (http://masm32.com/board/index.php?topic=7126.0)

Important note: Despite MatrixTransposeSSE46 transpose a matrix of any number of columns and rows it may benefit from data alignment by 16 in many cases, starting by the cases 1xN and Mx1 (N,M >=4). When we align by 16 we are not aligning the first row only. All info inside FORUM_TEXT_ABOUT_SSE46.txt. Note also that we must align by 16 both matrices. Otherwise
MatrixTransposeSSE46 doesnt work properly. If it works, you have luck.
Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on June 26, 2018, 08:16:08 PM
That's a lot of work you have done.  :t
Title: Re: Fast transposing matrix procedure for any size
Post by: zedd151 on June 26, 2018, 10:07:46 PM
 
Quote from: Rui...
Hi all,
       
Documentation ...
Macro's files inside sse46.inc used by MatrixTransposeSSE46...
Main results in i5 and i7 CPU...
Good luck  :t
Rui Loureiro

Quote from: Siekmanski on June 26, 2018, 08:16:08 PM
That's a lot of work you have done.  :t

The undisputed champion of exploring Matrix Transposition algos.  If anything he sure is persistent.  :icon_mrgreen:
Title: Re: Fast transposing matrix procedure for any size
Post by: aw27 on June 26, 2018, 11:44:27 PM
A whole lot of senseless masturbation using my original idea posted here: http://masm32.com/board/index.php?topic=6140.msg65148#msg65148
:P
Title: Re: Fast transposing matrix procedure for any size
Post by: guga on July 13, 2018, 01:27:25 AM
Great Work, Ruy, AW and Siekmanski :)

Can this be used to transpose an image as well considering that the image have a BitDepht that may alter the algo ?  I mean, suppose we have a image with 100 x 50 pixels, with a bitdepth/pitch of 3 . How can this be done with the algo ?
Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on July 13, 2018, 04:35:53 AM
Yes it can, it shuffles 32 bit values in place.
A pixel is 32 bit, so it can transpose bitmap images as well.
Title: Re: Fast transposing matrix procedure for any size
Post by: RuiLoureiro on July 13, 2018, 08:16:19 AM
Quote from: guga on July 13, 2018, 01:27:25 AM
Great Work, Ruy, AW and Siekmanski :)

Can this be used to transpose an image as well considering that the image have a BitDepht that may alter the algo ?  I mean, suppose we have a image with 100 x 50 pixels, with a bitdepth/pitch of 3 . How can this be done with the algo ?
Hi guga
           nice to see you here...  :t
           About images Siekmanski has the correct answer i have no doubts. Dont forget that you
may write your own faster transpose procedure for your CPU modifying something in that i posted.
Good luck
Title: Re: Fast transposing matrix procedure for any size
Post by: guga on July 13, 2018, 07:36:56 PM
Thanks, Rui. I´ll give a try :)

But...for images, they are not all in a DWORD value. Some images formats have 3 pixels only (24 bits), like FreeImage library that by default uses RGBTRIPLE structure rather then RGBA or ARGB :( The same goes for virtualdub that recommends computing the stride rather then a direct computation of the width of a image. Like this:

Stride = 4 * (Bitmap.Width * BitsPerPixel + 31)/32)

int bitsPerPixel = ((int)format & 0xff00) >> 8;
        int bytesPerPixel = (bitsPerPixel + 7) / 8;
        int stride = 4 * ((width * bytesPerPixel + 3) / 4);


Also computed as:
int bitsPerPixel = ((int)format & 0xff00) >> 8;
        int bytesPerPixel = (bitsPerPixel + 7) / 8;
        int stride = 4 * ((width * bytesPerPixel + 3) / 4);


Of course, on freeimage, it can be overcome using FreeImage_ConvertTo32Bits...but take a look, for example the code they made for rotating a image:
https://github.com/lubosz/FreeImage/blob/master/Source/FreeImageToolkit/ClassicRotate.cpp on rotate90º function:


for(xs = 0; xs < dst_width; xs += RBLOCK) {    // for all image blocks of RBLOCK*RBLOCK pixels
for(ys = 0; ys < dst_height; ys += RBLOCK) {
for(y = ys; y < MIN(dst_height, ys + RBLOCK); y++) {    // do rotation
y2 = dst_height - y - 1;
// point to src pixel at (y2, xs)
BYTE *src_bits = bsrc + (xs * src_pitch) + (y2 * bytespp);
// point to dst pixel at (xs, y)
BYTE *dst_bits = bdest + (y * dst_pitch) + (xs * bytespp);
for (x = xs; x < MIN(dst_width, xs + RBLOCK); x++) {
// dst.SetPixel(x, y, src.GetPixel(y2, x));
for(int j = 0; j < bytespp; j++) {
dst_bits[j] = src_bits[j];
}
dst_bits += bytespp;
src_bits += src_pitch;
}
}
}


I like Freeimage but i keep wondering why they didn´t standardized everything to work only on a 4 byte boundary formats such as:  RGBA, ARGB, BGRA, RGBQUAD etc ? And if the user opened a image with an unusual format (RGBTRIPLE or even 1 byte channel or those horrible RGB565, RGB555, RGB444 etc etc etc) then  simple conversion functions should be used instead.
Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on July 13, 2018, 07:53:43 PM
Why not take full control over your own needs and skip those horrible not flexible 3th party graphics libraries.
With Direct3D9 you can easily deal with all those "1 byte channel or those horrible RGB565, RGB555, RGB444 etc etc etc" formats.
You can transpose your bitmap images with almost no CPU usage ( changing only 4 coordinates ) and let the GPU do it for you using the fantastic very fast Direct3D9 interface.
Title: Re: Fast transposing matrix procedure for any size
Post by: guga on July 14, 2018, 02:06:23 AM
Hi Siekmanski, Maybe later i´ll restart studying the direct3d functions.  I´m currently trying to make the algorithms works 1st using freeimage because it is is the library i´m using to open images and manipulate their data. I don´t know how to get access to the pixels contents (Load and export ) using directX yet.

But...using freeimage is just for making a small app that can load/save these images so i can test the algorithms i´m trying to create. My idea is basically using a small app from where i can test whatever algorithms that can also be used on video manipulation. For video, it is easier for me to build them for virtualdub because i already studied how it works on a way i can be able to make some plugins for it. I gave a test years ago on the Sdk for sony vegas and make some tests but...had no more time to continue the tests.

Of course, if there is some app that uses direct3d to manipulate videos (load and export in whatever format or codecs) and at the same time allow us to make plugins for it then it maybe a better alternative, but i don´t know if there is such a thing yet :( For video editing i use Sony vegas and virtualdub on a regular daily basis. (Eventually i use premiere but, this one i recently started trying to work on it - Personally, i still prefer sony vegas and virtualdub for that...but...i kept it as an alternative)
Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on July 14, 2018, 05:50:54 AM
 :t Didn't know you where coding plugins.

With GDIplus you can load bitmap images ( bmp, gif, jpeg, png, tiff and ico ), access the pixel data, manipulate it and saving it again in one of those formats.

You can also use DirectShow with Direct3D9 to manipulate Video content on the fly, but it takes some time and learning curve to write such a program from scratch....
Title: Re: Fast transposing matrix procedure for any size
Post by: daydreamer on July 14, 2018, 04:37:19 PM
Quote from: Siekmanski on July 14, 2018, 05:50:54 AM
With GDIplus you can load bitmap images ( bmp, gif, jpeg, png, tiff and ico ), access the pixel data, manipulate it and saving it again in one of those formats.

You can also use DirectShow with Direct3D9 to manipulate Video content on the fly, but it takes some time and learning curve to write such a program from scratch....
thats good to know,GDIplus can be used to load .image's,perform pixel manipulation and in GDI game or d3d9 texturefrommemory

I also Think d3d9 would be prefered for loads of matrix operations
Title: Re: Fast transposing matrix procedure for any size
Post by: aw27 on July 15, 2018, 07:34:58 PM
For transposing images with pixel formats less than 32-bit, SIMD instructions can not be used because are not able to manipulate less than dwords.
But we can transpose without SIMD, and will still be much faster than using the slow getpixel and setpixel functions.

We need to be aware than Windows bitmaps are laid bottom-up in memory. This means that it will be necessary to make an horizontal flip to fix things after the transpose.

This is a sample that loads a 24-bit .bmp file from disk, transposes it and saves to disk afterwards.
The .bmp has width and height not multiples of 4, so I had to handle the stride stuff.


Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on July 15, 2018, 09:21:59 PM
Transposing or rotation, that's the question.  :biggrin:

You can load all pixel formats to 32 bit pixel format in memory with GDIplus and use SIMD intructions on the the pixels.  8)
Title: Re: Fast transposing matrix procedure for any size
Post by: zedd151 on July 15, 2018, 09:24:12 PM
Quote from: Siekmanski on July 15, 2018, 09:21:59 PM
Transposing or rotation, that's the question.  :biggrin:

I'd think if it's only 90 degrees, it's transposition. More than 90 degrees means more 'movement', hence rotation.   :P

The .bmp example was a nice little exercise with visual results.   :icon14:
Title: Re: Fast transposing matrix procedure for any size
Post by: daydreamer on July 16, 2018, 02:15:40 AM
Quote from: zedd151 on July 15, 2018, 09:24:12 PM
Quote from: Siekmanski on July 15, 2018, 09:21:59 PM
Transposing or rotation, that's the question.  :biggrin:

I'd think if it's only 90 degrees, it's transposition. More than 90 degrees means more 'movement', hence rotation.   :P

The .bmp example was a nice little exercise with visual results.   :icon14:
Not sure about that, I seen very fast image rotating demo code, because once its initalized with fsincos angle they just reuse those values
And rotation math kinda newx=oldx*cos-oldy*sin, newy=oldx*sin+oldy*cos, probably notthe correct formula but you get the idea how much slower each pixel it is, is replaced with lot faster x=x+constant1, y=y+constant2 looping thru all pixels, probably in fixed point format
Title: Re: Fast transposing matrix procedure for any size
Post by: FORTRANS on July 16, 2018, 03:01:51 AM
Hi,

Quote from: zedd151 on July 15, 2018, 09:24:12 PM
Quote from: Siekmanski on July 15, 2018, 09:21:59 PM
Transposing or rotation, that's the question.  :biggrin:

I'd think if it's only 90 degrees, it's transposition. More than 90 degrees means more 'movement', hence rotation.

   Transposing an array is a reflection about a diagonal axis.
Rotations and reflections, in the simplest cases, are not the
same thing.  Rotation preserves handedness.  Reflection
inverts it.

Cheers,

Steve N.
Title: Re: Fast transposing matrix procedure for any size
Post by: zedd151 on July 16, 2018, 03:23:54 AM
@daydreamer
cc: FORTRANS

Math geeks.   :P
Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on July 16, 2018, 04:10:27 AM
Matrix transposition is changing the rows to columns and vice versa.
A diagonal is a straight line that joins two opposite corners of a four-sided flat shape.
If you mirror it along the diagonal of a square it is ok.
But a matrix transposition can also be done on a rectangle, then it can't be mirrored along the diagonal.

This is why I made the comment "Transposing or rotation, that's the question."
AW's example is not about matrix transposition but, X-Y axis flipping or -90 degree rotation, else the text would be mirrored in the resulting bitmap image.
Title: Re: Fast transposing matrix procedure for any size
Post by: aw27 on July 16, 2018, 04:21:55 AM
Thank you everybody for the comments. :eusa_clap:

I consider a transposition of a matrix of triplets as opposed to the matrices of dwords we have been doing so far.
The need for the flip has to do with the bitmap being bottom-up. Had I started transposing from the bottom there will be no need for the flip.
This technique can be easily extended to any other bit format,16,8,4,1

Well, I know that it is also called -90 degrees image rotation, but I was looking at this from another perspective.

Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on July 16, 2018, 05:25:56 AM
My response was only to make clear that matrix transposition is about switching rows and columns.
With a matrix transposition, your result should look like this,

(http://members.home.nl/siekmanski/MatrixImageAW.png)
Title: Re: Fast transposing matrix procedure for any size
Post by: aw27 on July 16, 2018, 06:23:51 AM

If I understand well, your suggestion would involve a 4 step process:
1- convert to 32-bit pixel format using slow GDI++
2- Transpose
3- Flip
4- Convert back to 24-bit pixel format using slow GDI++

I am doing it in 2 stages:
1- Transpose
2- Flip

Sure, I should have called it precisely Transpose AND Flip, however I explained clearly what was being done. So, I don't know what we are disagreeing about.  :biggrin:

Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on July 16, 2018, 07:05:00 AM
 :biggrin:
QuoteSo, I don't know what we are disagreeing about.  :biggrin:

I don't know either.  :biggrin:

As always, there are many ways to get to the result you want.  :t

Another way,

1- load bmp
2- change 8 GPU 32 bit values ( or 4 x 64 bit values )
3- save bmp

And get the same result without flipping and transposing.
Title: Re: Fast transposing matrix procedure for any size
Post by: RuiLoureiro on July 16, 2018, 07:33:00 AM
Hi all
        It seems that aw wants to do rotation by 90º ... but the image was called transpImg24.zip. That's all.
Title: Re: Fast transposing matrix procedure for any size
Post by: aw27 on July 16, 2018, 03:01:02 PM
Quote from: Siekmanski on July 16, 2018, 07:05:00 AM
1- load bmp
2- change 8 GPU 32 bit values ( or 4 x 64 bit values )
3- save bmp
I don't think we should put in movement all the DirectX machine simply to rotate a bitmap, unless we are playing a game. It is overkill.

@Ruy,
You are the expert in transposing my ideas to your head and produce a lot of masturbation with them as if they were your own.  :badgrin:
Title: Re: Fast transposing matrix procedure for any size
Post by: RuiLoureiro on July 16, 2018, 06:34:45 PM
aw,
"Scientific" arguments

1) It seems that the matrix transposition is your idea in the same way any mathematical
   exercise/issue is also your property, following your arguments;
   2) Following your arguments, no one can write anything about matematics because
   the rules are always the same and there are thousands and thousands of the same
   thing already written by someone before everywhere, so the ideas are of someone else.
   The students are also prohibited from using what the teacher teaches...

What is your scientific background ? ( coffee talk ? slang language ? )
Who are interested in your poor ( incorrect ) "ideas" ? I am not.
Noone can choose his family, but we may choose our friends. I have no
doubts about you also. And i have no time to lose with you. I have not
your education. I have not your level. Choose another target.
Title: Re: Fast transposing matrix procedure for any size
Post by: jj2007 on July 16, 2018, 06:43:07 PM
@Rui: (https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Forbidden-151987.svg/220px-Forbidden-151987.svg.png)
Title: Re: Fast transposing matrix procedure for any size
Post by: aw27 on July 16, 2018, 10:34:12 PM
@Ruy,
You were the one who came with the provocations.  :P
You did not learn yet that when we use works from others we must quote the source. You did not, and started doing silly comparisions against the source of your knowledge which I never cared to optimize. Probably, you want to make a bet in that I can produce a faster code that your senseless masturbations around my own code?
And people is evaluated from what they do, not for what school titles they have. The value of most doctors is close to zero.
I have presumably more university degrees than you but never mentioned it here.
The fact that you and some other people are not interest in advanced ideas simply means that you are paralyzed in time and not able to cope with new things. Probably everything stopped for you when you purchased the Pentium 4 in the last century.

@JJ
I prefer to show some pictures of your flagship product. After more than 10 years, the main menu continues to disappear and when not the submenu covers part of it. Have you ever heard about user experience? Can't you do something with better quality?
Title: Re: Fast transposing matrix procedure for any size
Post by: jj2007 on July 16, 2018, 10:46:14 PM
Quote from: AW on July 16, 2018, 10:34:12 PM
@Ruy,
You were the one who came with the provocations.  :P
Strange that you see provocations from a member as peaceful as Rui. Btw misspelling his name repeatedly is a provocation.

Quote from: AW on July 16, 2018, 10:34:12 PM@JJ
I prefer to show some pictures of your flagship product. ... Can't you do something with better quality?
It works fine on the machines I own (XP, W7 and Win10). I have no control over your machine - your problem.

You have (hopefully) control over what you write:
Quote from: AW on July 16, 2018, 03:01:02 PMYou are the expert in transposing my ideas to your head and produce a lot of masturbation with them as if they were your own.
You are emotionally challenged, José, but sometimes a good therapy can help.
Title: Re: Fast transposing matrix procedure for any size
Post by: aw27 on July 17, 2018, 01:28:28 AM
Quote
It works fine on the machines I own (XP, W7 and Win10). I have no control over your machine - your problem.
It is not a question of A's computer or B's computer. Main menus always work fine and I have never seen a behaviour like this anywhere. Stop blaming and  finding excuses and learn how to make a menu (if you want of course).  :badgrin:

BTW, you appear to like to try to establish alliances with all kinds of strange people or scammers. What happen to Ascend or Lone Wolf, the guy that had developed many DirectX games and was showing all the time the M-16 rifle picture?  :lol:
Title: Re: Fast transposing matrix procedure for any size
Post by: RuiLoureiro on July 20, 2018, 09:22:27 PM
Hi all,
         Here are the basic procedures (.asm) used to write and test
         some basic code needed to transpose any matrix

Good luck

Particular note: my thanks to Siekmanski  :t

Example
Quote
--------------------------------------------
HOW TO WRITE THE CODE TO TRANSPOSE 4x4
--------------------------------------------
u  is unpack   -> ul is unpack low and uh is unpack high
l  is low
h  is high

lh is move  low to high
hl is move high to low

Start with «what is the input» and «what we want to get»

                the input                                             we want to get this
          ----------------------                                    -----------------------
line 0  -> 0  1  2  3 = x0 = x4                                            0  4  8  C
              4  5  6  7 = x1   <- after unpack we may use x1    1  5  9  D
           
line 2 ->  8  9  A  B = x2 = x5                                            2  6  A  E
              C  D  E  F = x3   <- after unpack we may use x3    3  7  B  F


note: As we need to get [0  4  8  C] we need to unpack x0 with x1 and x2 with x3.
      Because we destroy x0 and x2 we need to get a copy x4, x5           
----------------------------------------------------------------------------------
      x4=x0         <--- copy x0 to x4
      x5=x2

ul    x0,x1  = 0  4  1  5  = x6 ( x1 )
ul    x2,x3  = 8  C  9  D

lh    x0,x2  = 0  4  8  C
hl    x2,x6  = 1  5  9  D
----above (low) and bellow(high) are similar------
uh    x4,x1  = 2  6  3  7  = x7 ( x3 )
uh    x5,x3  = A  E  B  F

lh    x4,x5  = 2  6  A  E
hl    x5,x7  = 3  7  B  F
+++++++++++++++++++++++++++++++++++++++++++++++
      x4=x0
      x5=x2

ul    x0,x1  = 0  4  1  5  = x6 ( x1 )
ul    x2,x3  = 8  C  9  D
uh    x4,x1  = 2  6  3  7  = x7 ( x3 )
uh    x5,x3  = A  E  B  F

      x6=x0  or x1=x0
      x7=x4  or x3=x4
     
lh    x0,x2  = 0  4  8  C
hl    x2,x6  = 1  5  9  D   <<- replace x6 by x1
lh    x4,x5  = 2  6  A  E
hl    x5,x7  = 3  7  B  F   <<- replace x7 by x3
     
+++++++++++++++++++++++++++++++++++++++++++++++
      x4=x0
      x5=x2

ul    x0,x1  = 0  4  1  5  = x1
ul    x2,x3  = 8  C  9  D
uh    x4,x1  = 2  6  3  7  = x3
uh    x5,x3  = A  E  B  F

      x1=x0
      x3=x4
     
lh    x0,x2  = 0  4  8  C   <-- first output
hl    x2,x1  = 1  5  9  D
lh    x4,x5  = 2  6  A  E
hl    x5,x3  = 3  7  B  F   <-- fourth output

+++++++++++++++++++++++++++++++++++++++++++++++
                                   instructions
+++++++++++++++++++++++++++++++++++++++++++++++
                ; write the input xmm0,xmm1,xmm2,xmm3
                ;
                movaps      xmm4,xmm0           ; [0  4  1  5]
                movaps      xmm5,xmm2           ; [8  9  A  B]

                unpcklps    xmm0, xmm1          ; [0  4  1  5]
                unpcklps    xmm2, xmm3          ; [8  C  9  D]
                unpckhps    xmm4, xmm1          ; [2  6  3  7]
                unpckhps    xmm5, xmm3          ; [A  E  B  F]

                movaps      xmm1,xmm0           ; [0  4  1  5]
                movaps      xmm3,xmm4           ; [2  6  3  7]

                movlhps     xmm0,xmm2           ; [0  4  8  C] first  line
                movhlps     xmm2,xmm1           ; [1  5  9  D] second line
                movlhps     xmm4,xmm5           ; [2  6  A  E] third  line
                movhlps     xmm5,xmm3           ; [3  7  B  F] fourth line
                ;
                ; write the output xmm0,xmm2,xmm4,xmm5

For input, we may use this (do similar things to output).
If we have no additional columns we may use movaps for input;
If we have no additional lines we may use movaps for output.

We have additional if it is a multiple of 4 plus 1,2 or 3.
Quote
            mov     edx, LinLengthX

For 1 input:
            movups  xmm0, [esi]

For 2 inputs:
            movups  xmm0, [esi]
            movups  xmm1, [esi+edx]
           
For 3 inputs:
            movups  xmm0, [esi]
            movups  xmm1, [esi+edx]
            movups  xmm2, [esi+2*edx]

For 4 inputs:
            movups  xmm0, [esi]
            add     esi, edx
            movups  xmm1, [esi]
            movups  xmm2, [esi+edx]
            movups  xmm3, [esi+2*edx]
Title: Re: Fast transposing matrix procedure for any size
Post by: Siekmanski on July 20, 2018, 09:32:00 PM
 :t
Title: Re: Fast transposing matrix procedure for any size
Post by: zedd151 on July 21, 2018, 01:43:01 AM
Hi Rui!


Just the kind of building block a beginner needs. Nevermind the naysayers.


Also, I'm glad it's 32 bit code, as I have given up with 64 bit coding for time being.
Title: Re: Fast transposing matrix procedure for any size
Post by: guga on July 21, 2018, 03:08:29 AM
Excelent work, Rui, as usual  :t :t :t :t
Title: Re: Fast transposing matrix procedure for any size
Post by: HSE on July 21, 2018, 03:32:12 AM
Fantastic Rui  :t
Title: Re: Fast transposing matrix procedure for any size
Post by: RuiLoureiro on July 24, 2018, 08:33:12 PM
Hi all
       Here we have SSE49_FASCODE_v8 which contains MatrixTransposeSSE49 inside the file sse49.
       MatrixTransposeSSE49 is a procedure that uses 4x8, 4x16, 4x32, 4x64, 8x16, 8x32, 16x16
       16x32, 32x32, 16x64, 32x64, 64x64 transpose code to build one transpose procedure for
       each case. So we may build the best transpose procedure inserting or removing one of
       those cases for each matrix case. For example, to transpose the matrix 512x512 we may
       use 4x32 in each loop instead of 4x4. So, for each block of 4 lines we need only 16
       loops instead of 128 loops with 4x4.

Good luck
Rui Loureiro