News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Solutions to save 1/2 CPU registers inside a procedure

Started by RuiLoureiro, June 04, 2018, 01:58:26 AM

Previous topic - Next topic

RuiLoureiro

Hi all,
        i wrote 8 versions of a little procedure to transpose some matrices NxN

                ...SSE38XA, ...SSE38XB, ...SSE38XC, ..., ...SSE38XI
                ...SSE38YA, ...SSE38YB, ...SSE38YC, ..., ...SSE38YI

        The differences between them are identified in the test results.
        I want to know what is the best solution: push esi, edi - pop edi, esi ?
        Or mov LocalVarX, esi + mov LocalVarY, edi - mov esi, LocalVarX mov edi, LocalVarY ?
        Or mov LocalVarX, esi + mov edx, edi - mov esi, LocalVarX and mov edi, edx ?

        If you have a i5/i7/AMD CPU, would you mind to post your results, please ?

Thanks  :t

Note: The prog starts to test all cases up to 120x120

EDIT: the problem here is the way we save the registers not the proc to transpose a matrix.
The proc is only the way to waste time. See reply #9

My sample
Quote
***** Time table - LoopCount =10 000 *****

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
...
1233385  cycles, MatrixTransposeSSE38YB,  TestMatWA 500x500 -LOCAL var+var- ups B
1235908  cycles, MatrixTransposeSSE38XB,  TestMatWA 500x500 -push esi+edi- ups B
1282101  cycles, MatrixTransposeSSE38YI,  TestMatWA 500x500 -LOCAL var,edx- ups I
1282982  cycles, MatrixTransposeSSE38XI,  TestMatWA 500x500 -push esi,edx- ups I
1383583  cycles, MatrixTransposeSSE38XA,  TestMatWA 500x500 -push esi+edi- ups A
1395110  cycles, MatrixTransposeSSE38YA,  TestMatWA 500x500 -LOCAL var+var- ups A
1448037  cycles, MatrixTransposeSSE38XC,  TestMatWA 500x500 -push esi+edi- ups C
1629964  cycles, MatrixTransposeSSE38YC,  TestMatWA 500x500 -LOCAL var+var- ups C

1263363  cycles, MatrixTransposeSSE38XB,  TestMatWW 512x512 -push esi+edi- ups B
1268120  cycles, MatrixTransposeSSE38YB,  TestMatWW 512x512 -LOCAL var+var- ups B
1548259  cycles, MatrixTransposeSSE38YA,  TestMatWW 512x512 -LOCAL var+var- ups A
1549101  cycles, MatrixTransposeSSE38XA,  TestMatWW 512x512 -push esi+edi- ups A
1566116  cycles, MatrixTransposeSSE38XC,  TestMatWW 512x512 -push esi+edi- ups C
1590263  cycles, MatrixTransposeSSE38YC,  TestMatWW 512x512 -LOCAL var+var- ups C
1392685  cycles, MatrixTransposeSSE38YI,  TestMatWC 512x512 -LOCAL var,edx- ups I
1400638  cycles, MatrixTransposeSSE38XI,  TestMatWW 512x512 -push esi,edx- ups I

1284184  cycles, MatrixTransposeSSE38XB,  TestMatWB 504x504 -push esi+edi- ups B
1319583  cycles, MatrixTransposeSSE38YI,  TestMatWB 504x504 -LOCAL var,edx- ups I
1320214  cycles, MatrixTransposeSSE38YB,  TestMatWB 504x504 -LOCAL var+var- ups B
1324752  cycles, MatrixTransposeSSE38XI,  TestMatWB 504x504 -push esi,edx- ups I
1416786  cycles, MatrixTransposeSSE38YC,  TestMatWB 504x504 -LOCAL var+var- ups C
1428926  cycles, MatrixTransposeSSE38XA,  TestMatWB 504x504 -push esi+edi- ups A
1435277  cycles, MatrixTransposeSSE38YA,  TestMatWB 504x504 -LOCAL var+var- ups A
1439214  cycles, MatrixTransposeSSE38XC,  TestMatWB 504x504 -push esi+edi- ups C

1333058  cycles, MatrixTransposeSSE38XB,  TestMatWC 508x508 -push esi+edi- ups B
1337661  cycles, MatrixTransposeSSE38YB,  TestMatWC 508x508 -LOCAL var+var- ups B
1354388  cycles, MatrixTransposeSSE38XI,  TestMatWC 508x508 -push esi,edx- ups I
1358452  cycles, MatrixTransposeSSE38YI,  TestMatWZ 508x508 -LOCAL var,edx- ups I
1459252  cycles, MatrixTransposeSSE38YA,  TestMatWC 508x508 -LOCAL var+var- ups A
1468728  cycles, MatrixTransposeSSE38XA,  TestMatWC 508x508 -push esi+edi- ups A
1483957  cycles, MatrixTransposeSSE38XC,  TestMatWC 508x508 -push esi+edi- ups C
1530930  cycles, MatrixTransposeSSE38YC,  TestMatWC 508x508 -LOCAL var+var- ups C

jj2007

 :tIntel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

26  cycles, MatrixTransposeSSE38XB,  TestMatXA 4x4     -push esi+edi- ups B
27  cycles, MatrixTransposeSSE38YA,  TestMatXA 4x4     -LOCAL var+var- ups A
27  cycles, MatrixTransposeSSE38XI,  TestMatXA 4x4    -push esi,edx- ups I
27  cycles, MatrixTransposeSSE38YI,  TestMatXA 4x4     -LOCAL var,edx- ups I
28  cycles, MatrixTransposeSSE38XC,  TestMatXA 4x4    -push esi+edi- ups C
28  cycles, MatrixTransposeSSE38YB,  TestMatXA 4x4     -LOCAL var+var- ups B
29  cycles, MatrixTransposeSSE38YC,  TestMatXA 4x4     -LOCAL var+var- ups C
57  cycles, MatrixTransposeSSE38XB,  TestMatXB 8x8     -push esi+edi- ups B
60  cycles, MatrixTransposeSSE38YB,  TestMatXB 8x8     -LOCAL var+var- ups B
63  cycles, MatrixTransposeSSE38YA,  TestMatXB 8x8     -LOCAL var+var- ups A
65  cycles, MatrixTransposeSSE38YC,  TestMatXB 8x8     -LOCAL var+var- ups C
67  cycles, MatrixTransposeSSE38XC,  TestMatXB 8x8    -push esi+edi- ups C
69  cycles, MatrixTransposeSSE38YI,  TestMatXB 8x8     -LOCAL var,edx- ups I
70  cycles, MatrixTransposeSSE38XI,  TestMatXB 8x8    -push esi,edx- ups I
103  cycles, MatrixTransposeSSE38XA,  TestMatXA 4x4     -push esi+edi- ups A
120  cycles, MatrixTransposeSSE38XB,  TestMatXC 12x12   -push esi+edi- ups B
121  cycles, MatrixTransposeSSE38XI,  TestMatXC 12x12  -push esi,edx- ups I
123  cycles, MatrixTransposeSSE38XC,  TestMatXC 12x12  -push esi+edi- ups C
124  cycles, MatrixTransposeSSE38YC,  TestMatXC 12x12   -LOCAL var+var- ups C
124  cycles, MatrixTransposeSSE38YA,  TestMatXC 12x12   -LOCAL var+var- ups A
125  cycles, MatrixTransposeSSE38YB,  TestMatXC 12x12   -LOCAL var+var- ups B
156  cycles, MatrixTransposeSSE38YI,  TestMatXC 12x12   -LOCAL var,edx- ups I
217  cycles, MatrixTransposeSSE38XA,  TestMatXB 8x8     -push esi+edi- ups A
315  cycles, MatrixTransposeSSE38XB,  TestMatXD 20x20   -push esi+edi- ups B
337  cycles, MatrixTransposeSSE38YB,  TestMatXD 20x20   -LOCAL var+var- ups B
343  cycles, MatrixTransposeSSE38YA,  TestMatXD 20x20   -LOCAL var+var- ups A
345  cycles, MatrixTransposeSSE38YC,  TestMatXD 20x20   -LOCAL var+var- ups C
347  cycles, MatrixTransposeSSE38XC,  TestMatXD 20x20  -push esi+edi- ups C
412  cycles, MatrixTransposeSSE38XA,  TestMatXC 12x12   -push esi+edi- ups A
566  cycles, MatrixTransposeSSE38XI,  TestMatXD 20x20  -push esi,edx- ups I
573  cycles, MatrixTransposeSSE38YI,  TestMatXD 20x20   -LOCAL var,edx- ups I
1163  cycles, MatrixTransposeSSE38XA,  TestMatXD 20x20   -push esi+edi- ups A
9056  cycles, MatrixTransposeSSE38XB,  TestMatYA 100x100 -push esi+edi- ups B
9156  cycles, MatrixTransposeSSE38YB,  TestMatYA 100x100 -LOCAL var+var- ups B
9370  cycles, MatrixTransposeSSE38XI,  TestMatYA 100x100 -push esi,edx- ups I
9393  cycles, MatrixTransposeSSE38YI,  TestMatYA 100x100 -LOCAL var,edx- ups I
9748  cycles, MatrixTransposeSSE38XC,  TestMatYA 100x100 -push esi+edi- ups C
9775  cycles, MatrixTransposeSSE38YA,  TestMatYA 100x100 -LOCAL var+var- ups A
9950  cycles, MatrixTransposeSSE38YC,  TestMatYA 100x100 -LOCAL var+var- ups C
15266  cycles, MatrixTransposeSSE38YB,  TestMatYY 120x120 -LOCAL var+var- ups B
15385  cycles, MatrixTransposeSSE38XB,  TestMatYY 120x120 -push esi+edi- ups B
15690  cycles, MatrixTransposeSSE38XI,  TestMatYY 120x120 -push esi,edx- ups I
15841  cycles, MatrixTransposeSSE38XB,  TestMatYC 132x132 -push esi+edi- ups B
15863  cycles, MatrixTransposeSSE38YB,  TestMatYC 132x132 -LOCAL var+var- ups B
15960  cycles, MatrixTransposeSSE38YI,  TestMatYY 120x120 -LOCAL var,edx- ups I
16103  cycles, MatrixTransposeSSE38YA,  TestMatYY 120x120 -LOCAL var+var- ups A
16118  cycles, MatrixTransposeSSE38YC,  TestMatYY 120x120 -LOCAL var+var- ups C
16134  cycles, MatrixTransposeSSE38XC,  TestMatYY 120x120 -push esi+edi- ups C
16416  cycles, MatrixTransposeSSE38XA,  TestMatYY 120x120 -push esi+edi- ups A
16883  cycles, MatrixTransposeSSE38XA,  TestMatYA 100x100 -push esi+edi- ups A
16941  cycles, MatrixTransposeSSE38XA,  TestMatYC 132x132 -push esi+edi- ups A
16979  cycles, MatrixTransposeSSE38XC,  TestMatYC 132x132 -push esi+edi- ups C
17038  cycles, MatrixTransposeSSE38YA,  TestMatYC 132x132 -LOCAL var+var- ups A
17368  cycles, MatrixTransposeSSE38XI,  TestMatYC 132x132 -push esi,edx- ups I
17421  cycles, MatrixTransposeSSE38YI,  TestMatYC 132x132 -LOCAL var,edx- ups I
17567  cycles, MatrixTransposeSSE38YC,  TestMatYC 132x132 -LOCAL var+var- ups C
24206  cycles, MatrixTransposeSSE38YB,  TestMatYB 128x128 -LOCAL var+var- ups B
24368  cycles, MatrixTransposeSSE38XB,  TestMatYB 128x128 -push esi+edi- ups B
24619  cycles, MatrixTransposeSSE38XI,  TestMatYB 128x128 -push esi,edx- ups I
24639  cycles, MatrixTransposeSSE38YI,  TestMatYB 128x128 -LOCAL var,edx- ups I
25893  cycles, MatrixTransposeSSE38XA,  TestMatYB 128x128 -push esi+edi- ups A
26104  cycles, MatrixTransposeSSE38XC,  TestMatYB 128x128 -push esi+edi- ups C
26120  cycles, MatrixTransposeSSE38YA,  TestMatYB 128x128 -LOCAL var+var- ups A
26638  cycles, MatrixTransposeSSE38YC,  TestMatYB 128x128 -LOCAL var+var- ups C
118597  cycles, MatrixTransposeSSE38XI,  TestMatZB 260x260 -push esi,edx- ups I
118983  cycles, MatrixTransposeSSE38YI,  TestMatZB 260x260 -LOCAL var,edx- ups I
119003  cycles, MatrixTransposeSSE38YB,  TestMatZB 260x260 -LOCAL var+var- ups B
119393  cycles, MatrixTransposeSSE38XB,  TestMatZB 260x260 -push esi+edi- ups B
122371  cycles, MatrixTransposeSSE38YC,  TestMatZB 260x260 -LOCAL var+var- ups C
122542  cycles, MatrixTransposeSSE38XC,  TestMatZB 260x260 -push esi+edi- ups C
122872  cycles, MatrixTransposeSSE38YA,  TestMatZB 260x260 -LOCAL var+var- ups A
123103  cycles, MatrixTransposeSSE38XA,  TestMatZB 260x260 -push esi+edi- ups A
126953  cycles, MatrixTransposeSSE38YI,  TestMatZC 268x268 -LOCAL var,edx- ups I
126974  cycles, MatrixTransposeSSE38XB,  TestMatZC 268x268 -push esi+edi- ups B
126977  cycles, MatrixTransposeSSE38XI,  TestMatZC 268x268 -push esi,edx- ups I
127359  cycles, MatrixTransposeSSE38YB,  TestMatZC 268x268 -LOCAL var+var- ups B
130655  cycles, MatrixTransposeSSE38XB,  TestMatZZ 264x264 -push esi+edi- ups B
130727  cycles, MatrixTransposeSSE38YB,  TestMatZZ 264x264 -LOCAL var+var- ups B
131699  cycles, MatrixTransposeSSE38XI,  TestMatZZ 264x264 -push esi,edx- ups I
131864  cycles, MatrixTransposeSSE38YC,  TestMatZC 268x268 -LOCAL var+var- ups C
131980  cycles, MatrixTransposeSSE38YI,  TestMatZZ 264x264 -LOCAL var,edx- ups I
132072  cycles, MatrixTransposeSSE38XC,  TestMatZC 268x268 -push esi+edi- ups C
132254  cycles, MatrixTransposeSSE38YA,  TestMatZC 268x268 -LOCAL var+var- ups A
132316  cycles, MatrixTransposeSSE38XA,  TestMatZC 268x268 -push esi+edi- ups A
134342  cycles, MatrixTransposeSSE38XA,  TestMatZZ 264x264 -push esi+edi- ups A
134931  cycles, MatrixTransposeSSE38XC,  TestMatZZ 264x264 -push esi+edi- ups C
134959  cycles, MatrixTransposeSSE38YC,  TestMatZZ 264x264 -LOCAL var+var- ups C
135762  cycles, MatrixTransposeSSE38YA,  TestMatZZ 264x264 -LOCAL var+var- ups A
140127  cycles, MatrixTransposeSSE38YB,  TestMatZA 256x256 -LOCAL var+var- ups B
140296  cycles, MatrixTransposeSSE38XB,  TestMatZA 256x256 -push esi+edi- ups B
141620  cycles, MatrixTransposeSSE38YI,  TestMatZA 256x256 -LOCAL var,edx- ups I
141762  cycles, MatrixTransposeSSE38XI,  TestMatZA 256x256 -push esi,edx- ups I
169636  cycles, MatrixTransposeSSE38XC,  TestMatZA 256x256 -push esi+edi- ups C
169722  cycles, MatrixTransposeSSE38YA,  TestMatZA 256x256 -LOCAL var+var- ups A
169767  cycles, MatrixTransposeSSE38YC,  TestMatZA 256x256 -LOCAL var+var- ups C
170001  cycles, MatrixTransposeSSE38XA,  TestMatZA 256x256 -push esi+edi- ups A
378139  cycles, MatrixTransposeSSE38XB,  TestMatWA 500x500 -push esi+edi- ups B
381761  cycles, MatrixTransposeSSE38YI,  TestMatWA 500x500 -LOCAL var,edx- ups I
384240  cycles, MatrixTransposeSSE38YB,  TestMatWA 500x500 -LOCAL var+var- ups B
386579  cycles, MatrixTransposeSSE38XB,  TestMatWB 504x504 -push esi+edi- ups B
387787  cycles, MatrixTransposeSSE38YB,  TestMatWB 504x504 -LOCAL var+var- ups B
389193  cycles, MatrixTransposeSSE38XI,  TestMatWA 500x500 -push esi,edx- ups I
389826  cycles, MatrixTransposeSSE38YI,  TestMatWB 504x504 -LOCAL var,edx- ups I
390552  cycles, MatrixTransposeSSE38XI,  TestMatWB 504x504 -push esi,edx- ups I
393505  cycles, MatrixTransposeSSE38YB,  TestMatWC 508x508 -LOCAL var+var- ups B
393692  cycles, MatrixTransposeSSE38XB,  TestMatWC 508x508 -push esi+edi- ups B
397996  cycles, MatrixTransposeSSE38YI,  TestMatWZ 508x508 -LOCAL var,edx- ups I
399456  cycles, MatrixTransposeSSE38XI,  TestMatWC 508x508 -push esi,edx- ups I
480503  cycles, MatrixTransposeSSE38YC,  TestMatWB 504x504 -LOCAL var+var- ups C
480817  cycles, MatrixTransposeSSE38YA,  TestMatWB 504x504 -LOCAL var+var- ups A
480847  cycles, MatrixTransposeSSE38XC,  TestMatWB 504x504 -push esi+edi- ups C
480883  cycles, MatrixTransposeSSE38XA,  TestMatWB 504x504 -push esi+edi- ups A
513839  cycles, MatrixTransposeSSE38XA,  TestMatWA 500x500 -push esi+edi- ups A
513988  cycles, MatrixTransposeSSE38YA,  TestMatWA 500x500 -LOCAL var+var- ups A
514119  cycles, MatrixTransposeSSE38YC,  TestMatWA 500x500 -LOCAL var+var- ups C
514560  cycles, MatrixTransposeSSE38XC,  TestMatWA 500x500 -push esi+edi- ups C
608451  cycles, MatrixTransposeSSE38YC,  TestMatWC 508x508 -LOCAL var+var- ups C
609226  cycles, MatrixTransposeSSE38XC,  TestMatWC 508x508 -push esi+edi- ups C
610219  cycles, MatrixTransposeSSE38YA,  TestMatWC 508x508 -LOCAL var+var- ups A
612253  cycles, MatrixTransposeSSE38XA,  TestMatWC 508x508 -push esi+edi- ups A
673824  cycles, MatrixTransposeSSE38XB,  TestMatWW 512x512 -push esi+edi- ups B
677774  cycles, MatrixTransposeSSE38YB,  TestMatWW 512x512 -LOCAL var+var- ups B
686959  cycles, MatrixTransposeSSE38XI,  TestMatWW 512x512 -push esi,edx- ups I
687090  cycles, MatrixTransposeSSE38YI,  TestMatWC 512x512 -LOCAL var,edx- ups I
929165  cycles, MatrixTransposeSSE38YC,  TestMatWW 512x512 -LOCAL var+var- ups C
930455  cycles, MatrixTransposeSSE38XA,  TestMatWW 512x512 -push esi+edi- ups A
931261  cycles, MatrixTransposeSSE38XC,  TestMatWW 512x512 -push esi+edi- ups C
943437  cycles, MatrixTransposeSSE38YA,  TestMatWW 512x512 -LOCAL var+var- ups A

zedd151

Okay, I'll bite.   :P

results in attached zip file

cpu speed 1.60 Ghz

What I find odd is that when doing cycle counts, the cycle counts should remain very similar. My cpu speed is a little less than half of the average cpu e.g.,  ~3.40 Ghz. I know that timings would be doubled or better for me, but cycle counts shouldn't be affected by cpu speed.

Of course I could be mistaken for the reason, could be a cache size issue...   :icon_confused:

RuiLoureiro

 :t
Because this information is only useful (IMO) if we see it sorted by matrix type,
here are the results (only up to 20000 characteres).
(hutch, my apologize for taking this space if it is the case)
Quote
Jochen:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

   26  cycles, MatrixTransposeSSE38XB,  TestMatXA 4x4     -push esi+edi- ups B
   27  cycles, MatrixTransposeSSE38YA,  TestMatXA 4x4     -LOCAL var+var- ups A
   27  cycles, MatrixTransposeSSE38XI,  TestMatXA 4x4     -push esi,edx- ups I
   27  cycles, MatrixTransposeSSE38YI,  TestMatXA 4x4     -LOCAL var,edx- ups I
   28  cycles, MatrixTransposeSSE38XC,  TestMatXA 4x4     -push esi+edi- ups C
   28  cycles, MatrixTransposeSSE38YB,  TestMatXA 4x4     -LOCAL var+var- ups B
   29  cycles, MatrixTransposeSSE38YC,  TestMatXA 4x4     -LOCAL var+var- ups C
  103  cycles, MatrixTransposeSSE38XA,  TestMatXA 4x4     -push esi+edi- ups A
   
   57  cycles, MatrixTransposeSSE38XB,  TestMatXB 8x8     -push esi+edi- ups B
   60  cycles, MatrixTransposeSSE38YB,  TestMatXB 8x8     -LOCAL var+var- ups B
   63  cycles, MatrixTransposeSSE38YA,  TestMatXB 8x8     -LOCAL var+var- ups A
   65  cycles, MatrixTransposeSSE38YC,  TestMatXB 8x8     -LOCAL var+var- ups C
   67  cycles, MatrixTransposeSSE38XC,  TestMatXB 8x8     -push esi+edi- ups C
   69  cycles, MatrixTransposeSSE38YI,  TestMatXB 8x8     -LOCAL var,edx- ups I
   70  cycles, MatrixTransposeSSE38XI,  TestMatXB 8x8     -push esi,edx- ups I
  217  cycles, MatrixTransposeSSE38XA,  TestMatXB 8x8     -push esi+edi- ups A
 
  120  cycles, MatrixTransposeSSE38XB,  TestMatXC 12x12   -push esi+edi- ups B
  121  cycles, MatrixTransposeSSE38XI,  TestMatXC 12x12   -push esi,edx- ups I
  123  cycles, MatrixTransposeSSE38XC,  TestMatXC 12x12   -push esi+edi- ups C
  124  cycles, MatrixTransposeSSE38YC,  TestMatXC 12x12   -LOCAL var+var- ups C
  124  cycles, MatrixTransposeSSE38YA,  TestMatXC 12x12   -LOCAL var+var- ups A
  125  cycles, MatrixTransposeSSE38YB,  TestMatXC 12x12   -LOCAL var+var- ups B
  156  cycles, MatrixTransposeSSE38YI,  TestMatXC 12x12   -LOCAL var,edx- ups I
  412  cycles, MatrixTransposeSSE38XA,  TestMatXC 12x12   -push esi+edi- ups A
 
  315  cycles, MatrixTransposeSSE38XB,  TestMatXD 20x20   -push esi+edi- ups B
  337  cycles, MatrixTransposeSSE38YB,  TestMatXD 20x20   -LOCAL var+var- ups B
  343  cycles, MatrixTransposeSSE38YA,  TestMatXD 20x20   -LOCAL var+var- ups A
  345  cycles, MatrixTransposeSSE38YC,  TestMatXD 20x20   -LOCAL var+var- ups C
  347  cycles, MatrixTransposeSSE38XC,  TestMatXD 20x20   -push esi+edi- ups C
  566  cycles, MatrixTransposeSSE38XI,  TestMatXD 20x20   -push esi,edx- ups I
  573  cycles, MatrixTransposeSSE38YI,  TestMatXD 20x20   -LOCAL var,edx- ups I
1163  cycles, MatrixTransposeSSE38XA,  TestMatXD 20x20   -push esi+edi- ups A

9056  cycles, MatrixTransposeSSE38XB,  TestMatYA 100x100 -push esi+edi- ups B
9156  cycles, MatrixTransposeSSE38YB,  TestMatYA 100x100 -LOCAL var+var- ups B
9370  cycles, MatrixTransposeSSE38XI,  TestMatYA 100x100 -push esi,edx- ups I
9393  cycles, MatrixTransposeSSE38YI,  TestMatYA 100x100 -LOCAL var,edx- ups I
9748  cycles, MatrixTransposeSSE38XC,  TestMatYA 100x100 -push esi+edi- ups C
9775  cycles, MatrixTransposeSSE38YA,  TestMatYA 100x100 -LOCAL var+var- ups A
9950  cycles, MatrixTransposeSSE38YC,  TestMatYA 100x100 -LOCAL var+var- ups C
16883  cycles, MatrixTransposeSSE38XA,  TestMatYA 100x100 -push esi+edi- ups A

15266  cycles, MatrixTransposeSSE38YB,  TestMatYY 120x120 -LOCAL var+var- ups B
15385  cycles, MatrixTransposeSSE38XB,  TestMatYY 120x120 -push esi+edi- ups B
15690  cycles, MatrixTransposeSSE38XI,  TestMatYY 120x120 -push esi,edx- ups I
15960  cycles, MatrixTransposeSSE38YI,  TestMatYY 120x120 -LOCAL var,edx- ups I
16103  cycles, MatrixTransposeSSE38YA,  TestMatYY 120x120 -LOCAL var+var- ups A
16118  cycles, MatrixTransposeSSE38YC,  TestMatYY 120x120 -LOCAL var+var- ups C
16134  cycles, MatrixTransposeSSE38XC,  TestMatYY 120x120 -push esi+edi- ups C
16416  cycles, MatrixTransposeSSE38XA,  TestMatYY 120x120 -push esi+edi- ups A

15841  cycles, MatrixTransposeSSE38XB,  TestMatYC 132x132 -push esi+edi- ups B
15863  cycles, MatrixTransposeSSE38YB,  TestMatYC 132x132 -LOCAL var+var- ups B
16941  cycles, MatrixTransposeSSE38XA,  TestMatYC 132x132 -push esi+edi- ups A
16979  cycles, MatrixTransposeSSE38XC,  TestMatYC 132x132 -push esi+edi- ups C
17038  cycles, MatrixTransposeSSE38YA,  TestMatYC 132x132 -LOCAL var+var- ups A
17368  cycles, MatrixTransposeSSE38XI,  TestMatYC 132x132 -push esi,edx- ups I
17421  cycles, MatrixTransposeSSE38YI,  TestMatYC 132x132 -LOCAL var,edx- ups I
17567  cycles, MatrixTransposeSSE38YC,  TestMatYC 132x132 -LOCAL var+var- ups C

24206  cycles, MatrixTransposeSSE38YB,  TestMatYB 128x128 -LOCAL var+var- ups B
24368  cycles, MatrixTransposeSSE38XB,  TestMatYB 128x128 -push esi+edi- ups B
24619  cycles, MatrixTransposeSSE38XI,  TestMatYB 128x128 -push esi,edx- ups I
24639  cycles, MatrixTransposeSSE38YI,  TestMatYB 128x128 -LOCAL var,edx- ups I
25893  cycles, MatrixTransposeSSE38XA,  TestMatYB 128x128 -push esi+edi- ups A
26104  cycles, MatrixTransposeSSE38XC,  TestMatYB 128x128 -push esi+edi- ups C
26120  cycles, MatrixTransposeSSE38YA,  TestMatYB 128x128 -LOCAL var+var- ups A
26638  cycles, MatrixTransposeSSE38YC,  TestMatYB 128x128 -LOCAL var+var- ups C

; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
118597  cycles, MatrixTransposeSSE38XI,  TestMatZB 260x260 -push esi,edx- ups I
118983  cycles, MatrixTransposeSSE38YI,  TestMatZB 260x260 -LOCAL var,edx- ups I
119003  cycles, MatrixTransposeSSE38YB,  TestMatZB 260x260 -LOCAL var+var- ups B
119393  cycles, MatrixTransposeSSE38XB,  TestMatZB 260x260 -push esi+edi- ups B
122371  cycles, MatrixTransposeSSE38YC,  TestMatZB 260x260 -LOCAL var+var- ups C
122542  cycles, MatrixTransposeSSE38XC,  TestMatZB 260x260 -push esi+edi- ups C
122872  cycles, MatrixTransposeSSE38YA,  TestMatZB 260x260 -LOCAL var+var- ups A
123103  cycles, MatrixTransposeSSE38XA,  TestMatZB 260x260 -push esi+edi- ups A

126953  cycles, MatrixTransposeSSE38YI,  TestMatZC 268x268 -LOCAL var,edx- ups I
126974  cycles, MatrixTransposeSSE38XB,  TestMatZC 268x268 -push esi+edi- ups B
126977  cycles, MatrixTransposeSSE38XI,  TestMatZC 268x268 -push esi,edx- ups I
127359  cycles, MatrixTransposeSSE38YB,  TestMatZC 268x268 -LOCAL var+var- ups B
131864  cycles, MatrixTransposeSSE38YC,  TestMatZC 268x268 -LOCAL var+var- ups C
132072  cycles, MatrixTransposeSSE38XC,  TestMatZC 268x268 -push esi+edi- ups C
132254  cycles, MatrixTransposeSSE38YA,  TestMatZC 268x268 -LOCAL var+var- ups A
132316  cycles, MatrixTransposeSSE38XA,  TestMatZC 268x268 -push esi+edi- ups A

130655  cycles, MatrixTransposeSSE38XB,  TestMatZZ 264x264 -push esi+edi- ups B
130727  cycles, MatrixTransposeSSE38YB,  TestMatZZ 264x264 -LOCAL var+var- ups B
131699  cycles, MatrixTransposeSSE38XI,  TestMatZZ 264x264 -push esi,edx- ups I
131980  cycles, MatrixTransposeSSE38YI,  TestMatZZ 264x264 -LOCAL var,edx- ups I
134342  cycles, MatrixTransposeSSE38XA,  TestMatZZ 264x264 -push esi+edi- ups A
134931  cycles, MatrixTransposeSSE38XC,  TestMatZZ 264x264 -push esi+edi- ups C
134959  cycles, MatrixTransposeSSE38YC,  TestMatZZ 264x264 -LOCAL var+var- ups C
135762  cycles, MatrixTransposeSSE38YA,  TestMatZZ 264x264 -LOCAL var+var- ups A

140127  cycles, MatrixTransposeSSE38YB,  TestMatZA 256x256 -LOCAL var+var- ups B
140296  cycles, MatrixTransposeSSE38XB,  TestMatZA 256x256 -push esi+edi- ups B
141620  cycles, MatrixTransposeSSE38YI,  TestMatZA 256x256 -LOCAL var,edx- ups I
141762  cycles, MatrixTransposeSSE38XI,  TestMatZA 256x256 -push esi,edx- ups I
169636  cycles, MatrixTransposeSSE38XC,  TestMatZA 256x256 -push esi+edi- ups C
169722  cycles, MatrixTransposeSSE38YA,  TestMatZA 256x256 -LOCAL var+var- ups A
169767  cycles, MatrixTransposeSSE38YC,  TestMatZA 256x256 -LOCAL var+var- ups C
170001  cycles, MatrixTransposeSSE38XA,  TestMatZA 256x256 -push esi+edi- ups A
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

378139  cycles, MatrixTransposeSSE38XB,  TestMatWA 500x500 -push esi+edi- ups B
381761  cycles, MatrixTransposeSSE38YI,  TestMatWA 500x500 -LOCAL var,edx- ups I
384240  cycles, MatrixTransposeSSE38YB,  TestMatWA 500x500 -LOCAL var+var- ups B
389193  cycles, MatrixTransposeSSE38XI,  TestMatWA 500x500 -push esi,edx- ups I
513839  cycles, MatrixTransposeSSE38XA,  TestMatWA 500x500 -push esi+edi- ups A
513988  cycles, MatrixTransposeSSE38YA,  TestMatWA 500x500 -LOCAL var+var- ups A
514119  cycles, MatrixTransposeSSE38YC,  TestMatWA 500x500 -LOCAL var+var- ups C
514560  cycles, MatrixTransposeSSE38XC,  TestMatWA 500x500 -push esi+edi- ups C

386579  cycles, MatrixTransposeSSE38XB,  TestMatWB 504x504 -push esi+edi- ups B
387787  cycles, MatrixTransposeSSE38YB,  TestMatWB 504x504 -LOCAL var+var- ups B
389826  cycles, MatrixTransposeSSE38YI,  TestMatWB 504x504 -LOCAL var,edx- ups I
390552  cycles, MatrixTransposeSSE38XI,  TestMatWB 504x504 -push esi,edx- ups I
480503  cycles, MatrixTransposeSSE38YC,  TestMatWB 504x504 -LOCAL var+var- ups C
480817  cycles, MatrixTransposeSSE38YA,  TestMatWB 504x504 -LOCAL var+var- ups A
480847  cycles, MatrixTransposeSSE38XC,  TestMatWB 504x504 -push esi+edi- ups C
480883  cycles, MatrixTransposeSSE38XA,  TestMatWB 504x504 -push esi+edi- ups A

393505  cycles, MatrixTransposeSSE38YB,  TestMatWC 508x508 -LOCAL var+var- ups B
393692  cycles, MatrixTransposeSSE38XB,  TestMatWC 508x508 -push esi+edi- ups B
397996  cycles, MatrixTransposeSSE38YI,  TestMatWZ 508x508 -LOCAL var,edx- ups I
399456  cycles, MatrixTransposeSSE38XI,  TestMatWC 508x508 -push esi,edx- ups I
608451  cycles, MatrixTransposeSSE38YC,  TestMatWC 508x508 -LOCAL var+var- ups C
609226  cycles, MatrixTransposeSSE38XC,  TestMatWC 508x508 -push esi+edi- ups C
610219  cycles, MatrixTransposeSSE38YA,  TestMatWC 508x508 -LOCAL var+var- ups A
612253  cycles, MatrixTransposeSSE38XA,  TestMatWC 508x508 -push esi+edi- ups A

673824  cycles, MatrixTransposeSSE38XB,  TestMatWW 512x512 -push esi+edi- ups B
677774  cycles, MatrixTransposeSSE38YB,  TestMatWW 512x512 -LOCAL var+var- ups B
686959  cycles, MatrixTransposeSSE38XI,  TestMatWW 512x512 -push esi,edx- ups I
687090  cycles, MatrixTransposeSSE38YI,  TestMatWC 512x512 -LOCAL var,edx- ups I
929165  cycles, MatrixTransposeSSE38YC,  TestMatWW 512x512 -LOCAL var+var- ups C
930455  cycles, MatrixTransposeSSE38XA,  TestMatWW 512x512 -push esi+edi- ups A
931261  cycles, MatrixTransposeSSE38XC,  TestMatWW 512x512 -push esi+edi- ups C
943437  cycles, MatrixTransposeSSE38YA,  TestMatWW 512x512 -LOCAL var+var- ups A

mineiro

Hello sir, I can't get all results because I suppose some wine configuration while under console mode. This is what I get, I changed font size to 8 to get more results. I can't redirect results to a text file because program ask for user input (hit a key). I'm attaching results because I'm receiving a 20000 letters as maximum on this board.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

Siekmanski

***** Time table - LoopCount =10 000 *****

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

18  cycles, MatrixTransposeSSE38XI,  TestMatXA 4x4    -push esi,edx- ups I
19  cycles, MatrixTransposeSSE38YI,  TestMatXA 4x4     -LOCAL var,edx- ups I
20  cycles, MatrixTransposeSSE38YA,  TestMatXA 4x4     -LOCAL var+var- ups A
22  cycles, MatrixTransposeSSE38XC,  TestMatXA 4x4    -push esi+edi- ups C
29  cycles, MatrixTransposeSSE38YC,  TestMatXA 4x4     -LOCAL var+var- ups C
30  cycles, MatrixTransposeSSE38YB,  TestMatXA 4x4     -LOCAL var+var- ups B
54  cycles, MatrixTransposeSSE38XC,  TestMatXB 8x8    -push esi+edi- ups C
56  cycles, MatrixTransposeSSE38XB,  TestMatXA 4x4     -push esi+edi- ups B
62  cycles, MatrixTransposeSSE38YC,  TestMatXB 8x8     -LOCAL var+var- ups C
62  cycles, MatrixTransposeSSE38YB,  TestMatXB 8x8     -LOCAL var+var- ups B
63  cycles, MatrixTransposeSSE38YI,  TestMatXB 8x8     -LOCAL var,edx- ups I
63  cycles, MatrixTransposeSSE38XI,  TestMatXB 8x8    -push esi,edx- ups I
85  cycles, MatrixTransposeSSE38XA,  TestMatXA 4x4     -push esi+edi- ups A
115  cycles, MatrixTransposeSSE38XC,  TestMatXC 12x12  -push esi+edi- ups C
116  cycles, MatrixTransposeSSE38YA,  TestMatXC 12x12   -LOCAL var+var- ups A
117  cycles, MatrixTransposeSSE38YB,  TestMatXC 12x12   -LOCAL var+var- ups B
123  cycles, MatrixTransposeSSE38YC,  TestMatXC 12x12   -LOCAL var+var- ups C
127  cycles, MatrixTransposeSSE38XI,  TestMatXC 12x12  -push esi,edx- ups I
136  cycles, MatrixTransposeSSE38YI,  TestMatXC 12x12   -LOCAL var,edx- ups I
145  cycles, MatrixTransposeSSE38XA,  TestMatXB 8x8     -push esi+edi- ups A
150  cycles, MatrixTransposeSSE38XB,  TestMatXB 8x8     -push esi+edi- ups B
178  cycles, MatrixTransposeSSE38YA,  TestMatXB 8x8     -LOCAL var+var- ups A
260  cycles, MatrixTransposeSSE38YB,  TestMatXD 20x20   -LOCAL var+var- ups B
273  cycles, MatrixTransposeSSE38XC,  TestMatXD 20x20  -push esi+edi- ups C
277  cycles, MatrixTransposeSSE38YA,  TestMatXD 20x20   -LOCAL var+var- ups A
281  cycles, MatrixTransposeSSE38YC,  TestMatXD 20x20   -LOCAL var+var- ups C
304  cycles, MatrixTransposeSSE38YI,  TestMatXD 20x20   -LOCAL var,edx- ups I
308  cycles, MatrixTransposeSSE38XI,  TestMatXD 20x20  -push esi,edx- ups I
308  cycles, MatrixTransposeSSE38XA,  TestMatXC 12x12   -push esi+edi- ups A
313  cycles, MatrixTransposeSSE38XB,  TestMatXC 12x12   -push esi+edi- ups B
616  cycles, MatrixTransposeSSE38XA,  TestMatXD 20x20   -push esi+edi- ups A
720  cycles, MatrixTransposeSSE38XB,  TestMatXD 20x20   -push esi+edi- ups B
10932  cycles, MatrixTransposeSSE38YI,  TestMatYA 100x100 -LOCAL var,edx- ups I
10935  cycles, MatrixTransposeSSE38XI,  TestMatYA 100x100 -push esi,edx- ups I
11083  cycles, MatrixTransposeSSE38YB,  TestMatYA 100x100 -LOCAL var+var- ups B
11809  cycles, MatrixTransposeSSE38YA,  TestMatYA 100x100 -LOCAL var+var- ups A
11810  cycles, MatrixTransposeSSE38YC,  TestMatYA 100x100 -LOCAL var+var- ups C
11884  cycles, MatrixTransposeSSE38XC,  TestMatYA 100x100 -push esi+edi- ups C
17345  cycles, MatrixTransposeSSE38YB,  TestMatYY 120x120 -LOCAL var+var- ups B
17347  cycles, MatrixTransposeSSE38XB,  TestMatYY 120x120 -push esi+edi- ups B
17626  cycles, MatrixTransposeSSE38XA,  TestMatYA 100x100 -push esi+edi- ups A
17904  cycles, MatrixTransposeSSE38YI,  TestMatYY 120x120 -LOCAL var,edx- ups I
18008  cycles, MatrixTransposeSSE38XB,  TestMatYA 100x100 -push esi+edi- ups B
18012  cycles, MatrixTransposeSSE38XI,  TestMatYY 120x120 -push esi,edx- ups I
18024  cycles, MatrixTransposeSSE38YA,  TestMatYY 120x120 -LOCAL var+var- ups A
18097  cycles, MatrixTransposeSSE38XA,  TestMatYY 120x120 -push esi+edi- ups A
18119  cycles, MatrixTransposeSSE38XC,  TestMatYY 120x120 -push esi+edi- ups C
18153  cycles, MatrixTransposeSSE38YC,  TestMatYY 120x120 -LOCAL var+var- ups C
19222  cycles, MatrixTransposeSSE38XI,  TestMatYC 132x132 -push esi,edx- ups I
19262  cycles, MatrixTransposeSSE38YB,  TestMatYC 132x132 -LOCAL var+var- ups B
19291  cycles, MatrixTransposeSSE38YI,  TestMatYC 132x132 -LOCAL var,edx- ups I
19311  cycles, MatrixTransposeSSE38XB,  TestMatYC 132x132 -push esi+edi- ups B
20528  cycles, MatrixTransposeSSE38XC,  TestMatYC 132x132 -push esi+edi- ups C
20538  cycles, MatrixTransposeSSE38YC,  TestMatYC 132x132 -LOCAL var+var- ups C
20586  cycles, MatrixTransposeSSE38YA,  TestMatYC 132x132 -LOCAL var+var- ups A
20604  cycles, MatrixTransposeSSE38XA,  TestMatYC 132x132 -push esi+edi- ups A
28122  cycles, MatrixTransposeSSE38YB,  TestMatYB 128x128 -LOCAL var+var- ups B
28304  cycles, MatrixTransposeSSE38XB,  TestMatYB 128x128 -push esi+edi- ups B
28641  cycles, MatrixTransposeSSE38XI,  TestMatYB 128x128 -push esi,edx- ups I
28750  cycles, MatrixTransposeSSE38YI,  TestMatYB 128x128 -LOCAL var,edx- ups I
31052  cycles, MatrixTransposeSSE38YC,  TestMatYB 128x128 -LOCAL var+var- ups C
31103  cycles, MatrixTransposeSSE38YA,  TestMatYB 128x128 -LOCAL var+var- ups A
31142  cycles, MatrixTransposeSSE38XA,  TestMatYB 128x128 -push esi+edi- ups A
31158  cycles, MatrixTransposeSSE38XC,  TestMatYB 128x128 -push esi+edi- ups C
171916  cycles, MatrixTransposeSSE38XB,  TestMatZA 256x256 -push esi+edi- ups B
172156  cycles, MatrixTransposeSSE38YB,  TestMatZA 256x256 -LOCAL var+var- ups B

173787  cycles, MatrixTransposeSSE38XI,  TestMatZA 256x256 -push esi,edx- ups I
173918  cycles, MatrixTransposeSSE38YI,  TestMatZA 256x256 -LOCAL var,edx- ups I

188429  cycles, MatrixTransposeSSE38YB,  TestMatZB 260x260 -LOCAL var+var- ups B

188530  cycles, MatrixTransposeSSE38XB,  TestMatZB 260x260 -push esi+edi- ups B
188702  cycles, MatrixTransposeSSE38XI,  TestMatZB 260x260 -push esi,edx- ups I
188740  cycles, MatrixTransposeSSE38YI,  TestMatZB 260x260 -LOCAL var,edx- ups I

193331  cycles, MatrixTransposeSSE38XC,  TestMatZB 260x260 -push esi+edi- ups C
193569  cycles, MatrixTransposeSSE38YC,  TestMatZB 260x260 -LOCAL var+var- ups C

193777  cycles, MatrixTransposeSSE38XA,  TestMatZB 260x260 -push esi+edi- ups A
194257  cycles, MatrixTransposeSSE38YA,  TestMatZB 260x260 -LOCAL var+var- ups A

196732  cycles, MatrixTransposeSSE38XB,  TestMatZZ 264x264 -push esi+edi- ups B
196883  cycles, MatrixTransposeSSE38YB,  TestMatZZ 264x264 -LOCAL var+var- ups B

200864  cycles, MatrixTransposeSSE38YI,  TestMatZZ 264x264 -LOCAL var,edx- ups I

200891  cycles, MatrixTransposeSSE38XI,  TestMatZZ 264x264 -push esi,edx- ups I
201147  cycles, MatrixTransposeSSE38XB,  TestMatZC 268x268 -push esi+edi- ups B
201184  cycles, MatrixTransposeSSE38YB,  TestMatZC 268x268 -LOCAL var+var- ups B

201439  cycles, MatrixTransposeSSE38XI,  TestMatZC 268x268 -push esi,edx- ups I
201640  cycles, MatrixTransposeSSE38YI,  TestMatZC 268x268 -LOCAL var,edx- ups I

203106  cycles, MatrixTransposeSSE38YA,  TestMatZZ 264x264 -LOCAL var+var- ups A

203159  cycles, MatrixTransposeSSE38XA,  TestMatZZ 264x264 -push esi+edi- ups A
203764  cycles, MatrixTransposeSSE38YC,  TestMatZZ 264x264 -LOCAL var+var- ups C

203870  cycles, MatrixTransposeSSE38XC,  TestMatZZ 264x264 -push esi+edi- ups C
208185  cycles, MatrixTransposeSSE38YC,  TestMatZC 268x268 -LOCAL var+var- ups C

208212  cycles, MatrixTransposeSSE38XC,  TestMatZC 268x268 -push esi+edi- ups C
208660  cycles, MatrixTransposeSSE38XA,  TestMatZC 268x268 -push esi+edi- ups A
209059  cycles, MatrixTransposeSSE38YA,  TestMatZC 268x268 -LOCAL var+var- ups A

236710  cycles, MatrixTransposeSSE38XA,  TestMatZA 256x256 -push esi+edi- ups A
236904  cycles, MatrixTransposeSSE38YC,  TestMatZA 256x256 -LOCAL var+var- ups C

236960  cycles, MatrixTransposeSSE38YA,  TestMatZA 256x256 -LOCAL var+var- ups A

236968  cycles, MatrixTransposeSSE38XC,  TestMatZA 256x256 -push esi+edi- ups C
471095  cycles, MatrixTransposeSSE38YB,  TestMatWB 504x504 -LOCAL var+var- ups B

471488  cycles, MatrixTransposeSSE38XB,  TestMatWB 504x504 -push esi+edi- ups B
478281  cycles, MatrixTransposeSSE38YI,  TestMatWB 504x504 -LOCAL var,edx- ups I

478743  cycles, MatrixTransposeSSE38XI,  TestMatWB 504x504 -push esi,edx- ups I
504164  cycles, MatrixTransposeSSE38YB,  TestMatWC 508x508 -LOCAL var+var- ups B

504648  cycles, MatrixTransposeSSE38XB,  TestMatWA 500x500 -push esi+edi- ups B
505222  cycles, MatrixTransposeSSE38XB,  TestMatWC 508x508 -push esi+edi- ups B
505627  cycles, MatrixTransposeSSE38YB,  TestMatWA 500x500 -LOCAL var+var- ups B

509811  cycles, MatrixTransposeSSE38XI,  TestMatWA 500x500 -push esi,edx- ups I
510504  cycles, MatrixTransposeSSE38YI,  TestMatWA 500x500 -LOCAL var,edx- ups I

512446  cycles, MatrixTransposeSSE38XI,  TestMatWC 508x508 -push esi,edx- ups I
512484  cycles, MatrixTransposeSSE38YI,  TestMatWZ 508x508 -LOCAL var,edx- ups I

681134  cycles, MatrixTransposeSSE38XA,  TestMatWB 504x504 -push esi+edi- ups A
681323  cycles, MatrixTransposeSSE38XC,  TestMatWB 504x504 -push esi+edi- ups C
681479  cycles, MatrixTransposeSSE38YC,  TestMatWB 504x504 -LOCAL var+var- ups C

682633  cycles, MatrixTransposeSSE38YA,  TestMatWB 504x504 -LOCAL var+var- ups A

792368  cycles, MatrixTransposeSSE38YC,  TestMatWA 500x500 -LOCAL var+var- ups C

792495  cycles, MatrixTransposeSSE38XA,  TestMatWA 500x500 -push esi+edi- ups A
793080  cycles, MatrixTransposeSSE38XC,  TestMatWA 500x500 -push esi+edi- ups C
793390  cycles, MatrixTransposeSSE38YA,  TestMatWA 500x500 -LOCAL var+var- ups A

829152  cycles, MatrixTransposeSSE38XB,  TestMatWW 512x512 -push esi+edi- ups B
829328  cycles, MatrixTransposeSSE38YB,  TestMatWW 512x512 -LOCAL var+var- ups B

838242  cycles, MatrixTransposeSSE38XI,  TestMatWW 512x512 -push esi,edx- ups I
838321  cycles, MatrixTransposeSSE38YI,  TestMatWC 512x512 -LOCAL var,edx- ups I

979318  cycles, MatrixTransposeSSE38XA,  TestMatWC 508x508 -push esi+edi- ups A
980942  cycles, MatrixTransposeSSE38XC,  TestMatWC 508x508 -push esi+edi- ups C
981519  cycles, MatrixTransposeSSE38YC,  TestMatWC 508x508 -LOCAL var+var- ups C

981731  cycles, MatrixTransposeSSE38YA,  TestMatWC 508x508 -LOCAL var+var- ups A

1318720  cycles, MatrixTransposeSSE38XC,  TestMatWW 512x512 -push esi+edi- ups C

1319166  cycles, MatrixTransposeSSE38YC,  TestMatWW 512x512 -LOCAL var+var- ups C
1321178  cycles, MatrixTransposeSSE38XA,  TestMatWW 512x512 -push esi+edi- ups A

1321444  cycles, MatrixTransposeSSE38YA,  TestMatWW 512x512 -LOCAL var+var- ups A
********** END **********
Creative coders use backward thinking techniques as a strategy.

hutch--

Rui,

Could you spare us these massive blocks of tables, put them in a ZIP file.

RuiLoureiro

#7
Quote from: hutch-- on June 04, 2018, 04:46:16 AM
Rui,

Could you spare us these massive blocks of tables, put them in a ZIP file.
Hutch,
          It is here, sorted by matrix type (4x4 little time ... 512x512 consume much more time).
          Have you any comment about it? What's wrong ?
          From one basic procedure i wrote 8 different cases. As i have 4 procs, we have 32 different procedures to do the task. But until now i wrote many other different cases for the same thing.

hutch--

I had a look at the zip file and the results look fine and it save dumping large amounts of data like that directly into the forum. The problem with dumping large tables into the forum is it makes the topics unreadable. Also note that the Campus is for people who are learning assembler, it is not the place for advanced mathematics which should be posted in a more specialised forum.

RuiLoureiro

#9
Quote from: hutch-- on June 04, 2018, 11:08:12 AM
I had a look at the zip file and the results look fine and it save dumping large amounts of data like that directly into the forum. The problem with dumping large tables into the forum is it makes the topics unreadable. Also note that the Campus is for people who are learning assembler, it is not the place for advanced mathematics which should be posted in a more specialised forum.

The main problem i posted here has nothing to do with advanced mathematics as you "are seeing"
simply because the main problem here is not to transpose a matrix. It was used as time-consuming code only.

I try to explain with more details but i remember that the title is: "
solutions to save 1/2 CPU registers inside a procedure" (we may define local variables and we may use it).
I have a set of code inside a procedure where i defined some set of LOCAL variables, the same in all.
The code where i want to test the solutions to save/restore registers may be well described like this:
Quote
    ; esi and edi are pointers already defined
    ; start here and do someting - 5 instructions.
   
loopA:
    ; here we need to save esi and edi pointers           <<<<<---- this is the first problem HERE

        loopB:                                     ; <<< macro starts here
             ; here we need to save esi                       <<<<<---- this is the second problem HERE
   
                    ; here i put a time-consuming code         <<<<--- this is not the problem

             ; here we need to restore esi
             
             ; do something               
             ; while ecx is not the end go to loopB ; <<< macro ends here

     ; here we need to restore edi and esi
     ; do something
     ; while ebx is not the end goto loopA

From this, i wrote 4 cases XA, XB, XC, XI where the macros are A,B,C,I
        and another 4 cases YA, YB, YC, YI where the macros are A,B,C,I also.
     
Cases X: the solution is push esi+push edi                  but case I is push esi     + mov edx, edi

Cases Y: the solutin is mov LOCAL, esi+mov LOCAL,edi but case I is mov LOCAL,edi+ mov edx, edi

It is much more a problem of assembly than math itself because we may replace the time consuming code  for something else and we have the same main problem. Because it is needed to consume little time up to as much time as possible to test the main problem, using operations with matrices is a good idea or it seems to me good to test what i want to test (IMO).

      Now, you realize what the problem is ? Is it what you call an advanced mathematics problem ?
      If it is, where Hutch ?
Regards :biggrin:

RuiLoureiro

Hi all  :biggrin:
            Thanks all for your work in posting your results (Jochen, zeed151,Siekmanski,).
You have all results sorted by matrix type so you may see what is the better solution.
In many cases it seems to be push esi+push edi. For me it was a surprise or some surprise.
But it is what i do usually.
:t
mineiro,
            your results are not good.
            Thanks also

hutch--

Rui,
Quote
      Now, you realize what the problem is ? Is it what you call an advanced mathematics problem ?
      If it is, where Hutch ?
Be warned that if you start giving me lip you will get R_SOULED out the door faster than Halley's comet. I am not a free kick for anyone who thinks they can make a pest of themselves. I have asked you to stop dumping mountains of data in the forum, especially the Campus as it make the forum unreadable for other members.

I have had to in the past move much of what you have posted due to the mess and I have even deleted some of it as you refuse to co-operate when asked. No matter what I will solve any problem you may initiate so do us all a favour and treat other people as you would want yourself to be treated.

jj2007

Hi Rui,

Your setup is really a bit "bloated" - can't you organise the results in a way that it produces only a few averages? What I usually do to get reliable timings is cut off the worst 10% (slow due to interrupts...), and take the average of the 90%.

RuiLoureiro

#13
Hi Jochen,
              I will try to think about it in the next problem.
This problem seems to be done.
Thanks  :t

EDIT: Jochen, I am using a little prog ShowCpu in the file Timing.inc to identify the CPU
         but zedd151 say that it doesnt identify the frequency of his CPU (AMD ?). If i am right
         that prog was written by you. Have you another recent version for these new cases ?
         Timing.inc is elsewhere here i think in the folder Converter8 at least.

RuiLoureiro

Hutch,

I started this topic writing "... a little procedure to transpose some matrix NxN..."
which is not the main problem here. So you have reason about what you said.
My apologise.

Regards