News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Testing Transpose of a Matrix

Started by RuiLoureiro, May 07, 2018, 02:13:55 AM

Previous topic - Next topic

RuiLoureiro

Hi all,
        Could you run and show here your results of TestTranspose_cycles1 ?
        (files: all .asm included and Timing.inc)
       
        Thanks  :t       
Note: The matrices used here are matrices where each element  is a REAL4 number.

EDITSee all results in my reply #7 below
EDIT:  tests using SSE

HERE are my results
Quote
434 cycles, MatrixTransposeZ, transposeMatX

399 cycles, MatrixTransposeDD, transposeMatZ

411 cycles, MatrixTransposeDF, transposeMatZ

417 cycles, MatrixTransposeX, transposeMatX

476 cycles, MatrixTransposeZZ, transposeMatX

470 cycles, MatrixTransposeDDD, transposeMatZ

540 cycles, MatrixTransposeDFF, transposeMatZ

414 cycles, MatrixTransposeXX, transposeMatX

1311 cycles, MatrixTransposeZ, transposeMatXX

1170 cycles, MatrixTransposeDD, transposeMatZZ

1156 cycles, MatrixTransposeDF, transposeMatZZ

1326 cycles, MatrixTransposeX, transposeMatXX

1620 cycles, MatrixTransposeZZ, transposeMatXX

1548 cycles, MatrixTransposeDDD, transposeMatZZ

1610 cycles, MatrixTransposeDFF, transposeMatZZ

1322 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

399  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
411  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
414  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
417  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
434  cycles, MatrixTransposeZ,   testMatX 4x4- last to first
470  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
476  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
540  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last

1156  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
1170  cycles, MatrixTransposeDDtestMatZZ 7x8- last to first
1311  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
1322  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
1326  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
1548  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
1610  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
1620  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********

Jochen  :t
For now (2 samples:P4,i5) MatrixTransposeDD seems to be the best

jj2007

Intel Core i5:145 cycles, MatrixTransposeZ, transposeMatX

109 cycles, MatrixTransposeDD, transposeMatZ

115 cycles, MatrixTransposeDF, transposeMatZ

192 cycles, MatrixTransposeX, transposeMatX

264 cycles, MatrixTransposeZZ, transposeMatX

140 cycles, MatrixTransposeDDD, transposeMatZ

118 cycles, MatrixTransposeDFF, transposeMatZ

200 cycles, MatrixTransposeXX, transposeMatX

332 cycles, MatrixTransposeZ, transposeMatXX

316 cycles, MatrixTransposeDD, transposeMatZZ

329 cycles, MatrixTransposeDF, transposeMatZZ

300 cycles, MatrixTransposeX, transposeMatXX

378 cycles, MatrixTransposeZZ, transposeMatXX

402 cycles, MatrixTransposeDDD, transposeMatZZ

387 cycles, MatrixTransposeDFF, transposeMatZZ

310 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

109  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
115  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
118  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
140  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
145  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
192  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
200  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
264  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
300  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
310  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
316  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
329  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
332  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
378  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
387  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
402  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
********** END **********

zedd151


86 cycles, MatrixTransposeZ, transposeMatX

87 cycles, MatrixTransposeDD, transposeMatZ

91 cycles, MatrixTransposeDF, transposeMatZ

88 cycles, MatrixTransposeX, transposeMatX

92 cycles, MatrixTransposeZZ, transposeMatX

90 cycles, MatrixTransposeDDD, transposeMatZ

99 cycles, MatrixTransposeDFF, transposeMatZ

97 cycles, MatrixTransposeXX, transposeMatX

402 cycles, MatrixTransposeZ, transposeMatXX

315 cycles, MatrixTransposeDD, transposeMatZZ

318 cycles, MatrixTransposeDF, transposeMatZZ

324 cycles, MatrixTransposeX, transposeMatXX

664 cycles, MatrixTransposeZZ, transposeMatXX

272 cycles, MatrixTransposeDDD, transposeMatZZ

323 cycles, MatrixTransposeDFF, transposeMatZZ

369 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4) 1.6 Ghz

86  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
87  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
88  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
90  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
91  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
92  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
97  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
99  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
272  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
315  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
318  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
323  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
324  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
369  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
402  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
664  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********


:biggrin:

felipe


223 cycles, MatrixTransposeZ, transposeMatX

217 cycles, MatrixTransposeDD, transposeMatZ

219 cycles, MatrixTransposeDF, transposeMatZ

220 cycles, MatrixTransposeX, transposeMatX

290 cycles, MatrixTransposeZZ, transposeMatX

289 cycles, MatrixTransposeDDD, transposeMatZ

304 cycles, MatrixTransposeDFF, transposeMatZ

230 cycles, MatrixTransposeXX, transposeMatX

743 cycles, MatrixTransposeZ, transposeMatXX

746 cycles, MatrixTransposeDD, transposeMatZZ

743 cycles, MatrixTransposeDF, transposeMatZZ

780 cycles, MatrixTransposeX, transposeMatXX

1021 cycles, MatrixTransposeZZ, transposeMatXX

1016 cycles, MatrixTransposeDDD, transposeMatZZ

1055 cycles, MatrixTransposeDFF, transposeMatZZ

826 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz (SSE4)

217  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
219  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
220  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
223  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
230  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
289  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
290  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
304  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
743  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
743  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
746  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
780  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
826  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
1016  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
1021  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
1055  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********


i5 here too. Windows 8.1.  :icon14:

LiaoMi

100 cycles, MatrixTransposeZ, transposeMatX

101 cycles, MatrixTransposeDD, transposeMatZ

101 cycles, MatrixTransposeDF, transposeMatZ

79 cycles, MatrixTransposeX, transposeMatX

117 cycles, MatrixTransposeZZ, transposeMatX

98 cycles, MatrixTransposeDDD, transposeMatZ

123 cycles, MatrixTransposeDFF, transposeMatZ

102 cycles, MatrixTransposeXX, transposeMatX

368 cycles, MatrixTransposeZ, transposeMatXX

337 cycles, MatrixTransposeDD, transposeMatZZ

333 cycles, MatrixTransposeDF, transposeMatZZ

258 cycles, MatrixTransposeX, transposeMatXX

413 cycles, MatrixTransposeZZ, transposeMatXX

317 cycles, MatrixTransposeDDD, transposeMatZZ

375 cycles, MatrixTransposeDFF, transposeMatZZ

266 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

79  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
98  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
100  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
101  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
101  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
102  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
117  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
123  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
258  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
266  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
317  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
333  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
337  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
368  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
375  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
413  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********

Siekmanski

intel i7-4930K Win 8.1

203 cycles, MatrixTransposeZ, transposeMatX

125 cycles, MatrixTransposeDD, transposeMatZ

123 cycles, MatrixTransposeDF, transposeMatZ

101 cycles, MatrixTransposeX, transposeMatX

133 cycles, MatrixTransposeZZ, transposeMatX

117 cycles, MatrixTransposeDDD, transposeMatZ

139 cycles, MatrixTransposeDFF, transposeMatZ

103 cycles, MatrixTransposeXX, transposeMatX

384 cycles, MatrixTransposeZ, transposeMatXX

398 cycles, MatrixTransposeDD, transposeMatZZ

380 cycles, MatrixTransposeDF, transposeMatZZ

318 cycles, MatrixTransposeX, transposeMatXX

430 cycles, MatrixTransposeZZ, transposeMatXX

380 cycles, MatrixTransposeDDD, transposeMatZZ

465 cycles, MatrixTransposeDFF, transposeMatZZ

330 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

101  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
103  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
117  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
123  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
125  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
133  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
139  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
203  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
318  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
330  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
380  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
380  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
384  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
398  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
430  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
465  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********
Creative coders use backward thinking techniques as a strategy.

HSE

248 cycles, MatrixTransposeZ, transposeMatX

203 cycles, MatrixTransposeDD, transposeMatZ

225 cycles, MatrixTransposeDF, transposeMatZ

192 cycles, MatrixTransposeX, transposeMatX

255 cycles, MatrixTransposeZZ, transposeMatX

226 cycles, MatrixTransposeDDD, transposeMatZ

256 cycles, MatrixTransposeDFF, transposeMatZ

190 cycles, MatrixTransposeXX, transposeMatX

752 cycles, MatrixTransposeZ, transposeMatXX

668 cycles, MatrixTransposeDD, transposeMatZZ

727 cycles, MatrixTransposeDF, transposeMatZZ

619 cycles, MatrixTransposeX, transposeMatXX

830 cycles, MatrixTransposeZZ, transposeMatXX

718 cycles, MatrixTransposeDDD, transposeMatZZ

831 cycles, MatrixTransposeDFF, transposeMatZZ

621 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

190  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
192  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
203  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
225  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
226  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
248  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
255  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
256  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
619  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
621  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
668  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
718  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
727  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
752  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
830  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
831  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********
Equations in Assembly: SmplMath

RuiLoureiro

#7
Hi all,
        Thank you for showing your results (they are below). :t
       
        For now, i want to point out the definition of the matrices used
        in this test. Note that behind each name we have a BYTE (if X) or DWORD (if Z)
        for the number of LINES and behind, for the number of columns.

        The procedures ...X , ...Z , ...DD , ...DF       use EDI - push edi/pop edi
        The procedures ...XX, ...ZZ, ...DDD, ...DFF dont use EDI - but they are the same as above

        In the procedures ...X and ...XX we need to pass an address of a sequence of DWORDS -
        treated as real4 values - and the number of Lines and the number of Columns... I am
        using it here only for test purposes.
        :icon14:   
Quote
                    DEFINITIONS OF MATRICES
                     --------------------------------
MAXLINES_X      equ 4                 ; number of lines
MAXCOLUMNS_X    equ 4                 ; number of columns
MAXDWORDS_X     equ MAXLINES_X * MAXCOLUMNS_X
;-------------------------------------------------------
;                       testMatX
;=======================================================
                  db MAXCOLUMNS_X           ;           <<< is BYTE
                  db MAXLINES_X             ;           <<< is BYTE
testMatX      dd 11.0,12.0,13.0,14.0    ;     line 1
                  dd 21.0,22.0,23.0,24.0    ; +16 line 2
                  dd 31.0,32.0,33.0,34.0    ; +32 line 3
                  dd 41.0,42.0,43.0,44.0    ; +48 line 4
;-------------------------------------------------------
;                       testMatZ
;=======================================================
ALIGN 16
                  dd ?
                  dd ?
                  dd MAXCOLUMNS_X               ;           <<< is DWORD
                  dd MAXLINES_X                 ;           <<< is DWORD
testMatZ      dd 11.1, 12.2, 13.3, 14.4     ;   line 1
                  dd 21.1, 22.2, 23.3, 24.4     ;   line 2
                  dd 31.1, 32.2, 33.3, 34.4     ;   line 3
                  dd 41.1, 42.2, 43.3, 44.4     ;   line 4
;=======================================================
MAXLINES_Z      equ 7                 ; number of lines
MAXCOLUMNS_Z    equ 8                 ; number of columns
MAXDWORDS_Z     equ MAXLINES_Z * MAXCOLUMNS_Z
;-------------------------------------------------------
;                       testMatXX
;=======================================================
                  db MAXCOLUMNS_Z               ;          <<< is BYTE
                  db MAXLINES_Z                 ;          <<< is BYTE
testMatXX      dd 11.1, 12.2, 13.3, 14.4, 15.5, 16.6, 17.7, 18.8    ;   line 1
                  dd 21.1, 22.2, 23.3, 24.4, 25.5, 26.6, 27.7, 28.8    ;   line 2
                  dd 31.1, 32.2, 33.3, 34.4, 35.5, 36.6, 37.7, 38.8    ;   line 3
                  dd 41.1, 42.2, 43.3, 44.4, 45.5, 46.6, 47.7, 48.8    ;   line 4
                  dd 51.1, 52.2, 53.3, 54.4, 55.5, 56.6, 57.7, 58.8    ;   line 5
                  dd 61.1, 62.2, 63.3, 64.4, 65.5, 66.6, 67.7, 68.8    ;   line 6
                  dd 71.1, 72.2, 73.3, 74.4, 75.5, 76.6, 77.7, 78.8    ;   line 7
;-------------------------------------------------------
;                       testMatZZ
;=======================================================
ALIGN 16
                  dd ?
                  dd ?
                  dd MAXCOLUMNS_Z               ;           <<< is DWORD
                  dd MAXLINES_Z                 ;           <<< is DWORD
testMatZZ      dd 11.1, 12.2, 13.3, 14.4, 15.5, 16.6, 17.7, 18.8    ;   line 1
                  dd 21.1, 22.2, 23.3, 24.4, 25.5, 26.6, 27.7, 28.8    ;   line 2
                  dd 31.1, 32.2, 33.3, 34.4, 35.5, 36.6, 37.7, 38.8    ;   line 3
                  dd 41.1, 42.2, 43.3, 44.4, 45.5, 46.6, 47.7, 48.8    ;   line 4
                  dd 51.1, 52.2, 53.3, 54.4, 55.5, 56.6, 57.7, 58.8    ;   line 5
                  dd 61.1, 62.2, 63.3, 64.4, 65.5, 66.6, 67.7, 68.8    ;   line 6
                  dd 71.1, 72.2, 73.3, 74.4, 75.5, 76.6, 77.7, 78.8    ;   line 7
All results
Quote
RuiLoureiro:
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

399  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
411  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
414  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
417  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
434  cycles, MatrixTransposeZ,   testMatX 4x4- last to first
470  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
476  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
540  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last

1156  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
1170  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
1311  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
1322  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
1326  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
1548  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
1610  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
1620  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********

FORTRANS:
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

221  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
222  cycles, MatrixTransposeZ,   testMatX 4x4- last to first
249  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
249  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
251  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
282  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
286  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
337  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last

722  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
725  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
828  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
868  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
872  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
944  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
952  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
1050  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********
Jochen:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

109  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
115  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
118  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
140  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
145  cycles, MatrixTransposeZ,   testMatX 4x4- last to first
192  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
200  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
264  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first

300  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
310  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
316  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
329  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
332  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
378  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
387  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
402  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
********** END **********
Felipe:
Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz (SSE4)

217  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
219  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
220  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
223  cycles, MatrixTransposeZ,   testMatX 4x4- last to first
230  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
289  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
290  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
304  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last

743  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
743  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
746  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
780  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
826  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
1016  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
1021  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
1055  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********
aw27:
Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz (SSE4)

68  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
73  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
84  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
87  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
87  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
94  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
97  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
103  cycles, MatrixTransposeDD, testMatZ 4x4- last to first

225  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
231  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
243  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
249  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
255  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
284  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
286  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
289  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********
LiaoMi:

Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz (SSE4)

79  cycles, MatrixTransposeX,     testMatX 4x4, Lin, Col - last to first
98  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
100  cycles, MatrixTransposeZ,    testMatX 4x4- last to first
101  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
101  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
102  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
117  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
123  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last

258  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
266  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
317  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
333  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
337  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
368  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
375  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
413  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********
Siekmanski:
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

101  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
103  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
117  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
123  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
125  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
133  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
139  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
203  cycles, MatrixTransposeZ,   testMatX 4x4- last to first

318  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
330  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
380  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
380  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
384  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
398  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
430  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
465  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********
; ««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
zedd151:
AMD A6-9220e RADEON R4, 5 COMPUTE CORES 2C+3G   (SSE4) 1.6 Ghz

86  cycles, MatrixTransposeZ,    testMatX 4x4- last to first
87  cycles, MatrixTransposeDD,   testMatZ 4x4- last to first
88  cycles, MatrixTransposeX,    testMatX 4x4, Lin, Col - last to first
90  cycles, MatrixTransposeDDD,  testMatZ 4x4- last to first
91  cycles, MatrixTransposeDF,   testMatZ 4x4- first to last
92  cycles, MatrixTransposeZZ,   testMatX 4x4- last to first
97  cycles, MatrixTransposeXX,   testMatX 4x4, Lin, Col - last to first
99  cycles, MatrixTransposeDFF,  testMatZ 4x4- first to last

272  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
315  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
318  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
323  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
324  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
369  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
402  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
664  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********
HSE:
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

190  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
192  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
203  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
225  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
226  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
248  cycles, MatrixTransposeZ,   testMatX 4x4- last to first
255  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
256  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last

619  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
621  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
668  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
718  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
727  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
752  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
830  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
831  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********

aw27

There was some work done on matrix transposing using SSE instructions http://masm32.com/board/index.php?topic=6140.0.
A quick test shows that it is a few times faster and the bigger the matrix the faster it is. No wonders, of course.

FORTRANS

Hi,

   Windows 2000; "Is not a valid Win32 application."  In the past,
some of these would work after reassembling.

   Windows XP:
222 cycles, MatrixTransposeZ, transposeMatX

221 cycles, MatrixTransposeDD, transposeMatZ

249 cycles, MatrixTransposeDF, transposeMatZ

251 cycles, MatrixTransposeX, transposeMatX

282 cycles, MatrixTransposeZZ, transposeMatX

286 cycles, MatrixTransposeDDD, transposeMatZ

337 cycles, MatrixTransposeDFF, transposeMatZ

249 cycles, MatrixTransposeXX, transposeMatX

725 cycles, MatrixTransposeZ, transposeMatXX

722 cycles, MatrixTransposeDD, transposeMatZZ

828 cycles, MatrixTransposeDF, transposeMatZZ

872 cycles, MatrixTransposeX, transposeMatXX

952 cycles, MatrixTransposeZZ, transposeMatXX

944 cycles, MatrixTransposeDDD, transposeMatZZ

1050 cycles, MatrixTransposeDFF, transposeMatZZ

868 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)

221  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
222  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
249  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
249  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
251  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
282  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
286  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
337  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
722  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
725  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
828  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
868  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
872  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
944  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
952  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
1050  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********


Cheers,

Steve N.

FORTRANS

Hi,

Quote from: FORTRANS on May 07, 2018, 11:09:05 PM
   Windows 2000; "Is not a valid Win32 application."  In the past,
some of these would work after reassembling.

   Well, tried that and it worked...

Win2k:

274 cycles, MatrixTransposeZ, transposeMatX

252 cycles, MatrixTransposeDD, transposeMatZ

258 cycles, MatrixTransposeDF, transposeMatZ

273 cycles, MatrixTransposeX, transposeMatX

288 cycles, MatrixTransposeZZ, transposeMatX

280 cycles, MatrixTransposeDDD, transposeMatZ

301 cycles, MatrixTransposeDFF, transposeMatZ

272 cycles, MatrixTransposeXX, transposeMatX

986 cycles, MatrixTransposeZ, transposeMatXX

963 cycles, MatrixTransposeDD, transposeMatZZ

943 cycles, MatrixTransposeDF, transposeMatZZ

1005 cycles, MatrixTransposeX, transposeMatXX

1048 cycles, MatrixTransposeZZ, transposeMatXX

1032 cycles, MatrixTransposeDDD, transposeMatZZ

1165 cycles, MatrixTransposeDFF, transposeMatZZ

999 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

 (SSE1)

252  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
258  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
272  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
273  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
274  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
280  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
288  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
301  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
943  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
963  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
986  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
999  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
1005  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
1032  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
1048  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
1165  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********


Win98:


904 cycles, MatrixTransposeZ, transposeMatX

640 cycles, MatrixTransposeDD, transposeMatZ

639 cycles, MatrixTransposeDF, transposeMatZ

914 cycles, MatrixTransposeX, transposeMatX

1014 cycles, MatrixTransposeZZ, transposeMatX

687 cycles, MatrixTransposeDDD, transposeMatZ

692 cycles, MatrixTransposeDFF, transposeMatZ

869 cycles, MatrixTransposeXX, transposeMatX

3474 cycles, MatrixTransposeZ, transposeMatXX

2164 cycles, MatrixTransposeDD, transposeMatZZ

2181 cycles, MatrixTransposeDF, transposeMatZZ

3278 cycles, MatrixTransposeX, transposeMatXX

3595 cycles, MatrixTransposeZZ, transposeMatXX

2322 cycles, MatrixTransposeDDD, transposeMatZZ

2361 cycles, MatrixTransposeDFF, transposeMatZZ

3137 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****


639  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
640  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
687  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
692  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
869  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
904  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
914  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
1014  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
2164  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
2181  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
2322  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
2361  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
3137  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
3278  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
3474  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
3595  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********


Regards,

Steve N.

RuiLoureiro

#11
Quote from: aw27 on May 07, 2018, 08:03:04 PM
There was some work done on matrix transposing using SSE instructions http://masm32.com/board/index.php?topic=6140.0.
A quick test shows that it is a few times faster and the bigger the matrix the faster it is. No wonders, of course.
Hi
   Yes, in general, using SSE instructions and aligned data the code is faster.
   This is the case.
:icon14:
EDIT:
These are my results:
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

180  cycles, MatrixTransposeMO,  testMatZ 4x4
182  cycles, MatrixTransposeAW,  testMatZ 4x4, Lin, Col
398  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
404  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
405  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
413  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last
426  cycles, MatrixTransposeZ,   testMatX 4x4- last to first

458  cycles, MatrixTransposeMO,  testMatZZ 7x8
462  cycles, MatrixTransposeZZ,  testMatX  4x4- last to first
468  cycles, MatrixTransposeAW,  testMatZZ 7x8, Lin, Col
487  cycles, MatrixTransposeDDD, testMatZ  4x4- last to first
528  cycles, MatrixTransposeDFF, testMatZ  4x4- first to last

1153  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
1171  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
1280  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
1294  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first
1408  cycles, MatrixTransposeZ,   testMatXX 7x8- last to first
1531  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
1563  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
1620  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
********** END **********

Note: MatrixTransposeAW is your version
         MatrixTransposeMO is your modified version that i did where line and column
         is behind the address in both matrices

aw27

Hello Rui,

There are significant differences between your code results and my code results:
These are your code results:
97 cycles, MatrixTransposeZ, transposeMatX

103 cycles, MatrixTransposeDD, transposeMatZ

87 cycles, MatrixTransposeDF, transposeMatZ

73 cycles, MatrixTransposeX, transposeMatX

87 cycles, MatrixTransposeZZ, transposeMatX

94 cycles, MatrixTransposeDDD, transposeMatZ

84 cycles, MatrixTransposeDFF, transposeMatZ

68 cycles, MatrixTransposeXX, transposeMatX

249 cycles, MatrixTransposeZ, transposeMatXX

255 cycles, MatrixTransposeDD, transposeMatZZ

243 cycles, MatrixTransposeDF, transposeMatZZ

225 cycles, MatrixTransposeX, transposeMatXX

289 cycles, MatrixTransposeZZ, transposeMatXX

284 cycles, MatrixTransposeDDD, transposeMatZZ

286 cycles, MatrixTransposeDFF, transposeMatZZ

231 cycles, MatrixTransposeXX, transposeMatXX

*** STOP. Press any key to show the Time Table ***

***** Time table *****

Intel(R) Core(TM) i5-7300HQ CPU @ 2.50GHz (SSE4)

68  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
73  cycles, MatrixTransposeX,  testMatX 4x4, Lin, Col - last to first
84  cycles, MatrixTransposeDFF, testMatZ 4x4- first to last
87  cycles, MatrixTransposeZZ,  testMatX 4x4- last to first
87  cycles, MatrixTransposeDF, testMatZ 4x4- first to last
94  cycles, MatrixTransposeDDD, testMatZ 4x4- last to first
97  cycles, MatrixTransposeZ,  testMatX 4x4- last to first
103  cycles, MatrixTransposeDD, testMatZ 4x4- last to first
225  cycles, MatrixTransposeX,  testMatXX 7x8, Lin, Col - last to first
231  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
243  cycles, MatrixTransposeDF, testMatZZ 7x8- first to last
249  cycles, MatrixTransposeZ,  testMatXX 7x8- last to first
255  cycles, MatrixTransposeDD, testMatZZ 7x8- last to first
284  cycles, MatrixTransposeDDD, testMatZZ 7x8- last to first
286  cycles, MatrixTransposeDFF, testMatZZ 7x8- first to last
289  cycles, MatrixTransposeZZ,  testMatXX 7x8- last to first
********** END **********

These are my code results (I reused your test setup):

32 cycles, MatrixTransposeAW, transposeMatX
81 cycles, MatrixTransposeAW, transposeMatXX

It takes 1/3 the number of cycles.

RuiLoureiro

#13
Hello aw27,
           I think that your results are reasonable, particularly in your i5 (see all reply #7).
           Using SSE instructions and aligned data, the code is faster in general.
           If you cannot use SSE instructions ...
           
But using your trans32.asm in transposeTest folder
i got this in my Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3):
Quote
245 cycles, MatrixTransposeAW, transposeMatX         ; 3* = 735

661 cycles, MatrixTransposeAW, transposeMatXX       ; 3* =1983
Some previous results:
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

398  cycles, MatrixTransposeXX,  testMatX 4x4, Lin, Col - last to first
404  cycles, MatrixTransposeDD,  testMatZ 4x4- last to first
405  cycles, MatrixTransposeX,   testMatX 4x4, Lin, Col - last to first
413  cycles, MatrixTransposeDF,  testMatZ 4x4- first to last

1153  cycles, MatrixTransposeDF,  testMatZZ 7x8- first to last
1171  cycles, MatrixTransposeDD,  testMatZZ 7x8- last to first
1280  cycles, MatrixTransposeXX,  testMatXX 7x8, Lin, Col - last to first
1294  cycles, MatrixTransposeX,   testMatXX 7x8, Lin, Col - last to first

RuiLoureiro

Hi all,
        I wrote some procedures using SSE instructions (and Siekmanski proc to transpose 4x4 :t ).
       This work is based on simple concepts and math for kids :biggrin: .
        You may test in your CPU. If you want you may show your results.
See you
:icon14:
Some results for SSE12 using AW as a reference:         
Quote
Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)

68  cycles, MatrixTransposeSSE12,  testMatX 4x4
187  cycles, MatrixTransposeAW,     testMatX 4x4, Lin, Col      ; +119 cycles

99  cycles, MatrixTransposeSSE12,  testMatS 4x8
245  cycles, MatrixTransposeAW,     testMatS 4x8, Lin, Col      ; +146 cycles

119  cycles, MatrixTransposeSSE12,  testMatR 8x4
257  cycles, MatrixTransposeAW,     testMatR 8x4, Lin, Col      ; +138 cycles

192  cycles, MatrixTransposeSSE12,  testMatV 4x2
219  cycles, MatrixTransposeAW,     testMatV 4x2, Lin, Col      ; +27 cycles

194  cycles, MatrixTransposeSSE12,  testMatY 2x4
204  cycles, MatrixTransposeAW,     testMatY 2x4, Lin, Col      ; +10 cycles

255  cycles, MatrixTransposeSSE12,  testMatW 8x7
477  cycles, MatrixTransposeAW,     testMatW 8x7, Lin, Col      ; +222 cycles

376  cycles, MatrixTransposeSSE12,  testMatZ 7x8
611  cycles, MatrixTransposeAW,     testMatZ 7x8, Lin, Col      ; +235 cycles

478  cycles, MatrixTransposeSSE12,  testMatT 7x7
552  cycles, MatrixTransposeAW,     testMatT 7x7, Lin, Col      ; +74 cycles

530  cycles, MatrixTransposeSSE12,  testMatQ 12x12
794  cycles, MatrixTransposeAW,     testMatQ 12x12, Lin, Col    ; +264 cycles