News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Transpose a matrix

Started by jj2007, April 04, 2017, 09:43:51 AM

Previous topic - Next topic

jj2007

This is a spinoff from Guga's RosAsm thread. Can I have some timings please, also from AMD machines? Thanks.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1524    cycles for 100 * TransposeA
467     cycles for 100 * TransposeB (fast+small)
469     cycles for 100 * TransposeD (fast)

1475    cycles for 100 * TransposeA
465     cycles for 100 * TransposeB (fast+small)
468     cycles for 100 * TransposeD (fast)

1474    cycles for 100 * TransposeA
465     cycles for 100 * TransposeB (fast+small)
468     cycles for 100 * TransposeD (fast)

1605    cycles for 100 * TransposeA
466     cycles for 100 * TransposeB (fast+small)
466     cycles for 100 * TransposeD (fast)

1474    cycles for 100 * TransposeA
455     cycles for 100 * TransposeB (fast+small)
466     cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

2275    cycles for 100 * TransposeA
538     cycles for 100 * TransposeB (fast+small)
561     cycles for 100 * TransposeD (fast)

2158    cycles for 100 * TransposeA
542     cycles for 100 * TransposeB (fast+small)
560     cycles for 100 * TransposeD (fast)

2160    cycles for 100 * TransposeA
537     cycles for 100 * TransposeB (fast+small)
561     cycles for 100 * TransposeD (fast)

2159    cycles for 100 * TransposeA
537     cycles for 100 * TransposeB (fast+small)
560     cycles for 100 * TransposeD (fast)

2159    cycles for 100 * TransposeA
544     cycles for 100 * TransposeB (fast+small)
559     cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)
Creative coders use backward thinking techniques as a strategy.

HSE

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

2860 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
865 cycles for 100 * TransposeD (fast)

2515 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
741 cycles for 100 * TransposeD (fast)

2750 cycles for 100 * TransposeA
729 cycles for 100 * TransposeB (fast+small)
870 cycles for 100 * TransposeD (fast)

2753 cycles for 100 * TransposeA
728 cycles for 100 * TransposeB (fast+small)
870 cycles for 100 * TransposeD (fast)

2755 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
732 cycles for 100 * TransposeD (fast)

60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)




And here TransposeA move a qword (2 dword) and TransposeC is using FPU optimized... for easy understanding  :biggrin:AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

4750    cycles for 100 * TransposeA
960     cycles for 100 * TransposeB
12471   cycles for 100 * TransposeC

4923    cycles for 100 * TransposeA
964     cycles for 100 * TransposeB
12321   cycles for 100 * TransposeC

4909    cycles for 100 * TransposeA
969     cycles for 100 * TransposeB
12477   cycles for 100 * TransposeC

4685    cycles for 100 * TransposeA
1047    cycles for 100 * TransposeB
12336   cycles for 100 * TransposeC

4596    cycles for 100 * TransposeA
964     cycles for 100 * TransposeB
12302   cycles for 100 * TransposeC

67      bytes for TransposeA
114     bytes for TransposeB
116     bytes for TransposeC

Equations in Assembly: SmplMath

mineiro

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

1861    cycles for 100 * TransposeA
1397    cycles for 100 * TransposeB (fast+small)
1377    cycles for 100 * TransposeD (fast)

1854    cycles for 100 * TransposeA
1396    cycles for 100 * TransposeB (fast+small)
1376    cycles for 100 * TransposeD (fast)

1876    cycles for 100 * TransposeA
1396    cycles for 100 * TransposeB (fast+small)
1391    cycles for 100 * TransposeD (fast)

1830    cycles for 100 * TransposeA
1398    cycles for 100 * TransposeB (fast+small)
1376    cycles for 100 * TransposeD (fast)

1830    cycles for 100 * TransposeA
1398    cycles for 100 * TransposeB (fast+small)
1377    cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

guga

Remarkably fast as usual :)

JJ and Siekmanski are the kings of optimization :bgrin: :bgrin: :greenclp: :greenclp:

Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz (SSE4)

1439 cycles for 100 * TransposeA
385 cycles for 100 * TransposeB (fast+small)
283 cycles for 100 * TransposeD (fast)

1371 cycles for 100 * TransposeA
385 cycles for 100 * TransposeB (fast+small)
286 cycles for 100 * TransposeD (fast)

1372 cycles for 100 * TransposeA
741 cycles for 100 * TransposeB (fast+small)
588 cycles for 100 * TransposeD (fast)

2218 cycles for 100 * TransposeA
739 cycles for 100 * TransposeB (fast+small)
585 cycles for 100 * TransposeD (fast)

1437 cycles for 100 * TransposeA
875 cycles for 100 * TransposeB (fast+small)
309 cycles for 100 * TransposeD (fast)

60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Quote from: Siekmanski on April 04, 2017, 10:10:26 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
537     cycles for 100 * TransposeB (fast+small)
561     cycles for 100 * TransposeD (fast)

537     cycles for 100 * TransposeB (fast+small)
560     cycles for 100 * TransposeD (fast)

Quote from: guga on April 04, 2017, 02:15:59 PMIntel(R) Core(TM) i7 CPU         870  @ 2.93GHz (SSE4)
385 cycles for 100 * TransposeB (fast+small)
283 cycles for 100 * TransposeD (fast)

385 cycles for 100 * TransposeB (fast+small)
286 cycles for 100 * TransposeD (fast)

Hmmm... these i7 CPUs are a PITA 8)

J:\Masm32\MasmBasic>TransposeMatrix.exe
Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

4081    cycles for 100 * TransposeA
984     cycles for 100 * TransposeB (fast+small)
1101    cycles for 100 * TransposeD (fast)

4315    cycles for 100 * TransposeA
985     cycles for 100 * TransposeB (fast+small)
1088    cycles for 100 * TransposeD (fast)

3895    cycles for 100 * TransposeA
1011    cycles for 100 * TransposeB (fast+small)
1159    cycles for 100 * TransposeD (fast)

3900    cycles for 100 * TransposeA
983     cycles for 100 * TransposeB (fast+small)
1102    cycles for 100 * TransposeD (fast)

5852    cycles for 100 * TransposeA
985     cycles for 100 * TransposeB (fast+small)
1087    cycles for 100 * TransposeD (fast)

Siekmanski

QuoteHmmm... these i7 CPUs are a PITA 8)

Yeah, they are....  :biggrin:

One thing I noticed, is the code align64 macro.
On my PC it aligns to 32 instead of 64. This is a funny thing. ( maybe only on my PC )
I'll have a look at this issue. I myself use the memory pointer from a "VirtualAlloc" call as a reference for the align64 macro instead of the program entry point.

Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: Siekmanski on April 04, 2017, 07:56:53 PM
One thing I noticed, is the code align64 macro.
On my PC it aligns to 32 instead of 64.

Try this instead - same code produced but it tells you what it does:

; .err
align_64 macro      ; nidud
Local curalign, xbytes, tmp$
  curalign=$-_TEXT
  xbytes=64-(($-_TEXT) and (64-1))
  tmp$ CATSTR <## current=>, %curalign, <      nops added: >, %xbytes, <      sum=>, %(curalign+xbytes), <      (line >, %@Line, <)>
  % echo tmp$
  if xbytes
      db xbytes dup(90h)
  endif
endm


## current=0 nops added: 64 sum=64 (line 96)
## current=159 nops added: 33 sum=192 (line 144)
## current=317 nops added: 3 sum=320 (line 211)
## current=464 nops added: 48 sum=512 (line 271)
## current=642 nops added: 62 sum=704 (line 324)
## current=723 nops added: 45 sum=768 (line 338)
## current=787 nops added: 45 sum=832 (line 352)
## current=851 nops added: 45 sum=896 (line 366)
## current=1453 nops added: 19 sum=1472 (line 573)


Looks OK, with the minor glitch that the first one doesn't need alignment.

Siekmanski

I followed the "align_64 "thread of you and Nidud some time ago.
Couldn't get it to work correctly.

Tried it again:
## current=462      nops added: 50      sum=512      (line 58)
## current=556      nops added: 20      sum=576      (line 74)
## current=692      nops added: 12      sum=704      (line 116)
## current=799      nops added: 33      sum=832      (line 155)
## current=1649      nops added: 15      sum=1664      (line 274)


This looks OK.
But, this piece of test code gives an alignment of 16.
align_64
test1:

mov eax,offset test1 ; 4200016
and eax,63 ; = 16

Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: Siekmanski on April 04, 2017, 09:45:27 PM
But, this piece of test code gives an alignment of 16.

Strange. Can you post the complete source please?

FORTRANS

Hi Jochen,

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
+19 of 20 tests valid,
2812 cycles for 100 * TransposeA
2261 cycles for 100 * TransposeB (fast+small)
3126 cycles for 100 * TransposeD (fast)

2125 cycles for 100 * TransposeA
2241 cycles for 100 * TransposeB (fast+small)
2553 cycles for 100 * TransposeD (fast)

2767 cycles for 100 * TransposeA
2288 cycles for 100 * TransposeB (fast+small)
3304 cycles for 100 * TransposeD (fast)

2105 cycles for 100 * TransposeA
2905 cycles for 100 * TransposeB (fast+small)
2533 cycles for 100 * TransposeD (fast)

2915 cycles for 100 * TransposeA
2261 cycles for 100 * TransposeB (fast+small)
3088 cycles for 100 * TransposeD (fast)

60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

1281 cycles for 100 * TransposeA
672 cycles for 100 * TransposeB (fast+small)
837 cycles for 100 * TransposeD (fast)

1218 cycles for 100 * TransposeA
675 cycles for 100 * TransposeB (fast+small)
833 cycles for 100 * TransposeD (fast)

1693 cycles for 100 * TransposeA
671 cycles for 100 * TransposeB (fast+small)
834 cycles for 100 * TransposeD (fast)

1220 cycles for 100 * TransposeA
673 cycles for 100 * TransposeB (fast+small)
833 cycles for 100 * TransposeD (fast)

1224 cycles for 100 * TransposeA
671 cycles for 100 * TransposeB (fast+small)
837 cycles for 100 * TransposeD (fast)

60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)



   "+19 0f 20 tests valid, " ?  The Pentium M results do look erratic.

HTH,

Steve N.

Siekmanski

Align_64 test.

Align4  00
Align16 00
Align64 32

Press any key to continue...


Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: FORTRANS on April 04, 2017, 11:32:06 PM
   "+19 0f 20 tests valid, " ?  The Pentium M results do look erratic.

Steve,
19/20 is quite good. The macros I use check for outliers, and take only the more stable values to calculate the average. Most of the time that works fine, but there is no guarantee. It remains slightly "alchimistic" 8)

In case of doubt, look at the lowest values. The processor cannot cheat versus the lower end. Slower, yes, but running faster is difficult :biggrin:

jj2007

Quote from: Siekmanski on April 04, 2017, 11:37:18 PM
Align_64 test.

Align4  00
Align16 00
Align64 32

Your code, in the exe you posted:
004011E0        ³.  B8 E0114000             mov eax, 004011E0
004011E5        ³.  83E0 3F                 and eax, 0000003F


Your code, built here with ML/HJWasm and standard options:
004010C0        ³.  B8 C0104000             mov eax, 004010C0
004010C5        ³.  83E0 3F                 and eax, 0000003F


Did you use commandline option DONT_OVERDO_IT_WITH_THE_NOPS ?  ::)

Siekmanski

NOP(e)  :biggrin:

It's a mystery to me.  :(
Creative coders use backward thinking techniques as a strategy.