The MASM Forum

General => The Laboratory => Topic started by: jj2007 on April 04, 2017, 09:43:51 AM

Title: Transpose a matrix
Post by: jj2007 on April 04, 2017, 09:43:51 AM
This is a spinoff from Guga's RosAsm thread (http://masm32.com/board/index.php?topic=6105.0). Can I have some timings please, also from AMD machines? Thanks.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1524    cycles for 100 * TransposeA
467     cycles for 100 * TransposeB (fast+small)
469     cycles for 100 * TransposeD (fast)

1475    cycles for 100 * TransposeA
465     cycles for 100 * TransposeB (fast+small)
468     cycles for 100 * TransposeD (fast)

1474    cycles for 100 * TransposeA
465     cycles for 100 * TransposeB (fast+small)
468     cycles for 100 * TransposeD (fast)

1605    cycles for 100 * TransposeA
466     cycles for 100 * TransposeB (fast+small)
466     cycles for 100 * TransposeD (fast)

1474    cycles for 100 * TransposeA
455     cycles for 100 * TransposeB (fast+small)
466     cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)
Title: Re: Transpose a matrix
Post by: Siekmanski on April 04, 2017, 10:10:26 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

2275    cycles for 100 * TransposeA
538     cycles for 100 * TransposeB (fast+small)
561     cycles for 100 * TransposeD (fast)

2158    cycles for 100 * TransposeA
542     cycles for 100 * TransposeB (fast+small)
560     cycles for 100 * TransposeD (fast)

2160    cycles for 100 * TransposeA
537     cycles for 100 * TransposeB (fast+small)
561     cycles for 100 * TransposeD (fast)

2159    cycles for 100 * TransposeA
537     cycles for 100 * TransposeB (fast+small)
560     cycles for 100 * TransposeD (fast)

2159    cycles for 100 * TransposeA
544     cycles for 100 * TransposeB (fast+small)
559     cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)
Title: Re: Transpose a matrix
Post by: HSE on April 04, 2017, 11:03:22 AM
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

2860 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
865 cycles for 100 * TransposeD (fast)

2515 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
741 cycles for 100 * TransposeD (fast)

2750 cycles for 100 * TransposeA
729 cycles for 100 * TransposeB (fast+small)
870 cycles for 100 * TransposeD (fast)

2753 cycles for 100 * TransposeA
728 cycles for 100 * TransposeB (fast+small)
870 cycles for 100 * TransposeD (fast)

2755 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
732 cycles for 100 * TransposeD (fast)

60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)




And here TransposeA move a qword (2 dword) and TransposeC is using FPU optimized... for easy understanding  :biggrin:AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

4750    cycles for 100 * TransposeA
960     cycles for 100 * TransposeB
12471   cycles for 100 * TransposeC

4923    cycles for 100 * TransposeA
964     cycles for 100 * TransposeB
12321   cycles for 100 * TransposeC

4909    cycles for 100 * TransposeA
969     cycles for 100 * TransposeB
12477   cycles for 100 * TransposeC

4685    cycles for 100 * TransposeA
1047    cycles for 100 * TransposeB
12336   cycles for 100 * TransposeC

4596    cycles for 100 * TransposeA
964     cycles for 100 * TransposeB
12302   cycles for 100 * TransposeC

67      bytes for TransposeA
114     bytes for TransposeB
116     bytes for TransposeC

Title: Re: Transpose a matrix
Post by: mineiro on April 04, 2017, 12:40:35 PM
Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

1861    cycles for 100 * TransposeA
1397    cycles for 100 * TransposeB (fast+small)
1377    cycles for 100 * TransposeD (fast)

1854    cycles for 100 * TransposeA
1396    cycles for 100 * TransposeB (fast+small)
1376    cycles for 100 * TransposeD (fast)

1876    cycles for 100 * TransposeA
1396    cycles for 100 * TransposeB (fast+small)
1391    cycles for 100 * TransposeD (fast)

1830    cycles for 100 * TransposeA
1398    cycles for 100 * TransposeB (fast+small)
1376    cycles for 100 * TransposeD (fast)

1830    cycles for 100 * TransposeA
1398    cycles for 100 * TransposeB (fast+small)
1377    cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)
Title: Re: Transpose a matrix
Post by: guga on April 04, 2017, 02:15:59 PM
Remarkably fast as usual :)

JJ and Siekmanski are the kings of optimization :bgrin: :bgrin: :greenclp: :greenclp:

Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz (SSE4)

1439 cycles for 100 * TransposeA
385 cycles for 100 * TransposeB (fast+small)
283 cycles for 100 * TransposeD (fast)

1371 cycles for 100 * TransposeA
385 cycles for 100 * TransposeB (fast+small)
286 cycles for 100 * TransposeD (fast)

1372 cycles for 100 * TransposeA
741 cycles for 100 * TransposeB (fast+small)
588 cycles for 100 * TransposeD (fast)

2218 cycles for 100 * TransposeA
739 cycles for 100 * TransposeB (fast+small)
585 cycles for 100 * TransposeD (fast)

1437 cycles for 100 * TransposeA
875 cycles for 100 * TransposeB (fast+small)
309 cycles for 100 * TransposeD (fast)

60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
Title: Re: Transpose a matrix
Post by: jj2007 on April 04, 2017, 05:26:07 PM
Quote from: Siekmanski on April 04, 2017, 10:10:26 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
537     cycles for 100 * TransposeB (fast+small)
561     cycles for 100 * TransposeD (fast)

537     cycles for 100 * TransposeB (fast+small)
560     cycles for 100 * TransposeD (fast)

Quote from: guga on April 04, 2017, 02:15:59 PMIntel(R) Core(TM) i7 CPU         870  @ 2.93GHz (SSE4)
385 cycles for 100 * TransposeB (fast+small)
283 cycles for 100 * TransposeD (fast)

385 cycles for 100 * TransposeB (fast+small)
286 cycles for 100 * TransposeD (fast)

Hmmm... these i7 CPUs are a PITA 8)

J:\Masm32\MasmBasic>TransposeMatrix.exe
Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

4081    cycles for 100 * TransposeA
984     cycles for 100 * TransposeB (fast+small)
1101    cycles for 100 * TransposeD (fast)

4315    cycles for 100 * TransposeA
985     cycles for 100 * TransposeB (fast+small)
1088    cycles for 100 * TransposeD (fast)

3895    cycles for 100 * TransposeA
1011    cycles for 100 * TransposeB (fast+small)
1159    cycles for 100 * TransposeD (fast)

3900    cycles for 100 * TransposeA
983     cycles for 100 * TransposeB (fast+small)
1102    cycles for 100 * TransposeD (fast)

5852    cycles for 100 * TransposeA
985     cycles for 100 * TransposeB (fast+small)
1087    cycles for 100 * TransposeD (fast)
Title: Re: Transpose a matrix
Post by: Siekmanski on April 04, 2017, 07:56:53 PM
QuoteHmmm... these i7 CPUs are a PITA 8)

Yeah, they are....  :biggrin:

One thing I noticed, is the code align64 macro.
On my PC it aligns to 32 instead of 64. This is a funny thing. ( maybe only on my PC )
I'll have a look at this issue. I myself use the memory pointer from a "VirtualAlloc" call as a reference for the align64 macro instead of the program entry point.

Title: Re: Transpose a matrix
Post by: jj2007 on April 04, 2017, 08:52:58 PM
Quote from: Siekmanski on April 04, 2017, 07:56:53 PM
One thing I noticed, is the code align64 macro.
On my PC it aligns to 32 instead of 64.

Try this instead - same code produced but it tells you what it does:

; .err
align_64 macro      ; nidud (http://masm32.com/board/index.php?topic=4545.msg48734#msg48734)
Local curalign, xbytes, tmp$
  curalign=$-_TEXT
  xbytes=64-(($-_TEXT) and (64-1))
  tmp$ CATSTR <## current=>, %curalign, <      nops added: >, %xbytes, <      sum=>, %(curalign+xbytes), <      (line >, %@Line, <)>
  % echo tmp$
  if xbytes
      db xbytes dup(90h)
  endif
endm


## current=0 nops added: 64 sum=64 (line 96)
## current=159 nops added: 33 sum=192 (line 144)
## current=317 nops added: 3 sum=320 (line 211)
## current=464 nops added: 48 sum=512 (line 271)
## current=642 nops added: 62 sum=704 (line 324)
## current=723 nops added: 45 sum=768 (line 338)
## current=787 nops added: 45 sum=832 (line 352)
## current=851 nops added: 45 sum=896 (line 366)
## current=1453 nops added: 19 sum=1472 (line 573)


Looks OK, with the minor glitch that the first one doesn't need alignment.
Title: Re: Transpose a matrix
Post by: Siekmanski on April 04, 2017, 09:45:27 PM
I followed the "align_64 "thread of you and Nidud some time ago.
Couldn't get it to work correctly.

Tried it again:
## current=462      nops added: 50      sum=512      (line 58)
## current=556      nops added: 20      sum=576      (line 74)
## current=692      nops added: 12      sum=704      (line 116)
## current=799      nops added: 33      sum=832      (line 155)
## current=1649      nops added: 15      sum=1664      (line 274)


This looks OK.
But, this piece of test code gives an alignment of 16.
align_64
test1:

mov eax,offset test1 ; 4200016
and eax,63 ; = 16

Title: Re: Transpose a matrix
Post by: jj2007 on April 04, 2017, 10:11:12 PM
Quote from: Siekmanski on April 04, 2017, 09:45:27 PM
But, this piece of test code gives an alignment of 16.

Strange. Can you post the complete source please?
Title: Re: Transpose a matrix
Post by: FORTRANS on April 04, 2017, 11:32:06 PM
Hi Jochen,

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
+19 of 20 tests valid,
2812 cycles for 100 * TransposeA
2261 cycles for 100 * TransposeB (fast+small)
3126 cycles for 100 * TransposeD (fast)

2125 cycles for 100 * TransposeA
2241 cycles for 100 * TransposeB (fast+small)
2553 cycles for 100 * TransposeD (fast)

2767 cycles for 100 * TransposeA
2288 cycles for 100 * TransposeB (fast+small)
3304 cycles for 100 * TransposeD (fast)

2105 cycles for 100 * TransposeA
2905 cycles for 100 * TransposeB (fast+small)
2533 cycles for 100 * TransposeD (fast)

2915 cycles for 100 * TransposeA
2261 cycles for 100 * TransposeB (fast+small)
3088 cycles for 100 * TransposeD (fast)

60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

1281 cycles for 100 * TransposeA
672 cycles for 100 * TransposeB (fast+small)
837 cycles for 100 * TransposeD (fast)

1218 cycles for 100 * TransposeA
675 cycles for 100 * TransposeB (fast+small)
833 cycles for 100 * TransposeD (fast)

1693 cycles for 100 * TransposeA
671 cycles for 100 * TransposeB (fast+small)
834 cycles for 100 * TransposeD (fast)

1220 cycles for 100 * TransposeA
673 cycles for 100 * TransposeB (fast+small)
833 cycles for 100 * TransposeD (fast)

1224 cycles for 100 * TransposeA
671 cycles for 100 * TransposeB (fast+small)
837 cycles for 100 * TransposeD (fast)

60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)



   "+19 0f 20 tests valid, " ?  The Pentium M results do look erratic.

HTH,

Steve N.
Title: Re: Transpose a matrix
Post by: Siekmanski on April 04, 2017, 11:37:18 PM
Align_64 test.

Align4  00
Align16 00
Align64 32

Press any key to continue...


Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 01:13:15 AM
Quote from: FORTRANS on April 04, 2017, 11:32:06 PM
   "+19 0f 20 tests valid, " ?  The Pentium M results do look erratic.

Steve,
19/20 is quite good. The macros I use check for outliers, and take only the more stable values to calculate the average. Most of the time that works fine, but there is no guarantee. It remains slightly "alchimistic" 8)

In case of doubt, look at the lowest values. The processor cannot cheat versus the lower end. Slower, yes, but running faster is difficult :biggrin:
Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 01:32:30 AM
Quote from: Siekmanski on April 04, 2017, 11:37:18 PM
Align_64 test.

Align4  00
Align16 00
Align64 32

Your code, in the exe you posted:
004011E0        ³.  B8 E0114000             mov eax, 004011E0
004011E5        ³.  83E0 3F                 and eax, 0000003F


Your code, built here with ML/HJWasm and standard options:
004010C0        ³.  B8 C0104000             mov eax, 004010C0
004010C5        ³.  83E0 3F                 and eax, 0000003F


Did you use commandline option DONT_OVERDO_IT_WITH_THE_NOPS ?  ::)
Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 02:48:52 AM
NOP(e)  :biggrin:

It's a mystery to me.  :(
Title: Re: Transpose a matrix
Post by: guga on April 05, 2017, 04:19:22 AM
 :bgrin: :bgrin:

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 04:20:37 AM
Quote from: Siekmanski on April 05, 2017, 02:48:52 AMIt's a mystery to me.  :(

For me, too. Did you try different assemblers and/or linkers? I tried a number of options, none produced a non-zero result.
Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 04:51:31 AM
The commandline options used,

@echo off

if exist "Align_64.obj" del "Align_64.obj"
\masm32\bin\ml.exe /c /coff "Align_64.Asm"
if errorlevel 1 goto Einde

if exist "Align_64.exe" del "Align_64.exe"
\masm32\bin\Link.exe /SUBSYSTEM:CONSOLE /OPT:NOREF /OUT:Align_64.exe Align_64.obj
if errorlevel 1 goto Einde

Align_64.exe

:Einde


Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: Align_64.Asm
## current=209      nops added: 47      sum=256      (line 60)
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 32

Press any key to continue...


Same result with Polink.exe
Title: Re: Transpose a matrix
Post by: nidud on April 05, 2017, 05:29:00 AM
deleted
Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 06:40:04 AM
Stripped the assembler and the linker from the vs2015.com_enu.iso
Still the same result.....

Microsoft (R) Macro Assembler Version 14.00.23026.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: Align_64.Asm
## current=218      nops added: 38      sum=256      (line 60)
Microsoft (R) Incremental Linker Version 14.00.23026.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 32

Press any key to continue...


commandline options used,

@echo off

if exist "Align_64.obj" del "Align_64.obj"
d:\RadASM2212\Masm2015\ml.exe /c /coff "Align_64.Asm"
if errorlevel 1 goto Einde

if exist "Align_64.exe" del "Align_64.exe"
d:\RadASM2212\Masm2015\link.exe /SUBSYSTEM:CONSOLE /OPT:NOREF /OUT:Align_64.exe Align_64.obj
if errorlevel 1 goto Einde

Align_64.exe

:Einde
Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 06:42:56 AM
Quote from: Siekmanski on April 05, 2017, 04:51:31 AM
The commandline options used
...
Same result with Polink.exe

I can produce the error with /debug and the M$ linkers (6.14 and 9.0), but not with polink.
Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 07:12:00 AM
I'm clueless.
Maybe it's how my CPU handles things......

Entry point is 401120h (32 bit aligned)

(http://members.home.nl/siekmanski/Image1.png)
Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 08:20:34 AM
Quote from: Siekmanski on April 05, 2017, 07:12:00 AM
I'm clueless.
Maybe it's how my CPU handles things......

No, your CPU is not the culprit. I can produce the same error. It's the linker... all linkers, actually :(
The macro adds the right number of nops afaics.
Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 08:42:41 AM
Which linker do you use ?
Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 08:48:17 AM
Mostly polink, then ML link 5.12 or 10.0
I've given up on other LINK versions, it's dll hell:
LINKV14 : fatal error LNK2023: bad DLL or entry point 'msobj140.dll'
etc etc

I just don't have the energy to shove the missing DLLs around. Besides, polink and link10 are sufficient.
Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 08:54:09 AM
Then it's not the linker. I used version 5.12 with the same wrong result.

d:\RadASM2212\Masm\Projects\Align_64>makeit
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: Align_64.Asm
## current=219      nops added: 37      sum=256      (line 65)
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 32
Entry  401120
Title: Re: Transpose a matrix
Post by: nidud on April 05, 2017, 09:20:43 AM
deleted
Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 09:35:09 AM
Quote from: nidud on April 05, 2017, 09:20:43 AM
libraries or include files then..

No, tried it.
Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 10:28:57 AM
Try this one:align_64 MACRO
Local curalign, tmp$
  repeat 3
  curalign=($-_TEXT) and 63
  tmp$ CATSTR <current alignment=>, %curalign
% echo tmp$
if curalign
nop
align 16
echo * align16
endif
  endm
endm


I get the impression that I do not understand what the linker really does :(
Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 05:42:12 PM
Hi Jochen,
Test with the new macro:

Microsoft Windows [Version 6.3.9600]
(c) 2013 Microsoft Corporation. Alle rechten voorbehouden.

d:\RadASM2212\Masm\Projects\Align_64>makeit
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: Align_64.Asm
current alignment=27
* align16
current alignment=43
* align16
current alignment=59
* align16
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 48
Entry  4010A0

Press any key to continue...d:\RadASM2212\Masm\Projects\Align_64>makeit2 ( with polink )
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: Align_64.Asm
current alignment=27
* align16
current alignment=43
* align16
current alignment=59
* align16
Align_64 test.

Align4  00
Align16 00
Align64 48
Entry  4010A0

Press any key to continue...d:\RadASM2212\Masm\Projects\Align_64>makeit3
Microsoft (R) Macro Assembler Version 14.00.23026.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: Align_64.Asm
current alignment=36
* align16
current alignment=52
* align16
current alignment=4
* align16
Microsoft (R) Incremental Linker Version 14.00.23026.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 48
Entry  3010A0
Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 05:54:51 PM
Yes indeed, I get the same type of problem, depending on commandline options but apparently without much logic behind it. I guess the linkers "optimise" something - shuffle sections around, no idea. It seems that "$-_TEXT" does not give a reliable basis for this kind of adjustment. Unfortunately, there is no proper documentation available. Maybe one of our senior experts (did I see "grey hair and a walking stick" somewhere? ;)) can enlighten us.