Print Page - Transpose a matrix

Title: Transpose a matrix
Post by: jj2007 on April 04, 2017, 09:43:51 AM

This is a spinoff from Guga's RosAsm thread (http://masm32.com/board/index.php?topic=6105.0). Can I have some timings please, also from AMD machines? Thanks.

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

1524    cycles for 100 * TransposeA
467     cycles for 100 * TransposeB (fast+small)
469     cycles for 100 * TransposeD (fast)

1475    cycles for 100 * TransposeA
465     cycles for 100 * TransposeB (fast+small)
468     cycles for 100 * TransposeD (fast)

1474    cycles for 100 * TransposeA
465     cycles for 100 * TransposeB (fast+small)
468     cycles for 100 * TransposeD (fast)

1605    cycles for 100 * TransposeA
466     cycles for 100 * TransposeB (fast+small)
466     cycles for 100 * TransposeD (fast)

1474    cycles for 100 * TransposeA
455     cycles for 100 * TransposeB (fast+small)
466     cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)

Title: Re: Transpose a matrix
Post by: Siekmanski on April 04, 2017, 10:10:26 AM

Code Select

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

2275    cycles for 100 * TransposeA
538     cycles for 100 * TransposeB (fast+small)
561     cycles for 100 * TransposeD (fast)

2158    cycles for 100 * TransposeA
542     cycles for 100 * TransposeB (fast+small)
560     cycles for 100 * TransposeD (fast)

2160    cycles for 100 * TransposeA
537     cycles for 100 * TransposeB (fast+small)
561     cycles for 100 * TransposeD (fast)

2159    cycles for 100 * TransposeA
537     cycles for 100 * TransposeB (fast+small)
560     cycles for 100 * TransposeD (fast)

2159    cycles for 100 * TransposeA
544     cycles for 100 * TransposeB (fast+small)
559     cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)

Title: Re: Transpose a matrix
Post by: HSE on April 04, 2017, 11:03:22 AM

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

2860	cycles for 100 * TransposeA
860	cycles for 100 * TransposeB (fast+small)
865	cycles for 100 * TransposeD (fast)

2515	cycles for 100 * TransposeA
860	cycles for 100 * TransposeB (fast+small)
741	cycles for 100 * TransposeD (fast)

2750	cycles for 100 * TransposeA
729	cycles for 100 * TransposeB (fast+small)
870	cycles for 100 * TransposeD (fast)

2753	cycles for 100 * TransposeA
728	cycles for 100 * TransposeB (fast+small)
870	cycles for 100 * TransposeD (fast)

2755	cycles for 100 * TransposeA
860	cycles for 100 * TransposeB (fast+small)
732	cycles for 100 * TransposeD (fast)

60	bytes for TransposeA
105	bytes for TransposeB (fast+small)
111	bytes for TransposeD (fast)

And here TransposeA move a qword (2 dword) and TransposeC is using FPU optimized... for easy understanding :biggrin:

Code Select

AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)

4750    cycles for 100 * TransposeA
960     cycles for 100 * TransposeB
12471   cycles for 100 * TransposeC

4923    cycles for 100 * TransposeA
964     cycles for 100 * TransposeB
12321   cycles for 100 * TransposeC

4909    cycles for 100 * TransposeA
969     cycles for 100 * TransposeB
12477   cycles for 100 * TransposeC

4685    cycles for 100 * TransposeA
1047    cycles for 100 * TransposeB
12336   cycles for 100 * TransposeC

4596    cycles for 100 * TransposeA
964     cycles for 100 * TransposeB
12302   cycles for 100 * TransposeC

67      bytes for TransposeA
114     bytes for TransposeB
116     bytes for TransposeC

Title: Re: Transpose a matrix
Post by: mineiro on April 04, 2017, 12:40:35 PM

Code Select

Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz (SSE4)

1861    cycles for 100 * TransposeA
1397    cycles for 100 * TransposeB (fast+small)
1377    cycles for 100 * TransposeD (fast)

1854    cycles for 100 * TransposeA
1396    cycles for 100 * TransposeB (fast+small)
1376    cycles for 100 * TransposeD (fast)

1876    cycles for 100 * TransposeA
1396    cycles for 100 * TransposeB (fast+small)
1391    cycles for 100 * TransposeD (fast)

1830    cycles for 100 * TransposeA
1398    cycles for 100 * TransposeB (fast+small)
1376    cycles for 100 * TransposeD (fast)

1830    cycles for 100 * TransposeA
1398    cycles for 100 * TransposeB (fast+small)
1377    cycles for 100 * TransposeD (fast)

60      bytes for TransposeA
105     bytes for TransposeB (fast+small)
111     bytes for TransposeD (fast)

Title: Re: Transpose a matrix
Post by: guga on April 04, 2017, 02:15:59 PM

Remarkably fast as usual :)

JJ and Siekmanski are the kings of optimization :bgrin: :bgrin: :greenclp: :greenclp:

Code Select

Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz (SSE4)

1439	cycles for 100 * TransposeA
385	cycles for 100 * TransposeB (fast+small)
283	cycles for 100 * TransposeD (fast)

1371	cycles for 100 * TransposeA
385	cycles for 100 * TransposeB (fast+small)
286	cycles for 100 * TransposeD (fast)

1372	cycles for 100 * TransposeA
741	cycles for 100 * TransposeB (fast+small)
588	cycles for 100 * TransposeD (fast)

2218	cycles for 100 * TransposeA
739	cycles for 100 * TransposeB (fast+small)
585	cycles for 100 * TransposeD (fast)

1437	cycles for 100 * TransposeA
875	cycles for 100 * TransposeB (fast+small)
309	cycles for 100 * TransposeD (fast)

60	bytes for TransposeA
105	bytes for TransposeB (fast+small)
111	bytes for TransposeD (fast)

Title: Re: Transpose a matrix
Post by: jj2007 on April 04, 2017, 05:26:07 PM

Quote from: Siekmanski on April 04, 2017, 10:10:26 AM
Code Select Expand
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4) 537 cycles for 100 * TransposeB (fast+small) 561 cycles for 100 * TransposeD (fast) 537 cycles for 100 * TransposeB (fast+small) 560 cycles for 100 * TransposeD (fast)

Quote from: guga on April 04, 2017, 02:15:59 PM
Code Select Expand
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SSE4) 385 cycles for 100 * TransposeB (fast+small) 283 cycles for 100 * TransposeD (fast) 385 cycles for 100 * TransposeB (fast+small) 286 cycles for 100 * TransposeD (fast)

Hmmm... these i7 CPUs are a PITA 8)

Code Select

J:\Masm32\MasmBasic>TransposeMatrix.exe
Intel(R) Celeron(R) CPU  N2840  @ 2.16GHz (SSE4)

4081    cycles for 100 * TransposeA
984     cycles for 100 * TransposeB (fast+small)
1101    cycles for 100 * TransposeD (fast)

4315    cycles for 100 * TransposeA
985     cycles for 100 * TransposeB (fast+small)
1088    cycles for 100 * TransposeD (fast)

3895    cycles for 100 * TransposeA
1011    cycles for 100 * TransposeB (fast+small)
1159    cycles for 100 * TransposeD (fast)

3900    cycles for 100 * TransposeA
983     cycles for 100 * TransposeB (fast+small)
1102    cycles for 100 * TransposeD (fast)

5852    cycles for 100 * TransposeA
985     cycles for 100 * TransposeB (fast+small)
1087    cycles for 100 * TransposeD (fast)

Title: Re: Transpose a matrix
Post by: Siekmanski on April 04, 2017, 07:56:53 PM

QuoteHmmm... these i7 CPUs are a PITA 8)

Yeah, they are.... :biggrin:

One thing I noticed, is the code align64 macro.
On my PC it aligns to 32 instead of 64. This is a funny thing. ( maybe only on my PC )
I'll have a look at this issue. I myself use the memory pointer from a "VirtualAlloc" call as a reference for the align64 macro instead of the program entry point.

Title: Re: Transpose a matrix
Post by: jj2007 on April 04, 2017, 08:52:58 PM

Quote from: Siekmanski on April 04, 2017, 07:56:53 PM
One thing I noticed, is the code align64 macro.
On my PC it aligns to 32 instead of 64.

Try this instead - same code produced but it tells you what it does:

; .err
align_64 macro ; nidud (http://masm32.com/board/index.php?topic=4545.msg48734#msg48734)
Local curalign, xbytes, tmp$
curalign=$-_TEXT
xbytes=64-(($-_TEXT) and (64-1))
tmp$ CATSTR <## current=>, %curalign, < nops added: >, %xbytes, < sum=>, %(curalign+xbytes), < (line >, %@Line, <)>
% echo tmp$
if xbytes
db xbytes dup(90h)
endif
endm

Code Select

## current=0	nops added: 64	sum=64	(line 96)
## current=159	nops added: 33	sum=192	(line 144)
## current=317	nops added: 3	sum=320	(line 211)
## current=464	nops added: 48	sum=512	(line 271)
## current=642	nops added: 62	sum=704	(line 324)
## current=723	nops added: 45	sum=768	(line 338)
## current=787	nops added: 45	sum=832	(line 352)
## current=851	nops added: 45	sum=896	(line 366)
## current=1453	nops added: 19	sum=1472	(line 573)

Looks OK, with the minor glitch that the first one doesn't need alignment.

Title: Re: Transpose a matrix
Post by: Siekmanski on April 04, 2017, 09:45:27 PM

I followed the "align_64 "thread of you and Nidud some time ago.
Couldn't get it to work correctly.

Tried it again:

Code Select

## current=462      nops added: 50      sum=512      (line 58)
## current=556      nops added: 20      sum=576      (line 74)
## current=692      nops added: 12      sum=704      (line 116)
## current=799      nops added: 33      sum=832      (line 155)
## current=1649      nops added: 15      sum=1664      (line 274)

This looks OK.
But, this piece of test code gives an alignment of 16.

Code Select

align_64
test1:

mov eax,offset test1	; 4200016
and eax,63		; = 16

Title: Re: Transpose a matrix
Post by: jj2007 on April 04, 2017, 10:11:12 PM

Quote from: Siekmanski on April 04, 2017, 09:45:27 PM
But, this piece of test code gives an alignment of 16.

Strange. Can you post the complete source please?

Title: Re: Transpose a matrix
Post by: FORTRANS on April 04, 2017, 11:32:06 PM

Hi Jochen,

Code Select

Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
+19 of 20 tests valid, 
2812	cycles for 100 * TransposeA
2261	cycles for 100 * TransposeB (fast+small)
3126	cycles for 100 * TransposeD (fast)

2125	cycles for 100 * TransposeA
2241	cycles for 100 * TransposeB (fast+small)
2553	cycles for 100 * TransposeD (fast)

2767	cycles for 100 * TransposeA
2288	cycles for 100 * TransposeB (fast+small)
3304	cycles for 100 * TransposeD (fast)

2105	cycles for 100 * TransposeA
2905	cycles for 100 * TransposeB (fast+small)
2533	cycles for 100 * TransposeD (fast)

2915	cycles for 100 * TransposeA
2261	cycles for 100 * TransposeB (fast+small)
3088	cycles for 100 * TransposeD (fast)

60	bytes for TransposeA
105	bytes for TransposeB (fast+small)
111	bytes for TransposeD (fast)

Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)

1281	cycles for 100 * TransposeA
672	cycles for 100 * TransposeB (fast+small)
837	cycles for 100 * TransposeD (fast)

1218	cycles for 100 * TransposeA
675	cycles for 100 * TransposeB (fast+small)
833	cycles for 100 * TransposeD (fast)

1693	cycles for 100 * TransposeA
671	cycles for 100 * TransposeB (fast+small)
834	cycles for 100 * TransposeD (fast)

1220	cycles for 100 * TransposeA
673	cycles for 100 * TransposeB (fast+small)
833	cycles for 100 * TransposeD (fast)

1224	cycles for 100 * TransposeA
671	cycles for 100 * TransposeB (fast+small)
837	cycles for 100 * TransposeD (fast)

60	bytes for TransposeA
105	bytes for TransposeB (fast+small)
111	bytes for TransposeD (fast)

"+19 0f 20 tests valid, " ? The Pentium M results do look erratic.

HTH,

Steve N.

Title: Re: Transpose a matrix
Post by: Siekmanski on April 04, 2017, 11:37:18 PM

Code Select

Align_64 test.

Align4  00
Align16 00
Align64 32

Press any key to continue...

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 01:13:15 AM

Quote from: FORTRANS on April 04, 2017, 11:32:06 PM
"+19 0f 20 tests valid, " ? The Pentium M results do look erratic.

Steve,
19/20 is quite good. The macros I use check for outliers, and take only the more stable values to calculate the average. Most of the time that works fine, but there is no guarantee. It remains slightly "alchimistic" 8)

In case of doubt, look at the lowest values. The processor cannot cheat versus the lower end. Slower, yes, but running faster is difficult :biggrin:

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 01:32:30 AM

Quote from: Siekmanski on April 04, 2017, 11:37:18 PM
Code Select Expand
Align_64 test. Align4 00 Align16 00 Align64 32

Your code, in the exe you posted:

Code Select

004011E0        ³.  B8 E0114000             mov eax, 004011E0
004011E5        ³.  83E0 3F                 and eax, 0000003F

Your code, built here with ML/HJWasm and standard options:

Code Select

004010C0        ³.  B8 C0104000             mov eax, 004010C0
004010C5        ³.  83E0 3F                 and eax, 0000003F

Did you use commandline option DONT_OVERDO_IT_WITH_THE_NOPS ? ::)

Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 02:48:52 AM

NOP(e) :biggrin:

It's a mystery to me. :(

Title: Re: Transpose a matrix
Post by: guga on April 05, 2017, 04:19:22 AM

:bgrin: :bgrin:

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 04:20:37 AM

Quote from: Siekmanski on April 05, 2017, 02:48:52 AMIt's a mystery to me. :(

For me, too. Did you try different assemblers and/or linkers? I tried a number of options, none produced a non-zero result.

Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 04:51:31 AM

The commandline options used,

Code Select

@echo off

if exist "Align_64.obj" del "Align_64.obj"
\masm32\bin\ml.exe /c /coff "Align_64.Asm"
if errorlevel 1 goto Einde

if exist "Align_64.exe" del "Align_64.exe"
\masm32\bin\Link.exe /SUBSYSTEM:CONSOLE /OPT:NOREF /OUT:Align_64.exe Align_64.obj
if errorlevel 1 goto Einde

Align_64.exe

:Einde


Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: Align_64.Asm
## current=209      nops added: 47      sum=256      (line 60)
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 32

Press any key to continue...

Same result with Polink.exe

Title: Re: Transpose a matrix
Post by: nidud on April 05, 2017, 05:29:00 AM

deleted

Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 06:40:04 AM

Stripped the assembler and the linker from the vs2015.com_enu.iso
Still the same result.....

Code Select

Microsoft (R) Macro Assembler Version 14.00.23026.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: Align_64.Asm
## current=218      nops added: 38      sum=256      (line 60)
Microsoft (R) Incremental Linker Version 14.00.23026.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 32

Press any key to continue...

commandline options used,

Code Select

@echo off

if exist "Align_64.obj" del "Align_64.obj"
d:\RadASM2212\Masm2015\ml.exe /c /coff "Align_64.Asm"
if errorlevel 1 goto Einde

if exist "Align_64.exe" del "Align_64.exe"
d:\RadASM2212\Masm2015\link.exe /SUBSYSTEM:CONSOLE /OPT:NOREF /OUT:Align_64.exe Align_64.obj
if errorlevel 1 goto Einde

Align_64.exe

:Einde

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 06:42:56 AM

Quote from: Siekmanski on April 05, 2017, 04:51:31 AM
The commandline options used
...
Same result with Polink.exe

I can produce the error with /debug and the M$ linkers (6.14 and 9.0), but not with polink.

Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 07:12:00 AM

I'm clueless.
Maybe it's how my CPU handles things......

Entry point is 401120h (32 bit aligned)

(http://members.home.nl/siekmanski/Image1.png)

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 08:20:34 AM

Quote from: Siekmanski on April 05, 2017, 07:12:00 AM
I'm clueless.
Maybe it's how my CPU handles things......

No, your CPU is not the culprit. I can produce the same error. It's the linker... all linkers, actually :(
The macro adds the right number of nops afaics.

Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 08:42:41 AM

Which linker do you use ?

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 08:48:17 AM

Mostly polink, then ML link 5.12 or 10.0
I've given up on other LINK versions, it's dll hell:

Code Select

LINKV14 : fatal error LNK2023: bad DLL or entry point 'msobj140.dll'
etc etc

I just don't have the energy to shove the missing DLLs around. Besides, polink and link10 are sufficient.

Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 08:54:09 AM

Then it's not the linker. I used version 5.12 with the same wrong result.

Code Select

d:\RadASM2212\Masm\Projects\Align_64>makeit
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: Align_64.Asm
## current=219      nops added: 37      sum=256      (line 65)
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 32
Entry  401120

Title: Re: Transpose a matrix
Post by: nidud on April 05, 2017, 09:20:43 AM

deleted

Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 09:35:09 AM

Quote from: nidud on April 05, 2017, 09:20:43 AM
libraries or include files then..

No, tried it.

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 10:28:57 AM

Try this one:

Code Select

align_64 MACRO
Local curalign, tmp$
  repeat 3
  	curalign=($-_TEXT) and 63
  	tmp$ CATSTR <current alignment=>, %curalign
	% echo tmp$
	if curalign
		nop
		align 16
		echo * align16
	endif
  endm
endm

I get the impression that I do not understand what the linker really does :(

Title: Re: Transpose a matrix
Post by: Siekmanski on April 05, 2017, 05:42:12 PM

Hi Jochen,
Test with the new macro:

Code Select

Microsoft Windows [Version 6.3.9600]
(c) 2013 Microsoft Corporation. Alle rechten voorbehouden.

d:\RadASM2212\Masm\Projects\Align_64>makeit
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: Align_64.Asm
current alignment=27
* align16
current alignment=43
* align16
current alignment=59
* align16
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 48
Entry  4010A0

Press any key to continue...d:\RadASM2212\Masm\Projects\Align_64>makeit2 ( with polink )
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: Align_64.Asm
current alignment=27
* align16
current alignment=43
* align16
current alignment=59
* align16
Align_64 test.

Align4  00
Align16 00
Align64 48
Entry  4010A0

Press any key to continue...d:\RadASM2212\Masm\Projects\Align_64>makeit3
Microsoft (R) Macro Assembler Version 14.00.23026.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: Align_64.Asm
current alignment=36
* align16
current alignment=52
* align16
current alignment=4
* align16
Microsoft (R) Incremental Linker Version 14.00.23026.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Align_64 test.

Align4  00
Align16 00
Align64 48
Entry  3010A0

Title: Re: Transpose a matrix
Post by: jj2007 on April 05, 2017, 05:54:51 PM

Yes indeed, I get the same type of problem, depending on commandline options but apparently without much logic behind it. I guess the linkers "optimise" something - shuffle sections around, no idea. It seems that "$-_TEXT" does not give a reliable basis for this kind of adjustment. Unfortunately, there is no proper documentation available. Maybe one of our senior experts (did I see "grey hair and a walking stick" somewhere? ;)) can enlighten us.

The MASM Forum

General => The Laboratory => Topic started by: jj2007 on April 04, 2017, 09:43:51 AM