This is a spinoff from Guga's RosAsm thread (http://masm32.com/board/index.php?topic=6105.0). Can I have some timings please, also from AMD machines? Thanks.
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
1524 cycles for 100 * TransposeA
467 cycles for 100 * TransposeB (fast+small)
469 cycles for 100 * TransposeD (fast)
1475 cycles for 100 * TransposeA
465 cycles for 100 * TransposeB (fast+small)
468 cycles for 100 * TransposeD (fast)
1474 cycles for 100 * TransposeA
465 cycles for 100 * TransposeB (fast+small)
468 cycles for 100 * TransposeD (fast)
1605 cycles for 100 * TransposeA
466 cycles for 100 * TransposeB (fast+small)
466 cycles for 100 * TransposeD (fast)
1474 cycles for 100 * TransposeA
455 cycles for 100 * TransposeB (fast+small)
466 cycles for 100 * TransposeD (fast)
60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
2275 cycles for 100 * TransposeA
538 cycles for 100 * TransposeB (fast+small)
561 cycles for 100 * TransposeD (fast)
2158 cycles for 100 * TransposeA
542 cycles for 100 * TransposeB (fast+small)
560 cycles for 100 * TransposeD (fast)
2160 cycles for 100 * TransposeA
537 cycles for 100 * TransposeB (fast+small)
561 cycles for 100 * TransposeD (fast)
2159 cycles for 100 * TransposeA
537 cycles for 100 * TransposeB (fast+small)
560 cycles for 100 * TransposeD (fast)
2159 cycles for 100 * TransposeA
544 cycles for 100 * TransposeB (fast+small)
559 cycles for 100 * TransposeD (fast)
60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
2860 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
865 cycles for 100 * TransposeD (fast)
2515 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
741 cycles for 100 * TransposeD (fast)
2750 cycles for 100 * TransposeA
729 cycles for 100 * TransposeB (fast+small)
870 cycles for 100 * TransposeD (fast)
2753 cycles for 100 * TransposeA
728 cycles for 100 * TransposeB (fast+small)
870 cycles for 100 * TransposeD (fast)
2755 cycles for 100 * TransposeA
860 cycles for 100 * TransposeB (fast+small)
732 cycles for 100 * TransposeD (fast)
60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
And here TransposeA move a qword (2 dword) and TransposeC is using FPU optimized... for easy understanding :biggrin:AMD A6-3500 APU with Radeon(tm) HD Graphics (SSE3)
4750 cycles for 100 * TransposeA
960 cycles for 100 * TransposeB
12471 cycles for 100 * TransposeC
4923 cycles for 100 * TransposeA
964 cycles for 100 * TransposeB
12321 cycles for 100 * TransposeC
4909 cycles for 100 * TransposeA
969 cycles for 100 * TransposeB
12477 cycles for 100 * TransposeC
4685 cycles for 100 * TransposeA
1047 cycles for 100 * TransposeB
12336 cycles for 100 * TransposeC
4596 cycles for 100 * TransposeA
964 cycles for 100 * TransposeB
12302 cycles for 100 * TransposeC
67 bytes for TransposeA
114 bytes for TransposeB
116 bytes for TransposeC
Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz (SSE4)
1861 cycles for 100 * TransposeA
1397 cycles for 100 * TransposeB (fast+small)
1377 cycles for 100 * TransposeD (fast)
1854 cycles for 100 * TransposeA
1396 cycles for 100 * TransposeB (fast+small)
1376 cycles for 100 * TransposeD (fast)
1876 cycles for 100 * TransposeA
1396 cycles for 100 * TransposeB (fast+small)
1391 cycles for 100 * TransposeD (fast)
1830 cycles for 100 * TransposeA
1398 cycles for 100 * TransposeB (fast+small)
1376 cycles for 100 * TransposeD (fast)
1830 cycles for 100 * TransposeA
1398 cycles for 100 * TransposeB (fast+small)
1377 cycles for 100 * TransposeD (fast)
60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
Remarkably fast as usual :)
JJ and Siekmanski are the kings of optimization :bgrin: :bgrin: :greenclp: :greenclp:
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SSE4)
1439 cycles for 100 * TransposeA
385 cycles for 100 * TransposeB (fast+small)
283 cycles for 100 * TransposeD (fast)
1371 cycles for 100 * TransposeA
385 cycles for 100 * TransposeB (fast+small)
286 cycles for 100 * TransposeD (fast)
1372 cycles for 100 * TransposeA
741 cycles for 100 * TransposeB (fast+small)
588 cycles for 100 * TransposeD (fast)
2218 cycles for 100 * TransposeA
739 cycles for 100 * TransposeB (fast+small)
585 cycles for 100 * TransposeD (fast)
1437 cycles for 100 * TransposeA
875 cycles for 100 * TransposeB (fast+small)
309 cycles for 100 * TransposeD (fast)
60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
Quote from: Siekmanski on April 04, 2017, 10:10:26 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
537 cycles for 100 * TransposeB (fast+small)
561 cycles for 100 * TransposeD (fast)
537 cycles for 100 * TransposeB (fast+small)
560 cycles for 100 * TransposeD (fast)
Quote from: guga on April 04, 2017, 02:15:59 PMIntel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SSE4)
385 cycles for 100 * TransposeB (fast+small)
283 cycles for 100 * TransposeD (fast)
385 cycles for 100 * TransposeB (fast+small)
286 cycles for 100 * TransposeD (fast)
Hmmm... these i7 CPUs are a PITA 8)
J:\Masm32\MasmBasic>TransposeMatrix.exe
Intel(R) Celeron(R) CPU N2840 @ 2.16GHz (SSE4)
4081 cycles for 100 * TransposeA
984 cycles for 100 * TransposeB (fast+small)
1101 cycles for 100 * TransposeD (fast)
4315 cycles for 100 * TransposeA
985 cycles for 100 * TransposeB (fast+small)
1088 cycles for 100 * TransposeD (fast)
3895 cycles for 100 * TransposeA
1011 cycles for 100 * TransposeB (fast+small)
1159 cycles for 100 * TransposeD (fast)
3900 cycles for 100 * TransposeA
983 cycles for 100 * TransposeB (fast+small)
1102 cycles for 100 * TransposeD (fast)
5852 cycles for 100 * TransposeA
985 cycles for 100 * TransposeB (fast+small)
1087 cycles for 100 * TransposeD (fast)
QuoteHmmm... these i7 CPUs are a PITA 8)
Yeah, they are.... :biggrin:
One thing I noticed, is the code align64 macro.
On my PC it aligns to 32 instead of 64. This is a funny thing. ( maybe only on my PC )
I'll have a look at this issue. I myself use the memory pointer from a "VirtualAlloc" call as a reference for the align64 macro instead of the program entry point.
Quote from: Siekmanski on April 04, 2017, 07:56:53 PM
One thing I noticed, is the code align64 macro.
On my PC it aligns to 32 instead of 64.
Try this instead - same code produced but it tells you what it does:
; .err
align_64 macro ; nidud (http://masm32.com/board/index.php?topic=4545.msg48734#msg48734)
Local curalign, xbytes, tmp$
curalign=$-_TEXT
xbytes=64-(($-_TEXT) and (64-1))
tmp$ CATSTR <## current=>, %curalign, < nops added: >, %xbytes, < sum=>, %(curalign+xbytes), < (line >, %@Line, <)>
% echo tmp$
if xbytes
db xbytes dup(90h)
endif
endm## current=0 nops added: 64 sum=64 (line 96)
## current=159 nops added: 33 sum=192 (line 144)
## current=317 nops added: 3 sum=320 (line 211)
## current=464 nops added: 48 sum=512 (line 271)
## current=642 nops added: 62 sum=704 (line 324)
## current=723 nops added: 45 sum=768 (line 338)
## current=787 nops added: 45 sum=832 (line 352)
## current=851 nops added: 45 sum=896 (line 366)
## current=1453 nops added: 19 sum=1472 (line 573)
Looks OK, with the minor glitch that the first one doesn't need alignment.
I followed the "align_64 "thread of you and Nidud some time ago.
Couldn't get it to work correctly.
Tried it again:
## current=462 nops added: 50 sum=512 (line 58)
## current=556 nops added: 20 sum=576 (line 74)
## current=692 nops added: 12 sum=704 (line 116)
## current=799 nops added: 33 sum=832 (line 155)
## current=1649 nops added: 15 sum=1664 (line 274)
This looks OK.
But, this piece of test code gives an alignment of 16.
align_64
test1:
mov eax,offset test1 ; 4200016
and eax,63 ; = 16
Quote from: Siekmanski on April 04, 2017, 09:45:27 PM
But, this piece of test code gives an alignment of 16.
Strange. Can you post the complete source please?
Hi Jochen,
Intel(R) Pentium(R) M processor 1.70GHz (SSE2)
+19 of 20 tests valid,
2812 cycles for 100 * TransposeA
2261 cycles for 100 * TransposeB (fast+small)
3126 cycles for 100 * TransposeD (fast)
2125 cycles for 100 * TransposeA
2241 cycles for 100 * TransposeB (fast+small)
2553 cycles for 100 * TransposeD (fast)
2767 cycles for 100 * TransposeA
2288 cycles for 100 * TransposeB (fast+small)
3304 cycles for 100 * TransposeD (fast)
2105 cycles for 100 * TransposeA
2905 cycles for 100 * TransposeB (fast+small)
2533 cycles for 100 * TransposeD (fast)
2915 cycles for 100 * TransposeA
2261 cycles for 100 * TransposeB (fast+small)
3088 cycles for 100 * TransposeD (fast)
60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz (SSE4)
1281 cycles for 100 * TransposeA
672 cycles for 100 * TransposeB (fast+small)
837 cycles for 100 * TransposeD (fast)
1218 cycles for 100 * TransposeA
675 cycles for 100 * TransposeB (fast+small)
833 cycles for 100 * TransposeD (fast)
1693 cycles for 100 * TransposeA
671 cycles for 100 * TransposeB (fast+small)
834 cycles for 100 * TransposeD (fast)
1220 cycles for 100 * TransposeA
673 cycles for 100 * TransposeB (fast+small)
833 cycles for 100 * TransposeD (fast)
1224 cycles for 100 * TransposeA
671 cycles for 100 * TransposeB (fast+small)
837 cycles for 100 * TransposeD (fast)
60 bytes for TransposeA
105 bytes for TransposeB (fast+small)
111 bytes for TransposeD (fast)
"+19 0f 20 tests valid, " ? The Pentium M results do look erratic.
HTH,
Steve N.
Align_64 test.
Align4 00
Align16 00
Align64 32
Press any key to continue...
Quote from: FORTRANS on April 04, 2017, 11:32:06 PM
"+19 0f 20 tests valid, " ? The Pentium M results do look erratic.
Steve,
19/20 is quite good. The macros I use check for outliers, and take only the more stable values to calculate the average. Most of the time that works fine, but there is no guarantee. It remains slightly "alchimistic" 8)
In case of doubt, look at the lowest values. The processor cannot cheat versus the lower end. Slower, yes, but running faster is difficult :biggrin:
Quote from: Siekmanski on April 04, 2017, 11:37:18 PM
Align_64 test.
Align4 00
Align16 00
Align64 32
Your code, in the exe you posted:
004011E0 ³. B8 E0114000 mov eax, 004011E0
004011E5 ³. 83E0 3F and eax, 0000003F
Your code, built here with ML/HJWasm and standard options:
004010C0 ³. B8 C0104000 mov eax, 004010C0
004010C5 ³. 83E0 3F and eax, 0000003F
Did you use commandline option DONT_OVERDO_IT_WITH_THE_NOPS ? ::)
NOP(e) :biggrin:
It's a mystery to me. :(
:bgrin: :bgrin:
Quote from: Siekmanski on April 05, 2017, 02:48:52 AMIt's a mystery to me. :(
For me, too. Did you try different assemblers and/or linkers? I tried a number of options, none produced a non-zero result.
The commandline options used,
@echo off
if exist "Align_64.obj" del "Align_64.obj"
\masm32\bin\ml.exe /c /coff "Align_64.Asm"
if errorlevel 1 goto Einde
if exist "Align_64.exe" del "Align_64.exe"
\masm32\bin\Link.exe /SUBSYSTEM:CONSOLE /OPT:NOREF /OUT:Align_64.exe Align_64.obj
if errorlevel 1 goto Einde
Align_64.exe
:Einde
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: Align_64.Asm
## current=209 nops added: 47 sum=256 (line 60)
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
Align_64 test.
Align4 00
Align16 00
Align64 32
Press any key to continue...
Same result with Polink.exe
deleted
Stripped the assembler and the linker from the vs2015.com_enu.iso
Still the same result.....
Microsoft (R) Macro Assembler Version 14.00.23026.0
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: Align_64.Asm
## current=218 nops added: 38 sum=256 (line 60)
Microsoft (R) Incremental Linker Version 14.00.23026.0
Copyright (C) Microsoft Corporation. All rights reserved.
Align_64 test.
Align4 00
Align16 00
Align64 32
Press any key to continue...
commandline options used,
@echo off
if exist "Align_64.obj" del "Align_64.obj"
d:\RadASM2212\Masm2015\ml.exe /c /coff "Align_64.Asm"
if errorlevel 1 goto Einde
if exist "Align_64.exe" del "Align_64.exe"
d:\RadASM2212\Masm2015\link.exe /SUBSYSTEM:CONSOLE /OPT:NOREF /OUT:Align_64.exe Align_64.obj
if errorlevel 1 goto Einde
Align_64.exe
:Einde
Quote from: Siekmanski on April 05, 2017, 04:51:31 AM
The commandline options used
...
Same result with Polink.exe
I can produce the error with /debug and the M$ linkers (6.14 and 9.0), but not with polink.
I'm clueless.
Maybe it's how my CPU handles things......
Entry point is 401120h (32 bit aligned)
(http://members.home.nl/siekmanski/Image1.png)
Quote from: Siekmanski on April 05, 2017, 07:12:00 AM
I'm clueless.
Maybe it's how my CPU handles things......
No, your CPU is not the culprit. I can produce the same error. It's the linker... all linkers, actually :(
The macro adds the right number of nops afaics.
Which linker do you use ?
Mostly polink, then ML link 5.12 or 10.0
I've given up on other LINK versions, it's dll hell:
LINKV14 : fatal error LNK2023: bad DLL or entry point 'msobj140.dll'
etc etc
I just don't have the energy to shove the missing DLLs around. Besides, polink and link10 are sufficient.
Then it's not the linker. I used version 5.12 with the same wrong result.
d:\RadASM2212\Masm\Projects\Align_64>makeit
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: Align_64.Asm
## current=219 nops added: 37 sum=256 (line 65)
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
Align_64 test.
Align4 00
Align16 00
Align64 32
Entry 401120
deleted
Try this one:align_64 MACRO
Local curalign, tmp$
repeat 3
curalign=($-_TEXT) and 63
tmp$ CATSTR <current alignment=>, %curalign
% echo tmp$
if curalign
nop
align 16
echo * align16
endif
endm
endm
I get the impression that I do not understand what the linker really does :(
Hi Jochen,
Test with the new macro:
Microsoft Windows [Version 6.3.9600]
(c) 2013 Microsoft Corporation. Alle rechten voorbehouden.
d:\RadASM2212\Masm\Projects\Align_64>makeit
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: Align_64.Asm
current alignment=27
* align16
current alignment=43
* align16
current alignment=59
* align16
Microsoft (R) Incremental Linker Version 5.12.8078
Copyright (C) Microsoft Corp 1992-1998. All rights reserved.
Align_64 test.
Align4 00
Align16 00
Align64 48
Entry 4010A0
Press any key to continue...d:\RadASM2212\Masm\Projects\Align_64>makeit2 ( with polink )
Microsoft (R) Macro Assembler Version 12.00.21005.1
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: Align_64.Asm
current alignment=27
* align16
current alignment=43
* align16
current alignment=59
* align16
Align_64 test.
Align4 00
Align16 00
Align64 48
Entry 4010A0
Press any key to continue...d:\RadASM2212\Masm\Projects\Align_64>makeit3
Microsoft (R) Macro Assembler Version 14.00.23026.0
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: Align_64.Asm
current alignment=36
* align16
current alignment=52
* align16
current alignment=4
* align16
Microsoft (R) Incremental Linker Version 14.00.23026.0
Copyright (C) Microsoft Corporation. All rights reserved.
Align_64 test.
Align4 00
Align16 00
Align64 48
Entry 3010A0
Yes indeed, I get the same type of problem, depending on commandline options but apparently without much logic behind it. I guess the linkers "optimise" something - shuffle sections around, no idea. It seems that "$-_TEXT" does not give a reliable basis for this kind of adjustment. Unfortunately, there is no proper documentation available. Maybe one of our senior experts (did I see "grey hair and a walking stick" somewhere? ;)) can enlighten us.