The MASM Forum

General => The Workshop => Topic started by: Siekmanski on April 11, 2017, 11:21:44 AM

Title: Testing Code Align64
Post by: Siekmanski on April 11, 2017, 11:21:44 AM
I would ask you guys if you want to run this code align64 test program and post the results.
jj2007 and nidud created this macro to align code to a 64 byte boundary.

Align64 MACRO
LOCAL num_nops
  num_nops = 64-($-_TEXT) and 63
   if num_nops
      db num_nops dup(90h)
     endif
ENDM

Because I didn't get the right results, jj2007 and I tested it with different combinations of Masm assemblers and linkers and sadly, with different results.
To use this very useful macro I changed it to my needs and made a test program to see if I can use it with my assembler and linker combination.

Marinus

Code Align64 test.

Program Entry Point: 004010A0, 64 byte alignment: 32

Memory_1: 0040110F aligned to: 00401140  alignment: 00 '49 NOP(S) inserted.'
Memory_2: 0040126E aligned to: 00401280  alignment: 00 '18 NOP(S) inserted.'
Memory_3: 00401552 aligned to: 00401580  alignment: 00 '46 NOP(S) inserted.'
Memory_4: 00401C80 aligned to: 00401C80  alignment: 00 '00 NOP(S) inserted.'

Press any key to continue...
Title: Re: Testing Code Align64
Post by: jj2007 on April 11, 2017, 12:32:42 PM
Program Entry Point: 004010A0, 64 byte alignment: 32

Memory_1: 0040110F aligned to: 00401140  alignment: 00 '49 NOP(S) inserted.'
Memory_2: 0040126E aligned to: 00401280  alignment: 00 '18 NOP(S) inserted.'
Memory_3: 00401552 aligned to: 00401580  alignment: 00 '46 NOP(S) inserted.'
Memory_4: 00401C80 aligned to: 00401C80  alignment: 00 '00 NOP(S) inserted.'
Title: Re: Testing Code Align64
Post by: hutch-- on April 11, 2017, 03:04:31 PM
I must be a barbarian here, since the Core2 series of Intel hardware, almost exclusively code alignment does not matter and often when I have aligned labels the algo is slower. Data is another matter, if you want speed, data must be aligned. In 64 bit code the stack must be aligned to an interval of 16 and at least with MASM that is easy enough, the latest version of my prologue code (which I have not posted yet) is user defined with any alignment of intervals of 16 .
Title: Re: Testing Code Align64
Post by: jj2007 on April 11, 2017, 06:41:04 PM
Quote from: hutch-- on April 11, 2017, 03:04:31 PM
I must be a barbarian here, since the Core2 series of Intel hardware, almost exclusively code alignment does not matter and often when I have aligned labels the algo is slower.

True. In my experience, align 2 or align 4 before a loop can help a little bit, but not always. Before a proc, I always align 16 hoping that the code cache is being used more efficiently - and more specifically, to put timings on a comparable basis.

For the latter, align 64 would be nice, but so far the linker plays foul. I am curious to see what Marinus cooks up.
Title: Re: Testing Code Align64
Post by: Siekmanski on April 12, 2017, 06:42:52 AM
Found it... at least I think.
I had more than 1 code section in my program, and that messed it up.
So afterall it is a reliable macro if you take notice of that fact.
Title: Re: Testing Code Align64
Post by: jj2007 on April 12, 2017, 07:08:40 AM
So what happens if you use a library??
Title: Re: Testing Code Align64
Post by: Siekmanski on April 12, 2017, 08:17:59 AM
I'm writing a speed test now for code-loops smaller or equal to 64 bytes to fit in 1 code cache line.
To see if we can benefit of the Align64 macro.

But the linker again messed things up and the program entry moved up 16 bytes.  :icon_eek: ( only by writing more code to it. )
I can still use the macro, by changing this line:

num_nops = 64-($-_TEXT) and 63
to:
num_nops = 48-($-_TEXT) and 63

Later I will test it in a library.
Title: Re: Testing Code Align64
Post by: Siekmanski on April 12, 2017, 09:15:44 AM
Here is a test piece with source code to check out the Align64 macro.
It can be used to align code to 64 bytes for the code cache to execute it faster.
As an example I have used a code loop from my FFT routine to test if it runs faster with it.
And it does !  :eusa_dance:

Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  003D0000
CodeAlign64 test:   00

Timing starts now:

1124 Cycles for Test_Align4
1102 Cycles for Test_Align16
1023 Cycles for Test_Align64

Title: Re: Testing Code Align64
Post by: jj2007 on April 12, 2017, 10:20:43 AM
About 4% faster, compared to align 16, and +7% compared to align 4 :t
Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00250000
CodeAlign64 test:   00

Timing starts now:

1281 Cycles for Test_Align4
1216 Cycles for Test_Align16
1168 Cycles for Test_Align64
Title: Re: Testing Code Align64
Post by: hutch-- on April 12, 2017, 12:24:40 PM
There is an option in a COFF object module to align the code in the module but its not easy to get at. With the data module tool I have in MASM32 "fda.exe" its a simple option where you just include a number as long as it is a power of 2 and it works fine. I guess it would require a dedicated tool to modify an existing object module but that may be a way to get reliable code alignment greater than 16 bytes.
Title: Re: Testing Code Align64
Post by: Mikl__ on April 12, 2017, 12:31:24 PM
Hi, Siekmanski!
align 64 is ((-$+_TEXT)and 63)
Title: Re: Testing Code Align64
Post by: Siekmanski on April 12, 2017, 02:24:10 PM
jj2007, it seems we can use it to speed things up.  :t

Hi Hutch,
That's an option, but would be a whole workaround. With this macro you can adjust it to the program entry point when the program is ready to be released and use it just like align 4 and 16. If assembled and linked it works.

But the most important thing is, is it faster on your PC than align 16 ?

Hi Mikl__
((-$+_TEXT)and 63) gives this: error A2071:initializer magnitude too large for specified size
Title: Re: Testing Code Align64
Post by: hutch-- on April 12, 2017, 02:59:24 PM
Marinus,

Same results with multiple tests.


Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  005D0000
CodeAlign64 test:   00

Timing starts now:

832 Cycles for Test_Align4
819 Cycles for Test_Align16
775 Cycles for Test_Align64

Press any key to continue...
Title: Re: Testing Code Align64
Post by: Siekmanski on April 12, 2017, 03:02:08 PM
Thanks.  :eusa_dance:
Title: Re: Testing Code Align64
Post by: TWell on April 12, 2017, 03:46:12 PM
Cheap AMD
Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00360000
CodeAlign64 test:   00

Timing starts now:

887 Cycles for Test_Align4
777 Cycles for Test_Align16
736 Cycles for Test_Align64
Title: Re: Testing Code Align64
Post by: habran on April 12, 2017, 03:48:08 PM
Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00260000
CodeAlign64 test:   00

Timing starts now:

266 Cycles for Test_Align4
274 Cycles for Test_Align16
403 Cycles for Test_Align64

Press any key to continue...


My laptop:


Number of cores   4 (max 8)                                                                       
Number of threads   8 (max 16)                                                                     
Name   Intel Core i7 4700MQ                                                           
Codename   Haswell                                                                         
Specification   Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz                                       
Instructions sets   MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX, AVX2, FMA3 
L1 Data cache   4 x 32 KBytes, 8-way set associative, 64-byte line size                         
L1 Instruction cache      4 x 32 KBytes, 8-way set associative, 64-byte line size                     
L2 cache   4 x 256 KBytes, 8-way set associative, 64-byte line size                       
L3 cache   6 MBytes, 12-way set associative, 64-byte line size                             
FID/VID Control   yes                                                                             
Turbo Mode   supported, enabled

Title: Re: Testing Code Align64
Post by: hutch-- on April 12, 2017, 04:12:46 PM
Here is the result on my old i7.

Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00230000
CodeAlign64 test:   00

Timing starts now:

386 Cycles for Test_Align4
379 Cycles for Test_Align16
351 Cycles for Test_Align64

Press any key to continue...
Title: Re: Testing Code Align64
Post by: Siekmanski on April 12, 2017, 09:00:51 PM
Thanks guys,

Habran, the results on your laptop are from fast to slow ?

Results on my i7 are almost the same as the i7 from Hutch.  :biggrin:

Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00230000
CodeAlign64 test:   00

Timing starts now:

397 Cycles for Test_Align4
386 Cycles for Test_Align16
366 Cycles for Test_Align64


Title: Re: Testing Code Align64
Post by: habran on April 12, 2017, 09:22:30 PM
I am surprised myself :dazzled:
Title: Re: Testing Code Align64
Post by: mineiro on April 12, 2017, 09:32:43 PM
$wine Align64SpeedTest.exe
Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00340000
CodeAlign64 test:   00

Timing starts now:

941 Cycles for Test_Align4
791 Cycles for Test_Align16
521 Cycles for Test_Align64

Press any key to continue...

$cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Pentium(R) Dual  CPU  E2160  @ 1.80GHz
stepping        : 11
microcode       : 0xba
cpu MHz         : 1200.000
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm
bogomips        : 3599.77
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:
...

$lscpu
Arquitetura:           x86_64
Modo(s) operacional da CPU:32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per núcleo  1
Núcleo(s) por soquete:2
Soquete(s):            1
Nó(s) de NUMA:        1
ID de fornecedor:      GenuineIntel
Família da CPU:       6
Modelo:                15
Step:                  11
CPU MHz:               1800.000
BogoMIPS:              3599.77
cache de L1d:          32K
cache de L1i:          32K
cache de L2:           1024K
NUMA node0 CPU(s):     0,1


Title: Re: Testing Code Align64
Post by: habran on April 12, 2017, 09:35:19 PM
This one looks better :P
Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00230000
CodeAlign64 test:   00

Timing starts now:

435 Cycles for Test_Align4
219 Cycles for Test_Align16
213 Cycles for Test_Align64

Press any key to continue...
Title: Re: Testing Code Align64
Post by: Siekmanski on April 12, 2017, 09:43:32 PM
mineiro,
Wow, that's a big gain running it on Wine.
Title: Re: Testing Code Align64
Post by: mineiro on April 12, 2017, 10:08:09 PM
hello sir Siekmanski;
from my tests wine results are the same as if I test on win32 XP SP3.
At least with these bench algos on masm32 board that is crafted to do just this.