News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Testing Code Align64

Started by Siekmanski, April 11, 2017, 11:21:44 AM

Previous topic - Next topic

Siekmanski

I would ask you guys if you want to run this code align64 test program and post the results.
jj2007 and nidud created this macro to align code to a 64 byte boundary.

Align64 MACRO
LOCAL num_nops
  num_nops = 64-($-_TEXT) and 63
   if num_nops
      db num_nops dup(90h)
     endif
ENDM

Because I didn't get the right results, jj2007 and I tested it with different combinations of Masm assemblers and linkers and sadly, with different results.
To use this very useful macro I changed it to my needs and made a test program to see if I can use it with my assembler and linker combination.

Marinus

Code Align64 test.

Program Entry Point: 004010A0, 64 byte alignment: 32

Memory_1: 0040110F aligned to: 00401140  alignment: 00 '49 NOP(S) inserted.'
Memory_2: 0040126E aligned to: 00401280  alignment: 00 '18 NOP(S) inserted.'
Memory_3: 00401552 aligned to: 00401580  alignment: 00 '46 NOP(S) inserted.'
Memory_4: 00401C80 aligned to: 00401C80  alignment: 00 '00 NOP(S) inserted.'

Press any key to continue...
Creative coders use backward thinking techniques as a strategy.

jj2007

Program Entry Point: 004010A0, 64 byte alignment: 32

Memory_1: 0040110F aligned to: 00401140  alignment: 00 '49 NOP(S) inserted.'
Memory_2: 0040126E aligned to: 00401280  alignment: 00 '18 NOP(S) inserted.'
Memory_3: 00401552 aligned to: 00401580  alignment: 00 '46 NOP(S) inserted.'
Memory_4: 00401C80 aligned to: 00401C80  alignment: 00 '00 NOP(S) inserted.'

hutch--

I must be a barbarian here, since the Core2 series of Intel hardware, almost exclusively code alignment does not matter and often when I have aligned labels the algo is slower. Data is another matter, if you want speed, data must be aligned. In 64 bit code the stack must be aligned to an interval of 16 and at least with MASM that is easy enough, the latest version of my prologue code (which I have not posted yet) is user defined with any alignment of intervals of 16 .

jj2007

Quote from: hutch-- on April 11, 2017, 03:04:31 PM
I must be a barbarian here, since the Core2 series of Intel hardware, almost exclusively code alignment does not matter and often when I have aligned labels the algo is slower.

True. In my experience, align 2 or align 4 before a loop can help a little bit, but not always. Before a proc, I always align 16 hoping that the code cache is being used more efficiently - and more specifically, to put timings on a comparable basis.

For the latter, align 64 would be nice, but so far the linker plays foul. I am curious to see what Marinus cooks up.

Siekmanski

Found it... at least I think.
I had more than 1 code section in my program, and that messed it up.
So afterall it is a reliable macro if you take notice of that fact.
Creative coders use backward thinking techniques as a strategy.

jj2007

So what happens if you use a library??

Siekmanski

I'm writing a speed test now for code-loops smaller or equal to 64 bytes to fit in 1 code cache line.
To see if we can benefit of the Align64 macro.

But the linker again messed things up and the program entry moved up 16 bytes.  :icon_eek: ( only by writing more code to it. )
I can still use the macro, by changing this line:

num_nops = 64-($-_TEXT) and 63
to:
num_nops = 48-($-_TEXT) and 63

Later I will test it in a library.
Creative coders use backward thinking techniques as a strategy.

Siekmanski

Here is a test piece with source code to check out the Align64 macro.
It can be used to align code to 64 bytes for the code cache to execute it faster.
As an example I have used a code loop from my FFT routine to test if it runs faster with it.
And it does !  :eusa_dance:

Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  003D0000
CodeAlign64 test:   00

Timing starts now:

1124 Cycles for Test_Align4
1102 Cycles for Test_Align16
1023 Cycles for Test_Align64

Creative coders use backward thinking techniques as a strategy.

jj2007

About 4% faster, compared to align 16, and +7% compared to align 4 :t
Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00250000
CodeAlign64 test:   00

Timing starts now:

1281 Cycles for Test_Align4
1216 Cycles for Test_Align16
1168 Cycles for Test_Align64

hutch--

There is an option in a COFF object module to align the code in the module but its not easy to get at. With the data module tool I have in MASM32 "fda.exe" its a simple option where you just include a number as long as it is a power of 2 and it works fine. I guess it would require a dedicated tool to modify an existing object module but that may be a way to get reliable code alignment greater than 16 bytes.

Mikl__

Hi, Siekmanski!
align 64 is ((-$+_TEXT)and 63)

Siekmanski

jj2007, it seems we can use it to speed things up.  :t

Hi Hutch,
That's an option, but would be a whole workaround. With this macro you can adjust it to the program entry point when the program is ready to be released and use it just like align 4 and 16. If assembled and linked it works.

But the most important thing is, is it faster on your PC than align 16 ?

Hi Mikl__
((-$+_TEXT)and 63) gives this: error A2071:initializer magnitude too large for specified size
Creative coders use backward thinking techniques as a strategy.

hutch--

Marinus,

Same results with multiple tests.


Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  005D0000
CodeAlign64 test:   00

Timing starts now:

832 Cycles for Test_Align4
819 Cycles for Test_Align16
775 Cycles for Test_Align64

Press any key to continue...

Siekmanski

Creative coders use backward thinking techniques as a strategy.

TWell

Cheap AMD
Align64 speed test by Siekmanski.

ProgramEntry: 00401050
BufferStart:  00360000
CodeAlign64 test:   00

Timing starts now:

887 Cycles for Test_Align4
777 Cycles for Test_Align16
736 Cycles for Test_Align64