Print Page - For the alignment geeks

Title: For the alignment geeks
Post by: Raistlin on August 16, 2017, 08:49:32 PM

I found this interesting whilst researching cache optimization techniques, and wanted to find out from the ALIGNment guru's
what mitigation steps they would suggest for such scenarios.

Agner Fog: "Dynamic linking of function libraries (DLL's or shared objects) makes code caching less efficient.
Dynamic link libraries are typically loaded at round memory addresses. This can cause cache contentions if the
distances between multiple DLL's are divisible by high powers of 2."

Just to also add value as well (ca'nt just ask and not give :bgrin:)- here's the top 7 techniques to improve application cache utilization performance....

Basics: 1) Code instructions and the data it operates on should at least fit into L2 cache
2) If multi-core processors are available, performance may be improved by shifting software threads to exploit caching architecture via processor affinity

Intermediate/Advanced:
•   Use same memory operand sizes
•   The PREFETCH instruction, can help reduce latencies for irregular, sequential or large memory access (ex.that does not fit inside cache)
•   Alignment of the stack, code and data - done correctly will improve cache performance
•   Do not use -> use self-modifying code or store data within code segments
•   When writing SIMD code, interleave instructions in a Load then Store, Load then Store pattern

I'll add some more later if the thread gets popular :idea:

Title: Re: For the alignment geeks
Post by: jj2007 on August 16, 2017, 09:40:15 PM

Quote from: Raistlin on August 16, 2017, 08:49:32 PM
• When writing SIMD code, interleave instructions in a Load then Store, Load then Store pattern

load store load store:

Code Select

  align 4
  .While 1
	sub ecx, XMMWORD
	.Break .if Sign?
	movups xmm0, [esi+ecx]
	movups [edi+ecx], xmm0
  .Endw

x8:

Code Select

		sub ecx, 8*XMMWORD
		.Break .if Sign?
		movups xmm0, [esi+ecx+0*XMMWORD]
		movups xmm1, [esi+ecx+1*XMMWORD]
		movups xmm2, [esi+ecx+2*XMMWORD]
		movups xmm3, [esi+ecx+3*XMMWORD]
		movups xmm4, [esi+ecx+4*XMMWORD]
		movups xmm5, [esi+ecx+5*XMMWORD]
		movups xmm6, [esi+ecx+6*XMMWORD]
		movups xmm7, [esi+ecx+7*XMMWORD]
		movups [edi+ecx+0*XMMWORD], xmm0
		movups [edi+ecx+1*XMMWORD], xmm1
		movups [edi+ecx+2*XMMWORD], xmm2
		movups [edi+ecx+3*XMMWORD], xmm3
		movups [edi+ecx+4*XMMWORD], xmm4
		movups [edi+ecx+5*XMMWORD], xmm5
		movups [edi+ecx+6*XMMWORD], xmm6
		movups [edi+ecx+7*XMMWORD], xmm7

... but results are not that simple:

Code Select

10000000 bytes of memory used
4475    kCycles for 1 * load load store store x8
4569    kCycles for 1 * load store load store
4552    kCycles for 1 * load load store store x4

1000000 bytes of memory used
246554  cycles for 1 * load load store store x8
180390  cycles for 1 * load store load store		<<<<<<<<<<<+++++
306627  cycles for 1 * load load store store x4

800000 bytes of memory used
197757  cycles for 1 * load load store store x8
136593  cycles for 1 * load store load store		<<<<<<<<<<<+++++
242996  cycles for 1 * load load store store x4

500000 bytes of memory used
131991  cycles for 1 * load load store store x8
202065  cycles for 1 * load store load store		<<<<<<<<<<<-----
170440  cycles for 1 * load load store store x4

100000 bytes of memory used
14357   cycles for 1 * load load store store x8
23628   cycles for 1 * load store load store		<<<<<<<<<<<-----
21685   cycles for 1 * load load store store x4

10000 bytes of memory used
798     cycles for 1 * load load store store x8
1045    cycles for 1 * load store load store		<<<<<<<<<<<-----
792     cycles for 1 * load load store store x4

Title: Re: For the alignment geeks
Post by: Siekmanski on August 16, 2017, 10:00:11 PM

Keep the code loops within 64 bytes and align the code loops to 64 byte in code memory.
Align the data to 8192 bytes to please the data cache.
Use tiling to read and write the data.
Optimize the instruction order. ( instruction latency and reciprocal throughput )
Make smart use of the execution ports.
Use non-temporal write instructions ( MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPD, MOVNTPS ) when writing to a memory location that is unlikely to be cached.
And Multitask the hell out of it. ( keep the threads in idle state if not running code, assign each thread to his own cpu using processor affinity. )

On some cpu's the PREFETCH instruction takes 43 cycles and can spoil the fun.
In most cases the automatic prefetching is sufficient.

Title: Re: For the alignment geeks
Post by: hutch-- on August 16, 2017, 10:51:40 PM

Just check the date of this stuff, it sounds like PIV technology and I am not sure how much use it is in Core2 and later. In later hardware the PREFETCH family of instructions appear to be useless. Data alignment is always important but from Core2 upwards code alignment rarely ever makes any difference and often makes an algo slower.

Title: Re: For the alignment geeks
Post by: Siekmanski on August 16, 2017, 10:59:42 PM

Quote from: hutch-- on August 16, 2017, 10:51:40 PM
Data alignment is always important but from Core2 upwards code alignment rarely ever makes any difference and often makes an algo slower.

http://masm32.com/board/index.php?topic=6141.msg65172#msg65172

Title: Re: For the alignment geeks
Post by: felipe on August 17, 2017, 05:55:08 AM

:t This are the kind of post that i like more!

:greenclp:

Title: Re: For the alignment geeks
Post by: Raistlin on August 17, 2017, 03:37:37 PM

@ Hutch (references with dates)............

Basics:
1) Code instructions and the data it operates on should at least fit into L2 cache (Akhter,2006)
2) If multi-core processors are available, performance may be improved by shifting software threads to exploit caching architecture via processor affinity (Kazempor, 2008)

Intermediate/Advanced: (AMD - 2014 /Intel -2016, Agner Fog, 2017) - Optimization Manuals
• Use same memory operand sizes
• The PREFETCH instruction, can help reduce latencies for irregular, sequential or large memory access (ex.that does not fit inside cache)
• Alignment of the stack, code and data - done correctly will improve cache performance
• Do not use -> use self-modifying code or store data within code segments
• When writing SIMD code, interleave instructions in a Load then Store, Load then Store pattern (AMD - 2014 specifically says this @ jj)

@ JJ2007 - Re: SIMD interleave...........

Interestingly - AMD is quite clear about the interleaving instructions, whereas Intel just says its a good idea (re:suggested).
Just looking at Siekmanski's advice - lets see what happens when we align 64.... otherwise it looks like load, store patterns however
do form some sort of predictable average performance set, might be looking at it wrong.

@Siekmanski - Re: ALIGN 64 - do you think this might have something to do with cache line size and bus bandwidth ? Or please clarify.
Yes - I see "it just works" - but what might be the mechanism by which it speeds up. Intel/AMD says we should avoid
referencing instructions or data that is more than 4K boundary strides away, essentially equating to non-temporal page loads and cache thrashing.

@ Evryone else: sooooo, is anybody going to give me advice on DLL module alignment ? - the original question :icon_eek:

Title: Re: For the alignment geeks
Post by: jj2007 on August 17, 2017, 09:02:12 PM

Quote from: Raistlin on August 17, 2017, 03:37:37 PM
@ JJ2007 - Re: SIMD interleave...........

Interestingly - AMD is quite clear about the interleaving instructions, whereas Intel just says its a good idea (re:suggested).

If you have an AMD, post your timings, see attachment above. For the Intel Core i5,
at 10MB load store load store makes no difference
below it's clearly worse
below 0.5 MB it's clearly better

Title: Re: For the alignment geeks
Post by: Siekmanski on August 17, 2017, 10:41:37 PM

ALIGN 64 - Just to improve cache hit rates and to avoid penalties.
I asked myself what the gain would be if I put a 64 byte aligned code loop that fits in 1 cache line and doesn't need to be fetched over and over again.

How many cycles you have to wait before the next line is fetched. ( differs per system of course )
For modern CPU's:

accessing L1, +/- 4 cycles
accessing L2, +/- 10 cycles
accessing L3, +/- 30 cycles
missing all cache levels, 100 cycles or more.

What about DLL module alignment...

Never wrote a DLL myself, but can't you use self modifying code to align code and data ( it needs to be done only once at initalization )

Title: Re: For the alignment geeks
Post by: hutch-- on August 18, 2017, 02:18:30 AM

I have a funny old fashioned approach to these things, put them on the clock, the rest does not matter.

The MASM Forum

General => The Workshop => Topic started by: Raistlin on August 16, 2017, 08:49:32 PM