I found this interesting whilst researching cache optimization techniques, and wanted to find out from the ALIGNment guru's
what mitigation steps they would suggest for such scenarios.
Agner Fog: "Dynamic linking of function libraries (DLL's or shared objects) makes code caching less efficient.
Dynamic link libraries are typically loaded at round memory addresses. This can cause cache contentions if the
distances between multiple DLL's are divisible by high powers of 2."
Just to also add value as well (ca'nt just ask and not give :bgrin:)- here's the top 7 techniques to improve application cache utilization performance....
Basics: 1) Code instructions and the data it operates on should at least fit into L2 cache
2) If multi-core processors are available, performance may be improved by shifting software threads to exploit caching architecture via processor affinity
Intermediate/Advanced:
• Use same memory operand sizes
• The PREFETCH instruction, can help reduce latencies for irregular, sequential or large memory access (ex.that does not fit inside cache)
• Alignment of the stack, code and data - done correctly will improve cache performance
• Do not use -> use self-modifying code or store data within code segments
• When writing SIMD code, interleave instructions in a Load then Store, Load then Store pattern
I'll add some more later if the thread gets popular :idea:
Quote from: Raistlin on August 16, 2017, 08:49:32 PM
• When writing SIMD code, interleave instructions in a Load then Store, Load then Store pattern
load store load store:
align 4
.While 1
sub ecx, XMMWORD
.Break .if Sign?
movups xmm0, [esi+ecx]
movups [edi+ecx], xmm0
.Endw
x8:
sub ecx, 8*XMMWORD
.Break .if Sign?
movups xmm0, [esi+ecx+0*XMMWORD]
movups xmm1, [esi+ecx+1*XMMWORD]
movups xmm2, [esi+ecx+2*XMMWORD]
movups xmm3, [esi+ecx+3*XMMWORD]
movups xmm4, [esi+ecx+4*XMMWORD]
movups xmm5, [esi+ecx+5*XMMWORD]
movups xmm6, [esi+ecx+6*XMMWORD]
movups xmm7, [esi+ecx+7*XMMWORD]
movups [edi+ecx+0*XMMWORD], xmm0
movups [edi+ecx+1*XMMWORD], xmm1
movups [edi+ecx+2*XMMWORD], xmm2
movups [edi+ecx+3*XMMWORD], xmm3
movups [edi+ecx+4*XMMWORD], xmm4
movups [edi+ecx+5*XMMWORD], xmm5
movups [edi+ecx+6*XMMWORD], xmm6
movups [edi+ecx+7*XMMWORD], xmm7
... but results are not that simple:10000000 bytes of memory used
4475 kCycles for 1 * load load store store x8
4569 kCycles for 1 * load store load store
4552 kCycles for 1 * load load store store x4
1000000 bytes of memory used
246554 cycles for 1 * load load store store x8
180390 cycles for 1 * load store load store <<<<<<<<<<<+++++
306627 cycles for 1 * load load store store x4
800000 bytes of memory used
197757 cycles for 1 * load load store store x8
136593 cycles for 1 * load store load store <<<<<<<<<<<+++++
242996 cycles for 1 * load load store store x4
500000 bytes of memory used
131991 cycles for 1 * load load store store x8
202065 cycles for 1 * load store load store <<<<<<<<<<<-----
170440 cycles for 1 * load load store store x4
100000 bytes of memory used
14357 cycles for 1 * load load store store x8
23628 cycles for 1 * load store load store <<<<<<<<<<<-----
21685 cycles for 1 * load load store store x4
10000 bytes of memory used
798 cycles for 1 * load load store store x8
1045 cycles for 1 * load store load store <<<<<<<<<<<-----
792 cycles for 1 * load load store store x4
Keep the code loops within 64 bytes and align the code loops to 64 byte in code memory.
Align the data to 8192 bytes to please the data cache.
Use tiling to read and write the data.
Optimize the instruction order. ( instruction latency and reciprocal throughput )
Make smart use of the execution ports.
Use non-temporal write instructions ( MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPD, MOVNTPS ) when writing to a memory location that is unlikely to be cached.
And Multitask the hell out of it. ( keep the threads in idle state if not running code, assign each thread to his own cpu using processor affinity. )
On some cpu's the PREFETCH instruction takes 43 cycles and can spoil the fun.
In most cases the automatic prefetching is sufficient.
Just check the date of this stuff, it sounds like PIV technology and I am not sure how much use it is in Core2 and later. In later hardware the PREFETCH family of instructions appear to be useless. Data alignment is always important but from Core2 upwards code alignment rarely ever makes any difference and often makes an algo slower.
Quote from: hutch-- on August 16, 2017, 10:51:40 PM
Data alignment is always important but from Core2 upwards code alignment rarely ever makes any difference and often makes an algo slower.
http://masm32.com/board/index.php?topic=6141.msg65172#msg65172
:t This are the kind of post that i like more!
:greenclp:
@ Hutch (references with dates)............
Basics:
1) Code instructions and the data it operates on should at least fit into L2 cache (Akhter,2006)
2) If multi-core processors are available, performance may be improved by shifting software threads to exploit caching architecture via processor affinity (Kazempor, 2008)
Intermediate/Advanced: (AMD - 2014 /Intel -2016, Agner Fog, 2017) - Optimization Manuals
• Use same memory operand sizes
• The PREFETCH instruction, can help reduce latencies for irregular, sequential or large memory access (ex.that does not fit inside cache)
• Alignment of the stack, code and data - done correctly will improve cache performance
• Do not use -> use self-modifying code or store data within code segments
• When writing SIMD code, interleave instructions in a Load then Store, Load then Store pattern (AMD - 2014 specifically says this @ jj)
@ JJ2007 - Re: SIMD interleave...........
Interestingly - AMD is quite clear about the interleaving instructions, whereas Intel just says its a good idea (re:suggested).
Just looking at Siekmanski's advice - lets see what happens when we align 64.... otherwise it looks like load, store patterns however
do form some sort of predictable average performance set, might be looking at it wrong.
@Siekmanski - Re: ALIGN 64 - do you think this might have something to do with cache line size and bus bandwidth ? Or please clarify.
Yes - I see "it just works" - but what might be the mechanism by which it speeds up. Intel/AMD says we should avoid
referencing instructions or data that is more than 4K boundary strides away, essentially equating to non-temporal page loads and cache thrashing.
@ Evryone else: sooooo, is anybody going to give me advice on DLL module alignment ? - the original question :icon_eek:
Quote from: Raistlin on August 17, 2017, 03:37:37 PM
@ JJ2007 - Re: SIMD interleave...........
Interestingly - AMD is quite clear about the interleaving instructions, whereas Intel just says its a good idea (re:suggested).
If you have an AMD, post your timings, see attachment above. For the Intel Core i5,
at 10MB load store load store makes no difference
below it's clearly worse
below 0.5 MB it's clearly better
ALIGN 64 - Just to improve cache hit rates and to avoid penalties.
I asked myself what the gain would be if I put a 64 byte aligned code loop that fits in 1 cache line and doesn't need to be fetched over and over again.
How many cycles you have to wait before the next line is fetched. ( differs per system of course )
For modern CPU's:
accessing L1, +/- 4 cycles
accessing L2, +/- 10 cycles
accessing L3, +/- 30 cycles
missing all cache levels, 100 cycles or more.
What about DLL module alignment...
Never wrote a DLL myself, but can't you use self modifying code to align code and data ( it needs to be done only once at initalization )
I have a funny old fashioned approach to these things, put them on the clock, the rest does not matter.