Aligning memory for later instructions.

hutch-- · September 19, 2016, 08:10:52 PM

Over time I have learnt that Microsoft have changed the default alignment of various memory allocation strategies so for reliable operation with whatever strategy you choose, manually controlling the memory alignment is the only safe technique. As per Michael's suggestion, the CRT aligned memory is a viable technique that does work OK for exactly the same reason, you can directly control the alignment and not make assumptions about what the default may happen to be.

For SSE you need 128 byte alignment, AVX requires 256 byte alignment and AVX2 512 byte alignment.

jj2007 · September 19, 2016, 11:05:44 PM

Quote from: hutch-- on September 19, 2016, 08:10:52 PMOver time I have learnt that Microsoft have changed the default alignment of various memory allocation strategies

See screenshot below from the 1994 TechEd Conference. M$ may have had good intentions, but (test attached) GlobalAlloc is align 8 on XP and Win7-64 alike, exactly as for HeapAlloc 8)

QuoteFor SSE you need 128 byte alignment

The great majority of SSE instructions is happy with align 16 or no alignment at all. Or did you mean 128 bits?

nidud · September 19, 2016, 11:26:06 PM

deleted

nidud · September 20, 2016, 12:50:18 AM

deleted

hutch-- · September 20, 2016, 01:05:16 AM

> The great majority of SSE instructions is happy with align 16 or no alignment at all. Or did you mean 128 bits?

This is the Intel manual.
The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP.
The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP.
The 512-bit VMOVNTDQA addresses must be 64-byte aligned or the instruction will cause a #GP.

This was a blunder, tired and too much work.
> For SSE you need 128 byte alignment, AVX requires 256 byte alignment and AVX2 512 byte alignment.

It should be,
For SSE you need 128 BIT alignment, AVX requires 256 BIT alignment and AVX2 512 BIT alignment.

MichaelW · September 20, 2016, 03:07:56 AM

At least under Windows 7-64 and Windows 10-64, for the aligned malloc functions a 16-byte alignment is the minimum actual alignment. There are also the _aligned_offset_malloc functions that allow you to specify the alignment of a specific offset in the allocated memory. IIRC they were not supported under Windows XP, but are under Windows 7-64.

jj2007 · September 20, 2016, 09:18:05 AM

Quote from: hutch-- on September 20, 2016, 01:05:16 AMtired and too much work

Slow down, man. You are the Masm32 BDFL anyway, even if you don't finish the 64-bit version by tomorrow ;-)

Still 32-bit, almost plain HeapAlloc under the hood:

include \masm32\MasmBasic\MasmBasic.inc ; Version 20 September 2016
Init
Dim PtrSSE() As DWORD
For_ ct=0 To A16Max-1 ; 100 aligned pointers
Alloc16 Rand(10000)
movaps [eax], xmm0 ; the proof ;-)
mov PtrSSE(ct), eax
Print Hex$(al), " "
Next
For_ ct=0 To A16Max-1
Free16 PtrSSE(ct)
Next
Inkey "OK?"
EndOfCode

Output:

Code Select

50 20 20 A0 80 40 70 30 10 20 70 90 F0 A0 30 20 00 20 50 50 B0 C0 50 50 40 80 F0 70 D0 B0 40 E0 A0 C0 30 70 10 F0 70 E0 80 20 C0 60 A0 E0 10
 00 70 10 D0 B0 00 90 20 B0 90 70 00 90 30 90 B0 30 00 60 C0 C0 10 10 B0 50 F0 60 C0 F0 B0 E0 10 90 C0 D0 F0 60 00 30 F0 A0 C0 A0 10 A0 90 3
0 80 A0 F0 E0 10 B0 OK?

hutch-- · September 20, 2016, 09:41:24 PM

I don't claim to understand your notation but if I have it right, why not make a version where you can set the alignment to any power of 2 size you like so you can also handle AVX and AVX2 ?

nidud · September 22, 2016, 11:41:19 PM

deleted

jj2007 · September 23, 2016, 09:32:53 AM

Quote from: nidud on September 22, 2016, 11:41:19 PM
Using the stack is way faster than using HeapAlloc.

That's correct, and StackBuffer() proves it, but a HeapAlloc-based macro as shown above is normally fast enough, and not limited to the procedure where it was called.

hutch-- · September 23, 2016, 10:36:26 AM

I generally choose dynamic memory allocation when I need large single memory blocks which I generally chop up into the size bits I need from it. I have seen code where massive counts of small allocations occur but its lousy code design and often very slow. Stack is easy and fast but I only use it for relatively small amounts, a few K here and there. You can alter the linker option on stack reserve/stack commit if you want a lot more stack space.

The MASM Forum

News:

Aligning memory for later instructions.

hutch--

jj2007

nidud

nidud

hutch--

MichaelW

jj2007

hutch--

nidud

jj2007

hutch--