News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Aligning memory for later instructions.

Started by hutch--, August 25, 2016, 10:28:20 PM

Previous topic - Next topic

hutch--

Over time I have learnt that Microsoft have changed the default alignment of various memory allocation strategies so for reliable operation with whatever strategy you choose, manually controlling the memory alignment is the only safe technique. As per Michael's suggestion, the CRT aligned memory is a viable technique that does work OK for exactly the same reason, you can directly control the alignment and not make assumptions about what the default may happen to be.

For SSE you need 128 byte alignment, AVX requires 256 byte alignment and AVX2 512 byte alignment.

jj2007

Quote from: hutch-- on September 19, 2016, 08:10:52 PMOver time I have learnt that Microsoft have changed the default alignment of various memory allocation strategies

See screenshot below from the 1994 TechEd Conference. M$ may have had good intentions, but (test attached) GlobalAlloc is align 8 on XP and Win7-64 alike, exactly as for HeapAlloc 8)

QuoteFor SSE you need 128 byte alignment

The great majority of SSE instructions is happy with align 16 or no alignment at all. Or did you mean 128 bits?

nidud

#32
deleted

nidud

#33
deleted

hutch--

> The great majority of SSE instructions is happy with align 16 or no alignment at all. Or did you mean 128 bits?

This is the Intel manual.
The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP.
The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP.
The 512-bit VMOVNTDQA addresses must be 64-byte aligned or the instruction will cause a #GP.

This was a blunder, tired and too much work.
> For SSE you need 128 byte alignment, AVX requires 256 byte alignment and AVX2 512 byte alignment.

It should be,
For SSE you need 128 BIT alignment, AVX requires 256 BIT alignment and AVX2 512 BIT alignment.



MichaelW

At least under Windows 7-64 and Windows 10-64, for the aligned malloc functions a 16-byte alignment is the minimum actual alignment. There are also the _aligned_offset_malloc functions that allow you to specify the alignment of a specific offset in the allocated memory. IIRC they were not supported under Windows XP, but are under Windows 7-64.
Well Microsoft, here's another nice mess you've gotten us into.

jj2007

Quote from: hutch-- on September 20, 2016, 01:05:16 AMtired and too much work

Slow down, man. You are the Masm32 BDFL anyway, even if you don't finish the 64-bit version by tomorrow ;-)

Still 32-bit, almost plain HeapAlloc under the hood:

include \masm32\MasmBasic\MasmBasic.inc      ; Version 20 September 2016
  Init
  Dim PtrSSE() As DWORD
  For_ ct=0 To A16Max-1      ; 100 aligned pointers
      Alloc16 Rand(10000)
      movaps [eax], xmm0      ; the proof ;-)
      mov PtrSSE(ct), eax
      Print Hex$(al), " "
  Next
  For_ ct=0 To A16Max-1
      Free16 PtrSSE(ct)
  Next
  Inkey "OK?"
EndOfCode


Output:50 20 20 A0 80 40 70 30 10 20 70 90 F0 A0 30 20 00 20 50 50 B0 C0 50 50 40 80 F0 70 D0 B0 40 E0 A0 C0 30 70 10 F0 70 E0 80 20 C0 60 A0 E0 10
00 70 10 D0 B0 00 90 20 B0 90 70 00 90 30 90 B0 30 00 60 C0 C0 10 10 B0 50 F0 60 C0 F0 B0 E0 10 90 C0 D0 F0 60 00 30 F0 A0 C0 A0 10 A0 90 3
0 80 A0 F0 E0 10 B0 OK?

hutch--

I don't claim to understand your notation but if I have it right, why not make a version where you can set the alignment to any power of 2 size you like so you can also handle AVX and AVX2 ?

nidud

#38
deleted

jj2007

Quote from: nidud on September 22, 2016, 11:41:19 PM
Using the stack is way faster than using HeapAlloc.

That's correct, and StackBuffer() proves it, but a HeapAlloc-based macro as shown above is normally fast enough, and not limited to the procedure where it was called.

hutch--

I generally choose dynamic memory allocation when I need large single memory blocks which I generally chop up into the size bits I need from it. I have seen code where massive counts of small allocations occur but its lousy code design and often very slow. Stack is easy and fast but I only use it for relatively small amounts, a few K here and there. You can alter the linker option on stack reserve/stack commit if you want a lot more stack space.