The MASM Forum

Microsoft 64 bit MASM => MASM64 SDK => Topic started by: hutch-- on August 25, 2016, 10:28:20 PM

Title: Aligning memory for later instructions.
Post by: hutch-- on August 25, 2016, 10:28:20 PM
Simple enough to do and necessary for XMM and YMM operations.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    LOCAL pMem :QWORD           ; allocated memory pointer
    LOCAL aMem :QWORD           ; aligned memory pointer

    padd equ <512>              ; extra bytes (must be at least size of required alignment)
    bcnt equ <1024*1024*4>      ; 4 meg

    mov pMem, alloc(bcnt+padd)  ; allocate the memory plus padding
    memalign rax, 256           ; align the memory up to the next 256 byte boundary
    mov aMem, rax               ; store result in aligned memory pointer

  ; do what you need with the 256 byte aligned memory (YMM addresses, register etc ....)

    mfree pMem                  ; free the original allocated address

    waitkey

    invoke ExitProcess,0

    ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end
Title: Re: Aligning memory for later instructions.
Post by: K_F on September 09, 2016, 06:32:41 AM
Doing something similar with dynamic pointer allocations... but using bit/and comparisons ?
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 09, 2016, 06:51:43 AM
For the amount of memory usually needed for working efficiently with YMM regs, wouldn't VirtualAlloc be a good choice?
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 10, 2016, 04:29:35 PM
 :biggrin:

Once memory is allocated, its all the same, whatever floats your boat.  :P
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 11, 2016, 10:20:39 PM
Yes, but with VirtualAlloc you get the alignment "for free" ;-)
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 15, 2016, 03:16:03 PM
Another trick while tweaking the main include file, mis-align at least some structures and it crashes when the procedure that tries to use it is called. 8 byte alignment is necessary with most and probably all structures used in 64 bit.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 15, 2016, 08:27:03 PM
Quote from: hutch-- on September 15, 2016, 03:16:03 PM
Another trick while tweaking the main include file, mis-align at least some structures and it crashes when the procedure that tries to use it is called. 8 byte alignment is necessary with most and probably all structures used in 64 bit.

Indeed. The Zp switch is the way to go...

Quote from: jj2007 on September 13, 2016, 10:28:02 PMAnd that one works fine if you build it with Zp4 for 32-bit and Zp8 for 64-bit code, but it crashes for X64 and Zp4.

Which means that the default structure alignment of the Windows API is DWORD in 32-bit code and QWORD in 64-bit code. Both on a 64-bit processor, of course (see Reply #30); Redmond should take their documentation a bit more seriously 8)
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 15, 2016, 08:36:54 PM
 :biggrin:

This is only a problem if you are trying to multiport similar assemblers. ML64 is NOT MASM compatible, it is only ML64 compatible, its error messages while organising include files are unintelligible and buggy and at time crashes with stack dumps. It does not have the tolerance that the old ML had and is a genuine joy to make working include files for structures and equates.

Now if you have a look at the guts of Japheth's h2incX output you will have genuine nightmares at this tangled mess of typedefs, prototypes, bugs, equates, the odd "tag" attached to the front of structures and with a possible delivery date at about the year 3000.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 15, 2016, 09:08:45 PM
Quote from: hutch-- on September 15, 2016, 08:36:54 PMif you have a look at the guts of Japheth's h2incX output you will have genuine nightmares at this tangled mess of typedefs, prototypes, bugs, equates, ...

The C++ fraction will insist that the 100+ types are necessary. IMHO the only real change from Windows.inc+WinExtra.inc is the distinction between "data" DWORDs (they can stay "as is") and "pointers" that are DWORDs in 32-bit code, and QWORDs in 64-bit code. In \Masm32\MasmBasic\Res\DualWin.inc the latter is called SIZE_P, and it's size depends on whether you build 32- or 64-bit code, obviously. Otherwise, there are only minor changes compared to Windows.inc+WinExtra.inc - a few structure members choked with ML64.

So, give DualWin.inc a try - no tangled mess of typedefs, prototypes, bugs, equates, just the old Windows.inc format.

Besides, it runs with all assemblers. So if you are fond of unintelligible and buggy error messages and stack dumps, use ML64, if instead you like the old .if eax>99 etc syntax, you can use the same include file with HJWasm.
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 16, 2016, 08:50:56 AM
 :biggrin:

>  So if you are fond of unintelligible and buggy error messages and stack dumps, use ML64

I don't have the problem, I am not trying to use Japheth's includes. With ML64 I am free of "Open Sauce" licencing and the army of parasites that come with it.  :badgrin:
Title: Re: Aligning memory for later instructions.
Post by: mineiro on September 17, 2016, 09:29:14 AM
Quote from: hutch-- on September 16, 2016, 08:50:56 AM
With ML64 I am free of "Open Sauce" licencing and the army of parasites that come with it.  :badgrin:
Cmon sir hutch, tell us the true, you probably have a lot of softwares "open sauce" inside your computer, from pdf readers to disassemblers, just look to libraries.

But I take your point of view, I don't like open source too just because one thing: have lawyers inside. This is why I prefer public domain.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 17, 2016, 12:33:59 PM
Quote from: hutch-- on September 16, 2016, 08:50:56 AM
:biggrin:

>  So if you are fond of unintelligible and buggy error messages and stack dumps, use ML64

I don't have the problem, I am not trying to use Japheth's includes. With ML64 I am free of "Open Sauce" licencing and the army of parasites that come with it.  :badgrin:

Me neither. I am trying to use the standard Masm32 includes. Unfortunately, they make Microsoft ML64 crash with exceptions.
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 17, 2016, 12:54:48 PM
That is because they were written for 32 bit ML.EXE, they are not compatible with ML64. I would not be doing the work if it was. Shortly I will have another tool that isolates the prototypes in the Microsoft vc2015 header files and that will be a method of creating prototypes for 64 bit assembler. I don't need it for ML64 but it will be useful for the assemblers that need prototypes. I doubt you can get an auto converter to work on the C .H files, they are too much of a tangled mess and too much useless noise but you can get equates, structures and prototypes which will ease the production of assembler include files.

I have attached a zip file with a C header file cleaner in it. A lot more needs to be done with it but it converts C hex to asm hex, removes the comments and a lot of the junk. Just drop a C header file onto it and it will produce a text file "cleaned.txt" that is at least readable.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 17, 2016, 06:07:30 PM
Quote from: hutch-- on September 17, 2016, 12:54:48 PM
That is because they were written for 32 bit ML.EXE, they are not compatible with ML64. I would not be doing the work if it was. Shortly I will have another tool that isolates the prototypes in the Microsoft vc2015 header files and that will be a method of creating prototypes for 64 bit assembler. I don't need it for ML64

You may compare your results to the attached \Masm32\MasmBasic\Res\pt.inc, which gets generated if somebody attempts to run the MasmBasic 64-bit examples.

Quote from: jj2007 on July 21, 2016, 04:58:40 PM
Hutch,

The error checking of invoke is relevant for this type of case:

invoke CreateFile, esi, GENERIC_READ, FILE_SHARE_READ,
NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0, 0


And for nothing else. I guess we can safely assume that everybody can write their own proc with named parameters like hWnd and the like. If then somebody is thick enough to add a fifth parameter to the PROTO list, he should really go for Scratch or Logo 8)

Btw Microsoft's CrippleWare AssemblerTM can be convinced to count the paras. I am currently a bit stuck with the PROLOGUE bug (http://masm32.com/board/index.php?topic=5528.0), but with ML64 jinvoke works like a charm :icon_mrgreen:
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 18, 2016, 09:47:08 AM
This looks like an interesting technique.

j@BitBlt equ 2069/41:s111111111

Just a suggestion, instead of just supplying the argument count, use a 1 character notation for the data size.

q = QWORD
d = DWORD
w = WORD
b = BYTE
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 18, 2016, 10:20:57 AM
See e.g. j@VariantTimeToSystemTime equ 8777/89:s39
oleaut32.inc: VariantTimeToSystemTime PROTO :REAL8,:PTR SYSTEMTIME

Most args are DWORD in 32-bit code (even those that should be REAL4, see GdiPlus.inc). DWORDs can become SIZE_P for dual 64/32-bit assembly - the stack is organised in DWORD resp. QWORD slots, so even if (in 64-bit code) the arg should be "only" DWORD according to the C header file, it will do no harm to declare it a QWORD.

The conversion code starts in line 43ff of \Masm32\MasmBasic\Res\GetPT.asm

Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 18, 2016, 12:15:31 PM
Interestingly enough, this works OK.

    PPROTO TYPEDEF PTR PROC

    MessageBox MACRO args:VARARG
      externdef __imp_MessageBoxA:PPROTO
      IF argcount(args) NE 4
        echo *********************************************
        echo MessageBox MACRO arg count error, 4 expected
        echo *********************************************
        .err
      ENDIF
      invoke __imp_MessageBoxA,args
    ENDM

Its just that I could not be bothered, I have enjoyed working without prototypes.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 18, 2016, 07:23:09 PM
Erol and Paul had an alternative approach (http://www.masmforum.com/board/index.php?topic=8863.msg64295#msg64295), as you may remember:

EXTERNDEF MessageBox@16:PROC
MessageBox EQU <invoke pr4 PTR MessageBox@16>


My version would be
jinvoke MessageBox, 0, Str$(rax), Chr$("Title"), MB_OK

based on
jd@130 equ user32
...
j@MessageBoxA equ 12837/130:s1111


where 12837 is a global counter, 130 is the ID of the DLL, and s1111 means stdcall, 4*DWORD (or QWORD in 64-bit)
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 18, 2016, 07:57:48 PM
This will build but will not start and does not recognise an extra argument. It is old left over code that worked on 32 bit ML using a macro I designed a long time ago that is still in the 32 bit windows.inc file.

EXTERNDEF MessageBox@16:PROC
MessageBox EQU <invoke pr4 PTR MessageBox@16>

This method of prototyping does not work in ML64.
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 18, 2016, 08:27:38 PM
In the C++ header file this is the prototype.

WINOLEAUTAPI_(INT) VariantTimeToSystemTime(__in DOUBLE vtime, __out LPSYSTEMTIME lpSystemTime);
vtime is a DOUBLE
lpSystemTime is a QWORD pointer to a structure

Both are 64 bit values.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 18, 2016, 09:08:38 PM
Quote from: hutch-- on September 18, 2016, 08:27:38 PM
In the C++ header file this is the prototype.

WINOLEAUTAPI_(INT) VariantTimeToSystemTime(__in DOUBLE vtime, __out LPSYSTEMTIME lpSystemTime);
vtime is a DOUBLE
lpSystemTime is a QWORD pointer to a structure

Both are 64 bit values.

Indeed. Little test:

include \Masm32\MasmBasic\Res\JBasic.inc            ; OPT_64 1      ; put 0 for 32 bit, 1 for 64 bit assembly
.data?
MyR8      REAL8 ?
MyST      SYSTEMTIME <>

Init
  PrintLine Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format")
  jinvoke SetLastError, 0
  jinvoke GetLocalTime, addr MyST
  jinvoke SystemTimeToVariantTime, addr MyST, addr MyR8
  jinvoke VariantTimeToSystemTime, MyR8, addr MyST
  deb 4, "Result", eax, MyR8, MyST.wDay, MyST.wMonth, MyST.wYear
  Inkey Err$()
EndOfCode


Output:
This code was assembled with ml64 in 64-bit format
Result
eax     1
MyR8    42631.545960648
MyST.wDay       18
MyST.wMonth     9
MyST.wYear      2016

Operazione completata.
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 18, 2016, 10:13:50 PM
Beware of being trapped in the past, long mode /LARGEADDRESSAWARE is the future and the native data size for Win64 is 64 bit.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 18, 2016, 11:02:15 PM
Quote from: hutch-- on September 18, 2016, 10:13:50 PM
Beware of being trapped in the past, long mode /LARGEADDRESSAWARE is the future and the native data size for Win64 is 64 bit.

Can you please elaborate the relevance of your statement for the VariantTimeToSystemTime example? Or for any other 64-bit example I ever posted here?
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 18, 2016, 11:48:17 PM
Quote
Most args are DWORD in 32-bit code (even those that should be REAL4, see GdiPlus.inc). DWORDs can become SIZE_P for dual 64/32-bit assembly - the stack is organised in DWORD resp. QWORD slots, so even if (in 64-bit code) the arg should be "only" DWORD according to the C header file, it will do no harm to declare it a QWORD.
According to the C++ header file, both values are 64 bit. If you want to write reliable code you will get the data sizes right. You may get away with 2 x 32 bit values but you are at risk with the future. The other test is if your code builds successfully with the linker option /LARGEADDRESSAWARE. If not you are tied to the past using 32 bit values.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 19, 2016, 12:50:12 AM
All my 64-bit code builds fine with /large.

int 3
jinvoke VariantTimeToSystemTime, MyR8, addr MyST
nop


translates to:CC                         | int3                                          |
48 8D 15 8C 21 00 00       | lea rdx, qword ptr ds:[140003268]             | pointer to SYSTEMTIME
48 8B 0D 7D 21 00 00       | mov rcx, qword ptr ds:[140003260]             | 40E4D0F6612F684C, double 42631.701273148
FF 15 9F 24 00 00          | call qword ptr ds:[<&VariantTimeToSystemTime> |
90                         | nop                                           |


Of course, with OPT_64 0,  i.e. 32-bit assembly, there will be two DWORD pushes for the REAL8:
CC                  int3
68 107BD400         push offset MyST
FF35 0C7BD400       push dword ptr [0D47B0C]
FF35 087BD400       push dword ptr [MyR8]
FF15 887CD400       call near [0D47C88]
90                  nop
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 19, 2016, 08:07:00 AM
The 64 bit code looks simple enough as its only 3 register args but there is no gain in using the wrong data size in the register when the spec is 64 bit. I don't know what the value is in the ml64 subforum for the 32 bit code, its been around since the middle 1990s.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 19, 2016, 08:30:13 AM
Where am I "using the wrong data size in the register when the spec is 64 bit"??? ::)
Title: Re: Aligning memory for later instructions.
Post by: MichaelW on September 19, 2016, 10:30:58 AM
Why not use the  aligned malloc (https://msdn.microsoft.com/en-us/library/8z34s9c6.aspx) family of functions ? The attachment contains a demo done in C and GAS assembly.
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 19, 2016, 10:45:41 AM
Michael,

You will probably like this.

  ; --------------------------------------------------------
  ; alignment must be an immediate operand and a power of 2
  ; when no longer required the original address must be
  ; freed with either GlobalFree() or the macro "mfree".
  ; --------------------------------------------------------
    aalloc MACRO pmem:REQ,bcnt:REQ,alignment:REQ
      mov rdx, bcnt
      add rdx, alignment
      mov rcx, GMEM_FIXED or GMEM_ZEROINIT
      call GlobalAlloc
      mov pmem, rax
      add rax, alignment - 1
      and rax, -alignment
      EXITM <rax>
    ENDM

Your suggestion is a good idea though, I have seen the function but have not had the time to try it out yet.

I have a longer version as well that will take memory operands for the alignment.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 19, 2016, 07:48:34 PM
Since HeapAlloc returns align 8, and align 16 is what you need for SIMD, a thin wrapper around HeapAlloc and HeapFree is another option. 8 bytes extra per call, obviously.
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 19, 2016, 08:10:52 PM
Over time I have learnt that Microsoft have changed the default alignment of various memory allocation strategies so for reliable operation with whatever strategy you choose, manually controlling the memory alignment is the only safe technique. As per Michael's suggestion, the CRT aligned memory is a viable technique that does work OK for exactly the same reason, you can directly control the alignment and not make assumptions about what the default may happen to be.

For SSE you need 128 byte alignment, AVX requires 256 byte alignment and AVX2 512 byte alignment.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 19, 2016, 11:05:44 PM
Quote from: hutch-- on September 19, 2016, 08:10:52 PMOver time I have learnt that Microsoft have changed the default alignment of various memory allocation strategies

See screenshot below from the 1994 TechEd Conference (https://en.wikipedia.org/wiki/TechEd#Dates_and_Locations_of_TechEd_Events). M$ may have had good intentions, but (test attached) GlobalAlloc is align 8 on XP and Win7-64 alike, exactly as for HeapAlloc 8)

QuoteFor SSE you need 128 byte alignment

The great majority of SSE instructions is happy with align 16 or no alignment at all. Or did you mean 128 bits?
Title: Re: Aligning memory for later instructions.
Post by: nidud on September 19, 2016, 11:26:06 PM
deleted
Title: Re: Aligning memory for later instructions.
Post by: nidud on September 20, 2016, 12:50:18 AM
deleted
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 20, 2016, 01:05:16 AM
> The great majority of SSE instructions is happy with align 16 or no alignment at all. Or did you mean 128 bits?

This is the Intel manual.
The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP.
The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP.
The 512-bit VMOVNTDQA addresses must be 64-byte aligned or the instruction will cause a #GP.

This was a blunder, tired and too much work.
> For SSE you need 128 byte alignment, AVX requires 256 byte alignment and AVX2 512 byte alignment.

It should be,
For SSE you need 128 BIT alignment, AVX requires 256 BIT alignment and AVX2 512 BIT alignment.


Title: Re: Aligning memory for later instructions.
Post by: MichaelW on September 20, 2016, 03:07:56 AM
At least under Windows 7-64 and Windows 10-64, for the aligned malloc functions a 16-byte alignment is the minimum actual alignment. There are also the _aligned_offset_malloc functions that allow you to specify the alignment of a specific offset in the allocated memory. IIRC they were not supported under Windows XP, but are under Windows 7-64.
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 20, 2016, 09:18:05 AM
Quote from: hutch-- on September 20, 2016, 01:05:16 AMtired and too much work

Slow down, man. You are the Masm32 BDFL anyway, even if you don't finish the 64-bit version by tomorrow ;-)

Still 32-bit, almost plain HeapAlloc under the hood:

include \masm32\MasmBasic\MasmBasic.inc      ; Version 20 September 2016 (http://masm32.com/board/index.php?topic=94.0)
  Init
  Dim PtrSSE() As DWORD
  For_ ct=0 To A16Max-1      ; 100 aligned pointers
      Alloc16 Rand(10000)
      movaps [eax], xmm0      ; the proof ;-)
      mov PtrSSE(ct), eax
      Print Hex$(al), " "
  Next
  For_ ct=0 To A16Max-1
      Free16 PtrSSE(ct)
  Next
  Inkey "OK?"
EndOfCode


Output:50 20 20 A0 80 40 70 30 10 20 70 90 F0 A0 30 20 00 20 50 50 B0 C0 50 50 40 80 F0 70 D0 B0 40 E0 A0 C0 30 70 10 F0 70 E0 80 20 C0 60 A0 E0 10
00 70 10 D0 B0 00 90 20 B0 90 70 00 90 30 90 B0 30 00 60 C0 C0 10 10 B0 50 F0 60 C0 F0 B0 E0 10 90 C0 D0 F0 60 00 30 F0 A0 C0 A0 10 A0 90 3
0 80 A0 F0 E0 10 B0 OK?
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 20, 2016, 09:41:24 PM
I don't claim to understand your notation but if I have it right, why not make a version where you can set the alignment to any power of 2 size you like so you can also handle AVX and AVX2 ?
Title: Re: Aligning memory for later instructions.
Post by: nidud on September 22, 2016, 11:41:19 PM
deleted
Title: Re: Aligning memory for later instructions.
Post by: jj2007 on September 23, 2016, 09:32:53 AM
Quote from: nidud on September 22, 2016, 11:41:19 PM
Using the stack is way faster than using HeapAlloc.

That's correct, and StackBuffer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1255) proves it, but a HeapAlloc-based macro as shown above is normally fast enough, and not limited to the procedure where it was called.
Title: Re: Aligning memory for later instructions.
Post by: hutch-- on September 23, 2016, 10:36:26 AM
I generally choose dynamic memory allocation when I need large single memory blocks which I generally chop up into the size bits I need from it. I have seen code where massive counts of small allocations occur but its lousy code design and often very slow. Stack is easy and fast but I only use it for relatively small amounts, a few K here and there. You can alter the linker option on stack reserve/stack commit if you want a lot more stack space.