Simple enough to do and necessary for XMM and YMM operations.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
LOCAL pMem :QWORD ; allocated memory pointer
LOCAL aMem :QWORD ; aligned memory pointer
padd equ <512> ; extra bytes (must be at least size of required alignment)
bcnt equ <1024*1024*4> ; 4 meg
mov pMem, alloc(bcnt+padd) ; allocate the memory plus padding
memalign rax, 256 ; align the memory up to the next 256 byte boundary
mov aMem, rax ; store result in aligned memory pointer
; do what you need with the 256 byte aligned memory (YMM addresses, register etc ....)
mfree pMem ; free the original allocated address
waitkey
invoke ExitProcess,0
ret
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
Doing something similar with dynamic pointer allocations... but using bit/and comparisons ?
For the amount of memory usually needed for working efficiently with YMM regs, wouldn't VirtualAlloc be a good choice?
:biggrin:
Once memory is allocated, its all the same, whatever floats your boat. :P
Yes, but with VirtualAlloc you get the alignment "for free" ;-)
Another trick while tweaking the main include file, mis-align at least some structures and it crashes when the procedure that tries to use it is called. 8 byte alignment is necessary with most and probably all structures used in 64 bit.
Quote from: hutch-- on September 15, 2016, 03:16:03 PM
Another trick while tweaking the main include file, mis-align at least some structures and it crashes when the procedure that tries to use it is called. 8 byte alignment is necessary with most and probably all structures used in 64 bit.
Indeed. The Zp switch is the way to go...
Quote from: jj2007 on September 13, 2016, 10:28:02 PMAnd that one works fine if you build it with Zp4 for 32-bit and Zp8 for 64-bit code, but it crashes for X64 and Zp4.
Which means that the default structure alignment of the Windows API is DWORD in 32-bit code and QWORD in 64-bit code. Both on a 64-bit processor, of course (see Reply #30); Redmond should take their documentation a bit more seriously 8)
:biggrin:
This is only a problem if you are trying to multiport similar assemblers. ML64 is NOT MASM compatible, it is only ML64 compatible, its error messages while organising include files are unintelligible and buggy and at time crashes with stack dumps. It does not have the tolerance that the old ML had and is a genuine joy to make working include files for structures and equates.
Now if you have a look at the guts of Japheth's h2incX output you will have genuine nightmares at this tangled mess of typedefs, prototypes, bugs, equates, the odd "tag" attached to the front of structures and with a possible delivery date at about the year 3000.
Quote from: hutch-- on September 15, 2016, 08:36:54 PMif you have a look at the guts of Japheth's h2incX output you will have genuine nightmares at this tangled mess of typedefs, prototypes, bugs, equates, ...
The C++ fraction will insist that the 100+ types are necessary. IMHO the only real change from Windows.inc+WinExtra.inc is the distinction between "data" DWORDs (they can stay "as is") and "pointers" that are DWORDs in 32-bit code, and QWORDs in 64-bit code. In \Masm32\MasmBasic\Res\DualWin.inc the latter is called SIZE_P, and it's size depends on whether you build 32- or 64-bit code, obviously. Otherwise, there are only minor changes compared to Windows.inc+WinExtra.inc - a few structure members choked with ML64.
So, give DualWin.inc a try - no tangled mess of typedefs, prototypes, bugs, equates, just the old Windows.inc format.
Besides, it runs with all assemblers. So if you are fond of unintelligible and buggy error messages and stack dumps, use ML64, if instead you like the old
.if eax>99 etc syntax, you can use the same include file with HJWasm.
:biggrin:
> So if you are fond of unintelligible and buggy error messages and stack dumps, use ML64
I don't have the problem, I am not trying to use Japheth's includes. With ML64 I am free of "Open Sauce" licencing and the army of parasites that come with it. :badgrin:
Quote from: hutch-- on September 16, 2016, 08:50:56 AM
With ML64 I am free of "Open Sauce" licencing and the army of parasites that come with it. :badgrin:
Cmon sir hutch, tell us the true, you probably have a lot of softwares "open sauce" inside your computer, from pdf readers to disassemblers, just look to libraries.
But I take your point of view, I don't like open source too just because one thing: have lawyers inside. This is why I prefer public domain.
Quote from: hutch-- on September 16, 2016, 08:50:56 AM
:biggrin:
> So if you are fond of unintelligible and buggy error messages and stack dumps, use ML64
I don't have the problem, I am not trying to use Japheth's includes. With ML64 I am free of "Open Sauce" licencing and the army of parasites that come with it. :badgrin:
Me neither. I am trying to use the standard Masm32 includes. Unfortunately, they make Microsoft ML64 crash with exceptions.
That is because they were written for 32 bit ML.EXE, they are not compatible with ML64. I would not be doing the work if it was. Shortly I will have another tool that isolates the prototypes in the Microsoft vc2015 header files and that will be a method of creating prototypes for 64 bit assembler. I don't need it for ML64 but it will be useful for the assemblers that need prototypes. I doubt you can get an auto converter to work on the C .H files, they are too much of a tangled mess and too much useless noise but you can get equates, structures and prototypes which will ease the production of assembler include files.
I have attached a zip file with a C header file cleaner in it. A lot more needs to be done with it but it converts C hex to asm hex, removes the comments and a lot of the junk. Just drop a C header file onto it and it will produce a text file "cleaned.txt" that is at least readable.
Quote from: hutch-- on September 17, 2016, 12:54:48 PM
That is because they were written for 32 bit ML.EXE, they are not compatible with ML64. I would not be doing the work if it was. Shortly I will have another tool that isolates the prototypes in the Microsoft vc2015 header files and that will be a method of creating prototypes for 64 bit assembler. I don't need it for ML64
You may compare your results to the attached \Masm32\MasmBasic\Res\pt.inc, which gets generated if somebody attempts to run the MasmBasic 64-bit examples.
Quote from: jj2007 on July 21, 2016, 04:58:40 PM
Hutch,
The error checking of invoke is relevant for this type of case:
invoke CreateFile, esi, GENERIC_READ, FILE_SHARE_READ,
NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0, 0
And for nothing else. I guess we can safely assume that everybody can write their own proc with named parameters like hWnd and the like. If then somebody is thick enough to add a fifth parameter to the PROTO list, he should really go for Scratch or Logo 8)
Btw Microsoft's CrippleWare AssemblerTM can be convinced to count the paras. I am currently a bit stuck with the PROLOGUE bug (http://masm32.com/board/index.php?topic=5528.0), but with ML64 jinvoke works like a charm :icon_mrgreen:
This looks like an interesting technique.
j@BitBlt equ 2069/41:s111111111
Just a suggestion, instead of just supplying the argument count, use a 1 character notation for the data size.
q = QWORD
d = DWORD
w = WORD
b = BYTE
See e.g. j@VariantTimeToSystemTime equ 8777/89:s39
oleaut32.inc: VariantTimeToSystemTime PROTO :REAL8,:PTR SYSTEMTIME
Most args are DWORD in 32-bit code (even those that should be REAL4, see GdiPlus.inc). DWORDs can become SIZE_P for dual 64/32-bit assembly - the stack is organised in DWORD resp. QWORD slots, so even if (in 64-bit code) the arg should be "only" DWORD according to the C header file, it will do no harm to declare it a QWORD.
The conversion code starts in line 43ff of \Masm32\MasmBasic\Res\GetPT.asm
Interestingly enough, this works OK.
PPROTO TYPEDEF PTR PROC
MessageBox MACRO args:VARARG
externdef __imp_MessageBoxA:PPROTO
IF argcount(args) NE 4
echo *********************************************
echo MessageBox MACRO arg count error, 4 expected
echo *********************************************
.err
ENDIF
invoke __imp_MessageBoxA,args
ENDM
Its just that I could not be bothered, I have enjoyed working without prototypes.
Erol and Paul had an alternative approach (http://www.masmforum.com/board/index.php?topic=8863.msg64295#msg64295), as you may remember:
EXTERNDEF MessageBox@16:PROC
MessageBox EQU <invoke pr4 PTR MessageBox@16>
My version would be
jinvoke MessageBox, 0, Str$(rax), Chr$("Title"), MB_OK
based on
jd@130 equ user32
...
j@MessageBoxA equ 12837/130:s1111
where 12837 is a global counter, 130 is the ID of the DLL, and s1111 means stdcall, 4*DWORD (or QWORD in 64-bit)
This will build but will not start and does not recognise an extra argument. It is old left over code that worked on 32 bit ML using a macro I designed a long time ago that is still in the 32 bit windows.inc file.
EXTERNDEF MessageBox@16:PROC
MessageBox EQU <invoke pr4 PTR MessageBox@16>
This method of prototyping does not work in ML64.
In the C++ header file this is the prototype.
WINOLEAUTAPI_(INT) VariantTimeToSystemTime(__in DOUBLE vtime, __out LPSYSTEMTIME lpSystemTime);
vtime is a DOUBLE
lpSystemTime is a QWORD pointer to a structure
Both are 64 bit values.
Quote from: hutch-- on September 18, 2016, 08:27:38 PM
In the C++ header file this is the prototype.
WINOLEAUTAPI_(INT) VariantTimeToSystemTime(__in DOUBLE vtime, __out LPSYSTEMTIME lpSystemTime);
vtime is a DOUBLE
lpSystemTime is a QWORD pointer to a structure
Both are 64 bit values.
Indeed. Little test:
include \Masm32\MasmBasic\Res\JBasic.inc ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
.data?
MyR8 REAL8 ?
MyST SYSTEMTIME <>
Init
PrintLine Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format")
jinvoke SetLastError, 0
jinvoke GetLocalTime, addr MyST
jinvoke SystemTimeToVariantTime, addr MyST, addr MyR8
jinvoke VariantTimeToSystemTime, MyR8, addr MyST
deb 4, "Result", eax, MyR8, MyST.wDay, MyST.wMonth, MyST.wYear
Inkey Err$()
EndOfCodeOutput:
This code was assembled with ml64 in 64-bit format
Result
eax 1
MyR8 42631.545960648
MyST.wDay 18
MyST.wMonth 9
MyST.wYear 2016
Operazione completata.
Beware of being trapped in the past, long mode /LARGEADDRESSAWARE is the future and the native data size for Win64 is 64 bit.
Quote from: hutch-- on September 18, 2016, 10:13:50 PM
Beware of being trapped in the past, long mode /LARGEADDRESSAWARE is the future and the native data size for Win64 is 64 bit.
Can you please elaborate the relevance of your statement for the VariantTimeToSystemTime example? Or for any other 64-bit example I ever posted here?
Quote
Most args are DWORD in 32-bit code (even those that should be REAL4, see GdiPlus.inc). DWORDs can become SIZE_P for dual 64/32-bit assembly - the stack is organised in DWORD resp. QWORD slots, so even if (in 64-bit code) the arg should be "only" DWORD according to the C header file, it will do no harm to declare it a QWORD.
According to the C++ header file, both values are 64 bit. If you want to write reliable code you will get the data sizes right. You may get away with 2 x 32 bit values but you are at risk with the future. The other test is if your code builds successfully with the linker option /LARGEADDRESSAWARE. If not you are tied to the past using 32 bit values.
All my 64-bit code builds fine with /large.
int 3
jinvoke VariantTimeToSystemTime, MyR8, addr MyST
nop
translates to:CC | int3 |
48 8D 15 8C 21 00 00 | lea rdx, qword ptr ds:[140003268] | pointer to SYSTEMTIME
48 8B 0D 7D 21 00 00 | mov rcx, qword ptr ds:[140003260] | 40E4D0F6612F684C, double 42631.701273148
FF 15 9F 24 00 00 | call qword ptr ds:[<&VariantTimeToSystemTime> |
90 | nop |
Of course, with OPT_64 0, i.e. 32-bit assembly, there will be two DWORD pushes for the REAL8:
CC int3
68 107BD400 push offset MyST
FF35 0C7BD400 push dword ptr [0D47B0C]
FF35 087BD400 push dword ptr [MyR8]
FF15 887CD400 call near [0D47C88]
90 nop
The 64 bit code looks simple enough as its only 3 register args but there is no gain in using the wrong data size in the register when the spec is 64 bit. I don't know what the value is in the ml64 subforum for the 32 bit code, its been around since the middle 1990s.
Where am I "using the wrong data size in the register when the spec is 64 bit"??? ::)
Why not use the aligned malloc (https://msdn.microsoft.com/en-us/library/8z34s9c6.aspx) family of functions ? The attachment contains a demo done in C and GAS assembly.
Michael,
You will probably like this.
; --------------------------------------------------------
; alignment must be an immediate operand and a power of 2
; when no longer required the original address must be
; freed with either GlobalFree() or the macro "mfree".
; --------------------------------------------------------
aalloc MACRO pmem:REQ,bcnt:REQ,alignment:REQ
mov rdx, bcnt
add rdx, alignment
mov rcx, GMEM_FIXED or GMEM_ZEROINIT
call GlobalAlloc
mov pmem, rax
add rax, alignment - 1
and rax, -alignment
EXITM <rax>
ENDM
Your suggestion is a good idea though, I have seen the function but have not had the time to try it out yet.
I have a longer version as well that will take memory operands for the alignment.
Since HeapAlloc returns align 8, and align 16 is what you need for SIMD, a thin wrapper around HeapAlloc and HeapFree is another option. 8 bytes extra per call, obviously.
Over time I have learnt that Microsoft have changed the default alignment of various memory allocation strategies so for reliable operation with whatever strategy you choose, manually controlling the memory alignment is the only safe technique. As per Michael's suggestion, the CRT aligned memory is a viable technique that does work OK for exactly the same reason, you can directly control the alignment and not make assumptions about what the default may happen to be.
For SSE you need 128 byte alignment, AVX requires 256 byte alignment and AVX2 512 byte alignment.
Quote from: hutch-- on September 19, 2016, 08:10:52 PMOver time I have learnt that Microsoft have changed the default alignment of various memory allocation strategies
See screenshot below from the 1994 TechEd Conference (https://en.wikipedia.org/wiki/TechEd#Dates_and_Locations_of_TechEd_Events). M$ may have had good intentions, but (test attached) GlobalAlloc is align 8 on XP and Win7-64 alike, exactly as for HeapAlloc 8)
QuoteFor SSE you need 128 byte alignment
The great majority of SSE instructions is happy with align 16 or no alignment at all. Or did you mean 128 bits?
deleted
deleted
> The great majority of SSE instructions is happy with align 16 or no alignment at all. Or did you mean 128 bits?
This is the Intel manual.
The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP.
The 256-bit VMOVNTDQA addresses must be 32-byte aligned or the instruction will cause a #GP.
The 512-bit VMOVNTDQA addresses must be 64-byte aligned or the instruction will cause a #GP.
This was a blunder, tired and too much work.
> For SSE you need 128 byte alignment, AVX requires 256 byte alignment and AVX2 512 byte alignment.
It should be,
For SSE you need 128 BIT alignment, AVX requires 256 BIT alignment and AVX2 512 BIT alignment.
At least under Windows 7-64 and Windows 10-64, for the aligned malloc functions a 16-byte alignment is the minimum actual alignment. There are also the _aligned_offset_malloc functions that allow you to specify the alignment of a specific offset in the allocated memory. IIRC they were not supported under Windows XP, but are under Windows 7-64.
Quote from: hutch-- on September 20, 2016, 01:05:16 AMtired and too much work
Slow down, man. You are the Masm32 BDFL anyway, even if you don't finish the 64-bit version by tomorrow ;-)
Still 32-bit, almost plain HeapAlloc under the hood:
include \masm32\MasmBasic\MasmBasic.inc ; Version 20 September 2016 (http://masm32.com/board/index.php?topic=94.0)
Init
Dim PtrSSE() As DWORD
For_ ct=0 To A16Max-1 ; 100 aligned pointers
Alloc16 Rand(10000)
movaps [eax], xmm0 ; the proof ;-)
mov PtrSSE(ct), eax
Print Hex$(al), " "
Next
For_ ct=0 To A16Max-1
Free16 PtrSSE(ct)
Next
Inkey "OK?"
EndOfCodeOutput:
50 20 20 A0 80 40 70 30 10 20 70 90 F0 A0 30 20 00 20 50 50 B0 C0 50 50 40 80 F0 70 D0 B0 40 E0 A0 C0 30 70 10 F0 70 E0 80 20 C0 60 A0 E0 10
00 70 10 D0 B0 00 90 20 B0 90 70 00 90 30 90 B0 30 00 60 C0 C0 10 10 B0 50 F0 60 C0 F0 B0 E0 10 90 C0 D0 F0 60 00 30 F0 A0 C0 A0 10 A0 90 3
0 80 A0 F0 E0 10 B0 OK?
I don't claim to understand your notation but if I have it right, why not make a version where you can set the alignment to any power of 2 size you like so you can also handle AVX and AVX2 ?
deleted
Quote from: nidud on September 22, 2016, 11:41:19 PM
Using the stack is way faster than using HeapAlloc.
That's correct, and StackBuffer() (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1255) proves it, but a HeapAlloc-based macro as shown above is normally fast enough, and not limited to the procedure where it was called.
I generally choose dynamic memory allocation when I need large single memory blocks which I generally chop up into the size bits I need from it. I have seen code where massive counts of small allocations occur but its lousy code design and often very slow. Stack is easy and fast but I only use it for relatively small amounts, a few K here and there. You can alter the linker option on stack reserve/stack commit if you want a lot more stack space.