whats in it ?
1) stackbase:esp back in (thanks Nidud!)
2) incbin offset bug fixed (reported by Vortex)
3) fixed a bug in coff 32bit name mangle output (reported by JJ)
4) command line switch -nomlib to disable the built-in macro library (if you need or want to)
5) stackbase:rsp and stackbase:rbp code has been totally separated, refactored and optimised.
6) both of these options have been simplified as follows:
- option stackbase:rsp will enforce win64:11 and frame:auto. All procs with default prologue/epilogue settings will be frame procs.
- option stackbase:rbp wil enforce frame:auto but leaves option win64:1 -> 7 available. All procs with default prologue/epilogue settings will be frame procs.
- Both align the first local to 16 when required.
- RSP already had all the smart optimisations in, which have now been moved to RBP and extended..
- If no locals or parameters are used at all, the frame pointer will be omitted.
- If the procedure is a leaf proc (IE: no locals and no further invokes) the reservation of stack via sub/add rsp will be automatically removed.
- Both options will only copy parameters to home space "IF" they're actually used, so if you just use the source registers you get an optimised proc out.
7) There is now an OSX universal binary package on the site for 2.23
(This was an absolute pain to get GCC on OSX building it at all and then NOT giving Bus Error 10's).
Enjoy!
PS, We're now implementing the macros for 2.24 - due in store shortly ;)
Nice!
Thanks for all your hard work.
Quote from: fearless on April 01, 2017, 09:08:42 AMThanks for all your hard work.
Same from me :t
Builds my larger sources without any problems:
; RichMasm source, 18786 lines:
OxPT_Assembler HJWasm32 ; 1555 ms
OxPT_Assembler JWasm ; 1330 ms
OxPT_Assembler HJWasm64 ; 1274 ms
OPT_Assembler AsmC ; 1085 ms
; MasmBasic source, 31670 lines:
OxPT_Assembler mlv615 ; 8.0 secs
OxPT_Assembler JWasm ; 6.9 secs
OxPT_Assembler HJWasm32 ; 3.7
OxPT_Assembler HJWasm64 ; 3.2
OPT_Assembler AsmC ; 2.6 secs
Quote from: johnsa on April 01, 2017, 08:44:22 AM
- option stackbase:rsp will enforce win64:11 and frame:auto
I am not so sure that stackbase:rsp will enforce win64:11.
I have used:
OPTION casemap:none
OPTION FRAME:auto
OPTION WIN64:6
OPTION STACKBASE:RSP
and it disregarded WIN64:6 and the enforced win64:11 failed to align a local variable to a 16-byte boundary. So a movaps sse instruction from xmm register to local memory caused a crash.
I can not produce right now a proof of concept, will try early next Monday if necessary.
What I can advance is that the procedure uses FRAME and has no USES. I can also say that it does not produce an error with JWASM.
This is not important for me, in my cause STACKBASE:RSP increases the code size and reduces execution speed. I think I mentioned that before.
Hi Johnsa and Habran,
Thanks for the new release. Keep up the nice work :t
Not too sure how the align local to 16 isn't working, are you calling the function from HLL or is it being run directly from an asm app ?
Here is an example from my side and the locals bob/bob1 are aligned 16 every-time:
option frame:auto
option win64:6
option stackbase:rsp
;assemble with
; c:\jwasm\hjwasm64 -c -win64 -Zi -Zd -Zf -Zp8 aw.asm
; d:\vs2015\vc\bin\link /subsystem:console /machine:x64 /debug /entry:proc1 /Libpath:"%WINSDK%\v7.1\Lib\x64" aw.obj
__m128f struct
f0 real4 ?
f1 real4 ?
f2 real4 ?
f3 real4 ?
__m128f ends
__m128q struct
q0 QWORD ?
q1 QWORD ?
__m128q ends
__m128 union
f32 __m128f <>
q64 __m128q <>
__m128 ends
OPTION ARCH:AVX
includelib kernel32.lib
includelib user32.lib
externdef MessageBoxW : near
externdef MessageBoxA : near
MessageBoxW PROTO :qword, :qword, :qword, :qword
MessageBoxA PROTO :qword, :qword, :qword, :qword
.data
; Automatic type promotion from integer to float
aReal REAL4 2
; This is example of initializing a union with floats (first sub-type)
; using normal syntax as well as hjwasm 2.17 update to promote integer literal to float
myVec1 __m128 { < 1.0, 2.0, 3.0, 4.0 > }
myVec2 __m128 { < 1, 2, 3, 4 > }
; Hjwasm 2.22 enhanced union type (now allows direct specification of sub-type to use in initialisation):
myVec4 __m128.f32 { < 1.0, 2.0, 3.0, 4.0 > } ; you can try .f33 and hjwasm will emit an error when testing for valid sub-type.
myVec3 __m128.q64 { < 0x1234, 0x5678 > }
myVec5 __m128.f32 { < 1.0, 2.0, 3.0, 4.0 > } ; you can try .f33 and hjwasm will emit an error when testing for valid sub-type.
floatVar real4 2.3
awideStr dw "wide caption ",0
.code
start:
LOADSS xmm0,2.0
OPTION ARCH:SSE
LOADSS xmm1,3.0
OPTION ARCH:AVX
LOADSD xmm2,4.0
;this proc is creating a dud sub rsp,8 :( (FIXED)
proc2 proc public
ret
proc2 endp
sub1 proc public arg1:ptr, arg2:ptr
ret
sub1 endp
sub2 proc public uses rdi xmm0 arg1:ptr, arg2:ptr
ret
sub2 endp
newproc3 proc arg1:qword, arg2:qword
ret
newproc3 endp
newproc proc arg1:qword, arg2:real4
movss xmm3,arg2 ; with option win64:7 , this loads from [rbp+20h] but it SHOULD be [rbp+18h] :(
ret
newproc endp
newproc2 proc FRAME arg1:qword, arg2:real4, arg3:dword, arg4:dword, arg5:dword
movss xmm3,arg2 ; with option win64:7 , this loads from [rbp+20h] but it SHOULD be [rbp+18h] :(
mov eax,arg3
mov ebx,arg4
mov ecx,arg5
ret
newproc2 endp
; This one will implement FPO(frame pointer ommission as no parameters or locals are used).
newproc5 proc FRAME arg1:qword, arg2:real4, arg3:dword, arg4:dword, arg5:dword
xor eax,eax
mov ebx,eax
ret
newproc5 endp
proc1 proc FRAME arg1:qword, arg2:qword, arg3 :qword
local bob:XMMWORD
local bob1:XMMWORD
mov r9, rcx
mov r10, rdx
mov r11, r8
invoke newproc3, rax, "this is an ascii string"
movss xmm1, FP4(1.28)
movss xmm1, FP4(2.28)
movss xmm1, FP4(3.28)
invoke MessageBoxW, 0, ADDR awideStr, ADDR awideStr, 0
invoke MessageBoxA, 0, "yay string literals", "oops", 0
invoke newproc3, rax,"this is an ascii string"
invoke newproc3, rcx, L"a wide string yay"
invoke MessageBoxW, 0, L"yay wide string literal", ADDR awideStr, 0
invoke MessageBoxA, 0, "yay string literals2", "oops", 0
invoke MessageBoxA, 0, "yay string literals3", "oops", 0
invoke MessageBoxA, 0, "yay string literals4", "oops", 0
invoke newproc2, rax,xmm4,ebx,r10d,r11d
invoke newproc5, rax,xmm4,ebx,r10d,r11d
invoke newproc, rax, xmm4
invoke newproc, rax, floatVar
invoke newproc, rax, xmm1
INVOKE sub1, r10, r8
INVOKE sub2, r9, r11
mov rax, r9
vmovaps xmm0,bob
vmovaps bob1,xmm1
ret
proc1 endp
WinMainCRTStartup PROC FRAME
invoke proc1, 10, 20, 30
ret
WinMainCRTStartup ENDP
end WinMainCRTStartup
This is how I get the problem.
#include "stdafx.h"
#if defined (__cplusplus)
extern "C" {
#endif
void proc1(size_t var1, size_t var2, size_t var3, size_t var4,size_t var5);
#if defined (__cplusplus)
}
#endif
int main()
{
proc1(1, 2, 3, 4, 5);
return 0;
}
; test.asm
option casemap:none
option frame:auto
OPTION WIN64:11 ; same error if using OPTION WIN64:6
OPTION STACKBASE:RSP
.code
proc1 proc public FRAME _rcx : qword, _rdx: qword, _r8: qword, _r9 : qword, other: qword
LOCAL lvar1 : ptr
LOCAL lvar2 : XMMWORD
mov eax, 2.0
movd xmm0, eax
shufps xmm0, xmm0,0
movaps XMMWORD ptr lvar2, xmm0
ret
proc1 endp
end
Compiled in release 2.23 with
hjwasm64" -c -win64 -Zp8 -archSSE test.asm
No problems in JWASM, it aligns correctly lvar2
Ahh I see :)
In JWASM the align setting uses 16bytes for every local which is why they're all aligned, we're not , we only align the first local, so if you put something in first thats a qword it will throw the alignment out, but it can save a lot of stack by not wasting the 16 bytes per local, in some cases in my test procs 1-2 whole cache lines.
Quote from: johnsa on April 02, 2017, 07:02:16 PM
Ahh I see :)
In JWASM the align setting uses 16bytes for every local which is why they're all aligned, we're not , we only align the first local, so if you put something in first thats a qword it will throw the alignment out, but it can save a lot of stack by not wasting the 16 bytes per local, in some cases in my test procs 1-2 whole cache lines.
But this is the whole purpose of bit 2, I always understood it that way, of the OPTION WIN64 directive because the first local is
always 16 byte aligned when it corresponds to 16 byte variable. It becomes always 16-byte aligned as a result of the required stack alignment that takes place during the prolog.
This is valid for STACKBASE:RSP or STACKBASE:RBP.
Note that with bit 2 of OPTION WIN64, JWASM
DOES NOT 16-byte align every local, only 16-byte aligns 16-byte variables.
deleted
Quote from: aw27 on April 02, 2017, 07:31:53 PMthe first local is always 16 byte aligned when it corresponds to 16 byte variable.
Interesting observation :t
QuoteNote that with bit 2 of OPTION WIN64, JWASM DOES NOT 16-byte align every local, only 16-byte aligns 16-byte variables.
I agree with John that aligning everything is a waste of stack, but aligning only xmmwords would indeed be an intelligent option. For efficient code, we could still use e.g.
Local x0:XMMWORD, x1, x2, x3, x4
Local y0:XMMWORD, y1, y2, y3, y4
Quote from: jj2007 on April 02, 2017, 10:51:23 PM
Local x0:XMMWORD, x1, x2, x3, x4
Local y0:XMMWORD, y1, y2, y3, y4
This is a trap, isn't it?
I think you will not get what you want, if I know what you want.
I've fixed this now.. but for option stackbase:rsp it means you need the value 15.
If you supply any value < 11, it becomes 11 (as this is the most sensible)... except if you specify 15 which it will accept as 15 = 11 + align16 bit set.
So you have 2 options with stackbase:rsp ... 11 or 15 and that's it.
For stackbase rbp you can use any values between 0-7 as per normal.
This will be in 2.24 shortly along with some new presents :)
Quote from: aw27 on April 03, 2017, 01:51:16 AMThis is a trap, isn't it?
No, it's quite real. Just efficient code: you start with a local that can be used with mov
aps and friends, you add exactly 4 dwords, then the next SSE variable, etc. Hand-crafted but it works, of course.
include \Masm32\MasmBasic\Res\JBasic.inc ; ## builds in 32- or 64-bit mode with ML, AsmC, JWasm, HJWasm ##
usedeb=1
.code
MyTest proc <cb> uses rsi rdi rbx arg1, arg2, arg3, arg4, arg5
Local x0:XMMWORD, x1, x2, x3, x4
Local y0:XMMWORD, y1, y2, y3, y4
lea rax, x0
lea rdx, y0
deb 4, "are x0+y0 aligned?", x:rax, x:rdx, x:rbp, x:rsp, arg1, arg2, arg3, arg4, arg5
ret
MyTest endp
Init
PrintLine Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format", 13, 10)
jinvoke MyTest, 111, 222, 333, 444, 555
Inkey
EndOfCodeOutput:
This code was assembled with HJWasm32 in 64-bit format
are x0+y0 aligned?
x:rax 12feb0h
x:rdx 12fe90h
x:rbp 12fef0h
x:rsp 12fe00h
arg1 111
arg2 222
arg3 333
arg4 444
arg5 555
Quote from: jj2007 on April 03, 2017, 06:38:57 AM
No, it's quite real. Just efficient code: you start with a local that can be used with movaps and friends, you add exactly 4 dwords, then the next SSE variable, etc. Hand-crafted but it works, of course.
My take on this is that we shall place the variables in descending order by powers 2 of the TYPE of variable.
For example:
LOCAL a : XMMWORD
LOCAL b : XMMWORD
LOCAL c : QWORD
LOCAL d : QWORD
LOCAL e[10] : DWORD
LOCAL f : DWORD
LOCAL g[5] : WORD
LOCAL h[20] : BYTE
LOCAL i : BYTE
I believe this will pack as much as possible. On the other hand, I don't think that, in general, will cause any problem whatsoever to place variables out of order increasing a little the memory consumption when the compiler helps us to maintain the alignment, that's why I like BIT 2 of OPTION WIN64 when there are XMMWORDs. May be the compiler should consider other alignments for stack variables, such as 32, 64, 128 bytes, 256 or 512 bytes. I am adding that to my
Wish List.
Quote from: johnsa on April 03, 2017, 04:46:06 AM
I've fixed this now.. but for option stackbase:rsp it means you need the value 15.
If you supply any value < 11, it becomes 11 (as this is the most sensible)... except if you specify 15 which it will accept as 15 = 11 + align16 bit set.
So you have 2 options with stackbase:rsp ... 11 or 15 and that's it.
For stackbase rbp you can use any values between 0-7 as per normal.
This will be in 2.24 shortly along with some new presents :)
Looking forward to it. :t
The change I've made for 2.24 is to just take any local greater than a qword (in size) and align it's position up to 16. This happens in order, I don't want to shuffle the local order around in the assembler as it's not transparent to the user, which might in some case lead to undesirable outcomes, if for example someone writes code that assumes local B occurs after local A and they can be grabbed simultaneously..
MyProc PROC FRAME ....
LOCAL a:XMMWORD ; This is aligned 16 in all cases anyway..
LOCAL b:QWORD
LOCAL c:XMMWORD ; This is now aligned to 16 with win64:15
LOCAL d:DWORD
LOCAL e:DWORD
lea rax,e
mov rax,[rax] ; something like this where it might want to load in D and E at the same time... not that I'd recommend this.. but you never know!
So the alignment for win64:15 now will just insert the necessary padding in front of a local where it's required.
Quote from: johnsa on April 03, 2017, 05:45:26 PM
The change I've made for 2.24 is to just take any local greater than a qword (in size) and align it's position up to 16. This happens in order, I don't want to shuffle the local order around in the assembler as it's not transparent to the user, which might in some case lead to undesirable outcomes, if for example someone writes code that assumes local B occurs after local A and they can be grabbed simultaneously..
MyProc PROC FRAME ....
LOCAL a:XMMWORD ; This is aligned 16 in all cases anyway..
LOCAL b:QWORD
LOCAL c:XMMWORD ; This is now aligned to 16 with win64:15
LOCAL d:DWORD
LOCAL e:DWORD
lea rax,e
mov rax,[rax] ; something like this where it might want to load in D and E at the same time... not that I'd recommend this.. but you never know!
So the alignment for win64:15 now will just insert the necessary padding in front of a local where it's required.
It should work even for structure and union variables, independently of the alignment set for their fields.
If the structure or union is >= 16 bytes it will, I can adjust this so that it applies it to ANY struct/union, but logically if the structure isn't at least 16 bytes in total size you wouldn't be using any aligned operations against it anyway ?
Quote from: johnsa on April 03, 2017, 08:28:29 PM
If the structure or union is >= 16 bytes it will, I can adjust this so that it applies it to ANY struct/union, but logically if the structure isn't at least 16 bytes in total size you wouldn't be using any aligned operations against it anyway ?
I don't think you should and users are expected to know that they should align the structure fields to 16 if the structure contains a XMMWORD, because the structure alignment on the stack to 16 bytes does not guarantee that some XMMWORD field inside the structure will be aligned to 16 bytes.
Yep, well that's how it will be then in 2.24 as currently fixed, only the start address of the structure is guaranteed to be 16, not the fields inside it.
Hi johnsa.
I am not familiar with asm x64. This is my first program, so I don't know it's my mistake or hjwasm bug or maybe it fixed in 2.23 version, because i test all in 2.22 version.
Simple source:
.686P
.x64
option casemap :none
option win64 : 11
option frame : auto
option stackbase : rsp
include WINDOWS.INC
includelib user32.lib
includelib Kernel32.Lib
.data
Text db '0123456789',0
.code
ShowMessage proc FRAME
LOCAL TxtBuff[11]:byte
LOCAL Flag:BOOL
invoke ZeroMemory,addr TxtBuff,sizeof(TxtBuff)
invoke lstrcpy,addr TxtBuff,addr Text
invoke MessageBox,0,addr TxtBuff,"Info",MB_OK
; ---------------------------
; "Info" < Why double quote added to the text ???
; ---------------------------
; 0123456789 < Text is ok.
; ---------------------------
; OK
; ---------------------------
mov Flag,TRUE
invoke MessageBox,0,addr TxtBuff,"Info",MB_OK
; ---------------------------
; "Info" < Why double quote added to the text ???
; ---------------------------
; 01234567 < "89" was rewritten by "mov Flag,TRUE"
; ---------------------------
; OK
; ---------------------------
ret
ShowMessage endp
start proc FRAME
invoke ShowMessage
invoke ExitProcess,0
start endp
end start
Besides, could you explain when I need to add "FRAME" after the "proc"? I can't find any info about it.
Amazingly.. I think you have found a bug, i'm working on it now !!!
So you can leave out .686P (you don't need that)
FRAME is now implied, so you don't have to specify it at all..
Basically under Win64 , procedures should have a pdata / xdata entry which allows for proper exception handling.. originally jwasm used frame:auto and FRAME on the PROC declaration to specify this behaviour, but along the lines it also got roped into how the prologue/epilogue is generated .. by inserting .PUSHREG and other exception frame related operations.. bottom line is, there is no reason not to have it, and to keep the code consistent we've forced it on for all procs.
So you can specify it, or not.. it doesn't matter anymore from 2.23
A PROC is a PROC is a PROC. :)
Quote from: johnsa on April 04, 2017, 12:42:25 AM
Amazingly.. I think you have found a bug, i'm working on it now !!!
The bug happens as well with option stackbase : rbp, unless you use option win64 : 2 which happens to fix the alignment issue.
This is fixed now too for both rsp and rbp.
I'm just waiting for confirmation from Habran on some of his changes and as soon as he's ready we will put 2.24 up.
I've put 2.24 up.
The fixes are:
stackbase:rsp overwriting locals in some cases. (powershadow's bug #1).
stackbase:rbp enforce stack allocations so setting bit 2 of win64 isn't required to get alignment working (aw27's bug).
invoke string literals no longer include the quotes " "s (powershadows bug #2).
stackbase:rsp with win64:15 supports LOCAL alignment to 32 now (if you want to use aligned YMMWORDS or great).
Any structure or union .. in fact any local that is >= 16 bytes in size will be aligned to 16 under both rsp and rbp.
SYSTEM V ABI calls are in now too (this is early days experimental, and too serve as a test-case for adding delphi calls next). It also helps move towards the goal of full OSX and Linux support, now that we have an OSX build of HJWASM too.
To use win64 flags still apply (will be aliased in future), stackbase:rbp must be used and then a proc is simply decorated with:
OPTION ARCH:SSE
nixproc PROC SYSTEMV USES rbx xmm0 arg1:qword, arg2:DWORD, arg3:REAL4
LOCAL mem:DWORD
LOCAL vec:XMMWORD
mov rbx,arg1
mov ecx,arg2
mov eax,10
mov mem,eax
mov ecx,mem
IF @Arch EQ 0
movss xmm10,arg3
movaps xmm0,vec
ELSE
vmovss xmm10,arg3
vmovaps xmm0,vec
ENDIF
ret
nixproc ENDP
OPTION ARCH:AVX