News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

HJWasm 2.23 Release

Started by johnsa, April 01, 2017, 08:44:22 AM

Previous topic - Next topic

johnsa

whats in it ?

1) stackbase:esp back in (thanks Nidud!)

2) incbin offset bug fixed (reported by Vortex)

3) fixed a bug in coff 32bit name mangle output (reported by JJ)

4) command line switch -nomlib to disable the built-in macro library (if you need or want to)

5) stackbase:rsp and stackbase:rbp code has been totally separated, refactored and optimised.

6) both of these options have been simplified as follows:
   - option stackbase:rsp will enforce win64:11 and frame:auto. All procs with default prologue/epilogue settings will be frame procs.
   - option stackbase:rbp wil enforce frame:auto but leaves option win64:1 -> 7 available. All procs with default prologue/epilogue settings will be frame procs.
   - Both align the first local to 16 when required.
   - RSP already had all the smart optimisations in, which have now been moved to RBP and extended..
   - If no locals or parameters are used at all, the frame pointer will be omitted.
   - If the procedure is a leaf proc (IE: no locals and no further invokes) the reservation of stack via sub/add rsp will be automatically removed.
   - Both options will only copy parameters to home space "IF" they're actually used, so if you just use the source registers you get an optimised proc out.

7) There is now an OSX universal binary package on the site for 2.23
     (This was an absolute pain to get GCC on OSX building it at all and then NOT giving Bus Error 10's).

Enjoy!

PS, We're now implementing the macros for 2.24 - due in store shortly ;)

fearless

Nice!

Thanks for all your hard work.

jj2007

Quote from: fearless on April 01, 2017, 09:08:42 AMThanks for all your hard work.

Same from me :t

Builds my larger sources without any problems:
; RichMasm source, 18786 lines:
OxPT_Assembler HJWasm32 ; 1555 ms
OxPT_Assembler JWasm ; 1330 ms
OxPT_Assembler HJWasm64 ; 1274 ms
OPT_Assembler AsmC ; 1085 ms

; MasmBasic source, 31670 lines:
OxPT_Assembler mlv615 ; 8.0 secs
OxPT_Assembler JWasm ; 6.9 secs
OxPT_Assembler HJWasm32 ; 3.7
OxPT_Assembler HJWasm64 ; 3.2
OPT_Assembler AsmC ; 2.6 secs

aw27

Quote from: johnsa on April 01, 2017, 08:44:22 AM
- option stackbase:rsp will enforce win64:11 and frame:auto

I am not so sure that stackbase:rsp will enforce win64:11.
I have used:
OPTION casemap:none
OPTION FRAME:auto
OPTION WIN64:6
OPTION STACKBASE:RSP

and it disregarded WIN64:6 and the enforced win64:11 failed to align a local variable to a 16-byte boundary. So a movaps sse instruction from xmm register to local memory caused a crash.
I can not produce right now a proof of concept, will try early next Monday if necessary.
What I can advance is that the procedure uses FRAME and has no USES. I can also say that it does not produce an error with JWASM.
This is not important for me, in my cause STACKBASE:RSP increases the code size and reduces execution speed. I think I mentioned that before.



Vortex

Hi Johnsa and Habran,

Thanks for the new release. Keep up the nice work :t

johnsa

Not too sure how the align local to 16 isn't working, are you calling the function from HLL or is it being run directly from an asm app ?

Here is an example from my side and the locals bob/bob1 are aligned 16 every-time:



option frame:auto
option win64:6
option stackbase:rsp

;assemble with
; c:\jwasm\hjwasm64 -c -win64 -Zi -Zd -Zf -Zp8 aw.asm
; d:\vs2015\vc\bin\link /subsystem:console /machine:x64 /debug /entry:proc1 /Libpath:"%WINSDK%\v7.1\Lib\x64" aw.obj


__m128f struct
f0 real4 ?
f1 real4 ?
f2 real4 ?
f3 real4 ?
__m128f ends

__m128q struct
q0 QWORD ?
q1 QWORD ?
__m128q ends

__m128 union
f32 __m128f <>
q64 __m128q <>
__m128 ends

OPTION ARCH:AVX

    includelib kernel32.lib
    includelib user32.lib

externdef MessageBoxW : near
externdef MessageBoxA : near

MessageBoxW PROTO :qword, :qword, :qword, :qword
MessageBoxA PROTO :qword, :qword, :qword, :qword

.data

; Automatic type promotion from integer to float
aReal REAL4 2

; This is example of initializing a union with floats (first sub-type)
; using normal syntax as well as hjwasm 2.17 update to promote integer literal to float
myVec1 __m128 { < 1.0, 2.0, 3.0, 4.0 > }
myVec2 __m128 { < 1, 2, 3, 4 > }

; Hjwasm 2.22 enhanced union type (now allows direct specification of sub-type to use in initialisation):
myVec4 __m128.f32 { < 1.0, 2.0, 3.0, 4.0 > }   ; you can try .f33 and hjwasm will emit an error when testing for valid sub-type.
myVec3 __m128.q64 { < 0x1234, 0x5678 > }
myVec5 __m128.f32 { < 1.0, 2.0, 3.0, 4.0 > }   ; you can try .f33 and hjwasm will emit an error when testing for valid sub-type.

floatVar real4 2.3

awideStr dw "wide caption ",0

.code

start:

LOADSS xmm0,2.0

OPTION ARCH:SSE

LOADSS xmm1,3.0

OPTION ARCH:AVX

LOADSD xmm2,4.0

;this proc is creating a dud sub rsp,8 :( (FIXED)
proc2 proc public

   ret
proc2 endp

sub1 proc public arg1:ptr, arg2:ptr

   ret
sub1 endp

sub2 proc public uses rdi xmm0 arg1:ptr, arg2:ptr

   ret
sub2 endp

newproc3 proc arg1:qword, arg2:qword

ret
newproc3 endp

newproc proc arg1:qword, arg2:real4

movss xmm3,arg2 ; with option win64:7 , this loads from [rbp+20h] but it SHOULD be [rbp+18h] :(

ret
newproc endp

newproc2 proc FRAME arg1:qword, arg2:real4, arg3:dword, arg4:dword, arg5:dword

movss xmm3,arg2 ; with option win64:7 , this loads from [rbp+20h] but it SHOULD be [rbp+18h] :(
mov eax,arg3
mov ebx,arg4
mov ecx,arg5

ret
newproc2 endp

; This one will implement FPO(frame pointer ommission as no parameters or locals are used).
newproc5 proc FRAME arg1:qword, arg2:real4, arg3:dword, arg4:dword, arg5:dword

xor eax,eax
mov ebx,eax

ret
newproc5 endp

proc1 proc FRAME arg1:qword, arg2:qword, arg3 :qword
   
   local bob:XMMWORD
   local bob1:XMMWORD
   
   mov r9, rcx
   mov r10, rdx
   mov r11, r8

   invoke newproc3, rax, "this is an ascii string"
   movss xmm1, FP4(1.28)
   movss xmm1, FP4(2.28)
   movss xmm1, FP4(3.28)
   
invoke MessageBoxW, 0, ADDR awideStr, ADDR awideStr, 0
invoke MessageBoxA, 0, "yay string literals", "oops", 0

   invoke newproc3, rax,"this is an ascii string"
   invoke newproc3, rcx, L"a wide string yay"
   
    invoke MessageBoxW, 0, L"yay wide string literal", ADDR awideStr, 0
invoke MessageBoxA, 0, "yay string literals2", "oops", 0
invoke MessageBoxA, 0, "yay string literals3", "oops", 0
invoke MessageBoxA, 0, "yay string literals4", "oops", 0

invoke newproc2, rax,xmm4,ebx,r10d,r11d
invoke newproc5, rax,xmm4,ebx,r10d,r11d
   invoke newproc, rax, xmm4
   
   invoke newproc, rax, floatVar
   invoke newproc, rax, xmm1
   
   INVOKE sub1, r10, r8
   INVOKE sub2, r9, r11
   mov rax, r9
   
   vmovaps xmm0,bob
   vmovaps bob1,xmm1

   ret
proc1 endp


WinMainCRTStartup PROC FRAME
invoke proc1, 10, 20, 30
ret
WinMainCRTStartup ENDP

end WinMainCRTStartup



aw27

This is how I get the problem.

#include "stdafx.h"
#if defined (__cplusplus)
extern "C" {
#endif
   void proc1(size_t var1, size_t var2, size_t var3, size_t var4,size_t var5);
#if defined (__cplusplus)
}
#endif

int main()
{
    proc1(1, 2, 3, 4, 5);
    return 0;
}

; test.asm

option casemap:none
option frame:auto
OPTION WIN64:11 ; same error if using OPTION WIN64:6
OPTION STACKBASE:RSP

.code

proc1 proc public FRAME _rcx : qword, _rdx: qword, _r8: qword, _r9 : qword, other: qword
   LOCAL lvar1 : ptr
   LOCAL lvar2 : XMMWORD

   mov eax, 2.0
   movd xmm0, eax
   shufps xmm0, xmm0,0
   movaps XMMWORD ptr lvar2, xmm0
   
   ret
proc1 endp

end

Compiled in release 2.23 with
hjwasm64" -c -win64 -Zp8 -archSSE test.asm

No problems in JWASM, it aligns correctly lvar2

johnsa

Ahh I see :)

In JWASM the align setting uses 16bytes for every local which is why they're all aligned, we're not , we only align the first local, so if you put something in first thats a qword it will throw the alignment out, but it can save a lot of stack by not wasting the 16 bytes per local, in some cases in my test procs 1-2 whole cache lines.

aw27

#8
Quote from: johnsa on April 02, 2017, 07:02:16 PM
Ahh I see :)

In JWASM the align setting uses 16bytes for every local which is why they're all aligned, we're not , we only align the first local, so if you put something in first thats a qword it will throw the alignment out, but it can save a lot of stack by not wasting the 16 bytes per local, in some cases in my test procs 1-2 whole cache lines.

But this is the whole purpose of bit 2, I always understood it that way, of the OPTION WIN64 directive because the first local is always 16 byte aligned when it corresponds to 16 byte variable. It becomes always 16-byte aligned as a result of the required stack alignment that takes place during the prolog.
This is valid for STACKBASE:RSP or STACKBASE:RBP.

Note that with bit 2 of OPTION WIN64, JWASM DOES NOT 16-byte align every local, only 16-byte aligns 16-byte variables.



nidud

#9
deleted

jj2007

Quote from: aw27 on April 02, 2017, 07:31:53 PMthe first local is always 16 byte aligned when it corresponds to 16 byte variable.

Interesting observation :t

QuoteNote that with bit 2 of OPTION WIN64, JWASM DOES NOT 16-byte align every local, only 16-byte aligns 16-byte variables.

I agree with John that aligning everything is a waste of stack, but aligning only xmmwords would indeed be an intelligent option. For efficient code, we could still use e.g.
Local x0:XMMWORD, x1, x2, x3, x4
Local y0:XMMWORD, y1, y2, y3, y4

aw27

Quote from: jj2007 on April 02, 2017, 10:51:23 PM
Local x0:XMMWORD, x1, x2, x3, x4
Local y0:XMMWORD, y1, y2, y3, y4


This is a trap, isn't it?
I think you will not get what you want, if I know what you want.

johnsa

I've fixed this now.. but for option stackbase:rsp it means you need the value 15.

If you supply any value < 11, it becomes 11 (as this is the most sensible)... except if you specify 15 which it will accept as 15 = 11 + align16 bit set.
So you have 2 options with stackbase:rsp  ... 11 or 15 and that's it.

For stackbase rbp you can use any values between 0-7 as per normal.

This will be in 2.24 shortly along with some new presents :)

jj2007

Quote from: aw27 on April 03, 2017, 01:51:16 AMThis is a trap, isn't it?

No, it's quite real. Just efficient code: you start with a local that can be used with movaps and friends, you add exactly 4 dwords, then the next SSE variable, etc. Hand-crafted but it works, of course.

include \Masm32\MasmBasic\Res\JBasic.inc      ; ## builds in 32- or 64-bit mode with ML, AsmC, JWasm, HJWasm ##
usedeb=1

.code
MyTest proc <cb> uses rsi rdi rbx arg1, arg2, arg3, arg4, arg5
Local x0:XMMWORD, x1, x2, x3, x4
Local y0:XMMWORD, y1, y2, y3, y4
  lea rax, x0
  lea rdx, y0
  deb 4, "are x0+y0 aligned?", x:rax, x:rdx, x:rbp, x:rsp, arg1, arg2, arg3, arg4, arg5
  ret
MyTest endp

  Init
  PrintLine Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format", 13, 10)
  jinvoke MyTest, 111, 222, 333, 444, 555
  Inkey
EndOfCode


Output:This code was assembled with HJWasm32 in 64-bit format

are x0+y0 aligned?
x:rax   12feb0h
x:rdx   12fe90h
x:rbp   12fef0h
x:rsp   12fe00h
arg1    111
arg2    222
arg3    333
arg4    444
arg5    555

aw27

#14
Quote from: jj2007 on April 03, 2017, 06:38:57 AM
No, it's quite real. Just efficient code: you start with a local that can be used with movaps and friends, you add exactly 4 dwords, then the next SSE variable, etc. Hand-crafted but it works, of course.

My take on this is that we shall place the variables in descending order by powers 2 of the TYPE of variable.
For example:
LOCAL a : XMMWORD
LOCAL b : XMMWORD
LOCAL c : QWORD
LOCAL d : QWORD
LOCAL e[10] : DWORD
LOCAL f : DWORD
LOCAL g[5] : WORD
LOCAL h[20] : BYTE
LOCAL i : BYTE

I believe this will pack as much as possible. On the other hand, I don't think that, in general, will cause any problem whatsoever to place variables out of order increasing a little the memory consumption when the compiler helps us to maintain the alignment, that's why I like BIT 2 of OPTION WIN64 when there are XMMWORDs. May be the compiler should consider other alignments for stack variables, such as 32, 64, 128 bytes, 256 or 512 bytes. I am adding that to my Wish List.