The MASM Forum

64 bit assembler => UASM Assembler Development => Topic started by: aw27 on March 01, 2017, 04:37:40 AM

Title: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 01, 2017, 04:37:40 AM
Hello developers,

I am trying to compile with HJWASM a large program previously compiled with JWASM with success.
I found an issue, which I hope will be clear with the following:

Source code:
"
option frame:auto
OPTION WIN64:6

.code

sub1 proc public arg1:ptr, arg2:ptr

   ret
sub1 endp

sub2 proc public arg1:ptr, arg2:ptr

   ret
sub2 endp

proc1 proc public FRAME uses xmm8 xmm9 arg1:qword, arg2:qword, arg3 :qword
   mov r9, rcx
   mov r10, rdx
   mov r11, r8
   
   INVOKE sub1, r10, r8
   INVOKE sub2, r9, r11
   mov rax, r9
   ret
proc1 endp

end

"
Command line:
hjwasm64 -c -win64 -Zp8 test.asm

The above proc1 compiles with HJWASM to:
proc1:
000000013FBF16A4  push        rbp 
000000013FBF16A5  mov         rbp,rsp 
000000013FBF16A8  sub         rsp,20h 
000000013FBF16AC  sub         rsp,40h 
000000013FBF16B0  vmovdqu     xmmword ptr [rsp+40h],xmm8 
000000013FBF16B6  vmovdqu     xmmword ptr [rsp+50h],xmm9 
000000013FBF16BC  mov         r9,rcx 
000000013FBF16BF  mov         r10,rdx 
000000013FBF16C2  mov         r11,r8 
000000013FBF16C5  mov         rcx,r10 
000000013FBF16C8  mov         rdx,r8 
000000013FBF16CB  call        sub1 (13FBF1690h) 
000000013FBF16D0  mov         rcx,r9 
000000013FBF16D3  mov         rdx,r11 
000000013FBF16D6  call        sub2 (13FBF169Ah) 
000000013FBF16DB  mov         rax,r9 
000000013FBF16DE  vmovdqu     xmm8,xmmword ptr [rsp+40h] 
000000013FBF16E4  vmovdqu     xmm9,xmmword ptr [rsp+50h] 
000000013FBF16EA  add         rsp,40h 
000000013FBF16EE  pop         rbp 
000000013FBF16EF  ret 

and with JWASM to:
000000013F3016AC  push        rbp 
000000013F3016AD  mov         rbp,rsp 
000000013F3016B0  sub         rsp,40h 
000000013F3016B4  movdqa      xmmword ptr [rsp+20h],xmm8 
000000013F3016BB  movdqa      xmmword ptr [rsp+30h],xmm9 
000000013F3016C2  mov         r9,rcx 
000000013F3016C5  mov         r10,rdx 
000000013F3016C8  mov         r11,r8 
000000013F3016CB  mov         rcx,r10 
000000013F3016CE  mov         rdx,r8 
000000013F3016D1  call        sub1 (13F301690h) 
000000013F3016D6  mov         rcx,r9 
000000013F3016D9  mov         rdx,r11 
000000013F3016DC  call        sub2 (13F30169Eh) 
000000013F3016E1  mov         rax,r9 
000000013F3016E4  movdqa      xmm8,xmmword ptr [rsp+20h] 
000000013F3016EB  movdqa      xmm9,xmmword ptr [rsp+30h] 
000000013F3016F2  add         rsp,40h 
000000013F3016F6  pop         rbp 
000000013F3016F7  ret

So the stack becomes corrupted with HJWASM.

AW27





Title: Re: Crashes in HJWASM but works well in JWASM
Post by: coder on March 01, 2017, 05:48:59 PM
Not much into HJWASM but I'm wondering about this;

000000013FBF16A8  sub         rsp,20h

not being properly restored at the epilogue? Maybe I am wrong.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: habran on March 02, 2017, 09:23:28 AM
Hi aw27,
It is good find of the error which will be fixed in next release :biggrin:
However, it was overseen because we would never use such a combination of win64 flags
In the case below I have used :
option frame:auto
OPTION WIN64:3

Unless you know exactly what are you doing, I would suggest you to use:
option casemap:none      ; causes internal symbol recognition to be case sensitive
option frame:auto           ; generate SEH-compatible prologues and epilogues
option win64:11              ; reserve stack space once per procedure, save registers, calculate required stack space
option STACKBASE:RSP   ; use rsp as a stack base instead of rbp

values for the option win64 are as follow:
enum win64_flag_values {
    W64F_SAVEREGPARAMS = 0x01, /* 1=save register params in shadow space on proc entry */
    W64F_AUTOSTACKSP     = 0x02, /* 1=calculate required stack space for arguments of INVOKE */
    W64F_STACKALIGN16    = 0x04, /* 1=stack variables are 16-byte aligned; added in v2.12 */
    W64F_SMART                = 0x08, /* 1=takes care of everything */
    .....
    .....
};
So, option win64:11 is equivalent of W64F_SAVEREGPARAMS + W64F_AUTOSTACKSP + W64F_SMART




    16: proc1 proc public FRAME uses xmm8 xmm9 arg1:qword, arg2:qword, arg3 :qword
00007FF70377103A 48 83 EC 48          sub         rsp,48h 
00007FF70377103E C5 7A 7F 44 24 20    vmovdqu     xmmword ptr [rsp+20h],xmm8 
00007FF703771044 C5 7A 7F 4C 24 30    vmovdqu     xmmword ptr [rsp+30h],xmm9 
    17:    mov r9, rcx
00007FF70377104A 4C 8B C9             mov         r9,rcx 
    18:    mov r10, rdx
00007FF70377104D 4C 8B D2             mov         r10,rdx 
    19:    mov r11, r8
00007FF703771050 4D 8B D8             mov         r11,r8 
    20:   
    21:    INVOKE sub1, r10, r8
00007FF703771053 49 8B CA             mov         rcx,r10 
00007FF703771056 49 8B D0             mov         rdx,r8 
00007FF703771059 E8 D2 FF FF FF       call        sub1 (07FF703771030h) 
    22:    INVOKE sub2, r9, r11
00007FF70377105E 49 8B C9             mov         rcx,r9 
00007FF703771061 49 8B D3             mov         rdx,r11 
00007FF703771064 E8 CC FF FF FF       call        sub2 (07FF703771035h) 
    23:    mov rax, r9
00007FF703771069 49 8B C1             mov         rax,r9 
    24:    ret
00007FF70377106C C5 7A 6F 44 24 20    vmovdqu     xmm8,xmmword ptr [rsp+20h] 
00007FF703771072 C5 7A 6F 4C 24 30    vmovdqu     xmm9,xmmword ptr [rsp+30h] 
00007FF703771078 48 83 C4 48          add         rsp,48h 
00007FF70377107C C3                   ret


Title: Re: Crashes in HJWASM but works well in JWASM
Post by: habran on March 02, 2017, 05:16:41 PM
The bug is exterminated now, thanks aw27 :icon14:
Until next release make sure that the flag W64F_SAVEREGPARAMS is set.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 02, 2017, 05:23:43 PM
Hi habran,

It is good to know you are working on this, I will test again when you are done.

I have used OPTION WIN64:6 because
1) I need stack variables to be 16-byte stack aligned because there are many SSE instructions in the procedures. I know the 1st stack variable is always 16-byte stack aligned but if the second variable is for example a REAL4, the 3rd variable will not be anymore 16-byte stack aligned. I know, I can move around the variables by hand, but I found this is safer. This is how it worked in JWASM, not sure if is different in HJWASM. BTW, why are you using "vmovdqu" to save xmm registers if the memory is guaranteed to be 16-byte aligned at that point?
2) I am not using W64F_SAVEREGPARAMS because in many small procedures it is a waste of time to save the parameters. And my program is large and has procedures of all kinds.
3) Thank you for the hint about STACKBASE:RSP, I will check that out.

AW27
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: habran on March 02, 2017, 06:30:41 PM
Hi aw27,
If you need 16 byte alignment you can use OPTION WIN64:15.
and of course, option STACKBASE:RSP
if you use OPTION WIN64:11 or in your case OPTION WIN64:15 you don't have to worry about anything because HJWasm will not create stack frame if it is not needed and will store register in a home space only if parameter is used in the function.
I would suggest you to read HJWasm Extended Guide (http://www.terraspace.co.uk/hjwasm219_ext.pdf) about its features.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 02, 2017, 07:38:57 PM
Thank you, I will check all again on the next release. :t
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 14, 2017, 07:50:14 PM
I tested the latest release 2.20, if you remove the INVOKEs of my previous example, i.e:

proc1 proc public FRAME uses xmm8 xmm9 arg1:qword, arg2:qword, arg3 :qword
   mov r9, rcx
   mov r10, rdx
   mov r11, r8
   
   ; INVOKE sub1, r10, r8
   ; INVOKE sub2, r9, r11
   mov rax, r9
   ret
proc1 endp

It will compile to:
roc1:
000000013F0516A4  push        rbp 
000000013F0516A5  mov         rbp,rsp 
000000013F0516A8  mov         r9,rcx 
000000013F0516AB  mov         r10,rdx 
000000013F0516AE  mov         r11,r8 
000000013F0516B1  mov         rax,r9 
000000013F0516B4  vmovdqu     xmm8,xmmword ptr [rsp] 
000000013F0516B9  vmovdqu     xmm9,xmmword ptr [rsp+10h] 
000000013F0516BF  add         rsp,28h 
000000013F0516C3  pop         rbp 
000000013F0516C4  ret 

And you know the end result.
I would like to use HJWASM, but I have no much hope.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 14, 2017, 08:14:59 PM
Quote from: aw27 on March 02, 2017, 05:23:43 PMI know the 1st stack variable is always 16-byte stack aligned but if the second variable is for example a REAL4, the 3rd variable will not be anymore 16-byte stack aligned. I know, I can move around the variables by hand, ...

I hope Johnsa & Habran will fix the bug, but in the meantime, if that is a serious issue, why not use a local structure? If its beginning is aligned 16, you have full control over the rest.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 14, 2017, 09:28:55 PM
"if that is a serious issue, why not use a local structure"

The point is not that, it compiles and works with JWasm without a glitch. It is a big library of more than 540 functions I put recently in Codeproject.com (https://www.codeproject.com/Tips/1174521/DMath-is-DirectXMath-for-All). I am sure that I could find a way for it to compile and work with HJWasm, but is it worthwhile? Not convinced, although I found some good points in the specification. Wait and see.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 14, 2017, 10:12:51 PM
Quote from: aw27 on March 14, 2017, 09:28:55 PMIt is a big library of more than 540 functions I put recently in Codeproject.com (https://www.codeproject.com/Tips/1174521/DMath-is-DirectXMath-for-All).

Compliments, José, that looks like a big project :t
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 14, 2017, 10:45:12 PM
Grazie  :bgrin:
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 15, 2017, 09:00:19 PM
We're busy putting together a fix for this.

Alignment to 16 will always be just the first local, (at least for the near future) as allocating 16bytes per local is wasteful.

vmovdqu is used instead of the aligned equivalent as the same prolog/epilog code is generated to handle not only xmm but ymm and zmm too, and we're not going to align stack to 32/64 byte.
We can special case xmm and align it specifically, but it shouldn't in theory be required as most newer processors automatically implement the faster aligned path micro-architecturally for unaligned instructions if possible.

Given that you're building a math lib (and thusly using vectors/matrices etc) I would strongly suggest looking at vectorcall. We added it in a while back specifically for this type of work to avoid the massive overhead of passing simd types by ref/memory all the time. It's all in the extended manual.

Will have the update ready shortly.

Cheers
John
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 15, 2017, 09:43:13 PM
Quote from: johnsa on March 15, 2017, 09:00:19 PMallocating 16bytes per local is wasteful

Correct. One could limit the align 16 to OWORD variables, though. Or insert an align 16 in the right place:

include \Masm32\MasmBasic\Res\JBasic.inc      ; part of MasmBasic (http://masm32.com/board/index.php?topic=94.0)

.code
MyTest proc <cb> hwnd
LOCAL MyDword:DWORD
align 16
LOCAL MyOword:XMMWORD

  lea rax, MyDword
  Print Str$("MyDword:\t%x\n", rax)

  lea rax, MyOword
  Print Str$("MyOword:\t%x\n", rax)

  ret
MyTest endp

Init            ; OPT_64 1      ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format")
  jinvoke MyTest, 123

EndOfCode


Surprisingly enough, the assemblers tested allow this syntax 8)

Problem is that it's misleading - here are the results:
This code was assembled with AsmC in 64-bit format
MyDword:        12fedc
MyOword:        12fec8

This code was assembled with ml64 in 64-bit format
MyDword:        12fedc
MyOword:        12fecc

This code was assembled with HJWasm32 in 64-bit format
MyDword:        12fedc
MyOword:        12fec8
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 15, 2017, 10:30:34 PM
I did think about that, automatically re-arranging the simd aligned items to occur first, which is totally do-able, but I'm not sure I like the idea of the assembler doing it's own thing that much.
I like to know it does what I tell it, if I want things aligned i'll keep them at the front of the list.. sometimes i like to keep variables initially grouped by usage, and then when optimising i'll re-arrange them for alignment / locality.

the other option would be to have something which aligns just the specified local, but we'd need to think syntax wise how that should look :

Specifically to allow vectorcall to work properly hjwasm has built-in types for __m128, __m256 etc.. which allow it to know if you're talking about a vector rather than an HFA(Homogenous float aggregate) as they're handled differently in the calling convention.

So those types are what are used to define proc arguments, locals etc instead of the old fashion oword/xmmword etc. (Plus they have the union include file .. which will be even more useful with HJwasm 2.21 when we extend the union to allow it to be initialised with any of it's component structs not just the first one.. )

We could make it the case where __m128 is specifically aligned when used

so

LOCAL myVector:OWORD
LOCAL myVector:XMMWORD and so on wouldn't be aligned unless by virtue of the first item

but

LOCAL myVector:__m128

would be aligned no matter where in the list it's used.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 15, 2017, 10:50:28 PM
My stake on this is that you are wrong, not just a little bit wrong, but near 100% wrong.
1) IMHO, you should not use AVX instructions (vmovdqu) without making sure the OS supports AVX. Does XP support AVX? Does Vista support AVX? Does 2008 support AVX? NO!
2) Even when the OS supports AVX, it turns out to be a BIG performance penalty to mix AVX and SSE instructions (ref:Intel books).
3) Vector calls will not solve anything. Soon or later you will need to save data to memory because even 16 XMM registers are not always enough, or turn out to be a confusing mess, and you will thank God to have aligned memory to use.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 15, 2017, 10:54:02 PM
Quote from: johnsa on March 15, 2017, 10:30:34 PMI did think about that, automatically re-arranging the simd aligned items to occur first, which is totally do-able, but I'm not sure I like the idea of the assembler doing it's own thing that much.

I agree, it should be transparent.

MyTest proc arg1, arg2
LOCAL MyDword:DWORD
ALIGN 16
LOCAL MyXmmword:XMMWORD


Problem is that the "natural" syntax as shown above is FAKE: the align 16 does ... nothing. In HJWasm, in AsmC and in ML64. And this bug crept in decades before you started to work on JWasm :(
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 15, 2017, 10:54:32 PM
Quote from: jj2007 on March 15, 2017, 09:43:13 PM
Correct. One could limit the align 16 to OWORD variables, though. Or insert an align 16 in the right place:

You can't align stack (local) variables that way. align 16 does not work :eusa_naughty:
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 15, 2017, 11:26:04 PM
Quote from: aw27 on March 15, 2017, 10:54:32 PMYou can't align stack (local) variables that way. align 16 does not work :eusa_naughty:

I see that. An intelligent assembler would throw a syntax error, though, instead of letting the coder believe that it aligns the variable.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: habran on March 16, 2017, 12:23:32 AM
Some people should be banned to use "IMHO" they should use "IMAO" instead.
If someone is not happy with HJWasm they should take it to the shop where they bought it and they will be fully refunded.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 16, 2017, 12:49:43 AM
Quote from: aw27 on March 15, 2017, 10:50:28 PM
My stake on this is that you are wrong, not just a little bit wrong, but near 100% wrong.
1) IMHO, you should not use AVX instructions (vmovdqu) without making sure the OS supports AVX. Does XP support AVX? Does Vista support AVX? Does 2008 support AVX? NO!
2) Even when the OS supports AVX, it turns out to be a BIG performance penalty to mix AVX and SSE instructions (ref:Intel books).
3) Vector calls will not solve anything. Soon or later you will need to save data to memory because even 16 XMM registers are not always enough, or turn out to be a confusing mess, and you will thank God to have aligned memory to use.

1. Every cpu since 2011 supports avx.. there was a specific reason the avx equivalents were chosen over the sse variants and it relates to your point 2, there is no transition penalty from sse -> avx, but there is from avx -> sse. Suppose now you have a loop making use of avx code inside which you call a procedure, if that procedure use movdqa/movdqu you'd get an implicit performance penalty that was entirely hidden from the programmer. You'd have to insert a vzeroupper/all prior to the call to mitigate the prologue penalty. Following that if you used avx inside the procedure you'd need another one prior to the epilogue generation.
This way around you can use avx outside and inside without issue, if you use sse inside the procedure a single vzeroupper can be used prior to the sse code. So it's 1 transition cleanup required as opposed to 2.
https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties (https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties)

With regards to support in xp and vista.. do I care ? .. not particularly.. https://en.wikipedia.org/wiki/Usage_share_of_operating_systems (https://en.wikipedia.org/wiki/Usage_share_of_operating_systems) .. xp and vista account for 0.5% of the pc market.. welcome to 2017 :)
2008 sp1 does support avx.
You can always switch off the default prologue/epilogue and use your own specifically if you want to support non AVX systems.

3. I'm not sure how to respond to that, it's an ridiculous comment.. vectorcall makes a huge difference
Writing simd assembler code via way of mem passing to procs will land up being far less performant that if you'd written it in C++ with intrinsics and vectorcall no matter how cleverly you optimise the register usage and dependency chains etc.

Let's consider a vector normalize function in fastcall vs vectorcall ...


VectorNormalize PROC FRAME ptrVec:QWORD

vmovaps xmm0,[rcx]
vdpps xmm1,xmm0,xmm0,0x77
vsqrtps xmm1,xmm1
vdivps xmm0,xmm0,xmm1

ret
VectorNormalize ENDP

vmovaps someVector,xmm0
invoke VectorNormalize, ADDR someVector 
vmovaps xmm0,someVector
;<- continue using xmm0 now that it's normalized..

; Assuming I already had someVector in an xmm register, I'd either have to use a non prototyped proc and leave it in a register.. or worse I'd have to store it to memory simply to pass it to the proc.. lets assume the later as we want clean code.. thats one 16byte aligned memory store.
; Secondly we have a qword argument which will now (depending on your code style) be saved to home space.. another 8byte store.
; If after the proc I wanted it in a register again (quite likely... thats ANOTHER 16byte load)..
; Inside the proc.. because it's passed by mem, means another 16byte load to work on the vector..

So we have 48 or 56 bytes of memory traffic...



Now with vectorcall ...



VectorNormalize PROC VECTORCALL FRAME myVec:__m128

vdpps xmm1,xmm0,xmm0,0x77  ;< vector is already in xmm0..
vsqrtps xmm1,xmm1
vdivps xmm0,xmm0,xmm1

ret
VectorNormalize ENDP

invoke VectorNormalize, xmm0     ;<- xmm regs can be passed as arguments... because the argument and register abi are == here, no operation .. so it's free..
;<- continue using xmm0 now that it's normalized..

Memory traffic == .... 0 ... and the code is still clean and prototyped.


Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 01:24:50 AM
Quote from: habran on March 16, 2017, 12:23:32 AM
Some people should be banned to use "IMHO" they should use "IMAO" instead.
If someone is not happy with HJWasm they should take it to the shop where they bought it and they will be fully refunded.

You are no hero, legions of people supply free, I do it many times.
Supplying free is no excuse for supplying bad. Even worst, you are doing the inverse of the Midas Touch, transforming gold into crap.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 01:37:35 AM
Quote from: johnsa on March 16, 2017, 12:49:43 AM
1. Every cpu since 2011 supports avx.. there was a specific reason the avx equivalents were chosen
So, you are producing a compiler for people with modern computers.
Quote from: johnsa on March 16, 2017, 12:49:43 AM
2.
With regards to support in xp and vista.. do I care ? .. not particularly.. https://en.wikipedia.org/wiki/Usage_share_of_operating_systems (https://en.wikipedia.org/wiki/Usage_share_of_operating_systems) .. xp and vista account for 0.5% of the pc market.. welcome to 2017 :)2008 sp1 does support avx.
Much much more than 0.5% and not 2008 but 2008R2.
So we need as well Windows 7 or later.
Quote from: johnsa on March 16, 2017, 12:49:43 AM
3. I'm not sure how to respond to that, it's an ridiculous comment.. vectorcall makes a huge difference
Does not make any difference at all if your routine has more than a dozen lines.


Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 01:40:41 AM
Quote from: jj2007 on March 15, 2017, 11:26:04 PM
Quote from: aw27 on March 15, 2017, 10:54:32 PMYou can't align stack (local) variables that way. align 16 does not work :eusa_naughty:

I see that. An intelligent assembler would throw a syntax error, though, instead of letting the coder believe that it aligns the variable.

Actually in the code section, it align the code. :eusa_clap:
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 16, 2017, 01:55:28 AM
1. if by modern you mean > 2011 by default yes. There are two options here, either replace the prologue/epilogue for specific functions to be compatible with pre 2011 systems.. OR.. and this depends on how everyone feels, if by group consensus we feel this is useful, we can add a command line switch like -arch:sse/avx .. so you can specify which to prefer when auto-generating code. If people like that idea we can do that?

2. same as 1.

3. ?!?!?!? any proc/function which is going to be called repeatedly and assuming is designed to be optimised and potentially deal with millions of calls is going  to suffer badly from N*48 bytes worth of unnecessary memory overhead, especially in multi-threaded applications, peak bandwidth for mem is about 25Gb/s lets say on average .. that means without any other factors, call overhead etc you've already capped yourself at
520 million calls per second, and this will get worse with more arguments or more complex types like matrices! Given 4c/8t on most desktops I routinely hit 1 to 1.5billion calls per second on these sort of vector/matrix functions (but you have to keep stuff in registers). Now with Ryzen and the 8c/16t hedt cpu's that would be even worse and your memory would be be limiting you from achieving 3billion (+-) calls per second to only get 500mil.

another reference http://www.codersnotes.com/notes/maths-lib-2016/ (http://www.codersnotes.com/notes/maths-lib-2016/)
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: GoneFishing on March 16, 2017, 02:51:25 AM
Quote from: johnsa on March 16, 2017, 01:55:28 AM
...
another reference http://www.codersnotes.com/notes/maths-lib-2016/ (http://www.codersnotes.com/notes/maths-lib-2016/)

Interesting blog  :t
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 02:52:28 AM
Quote
3. ?!?!?!? any proc/function which is going to be called repeatedly and assuming is designed to be optimised and potentially deal with millions of calls is going  to suffer badly from N*48 bytes worth of unnecessary memory overhead, especially in multi-threaded applications, peak bandwidth for mem is about 25Gb/s lets say on average .. that means without any other factors, call overhead etc you've already capped yourself at
520 million calls per second, and this will get worse with more arguments or more complex types like matrices! Given 4c/8t on most desktops I routinely hit 1 to 1.5billion calls per second on these sort of vector/matrix functions (but you have to keep stuff in registers). Now with Ryzen and the 8c/16t hedt cpu's that would be even worse and your memory would be be limiting you from achieving 3billion (+-) calls per second to only get 500mil.

This is bullsh%&t, you are not taking into account the work involved in loading the xmm registers before the call. Take as an example the function
XMMatrixMultiply from the DirectXMath library (XMMATRIX XMMatrixMultiply(FXMMATRIX M1, CXMMATRIX M2)).
Look at the vector call:
000000013F6AB12E  movaps      xmm0,xmmword ptr [__xmm@412e66664131999a412666663e4ccccd (13F6D08F0h)] 
000000013F6AB135  lea         rdx,[rbp+40h] 
000000013F6AB139  movaps      xmm2,xmmword ptr [__xmm@3fc000004129999a4083333340c9999a (13F6D06D0h)] 
000000013F6AB140  movaps      xmm1,xmm14 
000000013F6AB144  movaps      xmm3,xmmword ptr [__xmm@408ccccd411b33333fd9999a40400000 (13F6D0830h)] 
000000013F6AB14B  movups      xmm6,xmmword ptr [rax] 
000000013F6AB14E  movups      xmm7,xmmword ptr [rax+10h] 
000000013F6AB152  movups      xmm8,xmmword ptr [rax+20h] 
000000013F6AB157  movups      xmm9,xmmword ptr [rax+30h] 
000000013F6AB15C  call        XMMatrixMultiply (13F6A3330h) 

So bad, we do have to load 8 XMM registers before the call they are not loaded by a miracle :(.
Then we could do the same inside the ASM routine!

Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 16, 2017, 03:47:50 AM
Quote from: aw27 on March 16, 2017, 01:40:41 AMActually in the code section, it align the code. :eusa_clap:

Yes, I've used that occasionally (100+ times in one of my bigger sources). Point is that I never tried an ALIGN between LOCAL statements, and it simply does not what you intuitively expect it to do:
MyTest proc
LOCAL MyDword:DWORD
ALIGN 16
LOCAL MyXmmword:XMMWORD


Now this looks suspiciously similar to...
.data
align 16
somevar dd ?


... but it doesn't align the local variable. Instead, it aligns the code and inserts nop word ptr [rax + rax] before the push rbp 8)

Strictly speaking, it does what the manual says. Only that no sane person would put the align 16 between local statements.

Btw we are all good friends here. No need to explain the open sauce etiquette to anybody (and I mean both sides :P)
While I don't need all that sophisticated stuff in HJWasm (and AsmC), I am watching with awe that these guys do develop their babies, thus challenging Micros**t's CripplewareTM assembler.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 04:58:04 AM
Quote
Btw we are all good friends here. No need to explain the open sauce etiquette to anybody (and I mean both sides :P)
While I don't need all that sophisticated stuff in HJWasm (and AsmC), I am watching with awe that these guys do develop their babies, thus challenging Micros**t's CripplewareTM assembler.
Yes, and in spite of all this I am still interested in HJWASM because, from what I read, it should handle better the INVOKE parameters in x64 code than JWasm. The first time I compiled with HJWasm I obtained smaller code and I thought it was due to that. Now, after the latest events I am not sure about the real reasons.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 16, 2017, 05:14:59 AM
Quote from: aw27 on March 16, 2017, 04:58:04 AMit should handle better the INVOKE parameters in x64 code than JWasm. The first time I compiled with HJWasm I obtained smaller code

The x64 ABI is not exactly user-friendly; it took me some time to understand it. There are some tricks to get smaller code, and there are also some people who bark at you if you dare to favour size over speed. Perhaps it would help if you posted one or two examples where different coding styles make a difference for your project, size- or speed-wise. A lot can be done inside the PROLOGUE macro btw.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 05:25:44 AM
Quote from: jj2007 on March 16, 2017, 05:14:59 AM
Quote from: aw27 on March 16, 2017, 04:58:04 AMit should handle better the INVOKE parameters in x64 code than JWasm. The first time I compiled with HJWasm I obtained smaller code

The x64 ABI is not exactly user-friendly; it took me some time to understand it. There are some tricks to get smaller code, and there are also some people who bark at you if you dare to favour size over speed. Perhaps it would help if you posted one or two examples where different coding styles make a difference for your project, size- or speed-wise. A lot can be done inside the PROLOGUE macro btw.

I think it can be automated without manually tweaking of the PROLOGUE. The rules are only  these:  ;)
; RCX, RDX, R8, R9 are used for integer and pointer arguments in that order left to right.
; XMM0, 1, 2, and 3 are used for floating point arguments.
; When used, XMM register displace the corresponding general register. For example xmm2, displaces r8 and it will not be used to pass a parameter.
; Additional arguments are pushed on the stack left to right.
; Parameters less than 64 bits long are not zero extended; the high bits contain garbage.
; It is the caller's responsibility to allocate 32 bytes of "shadow space" (for storing RCX, RDX, R8, and R9 if needed) before calling the function.
; It is the caller?s responsibility to clean the stack after the call.
; Integer return values (similar to x86) are returned in RAX if 64 bits or less. Pointer to small type are returned in RAX.
; Floating point return values are returned in XMM0.
; Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when the callee is called. Register usage for integer parameters is then pushed one to the right. RAX returns this address to the caller.
; The stack is 16-byte aligned. The "call" instruction pushes an 8-byte return value, so all non-leaf functions must adjust the stack by a value of the form 16n+8 when allocating stack space.
; Registers RAX, RCX, RDX, R8, R9, R10, and R11 are considered volatile and must be considered destroyed on function calls.
; RBX, RBP, RDI, RSI, R12, R14, R14, and R15 must be saved in any function using them.
; xmm0 to xmm5 are volatile

For example I would like to be able to do something like INVOKE Func, xmm0, xmm1, xmm2, r9 but is not possible with JWasm. I would have to call something like INVOKE Func, rcx, rdx. r8, r9 even though rcx, rdx and r8 are not used in that call.

Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 16, 2017, 06:08:01 AM
There are quite a few things we do in the prologue/epilogue generation which will make it shorter/faster than jwasm..

avoiding some of the pointless code that jwasm generated with things like add rsp,0 or sub rsp,0
replacing some zero'ing of values in invoke with xor
making sure that registers that are already set correctly aren't updated
re-using zero value to fill in nulls and others in invoke

inside the proc itself we also have some smart logic that happens when you use stackbase rsp and win64:11 where we re-use unused param space for uses, we only store things that are actually used/referenced in the proc and obviously if using rsp as the base frees up rbp for general use and shortens the code a bit more too.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 16, 2017, 06:08:42 AM
Quote from: aw27 on March 16, 2017, 05:25:44 AMI would like to be able to do something like INVOKE Func, xmm0, xmm1, xmm2, r9

Interesting. How would your arglist look like, and what would you expect under the hood? Would MyFunc move xmm0 into the stack, or use it directly?

MyFunc proc arg1:???, arg2:???, arg3:???, arg4

Right now, my jinvoke chokes on that one, but it's a macro... you can do almost everything with a macro ;-)

@johnsa: As in 32-bit code, [rbp+x] is one byte shorter. Not sure whether stackbase rsp provides any real advantage, given that you have many more general purpose registers at hand... except perhaps for very short procs, but then inlining would be the faster option.

000000014000102C | 48 8B 45 64                                      | mov rax, qword ptr ss:[rbp+64]                |
0000000140001030 | 48 8B 44 24 64                                   | mov rax, qword ptr ss:[rsp+64]                |
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: hutch-- on March 16, 2017, 10:33:13 AM
Quote
For example I would like to be able to do something like INVOKE Func, xmm0, xmm1, xmm2, r9 but is not possible with JWasm. I would have to call something like INVOKE Func, rcx, rdx. r8, r9 even though rcx, rdx and r8 are not used in that call.
There appears to be some redundancy in this desire, why would you use an "invoke" call when there are no memory operands involved in the argument list ? You don't really want to use the stack as it involves writing to stack memory on call and at the procedure level the called proc must then translate the stack addresses back to different sized registers. Without the extra clutter the proposed form,

INVOKE Func, xmm0, xmm1, xmm2, r9

would simply be with the registers loaded with whatever required values,

call Func


Now given that in most instances the data for xmm, ymm registers must come from somewhere in the application and for performance reasons that memory must be aligned correctly, if you don't want to load different sized registers directly, you pass the addresses of the data items as 64 bit pointers in the normal manner.

The option of a macro something like "regcall" would also do the job but only again if you were performing the double process of loading registers first then calling the procedure.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: coder on March 16, 2017, 01:34:38 PM
In 64-bit assembly, the best calling convention is MOV + CALL. Can't go wrong with it   :icon_cool:
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 03:50:52 PM
Quote from: coder on March 16, 2017, 01:34:38 PM
In 64-bit assembly, the best calling convention is MOV + CALL. Can't go wrong with it   :icon_cool:
I could not disagree more. :eusa_naughty:
The real problem is to align the stack, specially when you have a lot of parameters in your call. It is good to know that INVOKE does all the calculations for us.
Look at the following code where from proc1 you will set the 16 values of a 4x4 matrix, where the first row is will be all 0.1, the second all 0.2, the third all 0.3 and the fourth all 0.4.


option frame:auto

TXMMATRIX struct
r0 XMMWORD ?
r1 XMMWORD ?
r2 XMMWORD ?
r3 XMMWORD ?
TXMMATRIX ends

_XMVECTORSET MACRO r, float1, float2, float3, float4
movss xmm0, float1
movss xmm1, float2
movss xmm2, float3
movss xmm3, float4

unpcklps xmm1,xmm3
unpcklps xmm2,xmm0
unpcklps xmm1, xmm2
lea r10, [rcx].r
movups XMMWORD ptr [r10], xmm1
ENDM

.code

XMMatrixSet proc public retVal:QWORD, dumbpar1:QWORD, dumbpar2:QWORD, dumbpar3:QWORD, mm03: REAL4, mm10: REAL4, mm11: REAL4, mm12: REAL4, mm13: REAL4, mm20: REAL4, mm21: REAL4, mm22: REAL4, mm23: REAL4, mm30: REAL4, mm31: REAL4, mm32: REAL4, mm33: REAL4

        movss xmm0, mm03
unpcklps xmm2,xmm0
unpcklps xmm3,xmm1
unpcklps xmm2, xmm3
ASSUME rcx : ptr TXMMATRIX
lea r10, [rcx].r0
movups XMMWORD ptr [r10], xmm2
_XMVECTORSET r1, mm10, mm11, mm12, mm13
_XMVECTORSET r2, mm20, mm21, mm22, mm23
_XMVECTORSET r3, mm30, mm31, mm32, mm33
ASSUME rcx : NOTHING
mov rax, rcx
ret
XMMatrixSet endp

proc1 proc public
LOCAL M : TXMMATRIX
mov eax, 0.1
movd xmm1, eax
movd xmm2, eax
movd xmm3, eax
INVOKE XMMatrixSet, addr M, rdx, r8, r9, 0.1, 0.2,0.2,0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4
; do other stuff
; ......
; end other stuff
ret
proc1 endp


Now do it by the MOV + CALL in order to compile to the same, i.e.:


proc1:
000000013F201726  push        rbp 
000000013F201727  mov         rbp,rsp 
000000013F20172A  sub         rsp,40h 
000000013F20172E  mov         eax,3DCCCCCDh 
000000013F201733  movd        xmm1,eax 
000000013F201737  movd        xmm2,eax 
000000013F20173B  movd        xmm3,eax 
000000013F20173F  sub         rsp,90h 
000000013F201746  lea         rcx,[rbp-40h] 
000000013F20174A  mov         dword ptr [rsp+20h],3DCCCCCDh 
000000013F201752  mov         dword ptr [rsp+28h],3E4CCCCDh 
000000013F20175A  mov         dword ptr [rsp+30h],3E4CCCCDh 
000000013F201762  mov         dword ptr [rsp+38h],3E4CCCCDh 
000000013F20176A  mov         dword ptr [rsp+40h],3E4CCCCDh 
000000013F201772  mov         dword ptr [rsp+48h],3E99999Ah 
000000013F20177A  mov         dword ptr [rsp+50h],3E99999Ah 
000000013F201782  mov         dword ptr [rsp+58h],3E99999Ah 
000000013F20178A  mov         dword ptr [rsp+60h],3E99999Ah 
000000013F201792  mov         dword ptr [rsp+68h],3ECCCCCDh 
000000013F20179A  mov         dword ptr [rsp+70h],3ECCCCCDh 
000000013F2017A2  mov         dword ptr [rsp+78h],3ECCCCCDh 
000000013F2017AA  mov         dword ptr [rsp+80h],3ECCCCCDh 
000000013F2017B5  call        XMMatrixSet (13F201690h) 
000000013F2017BA  add         rsp,90h 
000000013F2017C1  leave 
000000013F2017C2  ret 


No joy.

This example also shows how nice it would be to have the possibility to include the XMM registers directly in the INVOKE statement instead of including general purpose registers that will not be used at all in the callee, just placeholders.
This is something for the developers of HJWASM to think about. Prototypes and function declarations would need to be reviewed accordingly.


Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 07:05:23 PM
Quote from: hutch-- on March 16, 2017, 10:33:13 AM
There appears to be some redundancy in this desire, why would you use an "invoke" call when there are no memory operands involved in the argument list ? You don't really want to use the stack as it involves writing to stack memory on call and at the procedure level the called proc must then translate the stack addresses back to different sized registers. Without the extra clutter the proposed form,

INVOKE Func, xmm0, xmm1, xmm2, r9

would simply be with the registers loaded with whatever required values,

call Func


Now given that in most instances the data for xmm, ymm registers must come from somewhere in the application and for performance reasons that memory must be aligned correctly, if you don't want to load different sized registers directly, you pass the addresses of the data items as 64 bit pointers in the normal manner.

The option of a macro something like "regcall" would also do the job but only again if you were performing the double process of loading registers first then calling the procedure.

I tried to explain in my previous message that the great usefulness of the INVOKE, for me at least, is to take care of all the stack adjustments for us. I know that some people do such calculations very easily by themselves. Those can obviously work as you suggest.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 07:22:50 PM
Quote from: jj2007 on March 16, 2017, 06:08:42 AM
Would MyFunc move xmm0 into the stack, or use it directly?

xmm0 will not go to the stack, is always passed as is.

My suggestion to use xmm registers in an INVOKE statement would involve 2 possibilities:
1) Just place them there in place of dummy placeholder parameters.
2) Load the xmm registers from whatever you put on the INVOKE command line. This is actually what the INVOKE already does for parameters that go into general purpose registers.

BTW, since you are a specialist in macros you know that macros can take xmm registers as parameters.
Well, thinking better, macros can take literally everything as parameters.  ::)
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 16, 2017, 08:08:39 PM
I tend to agree with you on this.. given that fastcall for x64 abi specifies floating point operands are passed in xmm() regs, I can think of no reason not to modify the invoke handling to allow xmm registers to be used as parameters in these positions, once again with the same optimisation applied that we use elsewhere, if the reg's are already in the right order nothing happens, if not it does the corresponding movaps into the register from that specified.

It doesn't really make any difference generated code wise to MOV+CALL, but i personally like having this stuff kept clean with argument checking and going via invoke just makes it all more tidy.

So that said unless somewhere disagrees.. I will implement this change to hjwasm.

So we have quite a few things on the list now worthy to promote it as 2.21.. the list of changes are:

1) Fix aw27's bug with sub rsp,8
2) Double check local alignments to 16
3) Add arch flag to allow generated code to use either sse or avx
4) Support xmm reg type arguments to invoke in fastcall x64

Unfortunately this means we'll push out the changes we had planned for 2.21 (union initialization enhacement and string literals in invoke / data declaration for both ascii and unicode) to 2.22
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: coder on March 16, 2017, 08:37:46 PM
Quote from: aw27 on March 16, 2017, 03:50:52 PM
I could not disagree more. :eusa_naughty:
The real problem is to align the stack, specially when you have a lot of parameters in your call. It is good to know that INVOKE does all the calculations for us.
Look at the following code where from proc1 you will set the 16 values of a 4x4 matrix, where the first row is will be all 0.1, the second all 0.2, the third all 0.3 and the fourth all 0.4.

It is not about aligning the stack. IMHO, what you need exactly is custom-built PROC and INVOKE for your own specific needs because AFAIK, there's no single PROC/INVOKE set out there that has the fits-all capability when dealing with uneven parameters. Not even from the likes of NASM and FASM. Of course it can be done with macros but the overhead may outwiegh its benefits.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 16, 2017, 08:47:19 PM
we've tried to get as close to that as possible with hjwasm, especially using stackbase:rsp / option win64:11
I think with this addition of xmm regs to invoke for float arguments and it should be pretty much bang on. It tries to make the invoke/prologue/epilogue generation as optimal as possible and deal with removing anything unused or not needed while supporting all the many combinations.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: coder on March 16, 2017, 08:59:27 PM
Quote from: johnsa on March 16, 2017, 08:47:19 PM
we've tried to get as close to that as possible with hjwasm, especially using stackbase:rsp / option win64:11
I think with this addition of xmm regs to invoke for float arguments and it should be pretty much bang on. It tries to make the invoke/prologue/epilogue generation as optimal as possible and deal with removing anything unused or not needed while supporting all the many combinations.

In other words, HJWASM is trying to anticipate all other custom needs of the users. For how long and how far can you go with it? That beats one design idea of MS 64-ABI - that most of the pre-entry works (alignment, saving volatiles etc) are the responsibility of the user codes / callers and not the modules. Excessive wrappings and abstracting may jeopardize stability and portability in the long run. Have nothing against HJWASM. You guys are doing great job, but the limit must be set somewhere.

Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 09:02:21 PM
Quote from: coder on March 16, 2017, 08:37:46 PM
It is not about aligning the stack. IMHO, what you need exactly is custom-built PROC and INVOKE for your own specific needs because AFAIK, there's no single PROC/INVOKE set out there that has the fits-all capability when dealing with uneven parameters. Not even from the likes of NASM and FASM. Of course it can be done with macros but the overhead may outwiegh its benefits.

Except for XMM registers, the INVOKE covers pretty much all the possibilities. I can't recall any other case it does not for the x64 ABI, cdecl, stdcall and pascal calling conventions.
INVOKE is useless for the Borland calling convention, which I use a lot, which is a variation of the Pascal calling convention with 3 registers used to pass data.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 16, 2017, 09:06:23 PM
Quote from: johnsa on March 16, 2017, 08:08:39 PM
So we have quite a few things on the list now worthy to promote it as 2.21.. the list of changes are:

1) Fix aw27's bug with sub rsp,8
2) Double check local alignments to 16
3) Add arch flag to allow generated code to use either sse or avx
4) Support xmm reg type arguments to invoke in fastcall x64

Looks like a good plan!  :t
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 16, 2017, 09:18:25 PM
Quote from: coder on March 16, 2017, 08:59:27 PM
Quote from: johnsa on March 16, 2017, 08:47:19 PM
we've tried to get as close to that as possible with hjwasm, especially using stackbase:rsp / option win64:11
I think with this addition of xmm regs to invoke for float arguments and it should be pretty much bang on. It tries to make the invoke/prologue/epilogue generation as optimal as possible and deal with removing anything unused or not needed while supporting all the many combinations.

In other words, HJWASM is trying to anticipate all other custom needs of the users. For how long and how far can you go with it? That beats one design idea of MS 64-ABI - that most of the pre-entry works (alignment, saving volatiles etc) are the responsibility of the user codes / callers and not the modules. Excessive wrappings and abstracting may jeopardize stability and portability in the long run. Have nothing against HJWASM. You guys are doing great job, but the limit must be set somewhere.

It shouldn't really be a problem because most combinations are deterministic and usually obvious

step #1: follow the ABI
step #2: allow invoke to be a bit "nicer" to use by allowing immediates, literal strings, direct register arguments etc..
step #3: ensure invoke generates optimal code, use xor, re-use repeated arguments, avoid pointless copies ie: regA -> regA
(so-far there is no reason for anything to change under any use condition)
step #4: using other win64 modes you can customise how the procs behave and how epilogue/prologue is generated (so at this point you have complete control to do what you want when required.. but the hope is that the default mode is ideal)

For example, a main reason for setting up a "raw" proc or a custom prologue would be to optimise the generated code to avoid home-space copies, ensure alignments etc.
There is no reason the "default" prologue generator shouldn't be able to cope with that on it's own.. which is our approach.. if the argument isn't referenced, don't store it.. etc
for example if in your proc you refer to the arguments via their registers and not the argument by name, then it won't generate the bloat in the prologue.. which is one of the main reasons for having a "raw" proc.

But we keep the option so that can always be over-ridden if required.

Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 16, 2017, 11:50:25 PM
Quote from: aw27 on March 16, 2017, 07:22:50 PM
Quote from: jj2007 on March 16, 2017, 06:08:42 AM
Would MyFunc move xmm0 into the stack, or use it directly?

xmm0 will not go to the stack, is always passed as is.

My suggestion to use xmm registers in an INVOKE statement would involve 2 possibilities:
1) Just place them there in place of dummy placeholder parameters.
2) Load the xmm registers from whatever you put on the INVOKE command line. This is actually what the INVOKE already does for parameters that go into general purpose registers.

BTW, since you are a specialist in macros you know that macros can take xmm registers as parameters.
Well, thinking better, macros can take literally everything as parameters.  ::)

I've given option 1) a try:include \Masm32\MasmBasic\Res\JBasic.inc ; ## console demo, builds in 32- or 64-bit mode with ML, AsmC, JWasm, HJWasm ##
.code
MyFunc proc <cb xm3> _xmm1, _xmm2, _xmm3, arg4, arg5, arg6 ; callback, 3 xmm regs passed
  nop
  usedeb=1
  deb 4, "MyFunc, original xmm regs", o:xmm1, o:xmm2, o:xmm3 ; o: means oword size
  deb 4, "MyFunc, normal args in stack", x:arg4, x:arg5, x:arg6 ; x: means hex
  usedeb=0
  ret
MyFunc endp

Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
  movaps xmm1, Oword16(1A1111111B1111111C1111111D111111h)
  movaps xmm2, Oword16(2A2222222B2222222C2222222D222222h)
  movaps xmm3, Oword16(3A3333333B3333333C3333333D333333h)
;   int 3
  jinvoke MyFunc, xmm1, xmm2, xmm3, 44444444h, 55555555h, 66666666h
  Inkey Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format")
EndOfCode


Output:MyFunc, original xmm regs
o:xmm1  1a111111 1b111111 1c111111 1d111111h
o:xmm2  2a222222 2b222222 2c222222 2d222222h
o:xmm3  3a333333 3b333333 3c333333 3d333333h
MyFunc, normal args in stack
x:arg4  44444444h
x:arg5  55555555h
x:arg6  66666666h
This code was assembled with ml64 in 64-bit format


Builds also as 32-bit code with HJWasm & friends. Under the hood (the 64-bit version):
00000001400018E2 | CC                            | int3                                          |
00000001400018E3 | 41 BA 66 66 66 66             | mov r10d, 66666666                            |
00000001400018E9 | 4C 89 54 24 28                | mov qword ptr ss:[rsp+28], r10                |
00000001400018EE | 41 BA 55 55 55 55             | mov r10d, 55555555                            |
00000001400018F4 | 4C 89 54 24 20                | mov qword ptr ss:[rsp+20], r10                |
00000001400018F9 | 41 B9 44 44 44 44             | mov r9d, 44444444                             | r9d:"@(\\w"
00000001400018FF | E8 FE F6 FF FF                | call 140001002                                |

...
0000000140001002 | 55                            | push rbp                                      |
0000000140001003 | 48 8B EC                      | mov rbp, rsp                                  |
0000000140001006 | 4C 89 4D 28                   | mov qword ptr ss:[rbp+28], r9                 |
000000014000100A | 48 81 EC 90 00 00 00          | sub rsp, 90                                   |
0000000140001011 | 90                            | nop                                           |
0000000140001012 | E8 49 0A 00 00                | call <jdebP>         (the deb macro)
...
00000001400018C2 | C9                            | leave                                         |
00000001400018C3 | C3                            | ret                                           |


The number of xmm regs passed can be 1-4, in this case 3: see <cb xm3>

This is what can done with macros. Your option 2) is more difficult to realise, because the PROLOG macro doesn't give you the list of arguments. It could probably be done in HJWasm itself, though. My example assembles even with ML64, but at that point, one could drop support for that one.

Project attached, requires MasmBasic of today (http://masm32.com/board/index.php?topic=94.0).
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 17, 2017, 12:16:09 AM
Just to keep you all updated,

There is a new branch (v2.21) on our git repository.

I've already completed all the arch sse/avx stuff and that is in that branch and tested.

You can now use:

OPTION ARCH:SSE
OPTION ARCH:AVX

or the command line switches -archSSE and -archAVX

and ANY code that is generated and uses any of : movss, movsd, movd, movq, movaps, movdqa, movdqu, movups or their avx counterparts will now be substituted with the correct version for either AVX or SSE.

AVX is set as the default.

I'm now working on adding the XMM arguments to INVOKE, so that should be done soon while Habran is investigating the stack alignment issues.

Cheers
John

Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 17, 2017, 01:54:25 AM
Quote from: jj2007 on March 16, 2017, 11:50:25 PM
This is what can done with macros. Your option 2) is more difficult to realise, because the PROLOG macro doesn't give you the list of arguments. It could probably be done in HJWasm itself, though. My example assembles even with ML64, but at that point, one could drop support for that one.
Project attached, requires MasmBasic of today (http://masm32.com/board/index.php?topic=94.0).

I was very curious to see the MasmBasic in action but after installing and running ml64, I just got errors. Am I missing something?

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: PassXmmRegs.asm
h:\Masm32\MasmBasic\Res\JBasic.inc(47) : error A2008:syntax error : .
h:\Masm32\MasmBasic\Res\JBasic.inc(48) : error A2008:syntax error : .
h:\Masm32\MasmBasic\Res\JBasic.inc(49) : error A2008:syntax error : .
*** using Res\JBasic32.lib ***
h:\Masm32\MasmBasic\Res\JBasic.inc(62) : error A2008:syntax error : rax
h:\Masm32\MasmBasic\Res\JBasic.inc(63) : error A2008:syntax error : rcx
h:\Masm32\MasmBasic\Res\JBasic.inc(64) : error A2008:syntax error : rdx
h:\Masm32\MasmBasic\Res\JBasic.inc(65) : error A2008:syntax error : rsi
h:\Masm32\MasmBasic\Res\JBasic.inc(66) : error A2008:syntax error : rdi
h:\Masm32\MasmBasic\Res\JBasic.inc(67) : error A2008:syntax error : rbx
h:\Masm32\MasmBasic\Res\JBasic.inc(68) : error A2008:syntax error : rbp
h:\Masm32\MasmBasic\Res\JBasic.inc(69) : error A2008:syntax error : rsp
h:\Masm32\MasmBasic\Res\JBasic.inc(558) : fatal error A1000:cannot open file : \Masm32\MasmBasic\Res\DualWin.inc

Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 17, 2017, 02:04:13 AM
update..

sub rsp,8 rogue item from aw27 fixed.
stack alignment to 16 working for win64:6 and 7 and 15 modes... fixed..

Busy adding xmm support to invoke now ;)

If you're all really lucky and I don't need to take a nap.. 2.21 might still be released today :)
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 17, 2017, 02:45:04 AM
Quote from: aw27 on March 17, 2017, 01:54:25 AMAm I missing something?
...
fatal error A1000:cannot open file : \Masm32\MasmBasic\Res\DualWin.inc

Open the PassXmmRegs.asc in \Masm32\MasmBasic\RichMasm.exe and hit F6; the editor should show you a MessageBox "JBasic installed" - did you see that one?

Afterwards, you should have two new files:
C:\Masm32\MasmBasic\Res\pt.inc
C:\Masm32\MasmBasic\Res\DualWin.inc

I just tested with a fresh installation, and on C: something blocks the download of HJWasm32.
However, \Masm32\bin\ml64.exe should work fine, too - just insert OPT_Assembler ML under the EndOfCode.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 17, 2017, 03:21:22 AM
Quote from: jj2007 on March 17, 2017, 02:45:04 AM
Open the PassXmmRegs.asc in \Masm32\MasmBasic\RichMasm.exe and hit F6; the editor should show you a MessageBox "JBasic installed" - did you see that one?
Yes, I see.
Quote
Afterwards, you should have two new files:
C:\Masm32\MasmBasic\Res\pt.inc
C:\Masm32\MasmBasic\Res\DualWin.inc
I don't see those. I just noticed a batch file bldallRM.bat
I run it and get:

**** 64-bit assembly ****


*** Assemble, link and run PassXmmRegs ***

*** Assemble using \masm32\bin\ml64 /c /Zp8  tmp_file.asm ***
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Assembling: tmp_file.asm
MASM : fatal error A1000:cannot open file : tmp_file.asm
*** Assembly Error ***
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 17, 2017, 03:55:03 AM
xmm registers are now supported in invoke for real4/real8 types.



newproc proc FRAME arg1:qword, arg2:real4
movss xmm3,arg2
ret
newproc endp

invoke newproc, rax, xmm4                     ; xmm1 set to xmm4 via movaps
invoke newproc, rax, floatVar                   ; xmm1 set from mem via movd
invoke newproc, rax, xmm1 ; no op, as xmm1 == xmm1

Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 17, 2017, 04:10:08 AM
Quote from: aw27 on March 17, 2017, 03:21:22 AMMASM : fatal error A1000:cannot open file : tmp_file.asm

RichMasm deletes the tmp_file.asm after a successful build. When you hit F6 again in RichMasm, still no success?

Sorry for that - I sent you a PM.

Quote from: johnsa on March 17, 2017, 03:55:03 AM
xmm registers are now supported in invoke for real4/real8 types.

You should respect the speed limits, johnsa :eusa_naughty:
;)
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 17, 2017, 06:13:50 AM
Speed limit is more of a "suggestion" to me ;)

All going to plan it should still be finished tonight.
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 17, 2017, 03:21:50 PM
Quote from: johnsa on March 17, 2017, 03:55:03 AM
xmm registers are now supported in invoke for real4/real8 types.



newproc proc FRAME arg1:qword, arg2:real4
movss xmm3,arg2
ret
newproc endp

invoke newproc, rax, xmm4                     ; xmm1 set to xmm4 via movaps
invoke newproc, rax, floatVar                   ; xmm1 set from mem via movd
invoke newproc, rax, xmm1 ; no op, as xmm1 == xmm1



Looking forward to test it ASAP.  :icon14:
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: johnsa on March 17, 2017, 08:20:30 PM
Hi all,

We're so slow.. only ready now ;)

V2.21 is up on the site, we've run all are testing and regression scripts and all seems to be in order.
The list of changes is in the documentation and as discussed previously :

real4/real8 via xmm support on invoke
fixed align 16 on stack local
fixed rogue sub rsp,8
fixed rbp relative stack locations
added command line switches -archSSE, -archAVX and OPTION ARCH:AVX|SSE to switch between generating sse or avx instructions in generated code
refactored the entire code-base to make use of this arch setup instead of hardcoded vmovdqa/movdqu etc..

Give it a test :)
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 18, 2017, 12:02:15 AM
Quote from: johnsa on March 17, 2017, 08:20:30 PM
real4/real8 via xmm support on invoke
It works! Actually it always worked in JWASM as well and I never noticed. :greenclp:

I found no bugs so far, but will try harder. Compiles very fast and code is about 2% smaller.

Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 18, 2017, 12:23:01 AM
Quote from: aw27 on March 18, 2017, 12:02:15 AMI found no bugs so far, but will try harder

That is the right spirit :icon_mrgreen:

So far everything fine with my big sources. MB assembles in
  2.2  AsmC
  2.8  HJWasm64
  3.2  HJWasm32
  7.0  Microsoft MASM 6.15 & 10.0

seconds. The first time the 64-bit version is faster. Did you test new compiler optimisations?
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 18, 2017, 01:42:14 AM
Quote from: jj2007 on March 18, 2017, 12:23:01 AM
Quote from: aw27 on March 18, 2017, 12:02:15 AMI found no bugs so far, but will try harder

That is the right spirit :icon_mrgreen:

So far everything fine with my big sources. MB assembles in
  2.2  AsmC
  2.8  HJWasm64
  3.2  HJWasm32
  7.0  Microsoft MASM 6.15 & 10.0

seconds. The first time the 64-bit version is faster. Did you test new compiler optimisations?

DMath64.asm: 21545 lines, 8 passes, 411 ms, 0 warnings, 0 errors
I don't know what else to optimize  :icon_cool:
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: TWell on March 18, 2017, 02:08:14 AM
If DMath64.asm belongs to library and had many functions, COMDAT is useful for it and [h]jwasm support it.
(ml64 don't have it, so it is not useful for that kind of one file libraries, so with it source must splitted to several parts)
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 18, 2017, 02:31:11 AM
Quote from: TWell on March 18, 2017, 02:08:14 AM
If DMath64.asm belongs to library and had many functions, COMDAT is useful for it and [h]jwasm support it.
(ml64 don't have it, so it is not useful for that kind of one file libraries, so with it source must splitted to several parts)
DMath is itself a library and does not depend on anything else, not even on the Windows API, I am afraid I can only optimize it by hand. :(
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: jj2007 on March 18, 2017, 03:21:43 AM
Quote from: aw27 on March 18, 2017, 01:42:14 AMDMath64.asm: 21545 lines, 8 passes, 411 ms, 0 warnings, 0 errors

You have a fast machine then :biggrin:

RichMasm.asm: 18332 lines, 9 passes, 925 ms, 0 warnings, 0 errors (Intel Core i5 on Win7-64)
Title: Re: Crashes in HJWASM but works well in JWASM
Post by: aw27 on March 18, 2017, 04:05:54 AM
Quote from: jj2007 on March 18, 2017, 03:21:43 AM
Quote from: aw27 on March 18, 2017, 01:42:14 AMDMath64.asm: 21545 lines, 8 passes, 411 ms, 0 warnings, 0 errors

You have a fast machine then :biggrin:

RichMasm.asm: 18332 lines, 9 passes, 925 ms, 0 warnings, 0 errors (Intel Core i5 on Win7-64)

I found it lazy though, it is an old Sandy-Bridge from 2013 with 6 cores/12 virtual CPUs. By default each application receives only a maximum of about 8% (100/12) of total power.