News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Crashes in HJWASM but works well in JWASM

Started by aw27, March 01, 2017, 04:37:40 AM

Previous topic - Next topic

aw27

Quote from: jj2007 on March 16, 2017, 05:14:59 AM
Quote from: aw27 on March 16, 2017, 04:58:04 AMit should handle better the INVOKE parameters in x64 code than JWasm. The first time I compiled with HJWasm I obtained smaller code

The x64 ABI is not exactly user-friendly; it took me some time to understand it. There are some tricks to get smaller code, and there are also some people who bark at you if you dare to favour size over speed. Perhaps it would help if you posted one or two examples where different coding styles make a difference for your project, size- or speed-wise. A lot can be done inside the PROLOGUE macro btw.

I think it can be automated without manually tweaking of the PROLOGUE. The rules are only  these:  ;)
; RCX, RDX, R8, R9 are used for integer and pointer arguments in that order left to right.
; XMM0, 1, 2, and 3 are used for floating point arguments.
; When used, XMM register displace the corresponding general register. For example xmm2, displaces r8 and it will not be used to pass a parameter.
; Additional arguments are pushed on the stack left to right.
; Parameters less than 64 bits long are not zero extended; the high bits contain garbage.
; It is the caller's responsibility to allocate 32 bytes of "shadow space" (for storing RCX, RDX, R8, and R9 if needed) before calling the function.
; It is the caller?s responsibility to clean the stack after the call.
; Integer return values (similar to x86) are returned in RAX if 64 bits or less. Pointer to small type are returned in RAX.
; Floating point return values are returned in XMM0.
; Larger return values (structs) have space allocated on the stack by the caller, and RCX then contains a pointer to the return space when the callee is called. Register usage for integer parameters is then pushed one to the right. RAX returns this address to the caller.
; The stack is 16-byte aligned. The "call" instruction pushes an 8-byte return value, so all non-leaf functions must adjust the stack by a value of the form 16n+8 when allocating stack space.
; Registers RAX, RCX, RDX, R8, R9, R10, and R11 are considered volatile and must be considered destroyed on function calls.
; RBX, RBP, RDI, RSI, R12, R14, R14, and R15 must be saved in any function using them.
; xmm0 to xmm5 are volatile

For example I would like to be able to do something like INVOKE Func, xmm0, xmm1, xmm2, r9 but is not possible with JWasm. I would have to call something like INVOKE Func, rcx, rdx. r8, r9 even though rcx, rdx and r8 are not used in that call.


johnsa

There are quite a few things we do in the prologue/epilogue generation which will make it shorter/faster than jwasm..

avoiding some of the pointless code that jwasm generated with things like add rsp,0 or sub rsp,0
replacing some zero'ing of values in invoke with xor
making sure that registers that are already set correctly aren't updated
re-using zero value to fill in nulls and others in invoke

inside the proc itself we also have some smart logic that happens when you use stackbase rsp and win64:11 where we re-use unused param space for uses, we only store things that are actually used/referenced in the proc and obviously if using rsp as the base frees up rbp for general use and shortens the code a bit more too.

jj2007

Quote from: aw27 on March 16, 2017, 05:25:44 AMI would like to be able to do something like INVOKE Func, xmm0, xmm1, xmm2, r9

Interesting. How would your arglist look like, and what would you expect under the hood? Would MyFunc move xmm0 into the stack, or use it directly?

MyFunc proc arg1:???, arg2:???, arg3:???, arg4

Right now, my jinvoke chokes on that one, but it's a macro... you can do almost everything with a macro ;-)

@johnsa: As in 32-bit code, [rbp+x] is one byte shorter. Not sure whether stackbase rsp provides any real advantage, given that you have many more general purpose registers at hand... except perhaps for very short procs, but then inlining would be the faster option.

000000014000102C | 48 8B 45 64                                      | mov rax, qword ptr ss:[rbp+64]                |
0000000140001030 | 48 8B 44 24 64                                   | mov rax, qword ptr ss:[rsp+64]                |

hutch--

Quote
For example I would like to be able to do something like INVOKE Func, xmm0, xmm1, xmm2, r9 but is not possible with JWasm. I would have to call something like INVOKE Func, rcx, rdx. r8, r9 even though rcx, rdx and r8 are not used in that call.
There appears to be some redundancy in this desire, why would you use an "invoke" call when there are no memory operands involved in the argument list ? You don't really want to use the stack as it involves writing to stack memory on call and at the procedure level the called proc must then translate the stack addresses back to different sized registers. Without the extra clutter the proposed form,

INVOKE Func, xmm0, xmm1, xmm2, r9

would simply be with the registers loaded with whatever required values,

call Func


Now given that in most instances the data for xmm, ymm registers must come from somewhere in the application and for performance reasons that memory must be aligned correctly, if you don't want to load different sized registers directly, you pass the addresses of the data items as 64 bit pointers in the normal manner.

The option of a macro something like "regcall" would also do the job but only again if you were performing the double process of loading registers first then calling the procedure.

coder

In 64-bit assembly, the best calling convention is MOV + CALL. Can't go wrong with it   :icon_cool:

aw27

#35
Quote from: coder on March 16, 2017, 01:34:38 PM
In 64-bit assembly, the best calling convention is MOV + CALL. Can't go wrong with it   :icon_cool:
I could not disagree more. :eusa_naughty:
The real problem is to align the stack, specially when you have a lot of parameters in your call. It is good to know that INVOKE does all the calculations for us.
Look at the following code where from proc1 you will set the 16 values of a 4x4 matrix, where the first row is will be all 0.1, the second all 0.2, the third all 0.3 and the fourth all 0.4.


option frame:auto

TXMMATRIX struct
r0 XMMWORD ?
r1 XMMWORD ?
r2 XMMWORD ?
r3 XMMWORD ?
TXMMATRIX ends

_XMVECTORSET MACRO r, float1, float2, float3, float4
movss xmm0, float1
movss xmm1, float2
movss xmm2, float3
movss xmm3, float4

unpcklps xmm1,xmm3
unpcklps xmm2,xmm0
unpcklps xmm1, xmm2
lea r10, [rcx].r
movups XMMWORD ptr [r10], xmm1
ENDM

.code

XMMatrixSet proc public retVal:QWORD, dumbpar1:QWORD, dumbpar2:QWORD, dumbpar3:QWORD, mm03: REAL4, mm10: REAL4, mm11: REAL4, mm12: REAL4, mm13: REAL4, mm20: REAL4, mm21: REAL4, mm22: REAL4, mm23: REAL4, mm30: REAL4, mm31: REAL4, mm32: REAL4, mm33: REAL4

        movss xmm0, mm03
unpcklps xmm2,xmm0
unpcklps xmm3,xmm1
unpcklps xmm2, xmm3
ASSUME rcx : ptr TXMMATRIX
lea r10, [rcx].r0
movups XMMWORD ptr [r10], xmm2
_XMVECTORSET r1, mm10, mm11, mm12, mm13
_XMVECTORSET r2, mm20, mm21, mm22, mm23
_XMVECTORSET r3, mm30, mm31, mm32, mm33
ASSUME rcx : NOTHING
mov rax, rcx
ret
XMMatrixSet endp

proc1 proc public
LOCAL M : TXMMATRIX
mov eax, 0.1
movd xmm1, eax
movd xmm2, eax
movd xmm3, eax
INVOKE XMMatrixSet, addr M, rdx, r8, r9, 0.1, 0.2,0.2,0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4
; do other stuff
; ......
; end other stuff
ret
proc1 endp


Now do it by the MOV + CALL in order to compile to the same, i.e.:


proc1:
000000013F201726  push        rbp 
000000013F201727  mov         rbp,rsp 
000000013F20172A  sub         rsp,40h 
000000013F20172E  mov         eax,3DCCCCCDh 
000000013F201733  movd        xmm1,eax 
000000013F201737  movd        xmm2,eax 
000000013F20173B  movd        xmm3,eax 
000000013F20173F  sub         rsp,90h 
000000013F201746  lea         rcx,[rbp-40h] 
000000013F20174A  mov         dword ptr [rsp+20h],3DCCCCCDh 
000000013F201752  mov         dword ptr [rsp+28h],3E4CCCCDh 
000000013F20175A  mov         dword ptr [rsp+30h],3E4CCCCDh 
000000013F201762  mov         dword ptr [rsp+38h],3E4CCCCDh 
000000013F20176A  mov         dword ptr [rsp+40h],3E4CCCCDh 
000000013F201772  mov         dword ptr [rsp+48h],3E99999Ah 
000000013F20177A  mov         dword ptr [rsp+50h],3E99999Ah 
000000013F201782  mov         dword ptr [rsp+58h],3E99999Ah 
000000013F20178A  mov         dword ptr [rsp+60h],3E99999Ah 
000000013F201792  mov         dword ptr [rsp+68h],3ECCCCCDh 
000000013F20179A  mov         dword ptr [rsp+70h],3ECCCCCDh 
000000013F2017A2  mov         dword ptr [rsp+78h],3ECCCCCDh 
000000013F2017AA  mov         dword ptr [rsp+80h],3ECCCCCDh 
000000013F2017B5  call        XMMatrixSet (13F201690h) 
000000013F2017BA  add         rsp,90h 
000000013F2017C1  leave 
000000013F2017C2  ret 


No joy.

This example also shows how nice it would be to have the possibility to include the XMM registers directly in the INVOKE statement instead of including general purpose registers that will not be used at all in the callee, just placeholders.
This is something for the developers of HJWASM to think about. Prototypes and function declarations would need to be reviewed accordingly.



aw27

Quote from: hutch-- on March 16, 2017, 10:33:13 AM
There appears to be some redundancy in this desire, why would you use an "invoke" call when there are no memory operands involved in the argument list ? You don't really want to use the stack as it involves writing to stack memory on call and at the procedure level the called proc must then translate the stack addresses back to different sized registers. Without the extra clutter the proposed form,

INVOKE Func, xmm0, xmm1, xmm2, r9

would simply be with the registers loaded with whatever required values,

call Func


Now given that in most instances the data for xmm, ymm registers must come from somewhere in the application and for performance reasons that memory must be aligned correctly, if you don't want to load different sized registers directly, you pass the addresses of the data items as 64 bit pointers in the normal manner.

The option of a macro something like "regcall" would also do the job but only again if you were performing the double process of loading registers first then calling the procedure.

I tried to explain in my previous message that the great usefulness of the INVOKE, for me at least, is to take care of all the stack adjustments for us. I know that some people do such calculations very easily by themselves. Those can obviously work as you suggest.

aw27

Quote from: jj2007 on March 16, 2017, 06:08:42 AM
Would MyFunc move xmm0 into the stack, or use it directly?

xmm0 will not go to the stack, is always passed as is.

My suggestion to use xmm registers in an INVOKE statement would involve 2 possibilities:
1) Just place them there in place of dummy placeholder parameters.
2) Load the xmm registers from whatever you put on the INVOKE command line. This is actually what the INVOKE already does for parameters that go into general purpose registers.

BTW, since you are a specialist in macros you know that macros can take xmm registers as parameters.
Well, thinking better, macros can take literally everything as parameters.  ::)

johnsa

I tend to agree with you on this.. given that fastcall for x64 abi specifies floating point operands are passed in xmm() regs, I can think of no reason not to modify the invoke handling to allow xmm registers to be used as parameters in these positions, once again with the same optimisation applied that we use elsewhere, if the reg's are already in the right order nothing happens, if not it does the corresponding movaps into the register from that specified.

It doesn't really make any difference generated code wise to MOV+CALL, but i personally like having this stuff kept clean with argument checking and going via invoke just makes it all more tidy.

So that said unless somewhere disagrees.. I will implement this change to hjwasm.

So we have quite a few things on the list now worthy to promote it as 2.21.. the list of changes are:

1) Fix aw27's bug with sub rsp,8
2) Double check local alignments to 16
3) Add arch flag to allow generated code to use either sse or avx
4) Support xmm reg type arguments to invoke in fastcall x64

Unfortunately this means we'll push out the changes we had planned for 2.21 (union initialization enhacement and string literals in invoke / data declaration for both ascii and unicode) to 2.22

coder

Quote from: aw27 on March 16, 2017, 03:50:52 PM
I could not disagree more. :eusa_naughty:
The real problem is to align the stack, specially when you have a lot of parameters in your call. It is good to know that INVOKE does all the calculations for us.
Look at the following code where from proc1 you will set the 16 values of a 4x4 matrix, where the first row is will be all 0.1, the second all 0.2, the third all 0.3 and the fourth all 0.4.

It is not about aligning the stack. IMHO, what you need exactly is custom-built PROC and INVOKE for your own specific needs because AFAIK, there's no single PROC/INVOKE set out there that has the fits-all capability when dealing with uneven parameters. Not even from the likes of NASM and FASM. Of course it can be done with macros but the overhead may outwiegh its benefits.

johnsa

we've tried to get as close to that as possible with hjwasm, especially using stackbase:rsp / option win64:11
I think with this addition of xmm regs to invoke for float arguments and it should be pretty much bang on. It tries to make the invoke/prologue/epilogue generation as optimal as possible and deal with removing anything unused or not needed while supporting all the many combinations.

coder

Quote from: johnsa on March 16, 2017, 08:47:19 PM
we've tried to get as close to that as possible with hjwasm, especially using stackbase:rsp / option win64:11
I think with this addition of xmm regs to invoke for float arguments and it should be pretty much bang on. It tries to make the invoke/prologue/epilogue generation as optimal as possible and deal with removing anything unused or not needed while supporting all the many combinations.

In other words, HJWASM is trying to anticipate all other custom needs of the users. For how long and how far can you go with it? That beats one design idea of MS 64-ABI - that most of the pre-entry works (alignment, saving volatiles etc) are the responsibility of the user codes / callers and not the modules. Excessive wrappings and abstracting may jeopardize stability and portability in the long run. Have nothing against HJWASM. You guys are doing great job, but the limit must be set somewhere.


aw27

Quote from: coder on March 16, 2017, 08:37:46 PM
It is not about aligning the stack. IMHO, what you need exactly is custom-built PROC and INVOKE for your own specific needs because AFAIK, there's no single PROC/INVOKE set out there that has the fits-all capability when dealing with uneven parameters. Not even from the likes of NASM and FASM. Of course it can be done with macros but the overhead may outwiegh its benefits.

Except for XMM registers, the INVOKE covers pretty much all the possibilities. I can't recall any other case it does not for the x64 ABI, cdecl, stdcall and pascal calling conventions.
INVOKE is useless for the Borland calling convention, which I use a lot, which is a variation of the Pascal calling convention with 3 registers used to pass data.

aw27

Quote from: johnsa on March 16, 2017, 08:08:39 PM
So we have quite a few things on the list now worthy to promote it as 2.21.. the list of changes are:

1) Fix aw27's bug with sub rsp,8
2) Double check local alignments to 16
3) Add arch flag to allow generated code to use either sse or avx
4) Support xmm reg type arguments to invoke in fastcall x64

Looks like a good plan!  :t

johnsa

Quote from: coder on March 16, 2017, 08:59:27 PM
Quote from: johnsa on March 16, 2017, 08:47:19 PM
we've tried to get as close to that as possible with hjwasm, especially using stackbase:rsp / option win64:11
I think with this addition of xmm regs to invoke for float arguments and it should be pretty much bang on. It tries to make the invoke/prologue/epilogue generation as optimal as possible and deal with removing anything unused or not needed while supporting all the many combinations.

In other words, HJWASM is trying to anticipate all other custom needs of the users. For how long and how far can you go with it? That beats one design idea of MS 64-ABI - that most of the pre-entry works (alignment, saving volatiles etc) are the responsibility of the user codes / callers and not the modules. Excessive wrappings and abstracting may jeopardize stability and portability in the long run. Have nothing against HJWASM. You guys are doing great job, but the limit must be set somewhere.

It shouldn't really be a problem because most combinations are deterministic and usually obvious

step #1: follow the ABI
step #2: allow invoke to be a bit "nicer" to use by allowing immediates, literal strings, direct register arguments etc..
step #3: ensure invoke generates optimal code, use xor, re-use repeated arguments, avoid pointless copies ie: regA -> regA
(so-far there is no reason for anything to change under any use condition)
step #4: using other win64 modes you can customise how the procs behave and how epilogue/prologue is generated (so at this point you have complete control to do what you want when required.. but the hope is that the default mode is ideal)

For example, a main reason for setting up a "raw" proc or a custom prologue would be to optimise the generated code to avoid home-space copies, ensure alignments etc.
There is no reason the "default" prologue generator shouldn't be able to cope with that on it's own.. which is our approach.. if the argument isn't referenced, don't store it.. etc
for example if in your proc you refer to the arguments via their registers and not the argument by name, then it won't generate the bloat in the prologue.. which is one of the main reasons for having a "raw" proc.

But we keep the option so that can always be over-ridden if required.