News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

HJWasm 2.15 uploaded

Started by habran, September 05, 2016, 09:14:38 AM

Previous topic - Next topic

jj2007

Quote from: johnsa on September 06, 2016, 01:35:08 AMmainCRTStartup:
000000013F161010 48 83 EC 20          sub         rsp,20h
000000013F161014 48 8D 0D E5 3F 00 00 lea         rcx,[msg (013F165000h)]
000000013F16101B E8 14 10 00 00       call        printf (013F162034h) 


the sub rsp,20h is the prologue for mainCRTStartup and has nothing to do with the invoke of printf.

Are you sure? I thought the ABI requires that you grant 4 QWORDS of stack to printf ::)

include \Masm32\MasmBasic\Res\JBasic.inc ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
Init
  int 3
  jinvoke crt_printf, Chr$("Hello msvcrt.dll, 13, 10")
EndOfCode


translates to:
CC                       | int3                               |
53                       | push rbx                           |
53                       | push rbx                           |
53                       | push rbx                           |
48 8D 0D EF 1F 00 00     | lea rcx, qword ptr ds:[140003004]  | 140003004:"Hello rt.dll, , "
51                       | push rcx                           |
FF 15 E4 20 00 00        | call qword ptr ds:[<&printf>]      |
48 83 C4 20              | add rsp, 20                        |
53                       | push rbx                           |
53                       | push rbx                           |
53                       | push rbx                           |
6A 00                    | push 0                             |
48 8B 0C 24              | mov rcx, qword ptr ss:[rsp]        | correct the stack
FF 15 D9 20 00 00        | call qword ptr ds:[<&RtlExitUserProcess |


A different way to allocate the 4 QWORDs.

johnsa

Yes, the minimum reservation for any non-leaf function should be 32 bytes (which is 20h).
There is no need to allocate stack space per call, instead it should be done once per PROC by examining all the invokes which occur inside that proc and determining the maximum reservation for all of them.
so for example if you had:

Main PROC

invoke a
invoke b,1,2,3,4
invoke c,1,2

ENDP

There would be a single sub rsp,20h (because no PROC used inside Main requires more than 4 arguments).
However, if you had:

Main PROC

invoke a
invoke b,1,2,3,4
invoke c,1,2,3,4,5

ENDP

you would get a single sub rsp,28h to account for the 5 arguments of function c, but to keep the stack xmmword aligned we'd round that up to 30h so that first local allocated in any proc was 16byte aligned too.
This is better for cache access to the stack, and also saves a load of unnecessary add/sub (or in your example pushes).

On a slightly unrelated note, we do a lot of these behind the scenes optimisations to setup the stack prologue/epilogue generation, managing alignment etc in hjwasm.. which is another reason that invoke is so important (I believe there was a comment on another thread about it ).. It allows the assembler to take care of things that would otherwise be extremely difficult or painful to manage manually (especially in 64bit), and would more than likely either land up with hard to find bugs or sub-optimal code.. and if writing assembly by hand lands up generating less optimal code than a C compiler .... then it really has lost any reason to exist!
This was one of the main drivers for having vectorcall support (apart from interop) .. we can't have a tool that doesnt' provide every opportunity to write the most performant code possible with the minimum of trouble when some silly old HLL compiler can do it! :)

habran

QuoteAre you sure? I thought the ABI requires that you grant 4 QWORDS of stack to printf ::)
You are correct JJ :t
Because TWell is not using FRAME and PROLOGUE and he is using only OPTION WIN64:2, 20h is allocation of home space
for 4 registers shadows for invoke:
    W64F_SAVEREGPARAMS = 0x01, /* 1=save register params in shadow space on proc entry */
    W64F_AUTOSTACKSP     = 0x02, /* 1=calculate required stack space for arguments of INVOKE */
    W64F_STACKALIGN16    = 0x04, /* 1=stack variables are 16-byte aligned; added in v2.12 */
    W64F_SMART                = 0x08, /* 1=takes care of everything */


Long time ago I have written first STACKBASE:RSP and forced Japheth to implement it in official version, because I was pissed off with the stupid adding and subtracting 20h on every call of subroutine.
Now, hutch is having the same problem with ML64

So, my answer to your original question:
Is it possible to avoid using those sub rsp/add rsp in every invoke with some option switch?
Yes, use this in the beginning of your source:
   option casemap:none         
   option win64:11                 ;W64F_SAVEREGPARAMS+W64F_AUTOSTACKSP + W64F_SMART
   option frame:auto              ;this writes PROLOGUE and EPILOGUE for you automatically
   option STACKBASE:RSP      ;this allocates home space only ones for all invoke you use in a PROC
Cod-Father

johnsa

Quote from: habran on September 06, 2016, 06:07:45 AM
QuoteAre you sure? I thought the ABI requires that you grant 4 QWORDS of stack to printf ::)
You are correct JJ :t
Because TWell is not using FRAME and PROLOGUE and he is using only OPTION WIN64:2, 20h is allocation of home space
for 4 registers shadows for invoke:
    W64F_SAVEREGPARAMS = 0x01, /* 1=save register params in shadow space on proc entry */
    W64F_AUTOSTACKSP     = 0x02, /* 1=calculate required stack space for arguments of INVOKE */
    W64F_STACKALIGN16    = 0x04, /* 1=stack variables are 16-byte aligned; added in v2.12 */
    W64F_SMART                = 0x08, /* 1=takes care of everything */


Long time ago I have written first STACKBASE:RSP and forced Japheth to implement it in official version, because I was pissed off with the stupid adding and subtracting 20h on every call of subroutine.
Now, hutch is having the same problem with ML64

So, my answer to your original question:
Is it possible to avoid using those sub rsp/add rsp in every invoke with some option switch?
Yes, use this in the beginning of your source:
   option casemap:none         
   option win64:11                 ;W64F_SAVEREGPARAMS+W64F_AUTOSTACKSP + W64F_SMART
   option frame:auto              ;this writes PROLOGUE and EPILOGUE for you automatically
   option STACKBASE:RSP      ;this allocates home space only ones for all invoke you use in a PROC

I posted a bit further back the modified version to make use of these options, which generated the shorter/optimal proc with out all the add/sub'ishness.
Quote
   .x64
   option casemap:none
   option win64:11
   option frame:auto
   option STACKBASE:RSP
   
exit proto :dword
printf proto args:vararg
includelib msvcrt.lib

.data
msg  db "Hello msvcrt.dll",13,10,0

.code
mainCRTStartup proc
  invoke printf,ADDR msg
  invoke exit, 0
mainCRTStartup endp
end

Exactly as per my post :) Habran and I both insisted on this back while Japheth was still in charge of JWasm. It's exactly the same way a compiler would generate it, and without a tremendous amount of trouble (I'm not sure if at all) to implement this same sort of logic/optimisation from a macro would be impossible as it would require scoped knowledge of the PROC it's being used in and every other invoke within that same proc.. or you can go "low level" and roll this yourself in ML64 and manually update your stack allocation prologue everytime you add an invoke anywhere.. fun times .. not ;)

hutch--

> Now, hutch is having the same problem with ML64

Only in the experimental stages, I have written a prologue/epilogue that is problem free and adjustable. Either high level code or it can be turned off for pure mnemonic code.

This empty proc,

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc


    ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

gives me this disassembly.

.text:0000000140001000 C8800000                   enter 0x80, 0x0
.text:0000000140001004 4883EC40                   sub rsp, 0x40
.text:0000000140001008 C9                         leave
.text:0000000140001009 C3                         ret

Both the ENTER 1st arg and the stack adjustment are adjustable and it has been super reliable.

johnsa

How are you planning on handling the calculation of the correct amount to adjust RSP by ? using a forward reference to a total which is only determined once the epilogue is reached and back-filling it into the prologue ?
(I'm not sure if that can be done with macros even without trying, depending on the provision of it allowing the forward reference and order of macro expansion :) ) I guess you would have to completely replace ML64's invoke/prologue/epilogue generation and assuming the former idea worked you could possibly achieve the same sort of result as hjwasm invoke.

hutch--

This high level code generates the following disassembly.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

WndProc proc hWin:QWORD,uMsg:QWORD,wParam:QWORD,lParam:WORD

    LOCAL rct :RECT
    LOCAL buffer[128]:BYTE
    LOCAL pbuf :QWORD

    ret

WndProc endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

empty WndProc
sub_14000102b   proc
.text:000000014000102b C8800000                   enter 0x80, 0x0
.text:000000014000102f 4881ECE0000000             sub rsp, 0xe0
.text:0000000140001036 48894D10                   mov qword ptr [rbp+0x10], rcx
.text:000000014000103a 48895518                   mov qword ptr [rbp+0x18], rdx
.text:000000014000103e 4C894520                   mov qword ptr [rbp+0x20], r8
.text:0000000140001042 4C894D28                   mov qword ptr [rbp+0x28], r9
.text:0000000140001046 C9                         leave
.text:0000000140001047 C3                         ret
sub_14000102b   endp

An empty procedure with no stack frame generates the following.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

empty proc


    ret

empty endp

STACKFRAME

empty NOSTACKFRAME proc
.text:000000014000104d
.text:000000014000104d 0x14000104d:
.text:000000014000104d C3                         ret

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

Quote
I guess you would have to completely replace ML64's invoke/prologue/epilogue generation and assuming the former idea worked you could possibly achieve the same sort of result as hjwasm invoke.
ML64 comes unconfigured so you have no choice other than to write a prologue/epilogue and an automated call notation, "invoke" being the most common. It calculate the byte count for the locals then subtracts from RSP while maintaining the correct alignment to provide the LOCAL space. Passed arguments are addressed above RSP.

hutch--

One more.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

stackframe_dbgdata equ <1>      ; turn on stackframe output
stackframe_default equ <256>    ; increase default stack size
stackframe_dynamic equ <512>    ; increase ENTER dynamic size

STACKFRAME

testproc proc a1:QWORD,a2:QWORD,a3:QWORD,a4:QWORD,a5:QWORD

    LOCAL var1 :QWORD
    LOCAL var2 :QWORD
    LOCAL var3 :QWORD
    LOCAL var4 :QWORD
    LOCAL var5 :QWORD
    LOCAL var6 :QWORD
    LOCAL var7 :QWORD
    LOCAL var8 :QWORD

    nop

    ret

testproc endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

.  ****************************
.  PROLOGUE testproc
.  arg count   = 5
.  local bytes = 64
.  ****************************

sub_14000104d   proc
.text:000000014000104d C3                         ret
.text:000000014000104e C8000200                   enter 0x200, 0x0
.text:0000000140001052 4881EC40010000             sub rsp, 0x140
.text:0000000140001059 48894D10                   mov qword ptr [rbp+0x10], rcx
.text:000000014000105d 48895518                   mov qword ptr [rbp+0x18], rdx
.text:0000000140001061 4C894520                   mov qword ptr [rbp+0x20], r8
.text:0000000140001065 4C894D28                   mov qword ptr [rbp+0x28], r9
.text:0000000140001069 90                         nop
.text:000000014000106a C9                         leave
.text:000000014000106b C3                         ret
sub_14000104d   endp

jj2007

Quote from: johnsa on September 06, 2016, 06:10:52 AMwithout a tremendous amount of trouble (I'm not sure if at all) to implement this same sort of logic/optimisation from a macro would be impossible as it would require scoped knowledge of the PROC it's being used in and every other invoke within that same proc..

Quote from: johnsa on September 06, 2016, 06:32:06 AMHow are you planning on handling the calculation of the correct amount to adjust RSP by ? using a forward reference to a total which is only determined once the epilogue is reached and back-filling it into the prologue ?

It is not that difficult, actually: You start with a reasonable default (e.g. 12 args as in CreateWindowEx), and if that's not enough, let the EPILOGUE macro tell the user that he must manually increase the reserved stack. Plus, if user is scared of running out of stack, e.g. in a recursive proc, he can start with a very low value, and let the epilogue macro inform him how much is really needed. Manual intervention will be a very rare case.

I had it running already, but, see here, there is a really weird behaviour of the PROLOGUE macro that I am fighting with right now. Maybe one of the masters of the Watcom universe has a clue what happens there. Plain Masm32 test case attached - the Watcom assemblers give also a wrong line count, see @Line in the attached source.

johnsa

#24
Hi,

Just to let you all know .. I spotted a bug in the new invoke code when using parameter types like [rsi], [reg+ofs] and [reg].struct.member on fastcall procedures.. I've fixed this and updated both the repository and packages on the site so there's an update for you dated 6/9/2016.

John

jj2007

Perhaps you should post a more recent version here:
HJWasm 2.15 (32bit)    6/08/2016    hjwasm215_x86.zip    32bit Binary Package (Windows)
HJWasm 2.15 (64bit)    6/08/2016    hjwasm215_x64.zip    64bit Binary Package (Windows)

6 August is one month ago 8)

johnsa

arghh... dumb typo, fixing now :) thanks for spotting!

jj2007

No problem, I saw the correct timestamp in the zip archive :P

There is still the prolog issue. What I found out is that
- the prolog macro kicks in only when the assembler hits the first real instruction after the locals (ML and Watcom)
- within the prolog macro, the @Line macro yields the sometest proc line in ML but the line after the locals plus 5 in the Watcom family.

Attached new plain Masm32 testbed. I assume it would be the same for 64-bit code.

johnsa

I've found the bits of code responsible for it.. I'm just not sure 100% yet what to do about it..  ::)
Technically the macro is being run in the same place by both ML and WATC family and our line number is more "correct" than ML's (not to say there isn't a reason for a custom prologue to do this)
so can you help me understand why you need the line number at time of prologue execution == the proc line ? (maybe it will help decide how best to change it)


johnsa

As a fix.. I've implemented a new built-in equate @ProcLine

which gives us:


SomeProlog MACRO procname, flags, argbytes, localbytes, reglist, userparms
Local tmp$, up$, alignOK, alignBad, is, alignedUses
  pLine=@Line
  % echo ## prologue of procname ##
  hello$ equ <The prologue macro has changed the string>
  push ebp
  mov ebp, esp ; create frame
  up$ CATSTR <userparms>, < >
  tmp$ CATSTR <line >, %@Line, < (should be line 24, ok with ML but not with Watcom family)>
  % echo tmp$
  tmp$ CATSTR <## PROLOG, line >, %@Line, <: &procname>, <: args+locals=>, <argbytes>, <+>, <localbytes>, <=>, %(argbytes+localbytes),  <, _flags=>, <flags>,  <, _userparms=>, <userparms>
  % echo tmp$
  % echo ## end of procname prologue ##
  IFDEF @ProcLine
  tmp$ CATSTR <line >, %@Line, < - proc line >, %@ProcLine
  ELSE
  tmp$ CATSTR <line >, %@Line, < - proc line >, %@Line
  ENDIF
  % echo tmp$
  EXITM %localbytes
ENDM


Which gives the correct results then for both ML and HJWASM.
If this works for you I will update source/packages?