First time UASM 2.49 user noob question about 32-bit vs 64-bit

Started by asmguru, September 19, 2019, 04:18:17 PM

Previous topic - Next topic

asmguru

First question by new masm32.com user; also a first-time user of UASM.

I compiled a one instruction fastcall routine 32-bit that simply loads its first argument to eax, and I get what I'd expect, a "mov eax,ecx" opcode.  Note the source opcode was "mov eax,arg1" -- changed in the list file to mov eax,ecx (as if arg1 was a text equated to "ecx").

UASM v2.49, Jun 21 2019, Masm-compatible assembler.
                                        .586
                                        .model  flat, fastcall
                                        .code
                                        .listall
00000000                        xyz     proc    arg1:ptr
00000000  8BC1                          mov     eax, ecx
00000002                                ret
00000002  C3                *   retn
00000003                        xyz     endp
                                        end
00000003                    *   _TEXT ends

But a similar thing in 64-bit generates an ebp stack frame, and then attempts to load the 1st argument from its homing area (which arg1 has yet to be stored to).  Here the source file opcode was "mov rax,arg1" and the arg1 was NOT changed to rcx (as was done in the 32-bit case).

UASM v2.49, Jun 21 2019, Masm-compatible assembler.
                                        .x64
                                        .model  flat, fastcall
                                        .code
                                        .listall
00000000                        xyz     proc    arg1:ptr
00000000  55                *   push rbp
00000001  488BEC            *   mov rbp, rsp
00000004  488B4510                      mov     rax, arg1
00000008                                ret
00000008  C9                *   leave
00000009  C3                *   retn
0000000A                        xyz     endp
                                        end
0000000A                    *   _TEXT ends

I would have hoped for just a  "mov rax,rcx"  opcode in the 64-bit case -- I don't see how the 64-bit code can even work, as arg1 has never been stored to the stack.  How can I get the 64-bit case to generate only a  "mov  rax,rcx"  opcode, as was done in the 32-bit case?

Thanks!

BugCatcher

Ml64 has a different calling convention than ml. There is no proc. Read the masm64 help file. It has all the information you need.

aw27

@asmguru

The default is do it like MASM, this will do what you want:


; uasm64 -c /Flfile.lst -win64 -Zp8 test.asm
; link /ENTRY:xyz /SUBSYSTEM:console /MACHINE:X64 test.obj

option win64:7
option frame:noauto

.code
.listall

     xyz     proc    arg1:ptr
mov     rax, arg1
ret
xyz     endp

    end

asmguru

AW's suggestion produced for me:
UASM v2.49, Jun 21 2019, Masm-compatible assembler.
                                        option  win64:7
                                        option  frame:noauto
                                        .code
                                        .listall
00000000                        xyz     proc    arg1:ptr
00000000  48894C2408        *   mov [rsp+8], rcx
00000005  55                *   push rbp
00000006  488BEC            *   mov rbp, rsp
00000009  488B4510                      mov     rax, arg1
0000000D                                ret
0000000D  C9                *   leave
0000000E  C3                *   retn
0000000F                        xyz     endp
                                        end
0000000F                    *   _TEXT ends

That is working code  :thumbsup:

But alas not what I want.  I hoped for a simple   mov rax,rcx   as per the Win64 ABI.   I did not want setup/teardown of an rbp local frame; I did not want arg1 saved to its homing area above the return address (64-bit offers lots of registers!  that's why the ABI uses them, keeping them out of memory for speed), nor do I need para stack alignment (I am dealing with a leaf function).

I had hoped for what I'd see from a C compiler, given such a leaf function.  I've continued to read for days on this and I've concluded that though 32-bit mode assembles fine (Freudian slip... it does what I want), but in 64-bit mode the assembler requires arguments coming in registers to be homed before they can be accessed by name.

Sure avoiding the smarts of the assembler via macros (epilogue/prologue, or ML64-style "roll-your-own" functions) is possible.  But I wished to use a modern smarter (than clueless ML64) assembler that could generate low size/overhead while allowing me to use a standard proc fastcall and argument names.  Again, basically exactly how the registered fastcall args can be handled, never leaving a register, in 32-bit mode.

My functions are obviously more complex (more arguments, more logic...) than the above example I simplified to illustrate my confusion.

Thank you for the answers, they are helpful!

jj2007

Have you tried simply using a label plus ret, instead of proc?

aw27

@asmguru,

It is not risk free for an assembler to decide that it can safely replace mov rax, arg1 with mov rax,rcx.
You need to use
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
to roll your own way of doing things.

I don't think a compiler will do it as well, except in very basic cases, and here the optimization is likely to simply strip out that function and present directly the result to the caller. What we see sometimes in compilers is that they save rcx into another register instead of into the stack.


Vortex

Assembling with Poasm Version 9 :

.code

    xyz PROC arg1:PTR

    mov     rax,arg1
    ret

    xyz ENDP

END


Disasembling the object module file :

public xyz

_text   SEGMENT PARA 'CODE'

xyz     PROC
        mov     rax, rcx
        ret
xyz     ENDP

_text   ENDS

END

aw27

Poasm assumes that users never do things like this:



.code

xyz proc arg1:ptr
mov rcx, 10
mov  rax, arg1
add rax, rcx
ret
xyz endp

end

Vortex

Well, the user has to be careful as rcx is a member of the fastcall convention. That's the trick.

asmguru

AW,
My latest C project (Microsoft 64-bit C) is a couple dozen functions.  None of the functions home the arguments or set up an rbp stack frame (rbp is too valuable to lose as a general-purpose register).  Functions access the incoming arguments from registers (or load them via [esp+xxx] for the 5th+).  An example prologue/epilogue:


; 552  : wchar_t *GetAt(ARGS *pA, wchar_t *pT) {

$LN27:
  00000 40 53 push rbx
  00002 55 push rbp
  00003 56 push rsi
  00004 57 push rdi
  00005 48 81 ec 38 02
00 00 sub rsp, 568 ; 00000238H

; 553  : enum { fn_elem = 260 };
; 554  : wchar_t fn[fn_elem];
; 555  : FILE *fh = NULL;

  0000c 33 ed xor ebp, ebp

... 150 lines removed ...

  000b1 48 81 c4 38 02
00 00 add rsp, 568 ; 00000238H
  000b8 5f pop rdi
  000b9 5e pop rsi
  000ba 5d pop rbp
  000bb 5b pop rbx
  000bc c3 ret 0
GetAt ENDP


Take a look at the output of any MSVC 64-bit compiler -- you'll see the same thing.  Use of registers for incoming args and no ebp stack frame.  Which makes sense to me -- if the 64-bit ABI wanted arguments forced to the stack, they would have left the calling convention stdcall.

I know there are good reasons things work the way they do, and that a lot of code depends that it works that way.  Also a lot of smart people have worked very hard and I respect what's been accomplished, and it's up to me to figure out a way to use what's offered.

What I'd like to see someday as an enhancement to 64-bit mode: a switch/mode (perhaps Win64 flag) that causes named args to be loaded from their incoming regs (or via [rsp] for 5th+), that does not require use of ebp for a stack frame (perhaps option stackbase:rsp already accomplishes this).  Sure one would have to insure the arguments are handled properly; nothing is risk-free in assembly -- we have to insure our pushes and pops match!

hutch--

It would appear that you need to learn how the Win64 ABI works, 4 registers matched by 4 locations on the stack (shadow space) and stack address locations above that shadow space for more than 4 arguments. You certainly CAN write RBP stack frames if you need them which is mainly for LOCAL storage, otherwise you can create procedures with no stack frame for pure mnemonic code or stack adjusted procedures for high level procedure calls. Using PUSH / POP in the same manner as 32 bit STDCALL is a failure to understand that win64 is different.

aw27

Modern releases of MSVC do not use RBP based stack frames. RSP based stack frames save 2 instructions and allow the use of RBP for other purposes. That is fine although in practice do not contribute much to performance. However, compilers use many other tricks so that it is not easy for an ASM programmer to beat a C/C++ compiler, except with SIMD instructions where compilers are not particularly smart. However, learning ASM is not just to get more speed, but this another discussion.
Now, MASM does not provide any support for RSP based frame. UASM does provide support for RSP based stack frames. Personally, I prefer RBP because I find it easier to debug on RBP.
I am not very concerned when assemblers do not provide many features. People that like to work on auto-pilot or have his little nice hand guided all the time ((C) Hutch) should consider use only HLL.

KradMoonRa

Yep,

Like AW explains,

This:

.code

xyz proc arg1:ptr
mov  rax, rcx
ret
xyz endp

end


Can be this:

.code

xyz proc arg1:ptr
mov  rax, rcx
mov  rax, [rcx]
movd  xmm0, [rcx]
movd  xmm0, rcx
movdqu  xmm0, [rcx]
movdqa  xmm0, [rcx]
ret
xyz endp

end


I mean that if implement in use64 so that procedure args recognize has general register with pointers and do-it in standard way, probably is not wath do you want to do in some cases and will lose the additional options cases.


For your intents this must be careful think.

uasm -win64 -Zp8 -Sg -nologo -c -Sa -FlMain64.lst -FoMain64.obj Main64.asm

    option win64:11   ;Get us on the stack optimization point
    option frame:auto ;Leave for the optimizer get our result

    .code
    .listall
   
     xyz     proc   arg1:ptr
        mov     rax, rcx
        ret
    xyz     endp

    end



    option win64:11   ;Get us on the stack optimization point
    option frame:auto ;Leave for the optimizer get our result

;win32
IF @Platform LT 1
    RRET TEXTEQU <EAX>
    RPARAM0 TEXTEQU <ECX>
    RPARAM1 TEXTEQU <EDX>
    RPARAM2 TEXTEQU <[ESP+12]>
    RPARAM3 TEXTEQU <[ESP+16]>
ENDIF

;win64
IF @Platform EQ 1
   RRET TEXTEQU <RAX>
   RPARAM0 TEXTEQU <RCX>
   RPARAM1 TEXTEQU <RDX>
   RPARAM2 TEXTEQU <R8>
   RPARAM3 TEXTEQU <R9>
ENDIF

    .code
    .listall
   
     xyz     proc   arg1:ptr
                                         ;best to forget the use of arguments, they are treated as locals in 64bits, to
                                         ;have your code portable in 32bits or 64bits windows, a simple param macro it help along road.

        mov     RRET, RPARAM0
        ret
    xyz     endp

    end

The uasmlib

jj2007

Quote from: asmguru on September 22, 2019, 03:34:29 AM... set up an rbp stack frame (rbp is too valuable to lose as a general-purpose register).

On extremely rare occasions I use ebp as a register to get a speed advantage. That is in 32-bit land. Claiming that you don't have enough registers in 64-bit land is courageous. Show us one proc of yours that requires rbp not for a stack frame.

Re calling conventions etc: The simplest solution is to do it "by hand":

include \Masm32\MasmBasic\Res\JBasic.inc        ; ## builds & runs in 32- or 64-bit mode with UAsm and ML ##
.code
Just_a_label:
  mText equ <rcx>
  mTitle equ <rdx>
  push rax      ; align 16 - important for Windows, no idea about Linux
  jinvoke MessageBox, 0, mText, mTitle, MB_OK or MB_SETFOREGROUND
  pop rdx
ret
Init
  PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
  mov rdx, Chr$("Hello")
  mov rcx, Chr$("I am a message box")
  call Just_a_label                     ; order of args: rcx rdx r8 r9 pushed5 pushed6 etc
  Inkey "ok?"
EndOfCode                      ; OPT_64 1      ; put 0 for 32 bit, 1 for 64 bit assembly

asmguru

I do understand the Win64 ABI well.  Yes there are lots of registers in 64-bit mode.

I should have said in the original question the concern was code space, but I didn't want to muddle the issue of not understanding what UASM was doing in the 64-bit prologue.  I was a long time ML guy brand new to UASM when the question was written.

If one is compiling a small routine (in my case many of them, into a library), homing the first four arguments then fetching them from there, setting up an ebp frame, para aligning the stack, can double the size of a routine.

Yes... if it's a small routine, who cares about doubling its size.
Yes... none of this matters a smidgen from a performance standpoint.

I've shut off the prologue/epilogue.
Sorry for beating a dead horse.  Thanks again for all the feedback.

BTW I just saw the UASM COMDAT support -- this is a wonderful feature and it works great.