News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

My First MASM 64-bit Hello World console

Started by coder, January 10, 2017, 01:29:50 AM

Previous topic - Next topic

coder

Hi, this is my first post. Nice to know you.

Finally got a minimal "Hello World" working, with ML64.exe and GoLink for console. You have no idea how happy I am to get it to work.  :greenclp:

64-bit is my entry point into MASM though. I have never tried 32-bit MASM. Here's my initiation code;

;-----------------------------------
;Descr  : My First 64-bit MASM program
;ml64 prog.asm /c
;golink /console /entry main prog.obj msvcrt.dll, or
;gcc -m64 prog.obj -o prog.exe (main/ret). This needs 64-bit GCC.
;-----------------------------------
extrn printf:proc
extrn exit:proc

.data
hello db 'Hello 64-bit world!',0ah,0

.code
main proc

        mov     rcx,offset hello
        sub     rsp,20h
        call    printf
        add     rsp,20h

        mov     rcx,0
        call    exit

main endp
end


There are questions I'd like to ask some more but I just want to share the joy here first. Maybe you can add some more advice and tricks.

Thanks

coder

My next questions;

1. Is this the correct thread to post something noob like this? Or should I move to other threads.
2. ML64 did not complain the use of "extern" and "extrn". Should I be concerned about this? Are they interchangeable in this particular setting?
3. Don't I need some kind of alignment somewhere? This alignment stuff scares me a little bit though.

Thanks

hutch--

I can sympathise with anyone who has had to twiddle the stack manually but the simple answer is you need a stack frame for any high level procedure including any of the C runtime functions. You basically have 2 types of procedures, pure mnemonic procedures where you pass arguments through registers to a limit of 4 registers and for higher level procedures you set up a stack frame. The recommended method is a pair of prologue / epilogue macros for managing the stack frame so you can turn the stackframe on and off.

Having suffered the same learning curve as yourself, I wrote a pair of macros (among many others) that do both the stackframe and a reasonable simulation of the old 32 bit MASM "invoke". If you have a look in the ML64 subforum you will find a reasonable amount of stuff already up and working correctly, API libraries including MSVCRT, the masm64 macro file and a collection of examples. The macros for the stackframe are complicated but they will show you exactly how they work. The most recent help file explains how the prologue and invoke code works.

coder

Quote from: hutch-- on January 10, 2017, 04:19:32 AM
I can sympathise with anyone who has had to twiddle the stack manually but the simple answer is you need a stack frame for any high level procedure including any of the C runtime functions. You basically have 2 types of procedures, pure mnemonic procedures where you pass arguments through registers to a limit of 4 registers and for higher level procedures you set up a stack frame. The recommended method is a pair of prologue / epilogue macros for managing the stack frame so you can turn the stackframe on and off.

Having suffered the same learning curve as yourself, I wrote a pair of macros (among many others) that do both the stackframe and a reasonable simulation of the old 32 bit MASM "invoke". If you have a look in the ML64 subforum you will find a reasonable amount of stuff already up and working correctly, API libraries including MSVCRT, the masm64 macro file and a collection of examples. The macros for the stackframe are complicated but they will show you exactly how they work. The most recent help file explains how the prologue and invoke code works.

Thanks for the reply. I've looked into your 64-bit project and can't praise you enough for what you been doing. Lots of future 64-bit younglings will rely heavily on it for sure.

As for the alignment I think you are right. A friend of mine also told me that the safest bet for a beginner like me wading through the 64-bit WIN APIs is to use the old school prolog/epilog setup until I can figure out the different alignment / frame setup requirements for different things.  So for the above code, I should have setup my main function to be

main proc
        push    rbp
        mov     rbp,rsp
       
        mov     rcx,offset hello
        sub     rsp,20h
        call    printf
        add     rsp,20h

        mov     rsp,rbp
        pop     rbp
        ret         ;for GCC
main endp


I am still confused about PROC/ENDP though. What does a "PROC / ENDP" do exactly down the assembly level? What does it hide or do silently in 64-bit environment when MS employs only a single fastcall convention. For example, does it automatically setup the shadow space / stack frame for my routine or just some simple label...ret scope definition?

Regards

hutch--

You control the "proc" "endp" pair with the prologue / epilogue macro if the stack frame is turned on. If the stack frame is turned off it is just the start and end of a pure mnemonic procedure. With ML64 you only use RET, not RETN, return value is in RAX, but you can use any of the other unprotected registers as well to return more than one value. RAX RCX RDX R8 R9 R10 R11 are the unprotected or transient registers. The numbered registers can specify sizes as well, R8 R8b R8w R8d gives you the four sizes. You cannot access the high byte in 64 bit directly so AH BH CH DH cannot be used. Its not a problem as you have enough registers to handle it.

Something I found incredibly useful was a disassembler/debugger called ArkDasm. It allowed me to have a look at exactly the questions you are asking and it was one of the main tools I used while developing the prologue and invoke style macros. The stackframe is pretty low overhead because it uses ENTER and LEAVE. The stackframe macro loads the first 4 register arguments into the shadow space which is necessary in a high level procedure so you don't overwrite the register values.

With the stack frame you can write high level code without the melodrama of twiddling the stack and when you need real speed and low overhead with a pure mnemonic leaf procedure you can turn it off.

coder

Great explanations. Thanks.
That means I can go bare-metal with MASM without any obvious limitations as far as the OS is concern.


hutch--

One more thing, try and avoid using PUSH POP as they change the RSP register. Here is one of the procs so far for the 64 bit library. This one uses other registers but if you want to write a pure mnemonic procedure with many registers including the ones that must not change across procedure calls, you can use ".DATA?" memory.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

mcopy proc

  ; rcx = source address
  ; rdx = destination address
  ; r8  = byte count

  ; --------------
  ; save rsi & rdi
  ; --------------
    mov r11, rsi
    mov r10, rdi

    cld
    mov rsi, rcx
    mov rdi, rdx
    mov rcx, r8

    shr rcx, 3
    rep movsq

    mov rcx, r8
    and rcx, 7
    rep movsb

  ; -----------------
  ; restore rsi & rdi
  ; -----------------
    mov rdi, r10
    mov rsi, r11

    ret

mcopy endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

coder

Nice! If I read it correctly, it's a memory copy function?

A few questions Hutch;

1. Why use R11? R15/R14 are the loneliest registers on earth and nobody seems to be using them. AFAIK, R11 returns the RFLAGS status to some interested parties. This is true in Linux too.

2. Do you use SSE instructions / registers in your 64-bit libraries or just standard x64 instructions? Love your mcopy routine. But if you could shr it by 4, it opens up to a whole new opportunity to use SSE string instructions and registers.

Love what you're doing Hutch. Keep it up and God bless you.

hutch--

There is no problem using SSE and AVX instructions, you just have to look up the instructions in the Intel manuals. Sad to say I have to do a lot more of the hacky stuff before I can get into the really fast stuff.

Now as far as register usage goes, long ago I have learnt to fully comply with whatever the appropriate ABI happens to be and you get reliable code that works across different version of Windows. From memory Linux has a slightly different ABI but its the same mentality, C compilers generally use the full spread of registers and use them according to the OS register convention so its worth playing safe here.

This algo below is an AVX memory copy procedure and it is faster than the legacy version and the SSE version. Note that the memory must be 256 bit aligned.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

ymmcopya proc

  ; rcx = source address
  ; rdx = destination address
  ; r8  = byte count

    mov r11, r8
    shr r11, 5                  ; div by 32 for loop count
    xor r10, r10                ; zero r10 to use as index

  lpst:
    vmovntdqa ymm0, YMMWORD PTR [rcx+r10]
    vmovntdq YMMWORD PTR [rdx+r10], ymm0

    add r10, 32
    sub r11, 1
    jnz lpst

    mov rax, r8                 ; calculate remainder if any
    and rax, 31
    test rax, rax
    jnz @F
    ret

  @@:
    mov r9b, [rcx+r10]          ; copy any remainder
    mov [rdx+r10], r9b
    add r10, 1
    sub rax, 1
    jnz @B

    ret

ymmcopya endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FORTRANS

Hi,

Quote from: coder on January 10, 2017, 01:46:43 AM
2. ML64 did not complain the use of "extern" and "extrn". Should I be concerned about this? Are they interchangeable in this particular setting?

   The oldest versions of MASM used "EXTRN" only.  Somewhere
along the way "EXTERN" was added.  They are the same directive,
fully interchangeable.

Regards,

Steve N.

coder

Quote from: hutch-- on January 10, 2017, 09:21:10 AM
There is no problem using SSE and AVX instructions, you just have to look up the instructions in the Intel manuals. Sad to say I have to do a lot more of the hacky stuff before I can get into the really fast stuff.

Now as far as register usage goes, long ago I have learnt to fully comply with whatever the appropriate ABI happens to be and you get reliable code that works across different version of Windows. From memory Linux has a slightly different ABI but its the same mentality, C compilers generally use the full spread of registers and use them according to the OS register convention so its worth playing safe here.

This algo below is an AVX memory copy procedure and it is faster than the legacy version and the SSE version. Note that the memory must be 256 bit aligned.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

NOSTACKFRAME

ymmcopya proc

  ; rcx = source address
  ; rdx = destination address
  ; r8  = byte count

    mov r11, r8
    shr r11, 5                  ; div by 32 for loop count
    xor r10, r10                ; zero r10 to use as index

  lpst:
    vmovntdqa ymm0, YMMWORD PTR [rcx+r10]
    vmovntdq YMMWORD PTR [rdx+r10], ymm0

    add r10, 32
    sub r11, 1
    jnz lpst

    mov rax, r8                 ; calculate remainder if any
    and rax, 31
    test rax, rax
    jnz @F
    ret

  @@:
    mov r9b, [rcx+r10]          ; copy any remainder
    mov [rdx+r10], r9b
    add r10, 1
    sub rax, 1
    jnz @B

    ret

ymmcopya endp

STACKFRAME

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤


That's very impressive Hutch.  Thanks for the code :greenclp:

coder

Quote from: FORTRANS on January 10, 2017, 09:48:55 AM
Hi,

Quote from: coder on January 10, 2017, 01:46:43 AM
2. ML64 did not complain the use of "extern" and "extrn". Should I be concerned about this? Are they interchangeable in this particular setting?

   The oldest versions of MASM used "EXTRN" only.  Somewhere
along the way "EXTERN" was added.  They are the same directive,
fully interchangeable.

Regards,

Steve N.
Got it.

nidud

#12
deleted

hutch--

nidud is right here, when I first started on Win64 I was using the simple EXTERN but was getting massive object modules from it. One of our members "qword" said use EXTERNDEF and they dropped to normal size. Here are the first few prototypes from kernel32.inc for ML64.

externdef __imp_ActivateActCtx:PPROC
ActivateActCtx equ <__imp_ActivateActCtx>

externdef __imp_AddAtomA:PPROC
AddAtomA equ <__imp_AddAtomA>
  IFNDEF __UNICODE__
    AddAtom equ <__imp_AddAtomA>
  ENDIF

externdef __imp_AddAtomW:PPROC
AddAtomW equ <__imp_AddAtomW>
  IFDEF __UNICODE__
    AddAtom equ <__imp_AddAtomW>
  ENDIF

externdef __imp_AddConsoleAliasA:PPROC
AddConsoleAliasA equ <__imp_AddConsoleAliasA>
  IFNDEF __UNICODE__
    AddConsoleAlias equ <__imp_AddConsoleAliasA>
  ENDIF

externdef __imp_AddConsoleAliasW:PPROC
AddConsoleAliasW equ <__imp_AddConsoleAliasW>
  IFDEF __UNICODE__
    AddConsoleAlias equ <__imp_AddConsoleAliasW>
  ENDIF

coder

Thanks nidud and Hutch for that important pointer on EXTERNDEF. One more future hiccup to avoid.