News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

64-bit assembly starter kit

Started by jj2007, July 16, 2016, 11:48:29 AM

Previous topic - Next topic

hutch--

Any struct that I have written in win64 is done like so.


strname STRUCT QWORD
  ; members dtype ? etc ....
strname ENDS


With the macros "LOCAL64 and STRUCT64" they are written in the uninitialised data section so the data is independent from the stack so I can use them anywhere in the code section without having to put them at the top like normal LOCAL variables but you can also use them in procedures with no stack frame.

qWord

The problem is the default prolog/epilogue, which does not respect the alignment constraints for fastcall nor the structure member alignment of 8 (Zp8 or direct). Conclusion => setup your own prologue/epilogue macro (straight forward). The second step is to write  LOCAL-replacement that does add padding to 8 bytes for structure member alignment (or, as Hutch, 16 if wished)..


BTW: I'm curious what all these 64-suffixes are good for:-D
MREAL macros - when you need floating point arithmetic while assembling!

hutch--

 :biggrin:

> BTW: I'm curious what all these 64-postfixes are good for:-D

It was just a naming convenience so I knew what they were for. Now interestingly enough, if you write the "proc" line in the same way as 32 bit ML with specified data types "item:QWORD" you get the first 4 arguments that correspond to the 64 bit calling convention blank and from argument 5 onwards the arguments contain the right data so I have used the following,


    mov QWORD PTR [rbp+10h], rcx
    mov QWORD PTR [rbp+18h], rdx
    mov QWORD PTR [rbp+20h], r8
    mov QWORD PTR [rbp+28h], r9


This preserves the registers so they are not overwritten by any other procedure call and you can use the names from the "proc" line once the data is copied to the stack. it also means that you have rax rcx rdx r8 r9 r10 and r11 as volatile registers to use in the algo if needed and that is before preserving any of the non-volatile registers so if you need it you can have genuine PHUN in algo design.

jj2007

Quote from: qWord on July 19, 2016, 04:59:16 AMThe second step is to write  LOCAL-replacement that does add padding to 8 bytes

This is the part where I fail to see the logic. The X64 ABI makes a big fuzz about the stack being align 16, and I always thought it's because of SIMD instructions being faster or requiring such alignment. But why then align 8 for local variables? It doesn't make sense 8)

qWord

Quote from: jj2007 on July 19, 2016, 06:02:48 AMBut why then align 8 for local variables? It doesn't make sense 8)
To save memory of course. Even for compilers it is no problem to keep track of the alignment and reorder locals to get best results. Actual this problem is specific to ml64 and its broken HLL capacities**.
Remarks that the structure alignment applies to all segments and not just the stack!

**when doing low-level assembler programming it no problem of course
MREAL macros - when you need floating point arithmetic while assembling!

mineiro

You can see this problem better if you think with odd, even, odd, even, ... . Or at some point with even, odd, even, odd, ... .
On entry point of our windows program, rsp=???8.

Before we do a 'call' stack should be 16 bytes aligned, so we should subtract from rsp 8 bytes to get rsp=???0 and after do a call, and on procedure entry rsp is back again to rsp=???8.

Suppose we call a function that have 4 parameters only and configuration below:

4*8 because rcx,rdx,r8,r9 will be saved on stack, like function parameters.
7*8 because locals
3*8 because I'm supposing that a function that have maximum parameters inside this procedure have 7 parameters, the first 4 goes into registers, so we need 3 on stack, and this space will be reusable by all others functions with minimum parameters. We setup to biggest parameter functions.

So our prologue can be
sub rsp, (4*8+7*8+3*8 ) == 4+7+3==14, the result is even, this is not ok because rsp=???8(odd). We need insert a foo (1*8 ) to stack get aligned into this prologue.
This way you can free rbp.

Other way is:
on entry point of our program, rsp=???8, but if we just do a call to a procedure without parameters, like 'call main', so this will align stack to 16 bytes multiple.
This is why I said about odds and evens, and my opinions this is the real challenge on coding to x86-64 O.S. (the same problem on linux side, just entry point is aligned to 16 bytes instead of 8 bytes, and instead of 4 shadows you have 6 shadows (both are even)). But I'm not talking about non volatile registers into this context, so, odds and evens again.

Dword locals are pseudo promoted to qwords, this is why it is a qword as have been said.

A simple test is, call at entry point printf function with 7 parameters, after with 8. One of these will be invalid.
So, now do the same as before but now with stack aligned, one of that will be invalid again.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: qWord on July 19, 2016, 06:30:20 AM
Quote from: jj2007 on July 19, 2016, 06:02:48 AMBut why then align 8 for local variables? It doesn't make sense 8)
To save memory of course.

I meant the opposite: If they want the stack align 16, why only align for a RECT? For a movaps xmm0, rc, you really need align 16 8)

QuoteAlignment
Most structures are aligned to their natural alignment. The primary exceptions are the stack pointer and malloc or alloca memory, which are aligned to 16 byte

"Most" is nice :eusa_boohoo:

Besides, that stack alignment story should be investigated. Insert the equivalent of PrintLine Hex$(rsp) in your WndProc and see something amazing...

qWord

Quote from: jj2007 on July 19, 2016, 10:19:21 AM
I meant the opposite: If they want the stack align 16, why only align for a RECT? For a movaps xmm0, rc, you really need align 16 8)
there is no conflict: you have a defined alignment on entry thus you can align the locals as you like.
MREAL macros - when you need floating point arithmetic while assembling!

jj2007

Quote from: qWord on July 19, 2016, 12:38:06 PMthere is no conflict: you have a defined alignment on entry thus you can align the locals as you like.

It doesn't seem so straightforward. In my tests, the WM_PAINT handler crashes if the PAINTSTRUCT is local, it works with a global ps. Padding with an extra qword left or right won't help. Apparently I've got the spill space wrong, BeginPaint writes into my locals :(

Besides, on entry into the WndProc, the stack is aligned to 8 bytes, not to 16. Windows doesn't follow the 16-bit alignment rule 8)

jj2007

#24
include \Masm32\MasmBasic\Res\JBasic.inc      ; click to download version 22 July 2016
; ### simple console demo, assembles in 32- or 64-bit mode with ML, AsmC, JWasm, HJWasm ###
j@start            ; OPT_64 1      ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This code was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format")      ; OPT_Assembler ML      ; make sure you got the right versions
  Print Str$(" \ntest with multiple tabbed args:\n\t%i\t%i\t%i\n\n", 12, 13, 14)
  Print "Type a number: "
  Print Str$("The value of your number is %i\n", Val(Input$()))
  Print Chr$(jbit$, "-bit assembly is easy, it seems...")
j@end


Output:
This code was assembled with ml64 in 64-bit format

test with multiple tabbed args:
        12      13      14

Type a number: 123.456
The value of your number is 123
64-bit assembly is easy, it seems...


No 64-bit libraries required, a standard Masm32 installation plus MasmBasic are sufficient. Code sizes are slightly higher for 64-bit code, although the snippet above builds at 2048 bytes for both the 64- and 32-bit versions.

Attached a simple window with a RichEdit control. Open in RichMasm and hit F6.
EDIT: Removed. Use the new one further down, or go to menu File/New Masm source in RichMasm.

jj2007

Quote from: sinsi on July 16, 2016, 07:18:38 PMIf you write your own prologue you can pass "maximum param bytes" in the PROC declaration and adjust the stack once.

Implemented in the latest MasmBasic release.

QuoteOne big gotcha is what happens to the upper 32 bits of a register when you manipulate the lower 32 bits.
"sub eax,eax" will zero the top 32 bits of rax. So "sub eax,eax" is the same as "sub rax,rax" except one byte smaller (no rex prefix).

Just stumbled over this one:
| 48 31 C0                  | xor rax, rax                      |
| 83 C8 FF                  | or eax, FFFFFFFF                  |
| 48 0D FF FF FF FF         | or rax, FFFFFFFFFFFFFFFF          |
| 31 C0                     | xor eax, eax                      |
| 48 31 C0                  | xor rax, rax                      |


Looks harmless but attention, one of the instructions does not behave as you expect it ::)

Mikl__

Ciao, jj2007!
xor eax, eax => movsx rax,eax => mov rax,0000000000000000h
or eax, 0FFFFFFFFh => movsx rax,eax => mov rax,0FFFFFFFFFFFFFFFFh

jj2007

Yes, exactly, that's the one:
   int 3
   xor rax, rax                     
   or eax, -1   ; no movsx rax,eax!!!
   or rax, -1
   xor eax, eax
   xor rax, rax


But sinsi wrote indeed "zero the top 32 bits of rax". Would movzx rax, eax be the correct rule in all cases?

jj2007

For the fans of qEditor:
- install the latest MasmBasic package
- run the 64-bit example once, i.e. click on (similar as 64-bit code) in line 12 of MbGuide.rtf and hit F6
- add a line in \Masm32\menus.ini e.g. under "console build all":
Build+Run &64,\MASM32\MasmBasic\Res\Bldallc64.bat "{b}"
- move the attached batch file to its location
- open one of the attached examples in qEditor and go to the Project menu

jj2007

Update 30 August 2016 (download):

During installation, you will see the MbGuide file (greenish background). Select Init in the (lower) 64-bit example on the first page, then hit F6.

Or, better, click on menu File/New Masm source, then on Dual 32/64 bit console/GUI templates that compile both as 64 and 32-bit applications

Feedback needed, this is not yet a stable release but for me it works fine. However, I wasted several hours chasing mysterious ML64 bugs, including "internal error" with hex dumps staring at me, for a source that had absolutely no problem with JWasm, HJWasm and AsmC. Besides, ML complains every now and then about "invalid characters" in the source; when retrying with the same identical file, it doesn't find the problem any more. Time to ditch this beast in favour of better and faster assemblers (where are you, Habran, Johnsa and Nidud??).

The current versions of the console and GUI templates (attached), though, still compile with all assemblers tested, in 64- and 32-bit mode. All you need is the current Masm32 installation plus MasmBasic - no other libs required. Btw, even with ML64, the jinvoke macro counts the CreateWindowEx parameters and barks at you if there aren't exactly 12 of them 8)

P.S. for the developers: have a look at \Masm32\MasmBasic\Res\DualWin.inc (search for SIZE_P)