News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Shadow space in 64-bit programming

Started by jj2007, March 08, 2021, 09:36:47 AM

Previous topic - Next topic

jj2007

Shadow space (sometimes also called home space) is a badly understood concept in 64-bit programming. To quote Microsoft (my highlighting):
QuoteThe x64 Application Binary Interface (ABI) uses a four-register fast-call calling convention by default. Space is allocated on the call stack as a shadow store for callees to save those registers.
Which means the shadow space is dedicated memory for saving four registers, precisely: rcx, rdx, r8 and r9.
The callee (i.e. the called proc) receives arguments 1-4 in registers rcx, rdx, r8 and r9. The callee can save them in the shadow space, but is not obliged to do so. In fact, debugging of, for example, CreateWindowEx shows that Windows often does not move the 4 registers into the shadow space; instead, it uses them directly ("fastcall").

Below is a simple example of a procedure with 7 arguments. It uses JBasic (which is included in the MasmBasic package), but other packages will do it very much the same way. Source and exe are attached, when debugging type y to stop at the int 3:
include \Masm32\MasmBasic\Res\JBasic.inc ; ## console demo, builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
.code
useClv=0
SayHi proc arg1:SIZE_P, arg2, arg3, arg4, arg5, arg6, arg7  ; SIZE_P is QWORD for 64-bit and DWORD for 32-bit code
Local v1, v2, v3, v4, rc:RECT
  if @64 ; don't use in 32-bit code
mov arg1, rcx ; the callee uses one shadow space slot
jinvoke crt_printf, Chr$("Arg1 as register: %s", 13, 10), rcx
  else ; 32-bit code looks a bit simpler:
jinvoke crt_printf, Chr$("Arg1 'as is' in 32-bit: %s", 13, 10), arg1
  endif
  Print Str$("The arguments 1...7: %s %i %i %i %i %i %i", arg1, arg2, arg3, arg4, arg5, arg6, arg7)
  mov rc.left, 123
  Print Str$(" \nThe locals v1...v4 and rc.left: %i %i %i %i %i", v1, v2, v3, v4, rc.left)
  jinvoke MessageBox, 0, arg1, Chr$("Hi folks"), MB_OK or MB_SETFOREGROUND
  ret
SayHi endp
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
  PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
  Inkey Chr$("Int 3?", 13, 10, 10)
  cmp rax, "y"
  jne @F
INT 3
@@:
  jinvoke SayHi, Chr$("The text for the MessageBox"), 2, 3, 4, 5, 6, 7
  jinvoke MessageBox, 0, Chr$("That was easy, right?"), Chr$("Title"), MB_OK
EndOfCode


Console output:
This program was assembled with ml64 in 64-bit format.
Int 3?

Arg1 as register: The text for the MessageBox
The arguments 1...7: The text for the MessageBox 0 0 0 5 6 7
The locals v1...v4 and rc.left: 0 0 0 0 123


As you can see, arguments 2-4 are zero, i.e. the shadow space was not set for them. Here is the disassembly:
sub rsp, 8*QWORD ; 64 bytes reserved somewhere higher up, e.g. on entry of a proc
...
; jinvoke MyProc, 1, 2, 3, 4, 5, 6, 7 ; seven arguments
; mov qword ptr ss:[rsp+38],8                             | 38h=56 dec: unused here (we have only 7 args)
mov qword ptr ss:[rsp+30],7                               | 30h=48 dec: arg7
mov qword ptr ss:[rsp+28],6                               | 28h=40 dec: arg6
mov qword ptr ss:[rsp+20],5                               | 20h=32 dec: arg5

; NO mov qword ptr ss:[rsp+18],4                          | 18h=24 dec: arg4 is shadow space and does
mov r9d,4                                                 | NOT get filled; instead, arg4...arg1 go to
mov r8d,3                                                 | the registers rcx, rdx, r8, r9
mov edx,2                                                 | using the required size, normally DWORD but
lea rcx,qword ptr ds:[140001521]                          | pointer for "The text for the MessageBox"

call <sub_140001002>                                      | call SayHi: stack is 000000000012FF00
...
add rsp, 8*QWORD ; release the 64 reserved bytes
; SayHi proc arg1:SIZE_P, arg2, arg3, arg4, arg5, arg6, arg7 ; arg1...arg4 are the shadow space
push rbp                                                  | create a
mov rbp,rsp                                               | stack frame
sub rsp,B0                                                | create space for the local variables
mov qword ptr ss:[rbp+10],rcx                             | use the arg1 shadow space for storing rcx
mov rdx,rcx                                               | arg2: "The text for the MessageBox"
lea rcx,qword ptr ds:[140001448]                          | arg1: "Arg1 as register: %s\r\n"
call qword ptr ds:[<sub_140001808>]                       | CRT printf
...
leave                                                     | the short and efficient way
ret                                                       | to get rid of the stack frame


In the example above, the seven arguments require at least 7 QWORDs on the stack, but we used 8:
sub rsp, 8*QWORD      ; 64 bytes reserved somewhere higher up, e.g. on entry of a proc

If you plan to call e.g. CreateWindowEx with 11 arguments, reserve 12 or 16 QWORDs. Remember that the stack must always be aligned to 16 bytes, see above: call SayHi: stack is 000000000012FF00. In the moment when you are calling, the least significant byte of rsp must be 0.

jj2007

Here comes a handy macro for the debugger, to be put just before a call xxx:

FillShadowSpace macro sBytes:=<80h>
lea rdx, [rsp+8]
mov qword ptr [rdx-32], "Ldne" ; endLocal
mov qword ptr [rdx-28], "laco" ; endLocal
mov qword ptr [rdx-24], "_pbr" ; rbp_rbp_
mov qword ptr [rdx-20], "_pbr" ; rbp_rbp_
mov qword ptr [rdx-8], 522D2D3Ch ; <--RetAd
mov qword ptr [rdx-4], "dAte" ; <--RetAd
xor ecx, ecx
@@: mov byte ptr [rdx], "x"
inc rdx
inc ecx
cmp ecx, sBytes
jb @B
ENDM


Before stepping into the subproc, start a dump at address rsp, then scroll some lines to see what happens below rsp.

jj2007

In general, rbp is used as the "frame register", but M$ allows other nonvolatile registers (for exception handling), too:

https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-160
QuoteFrame register

If nonzero, then the function uses a frame pointer (FP), and this field is the number of the nonvolatile register used as the frame pointer, using the same encoding for the operation info field of UNWIND_CODE nodes.

Frame register offset (scaled)

If the frame register field is nonzero, this field is the scaled offset from RSP that is applied to the FP register when it's established. The actual FP register is set to RSP + 16 * this number, allowing offsets from 0 to 240. This offset permits pointing the FP register into the middle of the local stack allocation for dynamic stack frames, allowing better code density through shorter instructions. (That is, more instructions can use the 8-bit signed offset form.)

For my JBasic library I will stick to rbp frames. In 64-bit code, there is one good reason to always use a frame, even if there are no local variables and no arguments: if you want to use one or more nonvolatile registers, how do you save and restore them?
mov global_rsi, rsi  ; 2*7=14 bytes
...
mov rsi, global_rsi

mov [rbp-0x80], rsi  ; 2*4=8 bytes
...
mov rsi, [rbp-0x80]


The enter...leave sequence needed to create a frame is 5 bytes, so even if you save only one nonvolatile register to a local variable, your code is already one byte shorter, compared to a "global" solution.

Plus, any access to arguments is one byte shorter as compared to e.g. mov rax, [rsp+x].

Note that, instead of enter nn, 0, you can use a faster...
  push rbp
  mov rbp, rsp
  sub rsp, nn

... sequence. You will gain about 300ms if you call the proc a hundred Million times. In case you want to do some benchmarks, here is a testbed:
align 16
testNoFrame:
  mov [rsp+8], rcx ; first arg
  mov rax, [rsp+8]
  sub rax, 12345677h
  ret
align 16
testPush:
  push rbp
  mov rbp, rsp
  sub rsp, 80h
  mov [rbp+16], rcx ; first arg
  mov rax, [rbp+16]
  sub rax, 12345677h
  leave
  ret
align 16
testEnter:
  enter 80h, 0
  mov [rbp+16], rcx
  mov rax, [rbp+16] ; first arg
  sub rax, 12345677h
  leave
  ret


All three are used like this: mov rcx, 12345678h, call test***, as shown below:
  loops=100000000
  PrintLine Str$("%i Million loops\n", loops/1000000)
  REPEAT 5
  mov t0, rv(GetTickCount)
  mov loopCt, loops-1
  @@: mov rcx, 12345678h
call testPush
dec loopCt
jns @B
  sub t0, rv(GetTickCount)
  neg t0
  Print Str$("Push:    ticks=%i\n", t0)
  mov t0, rv(GetTickCount)
  mov loopCt, loops-1
  @@: mov rcx, 12345678h
call testEnter
dec loopCt
jns @B
  sub t0, rv(GetTickCount)
  neg t0
  Print Str$("Enter:   ticks=%i\n", t0)
  mov t0, rv(GetTickCount)
  mov loopCt, loops-1
  @@: mov rcx, 12345678h
call testNoFrame
dec loopCt
jns @B
  sub t0, rv(GetTickCount)
  neg t0
  Print Str$("NoFrame: ticks=%i\n\n", t0)
  ENDM
  Inkey "hit any key"
  Exit


I attach an exe; since I am playing with many other things, the source is still too confused to be shown ;-)

jj2007

A hint for the developers of custom prologues (same behaviour for ML64, UAsm and AsmC):

JBasicProlog MACRO procname, flags, argbytes, localbytes, reglist, userparms
  jbProc$ equ <procname>
  % echo ### we are in the jbProc$ prolog ###
  ...

PrintFileSize proc uses rsi rdi rbx arg1
tmp$ CATSTR <*** #arguments used in >, jbProc$, <: >, %jbArgsUsed, < ***>
% echo tmp$


Echos in the output window:
*** #arguments used in DoTheWrite: 0 ***
### we are in the PrintFileSize prolog ###


This caused me some headaches, so I share it with you: above the PrintFileSize proc there was another proc called DoTheWrite. Why do we see the wrong jbProc$ then?

Because the PROLOG kicks in only when there is something to do! Once you insert a nop between PrintFileSize proc and tmp$, everything is as you expect it...