Shadow space (sometimes also called
home space) is a badly understood concept in 64-bit programming. To quote Microsoft (https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-160) (my highlighting):
QuoteThe x64 Application Binary Interface (ABI) uses a four-register fast-call calling convention by default. Space is allocated on the call stack as a shadow store for callees to save those registers.
Which means the shadow space is dedicated memory for saving
four registers, precisely:
rcx, rdx, r8 and r9.
The callee (i.e. the called proc) receives arguments 1-4 in registers rcx, rdx, r8 and r9. The callee
can save them in the shadow space, but is
not obliged to do so. In fact, debugging of, for example, CreateWindowEx shows that Windows often does
not move the 4 registers into the shadow space; instead, it uses them directly ("fastcall").
Below is a simple example of a procedure with 7 arguments. It uses JBasic (which is included in the MasmBasic package (http://masm32.com/board/index.php?topic=94.0)), but other packages will do it very much the same way. Source and exe are attached, when debugging type y to stop at the int 3:
include \Masm32\MasmBasic\Res\JBasic.inc ; ## console demo, builds in 32- or 64-bit mode with UAsm, ML, AsmC ##
.code
useClv=0
SayHi proc arg1:SIZE_P, arg2, arg3, arg4, arg5, arg6, arg7 ; SIZE_P is QWORD for 64-bit and DWORD for 32-bit code
Local v1, v2, v3, v4, rc:RECT
if @64 ; don't use in 32-bit code
mov arg1, rcx ; the callee uses one shadow space slot
jinvoke crt_printf, Chr$("Arg1 as register: %s", 13, 10), rcx
else ; 32-bit code looks a bit simpler:
jinvoke crt_printf, Chr$("Arg1 'as is' in 32-bit: %s", 13, 10), arg1
endif
Print Str$("The arguments 1...7: %s %i %i %i %i %i %i", arg1, arg2, arg3, arg4, arg5, arg6, arg7)
mov rc.left, 123
Print Str$(" \nThe locals v1...v4 and rc.left: %i %i %i %i %i", v1, v2, v3, v4, rc.left)
jinvoke MessageBox, 0, arg1, Chr$("Hi folks"), MB_OK or MB_SETFOREGROUND
ret
SayHi endp
Init ; OPT_64 1 ; put 0 for 32 bit, 1 for 64 bit assembly
PrintLine Chr$("This program was assembled with ", @AsmUsed$(1), " in ", jbit$, "-bit format.")
Inkey Chr$("Int 3?", 13, 10, 10)
cmp rax, "y"
jne @F
INT 3
@@:
jinvoke SayHi, Chr$("The text for the MessageBox"), 2, 3, 4, 5, 6, 7
jinvoke MessageBox, 0, Chr$("That was easy, right?"), Chr$("Title"), MB_OK
EndOfCode
Console output:
This program was assembled with ml64 in 64-bit format.
Int 3?
Arg1 as register: The text for the MessageBox
The arguments 1...7: The text for the MessageBox 0 0 0 5 6 7
The locals v1...v4 and rc.left: 0 0 0 0 123
As you can see, arguments 2-4 are zero, i.e. the shadow space was
not set for them. Here is the disassembly:
sub rsp, 8*QWORD ; 64 bytes reserved somewhere higher up, e.g. on entry of a proc
...
; jinvoke MyProc, 1, 2, 3, 4, 5, 6, 7 ; seven arguments
; mov qword ptr ss:[rsp+38],8 | 38h=56 dec: unused here (we have only 7 args)
mov qword ptr ss:[rsp+30],7 | 30h=48 dec: arg7
mov qword ptr ss:[rsp+28],6 | 28h=40 dec: arg6
mov qword ptr ss:[rsp+20],5 | 20h=32 dec: arg5
; NO mov qword ptr ss:[rsp+18],4 | 18h=24 dec: arg4 is shadow space and does
mov r9d,4 | NOT get filled; instead, arg4...arg1 go to
mov r8d,3 | the registers rcx, rdx, r8, r9
mov edx,2 | using the required size, normally DWORD but
lea rcx,qword ptr ds:[140001521] | pointer for "The text for the MessageBox"
call <sub_140001002> | call SayHi: stack is 000000000012FF00
...
add rsp, 8*QWORD ; release the 64 reserved bytes
; SayHi proc arg1:SIZE_P, arg2, arg3, arg4, arg5, arg6, arg7 ; arg1...arg4 are the shadow space
push rbp | create a
mov rbp,rsp | stack frame
sub rsp,B0 | create space for the local variables
mov qword ptr ss:[rbp+10],rcx | use the arg1 shadow space for storing rcx
mov rdx,rcx | arg2: "The text for the MessageBox"
lea rcx,qword ptr ds:[140001448] | arg1: "Arg1 as register: %s\r\n"
call qword ptr ds:[<sub_140001808>] | CRT printf
...
leave | the short and efficient way
ret | to get rid of the stack frame
In the example above, the seven arguments require at least 7 QWORDs on the stack, but we used 8:
sub rsp, 8*QWORD ; 64 bytes reserved somewhere higher up, e.g. on entry of a proc
If you plan to call e.g. CreateWindowEx with 11 arguments, reserve 12 or 16 QWORDs. Remember that the stack must always be aligned to 16 bytes, see above:
call SayHi: stack is 000000000012FF00. In the moment when you are calling, the least significant byte of rsp must be 0.
Here comes a handy macro for the debugger, to be put just before a call xxx:
FillShadowSpace macro sBytes:=<80h>
lea rdx, [rsp+8]
mov qword ptr [rdx-32], "Ldne" ; endLocal
mov qword ptr [rdx-28], "laco" ; endLocal
mov qword ptr [rdx-24], "_pbr" ; rbp_rbp_
mov qword ptr [rdx-20], "_pbr" ; rbp_rbp_
mov qword ptr [rdx-8], 522D2D3Ch ; <--RetAd
mov qword ptr [rdx-4], "dAte" ; <--RetAd
xor ecx, ecx
@@: mov byte ptr [rdx], "x"
inc rdx
inc ecx
cmp ecx, sBytes
jb @B
ENDM
Before stepping into the subproc, start a dump at address rsp, then scroll some lines to see what happens below rsp.
In general, rbp is used as the "frame register", but M$ allows other nonvolatile registers (for exception handling), too:
https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-160
QuoteFrame register
If nonzero, then the function uses a frame pointer (FP), and this field is the number of the nonvolatile register used as the frame pointer, using the same encoding for the operation info field of UNWIND_CODE nodes.
Frame register offset (scaled)
If the frame register field is nonzero, this field is the scaled offset from RSP that is applied to the FP register when it's established. The actual FP register is set to RSP + 16 * this number, allowing offsets from 0 to 240. This offset permits pointing the FP register into the middle of the local stack allocation for dynamic stack frames, allowing better code density through shorter instructions. (That is, more instructions can use the 8-bit signed offset form.)
For my JBasic library I will stick to rbp frames. In 64-bit code, there is one good reason to
always use a frame, even if there are
no local variables and
no arguments: if you want to use one or more nonvolatile registers, how do you save and restore them?
mov global_rsi, rsi ; 2*7=14 bytes
...
mov rsi, global_rsi
mov [rbp-0x80], rsi ; 2*4=8 bytes
...
mov rsi, [rbp-0x80]
The enter...leave sequence needed to create a frame is 5 bytes, so even if you save only one nonvolatile register to a local variable, your code is already one byte shorter, compared to a "global" solution.
Plus, any access to arguments is one byte shorter as compared to e.g. mov rax, [rsp+x].
Note that, instead of
enter nn, 0, you can use a faster...
push rbp
mov rbp, rsp
sub rsp, nn
... sequence. You will gain about 300ms if you call the proc a hundred Million times. In case you want to do some benchmarks, here is a testbed:
align 16
testNoFrame:
mov [rsp+8], rcx ; first arg
mov rax, [rsp+8]
sub rax, 12345677h
ret
align 16
testPush:
push rbp
mov rbp, rsp
sub rsp, 80h
mov [rbp+16], rcx ; first arg
mov rax, [rbp+16]
sub rax, 12345677h
leave
ret
align 16
testEnter:
enter 80h, 0
mov [rbp+16], rcx
mov rax, [rbp+16] ; first arg
sub rax, 12345677h
leave
ret
All three are used like this: mov rcx, 12345678h, call test***, as shown below:
loops=100000000
PrintLine Str$("%i Million loops\n", loops/1000000)
REPEAT 5
mov t0, rv(GetTickCount)
mov loopCt, loops-1
@@: mov rcx, 12345678h
call testPush
dec loopCt
jns @B
sub t0, rv(GetTickCount)
neg t0
Print Str$("Push: ticks=%i\n", t0)
mov t0, rv(GetTickCount)
mov loopCt, loops-1
@@: mov rcx, 12345678h
call testEnter
dec loopCt
jns @B
sub t0, rv(GetTickCount)
neg t0
Print Str$("Enter: ticks=%i\n", t0)
mov t0, rv(GetTickCount)
mov loopCt, loops-1
@@: mov rcx, 12345678h
call testNoFrame
dec loopCt
jns @B
sub t0, rv(GetTickCount)
neg t0
Print Str$("NoFrame: ticks=%i\n\n", t0)
ENDM
Inkey "hit any key"
Exit
I attach an exe; since I am playing with many other things, the source is still too confused to be shown ;-)
A hint for the developers of custom prologues (same behaviour for ML64, UAsm and AsmC):
JBasicProlog MACRO procname, flags, argbytes, localbytes, reglist, userparms
jbProc$ equ <procname>
% echo ### we are in the jbProc$ prolog ###
...
PrintFileSize proc uses rsi rdi rbx arg1
tmp$ CATSTR <*** #arguments used in >, jbProc$, <: >, %jbArgsUsed, < ***>
% echo tmp$
Echos in the output window:
*** #arguments used in DoTheWrite: 0 ***
### we are in the PrintFileSize prolog ###
This caused me some headaches, so I share it with you: above the PrintFileSize proc there was another proc called DoTheWrite. Why do we see the wrong jbProc$ then?
Because the PROLOG kicks in only when there is something to do! Once you insert a nop between PrintFileSize proc and tmp$, everything is as you expect it...