I have tested out an idea that comes from the days of WIN3.0 when creating a DLL was subject to very limited stack space and one of the techniques was to write the equivalent of a LOCAL variable in the uninitialised data section. With Win64 the virtue of not using locals in the ordinary sense is you don't mess up the stack alignment and then can run the procedures without a stack frame at all. In the source code below you will see the use of the macros,
LOCAL64 pMem
LOCAL64 hMem
LOCAL64 tc
which simply write the data space in the BBS section but can only be accessed locally within a procedure. There are a number of macros in the "macros64.inc" file that handle normal API calls, "fn64" and "rv64" that wrap the invoke macro with protection of RSP in and out.
I have not yet given much thought to what you need to do with structures, in Win32 you simply allocated the space on the stack but with Win64 being problematic with stack alignment I will have to see if there is another way to do it without having to fill out every structure in the .data? section.
This is the test piece source.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION DOTNAME
option casemap:none
include \masm64\include\win64.inc
include \masm64\include\temphls.inc
include \masm64\include\kernel32.inc
include \masm64\include\user32.inc
include \masm64\include\msvcrt.inc
includelib \masm64\lib\user32.lib
includelib \masm64\lib\kernel32.lib
includelib \masm64\lib\msvcrt.lib
include macros64.inc
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
; --------------
; gigabyte count
; --------------
gbcnt equ <4>
.data
pttl db "Milliseconds duration",0
ptxt dq pttl
tmsg db "The following operation benchmarks 4 gig memory copy",0
pmsg dq tmsg
titl db "Proceed",0
capt dq titl
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
main proc
LOCAL64 pMem
LOCAL64 hMem
LOCAL64 tc
; -----------------------------------------------------------
; following macros provide spill space for 4 64 bit registers
; plus + 8 bytes on entry and restore the stack on exit.
; -----------------------------------------------------------
mov pMem, rv64(GlobalAlloc,GMEM_FIXED or GMEM_ZEROINIT,1024*1024*1024*gbcnt)
mov hMem, alloc64(1024*1024*1024*gbcnt)
fn64 MessageBox,0,pmsg,capt,0 ; introduction text message
mov tc, rv64(GetTickCount)
; -----------------------------
; copy memory from pMem to hMem
; -----------------------------
mov rcx, pMem
mov rdx, hMem
mov r8, 1024*1024*1024*gbcnt
call mcopy64
sub rv64(GetTickCount), tc ; calculate duration
mov tc, rax
fn64 MessageBox,0,str64$(tc),ptxt,0 ; display millisecond timing
free64 pMem ; release memory
free64 hMem ; release memory
exit64 0 ; exit the process
main endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
mcopy64 proc
; rcx = source address
; rdx = destination address
; r8 = byte count
; --------------
; save rsi & rdi
; --------------
mov r11, rsi
mov r10, rdi
cld
mov rsi, rcx
mov rdi, rdx
mov rcx, r8
shr rcx, 3
rep movsq
mov rcx, r8
and rcx, 7
rep movsb
; -----------------
; restore rsi & rdi
; -----------------
mov rdi, r10
mov rsi, r11
retn
mcopy64 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
comment #
https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx
Volatile
rax rcx rdx r8 r9 r10 r11
Non Volatile
r12 r13 r14 r15 rdi rsi rbx rbp rsp
Volotile
xmm0 ymmo
xmm1 ymm1
xmm2 ymm2
xmm3 ymm3
xmm4 ymm4
xmm5 ymm5
Nonvolatile (XMM), Volatile (upper half of YMM)
xmm6-15
ymm6-15
#
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
Something I should have added, on this Haswell box the 4 gig copy runs at about 1.5 gig a second which was comfortably fast enough in Win32 but is clearly off the pace in Win64. If anyone has done a faster memory copy routine that the one above that uses REP MOVSQ I would be interested in seeing it. I have in the past done copy routines that used MOVDQA/U but they were not all that fast.
This technique resembles the SetGlobals (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1015) logic: syntax as in LOCAL, but use a .data? location and reference it using a persistent register, ebx in the case of SetGlobals.
I use it very often in my projects. Pro is that you have short code from [ebx-128] to [ebx+127] and persistent variables. Contra that you need to keep in mind that one register is blocked. For frequently used variables, it really reduces code size, and probably also speed, given that you have about 64 frequently dwords (for example) at one compact memory location which no doubt will always be in cache.
Another contra: variable names are obtained through equates, i.e. myvar equ dword ptr [ebx-40]. Olly doesn't know about that equate, so disassembly is a little bit more difficult.
No registers involved. This is the macro.
LOCAL64 MACRO arg1
LOCAL var, pvar
.data?
var QWORD ?
.data
pvar QWORD var
.code
arg1 = pvar
ENDM
Its a re-assignable equate that is renamed to the user's preference. What I have not nutted out yet is how to do structures.
This is the disassembly.
0x00000001`40001000: 4883EC28 sub rsp,0x28
0x00000001`40001004: B940000000 mov ecx,0x0000000000000040
0x00000001`40001009: 48BA000000000100 mov rdx,0x0000000100000000
0x00000001`40001013: FF15EF0F0000 call qword ptr [kernel32.dll!GlobalAlloc]
0x00000001`40001019: 4883C428 add rsp,0x28
0x00000001`4000101D: 48890547200000 mov qword ptr [0x000000014000306B],rax
The pseudo local variable is the QWORD address on the last line that rax is copied into.
Quote from: hutch-- on June 30, 2016, 07:33:17 PM
No registers involved.
You are accessing globals by their offset, and that means relatively long encodings. Can't speak for x64 yet, but in x32 it makes a real difference (
EDIT: for x64, it's 4 bytes vs 7 bytes):
include \masm32\MasmBasic\MasmBasic.inc
globalint dd ? ; a real global
SetGlobals someint, rc:RECT, double srcR8, destR8 ; ebx globals
Init
mov someint, eax
mov rc.right, eax
fld srcR8
fstp destR8
nop
mov globalint, eax
nop
EndOfCode
0128107F ³. 8943 80 mov [ebx-80], eax ; short encodings
01281082 ³. 8943 8C mov [ebx-74], eax
01281085 ³. DD43 94 fld qword ptr [ebx-6C]
01281088 ³. DD5B 9C fstp qword ptr [ebx-64]
0128108B ³. 90 nop
0128108C ³. A3 00902801 mov [globalint], eax ; long encoding, two bytes more
Speedwise, it's probably identical, but the shorter encodings use less of the instruction cache.
I don't lose much sleep over any API code, it is so much slower that the odd extra bytes don't matter. As always the real action is in mnemonic code where the extra registers seem to come in handy and the volume of allocatable memory is a great leap forward on Win32.