Memory benchmark with some new macros.

hutch-- · June 30, 2016, 12:24:21 PM

I have tested out an idea that comes from the days of WIN3.0 when creating a DLL was subject to very limited stack space and one of the techniques was to write the equivalent of a LOCAL variable in the uninitialised data section. With Win64 the virtue of not using locals in the ordinary sense is you don't mess up the stack alignment and then can run the procedures without a stack frame at all. In the source code below you will see the use of the macros,

LOCAL64 pMem
LOCAL64 hMem
LOCAL64 tc

which simply write the data space in the BBS section but can only be accessed locally within a procedure. There are a number of macros in the "macros64.inc" file that handle normal API calls, "fn64" and "rv64" that wrap the invoke macro with protection of RSP in and out.

I have not yet given much thought to what you need to do with structures, in Win32 you simply allocated the space on the stack but with Win64 being problematic with stack alignment I will have to see if there is another way to do it without having to fill out every structure in the .data? section.

This is the test piece source.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION DOTNAME

option casemap:none

include \masm64\include\win64.inc
include \masm64\include\temphls.inc

include \masm64\include\kernel32.inc
include \masm64\include\user32.inc
include \masm64\include\msvcrt.inc

includelib \masm64\lib\user32.lib
includelib \masm64\lib\kernel32.lib
includelib \masm64\lib\msvcrt.lib

include macros64.inc

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

; --------------
; gigabyte count
; --------------
gbcnt equ <4>

.data
pttl db "Milliseconds duration",0
ptxt dq pttl
tmsg db "The following operation benchmarks 4 gig memory copy",0
pmsg dq tmsg
titl db "Proceed",0
capt dq titl

.code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

main proc

LOCAL64 pMem
LOCAL64 hMem
LOCAL64 tc

; -----------------------------------------------------------
; following macros provide spill space for 4 64 bit registers
; plus + 8 bytes on entry and restore the stack on exit.
; -----------------------------------------------------------
mov pMem, rv64(GlobalAlloc,GMEM_FIXED or GMEM_ZEROINIT,1024*1024*1024*gbcnt)
mov hMem, alloc64(1024*1024*1024*gbcnt)

fn64 MessageBox,0,pmsg,capt,0 ; introduction text message

mov tc, rv64(GetTickCount)

; -----------------------------
; copy memory from pMem to hMem
; -----------------------------
mov rcx, pMem
mov rdx, hMem
mov r8, 1024*1024*1024*gbcnt
call mcopy64

sub rv64(GetTickCount), tc ; calculate duration
mov tc, rax

fn64 MessageBox,0,str64$(tc),ptxt,0 ; display millisecond timing
free64 pMem ; release memory
free64 hMem ; release memory
exit64 0 ; exit the process

main endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

mcopy64 proc

; rcx = source address
; rdx = destination address
; r8 = byte count

; --------------
; save rsi & rdi
; --------------
mov r11, rsi
mov r10, rdi

cld
mov rsi, rcx
mov rdi, rdx
mov rcx, r8

shr rcx, 3
rep movsq

mov rcx, r8
and rcx, 7
rep movsb

; -----------------
; restore rsi & rdi
; -----------------
mov rdi, r10
mov rsi, r11

retn

mcopy64 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment #

https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx

Volatile
rax rcx rdx r8 r9 r10 r11

Non Volatile
r12 r13 r14 r15 rdi rsi rbx rbp rsp

Volotile
xmm0 ymmo
xmm1 ymm1
xmm2 ymm2
xmm3 ymm3
xmm4 ymm4
xmm5 ymm5

Nonvolatile (XMM), Volatile (upper half of YMM)
xmm6-15
ymm6-15

#

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end

hutch-- · June 30, 2016, 12:36:49 PM

Something I should have added, on this Haswell box the 4 gig copy runs at about 1.5 gig a second which was comfortably fast enough in Win32 but is clearly off the pace in Win64. If anyone has done a faster memory copy routine that the one above that uses REP MOVSQ I would be interested in seeing it. I have in the past done copy routines that used MOVDQA/U but they were not all that fast.

jj2007 · June 30, 2016, 07:22:40 PM

This technique resembles the SetGlobals logic: syntax as in LOCAL, but use a .data? location and reference it using a persistent register, ebx in the case of SetGlobals.

I use it very often in my projects. Pro is that you have short code from [ebx-128] to [ebx+127] and persistent variables. Contra that you need to keep in mind that one register is blocked. For frequently used variables, it really reduces code size, and probably also speed, given that you have about 64 frequently dwords (for example) at one compact memory location which no doubt will always be in cache.

Another contra: variable names are obtained through equates, i.e. myvar equ dword ptr [ebx-40]. Olly doesn't know about that equate, so disassembly is a little bit more difficult.

hutch-- · June 30, 2016, 07:33:17 PM

No registers involved. This is the macro.

LOCAL64 MACRO arg1
LOCAL var, pvar
.data?
var QWORD ?
.data
pvar QWORD var
.code
arg1 = pvar
ENDM

Its a re-assignable equate that is renamed to the user's preference. What I have not nutted out yet is how to do structures.

This is the disassembly.

0x00000001`40001000: 4883EC28 sub rsp,0x28
0x00000001`40001004: B940000000 mov ecx,0x0000000000000040
0x00000001`40001009: 48BA000000000100 mov rdx,0x0000000100000000
0x00000001`40001013: FF15EF0F0000 call qword ptr [kernel32.dll!GlobalAlloc]
0x00000001`40001019: 4883C428 add rsp,0x28
0x00000001`4000101D: 48890547200000 mov qword ptr [0x000000014000306B],rax

The pseudo local variable is the QWORD address on the last line that rax is copied into.

jj2007 · June 30, 2016, 08:26:51 PM

Quote from: hutch-- on June 30, 2016, 07:33:17 PM
No registers involved.

You are accessing globals by their offset, and that means relatively long encodings. Can't speak for x64 yet, but in x32 it makes a real difference (EDIT: for x64, it's 4 bytes vs 7 bytes):

Code Select

include \masm32\MasmBasic\MasmBasic.inc
  globalint dd ?	; a real global

  SetGlobals someint, rc:RECT, double srcR8, destR8	; ebx globals
  Init
  mov someint, eax
  mov rc.right, eax
  fld srcR8
  fstp destR8
  nop
  mov globalint, eax
  nop
EndOfCode

Code Select

0128107F  ³.  8943 80          mov [ebx-80], eax                 ; short encodings
01281082  ³.  8943 8C          mov [ebx-74], eax
01281085  ³.  DD43 94          fld qword ptr [ebx-6C]
01281088  ³.  DD5B 9C          fstp qword ptr [ebx-64]
0128108B  ³.  90               nop
0128108C  ³.  A3 00902801      mov [globalint], eax              ; long encoding, two bytes more

Speedwise, it's probably identical, but the shorter encodings use less of the instruction cache.

hutch-- · June 30, 2016, 10:29:35 PM

I don't lose much sleep over any API code, it is so much slower that the odd extra bytes don't matter. As always the real action is in mnemonic code where the extra registers seem to come in handy and the volume of allocatable memory is a great leap forward on Win32.

The MASM Forum

News:

Memory benchmark with some new macros.

hutch--

hutch--

jj2007

hutch--

jj2007

hutch--