News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Memory benchmark with some new macros.

Started by hutch--, June 30, 2016, 12:24:21 PM

Previous topic - Next topic

hutch--

I have tested out an idea that comes from the days of WIN3.0 when creating a DLL was subject to very limited stack space and one of the techniques was to write the equivalent of a LOCAL variable in the uninitialised data section. With Win64 the virtue of not using locals in the ordinary sense is you don't mess up the stack alignment and then can run the procedures without a stack frame at all. In the source code below you will see the use of the macros,

    LOCAL64 pMem
    LOCAL64 hMem
    LOCAL64 tc

which simply write the data space in the BBS section but can only be accessed locally within a procedure. There are a number of macros in the "macros64.inc" file that handle normal API calls, "fn64" and "rv64" that wrap the invoke macro with protection of RSP in and out.

I have not yet given much thought to what you need to do with structures, in Win32 you simply allocated the space on the stack but with Win64 being problematic with stack alignment I will have to see if there is another way to do it without having to fill out every structure in the .data? section.

This is the test piece source.


; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    OPTION DOTNAME
   
    option casemap:none

    include \masm64\include\win64.inc
    include \masm64\include\temphls.inc

    include \masm64\include\kernel32.inc
    include \masm64\include\user32.inc
    include \masm64\include\msvcrt.inc

    includelib \masm64\lib\user32.lib   
    includelib \masm64\lib\kernel32.lib
    includelib \masm64\lib\msvcrt.lib

    include macros64.inc

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

  ; --------------
  ; gigabyte count
  ; --------------
    gbcnt equ <4>

  .data
    pttl db "Milliseconds duration",0
    ptxt dq pttl
    tmsg db "The following operation benchmarks 4 gig memory copy",0
    pmsg dq tmsg
    titl db "Proceed",0
    capt dq titl

  .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

main proc

    LOCAL64 pMem
    LOCAL64 hMem
    LOCAL64 tc

  ; -----------------------------------------------------------
  ; following macros provide spill space for 4 64 bit registers
  ; plus + 8 bytes on entry and restore the stack on exit.
  ; -----------------------------------------------------------
    mov pMem, rv64(GlobalAlloc,GMEM_FIXED or GMEM_ZEROINIT,1024*1024*1024*gbcnt)
    mov hMem, alloc64(1024*1024*1024*gbcnt)

    fn64 MessageBox,0,pmsg,capt,0                   ; introduction text message

    mov tc, rv64(GetTickCount)

  ; -----------------------------
  ; copy memory from pMem to hMem
  ; -----------------------------
    mov rcx, pMem
    mov rdx, hMem
    mov r8, 1024*1024*1024*gbcnt
    call mcopy64

    sub rv64(GetTickCount), tc                      ; calculate duration
    mov tc, rax

    fn64 MessageBox,0,str64$(tc),ptxt,0            ; display millisecond timing
    free64 pMem                                     ; release memory
    free64 hMem                                     ; release memory
    exit64 0                                        ; exit the process

main endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

mcopy64 proc

    ; rcx = source address
    ; rdx = destination address
    ; r8  = byte count

  ; --------------
  ; save rsi & rdi
  ; --------------
    mov r11, rsi
    mov r10, rdi

    cld
    mov rsi, rcx
    mov rdi, rdx
    mov rcx, r8

    shr rcx, 3
    rep movsq

    mov rcx, r8
    and rcx, 7
    rep movsb

  ; -----------------
  ; restore rsi & rdi
  ; -----------------
    mov rdi, r10
    mov rsi, r11

    retn

mcopy64 endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

comment #

    https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx

    Volatile
    rax rcx rdx r8 r9 r10 r11

    Non Volatile
    r12 r13 r14 r15 rdi rsi rbx rbp rsp

    Volotile
    xmm0 ymmo
    xmm1 ymm1
    xmm2 ymm2
    xmm3 ymm3
    xmm4 ymm4
    xmm5 ymm5

    Nonvolatile (XMM), Volatile (upper half of YMM)
    xmm6-15
    ymm6-15

#

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

  end

hutch--

Something I should have added, on this Haswell box the 4 gig copy runs at about 1.5 gig a second which was comfortably fast enough in Win32 but is clearly off the pace in Win64. If anyone has done a faster memory copy routine that the one above that uses REP MOVSQ I would be interested in seeing it. I have in the past done copy routines that used MOVDQA/U but they were not all that fast.

jj2007

This technique resembles the SetGlobals logic: syntax as in LOCAL, but use a .data? location and reference it using a persistent register, ebx in the case of SetGlobals.

I use it very often in my projects. Pro is that you have short code from [ebx-128] to [ebx+127] and persistent variables. Contra that you need to keep in mind that one register is blocked. For frequently used variables, it really reduces code size, and probably also speed, given that you have about 64 frequently dwords (for example) at one compact memory location which no doubt will always be in cache.

Another contra: variable names are obtained through equates, i.e. myvar equ dword ptr [ebx-40]. Olly doesn't know about that equate, so disassembly is a little bit more difficult.

hutch--

No registers involved. This is the macro.


    LOCAL64 MACRO arg1
      LOCAL var, pvar
      .data?
        var QWORD ?
      .data
        pvar QWORD var
      .code
      arg1 = pvar
    ENDM

Its a re-assignable equate that is renamed to the user's preference. What I have not nutted out yet is how to do structures.

This is the disassembly.

0x00000001`40001000: 4883EC28          sub     rsp,0x28
0x00000001`40001004: B940000000        mov     ecx,0x0000000000000040
0x00000001`40001009: 48BA000000000100  mov     rdx,0x0000000100000000
0x00000001`40001013: FF15EF0F0000      call    qword ptr [kernel32.dll!GlobalAlloc]
0x00000001`40001019: 4883C428          add     rsp,0x28
0x00000001`4000101D: 48890547200000    mov     qword ptr [0x000000014000306B],rax


The pseudo local variable is the QWORD address on the last line that rax is copied into.

jj2007

#4
Quote from: hutch-- on June 30, 2016, 07:33:17 PM
No registers involved.

You are accessing globals by their offset, and that means relatively long encodings. Can't speak for x64 yet, but in x32 it makes a real difference (EDIT: for x64, it's 4 bytes vs 7 bytes):

include \masm32\MasmBasic\MasmBasic.inc
  globalint dd ? ; a real global

  SetGlobals someint, rc:RECT, double srcR8, destR8 ; ebx globals
  Init
  mov someint, eax
  mov rc.right, eax
  fld srcR8
  fstp destR8
  nop
  mov globalint, eax
  nop
EndOfCode


0128107F  ³.  8943 80          mov [ebx-80], eax                 ; short encodings
01281082  ³.  8943 8C          mov [ebx-74], eax
01281085  ³.  DD43 94          fld qword ptr [ebx-6C]
01281088  ³.  DD5B 9C          fstp qword ptr [ebx-64]
0128108B  ³.  90               nop
0128108C  ³.  A3 00902801      mov [globalint], eax              ; long encoding, two bytes more


Speedwise, it's probably identical, but the shorter encodings use less of the instruction cache.

hutch--

I don't lose much sleep over any API code, it is so much slower that the odd extra bytes don't matter. As always the real action is in mnemonic code where the extra registers seem to come in handy and the volume of allocatable memory is a great leap forward on Win32.