News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

local vs. .data?

Started by Elegant, March 21, 2015, 04:13:55 PM

Previous topic - Next topic

Elegant

Hi guys, new here and just starting out with some assembly. I'm currently trying to port a small x64 project to x86. Naturally, I have the extra registers to contend with. I've been creating creating some .data? variables to counteract this using a push/pop type method for XMMs. This solution was working great when I was handle the occasional extra register or two but now I've gotten to a point where I need to account for xmm8-14. I thought I'd give local a go but I never seem to get the correct result but no matter how I set it up I never get the correct result. Should I even be using local?

I've also been wondering that maybe I should consider straight c-code instead of assembly when having to deal with this. Below is an example where the push/pop idea works but when you have to deal with more registers it becomes convoluted very quickly (and doesn't seem to work) and I begin to wonder about performance (if it did ever work).


.data?

align 16

xmm8_var DB 16 DUP (?)
xmm9_var DB 16 DUP (?)



checkOscillation5_SSE2_ASM proc p2p:dword,p1p:dword,s1p:dword,n1p:dword,n2p:dword,dstp:dword,p2_pitch:dword,p1_pitch:dword,s1_pitch:dword,n1_pitch:dword,n2_pitch:dword,dst_pitch:dword,width_:dword,height:dword,thresh:dword

public checkOscillation5_SSE2_ASM

;local xmm8_var[16]:xmmword,xmm9_var[16]:xmmword

mov eax,p2p
mov ebx,p1p
mov edx,s1p
mov edi,n1p
mov esi,n2p
pxor xmm6,xmm6
dec thresh
movd xmm7,thresh
punpcklbw xmm7,xmm7
punpcklwd xmm7,xmm7
punpckldq xmm7,xmm7
punpcklqdq xmm7,xmm7
movdqa xmmword ptr xmm8_var,xmm6 ; push the xmm6 register
movdqa xmmword ptr xmm9_var,xmm7 ; push the xmm7 register
pcmpeqb xmm7, xmm7
psrlw xmm7,15
movdqa xmm6,xmm7
psllw xmm6,8
por xmm7,xmm6
movdqa xmm7,xmmword ptr xmm9_var ; pop the xmm7 register
movdqa xmm6,xmmword ptr xmm8_var ; pop the xmm6 register
...
pminub xmm0,xmmword ptr xmm8_var
pmaxub xmm1,xmmword ptr xmm8_var
...
checkOscillation5_SSE2_ASM endp

sinsi

I don't think that xmm8-xmm15 are available in 32-bit mode.

Elegant

They aren't. That's the point of using my own xmm8_var, it's a variable that is compensating for the lack of xmm8 by pushing to the stack. I'm hoping I can achieve similar results using local instead so that I don't have to continuously push/pop with many variables.

jj2007

A better idea might be to use a frameless proc, and let the fake "lobals" point to a properly aligned .data? area:

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
someproc
LOCAL xm1, xm2
mov ebp, someglobaloffset
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

MichaelW

This code compares the cycle counts for
   MOVDQA xmm, xmm
and
   MOVDQA xmm, m128
I could not think of any good way to align LOCALs so I just did it by trial and error.

;==============================================================================
include \masm32\include\masm32rt.inc
.686
.xmm
include alignment.asm
include counter.asm
;==============================================================================
LOOP_COUNT equ 1000000
PP equ REALTIME_PRIORITY_CLASS
TP equ THREAD_PRIORITY_TIME_CRITICAL
;==============================================================================
.data
.code
;==============================================================================
testproc proc

    LOCAL pad[1]:byte
    LOCAL xmm_var:xmmword
   
    lea   ebx, xmm_var
    printf("%d\n",alignment(ebx))
   
    REPEAT 3
        counter_begin LOOP_COUNT, PP, TP
            REPEAT 8
                movdqa  xmm_var, xmm0
                movdqa  xmm0, xmm_var           
            ENDM   
        counter_end
        printf("%d cycles\n", eax)
        counter_begin LOOP_COUNT, PP, TP
            REPEAT 8
                movdqa  xmm1, xmm0
                movdqa  xmm0, xmm1           
            ENDM
        counter_end
        printf("%d cycles\n", eax)
    ENDM
    ret
testproc endp
;==============================================================================
start:
;==============================================================================
    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1
    invoke Sleep, 5000
    call testproc
    printf("\n")
    inkey
    exit
;==============================================================================
end start


Alignment.asm:

;---------------------------------------------------
; This macro returns the maximum alignment of _ptr.
;---------------------------------------------------
alignment MACRO _ptr
    push ecx
    xor eax, eax
    mov ecx, _ptr
    bsf ecx, ecx
    jz @F
    mov eax, 1
    shl eax, cl
  @@:
    pop ecx
    EXITM <eax>
ENDM


Counter.asm (newer than the one in the MASM32 macros directory):

  ;----------------------------------------------------------------------
  ; These two macros perform the grunt work involved in measuring the
  ; processor clock cycle count for a block of code. These macros must
  ; be used in pairs, and the block of code must be placed in between
  ; the counter_begin and counter_end macro calls. The counter_end macro
  ; returns the clock cycle count for a single pass through the block of
  ; code, corrected for the test loop overhead, in EAX.
  ;
  ; These macros require a .586 or higher processor directive.
  ;
  ; The essential differences between these macros and my previous macros
  ; are that these save and restore the original priorities, and provide
  ; a way to control the thread priority. Control of the thread priority
  ; allows timing code at the highest possible priority by combining
  ; REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL.
  ;
  ; Note that running at the higher priority settings on a single core
  ; processor involves some risk, as it will cause your process to
  ; preempt most or *all* other processes, including critical Windows
  ; processes. Using HIGH_PRIORITY_CLASS in combination with
  ; THREAD_PRIORITY_NORMAL should generally be safe.
  ;
  ; Note that CPUID will change the value of EBX, and that I did not
  ; correct this problem because I could see no way to do so without
  ; adding an additional instruction to the timed code (that is, in
  ; addition to the XOR EAX, EAX).
  ;----------------------------------------------------------------------

    counter_begin MACRO loopcount:REQ, process_priority:REQ, thread_priority
        LOCAL label

        IFNDEF __counter__qword__count__
          .data
          ALIGN 8             ;; Optimal alignment for QWORD
            __counter__qword__count__  dq 0
            __counter__loop__count__   dd 0
            __counter__loop__counter__ dd 0
            __process_priority_class__ dd 0
            __thread_priority__        dd 0
            __current_process__        dd 0
            __current_thread__         dd 0
          .code
        ENDIF

        mov __counter__loop__count__, loopcount
        invoke GetCurrentProcess
        mov __current_process__, eax
        invoke GetPriorityClass, __current_process__
        mov __process_priority_class__, eax
        invoke SetPriorityClass, __current_process__, process_priority
        IFNB <thread_priority>
            invoke GetCurrentThread
            mov __current_thread__, eax
            invoke GetThreadPriority, __current_thread__
            mov __thread_priority__, eax
            invoke SetThreadPriority, __current_thread__, thread_priority
        ENDIF
        xor eax, eax          ;; Use same CPUID input value for each call
        cpuid                 ;; Flush pipe & wait for pending ops to finish
        rdtsc                 ;; Read Time Stamp Counter

        push edx              ;; Preserve high-order 32 bits of start count
        push eax              ;; Preserve low-order 32 bits of start count
        mov   __counter__loop__counter__, loopcount
        xor eax, eax
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      @@:                     ;; Start an empty reference loop
        sub __counter__loop__counter__, 1
        jnz @B

        xor eax, eax
        cpuid                 ;; Make sure loop instructions finish
        rdtsc                 ;; Read end count
        pop ecx               ;; Recover low-order 32 bits of start count
        sub eax, ecx          ;; Low-order 32 bits of overhead count in EAX
        pop ecx               ;; Recover high-order 32 bits of start count
        sbb edx, ecx          ;; High-order 32 bits of overhead count in EDX
        push edx              ;; Preserve high-order 32 bits of overhead count
        push eax              ;; Preserve low-order 32 bits of overhead count

        xor eax, eax
        cpuid
        rdtsc
        push edx              ;; Preserve high-order 32 bits of start count
        push eax              ;; Preserve low-order 32 bits of start count
        mov   __counter__loop__counter__, loopcount
        xor eax, eax
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      label:                  ;; Start test loop
        __counter__loop__label__ equ <label>
    ENDM

    counter_end MACRO
        sub __counter__loop__counter__, 1
        jnz  __counter__loop__label__

        xor eax, eax
        cpuid                 ;; Make sure loop instructions finish
        rdtsc                 ;; Read end count
        pop ecx               ;; Recover low-order 32 bits of start count
        sub eax, ecx          ;; Low-order 32 bits of test count in EAX
        pop ecx               ;; Recover high-order 32 bits of start count
        sbb edx, ecx          ;; High-order 32 bits of test count in EDX
        pop ecx               ;; Recover low-order 32 bits of overhead count
        sub eax, ecx          ;; Low-order 32 bits of adjusted count in EAX
        pop ecx               ;; Recover high-order 32 bits of overhead count
        sbb edx, ecx          ;; High-order 32 bits of adjusted count in EDX

        mov DWORD PTR __counter__qword__count__, eax
        mov DWORD PTR __counter__qword__count__ + 4, edx

        invoke SetPriorityClass,__current_process__,__process_priority_class__
        IFNB <thread_priority>
            invoke SetThreadPriority, __current_thread__, __thread_priority__
        ENDIF

        finit
        fild __counter__qword__count__
        fild __counter__loop__count__
        fdiv
        fistp __counter__qword__count__

        mov eax, DWORD PTR __counter__qword__count__
    ENDM


Typical result after assembling with JWasm v2.12pre, Nov 27 2013 and linking as a console app with the MS linker distributed with MASM32, and running under Windows 7-64 on a Core i3:

16
42 cycles
2 cycles
42 cycles
2 cycles
42 cycles
2 cycles


The 2 cycles is somewhat surprising, and may possibly indicate that processors are now smart enough to recognize monkey motion and skip over it, so they look good in benchmarks.
Well Microsoft, here's another nice mess you've gotten us into.

hutch--

> I could not think of any good way to align LOCALs so I just did it by trial and error.

I confess I don't see the problem here, just allocate a bigger buffer and align the start OFFSET in the buffer.

LOCAL buffer[128]:BYTE
LOCAL pbuf :DWORD

lea eax, buffer
add eax, 16 - 1
and eax, -16
mov pbuf, eax


nidud

#6
deleted

MichaelW

I was assuming that the goal was individually named variables accessed by name.
Well Microsoft, here's another nice mess you've gotten us into.

dedndave

maybe you can write your own prologue/epilogue, such that EBP is 16-aligned
that way, labels can be created to 16-aligned stack frame offsets

on the other hand, it would be simpler to just create the registers in .DATA?   :P

dedndave

#9
i often write my own block like this...
        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

MyFunc  PROC  dwArg1:DWORD,dwArg2:DWORD

;--------------------------------

_dwArg2     TEXTEQU <DWORD PTR [EBP+24]>
_dwArg1     TEXTEQU <DWORD PTR [EBP+20]>
;                              [EBP+16}      ;RETurn address
;                              [EBP+12]      ;saved EBX contents
;                              [EBP+8]       ;saved ESI contents
;                              [EBP+4]       ;saved EDI contents
;                              [EBP]         ;saved EBP contents
_dwLocal1   TEXTEQU <DWORD PTR [EBP-4]>
_dwLocal2   TEXTEQU <DWORD PTR [EBP-8]>

;--------------------------------

    push    ebx
    push    esi
    push    edi
    push    ebp
    mov     ebp,esp
    push    eax                              ;[EBP-4] = _dwLocal1
    push    eax                              ;[EBP-8] = _dwLocal2

;body of procedure here

    mov     esp,ebp                          ;these 2 instructions are typically implemented with LEAVE
    pop     ebp                              ;expanded in this example for clarity
    pop     edi
    pop     esi
    pop     ebx
    ret     8

MyFunc  ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef


if we modify that to 16-align EBP...
        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

MyFunc  PROC  dwArg1:DWORD,dwArg2:DWORD

;--------------------------------

_dwArg2     TEXTEQU <DWORD PTR [EBP+8]>
_dwArg1     TEXTEQU <DWORD PTR [EBP+4]>
_dwRestore  TEXTEQU <DWORD PTR [EBP]>
_dwLocal1   TEXTEQU <DWORD PTR [EBP-4]>
_dwLocal2   TEXTEQU <DWORD PTR [EBP-8]>
_owLocal3   TEXTEQU <OWORD PTR [EBP-32]>

;--------------------------------

    push    ebx
    push    esi
    push    edi
    push    ebp
    mov     edx,esp                          ;keep a pointer to the saved EBP contents in EDX
    lea     ebp,[esp-12]                     ;make space for 3 dword values above EBP
    and     ebp,-16                          ;EBP is now 16-aligned
    mov     esp,ebp
    mov     eax,[edx+24]                     ;get Arg2 from stack
    mov     _dwArg2,eax                      ;save it at an EBP addressable address
    mov     eax,[edx+16]                     ;get Arg1 from stack
    mov     _dwArg1,eax                      ;save it at an EBP addressable address
    mov     _dwRestore,edx                   ;save the original stack frame base
    push    eax                              ;[EBP-4] = _dwLocal1
    push    eax                              ;[EBP-8] = _dwLocal2
    sub     esp,8+16                         ;8 for alignment, 16 for OWORD local

;body of procedure here

    mov     esp,_dwRestore                   ;restore ESP to point to saved EBP contents
    pop     ebp
    pop     edi
    pop     esi
    pop     ebx
    ret     8

MyFunc  ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef


that's a bit wordy, but should work
if you put a little thought into it, you might be able to simplify the code

EDIT: fixed a couple things, OWORD PTR and _dwRestore usage

dedndave

#10
here's a simpler one....

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

MyFunc  PROC  dwArg1:DWORD,dwArg2:DWORD

;--------------------------------

_dwRestore  TEXTEQU <DWORD PTR [EBP-4]>
_dwArg2     TEXTEQU <DWORD PTR [EBP-8]>
_dwArg1     TEXTEQU <DWORD PTR [EBP-12]>
_dwLocal1   TEXTEQU <DWORD PTR [EBP-16]>
_dwLocal2   TEXTEQU <DWORD PTR [EBP-20]>
_owLocal3   TEXTEQU <OWORD PTR [EBP-48]>

;--------------------------------

    push    ebx
    push    esi
    push    edi
    push    ebp
    mov     edx,esp                          ;keep a pointer to the saved EBP contents in EDX
    mov     ebp,esp
    and     ebp,-16                          ;EBP is now 16-aligned
    mov     esp,ebp
    push    edx                              ;[EBP-4]  = _dwRestore
    push dword ptr [edx+24]                  ;[EBP-8]  = _dwArg2
    push dword ptr [edx+20]                  ;[EBP-12] = _dwArg1
    push    eax                              ;[EBP-16] = _dwLocal1
    push    eax                              ;[EBP-20] = _dwLocal2
    sub     esp,12+16                        ;12 for alignment, 16 for _owLocal3

;body of procedure here

    mov     esp,_dwRestore                   ;restore ESP to point to saved EBP contents
    pop     ebp
    pop     edi
    pop     esi
    pop     ebx
    ret     8

MyFunc  ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

satpro

@Elegant...

Not bad, eh?  The website already paid for itself.  :biggrin:

dedndave

    mov     ebp,esp
    and     ebp,-16                          ;EBP is now 16-aligned
    mov     esp,ebp


i guess you could replace those 3 lines with these 2
    and     esp,-16                          ;ESP is now 16-aligned
    mov     ebp,esp

jj2007

Quote from: dedndave on March 21, 2015, 11:17:38 PM
on the other hand, it would be simpler to just create the registers in .DATA?   :P

I rewrote SetGlobals for use with xmm regs. First variable is now always align 16:

include \masm32\MasmBasic\MasmBasic.inc      ; download
  SetGlobals OWORD xx0, xx1, xx2, xx3, xx4, xx5, xx6, xx7
  Init
  mov eax, 11111111
  movd xmm0, eax
  movaps xx0, xmm0

  add eax, 11111111
  movd xmm1, eax
  movaps xx1, xmm1
  Exit
end start