Hi guys, new here and just starting out with some assembly. I'm currently trying to port a small x64 project to x86. Naturally, I have the extra registers to contend with. I've been creating creating some .data? variables to counteract this using a push/pop type method for XMMs. This solution was working great when I was handle the occasional extra register or two but now I've gotten to a point where I need to account for xmm8-14. I thought I'd give local a go but I never seem to get the correct result but no matter how I set it up I never get the correct result. Should I even be using local?
I've also been wondering that maybe I should consider straight c-code instead of assembly when having to deal with this. Below is an example where the push/pop idea works but when you have to deal with more registers it becomes convoluted very quickly (and doesn't seem to work) and I begin to wonder about performance (if it did ever work).
.data?
align 16
xmm8_var DB 16 DUP (?)
xmm9_var DB 16 DUP (?)
checkOscillation5_SSE2_ASM proc p2p:dword,p1p:dword,s1p:dword,n1p:dword,n2p:dword,dstp:dword,p2_pitch:dword,p1_pitch:dword,s1_pitch:dword,n1_pitch:dword,n2_pitch:dword,dst_pitch:dword,width_:dword,height:dword,thresh:dword
public checkOscillation5_SSE2_ASM
;local xmm8_var[16]:xmmword,xmm9_var[16]:xmmword
mov eax,p2p
mov ebx,p1p
mov edx,s1p
mov edi,n1p
mov esi,n2p
pxor xmm6,xmm6
dec thresh
movd xmm7,thresh
punpcklbw xmm7,xmm7
punpcklwd xmm7,xmm7
punpckldq xmm7,xmm7
punpcklqdq xmm7,xmm7
movdqa xmmword ptr xmm8_var,xmm6 ; push the xmm6 register
movdqa xmmword ptr xmm9_var,xmm7 ; push the xmm7 register
pcmpeqb xmm7, xmm7
psrlw xmm7,15
movdqa xmm6,xmm7
psllw xmm6,8
por xmm7,xmm6
movdqa xmm7,xmmword ptr xmm9_var ; pop the xmm7 register
movdqa xmm6,xmmword ptr xmm8_var ; pop the xmm6 register
...
pminub xmm0,xmmword ptr xmm8_var
pmaxub xmm1,xmmword ptr xmm8_var
...
checkOscillation5_SSE2_ASM endp
I don't think that xmm8-xmm15 are available in 32-bit mode.
They aren't. That's the point of using my own xmm8_var, it's a variable that is compensating for the lack of xmm8 by pushing to the stack. I'm hoping I can achieve similar results using local instead so that I don't have to continuously push/pop with many variables.
A better idea might be to use a frameless proc, and let the fake "lobals" point to a properly aligned .data? area:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
someproc
LOCAL xm1, xm2
mov ebp, someglobaloffset
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
This code compares the cycle counts for
MOVDQA xmm, xmm
and
MOVDQA xmm, m128
I could not think of any good way to align LOCALs so I just did it by trial and error.
;==============================================================================
include \masm32\include\masm32rt.inc
.686
.xmm
include alignment.asm
include counter.asm
;==============================================================================
LOOP_COUNT equ 1000000
PP equ REALTIME_PRIORITY_CLASS
TP equ THREAD_PRIORITY_TIME_CRITICAL
;==============================================================================
.data
.code
;==============================================================================
testproc proc
LOCAL pad[1]:byte
LOCAL xmm_var:xmmword
lea ebx, xmm_var
printf("%d\n",alignment(ebx))
REPEAT 3
counter_begin LOOP_COUNT, PP, TP
REPEAT 8
movdqa xmm_var, xmm0
movdqa xmm0, xmm_var
ENDM
counter_end
printf("%d cycles\n", eax)
counter_begin LOOP_COUNT, PP, TP
REPEAT 8
movdqa xmm1, xmm0
movdqa xmm0, xmm1
ENDM
counter_end
printf("%d cycles\n", eax)
ENDM
ret
testproc endp
;==============================================================================
start:
;==============================================================================
invoke GetCurrentProcess
invoke SetProcessAffinityMask, eax, 1
invoke Sleep, 5000
call testproc
printf("\n")
inkey
exit
;==============================================================================
end start
Alignment.asm:
;---------------------------------------------------
; This macro returns the maximum alignment of _ptr.
;---------------------------------------------------
alignment MACRO _ptr
push ecx
xor eax, eax
mov ecx, _ptr
bsf ecx, ecx
jz @F
mov eax, 1
shl eax, cl
@@:
pop ecx
EXITM <eax>
ENDM
Counter.asm (newer than the one in the MASM32 macros directory):
;----------------------------------------------------------------------
; These two macros perform the grunt work involved in measuring the
; processor clock cycle count for a block of code. These macros must
; be used in pairs, and the block of code must be placed in between
; the counter_begin and counter_end macro calls. The counter_end macro
; returns the clock cycle count for a single pass through the block of
; code, corrected for the test loop overhead, in EAX.
;
; These macros require a .586 or higher processor directive.
;
; The essential differences between these macros and my previous macros
; are that these save and restore the original priorities, and provide
; a way to control the thread priority. Control of the thread priority
; allows timing code at the highest possible priority by combining
; REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL.
;
; Note that running at the higher priority settings on a single core
; processor involves some risk, as it will cause your process to
; preempt most or *all* other processes, including critical Windows
; processes. Using HIGH_PRIORITY_CLASS in combination with
; THREAD_PRIORITY_NORMAL should generally be safe.
;
; Note that CPUID will change the value of EBX, and that I did not
; correct this problem because I could see no way to do so without
; adding an additional instruction to the timed code (that is, in
; addition to the XOR EAX, EAX).
;----------------------------------------------------------------------
counter_begin MACRO loopcount:REQ, process_priority:REQ, thread_priority
LOCAL label
IFNDEF __counter__qword__count__
.data
ALIGN 8 ;; Optimal alignment for QWORD
__counter__qword__count__ dq 0
__counter__loop__count__ dd 0
__counter__loop__counter__ dd 0
__process_priority_class__ dd 0
__thread_priority__ dd 0
__current_process__ dd 0
__current_thread__ dd 0
.code
ENDIF
mov __counter__loop__count__, loopcount
invoke GetCurrentProcess
mov __current_process__, eax
invoke GetPriorityClass, __current_process__
mov __process_priority_class__, eax
invoke SetPriorityClass, __current_process__, process_priority
IFNB <thread_priority>
invoke GetCurrentThread
mov __current_thread__, eax
invoke GetThreadPriority, __current_thread__
mov __thread_priority__, eax
invoke SetThreadPriority, __current_thread__, thread_priority
ENDIF
xor eax, eax ;; Use same CPUID input value for each call
cpuid ;; Flush pipe & wait for pending ops to finish
rdtsc ;; Read Time Stamp Counter
push edx ;; Preserve high-order 32 bits of start count
push eax ;; Preserve low-order 32 bits of start count
mov __counter__loop__counter__, loopcount
xor eax, eax
cpuid ;; Make sure loop setup instructions finish
ALIGN 16 ;; Optimal loop alignment for P6
@@: ;; Start an empty reference loop
sub __counter__loop__counter__, 1
jnz @B
xor eax, eax
cpuid ;; Make sure loop instructions finish
rdtsc ;; Read end count
pop ecx ;; Recover low-order 32 bits of start count
sub eax, ecx ;; Low-order 32 bits of overhead count in EAX
pop ecx ;; Recover high-order 32 bits of start count
sbb edx, ecx ;; High-order 32 bits of overhead count in EDX
push edx ;; Preserve high-order 32 bits of overhead count
push eax ;; Preserve low-order 32 bits of overhead count
xor eax, eax
cpuid
rdtsc
push edx ;; Preserve high-order 32 bits of start count
push eax ;; Preserve low-order 32 bits of start count
mov __counter__loop__counter__, loopcount
xor eax, eax
cpuid ;; Make sure loop setup instructions finish
ALIGN 16 ;; Optimal loop alignment for P6
label: ;; Start test loop
__counter__loop__label__ equ <label>
ENDM
counter_end MACRO
sub __counter__loop__counter__, 1
jnz __counter__loop__label__
xor eax, eax
cpuid ;; Make sure loop instructions finish
rdtsc ;; Read end count
pop ecx ;; Recover low-order 32 bits of start count
sub eax, ecx ;; Low-order 32 bits of test count in EAX
pop ecx ;; Recover high-order 32 bits of start count
sbb edx, ecx ;; High-order 32 bits of test count in EDX
pop ecx ;; Recover low-order 32 bits of overhead count
sub eax, ecx ;; Low-order 32 bits of adjusted count in EAX
pop ecx ;; Recover high-order 32 bits of overhead count
sbb edx, ecx ;; High-order 32 bits of adjusted count in EDX
mov DWORD PTR __counter__qword__count__, eax
mov DWORD PTR __counter__qword__count__ + 4, edx
invoke SetPriorityClass,__current_process__,__process_priority_class__
IFNB <thread_priority>
invoke SetThreadPriority, __current_thread__, __thread_priority__
ENDIF
finit
fild __counter__qword__count__
fild __counter__loop__count__
fdiv
fistp __counter__qword__count__
mov eax, DWORD PTR __counter__qword__count__
ENDM
Typical result after assembling with JWasm v2.12pre, Nov 27 2013 and linking as a console app with the MS linker distributed with MASM32, and running under Windows 7-64 on a Core i3:
16
42 cycles
2 cycles
42 cycles
2 cycles
42 cycles
2 cycles
The 2 cycles is somewhat surprising, and may possibly indicate that processors are now smart enough to recognize monkey motion and skip over it, so they look good in benchmarks.
> I could not think of any good way to align LOCALs so I just did it by trial and error.
I confess I don't see the problem here, just allocate a bigger buffer and align the start OFFSET in the buffer.
LOCAL buffer[128]:BYTE
LOCAL pbuf :DWORD
lea eax, buffer
add eax, 16 - 1
and eax, -16
mov pbuf, eax
deleted
I was assuming that the goal was individually named variables accessed by name.
maybe you can write your own prologue/epilogue, such that EBP is 16-aligned
that way, labels can be created to 16-aligned stack frame offsets
on the other hand, it would be simpler to just create the registers in .DATA? :P
i often write my own block like this...
OPTION PROLOGUE:None
OPTION EPILOGUE:None
MyFunc PROC dwArg1:DWORD,dwArg2:DWORD
;--------------------------------
_dwArg2 TEXTEQU <DWORD PTR [EBP+24]>
_dwArg1 TEXTEQU <DWORD PTR [EBP+20]>
; [EBP+16} ;RETurn address
; [EBP+12] ;saved EBX contents
; [EBP+8] ;saved ESI contents
; [EBP+4] ;saved EDI contents
; [EBP] ;saved EBP contents
_dwLocal1 TEXTEQU <DWORD PTR [EBP-4]>
_dwLocal2 TEXTEQU <DWORD PTR [EBP-8]>
;--------------------------------
push ebx
push esi
push edi
push ebp
mov ebp,esp
push eax ;[EBP-4] = _dwLocal1
push eax ;[EBP-8] = _dwLocal2
;body of procedure here
mov esp,ebp ;these 2 instructions are typically implemented with LEAVE
pop ebp ;expanded in this example for clarity
pop edi
pop esi
pop ebx
ret 8
MyFunc ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
if we modify that to 16-align EBP...
OPTION PROLOGUE:None
OPTION EPILOGUE:None
MyFunc PROC dwArg1:DWORD,dwArg2:DWORD
;--------------------------------
_dwArg2 TEXTEQU <DWORD PTR [EBP+8]>
_dwArg1 TEXTEQU <DWORD PTR [EBP+4]>
_dwRestore TEXTEQU <DWORD PTR [EBP]>
_dwLocal1 TEXTEQU <DWORD PTR [EBP-4]>
_dwLocal2 TEXTEQU <DWORD PTR [EBP-8]>
_owLocal3 TEXTEQU <OWORD PTR [EBP-32]>
;--------------------------------
push ebx
push esi
push edi
push ebp
mov edx,esp ;keep a pointer to the saved EBP contents in EDX
lea ebp,[esp-12] ;make space for 3 dword values above EBP
and ebp,-16 ;EBP is now 16-aligned
mov esp,ebp
mov eax,[edx+24] ;get Arg2 from stack
mov _dwArg2,eax ;save it at an EBP addressable address
mov eax,[edx+16] ;get Arg1 from stack
mov _dwArg1,eax ;save it at an EBP addressable address
mov _dwRestore,edx ;save the original stack frame base
push eax ;[EBP-4] = _dwLocal1
push eax ;[EBP-8] = _dwLocal2
sub esp,8+16 ;8 for alignment, 16 for OWORD local
;body of procedure here
mov esp,_dwRestore ;restore ESP to point to saved EBP contents
pop ebp
pop edi
pop esi
pop ebx
ret 8
MyFunc ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
that's a bit wordy, but should work
if you put a little thought into it, you might be able to simplify the code
EDIT: fixed a couple things, OWORD PTR and _dwRestore usage
here's a simpler one....
OPTION PROLOGUE:None
OPTION EPILOGUE:None
MyFunc PROC dwArg1:DWORD,dwArg2:DWORD
;--------------------------------
_dwRestore TEXTEQU <DWORD PTR [EBP-4]>
_dwArg2 TEXTEQU <DWORD PTR [EBP-8]>
_dwArg1 TEXTEQU <DWORD PTR [EBP-12]>
_dwLocal1 TEXTEQU <DWORD PTR [EBP-16]>
_dwLocal2 TEXTEQU <DWORD PTR [EBP-20]>
_owLocal3 TEXTEQU <OWORD PTR [EBP-48]>
;--------------------------------
push ebx
push esi
push edi
push ebp
mov edx,esp ;keep a pointer to the saved EBP contents in EDX
mov ebp,esp
and ebp,-16 ;EBP is now 16-aligned
mov esp,ebp
push edx ;[EBP-4] = _dwRestore
push dword ptr [edx+24] ;[EBP-8] = _dwArg2
push dword ptr [edx+20] ;[EBP-12] = _dwArg1
push eax ;[EBP-16] = _dwLocal1
push eax ;[EBP-20] = _dwLocal2
sub esp,12+16 ;12 for alignment, 16 for _owLocal3
;body of procedure here
mov esp,_dwRestore ;restore ESP to point to saved EBP contents
pop ebp
pop edi
pop esi
pop ebx
ret 8
MyFunc ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
@Elegant...
Not bad, eh? The website already paid for itself. :biggrin:
mov ebp,esp
and ebp,-16 ;EBP is now 16-aligned
mov esp,ebp
i guess you could replace those 3 lines with these 2
and esp,-16 ;ESP is now 16-aligned
mov ebp,esp
Quote from: dedndave on March 21, 2015, 11:17:38 PM
on the other hand, it would be simpler to just create the registers in .DATA? :P
I rewrote SetGlobals for use with xmm regs. First variable is now always align 16:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
SetGlobals OWORD xx0, xx1, xx2, xx3, xx4, xx5, xx6, xx7
Init
mov eax, 11111111
movd xmm0, eax
movaps xx0, xmm0
add eax, 11111111
movd xmm1, eax
movaps xx1, xmm1
Exit
end start