What is the best way for a thread to update a global 64-bit variable?
I am playing around with multithreaded file searching, adding totals as I go.
file: mov eax,wfd.nFileSizeLow
mov edx,wfd.nFileSizeHigh
lock inc files ;OK because files is a dword
add totalbytes,eax ;lock here guarantees atomic
adc totalbytes+4,edx ;but what to do here?
Maybe it doesn't matter? Maybe I'm overthinking it.
The mmx/sse instructions all seem to use a register as the destination
There is a function InterlockedExchangeAdd64 which seems to do it but it's only Vista or better (but we all know MSDN lies about minimum supported OS).
you can use a byte (dword, whatever) as a semaphore
(you can also create a semaphore with API)
i sometimes use a byte, then use XCHG on the semaphore
mov al,1 ;use bit 0 to indicate the semaphore is "being queried"
xchg al,bSemaphore ;get the semaphore byte into AL
;test conditions
;if the previous value is 1, it indicates the semaphore is being queried elsewhere - no action required
;otherwise....
xchg al,bSemaphore ;put the semaphore flags back
if you get a value that isn't 1, you "own" the semaphore, and you can update the 64-bit value
if the semaphore is busy, use Sleep,0 or Sleep,1 to create a wait loop
that allows the other thread to finish it's work and release the semaphore
that's just an example
the point is, XCHG AL,mem has an implied bus lock
API method...
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686946%28v=vs.85%29.aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/ms686946%28v=vs.85%29.aspx)
Here's Intel's spinlock
Spin_Lock:
CMP lockvar, 0 ;Check if lock is free
JE Get_Lock
PAUSE ;Short delay
JMP Spin_Lock
Get_Lock:
MOV EAX, 1
XCHG EAX, lockvar ;Try to get lock
CMP EAX, 0 ;Test if successful
JNE Spin_Lock
Critical_Section:
<critical section code>
MOV lockvar, 0
...
Continue:
I would prefer to use something like that rather than a Windows API, but maybe a CRITICAL_SECTION would work better i.e. let Windows take care of it.
you can write a little function to get lock
QueryLock PROC
QLloop: mov al,1
xchg al,bSemaphore
cmp al,1
jnz QLexit
INVOKE Sleep,1
jmp QLloop
QLexit: ret
QueryLock ENDP
Hmmm very interesting(just activating account lol)
I have doubts that my test code is testing everything it should, but I think the macro is functionally correct.
;==============================================================================
include \masm32\include\masm32rt.inc
.686
;==============================================================================
LOCKED_ADD64 MACRO m64, x64
LOCAL lbl
lbl:
;; CMPXCHG8B m64: if(m64 == EDX:EAX) m64 = ECX:EBX, ZF = 1
mov eax, DWORD PTR m64
mov edx, DWORD PTR m64+4
mov ebx, DWORD PTR x64
mov ecx, DWORD PTR x64+4
add ebx, eax
adc ecx, edx
lock cmpxchg8b m64
jnz lbl
ENDM
;==============================================================================
.data
q0 dq 0
q1 dq 100000001h
q3 dq 0
.code
;==============================================================================
ThreadProc proc lpParam:LPVOID
@@:
;printf("1")
LOCKED_ADD64 q0, q1
jmp @B
ThreadProc endp
;==============================================================================
start:
;==============================================================================
xadd eax, ebx
REPEAT 100
printf("%016I64X\t%016I64X\t",q0,q1)
LOCKED_ADD64 q0, q1
printf("%016I64X\n",q0)
ENDM
inkey
invoke CreateThread, NULL, 0, ThreadProc, 0, 0, NULL
L0:
;printf("0")
fild q0
fistp q3
mov eax, DWORD PTR q3
cmp eax, DWORD PTR q3+4
je @F
printf("ERROR")
inkey
exit
@@:
push eax
LOCKED_ADD64 q0, q1
pop eax
cmp eax, 10000000
jb L0
printf("\n%016I64X\t%d\n\n",q0,q0)
inkey
exit
;==============================================================================
end start
One bothersome detail is that leaving off the lock prefix does not make the test fail.
you may find that LOCK is implied for CMPXCHG
thus, your test won't catch it
try a similar sequence of instructions that does not use XCHG
The Intel manual I'm looking at, from January of this year, does not mention an implicit lock. My code compares the dword components to ensure that they are in sync, and I can't see how they could be, for all 10M cycles, if the instruction were not being executed atomically. For the code in the macro, less the jump, I get ~34 cycles on my P3 and ~196 cycles on my P4 Northwood. For a xchg reg, mem instruction, known to have an implicit lock, I get ~19 cycles on my P3 and ~125 cycles on my P4, so it looks like the instruction does do an implicit lock.
this may be related to what you are seeing
http://en.wikipedia.org/wiki/Pentium_F00F_bug (http://en.wikipedia.org/wiki/Pentium_F00F_bug)
another
scroll about half way down, then the comments become interesting :P
http://stackoverflow.com/questions/1746344/illegal-instruction-in-asm-lock-cmpxchg-dest-src (http://stackoverflow.com/questions/1746344/illegal-instruction-in-asm-lock-cmpxchg-dest-src)
Hi,
Well the Intel documentation says:
QuoteCMPXCHG8B—Compare and Exchange 8 Bytes
[Snip]
This instruction can be used with a LOCK prefix to allow the instruction to be
executed atomically. To simplify the interface to the processor's bus, the
destination operand receives a write cycle without regard to the result of
the comparison. The destination operand is written back if the comparison
fails; otherwise, the source operand is written into the destination. (The
processor never produces a locked read without also producing
a locked write.)
Though I don't see any reason for, or use of, the instruction if it
is not atomic. Nor do I really see how it could be non-atomic for
a single processor.
Regards,
Steve N.
Well, the lock prefix is having an effect, just not the expected one.
;==============================================================================
; ----------------------------------------------------------------------
; These two macros perform the grunt work involved in measuring the
; processor clock cycle count for a block of code. These macros must
; be used in pairs, and the block of code must be placed in between
; the counter_begin and counter_end macro calls. The counter_end macro
; returns the clock cycle count for a single pass through the block of
; code, corrected for the test loop overhead, in EAX.
;
; These macros require a .586 or higher processor directive.
;
; The essential differences between these macros and the prvious macros
; are that these save and restore the original priorities, and provide
; a way to control the thread priority. Control of the thread priority
; allows timing code at the highest possible priority by combining
; REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL.
;
; Note that running at the higher priority settings on a single core
; processor involves some risk, as it will cause your process to
; preempt *all* other processes, including critical Windows processes.
; Using HIGH_PRIORITY_CLASS in combination with THREAD_PRIORITY_NORMAL
; should generally be safe.
; ----------------------------------------------------------------------
counter_begin MACRO loopcount:REQ, process_priority:REQ, thread_priority
LOCAL label
IFNDEF __counter__qword__count__
.data
ALIGN 8 ;; Optimal alignment for QWORD
__counter__qword__count__ dq 0
__counter__loop__count__ dd 0
__counter__loop__counter__ dd 0
__process_priority_class__ dd 0
__thread_priority__ dd 0
__current_process__ dd 0
__current_thread__ dd 0
.code
ENDIF
mov __counter__loop__count__, loopcount
invoke GetCurrentProcess
mov __current_process__, eax
invoke GetPriorityClass, __current_process__
mov __process_priority_class__, eax
invoke SetPriorityClass, __current_process__, process_priority
IFNB <thread_priority>
invoke GetCurrentThread
mov __current_thread__, eax
invoke GetThreadPriority, __current_thread__
mov __thread_priority__, eax
invoke SetThreadPriority, __current_thread__, thread_priority
ENDIF
xor eax, eax ;; Use same CPUID input value for each call
cpuid ;; Flush pipe & wait for pending ops to finish
rdtsc ;; Read Time Stamp Counter
push edx ;; Preserve high-order 32 bits of start count
push eax ;; Preserve low-order 32 bits of start count
mov __counter__loop__counter__, loopcount
xor eax, eax
cpuid ;; Make sure loop setup instructions finish
ALIGN 16 ;; Optimal loop alignment for P6
@@: ;; Start an empty reference loop
sub __counter__loop__counter__, 1
jnz @B
xor eax, eax
cpuid ;; Make sure loop instructions finish
rdtsc ;; Read end count
pop ecx ;; Recover low-order 32 bits of start count
sub eax, ecx ;; Low-order 32 bits of overhead count in EAX
pop ecx ;; Recover high-order 32 bits of start count
sbb edx, ecx ;; High-order 32 bits of overhead count in EDX
push edx ;; Preserve high-order 32 bits of overhead count
push eax ;; Preserve low-order 32 bits of overhead count
xor eax, eax
cpuid
rdtsc
push edx ;; Preserve high-order 32 bits of start count
push eax ;; Preserve low-order 32 bits of start count
mov __counter__loop__counter__, loopcount
xor eax, eax
cpuid ;; Make sure loop setup instructions finish
ALIGN 16 ;; Optimal loop alignment for P6
label: ;; Start test loop
__counter__loop__label__ equ <label>
ENDM
counter_end MACRO
LOCAL lbl
sub __counter__loop__counter__, 1
jnz __counter__loop__label__
xor eax, eax
cpuid ;; Make sure loop instructions finish
rdtsc ;; Read end count
pop ecx ;; Recover low-order 32 bits of start count
sub eax, ecx ;; Low-order 32 bits of test count in EAX
pop ecx ;; Recover high-order 32 bits of start count
sbb edx, ecx ;; High-order 32 bits of test count in EDX
pop ecx ;; Recover low-order 32 bits of overhead count
sub eax, ecx ;; Low-order 32 bits of adjusted count in EAX
pop ecx ;; Recover high-order 32 bits of overhead count
sbb edx, ecx ;; High-order 32 bits of adjusted count in EDX
mov DWORD PTR __counter__qword__count__, eax
mov DWORD PTR __counter__qword__count__ + 4, edx
invoke SetPriorityClass,__current_process__,__process_priority_class__
IFNB <thread_priority>
invoke SetThreadPriority, __current_thread__, __thread_priority__
ENDIF
finit
fild __counter__qword__count__
fild __counter__loop__count__
fdiv
fistp __counter__qword__count__
mov eax, DWORD PTR __counter__qword__count__
test eax, eax
jns lbl
xor eax, eax ;; Assume count < 0 is effectively 0
lbl:
ENDM
;==============================================================================
include \masm32\include\masm32rt.inc
.686
;==============================================================================
.data
m64 dq 0
x64 dq 0
.code
;==============================================================================
start:
;==============================================================================
invoke GetCurrentProcess
invoke SetProcessAffinityMask, eax, 1
invoke Sleep, 8000
REPEAT 3
counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
xchg eax, DWORD PTR m64
counter_end
printf("xchg reg, mem :%d cycles\n", eax)
counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
lock xchg eax, DWORD PTR m64
counter_end
printf("lock xchg reg, mem :%d cycles\n", eax)
counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
mov eax, DWORD PTR m64
mov edx, DWORD PTR m64+4
mov ebx, DWORD PTR x64
mov ecx, DWORD PTR x64+4
counter_end
printf("overhead code :%d cycles\n", eax)
counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
mov eax, DWORD PTR m64
mov edx, DWORD PTR m64+4
mov ebx, DWORD PTR x64
mov ecx, DWORD PTR x64+4
cmpxchg8b m64
counter_end
printf("overhead + cmpxchg8b m64 :%d cycles\n", eax)
counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
mov eax, DWORD PTR m64
mov edx, DWORD PTR m64+4
mov ebx, DWORD PTR x64
mov ecx, DWORD PTR x64+4
lock cmpxchg8b m64
counter_end
printf("overhead + lock cmpxchg8b m64 :%d cycles\n\n", eax)
ENDM
inkey
exit
;==============================================================================
end start
Running on my P3:
xchg reg, mem :25 cycles
lock xchg reg, mem :25 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :11 cycles
overhead + lock cmpxchg8b m64 :34 cycles
xchg reg, mem :25 cycles
lock xchg reg, mem :25 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :11 cycles
overhead + lock cmpxchg8b m64 :34 cycles
xchg reg, mem :25 cycles
lock xchg reg, mem :25 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :11 cycles
overhead + lock cmpxchg8b m64 :34 cycles
Running on my P4 Northwood:
xchg reg, mem :178 cycles
lock xchg reg, mem :178 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :33 cycles
overhead + lock cmpxchg8b m64 :196 cycles
xchg reg, mem :178 cycles
lock xchg reg, mem :178 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :33 cycles
overhead + lock cmpxchg8b m64 :197 cycles
xchg reg, mem :179 cycles
lock xchg reg, mem :180 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :33 cycles
overhead + lock cmpxchg8b m64 :196 cycles
Running on my P3 or my (single core w HTT) P4 the test succeeds without the lock prefix and without the jnz. How does it behave on a real multi-core processor?
AMD Athlon(tm) II X2 220
xchg reg, mem :2 cycles
lock xchg reg, mem :11 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :9 cycles
overhead + lock cmpxchg8b m64 :22 cycles
xchg reg, mem :11 cycles
lock xchg reg, mem :11 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :9 cycles
overhead + lock cmpxchg8b m64 :22 cycles
xchg reg, mem :11 cycles
lock xchg reg, mem :11 cycles
overhead code :0 cycles
overhead + cmpxchg8b m64 :9 cycles
overhead + lock cmpxchg8b m64 :22 cycles
Quote from: dedndave on December 05, 2013, 02:51:46 PM
this may be related to what you are seeing
http://en.wikipedia.org/wiki/Pentium_F00F_bug (http://en.wikipedia.org/wiki/Pentium_F00F_bug)
this may be unrelated to what you were posting (http://en.wikipedia.org/wiki/Killer_poke)
QuoteThe floppy drive of the Commodore Amiga personal computer could be made to produce noises of various pitches, by making the drive heads move back and forth. A program existed which could play El Cóndor Pasa, more or less correctly, on the Amiga's floppy drive.
;)
Thanks Michael, that macro seems to work OK.
Here's how I tested
include \masm32\include\masm32rt.inc
.686
THREADS = 50
LOCKED_ADD64 MACRO m64, x64
LOCAL lbl
lbl:
;; CMPXCHG8B m64: if(m64 == EDX:EAX) m64 = ECX:EBX, ZF = 1
mov eax, DWORD PTR m64
mov edx, DWORD PTR m64+4
mov ebx, DWORD PTR x64
mov ecx, DWORD PTR x64+4
add ebx, eax
adc ecx, edx
lock cmpxchg8b m64
jnz lbl
ENDM
UNLOCKED_ADD64 MACRO m64, x64
LOCAL lbl
lbl:
;; CMPXCHG8B m64: if(m64 == EDX:EAX) m64 = ECX:EBX, ZF = 1
mov eax, DWORD PTR m64
mov edx, DWORD PTR m64+4
mov ebx, DWORD PTR x64
mov ecx, DWORD PTR x64+4
add ebx, eax
adc ecx, edx
cmpxchg8b m64
jnz lbl
ENDM
.data?
hthread dd THREADS dup (?)
mem64 dq ?
mem264 dq ?
.data
exx64 dq 1
.code
thread1: push esi
mov esi,100000
@@: LOCKED_ADD64 mem64,exx64
UNLOCKED_ADD64 mem264,exx64
dec esi
jnz @b
pop esi
ret 4
start:
sub ebx,ebx
@@: invoke CreateThread,0,0,offset thread1,0,0,0
mov hthread[ebx*4],eax
inc ebx
cmp ebx,THREADS
jb @b
invoke WaitForMultipleObjects,THREADS,offset hthread,TRUE,INFINITE
printf("Should be 5000000\nUNLOCKED_ADD64 is %I64d\nLOCKED_ADD64 is %I64d\n",mem264,mem64)
inkey
exit
end start
dirty. yucky code.