Threads and adding 64-bit integers

sinsi · November 19, 2013, 09:23:22 PM

What is the best way for a thread to update a global 64-bit variable?
I am playing around with multithreaded file searching, adding totals as I go.

Code Select


file:   mov eax,wfd.nFileSizeLow
        mov edx,wfd.nFileSizeHigh
        lock inc files         ;OK because files is a dword
        add totalbytes,eax     ;lock here guarantees atomic
        adc totalbytes+4,edx   ;but what to do here?

Maybe it doesn't matter? Maybe I'm overthinking it.

The mmx/sse instructions all seem to use a register as the destination
There is a function InterlockedExchangeAdd64 which seems to do it but it's only Vista or better (but we all know MSDN lies about minimum supported OS).

dedndave · November 19, 2013, 09:38:29 PM

you can use a byte (dword, whatever) as a semaphore
(you can also create a semaphore with API)

i sometimes use a byte, then use XCHG on the semaphore

Code Select

    mov     al,1                ;use bit 0 to indicate the semaphore is "being queried"
    xchg    al,bSemaphore       ;get the semaphore byte into AL

;test conditions
;if the previous value is 1, it indicates the semaphore is being queried elsewhere - no action required
;otherwise....

    xchg    al,bSemaphore       ;put the semaphore flags back

if you get a value that isn't 1, you "own" the semaphore, and you can update the 64-bit value
if the semaphore is busy, use Sleep,0 or Sleep,1 to create a wait loop
that allows the other thread to finish it's work and release the semaphore

that's just an example
the point is, XCHG AL,mem has an implied bus lock

API method...
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686946%28v=vs.85%29.aspx

sinsi · November 19, 2013, 10:13:36 PM

Here's Intel's spinlock

Code Select


Spin_Lock:
CMP lockvar, 0 ;Check if lock is free
JE Get_Lock
PAUSE ;Short delay
JMP Spin_Lock
Get_Lock:
MOV EAX, 1
XCHG EAX, lockvar ;Try to get lock
CMP EAX, 0  ;Test if successful
JNE Spin_Lock
Critical_Section:
<critical section code>
MOV lockvar, 0
...
Continue:

I would prefer to use something like that rather than a Windows API, but maybe a CRITICAL_SECTION would work better i.e. let Windows take care of it.

dedndave · November 19, 2013, 11:04:49 PM

you can write a little function to get lock

Code Select

QueryLock PROC

QLloop: mov     al,1
        xchg    al,bSemaphore
        cmp     al,1
        jnz     QLexit

        INVOKE  Sleep,1
        jmp     QLloop

QLexit: ret

QueryLock ENDP

LoveToLrn · December 02, 2013, 04:02:59 PM

Hmmm very interesting(just activating account lol)

MichaelW · December 04, 2013, 03:27:09 PM

I have doubts that my test code is testing everything it should, but I think the macro is functionally correct.

Code Select


;==============================================================================
    include \masm32\include\masm32rt.inc
    .686
;==============================================================================
LOCKED_ADD64 MACRO m64, x64
    LOCAL lbl
  lbl:
    ;; CMPXCHG8B m64: if(m64 == EDX:EAX) m64 = ECX:EBX, ZF = 1
    mov eax, DWORD PTR m64
    mov edx, DWORD PTR m64+4
    mov ebx, DWORD PTR x64
    mov ecx, DWORD PTR x64+4
    add ebx, eax
    adc ecx, edx
    lock cmpxchg8b m64
    jnz lbl
ENDM
;==============================================================================
    .data
        q0 dq 0
        q1 dq 100000001h
        q3 dq 0
    .code
;==============================================================================
ThreadProc proc lpParam:LPVOID
  @@:
    ;printf("1")
    LOCKED_ADD64 q0, q1
    jmp @B
ThreadProc endp
;==============================================================================
start:
;==============================================================================
    xadd eax, ebx
    REPEAT 100
        printf("%016I64X\t%016I64X\t",q0,q1)
        LOCKED_ADD64 q0, q1
        printf("%016I64X\n",q0)
    ENDM
    inkey
    invoke CreateThread, NULL, 0, ThreadProc, 0, 0, NULL
  L0:
    ;printf("0")
    fild q0
    fistp q3
    mov eax, DWORD PTR q3
    cmp eax, DWORD PTR q3+4
    je  @F
    printf("ERROR")
    inkey
    exit
  @@:
    push eax
    LOCKED_ADD64 q0, q1
    pop eax
    cmp eax, 10000000
    jb  L0
    printf("\n%016I64X\t%d\n\n",q0,q0)
    inkey
    exit
;==============================================================================
end start

One bothersome detail is that leaving off the lock prefix does not make the test fail.

dedndave · December 05, 2013, 06:40:43 AM

you may find that LOCK is implied for CMPXCHG
thus, your test won't catch it
try a similar sequence of instructions that does not use XCHG

MichaelW · December 05, 2013, 11:07:21 AM

The Intel manual I'm looking at, from January of this year, does not mention an implicit lock. My code compares the dword components to ensure that they are in sync, and I can't see how they could be, for all 10M cycles, if the instruction were not being executed atomically. For the code in the macro, less the jump, I get ~34 cycles on my P3 and ~196 cycles on my P4 Northwood. For a xchg reg, mem instruction, known to have an implicit lock, I get ~19 cycles on my P3 and ~125 cycles on my P4, so it looks like the instruction does do an implicit lock.

dedndave · December 05, 2013, 02:51:46 PM

this may be related to what you are seeing

http://en.wikipedia.org/wiki/Pentium_F00F_bug

dedndave · December 05, 2013, 02:54:50 PM

another
scroll about half way down, then the comments become interesting :P

http://stackoverflow.com/questions/1746344/illegal-instruction-in-asm-lock-cmpxchg-dest-src

FORTRANS · December 06, 2013, 12:29:50 AM

Hi,

Well the Intel documentation says:

QuoteCMPXCHG8B—Compare and Exchange 8 Bytes
[Snip]
This instruction can be used with a LOCK prefix to allow the instruction to be
executed atomically. To simplify the interface to the processor's bus, the
destination operand receives a write cycle without regard to the result of
the comparison. The destination operand is written back if the comparison
fails; otherwise, the source operand is written into the destination. (The
processor never produces a locked read without also producing
a locked write.)

Though I don't see any reason for, or use of, the instruction if it
is not atomic. Nor do I really see how it could be non-atomic for
a single processor.

Regards,

Steve N.

MichaelW · December 06, 2013, 09:03:29 AM

Well, the lock prefix is having an effect, just not the expected one.

Code Select


;==============================================================================
  ; ----------------------------------------------------------------------
  ; These two macros perform the grunt work involved in measuring the
  ; processor clock cycle count for a block of code. These macros must
  ; be used in pairs, and the block of code must be placed in between
  ; the counter_begin and counter_end macro calls. The counter_end macro
  ; returns the clock cycle count for a single pass through the block of
  ; code, corrected for the test loop overhead, in EAX.
  ;
  ; These macros require a .586 or higher processor directive.
  ;
  ; The essential differences between these macros and the prvious macros
  ; are that these save and restore the original priorities, and provide
  ; a way to control the thread priority. Control of the thread priority
  ; allows timing code at the highest possible priority by combining
  ; REALTIME_PRIORITY_CLASS with THREAD_PRIORITY_TIME_CRITICAL.
  ;
  ; Note that running at the higher priority settings on a single core
  ; processor involves some risk, as it will cause your process to
  ; preempt *all* other processes, including critical Windows processes.
  ; Using HIGH_PRIORITY_CLASS in combination with THREAD_PRIORITY_NORMAL
  ; should generally be safe.
  ; ----------------------------------------------------------------------

    counter_begin MACRO loopcount:REQ, process_priority:REQ, thread_priority
        LOCAL label

        IFNDEF __counter__qword__count__
          .data
          ALIGN 8             ;; Optimal alignment for QWORD
            __counter__qword__count__  dq 0
            __counter__loop__count__   dd 0
            __counter__loop__counter__ dd 0
            __process_priority_class__ dd 0
            __thread_priority__        dd 0
            __current_process__        dd 0
            __current_thread__         dd 0
          .code
        ENDIF

        mov __counter__loop__count__, loopcount
        invoke GetCurrentProcess
        mov __current_process__, eax
        invoke GetPriorityClass, __current_process__
        mov __process_priority_class__, eax
        invoke SetPriorityClass, __current_process__, process_priority
        IFNB <thread_priority>
            invoke GetCurrentThread
            mov __current_thread__, eax
            invoke GetThreadPriority, __current_thread__
            mov __thread_priority__, eax
            invoke SetThreadPriority, __current_thread__, thread_priority
        ENDIF
        xor eax, eax          ;; Use same CPUID input value for each call
        cpuid                 ;; Flush pipe & wait for pending ops to finish
        rdtsc                 ;; Read Time Stamp Counter

        push edx              ;; Preserve high-order 32 bits of start count
        push eax              ;; Preserve low-order 32 bits of start count
        mov   __counter__loop__counter__, loopcount
        xor eax, eax
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      @@:                     ;; Start an empty reference loop
        sub __counter__loop__counter__, 1
        jnz @B

        xor eax, eax
        cpuid                 ;; Make sure loop instructions finish
        rdtsc                 ;; Read end count
        pop ecx               ;; Recover low-order 32 bits of start count
        sub eax, ecx          ;; Low-order 32 bits of overhead count in EAX
        pop ecx               ;; Recover high-order 32 bits of start count
        sbb edx, ecx          ;; High-order 32 bits of overhead count in EDX
        push edx              ;; Preserve high-order 32 bits of overhead count
        push eax              ;; Preserve low-order 32 bits of overhead count

        xor eax, eax
        cpuid
        rdtsc
        push edx              ;; Preserve high-order 32 bits of start count
        push eax              ;; Preserve low-order 32 bits of start count
        mov   __counter__loop__counter__, loopcount
        xor eax, eax
        cpuid                 ;; Make sure loop setup instructions finish
      ALIGN 16                ;; Optimal loop alignment for P6
      label:                  ;; Start test loop
        __counter__loop__label__ equ <label>
    ENDM

    counter_end MACRO
        LOCAL lbl
        sub __counter__loop__counter__, 1
        jnz  __counter__loop__label__

        xor eax, eax
        cpuid                 ;; Make sure loop instructions finish
        rdtsc                 ;; Read end count
        pop ecx               ;; Recover low-order 32 bits of start count
        sub eax, ecx          ;; Low-order 32 bits of test count in EAX
        pop ecx               ;; Recover high-order 32 bits of start count
        sbb edx, ecx          ;; High-order 32 bits of test count in EDX
        pop ecx               ;; Recover low-order 32 bits of overhead count
        sub eax, ecx          ;; Low-order 32 bits of adjusted count in EAX
        pop ecx               ;; Recover high-order 32 bits of overhead count
        sbb edx, ecx          ;; High-order 32 bits of adjusted count in EDX

        mov DWORD PTR __counter__qword__count__, eax
        mov DWORD PTR __counter__qword__count__ + 4, edx

        invoke SetPriorityClass,__current_process__,__process_priority_class__
        IFNB <thread_priority>
            invoke SetThreadPriority, __current_thread__, __thread_priority__
        ENDIF

        finit
        fild __counter__qword__count__
        fild __counter__loop__count__
        fdiv
        fistp __counter__qword__count__

        mov eax, DWORD PTR __counter__qword__count__
        test eax, eax
        jns lbl
        xor eax, eax          ;; Assume count < 0 is effectively 0
      lbl:
    ENDM
;==============================================================================
include \masm32\include\masm32rt.inc
.686
;==============================================================================
.data
    m64 dq 0
    x64 dq 0
.code
;==============================================================================
start:
;==============================================================================

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    invoke Sleep, 8000

    REPEAT 3

    counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
        xchg eax, DWORD PTR m64
    counter_end
    printf("xchg reg, mem                 :%d cycles\n", eax)

    counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
        lock xchg eax, DWORD PTR m64
    counter_end
    printf("lock xchg reg, mem            :%d cycles\n", eax)

    counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
        mov eax, DWORD PTR m64
        mov edx, DWORD PTR m64+4
        mov ebx, DWORD PTR x64
        mov ecx, DWORD PTR x64+4
    counter_end
    printf("overhead code                 :%d cycles\n", eax)

    counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
        mov eax, DWORD PTR m64
        mov edx, DWORD PTR m64+4
        mov ebx, DWORD PTR x64
        mov ecx, DWORD PTR x64+4
        cmpxchg8b m64
    counter_end
    printf("overhead + cmpxchg8b m64      :%d cycles\n", eax)

    counter_begin 10000000,REALTIME_PRIORITY_CLASS,THREAD_PRIORITY_TIME_CRITICAL
        mov eax, DWORD PTR m64
        mov edx, DWORD PTR m64+4
        mov ebx, DWORD PTR x64
        mov ecx, DWORD PTR x64+4
        lock cmpxchg8b m64
    counter_end
    printf("overhead + lock cmpxchg8b m64 :%d cycles\n\n", eax)

    ENDM

    inkey
    exit
;==============================================================================
end start

Running on my P3:

Code Select


xchg reg, mem                 :25 cycles
lock xchg reg, mem            :25 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :11 cycles
overhead + lock cmpxchg8b m64 :34 cycles

xchg reg, mem                 :25 cycles
lock xchg reg, mem            :25 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :11 cycles
overhead + lock cmpxchg8b m64 :34 cycles

xchg reg, mem                 :25 cycles
lock xchg reg, mem            :25 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :11 cycles
overhead + lock cmpxchg8b m64 :34 cycles

Running on my P4 Northwood:

Code Select


xchg reg, mem                 :178 cycles
lock xchg reg, mem            :178 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :33 cycles
overhead + lock cmpxchg8b m64 :196 cycles

xchg reg, mem                 :178 cycles
lock xchg reg, mem            :178 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :33 cycles
overhead + lock cmpxchg8b m64 :197 cycles

xchg reg, mem                 :179 cycles
lock xchg reg, mem            :180 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :33 cycles
overhead + lock cmpxchg8b m64 :196 cycles

Running on my P3 or my (single core w HTT) P4 the test succeeds without the lock prefix and without the jnz. How does it behave on a real multi-core processor?

TWell · December 06, 2013, 05:39:12 PM

AMD Athlon(tm) II X2 220

Code Select

xchg reg, mem                 :2 cycles
lock xchg reg, mem            :11 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :9 cycles
overhead + lock cmpxchg8b m64 :22 cycles

xchg reg, mem                 :11 cycles
lock xchg reg, mem            :11 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :9 cycles
overhead + lock cmpxchg8b m64 :22 cycles

xchg reg, mem                 :11 cycles
lock xchg reg, mem            :11 cycles
overhead code                 :0 cycles
overhead + cmpxchg8b m64      :9 cycles
overhead + lock cmpxchg8b m64 :22 cycles

jj2007 · December 06, 2013, 05:46:13 PM

Quote from: dedndave on December 05, 2013, 02:51:46 PM
this may be related to what you are seeing

http://en.wikipedia.org/wiki/Pentium_F00F_bug

this may be unrelated to what you were posting

QuoteThe floppy drive of the Commodore Amiga personal computer could be made to produce noises of various pitches, by making the drive heads move back and forth. A program existed which could play El Cóndor Pasa, more or less correctly, on the Amiga's floppy drive.

;)

sinsi · December 06, 2013, 07:02:34 PM

Thanks Michael, that macro seems to work OK.

Here's how I tested

Code Select


include \masm32\include\masm32rt.inc
.686

THREADS = 50

LOCKED_ADD64 MACRO m64, x64
    LOCAL lbl
  lbl:
    ;; CMPXCHG8B m64: if(m64 == EDX:EAX) m64 = ECX:EBX, ZF = 1
    mov eax, DWORD PTR m64
    mov edx, DWORD PTR m64+4
    mov ebx, DWORD PTR x64
    mov ecx, DWORD PTR x64+4
    add ebx, eax
    adc ecx, edx
    lock cmpxchg8b m64
    jnz lbl
ENDM

UNLOCKED_ADD64 MACRO m64, x64
    LOCAL lbl
  lbl:
    ;; CMPXCHG8B m64: if(m64 == EDX:EAX) m64 = ECX:EBX, ZF = 1
    mov eax, DWORD PTR m64
    mov edx, DWORD PTR m64+4
    mov ebx, DWORD PTR x64
    mov ecx, DWORD PTR x64+4
    add ebx, eax
    adc ecx, edx
    cmpxchg8b m64
    jnz lbl
ENDM

.data?
hthread     dd THREADS dup (?)
mem64       dq ?
mem264      dq ?

.data
exx64       dq 1

.code

thread1:    push esi
            mov esi,100000
@@:         LOCKED_ADD64 mem64,exx64
            UNLOCKED_ADD64 mem264,exx64
            dec esi
            jnz @b
            pop esi
            ret 4
            
start:      

            sub ebx,ebx
@@:         invoke CreateThread,0,0,offset thread1,0,0,0
            mov hthread[ebx*4],eax
            inc ebx
            cmp ebx,THREADS
            jb @b
            invoke WaitForMultipleObjects,THREADS,offset hthread,TRUE,INFINITE
            
            printf("Should be         5000000\nUNLOCKED_ADD64 is %I64d\nLOCKED_ADD64 is   %I64d\n",mem264,mem64)
            
        

        inkey
        exit

end start

dirty. yucky code.

The MASM Forum

News:

Threads and adding 64-bit integers

sinsi

dedndave

sinsi

dedndave

LoveToLrn

MichaelW

dedndave

MichaelW

dedndave

dedndave

FORTRANS

MichaelW

TWell

jj2007

sinsi