Author Topic: Parallel threads test piece.  (Read 1275 times)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4873
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Parallel threads test piece.
« on: September 08, 2016, 01:21:07 AM »
I have done a test piece to start 8 threads running as close as possible at the same time. Each thread runs an arbitrary 1 billion iterations of a loop to provide the appropriate delay while the caller polls a loop to check on the 8 threads terminating. The thread count will better suits a quad core processor but the general idea is to perform parallel processing of whatever task the author has in mind to get the speed up. The threads start and a spinlock waits until the thread had copied its arguments to local variables before using a call to CloseHandle(). The threads run to completion then signal back to the caller that they have finished.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    LOCAL rvar :QWORD           ; thread return value
    LOCAL pvar :QWORD           ; its pointer

    mov pvar, ptr$(rvar)        ; load address into pointer
    mov rvar, 0                 ; set rval to 0

    fn callthread,0,1,pvar
    fn callthread,0,2,pvar
    fn callthread,0,3,pvar
    fn callthread,0,4,pvar
    fn callthread,0,5,pvar
    fn callthread,0,6,pvar
    fn callthread,0,7,pvar
    fn callthread,0,8,pvar

  ; --------------------------------------
  ; wait until all threads have terminated
  ; --------------------------------------
  @@:
    pause
    cmp rvar, 8
    jne @B

    conout lf,"--------------------------",lf, \
              "thread finished count = ",str$(rvar),lf, \
              "--------------------------",lf,lf
    waitkey

    invoke ExitProcess,0

    ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

callthread proc a1:QWORD,a2:QWORD,a3:QWORD

    LOCAL arr[8]   :QWORD
    LOCAL hThread  :QWORD
    LOCAL hProcess :QWORD

    mov hProcess, rv(GetCurrentProcess)

    invoke SetPriorityClass,hProcess,HIGH_PRIORITY_CLASS

    lea r10, arr                    ; load the array address
    mov QWORD PTR [r10],    0       ; first member is the flag to be tested set to zero
    mrm QWORD PTR [r10+8],  a1
    mrm QWORD PTR [r10+16], a2
    mrm QWORD PTR [r10+24], a3

    mov hThread, rv(CreateThread,0,0,ADDR threadproc,ADDR arr,0,0)

  ; --------------------------------------------------------
  ; spinlock ensures the arguments are written to the thread
  ; before the call to close the thread handle is made.
  ; NOTE : "pause" is not used here as it is too slow.
  ; --------------------------------------------------------
    lea rax, arr
  spinlock:                         ; loop until flag NE 0
    cmp QWORD PTR [rax], 0
    je spinlock

    conout "Thread ",str$(a2)," started",lf

    fn CloseHandle, hThread

    invoke SetPriorityClass,hProcess,NORMAL_PRIORITY_CLASS

    ret

callthread endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

threadproc proc arg:QWORD

    LOCAL arg1 :QWORD
    LOCAL arg2 :QWORD
    LOCAL arg3 :QWORD

  ; ------------------------------------------------------
  ; copy arguments to LOCAL values before setting the flag
  ; ------------------------------------------------------
  ; arg is passed in RCX

    mrm arg1, QWORD PTR [rcx+8]
    mrm arg2, QWORD PTR [rcx+16]
    mrm arg3, QWORD PTR [rcx+24]

  ; -----------------------------------------------------
  ; set the flag to be evaluated by the calling spinlock
  ; -----------------------------------------------------
    mov QWORD PTR [rcx], 1

  @@:
    ; conout str$(arg1),lf
    add arg1, 1
    cmp arg1, 1000000000
    jbe @B

    conout "thread ",str$(arg2)," finished",lf

  ; -------------------------------------------
  ; increment the initial thread closed counter
  ; -------------------------------------------
    mov rax, arg3
    add QWORD PTR [rax], 1

    ret

threadproc endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end

The results look like this, the first 8 line appear quickly then as each thread terminates it is displayed at the console. It also show that the Win10 thread scheduler is a bit all over the place and the finishing order varies each time its run.

Thread 1 started
Thread 2 started
Thread 3 started
Thread 4 started
Thread 5 started
Thread 6 started
Thread 7 started
Thread 8 started
thread 2 finished
thread 3 finished
thread 5 finished
thread 7 finished
thread 4 finished
thread 1 finished
thread 6 finished
thread 8 finished

--------------------------
thread finished count = 8
--------------------------

Press any key to continue...
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

rrr314159

  • Member
  • *****
  • Posts: 1382
Re: Parallel threads test piece.
« Reply #1 on: September 08, 2016, 02:28:22 AM »
IMHO it's better to use WaitForMultipleObjects, than the counter rvar. It's fast; at least, if the threads are doing any real work, the overhead is entirely negligible. Plus it's cleaner, more readable. Of course maybe you know something about it I don't, wouldn't be surprised.

BTW the comment in entry_point, 4th line, should be "; set rvar to 0"

I am NaN ;)

qWord

  • Member
  • *****
  • Posts: 1460
  • The base type of a type is the type itself
    • SmplMath macros
Re: Parallel threads test piece.
« Reply #2 on: September 08, 2016, 07:00:07 AM »

  ; -------------------------------------------
  ; increment the initial thread closed counter
  ; -------------------------------------------
    mov rax, arg3
    add QWORD PTR [rax], 1

The lock-prefix must be used here; otherwise the waiting-loop in entry_point might loop forever because of one or more "lost" increment-operations  (e.g. guess the case two threads/cores start reading for ADD mem,1 simultaneously).
MREAL macros - when you need floating point arithmetic while assembling!

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4873
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Parallel threads test piece.
« Reply #3 on: September 08, 2016, 11:40:30 AM »
Thanks for the comments, the polling loop in the original caller was an after thought just to get the demo going, the real target of the test was sequentially passing data to a number of new threads as fast as possible and using the spinlock to ensure that passed data was written before the call to CloseHandle() was made. There are numerous ways of catching when all threads have terminated and I would be inclined to have each thread have its own variable to write to but something somewhere would still need to poll all of those variables to know when they have all terminated.

WaitForMultipleObjects() is probably the right way to go but it would need to be tested to see if its laggy. I found the "pause" instruction slow when used in a spinlock so I only used it in the caller's polling loop where speed did not matter.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

TWell

  • Member
  • ****
  • Posts: 748
Re: Parallel threads test piece.
« Reply #4 on: September 08, 2016, 05:02:00 PM »
With AMD
Code: [Select]
Thread 1 started
thread 1 finished
Thread 2 started
thread 2 finished
Thread 3 started
thread 3 finished
Thread 4 started
thread 4 finished
Thread 5 started
thread 5 finished
Thread 6 started
thread 6 finished
Thread 7 started
thread 7 finished
Thread 8 started
thread 8 finished

--------------------------
thread finished count = 8
--------------------------

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4873
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Parallel threads test piece.
« Reply #5 on: September 08, 2016, 07:10:22 PM »
Tim,

What is the processor and OS version ?
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

TWell

  • Member
  • ****
  • Posts: 748
Re: Parallel threads test piece.
« Reply #6 on: September 08, 2016, 07:40:28 PM »
AMD Athlon II X2 220, 2 CPU's
Windows 10 Home 64-bit

sinsi

  • Member
  • *****
  • Posts: 1000
Re: Parallel threads test piece.
« Reply #7 on: September 08, 2016, 08:09:32 PM »
i7 4790 (4 cores, 8 threads), Windows 10 Pro x64
Code: [Select]
Thread 1 started
Thread 2 started
Thread 3 started
Thread 4 started
Thread 5 started
Thread 6 started
Thread 7 started
thread 6 finished
Thread 8 started
thread 6 finished
thread 7thread 7 finished
 finished
thread 7 finished
thread 3 finished
thread 2 finished
thread 8 finished

--------------------------
thread finished count = 8
--------------------------

Press any key to continue...
I can walk on water but stagger on beer.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4873
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Parallel threads test piece.
« Reply #8 on: September 10, 2016, 10:38:07 PM »
This one should do the job a bit better in respect to the warning that qWord gave about non-locked thread signalling. The method is a bit clunky but it should be free of the risk that he mentioned. The part that I was interested in was the technique of waiting by the caller using a spinlock until the arguments were written to local variables and that has not changed. The polling of 8 separate variables should be free of the potential problem.

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include64\masm64rt.inc

    .code

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

entry_point proc

    LOCAL rvar1 :QWORD          ; thread return value
    LOCAL rvar2 :QWORD          ; thread return value
    LOCAL rvar3 :QWORD          ; thread return value
    LOCAL rvar4 :QWORD          ; thread return value
    LOCAL rvar5 :QWORD          ; thread return value
    LOCAL rvar6 :QWORD          ; thread return value
    LOCAL rvar7 :QWORD          ; thread return value
    LOCAL rvar8 :QWORD          ; thread return value

    LOCAL pvar1 :QWORD          ; its pointer
    LOCAL pvar2 :QWORD          ; its pointer
    LOCAL pvar3 :QWORD          ; its pointer
    LOCAL pvar4 :QWORD          ; its pointer
    LOCAL pvar5 :QWORD          ; its pointer
    LOCAL pvar6 :QWORD          ; its pointer
    LOCAL pvar7 :QWORD          ; its pointer
    LOCAL pvar8 :QWORD          ; its pointer
    LOCAL result:QWORD
    LOCAL tc    :QWORD

    mov pvar1, ptr$(rvar1)      ; load address into pointer
    mov rvar1, 0                ; set rvar1 to 0
    mov pvar2, ptr$(rvar2)      ; load address into pointer
    mov rvar2, 0                ; set rvar2 to 0
    mov pvar3, ptr$(rvar3)      ; load address into pointer
    mov rvar3, 0                ; set rvar3 to 0
    mov pvar4, ptr$(rvar4)      ; load address into pointer
    mov rvar4, 0                ; set rvar4 to 0
    mov pvar5, ptr$(rvar5)      ; load address into pointer
    mov rvar5, 0                ; set rvar5 to 0
    mov pvar6, ptr$(rvar6)      ; load address into pointer
    mov rvar6, 0                ; set rvar6 to 0
    mov pvar7, ptr$(rvar7)      ; load address into pointer
    mov rvar7, 0                ; set rvar7 to 0
    mov pvar8, ptr$(rvar8)      ; load address into pointer
    mov rvar8, 0                ; set rvar8 to 0

    fn callthread,0,1,pvar1
    fn callthread,0,2,pvar2
    fn callthread,0,3,pvar3
    fn callthread,0,4,pvar4
    fn callthread,0,5,pvar5
    fn callthread,0,6,pvar6
    fn callthread,0,7,pvar7
    fn callthread,0,8,pvar8

    mov tc, rv(GetTickCount)
    add tc, 30000               ; add 30 seconds for timeout

  ; --------------------------------------
  ; wait until all threads have terminated
  ; --------------------------------------
  @@:
    fn SleepEx,1,0              ; 1 timeslice delay
    fn GetTickCount             ; check duration for timeout
    cmp rax, tc
    jae timeout                 ; exit on timeout delay
    xor r11, r11
    add r11, rvar1              ; add all the indicators together
    add r11, rvar2
    add r11, rvar3
    add r11, rvar4
    add r11, rvar5
    add r11, rvar6
    add r11, rvar7
    add r11, rvar8
    cmp r11, 8
    jne @B                      ; terminate the loop when all threads have exited
    jmp nxt

  timeout:
    conout "Operation timed out",lf
    jmp bye

  nxt:
    mov result, r11

    conout lf,"--------------------------",lf, \
              "thread finished count = ",str$(result),lf, \
              "--------------------------",lf,lf
  bye:
    waitkey

    invoke ExitProcess,0

    ret

entry_point endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

callthread proc a1:QWORD,a2:QWORD,a3:QWORD

    LOCAL arr[8]   :QWORD
    LOCAL hThread  :QWORD
    LOCAL hProcess :QWORD

    mov hProcess, rv(GetCurrentProcess)

    invoke SetPriorityClass,hProcess,HIGH_PRIORITY_CLASS

    lea r10, arr                    ; load the array address
    mov QWORD PTR [r10],    0       ; first member is the flag to be tested set to zero
    mrm QWORD PTR [r10+8],  a1
    mrm QWORD PTR [r10+16], a2
    mrm QWORD PTR [r10+24], a3

    mov hThread, rv(CreateThread,0,0,ADDR threadproc,ADDR arr,0,0)

  ; --------------------------------------------------------
  ; spinlock ensures the arguments are written to the thread
  ; before the call to close the thread handle is made.
  ; NOTE : "pause" is not used here as it is too slow.
  ; --------------------------------------------------------
    lea rax, arr
  spinlock:                         ; loop until flag NE 0
    cmp QWORD PTR [rax], 0
    je spinlock

    conout "Thread ",str$(a2)," started",lf

    fn CloseHandle, hThread

    invoke SetPriorityClass,hProcess,NORMAL_PRIORITY_CLASS

    ret

callthread endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

threadproc proc arg:QWORD

    LOCAL arg1 :QWORD
    LOCAL arg2 :QWORD
    LOCAL arg3 :QWORD

  ; ------------------------------------------------------
  ; copy arguments to LOCAL values before setting the flag
  ; ------------------------------------------------------
  ; arg is passed in RCX

    mrm arg1, QWORD PTR [rcx+8]
    mrm arg2, QWORD PTR [rcx+16]
    mrm arg3, QWORD PTR [rcx+24]

  ; -----------------------------------------------------
  ; set the flag to be evaluated by the calling spinlock
  ; -----------------------------------------------------
    mov QWORD PTR [rcx], 1

  @@:
    ; conout str$(arg1),lf
    add arg1, 1
    cmp arg1, 1000000000
    jbe @B

    conout "thread ",str$(arg2)," finished",lf

  ; -------------------------------------------
  ; increment the initial thread closed counter
  ; -------------------------------------------
    mov rax, arg3
    add QWORD PTR [rax], 1

    ret

threadproc endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    end
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

TWell

  • Member
  • ****
  • Posts: 748
Re: Parallel threads test piece.
« Reply #9 on: September 10, 2016, 11:35:37 PM »
AMD Athlon II X2 220, 2 CPU's, Windows 10 Home 64-bit
Code: [Select]
Thread 1 started
thread 1 finished
Thread 2 started
thread 2 finished
Thread 3 started
thread 3 finished
Thread 4 started
thread 4 finished
Thread 5 started
thread 5 finished
Thread 6 started
thread 6 finished
Thread 7 started
thread 7 finished
Thread 8 started
thread 8 finished

--------------------------
thread finished count = 8
--------------------------

sinsi

  • Member
  • *****
  • Posts: 1000
Re: Parallel threads test piece.
« Reply #10 on: September 10, 2016, 11:56:31 PM »
Code: [Select]
Thread 1 started
Thread 2 started
Thread 3 started
Thread 4 started
Thread 5 started
Thread 6 started
Thread 7 started
thread 6 finished
Thread 8 started
thread 4 finished
thread 2 finished
thread 3 finished
thread 5 finished
thread 7 finished
thread 1 finished
thread 8 finished

--------------------------
thread finished count = 8
--------------------------

Press any key to continue...
I can walk on water but stagger on beer.

qWord

  • Member
  • *****
  • Posts: 1460
  • The base type of a type is the type itself
    • SmplMath macros
Re: Parallel threads test piece.
« Reply #11 on: September 11, 2016, 12:35:33 AM »
hutch,
you can use one counter-variable when using the lock prefix for the load-modify-store accesses (in your initial post):

  ; -------------------------------------------
  ; increment the initial thread closed counter
  ; -------------------------------------------
    mov rax, arg3
    lock add QWORD PTR [rax], 1
MREAL macros - when you need floating point arithmetic while assembling!