I have done a test piece to start 8 threads running as close as possible at the same time. Each thread runs an arbitrary 1 billion iterations of a loop to provide the appropriate delay while the caller polls a loop to check on the 8 threads terminating. The thread count will better suits a quad core processor but the general idea is to perform parallel processing of whatever task the author has in mind to get the speed up. The threads start and a spinlock waits until the thread had copied its arguments to local variables before using a call to CloseHandle(). The threads run to completion then signal back to the caller that they have finished.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
LOCAL rvar :QWORD ; thread return value
LOCAL pvar :QWORD ; its pointer
mov pvar, ptr$(rvar) ; load address into pointer
mov rvar, 0 ; set rval to 0
fn callthread,0,1,pvar
fn callthread,0,2,pvar
fn callthread,0,3,pvar
fn callthread,0,4,pvar
fn callthread,0,5,pvar
fn callthread,0,6,pvar
fn callthread,0,7,pvar
fn callthread,0,8,pvar
; --------------------------------------
; wait until all threads have terminated
; --------------------------------------
@@:
pause
cmp rvar, 8
jne @B
conout lf,"--------------------------",lf, \
"thread finished count = ",str$(rvar),lf, \
"--------------------------",lf,lf
waitkey
invoke ExitProcess,0
ret
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
callthread proc a1:QWORD,a2:QWORD,a3:QWORD
LOCAL arr[8] :QWORD
LOCAL hThread :QWORD
LOCAL hProcess :QWORD
mov hProcess, rv(GetCurrentProcess)
invoke SetPriorityClass,hProcess,HIGH_PRIORITY_CLASS
lea r10, arr ; load the array address
mov QWORD PTR [r10], 0 ; first member is the flag to be tested set to zero
mrm QWORD PTR [r10+8], a1
mrm QWORD PTR [r10+16], a2
mrm QWORD PTR [r10+24], a3
mov hThread, rv(CreateThread,0,0,ADDR threadproc,ADDR arr,0,0)
; --------------------------------------------------------
; spinlock ensures the arguments are written to the thread
; before the call to close the thread handle is made.
; NOTE : "pause" is not used here as it is too slow.
; --------------------------------------------------------
lea rax, arr
spinlock: ; loop until flag NE 0
cmp QWORD PTR [rax], 0
je spinlock
conout "Thread ",str$(a2)," started",lf
fn CloseHandle, hThread
invoke SetPriorityClass,hProcess,NORMAL_PRIORITY_CLASS
ret
callthread endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
threadproc proc arg:QWORD
LOCAL arg1 :QWORD
LOCAL arg2 :QWORD
LOCAL arg3 :QWORD
; ------------------------------------------------------
; copy arguments to LOCAL values before setting the flag
; ------------------------------------------------------
; arg is passed in RCX
mrm arg1, QWORD PTR [rcx+8]
mrm arg2, QWORD PTR [rcx+16]
mrm arg3, QWORD PTR [rcx+24]
; -----------------------------------------------------
; set the flag to be evaluated by the calling spinlock
; -----------------------------------------------------
mov QWORD PTR [rcx], 1
@@:
; conout str$(arg1),lf
add arg1, 1
cmp arg1, 1000000000
jbe @B
conout "thread ",str$(arg2)," finished",lf
; -------------------------------------------
; increment the initial thread closed counter
; -------------------------------------------
mov rax, arg3
add QWORD PTR [rax], 1
ret
threadproc endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
The results look like this, the first 8 line appear quickly then as each thread terminates it is displayed at the console. It also show that the Win10 thread scheduler is a bit all over the place and the finishing order varies each time its run.
Thread 1 started
Thread 2 started
Thread 3 started
Thread 4 started
Thread 5 started
Thread 6 started
Thread 7 started
Thread 8 started
thread 2 finished
thread 3 finished
thread 5 finished
thread 7 finished
thread 4 finished
thread 1 finished
thread 6 finished
thread 8 finished
--------------------------
thread finished count = 8
--------------------------
Press any key to continue...
IMHO it's better to use WaitForMultipleObjects, than the counter rvar. It's fast; at least, if the threads are doing any real work, the overhead is entirely negligible. Plus it's cleaner, more readable. Of course maybe you know something about it I don't, wouldn't be surprised.
BTW the comment in entry_point, 4th line, should be "; set rvar to 0"
Quote from: hutch-- on September 08, 2016, 01:21:07 AM
; -------------------------------------------
; increment the initial thread closed counter
; -------------------------------------------
mov rax, arg3
add QWORD PTR [rax], 1
The lock-prefix must be used here; otherwise the waiting-loop in entry_point might loop forever because of one or more "lost" increment-operations (e.g. guess the case two threads/cores start reading for
ADD mem,1 simultaneously).
Thanks for the comments, the polling loop in the original caller was an after thought just to get the demo going, the real target of the test was sequentially passing data to a number of new threads as fast as possible and using the spinlock to ensure that passed data was written before the call to CloseHandle() was made. There are numerous ways of catching when all threads have terminated and I would be inclined to have each thread have its own variable to write to but something somewhere would still need to poll all of those variables to know when they have all terminated.
WaitForMultipleObjects() is probably the right way to go but it would need to be tested to see if its laggy. I found the "pause" instruction slow when used in a spinlock so I only used it in the caller's polling loop where speed did not matter.
With AMDThread 1 started
thread 1 finished
Thread 2 started
thread 2 finished
Thread 3 started
thread 3 finished
Thread 4 started
thread 4 finished
Thread 5 started
thread 5 finished
Thread 6 started
thread 6 finished
Thread 7 started
thread 7 finished
Thread 8 started
thread 8 finished
--------------------------
thread finished count = 8
--------------------------
Tim,
What is the processor and OS version ?
AMD Athlon II X2 220, 2 CPU's
Windows 10 Home 64-bit
i7 4790 (4 cores, 8 threads), Windows 10 Pro x64
Thread 1 started
Thread 2 started
Thread 3 started
Thread 4 started
Thread 5 started
Thread 6 started
Thread 7 started
thread 6 finished
Thread 8 started
thread 6 finished
thread 7thread 7 finished
finished
thread 7 finished
thread 3 finished
thread 2 finished
thread 8 finished
--------------------------
thread finished count = 8
--------------------------
Press any key to continue...
This one should do the job a bit better in respect to the warning that qWord gave about non-locked thread signalling. The method is a bit clunky but it should be free of the risk that he mentioned. The part that I was interested in was the technique of waiting by the caller using a spinlock until the arguments were written to local variables and that has not changed. The polling of 8 separate variables should be free of the potential problem.
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include64\masm64rt.inc
.code
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
entry_point proc
LOCAL rvar1 :QWORD ; thread return value
LOCAL rvar2 :QWORD ; thread return value
LOCAL rvar3 :QWORD ; thread return value
LOCAL rvar4 :QWORD ; thread return value
LOCAL rvar5 :QWORD ; thread return value
LOCAL rvar6 :QWORD ; thread return value
LOCAL rvar7 :QWORD ; thread return value
LOCAL rvar8 :QWORD ; thread return value
LOCAL pvar1 :QWORD ; its pointer
LOCAL pvar2 :QWORD ; its pointer
LOCAL pvar3 :QWORD ; its pointer
LOCAL pvar4 :QWORD ; its pointer
LOCAL pvar5 :QWORD ; its pointer
LOCAL pvar6 :QWORD ; its pointer
LOCAL pvar7 :QWORD ; its pointer
LOCAL pvar8 :QWORD ; its pointer
LOCAL result:QWORD
LOCAL tc :QWORD
mov pvar1, ptr$(rvar1) ; load address into pointer
mov rvar1, 0 ; set rvar1 to 0
mov pvar2, ptr$(rvar2) ; load address into pointer
mov rvar2, 0 ; set rvar2 to 0
mov pvar3, ptr$(rvar3) ; load address into pointer
mov rvar3, 0 ; set rvar3 to 0
mov pvar4, ptr$(rvar4) ; load address into pointer
mov rvar4, 0 ; set rvar4 to 0
mov pvar5, ptr$(rvar5) ; load address into pointer
mov rvar5, 0 ; set rvar5 to 0
mov pvar6, ptr$(rvar6) ; load address into pointer
mov rvar6, 0 ; set rvar6 to 0
mov pvar7, ptr$(rvar7) ; load address into pointer
mov rvar7, 0 ; set rvar7 to 0
mov pvar8, ptr$(rvar8) ; load address into pointer
mov rvar8, 0 ; set rvar8 to 0
fn callthread,0,1,pvar1
fn callthread,0,2,pvar2
fn callthread,0,3,pvar3
fn callthread,0,4,pvar4
fn callthread,0,5,pvar5
fn callthread,0,6,pvar6
fn callthread,0,7,pvar7
fn callthread,0,8,pvar8
mov tc, rv(GetTickCount)
add tc, 30000 ; add 30 seconds for timeout
; --------------------------------------
; wait until all threads have terminated
; --------------------------------------
@@:
fn SleepEx,1,0 ; 1 timeslice delay
fn GetTickCount ; check duration for timeout
cmp rax, tc
jae timeout ; exit on timeout delay
xor r11, r11
add r11, rvar1 ; add all the indicators together
add r11, rvar2
add r11, rvar3
add r11, rvar4
add r11, rvar5
add r11, rvar6
add r11, rvar7
add r11, rvar8
cmp r11, 8
jne @B ; terminate the loop when all threads have exited
jmp nxt
timeout:
conout "Operation timed out",lf
jmp bye
nxt:
mov result, r11
conout lf,"--------------------------",lf, \
"thread finished count = ",str$(result),lf, \
"--------------------------",lf,lf
bye:
waitkey
invoke ExitProcess,0
ret
entry_point endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
callthread proc a1:QWORD,a2:QWORD,a3:QWORD
LOCAL arr[8] :QWORD
LOCAL hThread :QWORD
LOCAL hProcess :QWORD
mov hProcess, rv(GetCurrentProcess)
invoke SetPriorityClass,hProcess,HIGH_PRIORITY_CLASS
lea r10, arr ; load the array address
mov QWORD PTR [r10], 0 ; first member is the flag to be tested set to zero
mrm QWORD PTR [r10+8], a1
mrm QWORD PTR [r10+16], a2
mrm QWORD PTR [r10+24], a3
mov hThread, rv(CreateThread,0,0,ADDR threadproc,ADDR arr,0,0)
; --------------------------------------------------------
; spinlock ensures the arguments are written to the thread
; before the call to close the thread handle is made.
; NOTE : "pause" is not used here as it is too slow.
; --------------------------------------------------------
lea rax, arr
spinlock: ; loop until flag NE 0
cmp QWORD PTR [rax], 0
je spinlock
conout "Thread ",str$(a2)," started",lf
fn CloseHandle, hThread
invoke SetPriorityClass,hProcess,NORMAL_PRIORITY_CLASS
ret
callthread endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
threadproc proc arg:QWORD
LOCAL arg1 :QWORD
LOCAL arg2 :QWORD
LOCAL arg3 :QWORD
; ------------------------------------------------------
; copy arguments to LOCAL values before setting the flag
; ------------------------------------------------------
; arg is passed in RCX
mrm arg1, QWORD PTR [rcx+8]
mrm arg2, QWORD PTR [rcx+16]
mrm arg3, QWORD PTR [rcx+24]
; -----------------------------------------------------
; set the flag to be evaluated by the calling spinlock
; -----------------------------------------------------
mov QWORD PTR [rcx], 1
@@:
; conout str$(arg1),lf
add arg1, 1
cmp arg1, 1000000000
jbe @B
conout "thread ",str$(arg2)," finished",lf
; -------------------------------------------
; increment the initial thread closed counter
; -------------------------------------------
mov rax, arg3
add QWORD PTR [rax], 1
ret
threadproc endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end
AMD Athlon II X2 220, 2 CPU's, Windows 10 Home 64-bit
Thread 1 started
thread 1 finished
Thread 2 started
thread 2 finished
Thread 3 started
thread 3 finished
Thread 4 started
thread 4 finished
Thread 5 started
thread 5 finished
Thread 6 started
thread 6 finished
Thread 7 started
thread 7 finished
Thread 8 started
thread 8 finished
--------------------------
thread finished count = 8
--------------------------
Thread 1 started
Thread 2 started
Thread 3 started
Thread 4 started
Thread 5 started
Thread 6 started
Thread 7 started
thread 6 finished
Thread 8 started
thread 4 finished
thread 2 finished
thread 3 finished
thread 5 finished
thread 7 finished
thread 1 finished
thread 8 finished
--------------------------
thread finished count = 8
--------------------------
Press any key to continue...
hutch,
you can use one counter-variable when using the lock prefix for the load-modify-store accesses (in your initial post):
; -------------------------------------------
; increment the initial thread closed counter
; -------------------------------------------
mov rax, arg3
lock add QWORD PTR [rax], 1