if you are spoiled by 4.5ghz cpu on PC,you dont notice any lag or problem with your code
but if you program for 1.6+ghz console,I guess its to get most out of SIMT first to distribute between many cores,maybe critical sections are exchanged from scalar code to SIMD
but it must also be advantage to run easier on Atom laptops
also one SIMT question
so I get a separate stack space for each workerthread?so I could keep it running in a PROC,doing stack tricks here,without affecting main threads stack
Magnus, this is why you write efficient assembler, minimum instruction counts, well designed algorithms, writing code for old machines is an art form and apart from some hardware differences, you generally get better code.
Quote from: hutch-- on January 19, 2021, 01:28:00 AM
Magnus, this is why you write efficient assembler, minimum instruction counts, well designed algorithms, writing code for old machines is an art form and apart from some hardware differences, you generally get better code.
yes I already try todo lots SIMD before ,it would be good if got suggested some SIMT exercises for 2-4 cores
split execution between several thread and use some timer to synchonize threads or more lowlevel LOCK prefix?
heres a rough outline of a program i wrote in Cobol nearly half a century ago
primarily to merge a lot of files [ originally ICL 1900 and a few years later IBM 360/158 ]
maybe youd like to try this ??
spec for sort "but largely merge" of data
assume the data is the key in this case ie chop the data up into double word length bits
if youre doing 32 bit or qword if youre trying out x64
the strategy is in outline
sort small batches of data
when you get enough release them to a merge
when you get enough batches of merged data release them to subsequent merges
divide up the data roughly into multiples of 2 sized data chunks [ ie nearest power of 2 ]
prob best to stick to a power of 2 amount of data to start to avoid any messy end processing
though its not at all difficult
.................................................................
loop a) until data exhausted
a)
in a new thread sort a small "power of 2" chunk of this data
eg 64 or 128 lots
if the no of batches reaches a convenient number eg power of 2 ...say 16 , 32 ...
otherwise
go to a)
......................
loop b) on return of say a small "power of 2" say 16 or 32 batches of sorted data
b)
in a new thread
keep taking and removing take the lowest key from each of the sorted batches of data to an "outfile"
as an example this can be done roughly as follows
im going to assume you've used [16] batches of data to avoid any confusion with stage a) the sort and
128 data items in each sort
initialy
b0) compare the [16] lowest keys from each of the batches so you know which is the lowest
b1) remove the lowest and place it in the "outlist"
take the next from the batch you just removed the item from
is it lower than the new lowest key
if it is [ this is quite likely with general user data ]
go to b1)
if it isnt [ this is quite likely with random data ]
place the new key in sequence in the look up
go to b0)
c) you now have an "outfile" of 16 * 128 sorted items = 2048
in a new thread
take all the lots of 2048 items eg [16] of them
and do b)
repeat stages b) and c) until all the data is sorted
the machines then were not multiple core but did use Virtual paging [ well the IBM did cant remember whether the ICL
did but it was generally a superior machine to the American offerings ]
there was quite a lot of data went through the IBM version merging 8 files of user related data it was quite quick
but id be interested to see how the threading affects the performance
as theres the opportunity to process much of the sort merge concurrently
i cant remember what the limit to the number of threads is on x32 [64 ????]
also
you can vary the number of batches and size and see how this imapacts the speed because youve now got
a thread control overhead implicit in the M$ software and its co ordination as well as the sort and merge
both of which provide different challenges
i hope you like this Magnus and anyone else who is interested in trying it
obviously if youre merging a lot of files and they're in a convenient similar sequence omit stage a)
regards mike b
thanks Mike :thumbsup:
After benchmark peekmessage(millions/second ) vs Workerthread (billions /second
I decided wndproc should be minimum messages, mouse messages just store coordinates and flags for mousemove, mousebutton messages in global variables, also keyboard messages
Maybe some include some code detecting what mouse points to are been clicked on
And do most work in workerthreads
https://www.drdobbs.com/cpp/ccli-threading-part-i/184402018 (https://www.drdobbs.com/cpp/ccli-threading-part-i/184402018)
https://www.drdobbs.com/cpp/ccli-threading-part-ii/184402029 (https://www.drdobbs.com/cpp/ccli-threading-part-ii/184402029)
how do I do try/catch/finalize todo in masm?
I suppose is 'exception handling'.
http://masm32.com/board/index.php?topic=6614.0
A debugger will receive 2 attemps to terminate your program, first one you need deal with supposed error, second will terminate your program.
To test you can do a division by zero.
Link bellow have a document that talk a bit of exception handling (swconventions).
http://masm32.com/board/index.php?topic=5455.0
Quote from: mineiro on January 26, 2021, 04:29:15 AM
I suppose is 'exception handling'.
http://masm32.com/board/index.php?topic=6614.0
A debugger will receive 2 attemps to terminate your program, first one you need deal with supposed error, second will terminate your program.
To test you can do a division by zero.
Link bellow have a document that talk a bit of exception handling (swconventions).
http://masm32.com/board/index.php?topic=5455.0
thanks mineiro :thumbsup:
much common gp fault,would be nice to catch,especially when I have several threads that can cause bugs or gp faults,so try /catch block can show a messagebox or something to show which thread is causing the problem
on 32bit it says OS allocates automatically stack space for thread if you dont tell some number,does it work the same with 64bit shadow space? and how much?
hello sir daydreamer;
I don't know the answer.
I only played with that to not get boring but after some tries I stay more boring.
Maybe Mark Russinovich book can have an answer.
Something to read:
http://bytepointer.com/resources/pietrek_crash_course_depths_of_win32_seh.htm
https://www.osronline.com/article.cfm%5earticle=469.htm
https://www.codeproject.com/Articles/1212332/bit-Structured-Exception-Handling-SEH-in-ASM
http://www.rohitab.com/structured-exception-handling-in-assembly-language
BTW:
TIOBE Index for January 2021 (https://www.tiobe.com/tiobe-index/)
Quote from: TimoVJL on January 26, 2021, 03:04:06 PM
BTW:
TIOBE Index for January 2021 (https://www.tiobe.com/tiobe-index/)
thanks miniero and TimoVJL
assembler risen from 15th to 11th place :greenclp: :thumbsup:
@Hutch more masm videos and we soon reach #1 :greenclp: :thumbsup:
now I have found some exercises and try the producer/consumer way
and learn what algorithms are most suitable for parallel and some less