console and low freq cpu programming?

daydreamer · January 18, 2021, 07:21:00 PM

if you are spoiled by 4.5ghz cpu on PC,you dont notice any lag or problem with your code
but if you program for 1.6+ghz console,I guess its to get most out of SIMT first to distribute between many cores,maybe critical sections are exchanged from scalar code to SIMD

but it must also be advantage to run easier on Atom laptops

also one SIMT question
so I get a separate stack space for each workerthread?so I could keep it running in a PROC,doing stack tricks here,without affecting main threads stack

hutch-- · January 19, 2021, 01:28:00 AM

Magnus, this is why you write efficient assembler, minimum instruction counts, well designed algorithms, writing code for old machines is an art form and apart from some hardware differences, you generally get better code.

daydreamer · January 20, 2021, 04:57:46 AM

Quote from: hutch-- on January 19, 2021, 01:28:00 AM
Magnus, this is why you write efficient assembler, minimum instruction counts, well designed algorithms, writing code for old machines is an art form and apart from some hardware differences, you generally get better code.

yes I already try todo lots SIMD before ,it would be good if got suggested some SIMT exercises for 2-4 cores
split execution between several thread and use some timer to synchonize threads or more lowlevel LOCK prefix?

mikeburr · January 21, 2021, 02:04:23 AM

heres a rough outline of a program i wrote in Cobol nearly half a century ago
primarily to merge a lot of files [ originally ICL 1900 and a few years later IBM 360/158 ]
maybe youd like to try this ??

spec for sort "but largely merge" of data
assume the data is the key in this case ie chop the data up into double word length bits
if youre doing 32 bit or qword if youre trying out x64
the strategy is in outline

sort small batches of data
when you get enough release them to a merge
when you get enough batches of merged data release them to subsequent merges

divide up the data roughly into multiples of 2 sized data chunks [ ie nearest power of 2 ]
prob best to stick to a power of 2 amount of data to start to avoid any messy end processing
though its not at all difficult
.................................................................

loop a) until data exhausted
a)
in a new thread sort a small "power of 2" chunk of this data
eg 64 or 128 lots
if the no of batches reaches a convenient number eg power of 2 ...say 16 , 32 ...
otherwise
go to a)

......................

loop b) on return of say a small "power of 2" say 16 or 32 batches of sorted data

b)
in a new thread
keep taking and removing take the lowest key from each of the sorted batches of data to an "outfile"
as an example this can be done roughly as follows
im going to assume you've used [16] batches of data to avoid any confusion with stage a) the sort and
128 data items in each sort

initialy
b0) compare the [16] lowest keys from each of the batches so you know which is the lowest

b1) remove the lowest and place it in the "outlist"
take the next from the batch you just removed the item from
is it lower than the new lowest key
if it is [ this is quite likely with general user data ]
go to b1)
if it isnt [ this is quite likely with random data ]
place the new key in sequence in the look up
go to b0)

c) you now have an "outfile" of 16 * 128 sorted items = 2048
in a new thread
take all the lots of 2048 items eg [16] of them
and do b)

repeat stages b) and c) until all the data is sorted

the machines then were not multiple core but did use Virtual paging [ well the IBM did cant remember whether the ICL
did but it was generally a superior machine to the American offerings ]
there was quite a lot of data went through the IBM version merging 8 files of user related data it was quite quick
but id be interested to see how the threading affects the performance
as theres the opportunity to process much of the sort merge concurrently
i cant remember what the limit to the number of threads is on x32 [64 ????]
also
you can vary the number of batches and size and see how this imapacts the speed because youve now got
a thread control overhead implicit in the M$ software and its co ordination as well as the sort and merge
both of which provide different challenges

i hope you like this Magnus and anyone else who is interested in trying it
obviously if youre merging a lot of files and they're in a convenient similar sequence omit stage a)
regards mike b

daydreamer · January 21, 2021, 05:50:39 AM

thanks Mike

daydreamer · January 25, 2021, 01:46:41 AM

After benchmark peekmessage(millions/second ) vs Workerthread (billions /second
I decided wndproc should be minimum messages, mouse messages just store coordinates and flags for mousemove, mousebutton messages in global variables, also keyboard messages
Maybe some include some code detecting what mouse points to are been clicked on
And do most work in workerthreads

daydreamer · January 26, 2021, 02:21:26 AM

https://www.drdobbs.com/cpp/ccli-threading-part-i/184402018

https://www.drdobbs.com/cpp/ccli-threading-part-ii/184402029

how do I do try/catch/finalize todo in masm?

mineiro · January 26, 2021, 04:29:15 AM

I suppose is 'exception handling'.
http://masm32.com/board/index.php?topic=6614.0

A debugger will receive 2 attemps to terminate your program, first one you need deal with supposed error, second will terminate your program.
To test you can do a division by zero.

Link bellow have a document that talk a bit of exception handling (swconventions).
http://masm32.com/board/index.php?topic=5455.0

daydreamer · January 26, 2021, 05:27:45 AM

Quote from: mineiro on January 26, 2021, 04:29:15 AM
I suppose is 'exception handling'.
http://masm32.com/board/index.php?topic=6614.0

A debugger will receive 2 attemps to terminate your program, first one you need deal with supposed error, second will terminate your program.
To test you can do a division by zero.

Link bellow have a document that talk a bit of exception handling (swconventions).
http://masm32.com/board/index.php?topic=5455.0

thanks mineiro

much common gp fault,would be nice to catch,especially when I have several threads that can cause bugs or gp faults,so try /catch block can show a messagebox or something to show which thread is causing the problem
on 32bit it says OS allocates automatically stack space for thread if you dont tell some number,does it work the same with 64bit shadow space? and how much?

mineiro · January 26, 2021, 07:41:46 AM

hello sir daydreamer;
I don't know the answer.
I only played with that to not get boring but after some tries I stay more boring.
Maybe Mark Russinovich book can have an answer.

TimoVJL · January 26, 2021, 03:04:06 PM

Something to read:
http://bytepointer.com/resources/pietrek_crash_course_depths_of_win32_seh.htm
https://www.osronline.com/article.cfm%5earticle=469.htm
https://www.codeproject.com/Articles/1212332/bit-Structured-Exception-Handling-SEH-in-ASM
http://www.rohitab.com/structured-exception-handling-in-assembly-language

BTW:
TIOBE Index for January 2021

daydreamer · January 27, 2021, 01:14:06 AM

Quote from: TimoVJL on January 26, 2021, 03:04:06 PM
BTW:
TIOBE Index for January 2021

thanks miniero and TimoVJL
assembler risen from 15th to 11th place

@Hutch more masm videos and we soon reach #1

now I have found some exercises and try the producer/consumer way
and learn what algorithms are most suitable for parallel and some less

The MASM Forum

News:

console and low freq cpu programming?

daydreamer

hutch--

daydreamer

mikeburr

daydreamer

daydreamer

daydreamer

mineiro

daydreamer

mineiro

TimoVJL

daydreamer