News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Multithreaded apps in 64 bit assembler

Started by AKRichard, August 08, 2012, 09:06:45 AM

Previous topic - Next topic

AKRichard

Hello again all,

  I posted about memory management a few weeks ago.  I have a problem that cropped up since then.  After writing the memory management routines I have started getting errors when running the library in a multithreaded app (it used to run fine in multithreaded apps), and it is running perfect in a single thread.  The problem I am having is figuring out HOW to debug it.  It never errors out on the first run (usually only after a few thousand runs does it error) and when it does error, most of the time it trashes every object being used.  I had set up code in the algorithms trying to catch the errors before it crashes with stuff like

cmp          rax, 0
ja             ItsGood
mov          rax, rax

ItsGood:

mov           rcx, qword ptr[rax]      <--------will error here showing rax with 0


when looking for the returns to calls for memory (of course there is a breakpoint set at the mov     rax, rax).  At this point, most of the time if I try to stop debugging it will take 5 minutes or so to stop, or I have to use task manager to stop it.

Also, every once in a great while the debugger will stop and tell me something about an error there is no handler for(something like that and it doesnt tell WHAT the error is).

Anyhow, I am not finding much out there about debugging multithreaded assmbler programs.

I am using visual studio 2010 .net 4, mixed mode managed/native, the assembler is in an asm file all to itself.

Thanks

qWord

Aside your debugging problem: Did you consider about synchronization of shared resources? Not doing so, will automatically cause a unpredictable behavior.
MREAL macros - when you need floating point arithmetic while assembling!

AKRichard

As far as the managed side of the library, there are no shared resources, I have the library setup as immutable.  Whenever a change is made what your getting is a whole new object.   On the native side, there really is nothing to share, all the native side is there for is pass variables down to the assembley language routines and pass the return values back.  My test program consuming the library is not sharing any of the resources, each thread calls the same function, and inside the function everything the function uses is created locally so that each thread should have its own copy of everything.

  It was running fine in multithread mode before I changed the memory handling of the assembly routines, so I am pretty sure it is there.  I am just having a hard time figuring out how to debug it.  Since the errors dont happen right away, stepping into it wouldnt work (I could be here all night before I actually stepped into the error).  But I havent been able to trap the error like I was in single threaded mode, somehow, it bypasses the traps.  Finally, if I just wait for the error, I loose everything except the call stack, most of the registers read 0, all the local variables read 0, all the objects at the native and managed levels read undefined.

  If I am understanding correctly, the runtime dumps all the registers to the stack when changing context, but what about local variables within the assembly language routines?I do see how it could wreck things if it didnt save the local variables, though I wouldve thought it would have showed up long before now.  I was just wondering if Im not using the correct tools, do I need to configure it differently, would sacrificing a chicken help?

AKRichard

Doesnt anyone have a little advice?  I finally figured out a way to get my assembly code to print to a text file, though I am sure there is a more efficient way, how ever, Iam running into issues with writing to a text file in multithreaded mode now, please dont laugh too hard, it looks like I found about 30 more subjects I need to study in order to write algorithms in assembly language.

  I have a few more questions.

1.  Do I have all the syncronization mechanisms available in native c++?
2.  Are there other mechanisms I should be considering?
3.  I have an initialization section of code that should only be run once, but I have found out that if the runtime switches at an unfortunate moment (right after the first thread passes through the check to see if it has been initialized but before it sets the variable to signal that it has begun initialization) then I end with 2 threads attempting the initalization, how do you deal with that?
4.  Are there any good articles about coding assembly language for multithreaded apps (and debugging multithreaded apps)?  Ive found precious little on this subject especially on 64 bit platform.


  I answered the question about shared resources way to early, the memory management is a shared resource in the assembly code, I am sure that is what is screwing me up, but I still havent been able to catch it in action.

  I really dont mind working my way through the problem, but I am having a hard time with the lack of information I am finding on this subject.

qWord

Quote from: AKRichard on August 13, 2012, 03:34:00 PM1.  Do I have all the syncronization mechanisms available in native c++?
When writing Assembler you can use Window's Synchronization Functions and/or the LOCK-Instruction-prefix that allow atomic memory access (see documentation).
Quote from: AKRichard on August 13, 2012, 03:34:00 PM3.  I have an initialization section of code that should only be run once, but I have found out that if the runtime switches at an unfortunate moment (right after the first thread passes through the check to see if it has been initialized but before it sets the variable to signal that it has begun initialization) then I end with 2 threads attempting the initalization, how do you deal with that?
The simplest solution is to use an initialization-function that is called from the main-thread at startup. This routine then creates the needed synchronizations objects.

Quote from: AKRichard on August 13, 2012, 03:34:00 PMI answered the question about shared resources way to early, the memory management is a shared resource in the assembly code, I am sure that is what is screwing me up, but I still havent been able to catch it in action.

  I really dont mind working my way through the problem, but I am having a hard time with the lack of information I am finding on this subject.
This problem is not trivial - you may consider about using the ready-to-use Heap Functions.
MREAL macros - when you need floating point arithmetic while assembling!

AKRichard

Quote from: qWord on August 14, 2012, 12:50:17 AM
When writing Assembler you can use Window's Synchronization Functions and/or the LOCK-Instruction-prefix that allow atomic memory access (see documentation).

Thats what I thought, I am pretty sure I understand the process of creating the structures needed in assembly, but, how do I access metthods within a particular instance?  For example, semaphores, how would I call the Release method on a prticular instance?  I havent even started reading on these subjects yet, but even if I fail to find much on the web Ill probably be able to find the answers in masm32 somewhere

Quote from: qWord on August 14, 2012, 12:50:17 AM
The simplest solution is to use an initialization-function that is called from the main-thread at startup. This routine then creates the needed synchronizations objects.

That is what I figured out shortly after I posted the question, Now I have the assembly code initialized the very first time an instance of the managed class is instantiated.

Quote from: qWord on August 14, 2012, 12:50:17 AM
This problem is not trivial - you may consider about using the ready-to-use Heap Functions.

  Those are the functions I am using in my release version of code for using within my other programs, but, I like the response I am getting from these routines using the memory management Ive created, though everytime I identify another bug and fix it, it does loose some speed (its amazing how fast an incorrectly running algorithm can run lol).

  All of this brings up a whole slew of other subjects I need to study up on, which I will work on my own time instead of bugging the forums too much, but I do have a few quick questions.

  From what I am reading on the lock prefix on the amd processor, it sounds like it is specific to isolating access to resources between processors (not threads specifically), if that is correct, would these problems disappear on a single processor system?  I dont have a single processor system at my disposal at the moment, but am thinking about dropping an old motherboard into a case to try it out.

  While reading about the lock prefix in the amd docs, I came across the fence instructions, it mentions the fence can be used for ordering memory reads and writes, What I am confused of is atomicity, are these instructions (the fence instructions) used to control that a set of instructions get executed atomically?  I read that they are used for controling the order of memory accesses, but I dont think I understand it clearly.

Thanks again for the reply

AKRichard

Turns out the answer was quite a bit simpler than I thought it would be, your sugestion of using the lock prefix was great all it took was


mov rax, 0
mov rbx, 1

TryAgain:

cmpxchg         _gm, rbx
jnz TryAgain
sub rsp, 28h
call GoMem64
add rsp, 28h


to get the memory routine to run on only one thread/processor at a time and appears to be running fine now in multithreaded mode.  It did impact  some of my gains, but now that i understand a little better I can play with it to make it better.

  I would still be interested in knowing how yall debug multithreaded apps yourself.  I never could catch the problem myself while debugging.  I only managed to guess what the problem was.  I was able to change my code to make it stop crashing, but I never managed to trap the errors before they happened or catch the problem in any other way.

Anyways, thanks, you pointed straight to a solution.

dedndave

XCHG implies a LOCK, as well
i use it from time to time...
        mov     al,1
        xchg    al,bSemaphore       ;if semaphore = 1, query
        cmp     al,1
        jz      sleep_then_try_again

;otherwise, you own the semaphore

MichaelW

XCHG has an implicit lock only when a memory operand is involved.

Also, from here:

QuoteSimple reads and writes to properly-aligned 32-bit variables are atomic operations. In other words, you will not end up with only one portion of the variable updated; all bits are updated in an atomic fashion. However, access is not guaranteed to be synchronized. If two threads are reading and writing from the same variable, you cannot determine if one thread will perform its read operation before the other performs its write operation.
Simple reads and writes to properly aligned 64-bit variables are atomic on 64-bit Windows. Reads and writes to 64-bit values are not guaranteed to be atomic on 32-bit Windows. Reads and writes to variables of other sizes are not guaranteed to be atomic on any platform.


Well Microsoft, here's another nice mess you've gotten us into.

AKRichard

in the amd docs:

Quote
The LOCK prefix can only be used with forms of the following instructions that write a memory
operand: ADC, ADD, AND, BTC, BTR, BTS, CMPXCHG, CMPXCHG8B, CMPXCHG16B, DEC,
INC, NEG, NOT, OR, SBB, SUB, XADD, XCHG, and XOR. An invalid-opcode exception occurs if
the LOCK prefix is used with any other instruction.

dedndave, how do you release the semaphore? originally I was thinking of using a semaphore or something, but I couldnt find an example of how to release it.  I think the solution I came up with was rather simple, using the _gm variable to signal when another thread was in the memory management routine and resetting the variable right before the RET statement (only the one lock prefix used),  Ive allready run through some 15 million iterations through the algorithms and it hasnt errored yet, its almost disappointing that it could be that simple but took me a week to get it here. 

dedndave

with XCHG reg,mem, you get a bus lock
that means only one thread may access that memory at a given time

when you put a 1 there, it tells other threads that it is being queried
the way i use it - the other bits can mean other things
if you XCHG and get a 1 back - you do not have access
if you XCHG and get back a value other than 1, it is the semaphore and you own it
to release it, you put the original byte back
other threads will wait for that bit to be low before accessing it and whatever it locks

if i am waiting for the semaphore, i use the Sleep function to allow other threads/processes to run
it isn't critical, but i use something like 1 to 10 mS
it depends on how fast you want things to process

qWord

The problem is that you can't efficiently build a signaling mechanism between threads using the LOCK prefix. Your code enters a  loop for waiting and that force the OS to assign more execution time for that thread. The thread that currently 'owns' the mutex is disadvantaged thus it took longer to release it.
IMO there is no way around using the corresponding OS functions, if waiting is required. However, a loop that executes some times before calling a blocking API function maybe an interesting option (I wouldn't be surprised if this is even done by some of the APIs).
MREAL macros - when you need floating point arithmetic while assembling!

hool

Quote from: AKRichard on August 15, 2012, 04:51:59 PM... some code ... to get the memory routine to run on only one thread/processor at a time and appears to be running fine now in multithreaded mode.
I'm afraid you got lucky. Cmpxchg is very fast comparing to switching threads for example, and it would probably take long time before your software starts misbehaving. Do use Lock prefix.

Quote from: dedndaveXCHG implies a LOCK, as well
Ouch, by "as well" you meant cmpxchg?

dedndave

"as well", as in "also"   :P

when you use XCHG reg,mem, a LOCK prefix is implied
i.e., you do not have to explicitly use LOCK

the reasons i use this method - you control the Sleep time
if you want faster semaphore tests, use 1 mS
if you want the thread that owns the semaphore to have more time, use 10 mS or more

and - you don't have to worry about OS version - lol

i have played with mutexes and semaphores via the API functions, as well
the XCHG method takes a little more code, but it is simple code

hool

how cool  :lol:
cmpxchg implies a lock  :dazzled:

I am just following the discussion post by post from above