News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Register usage in PROCs

Started by CCurl, October 15, 2015, 03:00:43 AM

Previous topic - Next topic

CCurl

In my own code, I assume I am free to use the registers however I want, so long as it behaves the way I want it to. Is that too cavalier?

I guess my question is more this ... if I write my own proc that will only ever be called internally (never by a C program or as part of a library), and that proc uses one or more registers internally, should I save (push) them at the beginning and then restore (pop) them at the end?

In other words ... is it "proper" practice to save the registers one uses and then restore them before leaving, out of worry that a caller might be depending on it not changing? 

How significant of a performance hit is all that pushing and popping?  If there is a hit, I would think that it would add up over time because of how much it happens.

How do YOU decide which registers you can trash, and which to leave alone?

Again, this it specifically in the context of internal subroutine procs that will never be called by anyone but you, not procs that would end up in a library or be called from another language.

bsdsource

Quoted from Japheth's website:

Quote- The "register gets trashed" trap:
Assembly beginners stumble inevitably over the phenomenon that registers "suddenly" change their values,
although there is apparently no reason for such magic changes. Here is a short explanation why.
eax
accumulator
trashable, used e.g. for returning values
ebx
base
general purpose protected register
ecx
count
general purpose trashable register (but protected in MasmBasic macros)
edx
data
general purpose trashable register
esi
source index
general purpose protected register, used e.g. in lodsd, movsd
edi
destination index
general purpose protected register, used e.g. in stosd, scasb
ebp
base pointer
used for handling LOCAL variables - don't touch
esp
stack pointer
used for push & pop and passing parameters - don't touch
* protected register (ebx, esi, edi, ebp):
1. If you use one of these registers, you must save it on entry into a proc, and restore it before the ret
   (you should avoid using ebp, as it is used for LOCAL variables; if you use LOCALs, do not use ebp explicitly, and do not save it)

Source:
http://www.webalice.it/jj2006/Masm32_Tips_Tricks_and_Traps.htm

TouEnMasm


https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx
Fa is a musical note to play with CL

K_F

Essentially every Operating System standardises on an Application Binary Interface (ABI).
This gives you a guideline on how to use the particular Application Interface (API) with respect to the platform (CPU type).

https://en.wikipedia.org/wiki/Application_binary_interface will get you started... take your time  :biggrin:
'Sire, Sire!... the peasants are Revolting !!!'
'Yes, they are.. aren't they....'

dedndave

it's probably a little confusing to someone unfamiliar

the trick is to understand when the rules of the ABI apply, and when they do not

when you call an API function, let's use MessageBox as an example,
we know that function follows the rules of the ABI - practically all API functions do
so - i can put a value in EBX, ESI, EDI, or EBP, call the API function, and know those registers will remain unaltered across the call

this is particularly critical in callback functions, like WndProc
WndProc is a function, often written by the user, that handles messages for a window
in WndProc code, the rules must be followed
so, we write our "general purpose" routines so that the rules are not violated
we write them to behave much like API functions

when you start a new process (or thread), all of the registers belong to you
you can trash whatever you like, at that point
in console-mode programs, you don't usually use any callback functions
but, it's nice to know that the API functions, and all the functions in the masm32 library follow the rules

for example, you can use EBX as a loop counter, without push/pop
        mov     ebx,4

loop00: call    SomeFunc
        dec     ebx
        jnz     loop00


you can do this because you know SomeFunc follows the ABI
you could just as well have used EBP for the loop count, or ESI or EDI

inside your PROC is a little different story
first, you probably want your PROC to follow the rules
so, if EBP is used, it is preserved and restored across the call

as it happens, the assembler will generate code that uses EBP to access arguments and/or local variables
in that case, you do not want to use EBP because those variables will no longer be accessable
this applies to PROC's written using the default prologue and epilogue
but, it does not apply to PROC's that have no arguments or locals

when the assembler uses EBP, it preserves and restores it across the call
so that other functions do not lose their EBP pointers

really, you can use whatever registers you like, however you like
but, it is important to understand how PROC's are written, so you know what's volatile

if you are writing code that is called from inside a WndProc, it must observe the rules
there are other examples of callback functions, like enumeration operations - same thing applies

on rare occassion, i need that extra register for use inside a loop
i push ebp, perform the loop, and pop ebp when i'm done
inside that loop, no arguments or locals may be accessed
but, i can use ebp to hold an integer or whatever i want
once ebp has been restored, i may again access the local variables or argments

another case pops up where i am writing a function that follows the ABI
inside that function, i preserve EBX, ESI, EDI, and EBP - and restore them at exit
i may have a PROC that is called by my function that does not observe the rules
but, the function that is called by the user (the "outer" PROC) does

CCurl

I understand. How much overhead does all the pushing and popping add?

I would think that for an app that is built on lots of very small, compact procedures (like my Forth implementation for example), all that pushing and popping could end up slowing it down when executing a loop very many times.

Most of my primitive operations, like (FETCH "@" and STORE "!) contain < 10 instructions, and they can easily be called hundreds of times while executing a word. So naturally, I want them to be as lean as possible.

qWord

Quote from: CCurl on October 21, 2015, 06:37:33 AMSo naturally, I want them to be as lean as possible.
You can avoid the call-overhead by inlining the code using macros (possible drawback: code size ... if that matters). An other option is to use a calling convention that use registers to pass arguments.

BTW: "hundreds of times" seems not time critical at first sight...
MREAL macros - when you need floating point arithmetic while assembling!

dedndave

the stack is a very powerful tool
imagine how much overhead there would be without it   :redface:

like any tool - use it when it is the correct tool for the job
the stack is not always the right answer, but often adds speed to a program

hutch--

Herein lies the problem, if you are not going to interact with the operating system, then you can do what you like with the available registers but as soon as you need to interact with the operating system, you either comply with the Intel ABI or you write very unreliable applications that go bang when you call an API on a different OS Version. The trick with the Intel ABI is to properly understand how it works and use it efficiently. There are situation where you enter a procedure and call other procedures and there are no API calls in any of the tree after the procedure entry point where you can use all 8 general purpose registers but before the initial procedure exits the procedure tree it MUST fully comply with the ABI or risk unreliable operation.

jj2007

Quote from: CCurl on October 21, 2015, 06:37:33 AM
I understand. How much overhead does all the pushing and popping add?

I would think that for an app that is built on lots of very small, compact procedures (like my Forth implementation for example), all that pushing and popping could end up slowing it down when executing a loop very many times.

Most of my primitive operations, like (FETCH "@" and STORE "!) contain < 10 instructions, and they can easily be called hundreds of times while executing a word. So naturally, I want them to be as lean as possible.

You seem to be really worried about speed :P

"Hundreds of times" won't be noticeable. The timings that I post here so generously typically run the loop two Million times. Here is one more timing:
1 ms for translating Windows.inc into an array of 26902 strings
2 ms for finding 1011 lines containing 'struct'

The code loads Windows.inc (about one MB) into memory and parses it for CrLf sequences to create a table of pointers and string lengths. Then it uses Instr_() to find occurrences of 'struct' - case-insensitive, full word mode, i.e. it counts Struct, STRUCT, struct but not ERROR_OUT_OF_STRUCTURES in line 4240.

This is heavy work, and it takes no noticeable time. Here is a loop with some pushin' and poppin' - how long does it take? Give us your best guess, or test it yourself...

   xor ecx, ecx
   .Repeat
      push eax
      push edx
      inc ecx
      pop edx
      pop eax
   .Until ecx>100000000   ; one-hundred Million iterations, must take a week or so ;)

CCurl

Cool. I will stop worrying about overhead with push and pop.

Re: worrying about the speed ... of course I am. Heck, I have already written this thing in C++, and the performance is OK, but not great. My primary reason for re-writing it in assembler is for the performance gain.

That, and I wanted to re-familiarize myself with x86 assembler.

dedndave

if you were writing a Forth compiler, you might even overlook the ABI in some respects
the Forth code I have seen relied heavily on ESI and LODSD to get arguments from command lists
this, of course, would not be kosher, in terms of ABI behavior
on the bright side, the API functions would preserve ESI for you   :biggrin:

CCurl

Quote from: jj2007 on October 21, 2015, 02:21:04 PM
...
"Hundreds of times" won't be noticeable.
...
This is a Forth system, where the initial words in the dictionary are based on primitive operations. The programmer's paradigm is to define new words that are based on words that are already in the dictionary. Eventually, the dictionary has the words that can perform the programmer's desired behavior. Depending on the programmer's need, those words could be very high level. The cool thing about this paradigm is that it seems to lend itself very well to machine learning and robotics, which is my eventual focus of study.

So sure, for low level words, the number of primitives is pretty small. But as the words get more and more high-level, the nesting of the words can increase by orders of magnitude, and then we are looking executing the primitives potentially millions of times.

Now if my typical primitive has 10 instructions, and I add 8 instructions for pushing and popping 4 registers (EBP, ESI, EDI, EBX), then my primitive is executing almost twice as many instructions as it was before.  So if push and pop take as long as other instructions, then it follows that the total time to execute the primitive is almost twice as long, which I am worried may make a noticeable difference when the words are sufficiently high-level.

dedndave

push an pop are not "super fast", but they are pretty fast
the instructions to watch out for....

DIV - takes a lot of cycles
if you are dividing by a constant value, use the Multiply-to-Divide technique

POPFD, STD, CLD, SAHF - for some reason, these instructions are surprisingly slow
notice that some EFlags operations are ok, like PUSHFD, CLC, STC, CMC, LAHF

CBW, CWDE, CDQ, CWD are not as fast as the ought to be (not nearly as bad as the EFlags instructions, though)

CCurl

Thanks. Checking a reg and branching if ZERO ..


     test   eax, eax
     jz     isZero

or

     cmp   eax, 0
     je    isZero

Or is there another better way?