News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Memory management from a performance standpoint

Started by AKRichard, July 23, 2012, 10:47:23 AM

Previous topic - Next topic

AKRichard

Hello all,

  I posted a question a month or so ago to get memory management working properly in my library ( a Big Number library).  Which I have working great now without any memory leaks, however, when comparing the performance of my code to microsofts, I find I am taking 3 times longer in my functions for small numbers (small numbers here are less then 6400 bits), but once I start using variables larger then about 6400 bits my microsofts code is taking approximately 50% longer then mine.  A few questions:

1.  How can I find where my bottleneck is on the smaller values when comparing to microsofts library?
2.  My memory management scheme simply entails calling malloc for all temporary arrays and the return array, but Ive been reading how iterations with small values can eat up performance when using malloc, should I be attempting to write my own memory allocation scheme?

thanks in advance

Antariy

What if try to use SysAllocStringByteLen as a malloc replacement?


malloc_new proc cbAlloc:DWORD
invoke SysAllocStringByteLen,0,cbAlloc
ret
malloc_new endp


Isn't it a bit faster?

MichaelW

Have you tried allocating the temporary arrays from the stack?
Well Microsoft, here's another nice mess you've gotten us into.

hutch--

Its usually the case that if you know how many allocations you need to make AND you know the size of each to get a total, then a single large allocation split into parts will be significantly faster than an allocation for each item. Depending on the count, even pre-allocating "slots" that may be bigger than required is a viable option and you simply re-use them by overwriting the old values.

The logic here is simple, even though modern OS design produces reasonably fast memory allocation, more is slower where if you allocate once and control the split up yourself, you reduce your overhead by a significant amount.

AKRichard

Quote from: Antariy on July 23, 2012, 12:19:32 PM
What if try to use SysAllocStringByteLen as a malloc replacement?


Isn't it a bit faster?

To be honest, I havent heard of that function till now (just looked it up on msdn)  I just started learning c++/cli and only learned enough native c++ to be able to implement the math in assembly.  From the msdn site at http://msdn.microsoft.com/en-us/library/ms891286.aspx :

"len [in] Number of bytes to copy from psz. A null character is placed afterwards, allocating a total of len+1 bytes.
Allocates a new string of len bytes, copies len bytes from the passed string into it, and then appends a null character. Valid only for 32-bit systems."

I have all this implemented in 64 bit mode, not sure why its only valid for 32 bit systems.

Quote from: MichaelW on July 23, 2012, 01:17:02 PM
Have you tried allocating the temporary arrays from the stack?

  I considered this, however, Some of the algorithms call others and I never know how large of a chunk I will need, this may be viable if there is a way I can specify the stack size at runtime, but it seems like it would easily introduce stack corruption also since there would be return addresses and such dumped in between variables of different calls to other algorithms

Quote from: hutch-- on July 23, 2012, 03:29:38 PM
Its usually the case that if you know how many allocations you need to make AND you know the size of each to get a total, then a single large allocation split into parts will be significantly faster than an allocation for each item. Depending on the count, even pre-allocating "slots" that may be bigger than required is a viable option and you simply re-use them by overwriting the old values.

The logic here is simple, even though modern OS design produces reasonably fast memory allocation, more is slower where if you allocate once and control the split up yourself, you reduce your overhead by a significant amount.

  This is what I am understanding from my reading, but I am not even sure where to begin.  I would need a way to mark chunks as in use or free, and I would probably need some type of compaction scheme as I can see how it would become quickly fragmented, for example modular exponentiation calls the multiply and the modular reduction algorithms how ever many bits are in the exponent plus the number of one bits in the exponent.  The multiply routine only creates a return value, the modular reduction creates 2 temporaries plus the return value (plus possibly 2 more temporaries).  The temporaries are no big issue since they are realeased before leaving the algorithm, but the return values would be used by the calling algorithm and not released till after calling another alg has completed (like the return from the multiply wouldnt be released till the modular reduction is completed).

  I know the heading of this thread is about memory management, but I am only guessing thats where the differences in running time are coming from.  I am not sure how to profile my algorithms in order to pinpoint why there algorithms are 3 times faster than mine when there is less then 100 elements in the arrays, but mine are faster when there are more then 100 elements.

jj2007

Allocating little chunks of memory is, as Hutch pointed out above, slow. Actually, it's incredibly slow - the OS needs thousands of cycles to give you a handful of bytes. Therefore, the best option is to get one fat allocation and to use that as a circular buffer (MasmBasic uses this technique).

For example, if you know that in the course of one call to your routines you expect to need 20 times 16 kBytes, that makes it 20*16=320 kBytes, well below the cache size of 1 or 2 MB.

You then need one 20*4 array named SlotManager that keeps the temporary pointers. The process is as follows:
SlotManager dd 20 dup
- request 10kb from circular buffer
- get a pointer somewhere in the buffer
- by adding requested size to the pointer, you get the next pointer
- store that in SlotManager[4*CurrentSlot]
- increase CurrentSlot
- if higher than MaxSlots, reset to zero

etc - full example attached. It is about a factor 50 faster than GlobalAlloc:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
ticks circular buffer (ms):     47
ticks circular buffer (ms):     46
ticks circular buffer (ms):     31
ticks circular buffer (ms):     47
ticks circular buffer (ms):     32
ticks circular buffer (ms):     47
ticks circular buffer (ms):     31
ticks circular buffer (ms):     31
ticks circular buffer (ms):     46
ticks circular buffer (ms):     47
ticks GlobalAlloc (ms):         2184
ticks HeapAlloc (ms):           2013

mywan

That certainly makes me rethink some things  :icon_cool:

hutch--

If I have the description right, its only on small arrays of low count that you have a speed problem and if you allocated the largest slot size for your maximum member count it would still not be a very big allocation, probably just a few megabytes. Now if this is the case you can simply allocate a large single block then map a set of pointers to it in the order that you want and the sizes you want, allow for alignment to optimise memory access speed in and out and you will probably get decent speed gains by doing so.

AKRichard

Quote from: hutch-- on July 24, 2012, 12:51:20 AM
If I have the description right, its only on small arrays of low count that you have a speed problem and if you allocated the largest slot size for your maximum member count it would still not be a very big allocation, probably just a few megabytes. Now if this is the case you can simply allocate a large single block then map a set of pointers to it in the order that you want and the sizes you want, allow for alignment to optimise memory access speed in and out and you will probably get decent speed gains by doing so.

Theoretically there would be no maximum size to the arrays, in practice, using managed c++, I am limited by the fact that the managed rt expects an int for the index which would allow some 1X10^9 elements (more then I have ever used)  In all my use I have never used more then a few million bits, and rarely even get a few hundred thousand bits in length.

  Anyways, I am in fact working on as you suggested, in my initial run it did make the small arrays faster (I am around 1/3 longer running time in my division and modular reduction algs now) but at the expense of the lead I had when the arrays grew large (which are now just running slightly faster then microsofts).

  On my first run I  allocated 1048576 bytes divided into 16 byte chunks to get these results, but if I just left it crunching numbers it would eventually error out (usually after a few million calulations of random sizes).  That version left no room for allocating more space, now I am working on keeping 2 arrays of pointers (one to keep track of pages in use, the other to keep the pointers to the base of those pages) so that it can grow dynamically, but the code is allready looking bloated to me.

  jj2007  I tried the circular buffer idea first because it sounded simple to implement (I like simple) and it worked great in my test program where it generates random numbers to use in the algs, checks the answer of accuracy, then discards the results, it gave good consistent results performance wise.  But became problematic when the lifetime of the return values was uncertain.  The only way I managed to get it to work outside of the test program was to make it local to the algorithm which went through the whole initialization every call then.  I couldnt figure out how to make it a static (is static a valid term in assembly?) set of methods for all the algorithms to use.

  Anyways, from what I am reading on memory management, there is no single best way to implement it, it looks like I am in for a lot of trial and error to figure it out(that means Ill be back with more questions eventually lol).

  If my thinking is wrong on this post please let me know, I am fumbling around while learning.  I do appreciate the feedback and the ideas, Thanks.

jj2007

Quote from: AKRichard on July 24, 2012, 03:58:04 PM
jj2007  I tried the circular buffer idea first because it sounded simple to implement (I like simple) and it worked great in my test program where it generates random numbers to use in the algs, checks the answer of accuracy, then discards the results, it gave good consistent results performance wise.  But became problematic when the lifetime of the return values was uncertain.

That should be no problem if the sum of all "active" chunks does not exceed total buffer size. Have you ever monitored your allocations, i.e. written to a file when and where you allocated how many bytes?

AKRichard

Quote from: jj2007 on July 24, 2012, 05:01:06 PM
That should be no problem if the sum of all "active" chunks does not exceed total buffer size. Have you ever monitored your allocations, i.e. written to a file when and where you allocated how many bytes?

I havent done that for the memory management, or from assembly for that matter.  I started working on figuring out how to get assembly to write to a file, but did it from within c++/cli instead.  I have stepped through the algorithm though.  I didnt follow your idea exactly.  Instead of calculating how much memory was needed for each call, I allocated more then ANY call would need.  The problem that crept up was if a value was not released before the slot manager circled back to it, then it was handing out a value that was technically still being used by another variable, or once I inserted code to check for this, stopped the alg in its tracks(I couldnt figure out how to recover from this condition).  Of course the problem disappeared if I didnt allocate the return value from the slot manager, but then I also lost more of that gain then I had hoped when I allocated the return value from a call to malloc.

  The problem did not crop up when I made it local to the algorithm, I was just trying to get away from having to initialize a slot manager for every call in every algorithm.

jj2007

You need to find a balance between speed and complexity. The circular buffer works just fine for many small allocations, and is a lot faster than HeapAlloc.

Try to distinguish
- parts of your code that need many small allocations ---> circular buffer
- parts of your code that need a few big allocations ---> HeapAlloc or VirtualAlloc

Your OS allows to allocate a 400MB circular buffer, and you can specify 20,000 slots if that is needed. The speed advantage will stay.

Gunther

Jochen,

Quote from: jj2007 on July 24, 2012, 05:54:47 PM
Try to distinguish
- parts of your code that need many small allocations ---> circular buffer
- parts of your code that need a few big allocations ---> HeapAlloc or VirtualAlloc

Your OS allows to allocate a 400MB circular buffer, and you can specify 20,000 slots if that is needed. The speed advantage will stay.

It seems to me that this is the right strategy. Solid work.

Gunther
You have to know the facts before you can distort them.

jj2007


Gunther

Never mind; you're welcome Jochen.  ;)

Gunther
You have to know the facts before you can distort them.