Print Page - Memory management from a performance standpoint

Title: Memory management from a performance standpoint
Post by: AKRichard on July 23, 2012, 10:47:23 AM

Hello all,

I posted a question a month or so ago to get memory management working properly in my library ( a Big Number library). Which I have working great now without any memory leaks, however, when comparing the performance of my code to microsofts, I find I am taking 3 times longer in my functions for small numbers (small numbers here are less then 6400 bits), but once I start using variables larger then about 6400 bits my microsofts code is taking approximately 50% longer then mine. A few questions:

1. How can I find where my bottleneck is on the smaller values when comparing to microsofts library?
2. My memory management scheme simply entails calling malloc for all temporary arrays and the return array, but Ive been reading how iterations with small values can eat up performance when using malloc, should I be attempting to write my own memory allocation scheme?

thanks in advance

Title: Re: Memory management from a performance standpoint
Post by: Antariy on July 23, 2012, 12:19:32 PM

What if try to use SysAllocStringByteLen as a malloc replacement?

Code Select


malloc_new proc cbAlloc:DWORD
invoke SysAllocStringByteLen,0,cbAlloc
ret
malloc_new endp

Isn't it a bit faster?

Title: Re: Memory management from a performance standpoint
Post by: MichaelW on July 23, 2012, 01:17:02 PM

Have you tried allocating the temporary arrays from the stack?

Title: Re: Memory management from a performance standpoint
Post by: hutch-- on July 23, 2012, 03:29:38 PM

Its usually the case that if you know how many allocations you need to make AND you know the size of each to get a total, then a single large allocation split into parts will be significantly faster than an allocation for each item. Depending on the count, even pre-allocating "slots" that may be bigger than required is a viable option and you simply re-use them by overwriting the old values.

The logic here is simple, even though modern OS design produces reasonably fast memory allocation, more is slower where if you allocate once and control the split up yourself, you reduce your overhead by a significant amount.

Title: Re: Memory management from a performance standpoint
Post by: AKRichard on July 23, 2012, 04:16:41 PM

Quote from: Antariy on July 23, 2012, 12:19:32 PM
What if try to use SysAllocStringByteLen as a malloc replacement?

Isn't it a bit faster?

To be honest, I havent heard of that function till now (just looked it up on msdn) I just started learning c++/cli and only learned enough native c++ to be able to implement the math in assembly. From the msdn site at http://msdn.microsoft.com/en-us/library/ms891286.aspx :

"len [in] Number of bytes to copy from psz. A null character is placed afterwards, allocating a total of len+1 bytes.
Allocates a new string of len bytes, copies len bytes from the passed string into it, and then appends a null character. Valid only for 32-bit systems."

I have all this implemented in 64 bit mode, not sure why its only valid for 32 bit systems.

Quote from: MichaelW on July 23, 2012, 01:17:02 PM
Have you tried allocating the temporary arrays from the stack?

I considered this, however, Some of the algorithms call others and I never know how large of a chunk I will need, this may be viable if there is a way I can specify the stack size at runtime, but it seems like it would easily introduce stack corruption also since there would be return addresses and such dumped in between variables of different calls to other algorithms

Quote from: hutch-- on July 23, 2012, 03:29:38 PM
Its usually the case that if you know how many allocations you need to make AND you know the size of each to get a total, then a single large allocation split into parts will be significantly faster than an allocation for each item. Depending on the count, even pre-allocating "slots" that may be bigger than required is a viable option and you simply re-use them by overwriting the old values.

The logic here is simple, even though modern OS design produces reasonably fast memory allocation, more is slower where if you allocate once and control the split up yourself, you reduce your overhead by a significant amount.

This is what I am understanding from my reading, but I am not even sure where to begin. I would need a way to mark chunks as in use or free, and I would probably need some type of compaction scheme as I can see how it would become quickly fragmented, for example modular exponentiation calls the multiply and the modular reduction algorithms how ever many bits are in the exponent plus the number of one bits in the exponent. The multiply routine only creates a return value, the modular reduction creates 2 temporaries plus the return value (plus possibly 2 more temporaries). The temporaries are no big issue since they are realeased before leaving the algorithm, but the return values would be used by the calling algorithm and not released till after calling another alg has completed (like the return from the multiply wouldnt be released till the modular reduction is completed).

I know the heading of this thread is about memory management, but I am only guessing thats where the differences in running time are coming from. I am not sure how to profile my algorithms in order to pinpoint why there algorithms are 3 times faster than mine when there is less then 100 elements in the arrays, but mine are faster when there are more then 100 elements.

Title: Re: Memory management from a performance standpoint
Post by: jj2007 on July 23, 2012, 06:49:49 PM

Allocating little chunks of memory is, as Hutch pointed out above, slow. Actually, it's incredibly slow - the OS needs thousands of cycles to give you a handful of bytes. Therefore, the best option is to get one fat allocation and to use that as a circular buffer (MasmBasic uses this technique).

For example, if you know that in the course of one call to your routines you expect to need 20 times 16 kBytes, that makes it 20*16=320 kBytes, well below the cache size of 1 or 2 MB.

You then need one 20*4 array named SlotManager that keeps the temporary pointers. The process is as follows:
SlotManager dd 20 dup
- request 10kb from circular buffer
- get a pointer somewhere in the buffer
- by adding requested size to the pointer, you get the next pointer
- store that in SlotManager[4*CurrentSlot]
- increase CurrentSlot
- if higher than MaxSlots, reset to zero

etc - full example attached. It is about a factor 50 faster than GlobalAlloc:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
ticks circular buffer (ms): 47
ticks circular buffer (ms): 46
ticks circular buffer (ms): 31
ticks circular buffer (ms): 47
ticks circular buffer (ms): 32
ticks circular buffer (ms): 47
ticks circular buffer (ms): 31
ticks circular buffer (ms): 31
ticks circular buffer (ms): 46
ticks circular buffer (ms): 47
ticks GlobalAlloc (ms): 2184
ticks HeapAlloc (ms): 2013

Title: Re: Memory management from a performance standpoint
Post by: mywan on July 23, 2012, 09:56:21 PM

That certainly makes me rethink some things :icon_cool:

Title: Re: Memory management from a performance standpoint
Post by: hutch-- on July 24, 2012, 12:51:20 AM

If I have the description right, its only on small arrays of low count that you have a speed problem and if you allocated the largest slot size for your maximum member count it would still not be a very big allocation, probably just a few megabytes. Now if this is the case you can simply allocate a large single block then map a set of pointers to it in the order that you want and the sizes you want, allow for alignment to optimise memory access speed in and out and you will probably get decent speed gains by doing so.

Title: Re: Memory management from a performance standpoint
Post by: AKRichard on July 24, 2012, 03:58:04 PM

Quote from: hutch-- on July 24, 2012, 12:51:20 AM
If I have the description right, its only on small arrays of low count that you have a speed problem and if you allocated the largest slot size for your maximum member count it would still not be a very big allocation, probably just a few megabytes. Now if this is the case you can simply allocate a large single block then map a set of pointers to it in the order that you want and the sizes you want, allow for alignment to optimise memory access speed in and out and you will probably get decent speed gains by doing so.

Theoretically there would be no maximum size to the arrays, in practice, using managed c++, I am limited by the fact that the managed rt expects an int for the index which would allow some 1X10^9 elements (more then I have ever used) In all my use I have never used more then a few million bits, and rarely even get a few hundred thousand bits in length.

Anyways, I am in fact working on as you suggested, in my initial run it did make the small arrays faster (I am around 1/3 longer running time in my division and modular reduction algs now) but at the expense of the lead I had when the arrays grew large (which are now just running slightly faster then microsofts).

On my first run I allocated 1048576 bytes divided into 16 byte chunks to get these results, but if I just left it crunching numbers it would eventually error out (usually after a few million calulations of random sizes). That version left no room for allocating more space, now I am working on keeping 2 arrays of pointers (one to keep track of pages in use, the other to keep the pointers to the base of those pages) so that it can grow dynamically, but the code is allready looking bloated to me.

jj2007 I tried the circular buffer idea first because it sounded simple to implement (I like simple) and it worked great in my test program where it generates random numbers to use in the algs, checks the answer of accuracy, then discards the results, it gave good consistent results performance wise. But became problematic when the lifetime of the return values was uncertain. The only way I managed to get it to work outside of the test program was to make it local to the algorithm which went through the whole initialization every call then. I couldnt figure out how to make it a static (is static a valid term in assembly?) set of methods for all the algorithms to use.

Anyways, from what I am reading on memory management, there is no single best way to implement it, it looks like I am in for a lot of trial and error to figure it out(that means Ill be back with more questions eventually lol).

If my thinking is wrong on this post please let me know, I am fumbling around while learning. I do appreciate the feedback and the ideas, Thanks.

Title: Re: Memory management from a performance standpoint
Post by: jj2007 on July 24, 2012, 05:01:06 PM

Quote from: AKRichard on July 24, 2012, 03:58:04 PM
jj2007 I tried the circular buffer idea first because it sounded simple to implement (I like simple) and it worked great in my test program where it generates random numbers to use in the algs, checks the answer of accuracy, then discards the results, it gave good consistent results performance wise. But became problematic when the lifetime of the return values was uncertain.

That should be no problem if the sum of all "active" chunks does not exceed total buffer size. Have you ever monitored your allocations, i.e. written to a file when and where you allocated how many bytes?

Title: Re: Memory management from a performance standpoint
Post by: AKRichard on July 24, 2012, 05:40:15 PM

Quote from: jj2007 on July 24, 2012, 05:01:06 PM
That should be no problem if the sum of all "active" chunks does not exceed total buffer size. Have you ever monitored your allocations, i.e. written to a file when and where you allocated how many bytes?

I havent done that for the memory management, or from assembly for that matter. I started working on figuring out how to get assembly to write to a file, but did it from within c++/cli instead. I have stepped through the algorithm though. I didnt follow your idea exactly. Instead of calculating how much memory was needed for each call, I allocated more then ANY call would need. The problem that crept up was if a value was not released before the slot manager circled back to it, then it was handing out a value that was technically still being used by another variable, or once I inserted code to check for this, stopped the alg in its tracks(I couldnt figure out how to recover from this condition). Of course the problem disappeared if I didnt allocate the return value from the slot manager, but then I also lost more of that gain then I had hoped when I allocated the return value from a call to malloc.

The problem did not crop up when I made it local to the algorithm, I was just trying to get away from having to initialize a slot manager for every call in every algorithm.

Title: Re: Memory management from a performance standpoint
Post by: jj2007 on July 24, 2012, 05:54:47 PM

You need to find a balance between speed and complexity. The circular buffer works just fine for many small allocations, and is a lot faster than HeapAlloc.

Try to distinguish
- parts of your code that need many small allocations ---> circular buffer
- parts of your code that need a few big allocations ---> HeapAlloc or VirtualAlloc

Your OS allows to allocate a 400MB circular buffer, and you can specify 20,000 slots if that is needed. The speed advantage will stay.

Title: Re: Memory management from a performance standpoint
Post by: Gunther on July 24, 2012, 09:31:00 PM

Jochen,

Quote from: jj2007 on July 24, 2012, 05:54:47 PM
Try to distinguish
- parts of your code that need many small allocations ---> circular buffer
- parts of your code that need a few big allocations ---> HeapAlloc or VirtualAlloc

Your OS allows to allocate a 400MB circular buffer, and you can specify 20,000 slots if that is needed. The speed advantage will stay.

It seems to me that this is the right strategy. Solid work.

Gunther

Title: Re: Memory management from a performance standpoint
Post by: jj2007 on July 24, 2012, 11:51:22 PM

Quote from: Gunther on July 24, 2012, 09:31:00 PM
It seems to me that this is the right strategy. Solid work.

Grazie :redface:

Title: Re: Memory management from a performance standpoint
Post by: Gunther on July 25, 2012, 01:19:35 AM

Never mind; you're welcome Jochen. ;)

Gunther

Title: Re: Memory management from a performance standpoint
Post by: Greenhorn on July 25, 2012, 04:12:01 AM

Yes, very interesting approach, Jochen. :t

Greenhorn

Title: Re: Memory management from a performance standpoint
Post by: AKRichard on July 26, 2012, 08:41:25 PM

Quote from: jj2007 on July 24, 2012, 05:54:47 PM
You need to find a balance between speed and complexity. The circular buffer works just fine for many small allocations, and is a lot faster than HeapAlloc.

Try to distinguish
- parts of your code that need many small allocations ---> circular buffer
- parts of your code that need a few big allocations ---> HeapAlloc or VirtualAlloc

Your OS allows to allocate a 400MB circular buffer, and you can specify 20,000 slots if that is needed. The speed advantage will stay.

I do have the slot manager working, though the return values are still being handled with malloc, I am finally getting comparable speeds. I have just a few more questions then Ill let this thread drop.

1. VirtualAlloc assigns a range of virtual memory then maps physical memory to it? if that is correct, wouldnt that be slower then malloc?
2. whats the difference between malloc and heapalloc?

Sorry if some of the questions seem pretty basic, but I really havent messed with the native side of c++ a whole lot

Title: Re: Memory management from a performance standpoint
Post by: npnw on July 27, 2012, 03:06:09 PM

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx)

malloc
http://msdn.microsoft.com/en-us/library/aa246461(v=vs.60).aspx (http://msdn.microsoft.com/en-us/library/aa246461(v=vs.60).aspx)

malloc returns a void pointer to the allocated space, or NULL if there is insufficient memory available. To return a pointer to a type other than void, use a type cast on the return value. The storage space pointed to by the return value is guaranteed to be suitably aligned for storage of any type of object. If size is 0, malloc allocates a zero-length item in the heap and returns a valid pointer to that item. Always check the return from malloc, even if the amount of memory requested is small.

heap alloc

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366597(v=vs.85).aspx (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366597(v=vs.85).aspx)

Now there may be some differences depending on the target platform for development, and or environment Windows xp, vista, 7, or 8, not sure. These were specific to their environment.
It doesn't look like they have changed for a long time from memory.

Hope this helps.

Title: Re: Memory management from a performance standpoint
Post by: MichaelW on July 27, 2012, 03:21:27 PM

The "storage space pointed to by the return value is guaranteed to be suitably aligned for storage of any type of object" statement is far out of date.

Title: Re: Memory management from a performance standpoint
Post by: npnw on July 27, 2012, 05:05:18 PM

QuoteThe "storage space pointed to by the return value is guaranteed to be suitably aligned for storage of any type of object" statement is far out of date.

I didn't catch the last part of that when I posted it. My fault. It does remind me of early Windows programming.
Have to find a more recent explanation of this in MSDN.

Title: Re: Memory management from a performance standpoint
Post by: npnw on July 27, 2012, 06:01:52 PM

Here is the 2012 Visual Studio RC version of malloc.
http://msdn.microsoft.com/en-us/library/6ewkz86d(v=vs.110).aspx (http://msdn.microsoft.com/en-us/library/6ewkz86d(v=vs.110).aspx)

Title: Re: Memory management from a performance standpoint
Post by: jj2007 on July 27, 2012, 06:10:03 PM

VS 2012: "The storage space pointed to by the return value is guaranteed to be suitably aligned for storage of any type of object."

Since M$ won't tell you, you'll probably have to launch the debugger to find out that it's VirtualAlloc under the hood, i.e. "align 4096" - simple but a bit slow for small allocations.

Title: Re: Memory management from a performance standpoint
Post by: npnw on July 28, 2012, 02:57:42 AM

jj,

Wasn't 4096 the page size. That way if it was paged it wouldn't cause a fault?

Title: Re: Memory management from a performance standpoint
Post by: jj2007 on July 28, 2012, 03:26:29 AM

Well, that arrogant phrase "guaranteed to be suitably aligned for storage of any type of object" suggests they are using a full page. Post an exe with an int 3 inserted before the "malloc", and we'll see if it's VirtualAlloc...

Title: Re: Memory management from a performance standpoint
Post by: npnw on July 28, 2012, 09:03:57 AM

jj Well, that arrogant phrase "guaranteed to be suitably aligned for storage of any type of object" suggests they are using a full page. Post an exe with an int 3 inserted before the "malloc", and we'll see if it's VirtualAlloc... (//http://#039;ll%20see%20if%20it's%20VirtualAlloc...)

LOL!

A man after my own heart.

The INT 3 instruction is defined for use by debuggers to temporarily replace an instruction in a running program, in order to set a breakpoint. Other INT instructions are encoded using two bytes. This makes them unsuitable for use in patching instructions (which can be one byte long). (see SIGTRAP)
The opcode for INT 3 is 0xCC, as opposed to the opcode for INT immediate', which is 0xCD imm8. According to Intel documentation: "Intel and Microsoft assemblers will not generate the CD03 opcode from any mnemonic" and 0xCC has some special features, which are not shared by "the normal 2-byte opcode for INT 3 (CD03)" [IA-32 Arch. Software Developer's Manual. Vol. 2A]

Title: Re: Memory management from a performance standpoint
Post by: jj2007 on July 28, 2012, 09:19:34 AM

lol ;-)

I am not a C coder, but if I remember well, you can insert something like _asm int 3; in C code; which would trigger the Just In Time debugger, Olly in my case.

Cheers, Jochen

Title: Re: Memory management from a performance standpoint
Post by: MichaelW on July 28, 2012, 11:28:56 AM

Under Windows 2000 and XP the minimum alignment for VirtualAlloc appears to be 65536, or 16 pages.

Title: Re: Memory management from a performance standpoint
Post by: AKRichard on August 06, 2012, 08:05:14 PM

Lake Tahoe is great for taking your mind off things. I hated to leave.

Anyways, I have a modified version of the slot manager that can grow as needed, its not truely circular anymore, but works well even with the return values now . Im not getting as fast of results with my code as you did yours, but it is running stably in single threaded mode(I have a glitch somewhere that only shows up in multithreaded apps that I may be posting about on another thread if I dont figure it out soon).

npnw, thanks for the links, I had read the one on malloc, but not virtualalloc or heapalloc. Kind of off topic, but is virtualallocex how they inject code (for creating a hook into one of the windows apis)?

Thanks everyone for the help.

Title: Re: Memory management from a performance standpoint
Post by: merlin on August 17, 2012, 01:57:10 AM

Check this link.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887%28v=vs.85%29.aspx

"VirtualAlloc cannot reserve a reserved page. It can commit a page that is already committed. This means you can commit a range of pages, regardless of whether they have already been committed, and the function will not fail."

MEM_LARGE_PAGES
0x20000000

Allocates memory using large page support.
The size and alignment must be a multiple of the large-page minimum.
To obtain this value, use the GetLargePageMinimum function.

You could just use VirtualAlloc and use the GetLargePageMinimum. I don't know how efficient this would be.

The MASM Forum

General => The Campus => Topic started by: AKRichard on July 23, 2012, 10:47:23 AM