I get strange results for this benchmark comparing various allocation methods. Sometimes GlobalAlloc is simply too fast, although it does not fail. Any ideas why??
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
309 cycles per kByte for HeapAlloc 00001000h bytes (4 kB)
1565 cycles per kByte for VirtualAlloc 00001000h bytes (4 kB)
162 cycles per kByte for GlobalAlloc 00001000h bytes (4 kB)
182 cycles per kByte for HeapAlloc 00004000h bytes (16 kB)
866 cycles per kByte for VirtualAlloc 00004000h bytes (16 kB)
41 cycles per kByte for GlobalAlloc 00004000h bytes (16 kB)
1705 cycles per kByte for HeapAlloc 00010000h bytes (64 kB)
683 cycles per kByte for VirtualAlloc 00010000h bytes (64 kB)
214 cycles per kByte for GlobalAlloc 00010000h bytes (64 kB)
188 cycles per kByte for HeapAlloc 00040000h bytes (256 kB)
608 cycles per kByte for VirtualAlloc 00040000h bytes (256 kB)
7 cycles per kByte for GlobalAlloc 00040000h bytes (256 kB)
1112 cycles per kByte for HeapAlloc 00100000h bytes (1 MB)
1116 cycles per kByte for VirtualAlloc 00100000h bytes (1 MB)
1113 cycles per kByte for GlobalAlloc 00100000h bytes (1 MB)
606 cycles per kByte for HeapAlloc 00400000h bytes (4 MB)
604 cycles per kByte for VirtualAlloc 00400000h bytes (4 MB)
603 cycles per kByte for GlobalAlloc 00400000h bytes (4 MB)
Memory was touched
(to assemble the source, extract all files to a folder, then drag HeapVsVirtualJJ.asm over TinyIDE.exe and hit F6)
GlobalAlloc uses the default process heap (probably... or maybe its own heap), which usually has the size of 1 MB allocated at the time of process start (/HEAP switch of the linker). So, when you're allocating small pieces, it "just grabs" first suitable block from those 1 MB, and returns it without a need in "physical" heap (re)allocation.
Results:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
879 cycles per kByte for HeapAlloc 00001000h bytes (4 kB)
4478 cycles per kByte for VirtualAlloc 00001000h bytes (4 kB)
565 cycles per kByte for GlobalAlloc 00001000h bytes (4 kB)
494 cycles per kByte for HeapAlloc 00004000h bytes (16 kB)
2954 cycles per kByte for VirtualAlloc 00004000h bytes (16 kB)
145 cycles per kByte for GlobalAlloc 00004000h bytes (16 kB)
5771 cycles per kByte for HeapAlloc 00010000h bytes (64 kB)
2951 cycles per kByte for VirtualAlloc 00010000h bytes (64 kB)
726 cycles per kByte for GlobalAlloc 00010000h bytes (64 kB)
1102 cycles per kByte for HeapAlloc 00040000h bytes (256 kB)
3142 cycles per kByte for VirtualAlloc 00040000h bytes (256 kB)
18 cycles per kByte for GlobalAlloc 00040000h bytes (256 kB)
3066 cycles per kByte for HeapAlloc 00100000h bytes (1 MB)
3199 cycles per kByte for VirtualAlloc 00100000h bytes (1 MB)
3110 cycles per kByte for GlobalAlloc 00100000h bytes (1 MB)
3067 cycles per kByte for HeapAlloc 00400000h bytes (4 MB)
3045 cycles per kByte for VirtualAlloc 00400000h bytes (4 MB)
3036 cycles per kByte for GlobalAlloc 00400000h bytes (4 MB)
Memory was touched
Edit: the stated above is truely right only for heap functions at all. VirtualAlloc is slow because it needs to reserve/commit the memory, while heap functions just getting suitable piece from entire block - reallocation occured only when the block size is not enough for the piece requested.
As for difference between Global/HeapAlloc, below is the results when code changed to use GMEM_ZEROINIT when calling GlobalAlloc:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
636 cycles per kByte for HeapAlloc 00001000h bytes (4 kB)
4639 cycles per kByte for VirtualAlloc 00001000h bytes (4 kB)
762 cycles per kByte for GlobalAlloc 00001000h bytes (4 kB)
443 cycles per kByte for HeapAlloc 00004000h bytes (16 kB)
2599 cycles per kByte for VirtualAlloc 00004000h bytes (16 kB)
475 cycles per kByte for GlobalAlloc 00004000h bytes (16 kB)
5648 cycles per kByte for HeapAlloc 00010000h bytes (64 kB)
2818 cycles per kByte for VirtualAlloc 00010000h bytes (64 kB)
1748 cycles per kByte for GlobalAlloc 00010000h bytes (64 kB)
1007 cycles per kByte for HeapAlloc 00040000h bytes (256 kB)
3049 cycles per kByte for VirtualAlloc 00040000h bytes (256 kB)
965 cycles per kByte for GlobalAlloc 00040000h bytes (256 kB)
3088 cycles per kByte for HeapAlloc 00100000h bytes (1 MB)
3090 cycles per kByte for VirtualAlloc 00100000h bytes (1 MB)
3025 cycles per kByte for GlobalAlloc 00100000h bytes (1 MB)
3193 cycles per kByte for HeapAlloc 00400000h bytes (4 MB)
3056 cycles per kByte for VirtualAlloc 00400000h bytes (4 MB)
2950 cycles per kByte for GlobalAlloc 00400000h bytes (4 MB)
Memory was touched
Could be, Alex, but why does it need 726 cycles for the 64k block and only 18 for 256k? That doesn't look plausible...
I've changed the post above - new results :biggrin: (arrgh... these boring tags!).
Maybe 64k block is the fist big enough block, which causes heap subsystem to "physically" commit the pages into memory (they was just reserved before)?
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
250 cycles per kByte for HeapAlloc 00001000h bytes (4 kB)
1653 cycles per kByte for VirtualAlloc 00001000h bytes (4 kB)
152 cycles per kByte for GlobalAlloc 00001000h bytes (4 kB)
160 cycles per kByte for HeapAlloc 00004000h bytes (16 kB)
825 cycles per kByte for VirtualAlloc 00004000h bytes (16 kB)
38 cycles per kByte for GlobalAlloc 00004000h bytes (16 kB)
1232 cycles per kByte for HeapAlloc 00010000h bytes (64 kB)
629 cycles per kByte for VirtualAlloc 00010000h bytes (64 kB)
188 cycles per kByte for GlobalAlloc 00010000h bytes (64 kB)
202 cycles per kByte for HeapAlloc 00040000h bytes (256 kB)
1065 cycles per kByte for VirtualAlloc 00040000h bytes (256 kB)
6 cycles per kByte for GlobalAlloc 00040000h bytes (256 kB)
856 cycles per kByte for HeapAlloc 00100000h bytes (1 MB)
570 cycles per kByte for VirtualAlloc 00100000h bytes (1 MB)
699 cycles per kByte for GlobalAlloc 00100000h bytes (1 MB)
693 cycles per kByte for HeapAlloc 00400000h bytes (4 MB)
536 cycles per kByte for VirtualAlloc 00400000h bytes (4 MB)
536 cycles per kByte for GlobalAlloc 00400000h bytes (4 MB)
Memory was touched
Looks like it is: changed NumBytes=NumBytes*4 to "*2", results:
Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
633 cycles per kByte for HeapAlloc 00001000h bytes (4 kB)
4290 cycles per kByte for VirtualAlloc 00001000h bytes (4 kB)
756 cycles per kByte for GlobalAlloc 00001000h bytes (4 kB)
497 cycles per kByte for HeapAlloc 00002000h bytes (8 kB)
3002 cycles per kByte for VirtualAlloc 00002000h bytes (8 kB)
575 cycles per kByte for GlobalAlloc 00002000h bytes (8 kB)
687 cycles per kByte for HeapAlloc 00004000h bytes (16 kB)
2484 cycles per kByte for VirtualAlloc 00004000h bytes (16 kB)
475 cycles per kByte for GlobalAlloc 00004000h bytes (16 kB)
578 cycles per kByte for HeapAlloc 00008000h bytes (32 kB)
2644 cycles per kByte for VirtualAlloc 00008000h bytes (32 kB)
434 cycles per kByte for GlobalAlloc 00008000h bytes (32 kB)
5605 cycles per kByte for HeapAlloc 00010000h bytes (64 kB)
2852 cycles per kByte for VirtualAlloc 00010000h bytes (64 kB)
1682 cycles per kByte for GlobalAlloc 00010000h bytes (64 kB)
642 cycles per kByte for HeapAlloc 00020000h bytes (128 kB)
3052 cycles per kByte for VirtualAlloc 00020000h bytes (128 kB)
693 cycles per kByte for GlobalAlloc 00020000h bytes (128 kB)
Memory was touched
Note: 64 KB block is also allocated slower than 128 KB block - it seem that it is just the big enough to cause the reallocation/commiting of the heap to bigger size (it is unlikely that it is the constant value), and next 128 KB block is again just grabbed from it without big delays. Note also that I've modified the code to use GMEM_ZEROINIT with GlobalAlloc.
Jochen, which are your results with these modifications:
invoke GlobalAlloc, GMEM_FIXED or GMEM_ZEROINIT, NumBytes
NumBytes=NumBytes*2
Changed program attached.
It is hard to predict what is the reason for the timings difference in each particular test. If try to do more complex tests with the different sizes and different memory conditions of the real running program, the timing results will be much more various. Because different scenarios are possible, for instance:
1. Heap is not big enough to provide the block requested => reallocation (if possible).
2. Heap is big enough, but the heap manager needs some time to find suitable block size in the entire heap block, what requiring different time depending on the heap allocation layout.
3. Heap is big enough overall (in the total sum of the free pieces), but highly fragmented, and thus need to be reallocated anyway to provide the monolithic block requested.
4. Other thread was eventually locked the heap to access it.
5. ...
Alex,
Thanks a lot. Here are my results:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
312 cycles per kByte for HeapAlloc 00001000h bytes (4 kB)
1564 cycles per kByte for VirtualAlloc 00001000h bytes (4 kB)
326 cycles per kByte for GlobalAlloc 00001000h bytes (4 kB)
217 cycles per kByte for HeapAlloc 00002000h bytes (8 kB)
1097 cycles per kByte for VirtualAlloc 00002000h bytes (8 kB)
242 cycles per kByte for GlobalAlloc 00002000h bytes (8 kB)
180 cycles per kByte for HeapAlloc 00004000h bytes (16 kB)
871 cycles per kByte for VirtualAlloc 00004000h bytes (16 kB)
188 cycles per kByte for GlobalAlloc 00004000h bytes (16 kB)
182 cycles per kByte for HeapAlloc 00008000h bytes (32 kB)
751 cycles per kByte for VirtualAlloc 00008000h bytes (32 kB)
182 cycles per kByte for GlobalAlloc 00008000h bytes (32 kB)
1679 cycles per kByte for HeapAlloc 00010000h bytes (64 kB)
672 cycles per kByte for VirtualAlloc 00010000h bytes (64 kB)
601 cycles per kByte for GlobalAlloc 00010000h bytes (64 kB)
190 cycles per kByte for HeapAlloc 00020000h bytes (128 kB)
621 cycles per kByte for VirtualAlloc 00020000h bytes (128 kB)
191 cycles per kByte for GlobalAlloc 00020000h bytes (128 kB)