News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Speed test HeapAlloc vs VirtualAlloc vs GlobalAlloc

Started by jj2007, May 22, 2012, 09:45:43 AM

Previous topic - Next topic

jj2007

I get strange results for this benchmark comparing various allocation methods. Sometimes GlobalAlloc is simply too fast, although it does not fail. Any ideas why??

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
309     cycles per kByte for HeapAlloc       00001000h bytes (4 kB)
1565    cycles per kByte for VirtualAlloc    00001000h bytes (4 kB)
162     cycles per kByte for GlobalAlloc     00001000h bytes (4 kB)

182     cycles per kByte for HeapAlloc       00004000h bytes (16 kB)
866     cycles per kByte for VirtualAlloc    00004000h bytes (16 kB)
41      cycles per kByte for GlobalAlloc     00004000h bytes (16 kB)

1705    cycles per kByte for HeapAlloc       00010000h bytes (64 kB)
683     cycles per kByte for VirtualAlloc    00010000h bytes (64 kB)
214     cycles per kByte for GlobalAlloc     00010000h bytes (64 kB)

188     cycles per kByte for HeapAlloc       00040000h bytes (256 kB)
608     cycles per kByte for VirtualAlloc    00040000h bytes (256 kB)
7       cycles per kByte for GlobalAlloc     00040000h bytes (256 kB)

1112    cycles per kByte for HeapAlloc       00100000h bytes (1 MB)
1116    cycles per kByte for VirtualAlloc    00100000h bytes (1 MB)
1113    cycles per kByte for GlobalAlloc     00100000h bytes (1 MB)

606     cycles per kByte for HeapAlloc       00400000h bytes (4 MB)
604     cycles per kByte for VirtualAlloc    00400000h bytes (4 MB)
603     cycles per kByte for GlobalAlloc     00400000h bytes (4 MB)

Memory was touched


(to assemble the source, extract all files to a folder, then drag HeapVsVirtualJJ.asm over TinyIDE.exe and hit F6)

Antariy

GlobalAlloc uses the default process heap (probably... or maybe its own heap), which usually has the size of 1 MB allocated at the time of process start (/HEAP switch of the linker). So, when you're allocating small pieces, it "just grabs" first suitable block from those 1 MB, and returns it without a need in "physical" heap (re)allocation.

Results:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
879     cycles per kByte for HeapAlloc       00001000h bytes (4 kB)
4478    cycles per kByte for VirtualAlloc    00001000h bytes (4 kB)
565     cycles per kByte for GlobalAlloc     00001000h bytes (4 kB)

494     cycles per kByte for HeapAlloc       00004000h bytes (16 kB)
2954    cycles per kByte for VirtualAlloc    00004000h bytes (16 kB)
145     cycles per kByte for GlobalAlloc     00004000h bytes (16 kB)

5771    cycles per kByte for HeapAlloc       00010000h bytes (64 kB)
2951    cycles per kByte for VirtualAlloc    00010000h bytes (64 kB)
726     cycles per kByte for GlobalAlloc     00010000h bytes (64 kB)

1102    cycles per kByte for HeapAlloc       00040000h bytes (256 kB)
3142    cycles per kByte for VirtualAlloc    00040000h bytes (256 kB)
18      cycles per kByte for GlobalAlloc     00040000h bytes (256 kB)

3066    cycles per kByte for HeapAlloc       00100000h bytes (1 MB)
3199    cycles per kByte for VirtualAlloc    00100000h bytes (1 MB)
3110    cycles per kByte for GlobalAlloc     00100000h bytes (1 MB)

3067    cycles per kByte for HeapAlloc       00400000h bytes (4 MB)
3045    cycles per kByte for VirtualAlloc    00400000h bytes (4 MB)
3036    cycles per kByte for GlobalAlloc     00400000h bytes (4 MB)

Memory was touched


Edit: the stated above is truely right only for heap functions at all. VirtualAlloc is slow because it needs to reserve/commit the memory, while heap functions just getting suitable piece from entire block - reallocation occured only when the block size is not enough for the piece requested.
As for difference between Global/HeapAlloc, below is the results when code changed to use GMEM_ZEROINIT when calling GlobalAlloc:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
636     cycles per kByte for HeapAlloc       00001000h bytes (4 kB)
4639    cycles per kByte for VirtualAlloc    00001000h bytes (4 kB)
762     cycles per kByte for GlobalAlloc     00001000h bytes (4 kB)

443     cycles per kByte for HeapAlloc       00004000h bytes (16 kB)
2599    cycles per kByte for VirtualAlloc    00004000h bytes (16 kB)
475     cycles per kByte for GlobalAlloc     00004000h bytes (16 kB)

5648    cycles per kByte for HeapAlloc       00010000h bytes (64 kB)
2818    cycles per kByte for VirtualAlloc    00010000h bytes (64 kB)
1748    cycles per kByte for GlobalAlloc     00010000h bytes (64 kB)

1007    cycles per kByte for HeapAlloc       00040000h bytes (256 kB)
3049    cycles per kByte for VirtualAlloc    00040000h bytes (256 kB)
965     cycles per kByte for GlobalAlloc     00040000h bytes (256 kB)

3088    cycles per kByte for HeapAlloc       00100000h bytes (1 MB)
3090    cycles per kByte for VirtualAlloc    00100000h bytes (1 MB)
3025    cycles per kByte for GlobalAlloc     00100000h bytes (1 MB)

3193    cycles per kByte for HeapAlloc       00400000h bytes (4 MB)
3056    cycles per kByte for VirtualAlloc    00400000h bytes (4 MB)
2950    cycles per kByte for GlobalAlloc     00400000h bytes (4 MB)

Memory was touched

jj2007

Could be, Alex, but why does it need 726 cycles for the 64k block and only 18 for 256k? That doesn't look plausible...

Antariy

I've changed the post above - new results :biggrin: (arrgh... these boring tags!).

Maybe 64k block is the fist big enough block, which causes heap subsystem to "physically" commit the pages into memory (they was just reserved before)?

hutch--



Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
250     cycles per kByte for HeapAlloc       00001000h bytes (4 kB)
1653    cycles per kByte for VirtualAlloc    00001000h bytes (4 kB)
152     cycles per kByte for GlobalAlloc     00001000h bytes (4 kB)

160     cycles per kByte for HeapAlloc       00004000h bytes (16 kB)
825     cycles per kByte for VirtualAlloc    00004000h bytes (16 kB)
38      cycles per kByte for GlobalAlloc     00004000h bytes (16 kB)

1232    cycles per kByte for HeapAlloc       00010000h bytes (64 kB)
629     cycles per kByte for VirtualAlloc    00010000h bytes (64 kB)
188     cycles per kByte for GlobalAlloc     00010000h bytes (64 kB)

202     cycles per kByte for HeapAlloc       00040000h bytes (256 kB)
1065    cycles per kByte for VirtualAlloc    00040000h bytes (256 kB)
6       cycles per kByte for GlobalAlloc     00040000h bytes (256 kB)

856     cycles per kByte for HeapAlloc       00100000h bytes (1 MB)
570     cycles per kByte for VirtualAlloc    00100000h bytes (1 MB)
699     cycles per kByte for GlobalAlloc     00100000h bytes (1 MB)

693     cycles per kByte for HeapAlloc       00400000h bytes (4 MB)
536     cycles per kByte for VirtualAlloc    00400000h bytes (4 MB)
536     cycles per kByte for GlobalAlloc     00400000h bytes (4 MB)

Memory was touched

Antariy

Looks like it is: changed NumBytes=NumBytes*4 to "*2", results:


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
633     cycles per kByte for HeapAlloc       00001000h bytes (4 kB)
4290    cycles per kByte for VirtualAlloc    00001000h bytes (4 kB)
756     cycles per kByte for GlobalAlloc     00001000h bytes (4 kB)

497     cycles per kByte for HeapAlloc       00002000h bytes (8 kB)
3002    cycles per kByte for VirtualAlloc    00002000h bytes (8 kB)
575     cycles per kByte for GlobalAlloc     00002000h bytes (8 kB)

687     cycles per kByte for HeapAlloc       00004000h bytes (16 kB)
2484    cycles per kByte for VirtualAlloc    00004000h bytes (16 kB)
475     cycles per kByte for GlobalAlloc     00004000h bytes (16 kB)

578     cycles per kByte for HeapAlloc       00008000h bytes (32 kB)
2644    cycles per kByte for VirtualAlloc    00008000h bytes (32 kB)
434     cycles per kByte for GlobalAlloc     00008000h bytes (32 kB)

5605    cycles per kByte for HeapAlloc       00010000h bytes (64 kB)
2852    cycles per kByte for VirtualAlloc    00010000h bytes (64 kB)
1682    cycles per kByte for GlobalAlloc     00010000h bytes (64 kB)

642     cycles per kByte for HeapAlloc       00020000h bytes (128 kB)
3052    cycles per kByte for VirtualAlloc    00020000h bytes (128 kB)
693     cycles per kByte for GlobalAlloc     00020000h bytes (128 kB)

Memory was touched


Note: 64 KB block is also allocated slower than 128 KB block - it seem that it is just the big enough to cause the reallocation/commiting of the heap to bigger size (it is unlikely that it is the constant value), and next 128 KB block is again just grabbed from it without big delays. Note also that I've modified the code to use GMEM_ZEROINIT with GlobalAlloc.

Jochen, which are your results with these modifications:

invoke GlobalAlloc, GMEM_FIXED or GMEM_ZEROINIT, NumBytes

NumBytes=NumBytes*2

Changed program attached.

Antariy

It is hard to predict what is the reason for the timings difference in each particular test. If try to do more complex tests with the different sizes and different memory conditions of the real running program, the timing results will be much more various. Because different scenarios are possible, for instance:
1. Heap is not big enough to provide the block requested => reallocation (if possible).
2. Heap is big enough, but the heap manager needs some time to find suitable block size in the entire heap block, what requiring different time depending on the heap allocation layout.
3. Heap is big enough overall (in the total sum of the free pieces), but highly fragmented, and thus need to be reallocated anyway to provide the monolithic block requested.
4. Other thread was eventually locked the heap to access it.
5. ...

jj2007

Alex,
Thanks a lot. Here are my results:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
312     cycles per kByte for HeapAlloc       00001000h bytes (4 kB)
1564    cycles per kByte for VirtualAlloc    00001000h bytes (4 kB)
326     cycles per kByte for GlobalAlloc     00001000h bytes (4 kB)

217     cycles per kByte for HeapAlloc       00002000h bytes (8 kB)
1097    cycles per kByte for VirtualAlloc    00002000h bytes (8 kB)
242     cycles per kByte for GlobalAlloc     00002000h bytes (8 kB)

180     cycles per kByte for HeapAlloc       00004000h bytes (16 kB)
871     cycles per kByte for VirtualAlloc    00004000h bytes (16 kB)
188     cycles per kByte for GlobalAlloc     00004000h bytes (16 kB)

182     cycles per kByte for HeapAlloc       00008000h bytes (32 kB)
751     cycles per kByte for VirtualAlloc    00008000h bytes (32 kB)
182     cycles per kByte for GlobalAlloc     00008000h bytes (32 kB)

1679    cycles per kByte for HeapAlloc       00010000h bytes (64 kB)
672     cycles per kByte for VirtualAlloc    00010000h bytes (64 kB)
601     cycles per kByte for GlobalAlloc     00010000h bytes (64 kB)

190     cycles per kByte for HeapAlloc       00020000h bytes (128 kB)
621     cycles per kByte for VirtualAlloc    00020000h bytes (128 kB)
191     cycles per kByte for GlobalAlloc     00020000h bytes (128 kB)