Author Topic: Fast memory allocation  (Read 2647 times)

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Fast memory allocation
« on: December 21, 2015, 11:35:32 PM »
As mention before using a stream for allocating memory may speed up the process by up to 100:1. The idea is to use a fixed size stream buffer for fast alloc. Requests that do not fit in this buffer are allocated using HeapAlloc(). Optionally these may be linked to the base pointer if external modules need this information.

When the total sum of blocks allocated is less then the stream buffer the result will be something like this:
Code: [Select]
    70170 cycles - code(317) 7: malloc - fixed
  2665902 cycles - code( 72) 4: HeapAlloc: hProcessHeap
  2697484 cycles - code( 53) 0: HeapAlloc
  2759501 cycles - code( 26) 1: GlobalAlloc
  2984940 cycles - code( 61) 3: crt_malloc
  6846653 cycles - code( 30) 5: HeapAlloc: HEAP_ZERO_MEMORY
  6953949 cycles - code( 54) 6: MASM32: alloc()
  7614503 cycles - code( 53) 2: VirtualAlloc

You see here how slow VirtualAlloc is on small blocks, in this case from 0.5 to 2M total size. However, the speed of VirtualAlloc increase with the total size:
Code: [Select]
3M -- 200000h
1568483   cycles -   2904K - 0: HeapAlloc
1813055   cycles -   2904K - 2: VirtualAlloc
4M -- 200000h
2042805   cycles -   3993K - 0: HeapAlloc
1811942   cycles -   3993K - 2: VirtualAlloc
8M -- 200000h
2844942   cycles -   7986K - 0: HeapAlloc
1858876   cycles -   7986K - 2: VirtualAlloc
16M -- 200000h
2925022   cycles -  15246K - 0: HeapAlloc
1920887   cycles -  15246K - 2: VirtualAlloc

When the total sum of blocks allocated is larger then the buffer the result will be something like this (2M):
Code: [Select]
2M -- 200000h
980645    cycles -   1996K - 0: HeapAlloc
19009     cycles -   1996K - 7: malloc - fixed
3M -- 200000h
1471771   cycles -   2904K - 0: HeapAlloc
826741    cycles -   2904K - 7: malloc - fixed
4M -- 200000h
1849849   cycles -   3993K - 0: HeapAlloc
1328642   cycles -   3993K - 7: malloc - fixed

So this strategy works best when the scratchpad is large enough for the total size and evens out if it tips over.

Another strategy is to continue to add fixed sized blocks dynamicly and unlink on delete. The result for small sizes:
Code: [Select]
    70363 cycles - code(325) 7: malloc - fixed
    93349 cycles - code(496) 9: malloc - dynamic
  2835179 cycles - code( 53) 0: HeapAlloc

The result for large sizes:
Code: [Select]
  6219507 cycles - code(496) 9: malloc - dynamic
  7343700 cycles - code(325) 7: malloc - fixed
  9147267 cycles - code( 53) 0: HeapAlloc

This is a relatively safe method where all allocated blocks are deleted after use. However the overhang of link/unlink removes most of the benefit speed-wise.

The third option which I ended up using is simply to add blocks when needed and skip the unlink bit. This will auto create a scratchpad according to usage and the function ReduceHeap() is added for special cases.

A special case will be when a few hundred thousand objects are temporary allocated increasing the heap by a few hundred Megs. In that case unlinking the chain after will be more productive than for each of the million calls to free().

result:
Code: [Select]
512K -- 200000h
274965    cycles -    499K - 0: HeapAlloc
16959     cycles -    499K - 7: malloc - fixed
16697     cycles -    499K - 8: malloc - auto
22166     cycles -    499K - 9: malloc - dynamic
2M -- 200000h
1241205   cycles -   1996K - 0: HeapAlloc
18601     cycles -   1996K - 7: malloc - fixed
20413     cycles -   1996K - 8: malloc - auto
24729     cycles -   1996K - 9: malloc - dynamic
3M -- 200000h
1849906   cycles -   2904K - 0: HeapAlloc
991063    cycles -   2904K - 7: malloc - fixed
42134     cycles -   2904K - 8: malloc - auto
1113107   cycles -   2904K - 9: malloc - dynamic
4M -- 200000h
2098923   cycles -   3993K - 0: HeapAlloc
1516105   cycles -   3993K - 7: malloc - fixed
30139     cycles -   3993K - 8: malloc - auto
1282315   cycles -   3993K - 9: malloc - dynamic
8M -- 200000h
2807686   cycles -   7986K - 0: HeapAlloc
2650790   cycles -   7986K - 7: malloc - fixed
63680     cycles -   7986K - 8: malloc - auto
1903944   cycles -   7986K - 9: malloc - dynamic

guga

  • Member
  • ****
  • Posts: 826
  • Assembly is a state of art.
    • RosAsm
Re: Fast memory allocation
« Reply #1 on: January 04, 2016, 03:13:51 AM »
my results

Code: [Select]
Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz (SSE4.2)
----------------------------------------------
512K -- 200000h
185290    cycles -    499K - 0: HeapAlloc
23885     cycles -    499K - 7: malloc - fixed
23607     cycles -    499K - 8: malloc - auto
17786     cycles -    499K - 9: malloc - dynamic
2M -- 200000h
1058865   cycles -   1996K - 0: HeapAlloc
24210     cycles -   1996K - 7: malloc - fixed
39535     cycles -   1996K - 8: malloc - auto
26909     cycles -   1996K - 9: malloc - dynamic
3M -- 200000h
1556853   cycles -   2904K - 0: HeapAlloc
1088794   cycles -   2904K - 7: malloc - fixed
50941     cycles -   2904K - 8: malloc - auto
694567    cycles -   2904K - 9: malloc - dynamic
4M -- 200000h
1624188   cycles -   3993K - 0: HeapAlloc
1235034   cycles -   3993K - 7: malloc - fixed
25434     cycles -   3993K - 8: malloc - auto
728995    cycles -   3993K - 9: malloc - dynamic
8M -- 200000h
2322818   cycles -   7986K - 0: HeapAlloc
1964005   cycles -   7986K - 7: malloc - fixed
71131     cycles -   7986K - 8: malloc - auto
1154075   cycles -   7986K - 9: malloc - dynamic

result:
   210648 cycles - code(544) 8: malloc - auto
  2622332 cycles - code(496) 9: malloc - dynamic
  4335928 cycles - code(325) 7: malloc - fixed
  6748014 cycles - code( 53) 0: HeapAlloc
hit any key to continue...


question. Can this new malloc (auto) be a replacement for virtualalloc ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 7542
  • Assembler is fun ;-)
    • MasmBasic
Re: Fast memory allocation
« Reply #2 on: January 04, 2016, 03:19:58 AM »
Here are some more, but honestly, I have no idea how to interpret your numbers ::)

Can you give a verbose version of this?
   176684 cycles - code(544) 8: malloc - auto
  2541613 cycles - code(496) 9: malloc - dynamic



Code: [Select]
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (AVX)
----------------------------------------------
512K -- 200000h
173231    cycles -    499K - 0: HeapAlloc
15217     cycles -    499K - 7: malloc - fixed
14822     cycles -    499K - 8: malloc - auto
20625     cycles -    499K - 9: malloc - dynamic
2M -- 200000h
406836    cycles -   1996K - 0: HeapAlloc
16913     cycles -   1996K - 7: malloc - fixed
19218     cycles -   1996K - 8: malloc - auto
21011     cycles -   1996K - 9: malloc - dynamic
3M -- 200000h
729334    cycles -   2904K - 0: HeapAlloc
985867    cycles -   2904K - 7: malloc - fixed
50724     cycles -   2904K - 8: malloc - auto
735871    cycles -   2904K - 9: malloc - dynamic
4M -- 200000h
587725    cycles -   3993K - 0: HeapAlloc
1070638   cycles -   3993K - 7: malloc - fixed
29647     cycles -   3993K - 8: malloc - auto
711767    cycles -   3993K - 9: malloc - dynamic
8M -- 200000h
846687    cycles -   7986K - 0: HeapAlloc
1006395   cycles -   7986K - 7: malloc - fixed
62273     cycles -   7986K - 8: malloc - auto
1052339   cycles -   7986K - 9: malloc - dynamic

result:
   176684 cycles - code(544) 8: malloc - auto
  2541613 cycles - code(496) 9: malloc - dynamic
  2743813 cycles - code( 53) 0: HeapAlloc
  3095030 cycles - code(325) 7: malloc - fixed

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Fast memory allocation
« Reply #3 on: January 04, 2016, 03:56:50 AM »
Quote
question. Can this new malloc (auto) be a replacement for virtualalloc ?
Yes, by using VirtualAlloc instead of HeapAlloc.

Quote
Can you give a verbose version of this?
   176684 cycles - code(544) 8: malloc - auto
  2541613 cycles - code(496) 9: malloc – dynamic
The first number is the total time it takes to complete all tests. This second number is the code size of the algorithm tested. The third number is the index of each algorithm tested, which in this case referred to the file 8.asm and 9.asm.

guga

  • Member
  • ****
  • Posts: 826
  • Assembly is a state of art.
    • RosAsm
Re: Fast memory allocation
« Reply #4 on: January 04, 2016, 04:18:18 AM »
Tks nidud,

Btw, i presume that your function uses a table of structures (M_BLOCK) to store the address of the memory to be allocated/deallocated, right ?

So, if it is a array of these structures, is there a limit of the number of elements ?  mean, what is the maximum amount of M_BLOCKS structures ? When this amount exceeds, another array is created to ?

I´m asking this, because, let´s say a program have 300 calls to this function  and the maximum amount of these calls are 100. The new memalloc will create the extra 200 group of M_BLOCKS so all 300 calls can work ? In other words, in case of exceeding is there a correspondence of realloc or a function that extendes the memory blocks ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Fast memory allocation
« Reply #5 on: January 04, 2016, 05:24:38 AM »
Btw, i presume that your function uses a table of structures (M_BLOCK) to store the address of the memory to be allocated/deallocated, right ?

No. There is only a header for each block. This has to be 16 byte to ensure 16-byte alignment of each block.

Quote
So, if it is a array of these structures, is there a limit of the number of elements ?  mean, what is the maximum amount of M_BLOCKS structures ? When this amount exceeds, another array is created to ?

I´m asking this, because, let´s say a program have 300 calls to this function  and the maximum amount of these calls are 100. The new memalloc will create the extra 200 group of M_BLOCKS so all 300 calls can work ?

You have to think of millions in this case, not hundreds, so there can't be any limits. For this reason the algorithm have to be as simple as possible.

Grincheux

  • Member
  • ***
  • Posts: 328
  • Never be pleased, Always improve
    • Asm for fun
Re: Fast memory allocation
« Reply #6 on: January 04, 2016, 08:58:44 AM »
Here are my results. What can you say in final? Which kind of memory allocation must we use? Is there an algo (HeapAlloc, malloc, VirtualAlloc...) for intel and an other to use and/or if user have <8Mb memory or > 16Mb. What about the size of the memory allocated ? Same question as above.

I expect your are patient, this a very hard analyze.

Quote
C:\Users\Grincheux\Downloads\malloc>timeit
AMD Athlon(tm) II X2 250 Processor (SSE3)
----------------------------------------------
512K -- 200000h
341027    cycles -    499K - 0: HeapAlloc
17903     cycles -    499K - 7: malloc - fixed
17255     cycles -    499K - 8: malloc - auto
23946     cycles -    499K - 9: malloc - dynamic
2M -- 200000h
659761    cycles -   1996K - 0: HeapAlloc
18834     cycles -   1996K - 7: malloc - fixed
22671     cycles -   1996K - 8: malloc - auto
25272     cycles -   1996K - 9: malloc - dynamic
3M -- 200000h
682142    cycles -   2904K - 0: HeapAlloc
765550    cycles -   2904K - 7: malloc - fixed
42529     cycles -   2904K - 8: malloc - auto
1188740   cycles -   2904K - 9: malloc - dynamic
4M -- 200000h
633392    cycles -   3993K - 0: HeapAlloc
908497    cycles -   3993K - 7: malloc - fixed
30416     cycles -   3993K - 8: malloc - auto
1344787   cycles -   3993K - 9: malloc - dynamic
8M -- 200000h
1255745   cycles -   7986K - 0: HeapAlloc
1048506   cycles -   7986K - 7: malloc - fixed
65802     cycles -   7986K - 8: malloc - auto
2025293   cycles -   7986K - 9: malloc - dynamic

result:
   178673 cycles - code(544) 8: malloc - auto
  2759290 cycles - code(325) 7: malloc - fixed
  3572067 cycles - code( 53) 0: HeapAlloc
  4608038 cycles - code(496) 9: malloc - dynamic
hit any key to continue...
C:\Users\Grincheux\Downloads\malloc>
Kenavo (Bye)
----------------------
Asm for Fun
My Links
"La garde meurt mais ne rend pas"
Cambronne à Waterloo

Grincheux

  • Member
  • ***
  • Posts: 328
  • Never be pleased, Always improve
    • Asm for fun
Re: Fast memory allocation
« Reply #7 on: January 04, 2016, 09:00:07 AM »
Here are my results. What can you say in final? Which kind of memory allocation must we use? Is there an algo (HeapAlloc, malloc, VirtualAlloc...) for intel and an other to use and/or if user have <8Gb memory or > 16Gb. What about the size of the memory allocated ? Same question as above.
Is there a possibility that the rams vendor affects the algo, or the quality ot this ram?

I expect your are patient, this a very hard analyze.

Quote
C:\Users\Grincheux\Downloads\malloc>timeit
AMD Athlon(tm) II X2 250 Processor (SSE3)
----------------------------------------------
512K -- 200000h
341027    cycles -    499K - 0: HeapAlloc
17903     cycles -    499K - 7: malloc - fixed
17255     cycles -    499K - 8: malloc - auto
23946     cycles -    499K - 9: malloc - dynamic
2M -- 200000h
659761    cycles -   1996K - 0: HeapAlloc
18834     cycles -   1996K - 7: malloc - fixed
22671     cycles -   1996K - 8: malloc - auto
25272     cycles -   1996K - 9: malloc - dynamic
3M -- 200000h
682142    cycles -   2904K - 0: HeapAlloc
765550    cycles -   2904K - 7: malloc - fixed
42529     cycles -   2904K - 8: malloc - auto
1188740   cycles -   2904K - 9: malloc - dynamic
4M -- 200000h
633392    cycles -   3993K - 0: HeapAlloc
908497    cycles -   3993K - 7: malloc - fixed
30416     cycles -   3993K - 8: malloc - auto
1344787   cycles -   3993K - 9: malloc - dynamic
8M -- 200000h
1255745   cycles -   7986K - 0: HeapAlloc
1048506   cycles -   7986K - 7: malloc - fixed
65802     cycles -   7986K - 8: malloc - auto
2025293   cycles -   7986K - 9: malloc - dynamic

result:
   178673 cycles - code(544) 8: malloc - auto
  2759290 cycles - code(325) 7: malloc - fixed
  3572067 cycles - code( 53) 0: HeapAlloc
  4608038 cycles - code(496) 9: malloc - dynamic
hit any key to continue...
C:\Users\Grincheux\Downloads\malloc>
Kenavo (Bye)
----------------------
Asm for Fun
My Links
"La garde meurt mais ne rend pas"
Cambronne à Waterloo

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Fast memory allocation
« Reply #8 on: January 04, 2016, 10:38:34 AM »
Quote
What can you say in final?

The conclusion(s) made is in the first post.

Quote
Which kind of memory allocation must we use?

The stack. This is the fastest and simplest way of allocating memory. The rest is optional but it is recommended that you use HeapAlloc() to allocate additional memory. If the memory block is to be shared with other applications (clipboard) you have to use GlobalAlloc().

Quote
Is there an algo (HeapAlloc, malloc, VirtualAlloc...) for intel and an other to use and/or if user have <8Gb memory or > 16Gb. What about the size of the memory allocated ? Same question as above.
Is there a possibility that the rams vendor affects the algo, or the quality ot this ram?

The OS will handle this so you don’t have to worry about that.

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4807
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Fast memory allocation
« Reply #9 on: January 04, 2016, 12:50:05 PM »
It would be worth adding to the list of allocation techniques the old API GlobalAlloc() with the GMEM_FIXED flag as it is a fair bit faster that the older style of GlobalLock() then GlobalAlloc(). Very few of the memory allocation strategies are well suited for large counts of small allocations, VirtualAlloc() is usually the slowest followed by HeapAlloc() depending on the flags used. A language like BASIC pays a penalty for its string allocation as it does a separate allocation for each string and a reallocation for each string change.

Depending on your needs, a single fixed size allocation that is addressed in slices with a pointer array is far faster if you can live with fixed sized blocks, it also has the additional advantage that you can clear it with a zero fill of the entire allocation while retaining the same set of pointers. If you need individually sized blocks that require to be resized on a needs basis, you are stuck with individual allocations for each item but you pay a speed penalty for doing so both in the allocation and freeing each block when no longer required.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

nidud

  • Member
  • *****
  • Posts: 1370
    • https://github.com/nidud/asmc
Re: Fast memory allocation
« Reply #10 on: October 27, 2016, 01:30:21 AM »
Here's the 64-bit version.

The bottle-neck include version for 32/64-bit is currently like this:
Code: [Select]
.xlist

__LIBC_INC EQU 1

IFDEF _WIN64
 IFDEF __PE__ ; For Asmc -pe
.x64
.MODEL FLAT, FASTCALL
 ENDIF
OPTION WIN64:3 ; Standard stack frame
ELSE
.486
.MODEL FLAT, STDCALL
 IFDEF __SSE__
.686
.XMM
 ENDIF
ENDIF
OPTION CASEMAP:NONE

TRUE EQU 1
FALSE EQU 0
NULL EQU 0

IFDEF _WIN64
_CDecl EQU <> ; fastcall
__WXP__ EQU 1 ; Windows XP++
ELSE
_CDecl EQU <C> ; c
ENDIF

IFDEF _WIN64
SIZE_T TYPEDEF QWORD
ELSE
SIZE_T TYPEDEF DWORD
ENDIF
size_t EQU <SIZE_T>

ifdef _UNICODE
TCHAR TYPEDEF SWORD
ELSE
TCHAR TYPEDEF SBYTE
ENDIF
LPSTR TYPEDEF PTR SBYTE
LPWSTR TYPEDEF PTR SWORD
LPTSTR TYPEDEF PTR TCHAR
SINT TYPEDEF SDWORD
UINT TYPEDEF DWORD
ULONG TYPEDEF DWORD
PVOID TYPEDEF PTR
HANDLE TYPEDEF PTR
LPWORD TYPEDEF PTR WORD
LPDWORD TYPEDEF PTR DWORD
LPQWORD TYPEDEF PTR QWORD

IFDEF DEBUG

.ASSERT:ON
assert_exit PROTO
ENDIF

.list

The _HEAP_GROWSIZE definition is now set to 64K which is the standard default size.
Code: [Select]
ifndef __LIBC_INC
 include libc.inc
endif
.xlist
;
; All allocations are of size n * _GRANULARITY
;
_GRANULARITY equ 0x10 ; align 16
_PAGESIZE_ equ 0x1000 ; one page
_SEGSIZE_ equ 0x10000 ; one segment (i.e., 64 Kb)
_HEAP_REGIONMAX equ 0x40 ; Max number of regions: 64
; For small memory systems:
_HEAP_REGIONSIZE_S equ 0x4000 ; Initial region size (16K)
_HEAP_MAXREGSIZE_S equ 0x1000000 ; Maximum region size (16M)
; For large memory systems:
_HEAP_REGIONSIZE_L equ 0x100000 ; Initial region size  (1M)
_HEAP_MAXREGSIZE_L equ 0x1000000 ; Maximum region size (16M)
_HEAP_GROWSIZE equ 0x10000 ; Default grow increment (64K)
_HEAP_GROWMIN equ _PAGESIZE_ ; Minimum grow inc (1 page)
_HEAP_GROWSTART equ _PAGESIZE_ ; Startup grow increment
_HEAP_COALESCE equ -1 ; Coalesce heap value
_HEAP_EMPTYLIST_SIZE equ (1 * _PAGESIZE_)

S_HEAP STRUC SIZE_T ; Memory Block Header: 16/32 byte
h_size SIZE_T ?
h_type BYTE ? ; 0 unused, 1 local, 2 global, 3 aligned
h_prev PVOID ? ; global block's (user size)
h_next PVOID ? ; local block's (chunck size)
S_HEAP ENDS

externdef _amblksiz:dword
externdef _heap_base:size_t
externdef _heap_free:size_t

malloc proto :size_t
realloc PROTO :ptr, :size_t
free proto :ptr
calloc proto :dword, :dword
_alloca64 proto :dword, :dword
ifdef _WIN64
alloca macro dwSize:req, ReservedStack:=<@ReservedStack>
exitm<_alloca64( dwSize, ReservedStack )>
endm
else
alloca proto :dword
endif
salloc proto :LPSTR
_aligned_malloc proto dwSize:size_t, alignment:UINT

__coreleft proto
__purgeheap proto

;;;;;;;;;;;;;;; kernel32

PAGE_NOACCESS equ 0x01
PAGE_READONLY equ 0x02
PAGE_READWRITE equ 0x04
PAGE_WRITECOPY equ 0x08
PAGE_EXECUTE equ 0x10
PAGE_EXECUTE_READ equ 0x20
PAGE_EXECUTE_READWRITE equ 0x40
PAGE_EXECUTE_WRITECOPY equ 0x80
PAGE_GUARD equ 0x0100
PAGE_NOCACHE equ 0x0200
PAGE_WRITECOMBINE equ 0x0400

MEM_COMMIT equ 0x1000
MEM_RESERVE equ 0x2000
MEM_DECOMMIT equ 0x4000
MEM_RELEASE equ 0x8000
MEM_FREE equ 0x00010000
MEM_PRIVATE equ 0x00020000
MEM_MAPPED equ 0x00040000
MEM_RESET equ 0x00080000
MEM_TOP_DOWN equ 0x00100000
MEM_4MB_PAGES equ 0x80000000
SEC_FILE equ 0x00800000
SEC_IMAGE equ 0x01000000
SEC_VLM equ 0x02000000
SEC_RESERVE equ 0x04000000
SEC_COMMIT equ 0x08000000
SEC_NOCACHE equ 0x10000000
MEM_IMAGE equ SEC_IMAGE

; Global Memory Flags

GMEM_FIXED equ 0x0000
GMEM_MOVEABLE equ 0x0002
GMEM_NOCOMPACT equ 0x0010
GMEM_NODISCARD equ 0x0020
GMEM_ZEROINIT equ 0x0040
GMEM_MODIFY equ 0x0080
GMEM_DISCARDABLE equ 0x0100
GMEM_NOT_BANKED equ 0x1000
GMEM_SHARE equ 0x2000
GMEM_DDESHARE equ 0x2000
GMEM_NOTIFY equ 0x4000
GMEM_LOWER equ GMEM_NOT_BANKED
GMEM_VALID_FLAGS equ 0x7F72
GMEM_INVALID_HANDLE equ 0x8000

;GHND equ GMEM_MOVEABLE or GMEM_ZEROINIT
;GPTR equ GMEM_FIXED or GMEM_ZEROINIT

; Flags returned by GlobalFlags (in addition to GMEM_DISCARDABLE)

GMEM_DISCARDED equ 0x4000
GMEM_LOCKCOUNT equ 0x00FF

; Local Memory Flags

LMEM_FIXED equ 0x0000
LMEM_MOVEABLE equ 0x0002
LMEM_NOCOMPACT equ 0x0010
LMEM_NODISCARD equ 0x0020
LMEM_ZEROINIT equ 0x0040
LMEM_MODIFY equ 0x0080
LMEM_DISCARDABLE equ 0x0F00
LMEM_VALID_FLAGS equ 0x0F72
LMEM_INVALID_HANDLE equ 0x8000

;LHND equ LMEM_MOVEABLE OR LMEM_ZEROINIT
;LPTR equ LMEM_FIXED OR LMEM_ZEROINIT
;NONZEROLHND equ LMEM_MOVEABLE
;NONZEROLPTR equ LMEM_FIXED

; Flags returned by LocalFlags (in addition to LMEM_DISCARDABLE)

LMEM_DISCARDED equ 0x4000
LMEM_LOCKCOUNT equ 0x00FF

PROCESS_HEAP_REGION equ 0x0001
PROCESS_HEAP_UNCOMMITTED_RANGE equ 0x0002
PROCESS_HEAP_ENTRY_BUSY equ 0x0004
PROCESS_HEAP_ENTRY_MOVEABLE equ 0x0010
PROCESS_HEAP_ENTRY_DDESHARE equ 0x0020

HEAP_NO_SERIALIZE equ 0x00000001
HEAP_GROWABLE equ 0x00000002
HEAP_GENERATE_EXCEPTIONS equ 0x00000004
HEAP_ZERO_MEMORY equ 0x00000008
HEAP_REALLOC_IN_PLACE_ONLY equ 0x00000010
HEAP_TAIL_CHECKING_ENABLED equ 0x00000020
HEAP_FREE_CHECKING_ENABLED equ 0x00000040
HEAP_DISABLE_COALESCE_ON_FREE equ 0x00000080
HEAP_CREATE_ALIGN_16 equ 0x00010000
HEAP_CREATE_ENABLE_TRACING equ 0x00020000
HEAP_MAXIMUM_TAG equ 0x0FFF
HEAP_PSEUDO_TAG_FLAG equ 0x8000
HEAP_TAG_SHIFT equ 18

PROCESS_HEAP_ENTRY STRUC SIZE_T
lpData PVOID ?
cbData DWORD ?
cbOverhead BYTE ?
iRegionIndex BYTE ?
wFlags WORD ?
UNION
    STRUC ;Block
hMem HANDLE ?
dwReserved DD 3 dup(?)
    ENDS
    STRUC ;Region
dwCommittedSize DWORD ?
dwUnCommittedSize DWORD ?
lpFirstBlock PVOID ?
lpLastBlock PVOID ?
    ENDS
ENDS
PROCESS_HEAP_ENTRY ENDS

MEMORY_BASIC_INFORMATION STRUC SIZE_T
BaseAddress PVOID ?
AllocationBase PVOID ?
AllocationProtect DWORD ?
RegionSize SIZE_T ?
State DWORD ?
Protect DWORD ?
_Type DWORD ?
MEMORY_BASIC_INFORMATION ENDS

MEMORYSTATUS STRUC
dwLength dd ? ; sizeof(MEMORYSTATUS): no need to set..
dwMemoryLoad dd ? ; percent of memory in use
dwTotalPhys dd ? ; bytes of physical memory
dwAvailPhys dd ? ; free physical memory bytes
dwTotalPageFile dd ? ; bytes of paging file
dwAvailPageFile dd ? ; free bytes of paging file
dwTotalVirtual dd ? ; user bytes of address space
dwAvailVirtual dd ? ; free user bytes
MEMORYSTATUS ENDS

MEMORYSTATUSEX STRUC
dwLength dd ? ; Size of structure: must be set before call
dwMemoryLoad dd ?
ullTotalPhys dq ?
ullAvailPhys dq ?
ullTotalPageFile dq ?
ullAvailPageFile dq ?
ullTotalVirtual dq ?
ullAvailVirtual dq ?
ullAvailExtendedVirtual dq ? ; Reserved. This value is always 0.
MEMORYSTATUSEX ENDS

GlobalAlloc proto \
uFlags: UINT,
dwBytes: SIZE_T

GlobalFree proto \
hMem: HANDLE

GlobalReAlloc proto \
hMem: HANDLE,
dwBytes: SIZE_T,
uFlags: UINT

GlobalSize proto \
hMem: HANDLE

GlobalFlags proto \
hMem: HANDLE

GlobalHandle proto \
pMem: PVOID

GlobalLock proto \
hMem: HANDLE

GlobalUnlock proto \
hMem: HANDLE

GlobalCompact proto \
dwMinFree: SIZE_T

GlobalFix proto \
hMem: HANDLE

GlobalUnfix proto \
hMem: HANDLE

GlobalWire proto \
hMem: HANDLE

GlobalUnWire proto \
hMem: HANDLE

GlobalMemoryStatus proto \
lpBuffer: PTR MEMORYSTATUS

GlobalMemoryStatusEx proto \
lpBuffer: PTR MEMORYSTATUSEX

LocalAlloc proto \
uFlags: UINT,
uBytes: SIZE_T

LocalReAlloc proto \
hMem: HANDLE,
uBytes: SIZE_T,
uFlags: UINT

LocalLock proto \
hMem: HANDLE

LocalHandle proto \
pMem: PVOID

LocalUnlock proto \
hMem: HANDLE

LocalSize proto \
hMem: HANDLE

LocalFlags proto \
hMem: HANDLE

LocalFree proto \
hMem: HANDLE

LocalShrink proto \
hMem: HANDLE,
cbNewSize: SIZE_T

LocalCompact proto \
uMinFree: SIZE_T

GetProcessHeap proto

GetProcessHeaps proto \
NumberOfHeaps: DWORD,
ProcessHeaps: PVOID

HeapCreate proto \
flOptions: DWORD,
dwInitialSize: SIZE_T,
dwMaximumSize: SIZE_T

HeapDestroy proto \
hHeap: HANDLE

HeapAlloc proto \
hHeap: HANDLE,
dwFlags: DWORD,
dwBytes: SIZE_T

HeapReAlloc proto \
hHeap: HANDLE,
dwFlags: DWORD,
lpMem: PVOID,
dwBytes: SIZE_T

HeapFree proto \
hHeap: HANDLE,
dwFlags: DWORD,
lpMem: PVOID

HeapSize proto \
hHeap: HANDLE,
dwFlags: DWORD,
lpMem: PVOID

HeapValidate proto \
hHeap: HANDLE,
dwFlags: DWORD,
lpMem: PVOID

HeapCompact proto \
hHeap: HANDLE,
dwFlags: DWORD

HeapLock proto \
hHeap: HANDLE

HeapUnlock proto \
hHeap: HANDLE

HeapWalk proto \
hHeap: HANDLE,
lpEntry: PTR LPPROCESS_HEAP_ENTRY

VirtualAlloc proto \
lpAddress: PVOID,
dwSize: SIZE_T,
flAllocationType:DWORD,
flProtect: DWORD

VirtualFree proto \
lpAddress: PVOID,
dwSize: SIZE_T,
dwFreeType: DWORD

VirtualProtect proto \
lpAddress: PVOID,
dwSize: SIZE_T,
flNewProtect: DWORD,
lpflOldProtect: PTR DWORD

VirtualQuery proto \
lpAddress: PVOID,
lpBuffer: PTR MEMORY_BASIC_INFORMATION,
dwLength: SIZE_T

VirtualAllocEx proto \
hProcess: HANDLE,
lpAddress: PVOID,
dwSize: SIZE_T,
flAllocationType:DWORD,
flProtect: DWORD

VirtualFreeEx proto \
hProcess: HANDLE,
lpAddress: PVOID,
dwSize: SIZE_T,
dwFreeType: DWORD

VirtualProtectEx proto \
hProcess: HANDLE,
lpAddress: PVOID,
dwSize: SIZE_T,
flNewProtect: DWORD,
lpflOldProtect: PTR DWORD

VirtualQueryEx proto \
hProcess: HANDLE,
lpAddress: PVOID,
lpBuffer: PTR MEMORY_BASIC_INFORMATION,
dwLength: SIZE_T

.list

The heap-block allocated is a linked list defined by size rather than pointers so the prev/next pointers are normally not used. At the end of the list is a block-header of size 0. The prev pointer in this last block points to the first block where the prev/next pointers are real (from HeapAlloc). These blocks, with fixed size, are linked from base->next. Additional blocks (larger than _HEAP_GROWSIZE) are linked from base->prev.
Code: [Select]
include alloc.inc

PUBLIC _amblksiz

.data
_amblksiz dd _HEAP_GROWSIZE

END

The implementation of malloc()/free()
Code: [Select]
include alloc.inc
include errno.inc

_FREE equ 0
_LOCAL equ 1
_GLOBAL equ 2
_ALIGNX equ 3

.data

_heap_base dq 0 ; address of main memory block
_heap_free dq 0 ; address of free memory block

.code

OPTION PROLOGUE:NONE, EPILOGUE:NONE

malloc PROC byte_count:SIZE_T

add rcx,sizeof(S_HEAP)+_GRANULARITY-1
and rcx,-(_GRANULARITY)

mov rdx,_heap_free
test rdx,rdx
jz create_heap

cmp [rdx].S_HEAP.h_type,_FREE
mov rax,[rdx].S_HEAP.h_size
jne find_block
cmp rax,rcx
jb find_block

block_found:

mov [rdx].S_HEAP.h_type,_LOCAL
je @F ; same size ?

mov [rdx].S_HEAP.h_size,rcx
sub rax,rcx ; create new free block
mov [rdx+rcx].S_HEAP.h_size,rax
mov [rdx+rcx].S_HEAP.h_type,_FREE
@@:

lea rax,[rdx+sizeof(S_HEAP)]; return address of memory block
add rdx,[rdx].S_HEAP.h_size
mov _heap_free,rdx
toend:
ret

find_block:

mov rdx,_heap_base
xor rax,rax
lupe:
add rdx,rax
mov rax,[rdx].S_HEAP.h_size
test rax,rax
jz last
cmp [rdx].S_HEAP.h_type,_FREE
jne lupe
cmp rax,rcx
jae block_found
cmp [rdx+rax].S_HEAP.h_type,_FREE
jne lupe
merge:
add rax,[rdx+rax].S_HEAP.h_size
mov [rdx].S_HEAP.h_size,rax
cmp [rdx+rax].S_HEAP.h_type,_FREE
je merge
cmp rax,rcx
jb lupe
jmp block_found
last:
mov rdx,[rdx].S_HEAP.h_prev
mov rdx,[rdx].S_HEAP.h_prev
test rdx,rdx
jnz lupe

create_heap:

mov eax,_amblksiz
test eax,eax
jz h_alloc
cmp rax,rcx
jb h_alloc
add rax,sizeof(S_HEAP) * 2

push rcx
push rbx
mov rbx,rax
sub rsp,28h
HeapAlloc( GetProcessHeap(), 0, rbx )
add rsp,28h
mov rdx,rbx
pop rbx
pop rcx

test rax,rax
jz nomem
sub rdx,sizeof(S_HEAP)
mov [rax].S_HEAP.h_size,rdx
mov [rax].S_HEAP.h_type,_FREE
mov [rax].S_HEAP.h_next,0
add rdx,rax
mov [rdx].S_HEAP.h_size,0
mov [rdx].S_HEAP.h_type,_LOCAL
mov [rdx].S_HEAP.h_prev,rax
mov rdx,_heap_base
mov [rax].S_HEAP.h_prev,rdx
.if rdx
push rcx
mov rcx,[rdx].S_HEAP.h_next
mov [rdx].S_HEAP.h_next,rax
mov [rax].S_HEAP.h_next,rcx
.if rcx
mov [rcx].S_HEAP.h_prev,rax
.endif
pop rcx
.endif
mov _heap_free,rax
mov _heap_base,rax
mov rdx,rax
mov rax,[rdx].S_HEAP.h_size
cmp rax,rcx
jae block_found
nomem:
mov errno,ENOMEM
xor rax,rax
jmp toend
h_alloc:
push rcx
push rbx
mov rbx,rcx
sub rsp,28h
HeapAlloc( GetProcessHeap(), 0, rbx )
add rsp,28h
pop rbx
pop rcx
test rax,rax
jz nomem
mov [rax].S_HEAP.h_size,rcx
mov [rax].S_HEAP.h_type,_GLOBAL
mov rcx,_heap_base
mov [rax].S_HEAP.h_prev,rcx
mov [rax].S_HEAP.h_next,rcx
.if rcx
mov rdx,[rcx].S_HEAP.h_next
mov [rcx].S_HEAP.h_next,rax
mov [rax].S_HEAP.h_next,rdx
.if rdx
mov [rdx].S_HEAP.h_prev,rax
.endif
.endif
add rax,sizeof(S_HEAP)
jmp toend
malloc ENDP

free PROC maddr:PVOID
push rax
aligned:
mov rax,rcx
sub rax,sizeof(S_HEAP)
js toend
cmp [rax].S_HEAP.h_type,_LOCAL
jne h_free
mov [rax].S_HEAP.h_type,_FREE
mov _heap_free,rax
toend:
pop rax
ret
h_free:
cmp [rax].S_HEAP.h_type,_ALIGNX
mov rcx,[rax].S_HEAP.h_prev
je aligned
cmp [rax].S_HEAP.h_type,_GLOBAL
jne toend
push rdx
mov rdx,[rax].S_HEAP.h_prev
mov rcx,[rax].S_HEAP.h_next
.if rdx
mov [rdx].S_HEAP.h_next,rcx
.endif
.if rcx
mov [rcx].S_HEAP.h_prev,rdx
.endif
push rbx
mov rbx,rax
sub rsp,20h
HeapFree( GetProcessHeap(), 0, rbx )
add rsp,20h
pop rbx
pop rdx
pop rax
ret
free ENDP

END

The implementation of realloc()
Code: [Select]
include alloc.inc
include errno.inc

_FREE equ 0
_LOCAL equ 1
_GLOBAL equ 2

.code

OPTION PROLOGUE:NONE, EPILOGUE:NONE

realloc PROC pblck:ptr, newsize:size_t

; special cases, handling mandated by ANSI

.if !rcx
;
; just do a malloc of newsize bytes and return a pointer to
; the new block
;
malloc( rdx )
jmp toend
.endif

.if !rdx
;
; free the block and return NULL
;
free  ( rcx )
xor rax,rax
jmp toend
.endif

; make newsize a valid allocation block size (i.e., round up to the
; nearest granularity)

lea rax,[rdx+sizeof(S_HEAP)+_GRANULARITY-1]
and rax,-(_GRANULARITY)
lea r8,[rcx-sizeof(S_HEAP)]
mov r9,[r8].S_HEAP.h_size

.if rax == r9

mov rax,rcx
jmp toend
.endif

.if [r8].S_HEAP.h_type == _LOCAL

.if rax < r9

sub r9,rax
.if r9 <= sizeof(S_HEAP) + _GRANULARITY

mov rax,rcx
jmp toend
.endif
mov [r8].S_HEAP.h_size,rax
add rax,r8
mov [rax].S_HEAP.h_size,r9
mov [rax].S_HEAP.h_type,0
mov rax,rcx
jmp toend
.endif

; see if block is big enough already, or can be expanded

mov r10,r9 ; add up free blocks
mov r11,r9
.while [r8+r10].S_HEAP.h_type == _FREE && r11

mov r11,[r8+r10].S_HEAP.h_size
add r10,r11
.endw

.if r10 >= rax

; expand block

sub r10,rax
.if r10 >= sizeof(S_HEAP)

mov [r8].S_HEAP.h_size,rax
mov [r8+rax].S_HEAP.h_size,r10
mov [r8+rax].S_HEAP.h_type,_FREE
.else

mov [r8].S_HEAP.h_size,r10
.endif
mov rax,rcx
jmp toend
.endif
.endif

push rsi
push rdi
mov rsi,r8 ; block
mov rdi,rax ; new size
.if malloc( rax )

mov rcx,[rsi].S_HEAP.h_size
sub rcx,sizeof(S_HEAP)
add rsi,sizeof(S_HEAP)
mov rdx,rsi
mov rdi,rax
rep movsb
free  ( rdx )
.endif
pop rdi
pop rsi

toend:
ret
realloc ENDP

END

In addition to _LOCAL and _GLOBAL blocks _ALIGNX is added for _aligned_malloc()
Code: [Select]
include alloc.inc

.code

OPTION WIN64:2

_aligned_malloc proc uses rdi dwSize:size_t, Alignment:UINT

mov edi,edx
lea rcx,[rcx+rdx+sizeof(S_HEAP)]

.if malloc( rcx )

dec edi
.if eax & edi

lea rdx,[rax-sizeof(S_HEAP)]
lea rax,[rax+rdi+sizeof(S_HEAP)]
not rdi
and rax,rdi
lea rcx,[rax-sizeof(S_HEAP)]
mov [rcx].S_HEAP.h_prev,rdx
mov [rcx].S_HEAP.h_type,3

.endif
.endif
ret

_aligned_malloc ENDP

END

As seen in the definition of alloca() the implementation of stack-alloc needs to include the @ReservedStack value of the current function, so this is now handled using a macro.
Code: [Select]
include alloc.inc

.code

OPTION PROLOGUE:NONE, EPILOGUE:NONE

_alloca64 PROC byte_count, res_stack

lea rax,[rsp+8] ; start of call
add rcx,rdx ; @ReservedStack
.while rcx > _PAGESIZE_
sub rax,_PAGESIZE_
test [rax],rax
sub rcx,_PAGESIZE_
.endw
sub rax,rcx
and rax,-(_GRANULARITY)
test [rax],rax
mov rcx,[rsp]
mov rsp,rax
add rax,rdx
jmp rcx

_alloca64 ENDP

END

This assumes a standard stack frame (-Cs) and a common spill-size (win64:2).