News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

ASM alloca/_alloca

Started by aw27, May 19, 2017, 04:25:42 PM

Previous topic - Next topic

aw27

Quote from: jj2007 on May 20, 2017, 01:42:14 AM
So the stack changes each time ::)

It has to be like this:  :t


; hjwasm64 -c -coff awalloca.asm
; link awalloca.obj /STACK:52428800,52428800 /SUBSYSTEM:CONSOLE

.386

.MODEL FLAT, C
OPTION CASEMAP:NONE

includelib h:\Masm32\lib\msvcrt.lib
includelib h:\Masm32\lib\Kernel32.lib

printf PROTO C arg1:Ptr Byte, printlist: VARARG
ExitProcess PROTO STDCALL :dword

.data
result db "Finished",13,10,0

.code

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE


_alloca proc C public thesize:dword, alignm:dword
pop ecx ; pops the return address
pop eax ; thesize
pop edx ; align
sub esp, eax
neg edx
and esp, edx
mov eax, esp
push edx ; re-push arguments to the stack for caller to clean them
push edx ; ditto
push ecx ; re-push the return address in the top of stack
ret
_alloca endp



myLooping proc
push ebp
mov ebp, esp ; FRAME is always required, even if no LOCALS, to re-establish the stack pointer on exit.

invoke _alloca, 50000000, 16 ; We have a mega stack of 50 MB
mov dword ptr [eax], 12345678h
mov dword ptr [eax+40000000], 12345678h
mov esp, ebp
pop ebp
ret
myLooping endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

start proc
LOCAL counter
;int 3
mov counter, 3000000
.Repeat
INVOKE myLooping
dec counter
.Until ZERO?
INVOKE printf, addr result
INVOKE ExitProcess, 0

start endp

end start





aw27

Quote from: jj2007 on May 20, 2017, 01:42:14 AM
But it means 4096 bytes allocated for every aligned variable...

No, you can put many variables in each segment, no need to have one segment per variable.

jj2007

Quote from: aw27 on May 20, 2017, 02:50:39 AM
Quote from: jj2007 on May 20, 2017, 01:42:14 AM
But it means 4096 bytes allocated for every aligned variable...

No, you can put many variables in each segment, no need to have one segment per variable.

Yes indeed, but to have all these variables aligned 32, you need again manual fumbling.

coder

JJ, you can use align 32 for each variables in the same segment.

jj2007

Quote from: coder on May 20, 2017, 07:37:15 AM
JJ, you can use align 32 for each variables in the same segment.

I was prepared to reply "nonsense", based on the horrible experience in the align64/transpose a matrix thread. BUT here the align 64 works :dazzled:

More precisely, it works only in this very special segment. Try replacing it with a .data?, and it will choke. Attached a testbed, it works even with ML 6.14 and its linker...! José made a great discovery :icon14:

coder

I have a different view on this though. I think segment is nothing special. It's just more primitive / low-level than sections and high-level macros like .code and .data. I think MASM's .code and .data are actually some wrapper macros to sections instead of segments to facilitate external linking, hence the difficulties in setting up the alignments. Older MASM reference books have made good use of them, but course in MZ format (executable). But yet again, PE format is MZ format in disguise. I just don't understand the sudden drop of its usage / popularity when it comes to Win / modern asm programming. A segment is the final memory partition seen by the CPU.








jj2007

Quote from: coder on May 20, 2017, 09:04:57 AMI think MASM's .code and .data are actually some wrapper macros to sections instead of segments to facilitate external linking, hence the difficulties in setting up the alignments.

That could be an explanation. Still, it is rather strange that ML chokes for align 32 in .DATA? but has no problems with align 64 in an old-fashioned DOS segment.

@José: I got it running now:Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

Allocating 5000*40800 bytes:
00186010        _alloca:           1 ms a16: 0
00186000        StackBuffer z:    11 ms a16: 0
00186000        StackBuffer nz:    0 ms a16: 0
0068E6B8        GlobalAlloc z:    81 ms a16: 1
0068E6B8        HeapAlloc nz:     69 ms a16: 1
0068E7A4        SysAlloc nz:      71 ms a16: 1
0068E7A0        HeapAlloc16 z:    68 ms a16: 0
0068E7A0        HeapAlloc16 nz:   68 ms a16: 0

Allocating 5000*163200 bytes:
001681C0        StackBuffer z:    37 ms a16: 0
001681C0        StackBuffer nz:    1 ms a16: 0
0068E7A0        GlobalAlloc z:   356 ms a16: 0
0068E7A0        HeapAlloc nz:    354 ms a16: 0
0068E7A4        SysAlloc nz:     354 ms a16: 1
0068E7A0        HeapAlloc16 z:   357 ms a16: 0
0068E7A0        HeapAlloc16 nz:  356 ms a16: 0

Allocating 5000*652800 bytes:
000F0940        StackBuffer z:   149 ms a16: 0
000F0940        StackBuffer nz:    1 ms a16: 0
00260020        GlobalAlloc z:    28 ms a16: 0
00260020        HeapAlloc nz:     27 ms a16: 0
00260024        SysAlloc nz:      30 ms a16: 1
00260020        HeapAlloc16 z:    22 ms a16: 0
00260020        HeapAlloc16 nz:   22 ms a16: 0


The problem is that _alloca can be used only for allocations up to 40800 bytes; beyond that, the guard pages kick in (see this old thread for the reasons). Speed-wise it is of course identical to StackBuffer(). Source & exe attached.

coder

@JJ

Yeah, sadly align 32 doesn't work in normal .data .code settings. I remember posting something similar here

It's either you manually pad them (macros etc) or use SEGMENT.


hutch--

If its for DATA, use dynamic memory and align it yourself. Verx jarst phine.  :biggrin:

aw27

#24
Quote from: jj2007 on May 20, 2017, 11:02:51 AM
@José: I got it running now.
The problem is that _alloca can be used only for allocations up to 40800 bytes; beyond that, the guard pages kick in

I downloaded the attachment and noticed a couple of points:
1) You are using the Pascal version of _alloca in the test without declaring it was Pascal in the Proc. Moreover, the Pascal version was buggy, I fixed it and modified the initial post but forgot to advertise that.. Sorry, you are using the stdcall, it is fine then if everything is stdcall.
2) You can not test _alloca like you did, the effect is that the stack is not freed during the test and will eventually overflow. You need an intermediate function as I have shown in Reply 15. Remember that with _alloca the memory is only freed when the function returns, so you need a function that return during the test, the intermediate function.
3) The problem with the guard pages applies when pages are not committed. If you make a big all committed stack you are all good.

aw27

Quote from: jj2007 on May 20, 2017, 08:17:15 AM
More precisely, it works only in this very special segment. Try replacing it with a .data?, and it will choke.

May be we can do it with a BSS segment, like this:

_DATA2 SEGMENT ALIGN(32) FLAT 'BSS'
bigArray DWORD 50000 DUP(?)
_DATA2 ends



Appears to work, although BSS is not in the list of typical segment classes which are:
'DATA', 'CODE', 'CONST' and 'STACK'


Siekmanski

Cool  8)
I was looking for a class like the 'BSS' value, couldn't find it.
Tested it and it works.

Found only these class values, 'DATA', 'CODE', 'CONST', 'MODULES' and 'STACK'

https://docs.microsoft.com/nl-nl/cpp/assembler/masm/segment
Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: aw27 on May 20, 2017, 04:00:24 PM2) You can not test _alloca like you did, the effect is that the stack is not freed during the test and will eventually overflow. You need an intermediate function as I have shown in Reply 15. Remember that with _alloca the memory is only freed when the function returns, so you need a function that return during the test, the intermediate function.

7-zip tricked me into adding an old version of Allocs.asm. The good one is attached - sorry for that :icon_redface:

Quote3) The problem with the guard pages applies when pages are not committed. If you make a big all committed stack you are all good.

There is kind of a silent understanding here to avoid 'special' commandline options, but of course what you propose is possible. Using the options, results are as shown below. StackBuffer() works with default options, and timings are identical; _alloca does not zero-init the pages, so compare to StackBuffer(size, nz).

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz

Allocating 50000*40000 bytes:
00186320        _alloca:           1 ms a16: 0
00186300        StackBuffer z:    83 ms a16: 0
00186300        StackBuffer nz:    1 ms a16: 0
00662DE0        GlobalAlloc z:    99 ms a16: 0
00662DE0        HeapAlloc nz:     13 ms a16: 0
00662DE4        SysAlloc nz:      18 ms a16: 1
00662DE0        HeapAlloc16 z:   100 ms a16: 0
00662DE0        HeapAlloc16 nz:   13 ms a16: 0

Allocating 50000*110000 bytes:
001751C0        _alloca:           1 ms a16: 0
00175180        StackBuffer z:   240 ms a16: 0
00175180        StackBuffer nz:    1 ms a16: 0
00662DE0        GlobalAlloc z:   255 ms a16: 0
00662DE0        HeapAlloc nz:     15 ms a16: 0
00662DE4        SysAlloc nz:      19 ms a16: 1
00662DE0        HeapAlloc16 z:   255 ms a16: 0
00662DE0        HeapAlloc16 nz:   15 ms a16: 0

Allocating 50000*302500 bytes:
001461C0        _alloca:           1 ms a16: 0
001461C0        StackBuffer z:   678 ms a16: 0
001461C0        StackBuffer nz:    1 ms a16: 0
00662DE0        GlobalAlloc z:   678 ms a16: 0
00662DE0        HeapAlloc nz:     15 ms a16: 0
00662DE4        SysAlloc nz:      19 ms a16: 1
00662DE0        HeapAlloc16 z:   676 ms a16: 0
00662DE0        HeapAlloc16 nz:   15 ms a16: 0

Allocating 50000*831875 bytes:
000C4DE0        _alloca:           1 ms a16: 0
000C4DC0        StackBuffer z:  1825 ms a16: 0
000C4DC0        StackBuffer nz:    1 ms a16: 0
022B0020        GlobalAlloc z:   265 ms a16: 0
022B0020        HeapAlloc nz:    261 ms a16: 0
022B0024        SysAlloc nz:     273 ms a16: 1
022B0020        HeapAlloc16 z:   204 ms a16: 0
022B0020        HeapAlloc16 nz:  204 ms a16: 0

aw27

Quote from: jj2007 on May 20, 2017, 07:57:53 PM
There is kind of a silent understanding here to avoid 'special' commandline options

Actually you don't need any special command line options to make _alloca make all the tests. All you have to do is turn the reserved stack pages into committed pages at run time. And this is done by probing them. This  is actually what Microsoft's open source _chkstk does, so I included it here in the middle of the example. In this example, before the test starts, you probe 850000 bytes from the default stack of 1MB, which makes more than enough space for your maximum allocation of 831875 bytes  :t


.686
.XMM

.MODEL FLAT, STDCALL

OPTION CASEMAP:NONE

includelib \Masm32\lib\Kernel32.lib

_PAGESIZE_      equ     1000h

.code


_chkstk proc C

_alloca_probe    =  _chkstk

        push    ecx

; Calculate new TOS.

        lea     ecx, [esp] + 8 - 4      ; TOS before entering function + size for ret value
        sub     ecx, eax                ; new TOS (Top of Stack)

; Handle allocation size that results in wraparound.
; Wraparound will result in StackOverflow exception.

        sbb     eax, eax                ; 0 if CF==0, ~0 if CF==1
        not     eax                     ; ~0 if TOS did not wrapped around, 0 otherwise
        and     ecx, eax                ; set to 0 if wraparound

        mov     eax, esp                ; current TOS
        and     eax, not ( _PAGESIZE_ - 1) ; Round down to current page boundary

cs10:
        cmp     ecx, eax                ; Is new TOS
    bnd jb      short cs20              ; in probed page?
        mov     eax, ecx                ; yes.
        pop     ecx
        xchg    esp, eax                ; update esp
        mov     eax, dword ptr [eax]    ; get return address
        mov     dword ptr [esp], eax    ; and put it at new TOS
    bnd ret

; Find next lower page and probe
cs20:
        sub     eax, _PAGESIZE_         ; decrease by PAGESIZE
        test    dword ptr [eax],eax     ; probe page.
        jmp     short cs10

_chkstk endp

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

_alloca proc stdcall public thesize:dword, alignm:dword
pop ecx ; pops the return address
pop eax ; thesize
pop edx ; align
sub esp, eax
neg edx
and esp, edx
mov eax, esp
push ecx ; re-push the return address in the top of stack
ret
_alloca endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

p0 proc dummy1, dummy2, allocSize
Local pFatBuffer, a16:BYTE
  invoke _alloca, allocSize, 32
  mov pFatBuffer, eax
  test al, 15
  setne a16
  .if Zero?
movaps [eax], xmm0
  .else
movups [eax], xmm0
  .endif
  mov dword ptr [eax], 11111111h
  mov edx, allocSize
  mov dword ptr [eax+edx-4], 33333333h
  shr edx, 1
  mov dword ptr [eax+edx], 22222222h
  movsx ecx, a16
  mov eax, pFatBuffer
  ret
p0 endp

start proc
LOCAL loops : dword
LOCAL allobytes : dword
LOCAL oldEsp : dword

mov loops, 50000
mov allobytes, 831875

mov oldEsp, esp
mov eax, 850000
invoke _chkstk
mov esp, oldEsp
mov ebx, loops-1

.Repeat
invoke p0, 123, 456, allobytes
dec ebx
.Until Sign?


ret
start endp

end start

end



jj2007

Quote from: aw27 on May 21, 2017, 02:34:19 AMturn the reserved stack pages into committed pages at run time. And this is done by probing them.

Yes, this is what StackBuffer() does, too:
Quote- StackBuffer does the stack probing for you; up to about half a megabyte, it is significantly faster than HeapAlloc

The "significantly faster" refers to the zeroing version. The nz version is always a lot faster than *Alloc.

Btw chkstk = open source? It's not in the VC header files, but of course, disassembling would be easy.