Thanks, Dave - I had not seen that thread. Now it's clearer...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles
4710 kCycles for 100 * rep stosd
2220 kCycles for 100 * HeapAlloc (*8)
2193 kCycles for 100 * StackBuffer (with zeroing)
2192 kCycles for 100 * StackBuffer (unrolled)
4697 kCycles for 100 * dedndave
1738 kCycles for 100 * rep stosd up
This is for slightly modified code, taking account of the need to save & restore the old stack:
.Repeat
mov edx, edi ; save edi
mov edi, esp
mov eax, esp ; save old stack
sub edi, (bufsize+3+4) ;<NumberOfBytesRequiredPlus3Mod4>
and edi, -4 ; aligns new stack
.repeat
push eax ; tickle the guard page
ASSUME FS:Nothing
mov esp, fs:[8] ; limit might be 4k lower now
ASSUME FS:ERROR
.until edi>=esp ; loop until we've got enough
mov esp, edi ; new stack
stosd ; save old stack to [edi]
xchg eax, ecx
push edi ; retval for macro
sub ecx, edi
shr ecx, 2
xor eax, eax
rep stosd
pop eax ; retval for macro
mov edi, edx ; restore edi
; ... code that uses buffer...
pop esp ; restore stack
dec ebx
.Until Sign?
I hope I didn't misunderstand anything - for some time I was thoroughly confused by your NumberOfBytesRequiredPlus3Mod4 ::)