News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Zero a stack buffer (and probe it)

Started by jj2007, October 25, 2013, 07:31:54 PM

Previous topic - Next topic

nidud

#75
deleted

dedndave

i think this is a case where you have to apply some common sense
how is the heap/stack space going to be used in a typical program ?
i can't think of too many cases where you actually waffle back and forth between allocating heap memory and committing stack space
if you do, you should probably re-think your design   :P

however, it would still be interesting to see how fast the OS can commit under different conditions
for example: allocate a large heap block, then commit some stack space

from what i have seen (with no heap allocated), the commit loop seems pretty fast

dedndave

ok, well it doesn't seem to be as fast as i thought
or maybe it's just hard to properly measure something with 1 pass   :P

11629 Clock cycles per page

EDIT: a more accurate version - results about the same - lol

nidud

#78
deleted

dedndave

well - that seems counter-intuitive

if you can allocate all available with HeapAlloc, it *should* reset the commit

nidud

#80
deleted

dedndave

i guess it doesn't matter - my way of thinking was wrong, of course

it seems that, once the space has been committed, it stays committed
it simply gets swapped out to the paging file if you try to HeapAlloc(nMaxBytes)

it might work if you create a thread to commit and release stack space, then terminate the thread

EDIT: we really aren't interested in measuring swaps between memory and the page file   :lol:

nidud

#82
deleted

dedndave

Quote from: nidud on October 29, 2013, 04:00:41 AM
...However, once the stack is committed (one way or the other) it will be available
as a substitute for HeapAlloc(), and that will save both code space and cycles.

i think that's how you have to look at it, too

btw - i see Mark has a relatively new tool (new version, at least) - called VMMap

http://technet.microsoft.com/en-us/sysinternals/dd535533.aspx

i have to do some reading to interpret what it's showing me   :P

jj2007

Quote from: dedndave on October 29, 2013, 03:30:12 AM
it seems that, once the space has been committed, it stays committed

This is also my interpretation.
IMHO a StackBuffer() macro is best for repeatedly used small local buffers that are bigger than the 2 pages you can have without probing, and smaller than the range of bytes where HeapAlloc becomes competiive. Another advantage is that it avoids heap fragmentation.

Attached a new testbed with sizes 2k ... 512k. Feel free to modify - no MasmBasic needed ;-)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

2048 bytes:
103705  cycles for 100 * HeapAlloc
28460   cycles for 100 * StackBuffer (xmm)
48535   cycles for 100 * StackBuffer (rep stosd)
47423   cycles for 100 * dedndave

103956  cycles for 100 * HeapAlloc
28460   cycles for 100 * StackBuffer (xmm)
48544   cycles for 100 * StackBuffer (rep stosd)
47424   cycles for 100 * dedndave

8192 bytes:
185329  cycles for 100 * HeapAlloc
105525  cycles for 100 * StackBuffer (xmm)
128172  cycles for 100 * StackBuffer (rep stosd)
127549  cycles for 100 * dedndave

184025  cycles for 100 * HeapAlloc
105314  cycles for 100 * StackBuffer (xmm)
128170  cycles for 100 * StackBuffer (rep stosd)
127050  cycles for 100 * dedndave

32768 bytes:
547     kCycles for 100 * HeapAlloc
438     kCycles for 100 * StackBuffer (xmm)
440     kCycles for 100 * StackBuffer (rep stosd)
439     kCycles for 100 * dedndave

548     kCycles for 100 * HeapAlloc
438     kCycles for 100 * StackBuffer (xmm)
444     kCycles for 100 * StackBuffer (rep stosd)
437     kCycles for 100 * dedndave

131072 bytes:
2810    kCycles for 100 * HeapAlloc
2808    kCycles for 100 * StackBuffer (xmm)
2230    kCycles for 100 * StackBuffer (rep stosd)
2222    kCycles for 100 * dedndave

2319    kCycles for 100 * HeapAlloc
2808    kCycles for 100 * StackBuffer (xmm)
2225    kCycles for 100 * StackBuffer (rep stosd)
2224    kCycles for 100 * dedndave

524288 bytes:
742     kCycles for 100 * HeapAlloc
12067   kCycles for 100 * StackBuffer (xmm)
8928    kCycles for 100 * StackBuffer (rep stosd)
8977    kCycles for 100 * dedndave

751     kCycles for 100 * HeapAlloc
12305   kCycles for 100 * StackBuffer (xmm)
8921    kCycles for 100 * StackBuffer (rep stosd)
8920    kCycles for 100 * dedndave

104     bytes for HeapAlloc
9       bytes for StackBuffer (xmm)
8       bytes for StackBuffer (rep stosd)
42      bytes for dedndave

33      bytes for MbStackB
16      bytes for MbStackX

dedndave

prescott w/htt xp mce2005 sp3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 246/100 cycles

2048 bytes:
230072  cycles for 100 * HeapAlloc
59433   cycles for 100 * StackBuffer (xmm)
71350   cycles for 100 * StackBuffer (rep stosd)
71069   cycles for 100 * dedndave

229937  cycles for 100 * HeapAlloc
60252   cycles for 100 * StackBuffer (xmm)
71518   cycles for 100 * StackBuffer (rep stosd)
70532   cycles for 100 * dedndave

8192 bytes:
447586  cycles for 100 * HeapAlloc
233879  cycles for 100 * StackBuffer (xmm)
245398  cycles for 100 * StackBuffer (rep stosd)
245187  cycles for 100 * dedndave

447288  cycles for 100 * HeapAlloc
234628  cycles for 100 * StackBuffer (xmm)
246152  cycles for 100 * StackBuffer (rep stosd)
244349  cycles for 100 * dedndave

32768 bytes:
1304    kCycles for 100 * HeapAlloc
945     kCycles for 100 * StackBuffer (xmm)
925     kCycles for 100 * StackBuffer (rep stosd)
948     kCycles for 100 * dedndave

1274    kCycles for 100 * HeapAlloc
913     kCycles for 100 * StackBuffer (xmm)
932     kCycles for 100 * StackBuffer (rep stosd)
924     kCycles for 100 * dedndave

131072 bytes:
5997    kCycles for 100 * HeapAlloc
3639    kCycles for 100 * StackBuffer (xmm)
3682    kCycles for 100 * StackBuffer (rep stosd)
3704    kCycles for 100 * dedndave

4671    kCycles for 100 * HeapAlloc
3663    kCycles for 100 * StackBuffer (xmm)
3654    kCycles for 100 * StackBuffer (rep stosd)
3651    kCycles for 100 * dedndave

524288 bytes:
2084    kCycles for 100 * HeapAlloc
14688   kCycles for 100 * StackBuffer (xmm)
14708   kCycles for 100 * StackBuffer (rep stosd)
14847   kCycles for 100 * dedndave

2091    kCycles for 100 * HeapAlloc
14649   kCycles for 100 * StackBuffer (xmm)
14950   kCycles for 100 * StackBuffer (rep stosd)
16294   kCycles for 100 * dedndave

Gunther

StackBuffer6 brings:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 239/100 cycles

2048 bytes:
71208   cycles for 100 * HeapAlloc
20078   cycles for 100 * StackBuffer (xmm)
40880   cycles for 100 * StackBuffer (rep stosd)
40775   cycles for 100 * dedndave

68901   cycles for 100 * HeapAlloc
47258   cycles for 100 * StackBuffer (xmm)
40798   cycles for 100 * StackBuffer (rep stosd)
40362   cycles for 100 * dedndave

8192 bytes:
111836  cycles for 100 * HeapAlloc
129559  cycles for 100 * StackBuffer (xmm)
122825  cycles for 100 * StackBuffer (rep stosd)
122018  cycles for 100 * dedndave

104378  cycles for 100 * HeapAlloc
55644   cycles for 100 * StackBuffer (xmm)
122728  cycles for 100 * StackBuffer (rep stosd)
122127  cycles for 100 * dedndave

32768 bytes:
262658  cycles for 100 * HeapAlloc
209534  cycles for 100 * StackBuffer (xmm)
196808  cycles for 100 * StackBuffer (rep stosd)
187779  cycles for 100 * dedndave

604     kCycles for 100 * HeapAlloc
477     kCycles for 100 * StackBuffer (xmm)
445     kCycles for 100 * StackBuffer (rep stosd)
443     kCycles for 100 * dedndave

131072 bytes:
1203    kCycles for 100 * HeapAlloc
1090    kCycles for 100 * StackBuffer (xmm)
1178    kCycles for 100 * StackBuffer (rep stosd)
1197    kCycles for 100 * dedndave

1843    kCycles for 100 * HeapAlloc
1108    kCycles for 100 * StackBuffer (xmm)
1159    kCycles for 100 * StackBuffer (rep stosd)
1195    kCycles for 100 * dedndave

524288 bytes:
586     kCycles for 100 * HeapAlloc
6736    kCycles for 100 * StackBuffer (xmm)
5396    kCycles for 100 * StackBuffer (rep stosd)
4757    kCycles for 100 * dedndave

591     kCycles for 100 * HeapAlloc
6740    kCycles for 100 * StackBuffer (xmm)
5389    kCycles for 100 * StackBuffer (rep stosd)
4787    kCycles for 100 * dedndave

104     bytes for HeapAlloc
9       bytes for StackBuffer (xmm)
8       bytes for StackBuffer (rep stosd)
42      bytes for dedndave

33      bytes for MbStackB
16      bytes for MbStackX

--- ok ---


Gunther
You have to know the facts before you can distort them.

jj2007

OK, thanks to everybody :icon14:

The new StackBuffer() is now implemented in MasmBasic of 30 Oct (more). In the end, rep stosd made the race. Usage examples:

          mov sbuf1, StackBuffer(4000h)        ; buffer is 16-byte aligned for use with SSE2
          invoke GetFileSize, hFile, 0             ; you may use a register to specify the buffer size
          mov sbuf2, StackBuffer(eax, nz)        ; option nz means "no zeroing" - much faster, of course
...
          StackBuffer()        ; release all buffers (sb without args = free the buffer)

The nz option does only the probing and zeroes the last two bytes of the buffer, plus two bytes beyond the buffer. This is to allow loading e.g. a textfile into the buffer and being sure that the end is zero-delimited.

dedndave


Gunther

Jochen,

Quote from: jj2007 on October 30, 2013, 11:16:13 AM
The new StackBuffer() is now implemented in MasmBasic of 30 Oct
:t

Gunther
You have to know the facts before you can distort them.