Zero a stack buffer (and probe it)

Gunther · October 28, 2013, 02:18:17 AM

Jochen,

StackBuffer3.exe comes up with that result:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles

512000 bytes:
12600   kCycles for 100 * rep stosd
4696    kCycles for 100 * HeapAlloc
4672    kCycles for 100 * StackBuffer (with zeroing)
4538    kCycles for 100 * dedndave
4593    kCycles for 100 * rep stosd up (no probing)

12567   kCycles for 100 * rep stosd
5334    kCycles for 100 * HeapAlloc
5236    kCycles for 100 * StackBuffer (with zeroing)
4937    kCycles for 100 * dedndave
4692    kCycles for 100 * rep stosd up (no probing)

12518   kCycles for 100 * rep stosd
4685    kCycles for 100 * HeapAlloc
4674    kCycles for 100 * StackBuffer (with zeroing)
5286    kCycles for 100 * dedndave
4666    kCycles for 100 * rep stosd up (no probing)

18      bytes for rep stosd
103     bytes for HeapAlloc
54      bytes for StackBuffer (with zeroing)
41      bytes for dedndave
17      bytes for rep stosd up (no probing)

--- ok ---

Gunther

dedndave · October 28, 2013, 02:32:04 AM

prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 261/100 cycles

512000 bytes:
26444   kCycles for 100 * rep stosd
22020   kCycles for 100 * HeapAlloc
16433   kCycles for 100 * StackBuffer (with zeroing)
15254   kCycles for 100 * dedndave
15176   kCycles for 100 * rep stosd up (no probing)

26155   kCycles for 100 * rep stosd
17181   kCycles for 100 * HeapAlloc
16086   kCycles for 100 * StackBuffer (with zeroing)
15254   kCycles for 100 * dedndave
15979   kCycles for 100 * rep stosd up (no probing)

26160   kCycles for 100 * rep stosd
17103   kCycles for 100 * HeapAlloc
16196   kCycles for 100 * StackBuffer (with zeroing)
15333   kCycles for 100 * dedndave
15132   kCycles for 100 * rep stosd up (no probing)

--- ok ---

loop overhead is approx. 254/100 cycles

512000 bytes:
26153   kCycles for 100 * rep stosd
22074   kCycles for 100 * HeapAlloc
16154   kCycles for 100 * StackBuffer (with zeroing)
15852   kCycles for 100 * dedndave
15254   kCycles for 100 * rep stosd up (no probing)

26087   kCycles for 100 * rep stosd
16510   kCycles for 100 * HeapAlloc
16647   kCycles for 100 * StackBuffer (with zeroing)
15258   kCycles for 100 * dedndave
15187   kCycles for 100 * rep stosd up (no probing)

26145   kCycles for 100 * rep stosd
16325   kCycles for 100 * HeapAlloc
16303   kCycles for 100 * StackBuffer (with zeroing)
15257   kCycles for 100 * dedndave
15032   kCycles for 100 * rep stosd up (no probing)

Siekmanski · October 28, 2013, 04:03:55 AM

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 542/100 cycles

512000 bytes:
14386 kCycles for 100 * rep stosd
5242 kCycles for 100 * HeapAlloc
5160 kCycles for 100 * StackBuffer (with zeroing)
5182 kCycles for 100 * dedndave
5283 kCycles for 100 * rep stosd up (no probing)

14350 kCycles for 100 * rep stosd
5261 kCycles for 100 * HeapAlloc
5202 kCycles for 100 * StackBuffer (with zeroing)
5187 kCycles for 100 * dedndave
5258 kCycles for 100 * rep stosd up (no probing)

14356 kCycles for 100 * rep stosd
5276 kCycles for 100 * HeapAlloc
5216 kCycles for 100 * StackBuffer (with zeroing)
5154 kCycles for 100 * dedndave
5289 kCycles for 100 * rep stosd up (no probing)

18 bytes for rep stosd
103 bytes for HeapAlloc
54 bytes for StackBuffer (with zeroing)
41 bytes for dedndave
17 bytes for rep stosd up (no probing)

nidud · October 28, 2013, 04:48:24 AM

deleted

jj2007 · October 28, 2013, 06:24:57 AM

SbTestJ proc uses esi MySize
mov esi, StackBuffer(MySize) ; works like a charm, no linker options needed
; ... use the buffer ...
StackBuffer()
ret
SbTestJ endp

SbTestN proc uses esi MySize
local buf[MySize+16]:byte ; error A2026: constant expected
local buffer:dword
lea eax,buf
and al,0F0h
add eax,16
mov buffer,eax
mov edi,eax
sub eax,eax
mov ecx,bufsize
rep stosd
; ... use the buffer ...
ret
SbTestN endp
;)

nidud · October 28, 2013, 07:58:57 AM

deleted

jj2007 · October 28, 2013, 09:01:16 AM

Quote from: nidud on October 28, 2013, 07:58:57 AM
If you plan on calling this macro frequently it may be better to set the stack one time

No need for doing that. Dave's fs:[8] loop is extremely clever - if the stack is already committed, it costs just 1 or 2 cycles.

Farabi · October 28, 2013, 12:49:12 PM

Im curious, movups is slower than conventional instuctions, but it was for double data right? Push edx is for 4 bytes, what about movups? I think it was 8 to 16 bytes, if the data speed clock was 22 it should be divided by 2 or 4 to know the byte rate transfer.

jj2007 · October 28, 2013, 07:03:06 PM

Just stumbled over an oddity with HeapAlloc: It gets very, very slow for a small range of bytes requested (Win7-32):

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

512000 bytes:
19585 kCycles for 100 * rep stosd <<< the reference

512000 bytes:
19662 kCycles for 100 * HeapAlloc <<< so far so good
519168 bytes:
131619 kCycles for 100 * HeapAlloc <<< oops
520192 bytes:
915 kCycles for 100 * HeapAlloc <<< VirtualAlloc kicks in

19565 kCycles for 100 * rep stosd

512000 bytes:
19667 kCycles for 100 * HeapAlloc
519168 bytes:
133256 kCycles for 100 * HeapAlloc
520192 bytes:
932 kCycles for 100 * HeapAlloc

Siekmanski · October 28, 2013, 10:58:06 PM

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 575/100 cycles

512000 bytes:
14335 kCycles for 100 * rep stosd

512000 bytes:
5293 kCycles for 100 * HeapAlloc
519168 bytes:
5381 kCycles for 100 * HeapAlloc
520192 bytes:
1314 kCycles for 100 * HeapAlloc

14333 kCycles for 100 * rep stosd

512000 bytes:
5311 kCycles for 100 * HeapAlloc
519168 bytes:
5340 kCycles for 100 * HeapAlloc
520192 bytes:
1314 kCycles for 100 * HeapAlloc

18 bytes for rep stosd
104 bytes for HeapAlloc

dedndave · October 28, 2013, 11:07:19 PM

XP MCE2005 SP3

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 427/100 cycles

512000 bytes:
34190   kCycles for 100 * rep stosd

512000 bytes:
28714   kCycles for 100 * HeapAlloc
519168 bytes:
23241   kCycles for 100 * HeapAlloc
520192 bytes:
2671    kCycles for 100 * HeapAlloc

34132   kCycles for 100 * rep stosd

512000 bytes:
23103   kCycles for 100 * HeapAlloc
519168 bytes:
23419   kCycles for 100 * HeapAlloc
520192 bytes:
2635    kCycles for 100 * HeapAlloc

jj2007 · October 28, 2013, 11:22:36 PM

Thanks, Marinus & Dave :icon14:
The switch to VirtualAlloc is there but not the slowdown shortly below. Could be Win-7 only, or some special feature of my machine ::)

dedndave · October 28, 2013, 11:28:18 PM

see if you can test a single pass
HeapAlloc may not like being in a x100 loop :P

jj2007 · October 28, 2013, 11:41:10 PM

Quote from: dedndave on October 28, 2013, 11:28:18 PM
see if you can test a single pass
HeapAlloc may not like being in a x100 loop :P

It's quite happy to be in that loop for everything below 500k*1.01 and above 500k*1.016... and switching to e.g. 5 loops doesn't change the pattern. Weird.

dedndave · October 29, 2013, 01:28:57 AM

that makes me wonder if there are other "holes" in the number line
and - is it specific to your hardware in some way
say, if you had more memory - would it act differently

The MASM Forum

News:

Zero a stack buffer (and probe it)

Gunther

dedndave

Siekmanski

nidud

jj2007

nidud

jj2007

Farabi

jj2007

Siekmanski

dedndave

jj2007

dedndave

jj2007

dedndave