News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Zero a stack buffer (and probe it)

Started by jj2007, October 25, 2013, 07:31:54 PM

Previous topic - Next topic

Gunther

Jochen,

StackBuffer3.exe comes up with that result:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles

512000 bytes:
12600   kCycles for 100 * rep stosd
4696    kCycles for 100 * HeapAlloc
4672    kCycles for 100 * StackBuffer (with zeroing)
4538    kCycles for 100 * dedndave
4593    kCycles for 100 * rep stosd up (no probing)

12567   kCycles for 100 * rep stosd
5334    kCycles for 100 * HeapAlloc
5236    kCycles for 100 * StackBuffer (with zeroing)
4937    kCycles for 100 * dedndave
4692    kCycles for 100 * rep stosd up (no probing)

12518   kCycles for 100 * rep stosd
4685    kCycles for 100 * HeapAlloc
4674    kCycles for 100 * StackBuffer (with zeroing)
5286    kCycles for 100 * dedndave
4666    kCycles for 100 * rep stosd up (no probing)

18      bytes for rep stosd
103     bytes for HeapAlloc
54      bytes for StackBuffer (with zeroing)
41      bytes for dedndave
17      bytes for rep stosd up (no probing)

--- ok ---


Gunther
You have to know the facts before you can distort them.

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 261/100 cycles

512000 bytes:
26444   kCycles for 100 * rep stosd
22020   kCycles for 100 * HeapAlloc
16433   kCycles for 100 * StackBuffer (with zeroing)
15254   kCycles for 100 * dedndave
15176   kCycles for 100 * rep stosd up (no probing)

26155   kCycles for 100 * rep stosd
17181   kCycles for 100 * HeapAlloc
16086   kCycles for 100 * StackBuffer (with zeroing)
15254   kCycles for 100 * dedndave
15979   kCycles for 100 * rep stosd up (no probing)

26160   kCycles for 100 * rep stosd
17103   kCycles for 100 * HeapAlloc
16196   kCycles for 100 * StackBuffer (with zeroing)
15333   kCycles for 100 * dedndave
15132   kCycles for 100 * rep stosd up (no probing)

--- ok ---

loop overhead is approx. 254/100 cycles

512000 bytes:
26153   kCycles for 100 * rep stosd
22074   kCycles for 100 * HeapAlloc
16154   kCycles for 100 * StackBuffer (with zeroing)
15852   kCycles for 100 * dedndave
15254   kCycles for 100 * rep stosd up (no probing)

26087   kCycles for 100 * rep stosd
16510   kCycles for 100 * HeapAlloc
16647   kCycles for 100 * StackBuffer (with zeroing)
15258   kCycles for 100 * dedndave
15187   kCycles for 100 * rep stosd up (no probing)

26145   kCycles for 100 * rep stosd
16325   kCycles for 100 * HeapAlloc
16303   kCycles for 100 * StackBuffer (with zeroing)
15257   kCycles for 100 * dedndave
15032   kCycles for 100 * rep stosd up (no probing)

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 542/100 cycles

512000 bytes:
14386   kCycles for 100 * rep stosd
5242    kCycles for 100 * HeapAlloc
5160    kCycles for 100 * StackBuffer (with zeroing)
5182    kCycles for 100 * dedndave
5283    kCycles for 100 * rep stosd up (no probing)

14350   kCycles for 100 * rep stosd
5261    kCycles for 100 * HeapAlloc
5202    kCycles for 100 * StackBuffer (with zeroing)
5187    kCycles for 100 * dedndave
5258    kCycles for 100 * rep stosd up (no probing)

14356   kCycles for 100 * rep stosd
5276    kCycles for 100 * HeapAlloc
5216    kCycles for 100 * StackBuffer (with zeroing)
5154    kCycles for 100 * dedndave
5289    kCycles for 100 * rep stosd up (no probing)

18      bytes for rep stosd
103     bytes for HeapAlloc
54      bytes for StackBuffer (with zeroing)
41      bytes for dedndave
17      bytes for rep stosd up (no probing)
Creative coders use backward thinking techniques as a strategy.

nidud

#63
deleted

jj2007

SbTestJ proc uses esi MySize
  mov esi, StackBuffer(MySize)   ; works like a charm, no linker options needed
  ; ... use the buffer ...
  StackBuffer()
  ret
SbTestJ endp

SbTestN proc uses esi MySize
local buf[MySize+16]:byte   ; error A2026: constant expected
local buffer:dword
  lea eax,buf
  and al,0F0h
  add eax,16
  mov buffer,eax
  mov edi,eax
  sub eax,eax
  mov ecx,bufsize
  rep stosd
  ; ... use the buffer ...
  ret
SbTestN endp

;)

nidud

#65
deleted

jj2007

Quote from: nidud on October 28, 2013, 07:58:57 AM
If you plan on calling this macro frequently it may be better to set the stack one time

No need for doing that. Dave's fs:[8] loop is extremely clever - if the stack is already committed, it costs just 1 or 2 cycles.

Farabi

Im curious, movups is slower than conventional instuctions, but it was for double data right? Push edx is for 4 bytes, what about movups? I think it was 8 to 16 bytes, if the data speed clock was 22 it should be divided by 2 or 4 to know the byte rate transfer.
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

jj2007

Just stumbled over an oddity with HeapAlloc: It gets very, very slow for a small range of bytes requested (Win7-32):

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

512000 bytes:
19585   kCycles for 100 * rep stosd <<< the reference

512000 bytes:
19662   kCycles for 100 * HeapAlloc <<< so far so good
519168 bytes:
131619  kCycles for 100 * HeapAlloc <<< oops
520192 bytes:
915     kCycles for 100 * HeapAlloc <<< VirtualAlloc kicks in

19565   kCycles for 100 * rep stosd

512000 bytes:
19667   kCycles for 100 * HeapAlloc
519168 bytes:
133256  kCycles for 100 * HeapAlloc
520192 bytes:
932     kCycles for 100 * HeapAlloc

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 575/100 cycles

512000 bytes:
14335   kCycles for 100 * rep stosd


512000 bytes:
5293    kCycles for 100 * HeapAlloc
519168 bytes:
5381    kCycles for 100 * HeapAlloc
520192 bytes:
1314    kCycles for 100 * HeapAlloc

14333   kCycles for 100 * rep stosd


512000 bytes:
5311    kCycles for 100 * HeapAlloc
519168 bytes:
5340    kCycles for 100 * HeapAlloc
520192 bytes:
1314    kCycles for 100 * HeapAlloc

18      bytes for rep stosd
104     bytes for HeapAlloc
Creative coders use backward thinking techniques as a strategy.

dedndave

XP MCE2005 SP3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 427/100 cycles

512000 bytes:
34190   kCycles for 100 * rep stosd

512000 bytes:
28714   kCycles for 100 * HeapAlloc
519168 bytes:
23241   kCycles for 100 * HeapAlloc
520192 bytes:
2671    kCycles for 100 * HeapAlloc

34132   kCycles for 100 * rep stosd

512000 bytes:
23103   kCycles for 100 * HeapAlloc
519168 bytes:
23419   kCycles for 100 * HeapAlloc
520192 bytes:
2635    kCycles for 100 * HeapAlloc

jj2007

Thanks, Marinus & Dave :icon14:
The switch to VirtualAlloc is there but not the slowdown shortly below. Could be Win-7 only, or some special feature of my machine ::)

dedndave

see if you can test a single pass
HeapAlloc may not like being in a x100 loop   :P

jj2007

Quote from: dedndave on October 28, 2013, 11:28:18 PM
see if you can test a single pass
HeapAlloc may not like being in a x100 loop   :P
It's quite happy to be in that loop for everything below 500k*1.01 and above 500k*1.016... and switching to e.g. 5 loops doesn't change the pattern. Weird.

dedndave

that makes me wonder if there are other "holes" in the number line
and - is it specific to your hardware in some way
say, if you had more memory - would it act differently