Author Topic: Zero a stack buffer (and probe it)  (Read 43967 times)

Gunther

  • Member
  • *****
  • Posts: 3723
  • Forgive your enemies, but never forget their names
Re: Zero a stack buffer (and probe it)
« Reply #60 on: October 28, 2013, 02:18:17 AM »
Jochen,

StackBuffer3.exe comes up with that result:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles

512000 bytes:
12600   kCycles for 100 * rep stosd
4696    kCycles for 100 * HeapAlloc
4672    kCycles for 100 * StackBuffer (with zeroing)
4538    kCycles for 100 * dedndave
4593    kCycles for 100 * rep stosd up (no probing)

12567   kCycles for 100 * rep stosd
5334    kCycles for 100 * HeapAlloc
5236    kCycles for 100 * StackBuffer (with zeroing)
4937    kCycles for 100 * dedndave
4692    kCycles for 100 * rep stosd up (no probing)

12518   kCycles for 100 * rep stosd
4685    kCycles for 100 * HeapAlloc
4674    kCycles for 100 * StackBuffer (with zeroing)
5286    kCycles for 100 * dedndave
4666    kCycles for 100 * rep stosd up (no probing)

18      bytes for rep stosd
103     bytes for HeapAlloc
54      bytes for StackBuffer (with zeroing)
41      bytes for dedndave
17      bytes for rep stosd up (no probing)

--- ok ---

Gunther
Get your facts first, and then you can distort them.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Zero a stack buffer (and probe it)
« Reply #61 on: October 28, 2013, 02:32:04 AM »
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 261/100 cycles

512000 bytes:
26444   kCycles for 100 * rep stosd
22020   kCycles for 100 * HeapAlloc
16433   kCycles for 100 * StackBuffer (with zeroing)
15254   kCycles for 100 * dedndave
15176   kCycles for 100 * rep stosd up (no probing)

26155   kCycles for 100 * rep stosd
17181   kCycles for 100 * HeapAlloc
16086   kCycles for 100 * StackBuffer (with zeroing)
15254   kCycles for 100 * dedndave
15979   kCycles for 100 * rep stosd up (no probing)

26160   kCycles for 100 * rep stosd
17103   kCycles for 100 * HeapAlloc
16196   kCycles for 100 * StackBuffer (with zeroing)
15333   kCycles for 100 * dedndave
15132   kCycles for 100 * rep stosd up (no probing)

--- ok ---

loop overhead is approx. 254/100 cycles

512000 bytes:
26153   kCycles for 100 * rep stosd
22074   kCycles for 100 * HeapAlloc
16154   kCycles for 100 * StackBuffer (with zeroing)
15852   kCycles for 100 * dedndave
15254   kCycles for 100 * rep stosd up (no probing)

26087   kCycles for 100 * rep stosd
16510   kCycles for 100 * HeapAlloc
16647   kCycles for 100 * StackBuffer (with zeroing)
15258   kCycles for 100 * dedndave
15187   kCycles for 100 * rep stosd up (no probing)

26145   kCycles for 100 * rep stosd
16325   kCycles for 100 * HeapAlloc
16303   kCycles for 100 * StackBuffer (with zeroing)
15257   kCycles for 100 * dedndave
15032   kCycles for 100 * rep stosd up (no probing)

Siekmanski

  • Member
  • *****
  • Posts: 2365
Re: Zero a stack buffer (and probe it)
« Reply #62 on: October 28, 2013, 04:03:55 AM »
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 542/100 cycles

512000 bytes:
14386   kCycles for 100 * rep stosd
5242    kCycles for 100 * HeapAlloc
5160    kCycles for 100 * StackBuffer (with zeroing)
5182    kCycles for 100 * dedndave
5283    kCycles for 100 * rep stosd up (no probing)

14350   kCycles for 100 * rep stosd
5261    kCycles for 100 * HeapAlloc
5202    kCycles for 100 * StackBuffer (with zeroing)
5187    kCycles for 100 * dedndave
5258    kCycles for 100 * rep stosd up (no probing)

14356   kCycles for 100 * rep stosd
5276    kCycles for 100 * HeapAlloc
5216    kCycles for 100 * StackBuffer (with zeroing)
5154    kCycles for 100 * dedndave
5289    kCycles for 100 * rep stosd up (no probing)

18      bytes for rep stosd
103     bytes for HeapAlloc
54      bytes for StackBuffer (with zeroing)
41      bytes for dedndave
17      bytes for rep stosd up (no probing)
Creative coders use backward thinking techniques as a strategy.

nidud

  • Member
  • *****
  • Posts: 2216
    • https://github.com/nidud/asmc
Re: Zero a stack buffer (and probe it)
« Reply #63 on: October 28, 2013, 04:48:24 AM »
A valid thought but
a) it requires more discipline with linker settings, environment variables etc
b) that local buffer is still full of garbage and
c) it is not aligned for use with SSE2
 ;)

 :biggrin:

Code: [Select]
bufsize=102400

StackBuffer proc
local buf[bufsize+16]:byte
local buffer:dword
lea eax,buf
and al,0F0h
add eax,16
mov buffer,eax
mov edi,eax
sub eax,eax
mov ecx,bufsize
rep stosd
; ... use the buffer ...
ret
StackBuffer endp

if (%1) == () goto probe
link /stack:%1,%1 ...
goto end
:probe
makeit 102416
:end

so, there you go  :lol:

jj2007

  • Member
  • *****
  • Posts: 11551
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and have fun)
« Reply #64 on: October 28, 2013, 06:24:57 AM »
SbTestJ proc uses esi MySize
  mov esi, StackBuffer(MySize)   ; works like a charm, no linker options needed
  ; ... use the buffer ...
  StackBuffer()
  ret
SbTestJ endp

SbTestN proc uses esi MySize
local buf[MySize+16]:byte   ; error A2026: constant expected
local buffer:dword
  lea eax,buf
  and al,0F0h
  add eax,16
  mov buffer,eax
  mov edi,eax
  sub eax,eax
  mov ecx,bufsize
  rep stosd
  ; ... use the buffer ...
  ret
SbTestN endp

 ;)

nidud

  • Member
  • *****
  • Posts: 2216
    • https://github.com/nidud/asmc
Re: Zero a stack buffer (and probe it)
« Reply #65 on: October 28, 2013, 07:58:57 AM »
If you plan on calling this macro frequently it may be better to set the stack one time, either using a link-switch or a function to avoid the probing. The stack could then be used by the alloc function if that was the intended usage.

Code: [Select]
new_stack proc stklen:dword
mov eax,esp
mov edx,eax
mov ecx,stklen
sub eax,ecx
ASSUME FS:NOTHING
.if eax < fs:[8]
    shr ecx,2
    .repeat
push eax
    .untilcxz
.endif
ASSUME FS:ERROR
mov esp,edx
ret
new_stack endp

start:
invoke new_stack,bufsize
...

I would however prefer the switch option.

jj2007

  • Member
  • *****
  • Posts: 11551
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #66 on: October 28, 2013, 09:01:16 AM »
If you plan on calling this macro frequently it may be better to set the stack one time

No need for doing that. Dave's fs:[8] loop is extremely clever - if the stack is already committed, it costs just 1 or 2 cycles.

Farabi

  • Member
  • ****
  • Posts: 969
  • Neuroscience Fans
Re: Zero a stack buffer (and probe it)
« Reply #67 on: October 28, 2013, 12:49:12 PM »
Im curious, movups is slower than conventional instuctions, but it was for double data right? Push edx is for 4 bytes, what about movups? I think it was 8 to 16 bytes, if the data speed clock was 22 it should be divided by 2 or 4 to know the byte rate transfer.
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

jj2007

  • Member
  • *****
  • Posts: 11551
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #68 on: October 28, 2013, 07:03:06 PM »
Just stumbled over an oddity with HeapAlloc: It gets very, very slow for a small range of bytes requested (Win7-32):

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

512000 bytes:
19585   kCycles for 100 * rep stosd <<< the reference

512000 bytes:
19662   kCycles for 100 * HeapAlloc <<< so far so good
519168 bytes:
131619  kCycles for 100 * HeapAlloc <<< oops
520192 bytes:
915     kCycles for 100 * HeapAlloc <<< VirtualAlloc kicks in

19565   kCycles for 100 * rep stosd

512000 bytes:
19667   kCycles for 100 * HeapAlloc
519168 bytes:
133256  kCycles for 100 * HeapAlloc
520192 bytes:
932     kCycles for 100 * HeapAlloc

Siekmanski

  • Member
  • *****
  • Posts: 2365
Re: Zero a stack buffer (and probe it)
« Reply #69 on: October 28, 2013, 10:58:06 PM »
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 575/100 cycles

512000 bytes:
14335   kCycles for 100 * rep stosd


512000 bytes:
5293    kCycles for 100 * HeapAlloc
519168 bytes:
5381    kCycles for 100 * HeapAlloc
520192 bytes:
1314    kCycles for 100 * HeapAlloc

14333   kCycles for 100 * rep stosd


512000 bytes:
5311    kCycles for 100 * HeapAlloc
519168 bytes:
5340    kCycles for 100 * HeapAlloc
520192 bytes:
1314    kCycles for 100 * HeapAlloc

18      bytes for rep stosd
104     bytes for HeapAlloc
Creative coders use backward thinking techniques as a strategy.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Zero a stack buffer (and probe it)
« Reply #70 on: October 28, 2013, 11:07:19 PM »
XP MCE2005 SP3
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 427/100 cycles

512000 bytes:
34190   kCycles for 100 * rep stosd

512000 bytes:
28714   kCycles for 100 * HeapAlloc
519168 bytes:
23241   kCycles for 100 * HeapAlloc
520192 bytes:
2671    kCycles for 100 * HeapAlloc

34132   kCycles for 100 * rep stosd

512000 bytes:
23103   kCycles for 100 * HeapAlloc
519168 bytes:
23419   kCycles for 100 * HeapAlloc
520192 bytes:
2635    kCycles for 100 * HeapAlloc

jj2007

  • Member
  • *****
  • Posts: 11551
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #71 on: October 28, 2013, 11:22:36 PM »
Thanks, Marinus & Dave :icon14:
The switch to VirtualAlloc is there but not the slowdown shortly below. Could be Win-7 only, or some special feature of my machine ::)

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Zero a stack buffer (and probe it)
« Reply #72 on: October 28, 2013, 11:28:18 PM »
see if you can test a single pass
HeapAlloc may not like being in a x100 loop   :P

jj2007

  • Member
  • *****
  • Posts: 11551
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #73 on: October 28, 2013, 11:41:10 PM »
see if you can test a single pass
HeapAlloc may not like being in a x100 loop   :P
It's quite happy to be in that loop for everything below 500k*1.01 and above 500k*1.016... and switching to e.g. 5 loops doesn't change the pattern. Weird.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Zero a stack buffer (and probe it)
« Reply #74 on: October 29, 2013, 01:28:57 AM »
that makes me wonder if there are other "holes" in the number line
and - is it specific to your hardware in some way
say, if you had more memory - would it act differently