Author Topic: Zero a stack buffer (and probe it)  (Read 43239 times)

Siekmanski

  • Member
  • *****
  • Posts: 2365
Re: Zero a stack buffer (and probe it)
« Reply #45 on: October 27, 2013, 06:28:33 PM »
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2673    kCycles for 100 * rep stosd
1437    kCycles for 100 * HeapAlloc (*8 )
1667    kCycles for 100 * StackBuffer (with zeroing)
2611    kCycles for 100 * StackBuffer (unrolled)
1680    kCycles for 100 * movaps xmm0
1027    kCycles for 100 * rep stosd up
1680    kCycles for 100 * movaps xmm0 (down)
957     kCycles for 100 * movaps xmm0 (up)
973     kCycles for 100 * movaps xmm0 (unrolled)

2672    kCycles for 100 * rep stosd
1500    kCycles for 100 * HeapAlloc (*8 )
1687    kCycles for 100 * StackBuffer (with zeroing)
2608    kCycles for 100 * StackBuffer (unrolled)
1681    kCycles for 100 * movaps xmm0
1029    kCycles for 100 * rep stosd up
1699    kCycles for 100 * movaps xmm0 (down)
948     kCycles for 100 * movaps xmm0 (up)
982     kCycles for 100 * movaps xmm0 (unrolled)

2671    kCycles for 100 * rep stosd
1446    kCycles for 100 * HeapAlloc (*8 )
1677    kCycles for 100 * StackBuffer (with zeroing)
2607    kCycles for 100 * StackBuffer (unrolled)
1681    kCycles for 100 * movaps xmm0
1070    kCycles for 100 * rep stosd up
1678    kCycles for 100 * movaps xmm0 (down)
966     kCycles for 100 * movaps xmm0 (up)
994     kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8 )
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)
Creative coders use backward thinking techniques as a strategy.

Gunther

  • Member
  • *****
  • Posts: 3722
  • Forgive your enemies, but never forget their names
Re: Zero a stack buffer (and probe it)
« Reply #46 on: October 27, 2013, 08:29:54 PM »
Results with Dave's (KeepingRealBusy) version:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 201/100 cycles

2356    kCycles for 100 * rep stosd
943     kCycles for 100 * HeapAlloc (*8)
853     kCycles for 100 * StackBuffer (with zeroing)
2285    kCycles for 100 * StackBuffer (unrolled)
824     kCycles for 100 * movaps xmm0
886     kCycles for 100 * rep stosd up
825     kCycles for 100 * movaps xmm0 (down)
880     kCycles for 100 * movaps xmm0 (up)
1443    kCycles for 100 * movaps xmm0 (unrolled)

2354    kCycles for 100 * rep stosd
967     kCycles for 100 * HeapAlloc (*8)
877     kCycles for 100 * StackBuffer (with zeroing)
2300    kCycles for 100 * StackBuffer (unrolled)
846     kCycles for 100 * movaps xmm0
911     kCycles for 100 * rep stosd up
842     kCycles for 100 * movaps xmm0 (down)
883     kCycles for 100 * movaps xmm0 (up)
839     kCycles for 100 * movaps xmm0 (unrolled)

2948    kCycles for 100 * rep stosd
981     kCycles for 100 * HeapAlloc (*8)
845     kCycles for 100 * StackBuffer (with zeroing)
2286    kCycles for 100 * StackBuffer (unrolled)
865     kCycles for 100 * movaps xmm0
868     kCycles for 100 * rep stosd up
828     kCycles for 100 * movaps xmm0 (down)
844     kCycles for 100 * movaps xmm0 (up)
865     kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)

--- ok ---

Gunther
Get your facts first, and then you can distort them.

jj2007

  • Member
  • *****
  • Posts: 11527
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #47 on: October 27, 2013, 09:14:23 PM »
The modifications were to move the "constant" initializations out of the REPEAT loops

Dave,
That defeats the purpose of these loops to simulate a complete HeapAlloc/.../HeapFree sequence...

Dave (the dedn),
Your algo is now included in the testbed below. I have improved it so dramatically that you are now morally obliged to donate it to MasmBasic's StackBuffer() :icon_mrgreen:

        push edi
        push ebp
        mov ebp, esp
        mov edi, esp
        mov ecx, bufsize     ; to be replaced with immediate, global, local, reg etc
        sub edi, ecx
        and edi, -64         ; aligns buffer to a cache line
        ASSUME FS:Nothing
        .repeat
                push eax     ; tickle the guard page - limit might be 4k lower now
                mov esp, fs:[8]
        .until edi>=esp      ; loop until we've got enough
        ASSUME FS:ERROR
        mov esp, edi         ; new stack
        add ecx, 3           ; bufsize might be badly aligned
        shr ecx, 2           ; stosD
        xor eax, eax
        rep stosd
        mov eax, esp         ; retval for macro
        ; ... use the buffer ...
        leave
        pop edi

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

4743    kCycles for 100 * rep stosd
2232    kCycles for 100 * HeapAlloc (*8)
2201    kCycles for 100 * StackBuffer (with zeroing)
2208    kCycles for 100 * StackBuffer (unrolled)
1749    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

4725    kCycles for 100 * rep stosd
1846    kCycles for 100 * HeapAlloc (*8)
2205    kCycles for 100 * StackBuffer (with zeroing)
2204    kCycles for 100 * StackBuffer (unrolled)
1747    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

4726    kCycles for 100 * rep stosd
1850    kCycles for 100 * HeapAlloc (*8)
2203    kCycles for 100 * StackBuffer (with zeroing)
2203    kCycles for 100 * StackBuffer (unrolled)
1747    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave   <<<<<<<<<<<<<<<< BLOATWARE ALARM!!!
17      bytes for rep stosd up

Gunther

  • Member
  • *****
  • Posts: 3722
  • Forgive your enemies, but never forget their names
Re: Zero a stack buffer (and probe it)
« Reply #48 on: October 27, 2013, 09:36:09 PM »
Okay Jochen, here we go again:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 268/100 cycles

2404    kCycles for 100 * rep stosd
1003    kCycles for 100 * HeapAlloc (*8)
887     kCycles for 100 * StackBuffer (with zeroing)
905     kCycles for 100 * StackBuffer (unrolled)
831     kCycles for 100 * dedndave
872     kCycles for 100 * rep stosd up

2342    kCycles for 100 * rep stosd
982     kCycles for 100 * HeapAlloc (*8)
859     kCycles for 100 * StackBuffer (with zeroing)
835     kCycles for 100 * StackBuffer (unrolled)
927     kCycles for 100 * dedndave
897     kCycles for 100 * rep stosd up

2346    kCycles for 100 * rep stosd
965     kCycles for 100 * HeapAlloc (*8)
838     kCycles for 100 * StackBuffer (with zeroing)
835     kCycles for 100 * StackBuffer (unrolled)
828     kCycles for 100 * dedndave
873     kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave
17      bytes for rep stosd up

--- ok ---

Gunther
Get your facts first, and then you can distort them.

jj2007

  • Member
  • *****
  • Posts: 11527
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #49 on: October 27, 2013, 09:55:59 PM »
Thanks, Gunther. The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:

Gunther

  • Member
  • *****
  • Posts: 3722
  • Forgive your enemies, but never forget their names
Re: Zero a stack buffer (and probe it)
« Reply #50 on: October 27, 2013, 10:33:42 PM »
Jochen,

... The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:

that's manifestly. Congrats Dave.  :t

Gunther
Get your facts first, and then you can distort them.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Zero a stack buffer (and probe it)
« Reply #51 on: October 27, 2013, 11:18:21 PM »
version 3 (DD) prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 253/100 cycles

5243    kCycles for 100 * rep stosd
4087    kCycles for 100 * HeapAlloc (*8 )
2834    kCycles for 100 * StackBuffer (with zeroing)
2834    kCycles for 100 * StackBuffer (unrolled)
2905    kCycles for 100 * dedndave
2861    kCycles for 100 * rep stosd up

5121    kCycles for 100 * rep stosd
3114    kCycles for 100 * HeapAlloc (*8 )
2861    kCycles for 100 * StackBuffer (with zeroing)
2842    kCycles for 100 * StackBuffer (unrolled)
2855    kCycles for 100 * dedndave
2873    kCycles for 100 * rep stosd up

5157    kCycles for 100 * rep stosd
3039    kCycles for 100 * HeapAlloc (*8 )
2893    kCycles for 100 * StackBuffer (with zeroing)
2867    kCycles for 100 * StackBuffer (unrolled)
2849    kCycles for 100 * dedndave
2845    kCycles for 100 * rep stosd up


a few words of caution - that apply to all algos...
be sure you leave some space for the OS (stay well under the stack reserve)
try not to REP STOSD with ECX = 0   :lol:
i didn't test for that in my algo, but it could easily be added
Code: [Select]
    shr     ecx,2
    .if !ZERO?
        rep     stosd
    .endif

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Zero a stack buffer (and probe it)
« Reply #52 on: October 27, 2013, 11:30:34 PM »
oh - and REP STOSD may still not be the fastest way to 0 the memory - that's another test, really

we still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass
i think HeapAlloc a large block can do that

it's probably best to seperate the 2 operations and optimize each

Gunther

  • Member
  • *****
  • Posts: 3722
  • Forgive your enemies, but never forget their names
Re: Zero a stack buffer (and probe it)
« Reply #53 on: October 28, 2013, 12:09:43 AM »
Dave,

it's probably best to seperate the 2 operations and optimize each

yes, it seems to me that this is true. But that's probably another story and another test.

Gunther
Get your facts first, and then you can distort them.

jj2007

  • Member
  • *****
  • Posts: 11527
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #54 on: October 28, 2013, 12:26:26 AM »
oh - and REP STOSD may still not be the fastest way to 0 the memory
It is, it is, at least for large buffers and for most CPUs - that's pretty obvious

Quote
we still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass

Using StackBuffer will happen somewhere between "proc" and "endp". There are two extreme cases:
1. You use this proc once - then the handful of nanoseconds lost in committing will not matter.
2. You use this proc a Million times - then you will not want the OS to uncommit and re-commit that stack space every time you call the proc.

So in effect the timings are extremely valid as they are...

Quote
try not to REP STOSD with ECX = 0   :lol:
See source:
   add ecx, 3   ; bufsize might be badly aligned
   shr ecx, 2   ; stosD
   xor eax, eax
   rep stosd


For ecx=0, rep stosd does absolutely nothing... caution, though, passing negative buffer sizes might result in unexpected behaviour :eusa_naughty:

Siekmanski

  • Member
  • *****
  • Posts: 2365
Re: Zero a stack buffer (and probe it)
« Reply #55 on: October 28, 2013, 12:55:56 AM »
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 215/100 cycles

2777    kCycles for 100 * rep stosd
1158    kCycles for 100 * HeapAlloc (*8)
1040    kCycles for 100 * StackBuffer (with zeroing)
1038    kCycles for 100 * StackBuffer (unrolled)
1090    kCycles for 100 * dedndave
1104    kCycles for 100 * rep stosd up

2770    kCycles for 100 * rep stosd
1144    kCycles for 100 * HeapAlloc (*8)
1065    kCycles for 100 * StackBuffer (with zeroing)
1047    kCycles for 100 * StackBuffer (unrolled)
1252    kCycles for 100 * dedndave
1069    kCycles for 100 * rep stosd up

2617    kCycles for 100 * rep stosd
1086    kCycles for 100 * HeapAlloc (*8)
981     kCycles for 100 * StackBuffer (with zeroing)
993     kCycles for 100 * StackBuffer (unrolled)
1037    kCycles for 100 * dedndave
1044    kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave
17      bytes for rep stosd up
Creative coders use backward thinking techniques as a strategy.

dedndave

  • Member
  • *****
  • Posts: 8829
  • Still using Abacus 2.0
    • DednDave
Re: Zero a stack buffer (and probe it)
« Reply #56 on: October 28, 2013, 01:02:05 AM »
For ecx=0, rep stosd does absolutely nothing...

oh - that's good   :P
as i recall, on an 8088, CX = 0 would do 64 KB

jj2007

  • Member
  • *****
  • Posts: 11527
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #57 on: October 28, 2013, 01:43:15 AM »
Thanks to everybody for testing :icon14:

New version:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

512000 bytes:
23582   kCycles for 100 * rep stosd
10726   kCycles for 100 * HeapAlloc
8653    kCycles for 100 * StackBuffer (with zeroing)
8653    kCycles for 100 * dedndave
8627    kCycles for 100 * rep stosd up (no probing)


To my embarassment, it seems the bufsize/8 disappeared somewhere, so the HeapAlloc values are for the full buffer size. And they are close to the others.
Try changing line 5: bufsize=102400*6 - you are in for a virtual surprise ;)

nidud

  • Member
  • *****
  • Posts: 2174
    • https://github.com/nidud/asmc
Re: Zero a stack buffer (and probe it)
« Reply #58 on: October 28, 2013, 01:45:46 AM »
here is a thought...

Code: [Select]
bufsize=102400

StackBuffer proc
local buffer[bufsize]:byte
; ... use the buffer ...
ret
StackBuffer endp

link /stack:102400,102400 ...

 :biggrin:

jj2007

  • Member
  • *****
  • Posts: 11527
  • Assembler is fun ;-)
    • MasmBasic
Re: Zero a stack buffer (and probe it)
« Reply #59 on: October 28, 2013, 01:57:54 AM »
here is a thought... link /stack:102400,102400

A valid thought but
a) it requires more discipline with linker settings, environment variables etc
b) that local buffer is still full of garbage and
c) it is not aligned for use with SSE2
 ;)