Zero a stack buffer (and probe it)

Siekmanski · October 27, 2013, 06:28:33 PM

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2673 kCycles for 100 * rep stosd
1437 kCycles for 100 * HeapAlloc (*8 )
1667 kCycles for 100 * StackBuffer (with zeroing)
2611 kCycles for 100 * StackBuffer (unrolled)
1680 kCycles for 100 * movaps xmm0
1027 kCycles for 100 * rep stosd up
1680 kCycles for 100 * movaps xmm0 (down)
957 kCycles for 100 * movaps xmm0 (up)
973 kCycles for 100 * movaps xmm0 (unrolled)

2672 kCycles for 100 * rep stosd
1500 kCycles for 100 * HeapAlloc (*8 )
1687 kCycles for 100 * StackBuffer (with zeroing)
2608 kCycles for 100 * StackBuffer (unrolled)
1681 kCycles for 100 * movaps xmm0
1029 kCycles for 100 * rep stosd up
1699 kCycles for 100 * movaps xmm0 (down)
948 kCycles for 100 * movaps xmm0 (up)
982 kCycles for 100 * movaps xmm0 (unrolled)

2671 kCycles for 100 * rep stosd
1446 kCycles for 100 * HeapAlloc (*8 )
1677 kCycles for 100 * StackBuffer (with zeroing)
2607 kCycles for 100 * StackBuffer (unrolled)
1681 kCycles for 100 * movaps xmm0
1070 kCycles for 100 * rep stosd up
1678 kCycles for 100 * movaps xmm0 (down)
966 kCycles for 100 * movaps xmm0 (up)
994 kCycles for 100 * movaps xmm0 (unrolled)

18 bytes for rep stosd
103 bytes for HeapAlloc (*8 )
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
25 bytes for movaps xmm0
17 bytes for rep stosd up
36 bytes for movaps xmm0 (down)
36 bytes for movaps xmm0 (up)
85 bytes for movaps xmm0 (unrolled)

Gunther · October 27, 2013, 08:29:54 PM

Results with Dave's (KeepingRealBusy) version:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 201/100 cycles

2356    kCycles for 100 * rep stosd
943     kCycles for 100 * HeapAlloc (*8)
853     kCycles for 100 * StackBuffer (with zeroing)
2285    kCycles for 100 * StackBuffer (unrolled)
824     kCycles for 100 * movaps xmm0
886     kCycles for 100 * rep stosd up
825     kCycles for 100 * movaps xmm0 (down)
880     kCycles for 100 * movaps xmm0 (up)
1443    kCycles for 100 * movaps xmm0 (unrolled)

2354    kCycles for 100 * rep stosd
967     kCycles for 100 * HeapAlloc (*8)
877     kCycles for 100 * StackBuffer (with zeroing)
2300    kCycles for 100 * StackBuffer (unrolled)
846     kCycles for 100 * movaps xmm0
911     kCycles for 100 * rep stosd up
842     kCycles for 100 * movaps xmm0 (down)
883     kCycles for 100 * movaps xmm0 (up)
839     kCycles for 100 * movaps xmm0 (unrolled)

2948    kCycles for 100 * rep stosd
981     kCycles for 100 * HeapAlloc (*8)
845     kCycles for 100 * StackBuffer (with zeroing)
2286    kCycles for 100 * StackBuffer (unrolled)
865     kCycles for 100 * movaps xmm0
868     kCycles for 100 * rep stosd up
828     kCycles for 100 * movaps xmm0 (down)
844     kCycles for 100 * movaps xmm0 (up)
865     kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)

--- ok ---

Gunther

jj2007 · October 27, 2013, 09:14:23 PM

Quote from: KeepingRealBusy on October 27, 2013, 02:04:42 PMThe modifications were to move the "constant" initializations out of the REPEAT loops

Dave,
That defeats the purpose of these loops to simulate a complete HeapAlloc/.../HeapFree sequence...

Dave (the dedn),
Your algo is now included in the testbed below. I have improved it so dramatically that you are now morally obliged to donate it to MasmBasic's StackBuffer() :icon_mrgreen:

push edi
push ebp
mov ebp, esp
mov edi, esp
mov ecx, bufsize ; to be replaced with immediate, global, local, reg etc
sub edi, ecx
and edi, -64 ; aligns buffer to a cache line
ASSUME FS:Nothing
.repeat
push eax ; tickle the guard page - limit might be 4k lower now
mov esp, fs:[8]
.until edi>=esp ; loop until we've got enough
ASSUME FS:ERROR
mov esp, edi ; new stack
add ecx, 3 ; bufsize might be badly aligned
shr ecx, 2 ; stosD
xor eax, eax
rep stosd
mov eax, esp ; retval for macro
; ... use the buffer ...
leave
pop edi

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

4743 kCycles for 100 * rep stosd
2232 kCycles for 100 * HeapAlloc (*8)
2201 kCycles for 100 * StackBuffer (with zeroing)
2208 kCycles for 100 * StackBuffer (unrolled)
1749 kCycles for 100 * dedndave
1746 kCycles for 100 * rep stosd up

4725 kCycles for 100 * rep stosd
1846 kCycles for 100 * HeapAlloc (*8)
2205 kCycles for 100 * StackBuffer (with zeroing)
2204 kCycles for 100 * StackBuffer (unrolled)
1747 kCycles for 100 * dedndave
1746 kCycles for 100 * rep stosd up

4726 kCycles for 100 * rep stosd
1850 kCycles for 100 * HeapAlloc (*8)
2203 kCycles for 100 * StackBuffer (with zeroing)
2203 kCycles for 100 * StackBuffer (unrolled)
1747 kCycles for 100 * dedndave
1746 kCycles for 100 * rep stosd up

18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
34 bytes for StackBuffer (with zeroing)
44 bytes for StackBuffer (unrolled)
41 bytes for dedndave <<<<<<<<<<<<<<<< BLOATWARE ALARM!!!
17 bytes for rep stosd up

Gunther · October 27, 2013, 09:36:09 PM

Okay Jochen, here we go again:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 268/100 cycles

2404    kCycles for 100 * rep stosd
1003    kCycles for 100 * HeapAlloc (*8)
887     kCycles for 100 * StackBuffer (with zeroing)
905     kCycles for 100 * StackBuffer (unrolled)
831     kCycles for 100 * dedndave
872     kCycles for 100 * rep stosd up

2342    kCycles for 100 * rep stosd
982     kCycles for 100 * HeapAlloc (*8)
859     kCycles for 100 * StackBuffer (with zeroing)
835     kCycles for 100 * StackBuffer (unrolled)
927     kCycles for 100 * dedndave
897     kCycles for 100 * rep stosd up

2346    kCycles for 100 * rep stosd
965     kCycles for 100 * HeapAlloc (*8)
838     kCycles for 100 * StackBuffer (with zeroing)
835     kCycles for 100 * StackBuffer (unrolled)
828     kCycles for 100 * dedndave
873     kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave
17      bytes for rep stosd up

--- ok ---

Gunther

jj2007 · October 27, 2013, 09:55:59 PM

Thanks, Gunther. The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result

Gunther · October 27, 2013, 10:33:42 PM

Jochen,

Quote from: jj2007 on October 27, 2013, 09:55:59 PM
... The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result

that's manifestly. Congrats Dave. :t

Gunther

dedndave · October 27, 2013, 11:18:21 PM

version 3 (DD) prescott w/htt

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 253/100 cycles

5243    kCycles for 100 * rep stosd
4087    kCycles for 100 * HeapAlloc (*8 )
2834    kCycles for 100 * StackBuffer (with zeroing)
2834    kCycles for 100 * StackBuffer (unrolled)
2905    kCycles for 100 * dedndave
2861    kCycles for 100 * rep stosd up

5121    kCycles for 100 * rep stosd
3114    kCycles for 100 * HeapAlloc (*8 )
2861    kCycles for 100 * StackBuffer (with zeroing)
2842    kCycles for 100 * StackBuffer (unrolled)
2855    kCycles for 100 * dedndave
2873    kCycles for 100 * rep stosd up

5157    kCycles for 100 * rep stosd
3039    kCycles for 100 * HeapAlloc (*8 )
2893    kCycles for 100 * StackBuffer (with zeroing)
2867    kCycles for 100 * StackBuffer (unrolled)
2849    kCycles for 100 * dedndave
2845    kCycles for 100 * rep stosd up

a few words of caution - that apply to all algos...
be sure you leave some space for the OS (stay well under the stack reserve)
try not to REP STOSD with ECX = 0 :lol:
i didn't test for that in my algo, but it could easily be added

Code Select

    shr     ecx,2
    .if !ZERO?
        rep     stosd
    .endif

dedndave · October 27, 2013, 11:30:34 PM

oh - and REP STOSD may still not be the fastest way to 0 the memory - that's another test, really

we still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass
i think HeapAlloc a large block can do that

it's probably best to seperate the 2 operations and optimize each

Gunther · October 28, 2013, 12:09:43 AM

Dave,

Quote from: dedndave on October 27, 2013, 11:30:34 PM
it's probably best to seperate the 2 operations and optimize each

yes, it seems to me that this is true. But that's probably another story and another test.

Gunther

jj2007 · October 28, 2013, 12:26:26 AM

Quote from: dedndave on October 27, 2013, 11:30:34 PM
oh - and REP STOSD may still not be the fastest way to 0 the memory

It is, it is, at least for large buffers and for most CPUs - that's pretty obvious

Quotewe still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass

Using StackBuffer will happen somewhere between "proc" and "endp". There are two extreme cases:
1. You use this proc once - then the handful of nanoseconds lost in committing will not matter.
2. You use this proc a Million times - then you will not want the OS to uncommit and re-commit that stack space every time you call the proc.

So in effect the timings are extremely valid as they are...

Quotetry not to REP STOSD with ECX = 0 :lol:

See source:
   add ecx, 3   ; bufsize might be badly aligned
   shr ecx, 2   ; stosD
   xor eax, eax
   rep stosd

For ecx=0, rep stosd does absolutely nothing... caution, though, passing negative buffer sizes might result in unexpected behaviour

Siekmanski · October 28, 2013, 12:55:56 AM

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 215/100 cycles

2777 kCycles for 100 * rep stosd
1158 kCycles for 100 * HeapAlloc (*8)
1040 kCycles for 100 * StackBuffer (with zeroing)
1038 kCycles for 100 * StackBuffer (unrolled)
1090 kCycles for 100 * dedndave
1104 kCycles for 100 * rep stosd up

2770 kCycles for 100 * rep stosd
1144 kCycles for 100 * HeapAlloc (*8)
1065 kCycles for 100 * StackBuffer (with zeroing)
1047 kCycles for 100 * StackBuffer (unrolled)
1252 kCycles for 100 * dedndave
1069 kCycles for 100 * rep stosd up

2617 kCycles for 100 * rep stosd
1086 kCycles for 100 * HeapAlloc (*8)
981 kCycles for 100 * StackBuffer (with zeroing)
993 kCycles for 100 * StackBuffer (unrolled)
1037 kCycles for 100 * dedndave
1044 kCycles for 100 * rep stosd up

18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
34 bytes for StackBuffer (with zeroing)
44 bytes for StackBuffer (unrolled)
41 bytes for dedndave
17 bytes for rep stosd up

dedndave · October 28, 2013, 01:02:05 AM

Quote from: jj2007 on October 28, 2013, 12:26:26 AM
For ecx=0, rep stosd does absolutely nothing...

oh - that's good :P
as i recall, on an 8088, CX = 0 would do 64 KB

jj2007 · October 28, 2013, 01:43:15 AM

Thanks to everybody for testing :icon14:

New version:

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

512000 bytes:
23582 kCycles for 100 * rep stosd
10726 kCycles for 100 * HeapAlloc
8653 kCycles for 100 * StackBuffer (with zeroing)
8653 kCycles for 100 * dedndave
8627 kCycles for 100 * rep stosd up (no probing)

To my embarassment, it seems the bufsize/8 disappeared somewhere, so the HeapAlloc values are for the full buffer size. And they are close to the others.
Try changing line 5: bufsize=102400*6 - you are in for a virtual surprise ;)

nidud · October 28, 2013, 01:45:46 AM

deleted

jj2007 · October 28, 2013, 01:57:54 AM

Quote from: nidud on October 28, 2013, 01:45:46 AM
here is a thought... link /stack:102400,102400

A valid thought but
a) it requires more discipline with linker settings, environment variables etc
b) that local buffer is still full of garbage and
c) it is not aligned for use with SSE2
;)

The MASM Forum

News: