News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Zero a stack buffer (and probe it)

Started by jj2007, October 25, 2013, 07:31:54 PM

Previous topic - Next topic

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2673    kCycles for 100 * rep stosd
1437    kCycles for 100 * HeapAlloc (*8 )
1667    kCycles for 100 * StackBuffer (with zeroing)
2611    kCycles for 100 * StackBuffer (unrolled)
1680    kCycles for 100 * movaps xmm0
1027    kCycles for 100 * rep stosd up
1680    kCycles for 100 * movaps xmm0 (down)
957     kCycles for 100 * movaps xmm0 (up)
973     kCycles for 100 * movaps xmm0 (unrolled)

2672    kCycles for 100 * rep stosd
1500    kCycles for 100 * HeapAlloc (*8 )
1687    kCycles for 100 * StackBuffer (with zeroing)
2608    kCycles for 100 * StackBuffer (unrolled)
1681    kCycles for 100 * movaps xmm0
1029    kCycles for 100 * rep stosd up
1699    kCycles for 100 * movaps xmm0 (down)
948     kCycles for 100 * movaps xmm0 (up)
982     kCycles for 100 * movaps xmm0 (unrolled)

2671    kCycles for 100 * rep stosd
1446    kCycles for 100 * HeapAlloc (*8 )
1677    kCycles for 100 * StackBuffer (with zeroing)
2607    kCycles for 100 * StackBuffer (unrolled)
1681    kCycles for 100 * movaps xmm0
1070    kCycles for 100 * rep stosd up
1678    kCycles for 100 * movaps xmm0 (down)
966     kCycles for 100 * movaps xmm0 (up)
994     kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8 )
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)
Creative coders use backward thinking techniques as a strategy.

Gunther

Results with Dave's (KeepingRealBusy) version:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 201/100 cycles

2356    kCycles for 100 * rep stosd
943     kCycles for 100 * HeapAlloc (*8)
853     kCycles for 100 * StackBuffer (with zeroing)
2285    kCycles for 100 * StackBuffer (unrolled)
824     kCycles for 100 * movaps xmm0
886     kCycles for 100 * rep stosd up
825     kCycles for 100 * movaps xmm0 (down)
880     kCycles for 100 * movaps xmm0 (up)
1443    kCycles for 100 * movaps xmm0 (unrolled)

2354    kCycles for 100 * rep stosd
967     kCycles for 100 * HeapAlloc (*8)
877     kCycles for 100 * StackBuffer (with zeroing)
2300    kCycles for 100 * StackBuffer (unrolled)
846     kCycles for 100 * movaps xmm0
911     kCycles for 100 * rep stosd up
842     kCycles for 100 * movaps xmm0 (down)
883     kCycles for 100 * movaps xmm0 (up)
839     kCycles for 100 * movaps xmm0 (unrolled)

2948    kCycles for 100 * rep stosd
981     kCycles for 100 * HeapAlloc (*8)
845     kCycles for 100 * StackBuffer (with zeroing)
2286    kCycles for 100 * StackBuffer (unrolled)
865     kCycles for 100 * movaps xmm0
868     kCycles for 100 * rep stosd up
828     kCycles for 100 * movaps xmm0 (down)
844     kCycles for 100 * movaps xmm0 (up)
865     kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)

--- ok ---


Gunther
You have to know the facts before you can distort them.

jj2007

Quote from: KeepingRealBusy on October 27, 2013, 02:04:42 PMThe modifications were to move the "constant" initializations out of the REPEAT loops

Dave,
That defeats the purpose of these loops to simulate a complete HeapAlloc/.../HeapFree sequence...

Dave (the dedn),
Your algo is now included in the testbed below. I have improved it so dramatically that you are now morally obliged to donate it to MasmBasic's StackBuffer() :icon_mrgreen:

        push edi
        push ebp
        mov ebp, esp
        mov edi, esp
        mov ecx, bufsize     ; to be replaced with immediate, global, local, reg etc
        sub edi, ecx
        and edi, -64         ; aligns buffer to a cache line
        ASSUME FS:Nothing
        .repeat
                push eax     ; tickle the guard page - limit might be 4k lower now
                mov esp, fs:[8]
        .until edi>=esp      ; loop until we've got enough
        ASSUME FS:ERROR
        mov esp, edi         ; new stack
        add ecx, 3           ; bufsize might be badly aligned
        shr ecx, 2           ; stosD
        xor eax, eax
        rep stosd
        mov eax, esp         ; retval for macro
        ; ... use the buffer ...
        leave
        pop edi

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

4743    kCycles for 100 * rep stosd
2232    kCycles for 100 * HeapAlloc (*8)
2201    kCycles for 100 * StackBuffer (with zeroing)
2208    kCycles for 100 * StackBuffer (unrolled)
1749    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

4725    kCycles for 100 * rep stosd
1846    kCycles for 100 * HeapAlloc (*8)
2205    kCycles for 100 * StackBuffer (with zeroing)
2204    kCycles for 100 * StackBuffer (unrolled)
1747    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

4726    kCycles for 100 * rep stosd
1850    kCycles for 100 * HeapAlloc (*8)
2203    kCycles for 100 * StackBuffer (with zeroing)
2203    kCycles for 100 * StackBuffer (unrolled)
1747    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave   <<<<<<<<<<<<<<<< BLOATWARE ALARM!!!
17      bytes for rep stosd up

Gunther

Okay Jochen, here we go again:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 268/100 cycles

2404    kCycles for 100 * rep stosd
1003    kCycles for 100 * HeapAlloc (*8)
887     kCycles for 100 * StackBuffer (with zeroing)
905     kCycles for 100 * StackBuffer (unrolled)
831     kCycles for 100 * dedndave
872     kCycles for 100 * rep stosd up

2342    kCycles for 100 * rep stosd
982     kCycles for 100 * HeapAlloc (*8)
859     kCycles for 100 * StackBuffer (with zeroing)
835     kCycles for 100 * StackBuffer (unrolled)
927     kCycles for 100 * dedndave
897     kCycles for 100 * rep stosd up

2346    kCycles for 100 * rep stosd
965     kCycles for 100 * HeapAlloc (*8)
838     kCycles for 100 * StackBuffer (with zeroing)
835     kCycles for 100 * StackBuffer (unrolled)
828     kCycles for 100 * dedndave
873     kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave
17      bytes for rep stosd up

--- ok ---


Gunther
You have to know the facts before you can distort them.

jj2007

Thanks, Gunther. The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:

Gunther

Jochen,

Quote from: jj2007 on October 27, 2013, 09:55:59 PM
... The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:

that's manifestly. Congrats Dave.  :t

Gunther
You have to know the facts before you can distort them.

dedndave

version 3 (DD) prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 253/100 cycles

5243    kCycles for 100 * rep stosd
4087    kCycles for 100 * HeapAlloc (*8 )
2834    kCycles for 100 * StackBuffer (with zeroing)
2834    kCycles for 100 * StackBuffer (unrolled)
2905    kCycles for 100 * dedndave
2861    kCycles for 100 * rep stosd up

5121    kCycles for 100 * rep stosd
3114    kCycles for 100 * HeapAlloc (*8 )
2861    kCycles for 100 * StackBuffer (with zeroing)
2842    kCycles for 100 * StackBuffer (unrolled)
2855    kCycles for 100 * dedndave
2873    kCycles for 100 * rep stosd up

5157    kCycles for 100 * rep stosd
3039    kCycles for 100 * HeapAlloc (*8 )
2893    kCycles for 100 * StackBuffer (with zeroing)
2867    kCycles for 100 * StackBuffer (unrolled)
2849    kCycles for 100 * dedndave
2845    kCycles for 100 * rep stosd up


a few words of caution - that apply to all algos...
be sure you leave some space for the OS (stay well under the stack reserve)
try not to REP STOSD with ECX = 0   :lol:
i didn't test for that in my algo, but it could easily be added
    shr     ecx,2
    .if !ZERO?
        rep     stosd
    .endif

dedndave

oh - and REP STOSD may still not be the fastest way to 0 the memory - that's another test, really

we still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass
i think HeapAlloc a large block can do that

it's probably best to seperate the 2 operations and optimize each

Gunther

Dave,

Quote from: dedndave on October 27, 2013, 11:30:34 PM
it's probably best to seperate the 2 operations and optimize each

yes, it seems to me that this is true. But that's probably another story and another test.

Gunther
You have to know the facts before you can distort them.

jj2007

Quote from: dedndave on October 27, 2013, 11:30:34 PM
oh - and REP STOSD may still not be the fastest way to 0 the memory
It is, it is, at least for large buffers and for most CPUs - that's pretty obvious

Quotewe still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass

Using StackBuffer will happen somewhere between "proc" and "endp". There are two extreme cases:
1. You use this proc once - then the handful of nanoseconds lost in committing will not matter.
2. You use this proc a Million times - then you will not want the OS to uncommit and re-commit that stack space every time you call the proc.

So in effect the timings are extremely valid as they are...

Quotetry not to REP STOSD with ECX = 0   :lol:
See source:
   add ecx, 3   ; bufsize might be badly aligned
   shr ecx, 2   ; stosD
   xor eax, eax
   rep stosd


For ecx=0, rep stosd does absolutely nothing... caution, though, passing negative buffer sizes might result in unexpected behaviour :eusa_naughty:

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 215/100 cycles

2777    kCycles for 100 * rep stosd
1158    kCycles for 100 * HeapAlloc (*8)
1040    kCycles for 100 * StackBuffer (with zeroing)
1038    kCycles for 100 * StackBuffer (unrolled)
1090    kCycles for 100 * dedndave
1104    kCycles for 100 * rep stosd up

2770    kCycles for 100 * rep stosd
1144    kCycles for 100 * HeapAlloc (*8)
1065    kCycles for 100 * StackBuffer (with zeroing)
1047    kCycles for 100 * StackBuffer (unrolled)
1252    kCycles for 100 * dedndave
1069    kCycles for 100 * rep stosd up

2617    kCycles for 100 * rep stosd
1086    kCycles for 100 * HeapAlloc (*8)
981     kCycles for 100 * StackBuffer (with zeroing)
993     kCycles for 100 * StackBuffer (unrolled)
1037    kCycles for 100 * dedndave
1044    kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave
17      bytes for rep stosd up
Creative coders use backward thinking techniques as a strategy.

dedndave

Quote from: jj2007 on October 28, 2013, 12:26:26 AM
For ecx=0, rep stosd does absolutely nothing...

oh - that's good   :P
as i recall, on an 8088, CX = 0 would do 64 KB

jj2007

Thanks to everybody for testing :icon14:

New version:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

512000 bytes:
23582   kCycles for 100 * rep stosd
10726   kCycles for 100 * HeapAlloc
8653    kCycles for 100 * StackBuffer (with zeroing)
8653    kCycles for 100 * dedndave
8627    kCycles for 100 * rep stosd up (no probing)


To my embarassment, it seems the bufsize/8 disappeared somewhere, so the HeapAlloc values are for the full buffer size. And they are close to the others.
Try changing line 5: bufsize=102400*6 - you are in for a virtual surprise ;)

nidud

#58
deleted

jj2007

Quote from: nidud on October 28, 2013, 01:45:46 AM
here is a thought... link /stack:102400,102400

A valid thought but
a) it requires more discipline with linker settings, environment variables etc
b) that local buffer is still full of garbage and
c) it is not aligned for use with SSE2
;)