The MASM Forum

General => The Laboratory => Topic started by: jj2007 on October 25, 2013, 07:31:54 PM

Title: Zero a stack buffer (and probe it)
Post by: jj2007 on October 25, 2013, 07:31:54 PM
Spin-off from MemStrategy (http://masm32.com/board/index.php?topic=2515.msg26340#msg26340):

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 268/100 cycles

3778    kCycles for 100 * rep stosd
4905    kCycles for 100 * push 0
4890    kCycles for 100 * push edx
3343    kCycles for 100 * movups xmm0
3319    kCycles for 100 * movaps xmm0

3785    kCycles for 100 * rep stosd
4891    kCycles for 100 * push 0
4894    kCycles for 100 * push edx
3457    kCycles for 100 * movups xmm0
3319    kCycles for 100 * movaps xmm0

3785    kCycles for 100 * rep stosd
4891    kCycles for 100 * push 0
4896    kCycles for 100 * push edx
3342    kCycles for 100 * movups xmm0
3320    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0
Title: Re: Zero a stack buffer (and probe it)
Post by: Siekmanski on October 25, 2013, 07:58:50 PM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2700    kCycles for 100 * rep stosd
5467    kCycles for 100 * push 0
4888    kCycles for 100 * push edx
4266    kCycles for 100 * movups xmm0
1411    kCycles for 100 * movaps xmm0

2752    kCycles for 100 * rep stosd
4887    kCycles for 100 * push 0
5651    kCycles for 100 * push edx
4262    kCycles for 100 * movups xmm0
1030    kCycles for 100 * movaps xmm0

2699    kCycles for 100 * rep stosd
4892    kCycles for 100 * push 0
4888    kCycles for 100 * push edx
4263    kCycles for 100 * movups xmm0
1744    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0

Title: Re: Zero a stack buffer (and probe it)
Post by: sinsi on October 25, 2013, 08:40:15 PM
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 310/100 cycles

2385    kCycles for 100 * rep stosd
4530    kCycles for 100 * push 0
4508    kCycles for 100 * push edx
3932    kCycles for 100 * movups xmm0
871     kCycles for 100 * movaps xmm0
Title: Re: Zero a stack buffer (and probe it)
Post by: TWell on October 25, 2013, 09:05:27 PM
AMD Athlon(tm) II X2 220 Processor (SSE3) 2.80 GHz
loop overhead is approx. 239/100 cycles

2621    kCycles for 100 * rep stosd
4891    kCycles for 100 * push 0
4895    kCycles for 100 * push edx
1666    kCycles for 100 * movups xmm0
1605    kCycles for 100 * movaps xmm0
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 25, 2013, 10:09:55 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 248/100 cycles

4986    kCycles for 100 * rep stosd
4827    kCycles for 100 * push 0
4991    kCycles for 100 * push edx
6187    kCycles for 100 * movups xmm0
2767    kCycles for 100 * movaps xmm0

5023    kCycles for 100 * rep stosd
4857    kCycles for 100 * push 0
4935    kCycles for 100 * push edx
6207    kCycles for 100 * movups xmm0
2766    kCycles for 100 * movaps xmm0

5023    kCycles for 100 * rep stosd
4855    kCycles for 100 * push 0
4990    kCycles for 100 * push edx
6225    kCycles for 100 * movups xmm0
2765    kCycles for 100 * movaps xmm0
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 25, 2013, 10:18:42 PM
I cleaned up the rep stosd function a bit
Code: [Select]
mov edx,edi
lea edi,[esp-bufsize]
mov ecx,bufsize/4
xor eax,eax
rep stosd
mov edi,edx
dec ebx

AMD Athlon(tm) II X2 245 Processor (SSE3)
loop overhead is approx. 239/100 cycles

2623    kCycles for 100 * rep stosd
4900    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
1592    kCycles for 100 * movups xmm0
1597    kCycles for 100 * movaps xmm0
1955    kCycles for 100 * rep stosd
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 25, 2013, 10:34:57 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 245/100 cycles

5107    kCycles for 100 * rep stosd
4844    kCycles for 100 * push 0
4902    kCycles for 100 * push edx
6153    kCycles for 100 * movups xmm0
2827    kCycles for 100 * movaps xmm0
2815    kCycles for 100 * rep stosd

5111    kCycles for 100 * rep stosd
4873    kCycles for 100 * push 0
4887    kCycles for 100 * push edx
6150    kCycles for 100 * movups xmm0
2795    kCycles for 100 * movaps xmm0
2782    kCycles for 100 * rep stosd

5053    kCycles for 100 * rep stosd
4892    kCycles for 100 * push 0
4850    kCycles for 100 * push edx
6179    kCycles for 100 * movups xmm0
2767    kCycles for 100 * movaps xmm0
2827    kCycles for 100 * rep stosd
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 25, 2013, 11:22:20 PM
Jochen,

here are the results from an old Computer (located in an University laboratory). The other tests from my machine at home will come this evening.

Code: [Select]
AMD Athlon(tm) Dual Core Processor 5000B (SSE3)
loop overhead is approx. 239/100 cycles

3779    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
3344    kCycles for 100 * movups xmm0
3347    kCycles for 100 * movaps xmm0

3774    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4899    kCycles for 100 * push edx
3343    kCycles for 100 * movups xmm0
3341    kCycles for 100 * movaps xmm0

3778    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
3344    kCycles for 100 * movups xmm0
3331    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0

--- ok ---

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 25, 2013, 11:35:41 PM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 245/100 cycles

5107    kCycles for 100 * rep stosd
4844    kCycles for 100 * push 0
4902    kCycles for 100 * push edx
6153    kCycles for 100 * movups xmm0
2827    kCycles for 100 * movaps xmm0
2815    kCycles for 100 * rep stosd

manipulation of the (direction) flag again  :biggrin:
shaves off some cycles on AMD but more on Intel
Title: Re: Zero a stack buffer (and probe it)
Post by: FORTRANS on October 26, 2013, 12:05:07 AM
Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
loop overhead is approx. 211/100 cycles

7356    kCycles for 100 * rep stosd
4902    kCycles for 100 * push 0
4902    kCycles for 100 * push edx
3059    kCycles for 100 * movups xmm0
2312    kCycles for 100 * movaps xmm0
2207    kCycles for 100 * rep stosd

7358    kCycles for 100 * rep stosd
4905    kCycles for 100 * push 0
4897    kCycles for 100 * push edx
3064    kCycles for 100 * movups xmm0
2303    kCycles for 100 * movaps xmm0
2212    kCycles for 100 * rep stosd

7372    kCycles for 100 * rep stosd
4913    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
3063    kCycles for 100 * movups xmm0
2303    kCycles for 100 * movaps xmm0
2214    kCycles for 100 * rep stosd

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0
17      bytes for rep stosd


--- ok ---
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 12:52:01 AM
Thanks to everybody :icon14:

I cleaned up the rep stosd function a bit

I appreciate your good intentions, Nidud. Put it under TestA, just for fun ;)
(hint: look at this thread's title)
Title: Re: Zero a stack buffer (and probe it)
Post by: MichaelW on October 26, 2013, 01:43:49 AM
Northwood w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 309/100 cycles

4910    kCycles for 100 * rep stosd
4902    kCycles for 100 * push 0
4904    kCycles for 100 * push edx
4904    kCycles for 100 * movups xmm0
2144    kCycles for 100 * movaps xmm0

4910    kCycles for 100 * rep stosd
5130    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
4893    kCycles for 100 * movups xmm0
2140    kCycles for 100 * movaps xmm0

4911    kCycles for 100 * rep stosd
4909    kCycles for 100 * push 0
4903    kCycles for 100 * push edx
4895    kCycles for 100 * movups xmm0
2150    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0

Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 26, 2013, 01:47:07 AM
Put it under TestA, just for fun ;)

the intention should at best be educational  :P

having both of them will illustrate the penalty of manipulating the flags on different CPU's
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 26, 2013, 01:50:10 AM
i try to avoid STD's   :lol:

in fact, i have gotten to where i don't use them at all
if i have to move things in that direction, i write a discrete loop

in this case, you could probe, then clear, one page at a time
something like this...
Code: [Select]
    ASSUME  FS:Nothing

    mov     edx,esp
    mov     fs:[700h],edi
    xor     eax,eax
    sub     edx,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     ecx,esp
        mov     esp,fs:[8]
        sub     ecx,esp
        shr     ecx,2
        .if !ZERO
            mov     edi,esp
            rep     stosd
        .endif
    .until edx>=esp
    mov     edi,fs:[700h]
    mov     esp,edx

    ASSUME  FS:ERROR
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 26, 2013, 02:28:02 AM
this is a simpler version...
Code: [Select]
    ASSUME  FS:Nothing

    mov     edx,esp
    mov     ecx,esp
    sub     edx,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     esp,fs:[8]
    .until edx>=esp
    sub     ecx,edx
    xchg    edx,edi
    shr     ecx,2
    xor     eax,eax
    mov     esp,edi
    rep     stosd
    mov     edi,edx

    ASSUME  FS:ERROR
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 09:40:39 AM
Put it under TestA, just for fun ;)

the intention should at best be educational  :P

having both of them will illustrate the penalty of manipulating the flags on different CPU's

Yes, that's true. Although it seems it's not the flag setting but rather the "wrong" direction that makes rep stosd slow.

Attached a new version with a modified StackBuffer() macro. Your algo is "rep stosd up" ;-)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

4601    kCycles for 100 * rep stosd
4893    kCycles for 100 * push 0
4890    kCycles for 100 * push edx
2151    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
1701    kCycles for 100 * rep stosd up

4592    kCycles for 100 * rep stosd
4891    kCycles for 100 * push 0
4894    kCycles for 100 * push edx
2142    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
1697    kCycles for 100 * rep stosd up
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 26, 2013, 09:44:49 AM
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 263/100 cycles

4957    kCycles for 100 * rep stosd
4902    kCycles for 100 * push 0
4931    kCycles for 100 * push edx
2934    kCycles for 100 * StackBuffer (with zeroing)
3006    kCycles for 100 * movaps xmm0
3035    kCycles for 100 * rep stosd up

4956    kCycles for 100 * rep stosd
4982    kCycles for 100 * push 0
4869    kCycles for 100 * push edx
2777    kCycles for 100 * StackBuffer (with zeroing)
2840    kCycles for 100 * movaps xmm0
2823    kCycles for 100 * rep stosd up

4954    kCycles for 100 * rep stosd
4863    kCycles for 100 * push 0
4940    kCycles for 100 * push edx
2799    kCycles for 100 * StackBuffer (with zeroing)
2801    kCycles for 100 * movaps xmm0
2779    kCycles for 100 * rep stosd up

4993    kCycles for 100 * rep stosd
4908    kCycles for 100 * push 0
4940    kCycles for 100 * push edx
2767    kCycles for 100 * StackBuffer (with zeroing)
2811    kCycles for 100 * movaps xmm0
2790    kCycles for 100 * rep stosd up

5074    kCycles for 100 * rep stosd
4911    kCycles for 100 * push 0
5082    kCycles for 100 * push edx
2860    kCycles for 100 * StackBuffer (with zeroing)
2767    kCycles for 100 * movaps xmm0
2779    kCycles for 100 * rep stosd up

5073    kCycles for 100 * rep stosd
4907    kCycles for 100 * push 0
5175    kCycles for 100 * push edx
2826    kCycles for 100 * StackBuffer (with zeroing)
2796    kCycles for 100 * movaps xmm0
2801    kCycles for 100 * rep stosd up

the last 3 are more-or-less the same on a P4
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 09:50:17 AM
prescott w/htt
...
the last 3 are more-or-less the same on a P4

Yes, they look similar on older CPUs. The i7 behave quite differently.
For probing only (no zeroing), StackBuffer is more than twice as fast.

P.S.: Jeri's CastAR looks really impressive :t
Title: Re: Zero a stack buffer (and probe it)
Post by: MichaelW on October 26, 2013, 09:53:50 AM
Northwood w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 331/100 cycles

4915    kCycles for 100 * rep stosd
4928    kCycles for 100 * push 0
4911    kCycles for 100 * push edx
2141    kCycles for 100 * StackBuffer (with zeroing)
2153    kCycles for 100 * movaps xmm0
2326    kCycles for 100 * rep stosd up

4912    kCycles for 100 * rep stosd
4905    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
2140    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
2199    kCycles for 100 * rep stosd up

4907    kCycles for 100 * rep stosd
4892    kCycles for 100 * push 0
4907    kCycles for 100 * push edx
2141    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
2212    kCycles for 100 * rep stosd up

Title: Re: Zero a stack buffer (and probe it)
Post by: KeepingRealBusy on October 26, 2013, 01:42:11 PM
My laptop:
Code: [Select]
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 495/100 cycles

4656    kCycles for 100 * rep stosd
8331    kCycles for 100 * push 0
7636    kCycles for 100 * push edx
2845    kCycles for 100 * StackBuffer (with zeroing)
2711    kCycles for 100 * movaps xmm0
2585    kCycles for 100 * rep stosd up

3488    kCycles for 100 * rep stosd
5722    kCycles for 100 * push 0
5240    kCycles for 100 * push edx
1712    kCycles for 100 * StackBuffer (with zeroing)
1650    kCycles for 100 * movaps xmm0
1600    kCycles for 100 * rep stosd up

2200    kCycles for 100 * rep stosd
4188    kCycles for 100 * push 0
4251    kCycles for 100 * push edx
1715    kCycles for 100 * StackBuffer (with zeroing)
1565    kCycles for 100 * movaps xmm0
1580    kCycles for 100 * rep stosd up

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
29      bytes for StackBuffer (with zeroing)
25      bytes for movaps xmm0
17      bytes for rep stosd up

--- ok ---

A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Dave.
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 04:19:12 PM
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Why not ;-)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4725    kCycles for 100 * rep stosd
2683    kCycles for 100 * HeapAlloc (*8)
2202    kCycles for 100 * StackBuffer (with zeroing)
2887    kCycles for 100 * StackBuffer (unrolled)
2207    kCycles for 100 * movaps xmm0
1746    kCycles for 100 * rep stosd up

29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
Title: Re: Zero a stack buffer (and probe it)
Post by: Siekmanski on October 26, 2013, 05:57:10 PM
StackBuffer2b.exe doesn't work with windows 8.1

StackBuffer2.exe works OK:

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2712    kCycles for 100 * rep stosd
4887    kCycles for 100 * push 0
4885    kCycles for 100 * push edx
1686    kCycles for 100 * StackBuffer (with zeroing)
923     kCycles for 100 * movaps xmm0
1027    kCycles for 100 * rep stosd up

2609    kCycles for 100 * rep stosd
5312    kCycles for 100 * push 0
4887    kCycles for 100 * push edx
943     kCycles for 100 * StackBuffer (with zeroing)
1989    kCycles for 100 * movaps xmm0
1086    kCycles for 100 * rep stosd up

2608    kCycles for 100 * rep stosd
5588    kCycles for 100 * push 0
5649    kCycles for 100 * push edx
971     kCycles for 100 * StackBuffer (with zeroing)
1654    kCycles for 100 * movaps xmm0
981     kCycles for 100 * rep stosd up

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
29      bytes for StackBuffer (with zeroing)
25      bytes for movaps xmm0
17      bytes for rep stosd up


--- ok ---
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 06:07:28 PM
StackBuffer2b.exe doesn't work with windows 8.1

Oops, it seems I uploaded a version with a nice little int 3 inside. Try 2c above...
Apologies :redface:
Title: Re: Zero a stack buffer (and probe it)
Post by: Siekmanski on October 26, 2013, 06:20:02 PM
StackBuffer2c   :t

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 576/100 cycles

2674    kCycles for 100 * rep stosd
1810    kCycles for 100 * HeapAlloc (*8 )
940     kCycles for 100 * StackBuffer (with zeroing)
3376    kCycles for 100 * StackBuffer (unrolled)
990     kCycles for 100 * movaps xmm0
1804    kCycles for 100 * rep stosd up

2672    kCycles for 100 * rep stosd
1117    kCycles for 100 * HeapAlloc (*8 )
957     kCycles for 100 * StackBuffer (with zeroing)
2613    kCycles for 100 * StackBuffer (unrolled)
1391    kCycles for 100 * movaps xmm0
1054    kCycles for 100 * rep stosd up

2672    kCycles for 100 * rep stosd
1824    kCycles for 100 * HeapAlloc (*8 )
962     kCycles for 100 * StackBuffer (with zeroing)
3380    kCycles for 100 * StackBuffer (unrolled)
934     kCycles for 100 * movaps xmm0
1789    kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8 )
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
Title: Re: Zero a stack buffer (and probe it)
Post by: sinsi on October 26, 2013, 06:36:11 PM
A bit of a difference in rep stosd...

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 328/100 cycles

2449    kCycles for 100 * rep stosd
1025    kCycles for 100 * HeapAlloc (*8 )
951     kCycles for 100 * StackBuffer (with zeroing)
2384    kCycles for 100 * StackBuffer (unrolled)
927     kCycles for 100 * movaps xmm0
955     kCycles for 100 * rep stosd up
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 06:50:14 PM
Thanks :icon14:
Astonishing that the unrolled version is so much slower...

         xorps xmm0, xmm0
         ifnb <unrolled>
            shr eax, 4+2   ; bufsize/16*4
            mov edx, esp   ; save current stack pointer
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, 4*OWORD
                  movdqa OWORD ptr [esp], xmm0
                  movdqa OWORD ptr [1*OWORD+esp], xmm0
                  movdqa OWORD ptr [2*OWORD+esp], xmm0
                  movdqa OWORD ptr [3*OWORD+esp], xmm0
                  dec eax
            .Until Zero?
         else
            shr eax, 4   ; /16
            mov edx, esp   ; save current stack pointer
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, OWORD
                  movaps OWORD ptr [esp], xmm0
                  dec eax
            .Until Zero?
         endif
Title: Re: Zero a stack buffer (and probe it)
Post by: sinsi on October 26, 2013, 07:19:16 PM
Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 202/100 cycles

945     kCycles for 100 * StackBuffer (with zeroing)
993     kCycles for 100 * StackBuffer (unrolled)

949     kCycles for 100 * StackBuffer (with zeroing)
933     kCycles for 100 * StackBuffer (unrolled)

926     kCycles for 100 * StackBuffer (with zeroing)
933     kCycles for 100 * StackBuffer (unrolled)
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 07:48:44 PM
Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.

Same effect for reordering:
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, 4*OWORD
                  movaps OWORD ptr [3*OWORD+esp], xmm0
                  movaps OWORD ptr [2*OWORD+esp], xmm0
                  movaps OWORD ptr [1*OWORD+esp], xmm0
                  movaps OWORD ptr [0*OWORD+esp], xmm0
                  dec eax
            .Until Zero?

But the timings are identical, so no need for unrolling.
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 26, 2013, 10:55:20 PM
Jochen,

timings from my machine at home:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 229/100 cycles

2287    kCycles for 100 * rep stosd
4347    kCycles for 100 * push 0
4314    kCycles for 100 * push edx
830     kCycles for 100 * StackBuffer (with zeroing)
822     kCycles for 100 * movaps xmm0
847     kCycles for 100 * rep stosd up

2290    kCycles for 100 * rep stosd
4321    kCycles for 100 * push 0
4289    kCycles for 100 * push edx
828     kCycles for 100 * StackBuffer (with zeroing)
821     kCycles for 100 * movaps xmm0
859     kCycles for 100 * rep stosd up

2293    kCycles for 100 * rep stosd
4303    kCycles for 100 * push 0
4298    kCycles for 100 * push edx
830     kCycles for 100 * StackBuffer (with zeroing)
812     kCycles for 100 * movaps xmm0
847     kCycles for 100 * rep stosd up

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
29      bytes for StackBuffer (with zeroing)
25      bytes for movaps xmm0
17      bytes for rep stosd up

--- ok ---

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 26, 2013, 11:05:51 PM
version 2c on a prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles

5102    kCycles for 100 * rep stosd
4717    kCycles for 100 * HeapAlloc (*8 )
2851    kCycles for 100 * StackBuffer (with zeroing)
3962    kCycles for 100 * StackBuffer (unrolled)
2835    kCycles for 100 * movaps xmm0
2872    kCycles for 100 * rep stosd up

5145    kCycles for 100 * rep stosd
3681    kCycles for 100 * HeapAlloc (*8 )
2862    kCycles for 100 * StackBuffer (with zeroing)
3950    kCycles for 100 * StackBuffer (unrolled)
2894    kCycles for 100 * movaps xmm0
2844    kCycles for 100 * rep stosd up

5111    kCycles for 100 * rep stosd
3769    kCycles for 100 * HeapAlloc (*8 )
2836    kCycles for 100 * StackBuffer (with zeroing)
3950    kCycles for 100 * StackBuffer (unrolled)
2846    kCycles for 100 * movaps xmm0
2900    kCycles for 100 * rep stosd up


can't beat REP STOSD for simplicity   :P
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 11:20:33 PM
can't beat REP STOSD for simplicity   :P

But for the fast "rep stosd up" you need to write an SEH, that makes it slightly more complicated again :icon_mrgreen:
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 26, 2013, 11:24:40 PM
why SEH ?
i posted code that does STOSD up with no SEH
but, you haven't incorporated it

a little update...
Code: [Select]
    ASSUME  FS:Nothing

    mov     edx,edi
    mov     edi,esp
    mov     ecx,esp
    sub     edi,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     esp,fs:[8]
    .until edi>=esp
    sub     ecx,edi
    shr     ecx,2
    xor     eax,eax
    mov     esp,edi
    rep     stosd
    mov     edi,edx

    ASSUME  FS:ERROR
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 26, 2013, 11:33:48 PM
but, you haven't incorporated it

I've tried to but it crashes ::)
Set useE=1 in the source...
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 26, 2013, 11:52:07 PM
if it crashes, there must be a simple reason - lol
how much memory are you trying to allocate ?

try the attached test code...
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 26, 2013, 11:55:02 PM
virgin 2d prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 254/100 cycles

5184    kCycles for 100 * rep stosd
4122    kCycles for 100 * HeapAlloc (*8 )
2853    kCycles for 100 * StackBuffer (with zeroing)
2868    kCycles for 100 * StackBuffer (unrolled)
2899    kCycles for 100 * rep stosd up

5116    kCycles for 100 * rep stosd
3073    kCycles for 100 * HeapAlloc (*8 )
2839    kCycles for 100 * StackBuffer (with zeroing)
2849    kCycles for 100 * StackBuffer (unrolled)
2862    kCycles for 100 * rep stosd up

5161    kCycles for 100 * rep stosd
3080    kCycles for 100 * HeapAlloc (*8 )
2843    kCycles for 100 * StackBuffer (with zeroing)
2873    kCycles for 100 * StackBuffer (unrolled)
2848    kCycles for 100 * rep stosd up
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 27, 2013, 12:01:55 AM
StackBuffer2d results:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles

2341    kCycles for 100 * rep stosd
987     kCycles for 100 * HeapAlloc (*8)
839     kCycles for 100 * StackBuffer (with zeroing)
830     kCycles for 100 * StackBuffer (unrolled)
921     kCycles for 100 * rep stosd up

2345    kCycles for 100 * rep stosd
985     kCycles for 100 * HeapAlloc (*8)
829     kCycles for 100 * StackBuffer (with zeroing)
867     kCycles for 100 * StackBuffer (unrolled)
872     kCycles for 100 * rep stosd up

2339    kCycles for 100 * rep stosd
989     kCycles for 100 * HeapAlloc (*8)
850     kCycles for 100 * StackBuffer (with zeroing)
906     kCycles for 100 * StackBuffer (unrolled)
875     kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
17      bytes for rep stosd up

--- ok ---

Dave,

your ProbeTest works fine under 64 bit.

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 27, 2013, 12:53:50 AM
thanks Gunther - whew !

only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary
(he must have been a C programmer in a previous life  :P )

but - i think there is a major flaw in the idea of speed-tests for probing code
once you have committed that memory, it remains committed until you release it and the OS allocates it elsewhere
to overcome this, you might try HeapAlloc
if the OS needs that space for the heap, it should "reset" the amount committed

i don't think altering the value at FS:[8] is a good idea - lol
sounds like a memory leak waiting to happen
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 27, 2013, 01:19:38 AM
ok
the default reserve is supposed to be 1 MB = 1,048,576 (100000h)
i can only allocate up to 1,032,192 (0FC000h) without a crash

that must be why Jochen is having to use SEH
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 27, 2013, 04:38:46 AM
only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary

Dave,

bufsize is 102400 bytes, no big deal. The Try/Catch thing would be needed for the "rep stosd up" algo, simply because it doesn't probe the stack.

Here is your code embedded in the testbed, it doesn't crash any more but 2 kCycles is a bit fast... some more comments would be nice, or maybe I am just too tired to understand it :(

TestE proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  mov esi, esp  ; check the stack
  align 4
  .Repeat
   mov edx, edi
   mov edi, esp
   mov ecx, esp
   sub edi, (bufsize+3) MOD 4      ;<NumberOfBytesRequiredPlus3Mod4>
   .repeat
      push eax
      ASSUME FS:Nothing
      mov esp, fs:[8]
      ASSUME FS:ERROR
   .until edi>=esp
   sub ecx, edi
   shr ecx, 2
   xor eax, eax
   mov esp, edi
   rep stosd
   mov edi, edx
   add esp, (bufsize+3) MOD 4   ; restore stack
   dec ebx
  .Until Sign?
  sub esi, esp
  .if !Zero?   ; OK
   print str$(esi), " STACKDIFF"
   exit
  .endif
  ret
TestE endp
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 27, 2013, 10:58:16 AM
it's not too bad
i probe down the stack by using the TEB.StackLimit value from FS:[8]
then, i use REP STOSD to clear it out

the probe part was discussed at length...
http://masm32.com/board/index.php?topic=1363 (http://masm32.com/board/index.php?topic=1363)
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 27, 2013, 12:03:16 PM
Thanks, Dave - I had not seen that thread. Now it's clearer...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4710    kCycles for 100 * rep stosd
2220    kCycles for 100 * HeapAlloc (*8)
2193    kCycles for 100 * StackBuffer (with zeroing)
2192    kCycles for 100 * StackBuffer (unrolled)
4697    kCycles for 100 * dedndave
1738    kCycles for 100 * rep stosd up


This is for slightly modified code, taking account of the need to save & restore the old stack:

  .Repeat
        mov edx, edi        ; save edi
        mov edi, esp
        mov eax, esp        ; save old stack
        sub edi, (bufsize+3+4)        ;<NumberOfBytesRequiredPlus3Mod4>
        and edi, -4        ; aligns new stack
        .repeat
                push eax        ; tickle the guard page
                ASSUME FS:Nothing
                mov esp, fs:[8]        ; limit might be 4k lower now
                ASSUME FS:ERROR
        .until edi>=esp        ; loop until we've got enough
        mov esp, edi        ; new stack
        stosd        ; save old stack to [edi]
        xchg eax, ecx
        push edi        ; retval for macro
        sub ecx, edi
        shr ecx, 2
        xor eax, eax
        rep stosd
        pop eax        ; retval for macro
        mov edi, edx        ; restore edi
        ; ... code that uses buffer...
        pop esp        ; restore stack
        dec ebx
  .Until Sign?


I hope I didn't misunderstand anything - for some time I was thoroughly confused by your NumberOfBytesRequiredPlus3Mod4 ::)
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 27, 2013, 12:10:08 PM
sorry for the confusion - it's just a number that is mod4=0
it could be an immediate - or a value calculated in EAX

as for restoring the stack.....

Code: [Select]
        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

MyProc PROC parm1:DWORD

        push    ebx
        push    esi
        push    edi            ;push/pops on EBX ESI EDI are optional, of course

        push    ebp
        mov     esp,ebp

;stack probe code here

;stack clear code here

;use stack space, as required

        leave

        pop     edi
        pop     esi
        pop     ebx
        ret     4

MyProc ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 27, 2013, 12:34:09 PM
as for restoring the stack..... leave

I've tried that but it crashes. If you have working code, please insert into the source :icon14:

Anyway, speed-wise it doesn't look so convincing. By the way, the forum software translates *8 into a smiley - HeapAlloc is actually tested with one eighth of the buffer size, because it's so slow :(
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 27, 2013, 01:46:48 PM
give this a try, my friend
i am anxious to see if it crashes on you   :P

it should display the allocation size (F0000), then 0 (cleared OR test result)

i commented it heavily, just for you   :biggrin:
Title: Re: Zero a stack buffer (and probe it)
Post by: KeepingRealBusy on October 27, 2013, 02:04:42 PM
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Why not ;-)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4725    kCycles for 100 * rep stosd
2683    kCycles for 100 * HeapAlloc (*8)
2202    kCycles for 100 * StackBuffer (with zeroing)
2887    kCycles for 100 * StackBuffer (unrolled)
2207    kCycles for 100 * movaps xmm0
1746    kCycles for 100 * rep stosd up

29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)


Jochen,

Using this version, I made some changes. The original movaps test was TestE. I made an unrolled version as TestI, then used some similar code to modify testE and saved them as Testg and TestH. The modifications were to move the "constant" initializations out of the REPEAT loops and execute them at the beginning of the test (before the REPEATs). The Following are the .lst sections for TestE, TestG, TestH, and TestI (just to check alignments):

Code: [Select]
align 16
 00000190 TestE_s:
 = movaps xmm0 NameE equ movaps xmm0 ; assign a descriptive name here
 00000190 TestE proc
 00000190  BB 00000063   mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
 00000198    *@C0011:
 00000198  8B CC mov ecx, esp
 0000019A  8D 84 24 lea eax, [esp-bufsize]
     FFFE7000
 000001A1  83 E4 F0 and esp, -16 ;  needs a reg or local to store original esp
 000001A4  0F 57 C0 xorps xmm0, xmm0
; align 4
.Repeat
 000001A7    *@C0012:
 000001A7  83 EC 10 sub esp, OWORD
 000001AA  0F 29 04 24 movaps OWORD ptr [esp], xmm0 ; movaps <1% faster on AMD
.Until esp<=eax
 000001AE  3B E0    *     cmp    esp, eax
 000001B0  77 F5    *     ja @C0012
 000001B2  8B E1 mov esp, ecx
; add esp, bufsize
 000001B4  4B dec ebx
  .Until Sign?
 000001B5  79 E1    *     jns    @C0011
 000001B7  C3   ret
 000001B8 TestE endp
 000001B8 TestE_endp:

align 16
 000001E0 TestG_s:
 = movaps xmm0 (down) NameG equ movaps xmm0 (down) ; assign a descriptive name here
 000001E0 TestG proc
 000001E0  BB 00000063   mov ebx, AlgoLoops-1 ; loop e.g. 100x
 000001E5  8B CC   mov ecx, esp
 000001E7  8B F4   mov esi, esp
 000001E9  BA FFFFFFF0   mov edx, -OWORD
 000001EE  83 E6 F0   and esi, -16
 000001F1  0F 57 C0   xorps xmm0, xmm0
 000001F4  8D 86 FFFE7000   lea eax, [esi-bufsize]
  align 16
  .Repeat
 00000200    *@C0017:
 00000200  8B E6         mov esp, esi
.Repeat
 00000202    *@C0018:
 00000202  8D 24 14                 lea esp,[esp+edx]
 00000205  0F 29 04 24 movaps OWORD ptr [esp], xmm0 ; movaps <1% faster on AMD
.Until esp==eax
 00000209  3B E0    *     cmp    esp, eax
 0000020B  75 F5    *     jne    @C0018
 0000020D  4B dec ebx
  .Until Sign?
 0000020E  79 F0    *     jns    @C0017
 00000210  8B E1   mov esp, ecx
 00000212  C3   ret
 00000213 TestG endp
 00000213 TestG_endp:

align 16
 00000220 TestH_s:
 = movaps xmm0 (up) NameH equ movaps xmm0 (up) ; assign a descriptive name here
 00000220 TestH proc
 00000220  BB 00000063   mov ebx, AlgoLoops-1 ; loop e.g. 100x
 00000225  8B CC   mov ecx, esp
 00000227  8D B4 24   lea esi, [esp-bufsize]
     FFFE7000
 0000022E  BA 00000010   mov edx, OWORD
 00000233  83 E6 F0   and esi, -16
 00000236  0F 57 C0   xorps xmm0, xmm0
 00000239  8D 86 00019000   lea eax, [esi+bufsize]
  align 16
  .Repeat
 00000240    *@C001B:
 00000240  8B E6         mov esp, esi
.Repeat
 00000242    *@C001C:
 00000242  0F 29 04 24 movaps OWORD ptr [esp+(0*OWORD)], xmm0 ; movaps <1% faster on AMD
 00000246  8D 24 14                 lea esp,[esp+edx]
.Until esp==eax
 00000249  3B E0    *     cmp    esp, eax
 0000024B  75 F5    *     jne    @C001C
 0000024D  4B dec ebx
  .Until Sign?
 0000024E  79 F0    *     jns    @C001B
 00000250  8B E1   mov esp, ecx
 00000252  C3   ret
 00000253 TestH endp
 00000253 TestH_endp:

align 16
 00000260 TestI_s:
 = movaps xmm0 (unrolled) NameI equ movaps xmm0 (unrolled) ; assign a descriptive name here
 00000260 TestI proc
 00000260  BB 00000063   mov ebx, AlgoLoops-1 ; loop e.g. 100x
 00000265  8B CC   mov ecx, esp
 00000267  8D B4 24   lea esi, [esp-bufsize]
     FFFE7000
 0000026E  BA 00000080   mov edx, (8*OWORD)
 00000273  83 E6 F0   and esi, -16
 00000276  0F 57 C0   xorps xmm0, xmm0
 00000279  8D 86 00019000   lea eax, [esi+bufsize]
  .Repeat
 0000027F    *@C001F:
 0000027F  8B E6         mov esp, esi
  align 16
.Repeat
 00000290    *@C0020:
 00000290  0F 29 04 24 movaps OWORD ptr [esp+(0*OWORD)], xmm0 ; movaps <1% faster on AMD
 00000294  0F 29 44 24 10 movaps OWORD ptr [esp+(1*OWORD)], xmm0 ; movaps <1% faster on AMD
 00000299  0F 29 44 24 20 movaps OWORD ptr [esp+(2*OWORD)], xmm0 ; movaps <1% faster on AMD
 0000029E  0F 29 44 24 30 movaps OWORD ptr [esp+(3*OWORD)], xmm0 ; movaps <1% faster on AMD
 000002A3  0F 29 44 24 40 movaps OWORD ptr [esp+(4*OWORD)], xmm0 ; movaps <1% faster on AMD
 000002A8  0F 29 44 24 50 movaps OWORD ptr [esp+(5*OWORD)], xmm0 ; movaps <1% faster on AMD
 000002AD  0F 29 44 24 60 movaps OWORD ptr [esp+(6*OWORD)], xmm0 ; movaps <1% faster on AMD
 000002B2  0F 29 44 24 70 movaps OWORD ptr [esp+(7*OWORD)], xmm0 ; movaps <1% faster on AMD
 000002B7  8D 24 14                 lea esp,[esp+edx]
.Until esp==eax
 000002BA  3B E0    *     cmp    esp, eax
 000002BC  75 D2    *     jne    @C0020
; add esp, bufsize
 000002BE  4B dec ebx
  .Until Sign?
 000002BF  79 BE    *     jns    @C001F
 000002C1  8B E1   mov esp, ecx
 000002C3  C3   ret
 000002C4 TestI endp
 000002C4 TestI_endp:

The following are my executions:

Code: [Select]
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 433/100 cycles

5229    kCycles for 100 * rep stosd
3627    kCycles for 100 * HeapAlloc (*8)
3274    kCycles for 100 * StackBuffer (with zeroing)
3278    kCycles for 100 * StackBuffer (unrolled)
3193    kCycles for 100 * movaps xmm0
3118    kCycles for 100 * rep stosd up
2798    kCycles for 100 * movaps xmm0 (down)
2974    kCycles for 100 * movaps xmm0 (up)
2895    kCycles for 100 * movaps xmm0 (unrolled)

3573    kCycles for 100 * rep stosd
2709    kCycles for 100 * HeapAlloc (*8)
2458    kCycles for 100 * StackBuffer (with zeroing)
2481    kCycles for 100 * StackBuffer (unrolled)
2426    kCycles for 100 * movaps xmm0
2218    kCycles for 100 * rep stosd up
2086    kCycles for 100 * movaps xmm0 (down)
2329    kCycles for 100 * movaps xmm0 (up)
2273    kCycles for 100 * movaps xmm0 (unrolled)

2244    kCycles for 100 * rep stosd
1512    kCycles for 100 * HeapAlloc (*8)
1422    kCycles for 100 * StackBuffer (with zeroing)
1403    kCycles for 100 * StackBuffer (unrolled)
1546    kCycles for 100 * movaps xmm0
1448    kCycles for 100 * rep stosd up
1424    kCycles for 100 * movaps xmm0 (down)
1561    kCycles for 100 * movaps xmm0 (up)
1502    kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)

--- ok ---

The times are interesting. I have attached a zip of my .asm and .exe file.

Dave.
Title: Re: Zero a stack buffer (and probe it)
Post by: Siekmanski on October 27, 2013, 06:28:33 PM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2673    kCycles for 100 * rep stosd
1437    kCycles for 100 * HeapAlloc (*8 )
1667    kCycles for 100 * StackBuffer (with zeroing)
2611    kCycles for 100 * StackBuffer (unrolled)
1680    kCycles for 100 * movaps xmm0
1027    kCycles for 100 * rep stosd up
1680    kCycles for 100 * movaps xmm0 (down)
957     kCycles for 100 * movaps xmm0 (up)
973     kCycles for 100 * movaps xmm0 (unrolled)

2672    kCycles for 100 * rep stosd
1500    kCycles for 100 * HeapAlloc (*8 )
1687    kCycles for 100 * StackBuffer (with zeroing)
2608    kCycles for 100 * StackBuffer (unrolled)
1681    kCycles for 100 * movaps xmm0
1029    kCycles for 100 * rep stosd up
1699    kCycles for 100 * movaps xmm0 (down)
948     kCycles for 100 * movaps xmm0 (up)
982     kCycles for 100 * movaps xmm0 (unrolled)

2671    kCycles for 100 * rep stosd
1446    kCycles for 100 * HeapAlloc (*8 )
1677    kCycles for 100 * StackBuffer (with zeroing)
2607    kCycles for 100 * StackBuffer (unrolled)
1681    kCycles for 100 * movaps xmm0
1070    kCycles for 100 * rep stosd up
1678    kCycles for 100 * movaps xmm0 (down)
966     kCycles for 100 * movaps xmm0 (up)
994     kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8 )
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 27, 2013, 08:29:54 PM
Results with Dave's (KeepingRealBusy) version:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 201/100 cycles

2356    kCycles for 100 * rep stosd
943     kCycles for 100 * HeapAlloc (*8)
853     kCycles for 100 * StackBuffer (with zeroing)
2285    kCycles for 100 * StackBuffer (unrolled)
824     kCycles for 100 * movaps xmm0
886     kCycles for 100 * rep stosd up
825     kCycles for 100 * movaps xmm0 (down)
880     kCycles for 100 * movaps xmm0 (up)
1443    kCycles for 100 * movaps xmm0 (unrolled)

2354    kCycles for 100 * rep stosd
967     kCycles for 100 * HeapAlloc (*8)
877     kCycles for 100 * StackBuffer (with zeroing)
2300    kCycles for 100 * StackBuffer (unrolled)
846     kCycles for 100 * movaps xmm0
911     kCycles for 100 * rep stosd up
842     kCycles for 100 * movaps xmm0 (down)
883     kCycles for 100 * movaps xmm0 (up)
839     kCycles for 100 * movaps xmm0 (unrolled)

2948    kCycles for 100 * rep stosd
981     kCycles for 100 * HeapAlloc (*8)
845     kCycles for 100 * StackBuffer (with zeroing)
2286    kCycles for 100 * StackBuffer (unrolled)
865     kCycles for 100 * movaps xmm0
868     kCycles for 100 * rep stosd up
828     kCycles for 100 * movaps xmm0 (down)
844     kCycles for 100 * movaps xmm0 (up)
865     kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)

--- ok ---

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 27, 2013, 09:14:23 PM
The modifications were to move the "constant" initializations out of the REPEAT loops

Dave,
That defeats the purpose of these loops to simulate a complete HeapAlloc/.../HeapFree sequence...

Dave (the dedn),
Your algo is now included in the testbed below. I have improved it so dramatically that you are now morally obliged to donate it to MasmBasic's StackBuffer() :icon_mrgreen:

        push edi
        push ebp
        mov ebp, esp
        mov edi, esp
        mov ecx, bufsize     ; to be replaced with immediate, global, local, reg etc
        sub edi, ecx
        and edi, -64         ; aligns buffer to a cache line
        ASSUME FS:Nothing
        .repeat
                push eax     ; tickle the guard page - limit might be 4k lower now
                mov esp, fs:[8]
        .until edi>=esp      ; loop until we've got enough
        ASSUME FS:ERROR
        mov esp, edi         ; new stack
        add ecx, 3           ; bufsize might be badly aligned
        shr ecx, 2           ; stosD
        xor eax, eax
        rep stosd
        mov eax, esp         ; retval for macro
        ; ... use the buffer ...
        leave
        pop edi

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles

4743    kCycles for 100 * rep stosd
2232    kCycles for 100 * HeapAlloc (*8)
2201    kCycles for 100 * StackBuffer (with zeroing)
2208    kCycles for 100 * StackBuffer (unrolled)
1749    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

4725    kCycles for 100 * rep stosd
1846    kCycles for 100 * HeapAlloc (*8)
2205    kCycles for 100 * StackBuffer (with zeroing)
2204    kCycles for 100 * StackBuffer (unrolled)
1747    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

4726    kCycles for 100 * rep stosd
1850    kCycles for 100 * HeapAlloc (*8)
2203    kCycles for 100 * StackBuffer (with zeroing)
2203    kCycles for 100 * StackBuffer (unrolled)
1747    kCycles for 100 * dedndave
1746    kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave   <<<<<<<<<<<<<<<< BLOATWARE ALARM!!!
17      bytes for rep stosd up
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 27, 2013, 09:36:09 PM
Okay Jochen, here we go again:
Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 268/100 cycles

2404    kCycles for 100 * rep stosd
1003    kCycles for 100 * HeapAlloc (*8)
887     kCycles for 100 * StackBuffer (with zeroing)
905     kCycles for 100 * StackBuffer (unrolled)
831     kCycles for 100 * dedndave
872     kCycles for 100 * rep stosd up

2342    kCycles for 100 * rep stosd
982     kCycles for 100 * HeapAlloc (*8)
859     kCycles for 100 * StackBuffer (with zeroing)
835     kCycles for 100 * StackBuffer (unrolled)
927     kCycles for 100 * dedndave
897     kCycles for 100 * rep stosd up

2346    kCycles for 100 * rep stosd
965     kCycles for 100 * HeapAlloc (*8)
838     kCycles for 100 * StackBuffer (with zeroing)
835     kCycles for 100 * StackBuffer (unrolled)
828     kCycles for 100 * dedndave
873     kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave
17      bytes for rep stosd up

--- ok ---

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 27, 2013, 09:55:59 PM
Thanks, Gunther. The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 27, 2013, 10:33:42 PM
Jochen,

... The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:

that's manifestly. Congrats Dave.  :t

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 27, 2013, 11:18:21 PM
version 3 (DD) prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 253/100 cycles

5243    kCycles for 100 * rep stosd
4087    kCycles for 100 * HeapAlloc (*8 )
2834    kCycles for 100 * StackBuffer (with zeroing)
2834    kCycles for 100 * StackBuffer (unrolled)
2905    kCycles for 100 * dedndave
2861    kCycles for 100 * rep stosd up

5121    kCycles for 100 * rep stosd
3114    kCycles for 100 * HeapAlloc (*8 )
2861    kCycles for 100 * StackBuffer (with zeroing)
2842    kCycles for 100 * StackBuffer (unrolled)
2855    kCycles for 100 * dedndave
2873    kCycles for 100 * rep stosd up

5157    kCycles for 100 * rep stosd
3039    kCycles for 100 * HeapAlloc (*8 )
2893    kCycles for 100 * StackBuffer (with zeroing)
2867    kCycles for 100 * StackBuffer (unrolled)
2849    kCycles for 100 * dedndave
2845    kCycles for 100 * rep stosd up


a few words of caution - that apply to all algos...
be sure you leave some space for the OS (stay well under the stack reserve)
try not to REP STOSD with ECX = 0   :lol:
i didn't test for that in my algo, but it could easily be added
Code: [Select]
    shr     ecx,2
    .if !ZERO?
        rep     stosd
    .endif
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 27, 2013, 11:30:34 PM
oh - and REP STOSD may still not be the fastest way to 0 the memory - that's another test, really

we still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass
i think HeapAlloc a large block can do that

it's probably best to seperate the 2 operations and optimize each
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 28, 2013, 12:09:43 AM
Dave,

it's probably best to seperate the 2 operations and optimize each

yes, it seems to me that this is true. But that's probably another story and another test.

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 28, 2013, 12:26:26 AM
oh - and REP STOSD may still not be the fastest way to 0 the memory
It is, it is, at least for large buffers and for most CPUs - that's pretty obvious

Quote
we still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass

Using StackBuffer will happen somewhere between "proc" and "endp". There are two extreme cases:
1. You use this proc once - then the handful of nanoseconds lost in committing will not matter.
2. You use this proc a Million times - then you will not want the OS to uncommit and re-commit that stack space every time you call the proc.

So in effect the timings are extremely valid as they are...

Quote
try not to REP STOSD with ECX = 0   :lol:
See source:
   add ecx, 3   ; bufsize might be badly aligned
   shr ecx, 2   ; stosD
   xor eax, eax
   rep stosd


For ecx=0, rep stosd does absolutely nothing... caution, though, passing negative buffer sizes might result in unexpected behaviour :eusa_naughty:
Title: Re: Zero a stack buffer (and probe it)
Post by: Siekmanski on October 28, 2013, 12:55:56 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 215/100 cycles

2777    kCycles for 100 * rep stosd
1158    kCycles for 100 * HeapAlloc (*8)
1040    kCycles for 100 * StackBuffer (with zeroing)
1038    kCycles for 100 * StackBuffer (unrolled)
1090    kCycles for 100 * dedndave
1104    kCycles for 100 * rep stosd up

2770    kCycles for 100 * rep stosd
1144    kCycles for 100 * HeapAlloc (*8)
1065    kCycles for 100 * StackBuffer (with zeroing)
1047    kCycles for 100 * StackBuffer (unrolled)
1252    kCycles for 100 * dedndave
1069    kCycles for 100 * rep stosd up

2617    kCycles for 100 * rep stosd
1086    kCycles for 100 * HeapAlloc (*8)
981     kCycles for 100 * StackBuffer (with zeroing)
993     kCycles for 100 * StackBuffer (unrolled)
1037    kCycles for 100 * dedndave
1044    kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
41      bytes for dedndave
17      bytes for rep stosd up
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 28, 2013, 01:02:05 AM
For ecx=0, rep stosd does absolutely nothing...

oh - that's good   :P
as i recall, on an 8088, CX = 0 would do 64 KB
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 28, 2013, 01:43:15 AM
Thanks to everybody for testing :icon14:

New version:

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

512000 bytes:
23582   kCycles for 100 * rep stosd
10726   kCycles for 100 * HeapAlloc
8653    kCycles for 100 * StackBuffer (with zeroing)
8653    kCycles for 100 * dedndave
8627    kCycles for 100 * rep stosd up (no probing)


To my embarassment, it seems the bufsize/8 disappeared somewhere, so the HeapAlloc values are for the full buffer size. And they are close to the others.
Try changing line 5: bufsize=102400*6 - you are in for a virtual surprise ;)
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 28, 2013, 01:45:46 AM
here is a thought...

Code: [Select]
bufsize=102400

StackBuffer proc
local buffer[bufsize]:byte
; ... use the buffer ...
ret
StackBuffer endp

link /stack:102400,102400 ...

 :biggrin:
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 28, 2013, 01:57:54 AM
here is a thought... link /stack:102400,102400

A valid thought but
a) it requires more discipline with linker settings, environment variables etc
b) that local buffer is still full of garbage and
c) it is not aligned for use with SSE2
 ;)
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 28, 2013, 02:18:17 AM
Jochen,

StackBuffer3.exe comes up with that result:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles

512000 bytes:
12600   kCycles for 100 * rep stosd
4696    kCycles for 100 * HeapAlloc
4672    kCycles for 100 * StackBuffer (with zeroing)
4538    kCycles for 100 * dedndave
4593    kCycles for 100 * rep stosd up (no probing)

12567   kCycles for 100 * rep stosd
5334    kCycles for 100 * HeapAlloc
5236    kCycles for 100 * StackBuffer (with zeroing)
4937    kCycles for 100 * dedndave
4692    kCycles for 100 * rep stosd up (no probing)

12518   kCycles for 100 * rep stosd
4685    kCycles for 100 * HeapAlloc
4674    kCycles for 100 * StackBuffer (with zeroing)
5286    kCycles for 100 * dedndave
4666    kCycles for 100 * rep stosd up (no probing)

18      bytes for rep stosd
103     bytes for HeapAlloc
54      bytes for StackBuffer (with zeroing)
41      bytes for dedndave
17      bytes for rep stosd up (no probing)

--- ok ---

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 28, 2013, 02:32:04 AM
prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 261/100 cycles

512000 bytes:
26444   kCycles for 100 * rep stosd
22020   kCycles for 100 * HeapAlloc
16433   kCycles for 100 * StackBuffer (with zeroing)
15254   kCycles for 100 * dedndave
15176   kCycles for 100 * rep stosd up (no probing)

26155   kCycles for 100 * rep stosd
17181   kCycles for 100 * HeapAlloc
16086   kCycles for 100 * StackBuffer (with zeroing)
15254   kCycles for 100 * dedndave
15979   kCycles for 100 * rep stosd up (no probing)

26160   kCycles for 100 * rep stosd
17103   kCycles for 100 * HeapAlloc
16196   kCycles for 100 * StackBuffer (with zeroing)
15333   kCycles for 100 * dedndave
15132   kCycles for 100 * rep stosd up (no probing)

--- ok ---

loop overhead is approx. 254/100 cycles

512000 bytes:
26153   kCycles for 100 * rep stosd
22074   kCycles for 100 * HeapAlloc
16154   kCycles for 100 * StackBuffer (with zeroing)
15852   kCycles for 100 * dedndave
15254   kCycles for 100 * rep stosd up (no probing)

26087   kCycles for 100 * rep stosd
16510   kCycles for 100 * HeapAlloc
16647   kCycles for 100 * StackBuffer (with zeroing)
15258   kCycles for 100 * dedndave
15187   kCycles for 100 * rep stosd up (no probing)

26145   kCycles for 100 * rep stosd
16325   kCycles for 100 * HeapAlloc
16303   kCycles for 100 * StackBuffer (with zeroing)
15257   kCycles for 100 * dedndave
15032   kCycles for 100 * rep stosd up (no probing)
Title: Re: Zero a stack buffer (and probe it)
Post by: Siekmanski on October 28, 2013, 04:03:55 AM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 542/100 cycles

512000 bytes:
14386   kCycles for 100 * rep stosd
5242    kCycles for 100 * HeapAlloc
5160    kCycles for 100 * StackBuffer (with zeroing)
5182    kCycles for 100 * dedndave
5283    kCycles for 100 * rep stosd up (no probing)

14350   kCycles for 100 * rep stosd
5261    kCycles for 100 * HeapAlloc
5202    kCycles for 100 * StackBuffer (with zeroing)
5187    kCycles for 100 * dedndave
5258    kCycles for 100 * rep stosd up (no probing)

14356   kCycles for 100 * rep stosd
5276    kCycles for 100 * HeapAlloc
5216    kCycles for 100 * StackBuffer (with zeroing)
5154    kCycles for 100 * dedndave
5289    kCycles for 100 * rep stosd up (no probing)

18      bytes for rep stosd
103     bytes for HeapAlloc
54      bytes for StackBuffer (with zeroing)
41      bytes for dedndave
17      bytes for rep stosd up (no probing)
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 28, 2013, 04:48:24 AM
A valid thought but
a) it requires more discipline with linker settings, environment variables etc
b) that local buffer is still full of garbage and
c) it is not aligned for use with SSE2
 ;)

 :biggrin:

Code: [Select]
bufsize=102400

StackBuffer proc
local buf[bufsize+16]:byte
local buffer:dword
lea eax,buf
and al,0F0h
add eax,16
mov buffer,eax
mov edi,eax
sub eax,eax
mov ecx,bufsize
rep stosd
; ... use the buffer ...
ret
StackBuffer endp

if (%1) == () goto probe
link /stack:%1,%1 ...
goto end
:probe
makeit 102416
:end

so, there you go  :lol:
Title: Re: Zero a stack buffer (and have fun)
Post by: jj2007 on October 28, 2013, 06:24:57 AM
SbTestJ proc uses esi MySize
  mov esi, StackBuffer(MySize)   ; works like a charm, no linker options needed
  ; ... use the buffer ...
  StackBuffer()
  ret
SbTestJ endp

SbTestN proc uses esi MySize
local buf[MySize+16]:byte   ; error A2026: constant expected
local buffer:dword
  lea eax,buf
  and al,0F0h
  add eax,16
  mov buffer,eax
  mov edi,eax
  sub eax,eax
  mov ecx,bufsize
  rep stosd
  ; ... use the buffer ...
  ret
SbTestN endp

 ;)
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 28, 2013, 07:58:57 AM
If you plan on calling this macro frequently it may be better to set the stack one time, either using a link-switch or a function to avoid the probing. The stack could then be used by the alloc function if that was the intended usage.

Code: [Select]
new_stack proc stklen:dword
mov eax,esp
mov edx,eax
mov ecx,stklen
sub eax,ecx
ASSUME FS:NOTHING
.if eax < fs:[8]
    shr ecx,2
    .repeat
push eax
    .untilcxz
.endif
ASSUME FS:ERROR
mov esp,edx
ret
new_stack endp

start:
invoke new_stack,bufsize
...

I would however prefer the switch option.
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 28, 2013, 09:01:16 AM
If you plan on calling this macro frequently it may be better to set the stack one time

No need for doing that. Dave's fs:[8] loop is extremely clever - if the stack is already committed, it costs just 1 or 2 cycles.
Title: Re: Zero a stack buffer (and probe it)
Post by: Farabi on October 28, 2013, 12:49:12 PM
Im curious, movups is slower than conventional instuctions, but it was for double data right? Push edx is for 4 bytes, what about movups? I think it was 8 to 16 bytes, if the data speed clock was 22 it should be divided by 2 or 4 to know the byte rate transfer.
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 28, 2013, 07:03:06 PM
Just stumbled over an oddity with HeapAlloc: It gets very, very slow for a small range of bytes requested (Win7-32):

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles

512000 bytes:
19585   kCycles for 100 * rep stosd <<< the reference

512000 bytes:
19662   kCycles for 100 * HeapAlloc <<< so far so good
519168 bytes:
131619  kCycles for 100 * HeapAlloc <<< oops
520192 bytes:
915     kCycles for 100 * HeapAlloc <<< VirtualAlloc kicks in

19565   kCycles for 100 * rep stosd

512000 bytes:
19667   kCycles for 100 * HeapAlloc
519168 bytes:
133256  kCycles for 100 * HeapAlloc
520192 bytes:
932     kCycles for 100 * HeapAlloc
Title: Re: Zero a stack buffer (and probe it)
Post by: Siekmanski on October 28, 2013, 10:58:06 PM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 575/100 cycles

512000 bytes:
14335   kCycles for 100 * rep stosd


512000 bytes:
5293    kCycles for 100 * HeapAlloc
519168 bytes:
5381    kCycles for 100 * HeapAlloc
520192 bytes:
1314    kCycles for 100 * HeapAlloc

14333   kCycles for 100 * rep stosd


512000 bytes:
5311    kCycles for 100 * HeapAlloc
519168 bytes:
5340    kCycles for 100 * HeapAlloc
520192 bytes:
1314    kCycles for 100 * HeapAlloc

18      bytes for rep stosd
104     bytes for HeapAlloc
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 28, 2013, 11:07:19 PM
XP MCE2005 SP3
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 427/100 cycles

512000 bytes:
34190   kCycles for 100 * rep stosd

512000 bytes:
28714   kCycles for 100 * HeapAlloc
519168 bytes:
23241   kCycles for 100 * HeapAlloc
520192 bytes:
2671    kCycles for 100 * HeapAlloc

34132   kCycles for 100 * rep stosd

512000 bytes:
23103   kCycles for 100 * HeapAlloc
519168 bytes:
23419   kCycles for 100 * HeapAlloc
520192 bytes:
2635    kCycles for 100 * HeapAlloc
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 28, 2013, 11:22:36 PM
Thanks, Marinus & Dave :icon14:
The switch to VirtualAlloc is there but not the slowdown shortly below. Could be Win-7 only, or some special feature of my machine ::)
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 28, 2013, 11:28:18 PM
see if you can test a single pass
HeapAlloc may not like being in a x100 loop   :P
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 28, 2013, 11:41:10 PM
see if you can test a single pass
HeapAlloc may not like being in a x100 loop   :P
It's quite happy to be in that loop for everything below 500k*1.01 and above 500k*1.016... and switching to e.g. 5 loops doesn't change the pattern. Weird.
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 29, 2013, 01:28:57 AM
that makes me wonder if there are other "holes" in the number line
and - is it specific to your hardware in some way
say, if you had more memory - would it act differently
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 29, 2013, 01:54:55 AM
If you plan on calling this macro frequently it may be better to set the stack one time

No need for doing that. Dave's fs:[8] loop is extremely clever - if the stack is already committed, it costs just 1 or 2 cycles.

And each time you commit an extra page will add how many cycles?

If the stack is already committed you only have to do the rep stosd thing, so the next (speed) test will be unfair compare to the first one.
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 29, 2013, 02:09:53 AM
i think this is a case where you have to apply some common sense
how is the heap/stack space going to be used in a typical program ?
i can't think of too many cases where you actually waffle back and forth between allocating heap memory and committing stack space
if you do, you should probably re-think your design   :P

however, it would still be interesting to see how fast the OS can commit under different conditions
for example: allocate a large heap block, then commit some stack space

from what i have seen (with no heap allocated), the commit loop seems pretty fast
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 29, 2013, 02:31:33 AM
ok, well it doesn't seem to be as fast as i thought
or maybe it's just hard to properly measure something with 1 pass   :P

Code: [Select]
11629 Clock cycles per page
EDIT: a more accurate version - results about the same - lol
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 29, 2013, 02:45:34 AM
One way to find out is to commit one page for each loop iteration,
allocate one page using HeapAlloc for each iteration without deleting it,
and then compare the result
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 29, 2013, 02:50:19 AM
well - that seems counter-intuitive

if you can allocate all available with HeapAlloc, it *should* reset the commit
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 29, 2013, 03:11:15 AM
well, I was thinking of doing the stack thing first and time it
and then the alloc thing separately, not combined  :P
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 29, 2013, 03:30:12 AM
i guess it doesn't matter - my way of thinking was wrong, of course

it seems that, once the space has been committed, it stays committed
it simply gets swapped out to the paging file if you try to HeapAlloc(nMaxBytes)

it might work if you create a thread to commit and release stack space, then terminate the thread

EDIT: we really aren't interested in measuring swaps between memory and the page file   :lol:
Title: Re: Zero a stack buffer (and probe it)
Post by: nidud on October 29, 2013, 04:00:41 AM
Ok, I downloaded the test  :biggrin:

So, if you multiply the result with pages committed and compare it to the /STACK: option you will get a fair estimate on how many cycles you saved. However, once the stack is committed (one way or the other) it will be available as a substitute for HeapAlloc(), and that will save both code space and cycles.
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 29, 2013, 04:27:30 AM
...However, once the stack is committed (one way or the other) it will be available
as a substitute for HeapAlloc(), and that will save both code space and cycles.

i think that's how you have to look at it, too

btw - i see Mark has a relatively new tool (new version, at least) - called VMMap

http://technet.microsoft.com/en-us/sysinternals/dd535533.aspx

i have to do some reading to interpret what it's showing me   :P
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 29, 2013, 10:28:18 AM
it seems that, once the space has been committed, it stays committed

This is also my interpretation.
IMHO a StackBuffer() macro is best for repeatedly used small local buffers that are bigger than the 2 pages you can have without probing, and smaller than the range of bytes where HeapAlloc becomes competiive. Another advantage is that it avoids heap fragmentation.

Attached a new testbed with sizes 2k ... 512k. Feel free to modify - no MasmBasic needed ;-)

Code: [Select]
Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

2048 bytes:
103705  cycles for 100 * HeapAlloc
28460   cycles for 100 * StackBuffer (xmm)
48535   cycles for 100 * StackBuffer (rep stosd)
47423   cycles for 100 * dedndave

103956  cycles for 100 * HeapAlloc
28460   cycles for 100 * StackBuffer (xmm)
48544   cycles for 100 * StackBuffer (rep stosd)
47424   cycles for 100 * dedndave

8192 bytes:
185329  cycles for 100 * HeapAlloc
105525  cycles for 100 * StackBuffer (xmm)
128172  cycles for 100 * StackBuffer (rep stosd)
127549  cycles for 100 * dedndave

184025  cycles for 100 * HeapAlloc
105314  cycles for 100 * StackBuffer (xmm)
128170  cycles for 100 * StackBuffer (rep stosd)
127050  cycles for 100 * dedndave

32768 bytes:
547     kCycles for 100 * HeapAlloc
438     kCycles for 100 * StackBuffer (xmm)
440     kCycles for 100 * StackBuffer (rep stosd)
439     kCycles for 100 * dedndave

548     kCycles for 100 * HeapAlloc
438     kCycles for 100 * StackBuffer (xmm)
444     kCycles for 100 * StackBuffer (rep stosd)
437     kCycles for 100 * dedndave

131072 bytes:
2810    kCycles for 100 * HeapAlloc
2808    kCycles for 100 * StackBuffer (xmm)
2230    kCycles for 100 * StackBuffer (rep stosd)
2222    kCycles for 100 * dedndave

2319    kCycles for 100 * HeapAlloc
2808    kCycles for 100 * StackBuffer (xmm)
2225    kCycles for 100 * StackBuffer (rep stosd)
2224    kCycles for 100 * dedndave

524288 bytes:
742     kCycles for 100 * HeapAlloc
12067   kCycles for 100 * StackBuffer (xmm)
8928    kCycles for 100 * StackBuffer (rep stosd)
8977    kCycles for 100 * dedndave

751     kCycles for 100 * HeapAlloc
12305   kCycles for 100 * StackBuffer (xmm)
8921    kCycles for 100 * StackBuffer (rep stosd)
8920    kCycles for 100 * dedndave

104     bytes for HeapAlloc
9       bytes for StackBuffer (xmm)
8       bytes for StackBuffer (rep stosd)
42      bytes for dedndave

33      bytes for MbStackB
16      bytes for MbStackX
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 29, 2013, 10:51:16 AM
prescott w/htt xp mce2005 sp3
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 246/100 cycles

2048 bytes:
230072  cycles for 100 * HeapAlloc
59433   cycles for 100 * StackBuffer (xmm)
71350   cycles for 100 * StackBuffer (rep stosd)
71069   cycles for 100 * dedndave

229937  cycles for 100 * HeapAlloc
60252   cycles for 100 * StackBuffer (xmm)
71518   cycles for 100 * StackBuffer (rep stosd)
70532   cycles for 100 * dedndave

8192 bytes:
447586  cycles for 100 * HeapAlloc
233879  cycles for 100 * StackBuffer (xmm)
245398  cycles for 100 * StackBuffer (rep stosd)
245187  cycles for 100 * dedndave

447288  cycles for 100 * HeapAlloc
234628  cycles for 100 * StackBuffer (xmm)
246152  cycles for 100 * StackBuffer (rep stosd)
244349  cycles for 100 * dedndave

32768 bytes:
1304    kCycles for 100 * HeapAlloc
945     kCycles for 100 * StackBuffer (xmm)
925     kCycles for 100 * StackBuffer (rep stosd)
948     kCycles for 100 * dedndave

1274    kCycles for 100 * HeapAlloc
913     kCycles for 100 * StackBuffer (xmm)
932     kCycles for 100 * StackBuffer (rep stosd)
924     kCycles for 100 * dedndave

131072 bytes:
5997    kCycles for 100 * HeapAlloc
3639    kCycles for 100 * StackBuffer (xmm)
3682    kCycles for 100 * StackBuffer (rep stosd)
3704    kCycles for 100 * dedndave

4671    kCycles for 100 * HeapAlloc
3663    kCycles for 100 * StackBuffer (xmm)
3654    kCycles for 100 * StackBuffer (rep stosd)
3651    kCycles for 100 * dedndave

524288 bytes:
2084    kCycles for 100 * HeapAlloc
14688   kCycles for 100 * StackBuffer (xmm)
14708   kCycles for 100 * StackBuffer (rep stosd)
14847   kCycles for 100 * dedndave

2091    kCycles for 100 * HeapAlloc
14649   kCycles for 100 * StackBuffer (xmm)
14950   kCycles for 100 * StackBuffer (rep stosd)
16294   kCycles for 100 * dedndave
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 30, 2013, 01:19:47 AM
StackBuffer6 brings:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 239/100 cycles

2048 bytes:
71208   cycles for 100 * HeapAlloc
20078   cycles for 100 * StackBuffer (xmm)
40880   cycles for 100 * StackBuffer (rep stosd)
40775   cycles for 100 * dedndave

68901   cycles for 100 * HeapAlloc
47258   cycles for 100 * StackBuffer (xmm)
40798   cycles for 100 * StackBuffer (rep stosd)
40362   cycles for 100 * dedndave

8192 bytes:
111836  cycles for 100 * HeapAlloc
129559  cycles for 100 * StackBuffer (xmm)
122825  cycles for 100 * StackBuffer (rep stosd)
122018  cycles for 100 * dedndave

104378  cycles for 100 * HeapAlloc
55644   cycles for 100 * StackBuffer (xmm)
122728  cycles for 100 * StackBuffer (rep stosd)
122127  cycles for 100 * dedndave

32768 bytes:
262658  cycles for 100 * HeapAlloc
209534  cycles for 100 * StackBuffer (xmm)
196808  cycles for 100 * StackBuffer (rep stosd)
187779  cycles for 100 * dedndave

604     kCycles for 100 * HeapAlloc
477     kCycles for 100 * StackBuffer (xmm)
445     kCycles for 100 * StackBuffer (rep stosd)
443     kCycles for 100 * dedndave

131072 bytes:
1203    kCycles for 100 * HeapAlloc
1090    kCycles for 100 * StackBuffer (xmm)
1178    kCycles for 100 * StackBuffer (rep stosd)
1197    kCycles for 100 * dedndave

1843    kCycles for 100 * HeapAlloc
1108    kCycles for 100 * StackBuffer (xmm)
1159    kCycles for 100 * StackBuffer (rep stosd)
1195    kCycles for 100 * dedndave

524288 bytes:
586     kCycles for 100 * HeapAlloc
6736    kCycles for 100 * StackBuffer (xmm)
5396    kCycles for 100 * StackBuffer (rep stosd)
4757    kCycles for 100 * dedndave

591     kCycles for 100 * HeapAlloc
6740    kCycles for 100 * StackBuffer (xmm)
5389    kCycles for 100 * StackBuffer (rep stosd)
4787    kCycles for 100 * dedndave

104     bytes for HeapAlloc
9       bytes for StackBuffer (xmm)
8       bytes for StackBuffer (rep stosd)
42      bytes for dedndave

33      bytes for MbStackB
16      bytes for MbStackX

--- ok ---

Gunther
Title: Re: Zero a stack buffer (and probe it)
Post by: jj2007 on October 30, 2013, 11:16:13 AM
OK, thanks to everybody :icon14:

The new StackBuffer() is now implemented in MasmBasic of 30 Oct (more (http://masm32.com/board/index.php?topic=94.msg26592#msg26592)). In the end, rep stosd made the race. Usage examples:

          mov sbuf1, StackBuffer(4000h)        ; buffer is 16-byte aligned for use with SSE2
          invoke GetFileSize, hFile, 0             ; you may use a register to specify the buffer size
          mov sbuf2, StackBuffer(eax, nz)        ; option nz means "no zeroing" - much faster, of course
...
          StackBuffer()        ; release all buffers (sb without args = free the buffer)

The nz option does only the probing and zeroes the last two bytes of the buffer, plus two bytes beyond the buffer. This is to allow loading e.g. a textfile into the buffer and being sure that the end is zero-delimited.
Title: Re: Zero a stack buffer (and probe it)
Post by: dedndave on October 30, 2013, 02:24:07 PM
 :t
Title: Re: Zero a stack buffer (and probe it)
Post by: Gunther on October 31, 2013, 02:14:56 AM
Jochen,

The new StackBuffer() is now implemented in MasmBasic of 30 Oct
  :t

Gunther