Zero a stack buffer (and probe it)

jj2007 · October 25, 2013, 07:31:54 PM

Spin-off from MemStrategy:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 268/100 cycles

3778 kCycles for 100 * rep stosd
4905 kCycles for 100 * push 0
4890 kCycles for 100 * push edx
3343 kCycles for 100 * movups xmm0
3319 kCycles for 100 * movaps xmm0

3785 kCycles for 100 * rep stosd
4891 kCycles for 100 * push 0
4894 kCycles for 100 * push edx
3457 kCycles for 100 * movups xmm0
3319 kCycles for 100 * movaps xmm0

3785 kCycles for 100 * rep stosd
4891 kCycles for 100 * push 0
4896 kCycles for 100 * push edx
3342 kCycles for 100 * movups xmm0
3320 kCycles for 100 * movaps xmm0

18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0

Siekmanski · October 25, 2013, 07:58:50 PM

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2700 kCycles for 100 * rep stosd
5467 kCycles for 100 * push 0
4888 kCycles for 100 * push edx
4266 kCycles for 100 * movups xmm0
1411 kCycles for 100 * movaps xmm0

2752 kCycles for 100 * rep stosd
4887 kCycles for 100 * push 0
5651 kCycles for 100 * push edx
4262 kCycles for 100 * movups xmm0
1030 kCycles for 100 * movaps xmm0

2699 kCycles for 100 * rep stosd
4892 kCycles for 100 * push 0
4888 kCycles for 100 * push edx
4263 kCycles for 100 * movups xmm0
1744 kCycles for 100 * movaps xmm0

18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0

sinsi · October 25, 2013, 08:40:15 PM

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 310/100 cycles

2385 kCycles for 100 * rep stosd
4530 kCycles for 100 * push 0
4508 kCycles for 100 * push edx
3932 kCycles for 100 * movups xmm0
871 kCycles for 100 * movaps xmm0

TWell · October 25, 2013, 09:05:27 PM

AMD Athlon(tm) II X2 220 Processor (SSE3) 2.80 GHz
loop overhead is approx. 239/100 cycles

2621 kCycles for 100 * rep stosd
4891 kCycles for 100 * push 0
4895 kCycles for 100 * push edx
1666 kCycles for 100 * movups xmm0
1605 kCycles for 100 * movaps xmm0

dedndave · October 25, 2013, 10:09:55 PM

prescott w/htt

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 248/100 cycles

4986    kCycles for 100 * rep stosd
4827    kCycles for 100 * push 0
4991    kCycles for 100 * push edx
6187    kCycles for 100 * movups xmm0
2767    kCycles for 100 * movaps xmm0

5023    kCycles for 100 * rep stosd
4857    kCycles for 100 * push 0
4935    kCycles for 100 * push edx
6207    kCycles for 100 * movups xmm0
2766    kCycles for 100 * movaps xmm0

5023    kCycles for 100 * rep stosd
4855    kCycles for 100 * push 0
4990    kCycles for 100 * push edx
6225    kCycles for 100 * movups xmm0
2765    kCycles for 100 * movaps xmm0

nidud · October 25, 2013, 10:18:42 PM

deleted

dedndave · October 25, 2013, 10:34:57 PM

prescott w/htt

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 245/100 cycles

5107    kCycles for 100 * rep stosd
4844    kCycles for 100 * push 0
4902    kCycles for 100 * push edx
6153    kCycles for 100 * movups xmm0
2827    kCycles for 100 * movaps xmm0
2815    kCycles for 100 * rep stosd

5111    kCycles for 100 * rep stosd
4873    kCycles for 100 * push 0
4887    kCycles for 100 * push edx
6150    kCycles for 100 * movups xmm0
2795    kCycles for 100 * movaps xmm0
2782    kCycles for 100 * rep stosd

5053    kCycles for 100 * rep stosd
4892    kCycles for 100 * push 0
4850    kCycles for 100 * push edx
6179    kCycles for 100 * movups xmm0
2767    kCycles for 100 * movaps xmm0
2827    kCycles for 100 * rep stosd

Gunther · October 25, 2013, 11:22:20 PM

Jochen,

here are the results from an old Computer (located in an University laboratory). The other tests from my machine at home will come this evening.

Code Select


AMD Athlon(tm) Dual Core Processor 5000B (SSE3)
loop overhead is approx. 239/100 cycles

3779    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
3344    kCycles for 100 * movups xmm0
3347    kCycles for 100 * movaps xmm0

3774    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4899    kCycles for 100 * push edx
3343    kCycles for 100 * movups xmm0
3341    kCycles for 100 * movaps xmm0

3778    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
3344    kCycles for 100 * movups xmm0
3331    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0

--- ok ---

Gunther

nidud · October 25, 2013, 11:35:41 PM

deleted

FORTRANS · October 26, 2013, 12:05:07 AM

Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
loop overhead is approx. 211/100 cycles

7356 kCycles for 100 * rep stosd
4902 kCycles for 100 * push 0
4902 kCycles for 100 * push edx
3059 kCycles for 100 * movups xmm0
2312 kCycles for 100 * movaps xmm0
2207 kCycles for 100 * rep stosd

7358 kCycles for 100 * rep stosd
4905 kCycles for 100 * push 0
4897 kCycles for 100 * push edx
3064 kCycles for 100 * movups xmm0
2303 kCycles for 100 * movaps xmm0
2212 kCycles for 100 * rep stosd

7372 kCycles for 100 * rep stosd
4913 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
3063 kCycles for 100 * movups xmm0
2303 kCycles for 100 * movaps xmm0
2214 kCycles for 100 * rep stosd

18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0
17 bytes for rep stosd

--- ok ---

jj2007 · October 26, 2013, 12:52:01 AM

Thanks to everybody :icon14:

Quote from: nidud on October 25, 2013, 10:18:42 PM
I cleaned up the rep stosd function a bit

I appreciate your good intentions, Nidud. Put it under TestA, just for fun ;)
(hint: look at this thread's title)

MichaelW · October 26, 2013, 01:43:49 AM

Northwood w/htt

Code Select


Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 309/100 cycles

4910    kCycles for 100 * rep stosd
4902    kCycles for 100 * push 0
4904    kCycles for 100 * push edx
4904    kCycles for 100 * movups xmm0
2144    kCycles for 100 * movaps xmm0

4910    kCycles for 100 * rep stosd
5130    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
4893    kCycles for 100 * movups xmm0
2140    kCycles for 100 * movaps xmm0

4911    kCycles for 100 * rep stosd
4909    kCycles for 100 * push 0
4903    kCycles for 100 * push edx
4895    kCycles for 100 * movups xmm0
2150    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0

nidud · October 26, 2013, 01:47:07 AM

deleted

dedndave · October 26, 2013, 01:50:10 AM

i try to avoid STD's :lol:

in fact, i have gotten to where i don't use them at all
if i have to move things in that direction, i write a discrete loop

in this case, you could probe, then clear, one page at a time
something like this...

Code Select

    ASSUME  FS:Nothing

    mov     edx,esp
    mov     fs:[700h],edi
    xor     eax,eax
    sub     edx,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     ecx,esp
        mov     esp,fs:[8]
        sub     ecx,esp
        shr     ecx,2
        .if !ZERO
            mov     edi,esp
            rep     stosd
        .endif
    .until edx>=esp
    mov     edi,fs:[700h]
    mov     esp,edx

    ASSUME  FS:ERROR

dedndave · October 26, 2013, 02:28:02 AM

this is a simpler version...

Code Select

    ASSUME  FS:Nothing

    mov     edx,esp
    mov     ecx,esp
    sub     edx,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     esp,fs:[8]
    .until edx>=esp
    sub     ecx,edx
    xchg    edx,edi
    shr     ecx,2
    xor     eax,eax
    mov     esp,edi
    rep     stosd
    mov     edi,edx

    ASSUME  FS:ERROR

The MASM Forum

News:

Zero a stack buffer (and probe it)

jj2007

Siekmanski

sinsi

TWell

dedndave

nidud

dedndave

Gunther

nidud

FORTRANS

jj2007

MichaelW

nidud

dedndave

dedndave