News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Zero a stack buffer (and probe it)

Started by jj2007, October 25, 2013, 07:31:54 PM

Previous topic - Next topic

jj2007

Spin-off from MemStrategy:

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 268/100 cycles

3778    kCycles for 100 * rep stosd
4905    kCycles for 100 * push 0
4890    kCycles for 100 * push edx
3343    kCycles for 100 * movups xmm0
3319    kCycles for 100 * movaps xmm0

3785    kCycles for 100 * rep stosd
4891    kCycles for 100 * push 0
4894    kCycles for 100 * push edx
3457    kCycles for 100 * movups xmm0
3319    kCycles for 100 * movaps xmm0

3785    kCycles for 100 * rep stosd
4891    kCycles for 100 * push 0
4896    kCycles for 100 * push edx
3342    kCycles for 100 * movups xmm0
3320    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2700    kCycles for 100 * rep stosd
5467    kCycles for 100 * push 0
4888    kCycles for 100 * push edx
4266    kCycles for 100 * movups xmm0
1411    kCycles for 100 * movaps xmm0

2752    kCycles for 100 * rep stosd
4887    kCycles for 100 * push 0
5651    kCycles for 100 * push edx
4262    kCycles for 100 * movups xmm0
1030    kCycles for 100 * movaps xmm0

2699    kCycles for 100 * rep stosd
4892    kCycles for 100 * push 0
4888    kCycles for 100 * push edx
4263    kCycles for 100 * movups xmm0
1744    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0

Creative coders use backward thinking techniques as a strategy.

sinsi

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 310/100 cycles

2385    kCycles for 100 * rep stosd
4530    kCycles for 100 * push 0
4508    kCycles for 100 * push edx
3932    kCycles for 100 * movups xmm0
871     kCycles for 100 * movaps xmm0

TWell

AMD Athlon(tm) II X2 220 Processor (SSE3) 2.80 GHz
loop overhead is approx. 239/100 cycles

2621    kCycles for 100 * rep stosd
4891    kCycles for 100 * push 0
4895    kCycles for 100 * push edx
1666    kCycles for 100 * movups xmm0
1605    kCycles for 100 * movaps xmm0

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 248/100 cycles

4986    kCycles for 100 * rep stosd
4827    kCycles for 100 * push 0
4991    kCycles for 100 * push edx
6187    kCycles for 100 * movups xmm0
2767    kCycles for 100 * movaps xmm0

5023    kCycles for 100 * rep stosd
4857    kCycles for 100 * push 0
4935    kCycles for 100 * push edx
6207    kCycles for 100 * movups xmm0
2766    kCycles for 100 * movaps xmm0

5023    kCycles for 100 * rep stosd
4855    kCycles for 100 * push 0
4990    kCycles for 100 * push edx
6225    kCycles for 100 * movups xmm0
2765    kCycles for 100 * movaps xmm0

nidud

#5
deleted

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 245/100 cycles

5107    kCycles for 100 * rep stosd
4844    kCycles for 100 * push 0
4902    kCycles for 100 * push edx
6153    kCycles for 100 * movups xmm0
2827    kCycles for 100 * movaps xmm0
2815    kCycles for 100 * rep stosd

5111    kCycles for 100 * rep stosd
4873    kCycles for 100 * push 0
4887    kCycles for 100 * push edx
6150    kCycles for 100 * movups xmm0
2795    kCycles for 100 * movaps xmm0
2782    kCycles for 100 * rep stosd

5053    kCycles for 100 * rep stosd
4892    kCycles for 100 * push 0
4850    kCycles for 100 * push edx
6179    kCycles for 100 * movups xmm0
2767    kCycles for 100 * movaps xmm0
2827    kCycles for 100 * rep stosd

Gunther

Jochen,

here are the results from an old Computer (located in an University laboratory). The other tests from my machine at home will come this evening.


AMD Athlon(tm) Dual Core Processor 5000B (SSE3)
loop overhead is approx. 239/100 cycles

3779    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
3344    kCycles for 100 * movups xmm0
3347    kCycles for 100 * movaps xmm0

3774    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4899    kCycles for 100 * push edx
3343    kCycles for 100 * movups xmm0
3341    kCycles for 100 * movaps xmm0

3778    kCycles for 100 * rep stosd
4897    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
3344    kCycles for 100 * movups xmm0
3331    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0

--- ok ---


Gunther
You have to know the facts before you can distort them.

nidud

#8
deleted

FORTRANS

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
loop overhead is approx. 211/100 cycles

7356    kCycles for 100 * rep stosd
4902    kCycles for 100 * push 0
4902    kCycles for 100 * push edx
3059    kCycles for 100 * movups xmm0
2312    kCycles for 100 * movaps xmm0
2207    kCycles for 100 * rep stosd

7358    kCycles for 100 * rep stosd
4905    kCycles for 100 * push 0
4897    kCycles for 100 * push edx
3064    kCycles for 100 * movups xmm0
2303    kCycles for 100 * movaps xmm0
2212    kCycles for 100 * rep stosd

7372    kCycles for 100 * rep stosd
4913    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
3063    kCycles for 100 * movups xmm0
2303    kCycles for 100 * movaps xmm0
2214    kCycles for 100 * rep stosd

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0
17      bytes for rep stosd


--- ok ---

jj2007

Thanks to everybody :icon14:

Quote from: nidud on October 25, 2013, 10:18:42 PM
I cleaned up the rep stosd function a bit

I appreciate your good intentions, Nidud. Put it under TestA, just for fun ;)
(hint: look at this thread's title)

MichaelW

Northwood w/htt

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 309/100 cycles

4910    kCycles for 100 * rep stosd
4902    kCycles for 100 * push 0
4904    kCycles for 100 * push edx
4904    kCycles for 100 * movups xmm0
2144    kCycles for 100 * movaps xmm0

4910    kCycles for 100 * rep stosd
5130    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
4893    kCycles for 100 * movups xmm0
2140    kCycles for 100 * movaps xmm0

4911    kCycles for 100 * rep stosd
4909    kCycles for 100 * push 0
4903    kCycles for 100 * push edx
4895    kCycles for 100 * movups xmm0
2150    kCycles for 100 * movaps xmm0

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
22      bytes for movups xmm0
25      bytes for movaps xmm0


Well Microsoft, here's another nice mess you've gotten us into.

nidud

#12
deleted

dedndave

i try to avoid STD's   :lol:

in fact, i have gotten to where i don't use them at all
if i have to move things in that direction, i write a discrete loop

in this case, you could probe, then clear, one page at a time
something like this...
    ASSUME  FS:Nothing

    mov     edx,esp
    mov     fs:[700h],edi
    xor     eax,eax
    sub     edx,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     ecx,esp
        mov     esp,fs:[8]
        sub     ecx,esp
        shr     ecx,2
        .if !ZERO
            mov     edi,esp
            rep     stosd
        .endif
    .until edx>=esp
    mov     edi,fs:[700h]
    mov     esp,edx

    ASSUME  FS:ERROR

dedndave

this is a simpler version...
    ASSUME  FS:Nothing

    mov     edx,esp
    mov     ecx,esp
    sub     edx,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     esp,fs:[8]
    .until edx>=esp
    sub     ecx,edx
    xchg    edx,edi
    shr     ecx,2
    xor     eax,eax
    mov     esp,edi
    rep     stosd
    mov     edi,edx

    ASSUME  FS:ERROR