News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Zero a stack buffer (and probe it)

Started by jj2007, October 25, 2013, 07:31:54 PM

Previous topic - Next topic

jj2007

Quote from: nidud on October 26, 2013, 01:47:07 AM
Quote from: jj2007 on October 26, 2013, 12:52:01 AM
Put it under TestA, just for fun ;)

the intention should at best be educational  :P

having both of them will illustrate the penalty of manipulating the flags on different CPU's

Yes, that's true. Although it seems it's not the flag setting but rather the "wrong" direction that makes rep stosd slow.

Attached a new version with a modified StackBuffer() macro. Your algo is "rep stosd up" ;-)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

4601    kCycles for 100 * rep stosd
4893    kCycles for 100 * push 0
4890    kCycles for 100 * push edx
2151    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
1701    kCycles for 100 * rep stosd up

4592    kCycles for 100 * rep stosd
4891    kCycles for 100 * push 0
4894    kCycles for 100 * push edx
2142    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
1697    kCycles for 100 * rep stosd up

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 263/100 cycles

4957    kCycles for 100 * rep stosd
4902    kCycles for 100 * push 0
4931    kCycles for 100 * push edx
2934    kCycles for 100 * StackBuffer (with zeroing)
3006    kCycles for 100 * movaps xmm0
3035    kCycles for 100 * rep stosd up

4956    kCycles for 100 * rep stosd
4982    kCycles for 100 * push 0
4869    kCycles for 100 * push edx
2777    kCycles for 100 * StackBuffer (with zeroing)
2840    kCycles for 100 * movaps xmm0
2823    kCycles for 100 * rep stosd up

4954    kCycles for 100 * rep stosd
4863    kCycles for 100 * push 0
4940    kCycles for 100 * push edx
2799    kCycles for 100 * StackBuffer (with zeroing)
2801    kCycles for 100 * movaps xmm0
2779    kCycles for 100 * rep stosd up

4993    kCycles for 100 * rep stosd
4908    kCycles for 100 * push 0
4940    kCycles for 100 * push edx
2767    kCycles for 100 * StackBuffer (with zeroing)
2811    kCycles for 100 * movaps xmm0
2790    kCycles for 100 * rep stosd up

5074    kCycles for 100 * rep stosd
4911    kCycles for 100 * push 0
5082    kCycles for 100 * push edx
2860    kCycles for 100 * StackBuffer (with zeroing)
2767    kCycles for 100 * movaps xmm0
2779    kCycles for 100 * rep stosd up

5073    kCycles for 100 * rep stosd
4907    kCycles for 100 * push 0
5175    kCycles for 100 * push edx
2826    kCycles for 100 * StackBuffer (with zeroing)
2796    kCycles for 100 * movaps xmm0
2801    kCycles for 100 * rep stosd up


the last 3 are more-or-less the same on a P4

jj2007

Quote from: dedndave on October 26, 2013, 09:44:49 AM
prescott w/htt
...
the last 3 are more-or-less the same on a P4

Yes, they look similar on older CPUs. The i7 behave quite differently.
For probing only (no zeroing), StackBuffer is more than twice as fast.

P.S.: Jeri's CastAR looks really impressive :t

MichaelW

Northwood w/htt

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 331/100 cycles

4915    kCycles for 100 * rep stosd
4928    kCycles for 100 * push 0
4911    kCycles for 100 * push edx
2141    kCycles for 100 * StackBuffer (with zeroing)
2153    kCycles for 100 * movaps xmm0
2326    kCycles for 100 * rep stosd up

4912    kCycles for 100 * rep stosd
4905    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
2140    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
2199    kCycles for 100 * rep stosd up

4907    kCycles for 100 * rep stosd
4892    kCycles for 100 * push 0
4907    kCycles for 100 * push edx
2141    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
2212    kCycles for 100 * rep stosd up


Well Microsoft, here's another nice mess you've gotten us into.

KeepingRealBusy

My laptop:

AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 495/100 cycles

4656    kCycles for 100 * rep stosd
8331    kCycles for 100 * push 0
7636    kCycles for 100 * push edx
2845    kCycles for 100 * StackBuffer (with zeroing)
2711    kCycles for 100 * movaps xmm0
2585    kCycles for 100 * rep stosd up

3488    kCycles for 100 * rep stosd
5722    kCycles for 100 * push 0
5240    kCycles for 100 * push edx
1712    kCycles for 100 * StackBuffer (with zeroing)
1650    kCycles for 100 * movaps xmm0
1600    kCycles for 100 * rep stosd up

2200    kCycles for 100 * rep stosd
4188    kCycles for 100 * push 0
4251    kCycles for 100 * push edx
1715    kCycles for 100 * StackBuffer (with zeroing)
1565    kCycles for 100 * movaps xmm0
1580    kCycles for 100 * rep stosd up

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
29      bytes for StackBuffer (with zeroing)
25      bytes for movaps xmm0
17      bytes for rep stosd up

--- ok ---


A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Dave.

jj2007

#20
Quote from: KeepingRealBusy on October 26, 2013, 01:42:11 PM
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Why not ;-)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4725    kCycles for 100 * rep stosd
2683    kCycles for 100 * HeapAlloc (*8)
2202    kCycles for 100 * StackBuffer (with zeroing)
2887    kCycles for 100 * StackBuffer (unrolled)
2207    kCycles for 100 * movaps xmm0
1746    kCycles for 100 * rep stosd up

29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)

Siekmanski

StackBuffer2b.exe doesn't work with windows 8.1

StackBuffer2.exe works OK:

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2712    kCycles for 100 * rep stosd
4887    kCycles for 100 * push 0
4885    kCycles for 100 * push edx
1686    kCycles for 100 * StackBuffer (with zeroing)
923     kCycles for 100 * movaps xmm0
1027    kCycles for 100 * rep stosd up

2609    kCycles for 100 * rep stosd
5312    kCycles for 100 * push 0
4887    kCycles for 100 * push edx
943     kCycles for 100 * StackBuffer (with zeroing)
1989    kCycles for 100 * movaps xmm0
1086    kCycles for 100 * rep stosd up

2608    kCycles for 100 * rep stosd
5588    kCycles for 100 * push 0
5649    kCycles for 100 * push edx
971     kCycles for 100 * StackBuffer (with zeroing)
1654    kCycles for 100 * movaps xmm0
981     kCycles for 100 * rep stosd up

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
29      bytes for StackBuffer (with zeroing)
25      bytes for movaps xmm0
17      bytes for rep stosd up


--- ok ---
Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: Siekmanski on October 26, 2013, 05:57:10 PM
StackBuffer2b.exe doesn't work with windows 8.1

Oops, it seems I uploaded a version with a nice little int 3 inside. Try 2c above...
Apologies :redface:

Siekmanski

StackBuffer2c   :t

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 576/100 cycles

2674    kCycles for 100 * rep stosd
1810    kCycles for 100 * HeapAlloc (*8 )
940     kCycles for 100 * StackBuffer (with zeroing)
3376    kCycles for 100 * StackBuffer (unrolled)
990     kCycles for 100 * movaps xmm0
1804    kCycles for 100 * rep stosd up

2672    kCycles for 100 * rep stosd
1117    kCycles for 100 * HeapAlloc (*8 )
957     kCycles for 100 * StackBuffer (with zeroing)
2613    kCycles for 100 * StackBuffer (unrolled)
1391    kCycles for 100 * movaps xmm0
1054    kCycles for 100 * rep stosd up

2672    kCycles for 100 * rep stosd
1824    kCycles for 100 * HeapAlloc (*8 )
962     kCycles for 100 * StackBuffer (with zeroing)
3380    kCycles for 100 * StackBuffer (unrolled)
934     kCycles for 100 * movaps xmm0
1789    kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8 )
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
Creative coders use backward thinking techniques as a strategy.

sinsi

A bit of a difference in rep stosd...

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 328/100 cycles

2449    kCycles for 100 * rep stosd
1025    kCycles for 100 * HeapAlloc (*8 )
951     kCycles for 100 * StackBuffer (with zeroing)
2384    kCycles for 100 * StackBuffer (unrolled)
927     kCycles for 100 * movaps xmm0
955     kCycles for 100 * rep stosd up
Tá fuinneoga a haon déag níos fearr :biggrin:

jj2007

Thanks :icon14:
Astonishing that the unrolled version is so much slower...

         xorps xmm0, xmm0
         ifnb <unrolled>
            shr eax, 4+2   ; bufsize/16*4
            mov edx, esp   ; save current stack pointer
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, 4*OWORD
                  movdqa OWORD ptr [esp], xmm0
                  movdqa OWORD ptr [1*OWORD+esp], xmm0
                  movdqa OWORD ptr [2*OWORD+esp], xmm0
                  movdqa OWORD ptr [3*OWORD+esp], xmm0
                  dec eax
            .Until Zero?
         else
            shr eax, 4   ; /16
            mov edx, esp   ; save current stack pointer
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, OWORD
                  movaps OWORD ptr [esp], xmm0
                  dec eax
            .Until Zero?
         endif

sinsi

Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 202/100 cycles

945     kCycles for 100 * StackBuffer (with zeroing)
993     kCycles for 100 * StackBuffer (unrolled)

949     kCycles for 100 * StackBuffer (with zeroing)
933     kCycles for 100 * StackBuffer (unrolled)

926     kCycles for 100 * StackBuffer (with zeroing)
933     kCycles for 100 * StackBuffer (unrolled)
Tá fuinneoga a haon déag níos fearr :biggrin:

jj2007

Quote from: sinsi on October 26, 2013, 07:19:16 PM
Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.

Same effect for reordering:
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, 4*OWORD
                  movaps OWORD ptr [3*OWORD+esp], xmm0
                  movaps OWORD ptr [2*OWORD+esp], xmm0
                  movaps OWORD ptr [1*OWORD+esp], xmm0
                  movaps OWORD ptr [0*OWORD+esp], xmm0
                  dec eax
            .Until Zero?

But the timings are identical, so no need for unrolling.

Gunther

Jochen,

timings from my machine at home:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 229/100 cycles

2287    kCycles for 100 * rep stosd
4347    kCycles for 100 * push 0
4314    kCycles for 100 * push edx
830     kCycles for 100 * StackBuffer (with zeroing)
822     kCycles for 100 * movaps xmm0
847     kCycles for 100 * rep stosd up

2290    kCycles for 100 * rep stosd
4321    kCycles for 100 * push 0
4289    kCycles for 100 * push edx
828     kCycles for 100 * StackBuffer (with zeroing)
821     kCycles for 100 * movaps xmm0
859     kCycles for 100 * rep stosd up

2293    kCycles for 100 * rep stosd
4303    kCycles for 100 * push 0
4298    kCycles for 100 * push edx
830     kCycles for 100 * StackBuffer (with zeroing)
812     kCycles for 100 * movaps xmm0
847     kCycles for 100 * rep stosd up

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
29      bytes for StackBuffer (with zeroing)
25      bytes for movaps xmm0
17      bytes for rep stosd up

--- ok ---


Gunther
You have to know the facts before you can distort them.

dedndave

version 2c on a prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles

5102    kCycles for 100 * rep stosd
4717    kCycles for 100 * HeapAlloc (*8 )
2851    kCycles for 100 * StackBuffer (with zeroing)
3962    kCycles for 100 * StackBuffer (unrolled)
2835    kCycles for 100 * movaps xmm0
2872    kCycles for 100 * rep stosd up

5145    kCycles for 100 * rep stosd
3681    kCycles for 100 * HeapAlloc (*8 )
2862    kCycles for 100 * StackBuffer (with zeroing)
3950    kCycles for 100 * StackBuffer (unrolled)
2894    kCycles for 100 * movaps xmm0
2844    kCycles for 100 * rep stosd up

5111    kCycles for 100 * rep stosd
3769    kCycles for 100 * HeapAlloc (*8 )
2836    kCycles for 100 * StackBuffer (with zeroing)
3950    kCycles for 100 * StackBuffer (unrolled)
2846    kCycles for 100 * movaps xmm0
2900    kCycles for 100 * rep stosd up


can't beat REP STOSD for simplicity   :P