Zero a stack buffer (and probe it)

jj2007 · October 26, 2013, 09:40:39 AM

Quote from: nidud on October 26, 2013, 01:47:07 AM
Quote from: jj2007 on October 26, 2013, 12:52:01 AM
Put it under TestA, just for fun ;)

the intention should at best be educational :P

having both of them will illustrate the penalty of manipulating the flags on different CPU's

Yes, that's true. Although it seems it's not the flag setting but rather the "wrong" direction that makes rep stosd slow.

Attached a new version with a modified StackBuffer() macro. Your algo is "rep stosd up" ;-)

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles

4601 kCycles for 100 * rep stosd
4893 kCycles for 100 * push 0
4890 kCycles for 100 * push edx
2151 kCycles for 100 * StackBuffer (with zeroing)
2141 kCycles for 100 * movaps xmm0
1701 kCycles for 100 * rep stosd up

4592 kCycles for 100 * rep stosd
4891 kCycles for 100 * push 0
4894 kCycles for 100 * push edx
2142 kCycles for 100 * StackBuffer (with zeroing)
2141 kCycles for 100 * movaps xmm0
1697 kCycles for 100 * rep stosd up

dedndave · October 26, 2013, 09:44:49 AM

prescott w/htt

Code Select

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 263/100 cycles

4957    kCycles for 100 * rep stosd
4902    kCycles for 100 * push 0
4931    kCycles for 100 * push edx
2934    kCycles for 100 * StackBuffer (with zeroing)
3006    kCycles for 100 * movaps xmm0
3035    kCycles for 100 * rep stosd up

4956    kCycles for 100 * rep stosd
4982    kCycles for 100 * push 0
4869    kCycles for 100 * push edx
2777    kCycles for 100 * StackBuffer (with zeroing)
2840    kCycles for 100 * movaps xmm0
2823    kCycles for 100 * rep stosd up

4954    kCycles for 100 * rep stosd
4863    kCycles for 100 * push 0
4940    kCycles for 100 * push edx
2799    kCycles for 100 * StackBuffer (with zeroing)
2801    kCycles for 100 * movaps xmm0
2779    kCycles for 100 * rep stosd up

4993    kCycles for 100 * rep stosd
4908    kCycles for 100 * push 0
4940    kCycles for 100 * push edx
2767    kCycles for 100 * StackBuffer (with zeroing)
2811    kCycles for 100 * movaps xmm0
2790    kCycles for 100 * rep stosd up

5074    kCycles for 100 * rep stosd
4911    kCycles for 100 * push 0
5082    kCycles for 100 * push edx
2860    kCycles for 100 * StackBuffer (with zeroing)
2767    kCycles for 100 * movaps xmm0
2779    kCycles for 100 * rep stosd up

5073    kCycles for 100 * rep stosd
4907    kCycles for 100 * push 0
5175    kCycles for 100 * push edx
2826    kCycles for 100 * StackBuffer (with zeroing)
2796    kCycles for 100 * movaps xmm0
2801    kCycles for 100 * rep stosd up

the last 3 are more-or-less the same on a P4

jj2007 · October 26, 2013, 09:50:17 AM

Quote from: dedndave on October 26, 2013, 09:44:49 AM
prescott w/htt
...
the last 3 are more-or-less the same on a P4

Yes, they look similar on older CPUs. The i7 behave quite differently.
For probing only (no zeroing), StackBuffer is more than twice as fast.

P.S.: Jeri's CastAR looks really impressive :t

MichaelW · October 26, 2013, 09:53:50 AM

Northwood w/htt

Code Select


Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 331/100 cycles

4915    kCycles for 100 * rep stosd
4928    kCycles for 100 * push 0
4911    kCycles for 100 * push edx
2141    kCycles for 100 * StackBuffer (with zeroing)
2153    kCycles for 100 * movaps xmm0
2326    kCycles for 100 * rep stosd up

4912    kCycles for 100 * rep stosd
4905    kCycles for 100 * push 0
4901    kCycles for 100 * push edx
2140    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
2199    kCycles for 100 * rep stosd up

4907    kCycles for 100 * rep stosd
4892    kCycles for 100 * push 0
4907    kCycles for 100 * push edx
2141    kCycles for 100 * StackBuffer (with zeroing)
2141    kCycles for 100 * movaps xmm0
2212    kCycles for 100 * rep stosd up

KeepingRealBusy · October 26, 2013, 01:42:11 PM

My laptop:

Code Select


AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 495/100 cycles

4656    kCycles for 100 * rep stosd
8331    kCycles for 100 * push 0
7636    kCycles for 100 * push edx
2845    kCycles for 100 * StackBuffer (with zeroing)
2711    kCycles for 100 * movaps xmm0
2585    kCycles for 100 * rep stosd up

3488    kCycles for 100 * rep stosd
5722    kCycles for 100 * push 0
5240    kCycles for 100 * push edx
1712    kCycles for 100 * StackBuffer (with zeroing)
1650    kCycles for 100 * movaps xmm0
1600    kCycles for 100 * rep stosd up

2200    kCycles for 100 * rep stosd
4188    kCycles for 100 * push 0
4251    kCycles for 100 * push edx
1715    kCycles for 100 * StackBuffer (with zeroing)
1565    kCycles for 100 * movaps xmm0
1580    kCycles for 100 * rep stosd up

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
29      bytes for StackBuffer (with zeroing)
25      bytes for movaps xmm0
17      bytes for rep stosd up

--- ok ---

A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Dave.

jj2007 · October 26, 2013, 04:19:12 PM

Quote from: KeepingRealBusy on October 26, 2013, 01:42:11 PM
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Why not ;-)

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4725 kCycles for 100 * rep stosd
2683 kCycles for 100 * HeapAlloc (*8)
2202 kCycles for 100 * StackBuffer (with zeroing)
2887 kCycles for 100 * StackBuffer (unrolled)
2207 kCycles for 100 * movaps xmm0
1746 kCycles for 100 * rep stosd up

29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)

Siekmanski · October 26, 2013, 05:57:10 PM

StackBuffer2b.exe doesn't work with windows 8.1

StackBuffer2.exe works OK:

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles

2712 kCycles for 100 * rep stosd
4887 kCycles for 100 * push 0
4885 kCycles for 100 * push edx
1686 kCycles for 100 * StackBuffer (with zeroing)
923 kCycles for 100 * movaps xmm0
1027 kCycles for 100 * rep stosd up

2609 kCycles for 100 * rep stosd
5312 kCycles for 100 * push 0
4887 kCycles for 100 * push edx
943 kCycles for 100 * StackBuffer (with zeroing)
1989 kCycles for 100 * movaps xmm0
1086 kCycles for 100 * rep stosd up

2608 kCycles for 100 * rep stosd
5588 kCycles for 100 * push 0
5649 kCycles for 100 * push edx
971 kCycles for 100 * StackBuffer (with zeroing)
1654 kCycles for 100 * movaps xmm0
981 kCycles for 100 * rep stosd up

18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
29 bytes for StackBuffer (with zeroing)
25 bytes for movaps xmm0
17 bytes for rep stosd up

--- ok ---

jj2007 · October 26, 2013, 06:07:28 PM

Quote from: Siekmanski on October 26, 2013, 05:57:10 PM
StackBuffer2b.exe doesn't work with windows 8.1

Oops, it seems I uploaded a version with a nice little int 3 inside. Try 2c above...
Apologies :redface:

Siekmanski · October 26, 2013, 06:20:02 PM

StackBuffer2c :t

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 576/100 cycles

2674 kCycles for 100 * rep stosd
1810 kCycles for 100 * HeapAlloc (*8 )
940 kCycles for 100 * StackBuffer (with zeroing)
3376 kCycles for 100 * StackBuffer (unrolled)
990 kCycles for 100 * movaps xmm0
1804 kCycles for 100 * rep stosd up

2672 kCycles for 100 * rep stosd
1117 kCycles for 100 * HeapAlloc (*8 )
957 kCycles for 100 * StackBuffer (with zeroing)
2613 kCycles for 100 * StackBuffer (unrolled)
1391 kCycles for 100 * movaps xmm0
1054 kCycles for 100 * rep stosd up

2672 kCycles for 100 * rep stosd
1824 kCycles for 100 * HeapAlloc (*8 )
962 kCycles for 100 * StackBuffer (with zeroing)
3380 kCycles for 100 * StackBuffer (unrolled)
934 kCycles for 100 * movaps xmm0
1789 kCycles for 100 * rep stosd up

18 bytes for rep stosd
103 bytes for HeapAlloc (*8 )
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
25 bytes for movaps xmm0
17 bytes for rep stosd up

sinsi · October 26, 2013, 06:36:11 PM

A bit of a difference in rep stosd...

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 328/100 cycles

2449 kCycles for 100 * rep stosd
1025 kCycles for 100 * HeapAlloc (*8 )
951 kCycles for 100 * StackBuffer (with zeroing)
2384 kCycles for 100 * StackBuffer (unrolled)
927 kCycles for 100 * movaps xmm0
955 kCycles for 100 * rep stosd up

jj2007 · October 26, 2013, 06:50:14 PM

Thanks :icon14:
Astonishing that the unrolled version is so much slower...

         xorps xmm0, xmm0
         ifnb <unrolled>
            shr eax, 4+2   ; bufsize/16*4
            mov edx, esp   ; save current stack pointer
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, 4*OWORD
                  movdqa OWORD ptr [esp], xmm0
                  movdqa OWORD ptr [1*OWORD+esp], xmm0
                  movdqa OWORD ptr [2*OWORD+esp], xmm0
                  movdqa OWORD ptr [3*OWORD+esp], xmm0
                  dec eax
            .Until Zero?
         else
            shr eax, 4   ; /16
            mov edx, esp   ; save current stack pointer
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, OWORD
                  movaps OWORD ptr [esp], xmm0
                  dec eax
            .Until Zero?
         endif

sinsi · October 26, 2013, 07:19:16 PM

Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.

Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 202/100 cycles

945 kCycles for 100 * StackBuffer (with zeroing)
993 kCycles for 100 * StackBuffer (unrolled)

949 kCycles for 100 * StackBuffer (with zeroing)
933 kCycles for 100 * StackBuffer (unrolled)

926 kCycles for 100 * StackBuffer (with zeroing)
933 kCycles for 100 * StackBuffer (unrolled)

jj2007 · October 26, 2013, 07:48:44 PM

Quote from: sinsi on October 26, 2013, 07:19:16 PM
Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.

Same effect for reordering:
            and esp, -16   ; aligned for SSE2
            align 4
            .Repeat
                  sub esp, 4*OWORD
                  movaps OWORD ptr [3*OWORD+esp], xmm0
                  movaps OWORD ptr [2*OWORD+esp], xmm0
                  movaps OWORD ptr [1*OWORD+esp], xmm0
                  movaps OWORD ptr [0*OWORD+esp], xmm0
                  dec eax
            .Until Zero?
But the timings are identical, so no need for unrolling.

Gunther · October 26, 2013, 10:55:20 PM

Jochen,

timings from my machine at home:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 229/100 cycles

2287    kCycles for 100 * rep stosd
4347    kCycles for 100 * push 0
4314    kCycles for 100 * push edx
830     kCycles for 100 * StackBuffer (with zeroing)
822     kCycles for 100 * movaps xmm0
847     kCycles for 100 * rep stosd up

2290    kCycles for 100 * rep stosd
4321    kCycles for 100 * push 0
4289    kCycles for 100 * push edx
828     kCycles for 100 * StackBuffer (with zeroing)
821     kCycles for 100 * movaps xmm0
859     kCycles for 100 * rep stosd up

2293    kCycles for 100 * rep stosd
4303    kCycles for 100 * push 0
4298    kCycles for 100 * push edx
830     kCycles for 100 * StackBuffer (with zeroing)
812     kCycles for 100 * movaps xmm0
847     kCycles for 100 * rep stosd up

18      bytes for rep stosd
17      bytes for push 0
16      bytes for push edx
29      bytes for StackBuffer (with zeroing)
25      bytes for movaps xmm0
17      bytes for rep stosd up

--- ok ---

Gunther

dedndave · October 26, 2013, 11:05:51 PM

version 2c on a prescott w/htt

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles

5102    kCycles for 100 * rep stosd
4717    kCycles for 100 * HeapAlloc (*8 )
2851    kCycles for 100 * StackBuffer (with zeroing)
3962    kCycles for 100 * StackBuffer (unrolled)
2835    kCycles for 100 * movaps xmm0
2872    kCycles for 100 * rep stosd up

5145    kCycles for 100 * rep stosd
3681    kCycles for 100 * HeapAlloc (*8 )
2862    kCycles for 100 * StackBuffer (with zeroing)
3950    kCycles for 100 * StackBuffer (unrolled)
2894    kCycles for 100 * movaps xmm0
2844    kCycles for 100 * rep stosd up

5111    kCycles for 100 * rep stosd
3769    kCycles for 100 * HeapAlloc (*8 )
2836    kCycles for 100 * StackBuffer (with zeroing)
3950    kCycles for 100 * StackBuffer (unrolled)
2846    kCycles for 100 * movaps xmm0
2900    kCycles for 100 * rep stosd up

can't beat REP STOSD for simplicity :P

The MASM Forum

News:

Zero a stack buffer (and probe it)

jj2007

dedndave

jj2007

MichaelW

KeepingRealBusy

jj2007

Siekmanski

jj2007

Siekmanski

sinsi

jj2007

sinsi

jj2007

Gunther

dedndave