Spin-off from MemStrategy (http://masm32.com/board/index.php?topic=2515.msg26340#msg26340):
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 268/100 cycles
3778 kCycles for 100 * rep stosd
4905 kCycles for 100 * push 0
4890 kCycles for 100 * push edx
3343 kCycles for 100 * movups xmm0
3319 kCycles for 100 * movaps xmm0
3785 kCycles for 100 * rep stosd
4891 kCycles for 100 * push 0
4894 kCycles for 100 * push edx
3457 kCycles for 100 * movups xmm0
3319 kCycles for 100 * movaps xmm0
3785 kCycles for 100 * rep stosd
4891 kCycles for 100 * push 0
4896 kCycles for 100 * push edx
3342 kCycles for 100 * movups xmm0
3320 kCycles for 100 * movaps xmm0
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles
2700 kCycles for 100 * rep stosd
5467 kCycles for 100 * push 0
4888 kCycles for 100 * push edx
4266 kCycles for 100 * movups xmm0
1411 kCycles for 100 * movaps xmm0
2752 kCycles for 100 * rep stosd
4887 kCycles for 100 * push 0
5651 kCycles for 100 * push edx
4262 kCycles for 100 * movups xmm0
1030 kCycles for 100 * movaps xmm0
2699 kCycles for 100 * rep stosd
4892 kCycles for 100 * push 0
4888 kCycles for 100 * push edx
4263 kCycles for 100 * movups xmm0
1744 kCycles for 100 * movaps xmm0
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 310/100 cycles
2385 kCycles for 100 * rep stosd
4530 kCycles for 100 * push 0
4508 kCycles for 100 * push edx
3932 kCycles for 100 * movups xmm0
871 kCycles for 100 * movaps xmm0
AMD Athlon(tm) II X2 220 Processor (SSE3) 2.80 GHz
loop overhead is approx. 239/100 cycles
2621 kCycles for 100 * rep stosd
4891 kCycles for 100 * push 0
4895 kCycles for 100 * push edx
1666 kCycles for 100 * movups xmm0
1605 kCycles for 100 * movaps xmm0
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 248/100 cycles
4986 kCycles for 100 * rep stosd
4827 kCycles for 100 * push 0
4991 kCycles for 100 * push edx
6187 kCycles for 100 * movups xmm0
2767 kCycles for 100 * movaps xmm0
5023 kCycles for 100 * rep stosd
4857 kCycles for 100 * push 0
4935 kCycles for 100 * push edx
6207 kCycles for 100 * movups xmm0
2766 kCycles for 100 * movaps xmm0
5023 kCycles for 100 * rep stosd
4855 kCycles for 100 * push 0
4990 kCycles for 100 * push edx
6225 kCycles for 100 * movups xmm0
2765 kCycles for 100 * movaps xmm0
deleted
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 245/100 cycles
5107 kCycles for 100 * rep stosd
4844 kCycles for 100 * push 0
4902 kCycles for 100 * push edx
6153 kCycles for 100 * movups xmm0
2827 kCycles for 100 * movaps xmm0
2815 kCycles for 100 * rep stosd
5111 kCycles for 100 * rep stosd
4873 kCycles for 100 * push 0
4887 kCycles for 100 * push edx
6150 kCycles for 100 * movups xmm0
2795 kCycles for 100 * movaps xmm0
2782 kCycles for 100 * rep stosd
5053 kCycles for 100 * rep stosd
4892 kCycles for 100 * push 0
4850 kCycles for 100 * push edx
6179 kCycles for 100 * movups xmm0
2767 kCycles for 100 * movaps xmm0
2827 kCycles for 100 * rep stosd
Jochen,
here are the results from an old Computer (located in an University laboratory). The other tests from my machine at home will come this evening.
AMD Athlon(tm) Dual Core Processor 5000B (SSE3)
loop overhead is approx. 239/100 cycles
3779 kCycles for 100 * rep stosd
4897 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
3344 kCycles for 100 * movups xmm0
3347 kCycles for 100 * movaps xmm0
3774 kCycles for 100 * rep stosd
4897 kCycles for 100 * push 0
4899 kCycles for 100 * push edx
3343 kCycles for 100 * movups xmm0
3341 kCycles for 100 * movaps xmm0
3778 kCycles for 100 * rep stosd
4897 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
3344 kCycles for 100 * movups xmm0
3331 kCycles for 100 * movaps xmm0
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0
--- ok ---
Gunther
deleted
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
loop overhead is approx. 211/100 cycles
7356 kCycles for 100 * rep stosd
4902 kCycles for 100 * push 0
4902 kCycles for 100 * push edx
3059 kCycles for 100 * movups xmm0
2312 kCycles for 100 * movaps xmm0
2207 kCycles for 100 * rep stosd
7358 kCycles for 100 * rep stosd
4905 kCycles for 100 * push 0
4897 kCycles for 100 * push edx
3064 kCycles for 100 * movups xmm0
2303 kCycles for 100 * movaps xmm0
2212 kCycles for 100 * rep stosd
7372 kCycles for 100 * rep stosd
4913 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
3063 kCycles for 100 * movups xmm0
2303 kCycles for 100 * movaps xmm0
2214 kCycles for 100 * rep stosd
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0
17 bytes for rep stosd
--- ok ---
Thanks to everybody :icon14:
Quote from: nidud on October 25, 2013, 10:18:42 PM
I cleaned up the rep stosd function a bit
I appreciate your good intentions, Nidud. Put it under TestA, just for fun ;)
(hint: look at this thread's title)
Northwood w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 309/100 cycles
4910 kCycles for 100 * rep stosd
4902 kCycles for 100 * push 0
4904 kCycles for 100 * push edx
4904 kCycles for 100 * movups xmm0
2144 kCycles for 100 * movaps xmm0
4910 kCycles for 100 * rep stosd
5130 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
4893 kCycles for 100 * movups xmm0
2140 kCycles for 100 * movaps xmm0
4911 kCycles for 100 * rep stosd
4909 kCycles for 100 * push 0
4903 kCycles for 100 * push edx
4895 kCycles for 100 * movups xmm0
2150 kCycles for 100 * movaps xmm0
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0
deleted
i try to avoid STD's :lol:
in fact, i have gotten to where i don't use them at all
if i have to move things in that direction, i write a discrete loop
in this case, you could probe, then clear, one page at a time
something like this...
ASSUME FS:Nothing
mov edx,esp
mov fs:[700h],edi
xor eax,eax
sub edx,<NumberOfBytesRequiredPlus3Mod4>
.repeat
push eax
mov ecx,esp
mov esp,fs:[8]
sub ecx,esp
shr ecx,2
.if !ZERO
mov edi,esp
rep stosd
.endif
.until edx>=esp
mov edi,fs:[700h]
mov esp,edx
ASSUME FS:ERROR
this is a simpler version...
ASSUME FS:Nothing
mov edx,esp
mov ecx,esp
sub edx,<NumberOfBytesRequiredPlus3Mod4>
.repeat
push eax
mov esp,fs:[8]
.until edx>=esp
sub ecx,edx
xchg edx,edi
shr ecx,2
xor eax,eax
mov esp,edi
rep stosd
mov edi,edx
ASSUME FS:ERROR
Quote from: nidud on October 26, 2013, 01:47:07 AM
Quote from: jj2007 on October 26, 2013, 12:52:01 AM
Put it under TestA, just for fun ;)
the intention should at best be educational :P
having both of them will illustrate the penalty of manipulating the flags on different CPU's
Yes, that's true. Although it seems it's not the flag setting but rather the "wrong" direction that makes rep stosd slow.
Attached a new version with a modified StackBuffer() macro. Your algo is "rep stosd up" ;-)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles
4601 kCycles for 100 * rep stosd
4893 kCycles for 100 * push 0
4890 kCycles for 100 * push edx
2151 kCycles for 100 * StackBuffer (with zeroing)
2141 kCycles for 100 * movaps xmm0
1701 kCycles for 100 * rep stosd up
4592 kCycles for 100 * rep stosd
4891 kCycles for 100 * push 0
4894 kCycles for 100 * push edx
2142 kCycles for 100 * StackBuffer (with zeroing)
2141 kCycles for 100 * movaps xmm0
1697 kCycles for 100 * rep stosd up
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 263/100 cycles
4957 kCycles for 100 * rep stosd
4902 kCycles for 100 * push 0
4931 kCycles for 100 * push edx
2934 kCycles for 100 * StackBuffer (with zeroing)
3006 kCycles for 100 * movaps xmm0
3035 kCycles for 100 * rep stosd up
4956 kCycles for 100 * rep stosd
4982 kCycles for 100 * push 0
4869 kCycles for 100 * push edx
2777 kCycles for 100 * StackBuffer (with zeroing)
2840 kCycles for 100 * movaps xmm0
2823 kCycles for 100 * rep stosd up
4954 kCycles for 100 * rep stosd
4863 kCycles for 100 * push 0
4940 kCycles for 100 * push edx
2799 kCycles for 100 * StackBuffer (with zeroing)
2801 kCycles for 100 * movaps xmm0
2779 kCycles for 100 * rep stosd up
4993 kCycles for 100 * rep stosd
4908 kCycles for 100 * push 0
4940 kCycles for 100 * push edx
2767 kCycles for 100 * StackBuffer (with zeroing)
2811 kCycles for 100 * movaps xmm0
2790 kCycles for 100 * rep stosd up
5074 kCycles for 100 * rep stosd
4911 kCycles for 100 * push 0
5082 kCycles for 100 * push edx
2860 kCycles for 100 * StackBuffer (with zeroing)
2767 kCycles for 100 * movaps xmm0
2779 kCycles for 100 * rep stosd up
5073 kCycles for 100 * rep stosd
4907 kCycles for 100 * push 0
5175 kCycles for 100 * push edx
2826 kCycles for 100 * StackBuffer (with zeroing)
2796 kCycles for 100 * movaps xmm0
2801 kCycles for 100 * rep stosd up
the last 3 are more-or-less the same on a P4
Quote from: dedndave on October 26, 2013, 09:44:49 AM
prescott w/htt
...
the last 3 are more-or-less the same on a P4
Yes, they look similar on older CPUs. The i7 behave quite differently.
For probing only (no zeroing), StackBuffer is more than twice as fast.
P.S.: Jeri's CastAR looks really impressive :t
Northwood w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 331/100 cycles
4915 kCycles for 100 * rep stosd
4928 kCycles for 100 * push 0
4911 kCycles for 100 * push edx
2141 kCycles for 100 * StackBuffer (with zeroing)
2153 kCycles for 100 * movaps xmm0
2326 kCycles for 100 * rep stosd up
4912 kCycles for 100 * rep stosd
4905 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
2140 kCycles for 100 * StackBuffer (with zeroing)
2141 kCycles for 100 * movaps xmm0
2199 kCycles for 100 * rep stosd up
4907 kCycles for 100 * rep stosd
4892 kCycles for 100 * push 0
4907 kCycles for 100 * push edx
2141 kCycles for 100 * StackBuffer (with zeroing)
2141 kCycles for 100 * movaps xmm0
2212 kCycles for 100 * rep stosd up
My laptop:
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 495/100 cycles
4656 kCycles for 100 * rep stosd
8331 kCycles for 100 * push 0
7636 kCycles for 100 * push edx
2845 kCycles for 100 * StackBuffer (with zeroing)
2711 kCycles for 100 * movaps xmm0
2585 kCycles for 100 * rep stosd up
3488 kCycles for 100 * rep stosd
5722 kCycles for 100 * push 0
5240 kCycles for 100 * push edx
1712 kCycles for 100 * StackBuffer (with zeroing)
1650 kCycles for 100 * movaps xmm0
1600 kCycles for 100 * rep stosd up
2200 kCycles for 100 * rep stosd
4188 kCycles for 100 * push 0
4251 kCycles for 100 * push edx
1715 kCycles for 100 * StackBuffer (with zeroing)
1565 kCycles for 100 * movaps xmm0
1580 kCycles for 100 * rep stosd up
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
29 bytes for StackBuffer (with zeroing)
25 bytes for movaps xmm0
17 bytes for rep stosd up
--- ok ---
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.
Dave.
Quote from: KeepingRealBusy on October 26, 2013, 01:42:11 PM
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.
Why not ;-)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles
4725 kCycles for 100 * rep stosd
2683 kCycles for 100 * HeapAlloc (*8)
2202 kCycles for 100 * StackBuffer (with zeroing)
2887 kCycles for 100 * StackBuffer (unrolled)
2207 kCycles for 100 * movaps xmm0
1746 kCycles for 100 * rep stosd up
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
StackBuffer2b.exe doesn't work with windows 8.1
StackBuffer2.exe works OK:
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles
2712 kCycles for 100 * rep stosd
4887 kCycles for 100 * push 0
4885 kCycles for 100 * push edx
1686 kCycles for 100 * StackBuffer (with zeroing)
923 kCycles for 100 * movaps xmm0
1027 kCycles for 100 * rep stosd up
2609 kCycles for 100 * rep stosd
5312 kCycles for 100 * push 0
4887 kCycles for 100 * push edx
943 kCycles for 100 * StackBuffer (with zeroing)
1989 kCycles for 100 * movaps xmm0
1086 kCycles for 100 * rep stosd up
2608 kCycles for 100 * rep stosd
5588 kCycles for 100 * push 0
5649 kCycles for 100 * push edx
971 kCycles for 100 * StackBuffer (with zeroing)
1654 kCycles for 100 * movaps xmm0
981 kCycles for 100 * rep stosd up
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
29 bytes for StackBuffer (with zeroing)
25 bytes for movaps xmm0
17 bytes for rep stosd up
--- ok ---
Quote from: Siekmanski on October 26, 2013, 05:57:10 PM
StackBuffer2b.exe doesn't work with windows 8.1
Oops, it seems I uploaded a version with a nice little
int 3 inside. Try 2c above...
Apologies :redface:
StackBuffer2c :t
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 576/100 cycles
2674 kCycles for 100 * rep stosd
1810 kCycles for 100 * HeapAlloc (*8 )
940 kCycles for 100 * StackBuffer (with zeroing)
3376 kCycles for 100 * StackBuffer (unrolled)
990 kCycles for 100 * movaps xmm0
1804 kCycles for 100 * rep stosd up
2672 kCycles for 100 * rep stosd
1117 kCycles for 100 * HeapAlloc (*8 )
957 kCycles for 100 * StackBuffer (with zeroing)
2613 kCycles for 100 * StackBuffer (unrolled)
1391 kCycles for 100 * movaps xmm0
1054 kCycles for 100 * rep stosd up
2672 kCycles for 100 * rep stosd
1824 kCycles for 100 * HeapAlloc (*8 )
962 kCycles for 100 * StackBuffer (with zeroing)
3380 kCycles for 100 * StackBuffer (unrolled)
934 kCycles for 100 * movaps xmm0
1789 kCycles for 100 * rep stosd up
18 bytes for rep stosd
103 bytes for HeapAlloc (*8 )
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
25 bytes for movaps xmm0
17 bytes for rep stosd up
A bit of a difference in rep stosd...
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 328/100 cycles
2449 kCycles for 100 * rep stosd
1025 kCycles for 100 * HeapAlloc (*8 )
951 kCycles for 100 * StackBuffer (with zeroing)
2384 kCycles for 100 * StackBuffer (unrolled)
927 kCycles for 100 * movaps xmm0
955 kCycles for 100 * rep stosd up
Thanks :icon14:
Astonishing that the unrolled version is so much slower...
xorps xmm0, xmm0
ifnb <unrolled>
shr eax, 4+2 ; bufsize/16*4
mov edx, esp ; save current stack pointer
and esp, -16 ; aligned for SSE2
align 4
.Repeat
sub esp, 4*OWORD
movdqa OWORD ptr [esp], xmm0
movdqa OWORD ptr [1*OWORD+esp], xmm0
movdqa OWORD ptr [2*OWORD+esp], xmm0
movdqa OWORD ptr [3*OWORD+esp], xmm0
dec eax
.Until Zero?
else
shr eax, 4 ; /16
mov edx, esp ; save current stack pointer
and esp, -16 ; aligned for SSE2
align 4
.Repeat
sub esp, OWORD
movaps OWORD ptr [esp], xmm0
dec eax
.Until Zero?
endif
Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (SSE4)
loop overhead is approx. 202/100 cycles
945 kCycles for 100 * StackBuffer (with zeroing)
993 kCycles for 100 * StackBuffer (unrolled)
949 kCycles for 100 * StackBuffer (with zeroing)
933 kCycles for 100 * StackBuffer (unrolled)
926 kCycles for 100 * StackBuffer (with zeroing)
933 kCycles for 100 * StackBuffer (unrolled)
Quote from: sinsi on October 26, 2013, 07:19:16 PM
Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.
Same effect for reordering:
and esp, -16 ; aligned for SSE2
align 4
.Repeat
sub esp, 4*OWORD
movaps OWORD ptr [3*OWORD+esp], xmm0
movaps OWORD ptr [2*OWORD+esp], xmm0
movaps OWORD ptr [1*OWORD+esp], xmm0
movaps OWORD ptr [0*OWORD+esp], xmm0
dec eax
.Until Zero?But the timings are identical, so no need for unrolling.
Jochen,
timings from my machine at home:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 229/100 cycles
2287 kCycles for 100 * rep stosd
4347 kCycles for 100 * push 0
4314 kCycles for 100 * push edx
830 kCycles for 100 * StackBuffer (with zeroing)
822 kCycles for 100 * movaps xmm0
847 kCycles for 100 * rep stosd up
2290 kCycles for 100 * rep stosd
4321 kCycles for 100 * push 0
4289 kCycles for 100 * push edx
828 kCycles for 100 * StackBuffer (with zeroing)
821 kCycles for 100 * movaps xmm0
859 kCycles for 100 * rep stosd up
2293 kCycles for 100 * rep stosd
4303 kCycles for 100 * push 0
4298 kCycles for 100 * push edx
830 kCycles for 100 * StackBuffer (with zeroing)
812 kCycles for 100 * movaps xmm0
847 kCycles for 100 * rep stosd up
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
29 bytes for StackBuffer (with zeroing)
25 bytes for movaps xmm0
17 bytes for rep stosd up
--- ok ---
Gunther
version 2c on a prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles
5102 kCycles for 100 * rep stosd
4717 kCycles for 100 * HeapAlloc (*8 )
2851 kCycles for 100 * StackBuffer (with zeroing)
3962 kCycles for 100 * StackBuffer (unrolled)
2835 kCycles for 100 * movaps xmm0
2872 kCycles for 100 * rep stosd up
5145 kCycles for 100 * rep stosd
3681 kCycles for 100 * HeapAlloc (*8 )
2862 kCycles for 100 * StackBuffer (with zeroing)
3950 kCycles for 100 * StackBuffer (unrolled)
2894 kCycles for 100 * movaps xmm0
2844 kCycles for 100 * rep stosd up
5111 kCycles for 100 * rep stosd
3769 kCycles for 100 * HeapAlloc (*8 )
2836 kCycles for 100 * StackBuffer (with zeroing)
3950 kCycles for 100 * StackBuffer (unrolled)
2846 kCycles for 100 * movaps xmm0
2900 kCycles for 100 * rep stosd up
can't beat REP STOSD for simplicity :P
Quote from: dedndave on October 26, 2013, 11:05:51 PMcan't beat REP STOSD for simplicity :P
But for the fast "rep stosd up" you need to write an SEH, that makes it slightly more complicated again :icon_mrgreen:
why SEH ?
i posted code that does STOSD up with no SEH
but, you haven't incorporated it
a little update...
ASSUME FS:Nothing
mov edx,edi
mov edi,esp
mov ecx,esp
sub edi,<NumberOfBytesRequiredPlus3Mod4>
.repeat
push eax
mov esp,fs:[8]
.until edi>=esp
sub ecx,edi
shr ecx,2
xor eax,eax
mov esp,edi
rep stosd
mov edi,edx
ASSUME FS:ERROR
Quote from: dedndave on October 26, 2013, 11:24:40 PM
but, you haven't incorporated it
I've tried to but it crashes ::)
Set useE=1 in the source...
if it crashes, there must be a simple reason - lol
how much memory are you trying to allocate ?
try the attached test code...
virgin 2d prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 254/100 cycles
5184 kCycles for 100 * rep stosd
4122 kCycles for 100 * HeapAlloc (*8 )
2853 kCycles for 100 * StackBuffer (with zeroing)
2868 kCycles for 100 * StackBuffer (unrolled)
2899 kCycles for 100 * rep stosd up
5116 kCycles for 100 * rep stosd
3073 kCycles for 100 * HeapAlloc (*8 )
2839 kCycles for 100 * StackBuffer (with zeroing)
2849 kCycles for 100 * StackBuffer (unrolled)
2862 kCycles for 100 * rep stosd up
5161 kCycles for 100 * rep stosd
3080 kCycles for 100 * HeapAlloc (*8 )
2843 kCycles for 100 * StackBuffer (with zeroing)
2873 kCycles for 100 * StackBuffer (unrolled)
2848 kCycles for 100 * rep stosd up
StackBuffer2d results:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles
2341 kCycles for 100 * rep stosd
987 kCycles for 100 * HeapAlloc (*8)
839 kCycles for 100 * StackBuffer (with zeroing)
830 kCycles for 100 * StackBuffer (unrolled)
921 kCycles for 100 * rep stosd up
2345 kCycles for 100 * rep stosd
985 kCycles for 100 * HeapAlloc (*8)
829 kCycles for 100 * StackBuffer (with zeroing)
867 kCycles for 100 * StackBuffer (unrolled)
872 kCycles for 100 * rep stosd up
2339 kCycles for 100 * rep stosd
989 kCycles for 100 * HeapAlloc (*8)
850 kCycles for 100 * StackBuffer (with zeroing)
906 kCycles for 100 * StackBuffer (unrolled)
875 kCycles for 100 * rep stosd up
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
34 bytes for StackBuffer (with zeroing)
44 bytes for StackBuffer (unrolled)
17 bytes for rep stosd up
--- ok ---
Dave,
your ProbeTest works fine under 64 bit.
Gunther
thanks Gunther - whew !
only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary
(he must have been a C programmer in a previous life :P )
but - i think there is a major flaw in the idea of speed-tests for probing code
once you have committed that memory, it remains committed until you release it and the OS allocates it elsewhere
to overcome this, you might try HeapAlloc
if the OS needs that space for the heap, it should "reset" the amount committed
i don't think altering the value at FS:[8] is a good idea - lol
sounds like a memory leak waiting to happen
ok
the default reserve is supposed to be 1 MB = 1,048,576 (100000h)
i can only allocate up to 1,032,192 (0FC000h) without a crash
that must be why Jochen is having to use SEH
Quote from: dedndave on October 27, 2013, 12:53:50 AM
only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary
Dave,
bufsize is 102400 bytes, no big deal. The Try/Catch thing would be needed for the "rep stosd up" algo, simply because it doesn't probe the stack.
Here is your code embedded in the testbed, it doesn't crash any more but 2 kCycles is a bit fast... some more comments would be nice, or maybe I am just too tired to understand it :(
TestE proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
mov esi, esp ; check the stack
align 4
.Repeat
mov edx, edi
mov edi, esp
mov ecx, esp
sub edi, (bufsize+3) MOD 4 ;<NumberOfBytesRequiredPlus3Mod4>
.repeat
push eax
ASSUME FS:Nothing
mov esp, fs:[8]
ASSUME FS:ERROR
.until edi>=esp
sub ecx, edi
shr ecx, 2
xor eax, eax
mov esp, edi
rep stosd
mov edi, edx
add esp, (bufsize+3) MOD 4 ; restore stack
dec ebx
.Until Sign?
sub esi, esp
.if !Zero? ; OK
print str$(esi), " STACKDIFF"
exit
.endif
ret
TestE endp
it's not too bad
i probe down the stack by using the TEB.StackLimit value from FS:[8]
then, i use REP STOSD to clear it out
the probe part was discussed at length...
http://masm32.com/board/index.php?topic=1363 (http://masm32.com/board/index.php?topic=1363)
Thanks, Dave - I had not seen that thread. Now it's clearer...
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles
4710 kCycles for 100 * rep stosd
2220 kCycles for 100 * HeapAlloc (*8)
2193 kCycles for 100 * StackBuffer (with zeroing)
2192 kCycles for 100 * StackBuffer (unrolled)
4697 kCycles for 100 * dedndave
1738 kCycles for 100 * rep stosd up
This is for slightly modified code, taking account of the need to save & restore the old stack:
.Repeat
mov edx, edi ; save edi
mov edi, esp
mov eax, esp ; save old stack
sub edi, (bufsize+3+4) ;<NumberOfBytesRequiredPlus3Mod4>
and edi, -4 ; aligns new stack
.repeat
push eax ; tickle the guard page
ASSUME FS:Nothing
mov esp, fs:[8] ; limit might be 4k lower now
ASSUME FS:ERROR
.until edi>=esp ; loop until we've got enough
mov esp, edi ; new stack
stosd ; save old stack to [edi]
xchg eax, ecx
push edi ; retval for macro
sub ecx, edi
shr ecx, 2
xor eax, eax
rep stosd
pop eax ; retval for macro
mov edi, edx ; restore edi
; ... code that uses buffer...
pop esp ; restore stack
dec ebx
.Until Sign?
I hope I didn't misunderstand anything - for some time I was thoroughly confused by your NumberOfBytesRequiredPlus3Mod4 ::)
sorry for the confusion - it's just a number that is mod4=0
it could be an immediate - or a value calculated in EAX
as for restoring the stack.....
OPTION PROLOGUE:None
OPTION EPILOGUE:None
MyProc PROC parm1:DWORD
push ebx
push esi
push edi ;push/pops on EBX ESI EDI are optional, of course
push ebp
mov esp,ebp
;stack probe code here
;stack clear code here
;use stack space, as required
leave
pop edi
pop esi
pop ebx
ret 4
MyProc ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Quote from: dedndave on October 27, 2013, 12:10:08 PM
as for restoring the stack..... leave
I've tried that but it crashes. If you have working code, please insert into the source :icon14:
Anyway, speed-wise it doesn't look so convincing. By the way, the forum software translates *8 into a smiley - HeapAlloc is actually tested with one eighth of the buffer size, because it's so slow :(
give this a try, my friend
i am anxious to see if it crashes on you :P
it should display the allocation size (F0000), then 0 (cleared OR test result)
i commented it heavily, just for you :biggrin:
Quote from: jj2007 on October 26, 2013, 04:19:12 PM
Quote from: KeepingRealBusy on October 26, 2013, 01:42:11 PM
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.
Why not ;-)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles
4725 kCycles for 100 * rep stosd
2683 kCycles for 100 * HeapAlloc (*8)
2202 kCycles for 100 * StackBuffer (with zeroing)
2887 kCycles for 100 * StackBuffer (unrolled)
2207 kCycles for 100 * movaps xmm0
1746 kCycles for 100 * rep stosd up
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
Jochen,
Using this version, I made some changes. The original movaps test was TestE. I made an unrolled version as TestI, then used some similar code to modify testE and saved them as Testg and TestH. The modifications were to move the "constant" initializations out of the REPEAT loops and execute them at the beginning of the test (before the REPEATs). The Following are the .lst sections for TestE, TestG, TestH, and TestI (just to check alignments):
align 16
00000190 TestE_s:
= movaps xmm0 NameE equ movaps xmm0 ; assign a descriptive name here
00000190 TestE proc
00000190 BB 00000063 mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
00000198 *@C0011:
00000198 8B CC mov ecx, esp
0000019A 8D 84 24 lea eax, [esp-bufsize]
FFFE7000
000001A1 83 E4 F0 and esp, -16 ; needs a reg or local to store original esp
000001A4 0F 57 C0 xorps xmm0, xmm0
; align 4
.Repeat
000001A7 *@C0012:
000001A7 83 EC 10 sub esp, OWORD
000001AA 0F 29 04 24 movaps OWORD ptr [esp], xmm0 ; movaps <1% faster on AMD
.Until esp<=eax
000001AE 3B E0 * cmp esp, eax
000001B0 77 F5 * ja @C0012
000001B2 8B E1 mov esp, ecx
; add esp, bufsize
000001B4 4B dec ebx
.Until Sign?
000001B5 79 E1 * jns @C0011
000001B7 C3 ret
000001B8 TestE endp
000001B8 TestE_endp:
align 16
000001E0 TestG_s:
= movaps xmm0 (down) NameG equ movaps xmm0 (down) ; assign a descriptive name here
000001E0 TestG proc
000001E0 BB 00000063 mov ebx, AlgoLoops-1 ; loop e.g. 100x
000001E5 8B CC mov ecx, esp
000001E7 8B F4 mov esi, esp
000001E9 BA FFFFFFF0 mov edx, -OWORD
000001EE 83 E6 F0 and esi, -16
000001F1 0F 57 C0 xorps xmm0, xmm0
000001F4 8D 86 FFFE7000 lea eax, [esi-bufsize]
align 16
.Repeat
00000200 *@C0017:
00000200 8B E6 mov esp, esi
.Repeat
00000202 *@C0018:
00000202 8D 24 14 lea esp,[esp+edx]
00000205 0F 29 04 24 movaps OWORD ptr [esp], xmm0 ; movaps <1% faster on AMD
.Until esp==eax
00000209 3B E0 * cmp esp, eax
0000020B 75 F5 * jne @C0018
0000020D 4B dec ebx
.Until Sign?
0000020E 79 F0 * jns @C0017
00000210 8B E1 mov esp, ecx
00000212 C3 ret
00000213 TestG endp
00000213 TestG_endp:
align 16
00000220 TestH_s:
= movaps xmm0 (up) NameH equ movaps xmm0 (up) ; assign a descriptive name here
00000220 TestH proc
00000220 BB 00000063 mov ebx, AlgoLoops-1 ; loop e.g. 100x
00000225 8B CC mov ecx, esp
00000227 8D B4 24 lea esi, [esp-bufsize]
FFFE7000
0000022E BA 00000010 mov edx, OWORD
00000233 83 E6 F0 and esi, -16
00000236 0F 57 C0 xorps xmm0, xmm0
00000239 8D 86 00019000 lea eax, [esi+bufsize]
align 16
.Repeat
00000240 *@C001B:
00000240 8B E6 mov esp, esi
.Repeat
00000242 *@C001C:
00000242 0F 29 04 24 movaps OWORD ptr [esp+(0*OWORD)], xmm0 ; movaps <1% faster on AMD
00000246 8D 24 14 lea esp,[esp+edx]
.Until esp==eax
00000249 3B E0 * cmp esp, eax
0000024B 75 F5 * jne @C001C
0000024D 4B dec ebx
.Until Sign?
0000024E 79 F0 * jns @C001B
00000250 8B E1 mov esp, ecx
00000252 C3 ret
00000253 TestH endp
00000253 TestH_endp:
align 16
00000260 TestI_s:
= movaps xmm0 (unrolled) NameI equ movaps xmm0 (unrolled) ; assign a descriptive name here
00000260 TestI proc
00000260 BB 00000063 mov ebx, AlgoLoops-1 ; loop e.g. 100x
00000265 8B CC mov ecx, esp
00000267 8D B4 24 lea esi, [esp-bufsize]
FFFE7000
0000026E BA 00000080 mov edx, (8*OWORD)
00000273 83 E6 F0 and esi, -16
00000276 0F 57 C0 xorps xmm0, xmm0
00000279 8D 86 00019000 lea eax, [esi+bufsize]
.Repeat
0000027F *@C001F:
0000027F 8B E6 mov esp, esi
align 16
.Repeat
00000290 *@C0020:
00000290 0F 29 04 24 movaps OWORD ptr [esp+(0*OWORD)], xmm0 ; movaps <1% faster on AMD
00000294 0F 29 44 24 10 movaps OWORD ptr [esp+(1*OWORD)], xmm0 ; movaps <1% faster on AMD
00000299 0F 29 44 24 20 movaps OWORD ptr [esp+(2*OWORD)], xmm0 ; movaps <1% faster on AMD
0000029E 0F 29 44 24 30 movaps OWORD ptr [esp+(3*OWORD)], xmm0 ; movaps <1% faster on AMD
000002A3 0F 29 44 24 40 movaps OWORD ptr [esp+(4*OWORD)], xmm0 ; movaps <1% faster on AMD
000002A8 0F 29 44 24 50 movaps OWORD ptr [esp+(5*OWORD)], xmm0 ; movaps <1% faster on AMD
000002AD 0F 29 44 24 60 movaps OWORD ptr [esp+(6*OWORD)], xmm0 ; movaps <1% faster on AMD
000002B2 0F 29 44 24 70 movaps OWORD ptr [esp+(7*OWORD)], xmm0 ; movaps <1% faster on AMD
000002B7 8D 24 14 lea esp,[esp+edx]
.Until esp==eax
000002BA 3B E0 * cmp esp, eax
000002BC 75 D2 * jne @C0020
; add esp, bufsize
000002BE 4B dec ebx
.Until Sign?
000002BF 79 BE * jns @C001F
000002C1 8B E1 mov esp, ecx
000002C3 C3 ret
000002C4 TestI endp
000002C4 TestI_endp:
The following are my executions:
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 433/100 cycles
5229 kCycles for 100 * rep stosd
3627 kCycles for 100 * HeapAlloc (*8)
3274 kCycles for 100 * StackBuffer (with zeroing)
3278 kCycles for 100 * StackBuffer (unrolled)
3193 kCycles for 100 * movaps xmm0
3118 kCycles for 100 * rep stosd up
2798 kCycles for 100 * movaps xmm0 (down)
2974 kCycles for 100 * movaps xmm0 (up)
2895 kCycles for 100 * movaps xmm0 (unrolled)
3573 kCycles for 100 * rep stosd
2709 kCycles for 100 * HeapAlloc (*8)
2458 kCycles for 100 * StackBuffer (with zeroing)
2481 kCycles for 100 * StackBuffer (unrolled)
2426 kCycles for 100 * movaps xmm0
2218 kCycles for 100 * rep stosd up
2086 kCycles for 100 * movaps xmm0 (down)
2329 kCycles for 100 * movaps xmm0 (up)
2273 kCycles for 100 * movaps xmm0 (unrolled)
2244 kCycles for 100 * rep stosd
1512 kCycles for 100 * HeapAlloc (*8)
1422 kCycles for 100 * StackBuffer (with zeroing)
1403 kCycles for 100 * StackBuffer (unrolled)
1546 kCycles for 100 * movaps xmm0
1448 kCycles for 100 * rep stosd up
1424 kCycles for 100 * movaps xmm0 (down)
1561 kCycles for 100 * movaps xmm0 (up)
1502 kCycles for 100 * movaps xmm0 (unrolled)
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
25 bytes for movaps xmm0
17 bytes for rep stosd up
36 bytes for movaps xmm0 (down)
36 bytes for movaps xmm0 (up)
85 bytes for movaps xmm0 (unrolled)
--- ok ---
The times are interesting. I have attached a zip of my .asm and .exe file.
Dave.
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 579/100 cycles
2673 kCycles for 100 * rep stosd
1437 kCycles for 100 * HeapAlloc (*8 )
1667 kCycles for 100 * StackBuffer (with zeroing)
2611 kCycles for 100 * StackBuffer (unrolled)
1680 kCycles for 100 * movaps xmm0
1027 kCycles for 100 * rep stosd up
1680 kCycles for 100 * movaps xmm0 (down)
957 kCycles for 100 * movaps xmm0 (up)
973 kCycles for 100 * movaps xmm0 (unrolled)
2672 kCycles for 100 * rep stosd
1500 kCycles for 100 * HeapAlloc (*8 )
1687 kCycles for 100 * StackBuffer (with zeroing)
2608 kCycles for 100 * StackBuffer (unrolled)
1681 kCycles for 100 * movaps xmm0
1029 kCycles for 100 * rep stosd up
1699 kCycles for 100 * movaps xmm0 (down)
948 kCycles for 100 * movaps xmm0 (up)
982 kCycles for 100 * movaps xmm0 (unrolled)
2671 kCycles for 100 * rep stosd
1446 kCycles for 100 * HeapAlloc (*8 )
1677 kCycles for 100 * StackBuffer (with zeroing)
2607 kCycles for 100 * StackBuffer (unrolled)
1681 kCycles for 100 * movaps xmm0
1070 kCycles for 100 * rep stosd up
1678 kCycles for 100 * movaps xmm0 (down)
966 kCycles for 100 * movaps xmm0 (up)
994 kCycles for 100 * movaps xmm0 (unrolled)
18 bytes for rep stosd
103 bytes for HeapAlloc (*8 )
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
25 bytes for movaps xmm0
17 bytes for rep stosd up
36 bytes for movaps xmm0 (down)
36 bytes for movaps xmm0 (up)
85 bytes for movaps xmm0 (unrolled)
Results with Dave's (KeepingRealBusy) version:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 201/100 cycles
2356 kCycles for 100 * rep stosd
943 kCycles for 100 * HeapAlloc (*8)
853 kCycles for 100 * StackBuffer (with zeroing)
2285 kCycles for 100 * StackBuffer (unrolled)
824 kCycles for 100 * movaps xmm0
886 kCycles for 100 * rep stosd up
825 kCycles for 100 * movaps xmm0 (down)
880 kCycles for 100 * movaps xmm0 (up)
1443 kCycles for 100 * movaps xmm0 (unrolled)
2354 kCycles for 100 * rep stosd
967 kCycles for 100 * HeapAlloc (*8)
877 kCycles for 100 * StackBuffer (with zeroing)
2300 kCycles for 100 * StackBuffer (unrolled)
846 kCycles for 100 * movaps xmm0
911 kCycles for 100 * rep stosd up
842 kCycles for 100 * movaps xmm0 (down)
883 kCycles for 100 * movaps xmm0 (up)
839 kCycles for 100 * movaps xmm0 (unrolled)
2948 kCycles for 100 * rep stosd
981 kCycles for 100 * HeapAlloc (*8)
845 kCycles for 100 * StackBuffer (with zeroing)
2286 kCycles for 100 * StackBuffer (unrolled)
865 kCycles for 100 * movaps xmm0
868 kCycles for 100 * rep stosd up
828 kCycles for 100 * movaps xmm0 (down)
844 kCycles for 100 * movaps xmm0 (up)
865 kCycles for 100 * movaps xmm0 (unrolled)
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
25 bytes for movaps xmm0
17 bytes for rep stosd up
36 bytes for movaps xmm0 (down)
36 bytes for movaps xmm0 (up)
85 bytes for movaps xmm0 (unrolled)
--- ok ---
Gunther
Quote from: KeepingRealBusy on October 27, 2013, 02:04:42 PMThe modifications were to move the "constant" initializations out of the REPEAT loops
Dave,
That defeats the purpose of these loops to simulate a complete HeapAlloc/.../HeapFree sequence...
Dave (the dedn),
Your algo is now included in the testbed below. I have improved it so dramatically that you are now morally obliged to donate it to MasmBasic's StackBuffer() :icon_mrgreen:
push edi
push ebp
mov ebp, esp
mov edi, esp
mov ecx, bufsize ; to be replaced with immediate, global, local, reg etc
sub edi, ecx
and edi, -64 ; aligns buffer to a cache line
ASSUME FS:Nothing
.repeat
push eax ; tickle the guard page - limit might be 4k lower now
mov esp, fs:[8]
.until edi>=esp ; loop until we've got enough
ASSUME FS:ERROR
mov esp, edi ; new stack
add ecx, 3 ; bufsize might be badly aligned
shr ecx, 2 ; stosD
xor eax, eax
rep stosd
mov eax, esp ; retval for macro
; ... use the buffer ...
leave
pop edi
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 189/100 cycles
4743 kCycles for 100 * rep stosd
2232 kCycles for 100 * HeapAlloc (*8)
2201 kCycles for 100 * StackBuffer (with zeroing)
2208 kCycles for 100 * StackBuffer (unrolled)
1749 kCycles for 100 * dedndave
1746 kCycles for 100 * rep stosd up
4725 kCycles for 100 * rep stosd
1846 kCycles for 100 * HeapAlloc (*8)
2205 kCycles for 100 * StackBuffer (with zeroing)
2204 kCycles for 100 * StackBuffer (unrolled)
1747 kCycles for 100 * dedndave
1746 kCycles for 100 * rep stosd up
4726 kCycles for 100 * rep stosd
1850 kCycles for 100 * HeapAlloc (*8)
2203 kCycles for 100 * StackBuffer (with zeroing)
2203 kCycles for 100 * StackBuffer (unrolled)
1747 kCycles for 100 * dedndave
1746 kCycles for 100 * rep stosd up
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
34 bytes for StackBuffer (with zeroing)
44 bytes for StackBuffer (unrolled)
41 bytes for dedndave <<<<<<<<<<<<<<<< BLOATWARE ALARM!!!
17 bytes for rep stosd up
Okay Jochen, here we go again:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 268/100 cycles
2404 kCycles for 100 * rep stosd
1003 kCycles for 100 * HeapAlloc (*8)
887 kCycles for 100 * StackBuffer (with zeroing)
905 kCycles for 100 * StackBuffer (unrolled)
831 kCycles for 100 * dedndave
872 kCycles for 100 * rep stosd up
2342 kCycles for 100 * rep stosd
982 kCycles for 100 * HeapAlloc (*8)
859 kCycles for 100 * StackBuffer (with zeroing)
835 kCycles for 100 * StackBuffer (unrolled)
927 kCycles for 100 * dedndave
897 kCycles for 100 * rep stosd up
2346 kCycles for 100 * rep stosd
965 kCycles for 100 * HeapAlloc (*8)
838 kCycles for 100 * StackBuffer (with zeroing)
835 kCycles for 100 * StackBuffer (unrolled)
828 kCycles for 100 * dedndave
873 kCycles for 100 * rep stosd up
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
34 bytes for StackBuffer (with zeroing)
44 bytes for StackBuffer (unrolled)
41 bytes for dedndave
17 bytes for rep stosd up
--- ok ---
Gunther
Thanks, Gunther. The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:
Jochen,
Quote from: jj2007 on October 27, 2013, 09:55:59 PM
... The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:
that's manifestly. Congrats Dave. :t
Gunther
version 3 (DD) prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 253/100 cycles
5243 kCycles for 100 * rep stosd
4087 kCycles for 100 * HeapAlloc (*8 )
2834 kCycles for 100 * StackBuffer (with zeroing)
2834 kCycles for 100 * StackBuffer (unrolled)
2905 kCycles for 100 * dedndave
2861 kCycles for 100 * rep stosd up
5121 kCycles for 100 * rep stosd
3114 kCycles for 100 * HeapAlloc (*8 )
2861 kCycles for 100 * StackBuffer (with zeroing)
2842 kCycles for 100 * StackBuffer (unrolled)
2855 kCycles for 100 * dedndave
2873 kCycles for 100 * rep stosd up
5157 kCycles for 100 * rep stosd
3039 kCycles for 100 * HeapAlloc (*8 )
2893 kCycles for 100 * StackBuffer (with zeroing)
2867 kCycles for 100 * StackBuffer (unrolled)
2849 kCycles for 100 * dedndave
2845 kCycles for 100 * rep stosd up
a few words of caution - that apply to all algos...
be sure you leave some space for the OS (stay well under the stack reserve)
try not to REP STOSD with ECX = 0 :lol:
i didn't test for that in my algo, but it could easily be added
shr ecx,2
.if !ZERO?
rep stosd
.endif
oh - and REP STOSD may still not be the fastest way to 0 the memory - that's another test, really
we still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass
i think HeapAlloc a large block can do that
it's probably best to seperate the 2 operations and optimize each
Dave,
Quote from: dedndave on October 27, 2013, 11:30:34 PM
it's probably best to seperate the 2 operations and optimize each
yes, it seems to me that this is true. But that's probably another story and another test.
Gunther
Quote from: dedndave on October 27, 2013, 11:30:34 PM
oh - and REP STOSD may still not be the fastest way to 0 the memory
It is, it is, at least for large buffers and for most CPUs - that's pretty obvious
Quotewe still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass
Using StackBuffer will happen somewhere between "proc" and "endp". There are two extreme cases:
1. You use this proc once - then the handful of nanoseconds lost in committing will not matter.
2. You use this proc a Million times - then you will not want the OS to uncommit and re-commit that stack space every time you call the proc.
So in effect the timings are extremely valid as they are...
Quotetry not to REP STOSD with ECX = 0 :lol:
See source:
add ecx, 3 ; bufsize might be badly aligned
shr ecx, 2 ; stosD
xor eax, eax
rep stosdFor ecx=0, rep stosd does absolutely nothing... caution, though, passing negative buffer sizes might result in unexpected behaviour :eusa_naughty:
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 215/100 cycles
2777 kCycles for 100 * rep stosd
1158 kCycles for 100 * HeapAlloc (*8)
1040 kCycles for 100 * StackBuffer (with zeroing)
1038 kCycles for 100 * StackBuffer (unrolled)
1090 kCycles for 100 * dedndave
1104 kCycles for 100 * rep stosd up
2770 kCycles for 100 * rep stosd
1144 kCycles for 100 * HeapAlloc (*8)
1065 kCycles for 100 * StackBuffer (with zeroing)
1047 kCycles for 100 * StackBuffer (unrolled)
1252 kCycles for 100 * dedndave
1069 kCycles for 100 * rep stosd up
2617 kCycles for 100 * rep stosd
1086 kCycles for 100 * HeapAlloc (*8)
981 kCycles for 100 * StackBuffer (with zeroing)
993 kCycles for 100 * StackBuffer (unrolled)
1037 kCycles for 100 * dedndave
1044 kCycles for 100 * rep stosd up
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
34 bytes for StackBuffer (with zeroing)
44 bytes for StackBuffer (unrolled)
41 bytes for dedndave
17 bytes for rep stosd up
Quote from: jj2007 on October 28, 2013, 12:26:26 AM
For ecx=0, rep stosd does absolutely nothing...
oh - that's good :P
as i recall, on an 8088, CX = 0 would do 64 KB
Thanks to everybody for testing :icon14:
New version:
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles
512000 bytes:
23582 kCycles for 100 * rep stosd
10726 kCycles for 100 * HeapAlloc
8653 kCycles for 100 * StackBuffer (with zeroing)
8653 kCycles for 100 * dedndave
8627 kCycles for 100 * rep stosd up (no probing)
To my embarassment, it seems the bufsize/8 disappeared somewhere, so the HeapAlloc values are for the full buffer size. And they are close to the others.
Try changing line 5: bufsize=102400*6 - you are in for a virtual surprise ;)
deleted
Quote from: nidud on October 28, 2013, 01:45:46 AM
here is a thought... link /stack:102400,102400
A valid thought but
a) it requires more discipline with linker settings, environment variables etc
b) that local buffer is still full of garbage and
c) it is not aligned for use with SSE2
;)
Jochen,
StackBuffer3.exe comes up with that result:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles
512000 bytes:
12600 kCycles for 100 * rep stosd
4696 kCycles for 100 * HeapAlloc
4672 kCycles for 100 * StackBuffer (with zeroing)
4538 kCycles for 100 * dedndave
4593 kCycles for 100 * rep stosd up (no probing)
12567 kCycles for 100 * rep stosd
5334 kCycles for 100 * HeapAlloc
5236 kCycles for 100 * StackBuffer (with zeroing)
4937 kCycles for 100 * dedndave
4692 kCycles for 100 * rep stosd up (no probing)
12518 kCycles for 100 * rep stosd
4685 kCycles for 100 * HeapAlloc
4674 kCycles for 100 * StackBuffer (with zeroing)
5286 kCycles for 100 * dedndave
4666 kCycles for 100 * rep stosd up (no probing)
18 bytes for rep stosd
103 bytes for HeapAlloc
54 bytes for StackBuffer (with zeroing)
41 bytes for dedndave
17 bytes for rep stosd up (no probing)
--- ok ---
Gunther
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 261/100 cycles
512000 bytes:
26444 kCycles for 100 * rep stosd
22020 kCycles for 100 * HeapAlloc
16433 kCycles for 100 * StackBuffer (with zeroing)
15254 kCycles for 100 * dedndave
15176 kCycles for 100 * rep stosd up (no probing)
26155 kCycles for 100 * rep stosd
17181 kCycles for 100 * HeapAlloc
16086 kCycles for 100 * StackBuffer (with zeroing)
15254 kCycles for 100 * dedndave
15979 kCycles for 100 * rep stosd up (no probing)
26160 kCycles for 100 * rep stosd
17103 kCycles for 100 * HeapAlloc
16196 kCycles for 100 * StackBuffer (with zeroing)
15333 kCycles for 100 * dedndave
15132 kCycles for 100 * rep stosd up (no probing)
--- ok ---
loop overhead is approx. 254/100 cycles
512000 bytes:
26153 kCycles for 100 * rep stosd
22074 kCycles for 100 * HeapAlloc
16154 kCycles for 100 * StackBuffer (with zeroing)
15852 kCycles for 100 * dedndave
15254 kCycles for 100 * rep stosd up (no probing)
26087 kCycles for 100 * rep stosd
16510 kCycles for 100 * HeapAlloc
16647 kCycles for 100 * StackBuffer (with zeroing)
15258 kCycles for 100 * dedndave
15187 kCycles for 100 * rep stosd up (no probing)
26145 kCycles for 100 * rep stosd
16325 kCycles for 100 * HeapAlloc
16303 kCycles for 100 * StackBuffer (with zeroing)
15257 kCycles for 100 * dedndave
15032 kCycles for 100 * rep stosd up (no probing)
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 542/100 cycles
512000 bytes:
14386 kCycles for 100 * rep stosd
5242 kCycles for 100 * HeapAlloc
5160 kCycles for 100 * StackBuffer (with zeroing)
5182 kCycles for 100 * dedndave
5283 kCycles for 100 * rep stosd up (no probing)
14350 kCycles for 100 * rep stosd
5261 kCycles for 100 * HeapAlloc
5202 kCycles for 100 * StackBuffer (with zeroing)
5187 kCycles for 100 * dedndave
5258 kCycles for 100 * rep stosd up (no probing)
14356 kCycles for 100 * rep stosd
5276 kCycles for 100 * HeapAlloc
5216 kCycles for 100 * StackBuffer (with zeroing)
5154 kCycles for 100 * dedndave
5289 kCycles for 100 * rep stosd up (no probing)
18 bytes for rep stosd
103 bytes for HeapAlloc
54 bytes for StackBuffer (with zeroing)
41 bytes for dedndave
17 bytes for rep stosd up (no probing)
deleted
SbTestJ proc uses esi MySize
mov esi, StackBuffer(MySize) ; works like a charm, no linker options needed
; ... use the buffer ...
StackBuffer()
ret
SbTestJ endp
SbTestN proc uses esi MySize
local buf[MySize+16]:byte ; error A2026: constant expected
local buffer:dword
lea eax,buf
and al,0F0h
add eax,16
mov buffer,eax
mov edi,eax
sub eax,eax
mov ecx,bufsize
rep stosd
; ... use the buffer ...
ret
SbTestN endp
;)
deleted
Quote from: nidud on October 28, 2013, 07:58:57 AM
If you plan on calling this macro frequently it may be better to set the stack one time
No need for doing that. Dave's fs:[8] loop is extremely clever - if the stack is already committed, it costs just 1 or 2 cycles.
Im curious, movups is slower than conventional instuctions, but it was for double data right? Push edx is for 4 bytes, what about movups? I think it was 8 to 16 bytes, if the data speed clock was 22 it should be divided by 2 or 4 to know the byte rate transfer.
Just stumbled over an oddity with HeapAlloc: It gets very, very slow for a small range of bytes requested (Win7-32):
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles
512000 bytes:
19585 kCycles for 100 * rep stosd <<< the reference
512000 bytes:
19662 kCycles for 100 * HeapAlloc <<< so far so good
519168 bytes:
131619 kCycles for 100 * HeapAlloc <<< oops
520192 bytes:
915 kCycles for 100 * HeapAlloc <<< VirtualAlloc kicks in
19565 kCycles for 100 * rep stosd
512000 bytes:
19667 kCycles for 100 * HeapAlloc
519168 bytes:
133256 kCycles for 100 * HeapAlloc
520192 bytes:
932 kCycles for 100 * HeapAlloc
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
loop overhead is approx. 575/100 cycles
512000 bytes:
14335 kCycles for 100 * rep stosd
512000 bytes:
5293 kCycles for 100 * HeapAlloc
519168 bytes:
5381 kCycles for 100 * HeapAlloc
520192 bytes:
1314 kCycles for 100 * HeapAlloc
14333 kCycles for 100 * rep stosd
512000 bytes:
5311 kCycles for 100 * HeapAlloc
519168 bytes:
5340 kCycles for 100 * HeapAlloc
520192 bytes:
1314 kCycles for 100 * HeapAlloc
18 bytes for rep stosd
104 bytes for HeapAlloc
XP MCE2005 SP3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 427/100 cycles
512000 bytes:
34190 kCycles for 100 * rep stosd
512000 bytes:
28714 kCycles for 100 * HeapAlloc
519168 bytes:
23241 kCycles for 100 * HeapAlloc
520192 bytes:
2671 kCycles for 100 * HeapAlloc
34132 kCycles for 100 * rep stosd
512000 bytes:
23103 kCycles for 100 * HeapAlloc
519168 bytes:
23419 kCycles for 100 * HeapAlloc
520192 bytes:
2635 kCycles for 100 * HeapAlloc
Thanks, Marinus & Dave :icon14:
The switch to VirtualAlloc is there but not the slowdown shortly below. Could be Win-7 only, or some special feature of my machine ::)
see if you can test a single pass
HeapAlloc may not like being in a x100 loop :P
Quote from: dedndave on October 28, 2013, 11:28:18 PM
see if you can test a single pass
HeapAlloc may not like being in a x100 loop :P
It's quite happy to be in that loop for everything below 500k*1.01 and above 500k*1.016... and switching to e.g. 5 loops doesn't change the pattern. Weird.
that makes me wonder if there are other "holes" in the number line
and - is it specific to your hardware in some way
say, if you had more memory - would it act differently
deleted
i think this is a case where you have to apply some common sense
how is the heap/stack space going to be used in a typical program ?
i can't think of too many cases where you actually waffle back and forth between allocating heap memory and committing stack space
if you do, you should probably re-think your design :P
however, it would still be interesting to see how fast the OS can commit under different conditions
for example: allocate a large heap block, then commit some stack space
from what i have seen (with no heap allocated), the commit loop seems pretty fast
ok, well it doesn't seem to be as fast as i thought
or maybe it's just hard to properly measure something with 1 pass :P
11629 Clock cycles per page
EDIT: a more accurate version - results about the same - lol
deleted
well - that seems counter-intuitive
if you can allocate all available with HeapAlloc, it *should* reset the commit
deleted
i guess it doesn't matter - my way of thinking was wrong, of course
it seems that, once the space has been committed, it stays committed
it simply gets swapped out to the paging file if you try to HeapAlloc(nMaxBytes)
it might work if you create a thread to commit and release stack space, then terminate the thread
EDIT: we really aren't interested in measuring swaps between memory and the page file :lol:
deleted
Quote from: nidud on October 29, 2013, 04:00:41 AM
...However, once the stack is committed (one way or the other) it will be available
as a substitute for HeapAlloc(), and that will save both code space and cycles.
i think that's how you have to look at it, too
btw - i see Mark has a relatively new tool (new version, at least) - called VMMap
http://technet.microsoft.com/en-us/sysinternals/dd535533.aspx
i have to do some reading to interpret what it's showing me :P
Quote from: dedndave on October 29, 2013, 03:30:12 AM
it seems that, once the space has been committed, it stays committed
This is also my interpretation.
IMHO a StackBuffer() macro is best for repeatedly used small local buffers that are bigger than the 2 pages you can have without probing, and smaller than the range of bytes where HeapAlloc becomes competiive. Another advantage is that it avoids heap fragmentation.
Attached a new testbed with sizes 2k ... 512k. Feel free to modify - no MasmBasic needed ;-)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles
2048 bytes:
103705 cycles for 100 * HeapAlloc
28460 cycles for 100 * StackBuffer (xmm)
48535 cycles for 100 * StackBuffer (rep stosd)
47423 cycles for 100 * dedndave
103956 cycles for 100 * HeapAlloc
28460 cycles for 100 * StackBuffer (xmm)
48544 cycles for 100 * StackBuffer (rep stosd)
47424 cycles for 100 * dedndave
8192 bytes:
185329 cycles for 100 * HeapAlloc
105525 cycles for 100 * StackBuffer (xmm)
128172 cycles for 100 * StackBuffer (rep stosd)
127549 cycles for 100 * dedndave
184025 cycles for 100 * HeapAlloc
105314 cycles for 100 * StackBuffer (xmm)
128170 cycles for 100 * StackBuffer (rep stosd)
127050 cycles for 100 * dedndave
32768 bytes:
547 kCycles for 100 * HeapAlloc
438 kCycles for 100 * StackBuffer (xmm)
440 kCycles for 100 * StackBuffer (rep stosd)
439 kCycles for 100 * dedndave
548 kCycles for 100 * HeapAlloc
438 kCycles for 100 * StackBuffer (xmm)
444 kCycles for 100 * StackBuffer (rep stosd)
437 kCycles for 100 * dedndave
131072 bytes:
2810 kCycles for 100 * HeapAlloc
2808 kCycles for 100 * StackBuffer (xmm)
2230 kCycles for 100 * StackBuffer (rep stosd)
2222 kCycles for 100 * dedndave
2319 kCycles for 100 * HeapAlloc
2808 kCycles for 100 * StackBuffer (xmm)
2225 kCycles for 100 * StackBuffer (rep stosd)
2224 kCycles for 100 * dedndave
524288 bytes:
742 kCycles for 100 * HeapAlloc
12067 kCycles for 100 * StackBuffer (xmm)
8928 kCycles for 100 * StackBuffer (rep stosd)
8977 kCycles for 100 * dedndave
751 kCycles for 100 * HeapAlloc
12305 kCycles for 100 * StackBuffer (xmm)
8921 kCycles for 100 * StackBuffer (rep stosd)
8920 kCycles for 100 * dedndave
104 bytes for HeapAlloc
9 bytes for StackBuffer (xmm)
8 bytes for StackBuffer (rep stosd)
42 bytes for dedndave
33 bytes for MbStackB
16 bytes for MbStackX
prescott w/htt xp mce2005 sp3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 246/100 cycles
2048 bytes:
230072 cycles for 100 * HeapAlloc
59433 cycles for 100 * StackBuffer (xmm)
71350 cycles for 100 * StackBuffer (rep stosd)
71069 cycles for 100 * dedndave
229937 cycles for 100 * HeapAlloc
60252 cycles for 100 * StackBuffer (xmm)
71518 cycles for 100 * StackBuffer (rep stosd)
70532 cycles for 100 * dedndave
8192 bytes:
447586 cycles for 100 * HeapAlloc
233879 cycles for 100 * StackBuffer (xmm)
245398 cycles for 100 * StackBuffer (rep stosd)
245187 cycles for 100 * dedndave
447288 cycles for 100 * HeapAlloc
234628 cycles for 100 * StackBuffer (xmm)
246152 cycles for 100 * StackBuffer (rep stosd)
244349 cycles for 100 * dedndave
32768 bytes:
1304 kCycles for 100 * HeapAlloc
945 kCycles for 100 * StackBuffer (xmm)
925 kCycles for 100 * StackBuffer (rep stosd)
948 kCycles for 100 * dedndave
1274 kCycles for 100 * HeapAlloc
913 kCycles for 100 * StackBuffer (xmm)
932 kCycles for 100 * StackBuffer (rep stosd)
924 kCycles for 100 * dedndave
131072 bytes:
5997 kCycles for 100 * HeapAlloc
3639 kCycles for 100 * StackBuffer (xmm)
3682 kCycles for 100 * StackBuffer (rep stosd)
3704 kCycles for 100 * dedndave
4671 kCycles for 100 * HeapAlloc
3663 kCycles for 100 * StackBuffer (xmm)
3654 kCycles for 100 * StackBuffer (rep stosd)
3651 kCycles for 100 * dedndave
524288 bytes:
2084 kCycles for 100 * HeapAlloc
14688 kCycles for 100 * StackBuffer (xmm)
14708 kCycles for 100 * StackBuffer (rep stosd)
14847 kCycles for 100 * dedndave
2091 kCycles for 100 * HeapAlloc
14649 kCycles for 100 * StackBuffer (xmm)
14950 kCycles for 100 * StackBuffer (rep stosd)
16294 kCycles for 100 * dedndave
StackBuffer6 brings:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 239/100 cycles
2048 bytes:
71208 cycles for 100 * HeapAlloc
20078 cycles for 100 * StackBuffer (xmm)
40880 cycles for 100 * StackBuffer (rep stosd)
40775 cycles for 100 * dedndave
68901 cycles for 100 * HeapAlloc
47258 cycles for 100 * StackBuffer (xmm)
40798 cycles for 100 * StackBuffer (rep stosd)
40362 cycles for 100 * dedndave
8192 bytes:
111836 cycles for 100 * HeapAlloc
129559 cycles for 100 * StackBuffer (xmm)
122825 cycles for 100 * StackBuffer (rep stosd)
122018 cycles for 100 * dedndave
104378 cycles for 100 * HeapAlloc
55644 cycles for 100 * StackBuffer (xmm)
122728 cycles for 100 * StackBuffer (rep stosd)
122127 cycles for 100 * dedndave
32768 bytes:
262658 cycles for 100 * HeapAlloc
209534 cycles for 100 * StackBuffer (xmm)
196808 cycles for 100 * StackBuffer (rep stosd)
187779 cycles for 100 * dedndave
604 kCycles for 100 * HeapAlloc
477 kCycles for 100 * StackBuffer (xmm)
445 kCycles for 100 * StackBuffer (rep stosd)
443 kCycles for 100 * dedndave
131072 bytes:
1203 kCycles for 100 * HeapAlloc
1090 kCycles for 100 * StackBuffer (xmm)
1178 kCycles for 100 * StackBuffer (rep stosd)
1197 kCycles for 100 * dedndave
1843 kCycles for 100 * HeapAlloc
1108 kCycles for 100 * StackBuffer (xmm)
1159 kCycles for 100 * StackBuffer (rep stosd)
1195 kCycles for 100 * dedndave
524288 bytes:
586 kCycles for 100 * HeapAlloc
6736 kCycles for 100 * StackBuffer (xmm)
5396 kCycles for 100 * StackBuffer (rep stosd)
4757 kCycles for 100 * dedndave
591 kCycles for 100 * HeapAlloc
6740 kCycles for 100 * StackBuffer (xmm)
5389 kCycles for 100 * StackBuffer (rep stosd)
4787 kCycles for 100 * dedndave
104 bytes for HeapAlloc
9 bytes for StackBuffer (xmm)
8 bytes for StackBuffer (rep stosd)
42 bytes for dedndave
33 bytes for MbStackB
16 bytes for MbStackX
--- ok ---
Gunther
OK, thanks to everybody :icon14:
The new StackBuffer() is now implemented in MasmBasic of 30 Oct (more (http://masm32.com/board/index.php?topic=94.msg26592#msg26592)). In the end, rep stosd made the race. Usage examples:
mov sbuf1, StackBuffer(4000h) ; buffer is 16-byte aligned for use with SSE2
invoke GetFileSize, hFile, 0 ; you may use a register to specify the buffer size
mov sbuf2, StackBuffer(eax, nz) ; option nz means "no zeroing" - much faster, of course
...
StackBuffer() ; release all buffers (sb without args = free the buffer)
The nz option does only the probing and zeroes the last two bytes of the buffer, plus two bytes beyond the buffer. This is to allow loading e.g. a textfile into the buffer and being sure that the end is zero-delimited.
:t
Jochen,
Quote from: jj2007 on October 30, 2013, 11:16:13 AM
The new StackBuffer() is now implemented in MasmBasic of 30 Oct
:t
Gunther