Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 248/100 cycles
4986 kCycles for 100 * rep stosd
4827 kCycles for 100 * push 0
4991 kCycles for 100 * push edx
6187 kCycles for 100 * movups xmm0
2767 kCycles for 100 * movaps xmm0
5023 kCycles for 100 * rep stosd
4857 kCycles for 100 * push 0
4935 kCycles for 100 * push edx
6207 kCycles for 100 * movups xmm0
2766 kCycles for 100 * movaps xmm0
5023 kCycles for 100 * rep stosd
4855 kCycles for 100 * push 0
4990 kCycles for 100 * push edx
6225 kCycles for 100 * movups xmm0
2765 kCycles for 100 * movaps xmm0
mov edx,edi
lea edi,[esp-bufsize]
mov ecx,bufsize/4
xor eax,eax
rep stosd
mov edi,edx
dec ebx
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 245/100 cycles
5107 kCycles for 100 * rep stosd
4844 kCycles for 100 * push 0
4902 kCycles for 100 * push edx
6153 kCycles for 100 * movups xmm0
2827 kCycles for 100 * movaps xmm0
2815 kCycles for 100 * rep stosd
5111 kCycles for 100 * rep stosd
4873 kCycles for 100 * push 0
4887 kCycles for 100 * push edx
6150 kCycles for 100 * movups xmm0
2795 kCycles for 100 * movaps xmm0
2782 kCycles for 100 * rep stosd
5053 kCycles for 100 * rep stosd
4892 kCycles for 100 * push 0
4850 kCycles for 100 * push edx
6179 kCycles for 100 * movups xmm0
2767 kCycles for 100 * movaps xmm0
2827 kCycles for 100 * rep stosd
AMD Athlon(tm) Dual Core Processor 5000B (SSE3)
loop overhead is approx. 239/100 cycles
3779 kCycles for 100 * rep stosd
4897 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
3344 kCycles for 100 * movups xmm0
3347 kCycles for 100 * movaps xmm0
3774 kCycles for 100 * rep stosd
4897 kCycles for 100 * push 0
4899 kCycles for 100 * push edx
3343 kCycles for 100 * movups xmm0
3341 kCycles for 100 * movaps xmm0
3778 kCycles for 100 * rep stosd
4897 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
3344 kCycles for 100 * movups xmm0
3331 kCycles for 100 * movaps xmm0
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0
--- ok ---
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 245/100 cycles
5107 kCycles for 100 * rep stosd
4844 kCycles for 100 * push 0
4902 kCycles for 100 * push edx
6153 kCycles for 100 * movups xmm0
2827 kCycles for 100 * movaps xmm0
2815 kCycles for 100 * rep stosd
I cleaned up the rep stosd function a bit
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 309/100 cycles
4910 kCycles for 100 * rep stosd
4902 kCycles for 100 * push 0
4904 kCycles for 100 * push edx
4904 kCycles for 100 * movups xmm0
2144 kCycles for 100 * movaps xmm0
4910 kCycles for 100 * rep stosd
5130 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
4893 kCycles for 100 * movups xmm0
2140 kCycles for 100 * movaps xmm0
4911 kCycles for 100 * rep stosd
4909 kCycles for 100 * push 0
4903 kCycles for 100 * push edx
4895 kCycles for 100 * movups xmm0
2150 kCycles for 100 * movaps xmm0
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
22 bytes for movups xmm0
25 bytes for movaps xmm0
Put it under TestA, just for fun ;)
ASSUME FS:Nothing
mov edx,esp
mov fs:[700h],edi
xor eax,eax
sub edx,<NumberOfBytesRequiredPlus3Mod4>
.repeat
push eax
mov ecx,esp
mov esp,fs:[8]
sub ecx,esp
shr ecx,2
.if !ZERO
mov edi,esp
rep stosd
.endif
.until edx>=esp
mov edi,fs:[700h]
mov esp,edx
ASSUME FS:ERROR
ASSUME FS:Nothing
mov edx,esp
mov ecx,esp
sub edx,<NumberOfBytesRequiredPlus3Mod4>
.repeat
push eax
mov esp,fs:[8]
.until edx>=esp
sub ecx,edx
xchg edx,edi
shr ecx,2
xor eax,eax
mov esp,edi
rep stosd
mov edi,edx
ASSUME FS:ERROR
Put it under TestA, just for fun ;)
the intention should at best be educational :P
having both of them will illustrate the penalty of manipulating the flags on different CPU's
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 263/100 cycles
4957 kCycles for 100 * rep stosd
4902 kCycles for 100 * push 0
4931 kCycles for 100 * push edx
2934 kCycles for 100 * StackBuffer (with zeroing)
3006 kCycles for 100 * movaps xmm0
3035 kCycles for 100 * rep stosd up
4956 kCycles for 100 * rep stosd
4982 kCycles for 100 * push 0
4869 kCycles for 100 * push edx
2777 kCycles for 100 * StackBuffer (with zeroing)
2840 kCycles for 100 * movaps xmm0
2823 kCycles for 100 * rep stosd up
4954 kCycles for 100 * rep stosd
4863 kCycles for 100 * push 0
4940 kCycles for 100 * push edx
2799 kCycles for 100 * StackBuffer (with zeroing)
2801 kCycles for 100 * movaps xmm0
2779 kCycles for 100 * rep stosd up
4993 kCycles for 100 * rep stosd
4908 kCycles for 100 * push 0
4940 kCycles for 100 * push edx
2767 kCycles for 100 * StackBuffer (with zeroing)
2811 kCycles for 100 * movaps xmm0
2790 kCycles for 100 * rep stosd up
5074 kCycles for 100 * rep stosd
4911 kCycles for 100 * push 0
5082 kCycles for 100 * push edx
2860 kCycles for 100 * StackBuffer (with zeroing)
2767 kCycles for 100 * movaps xmm0
2779 kCycles for 100 * rep stosd up
5073 kCycles for 100 * rep stosd
4907 kCycles for 100 * push 0
5175 kCycles for 100 * push edx
2826 kCycles for 100 * StackBuffer (with zeroing)
2796 kCycles for 100 * movaps xmm0
2801 kCycles for 100 * rep stosd up
prescott w/htt
...
the last 3 are more-or-less the same on a P4
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
++18 of 20 tests valid, loop overhead is approx. 331/100 cycles
4915 kCycles for 100 * rep stosd
4928 kCycles for 100 * push 0
4911 kCycles for 100 * push edx
2141 kCycles for 100 * StackBuffer (with zeroing)
2153 kCycles for 100 * movaps xmm0
2326 kCycles for 100 * rep stosd up
4912 kCycles for 100 * rep stosd
4905 kCycles for 100 * push 0
4901 kCycles for 100 * push edx
2140 kCycles for 100 * StackBuffer (with zeroing)
2141 kCycles for 100 * movaps xmm0
2199 kCycles for 100 * rep stosd up
4907 kCycles for 100 * rep stosd
4892 kCycles for 100 * push 0
4907 kCycles for 100 * push edx
2141 kCycles for 100 * StackBuffer (with zeroing)
2141 kCycles for 100 * movaps xmm0
2212 kCycles for 100 * rep stosd up
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 495/100 cycles
4656 kCycles for 100 * rep stosd
8331 kCycles for 100 * push 0
7636 kCycles for 100 * push edx
2845 kCycles for 100 * StackBuffer (with zeroing)
2711 kCycles for 100 * movaps xmm0
2585 kCycles for 100 * rep stosd up
3488 kCycles for 100 * rep stosd
5722 kCycles for 100 * push 0
5240 kCycles for 100 * push edx
1712 kCycles for 100 * StackBuffer (with zeroing)
1650 kCycles for 100 * movaps xmm0
1600 kCycles for 100 * rep stosd up
2200 kCycles for 100 * rep stosd
4188 kCycles for 100 * push 0
4251 kCycles for 100 * push edx
1715 kCycles for 100 * StackBuffer (with zeroing)
1565 kCycles for 100 * movaps xmm0
1580 kCycles for 100 * rep stosd up
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
29 bytes for StackBuffer (with zeroing)
25 bytes for movaps xmm0
17 bytes for rep stosd up
--- ok ---
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.
StackBuffer2b.exe doesn't work with windows 8.1
Aligning the stack buffer to 64 (one cache line) gives me almost identical times for looped/unrolled.
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 229/100 cycles
2287 kCycles for 100 * rep stosd
4347 kCycles for 100 * push 0
4314 kCycles for 100 * push edx
830 kCycles for 100 * StackBuffer (with zeroing)
822 kCycles for 100 * movaps xmm0
847 kCycles for 100 * rep stosd up
2290 kCycles for 100 * rep stosd
4321 kCycles for 100 * push 0
4289 kCycles for 100 * push edx
828 kCycles for 100 * StackBuffer (with zeroing)
821 kCycles for 100 * movaps xmm0
859 kCycles for 100 * rep stosd up
2293 kCycles for 100 * rep stosd
4303 kCycles for 100 * push 0
4298 kCycles for 100 * push edx
830 kCycles for 100 * StackBuffer (with zeroing)
812 kCycles for 100 * movaps xmm0
847 kCycles for 100 * rep stosd up
18 bytes for rep stosd
17 bytes for push 0
16 bytes for push edx
29 bytes for StackBuffer (with zeroing)
25 bytes for movaps xmm0
17 bytes for rep stosd up
--- ok ---
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles
5102 kCycles for 100 * rep stosd
4717 kCycles for 100 * HeapAlloc (*8 )
2851 kCycles for 100 * StackBuffer (with zeroing)
3962 kCycles for 100 * StackBuffer (unrolled)
2835 kCycles for 100 * movaps xmm0
2872 kCycles for 100 * rep stosd up
5145 kCycles for 100 * rep stosd
3681 kCycles for 100 * HeapAlloc (*8 )
2862 kCycles for 100 * StackBuffer (with zeroing)
3950 kCycles for 100 * StackBuffer (unrolled)
2894 kCycles for 100 * movaps xmm0
2844 kCycles for 100 * rep stosd up
5111 kCycles for 100 * rep stosd
3769 kCycles for 100 * HeapAlloc (*8 )
2836 kCycles for 100 * StackBuffer (with zeroing)
3950 kCycles for 100 * StackBuffer (unrolled)
2846 kCycles for 100 * movaps xmm0
2900 kCycles for 100 * rep stosd up
can't beat REP STOSD for simplicity :P
ASSUME FS:Nothing
mov edx,edi
mov edi,esp
mov ecx,esp
sub edi,<NumberOfBytesRequiredPlus3Mod4>
.repeat
push eax
mov esp,fs:[8]
.until edi>=esp
sub ecx,edi
shr ecx,2
xor eax,eax
mov esp,edi
rep stosd
mov edi,edx
ASSUME FS:ERROR
but, you haven't incorporated it
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 254/100 cycles
5184 kCycles for 100 * rep stosd
4122 kCycles for 100 * HeapAlloc (*8 )
2853 kCycles for 100 * StackBuffer (with zeroing)
2868 kCycles for 100 * StackBuffer (unrolled)
2899 kCycles for 100 * rep stosd up
5116 kCycles for 100 * rep stosd
3073 kCycles for 100 * HeapAlloc (*8 )
2839 kCycles for 100 * StackBuffer (with zeroing)
2849 kCycles for 100 * StackBuffer (unrolled)
2862 kCycles for 100 * rep stosd up
5161 kCycles for 100 * rep stosd
3080 kCycles for 100 * HeapAlloc (*8 )
2843 kCycles for 100 * StackBuffer (with zeroing)
2873 kCycles for 100 * StackBuffer (unrolled)
2848 kCycles for 100 * rep stosd up
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles
2341 kCycles for 100 * rep stosd
987 kCycles for 100 * HeapAlloc (*8)
839 kCycles for 100 * StackBuffer (with zeroing)
830 kCycles for 100 * StackBuffer (unrolled)
921 kCycles for 100 * rep stosd up
2345 kCycles for 100 * rep stosd
985 kCycles for 100 * HeapAlloc (*8)
829 kCycles for 100 * StackBuffer (with zeroing)
867 kCycles for 100 * StackBuffer (unrolled)
872 kCycles for 100 * rep stosd up
2339 kCycles for 100 * rep stosd
989 kCycles for 100 * HeapAlloc (*8)
850 kCycles for 100 * StackBuffer (with zeroing)
906 kCycles for 100 * StackBuffer (unrolled)
875 kCycles for 100 * rep stosd up
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
34 bytes for StackBuffer (with zeroing)
44 bytes for StackBuffer (unrolled)
17 bytes for rep stosd up
--- ok ---
only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary
OPTION PROLOGUE:None
OPTION EPILOGUE:None
MyProc PROC parm1:DWORD
push ebx
push esi
push edi ;push/pops on EBX ESI EDI are optional, of course
push ebp
mov esp,ebp
;stack probe code here
;stack clear code here
;use stack space, as required
leave
pop edi
pop esi
pop ebx
ret 4
MyProc ENDP
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
as for restoring the stack..... leave
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.
Why not ;-)
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles
4725 kCycles for 100 * rep stosd
2683 kCycles for 100 * HeapAlloc (*8)
2202 kCycles for 100 * StackBuffer (with zeroing)
2887 kCycles for 100 * StackBuffer (unrolled)
2207 kCycles for 100 * movaps xmm0
1746 kCycles for 100 * rep stosd up
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
align 16
00000190 TestE_s:
= movaps xmm0 NameE equ movaps xmm0 ; assign a descriptive name here
00000190 TestE proc
00000190 BB 00000063 mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
00000198 *@C0011:
00000198 8B CC mov ecx, esp
0000019A 8D 84 24 lea eax, [esp-bufsize]
FFFE7000
000001A1 83 E4 F0 and esp, -16 ; needs a reg or local to store original esp
000001A4 0F 57 C0 xorps xmm0, xmm0
; align 4
.Repeat
000001A7 *@C0012:
000001A7 83 EC 10 sub esp, OWORD
000001AA 0F 29 04 24 movaps OWORD ptr [esp], xmm0 ; movaps <1% faster on AMD
.Until esp<=eax
000001AE 3B E0 * cmp esp, eax
000001B0 77 F5 * ja @C0012
000001B2 8B E1 mov esp, ecx
; add esp, bufsize
000001B4 4B dec ebx
.Until Sign?
000001B5 79 E1 * jns @C0011
000001B7 C3 ret
000001B8 TestE endp
000001B8 TestE_endp:
align 16
000001E0 TestG_s:
= movaps xmm0 (down) NameG equ movaps xmm0 (down) ; assign a descriptive name here
000001E0 TestG proc
000001E0 BB 00000063 mov ebx, AlgoLoops-1 ; loop e.g. 100x
000001E5 8B CC mov ecx, esp
000001E7 8B F4 mov esi, esp
000001E9 BA FFFFFFF0 mov edx, -OWORD
000001EE 83 E6 F0 and esi, -16
000001F1 0F 57 C0 xorps xmm0, xmm0
000001F4 8D 86 FFFE7000 lea eax, [esi-bufsize]
align 16
.Repeat
00000200 *@C0017:
00000200 8B E6 mov esp, esi
.Repeat
00000202 *@C0018:
00000202 8D 24 14 lea esp,[esp+edx]
00000205 0F 29 04 24 movaps OWORD ptr [esp], xmm0 ; movaps <1% faster on AMD
.Until esp==eax
00000209 3B E0 * cmp esp, eax
0000020B 75 F5 * jne @C0018
0000020D 4B dec ebx
.Until Sign?
0000020E 79 F0 * jns @C0017
00000210 8B E1 mov esp, ecx
00000212 C3 ret
00000213 TestG endp
00000213 TestG_endp:
align 16
00000220 TestH_s:
= movaps xmm0 (up) NameH equ movaps xmm0 (up) ; assign a descriptive name here
00000220 TestH proc
00000220 BB 00000063 mov ebx, AlgoLoops-1 ; loop e.g. 100x
00000225 8B CC mov ecx, esp
00000227 8D B4 24 lea esi, [esp-bufsize]
FFFE7000
0000022E BA 00000010 mov edx, OWORD
00000233 83 E6 F0 and esi, -16
00000236 0F 57 C0 xorps xmm0, xmm0
00000239 8D 86 00019000 lea eax, [esi+bufsize]
align 16
.Repeat
00000240 *@C001B:
00000240 8B E6 mov esp, esi
.Repeat
00000242 *@C001C:
00000242 0F 29 04 24 movaps OWORD ptr [esp+(0*OWORD)], xmm0 ; movaps <1% faster on AMD
00000246 8D 24 14 lea esp,[esp+edx]
.Until esp==eax
00000249 3B E0 * cmp esp, eax
0000024B 75 F5 * jne @C001C
0000024D 4B dec ebx
.Until Sign?
0000024E 79 F0 * jns @C001B
00000250 8B E1 mov esp, ecx
00000252 C3 ret
00000253 TestH endp
00000253 TestH_endp:
align 16
00000260 TestI_s:
= movaps xmm0 (unrolled) NameI equ movaps xmm0 (unrolled) ; assign a descriptive name here
00000260 TestI proc
00000260 BB 00000063 mov ebx, AlgoLoops-1 ; loop e.g. 100x
00000265 8B CC mov ecx, esp
00000267 8D B4 24 lea esi, [esp-bufsize]
FFFE7000
0000026E BA 00000080 mov edx, (8*OWORD)
00000273 83 E6 F0 and esi, -16
00000276 0F 57 C0 xorps xmm0, xmm0
00000279 8D 86 00019000 lea eax, [esi+bufsize]
.Repeat
0000027F *@C001F:
0000027F 8B E6 mov esp, esi
align 16
.Repeat
00000290 *@C0020:
00000290 0F 29 04 24 movaps OWORD ptr [esp+(0*OWORD)], xmm0 ; movaps <1% faster on AMD
00000294 0F 29 44 24 10 movaps OWORD ptr [esp+(1*OWORD)], xmm0 ; movaps <1% faster on AMD
00000299 0F 29 44 24 20 movaps OWORD ptr [esp+(2*OWORD)], xmm0 ; movaps <1% faster on AMD
0000029E 0F 29 44 24 30 movaps OWORD ptr [esp+(3*OWORD)], xmm0 ; movaps <1% faster on AMD
000002A3 0F 29 44 24 40 movaps OWORD ptr [esp+(4*OWORD)], xmm0 ; movaps <1% faster on AMD
000002A8 0F 29 44 24 50 movaps OWORD ptr [esp+(5*OWORD)], xmm0 ; movaps <1% faster on AMD
000002AD 0F 29 44 24 60 movaps OWORD ptr [esp+(6*OWORD)], xmm0 ; movaps <1% faster on AMD
000002B2 0F 29 44 24 70 movaps OWORD ptr [esp+(7*OWORD)], xmm0 ; movaps <1% faster on AMD
000002B7 8D 24 14 lea esp,[esp+edx]
.Until esp==eax
000002BA 3B E0 * cmp esp, eax
000002BC 75 D2 * jne @C0020
; add esp, bufsize
000002BE 4B dec ebx
.Until Sign?
000002BF 79 BE * jns @C001F
000002C1 8B E1 mov esp, ecx
000002C3 C3 ret
000002C4 TestI endp
000002C4 TestI_endp:
AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 433/100 cycles
5229 kCycles for 100 * rep stosd
3627 kCycles for 100 * HeapAlloc (*8)
3274 kCycles for 100 * StackBuffer (with zeroing)
3278 kCycles for 100 * StackBuffer (unrolled)
3193 kCycles for 100 * movaps xmm0
3118 kCycles for 100 * rep stosd up
2798 kCycles for 100 * movaps xmm0 (down)
2974 kCycles for 100 * movaps xmm0 (up)
2895 kCycles for 100 * movaps xmm0 (unrolled)
3573 kCycles for 100 * rep stosd
2709 kCycles for 100 * HeapAlloc (*8)
2458 kCycles for 100 * StackBuffer (with zeroing)
2481 kCycles for 100 * StackBuffer (unrolled)
2426 kCycles for 100 * movaps xmm0
2218 kCycles for 100 * rep stosd up
2086 kCycles for 100 * movaps xmm0 (down)
2329 kCycles for 100 * movaps xmm0 (up)
2273 kCycles for 100 * movaps xmm0 (unrolled)
2244 kCycles for 100 * rep stosd
1512 kCycles for 100 * HeapAlloc (*8)
1422 kCycles for 100 * StackBuffer (with zeroing)
1403 kCycles for 100 * StackBuffer (unrolled)
1546 kCycles for 100 * movaps xmm0
1448 kCycles for 100 * rep stosd up
1424 kCycles for 100 * movaps xmm0 (down)
1561 kCycles for 100 * movaps xmm0 (up)
1502 kCycles for 100 * movaps xmm0 (unrolled)
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
25 bytes for movaps xmm0
17 bytes for rep stosd up
36 bytes for movaps xmm0 (down)
36 bytes for movaps xmm0 (up)
85 bytes for movaps xmm0 (unrolled)
--- ok ---
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 201/100 cycles
2356 kCycles for 100 * rep stosd
943 kCycles for 100 * HeapAlloc (*8)
853 kCycles for 100 * StackBuffer (with zeroing)
2285 kCycles for 100 * StackBuffer (unrolled)
824 kCycles for 100 * movaps xmm0
886 kCycles for 100 * rep stosd up
825 kCycles for 100 * movaps xmm0 (down)
880 kCycles for 100 * movaps xmm0 (up)
1443 kCycles for 100 * movaps xmm0 (unrolled)
2354 kCycles for 100 * rep stosd
967 kCycles for 100 * HeapAlloc (*8)
877 kCycles for 100 * StackBuffer (with zeroing)
2300 kCycles for 100 * StackBuffer (unrolled)
846 kCycles for 100 * movaps xmm0
911 kCycles for 100 * rep stosd up
842 kCycles for 100 * movaps xmm0 (down)
883 kCycles for 100 * movaps xmm0 (up)
839 kCycles for 100 * movaps xmm0 (unrolled)
2948 kCycles for 100 * rep stosd
981 kCycles for 100 * HeapAlloc (*8)
845 kCycles for 100 * StackBuffer (with zeroing)
2286 kCycles for 100 * StackBuffer (unrolled)
865 kCycles for 100 * movaps xmm0
868 kCycles for 100 * rep stosd up
828 kCycles for 100 * movaps xmm0 (down)
844 kCycles for 100 * movaps xmm0 (up)
865 kCycles for 100 * movaps xmm0 (unrolled)
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)
25 bytes for movaps xmm0
17 bytes for rep stosd up
36 bytes for movaps xmm0 (down)
36 bytes for movaps xmm0 (up)
85 bytes for movaps xmm0 (unrolled)
--- ok ---
The modifications were to move the "constant" initializations out of the REPEAT loops
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 268/100 cycles
2404 kCycles for 100 * rep stosd
1003 kCycles for 100 * HeapAlloc (*8)
887 kCycles for 100 * StackBuffer (with zeroing)
905 kCycles for 100 * StackBuffer (unrolled)
831 kCycles for 100 * dedndave
872 kCycles for 100 * rep stosd up
2342 kCycles for 100 * rep stosd
982 kCycles for 100 * HeapAlloc (*8)
859 kCycles for 100 * StackBuffer (with zeroing)
835 kCycles for 100 * StackBuffer (unrolled)
927 kCycles for 100 * dedndave
897 kCycles for 100 * rep stosd up
2346 kCycles for 100 * rep stosd
965 kCycles for 100 * HeapAlloc (*8)
838 kCycles for 100 * StackBuffer (with zeroing)
835 kCycles for 100 * StackBuffer (unrolled)
828 kCycles for 100 * dedndave
873 kCycles for 100 * rep stosd up
18 bytes for rep stosd
103 bytes for HeapAlloc (*8)
34 bytes for StackBuffer (with zeroing)
44 bytes for StackBuffer (unrolled)
41 bytes for dedndave
17 bytes for rep stosd up
--- ok ---
... The dedndave algo is a clear winner, as expected. Almost ten times as fast as HeapAlloc is a good result :biggrin:
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 253/100 cycles
5243 kCycles for 100 * rep stosd
4087 kCycles for 100 * HeapAlloc (*8 )
2834 kCycles for 100 * StackBuffer (with zeroing)
2834 kCycles for 100 * StackBuffer (unrolled)
2905 kCycles for 100 * dedndave
2861 kCycles for 100 * rep stosd up
5121 kCycles for 100 * rep stosd
3114 kCycles for 100 * HeapAlloc (*8 )
2861 kCycles for 100 * StackBuffer (with zeroing)
2842 kCycles for 100 * StackBuffer (unrolled)
2855 kCycles for 100 * dedndave
2873 kCycles for 100 * rep stosd up
5157 kCycles for 100 * rep stosd
3039 kCycles for 100 * HeapAlloc (*8 )
2893 kCycles for 100 * StackBuffer (with zeroing)
2867 kCycles for 100 * StackBuffer (unrolled)
2849 kCycles for 100 * dedndave
2845 kCycles for 100 * rep stosd up
shr ecx,2
.if !ZERO?
rep stosd
.endif
it's probably best to seperate the 2 operations and optimize each
oh - and REP STOSD may still not be the fastest way to 0 the memoryIt is, it is, at least for large buffers and for most CPUs - that's pretty obvious
we still have an issue to deal with, as far as timing tests:
once the stack space has been committed the first time, it's already committed on any successive pass
if you really want to know how many cycles it takes, you have to force the OS to "un-commit" before the next timing pass
try not to REP STOSD with ECX = 0 :lol:See source:
For ecx=0, rep stosd does absolutely nothing...
bufsize=102400
StackBuffer proc
local buffer[bufsize]:byte
; ... use the buffer ...
ret
StackBuffer endp
link /stack:102400,102400 ...
here is a thought... link /stack:102400,102400
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles
512000 bytes:
12600 kCycles for 100 * rep stosd
4696 kCycles for 100 * HeapAlloc
4672 kCycles for 100 * StackBuffer (with zeroing)
4538 kCycles for 100 * dedndave
4593 kCycles for 100 * rep stosd up (no probing)
12567 kCycles for 100 * rep stosd
5334 kCycles for 100 * HeapAlloc
5236 kCycles for 100 * StackBuffer (with zeroing)
4937 kCycles for 100 * dedndave
4692 kCycles for 100 * rep stosd up (no probing)
12518 kCycles for 100 * rep stosd
4685 kCycles for 100 * HeapAlloc
4674 kCycles for 100 * StackBuffer (with zeroing)
5286 kCycles for 100 * dedndave
4666 kCycles for 100 * rep stosd up (no probing)
18 bytes for rep stosd
103 bytes for HeapAlloc
54 bytes for StackBuffer (with zeroing)
41 bytes for dedndave
17 bytes for rep stosd up (no probing)
--- ok ---
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 261/100 cycles
512000 bytes:
26444 kCycles for 100 * rep stosd
22020 kCycles for 100 * HeapAlloc
16433 kCycles for 100 * StackBuffer (with zeroing)
15254 kCycles for 100 * dedndave
15176 kCycles for 100 * rep stosd up (no probing)
26155 kCycles for 100 * rep stosd
17181 kCycles for 100 * HeapAlloc
16086 kCycles for 100 * StackBuffer (with zeroing)
15254 kCycles for 100 * dedndave
15979 kCycles for 100 * rep stosd up (no probing)
26160 kCycles for 100 * rep stosd
17103 kCycles for 100 * HeapAlloc
16196 kCycles for 100 * StackBuffer (with zeroing)
15333 kCycles for 100 * dedndave
15132 kCycles for 100 * rep stosd up (no probing)
--- ok ---
loop overhead is approx. 254/100 cycles
512000 bytes:
26153 kCycles for 100 * rep stosd
22074 kCycles for 100 * HeapAlloc
16154 kCycles for 100 * StackBuffer (with zeroing)
15852 kCycles for 100 * dedndave
15254 kCycles for 100 * rep stosd up (no probing)
26087 kCycles for 100 * rep stosd
16510 kCycles for 100 * HeapAlloc
16647 kCycles for 100 * StackBuffer (with zeroing)
15258 kCycles for 100 * dedndave
15187 kCycles for 100 * rep stosd up (no probing)
26145 kCycles for 100 * rep stosd
16325 kCycles for 100 * HeapAlloc
16303 kCycles for 100 * StackBuffer (with zeroing)
15257 kCycles for 100 * dedndave
15032 kCycles for 100 * rep stosd up (no probing)
A valid thought but
a) it requires more discipline with linker settings, environment variables etc
b) that local buffer is still full of garbage and
c) it is not aligned for use with SSE2
;)
bufsize=102400
StackBuffer proc
local buf[bufsize+16]:byte
local buffer:dword
lea eax,buf
and al,0F0h
add eax,16
mov buffer,eax
mov edi,eax
sub eax,eax
mov ecx,bufsize
rep stosd
; ... use the buffer ...
ret
StackBuffer endp
if (%1) == () goto probe
link /stack:%1,%1 ...
goto end
:probe
makeit 102416
:end
new_stack proc stklen:dword
mov eax,esp
mov edx,eax
mov ecx,stklen
sub eax,ecx
ASSUME FS:NOTHING
.if eax < fs:[8]
shr ecx,2
.repeat
push eax
.untilcxz
.endif
ASSUME FS:ERROR
mov esp,edx
ret
new_stack endp
start:
invoke new_stack,bufsize
...
If you plan on calling this macro frequently it may be better to set the stack one time
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 427/100 cycles
512000 bytes:
34190 kCycles for 100 * rep stosd
512000 bytes:
28714 kCycles for 100 * HeapAlloc
519168 bytes:
23241 kCycles for 100 * HeapAlloc
520192 bytes:
2671 kCycles for 100 * HeapAlloc
34132 kCycles for 100 * rep stosd
512000 bytes:
23103 kCycles for 100 * HeapAlloc
519168 bytes:
23419 kCycles for 100 * HeapAlloc
520192 bytes:
2635 kCycles for 100 * HeapAlloc
see if you can test a single passIt's quite happy to be in that loop for everything below 500k*1.01 and above 500k*1.016... and switching to e.g. 5 loops doesn't change the pattern. Weird.
HeapAlloc may not like being in a x100 loop :P
If you plan on calling this macro frequently it may be better to set the stack one time
No need for doing that. Dave's fs:[8] loop is extremely clever - if the stack is already committed, it costs just 1 or 2 cycles.
11629 Clock cycles per page
...However, once the stack is committed (one way or the other) it will be available
as a substitute for HeapAlloc(), and that will save both code space and cycles.
it seems that, once the space has been committed, it stays committed
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 190/100 cycles
2048 bytes:
103705 cycles for 100 * HeapAlloc
28460 cycles for 100 * StackBuffer (xmm)
48535 cycles for 100 * StackBuffer (rep stosd)
47423 cycles for 100 * dedndave
103956 cycles for 100 * HeapAlloc
28460 cycles for 100 * StackBuffer (xmm)
48544 cycles for 100 * StackBuffer (rep stosd)
47424 cycles for 100 * dedndave
8192 bytes:
185329 cycles for 100 * HeapAlloc
105525 cycles for 100 * StackBuffer (xmm)
128172 cycles for 100 * StackBuffer (rep stosd)
127549 cycles for 100 * dedndave
184025 cycles for 100 * HeapAlloc
105314 cycles for 100 * StackBuffer (xmm)
128170 cycles for 100 * StackBuffer (rep stosd)
127050 cycles for 100 * dedndave
32768 bytes:
547 kCycles for 100 * HeapAlloc
438 kCycles for 100 * StackBuffer (xmm)
440 kCycles for 100 * StackBuffer (rep stosd)
439 kCycles for 100 * dedndave
548 kCycles for 100 * HeapAlloc
438 kCycles for 100 * StackBuffer (xmm)
444 kCycles for 100 * StackBuffer (rep stosd)
437 kCycles for 100 * dedndave
131072 bytes:
2810 kCycles for 100 * HeapAlloc
2808 kCycles for 100 * StackBuffer (xmm)
2230 kCycles for 100 * StackBuffer (rep stosd)
2222 kCycles for 100 * dedndave
2319 kCycles for 100 * HeapAlloc
2808 kCycles for 100 * StackBuffer (xmm)
2225 kCycles for 100 * StackBuffer (rep stosd)
2224 kCycles for 100 * dedndave
524288 bytes:
742 kCycles for 100 * HeapAlloc
12067 kCycles for 100 * StackBuffer (xmm)
8928 kCycles for 100 * StackBuffer (rep stosd)
8977 kCycles for 100 * dedndave
751 kCycles for 100 * HeapAlloc
12305 kCycles for 100 * StackBuffer (xmm)
8921 kCycles for 100 * StackBuffer (rep stosd)
8920 kCycles for 100 * dedndave
104 bytes for HeapAlloc
9 bytes for StackBuffer (xmm)
8 bytes for StackBuffer (rep stosd)
42 bytes for dedndave
33 bytes for MbStackB
16 bytes for MbStackX
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 246/100 cycles
2048 bytes:
230072 cycles for 100 * HeapAlloc
59433 cycles for 100 * StackBuffer (xmm)
71350 cycles for 100 * StackBuffer (rep stosd)
71069 cycles for 100 * dedndave
229937 cycles for 100 * HeapAlloc
60252 cycles for 100 * StackBuffer (xmm)
71518 cycles for 100 * StackBuffer (rep stosd)
70532 cycles for 100 * dedndave
8192 bytes:
447586 cycles for 100 * HeapAlloc
233879 cycles for 100 * StackBuffer (xmm)
245398 cycles for 100 * StackBuffer (rep stosd)
245187 cycles for 100 * dedndave
447288 cycles for 100 * HeapAlloc
234628 cycles for 100 * StackBuffer (xmm)
246152 cycles for 100 * StackBuffer (rep stosd)
244349 cycles for 100 * dedndave
32768 bytes:
1304 kCycles for 100 * HeapAlloc
945 kCycles for 100 * StackBuffer (xmm)
925 kCycles for 100 * StackBuffer (rep stosd)
948 kCycles for 100 * dedndave
1274 kCycles for 100 * HeapAlloc
913 kCycles for 100 * StackBuffer (xmm)
932 kCycles for 100 * StackBuffer (rep stosd)
924 kCycles for 100 * dedndave
131072 bytes:
5997 kCycles for 100 * HeapAlloc
3639 kCycles for 100 * StackBuffer (xmm)
3682 kCycles for 100 * StackBuffer (rep stosd)
3704 kCycles for 100 * dedndave
4671 kCycles for 100 * HeapAlloc
3663 kCycles for 100 * StackBuffer (xmm)
3654 kCycles for 100 * StackBuffer (rep stosd)
3651 kCycles for 100 * dedndave
524288 bytes:
2084 kCycles for 100 * HeapAlloc
14688 kCycles for 100 * StackBuffer (xmm)
14708 kCycles for 100 * StackBuffer (rep stosd)
14847 kCycles for 100 * dedndave
2091 kCycles for 100 * HeapAlloc
14649 kCycles for 100 * StackBuffer (xmm)
14950 kCycles for 100 * StackBuffer (rep stosd)
16294 kCycles for 100 * dedndave
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+19 of 20 tests valid, loop overhead is approx. 239/100 cycles
2048 bytes:
71208 cycles for 100 * HeapAlloc
20078 cycles for 100 * StackBuffer (xmm)
40880 cycles for 100 * StackBuffer (rep stosd)
40775 cycles for 100 * dedndave
68901 cycles for 100 * HeapAlloc
47258 cycles for 100 * StackBuffer (xmm)
40798 cycles for 100 * StackBuffer (rep stosd)
40362 cycles for 100 * dedndave
8192 bytes:
111836 cycles for 100 * HeapAlloc
129559 cycles for 100 * StackBuffer (xmm)
122825 cycles for 100 * StackBuffer (rep stosd)
122018 cycles for 100 * dedndave
104378 cycles for 100 * HeapAlloc
55644 cycles for 100 * StackBuffer (xmm)
122728 cycles for 100 * StackBuffer (rep stosd)
122127 cycles for 100 * dedndave
32768 bytes:
262658 cycles for 100 * HeapAlloc
209534 cycles for 100 * StackBuffer (xmm)
196808 cycles for 100 * StackBuffer (rep stosd)
187779 cycles for 100 * dedndave
604 kCycles for 100 * HeapAlloc
477 kCycles for 100 * StackBuffer (xmm)
445 kCycles for 100 * StackBuffer (rep stosd)
443 kCycles for 100 * dedndave
131072 bytes:
1203 kCycles for 100 * HeapAlloc
1090 kCycles for 100 * StackBuffer (xmm)
1178 kCycles for 100 * StackBuffer (rep stosd)
1197 kCycles for 100 * dedndave
1843 kCycles for 100 * HeapAlloc
1108 kCycles for 100 * StackBuffer (xmm)
1159 kCycles for 100 * StackBuffer (rep stosd)
1195 kCycles for 100 * dedndave
524288 bytes:
586 kCycles for 100 * HeapAlloc
6736 kCycles for 100 * StackBuffer (xmm)
5396 kCycles for 100 * StackBuffer (rep stosd)
4757 kCycles for 100 * dedndave
591 kCycles for 100 * HeapAlloc
6740 kCycles for 100 * StackBuffer (xmm)
5389 kCycles for 100 * StackBuffer (rep stosd)
4787 kCycles for 100 * dedndave
104 bytes for HeapAlloc
9 bytes for StackBuffer (xmm)
8 bytes for StackBuffer (rep stosd)
42 bytes for dedndave
33 bytes for MbStackB
16 bytes for MbStackX
--- ok ---
The new StackBuffer() is now implemented in MasmBasic of 30 Oct:t