Zero a stack buffer (and probe it)

jj2007 · October 26, 2013, 11:20:33 PM

Quote from: dedndave on October 26, 2013, 11:05:51 PMcan't beat REP STOSD for simplicity :P

But for the fast "rep stosd up" you need to write an SEH, that makes it slightly more complicated again :icon_mrgreen:

dedndave · October 26, 2013, 11:24:40 PM

why SEH ?
i posted code that does STOSD up with no SEH
but, you haven't incorporated it

a little update...

    ASSUME  FS:Nothing

    mov     edx,edi
    mov     edi,esp
    mov     ecx,esp
    sub     edi,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     esp,fs:[8]
    .until edi>=esp
    sub     ecx,edi
    shr     ecx,2
    xor     eax,eax
    mov     esp,edi
    rep     stosd
    mov     edi,edx

    ASSUME  FS:ERROR

jj2007 · October 26, 2013, 11:33:48 PM

Quote from: dedndave on October 26, 2013, 11:24:40 PM
but, you haven't incorporated it

I've tried to but it crashes ::)
Set useE=1 in the source...

dedndave · October 26, 2013, 11:52:07 PM

if it crashes, there must be a simple reason - lol
how much memory are you trying to allocate ?

try the attached test code...

dedndave · October 26, 2013, 11:55:02 PM

virgin 2d prescott w/htt

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 254/100 cycles

5184    kCycles for 100 * rep stosd
4122    kCycles for 100 * HeapAlloc (*8 )
2853    kCycles for 100 * StackBuffer (with zeroing)
2868    kCycles for 100 * StackBuffer (unrolled)
2899    kCycles for 100 * rep stosd up

5116    kCycles for 100 * rep stosd
3073    kCycles for 100 * HeapAlloc (*8 )
2839    kCycles for 100 * StackBuffer (with zeroing)
2849    kCycles for 100 * StackBuffer (unrolled)
2862    kCycles for 100 * rep stosd up

5161    kCycles for 100 * rep stosd
3080    kCycles for 100 * HeapAlloc (*8 )
2843    kCycles for 100 * StackBuffer (with zeroing)
2873    kCycles for 100 * StackBuffer (unrolled)
2848    kCycles for 100 * rep stosd up

Gunther · October 27, 2013, 12:01:55 AM

StackBuffer2d results:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles

2341    kCycles for 100 * rep stosd
987     kCycles for 100 * HeapAlloc (*8)
839     kCycles for 100 * StackBuffer (with zeroing)
830     kCycles for 100 * StackBuffer (unrolled)
921     kCycles for 100 * rep stosd up

2345    kCycles for 100 * rep stosd
985     kCycles for 100 * HeapAlloc (*8)
829     kCycles for 100 * StackBuffer (with zeroing)
867     kCycles for 100 * StackBuffer (unrolled)
872     kCycles for 100 * rep stosd up

2339    kCycles for 100 * rep stosd
989     kCycles for 100 * HeapAlloc (*8)
850     kCycles for 100 * StackBuffer (with zeroing)
906     kCycles for 100 * StackBuffer (unrolled)
875     kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
17      bytes for rep stosd up

--- ok ---

Dave,

your ProbeTest works fine under 64 bit.

Gunther

dedndave · October 27, 2013, 12:53:50 AM

thanks Gunther - whew !

only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary
(he must have been a C programmer in a previous life :P )

but - i think there is a major flaw in the idea of speed-tests for probing code
once you have committed that memory, it remains committed until you release it and the OS allocates it elsewhere
to overcome this, you might try HeapAlloc
if the OS needs that space for the heap, it should "reset" the amount committed

i don't think altering the value at FS:[8] is a good idea - lol
sounds like a memory leak waiting to happen

dedndave · October 27, 2013, 01:19:38 AM

ok
the default reserve is supposed to be 1 MB = 1,048,576 (100000h)
i can only allocate up to 1,032,192 (0FC000h) without a crash

that must be why Jochen is having to use SEH

jj2007 · October 27, 2013, 04:38:46 AM

Quote from: dedndave on October 27, 2013, 12:53:50 AM
only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary

Dave,

bufsize is 102400 bytes, no big deal. The Try/Catch thing would be needed for the "rep stosd up" algo, simply because it doesn't probe the stack.

Here is your code embedded in the testbed, it doesn't crash any more but 2 kCycles is a bit fast... some more comments would be nice, or maybe I am just too tired to understand it :(

TestE proc
mov ebx, AlgoLoops-1   ; loop e.g. 100x
mov esi, esp ; check the stack
align 4
.Repeat
   mov edx, edi
   mov edi, esp
   mov ecx, esp
   sub edi, (bufsize+3) MOD 4      ;<NumberOfBytesRequiredPlus3Mod4>
   .repeat
      push eax
      ASSUME FS:Nothing
      mov esp, fs:[8]
      ASSUME FS:ERROR
   .until edi>=esp
   sub ecx, edi
   shr ecx, 2
   xor eax, eax
   mov esp, edi
   rep stosd
   mov edi, edx
   add esp, (bufsize+3) MOD 4   ; restore stack
   dec ebx
.Until Sign?
sub esi, esp
.if !Zero? ; OK
   print str$(esi), " STACKDIFF"
   exit
.endif
ret
TestE endp

dedndave · October 27, 2013, 10:58:16 AM

it's not too bad
i probe down the stack by using the TEB.StackLimit value from FS:[8]
then, i use REP STOSD to clear it out

the probe part was discussed at length...
http://masm32.com/board/index.php?topic=1363

jj2007 · October 27, 2013, 12:03:16 PM

Thanks, Dave - I had not seen that thread. Now it's clearer...

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4710 kCycles for 100 * rep stosd
2220 kCycles for 100 * HeapAlloc (*8)
2193 kCycles for 100 * StackBuffer (with zeroing)
2192 kCycles for 100 * StackBuffer (unrolled)
4697 kCycles for 100 * dedndave
1738 kCycles for 100 * rep stosd up

This is for slightly modified code, taking account of the need to save & restore the old stack:

.Repeat
mov edx, edi ; save edi
mov edi, esp
mov eax, esp ; save old stack
sub edi, (bufsize+3+4) ;<NumberOfBytesRequiredPlus3Mod4>
and edi, -4 ; aligns new stack
.repeat
push eax ; tickle the guard page
ASSUME FS:Nothing
mov esp, fs:[8] ; limit might be 4k lower now
ASSUME FS:ERROR
.until edi>=esp ; loop until we've got enough
mov esp, edi ; new stack
stosd ; save old stack to [edi]
xchg eax, ecx
push edi ; retval for macro
sub ecx, edi
shr ecx, 2
xor eax, eax
rep stosd
pop eax ; retval for macro
mov edi, edx ; restore edi
; ... code that uses buffer...
pop esp ; restore stack
dec ebx
.Until Sign?

I hope I didn't misunderstand anything - for some time I was thoroughly confused by your NumberOfBytesRequiredPlus3Mod4 ::)

dedndave · October 27, 2013, 12:10:08 PM

sorry for the confusion - it's just a number that is mod4=0
it could be an immediate - or a value calculated in EAX

as for restoring the stack.....

Code Select

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

MyProc PROC parm1:DWORD

        push    ebx
        push    esi
        push    edi            ;push/pops on EBX ESI EDI are optional, of course

        push    ebp
        mov     esp,ebp

;stack probe code here

;stack clear code here

;use stack space, as required

        leave

        pop     edi
        pop     esi
        pop     ebx
        ret     4

MyProc ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

jj2007 · October 27, 2013, 12:34:09 PM

Quote from: dedndave on October 27, 2013, 12:10:08 PM
as for restoring the stack..... leave

I've tried that but it crashes. If you have working code, please insert into the source :icon14:

Anyway, speed-wise it doesn't look so convincing. By the way, the forum software translates *8 into a smiley - HeapAlloc is actually tested with one eighth of the buffer size, because it's so slow :(

dedndave · October 27, 2013, 01:46:48 PM

give this a try, my friend
i am anxious to see if it crashes on you :P

it should display the allocation size (F0000), then 0 (cleared OR test result)

i commented it heavily, just for you

KeepingRealBusy · October 27, 2013, 02:04:42 PM

Quote from: jj2007 on October 26, 2013, 04:19:12 PM
Quote from: KeepingRealBusy on October 26, 2013, 01:42:11 PM
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Why not ;-)

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4725 kCycles for 100 * rep stosd
2683 kCycles for 100 * HeapAlloc (*8)
2202 kCycles for 100 * StackBuffer (with zeroing)
2887 kCycles for 100 * StackBuffer (unrolled)
2207 kCycles for 100 * movaps xmm0
1746 kCycles for 100 * rep stosd up

29 bytes for StackBuffer (with zeroing)
48 bytes for StackBuffer (unrolled)

Jochen,

Using this version, I made some changes. The original movaps test was TestE. I made an unrolled version as TestI, then used some similar code to modify testE and saved them as Testg and TestH. The modifications were to move the "constant" initializations out of the REPEAT loops and execute them at the beginning of the test (before the REPEATs). The Following are the .lst sections for TestE, TestG, TestH, and TestI (just to check alignments):

Code Select


				align 16
 00000190			TestE_s:
 = movaps xmm0			NameE equ movaps xmm0	; assign a descriptive name here
 00000190			TestE proc
 00000190  BB 00000063		  mov ebx, AlgoLoops-1	; loop e.g. 100x
				  align 4
				  .Repeat
 00000198		   *@C0011:
 00000198  8B CC			mov ecx, esp
 0000019A  8D 84 24			lea eax, [esp-bufsize]
	     FFFE7000
 000001A1  83 E4 F0			and esp, -16		;  needs a reg or local to store original esp
 000001A4  0F 57 C0			xorps xmm0, xmm0
				; 	align 4
					.Repeat
 000001A7		   *@C0012:
 000001A7  83 EC 10				sub esp, OWORD
 000001AA  0F 29 04 24				movaps OWORD ptr [esp], xmm0	; movaps <1% faster on AMD
					.Until esp<=eax
 000001AE  3B E0	   *	    cmp    esp, eax
 000001B0  77 F5	   *	    ja	@C0012
 000001B2  8B E1			mov esp, ecx
					; add esp, bufsize
 000001B4  4B				dec ebx
				  .Until Sign?
 000001B5  79 E1	   *	    jns    @C0011
 000001B7  C3			  ret
 000001B8			TestE endp
 000001B8			TestE_endp:

				align 16
 000001E0			TestG_s:
 = movaps xmm0 (down)		NameG equ movaps xmm0 (down)	; assign a descriptive name here
 000001E0			TestG proc
 000001E0  BB 00000063		  mov ebx, AlgoLoops-1	; loop e.g. 100x
 000001E5  8B CC		  mov ecx, esp
 000001E7  8B F4		  mov esi, esp
 000001E9  BA FFFFFFF0		  mov edx, -OWORD
 000001EE  83 E6 F0		  and esi, -16
 000001F1  0F 57 C0		  xorps xmm0, xmm0
 000001F4  8D 86 FFFE7000	  lea eax, [esi-bufsize]
				  align 16
				  .Repeat
 00000200		   *@C0017:
 00000200  8B E6		        mov esp, esi
					.Repeat
 00000202		   *@C0018:
 00000202  8D 24 14		                lea esp,[esp+edx]
 00000205  0F 29 04 24				movaps OWORD ptr [esp], xmm0	; movaps <1% faster on AMD
					.Until esp==eax
 00000209  3B E0	   *	    cmp    esp, eax
 0000020B  75 F5	   *	    jne    @C0018
 0000020D  4B				dec ebx
				  .Until Sign?
 0000020E  79 F0	   *	    jns    @C0017
 00000210  8B E1		  mov esp, ecx
 00000212  C3			  ret
 00000213			TestG endp
 00000213			TestG_endp:

				align 16
 00000220			TestH_s:
 = movaps xmm0 (up)		NameH equ movaps xmm0 (up)	; assign a descriptive name here
 00000220			TestH proc
 00000220  BB 00000063		  mov ebx, AlgoLoops-1	; loop e.g. 100x
 00000225  8B CC		  mov ecx, esp
 00000227  8D B4 24		  lea esi, [esp-bufsize]
	     FFFE7000
 0000022E  BA 00000010		  mov edx, OWORD
 00000233  83 E6 F0		  and esi, -16
 00000236  0F 57 C0		  xorps xmm0, xmm0
 00000239  8D 86 00019000	  lea eax, [esi+bufsize]
				  align 16
				  .Repeat
 00000240		   *@C001B:
 00000240  8B E6		        mov esp, esi
					.Repeat
 00000242		   *@C001C:
 00000242  0F 29 04 24				movaps OWORD ptr [esp+(0*OWORD)], xmm0	; movaps <1% faster on AMD
 00000246  8D 24 14		                lea esp,[esp+edx]
					.Until esp==eax
 00000249  3B E0	   *	    cmp    esp, eax
 0000024B  75 F5	   *	    jne    @C001C
 0000024D  4B				dec ebx
				  .Until Sign?
 0000024E  79 F0	   *	    jns    @C001B
 00000250  8B E1		  mov esp, ecx
 00000252  C3			  ret
 00000253			TestH endp
 00000253			TestH_endp:

				align 16
 00000260			TestI_s:
 = movaps xmm0 (unrolled)	NameI equ movaps xmm0 (unrolled)	; assign a descriptive name here
 00000260			TestI proc
 00000260  BB 00000063		  mov ebx, AlgoLoops-1	; loop e.g. 100x
 00000265  8B CC		  mov ecx, esp
 00000267  8D B4 24		  lea esi, [esp-bufsize]
	     FFFE7000
 0000026E  BA 00000080		  mov edx, (8*OWORD)
 00000273  83 E6 F0		  and esi, -16
 00000276  0F 57 C0		  xorps xmm0, xmm0
 00000279  8D 86 00019000	  lea eax, [esi+bufsize]
				  .Repeat
 0000027F		   *@C001F:
 0000027F  8B E6		        mov esp, esi
				  align 16
					.Repeat
 00000290		   *@C0020:
 00000290  0F 29 04 24				movaps OWORD ptr [esp+(0*OWORD)], xmm0	; movaps <1% faster on AMD
 00000294  0F 29 44 24 10			movaps OWORD ptr [esp+(1*OWORD)], xmm0	; movaps <1% faster on AMD
 00000299  0F 29 44 24 20			movaps OWORD ptr [esp+(2*OWORD)], xmm0	; movaps <1% faster on AMD
 0000029E  0F 29 44 24 30			movaps OWORD ptr [esp+(3*OWORD)], xmm0	; movaps <1% faster on AMD
 000002A3  0F 29 44 24 40			movaps OWORD ptr [esp+(4*OWORD)], xmm0	; movaps <1% faster on AMD
 000002A8  0F 29 44 24 50			movaps OWORD ptr [esp+(5*OWORD)], xmm0	; movaps <1% faster on AMD
 000002AD  0F 29 44 24 60			movaps OWORD ptr [esp+(6*OWORD)], xmm0	; movaps <1% faster on AMD
 000002B2  0F 29 44 24 70			movaps OWORD ptr [esp+(7*OWORD)], xmm0	; movaps <1% faster on AMD
 000002B7  8D 24 14		                lea esp,[esp+edx]
					.Until esp==eax
 000002BA  3B E0	   *	    cmp    esp, eax
 000002BC  75 D2	   *	    jne    @C0020
					; add esp, bufsize
 000002BE  4B				dec ebx
				  .Until Sign?
 000002BF  79 BE	   *	    jns    @C001F
 000002C1  8B E1		  mov esp, ecx
 000002C3  C3			  ret
 000002C4			TestI endp
 000002C4			TestI_endp:

The following are my executions:

Code Select


AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 433/100 cycles

5229    kCycles for 100 * rep stosd
3627    kCycles for 100 * HeapAlloc (*8)
3274    kCycles for 100 * StackBuffer (with zeroing)
3278    kCycles for 100 * StackBuffer (unrolled)
3193    kCycles for 100 * movaps xmm0
3118    kCycles for 100 * rep stosd up
2798    kCycles for 100 * movaps xmm0 (down)
2974    kCycles for 100 * movaps xmm0 (up)
2895    kCycles for 100 * movaps xmm0 (unrolled)

3573    kCycles for 100 * rep stosd
2709    kCycles for 100 * HeapAlloc (*8)
2458    kCycles for 100 * StackBuffer (with zeroing)
2481    kCycles for 100 * StackBuffer (unrolled)
2426    kCycles for 100 * movaps xmm0
2218    kCycles for 100 * rep stosd up
2086    kCycles for 100 * movaps xmm0 (down)
2329    kCycles for 100 * movaps xmm0 (up)
2273    kCycles for 100 * movaps xmm0 (unrolled)

2244    kCycles for 100 * rep stosd
1512    kCycles for 100 * HeapAlloc (*8)
1422    kCycles for 100 * StackBuffer (with zeroing)
1403    kCycles for 100 * StackBuffer (unrolled)
1546    kCycles for 100 * movaps xmm0
1448    kCycles for 100 * rep stosd up
1424    kCycles for 100 * movaps xmm0 (down)
1561    kCycles for 100 * movaps xmm0 (up)
1502    kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)

--- ok ---

The times are interesting. I have attached a zip of my .asm and .exe file.

Dave.

The MASM Forum

News:

Zero a stack buffer (and probe it)

jj2007

dedndave

jj2007

dedndave

dedndave

Gunther

dedndave

dedndave

jj2007

dedndave

jj2007

dedndave

jj2007

dedndave

KeepingRealBusy