News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Zero a stack buffer (and probe it)

Started by jj2007, October 25, 2013, 07:31:54 PM

Previous topic - Next topic

jj2007

Quote from: dedndave on October 26, 2013, 11:05:51 PMcan't beat REP STOSD for simplicity   :P

But for the fast "rep stosd up" you need to write an SEH, that makes it slightly more complicated again :icon_mrgreen:

dedndave

why SEH ?
i posted code that does STOSD up with no SEH
but, you haven't incorporated it

a little update...
    ASSUME  FS:Nothing

    mov     edx,edi
    mov     edi,esp
    mov     ecx,esp
    sub     edi,<NumberOfBytesRequiredPlus3Mod4>
    .repeat
        push    eax
        mov     esp,fs:[8]
    .until edi>=esp
    sub     ecx,edi
    shr     ecx,2
    xor     eax,eax
    mov     esp,edi
    rep     stosd
    mov     edi,edx

    ASSUME  FS:ERROR

jj2007

Quote from: dedndave on October 26, 2013, 11:24:40 PM
but, you haven't incorporated it

I've tried to but it crashes ::)
Set useE=1 in the source...

dedndave

if it crashes, there must be a simple reason - lol
how much memory are you trying to allocate ?

try the attached test code...

dedndave

virgin 2d prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 254/100 cycles

5184    kCycles for 100 * rep stosd
4122    kCycles for 100 * HeapAlloc (*8 )
2853    kCycles for 100 * StackBuffer (with zeroing)
2868    kCycles for 100 * StackBuffer (unrolled)
2899    kCycles for 100 * rep stosd up

5116    kCycles for 100 * rep stosd
3073    kCycles for 100 * HeapAlloc (*8 )
2839    kCycles for 100 * StackBuffer (with zeroing)
2849    kCycles for 100 * StackBuffer (unrolled)
2862    kCycles for 100 * rep stosd up

5161    kCycles for 100 * rep stosd
3080    kCycles for 100 * HeapAlloc (*8 )
2843    kCycles for 100 * StackBuffer (with zeroing)
2873    kCycles for 100 * StackBuffer (unrolled)
2848    kCycles for 100 * rep stosd up

Gunther

StackBuffer2d results:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 193/100 cycles

2341    kCycles for 100 * rep stosd
987     kCycles for 100 * HeapAlloc (*8)
839     kCycles for 100 * StackBuffer (with zeroing)
830     kCycles for 100 * StackBuffer (unrolled)
921     kCycles for 100 * rep stosd up

2345    kCycles for 100 * rep stosd
985     kCycles for 100 * HeapAlloc (*8)
829     kCycles for 100 * StackBuffer (with zeroing)
867     kCycles for 100 * StackBuffer (unrolled)
872     kCycles for 100 * rep stosd up

2339    kCycles for 100 * rep stosd
989     kCycles for 100 * HeapAlloc (*8)
850     kCycles for 100 * StackBuffer (with zeroing)
906     kCycles for 100 * StackBuffer (unrolled)
875     kCycles for 100 * rep stosd up

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
34      bytes for StackBuffer (with zeroing)
44      bytes for StackBuffer (unrolled)
17      bytes for rep stosd up

--- ok ---


Dave,

your ProbeTest works fine under 64 bit.

Gunther
You have to know the facts before you can distort them.

dedndave

thanks Gunther - whew !

only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary
(he must have been a C programmer in a previous life  :P )

but - i think there is a major flaw in the idea of speed-tests for probing code
once you have committed that memory, it remains committed until you release it and the OS allocates it elsewhere
to overcome this, you might try HeapAlloc
if the OS needs that space for the heap, it should "reset" the amount committed

i don't think altering the value at FS:[8] is a good idea - lol
sounds like a memory leak waiting to happen

dedndave

ok
the default reserve is supposed to be 1 MB = 1,048,576 (100000h)
i can only allocate up to 1,032,192 (0FC000h) without a crash

that must be why Jochen is having to use SEH

jj2007

Quote from: dedndave on October 27, 2013, 12:53:50 AM
only thing i can think of is Jochen is trying to allocate more than is reserved
or - perhaps he has some other flaw that makes the try/catch thing necessary

Dave,

bufsize is 102400 bytes, no big deal. The Try/Catch thing would be needed for the "rep stosd up" algo, simply because it doesn't probe the stack.

Here is your code embedded in the testbed, it doesn't crash any more but 2 kCycles is a bit fast... some more comments would be nice, or maybe I am just too tired to understand it :(

TestE proc
  mov ebx, AlgoLoops-1   ; loop e.g. 100x
  mov esi, esp  ; check the stack
  align 4
  .Repeat
   mov edx, edi
   mov edi, esp
   mov ecx, esp
   sub edi, (bufsize+3) MOD 4      ;<NumberOfBytesRequiredPlus3Mod4>
   .repeat
      push eax
      ASSUME FS:Nothing
      mov esp, fs:[8]
      ASSUME FS:ERROR
   .until edi>=esp
   sub ecx, edi
   shr ecx, 2
   xor eax, eax
   mov esp, edi
   rep stosd
   mov edi, edx
   add esp, (bufsize+3) MOD 4   ; restore stack
   dec ebx
  .Until Sign?
  sub esi, esp
  .if !Zero?   ; OK
   print str$(esi), " STACKDIFF"
   exit
  .endif
  ret
TestE endp

dedndave

it's not too bad
i probe down the stack by using the TEB.StackLimit value from FS:[8]
then, i use REP STOSD to clear it out

the probe part was discussed at length...
http://masm32.com/board/index.php?topic=1363

jj2007

Thanks, Dave - I had not seen that thread. Now it's clearer...

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4710    kCycles for 100 * rep stosd
2220    kCycles for 100 * HeapAlloc (*8)
2193    kCycles for 100 * StackBuffer (with zeroing)
2192    kCycles for 100 * StackBuffer (unrolled)
4697    kCycles for 100 * dedndave
1738    kCycles for 100 * rep stosd up


This is for slightly modified code, taking account of the need to save & restore the old stack:

  .Repeat
        mov edx, edi        ; save edi
        mov edi, esp
        mov eax, esp        ; save old stack
        sub edi, (bufsize+3+4)        ;<NumberOfBytesRequiredPlus3Mod4>
        and edi, -4        ; aligns new stack
        .repeat
                push eax        ; tickle the guard page
                ASSUME FS:Nothing
                mov esp, fs:[8]        ; limit might be 4k lower now
                ASSUME FS:ERROR
        .until edi>=esp        ; loop until we've got enough
        mov esp, edi        ; new stack
        stosd        ; save old stack to [edi]
        xchg eax, ecx
        push edi        ; retval for macro
        sub ecx, edi
        shr ecx, 2
        xor eax, eax
        rep stosd
        pop eax        ; retval for macro
        mov edi, edx        ; restore edi
        ; ... code that uses buffer...
        pop esp        ; restore stack
        dec ebx
  .Until Sign?


I hope I didn't misunderstand anything - for some time I was thoroughly confused by your NumberOfBytesRequiredPlus3Mod4 ::)

dedndave

sorry for the confusion - it's just a number that is mod4=0
it could be an immediate - or a value calculated in EAX

as for restoring the stack.....

        OPTION  PROLOGUE:None
        OPTION  EPILOGUE:None

MyProc PROC parm1:DWORD

        push    ebx
        push    esi
        push    edi            ;push/pops on EBX ESI EDI are optional, of course

        push    ebp
        mov     esp,ebp

;stack probe code here

;stack clear code here

;use stack space, as required

        leave

        pop     edi
        pop     esi
        pop     ebx
        ret     4

MyProc ENDP

        OPTION  PROLOGUE:PrologueDef
        OPTION  EPILOGUE:EpilogueDef

jj2007

Quote from: dedndave on October 27, 2013, 12:10:08 PM
as for restoring the stack..... leave

I've tried that but it crashes. If you have working code, please insert into the source :icon14:

Anyway, speed-wise it doesn't look so convincing. By the way, the forum software translates *8 into a smiley - HeapAlloc is actually tested with one eighth of the buffer size, because it's so slow :(

dedndave

give this a try, my friend
i am anxious to see if it crashes on you   :P

it should display the allocation size (F0000), then 0 (cleared OR test result)

i commented it heavily, just for you   :biggrin:

KeepingRealBusy

Quote from: jj2007 on October 26, 2013, 04:19:12 PM
Quote from: KeepingRealBusy on October 26, 2013, 01:42:11 PM
A question, what about unrolling the xmm loop and use 8 movdqa's to reduce the loop overhead.

Why not ;-)

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 188/100 cycles

4725    kCycles for 100 * rep stosd
2683    kCycles for 100 * HeapAlloc (*8)
2202    kCycles for 100 * StackBuffer (with zeroing)
2887    kCycles for 100 * StackBuffer (unrolled)
2207    kCycles for 100 * movaps xmm0
1746    kCycles for 100 * rep stosd up

29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)


Jochen,

Using this version, I made some changes. The original movaps test was TestE. I made an unrolled version as TestI, then used some similar code to modify testE and saved them as Testg and TestH. The modifications were to move the "constant" initializations out of the REPEAT loops and execute them at the beginning of the test (before the REPEATs). The Following are the .lst sections for TestE, TestG, TestH, and TestI (just to check alignments):


align 16
00000190 TestE_s:
= movaps xmm0 NameE equ movaps xmm0 ; assign a descriptive name here
00000190 TestE proc
00000190  BB 00000063   mov ebx, AlgoLoops-1 ; loop e.g. 100x
  align 4
  .Repeat
00000198    *@C0011:
00000198  8B CC mov ecx, esp
0000019A  8D 84 24 lea eax, [esp-bufsize]
     FFFE7000
000001A1  83 E4 F0 and esp, -16 ;  needs a reg or local to store original esp
000001A4  0F 57 C0 xorps xmm0, xmm0
; align 4
.Repeat
000001A7    *@C0012:
000001A7  83 EC 10 sub esp, OWORD
000001AA  0F 29 04 24 movaps OWORD ptr [esp], xmm0 ; movaps <1% faster on AMD
.Until esp<=eax
000001AE  3B E0    *     cmp    esp, eax
000001B0  77 F5    *     ja @C0012
000001B2  8B E1 mov esp, ecx
; add esp, bufsize
000001B4  4B dec ebx
  .Until Sign?
000001B5  79 E1    *     jns    @C0011
000001B7  C3   ret
000001B8 TestE endp
000001B8 TestE_endp:

align 16
000001E0 TestG_s:
= movaps xmm0 (down) NameG equ movaps xmm0 (down) ; assign a descriptive name here
000001E0 TestG proc
000001E0  BB 00000063   mov ebx, AlgoLoops-1 ; loop e.g. 100x
000001E5  8B CC   mov ecx, esp
000001E7  8B F4   mov esi, esp
000001E9  BA FFFFFFF0   mov edx, -OWORD
000001EE  83 E6 F0   and esi, -16
000001F1  0F 57 C0   xorps xmm0, xmm0
000001F4  8D 86 FFFE7000   lea eax, [esi-bufsize]
  align 16
  .Repeat
00000200    *@C0017:
00000200  8B E6         mov esp, esi
.Repeat
00000202    *@C0018:
00000202  8D 24 14                 lea esp,[esp+edx]
00000205  0F 29 04 24 movaps OWORD ptr [esp], xmm0 ; movaps <1% faster on AMD
.Until esp==eax
00000209  3B E0    *     cmp    esp, eax
0000020B  75 F5    *     jne    @C0018
0000020D  4B dec ebx
  .Until Sign?
0000020E  79 F0    *     jns    @C0017
00000210  8B E1   mov esp, ecx
00000212  C3   ret
00000213 TestG endp
00000213 TestG_endp:

align 16
00000220 TestH_s:
= movaps xmm0 (up) NameH equ movaps xmm0 (up) ; assign a descriptive name here
00000220 TestH proc
00000220  BB 00000063   mov ebx, AlgoLoops-1 ; loop e.g. 100x
00000225  8B CC   mov ecx, esp
00000227  8D B4 24   lea esi, [esp-bufsize]
     FFFE7000
0000022E  BA 00000010   mov edx, OWORD
00000233  83 E6 F0   and esi, -16
00000236  0F 57 C0   xorps xmm0, xmm0
00000239  8D 86 00019000   lea eax, [esi+bufsize]
  align 16
  .Repeat
00000240    *@C001B:
00000240  8B E6         mov esp, esi
.Repeat
00000242    *@C001C:
00000242  0F 29 04 24 movaps OWORD ptr [esp+(0*OWORD)], xmm0 ; movaps <1% faster on AMD
00000246  8D 24 14                 lea esp,[esp+edx]
.Until esp==eax
00000249  3B E0    *     cmp    esp, eax
0000024B  75 F5    *     jne    @C001C
0000024D  4B dec ebx
  .Until Sign?
0000024E  79 F0    *     jns    @C001B
00000250  8B E1   mov esp, ecx
00000252  C3   ret
00000253 TestH endp
00000253 TestH_endp:

align 16
00000260 TestI_s:
= movaps xmm0 (unrolled) NameI equ movaps xmm0 (unrolled) ; assign a descriptive name here
00000260 TestI proc
00000260  BB 00000063   mov ebx, AlgoLoops-1 ; loop e.g. 100x
00000265  8B CC   mov ecx, esp
00000267  8D B4 24   lea esi, [esp-bufsize]
     FFFE7000
0000026E  BA 00000080   mov edx, (8*OWORD)
00000273  83 E6 F0   and esi, -16
00000276  0F 57 C0   xorps xmm0, xmm0
00000279  8D 86 00019000   lea eax, [esi+bufsize]
  .Repeat
0000027F    *@C001F:
0000027F  8B E6         mov esp, esi
  align 16
.Repeat
00000290    *@C0020:
00000290  0F 29 04 24 movaps OWORD ptr [esp+(0*OWORD)], xmm0 ; movaps <1% faster on AMD
00000294  0F 29 44 24 10 movaps OWORD ptr [esp+(1*OWORD)], xmm0 ; movaps <1% faster on AMD
00000299  0F 29 44 24 20 movaps OWORD ptr [esp+(2*OWORD)], xmm0 ; movaps <1% faster on AMD
0000029E  0F 29 44 24 30 movaps OWORD ptr [esp+(3*OWORD)], xmm0 ; movaps <1% faster on AMD
000002A3  0F 29 44 24 40 movaps OWORD ptr [esp+(4*OWORD)], xmm0 ; movaps <1% faster on AMD
000002A8  0F 29 44 24 50 movaps OWORD ptr [esp+(5*OWORD)], xmm0 ; movaps <1% faster on AMD
000002AD  0F 29 44 24 60 movaps OWORD ptr [esp+(6*OWORD)], xmm0 ; movaps <1% faster on AMD
000002B2  0F 29 44 24 70 movaps OWORD ptr [esp+(7*OWORD)], xmm0 ; movaps <1% faster on AMD
000002B7  8D 24 14                 lea esp,[esp+edx]
.Until esp==eax
000002BA  3B E0    *     cmp    esp, eax
000002BC  75 D2    *     jne    @C0020
; add esp, bufsize
000002BE  4B dec ebx
  .Until Sign?
000002BF  79 BE    *     jns    @C001F
000002C1  8B E1   mov esp, ecx
000002C3  C3   ret
000002C4 TestI endp
000002C4 TestI_endp:


The following are my executions:


AMD A8-3520M APU with Radeon(tm) HD Graphics (SSE3)
loop overhead is approx. 433/100 cycles

5229    kCycles for 100 * rep stosd
3627    kCycles for 100 * HeapAlloc (*8)
3274    kCycles for 100 * StackBuffer (with zeroing)
3278    kCycles for 100 * StackBuffer (unrolled)
3193    kCycles for 100 * movaps xmm0
3118    kCycles for 100 * rep stosd up
2798    kCycles for 100 * movaps xmm0 (down)
2974    kCycles for 100 * movaps xmm0 (up)
2895    kCycles for 100 * movaps xmm0 (unrolled)

3573    kCycles for 100 * rep stosd
2709    kCycles for 100 * HeapAlloc (*8)
2458    kCycles for 100 * StackBuffer (with zeroing)
2481    kCycles for 100 * StackBuffer (unrolled)
2426    kCycles for 100 * movaps xmm0
2218    kCycles for 100 * rep stosd up
2086    kCycles for 100 * movaps xmm0 (down)
2329    kCycles for 100 * movaps xmm0 (up)
2273    kCycles for 100 * movaps xmm0 (unrolled)

2244    kCycles for 100 * rep stosd
1512    kCycles for 100 * HeapAlloc (*8)
1422    kCycles for 100 * StackBuffer (with zeroing)
1403    kCycles for 100 * StackBuffer (unrolled)
1546    kCycles for 100 * movaps xmm0
1448    kCycles for 100 * rep stosd up
1424    kCycles for 100 * movaps xmm0 (down)
1561    kCycles for 100 * movaps xmm0 (up)
1502    kCycles for 100 * movaps xmm0 (unrolled)

18      bytes for rep stosd
103     bytes for HeapAlloc (*8)
29      bytes for StackBuffer (with zeroing)
48      bytes for StackBuffer (unrolled)
25      bytes for movaps xmm0
17      bytes for rep stosd up
36      bytes for movaps xmm0 (down)
36      bytes for movaps xmm0 (up)
85      bytes for movaps xmm0 (unrolled)

--- ok ---


The times are interesting. I have attached a zip of my .asm and .exe file.

Dave.