News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Code location sensitivity of timings

Started by nidud, July 12, 2014, 09:15:44 PM

Previous topic - Next topic

RuiLoureiro

File length = 977412

1484 ms
1453 ms
1516 ms
1547 ms
Press any key to continue ...

RuiLoureiro

1344 ms
1344 ms
1343 ms
2016 ms...

1343 ms
1344 ms
1344 ms
2015 ms...

1344 ms
1359 ms
1344 ms
2016 ms
If we remove the worst case ...

RuiLoureiro

If i am not wrong, you are using 2 counters:
           First  counter     = 1000
           Second counter = count (=4000,etc.)

You get the result only when
the first counter is 0 (counter_end).
So the result has something to do with the execution
of this:

   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi

Is there any particular reason for this ?
Quote
   counter_begin 1000, HIGH_PRIORITY_CLASS
   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi
   .endw
   counter_end

jj2007

Quote from: nidud on July 26, 2014, 02:57:09 AM
QuoteCheck if the align is really needed
I normally tune them from the list file in the end

What I intended is that rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (MMX, SSE, SSE2, SSE3)
movsd align 16  10476 µs
movsd align 3   10456 µs
movsd align 13  10347 µs
movsb align 16  10510 µs
movsb align 3   10503 µs
movsb align 13  10407 µs

movsd align 16  10514 µs
movsd align 3   10469 µs
movsd align 13  10516 µs
movsb align 16  10455 µs
movsb align 3   10515 µs
movsb align 13  10502 µs

movsd align 16  10526 µs
movsd align 3   10455 µs
movsd align 13  10469 µs
movsb align 16  10360 µs
movsb align 3   10485 µs
movsb align 13  10456 µs


Sample:
test4a proc uses esi edi ecx
  align 16
  nops 3
  rep movsb
  ret
test4a endp


Interesting, though, that movsb is indeed equally fast on my trusty old Celeron, at least for a 10 MB string.

nidud

#49
deleted

Gunther

That's the result by memcpy.exe by 1234.zip:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498952    cycles -  10 (  0) 0: crt_memcpy
898756    cycles -  10 ( 75) 1: movsd - mov eax,ecx
903577    cycles -  10 ( 75) 2: movsd - push ecx
354813    cycles -  10 ( 59) 3: movsb
487954    cycles -  10 (182) 4: SSE
-- unaligned strings --
494936    cycles -  10 (  0) 0: crt_memcpy
895940    cycles -  10 ( 75) 1: movsd - mov eax,ecx
895968    cycles -  10 ( 75) 2: movsd - push ecx
373553    cycles -  10 ( 59) 3: movsb
491344    cycles -  10 (182) 4: SSE
-- short strings 15 --
175961    cycles - 8000 (  0) 0: crt_memcpy
361324    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
361586    cycles - 8000 ( 75) 2: movsd - push ecx
313550    cycles - 8000 ( 59) 3: movsb
92719     cycles - 8000 (182) 4: SSE
-- short strings 271 --
841879    cycles - 8000 (  0) 0: crt_memcpy
780741    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
806939    cycles - 8000 ( 75) 2: movsd - push ecx
623419    cycles - 8000 ( 59) 3: movsb
275466    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1002628   cycles - 4000 (  0) 0: crt_memcpy
2239737   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2226209   cycles - 4000 ( 75) 2: movsd - push ecx
962207    cycles - 4000 ( 59) 3: movsb
972245    cycles - 4000 (182) 4: SSE
--- ok ---


Gunther
You have to know the facts before you can distort them.

RuiLoureiro

#51
Quote
The macro can only be called by EDI, ESI, or EBX or an immediate value.
I think I just run out of regs once and inserted a loop.
The count for small functions is also rather high so it's
just a way of skipping zeros I guess.
I think you are talking about this macro:
   
        counter_begin MACRO loopcount:REQ, priority
   or  counter_end

    If it is, we cannot use EBX because cpuid destroys EBX

I modified counter_begin -written by MichaelW- to this:
(COUNTERLOOPS=1000 or 10000 or 100000 or ...)

; this macro uses EDI inside = length from kIni to kEnd
; we need to define an array to save the means.
; we need to define _LoopCount,_MaxLength...etc. in .DATA
BEGIN_COUNTER_CYCLE_HIGH_PRIORITY_CLASS MACRO   kIni, kEnd
                                        LOCAL   labelA,labelB

                mov     _LoopCount, COUNTERLOOPS
                mov     _MaxLength, kEnd
                mov     edi, kIni
                ;mov     _MinLength, edi         ;; not used yet
                mov     _MeanValue, 0           ;; mean is 0

                invoke  GetCurrentProcess
                invoke  SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    labelA:                                         ;; Begin test loop
   
                BEGIN_LOOP_TEST equ <labelA>
           
                xor     eax, eax        ;; Use same CPUID input value for each call
                cpuid                   ;; Flush pipe & wait for pending ops to finish
                rdtsc                   ;; Read Time Stamp Counter

                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
           
                mov     _LoopCounter, COUNTERLOOPS
                xor     eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
          ALIGN 16                      ;; Optimal loop alignment for P6
          @@:                           ;; Start an empty reference loop
                sub     _LoopCounter, 1
                jnz     short @B

                xor     eax, eax
                cpuid                   ;; Make sure loop instructions finish
                rdtsc                   ;; Read end count
                pop     ecx             ;; Recover low-order 32 bits of start count
                sub     eax, ecx        ;; Low-order 32 bits of overhead count in EAX
                pop     ecx             ;; Recover high-order 32 bits of start count
                sbb     edx, ecx        ;; High-order 32 bits of overhead count in EDX
                push    edx             ;; Preserve high-order 32 bits of overhead count
                push    eax             ;; Preserve low-order 32 bits of overhead count

                xor     eax, eax
                cpuid
                rdtsc
                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
                ;;-------------------------------------
                ;;              Start
                ;;-------------------------------------
                mov         _LoopCounter, COUNTERLOOPS
                xor         eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
    ALIGN 16                            ;; Optimal loop alignment for P6
    labelB:                             ;; Start test loop
                START_LOOP_TEST equ <labelB>
ENDM
; ------------------------------------------------------------------------
END_COUNTER_CYCLE       MACRO  arg
                        LOCAL  $tmpstr$

                sub         _LoopCounter, 1
                jnz         START_LOOP_TEST                ;; goto labelB
                ;;---------------------------
                ;;   stop this count
                ;;---------------------------
                xor         eax, eax
                cpuid                       ;; Make sure loop instructions finish
                rdtsc                       ;; Read end count
                pop         ecx             ;; Recover low-order 32 bits of start count
                sub         eax, ecx        ;; Low-order 32 bits of test count in EAX
                pop         ecx             ;; Recover high-order 32 bits of start count
                sbb         edx, ecx        ;; High-order 32 bits of test count in EDX
                pop         ecx             ;; Recover low-order 32 bits of overhead count
                sub         eax, ecx        ;; Low-order 32 bits of adjusted count in EAX
                pop         ecx             ;; Recover high-order 32 bits of overhead count
                sbb         edx, ecx        ;; High-order 32 bits of adjusted count in EDX

                mov         DWORD PTR _CounterQword, eax
                mov         DWORD PTR _CounterQword + 4, edx
                finit
                fild        _CounterQword
                fild        _LoopCount
                fdiv
                fistp       _CounterQword

                mov         ebx, dword ptr _CounterQword
               
                ;---------------------------------------------------
                ;               print cycles
                ;---------------------------------------------------
                add         ebx, _MeanValue
                mov         _MeanValue, ebx

                add         edi, 1
                cmp         edi, _MaxLength
                jbe         BEGIN_LOOP_TEST                      ;; goto labelA

                invoke      GetCurrentProcess
                invoke      SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
               
                ; --------------------------------------------------
                ;          Save mean and print mean               
                ; --------------------------------------------------
                invoke      SaveMeans, ebx          ;; save it in one array
                                                       ;; one after another
               
                ;---------------------------------------------------
                print       str$(ebx)                       
                $tmpstr$    CATSTR <chr$(">, <arg>, <",13,10)>       
                print       $tmpstr$
                ;---------------------------------------------------                 
ENDM



.data
ALIGN 8                         ;; Optimal alignment for QWORD
_CounterQword   dq 0
_LoopCount      dd 0
_LoopCounter    dd 0                                   

_MinLength      dd 0
_MaxLength      dd 0
_MeanValue      dd 0
;------------------------------
ALIGN   4
                dd 0                ; <<<--- start with 0   
_TblTiming0     dd 600 dup (?)
.code
SaveMeans       proc        kMean:DWORD                   
                mov         eax, kMean
                mov         edx, offset _TblTiming0                   
                mov         ecx, [edx-4]            ; number of means
                mov         [edx+ecx*4], eax                   
                add         ecx, 1
                mov         [edx-4], ecx
                ret
SaveMeans       endp

nidud

#52
deleted

Gunther

Hi nidud,

Quote from: nidud on July 27, 2014, 02:10:39 AM
I added some bits to Dave's test:

there's nothing attached.

Gunther
You have to know the facts before you can distort them.

nidud

#54
deleted

nidud

#55
deleted

jj2007

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
-----------------------------------------------
-- aligned strings --
995933  cycles - 10 (  0) 0: crt_strrchr
995891  cycles - 10 ( 40) 1: strrchr
273823  cycles - 10 (154) 2: x
94668   cycles - 10 (112) 3: SSE
-- unaligned strings --
996477  cycles - 10 (  0) 0: crt_strrchr
997094  cycles - 10 ( 40) 1: strrchr
298219  cycles - 10 (154) 2: x
121529  cycles - 10 (112) 3: SSE
-- small strings 128 --
324263  cycles - 500 (  0) 0: crt_strrchr
323710  cycles - 500 ( 40) 1: strrchr
84786   cycles - 500 (154) 2: x
34915   cycles - 500 (112) 3: SSE
-- small strings 1 --
67914   cycles - 500 (  0) 0: crt_strrchr
67286   cycles - 500 ( 40) 1: strrchr
12595   cycles - 500 (154) 2: x
16622   cycles - 500 (112) 3: SSE

Gunther

Hi nidud,

here's the output of auto.zip:

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- aligned strings --
491469    cycles -  10 (  0) 0: crt_memcpy
889651    cycles -  10 ( 63) 1: movsd - mov eax,ecx
887273    cycles -  10 ( 63) 2: movsd - push ecx
355080    cycles -  10 ( 51) 3: movsb
487046    cycles -  10 (182) 4: SSE
355990    cycles -  10 (  0) 5: auto
-- unaligned strings --
490269    cycles -  10 (  0) 0: crt_memcpy
886259    cycles -  10 ( 63) 1: movsd - mov eax,ecx
886778    cycles -  10 ( 63) 2: movsd - push ecx
372520    cycles -  10 ( 51) 3: movsb
491780    cycles -  10 (182) 4: SSE
378881    cycles -  10 (  0) 5: auto
-- short strings 15 --
174897    cycles - 8000 (  0) 0: crt_memcpy
349626    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
343812    cycles - 8000 ( 63) 2: movsd - push ecx
307384    cycles - 8000 ( 51) 3: movsb
98073     cycles - 8000 (182) 4: SSE
293479    cycles - 8000 (  0) 5: auto
-- short strings 271 --
832627    cycles - 8000 (  0) 0: crt_memcpy
773797    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
764418    cycles - 8000 ( 63) 2: movsd - push ecx
586580    cycles - 8000 ( 51) 3: movsb
279676    cycles - 8000 (182) 4: SSE
557134    cycles - 8000 (  0) 5: auto
-- short strings 2014 --
998188    cycles - 4000 (  0) 0: crt_memcpy
2198740   cycles - 4000 ( 63) 1: movsd - mov eax,ecx
2195833   cycles - 4000 ( 63) 2: movsd - push ecx
935710    cycles - 4000 ( 51) 3: movsb
961563    cycles - 4000 (182) 4: SSE
906474    cycles - 4000 (  0) 5: auto
--- ok ---


Gunther
You have to know the facts before you can distort them.

nidud

#58
deleted

nidud

#59
deleted