Code location sensitivity of timings

RuiLoureiro · July 26, 2014, 03:38:14 AM

File length = 977412

1484 ms
1453 ms
1516 ms
1547 ms
Press any key to continue ...

RuiLoureiro · July 26, 2014, 03:40:43 AM

1344 ms
1344 ms
1343 ms
2016 ms...

1343 ms
1344 ms
1344 ms
2015 ms...

1344 ms
1359 ms
1344 ms
2016 ms
If we remove the worst case ...

RuiLoureiro · July 26, 2014, 04:08:58 AM

If i am not wrong, you are using 2 counters:
First counter = 1000
Second counter = count (=4000,etc.)

You get the result only when
the first counter is 0 (counter_end).
So the result has something to do with the execution
of this:

Code Select


   mov edi,count
   mov ebx,esp
   .while edi
       pushargs
       call esi
       mov esp,ebx
       dec edi

Is there any particular reason for this ?

Quote
counter_begin 1000, HIGH_PRIORITY_CLASS
mov edi,count
mov ebx,esp
.while edi
pushargs
call esi
mov esp,ebx
dec edi
.endw
counter_end

jj2007 · July 26, 2014, 05:09:09 AM

Quote from: nidud on July 26, 2014, 02:57:09 AM
QuoteCheck if the align is really needed
I normally tune them from the list file in the end

What I intended is that rep movsX may not need ANY alignment, simply because it isn't a loop at this level (at micro code level, it is of course a loop).

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (MMX, SSE, SSE2, SSE3)
movsd align 16 10476 µs
movsd align 3 10456 µs
movsd align 13 10347 µs
movsb align 16 10510 µs
movsb align 3 10503 µs
movsb align 13 10407 µs

movsd align 16 10514 µs
movsd align 3 10469 µs
movsd align 13 10516 µs
movsb align 16 10455 µs
movsb align 3 10515 µs
movsb align 13 10502 µs

movsd align 16 10526 µs
movsd align 3 10455 µs
movsd align 13 10469 µs
movsb align 16 10360 µs
movsb align 3 10485 µs
movsb align 13 10456 µs

Sample:
test4a proc uses esi edi ecx
align 16
nops 3
rep movsb
ret
test4a endp

Interesting, though, that movsb is indeed equally fast on my trusty old Celeron, at least for a 10 MB string.

nidud · July 26, 2014, 05:43:01 AM

deleted

Gunther · July 26, 2014, 08:50:17 AM

That's the result by memcpy.exe by 1234.zip:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
----------------------------------------------
-- aligned strings --
498952    cycles -  10 (  0) 0: crt_memcpy
898756    cycles -  10 ( 75) 1: movsd - mov eax,ecx
903577    cycles -  10 ( 75) 2: movsd - push ecx
354813    cycles -  10 ( 59) 3: movsb
487954    cycles -  10 (182) 4: SSE
-- unaligned strings --
494936    cycles -  10 (  0) 0: crt_memcpy
895940    cycles -  10 ( 75) 1: movsd - mov eax,ecx
895968    cycles -  10 ( 75) 2: movsd - push ecx
373553    cycles -  10 ( 59) 3: movsb
491344    cycles -  10 (182) 4: SSE
-- short strings 15 --
175961    cycles - 8000 (  0) 0: crt_memcpy
361324    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
361586    cycles - 8000 ( 75) 2: movsd - push ecx
313550    cycles - 8000 ( 59) 3: movsb
92719     cycles - 8000 (182) 4: SSE
-- short strings 271 --
841879    cycles - 8000 (  0) 0: crt_memcpy
780741    cycles - 8000 ( 75) 1: movsd - mov eax,ecx
806939    cycles - 8000 ( 75) 2: movsd - push ecx
623419    cycles - 8000 ( 59) 3: movsb
275466    cycles - 8000 (182) 4: SSE
-- short strings 2014 --
1002628   cycles - 4000 (  0) 0: crt_memcpy
2239737   cycles - 4000 ( 75) 1: movsd - mov eax,ecx
2226209   cycles - 4000 ( 75) 2: movsd - push ecx
962207    cycles - 4000 ( 59) 3: movsb
972245    cycles - 4000 (182) 4: SSE
--- ok ---

Gunther

RuiLoureiro · July 26, 2014, 09:02:01 AM

Quote
The macro can only be called by EDI, ESI, or EBX or an immediate value.
I think I just run out of regs once and inserted a loop.
The count for small functions is also rather high so it's
just a way of skipping zeros I guess.

I think you are talking about this macro:

counter_begin MACRO loopcount:REQ, priority
or counter_end

If it is, we cannot use EBX because cpuid destroys EBX

I modified counter_begin -written by MichaelW- to this:
(COUNTERLOOPS=1000 or 10000 or 100000 or ...)

Code Select


; this macro uses EDI inside = length from kIni to kEnd
; we need to define an array to save the means.
; we need to define _LoopCount,_MaxLength...etc. in .DATA
BEGIN_COUNTER_CYCLE_HIGH_PRIORITY_CLASS MACRO   kIni, kEnd
                                        LOCAL   labelA,labelB

                mov     _LoopCount, COUNTERLOOPS
                mov     _MaxLength, kEnd
                mov     edi, kIni
                ;mov     _MinLength, edi         ;; not used yet
                mov     _MeanValue, 0           ;; mean is 0

                invoke  GetCurrentProcess
                invoke  SetPriorityClass, eax, HIGH_PRIORITY_CLASS

    labelA:                                         ;; Begin test loop
    
                BEGIN_LOOP_TEST equ <labelA>
            
                xor     eax, eax        ;; Use same CPUID input value for each call
                cpuid                   ;; Flush pipe & wait for pending ops to finish
                rdtsc                   ;; Read Time Stamp Counter

                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
            
                mov     _LoopCounter, COUNTERLOOPS
                xor     eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
          ALIGN 16                      ;; Optimal loop alignment for P6
          @@:                           ;; Start an empty reference loop
                sub     _LoopCounter, 1
                jnz     short @B

                xor     eax, eax
                cpuid                   ;; Make sure loop instructions finish
                rdtsc                   ;; Read end count
                pop     ecx             ;; Recover low-order 32 bits of start count
                sub     eax, ecx        ;; Low-order 32 bits of overhead count in EAX
                pop     ecx             ;; Recover high-order 32 bits of start count
                sbb     edx, ecx        ;; High-order 32 bits of overhead count in EDX
                push    edx             ;; Preserve high-order 32 bits of overhead count
                push    eax             ;; Preserve low-order 32 bits of overhead count

                xor     eax, eax
                cpuid
                rdtsc
                push    edx             ;; Preserve high-order 32 bits of start count
                push    eax             ;; Preserve low-order 32 bits of start count
                ;;-------------------------------------
                ;;              Start
                ;;-------------------------------------
                mov         _LoopCounter, COUNTERLOOPS
                xor         eax, eax
                cpuid                   ;; Make sure loop setup instructions finish
    ALIGN 16                            ;; Optimal loop alignment for P6
    labelB:                             ;; Start test loop
                START_LOOP_TEST equ <labelB>
ENDM
; ------------------------------------------------------------------------
END_COUNTER_CYCLE       MACRO  arg
                        LOCAL  $tmpstr$

                sub         _LoopCounter, 1
                jnz         START_LOOP_TEST                ;; goto labelB
                ;;---------------------------
                ;;   stop this count
                ;;---------------------------
                xor         eax, eax
                cpuid                       ;; Make sure loop instructions finish
                rdtsc                       ;; Read end count
                pop         ecx             ;; Recover low-order 32 bits of start count
                sub         eax, ecx        ;; Low-order 32 bits of test count in EAX
                pop         ecx             ;; Recover high-order 32 bits of start count
                sbb         edx, ecx        ;; High-order 32 bits of test count in EDX
                pop         ecx             ;; Recover low-order 32 bits of overhead count
                sub         eax, ecx        ;; Low-order 32 bits of adjusted count in EAX
                pop         ecx             ;; Recover high-order 32 bits of overhead count
                sbb         edx, ecx        ;; High-order 32 bits of adjusted count in EDX

                mov         DWORD PTR _CounterQword, eax
                mov         DWORD PTR _CounterQword + 4, edx
                finit
                fild        _CounterQword
                fild        _LoopCount
                fdiv
                fistp       _CounterQword

                mov         ebx, dword ptr _CounterQword
                
                ;---------------------------------------------------
                ;               print cycles
                ;---------------------------------------------------
                add         ebx, _MeanValue
                mov         _MeanValue, ebx

                add         edi, 1
                cmp         edi, _MaxLength 
                jbe         BEGIN_LOOP_TEST                      ;; goto labelA

                invoke      GetCurrentProcess
                invoke      SetPriorityClass, eax, NORMAL_PRIORITY_CLASS
                
                ; --------------------------------------------------
                ;          Save mean and print mean                
                ; --------------------------------------------------
                invoke      SaveMeans, ebx          ;; save it in one array
                                                       ;; one after another
                
                ;--------------------------------------------------- 
                print       str$(ebx)                       
                $tmpstr$    CATSTR <chr$(">, <arg>, <",13,10)>        
                print       $tmpstr$
                ;---------------------------------------------------                 
ENDM

Code Select


.data
ALIGN 8                         ;; Optimal alignment for QWORD
_CounterQword   dq 0
_LoopCount      dd 0
_LoopCounter    dd 0                                   

_MinLength      dd 0
_MaxLength      dd 0
_MeanValue      dd 0
;------------------------------
ALIGN   4
                dd 0                ; <<<--- start with 0   
_TblTiming0     dd 600 dup (?)
.code
SaveMeans       proc        kMean:DWORD                    
                mov         eax, kMean
                mov         edx, offset _TblTiming0                    
                mov         ecx, [edx-4]            ; number of means
                mov         [edx+ecx*4], eax                    
                add         ecx, 1
                mov         [edx-4], ecx
                ret
SaveMeans       endp

nidud · July 27, 2014, 02:10:39 AM

deleted

Gunther · July 27, 2014, 02:54:54 AM

Hi nidud,

Quote from: nidud on July 27, 2014, 02:10:39 AM
I added some bits to Dave's test:

there's nothing attached.

Gunther

nidud · July 27, 2014, 04:48:41 AM

deleted

nidud · July 27, 2014, 05:17:38 AM

deleted

jj2007 · July 27, 2014, 08:06:15 AM

Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
-----------------------------------------------
-- aligned strings --
995933 cycles - 10 ( 0) 0: crt_strrchr
995891 cycles - 10 ( 40) 1: strrchr
273823 cycles - 10 (154) 2: x
94668 cycles - 10 (112) 3: SSE
-- unaligned strings --
996477 cycles - 10 ( 0) 0: crt_strrchr
997094 cycles - 10 ( 40) 1: strrchr
298219 cycles - 10 (154) 2: x
121529 cycles - 10 (112) 3: SSE
-- small strings 128 --
324263 cycles - 500 ( 0) 0: crt_strrchr
323710 cycles - 500 ( 40) 1: strrchr
84786 cycles - 500 (154) 2: x
34915 cycles - 500 (112) 3: SSE
-- small strings 1 --
67914 cycles - 500 ( 0) 0: crt_strrchr
67286 cycles - 500 ( 40) 1: strrchr
12595 cycles - 500 (154) 2: x
16622 cycles - 500 (112) 3: SSE

Gunther · July 27, 2014, 10:40:16 AM

Hi nidud,

here's the output of auto.zip:

Code Select


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (AVX)
----------------------------------------------
-- aligned strings --
491469    cycles -  10 (  0) 0: crt_memcpy
889651    cycles -  10 ( 63) 1: movsd - mov eax,ecx
887273    cycles -  10 ( 63) 2: movsd - push ecx
355080    cycles -  10 ( 51) 3: movsb
487046    cycles -  10 (182) 4: SSE
355990    cycles -  10 (  0) 5: auto
-- unaligned strings --
490269    cycles -  10 (  0) 0: crt_memcpy
886259    cycles -  10 ( 63) 1: movsd - mov eax,ecx
886778    cycles -  10 ( 63) 2: movsd - push ecx
372520    cycles -  10 ( 51) 3: movsb
491780    cycles -  10 (182) 4: SSE
378881    cycles -  10 (  0) 5: auto
-- short strings 15 --
174897    cycles - 8000 (  0) 0: crt_memcpy
349626    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
343812    cycles - 8000 ( 63) 2: movsd - push ecx
307384    cycles - 8000 ( 51) 3: movsb
98073     cycles - 8000 (182) 4: SSE
293479    cycles - 8000 (  0) 5: auto
-- short strings 271 --
832627    cycles - 8000 (  0) 0: crt_memcpy
773797    cycles - 8000 ( 63) 1: movsd - mov eax,ecx
764418    cycles - 8000 ( 63) 2: movsd - push ecx
586580    cycles - 8000 ( 51) 3: movsb
279676    cycles - 8000 (182) 4: SSE
557134    cycles - 8000 (  0) 5: auto
-- short strings 2014 --
998188    cycles - 4000 (  0) 0: crt_memcpy
2198740   cycles - 4000 ( 63) 1: movsd - mov eax,ecx
2195833   cycles - 4000 ( 63) 2: movsd - push ecx
935710    cycles - 4000 ( 51) 3: movsb
961563    cycles - 4000 (182) 4: SSE
906474    cycles - 4000 (  0) 5: auto
--- ok ---

Gunther

nidud · July 27, 2014, 11:12:01 PM

deleted

nidud · August 11, 2014, 08:46:40 PM

deleted

The MASM Forum

News: