The MASM Forum

General => The Laboratory => Topic started by: jj2007 on September 17, 2015, 08:22:21 AM

Title: New cycle counter routines
Post by: jj2007 on September 17, 2015, 08:22:21 AM
For some time, MasmBasic has featured a cycle count template (in RichMasm: Menu File/New Masm source, Timer example).

I have released today a new version (http://masm32.com/board/index.php?topic=94.0), with more precise and more stable cycle counts. Typical example:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7011    cycles for 100 * CRT strlen
5236    cycles for 100 * Masm32 StrLen
2170    cycles for 100 * MasmBasic Len

7024    cycles for 100 * CRT strlen
5232    cycles for 100 * Masm32 StrLen
2263    cycles for 100 * MasmBasic Len

7267    cycles for 100 * CRT strlen
5233    cycles for 100 * Masm32 StrLen
2173    cycles for 100 * MasmBasic Len

7010    cycles for 100 * CRT strlen
5224    cycles for 100 * Masm32 StrLen
2173    cycles for 100 * MasmBasic Len

7024    cycles for 100 * CRT strlen
5223    cycles for 100 * Masm32 StrLen
2169    cycles for 100 * MasmBasic Len


Executable attached; if you encounter unstable timings for your cpu, please let me know.
Title: Re: New cycle counter routines
Post by: dedndave on September 17, 2015, 10:25:42 AM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

15847   cycles for 100 * CRT strlen
15680   cycles for 100 * Masm32 StrLen
7114    cycles for 100 * MasmBasic Len

15721   cycles for 100 * CRT strlen
14284   cycles for 100 * Masm32 StrLen
7265    cycles for 100 * MasmBasic Len

15632   cycles for 100 * CRT strlen
14383   cycles for 100 * Masm32 StrLen
7121    cycles for 100 * MasmBasic Len

15678   cycles for 100 * CRT strlen
14341   cycles for 100 * Masm32 StrLen
7194    cycles for 100 * MasmBasic Len

15745   cycles for 100 * CRT strlen
14251   cycles for 100 * Masm32 StrLen
7123    cycles for 100 * MasmBasic Len
Title: Re: New cycle counter routines
Post by: rrr314159 on September 17, 2015, 10:30:46 AM
AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

11079   cycles for 100 * CRT strlen
8101    cycles for 100 * Masm32 StrLen
3235    cycles for 100 * MasmBasic Len

10602   cycles for 100 * CRT strlen
8073    cycles for 100 * Masm32 StrLen
3301    cycles for 100 * MasmBasic Len

9716    cycles for 100 * CRT strlen
8044    cycles for 100 * Masm32 StrLen
3267    cycles for 100 * MasmBasic Len

10433   cycles for 100 * CRT strlen
8057    cycles for 100 * Masm32 StrLen
3244    cycles for 100 * MasmBasic Len

11064   cycles for 100 * CRT strlen
8023    cycles for 100 * Masm32 StrLen
3297    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for MasmBasic Len
Title: Re: New cycle counter routines
Post by: jj2007 on September 17, 2015, 12:00:33 PM
Thanks to both of you :icon14:

Looks OK, especially the P4 was always a problem, now it seems solved. I attach a new version (with source) that uses an "align 64" macro, which may help to force the test loops into a 64-byte instruction cache boundary (I write "may" because I'm not sure if that theory is correct; besides, depending on how other code is arranged, the short main loop might fit into the cache, while the bigger algo might cross cache boundaries).

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6922    cycles for 100 * CRT strlen
5157    cycles for 100 * Masm32 StrLen
17467   cycles for 100 * Windows lstrlen
1668    cycles for 100 * MasmBasic Len

6917    cycles for 100 * CRT strlen
5157    cycles for 100 * Masm32 StrLen
17478   cycles for 100 * Windows lstrlen
1668    cycles for 100 * MasmBasic Len

6933    cycles for 100 * CRT strlen
5156    cycles for 100 * Masm32 StrLen
17505   cycles for 100 * Windows lstrlen
1669    cycles for 100 * MasmBasic Len

6922    cycles for 100 * CRT strlen
5161    cycles for 100 * Masm32 StrLen
17444   cycles for 100 * Windows lstrlen
1670    cycles for 100 * MasmBasic Len

6936    cycles for 100 * CRT strlen
5157    cycles for 100 * Masm32 StrLen
17461   cycles for 100 * Windows lstrlen
1668    cycles for 100 * MasmBasic Len
Title: Re: New cycle counter routines
Post by: hutch-- on September 17, 2015, 12:10:51 PM

Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)

8893    cycles for 100 * CRT strlen
6779    cycles for 100 * Masm32 StrLen
24229   cycles for 100 * Windows lstrlen
2195    cycles for 100 * MasmBasic Len

8887    cycles for 100 * CRT strlen
6774    cycles for 100 * Masm32 StrLen
24229   cycles for 100 * Windows lstrlen
2195    cycles for 100 * MasmBasic Len

8884    cycles for 100 * CRT strlen
6778    cycles for 100 * Masm32 StrLen
24228   cycles for 100 * Windows lstrlen
2005    cycles for 100 * MasmBasic Len

8884    cycles for 100 * CRT strlen
6774    cycles for 100 * Masm32 StrLen
23511   cycles for 100 * Windows lstrlen
2200    cycles for 100 * MasmBasic Len

8885    cycles for 100 * CRT strlen
6776    cycles for 100 * Masm32 StrLen
24232   cycles for 100 * Windows lstrlen
2196    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
Title: Re: New cycle counter routines
Post by: rrr314159 on September 17, 2015, 12:37:33 PM
AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

11095   cycles for 100 * CRT strlen
8118    cycles for 100 * Masm32 StrLen
22622   cycles for 100 * Windows lstrlen
3248    cycles for 100 * MasmBasic Len

11086   cycles for 100 * CRT strlen
8144    cycles for 100 * Masm32 StrLen
22499   cycles for 100 * Windows lstrlen
3255    cycles for 100 * MasmBasic Len

11071   cycles for 100 * CRT strlen
8048    cycles for 100 * Masm32 StrLen
22709   cycles for 100 * Windows lstrlen
3271    cycles for 100 * MasmBasic Len

10127   cycles for 100 * CRT strlen
8056    cycles for 100 * Masm32 StrLen
22686   cycles for 100 * Windows lstrlen
3252    cycles for 100 * MasmBasic Len

11076   cycles for 100 * CRT strlen
8077    cycles for 100 * Masm32 StrLen
22624   cycles for 100 * Windows lstrlen
3233    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len


--- ok ---
Title: Re: New cycle counter routines
Post by: jj2007 on September 17, 2015, 12:44:03 PM
Quote from: rrr314159 on September 17, 2015, 12:37:33 PM10127   cycles for 100 * CRT strlen

Umpfff... that one is difficult to explain ::)
Title: Re: New cycle counter routines
Post by: six_L on September 17, 2015, 02:11:01 PM
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (SSE4)

5260   cycles for 100 * CRT strlen
4988   cycles for 100 * Masm32 StrLen
16209   cycles for 100 * Windows lstrlen
1668   cycles for 100 * MasmBasic Len

5268   cycles for 100 * CRT strlen
4910   cycles for 100 * Masm32 StrLen
16190   cycles for 100 * Windows lstrlen
1665   cycles for 100 * MasmBasic Len

5279   cycles for 100 * CRT strlen
4914   cycles for 100 * Masm32 StrLen
16167   cycles for 100 * Windows lstrlen
1664   cycles for 100 * MasmBasic Len

5275   cycles for 100 * CRT strlen
4914   cycles for 100 * Masm32 StrLen
16193   cycles for 100 * Windows lstrlen
1667   cycles for 100 * MasmBasic Len

5263   cycles for 100 * CRT strlen
4913   cycles for 100 * Masm32 StrLen
16169   cycles for 100 * Windows lstrlen
1666   cycles for 100 * MasmBasic Len

14   bytes for CRT strlen
10   bytes for Masm32 StrLen
10   bytes for Windows lstrlen
10   bytes for MasmBasic Len


--- ok ---

Title: Re: New cycle counter routines
Post by: dedndave on September 17, 2015, 02:56:10 PM
how about the source for the timing code ?   :redface:

prescott w/htt xp mce2005 sp3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

18315   cycles for 100 * CRT strlen
16580   cycles for 100 * Masm32 StrLen
46696   cycles for 100 * Windows lstrlen
8590    cycles for 100 * MasmBasic Len

18517   cycles for 100 * CRT strlen
16776   cycles for 100 * Masm32 StrLen
47050   cycles for 100 * Windows lstrlen
8736    cycles for 100 * MasmBasic Len

18248   cycles for 100 * CRT strlen
16638   cycles for 100 * Masm32 StrLen
46668   cycles for 100 * Windows lstrlen
8534    cycles for 100 * MasmBasic Len

18377   cycles for 100 * CRT strlen
16693   cycles for 100 * Masm32 StrLen
47407   cycles for 100 * Windows lstrlen
8518    cycles for 100 * MasmBasic Len

18318   cycles for 100 * CRT strlen
16796   cycles for 100 * Masm32 StrLen
46341   cycles for 100 * Windows lstrlen
8702    cycles for 100 * MasmBasic Len
Title: Re: New cycle counter routines
Post by: dedndave on September 17, 2015, 03:10:05 PM
the whole crux of the issue seems to be serialization
CPUID appears to be the best we have, to date

my "theory" was to try and force instructions to serialize without the use of CPUID
the fewer cycles required to serialize, the more stable the readings should be

here is what i was working on the last time i played with it
and - the results weren't as good as i had hoped
but, maybe it will spawn some ideas

    push    ebx
    push    esi
    push    edi
    push    ebp

    pushad           ;EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI

;time measurement

    pop     edi
    pop     esi
    pop     ebp
    pop     eax
    pop     ebx
    pop     eax
    pop     ecx
    pop     edx

    pop     ebp
    pop     edi
    pop     esi
    pop     ebx
    xchg    eax,edx


the idea being, because the registers are popped in a different order than they are pushed,
the CPU is forced to wait for completion - thus a sort of "natural" serialization
Title: Re: New cycle counter routines
Post by: rrr314159 on September 17, 2015, 03:19:11 PM
 Intel suggests mfence may be better than cpuid for serialization
Title: Re: New cycle counter routines
Post by: TWell on September 17, 2015, 03:25:15 PM
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

15529   cycles for 100 * CRT strlen
10607   cycles for 100 * Masm32 StrLen
32263   cycles for 100 * Windows lstrlen
5946    cycles for 100 * MasmBasic Len

15516   cycles for 100 * CRT strlen
10544   cycles for 100 * Masm32 StrLen
32228   cycles for 100 * Windows lstrlen
5956    cycles for 100 * MasmBasic Len

15543   cycles for 100 * CRT strlen
10599   cycles for 100 * Masm32 StrLen
32194   cycles for 100 * Windows lstrlen
6021    cycles for 100 * MasmBasic Len

15544   cycles for 100 * CRT strlen
10644   cycles for 100 * Masm32 StrLen
32107   cycles for 100 * Windows lstrlen
5931    cycles for 100 * MasmBasic Len

15541   cycles for 100 * CRT strlen
10598   cycles for 100 * Masm32 StrLen
32839   cycles for 100 * Windows lstrlen
5970    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len
Title: Re: New cycle counter routines
Post by: Siekmanski on September 17, 2015, 07:55:13 PM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

8307    cycles for 100 * CRT strlen
6156    cycles for 100 * Masm32 StrLen
20840   cycles for 100 * Windows lstrlen
2076    cycles for 100 * MasmBasic Len

8307    cycles for 100 * CRT strlen
6162    cycles for 100 * Masm32 StrLen
20883   cycles for 100 * Windows lstrlen
2078    cycles for 100 * MasmBasic Len

8305    cycles for 100 * CRT strlen
6166    cycles for 100 * Masm32 StrLen
20836   cycles for 100 * Windows lstrlen
2078    cycles for 100 * MasmBasic Len

8304    cycles for 100 * CRT strlen
6162    cycles for 100 * Masm32 StrLen
20852   cycles for 100 * Windows lstrlen
2077    cycles for 100 * MasmBasic Len

8305    cycles for 100 * CRT strlen
6153    cycles for 100 * Masm32 StrLen
20839   cycles for 100 * Windows lstrlen
2077    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len


--- ok ---
Title: Re: New cycle counter routines
Post by: jj2007 on September 17, 2015, 08:47:31 PM
Quote from: dedndave on September 17, 2015, 02:56:10 PM
how about the source for the timing code ?   :redface:

It's included in the attachment above, Dave. And you can even set useMB=0 :P

Apart from the one negative outlier in rr's post that is really difficult to explain, timings appear fairly stable, compared to what we have seen over the years. The SpinUp macro does a good job, it seems.
Title: Re: New cycle counter routines
Post by: MichaelW on September 18, 2015, 06:17:50 AM
Quote from: dedndave on September 17, 2015, 03:10:05 PM
the whole crux of the issue seems to be serialization
CPUID appears to be the best we have, to date

There is RDTSCP. I recently coded some 64-bit macros for GCC,  that use CPUID function 0 for serialization (~95 cycles IIRC), and running at the highest-possible priority on my Window 7, i3 system the run to run variation was normally only 1 cycle +-. The problem I ran into was CPUID's modification of EBX. While there were no problems with no compiler optimization, with any level of optimization the app would trigger an access-violation exception, even when the test code did not use EBX. The solution was to preserve RBX around CPUID, and doing so did not alter the repeatability, and a RDTSCP version of the macros did no better, so I stuck with using CPUID. Also, I seem to recall that RDTSC is only a little faster than CPUID function 0, at least on my i3.
Title: Re: New cycle counter routines
Post by: dedndave on September 22, 2015, 12:36:42 PM
i have to say that i am very pleased with this iteration of my idea   :biggrin:
with the exception of a few out-liers, it seems extremely stable on my quirky P4 with XP media center edition 2005

a single run of the test program...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
0 0 0 0 0 0 0 0 0 455305 Iterations
15030 15038 15038 13380 15038 15038 15030 13381 15038 114344 Iterations
30038 30038 30038 30038 26715 30038 30038 30038 30038 64327 Iterations


the actual serialization code looks like this
(_dwClock2Lo and _dwClock2Hi are stack-frame variables)

notice that the registers are PUSH'ed in different order than POP'ed,
thus forcing the CPU to complete instructions before and after

        push    eax
        push    ecx
        push    edx
        push    ebx
        push    esp
        push    ebp
        push    esi
        push    edi

        rdtsc
        mov     _dwClock2Lo,eax
        mov     _dwClock2Hi,edx

        pop     eax
        pop     ecx
        pop     edx
        pop     ebx
        pop     eax
        pop     ecx
        pop     esi
        pop     edi
Title: Re: New cycle counter routines
Post by: dedndave on September 22, 2015, 02:52:38 PM
maybe i should have started a seperate thread
can i get some members to run this for me ?   :t
Title: Re: New cycle counter routines
Post by: zedd151 on September 22, 2015, 03:51:53 PM
Quote from: dedndave on September 22, 2015, 02:52:38 PM
maybe i should have started a seperate thread
can i get some members to run this for me ?   :t


Genuine Intel(R) CPU           T2060  @ 1.60GHz (SSE3)
0 0 0 0 0 0 -12 0 0 516805 Iterations
10020 10020 10020 10020 10032 10020 10020 10020 10020 99370 Iterations
20028 20016 20028 20028 20028 20028 20028 20040 20028 49035 Iterations
Press any key to continue ...



some time later::

I have run your test dave sevral times.
It seems to have a strong affinity for the number twelve - that is
when the counts are off, it is usually +/- 12, +/- 24, etc
Title: Re: New cycle counter routines
Post by: jj2007 on September 22, 2015, 04:14:34 PM
Looks convincing, Dave :t

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
0 0 0 0 0 0 0 0 0 797143 Iterations
10028 10028 10028 10028 10028 10028 10028 10028 10028 96273 Iterations
20028 20028 20028 20028 20028 20028 20028 20028 20028 49035 Iterations
Title: Re: New cycle counter routines
Post by: sinsi on September 22, 2015, 04:36:52 PM

Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz (SSE4)
0 0 0 0 0 0 0 0 0 2528866 Iterations
9021 9018 9021 9021 9018 9018 9021 9018 9021 265816 Iterations
18018 18021 18021 18018 18018 18018 18018 18021 18018 99370 Iterations

Title: Re: New cycle counter routines
Post by: Siekmanski on September 22, 2015, 05:19:18 PM
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
0 0 0 0 0 0 0 0 0 1589917 Iterations
10040 10032 10040 10040 10040 10032 10040 10040 10032 224319 Iterations
20040 20040 20040 20040 20040 20040 20040 20040 20040 118358 Iterations
Title: Re: New cycle counter routines
Post by: TWell on September 22, 2015, 06:47:41 PM
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4) 1,6 GHz
0 0 0 0 0 0 0 0 0 455305 Iterations
20010 20010 20010 20010 20010 20010 20010 20010 20010 32300 Iterations
40011 40011 40011 40011 40011 40011 40011 40011 40011 27642 Iterations
Press any key to continue ...
Title: Re: New cycle counter routines
Post by: dedndave on September 22, 2015, 10:16:52 PM
many thanks, guys  :t

that looks pretty good, to me
there might be some things i can do to improve it
let me play with it a bit
i may want to put it in a LIB, so the code falls early in  the .CODE section

it's pretty convenient to use, really
your Function Under Test, of course
then, i let you supply a "NUL" reference function to compare it to
and, i let you supply the run time (mS) and both process and thread priorities (OR'ed together)

TestTime PROTO :DWORD,:DWORD,:LPVOID,:LPVOID
fnNUL    PROTO
fnFUT    PROTO


INVOKE  TestTime,500,HIGH_PRIORITY_CLASS or THREAD_PRIORITY_ABOVE_NORMAL,fnNUL,fnFUT

(no more playing with LoopCounts   :P )

the fnFUT might look something like this

fnFUT PROC

    INVOKE  StrLen,offset szTest
    ret

fnFUT ENDP


and, you'd subtract out the fnNUL time

fnNUL PROC

    ret

fnNUL ENDP


the serialization technique is a bit faster than CPUID
and, if it's stable on my machine, it should be stable on everyone elses - lol
Title: Re: New cycle counter routines
Post by: rrr314159 on September 23, 2015, 03:00:57 AM
Tried it out,

the good news is, using your test it is, in fact, more stable than mfence (which I'm currently using for serialization). Didn't compare to cpuid since I assume you did that already.

However there are couple problems. Main one, this serialization technique blows away all the registers! I modified it to preserve registers, and it still works OK, not as well tho. So for "on-the-fly" timing, which must not impact the running program by destroying registers, this technique not so good. Of course mfence is a lot faster, and doesn't need "pushad / popad".

In your FUT you use only ecx, and in your timing code you preserve only ecx. So if you use other registers in the FUT, does the timing code still work? Didn't get around to checking that, but assume it does. If so, this looks like a better way to do "formal" timing but not good for "on-the-fly"

BTW zedd151 when cycle count is off, it can be by multiple of 12, but I also have seen 18, 3, ... If there's a pattern it's not obvious
Title: Re: New cycle counter routines
Post by: zedd151 on September 23, 2015, 03:38:20 AM
Quote from: rrr314159 on September 23, 2015, 03:00:57 AM
BTW zedd151 when cycle count is off, it can be by multiple of 12, but I also have seen 18, 3, ... If there's a pattern it's not obvious

I think it is my machine that is the problem, that's why I had such difficulty taming even the
'tried and true' timing/counting routines. :(

It seems to be running hot often.

Perhaps the airways need a good cleaning? (it's an old -12yr - laptop)
Title: Re: New cycle counter routines
Post by: dedndave on September 23, 2015, 04:35:23 AM
rrr - it's not "inline", as you say
and, yes, the "ABI" registers are preserved - just not around individual calls

at the beginning of the routine...
;preserve registers and set up stack frame

    push    ebx                                        ;[EBP+12] = preserved EBX contents
    push    esi                                        ;[EBP+8]  = preserved ESI contents
    push    edi                                        ;[EBP+4]  = preserved EDI contents
    push    ebp                                        ;[EBP]    = preserved EBP contents
    mov     ebp,esp                                    ;EBP = stack frame base pointer


at the end...
;return measured time in EDX:EAX and iterations in ECX

    mov     eax,_dwMeasuredLo
    mov     edx,_dwMeasuredHi
    sub     eax,_dwReferenceLo
    mov     ecx,_dwIterations
    sbb     edx,_dwReferenceHi

;release stack frame and restore registers

    leave                  ;performs MOV ESP,EBP then POP EBP
    pop     edi
    pop     esi
    pop     ebx
    ret     16


when the RDTSC operation is made, the only registers that matter are ESP and EBP
Title: Re: New cycle counter routines
Post by: dedndave on September 23, 2015, 04:42:07 AM
also, i have written the current code so that _SerializeA and SerializeB macros
may be easily modified for trying other serialization methods

i did try CPUID

_SerializeA MACRO

        xor     eax,eax
        cpuid

            ENDM

_SerializeB TEXTEQU <_SerializeA>