News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

New cycle counter routines

Started by jj2007, September 17, 2015, 08:22:21 AM

Previous topic - Next topic

jj2007

For some time, MasmBasic has featured a cycle count template (in RichMasm: Menu File/New Masm source, Timer example).

I have released today a new version, with more precise and more stable cycle counts. Typical example:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

7011    cycles for 100 * CRT strlen
5236    cycles for 100 * Masm32 StrLen
2170    cycles for 100 * MasmBasic Len

7024    cycles for 100 * CRT strlen
5232    cycles for 100 * Masm32 StrLen
2263    cycles for 100 * MasmBasic Len

7267    cycles for 100 * CRT strlen
5233    cycles for 100 * Masm32 StrLen
2173    cycles for 100 * MasmBasic Len

7010    cycles for 100 * CRT strlen
5224    cycles for 100 * Masm32 StrLen
2173    cycles for 100 * MasmBasic Len

7024    cycles for 100 * CRT strlen
5223    cycles for 100 * Masm32 StrLen
2169    cycles for 100 * MasmBasic Len


Executable attached; if you encounter unstable timings for your cpu, please let me know.

dedndave

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

15847   cycles for 100 * CRT strlen
15680   cycles for 100 * Masm32 StrLen
7114    cycles for 100 * MasmBasic Len

15721   cycles for 100 * CRT strlen
14284   cycles for 100 * Masm32 StrLen
7265    cycles for 100 * MasmBasic Len

15632   cycles for 100 * CRT strlen
14383   cycles for 100 * Masm32 StrLen
7121    cycles for 100 * MasmBasic Len

15678   cycles for 100 * CRT strlen
14341   cycles for 100 * Masm32 StrLen
7194    cycles for 100 * MasmBasic Len

15745   cycles for 100 * CRT strlen
14251   cycles for 100 * Masm32 StrLen
7123    cycles for 100 * MasmBasic Len

rrr314159

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

11079   cycles for 100 * CRT strlen
8101    cycles for 100 * Masm32 StrLen
3235    cycles for 100 * MasmBasic Len

10602   cycles for 100 * CRT strlen
8073    cycles for 100 * Masm32 StrLen
3301    cycles for 100 * MasmBasic Len

9716    cycles for 100 * CRT strlen
8044    cycles for 100 * Masm32 StrLen
3267    cycles for 100 * MasmBasic Len

10433   cycles for 100 * CRT strlen
8057    cycles for 100 * Masm32 StrLen
3244    cycles for 100 * MasmBasic Len

11064   cycles for 100 * CRT strlen
8023    cycles for 100 * Masm32 StrLen
3297    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for MasmBasic Len
I am NaN ;)

jj2007

Thanks to both of you :icon14:

Looks OK, especially the P4 was always a problem, now it seems solved. I attach a new version (with source) that uses an "align 64" macro, which may help to force the test loops into a 64-byte instruction cache boundary (I write "may" because I'm not sure if that theory is correct; besides, depending on how other code is arranged, the short main loop might fit into the cache, while the bigger algo might cross cache boundaries).

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)

6922    cycles for 100 * CRT strlen
5157    cycles for 100 * Masm32 StrLen
17467   cycles for 100 * Windows lstrlen
1668    cycles for 100 * MasmBasic Len

6917    cycles for 100 * CRT strlen
5157    cycles for 100 * Masm32 StrLen
17478   cycles for 100 * Windows lstrlen
1668    cycles for 100 * MasmBasic Len

6933    cycles for 100 * CRT strlen
5156    cycles for 100 * Masm32 StrLen
17505   cycles for 100 * Windows lstrlen
1669    cycles for 100 * MasmBasic Len

6922    cycles for 100 * CRT strlen
5161    cycles for 100 * Masm32 StrLen
17444   cycles for 100 * Windows lstrlen
1670    cycles for 100 * MasmBasic Len

6936    cycles for 100 * CRT strlen
5157    cycles for 100 * Masm32 StrLen
17461   cycles for 100 * Windows lstrlen
1668    cycles for 100 * MasmBasic Len

hutch--


Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz (SSE4)

8893    cycles for 100 * CRT strlen
6779    cycles for 100 * Masm32 StrLen
24229   cycles for 100 * Windows lstrlen
2195    cycles for 100 * MasmBasic Len

8887    cycles for 100 * CRT strlen
6774    cycles for 100 * Masm32 StrLen
24229   cycles for 100 * Windows lstrlen
2195    cycles for 100 * MasmBasic Len

8884    cycles for 100 * CRT strlen
6778    cycles for 100 * Masm32 StrLen
24228   cycles for 100 * Windows lstrlen
2005    cycles for 100 * MasmBasic Len

8884    cycles for 100 * CRT strlen
6774    cycles for 100 * Masm32 StrLen
23511   cycles for 100 * Windows lstrlen
2200    cycles for 100 * MasmBasic Len

8885    cycles for 100 * CRT strlen
6776    cycles for 100 * Masm32 StrLen
24232   cycles for 100 * Windows lstrlen
2196    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len

rrr314159

AMD A6-6310 APU with AMD Radeon R4 Graphics     (SSE4)

11095   cycles for 100 * CRT strlen
8118    cycles for 100 * Masm32 StrLen
22622   cycles for 100 * Windows lstrlen
3248    cycles for 100 * MasmBasic Len

11086   cycles for 100 * CRT strlen
8144    cycles for 100 * Masm32 StrLen
22499   cycles for 100 * Windows lstrlen
3255    cycles for 100 * MasmBasic Len

11071   cycles for 100 * CRT strlen
8048    cycles for 100 * Masm32 StrLen
22709   cycles for 100 * Windows lstrlen
3271    cycles for 100 * MasmBasic Len

10127   cycles for 100 * CRT strlen
8056    cycles for 100 * Masm32 StrLen
22686   cycles for 100 * Windows lstrlen
3252    cycles for 100 * MasmBasic Len

11076   cycles for 100 * CRT strlen
8077    cycles for 100 * Masm32 StrLen
22624   cycles for 100 * Windows lstrlen
3233    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len


--- ok ---
I am NaN ;)

jj2007

Quote from: rrr314159 on September 17, 2015, 12:37:33 PM10127   cycles for 100 * CRT strlen

Umpfff... that one is difficult to explain ::)

six_L

Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (SSE4)

5260   cycles for 100 * CRT strlen
4988   cycles for 100 * Masm32 StrLen
16209   cycles for 100 * Windows lstrlen
1668   cycles for 100 * MasmBasic Len

5268   cycles for 100 * CRT strlen
4910   cycles for 100 * Masm32 StrLen
16190   cycles for 100 * Windows lstrlen
1665   cycles for 100 * MasmBasic Len

5279   cycles for 100 * CRT strlen
4914   cycles for 100 * Masm32 StrLen
16167   cycles for 100 * Windows lstrlen
1664   cycles for 100 * MasmBasic Len

5275   cycles for 100 * CRT strlen
4914   cycles for 100 * Masm32 StrLen
16193   cycles for 100 * Windows lstrlen
1667   cycles for 100 * MasmBasic Len

5263   cycles for 100 * CRT strlen
4913   cycles for 100 * Masm32 StrLen
16169   cycles for 100 * Windows lstrlen
1666   cycles for 100 * MasmBasic Len

14   bytes for CRT strlen
10   bytes for Masm32 StrLen
10   bytes for Windows lstrlen
10   bytes for MasmBasic Len


--- ok ---

Say you, Say me, Say the codes together for ever.

dedndave

how about the source for the timing code ?   :redface:

prescott w/htt xp mce2005 sp3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)

18315   cycles for 100 * CRT strlen
16580   cycles for 100 * Masm32 StrLen
46696   cycles for 100 * Windows lstrlen
8590    cycles for 100 * MasmBasic Len

18517   cycles for 100 * CRT strlen
16776   cycles for 100 * Masm32 StrLen
47050   cycles for 100 * Windows lstrlen
8736    cycles for 100 * MasmBasic Len

18248   cycles for 100 * CRT strlen
16638   cycles for 100 * Masm32 StrLen
46668   cycles for 100 * Windows lstrlen
8534    cycles for 100 * MasmBasic Len

18377   cycles for 100 * CRT strlen
16693   cycles for 100 * Masm32 StrLen
47407   cycles for 100 * Windows lstrlen
8518    cycles for 100 * MasmBasic Len

18318   cycles for 100 * CRT strlen
16796   cycles for 100 * Masm32 StrLen
46341   cycles for 100 * Windows lstrlen
8702    cycles for 100 * MasmBasic Len

dedndave

the whole crux of the issue seems to be serialization
CPUID appears to be the best we have, to date

my "theory" was to try and force instructions to serialize without the use of CPUID
the fewer cycles required to serialize, the more stable the readings should be

here is what i was working on the last time i played with it
and - the results weren't as good as i had hoped
but, maybe it will spawn some ideas

    push    ebx
    push    esi
    push    edi
    push    ebp

    pushad           ;EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI

;time measurement

    pop     edi
    pop     esi
    pop     ebp
    pop     eax
    pop     ebx
    pop     eax
    pop     ecx
    pop     edx

    pop     ebp
    pop     edi
    pop     esi
    pop     ebx
    xchg    eax,edx


the idea being, because the registers are popped in a different order than they are pushed,
the CPU is forced to wait for completion - thus a sort of "natural" serialization

rrr314159

 Intel suggests mfence may be better than cpuid for serialization
I am NaN ;)

TWell

AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)

15529   cycles for 100 * CRT strlen
10607   cycles for 100 * Masm32 StrLen
32263   cycles for 100 * Windows lstrlen
5946    cycles for 100 * MasmBasic Len

15516   cycles for 100 * CRT strlen
10544   cycles for 100 * Masm32 StrLen
32228   cycles for 100 * Windows lstrlen
5956    cycles for 100 * MasmBasic Len

15543   cycles for 100 * CRT strlen
10599   cycles for 100 * Masm32 StrLen
32194   cycles for 100 * Windows lstrlen
6021    cycles for 100 * MasmBasic Len

15544   cycles for 100 * CRT strlen
10644   cycles for 100 * Masm32 StrLen
32107   cycles for 100 * Windows lstrlen
5931    cycles for 100 * MasmBasic Len

15541   cycles for 100 * CRT strlen
10598   cycles for 100 * Masm32 StrLen
32839   cycles for 100 * Windows lstrlen
5970    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len

Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)

8307    cycles for 100 * CRT strlen
6156    cycles for 100 * Masm32 StrLen
20840   cycles for 100 * Windows lstrlen
2076    cycles for 100 * MasmBasic Len

8307    cycles for 100 * CRT strlen
6162    cycles for 100 * Masm32 StrLen
20883   cycles for 100 * Windows lstrlen
2078    cycles for 100 * MasmBasic Len

8305    cycles for 100 * CRT strlen
6166    cycles for 100 * Masm32 StrLen
20836   cycles for 100 * Windows lstrlen
2078    cycles for 100 * MasmBasic Len

8304    cycles for 100 * CRT strlen
6162    cycles for 100 * Masm32 StrLen
20852   cycles for 100 * Windows lstrlen
2077    cycles for 100 * MasmBasic Len

8305    cycles for 100 * CRT strlen
6153    cycles for 100 * Masm32 StrLen
20839   cycles for 100 * Windows lstrlen
2077    cycles for 100 * MasmBasic Len

14      bytes for CRT strlen
10      bytes for Masm32 StrLen
10      bytes for Windows lstrlen
10      bytes for MasmBasic Len


--- ok ---
Creative coders use backward thinking techniques as a strategy.

jj2007

Quote from: dedndave on September 17, 2015, 02:56:10 PM
how about the source for the timing code ?   :redface:

It's included in the attachment above, Dave. And you can even set useMB=0 :P

Apart from the one negative outlier in rr's post that is really difficult to explain, timings appear fairly stable, compared to what we have seen over the years. The SpinUp macro does a good job, it seems.

MichaelW

Quote from: dedndave on September 17, 2015, 03:10:05 PM
the whole crux of the issue seems to be serialization
CPUID appears to be the best we have, to date

There is RDTSCP. I recently coded some 64-bit macros for GCC,  that use CPUID function 0 for serialization (~95 cycles IIRC), and running at the highest-possible priority on my Window 7, i3 system the run to run variation was normally only 1 cycle +-. The problem I ran into was CPUID's modification of EBX. While there were no problems with no compiler optimization, with any level of optimization the app would trigger an access-violation exception, even when the test code did not use EBX. The solution was to preserve RBX around CPUID, and doing so did not alter the repeatability, and a RDTSCP version of the macros did no better, so I stuck with using CPUID. Also, I seem to recall that RDTSC is only a little faster than CPUID function 0, at least on my i3.
Well Microsoft, here's another nice mess you've gotten us into.