For some time, MasmBasic has featured a cycle count template (in RichMasm: Menu File/New Masm source, Timer example).
I have released today a new version (http://masm32.com/board/index.php?topic=94.0), with more precise and more stable cycle counts. Typical example:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
7011 cycles for 100 * CRT strlen
5236 cycles for 100 * Masm32 StrLen
2170 cycles for 100 * MasmBasic Len
7024 cycles for 100 * CRT strlen
5232 cycles for 100 * Masm32 StrLen
2263 cycles for 100 * MasmBasic Len
7267 cycles for 100 * CRT strlen
5233 cycles for 100 * Masm32 StrLen
2173 cycles for 100 * MasmBasic Len
7010 cycles for 100 * CRT strlen
5224 cycles for 100 * Masm32 StrLen
2173 cycles for 100 * MasmBasic Len
7024 cycles for 100 * CRT strlen
5223 cycles for 100 * Masm32 StrLen
2169 cycles for 100 * MasmBasic Len
Executable attached; if you encounter unstable timings for your cpu, please let me know.
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
15847 cycles for 100 * CRT strlen
15680 cycles for 100 * Masm32 StrLen
7114 cycles for 100 * MasmBasic Len
15721 cycles for 100 * CRT strlen
14284 cycles for 100 * Masm32 StrLen
7265 cycles for 100 * MasmBasic Len
15632 cycles for 100 * CRT strlen
14383 cycles for 100 * Masm32 StrLen
7121 cycles for 100 * MasmBasic Len
15678 cycles for 100 * CRT strlen
14341 cycles for 100 * Masm32 StrLen
7194 cycles for 100 * MasmBasic Len
15745 cycles for 100 * CRT strlen
14251 cycles for 100 * Masm32 StrLen
7123 cycles for 100 * MasmBasic Len
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
11079 cycles for 100 * CRT strlen
8101 cycles for 100 * Masm32 StrLen
3235 cycles for 100 * MasmBasic Len
10602 cycles for 100 * CRT strlen
8073 cycles for 100 * Masm32 StrLen
3301 cycles for 100 * MasmBasic Len
9716 cycles for 100 * CRT strlen
8044 cycles for 100 * Masm32 StrLen
3267 cycles for 100 * MasmBasic Len
10433 cycles for 100 * CRT strlen
8057 cycles for 100 * Masm32 StrLen
3244 cycles for 100 * MasmBasic Len
11064 cycles for 100 * CRT strlen
8023 cycles for 100 * Masm32 StrLen
3297 cycles for 100 * MasmBasic Len
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for MasmBasic Len
Thanks to both of you :icon14:
Looks OK, especially the P4 was always a problem, now it seems solved. I attach a new version (with source) that uses an "align 64" macro, which may help to force the test loops into a 64-byte instruction cache boundary (I write "may" because I'm not sure if that theory is correct; besides, depending on how other code is arranged, the short main loop might fit into the cache, while the bigger algo might cross cache boundaries).
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
6922 cycles for 100 * CRT strlen
5157 cycles for 100 * Masm32 StrLen
17467 cycles for 100 * Windows lstrlen
1668 cycles for 100 * MasmBasic Len
6917 cycles for 100 * CRT strlen
5157 cycles for 100 * Masm32 StrLen
17478 cycles for 100 * Windows lstrlen
1668 cycles for 100 * MasmBasic Len
6933 cycles for 100 * CRT strlen
5156 cycles for 100 * Masm32 StrLen
17505 cycles for 100 * Windows lstrlen
1669 cycles for 100 * MasmBasic Len
6922 cycles for 100 * CRT strlen
5161 cycles for 100 * Masm32 StrLen
17444 cycles for 100 * Windows lstrlen
1670 cycles for 100 * MasmBasic Len
6936 cycles for 100 * CRT strlen
5157 cycles for 100 * Masm32 StrLen
17461 cycles for 100 * Windows lstrlen
1668 cycles for 100 * MasmBasic Len
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz (SSE4)
8893 cycles for 100 * CRT strlen
6779 cycles for 100 * Masm32 StrLen
24229 cycles for 100 * Windows lstrlen
2195 cycles for 100 * MasmBasic Len
8887 cycles for 100 * CRT strlen
6774 cycles for 100 * Masm32 StrLen
24229 cycles for 100 * Windows lstrlen
2195 cycles for 100 * MasmBasic Len
8884 cycles for 100 * CRT strlen
6778 cycles for 100 * Masm32 StrLen
24228 cycles for 100 * Windows lstrlen
2005 cycles for 100 * MasmBasic Len
8884 cycles for 100 * CRT strlen
6774 cycles for 100 * Masm32 StrLen
23511 cycles for 100 * Windows lstrlen
2200 cycles for 100 * MasmBasic Len
8885 cycles for 100 * CRT strlen
6776 cycles for 100 * Masm32 StrLen
24232 cycles for 100 * Windows lstrlen
2196 cycles for 100 * MasmBasic Len
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
AMD A6-6310 APU with AMD Radeon R4 Graphics (SSE4)
11095 cycles for 100 * CRT strlen
8118 cycles for 100 * Masm32 StrLen
22622 cycles for 100 * Windows lstrlen
3248 cycles for 100 * MasmBasic Len
11086 cycles for 100 * CRT strlen
8144 cycles for 100 * Masm32 StrLen
22499 cycles for 100 * Windows lstrlen
3255 cycles for 100 * MasmBasic Len
11071 cycles for 100 * CRT strlen
8048 cycles for 100 * Masm32 StrLen
22709 cycles for 100 * Windows lstrlen
3271 cycles for 100 * MasmBasic Len
10127 cycles for 100 * CRT strlen
8056 cycles for 100 * Masm32 StrLen
22686 cycles for 100 * Windows lstrlen
3252 cycles for 100 * MasmBasic Len
11076 cycles for 100 * CRT strlen
8077 cycles for 100 * Masm32 StrLen
22624 cycles for 100 * Windows lstrlen
3233 cycles for 100 * MasmBasic Len
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
--- ok ---
Quote from: rrr314159 on September 17, 2015, 12:37:33 PM10127 cycles for 100 * CRT strlen
Umpfff... that one is difficult to explain ::)
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (SSE4)
5260 cycles for 100 * CRT strlen
4988 cycles for 100 * Masm32 StrLen
16209 cycles for 100 * Windows lstrlen
1668 cycles for 100 * MasmBasic Len
5268 cycles for 100 * CRT strlen
4910 cycles for 100 * Masm32 StrLen
16190 cycles for 100 * Windows lstrlen
1665 cycles for 100 * MasmBasic Len
5279 cycles for 100 * CRT strlen
4914 cycles for 100 * Masm32 StrLen
16167 cycles for 100 * Windows lstrlen
1664 cycles for 100 * MasmBasic Len
5275 cycles for 100 * CRT strlen
4914 cycles for 100 * Masm32 StrLen
16193 cycles for 100 * Windows lstrlen
1667 cycles for 100 * MasmBasic Len
5263 cycles for 100 * CRT strlen
4913 cycles for 100 * Masm32 StrLen
16169 cycles for 100 * Windows lstrlen
1666 cycles for 100 * MasmBasic Len
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
--- ok ---
how about the source for the timing code ? :redface:
prescott w/htt xp mce2005 sp3
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
18315 cycles for 100 * CRT strlen
16580 cycles for 100 * Masm32 StrLen
46696 cycles for 100 * Windows lstrlen
8590 cycles for 100 * MasmBasic Len
18517 cycles for 100 * CRT strlen
16776 cycles for 100 * Masm32 StrLen
47050 cycles for 100 * Windows lstrlen
8736 cycles for 100 * MasmBasic Len
18248 cycles for 100 * CRT strlen
16638 cycles for 100 * Masm32 StrLen
46668 cycles for 100 * Windows lstrlen
8534 cycles for 100 * MasmBasic Len
18377 cycles for 100 * CRT strlen
16693 cycles for 100 * Masm32 StrLen
47407 cycles for 100 * Windows lstrlen
8518 cycles for 100 * MasmBasic Len
18318 cycles for 100 * CRT strlen
16796 cycles for 100 * Masm32 StrLen
46341 cycles for 100 * Windows lstrlen
8702 cycles for 100 * MasmBasic Len
the whole crux of the issue seems to be serialization
CPUID appears to be the best we have, to date
my "theory" was to try and force instructions to serialize without the use of CPUID
the fewer cycles required to serialize, the more stable the readings should be
here is what i was working on the last time i played with it
and - the results weren't as good as i had hoped
but, maybe it will spawn some ideas
push ebx
push esi
push edi
push ebp
pushad ;EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI
;time measurement
pop edi
pop esi
pop ebp
pop eax
pop ebx
pop eax
pop ecx
pop edx
pop ebp
pop edi
pop esi
pop ebx
xchg eax,edx
the idea being, because the registers are popped in a different order than they are pushed,
the CPU is forced to wait for completion - thus a sort of "natural" serialization
Intel suggests mfence may be better than cpuid for serialization
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4)
15529 cycles for 100 * CRT strlen
10607 cycles for 100 * Masm32 StrLen
32263 cycles for 100 * Windows lstrlen
5946 cycles for 100 * MasmBasic Len
15516 cycles for 100 * CRT strlen
10544 cycles for 100 * Masm32 StrLen
32228 cycles for 100 * Windows lstrlen
5956 cycles for 100 * MasmBasic Len
15543 cycles for 100 * CRT strlen
10599 cycles for 100 * Masm32 StrLen
32194 cycles for 100 * Windows lstrlen
6021 cycles for 100 * MasmBasic Len
15544 cycles for 100 * CRT strlen
10644 cycles for 100 * Masm32 StrLen
32107 cycles for 100 * Windows lstrlen
5931 cycles for 100 * MasmBasic Len
15541 cycles for 100 * CRT strlen
10598 cycles for 100 * Masm32 StrLen
32839 cycles for 100 * Windows lstrlen
5970 cycles for 100 * MasmBasic Len
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
8307 cycles for 100 * CRT strlen
6156 cycles for 100 * Masm32 StrLen
20840 cycles for 100 * Windows lstrlen
2076 cycles for 100 * MasmBasic Len
8307 cycles for 100 * CRT strlen
6162 cycles for 100 * Masm32 StrLen
20883 cycles for 100 * Windows lstrlen
2078 cycles for 100 * MasmBasic Len
8305 cycles for 100 * CRT strlen
6166 cycles for 100 * Masm32 StrLen
20836 cycles for 100 * Windows lstrlen
2078 cycles for 100 * MasmBasic Len
8304 cycles for 100 * CRT strlen
6162 cycles for 100 * Masm32 StrLen
20852 cycles for 100 * Windows lstrlen
2077 cycles for 100 * MasmBasic Len
8305 cycles for 100 * CRT strlen
6153 cycles for 100 * Masm32 StrLen
20839 cycles for 100 * Windows lstrlen
2077 cycles for 100 * MasmBasic Len
14 bytes for CRT strlen
10 bytes for Masm32 StrLen
10 bytes for Windows lstrlen
10 bytes for MasmBasic Len
--- ok ---
Quote from: dedndave on September 17, 2015, 02:56:10 PM
how about the source for the timing code ? :redface:
It's included in the attachment above, Dave. And you can even set useMB=0 :P
Apart from the one negative outlier in rr's post that is really difficult to explain, timings appear fairly stable, compared to what we have seen over the years. The SpinUp macro does a good job, it seems.
Quote from: dedndave on September 17, 2015, 03:10:05 PM
the whole crux of the issue seems to be serialization
CPUID appears to be the best we have, to date
There is RDTSCP. I recently coded some 64-bit macros for GCC, that use CPUID function 0 for serialization (~95 cycles IIRC), and running at the highest-possible priority on my Window 7, i3 system the run to run variation was normally only 1 cycle +-. The problem I ran into was CPUID's modification of EBX. While there were no problems with no compiler optimization, with
any level of optimization the app would trigger an access-violation exception, even when the test code did not use EBX. The solution was to preserve RBX around CPUID, and doing so did not alter the repeatability, and a RDTSCP version of the macros did no better, so I stuck with using CPUID. Also, I seem to recall that RDTSC is only a little faster than CPUID function 0, at least on my i3.
i have to say that i am very pleased with this iteration of my idea :biggrin:
with the exception of a few out-liers, it seems extremely stable on my quirky P4 with XP media center edition 2005
a single run of the test program...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
0 0 0 0 0 0 0 0 0 455305 Iterations
15030 15038 15038 13380 15038 15038 15030 13381 15038 114344 Iterations
30038 30038 30038 30038 26715 30038 30038 30038 30038 64327 Iterations
the actual serialization code looks like this
(_dwClock2Lo and _dwClock2Hi are stack-frame variables)
notice that the registers are PUSH'ed in different order than POP'ed,
thus forcing the CPU to complete instructions before and after
push eax
push ecx
push edx
push ebx
push esp
push ebp
push esi
push edi
rdtsc
mov _dwClock2Lo,eax
mov _dwClock2Hi,edx
pop eax
pop ecx
pop edx
pop ebx
pop eax
pop ecx
pop esi
pop edi
maybe i should have started a seperate thread
can i get some members to run this for me ? :t
Quote from: dedndave on September 22, 2015, 02:52:38 PM
maybe i should have started a seperate thread
can i get some members to run this for me ? :t
Genuine Intel(R) CPU T2060 @ 1.60GHz (SSE3)
0 0 0 0 0 0 -12 0 0 516805 Iterations
10020 10020 10020 10020 10032 10020 10020 10020 10020 99370 Iterations
20028 20016 20028 20028 20028 20028 20028 20040 20028 49035 Iterations
Press any key to continue ...
some time later::
I have run your test dave sevral times.
It seems to have a strong affinity for the number twelve - that is
when the counts are off, it is usually +/- 12, +/- 24, etc
Looks convincing, Dave :t
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
0 0 0 0 0 0 0 0 0 797143 Iterations
10028 10028 10028 10028 10028 10028 10028 10028 10028 96273 Iterations
20028 20028 20028 20028 20028 20028 20028 20028 20028 49035 Iterations
Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz (SSE4)
0 0 0 0 0 0 0 0 0 2528866 Iterations
9021 9018 9021 9021 9018 9018 9021 9018 9021 265816 Iterations
18018 18021 18021 18018 18018 18018 18018 18021 18018 99370 Iterations
Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
0 0 0 0 0 0 0 0 0 1589917 Iterations
10040 10032 10040 10040 10040 10032 10040 10040 10032 224319 Iterations
20040 20040 20040 20040 20040 20040 20040 20040 20040 118358 Iterations
AMD E-450 APU with Radeon(tm) HD Graphics (SSE4) 1,6 GHz
0 0 0 0 0 0 0 0 0 455305 Iterations
20010 20010 20010 20010 20010 20010 20010 20010 20010 32300 Iterations
40011 40011 40011 40011 40011 40011 40011 40011 40011 27642 Iterations
Press any key to continue ...
many thanks, guys :t
that looks pretty good, to me
there might be some things i can do to improve it
let me play with it a bit
i may want to put it in a LIB, so the code falls early in the .CODE section
it's pretty convenient to use, really
your Function Under Test, of course
then, i let you supply a "NUL" reference function to compare it to
and, i let you supply the run time (mS) and both process and thread priorities (OR'ed together)
TestTime PROTO :DWORD,:DWORD,:LPVOID,:LPVOID
fnNUL PROTO
fnFUT PROTO
INVOKE TestTime,500,HIGH_PRIORITY_CLASS or THREAD_PRIORITY_ABOVE_NORMAL,fnNUL,fnFUT
(no more playing with LoopCounts :P )
the fnFUT might look something like this
fnFUT PROC
INVOKE StrLen,offset szTest
ret
fnFUT ENDP
and, you'd subtract out the fnNUL time
fnNUL PROC
ret
fnNUL ENDP
the serialization technique is a bit faster than CPUID
and, if it's stable on my machine, it should be stable on everyone elses - lol
Tried it out,
the good news is, using your test it is, in fact, more stable than mfence (which I'm currently using for serialization). Didn't compare to cpuid since I assume you did that already.
However there are couple problems. Main one, this serialization technique blows away all the registers! I modified it to preserve registers, and it still works OK, not as well tho. So for "on-the-fly" timing, which must not impact the running program by destroying registers, this technique not so good. Of course mfence is a lot faster, and doesn't need "pushad / popad".
In your FUT you use only ecx, and in your timing code you preserve only ecx. So if you use other registers in the FUT, does the timing code still work? Didn't get around to checking that, but assume it does. If so, this looks like a better way to do "formal" timing but not good for "on-the-fly"
BTW zedd151 when cycle count is off, it can be by multiple of 12, but I also have seen 18, 3, ... If there's a pattern it's not obvious
Quote from: rrr314159 on September 23, 2015, 03:00:57 AM
BTW zedd151 when cycle count is off, it can be by multiple of 12, but I also have seen 18, 3, ... If there's a pattern it's not obvious
I think it is my machine that is the problem, that's why I had such difficulty taming even the
'tried and true' timing/counting routines. :(
It seems to be running hot often.
Perhaps the airways need a good cleaning? (it's an old -12yr - laptop)
rrr - it's not "inline", as you say
and, yes, the "ABI" registers are preserved - just not around individual calls
at the beginning of the routine...
;preserve registers and set up stack frame
push ebx ;[EBP+12] = preserved EBX contents
push esi ;[EBP+8] = preserved ESI contents
push edi ;[EBP+4] = preserved EDI contents
push ebp ;[EBP] = preserved EBP contents
mov ebp,esp ;EBP = stack frame base pointer
at the end...
;return measured time in EDX:EAX and iterations in ECX
mov eax,_dwMeasuredLo
mov edx,_dwMeasuredHi
sub eax,_dwReferenceLo
mov ecx,_dwIterations
sbb edx,_dwReferenceHi
;release stack frame and restore registers
leave ;performs MOV ESP,EBP then POP EBP
pop edi
pop esi
pop ebx
ret 16
when the RDTSC operation is made, the only registers that matter are ESP and EBP
also, i have written the current code so that _SerializeA and SerializeB macros
may be easily modified for trying other serialization methods
i did try CPUID
_SerializeA MACRO
xor eax,eax
cpuid
ENDM
_SerializeB TEXTEQU <_SerializeA>