The MASM Forum

General => The Laboratory => Topic started by: hutch-- on May 21, 2012, 03:47:53 PM

Title: Michael Webster's code timing macros
Post by: hutch-- on May 21, 2012, 03:47:53 PM
This is a reposting from the old forum.
Title: Re: Michael Webster's code timing macros
Post by: jj2007 on May 21, 2012, 05:37:29 PM
The attached cyct_macros.inc uses MichaelW's timer macros (i.e. you need timers.zip attached above) but cuts off outliers, thus improving the consistency of timings especially on older CPUs such as the P4. Simple example:

Code: [Select]
include \masm32\include\masm32rt.inc
include \masm32\macros\Cyct_Macros.inc

.code
start: ShowCpu
REPEAT 3
cyct_begin
invoke GetTickCount
cyct_end <GetTickCount>

cyct_begin
push esi
push edi
push ebx
nop
pop ebx
pop edi
pop esi
cyct_end <uses esi edi ebx>

print chr$(13, 10)
ENDM
inkey "ok"
exit
end start
Title: Re: Michael Webster's code timing macros
Post by: qWord on May 21, 2012, 09:54:01 PM
in the attachment a x64-version of the counter-macro (jWasm, ml64)
Title: Re: Michael Webster's code timing macros
Post by: Farabi on December 23, 2012, 09:19:27 AM
Can we made all list of instruction timing.
Title: Re: Michael Webster's code timing macros
Post by: MichaelW on December 23, 2012, 01:21:54 PM
Hi Onan,

If I understand your question correctly, the answer is effectively no. Due to the large variations in processor design the list would need to include timings for each processor family, and possibly for each individual model of a given family. And then there is the larger problem with the timing of an instruction being increasingly dependent on the instructions around it. For the earlier, and much simpler processors, the Intel instruction set listings included instruction timings, but these timings were valid only under certain conditions. For example, from the 386 DX Programmer’s Reference:
Quote
The “Clocks” column gives the number of clock cycles the instruction takes to execute. The clock count calculations makes the following assumptions:

The instruction has been prefetched and decoded and is ready for execution.

Bus cycles do not require wait states.

There are no local bus HOLD request delaying processor access to the bus.

No exceptions are detected during instruction execution.

Memory operands are aligned.

With each succeeding processor family the number of conditions increased, for example, from the 486 DX Programmer’s Reference:
Quote
Data and instruction accesses hit in the cache.

The target of a jump instruction is in the cache.

No invalidate cycles contend with the instruction for use of the cache.

Page translation hits in the TLB.

Memory operands are aligned.

Effective address calculations use one base register and no index register, and the base register is not the destination register of the preceding instruction.

Displacement and immediate are not used together.

No exceptions are detected during instruction execution.

There are no write-buffer delays.

This continued to the point that, some time after the Pentium was introduced, Intel dropped the timings from the instruction set listings.

And another problem is that the TSC has a resolution of one clock cycle, but under the right conditions anything resembling a recent processor can, depending on the instruction, execute more than one instruction per clock cycle.
Title: Re: Michael Webster's code timing macros
Post by: sinsi on January 08, 2013, 07:30:29 PM
Possible problem:
Code: [Select]
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
                            push count
                            push number
                            call ScaleBits1
                            mov ebx,eax
counter_end
EBX was not the expected result, even using "mov ebx,1" gave the same (wrong) result.
I was assuming the macro wouldn't be using EBX but...CPUID trashes it.
Title: Re: Michael Webster's code timing macros
Post by: dedndave on January 09, 2013, 12:46:16 AM
on the old forum, Michael had a second version that preserves EBX
it also did the calculations differently, so it wasn't the only difference
but - you can mod the macro to preserve EBX   :P
Title: Re: Michael Webster's code timing macros
Post by: Raistlin on September 05, 2016, 08:25:28 PM
Hi Masters, (I'am only a master of disaster  :t)

Some-one might have mentioned somewhere that this version of Micheal's timing routines aren't  thread safe.
What exactly are the implications of such and is there a work around - or best practice method for measuring
hardware multi-threaded code [apps or other]. (Reasoning: Individual thread timings and total workload metrics required for my test pieces.)

Thanks in advance.
Raistlin
Title: Re: Michael Webster's code timing macros
Post by: MichaelW on September 08, 2016, 11:13:19 PM
The problem with using the previous counters in a multi-threaded app is the loop counter stored in a global variable. I'm not at home so my resources are limited, but I did manage to modify a set of 64-bit timer macros that I had previously created for GCC, replacing the loop counter variable with the nonvolatile register r12. The macros store the loop count in a global variable, a practical necessity for the C macros, but an assembly version of the code could reasonably return the loop count in RAX or EDX:EAX.

The code times the CPUID instruction because it consumes a sizeable number of clock cycles, more or less, depending on the function number.

The source creates only two test threads, in addition to the main thread, which, assuming everything works as I intended, is effectively blocked while the test threads are active, because my crappy test system has only two cores (both physical cores, no HTT support). I could not test this, but I have doubts that this code will work as expected on a HTT "core".

I tested running the cycle counts on any available core and on separate cores. Running on separate cores reduced the number of anomalies, but unfortunately did not eliminate the anomalies.

Note that GCC, unlike the Microsoft and Pelles compilers, supports 64-bit inline assembly.

Source code and compiled executable in the attachment.
Title: Re: Michael Webster's code timing macros
Post by: Raistlin on September 09, 2016, 12:02:07 AM
Code: [Select]
48
40

Threads running on any available core:
1: 219 cycles
2: 219 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 219 cycles
2: 219 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
72
68

15
15

Threads running on separate cores:
2: 220 cycles
1: 220 cycles
2: 219 cycles
1: 220 cycles
2: 219 cycles
1: 219 cycles
2: 219 cycles
1: 220 cycles
2: 219 cycles
1: 220 cycles
2: 219 cycles
1: 220 cycles
2: 220 cycles
1: 219 cycles
2: 219 cycles
1: 220 cycles

Tested on HTT - Intel i5 HP ProDesk - Win 8.1

Seems to work ! :t  but obviously does'nt test beyond the 2 cores as per your description
Title: Re: Michael Webster's code timing macros
Post by: rrr314159 on September 09, 2016, 12:17:33 AM
I made various versions of multi-thread timers, for 32 and / or 64. For instance look at this code, http://masm32.com/board/index.php?topic=4832.msg51985#msg51985. It might not be obvious how to use the timer macros. If interested ask.
Title: Re: Michael Webster's code timing macros
Post by: Siekmanski on September 09, 2016, 05:09:28 AM
i7-4930K CPU Windows 8.1

Code: [Select]
40
44

Threads running on any available core:
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
56
64

4095
4095

Threads running on separate cores:
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 284 cycles
Title: Re: Michael Webster's code timing macros
Post by: MichaelW on September 12, 2016, 03:59:57 PM
Why a consistent 257 vs 284/285, on a processor that apparently supports 12 threads (judging from the previous affinity mask 4095 = FFFh = 1111 1111 1111b), when the test app has only 2 active threads?
Title: Re: Michael Webster's code timing macros
Post by: Siekmanski on September 12, 2016, 05:43:15 PM
I think because of HTT, the 2 threads are running on the same core ( AffinityMask 1 & 2 )
Try AffinityMasks 1 & 3 so they run on separate cores.
I can't test it, don't have a C compiler installed.
Title: Re: Michael Webster's code timing macros
Post by: Raistlin on September 12, 2016, 06:56:21 PM
Yes I think it's time we converted this to a workable thread-safe macro include in 32 & 64 bits.

A test environment sample would need a switch parameter, so that we can select  all/even/odd "cores-HTT" / physical cores

Hope that all made sense. :icon_rolleyes:
Title: Re: Michael Webster's code timing macros
Post by: hutch-- on September 13, 2016, 12:40:44 AM
Hehe, I still time algos with GetTickCount with a long enough sample. What you will have PHUN with is the Win10 64 thread scheduling that is all over the place. A recent thread test I have done and posted here shows how patchy the scheduling is in that the thread completion order is nowhere near the thread start order.
Title: Re: Michael Webster's code timing macros
Post by: MichaelW on September 13, 2016, 04:07:25 PM
Try AffinityMasks 1 & 3 so they run on separate cores.

Title: Re: Michael Webster's code timing macros
Post by: Raistlin on September 13, 2016, 04:20:16 PM
Code: [Select]
44
48

Threads running on any available core:
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 222 cycles
2: 222 cycles
1: 220 cycles
2: 220 cycles
1: 221 cycles
2: 221 cycles
1: 220 cycles
2: 220 cycles
60
56

15
15

Threads running on separate cores:
2: 220 cycles
1: 221 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 221 cycles
2: 221 cycles
1: 221 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 221 cycles
2: 220 cycles
1: 220 cycles

HP Intel - i5 Prodesk 600 G1
Title: Re: Michael Webster's code timing macros
Post by: TWell on September 13, 2016, 04:47:03 PM
AMD Athlon II X2 220, 2 CPU's, Windows 10 Home 64-bit
Code: [Select]
88
92

Threads running on any available core:
1: 49 cycles
2: 49 cycles
1: 49 cycles
2: 49 cycles
1: 49 cycles
2: 0 cycles
1: 49 cycles
2: 49 cycles
2: 49 cycles
1: 49 cycles
1: 49 cycles
2: 0 cycles
1: 49 cycles
2: 49 cycles
1: 49 cycles
2: 49 cycles
104
96

3
3

Threads running on separate cores:
2: 49 cycles
1: 58 cycles
2: 49 cycles
1: 49 cycles
2: 49 cycles
1: 56 cycles
2: 49 cycles
1: 62 cycles
2: 49 cycles
2: 49 cycles
1: 62 cycles
2: 49 cycles
1: 64 cycles
2: 49 cycles
1: 57 cycles
1: 49 cycles
Title: Re: Michael Webster's code timing macros
Post by: Siekmanski on September 13, 2016, 06:56:31 PM
i7-4930K CPU Windows 8.1 ( AffinityMasks 1 & 3 )
Code: [Select]
40
44

Threads running on any available core:
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
60
52

4095
4095

Threads running on separate cores:
2: 282 cycles
1: 283 cycles
2: 262 cycles
1: 263 cycles
2: 256 cycles
1: 257 cycles
2: 255 cycles
1: 256 cycles
2: 255 cycles
1: 256 cycles
2: 258 cycles
1: 258 cycles
2: 255 cycles
1: 255 cycles
2: 255 cycles
1: 255 cycles
Title: Re: Michael Webster's code timing macros
Post by: FORTRANS on September 14, 2016, 12:14:28 AM
Hi,

   Windows 8.1, i3, laptop.

Code: [Select]
44
48

Threads running on any available core:
1: 284 cycles
2: 285 cycles
1: 268 cycles
2: 274 cycles
1: 280 cycles
2: 281 cycles
1: 282 cycles
2: 286 cycles
1: 271 cycles
2: 271 cycles
1: 271 cycles
2: 270 cycles
1: 268 cycles
2: 269 cycles
1: 266 cycles
2: 267 cycles
76
72

15
15

Threads running on separate cores:
2: 268 cycles
1: 268 cycles
2: 267 cycles
1: 268 cycles
2: 267 cycles
1: 268 cycles
2: 268 cycles
1: 268 cycles
2: 268 cycles
1: 268 cycles
2: 267 cycles
1: 268 cycles
2: 267 cycles
1: 268 cycles
2: 268 cycles
1: 268 cycles

Steve
Title: Re: Michael Webster's code timing macros
Post by: TWell on September 21, 2016, 02:59:14 AM
modified test13 build with mingw w64 v 6.2 without mingw headers and runtime.
so size is now 5 kb.
only minimal environment gcc.exe cc1.exe as.exe ld.exe are needed for compilation.

gcc binaries here (https://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win64/Personal%20Builds/mingw-builds/6.2.0/threads-win32/seh/)

with this someone can compile own tests without full installations.
Title: Re: Michael Webster's code timing macros
Post by: Gunther on September 21, 2016, 09:55:11 PM
Results Windows 7, Desktop Computer:
Code: [Select]
32
36

Threads running on any available core:
2: 231 cycles
1: 237 cycles
2: 231 cycles
1: 242 cycles
2: 231 cycles
1: 238 cycles
2: 231 cycles
1: 237 cycles
2: 231 cycles
1: 240 cycles
2: 231 cycles
1: 239 cycles
2: 231 cycles
1: 237 cycles
2: 231 cycles
1: 236 cycles
48
44

255
255

Threads running on separate cores:
2: 257 cycles
1: 258 cycles
2: 257 cycles
1: 258 cycles
2: 258 cycles
1: 258 cycles
2: 257 cycles
1: 258 cycles
2: 257 cycles
1: 258 cycles
2: 258 cycles
1: 258 cycles
2: 258 cycles
1: 258 cycles
2: 258 cycles
1: 258 cycles

Gunther