Author Topic: Michael Webster's code timing macros  (Read 8865 times)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4813
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Michael Webster's code timing macros
« on: May 21, 2012, 03:47:53 PM »
This is a reposting from the old forum.
« Last Edit: May 21, 2012, 03:57:56 PM by hutch-- »
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

jj2007

  • Member
  • *****
  • Posts: 7552
  • Assembler is fun ;-)
    • MasmBasic
Re: Michael Webster's code timing macros
« Reply #1 on: May 21, 2012, 05:37:29 PM »
The attached cyct_macros.inc uses MichaelW's timer macros (i.e. you need timers.zip attached above) but cuts off outliers, thus improving the consistency of timings especially on older CPUs such as the P4. Simple example:

Code: [Select]
include \masm32\include\masm32rt.inc
include \masm32\macros\Cyct_Macros.inc

.code
start: ShowCpu
REPEAT 3
cyct_begin
invoke GetTickCount
cyct_end <GetTickCount>

cyct_begin
push esi
push edi
push ebx
nop
pop ebx
pop edi
pop esi
cyct_end <uses esi edi ebx>

print chr$(13, 10)
ENDM
inkey "ok"
exit
end start
« Last Edit: May 21, 2012, 07:04:41 PM by jj2007 »

qWord

  • Member
  • *****
  • Posts: 1454
  • The base type of a type is the type itself
    • SmplMath macros
Re: Michael Webster's code timing macros
« Reply #2 on: May 21, 2012, 09:54:01 PM »
in the attachment a x64-version of the counter-macro (jWasm, ml64)
MREAL macros - when you need floating point arithmetic while assembling!

Farabi

  • Member
  • ****
  • Posts: 970
  • Neuroscience Fans
Re: Michael Webster's code timing macros
« Reply #3 on: December 23, 2012, 09:19:27 AM »
Can we made all list of instruction timing.
http://farabidatacenter.url.ph/MySoftware/
My 3D Game Engine Demo.

Contact me at Whatsapp: 6283818314165

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: Michael Webster's code timing macros
« Reply #4 on: December 23, 2012, 01:21:54 PM »
Hi Onan,

If I understand your question correctly, the answer is effectively no. Due to the large variations in processor design the list would need to include timings for each processor family, and possibly for each individual model of a given family. And then there is the larger problem with the timing of an instruction being increasingly dependent on the instructions around it. For the earlier, and much simpler processors, the Intel instruction set listings included instruction timings, but these timings were valid only under certain conditions. For example, from the 386 DX Programmer’s Reference:
Quote
The “Clocks” column gives the number of clock cycles the instruction takes to execute. The clock count calculations makes the following assumptions:

The instruction has been prefetched and decoded and is ready for execution.

Bus cycles do not require wait states.

There are no local bus HOLD request delaying processor access to the bus.

No exceptions are detected during instruction execution.

Memory operands are aligned.

With each succeeding processor family the number of conditions increased, for example, from the 486 DX Programmer’s Reference:
Quote
Data and instruction accesses hit in the cache.

The target of a jump instruction is in the cache.

No invalidate cycles contend with the instruction for use of the cache.

Page translation hits in the TLB.

Memory operands are aligned.

Effective address calculations use one base register and no index register, and the base register is not the destination register of the preceding instruction.

Displacement and immediate are not used together.

No exceptions are detected during instruction execution.

There are no write-buffer delays.

This continued to the point that, some time after the Pentium was introduced, Intel dropped the timings from the instruction set listings.

And another problem is that the TSC has a resolution of one clock cycle, but under the right conditions anything resembling a recent processor can, depending on the instruction, execute more than one instruction per clock cycle.
Well Microsoft, here’s another nice mess you’ve gotten us into.

sinsi

  • Member
  • ****
  • Posts: 996
Re: Michael Webster's code timing macros
« Reply #5 on: January 08, 2013, 07:30:29 PM »
Possible problem:
Code: [Select]
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
                            push count
                            push number
                            call ScaleBits1
                            mov ebx,eax
counter_end
EBX was not the expected result, even using "mov ebx,1" gave the same (wrong) result.
I was assuming the macro wouldn't be using EBX but...CPUID trashes it.
I can walk on water but stagger on beer.

dedndave

  • Member
  • *****
  • Posts: 8734
  • Still using Abacus 2.0
    • DednDave
Re: Michael Webster's code timing macros
« Reply #6 on: January 09, 2013, 12:46:16 AM »
on the old forum, Michael had a second version that preserves EBX
it also did the calculations differently, so it wasn't the only difference
but - you can mod the macro to preserve EBX   :P

Raistlin

  • Member
  • **
  • Posts: 238
Re: Michael Webster's code timing macros
« Reply #7 on: September 05, 2016, 08:25:28 PM »
Hi Masters, (I'am only a master of disaster  :t)

Some-one might have mentioned somewhere that this version of Micheal's timing routines aren't  thread safe.
What exactly are the implications of such and is there a work around - or best practice method for measuring
hardware multi-threaded code [apps or other]. (Reasoning: Individual thread timings and total workload metrics required for my test pieces.)

Thanks in advance.
Raistlin

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: Michael Webster's code timing macros
« Reply #8 on: September 08, 2016, 11:13:19 PM »
The problem with using the previous counters in a multi-threaded app is the loop counter stored in a global variable. I'm not at home so my resources are limited, but I did manage to modify a set of 64-bit timer macros that I had previously created for GCC, replacing the loop counter variable with the nonvolatile register r12. The macros store the loop count in a global variable, a practical necessity for the C macros, but an assembly version of the code could reasonably return the loop count in RAX or EDX:EAX.

The code times the CPUID instruction because it consumes a sizeable number of clock cycles, more or less, depending on the function number.

The source creates only two test threads, in addition to the main thread, which, assuming everything works as I intended, is effectively blocked while the test threads are active, because my crappy test system has only two cores (both physical cores, no HTT support). I could not test this, but I have doubts that this code will work as expected on a HTT "core".

I tested running the cycle counts on any available core and on separate cores. Running on separate cores reduced the number of anomalies, but unfortunately did not eliminate the anomalies.

Note that GCC, unlike the Microsoft and Pelles compilers, supports 64-bit inline assembly.

Source code and compiled executable in the attachment.
Well Microsoft, here’s another nice mess you’ve gotten us into.

Raistlin

  • Member
  • **
  • Posts: 238
Re: Michael Webster's code timing macros
« Reply #9 on: September 09, 2016, 12:02:07 AM »
Code: [Select]
48
40

Threads running on any available core:
1: 219 cycles
2: 219 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
1: 219 cycles
2: 219 cycles
1: 220 cycles
2: 220 cycles
1: 220 cycles
2: 220 cycles
72
68

15
15

Threads running on separate cores:
2: 220 cycles
1: 220 cycles
2: 219 cycles
1: 220 cycles
2: 219 cycles
1: 219 cycles
2: 219 cycles
1: 220 cycles
2: 219 cycles
1: 220 cycles
2: 219 cycles
1: 220 cycles
2: 220 cycles
1: 219 cycles
2: 219 cycles
1: 220 cycles

Tested on HTT - Intel i5 HP ProDesk - Win 8.1

Seems to work ! :t  but obviously does'nt test beyond the 2 cores as per your description

rrr314159

  • Member
  • *****
  • Posts: 1381
Re: Michael Webster's code timing macros
« Reply #10 on: September 09, 2016, 12:17:33 AM »
I made various versions of multi-thread timers, for 32 and / or 64. For instance look at this code, http://masm32.com/board/index.php?topic=4832.msg51985#msg51985. It might not be obvious how to use the timer macros. If interested ask.
I am NaN ;)

Siekmanski

  • Member
  • *****
  • Posts: 1094
Re: Michael Webster's code timing macros
« Reply #11 on: September 09, 2016, 05:09:28 AM »
i7-4930K CPU Windows 8.1

Code: [Select]
40
44

Threads running on any available core:
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
2: 257 cycles
1: 257 cycles
56
64

4095
4095

Threads running on separate cores:
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 285 cycles
2: 284 cycles
1: 284 cycles

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: Michael Webster's code timing macros
« Reply #12 on: September 12, 2016, 03:59:57 PM »
Why a consistent 257 vs 284/285, on a processor that apparently supports 12 threads (judging from the previous affinity mask 4095 = FFFh = 1111 1111 1111b), when the test app has only 2 active threads?
Well Microsoft, here’s another nice mess you’ve gotten us into.

Siekmanski

  • Member
  • *****
  • Posts: 1094
Re: Michael Webster's code timing macros
« Reply #13 on: September 12, 2016, 05:43:15 PM »
I think because of HTT, the 2 threads are running on the same core ( AffinityMask 1 & 2 )
Try AffinityMasks 1 & 3 so they run on separate cores.
I can't test it, don't have a C compiler installed.

Raistlin

  • Member
  • **
  • Posts: 238
Re: Michael Webster's code timing macros
« Reply #14 on: September 12, 2016, 06:56:21 PM »
Yes I think it's time we converted this to a workable thread-safe macro include in 32 & 64 bits.

A test environment sample would need a switch parameter, so that we can select  all/even/odd "cores-HTT" / physical cores

Hope that all made sense. :icon_rolleyes: