General > The Laboratory

Michael Webster's code timing macros

(1/5) > >>

hutch--:
This is a reposting from the old forum.

jj2007:
The attached cyct_macros.inc uses MichaelW's timer macros (i.e. you need timers.zip attached above) but cuts off outliers, thus improving the consistency of timings especially on older CPUs such as the P4. Simple example:


--- Code: ---include \masm32\include\masm32rt.inc
include \masm32\macros\Cyct_Macros.inc

.code
start: ShowCpu
REPEAT 3
cyct_begin
invoke GetTickCount
cyct_end <GetTickCount>

cyct_begin
push esi
push edi
push ebx
nop
pop ebx
pop edi
pop esi
cyct_end <uses esi edi ebx>

print chr$(13, 10)
ENDM
inkey "ok"
exit
end start

--- End code ---

qWord:
in the attachment a x64-version of the counter-macro (jWasm, ml64)

Farabi:
Can we made all list of instruction timing.

MichaelW:
Hi Onan,

If I understand your question correctly, the answer is effectively no. Due to the large variations in processor design the list would need to include timings for each processor family, and possibly for each individual model of a given family. And then there is the larger problem with the timing of an instruction being increasingly dependent on the instructions around it. For the earlier, and much simpler processors, the Intel instruction set listings included instruction timings, but these timings were valid only under certain conditions. For example, from the 386 DX Programmer’s Reference:

--- Quote ---The “Clocks” column gives the number of clock cycles the instruction takes to execute. The clock count calculations makes the following assumptions:

The instruction has been prefetched and decoded and is ready for execution.

Bus cycles do not require wait states.

There are no local bus HOLD request delaying processor access to the bus.

No exceptions are detected during instruction execution.

Memory operands are aligned.

--- End quote ---

With each succeeding processor family the number of conditions increased, for example, from the 486 DX Programmer’s Reference:

--- Quote ---Data and instruction accesses hit in the cache.

The target of a jump instruction is in the cache.

No invalidate cycles contend with the instruction for use of the cache.

Page translation hits in the TLB.

Memory operands are aligned.

Effective address calculations use one base register and no index register, and the base register is not the destination register of the preceding instruction.

Displacement and immediate are not used together.

No exceptions are detected during instruction execution.

There are no write-buffer delays.

--- End quote ---

This continued to the point that, some time after the Pentium was introduced, Intel dropped the timings from the instruction set listings.

And another problem is that the TSC has a resolution of one clock cycle, but under the right conditions anything resembling a recent processor can, depending on the instruction, execute more than one instruction per clock cycle.

Navigation

[0] Message Index

[#] Next page

Go to full version