News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

New cycle counter routines

Started by jj2007, September 17, 2015, 08:22:21 AM

Previous topic - Next topic

dedndave

i have to say that i am very pleased with this iteration of my idea   :biggrin:
with the exception of a few out-liers, it seems extremely stable on my quirky P4 with XP media center edition 2005

a single run of the test program...
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
0 0 0 0 0 0 0 0 0 455305 Iterations
15030 15038 15038 13380 15038 15038 15030 13381 15038 114344 Iterations
30038 30038 30038 30038 26715 30038 30038 30038 30038 64327 Iterations


the actual serialization code looks like this
(_dwClock2Lo and _dwClock2Hi are stack-frame variables)

notice that the registers are PUSH'ed in different order than POP'ed,
thus forcing the CPU to complete instructions before and after

        push    eax
        push    ecx
        push    edx
        push    ebx
        push    esp
        push    ebp
        push    esi
        push    edi

        rdtsc
        mov     _dwClock2Lo,eax
        mov     _dwClock2Hi,edx

        pop     eax
        pop     ecx
        pop     edx
        pop     ebx
        pop     eax
        pop     ecx
        pop     esi
        pop     edi

dedndave

maybe i should have started a seperate thread
can i get some members to run this for me ?   :t

zedd151

Quote from: dedndave on September 22, 2015, 02:52:38 PM
maybe i should have started a seperate thread
can i get some members to run this for me ?   :t


Genuine Intel(R) CPU           T2060  @ 1.60GHz (SSE3)
0 0 0 0 0 0 -12 0 0 516805 Iterations
10020 10020 10020 10020 10032 10020 10020 10020 10020 99370 Iterations
20028 20016 20028 20028 20028 20028 20028 20040 20028 49035 Iterations
Press any key to continue ...



some time later::

I have run your test dave sevral times.
It seems to have a strong affinity for the number twelve - that is
when the counts are off, it is usually +/- 12, +/- 24, etc

jj2007

Looks convincing, Dave :t

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz (SSE4)
0 0 0 0 0 0 0 0 0 797143 Iterations
10028 10028 10028 10028 10028 10028 10028 10028 10028 96273 Iterations
20028 20028 20028 20028 20028 20028 20028 20028 20028 49035 Iterations

sinsi


Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz (SSE4)
0 0 0 0 0 0 0 0 0 2528866 Iterations
9021 9018 9021 9021 9018 9018 9021 9018 9021 265816 Iterations
18018 18021 18021 18018 18018 18018 18018 18021 18018 99370 Iterations


Siekmanski

Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz (SSE4)
0 0 0 0 0 0 0 0 0 1589917 Iterations
10040 10032 10040 10040 10040 10032 10040 10040 10032 224319 Iterations
20040 20040 20040 20040 20040 20040 20040 20040 20040 118358 Iterations
Creative coders use backward thinking techniques as a strategy.

TWell

AMD E-450 APU with Radeon(tm) HD Graphics (SSE4) 1,6 GHz
0 0 0 0 0 0 0 0 0 455305 Iterations
20010 20010 20010 20010 20010 20010 20010 20010 20010 32300 Iterations
40011 40011 40011 40011 40011 40011 40011 40011 40011 27642 Iterations
Press any key to continue ...

dedndave

many thanks, guys  :t

that looks pretty good, to me
there might be some things i can do to improve it
let me play with it a bit
i may want to put it in a LIB, so the code falls early in  the .CODE section

it's pretty convenient to use, really
your Function Under Test, of course
then, i let you supply a "NUL" reference function to compare it to
and, i let you supply the run time (mS) and both process and thread priorities (OR'ed together)

TestTime PROTO :DWORD,:DWORD,:LPVOID,:LPVOID
fnNUL    PROTO
fnFUT    PROTO


INVOKE  TestTime,500,HIGH_PRIORITY_CLASS or THREAD_PRIORITY_ABOVE_NORMAL,fnNUL,fnFUT

(no more playing with LoopCounts   :P )

the fnFUT might look something like this

fnFUT PROC

    INVOKE  StrLen,offset szTest
    ret

fnFUT ENDP


and, you'd subtract out the fnNUL time

fnNUL PROC

    ret

fnNUL ENDP


the serialization technique is a bit faster than CPUID
and, if it's stable on my machine, it should be stable on everyone elses - lol

rrr314159

Tried it out,

the good news is, using your test it is, in fact, more stable than mfence (which I'm currently using for serialization). Didn't compare to cpuid since I assume you did that already.

However there are couple problems. Main one, this serialization technique blows away all the registers! I modified it to preserve registers, and it still works OK, not as well tho. So for "on-the-fly" timing, which must not impact the running program by destroying registers, this technique not so good. Of course mfence is a lot faster, and doesn't need "pushad / popad".

In your FUT you use only ecx, and in your timing code you preserve only ecx. So if you use other registers in the FUT, does the timing code still work? Didn't get around to checking that, but assume it does. If so, this looks like a better way to do "formal" timing but not good for "on-the-fly"

BTW zedd151 when cycle count is off, it can be by multiple of 12, but I also have seen 18, 3, ... If there's a pattern it's not obvious
I am NaN ;)

zedd151

Quote from: rrr314159 on September 23, 2015, 03:00:57 AM
BTW zedd151 when cycle count is off, it can be by multiple of 12, but I also have seen 18, 3, ... If there's a pattern it's not obvious

I think it is my machine that is the problem, that's why I had such difficulty taming even the
'tried and true' timing/counting routines. :(

It seems to be running hot often.

Perhaps the airways need a good cleaning? (it's an old -12yr - laptop)

dedndave

rrr - it's not "inline", as you say
and, yes, the "ABI" registers are preserved - just not around individual calls

at the beginning of the routine...
;preserve registers and set up stack frame

    push    ebx                                        ;[EBP+12] = preserved EBX contents
    push    esi                                        ;[EBP+8]  = preserved ESI contents
    push    edi                                        ;[EBP+4]  = preserved EDI contents
    push    ebp                                        ;[EBP]    = preserved EBP contents
    mov     ebp,esp                                    ;EBP = stack frame base pointer


at the end...
;return measured time in EDX:EAX and iterations in ECX

    mov     eax,_dwMeasuredLo
    mov     edx,_dwMeasuredHi
    sub     eax,_dwReferenceLo
    mov     ecx,_dwIterations
    sbb     edx,_dwReferenceHi

;release stack frame and restore registers

    leave                  ;performs MOV ESP,EBP then POP EBP
    pop     edi
    pop     esi
    pop     ebx
    ret     16


when the RDTSC operation is made, the only registers that matter are ESP and EBP

dedndave

also, i have written the current code so that _SerializeA and SerializeB macros
may be easily modified for trying other serialization methods

i did try CPUID

_SerializeA MACRO

        xor     eax,eax
        cpuid

            ENDM

_SerializeB TEXTEQU <_SerializeA>