Author Topic: A Different Approach  (Read 3905 times)

rrr314159

  • Member
  • *****
  • Posts: 1381
Re: A Different Approach
« Reply #15 on: September 17, 2015, 10:26:46 AM »
Quote from: zedd151
My cpu behaves unpredictably in regards to using RDTSC especially. If I fiddle with the 'LoopCount' - only sometimes can I achieve stable and repeatable results. Most other times, there is doubling or quadrupling of the resulting cycle 'counts' - and often the results are just erratic.

How to Benchmark Code Execution Times on Intel mentions some reasons why RDTSC may be unsatisfactory.

Quote from: zedd151
Therefore, I decided to do some experimenting using GetTickCount. Yes GTC also has its faults, I am sure

- the big problem with it, the resolution is about 16 milliseconds. RDTSC resolution is more than a million times greater! Probably if you "rounded off" RDTSC to 16 ms, it would be very stable. U might also check out timeGetTime which is similar to GetTickCount but with some slight advantages. They both take much longer than RDTSC, hundreds of nanoseconds, but with resolution of 16 ms, that doesn't matter
« Last Edit: September 17, 2015, 03:27:00 PM by rrr314159 »
I am NaN ;)

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 4813
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: A Different Approach
« Reply #16 on: September 18, 2015, 12:16:50 AM »
You can improve the resolution using a multi-media timer but its a lot of messing around. On a multi-core machine it works OK but on a single core machine the multi-media overhead slows it all down.
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :biggrin:

zedd151

  • Member
  • **
  • Posts: 241
Re: A Different Approach
« Reply #17 on: September 18, 2015, 07:53:51 PM »
 :biggrin:  :biggrin:
Code: [Select]
        include \masm32\include\masm32rt.inc
        include \masm32\include\zeddlib.inc
        includelib \masm32\lib\zeddlib.lib
       
        zCount      proto :dword, :dword
        zTime       proto :dword, :dword
    .data
        align 16
        teststring db "00003456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123", 0
    .code
   
    start:

        fn zCount, addr TestProc1, "lstrlen"
        fn zCount, addr TestProc2, "StrLen"

        fn zTime, addr TestProc1, "lstrlen"
        fn zTime, addr TestProc2, "StrLen"

        print chr$( 13, 10)
       
        inkey chr$(13, 10, "-- done! --", 13, 10)
       
        exit

    TestProc1 proc
        invoke lstrlen, addr teststring
        ret
    TestProc1 endp

    TestProc2 proc
        invoke StrLen, addr teststring
        ret
    TestProc2 endp

    end start

 :shock:

Code: [Select]
align 16
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
    zCount proc src:dword, fnName:dword                 ; src = address of test procedure, fnName = display text
        SleepMs =       250                             ; ZZZzzz
        LOOP_COUNT =    5000000                         ; calculated for short texts and/or small algos
        repeet     =    10                              ; number of consecutive runs of same test
        pushad                                          ; save the registers
   
        print fnName, 13, 10, 13, 10                    ; print current test title
        invoke Sleep, 1                                 ; get new time slice
       
        ; ------------------------------------------------------------------------------------
        counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS   ; run once extra, for 'warmup'
            call src                                    ; call test procedure
        counter_end
        ; ------------------------------------------------------------------------------------
               
; ------------------------------------------------------------------------------------
    repeat repeet
   
        ; ------------------------------------------------------------------------------------
        counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS   ; cycle counter

        call src                                        ; call test procedure

        counter_end
        ; ------------------------------------------------------------------------------------
       
        print str$(eax), 20h, " = cycles", 13, 10       ;   print result of current test
    endm
; ------------------------------------------------------------------------------------

        print chr$(13, 10)                              ; extra line feed for formatting
        popad                                           ; restore registers
        ret
    zCount endp
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------

 :P  :P
align 16
Code: [Select]
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
    zTime proc src:dword, fnName:dword                  ; src = address of test procedure, fnName = display text
        LOOP_COUNT =    4000000                         ; 4 x 1000000
        repeet     =    10                              ; number of consecutive runs of same test
        pushad                                          ; save the registers

        print fnName, 13, 10, 13, 10                    ; print current test title
       
        ; ------------------------------------------------------------------------------------
        timer_begin LOOP_COUNT, HIGH_PRIORITY_CLASS     ; run once extra, for 'warmup'
            call src                                    ; call test procedure
        timer_end
        ; ------------------------------------------------------------------------------------
           
; ------------------------------------------------------------------------------------
    repeat repeet
   
        ; ------------------------------------------------------------------------------------
        timer_begin LOOP_COUNT, HIGH_PRIORITY_CLASS     ; high resolution timer (microseconds)

            call src                                    ; call test procedure

        timer_end
        shr eax, 2                                      ; divide by 4 to get same result as if LoopCount is 1000000
        ; ------------------------------------------------------------------------------------
       
        print str$(eax), 20h, " = microseconds", 13, 10 ; print result of current test
    endm
; ------------------------------------------------------------------------------------

        print chr$(13, 10)                              ; extra line feed for formatting
        popad                                           ; restore registers
        ret
    zTime endp
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------

 :greensml:

Code: [Select]
lstrlen

316  = cycles
313  = cycles
312  = cycles
315  = cycles
312  = cycles
313  = cycles
313  = cycles
312  = cycles
313  = cycles
313  = cycles

StrLen

118  = cycles
118  = cycles
118  = cycles
118  = cycles
117  = cycles
117  = cycles
117  = cycles
117  = cycles
118  = cycles
117  = cycles

lstrlen

195  = microseconds
196  = microseconds
195  = microseconds
195  = microseconds
196  = microseconds
197  = microseconds
195  = microseconds
196  = microseconds
196  = microseconds
196  = microseconds

StrLen

73  = microseconds
73  = microseconds
74  = microseconds
73  = microseconds
74  = microseconds
74  = microseconds
74  = microseconds
74  = microseconds
73  = microseconds
74  = microseconds
8)

The above procedures are once again based upon MichaelW's counter/timer macros.

Regarding all of these tests and experimenting that I have been doing with timers/counters,
I find it is always better to run your tests for both cycle counts AND time.

Both of them have their peculiarities, but it seems that even if one method fails, (hopefully they both don't)
you can always compare the results of the other method. This is just my point of view.

A lot of research has been done along these lines, and yes most of the methods researched ALL
have their flaws. I don't think that their exists a true perfect way to obtain accurate cycle counts,
or even timings with truly high precision.

The best we can hope for is a close approximation. Which, in most instances is good enough
for most purposes.

That is all I have to say about cycle counting and timing.


zedd
Links broken - to be remedied soon
----------------------------------------------
hutch: "MASM does not have bugs, it just has features you need to understand." 8)

zedd151

  • Member
  • **
  • Posts: 241
Re: A Different Approach
« Reply #18 on: September 18, 2015, 09:24:06 PM »
An addition to my library.

The tests in the above post are suitable for using with small algos, or
those which operate on small strings or data.

This one is for larger algos, or operating on a larger data set.

Code: [Select]
lstrlen
1718925  = cycles
1718997  = cycles
1715641  = cycles
1716976  = cycles
1713258  = cycles
1711389  = cycles
1711370  = cycles
1710355  = cycles
1710596  = cycles
1710472  = cycles

849788 = sizeof windows.inc ; added a validation of the algo used (for 'optimizing' similar functions)

StrLen
762612  = cycles
762458  = cycles
762232  = cycles
762239  = cycles
762812  = cycles
762311  = cycles
762612  = cycles
762117  = cycles
763243  = cycles
762298  = cycles

849788 = sizeof windows.inc

lstrlen
1071  = milliseconds
1072  = milliseconds
1075  = milliseconds
1072  = milliseconds
1071  = milliseconds
1071  = milliseconds
1074  = milliseconds
1076  = milliseconds
1073  = milliseconds
1071  = milliseconds

849788 = sizeof windows.inc

StrLen
483  = milliseconds
480  = milliseconds
480  = milliseconds
477  = milliseconds
477  = milliseconds
477  = milliseconds
477  = milliseconds
478  = milliseconds
478  = milliseconds
477  = milliseconds

849788 = sizeof windows.inc

The mid size counter:
Code: [Select]
align 16
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
    zCountM proc src:dword, fnName:dword                 ; src = address of test procedure, fnName = display text
        SleepMs =       250                             ; ZZZzzz
        LOOP_COUNT =    5000                            ; calculated for short texts and/or small algos
        repeet     =    10                              ; number of consecutive runs of same test
        pushad                                          ; save the registers
   
        print fnName, 13, 10                            ; print current test title
        invoke Sleep, 1                                 ; get new time slice
       
        ; ------------------------------------------------------------------------------------
        counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS   ; run once extra, for 'warmup'
            call src                                    ; call test procedure
        counter_end
        ; ------------------------------------------------------------------------------------
               
; ------------------------------------------------------------------------------------
    repeat repeet
   
        ; ------------------------------------------------------------------------------------
        counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS   ; cycle counter

        call src                                        ; call test procedure

        counter_end
        ; ------------------------------------------------------------------------------------
       
        print str$(eax), 20h, " = cycles", 13, 10       ;   print result of current test
    endm
; ------------------------------------------------------------------------------------

        print chr$(13, 10)                              ; extra line feed for formatting
        popad                                           ; restore registers
        ret
    zCountM endp
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------

And the mid-size timer:

Code: [Select]
align 16
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
    zTimeM proc src:dword, fnName:dword                 ; src = address of test procedure, fnName = display text
        LOOP_COUNT =    4000                            ; 4 x 1000
        repeet     =    10                              ; number of consecutive runs of same test
        pushad                                          ; save the registers

        print fnName, 13, 10                            ; print current test title
       
        ; ------------------------------------------------------------------------------------
        timer_begin LOOP_COUNT, HIGH_PRIORITY_CLASS     ; run once extra, for 'warmup'
            call src                                    ; call test procedure
        timer_end
        ; ------------------------------------------------------------------------------------
           
; ------------------------------------------------------------------------------------
    repeat repeet
   
        ; ------------------------------------------------------------------------------------
        timer_begin LOOP_COUNT, HIGH_PRIORITY_CLASS     ; high resolution timer (microseconds)

            call src                                    ; call test procedure

        timer_end
        shr eax, 2                                      ; divide by 4 to get same result as if LoopCount is 1000000
        ; ------------------------------------------------------------------------------------
       
        print str$(eax), 20h, " = milliseconds", 13, 10 ; print result of current test
    endm
; ------------------------------------------------------------------------------------

        print chr$(13, 10)                              ; extra line feed for formatting
        popad                                           ; restore registers
        ret
    zTimeM endp
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------
; ------------------------------------------------------------------------------------

Now, if I can, I will also add a large size counter and timer to my library.
But it will take eons for those to run. I'll grow older while waiting - lol.

The test:

Code: [Select]
        include \masm32\include\masm32rt.inc
        include \masm32\include\zeddlib.inc
        includelib \masm32\lib\zeddlib.lib
       
        zCount      proto :dword, :dword
        zTime       proto :dword, :dword
    .data
        align 16
        winincl     db "C:\masm32\include\windows.inc", 0
        winincmem   dd 0
    .code
   
    start:

        push esi
        invoke LoadFilex, addr winincl
        mov winincmem, esi
        pop esi
       
        fn zCountM, addr TestProc1, "lstrlen"
        invoke lstrlen, winincmem               ; validate test results by actually running the algo
        print str$(eax), " = sizeof windows.inc", 13, 10, 13, 10

        fn zCountM, addr TestProc2, "StrLen"
        invoke StrLen, winincmem                ; validate test results by actually running the algo
        print str$(eax), " = sizeof windows.inc", 13, 10, 13, 10

        fn zTimeM, addr TestProc1, "lstrlen"
        invoke lstrlen, winincmem               ; validate test results by actually running the algo
        print str$(eax), " = sizeof windows.inc", 13, 10, 13, 10

        fn zTimeM, addr TestProc2, "StrLen"
        invoke StrLen, winincmem                ; validate test results by actually running the algo
        print str$(eax), " = sizeof windows.inc", 13, 10, 13, 10

        print chr$( 13, 10)
       
        inkey chr$(13, 10, "-- done! --", 13, 10)
       
        exit

    TestProc1 proc
        invoke lstrlen, winincmem
        ret
    TestProc1 endp

    TestProc2 proc
        invoke StrLen, winincmem
        ret
    TestProc2 endp

    end start
Links broken - to be remedied soon
----------------------------------------------
hutch: "MASM does not have bugs, it just has features you need to understand." 8)

zedd151

  • Member
  • **
  • Posts: 241
Re: A Different Approach
« Reply #19 on: September 19, 2015, 01:37:54 AM »
One thing I like about having the timer/counter in a procedure, is you can put a call to it anywhere.
The way that this is done, the algorithm under test must be also in a procedure. Which is actually
okay if you think about it. - you are not restricted to having it in close proximity to the other
tests being made. - spacial displacement.

Also, since it is not a macro (yes a macro is used internally) no extra code is generated you can
run thousands of tests at a fraction of the size if everything was done with macros. (as I have
experimented with in a previous posting)

I did try working on a Large scale timer and counter, but either accuracy or speed would have to be
eliminated. You can't have both. I was working with a 100 MB text file, (actually random ascii chars)
and if the LOOPCOUNT was high enough to ensure more reliablility, then the timing would take
hours to complete, literally. If I reduced it to where the duration of the tests were more acceptable
then of course reliability and predictability suffered.

But who would try to open a 100 MB text file anyway.

The mid size counter/timers should be usable for most purposes. The original (small size)
timer/counter should be usable for more critical timing and cycle counting.

But as I stated in a previous post above, these timer/counter tests are only close
approximation, but should be sufficient for comparison purposes.

zedd
Links broken - to be remedied soon
----------------------------------------------
hutch: "MASM does not have bugs, it just has features you need to understand." 8)

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: A Different Approach
« Reply #20 on: September 19, 2015, 07:03:57 AM »
Results will of course vary from one cpu to the next due to different cpu speeds, cpu
load, additional processes running, etc.

Running on recent systems which implement thermal throttling (to control circuit temperature), which as far as I know includes virtually all recent systems, the clock speed can vary and this will cause a proportional variation in real-time timings, but will have no effect on clock-cycle timings.

Well Microsoft, here’s another nice mess you’ve gotten us into.

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: A Different Approach
« Reply #21 on: September 19, 2015, 07:26:56 AM »
Quote from: zedd151
My cpu behaves unpredictably in regards to using RDTSC especially. If I fiddle with the 'LoopCount' - only sometimes can I achieve stable and repeatable results. Most other times, there is doubling or quadrupling of the resulting cycle 'counts' - and often the results are just erratic.

How to Benchmark Code Execution Times on Intel mentions some reasons why RDTSC may be unsatisfactory.

Interesting, this is the second document I have seen where the "expert" author failed to control the CPUID function, introducing something like a 100-cycle variation in the CPUID overhead.
Well Microsoft, here’s another nice mess you’ve gotten us into.

rrr314159

  • Member
  • *****
  • Posts: 1381
Re: A Different Approach
« Reply #22 on: September 19, 2015, 01:15:11 PM »
hi MichaelW,

I have been using mfence because some Intel guy used it, I forget where. He thought it better than CPUID for serialization; after all, that's what it's designed for. Do u have an opinion about mfence?
I am NaN ;)

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: A Different Approach
« Reply #23 on: September 19, 2015, 02:21:16 PM »
Do u have an opinion about mfence?

I can't recall testing it, and I see no need because CPUID function 0 works well for me. Even with having to preserve EBX in my latest macros (GCC), and running on my Core i3, counts outside the range +/- 1 cycle are rare, for test code that ranges from single instructions to CRT function calls.

Quote
after all, that's what it's designed for

Yes, I have read that, and it does make sense, but Agner Fog has for a long time and still does recommend using CPUID function 0.

Quote
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction.

Per Agner Fog, you should serialize before and after each RDTSC, and RDTSC does not access memory, only a MSR.

Another possibility is RDTSCP, which "waits until all previous instructions have been executed before reading the counter." I did test RDTSCP, and could see no difference in the repeatability of the counts, compared to CPUID function 0, but decided against using it because (I assumed, without doing any research) it would be supported by far fewer systems.
Well Microsoft, here’s another nice mess you’ve gotten us into.

guga

  • Member
  • ****
  • Posts: 826
  • Assembly is a state of art.
    • RosAsm
Re: A Different Approach
« Reply #24 on: December 13, 2015, 11:19:25 PM »
Personally, i prefer using  QPC method since it seems to produce a more reliable results (But, it is a bit slow in huge loopings count. Example, above 3000000 (For usage with LOOP_COUNT, i suppose) it takes a couple of seconds to finish, whereas using the rdtsc version is a bit faster (But, unstable and not so accurate).

One question though.. The results that are being achieve are only the ones of the last loop on  ? If so, why ?

Even thinking that QPC works "better" then the raw version (rdtsc only), the results in edx/eax can be "negative", resulting in innacuracy. Ok, ok, rtdsc and QPC should not suppose to produce negative values (since they are positive 64 bit integers), but.........the fact is that they do result in a negative form of a Qword (LARGE_INTEGER), which for the testing purposes may represents wrong lattency for example, or a bad Throughput.

I´m making some tests storing the results and computing the standard deviation of them (The sample Std and population Std) and it shows that the values resulting in edx/eax or from QPC can, sometimes, be innacurate. This is because STD after analyzing the resultants values, displays a negative minimum STD, which means that the Standard Deviation is bigger then the Mean. In other words, some data collected during the Loop (The ones from edx/eax) are extremelly different then the rest of the data.

Example
Loop1 = 150 nanosecs
Loop2 = 124 nanosecs
Loop3 = 100 nanosecs
Loop4 = 6000 nanosecs <---- Here the results may vary and produce a negative STD minimum
Loop5 = 200 nanosecs
Loop6 = 140 nanosecs
Loop7 = 135 nanosecs

The STD values for those chain of data is:

Total Numbers: 7
Mean (Average): 978.42857

Standard deviation: 2214.51438
Variance(Standard deviation) :4904073.95238
Miinmum Standard STD: -1236.08

Population Standard deviation : 2050.24193
Variance(Population Standard deviation) :4203491.95918
Minimum STD (Population): -1071.813

The negative results are due to the 4th Loop (6000), whose value is extremely different from the other samples.

For benchmarking purposes, these can be interesting because we may be collecting bad samples from edx/eax and using them, instead simply ignoring.

One thing that may be interesting to analyse/compute is when the Standard Deviation produces negative results. (Std> Mean). I´m not sure if there is a math formula for this, but, it seeems that whenever thie inconsistencies will happens whenever one sample is bugger then 1 or smaller then the half of the next one. I mean, if you take the ration between on sample and the next one. If the resultant fraction is > 0.5 and smaller then 1 then the it will result on a negative minimum. But, i´m not sure yet of that assumption, since i didn´t found if there is a formula to calculate when the thus situation happens.

Another thing to consider is the differences between the Standard and Population STD results. It seems that the less differences between them means more stability. (I´ll make such tests later)

Or, the macros already fixes the "negative" qword problem ?

Another dumb question. The time of the results are being displayed as microseconds or nanoseconds ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 7558
  • Assembler is fun ;-)
    • MasmBasic
Re: A Different Approach
« Reply #25 on: December 14, 2015, 12:49:26 AM »
The mysterious negative timings can be avoided with a simple macro:

Code: [Select]
UseCore MACRO CoNum
  invoke SetThreadAffinityMask, rv(GetCurrentThread), 1 shl (CoNum-1)
  ExternDef MbError0:NEAR ; used with WriteFile and SetWin$
  TestRetVal PROTO
  call TestRetVal
  Delay 1 ; <--- force to switch mode by dropping current time slice (Alex)
ENDM

Btw I would never use QPC inside a loop 8)

guga

  • Member
  • ****
  • Posts: 826
  • Assembly is a state of art.
    • RosAsm
Re: A Different Approach
« Reply #26 on: December 14, 2015, 01:08:10 AM »
I understand QPC may be slow for larger loops, but, isn´t it more accurate then "raw" method ? (rdtsc) ? I´m trying to build a app to display the different methods to test the accuracy of both.  (QPC and raw)

About the delay, i suppose it is meant for serialize. So, what about using lfence for that purpose ?

PS.: What is the final value ? I mean, what is being collected ? The mean/STD or simply the last one after XX loops ?
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

  • Member
  • *****
  • Posts: 7558
  • Assembler is fun ;-)
    • MasmBasic
Re: A Different Approach
« Reply #27 on: December 14, 2015, 04:01:08 AM »
simply the last one after XX loops ?

Exactly. You measure -(loop entry - loop exit), after 10,000 ... 1 Mio iterations, depending on the expected number of cycles. As Dave writes sometimes, make sure the loop takes about half a second.

guga

  • Member
  • ****
  • Posts: 826
  • Assembly is a state of art.
    • RosAsm
Re: A Different Approach
« Reply #28 on: December 14, 2015, 09:11:49 AM »
Tks, Jochen. I´m trying to analyze the different results.

But, instead using the loop entry/exit as a measure, i´ll use the Standard Deviations of them to guarantee accuracy.

I´m rebuilding a small timmings app made for RoAsm years ago and will adapt it to the one Steve made. I really really like the Gui version and the checkboxes running etc  :icon_mrgreen: Of course, steve used GetTickCount to achieve the results, but i´ll use it as an alternative method.

The goal is focus on the more accurated method to achieve the benchmark tests. As i said, i´ll use the rdtsc, QPC, gettickcount and *perhaps* a variation of the rdtsc using lfence to serialize (I found this method on a linux app btw).

Since i´plan to build a skeleton of a code profiler for RosAsm (I´ll finish it...well..eventually :greensml:), i need to test all this different methods in order to no ruin the performance of the assembler etc. The more accurate the better for the user write his own apps biasing on the resultant tests  :)

Btw. I attached the tests i made on steve´s App. Converted it to RosAsm
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

  • Member
  • ****
  • Posts: 826
  • Assembly is a state of art.
    • RosAsm
Re: A Different Approach
« Reply #29 on: December 14, 2015, 10:24:53 AM »
Updated test version. I forgot to remove a test i was making in functions using createthread . Fixed now ;)
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com