News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

New timing macros

Started by jj2007, May 16, 2022, 09:42:21 AM

Previous topic - Next topic

TimoVJL

AMD Athlon(tm) II X2 220 Processor
+0       Cycles for PI*100
+17      Cycles for PI*100/10
+1       Cycles for push & pop eax
+0       Cycles for empty loop
+16      Cycles for 10 * inc & dec eax
+16      Cycles for 10 * add eax,1 & sub eax,1
+16      Cycles for 10 * inc+dec with lea
481     kCycles for finding 'Duplicate' in Window.inc
+471     Cycles for finding 'test' with InString

+0       Cycles for PI*100
+17      Cycles for PI*100/10
+0       Cycles for push & pop eax
+0       Cycles for empty loop
+16      Cycles for 10 * inc & dec eax
+16      Cycles for 10 * add eax,1 & sub eax,1
+16      Cycles for 10 * inc+dec with lea
481     kCycles for finding 'Duplicate' in Window.inc
+476     Cycles for finding 'test' with InString

+0       Cycles for PI*100
+17      Cycles for PI*100/10
+1       Cycles for push & pop eax
+0       Cycles for empty loop
+16      Cycles for 10 * inc & dec eax
+16      Cycles for 10 * add eax,1 & sub eax,1
+16      Cycles for 10 * inc+dec with lea
481     kCycles for finding 'Duplicate' in Window.inc
+477     Cycles for finding 'test' with InString

+0       Cycles for PI*100
+17      Cycles for PI*100/10
+0       Cycles for push & pop eax
+0       Cycles for empty loop
+16      Cycles for 10 * inc & dec eax
+16      Cycles for 10 * add eax,1 & sub eax,1
+16      Cycles for 10 * inc+dec with lea
481     kCycles for finding 'Duplicate' in Window.inc
+477     Cycles for finding 'test' with InString

more (y)?

May the source be with you

jj2007

Thanks @everybody :thup:

I attach a new graphical version. Two interesting aspects:
- when sizing the window, the cycle count for ...
      fldpi
      fdiv FP4(0.1)
      fstp st
... is pretty stable at 18 cycles for my macros and 14 cycles for Michael Webster's macros. Why that difference? I can only guess that my "isolation" of the core code with nops all over the place place a role:

Address   Hex dump          Command                                  Comments
00401403   .  0FA2          .cpuid
00401405   .  0F31          .rdtsc
00401407   .  A3 F0A34000   .mov [CyCtrdtsc], eax
0040140C   .  8915 F4A34000 .mov [40A3F4], edx
00401412   .  8B1D E4A34000 .mov ebx, [CyCtinner]
00401418   .  90            .nop
00401419   .  90            .nop
0040141A   .  90            .nop
0040141B   .  90            .nop
0040141C   .  90            .nop
0040141D   .  90            .nop
0040141E   .  90            .nop
0040141F   .  90            .nop
00401420   >  D9EB          .fldpi                                   ; loop start ---------------------
00401422   .  D835 68904000 .fdiv dword ptr [??0232]                 ; float 0.1000000
00401428   .  DDD8          .fstp st
0040142A   .  90            .nop
0040142B   .  90            .nop
0040142C   .  90            .nop
0040142D   .  90            .nop
0040142E   .  90            .nop
0040142F   .  90            .nop
00401430   .  90            .nop
00401431   .  90            .nop
00401432   .  4B            .dec ebx
00401433   .^ 79 EB         .jns short CyCtcode2_s                   ; jump back ---------------------
00401435   .  50            .push eax
00401436   .  51            .push ecx
00401437   .  52            .push edx
00401438   .  33C0          .xor eax, eax
0040143A   .  0FA2          .cpuid
0040143C   .  0F31          .rdtsc

HSE

 :biggrin:

Perhaps is a bad idea to take the timing at same time that paint (you are not calling for High Priority¿?):
Equations in Assembly: SmplMath

jj2007

Quote from: HSE on May 20, 2022, 09:55:41 AM
Perhaps is a bad idea to take the timing at same time that paint.:

Well, it's not taken during the WM_PAINT processing. But apparently there is an influence. When quickly sizing the window (lots of painting), the cycle counts go down, though. I have absolutely no clue why that is so.

Anyway, for timing your algo you would use a console program. The GUI version is for illustrative purposes, i.e. for showing the skewed distribution of results.

To build the source attached above, you need the latest MasmBasic version of 20 May 2022.

hutch--

mOST RECENT VERSION !

Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
+1       Cycles for PI*100
+12      Cycles for PI*100/10
+2       Cycles for push & pop eax
+0       Cycles for empty loop
+14      Cycles for 10 * inc & dec eax
+14      Cycles for 10 * add eax,1 & sub eax,1
+14      Cycles for 10 * inc+dec with lea
187     kCycles for finding 'Duplicate' in Window.inc
+145     Cycles for finding 'test' with InString

+1       Cycles for PI*100
+12      Cycles for PI*100/10
+2       Cycles for push & pop eax
+0       Cycles for empty loop
+14      Cycles for 10 * inc & dec eax
+14      Cycles for 10 * add eax,1 & sub eax,1
+14      Cycles for 10 * inc+dec with lea
187     kCycles for finding 'Duplicate' in Window.inc
+144     Cycles for finding 'test' with InString

+1       Cycles for PI*100
+12      Cycles for PI*100/10
+2       Cycles for push & pop eax
+0       Cycles for empty loop
+14      Cycles for 10 * inc & dec eax
+14      Cycles for 10 * add eax,1 & sub eax,1
+14      Cycles for 10 * inc+dec with lea
187     kCycles for finding 'Duplicate' in Window.inc
+144     Cycles for finding 'test' with InString

+1       Cycles for PI*100
+12      Cycles for PI*100/10
+2       Cycles for push & pop eax
+0       Cycles for empty loop
+14      Cycles for 10 * inc & dec eax
+14      Cycles for 10 * add eax,1 & sub eax,1
+14      Cycles for 10 * inc+dec with lea
187     kCycles for finding 'Duplicate' in Window.inc
+143     Cycles for finding 'test' with InString

more (y)?


What is the "more (y)?" for ?  :tongue:

jj2007


hutch--

I did, nothing happened apart from exiting, is that what the "y" is for ?  :tongue:

jj2007

Strange, I get this from reply #27:

more (y)?

+73      Cycles for fldpi+1*fdiv+fstp st
+12     kCycles for finding 'test' with Instr_
328     kCycles for finding 'Duplicate' with Instr_
2492    kCycles for finding 'Duplicate' with InString
2553    kCycles for finding 'Duplicate' with crt strstr
720     kCycles for finding 'Duplicate' with Boyer-Moore

+16      Cycles for fldpi+1*fdiv+fstp st
+3286    Cycles for finding 'test' with Instr_
192     kCycles for finding 'Duplicate' with Instr_
2496    kCycles for finding 'Duplicate' with InString
2560    kCycles for finding 'Duplicate' with crt strstr
720     kCycles for finding 'Duplicate' with Boyer-Moore

+16      Cycles for fldpi+1*fdiv+fstp st
+3286    Cycles for finding 'test' with Instr_
192     kCycles for finding 'Duplicate' with Instr_
2494    kCycles for finding 'Duplicate' with InString
2560    kCycles for finding 'Duplicate' with crt strstr
719     kCycles for finding 'Duplicate' with Boyer-Moore

+17      Cycles for fldpi+1*fdiv+fstp st
+3309    Cycles for finding 'test' with Instr_
192     kCycles for finding 'Duplicate' with Instr_
2493    kCycles for finding 'Duplicate' with InString
2555    kCycles for finding 'Duplicate' with crt strstr
720     kCycles for finding 'Duplicate' with Boyer-Moore

+16      Cycles for fldpi+1*fdiv+fstp st
+3313    Cycles for finding 'test' with Instr_
192     kCycles for finding 'Duplicate' with Instr_
2491    kCycles for finding 'Duplicate' with InString
2558    kCycles for finding 'Duplicate' with crt strstr
720     kCycles for finding 'Duplicate' with Boyer-Moore

hit any key

felipe

Hi all, long time since my last reply/post in the forum  :sad:. One question please: generally speaking, what would be more accurate to test performance in a windows program, counting cycles or counting time (with QueryPerformanceCounter)?

daydreamer

Quote from: felipe on May 24, 2022, 04:01:59 AM
Hi all, long time since my last reply/post in the forum  :sad:. One question please: generally speaking, what would be more accurate to test performance in a windows program, counting cycles or counting time (with QueryPerformanceCounter)?
Welcome back felipe  :thumbsup:
Counting cycles,but invoke gettimemillis before and after easier to compare between your asm program and other languages timings for example very fast, huge fibonnaci calculations with matrices
http://masm32.com/board/index.php?topic=9773.0

my none asm creations
https://masm32.com/board/index.php?topic=6937.msg74303#msg74303
I am an Invoker
"An Invoker is a mage who specializes in the manipulation of raw and elemental energies."
Like SIMD coding

jj2007

Quote from: felipe on May 24, 2022, 04:01:59 AMcounting cycles or counting time

Time varies with cpu speed, cycles don't.

Quote from: daydreamer on May 24, 2022, 04:57:15 AMinvoke gettimemillis before and after

Interesting, can you post an example using gettimemillis?

felipe

This is an interesting topic. Let me put a quote from a page of intel about clock cycles to help a little bit to new ones (like my self  :toothy:):

QuoteSometimes, multiple instructions are completed in a single clock cycle; in other cases, one instruction might be handled over multiple clock cycles. Since different CPU designs handle instructions differently, it's best to compare clock speeds within the same CPU brand and generation.

For example, a CPU with a higher clock speed from five years ago might be outperformed by a new CPU with a lower clock speed, as the newer architecture deals with instructions more efficiently. An X-series Intel® processor might outperform a K-series processor with a higher clock speed, because it splits tasks between more cores and features a larger CPU cache. But within the same generation of CPUs, a processor with a higher clock speed will generally outperform a processor with a lower clock speed across many applications. This is why it's important to compare processors from the same brand and generation.

https://www.intel.com/content/www/us/en/gaming/resources/cpu-clock-speed.html

hutch--

Hi felipe,

Don't be lead astray, the only reliable timing technique is real time[tm]. Any recent CPU does many things that effect trying to perform close range timings, variable frequency depending on load, OS ring0 over-rides, core sharing, CPU time slicing, CPU hardware differences etc etc etc ....

Timing is at best unreliable due to these factors and the best I have come up with over many years is to design each test to match the algo's usage. Run the test for long enough, (over 1 second usually does the job). With the choice of attack speed versus sustained data transfer and everything inbetween.

Short duration tests run many times versus long duration tests for absolute transfer speed.

felipe

I share that vision Hutch, i mean, if you can see the program running fast then you finished the job. If you want your program to run fast on multiple cpu's, you test the program in all the different cpu's you can. If you have the task of constructing the fastest program in the world, you will still need to face the reallity that the program will run different in two different cpu's.  :icon_idea:

Although i recognize that counting cycles, i mean, bringing out strange numbers from this electronic world, can be fun... :mrgreen:

quarantined

off topic == on

Quote from: felipe on May 24, 2022, 04:01:59 AM
Hi all, long time since my last reply/post in the forum

Well, hello. I haven't seen you around here since forever. How are you?

off topic == off. Oops