I am trying to improve on the standard timer macros, and would appreciate some test runs. So far it looks ok but far from perfect...
TimerStart
void Instr_("This is just a stupid little test", "test", 0)
TimerEnd a small test
TimerStart
Let esi=FileRead$("\Masm32\include\Windows.inc")
void Instr_(esi, "Duplicate", 0)
TimerEnd a big test
TimerStart
fld FP8(10000.0)
fdiv FP4(10.0)
fdiv FP4(10.0)
fdiv FP4(10.0)
fdiv FP4(10.0)
fstp st
TimerEnd 4*fdiv
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
0 cycles for an empty loop
96 cycles for a small test
2965 kCycles for a big test
87 cycles for 4*fdiv
102 cycles for a small test
2950 kCycles for a big test
87 cycles for 4*fdiv
96 cycles for a small test
2960 kCycles for a big test
88 cycles for 4*fdiv
95 cycles for a small test
2963 kCycles for a big test
88 cycles for 4*fdiv
96 cycles for a small test
2962 kCycles for a big test
87 cycles for 4*fdiv
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz
-3 cycles for an empty loop
61 cycles for a small test
2286 kCycles for a big test
51 cycles for 4*fdiv
59 cycles for a small test
2235 kCycles for a big test
52 cycles for 4*fdiv
61 cycles for a small test
2252 kCycles for a big test
51 cycles for 4*fdiv
60 cycles for a small test
2234 kCycles for a big test
52 cycles for 4*fdiv
60 cycles for a small test
2224 kCycles for a big test
54 cycles for 4*fdiv
hit any key
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
0 cycles for an empty loop
63 cycles for a small test
2372 kCycles for a big test
75 cycles for 4*fdiv
60 cycles for a small test
2521 kCycles for a big test
75 cycles for 4*fdiv
65 cycles for a small test
2529 kCycles for a big test
75 cycles for 4*fdiv
65 cycles for a small test
2447 kCycles for a big test
75 cycles for 4*fdiv
64 cycles for a small test
2360 kCycles for a big test
75 cycles for 4*fdiv
hit any key
Thanks, Hector & Hutch :thup:
I try to stabilise timings by cutting off the lowest (2%) and highest (40%) results, and taking the average of the middle values. Unfortunately, even this sophisticated approach does not yield good results. Sometimes the entire array of results is higher or lower, and I have no good explanation why :sad:
Here is the full array for the "big test"
Let esi=FileRead$("\Masm32\include\Windows.inc")
void Instr_(esi, "Duplicate", 0)
0 2930312
1 2933809
2 2936058
3 2936948
4 2938885
5 2941745
6 2943788
7 2945243
8 2945521
9 2949879
10 2952848
11 2953736
12 2956294
13 2957591
14 2958842
15 2961609
16 2964733
17 2965709
18 2970148
19 2970306
20 2970579
21 2970582
22 2970948
23 2974688
24 2978464
25 2979094
26 2980370
27 2980833
28 2981564
29 2981591
30 2983309
31 2983336
32 2983730
33 2983836
34 2984424
35 2986352
36 2986682
37 2988949
38 2990821
39 2990963
40 2990973
41 2991821
42 2992367
43 2993306
44 2993379
45 2993466
46 2995312
47 2995399
48 2996222
49 2997051
50 2997827
51 2999610
52 2999851
53 3000382
54 3002397
55 3002476
56 3003954
57 3004109
58 3004753
59 3004775
60 3005360
61 3007000
62 3007688
63 3009229
64 3010270
65 3012121
66 3014591
67 3015858
68 3016028
69 3016749
70 3017688
71 3017825
72 3018927
73 3019824
74 3020443
75 3021909
76 3022003
77 3022600
78 3023700
79 3023961
80 3028300
81 3028418
82 3029500
83 3029745
84 3029827
85 3031285
86 3031724
87 3033169
88 3033377
89 3035666
90 3035970
91 3037459
92 3038854
93 3039341
94 3039362
95 3041824
96 3043073
97 3043763
98 3044064
99 3047979
100 3048806
101 3050252
102 3052958
103 3053943
104 3056554
105 3058038
106 3060349
107 3060516
108 3060972
109 3065804
110 3066998
111 3070333
112 3070509
113 3071900
114 3071960
115 3072677
116 3073488
117 3073609
118 3075852
119 3076869
120 3077982
121 3080533
122 3081342
123 3081397
124 3082915
125 3083890
126 3087896
127 3088278
128 3089543
129 3089922
130 3091488
131 3092003
132 3093339
133 3093891
134 3095475
135 3097324
136 3097631
137 3098403
138 3099164
139 3102406
140 3108633
141 3109394
142 3109521
143 3110369
144 3115745
145 3118991
146 3122158
147 3124303
148 3124874
149 3127251
150 3128369
151 3129200
152 3143797
153 3153240
154 3153543
155 3159927
156 3161839
157 3173825
158 3179803
159 3188800
160 3190745
161 3195179
162 3210408
163 3221058
164 3225581
165 3227312
166 3232076
167 3246871
168 3254988
169 3257103
170 3265397
171 3270494
172 3270606
173 3285873
174 3290494
175 3293318
176 3295071
177 3298356
178 3311715
179 3313875
180 3331991
181 3335859
182 3355524
183 3360715
184 3367790
185 3372294
186 3400711
187 3425828
188 3426903
189 3502082
190 3596308
191 3917123
192 4030047
193 4219088
194 8213100
195 8340445
196 8358491
197 8548839
198 8652628
199 8803120
200 8929203
2949 kCycles for a big test
For comparison, the 4*fdiv array:
0 172
1 172
2 172
3 172
4 172
5 172
6 172
7 172
8 172
9 172
10 172
11 172
12 172
13 172
14 172
15 172
16 172
17 172
18 172
19 172
20 172
21 172
22 172
23 172
24 172
25 172
26 172
27 172
28 172
29 172
30 172
31 172
32 172
33 172
34 172
35 172
36 172
37 172
38 172
39 172
40 172
41 172
42 172
43 172
44 172
45 172
46 172
47 172
48 172
49 172
50 172
51 172
52 172
53 172
54 172
55 172
56 172
57 172
58 172
59 172
60 172
61 172
62 172
63 172
64 172
65 172
66 172
67 172
68 172
69 172
70 172
71 172
72 172
73 172
74 172
75 172
76 172
77 172
78 172
79 172
80 172
81 172
82 172
83 172
84 172
85 172
86 172
87 172
88 172
89 175
90 175
91 175
92 175
93 175
94 175
95 175
96 175
97 175
98 175
99 175
100 175
101 175
102 175
103 175
104 175
105 175
106 175
107 175
108 175
109 175
110 175
111 175
112 175
113 175
114 175
115 175
116 175
117 175
118 175
119 175
120 175
121 175
122 175
123 175
124 175
125 175
126 175
127 175
128 175
129 175
130 175
131 175
132 175
133 175
134 175
135 175
136 175
137 175
138 175
139 175
140 175
141 175
142 175
143 175
144 175
145 175
146 178
147 185
148 185
149 185
150 185
151 185
152 185
153 185
154 185
155 185
156 185
157 185
158 185
159 185
160 185
161 185
162 185
163 185
164 185
165 185
166 185
167 185
168 185
169 185
170 185
171 185
172 185
173 185
174 185
175 185
176 185
177 185
178 185
179 185
180 185
181 185
182 185
183 185
184 185
185 185
186 185
187 185
188 185
189 185
190 185
191 185
192 185
193 185
194 185
195 185
196 185
197 185
198 185
199 185
200 185
201 185
202 185
203 185
204 185
205 185
206 185
207 185
208 185
209 185
210 185
211 185
212 185
213 185
214 185
215 185
216 185
217 185
218 185
219 185
220 185
221 185
222 185
223 185
224 185
225 185
226 185
227 185
228 185
229 185
230 185
231 185
232 185
233 185
234 185
235 185
236 185
237 185
238 185
239 185
240 185
241 185
242 185
243 185
244 185
245 185
246 185
247 185
248 185
249 185
250 185
251 185
252 185
253 185
254 185
255 185
256 185
257 185
258 185
259 185
260 185
261 185
262 185
263 185
264 185
265 185
266 185
267 185
268 185
269 185
270 185
271 185
272 185
273 185
274 185
275 185
276 185
277 185
278 185
279 185
280 185
281 185
282 185
283 185
284 185
285 185
286 185
287 185
288 185
289 185
290 185
291 185
292 185
293 185
294 185
295 185
296 185
297 185
298 185
299 185
300 185
301 185
302 185
303 185
304 185
305 185
306 185
307 185
308 185
309 185
310 185
311 185
312 185
313 185
314 185
315 185
316 185
317 185
318 185
319 185
320 185
321 185
322 185
323 185
324 185
325 185
326 185
327 185
328 185
329 185
330 185
331 185
332 185
333 185
334 185
335 185
336 185
337 185
338 185
339 185
340 185
341 185
342 185
343 185
344 185
345 185
346 185
347 185
348 185
349 185
350 185
351 185
352 185
353 185
354 185
355 185
356 185
357 185
358 185
359 185
360 185
361 185
362 185
363 185
364 185
365 185
366 185
367 185
368 185
369 185
370 185
371 185
372 185
373 185
374 185
375 185
376 185
377 185
378 185
379 185
380 185
381 185
382 185
383 185
384 185
385 185
386 185
387 185
388 185
389 185
390 185
391 185
392 185
393 185
394 185
395 185
396 185
397 185
398 185
399 185
400 185
401 185
402 185
403 185
404 185
405 185
406 185
407 185
408 185
409 185
410 185
411 185
412 185
413 185
414 185
415 185
416 185
417 185
418 185
419 185
420 185
421 185
422 185
423 185
424 185
425 185
426 185
427 185
428 185
429 185
430 185
431 185
432 185
433 185
434 185
435 185
436 185
437 185
438 185
439 185
440 185
441 185
442 185
443 185
444 185
445 185
446 185
447 185
448 185
449 185
450 185
451 185
452 185
453 185
454 185
455 185
456 185
457 185
458 185
459 185
460 185
461 185
462 185
463 185
464 185
465 185
466 185
467 185
468 185
469 185
470 185
471 185
472 185
473 185
474 185
475 185
476 185
477 185
478 185
479 185
480 185
481 185
482 185
483 185
484 185
485 185
486 185
487 185
488 185
489 185
490 185
491 185
492 185
493 185
494 185
495 185
496 185
497 185
498 185
499 185
500 185
501 185
502 185
503 185
504 185
505 185
506 185
507 185
508 185
509 185
510 185
511 185
512 185
513 185
514 185
515 185
516 185
517 185
518 185
519 185
520 185
521 185
522 185
523 185
524 185
525 185
526 185
527 185
528 185
529 185
530 185
531 185
532 185
533 185
534 185
535 185
536 185
537 185
538 185
539 185
540 185
541 185
542 185
543 185
544 185
545 185
546 185
547 185
548 185
549 185
550 185
551 185
552 185
553 185
554 185
555 185
556 185
557 185
558 185
559 185
560 185
561 185
562 185
563 185
564 185
565 185
566 185
567 185
568 185
569 188
570 188
571 188
572 188
573 188
574 188
575 188
576 188
577 188
578 188
579 188
580 188
581 188
582 188
583 188
584 188
585 188
586 188
587 188
588 188
589 188
590 188
591 188
592 188
593 188
594 188
595 188
596 188
597 188
598 188
599 188
600 188
601 188
602 188
603 188
604 188
605 188
606 188
607 188
608 188
609 188
610 188
611 188
612 188
613 188
614 188
615 188
616 188
617 188
618 188
619 188
620 188
621 188
622 188
623 188
624 188
625 188
626 188
627 188
628 188
629 188
630 188
631 188
632 188
633 188
634 188
635 188
636 188
637 188
638 188
639 188
640 188
641 188
642 188
643 188
644 188
645 188
646 188
647 188
648 188
649 188
650 188
651 188
652 188
653 188
654 188
655 188
656 188
657 188
658 188
659 188
660 188
661 188
662 188
663 188
664 188
665 188
666 188
667 188
668 188
669 188
670 188
671 188
672 188
673 188
674 188
675 188
676 188
677 188
678 188
679 188
680 188
681 188
682 188
683 188
684 188
685 188
686 188
687 188
688 188
689 188
690 188
691 188
692 188
693 188
694 188
695 188
696 188
697 188
698 188
699 188
700 188
701 188
702 188
703 188
704 188
705 188
706 188
707 188
708 188
709 188
710 188
711 188
712 188
713 188
714 188
715 188
716 188
717 188
718 188
719 188
720 188
721 188
722 188
723 188
724 188
725 188
726 188
727 188
728 188
729 188
730 188
731 188
732 188
733 188
734 188
735 188
736 188
737 188
738 188
739 188
740 188
741 188
742 188
743 188
744 188
745 188
746 188
747 188
748 188
749 188
750 188
751 188
752 188
753 188
754 188
755 188
756 188
757 188
758 188
759 188
760 188
761 188
762 188
763 188
764 188
765 188
766 188
767 188
768 188
769 188
770 188
771 188
772 188
773 188
774 188
775 188
776 188
777 188
778 188
779 188
780 188
781 188
782 188
783 188
784 188
785 188
786 188
787 188
788 188
789 188
790 188
791 188
792 188
793 188
794 188
795 188
796 188
797 188
798 188
799 188
800 188
801 188
802 188
803 188
804 188
805 188
806 188
807 188
808 188
809 188
810 188
811 188
812 188
813 188
814 188
815 188
816 188
817 188
818 188
819 188
820 188
821 188
822 188
823 188
824 188
825 188
826 188
827 188
828 188
829 188
830 188
831 188
832 188
833 188
834 188
835 188
836 188
837 188
838 188
839 188
840 188
841 188
842 188
843 188
844 188
845 188
846 188
847 188
848 188
849 188
850 188
851 188
852 188
853 188
854 188
855 188
856 188
857 188
858 188
859 188
860 188
861 188
862 188
863 188
864 188
865 188
866 188
867 188
868 188
869 188
870 188
871 188
872 188
873 188
874 188
875 188
876 188
877 188
878 188
879 188
880 188
881 188
882 188
883 188
884 188
885 188
886 188
887 188
888 188
889 188
890 188
891 188
892 188
893 188
894 188
895 188
896 188
897 188
898 188
899 188
900 188
901 188
902 188
903 188
904 188
905 188
906 188
907 188
908 188
909 188
910 188
911 188
912 188
913 188
914 188
915 188
916 188
917 188
918 188
919 188
920 188
921 188
922 188
923 188
924 188
925 188
926 188
927 188
928 188
929 188
930 188
931 188
932 188
933 188
934 188
935 188
936 188
937 188
938 188
939 188
940 188
941 188
942 188
943 188
944 188
945 188
946 188
947 188
948 188
949 188
950 191
951 191
952 191
953 191
954 191
955 191
956 191
957 191
958 278
959 278
960 281
961 288
962 288
963 291
964 291
965 291
966 291
967 291
968 291
969 291
970 291
971 291
972 291
973 291
974 291
975 291
976 291
977 291
978 291
979 291
980 291
981 294
982 294
983 294
984 294
985 294
986 294
987 294
988 294
989 294
990 294
991 294
992 294
993 294
994 294
995 294
996 297
997 297
998 397
999 400
1000 821
Note the little jumps:
88 172
89 175
...
145 175
146 178
147 185
How do you explain them? It's not the variable processor speed, as it's measured with rdtsc, i.e. cycles. Can't be interrupts either, they would cause much bigger hiccups. So what makes the processor insert 3 cycles here and there? :rolleyes:
Quote
I try to stabilise timings by cutting off the lowest (2%) and highest (40%) results, and taking the average of the middle values.
:badgrin: :badgrin: :rolleyes:
There are always going to be inconsistencies which seem to be inherent with the methods used in the timing macros. Not the macro authors fault though. I had once tried several different averaging methods using the timing macros and even a custom rewritten version of the macros. And I got a lot of grief from you jj for doing so.
You'll probably just have to live with the 'outliers'... :tongue:
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
0 cycles for an empty loop
98 cycles for a small test
5438 kCycles for a big test
108 cycles for 4*fdiv
98 cycles for a small test
5108 kCycles for a big test
108 cycles for 4*fdiv
98 cycles for a small test
4872 kCycles for a big test
108 cycles for 4*fdiv
98 cycles for a small test
4828 kCycles for a big test
108 cycles for 4*fdiv
98 cycles for a small test
4785 kCycles for a big test
108 cycles for 4*fdiv
hit any key
[code]
Yeah, the cycle count should'nt change I would think.
From Michael Webster's to other variations, they all suffer the same problem, CPU hardware has a whole pile of tricks that make timing at very close range, nearly impossible. Calculating the loop and other overhead varies from one CPU and computer setup to another and this is where the main problem lies.
Having tried many over time, I still revert to clocking algos in real time which is just about useless at very short durations. The alternative is to use far longer durations which means very large samples OR high loop counts.
Roughly, if you are looking for the sustain rate of an algo (absolute throughput) you use much longer samples where alternatively, if you want to clock the attack rate, you produce algos with very low overhead and run them through a high count of iterations. Dial a test for what you want an algo to perform and you will get something like useful "comparison" information.
Quote from: quarantined on May 16, 2022, 01:14:50 PMYou'll probably just have to live with the 'outliers'... :tongue:
...
Yeah, the cycle count should'nt change I would think.
The point is that I
don't live with the outliers, because I do eliminate the 40% slowest. And yet, there are runs that are
systematically some % slower. Same code, same position, so there is absolutely no reason why some fild [esp] ... fistp myvar (for example) sequence should take 80x 77 cycles in the first run, 80x 83 in the second and 80x 77 again in the third run. In all three runs, there are a few outliers with 99 or even 200 cycles, but they are thrown out by sorting and eliminating the slower ones.
It's not logical.
I attach a new version that allows (by pressing y at the end) to export the sorted arrays of timings. Have a look at the "tails", but also at the start of each array.
In theory,
the fastest should be the "true" result, because a cpu can't just drop some cycles if it feels in the right mood, right?
Sample top results (the 0 0 is a bug that I still have to fix):
Sorted timings for finding 'Duplicate' with crt strstr
0 0
1 2494306
2 2518611
3 2526306
Sorted timings for finding 'Duplicate' with crt strstr
0 0
1 2471897
2 2532000
3 2538188
Sorted timings for finding 'Duplicate' with crt strstr
0 0
1 2514656
2 2552965
3 2554821
Sorted timings for finding 'Duplicate' with crt strstr
0 0
1 2515663
2 2536230
3 2539026
Sorted timings for finding 'Duplicate' with crt strstr
0 0
1 2520319
2 2551161
3 2557537
Corresponding 'tails' (note #4, 10189333!):
198 4024483
199 4029707
200 4168568
2567 kCycles for finding 'Duplicate' with crt strstr
198 4039726
199 4358796
200 4849407
2568 kCycles for finding 'Duplicate' with crt strstr
198 3059861
199 3060097
200 3106652
2563 kCycles for finding 'Duplicate' with crt strstr
198 6868355
199 10076800
200 10189333
2588 kCycles for finding 'Duplicate' with crt strstr
198 4001836
199 4147888
200 4317080
2567 kCycles for finding 'Duplicate' with crt strstr
What probably would be a good idea to do if someone had the time is to create a lookup table with the published cycle count (Intel, AMD, other?) for the myriad of instructions for the target cpu. Then iterate through all the instructions in the algorithm and calculate the total cycle count via the cycle count lookup table to find exactly what the cycle count 'should' be. Of course for short sequences can be done manually.
Will depend on exact processor, architecture, etc., of course. Should avoid branching in test algo of course.
Just a thought.
RDTSC problem? Processor problem? Glitch in the proverbial MATRIX? Who knows?
I don't have the Force, so cannot provide the Jedi Mind Tricks that Agner Fog had up his sleeves. I had given up on cycle count comparisons in favor of running a given algo multiple times and getting the average running time for speed comparisons. Usually produces satisfactory results for "is algo A faster than algo B..."
Quote from: quarantined on May 17, 2022, 01:13:54 AMa lookup table with the published cycle count
The times when this used to work are gone, see instruction pipelining (https://en.wikipedia.org/wiki/Instruction_pipelining). Many instructions can nowadays run in parallel, others have to wait until the previous one has completed. It's very messy.
Just for fun... size the window to see the distribution of cycle counts, it's hilarious (and I was kind: I cut off the slowest 20) :cool:
CyCtStart
fldpi
fdiv FP4(0.1)
fstp st
CyCtEnd <fldpi+fdiv+fstp st>
CyCtStart
void Instr_("Hello World, this is a little test of MasmBasic's Instr macro", "macro")
CyCtEnd <Instr>
Quote from: jj2007 on May 17, 2022, 01:57:31 AM
The times when this used to work are gone, see instruction pipelining (https://en.wikipedia.org/wiki/Instruction_pipelining). Many instructions can nowadays run in parallel, others have to wait until the previous one has completed. It's very messy.
Well that really throws a spanner into the gears. I'll check out your latest when I'm back at my computer...
Hi JJ!
That look really wrong :biggrin:
Could be interesting comparations with MichaelW's timers and GetTickCount.
Hi Hector,
In a console proggie, as attached, the new macros perform much better than MichaelW's \Masm32\macros\timers.asm, especially for somewhat longer tests:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
23 Cycles for some short instructions
18 Cycles for fldpi+1*fdiv+fstp st
18 Cycles for fldpi+1*fdiv+fstp st
188 kCycles for finding 'Duplicate' with Instr_
2503 kCycles for finding 'Duplicate' with InString
2571 kCycles for finding 'Duplicate' with crt strstr
710 kCycles for finding 'Duplicate' with Boyer-Moore
18 Cycles for fldpi+1*fdiv+fstp st
18 Cycles for fldpi+1*fdiv+fstp st
191 kCycles for finding 'Duplicate' with Instr_
2494 kCycles for finding 'Duplicate' with InString
2574 kCycles for finding 'Duplicate' with crt strstr
713 kCycles for finding 'Duplicate' with Boyer-Moore
19 Cycles for fldpi+1*fdiv+fstp st
19 Cycles for fldpi+1*fdiv+fstp st
188 kCycles for finding 'Duplicate' with Instr_
2493 kCycles for finding 'Duplicate' with InString
2565 kCycles for finding 'Duplicate' with crt strstr
710 kCycles for finding 'Duplicate' with Boyer-Moore
21 Cycles for fldpi+1*fdiv+fstp st
24 Cycles for fldpi+1*fdiv+fstp st
188 kCycles for finding 'Duplicate' with Instr_
2490 kCycles for finding 'Duplicate' with InString
2573 kCycles for finding 'Duplicate' with crt strstr
710 kCycles for finding 'Duplicate' with Boyer-Moore
24 Cycles for fldpi+1*fdiv+fstp st
19 Cycles for fldpi+1*fdiv+fstp st
188 kCycles for finding 'Duplicate' with Instr_
2486 kCycles for finding 'Duplicate' with InString
2573 kCycles for finding 'Duplicate' with crt strstr
711 kCycles for finding 'Duplicate' with Boyer-Moore
The interesting question is why that is so dramatically different in a GUI application :rolleyes:
Quote from: jj2007 on May 17, 2022, 11:03:07 PM
the new macros perform much better than MichaelW's \Masm32\macros\timers.asm, especially for somewhat longer tests:
Doesn't matter that :biggrin:
Just interesting to see how are fluctuations of other timings in same graphic.
See yourself... Michael's macros have a loop inside the rdtsc ... test code ... rdtsc sequence.
It's a lot more stable. What is stack?
stack equ dword ptr [esp]
stack[4] = first argument
etc
But must be edi-10
Correct :thumbsup:
(wow, somebody reads and understands my code :rolleyes:)
I can read my own code! (most of the time )
Hi Hector,
Here is a console program using mostly CyCt* macros plus one of the MichaelW counters:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
23 Cycles for some short instructions
19 Cycles for fldpi+1*fdiv+fstp st
19 Cycles for fldpi+1*fdiv+fstp st
186 kCycles for finding 'Duplicate' with Instr_
2483 kCycles for finding 'Duplicate' with InString
2551 kCycles for finding 'Duplicate' with crt strstr
710 kCycles for finding 'Duplicate' with Boyer-Moore
2508 kCycles for InString using MichaelW's macro
18 Cycles for fldpi+1*fdiv+fstp st
19 Cycles for fldpi+1*fdiv+fstp st
186 kCycles for finding 'Duplicate' with Instr_
2480 kCycles for finding 'Duplicate' with InString
2548 kCycles for finding 'Duplicate' with crt strstr
710 kCycles for finding 'Duplicate' with Boyer-Moore
2534 kCycles for InString using MichaelW's macro
19 Cycles for fldpi+1*fdiv+fstp st
18 Cycles for fldpi+1*fdiv+fstp st
186 kCycles for finding 'Duplicate' with Instr_
2480 kCycles for finding 'Duplicate' with InString
2545 kCycles for finding 'Duplicate' with crt strstr
710 kCycles for finding 'Duplicate' with Boyer-Moore
2506 kCycles for InString using MichaelW's macro
19 Cycles for fldpi+1*fdiv+fstp st
18 Cycles for fldpi+1*fdiv+fstp st
186 kCycles for finding 'Duplicate' with Instr_
2487 kCycles for finding 'Duplicate' with InString
2551 kCycles for finding 'Duplicate' with crt strstr
710 kCycles for finding 'Duplicate' with Boyer-Moore
2521 kCycles for InString using MichaelW's macro
19 Cycles for fldpi+1*fdiv+fstp st
18 Cycles for fldpi+1*fdiv+fstp st
186 kCycles for finding 'Duplicate' with Instr_
2483 kCycles for finding 'Duplicate' with InString
2549 kCycles for finding 'Duplicate' with crt strstr
710 kCycles for finding 'Duplicate' with Boyer-Moore
2510 kCycles for InString using MichaelW's macro
MichaelW, Instring: 2508 2534 2506 2521 2510 2534-2506=28
CyCt*, Instring: 2483 2480 2480 2487 2483 2487-2480=7
Run the attachment, and you will see that the new macros have another advantage :cool:
:biggrin:
Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz
42 Cycles for some short instructions
31 Cycles for fldpi+1*fdiv+fstp st
25 Cycles for fldpi+1*fdiv+fstp st
211 kCycles for finding 'Duplicate' with Instr_
2093 kCycles for finding 'Duplicate' with InString
1886 kCycles for finding 'Duplicate' with crt strstr
712 kCycles for finding 'Duplicate' with Boyer-Moore
2118 kCycles for InString using MichaelW's macro
34 Cycles for fldpi+1*fdiv+fstp st
26 Cycles for fldpi+1*fdiv+fstp st
209 kCycles for finding 'Duplicate' with Instr_
2064 kCycles for finding 'Duplicate' with InString
1880 kCycles for finding 'Duplicate' with crt strstr
696 kCycles for finding 'Duplicate' with Boyer-Moore
2114 kCycles for InString using MichaelW's macro
34 Cycles for fldpi+1*fdiv+fstp st
25 Cycles for fldpi+1*fdiv+fstp st
210 kCycles for finding 'Duplicate' with Instr_
2081 kCycles for finding 'Duplicate' with InString
1883 kCycles for finding 'Duplicate' with crt strstr
696 kCycles for finding 'Duplicate' with Boyer-Moore
2117 kCycles for InString using MichaelW's macro
37 Cycles for fldpi+1*fdiv+fstp st
24 Cycles for fldpi+1*fdiv+fstp st
210 kCycles for finding 'Duplicate' with Instr_
2088 kCycles for finding 'Duplicate' with InString
1926 kCycles for finding 'Duplicate' with crt strstr
713 kCycles for finding 'Duplicate' with Boyer-Moore
2119 kCycles for InString using MichaelW's macro
38 Cycles for fldpi+1*fdiv+fstp st
21 Cycles for fldpi+1*fdiv+fstp st
209 kCycles for finding 'Duplicate' with Instr_
2088 kCycles for finding 'Duplicate' with InString
1926 kCycles for finding 'Duplicate' with crt strstr
695 kCycles for finding 'Duplicate' with Boyer-Moore
2117 kCycles for InString using MichaelW's macro
MichaelW, Instring: 2119-2114= 5
CyCt*, Instring: 2093-2064=29
YMMV :cool:
But in general, I get fairly consistent output, and at much, much greater speed. Which is an argument when you are trying to optimise an algo and it takes you many small changes.
More refining done... it starts to look convincing :tongue:
Can I have some timings, please?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
+0 Cycles for PI*100
+16 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+13 Cycles for 10 * inc & dec eax
+13 Cycles for 10 * add eax,1 & sub eax,1
+13 Cycles for 10 * inc+dec with lea
190 kCycles for finding 'Duplicate' in Window.inc
+189 Cycles for finding 'test' with InString
+0 Cycles for PI*100
+16 Cycles for PI*100/10
+4 Cycles for push & pop eax
+0 Cycles for empty loop
+13 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+13 Cycles for 10 * inc+dec with lea
190 kCycles for finding 'Duplicate' in Window.inc
+187 Cycles for finding 'test' with InString
+0 Cycles for PI*100
+16 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+13 Cycles for 10 * inc & dec eax
+13 Cycles for 10 * add eax,1 & sub eax,1
+13 Cycles for 10 * inc+dec with lea
190 kCycles for finding 'Duplicate' in Window.inc
+187 Cycles for finding 'test' with InString
+0 Cycles for PI*100
+16 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+13 Cycles for 10 * inc & dec eax
+13 Cycles for 10 * add eax,1 & sub eax,1
+13 Cycles for 10 * inc+dec with lea
190 kCycles for finding 'Duplicate' in Window.inc
+187 Cycles for finding 'test' with InString
Could you civilize this a bit. Would not run where I downloaded it and had to be installed on my dev drive.
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
+1 Cycles for PI*100
+12 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+14 Cycles for 10 * inc & dec eax
+14 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
190 kCycles for finding 'Duplicate' in Window.inc
+163 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+12 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+14 Cycles for 10 * inc & dec eax
+14 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
189 kCycles for finding 'Duplicate' in Window.inc
+145 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+12 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+14 Cycles for 10 * inc & dec eax
+14 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
190 kCycles for finding 'Duplicate' in Window.inc
+145 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+12 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+14 Cycles for 10 * inc & dec eax
+14 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
190 kCycles for finding 'Duplicate' in Window.inc
+147 Cycles for finding 'test' with InString
Here is the civilised version, with an Inkey at the end. The other one runs fine from a DOS prompt, of course :biggrin:
This onje at least stops on the inkey but to run it, it must be copied to my dev drive. Would be a lot easier to run if I could just copy windows.inc to where it gets unzipped. I test on a ramdrive, not my dev drive.
Quote from: hutch-- on May 19, 2022, 09:13:15 PMif I could just copy windows.inc to where it gets unzipped
Now I understand the problem :biggrin:
I know one guy who is very proud about all paths in include files being hardcoded to the root as in \thatfolder\whatever :mrgreen:
Let esi=FileRead$("\Masm32\include\Windows.inc")
Attached a version with embedded Windows.inc, and some more timings. Don't complain that it's a fat attachment :cool:
; Let esi=FileRead$("\Masm32\include\Windows.inc")
Let esi=FileRead$(99) ; resource ID 99
Hi,
Two laptops.
Intel(R) Pentium(R) M processor 1.70GHz
+3 Cycles for PI*100
+31 Cycles for PI*100/10
+-1 Cycles for push & pop eax
+0 Cycles for empty loop
+16 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
+33 Cycles for finding 'Duplicate' in Window.inc
+458 Cycles for finding 'test' with InString
+3 Cycles for PI*100
+31 Cycles for PI*100/10
+-1 Cycles for push & pop eax
+0 Cycles for empty loop
+16 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
+33 Cycles for finding 'Duplicate' in Window.inc
+467 Cycles for finding 'test' with InString
+3 Cycles for PI*100
+31 Cycles for PI*100/10
+-1 Cycles for push & pop eax
+0 Cycles for empty loop
+16 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
+33 Cycles for finding 'Duplicate' in Window.inc
+423 Cycles for finding 'test' with InString
+3 Cycles for PI*100
+31 Cycles for PI*100/10
+-1 Cycles for push & pop eax
+0 Cycles for empty loop
+16 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
+33 Cycles for finding 'Duplicate' in Window.inc
+422 Cycles for finding 'test' with InString
more (y)?
+32 Cycles for fldpi+1*fdiv+fstp st
+3581 Cycles for finding 'test' with Instr_
+33 Cycles for finding 'Duplicate' with Instr_
3497 kCycles for finding 'Duplicate' with InString
3315 kCycles for finding 'Duplicate' with crt strstr
1190 kCycles for finding 'Duplicate' with Boyer-Moore
+32 Cycles for fldpi+1*fdiv+fstp st
+3582 Cycles for finding 'test' with Instr_
+33 Cycles for finding 'Duplicate' with Instr_
3497 kCycles for finding 'Duplicate' with InString
3315 kCycles for finding 'Duplicate' with crt strstr
1190 kCycles for finding 'Duplicate' with Boyer-Moore
+32 Cycles for fldpi+1*fdiv+fstp st
+3582 Cycles for finding 'test' with Instr_
+33 Cycles for finding 'Duplicate' with Instr_
3497 kCycles for finding 'Duplicate' with InString
3315 kCycles for finding 'Duplicate' with crt strstr
1190 kCycles for finding 'Duplicate' with Boyer-Moore
+32 Cycles for fldpi+1*fdiv+fstp st
+3581 Cycles for finding 'test' with Instr_
+33 Cycles for finding 'Duplicate' with Instr_
3497 kCycles for finding 'Duplicate' with InString
3315 kCycles for finding 'Duplicate' with crt strstr
1190 kCycles for finding 'Duplicate' with Boyer-Moore
+32 Cycles for fldpi+1*fdiv+fstp st
+3582 Cycles for finding 'test' with Instr_
+33 Cycles for finding 'Duplicate' with Instr_
3497 kCycles for finding 'Duplicate' with InString
3315 kCycles for finding 'Duplicate' with crt strstr
1190 kCycles for finding 'Duplicate' with Boyer-Moore
hit any key - bye
Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz
+1 Cycles for PI*100
+15 Cycles for PI*100/10
+3 Cycles for push & pop eax
+0 Cycles for empty loop
+17 Cycles for 10 * inc & dec eax
+17 Cycles for 10 * add eax,1 & sub eax,1
+17 Cycles for 10 * inc+dec with lea
227 kCycles for finding 'Duplicate' in Window.inc
+175 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+15 Cycles for PI*100/10
+3 Cycles for push & pop eax
+0 Cycles for empty loop
+17 Cycles for 10 * inc & dec eax
+17 Cycles for 10 * add eax,1 & sub eax,1
+17 Cycles for 10 * inc+dec with lea
227 kCycles for finding 'Duplicate' in Window.inc
+174 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+15 Cycles for PI*100/10
+3 Cycles for push & pop eax
+0 Cycles for empty loop
+17 Cycles for 10 * inc & dec eax
+17 Cycles for 10 * add eax,1 & sub eax,1
+17 Cycles for 10 * inc+dec with lea
227 kCycles for finding 'Duplicate' in Window.inc
+174 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+15 Cycles for PI*100/10
+3 Cycles for push & pop eax
+0 Cycles for empty loop
+17 Cycles for 10 * inc & dec eax
+17 Cycles for 10 * add eax,1 & sub eax,1
+17 Cycles for 10 * inc+dec with lea
227 kCycles for finding 'Duplicate' in Window.inc
+175 Cycles for finding 'test' with InString
more (y)?
+33 Cycles for fldpi+1*fdiv+fstp st
+3522 Cycles for finding 'test' with Instr_
227 kCycles for finding 'Duplicate' with Instr_
2418 kCycles for finding 'Duplicate' with InString
2265 kCycles for finding 'Duplicate' with crt strstr
826 kCycles for finding 'Duplicate' with Boyer-Moore
+15 Cycles for fldpi+1*fdiv+fstp st
+3108 Cycles for finding 'test' with Instr_
227 kCycles for finding 'Duplicate' with Instr_
2419 kCycles for finding 'Duplicate' with InString
2264 kCycles for finding 'Duplicate' with crt strstr
825 kCycles for finding 'Duplicate' with Boyer-Moore
+15 Cycles for fldpi+1*fdiv+fstp st
+3108 Cycles for finding 'test' with Instr_
227 kCycles for finding 'Duplicate' with Instr_
2418 kCycles for finding 'Duplicate' with InString
2264 kCycles for finding 'Duplicate' with crt strstr
825 kCycles for finding 'Duplicate' with Boyer-Moore
+15 Cycles for fldpi+1*fdiv+fstp st
+3108 Cycles for finding 'test' with Instr_
227 kCycles for finding 'Duplicate' with Instr_
2419 kCycles for finding 'Duplicate' with InString
2264 kCycles for finding 'Duplicate' with crt strstr
825 kCycles for finding 'Duplicate' with Boyer-Moore
+15 Cycles for fldpi+1*fdiv+fstp st
+3109 Cycles for finding 'test' with Instr_
227 kCycles for finding 'Duplicate' with Instr_
2419 kCycles for finding 'Duplicate' with InString
2264 kCycles for finding 'Duplicate' with crt strstr
825 kCycles for finding 'Duplicate' with Boyer-Moore
hit any key - bye
Cheers,
Steve N.
Thanks, Steve, now it looks reasonably stable :thumbsup:
A typical usage example:
CyCtStart
fldpi
fmul FP4(100.0)
fistp testDD
CyCtEnd PI*100
CyCtStart
fldpi
fmul FP4(100.0)
fdiv FP4(10.0)
fistp testDD
CyCtEnd PI*100/10
CyCtStart
push eax
pop eax
CyCtEnd push & pop eax
AMD Athlon(tm) II X2 220 Processor
+0 Cycles for PI*100
+17 Cycles for PI*100/10
+1 Cycles for push & pop eax
+0 Cycles for empty loop
+16 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+16 Cycles for 10 * inc+dec with lea
481 kCycles for finding 'Duplicate' in Window.inc
+471 Cycles for finding 'test' with InString
+0 Cycles for PI*100
+17 Cycles for PI*100/10
+0 Cycles for push & pop eax
+0 Cycles for empty loop
+16 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+16 Cycles for 10 * inc+dec with lea
481 kCycles for finding 'Duplicate' in Window.inc
+476 Cycles for finding 'test' with InString
+0 Cycles for PI*100
+17 Cycles for PI*100/10
+1 Cycles for push & pop eax
+0 Cycles for empty loop
+16 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+16 Cycles for 10 * inc+dec with lea
481 kCycles for finding 'Duplicate' in Window.inc
+477 Cycles for finding 'test' with InString
+0 Cycles for PI*100
+17 Cycles for PI*100/10
+0 Cycles for push & pop eax
+0 Cycles for empty loop
+16 Cycles for 10 * inc & dec eax
+16 Cycles for 10 * add eax,1 & sub eax,1
+16 Cycles for 10 * inc+dec with lea
481 kCycles for finding 'Duplicate' in Window.inc
+477 Cycles for finding 'test' with InString
more (y)?
Thanks @everybody :thup:
I attach a new graphical version. Two interesting aspects:
- when sizing the window, the cycle count for ...
fldpi
fdiv FP4(0.1)
fstp st
... is pretty stable at 18 cycles for my macros and 14 cycles for Michael Webster's macros. Why that difference? I can only guess that my "isolation" of the core code with nops all over the place place a role:
Address Hex dump Command Comments
00401403 . 0FA2 .cpuid
00401405 . 0F31 .rdtsc
00401407 . A3 F0A34000 .mov [CyCtrdtsc], eax
0040140C . 8915 F4A34000 .mov [40A3F4], edx
00401412 . 8B1D E4A34000 .mov ebx, [CyCtinner]
00401418 . 90 .nop
00401419 . 90 .nop
0040141A . 90 .nop
0040141B . 90 .nop
0040141C . 90 .nop
0040141D . 90 .nop
0040141E . 90 .nop
0040141F . 90 .nop
00401420 > D9EB .fldpi ; loop start ---------------------
00401422 . D835 68904000 .fdiv dword ptr [??0232] ; float 0.1000000
00401428 . DDD8 .fstp st
0040142A . 90 .nop
0040142B . 90 .nop
0040142C . 90 .nop
0040142D . 90 .nop
0040142E . 90 .nop
0040142F . 90 .nop
00401430 . 90 .nop
00401431 . 90 .nop
00401432 . 4B .dec ebx
00401433 .^ 79 EB .jns short CyCtcode2_s ; jump back ---------------------
00401435 . 50 .push eax
00401436 . 51 .push ecx
00401437 . 52 .push edx
00401438 . 33C0 .xor eax, eax
0040143A . 0FA2 .cpuid
0040143C . 0F31 .rdtsc
:biggrin:
Perhaps is a bad idea to take the timing at same time that paint (you are not calling for High Priority¿?):
Quote from: HSE on May 20, 2022, 09:55:41 AM
Perhaps is a bad idea to take the timing at same time that paint.:
Well, it's
not taken during the WM_PAINT processing. But apparently there is an influence. When quickly sizing the window (lots of painting), the cycle counts go down, though. I have absolutely no clue why that is so.
Anyway, for timing your algo you would use a console program. The GUI version is for illustrative purposes, i.e. for showing the skewed distribution of results.
To build the source attached above, you need the latest MasmBasic version of 20 May 2022. (http://masm32.com/board/index.php?topic=94.0)
mOST RECENT VERSION !
Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
+1 Cycles for PI*100
+12 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+14 Cycles for 10 * inc & dec eax
+14 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
187 kCycles for finding 'Duplicate' in Window.inc
+145 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+12 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+14 Cycles for 10 * inc & dec eax
+14 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
187 kCycles for finding 'Duplicate' in Window.inc
+144 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+12 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+14 Cycles for 10 * inc & dec eax
+14 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
187 kCycles for finding 'Duplicate' in Window.inc
+144 Cycles for finding 'test' with InString
+1 Cycles for PI*100
+12 Cycles for PI*100/10
+2 Cycles for push & pop eax
+0 Cycles for empty loop
+14 Cycles for 10 * inc & dec eax
+14 Cycles for 10 * add eax,1 & sub eax,1
+14 Cycles for 10 * inc+dec with lea
187 kCycles for finding 'Duplicate' in Window.inc
+143 Cycles for finding 'test' with InString
more (y)?
What is the "more (y)?" for ? :tongue:
Quote from: hutch-- on May 20, 2022, 10:48:51 AM
What is the "more (y)?" for ? :tongue:
Be courageous, type
y :biggrin:
I did, nothing happened apart from exiting, is that what the "y" is for ? :tongue:
Strange, I get this from reply #27 (http://masm32.com/board/index.php?topic=10060.msg110088#msg110088):
more (y)?
+73 Cycles for fldpi+1*fdiv+fstp st
+12 kCycles for finding 'test' with Instr_
328 kCycles for finding 'Duplicate' with Instr_
2492 kCycles for finding 'Duplicate' with InString
2553 kCycles for finding 'Duplicate' with crt strstr
720 kCycles for finding 'Duplicate' with Boyer-Moore
+16 Cycles for fldpi+1*fdiv+fstp st
+3286 Cycles for finding 'test' with Instr_
192 kCycles for finding 'Duplicate' with Instr_
2496 kCycles for finding 'Duplicate' with InString
2560 kCycles for finding 'Duplicate' with crt strstr
720 kCycles for finding 'Duplicate' with Boyer-Moore
+16 Cycles for fldpi+1*fdiv+fstp st
+3286 Cycles for finding 'test' with Instr_
192 kCycles for finding 'Duplicate' with Instr_
2494 kCycles for finding 'Duplicate' with InString
2560 kCycles for finding 'Duplicate' with crt strstr
719 kCycles for finding 'Duplicate' with Boyer-Moore
+17 Cycles for fldpi+1*fdiv+fstp st
+3309 Cycles for finding 'test' with Instr_
192 kCycles for finding 'Duplicate' with Instr_
2493 kCycles for finding 'Duplicate' with InString
2555 kCycles for finding 'Duplicate' with crt strstr
720 kCycles for finding 'Duplicate' with Boyer-Moore
+16 Cycles for fldpi+1*fdiv+fstp st
+3313 Cycles for finding 'test' with Instr_
192 kCycles for finding 'Duplicate' with Instr_
2491 kCycles for finding 'Duplicate' with InString
2558 kCycles for finding 'Duplicate' with crt strstr
720 kCycles for finding 'Duplicate' with Boyer-Moore
hit any key
Hi all, long time since my last reply/post in the forum :sad:. One question please: generally speaking, what would be more accurate to test performance in a windows program, counting cycles or counting time (with QueryPerformanceCounter)?
Quote from: felipe on May 24, 2022, 04:01:59 AM
Hi all, long time since my last reply/post in the forum :sad:. One question please: generally speaking, what would be more accurate to test performance in a windows program, counting cycles or counting time (with QueryPerformanceCounter)?
Welcome back felipe :thumbsup:
Counting cycles,but invoke gettimemillis before and after easier to compare between your asm program and other languages timings for example very fast, huge fibonnaci calculations with matrices
http://masm32.com/board/index.php?topic=9773.0 (http://masm32.com/board/index.php?topic=9773.0)
Quote from: felipe on May 24, 2022, 04:01:59 AMcounting cycles or counting time
Time varies with cpu speed, cycles don't.
Quote from: daydreamer on May 24, 2022, 04:57:15 AMinvoke gettimemillis before and after
Interesting, can you post an example using
gettimemillis?
This is an interesting topic. Let me put a quote from a page of intel about clock cycles to help a little bit to new ones (like my self :toothy:):
QuoteSometimes, multiple instructions are completed in a single clock cycle; in other cases, one instruction might be handled over multiple clock cycles. Since different CPU designs handle instructions differently, it's best to compare clock speeds within the same CPU brand and generation.
For example, a CPU with a higher clock speed from five years ago might be outperformed by a new CPU with a lower clock speed, as the newer architecture deals with instructions more efficiently. An X-series Intel® processor might outperform a K-series processor with a higher clock speed, because it splits tasks between more cores and features a larger CPU cache. But within the same generation of CPUs, a processor with a higher clock speed will generally outperform a processor with a lower clock speed across many applications. This is why it's important to compare processors from the same brand and generation.
https://www.intel.com/content/www/us/en/gaming/resources/cpu-clock-speed.html (https://www.intel.com/content/www/us/en/gaming/resources/cpu-clock-speed.html)
Hi felipe,
Don't be lead astray, the only reliable timing technique is real time[tm]. Any recent CPU does many things that effect trying to perform close range timings, variable frequency depending on load, OS ring0 over-rides, core sharing, CPU time slicing, CPU hardware differences etc etc etc ....
Timing is at best unreliable due to these factors and the best I have come up with over many years is to design each test to match the algo's usage. Run the test for long enough, (over 1 second usually does the job). With the choice of attack speed versus sustained data transfer and everything inbetween.
Short duration tests run many times versus long duration tests for absolute transfer speed.
I share that vision Hutch, i mean, if you can see the program running fast then you finished the job. If you want your program to run fast on multiple cpu's, you test the program in all the different cpu's you can. If you have the task of constructing the fastest program in the world, you will still need to face the reallity that the program will run different in two different cpu's. :icon_idea:
Although i recognize that counting cycles, i mean, bringing out strange numbers from this electronic world, can be fun... :mrgreen:
off topic == on
Quote from: felipe on May 24, 2022, 04:01:59 AM
Hi all, long time since my last reply/post in the forum
Well, hello. I haven't seen you around here since forever. How are you?
off topic == off. Oops
Quote from: felipe on May 24, 2022, 07:59:48 AMif you can see the program running fast then you finished the job.
Sure, but between starting the work and finishing, there may be weeks or years. In the meantime, you might want to know which instructions are the fastest in the 100+ innermost loops of your program. For example, are inc, add and lea any different? Should you prefer one of them? Does one of them really suck on AMD cpus, or on the cheap Celerons that your company uses?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
once:
0 Cycles for inc eax
0 Cycles for add eax,1
0 Cycles for lea eax,[eax+1]
0 Cycles for inc eax
0 Cycles for add eax,1
0 Cycles for lea eax,[eax+1]
0 Cycles for inc eax
0 Cycles for add eax,1
0 Cycles for lea eax,[eax+1]
0 Cycles for inc eax
0 Cycles for add eax,1
0 Cycles for lea eax,[eax+1]
100x:
87 Cycles for 100*inc eax
96 Cycles for 100*add eax,1
84 Cycles for 100*lea eax,[eax+1]
84 Cycles for 100*inc eax
81 Cycles for 100*add eax,1
83 Cycles for 100*lea eax,[eax+1]
84 Cycles for 100*inc eax
85 Cycles for 100*add eax,1
84 Cycles for 100*lea eax,[eax+1]
84 Cycles for 100*inc eax
78 Cycles for 100*add eax,1
84 Cycles for 100*lea eax,[eax+1]
1 bytes for inc eax
3 bytes for add eax, 1
3 bytes for lea eax, [eax+1]
100 bytes for 100*inc eax
300 bytes for 100*add eax, 1
300 bytes for 100*lea eax, [eax+1]
Actually that's a good point jj. You will need to test during the development. How you test performance specifically i suppose will always be a function (at least) of the type of program you are doing.