pre-P4 (SSE1)
loop overhead is approx. 208/100 cycles
883 cycles for 100 * pushpush
802 cycles for 100 * sub8
813 cycles for 100 * pushpush fst
806 cycles for 100 * sub8 fst
883 cycles for 100 * pushpush
803 cycles for 100 * sub8
809 cycles for 100 * pushpush fst
806 cycles for 100 * sub8 fst
881 cycles for 100 * pushpush
802 cycles for 100 * sub8
817 cycles for 100 * pushpush fst
807 cycles for 100 * sub8 fst
P4 Northwood”:Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
loop overhead is approx. 231/100 cycles
1027 cycles for 100 * pushpush
998 cycles for 100 * sub8
1276 cycles for 100 * pushpush fst
1002 cycles for 100 * sub8 fst
1161 cycles for 100 * pushpush
988 cycles for 100 * sub8
1201 cycles for 100 * pushpush fst
989 cycles for 100 * sub8 fst
1120 cycles for 100 * pushpush
988 cycles for 100 * sub8
1216 cycles for 100 * pushpush fst
989 cycles for 100 * sub8 fst
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 252/100 cycles
803 cycles for 100 * pushpush
797 cycles for 100 * sub8
1002 cycles for 100 * pushpush fst
818 cycles for 100 * sub8 fst
801 cycles for 100 * pushpush
875 cycles for 100 * sub8
1000 cycles for 100 * pushpush fst
807 cycles for 100 * sub8 fst
881 cycles for 100 * pushpush
798 cycles for 100 * sub8
1017 cycles for 100 * pushpush fst
809 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++13 of 20 tests valid, loop overhead is approx. 329/100 cycles
310 cycles for 100 * pushpush
878 cycles for 100 * sub8
592 cycles for 100 * pushpush fst
298 cycles for 100 * sub8 fst
304 cycles for 100 * pushpush
273 cycles for 100 * sub8
906 cycles for 100 * pushpush fst
297 cycles for 100 * sub8 fst
318 cycles for 100 * pushpush
276 cycles for 100 * sub8
988 cycles for 100 * pushpush fst
297 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
loop overhead is approx. 137/100 cycles
1072 cycles for 100 * pushpush
1029 cycles for 100 * sub8
1227 cycles for 100 * pushpush fst
948 cycles for 100 * sub8 fst
1069 cycles for 100 * pushpush
1029 cycles for 100 * sub8
1223 cycles for 100 * pushpush fst
942 cycles for 100 * sub8 fst
1070 cycles for 100 * pushpush
1029 cycles for 100 * sub8
1223 cycles for 100 * pushpush fst
942 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
I’m seeing a lot of variation on my P3, where I normally get very consistent counts. I suspect that your code is not waiting long enough after the program is launched and before it starts timing. I used to use 3 seconds, but these days I use 5.
pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles
847 cycles for 100 * pushpush
807 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
814 cycles for 100 * sub8 fst
842 cycles for 100 * pushpush
807 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
813 cycles for 100 * sub8 fst
843 cycles for 100 * pushpush
818 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
814 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
loop overhead is approx. 211/100 cycles
536 cycles for 100 * pushpush
721 cycles for 100 * sub8
650 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst
553 cycles for 100 * pushpush
719 cycles for 100 * sub8
633 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst
551 cycles for 100 * pushpush
720 cycles for 100 * sub8
632 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
pre-P4loop overhead is approx. 223/100 cycles
1007 cycles for 100 * pushpush
1107 cycles for 100 * sub8
1206 cycles for 100 * pushpush fst
1600 cycles for 100 * sub8 fst
1008 cycles for 100 * pushpush
1107 cycles for 100 * sub8
1206 cycles for 100 * pushpush fst
1600 cycles for 100 * sub8 fst
1007 cycles for 100 * pushpush
1107 cycles for 100 * sub8
1205 cycles for 100 * pushpush fst
1601 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
Three more.
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles
831 cycles for 100 * pushpush
788 cycles for 100 * sub8
1192 cycles for 100 * pushpush+fst
825 cycles for 100 * sub8+fst
802 cycles for 100 * lea esp+fst
796 cycles for 100 * pushpush
790 cycles for 100 * sub8
1161 cycles for 100 * pushpush+fst
898 cycles for 100 * sub8+fst
807 cycles for 100 * lea esp+fst
800 cycles for 100 * pushpush
791 cycles for 100 * sub8
1099 cycles for 100 * pushpush+fst
921 cycles for 100 * sub8+fst
817 cycles for 100 * lea esp+fst