Hi folks,
dummy proc arg1:REAL8
nop
ret
dummy endp
Pelles C uses this:
sub esp, 8
fst REAL8 PTR [esp]
call dummy
Attachment compares against:
push eax
push edx
fst REAL8 PTR [esp]
call dummy
Can I have some timings please?
Thanks :t
Intel(R) Celeron(R) M CPU 420 @ 1.60GHz (SSE3)
529 cycles for 100 * pushpush
753 cycles for 100 * sub8
666 cycles for 100 * pushpush fst
750 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (SSE4)
loop overhead is approx. 123/100 cycles
681 cycles for 100 * pushpush
655 cycles for 100 * sub8
784 cycles for 100 * pushpush fst
595 cycles for 100 * sub8 fst
680 cycles for 100 * pushpush
656 cycles for 100 * sub8
782 cycles for 100 * pushpush fst
594 cycles for 100 * sub8 fst
680 cycles for 100 * pushpush
653 cycles for 100 * sub8
783 cycles for 100 * pushpush fst
596 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
P3:
pre-P4 (SSE1)
loop overhead is approx. 208/100 cycles
883 cycles for 100 * pushpush
802 cycles for 100 * sub8
813 cycles for 100 * pushpush fst
806 cycles for 100 * sub8 fst
883 cycles for 100 * pushpush
803 cycles for 100 * sub8
809 cycles for 100 * pushpush fst
806 cycles for 100 * sub8 fst
881 cycles for 100 * pushpush
802 cycles for 100 * sub8
817 cycles for 100 * pushpush fst
807 cycles for 100 * sub8 fst
P4 Northwood":
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
loop overhead is approx. 231/100 cycles
1027 cycles for 100 * pushpush
998 cycles for 100 * sub8
1276 cycles for 100 * pushpush fst
1002 cycles for 100 * sub8 fst
1161 cycles for 100 * pushpush
988 cycles for 100 * sub8
1201 cycles for 100 * pushpush fst
989 cycles for 100 * sub8 fst
1120 cycles for 100 * pushpush
988 cycles for 100 * sub8
1216 cycles for 100 * pushpush fst
989 cycles for 100 * sub8 fst
I'm seeing a lot of variation on my P3, where I normally get very consistent counts. I suspect that your code is not waiting long enough after the program is launched and before it starts timing. I used to use 3 seconds, but these days I use 5.
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 252/100 cycles
803 cycles for 100 * pushpush
797 cycles for 100 * sub8
1002 cycles for 100 * pushpush fst
818 cycles for 100 * sub8 fst
801 cycles for 100 * pushpush
875 cycles for 100 * sub8
1000 cycles for 100 * pushpush fst
807 cycles for 100 * sub8 fst
881 cycles for 100 * pushpush
798 cycles for 100 * sub8
1017 cycles for 100 * pushpush fst
809 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
Jochen,
your timings from my machine:
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++13 of 20 tests valid, loop overhead is approx. 329/100 cycles
310 cycles for 100 * pushpush
878 cycles for 100 * sub8
592 cycles for 100 * pushpush fst
298 cycles for 100 * sub8 fst
304 cycles for 100 * pushpush
273 cycles for 100 * sub8
906 cycles for 100 * pushpush fst
297 cycles for 100 * sub8 fst
318 cycles for 100 * pushpush
276 cycles for 100 * sub8
988 cycles for 100 * pushpush fst
297 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
My test:
Quote
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (SSE4)
loop overhead is approx. 137/100 cycles
1072 cycles for 100 * pushpush
1029 cycles for 100 * sub8
1227 cycles for 100 * pushpush fst
948 cycles for 100 * sub8 fst
1069 cycles for 100 * pushpush
1029 cycles for 100 * sub8
1223 cycles for 100 * pushpush fst
942 cycles for 100 * sub8 fst
1070 cycles for 100 * pushpush
1029 cycles for 100 * sub8
1223 cycles for 100 * pushpush fst
942 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
Quote from: MichaelW on February 02, 2013, 08:29:16 PMI'm seeing a lot of variation on my P3, where I normally get very consistent counts. I suspect that your code is not waiting long enough after the program is launched and before it starts timing. I used to use 3 seconds, but these days I use 5.
Michael,
You could be right. I am not very satisfied with the setup either. The most consistent results I get on the Celeron. If I find the time, ... ;-)
Thanks to everybody. It seems pushpush is one or two cycles slower, except on my Celeron ::)
Hi Jochen,
Three more.
Regards,
Steve N.
pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles
847 cycles for 100 * pushpush
807 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
814 cycles for 100 * sub8 fst
842 cycles for 100 * pushpush
807 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
813 cycles for 100 * sub8 fst
843 cycles for 100 * pushpush
818 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
814 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
Mobile Intel(R) Celeron(R) processor 600MHz (SSE2)
loop overhead is approx. 211/100 cycles
536 cycles for 100 * pushpush
721 cycles for 100 * sub8
650 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst
553 cycles for 100 * pushpush
719 cycles for 100 * sub8
633 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst
551 cycles for 100 * pushpush
720 cycles for 100 * sub8
632 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
pre-P4loop overhead is approx. 223/100 cycles
1007 cycles for 100 * pushpush
1107 cycles for 100 * sub8
1206 cycles for 100 * pushpush fst
1600 cycles for 100 * sub8 fst
1008 cycles for 100 * pushpush
1107 cycles for 100 * sub8
1206 cycles for 100 * pushpush fst
1600 cycles for 100 * sub8 fst
1007 cycles for 100 * pushpush
1107 cycles for 100 * sub8
1205 cycles for 100 * pushpush fst
1601 cycles for 100 * sub8 fst
4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst
--- ok ---
Quote from: FORTRANS on February 03, 2013, 01:47:03 AM
Three more.
Thanks, Steve.
I have added Sleep time and a test with lea esp, [esp-8] (8/100 cycles less on my Celeron).
i still get inconsistant readings
the loop counts could be increased a bit
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles
831 cycles for 100 * pushpush
788 cycles for 100 * sub8
1192 cycles for 100 * pushpush+fst
825 cycles for 100 * sub8+fst
802 cycles for 100 * lea esp+fst
796 cycles for 100 * pushpush
790 cycles for 100 * sub8
1161 cycles for 100 * pushpush+fst
898 cycles for 100 * sub8+fst
807 cycles for 100 * lea esp+fst
800 cycles for 100 * pushpush
791 cycles for 100 * sub8
1099 cycles for 100 * pushpush+fst
921 cycles for 100 * sub8+fst
817 cycles for 100 * lea esp+fst