News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Pelles C and passing REAL8: timings wanted

Started by jj2007, February 02, 2013, 07:10:11 PM

Previous topic - Next topic

jj2007

Hi folks,

dummy proc arg1:REAL8
  nop
  ret
dummy endp


Pelles C uses this:
   sub esp, 8
   fst REAL8 PTR [esp]
   call dummy


Attachment compares against:
   push eax
   push edx
   fst REAL8 PTR [esp]
   call dummy


Can I have some timings please?
Thanks :t

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
529     cycles for 100 * pushpush
753     cycles for 100 * sub8
666     cycles for 100 * pushpush fst
750     cycles for 100 * sub8 fst

4       bytes for pushpush
5       bytes for sub8
11      bytes for pushpush fst
12      bytes for sub8 fst

hutch--


Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
loop overhead is approx. 123/100 cycles

681     cycles for 100 * pushpush
655     cycles for 100 * sub8
784     cycles for 100 * pushpush fst
595     cycles for 100 * sub8 fst

680     cycles for 100 * pushpush
656     cycles for 100 * sub8
782     cycles for 100 * pushpush fst
594     cycles for 100 * sub8 fst

680     cycles for 100 * pushpush
653     cycles for 100 * sub8
783     cycles for 100 * pushpush fst
596     cycles for 100 * sub8 fst

4       bytes for pushpush
5       bytes for sub8
11      bytes for pushpush fst
12      bytes for sub8 fst


--- ok ---

MichaelW

P3:

pre-P4 (SSE1)
loop overhead is approx. 208/100 cycles

883     cycles for 100 * pushpush
802     cycles for 100 * sub8
813     cycles for 100 * pushpush fst
806     cycles for 100 * sub8 fst

883     cycles for 100 * pushpush
803     cycles for 100 * sub8
809     cycles for 100 * pushpush fst
806     cycles for 100 * sub8 fst

881     cycles for 100 * pushpush
802     cycles for 100 * sub8
817     cycles for 100 * pushpush fst
807 cycles for 100 * sub8 fst

P4 Northwood":

Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
loop overhead is approx. 231/100 cycles

1027    cycles for 100 * pushpush
998     cycles for 100 * sub8
1276    cycles for 100 * pushpush fst
1002    cycles for 100 * sub8 fst

1161    cycles for 100 * pushpush
988     cycles for 100 * sub8
1201    cycles for 100 * pushpush fst
989     cycles for 100 * sub8 fst

1120    cycles for 100 * pushpush
988     cycles for 100 * sub8
1216    cycles for 100 * pushpush fst
989 cycles for 100 * sub8 fst


I'm seeing a lot of variation on my P3, where I normally get very consistent counts. I suspect that your code is not waiting long enough after the program is launched and before it starts timing. I used to use 3 seconds, but these days I use 5.

Well Microsoft, here's another nice mess you've gotten us into.

Vortex

Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 252/100 cycles

803     cycles for 100 * pushpush
797     cycles for 100 * sub8
1002    cycles for 100 * pushpush fst
818     cycles for 100 * sub8 fst

801     cycles for 100 * pushpush
875     cycles for 100 * sub8
1000    cycles for 100 * pushpush fst
807     cycles for 100 * sub8 fst

881     cycles for 100 * pushpush
798     cycles for 100 * sub8
1017    cycles for 100 * pushpush fst
809     cycles for 100 * sub8 fst

4       bytes for pushpush
5       bytes for sub8
11      bytes for pushpush fst
12      bytes for sub8 fst


--- ok ---

Gunther

Jochen,

your timings from my machine:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++13 of 20 tests valid, loop overhead is approx. 329/100 cycles

310 cycles for 100 * pushpush
878 cycles for 100 * sub8
592 cycles for 100 * pushpush fst
298 cycles for 100 * sub8 fst

304 cycles for 100 * pushpush
273 cycles for 100 * sub8
906 cycles for 100 * pushpush fst
297 cycles for 100 * sub8 fst

318 cycles for 100 * pushpush
276 cycles for 100 * sub8
988 cycles for 100 * pushpush fst
297 cycles for 100 * sub8 fst

4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst


--- ok ---
You have to know the facts before you can distort them.

frktons

My test:
Quote
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
loop overhead is approx. 137/100 cycles

1072    cycles for 100 * pushpush
1029    cycles for 100 * sub8
1227    cycles for 100 * pushpush fst
948     cycles for 100 * sub8 fst

1069    cycles for 100 * pushpush
1029    cycles for 100 * sub8
1223    cycles for 100 * pushpush fst
942     cycles for 100 * sub8 fst

1070    cycles for 100 * pushpush
1029    cycles for 100 * sub8
1223    cycles for 100 * pushpush fst
942     cycles for 100 * sub8 fst

4       bytes for pushpush
5       bytes for sub8
11      bytes for pushpush fst
12      bytes for sub8 fst

There are only two days a year when you can't do anything: one is called yesterday, the other is called tomorrow, so today is the right day to love, believe, do and, above all, live.

Dalai Lama

jj2007

Quote from: MichaelW on February 02, 2013, 08:29:16 PMI'm seeing a lot of variation on my P3, where I normally get very consistent counts. I suspect that your code is not waiting long enough after the program is launched and before it starts timing. I used to use 3 seconds, but these days I use 5.

Michael,
You could be right. I am not very satisfied with the setup either. The most consistent results I get on the Celeron. If I find the time, ... ;-)

Thanks to everybody. It seems pushpush is one or two cycles slower, except on my Celeron ::)

FORTRANS

Hi Jochen,

   Three more.

Regards,

Steve N.


pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles

847 cycles for 100 * pushpush
807 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
814 cycles for 100 * sub8 fst

842 cycles for 100 * pushpush
807 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
813 cycles for 100 * sub8 fst

843 cycles for 100 * pushpush
818 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
814 cycles for 100 * sub8 fst

4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst


--- ok ---

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
loop overhead is approx. 211/100 cycles

536 cycles for 100 * pushpush
721 cycles for 100 * sub8
650 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst

553 cycles for 100 * pushpush
719 cycles for 100 * sub8
633 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst

551 cycles for 100 * pushpush
720 cycles for 100 * sub8
632 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst

4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst


--- ok ---

pre-P4loop overhead is approx. 223/100 cycles                                   
                                                                               
1007    cycles for 100 * pushpush                                               
1107    cycles for 100 * sub8                                                   
1206    cycles for 100 * pushpush fst                                           
1600    cycles for 100 * sub8 fst                                               
                                                                               
1008    cycles for 100 * pushpush                                               
1107    cycles for 100 * sub8                                                   
1206    cycles for 100 * pushpush fst                                           
1600    cycles for 100 * sub8 fst                                               
                                                                               
1007    cycles for 100 * pushpush                                               
1107    cycles for 100 * sub8                                                   
1205    cycles for 100 * pushpush fst                                           
1601    cycles for 100 * sub8 fst                                               
                                                                               
4       bytes for pushpush                                                     
5       bytes for sub8                                                         
11      bytes for pushpush fst                                                 
12      bytes for sub8 fst                                                     
                                                                               
                                                                               
--- ok ---                                                                     
                                                                               

jj2007

Quote from: FORTRANS on February 03, 2013, 01:47:03 AM
   Three more.

Thanks, Steve.
I have added Sleep time and a test with lea esp, [esp-8] (8/100 cycles less on my Celeron).

dedndave

i still get inconsistant readings
the loop counts could be increased a bit

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles

831     cycles for 100 * pushpush
788     cycles for 100 * sub8
1192    cycles for 100 * pushpush+fst
825     cycles for 100 * sub8+fst
802     cycles for 100 * lea esp+fst

796     cycles for 100 * pushpush
790     cycles for 100 * sub8
1161    cycles for 100 * pushpush+fst
898     cycles for 100 * sub8+fst
807     cycles for 100 * lea esp+fst

800     cycles for 100 * pushpush
791     cycles for 100 * sub8
1099    cycles for 100 * pushpush+fst
921     cycles for 100 * sub8+fst
817     cycles for 100 * lea esp+fst