Author Topic: Pelles C and passing REAL8: timings wanted  (Read 5335 times)

jj2007

  • Member
  • *****
  • Posts: 10543
  • Assembler is fun ;-)
    • MasmBasic
Pelles C and passing REAL8: timings wanted
« on: February 02, 2013, 07:10:11 PM »
Hi folks,

dummy proc arg1:REAL8
  nop
  ret
dummy endp


Pelles C uses this:
   sub esp, 8
   fst REAL8 PTR [esp]
   call dummy


Attachment compares against:
   push eax
   push edx
   fst REAL8 PTR [esp]
   call dummy


Can I have some timings please?
Thanks :t

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
529     cycles for 100 * pushpush
753     cycles for 100 * sub8
666     cycles for 100 * pushpush fst
750     cycles for 100 * sub8 fst

4       bytes for pushpush
5       bytes for sub8
11      bytes for pushpush fst
12      bytes for sub8 fst
« Last Edit: February 03, 2013, 01:55:23 AM by jj2007 »

hutch--

  • Administrator
  • Member
  • ******
  • Posts: 7539
  • Mnemonic Driven API Grinder
    • The MASM32 SDK
Re: Pelles C and passing REAL8: timings wanted
« Reply #1 on: February 02, 2013, 08:16:36 PM »

Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (SSE4)
loop overhead is approx. 123/100 cycles

681     cycles for 100 * pushpush
655     cycles for 100 * sub8
784     cycles for 100 * pushpush fst
595     cycles for 100 * sub8 fst

680     cycles for 100 * pushpush
656     cycles for 100 * sub8
782     cycles for 100 * pushpush fst
594     cycles for 100 * sub8 fst

680     cycles for 100 * pushpush
653     cycles for 100 * sub8
783     cycles for 100 * pushpush fst
596     cycles for 100 * sub8 fst

4       bytes for pushpush
5       bytes for sub8
11      bytes for pushpush fst
12      bytes for sub8 fst


--- ok ---
hutch at movsd dot com
http://www.masm32.com    :biggrin:  :skrewy:

MichaelW

  • Global Moderator
  • Member
  • *****
  • Posts: 1209
Re: Pelles C and passing REAL8: timings wanted
« Reply #2 on: February 02, 2013, 08:29:16 PM »
P3:
Code: [Select]
pre-P4 (SSE1)
loop overhead is approx. 208/100 cycles

883     cycles for 100 * pushpush
802     cycles for 100 * sub8
813     cycles for 100 * pushpush fst
806     cycles for 100 * sub8 fst

883     cycles for 100 * pushpush
803     cycles for 100 * sub8
809     cycles for 100 * pushpush fst
806     cycles for 100 * sub8 fst

881     cycles for 100 * pushpush
802     cycles for 100 * sub8
817     cycles for 100 * pushpush fst
807 cycles for 100 * sub8 fst
P4 Northwood”:
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE2)
loop overhead is approx. 231/100 cycles

1027    cycles for 100 * pushpush
998     cycles for 100 * sub8
1276    cycles for 100 * pushpush fst
1002    cycles for 100 * sub8 fst

1161    cycles for 100 * pushpush
988     cycles for 100 * sub8
1201    cycles for 100 * pushpush fst
989     cycles for 100 * sub8 fst

1120    cycles for 100 * pushpush
988     cycles for 100 * sub8
1216    cycles for 100 * pushpush fst
989 cycles for 100 * sub8 fst

I’m seeing a lot of variation on my P3, where I normally get very consistent counts. I suspect that your code is not waiting long enough after the program is launched and before it starts timing. I used to use 3 seconds, but these days I use 5.

Well Microsoft, here’s another nice mess you’ve gotten us into.

Vortex

  • Member
  • *****
  • Posts: 2334
Re: Pelles C and passing REAL8: timings wanted
« Reply #3 on: February 02, 2013, 08:29:32 PM »
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.20GHz (SSE3)
+19 of 20 tests valid, loop overhead is approx. 252/100 cycles

803     cycles for 100 * pushpush
797     cycles for 100 * sub8
1002    cycles for 100 * pushpush fst
818     cycles for 100 * sub8 fst

801     cycles for 100 * pushpush
875     cycles for 100 * sub8
1000    cycles for 100 * pushpush fst
807     cycles for 100 * sub8 fst

881     cycles for 100 * pushpush
798     cycles for 100 * sub8
1017    cycles for 100 * pushpush fst
809     cycles for 100 * sub8 fst

4       bytes for pushpush
5       bytes for sub8
11      bytes for pushpush fst
12      bytes for sub8 fst


--- ok ---

Gunther

  • Member
  • *****
  • Posts: 3585
  • Forgive your enemies, but never forget their names
Re: Pelles C and passing REAL8: timings wanted
« Reply #4 on: February 02, 2013, 10:07:15 PM »
Jochen,

your timings from my machine:

Code: [Select]
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
+++++++13 of 20 tests valid, loop overhead is approx. 329/100 cycles

310 cycles for 100 * pushpush
878 cycles for 100 * sub8
592 cycles for 100 * pushpush fst
298 cycles for 100 * sub8 fst

304 cycles for 100 * pushpush
273 cycles for 100 * sub8
906 cycles for 100 * pushpush fst
297 cycles for 100 * sub8 fst

318 cycles for 100 * pushpush
276 cycles for 100 * sub8
988 cycles for 100 * pushpush fst
297 cycles for 100 * sub8 fst

4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst


--- ok ---
Get your facts first, and then you can distort them.

frktons

  • Member
  • ***
  • Posts: 491
Re: Pelles C and passing REAL8: timings wanted
« Reply #5 on: February 02, 2013, 11:31:26 PM »
My test:
Quote
Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz (SSE4)
loop overhead is approx. 137/100 cycles

1072    cycles for 100 * pushpush
1029    cycles for 100 * sub8
1227    cycles for 100 * pushpush fst
948     cycles for 100 * sub8 fst

1069    cycles for 100 * pushpush
1029    cycles for 100 * sub8
1223    cycles for 100 * pushpush fst
942     cycles for 100 * sub8 fst

1070    cycles for 100 * pushpush
1029    cycles for 100 * sub8
1223    cycles for 100 * pushpush fst
942     cycles for 100 * sub8 fst

4       bytes for pushpush
5       bytes for sub8
11      bytes for pushpush fst
12      bytes for sub8 fst


jj2007

  • Member
  • *****
  • Posts: 10543
  • Assembler is fun ;-)
    • MasmBasic
Re: Pelles C and passing REAL8: timings wanted
« Reply #6 on: February 03, 2013, 01:22:35 AM »
I’m seeing a lot of variation on my P3, where I normally get very consistent counts. I suspect that your code is not waiting long enough after the program is launched and before it starts timing. I used to use 3 seconds, but these days I use 5.

Michael,
You could be right. I am not very satisfied with the setup either. The most consistent results I get on the Celeron. If I find the time, ... ;-)

Thanks to everybody. It seems pushpush is one or two cycles slower, except on my Celeron ::)

FORTRANS

  • Member
  • *****
  • Posts: 1077
Re: Pelles C and passing REAL8: timings wanted
« Reply #7 on: February 03, 2013, 01:47:03 AM »
Hi Jochen,

   Three more.

Regards,

Steve N.

Code: [Select]
pre-P4 (SSE1)
loop overhead is approx. 209/100 cycles

847 cycles for 100 * pushpush
807 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
814 cycles for 100 * sub8 fst

842 cycles for 100 * pushpush
807 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
813 cycles for 100 * sub8 fst

843 cycles for 100 * pushpush
818 cycles for 100 * sub8
816 cycles for 100 * pushpush fst
814 cycles for 100 * sub8 fst

4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst


--- ok ---

Mobile Intel(R) Celeron(R) processor     600MHz (SSE2)
loop overhead is approx. 211/100 cycles

536 cycles for 100 * pushpush
721 cycles for 100 * sub8
650 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst

553 cycles for 100 * pushpush
719 cycles for 100 * sub8
633 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst

551 cycles for 100 * pushpush
720 cycles for 100 * sub8
632 cycles for 100 * pushpush fst
722 cycles for 100 * sub8 fst

4 bytes for pushpush
5 bytes for sub8
11 bytes for pushpush fst
12 bytes for sub8 fst


--- ok ---

pre-P4loop overhead is approx. 223/100 cycles                                   
                                                                               
1007    cycles for 100 * pushpush                                               
1107    cycles for 100 * sub8                                                   
1206    cycles for 100 * pushpush fst                                           
1600    cycles for 100 * sub8 fst                                               
                                                                               
1008    cycles for 100 * pushpush                                               
1107    cycles for 100 * sub8                                                   
1206    cycles for 100 * pushpush fst                                           
1600    cycles for 100 * sub8 fst                                               
                                                                               
1007    cycles for 100 * pushpush                                               
1107    cycles for 100 * sub8                                                   
1205    cycles for 100 * pushpush fst                                           
1601    cycles for 100 * sub8 fst                                               
                                                                               
4       bytes for pushpush                                                     
5       bytes for sub8                                                         
11      bytes for pushpush fst                                                 
12      bytes for sub8 fst                                                     
                                                                               
                                                                               
--- ok ---                                                                     
                                                                               

jj2007

  • Member
  • *****
  • Posts: 10543
  • Assembler is fun ;-)
    • MasmBasic
Re: Pelles C and passing REAL8: timings wanted
« Reply #8 on: February 03, 2013, 01:56:42 AM »
   Three more.

Thanks, Steve.
I have added Sleep time and a test with lea esp, [esp-8] (8/100 cycles less on my Celeron).

dedndave

  • Member
  • *****
  • Posts: 8827
  • Still using Abacus 2.0
    • DednDave
Re: Pelles C and passing REAL8: timings wanted
« Reply #9 on: February 03, 2013, 02:09:47 AM »
i still get inconsistant readings
the loop counts could be increased a bit

prescott w/htt
Code: [Select]
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 264/100 cycles

831     cycles for 100 * pushpush
788     cycles for 100 * sub8
1192    cycles for 100 * pushpush+fst
825     cycles for 100 * sub8+fst
802     cycles for 100 * lea esp+fst

796     cycles for 100 * pushpush
790     cycles for 100 * sub8
1161    cycles for 100 * pushpush+fst
898     cycles for 100 * sub8+fst
807     cycles for 100 * lea esp+fst

800     cycles for 100 * pushpush
791     cycles for 100 * sub8
1099    cycles for 100 * pushpush+fst
921     cycles for 100 * sub8+fst
817     cycles for 100 * lea esp+fst