The MASM Forum

General => The Laboratory => Topic started by: jj2007 on April 10, 2013, 10:30:23 PM

Title: 2^x timings
Post by: jj2007 on April 10, 2013, 10:30:23 PM
Hi folks,
Could I please have some timings for these Y=2^x algos?
Thanks, JJ

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 79/20 cycles

913     cycles for 20 * Pow2 fist/fild
907     cycles for 20 * Pow2 fadd One
1500    cycles for 20 * Pow2 frndint
Title: Re: 2^x timings
Post by: anta40 on April 10, 2013, 10:43:31 PM
Updated:

Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 30/20 cycles

1037    cycles for 20 * Pow2 fist/fild
1039    cycles for 20 * Pow2 fadd One
1183    cycles for 20 * Pow2 frndint
2301    cycles for 20 * PowX fist/fild
2429    cycles for 20 * PowX frndint

1071    cycles for 20 * Pow2 fist/fild
1046    cycles for 20 * Pow2 fadd One
1153    cycles for 20 * Pow2 frndint
2544    cycles for 20 * PowX fist/fild
2740    cycles for 20 * PowX frndint

1343    cycles for 20 * Pow2 fist/fild
1202    cycles for 20 * Pow2 fadd One
1448    cycles for 20 * Pow2 frndint
2600    cycles for 20 * PowX fist/fild
2438    cycles for 20 * PowX frndint

1295    cycles for 20 * Pow2 fist/fild
1322    cycles for 20 * Pow2 fadd One
1452    cycles for 20 * Pow2 frndint
2775    cycles for 20 * PowX fist/fild
2624    cycles for 20 * PowX frndint

44      bytes for Pow2 fist/fild
38      bytes for Pow2 fadd One
36      bytes for Pow2 frndint
50      bytes for PowX fist/fild
47      bytes for PowX frndint


--- ok --
Title: Re: 2^x timings
Post by: Gunther on April 10, 2013, 11:04:27 PM
Jochen,

here are my results:


Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 77/20 cycles

1036 cycles for 20 * Pow2 fist/fild
1053 cycles for 20 * Pow2 fadd One
24582 cycles for 20 * Pow2 frndint

1027 cycles for 20 * Pow2 fist/fild
1054 cycles for 20 * Pow2 fadd One
24499 cycles for 20 * Pow2 frndint

1026 cycles for 20 * Pow2 fist/fild
1351 cycles for 20 * Pow2 fadd One
24550 cycles for 20 * Pow2 frndint

44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
34 bytes for Pow2 frndint

--- ok ---


By the way: well done.  :t

Gunther
Title: Re: 2^x timings
Post by: Magnum on April 10, 2013, 11:19:15 PM
Intel(R) Core(TM)2 Duo CPU     P8600  @ 2.40GHz (SSE4)
loop overhead is approx. 68/20 cycles

2296   cycles for 20 * Pow2 fist/fild
2324   cycles for 20 * Pow2 fadd One
14751   cycles for 20 * Pow2 frndint

2298   cycles for 20 * Pow2 fist/fild
2300   cycles for 20 * Pow2 fadd One
14857   cycles for 20 * Pow2 frndint

2325   cycles for 20 * Pow2 fist/fild
2295   cycles for 20 * Pow2 fadd One
14731   cycles for 20 * Pow2 frndint

44   bytes for Pow2 fist/fild
38   bytes for Pow2 fadd One
34   bytes for Pow2 frndint
Title: Re: 2^x timings
Post by: FORTRANS on April 10, 2013, 11:27:18 PM
Hi,

   P-III, others if wanted.


pre-P4 (SSE1)
loop overhead is approx. 48/20 cycles

2665    cycles for 20 * Pow2 fist/fild
2654    cycles for 20 * Pow2 fadd One
8635    cycles for 20 * Pow2 frndint

2664    cycles for 20 * Pow2 fist/fild
2655    cycles for 20 * Pow2 fadd One
8635    cycles for 20 * Pow2 frndint

2664    cycles for 20 * Pow2 fist/fild
2650    cycles for 20 * Pow2 fadd One
8635    cycles for 20 * Pow2 frndint

44      bytes for Pow2 fist/fild
38      bytes for Pow2 fadd One
34      bytes for Pow2 frndint


--- ok ---


Regards,

Steve N.
Title: Re: 2^x timings
Post by: dedndave on April 11, 2013, 12:26:50 AM
prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 51/20 cycles

5718    cycles for 20 * Pow2 fist/fild
5759    cycles for 20 * Pow2 fadd One
78540   cycles for 20 * Pow2 frndint

5740    cycles for 20 * Pow2 fist/fild
5725    cycles for 20 * Pow2 fadd One
78548   cycles for 20 * Pow2 frndint

5732    cycles for 20 * Pow2 fist/fild
5716    cycles for 20 * Pow2 fadd One
79000   cycles for 20 * Pow2 frndint


i don't know what's in frndint, but it doesn't like P4's - lol
Title: Re: 2^x timings
Post by: sinsi on April 11, 2013, 12:39:46 AM

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 75/20 cycles

895     cycles for 20 * Pow2 fist/fild
890     cycles for 20 * Pow2 fadd One
1465    cycles for 20 * Pow2 frndint


Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 39/20 cycles

1254    cycles for 20 * Pow2 fist/fild
1279    cycles for 20 * Pow2 fadd One
28154   cycles for 20 * Pow2 frndint
Title: Re: 2^x timings
Post by: jj2007 on April 11, 2013, 12:53:11 AM
Quote from: dedndave on April 11, 2013, 12:26:50 AM
i don't know what's in frndint, but it doesn't like P4's - lol

Could be related to the fact that I forgot one fld st in that algo :redface:

New version attached on top of this thread. My apologies for having wasted your time. To compensate, I added z=x^y to the list - see PowX.

AMD Athlon(tm) Dual Core Processor 4450B (SSE3)

912     cycles for 20 * Pow2 fist/fild
980     cycles for 20 * Pow2 fadd One
1041    cycles for 20 * Pow2 frndint
4059    cycles for 20 * PowX fist/fild
3982    cycles for 20 * PowX frndint

841     cycles for 20 * Pow2 fist/fild
906     cycles for 20 * Pow2 fadd One
1039    cycles for 20 * Pow2 frndint
4054    cycles for 20 * PowX fist/fild
4065    cycles for 20 * PowX frndint
Title: Re: 2^x timings
Post by: dedndave on April 11, 2013, 01:18:18 AM
much better   :biggrin:

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 47/20 cycles

5724    cycles for 20 * Pow2 fist/fild
5745    cycles for 20 * Pow2 fadd One
5932    cycles for 20 * Pow2 frndint
8048    cycles for 20 * PowX fist/fild
8082    cycles for 20 * PowX frndint

5739    cycles for 20 * Pow2 fist/fild
5715    cycles for 20 * Pow2 fadd One
5947    cycles for 20 * Pow2 frndint
8371    cycles for 20 * PowX fist/fild
8105    cycles for 20 * PowX frndint

5737    cycles for 20 * Pow2 fist/fild
5759    cycles for 20 * Pow2 fadd One
5924    cycles for 20 * Pow2 frndint
8060    cycles for 20 * PowX fist/fild
8094    cycles for 20 * PowX frndint

5748    cycles for 20 * Pow2 fist/fild
5736    cycles for 20 * Pow2 fadd One
5930    cycles for 20 * Pow2 frndint
8021    cycles for 20 * PowX fist/fild
8116    cycles for 20 * PowX frndint
Title: Re: 2^x timings
Post by: Gunther on April 11, 2013, 01:27:45 AM
Jochen,

never mind; no need for excuses. Here are the new timings:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 78/20 cycles

1481   cycles for 20 * Pow2 fist/fild
1505   cycles for 20 * Pow2 fadd One
1479   cycles for 20 * Pow2 frndint
2755   cycles for 20 * PowX fist/fild
2820   cycles for 20 * PowX frndint

1325   cycles for 20 * Pow2 fist/fild
1356   cycles for 20 * Pow2 fadd One
1636   cycles for 20 * Pow2 frndint
2774   cycles for 20 * PowX fist/fild
2810   cycles for 20 * PowX frndint

1325   cycles for 20 * Pow2 fist/fild
1352   cycles for 20 * Pow2 fadd One
1644   cycles for 20 * Pow2 frndint
2595   cycles for 20 * PowX fist/fild
2794   cycles for 20 * PowX frndint

1494   cycles for 20 * Pow2 fist/fild
1509   cycles for 20 * Pow2 fadd One
1180   cycles for 20 * Pow2 frndint
2591   cycles for 20 * PowX fist/fild
2803   cycles for 20 * PowX frndint

44   bytes for Pow2 fist/fild
38   bytes for Pow2 fadd One
36   bytes for Pow2 frndint
50   bytes for PowX fist/fild
47   bytes for PowX frndint

--- ok ---

Gunther

Title: Re: 2^x timings
Post by: FORTRANS on April 11, 2013, 01:46:09 AM
P-III


pre-P4 (SSE1)
loop overhead is approx. 48/20 cycles

2665 cycles for 20 * Pow2 fist/fild
2652 cycles for 20 * Pow2 fadd One
2717 cycles for 20 * Pow2 frndint
4829 cycles for 20 * PowX fist/fild
5023 cycles for 20 * PowX frndint

2664 cycles for 20 * Pow2 fist/fild
2658 cycles for 20 * Pow2 fadd One
2711 cycles for 20 * Pow2 frndint
4833 cycles for 20 * PowX fist/fild
5017 cycles for 20 * PowX frndint

2670 cycles for 20 * Pow2 fist/fild
2652 cycles for 20 * Pow2 fadd One
2712 cycles for 20 * Pow2 frndint
4835 cycles for 20 * PowX fist/fild
5028 cycles for 20 * PowX frndint

2664 cycles for 20 * Pow2 fist/fild
2653 cycles for 20 * Pow2 fadd One
2711 cycles for 20 * Pow2 frndint
4837 cycles for 20 * PowX fist/fild
5026 cycles for 20 * PowX frndint

44 bytes for Pow2 fist/fild
38 bytes for Pow2 fadd One
36 bytes for Pow2 frndint
50 bytes for PowX fist/fild
47 bytes for PowX frndint


--- ok ---
Title: Re: 2^x timings
Post by: qWord on April 11, 2013, 03:26:59 AM
it may be also interesting to replace FSCALE by non-FPU instructions, because it does nothing more than st(0)*2^rndint(st(1)). This could be replaced by code that directly manipulates the exponent field (of value = 1.0) to get 2^rndint(st(1)). Even we already have the rounded value of st(1) as integer...
Title: Re: 2^x timings
Post by: dedndave on April 11, 2013, 05:46:16 AM
i was playing with a little code to do that
and, while it may be simple to manipulate directly in 99.99 % of the cases,   :P
there are those special cases where you need a bunch of if/else statements to handle properly
Title: Re: 2^x timings
Post by: qWord on April 11, 2013, 06:56:58 AM
Quote from: dedndave on April 11, 2013, 05:46:16 AM
i was playing with a little code to do that
and, while it may be simple to manipulate directly in 99.99 % of the cases,   :P
there are those special cases where you need a bunch of if/else statements to handle properly
the following code should do it in all cases of valid input:
; calc: 2^x = 2^(a+b) = 2^a*2^b = 2^fract_part(st0)*2^int_part(st0)
; In:   st0 == exponent
; Out:  st0 == 2^st0
pow2 proc
LOCAL r10[3]:DWORD
LOCAL exp:SDWORD

    mov eax,3fffh
    fist exp
    fisub exp
    add eax,exp ; case: add 3fffh,-X ==> sub 3fffh,X
    jle @err1 ; underflow
    cmp eax,8000h
    jae @err2 ; overflow
    mov r10[0],0
    mov r10[4],80000000h
    mov r10[8],eax
    f2xm1
    fadd FP4(1.0)
    fld REAL10 ptr r10
    fmulp st(1),st
    ret

@err1:
    fstp st(0)
    fld FP4(0.0)
    ret
@err2:
    fstp st(0)
    fld FP4(07F800000r) ; infinite
    ret
   
pow2 endp


EDIT: overflow detection was incomplete
Title: Re: 2^x timings
Post by: jj2007 on April 11, 2013, 07:48:51 AM
Looks very competitive, qWord - compliments :t
Would you mind adding it to MasmBasic, with proper acknowledgement, of course?

EDIT: Shaved off a cycle and a few bytes. See Reply#16 for attachment.
Title: Re: 2^x timings
Post by: qWord on April 11, 2013, 08:08:48 AM
Quote from: jj2007 on April 11, 2013, 07:48:51 AM
Looks very competitive, qWord - compliments :t
Would you mind adding it to MasmBasic, with proper acknowledgement, of course?
I'm afraid Agner Fog had found that long before me ;-D
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz (SSE4)
loop overhead is approx. 64/20 cycles

1194    cycles for 20 * Pow2 fist/fild
1118    cycles for 20 * Pow2 fadd One
1189    cycles for 20 * Pow2 frndint
2202    cycles for 20 * PowX fist/fild
754     cycles for 20 * pow qWord

885     cycles for 20 * Pow2 fist/fild
1011    cycles for 20 * Pow2 fadd One
1240    cycles for 20 * Pow2 frndint
2100    cycles for 20 * PowX fist/fild
752     cycles for 20 * pow qWord

1085    cycles for 20 * Pow2 fist/fild
1118    cycles for 20 * Pow2 fadd One
1354    cycles for 20 * Pow2 frndint
2147    cycles for 20 * PowX fist/fild
925     cycles for 20 * pow qWord

877     cycles for 20 * Pow2 fist/fild
894     cycles for 20 * Pow2 fadd One
1208    cycles for 20 * Pow2 frndint
2103    cycles for 20 * PowX fist/fild
1006    cycles for 20 * pow qWord

44      bytes for Pow2 fist/fild
38      bytes for Pow2 fadd One
36      bytes for Pow2 frndint
50      bytes for PowX fist/fild
96      bytes for pow qWord


--- ok ---
Title: Re: 2^x timings
Post by: jj2007 on April 11, 2013, 08:16:29 AM
Quote from: qWord on April 11, 2013, 08:08:48 AM
I'm afraid Agner Fog had found that long before me ;-D

Just found bitRAKE's version (http://www.asmcommunity.net/board/index.php?PHPSESSID=cf1081402952c2c1ed7f56c24f03a5ba&topic=2979.msg30586#msg30586) - he quotes Agner.


Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
loop overhead is approx. 38/20 cycles

2647    cycles for 20 * Pow2 fist/fild
1399    cycles for 20 * pow qWord/Agner

2654    cycles for 20 * Pow2 fist/fild
1399    cycles for 20 * pow qWord/Agner

2645    cycles for 20 * Pow2 fist/fild
1406    cycles for 20 * pow qWord/Agner

44      bytes for Pow2 fist/fild
90      bytes for pow qWord/Agner

Title: Re: 2^x timings
Post by: Antariy on April 11, 2013, 02:41:47 PM
Hi, Jochen :t


Intel(R) Celeron(R) CPU 2.13GHz (SSE3)
loop overhead is approx. 68/20 cycles

6315    cycles for 20 * Pow2 fist/fild
4175    cycles for 20 * pow qWord/Agner

6285    cycles for 20 * Pow2 fist/fild
3900    cycles for 20 * pow qWord/Agner

6479    cycles for 20 * Pow2 fist/fild
4086    cycles for 20 * pow qWord/Agner

44      bytes for Pow2 fist/fild
90      bytes for pow qWord/Agner


--- ok ---
Title: Re: 2^x timings
Post by: jj2007 on April 11, 2013, 03:22:09 PM
Thanks, Alex :icon14:

Not much difference for your Celeron, it seems ;-)
Title: Re: 2^x timings
Post by: sinsi on April 11, 2013, 03:43:42 PM

AMD Phenom(tm) II X6 1100T Processor (SSE3)
loop overhead is approx. 71/20 cycles
837     cycles for 20 * Pow2 fist/fild
1054    cycles for 20 * pow qWord/Agner

Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz (SSE4)
loop overhead is approx. 55/20 cycles
1227    cycles for 20 * Pow2 fist/fild
1065    cycles for 20 * pow qWord/Agner

Intel(R) Core(TM)2 Duo CPU     T8100  @ 2.10GHz (SSE4)
loop overhead is approx. 37/20 cycles
2394    cycles for 20 * Pow2 fist/fild
1249    cycles for 20 * pow qWord/Agner

Title: Re: 2^x timings
Post by: TouEnMasm on April 11, 2013, 03:58:30 PM

Intel(R) Celeron(R) CPU 2.80GHz (SSE3)
loop overhead is approx. 53/20 cycles

5799    cycles for 20 * Pow2 fist/fild
3774    cycles for 20 * pow qWord/Agner

5806    cycles for 20 * Pow2 fist/fild
3794    cycles for 20 * pow qWord/Agner

5801    cycles for 20 * Pow2 fist/fild
3772    cycles for 20 * pow qWord/Agner

44      bytes for Pow2 fist/fild
90      bytes for pow qWord/Agner


--- ok ---
Title: Re: 2^x timings
Post by: jj2007 on April 11, 2013, 04:34:56 PM
Thanks, John and Yves. Apparently the algo doesn't like AMD...


AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 78/20 cycles

913     cycles for 20 * Pow2 fist/fild
1161    cycles for 20 * pow qWord/Agner


Intel(R) Core(TM) i5 CPU       M 520  @ 2.40GHz (SSE4)
loop overhead is approx. 69/20 cycles

1090    cycles for 20 * Pow2 fist/fild
813     cycles for 20 * pow qWord/Agner
Title: Re: 2^x timings
Post by: sinsi on April 11, 2013, 04:49:43 PM
>Apparently the algo doesn't like AMD...
I sometimes feel left out amid all these intel cpus.
Plenty of timing code seems to be the opposite for my amd, probably because everyone else is intel.
Title: Re: 2^x timings
Post by: Magnum on April 11, 2013, 09:36:04 PM
Are there instructions that don't work on AMD's or in a non-standard way ?

I used to have a K-6 myself.

Andy
Title: Re: 2^x timings
Post by: Antariy on April 11, 2013, 11:55:28 PM
Jochen, what if add invoke SetProcessAffinityMask,1 into the init of the program? Maybe on mostly multicore AMD CPUs it just gets switch the thread from one core to another?
Title: Re: 2^x timings
Post by: jj2007 on April 12, 2013, 12:58:59 AM
Quote from: Antariy on April 11, 2013, 11:55:28 PM
Jochen, what if add invoke SetProcessAffinityMask,1 into the init of the program? Maybe on mostly multicore AMD CPUs it just gets switch the thread from one core to another?

Alex,
I have been scratching my head all the time why the timings were so volatile on some CPUs... thanks for reminding me of SetProcessAffinityMask :t
I was deeply convinced that I had set it somewhere but nope, it just wasn't included :redface:

The good news is it's now included, see attachment.
The bad news is it doesn't make AMD any faster :icon_mrgreen:
Title: Re: 2^x timings
Post by: dedndave on April 12, 2013, 02:10:37 AM
Quotethanks for reminding me of SetProcessAffinityMask

:icon_eek:
from the guy who has probably written more timing tests than anyone else - lol
they could probably run a little longer, too
1) select a single core
2) use Sleep,500 after that (or more) to bind and settle
3) try to make each test use about 0.5 seconds

prescott w/htt
Intel(R) Pentium(R) 4 CPU 3.00GHz (SSE3)
loop overhead is approx. 58/20 cycles

5899    cycles for 20 * Pow2 fist/fild
4082    cycles for 20 * pow qWord/Agner

5930    cycles for 20 * Pow2 fist/fild
4049    cycles for 20 * pow qWord/Agner

5925    cycles for 20 * Pow2 fist/fild
4096    cycles for 20 * pow qWord/Agner
Title: Re: 2^x timings
Post by: Gunther on April 12, 2013, 05:24:38 AM
Jochen,

results from Pow2Timings3.zip:

Quote
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (SSE4)
loop overhead is approx. 76/20 cycles

1029   cycles for 20 * Pow2 fist/fild
1503   cycles for 20 * pow qWord/Agner

1027   cycles for 20 * Pow2 fist/fild
1726   cycles for 20 * pow qWord/Agner

1639   cycles for 20 * Pow2 fist/fild
899   cycles for 20 * pow qWord/Agner

44   bytes for Pow2 fist/fild
90   bytes for pow qWord/Agner

--- ok ---

Gunther
Title: Re: 2^x timings
Post by: Antariy on April 12, 2013, 02:12:06 PM
Hi Jochen!

Quote from: jj2007 on April 12, 2013, 12:58:59 AM
I have been scratching my head all the time why the timings were so volatile on some CPUs... thanks for reminding me of SetProcessAffinityMask :t
I was deeply convinced that I had set it somewhere but nope, it just wasn't included :redface:

The good news is it's now included, see attachment.
The bad news is it doesn't make AMD any faster :icon_mrgreen:

:redface:

At least the timings now will not jump like crazy rabbits in other multicore-affected, and not only, tests where testbed could be used - you made it as very comprehensive template :t
Title: Re: 2^x timings
Post by: jj2007 on April 15, 2013, 08:48:17 AM
Quote from: Antariy on April 12, 2013, 02:12:06 PM
At least the timings now will not jump like crazy rabbits in other multicore-affected, and not only, tests where testbed could be used - you made it a very comprehensive template :t

Thanks, Alex.

Exp10, Exp2, ExpE and ExpXY are now implemented, see here (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1192).
Title: Re: 2^x timings
Post by: Gunther on April 16, 2013, 01:11:55 AM
Jochen,

Quote from: jj2007 on April 15, 2013, 08:48:17 AM
Exp10, Exp2, ExpE and ExpXY are now implemented, see here (http://www.webalice.it/jj2006/MasmBasicQuickReference.htm#Mb1192).

rock solid work. I've checked it.  :t

Gunther